--- library_name: mlx license: apache-2.0 language: - en - bn - hi - kn - gu - mr - ml - or - pa - ta - te base_model: sarvamai/sarvam-m base_model_relation: quantized pipeline_tag: text-generation tags: - mlx - quantized - 4bit - indian-languages - multilingual - apple-silicon - sarvam - mistral --- # Sarvam-M 4-bit MLX This is a 4-bit quantized version of [sarvamai/sarvam-m](https://huggingface.co/sarvamai/sarvam-m) optimized for Apple Silicon using [MLX](https://github.com/ml-explore/mlx). ## Model Details - **Base Model**: [Sarvam-M](https://huggingface.co/sarvamai/sarvam-m) (24B parameters) - **Quantization**: 4.5 bits per weight - **Framework**: MLX (optimized for Apple Silicon) - **Model Size**: ~12GB (75% reduction from original ~48GB) - **Languages**: English + 10 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu) ## Key Features - **🇮🇳 Indic Language Excellence**: Specifically optimized for Indian languages with cultural context - **🧮 Hybrid Reasoning**: Supports both "thinking" and "non-thinking" modes for different use cases - **⚡ Fast Inference**: 4-6x faster than larger models while maintaining quality - **🎯 Versatile**: Strong performance in math, programming, and multilingual tasks - **💻 Apple Silicon Optimized**: Runs efficiently on M1/M2/M3 MacBooks ## Installation ```bash # Install MLX and dependencies pip install mlx-lm transformers # For chat functionality (optional) pip install gradio ``` ## 🛠️ LM Studio Setup **Having issues with short responses or "EOS token" problems in LM Studio?** 👉 **[See the complete LM Studio Setup Guide](./LM_Studio_Setup_Guide.md)** **Quick Fix:** Use proper chat formatting: ``` [INST] Your question here [/INST] ``` The model requires specific prompt formatting to work correctly in LM Studio. ## Usage ### Basic Generation ```python from mlx_lm import load, generate # Load the model model, tokenizer = load("your-username/sarvam-m-4bit-mlx") # Simple generation response = generate( model, tokenizer, prompt="What is the capital of India?", max_tokens=50 ) print(response) ``` ### Chat with Thinking Mode Control ```python from mlx_lm import load, generate model, tokenizer = load("your-username/sarvam-m-4bit-mlx") # No thinking mode (direct answers) messages = [{'role': 'user', 'content': 'What is 2+2?'}] prompt = tokenizer.apply_chat_template( messages, tokenize=False, enable_thinking=False ) response = generate(model, tokenizer, prompt=prompt, max_tokens=20) print(response) # Output: The sum of 2 and 2 is **4**. # With thinking mode (shows reasoning) messages = [{'role': 'user', 'content': 'Solve: 15 * 23'}] prompt = tokenizer.apply_chat_template( messages, tokenize=False, enable_thinking=True ) response = generate(model, tokenizer, prompt=prompt, max_tokens=100) print(response) # Output: Let me calculate... The answer is 345. ``` ### Hindi Language Example ```python # Hindi conversation messages = [{'role': 'user', 'content': 'भारत की राजधानी क्या है?'}] prompt = tokenizer.apply_chat_template( messages, tokenize=False, enable_thinking=False ) response = generate(model, tokenizer, prompt=prompt, max_tokens=50) print(response) # Output: भारत की राजधानी **नई दिल्ली** है। यह देश की राजनीतिक, प्रशासनिक... ``` ### Programming Example ```python # Code generation messages = [{'role': 'user', 'content': 'Write a Python function to calculate fibonacci numbers'}] prompt = tokenizer.apply_chat_template( messages, tokenize=False, enable_thinking=False ) response = generate(model, tokenizer, prompt=prompt, max_tokens=150) print(response) ``` ## Command Line Usage ```bash # Simple generation python -m mlx_lm generate \ --model your-username/sarvam-m-4bit-mlx \ --prompt "Hello, how are you?" \ --max-tokens 50 # Interactive chat python -m mlx_lm chat --model your-username/sarvam-m-4bit-mlx ``` ## Performance Benchmarks | Metric | Value | |--------|-------| | Model Size | ~12GB | | Peak Memory Usage | ~13.3GB | | Generation Speed | 18-36 tokens/sec | | Quantization Bits | 4.5 bits per weight | | Supported Languages | 11 (English + 10 Indic) | ### Quality Comparison - **Math**: Accurate arithmetic and reasoning - **Hindi**: Native-level language understanding - **Programming**: Strong code generation capabilities - **Cultural Context**: Indian-specific knowledge and values ## Hardware Requirements - **Minimum**: Apple Silicon Mac (M1/M2/M3/M4) with 16GB RAM - **Recommended**: 32GB+ RAM for optimal performance - **Storage**: ~15GB free space ## Supported Languages 1. **English** - Primary language 2. **Hindi** (हिन्दी) - 28% of Indic data 3. **Bengali** (বাংলা) - 8% of Indic data 4. **Gujarati** (ગુજરાતી) - 8% of Indic data 5. **Kannada** (ಕನ್ನಡ) - 8% of Indic data 6. **Malayalam** (മലയാളം) - 8% of Indic data 7. **Marathi** (मराठी) - 8% of Indic data 8. **Oriya** (ଓଡ଼ିଆ) - 8% of Indic data 9. **Punjabi** (ਪੰਜਾਬੀ) - 8% of Indic data 10. **Tamil** (தமிழ்) - 8% of Indic data 11. **Telugu** (తెలుగు) - 8% of Indic data ## License This model follows the same license as the original Sarvam-M model. Please refer to the [original model card](https://huggingface.co/sarvamai/sarvam-m) for license details. ## Citation ```bibtex @misc{sarvam-m-mlx, title={Sarvam-M 4-bit MLX: Quantized Indian Language Model for Apple Silicon}, author={Community Contribution}, year={2025}, url={https://huggingface.co/your-username/sarvam-m-4bit-mlx} } ``` ## Credits - **Original Model**: [Sarvam AI](https://sarvam.ai/) for creating Sarvam-M - **Base Model**: Built on [Mistral Small](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503) - **MLX Framework**: [Apple's MLX team](https://github.com/ml-explore/mlx) - **Quantization**: Community contribution using MLX-LM tools ## Issues and Support For issues specific to this MLX version: - Check that you're using Apple Silicon hardware - Ensure MLX is properly installed - Verify you have sufficient RAM (16GB minimum) For general model issues, refer to the [original Sarvam-M repository](https://huggingface.co/sarvamai/sarvam-m). --- *This model was quantized using MLX-LM tools and optimized for Apple Silicon. It maintains the quality and capabilities of the original Sarvam-M while providing significant efficiency improvements.*