Sarvam-M 4-bit MLX
This is a 4-bit quantized version of sarvamai/sarvam-m optimized for Apple Silicon using MLX.
Model Details
- Base Model: Sarvam-M (24B parameters)
- Quantization: 4.5 bits per weight
- Framework: MLX (optimized for Apple Silicon)
- Model Size: ~12GB (75% reduction from original ~48GB)
- Languages: English + 10 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)
Key Features
- 🇮🇳 Indic Language Excellence: Specifically optimized for Indian languages with cultural context
- 🧮 Hybrid Reasoning: Supports both "thinking" and "non-thinking" modes for different use cases
- ⚡ Fast Inference: 4-6x faster than larger models while maintaining quality
- 🎯 Versatile: Strong performance in math, programming, and multilingual tasks
- 💻 Apple Silicon Optimized: Runs efficiently on M1/M2/M3 MacBooks
Installation
# Install MLX and dependencies
pip install mlx-lm transformers
# For chat functionality (optional)
pip install gradio
🛠️ LM Studio Setup
Having issues with short responses or "EOS token" problems in LM Studio?
👉 See the complete LM Studio Setup Guide
Quick Fix: Use proper chat formatting:
[INST] Your question here [/INST]
The model requires specific prompt formatting to work correctly in LM Studio.
Usage
Basic Generation
from mlx_lm import load, generate
# Load the model
model, tokenizer = load("your-username/sarvam-m-4bit-mlx")
# Simple generation
response = generate(
model,
tokenizer,
prompt="What is the capital of India?",
max_tokens=50
)
print(response)
Chat with Thinking Mode Control
from mlx_lm import load, generate
model, tokenizer = load("your-username/sarvam-m-4bit-mlx")
# No thinking mode (direct answers)
messages = [{'role': 'user', 'content': 'What is 2+2?'}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=20)
print(response) # Output: The sum of 2 and 2 is **4**.
# With thinking mode (shows reasoning)
messages = [{'role': 'user', 'content': 'Solve: 15 * 23'}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response) # Output: <think>Let me calculate...</think> The answer is 345.
Hindi Language Example
# Hindi conversation
messages = [{'role': 'user', 'content': 'भारत की राजधानी क्या है?'}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=50)
print(response)
# Output: भारत की राजधानी **नई दिल्ली** है। यह देश की राजनीतिक, प्रशासनिक...
Programming Example
# Code generation
messages = [{'role': 'user', 'content': 'Write a Python function to calculate fibonacci numbers'}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=150)
print(response)
Command Line Usage
# Simple generation
python -m mlx_lm generate \
--model your-username/sarvam-m-4bit-mlx \
--prompt "Hello, how are you?" \
--max-tokens 50
# Interactive chat
python -m mlx_lm chat --model your-username/sarvam-m-4bit-mlx
Performance Benchmarks
Metric | Value |
---|---|
Model Size | ~12GB |
Peak Memory Usage | ~13.3GB |
Generation Speed | 18-36 tokens/sec |
Quantization Bits | 4.5 bits per weight |
Supported Languages | 11 (English + 10 Indic) |
Quality Comparison
- Math: Accurate arithmetic and reasoning
- Hindi: Native-level language understanding
- Programming: Strong code generation capabilities
- Cultural Context: Indian-specific knowledge and values
Hardware Requirements
- Minimum: Apple Silicon Mac (M1/M2/M3/M4) with 16GB RAM
- Recommended: 32GB+ RAM for optimal performance
- Storage: ~15GB free space
Supported Languages
- English - Primary language
- Hindi (हिन्दी) - 28% of Indic data
- Bengali (বাংলা) - 8% of Indic data
- Gujarati (ગુજરાતી) - 8% of Indic data
- Kannada (ಕನ್ನಡ) - 8% of Indic data
- Malayalam (മലയാളം) - 8% of Indic data
- Marathi (मराठी) - 8% of Indic data
- Oriya (ଓଡ଼ିଆ) - 8% of Indic data
- Punjabi (ਪੰਜਾਬੀ) - 8% of Indic data
- Tamil (தமிழ்) - 8% of Indic data
- Telugu (తెలుగు) - 8% of Indic data
License
This model follows the same license as the original Sarvam-M model. Please refer to the original model card for license details.
Citation
@misc{sarvam-m-mlx,
title={Sarvam-M 4-bit MLX: Quantized Indian Language Model for Apple Silicon},
author={Community Contribution},
year={2025},
url={https://huggingface.co/your-username/sarvam-m-4bit-mlx}
}
Credits
- Original Model: Sarvam AI for creating Sarvam-M
- Base Model: Built on Mistral Small
- MLX Framework: Apple's MLX team
- Quantization: Community contribution using MLX-LM tools
Issues and Support
For issues specific to this MLX version:
- Check that you're using Apple Silicon hardware
- Ensure MLX is properly installed
- Verify you have sufficient RAM (16GB minimum)
For general model issues, refer to the original Sarvam-M repository.
This model was quantized using MLX-LM tools and optimized for Apple Silicon. It maintains the quality and capabilities of the original Sarvam-M while providing significant efficiency improvements.
- Downloads last month
- 4
Model tree for Jimmi42/sarvam-m-4bit-mlx
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503