Sarvam-M 4-bit MLX

This is a 4-bit quantized version of sarvamai/sarvam-m optimized for Apple Silicon using MLX.

Model Details

Base Model: Sarvam-M (24B parameters)
Quantization: 4.5 bits per weight
Framework: MLX (optimized for Apple Silicon)
Model Size: ~12GB (75% reduction from original ~48GB)
Languages: English + 10 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)

Key Features

🇮🇳 Indic Language Excellence: Specifically optimized for Indian languages with cultural context
🧮 Hybrid Reasoning: Supports both "thinking" and "non-thinking" modes for different use cases
⚡ Fast Inference: 4-6x faster than larger models while maintaining quality
🎯 Versatile: Strong performance in math, programming, and multilingual tasks
💻 Apple Silicon Optimized: Runs efficiently on M1/M2/M3 MacBooks

Installation

# Install MLX and dependencies
pip install mlx-lm transformers

# For chat functionality (optional)
pip install gradio

🛠️ LM Studio Setup

Having issues with short responses or "EOS token" problems in LM Studio?

👉 See the complete LM Studio Setup Guide

Quick Fix: Use proper chat formatting:

[INST] Your question here [/INST]

The model requires specific prompt formatting to work correctly in LM Studio.

Usage

Basic Generation

from mlx_lm import load, generate

# Load the model
model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# Simple generation
response = generate(
    model, 
    tokenizer, 
    prompt="What is the capital of India?", 
    max_tokens=50
)
print(response)

Chat with Thinking Mode Control

from mlx_lm import load, generate

model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# No thinking mode (direct answers)
messages = [{'role': 'user', 'content': 'What is 2+2?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=20)
print(response)  # Output: The sum of 2 and 2 is **4**.

# With thinking mode (shows reasoning)
messages = [{'role': 'user', 'content': 'Solve: 15 * 23'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response)  # Output: <think>Let me calculate...</think> The answer is 345.

Hindi Language Example

# Hindi conversation
messages = [{'role': 'user', 'content': 'भारत की राजधानी क्या है?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=50)
print(response)
# Output: भारत की राजधानी **नई दिल्ली** है। यह देश की राजनीतिक, प्रशासनिक...

Programming Example

# Code generation
messages = [{'role': 'user', 'content': 'Write a Python function to calculate fibonacci numbers'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=150)
print(response)

Command Line Usage

# Simple generation
python -m mlx_lm generate \
    --model your-username/sarvam-m-4bit-mlx \
    --prompt "Hello, how are you?" \
    --max-tokens 50

# Interactive chat
python -m mlx_lm chat --model your-username/sarvam-m-4bit-mlx

Performance Benchmarks

Metric	Value
Model Size	~12GB
Peak Memory Usage	~13.3GB
Generation Speed	18-36 tokens/sec
Quantization Bits	4.5 bits per weight
Supported Languages	11 (English + 10 Indic)

Quality Comparison

Math: Accurate arithmetic and reasoning
Hindi: Native-level language understanding
Programming: Strong code generation capabilities
Cultural Context: Indian-specific knowledge and values

Hardware Requirements

Minimum: Apple Silicon Mac (M1/M2/M3/M4) with 16GB RAM
Recommended: 32GB+ RAM for optimal performance
Storage: ~15GB free space

Supported Languages

English - Primary language
Hindi (हिन्दी) - 28% of Indic data
Bengali (বাংলা) - 8% of Indic data
Gujarati (ગુજરાતી) - 8% of Indic data
Kannada (ಕನ್ನಡ) - 8% of Indic data
Malayalam (മലയാളം) - 8% of Indic data
Marathi (मराठी) - 8% of Indic data
Oriya (ଓଡ଼ିଆ) - 8% of Indic data
Punjabi (ਪੰਜਾਬੀ) - 8% of Indic data
Tamil (தமிழ்) - 8% of Indic data
Telugu (తెలుగు) - 8% of Indic data

License

This model follows the same license as the original Sarvam-M model. Please refer to the original model card for license details.

Citation

@misc{sarvam-m-mlx,
  title={Sarvam-M 4-bit MLX: Quantized Indian Language Model for Apple Silicon},
  author={Community Contribution},
  year={2025},
  url={https://huggingface.co/your-username/sarvam-m-4bit-mlx}
}

Credits

Original Model: Sarvam AI for creating Sarvam-M
Base Model: Built on Mistral Small
MLX Framework: Apple's MLX team
Quantization: Community contribution using MLX-LM tools

Issues and Support

For issues specific to this MLX version:

Check that you're using Apple Silicon hardware
Ensure MLX is properly installed
Verify you have sufficient RAM (16GB minimum)

For general model issues, refer to the original Sarvam-M repository.

This model was quantized using MLX-LM tools and optimized for Apple Silicon. It maintains the quality and capabilities of the original Sarvam-M while providing significant efficiency improvements.

Jimmi42
/

sarvam-m-4bit-mlx