Sarvam-M 4-bit MLX

This is a 4-bit quantized version of sarvamai/sarvam-m optimized for Apple Silicon using MLX.

Model Details

  • Base Model: Sarvam-M (24B parameters)
  • Quantization: 4.5 bits per weight
  • Framework: MLX (optimized for Apple Silicon)
  • Model Size: ~12GB (75% reduction from original ~48GB)
  • Languages: English + 10 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)

Key Features

  • 🇮🇳 Indic Language Excellence: Specifically optimized for Indian languages with cultural context
  • 🧮 Hybrid Reasoning: Supports both "thinking" and "non-thinking" modes for different use cases
  • ⚡ Fast Inference: 4-6x faster than larger models while maintaining quality
  • 🎯 Versatile: Strong performance in math, programming, and multilingual tasks
  • 💻 Apple Silicon Optimized: Runs efficiently on M1/M2/M3 MacBooks

Installation

# Install MLX and dependencies
pip install mlx-lm transformers

# For chat functionality (optional)
pip install gradio

🛠️ LM Studio Setup

Having issues with short responses or "EOS token" problems in LM Studio?

👉 See the complete LM Studio Setup Guide

Quick Fix: Use proper chat formatting:

[INST] Your question here [/INST]

The model requires specific prompt formatting to work correctly in LM Studio.

Usage

Basic Generation

from mlx_lm import load, generate

# Load the model
model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# Simple generation
response = generate(
    model, 
    tokenizer, 
    prompt="What is the capital of India?", 
    max_tokens=50
)
print(response)

Chat with Thinking Mode Control

from mlx_lm import load, generate

model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# No thinking mode (direct answers)
messages = [{'role': 'user', 'content': 'What is 2+2?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=20)
print(response)  # Output: The sum of 2 and 2 is **4**.

# With thinking mode (shows reasoning)
messages = [{'role': 'user', 'content': 'Solve: 15 * 23'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response)  # Output: <think>Let me calculate...</think> The answer is 345.

Hindi Language Example

# Hindi conversation
messages = [{'role': 'user', 'content': 'भारत की राजधानी क्या है?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=50)
print(response)
# Output: भारत की राजधानी **नई दिल्ली** है। यह देश की राजनीतिक, प्रशासनिक...

Programming Example

# Code generation
messages = [{'role': 'user', 'content': 'Write a Python function to calculate fibonacci numbers'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=150)
print(response)

Command Line Usage

# Simple generation
python -m mlx_lm generate \
    --model your-username/sarvam-m-4bit-mlx \
    --prompt "Hello, how are you?" \
    --max-tokens 50

# Interactive chat
python -m mlx_lm chat --model your-username/sarvam-m-4bit-mlx

Performance Benchmarks

Metric Value
Model Size ~12GB
Peak Memory Usage ~13.3GB
Generation Speed 18-36 tokens/sec
Quantization Bits 4.5 bits per weight
Supported Languages 11 (English + 10 Indic)

Quality Comparison

  • Math: Accurate arithmetic and reasoning
  • Hindi: Native-level language understanding
  • Programming: Strong code generation capabilities
  • Cultural Context: Indian-specific knowledge and values

Hardware Requirements

  • Minimum: Apple Silicon Mac (M1/M2/M3/M4) with 16GB RAM
  • Recommended: 32GB+ RAM for optimal performance
  • Storage: ~15GB free space

Supported Languages

  1. English - Primary language
  2. Hindi (हिन्दी) - 28% of Indic data
  3. Bengali (বাংলা) - 8% of Indic data
  4. Gujarati (ગુજરાતી) - 8% of Indic data
  5. Kannada (ಕನ್ನಡ) - 8% of Indic data
  6. Malayalam (മലയാളം) - 8% of Indic data
  7. Marathi (मराठी) - 8% of Indic data
  8. Oriya (ଓଡ଼ିଆ) - 8% of Indic data
  9. Punjabi (ਪੰਜਾਬੀ) - 8% of Indic data
  10. Tamil (தமிழ்) - 8% of Indic data
  11. Telugu (తెలుగు) - 8% of Indic data

License

This model follows the same license as the original Sarvam-M model. Please refer to the original model card for license details.

Citation

@misc{sarvam-m-mlx,
  title={Sarvam-M 4-bit MLX: Quantized Indian Language Model for Apple Silicon},
  author={Community Contribution},
  year={2025},
  url={https://huggingface.co/your-username/sarvam-m-4bit-mlx}
}

Credits

Issues and Support

For issues specific to this MLX version:

  • Check that you're using Apple Silicon hardware
  • Ensure MLX is properly installed
  • Verify you have sufficient RAM (16GB minimum)

For general model issues, refer to the original Sarvam-M repository.


This model was quantized using MLX-LM tools and optimized for Apple Silicon. It maintains the quality and capabilities of the original Sarvam-M while providing significant efficiency improvements.

Downloads last month
4
Safetensors
Model size
3.68B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jimmi42/sarvam-m-4bit-mlx

Finetuned
sarvamai/sarvam-m
Quantized
(18)
this model