Sarvam-M 4-bit MLX

This is a 4-bit quantized version of sarvamai/sarvam-m optimized for Apple Silicon using MLX.

Model Details

  • Base Model: Sarvam-M (24B parameters)
  • Quantization: 4.5 bits per weight
  • Framework: MLX (optimized for Apple Silicon)
  • Model Size: ~12GB (75% reduction from original ~48GB)
  • Languages: English + 10 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)

Key Features

  • ๐Ÿ‡ฎ๐Ÿ‡ณ Indic Language Excellence: Specifically optimized for Indian languages with cultural context
  • ๐Ÿงฎ Hybrid Reasoning: Supports both "thinking" and "non-thinking" modes for different use cases
  • โšก Fast Inference: 4-6x faster than larger models while maintaining quality
  • ๐ŸŽฏ Versatile: Strong performance in math, programming, and multilingual tasks
  • ๐Ÿ’ป Apple Silicon Optimized: Runs efficiently on M1/M2/M3 MacBooks

Installation

# Install MLX and dependencies
pip install mlx-lm transformers

# For chat functionality (optional)
pip install gradio

๐Ÿ› ๏ธ LM Studio Setup

Having issues with short responses or "EOS token" problems in LM Studio?

๐Ÿ‘‰ See the complete LM Studio Setup Guide

Quick Fix: Use proper chat formatting:

[INST] Your question here [/INST]

The model requires specific prompt formatting to work correctly in LM Studio.

Usage

Basic Generation

from mlx_lm import load, generate

# Load the model
model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# Simple generation
response = generate(
    model, 
    tokenizer, 
    prompt="What is the capital of India?", 
    max_tokens=50
)
print(response)

Chat with Thinking Mode Control

from mlx_lm import load, generate

model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# No thinking mode (direct answers)
messages = [{'role': 'user', 'content': 'What is 2+2?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=20)
print(response)  # Output: The sum of 2 and 2 is **4**.

# With thinking mode (shows reasoning)
messages = [{'role': 'user', 'content': 'Solve: 15 * 23'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response)  # Output: <think>Let me calculate...</think> The answer is 345.

Hindi Language Example

# Hindi conversation
messages = [{'role': 'user', 'content': 'เคญเคพเคฐเคค เค•เฅ€ เคฐเคพเคœเคงเคพเคจเฅ€ เค•เฅเคฏเคพ เคนเฅˆ?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=50)
print(response)
# Output: เคญเคพเคฐเคค เค•เฅ€ เคฐเคพเคœเคงเคพเคจเฅ€ **เคจเคˆ เคฆเคฟเคฒเฅเคฒเฅ€** เคนเฅˆเฅค เคฏเคน เคฆเฅ‡เคถ เค•เฅ€ เคฐเคพเคœเคจเฅ€เคคเคฟเค•, เคชเฅเคฐเคถเคพเคธเคจเคฟเค•...

Programming Example

# Code generation
messages = [{'role': 'user', 'content': 'Write a Python function to calculate fibonacci numbers'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=150)
print(response)

Command Line Usage

# Simple generation
python -m mlx_lm generate \
    --model your-username/sarvam-m-4bit-mlx \
    --prompt "Hello, how are you?" \
    --max-tokens 50

# Interactive chat
python -m mlx_lm chat --model your-username/sarvam-m-4bit-mlx

Performance Benchmarks

Metric Value
Model Size ~12GB
Peak Memory Usage ~13.3GB
Generation Speed 18-36 tokens/sec
Quantization Bits 4.5 bits per weight
Supported Languages 11 (English + 10 Indic)

Quality Comparison

  • Math: Accurate arithmetic and reasoning
  • Hindi: Native-level language understanding
  • Programming: Strong code generation capabilities
  • Cultural Context: Indian-specific knowledge and values

Hardware Requirements

  • Minimum: Apple Silicon Mac (M1/M2/M3/M4) with 16GB RAM
  • Recommended: 32GB+ RAM for optimal performance
  • Storage: ~15GB free space

Supported Languages

  1. English - Primary language
  2. Hindi (เคนเคฟเคจเฅเคฆเฅ€) - 28% of Indic data
  3. Bengali (เฆฌเฆพเฆ‚เฆฒเฆพ) - 8% of Indic data
  4. Gujarati (เช—เซเชœเชฐเชพเชคเซ€) - 8% of Indic data
  5. Kannada (เฒ•เฒจเณเฒจเฒก) - 8% of Indic data
  6. Malayalam (เดฎเดฒเดฏเดพเดณเด‚) - 8% of Indic data
  7. Marathi (เคฎเคฐเคพเค เฅ€) - 8% of Indic data
  8. Oriya (เฌ“เฌกเฌผเฌฟเฌ†) - 8% of Indic data
  9. Punjabi (เจชเฉฐเจœเจพเจฌเฉ€) - 8% of Indic data
  10. Tamil (เฎคเฎฎเฎฟเฎดเฏ) - 8% of Indic data
  11. Telugu (เฐคเฑ†เฐฒเฑเฐ—เฑ) - 8% of Indic data

License

This model follows the same license as the original Sarvam-M model. Please refer to the original model card for license details.

Citation

@misc{sarvam-m-mlx,
  title={Sarvam-M 4-bit MLX: Quantized Indian Language Model for Apple Silicon},
  author={Community Contribution},
  year={2025},
  url={https://huggingface.co/your-username/sarvam-m-4bit-mlx}
}

Credits

Issues and Support

For issues specific to this MLX version:

  • Check that you're using Apple Silicon hardware
  • Ensure MLX is properly installed
  • Verify you have sufficient RAM (16GB minimum)

For general model issues, refer to the original Sarvam-M repository.


This model was quantized using MLX-LM tools and optimized for Apple Silicon. It maintains the quality and capabilities of the original Sarvam-M while providing significant efficiency improvements.

Downloads last month
42
Safetensors
Model size
3.68B params
Tensor type
BF16
ยท
U32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Jimmi42/sarvam-m-4bit-mlx

Finetuned
sarvamai/sarvam-m
Quantized
(21)
this model