🛠️ LM Studio Setup Guide for Sarvam-M 4-bit MLX
🔍 Problem Diagnosis
The "EOS token issue" where the model stops after a few words is caused by incorrect prompt formatting, not the EOS token itself.
✅ Solution: Proper Chat Format
Option 1: Use Chat Mode in LM Studio
- Load the model in LM Studio
- Switch to Chat mode (not Playground mode)
- Set Chat Template to "Custom" or "Mistral"
- Configure these settings:
System Prompt: "You are a helpful assistant." Chat Template Format: <s>[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]
Option 2: Manual Prompt Format
If using Playground mode, format your prompts like this:
Simple Format:
[INST] Your question here [/INST]
With System Prompt:
<s>[SYSTEM_PROMPT]You are a helpful assistant.[/SYSTEM_PROMPT][INST]Your question here[/INST]
Option 3: Thinking Mode (Advanced)
For reasoning tasks, use:
<s>[SYSTEM_PROMPT]You are a helpful assistant. Think deeply before answering the user's question. Do the thinking inside <think>...</think> tags.[/SYSTEM_PROMPT][INST]Your question here[/INST]<think>
🎛️ Recommended LM Studio Settings
Generation Parameters:
- Max Tokens: 512-1024
- Temperature: 0.7-0.8
- Top P: 0.9
- Repetition Penalty: 1.1
- Context Length: 4096
Stop Sequences:
Add these stop sequences:
</s>
[/INST]
\n\nUser:
\n\nHuman:
MLX Settings:
- ✅ Enable MLX acceleration
- ✅ Use GPU memory
- Set batch size to 1-4
🧪 Test Examples
Test 1: Basic Math
Prompt: [INST] What is 2+2? Please explain your answer. [/INST]
Expected: The sum of 2 and 2 is **4**. [explanation follows]
Test 2: Reasoning
Prompt: <s>[SYSTEM_PROMPT]Think before answering.[/SYSTEM_PROMPT][INST]Why is the sky blue?[/INST]<think>
Expected: <think>[reasoning]</think> The sky appears blue because...
Test 3: Hindi Language
Prompt: [INST] भारत की राजधानी क्या है? [/INST]
Expected: भारत की राजधानी **नई दिल्ली** है...
🚨 Common Issues & Fixes
Issue | Cause | Solution |
---|---|---|
Empty responses | No chat template | Use [INST]...[/INST] format |
Stops after few words | Wrong stop tokens | Remove </s> from stop sequences |
Repeating text | Low repetition penalty | Increase to 1.1-1.2 |
Slow responses | CPU inference | Enable MLX acceleration |
📝 Model Information
- Format: MLX 4-bit quantized
- Languages: English + 10 Indic languages
- Context: 4096 tokens
- Based on: Mistral Small architecture
- Special Features: Thinking mode, multi-language support
🔗 Working Example Commands
MLX-LM (Command Line):
# Basic chat
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "[INST] Hello, how are you? [/INST]" --max-tokens 100
# With thinking
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "<s>[SYSTEM_PROMPT]Think deeply.[/SYSTEM_PROMPT][INST]Explain quantum physics[/INST]<think>" --max-tokens 200
Python Code:
from mlx_lm import load, generate
model, tokenizer = load('Jimmi42/sarvam-m-4bit-mlx')
# Format prompt correctly
messages = [{"role": "user", "content": "What is AI?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt, max_tokens=100)
print(response)
✅ Success Checklist
- Model loads in LM Studio with MLX enabled
- Chat template is set to Custom/Mistral format
- Test prompt:
[INST] Hello [/INST]
generates response - Stop sequences configured correctly
- Generation parameters optimized
- Multi-language capability tested