🛠️ LM Studio Setup Guide for Sarvam-M 4-bit MLX

🔍 Problem Diagnosis

The "EOS token issue" where the model stops after a few words is caused by incorrect prompt formatting, not the EOS token itself.

✅ Solution: Proper Chat Format

Option 1: Use Chat Mode in LM Studio

Load the model in LM Studio
Switch to Chat mode (not Playground mode)
Set Chat Template to "Custom" or "Mistral"

Configure these settings:

System Prompt: "You are a helpful assistant."

Chat Template Format:
<s>[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]

Option 2: Manual Prompt Format

If using Playground mode, format your prompts like this:

Simple Format:

[INST] Your question here [/INST]

With System Prompt:

<s>[SYSTEM_PROMPT]You are a helpful assistant.[/SYSTEM_PROMPT][INST]Your question here[/INST]

Option 3: Thinking Mode (Advanced)

For reasoning tasks, use:

<s>[SYSTEM_PROMPT]You are a helpful assistant. Think deeply before answering the user's question. Do the thinking inside <think>...</think> tags.[/SYSTEM_PROMPT][INST]Your question here[/INST]<think>

🎛️ Recommended LM Studio Settings

Generation Parameters:

Max Tokens: 512-1024
Temperature: 0.7-0.8
Top P: 0.9
Repetition Penalty: 1.1
Context Length: 4096

Stop Sequences:

Add these stop sequences:

</s>
[/INST]
\n\nUser:
\n\nHuman:

MLX Settings:

✅ Enable MLX acceleration
✅ Use GPU memory
Set batch size to 1-4

🧪 Test Examples

Test 1: Basic Math

Prompt: [INST] What is 2+2? Please explain your answer. [/INST]
Expected: The sum of 2 and 2 is **4**. [explanation follows]

Test 2: Reasoning

Prompt: <s>[SYSTEM_PROMPT]Think before answering.[/SYSTEM_PROMPT][INST]Why is the sky blue?[/INST]<think>
Expected: <think>[reasoning]</think> The sky appears blue because...

Test 3: Hindi Language

Prompt: [INST] भारत की राजधानी क्या है? [/INST]
Expected: भारत की राजधानी **नई दिल्ली** है...

🚨 Common Issues & Fixes

Issue	Cause	Solution
Empty responses	No chat template	Use `[INST]...[/INST]` format
Stops after few words	Wrong stop tokens	Remove `</s>` from stop sequences
Repeating text	Low repetition penalty	Increase to 1.1-1.2
Slow responses	CPU inference	Enable MLX acceleration

📝 Model Information

Format: MLX 4-bit quantized
Languages: English + 10 Indic languages
Context: 4096 tokens
Based on: Mistral Small architecture
Special Features: Thinking mode, multi-language support

🔗 Working Example Commands

MLX-LM (Command Line):

# Basic chat
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "[INST] Hello, how are you? [/INST]" --max-tokens 100

# With thinking
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "<s>[SYSTEM_PROMPT]Think deeply.[/SYSTEM_PROMPT][INST]Explain quantum physics[/INST]<think>" --max-tokens 200

Python Code:

from mlx_lm import load, generate

model, tokenizer = load('Jimmi42/sarvam-m-4bit-mlx')

# Format prompt correctly
messages = [{"role": "user", "content": "What is AI?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt, max_tokens=100)
print(response)

✅ Success Checklist

Model loads in LM Studio with MLX enabled
Chat template is set to Custom/Mistral format
Test prompt: [INST] Hello [/INST] generates response
Stop sequences configured correctly
Generation parameters optimized
Multi-language capability tested