sarvam-m-4bit-mlx / LM_Studio_Setup_Guide.md
Jimmi42's picture
Add LM Studio Setup Guide - Fix EOS token issue with proper chat formatting
f6c7c50 verified
# 🛠️ LM Studio Setup Guide for Sarvam-M 4-bit MLX
## 🔍 Problem Diagnosis
The "EOS token issue" where the model stops after a few words is caused by **incorrect prompt formatting**, not the EOS token itself.
## ✅ Solution: Proper Chat Format
### **Option 1: Use Chat Mode in LM Studio**
1. **Load the model** in LM Studio
2. **Switch to Chat mode** (not Playground mode)
3. **Set Chat Template** to "Custom" or "Mistral"
4. **Configure these settings:**
```
System Prompt: "You are a helpful assistant."
Chat Template Format:
<s>[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]
```
### **Option 2: Manual Prompt Format**
If using Playground mode, format your prompts like this:
**Simple Format:**
```
[INST] Your question here [/INST]
```
**With System Prompt:**
```
<s>[SYSTEM_PROMPT]You are a helpful assistant.[/SYSTEM_PROMPT][INST]Your question here[/INST]
```
### **Option 3: Thinking Mode (Advanced)**
For reasoning tasks, use:
```
<s>[SYSTEM_PROMPT]You are a helpful assistant. Think deeply before answering the user's question. Do the thinking inside <think>...</think> tags.[/SYSTEM_PROMPT][INST]Your question here[/INST]<think>
```
## 🎛️ Recommended LM Studio Settings
### **Generation Parameters:**
- **Max Tokens:** 512-1024
- **Temperature:** 0.7-0.8
- **Top P:** 0.9
- **Repetition Penalty:** 1.1
- **Context Length:** 4096
### **Stop Sequences:**
Add these stop sequences:
- `</s>`
- `[/INST]`
- `\n\nUser:`
- `\n\nHuman:`
### **MLX Settings:**
- ✅ Enable MLX acceleration
- ✅ Use GPU memory
- Set batch size to 1-4
## 🧪 Test Examples
### **Test 1: Basic Math**
```
Prompt: [INST] What is 2+2? Please explain your answer. [/INST]
Expected: The sum of 2 and 2 is **4**. [explanation follows]
```
### **Test 2: Reasoning**
```
Prompt: <s>[SYSTEM_PROMPT]Think before answering.[/SYSTEM_PROMPT][INST]Why is the sky blue?[/INST]<think>
Expected: <think>[reasoning]</think> The sky appears blue because...
```
### **Test 3: Hindi Language**
```
Prompt: [INST] भारत की राजधानी क्या है? [/INST]
Expected: भारत की राजधानी **नई दिल्ली** है...
```
## 🚨 Common Issues & Fixes
| Issue | Cause | Solution |
|-------|-------|----------|
| Empty responses | No chat template | Use `[INST]...[/INST]` format |
| Stops after few words | Wrong stop tokens | Remove `</s>` from stop sequences |
| Repeating text | Low repetition penalty | Increase to 1.1-1.2 |
| Slow responses | CPU inference | Enable MLX acceleration |
## 📝 Model Information
- **Format:** MLX 4-bit quantized
- **Languages:** English + 10 Indic languages
- **Context:** 4096 tokens
- **Based on:** Mistral Small architecture
- **Special Features:** Thinking mode, multi-language support
## 🔗 Working Example Commands
### **MLX-LM (Command Line):**
```bash
# Basic chat
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "[INST] Hello, how are you? [/INST]" --max-tokens 100
# With thinking
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "<s>[SYSTEM_PROMPT]Think deeply.[/SYSTEM_PROMPT][INST]Explain quantum physics[/INST]<think>" --max-tokens 200
```
### **Python Code:**
```python
from mlx_lm import load, generate
model, tokenizer = load('Jimmi42/sarvam-m-4bit-mlx')
# Format prompt correctly
messages = [{"role": "user", "content": "What is AI?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt, max_tokens=100)
print(response)
```
## ✅ Success Checklist
- [ ] Model loads in LM Studio with MLX enabled
- [ ] Chat template is set to Custom/Mistral format
- [ ] Test prompt: `[INST] Hello [/INST]` generates response
- [ ] Stop sequences configured correctly
- [ ] Generation parameters optimized
- [ ] Multi-language capability tested