File size: 3,906 Bytes

f6c7c50

# 🛠️ LM Studio Setup Guide for Sarvam-M 4-bit MLX

## 🔍 Problem Diagnosis
The "EOS token issue" where the model stops after a few words is caused by **incorrect prompt formatting**, not the EOS token itself.

## ✅ Solution: Proper Chat Format

### **Option 1: Use Chat Mode in LM Studio**
1. **Load the model** in LM Studio
2. **Switch to Chat mode** (not Playground mode)
3. **Set Chat Template** to "Custom" or "Mistral"
4. **Configure these settings:**
   ```
   System Prompt: "You are a helpful assistant."
   
   Chat Template Format:
   <s>[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]
   ```

### **Option 2: Manual Prompt Format**
If using Playground mode, format your prompts like this:

**Simple Format:**
```
[INST] Your question here [/INST]
```

**With System Prompt:**
```
<s>[SYSTEM_PROMPT]You are a helpful assistant.[/SYSTEM_PROMPT][INST]Your question here[/INST]
```

### **Option 3: Thinking Mode (Advanced)**
For reasoning tasks, use:
```
<s>[SYSTEM_PROMPT]You are a helpful assistant. Think deeply before answering the user's question. Do the thinking inside <think>...</think> tags.[/SYSTEM_PROMPT][INST]Your question here[/INST]<think>
```

## 🎛️ Recommended LM Studio Settings

### **Generation Parameters:**
- **Max Tokens:** 512-1024
- **Temperature:** 0.7-0.8
- **Top P:** 0.9
- **Repetition Penalty:** 1.1
- **Context Length:** 4096

### **Stop Sequences:**
Add these stop sequences:
- `</s>`
- `[/INST]`
- `\n\nUser:`
- `\n\nHuman:`

### **MLX Settings:**
- ✅ Enable MLX acceleration
- ✅ Use GPU memory
- Set batch size to 1-4

## 🧪 Test Examples

### **Test 1: Basic Math**
```
Prompt: [INST] What is 2+2? Please explain your answer. [/INST]
Expected: The sum of 2 and 2 is **4**. [explanation follows]
```

### **Test 2: Reasoning**
```
Prompt: <s>[SYSTEM_PROMPT]Think before answering.[/SYSTEM_PROMPT][INST]Why is the sky blue?[/INST]<think>
Expected: <think>[reasoning]</think> The sky appears blue because... 
```

### **Test 3: Hindi Language**
```
Prompt: [INST] भारत की राजधानी क्या है? [/INST]
Expected: भारत की राजधानी **नई दिल्ली** है...
```

## 🚨 Common Issues & Fixes

| Issue | Cause | Solution |
|-------|-------|----------|
| Empty responses | No chat template | Use `[INST]...[/INST]` format |
| Stops after few words | Wrong stop tokens | Remove `</s>` from stop sequences |
| Repeating text | Low repetition penalty | Increase to 1.1-1.2 |
| Slow responses | CPU inference | Enable MLX acceleration |

## 📝 Model Information
- **Format:** MLX 4-bit quantized
- **Languages:** English + 10 Indic languages
- **Context:** 4096 tokens
- **Based on:** Mistral Small architecture
- **Special Features:** Thinking mode, multi-language support

## 🔗 Working Example Commands

### **MLX-LM (Command Line):**
```bash
# Basic chat
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "[INST] Hello, how are you? [/INST]" --max-tokens 100

# With thinking
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "<s>[SYSTEM_PROMPT]Think deeply.[/SYSTEM_PROMPT][INST]Explain quantum physics[/INST]<think>" --max-tokens 200
```

### **Python Code:**
```python
from mlx_lm import load, generate

model, tokenizer = load('Jimmi42/sarvam-m-4bit-mlx')

# Format prompt correctly
messages = [{"role": "user", "content": "What is AI?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt, max_tokens=100)
print(response)
```

## ✅ Success Checklist
- [ ] Model loads in LM Studio with MLX enabled
- [ ] Chat template is set to Custom/Mistral format
- [ ] Test prompt: `[INST] Hello [/INST]` generates response
- [ ] Stop sequences configured correctly
- [ ] Generation parameters optimized
- [ ] Multi-language capability tested