Jimmi42
/

sarvam-m-4bit-mlx

+# 🛠️ LM Studio Setup Guide for Sarvam-M 4-bit MLX
+## 🔍 Problem Diagnosis
+The "EOS token issue" where the model stops after a few words is caused by **incorrect prompt formatting**, not the EOS token itself.
+## ✅ Solution: Proper Chat Format
+### **Option 1: Use Chat Mode in LM Studio**
+1. **Load the model** in LM Studio
+2. **Switch to Chat mode** (not Playground mode)
+3. **Set Chat Template** to "Custom" or "Mistral"
+4. **Configure these settings:**
+   ```
+   System Prompt: "You are a helpful assistant."
+   Chat Template Format:
+   <s>[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]
+   ```
+### **Option 2: Manual Prompt Format**
+If using Playground mode, format your prompts like this:
+**Simple Format:**
+```
+[INST] Your question here [/INST]
+```
+**With System Prompt:**
+```
+<s>[SYSTEM_PROMPT]You are a helpful assistant.[/SYSTEM_PROMPT][INST]Your question here[/INST]
+```
+### **Option 3: Thinking Mode (Advanced)**
+For reasoning tasks, use:
+```
+<s>[SYSTEM_PROMPT]You are a helpful assistant. Think deeply before answering the user's question. Do the thinking inside <think>...</think> tags.[/SYSTEM_PROMPT][INST]Your question here[/INST]<think>
+```
+## 🎛️ Recommended LM Studio Settings
+### **Generation Parameters:**
+- **Max Tokens:** 512-1024
+- **Temperature:** 0.7-0.8
+- **Top P:** 0.9
+- **Repetition Penalty:** 1.1
+- **Context Length:** 4096
+### **Stop Sequences:**
+Add these stop sequences:
+- `</s>`
+- `[/INST]`
+- `\n\nUser:`
+- `\n\nHuman:`
+### **MLX Settings:**
+- ✅ Enable MLX acceleration
+- ✅ Use GPU memory
+- Set batch size to 1-4
+## 🧪 Test Examples
+### **Test 1: Basic Math**
+```
+Prompt: [INST] What is 2+2? Please explain your answer. [/INST]
+Expected: The sum of 2 and 2 is **4**. [explanation follows]
+```
+### **Test 2: Reasoning**
+```
+Prompt: <s>[SYSTEM_PROMPT]Think before answering.[/SYSTEM_PROMPT][INST]Why is the sky blue?[/INST]<think>
+Expected: <think>[reasoning]</think> The sky appears blue because...
+```
+### **Test 3: Hindi Language**
+```
+Prompt: [INST] भारत की राजधानी क्या है? [/INST]
+Expected: भारत की राजधानी **नई दिल्ली** है...
+```
+## 🚨 Common Issues & Fixes
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| Empty responses | No chat template | Use `[INST]...[/INST]` format |
+| Stops after few words | Wrong stop tokens | Remove `</s>` from stop sequences |
+| Repeating text | Low repetition penalty | Increase to 1.1-1.2 |
+| Slow responses | CPU inference | Enable MLX acceleration |
+## 📝 Model Information
+- **Format:** MLX 4-bit quantized
+- **Languages:** English + 10 Indic languages
+- **Context:** 4096 tokens
+- **Based on:** Mistral Small architecture
+- **Special Features:** Thinking mode, multi-language support
+## 🔗 Working Example Commands
+### **MLX-LM (Command Line):**
+```bash
+# Basic chat
+python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "[INST] Hello, how are you? [/INST]" --max-tokens 100
+# With thinking
+python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "<s>[SYSTEM_PROMPT]Think deeply.[/SYSTEM_PROMPT][INST]Explain quantum physics[/INST]<think>" --max-tokens 200
+```
+### **Python Code:**
+```python
+from mlx_lm import load, generate
+model, tokenizer = load('Jimmi42/sarvam-m-4bit-mlx')
+# Format prompt correctly
+messages = [{"role": "user", "content": "What is AI?"}]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+response = generate(model, tokenizer, prompt, max_tokens=100)
+print(response)
+```
+## ✅ Success Checklist
+- [ ] Model loads in LM Studio with MLX enabled
+- [ ] Chat template is set to Custom/Mistral format
+- [ ] Test prompt: `[INST] Hello [/INST]` generates response
+- [ ] Stop sequences configured correctly
+- [ ] Generation parameters optimized
+- [ ] Multi-language capability tested