|
# 🛠️ LM Studio Setup Guide for Sarvam-M 4-bit MLX |
|
|
|
## 🔍 Problem Diagnosis |
|
The "EOS token issue" where the model stops after a few words is caused by **incorrect prompt formatting**, not the EOS token itself. |
|
|
|
## ✅ Solution: Proper Chat Format |
|
|
|
### **Option 1: Use Chat Mode in LM Studio** |
|
1. **Load the model** in LM Studio |
|
2. **Switch to Chat mode** (not Playground mode) |
|
3. **Set Chat Template** to "Custom" or "Mistral" |
|
4. **Configure these settings:** |
|
``` |
|
System Prompt: "You are a helpful assistant." |
|
|
|
Chat Template Format: |
|
<s>[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST] |
|
``` |
|
|
|
### **Option 2: Manual Prompt Format** |
|
If using Playground mode, format your prompts like this: |
|
|
|
**Simple Format:** |
|
``` |
|
[INST] Your question here [/INST] |
|
``` |
|
|
|
**With System Prompt:** |
|
``` |
|
<s>[SYSTEM_PROMPT]You are a helpful assistant.[/SYSTEM_PROMPT][INST]Your question here[/INST] |
|
``` |
|
|
|
### **Option 3: Thinking Mode (Advanced)** |
|
For reasoning tasks, use: |
|
``` |
|
<s>[SYSTEM_PROMPT]You are a helpful assistant. Think deeply before answering the user's question. Do the thinking inside <think>...</think> tags.[/SYSTEM_PROMPT][INST]Your question here[/INST]<think> |
|
``` |
|
|
|
## 🎛️ Recommended LM Studio Settings |
|
|
|
### **Generation Parameters:** |
|
- **Max Tokens:** 512-1024 |
|
- **Temperature:** 0.7-0.8 |
|
- **Top P:** 0.9 |
|
- **Repetition Penalty:** 1.1 |
|
- **Context Length:** 4096 |
|
|
|
### **Stop Sequences:** |
|
Add these stop sequences: |
|
- `</s>` |
|
- `[/INST]` |
|
- `\n\nUser:` |
|
- `\n\nHuman:` |
|
|
|
### **MLX Settings:** |
|
- ✅ Enable MLX acceleration |
|
- ✅ Use GPU memory |
|
- Set batch size to 1-4 |
|
|
|
## 🧪 Test Examples |
|
|
|
### **Test 1: Basic Math** |
|
``` |
|
Prompt: [INST] What is 2+2? Please explain your answer. [/INST] |
|
Expected: The sum of 2 and 2 is **4**. [explanation follows] |
|
``` |
|
|
|
### **Test 2: Reasoning** |
|
``` |
|
Prompt: <s>[SYSTEM_PROMPT]Think before answering.[/SYSTEM_PROMPT][INST]Why is the sky blue?[/INST]<think> |
|
Expected: <think>[reasoning]</think> The sky appears blue because... |
|
``` |
|
|
|
### **Test 3: Hindi Language** |
|
``` |
|
Prompt: [INST] भारत की राजधानी क्या है? [/INST] |
|
Expected: भारत की राजधानी **नई दिल्ली** है... |
|
``` |
|
|
|
## 🚨 Common Issues & Fixes |
|
|
|
| Issue | Cause | Solution | |
|
|-------|-------|----------| |
|
| Empty responses | No chat template | Use `[INST]...[/INST]` format | |
|
| Stops after few words | Wrong stop tokens | Remove `</s>` from stop sequences | |
|
| Repeating text | Low repetition penalty | Increase to 1.1-1.2 | |
|
| Slow responses | CPU inference | Enable MLX acceleration | |
|
|
|
## 📝 Model Information |
|
- **Format:** MLX 4-bit quantized |
|
- **Languages:** English + 10 Indic languages |
|
- **Context:** 4096 tokens |
|
- **Based on:** Mistral Small architecture |
|
- **Special Features:** Thinking mode, multi-language support |
|
|
|
## 🔗 Working Example Commands |
|
|
|
### **MLX-LM (Command Line):** |
|
```bash |
|
# Basic chat |
|
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "[INST] Hello, how are you? [/INST]" --max-tokens 100 |
|
|
|
# With thinking |
|
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "<s>[SYSTEM_PROMPT]Think deeply.[/SYSTEM_PROMPT][INST]Explain quantum physics[/INST]<think>" --max-tokens 200 |
|
``` |
|
|
|
### **Python Code:** |
|
```python |
|
from mlx_lm import load, generate |
|
|
|
model, tokenizer = load('Jimmi42/sarvam-m-4bit-mlx') |
|
|
|
# Format prompt correctly |
|
messages = [{"role": "user", "content": "What is AI?"}] |
|
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
|
response = generate(model, tokenizer, prompt, max_tokens=100) |
|
print(response) |
|
``` |
|
|
|
## ✅ Success Checklist |
|
- [ ] Model loads in LM Studio with MLX enabled |
|
- [ ] Chat template is set to Custom/Mistral format |
|
- [ ] Test prompt: `[INST] Hello [/INST]` generates response |
|
- [ ] Stop sequences configured correctly |
|
- [ ] Generation parameters optimized |
|
- [ ] Multi-language capability tested |