sarvam-m-4bit-mlx / LM_Studio_Setup_Guide.md

Add LM Studio Setup Guide - Fix EOS token issue with proper chat formatting

f6c7c50 verified 9 days ago

3.91 kB

	# 🛠️ LM Studio Setup Guide for Sarvam-M 4-bit MLX

	## 🔍 Problem Diagnosis
	The "EOS token issue" where the model stops after a few words is caused by incorrect prompt formatting, not the EOS token itself.

	## ✅ Solution: Proper Chat Format

	### Option 1: Use Chat Mode in LM Studio
	1. Load the model in LM Studio
	2. Switch to Chat mode (not Playground mode)
	3. Set Chat Template to "Custom" or "Mistral"
	4. Configure these settings:
	```
	System Prompt: "You are a helpful assistant."

	Chat Template Format:
	<s>[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]
	```

	### Option 2: Manual Prompt Format
	If using Playground mode, format your prompts like this:

	Simple Format:
	```
	[INST] Your question here [/INST]
	```

	With System Prompt:
	```
	<s>[SYSTEM_PROMPT]You are a helpful assistant.[/SYSTEM_PROMPT][INST]Your question here[/INST]
	```

	### Option 3: Thinking Mode (Advanced)
	For reasoning tasks, use:
	```
	<s>[SYSTEM_PROMPT]You are a helpful assistant. Think deeply before answering the user's question. Do the thinking inside <think>...</think> tags.[/SYSTEM_PROMPT][INST]Your question here[/INST]<think>
	```

	## 🎛️ Recommended LM Studio Settings

	### Generation Parameters:
	- Max Tokens: 512-1024
	- Temperature: 0.7-0.8
	- Top P: 0.9
	- Repetition Penalty: 1.1
	- Context Length: 4096

	### Stop Sequences:
	Add these stop sequences:
	- `</s>`
	- `[/INST]`
	- `\n\nUser:`
	- `\n\nHuman:`

	### MLX Settings:
	- ✅ Enable MLX acceleration
	- ✅ Use GPU memory
	- Set batch size to 1-4

	## 🧪 Test Examples

	### Test 1: Basic Math
	```
	Prompt: [INST] What is 2+2? Please explain your answer. [/INST]
	Expected: The sum of 2 and 2 is 4. [explanation follows]
	```

	### Test 2: Reasoning
	```
	Prompt: <s>[SYSTEM_PROMPT]Think before answering.[/SYSTEM_PROMPT][INST]Why is the sky blue?[/INST]<think>
	Expected: <think>[reasoning]</think> The sky appears blue because...
	```

	### Test 3: Hindi Language
	```
	Prompt: [INST] भारत की राजधानी क्या है? [/INST]
	Expected: भारत की राजधानी नई दिल्ली है...
	```

	## 🚨 Common Issues & Fixes

	\| Issue \| Cause \| Solution \|
	\|-------\|-------\|----------\|
	\| Empty responses \| No chat template \| Use `[INST]...[/INST]` format \|
	\| Stops after few words \| Wrong stop tokens \| Remove `</s>` from stop sequences \|
	\| Repeating text \| Low repetition penalty \| Increase to 1.1-1.2 \|
	\| Slow responses \| CPU inference \| Enable MLX acceleration \|

	## 📝 Model Information
	- Format: MLX 4-bit quantized
	- Languages: English + 10 Indic languages
	- Context: 4096 tokens
	- Based on: Mistral Small architecture
	- Special Features: Thinking mode, multi-language support

	## 🔗 Working Example Commands

	### MLX-LM (Command Line):
	```bash
	# Basic chat
	python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "[INST] Hello, how are you? [/INST]" --max-tokens 100

	# With thinking
	python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "<s>[SYSTEM_PROMPT]Think deeply.[/SYSTEM_PROMPT][INST]Explain quantum physics[/INST]<think>" --max-tokens 200
	```

	### Python Code:
	```python
	from mlx_lm import load, generate

	model, tokenizer = load('Jimmi42/sarvam-m-4bit-mlx')

	# Format prompt correctly
	messages = [{"role": "user", "content": "What is AI?"}]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	response = generate(model, tokenizer, prompt, max_tokens=100)
	print(response)
	```

	## ✅ Success Checklist
	- [ ] Model loads in LM Studio with MLX enabled
	- [ ] Chat template is set to Custom/Mistral format
	- [ ] Test prompt: `[INST] Hello [/INST]` generates response
	- [ ] Stop sequences configured correctly
	- [ ] Generation parameters optimized
	- [ ] Multi-language capability tested