sarvam-m-4bit-mlx / LM_Studio_Setup_Guide.md
Jimmi42's picture
Add LM Studio Setup Guide - Fix EOS token issue with proper chat formatting
f6c7c50 verified

🛠️ LM Studio Setup Guide for Sarvam-M 4-bit MLX

🔍 Problem Diagnosis

The "EOS token issue" where the model stops after a few words is caused by incorrect prompt formatting, not the EOS token itself.

✅ Solution: Proper Chat Format

Option 1: Use Chat Mode in LM Studio

  1. Load the model in LM Studio
  2. Switch to Chat mode (not Playground mode)
  3. Set Chat Template to "Custom" or "Mistral"
  4. Configure these settings:
    System Prompt: "You are a helpful assistant."
    
    Chat Template Format:
    <s>[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]
    

Option 2: Manual Prompt Format

If using Playground mode, format your prompts like this:

Simple Format:

[INST] Your question here [/INST]

With System Prompt:

<s>[SYSTEM_PROMPT]You are a helpful assistant.[/SYSTEM_PROMPT][INST]Your question here[/INST]

Option 3: Thinking Mode (Advanced)

For reasoning tasks, use:

<s>[SYSTEM_PROMPT]You are a helpful assistant. Think deeply before answering the user's question. Do the thinking inside <think>...</think> tags.[/SYSTEM_PROMPT][INST]Your question here[/INST]<think>

🎛️ Recommended LM Studio Settings

Generation Parameters:

  • Max Tokens: 512-1024
  • Temperature: 0.7-0.8
  • Top P: 0.9
  • Repetition Penalty: 1.1
  • Context Length: 4096

Stop Sequences:

Add these stop sequences:

  • </s>
  • [/INST]
  • \n\nUser:
  • \n\nHuman:

MLX Settings:

  • ✅ Enable MLX acceleration
  • ✅ Use GPU memory
  • Set batch size to 1-4

🧪 Test Examples

Test 1: Basic Math

Prompt: [INST] What is 2+2? Please explain your answer. [/INST]
Expected: The sum of 2 and 2 is **4**. [explanation follows]

Test 2: Reasoning

Prompt: <s>[SYSTEM_PROMPT]Think before answering.[/SYSTEM_PROMPT][INST]Why is the sky blue?[/INST]<think>
Expected: <think>[reasoning]</think> The sky appears blue because... 

Test 3: Hindi Language

Prompt: [INST] भारत की राजधानी क्या है? [/INST]
Expected: भारत की राजधानी **नई दिल्ली** है...

🚨 Common Issues & Fixes

Issue Cause Solution
Empty responses No chat template Use [INST]...[/INST] format
Stops after few words Wrong stop tokens Remove </s> from stop sequences
Repeating text Low repetition penalty Increase to 1.1-1.2
Slow responses CPU inference Enable MLX acceleration

📝 Model Information

  • Format: MLX 4-bit quantized
  • Languages: English + 10 Indic languages
  • Context: 4096 tokens
  • Based on: Mistral Small architecture
  • Special Features: Thinking mode, multi-language support

🔗 Working Example Commands

MLX-LM (Command Line):

# Basic chat
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "[INST] Hello, how are you? [/INST]" --max-tokens 100

# With thinking
python -m mlx_lm.generate --model Jimmi42/sarvam-m-4bit-mlx --prompt "<s>[SYSTEM_PROMPT]Think deeply.[/SYSTEM_PROMPT][INST]Explain quantum physics[/INST]<think>" --max-tokens 200

Python Code:

from mlx_lm import load, generate

model, tokenizer = load('Jimmi42/sarvam-m-4bit-mlx')

# Format prompt correctly
messages = [{"role": "user", "content": "What is AI?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt, max_tokens=100)
print(response)

✅ Success Checklist

  • Model loads in LM Studio with MLX enabled
  • Chat template is set to Custom/Mistral format
  • Test prompt: [INST] Hello [/INST] generates response
  • Stop sequences configured correctly
  • Generation parameters optimized
  • Multi-language capability tested