--- library_name: transformers license: apache-2.0 language: - en - bn - hi - kn - gu - mr - ml - or - pa - ta - te base_model: - mistralai/Mistral-Small-3.1-24B-Base-2503 --- # Model Information `sarvam-m` is a multilingual, hybrid-reasoning, text-only language model built on Mistral-Small. This post-trained version delivers exceptional improvements over the base model: - +20% average improvement on Indian language benchmarks - +21.6% enhancement on math benchmarks - +17.6% boost on programming benchmarks Performance gains are even more impressive at the intersection of Indian languages and mathematics, with an outstanding +86% improvement in romanized Indian language GSM-8K benchmarks. Learn more about sarvam-m in our detailed [blog post](https://www.sarvam.ai/blogs/sarvam-m). # Key Features - **Hybrid Thinking Mode**: A single versatile model supporting both "think" and "non-think" modes. Use the think mode for complex logical reasoning, mathematical problems, and coding tasks, or switch to non-think mode for efficient, general-purpose conversation. - **Advanced Indic Skills**: Specifically post-trained on Indian languages alongside English, embodying a character that authentically reflects and emphasizes Indian cultural values. - **Superior Reasoning Capabilities**: Outperforms most similarly-sized models on coding and math benchmarks, demonstrating exceptional reasoning abilities. - **Seamless Chatting Experience**: Full support for both Indic scripts and romanized versions of Indian languages, providing a smooth and accessible multilingual conversation experience. # Quickstart The following code snippet demonstrates how to use `sarvam-m` using Transformers. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "sarvamai/sarvam-m" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # prepare the model input prompt = "Who are you and what is your purpose on this planet?" messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template( messages, tokenize=False, enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True. ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate(**model_inputs, max_new_tokens=8192) output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :].tolist() output_text = tokenizer.decode(output_ids) if "" in output_text: reasoning_content = output_text.split("")[0].rstrip("\n") content = output_text.split("")[-1].lstrip("\n").rstrip("") else: reasoning_content = "" content = output_text.rstrip("") print("reasoning content:", reasoning_content) print("content:", content) ``` # How to use with Sarvam APIs ```python from openai import OpenAI base_url = "https://api.sarvam.ai/v1" model_name = "sarvam-m" api_key = "Your-API-Key" # get it from https://dashboard.sarvam.ai/ client = OpenAI( base_url=base_url, api_key=api_key, ).with_options(max_retries=1) response = client.chat.completions.create( model=model_name, messages=[ {"role": "system", "content": "say hi"}, {"role": "user", "content": "say hi"}, ], stream=False, max_completion_tokens=2048, # reasoning_effort="low", # set either of 3 values to enable reasoning ) print(response.choices[0].message.content) response1 = client.chat.completions.create( model=model_name, messages=[ {"role": "system", "content": "You're a helpful AI assistant"}, {"role": "user", "content": "Explain quantum computing in simple terms"} ], max_completion_tokens=4096, reasoning_effort="medium" # Optional reasoning mode ) print("First response:", response1.choices[0].message.content) # Second turn (using previous response as context) response2 = client.chat.completions.create( model=model_name, messages=[ {"role": "system", "content": "You're a helpful AI assistant"}, {"role": "user", "content": "Explain quantum computing in simple terms"}, {"role": "assistant", "content": response1.choices[0].message.content}, # Previous response {"role": "user", "content": "Can you give an analogy for superposition?"} ], reasoning_effort="high", max_completion_tokens=8192, ) print("Follow-up response:", response2.choices[0].message.content) ``` # VLLM Deployment For easy deployment, we can use `vllm>=0.8.5` and create an OpenAI-compatible API endpoint with `vllm serve sarvamai/sarvam-m` For more control, we can use vllm in Python. That way, we can explicitly enable or disable thinking mode. ```python from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id messages = [{"role": "user", "content": "Why is 42 the best number?"}] # By default, the model is in thinking mode. # If you want to disable thinking, add: # extra_body={"chat_template_kwargs": {"enable_thinking": False}} response = client.chat.completions.create(model=model, messages=messages) output_text = response.choices[0].message.content if "" in output_text: reasoning_content = output_text.split("")[0].rstrip("\n") content = output_text.split("")[-1].lstrip("\n") else: reasoning_content = "" content = output_text print("reasoning content:", reasoning_content) print("content:", content) # For the next round, add the model's response directly as assistant turn. messages.append( {"role": "assistant", "content": output_text} ) ```