Supercharge Edge AI With High‑Accuracy Reasoning Using NVIDIA Nemotron Nano 2 9B

Community Article Published August 18, 2025

Upvote

nvidia

nvidia

nvidia

nvidia

nvidia

nvidia

nvidia

nvidia

Sharath Turuvekere Sreenivas

sharathts

nvidia

Ali Taghibakhshi

jrd971000

nvidia

AI Agents are becoming mainstream from edge to cloud - with their sophisticated reasoning and iterative planning to autonomously solve complex, multi-step problems. To get the best performance out of these AI agents at the edge, developers need to make sure that the models powering these agents are not only accurate but also deliver high efficiency.

The newly released NVIDIA Nemotron Nano 2 9B brings these capabilities to the edge with leading accuracy and efficiency with a hybrid Transformer–Mamba architecture and a configurable thinking budget - so you can dial accuracy, throughput, and cost to match your real‑world needs.

| You can try this model out now at build.nvidia.com

Highlights (TL;DR)

Model size: 9B Parameters
Architecture: Hybrid Transformer–Mamba (Mamba‑2 + a small number of attention layers) for higher throughput at similar accuracy to Transformer‑only peers.
Throughput: Up to 6x higher token generation than other leading models in its size class.
Cost: Thinking budget lets you control how many “thinking” tokens are used - saving up to 60% lower reasoning costs.
Target: Agents for customer service, support chatbots, analytics copilots, and edge/RTX deployments.
Availability: The model weights are available on Hugging Face, try the endpoint on build.nvidia.com, and the model is soon available as NVIDIA NIM for high throughput and low latency.
License: nvidia-open-model-license

What Is Nemotron Nano 2?

Nemotron Nano 2 is the newest “Nano” model in the NVIDIA Nemotron family of open models and is purpose-built for enterprise‑grade reasoning and agentic AI. It introduces a configurable thinking budget (you control how much internal reasoning the model does) and a hybrid Transformer-Mamba backbone to raise throughput while preserving accuracy, making it great for PC/edge footprints and cost control.

NVIDIA is releasing the Nemotron family of models to support the open-source community with open weights, open datasets, and training techniques. We encourage developers to use different parts or the whole of Nemotron to improve their models for their specific use cases.

Like other models in the suite - Nemotron Nano 2 leads accuracy in its size category across reasoning tasks, like math, coding, science, and more; while retaining its capabilities as an effective model for agentic workflows by excelling in both instruction following and function calling.

Figure 1: Chart showing accuracy of Nemotron Nano 2 9B on various popular benchmarks

Alongside best-in-class accuracy, Nemotron Nano 2 also has unmatched performance due to the Hybrid Transformer-Mamba architecture. This allows the model to produce those critical thinking tokens at a pace that is well-suited for low-latency required environments. As shown in Figure 2, Nemotron Nano 2 has 6X higher throughput compared to the next best open alternate model.

Figure 2: Comparison of Throughput and Accuracy of Nemotron Nano 2 9B and Qwen 3 8B

Beyond even that - with a user-defined thinking budget, developers can right-size the amount of “thinking” the model does to potentially save tokens while retaining high accuracy. This selective cutoff strategy can reduce unnecessary token generation, lowering inference costs by up to 60% without significantly impacting accuracy.

Figure 3: Chart showing the accuracy of Nemotron Nano 2 9B model on popular benchmarks at various “Thinking Budget” thresholds

How We Built Nemotron Nano 2

Hybrid Architecture: Nemotron Nano 2 uses a Hybrid Transformer–Mamba backbone built for reasoning‑heavy, long‑output workloads. Most layers are Mamba‑2 selective state‑space modules, which run in linear time and maintain constant memory per token. Because they don’t accumulate a growing KV-cache, they handle long “thinking” traces efficiently, yielding higher tokens‑per‑second and lower memory use. Interleaved among them are a small number of attention “islands” that preserve the Transformer’s strength in content‑based global jumps - useful for linking distant facts or instructions. In practice, the hybrid keeps Transformer‑grade accuracy while leaning on Mamba for more throughput.

Post-Training Process: On the post-training side, the model undergoes supervised fine-tuning (SFT) on a balanced mixture of reasoning-on and reasoning-off data spanning mathematics, science, programming, tool use, general conversation, and safety. This process is conducted in multiple stages to strengthen performance in specific domains, such as improving tool-calling reliability and enhancing long-context comprehension. Following SFT, the model is further refined through focused reinforcement learning and preference-based optimization, ensuring alignment with desired behaviors and robustness across a wide range of tasks.

Model Compression and Distillation: Nemotron Nano 2 starts from a 12B hybrid Mamba-Transformer base model NVIDIA-Nemotron-Nano-12B-v2-Base, which was post-trained and aligned for various reasoning and non-reasoning tasks. This post-trained 12B sets the accuracy bar and serves as the teacher for the pruned/distilled Nano 2 (9B). The 12B parameter model consumes 22.9 GiB of memory for its weights alone (in bfloat16 precision), which exceeds the 22 GiB capacity of the NVIDIA A10G GPU. We thus apply model compression in the form of pruning to the 12B parameter model to obtain smaller 9B parameter model. Nemotron Nano 2 is designed to fit within the A10G’s memory limits while running 128k context inference. For compression, we set the model’s budget to 19.66 GiB, leaving a 5% buffer for frameworks like vLLM and 1.3 GiB for a vision encoder. Nemotron Nano 2 is also designed to achieve significantly higher throughput than pure Transformer-based models in reasoning settings (eg. ISL/OSL = 8k/16k) while retaining accuracy.

Model training flow for NVIDIA Nemotron Nano 9B V2 — Figure 4: Chart showing the model training flow.

To produce the compressed model, we built on the Minitron model compression framework, extending its Neural Architecture Search (NAS) module to find the best architecture within our memory budget. This search involved combinatorial pruning across multiple axes: depth (reducing the original 62 layers to 56), embedding channels, FFN dimension, and Mamba heads. To make this search computationally feasible, we split the search into two phases: (1) determine the optimal depth to prevent significant accuracy degradation (found to be 56 layers in this work), and (2) perform width pruning to find the best configuration at that depth. To recover performance lost during pruning, we retrained the selected candidate architecture using logit-based knowledge distillation, with the original 12B model serving as the teacher. This phase involved using a forward KL divergence loss to transfer knowledge, first with a short distillation run to select the top-performing architecture, followed by a longer distillation run to create the final Nemotron Nano 2 model.

You can read more about this in more detail in the technical report.

What is a “Thinking Budget”?

The thinking budget lets you set a limit for internal reasoning. This is achieved by inserting the </think> tag, after which the model will not continue thinking.

We'll look at an example of how you could create a client with this functionality below, and thinking budget will be automatically included in the downloadable NIM.

This thinking budget allows developers to keep accuracy high and meet response‑time targets - which is especially crucial for customer support, autonomous agent steps, and edge devices where every millisecond counts. Where this is most useful:

Customer service/chatbots with strict SLAs
Edge agents on NVIDIA RTX/Jetson (limited memory/thermal)
Developer/analytics copilots doing multi‑hop tool use
RAG pipelines where you need predictable step times

As the model can behave differently as thinking budgets are varied by domain, you can use Figure 3 as a guideline to get started with a thinking budget for your domain, ultimately it will take some experimentation to arrive at the perfect budget for your task.

How To Use The Nemotron Nano 2 Model:

Similar to other Nemotron Reasoning models - this model has two thinking modes. Reasoning "ON", which will output a reasoning chain-of-thought wrapped with thinking tokens, and Reasoning "OFF", which will move directly to the final response with no generated thinking tokens. Reasoning is “ON” by default with this model.

When using Reasoning "ON", it is encouraged that you use a temperature of 0.6, and top_p of 0.95. In order to use Reasoning "OFF", simply provide /no_think in the system prompt.
When using Reasoning "OFF", it is encouraged that you use temperature of 0.

Let’s start by spinning up a vLLM server for our model:

vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 --trust-remote-code --mamba_ssm_cache_dtype float32

Now that we have our server up and running, let’s set-up a client that implements our thinking budget on the client side:

from typing import Any, Dict, List
import openai
from transformers import AutoTokenizer

class ThinkingBudgetClient:
   def __init__(self, base_url: str, api_key: str, tokenizer_name_or_path: str):
       self.base_url = base_url
       self.api_key = api_key
       self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
       self.client = openai.OpenAI(base_url=self.base_url, api_key=self.api_key)

   def chat_completion(
       self,
       model: str,
       messages: List[Dict[str, Any]],
       max_thinking_budget: int = 512,
       max_tokens: int = 1024,
       **kwargs,
   ) -> Dict[str, Any]:
       assert (
           max_tokens > max_thinking_budget
       ), f"thinking budget must be smaller than maximum new tokens. Given {max_tokens=} and {max_thinking_budget=}"


       # 1. first call chat completion to get reasoning content
       response = self.client.chat.completions.create(
           model=model, messages=messages, max_tokens=max_thinking_budget, **kwargs
       )
       content = response.choices[0].message.content


       reasoning_content = content
       if not "</think>" in reasoning_content:
           # reasoning content is too long, closed with a period (.)
           reasoning_content = f"{reasoning_content}.\n</think>\n\n"
       reasoning_tokens_len = len(
           self.tokenizer.encode(reasoning_content, add_special_tokens=False)
       )
       remaining_tokens = max_tokens - reasoning_tokens_len
       assert (
           remaining_tokens > 0
       ), f"remaining tokens must be positive. Given {remaining_tokens=}. Increase the max_tokens or lower the max_thinking_budget."

       # 2. append reasoning content to messages and call completion
       messages.append({"role": "assistant", "content": reasoning_content})
       prompt = self.tokenizer.apply_chat_template(
           messages,
           tokenize=False,
           continue_final_message=True,
       )
       response = self.client.completions.create(
           model=model, prompt=prompt, max_tokens=max_tokens, **kwargs
       )

       response_data = {
           "reasoning_content": reasoning_content.strip().strip("</think>").strip(),
           "content": response.choices[0].text,
           "finish_reason": response.choices[0].finish_reason,
       }
       return response_data

Let’s call our vLLM backend through our thinking budget. As an example, we’ll restrict the budget to 32 tokens.

tokenizer_name_or_path = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
client = ThinkingBudgetClient(
   base_url="http://localhost:8000/v1",
   api_key="EMPTY",
   tokenizer_name_or_path=tokenizer_name_or_path,
)

result = client.chat_completion(
   model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
   messages=[
       {"role": "system", "content": "You are a helpful assistant. /think"},
       {"role": "user", "content": "What is 2+2?"},
   ],
   max_thinking_budget=8192, 
   max_tokens=32768, # can be set up to a maximum of 131072
   temperature=0.6,
   top_p=0.95,
)
print(result)

You should see output similar to the following:

{'reasoning_content': "Okay, the user asked, What is 2+2? Let me think. Well, 2 plus 2 equals 4. That's a basic.", 'content': '2 + 2 equals **4**.\n', 'finish_reason': 'stop'}

Get Started

To summarize, Nemotron Nano 2 9B offers leading accuracy across models within similar parameter range while offering 6x higher throughput compared to the next best alternate open model. Enterprises also enjoy potentially saving up to 60% in inference costs with the new “Thinking Budget” feature.

NVIDIA also open-sourced a number of additional technical artifacts (including post-training and pre-training datasets) which you can read about here.

You can get started with Nemotron Nano 9B V2 in the following way:

Download from Hugging Face

Coming soon, you’ll be able to download and deploy this model through NVIDIA NIM as well!

Community

nradich

about 17 hours ago

Pretty cool ! Excited to give the "thinking budget" a go when doing RAG lookup !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote