Measuring Open-Source Llama Nemotron Models on DeepResearch Bench
Contributors: David Austin, Raja Biswas, Gilberto Titericz Junior, NVIDIA
NVIDIA’s AI-Q Blueprint—the leading portable, open deep research agent—recently climbed to the top of the Hugging Face “LLM with Search” leaderboard on DeepResearch Bench. This is a significant step forward for the open-source AI stack, proving that developer-accessible models can power advanced agentic workflows that rival or surpass closed alternatives.
What sets AI-Q apart? It fuses two high-performance open LLMs—Llama 3.3-70B Instruct and Llama-3.3-Nemotron-Super-49B-v1.5—to orchestrate long-context retrieval, agentic reasoning, and robust synthesis.
Core Stack: Model Choices and Technical Innovations
- Llama 3.3-70B Instruct: The foundation for fluent, structured report generation, derived from Meta’s Llama series and open-licensed for unrestricted deployment.
- Llama-3.3-Nemotron-Super-49B-v1.5: An optimized, reasoning-focused variant. Built via Neural Architecture Search (NAS), knowledge distillation, and successive rounds of supervised and reinforcement learning, it excels at multi-step reasoning, query planning, tool use, and reflection—all with a reduced memory footprint for efficient deployment on standard GPUs.
The AI-Q reference example also includes:
- NVIDIA NeMo Retriever for scalable, multimodal search (internal+external).
- NVIDIA NeMo Agent toolkit for orchestrating complex, multistep agentic workflows.
The architecture supports parallel, low-latency search over local and web data, making it ideal for use cases that demand privacy, compliance, or on-premise deployment for reduced latency.
Deep Reasoning with Llama Nemotron
NVIDIA Llama Nemotron Super isn’t just a fine-tuned instruct model—it’s post-trained for explicit agentic reasoning and supports reasoning ON/OFF toggles via system prompts. You can use it in standard chat LLM mode or switch to deep, chain-of-thought reasoning for agent pipelines—enabling dynamic, context-sensitive workflows.
Key highlights:
- Multi-phase post-training: Combines instruction following, mathematical/programmatic reasoning, and tool-calling skills.
- Transparent model lineage: Directly traceable from open Meta weights, with additional openness around synthetic data and tuning datasets.
- Efficiency: 49B parameters with context windows up to 128K tokens can run on a single H100 GPU or smaller, keeping inference costs predictable and fast.
Evaluation: Transparency and Robustness in Metrics
One of the core strengths of AI-Q is transparency—not just in outputs, but in reasoning traces and intermediate steps. During development, the NVIDIA team leveraged both standard and new metrics, such as:
- Hallucination detection: Each factual claim is checked at generation.
- Multi-source synthesis: Synthesis of new insights from disparate evidence.
- Citation trustworthiness: Automated assessment of claim-evidence links.
- RAGAS metrics: Automated scoring of retrieval-augmented generation accuracy.
The architecture lends itself perfectly to granular, stepwise evaluation and debugging—one of the biggest pain points in agentic pipeline development.
Benchmark Results: DeepResearch Bench
DeepResearch Bench evaluates agent stacks using a set of 100+ long-context, real-world research tasks (across science, finance, art, history, software, and more). Unlike traditional QA, tasks require report-length synthesis and complex multi-hop reasoning:
- AI-Q achieved an overall score of 40.52 in the LLM with Search category as of August 2025, currently holding the top spot for any fully open-licensed stack.
- Strongest metrics: comprehensiveness (depth of report), insightfulness (quality of analysis), and citation quality.
For the Hugging Face Developer Community
- Both Llama-3.3-Nemotron-Super-49B-v1.5 and Llama 3.3-70B Instruct are available for direct use/download on Hugging Face. Try them in your own pipelines using a few lines of Python, or deploy with vLLM for fast inference and tool-calling support (see the model card for code/serving examples).
- Open post-training data, transparent evaluation methods, and permissive licensing enable experimentation and reproducibility.
Takeaways
The open-source ecosystem is rapidly closing the gap—and, in some areas, leading—on real-world agent tasks that matter. AI-Q, built on Llama Nemotron, demonstrates that you don’t need to compromise on transparency or control to achieve state-of-the-art results.
Try the stack or adapt it to your own research agent projects from Hugging Face or build.nvidia.com.