Measuring Open-Source Llama Nemotron Models on DeepResearch Bench

Community Article Published August 4, 2025

Contributors: David Austin, Raja Biswas, Gilberto Titericz Junior, NVIDIA

NVIDIA’s AI-Q Blueprint—the leading portable, open deep research agent—recently climbed to the top of the Hugging Face “LLM with Search” leaderboard on DeepResearch Bench. This is a significant step forward for the open-source AI stack, proving that developer-accessible models can power advanced agentic workflows that rival or surpass closed alternatives.

What sets AI-Q apart? It fuses two high-performance open LLMs—Llama 3.3-70B Instruct and Llama-3.3-Nemotron-Super-49B-v1.5—to orchestrate long-context retrieval, agentic reasoning, and robust synthesis.

Core Stack: Model Choices and Technical Innovations

  • Llama 3.3-70B Instruct: The foundation for fluent, structured report generation, derived from Meta’s Llama series and open-licensed for unrestricted deployment.
  • Llama-3.3-Nemotron-Super-49B-v1.5: An optimized, reasoning-focused variant. Built via Neural Architecture Search (NAS), knowledge distillation, and successive rounds of supervised and reinforcement learning, it excels at multi-step reasoning, query planning, tool use, and reflection—all with a reduced memory footprint for efficient deployment on standard GPUs.

The AI-Q reference example also includes:

The architecture supports parallel, low-latency search over local and web data, making it ideal for use cases that demand privacy, compliance, or on-premise deployment for reduced latency.

Deep Reasoning with Llama Nemotron

NVIDIA Llama Nemotron Super isn’t just a fine-tuned instruct model—it’s post-trained for explicit agentic reasoning and supports reasoning ON/OFF toggles via system prompts. You can use it in standard chat LLM mode or switch to deep, chain-of-thought reasoning for agent pipelines—enabling dynamic, context-sensitive workflows.

Key highlights:

  • Multi-phase post-training: Combines instruction following, mathematical/programmatic reasoning, and tool-calling skills.
  • Transparent model lineage: Directly traceable from open Meta weights, with additional openness around synthetic data and tuning datasets.
  • Efficiency: 49B parameters with context windows up to 128K tokens can run on a single H100 GPU or smaller, keeping inference costs predictable and fast.

Evaluation: Transparency and Robustness in Metrics

One of the core strengths of AI-Q is transparency—not just in outputs, but in reasoning traces and intermediate steps. During development, the NVIDIA team leveraged both standard and new metrics, such as:

  • Hallucination detection: Each factual claim is checked at generation.
  • Multi-source synthesis: Synthesis of new insights from disparate evidence.
  • Citation trustworthiness: Automated assessment of claim-evidence links.
  • RAGAS metrics: Automated scoring of retrieval-augmented generation accuracy.

The architecture lends itself perfectly to granular, stepwise evaluation and debugging—one of the biggest pain points in agentic pipeline development.

Benchmark Results: DeepResearch Bench

DeepResearch Bench evaluates agent stacks using a set of 100+ long-context, real-world research tasks (across science, finance, art, history, software, and more). Unlike traditional QA, tasks require report-length synthesis and complex multi-hop reasoning:

  • AI-Q achieved an overall score of 40.52 in the LLM with Search category as of August 2025, currently holding the top spot for any fully open-licensed stack.
  • Strongest metrics: comprehensiveness (depth of report), insightfulness (quality of analysis), and citation quality.

For the Hugging Face Developer Community

  • Both Llama-3.3-Nemotron-Super-49B-v1.5 and Llama 3.3-70B Instruct are available for direct use/download on Hugging Face. Try them in your own pipelines using a few lines of Python, or deploy with vLLM for fast inference and tool-calling support (see the model card for code/serving examples).
  • Open post-training data, transparent evaluation methods, and permissive licensing enable experimentation and reproducibility.

Takeaways

The open-source ecosystem is rapidly closing the gap—and, in some areas, leading—on real-world agent tasks that matter. AI-Q, built on Llama Nemotron, demonstrates that you don’t need to compromise on transparency or control to achieve state-of-the-art results.

Try the stack or adapt it to your own research agent projects from Hugging Face or build.nvidia.com.

Community

Sign up or log in to comment