Measuring Open-Source Llama Nemotron Models on DeepResearch Bench

Community Article Published August 4, 2025

nvidia

Contributors: David Austin, Raja Biswas, Gilberto Titericz Junior, NVIDIA

NVIDIA’s AI-Q Blueprint—the leading portable, open deep research agent—recently climbed to the top of the Hugging Face “LLM with Search” leaderboard on DeepResearch Bench. This is a significant step forward for the open-source AI stack, proving that developer-accessible models can power advanced agentic workflows that rival or surpass closed alternatives.

What sets AI-Q apart? It fuses two high-performance open LLMs—Llama 3.3-70B Instruct and Llama-3.3-Nemotron-Super-49B-v1.5—to orchestrate long-context retrieval, agentic reasoning, and robust synthesis.

Core Stack: Model Choices and Technical Innovations

Llama 3.3-70B Instruct: The foundation for fluent, structured report generation, derived from Meta’s Llama series and open-licensed for unrestricted deployment.
Llama-3.3-Nemotron-Super-49B-v1.5: An optimized, reasoning-focused variant. Built via Neural Architecture Search (NAS), knowledge distillation, and successive rounds of supervised and reinforcement learning, it excels at multi-step reasoning, query planning, tool use, and reflection—all with a reduced memory footprint for efficient deployment on standard GPUs.

The AI-Q reference example also includes:

NVIDIA NeMo Retriever for scalable, multimodal search (internal+external).
NVIDIA NeMo Agent toolkit for orchestrating complex, multistep agentic workflows.

The architecture supports parallel, low-latency search over local and web data, making it ideal for use cases that demand privacy, compliance, or on-premise deployment for reduced latency.

Deep Reasoning with Llama Nemotron

NVIDIA Llama Nemotron Super isn’t just a fine-tuned instruct model—it’s post-trained for explicit agentic reasoning and supports reasoning ON/OFF toggles via system prompts. You can use it in standard chat LLM mode or switch to deep, chain-of-thought reasoning for agent pipelines—enabling dynamic, context-sensitive workflows.

Key highlights:

Multi-phase post-training: Combines instruction following, mathematical/programmatic reasoning, and tool-calling skills.
Transparent model lineage: Directly traceable from open Meta weights, with additional openness around synthetic data and tuning datasets.
Efficiency: 49B parameters with context windows up to 128K tokens can run on a single H100 GPU or smaller, keeping inference costs predictable and fast.

Evaluation: Transparency and Robustness in Metrics

One of the core strengths of AI-Q is transparency—not just in outputs, but in reasoning traces and intermediate steps. During development, the NVIDIA team leveraged both standard and new metrics, such as:

Hallucination detection: Each factual claim is checked at generation.
Multi-source synthesis: Synthesis of new insights from disparate evidence.
Citation trustworthiness: Automated assessment of claim-evidence links.
RAGAS metrics: Automated scoring of retrieval-augmented generation accuracy.

The architecture lends itself perfectly to granular, stepwise evaluation and debugging—one of the biggest pain points in agentic pipeline development.

Benchmark Results: DeepResearch Bench

DeepResearch Bench evaluates agent stacks using a set of 100+ long-context, real-world research tasks (across science, finance, art, history, software, and more). Unlike traditional QA, tasks require report-length synthesis and complex multi-hop reasoning:

AI-Q achieved an overall score of 40.52 in the LLM with Search category as of August 2025, currently holding the top spot for any fully open-licensed stack.
Strongest metrics: comprehensiveness (depth of report), insightfulness (quality of analysis), and citation quality.

For the Hugging Face Developer Community

Both Llama-3.3-Nemotron-Super-49B-v1.5 and Llama 3.3-70B Instruct are available for direct use/download on Hugging Face. Try them in your own pipelines using a few lines of Python, or deploy with vLLM for fast inference and tool-calling support (see the model card for code/serving examples).
Open post-training data, transparent evaluation methods, and permissive licensing enable experimentation and reproducibility.

Takeaways

The open-source ecosystem is rapidly closing the gap—and, in some areas, leading—on real-world agent tasks that matter. AI-Q, built on Llama Nemotron, demonstrates that you don’t need to compromise on transparency or control to achieve state-of-the-art results.

Try the stack or adapt it to your own research agent projects from Hugging Face or build.nvidia.com.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote