GR00T-N1-2B

Github page: https://github.com/NVIDIA/Isaac-GR00T/

Description:

NVIDIA Isaac GR00T N1 is the world’s first open foundation model for generalized humanoid robot reasoning and skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments. Developers and researchers can post-train GR00T N1 with real or synthetic data for their specific humanoid robot or task.

Isaac GR00T N1-1B is the lightweight version of our model built using pre-trained vision and language encoders, and uses a flow matching action transformer to model a chunk of actions conditioned on vision, language and proprioception.

A detailed description of Isaac GR00T N1-1B architecture is provided in the Whitepaper

License/Terms of Use

NSCL V1 License NVIDIA OneWay Noncommercial License_22Mar2022

Deployment Geography:

Global

Use Case:

Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development. Developers: Integrate and customize AI for various robotic applications. Startups & Companies: Accelerate robotics development and reduce training costs.

Release Date:

Github [Insert 03/17/2025] via [URL] Huggingface [Insert 03/17/2025] via [URL]

Reference(s):

NVIDIA-EAGLE: Li, Zhiqi, et al. "Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models." arXiv preprint arXiv:2501.14818 (2025). Rectified Flow: Liu, Xingchao, and Chengyue Gong. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." The Eleventh International Conference on Learning Representations” [link]. Flow Matching Policy: Black, Kevin, et al. "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv preprint arXiv:2410.24164 (2024).

Model Architecture:

Architecture Type: Vision Transformer, Multilayer Perceptron, Flow matching Transformer

Isaac GR00T N1 uses vision and text transformers to encode the robot's image observations and text instructions. The architecture handles a varying number of views per embodiment by concatenating image token embeddings from all frames into a sequence, followed by language token embeddings.

To model proprioception and a sequence of actions conditioned on observations, Isaac GR00T N1-1B uses a flow matching transformer. The flow matching transformer interleaves self-attention over proprioception and actions with cross-attention to the vision and language embeddings. During training, the input actions are corrupted by randomly interpolating between the clean action vector and a gaussian noise vector. At inference time, the policy first samples a gaussian noise vector and iteratively reconstructs a continuous-value action using its velocity prediction.

Network Architecture: Illustrated in Whitepaper Figure 2 RGB camera frames are processed through a pre-trained vision transformer (SigLip2). Text is encoded by a pre-trained transformer (T5) Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprio, inputs are padded to a configurable max length before feeding into the MLP. Actions are encoded and velocity predictions decoded by an MLP, one per unique embodiment. The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).

Input:

Input Type: -Vision: Image Frames -State: Robot Proprioception -Language Instruction: Text -Embodiment ID: Integer

Input Format: -Vision: Variable number of 224x224 uint8 image frames, coming from robot cameras -State: Floating Point -Language Instruction: String -Embodiment ID: Integer indicating which of the training embodiments is observed

Input Parameters: -Vision: 2D - RGB image, square -State: 1D - Floating number vector -Language Instruction: 1D - String -Embodiment ID: 1D - Integer

Output:

Output Type(s): Actions Output Format Continuous-value vectors that correspond to different motor controls on a robot.

Software Integration:

Runtime Engine(s): PyTorch

Supported Hardware Microarchitecture Compatibility: All of the below:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace

[Preferred/Supported] Operating System(s):

Linux

Model Version(s):

This is the initial version of the model, version 1.0.

Training and Evaluation Datasets:

Training Dataset:

GR00T Pretraining Data Link: Data Collection Method by dataset: Hybrid: Human, Synthetic. Labeling Method by dataset: Hybrid: Human, Automated. Properties: Cross-embodiment: Data collected on various robot embodiments Sensor types: RGB camera, robot proprioception, robot actuator data Dataset License(s): Release Legal Tracker (GR00T-N1)

Evaluation:

We evaluate in both simulation and real robot benchmarks, as defined in the Whitepaper

Sim evaluation benchmarks for upper body control: (nspect NSPECT-5WZF-67VI) 9 DexMG Whitepaper tasks 24 RoboCasa simulated mobile manipulator tasks 24 Digital Cousin simulated GR-1 humanoid manipulation tasks For sim, we automatically measure the success rate in each manipulation behavior. For real robot (nspect NSPECT-IDAT-9M9L): Grocery packing task Industrial multi-robot coordination with handoffs Evaluated by human observers in the lab

Inference:

Engine: PyTorch Test Hardware: A6000

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Model Limitations:

This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms.

Risk: Model underperformance in highly dynamic environments. Mitigation: Enhance dataset with dynamic obstacle scenarios and fine-tune models accordingly.

Risk: Integration challenges in specific customer environments. Mitigation: Provide detailed integration guides and support, leveraging NVIDIA's ecosystem.

Risk: Limited initial support for certain robot embodiments. Mitigation: Expand testing and validation across a wider range of robot platforms.

nvidia
/

GR00T-N1-2B