Improve model card metadata, add sample usage and citation
#16
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
datasets:
|
| 3 |
- nvidia/NitroGen
|
| 4 |
tags:
|
|
@@ -6,156 +7,80 @@ tags:
|
|
| 6 |
- cloning
|
| 7 |
- gaming
|
| 8 |
- agent
|
|
|
|
| 9 |
---
|
|
|
|
| 10 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/67d8509cb6b70254852d734d/u3VY6_KoT6tEs86YPehU2.gif" width="100%" />
|
| 11 |
|
| 12 |
<div align="center">
|
| 13 |
<p style="font-size: 1.2em;">
|
| 14 |
<a href="https://nitrogen.minedojo.org/"><strong>Website</strong></a> |
|
| 15 |
-
<a href="https://
|
| 16 |
<a href="https://huggingface.co/datasets/nvidia/NitroGen"><strong>Dataset</strong></a> |
|
| 17 |
-
<a href="https://
|
| 18 |
</p>
|
| 19 |
</div>
|
| 20 |
|
| 21 |
-
# Model
|
| 22 |
-
|
| 23 |
-
### Description:
|
| 24 |
-
|
| 25 |
-
NitroGen is a unified vision-to-action model designed to play video games directly from raw frames. It takes video game footage as input and outputs gamepad actions. Unlike models trained with rewards or task objectives, NitroGen is trained purely through large-scale imitation learning on videos of human gameplay. NitroGen works best on games designed for gamepad controls (e.g., action, platformer, and racing games) and is less effective on games that rely heavily on mouse and keyboard (e.g., RTS, MOBA).
|
| 26 |
-
|
| 27 |
-
The goal of the NitroGen project is to explore whether large-scale training on diverse human gameplay leads to emergent, general-purpose embodied abilities, similar to how scaling has unlocked emergent behaviors in large language models.
|
| 28 |
-
|
| 29 |
-
Potential applications include next-generation game AI, automated QA for video games, and advancing research in general embodied AI.
|
| 30 |
-
|
| 31 |
-
NitroGen 1 was developed by NVIDIA and is the first model of the series. This model is for research and development only.
|
| 32 |
-
|
| 33 |
-
### License/Terms of Use:
|
| 34 |
-
|
| 35 |
-
Governing Terms: [NVIDIA License](https://developer.download.nvidia.com/licenses/NVIDIA-OneWay-Noncommercial-License-22Mar2022.pdf).
|
| 36 |
-
|
| 37 |
-
Additional Information: [Apache License](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md) for [https://huggingface.co/google/siglip2-base-patch16-224]().
|
| 38 |
-
|
| 39 |
-
### Deployment Geography:
|
| 40 |
-
Global <br>
|
| 41 |
-
|
| 42 |
-
### Use Case: <br>
|
| 43 |
-
Researchers, engineers, open source community, companies, gamers. Potential applications include next-generation game AI, automated testing for video games, and generally advancing research in embodied AI.<br>
|
| 44 |
-
|
| 45 |
-
### Release Date: <br>
|
| 46 |
-
GitHub 12/19/2025 via []() <br>
|
| 47 |
-
GitHub 12/19/2025 via [https://huggingface.co/nvidia/NitroGen](https://huggingface.co/nvidia/NitroGen) <br>
|
| 48 |
-
|
| 49 |
-
## References:
|
| 50 |
-
[VPT](https://arxiv.org/abs/2206.11795), a Minecraft agent trained from internet videos.
|
| 51 |
-
[SIMA](https://arxiv.org/abs/2404.10179), a multi-game agent trained to follow text instructions.
|
| 52 |
-
[GR00T N1](https://arxiv.org/abs/2503.14734), an open foundation model for generalist humanoid robots.
|
| 53 |
-
<br>
|
| 54 |
-
|
| 55 |
-
## Model Architecture:
|
| 56 |
-
**Architecture Type:** Vision Transformer, Diffusion Transformer <br>
|
| 57 |
-
|
| 58 |
-
**Network Architecture:**
|
| 59 |
-
- RGB frames are processed through a pre-trained vision transformer (SigLip2).
|
| 60 |
-
- A diffusion matching transformer (DiT) then generates actions, conditioned on SigLip output.
|
| 61 |
-
<br>
|
| 62 |
-
|
| 63 |
-
**This model was developed based on** SigLip2 <br>
|
| 64 |
-
|
| 65 |
-
**Number of model parameters:** $4.93 × 10^8$ <br>
|
| 66 |
-
|
| 67 |
-
## Input(s): <br>
|
| 68 |
-
**Input Type(s):** Image <br>
|
| 69 |
-
|
| 70 |
-
**Input Format(s):** Red, Green, Blue (RGB) <br>
|
| 71 |
-
|
| 72 |
-
**Input Parameters:** Two-Dimensional (2D) <br>
|
| 73 |
-
|
| 74 |
-
**Other Properties Related to Input:** 256x256 Images
|
| 75 |
-
|
| 76 |
-
## Output(s)
|
| 77 |
-
|
| 78 |
-
**Output Type(s):** Actions for gamepad/game controllers<br>
|
| 79 |
-
|
| 80 |
-
**Output Format(s):** Tabular <br>
|
| 81 |
-
|
| 82 |
-
**Output Parameters:** 2D: one action dimension and one temporal dimension <br>
|
| 83 |
-
|
| 84 |
-
**Other Properties Related to Output:** The output has shape 21x16, two 2D Continuous-value vectors for each joystick, 17 binary values for each button.
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
|
| 88 |
-
|
| 89 |
-
## Software Integration:
|
| 90 |
-
**Runtime Engine(s):**
|
| 91 |
-
No runtime engine was used.
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
**Supported Hardware Microarchitecture Compatibility:** <br>
|
| 95 |
-
* NVIDIA Blackwell <br>
|
| 96 |
-
* NVIDIA Hopper<br>
|
| 97 |
-
|
| 98 |
-
**Preferred/Supported Operating System(s):**
|
| 99 |
-
|
| 100 |
-
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>
|
| 101 |
-
|
| 102 |
-
* Linux <br>
|
| 103 |
-
* Windows <br>
|
| 104 |
-
|
| 105 |
-
## Model Version(s):
|
| 106 |
-
V1 <br>
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
## Training, Testing, and Evaluation Datasets:
|
| 110 |
-
|
| 111 |
-
## Training Dataset:
|
| 112 |
-
|
| 113 |
-
**Data Modality**<br>
|
| 114 |
-
* Image <br>
|
| 115 |
-
* Video <br>
|
| 116 |
-
|
| 117 |
-
**Image Training Data Size**<br>
|
| 118 |
-
* More than 1 Billion Images <br>
|
| 119 |
|
|
|
|
| 120 |
|
| 121 |
-
|
| 122 |
-
* 10,000 to 1 Million Hours <br>
|
| 123 |
|
| 124 |
-
|
| 125 |
-
* Automated <br>
|
| 126 |
|
| 127 |
-
|
| 128 |
-
* Synthetic <br>
|
| 129 |
|
| 130 |
-
|
| 131 |
|
| 132 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
|
| 134 |
-
|
| 135 |
-
* Automated <br>
|
| 136 |
|
| 137 |
-
**
|
| 138 |
-
|
|
|
|
|
|
|
| 139 |
|
| 140 |
-
**
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
-
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
-
|
| 145 |
-
* Automated <br>
|
| 146 |
|
| 147 |
-
**
|
| 148 |
-
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
-
|
| 151 |
|
| 152 |
-
|
| 153 |
-
**Acceleration Engine:** None <br>
|
| 154 |
-
**Test Hardware:** H100 <br>
|
| 155 |
|
| 156 |
-
|
| 157 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
-
|
| 160 |
|
| 161 |
-
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: other
|
| 3 |
datasets:
|
| 4 |
- nvidia/NitroGen
|
| 5 |
tags:
|
|
|
|
| 7 |
- cloning
|
| 8 |
- gaming
|
| 9 |
- agent
|
| 10 |
+
pipeline_tag: robotics
|
| 11 |
---
|
| 12 |
+
|
| 13 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/67d8509cb6b70254852d734d/u3VY6_KoT6tEs86YPehU2.gif" width="100%" />
|
| 14 |
|
| 15 |
<div align="center">
|
| 16 |
<p style="font-size: 1.2em;">
|
| 17 |
<a href="https://nitrogen.minedojo.org/"><strong>Website</strong></a> |
|
| 18 |
+
<a href="https://github.com/MineDojo/NitroGen"><strong>Code</strong></a> |
|
| 19 |
<a href="https://huggingface.co/datasets/nvidia/NitroGen"><strong>Dataset</strong></a> |
|
| 20 |
+
<a href="https://huggingface.co/papers/2601.02427"><strong>Paper</strong></a>
|
| 21 |
</p>
|
| 22 |
</div>
|
| 23 |
|
| 24 |
+
# NitroGen: An Open Foundation Model for Generalist Gaming Agents
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
NitroGen is a unified vision-to-action foundation model designed to play video games directly from raw frames. It is a generalist agent trained via large-scale behavior cloning on 40,000 hours of gameplay across over 1,000 games. It maps RGB video footage to gamepad actions.
|
| 27 |
|
| 28 |
+
NitroGen works best on games designed for gamepad controls (e.g., action, platformer, and racing games) and is less effective on games that rely heavily on mouse and keyboard (e.g., RTS, MOBA).
|
|
|
|
| 29 |
|
| 30 |
+
## Sample Usage
|
|
|
|
| 31 |
|
| 32 |
+
### Installation
|
|
|
|
| 33 |
|
| 34 |
+
To use NitroGen, clone and install the repository:
|
| 35 |
|
| 36 |
+
```bash
|
| 37 |
+
git clone https://github.com/MineDojo/NitroGen.git
|
| 38 |
+
cd NitroGen
|
| 39 |
+
pip install -e .
|
| 40 |
+
```
|
| 41 |
|
| 42 |
+
### Inference
|
|
|
|
| 43 |
|
| 44 |
+
1. **Download the checkpoint** from Hugging Face:
|
| 45 |
+
```bash
|
| 46 |
+
hf download nvidia/NitroGen ng.pt
|
| 47 |
+
```
|
| 48 |
|
| 49 |
+
2. **Start the inference server**:
|
| 50 |
+
```bash
|
| 51 |
+
python scripts/serve.py <path_to_ng.pt>
|
| 52 |
+
```
|
| 53 |
|
| 54 |
+
3. **Run the agent** on the game of your choice (currently supports Windows games):
|
| 55 |
+
```bash
|
| 56 |
+
python scripts/play.py --process '<game_executable_name>.exe'
|
| 57 |
+
```
|
| 58 |
|
| 59 |
+
## Model Details
|
|
|
|
| 60 |
|
| 61 |
+
- **Architecture:** Vision Transformer (SigLip2) + Diffusion Matching Transformer (DiT).
|
| 62 |
+
- **Parameters:** $4.93 \times 10^8$.
|
| 63 |
+
- **Inputs:** 256x256 RGB images.
|
| 64 |
+
- **Outputs:** Gamepad actions (21x16 shape: two 2D continuous vectors for joysticks, 17 binary buttons).
|
| 65 |
+
- **Training:** Trained on 40,000 hours of internet-scale gameplay videos.
|
| 66 |
|
| 67 |
+
## Citation
|
| 68 |
|
| 69 |
+
If you find NitroGen useful in your research, please cite:
|
|
|
|
|
|
|
| 70 |
|
| 71 |
+
```bibtex
|
| 72 |
+
@misc{magne2026nitrogen,
|
| 73 |
+
title={NitroGen: An Open Foundation Model for Generalist Gaming Agents},
|
| 74 |
+
author={Loïc Magne and Anas Awadalla and Guanzhi Wang and Yinzhen Xu and Joshua Belofsky and Fengyuan Hu and Joohwan Kim and Ludwig Schmidt and Georgia Gkioxari and Jan Kautz and Yisong Yue and Yejin Choi and Yuke Zhu and Linxi "Jim" Fan},
|
| 75 |
+
year={2026},
|
| 76 |
+
eprint={2601.02427},
|
| 77 |
+
archivePrefix={arXiv},
|
| 78 |
+
primaryClass={cs.CV},
|
| 79 |
+
url={https://arxiv.org/abs/2601.02427},
|
| 80 |
+
}
|
| 81 |
+
```
|
| 82 |
|
| 83 |
+
## License
|
| 84 |
|
| 85 |
+
Governing Terms: [NVIDIA License](https://developer.download.nvidia.com/licenses/NVIDIA-OneWay-Noncommercial-License-22Mar2022.pdf).
|
| 86 |
+
The model uses a [SigLip2](https://huggingface.co/google/siglip2-base-patch16-224) backbone which is licensed under Apache 2.0.
|