Text Generation
Transformers
Safetensors
PyTorch
nvidia
suhara commited on
Commit
6d4bf47
·
verified ·
1 Parent(s): a1cd02c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -11
README.md CHANGED
@@ -85,17 +85,6 @@ Hugging Face 08/18/2025 via [https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano
85
  ## Model design
86
 
87
  The model was trained with 20T tokens, with a batch size of 736, and used the Warmup-Stable-Decay (WSD) learning rate schedule with 8B tokens of learning rate warm up, peak learning rate of 4.5e-4 and minimum learning rate of 4.5e-6. There are a total of 62 layers, of which there are 28 of each MLP and Mamba-2, the remaining layers use GQA with 8 groups
88
-
89
- ## Computational load
90
-
91
- Cumulative compute : 1.45E+24 FLOPS
92
-
93
- Estimate energy and emissions for model training: 708.3 MWh
94
-
95
- | | \# of tokens | Compute \[FLOPS\] | Energy \[MWh\] |
96
- | :---- | :---- | :---- | :---- |
97
- | 12B Base Pre-training | 20T | 1.45E+24 | 708.3 |
98
-
99
 
100
 
101
  ## Input
 
85
  ## Model design
86
 
87
  The model was trained with 20T tokens, with a batch size of 736, and used the Warmup-Stable-Decay (WSD) learning rate schedule with 8B tokens of learning rate warm up, peak learning rate of 4.5e-4 and minimum learning rate of 4.5e-6. There are a total of 62 layers, of which there are 28 of each MLP and Mamba-2, the remaining layers use GQA with 8 groups
 
 
 
 
 
 
 
 
 
 
 
88
 
89
 
90
  ## Input