File size: 3,113 Bytes
db61de0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adfc008
db61de0
 
 
adfc008
db61de0
adfc008
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db61de0
adfc008
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - fp16
  - dequantized
  - gpt-oss
  - mxfp4-upcast
base_model: openai/gpt-oss-120b
model-index:
  - name: gpt-oss-120b-fp16
    results: []
---

# 

## Precision: FP32 vs FP16 (and BF16)

This project saves dequantized checkpoints in **FP16** (bf16 -> fp16) 

- **FP32 (single precision, 32-bit, 4 bytes/param)**
  Reference/default precision in many frameworks. Highest numerical range/precision, **largest memory**.
- **FP16 (half precision, 16-bit, 2 bytes/param)**
  Half the memory of FP32. Great for **inference** on modern GPUs; may underflow/overflow more easily than BF16.
- **BF16 (bfloat16, 16-bit, 2 bytes/param)**
  Same memory as FP16, **wider exponent like FP32**, often more numerically robust than FP16; slightly less precision in mantissa.

> In this repo, output precision is **FP16** (default) or **BF16** via `--dtype`.
> **FP32 output is not offered** because it doubles disk/RAM vs FP16/BF16 with minimal inference benefit on modern hardware.

### Memory math (example: 120B parameters)

Each parameter stores one number:

| Format | Bits | Bytes/param | Approx size for 120B params |
|-------:|-----:|-------------:|-----------------------------:|
| FP32   |   32 |            4 | ~ **447 GiB**               |
| FP16   |   16 |            2 | ~ **224 GiB**               |
| BF16   |   16 |            2 | ~ **224 GiB**               |

> Calculation (GiB): `params * bytes_per_param / 1024^3`
> For 120,000,000,000 params:
> FP32: 480e9 B ≈ 447.03 GiB
> FP16/BF16: 240e9 B ≈ 223.52 GiB

### When to use which

- **Inference on modern NVIDIA GPUs (Turing+/Ampere+/Ada/Hopper):**
  Use **FP16** (default here) or **BF16**. You’ll get large memory savings and typically **equal or faster** throughput than FP32 thanks to tensor cores.

- **Training / Finetuning:**
  Use **mixed precision** (BF16 or FP16 compute with an FP32 master copy of weights/optimizer states).
  If your GPU supports BF16 well (e.g., A100/H100), **BF16** is preferred for numeric stability.
  (This tool focuses on exporting dequantized checkpoints, not training loops.)

- **If you hit numeric issues in FP16:**
  Try **BF16** (`--dtype bf16`). Same size as FP16 but usually more stable due to FP32-like exponent range.

### Notes

- **FP32** remains the gold standard for numeric headroom and deterministic baselines, but for **inference** it’s typically unnecessary and **costly** (2× memory vs FP16/BF16).
- **Tensor cores** accelerate FP16/BF16 GEMMs on most modern NVIDIA GPUs; FP32 is often slower and more memory-bound.
- If a downstream runtime expects a specific dtype, export to that: FP16 for speed/memory, BF16 for robustness.

---

### WIP

- Upcoming models: cleaned FP16 release (uniform fp16 with fp32 LayerNorms), compressed variants (W8A8, W4A16, mixed experts), 2:4 sparse checkpoints.
- Evals: MMLU, HellaSwag, TruthfulQA, GSM8K, BBH, MT‑Bench; plus latency/throughput and memory footprint on 3090/A100.
- Extras: scripted upload tooling, detailed model cards, and reproducible Docker workflows.