gpt-oss-120b-fp16 / README.md
twhitworth's picture
Update README.md
db61de0
---
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- fp16
- dequantized
- gpt-oss
- mxfp4-upcast
base_model: openai/gpt-oss-120b
model-index:
- name: gpt-oss-120b-fp16
results: []
---
#
## Precision: FP32 vs FP16 (and BF16)
This project saves dequantized checkpoints in **FP16** (bf16 -> fp16)
- **FP32 (single precision, 32-bit, 4 bytes/param)**
Reference/default precision in many frameworks. Highest numerical range/precision, **largest memory**.
- **FP16 (half precision, 16-bit, 2 bytes/param)**
Half the memory of FP32. Great for **inference** on modern GPUs; may underflow/overflow more easily than BF16.
- **BF16 (bfloat16, 16-bit, 2 bytes/param)**
Same memory as FP16, **wider exponent like FP32**, often more numerically robust than FP16; slightly less precision in mantissa.
> In this repo, output precision is **FP16** (default) or **BF16** via `--dtype`.
> **FP32 output is not offered** because it doubles disk/RAM vs FP16/BF16 with minimal inference benefit on modern hardware.
### Memory math (example: 120B parameters)
Each parameter stores one number:
| Format | Bits | Bytes/param | Approx size for 120B params |
|-------:|-----:|-------------:|-----------------------------:|
| FP32 | 32 | 4 | ~ **447 GiB** |
| FP16 | 16 | 2 | ~ **224 GiB** |
| BF16 | 16 | 2 | ~ **224 GiB** |
> Calculation (GiB): `params * bytes_per_param / 1024^3`
> For 120,000,000,000 params:
> FP32: 480e9 B ≈ 447.03 GiB
> FP16/BF16: 240e9 B ≈ 223.52 GiB
### When to use which
- **Inference on modern NVIDIA GPUs (Turing+/Ampere+/Ada/Hopper):**
Use **FP16** (default here) or **BF16**. You’ll get large memory savings and typically **equal or faster** throughput than FP32 thanks to tensor cores.
- **Training / Finetuning:**
Use **mixed precision** (BF16 or FP16 compute with an FP32 master copy of weights/optimizer states).
If your GPU supports BF16 well (e.g., A100/H100), **BF16** is preferred for numeric stability.
(This tool focuses on exporting dequantized checkpoints, not training loops.)
- **If you hit numeric issues in FP16:**
Try **BF16** (`--dtype bf16`). Same size as FP16 but usually more stable due to FP32-like exponent range.
### Notes
- **FP32** remains the gold standard for numeric headroom and deterministic baselines, but for **inference** it’s typically unnecessary and **costly** (2× memory vs FP16/BF16).
- **Tensor cores** accelerate FP16/BF16 GEMMs on most modern NVIDIA GPUs; FP32 is often slower and more memory-bound.
- If a downstream runtime expects a specific dtype, export to that: FP16 for speed/memory, BF16 for robustness.
---
### WIP
- Upcoming models: cleaned FP16 release (uniform fp16 with fp32 LayerNorms), compressed variants (W8A8, W4A16, mixed experts), 2:4 sparse checkpoints.
- Evals: MMLU, HellaSwag, TruthfulQA, GSM8K, BBH, MT‑Bench; plus latency/throughput and memory footprint on 3090/A100.
- Extras: scripted upload tooling, detailed model cards, and reproducible Docker workflows.