|
--- |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- fp16 |
|
- dequantized |
|
- gpt-oss |
|
- mxfp4-upcast |
|
base_model: openai/gpt-oss-120b |
|
model-index: |
|
- name: gpt-oss-120b-fp16 |
|
results: [] |
|
--- |
|
|
|
# |
|
|
|
## Precision: FP32 vs FP16 (and BF16) |
|
|
|
This project saves dequantized checkpoints in **FP16** (bf16 -> fp16) |
|
|
|
- **FP32 (single precision, 32-bit, 4 bytes/param)** |
|
Reference/default precision in many frameworks. Highest numerical range/precision, **largest memory**. |
|
- **FP16 (half precision, 16-bit, 2 bytes/param)** |
|
Half the memory of FP32. Great for **inference** on modern GPUs; may underflow/overflow more easily than BF16. |
|
- **BF16 (bfloat16, 16-bit, 2 bytes/param)** |
|
Same memory as FP16, **wider exponent like FP32**, often more numerically robust than FP16; slightly less precision in mantissa. |
|
|
|
> In this repo, output precision is **FP16** (default) or **BF16** via `--dtype`. |
|
> **FP32 output is not offered** because it doubles disk/RAM vs FP16/BF16 with minimal inference benefit on modern hardware. |
|
|
|
### Memory math (example: 120B parameters) |
|
|
|
Each parameter stores one number: |
|
|
|
| Format | Bits | Bytes/param | Approx size for 120B params | |
|
|-------:|-----:|-------------:|-----------------------------:| |
|
| FP32 | 32 | 4 | ~ **447 GiB** | |
|
| FP16 | 16 | 2 | ~ **224 GiB** | |
|
| BF16 | 16 | 2 | ~ **224 GiB** | |
|
|
|
> Calculation (GiB): `params * bytes_per_param / 1024^3` |
|
> For 120,000,000,000 params: |
|
> FP32: 480e9 B ≈ 447.03 GiB |
|
> FP16/BF16: 240e9 B ≈ 223.52 GiB |
|
|
|
### When to use which |
|
|
|
- **Inference on modern NVIDIA GPUs (Turing+/Ampere+/Ada/Hopper):** |
|
Use **FP16** (default here) or **BF16**. You’ll get large memory savings and typically **equal or faster** throughput than FP32 thanks to tensor cores. |
|
|
|
- **Training / Finetuning:** |
|
Use **mixed precision** (BF16 or FP16 compute with an FP32 master copy of weights/optimizer states). |
|
If your GPU supports BF16 well (e.g., A100/H100), **BF16** is preferred for numeric stability. |
|
(This tool focuses on exporting dequantized checkpoints, not training loops.) |
|
|
|
- **If you hit numeric issues in FP16:** |
|
Try **BF16** (`--dtype bf16`). Same size as FP16 but usually more stable due to FP32-like exponent range. |
|
|
|
### Notes |
|
|
|
- **FP32** remains the gold standard for numeric headroom and deterministic baselines, but for **inference** it’s typically unnecessary and **costly** (2× memory vs FP16/BF16). |
|
- **Tensor cores** accelerate FP16/BF16 GEMMs on most modern NVIDIA GPUs; FP32 is often slower and more memory-bound. |
|
- If a downstream runtime expects a specific dtype, export to that: FP16 for speed/memory, BF16 for robustness. |
|
|
|
--- |
|
|
|
### WIP |
|
|
|
- Upcoming models: cleaned FP16 release (uniform fp16 with fp32 LayerNorms), compressed variants (W8A8, W4A16, mixed experts), 2:4 sparse checkpoints. |
|
- Evals: MMLU, HellaSwag, TruthfulQA, GSM8K, BBH, MT‑Bench; plus latency/throughput and memory footprint on 3090/A100. |
|
- Extras: scripted upload tooling, detailed model cards, and reproducible Docker workflows. |
|
|
|
|