--- language: - en library_name: transformers pipeline_tag: text-generation tags: - fp16 - dequantized - gpt-oss - mxfp4-upcast base_model: openai/gpt-oss-120b model-index: - name: gpt-oss-120b-fp16 results: [] --- # ## Precision: FP32 vs FP16 (and BF16) This project saves dequantized checkpoints in **FP16** (bf16 -> fp16) - **FP32 (single precision, 32-bit, 4 bytes/param)** Reference/default precision in many frameworks. Highest numerical range/precision, **largest memory**. - **FP16 (half precision, 16-bit, 2 bytes/param)** Half the memory of FP32. Great for **inference** on modern GPUs; may underflow/overflow more easily than BF16. - **BF16 (bfloat16, 16-bit, 2 bytes/param)** Same memory as FP16, **wider exponent like FP32**, often more numerically robust than FP16; slightly less precision in mantissa. > In this repo, output precision is **FP16** (default) or **BF16** via `--dtype`. > **FP32 output is not offered** because it doubles disk/RAM vs FP16/BF16 with minimal inference benefit on modern hardware. ### Memory math (example: 120B parameters) Each parameter stores one number: | Format | Bits | Bytes/param | Approx size for 120B params | |-------:|-----:|-------------:|-----------------------------:| | FP32 | 32 | 4 | ~ **447 GiB** | | FP16 | 16 | 2 | ~ **224 GiB** | | BF16 | 16 | 2 | ~ **224 GiB** | > Calculation (GiB): `params * bytes_per_param / 1024^3` > For 120,000,000,000 params: > FP32: 480e9 B ≈ 447.03 GiB > FP16/BF16: 240e9 B ≈ 223.52 GiB ### When to use which - **Inference on modern NVIDIA GPUs (Turing+/Ampere+/Ada/Hopper):** Use **FP16** (default here) or **BF16**. You’ll get large memory savings and typically **equal or faster** throughput than FP32 thanks to tensor cores. - **Training / Finetuning:** Use **mixed precision** (BF16 or FP16 compute with an FP32 master copy of weights/optimizer states). If your GPU supports BF16 well (e.g., A100/H100), **BF16** is preferred for numeric stability. (This tool focuses on exporting dequantized checkpoints, not training loops.) - **If you hit numeric issues in FP16:** Try **BF16** (`--dtype bf16`). Same size as FP16 but usually more stable due to FP32-like exponent range. ### Notes - **FP32** remains the gold standard for numeric headroom and deterministic baselines, but for **inference** it’s typically unnecessary and **costly** (2× memory vs FP16/BF16). - **Tensor cores** accelerate FP16/BF16 GEMMs on most modern NVIDIA GPUs; FP32 is often slower and more memory-bound. - If a downstream runtime expects a specific dtype, export to that: FP16 for speed/memory, BF16 for robustness. --- ### WIP - Upcoming models: cleaned FP16 release (uniform fp16 with fp32 LayerNorms), compressed variants (W8A8, W4A16, mixed experts), 2:4 sparse checkpoints. - Evals: MMLU, HellaSwag, TruthfulQA, GSM8K, BBH, MT‑Bench; plus latency/throughput and memory footprint on 3090/A100. - Extras: scripted upload tooling, detailed model cards, and reproducible Docker workflows.