twhitworth
/

gpt-oss-120b-fp16

Text Generation

Model card Files Files and versions

gpt-oss-120b-fp16 / README.md

twhitworth's picture

Update README.md

db61de0 28 days ago

|

history blame contribute delete

3.11 kB

	---
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- fp16
	- dequantized
	- gpt-oss
	- mxfp4-upcast
	base_model: openai/gpt-oss-120b
	model-index:
	- name: gpt-oss-120b-fp16
	results: []
	---

	#

	## Precision: FP32 vs FP16 (and BF16)

	This project saves dequantized checkpoints in FP16 (bf16 -> fp16)

	- FP32 (single precision, 32-bit, 4 bytes/param)
	Reference/default precision in many frameworks. Highest numerical range/precision, largest memory.
	- FP16 (half precision, 16-bit, 2 bytes/param)
	Half the memory of FP32. Great for inference on modern GPUs; may underflow/overflow more easily than BF16.
	- BF16 (bfloat16, 16-bit, 2 bytes/param)
	Same memory as FP16, wider exponent like FP32, often more numerically robust than FP16; slightly less precision in mantissa.

	> In this repo, output precision is FP16 (default) or BF16 via `--dtype`.
	> FP32 output is not offered because it doubles disk/RAM vs FP16/BF16 with minimal inference benefit on modern hardware.

	### Memory math (example: 120B parameters)

	Each parameter stores one number:

	\| Format \| Bits \| Bytes/param \| Approx size for 120B params \|
	\|-------:\|-----:\|-------------:\|-----------------------------:\|
	\| FP32 \| 32 \| 4 \| ~ 447 GiB \|
	\| FP16 \| 16 \| 2 \| ~ 224 GiB \|
	\| BF16 \| 16 \| 2 \| ~ 224 GiB \|

	> Calculation (GiB): `params * bytes_per_param / 1024^3`
	> For 120,000,000,000 params:
	> FP32: 480e9 B ≈ 447.03 GiB
	> FP16/BF16: 240e9 B ≈ 223.52 GiB

	### When to use which

	- Inference on modern NVIDIA GPUs (Turing+/Ampere+/Ada/Hopper):
	Use FP16 (default here) or BF16. You’ll get large memory savings and typically equal or faster throughput than FP32 thanks to tensor cores.

	- Training / Finetuning:
	Use mixed precision (BF16 or FP16 compute with an FP32 master copy of weights/optimizer states).
	If your GPU supports BF16 well (e.g., A100/H100), BF16 is preferred for numeric stability.
	(This tool focuses on exporting dequantized checkpoints, not training loops.)

	- If you hit numeric issues in FP16:
	Try BF16 (`--dtype bf16`). Same size as FP16 but usually more stable due to FP32-like exponent range.

	### Notes

	- FP32 remains the gold standard for numeric headroom and deterministic baselines, but for inference it’s typically unnecessary and costly (2× memory vs FP16/BF16).
	- Tensor cores accelerate FP16/BF16 GEMMs on most modern NVIDIA GPUs; FP32 is often slower and more memory-bound.
	- If a downstream runtime expects a specific dtype, export to that: FP16 for speed/memory, BF16 for robustness.

	---

	### WIP

	- Upcoming models: cleaned FP16 release (uniform fp16 with fp32 LayerNorms), compressed variants (W8A8, W4A16, mixed experts), 2:4 sparse checkpoints.
	- Evals: MMLU, HellaSwag, TruthfulQA, GSM8K, BBH, MT‑Bench; plus latency/throughput and memory footprint on 3090/A100.
	- Extras: scripted upload tooling, detailed model cards, and reproducible Docker workflows.