GPT-OSS-20B · MLX Q5 (gs=32) — practical on 24–32 GB Macs, much closer to Q8 than Q4
#1
by
sebastavar
- opened
I built this because I wanted GPT-OSS-20B (sparse MoE, MP×4) to be practical on everyday Macs—especially 24–32 GB machines—without giving up much quality. The result is a 5-bit, group-size 32 MLX quant that keeps latency snappy for interactive chat while landing much closer to our 8-bit baseline than the 4-bit variant.
Repo: halley-ai/gpt-oss-20b-MLX-5bit-gs32
https://huggingface.co/halley-ai/gpt-oss-20b-MLX-5bit-gs32
Why this exists
- On-device, on-prem: MLX (Metal) on Apple Silicon, no external GPU required.
- Fits the sweet spot: Runs on 24 GB (and breathes on 32 GB), leaving KV headroom.
- MoE-friendly quant: gs=32 reduces routing drift; steadier expert selection vs larger groups.
Quality vs 8-bit (ctx=4096, same tokenizer)
- 8-bit (reference): PPL 10.75
- Q6 / gs=32: PPL 10.46 (better than 8-bit)
- Q5 / gs=32: PPL 11.11 (close to 8-bit)
- Q4 / gs=32: PPL 13.70 (meaningful drop; this Q5 build is the intended 24 GB tier)
(Your mileage may vary by prompt mix; these are my runs.)
Real-world throughput (MLX)
- Expect Q5 ≈ 10–20% slower than my Q4/gs32 build at the same settings.
- Typical behavior on recent Macs: low TTFB (~0.3–0.6 s) and fast streaming for interactive use.
- Throughput depends on prompt length and context; CLI micro-runs underreport tok/s due to startup overhead.
Footprint
- On-disk: ~16 GB (varies slightly by build).
- RAM: Targets 24 GB+; not viable on 16 GB due to the ~10.67 GB GPU allocation ceiling.
Quick start (MLX)
pip install mlx-lm
python -m mlx_lm.convert --hf-path halley-ai/gpt-oss-20b --mlx-path ./gpt-oss-20b-MLX-5bit-gs32 -q \
--q-bits 5 --q-group-size 32
python -m mlx_lm.generate --model ./gpt-oss-20b-MLX-5bit-gs32 \
--prompt "Summarize why gs=32 helps MoE quantization."
Prompting tips
- Works with ChatML / Harmony-style formatting.
- For cleaner outputs, add to system: “Final answer only; do not output analysis.”
- If your template uses channels, consider stops like:
<|channel|>analysis
,<|im_end|>
.
Who should try this
- Teams exploring on-prem assistants on M-series Macs (24–32 GB).
- Anyone benchmarking MoE on MLX or comparing Q5 vs Q4 vs 8-bit trade-offs.
Siblings & baseline
- Q6 (gs=32) — near-Q8 fidelity:
halley-ai/gpt-oss-20b-MLX-6bit-gs32
- Q4 (gs=32) — minimal-size tier (bigger quality hit):
halley-ai/gpt-oss-20b-MLX-4bit-gs32
- 8-bit reference (upstream):
lmstudio-community/gpt-oss-20b-MLX-8bit
Notes
- MLX only (macOS ≥ 13.5, Apple Silicon). Not available on HF Inference API.
- License: Apache-2.0 (inherits from base model).
- If you test it, I’d love your tokens/sec, TTFB, and PPL on different Macs/contexts.