GPT-OSS-20B · MLX Q5 (gs=32) — practical on 24–32 GB Macs, much closer to Q8 than Q4

#1
by sebastavar - opened

I built this because I wanted GPT-OSS-20B (sparse MoE, MP×4) to be practical on everyday Macs—especially 24–32 GB machines—without giving up much quality. The result is a 5-bit, group-size 32 MLX quant that keeps latency snappy for interactive chat while landing much closer to our 8-bit baseline than the 4-bit variant.

Repo: halley-ai/gpt-oss-20b-MLX-5bit-gs32
https://huggingface.co/halley-ai/gpt-oss-20b-MLX-5bit-gs32

Why this exists

  • On-device, on-prem: MLX (Metal) on Apple Silicon, no external GPU required.
  • Fits the sweet spot: Runs on 24 GB (and breathes on 32 GB), leaving KV headroom.
  • MoE-friendly quant: gs=32 reduces routing drift; steadier expert selection vs larger groups.

Quality vs 8-bit (ctx=4096, same tokenizer)

  • 8-bit (reference): PPL 10.75
  • Q6 / gs=32: PPL 10.46 (better than 8-bit)
  • Q5 / gs=32: PPL 11.11 (close to 8-bit)
  • Q4 / gs=32: PPL 13.70 (meaningful drop; this Q5 build is the intended 24 GB tier)

(Your mileage may vary by prompt mix; these are my runs.)

Real-world throughput (MLX)

  • Expect Q5 ≈ 10–20% slower than my Q4/gs32 build at the same settings.
  • Typical behavior on recent Macs: low TTFB (~0.3–0.6 s) and fast streaming for interactive use.
  • Throughput depends on prompt length and context; CLI micro-runs underreport tok/s due to startup overhead.

Footprint

  • On-disk: ~16 GB (varies slightly by build).
  • RAM: Targets 24 GB+; not viable on 16 GB due to the ~10.67 GB GPU allocation ceiling.

Quick start (MLX)

pip install mlx-lm
python -m mlx_lm.convert --hf-path halley-ai/gpt-oss-20b --mlx-path ./gpt-oss-20b-MLX-5bit-gs32 -q \
  --q-bits 5 --q-group-size 32
python -m mlx_lm.generate --model ./gpt-oss-20b-MLX-5bit-gs32 \
  --prompt "Summarize why gs=32 helps MoE quantization."

Prompting tips

  • Works with ChatML / Harmony-style formatting.
  • For cleaner outputs, add to system: “Final answer only; do not output analysis.”
  • If your template uses channels, consider stops like: <|channel|>analysis, <|im_end|>.

Who should try this

  • Teams exploring on-prem assistants on M-series Macs (24–32 GB).
  • Anyone benchmarking MoE on MLX or comparing Q5 vs Q4 vs 8-bit trade-offs.

Siblings & baseline

  • Q6 (gs=32) — near-Q8 fidelity: halley-ai/gpt-oss-20b-MLX-6bit-gs32
  • Q4 (gs=32) — minimal-size tier (bigger quality hit): halley-ai/gpt-oss-20b-MLX-4bit-gs32
  • 8-bit reference (upstream): lmstudio-community/gpt-oss-20b-MLX-8bit

Notes

  • MLX only (macOS ≥ 13.5, Apple Silicon). Not available on HF Inference API.
  • License: Apache-2.0 (inherits from base model).
  • If you test it, I’d love your tokens/sec, TTFB, and PPL on different Macs/contexts.

Sign up or log in to comment