halley-ai/gpt-oss-20b-MLX-5bit-gs32 · GPT-OSS-20B · MLX Q5 (gs=32) — practical on 24

I built this because I wanted GPT-OSS-20B (sparse MoE, MP×4) to be practical on everyday Macs—especially 24–32 GB machines—without giving up much quality. The result is a 5-bit, group-size 32 MLX quant that keeps latency snappy for interactive chat while landing much closer to our 8-bit baseline than the 4-bit variant.

Repo: halley-ai/gpt-oss-20b-MLX-5bit-gs32
https://huggingface.co/halley-ai/gpt-oss-20b-MLX-5bit-gs32

Why this exists

On-device, on-prem: MLX (Metal) on Apple Silicon, no external GPU required.
Fits the sweet spot: Runs on 24 GB (and breathes on 32 GB), leaving KV headroom.
MoE-friendly quant: gs=32 reduces routing drift; steadier expert selection vs larger groups.

Quality vs 8-bit (ctx=4096, same tokenizer)

8-bit (reference): PPL 10.75
Q6 / gs=32: PPL 10.46 (better than 8-bit)
Q5 / gs=32: PPL 11.11 (close to 8-bit)
Q4 / gs=32: PPL 13.70 (meaningful drop; this Q5 build is the intended 24 GB tier)

(Your mileage may vary by prompt mix; these are my runs.)

Real-world throughput (MLX)

Expect Q5 ≈ 10–20% slower than my Q4/gs32 build at the same settings.
Typical behavior on recent Macs: low TTFB (~0.3–0.6 s) and fast streaming for interactive use.
Throughput depends on prompt length and context; CLI micro-runs underreport tok/s due to startup overhead.

Footprint

On-disk: ~16 GB (varies slightly by build).
RAM: Targets 24 GB+; not viable on 16 GB due to the ~10.67 GB GPU allocation ceiling.

Quick start (MLX)

pip install mlx-lm
python -m mlx_lm.convert --hf-path halley-ai/gpt-oss-20b --mlx-path ./gpt-oss-20b-MLX-5bit-gs32 -q \
  --q-bits 5 --q-group-size 32
python -m mlx_lm.generate --model ./gpt-oss-20b-MLX-5bit-gs32 \
  --prompt "Summarize why gs=32 helps MoE quantization."

Prompting tips

Works with ChatML / Harmony-style formatting.
For cleaner outputs, add to system: “Final answer only; do not output analysis.”
If your template uses channels, consider stops like: <|channel|>analysis, <|im_end|>.

Who should try this

Teams exploring on-prem assistants on M-series Macs (24–32 GB).
Anyone benchmarking MoE on MLX or comparing Q5 vs Q4 vs 8-bit trade-offs.

Siblings & baseline

Q6 (gs=32) — near-Q8 fidelity: halley-ai/gpt-oss-20b-MLX-6bit-gs32
Q4 (gs=32) — minimal-size tier (bigger quality hit): halley-ai/gpt-oss-20b-MLX-4bit-gs32
8-bit reference (upstream): lmstudio-community/gpt-oss-20b-MLX-8bit

Notes

MLX only (macOS ≥ 13.5, Apple Silicon). Not available on HF Inference API.
License: Apache-2.0 (inherits from base model).
If you test it, I’d love your tokens/sec, TTFB, and PPL on different Macs/contexts.

halley-ai
/

gpt-oss-20b-MLX-5bit-gs32

GPT-OSS-20B · MLX Q5 (gs=32) — practical on 24–32 GB Macs, much closer to Q8 than Q4