Spaces:
No application file
GPT-OSS-20B on a 16 GB Mac (MLX): Why Q3 Quantization Didn't Work, and What I Recommend
TL;DR
I attempted running GPT-OSS-20B (MLX) on my 16 GB Apple Silicon Mac by reducing experts to 3-bit quantization (Q3). Unfortunately, Q3 significantly degraded the model's quality, regardless of whether I used group size 32 or 64. If you're facing similar constraints, my clear recommendation is to skip Q3 quantization with MLX entirely and instead switch to a GGUF build with llama.cpp. If you're determined to stick with MLX, the safer path is to get a Mac with at least 24 GB RAM.
Environment
- Hardware: 16 GB Apple Silicon Mac (Unified Memory)
- Practical GPU limit: Approximately ~10.67 GiB available for MLX/Metal workloads
- Software stack: Latest
mlx_lm
+ Python 3.11/3.12; tested llama.cpp with Metal backend
What Went Wrong
Objective: Run GPT-OSS-20B without crashes or swapping on a 16 GB machine while maintaining decent quality.
Strategy: Quantize MoE experts to Q3 (group sizes tested: gs32 and gs64), leaving other parts at higher precision.
Issues Encountered:
- Generation drifted into repetitive quiz-like outputs (e.g., "Paris.\n\nB: ... C: ...").
- Perplexity spiked dramatically (~90+), despite correct evaluation settings and tokenizer.
- Changing group sizes didn't help; both gs32 and gs64 resulted in unstable outputs.
Additional Learnings (to save your time):
- MLX expects a local model folder to include
config.json
, tokenizer files, and weights (model.safetensors
orweights.npz
). Missing metadata triggers confusing HF repo-id errors. - MLX's
generate
function has inconsistent kwargs across versions. Stick to greedy generation tests without temperature to validate quickly. - Always use the raw variant of WikiText (
wikitext-2-raw-v1
) and proper teacher forcing (input tokens shifted by one) for valid perplexity checks.
My Recommended Solution: GGUF with llama.cpp
For my 16 GB Mac, the most reliable, stable solution was a GGUF build running with llama.cpp. The flexibility in CPU/GPU distribution provided by llama.cpp let me comfortably run GPT-OSS-20B without precision issues.
Notes:
- Choose a Q4_K profile that suits your memory budget. 16 GB machines typically handle Q4 with smart CPU offloading well.
- Consider GUI options like LM Studio with GGUF builds for convenience.
If You Still Want MLX
Buy a new Mac with at least 24 GB RAM.
Bottom Line
For GPT-OSS-20B on MLX with limited GPU memory (16 GB Macs), Q3 quantization proved problematic. My practical recommendation: avoid Q3 quantization and prefer GGUF with llama.cpp, or at minimum, use MLX Q4 experts with CPU offload.