halley-ai/README · GPT-OSS-20B on a 16 GB Mac (MLX): Why Q3 Quantization Didn't Work, and What I Recommend

TL;DR

I attempted running GPT-OSS-20B (MLX) on my 16 GB Apple Silicon Mac by reducing experts to 3-bit quantization (Q3). Unfortunately, Q3 significantly degraded the model's quality, regardless of whether I used group size 32 or 64. If you're facing similar constraints, my clear recommendation is to skip Q3 quantization with MLX entirely and instead switch to a GGUF build with llama.cpp. If you're determined to stick with MLX, the safer path is to get a Mac with at least 24 GB RAM.

Environment

Hardware: 16 GB Apple Silicon Mac (Unified Memory)
Practical GPU limit: Approximately ~10.67 GiB available for MLX/Metal workloads
Software stack: Latest mlx_lm + Python 3.11/3.12; tested llama.cpp with Metal backend

What Went Wrong

Objective: Run GPT-OSS-20B without crashes or swapping on a 16 GB machine while maintaining decent quality.

Strategy: Quantize MoE experts to Q3 (group sizes tested: gs32 and gs64), leaving other parts at higher precision.

Issues Encountered:

Generation drifted into repetitive quiz-like outputs (e.g., "Paris.\n\nB: ... C: ...").
Perplexity spiked dramatically (~90+), despite correct evaluation settings and tokenizer.
Changing group sizes didn't help; both gs32 and gs64 resulted in unstable outputs.

Additional Learnings (to save your time):

MLX expects a local model folder to include config.json, tokenizer files, and weights (model.safetensors or weights.npz). Missing metadata triggers confusing HF repo-id errors.
MLX's generate function has inconsistent kwargs across versions. Stick to greedy generation tests without temperature to validate quickly.
Always use the raw variant of WikiText (wikitext-2-raw-v1) and proper teacher forcing (input tokens shifted by one) for valid perplexity checks.

My Recommended Solution: GGUF with llama.cpp

For my 16 GB Mac, the most reliable, stable solution was a GGUF build running with llama.cpp. The flexibility in CPU/GPU distribution provided by llama.cpp let me comfortably run GPT-OSS-20B without precision issues.

Notes:

Choose a Q4_K profile that suits your memory budget. 16 GB machines typically handle Q4 with smart CPU offloading well.
Consider GUI options like LM Studio with GGUF builds for convenience.

If You Still Want MLX

Buy a new Mac with at least 24 GB RAM.

Bottom Line

For GPT-OSS-20B on MLX with limited GPU memory (16 GB Macs), Q3 quantization proved problematic. My practical recommendation: avoid Q3 quantization and prefer GGUF with llama.cpp, or at minimum, use MLX Q4 experts with CPU offload.