GPT-OSS-20B on a 16 GB Mac (MLX): Why Q3 Quantization Didn't Work, and What I Recommend

#1
by sebastavar - opened

TL;DR

I attempted running GPT-OSS-20B (MLX) on my 16 GB Apple Silicon Mac by reducing experts to 3-bit quantization (Q3). Unfortunately, Q3 significantly degraded the model's quality, regardless of whether I used group size 32 or 64. If you're facing similar constraints, my clear recommendation is to skip Q3 quantization with MLX entirely and instead switch to a GGUF build with llama.cpp. If you're determined to stick with MLX, the safer path is to get a Mac with at least 24 GB RAM.

Environment

  • Hardware: 16 GB Apple Silicon Mac (Unified Memory)
  • Practical GPU limit: Approximately ~10.67 GiB available for MLX/Metal workloads
  • Software stack: Latest mlx_lm + Python 3.11/3.12; tested llama.cpp with Metal backend

What Went Wrong

Objective: Run GPT-OSS-20B without crashes or swapping on a 16 GB machine while maintaining decent quality.

Strategy: Quantize MoE experts to Q3 (group sizes tested: gs32 and gs64), leaving other parts at higher precision.

Issues Encountered:

  • Generation drifted into repetitive quiz-like outputs (e.g., "Paris.\n\nB: ... C: ...").
  • Perplexity spiked dramatically (~90+), despite correct evaluation settings and tokenizer.
  • Changing group sizes didn't help; both gs32 and gs64 resulted in unstable outputs.

Additional Learnings (to save your time):

  • MLX expects a local model folder to include config.json, tokenizer files, and weights (model.safetensors or weights.npz). Missing metadata triggers confusing HF repo-id errors.
  • MLX's generate function has inconsistent kwargs across versions. Stick to greedy generation tests without temperature to validate quickly.
  • Always use the raw variant of WikiText (wikitext-2-raw-v1) and proper teacher forcing (input tokens shifted by one) for valid perplexity checks.

My Recommended Solution: GGUF with llama.cpp

For my 16 GB Mac, the most reliable, stable solution was a GGUF build running with llama.cpp. The flexibility in CPU/GPU distribution provided by llama.cpp let me comfortably run GPT-OSS-20B without precision issues.

Notes:

  • Choose a Q4_K profile that suits your memory budget. 16 GB machines typically handle Q4 with smart CPU offloading well.
  • Consider GUI options like LM Studio with GGUF builds for convenience.

If You Still Want MLX

Buy a new Mac with at least 24 GB RAM.

Bottom Line

For GPT-OSS-20B on MLX with limited GPU memory (16 GB Macs), Q3 quantization proved problematic. My practical recommendation: avoid Q3 quantization and prefer GGUF with llama.cpp, or at minimum, use MLX Q4 experts with CPU offload.

Sign up or log in to comment