|
--- |
|
library_name: mlx |
|
pipeline_tag: text-generation |
|
inference: false |
|
license: apache-2.0 |
|
base_model: openai/gpt-oss-20b |
|
base_model_relation: quantized |
|
language: |
|
- en |
|
- ro |
|
tags: |
|
- apple-silicon |
|
- metal |
|
- arm64 |
|
- 6-bit |
|
- group-size-32 |
|
- moe |
|
- mpx4 |
|
- openai |
|
- halley-ai |
|
--- |
|
# gpt-oss-20b — MLX 6-bit (group size 32) |
|
|
|
**Summary.** This is a 6-bit (**Q6**) **MLX** quantization of **gpt-oss-20B** (sparse Mixture-of-Experts, MPx4). Group size is **32**. |
|
Built for **Apple Silicon** with Metal acceleration. |
|
|
|
- **Base model:** `openai/gpt-oss-20b` (Apache-2.0) |
|
- **Quantization:** MLX Q6, `q_group_size=32` (some tensors remain FP16 for stability) |
|
- **Files:** MLX weight shards + `config.json`; tokenizer files included for drop-in use |
|
- **Footprint:** ~**18.38 GB** on disk |
|
- **Intended use:** local inference / research on M-series Macs |
|
- **Not intended for:** safety-critical decisions; outputs may be inaccurate or biased |
|
|
|
## Requirements |
|
**Runs on:** Apple Silicon (M1 or newer) with **macOS ≥ 13.5** via **MLX (Metal)**. |
|
**Not supported:** Intel macOS / Linux / Windows (use a GGUF build + llama.cpp instead). |
|
**RAM guidance:** 32 GB minimum for Q6 (gs=32). 24 GB MacBook Pro **won’t run it**. Extra RAM improves headroom. |
|
|
|
## How to use (MLX) |
|
|
|
```bash |
|
pip install mlx-lm transformers |
|
``` |
|
|
|
```python |
|
# Python API (uses tokenizer bundled with this repo) |
|
from mlx_lm import load, generate |
|
|
|
model, tokenizer = load("halley-ai/gpt-oss-20b-MLX-6bit-gs32") |
|
print(generate( |
|
model, tokenizer, |
|
prompt="Explain the Chudnovsky algorithm to compute π.", |
|
max_tokens=256, max_kv_size=512 |
|
)) |
|
``` |
|
|
|
## Performance (Apple Silicon, real-world) |
|
|
|
LM Studio / CLI (MLX, Q6 gs=32): ~49–55 tok/s, TTFB ~0.35–0.45 s (≈2k-token responses) |
|
– measured on M1 Max 32 GB (short fixed-length runs show lower t/s due to startup overhead). |
|
Throughput varies with Mac model, context, and sampler settings. |
|
|
|
## Evaluation |
|
|
|
Perplexity (PPL) streaming evaluation on WikiText-2; window=stride=4096, ~100k tokens, EOS inserted between docs. |
|
<table> |
|
<thead> |
|
<tr><th>Variant</th><th>PPL (ctx=4096)</th></tr> |
|
</thead> |
|
<tbody> |
|
<tr><td>MLX 8-bit (reference)</td><td>10.75</td></tr> |
|
<tr><td><strong>MLX 6-bit (gs=32)</strong></td><td><strong>10.46 (−2.7% vs 8-bit/gs64)</strong></td></tr> |
|
<tr><td>MLX 5-bit (gs=32)</td><td>11.11 (+3.3% vs 8-bit/gs64, +6.2% vs 6-bit/gs32)</strong></td></tr> |
|
<tr><td>MLX 4-bit (gs=32)</td><td>13.70 (+27.4% vs 8-bit/gs64, +31.0% vs 6-bit/gs32)</td></tr> |
|
</tbody> |
|
</table> |
|
|
|
**Interpretation** |
|
- MLX 6-bit/gs32: Best of the group; edges out 8-bit/gs64 slightly at a smaller |
|
footprint. |
|
- MLX 5-bit/gs32: Small, consistent drop vs 6-bit/gs32 and 8-bit/gs64 (~3–6% PPL); strong “fits-16GB” option when GPU buffer limits matter. |
|
- MLX 8-bit/gs64: Solid reference; near‑FP16 quality at a larger footprint. |
|
- MLX 4-bit/gs32: Trades accuracy for footprint; use when RAM is constrained or throughput is the priority. |
|
|
|
## Conversion details (provenance) |
|
|
|
```bash |
|
python -m mlx_lm convert \ |
|
--hf-path openai/gpt-oss-20b \ |
|
--mlx-path gpt-oss-20b-mlx-q6-gs32 \ |
|
--q-bits 6 --q-group-size 32 -q |
|
``` |
|
|
|
- Some non-expert tensors (embeddings, norms, router) remain FP16. |
|
|
|
## Sibling & reference models |
|
- halley-ai/gpt-oss-20b-MLX-5bit-gs32 |
|
- halley-ai/gpt-oss-20b-MLX-4bit-gs32 |
|
- Reference (8-bit, upstream): lmstudio-community/gpt-oss-20b-MLX-8bit |
|
|
|
## Limitations & biases |
|
|
|
Outputs may be factually wrong or unsafe. Don’t use for medical, legal, or financial decisions without human review. |
|
MoE models can be sensitive to prompt wording; prefer explicit instructions and structure. |
|
|
|
## License & credits |
|
- License: Apache-2.0 (inherits from base model) |
|
- Base model: OpenAI gpt-oss-20B |
|
- Quantization: Halley AI Lab (MLX Q6, gs=32) |
|
- Please cite both the base model and this repository when you use the weights. |
|
|