File size: 3,919 Bytes
5568a63 1029632 5568a63 1029632 5568a63 721298f 5568a63 88377b9 5568a63 1029632 5568a63 1029632 88377b9 5568a63 1029632 9552518 5568a63 b0ed62b 5568a63 ab2d096 5568a63 1029632 5568a63 1029632 dc772ca 5568a63 b39556f c7c612d b39556f 5568a63 954eb21 257edfb 5568a63 bae5d0d 5568a63 6792d12 bae5d0d 1029632 cfb27ce 1029632 a4023c1 ab2d096 cfb27ce 5568a63 cfb27ce |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
library_name: mlx
pipeline_tag: text-generation
inference: false # MLX is macOS-only; HF Inference API won't run it
license: apache-2.0
base_model: openai/gpt-oss-20b
base_model_relation: quantized
language:
- en
- ro
tags:
- apple-silicon
- metal
- arm64
- 6-bit
- group-size-32
- moe
- mpx4
- openai
- halley-ai
---
# gpt-oss-20b — MLX 6-bit (group size 32)
**Summary.** This is a 6-bit (**Q6**) **MLX** quantization of **gpt-oss-20B** (sparse Mixture-of-Experts, MPx4). Group size is **32**.
Built for **Apple Silicon** with Metal acceleration.
- **Base model:** `openai/gpt-oss-20b` (Apache-2.0)
- **Quantization:** MLX Q6, `q_group_size=32` (some tensors remain FP16 for stability)
- **Files:** MLX weight shards + `config.json`; tokenizer files included for drop-in use
- **Footprint:** ~**18.38 GB** on disk
- **Intended use:** local inference / research on M-series Macs
- **Not intended for:** safety-critical decisions; outputs may be inaccurate or biased
## Requirements
**Runs on:** Apple Silicon (M1 or newer) with **macOS ≥ 13.5** via **MLX (Metal)**.
**Not supported:** Intel macOS / Linux / Windows (use a GGUF build + llama.cpp instead).
**RAM guidance:** 32 GB minimum for Q6 (gs=32). 24 GB MacBook Pro **won’t run it**. Extra RAM improves headroom.
## How to use (MLX)
```bash
pip install mlx-lm transformers
```
```python
# Python API (uses tokenizer bundled with this repo)
from mlx_lm import load, generate
model, tokenizer = load("halley-ai/gpt-oss-20b-MLX-6bit-gs32")
print(generate(
model, tokenizer,
prompt="Explain the Chudnovsky algorithm to compute π.",
max_tokens=256, max_kv_size=512
))
```
## Performance (Apple Silicon, real-world)
LM Studio / CLI (MLX, Q6 gs=32): ~49–55 tok/s, TTFB ~0.35–0.45 s (≈2k-token responses)
– measured on M1 Max 32 GB (short fixed-length runs show lower t/s due to startup overhead).
Throughput varies with Mac model, context, and sampler settings.
## Evaluation
Perplexity (PPL) streaming evaluation on WikiText-2; window=stride=4096, ~100k tokens, EOS inserted between docs.
<table>
<thead>
<tr><th>Variant</th><th>PPL (ctx=4096)</th></tr>
</thead>
<tbody>
<tr><td>MLX 8-bit (reference)</td><td>10.75</td></tr>
<tr><td><strong>MLX 6-bit (gs=32)</strong></td><td><strong>10.46 (−2.7% vs 8-bit/gs64)</strong></td></tr>
<tr><td>MLX 5-bit (gs=32)</td><td>11.11 (+3.3% vs 8-bit/gs64, +6.2% vs 6-bit/gs32)</strong></td></tr>
<tr><td>MLX 4-bit (gs=32)</td><td>13.70 (+27.4% vs 8-bit/gs64, +31.0% vs 6-bit/gs32)</td></tr>
</tbody>
</table>
**Interpretation**
- MLX 6-bit/gs32: Best of the group; edges out 8-bit/gs64 slightly at a smaller
footprint.
- MLX 5-bit/gs32: Small, consistent drop vs 6-bit/gs32 and 8-bit/gs64 (~3–6% PPL); strong “fits-16GB” option when GPU buffer limits matter.
- MLX 8-bit/gs64: Solid reference; near‑FP16 quality at a larger footprint.
- MLX 4-bit/gs32: Trades accuracy for footprint; use when RAM is constrained or throughput is the priority.
## Conversion details (provenance)
```bash
python -m mlx_lm convert \
--hf-path openai/gpt-oss-20b \
--mlx-path gpt-oss-20b-mlx-q6-gs32 \
--q-bits 6 --q-group-size 32 -q
```
- Some non-expert tensors (embeddings, norms, router) remain FP16.
## Sibling & reference models
- halley-ai/gpt-oss-20b-MLX-5bit-gs32
- halley-ai/gpt-oss-20b-MLX-4bit-gs32
- Reference (8-bit, upstream): lmstudio-community/gpt-oss-20b-MLX-8bit
## Limitations & biases
Outputs may be factually wrong or unsafe. Don’t use for medical, legal, or financial decisions without human review.
MoE models can be sensitive to prompt wording; prefer explicit instructions and structure.
## License & credits
- License: Apache-2.0 (inherits from base model)
- Base model: OpenAI gpt-oss-20B
- Quantization: Halley AI Lab (MLX Q6, gs=32)
- Please cite both the base model and this repository when you use the weights.
|