File size: 3,919 Bytes
5568a63
 
 
 
 
 
 
 
 
 
 
 
1029632
 
5568a63
 
1029632
 
5568a63
721298f
5568a63
88377b9
5568a63
1029632
5568a63
 
 
 
1029632
88377b9
5568a63
 
 
 
1029632
 
9552518
5568a63
 
 
 
 
 
 
b0ed62b
5568a63
 
 
ab2d096
5568a63
 
 
 
 
 
 
 
 
1029632
 
5568a63
 
 
1029632
dc772ca
5568a63
 
 
 
 
b39556f
c7c612d
 
b39556f
5568a63
 
954eb21
257edfb
 
 
 
 
 
5568a63
 
 
bae5d0d
5568a63
 
6792d12
 
bae5d0d
1029632
cfb27ce
1029632
 
a4023c1
ab2d096
cfb27ce
5568a63
 
 
 
 
 
 
cfb27ce
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
library_name: mlx
pipeline_tag: text-generation
inference: false         # MLX is macOS-only; HF Inference API won't run it
license: apache-2.0
base_model: openai/gpt-oss-20b
base_model_relation: quantized
language:
- en
- ro
tags:
- apple-silicon
- metal
- arm64
- 6-bit
- group-size-32
- moe
- mpx4
- openai
- halley-ai
---
# gpt-oss-20b — MLX 6-bit (group size 32)

**Summary.** This is a 6-bit (**Q6**) **MLX** quantization of **gpt-oss-20B** (sparse Mixture-of-Experts, MPx4). Group size is **32**.  
Built for **Apple Silicon** with Metal acceleration.

- **Base model:** `openai/gpt-oss-20b` (Apache-2.0)
- **Quantization:** MLX Q6, `q_group_size=32` (some tensors remain FP16 for stability)
- **Files:** MLX weight shards + `config.json`; tokenizer files included for drop-in use
- **Footprint:** ~**18.38 GB** on disk
- **Intended use:** local inference / research on M-series Macs
- **Not intended for:** safety-critical decisions; outputs may be inaccurate or biased

## Requirements
**Runs on:** Apple Silicon (M1 or newer) with **macOS ≥ 13.5** via **MLX (Metal)**.  
**Not supported:** Intel macOS / Linux / Windows (use a GGUF build + llama.cpp instead).  
**RAM guidance:** 32 GB minimum for Q6 (gs=32). 24 GB MacBook Pro **won’t run it**. Extra RAM improves headroom.

## How to use (MLX)

```bash
pip install mlx-lm transformers
```

```python
# Python API (uses tokenizer bundled with this repo)
from mlx_lm import load, generate

model, tokenizer = load("halley-ai/gpt-oss-20b-MLX-6bit-gs32")
print(generate(
    model, tokenizer,
    prompt="Explain the Chudnovsky algorithm to compute π.",
    max_tokens=256, max_kv_size=512
))
```

## Performance (Apple Silicon, real-world)

LM Studio / CLI (MLX, Q6 gs=32): ~49–55 tok/s, TTFB ~0.35–0.45 s (≈2k-token responses)
– measured on M1 Max 32 GB (short fixed-length runs show lower t/s due to startup overhead).
Throughput varies with Mac model, context, and sampler settings.

## Evaluation

Perplexity (PPL) streaming evaluation on WikiText-2; window=stride=4096, ~100k tokens, EOS inserted between docs.
<table>
  <thead>
    <tr><th>Variant</th><th>PPL (ctx=4096)</th></tr>
  </thead>
  <tbody>
    <tr><td>MLX 8-bit (reference)</td><td>10.75</td></tr>
    <tr><td><strong>MLX 6-bit (gs=32)</strong></td><td><strong>10.46 (−2.7% vs 8-bit/gs64)</strong></td></tr>
    <tr><td>MLX 5-bit (gs=32)</td><td>11.11 (+3.3% vs 8-bit/gs64, +6.2% vs 6-bit/gs32)</strong></td></tr>
    <tr><td>MLX 4-bit (gs=32)</td><td>13.70 (+27.4% vs 8-bit/gs64, +31.0% vs 6-bit/gs32)</td></tr>
  </tbody>
</table>

**Interpretation**
- MLX 6-bit/gs32: Best of the group; edges out 8-bit/gs64 slightly at a smaller 
footprint.
- MLX 5-bit/gs32: Small, consistent drop vs 6-bit/gs32 and 8-bit/gs64 (~3–6% PPL); strong “fits-16GB” option when GPU buffer limits matter.
- MLX 8-bit/gs64: Solid reference; near‑FP16 quality at a larger footprint.
- MLX 4-bit/gs32: Trades accuracy for footprint; use when RAM is constrained or throughput is the priority.

## Conversion details (provenance)

```bash
python -m mlx_lm convert \
  --hf-path openai/gpt-oss-20b \
  --mlx-path gpt-oss-20b-mlx-q6-gs32 \
  --q-bits 6 --q-group-size 32 -q
```

- Some non-expert tensors (embeddings, norms, router) remain FP16.

## Sibling & reference models
- halley-ai/gpt-oss-20b-MLX-5bit-gs32
- halley-ai/gpt-oss-20b-MLX-4bit-gs32
- Reference (8-bit, upstream): lmstudio-community/gpt-oss-20b-MLX-8bit

## Limitations & biases

Outputs may be factually wrong or unsafe. Don’t use for medical, legal, or financial decisions without human review.
MoE models can be sensitive to prompt wording; prefer explicit instructions and structure.

## License & credits
- License: Apache-2.0 (inherits from base model)
- Base model: OpenAI gpt-oss-20B
- Quantization: Halley AI Lab (MLX Q6, gs=32)
- Please cite both the base model and this repository when you use the weights.