File size: 5,086 Bytes
ee982f8
 
 
 
ab41846
 
 
 
 
ee982f8
 
 
 
 
 
 
 
 
ab41846
 
ee982f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: apache-2.0
base_model:
- openai/gpt-oss-20b
tags:
- mlx
- gpt
- openai
- chatGPT
---


# OpenAI GPT OSS 20B - MLX 4-bit

## Model Description

This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3).

>[!TIP] Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22)

## Quick Start

### Installation

```bash
pip install mlx-lm
```

### Usage

```python
from mlx_lm import load, generate

# Load the quantized model
model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")

# Generate text
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
print(response)
```

## Technical Specifications

### Model Architecture
- **Model Type**: `gpt_oss` (GPT Open Source)
- **Architecture**: `GptOssForCausalLM`
- **Parameters**: ~20 billion parameters
- **Quantization**: 4-bit precision (4.504 bits per weight)

### Core Parameters
| Parameter | Value |
|-----------|-------|
| Hidden Size | 2,880 |
| Intermediate Size | 2,880 |
| Number of Layers | 24 |
| Attention Heads | 64 |
| Key-Value Heads | 8 |
| Head Dimension | 64 |
| Vocabulary Size | 201,088 |

### Mixture of Experts (MoE) Configuration
- **Number of Local Experts**: 32
- **Experts per Token**: 4
- **Router Auxiliary Loss Coefficient**: 0.9
- **SwiGLU Limit**: 7.0

### Attention Mechanism
- **Attention Type**: Hybrid sliding window and full attention
- **Sliding Window Size**: 128 tokens
- **Max Position Embeddings**: 131,072 tokens
- **Initial Context Length**: 4,096 tokens
- **Attention Pattern**: Alternating sliding and full attention layers

### RoPE (Rotary Position Embedding) Configuration
- **RoPE Theta**: 150,000
- **RoPE Scaling Type**: YaRN (Yet another RoPE extensioN)
- **Scaling Factor**: 32.0
- **Beta Fast**: 32.0
- **Beta Slow**: 1.0

### Quantization Details
- **Quantization Method**: MLX 4-bit quantization
- **Group Size**: 64
- **Effective Bits per Weight**: 4.504
- **Size Reduction**: 13GB → 11GB (~15% reduction)

### File Structure
```
gpt-oss-20b-MLX-4bit/
├── config.json                           # Model configuration
├── model-00001-of-00003.safetensors     # Model weights (part 1)
├── model-00002-of-00003.safetensors     # Model weights (part 2)
├── model-00003-of-00003.safetensors     # Model weights (part 3)
├── model.safetensors.index.json         # Model sharding index
├── tokenizer.json                       # Tokenizer configuration
├── tokenizer_config.json               # Tokenizer settings
├── special_tokens_map.json             # Special tokens mapping
├── generation_config.json              # Generation parameters
└── chat_template.jinja                 # Chat template
```

## Performance Characteristics

### Hardware Requirements
- **Platform**: Apple Silicon (M1, M2, M3, M4 series)
- **Memory**: ~11GB for model weights
- **Recommended RAM**: 16GB+ for optimal performance

### Layer Configuration
The model uses an alternating attention pattern across its 24 layers:
- **Even layers (0, 2, 4, ...)**: Sliding window attention (128 tokens)
- **Odd layers (1, 3, 5, ...)**: Full attention

## Training Details

### Tokenizer
- **Type**: Custom tokenizer with 201,088 vocabulary size
- **Special Tokens**:
  - EOS Token ID: 200,002
  - Pad Token ID: 199,999

### Model Configuration
- **Hidden Activation**: SiLU (Swish)
- **Normalization**: RMSNorm (ε = 1e-05)
- **Initializer Range**: 0.02
- **Attention Dropout**: 0.0
- **Attention Bias**: Enabled

## Conversion Process

This model was successfully converted using:
- **MLX-LM Version**: 0.26.3 (development branch)
- **Conversion Command**: 
  ```bash
  python3 -m mlx_lm convert \
    --hf-path "/path/to/openai-gpt-oss-20b" \
    --mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
    --quantize \
    --q-bits 4
  ```

## Known Limitations

1. **Architecture Specificity**: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+
2. **Platform Dependency**: Optimized specifically for Apple Silicon; may not run on other platforms
3. **Quantization Trade-offs**: 4-bit quantization may result in slight quality degradation compared to full precision

## Compatibility

- **MLX-LM**: Requires v0.26.3 or later for `gpt_oss` support
- **Apple Silicon**: M1, M2, M3, M4 series processors
- **macOS**: Compatible with recent macOS versions supporting MLX

## License

Please refer to the original OpenAI GPT OSS 20B model license terms.

## Acknowledgments

- Original model by OpenAI
- MLX framework by Apple Machine Learning Research
- Quantization achieved using `mlx-lm` development tools

---

**Model Size**: 11GB  
**Quantization**: 4-bit (4.504 bits/weight)  
**Created**: August 6, 2025  
**MLX-LM Version**: 0.26.3 (development)