Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,163 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- openai/gpt-oss-20b
|
5 |
+
---
|
6 |
+
|
7 |
+
|
8 |
+
# OpenAI GPT OSS 20B - MLX 4-bit
|
9 |
+
|
10 |
+
## Model Description
|
11 |
+
|
12 |
+
This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3).
|
13 |
+
|
14 |
+
## Quick Start
|
15 |
+
|
16 |
+
### Installation
|
17 |
+
|
18 |
+
```bash
|
19 |
+
pip install mlx-lm
|
20 |
+
```
|
21 |
+
|
22 |
+
### Usage
|
23 |
+
|
24 |
+
```python
|
25 |
+
from mlx_lm import load, generate
|
26 |
+
|
27 |
+
# Load the quantized model
|
28 |
+
model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")
|
29 |
+
|
30 |
+
# Generate text
|
31 |
+
prompt = "Explain quantum computing in simple terms:"
|
32 |
+
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
|
33 |
+
print(response)
|
34 |
+
```
|
35 |
+
|
36 |
+
## Technical Specifications
|
37 |
+
|
38 |
+
### Model Architecture
|
39 |
+
- **Model Type**: `gpt_oss` (GPT Open Source)
|
40 |
+
- **Architecture**: `GptOssForCausalLM`
|
41 |
+
- **Parameters**: ~20 billion parameters
|
42 |
+
- **Quantization**: 4-bit precision (4.504 bits per weight)
|
43 |
+
|
44 |
+
### Core Parameters
|
45 |
+
| Parameter | Value |
|
46 |
+
|-----------|-------|
|
47 |
+
| Hidden Size | 2,880 |
|
48 |
+
| Intermediate Size | 2,880 |
|
49 |
+
| Number of Layers | 24 |
|
50 |
+
| Attention Heads | 64 |
|
51 |
+
| Key-Value Heads | 8 |
|
52 |
+
| Head Dimension | 64 |
|
53 |
+
| Vocabulary Size | 201,088 |
|
54 |
+
|
55 |
+
### Mixture of Experts (MoE) Configuration
|
56 |
+
- **Number of Local Experts**: 32
|
57 |
+
- **Experts per Token**: 4
|
58 |
+
- **Router Auxiliary Loss Coefficient**: 0.9
|
59 |
+
- **SwiGLU Limit**: 7.0
|
60 |
+
|
61 |
+
### Attention Mechanism
|
62 |
+
- **Attention Type**: Hybrid sliding window and full attention
|
63 |
+
- **Sliding Window Size**: 128 tokens
|
64 |
+
- **Max Position Embeddings**: 131,072 tokens
|
65 |
+
- **Initial Context Length**: 4,096 tokens
|
66 |
+
- **Attention Pattern**: Alternating sliding and full attention layers
|
67 |
+
|
68 |
+
### RoPE (Rotary Position Embedding) Configuration
|
69 |
+
- **RoPE Theta**: 150,000
|
70 |
+
- **RoPE Scaling Type**: YaRN (Yet another RoPE extensioN)
|
71 |
+
- **Scaling Factor**: 32.0
|
72 |
+
- **Beta Fast**: 32.0
|
73 |
+
- **Beta Slow**: 1.0
|
74 |
+
|
75 |
+
### Quantization Details
|
76 |
+
- **Quantization Method**: MLX 4-bit quantization
|
77 |
+
- **Group Size**: 64
|
78 |
+
- **Effective Bits per Weight**: 4.504
|
79 |
+
- **Size Reduction**: 13GB → 11GB (~15% reduction)
|
80 |
+
|
81 |
+
### File Structure
|
82 |
+
```
|
83 |
+
gpt-oss-20b-MLX-4bit/
|
84 |
+
├── config.json # Model configuration
|
85 |
+
├── model-00001-of-00003.safetensors # Model weights (part 1)
|
86 |
+
├── model-00002-of-00003.safetensors # Model weights (part 2)
|
87 |
+
├── model-00003-of-00003.safetensors # Model weights (part 3)
|
88 |
+
├── model.safetensors.index.json # Model sharding index
|
89 |
+
├── tokenizer.json # Tokenizer configuration
|
90 |
+
├── tokenizer_config.json # Tokenizer settings
|
91 |
+
├── special_tokens_map.json # Special tokens mapping
|
92 |
+
├── generation_config.json # Generation parameters
|
93 |
+
└── chat_template.jinja # Chat template
|
94 |
+
```
|
95 |
+
|
96 |
+
## Performance Characteristics
|
97 |
+
|
98 |
+
### Hardware Requirements
|
99 |
+
- **Platform**: Apple Silicon (M1, M2, M3, M4 series)
|
100 |
+
- **Memory**: ~11GB for model weights
|
101 |
+
- **Recommended RAM**: 16GB+ for optimal performance
|
102 |
+
|
103 |
+
### Layer Configuration
|
104 |
+
The model uses an alternating attention pattern across its 24 layers:
|
105 |
+
- **Even layers (0, 2, 4, ...)**: Sliding window attention (128 tokens)
|
106 |
+
- **Odd layers (1, 3, 5, ...)**: Full attention
|
107 |
+
|
108 |
+
## Training Details
|
109 |
+
|
110 |
+
### Tokenizer
|
111 |
+
- **Type**: Custom tokenizer with 201,088 vocabulary size
|
112 |
+
- **Special Tokens**:
|
113 |
+
- EOS Token ID: 200,002
|
114 |
+
- Pad Token ID: 199,999
|
115 |
+
|
116 |
+
### Model Configuration
|
117 |
+
- **Hidden Activation**: SiLU (Swish)
|
118 |
+
- **Normalization**: RMSNorm (ε = 1e-05)
|
119 |
+
- **Initializer Range**: 0.02
|
120 |
+
- **Attention Dropout**: 0.0
|
121 |
+
- **Attention Bias**: Enabled
|
122 |
+
|
123 |
+
## Conversion Process
|
124 |
+
|
125 |
+
This model was successfully converted using:
|
126 |
+
- **MLX-LM Version**: 0.26.3 (development branch)
|
127 |
+
- **Conversion Command**:
|
128 |
+
```bash
|
129 |
+
python3 -m mlx_lm convert \
|
130 |
+
--hf-path "/path/to/openai-gpt-oss-20b" \
|
131 |
+
--mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
|
132 |
+
--quantize \
|
133 |
+
--q-bits 4
|
134 |
+
```
|
135 |
+
|
136 |
+
## Known Limitations
|
137 |
+
|
138 |
+
1. **Architecture Specificity**: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+
|
139 |
+
2. **Platform Dependency**: Optimized specifically for Apple Silicon; may not run on other platforms
|
140 |
+
3. **Quantization Trade-offs**: 4-bit quantization may result in slight quality degradation compared to full precision
|
141 |
+
|
142 |
+
## Compatibility
|
143 |
+
|
144 |
+
- **MLX-LM**: Requires v0.26.3 or later for `gpt_oss` support
|
145 |
+
- **Apple Silicon**: M1, M2, M3, M4 series processors
|
146 |
+
- **macOS**: Compatible with recent macOS versions supporting MLX
|
147 |
+
|
148 |
+
## License
|
149 |
+
|
150 |
+
Please refer to the original OpenAI GPT OSS 20B model license terms.
|
151 |
+
|
152 |
+
## Acknowledgments
|
153 |
+
|
154 |
+
- Original model by OpenAI
|
155 |
+
- MLX framework by Apple Machine Learning Research
|
156 |
+
- Quantization achieved using `mlx-lm` development tools
|
157 |
+
|
158 |
+
---
|
159 |
+
|
160 |
+
**Model Size**: 11GB
|
161 |
+
**Quantization**: 4-bit (4.504 bits/weight)
|
162 |
+
**Created**: August 6, 2025
|
163 |
+
**MLX-LM Version**: 0.26.3 (development)
|