InferenceIllusionist commited on
Commit
ee982f8
·
verified ·
1 Parent(s): 98d6191

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -3
README.md CHANGED
@@ -1,3 +1,163 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - openai/gpt-oss-20b
5
+ ---
6
+
7
+
8
+ # OpenAI GPT OSS 20B - MLX 4-bit
9
+
10
+ ## Model Description
11
+
12
+ This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3).
13
+
14
+ ## Quick Start
15
+
16
+ ### Installation
17
+
18
+ ```bash
19
+ pip install mlx-lm
20
+ ```
21
+
22
+ ### Usage
23
+
24
+ ```python
25
+ from mlx_lm import load, generate
26
+
27
+ # Load the quantized model
28
+ model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")
29
+
30
+ # Generate text
31
+ prompt = "Explain quantum computing in simple terms:"
32
+ response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
33
+ print(response)
34
+ ```
35
+
36
+ ## Technical Specifications
37
+
38
+ ### Model Architecture
39
+ - **Model Type**: `gpt_oss` (GPT Open Source)
40
+ - **Architecture**: `GptOssForCausalLM`
41
+ - **Parameters**: ~20 billion parameters
42
+ - **Quantization**: 4-bit precision (4.504 bits per weight)
43
+
44
+ ### Core Parameters
45
+ | Parameter | Value |
46
+ |-----------|-------|
47
+ | Hidden Size | 2,880 |
48
+ | Intermediate Size | 2,880 |
49
+ | Number of Layers | 24 |
50
+ | Attention Heads | 64 |
51
+ | Key-Value Heads | 8 |
52
+ | Head Dimension | 64 |
53
+ | Vocabulary Size | 201,088 |
54
+
55
+ ### Mixture of Experts (MoE) Configuration
56
+ - **Number of Local Experts**: 32
57
+ - **Experts per Token**: 4
58
+ - **Router Auxiliary Loss Coefficient**: 0.9
59
+ - **SwiGLU Limit**: 7.0
60
+
61
+ ### Attention Mechanism
62
+ - **Attention Type**: Hybrid sliding window and full attention
63
+ - **Sliding Window Size**: 128 tokens
64
+ - **Max Position Embeddings**: 131,072 tokens
65
+ - **Initial Context Length**: 4,096 tokens
66
+ - **Attention Pattern**: Alternating sliding and full attention layers
67
+
68
+ ### RoPE (Rotary Position Embedding) Configuration
69
+ - **RoPE Theta**: 150,000
70
+ - **RoPE Scaling Type**: YaRN (Yet another RoPE extensioN)
71
+ - **Scaling Factor**: 32.0
72
+ - **Beta Fast**: 32.0
73
+ - **Beta Slow**: 1.0
74
+
75
+ ### Quantization Details
76
+ - **Quantization Method**: MLX 4-bit quantization
77
+ - **Group Size**: 64
78
+ - **Effective Bits per Weight**: 4.504
79
+ - **Size Reduction**: 13GB → 11GB (~15% reduction)
80
+
81
+ ### File Structure
82
+ ```
83
+ gpt-oss-20b-MLX-4bit/
84
+ ├── config.json # Model configuration
85
+ ├── model-00001-of-00003.safetensors # Model weights (part 1)
86
+ ├── model-00002-of-00003.safetensors # Model weights (part 2)
87
+ ├── model-00003-of-00003.safetensors # Model weights (part 3)
88
+ ├── model.safetensors.index.json # Model sharding index
89
+ ├── tokenizer.json # Tokenizer configuration
90
+ ├── tokenizer_config.json # Tokenizer settings
91
+ ├── special_tokens_map.json # Special tokens mapping
92
+ ├── generation_config.json # Generation parameters
93
+ └── chat_template.jinja # Chat template
94
+ ```
95
+
96
+ ## Performance Characteristics
97
+
98
+ ### Hardware Requirements
99
+ - **Platform**: Apple Silicon (M1, M2, M3, M4 series)
100
+ - **Memory**: ~11GB for model weights
101
+ - **Recommended RAM**: 16GB+ for optimal performance
102
+
103
+ ### Layer Configuration
104
+ The model uses an alternating attention pattern across its 24 layers:
105
+ - **Even layers (0, 2, 4, ...)**: Sliding window attention (128 tokens)
106
+ - **Odd layers (1, 3, 5, ...)**: Full attention
107
+
108
+ ## Training Details
109
+
110
+ ### Tokenizer
111
+ - **Type**: Custom tokenizer with 201,088 vocabulary size
112
+ - **Special Tokens**:
113
+ - EOS Token ID: 200,002
114
+ - Pad Token ID: 199,999
115
+
116
+ ### Model Configuration
117
+ - **Hidden Activation**: SiLU (Swish)
118
+ - **Normalization**: RMSNorm (ε = 1e-05)
119
+ - **Initializer Range**: 0.02
120
+ - **Attention Dropout**: 0.0
121
+ - **Attention Bias**: Enabled
122
+
123
+ ## Conversion Process
124
+
125
+ This model was successfully converted using:
126
+ - **MLX-LM Version**: 0.26.3 (development branch)
127
+ - **Conversion Command**:
128
+ ```bash
129
+ python3 -m mlx_lm convert \
130
+ --hf-path "/path/to/openai-gpt-oss-20b" \
131
+ --mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
132
+ --quantize \
133
+ --q-bits 4
134
+ ```
135
+
136
+ ## Known Limitations
137
+
138
+ 1. **Architecture Specificity**: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+
139
+ 2. **Platform Dependency**: Optimized specifically for Apple Silicon; may not run on other platforms
140
+ 3. **Quantization Trade-offs**: 4-bit quantization may result in slight quality degradation compared to full precision
141
+
142
+ ## Compatibility
143
+
144
+ - **MLX-LM**: Requires v0.26.3 or later for `gpt_oss` support
145
+ - **Apple Silicon**: M1, M2, M3, M4 series processors
146
+ - **macOS**: Compatible with recent macOS versions supporting MLX
147
+
148
+ ## License
149
+
150
+ Please refer to the original OpenAI GPT OSS 20B model license terms.
151
+
152
+ ## Acknowledgments
153
+
154
+ - Original model by OpenAI
155
+ - MLX framework by Apple Machine Learning Research
156
+ - Quantization achieved using `mlx-lm` development tools
157
+
158
+ ---
159
+
160
+ **Model Size**: 11GB
161
+ **Quantization**: 4-bit (4.504 bits/weight)
162
+ **Created**: August 6, 2025
163
+ **MLX-LM Version**: 0.26.3 (development)