File size: 5,086 Bytes
ee982f8 ab41846 ee982f8 ab41846 ee982f8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
license: apache-2.0
base_model:
- openai/gpt-oss-20b
tags:
- mlx
- gpt
- openai
- chatGPT
---
# OpenAI GPT OSS 20B - MLX 4-bit
## Model Description
This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3).
>[!TIP] Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22)
## Quick Start
### Installation
```bash
pip install mlx-lm
```
### Usage
```python
from mlx_lm import load, generate
# Load the quantized model
model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")
# Generate text
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
print(response)
```
## Technical Specifications
### Model Architecture
- **Model Type**: `gpt_oss` (GPT Open Source)
- **Architecture**: `GptOssForCausalLM`
- **Parameters**: ~20 billion parameters
- **Quantization**: 4-bit precision (4.504 bits per weight)
### Core Parameters
| Parameter | Value |
|-----------|-------|
| Hidden Size | 2,880 |
| Intermediate Size | 2,880 |
| Number of Layers | 24 |
| Attention Heads | 64 |
| Key-Value Heads | 8 |
| Head Dimension | 64 |
| Vocabulary Size | 201,088 |
### Mixture of Experts (MoE) Configuration
- **Number of Local Experts**: 32
- **Experts per Token**: 4
- **Router Auxiliary Loss Coefficient**: 0.9
- **SwiGLU Limit**: 7.0
### Attention Mechanism
- **Attention Type**: Hybrid sliding window and full attention
- **Sliding Window Size**: 128 tokens
- **Max Position Embeddings**: 131,072 tokens
- **Initial Context Length**: 4,096 tokens
- **Attention Pattern**: Alternating sliding and full attention layers
### RoPE (Rotary Position Embedding) Configuration
- **RoPE Theta**: 150,000
- **RoPE Scaling Type**: YaRN (Yet another RoPE extensioN)
- **Scaling Factor**: 32.0
- **Beta Fast**: 32.0
- **Beta Slow**: 1.0
### Quantization Details
- **Quantization Method**: MLX 4-bit quantization
- **Group Size**: 64
- **Effective Bits per Weight**: 4.504
- **Size Reduction**: 13GB → 11GB (~15% reduction)
### File Structure
```
gpt-oss-20b-MLX-4bit/
├── config.json # Model configuration
├── model-00001-of-00003.safetensors # Model weights (part 1)
├── model-00002-of-00003.safetensors # Model weights (part 2)
├── model-00003-of-00003.safetensors # Model weights (part 3)
├── model.safetensors.index.json # Model sharding index
├── tokenizer.json # Tokenizer configuration
├── tokenizer_config.json # Tokenizer settings
├── special_tokens_map.json # Special tokens mapping
├── generation_config.json # Generation parameters
└── chat_template.jinja # Chat template
```
## Performance Characteristics
### Hardware Requirements
- **Platform**: Apple Silicon (M1, M2, M3, M4 series)
- **Memory**: ~11GB for model weights
- **Recommended RAM**: 16GB+ for optimal performance
### Layer Configuration
The model uses an alternating attention pattern across its 24 layers:
- **Even layers (0, 2, 4, ...)**: Sliding window attention (128 tokens)
- **Odd layers (1, 3, 5, ...)**: Full attention
## Training Details
### Tokenizer
- **Type**: Custom tokenizer with 201,088 vocabulary size
- **Special Tokens**:
- EOS Token ID: 200,002
- Pad Token ID: 199,999
### Model Configuration
- **Hidden Activation**: SiLU (Swish)
- **Normalization**: RMSNorm (ε = 1e-05)
- **Initializer Range**: 0.02
- **Attention Dropout**: 0.0
- **Attention Bias**: Enabled
## Conversion Process
This model was successfully converted using:
- **MLX-LM Version**: 0.26.3 (development branch)
- **Conversion Command**:
```bash
python3 -m mlx_lm convert \
--hf-path "/path/to/openai-gpt-oss-20b" \
--mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
--quantize \
--q-bits 4
```
## Known Limitations
1. **Architecture Specificity**: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+
2. **Platform Dependency**: Optimized specifically for Apple Silicon; may not run on other platforms
3. **Quantization Trade-offs**: 4-bit quantization may result in slight quality degradation compared to full precision
## Compatibility
- **MLX-LM**: Requires v0.26.3 or later for `gpt_oss` support
- **Apple Silicon**: M1, M2, M3, M4 series processors
- **macOS**: Compatible with recent macOS versions supporting MLX
## License
Please refer to the original OpenAI GPT OSS 20B model license terms.
## Acknowledgments
- Original model by OpenAI
- MLX framework by Apple Machine Learning Research
- Quantization achieved using `mlx-lm` development tools
---
**Model Size**: 11GB
**Quantization**: 4-bit (4.504 bits/weight)
**Created**: August 6, 2025
**MLX-LM Version**: 0.26.3 (development) |