|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- openai/gpt-oss-20b |
|
tags: |
|
- mlx |
|
- gpt |
|
- openai |
|
- chatGPT |
|
--- |
|
|
|
|
|
# OpenAI GPT OSS 20B - MLX 4-bit |
|
|
|
## Model Description |
|
|
|
This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3). |
|
|
|
>[!TIP] Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22) |
|
|
|
## Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install mlx-lm |
|
``` |
|
|
|
### Usage |
|
|
|
```python |
|
from mlx_lm import load, generate |
|
|
|
# Load the quantized model |
|
model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit") |
|
|
|
# Generate text |
|
prompt = "Explain quantum computing in simple terms:" |
|
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512) |
|
print(response) |
|
``` |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture |
|
- **Model Type**: `gpt_oss` (GPT Open Source) |
|
- **Architecture**: `GptOssForCausalLM` |
|
- **Parameters**: ~20 billion parameters |
|
- **Quantization**: 4-bit precision (4.504 bits per weight) |
|
|
|
### Core Parameters |
|
| Parameter | Value | |
|
|-----------|-------| |
|
| Hidden Size | 2,880 | |
|
| Intermediate Size | 2,880 | |
|
| Number of Layers | 24 | |
|
| Attention Heads | 64 | |
|
| Key-Value Heads | 8 | |
|
| Head Dimension | 64 | |
|
| Vocabulary Size | 201,088 | |
|
|
|
### Mixture of Experts (MoE) Configuration |
|
- **Number of Local Experts**: 32 |
|
- **Experts per Token**: 4 |
|
- **Router Auxiliary Loss Coefficient**: 0.9 |
|
- **SwiGLU Limit**: 7.0 |
|
|
|
### Attention Mechanism |
|
- **Attention Type**: Hybrid sliding window and full attention |
|
- **Sliding Window Size**: 128 tokens |
|
- **Max Position Embeddings**: 131,072 tokens |
|
- **Initial Context Length**: 4,096 tokens |
|
- **Attention Pattern**: Alternating sliding and full attention layers |
|
|
|
### RoPE (Rotary Position Embedding) Configuration |
|
- **RoPE Theta**: 150,000 |
|
- **RoPE Scaling Type**: YaRN (Yet another RoPE extensioN) |
|
- **Scaling Factor**: 32.0 |
|
- **Beta Fast**: 32.0 |
|
- **Beta Slow**: 1.0 |
|
|
|
### Quantization Details |
|
- **Quantization Method**: MLX 4-bit quantization |
|
- **Group Size**: 64 |
|
- **Effective Bits per Weight**: 4.504 |
|
- **Size Reduction**: 13GB → 11GB (~15% reduction) |
|
|
|
### File Structure |
|
``` |
|
gpt-oss-20b-MLX-4bit/ |
|
├── config.json # Model configuration |
|
├── model-00001-of-00003.safetensors # Model weights (part 1) |
|
├── model-00002-of-00003.safetensors # Model weights (part 2) |
|
├── model-00003-of-00003.safetensors # Model weights (part 3) |
|
├── model.safetensors.index.json # Model sharding index |
|
├── tokenizer.json # Tokenizer configuration |
|
├── tokenizer_config.json # Tokenizer settings |
|
├── special_tokens_map.json # Special tokens mapping |
|
├── generation_config.json # Generation parameters |
|
└── chat_template.jinja # Chat template |
|
``` |
|
|
|
## Performance Characteristics |
|
|
|
### Hardware Requirements |
|
- **Platform**: Apple Silicon (M1, M2, M3, M4 series) |
|
- **Memory**: ~11GB for model weights |
|
- **Recommended RAM**: 16GB+ for optimal performance |
|
|
|
### Layer Configuration |
|
The model uses an alternating attention pattern across its 24 layers: |
|
- **Even layers (0, 2, 4, ...)**: Sliding window attention (128 tokens) |
|
- **Odd layers (1, 3, 5, ...)**: Full attention |
|
|
|
## Training Details |
|
|
|
### Tokenizer |
|
- **Type**: Custom tokenizer with 201,088 vocabulary size |
|
- **Special Tokens**: |
|
- EOS Token ID: 200,002 |
|
- Pad Token ID: 199,999 |
|
|
|
### Model Configuration |
|
- **Hidden Activation**: SiLU (Swish) |
|
- **Normalization**: RMSNorm (ε = 1e-05) |
|
- **Initializer Range**: 0.02 |
|
- **Attention Dropout**: 0.0 |
|
- **Attention Bias**: Enabled |
|
|
|
## Conversion Process |
|
|
|
This model was successfully converted using: |
|
- **MLX-LM Version**: 0.26.3 (development branch) |
|
- **Conversion Command**: |
|
```bash |
|
python3 -m mlx_lm convert \ |
|
--hf-path "/path/to/openai-gpt-oss-20b" \ |
|
--mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \ |
|
--quantize \ |
|
--q-bits 4 |
|
``` |
|
|
|
## Known Limitations |
|
|
|
1. **Architecture Specificity**: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+ |
|
2. **Platform Dependency**: Optimized specifically for Apple Silicon; may not run on other platforms |
|
3. **Quantization Trade-offs**: 4-bit quantization may result in slight quality degradation compared to full precision |
|
|
|
## Compatibility |
|
|
|
- **MLX-LM**: Requires v0.26.3 or later for `gpt_oss` support |
|
- **Apple Silicon**: M1, M2, M3, M4 series processors |
|
- **macOS**: Compatible with recent macOS versions supporting MLX |
|
|
|
## License |
|
|
|
Please refer to the original OpenAI GPT OSS 20B model license terms. |
|
|
|
## Acknowledgments |
|
|
|
- Original model by OpenAI |
|
- MLX framework by Apple Machine Learning Research |
|
- Quantization achieved using `mlx-lm` development tools |
|
|
|
--- |
|
|
|
**Model Size**: 11GB |
|
**Quantization**: 4-bit (4.504 bits/weight) |
|
**Created**: August 6, 2025 |
|
**MLX-LM Version**: 0.26.3 (development) |