InferenceIllusionist's picture
Update README.md
ab41846 verified
---
license: apache-2.0
base_model:
- openai/gpt-oss-20b
tags:
- mlx
- gpt
- openai
- chatGPT
---
# OpenAI GPT OSS 20B - MLX 4-bit
## Model Description
This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3).
>[!TIP] Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22)
## Quick Start
### Installation
```bash
pip install mlx-lm
```
### Usage
```python
from mlx_lm import load, generate
# Load the quantized model
model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")
# Generate text
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
print(response)
```
## Technical Specifications
### Model Architecture
- **Model Type**: `gpt_oss` (GPT Open Source)
- **Architecture**: `GptOssForCausalLM`
- **Parameters**: ~20 billion parameters
- **Quantization**: 4-bit precision (4.504 bits per weight)
### Core Parameters
| Parameter | Value |
|-----------|-------|
| Hidden Size | 2,880 |
| Intermediate Size | 2,880 |
| Number of Layers | 24 |
| Attention Heads | 64 |
| Key-Value Heads | 8 |
| Head Dimension | 64 |
| Vocabulary Size | 201,088 |
### Mixture of Experts (MoE) Configuration
- **Number of Local Experts**: 32
- **Experts per Token**: 4
- **Router Auxiliary Loss Coefficient**: 0.9
- **SwiGLU Limit**: 7.0
### Attention Mechanism
- **Attention Type**: Hybrid sliding window and full attention
- **Sliding Window Size**: 128 tokens
- **Max Position Embeddings**: 131,072 tokens
- **Initial Context Length**: 4,096 tokens
- **Attention Pattern**: Alternating sliding and full attention layers
### RoPE (Rotary Position Embedding) Configuration
- **RoPE Theta**: 150,000
- **RoPE Scaling Type**: YaRN (Yet another RoPE extensioN)
- **Scaling Factor**: 32.0
- **Beta Fast**: 32.0
- **Beta Slow**: 1.0
### Quantization Details
- **Quantization Method**: MLX 4-bit quantization
- **Group Size**: 64
- **Effective Bits per Weight**: 4.504
- **Size Reduction**: 13GB → 11GB (~15% reduction)
### File Structure
```
gpt-oss-20b-MLX-4bit/
├── config.json # Model configuration
├── model-00001-of-00003.safetensors # Model weights (part 1)
├── model-00002-of-00003.safetensors # Model weights (part 2)
├── model-00003-of-00003.safetensors # Model weights (part 3)
├── model.safetensors.index.json # Model sharding index
├── tokenizer.json # Tokenizer configuration
├── tokenizer_config.json # Tokenizer settings
├── special_tokens_map.json # Special tokens mapping
├── generation_config.json # Generation parameters
└── chat_template.jinja # Chat template
```
## Performance Characteristics
### Hardware Requirements
- **Platform**: Apple Silicon (M1, M2, M3, M4 series)
- **Memory**: ~11GB for model weights
- **Recommended RAM**: 16GB+ for optimal performance
### Layer Configuration
The model uses an alternating attention pattern across its 24 layers:
- **Even layers (0, 2, 4, ...)**: Sliding window attention (128 tokens)
- **Odd layers (1, 3, 5, ...)**: Full attention
## Training Details
### Tokenizer
- **Type**: Custom tokenizer with 201,088 vocabulary size
- **Special Tokens**:
- EOS Token ID: 200,002
- Pad Token ID: 199,999
### Model Configuration
- **Hidden Activation**: SiLU (Swish)
- **Normalization**: RMSNorm (ε = 1e-05)
- **Initializer Range**: 0.02
- **Attention Dropout**: 0.0
- **Attention Bias**: Enabled
## Conversion Process
This model was successfully converted using:
- **MLX-LM Version**: 0.26.3 (development branch)
- **Conversion Command**:
```bash
python3 -m mlx_lm convert \
--hf-path "/path/to/openai-gpt-oss-20b" \
--mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
--quantize \
--q-bits 4
```
## Known Limitations
1. **Architecture Specificity**: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+
2. **Platform Dependency**: Optimized specifically for Apple Silicon; may not run on other platforms
3. **Quantization Trade-offs**: 4-bit quantization may result in slight quality degradation compared to full precision
## Compatibility
- **MLX-LM**: Requires v0.26.3 or later for `gpt_oss` support
- **Apple Silicon**: M1, M2, M3, M4 series processors
- **macOS**: Compatible with recent macOS versions supporting MLX
## License
Please refer to the original OpenAI GPT OSS 20B model license terms.
## Acknowledgments
- Original model by OpenAI
- MLX framework by Apple Machine Learning Research
- Quantization achieved using `mlx-lm` development tools
---
**Model Size**: 11GB
**Quantization**: 4-bit (4.504 bits/weight)
**Created**: August 6, 2025
**MLX-LM Version**: 0.26.3 (development)