File size: 5,086 Bytes

---
license: apache-2.0
base_model:
- openai/gpt-oss-20b
tags:
- mlx
- gpt
- openai
- chatGPT
---


# OpenAI GPT OSS 20B - MLX 4-bit

## Model Description

This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3).

>[!TIP] Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22)

## Quick Start

### Installation

```bash
pip install mlx-lm
```

### Usage

```python
from mlx_lm import load, generate

# Load the quantized model
model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")

# Generate text
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
print(response)
```

## Technical Specifications

### Model Architecture
- **Model Type**: `gpt_oss` (GPT Open Source)
- **Architecture**: `GptOssForCausalLM`
- **Parameters**: ~20 billion parameters
- **Quantization**: 4-bit precision (4.504 bits per weight)

### Core Parameters
| Parameter | Value |
|-----------|-------|
| Hidden Size | 2,880 |
| Intermediate Size | 2,880 |
| Number of Layers | 24 |
| Attention Heads | 64 |
| Key-Value Heads | 8 |
| Head Dimension | 64 |
| Vocabulary Size | 201,088 |

### Mixture of Experts (MoE) Configuration
- **Number of Local Experts**: 32
- **Experts per Token**: 4
- **Router Auxiliary Loss Coefficient**: 0.9
- **SwiGLU Limit**: 7.0

### Attention Mechanism
- **Attention Type**: Hybrid sliding window and full attention
- **Sliding Window Size**: 128 tokens
- **Max Position Embeddings**: 131,072 tokens
- **Initial Context Length**: 4,096 tokens
- **Attention Pattern**: Alternating sliding and full attention layers

### RoPE (Rotary Position Embedding) Configuration
- **RoPE Theta**: 150,000
- **RoPE Scaling Type**: YaRN (Yet another RoPE extensioN)
- **Scaling Factor**: 32.0
- **Beta Fast**: 32.0
- **Beta Slow**: 1.0

### Quantization Details
- **Quantization Method**: MLX 4-bit quantization
- **Group Size**: 64
- **Effective Bits per Weight**: 4.504
- **Size Reduction**: 13GB → 11GB (~15% reduction)

### File Structure
```
gpt-oss-20b-MLX-4bit/
├── config.json                           # Model configuration
├── model-00001-of-00003.safetensors     # Model weights (part 1)
├── model-00002-of-00003.safetensors     # Model weights (part 2)
├── model-00003-of-00003.safetensors     # Model weights (part 3)
├── model.safetensors.index.json         # Model sharding index
├── tokenizer.json                       # Tokenizer configuration
├── tokenizer_config.json               # Tokenizer settings
├── special_tokens_map.json             # Special tokens mapping
├── generation_config.json              # Generation parameters
└── chat_template.jinja                 # Chat template
```

## Performance Characteristics

### Hardware Requirements
- **Platform**: Apple Silicon (M1, M2, M3, M4 series)
- **Memory**: ~11GB for model weights
- **Recommended RAM**: 16GB+ for optimal performance

### Layer Configuration
The model uses an alternating attention pattern across its 24 layers:
- **Even layers (0, 2, 4, ...)**: Sliding window attention (128 tokens)
- **Odd layers (1, 3, 5, ...)**: Full attention

## Training Details

### Tokenizer
- **Type**: Custom tokenizer with 201,088 vocabulary size
- **Special Tokens**:
  - EOS Token ID: 200,002
  - Pad Token ID: 199,999

### Model Configuration
- **Hidden Activation**: SiLU (Swish)
- **Normalization**: RMSNorm (ε = 1e-05)
- **Initializer Range**: 0.02
- **Attention Dropout**: 0.0
- **Attention Bias**: Enabled

## Conversion Process

This model was successfully converted using:
- **MLX-LM Version**: 0.26.3 (development branch)
- **Conversion Command**: 
  ```bash
  python3 -m mlx_lm convert \
    --hf-path "/path/to/openai-gpt-oss-20b" \
    --mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
    --quantize \
    --q-bits 4
  ```

## Known Limitations

1. **Architecture Specificity**: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+
2. **Platform Dependency**: Optimized specifically for Apple Silicon; may not run on other platforms
3. **Quantization Trade-offs**: 4-bit quantization may result in slight quality degradation compared to full precision

## Compatibility

- **MLX-LM**: Requires v0.26.3 or later for `gpt_oss` support
- **Apple Silicon**: M1, M2, M3, M4 series processors
- **macOS**: Compatible with recent macOS versions supporting MLX

## License

Please refer to the original OpenAI GPT OSS 20B model license terms.

## Acknowledgments

- Original model by OpenAI
- MLX framework by Apple Machine Learning Research
- Quantization achieved using `mlx-lm` development tools

---

**Model Size**: 11GB  
**Quantization**: 4-bit (4.504 bits/weight)  
**Created**: August 6, 2025  
**MLX-LM Version**: 0.26.3 (development)