InferenceIllusionist
/

gpt-oss-20b-MLX-4bit

+---
+license: apache-2.0
+base_model:
+- openai/gpt-oss-20b
+---
+# OpenAI GPT OSS 20B - MLX 4-bit
+## Model Description
+This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3).
+## Quick Start
+### Installation
+```bash
+pip install mlx-lm
+```
+### Usage
+```python
+from mlx_lm import load, generate
+# Load the quantized model
+model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")
+# Generate text
+prompt = "Explain quantum computing in simple terms:"
+response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
+print(response)
+```
+## Technical Specifications
+### Model Architecture
+- **Model Type**: `gpt_oss` (GPT Open Source)
+- **Architecture**: `GptOssForCausalLM`
+- **Parameters**: ~20 billion parameters
+- **Quantization**: 4-bit precision (4.504 bits per weight)
+### Core Parameters
+| Parameter | Value |
+|-----------|-------|
+| Hidden Size | 2,880 |
+| Intermediate Size | 2,880 |
+| Number of Layers | 24 |
+| Attention Heads | 64 |
+| Key-Value Heads | 8 |
+| Head Dimension | 64 |
+| Vocabulary Size | 201,088 |
+### Mixture of Experts (MoE) Configuration
+- **Number of Local Experts**: 32
+- **Experts per Token**: 4
+- **Router Auxiliary Loss Coefficient**: 0.9
+- **SwiGLU Limit**: 7.0
+### Attention Mechanism
+- **Attention Type**: Hybrid sliding window and full attention
+- **Sliding Window Size**: 128 tokens
+- **Max Position Embeddings**: 131,072 tokens
+- **Initial Context Length**: 4,096 tokens
+- **Attention Pattern**: Alternating sliding and full attention layers
+### RoPE (Rotary Position Embedding) Configuration
+- **RoPE Theta**: 150,000
+- **RoPE Scaling Type**: YaRN (Yet another RoPE extensioN)
+- **Scaling Factor**: 32.0
+- **Beta Fast**: 32.0
+- **Beta Slow**: 1.0
+### Quantization Details
+- **Quantization Method**: MLX 4-bit quantization
+- **Group Size**: 64
+- **Effective Bits per Weight**: 4.504
+- **Size Reduction**: 13GB → 11GB (~15% reduction)
+### File Structure
+```
+gpt-oss-20b-MLX-4bit/
+├── config.json                           # Model configuration
+├── model-00001-of-00003.safetensors     # Model weights (part 1)
+├── model-00002-of-00003.safetensors     # Model weights (part 2)
+├── model-00003-of-00003.safetensors     # Model weights (part 3)
+├── model.safetensors.index.json         # Model sharding index
+├── tokenizer.json                       # Tokenizer configuration
+├── tokenizer_config.json               # Tokenizer settings
+├── special_tokens_map.json             # Special tokens mapping
+├── generation_config.json              # Generation parameters
+└── chat_template.jinja                 # Chat template
+```
+## Performance Characteristics
+### Hardware Requirements
+- **Platform**: Apple Silicon (M1, M2, M3, M4 series)
+- **Memory**: ~11GB for model weights
+- **Recommended RAM**: 16GB+ for optimal performance
+### Layer Configuration
+The model uses an alternating attention pattern across its 24 layers:
+- **Even layers (0, 2, 4, ...)**: Sliding window attention (128 tokens)
+- **Odd layers (1, 3, 5, ...)**: Full attention
+## Training Details
+### Tokenizer
+- **Type**: Custom tokenizer with 201,088 vocabulary size
+- **Special Tokens**:
+  - EOS Token ID: 200,002
+  - Pad Token ID: 199,999
+### Model Configuration
+- **Hidden Activation**: SiLU (Swish)
+- **Normalization**: RMSNorm (ε = 1e-05)
+- **Initializer Range**: 0.02
+- **Attention Dropout**: 0.0
+- **Attention Bias**: Enabled
+## Conversion Process
+This model was successfully converted using:
+- **MLX-LM Version**: 0.26.3 (development branch)
+- **Conversion Command**:
+  ```bash
+  python3 -m mlx_lm convert \
+    --hf-path "/path/to/openai-gpt-oss-20b" \
+    --mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
+    --quantize \
+    --q-bits 4
+  ```
+## Known Limitations
+1. **Architecture Specificity**: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+
+2. **Platform Dependency**: Optimized specifically for Apple Silicon; may not run on other platforms
+3. **Quantization Trade-offs**: 4-bit quantization may result in slight quality degradation compared to full precision
+## Compatibility
+- **MLX-LM**: Requires v0.26.3 or later for `gpt_oss` support
+- **Apple Silicon**: M1, M2, M3, M4 series processors
+- **macOS**: Compatible with recent macOS versions supporting MLX
+## License
+Please refer to the original OpenAI GPT OSS 20B model license terms.
+## Acknowledgments
+- Original model by OpenAI
+- MLX framework by Apple Machine Learning Research
+- Quantization achieved using `mlx-lm` development tools
+---
+**Model Size**: 11GB
+**Quantization**: 4-bit (4.504 bits/weight)
+**Created**: August 6, 2025
+**MLX-LM Version**: 0.26.3 (development)