--- license: apache-2.0 base_model: - openai/gpt-oss-20b tags: - mlx - gpt - openai - chatGPT --- # OpenAI GPT OSS 20B - MLX 4-bit ## Model Description This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3). >[!TIP] Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22) ## Quick Start ### Installation ```bash pip install mlx-lm ``` ### Usage ```python from mlx_lm import load, generate # Load the quantized model model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit") # Generate text prompt = "Explain quantum computing in simple terms:" response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512) print(response) ``` ## Technical Specifications ### Model Architecture - **Model Type**: `gpt_oss` (GPT Open Source) - **Architecture**: `GptOssForCausalLM` - **Parameters**: ~20 billion parameters - **Quantization**: 4-bit precision (4.504 bits per weight) ### Core Parameters | Parameter | Value | |-----------|-------| | Hidden Size | 2,880 | | Intermediate Size | 2,880 | | Number of Layers | 24 | | Attention Heads | 64 | | Key-Value Heads | 8 | | Head Dimension | 64 | | Vocabulary Size | 201,088 | ### Mixture of Experts (MoE) Configuration - **Number of Local Experts**: 32 - **Experts per Token**: 4 - **Router Auxiliary Loss Coefficient**: 0.9 - **SwiGLU Limit**: 7.0 ### Attention Mechanism - **Attention Type**: Hybrid sliding window and full attention - **Sliding Window Size**: 128 tokens - **Max Position Embeddings**: 131,072 tokens - **Initial Context Length**: 4,096 tokens - **Attention Pattern**: Alternating sliding and full attention layers ### RoPE (Rotary Position Embedding) Configuration - **RoPE Theta**: 150,000 - **RoPE Scaling Type**: YaRN (Yet another RoPE extensioN) - **Scaling Factor**: 32.0 - **Beta Fast**: 32.0 - **Beta Slow**: 1.0 ### Quantization Details - **Quantization Method**: MLX 4-bit quantization - **Group Size**: 64 - **Effective Bits per Weight**: 4.504 - **Size Reduction**: 13GB → 11GB (~15% reduction) ### File Structure ``` gpt-oss-20b-MLX-4bit/ ├── config.json # Model configuration ├── model-00001-of-00003.safetensors # Model weights (part 1) ├── model-00002-of-00003.safetensors # Model weights (part 2) ├── model-00003-of-00003.safetensors # Model weights (part 3) ├── model.safetensors.index.json # Model sharding index ├── tokenizer.json # Tokenizer configuration ├── tokenizer_config.json # Tokenizer settings ├── special_tokens_map.json # Special tokens mapping ├── generation_config.json # Generation parameters └── chat_template.jinja # Chat template ``` ## Performance Characteristics ### Hardware Requirements - **Platform**: Apple Silicon (M1, M2, M3, M4 series) - **Memory**: ~11GB for model weights - **Recommended RAM**: 16GB+ for optimal performance ### Layer Configuration The model uses an alternating attention pattern across its 24 layers: - **Even layers (0, 2, 4, ...)**: Sliding window attention (128 tokens) - **Odd layers (1, 3, 5, ...)**: Full attention ## Training Details ### Tokenizer - **Type**: Custom tokenizer with 201,088 vocabulary size - **Special Tokens**: - EOS Token ID: 200,002 - Pad Token ID: 199,999 ### Model Configuration - **Hidden Activation**: SiLU (Swish) - **Normalization**: RMSNorm (ε = 1e-05) - **Initializer Range**: 0.02 - **Attention Dropout**: 0.0 - **Attention Bias**: Enabled ## Conversion Process This model was successfully converted using: - **MLX-LM Version**: 0.26.3 (development branch) - **Conversion Command**: ```bash python3 -m mlx_lm convert \ --hf-path "/path/to/openai-gpt-oss-20b" \ --mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \ --quantize \ --q-bits 4 ``` ## Known Limitations 1. **Architecture Specificity**: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+ 2. **Platform Dependency**: Optimized specifically for Apple Silicon; may not run on other platforms 3. **Quantization Trade-offs**: 4-bit quantization may result in slight quality degradation compared to full precision ## Compatibility - **MLX-LM**: Requires v0.26.3 or later for `gpt_oss` support - **Apple Silicon**: M1, M2, M3, M4 series processors - **macOS**: Compatible with recent macOS versions supporting MLX ## License Please refer to the original OpenAI GPT OSS 20B model license terms. ## Acknowledgments - Original model by OpenAI - MLX framework by Apple Machine Learning Research - Quantization achieved using `mlx-lm` development tools --- **Model Size**: 11GB **Quantization**: 4-bit (4.504 bits/weight) **Created**: August 6, 2025 **MLX-LM Version**: 0.26.3 (development)