Update README.md

ab41846 verified 2 days ago

5.09 kB

	---
	license: apache-2.0
	base_model:
	- openai/gpt-oss-20b
	tags:
	- mlx
	- gpt
	- openai
	- chatGPT
	---


	# OpenAI GPT OSS 20B - MLX 4-bit

	## Model Description

	This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original `gpt_oss` architecture to MLX format using the development version of `mlx-lm` (v0.26.3).

	>[!TIP] Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22)

	## Quick Start

	### Installation

	```bash
	pip install mlx-lm
	```

	### Usage

	```python
	from mlx_lm import load, generate

	# Load the quantized model
	model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")

	# Generate text
	prompt = "Explain quantum computing in simple terms:"
	response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
	print(response)
	```

	## Technical Specifications

	### Model Architecture
	- Model Type: `gpt_oss` (GPT Open Source)
	- Architecture: `GptOssForCausalLM`
	- Parameters: ~20 billion parameters
	- Quantization: 4-bit precision (4.504 bits per weight)

	### Core Parameters
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hidden Size \| 2,880 \|
	\| Intermediate Size \| 2,880 \|
	\| Number of Layers \| 24 \|
	\| Attention Heads \| 64 \|
	\| Key-Value Heads \| 8 \|
	\| Head Dimension \| 64 \|
	\| Vocabulary Size \| 201,088 \|

	### Mixture of Experts (MoE) Configuration
	- Number of Local Experts: 32
	- Experts per Token: 4
	- Router Auxiliary Loss Coefficient: 0.9
	- SwiGLU Limit: 7.0

	### Attention Mechanism
	- Attention Type: Hybrid sliding window and full attention
	- Sliding Window Size: 128 tokens
	- Max Position Embeddings: 131,072 tokens
	- Initial Context Length: 4,096 tokens
	- Attention Pattern: Alternating sliding and full attention layers

	### RoPE (Rotary Position Embedding) Configuration
	- RoPE Theta: 150,000
	- RoPE Scaling Type: YaRN (Yet another RoPE extensioN)
	- Scaling Factor: 32.0
	- Beta Fast: 32.0
	- Beta Slow: 1.0

	### Quantization Details
	- Quantization Method: MLX 4-bit quantization
	- Group Size: 64
	- Effective Bits per Weight: 4.504
	- Size Reduction: 13GB → 11GB (~15% reduction)

	### File Structure
	```
	gpt-oss-20b-MLX-4bit/
	├── config.json # Model configuration
	├── model-00001-of-00003.safetensors # Model weights (part 1)
	├── model-00002-of-00003.safetensors # Model weights (part 2)
	├── model-00003-of-00003.safetensors # Model weights (part 3)
	├── model.safetensors.index.json # Model sharding index
	├── tokenizer.json # Tokenizer configuration
	├── tokenizer_config.json # Tokenizer settings
	├── special_tokens_map.json # Special tokens mapping
	├── generation_config.json # Generation parameters
	└── chat_template.jinja # Chat template
	```

	## Performance Characteristics

	### Hardware Requirements
	- Platform: Apple Silicon (M1, M2, M3, M4 series)
	- Memory: ~11GB for model weights
	- Recommended RAM: 16GB+ for optimal performance

	### Layer Configuration
	The model uses an alternating attention pattern across its 24 layers:
	- Even layers (0, 2, 4, ...): Sliding window attention (128 tokens)
	- Odd layers (1, 3, 5, ...): Full attention

	## Training Details

	### Tokenizer
	- Type: Custom tokenizer with 201,088 vocabulary size
	- Special Tokens:
	- EOS Token ID: 200,002
	- Pad Token ID: 199,999

	### Model Configuration
	- Hidden Activation: SiLU (Swish)
	- Normalization: RMSNorm (ε = 1e-05)
	- Initializer Range: 0.02
	- Attention Dropout: 0.0
	- Attention Bias: Enabled

	## Conversion Process

	This model was successfully converted using:
	- MLX-LM Version: 0.26.3 (development branch)
	- Conversion Command:
	```bash
	python3 -m mlx_lm convert \
	--hf-path "/path/to/openai-gpt-oss-20b" \
	--mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
	--quantize \
	--q-bits 4
	```

	## Known Limitations

	1. Architecture Specificity: This model uses the `gpt_oss` architecture which is only supported in MLX-LM v0.26.3+
	2. Platform Dependency: Optimized specifically for Apple Silicon; may not run on other platforms
	3. Quantization Trade-offs: 4-bit quantization may result in slight quality degradation compared to full precision

	## Compatibility

	- MLX-LM: Requires v0.26.3 or later for `gpt_oss` support
	- Apple Silicon: M1, M2, M3, M4 series processors
	- macOS: Compatible with recent macOS versions supporting MLX

	## License

	Please refer to the original OpenAI GPT OSS 20B model license terms.

	## Acknowledgments

	- Original model by OpenAI
	- MLX framework by Apple Machine Learning Research
	- Quantization achieved using `mlx-lm` development tools

	---

	Model Size: 11GB
	Quantization: 4-bit (4.504 bits/weight)
	Created: August 6, 2025
	MLX-LM Version: 0.26.3 (development)