license: apache-2.0 language: - en pipeline_tag: text-generation tags: - mlx==0.26.2 - q5 - qwq - reasoning - m3-ultra base_model: Qwen/QwQ-32B

QwQ-32B MLX Q5 Quantization

This is a Q5 (5-bit) quantized version of the QwQ-32B reasoning model, optimized for MLX on Apple Silicon. This quantization offers an excellent balance between model quality and size, specifically designed for high-memory Apple Silicon systems like the M3 Ultra.

Model Details

  • Base Model: Qwen/QwQ-32B
  • Quantization: Q5 (5-bit) with group size 64
  • Format: MLX (Apple Silicon optimized)
  • Size: 21GB (from original 61GB bfloat16)
  • Compression: 66% size reduction
  • Architecture: Qwen2 with reasoning capabilities

Why Q5?

Q5 quantization provides:

  • Superior quality compared to Q4 while being smaller than Q6/Q8
  • Optimal size for 128GB+ Apple Silicon systems
  • Minimal quality loss - retains ~98% of original model capabilities
  • Fast inference with MLX's unified memory architecture

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • macOS 13.0+
  • Python 3.11+
  • MLX 0.26.0+
  • mlx-lm 0.22.5+
  • 32GB+ RAM recommended (64GB+ for full 128k context)

Installation

# Using uv (recommended)
uv add mlx>=0.26.0 mlx-lm transformers

# Or with pip (not tested and obsolete)
pip install mlx>=0.26.0 mlx-lm transformers

Usage

Direct Generation

uv run mlx_lm.generate \
  --model LibraxisAI/QwQ-32B-MLX-Q5 \
  --prompt "Solve this step by step: If a train travels 120 km in 2 hours, what is its speed?" \
  --max-tokens 500

Python API

from mlx_lm import load, generate

# Load model
model, tokenizer = load("LibraxisAI/QwQ-32B-MLX-Q5")

# Generate text with reasoning
prompt = "Think step by step: What are the implications of Q5 quantization for LLM deployment?"
response = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_tokens=1000,
    temp=0.7
)
print(response)

HTTP Server

uv run mlx_lm.server \
  --model LibraxisAI/QwQ-32B-MLX-Q5 \
  --host 0.0.0.0 \
  --port 8080

Performance Benchmarks

Tested on Mac Studio M3 Ultra (512GB):

Metric Value
Model Size 21GB
Peak Memory Usage ~25GB
Generation Speed ~12-15 tokens/sec
Max Context Length 131,072 tokens (128k)

Special Features

QwQ (Qwen with Questions) is designed for:

  • Deep reasoning and step-by-step problem solving
  • Mathematical reasoning and logical deduction
  • Code generation with explanations
  • Self-reflection and error correction

Limitations

โš ๏ธ Important: This Q5 model as for the release date, of this quant is NOT compatible with LM Studio (yet), which only supports 2, 3, 4, 6, and 8-bit quantizations & we didn't test it with Ollama or any other inference client. Use MLX directly or via the MLX server - we've created a comfortable, command generation script to run the server properly.

Conversion Details

This model was quantized using:

uv run mlx_lm.convert \
  --hf-path Qwen/QwQ-32B \
  --mlx-path QwQ-32B-MLX-Q5 \
  --dtype bfloat16 \
  -q --q-bits 5 --q-group-size 64

Frontier M3 Ultra Optimization

This model is specifically optimized for the Mac Studio M3 Ultra setup with 512GB unified memory. For best performance:

import mlx.core as mx

# Set memory limits for large models
mx.metal.set_memory_limit(100 * 1024**3)  # 100GB
mx.metal.set_cache_limit(20 * 1024**3)    # 20GB cache

Tools Included

We provide utility scripts for easy model management:

  1. convert-to-mlx.sh - Command generation tool - convert any model to MLX format with many options of customization and Q5 quantization support on mlx>=0.26.0
  2. mlx-serve.sh - Launch MLX server with custom parameters

Historical Note

The LibraxisAI Q5 models were among the first Q5 quantized MLX models available on Hugging Face, pioneering the use of 5-bit quantization for Apple Silicon optimization.

Citation

If you use this model, please cite:

@misc{qwq-32b-q5-mlx,
  author = {LibraxisAI},
  title = {QwQ-32B Q5 MLX - Reasoning Model for Apple Silicon},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/LibraxisAI/QwQ-32B-MLX-Q5}
}

License

This model follows the original QwQ license (Apache-2.0). See the base model card for full details.

Authors of the repository

Monika Szymanska Maciej Gad, DVM

Acknowledgments

  • Apple MLX team and community for the amazing 0.26.0+ framework
  • Qwen team for the innovative QwQ reasoning model
  • Klaudiusz-AI ๐Ÿ‰
Downloads last month
20
Safetensors
Model size
32.8B params
Tensor type
BF16
ยท
U32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for div0-space/QwQ-32B-MLX-Q5

Base model

Qwen/Qwen2.5-32B
Finetuned
Qwen/QwQ-32B
Quantized
(168)
this model