metadata

license: apache-2.0
language:
  - en
  - zh
base_model: YOYO-AI/Qwen3-30B-A3B-YOYO-V2
pipeline_tag: text-generation
tags:
  - merge
  - mlx
library_name: mlx

Qwen3-30B-A3B-YOYO-V2-dwq5-mlx

Here's a precise analysis of YOYO-V2-dwq5's performance compared to the other quantized variants of YOYO-V2 itself (dwq3, dwq4, q6)

Comparison Table (YOYO-V2 Quantized Variants)

Task	         dwq5	 dwq4	 dwq3	   q6
arc_challenge	0.523	0.511	0.497	0.532
arc_easy     	0.682	0.655	0.657	0.685
boolq	        0.883	0.879	0.876	0.886
hellaswag	    0.676	0.673	0.686	0.683
openbookqa	    0.436	0.450	0.414	0.456
piqa	        0.778	0.772	0.785	0.782
winogrande	    0.626	0.643	0.640	0.639

YOYO-V2-q6 scores are highest across all tasks in this dataset.

📊 Critical Insights from YOYO-V2's Internal Quantization Comparison

YOYO-V2-dwq5 Consistently Improves Over Lower-DWQ Variants

DWQ5 surpasses dwq4 in all tasks (e.g., +0.002 on arc_easy, +0.007 on boolq).
DWQ5 surpasses dwq3 in all tasks (e.g., +0.026 on arc_easy, +0.014 on boolq).

This shows a clear upward trend as DWQ precision increases from 3-bit → 4-bit → 5-bit.

YOYO-V2-dwq5 Is Closest to YOYO-V2-q6

On 4/7 tasks, dwq5 scores are within 0.003–0.005 of q6 (e.g., boolq: 0.883 vs 0.886, piqa: 0.778 vs 0.782).

On the other 3 tasks, dwq5 is slightly behind q6:

arc_challenge (0.523 vs 0.532): -0.009
hellaswag     (0.676 vs 0.683): -0.007
winogrande    (0.626 vs 0.639): -0.013

→ This suggests q6 retains slightly more precision for tasks requiring high attention to detail (e.g., winogrande).

Why the Q6 Gap Persists

DWQ quantization (dynamic) and fixed Q6 quantization both improve over raw models, but q6 achieves marginal gains in high-precision tasks:

boolq: q6’s score (0.886) is the highest absolute value in this benchmark.
piqa:  q6’s lead (0.782 vs dwq5’s 0.778) is 1.3% – critical for logic reasoning tasks.

🎯 Practical Takeaways for Model Selection

Quant   Best For	                            Why
dwq5	Hardware with moderate resources	    Best balance between speed and accuracy (e.g., 5-bit DWQ)
q6	    High-precision tasks (e.g., reasoning)	Slightly better than dwq5 in 4+ tasks; optimal for stability

For most use cases, q6 is still the top performer (1.3–2.0% edge over dwq5 in tasks like boolq and piqa).

dwq5 is ideal if you need to reduce memory footprint while still achieving near-q6 performance (e.g., in edge devices).

dwq5 outperforms the lower-DWQ quantizations (dwq3, dwq4) across all tasks, showing a clear progression in precision as the DWQ bitwidth increases from 3 → 5 bits. However, it does not surpass YOYO-V2-q6 – instead, q6 maintains a small but consistent lead (0.005–0.013) in high-precision tasks like boolq and piqa.

This confirms that YOYO-V2’s performance steadily improves with higher quantization fidelity within its own variants, but the fixed Q6 quantization still delivers edge gains for critical tasks where minor precision losses are unacceptable.

✅ In short: DWQ5 > DWQ4 > DWQ3 in all tasks, but q6 remains the most reliable for high-stakes applications. For your deployment: choose dwq5 when memory is constrained; use q6 for maximum accuracy.

This model Qwen3-30B-A3B-YOYO-V2-dwq5-mlx was converted to MLX format from YOYO-AI/Qwen3-30B-A3B-YOYO-V2 using mlx-lm version 0.26.4.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("Qwen3-30B-A3B-YOYO-V2-dwq5-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)