Qwen3-30B-A3B-YOYO-V2-q6-hi-mlx

This compares the YOYO-V2 model (a merge of Qwen's Thinking, Instruct, and Coder models) with the individual Thinking and Coder models to analyze how the merge impacted overall performance across different language intelligence tasks.

Note that the Instruct model isn't explicitly represented in this dataset (as it's excluded from the metrics).

Key Benchmark Comparison

Below is a breakdown of YOYO-V2's performance relative to the Thinking and Coder models across 7 tasks:

Task           YOYO-V2  Thinking Coder    YOYO Advantage Over Coder
arc_challenge  0.532    0.414    0.417    +0.115
arc_easy       0.685    0.444    0.529    +0.156
boolq          0.886    0.702    0.881    +0.005 (slight gain over Coder)
hellaswag      0.683    0.632    0.545    +0.138
openbookqa     0.456    0.396    0.426    +0.030
piqa           0.782    0.763    0.720    +0.062
winogrande     0.639    0.666    0.572    +0.067

How the Merge Affected Overall Performance

Net Positive Impact Across Tasks:

YOYO-V2 outperforms both Thinking and Coder models in 6 out of 7 tasks.

The most significant gains are seen in:

arc_easy:  YOYO-V2’s score jumps from 0.529 (Coder) to 0.685 (a +15.6% improvement).
hellaswag: YOYO-V2 shows a strong jump from 0.545 (Coder) to 0.683 (+25%).
piqa:      YOYO-V2 achieves 0.782 vs. Coder’s 0.720 (+8%).

Minor Trade-offs in Specific Tasks:

YOYO-V2 slightly underperforms Thinking on winogrande (0.639 vs. 0.666), but this is offset by its superiority in other tasks.

On boolq, YOYO-V2’s score is very close to Coder (0.886 vs. 0.881), showing minimal gains from the merge (likely due to task-specific alignment).

Why This Matters:

The merge likely leverages complementary strengths of the three Qwen models (e.g., Thinking for reasoning, Coder for code generation, and Instruct for instruction-following). YOYO-V2’s higher scores indicate the merge effectively harmonized these capabilities without severe drawbacks.

The overall trend is clear: the merged model achieves better or comparable results across the majority of benchmarks, with gains in downstream tasks that demand flexibility (e.g., reasoning, text generation).

Conclusion

YOYO-V2’s performance demonstrates that merging the Qwen Thinking, Coder, and Instruct models (at Q6 quantization) generally enhances overall task performance across diverse language intelligence benchmarks. The model shows the most dramatic improvements in tasks like arc_easy and hellaswag, where it excels by integrating specialized knowledge from each component model. While minor losses in a few tasks (e.g., winogrande) exist, the net effect is positive and robust, validating YOYO-V2 as a stronger multi-purpose model for real-world applications.

Takeaway: For Qwen users, YOYO-V2 is recommended if your use cases span reasoning (arc), code generation (Coder), and instruction-following (Instruct) – it provides a more balanced, high-performing solution than the base models alone.

--reviewed by qwen3-jan-v1-256k-ctx-6b-brainstorm20x-qx6-mlx

The hi model is improving over q6 by quanting with group size 32 and should perform better than the q6

This model Qwen3-30B-A3B-YOYO-V2-q6-hi-mlx was converted to MLX format from YOYO-AI/Qwen3-30B-A3B-YOYO-V2 using mlx-lm version 0.26.4.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("Qwen3-30B-A3B-YOYO-V2-q6-hi-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

nightmedia
/

Qwen3-30B-A3B-YOYO-V2-q6-hi-mlx

Qwen3-30B-A3B-YOYO-V2-q6-hi-mlx

Use with mlx

Model tree for nightmedia/Qwen3-30B-A3B-YOYO-V2-q6-hi-mlx