image/png

QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha
- Solo Innovation: Breaking Performance Barriers with Minimal Resources -
Powered by personal research with insights from agentica-org

Overview

QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha is a groundbreaking language model built on top of the DeepSeek‑R1‑Distill‑Qwen‑1.5B base. Developed entirely by a solo innovator—with valuable inspiration from Berkeley’s research—the model employs a novel reinforcement learning distillation framework that dramatically enhances performance while keeping training data requirements and compute costs to a minimum. Despite having only 1.5B parameters, the model achieves a striking 47.18 MMLU score and outperforms prior baselines on multiple math and reasoning benchmarks.


Data

Our training dataset is comprised of 6,170 meticulously curated problem–answer pairs drawn from high-quality sources such as:

  • AIME Problems(QwQ-32B Generated)
  • AMC Problems(QwQ-32B Generated)
  • MMLU Problems(QwQ-32B Generated))
  • Complementary academic math and reasoning datasets(QwQ-32B Generated)

By focusing on a lean yet highly informative dataset, the model efficiently learns critical reasoning capabilities without the burden of excessive data volume.

Generate in QwQ32B with reference to each dataset in the model definition and other datasets.

Training Recipe

To maximize performance with minimal resources, QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha utilizes an innovative training strategy that includes:

- Scaled Group Relative Policy Optimization (GRPO): An adaptation of PPO that normalizes the advantage function across samples generated from the same prompt. - KL Divergence Regularization:: Additional regularization is applied on top of the surrogate loss to prevent significant policy drift. - Iterative Context Scaling:: Progressive expansion of the context length is used to boost model performance while reducing compute costs.

Training was carried out using H200 GPUs for 336 hours at an exceptionally low cost of approximately $1,341. This carefully engineered approach makes it possible to obtain state-of-the-art performance with very limited training data.


Evaluation

The model has been rigorously evaluated on a variety of challenging benchmarks. Below is a snapshot of the results:

Benchmark Metric (Path@1) Metric (cons@64) Avg. Token Count
MMLU 47.18
AIME 2024 33.33 53.33 21,191
AIME 2025-I 34.58 40.00 17,952
AIME 2025-II 21.56 33.33 21,376
AMC 2023 75.00 58.92 44.17
MATH 5000 38.89 20,173

Comparison

image/png

Serving QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha

Deploy your model effortlessly using high-performance inference systems, including:

  • vLLM
  • Hugging Face Text Generation Inference (TGI)
  • SGLang
  • TensorRT-LLM

All these systems support the OpenAI Chat Completions API format, ensuring smooth integration into your applications.


How to use:

Runs on a single A40 GPU!

Serving Model:

vllm serve AXCXEPT/QwQ-32B-Distill-Qwen-1.5B-Alpha --max-model-len 32768 --enforce-eager

Call API Without Streaming:

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

prompt = """Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+rac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop."""
completion = client.chat.completions.create(
  model="AXCXEPT/QwQ-32B-Distill-Qwen-1.5B-Alpha",
  messages=[
    {"role": "user", "content": prompt}
  ]
)

print(completion.choices[0].message)

Call API With Streaming:

from openai import OpenAI

#Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id
prompt = """Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+rac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop."""
messages = [{"role": "user", "content": prompt}]
#For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
stream = client.chat.completions.create(model=model,
                                        messages=messages,
                                        stream=True)

print("client: Start streaming chat completions...")
printed_reasoning_content = False
printed_content = False

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoning_content = chunk.choices[0].delta.reasoning_content
    elif hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

    if reasoning_content is not None:
        if not printed_reasoning_content:
            printed_reasoning_content = True
            print("reasoning_content:", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    elif content is not None:
        if not printed_content:
            printed_content = True
            print("\ncontent:", end="", flush=True)
        # Extract and print the content
        print(content, end="", flush=True)

License

This project is released under the MIT License, reflecting our commitment to open and accessible AI. We firmly believe that cutting-edge AI research should be available for anyone to use, modify, and build upon.


Special Thanks

We extend our sincere gratitude to the following teams and organizations whose contributions and ideas were instrumental in this project:

  • Qwen Team (Alibaba Cloud): for creating the exceptional QwQ-32B model used as the distillation source.
  • Agentica-org (Berkeley Sky Computing Lab and Berkeley AI Research): for valuable insights and pioneering reinforcement learning techniques.
  • DeepSeek AI: for developing the robust foundational model upon which this research is built.

Their groundbreaking work made our innovations possible.

Downloads last month
51
Safetensors
Model size
1.78B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXCXEPT/QwQ-32B-Distill-Qwen-1.5B-Alpha

Finetuned
(256)
this model
Quantizations
1 model

Datasets used to train AXCXEPT/QwQ-32B-Distill-Qwen-1.5B-Alpha