Overview

newmindai/QwQ-32B-r1 is a LoRA adapter, fine-tuned via Reinforcement Learning (RL) on top of the base model QwQ-32B. It incorporates:

ORMs (Open Reward Modules)
DAPO (Decoder Appearance Optimization)
SimpleScaling (Multi-objective loss balancing)

This is an adapter, not a fully merged model. To use it, you must load it on top of the base model (Qwen/QwQ-32B) using the peft library.

Training Setup

Base Model

Architecture: QwQ-32B (Qwen-style transformer)
Libraries: transformers, trl, deepspeed, accelerate, vllm
Tokenizer: Custom-trained (compatible with Hugging Face format)

Reward Modules (ORMs)

Reward Function	Description
`math`	Evaluates symbolic math correctness (MathORM)
`accuracy`	Targets numeric accuracy (MathAccuracy)
`format`	Enforces strict formatting constraints
`cosine`	Measures similarity to gold responses
`repetition`	Penalizes repeated or degenerate outputs
`soft_overlong`	Soft penalty for overly long generations

These were combined and scaled during training with adaptive weighting.

Scaling Techniques

DAPO (Appearance Optimization): Regularizes attention and layout structure in decoder outputs.
SimpleScaling (newmindai/simplescaling): Controls optimizer behavior and reward balance across multiple objectives.

Training Regime

Stage 1 (Wait #1): Model explores reward landscape; initial rewards unstable.
Stage 2 (Wait #2): Convergence improves as ORM signals align.
Aha Moment: Clear gains in math and formatting scores around ~2K steps after warm-up.

Evaluation

🐍 Mezura-SnakeBench Benchmarking
Final performance was benchmarked using the Mezura SnakeBench framework — a standardized evaluation suite developed by NewmindAI for structured Turkish NLP tasks.

Usage Example (LoRA Adapter)

This adapter must be loaded on top of the base model Qwen/QwQ-32B using the peft library:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "Qwen/QwQ-32B"
adapter_id = "newmindai/QwQ-32B-r1"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_id)

# Inference
prompt = "Türkiye'nin en yüksek dağı nedir?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

newmindai
/

QwQ-32B-r1