Instructions to use whw06/MIRA-QA-Group3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use whw06/MIRA-QA-Group3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="whw06/MIRA-QA-Group3")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("whw06/MIRA-QA-Group3")
model = AutoModelForImageTextToText.from_pretrained("whw06/MIRA-QA-Group3")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use whw06/MIRA-QA-Group3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "whw06/MIRA-QA-Group3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "whw06/MIRA-QA-Group3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/whw06/MIRA-QA-Group3

SGLang

How to use whw06/MIRA-QA-Group3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "whw06/MIRA-QA-Group3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "whw06/MIRA-QA-Group3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "whw06/MIRA-QA-Group3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "whw06/MIRA-QA-Group3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use whw06/MIRA-QA-Group3 with Docker Model Runner:
```
docker model run hf.co/whw06/MIRA-QA-Group3
```

MIRA-QA-Group3

A student scorer from MIRA (Mid-training Rubric Anchoring for Source-Aware Data Selection), fine-tuned to score mathematical reasoning QA (with thinking traces) along a group-specific set of anchor rubric dimensions.

📄 Paper: MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection (EMNLP 2026) 💻 Code: https://github.com/Multilingual-Multimodal-NLP/mira

TL;DR

MIRA is a source-aware data selection framework for heterogeneous mid-training corpora. Instead of applying a single global quality rubric, MIRA (1) clusters sources into capability-coherent groups, (2) lets a frontier teacher (Kimi-K2.6) freely propose rubric dimensions and anchors them per group, (3) distills the anchored teacher into a lightweight per-group student scorer, and (4) applies reliability-aware aggregation with per-source retention thresholds.

This repository is one of those student scorers — variant 3 in the QA family, specialized for mathematical reasoning QA (with thinking traces). Given an in-distribution record, it produces a numerical score and a short rationale for every anchor dimension in this group's rubric.

Model summary


Architecture	Mixture-of-Experts decoder (35B total / ≈3B active params)
Base model	Qwen3.5-35B-A3B-Base
Fine-tuning	Full-parameter SFT on Kimi-K2.6 anchored teacher labels
Domain	Mathematical reasoning corpora with explicit chain-of-thought — AceReason-1.1, lean_qa, mathcode, QwQ Big-SFT, DeepSeek-R1 distill, Kimi distill. Uses the qa_with_think prompt family (5 think + 5 combined + 5 QA anchor dims).
Anchor rubric	15 group-specific dimensions (`group_2_dim_anchors.jsonl` in the project repo)
Source count	6 qa sources
Output	Structured (score, rationale) per anchor dimension
Precision	BF16
License	Apache-2.0 (inherits from Qwen3)

Sources covered

This scorer is calibrated for the following mid-training sources in the QA / Math reasoning group:

Source	Description
`ace_reason`	AceReason-1.1-SFT
`lean_qa`	Lean theorem-prover QA
`mathcode`	Math-code distill QA
`qwq`	QwQ Big-SFT-Data think-dedup
`reason_dpsk32`	DeepSeek-R1 32B distill outputs
`reason_kimi`	Kimi distill outputs

The full source-grouping report (KMeans k=4 / 5 clusters, intra-group cosine similarities) is in the project repo.

Anchor dimensions (15 slots)

The scoring rubric for this group, discovered via Kimi-K2.6 free-form judging and clustered into 15 anchor dimensions (KMeans k=15 over the group's dim-score embeddings). Dimensions below are sorted by cluster size — larger clusters dominate the corpus and carry more signal. Anchor names are read verbatim from this group's group_2_dim_anchors.jsonl; some names recur across slots because semantically related but distinct rubric facets were clustered separately by the teacher.

Slot	Dimension	Cluster size
A1	Factual Correctness	11,915
A2	Difficulty Appropriateness	10,521
A3	Educational Value	10,206
A4	Difficulty Level	10,142
A5	Specificity	7,645
A6	Answer Depth	6,902
A7	Think-Response Consistency	4,881
A8	Answer Completeness	4,858
A9	Verifiability	4,613
A10	Pedagogical Value	4,537
A11	Answer Completeness	4,291
A12	Question Clarity	4,107
A13	Answer Conciseness	3,812
A14	Actionability	3,783
A15	Practical Applicability	2,787

The scorer outputs one [Ai] <dimension>: <score>/10 — <rationale> line per slot, plus overall, training_recommendation, domain_tag, and brief.

Where this model fits in the MIRA pipeline

┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│ 1. Rubric        │  │ 2. Anchored      │  │ 3. Reliability   │  │ 4. Data          │
│    Discovery     │→ │    Judge         │→ │    Aggregation   │→ │    Selection     │
│ (Kimi-K2.6,      │  │    Distillation  │  │ (mask unreliable │  │ (per-source      │
│  free-form       │  │ ◀── THIS MODEL   │  │  src×dim cells)  │  │  retention)      │
│  judging)        │  │                  │  │                  │  │                  │
└──────────────────┘  └──────────────────┘  └──────────────────┘  └──────────────────┘

MIRA-QA-Group3 lives in Stage 2: it scores the full QA / Math reasoning corpus so that downstream stages can apply reliability masking and source-aware retention.

Intended use

Primary: Score mathematical reasoning QA (with thinking traces) on this group's anchor dimensions to drive source-aware data selection and filtering.
Secondary: Research on rubric distillation, semantic quality scoring, and reliability diagnostics for heterogeneous training corpora.

Not intended for:

General-purpose chat or instruction following — fine-tuned to emit structured scores, not freeform dialogue.
Single-shot quality judgments without the anchor-dimension prompt template — outputs will be miscalibrated.
Records outside the QA / Math reasoning group; use the matching sibling scorer instead.

Deployment

The scorer is designed to be served via vLLM behind an OpenAI-compatible endpoint and called in batch from the MIRA scoring pipeline.

1. Serve with vLLM (recommended)

vllm serve whw06/MIRA-QA-Group3 \
    --tensor-parallel-size 8 \
    --dtype bfloat16 \
    --max-model-len 65536 \
    --max-num-batched-tokens 131072 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --port 8000

Why these values (verified on H200 141GB during the paper's per-source evaluation):

max-model-len=65536 — 2× the mid-training cutoff. Records can hit ~60K tokens for densely-tokenized sources; 40K runs into prompt-overflow errors.
max-num-batched-tokens=131072 — supports two full-length sequences per scheduling step.
gpu-memory-utilization=0.9 — 35B BF16 weights take ~70GB, leaving ~57GB KV cache. Roughly 4 concurrent 65K-context sequences per GPU.
8-way tensor parallel works well for the 35B MoE on a single 8×H200/A100 node.

2. Call from Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="whw06/MIRA-QA-Group3",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},   # group-2 anchor calibration
        {"role": "user",   "content": USER_PROMPT},     # record + [A1]..[A15] template
    ],
    temperature=0.7,
    top_p=0.95,
    max_tokens=2048,
)
print(resp.choices[0].message.content)

3. Prompt template

The user message asks for one structured line per anchor dimension (top-15 of this group):

[A1] {anchor_dim_1}: <score>/10 — <justification>
[A2] {anchor_dim_2}: <score>/10 — <justification>
...
[A15] {anchor_dim_15}: <score>/10 — <justification>
overall: <0-100>
training_recommendation: <keep | downsample | drop>
domain_tag: <short tag>
brief: <one-sentence summary>

The system prompt embeds the top-12 anchor calibration references (canonical examples from clustering) so the student matches the teacher's scoring scale. The full prompt builder, anchor JSONL files, and output parser are in the project repo's scoring/score_qa_anchored.py.

Training details


Teacher	Kimi-K2.6 (free-form rubric discovery in Phase 1; anchored re-scoring in Phase 2)
Training data	Kimi-K2.6 anchored labels on this group's Phase-2 corpus, split into a distillation set + a held-out validation split for reliability diagnostics
Loss	Standard next-token CE over (score, rationale) labels for every anchor dimension
Hyperparameters	Held constant across all MIRA student scorers; full settings in paper Appendix A.4
Validation	Per-dimension teacher–student MAE and Spearman ρ on a held-out split; dimensions failing reliability thresholds are masked post-hoc (Figure 3 in the paper)

Training loss / step curve is preserved in trainer_state.json for full reproducibility.

Headline results (from the paper)

End-to-end downstream evaluation: Qwen2.5-Coder-14B mid-trained on 25B-token MIRA-selected subsets vs. baselines, then SFT, evaluated on 9 code benchmarks across 4 categories.

Method	Code Gen	MultiplE	SQL (EX)	SWE-Multi	Macro Avg
Base + SFT (no mid)	53.91	72.57	64.24	3.67	48.60
Raw Mixture (50B)	53.71	67.42	94.18	40.00	63.83
Random (25B)	52.71	71.44	91.03	35.00	63.23
DataMan (25B)	53.82	71.38	93.84	33.00	63.01
DSIR (25B)	48.74	67.26	95.20	27.00	59.55
PPL (25B)	50.52	57.74	90.66	20.00	54.73
MIRA-Global (25B)	53.12	67.84	94.26	32.00	61.81
MIRA-Group (25B)	54.53	71.85	94.08	36.33	64.20
MIRA-Source (25B)	54.18	72.84	94.38	30.33	62.93

MIRA-Group matches the full 50B-token raw mixture while using only half the tokens, and out-performs all 25B-token selection baselines on the macro average. This scorer is one of the 12 student models used by the MIRA-Group variant.

Sibling models

MIRA releases one student scorer per source-group variant. Use the matching scorer for each record's format:

Agent: whw06/MIRA-Agent-Group1 · -Group2 · -Group3 · -Group4
QA: whw06/MIRA-QA-Group1 · -Group2 · MIRA-QA-Group3 (this model) · -Group4 · -Group5
Text: whw06/MIRA-Text-Group1 · -Group2 · -Group3

Limitations

MIRA addresses source-aware filtering only. Source discovery, mixture-ratio design, curriculum scheduling, deduplication and contamination control remain orthogonal concerns.
This scorer is calibrated against the QA / Math reasoning group; cross-domain transfer is not advised — use the matching sibling for other source formats.
Some anchor dimensions exhibit high teacher–student MAE and are masked post-hoc during aggregation (see paper §3.4). The model still emits scores for masked dimensions; downstream consumers should re-apply the reliability mask from the project repository.
Calibrated on 6 sources within this group; behavior on out-of-distribution formats is unverified.

Citation

@inproceedings{wang2026mira,
  title     = {MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection},
  author    = {Wang, Haowen and Du, Yaxin and Yang, Jian and Wu, Jiajun and
               Liu, Shukai and Zhang, Yuxuan and Wang, Pingjie and Chen, Siheng and
               Zheng, Tuney and Zhou, Ming and Liu, Xianglong},
  booktitle = {Proceedings of the 2026 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2026}
}

Acknowledgments

Built on Qwen3.5-35B-A3B-Base and the Megatron-LM training stack. Teacher labels generated with Kimi-K2.6.

Downloads last month: 27

Safetensors

Model size

665k params

Tensor type

BF16

Model tree for whw06/MIRA-QA-Group3

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

(45)

this model