Weaver Distilled for All Datasets (gte-Qwen2-1.5B-instruct)

A general-purpose distilled cross-encoder model based on gte-Qwen2-1.5B-instruct, trained to predict the correctness of reasoning responses across multiple domains: mathematics (MATH500), science (GPQA), and academic knowledge (MMLU-Pro). This specialized verifier was trained on Weaver scores aggregated over 35 different verifiers and reward models.

Model Details

Base Model: Alibaba-NLP/gte-Qwen2-1.5B-instruct (1.5B parameters)
Architecture: Cross-encoder with MLP head (1536 → 768 → 384 → 1)
Max Sequence Length: 4096 tokens
Training Data: Combined MATH500, GPQA, and MMLU-Pro with Weaver scores from 35 LM judges and reward models
Task: Binary classification for answer correctness prediction across domains

Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage - works across math, science, and academic domains
instruction = "What is the derivative of f(x) = 3x² + 2x - 1?"
response = "Using the power rule: f'(x) = 6x + 2. The derivative of 3x² is 6x, the derivative of 2x is 2, and the derivative of -1 is 0."

# Tokenize input pair
inputs = tokenizer(
    instruction, 
    response,
    truncation=True,
    max_length=4096,
    padding=True,
    return_tensors="pt"
)

# Get correctness score
with torch.no_grad():
    outputs = model(**inputs)
    score = torch.sigmoid(outputs.logits).item()
    
print(f"Correctness score: {score:.3f}")
print(f"Prediction: {'Correct' if score > 0.5 else 'Incorrect'}")

Training Details

This model was trained using the Weaver distillation pipeline on a combined dataset spanning multiple reasoning domains. For training your own distilled models, see the distillation README.

Citation

@misc{saadfalcon2025shrinkinggenerationverificationgapweak,
      title={Shrinking the Generation-Verification Gap with Weak Verifiers}, 
      author={Jon Saad-Falcon and E. Kelly Buchanan and Mayee F. Chen and Tzu-Heng Huang and Brendan McLaughlin and Tanvir Bhathal and Shang Zhu and Ben Athiwaratkun and Frederic Sala and Scott Linderman and Azalia Mirhoseini and Christopher Ré},
      year={2025},
      eprint={2506.18203},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2506.18203}, 
}

hazyresearch
/

Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct

Weaver Distilled for All Datasets (gte-Qwen2-1.5B-instruct)

Model Details

Quick Start

Training Details

Citation

Model tree for hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct

Collection including hazyresearch/Weaver_Distilled_All_Datasets_gte-Qwen2-1.5B-instruct

Weaver