Meta-rater Language Model - All (25) Quality Scores (1.3B Parameters, 30B Tokens)
This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.
Code: https://github.com/opendatalab/Meta-rater
Model Description
This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the Meta-rater framework with all 25 quality scores. This represents the flagship model of the Meta-rater research, combining natural language quality signals, data importance scores, and model-based ratings through learned optimal weightings.
Model Details
- Architecture: Transformer decoder-only
- Parameters: 1.345B (1,345,423,360 parameters)
- Training Tokens: 30B tokens
- Context Window: 1,024 tokens
- Vocabulary Size: 32,000 (LLaMA tokenizer)
- Data Selection Method: Meta-rater with all 25 quality scores
- Optimization: Learned optimal weightings through 256 proxy models
Architecture Specifications
- Hidden Dimension: 2,048
- Number of Layers: 24
- Attention Heads: 16
- Key-Value Heads: 16
- MLP Ratio: 8/3
- Position Encoding: RoPE (base=10,000)
Meta-rater Framework
The training data was selected using the complete Meta-rater framework integrating:
Natural Language Quality Signals (11)
- RedPajama rule-based measures (word count, entropy, unique words, etc.)
- Text naturalness and linguistic integrity indicators
Data Importance Scores (3)
- DSIR similarity to Books, Wikipedia, and AutoMathText
- Domain-specific quality assessment
Model-based Ratings (11)
- PRRC (4): Professionalism, Readability, Reasoning, Cleanliness
- QuRating (4): Required Expertise, Writing Style, Facts & Trivia, Educational Value
- FineWeb-Edu (1): Educational value assessment
- WanjuanCC (2): Advertisement detection, Fluency evaluation
Optimal Weighting
Top contributing quality scores with learned weights:
- Educational Value (5.64%)
- doc_frac_no_alph_words (4.93%)
- Fineweb-edu (4.93%)
- lines_uppercase_letter_fraction (4.88%)
- Facts and Trivia (4.77%)
Training Details
- Hardware: 32x NVIDIA A800 GPUs
- Global Batch Size: 4,194,304 tokens
- Learning Rate: 5e-5
- Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- Training Time: ~14 hours
- Meta-rater Construction: 256 proxy models for optimal weight learning
Performance Results
Downstream Task Performance (Average Accuracy)
General Knowledge: 58.90% (+6.11% vs Random)
- ARC-Easy: 58.25%
- ARC-Challenge: 29.86%
- SciQ: 88.60%
Commonsense Reasoning: 45.41% (+1.47% vs Random)
- HellaSwag: 39.81%
- SIQA: 42.68%
- WinoGrande: 53.75%
Reading Comprehension: 31.55% (+1.53% vs Random)
- RACE: 31.10%
- OpenbookQA: 32.00%
Overall Average: 47.01% (+3.23% vs Random)
Key Achievements
- State-of-the-Art: Best performance among all baseline methods
- Convergence Speed: 2x faster convergence compared to random selection
- Token Efficiency: Matches Random-60B performance using only 30B tokens
- Holistic Quality: Balanced improvements across all task categories
- Multi-dimensional: Successfully integrates 25 complementary quality metrics
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-25raters"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text (optimized for overall quality)
prompt = "The advancement of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=100,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Applications
This model is exceptionally well-suited for:
- General-purpose language modeling with high quality standards
- Research requiring state-of-the-art baseline performance
- Educational applications across multiple domains
- Content generation with balanced quality across dimensions
- Multi-domain tasks requiring diverse capabilities
- Production systems needing reliable, high-quality text generation
Strengths
- Holistic Quality: Balanced performance across all evaluation dimensions
- Training Efficiency: Superior token efficiency compared to random selection
- Robust Performance: Consistent improvements across diverse task types
- Multi-dimensional: Benefits from comprehensive quality assessment
- Research Validated: Empirically optimized through systematic methodology
- Scalable: Framework scales to larger models (validated up to 7.2B)
Research Significance
This model demonstrates several key findings:
- Multi-dimensional beats single-dimension: 47.01% vs best single rater (46.16%)
- Quality integration superiority: Outperforms simple combination methods
- Efficiency gains: Achieves 2x convergence speed improvement
- Scalability: Benefits persist at larger model scales
- Comprehensive approach: 25 quality scores provide complementary information
Comparison with Baselines
- vs Random Baseline: +3.23% overall improvement
- vs Best Single Rater (QuRating Educational Value): +0.85% improvement
- vs Simple Mean Combination: +2.36% improvement over uniform weighting
- vs Previous SOTA: Establishes new state-of-the-art for data selection methods
Limitations
- Higher computational cost for quality score rating during data selection
- Optimized for SlimPajama-style web-crawled data
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- Requires proxy model training for weight optimization
Technical Innovation
The Meta-rater framework introduces:
- Systematic quality integration: Moving beyond single-dimensional selection
- Learned optimal weightings: Data-driven rather than heuristic combinations
- Proxy model methodology: Efficient exploration of weight space
- Multi-dimensional assessment: Comprehensive quality evaluation (PRRC)
- Scalable paradigm: Framework applicable to diverse quality metrics
Citation
If you use this model in your research, please cite:
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
Related Resources
- PRRC Rating Models: Individual ModernBERT models for quality assessment
- Annotated SlimPajama-627B: Fully labeled dataset with all 25 quality scores
- Meta-rater Scripts: Implementation and training code
- Proxy Models: Smaller models used for weight optimization
License
Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.
Contact
For questions or issues, please contact the authors or open an issue in the repository.
- Downloads last month
- 2