Random Baseline Language Model (7.2B Parameters, 150B Tokens)
This repository contains the 7.2B parameter random baseline language model used in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.
Code: https://github.com/opendatalab/Meta-rater
Model Description
This is a 7.2B parameter transformer-based decoder-only language model trained from scratch on 150B tokens randomly sampled from SlimPajama dataset. It represents the largest baseline model in the Meta-rater research, demonstrating performance capabilities at scale with random data selection.
Model Details
- Architecture: Transformer decoder-only
- Parameters: 7.2B (7,241,732,096 parameters)
- Training Tokens: 150B tokens
- Context Window: 1,024 tokens
- Vocabulary Size: 32,000 (LLaMA tokenizer)
- Training Data: Randomly sampled from SlimPajama dataset
- Domain Distribution: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%)
Architecture Specifications
- Hidden Dimension: 4,096
- Number of Layers: 32
- Attention Heads: 32
- Key-Value Heads: 8 (Grouped Query Attention)\
- MLP Ratio: 8/3
- Position Encoding: RoPE (base=10,000)
Training Details
- Hardware: 32x NVIDIA A800 GPUs
- Global Batch Size: 4,194,304 tokens
- Learning Rate: 5e-5
- Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- Training Time: ~284 hours
Performance Results
Downstream Task Performance (Average Accuracy)
General Knowledge: 65.10%
- ARC-Easy: 67.77%
- ARC-Challenge: 36.43%
- SciQ: 91.10%
Commonsense Reasoning: 52.01%
- HellaSwag: 53.02%
- SIQA: 42.73%
- WinoGrande: 60.29%
Reading Comprehension: 35.87%
- RACE: 34.73%
- OpenbookQA: 37.00%
Overall Average: 52.12%
Knowledge-Intensive Tasks
- MMLU: 26.21%
- NaturalQuestions: 10.89%
Scaling Analysis
Performance Progression Across Scales
- 1.3B Random: 43.78% overall
- 3.3B Random: 52.98% overall (+9.20%)
- 7.2B Random: 52.12% overall (-0.86%)
Scale Observations
- Plateau Effect: Performance plateaus or slightly decreases at 7.2B scale
- Knowledge Tasks: NaturalQuestions shows continued improvement with scale
- Efficiency: Diminishing returns from parameter scaling with random data
- Data Quality Impact: Highlights importance of curation at larger scales
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "opendatalab/meta-rater-7b-random"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text
prompt = "Recent advances in machine learning have"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=200,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Research Context
This model provides crucial insights for the Meta-rater research:
Scaling Law Implications
- Data Quality Importance: Random selection shows diminishing returns at scale
- Ceiling Effects: Parameter scaling alone insufficient for continued improvement
- Meta-rater Value: Quality data selection becomes more valuable at larger scales
Key Research Findings
- Plateau Phenomenon: Random data selection hits performance plateau
- Efficiency Questions: Massive parameter increases yield minimal gains
- Quality Selection Necessity: Demonstrates need for systematic data curation
Applications
This model can be used for:
- Scaling research and understanding parameter efficiency
- Baseline establishment for large-scale language modeling
- Research on diminishing returns in parameter scaling
- Data quality impact studies at scale
- Computational efficiency analysis
Strengths
- Large parameter capacity for complex pattern learning
- Extensive training on diverse content
- Strong knowledge retention capabilities
- Valuable baseline for scaling studies
Limitations
- Performance plateau despite increased parameters
- Trained on randomly selected data without quality filtering
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- High computational requirements with modest performance gains
- Demonstrates inefficiency of random data selection at scale
Critical Scaling Insights
Diminishing Returns Pattern
- 3.3B to 7.2B: ~2.2x parameters, -0.86% performance
- Training Cost: 284 hours vs 129 hours (+120% training time)
- Efficiency: Negative return on computational investment
Data Quality Imperative
This model demonstrates why data curation becomes crucial at scale:
- Random selection fails to utilize increased model capacity
- Quality data selection (Meta-rater) shows continued benefits
- Parameter scaling alone insufficient for performance gains
Comparison with Meta-rater 7.2B
The corresponding Meta-rater model achieves:
- Overall Performance: 55.24% vs 52.12% = +3.12% improvement
- Efficiency: Same training cost, significantly better results
- Scalability: Meta-rater benefits increase at larger scales
Citation
If you use this model in your research, please cite:
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
License
Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.
Contact
For questions or issues, please contact the authors or open an issue in the repository.
⭐ Star us on GitHub if you find Meta-rater useful! ⭐
Made with ❤️ by the OpenDataLab team
- Downloads last month
- 15
Model tree for opendatalab/meta-rater-7b-random
Base model
internlm/internlm2-7b