Random Baseline Language Model (7.2B Parameters, 150B Tokens)

This repository contains the 7.2B parameter random baseline language model used in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 7.2B parameter transformer-based decoder-only language model trained from scratch on 150B tokens randomly sampled from SlimPajama dataset. It represents the largest baseline model in the Meta-rater research, demonstrating performance capabilities at scale with random data selection.

Model Details

Architecture: Transformer decoder-only
Parameters: 7.2B (7,241,732,096 parameters)
Training Tokens: 150B tokens
Context Window: 1,024 tokens
Vocabulary Size: 32,000 (LLaMA tokenizer)
Training Data: Randomly sampled from SlimPajama dataset
Domain Distribution: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%)

Architecture Specifications

Hidden Dimension: 4,096
Number of Layers: 32
Attention Heads: 32
Key-Value Heads: 8 (Grouped Query Attention)\
MLP Ratio: 8/3
Position Encoding: RoPE (base=10,000)

Training Details

Hardware: 32x NVIDIA A800 GPUs
Global Batch Size: 4,194,304 tokens
Learning Rate: 5e-5
Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
Training Time: ~284 hours

Performance Results

Downstream Task Performance (Average Accuracy)

General Knowledge: 65.10%
- ARC-Easy: 67.77%
- ARC-Challenge: 36.43%
- SciQ: 91.10%
Commonsense Reasoning: 52.01%
- HellaSwag: 53.02%
- SIQA: 42.73%
- WinoGrande: 60.29%
Reading Comprehension: 35.87%
- RACE: 34.73%
- OpenbookQA: 37.00%
Overall Average: 52.12%

Knowledge-Intensive Tasks

MMLU: 26.21%
NaturalQuestions: 10.89%

Scaling Analysis

Performance Progression Across Scales

1.3B Random: 43.78% overall
3.3B Random: 52.98% overall (+9.20%)
7.2B Random: 52.12% overall (-0.86%)

Scale Observations

Plateau Effect: Performance plateaus or slightly decreases at 7.2B scale
Knowledge Tasks: NaturalQuestions shows continued improvement with scale
Efficiency: Diminishing returns from parameter scaling with random data
Data Quality Impact: Highlights importance of curation at larger scales

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-7b-random"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
prompt = "Recent advances in machine learning have"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=200,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Research Context

This model provides crucial insights for the Meta-rater research:

Scaling Law Implications

Data Quality Importance: Random selection shows diminishing returns at scale
Ceiling Effects: Parameter scaling alone insufficient for continued improvement
Meta-rater Value: Quality data selection becomes more valuable at larger scales

Key Research Findings

Plateau Phenomenon: Random data selection hits performance plateau
Efficiency Questions: Massive parameter increases yield minimal gains
Quality Selection Necessity: Demonstrates need for systematic data curation

Applications

This model can be used for:

Scaling research and understanding parameter efficiency
Baseline establishment for large-scale language modeling
Research on diminishing returns in parameter scaling
Data quality impact studies at scale
Computational efficiency analysis

Strengths

Large parameter capacity for complex pattern learning
Extensive training on diverse content
Strong knowledge retention capabilities
Valuable baseline for scaling studies

Limitations

Performance plateau despite increased parameters
Trained on randomly selected data without quality filtering
Limited context window (1,024 tokens)
No instruction tuning or safety alignment
High computational requirements with modest performance gains
Demonstrates inefficiency of random data selection at scale

Critical Scaling Insights

Diminishing Returns Pattern

3.3B to 7.2B: ~2.2x parameters, -0.86% performance
Training Cost: 284 hours vs 129 hours (+120% training time)
Efficiency: Negative return on computational investment

Data Quality Imperative

This model demonstrates why data curation becomes crucial at scale:

Random selection fails to utilize increased model capacity
Quality data selection (Meta-rater) shows continued benefits
Parameter scaling alone insufficient for performance gains

Comparison with Meta-rater 7.2B

The corresponding Meta-rater model achieves:

Overall Performance: 55.24% vs 52.12% = +3.12% improvement
Efficiency: Same training cost, significantly better results
Scalability: Meta-rater benefits increase at larger scales

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

⭐ Star us on GitHub if you find Meta-rater useful! ⭐

Made with ❤️ by the OpenDataLab team

opendatalab
/

meta-rater-7b-random