Text Generation
Transformers
Safetensors
English
internlm2
custom_code

Random Baseline Language Model (7.2B Parameters, 150B Tokens)

This repository contains the 7.2B parameter random baseline language model used in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 7.2B parameter transformer-based decoder-only language model trained from scratch on 150B tokens randomly sampled from SlimPajama dataset. It represents the largest baseline model in the Meta-rater research, demonstrating performance capabilities at scale with random data selection.

Model Details

  • Architecture: Transformer decoder-only
  • Parameters: 7.2B (7,241,732,096 parameters)
  • Training Tokens: 150B tokens
  • Context Window: 1,024 tokens
  • Vocabulary Size: 32,000 (LLaMA tokenizer)
  • Training Data: Randomly sampled from SlimPajama dataset
  • Domain Distribution: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%)

Architecture Specifications

  • Hidden Dimension: 4,096
  • Number of Layers: 32
  • Attention Heads: 32
  • Key-Value Heads: 8 (Grouped Query Attention)\
  • MLP Ratio: 8/3
  • Position Encoding: RoPE (base=10,000)

Training Details

  • Hardware: 32x NVIDIA A800 GPUs
  • Global Batch Size: 4,194,304 tokens
  • Learning Rate: 5e-5
  • Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
  • Training Time: ~284 hours

Performance Results

Downstream Task Performance (Average Accuracy)

  • General Knowledge: 65.10%

    • ARC-Easy: 67.77%
    • ARC-Challenge: 36.43%
    • SciQ: 91.10%
  • Commonsense Reasoning: 52.01%

    • HellaSwag: 53.02%
    • SIQA: 42.73%
    • WinoGrande: 60.29%
  • Reading Comprehension: 35.87%

    • RACE: 34.73%
    • OpenbookQA: 37.00%
  • Overall Average: 52.12%

Knowledge-Intensive Tasks

  • MMLU: 26.21%
  • NaturalQuestions: 10.89%

Scaling Analysis

Performance Progression Across Scales

  • 1.3B Random: 43.78% overall
  • 3.3B Random: 52.98% overall (+9.20%)
  • 7.2B Random: 52.12% overall (-0.86%)

Scale Observations

  • Plateau Effect: Performance plateaus or slightly decreases at 7.2B scale
  • Knowledge Tasks: NaturalQuestions shows continued improvement with scale
  • Efficiency: Diminishing returns from parameter scaling with random data
  • Data Quality Impact: Highlights importance of curation at larger scales

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-7b-random"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
prompt = "Recent advances in machine learning have"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=200,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Research Context

This model provides crucial insights for the Meta-rater research:

Scaling Law Implications

  • Data Quality Importance: Random selection shows diminishing returns at scale
  • Ceiling Effects: Parameter scaling alone insufficient for continued improvement
  • Meta-rater Value: Quality data selection becomes more valuable at larger scales

Key Research Findings

  • Plateau Phenomenon: Random data selection hits performance plateau
  • Efficiency Questions: Massive parameter increases yield minimal gains
  • Quality Selection Necessity: Demonstrates need for systematic data curation

Applications

This model can be used for:

  • Scaling research and understanding parameter efficiency
  • Baseline establishment for large-scale language modeling
  • Research on diminishing returns in parameter scaling
  • Data quality impact studies at scale
  • Computational efficiency analysis

Strengths

  • Large parameter capacity for complex pattern learning
  • Extensive training on diverse content
  • Strong knowledge retention capabilities
  • Valuable baseline for scaling studies

Limitations

  • Performance plateau despite increased parameters
  • Trained on randomly selected data without quality filtering
  • Limited context window (1,024 tokens)
  • No instruction tuning or safety alignment
  • High computational requirements with modest performance gains
  • Demonstrates inefficiency of random data selection at scale

Critical Scaling Insights

Diminishing Returns Pattern

  • 3.3B to 7.2B: ~2.2x parameters, -0.86% performance
  • Training Cost: 284 hours vs 129 hours (+120% training time)
  • Efficiency: Negative return on computational investment

Data Quality Imperative

This model demonstrates why data curation becomes crucial at scale:

  • Random selection fails to utilize increased model capacity
  • Quality data selection (Meta-rater) shows continued benefits
  • Parameter scaling alone insufficient for performance gains

Comparison with Meta-rater 7.2B

The corresponding Meta-rater model achieves:

  • Overall Performance: 55.24% vs 52.12% = +3.12% improvement
  • Efficiency: Same training cost, significantly better results
  • Scalability: Meta-rater benefits increase at larger scales

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.


⭐ Star us on GitHub if you find Meta-rater useful! ⭐

Made with ❤️ by the OpenDataLab team

Downloads last month
15
Safetensors
Model size
7.24B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for opendatalab/meta-rater-7b-random

Finetuned
(15)
this model

Dataset used to train opendatalab/meta-rater-7b-random