---
datasets:
- opendatalab/SlimPajama-Meta-rater
language:
- en
license: mit
pipeline_tag: text-generation
library_name: transformers
---

# Random Baseline Language Model (3.3B Parameters, 100B Tokens)

This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194).

Code: https://github.com/opendatalab/Meta-rater

## Model Description

This is a 3.3B parameter transformer-based decoder-only language model trained from scratch on 100B tokens randomly sampled from SlimPajama dataset. It serves as a scaling baseline for comparing data selection methods in the Meta-rater research, demonstrating performance with increased model size and training data.

## Model Details

- **Architecture**: Transformer decoder-only
- **Parameters**: 3.3B (3,335,989,760 parameters)
- **Training Tokens**: 100B tokens
- **Context Window**: 1,024 tokens
- **Vocabulary Size**: 32,000 (LLaMA tokenizer)
- **Training Data**: Randomly sampled from SlimPajama dataset
- **Domain Distribution**: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%)

## Architecture Specifications

- **Hidden Dimension**: 2,560
- **Number of Layers**: 40
- **Attention Heads**: 20
- **Key-Value Heads**: 20
- **MLP Ratio**: 8/3
- **Position Encoding**: RoPE (base=10,000)

## Training Details

- **Hardware**: 32x NVIDIA A800 GPUs
- **Global Batch Size**: 4,194,304 tokens
- **Learning Rate**: 5e-5
- **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- **Training Time**: ~129 hours

## Performance Results

### Downstream Task Performance (Average Accuracy)

- **General Knowledge**: 64.22%
  - ARC-Easy: 66.33%
  - ARC-Challenge: 33.53%
  - SciQ: 92.80%

- **Commonsense Reasoning**: 53.55%
  - HellaSwag: 57.35%
  - SIQA: 43.71%
  - WinoGrande: 59.59%

- **Reading Comprehension**: 35.28%
  - RACE: 34.35%
  - OpenbookQA: 36.20%

- **Overall Average**: 52.98%

### Knowledge-Intensive Tasks
- **MMLU**: 25.48%
- **NaturalQuestions**: 6.28%

## Scaling Improvements

Compared to the 1.3B random baseline (30B tokens):
- **General Knowledge**: +11.43% (52.79% → 64.22%)
- **Commonsense Reasoning**: +9.61% (43.94% → 53.55%)
- **Reading Comprehension**: +5.26% (30.02% → 35.28%)
- **Overall Average**: +9.20% (43.78% → 52.98%)

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-3b-random"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Research Context

This model serves as a crucial scaling baseline in the Meta-rater research:

- **Scale Validation**: Demonstrates that data selection benefits persist at larger scales
- **Efficiency Comparison**: Meta-rater models show consistent advantages even with increased parameters
- **Performance Ceiling**: Establishes upper bounds for random selection at this scale

### Key Scaling Findings

- **Data Selection Benefits Persist**: Meta-rater maintains advantages at 3.3B scale
- **Improved Absolute Performance**: Substantial gains from increased model size
- **Knowledge Tasks**: Particularly strong improvements in knowledge-intensive evaluations
- **Efficiency Gains**: Meta-rater still provides meaningful improvements over random selection

## Applications

This model can be used for:
- **Scaling research** and baseline comparisons
- **General language modeling** with improved capabilities
- **Research on training efficiency** at larger scales
- **Educational purposes** for understanding scale effects
- **Benchmark establishment** for 3.3B parameter models

## Strengths

- Significantly improved performance over smaller baselines
- Strong knowledge retention and reasoning capabilities
- Robust performance across diverse task categories
- Valuable reference point for scaling experiments

## Limitations

- Trained on randomly selected data without quality filtering
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- High computational requirements for training
- Performance still lower than models trained with curated data selection

## Comparison with Meta-rater

When compared to the equivalent Meta-rater 3.3B model:
- **Overall Performance Gap**: 54.71% (Meta-rater) vs 52.98% (Random) = +1.73%
- **General Knowledge**: 67.51% vs 64.22% = +3.29%
- **Efficiency**: Meta-rater achieves better performance with same computational resources

## Citation

If you use this model in your research, please cite:

```bibtex
@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}
```

## License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

## Contact

For questions or issues, please contact the authors or open an issue in the repository.