--- datasets: - opendatalab/SlimPajama-Meta-rater language: - en license: mit pipeline_tag: text-generation library_name: transformers --- # Random Baseline Language Model (3.3B Parameters, 100B Tokens) This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194). Code: https://github.com/opendatalab/Meta-rater ## Model Description This is a 3.3B parameter transformer-based decoder-only language model trained from scratch on 100B tokens randomly sampled from SlimPajama dataset. It serves as a scaling baseline for comparing data selection methods in the Meta-rater research, demonstrating performance with increased model size and training data. ## Model Details - **Architecture**: Transformer decoder-only - **Parameters**: 3.3B (3,335,989,760 parameters) - **Training Tokens**: 100B tokens - **Context Window**: 1,024 tokens - **Vocabulary Size**: 32,000 (LLaMA tokenizer) - **Training Data**: Randomly sampled from SlimPajama dataset - **Domain Distribution**: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%) ## Architecture Specifications - **Hidden Dimension**: 2,560 - **Number of Layers**: 40 - **Attention Heads**: 20 - **Key-Value Heads**: 20 - **MLP Ratio**: 8/3 - **Position Encoding**: RoPE (base=10,000) ## Training Details - **Hardware**: 32x NVIDIA A800 GPUs - **Global Batch Size**: 4,194,304 tokens - **Learning Rate**: 5e-5 - **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8) - **Training Time**: ~129 hours ## Performance Results ### Downstream Task Performance (Average Accuracy) - **General Knowledge**: 64.22% - ARC-Easy: 66.33% - ARC-Challenge: 33.53% - SciQ: 92.80% - **Commonsense Reasoning**: 53.55% - HellaSwag: 57.35% - SIQA: 43.71% - WinoGrande: 59.59% - **Reading Comprehension**: 35.28% - RACE: 34.35% - OpenbookQA: 36.20% - **Overall Average**: 52.98% ### Knowledge-Intensive Tasks - **MMLU**: 25.48% - **NaturalQuestions**: 6.28% ## Scaling Improvements Compared to the 1.3B random baseline (30B tokens): - **General Knowledge**: +11.43% (52.79% → 64.22%) - **Commonsense Reasoning**: +9.61% (43.94% → 53.55%) - **Reading Comprehension**: +5.26% (30.02% → 35.28%) - **Overall Average**: +9.20% (43.78% → 52.98%) ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and tokenizer model_name = "opendatalab/meta-rater-3b-random" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Generate text prompt = "The future of artificial intelligence is" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( inputs.input_ids, max_length=150, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ## Research Context This model serves as a crucial scaling baseline in the Meta-rater research: - **Scale Validation**: Demonstrates that data selection benefits persist at larger scales - **Efficiency Comparison**: Meta-rater models show consistent advantages even with increased parameters - **Performance Ceiling**: Establishes upper bounds for random selection at this scale ### Key Scaling Findings - **Data Selection Benefits Persist**: Meta-rater maintains advantages at 3.3B scale - **Improved Absolute Performance**: Substantial gains from increased model size - **Knowledge Tasks**: Particularly strong improvements in knowledge-intensive evaluations - **Efficiency Gains**: Meta-rater still provides meaningful improvements over random selection ## Applications This model can be used for: - **Scaling research** and baseline comparisons - **General language modeling** with improved capabilities - **Research on training efficiency** at larger scales - **Educational purposes** for understanding scale effects - **Benchmark establishment** for 3.3B parameter models ## Strengths - Significantly improved performance over smaller baselines - Strong knowledge retention and reasoning capabilities - Robust performance across diverse task categories - Valuable reference point for scaling experiments ## Limitations - Trained on randomly selected data without quality filtering - Limited context window (1,024 tokens) - No instruction tuning or safety alignment - High computational requirements for training - Performance still lower than models trained with curated data selection ## Comparison with Meta-rater When compared to the equivalent Meta-rater 3.3B model: - **Overall Performance Gap**: 54.71% (Meta-rater) vs 52.98% (Random) = +1.73% - **General Knowledge**: 67.51% vs 64.22% = +3.29% - **Efficiency**: Meta-rater achieves better performance with same computational resources ## Citation If you use this model in your research, please cite: ```bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ``` ## License Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements. ## Contact For questions or issues, please contact the authors or open an issue in the repository.