AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts
Abstract
To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach utilizes meta-prompted language models as zero-shot verifiers to autonomously evaluate and select high-quality mathematical content, and we release the curated open-source AutoMathText dataset encompassing over 200GB of data. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter Mistral language model on the AutoMathText dataset, achieving substantial improvements in downstream performance on the MATH dataset with a token amount reduced by orders of magnitude compared to previous continuous pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to baselines, underscoring the potential of our approach in enhancing models' mathematical reasoning capabilities. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation (2024)
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024)
- Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora (2024)
- EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models with Semi-structured Data (2023)
- Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend