This repository contains a fastText pretraining data filter targeting the LAMBADA task, as discussed in the paper Improving Pretraining Data Using Perplexity Correlations. This filter selects high-quality pretraining data based on correlations between LLM perplexity and downstream benchmark performance.
Code: https://github.com/TristanThrush/perplexity-correlations
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.