This is the fastText pretraining data filter targeting the LAMBADA FR task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816. This filter uses perplexity correlations to identify high-quality pretraining data without requiring any LLM training. It is designed to be used with the fastText library.

Github: https://github.com/TristanThrush/perplexity-correlations

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support