Improve model card with metadata and link to code

This PR adds metadata to the model card, including the `pipeline_tag` and `library_name`. It also adds a link to the Github repository for easier access to the code.

Files changed (1) hide show

README.md +10 -3

README.md CHANGED Viewed

@@ -1,6 +1,13 @@
 ---
 license: mit
 ---
-This is the fastText pretraining data filter targeting
-the SciQ task, discussed in the main text of the Perplexity
-Correlations paper: https://arxiv.org/abs/2409.05816

 ---
 license: mit
+library_name: fasttext
+pipeline_tag: text-classification
 ---
+This is the fastText pretraining data filter targeting the SciQ task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816
+This package can be used to get LLM pretraining data sampling distributions using simple statistical methods. The compute requirements are minimal, and you don't need to train any LLMs yourself.
+Essentially, this approach encourages training on domains where lower loss is very correlated with higher downstream performance. We can use existing and freely available LLMs to do this.
+Code: https://github.com/TristanThrush/perplexity-correlations