Improve model card with metadata and link to code
Browse filesThis PR adds metadata to the model card, including the `pipeline_tag` and `library_name`. It also adds a link to the Github repository for easier access to the code.
README.md
CHANGED
@@ -1,6 +1,13 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
3 |
---
|
4 |
-
|
5 |
-
the SciQ task, discussed in the main text of the Perplexity
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
library_name: fasttext
|
4 |
+
pipeline_tag: text-classification
|
5 |
---
|
6 |
+
|
7 |
+
This is the fastText pretraining data filter targeting the SciQ task, discussed in the main text of the Perplexity Correlations paper: https://arxiv.org/abs/2409.05816
|
8 |
+
|
9 |
+
This package can be used to get LLM pretraining data sampling distributions using simple statistical methods. The compute requirements are minimal, and you don't need to train any LLMs yourself.
|
10 |
+
|
11 |
+
Essentially, this approach encourages training on domains where lower loss is very correlated with higher downstream performance. We can use existing and freely available LLMs to do this.
|
12 |
+
|
13 |
+
Code: https://github.com/TristanThrush/perplexity-correlations
|