Commit
·
788d6ba
1
Parent(s):
2af71b1
create updated readme
Browse files
README.md
ADDED
|
@@ -0,0 +1,278 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
tags:
|
| 5 |
+
- pytorch
|
| 6 |
+
- causal-lm
|
| 7 |
+
- pythia
|
| 8 |
+
license: apache-2.0
|
| 9 |
+
datasets:
|
| 10 |
+
- EleutherAI/the_pile_deduplicated
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
The *Pythia Scaling Suite* is a collection of models developed to facilitate
|
| 14 |
+
interpretability research. It contains two sets of eight models of sizes
|
| 15 |
+
70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
|
| 16 |
+
models: one trained on the Pile, and one trained on the Pile after the dataset
|
| 17 |
+
has been globally deduplicated. All 8 model sizes are trained on the exact
|
| 18 |
+
same data, in the exact same order. We also provide 154 intermediate
|
| 19 |
+
checkpoints per model, hosted on Hugging Face as branches.
|
| 20 |
+
|
| 21 |
+
The Pythia model suite was designed to promote scientific
|
| 22 |
+
research on large language models, especially interpretability research.
|
| 23 |
+
Despite not centering downstream performance as a design goal, we find the
|
| 24 |
+
models <a href="#evaluations">match or exceed</a> the performance of
|
| 25 |
+
similar and same-sized models, such as those in the OPT and GPT-Neo suites.
|
| 26 |
+
|
| 27 |
+
<details>
|
| 28 |
+
<summary style="font-weight:600">Details on previous early release and naming convention.</summary>
|
| 29 |
+
|
| 30 |
+
Previously, we released an early version of the Pythia suite to the public.
|
| 31 |
+
However, we decided to retrain the model suite to address a few hyperparameter
|
| 32 |
+
discrepancies. This model card <a href="#changelog">lists the changes</a>;
|
| 33 |
+
see appendix B in the Pythia paper for further discussion. We found no
|
| 34 |
+
difference in benchmark performance between the two Pythia versions.
|
| 35 |
+
The old models are
|
| 36 |
+
[still available](https://huggingface.co/models?other=pythia_v0), but we
|
| 37 |
+
suggest the retrained suite if you are just starting to use Pythia.<br>
|
| 38 |
+
**This is the current release.**
|
| 39 |
+
|
| 40 |
+
Please note that all models in the *Pythia* suite were renamed in January
|
| 41 |
+
2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
|
| 42 |
+
comparing the old and new names</a> is provided in this model card, together
|
| 43 |
+
with exact parameter counts.
|
| 44 |
+
</details>
|
| 45 |
+
<br>
|
| 46 |
+
|
| 47 |
+
# Pythia-2.8B-deduped
|
| 48 |
+
|
| 49 |
+
## Model Details
|
| 50 |
+
|
| 51 |
+
- Developed by: [EleutherAI](http://eleuther.ai)
|
| 52 |
+
- Model type: Transformer-based Language Model
|
| 53 |
+
- Language: English
|
| 54 |
+
- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
|
| 55 |
+
for training procedure, config files, and details on how to use.
|
| 56 |
+
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
|
| 57 |
+
- License: Apache 2.0
|
| 58 |
+
- Contact: to ask questions about this model, join the [EleutherAI
|
| 59 |
+
Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
|
| 60 |
+
Please read the existing *Pythia* documentation before asking about it in the
|
| 61 |
+
EleutherAI Discord. For general correspondence: [contact@eleuther.
|
| 62 |
+
ai](mailto:[email protected]).
|
| 63 |
+
|
| 64 |
+
<figure>
|
| 65 |
+
|
| 66 |
+
| Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
|
| 67 |
+
| -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
|
| 68 |
+
| 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | — |
|
| 69 |
+
| 160M | 85,056,000 | 12 | 768 | 12 | 4M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
|
| 70 |
+
| 410M | 302,311,424 | 24 | 1024 | 16 | 4M | 3.0 x 10<sup>-4</sup> | OPT-350M |
|
| 71 |
+
| 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | — |
|
| 72 |
+
| 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 4M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
|
| 73 |
+
| 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
|
| 74 |
+
| 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
|
| 75 |
+
| 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | — |
|
| 76 |
+
<figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
|
| 77 |
+
non-deduped models of a given size have the same hyperparameters. “Equivalent”
|
| 78 |
+
models have <b>exactly</b> the same architecture, and the same number of
|
| 79 |
+
non-embedding parameters.</figcaption>
|
| 80 |
+
</figure>
|
| 81 |
+
|
| 82 |
+
## Uses and Limitations
|
| 83 |
+
|
| 84 |
+
### Intended Use
|
| 85 |
+
|
| 86 |
+
The primary intended use of Pythia is research on the behavior, functionality,
|
| 87 |
+
and limitations of large language models. This suite is intended to provide
|
| 88 |
+
a controlled setting for performing scientific experiments. We also provide
|
| 89 |
+
154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints
|
| 90 |
+
`step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to
|
| 91 |
+
`step143000`. These checkpoints are hosted on Hugging Face as branches. Note
|
| 92 |
+
that branch `143000` corresponds exactly to the model checkpoint on the `main`
|
| 93 |
+
branch of each model.
|
| 94 |
+
|
| 95 |
+
You may also further fine-tune and adapt Pythia-2.8B-deduped for deployment,
|
| 96 |
+
as long as your use is in accordance with the Apache 2.0 license. Pythia
|
| 97 |
+
models work with the Hugging Face [Transformers
|
| 98 |
+
Library](https://huggingface.co/docs/transformers/index). If you decide to use
|
| 99 |
+
pre-trained Pythia-2.8B-deduped as a basis for your fine-tuned model, please
|
| 100 |
+
conduct your own risk and bias assessment.
|
| 101 |
+
|
| 102 |
+
### Out-of-scope use
|
| 103 |
+
|
| 104 |
+
The Pythia Suite is **not** intended for deployment. It is not a in itself
|
| 105 |
+
a product and cannot be used for human-facing interactions. For example,
|
| 106 |
+
the model may generate harmful or offensive text. Please evaluate the risks
|
| 107 |
+
associated with your particular use case.
|
| 108 |
+
|
| 109 |
+
Pythia models are English-language only, and are not suitable for translation
|
| 110 |
+
or generating text in other languages.
|
| 111 |
+
|
| 112 |
+
Pythia-2.8B-deduped has not been fine-tuned for downstream contexts in which
|
| 113 |
+
language models are commonly deployed, such as writing genre prose,
|
| 114 |
+
or commercial chatbots. This means Pythia-2.8B-deduped will **not**
|
| 115 |
+
respond to a given prompt the way a product like ChatGPT does. This is because,
|
| 116 |
+
unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
|
| 117 |
+
Learning from Human Feedback (RLHF) to better “follow” human instructions.
|
| 118 |
+
|
| 119 |
+
### Limitations and biases
|
| 120 |
+
|
| 121 |
+
The core functionality of a large language model is to take a string of text
|
| 122 |
+
and predict the next token. The token used by the model need not produce the
|
| 123 |
+
most “accurate” text. Never rely on Pythia-2.8B-deduped to produce factually accurate
|
| 124 |
+
output.
|
| 125 |
+
|
| 126 |
+
This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
|
| 127 |
+
known to contain profanity and texts that are lewd or otherwise offensive.
|
| 128 |
+
See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
|
| 129 |
+
discussion of documented biases with regards to gender, religion, and race.
|
| 130 |
+
Pythia-2.8B-deduped may produce socially unacceptable or undesirable text, *even if*
|
| 131 |
+
the prompt itself does not include anything explicitly offensive.
|
| 132 |
+
|
| 133 |
+
If you plan on using text generated through, for example, the Hosted Inference
|
| 134 |
+
API, we recommend having a human curate the outputs of this language model
|
| 135 |
+
before presenting it to other people. Please inform your audience that the
|
| 136 |
+
text was generated by Pythia-2.8B-deduped.
|
| 137 |
+
|
| 138 |
+
### Quickstart
|
| 139 |
+
|
| 140 |
+
Pythia models can be loaded and used via the following code, demonstrated here
|
| 141 |
+
for the third `pythia-70m-deduped` checkpoint:
|
| 142 |
+
|
| 143 |
+
```python
|
| 144 |
+
from transformers import GPTNeoXForCausalLM, AutoTokenizer
|
| 145 |
+
|
| 146 |
+
model = GPTNeoXForCausalLM.from_pretrained(
|
| 147 |
+
"EleutherAI/pythia-70m-deduped",
|
| 148 |
+
revision="step3000",
|
| 149 |
+
cache_dir="./pythia-70m-deduped/step3000",
|
| 150 |
+
)
|
| 151 |
+
|
| 152 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 153 |
+
"EleutherAI/pythia-70m-deduped",
|
| 154 |
+
revision="step3000",
|
| 155 |
+
cache_dir="./pythia-70m-deduped/step3000",
|
| 156 |
+
)
|
| 157 |
+
|
| 158 |
+
inputs = tokenizer("Hello, I am", return_tensors="pt")
|
| 159 |
+
tokens = model.generate(**inputs)
|
| 160 |
+
tokenizer.decode(tokens[0])
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
Revision/branch `step143000` corresponds exactly to the model checkpoint on
|
| 164 |
+
the `main` branch of each model.<br>
|
| 165 |
+
For more information on how to use all Pythia models, see [documentation on
|
| 166 |
+
GitHub](https://github.com/EleutherAI/pythia).
|
| 167 |
+
|
| 168 |
+
## Training
|
| 169 |
+
|
| 170 |
+
### Training data
|
| 171 |
+
|
| 172 |
+
Pythia-2.8B-deduped was trained on the Pile **after the dataset has been globally
|
| 173 |
+
deduplicated**.<br>
|
| 174 |
+
[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
|
| 175 |
+
English. It was created by EleutherAI specifically for training large language
|
| 176 |
+
models. It contains texts from 22 diverse sources, roughly broken down into
|
| 177 |
+
five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
|
| 178 |
+
prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
|
| 179 |
+
miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
|
| 180 |
+
paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
|
| 181 |
+
methodology, and a discussion of ethical implications. Consult [the
|
| 182 |
+
datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
|
| 183 |
+
about the Pile and its component datasets. The Pile can be downloaded from
|
| 184 |
+
the [official website](https://pile.eleuther.ai/), or from a [community
|
| 185 |
+
mirror](https://the-eye.eu/public/AI/pile/).
|
| 186 |
+
|
| 187 |
+
### Training procedure
|
| 188 |
+
|
| 189 |
+
All models were trained on the exact same data, in the exact same order. Each
|
| 190 |
+
model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
|
| 191 |
+
model are saved every 2,097,152,000 tokens, spaced evenly throughout training,
|
| 192 |
+
from `step1000` to `step143000` (which is the same as `main`). In addition, we
|
| 193 |
+
also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`.
|
| 194 |
+
This corresponds to training for just under 1 epoch on the Pile for
|
| 195 |
+
non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
|
| 196 |
+
|
| 197 |
+
All *Pythia* models trained for 143000 steps at a batch size
|
| 198 |
+
of 2M (2,097,152 tokens).<br>
|
| 199 |
+
See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
|
| 200 |
+
procedure, including [how to reproduce
|
| 201 |
+
it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
|
| 202 |
+
Pythia uses the same tokenizer as [GPT-NeoX-
|
| 203 |
+
20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
|
| 204 |
+
|
| 205 |
+
## Evaluations
|
| 206 |
+
|
| 207 |
+
All 16 *Pythia* models were evaluated using the [LM Evaluation
|
| 208 |
+
Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
|
| 209 |
+
the results by model and step at `results/json/*` in the [GitHub
|
| 210 |
+
repository](https://github.com/EleutherAI/pythia/tree/main/results/json/).<br>
|
| 211 |
+
Expand the sections below to see plots of evaluation results for all
|
| 212 |
+
Pythia and Pythia-deduped models compared with OPT and BLOOM.
|
| 213 |
+
|
| 214 |
+
<details>
|
| 215 |
+
<summary>LAMBADA – OpenAI</summary>
|
| 216 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai_v1.png" style="width:auto"/>
|
| 217 |
+
</details>
|
| 218 |
+
|
| 219 |
+
<details>
|
| 220 |
+
<summary>Physical Interaction: Question Answering (PIQA)</summary>
|
| 221 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa_v1.png" style="width:auto"/>
|
| 222 |
+
</details>
|
| 223 |
+
|
| 224 |
+
<details>
|
| 225 |
+
<summary>WinoGrande</summary>
|
| 226 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande_v1.png" style="width:auto"/>
|
| 227 |
+
</details>
|
| 228 |
+
|
| 229 |
+
<details>
|
| 230 |
+
<summary>AI2 Reasoning Challenge—Easy Set</summary>
|
| 231 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_easy_v1.png" style="width:auto"/>
|
| 232 |
+
</details>
|
| 233 |
+
|
| 234 |
+
<details>
|
| 235 |
+
<summary>SciQ</summary>
|
| 236 |
+
<img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq_v1.png" style="width:auto"/>
|
| 237 |
+
</details>
|
| 238 |
+
|
| 239 |
+
## Changelog
|
| 240 |
+
|
| 241 |
+
This section compares differences between previously released
|
| 242 |
+
[Pythia v0](https://huggingface.co/models?other=pythia_v0) and the current
|
| 243 |
+
models. See Appendix B of the Pythia paper for further discussion of these
|
| 244 |
+
changes and the motivation behind them. We found that retraining Pythia had no
|
| 245 |
+
impact on benchmark performance.
|
| 246 |
+
|
| 247 |
+
- All model sizes are now trained with uniform batch size of 2M tokens.
|
| 248 |
+
Previously, the models of size 160M, 410M, and 1.4B parameters were trained
|
| 249 |
+
with batch sizes of 4M tokens.
|
| 250 |
+
- We added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64,
|
| 251 |
+
128,256,512} in addition to every 1000 training steps.
|
| 252 |
+
- Flash Attention was used in the new retrained suite.
|
| 253 |
+
- We remedied a minor inconsistency that existed in the original suite: all
|
| 254 |
+
models of size 2.8B parameters or smaller had a learning rate (LR) schedule
|
| 255 |
+
which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and
|
| 256 |
+
12B models all used an LR schedule which decayed to a minimum LR of 0. In
|
| 257 |
+
the redone training runs, we rectified this inconsistency: all models now were
|
| 258 |
+
trained with LR decaying to a minimum of 0.1× their maximum LR.
|
| 259 |
+
|
| 260 |
+
### Naming convention and parameter count
|
| 261 |
+
|
| 262 |
+
*Pythia* models were renamed in January 2023. It is possible that the old
|
| 263 |
+
naming convention still persists in some documentation by accident. The
|
| 264 |
+
current naming convention (70M, 160M, etc.) is based on total parameter count.
|
| 265 |
+
|
| 266 |
+
<figure style="width:32em">
|
| 267 |
+
|
| 268 |
+
| current Pythia suffix | old suffix | total params | non-embedding params |
|
| 269 |
+
| --------------------: | ---------: | -------------: | -------------------: |
|
| 270 |
+
| 70M | 19M | 70,426,624 | 18,915,328 |
|
| 271 |
+
| 160M | 125M | 162,322,944 | 85,056,000 |
|
| 272 |
+
| 410M | 350M | 405,334,016 | 302,311,424 |
|
| 273 |
+
| 1B | 800M | 1,011,781,632 | 805,736,448 |
|
| 274 |
+
| 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |
|
| 275 |
+
| 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |
|
| 276 |
+
| 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |
|
| 277 |
+
| 12B | 13B | 11,846,072,320 | 11,327,027,200 |
|
| 278 |
+
</figure>
|