|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- HuggingFaceTB/smollm-corpus |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- fineweb |
|
- t5 |
|
--- |
|
|
|
# tFINE-base-300m |
|
|
|
An encoder-decoder (T5 architecture) pretrained with [nanoT5](https://github.com/pszemraj/nanoT5/tree/flan-dataset): |
|
|
|
- tokenizer: sentencepiece BPE w/ byte fallback, 48k vocab (from [vocab scaling laws](https://hf.co/collections/sail/scaling-laws-with-vocabulary-6699e0cbd77a8b2870859bfe)) |
|
- data: `fineweb-edu-dedup` split of `HuggingFaceTB/smollm-corpus` |
|
- context length: 1024 ctx |
|
|
|
## details |
|
|
|
Detailed info, including training logs, configs, and checkpoints can be found under `checkpoints/` in this repo. |
|
|
|
<details> |
|
<summary><strong>Expand hyperparameter overview</strong></summary> |
|
|
|
1. Model: |
|
- Dropout rate: 0.0 |
|
- Activations: `silu`, `gated-silu` |
|
- torch compile: true |
|
|
|
2. Data processing: |
|
- Input length: 1024 |
|
- MLM probability: 0.15 |
|
|
|
3. Optimization: |
|
- Optimizer: AdamW with scaling |
|
- Base learning rate: 0.008 |
|
- Batch size: 120 |
|
- Total training steps: 80,000 |
|
- Warmup steps: 10,000 |
|
- Learning rate scheduler: Cosine |
|
- Weight decay: 0.0001 |
|
- Gradient clipping: 1.0 |
|
- Gradient accumulation steps: 24 |
|
- Final cosine learning rate: 1e-5 |
|
|
|
4. Hardware: |
|
- Device: RTX 4080 |
|
- Precision: bfloat16, tf32 |
|
</details> |
|
|
|
## plots |
|
|
|
|
|
training loss |
|
|
|
![loss](./checkpoints/loss_over_steps.png) |
|
|
|
|
|
<details> |
|
<summary><strong>Expand grad and weights L2 norm plots</strong></summary> |
|
|
|
grad norm |
|
|
|
![grad](./checkpoints/grad_l2_over_steps.png) |
|
|
|
|
|
weights norm |
|
|
|
|
|
![weights](./checkpoints/weights_l2_over_steps.png) |
|
|
|
</details> |
|
|
|
--- |