File size: 1,656 Bytes

18adc9a
 
 
 
 
 
53f39f4
 
 
 
 
18adc9a
 
4f4a121
18adc9a
4f4a121
 
44ce8b3
4f4a121
 
18adc9a
0364a6e
 
44ce8b3
 
 
 
0364a6e
 
 
 
e4f9349
0364a6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f4a121
 
0364a6e
44ce8b3
0364a6e
18adc9a
 
9831c5a
 
 
 
 
 
44ce8b3
 
 
9831c5a
 
 
 
 
 
 
 
 
 
44ce8b3
9831c5a
53f39f4

---
license: apache-2.0
datasets:
- HuggingFaceTB/smollm-corpus
language:
- en
library_name: transformers
pipeline_tag: text2text-generation
tags:
- fineweb
- t5
---

# tFINE-base-300m

An encoder-decoder (T5 architecture) pretrained with [nanoT5](https://github.com/pszemraj/nanoT5/tree/flan-dataset):

- tokenizer: sentencepiece BPE w/ byte fallback, 48k vocab (from [vocab scaling laws](https://hf.co/collections/sail/scaling-laws-with-vocabulary-6699e0cbd77a8b2870859bfe))
- data: `fineweb-edu-dedup` split of `HuggingFaceTB/smollm-corpus`
- context length: 1024 ctx

## details

Detailed info, including training logs, configs, and checkpoints can be found under `checkpoints/` in this repo.

<details>
<summary><strong>Expand hyperparameter overview</strong></summary>

1. Model:
   - Dropout rate: 0.0
   - Activations: `silu`, `gated-silu`
   - torch compile: true

2. Data processing:
   - Input length: 1024
   - MLM probability: 0.15

3. Optimization:
   - Optimizer: AdamW with scaling
   - Base learning rate: 0.008
   - Batch size: 120
   - Total training steps: 80,000
   - Warmup steps: 10,000
   - Learning rate scheduler: Cosine
   - Weight decay: 0.0001
   - Gradient clipping: 1.0
   - Gradient accumulation steps: 24
   - Final cosine learning rate: 1e-5

4. Hardware:
   - Device: RTX 4080
   - Precision: bfloat16, tf32
</details>

## plots


training loss  

![loss](./checkpoints/loss_over_steps.png)


<details>
<summary><strong>Expand grad and weights L2 norm plots</strong></summary>

grad norm 

![grad](./checkpoints/grad_l2_over_steps.png)


weights norm


![weights](./checkpoints/weights_l2_over_steps.png)

</details>

---