Update README.md
Browse files
README.md
CHANGED
@@ -16,6 +16,35 @@ tags:
|
|
16 |
- SiLU activations
|
17 |
- `fineweb-edu-dedup` split of `HuggingFaceTB/smollm-corpus`
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
## plots
|
20 |
|
21 |
|
|
|
16 |
- SiLU activations
|
17 |
- `fineweb-edu-dedup` split of `HuggingFaceTB/smollm-corpus`
|
18 |
|
19 |
+
## details
|
20 |
+
|
21 |
+
|
22 |
+
1. Model:
|
23 |
+
- Dropout rate: 0.0
|
24 |
+
- Activations: `silu`, `gated-silu`
|
25 |
+
- Model compilation: enabled
|
26 |
+
|
27 |
+
2. Data processing:
|
28 |
+
- Input length: 1024
|
29 |
+
- MLM probability: 0.15
|
30 |
+
|
31 |
+
3. Optimization:
|
32 |
+
- Optimizer: AdamW with scaling
|
33 |
+
- Base learning rate: 0.008
|
34 |
+
- Batch size: 120
|
35 |
+
- Total training steps: 80,000
|
36 |
+
- Warmup steps: 10,000
|
37 |
+
- Learning rate scheduler: Cosine
|
38 |
+
- Weight decay: 0.0001
|
39 |
+
- Gradient clipping: 1.0
|
40 |
+
- Gradient accumulation steps: 24
|
41 |
+
- Final cosine learning rate: 1e-5
|
42 |
+
|
43 |
+
4. Hardware utilization:
|
44 |
+
- Device: GPU
|
45 |
+
- Precision: bfloat16, tf32
|
46 |
+
|
47 |
+
|
48 |
## plots
|
49 |
|
50 |
|