Update README.md
Browse files
README.md
CHANGED
@@ -26,6 +26,15 @@ Training was conducted on the [LUMI supercomputer](https://www.lumi-supercompute
|
|
26 |
The project aimed to train multilingual encoder models that support long context and all official Finnish languages¹. The model can theoretically extrapolate to a context length of 128,000 tokens.
|
27 |
|
28 |
¹Multiple Sámi languages are spoken in Finland, but Northern Sámi is the most widespread and thus included in the training data. English is not the official language of Finland, but it is widely used. Latin was included for potential clinical use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
## Model Overview
|
30 |
| Hyperparameter | Value |
|
31 |
| :------------- | :----: |
|
@@ -36,9 +45,9 @@ The project aimed to train multilingual encoder models that support long context
|
|
36 |
| sequence_length | 16,000 / 128,000 |
|
37 |
## Training
|
38 |
Pretraining was done using Distributed Data Parallelism, AdamW with ZeroRedundancyOptimizer, and the WSD learning rate schedule.
|
39 |
-
The model was trained with a learning rate of
|
40 |
### Long context training
|
41 |
-
The model was trained with a learning rate of
|
42 |
where each sequence length was trained for an equal number of tokens, totaling 53B tokens over 16,560 steps.
|
43 |
RoPE theta in global layers was increased to 1,000,000. Long documents were sampled from the original data in the distribution below:
|
44 |
|Sequence lenght | % |
|
|
|
26 |
The project aimed to train multilingual encoder models that support long context and all official Finnish languages¹. The model can theoretically extrapolate to a context length of 128,000 tokens.
|
27 |
|
28 |
¹Multiple Sámi languages are spoken in Finland, but Northern Sámi is the most widespread and thus included in the training data. English is not the official language of Finland, but it is widely used. Latin was included for potential clinical use.
|
29 |
+
## Table of Contents
|
30 |
+
1. [Model Overview](#model-overview)
|
31 |
+
2. [Training](#training)
|
32 |
+
3. [Training data](#training-data)
|
33 |
+
4. [Evaluation results](#evaluation-results)
|
34 |
+
5. [Ethical Considerations and Limitations](#ethical-considerations-and-limitations)
|
35 |
+
6. [Aknowledgements](#aknowledgements)
|
36 |
+
7. [Licence](#licence)
|
37 |
+
8. [Citation information](#citation-information)
|
38 |
## Model Overview
|
39 |
| Hyperparameter | Value |
|
40 |
| :------------- | :----: |
|
|
|
45 |
| sequence_length | 16,000 / 128,000 |
|
46 |
## Training
|
47 |
Pretraining was done using Distributed Data Parallelism, AdamW with ZeroRedundancyOptimizer, and the WSD learning rate schedule.
|
48 |
+
The model was trained with a learning rate of 8e-4, a sequence length of 1024, and a RoPE theta of 10,000 for 377B tokens over 117,300 steps.
|
49 |
### Long context training
|
50 |
+
The model was trained with a learning rate of 5e-4, increasing the context length from 1024 to 16,000 in six stages,
|
51 |
where each sequence length was trained for an equal number of tokens, totaling 53B tokens over 16,560 steps.
|
52 |
RoPE theta in global layers was increased to 1,000,000. Long documents were sampled from the original data in the distribution below:
|
53 |
|Sequence lenght | % |
|