Improve model card: add `library_name` and primary paper link (#2)
Browse files- Improve model card: add `library_name` and primary paper link (7a79027f95074a72ee60eb2726260ffa999470c4)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,23 +1,25 @@
|
|
1 |
---
|
2 |
-
language:
|
3 |
-
- fi
|
4 |
-
license: apache-2.0
|
5 |
-
tags:
|
6 |
-
- finnish
|
7 |
-
- llama
|
8 |
datasets:
|
9 |
- Finnish-NLP/CulturaX_fi_cleaned
|
10 |
- Finnish-NLP/HPLT_1.2_fi_cleaned
|
11 |
- Finnish-NLP/wikipedia_20231101_fi_cleaned
|
12 |
- Finnish-NLP/Reddit_fi_2006_2022
|
13 |
- intfloat/multilingual_cc_news
|
14 |
-
|
|
|
|
|
15 |
pipeline_tag: text-generation
|
16 |
-
|
|
|
|
|
|
|
|
|
17 |
---
|
18 |
|
19 |
# Ahma-3B for Finnish
|
20 |
|
|
|
|
|
21 |
Ahma-3B is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
|
22 |
[this paper](https://arxiv.org/abs/2302.13971)
|
23 |
and first released at [this page](https://github.com/facebookresearch/llama).
|
@@ -62,7 +64,11 @@ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti.
|
|
62 |
|
63 |
|
64 |
def format_prompt(prompt: str) -> str:
|
65 |
-
prompt = f" [INST] <<SYS
|
|
|
|
|
|
|
|
|
66 |
return prompt
|
67 |
|
68 |
|
@@ -144,27 +150,27 @@ The final training dataset had 23 billion words (calculated with regex "\w+") an
|
|
144 |
The first stage:
|
145 |
|Dataset | Words | Ratio |
|
146 |
|:-----------------------------|:------------|:-------------|
|
147 |
-
|CulturaX | 12.820B | 59.88
|
148 |
-
|HPLT v1.2 | 5.034B | 23.51
|
149 |
-
|Suomi24 | 3.018B | 14.09
|
150 |
-
|Reddit | 0.141B | 0.66
|
151 |
-
|CC-News | 0.311B | 1.45
|
152 |
-
|FI news corpus | 0.004B | 0.02
|
153 |
-
|Project Lönnrot | 0.083B | 0.39
|
154 |
-
|**TOTAL** | **21.410B** | **100.0
|
155 |
|
156 |
|
157 |
The second stage:
|
158 |
|Dataset | Words | Ratio |
|
159 |
|:--------------------------------------------------------------|:------------|:------------|
|
160 |
-
|CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48
|
161 |
-
|Wikipedia | 0.095B | 2.34
|
162 |
-
|STT | 0.253B | 6.23
|
163 |
-
|Yle | 0.212B | 5.22
|
164 |
-
|Finnish parliament speeches | 0.021B | 0.52
|
165 |
-
|Finnish higher education public theses | 0.855B | 21.07
|
166 |
-
|Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14
|
167 |
-
|**TOTAL** | **4.059B** | **100.0
|
168 |
|
169 |
## Training procedure
|
170 |
|
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
datasets:
|
3 |
- Finnish-NLP/CulturaX_fi_cleaned
|
4 |
- Finnish-NLP/HPLT_1.2_fi_cleaned
|
5 |
- Finnish-NLP/wikipedia_20231101_fi_cleaned
|
6 |
- Finnish-NLP/Reddit_fi_2006_2022
|
7 |
- intfloat/multilingual_cc_news
|
8 |
+
language:
|
9 |
+
- fi
|
10 |
+
license: apache-2.0
|
11 |
pipeline_tag: text-generation
|
12 |
+
tags:
|
13 |
+
- finnish
|
14 |
+
- llama
|
15 |
+
inference: false
|
16 |
+
library_name: transformers
|
17 |
---
|
18 |
|
19 |
# Ahma-3B for Finnish
|
20 |
|
21 |
+
This model was presented in the paper [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264).
|
22 |
+
|
23 |
Ahma-3B is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
|
24 |
[this paper](https://arxiv.org/abs/2302.13971)
|
25 |
and first released at [this page](https://github.com/facebookresearch/llama).
|
|
|
64 |
|
65 |
|
66 |
def format_prompt(prompt: str) -> str:
|
67 |
+
prompt = f" [INST] <<SYS>>
|
68 |
+
{system_prompt.strip()}
|
69 |
+
<</SYS>>
|
70 |
+
|
71 |
+
{prompt.strip()} [/INST] "
|
72 |
return prompt
|
73 |
|
74 |
|
|
|
150 |
The first stage:
|
151 |
|Dataset | Words | Ratio |
|
152 |
|:-----------------------------|:------------|:-------------|
|
153 |
+
|CulturaX | 12.820B | 59.88% |
|
154 |
+
|HPLT v1.2 | 5.034B | 23.51% |
|
155 |
+
|Suomi24 | 3.018B | 14.09% |
|
156 |
+
|Reddit | 0.141B | 0.66% |
|
157 |
+
|CC-News | 0.311B | 1.45% |
|
158 |
+
|FI news corpus | 0.004B | 0.02% |
|
159 |
+
|Project Lönnrot | 0.083B | 0.39% |
|
160 |
+
|**TOTAL** | **21.410B** | **100.0%** |
|
161 |
|
162 |
|
163 |
The second stage:
|
164 |
|Dataset | Words | Ratio |
|
165 |
|:--------------------------------------------------------------|:------------|:------------|
|
166 |
+
|CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48% |
|
167 |
+
|Wikipedia | 0.095B | 2.34% |
|
168 |
+
|STT | 0.253B | 6.23% |
|
169 |
+
|Yle | 0.212B | 5.22% |
|
170 |
+
|Finnish parliament speeches | 0.021B | 0.52% |
|
171 |
+
|Finnish higher education public theses | 0.855B | 21.07% |
|
172 |
+
|Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14% |
|
173 |
+
|**TOTAL** | **4.059B** | **100.0%** |
|
174 |
|
175 |
## Training procedure
|
176 |
|