Improve model card: add `library_name` and primary paper link (#2)

Browse files

- Improve model card: add `library_name` and primary paper link (7a79027f95074a72ee60eb2726260ffa999470c4)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +31 -25

README.md CHANGED Viewed

@@ -1,23 +1,25 @@
 ---
-language:
-- fi
-license: apache-2.0
-tags:
-- finnish
-- llama
 datasets:
 - Finnish-NLP/CulturaX_fi_cleaned
 - Finnish-NLP/HPLT_1.2_fi_cleaned
 - Finnish-NLP/wikipedia_20231101_fi_cleaned
 - Finnish-NLP/Reddit_fi_2006_2022
 - intfloat/multilingual_cc_news
-inference: false
 pipeline_tag: text-generation
 ---
 # Ahma-3B for Finnish
 Ahma-3B is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
 [this paper](https://arxiv.org/abs/2302.13971)
 and first released at [this page](https://github.com/facebookresearch/llama).
@@ -62,7 +64,11 @@ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti.
 def format_prompt(prompt: str) -> str:
-    prompt = f" [INST] <<SYS>>\n{system_prompt.strip()}\n<</SYS>>\n\n{prompt.strip()} [/INST] "
     return prompt
@@ -144,27 +150,27 @@ The final training dataset had 23 billion words (calculated with regex "\w+") an
 The first stage:
 |Dataset                       | Words       | Ratio        |
 |:-----------------------------|:------------|:-------------|
-|CulturaX                      | 12.820B     | 59.88\%      |
-|HPLT v1.2                     | 5.034B      | 23.51\%      |
-|Suomi24                       | 3.018B      | 14.09\%      |
-|Reddit                        | 0.141B      | 0.66\%       |
-|CC-News                       | 0.311B      | 1.45\%       |
-|FI news corpus                | 0.004B      | 0.02\%       |
-|Project Lönnrot               | 0.083B      | 0.39\%       |
-|**TOTAL**                     | **21.410B** | **100.0\%**  |
 The second stage:
 |Dataset                                                        | Words       | Ratio       |
 |:--------------------------------------------------------------|:------------|:------------|
-|CulturaX (cleaner sample using KenLM perplexity score)         | 2.252B      | 55.48\%     |
-|Wikipedia                                                      | 0.095B      | 2.34\%      |
-|STT                                                            | 0.253B      | 6.23\%      |
-|Yle                                                            | 0.212B      | 5.22\%      |
-|Finnish parliament speeches                                    | 0.021B      | 0.52\%      |
-|Finnish higher education public theses                         | 0.855B      | 21.07\%     |
-|Finnish instruction-following datasets (note: 2X upsampled)    | 0.371B      | 9.14\%      |
-|**TOTAL**                                                      | **4.059B**  | **100.0\%** |
 ## Training procedure

 ---
 datasets:
 - Finnish-NLP/CulturaX_fi_cleaned
 - Finnish-NLP/HPLT_1.2_fi_cleaned
 - Finnish-NLP/wikipedia_20231101_fi_cleaned
 - Finnish-NLP/Reddit_fi_2006_2022
 - intfloat/multilingual_cc_news
+language:
+- fi
+license: apache-2.0
 pipeline_tag: text-generation
+tags:
+- finnish
+- llama
+inference: false
+library_name: transformers
 ---
 # Ahma-3B for Finnish
+This model was presented in the paper [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264).
 Ahma-3B is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
 [this paper](https://arxiv.org/abs/2302.13971)
 and first released at [this page](https://github.com/facebookresearch/llama).
 def format_prompt(prompt: str) -> str:
+    prompt = f" [INST] <<SYS>>
+{system_prompt.strip()}
+<</SYS>>
+{prompt.strip()} [/INST] "
     return prompt
 The first stage:
 |Dataset                       | Words       | Ratio        |
 |:-----------------------------|:------------|:-------------|
+|CulturaX                      | 12.820B     | 59.88%      |
+|HPLT v1.2                     | 5.034B      | 23.51%      |
+|Suomi24                       | 3.018B      | 14.09%      |
+|Reddit                        | 0.141B      | 0.66%       |
+|CC-News                       | 0.311B      | 1.45%       |
+|FI news corpus                | 0.004B      | 0.02%       |
+|Project Lönnrot               | 0.083B      | 0.39%       |
+|**TOTAL**                     | **21.410B** | **100.0%**  |
 The second stage:
 |Dataset                                                        | Words       | Ratio       |
 |:--------------------------------------------------------------|:------------|:------------|
+|CulturaX (cleaner sample using KenLM perplexity score)         | 2.252B      | 55.48%     |
+|Wikipedia                                                      | 0.095B      | 2.34%      |
+|STT                                                            | 0.253B      | 6.23%      |
+|Yle                                                            | 0.212B      | 5.22%      |
+|Finnish parliament speeches                                    | 0.021B      | 0.52%      |
+|Finnish higher education public theses                         | 0.855B      | 21.07%     |
+|Finnish instruction-following datasets (note: 2X upsampled)    | 0.371B      | 9.14%      |
+|**TOTAL**                                                      | **4.059B**  | **100.0%** |
 ## Training procedure