ibaucells commited on 15 days ago

Commit

b3bf026

verified ·

1 Parent(s): 6c73c6d

v0.9

Browse files

Files changed (22) hide show

.gitattributes +2 -0
README.md +72 -59
Release note v0.9.md +1 -0
images/corpus_languages.png +0 -0
images/salamandra_header.png +2 -2
model-00001-of-00017.safetensors +1 -1
model-00002-of-00017.safetensors +1 -1
model-00003-of-00017.safetensors +1 -1
model-00004-of-00017.safetensors +1 -1
model-00005-of-00017.safetensors +1 -1
model-00006-of-00017.safetensors +1 -1
model-00007-of-00017.safetensors +1 -1
model-00008-of-00017.safetensors +1 -1
model-00009-of-00017.safetensors +1 -1
model-00010-of-00017.safetensors +1 -1
model-00011-of-00017.safetensors +1 -1
model-00012-of-00017.safetensors +1 -1
model-00013-of-00017.safetensors +1 -1
model-00014-of-00017.safetensors +1 -1
model-00015-of-00017.safetensors +1 -1
model-00016-of-00017.safetensors +1 -1
model-00017-of-00017.safetensors +1 -1

.gitattributes CHANGED Viewed

@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text
+images/corpus_languages.png filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -66,13 +66,11 @@ datasets:
 ![](./images/logo_alia_2.png)
 > [!WARNING]
-> **WARNING:** This is an intermediate checkpoint, as training is still ongoing.
->
-> The weights will be promptly updated as soon as the training process is complete.
 # ALIA-40b Model Card
-ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
 To visit the model cards of other model versions, please refer to the [Model Index](#model-index).
@@ -85,12 +83,12 @@ Along with the open weights, all training scripts and configuration files are ma
 ### Description
-Transformer-based decoder-only language model that has been pre-trained from scratch on 6.9 trillion tokens of highly curated data.
 The pre-training corpus contains text in 35 European languages and code.
 ### Hyperparameters
-The full list of hyperparameters can be found [here](https://github.com/langtech-bsc/alia/blob/main/configs/bsc_40b.yaml).
 ### Architecture
@@ -101,7 +99,7 @@ The full list of hyperparameters can be found [here](https://github.com/langtech
 | Layers                  | 48            |
 | Hidden size             | 8,192         |
 | Attention heads         | 64            |
-| Context length          | 4,096         |
 | Vocabulary size         | 256,000       |
 | Precision               | bfloat16      |
 | Embedding type          | RoPE          |
@@ -304,11 +302,11 @@ for output in outputs:
 ### Pretraining Data
 The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
-The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
 and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
-During the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
-This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
 ![lang distrib](./images/corpus_languages.png)
@@ -347,11 +345,13 @@ Feel free to click the expand button below to see the full list of sources.
 | Croatian Web as Corpus 2.1 (hrWaC) | hr | Ljubešić & Klubička, 2014 |
 | DaNewsroom | da | Varab & Schluter, 2020 |
 | Danish GigaWord | da | Strømberg-Derczynski et al., 2021 |
 | DK-CLARIN Reference Corpus of General Danish | da | [Link](https://korpus.dsl.dk/clarin/) |
 | Estonian National Corpus 2021 (ENC) | et | Koppel & Kallas, 2022 |
 | Estonian Reference Corpus (ERC) | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
 | EusCrawl (w/o Wikipedia or NC-licenses) | eu | Artetxe et al., 2022 |
 | FineWeb-Edu (350BT subset) | en | Penedo et al., 2024 |
 | French Public Domain Books (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Books) |
 | French Public Domain Newspapers (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
 | German Web as Corpus (DeWaC)  | de | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:dewac) |
@@ -395,6 +395,7 @@ Feel free to click the expand button below to see the full list of sources.
 | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
 | Yle Finnish News Archive (Yle-News) | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
 To consult the data summary document with the respective licences, please send an e-mail to [email protected].
 <details>
@@ -439,6 +440,7 @@ To consult the data summary document with the respective licences, please send a
 - Soldaini, L., & Lo, K. (2023). peS2o (Pretraining Efficiently on S2ORC) Dataset. Allen Institute for AI.
 - Strømberg-Derczynski, L., Ciosici, M., Baglini, R., Christiansen, M. H., Dalsgaard, J. A., Fusaroli, R., Henrichsen, P. J., Hvingelby, R., Kirkedal, A., Kjeldsen, A. S., Ladefoged, C., Nielsen, F. Å., Madsen, J., Petersen, M. L., Rystrøm, J. H., & Varab, D. (2021). The Danish Gigaword Corpus. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 413–421. [Link](https://aclanthology.org/2021.nodalida-main.46)
 - Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. 208–220. [Link](https://doi.org/10.18653/v1/2023.trustnlp-1.18)
 - Varab, D., & Schluter, N. (2020). DaNewsroom: A Large-scale Danish Summarisation Dataset. Proceedings of The 12th Language Resources and Evaluation Conference, 6731–6739. [Link](https://www.aclweb.org/anthology/2020.lrec-1.831)
 - Váradi, T., Nyéki, B., Koeva, S., Tadić, M., Štefanec, V., Ogrodniczuk, M., Nitoń, B., Pezik, P., Barbu Mititelu, V., Irimia, E., Mitrofan, M., Tufi\textcommabelows, D., Garabík, R., Krek, S., & Repar, A. (2022). Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 100–108). European Language Resources Association. [Link](https://aclanthology.org/2022.lrec-1.11)
 - Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
@@ -454,6 +456,8 @@ To consult the data summary document with the respective licences, please send a
 </details>
 We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
 <details>
@@ -735,36 +739,36 @@ All results reported below are on a 5-shot setting.
     <td>Commonsense Reasoning</td>
     <td>xstorycloze_es</td>
     <td>acc</td>
-    <td>78.89</td>
   </tr>
   <tr>
     <td rowspan="2">NLI</td>
     <td>wnli_es</td>
     <td>acc</td>
-    <td>60.56</td>
   </tr>
   <tr>
     <td>xnli_es</td>
     <td>acc</td>
-    <td>48.31</td>
   </tr>
   <tr>
     <td>Paraphrasing</td>
     <td>paws_es</td>
     <td>acc</td>
-    <td>67.50</td>
   </tr>
   <tr>
     <td>QA</td>
     <td>xquad_es</td>
     <td>acc</td>
-    <td>74.03</td>
   </tr>
   <tr>
     <td>Translation</td>
     <td>flores_es</td>
     <td>bleu</td>
-    <td>25.12</td>
   </tr>
 </tbody>
 </table>
@@ -783,66 +787,66 @@ All results reported below are on a 5-shot setting.
     <td rowspan="2">Commonsense Reasoning</td>
     <td>copa_ca</td>
     <td>acc</td>
-    <td>85.20</td>
   </tr>
   <tr>
     <td>xstorycloze_ca</td>
     <td>acc</td>
-    <td>78.09</td>
   </tr>
   <tr>
     <td rowspan="2">NLI</td>
     <td>wnli_ca</td>
     <td>acc</td>
-    <td>60.56</td>
   </tr>
   <tr>
     <td>xnli_ca</td>
     <td>acc</td>
-    <td>49.84</td>
   </tr>
   <tr>
     <td rowspan="2">Paraphrasing</td>
     <td>parafraseja</td>
     <td>acc</td>
-    <td>64.33</td>
   </tr>
   <tr>
     <td>paws_ca</td>
     <td>acc</td>
-    <td>67.35</td>
   </tr>
   <tr>
     <td rowspan="5">QA</td>
     <td>arc_ca_easy</td>
     <td>acc</td>
-    <td>78.87</td>
   </tr>
   <tr>
     <td>arc_ca_challenge</td>
     <td>acc</td>
-    <td>51.62</td>
   </tr>
   <tr>
     <td>openbookqa_ca</td>
     <td>acc</td>
-    <td>38.40</td>
   </tr>
   <tr>
     <td>piqa_ca</td>
     <td>acc</td>
-    <td>74.86</td>
   </tr>
   <tr>
     <td>siqa_ca</td>
     <td>acc</td>
-    <td>53.07</td>
   </tr>
   <tr>
     <td>Translation</td>
     <td>flores_ca</td>
     <td>bleu</td>
-    <td>32.97</td>
   </tr>
 </tbody></table>
@@ -860,51 +864,51 @@ All results reported below are on a 5-shot setting.
     <td rowspan="2">Commonsense Reasoning</td>
     <td>xcopa_eu</td>
     <td>acc</td>
-    <td>74.20</td>
   </tr>
   <tr>
     <td>xstorycloze_eu</td>
     <td>acc</td>
-    <td>70.75</td>
   </tr>
   <tr>
     <td rowspan="2">NLI</td>
     <td>wnli_eu</td>
     <td>acc</td>
-    <td>54.93</td>
   </tr>
   <tr>
     <td>xnli_eu</td>
     <td>acc</td>
-    <td>46.54</td>
   </tr>
   <tr>
     <td rowspan="3">QA</td>
     <td>eus_exams</td>
     <td>acc</td>
-    <td>55.12</td>
   </tr>
   <tr>
     <td>eus_proficiency</td>
     <td>acc</td>
-    <td>54.25</td>
   </tr>
   <tr>
     <td>eus_trivia</td>
     <td>acc</td>
-    <td>63.62</td>
   </tr>
   <tr>
     <td>Reading Comprehension</td>
     <td>eus_reading</td>
     <td>acc</td>
-    <td>52.56</td>
   </tr>
   <tr>
     <td>Translation</td>
     <td>flores_eu</td>
     <td>bleu</td>
-    <td>19.85</td>
   </tr>
 </tbody></table>
@@ -922,24 +926,24 @@ All results reported below are on a 5-shot setting.
     <td rowspan="2">Paraphrasing</td>
     <td>parafrases_gl</td>
     <td>acc</td>
-    <td>60.20</td>
   </tr>
   <tr>
     <td>paws_gl</td>
     <td>acc</td>
-    <td>69.10</td>
   </tr>
   <tr>
     <td>QA</td>
     <td>openbookqa_gl</td>
     <td>acc</td>
-    <td>35.00</td>
   </tr>
   <tr>
     <td>Translation</td>
     <td>flores_gl</td>
     <td>bleu</td>
-    <td>30.19</td>
   </tr>
 </tbody>
 </table>
@@ -958,66 +962,75 @@ All results reported below are on a 5-shot setting.
     <td rowspan="2">Commonsense Reasoning</td>
     <td>copa</td>
     <td>acc</td>
-    <td>91</td>
   </tr>
   <tr>
     <td>xstorycloze_en</td>
     <td>acc</td>
-    <td>82.20</td>
   </tr>
   <tr>
     <td rowspan="2">NLI</td>
     <td>wnli</td>
     <td>acc</td>
-    <td>61.97</td>
   </tr>
   <tr>
     <td>xnli_en</td>
     <td>acc</td>
-    <td>51.77</td>
   </tr>
   <tr>
     <td>Paraphrasing</td>
     <td>paws *</td>
     <td>acc</td>
-    <td>64.65</td>
   </tr>
   <tr>
     <td rowspan="6">QA</td>
     <td>arc_easy</td>
     <td>acc</td>
-    <td>85.40</td>
   </tr>
   <tr>
     <td>arc_challenge</td>
     <td>acc</td>
-    <td>58.70</td>
   </tr>
   <tr>
     <td>openbookqa</td>
     <td>acc</td>
-    <td>37.80</td>
   </tr>
   <tr>
     <td>piqa</td>
     <td>acc</td>
-    <td>81.77</td>
   </tr>
   <tr>
     <td>social_iqa</td>
     <td>acc</td>
-    <td>53.48</td>
   </tr>
   <tr>
-    <td>squad_en **</td>
     <td>acc</td>
-    <td>81.53</td>
   </tr>
 </tbody></table>
 \* Current LM Evaluation Harness implementation is lacking correct pre-processing. These results are obtained with adequate pre-processing.
-\*\* This task is not yet available in the official Harness, we hope to add it soon.
 ---
@@ -1034,13 +1047,13 @@ We highlight that these results can be expected from a pretrained model that has
 ## Additional information
 ### Author
-The Language Technologies Unit from Barcelona Supercomputing Center.
 ### Contact
 For further information, please send an email to <[email protected]>.
 ### Copyright
-Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
 ### Funding
 This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
@@ -1084,6 +1097,6 @@ The Barcelona Supercomputing Center, as the owner and creator of the model, shal
 ## Model Index
 |Model|Base|Instruct|
 |:---:|:---:|:---:|
-|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
-|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
-|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |

 ![](./images/logo_alia_2.png)
 > [!WARNING]
+> **WARNING:** This is a base language model that has not undergone instruction tuning or alignment with human preferences. As a result, it may generate outputs that are inappropriate, misleading, biased, or unsafe. These risks can be mitigated through additional post-training stages, which is strongly recommended before deployment in any production system, especially for high-stakes applications.
 # ALIA-40b Model Card
+ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40b base version.
 To visit the model cards of other model versions, please refer to the [Model Index](#model-index).
 ### Description
+Transformer-based decoder-only language model that has been pre-trained from scratch on 9.37 trillion tokens of highly curated data.
 The pre-training corpus contains text in 35 European languages and code.
 ### Hyperparameters
+The full list of hyperparameters can be found [here](https://github.com/langtech-bsc/alia/blob/main/configs).
 ### Architecture
 | Layers                  | 48            |
 | Hidden size             | 8,192         |
 | Attention heads         | 64            |
+| Context length          | 32,768        |
 | Vocabulary size         | 256,000       |
 | Precision               | bfloat16      |
 | Embedding type          | RoPE          |
 ### Pretraining Data
 The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
+The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
 and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
+During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
+This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below:
 ![lang distrib](./images/corpus_languages.png)
 | Croatian Web as Corpus 2.1 (hrWaC) | hr | Ljubešić & Klubička, 2014 |
 | DaNewsroom | da | Varab & Schluter, 2020 |
 | Danish GigaWord | da | Strømberg-Derczynski et al., 2021 |
+| Dolmino-mix-1124 (subset without synthetically generated data and privative licenses) | en | Team OLMo, 2024
 | DK-CLARIN Reference Corpus of General Danish | da | [Link](https://korpus.dsl.dk/clarin/) |
 | Estonian National Corpus 2021 (ENC) | et | Koppel & Kallas, 2022 |
 | Estonian Reference Corpus (ERC) | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
 | EusCrawl (w/o Wikipedia or NC-licenses) | eu | Artetxe et al., 2022 |
 | FineWeb-Edu (350BT subset) | en | Penedo et al., 2024 |
+| Fineweb2 (ad hoc subset of 178BT) | ar, as, bg, ca, cs, cy, da, de, el, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk | Penedo et al., 2024
 | French Public Domain Books (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Books) |
 | French Public Domain Newspapers (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
 | German Web as Corpus (DeWaC)  | de | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:dewac) |
 | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
 | Yle Finnish News Archive (Yle-News) | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
 To consult the data summary document with the respective licences, please send an e-mail to [email protected].
 <details>
 - Soldaini, L., & Lo, K. (2023). peS2o (Pretraining Efficiently on S2ORC) Dataset. Allen Institute for AI.
 - Strømberg-Derczynski, L., Ciosici, M., Baglini, R., Christiansen, M. H., Dalsgaard, J. A., Fusaroli, R., Henrichsen, P. J., Hvingelby, R., Kirkedal, A., Kjeldsen, A. S., Ladefoged, C., Nielsen, F. Å., Madsen, J., Petersen, M. L., Rystrøm, J. H., & Varab, D. (2021). The Danish Gigaword Corpus. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 413–421. [Link](https://aclanthology.org/2021.nodalida-main.46)
 - Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. 208–220. [Link](https://doi.org/10.18653/v1/2023.trustnlp-1.18)
+- Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
 - Varab, D., & Schluter, N. (2020). DaNewsroom: A Large-scale Danish Summarisation Dataset. Proceedings of The 12th Language Resources and Evaluation Conference, 6731–6739. [Link](https://www.aclweb.org/anthology/2020.lrec-1.831)
 - Váradi, T., Nyéki, B., Koeva, S., Tadić, M., Štefanec, V., Ogrodniczuk, M., Nitoń, B., Pezik, P., Barbu Mititelu, V., Irimia, E., Mitrofan, M., Tufi\textcommabelows, D., Garabík, R., Krek, S., & Repar, A. (2022). Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 100–108). European Language Resources Association. [Link](https://aclanthology.org/2022.lrec-1.11)
 - Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
 </details>
+In the final pre-training phase, we used a high-quality subset of 160 billion tokens. Additionally, to expand the model's context window to 32K, 6.3 billion tokens were processed using the Llama 3.1 RoPE interpolation strategy.
 We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
 <details>
     <td>Commonsense Reasoning</td>
     <td>xstorycloze_es</td>
     <td>acc</td>
+    <td>79.5</td>
   </tr>
   <tr>
     <td rowspan="2">NLI</td>
     <td>wnli_es</td>
     <td>acc</td>
+    <td>64.8</td>
   </tr>
   <tr>
     <td>xnli_es</td>
     <td>acc</td>
+    <td>50.4</td>
   </tr>
   <tr>
     <td>Paraphrasing</td>
     <td>paws_es</td>
     <td>acc</td>
+    <td>63.8</td>
   </tr>
   <tr>
     <td>QA</td>
     <td>xquad_es</td>
     <td>acc</td>
+    <td>73.4</td>
   </tr>
   <tr>
     <td>Translation</td>
     <td>flores_es</td>
     <td>bleu</td>
+    <td>25.9</td>
   </tr>
 </tbody>
 </table>
     <td rowspan="2">Commonsense Reasoning</td>
     <td>copa_ca</td>
     <td>acc</td>
+    <td>86.0</td>
   </tr>
   <tr>
     <td>xstorycloze_ca</td>
     <td>acc</td>
+    <td>80.0</td>
   </tr>
   <tr>
     <td rowspan="2">NLI</td>
     <td>wnli_ca</td>
     <td>acc</td>
+    <td>70.0</td>
   </tr>
   <tr>
     <td>xnli_ca</td>
     <td>acc</td>
+    <td>50.7</td>
   </tr>
   <tr>
     <td rowspan="2">Paraphrasing</td>
     <td>parafraseja</td>
     <td>acc</td>
+    <td>67.8</td>
   </tr>
   <tr>
     <td>paws_ca</td>
     <td>acc</td>
+    <td>67.5</td>
   </tr>
   <tr>
     <td rowspan="5">QA</td>
     <td>arc_ca_easy</td>
     <td>acc</td>
+    <td>81.0</td>
   </tr>
   <tr>
     <td>arc_ca_challenge</td>
     <td>acc</td>
+    <td>53.0</td>
   </tr>
   <tr>
     <td>openbookqa_ca</td>
     <td>acc</td>
+    <td>41.6</td>
   </tr>
   <tr>
     <td>piqa_ca</td>
     <td>acc</td>
+    <td>75.8</td>
   </tr>
   <tr>
     <td>siqa_ca</td>
     <td>acc</td>
+    <td>53.9</td>
   </tr>
   <tr>
     <td>Translation</td>
     <td>flores_ca</td>
     <td>bleu</td>
+    <td>33.7</td>
   </tr>
 </tbody></table>
     <td rowspan="2">Commonsense Reasoning</td>
     <td>xcopa_eu</td>
     <td>acc</td>
+    <td>78.8</td>
   </tr>
   <tr>
     <td>xstorycloze_eu</td>
     <td>acc</td>
+    <td>72.2</td>
   </tr>
   <tr>
     <td rowspan="2">NLI</td>
     <td>wnli_eu</td>
     <td>acc</td>
+    <td>66.2</td>
   </tr>
   <tr>
     <td>xnli_eu</td>
     <td>acc</td>
+    <td>45.9</td>
   </tr>
   <tr>
     <td rowspan="3">QA</td>
     <td>eus_exams</td>
     <td>acc</td>
+    <td>61.5</td>
   </tr>
   <tr>
     <td>eus_proficiency</td>
     <td>acc</td>
+    <td>60.4</td>
   </tr>
   <tr>
     <td>eus_trivia</td>
     <td>acc</td>
+    <td>67.2</td>
   </tr>
   <tr>
     <td>Reading Comprehension</td>
     <td>eus_reading</td>
     <td>acc</td>
+    <td>61.1</td>
   </tr>
   <tr>
     <td>Translation</td>
     <td>flores_eu</td>
     <td>bleu</td>
+    <td>21.3</td>
   </tr>
 </tbody></table>
     <td rowspan="2">Paraphrasing</td>
     <td>parafrases_gl</td>
     <td>acc</td>
+    <td>60.2</td>
   </tr>
   <tr>
     <td>paws_gl</td>
     <td>acc</td>
+    <td>63.0</td>
   </tr>
   <tr>
     <td>QA</td>
     <td>openbookqa_gl</td>
     <td>acc</td>
+    <td>36.6</td>
   </tr>
   <tr>
     <td>Translation</td>
     <td>flores_gl</td>
     <td>bleu</td>
+    <td>31.2</td>
   </tr>
 </tbody>
 </table>
     <td rowspan="2">Commonsense Reasoning</td>
     <td>copa</td>
     <td>acc</td>
+    <td>94.0</td>
   </tr>
   <tr>
     <td>xstorycloze_en</td>
     <td>acc</td>
+    <td>83.2</td>
   </tr>
   <tr>
     <td rowspan="2">NLI</td>
     <td>wnli</td>
     <td>acc</td>
+    <td>67.6</td>
   </tr>
   <tr>
     <td>xnli_en</td>
     <td>acc</td>
+    <td>57.0</td>
   </tr>
   <tr>
     <td>Paraphrasing</td>
     <td>paws *</td>
     <td>acc</td>
+    <td>68.5</td>
   </tr>
   <tr>
     <td rowspan="6">QA</td>
     <td>arc_easy</td>
     <td>acc</td>
+    <td>86.5</td>
   </tr>
   <tr>
     <td>arc_challenge</td>
     <td>acc</td>
+    <td>59.4</td>
   </tr>
   <tr>
     <td>openbookqa</td>
     <td>acc</td>
+    <td>38.4</td>
   </tr>
   <tr>
     <td>piqa</td>
     <td>acc</td>
+    <td>81.7</td>
   </tr>
   <tr>
     <td>social_iqa</td>
     <td>acc</td>
+    <td>53.8</td>
   </tr>
   <tr>
+    <td>xquad_en </td>
     <td>acc</td>
+    <td>80.7</td>
   </tr>
 </tbody></table>
 \* Current LM Evaluation Harness implementation is lacking correct pre-processing. These results are obtained with adequate pre-processing.
+### Long Context Evaluation
+To assess the long-context capabilities of our model, we conduct a "needle in a haystack" test with the following configuration:
+- **Needle Phrase**: *"The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."*
+- **Retrieval Question**: *"The best thing to do in San Francisco is"*
+- **Evaluator**: [prometheus-8x7b-v2.0](https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0), used as the evaluation judge.
+![](./images/LongContext_eval.png)
 ---
 ## Additional information
 ### Author
+The Language Technologies Lab from Barcelona Supercomputing Center.
 ### Contact
 For further information, please send an email to <[email protected]>.
 ### Copyright
+Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
 ### Funding
 This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
 ## Model Index
 |Model|Base|Instruct|
 |:---:|:---:|:---:|
+|2b| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
+|7b| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
+|40b| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |

Release note v0.9.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ In this release, the ALIA-40B checkpoint has completed the main pre-training on 8.56 trillion tokens (1.6 epochs with 2.4 trillion tokens and 2 epochs with 2.68 trillion tokens) using a 4k context window, which was unfinished in the previous release. Additionally, it has undergone a preliminary final pre-training stage with a subset of high-quality data and a context extension to 32k tokens.

images/corpus_languages.png CHANGED Viewed

Git LFS Details

SHA256: a66685fd95c43e997ba20620231b00d43e1085aba09af54c69b3a414b6a25f73
Pointer size: 131 Bytes
Size of remote file: 352 kB

images/salamandra_header.png CHANGED Viewed

Git LFS Details

SHA256: de12bec43f22c0c41b45b84425759d6c9e38ecdf06d58519f048f10fe6e826de
Pointer size: 133 Bytes
Size of remote file: 11.1 MB

Git LFS Details

SHA256: 4be1584a2e8cb549a8740c7893c75a638510289215482968e665566d39f4cfb1
Pointer size: 128 Bytes
Size of remote file: 133 Bytes

model-00001-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3ff526e46d494258a656971416da3cc1510464c9c6f32ee47a2a926ca3d75a7a
 size 4898947792

 version https://git-lfs.github.com/spec/v1
+oid sha256:86eea25bb9c816c55fbe801a79689dfeb2b9444c9868ebcbe40fa1c7759a78de
 size 4898947792

model-00002-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f02d3cbb8e0f9fbde20b806f90233b3dd5c9260b22a6fb35c03d5d717aca9471
 size 4932603064

 version https://git-lfs.github.com/spec/v1
+oid sha256:bab17a5d55a0ea3b1f6e7c49fca860488b5d4ca723bfc1c0c660db758cf7acee
 size 4932603064

model-00003-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9635e1d0c5287498d473c264157de7b60c0364d31f8df1813b68dbda6960b9b7
 size 4932636064

 version https://git-lfs.github.com/spec/v1
+oid sha256:4f51af4a82d6a6ce07157d81688cee43f9d016ed6148e15daafefef5da41c97e
 size 4932636064

model-00004-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ceb246808ed7507cf405f47f901b17f051c7ca3915d5b4e5e5b1ab9b44b2509f
 size 4831940128

 version https://git-lfs.github.com/spec/v1
+oid sha256:7d0f30ce33bc6731ea6d58ed7d337e446dfe4abe6a50469abbe1bdb40da97523
 size 4831940128

model-00005-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5ee56cb425b2fc75365979ac567e67c22eeb4b49b6edaa4a0c72905094bf5ab4
 size 4932603096

 version https://git-lfs.github.com/spec/v1
+oid sha256:c383044bd870e52b5d556b0312a7593dda826ea6fd40cc88d57f7ca054fa8a0e
 size 4932603096

model-00006-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a76cd8c3d64a6d6db3cad041528d0452d8bd853de7969429a2ca7ae448649ad6
 size 4932603088

 version https://git-lfs.github.com/spec/v1
+oid sha256:30a71fb6ae629d637aa5e6dbc1482f31b8402099ea7dd6a7f5897bae75901909
 size 4932603088

model-00007-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:98275b7e8711ec35c7a48591daa33efacc062d0af056bbbcafffa4c5349700cc
 size 4932636096

 version https://git-lfs.github.com/spec/v1
+oid sha256:58b5d28b6bf820608f4c7590611e91291037756a209c957b659cf5cdc306c093
 size 4932636096

model-00008-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c8c14df494eedafb26c1ddbaf2f64c5c7b492aa1562b3bab896de10357d68dc7
 size 4831940160

 version https://git-lfs.github.com/spec/v1
+oid sha256:ff854c6b09431715b07ad1685fc4e902d68326d2343425f1fff6781cc86cc66b
 size 4831940160

model-00009-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b9970d40aa9954e8ace15a559111f090c76a177d05d183ccf2b975d584319d39
 size 4932603096

 version https://git-lfs.github.com/spec/v1
+oid sha256:5f0579b928cc75e1d628d4d8d67d0f414a8550c18fe9063f39023498a601cc8e
 size 4932603096

model-00010-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c6b26875ba10aed59dfca60fd383773f83bc5de4606b1e020a8d6cdeb823b8a7
 size 4932603088

 version https://git-lfs.github.com/spec/v1
+oid sha256:adc5fbe7645d6ab910a3314b23eb42bf793b751d7940fcb73d4d938c42c3047d
 size 4932603088

model-00011-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1c1dec561c5b21c305f50cc33ce688ccff660f22db8266f4edda6408a5b2e9e0
 size 4932636096

 version https://git-lfs.github.com/spec/v1
+oid sha256:2c814623773121f9be5579ed775ae49835e16f1e43ea16f39198798c5d9a2a32
 size 4932636096

model-00012-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c34abe63aede8a40b6c1cceaa1f2048ded19a7b96a8488c67da874453694e330
 size 4831940160

 version https://git-lfs.github.com/spec/v1
+oid sha256:91fe722802a3dcac49de528f280a2dbef7c7a0cc9e403aecdaeeb836b13fa375
 size 4831940160

model-00013-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:28d9169546c4d6fb4a5161ec273d1848a8e16ead6c239dc2e145774272f47fb2
 size 4932603096

 version https://git-lfs.github.com/spec/v1
+oid sha256:1b7e784fae5452f080f3cbe4779388843fb0b92f02250f799e3035c119b1c8e5
 size 4932603096

model-00014-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:93ff37553fd5ac3066df67c15f2300381b804b56e778e61d479295f69f8c0080
 size 4932603088

 version https://git-lfs.github.com/spec/v1
+oid sha256:8e20dc5e90c2a64710cbd50728b5228d916082d883f274977af599e2b3f3bbb0
 size 4932603088

model-00015-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:46049c9ce047e7db0ccfff44cb4d4b241bb4fb69619b1882709a2ccc5c0ddbdf
 size 4932636096

 version https://git-lfs.github.com/spec/v1
+oid sha256:9f32f39b3ae508d99be07600a6a96bc37c96be9dc10ab5ee774aaca4d5ddd1e3
 size 4932636096

model-00016-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5e7c1e9cfd124c531fecfb8a39d4344cdc1ee2b2e33fc4783ce8fae6a59d8179
 size 3019983016

 version https://git-lfs.github.com/spec/v1
+oid sha256:c95947ba9f96a43078f1984c4b74f58a9ad52cb5c4f83243b52ddc5fb29396d3
 size 3019983016

model-00017-of-00017.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:90fc55c26a532d8eb9a96d15149b9391105466b120e0a0b496c7422b77228564
 size 4194304128

 version https://git-lfs.github.com/spec/v1
+oid sha256:60f09ac53921db320447f39598c5b63ceec67e63daad6dff8ad2e19389fddea5
 size 4194304128