ibaucells commited on
Commit
b3bf026
·
verified ·
1 Parent(s): 6c73c6d
.gitattributes CHANGED
@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text
37
+ images/corpus_languages.png filter=lfs diff=lfs merge=lfs -text
38
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -66,13 +66,11 @@ datasets:
66
  ![](./images/logo_alia_2.png)
67
 
68
  > [!WARNING]
69
- > **WARNING:** This is an intermediate checkpoint, as training is still ongoing.
70
- >
71
- > The weights will be promptly updated as soon as the training process is complete.
72
 
73
  # ALIA-40b Model Card
74
 
75
- ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40B base version.
76
 
77
  To visit the model cards of other model versions, please refer to the [Model Index](#model-index).
78
 
@@ -85,12 +83,12 @@ Along with the open weights, all training scripts and configuration files are ma
85
 
86
  ### Description
87
 
88
- Transformer-based decoder-only language model that has been pre-trained from scratch on 6.9 trillion tokens of highly curated data.
89
  The pre-training corpus contains text in 35 European languages and code.
90
 
91
  ### Hyperparameters
92
 
93
- The full list of hyperparameters can be found [here](https://github.com/langtech-bsc/alia/blob/main/configs/bsc_40b.yaml).
94
 
95
  ### Architecture
96
 
@@ -101,7 +99,7 @@ The full list of hyperparameters can be found [here](https://github.com/langtech
101
  | Layers | 48 |
102
  | Hidden size | 8,192 |
103
  | Attention heads | 64 |
104
- | Context length | 4,096 |
105
  | Vocabulary size | 256,000 |
106
  | Precision | bfloat16 |
107
  | Embedding type | RoPE |
@@ -304,11 +302,11 @@ for output in outputs:
304
  ### Pretraining Data
305
 
306
  The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
307
- The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
308
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
309
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
310
- During the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
311
- This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
312
 
313
  ![lang distrib](./images/corpus_languages.png)
314
 
@@ -347,11 +345,13 @@ Feel free to click the expand button below to see the full list of sources.
347
  | Croatian Web as Corpus 2.1 (hrWaC) | hr | Ljubešić & Klubička, 2014 |
348
  | DaNewsroom | da | Varab & Schluter, 2020 |
349
  | Danish GigaWord | da | Strømberg-Derczynski et al., 2021 |
 
350
  | DK-CLARIN Reference Corpus of General Danish | da | [Link](https://korpus.dsl.dk/clarin/) |
351
  | Estonian National Corpus 2021 (ENC) | et | Koppel & Kallas, 2022 |
352
  | Estonian Reference Corpus (ERC) | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
353
  | EusCrawl (w/o Wikipedia or NC-licenses) | eu | Artetxe et al., 2022 |
354
  | FineWeb-Edu (350BT subset) | en | Penedo et al., 2024 |
 
355
  | French Public Domain Books (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Books) |
356
  | French Public Domain Newspapers (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
357
  | German Web as Corpus (DeWaC) | de | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:dewac) |
@@ -395,6 +395,7 @@ Feel free to click the expand button below to see the full list of sources.
395
  | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
396
  | Yle Finnish News Archive (Yle-News) | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
397
 
 
398
  To consult the data summary document with the respective licences, please send an e-mail to [email protected].
399
 
400
  <details>
@@ -439,6 +440,7 @@ To consult the data summary document with the respective licences, please send a
439
  - Soldaini, L., & Lo, K. (2023). peS2o (Pretraining Efficiently on S2ORC) Dataset. Allen Institute for AI.
440
  - Strømberg-Derczynski, L., Ciosici, M., Baglini, R., Christiansen, M. H., Dalsgaard, J. A., Fusaroli, R., Henrichsen, P. J., Hvingelby, R., Kirkedal, A., Kjeldsen, A. S., Ladefoged, C., Nielsen, F. Å., Madsen, J., Petersen, M. L., Rystrøm, J. H., & Varab, D. (2021). The Danish Gigaword Corpus. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 413–421. [Link](https://aclanthology.org/2021.nodalida-main.46)
441
  - Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. 208–220. [Link](https://doi.org/10.18653/v1/2023.trustnlp-1.18)
 
442
  - Varab, D., & Schluter, N. (2020). DaNewsroom: A Large-scale Danish Summarisation Dataset. Proceedings of The 12th Language Resources and Evaluation Conference, 6731–6739. [Link](https://www.aclweb.org/anthology/2020.lrec-1.831)
443
  - Váradi, T., Nyéki, B., Koeva, S., Tadić, M., Štefanec, V., Ogrodniczuk, M., Nitoń, B., Pezik, P., Barbu Mititelu, V., Irimia, E., Mitrofan, M., Tufi\textcommabelows, D., Garabík, R., Krek, S., & Repar, A. (2022). Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 100–108). European Language Resources Association. [Link](https://aclanthology.org/2022.lrec-1.11)
444
  - Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
@@ -454,6 +456,8 @@ To consult the data summary document with the respective licences, please send a
454
 
455
  </details>
456
 
 
 
457
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
458
 
459
  <details>
@@ -735,36 +739,36 @@ All results reported below are on a 5-shot setting.
735
  <td>Commonsense Reasoning</td>
736
  <td>xstorycloze_es</td>
737
  <td>acc</td>
738
- <td>78.89</td>
739
  </tr>
740
  <tr>
741
  <td rowspan="2">NLI</td>
742
  <td>wnli_es</td>
743
  <td>acc</td>
744
- <td>60.56</td>
745
  </tr>
746
  <tr>
747
  <td>xnli_es</td>
748
  <td>acc</td>
749
- <td>48.31</td>
750
  </tr>
751
  <tr>
752
  <td>Paraphrasing</td>
753
  <td>paws_es</td>
754
  <td>acc</td>
755
- <td>67.50</td>
756
  </tr>
757
  <tr>
758
  <td>QA</td>
759
  <td>xquad_es</td>
760
  <td>acc</td>
761
- <td>74.03</td>
762
  </tr>
763
  <tr>
764
  <td>Translation</td>
765
  <td>flores_es</td>
766
  <td>bleu</td>
767
- <td>25.12</td>
768
  </tr>
769
  </tbody>
770
  </table>
@@ -783,66 +787,66 @@ All results reported below are on a 5-shot setting.
783
  <td rowspan="2">Commonsense Reasoning</td>
784
  <td>copa_ca</td>
785
  <td>acc</td>
786
- <td>85.20</td>
787
  </tr>
788
  <tr>
789
  <td>xstorycloze_ca</td>
790
  <td>acc</td>
791
- <td>78.09</td>
792
  </tr>
793
  <tr>
794
  <td rowspan="2">NLI</td>
795
  <td>wnli_ca</td>
796
  <td>acc</td>
797
- <td>60.56</td>
798
  </tr>
799
  <tr>
800
  <td>xnli_ca</td>
801
  <td>acc</td>
802
- <td>49.84</td>
803
  </tr>
804
  <tr>
805
  <td rowspan="2">Paraphrasing</td>
806
  <td>parafraseja</td>
807
  <td>acc</td>
808
- <td>64.33</td>
809
  </tr>
810
  <tr>
811
  <td>paws_ca</td>
812
  <td>acc</td>
813
- <td>67.35</td>
814
  </tr>
815
  <tr>
816
  <td rowspan="5">QA</td>
817
  <td>arc_ca_easy</td>
818
  <td>acc</td>
819
- <td>78.87</td>
820
  </tr>
821
  <tr>
822
  <td>arc_ca_challenge</td>
823
  <td>acc</td>
824
- <td>51.62</td>
825
  </tr>
826
  <tr>
827
  <td>openbookqa_ca</td>
828
  <td>acc</td>
829
- <td>38.40</td>
830
  </tr>
831
  <tr>
832
  <td>piqa_ca</td>
833
  <td>acc</td>
834
- <td>74.86</td>
835
  </tr>
836
  <tr>
837
  <td>siqa_ca</td>
838
  <td>acc</td>
839
- <td>53.07</td>
840
  </tr>
841
  <tr>
842
  <td>Translation</td>
843
  <td>flores_ca</td>
844
  <td>bleu</td>
845
- <td>32.97</td>
846
  </tr>
847
  </tbody></table>
848
 
@@ -860,51 +864,51 @@ All results reported below are on a 5-shot setting.
860
  <td rowspan="2">Commonsense Reasoning</td>
861
  <td>xcopa_eu</td>
862
  <td>acc</td>
863
- <td>74.20</td>
864
  </tr>
865
  <tr>
866
  <td>xstorycloze_eu</td>
867
  <td>acc</td>
868
- <td>70.75</td>
869
  </tr>
870
  <tr>
871
  <td rowspan="2">NLI</td>
872
  <td>wnli_eu</td>
873
  <td>acc</td>
874
- <td>54.93</td>
875
  </tr>
876
  <tr>
877
  <td>xnli_eu</td>
878
  <td>acc</td>
879
- <td>46.54</td>
880
  </tr>
881
  <tr>
882
  <td rowspan="3">QA</td>
883
  <td>eus_exams</td>
884
  <td>acc</td>
885
- <td>55.12</td>
886
  </tr>
887
  <tr>
888
  <td>eus_proficiency</td>
889
  <td>acc</td>
890
- <td>54.25</td>
891
  </tr>
892
  <tr>
893
  <td>eus_trivia</td>
894
  <td>acc</td>
895
- <td>63.62</td>
896
  </tr>
897
  <tr>
898
  <td>Reading Comprehension</td>
899
  <td>eus_reading</td>
900
  <td>acc</td>
901
- <td>52.56</td>
902
  </tr>
903
  <tr>
904
  <td>Translation</td>
905
  <td>flores_eu</td>
906
  <td>bleu</td>
907
- <td>19.85</td>
908
  </tr>
909
  </tbody></table>
910
 
@@ -922,24 +926,24 @@ All results reported below are on a 5-shot setting.
922
  <td rowspan="2">Paraphrasing</td>
923
  <td>parafrases_gl</td>
924
  <td>acc</td>
925
- <td>60.20</td>
926
  </tr>
927
  <tr>
928
  <td>paws_gl</td>
929
  <td>acc</td>
930
- <td>69.10</td>
931
  </tr>
932
  <tr>
933
  <td>QA</td>
934
  <td>openbookqa_gl</td>
935
  <td>acc</td>
936
- <td>35.00</td>
937
  </tr>
938
  <tr>
939
  <td>Translation</td>
940
  <td>flores_gl</td>
941
  <td>bleu</td>
942
- <td>30.19</td>
943
  </tr>
944
  </tbody>
945
  </table>
@@ -958,66 +962,75 @@ All results reported below are on a 5-shot setting.
958
  <td rowspan="2">Commonsense Reasoning</td>
959
  <td>copa</td>
960
  <td>acc</td>
961
- <td>91</td>
962
  </tr>
963
  <tr>
964
  <td>xstorycloze_en</td>
965
  <td>acc</td>
966
- <td>82.20</td>
967
  </tr>
968
  <tr>
969
  <td rowspan="2">NLI</td>
970
  <td>wnli</td>
971
  <td>acc</td>
972
- <td>61.97</td>
973
  </tr>
974
  <tr>
975
  <td>xnli_en</td>
976
  <td>acc</td>
977
- <td>51.77</td>
978
  </tr>
979
  <tr>
980
  <td>Paraphrasing</td>
981
  <td>paws *</td>
982
  <td>acc</td>
983
- <td>64.65</td>
984
  </tr>
985
  <tr>
986
  <td rowspan="6">QA</td>
987
  <td>arc_easy</td>
988
  <td>acc</td>
989
- <td>85.40</td>
990
  </tr>
991
  <tr>
992
  <td>arc_challenge</td>
993
  <td>acc</td>
994
- <td>58.70</td>
995
  </tr>
996
  <tr>
997
  <td>openbookqa</td>
998
  <td>acc</td>
999
- <td>37.80</td>
1000
  </tr>
1001
  <tr>
1002
  <td>piqa</td>
1003
  <td>acc</td>
1004
- <td>81.77</td>
1005
  </tr>
1006
  <tr>
1007
  <td>social_iqa</td>
1008
  <td>acc</td>
1009
- <td>53.48</td>
1010
  </tr>
1011
  <tr>
1012
- <td>squad_en **</td>
1013
  <td>acc</td>
1014
- <td>81.53</td>
1015
  </tr>
1016
  </tbody></table>
1017
 
 
1018
  \* Current LM Evaluation Harness implementation is lacking correct pre-processing. These results are obtained with adequate pre-processing.
1019
 
1020
- \*\* This task is not yet available in the official Harness, we hope to add it soon.
 
 
 
 
 
 
 
 
1021
 
1022
  ---
1023
 
@@ -1034,13 +1047,13 @@ We highlight that these results can be expected from a pretrained model that has
1034
  ## Additional information
1035
 
1036
  ### Author
1037
- The Language Technologies Unit from Barcelona Supercomputing Center.
1038
 
1039
  ### Contact
1040
  For further information, please send an email to <[email protected]>.
1041
 
1042
  ### Copyright
1043
- Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
1044
 
1045
  ### Funding
1046
  This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
@@ -1084,6 +1097,6 @@ The Barcelona Supercomputing Center, as the owner and creator of the model, shal
1084
  ## Model Index
1085
  |Model|Base|Instruct|
1086
  |:---:|:---:|:---:|
1087
- |2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
1088
- |7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
1089
- |40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
 
66
  ![](./images/logo_alia_2.png)
67
 
68
  > [!WARNING]
69
+ > **WARNING:** This is a base language model that has not undergone instruction tuning or alignment with human preferences. As a result, it may generate outputs that are inappropriate, misleading, biased, or unsafe. These risks can be mitigated through additional post-training stages, which is strongly recommended before deployment in any production system, especially for high-stakes applications.
 
 
70
 
71
  # ALIA-40b Model Card
72
 
73
+ ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40b base version.
74
 
75
  To visit the model cards of other model versions, please refer to the [Model Index](#model-index).
76
 
 
83
 
84
  ### Description
85
 
86
+ Transformer-based decoder-only language model that has been pre-trained from scratch on 9.37 trillion tokens of highly curated data.
87
  The pre-training corpus contains text in 35 European languages and code.
88
 
89
  ### Hyperparameters
90
 
91
+ The full list of hyperparameters can be found [here](https://github.com/langtech-bsc/alia/blob/main/configs).
92
 
93
  ### Architecture
94
 
 
99
  | Layers | 48 |
100
  | Hidden size | 8,192 |
101
  | Attention heads | 64 |
102
+ | Context length | 32,768 |
103
  | Vocabulary size | 256,000 |
104
  | Precision | bfloat16 |
105
  | Embedding type | RoPE |
 
302
  ### Pretraining Data
303
 
304
  The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
305
+ The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
306
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
307
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
308
+ During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
309
+ This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below:
310
 
311
  ![lang distrib](./images/corpus_languages.png)
312
 
 
345
  | Croatian Web as Corpus 2.1 (hrWaC) | hr | Ljubešić & Klubička, 2014 |
346
  | DaNewsroom | da | Varab & Schluter, 2020 |
347
  | Danish GigaWord | da | Strømberg-Derczynski et al., 2021 |
348
+ | Dolmino-mix-1124 (subset without synthetically generated data and privative licenses) | en | Team OLMo, 2024
349
  | DK-CLARIN Reference Corpus of General Danish | da | [Link](https://korpus.dsl.dk/clarin/) |
350
  | Estonian National Corpus 2021 (ENC) | et | Koppel & Kallas, 2022 |
351
  | Estonian Reference Corpus (ERC) | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
352
  | EusCrawl (w/o Wikipedia or NC-licenses) | eu | Artetxe et al., 2022 |
353
  | FineWeb-Edu (350BT subset) | en | Penedo et al., 2024 |
354
+ | Fineweb2 (ad hoc subset of 178BT) | ar, as, bg, ca, cs, cy, da, de, el, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk | Penedo et al., 2024
355
  | French Public Domain Books (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Books) |
356
  | French Public Domain Newspapers (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
357
  | German Web as Corpus (DeWaC) | de | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:dewac) |
 
395
  | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
396
  | Yle Finnish News Archive (Yle-News) | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
397
 
398
+
399
  To consult the data summary document with the respective licences, please send an e-mail to [email protected].
400
 
401
  <details>
 
440
  - Soldaini, L., & Lo, K. (2023). peS2o (Pretraining Efficiently on S2ORC) Dataset. Allen Institute for AI.
441
  - Strømberg-Derczynski, L., Ciosici, M., Baglini, R., Christiansen, M. H., Dalsgaard, J. A., Fusaroli, R., Henrichsen, P. J., Hvingelby, R., Kirkedal, A., Kjeldsen, A. S., Ladefoged, C., Nielsen, F. Å., Madsen, J., Petersen, M. L., Rystrøm, J. H., & Varab, D. (2021). The Danish Gigaword Corpus. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 413–421. [Link](https://aclanthology.org/2021.nodalida-main.46)
442
  - Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. 208–220. [Link](https://doi.org/10.18653/v1/2023.trustnlp-1.18)
443
+ - Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
444
  - Varab, D., & Schluter, N. (2020). DaNewsroom: A Large-scale Danish Summarisation Dataset. Proceedings of The 12th Language Resources and Evaluation Conference, 6731–6739. [Link](https://www.aclweb.org/anthology/2020.lrec-1.831)
445
  - Váradi, T., Nyéki, B., Koeva, S., Tadić, M., Štefanec, V., Ogrodniczuk, M., Nitoń, B., Pezik, P., Barbu Mititelu, V., Irimia, E., Mitrofan, M., Tufi\textcommabelows, D., Garabík, R., Krek, S., & Repar, A. (2022). Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 100–108). European Language Resources Association. [Link](https://aclanthology.org/2022.lrec-1.11)
446
  - Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
 
456
 
457
  </details>
458
 
459
+ In the final pre-training phase, we used a high-quality subset of 160 billion tokens. Additionally, to expand the model's context window to 32K, 6.3 billion tokens were processed using the Llama 3.1 RoPE interpolation strategy.
460
+
461
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
462
 
463
  <details>
 
739
  <td>Commonsense Reasoning</td>
740
  <td>xstorycloze_es</td>
741
  <td>acc</td>
742
+ <td>79.5</td>
743
  </tr>
744
  <tr>
745
  <td rowspan="2">NLI</td>
746
  <td>wnli_es</td>
747
  <td>acc</td>
748
+ <td>64.8</td>
749
  </tr>
750
  <tr>
751
  <td>xnli_es</td>
752
  <td>acc</td>
753
+ <td>50.4</td>
754
  </tr>
755
  <tr>
756
  <td>Paraphrasing</td>
757
  <td>paws_es</td>
758
  <td>acc</td>
759
+ <td>63.8</td>
760
  </tr>
761
  <tr>
762
  <td>QA</td>
763
  <td>xquad_es</td>
764
  <td>acc</td>
765
+ <td>73.4</td>
766
  </tr>
767
  <tr>
768
  <td>Translation</td>
769
  <td>flores_es</td>
770
  <td>bleu</td>
771
+ <td>25.9</td>
772
  </tr>
773
  </tbody>
774
  </table>
 
787
  <td rowspan="2">Commonsense Reasoning</td>
788
  <td>copa_ca</td>
789
  <td>acc</td>
790
+ <td>86.0</td>
791
  </tr>
792
  <tr>
793
  <td>xstorycloze_ca</td>
794
  <td>acc</td>
795
+ <td>80.0</td>
796
  </tr>
797
  <tr>
798
  <td rowspan="2">NLI</td>
799
  <td>wnli_ca</td>
800
  <td>acc</td>
801
+ <td>70.0</td>
802
  </tr>
803
  <tr>
804
  <td>xnli_ca</td>
805
  <td>acc</td>
806
+ <td>50.7</td>
807
  </tr>
808
  <tr>
809
  <td rowspan="2">Paraphrasing</td>
810
  <td>parafraseja</td>
811
  <td>acc</td>
812
+ <td>67.8</td>
813
  </tr>
814
  <tr>
815
  <td>paws_ca</td>
816
  <td>acc</td>
817
+ <td>67.5</td>
818
  </tr>
819
  <tr>
820
  <td rowspan="5">QA</td>
821
  <td>arc_ca_easy</td>
822
  <td>acc</td>
823
+ <td>81.0</td>
824
  </tr>
825
  <tr>
826
  <td>arc_ca_challenge</td>
827
  <td>acc</td>
828
+ <td>53.0</td>
829
  </tr>
830
  <tr>
831
  <td>openbookqa_ca</td>
832
  <td>acc</td>
833
+ <td>41.6</td>
834
  </tr>
835
  <tr>
836
  <td>piqa_ca</td>
837
  <td>acc</td>
838
+ <td>75.8</td>
839
  </tr>
840
  <tr>
841
  <td>siqa_ca</td>
842
  <td>acc</td>
843
+ <td>53.9</td>
844
  </tr>
845
  <tr>
846
  <td>Translation</td>
847
  <td>flores_ca</td>
848
  <td>bleu</td>
849
+ <td>33.7</td>
850
  </tr>
851
  </tbody></table>
852
 
 
864
  <td rowspan="2">Commonsense Reasoning</td>
865
  <td>xcopa_eu</td>
866
  <td>acc</td>
867
+ <td>78.8</td>
868
  </tr>
869
  <tr>
870
  <td>xstorycloze_eu</td>
871
  <td>acc</td>
872
+ <td>72.2</td>
873
  </tr>
874
  <tr>
875
  <td rowspan="2">NLI</td>
876
  <td>wnli_eu</td>
877
  <td>acc</td>
878
+ <td>66.2</td>
879
  </tr>
880
  <tr>
881
  <td>xnli_eu</td>
882
  <td>acc</td>
883
+ <td>45.9</td>
884
  </tr>
885
  <tr>
886
  <td rowspan="3">QA</td>
887
  <td>eus_exams</td>
888
  <td>acc</td>
889
+ <td>61.5</td>
890
  </tr>
891
  <tr>
892
  <td>eus_proficiency</td>
893
  <td>acc</td>
894
+ <td>60.4</td>
895
  </tr>
896
  <tr>
897
  <td>eus_trivia</td>
898
  <td>acc</td>
899
+ <td>67.2</td>
900
  </tr>
901
  <tr>
902
  <td>Reading Comprehension</td>
903
  <td>eus_reading</td>
904
  <td>acc</td>
905
+ <td>61.1</td>
906
  </tr>
907
  <tr>
908
  <td>Translation</td>
909
  <td>flores_eu</td>
910
  <td>bleu</td>
911
+ <td>21.3</td>
912
  </tr>
913
  </tbody></table>
914
 
 
926
  <td rowspan="2">Paraphrasing</td>
927
  <td>parafrases_gl</td>
928
  <td>acc</td>
929
+ <td>60.2</td>
930
  </tr>
931
  <tr>
932
  <td>paws_gl</td>
933
  <td>acc</td>
934
+ <td>63.0</td>
935
  </tr>
936
  <tr>
937
  <td>QA</td>
938
  <td>openbookqa_gl</td>
939
  <td>acc</td>
940
+ <td>36.6</td>
941
  </tr>
942
  <tr>
943
  <td>Translation</td>
944
  <td>flores_gl</td>
945
  <td>bleu</td>
946
+ <td>31.2</td>
947
  </tr>
948
  </tbody>
949
  </table>
 
962
  <td rowspan="2">Commonsense Reasoning</td>
963
  <td>copa</td>
964
  <td>acc</td>
965
+ <td>94.0</td>
966
  </tr>
967
  <tr>
968
  <td>xstorycloze_en</td>
969
  <td>acc</td>
970
+ <td>83.2</td>
971
  </tr>
972
  <tr>
973
  <td rowspan="2">NLI</td>
974
  <td>wnli</td>
975
  <td>acc</td>
976
+ <td>67.6</td>
977
  </tr>
978
  <tr>
979
  <td>xnli_en</td>
980
  <td>acc</td>
981
+ <td>57.0</td>
982
  </tr>
983
  <tr>
984
  <td>Paraphrasing</td>
985
  <td>paws *</td>
986
  <td>acc</td>
987
+ <td>68.5</td>
988
  </tr>
989
  <tr>
990
  <td rowspan="6">QA</td>
991
  <td>arc_easy</td>
992
  <td>acc</td>
993
+ <td>86.5</td>
994
  </tr>
995
  <tr>
996
  <td>arc_challenge</td>
997
  <td>acc</td>
998
+ <td>59.4</td>
999
  </tr>
1000
  <tr>
1001
  <td>openbookqa</td>
1002
  <td>acc</td>
1003
+ <td>38.4</td>
1004
  </tr>
1005
  <tr>
1006
  <td>piqa</td>
1007
  <td>acc</td>
1008
+ <td>81.7</td>
1009
  </tr>
1010
  <tr>
1011
  <td>social_iqa</td>
1012
  <td>acc</td>
1013
+ <td>53.8</td>
1014
  </tr>
1015
  <tr>
1016
+ <td>xquad_en </td>
1017
  <td>acc</td>
1018
+ <td>80.7</td>
1019
  </tr>
1020
  </tbody></table>
1021
 
1022
+
1023
  \* Current LM Evaluation Harness implementation is lacking correct pre-processing. These results are obtained with adequate pre-processing.
1024
 
1025
+ ### Long Context Evaluation
1026
+
1027
+ To assess the long-context capabilities of our model, we conduct a "needle in a haystack" test with the following configuration:
1028
+
1029
+ - **Needle Phrase**: *"The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."*
1030
+ - **Retrieval Question**: *"The best thing to do in San Francisco is"*
1031
+ - **Evaluator**: [prometheus-8x7b-v2.0](https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0), used as the evaluation judge.
1032
+
1033
+ ![](./images/LongContext_eval.png)
1034
 
1035
  ---
1036
 
 
1047
  ## Additional information
1048
 
1049
  ### Author
1050
+ The Language Technologies Lab from Barcelona Supercomputing Center.
1051
 
1052
  ### Contact
1053
  For further information, please send an email to <[email protected]>.
1054
 
1055
  ### Copyright
1056
+ Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
1057
 
1058
  ### Funding
1059
  This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
 
1097
  ## Model Index
1098
  |Model|Base|Instruct|
1099
  |:---:|:---:|:---:|
1100
+ |2b| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
1101
+ |7b| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
1102
+ |40b| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
Release note v0.9.md ADDED
@@ -0,0 +1 @@
 
 
1
+ In this release, the ALIA-40B checkpoint has completed the main pre-training on 8.56 trillion tokens (1.6 epochs with 2.4 trillion tokens and 2 epochs with 2.68 trillion tokens) using a 4k context window, which was unfinished in the previous release. Additionally, it has undergone a preliminary final pre-training stage with a subset of high-quality data and a context extension to 32k tokens.
images/corpus_languages.png CHANGED

Git LFS Details

  • SHA256: a66685fd95c43e997ba20620231b00d43e1085aba09af54c69b3a414b6a25f73
  • Pointer size: 131 Bytes
  • Size of remote file: 352 kB
images/salamandra_header.png CHANGED

Git LFS Details

  • SHA256: de12bec43f22c0c41b45b84425759d6c9e38ecdf06d58519f048f10fe6e826de
  • Pointer size: 133 Bytes
  • Size of remote file: 11.1 MB

Git LFS Details

  • SHA256: 4be1584a2e8cb549a8740c7893c75a638510289215482968e665566d39f4cfb1
  • Pointer size: 128 Bytes
  • Size of remote file: 133 Bytes
model-00001-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3ff526e46d494258a656971416da3cc1510464c9c6f32ee47a2a926ca3d75a7a
3
  size 4898947792
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:86eea25bb9c816c55fbe801a79689dfeb2b9444c9868ebcbe40fa1c7759a78de
3
  size 4898947792
model-00002-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f02d3cbb8e0f9fbde20b806f90233b3dd5c9260b22a6fb35c03d5d717aca9471
3
  size 4932603064
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bab17a5d55a0ea3b1f6e7c49fca860488b5d4ca723bfc1c0c660db758cf7acee
3
  size 4932603064
model-00003-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9635e1d0c5287498d473c264157de7b60c0364d31f8df1813b68dbda6960b9b7
3
  size 4932636064
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f51af4a82d6a6ce07157d81688cee43f9d016ed6148e15daafefef5da41c97e
3
  size 4932636064
model-00004-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ceb246808ed7507cf405f47f901b17f051c7ca3915d5b4e5e5b1ab9b44b2509f
3
  size 4831940128
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d0f30ce33bc6731ea6d58ed7d337e446dfe4abe6a50469abbe1bdb40da97523
3
  size 4831940128
model-00005-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5ee56cb425b2fc75365979ac567e67c22eeb4b49b6edaa4a0c72905094bf5ab4
3
  size 4932603096
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c383044bd870e52b5d556b0312a7593dda826ea6fd40cc88d57f7ca054fa8a0e
3
  size 4932603096
model-00006-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a76cd8c3d64a6d6db3cad041528d0452d8bd853de7969429a2ca7ae448649ad6
3
  size 4932603088
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:30a71fb6ae629d637aa5e6dbc1482f31b8402099ea7dd6a7f5897bae75901909
3
  size 4932603088
model-00007-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:98275b7e8711ec35c7a48591daa33efacc062d0af056bbbcafffa4c5349700cc
3
  size 4932636096
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:58b5d28b6bf820608f4c7590611e91291037756a209c957b659cf5cdc306c093
3
  size 4932636096
model-00008-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c8c14df494eedafb26c1ddbaf2f64c5c7b492aa1562b3bab896de10357d68dc7
3
  size 4831940160
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff854c6b09431715b07ad1685fc4e902d68326d2343425f1fff6781cc86cc66b
3
  size 4831940160
model-00009-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b9970d40aa9954e8ace15a559111f090c76a177d05d183ccf2b975d584319d39
3
  size 4932603096
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f0579b928cc75e1d628d4d8d67d0f414a8550c18fe9063f39023498a601cc8e
3
  size 4932603096
model-00010-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c6b26875ba10aed59dfca60fd383773f83bc5de4606b1e020a8d6cdeb823b8a7
3
  size 4932603088
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:adc5fbe7645d6ab910a3314b23eb42bf793b751d7940fcb73d4d938c42c3047d
3
  size 4932603088
model-00011-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1c1dec561c5b21c305f50cc33ce688ccff660f22db8266f4edda6408a5b2e9e0
3
  size 4932636096
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c814623773121f9be5579ed775ae49835e16f1e43ea16f39198798c5d9a2a32
3
  size 4932636096
model-00012-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c34abe63aede8a40b6c1cceaa1f2048ded19a7b96a8488c67da874453694e330
3
  size 4831940160
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91fe722802a3dcac49de528f280a2dbef7c7a0cc9e403aecdaeeb836b13fa375
3
  size 4831940160
model-00013-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:28d9169546c4d6fb4a5161ec273d1848a8e16ead6c239dc2e145774272f47fb2
3
  size 4932603096
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b7e784fae5452f080f3cbe4779388843fb0b92f02250f799e3035c119b1c8e5
3
  size 4932603096
model-00014-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:93ff37553fd5ac3066df67c15f2300381b804b56e778e61d479295f69f8c0080
3
  size 4932603088
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e20dc5e90c2a64710cbd50728b5228d916082d883f274977af599e2b3f3bbb0
3
  size 4932603088
model-00015-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:46049c9ce047e7db0ccfff44cb4d4b241bb4fb69619b1882709a2ccc5c0ddbdf
3
  size 4932636096
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f32f39b3ae508d99be07600a6a96bc37c96be9dc10ab5ee774aaca4d5ddd1e3
3
  size 4932636096
model-00016-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5e7c1e9cfd124c531fecfb8a39d4344cdc1ee2b2e33fc4783ce8fae6a59d8179
3
  size 3019983016
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c95947ba9f96a43078f1984c4b74f58a9ad52cb5c4f83243b52ddc5fb29396d3
3
  size 3019983016
model-00017-of-00017.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:90fc55c26a532d8eb9a96d15149b9391105466b120e0a0b496c7422b77228564
3
  size 4194304128
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60f09ac53921db320447f39598c5b63ceec67e63daad6dff8ad2e19389fddea5
3
  size 4194304128