v0.9
Browse files- .gitattributes +2 -0
- README.md +72 -59
- Release note v0.9.md +1 -0
- images/corpus_languages.png +0 -0
- images/salamandra_header.png +2 -2
- model-00001-of-00017.safetensors +1 -1
- model-00002-of-00017.safetensors +1 -1
- model-00003-of-00017.safetensors +1 -1
- model-00004-of-00017.safetensors +1 -1
- model-00005-of-00017.safetensors +1 -1
- model-00006-of-00017.safetensors +1 -1
- model-00007-of-00017.safetensors +1 -1
- model-00008-of-00017.safetensors +1 -1
- model-00009-of-00017.safetensors +1 -1
- model-00010-of-00017.safetensors +1 -1
- model-00011-of-00017.safetensors +1 -1
- model-00012-of-00017.safetensors +1 -1
- model-00013-of-00017.safetensors +1 -1
- model-00014-of-00017.safetensors +1 -1
- model-00015-of-00017.safetensors +1 -1
- model-00016-of-00017.safetensors +1 -1
- model-00017-of-00017.safetensors +1 -1
.gitattributes
CHANGED
@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text
|
37 |
+
images/corpus_languages.png filter=lfs diff=lfs merge=lfs -text
|
38 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -66,13 +66,11 @@ datasets:
|
|
66 |

|
67 |
|
68 |
> [!WARNING]
|
69 |
-
> **WARNING:** This is
|
70 |
-
>
|
71 |
-
> The weights will be promptly updated as soon as the training process is complete.
|
72 |
|
73 |
# ALIA-40b Model Card
|
74 |
|
75 |
-
ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the
|
76 |
|
77 |
To visit the model cards of other model versions, please refer to the [Model Index](#model-index).
|
78 |
|
@@ -85,12 +83,12 @@ Along with the open weights, all training scripts and configuration files are ma
|
|
85 |
|
86 |
### Description
|
87 |
|
88 |
-
Transformer-based decoder-only language model that has been pre-trained from scratch on
|
89 |
The pre-training corpus contains text in 35 European languages and code.
|
90 |
|
91 |
### Hyperparameters
|
92 |
|
93 |
-
The full list of hyperparameters can be found [here](https://github.com/langtech-bsc/alia/blob/main/configs
|
94 |
|
95 |
### Architecture
|
96 |
|
@@ -101,7 +99,7 @@ The full list of hyperparameters can be found [here](https://github.com/langtech
|
|
101 |
| Layers | 48 |
|
102 |
| Hidden size | 8,192 |
|
103 |
| Attention heads | 64 |
|
104 |
-
| Context length |
|
105 |
| Vocabulary size | 256,000 |
|
106 |
| Precision | bfloat16 |
|
107 |
| Embedding type | RoPE |
|
@@ -304,11 +302,11 @@ for output in outputs:
|
|
304 |
### Pretraining Data
|
305 |
|
306 |
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
307 |
-
The initial 1.
|
308 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
309 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
310 |
-
During the following
|
311 |
-
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
312 |
|
313 |

|
314 |
|
@@ -347,11 +345,13 @@ Feel free to click the expand button below to see the full list of sources.
|
|
347 |
| Croatian Web as Corpus 2.1 (hrWaC) | hr | Ljubešić & Klubička, 2014 |
|
348 |
| DaNewsroom | da | Varab & Schluter, 2020 |
|
349 |
| Danish GigaWord | da | Strømberg-Derczynski et al., 2021 |
|
|
|
350 |
| DK-CLARIN Reference Corpus of General Danish | da | [Link](https://korpus.dsl.dk/clarin/) |
|
351 |
| Estonian National Corpus 2021 (ENC) | et | Koppel & Kallas, 2022 |
|
352 |
| Estonian Reference Corpus (ERC) | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
|
353 |
| EusCrawl (w/o Wikipedia or NC-licenses) | eu | Artetxe et al., 2022 |
|
354 |
| FineWeb-Edu (350BT subset) | en | Penedo et al., 2024 |
|
|
|
355 |
| French Public Domain Books (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Books) |
|
356 |
| French Public Domain Newspapers (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
|
357 |
| German Web as Corpus (DeWaC) | de | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:dewac) |
|
@@ -395,6 +395,7 @@ Feel free to click the expand button below to see the full list of sources.
|
|
395 |
| Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
|
396 |
| Yle Finnish News Archive (Yle-News) | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
|
397 |
|
|
|
398 |
To consult the data summary document with the respective licences, please send an e-mail to [email protected].
|
399 |
|
400 |
<details>
|
@@ -439,6 +440,7 @@ To consult the data summary document with the respective licences, please send a
|
|
439 |
- Soldaini, L., & Lo, K. (2023). peS2o (Pretraining Efficiently on S2ORC) Dataset. Allen Institute for AI.
|
440 |
- Strømberg-Derczynski, L., Ciosici, M., Baglini, R., Christiansen, M. H., Dalsgaard, J. A., Fusaroli, R., Henrichsen, P. J., Hvingelby, R., Kirkedal, A., Kjeldsen, A. S., Ladefoged, C., Nielsen, F. Å., Madsen, J., Petersen, M. L., Rystrøm, J. H., & Varab, D. (2021). The Danish Gigaword Corpus. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 413–421. [Link](https://aclanthology.org/2021.nodalida-main.46)
|
441 |
- Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. 208–220. [Link](https://doi.org/10.18653/v1/2023.trustnlp-1.18)
|
|
|
442 |
- Varab, D., & Schluter, N. (2020). DaNewsroom: A Large-scale Danish Summarisation Dataset. Proceedings of The 12th Language Resources and Evaluation Conference, 6731–6739. [Link](https://www.aclweb.org/anthology/2020.lrec-1.831)
|
443 |
- Váradi, T., Nyéki, B., Koeva, S., Tadić, M., Štefanec, V., Ogrodniczuk, M., Nitoń, B., Pezik, P., Barbu Mititelu, V., Irimia, E., Mitrofan, M., Tufi\textcommabelows, D., Garabík, R., Krek, S., & Repar, A. (2022). Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 100–108). European Language Resources Association. [Link](https://aclanthology.org/2022.lrec-1.11)
|
444 |
- Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
|
@@ -454,6 +456,8 @@ To consult the data summary document with the respective licences, please send a
|
|
454 |
|
455 |
</details>
|
456 |
|
|
|
|
|
457 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
458 |
|
459 |
<details>
|
@@ -735,36 +739,36 @@ All results reported below are on a 5-shot setting.
|
|
735 |
<td>Commonsense Reasoning</td>
|
736 |
<td>xstorycloze_es</td>
|
737 |
<td>acc</td>
|
738 |
-
<td>
|
739 |
</tr>
|
740 |
<tr>
|
741 |
<td rowspan="2">NLI</td>
|
742 |
<td>wnli_es</td>
|
743 |
<td>acc</td>
|
744 |
-
<td>
|
745 |
</tr>
|
746 |
<tr>
|
747 |
<td>xnli_es</td>
|
748 |
<td>acc</td>
|
749 |
-
<td>
|
750 |
</tr>
|
751 |
<tr>
|
752 |
<td>Paraphrasing</td>
|
753 |
<td>paws_es</td>
|
754 |
<td>acc</td>
|
755 |
-
<td>
|
756 |
</tr>
|
757 |
<tr>
|
758 |
<td>QA</td>
|
759 |
<td>xquad_es</td>
|
760 |
<td>acc</td>
|
761 |
-
<td>
|
762 |
</tr>
|
763 |
<tr>
|
764 |
<td>Translation</td>
|
765 |
<td>flores_es</td>
|
766 |
<td>bleu</td>
|
767 |
-
<td>25.
|
768 |
</tr>
|
769 |
</tbody>
|
770 |
</table>
|
@@ -783,66 +787,66 @@ All results reported below are on a 5-shot setting.
|
|
783 |
<td rowspan="2">Commonsense Reasoning</td>
|
784 |
<td>copa_ca</td>
|
785 |
<td>acc</td>
|
786 |
-
<td>
|
787 |
</tr>
|
788 |
<tr>
|
789 |
<td>xstorycloze_ca</td>
|
790 |
<td>acc</td>
|
791 |
-
<td>
|
792 |
</tr>
|
793 |
<tr>
|
794 |
<td rowspan="2">NLI</td>
|
795 |
<td>wnli_ca</td>
|
796 |
<td>acc</td>
|
797 |
-
<td>
|
798 |
</tr>
|
799 |
<tr>
|
800 |
<td>xnli_ca</td>
|
801 |
<td>acc</td>
|
802 |
-
<td>
|
803 |
</tr>
|
804 |
<tr>
|
805 |
<td rowspan="2">Paraphrasing</td>
|
806 |
<td>parafraseja</td>
|
807 |
<td>acc</td>
|
808 |
-
<td>
|
809 |
</tr>
|
810 |
<tr>
|
811 |
<td>paws_ca</td>
|
812 |
<td>acc</td>
|
813 |
-
<td>67.
|
814 |
</tr>
|
815 |
<tr>
|
816 |
<td rowspan="5">QA</td>
|
817 |
<td>arc_ca_easy</td>
|
818 |
<td>acc</td>
|
819 |
-
<td>
|
820 |
</tr>
|
821 |
<tr>
|
822 |
<td>arc_ca_challenge</td>
|
823 |
<td>acc</td>
|
824 |
-
<td>
|
825 |
</tr>
|
826 |
<tr>
|
827 |
<td>openbookqa_ca</td>
|
828 |
<td>acc</td>
|
829 |
-
<td>
|
830 |
</tr>
|
831 |
<tr>
|
832 |
<td>piqa_ca</td>
|
833 |
<td>acc</td>
|
834 |
-
<td>
|
835 |
</tr>
|
836 |
<tr>
|
837 |
<td>siqa_ca</td>
|
838 |
<td>acc</td>
|
839 |
-
<td>53.
|
840 |
</tr>
|
841 |
<tr>
|
842 |
<td>Translation</td>
|
843 |
<td>flores_ca</td>
|
844 |
<td>bleu</td>
|
845 |
-
<td>
|
846 |
</tr>
|
847 |
</tbody></table>
|
848 |
|
@@ -860,51 +864,51 @@ All results reported below are on a 5-shot setting.
|
|
860 |
<td rowspan="2">Commonsense Reasoning</td>
|
861 |
<td>xcopa_eu</td>
|
862 |
<td>acc</td>
|
863 |
-
<td>
|
864 |
</tr>
|
865 |
<tr>
|
866 |
<td>xstorycloze_eu</td>
|
867 |
<td>acc</td>
|
868 |
-
<td>
|
869 |
</tr>
|
870 |
<tr>
|
871 |
<td rowspan="2">NLI</td>
|
872 |
<td>wnli_eu</td>
|
873 |
<td>acc</td>
|
874 |
-
<td>
|
875 |
</tr>
|
876 |
<tr>
|
877 |
<td>xnli_eu</td>
|
878 |
<td>acc</td>
|
879 |
-
<td>
|
880 |
</tr>
|
881 |
<tr>
|
882 |
<td rowspan="3">QA</td>
|
883 |
<td>eus_exams</td>
|
884 |
<td>acc</td>
|
885 |
-
<td>
|
886 |
</tr>
|
887 |
<tr>
|
888 |
<td>eus_proficiency</td>
|
889 |
<td>acc</td>
|
890 |
-
<td>
|
891 |
</tr>
|
892 |
<tr>
|
893 |
<td>eus_trivia</td>
|
894 |
<td>acc</td>
|
895 |
-
<td>
|
896 |
</tr>
|
897 |
<tr>
|
898 |
<td>Reading Comprehension</td>
|
899 |
<td>eus_reading</td>
|
900 |
<td>acc</td>
|
901 |
-
<td>
|
902 |
</tr>
|
903 |
<tr>
|
904 |
<td>Translation</td>
|
905 |
<td>flores_eu</td>
|
906 |
<td>bleu</td>
|
907 |
-
<td>
|
908 |
</tr>
|
909 |
</tbody></table>
|
910 |
|
@@ -922,24 +926,24 @@ All results reported below are on a 5-shot setting.
|
|
922 |
<td rowspan="2">Paraphrasing</td>
|
923 |
<td>parafrases_gl</td>
|
924 |
<td>acc</td>
|
925 |
-
<td>60.
|
926 |
</tr>
|
927 |
<tr>
|
928 |
<td>paws_gl</td>
|
929 |
<td>acc</td>
|
930 |
-
<td>
|
931 |
</tr>
|
932 |
<tr>
|
933 |
<td>QA</td>
|
934 |
<td>openbookqa_gl</td>
|
935 |
<td>acc</td>
|
936 |
-
<td>
|
937 |
</tr>
|
938 |
<tr>
|
939 |
<td>Translation</td>
|
940 |
<td>flores_gl</td>
|
941 |
<td>bleu</td>
|
942 |
-
<td>
|
943 |
</tr>
|
944 |
</tbody>
|
945 |
</table>
|
@@ -958,66 +962,75 @@ All results reported below are on a 5-shot setting.
|
|
958 |
<td rowspan="2">Commonsense Reasoning</td>
|
959 |
<td>copa</td>
|
960 |
<td>acc</td>
|
961 |
-
<td>
|
962 |
</tr>
|
963 |
<tr>
|
964 |
<td>xstorycloze_en</td>
|
965 |
<td>acc</td>
|
966 |
-
<td>
|
967 |
</tr>
|
968 |
<tr>
|
969 |
<td rowspan="2">NLI</td>
|
970 |
<td>wnli</td>
|
971 |
<td>acc</td>
|
972 |
-
<td>
|
973 |
</tr>
|
974 |
<tr>
|
975 |
<td>xnli_en</td>
|
976 |
<td>acc</td>
|
977 |
-
<td>
|
978 |
</tr>
|
979 |
<tr>
|
980 |
<td>Paraphrasing</td>
|
981 |
<td>paws *</td>
|
982 |
<td>acc</td>
|
983 |
-
<td>
|
984 |
</tr>
|
985 |
<tr>
|
986 |
<td rowspan="6">QA</td>
|
987 |
<td>arc_easy</td>
|
988 |
<td>acc</td>
|
989 |
-
<td>
|
990 |
</tr>
|
991 |
<tr>
|
992 |
<td>arc_challenge</td>
|
993 |
<td>acc</td>
|
994 |
-
<td>
|
995 |
</tr>
|
996 |
<tr>
|
997 |
<td>openbookqa</td>
|
998 |
<td>acc</td>
|
999 |
-
<td>
|
1000 |
</tr>
|
1001 |
<tr>
|
1002 |
<td>piqa</td>
|
1003 |
<td>acc</td>
|
1004 |
-
<td>81.
|
1005 |
</tr>
|
1006 |
<tr>
|
1007 |
<td>social_iqa</td>
|
1008 |
<td>acc</td>
|
1009 |
-
<td>53.
|
1010 |
</tr>
|
1011 |
<tr>
|
1012 |
-
<td>
|
1013 |
<td>acc</td>
|
1014 |
-
<td>
|
1015 |
</tr>
|
1016 |
</tbody></table>
|
1017 |
|
|
|
1018 |
\* Current LM Evaluation Harness implementation is lacking correct pre-processing. These results are obtained with adequate pre-processing.
|
1019 |
|
1020 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1021 |
|
1022 |
---
|
1023 |
|
@@ -1034,13 +1047,13 @@ We highlight that these results can be expected from a pretrained model that has
|
|
1034 |
## Additional information
|
1035 |
|
1036 |
### Author
|
1037 |
-
The Language Technologies
|
1038 |
|
1039 |
### Contact
|
1040 |
For further information, please send an email to <[email protected]>.
|
1041 |
|
1042 |
### Copyright
|
1043 |
-
Copyright(c)
|
1044 |
|
1045 |
### Funding
|
1046 |
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
|
@@ -1084,6 +1097,6 @@ The Barcelona Supercomputing Center, as the owner and creator of the model, shal
|
|
1084 |
## Model Index
|
1085 |
|Model|Base|Instruct|
|
1086 |
|:---:|:---:|:---:|
|
1087 |
-
|
|
1088 |
-
|
|
1089 |
-
|
|
|
|
66 |

|
67 |
|
68 |
> [!WARNING]
|
69 |
+
> **WARNING:** This is a base language model that has not undergone instruction tuning or alignment with human preferences. As a result, it may generate outputs that are inappropriate, misleading, biased, or unsafe. These risks can be mitigated through additional post-training stages, which is strongly recommended before deployment in any production system, especially for high-stakes applications.
|
|
|
|
|
70 |
|
71 |
# ALIA-40b Model Card
|
72 |
|
73 |
+
ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants. This model card corresponds to the 40b base version.
|
74 |
|
75 |
To visit the model cards of other model versions, please refer to the [Model Index](#model-index).
|
76 |
|
|
|
83 |
|
84 |
### Description
|
85 |
|
86 |
+
Transformer-based decoder-only language model that has been pre-trained from scratch on 9.37 trillion tokens of highly curated data.
|
87 |
The pre-training corpus contains text in 35 European languages and code.
|
88 |
|
89 |
### Hyperparameters
|
90 |
|
91 |
+
The full list of hyperparameters can be found [here](https://github.com/langtech-bsc/alia/blob/main/configs).
|
92 |
|
93 |
### Architecture
|
94 |
|
|
|
99 |
| Layers | 48 |
|
100 |
| Hidden size | 8,192 |
|
101 |
| Attention heads | 64 |
|
102 |
+
| Context length | 32,768 |
|
103 |
| Vocabulary size | 256,000 |
|
104 |
| Precision | bfloat16 |
|
105 |
| Embedding type | RoPE |
|
|
|
302 |
### Pretraining Data
|
303 |
|
304 |
The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
|
305 |
+
The initial 1.6 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
306 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
307 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
308 |
+
During the following training, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
309 |
+
This adjustment resulted in a total of 2.68 trillion tokens used across 2 epochs, distributed as outlined below:
|
310 |
|
311 |

|
312 |
|
|
|
345 |
| Croatian Web as Corpus 2.1 (hrWaC) | hr | Ljubešić & Klubička, 2014 |
|
346 |
| DaNewsroom | da | Varab & Schluter, 2020 |
|
347 |
| Danish GigaWord | da | Strømberg-Derczynski et al., 2021 |
|
348 |
+
| Dolmino-mix-1124 (subset without synthetically generated data and privative licenses) | en | Team OLMo, 2024
|
349 |
| DK-CLARIN Reference Corpus of General Danish | da | [Link](https://korpus.dsl.dk/clarin/) |
|
350 |
| Estonian National Corpus 2021 (ENC) | et | Koppel & Kallas, 2022 |
|
351 |
| Estonian Reference Corpus (ERC) | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
|
352 |
| EusCrawl (w/o Wikipedia or NC-licenses) | eu | Artetxe et al., 2022 |
|
353 |
| FineWeb-Edu (350BT subset) | en | Penedo et al., 2024 |
|
354 |
+
| Fineweb2 (ad hoc subset of 178BT) | ar, as, bg, ca, cs, cy, da, de, el, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk | Penedo et al., 2024
|
355 |
| French Public Domain Books (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Books) |
|
356 |
| French Public Domain Newspapers (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
|
357 |
| German Web as Corpus (DeWaC) | de | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:dewac) |
|
|
|
395 |
| Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
|
396 |
| Yle Finnish News Archive (Yle-News) | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
|
397 |
|
398 |
+
|
399 |
To consult the data summary document with the respective licences, please send an e-mail to [email protected].
|
400 |
|
401 |
<details>
|
|
|
440 |
- Soldaini, L., & Lo, K. (2023). peS2o (Pretraining Efficiently on S2ORC) Dataset. Allen Institute for AI.
|
441 |
- Strømberg-Derczynski, L., Ciosici, M., Baglini, R., Christiansen, M. H., Dalsgaard, J. A., Fusaroli, R., Henrichsen, P. J., Hvingelby, R., Kirkedal, A., Kjeldsen, A. S., Ladefoged, C., Nielsen, F. Å., Madsen, J., Petersen, M. L., Rystrøm, J. H., & Varab, D. (2021). The Danish Gigaword Corpus. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 413–421. [Link](https://aclanthology.org/2021.nodalida-main.46)
|
442 |
- Subramani, N., Luccioni, S., Dodge, J., & Mitchell, M. (2023). Detecting Personal Information in Training Corpora: An Analysis. 208–220. [Link](https://doi.org/10.18653/v1/2023.trustnlp-1.18)
|
443 |
+
- Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
|
444 |
- Varab, D., & Schluter, N. (2020). DaNewsroom: A Large-scale Danish Summarisation Dataset. Proceedings of The 12th Language Resources and Evaluation Conference, 6731–6739. [Link](https://www.aclweb.org/anthology/2020.lrec-1.831)
|
445 |
- Váradi, T., Nyéki, B., Koeva, S., Tadić, M., Štefanec, V., Ogrodniczuk, M., Nitoń, B., Pezik, P., Barbu Mititelu, V., Irimia, E., Mitrofan, M., Tufi\textcommabelows, D., Garabík, R., Krek, S., & Repar, A. (2022). Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 100–108). European Language Resources Association. [Link](https://aclanthology.org/2022.lrec-1.11)
|
446 |
- Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brwac corpus: A new open resource for brazilian portuguese. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
|
|
|
456 |
|
457 |
</details>
|
458 |
|
459 |
+
In the final pre-training phase, we used a high-quality subset of 160 billion tokens. Additionally, to expand the model's context window to 32K, 6.3 billion tokens were processed using the Llama 3.1 RoPE interpolation strategy.
|
460 |
+
|
461 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
462 |
|
463 |
<details>
|
|
|
739 |
<td>Commonsense Reasoning</td>
|
740 |
<td>xstorycloze_es</td>
|
741 |
<td>acc</td>
|
742 |
+
<td>79.5</td>
|
743 |
</tr>
|
744 |
<tr>
|
745 |
<td rowspan="2">NLI</td>
|
746 |
<td>wnli_es</td>
|
747 |
<td>acc</td>
|
748 |
+
<td>64.8</td>
|
749 |
</tr>
|
750 |
<tr>
|
751 |
<td>xnli_es</td>
|
752 |
<td>acc</td>
|
753 |
+
<td>50.4</td>
|
754 |
</tr>
|
755 |
<tr>
|
756 |
<td>Paraphrasing</td>
|
757 |
<td>paws_es</td>
|
758 |
<td>acc</td>
|
759 |
+
<td>63.8</td>
|
760 |
</tr>
|
761 |
<tr>
|
762 |
<td>QA</td>
|
763 |
<td>xquad_es</td>
|
764 |
<td>acc</td>
|
765 |
+
<td>73.4</td>
|
766 |
</tr>
|
767 |
<tr>
|
768 |
<td>Translation</td>
|
769 |
<td>flores_es</td>
|
770 |
<td>bleu</td>
|
771 |
+
<td>25.9</td>
|
772 |
</tr>
|
773 |
</tbody>
|
774 |
</table>
|
|
|
787 |
<td rowspan="2">Commonsense Reasoning</td>
|
788 |
<td>copa_ca</td>
|
789 |
<td>acc</td>
|
790 |
+
<td>86.0</td>
|
791 |
</tr>
|
792 |
<tr>
|
793 |
<td>xstorycloze_ca</td>
|
794 |
<td>acc</td>
|
795 |
+
<td>80.0</td>
|
796 |
</tr>
|
797 |
<tr>
|
798 |
<td rowspan="2">NLI</td>
|
799 |
<td>wnli_ca</td>
|
800 |
<td>acc</td>
|
801 |
+
<td>70.0</td>
|
802 |
</tr>
|
803 |
<tr>
|
804 |
<td>xnli_ca</td>
|
805 |
<td>acc</td>
|
806 |
+
<td>50.7</td>
|
807 |
</tr>
|
808 |
<tr>
|
809 |
<td rowspan="2">Paraphrasing</td>
|
810 |
<td>parafraseja</td>
|
811 |
<td>acc</td>
|
812 |
+
<td>67.8</td>
|
813 |
</tr>
|
814 |
<tr>
|
815 |
<td>paws_ca</td>
|
816 |
<td>acc</td>
|
817 |
+
<td>67.5</td>
|
818 |
</tr>
|
819 |
<tr>
|
820 |
<td rowspan="5">QA</td>
|
821 |
<td>arc_ca_easy</td>
|
822 |
<td>acc</td>
|
823 |
+
<td>81.0</td>
|
824 |
</tr>
|
825 |
<tr>
|
826 |
<td>arc_ca_challenge</td>
|
827 |
<td>acc</td>
|
828 |
+
<td>53.0</td>
|
829 |
</tr>
|
830 |
<tr>
|
831 |
<td>openbookqa_ca</td>
|
832 |
<td>acc</td>
|
833 |
+
<td>41.6</td>
|
834 |
</tr>
|
835 |
<tr>
|
836 |
<td>piqa_ca</td>
|
837 |
<td>acc</td>
|
838 |
+
<td>75.8</td>
|
839 |
</tr>
|
840 |
<tr>
|
841 |
<td>siqa_ca</td>
|
842 |
<td>acc</td>
|
843 |
+
<td>53.9</td>
|
844 |
</tr>
|
845 |
<tr>
|
846 |
<td>Translation</td>
|
847 |
<td>flores_ca</td>
|
848 |
<td>bleu</td>
|
849 |
+
<td>33.7</td>
|
850 |
</tr>
|
851 |
</tbody></table>
|
852 |
|
|
|
864 |
<td rowspan="2">Commonsense Reasoning</td>
|
865 |
<td>xcopa_eu</td>
|
866 |
<td>acc</td>
|
867 |
+
<td>78.8</td>
|
868 |
</tr>
|
869 |
<tr>
|
870 |
<td>xstorycloze_eu</td>
|
871 |
<td>acc</td>
|
872 |
+
<td>72.2</td>
|
873 |
</tr>
|
874 |
<tr>
|
875 |
<td rowspan="2">NLI</td>
|
876 |
<td>wnli_eu</td>
|
877 |
<td>acc</td>
|
878 |
+
<td>66.2</td>
|
879 |
</tr>
|
880 |
<tr>
|
881 |
<td>xnli_eu</td>
|
882 |
<td>acc</td>
|
883 |
+
<td>45.9</td>
|
884 |
</tr>
|
885 |
<tr>
|
886 |
<td rowspan="3">QA</td>
|
887 |
<td>eus_exams</td>
|
888 |
<td>acc</td>
|
889 |
+
<td>61.5</td>
|
890 |
</tr>
|
891 |
<tr>
|
892 |
<td>eus_proficiency</td>
|
893 |
<td>acc</td>
|
894 |
+
<td>60.4</td>
|
895 |
</tr>
|
896 |
<tr>
|
897 |
<td>eus_trivia</td>
|
898 |
<td>acc</td>
|
899 |
+
<td>67.2</td>
|
900 |
</tr>
|
901 |
<tr>
|
902 |
<td>Reading Comprehension</td>
|
903 |
<td>eus_reading</td>
|
904 |
<td>acc</td>
|
905 |
+
<td>61.1</td>
|
906 |
</tr>
|
907 |
<tr>
|
908 |
<td>Translation</td>
|
909 |
<td>flores_eu</td>
|
910 |
<td>bleu</td>
|
911 |
+
<td>21.3</td>
|
912 |
</tr>
|
913 |
</tbody></table>
|
914 |
|
|
|
926 |
<td rowspan="2">Paraphrasing</td>
|
927 |
<td>parafrases_gl</td>
|
928 |
<td>acc</td>
|
929 |
+
<td>60.2</td>
|
930 |
</tr>
|
931 |
<tr>
|
932 |
<td>paws_gl</td>
|
933 |
<td>acc</td>
|
934 |
+
<td>63.0</td>
|
935 |
</tr>
|
936 |
<tr>
|
937 |
<td>QA</td>
|
938 |
<td>openbookqa_gl</td>
|
939 |
<td>acc</td>
|
940 |
+
<td>36.6</td>
|
941 |
</tr>
|
942 |
<tr>
|
943 |
<td>Translation</td>
|
944 |
<td>flores_gl</td>
|
945 |
<td>bleu</td>
|
946 |
+
<td>31.2</td>
|
947 |
</tr>
|
948 |
</tbody>
|
949 |
</table>
|
|
|
962 |
<td rowspan="2">Commonsense Reasoning</td>
|
963 |
<td>copa</td>
|
964 |
<td>acc</td>
|
965 |
+
<td>94.0</td>
|
966 |
</tr>
|
967 |
<tr>
|
968 |
<td>xstorycloze_en</td>
|
969 |
<td>acc</td>
|
970 |
+
<td>83.2</td>
|
971 |
</tr>
|
972 |
<tr>
|
973 |
<td rowspan="2">NLI</td>
|
974 |
<td>wnli</td>
|
975 |
<td>acc</td>
|
976 |
+
<td>67.6</td>
|
977 |
</tr>
|
978 |
<tr>
|
979 |
<td>xnli_en</td>
|
980 |
<td>acc</td>
|
981 |
+
<td>57.0</td>
|
982 |
</tr>
|
983 |
<tr>
|
984 |
<td>Paraphrasing</td>
|
985 |
<td>paws *</td>
|
986 |
<td>acc</td>
|
987 |
+
<td>68.5</td>
|
988 |
</tr>
|
989 |
<tr>
|
990 |
<td rowspan="6">QA</td>
|
991 |
<td>arc_easy</td>
|
992 |
<td>acc</td>
|
993 |
+
<td>86.5</td>
|
994 |
</tr>
|
995 |
<tr>
|
996 |
<td>arc_challenge</td>
|
997 |
<td>acc</td>
|
998 |
+
<td>59.4</td>
|
999 |
</tr>
|
1000 |
<tr>
|
1001 |
<td>openbookqa</td>
|
1002 |
<td>acc</td>
|
1003 |
+
<td>38.4</td>
|
1004 |
</tr>
|
1005 |
<tr>
|
1006 |
<td>piqa</td>
|
1007 |
<td>acc</td>
|
1008 |
+
<td>81.7</td>
|
1009 |
</tr>
|
1010 |
<tr>
|
1011 |
<td>social_iqa</td>
|
1012 |
<td>acc</td>
|
1013 |
+
<td>53.8</td>
|
1014 |
</tr>
|
1015 |
<tr>
|
1016 |
+
<td>xquad_en </td>
|
1017 |
<td>acc</td>
|
1018 |
+
<td>80.7</td>
|
1019 |
</tr>
|
1020 |
</tbody></table>
|
1021 |
|
1022 |
+
|
1023 |
\* Current LM Evaluation Harness implementation is lacking correct pre-processing. These results are obtained with adequate pre-processing.
|
1024 |
|
1025 |
+
### Long Context Evaluation
|
1026 |
+
|
1027 |
+
To assess the long-context capabilities of our model, we conduct a "needle in a haystack" test with the following configuration:
|
1028 |
+
|
1029 |
+
- **Needle Phrase**: *"The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."*
|
1030 |
+
- **Retrieval Question**: *"The best thing to do in San Francisco is"*
|
1031 |
+
- **Evaluator**: [prometheus-8x7b-v2.0](https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0), used as the evaluation judge.
|
1032 |
+
|
1033 |
+

|
1034 |
|
1035 |
---
|
1036 |
|
|
|
1047 |
## Additional information
|
1048 |
|
1049 |
### Author
|
1050 |
+
The Language Technologies Lab from Barcelona Supercomputing Center.
|
1051 |
|
1052 |
### Contact
|
1053 |
For further information, please send an email to <[email protected]>.
|
1054 |
|
1055 |
### Copyright
|
1056 |
+
Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
|
1057 |
|
1058 |
### Funding
|
1059 |
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
|
|
|
1097 |
## Model Index
|
1098 |
|Model|Base|Instruct|
|
1099 |
|:---:|:---:|:---:|
|
1100 |
+
|2b| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
1101 |
+
|7b| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
1102 |
+
|40b| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|
Release note v0.9.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
In this release, the ALIA-40B checkpoint has completed the main pre-training on 8.56 trillion tokens (1.6 epochs with 2.4 trillion tokens and 2 epochs with 2.68 trillion tokens) using a 4k context window, which was unfinished in the previous release. Additionally, it has undergone a preliminary final pre-training stage with a subset of high-quality data and a context extension to 32k tokens.
|
images/corpus_languages.png
CHANGED
![]() |
![]() |
Git LFS Details
|
images/salamandra_header.png
CHANGED
![]() |
Git LFS Details
|
![]() |
Git LFS Details
|
model-00001-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4898947792
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:86eea25bb9c816c55fbe801a79689dfeb2b9444c9868ebcbe40fa1c7759a78de
|
3 |
size 4898947792
|
model-00002-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932603064
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:bab17a5d55a0ea3b1f6e7c49fca860488b5d4ca723bfc1c0c660db758cf7acee
|
3 |
size 4932603064
|
model-00003-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932636064
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4f51af4a82d6a6ce07157d81688cee43f9d016ed6148e15daafefef5da41c97e
|
3 |
size 4932636064
|
model-00004-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4831940128
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7d0f30ce33bc6731ea6d58ed7d337e446dfe4abe6a50469abbe1bdb40da97523
|
3 |
size 4831940128
|
model-00005-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932603096
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c383044bd870e52b5d556b0312a7593dda826ea6fd40cc88d57f7ca054fa8a0e
|
3 |
size 4932603096
|
model-00006-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932603088
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:30a71fb6ae629d637aa5e6dbc1482f31b8402099ea7dd6a7f5897bae75901909
|
3 |
size 4932603088
|
model-00007-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932636096
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:58b5d28b6bf820608f4c7590611e91291037756a209c957b659cf5cdc306c093
|
3 |
size 4932636096
|
model-00008-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4831940160
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ff854c6b09431715b07ad1685fc4e902d68326d2343425f1fff6781cc86cc66b
|
3 |
size 4831940160
|
model-00009-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932603096
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5f0579b928cc75e1d628d4d8d67d0f414a8550c18fe9063f39023498a601cc8e
|
3 |
size 4932603096
|
model-00010-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932603088
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:adc5fbe7645d6ab910a3314b23eb42bf793b751d7940fcb73d4d938c42c3047d
|
3 |
size 4932603088
|
model-00011-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932636096
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2c814623773121f9be5579ed775ae49835e16f1e43ea16f39198798c5d9a2a32
|
3 |
size 4932636096
|
model-00012-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4831940160
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:91fe722802a3dcac49de528f280a2dbef7c7a0cc9e403aecdaeeb836b13fa375
|
3 |
size 4831940160
|
model-00013-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932603096
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1b7e784fae5452f080f3cbe4779388843fb0b92f02250f799e3035c119b1c8e5
|
3 |
size 4932603096
|
model-00014-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932603088
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8e20dc5e90c2a64710cbd50728b5228d916082d883f274977af599e2b3f3bbb0
|
3 |
size 4932603088
|
model-00015-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4932636096
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9f32f39b3ae508d99be07600a6a96bc37c96be9dc10ab5ee774aaca4d5ddd1e3
|
3 |
size 4932636096
|
model-00016-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 3019983016
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c95947ba9f96a43078f1984c4b74f58a9ad52cb5c4f83243b52ddc5fb29396d3
|
3 |
size 3019983016
|
model-00017-of-00017.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4194304128
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:60f09ac53921db320447f39598c5b63ceec67e63daad6dff8ad2e19389fddea5
|
3 |
size 4194304128
|