Update README.md

Adjusted results. Now takes into account that not all models have BOS tokens. Excludes loss computation on the first token for all models. Adds EuroLLM 22B Preview

Files changed (1) hide show

README.md +29 -30

README.md CHANGED Viewed

@@ -120,33 +120,32 @@ Character-level perplexity creates a standardised comparison by calculating how
 **What data did we use?**
 We use WMT24++ as it is a multilingual, language-parallel evaluation set that none of the models have seen during training. WMT24++ is a composite of texts from news, literature, speech, and social media; thus, it is suitable for foundational model benchmarking.
-| Language | TildeOpen-30B | Gemma-2-27B | EuroLLM-9B | ALIA-40B |
-|----------|---------------|-------------|------------|-----------------|
-| Bulgarian | **2.1716** | 2.3541 | 2.3502 | 2.2411 |
-| Croatian | **2.2259** | 2.6809 | 2.6780 | 2.3456 |
-| Czech | **2.2682** | 2.4873 | 2.4808 | 2.3639 |
-| Danish | **2.0968** | 2.2608 | 2.2586 | 2.1543 |
-| Dutch | **2.0136** | 2.1249 | 2.1185 | 2.0629 |
-| English | 2.1497 | **2.0342** | 2.1897 | 2.1027 |
-| Estonian | **2.2825** | 2.7163 | 2.5652 | 2.4232 |
-| Finnish | **2.1687** | 2.4069 | 2.3844 | 2.2774 |
-| French | 1.9779 | 2.0195 | 2.0479 | **1.9750** |
-| German | **1.9664** | 2.0214 | 2.0499 | 1.9725 |
-| Hungarian | **2.1481** | 2.3308 | 2.3705 | 2.2493 |
-| Icelandic | **2.2011** | 3.1917 | 5.3162 | 4.0978 |
-| Italian | **2.0431** | 2.1065 | 2.1213 | 2.0604 |
-| Latvian | **2.2477** | 2.6701 | 2.4896 | 2.4352 |
-| Lithuanian | **2.2301** | 2.5495 | 2.4754 | 2.4109 |
-| Norwegian | **2.2445** | 2.4173 | 2.5121 | 2.3152 |
-| Polish | **2.1214** | 2.2294 | 2.2264 | 2.1847 |
-| Portuguese | **2.0810** | 2.1554 | 2.1561 | 2.0884 |
-| Romanian | **2.1266** | 2.2724 | 2.2821 | 2.1974 |
-| Russian | **2.1502** | 2.2091 | 2.2813 | 2.1889 |
-| Serbian | **2.3708** | 2.8053 | 4.7160 | 2.5119 |
-| Slovak | **2.2281** | 2.4674 | 2.4588 | 2.3505 |
-| Slovenian | **2.2662** | 2.5798 | 2.5087 | 2.3611 |
-| Spanish | 2.0400 | 2.0665 | 2.1186 | **2.0055** |
-| Swedish | **2.1471** | 2.2971 | 2.2856 | 2.2039 |
-| Turkish | **2.2108** | 2.3665 | 2.3508 | 3.0611 |
-| Ukrainian | **2.2470** | 2.4000 | 2.4251 | 2.3168 |

 **What data did we use?**
 We use WMT24++ as it is a multilingual, language-parallel evaluation set that none of the models have seen during training. WMT24++ is a composite of texts from news, literature, speech, and social media; thus, it is suitable for foundational model benchmarking.
+| Language | TildeOpen 30b | Gemma 2 27b | EuroLLM 22B Prev. | ALIA 40B |
+|-----------------|---------|------------|----|------|
+| Bulgarian | **2.0539** | 2.2184 | 2.1985 | 2.1336 |
+| Czech | **2.1579** | 2.3522 | 2.3221 | 2.2719 |
+| Danish | **2.003** | 2.1517 | 2.1353 | 2.0805 |
+| German | **1.8769** | 1.9285 | 1.9452 | 1.904 |
+| English | 2.0378 | **1.9525** | 2.0568 | 2.0261 |
+| Spanish | 1.9503 | 1.9752 | 2.0145 | **1.9369** |
+| Estonian | **2.1711** | 2.5747 | 2.3852 | 2.325 |
+| Finnish | **2.0497** | 2.288 | 2.2388 | 2.1831 |
+| French | **1.8978** | 1.9355 | 1.9282 | 1.9084 |
+| Croatian | **2.1147** | 2.544 | 2.4905 | 2.2433 |
+| Hungarian | **2.0539** | 2.2228 | 2.2256 | 2.1635 |
+| Icelandic | **2.0873** | 3.0329 | 4.7908 | 3.957 |
+| Italian | **1.9565** | 2.0137 | 2.0098 | 1.9887 |
+| Lithuanian | **2.1247** | 2.4175 | 2.3137 | 2.3075 |
+| Latvian | **2.1439** | 2.5355 | 2.3141 | 2.3276 |
+| Dutch | **1.9333** | 2.0312 | 2.0079 | 1.9904 |
+| Norwegian | **2.1284** | 2.2862 | 2.3506 | 2.2253 |
+| Polish | **2.0241** | 2.1294 | 2.0803 | 2.0803 |
+| Portuguese | **1.9899** | 2.0597 | 2.0272 | 2.0187 |
+| Romanian | **2.0196** | 2.1606 | 2.1641 | 2.1114 |
+| Russian | **2.0424** | 2.09 | 2.1095 | 2.0871 |
+| Slovak | **2.1192** | 2.338 | 2.3029 | 2.2609 |
+| Slovenian | **2.1556** | 2.4443 | 2.3398 | 2.2589 |
+| Serbian | **2.2469** | 2.6351 | 4.2471 | 2.3743 |
+| Swedish | **2.041** | 2.1809 | 2.1464 | 2.1211 |
+| Turkish | **2.0997** | 2.247 | 2.2202 | 2.232 |
+| Ukrainian | **2.1376** | 2.2665 | 2.2691 | 2.2086 |