Commit
·
6238e57
1
Parent(s):
e3038db
Update README.md
Browse files
README.md
CHANGED
@@ -27,8 +27,23 @@ Afrikaans (af), Arabic (ar), Armenian (hy), Azerbaijani (az), Basque (eu), Bashk
|
|
27 |
|
28 |
<table><thead><tr><th>Language Family</th><th>Languages</th></tr></thead><tbody><tr><td>Afro-Asiatic</td><td>Arabic (ar), Hebrew (he)</td></tr><tr><td>Austro-Asiatic</td><td>Vietnamese (vi)</td></tr><tr><td>Austronesian</td><td>Indonesian (id), Javanese (jv), Malay (ms), Tagalog (tl)</td></tr><tr><td>Baltic</td><td>Latvian (lv), Lithuanian (lt)</td></tr><tr><td>Basque</td><td>Basque (eu)</td></tr><tr><td>Dravidian</td><td>Malayalam (ml), Tamil (ta), Telugu (te)</td></tr><tr><td>Indo-European (Armenian)</td><td>Armenian (hy)</td></tr><tr><td>Indo-European (Indo-Aryan)</td><td>Bengali (bn), Marathi (mr), Hindi (hi), Urdu (ur)</td></tr><tr><td>Indo-European (Germanic)</td><td>Afrikaans (af), Danish (da), English (en), German (de), Swedish (sv)</td></tr><tr><td>Indo-European (Romance)</td><td>French (fr), Italian (it), Portuguese (pt), Romanian (ro), Spanish (es)</td></tr><tr><td>Indo-European (Greek)</td><td>Greek (el)</td></tr><tr><td>Indo-European (Iranian)</td><td>Ossetian (os), Tajik (tg), Persian (fa)</td></tr><tr><td>Japonic</td><td>Japanese (ja)</td></tr><tr><td>Kartvelian</td><td>Georgian (ka)</td></tr><tr><td>Koreanic</td><td>Korean (ko)</td></tr><tr><td>Kra-Dai</td><td>Thai (th)</td></tr><tr><td>Mongolic</td><td>Buryat (bxr), Kalmyk (xal), Mongolian (mn)</td></tr><tr><td>Niger-Congo</td><td>Swahili (sw), Yoruba (yo)</td></tr><tr><td>Slavic</td><td>Belarusian (be), Bulgarian (bg), Russian (ru), Ukrainian (uk), Polish (pl)</td></tr><tr><td>Sino-Tibetan</td><td>Burmese (my)</td></tr><tr><td>Turkic (Karluk)</td><td>Uzbek (uz)</td></tr><tr><td>Turkic (Kipchak)</td><td>Bashkir (ba), Kazakh (kk), Kyrgyz (ky), Tatar (tt)</td></tr><tr><td>Turkic (Oghuz)</td><td>Azerbaijani (az), Chuvash (cv), Turkish (tr), Turkmen (tk)</td></tr><tr><td>Turkic (Siberian)</td><td>Tuvan (tyv), Yakut (sax)</td></tr><tr><td>Uralic</td><td>Estonian (et), Finnish (fi), Hungarian (hu)</td></tr></tbody></table>
|
29 |
|
30 |
-
##
|
31 |
|
32 |
The models are pretrained on 16 V100 GPUs for 600k training steps with a set of fixed hyperparameters: vocabulary size of 100k, context window of 2048, learning rate of 2e−4, and batch size of 4.
|
33 |
|
34 |
-
The mGPT architecture is based on GPT-3. We use the architecture description by Brown et al., the code base on GPT-2 (Radford et al., 2019) in the HuggingFace library (Wolf et al., 2020) and Megatron-LM (Shoeybi et al., 2019).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
<table><thead><tr><th>Language Family</th><th>Languages</th></tr></thead><tbody><tr><td>Afro-Asiatic</td><td>Arabic (ar), Hebrew (he)</td></tr><tr><td>Austro-Asiatic</td><td>Vietnamese (vi)</td></tr><tr><td>Austronesian</td><td>Indonesian (id), Javanese (jv), Malay (ms), Tagalog (tl)</td></tr><tr><td>Baltic</td><td>Latvian (lv), Lithuanian (lt)</td></tr><tr><td>Basque</td><td>Basque (eu)</td></tr><tr><td>Dravidian</td><td>Malayalam (ml), Tamil (ta), Telugu (te)</td></tr><tr><td>Indo-European (Armenian)</td><td>Armenian (hy)</td></tr><tr><td>Indo-European (Indo-Aryan)</td><td>Bengali (bn), Marathi (mr), Hindi (hi), Urdu (ur)</td></tr><tr><td>Indo-European (Germanic)</td><td>Afrikaans (af), Danish (da), English (en), German (de), Swedish (sv)</td></tr><tr><td>Indo-European (Romance)</td><td>French (fr), Italian (it), Portuguese (pt), Romanian (ro), Spanish (es)</td></tr><tr><td>Indo-European (Greek)</td><td>Greek (el)</td></tr><tr><td>Indo-European (Iranian)</td><td>Ossetian (os), Tajik (tg), Persian (fa)</td></tr><tr><td>Japonic</td><td>Japanese (ja)</td></tr><tr><td>Kartvelian</td><td>Georgian (ka)</td></tr><tr><td>Koreanic</td><td>Korean (ko)</td></tr><tr><td>Kra-Dai</td><td>Thai (th)</td></tr><tr><td>Mongolic</td><td>Buryat (bxr), Kalmyk (xal), Mongolian (mn)</td></tr><tr><td>Niger-Congo</td><td>Swahili (sw), Yoruba (yo)</td></tr><tr><td>Slavic</td><td>Belarusian (be), Bulgarian (bg), Russian (ru), Ukrainian (uk), Polish (pl)</td></tr><tr><td>Sino-Tibetan</td><td>Burmese (my)</td></tr><tr><td>Turkic (Karluk)</td><td>Uzbek (uz)</td></tr><tr><td>Turkic (Kipchak)</td><td>Bashkir (ba), Kazakh (kk), Kyrgyz (ky), Tatar (tt)</td></tr><tr><td>Turkic (Oghuz)</td><td>Azerbaijani (az), Chuvash (cv), Turkish (tr), Turkmen (tk)</td></tr><tr><td>Turkic (Siberian)</td><td>Tuvan (tyv), Yakut (sax)</td></tr><tr><td>Uralic</td><td>Estonian (et), Finnish (fi), Hungarian (hu)</td></tr></tbody></table>
|
29 |
|
30 |
+
## Technical details
|
31 |
|
32 |
The models are pretrained on 16 V100 GPUs for 600k training steps with a set of fixed hyperparameters: vocabulary size of 100k, context window of 2048, learning rate of 2e−4, and batch size of 4.
|
33 |
|
34 |
+
The mGPT architecture is based on GPT-3. We use the architecture description by Brown et al., the code base on GPT-2 (Radford et al., 2019) in the HuggingFace library (Wolf et al., 2020) and Megatron-LM (Shoeybi et al., 2019).
|
35 |
+
|
36 |
+
## Perplexity
|
37 |
+
|
38 |
+
The mGPT13B model achieves the best perplexities within the 2-to-10 score range for the majority of languages, including Dravidian (Malayalam, Tamil, Telugu), Indo-Aryan (Bengali, Hindi, Marathi), Slavic (Belarusian, Ukrainian, Russian, Bulgarian), Sino-Tibetan (Burmese), Kipchak (Bashkir, Kazakh) and others. Higher perplexities up to 20 are for only seven languages from different families.
|
39 |
+
|
40 |
+
#### Language-wise perplexity results
|
41 |
+
|
42 |
+

|
43 |
+
|
44 |
+
#### Family-wise perplexity results
|
45 |
+
|
46 |
+

|
47 |
+
|
48 |
+
_The scores are averaged over the number of languages within each family._
|
49 |
+
|