MaLA corpus
MaLA Corpus for Massive Language Adaptation of Large Language Models https://mala-lm.github.io
Viewer • Updated • 1.14B • 1.93k • 2Note The MaLA monolingual corpus's noisy version that integrates texts from different sources without cleaning.
MaLA-LM/mala-monolingual-filter
Viewer • Updated • 1.42B • 13.2k • 2Note The MaLA monolingual corpus's filtered version that performs further data filtering
MaLA-LM/mala-monolingual-dedup
Viewer • Updated • 969M • 11.5k • 2Note The MaLA monolingual corpus's deduplicated version that removes repeated data points
MaLA-LM/mala-monolingual-split
Viewer • Updated • 538M • 9.18k • 2Note The MaLA monolingual corpus's final version is processed by splitting the filtered and deduplicated version into training and test sets
MaLA-LM/mala-bilingual-translation-corpus
Viewer • Updated • 14.4B • 2.13k • 5Note The MaLA bilingual translation corpus contains parallel data in more than 2,500 language pairs (500+ languages).
MaLA-LM/mala-code-reasoning
Viewer • Updated • 44.9M • 134 • 1Note The first version of the MaLA code and reasoning dataset used for training https://huggingface.co/MaLA-LM/emma-500-llama2-7b
MaLA-LM/mala-code-reasoning-v2
Viewer • Updated • 89.7M • 261 • 3Note The 2nd version of the MaLA code and reasoning dataset used for training EMMA-500 Llama 3(.1) Mono/Bi model series.
MaLA-LM/mala-opus-dedup-2410
Viewer • Updated • 44.3B • 11.6k • 1Note This mala-opus-dedup-2410 is the bilingual part of the MaLA Corpus. It is a cleaned and deduplicated version of OPUS corpus, collected from OPUS with a cutoff of October 2024 (2410).
MaLA-LM/mala-opus-dedup-2410-sample
Viewer • Updated • 6.48B • 611Note A sampled set of MaLA-LM/mala-opus-dedup-2410
MaLA-LM/mala-opus-dedup-shuffle-2410
Preview • Updated • 1.26kNote A shuffled version of MaLA-LM/mala-opus-dedup-2410