Why does the ROOTS Corpus not include German language?
BLOOM has been trained on the ROOTS Corpus. Why does this corpus not contain German in its linguistic makeup, is there a specific reason for that? English, French are included so one would expect German as well.
Hi @akratz ! Each language intentionally included in ROOTS was the outcome of significant human curation to identify good sources and language-specific pre-processing steps. Since the project was volunteer-driven, this meant that languages beyond the starting set (made up of languages with the most speakers around the world) were selected based on participants' interest and bandwidth - we did not manage to create a working group for German in time for training the model. We hope that future efforts for data curation can re-use some of the tools and methodology we proposed to address this limitation though!
You can find more details in the following paper:
https://huggingface.co/papers/2303.03915
Additionally, while BLOOM was not initially trained on German, there has been some really amazing on post-hoc adaptation and language transfer, check out this one!
https://opengptx.dfki.de/