Why does the ROOTS Corpus not include German language?

#221

by akratz - opened Mar 27, 2023

Mar 27, 2023

BLOOM has been trained on the ROOTS Corpus. Why does this corpus not contain German in its linguistic makeup, is there a specific reason for that? English, French are included so one would expect German as well.

yjernite

BigScience Workshop org Mar 27, 2023

Hi @akratz ! Each language intentionally included in ROOTS was the outcome of significant human curation to identify good sources and language-specific pre-processing steps. Since the project was volunteer-driven, this meant that languages beyond the starting set (made up of languages with the most speakers around the world) were selected based on participants' interest and bandwidth - we did not manage to create a working group for German in time for training the model. We hope that future efforts for data curation can re-use some of the tools and methodology we proposed to address this limitation though!

You can find more details in the following paper:
https://huggingface.co/papers/2303.03915

yjernite

BigScience Workshop org Mar 27, 2023

Additionally, while BLOOM was not initially trained on German, there has been some really amazing on post-hoc adaptation and language transfer, check out this one!
https://opengptx.dfki.de/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment