Training datasets for the fineweb2 classifier.

EuroLingua-GPT
community
AI & ML interests
None defined yet.
Recent Activity
Organization Card
EuroLingua-GPT
🧠 What is EuroLingua-GPT?
EuroLingua-GPT is a multilingual large language model initiative led by Fraunhofer IAIS, AI Sweden, and TU Dresden. It aims to build a state-of-the-art open-source LLM tailored for Europe, covering 37 European languages and beyond.
🎯 Project Goal
- Develop a high-performing multilingual LLM optimized for European languages.
- Collect, curate, and evaluate large-scale multilingual datasets.
- Train and align the model using the latest in transformer and instruction-tuning techniques.
- Openly release the model to support research, innovation, and responsible AI development in Europe.
- Training Framework: GitHub - Modalities
🗓️ Project Timeline May 1, 2024 – October 1, 2025
Collections
1
spaces
1
datasets
9
Eurolingua/tokenizer_final_dataset
Updated
•
6
Eurolingua/Fineweb_2_500k_removed
Viewer
•
Updated
•
11.7M
•
98
Eurolingua/Fineweb_2_500k_filtered
Viewer
•
Updated
•
11.2M
•
146
Eurolingua/Fineweb_2_500k_both
Viewer
•
Updated
•
11.6M
•
91
Eurolingua/truthfulqax
Updated
•
178
•
1
Eurolingua/gsm8kx
Updated
•
309
•
2
Eurolingua/hellaswagx
Updated
•
210
•
1
Eurolingua/arcx
Updated
•
447
•
1
Eurolingua/mmlux
Updated
•
208
•
1