Training data
Hello
Thanks for this model - it was clearly needed in the EU!
Do you have an exhaustive description of the data that you used, both for pretraining and instruction tuning? Since this is an EU project and the models are apache licensed, I'd be very hopeful to see the datasets released/transparently described!
Thanks again for the work!
Hi Bram. Thank you!
We are going to release a technical report describing all the data, pre-training and post-training details soon.
And we're also planning to release the data once we release the final model (that we're starting to work on now).
That sounds awesome! Transparency is so important and very much appreciated - thanks a lot!
update ?
Final model is still training.
Its a 22B... its taking time :)
FYI Mistral-3.1-24B-instruct is giving quite good results (better than EuroLLM-9B-Instruct) but as soon as I try to finetune, I am losing quality.
our experience testing Mistral 3.1 is that its good only on a handful of languages. I believe Gemma 3 models are better and have better coverage (but hard to fine-tune also)
In my tests (Dutch) Gemma's tokenizer is well-suited (lower fertility) but indeed hard to finetune. Qwen 2.5 and Llama 3 are still my goto's at the moment.