Training data

by BramVanroy - opened Dec 2, 2024

Dec 2, 2024

Hello

Thanks for this model - it was clearly needed in the EU!

Do you have an exhaustive description of the data that you used, both for pretraining and instruction tuning? Since this is an EU project and the models are apache licensed, I'd be very hopeful to see the datasets released/transparently described!

Thanks again for the work!

phmartins

UTTER - Unified Transcription and Translation for Extended Reality org Dec 2, 2024

•

edited Dec 2, 2024

Hi Bram. Thank you!

We are going to release a technical report describing all the data, pre-training and post-training details soon.
And we're also planning to release the data once we release the final model (that we're starting to work on now).

phmartins changed discussion status to closed Dec 2, 2024

BramVanroy

Dec 2, 2024

That sounds awesome! Transparency is so important and very much appreciated - thanks a lot!

vince62s

Apr 3

update ?

RicardoRei

UTTER - Unified Transcription and Translation for Extended Reality org Apr 3

Final model is still training.

Its a 22B... its taking time :)

vince62s

Apr 3

FYI Mistral-3.1-24B-instruct is giving quite good results (better than EuroLLM-9B-Instruct) but as soon as I try to finetune, I am losing quality.

RicardoRei

UTTER - Unified Transcription and Translation for Extended Reality org Apr 3

our experience testing Mistral 3.1 is that its good only on a handful of languages. I believe Gemma 3 models are better and have better coverage (but hard to fine-tune also)

BramVanroy

28 days ago

In my tests (Dutch) Gemma's tokenizer is well-suited (lower fertility) but indeed hard to finetune. Qwen 2.5 and Llama 3 are still my goto's at the moment.

fergusq

20 days ago

We have tested both Mistral 3.1, Gemma 3, and EuroLLM for Finnish and English machine translation. Gemma 3 is superior to both Mistral and EuroLLM, but it is unclear how much of it can be explained by Gemma 3 just being a larger model. Unfortunately, the license and closed training data of Gemma 3 is a deal-breaker for us, so we are very excited for the 22B EuroLLM coming!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment