TRAINING DATA

#23
by amanpreet7 - opened

i wanted to ask which or to be precise what datasets did you used to train this llm? i am thrilled to know.

Hi @amanpreet7 ,

Welcome to Google Gemma family of open source models, Gemma models are trained on large corpus of open source internet data like open source books, novels, blogs, .etc the pre-training of Gemma models on a slightly larger token budget than Gemma 2, i.e., we train on 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B. The increase in tokens accounts for the mix of images and text used during pre-training. We also increase the amount of multilingual data to improve language coverage. We add both monolingual and parallel data, and we handle the imbalance in language representation using a strategy inspired by Chung et al. (2023).

To know more about Gemma model please visit the following page.

Thanks.

what were names of dataset that you used?

Hi, The model is trained on large amount of open source blogs, novels, open source resources,.etc. Haven't any info related to specific datasets related. To know more about model related technical info please visit the following document.

Thanks.

Sign up or log in to comment