google/gemma-3-1b-it · TRAINING DATA

amanpreet7

1 day ago

•

edited 1 day ago

i wanted to ask which or to be precise what datasets did you used to train this llm? i am thrilled to know.

BalakrishnaCh

Google org 1 day ago

Hi @amanpreet7 ,

Welcome to Google Gemma family of open source models, Gemma models are trained on large corpus of open source internet data like open source books, novels, blogs, .etc the pre-training of Gemma models on a slightly larger token budget than Gemma 2, i.e., we train on 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B. The increase in tokens accounts for the mix of images and text used during pre-training. We also increase the amount of multilingual data to improve language coverage. We add both monolingual and parallel data, and we handle the imbalance in language representation using a strategy inspired by Chung et al. (2023).

To know more about Gemma model please visit the following page.

Thanks.

amanpreet7

about 20 hours ago

what were names of dataset that you used?

BalakrishnaCh

Google org about 2 hours ago

Hi, The model is trained on large amount of open source blogs, novels, open source resources,.etc. Haven't any info related to specific datasets related. To know more about model related technical info please visit the following document.

Thanks.