Gemma 3 fine tuning max token length

#22
by mukhayy - opened

Looking to fine tune google/gemma-3-12b-it with my dataset of around 10k examples. But my dataset outputs are quiet lengthy (some of them may reach 125k and average being around 60k tokens) so I thought I may take adventage of max_position_embeddings = 131072 of this model. But I haven't seen anywhere in examples for fine tuning setting max_seq_length of trl.SFTTrainer as 131072.
Is it smth doable? Or does 131072 only applies for inference? How people should/are approach(ing) fine tuning for lengthy outputs in dataset? Can you also tell what hardware with number of gpus is a best option from your experience?
Thank you

Sign up or log in to comment