google/gemma-3-1b-pt · Video Handling & Training

mosama

Jun 24

•

edited Jun 24

I see in the description that this can handle images.

Can this also handle videos as well?
Also, I was looking to train it on arabic language. Can I get some metrics on what its performance is on arabic langauge?
If I want to train on Arabic language, how do i do that? should i train the pretrained model further with only the language model layers unfrozen and rest stays frozen. will that work?
I see in your technical document that you pretrain the model. So this is a vision model so how is this pretrained what is the pretraining objective like causalLM? like you feed an image or a series of images and text and it produces the same image and text in output? Is that how it works just like in text models? OR is the pretraining done like vision encoder is trained separately and the language model trained separately and then in instruction tuning we merge then together end to end with the MLP layers and all?

mosama changed discussion title from Video Handling to Video Handling & Training Jun 24

Renu11

Google org Jun 26

Hi @mosama , The gemma-3-1b-pt model is text-only and does not directly handle videos. For video, Gemma 3's other model variants(4b, 12b & 27b) are suitable with multimodality feature which integrate a vision encoder (like SigLIP) that converts visual data into "soft tokens" which are then processed by the language model.

While gemma-3-1b-pt has general multilingual capabilities, for specific performance metrics for Arabic would require evaluating fine-tuned versions. Fine-tuning for Arabic is feasible using Parameter-Efficient Fine-Tuning (PEFT) (e.g., LoRA), which adapts the model by training a subset of parameters. Please refer the mentioned link fo rmore details. Thank you.

yukiarimo

Jul 12

Is it possible to just add SigLip projector to 1B to make it multimodal?