SigLIP or SigLIP2 encoder?

#37
by orrzohar - opened

SigLIP or SigLIP2 encoder?

Google org

Hi @orrzohar ,

Yes, SigLIP and SigLIP 2 utilize similar encoder architectures, both employing the Vision Transformer (ViT) design with learned positional embeddings.
Could you please refer this reference.

Thank you.

Hi @GopiUppari ,
I am familiar with SigLIP.
However, in the Gemma3 paper, it was not stated whether SigLIP or SigLIP2 was utilized. From the config, it is impossible to tests either because the arch is the same so both are defined as siglip_vision_model.
Did Gemma3 utilize the SigLIP2 or SigLIP checkpoints?

Best,
Orr

I'm also curious if the siglip_vision_model's embeddings remain general purpose (i.e frozen during gemma training) or the SigLIP has been finetuned to improve Gemma's performance

@udaybondi i would be shocked if they kept the encoder frozen, everyone trains now a days

According to the Gemma3 paper, they used SigLIP instead of SigLIP 2, and they froze its weights during the training process for "simplicity". But it's not stated whether the weight they used is the same as the public version of the SigLIP model.
https://arxiv.org/pdf/2503.19786

"We use a vision encoder based on SigLIP (Zhai et al., 2023)." could be SigLIP2, SigLIP, or even encoders from Paligemma/similar...

Sign up or log in to comment