Any-to-Any
bagel-mot

SigLIP2 or SigLIP1

#3
by JosephusCheung - opened

License

BAGEL is licensed under the Apache 2.0 license. It is finetuned from Qwen2.5-7B-Instruct and siglip-so400m-14-980-flash-attn2-navit model, and uses the FLUX.1-schnell VAE model, all under Apache 2.0.

siglip-so400m-14-980-flash-attn2-navit by HuggingFaceM4 is SigLIP1, but in your paper

We adopt
SigLIP2-so400m/14 [74] with a fixed 384-resolution as the initialization of the ViT encoder. Building

ByteDance Seed org

Thanks for the pointing out this issue! We use siglip-so400m-14-384-flash-attn2. The information in license is updated.

I'm still confused.

'siglip-so400m-14-384-flash-attn2' seems to be SigLIP1. But SigLIP2-so400m/14 was mentioned in your paper.

I'm still confused.

'siglip-so400m-14-384-flash-attn2' seems to be SigLIP1. But SigLIP2-so400m/14 was mentioned in your paper.

let me clarify this. we use SigLIP2-so400m/14 with a 384x384 input resolution. then we interpolate the position embeddings to 980x980.
this is actually what siglip-so400m-14-384-flash-attn2 has done to siglip-so400m-14-384.

Sign up or log in to comment