Finetune of SigLIP 2 So400m for Long Context
Finetuned from SigLIP 2, this model functions exactly the same except it now has a maximum text length of 256 tokens, compared to 64 in the base model.
Training Settings:
- Training Samples: 10,000,000
- Warmup Samples: 1,000,000
- Batch Size: 256
- Learning Rate: 4e-4
- Schedule: Cosine
- AMP: bfloat16
- Model Weights: float32
- Optimizer: AdamW
- Weight Decay: 0.2
- Clip Grad Norm: 1.0
- Maximum Token Length: 256
These settings are by no means optimal. The SigLIP paper suggests that Weight Decay is bad for finetuning SigLIP models, and of course these types of models tend to benefit from large batch sizes. I merely used some defaults from older code.
On a test set of 16K samples, the model starts at a loss of 17.65 and finishes at a loss of 2.51.
The dataset used consists of about 1.2 M text-image pairs with data from a variety of sources. About 250k examples are random CommonCrawl image-alt text pairs, which should best match so400m's original training data. The remainder of the examples are from the JoyCaption dataset, which contains a wide variety of image types and paired text such as descriptive captions, booru tag lists, stable diffusion prompts, and VQA.
During training the vision tower was kept completely frozen, along with logit_scale, logit_bias, and the text tower's head. The rest of the text tower was left unfrozen. This is to help ensure that the finetuning process preserves the original embedding space, and focusses on merely upgrading the context length and types of text.
The position embeddings were expanded by leaving the original 64 embeddings intact in their original positions, while initializing the new positions randomly. No ablations were perform to determine if this is the optimial approach. However I noted during experimentation that the model is fairly insensitive to the position embeddings.
In practice I've found that this model performs slightly better than the base SigLIP 2 so400m, but tends to prefer shorter text. i.e. given two texts that both perfectly describe the image, the model will tend to weight the shorter of the two higher. The model's ability to recognize booru tag lists for photorealistic images is also imperfect.
Credits
Credits to the SigLIP 2 team for their amazing work on improving an already great model.
BibTeX entry and citation info
@misc{tschannen2025siglip2multilingualvisionlanguage,
title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
year={2025},
eprint={2502.14786},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.14786},
}
- Downloads last month
- 1
Model tree for fancyfeast/so400m-long
Base model
google/siglip2-so400m-patch14-384