FG-CLIP: Fine-Grained Visual and Textual Alignment
Paper
•
2505.05071
•
Published
•
16
•
1
Would you share the total training cost info? as traing of IDEFICS2-8B used "approximately 1.5 billion images and 225 billion text tokens" which is quite huge for a 8B sized LMM model