Tempo14
's Collections
vision
updated
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
104
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
90
DeMamba: AI-Generated Video Detection on Million-Scale GenVideo
Benchmark
Paper
•
2405.19707
•
Published
•
7
Scaling Up Your Kernels: Large Kernel Design in ConvNets towards
Universal Representations
Paper
•
2410.08049
•
Published
•
8
Task Vectors are Cross-Modal
Paper
•
2410.22330
•
Published
•
11
DINO-X: A Unified Vision Model for Open-World Object Detection and
Understanding
Paper
•
2411.14347
•
Published
•
13
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
•
2412.04424
•
Published
•
63
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
•
2412.04467
•
Published
•
111
iFormer: Integrating ConvNet and Transformer for Mobile Application
Paper
•
2501.15369
•
Published
•
12
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
•
2502.14786
•
Published
•
142
Paper
•
2502.17941
•
Published
•
8