Many VLMs claim to process hours of video. But can they follow the story?๐ค Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!โณ
We test three skills that matter for real-world use: ๐ Localized Retrieval: Find a specific action. ๐งฉ Information Synthesis: Piece together scattered clues. ๐ Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).
The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos. Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking pointsโnow the community can start fixing them.๐
Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.
Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal โmental sketchesโ?
Thatโs the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.
These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.
๐ง Mirage is trained in two phases:
1) Grounding: It learns to produce latent tokens anchored in real images. 2) Refinement: The model drops the images and learns to generate visual tokens on its own.
๐ And yes, it works! On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines. Smart sketches > empty words.
Extremely bullish on @CohereForAI's Aya Vision (8B & 32B) - new SOTA open-weight VLMs
- 8B wins up to 81% of the time in its class, better than Gemini Flash - 32B beats Llama 3.2 90B! - Covers 23 languages, excels in image captioning, VQA & more - Integrated on transformers from Day 0!
Weโre thrilled to share ๐ฆ๐บ๐ผ๐น๐ฉ๐๐ (256M & 500M)โthe smallest Visual Language Models ever built. Think: running on <1GB of GPU memoryโyou can fine-tune it on your laptop and run it on your toaster!
Why Itโs Game-Changing: - ๐ข๐๐๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ ๐๐ฎ๐ฟ๐ด๐ฒ๐ฟ ๐ ๐ผ๐ฑ๐ฒ๐น๐: Even the 256M model surpasses our SOTA 80B-parameter model from just 17 months ago. Over 300x reduction! ๐ ๐ถ๐ด๐ต๐๐ ๐๐ณ๐ณ๐ถ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐: The 256M version delivers 80% of our 2.2B modelโs performance, and the 500M version hits 90% ๐๐ถ๐ด๐ต๐๐ป๐ถ๐ป๐ด-๐๐ฎ๐๐ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต: SmolVLM integrates with ColiPali for state-of-the-art retrieval speedsโon par with models 10x bigger. That means cheaper, faster indexing and real-world impact.
Whatโs New Under the Hood: - ๐ก๐ฒ๐ ๐ฉ๐ถ๐๐ถ๐ผ๐ป ๐๐ป๐ฐ๐ผ๐ฑ๐ฒ๐ฟ: Smaller overall size (400M -> 93M), but with higher resolution. - ๐๐ถ๐ด๐ต๐ฒ๐ฟ ๐ฃ๐ถ๐ ๐ฒ๐น๐/๐ง๐ผ๐ธ๐ฒ๐ป: 4096 vs. 1820โmore efficient image processing. - ๐ฆ๐บ๐ฎ๐ฟ๐ ๐ง๐ผ๐ธ๐ฒ๐ป๐ถ๐๐ฎ๐๐ถ๐ผ๐ป: Faster training and a performance boost.