LaViDa-1.0 Collection LArge VIsion-language Diffusion moDel with mAsking • 11 items • Updated 6 days ago • 6
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning Paper • 2505.08617 • Published 19 days ago • 40
view article Article What is test-time compute and how to scale it? By Kseniase and 1 other • Feb 6 • 89
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought Paper • 2504.05599 • Published Apr 8 • 83
view article Article You could have designed state of the art positional encoding By FL33TW00D-HF • Nov 25, 2024 • 280
view article Article SmolLM - blazingly fast and remarkably powerful By loubnabnl and 2 others • Jul 16, 2024 • 374
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Paper • 2503.06749 • Published Mar 9 • 30
Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers? Paper • 2503.10632 • Published Mar 13 • 14
view article Article SmolVLM - small yet mighty Vision Language Model By andito and 4 others • Nov 26, 2024 • 298
WebArena: A Realistic Web Environment for Building Autonomous Agents Paper • 2307.13854 • Published Jul 25, 2023 • 25
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Paper • 2408.03314 • Published Aug 6, 2024 • 63
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation Paper • 2402.10210 • Published Feb 15, 2024 • 36
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation Paper • 2401.05675 • Published Jan 11, 2024 • 26
Video ReCap: Recursive Captioning of Hour-Long Videos Paper • 2402.13250 • Published Feb 20, 2024 • 27
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? Paper • 2307.16368 • Published Jul 31, 2023 • 12