-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 27 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2503.12937
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 8 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 8 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
-
Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs
Paper • 2503.12303 • Published • 7 -
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
Paper • 2503.12797 • Published • 29 -
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Paper • 2503.12937 • Published • 27 -
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
Paper • 2503.13399 • Published • 20
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper • 2503.12605 • Published • 31 -
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Paper • 2503.12937 • Published • 27 -
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
Paper • 2503.12271 • Published • 9 -
Video-T1: Test-Time Scaling for Video Generation
Paper • 2503.18942 • Published • 85
-
Personalize Anything for Free with Diffusion Transformer
Paper • 2503.12590 • Published • 42 -
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Paper • 2503.12937 • Published • 27 -
Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks
Paper • 2503.11514 • Published • 15
-
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Paper • 2503.05592 • Published • 25 -
Learning from Failures in Multi-Attempt Reinforcement Learning
Paper • 2503.04808 • Published • 17 -
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Paper • 2503.12937 • Published • 27