-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 27 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2503.10639
-
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
Paper • 2503.12797 • Published • 29 -
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
Paper • 2503.12329 • Published • 24 -
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Paper • 2503.10639 • Published • 47
-
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation
Paper • 2404.13026 • Published • 24 -
Distilling Diversity and Control in Diffusion Models
Paper • 2503.10637 • Published • 14 -
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Paper • 2503.10639 • Published • 47
-
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
Paper • 2410.14059 • Published • 60 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 43 -
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper • 2503.04130 • Published • 89 -
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Paper • 2503.10639 • Published • 47
-
RL + Transformer = A General-Purpose Problem Solver
Paper • 2501.14176 • Published • 27 -
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 28 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 118 -
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Paper • 2412.12098 • Published • 4
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 29 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 106 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 100
-
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Paper • 2411.02337 • Published • 37 -
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Paper • 2411.04996 • Published • 51 -
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Paper • 2411.03562 • Published • 67 -
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization
Paper • 2410.08815 • Published • 49
-
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper • 2407.09413 • Published • 11 -
MAVIS: Mathematical Visual Instruction Tuning
Paper • 2407.08739 • Published • 33 -
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper • 2409.01437 • Published • 71 -
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper • 2409.05840 • Published • 48