-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 27 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2503.21620
-
Can Large Language Models Understand Context?
Paper • 2402.00858 • Published • 23 -
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 83 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 147 -
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Paper • 2401.17072 • Published • 25
-
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Paper • 2502.11573 • Published • 8 -
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper • 2502.02339 • Published • 22 -
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper • 2502.11775 • Published • 8 -
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39
-
End-to-End Goal-Driven Web Navigation
Paper • 1602.02261 • Published -
Learning Language Games through Interaction
Paper • 1606.02447 • Published -
Naturalizing a Programming Language via Interactive Learning
Paper • 1704.06956 • Published -
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Paper • 1802.08802 • Published
-
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
Paper • 2503.16874 • Published • 42 -
ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation
Paper • 2503.21729 • Published • 25 -
Modifying Large Language Model Post-Training for Diverse Creative Writing
Paper • 2503.17126 • Published • 33 -
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
Paper • 2503.21620 • Published • 53
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 21 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 57 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 50 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 50
-
RuCCoD: Towards Automated ICD Coding in Russian
Paper • 2502.21263 • Published • 126 -
Unified Reward Model for Multimodal Understanding and Generation
Paper • 2503.05236 • Published • 112 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 43 -
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Paper • 2503.05592 • Published • 25
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Paper • 2310.11441 • Published • 28 -
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 56 -
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Paper • 2406.08451 • Published • 25 -
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
Paper • 2406.10819 • Published • 1
-
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration
Paper • 2502.17110 • Published • 12 -
WebGames: Challenging General-Purpose Web-Browsing AI Agents
Paper • 2502.18356 • Published • 12 -
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
Paper • 2502.18906 • Published • 12 -
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
Paper • 2503.02268 • Published • 10
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 58 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 42 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 57