-
Cosmos World Foundation Model Platform for Physical AI
Paper • 2501.03575 • Published • 67 -
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Paper • 2501.00599 • Published • 41 -
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Paper • 2501.03841 • Published • 52 -
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
Paper • 2501.04003 • Published • 24
Collections
Discover the best community collections!
Collections including paper arxiv:2501.03841
-
Agents for self-driving laboratories applied to quantum computing
Paper • 2412.07978 • Published • 1 -
Towards Scientific Discovery with Generative AI: Progress, Opportunities, and Challenges
Paper • 2412.11427 • Published • 1 -
AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions
Paper • 2411.18015 • Published • 1 -
LLM4SR: A Survey on Large Language Models for Scientific Research
Paper • 2501.04306 • Published • 33
-
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Paper • 2501.02955 • Published • 40 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 98 -
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper • 2501.12380 • Published • 78 -
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper • 2501.09781 • Published • 24
-
OpenVLA: An Open-Source Vision-Language-Action Model
Paper • 2406.09246 • Published • 37 -
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Paper • 2411.19650 • Published -
Octo: An Open-Source Generalist Robot Policy
Paper • 2405.12213 • Published • 26 -
Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression
Paper • 2412.03293 • Published
-
GRUtopia: Dream General Robots in a City at Scale
Paper • 2407.10943 • Published • 24 -
Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion
Paper • 2407.10973 • Published • 10 -
Cross Anything: General Quadruped Robot Navigation through Complex Terrains
Paper • 2407.16412 • Published • 6 -
RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands
Paper • 2408.11048 • Published • 4
-
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
Paper • 2406.02523 • Published • 11 -
UniT: Unified Tactile Representation for Robot Learning
Paper • 2408.06481 • Published • 10 -
Latent Action Pretraining from Videos
Paper • 2410.11758 • Published • 2 -
Neural Fields in Robotics: A Survey
Paper • 2410.20220 • Published • 4
-
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Paper • 2311.17049 • Published • 1 -
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 15 -
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper • 2303.17376 • Published -
Sigmoid Loss for Language Image Pre-Training
Paper • 2303.15343 • Published • 6