Abstract
PlayerOne is an egocentric realistic world simulator that constructs and generates videos from user-captured images, using a coarse-to-fine training pipeline and advanced motion injection and reconstruction frameworks.
We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos (2025)
- EgoM2P: Egocentric Multimodal Multitask Pretraining (2025)
- Improving Keystep Recognition in Ego-Video via Dexterous Focus (2025)
- Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs (2025)
- VideoMolmo: Spatio-Temporal Grounding Meets Pointing (2025)
- DreamDance: Animating Character Art via Inpainting Stable Gaussian Worlds (2025)
- HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper