EgoTwin: Dreaming Body and View in First Person
Abstract
EgoTwin, a diffusion transformer framework, addresses viewpoint alignment and causal interplay in joint egocentric video and human motion generation using a head-centric motion representation and cybernetics-inspired interaction mechanism.
While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.
Community
#EgoTwin is an ๐egocentric world model๐ that jointly predicts & generates egocentric video and human motion in a view consistent and causally coherent manner.
- Project: https://egotwin.pages.dev
- Paper: https://arxiv.org/abs/2508.13013
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation (2025)
- EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba (2025)
- SnapMoGen: Human Motion Generation from Expressive Texts (2025)
- MotionGPT3: Human Motion as a Second Modality (2025)
- RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space (2025)
- X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents (2025)
- X-MoGen: Unified Motion Generation across Humans and Animals (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper