arxiv:2508.13013

EgoTwin: Dreaming Body and View in First Person

Published on Aug 18

· Submitted by

liuziwei7 on Aug 25

Upvote

Authors:

Jingqiao Xiu ,

Ziwei Liu

Abstract

EgoTwin, a diffusion transformer framework, addresses viewpoint alignment and causal interplay in joint egocentric video and human motion generation using a head-centric motion representation and cybernetics-inspired interaction mechanism.

AI-generated summary

While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.

View arXiv page View PDF Add to collection

Community

liuziwei7

Paper author Paper submitter 1 day ago

#EgoTwin is an 😎egocentric world model😎 that jointly predicts & generates egocentric video and human motion in a view consistent and causally coherent manner.