arxiv:2508.07901

Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Published on Aug 11

· Submitted by

RichardQRQ on Aug 14

#3 Paper of the day

Upvote

Authors:

Bowen Xue ,

Abstract

A lightweight framework for identity preservation in video generation using conditional image branches and restricted self-attentions outperforms full-parameter methods with minimal additional parameters.

AI-generated summary

Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just sim1\% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.

View arXiv page View PDF Project page GitHub 237 Add to collection

Community

RichardQRQ

Paper submitter 2 days ago

💻 Github: https://github.com/WeChatCV/Stand-In
🚀 Webpage: https://stand-in-video.github.io/

Stand-In framework: Stand-In is a lightweight, plug-and-play framework for identity-preserving video generation. By adding and training only ~1% extra parameters, it achieves state-of-the-art performance in identity preservation, video quality, and prompt adherence.
Identity injection without explicit face extractors: A conditional image branch is introduced into the video generation model. The image and video branches exchange information via restricted self-attention with conditional position mapping, enabling strong identity learning from small datasets.
High compatibility and generalization: Although trained solely on real-person data, Stand-In generalizes to cartoons, objects, and other domains, and can be directly applied to tasks such as pose-guided video generation, video stylization, and face swapping.