Papers
arxiv:2508.07901

Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Published on Aug 11
ยท Submitted by RichardQRQ on Aug 14
#3 Paper of the day
Authors:
,
,
,

Abstract

A lightweight framework for identity preservation in video generation using conditional image branches and restricted self-attentions outperforms full-parameter methods with minimal additional parameters.

AI-generated summary

Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just sim1\% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.

Community

Paper submitter

๐Ÿ’ป Github: https://github.com/WeChatCV/Stand-In
๐Ÿš€ Webpage: https://stand-in-video.github.io/

  • Stand-In framework: Stand-In is a lightweight, plug-and-play framework for identity-preserving video generation. By adding and training only ~1% extra parameters, it achieves state-of-the-art performance in identity preservation, video quality, and prompt adherence.

  • Identity injection without explicit face extractors: A conditional image branch is introduced into the video generation model. The image and video branches exchange information via restricted self-attention with conditional position mapping, enabling strong identity learning from small datasets.

  • High compatibility and generalization: Although trained solely on real-person data, Stand-In generalizes to cartoons, objects, and other domains, and can be directly applied to tasks such as pose-guided video generation, video stylization, and face swapping.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.07901 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.07901 in a Space README.md to link it from this page.

Collections including this paper 2