Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Abstract
A lightweight framework for identity preservation in video generation using conditional image branches and restricted self-attentions outperforms full-parameter methods with minimal additional parameters.
Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just sim1\% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.
Community
๐ป Github: https://github.com/WeChatCV/Stand-In
๐ Webpage: https://stand-in-video.github.io/
Stand-In framework: Stand-In is a lightweight, plug-and-play framework for identity-preserving video generation. By adding and training only ~1% extra parameters, it achieves state-of-the-art performance in identity preservation, video quality, and prompt adherence.
Identity injection without explicit face extractors: A conditional image branch is introduced into the video generation model. The image and video branches exchange information via restricted self-attention with conditional position mapping, enabling strong identity learning from small datasets.
High compatibility and generalization: Although trained solely on real-person data, Stand-In generalizes to cartoons, objects, and other domains, and can be directly applied to tasks such as pose-guided video generation, video stylization, and face swapping.
arXiv explained breakdown of this paper ๐ https://arxivexplained.com/papers/stand-in-a-lightweight-and-plug-and-play-identity-control-for-video-generation
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation (2025)
- Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration (2025)
- Proteus-ID: ID-Consistent and Motion-Coherent Video Customization (2025)
- Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA (2025)
- Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations (2025)
- OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation (2025)
- Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper