CharacterShot: Controllable and Consistent 4D Character Animation
Abstract
CharacterShot is a 4D character animation framework that uses a DiT-based model and dual-attention module to generate consistent 3D animations from a single image and 2D pose sequence.
In this paper, we propose CharacterShot, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a large-scale dataset Character4D, containing 13,115 unique characters with diverse appearances and motions, rendered from multiple viewpoints. Extensive experiments on our newly constructed benchmark, CharacterBench, demonstrate that our approach outperforms current state-of-the-art methods. Code, models, and datasets will be publicly available at https://github.com/Jeoyal/CharacterShot.
Community
In this paper, we propose CharacterShot, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a large-scale dataset Character4D, containing 13,115 unique characters with diverse appearances and motions, rendered from multiple viewpoints. Extensive experiments on our newly constructed benchmark, CharacterBench, demonstrate that our approach outperforms current state-of-the-art methods. Code, models, and datasets will be publicly available at this https URL.
good paper👍
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FramePrompt: In-context Controllable Animation with Zero Structural Changes (2025)
- AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models (2025)
- Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis (2025)
- PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image (2025)
- MVG4D: Image Matrix-Based Multi-View and Motion Generation for 4D Content Creation from a Single Image (2025)
- AnimeColor: Reference-based Animation Colorization with Diffusion Transformers (2025)
- 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper