Abstract
Captain Cinema generates high-quality short movies from textual descriptions using top-down keyframe planning and bottom-up video synthesis with interleaved training of Multimodal Diffusion Transformers.
We present Captain Cinema, a generation framework for short movie generation. Given a detailed textual description of a movie storyline, our approach firstly generates a sequence of keyframes that outline the entire narrative, which ensures long-range coherence in both the storyline and visual appearance (e.g., scenes and characters). We refer to this step as top-down keyframe planning. These keyframes then serve as conditioning signals for a video synthesis model, which supports long context learning, to produce the spatio-temporal dynamics between them. This step is referred to as bottom-up video synthesis. To support stable and efficient generation of multi-scene long narrative cinematic works, we introduce an interleaved training strategy for Multimodal Diffusion Transformers (MM-DiT), specifically adapted for long-context video data. Our model is trained on a specially curated cinematic dataset consisting of interleaved data pairs. Our experiments demonstrate that Captain Cinema performs favorably in the automated creation of visually coherent and narrative consistent short movies in high quality and efficiency. Project page: https://thecinema.ai
Community
Good Paper On End-to-end Multi-Shots Generation!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation (2025)
- AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation (2025)
- Sekai: A Video Dataset towards World Exploration (2025)
- SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers (2025)
- ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models (2025)
- Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition (2025)
- EchoShot: Multi-Shot Portrait Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper