StoryMem: Multi-shot Long Video Storytelling with Memory
βProject Lead Β§Corresponding Author
Given a story script with per-shot text descriptions, StoryMem generates appealing minute-long, multi-shot narrative videos with highly coherent characters and cinematic visual quality. This is achieved through shot-by-shot generation using a memory-conditioned single-shot video diffusion model. See our π Project Page for more details and video results.
π Getting Started
Installation
git clone --single-branch --branch main [email protected]:Kevin-thu/StoryMem.git
cd StoryMem
conda create -n storymem python=3.11
conda activate storymem
pip install -r requirements.txt
pip install flash_attn
Model Download
Download Wan2.2 Base Model and StoryMem M2V LoRA from Huggingface:
| Models | Download Links | Description |
|---|---|---|
| Wan2.2 T2V-A14B | π€ Huggingface | Text-to-Video MoE model |
| Wan2.2 I2V-A14B | π€ Huggingface | Image-to-Video MoE model |
| StoryMem Wan2.2 M2V-A14B | π€ Huggingface | Memory-to-Video Fine-tuned LoRA |
You can easily download models using huggingface-cli:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./models/Wan2.2-T2V-A14B
huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./models/Wan2.2-I2V-A14B
huggingface-cli download Kevin-thu/StoryMem --local-dir ./models/StoryMem
There are two models provided:
StoryMem/Wan2.2-MI2V-A14B: Support M2V and MI2V (memory + first-frame image conditioning)StoryMem/Wan2.2-MM2V-A14B: Support M2V and MM2V (memory + first 5 motion frames conditioning)
Run the Code
You can run an example using the following command:
bash run_example.sh
This script first uses the T2V model to generate the first shot as the initial memory. It then uses our M2V model to generate the remaining shots shot by shot, automatically extracting keyframes and updating the memory after each shot.
Key arguments:
story_script_path: Path to the story script JSON file.output_dir: Directory to save the generated videos. Default is./results.t2v_model_path: Path to the T2V model. Default is./models/Wan2.2-T2V-A14B.i2v_model_path: Path to the I2V model. Default is./models/Wan2.2-I2V-A14B.lora_weight_path: Path to the M2V LoRA weights. Default is./models/StoryMem.seed: Random seed. Default is0.size: Output video resolution. Default is832*480.max_memory_size: Maximum number of shots to keep in memory. Default is10.t2v_first_shot: Use T2V to generate the first shot as the initial memory.m2v_first_shot: Use M2V to generate the first shot (for the MR2V setting, where reference images are provided as initial memory). Reference images should be placed inoutput_diras00_00_keyframe0.jpg, ...,00_00_keyframeN.jpg.mi2v: Enable MI2V (memory + first-frame image conditioning) to connect adjacent shots whenscene_cutisFalse.mm2v: Enable MM2V (memory + first 5 motion frames conditioning) to connect adjacent shots whenscene_cutisFalse.
ST-Bench
To support evaluating multi-shot long video storytelling, we establish and release ST-Bench, which can be found in the subfloder ./story.
We prompt GPT-5 to create 30 long story scripts spanning diverse styles, each containing a story overview, 8β12 shot-level text prompts, and scene-cut indicators.
In total, ST-Bench provides 300 detailed video prompts describing characters, scenes, dynamic events, shot types, and possibly camera movements.
To create your own story script, you can follow the system prompt below, which we use to generate structured, shot-level story scripts:
System prompt for story script (click to expand)
You are an expert director of story videos. Your task is to design a story script about [..., e.g. a funny man].
Each prompt corresponds to a five-second video clip, so avoid overly complex text rendering, extreme motions, or audio-dependent effects. The overall story should remain simple, clear, and easy to follow.
Your output must follow the JSON format shown in the example:
[ ... an example json story script here, e.g.
./story/black.json... ]Field Instructions:
- story_overview: A concise summary of the whole story.
- scene_num: Sequential index of the scene.
- cut: Whether this prompt starts with a scene cut:
"True": a new cut."False": continue smoothly from the last frame of the previous prompt. Must ensure the two adjacent prompts can be naturally concatenated into a smooth continuous clip.- The first prompt in the story must always have
"True".- video_prompts: A list of text-to-video prompts forming the story beats within the scene. Prompts should reflect natural, smooth, and logical story progression.
What Each Video Prompt Should Describe (if relevant)
- Characters: appearance, attire, age, style.
- Actions & interactions: motion, gestures, expressions, eye contact, simple physical actions.
- Scene & background: indoor/outdoor location, props, layout, lighting, environment details.
- Atmosphere & mood: emotional tone, colors, aesthetic feeling.
- Camera & editing: shot type (e.g., close-up / medium / wide), simple camera movement, transitions.
The prompts should be concise but sufficiently detailed (1β4 sentences).
Return only a valid JSON story script.
To improve consistency, it would be better to provide more explicit character descriptions in each shot prompt, helping the model match the intended memory.
β€οΈ Ackowledgement
Our implementation is based on Wan2.2. Thanks for the great open-source work!
π Citation
If any part of our paper or code is helpful to your research, please consider citing our work π and give us a star β. Thanks for your support!
@article{zhang2025storymem,
title={{StoryMem}: Multi-shot Long Video Storytelling with Memory},
author={Zhang, Kaiwen and Jiang, Liming and Wang, Angtian and Fang, Jacob Zhiyuan and Zhi, Tiancheng and Yan, Qing and Kang, Hao and Lu, Xin and Pan, Xingang},
journal={arXiv preprint},
volume={arXiv:2512.19539},
year={2025}
}
- Downloads last month
- -
Model tree for Kevin-thu/StoryMem
Base model
Wan-AI/Wan2.2-I2V-A14B