arxiv:2507.03745

StreamDiT: Real-Time Streaming Text-to-Video Generation

Published on Jul 4

· Submitted by

AkiCumulo on Jul 8

Upvote

Authors:

Akio Kodaira ,

Abstract

A streaming video generation model named StreamDiT, based on transformer-based diffusion models, enables real-time video generation with high content consistency and visual quality.

AI-generated summary

Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: <a href="https://cumulo-autumn.github.io/StreamDiT/">this https URL.</a>

View arXiv page View PDF Project page Add to collection

Community

AkiCumulo

Paper author Paper submitter about 8 hours ago

•

edited about 8 hours ago

💧StreamDiT: Real-Time Streaming Text-to-Video Generation

We present StreamDiT, a real-time streaming text-to-video generation model capable of producing diverse and open-domain video scenes.

Our model enables real-time applications, e.g., streaming generation and interactive generation.
This advancement opens the door to a wide range of downstream applications, including real-time storytelling, virtual avatar control, and live content creation.
Our approach demonstrates that efficient architecture design and training strategies can close the gap between offline generation quality and online usability, making continuous and responsive video generation practically viable.

🌐 Project website: https://cumulo-autumn.github.io/StreamDiT/
📄 Paper: https://arxiv.org/abs/2507.03745

Key Features

Real-time Generation: 16 FPS on single GPU (H100)
High Resolution: 512p video generation
Streaming Capability: Continuous video generation without length limits
Interactive Applications: Support for real-time video editing and style transfer
Efficient Architecture: 4B parameter model with optimized inference

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.03745 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.03745 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.03745 in a Space README.md to link it from this page.