arxiv:2412.07583

Mobile Video Diffusion

Published on Dec 10, 2024

· Submitted by

habibian on Dec 11, 2024

Upvote

Authors:

Haitam Ben Yahia ,

Amir Ghodrati ,

Amirhossein Habibian

Abstract

A mobile-optimized video diffusion model reduces computational demands and maintains quality through pruning and adversarial finetuning.

AI-generated summary

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/