TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models
Abstract
TalkingMachines transforms a pretrained image-to-video model into an audio-driven avatar generator, supports infinite video streaming, and uses engineering optimizations for real-time performance.
In this paper, we present TalkingMachines -- an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators. TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model. Our primary contributions include: (1) We adapt a pretrained SOTA image-to-video DiT into an audio-driven avatar generation model of 18 billion parameters; (2) We enable infinite video streaming without error accumulation through asymmetric knowledge distillation from a bidirectional teacher model into a sparse causal, autoregressive student model; (3) We design a high-throughput, low-latency inference pipeline incorporating several key engineering optimizations such as: (a) disaggregation of the DiT and VAE decoder across separate devices, (b) efficient overlap of inter-device communication and computation using CUDA streams, (c) elimination of redundant recomputations to maximize frame-generation throughput. Please see demo videos here - https://aaxwaz.github.io/TalkingMachines/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Playing with Transformer at 30+ FPS via Next-Frame Diffusion (2025)
- MAGI-1: Autoregressive Video Generation at Scale (2025)
- LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model (2025)
- LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer (2025)
- LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval (2025)
- Minute-Long Videos with Dual Parallelisms (2025)
- Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper