Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation
Abstract
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.
Community
We propose a novel paradigm, dubbed as DICE-Talk, which is a new framework for generating talking head videos with vivid, identity-preserving emotional expressions.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter (2025)
- EmoHead: Emotional Talking Head via Manipulating Semantic Expression Parameters (2025)
- DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model (2025)
- Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait (2025)
- PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation (2025)
- EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models (2025)
- MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper