Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

CVPR 2025
*denotes equal contribution
1POSTECH,  2KRAFTON,  3KAIST

Abstract

Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria—Temporal Synchronization, Lip Readability, and Expressiveness—are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization.

Perceptually Accurate 3D Talking Head Generation

Interpolate start reference image.
What defines perceptually accurate lip movement for a speech signal? In this work, we define three criteria to assess perceptual alignment between speech and lip movements of 3D talking heads: Temporal Synchronization, Lip Readability, and Expressiveness (a). The motivational hypothesis is the existence of a desirable representation space that models and complies well with the three criteria between diverse speech characteristics and 3D facial movements, as illustrated in (b); where representations with the same phonemes are clustered, are sensitive to temporal synchronization, and follow a certain pattern as the speech intensity increases. Consequently, we build a rich speech-mesh synchronized representation space that exhibits the desirable properties.

Human Studies on Alignment Criteria

We present intriguing findings on human audio-visual perception through human studies.


Interpolate start reference image.

Preference scores (1-3) for 3D talking heads with varying lip movement intensities paired with different speech intensities. Humans prefer the lip movements with the intensity that match the intensity of speech.

Interpolate start reference image.

Human preference between (A) samples with precise timing but low expressiveness, and (B) samples with high expressiveness but 100ms asynchrony—twice the commonly accepted 50ms threshold. Humans are more sensitive to expressiveness than temporal synchronization when perceiving, highlighting the importance of expressiveness.

Speech-Mesh Representation as Perceptual Loss

Interpolate start reference image.
We train our speech-mesh representation space in a two-stage manner. As a key application of our speech-mesh representation space, we propose a plug-in perceptual loss to 3D talking head models to enhance the quality of lip movements.

Emergent desirable properties in our representation

We found that our learned representation exhibits desirable characteristics we pursue for the three criteria.


Interpolate start reference image.

Temporal Synchronization. Cosine similarity drops when temporal misalignment is introduced between input speech and 3D face mesh.

Interpolate start reference image.

Lip Readability. Our representation tends to form distinct clusters according to phonemes, with vowels and consonants grouped closely. We also observe a directional progression in the feature space, shifting from phonemes with mouth opening (e.g., /aj/) to those with mouth closing (e.g., /f/).

Interpolate start reference image.

Expressiveness. We plot speech features at varying speech intensities, showing a directional trend as intensity increases from lowest to highest.

Effectiveness of our perceptual loss for lip readability

Effectiveness of our perceptual loss for expressiveness

Acknowledgments

We thank the members of AMILab for their helpful discussions and proofreading. This research was supported by a grant from KRAFTON AI, and was also partially supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2021-II212068, Artificial Intelligence Innovation Hub; No.RS-2023-00225630, Development of Artificial Intelligence for Text-based 3D Movie Generation; No. RS-2024-00457882, National AI Research Lab Project) and Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2024 (Project Name: Development of barrier-free experiential XR contents technology to improve accessibility to online activities for the physically disabled, Project Number: RS-2024-00396700, Contribution Rate: 25%).