Abstract
We propose GaussianTalker, a novel framework for real-time generation of
pose-controllable talking heads. It leverages the fast rendering capabilities
of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly
controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS
representation of the head and deforms it in sync with the audio. A key insight
is to encode the 3D Gaussian attributes into a shared implicit feature
representation, where it is merged with audio features to manipulate each
Gaussian attribute. This design exploits the spatial-aware features and
enforces interactions between neighboring points. The feature embeddings are
then fed to a spatial-audio attention module, which predicts frame-wise offsets
for the attributes of each Gaussian. It is more stable than previous
concatenation or multiplication approaches for manipulating the numerous
Gaussians and their intricate parameters. Experimental results showcase
GaussianTalker's superiority in facial fidelity, lip synchronization accuracy,
and rendering speed compared to previous methods. Specifically, GaussianTalker
achieves a remarkable rendering speed of 120 FPS, surpassing previous
benchmarks.
Overall Framework
GaussianTalker utilizes a multi-resolution triplane to leverage different
scales of features depicting a canonical 3D head. These features are fed into a spatial-audio attention module along with the
audio feature to predict per-frame deformations, enabling fast and reliable talking head synthesis.
Comparison with Baseline Models
Fidelity, lip synchronization and inference time
comparison between existing 3D talking face synthesis
models and ours. Our method, GaussianTalker,
achieves on par with or better results at much higher FPS.
Note that we also include GaussianTalker∗, a more efficient
and faster variant. Size of each bubble represents the inference time per frame of each method.
Qualitative Experiments
Self-Driven Results
Cross-Driven Results
Qualitative Experiments
Self-Driven Results
Cross-Driven Results
Importance of our Spatial-Audio Attention Module
Speech-related Motion Disentanglement
Our spatial-audio attention module effectively disentangles the speech-related motion, by conditioning the unrelated facial motion and scene variations on ohter input conditions.
We thereby disentangle the speech-related motion from the video, allowing the model to better model the correspondence between the input speech audio and corresponding facial motion.
Stabilization of Scene Variations
By conditioning the spatial-audio attention module on the facail viewpoint, we effectively control the scene variations that do not correlate with the speech audio, such as hair motion and skin illumination for certain angles.
Citation
If you find our work useful in your research, please cite our work as:@misc{cho2024gaussiantalker, title={GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting}, author={Kyusun Cho and Joungbin Lee and Heeji Yoon and Yeobin Hong and Jaehoon Ko and Sangjun Ahn and Seungryong Kim}, year={2024}, eprint={xxxx.xxxxx}, archivePrefix={arXiv}, primaryClass={cs.CV} }
Acknowledgements
The website template was borrowed from Michaël Gharbi.