We present a novel framework for talking head video editing, allowing users to freely edit head pose, emotion, and eye blink while maintaining audio-visual synchronization. Unlike previous approaches that mainly focus on generating a talking head video, our proposed model is able to edit the talking heads of an input video and restore it to full frames, which supports a broader range of applications. Our proposed framework consists of two parts: a) a reconstruction-based generator that can generate talking heads fitting to the original frame while corresponding to freely controllable attributes, including head pose, emotion, and eye blink. b) a multiple-attribute discriminator that enforces attribute-visual synchronization. We additionally introduce attention modules and perceptual loss to improve the overall generation quality. We compare existing approaches as corroborated by quantitative metrics and qualitative comparisons.
The encoders embed inputs together and feed them into the generator, while the input audio Mel spectrogram,
head
pose, emotion, and eye blink are extracted from target frames during the training stage. A set of
synchronization
losses are then calculated by a pre-trained multi-attribute discriminator between generated frames and input
attributes to enforce attribute-visual synchronization.
@INPROCEEDINGS{Huang2023FETE,
author={Huang, Yuantian and Iizuka, Satoshi and Fukui, Kazuhiro},
booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Free-View Expressive Talking Head Video Editing},
year={2023},
pages={1-5},
doi={10.1109/ICASSP49357.2023.10095745},
url={https://ieeexplore.ieee.org/abstract/document/10095745},
month={June},
}