Expressive Talking Head Video Encoding
in StyleGAN2 Latent-Space
(CVEU Workshop @ ICCV'23 - Best Paper)

University of Maryland - College Park

Abstract

While the recent advances in research on video re-enactment has yielded promising results, the approaches fall short in capturing the fine, detailed, and expressive facial features (e.g., lip-pressing, mouth puckering, mouth gaping, and wrinkles) which is crucial in generating realistic animated face videos. To this end, we propose an end-to-end expressive face video encoding approach that facilitates data-efficient high-quality video re-synthesis by optimizing low-dimensional edits of a single Identity-latent. The approach builds on StyleGAN2 image inversion and multi-stage non-linear latent-space editing to generate videos that are nearly comparable to input videos. While existing StyleGAN latent-based editing techniques focus on simply generating plausible edits of static images, we automate the latent-space editing to capture the fine expressive facial deformations in a sequence of frames using an encoding that resides in the Style-latent-space (StyleSpace) of StyleGAN2. The encoding thus obtained could be super-imposed on a single Identity-latent to facilitate re-enactment of face videos at 10242. The proposed framework economically captures face identity, head-pose, and complex expressive facial motions at fine levels, and thereby bypasses training, person modeling, dependence on landmarks/ keypoints, and low-resolution synthesis which tend to hamper most re-enactment approaches. The approach is designed with maximum data efficiency, where a single W+ latent and 35 parameters per frame enable high-fidelity video rendering. This pipeline can also be used for puppeteering (i.e., motion transfer).

Pipeline

The multi-stage pipeline for encoding a video in latent-space: 
            The (1) pre-processing stage aligns the input sequence of frames, 
            which are fed to the (2) GAN inversion to obtain the corresponding 
            sequence of <i>W+</i> latents. Out of which the best inversion 
            which also has near frontal head-pose is chosen to be the ID-latent 
            in the (3) ID-latent selection stage. The (4) Head-pose encoding stage, 
            encodes the yaw and pitch of the target frames, in reference to the 
            ID-latent while generating a series of head-pose adjusted ID-latents. 
            Subsequently, the (5) facial-attribute encoding stage, encodes the 
            facial deformations using 32 parameters anchoring onto the head-pose 
            adjusted ID-latents. Finally, the encoded parameters (35/frame) and 
            the ID-latent is used to synthesize the re-enacted frames at the 
            (6) Rendering stage.

The multi-stage pipeline for encoding a video in latent-space: The (1) pre-processing stage aligns the input sequence of frames, which are fed to the
(2) GAN inversion to obtain the corresponding sequence of W+ latents. Out of which the best inversion which also has near frontal head-pose is chosen to be the ID-latent in the
(3) ID-latent selection stage. The (4) Head-pose encoding stage, encodes the yaw and pitch of the target frames, in reference to the ID-latent while generating a series of head-pose adjusted ID-latents. Subsequently, the (5) facial-attribute encoding stage, encodes the facial deformations using 32 parameters anchoring onto the head-pose adjusted ID-latents. Finally, the encoded parameters (35/frame) and the ID-latent is used to synthesize the re-enacted frames at the (6) Rendering stage.

Expressive Facial Features

The proposed approach was carefully designed such that complex and fine expressive 
            face details such as lip-pressing, mouth puckering, mouth gaping, and wrinkles around
            the eyes, mouth, nasal-bridge, and forehead are well-captured.

The proposed encoding scheme is carefully designed such that complex and fine expressive facial features such as
lip-pressing, mouth puckering, mouth gaping, and wrinkles around the eyes, mouth, nasal-bridge, and forehead are well-captured.

Video Re-Synthesis

Puppeteering

Acknowledgments

We thank the community for free licensing and allowing editing of human data through https://www.pexels.com .

BibTeX

@article{oorloff2022expressivefacevideoencoding,
      title={Expressive Talking Head Video Encoding in StyleGAN2 Latent-Space},
      author={Trevine Oorloff and Yaser Yaoob},
      year={2022},
      eprint={2203.14512},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}