Robust One-Shot Face Video Re-enactment using
Hybrid Latent Spaces of StyleGAN2
(ICCV 2023)

University of Maryland - College Park

Abstract

While recent research has progressively overcome the low-resolution constraint of one-shot facial video re-enactment with the help of StyleGAN’s high-fidelity portrait generation, these approaches rely on at least one of the following: explicit 2D/3D priors, optical flow based warping as motion descriptors, off-the-shelf encoders, etc., which constrain their performance (e.g., inconsistent predictions, inability to capture fine details and facial accessories, poor generalization, artifacts). We propose an end-to-end framework for simultaneously supporting face attribute edits, facial motions and deformations, and facial identity control for video generation. It employs a hybrid latent-space that encodes a given frame into a pair of latents: Identity latent, $W_{ID}$ and Facial deformation latent, $S_F$ that respectively reside in the $W+$ and $SS$ spaces of StyleGAN2. Thereby, incorporating the impressive editability-distortion trade-off of $W+$ and the high disentanglement properties of $SS$. These hybrid latents employ the StyleGAN2 generator to achieve high-fidelity face video re-enactment at $1024^2$. Furthermore, the model supports the generation of realistic re-enactment videos with other latent-based semantic edits (e.g., beard, age, make-up). Qualitative and quantitative analysis performed against state-of-the-art methods demonstrate the superiority of the proposed approach.

Overview

The proposed end-to-end framework simultaneously supports face attribute edits, facial motions and deformations, and facial identity control for video re-enactment (both same and cross identity) at 10242 with zero-reliance on explicit structural facial priors.

Pipeline


            The high-level re-enactment process (Top), the expanded architectures of the
              encoding (Bottom-Left) and re-enactment (Bottom-Right) 
              processes are depicted. In encoding, given a frame, the Encoder, E 
              outputs a pair of latents: Identity latent, W_ID and 
              Facial-deformation latent S_F. In re-enactment, 
              S^D_F (of driving frame) is added to W^S_ID (of source frame), transformed using 
              A(·) to obtain the animated SS latent, which is used 
              to obtain the re-enacted frame using the StyleGAN2 Generator, G.

The high-level re-enactment process (Top), the expanded architectures of the encoding (Bottom-Left) and re-enactment (Bottom-Right) processes are depicted. In encoding, given a frame, the Encoder, E outputs a pair of latents: Identity latent, $W_{ID}$ and Facial-deformation latent $S_F$. In re-enactment, ${S^D}_F$ (of driving frame) is added to ${W^S}_{ID}$ (of source frame), transformed using $A(·)$ to obtain the animated $SS$ latent, which is used to obtain the re-enacted frame using the StyleGAN2 Generator, $G$.

One-Shot Same Identity Re-enactment

Experiment: Re-enactment using a single source frame and a driving sequence belonging to the same identity.

One-Shot Cross Identity Re-enactment

Experiment: Re-enactment using a single source frame and a driving sequence belonging to the different identities.

One-Shot Robustness

Experiment: Heatmaps of same identity re-enactment reconstruction loss averaged over 5 runs, each initiated with a different source frame. Measures the robustness of re-enactment to source frames with diverse head-poses and expressions.

BibTeX

@article{oorloff2023oneshot,
      title={One-Shot Face Re-enactment using Hybrid Latent Spaces of StyleGAN2},
      author={Trevine Oorloff and Yaser Yaoob},
      year={2023},
      eprint={2302.07848},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}