1 Introduction

Neural face reenactment aims to transfer the rigid 3D face/head orientation and the deformable facial expression of a target facial image to a source facial image. Such technology is the key enabler for creating high-quality digital head avatars that find a multitude of applications in telepresence, Augmented Reality/Virtual Reality (AR/VR), and the creative industries. Recently, thanks to the advent of Deep Learning, Neural Face Reenactment has seen remarkable progress (Burkov et al., 2020; Meshry et al., 2021; Wang et al., 2021b; Zakharov et al., 2020). In spite of this, synthesizing photorealistic face/head sequences remains a challenging problem with the quality of existing solutions being far from sufficient for the demanding aforementioned applications.

A major challenge that most prior works (Bao et al., 2018; Burkov et al., 2020; Ha et al., 2020; Zakharov et al., 2019, 2020; Zeng et al., 2020) have focused on is how to achieve identity and head pose/expression disentanglement to both preserve the appearance and identity characteristics of the source face and successfully transfer the head pose and the expression of the target face. A recent line of research relies on training conditional Generative Adversarial Networks (GANs) (Deng et al., 2020; Kowalski et al., 2020; Shoshan et al., 2021) in order to produce disentangled embeddings and control the generation process. However, such methods mainly focus on synthetic image generation, rendering reenactment on real faces challenging. Another line of works (Zakharov et al., 2019, 2020) rely on training with paired data (i.e., source and target facial images of the same identity), leading to poor cross-person face reenactment.

In this work, we propose a neural face reenactment framework that addresses the aforementioned limitations of state-of-the-art (SOTA), motivated by the remarkable ability of modern pre-trained GANs (e.g., StyleGAN Karras et al. 2019; 2020a; b) in generating realistic and aesthetically pleasing faces, often indistinguishable from real ones. The research question we address in this paper is: Can a pre-trained GAN be adapted for face reenactment? A key challenge that needs to be addressed to this end is the absence of any inherent semantic structure in the latent space of GANs. In order to gain control over the generative process, inspired by Voynov and Babenko (2020), we propose to learn a set of latent direction (i.e., direction vectors in the GAN’s latent space) that are responsible for controlling head pose and expression variations in the generated facial images. Knowledge of these directions directly equips the pre-trained GAN with the ability of controllable generation in terms of head pose and expression, allowing for effective face reenactment. Specifically, in this work we present a simple pipeline to learn such directions leveraging the ability of a linear 3D shape model (Feng et al., 2021) in capturing disentangled directions for head pose, identity, and expression, which is crucial towards effective neural face reenactment. Moreover, another key challenge that needs to be addressed is how to use the GAN for the manipulation of real-world images. Capitalizing on Tov et al. (2021), we further show that by embedding real images in the GAN latent space, our pipeline can be successfully used for real face reenactment. Overall, we make the following contributions:

  1. 1.

    Instead of training from-scratch conditional generative models (Burkov et al., 2020; Zakharov et al., 2020), we present a novel approach to face reenactment by finding the directions in the latent space of a pre-trained GAN (i.e., StyleGAN2 Karras et al. Karras et al. (2020b) fine-tuned on the VoxCeleb1 dataset) that are responsible for controlling the rigid head orientation and expression, and show how these directions can be used for neural face reenactment on video datasets.

  2. 2.

    We present a simple pipeline that is trained with the aid of a linear 3D shape model (Feng et al., 2021), that is inherently equipped with disentangled directions for facial shape in terms of head pose, identity and expression. We further show that our pipeline can be trained with real images by firstly embedding them into the GAN space, allowing for effective reenactment of real-world faces.

  3. 3.

    We show that our method features several favorable properties including requiring a single source image (one-shot), and enabling cross-person reenactment.

  4. 4.

    We perform several qualitative and quantitative comparisons with recent state-of-the-art reenactment methods, illustrating that our approach typically produces reenacted faces of notably higher quality for the standard benchmarks of VoxCeleb1 & 2 (Chung et al., 2018; Nagrani et al., 2017).

Compared to our previous work in Bounareli et al. (2022), this paper further investigates the real image inversion step and proposes a joint training scheme (Sect. 3.4) that eliminates the need for the optimization step during inference, described in Sect. 3.2, resulting in a more efficient inference process and better quantitative and qualitative results. The proposed joint training scheme efficiently addresses existing visual artifacts on the reenacted images caused by large head pose variations between the source and target faces, resulting in improved overall image quality. We qualitatively and quantitatively show that by jointly learning the real image inversion encoder and the directions, our method achieves compelling results without the need of one-shot fine-tuning during inference. Finally, to further improve the visual quality of the reenacted images in terms of crucial (for the purpose of face reenactment) background and identity characteristics, we propose to further fine-tune the feature space \({\mathcal {F}}\) of StyleGAN2 (Sect. 3.5).

2 Related Work

2.1 Semantic Face Editing

Several recent works (Härkönen et al., 2020; Oldfield et al., 2021, 2023; Shen & Zhou, 2021; Shen et al., 2020; Voynov & Babenko, 2020; Tzelepis et al., 2021, 2022; Yang et al., 2021; Yao et al., 2021) study the existence of directions/paths in the latent space of a pre-trained GAN in order to perform editing (i.e., with respect to specific facial attributes) on the generated facial images. Voynov and Babenko (2020) introduced an unsupervised method that optimizes a set of vectors in the GAN’s latent space by learning to distinguish (using a “reconstructor” network) the image transformations caused by distinct latent directions. This leads to the discovery of a set of “interpretable”, but not “controllable”, directions—i.e., the optimized latent directions cannot be used for controllable (in terms of head pose and expression) facial editing and, thus, for face reenactment. Our method is inspired by the work of Voynov and Babenko (2020), extending it in several ways to make it suitable for neural face reenactment. Another line of recent works allows for explicit controllable facial image editing (Abdal et al., 2021; Deng et al., 2020; Durall Lopez et al., 2021; Ghosh et al., 2020; Nitzan et al., 2020; Shoshan et al., 2021; Wang et al., 2021a). However, these methods mostly rely on synthetic image editing rather than performing face reenactment on real video data. A work that is related to our framework is StyleRig (Tewari et al., 2020b), which uses 3D Morphable Model’s (3DMM) (Blanz & Vetter, 1999) parameters to control the generated images from a pre-trained StyleGAN2 (Karras et al., 2020b). However, by contrast to our method, StyleRig’s training pipeline is not end-to-end and is significantly more complicated than ours, while in order to learn better disentangled directions, StyleRig requires the training of distinct models for different attributes (e.g., head pose and expression). This, along with the fact that StyleRig operates mainly on synthetic images, poses a notable restriction towards real-world face reenactment, where various facial attributes change simultaneously. By contrast, we propose to learn all disentangled directions for face reenactment simultaneously, allowing in this way for the effective editing of all, a subset, or a single attribute, whilst we optimize our framework on real faces as well. A follow-up work, PIE (Tewari et al., 2020a), focuses on inverting real images to enable editing using StyleRig (Tewari et al., 2020b). However, their method is computationally expensive (10 min/image) which is prohibitive for video-based facial reenactment. By contrast, we propose a framework that effectively and efficiently performs face reenactment (0.13 sec/image).

2.2 GAN Inversion

GAN inversion methods aim to encode real images into the latent space of pre-trained GANs (Karras et al., 2019, 2020b), allowing for subsequent editing using existing methods of synthetic image manipulation. The major challenge in the GAN inversion problem comprises of the so called “editability-perception” trade-off; that is, finding a sweet spot between faithful reconstruction of the real image and the editability of the corresponding latent code. The majority of recent inversion methods (Alaluf et al., 2021, 2022; Dinh et al., 2022; Richardson et al., 2021; Tov et al., 2021; Wang et al., 2022a) train encoder-based architectures that focus on predicting the latent codes \({\textbf{w}}\) that best reconstruct the original (real) images and that allow for subsequent editing. Zhu et al. (2020) propose a hybrid approach which consists of learning an encoder followed by an optimization step on the latent space to refine the similarity between the reconstructed and real images. Richardson et al. (2021) introduce a method that aims to improve the “editability-perception” trade-off, while recently (Roich et al., 2021) propose to fine-tune the generator to better capture/transfer appearance features.

The aforementioned works typically perform inversion onto the \({\mathcal {W}}+\) latent space of StyleGAN2. However, Parmar et al. (2022) have shown that \({\mathcal {W}}+\) is not capable of fully reconstructing the real images. Specifically, details such as the background, the hair style or facial accessories i.e., hats and glasses, cannot be inverted with high fidelity. A recent line of works (Alaluf et al., 2022; Bai et al., 2022; Wang et al., 2022a; Yao et al., 2022a) propose to mitigate this by investigating more expressive spaces of StyleGAN2 (such as the feature space \({\mathcal {F}} \in {{\mathbb {R}}}^{h \times w \times c}\) Kang et al. 2021) to perform real image inversion. Although such methods are able to produce high quality reconstructions, their ability to accurately edit the inverted images is limited. Especially when changing the head pose, such methods tend to produce many visual artifacts (Fig. 25). In order to balance between expressive invertibility and editing performance, the authors of Parmar et al. (2022) (SAM) propose to fuse different spaces, i.e., the \({\mathcal {W}}+\) latent space and the feature space \({\mathcal {F}} = \{{\mathcal {F}}_4, {\mathcal {F}}_6, {\mathcal {F}}_8, {\mathcal {F}}_{10} \}\), where each one corresponds to a different feature layer of StyleGAN2 (Karras et al., 2020b). In more detail, they propose to break the facial images into different segments (background, hat, glasses etc.) and choose the most suitable space to invert each segment, leveraging the editing capabilities of the \({\mathcal {W}}+\) latent space and the reconstruction quality of the feature space \({\mathcal {F}}\). However, when performing global editings, i.e., changing the head pose orientation, SAM (Parmar et al., 2022) results in notable visual artifacts, in contrast to our method, as will be shown in the experimental section.

Fig. 1
figure 1

Overview of the proposed framework: Given a pair of source \({\textbf{I}}_s\) and target \({\textbf{I}}_t\) images, we calculate the head pose/expression parameter vectors \({\textbf{p}}_s\) and \({\textbf{p}}_t\) using the \(\mathrm {Net_{3D}}\) network, respectively. The matrix of directions \({\textbf{A}}\) is trained so as, given the shift \(\Delta {\textbf{w}} = {\textbf{A}}\Delta {\textbf{p}}\), the reenacted image \({\textbf{I}}_r\) generated using the latent code \({\textbf{w}}_r = {\textbf{w}}_s + \varvec{\Delta }{\textbf{w}}\), transfers the head pose and the expression of the target face, maintaining at the same time the identity of the source face

2.3 Neural Face Reenactment

Neural face reenactment poses a challenging problem that requires strong generalization ability across many different identities and a large range of head poses and expressions. Many of the proposed methods rely on facial landmark information (Ha et al., 2020; Hsu et al., 2022; Tripathy et al., 2020, 2021; Wang et al., 2022b; Zakharov et al., 2019, 2020; Zhang et al., 2020). Specifically, Zakharov et al. (2020) propose an one-shot face reenactment method driven by landmarks, which decomposes an image on pose-dependent and pose-independent components. A limitation of landmark based methods is that landmarks preserve identity information, thus impeding their applicability on cross-subject face reenactment (Burkov et al., 2020). In order to mitigate this limitation (Hsu et al., 2022) propose to use an ID-preserving Shape Generator (IDSG) that transforms the target facial landmarks so that they preserve the identity, i.e. facial shape, of the source image. Additionally, several methods (Doukas et al., 2021; Ren et al., 2021; Yang et al., 2022; Yao et al., 2020) rely on 3D shape models to remove the identity details of the driving images. Warping-based methods (Doukas et al., 2021; Ren et al., 2021; Siarohin et al., 2019; Wang et al., 2021b; Wiles et al., 2018; Yang et al., 2022) synthesize the reenacted images based on the motion of the driving faces. Specifically, HeadGAN (Doukas et al., 2021) and Face2Face (Yang et al., 2022) are warping-based methods conditioned on 3D Morphable Models. Whilst such methods produce realistic results, they suffer from several visual artifacts and head pose mismatch, especially in large head pose variations. Finally, Meshry et al. (2021) propose a two-step architecture that aims to disentangle the spatial and style components of an image that leads to better preservation of the source identity.

By contrast to the methods discussed above, which rely on the training of conditional generative models on large paired datasets in order to learn facial descriptors with disentanglement properties, in this paper we propose a novel and simple face reenactment framework that learns disentangled directions in the latent space of a StyleGAN2 (Karras et al., 2020b) pre-trained on the VoxCeleb (Nagrani et al., 2017) dataset. We show that the discovery of meaningful and disentangled directions that are responsible for controlling the head pose and the facial expression can be used for high quality self- and cross-identity reenactment.

3 Proposed Method

In this section, we present the proposed framework for one-shot neural face reenactment via finding directions in the latent space of StyleGAN2. More specifically, we begin with the most basic variant of our framework for finding reenactment latent directions using unpaired synthetic images in Sect. 3.1—an overview of this is shown in Fig. 1. Next, in Sect. 3.2 we extend this methodology for handling real images along with synthetic ones (i.e., towards real face reenactment), while in Sect. 3.3 we investigate the incorporation of paired video data. In Sect. 3.4 we introduce a joint training scheme that allows for optimization-free reenactment, leading to efficient and consistent neural reenactment. Finally, in Sect. 3.5, on top of the previously introduced variants of our method, we propose the refinement of crucial visual details (i.e., background, hair style) by leveraging the impressive reconstruction capability of StyleGAN2’s feature space \({\mathcal {F}}\).

3.1 Finding Reenactment Latent Directions on Unpaired Synthetic Images

3.1.1 StyleGAN2 Background

Let \({\mathcal {G}}\) denote the generator of StyleGAN2 (Karras et al., 2020b), as shown in Fig. 1. Specifically, \({\mathcal {G}}\) takes as input a latent code \({\textbf{w}}\in {\mathcal {W}}\subset {{\mathbb {R}}}^{512}\), which is typically the output of StyleGAN2’s input MLP-based Mapping Network f that acts on samples \({\textbf{z}}\in {{\mathbb {R}}}^{512}\) drawn from the standard Gaussian \({\mathcal {N}}({\textbf{0}},{\textbf{I}})\). That is, given a latent code \({\textbf{z}}\sim {\mathcal {N}}({\textbf{0}},{\textbf{I}})\), the generator produces an image \({\mathcal {G}}(f({\textbf{z}}))\in {{\mathbb {R}}}^{3\times 256\times 256}\).

StyleGAN2 is typically pre-trained on the Flickr-Faces-HQ (FFHQ) dataset (Karras et al., 2019), which exhibits poor diversity in terms of head pose and facial expression; for instance, FFHQ does not typically account for roll changes in head pose. In order to compare our method with other state-of-the-art methods, we fine-tune StyleGAN2’s generator \({\mathcal {G}}\) on the VoxCeleb dataset (Nagrani et al., 2017), which provides a much wider range of head poses and facial expressions, rendering it very useful for the task of neural face reenactment by finding the appropriate latent directions as will be discussed in the following sections. We note that we fine-tune the StyleGAN2’s generator on VoxCeleb dataset using the method provided by Karras et al. (2020a), while we do not impose any reenactment objectives. That is, the fine-tuned generator can produce synthetic images with random identities (different from the identities of VoxCeleb) that follow the distribution of VoxCeleb dataset in terms of head poses and expressions.

3.1.2 3D Morphable Model (Net3D)

Given an image, Net3D (Feng et al., 2021) encodes the depicted face’s pose into a facial shape vector \({{\textbf {s}}}\in {{\mathbb {R}}}^{3N}\), where N denotes the number of vertices, which can be decomposed in terms of a linear 3D facial shape as

$$\begin{aligned} {{\textbf {s}}} = \bar{{{\textbf {s}}}} + {{\textbf {S}}}_{i}{\textbf{p}}_{i} + {{\textbf {S}}}_{\theta }{\textbf{p}}_{\theta } + {{\textbf {S}}}_{e}{\textbf{p}}_{e}, \end{aligned}$$
(1)

where \(\bar{{{\textbf {s}}}}\) denotes the mean 3D facial shape, \({{\textbf {S}}}_{i}\in {{\mathbb {R}}}^{3N\times m_{i}}\), \({{\textbf {S}}}_{\theta }\in {{\mathbb {R}}}^{3N\times m_{\theta }}\) and \({{\textbf {S}}}_{e}\in {{\mathbb {R}}}^{3N\times m_{e}}\) denote the PCA bases for identity, head orientation and expression, and \({\textbf{p}}_{i}\), \({\textbf{p}}_{\theta }\) and \({\textbf{p}}_{e}\) denote the corresponding identity, head orientation and expression coefficients, respectively. The variables \(m_{i}\), \(m_{\theta }\) and \(m_{e}\) correspond to the number of identity, head pose and expression coefficients. For reenactment, we are interested in manipulating head orientation and expression, thus, our head pose/expression parameter vector is given as \({\textbf{p}}=[{\textbf{p}}_{\theta }, {\textbf{p}}_{e}] \in {{\mathbb {R}}}^{3+m_e}\). We note that all PCA shape bases are orthogonal to each other, and hence they capture disentangled variations of identity and expression. Finally, we note that they are calculated in a frontalized reference frame, thus, they are also disentangled from head orientation. These bases can be also interpreted as directions in the shape space. We propose to learn similar directions in the GAN latent space as described in detail in the following section.

3.1.3 Reenactment Latent Directions

In particular, we propose to associate a change \(\Delta {\textbf{p}}\) in the head pose orientation and expression, with a change \(\Delta {\textbf{w}}\) in the (intermediate) latent GAN space so that the two generated images \(G({\textbf{w}})\) and \(G({\textbf{w}}+\Delta {\textbf{w}})\) differ only in head pose and expression by the same amount \(\Delta {\textbf{s}}\) induced by \(\Delta {\textbf{p}}\). If the directions sought in the GAN latent space are assumed to be linear (Nitzan et al., 2021), this implies the following linear relationship

$$\begin{aligned} \Delta {\textbf{w}} = {\textbf{A}}\Delta {\textbf{p}}, \end{aligned}$$
(2)

where \({\textbf{A}}\in {{\mathbb {R}}}^{d_{\textrm{out}}\times d_{\textrm{in}}}\) is a matrix, the columns of which represent the directions in GAN latent space. In our case, \(d_{\textrm{in}} = (3+m_e)\) and \(d_{\textrm{out}} = N_{l}\times 512\), where \(N_{l}\) is the number of the generator’s layers we opt to apply shift changes.

3.1.4 Training Pipeline

In order to optimize the matrix of controllable latent directions \({\textbf{A}}\), we propose a simple pipeline, shown in Fig. 1. Specifically, during training, a pair of a source (\({\textbf{z}}_s\)) and a target (\({\textbf{z}}_t\)) latent codes are drawn from \({\mathcal {N}}({\textbf{0}},{\textbf{I}})\), giving rise to a pair of a source (\({\textbf{I}}_s = {\mathcal {G}}(f({\textbf{z}}_s))\)) and a target (\({\textbf{I}}_t = {\mathcal {G}}(f({\textbf{z}}_t))\)) images, as shown in the left part of Fig. 1. The pair of images \({\textbf{I}}_s\) and \({\textbf{I}}_t\) are then encoded by the pre-trained Net3D into the head pose/expression parameter vectors \({\textbf{p}}_s\) and \({\textbf{p}}_t\), respectively. Using (2), we calculate the shift \(\Delta {\textbf{w}}\) in the intermediate latent space of StyleGAN2 as \(\Delta {\textbf{w}}={\textbf{A}}\Delta {\textbf{p}}={\textbf{A}}({\textbf{p}}_t-{\textbf{p}}_s)\) and the reenactment latent code \({\textbf{w}}_{r}={\textbf{w}}_s+\Delta {\textbf{w}}\). Using the latter we arrive at the reenacted image \({\textbf{I}}_{r}= {\mathcal {G}}({\textbf{w}}_{r})\).

It is worth noting that the only trainable module of the proposed framework is the matrix \({\textbf{A}}\in {{\mathbb {R}}}^{d_{\textrm{out}}\times d_{\textrm{in}}}\)—i.e., the number of trainable parameters of the proposed framework is 65K. We also note that, before training, we estimate the distribution of each element of the head pose/expression parameters \({\textbf{p}}\) by randomly generating \(10\textrm{K}\) images and calculating using the pre-trained Net3D (Feng et al., 2021) their corresponding \({\textbf{p}}\) vectors. Using the estimated distributions, during training, we re-scale each element of \({\textbf{p}}\) from its original range to a common range \([-a,a]\) (a being a hyperparameter empirically set to 6). In the appendices (Sect. A.1.1) we further discuss the re-scaling of each element of \({\textbf{p}}\). To further encourage disentanglement in the optimized latent directions matrix \({\textbf{A}}\), we follow a training strategy where for \(50\%\) of the training samples we reenact only one attribute by using \(\Delta {\textbf{p}} = [0,\ldots ,\varepsilon ,\ldots ,0]\), where \(\varepsilon \) is uniformly sampled from \({\mathcal {U}}[-a, a]\). In the appendices (Sect. A.1.3) we show that the above training strategy improves the disentanglement between the learned directions.

3.1.5 Losses

We train our framework by minimizing the following total loss:

$$\begin{aligned} {\mathcal {L}} = \lambda _{r} {\mathcal {L}}_{r} + \lambda _{id} {\mathcal {L}}_{id} + \lambda _{per} {\mathcal {L}}_{per}, \end{aligned}$$
(3)

where \({\mathcal {L}}_{r}\), \({\mathcal {L}}_{id}\), and \({\mathcal {L}}_{per}\) denote respectively the reenactment, identity, and perceptual losses with \(\lambda _{r}\), \(\lambda _{id}\), and \(\lambda _{per}\) being weighting hyperparameters empirically set to \(\lambda _{r}=1\), \(\lambda _{id}=10\) and \(\lambda _{per}=10\). We detail each loss term below.

Reenactment loss \({\mathcal {L}}_{r}\) We define the reenactment loss as

$$\begin{aligned} {\mathcal {L}}_{r} = {\mathcal {L}}_{sh} + {\mathcal {L}}_{eye} + {\mathcal {L}}_{mouth}, \end{aligned}$$

where the shape loss term \({\mathcal {L}}_{sh}=\Vert {\textbf{S}}_r-{\textbf{S}}_{gt}\Vert _1\) imposes head pose and expression transfer from target to source, where \({\textbf{S}}_r\) is the 3D shape of the reenacted image and \({\textbf{S}}_{gt}\) is the reconstructed ground-truth 3D shape calculated using (1). Specifically, the ground-truth 3D facial shape \({\textbf{S}}_{gt}\) should have the identity, i.e., facial shape, of the source image and the facial expression and head pose of the target image, either on the task of self reenactment or on cross-subject reenactment. On self reenactment \({\textbf{S}}_{gt}\) is the same with \({\textbf{S}}_{t}\), where \({\textbf{S}}_{t}\) is the facial shape of the target image. On cross-subject reenactment, we calculate \({\textbf{S}}_{gt}\) using the identity coefficients \({\textbf{p}}_{i}^s\) of the source face and the facial expression and head pose coefficients \({\textbf{p}}_{e}^t, {\textbf{p}}_{\theta }^t\) of the target face as:

$$\begin{aligned} {{\textbf {S}}}_{gt} = \bar{{{\textbf {S}}}} + {{\textbf {S}}}_{i}{\textbf{p}}_{i}^s + {{\textbf {S}}}_{\theta }{\textbf{p}}_{\theta }^t + {{\textbf {S}}}_{e}{\textbf{p}}_{e}^t, \end{aligned}$$
(4)

To enhance the expression transfer, we calculate the eye (\({\mathcal {L}}_{eye}\)) and the mouth (\({\mathcal {L}}_{mouth}\)) losses. The eye loss \({\mathcal {L}}_{eye}\) (the mouth loss \({\mathcal {L}}_{mouth}\) is computed in a similar fashion) compares the inner distances between the eye landmark pairs of upper and lower eyelids between the reenacted and reconstructed ground-truth shapes. In Appendix A.2, we provide a detailed discussion on \({\mathcal {L}}_{eye}\) and \({\mathcal {L}}_{mouth}\).

Identity loss \({\mathcal {L}}_{id}\) We define the identity loss as the cosine similarity between feature representations extracted from the source \({\textbf{I}}_s\) and the reenacted \({\textbf{I}}_r\) image using ArcFace  (Deng et al., 2019). The identity loss imposes the identity preservation between the source and the reenacted image.

Perceptual loss \({\mathcal {L}}_{per}\) We defined the perceptual loss similarly to Johnson et al. (2016) in order to improve the quality of the reenactment face images.

3.2 Fine-Tuning on Unpaired Real Images

In this section, we extend the basic pipeline of the proposed framework, described in the previous section, in order to learn from both synthetic and real images. For doing so, we propose to (a) use a pipeline for inverting the images back to the latent code space of StyleGAN2, and (b) adopt a mixed training approach (using both synthetic and inverted latent codes) for discovering the latent directions (Sect. 3.1.3).

As discussed in previous sections, the main challenge in the GAN inversion problem is finding a good trade-off between faithful reconstruction of the real image and effective editability using the inverted latent code. Although satisfying both requirements is challenging (Alaluf et al., 2021; Richardson et al., 2021; Tov et al., 2021), we found that the following pipeline produces compelling results for the purposes of our goal (i.e., face/head reenactment). During training, we employ an encoder based method (e4e) (Tov et al., 2021) to invert the real images into the \({\mathcal {W}}+\) latent space of StyleGAN2 (Abdal et al., 2019). However, directly using the inverted \({\mathcal {W}}+\) latent codes performs poorly in face reenactment due to the domain gap between the synthetic and inverted latent codes. To alleviate this, we propose a mixed-data approach (i.e., using both synthetic and real images) for training the pipeline presented in Sect. 3.1. Specifically, we first invert the extracted frames from the VoxCeleb dataset, and during training, at each iteration (i.e., for each batch) we use 50% random latent codes \({\textbf{w}}\) and 50% embedded latent codes \({\textbf{w}}^{inv}\).

Since the inverted images using e4e (Tov et al., 2021) might still be missing some crucial identity details, we propose to use an additional optimization step (only during inference), similarly to Roich et al. (2021), in order to slightly update the generator \({\mathcal {G}}\) and arrive at better reenacted images in terms of identity preservation. Note that this step does not affect the calculation of \({\textbf{w}}^{inv}\) and is used only during inference to obtain a higher quality inversion. We perform the optimization for 200 steps and only on the source frame of each video. In Fig. 2 we illustrate examples of neural face reenactment without optimizing the generator’s weights (w/o opt.—third column) and with optimization (w/ opt.—fourth column), where we observe that, clearly, the reenacted images without the additional optimization step are not able to faithfully reconstruct the real images, while the reenacted images after optimizing the generator weights resembles the real ones more closely.

Fig. 2
figure 2

Examples of face reenactment without (“w/o opt.”) and with (“w/ opt.”) the generator’s optimization. We additionally show results using our proposed joint training scheme (“Joint Training”) and the refinement of StyleGAN2’s feature space (“FSR”) described in Sect. 3.4 and 3.5, respectively

3.3 Fine-Tuning on Paired Real Images (Video Data)

In the previous sections, we presented the proposed framework for learning from unpaired synthetic and real images. Whilst this provides the benefit of learning from a very large number of identities, making it useful for cross-person reenactment, we show that we can achieve additional improvements by optimizing novel losses introduced by further training on paired data from the VoxCeleb1 (Nagrani et al., 2017) video dataset.

Compared to training from scratch on video data, as in most previous methods (e.g. Zakharov et al. 2020; 2019, Burkov et al. 2020), we argue that our approach offers a more balanced strategy that combines the best of both worlds; that is, training with unpaired images and fine-tuning with paired video data. From each video of our training set, we randomly sample a source and a target face that have the same identity but different head pose/expression. Consequently, we minimize the following loss function

$$\begin{aligned} {\mathcal {L}} = \lambda _{r}{\mathcal {L}}_{r} + \lambda _{id}{\mathcal {L}}_{id} + \lambda _{per}{\mathcal {L}}_{per} + \lambda _{pix}{\mathcal {L}}_{pix}, \end{aligned}$$
(5)

where \({\mathcal {L}}_{r}\) is the same reenactment loss defined in Sect. 3.1, \({\mathcal {L}}_{id}\) and \({\mathcal {L}}_{per}\) are the identity and perceptual losses, however this time calculated between the reenacted \({\textbf{I}}_r\) and the target image \({\textbf{I}}_t\), and \({\mathcal {L}}_{pix}\) is a pixel-wise L1 loss between the reenacted and target images.

3.4 Joint Training of the Real Image Inversion Encoder \({\mathcal {E}}_w\) and the Directions Matrix \({\textbf{A}}\)

Fig. 3
figure 3

To eliminate the need for the optimization step during inference, we propose to jointly train the real image inversion encoder \({\mathcal {E}}_w\) and the directions matrix \({\textbf{A}}\). We note that during training both the generator \({\mathcal {G}}\) and the \(\mathrm {Net_{3D}}\) network are frozen

As discussed in Sect. 3.2, the encoder-based e4e (Tov et al., 2021) inversion method often fails to faithfully reconstruct real images by typically failing to preserve crucial identity characteristics, as shown in the third column (“w/o opt.”) of Fig. 2. Clearly, this poses a certain limitation to the face reenactment methodology presented in Sect. 3.1.4. Optimizing the generator’s weights leads to notable improvements (Sect. 3.2), as shown in the fourth column (“w/ opt.”) of Fig. 2, albeit, this comes at a significant cost for the task of face reenactment (that is, the optimization of \({\mathcal {G}}\) takes approximately 20 sec. per frame).

In this section, we propose to jointly train the real image inversion encoder \({\mathcal {E}}_w\) and the directions matrix \({\textbf{A}}\), which leads to optimization-free face reenactment at inference time. For doing so, we use paired data as described in Sect. 3.3. An overview of this approach is shown in Fig. 3. Specifically, we first sample a source (\({\textbf{I}}_s\)) and a target (\({\textbf{I}}_t\)) image from the same video of VoxCeleb1 (Nagrani et al., 2017) training set, that have the same identity but different head pose/expression. Those images are then fed into the inversion encoder \({\mathcal {E}}_w\) to predict the corresponding source (\({\textbf{w}}_s\)) and target (\({\textbf{w}}_t\)) latent codes. Then, the pre-trained \(\mathrm {Net_{3D}}\) network extracts the corresponding source (\({\textbf{p}}_s\)) and target (\({\textbf{p}}_t\)) parameter vectors. Finally, as described in Sect. 3.1, we generate the reenacted image using the latent code \({\textbf{w}}_{r} = {\textbf{w}}_s + \Delta {\textbf{w}}\), where \(\Delta {\textbf{w}}={\textbf{A}}({\textbf{p}}_t-{\textbf{p}}_s)\).

3.4.1 Real Image Encoder \({\mathcal {E}}_w\) Optimization Objective

In order to train the real image encoder \({\mathcal {E}}_w\) we minimize the following loss:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{{\mathcal {E}}_w} = \lambda _{id} ({\mathcal {L}}_{id}({\textbf{I}}_s, \hat{{\textbf{I}}}_s) + {\mathcal {L}}_{id}({\textbf{I}}_t, \hat{{\textbf{I}}}_t)) \\ \quad +\lambda _{per} ({\mathcal {L}}_{per}({\textbf{I}}_s, \hat{{\textbf{I}}}_s) + {\mathcal {L}}_{per}({\textbf{I}}_t, \hat{{\textbf{I}}}_t)) \\ \quad +\lambda _{pix} ({\mathcal {L}}_{pix}({\textbf{I}}_s, \hat{{\textbf{I}}}_s) + {\mathcal {L}}_{pix}({\textbf{I}}_t, \hat{{\textbf{I}}}_t)) \\ \quad +\lambda _{style} ({\mathcal {L}}_{style}({\textbf{I}}_s, \hat{{\textbf{I}}}_s) + {\mathcal {L}}_{style}({\textbf{I}}_t, \hat{{\textbf{I}}}_t)), \end{aligned} \end{aligned}$$
(6)

where \({\mathcal {L}}_{id}\), \({\mathcal {L}}_{per}\), and \({\mathcal {L}}_{pix}\) denote the identity, perceptual, and pixel-wise losses described in the previous sections.

Additionally, to further improve the style and the quality of the reconstructed images we propose to use a style loss \({\mathcal {L}}_{style}\) similarly to Barattin et al. (2023). Specifically, we use FaRL (Zheng et al., 2022), a method for general facial representation learning that leverages contrastive learning between images and text pairs to learn meaningful feature representations of facial images. In our method, we use the image Transformer-based encoder, \({\mathcal {E}}_{FaRL}\), to extract a 512-dimensional feature vector from each image. The proposed style loss is defined as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{style} = \Vert {\mathcal {E}}_{FaRL}({\textbf{I}}_s) - {\mathcal {E}}_{FaRL}(\hat{{\textbf{I}}}_s)\Vert _1 \\ +\Vert {\mathcal {E}}_{FaRL}({\textbf{I}}_t) - {\mathcal {E}}_{FaRL}(\hat{{\textbf{I}}}_t)\Vert _1. \end{aligned} \end{aligned}$$
(7)

3.4.2 Directions Matrix \({\textbf{A}}\) Optimization Objective

In order to train the directions matrix \({\textbf{A}}\) we minimize the following loss:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{{\textbf{A}}} = \lambda _{r}{\mathcal {L}}_{r} +\lambda _{id}{\mathcal {L}}_{id} + \lambda _{per}{\mathcal {L}}_{per} \\ +\lambda _{pix}{\mathcal {L}}_{pix} + \lambda _{style}{\mathcal {L}}_{style}, \end{aligned} \end{aligned}$$
(8)

where \({\mathcal {L}}_{r}\), \({\mathcal {L}}_{id}\), \({\mathcal {L}}_{per}\), \({\mathcal {L}}_{pix}\), and \({\mathcal {L}}_{style}\) denote respectively the reenactment loss defined in Sect. 3.1, the identity, the perceptual, the pixel-wise, and the style losses calculated between the reenacted \({\textbf{I}}_r\) and the target images \({\textbf{I}}_t\).

Fig. 4
figure 4

Cycle loss: Given a pair of source (\({\textbf{I}}_s^1\)) and target (\({\textbf{I}}_t^1\)) images, we calculate the corresponding reenacted image \({\textbf{I}}_r^1\). We then use this image as source and as target the source image from the first image pair and we calculate the second reenacted image \({\textbf{I}}_r^2\), which is imposed to be similar with \({\textbf{I}}_s^1\)

Moreover, to further improve the reenactment results we propose an additional cycle loss term \({\mathcal {L}}_{cycle}\) (Bounareli et al., 2023; Sanchez & Valstar, 2020). Specifically, as shown in Fig. 4, given an image pair of a source (\({\textbf{I}}_s^1\)) and a target (\({\textbf{I}}_t^1\)) images, we calculate the corresponding reenacted image \({\textbf{I}}_r^1 \equiv {\textbf{I}}_t^1\). Having as source image the reenacted image \({\textbf{I}}_r^1\) and as target the source image \({\textbf{I}}_s^1\), we calculate a new reenacted image \({\textbf{I}}_r^2\) that is imposed to be similar to \({\textbf{I}}_s^1\). Consequently, we calculate all reconstruction losses, i.e. \({\mathcal {L}}_{id}\), \({\mathcal {L}}_{per}\), \({\mathcal {L}}_{pix}\), and \({\mathcal {L}}_{style}\), between the source image \({\textbf{I}}_s^1\) and the reenacted image \({\textbf{I}}_r^2\). In our ablation studies (Sect. 4.2), we show that using the proposed cycle loss improves the face reenactment performance.

3.4.3 Joint Optimization Objective

Overall, the objective of the joint optimization is as follows:

$$\begin{aligned} {\mathcal {L}} = {\mathcal {L}}_{\textbf{A}} + {\mathcal {L}}_{E_w} + {\mathcal {L}}_{cycle}. \end{aligned}$$
(9)

We note that, in this training phase, we fine-tune the matrix \({\textbf{A}}\) and the real image inversion encoder \({\mathcal {E}}_w\), trained as described in Sect. 3.2. As demonstrated in Fig. 2, using the proposed joint training scheme (Joint Training) our method is able to reconstruct the identity details of the real faces without performing any optimization step. In Sect. 4, we quantitatively demonstrate that our proposed method produces similar results on self reenactment with our method when optimizing the generator’s weights. Nevertheless, on the more challenging tasks of cross-subject reenactment and on large head pose differences between the source and target faces, the joint training scheme outperforms our results with optimization, producing more realistic images with less visual artifacts.

3.5 Feature Space \({\mathcal {F}}\) Refinement

Fig. 5
figure 5

Training of feature space encoder \({\mathcal {E}}_{{\mathcal {F}}}\) in the real image inversion task. \({\mathcal {E}}_{{\mathcal {F}}}\) takes as input a real image and predicts the shift \(\Delta f_4\) that updates the feature map \(f_4\) of the \(4^{th}\) feature layer of StyleGAN2’s generator

In this section, we propose an additional module for our face reenactment framework that refines the feature space \({\mathcal {F}}\) of the StyleGAN2’s generator; taking advantage from its exceptional expressiveness (e.g., in terms of background, hair style/color, or hair accessories). In order to mitigate the limited editability of \({\mathcal {F}}\) (Kang et al., 2021; Parmar et al., 2022), we propose a two-step training procedure, which we illustrate in Fig. 5. Specifically, we first train a feature space encoder \({\mathcal {E}}_{{\mathcal {F}}}\), using the ResNet-18 (He et al., 2016) architecture, in the task of real image inversion. \({\mathcal {E}}_{{\mathcal {F}}}\) takes as input a real image and predicts the shift \(\Delta f_4\) that updates the feature map \(f_4\) as:

$$\begin{aligned} \hat{f_4} = f_4 + \Delta f_4, \end{aligned}$$
(10)

where \(f_4\) is the feature map calculated using the inverted latent code \({\textbf{w}}\). The training objective in this step consists of the reconstruction losses, namely identity, perceptual, pixel-wise, and style, calculated between the reconstructed \(\hat{\mathbf {{I}}}\) and the real images \({\textbf{I}}\) as described in (Eq. 6). It is worth nothing that we only refine the \(4^{th}\) feature layer of StyleGAN2’s generator \({\mathcal {G}}\) that we found to be in particular beneficial to the face reenactment task, in contrast to later feature layers that, despite their capability in reconstructing almost perfectly the real images, they suffer from poor semantic editability (as shown by Yao et al. (2022b)).

As discussed above, using the updated feature map \(\hat{f_4}\) to refine details on the edited images leads to visual artifacts. To address this, we propose a framework that efficiently learns to predict the updated feature map of the edited image \(\hat{f_4^r}\) using the refined source feature map \(\hat{f_4^s}\). We illustrate this in Fig. 6, where, given a source and a target image pair, we first calculate the reenacted latent code \({\textbf{w}}_r\) as described in Sect. 3.4. We note that the directions matrix \({\textbf{A}}\) and the real image inversion encoder \({\mathcal {E}}_w\) are frozen during training. Then, using the feature encoder \({\mathcal {E}}_{{\mathcal {F}}}\), we calculate the source refined feature map \(\hat{f_4^s}\) using (10). In order to calculate the refined feature map of the reenacted image \(\hat{f_4^r}\), we introduce the Feature Transformation (FT) module, that takes as input the difference of the source refined feature map \(\hat{f_4^s}\) and the reenacted feature map \(f_4^r\), and outputs the shift \(\Delta f_4^r\), which can be used to calculate the updated feature map \(\hat{f_4^r}\) given by (10). As shown in Fig. 6, the proposed Feature Transformation (FT) module learns two modulation parameters, namely \(\gamma \) and \(\beta \), that efficiently transform the shift \(\Delta f_4^s\) of the source feature map into the shift \(\Delta f_4^r\) of the reenacted feature map as:

$$\begin{aligned} \Delta f_4^r = \gamma \odot \Delta f_4^s + \beta . \end{aligned}$$
(11)

As illustrated in Fig. 6, the FT module consists of two convolutional blocks with 2 convolutional layers each. We note that in this training step we train both the FT module and the feature space encoder \({\mathcal {E}}_{{\mathcal {F}}}\). Our training objective consists of the reconstruction losses, namely identity, perceptual, pixel-wise and style, calculated between the reenacted and the target images (described in detail in Sect. 3.4).

Fig. 6
figure 6

Training of the feature space encoder \({\mathcal {E}}_{{\mathcal {F}}}\) and the Feature Transformation (FT) module to efficiently refine the feature map \(f_4^r\) of the reenacted images

Finally, in Fig. 7 we give some indicative results of the proposed reenactment variant of our method that learns to optimize the feature space \({\mathcal {F}}\) (“FSR”) in comparison to the variant of our method described in Sect. 3.4 (“Joint Training”) and Parmar et al. (2022) (“SAM”). We note that using the \({\mathcal {W}}+\) latent space (Joint Training / Sect. 3.4) leads to relatively faithful reconstruction performance, albeit, without being able to reconstruct every detail on the background or the hair styles. As we will show in the experimental section, qualitatively and quantitatively, but also in the conducted user study, such level of detail is crucial for the task of face reenactment. Similarly, SAM (Parmar et al., 2022) is able to better reconstruct the background however the reenacted images suffer from visual artifacts (marked with red arrows in Fig. 7) and, thus, look unrealistic, especially around the face area. By contrast, the proposed framework that learns to optimize the feature space \({\mathcal {F}}\) (“FSR”) leads to both notably more faithful face reenactment exhibiting less artifacts.

Fig. 7
figure 7

Face reenactment examples using only the \({\mathcal {W}}+\) latent space (“Joint Training”), SAM method (Parmar et al., 2022) and our proposed method for feature space refinement (“FSR”)

4 Experiments

In this section, we present qualitative and quantitative results, along with a user study, in order to evaluate the proposed framework (all its variants) in the task of neural face reenactment and compare with several recent state-of-the-art (SOTA) approaches. The bulk of our results and comparisons, reported in Sect. 4.1, are on self- and cross-person reenactment on the VoxCeleb1 (Nagrani et al., 2017) dataset. Comparisons with state-of-the-art on the VoxCeleb2 (Chung et al., 2018) test set are provided in the appendices. Finally, in Sect. 4.2 we report ablation studies on the various design choices of our method and in Sect. 4.3 we discuss its limitations.

Implementation details We fine-tune StyleGAN2 on the VoxCeleb1 dataset with \(256\times 256\) image resolution and we train the e4e encoder of Tov et al. (2021) for real image inversion. The 3D shape model we use (i.e., the Net3D module shown in Figs. 13) is DECA (Feng et al., 2021). For our training procedure described in Sects. 3.1 3.2, and 3.3, we only learn the directions matrix \({\textbf{A}}\in {{\mathbb {R}}}^{(N_l\times 512) \times k}\) where \(k=3+m_e, m_e = 12\) and \(N_l=8\). We train three matrices of directions: (i) the first one is on synthetically generated images (Sect. 3.1), (ii) the second one is on mixed real and synthetic data (Sect. 3.2), and (iii) the third one is fine-tuning (ii) on paired data (Sect. 3.3). Additionally, on the proposed joint training scheme (Sect. 3.4), we fine-tune both the directions matrix \({\textbf{A}}\) and the real image inversion encoder \({\mathcal {E}}_w\). Finally, in the feature space refinement variant (Sect. 3.5) we train both the feature space encoder \({\mathcal {E}}_{{\mathcal {F}}}\) and the propose Feature Transformation (FT) module. It is worth noting that during the first and second training phases, we perform cross-subject training, i.e., the source and target faces have different identities. This approach enables our model to generalize effectively across various identities, resulting in improved performance on the challenging task of cross-subject reenactment. On the rest training phases we perform self reenactment, where the source and target faces are sampled from the same video. For training, we used the Adam optimizer (Kingma & Ba, 2015) with constant learning rate \(10^{-4}\). All models are implemented in PyTorch (Paszke et al., 2019).

4.1 Comparison with State-of-the-Art on VoxCeleb

In this section, we compare the performance of our method against the state-of-the-art in face reenactment on VoxCeleb1 (Nagrani et al., 2017). We conduct two types of experiments, namely self- and cross-person reenactment. For evaluation purposes, we use both the video data provided by Zakharov et al. (2019) and the original test-set of VoxCeleb1. We note that there is no overlap between the train and test identities and videos. We compare our method quantitatively and qualitatively with nine methods: X2Face (Wiles et al., 2018), FOMM (Siarohin et al., 2019), Fast bi-layer (Zakharov et al., 2020), Neural-Head (Burkov et al., 2020), LSR (Meshry et al., 2021), PIR (Ren et al., 2021), HeadGAN (Doukas et al., 2021), Dual (Hsu et al., 2022) and Face2Face (Yang et al., 2022). For X2Face (Wiles et al., 2018), FOMM (Siarohin et al., 2019), PIR (Ren et al., 2021), HeadGAN (Doukas et al., 2021) and Face2Face (Yang et al., 2022), we use the pre-trained (by the authors) model on VoxCeleb1. For Fast bi-layer (Zakharov et al., 2020), Neural-Head (Burkov et al., 2020) and LSR (Meshry et al., 2021) we also use the pre-trained (by the authors) models on VoxCeleb2 (Chung et al., 2018). Regarding Dual (Hsu et al., 2022), we use the pre-trained by the authors model on both VoxCeleb (Chung et al., 2018; Nagrani et al., 2017) and MPIE (Gross et al., 2010) datasets. For fair comparison with the methods of Neural-Head (Burkov et al., 2020), LSR (Meshry et al., 2021) and Dual (Hsu et al., 2022), we evaluate their model under the one-shot setting. We note that we will be referring to our method that optimizes the generator’s weights during inference as Latent Optimization Reenactment (LOR), whereas LOR+ will be referring to our final model with joint training and feature space refinement. We note that in the LOR+ model, we do not optimize the generator weights.

4.1.1 Quantitative Comparisons

We report eight different metrics. We compute the Learned Perceptual Image Path Similarity (LPIPS) (Zhang et al., 2018) to measure the perceptual similarities, and to quantify identity preservation we compute the cosine similarity (CSIM) of ArcFace (Deng et al., 2019) features. Moreover, we measure the quality of the reenacted images using the Fréchet-Inception Distance (FID) metric (Heusel et al., 2017), while we also report the Fréchet Video Distance (FVD) (Skorokhodov et al., 2022; Unterthiner et al., 2018) metric that measures both the video quality and the temporal consistency of the generated videos. To quantify the head pose/expression transfer, we calculate the normalized mean error (NME) between the predicted landmarks in the reenacted and target images. We use (Bulat & Tzimiropoulos, 2017) for landmark estimation, and we calculate the NME by normalizing it with the square root of the ground truth face bounding box and scaled by a factor of \(10^3\). We further evaluate the head pose transfer by calculating the average \({\mathcal {L}}_1\) distance of the head pose orientation (Average Rotation Distance, ARD) in degrees, and the expression transfer by calculating the average \({\mathcal {L}}_1\) distance of the expression coefficients \({\textbf{p}}_{e}\) (Average Expression Distance, AED) and the Action Units Hamming distance (AU-H) computed as in Doukas et al. (2021).

Table 1 Quantitative results on self-reenactment. The results are reported on the combined original test set of VoxCeleb1 (Nagrani et al., 2017) and the test set released by Zakharov et al. (2019). For CSIM metric, higher is better (\(\uparrow \)), while in all other metrics lower is better (\(\downarrow \))
Table 2 Quantitative comparisons on the benchmark set (Benchmark-L ) with image pairs from VoxCeleb1 dataset, where the average head pose distance is larger than \(10^{\circ }\). For CSIM metric, higher is better (\(\uparrow \)), while in all other metrics lower is better (\(\downarrow \))
Fig. 8
figure 8

Qualitative results and comparisons for self (top three rows) and cross-subject reenactment (last three rows) on VoxCeleb1. The first and second columns show the source and target faces. Our method preserves the appearance and identity characteristics (e.g., face shape) of the source face significantly better and also faithfully transfer the target head pose/expression without producing visual artifacts

In Table 1 we report quantitative results on self reenactment, using the original test set of VoxCeleb1 (Nagrani et al., 2017) and the test set provided by Zakharov et al. (2019). Additionally, in Table 2 we report results on a more challenging condition on self reenactment where the source and target faces have large head pose difference. Specifically, we randomly selected from the test set of VoxCeleb1 1,000 image pairs with head pose distance larger than \(10^{\circ }\). The head pose distance is calculated as the average of the absolute differences of the three Euler angles (i.e., yaw, pitch, and roll) between the source and target faces. In the appendices (Sect. A.4), we provide additional details regarding our benchmark dataset. We note that in self reenactment, all metrics are calculated between the reenacted and the target faces. As shown in Table 1, the warping-based methods, namely X2Face, PIR, HeadGAN and Face2Face have high values on CSIM, however we argue that this is due to their warping-based technique which enables better reconstruction of the background and other identity characteristics. Importantly, these results are accompanied by poor quantitative and qualitative results when there is a significant change on the head pose (e.g., see Fig. 8 and Table 2). Additionally, regarding head pose/expression transfer, our method (LOR+) achieves similar results on NME with Fast Bi-layer (Zakharov et al., 2020), while on ARD and AED metrics we outperform all methods. Finally, our results on FID and FVD metrics confirm that the quality of our generated videos resembles the quality of VoxCeleb dataset. Nevertheless, our method (LOR+) on the challenging condition with large head pose differences between the source and target faces (Table 2) outperforms all methods.

Cross-subject reenactment is more challenging compared to self reenactment, as source and target faces have different identities, and in this case it is important to maintain the source identity characteristics without transferring the target ones. In Table 3, we report the quantitative results for cross-subject reenactment, where we randomly select 200 video pairs from the small test set of Zakharov et al. (2019). In this task, CSIM metric is calculated between the source and the reenacted faces while ARD, AED and AU-H metrics between the target and the reenacted faces. As depicted in Table 3, our method (LOR+) achieves the best results on head pose and expression transfer, while we achieve high score in CSIM metric. It is worth noting that the high CSIM value for FOMM, HeadGAN and Face2Face is not accompanied by good qualitative results as shown in Figs. 8 and 27, where in most cases, those methods are not able to generate realistic images.

Table 3 Quantitative results on cross-subject reenactment. For CSIM metric, higher is better (\(\uparrow \)), while in all other metrics lower is better (\(\downarrow \))

To further evaluate the performance of reenactment methods we conduct a user study, where we ask 30 users to select the method that best reenacts the source frame on self and cross-subject reenactment tasks. For the purposes of the user study we utilise only our final model (LOR+). The results are reported in Table 14 and as shown our method is the most preferable (by a large margin—52.1% versus the 19.2% second best method), which also validates our quantitative results.

Additionally, in Table 4 we report comparisons on inference time required to generate a video of 200 frames. As shown, X2Face (Wiles et al., 2018) and FOMM (Siarohin et al., 2019) are the fastest methods, however their overall performance (quantitative and qualitative results) is unsatisfactory (i.e., visual artifacts). Nevertheless, our proposed method (LOR+) is able to generate compelling reenacted images, while also being competitive in terms of inference time. Notably, our final model (LOR+) outperforms our model that requires the optimization step (LOR), which is a time consuming operation.

Table 4 Quantitative comparisons on inference time required to generate a video of 200 frames

4.1.2 Qualitative Comparisons

Quantitative comparisons alone are insufficient to capture the quality of reenactment. Hence, we opt for qualitative visual comparisons in multiple ways: (a) results in Fig. 8, (b) in the appendices, we provide more results in self and cross-subject reenactment both on VoxCeleb1 and VoxCeleb2 datasets (Figs. 2326272829), and (c) we also provide a supplementary video with self and cross-subject reeenactment results from VoxCeleb1 and VoxCeleb2 datasets. As we can see from Fig. 8 and the videos provided in the supplementary material, our method provides, for the majority of videos, the highest reenactment quality including accurate transfer of head pose and expression and, significantly enhanced identity preservation compared to all other methods. Importantly, one great advantage of our method on cross-subject reenactment, as shown in Fig. 8, is that it is able to reenact the source face with minimal identity leakage (e.g facial shape) from the target face, in contrast to landmark-based methods such as Fast Bi-layer (Zakharov et al., 2020). Finally, to show that our method is able to generalise well on other facial video datasets, we provide additional results on the FaceForensics (Rössler et al., 2018) and 300-VW (Shen et al., 2015) datasets in the appendices (Fig. 30).

Table 5 Quantitative results of the various models of our work on self reenactment (SR), self reenactment with image pairs that have large head pose difference (SR - large head pose) and on cross-subject reenactment (CR)

4.2 Ablation Studies

In this section, we perform several ablation tests to (a) assess the different variants of our method, i.e., the optimization of generator \({\mathcal {G}}\) during inference (Sect. 3.3), the proposed joint training scheme (Sect. 3.4) and the refinement of the feature space (Sect. 3.5), (b) measure the impact of the identity and perceptual losses, and the additional shape losses for the eyes and mouth (Sect. 3.1), (c) validate our trained models on synthetic, mixed and paired images, and (d) measure the impact of the style and cycle losses (introduced in Sect. 3.4).

Fig. 9
figure 9

Qualitative comparisons of the various models of our work on self reenactment

For (a), we report results of our method on self and cross-subject reenactment, with our model (LOR) described in Sect. 3.3 without performing optimization (w/o opt.) and with optimization (w/ opt.) of the generator \({\mathcal {G}}\) during inference. We also report results of our final model (LOR+) without the additional feature space refinement (FSR) (Sect. 3.4) and with feature space refinement (Sect. 3.5). As shown in Table 5, the optimization of \({\mathcal {G}}\) during inference improves our results (as expected) especially regarding the identity preservation (CSIM) compared to our model without performing optimization. Nevertheless, our proposed joint training scheme (LOR+ w/o FSR) achieves the same results on image reconstruction metrics (CSIM and LPIPS), and improves our results on head pose/expression transfer (ARD, AED) without performing any optimization of the generator. Additionally, the proposed refinement on the feature space of StyleGAN2 (LOR+ w/ FSR) improves all quantitative results. It is worth mentioning that the new proposed components (Joint Training and Feature Space Refinement) compared to our previous work (Bounareli et al., 2022) improve our results especially on the challenging tasks of self reenactment with large head pose differences between the source and target faces and on cross-subject reenactment. Figure 9 illustrates results on self reenactment using the above described models. As shown LOR without optimization cannot accurately reconstruct the identity of the source face, while with optimization the identity details are better reconstructed but the reenacted images contain noticeable visual artifacts. On the contrary, the proposed joint training scheme (LOR+ w/o FSR) is able to accurately reconstruct the identity of the source faces and produce artifact-free images without performing any subject fine-tuning. Finally, the proposed feature space refinement (LOR+ w/ FSR) improves our qualitative results by producing more realistic images (i.e., better background and hair style reconstruction).

For (b), we perform experiments on synthetic images with and without the identity and perceptual losses. To evaluate the models, we randomly generate 5K pairs of synthetic images (source and target) and reenact the source image with the head pose and expression of the target. As shown in Table 6, the incorporation of the identity and perceptual losses is crucial to isolate the latent space directions that strictly control the head pose and expression characteristics without affecting the identity of the source face. In a similar fashion, in Table 6, we show the impact of the additional shape losses, namely the eye \({\mathcal {L}}_{eye}\) and mouth \({\mathcal {L}}_{mouth}\) losses. As shown, omitting these losses leads to higher head pose and expression error. The impact of those losses is also obvious on our qualitative comparisons in Fig. 10. As shown, when we exclude the identity and perceptual losses from the training process, the generated images lack several appearance details, while omitting the eye and mouth losses leads to less accurate facial expression transfer.

Table 6 Ablation study on the impact of the identity \({\mathcal {L}}_{id}\) and perceptual \({\mathcal {L}}_{per}\) losses, and on the impact of eye \({\mathcal {L}}_{eye}\) and mouth \({\mathcal {L}}_{mouth}\) losses. CSIM is calculated between the source and the reenacted images which are on different head pose and expression
Fig. 10
figure 10

Qualitative comparisons on the impact of the identity \({\mathcal {L}}_{id}\) and perceptual \({\mathcal {L}}_{per}\) losses, and on the impact of eye \({\mathcal {L}}_{eye}\) and mouth \({\mathcal {L}}_{mouth}\) losses

For (c), we evaluate the three different training schemes, namely synthetic only (Sect. 3.1), mixed synthetic-real (Sect. 3.2), and mixed synthetic-real fine-tuned with paired data (Sect. 3.3) for self-reenactment. The results, reported in Table 7 and in Fig. 11, illustrate the impact of each of these training schemes with the one using paired data providing the best results as expected. As shown in Fig. 11, our final model trained with paired data produces more realistic images with less artifacts.

Table 7 Ablation studies on self-reenactment using three different models: (a) trained on synthetic images, (b) trained on both synthetic and real images, and (c) fine-tuned on paired data
Fig. 11
figure 11

Qualitative results of the three different models trained on synthetic images, on both synthetic and real images and on paired data

Table 8 Ablation study on the impact of style \({\mathcal {L}}_{style}\) and cycle \({\mathcal {L}}_{cycle}\) losses

Finally, for (d) we perform experiments on self reenactment using our model with joint training scheme, without using the style loss \({\mathcal {L}}_{style}\) and without the cycle loss \({\mathcal {L}}_{cycle}\). As shown in Table 8 our final model with both those losses has better results both on identity preservation and on head pose/expression transfer. Additionally, as illustrated in Fig. 12, our final model using both the style and the cycle loss has improved results in terms of identity/appearance preservation.

Fig. 12
figure 12

Qualitative comparisons on the impact of the style \({\mathcal {L}}_{style}\) and cycle \({\mathcal {L}}_{cycle}\) losses

Fig. 13
figure 13

Cases where the reconstruction of facial accessories like hats fails. The first two columns show the source and target images, while the reenacted images are presented on the last column

4.3 Limitations

As shown both in our quantitative and qualitative results, our method is able to efficiently reenact the source faces, by preserving the source identity characteristics and by faithfully transferring the target head pose and expression. Our proposed method, which is based on a pre-trained StyleGAN2 model, enables both self and cross-subject reenactment using only one source frame and without any further subject fine-tuning. The proposed joint training scheme of the real image encoder \({\mathcal {E}}_w\) and the direction matrix \({\textbf{A}}\) enables more accurate identity reconstruction and facial image editing without many visual artifacts, especially on the challenging task of extreme head poses. Additionally, the refinement of StyleGAN2’s feature space \({\mathcal {F}}\) enables better reconstruction of various image details including background, hair style/color and facial accessories, resulting in visually more realistic images. Nevertheless, in Fig. 13 we observe that especially on hair accessories, such as hats that are underrepresented on the training dataset, our method is not able to faithfully reconstruct every detail when editing the head pose orientation.

5 Conclusions

In this paper, we presented a novel approach towards neural head/face reenactment using a 3D shape model to learn disentangled directions of head pose and expression in the latent GAN space. This approach comes with specific advantages, such as the use of powerful pre-trained GANs and 3D shape models, which have been thoroughly developed and studied by the research community during the past years. Our method is able to successfully disentangle the facial movements and the appearance of the input images leveraging the disentangled properties of the pre-trained StyleGAN2 model. Consequently, our framework effectively mimics the target head pose and expression without transferring identity details from the driving images. Additionally, our method features several favorable properties including one-shot face reenactment without the need for further subject-specific fine-tuning. It also allows for improved cross-subject reenactment through the proposed upaired data training with synthetic and real images. While our method demonstrates compelling results, it relies on the capabilities of StyleGAN2 model, which is bounded by the distribution of the training dataset. If the dataset lacks diversity in terms of complex backgrounds, facial accessories like hats, glasses e.t.c, this can affect our model’s ability to generalize well to more complex datasets. This limitation highlights the importance of using more diverse video datasets during the training of the generative models.

Fig. 14
figure 14

Analysis of the correlation between shifts \(\Vert \Delta {\textbf{w}}\Vert \) in the latent space and the predicted changes \(|\hat{\Delta {\textbf{p}}}|\) in the parameters space. We show results of four different attributes (yaw and pitch angles, smile, and open mouth). In all attributes the correlation is high, indicating strong linear relationship

Finally, we acknowledge that although face reenactment can be used in a variety of applications such as art, entertainment, video conferencing etc., it can also be applied for malicious purposes, including deepfake creation, that could potentially harm individuals and the society. It is important for the researchers on our field to be aware of the potential risks and promote the responsible use of this technology.