One-shot Neural Face Reenactment via Finding Directions in GAN's Latent Space

In this paper, we present our framework for neural face/head reenactment whose goal is to transfer the 3D head orientation and expression of a target face to a source face. Previous methods focus on learning embedding networks for identity and head pose/expression disentanglement which proves to be a rather hard task, degrading the quality of the generated images. We take a different approach, bypassing the training of such networks, by using (fine-tuned) pre-trained GANs which have been shown capable of producing high-quality facial images. Because GANs are characterized by weak controllability, the core of our approach is a method to discover which directions in latent GAN space are responsible for controlling head pose and expression variations. We present a simple pipeline to learn such directions with the aid of a 3D shape model which, by construction, inherently captures disentangled directions for head pose, identity, and expression. Moreover, we show that by embedding real images in the GAN latent space, our method can be successfully used for the reenactment of real-world faces. Our method features several favorable properties including using a single source image (one-shot) and enabling cross-person reenactment. Extensive qualitative and quantitative results show that our approach typically produces reenacted faces of notably higher quality than those produced by state-of-the-art methods for the standard benchmarks of VoxCeleb1&2.


Introduction
Neural face reenactment aims to transfer the rigid 3D face/head orientation and the deformable facial expression of a target facial image to a source facial image.Such technology is the key enabler for creating high-quality digital head avatars that find a multitude of applications in telepresence, Augmented Reality/Virtual Reality (AR/VR), and the creative industries.Recently, thanks to the advent of Deep Learning, Neural Face Reenactment has seen remarkable progress Burkov et al (2020); Zakharov et al (2020); Wang et al (2021b); Meshry et al (2021).In spite of this, synthesizing photorealistic face/head sequences remains a challenging problem with the quality of existing solutions being far from sufficient for the demanding aforementioned applications.
A major challenge that most prior works Bao et al (2018); Zeng et al (2020); Zakharov et al (2019Zakharov et al ( , 2020)); Burkov et al (2020); Ha et al (2020) have focused on is how to achieve identity and head pose/expression disentanglement to both preserve the appearance and identity characteristics of the source face and successfully transfer the head pose and the expression of the target face.A recent line of research relies on training conditional Generative Adversarial Networks (GANs) Deng et al (2020); Kowalski et al (2020); Shoshan et al (2021) in order to produce disentangled embeddings and control the generation process.However, such methods mainly focus on synthetic image generation, rendering reenactment on real faces challenging.Another line of works Zakharov et al (2019Zakharov et al ( , 2020) ) rely on training with paired data (i.e., source and target facial images of the same identity), leading to poor cross-person face reenactment.
In this work, we propose a neural face reenactment framework that addresses the aforementioned limitations of state-of-the-art (SOTA), motivated by the remarkable ability of modern pre-trained GANs (e.g., StyleGAN Karras et al (2019, 2020b,a)) in generating realistic and aesthetically pleasing faces, often indistinguishable from real ones.The research question we address in this paper is: Can a pre-trained GAN be adapted for face reenactment?A key challenge that needs to be addressed to this end is the absence of any inherent semantic structure in the latent space of GANs.In order to gain control over the generative process, inspired by Voynov and Babenko (2020), we propose to learn a set of latent direction (i.e., direction vectors in the GAN's latent space) that are responsible for controlling head pose and expression variations in the generated facial images.Knowledge of these directions directly equips the pre-trained GAN with the ability of controllable generation in terms of head pose and expression, allowing for effective face reenactment.Specifically, in this work we present a simple pipeline to learn such directions leveraging the ability of a linear 3D shape model Feng et al (2021) in capturing disentangled directions for head pose, identity, and expression, which is crucial towards effective neural face reenactment.Moreover, another key challenge that needs to be addressed is how to use the GAN for the manipulation of real-world images.Capitalizing on Tov et al (2021), we further show that by embedding real images in the GAN latent space, our pipeline can be successfully used for real face reenactment.Overall, we make the following contributions: training scheme (Sect.3.4) that eliminates the need for the optimization step during inference, described in Sect.3.2, resulting in a more efficient inference process and better quantitative and qualitative results.The proposed joint training scheme efficiently addresses existing visual artifacts on the reenacted images caused by large head pose variations between the source and target faces, resulting in improved overall image quality.We qualitatively and quantitatively show that by jointly learning the real image inversion encoder and the directions, our method achieves compelling results without the need of one-shot fine-tuning during inference.Finally, to further improve the visual quality of the reenacted images in terms of crucial (for the purpose of face reenactment) background and identity characteristics, we propose to further fine-tune the feature space F of StyleGAN2 (Sect.3.5).

Related work 2.1 Semantic face editing
Several recent works Shen et al (2020); Härkönen et al (2020); Voynov and Babenko (2020); Shen and Zhou (2021); Oldfield et al (2021); Tzelepis et al (2021); Yao et al (2021); Yang et al (2021); Oldfield et al (2023); Tzelepis et al ( 2022) study the existence of directions/paths in the latent space of a pre-trained GAN in order to perform editing (i.e., with respect to specific facial attributes) on the generated facial images.Voynov and Babenko (2020) introduced an unsupervised method that optimizes a set of vectors in the GAN's latent space by learning to distinguish (using a "reconstructor" network) the image transformations caused by distinct latent directions.This leads to the discovery of a set of "interpretable", but not "controllable", directions -i.e., the optimized latent directions cannot be used for controllable (in terms of head pose and expression) facial editing and, thus, for face reenactment.Our method is inspired by the work of Voynov and Babenko (2020), extending it in several ways to make it suitable for neural face reenactment.Another line of recent works allows for explicit controllable facial image editing Deng et al (2020); Ghosh et al (2020); Durall Lopez et al (2021); Shoshan et al (2021); Wang et al (2021a); Nitzan et al (2020); Abdal et al (2021).
However, these methods mostly rely on synthetic image editing rather than performing face reenactment on real video data.A work that is related to our framework is StyleRig Tewari et al (2020b), which uses 3D Morphable Model's (3DMM) Blanz and Vetter (1999) parameters to control the generated images from a pre-trained StyleGAN2 Karras et al (2020b).However, by contrast to our method, StyleRig's training pipeline is not end-to-end and is significantly more complicated than ours, while in order to learn better disentangled directions, StyleRig requires the training of distinct models for different attributes (e.g., head pose and expression).This, along with the fact that Sty-leRig operates mainly on synthetic images, poses a notable restriction towards real-world face reenactment, where various facial attributes change simultaneously.By contrast, we propose to learn all disentangled directions for face reenactment simultaneously, allowing in this way for the effective editing of all, a subset, or a single attribute, whilst we optimize our framework on real faces as well.A follow-up work, PIE Tewari et al (2020a), focuses on inverting real images to enable editing using StyleRig Tewari et al (2020b).However, their method is computationally expensive (10 min/image) which is prohibitive for video-based facial reenactment.By contrast, we propose a framework that effectively and efficiently performs face reenactment (0.13 sec/image).

GAN inversion
GAN inversion methods aim to encode real images into the latent space of pre-trained GANs Karras et al (2019Karras et al ( , 2020b)), allowing for subsequent editing using existing methods of synthetic image manipulation.The major challenge in the GAN inversion problem comprises of the so called "editability-perception" trade-off; that is, finding a sweet spot between faithful reconstruction of the real image and the editability of the corresponding latent code.Wang et al (2022a) train encoder-based architectures that focus on predicting the latent codes w that best reconstruct the original (real) images and that allow for subsequent editing.Zhu et al (2020) propose a hybrid approach which consists of learning an encoder followed by an optimization step on the latent space to refine the similarity between the reconstructed and real images.Richardson et al (2021) introduce a method that aims to improve the "editabilityperception" trade-off, while recently Roich et al (2021) propose to fine-tune the generator to better capture/transfer appearance features.
The aforementioned works typically perform inversion onto the W+ latent space of StyleGAN2.However, Parmar et al (2022) have shown that W+ is not capable of fully reconstructing the real images.Specifically, details such as the background, the hair style or facial accessories i.e., hats and glasses, cannot be inverted with high fidelity.A recent line of works Wang et al (2022a); Yao et al (2022a);Bai et al (2022); Alaluf et al (2022) propose to mitigate this by investigating more expressive spaces of StyleGAN2 (such as the feature space F ∈ R h×w×c Kang et al ( 2021)) to perform real image inversion.Although such methods are able to produce high quality reconstructions, their ability to accurately edit the inverted images is limited.Especially when changing the head pose, such methods tend to produce many visual artifacts (Fig. A12).In order to balance between expressive invertibility and editing performance, the authors of Parmar et al (2022) (SAM) propose to fuse different spaces, i.e., the W+ latent space and the feature space F = {F 4 , F 6 , F 8 , F 10 }, where each one corresponds to a different feature layer of StyleGAN2 Karras et al (2020b).In more detail, they propose to break the facial images into different segments (background, hat, glasses etc.) and choose the most suitable space to invert each segment, leveraging the editing capabilities of the W+ latent space and the reconstruction quality of the feature space F. However, when performing global editings, i.e., changing the head pose orientation, SAM Parmar et al (2022) results in notable visual artifacts, in contrast to our method, as will be shown in the experimental section.

Neural face reenactment
Neural face reenactment poses a challenging problem that requires strong generalization ability across many different identities and a large range of head poses and expressions.Many of the proposed methods rely on facial landmark information Zakharov et al (2019);Tripathy et al (2020); Zhang et al (2020);Ha et al (2020);Tripathy et al (2021); Zakharov et al (2020); Wang et al (2022b);Hsu et al (2022).Specifically, Zakharov et al (2020) propose an one-shot face reenactment method driven by landmarks, which decomposes an image on pose-dependent and poseindependent components.A limitation of landmark based methods is that landmarks preserve identity information, thus impeding their applicability on cross-subject face reenactment Burkov et al (2020).In order to mitigate this limitation Hsu et al (2022) propose to use an IDpreserving Shape Generator (IDSG) that transforms the target facial landmarks so that they preserve the identity, i.e. facial shape, of the source image.Additionally, several methods Doukas et al ( 2021 By contrast to the methods discussed above, which rely on the training of conditional generative models on large paired datasets in order to learn facial descriptors with disentanglement properties, in this paper we propose a novel and simple face reenactment framework that learns disentangled directions in the latent space of a StyleGAN2 Karras et al (2020b) pre-trained on the VoxCeleb Nagrani et al (2017) dataset.We show that the discovery of meaningful and disentangled directions that are responsible for controlling the head pose and the facial expression can be used for high quality self-and cross-identity reenactment.

Source Image
Target Image

Reenacted Image
Fig. 1: Overview of the proposed framework: Given a pair of source I s and target I t images, we calculate the head pose/expression parameter vectors p s and p t using the Net 3D network, respectively.The matrix of directions A is trained so as, given the shift ∆w = A∆p, the reenacted image I r generated using the latent code w r = w s + ∆w, transfers the head pose and the expression of the target face, maintaining at the same time the identity of the source face.

Proposed Method
In this section, we present the proposed framework for one-shot neural face reenactment via finding directions in the latent space of Style-GAN2.More specifically, we begin with the most basic variant of our framework for finding reenactment latent directions using unpaired synthetic images in Sect.3.1 -an overview of this is shown in Fig. 1.Next, in Sect.3.2 we extend this methodology for handling real images along with synthetic ones (i.e., towards real face reenactment), while in Sect.3.3 we investigate the incorporation of paired video data.In Sect.3.4 we introduce a joint training scheme that allows for optimization-free reenactment, leading to efficient and consistent neural reenactment.Finally, in Sect.3.5, on top of the previously introduced variants of our method, we propose the refinement of crucial visual details (i.e., background, hair style) by leveraging the impressive reconstruction capability of StyleGAN2's feature space F.

Finding reenactment latent
directions on unpaired synthetic images

StyleGAN2 background
Let G denote the generator of StyleGAN2 Karras et al (2020b), as shown in Fig. 1.Specifically, G takes as input a latent code w ∈ W ⊂ R 512 , which is typically the output of StyleGAN2's input MLP-based Mapping Network f that acts on samples z ∈ R 512 drawn from the standard Gaussian N (0, I).That is, given a latent code z ∼ N (0, I), the generator produces an image StyleGAN2 is typically pre-trained on the Flickr-Faces-HQ (FFHQ) dataset Karras et al (2019), which exhibits poor diversity in terms of head pose and facial expression; for instance, FFHQ does not typically account for roll changes in head pose.In order to compare our method with other state-of-the-art methods, we finetune StyleGAN2's generator G on the VoxCeleb dataset Nagrani et al (2017), which provides a much wider range of head poses and facial expressions, rendering it very useful for the task of neural face reenactment by finding the appropriate latent directions as will be discussed in the following sections.We note that we fine-tune the Style-GAN2's generator on VoxCeleb dataset using the method provided by Karras et al (2020a), while we do not impose any reenactment objectives.That is, the fine-tuned generator can produce synthetic images with random identities (different from the identities of VoxCeleb) that follow the distribution of VoxCeleb dataset in terms of head poses and expressions.

3D Morphable Model (Net3D)
Given an image, Net3D Feng et al (2021) encodes the depicted face's pose into a facial shape vector s ∈ R 3N , where N denotes the number of vertices, which can be decomposed in terms of a linear 3D facial shape as where s denotes the mean 3D facial shape, S i ∈ R 3N ×mi , S θ ∈ R 3N ×m θ and S e ∈ R 3N ×me denote the PCA bases for identity, head orientation and expression, and p i , p θ and p e denote the corresponding identity, head orientation and expression coefficients, respectively.The variables m i , m θ and m e correspond to the number of identity, head pose and expression coefficients.For reenactment, we are interested in manipulating head orientation and expression, thus, our head pose/expression parameter vector is given as p = [p θ , p e ] ∈ R 3+me .
We note that all PCA shape bases are orthogonal to each other, and hence they capture disentangled variations of identity and expression.Finally, we note that they are calculated in a frontalized reference frame, thus, they are also disentangled from head orientation.These bases can be also interpreted as directions in the shape space.We propose to learn similar directions in the GAN latent space as described in detail in the following section.

Reenactment latent directions
In particular, we propose to associate a change ∆p in the head pose orientation and expression, with a change ∆w in the (intermediate) latent GAN space so that the two generated images G(w) and G(w + ∆w) differ only in head pose and expression by the same amount ∆s induced by ∆p.If the directions sought in the GAN latent space are assumed to be linear Nitzan et al (2021), this implies the following linear relationship ∆w = A∆p, where A ∈ R dout×din is a matrix, the columns of which represent the directions in GAN latent space.In our case, d in = (3 + m e ) and d out = N l ×512, where N l is the number of the generator's layers we opt to apply shift changes.

Training pipeline
In order to optimize the matrix of controllable latent directions A, we propose a simple pipeline, shown in Fig. 1 It is worth noting that the only trainable module of the proposed framework is the matrix A ∈ R dout×din -i.e., the number of trainable parameters of the proposed framework is 65K.We also note that, before training, we estimate the distribution of each element of the head pose/expression parameters p by randomly generating 10K images and calculating using the pre-trained Net3D Feng et al (2021) their corresponding p vectors.Using the estimated distributions, during training, we re-scale each element of p from its original range to a common range [−a, a] (a being a hyperparameter empirically set to 6).In the appendices (Sect.A.1.1)we further discuss the rescaling of each element of p.To further encourage disentanglement in the optimized latent directions matrix A, we follow a training strategy where for 50% of the training samples we reenact only one attribute by using ∆p = [0, . . ., ε, . . ., 0], where ε is uniformly sampled from U[−a, a].In the appendices (Sect.A.1.3)we show that the above training strategy improves the disentanglement between the learned directions.

Losses
We train our framework by minimizing the following total loss: where L r , L id , and L per denote respectively the reenactment, identity, and perceptual losses with λ r , λ id , and λ per being weighting hyperparameters empirically set to λ r = 1, λ id = 10 and λ per = 10.We detail each loss term below.

Reenactment loss L r
We define the reenactment loss as where the shape loss term L sh = ∥S r − S gt ∥ 1 imposes head pose and expression transfer from target to source, where S r is the 3D shape of the reenacted image and S gt is the reconstructed ground-truth 3D shape calculated using (1).Specifically, the ground-truth 3D facial shape S gt should have the identity, i.e., facial shape, of the source image and the facial expression and head pose of the target image, either on the task of self reenactment or on cross-subject reenactment.
On self reenactment S gt is the same with S t , where S t is the facial shape of the target image.On crosssubject reenactment, we calculate S gt using the identity coefficients p s i of the source face and the facial expression and head pose coefficients p t e , p t θ of the target face as: To enhance the expression transfer, we calculate the eye (L eye ) and the mouth (L mouth ) losses.The eye loss L eye (the mouth loss L mouth is computed in a similar fashion) compares the inner distances between the eye landmark pairs of upper and lower eyelids between the reenacted and reconstructed ground-truth shapes.In Appendix A.2, we provide a detailed discussion on L eye and L mouth .

Identity loss L id
We define the identity loss as the cosine similarity between feature representations extracted from the source I s and the reenacted I r image using ArcFace Deng et al ( 2019).The identity loss imposes the identity preservation between the source and the reenacted image.

Perceptual loss L per
We defined the perceptual loss similarly to Johnson et al ( 2016) in order to improve the quality of the reenactment face images.

Fine-tuning on unpaired real images
In this section, we extend the basic pipeline of the proposed framework, described in the previous section, in order to learn from both synthetic and real images.For doing so, we propose to (a) use a pipeline for inverting the images back to the latent code space of StyleGAN2, and (b) adopt a mixed training approach (using both synthetic and inverted latent codes) for discovering the latent directions (Sect.3.1.3).
As discussed in previous sections, the main challenge in the GAN inversion problem is finding a good trade-off between faithful reconstruction of the real image and effective editability using the inverted latent code.Although satisfying both requirements is challenging Alaluf et al ( 2021); Richardson et al (2021); Tov et al ( 2021), we found that the following pipeline produces compelling results for the purposes of our goal (i.e., face/head reenactment).During training, we employ an encoder based method (e4e) Tov et al ( 2021) to invert the real images into the W+ latent space of StyleGAN2 Abdal et al (2019).However, directly using the inverted W+ latent codes performs poorly in face reenactment due to the domain gap between the synthetic and inverted latent codes.To alleviate this, we propose a mixeddata approach (i.e., using both synthetic and real images) for training the pipeline presented in Sect.3.1.Specifically, we first invert the extracted frames from the VoxCeleb dataset, and during training, at each iteration (i.e., for each batch) we use 50% random latent codes w and 50% embedded latent codes w inv .
Since the inverted images using e4e Tov et al (2021) might still be missing some crucial identity details, we propose to use an additional optimization step (only during inference), similarly to Roich et al (2021), in order to slightly update the generator G and arrive at better reenacted images in terms of identity preservation.Note that this step does not affect the calculation of w inv and is used only during inference to obtain a higher quality inversion.We perform the optimization for 200 steps and only on the source frame of each video.In Fig. 2  ) and the refinement of StyleGAN2's feature space ("FSR") described in Sect.3.4 and 3.5, respectively.
column), where we observe that, clearly, the reenacted images without the additional optimization step are not able to faithfully reconstruct the real images, while the reenacted images after optimizing the generator weights resembles the real ones more closely.

Fine-tuning on paired real images (video data)
In the previous sections, we presented the proposed framework for learning from unpaired synthetic and real images.Whilst this provides the benefit of learning from a very large number of identities, making it useful for cross-person reenactment, we show that we can achieve additional improvements by optimizing novel losses introduced by further training on paired data from the VoxCeleb1 Nagrani et al (2017) video dataset.
Compared to training from scratch on video data, as in most previous methods (e.g.Zakharov et al (2019Zakharov et al ( , 2020)); Burkov et al ( 2020)), we argue that our approach offers a more balanced strategy that combines the best of both worlds; that is, training with unpaired images and fine-tuning with paired video data.From each video of our training set, we randomly sample a source and a target face that have the same identity but different head pose/expression.Consequently, we minimize the following loss function  where L r is the same reenactment loss defined in Sect.3.1, L id and L per are the identity and perceptual losses, however this time calculated between the reenacted I r and the target image I t , and L pix is a pixel-wise L1 loss between the reenacted and target images.

Joint Training of the real image inversion encoder E w and the directions matrix A
As discussed in Sect.3.2, the encoder-based e4e Tov et al ( 2021) inversion method often fails to faithfully reconstruct real images by typically failing to preserve crucial identity characteristics, as shown in the third column ("w/o opt.") of Fig. 2. Clearly, this poses a certain limitation to the face reenactment methodology presented in Sect.3.1.4.Optimizing the generator's weights leads to notable improvements (Sect.3.2), as shown in the fourth column ("w/ opt.") of Fig. 2, albeit, this comes at a significant cost for the task of face reenactment (that is, the optimization of G takes approximately 20 sec.per frame).
In this section, we propose to jointly train the real image inversion encoder E w and the directions matrix A, which leads to optimization-free face reenactment at inference time.For doing so, we use paired data as described in Sect.3.3.An overview of this approach is shown in Fig. 3. Specifically, we first sample a source (I s ) and a target (I t ) image from the same video of Vox-Celeb1 Nagrani et al ( 2017) training set, that have the same identity but different head pose/expression.Those images are then fed into the inversion encoder E w to predict the corresponding source (w s ) and target (w t ) latent codes.Then, the pre-trained Net 3D network extracts the corresponding source (p s ) and target (p t ) parameter vectors.Finally, as described in Sect.3.1, we generate the reenacted image using the latent code w r = w s + ∆w, where ∆w = A(p t − p s ).

Real image encoder E w optimization objective
In order to train the real image encoder E w we minimize the following loss: where L id , L per , and L pix denote the identity, perceptual, and pixel-wise losses described in the previous sections.Additionally, to further improve the style and the quality of the reconstructed images we propose to use a style loss L style similarly to Barattin et al (2023).Specifically, we use FaRL Zheng et al (2022), a method for general facial representation learning that leverages contrastive learning between images and text pairs to learn meaningful feature representations of facial images.In our method, we use the image Transformer-based encoder, E F aRL , to extract a 512-dimensional feature vector from each image.The proposed style loss is defined as:

Directions matrix A optimization objective
In order to train the directions matrix A we minimize the following loss: Reenacted Source Reenacted Target Target Fig. 4: Cycle loss: Given a pair of source (I 1 s ) and target (I 1 t ) images, we calculate the corresponding reenacted image I 1 r .We then use this image as source and as target the source image from the first image pair and we calculate the second reenacted image I 2 r , which is imposed to be similar with I 1 s .
where L r , L id , L per , L pix , and L style denote respectively the reenactment loss defined in Sect.3.1, the identity, the perceptual, the pixelwise, and the style losses calculated between the reenacted I r and the target images I t .Moreover, to further improve the reenactment results we propose an additional cycle loss term L cycle Sanchez and Valstar (2020); Bounareli et al (2023).Specifically, as shown in Fig. 4, given an image pair of a source (I 1 s ) and a target (I 1 t ) images, we calculate the corresponding reenacted image I 1 r ≡ I 1 t .Having as source image the reenacted image I 1 r and as target the source image I 1 s , we calculate a new reenacted image I 2 r that is imposed to be similar to I 1 s .Consequently, we calculate all reconstruction losses, i.e.L id , L per , L pix , and L style , between the source image I 1 s and the reenacted image I 2 r .In our ablation studies (Sect.4.2), we show that using the proposed cycle loss improves the face reenactment performance.

Joint optimization objective
Overall, the objective of the joint optimization is as follows: Reconstructed Image

Real Image
Fig. 5: Training of feature space encoder E F in the real image inversion task.E F takes as input a real image and predicts the shift ∆f 4 that updates the feature map f 4 of the 4 th feature layer of Style-GAN2's generator.
We note that, in this training phase, we finetune the matrix A and the real image inversion encoder E w , trained as described in Sect.3.2.As demonstrated in Fig. 2, using the proposed joint training scheme (Joint Training) our method is able to reconstruct the identity details of the real faces without performing any optimization step.In Sect.4, we quantitatively demonstrate that our proposed method produces similar results on self reenactment with our method when optimizing the generator's weights.Nevertheless, on the more challenging tasks of cross-subject reenactment and on large head pose differences between the source and target faces, the joint training scheme outperforms our results with optimization, producing more realistic images with less visual artifacts.

Feature space F refinement
In this section, we propose an additional module for our face reenactment framework that refines the feature space F of the StyleGAN2's generator; taking advantage from its exceptional expressiveness (e.g., in terms of background, hair style/color, or hair accessories).In order to mitigate the limited editability of F Parmar et al (2022); Kang et al ( 2021), we propose a two-step training procedure, which we illustrate in Fig. 5. Specifically, we first train a feature space encoder E F , using the ResNet-18 He et al ( 2016) architecture, in the task of real image inversion.E F takes as input a real image and predicts the shift ∆f 4 that updates the feature map f 4 as: where f 4 is the feature map calculated using the inverted latent code w.The training objective in this step consists of the reconstruction losses, namely identity, perceptual, pixel-wise, and style, calculated between the reconstructed Î and the real images I as described in (Eq.7).It is worth nothing that we only refine the 4 th feature layer of StyleGAN2's generator G that we found to be in particular beneficial to the face reenactment task, in contrast to later feature layers that, despite their capability in reconstructing almost perfectly the real images, they suffer from poor semantic editability (as shown by Yao et al (2022b)).
As discussed above, using the updated feature map f4 to refine details on the edited images leads to visual artifacts.To address this, we propose a framework that efficiently learns to predict the updated feature map of the edited image f r 4 using the refined source feature map f s 4 .We illustrate this in Fig. 6, where, given a source and a target image pair, we first calculate the reenacted latent code w r as described in Sect.3.4.We note that the directions matrix A and the real image inversion encoder E w are frozen during training.Then, using the feature encoder E F , we calculate the source refined feature map f s 4 using (11).In order to calculate the refined feature map of the reenacted image f r 4 , we introduce the Feature Transformation (FT) module, that takes as input the difference of the source refined feature map f s 4 and the reenacted feature map f r 4 , and outputs the shift ∆f r 4 , which can be used to calculate the updated feature map f r 4 given by (11).As shown in Fig. 6, the proposed Feature Transformation (FT) module learns two modulation parameters, namely γ and β, that efficiently transform the shift ∆f s 4 of the source feature map into the shift ∆f r 4 of the reenacted feature map as: As illustrated in Fig. 6, the FT module consists of two convolutional blocks with 2 convolutional layers each.We note that in this training step we train both the FT module and the feature space encoder E F .Our training objective consists of the reconstruction losses, namely identity, perceptual,  pixel-wise and style, calculated between the reenacted and the target images (described in detail in Sect.3.4).
Finally, in Fig. 7 we give some indicative results of the proposed reenactment variant of our method that learns to optimize the feature space F ("FSR") in comparison to the variant of our method described in Sect.3.4 ("Joint Training") and Parmar et al (2022) ("SAM").We note that using the W+ latent space (Joint Training / Sect.3.4) leads to relatively faithful reconstruction performance, albeit, without being able to reconstruct every detail on the background or the hair styles.As we will show in the experimental section, qualitatively and quantitatively, but also in the conducted user study, such level of detail is crucial for the task of face reenactment.Similarly, SAM Parmar et al ( 2022) is able to better reconstruct the background however the reenacted images suffer from visual artifacts (marked with red arrows in Fig. 7) and, thus, look unrealistic, especially around the face area.By contrast, the proposed framework that learns to optimize the feature space F ("FSR") leads to both notably more faithful face reenactment exhibiting less artifacts.

Experiments
In this section, we present qualitative and quantitative results, along with a user study, in order to evaluate the proposed framework (all its variants) where k = 3 + m e , m e = 12 and N l = 8.We train three matrices of directions: (i) the first one is on synthetically generated images (Sect.(2022), we evaluate their model under the oneshot setting.We note that we will be referring to our method that optimizes the generator's weights during inference as Latent Optimization Reenactment (LOR), whereas LOR+ will be referring to our final model with joint training and feature space refinement.We note that in the LOR+ model, we do not optimize the generator weights.

Quantitative comparisons
We report eight different metrics.We compute the Learned Perceptual Image Path Similarity (LPIPS) Zhang et al (2018) (2021).
In Table 1 we report quantitative results on self reenactment, using the original test set of VoxCeleb1 Nagrani et al (2017) and the test set provided by Zakharov et al (2019).Additionally, in Table 2 we report results on a more challenging condition on self reenactment where the source and target faces have large head pose difference.Specifically, we randomly selected from the test set of VoxCeleb1 1,000 image pairs with head pose distance larger than 10 • .The head pose distance is calculated as the average of the absolute differences of the three Euler angles (i.e., yaw, pitch, and roll) between the source and target faces.In the appendices (Sect.A.4), we provide additional details regarding our benchmark dataset.We note that in self reenactment, all metrics are calculated between the reenacted and the target faces.As shown in Table 1, the warping-based methods, namely X2Face, PIR, HeadGAN and Face2Face have high values on CSIM, however we argue that this is due to their warping-based technique which enables better reconstruction of the background and other identity characteristics.Importantly, these results are accompanied by poor quantitative and qualitative results when there is a significant change on the head pose (e.g., see Fig. 8 and Table 2).Additionally, regarding head pose/expression transfer, our method (LOR+) achieves similar results on NME with Fast Bi-layer Zakharov et al (2020), while on ARD and AED metrics we outperform all methods.Finally, our results on FID and FVD metrics confirm that the quality of our generated videos resembles the quality of VoxCeleb dataset.Nevertheless, our method (LOR+) on the challenging condition with large head pose differences between the source and target faces (Table 2) outperforms all methods.Cross-subject reenactment is more challenging compared to self reenactment, as source and target faces have different identities, and in this case it is important to maintain the source identity characteristics without transferring the target ones.In Table 3, we report the quantitative results for cross-subject reenactment, where we randomly To further evaluate the performance of reenactment methods we conduct a user study, where we ask 30 users to select the method that best reenacts the source frame on self and cross-subject reenactment tasks.For the purposes of the user study we utilise only our final model (LOR+).The results are reported in Table A6 and as shown our method is the most preferable (by a large margin -52.1% versus the 19.2% second best method), which also validates our quantitative results.
Additionally, in Fig. 8: Qualitative results and comparisons for self (top three rows) and cross-subject reenactment (last three rows) on VoxCeleb1.The first and second columns show the source and target faces.Our method preserves the appearance and identity characteristics (e.g., face shape) of the source face significantly better and also faithfully transfer the target head pose/expression without producing visual artifacts.
on cross-subject reenactment, as shown in Fig. 8, is that it is able to reenact the source face with minimal identity leakage (e.g facial shape) from the target face, in contrast to landmark-based methods such as Fast Bi-layer Zakharov et al (2020).Finally, to show that our method is able to generalise well on other facial video datasets, we provide additional results on the FaceForensics Rössler et al ( 2018) and 300-VW Shen et al ( 2015) datasets in the appendices (Fig. A17).

Ablation studies
In this section, we perform several ablation tests to (a) assess the different variants of our method, i.e., the optimization of generator G during inference (Sect.3.3), the proposed joint training scheme (Sect.3.4) and the refinement of the feature space (Sect.3.5), (b) measure the impact of the identity and perceptual losses, and the additional shape losses for the eyes and mouth (Sect.3.1), (c) validate our trained models on synthetic, mixed and paired images, and (d) measure the impact of the style and cycle losses (introduced in Sect.3.4).
For (a), we report results of our method on self and cross-subject reenactment, with our model (LOR) described in Sect.3.3 without performing optimization (w/o opt.) and with optimization (w/ opt.) of the generator G during inference.We also report results of our final model (LOR+) without the additional feature space refinement (FSR) (Sect.3.4) and with feature space refinement (Sect.3.5).As shown in Table 5, the optimization of G during inference improves our results (as expected) especially regarding the identity preservation   especially on the challenging tasks of self reenactment with large head pose differences between the source and target faces and on cross-subject reenactment.Fig. 9 illustrates results on self reenactment using the above described models.As shown LOR without optimization cannot accurately reconstruct the identity of the source face, while with optimization the identity details are better reconstructed but the reenacted images contain noticeable visual artifacts.On the contrary, the proposed joint training scheme (LOR+ w/o FSR) is able to accurately reconstruct the identity of the source faces and produce artifactfree images without performing any subject finetuning.Finally, the proposed feature space refinement (LOR+ w/ FSR) improves our qualitative results by producing more realistic images (i.e., better background and hair style reconstruction).
Table 6: Ablation study on the impact of the identity L id and perceptual L per losses, and on the impact of eye L eye and mouth L mouth losses.CSIM is calculated between the source and the reenacted images which are on different head pose and expression.For (b), we perform experiments on synthetic images with and without the identity and perceptual losses.To evaluate the models, we randomly generate 5K pairs of synthetic images (source and target) and reenact the source image with the head pose and expression of the target.As shown in Table 6, the incorporation of the identity and perceptual losses is crucial to isolate the latent space directions that strictly control the head pose and expression characteristics without affecting the identity of the source face.In a similar fashion, in Table 6, we show the impact of the additional shape losses, namely the eye L eye and mouth L mouth losses.As shown, omitting these losses leads to higher head pose and expression error.The impact of those losses is also obvious on our qualitative comparisons in Fig. 10.As shown, when we exclude the identity and perceptual losses from the training process, the generated images lack several appearance details, while omitting the eye and mouth losses leads to less accurate facial expression transfer.
For (c), we evaluate the three different training schemes, namely synthetic only (Sect.3.1), mixed synthetic-real (Sect.3.2), and mixed synthetic-real fine-tuned with paired data (Sect.3.3) for selfreenactment.The results, reported in Table 7 and  in Fig. 11, illustrate the impact of each of these training schemes with the one using paired data providing the best results as expected.As shown in Fig. 11, our final model trained with paired data produces more realistic images with less artifacts.Finally, for (d) we perform experiments on self reenactment using our model with joint training scheme, without using the style loss L style and without the cycle loss L cycle .As shown in Table 8 our final model with both those losses has better results both on identity preservation and on head pose/expression transfer.Additionally, as illustrated in Fig. 12, our final model using both the style and the cycle loss has improved results in terms of identity/appearance preservation.

Limitations
As shown both in our quantitative and qualitative results, our method is able to efficiently  reenact the source faces, by preserving the source identity characteristics and by faithfully transferring the target head pose and expression.Our proposed method, which is based on a pretrained StyleGAN2 model, enables both self and cross-subject reenactment using only one source frame and without any further subject fine-tuning.The proposed joint training scheme of the real image encoder E w and the direction matrix A enables more accurate identity reconstruction and facial image editing without many visual artifacts, especially on the challenging task of extreme head poses.Additionally, the refinement of Style-GAN2's feature space F enables better reconstruction of various image details including background, hair style/color and facial accessories, resulting in visually more realistic images.Nevertheless, in Fig. 13 we observe that especially on hair accessories, such as hats that are underrepresented on the training dataset, our method is not able to faithfully reconstruct every detail when editing the head pose orientation.

Conclusions
In this paper, we presented a novel approach towards neural head/face reenactment using a 3D shape model to learn disentangled directions of head pose and expression in the latent GAN space.This approach comes with specific advantages, such as the use of powerful pre-trained GANs and 3D shape models, which have been thoroughly developed and studied by the research community during the past years.Our method is able to successfully disentangle the facial movements and the appearance of the input images leveraging the disentangled properties of the pre-trained Style-GAN2 model.Consequently, our framework effectively mimics the target head pose and expression without transferring identity details from the driving images.Additionally, our method features several favorable properties including one-shot face reenactment without the need for further subjectspecific fine-tuning.It also allows for improved cross-subject reenactment through the proposed upaired data training with synthetic and real images.While our method demonstrates compelling results, it relies on the capabilities of StyleGAN2 model, which is bounded by the distribution of the training dataset.If the dataset lacks diversity in terms of complex backgrounds, facial accessories like hats, glasses e.t.c, this can affect our model's ability to generalize well to more complex datasets.This limitation highlights the importance of using more diverse video datasets during the training of the generative models.Finally, we acknowledge that although face reenactment can be used in a variety of applications such as art, entertainment, video conferencing etc., it can also be applied for malicious purposes, including deepfake creation, that could potentially harm individuals and the society.It is important for the researchers on our field to be aware of the potential risks and promote the responsible use of this technology.

Appendix A
In this appendix, we first provide an analysis of the discovered directions in the latent space in App.A.1 and we describe in detail the calculation of the shape losses in App.A.2. Additionally, we show results of our method on the task of facial attribute editing in App.A.3.In App.A.4, we provide details about the benchmark datasets used to evaluate our method on large head pose variations.Finally, in App.A.5, we compare the proposed framework with state-of-the-art methods for synthetic image editing on FFHQ dataset Karras et al ( 2019) and we show comparisons on real image editing against five methods that perform real image inversion using the feature space of StyleGAN2 Karras et al (2020b).Moreover, we provide additional quantitative and qualitative results both on the VoxCeleb1 Nagrani et al ( 2017) and the VoxCeleb2 Chung et al ( 2018) datasets and we show additional results on the FaceForensics Rössler et al ( 2018) and the 300-VW Shen et al (2015) datasets.

A.1 Analysis of the learned directions
A.1.1 Head pose/expression parameter vector The elements of p = [p θ , p e ], i.e., the head pose p θ and the expression p e coefficients, are typically in different ranges of values.That is, head pose p θ is given in terms of the three Euler angles (i.e., yaw, pitch, and roll) in degrees (i.e., in the range [−90, 90]), while the expression coefficients p e are given in the range of [−2, 2] with the vast majority (99%) of samples in VoxCeleb1 dataset being within the range of [−1, 1].In order to bring each element of p = [p θ , p e ] into a common range of values [−a, a], we sampled 10,000 synthetic facial images and calculated the corresponding values for p θ and p e using the pre-trained DECA Feng et al (2021) network.We then re-scaled each element x of p in [−a, a] using min-max scaling; i.e., x = x−xmin x−xmax × 2a − a.This way, we guarantee that each component contributes evenly to the overall facial representation, regardless of its original range, providing stability in the training process.The specific re-scaling range, i.e., [−6, 6], is practically imposed by the StyleGAN's latent space, as Voynov and Babenko (2020) originally pointed out, meaning that traversing the latent space outside this range, often leads to severe degradation in the quality of the generated images, since latent codes lie in regions of low density.

A.1.2 Linearity
In this work, we discover the disentangled directions that control the head pose and the expression by optimising a matrix A so that: where ∆w denotes a shift in the latent space and ∆p denotes the corresponding change in the parameters space.That is, independently of the source attributes, we assume linearity between a shift ∆w that is applied to an arbitrary code w and the induced change ∆p in the parameter space -i.e., the change between the source and the reenacted attributes.
Several recent methods propose to learn linear directions in the latent space of StyleGAN Voynov and Babenko (2020); Shen et al (2020); Shen and Zhou (2021) in order to perform synthetic image editing, based on the fact that the W latent space of StyleGAN Karras et al ( 2019) has been designed to be linear and disentangled.Furthermore, Nitzan et al ( 2021) provide a comprehensive analysis on the existence of linear relations between the magnitude of change in the semantic attributes (e.g., head orientation, smile, etc) and the traversal distance along the corresponding linear latent paths.In order to further support our hypothesis (i.e., Eq.A1), we perform a similar analysis by examining the correlation between random shifts in the latent space, ∆w, and the predicted shifts in the parameters space, ∆p.Specifically, given a known change ∆p, we calculate the corresponding ∆w using Eq.A1 and we apply this change (i.e., ∆w) onto random latent codes of images with different attributes.Then, we calculate the predicted change ∆p between the source and the reenacted images.In Fig. A1 we demonstrate the results of our analysis in four different attributes, namely, yaw angle, pitch angle, smile, and open mouth.In all attributes, the calculated correlation is close to 0.9 indicating strong linear relationship.Finally, additional visual results of two different subjects in different head poses and expressions are depicted in Fig. A2.Specifically, we show the ground truth change ∥∆p∥ in the parameter space, the corresponding ∥∆w∥, and the predicted changes ∥∆p∥ between the source and shifted images.Above the presented images in the row where we report ∥∆p∥ the two values separated by commas correspond to the subjects depicted in the first and second row.As shown, a change ∥∆w∥ corresponds to a similar change in the parameter space ∥∆p∥ independently of the facial attributes of the source images.

A.1.3 Disentanglement
Following the common understanding of disentanglement in the area of GANs Chen et al (2016); Deng et al (2020); Karras et al ( 2019), we refer to a disentangled latent direction when travelling across it leads to image generations where only a single attribute changes.To assess the directions learnt by our method in terms of disentanglement, in Fig. A3 we illustrate the differences between the source and reenacted attributes when changing a single attribute.In Fig. A3a, we only transfer the yaw angle from the target image, while in Fig. A3b we only transfer the smile expression from the target image.We observe that the differences between the rest of the attributes (i.e., pitch, roll, and expression in Fig. A3a and yaw, pitch, and roll in Fig. A3b) are clearly small, which indicates that the discovered directions are disentangled.We note that these plots were calculated using 2000 random image pairs.In Fig. A3a, we show the differences in yaw angle that were calculated as the absolute difference between the source and the target yaw angles (measured in degrees), while the differences in the unchanged attributes were calculated between the source and reenacted images.In a similar way, in Fig. A3b we show the differences in expression that were calculated as the absolute difference between the source and the target expression (p e coefficients).Moreover, in Fig. A4 we demonstrate visual results of editing only one direction, namely yaw, pitch angles and smile.As shown, when altering the head pose, i.e., yaw and pitch angles, all other facial attributes, i.e., facial expressions, remain unchanged.Additionally, when altering the smile expression, we observe changes only around the mouth area, while head orientation remains the same.In more detail, in the first subject where smile is controlled (row 5), the eyes remain closed despite editing the smile expression, while in the second subject (row 6) the raised brows remain unaffected.
Finally, in order to encourage better disentanglement between the facial attributes that we control, during training we propose to change only one attribute on 50% of the training samples within each batch.To validate the effectiveness of the above training choice, in Table A1 we compare two models trained on synthetic images and report results indicating with "True" the model trained with single attribute change and "False" the model without the single attribute change.Specifically, we change only one attribute, namely the yaw, pitch, or roll head rotation angles, or one of the expression coefficients (e i , i = 1, . . ., 12).We then calculate and report the error, i.e., the l1-distance between the source and the reenacted attributes that should remain unchanged.For instance, when changing only the yaw angle, then both the pitch and the roll angles, as well as the expressions should remain the same as those of the source image.We note that regarding the expression error we report the mean error across all expressions.When we alter a specific expression e i , we calculate the expression error by excluding that particular expression, as denoted by the last column of Table A1.As shown, adopting this training strategy leads to better disentanglement with respect to all the 3 Euler angles and the 12 facial expressions.

A.3 Image editing
Our method is able to discover the disentangled directions of head pose and expression in the latent space of StyleGAN2.Consequently, except from face reenactment, our model can perform head pose and expression editing by simply setting the desired head pose or expression.Fig. A6 illustrates results of per attribute editing.As shown, our model can alter the head pose (i.e., yaw, pitch, and roll) or the expression (e.g., open mouth, smile) by maintaining all other attributes unchanged.Similarly, our method can be used in the frontalization task.We compare our model with the methods of pSp Richardson et al (2021) and R&R Zhou et al (2020) and we report both qualitative (Fig. A7) and quantitative (Table A2) results.Specifically, we randomly select 250 frames of different identities from the VoxCeleb test set and we perform frontalization.In Table A2, we evaluate the identity preservation (CSIM) and the Average Rotation Distance (ARD) between the source and the frontalized images.

A.4 Benchmark datasets with large head pose variations
As mentioned in Sect.4.1.1,the benchmark used in Table 2 (Benchmark-L) in order to evaluate our method on large head pose reenactment contains 1,000 image pairs from the Vox-Celeb1 dataset with head pose distance larger than 10 • , calculated as the average L1 distance of the three Euler angles (yaw, pitch, roll).In To further validate our method on larger head pose differences we generate a new benchmark dataset (Benchmark-XL) using images from both the VoxCeleb1 and the VoxCeleb2 datasets.Specifically, we randomly select 1, 000 image pairs where the distance on the yaw angle is larger than 30 • and on the pitch or roll angles larger than 20 • .As shown in Fig. A9, Benchmark-XL consists of image pairs with "extreme" head pose differences compared to the distribution of the Specifically, given two different input images and ground truth changes ∥∆p∥ in the parameter space, we calculate the corresponding ∥∆w∥ shift in the latent space and the predicted changes ∥∆p∥ between the source and shifted images.We note that a similar shift ∥∆w∥ corresponds to a similar change in the parameter space independently of the facial attributes of the source images.
overall dataset.In Tables A3, A4 and in Fig. A10, we demonstrate the quantitative and qualitative comparisons on Benchmark-XL both on Vox-Celeb1 and VoxCeleb2, respectively.As shown our method is able to better preserve the identity of Table A2: Quantitative results on frontalization task.We compare our method with pSp Richardson et al (2021) and R&R Zhou et al (2020) by evaluating the identity preservation (CSIM) and the Average Rotation Distance (ARD) between the source and the frontalized images.A.5 Additional results

A.5.1 Comparisons with synthetic image editing methods
In order to show the superiority of our method against methods for synthetic image editing, we compare against two state-of-the-art methods, namely ID-disentanglement Nitzan et al ( 2020   and target faces) to perform face reenactment.
In Table A5 and in Fig. A11, we show quantitative and qualitative results of our method against ID-disentanglement Nitzan et al (2020) and Style-Flow Abdal et al (2021).As shown in Table A5 our method outperforms all other method both on identity preservation (CSIM) and on head pose/expression transfer metrics, namely ARD, AED and NME.Additionally, as illustrated in Fig. A11, our method can successfully edit the source image

A.5.2 Comparisons with real image inversion methods
Additionally, in order to validate that our proposed Feature Transformation module is necessary to perform image editing without producing visual artifacts when altering the feature space of StyleGAN2, we compare our method against four methods that perform real image inversion using the feature space and one method that learns to alter the weights of the StyleGAN2 generator.Specifically, we compare against SAM Parmar    (2022) proposes to alter the generator's weights using a hypernetwork.In Fig. A12, we demonstrate results of editing the head pose using our direction matrix A by first inverting the real images using the above methods.As shown, our method is the only one without visual artifacts when editing the head pose orientation.All the aforementioned methods are able to faithfully reconstruct the real images but fail on editing.

A.5.3 Additional comparisons
In Table A6, we report the results of our user study.Specifically, we ask 30 users to select the method that best reenacts the source frame on self and cross-subject reenactment tasks.For the purposes of the user study we utilise only our  2021) and PIR Ren et al (2021).As shown our method is the most preferable, by a large margin -52.1% versus the 19.2% of the second best method.
We provide additional results on self (Fig. A13) and cross-subject (Figs.A14, A15) reenactment on VoxCeleb1 Nagrani et al (2017) dataset and we compare our method with X2Face Wiles et al (2018), FOMM Siarohin et al (2019), Fast bilayer Zakharov et al (2020), Neural-Head Burkov et al (2020), LSR Meshry et al (2021), PIR Ren et al (2021), HeadGAN Doukas et al (2021), Dual Hsu et al (2022) and Face2Face Yang et al (2022).Moreover, in Fig. A16 we show additional comparisons on VoxCeleb2 Chung et al (2018) dataset both on self and on cross-subject reenactment.Additionally, we provide a supplementary video with randomly selected videos on self-reenactment and randomly selected pairs on cross-subject reenactment from the test sets of VoxCeleb1 and VoxCeleb2 datasets.Finally, we show that our method is able to generalise well on other facial video datasets.In Fig. A17 we provide results on FaceForensics Rössler et al ( 2018) and 300-VW Shen et al ( 2015) datasets both on self (Fig. A17a) and on cross-subject (Fig. A17b) reenactment.
Table A6: Results of a user study that we conduct to evaluate the user preference (Pref.(%)) on the generated images of stateof-the-art methods.

Method
Pref. (%) X2Face Wiles et al (2018) 1.3 FOMM Siarohin et al (2019) 5.0 Fast Bi-layer Zakharov et al (2020) 9.4 Neural-Head Burkov et al (2020) 19.2 LSR Meshry et al (2021) 10.7 PIR Ren et al (2021) 2.3 LOR+ (Ours) 52.1 ); Yao et al (2020); Ren et al (2021); Yang et al (2022) rely on 3D shape models to remove the identity details of the driving images.Warpingbased methods Wiles et al (2018); Siarohin et al (2019); Wang et al (2021b); Ren et al (2021); Doukas et al (2021); Yang et al (2022) synthesize the reenacted images based on the motion of the driving faces.Specifically, HeadGAN Doukas et al (2021) and Face2Face Yang et al (2022) are warping-based methods conditioned on 3D Morphable Models.Whilst such methods produce realistic results, they suffer from several visual artifacts and head pose mismatch, especially in large head pose variations.Finally, Meshry et al (2021) propose a two-step architecture that aims to disentangle the spatial and style components of an image that leads to better preservation of the source identity.

Fig. 3 :
Fig. 3: To eliminate the need for the optimization step during inference, we propose to jointly train the real image inversion encoder E w and the directions matrix A. We note that during training both the generator G and the Net 3D network are frozen.

Fig. 6 :
Fig. 6: Training of the feature space encoder E F and the Feature Transformation (FT) module to efficiently refine the feature map f r 4 of the reenacted images.
(CSIM) compared to our model without performing optimization.Nevertheless, our proposed joint training scheme (LOR+ w/o FSR) achieves the same results on image reconstruction metrics (CSIM and LPIPS), and improves our results on head pose/expression transfer (ARD, AED) without performing any optimization of the generator.Additionally, the proposed refinement on the feature space of StyleGAN2 (LOR+ w/ FSR) improves all quantitative results.It is worth mentioning that the new proposed components (Joint Training and Feature Space Refinement) compared to our previous work Bounareli et al (2022) improve our results

Fig. 9 :
Fig. 9: Qualitative comparisons of the various models of our work on self reenactment.

Fig. 10 :
Fig. 10: Qualitative comparisons on the impact of the identity L id and perceptual L per losses, and on the impact of eye L eye and mouth L mouth losses.

Fig. 11 :
Fig. 11: Qualitative results of the three different models trained on synthetic images, on both synthetic and real images and on paired data.

Fig. 12 :Fig. 13 :
Fig. 12: Qualitative comparisons on the impact of the style L style and cycle L cycle losses.

Fig. A1 :
Fig. A1: Analysis of the correlation between shifts ∥∆w∥ in the latent space and the predicted changes | ∆p| in the parameters space.We show results of four different attributes (yaw and pitch angles, smile, and open mouth).In all attributes the correlation is high, indicating strong linear relationship.

Fig. A9 ,
Fig.A9, we present a comparison of the distributions of the three Euler angles (yaw, pitch and roll) and the average head pose distance between the VoxCeleb1 dataset and the aforementioned benchmark dataset (Benchmark-L).As shown Benchmark-L comprises image pairs that have larger head pose distances compared to the average head pose distance observed in the Vox-Celeb1 dataset.Additionally, Fig.A8, illustrates some indicative example image pairs from the benchmark dataset, where there is a wide range Fig.A2: Visual results illustrating the strongly linear relationship between ∥∆p∥ and ∥∆w∥.Specifically, given two different input images and ground truth changes ∥∆p∥ in the parameter space, we calculate the corresponding ∥∆w∥ shift in the latent space and the predicted changes ∥∆p∥ between the source and shifted images.We note that a similar shift ∥∆w∥ corresponds to a similar change in the parameter space independently of the facial attributes of the source images.
, successfully transfer the target head pose and expression and generate realistic images without many visual artifacts compared to the other state-of-the-art methods.
) and StyleFlow Abdal et al (2021).The authors of ID-disentanglement Nitzan et al (2020) introduce a method that learns to disentangle the head pose/expression and the identity characteristics using a pre-trained StyleGAN2 on FFHQ dataset.Additionally, StyleFlow Abdal et al (2021) is a state-of-the-art method that finds meaningful non-linear directions in the latent space of StyleGAN2 using supervision from multiple attribute classifiers and regressors.Both ID-disentanglement Nitzan et al (2020) and Style-Flow Abdal et al (2021) provide pre-trained models using the StyleGAN2 generator trained on FFHQ dataset Karras et al (2019).Consequently, in order to fairly compare against these methods, we train our model using synthetically generated images from StyleGAN2 generator trained on FFHQ.We compare against ID-disentanglement Nitzan et al (2020) and Style-Flow Abdal et al (2021) on cross-subject reenactment using synthetic images.Specifically, we use the small test set (1000 images) provided by the authors of StyleFlow Abdal et al (2021) and we randomly select 500 image pairs (source (a) L1 distance in pitch, roll angles (in degrees) and expression (pe coefficients) when transferring only the yaw angle from the target images.(b) L1 distance in yaw, pitch and roll angles (in degrees) when transferring only the smile expression from the target images.

Fig. A3 :
Fig. A3: Difference between the source and reenacted facial attributes when transferring only one facial attribute (e.g., yaw angle and smile expression) from the target images.

Fig. A4 :
Fig. A4: Visual examples of editing only one facial attribute, namely yaw and pitch angles, and smile.The source images are depicted inside the red boxes.

Fig. A6 :
Fig. A6: Our method can perform head pose and expression editing on real images.Specifically, we are able to edit an attribute by keeping all other attributes unchanged.The first column shows the source images, while the rest columns show editings of different expressions and head poses.

Fig. A8 :
Fig. A8: Indicative examples of source-target image pairs from our benchmark set (Benchmark-L), where the average head pose distance is larger than 10 • .

Fig. A9 :
Fig.A9: Comparison of the distributions of the three Euler angles (yaw, pitch and roll) and the average head pose distance between the VoxCeleb1 dataset, the benchmark set, called here Benchmark-L (average head pose distance larger than 10 • ), and the new benchmark set, called here Benchmark-XL (yaw larger than 30 • , pitch/roll larger than 20 • ) Zakharov et al (2019) one is on mixed real and synthetic data (Sect.3.2), and (iii) the third one is fine-tuning (ii) on paired data (Sect.3.3).Additionally, on the proposed joint training scheme (Sect.3.4),wefinetune both the directions matrix A and the real image inversion encoder E w .Finally, in the feature space refinement variant (Sect.3.5)wetrain both the feature space encoder E F and the propose Feature Transformation (FT) module.It is worth noting that during the first and second training phases, we perform cross-subject training, i.e., the source and target faces have different identities.This approach enables our model to generalize effectively across various identities, resulting in improved performance on the challenging task of cross-subject reenactment.On the rest training phases we perform self reenactment, where the source and target faces are sampled from the same video.For training, we used the Adam optimizerKingma and Ba (2015)with constant learning rate 10 −4 .All models are implemented in PyTorchPaszke et al (2019).In this section, we compare the performance of our method against the state-of-the-art in face reenactment on VoxCeleb1 Nagrani et al(2017).We conduct two types of experiments, namely self-and cross-person reenactment.For evaluation purposes, we use both the video data provided byZakharov et al (2019)and the original test-set of VoxCeleb1.We note that there is no overlap between the train and test identities and videos.
to measure the perceptual similarities, and to quantify identity preservation we compute the cosine similarity (CSIM) of ArcFace Deng et al (2019) features.Moreover, we measure the quality of the reenacted images using the Fréchet-Inception Distance (FID) metric Heusel et al (2017), while we also report the Fréchet Video Distance (FVD) Unterthiner et al (2018); Skorokhodov et al (2022) metric that measures both the video quality and the temporal consistency of the generated videos.To quantify the head pose/expression transfer, we calculate the normalized mean error (NME) between the predicted landmarks in the reenacted and target images.We useBulat and Tzimiropoulos (2017)for landmark estimation, and we calculate the NME by normalizing it with the square root of the ground truth face bounding box and scaled by a factor of 10 3 .We further evaluate the head pose transfer by calculating the average L 1 distance of the head pose orientation (Average Rotation Distance, ARD) in degrees, and the expression transfer by calculating the average L 1 distance of the expression coefficients p e (Average Expression Distance, AED) and the Action Units Hamming distance (AU-H) computed as in Doukas et al

Table 1 :
Zakharov et al (2019)on self-reenactment.The results are reported on the combined original test set of VoxCeleb1 Nagrani et al (2017) and the test set released byZakharov et al (2019).For CSIM metric, higher is better (↑), while in all other metrics lower is better (↓).

Table 2 :
Quantitative comparisons on the benchmark set (Benchmark-L ) with image pairs from Vox-Celeb1 dataset, where the average head pose distance is larger than 10 • .For CSIM metric, higher is better (↑), while in all other metrics lower is better (↓).

Table 3 :
Quantitative results on cross-subject reenactment.For CSIM metric, higher is better (↑), while in all other metrics lower is better (↓).
fer, while we achieve high score in CSIM metric.It is worth noting that the high CSIM value for FOMM, HeadGAN and Face2Face is not accompanied by good qualitative results as shown in Figs.8 and A14, where in most cases, those methods are not able to generate realistic images.

Table 4 :
Quantitative comparisons on inference time required to generate a video of 200 frames.

Table 5 :
Quantitative results of the various models of our work on self reenactment (SR), self reenactment with image pairs that have large head pose difference (SR -large head pose) and on cross-subject reenactment (CR).

Table 7 :
Ablation studies on selfreenactment using three different models: (a) trained on synthetic images, (b) trained on both synthetic and real images, and (c) fine-tuned on paired data.

Table 8 :
Ablation study on the impact of style L style and cycle L cycle losses.

Table A3 :
Quantitative comparisons on the Benchmark-XL with image pairs from VoxCeleb1 dataset, where the distance on the yaw angle is larger than 30 • and on the pitch or roll angles larger than 20 • .

Table A4 :
Quantitative comparisons on the Benchmark-XL with image pairs from VoxCeleb2 dataset, where the distance on the yaw angle is larger than 30 • and on the pitch or roll angles larger than 20 • .

Table A5 :
Quantitative comparisons against two state-of-the-art methods for synthetic image editing, namely ID-dis Nitzan et al (2020) and StyleFlowAbdal et al (2021).For CSIM metric, higher is better (↑), while in all other metrics lower is better (↓).