1 Introduction

Facial beautification plays an important role in our social lives. People with beautiful faces have many advantages in social activities such as dating and voting. As an important task of facial beautification, face reshaping aims to beautify the shapes of portrait faces in images, which allows users to customize the beautification of individual facial features. However, most methods focus on using image processing software such as Adobe Photoshop to reshape the face image, which is still laborious, intensive, and time-consuming. Generating visually pleasing results with proper shape distortions usually requires professional artistic skills with subjective aesthetics, which is a challenging task for common users without any reference information. Therefore, it is of great value to propose a reference-based method to automatically generate a reasonable shape transformation to allow users to manipulate the results with user controls flexibly.

These existing face-reshaping methods can be roughly classified into two groups: 2D-image-based methods and 3D-face-model-based methods. 2D-image-based methods, such as image morphing, derive from image-processing techniques and image analyses [1, 2]. For example, numerous works [3, 4] have presented an image-morphing-based method to reshape or clone facial expressions. Because such methods only edit facial landmarks to achieve different operations in the 2D image, they can only deal with frontal faces which the landmarks are easy to obtain. On the other hand, 3D-model-based methods can introduce more information than 2D-image-based methods. For example, numerous works [57] utilize 3D face shapes estimated by 2D images and edit the face on the 3D model. However, the performance of 3D face models is minimal because of the global control of their parameters, which cannot control individual components [8].

To overcome the aforementioned problems, we propose a novel method named face shape transfer (FST) via semantic warping, which is capable of transferring both the overall face and individual components (e.g., eyes, nose, and mouths) of the reference image to the source image while preserving the facial structure and personal identity. As shown in Fig. 1, we utilize the cycle consistency strategy [9] and an encoder-decoder network [10] to model the shape transformation. Additionally, to enable component-level controllability, we introduce five encoding networks to learn the feature embedding for five face components (e.g., left eye, right eye, nose, mouth and skin) from the semantic parsing results, which aim to preserve the original structures of each component to the greatest extent. Then, to efficiently utilize the features from different scale semantic parsing maps, we adopt an intuitive way to connect all layers in the global dense network directly, ensuring maximum information flow between layers. Finally, we introduce a spatial transformer network to allow flexible warping operations and a few loss functions are introduced to prevent ghosting artifacts and obvious distortions.

Figure 1
figure 1

Overall pipeline of face shape transfer (FST). The pipeline consists of three parts: local embedding network, global dense network and spatial transformer network. The source mask \(\boldsymbol{P}_{\mathrm{src}}\) and reference mask \(\boldsymbol{P}_{\mathrm{ref}}\) are fed into the local embedding network to obtain the source embedding \(\boldsymbol{z}_{\mathrm{src}}\) and reference embedding \(\boldsymbol{z}_{\mathrm{ref}}\), respectively. The global dense network maps \(\boldsymbol{z}_{\mathrm{src}}\) and \(\boldsymbol{z}_{\mathrm{ref}}\) into feature maps to capture the dense correspondences in \(\boldsymbol{P}_{\mathrm{src}}\) and \(\boldsymbol{P}_{\mathrm{ref}}\). The spatial transformer network uses the mapped feature map to calculate the warping parameters θ and transform the shape via differentiable bilinear sampling operations. “Src.”, “Ref.” and “Res” represent source image, reference image, and result, respectively

The main contributions of this paper are as follows:

  1. 1)

    We propose a novel method named face shape transfer via semantic warping, which is capable of transferring both the overall face and individual components (e.g., eyes, nose, and mouths) of the reference image to the source image without the intermediate presentation (e.g., 3D morphable model and pre-defined landmarks) limitation of existing methods.

  2. 2)

    We introduce a novel spatial transformer network with two innovative loss functions: coordinate-based reconstruction loss and facial component loss. These allow flexible warping operations and smoother translation of all pixels in the same semantic region.

  3. 3)

    We contribute a large-scale and high-resolution face dataset. Both qualitative and quantitative experiments are performed on our dataset and another benchmark dataset to demonstrate the superiority of our method over other state-of-the-art methods.

2 Related work

2D-based method. Traditional 2D-image-based methods compute transformation distances from the reference face shape to that of the source based on facial landmarks. Since large-scale deformation is prone to distortion, some methods [1113] build a facial image database of the source person to retrieve the most similar expression to the reference as the basis of deformation, and then warp and stitch existing patches together. Although these methods succeed in face mimicking for a specific source person, collecting and pre-processing such a large dataset for each source person is an expensive cost in practice. Recently, numerous methods [1419] based on generative adversarial networks (GANs) [20] have been developed for face reshaping. However, most GAN-based methods require large image sets for training, and the results are generated based on similar examples which largely limits diversity and controllability. Moreover, the personal identities of source images can not be preserved, which is the main difference between our method and other GAN models.

3D-based method. Most 3D-based face reshaping methods usually reconstruct 3D face models from 2D images and then apply 3D model reshaping methods [21, 22]. Although the development of statistical shape models [2325] and example-based models [26, 27] has facilitated the maturity of modeling technology, 3D face reconstruction with a wide range of poses and expressions remains a challenging ill-posed problem. These methods require a 3D morphable model for shape transformation simulation, but this process is time-consuming and costly, limiting its application.

Local editing method. Local editing methods [2832] address the local editing (e.g., nose and background) as opposed to the most GAN-based image editing methods that modify the global appearance [3335]. For example, Editing in Style [28] tries to identify the contribution of each channel in the style vectors to specific parts. Structured noise [29] replaces the learned constant from StyleGAN with an input tensor, which is a combination of local and global codes. Meanwhile, GANs are widely leveraged to learn how to map from a reference in the source domain to the target domain. Specifically for local editing, references are often referred to as semantic masks [30, 36] or hand-written sketches [19, 37]. In the context of semantic-guided facial image synthesis, SPADE [36] leverages the semantic information to modify the image decoder for better visual fidelity and texture-semantic alignment. SEAN [30] encoders the real images into the per-region style codes and manipulates them, but it requires pairs of images and segmentation masks for training. Recently, SofGAN [38] has been presented to use semantic volumes for 3D editable image synthesis. However, the interpretation of 3D geometry is still lacking, and considerable semantically labeled 3D scans are required for training semantic rendering. In addition, there is no mechanism for preserving the view consistency in the synthesized textures.

3 Method

3.1 Overall framework

Given a source image \(\boldsymbol{I}_{\mathrm{src}} \in \mathbb{R}^{3 \times H \times W}\) and reference image \(\boldsymbol{I}_{\mathrm{ref}} \in \mathbb{R}^{3 \times H \times W}\), where W and H are the width and height of the image, FST aims to transform the face shape of the reference image to the source image. As illustrated in Fig. 1, the inputs to FST are the semantic label masks of the source and reference images \(\boldsymbol{P}_{\mathrm{src}} \in \mathbb{R}^{C \times H \times W}\) and \(\boldsymbol{P}_{\mathrm{ref}} \in \mathbb{R}^{C \times H \times W}\) obtained using the face parsing network [39], where C is the number of face components (e.g., eyes, nose, mouth, etc). First, the local embedding network is used to learn embedding features from five face components. Then, the features from each component are fed into the global dense network, which can ensure maximum information flow between layers in the network. Finally, the spatial transformer network performs reshaping operations on \(\boldsymbol{P}_{\mathrm{src}}\) according to the decoding results to obtain the final result \(\boldsymbol{P}_{\mathrm{res}}\), which can also be performed on \(\boldsymbol{I}_{\mathrm{src}}\) to obtain \(\boldsymbol{I}_{\mathrm{res}}\).

3.2 Local embedding network

Since a global network is quite limited in learning and recovering all local details of each instance [40], we design a separate face component encoding strategy to preserve local details. To this end, we segment the foreground face image into five components according to the face mask, which can efficiently avoid interference from other face components when dealing with individual component transformation. For each face component, we use the corresponding auto-encoder network to learn its original structure, which preserves local relations and the global shape of the input data during embedding. Then, to better balance the accuracy and efficiency of the separate encoding, the input size of five components is determined by the maximum size of each. With separate auto-encoder networks, we can conveniently change facial components in the encoding results and recombine different components from different faces. Since pix2pixHD [41] trains an auto-encoder network to obtain the feature vector that corresponds to each instance in the image and guarantee the features that fit different face component shapes, we add a component-wise average pooling layer to the output of the encoder, which computes the average feature for the face component.

To verify whether the shape embedding features could capture meaningful facial structure information, we first apply the mean shift clustering method [42] to group face component shapes and then apply the t-SNE [43] scheme for visualization. Figure 2 demonstrates that face components within the same cluster share a similar facial structure while neighboring clusters are similar in certain semantic parts.

Figure 2
figure 2

Visualization of the encoding space for the face component shape

3.3 Global dense network

To provide more information about different instances of the same object category between the source mask \(\boldsymbol{P}_{\mathrm{src}}\) and reference mask \(\boldsymbol{P}_{\mathrm{ref}}\), we first concatenate the latent encoding vectors from the source embedding \(\boldsymbol{z}_{\mathrm{src}}\) and reference embedding \(\boldsymbol{z}_{\mathrm{ref}}\) individually, which develops a novel object representation to cope with the different components. Then, to efficiently utilize the features from different scale images, all layers in dense blocks and dense transition layers are connected directly, ensuring maximum information flow between layers in the network. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature maps to all subsequent layers. During the decoding process, we combine features by concatenating them before passing them into subsequent layers, which produces a discriminant and appropriate descriptor for the affine transformation process.

3.4 Spatial transformer network

The original spatial transformer networks [44] are developed to improve the object recognition performance by reducing the input geometry information. However, low-dimensional affine transformation [44] and homograph transformation [45] fail to meet the requirement of dense and complex deformations. Therefore, we introduce a spatial transformer network to predict warping parameters to enable shape transformation on \(\boldsymbol{I}_{\mathrm{src}}\).

Based on the decoded feature vectors from face components, the spatial transformer network predicts the dense correspondences denoted by a matrix \(\boldsymbol{C} \in \mathbb{R}^{2 \times H \times W}\). Specifically, \({\textit{C}}_{1, i, j}\) and \({\textit{C}}_{2, i, j}\) indicate the corresponding target positions when warping each pixel in the source image. To this end, the feature vectors are fed into the localization network to obtain the transformer parameter θ for the next calculation. During training, θ is updated by the facial component loss to reach the expected affine transformation matrix. Under the updated affine transformation matrix, we can easily obtain the output feature maps, which means that the pixel value is also available. After that, the parameterized sampling grid is prepared to obtain the coordinate relations between \(\boldsymbol{I}_{\mathrm{src}}\) and \(\boldsymbol{I}_{\mathrm{res}}\) via pixel matching. Finally, differentiable bilinear sampling operations are carried out on the source image to obtain the final results according to the coordinate relations that can not be derived from the former spatial transformer networks [44, 45].

3.5 Loss functions

To make \(\boldsymbol{P}_{\mathrm{res}}\) similar to \(\boldsymbol{P}_{\mathrm{ref}}\), our objective contains four terms: reconstruction loss for preserving the semantic information during shape transformation, adversarial loss for preserving the local structure, cycle-consistency loss for reducing unreasonable operations, and facial component loss for reducing the artifacts inside each facial part.

Reconstruction loss. We compute the widely-used \(\mathcal{L}_{1}\) loss [46] and perceptual loss [47] as our reconstruction loss \(\mathcal{L}_{\mathrm{rec}}^{\mathrm{p}}\) to preserve the global semantic information, which is defined as follows:

$$ \mathcal{L}_{\mathrm{rec}}^{\mathrm{p}}=\lambda _{\mathrm{l1}}\|\boldsymbol{P}_{ \mathrm{ref}}-\boldsymbol{P}_{\mathrm{res}}\|_{1}+\lambda _{\mathrm{per}}\|\phi ( \boldsymbol{P}_{\mathrm{ref}})-\phi (\boldsymbol{P}_{\mathrm{res}})\|_{1}, $$
(1)

where ϕ is a pre-trained VGG-19 network [48], and \(\lambda _{\mathrm{l1}}\) and \(\lambda _{\mathrm{per}}\) denote the loss weights of \(\mathcal{L}_{1}\) loss and perceptual loss, respectively. However, the above reconstruction loss does not consider the position variance of the small facial component. For example, the function can easily calculate the difference between skin regions but fails to capture the changes in the left eye. Following Chu et al. [49], we compute the distance of the center point from the same component in \(\boldsymbol{P}_{\mathrm{res}}\) and \(\boldsymbol{P}_{\mathrm{ref}}\). First, we calculate the average coordinate \(\left (x^{c}, y^{c}\right )\) of each face component c and regard it as the central location as follows:

$$ \left \{ \textstyle\begin{array}{l} x^{c}=\frac{1}{H \times W} \sum \limits _{j=1}^{H} \sum \limits _{k=1}^{W} P^{(c, j, k)} \times j, \\ y^{c}=\frac{1}{H \times W} \sum \limits _{j=1}^{H} \sum \limits _{k=1}^{W} P^{(c, j, k)} \times k, \end{array}\displaystyle \right . $$
(2)

where P indicates parsing map. Then, the location-based reconstruction loss function, which consists of ten components from two images is

$$ \mathcal{L}_{\mathrm{rec}}^{\mathrm{l}}=\frac{1}{\lambda _{\mathrm{l}}} \sum _{c=1}^{c} \left (\left \|x_{\text{res}}^{c}-x_{\text{ref}}^{c}\right \|_{2}+ \left \|y_{\text{res }}^{c}-y_{\text{ref}}^{c}\right \|_{2}\right ), $$
(3)

where \(\lambda _{\mathrm{l}}\) is equal to the pixel ratio in each face component of the whole image, \(x_{\mathrm{res}}^{c}\) and \(y_{\mathrm{res}}^{c}\) indicates the average coordinates in the final result, \(x_{\mathrm{ref}}^{c}\) and \(y_{\mathrm{ref}}^{c}\) indicates the average coordinates in the reference input and c indicates the component index. Finally, the full reconstruction loss is

$$ \mathcal{L}_{\mathrm{r}}= \lambda _{\mathrm{p}} \mathcal{L}_{\mathrm{rec}}^{\mathrm{p}}+ \lambda _{\mathrm{l}} \mathcal{L}_{\mathrm{rec}}^{\mathrm{l}}, $$
(4)

where \(\lambda _{\mathrm{p}}\) and \(\lambda _{\mathrm{l}}\) are weights for the loss terms.

Adversarial loss. After preserving the global structure, we use the adversarial loss to address the local structure. In this work, we derive a similar function based on StyleGAN [50] from the face paring result. Specifically, we use adversarial learning to approximate the distribution of \(\boldsymbol{P}_{\mathrm{res}}\) to the reference image \(\boldsymbol{P}_{\mathrm{ref}}\) and calculate the adversarial loss \(\mathcal{L}_{\mathrm{a}}\). The training strategy is the same as that for the WGAN [51] model.

Cycle consistency loss. Following Chu et al. [49], we compute the cycle consistency loss to maintain the integrity of the original data during transformation, which is illustrated in Fig. 3. Specifically, \(\boldsymbol{P}_{\mathrm{res}}\) is input into the encoder network to obtain the encoding result \(\boldsymbol{z}_{\mathrm{res}}\). Next, we concatenate \(\boldsymbol{z}_{\mathrm{res}}\) and \(\boldsymbol{z}_{\mathrm{src}}\), feeding them into the global dense network to reconstruct the source parsing map, denoted as \(\boldsymbol{P}_{\mathrm{cyc}}\). Finally, we apply the pixel-wise reconstruction loss Eq. (1) to compute the loss between \(\boldsymbol{P}_{\mathrm{cyc}}\) and \(\boldsymbol{P}_{\mathrm{src}}\), referred to as \(\mathcal{L}_{\mathrm{cyc}}\).

Figure 3
figure 3

Illustration of the cycle consistency loss. \(\boldsymbol{P}_{\mathrm{cyc}}\) indicates the reconstructed source parsing map

Facial component loss. To further enhance the perceptually significant facial components, we introduce facial component loss for the left eye, right eye, nose and mouth. We first crop regions of interest with ROI alignment [52]. We train separate and small local discriminators for each region to distinguish whether the restore patches are real, pushing the patches close to the real facial component shapes. The above loss functions only focus on the face parsing result, which constrains the result to be consistent in the semantic space. Specifically, we build the coordinate map for \(\boldsymbol{P}_{\mathrm{src}}\) as \(\boldsymbol{M}_{\mathrm{src}} \in \mathbb{R}^{2 \times H \times W}\). Similarly, \(\boldsymbol{M}_{\mathrm{cyc}}\) can also be obtained. We add the spatial coordinate loss to ensure that the coordinates of different facial components are well aligned. The facial component loss is defined as follows:

$$ \begin{aligned} \mathcal{L}_{\mathrm{fp}}= \sum _{ROI}\log \left (1-D_{ROI} \left (\boldsymbol{P}_{\mathrm{res}}^{ROI}\right )\right ) + \\ \frac{1}{H \times W} \sum _{i=1}^{H} \sum _{j=1}^{W}\left \| \boldsymbol{M}_{\mathrm{src}}^{(i, j)}-\boldsymbol{M}_{\mathrm{cyc}}^{(i, j)} \right \|_{1}, \end{aligned} $$
(5)

where \(ROI\) represents the region of the facial component (e.g., left eye, right eye, nose and mouth), and \(D_{ROI}\) is the local discriminator for each region.

Since the constructed pixels may not be well aligned in the coordinate space, we add the spatial coordinate loss when computing the cycle consistency. We construct a coordinate map \(\boldsymbol{M}_{\mathrm{src}} \in \mathbb{R}^{2 \times H \times W}\), where \(\boldsymbol{M}_{\mathrm{src}}^{(i, j)}=(i, j)\) indicates the spatial coordinate. After obtaining the reconstructed \(\boldsymbol{P}_{\mathrm{cyc}}\), we convert it to a coordinate map \(\boldsymbol{M}_{\mathrm{cyc}}\). As \(\boldsymbol{M}_{\mathrm{cyc}}\) has already been mapped by the global dense network and may not be as well aligned as \(\boldsymbol{M}_{\mathrm{src}}\). Inspired by Ref. [49], we can minimize the distance by optimizing the following function:

$$ \mathcal{L}_{\mathrm{s}}=\frac{1}{H \times W} \sum _{i=1}^{H} \sum _{j=1}^{W} \left \|\boldsymbol{M}_{\mathrm{src}}^{(i, j)}-\boldsymbol{M}_{\mathrm{cyc}}^{(i, j)}\right \|_{2}. $$
(6)

This spatially-variant consistency loss in the coordinate space can constrain the per-pixel correspondence to be one-to-one and reversible, which reduces the artifacts inside each facial part.

Overall loss functions. The overall loss function for the proposed method is as follows:

$$ \mathcal{L}_{\text{shape}}=\lambda _{\mathrm{r}} \mathcal{L}_{\mathrm{r}}+ \lambda _{\mathrm{a}} \mathcal{L}_{\mathrm{a}}+\lambda _{\mathrm{cyc}} \mathcal{L}_{ \mathrm{cyc}}+\lambda _{\mathrm{fp}} \mathcal{L}_{\mathrm{fp}} + \lambda _{\mathrm{s}} \mathcal{L}_{\mathrm{s}}, $$
(7)

where \(\lambda _{\mathrm{r}}\), \(\lambda _{\mathrm{a}}\), \(\lambda _{\mathrm{cyc}}\), \(\lambda _{\mathrm{fp}}\) and \(\lambda _{\mathrm{s}}\) are used to weigh the loss computed for different samples differently based on whether they belong to the majority or minority classes.

3.6 Datasets

With the development of mobile devices, the quality and resolution of images have been largely improved and enlarged, such that the existing benchmark datasets, except CelebAMask-HQ [53], struggle to meet our requirements. However, CelebAMask-HQ suffers from the problem that only a small number of images can be regarded as reference images, which restricts the generalizability of our method due to the obvious imbalance between source images and reference images. To further train our method, we construct a large-scale face dataset that contains 14,000 high-resolution face images from different countries and regions in Asia. The collected dataset comprises in-the-wild images, and the age spans from 5 to 50 years old, which covers the main age groups with a high demand for face editing. Meanwhile, the looks cover different levels, from ordinary people to celebrities, and exactly meet the requirements of different degrees of shape transformation. These images vary in resolution and visual quality, ranging from 32 × 32 to 5760 × 3840. Some show crowds of several people whereas others focus on the face of a single person. Thus, applying several image processing steps is necessary to ensure consistent quality and to center the image on the face region. To improve the overall image quality, we preprocess each JPG image using two pre-trained neural networks: a convolutional autoencoder trained to remove JPG artifacts in natural images and an adversarially-trained super-resolution network. To handle cases where the facial region extends outside the image, we employ padding and filtering to extend the dimensions of the images. Then, we select an oriented crop rectangle based on the facial landmark annotations, transform it to 4096 × 4096 pixels using bilinear filtering, and scale it to 1024 × 1024 resolution using a box filter. We perform the above processing for all 32,569 images. We further analyze the resulting 1024 × 1024 images to estimate the final image quality, sort the images accordingly, and discard all but the best 14,000 images.

4 Experiments

4.1 Experiment details

We implement our method via PyTorch and train the model with a single Nvidia RTX TITAN GPU. For the encoder, we use five basic dense blocks [54] to extract local features and then utilize a flattened layer to obtain a 128-dimension vector for each face component. The decoder consists of five symmetric dense blocks and dense transition layers with transposed convolution layers for upsampling. During the training process, we use the Adam optimizer [55] and let the batch size equal to 32. We set \(\lambda _{\mathrm{r}} = 200\), \(\lambda _{\mathrm{a}} = 1\), \(\lambda _{\mathrm{cyc}} = 1\) and \(\lambda _{\mathrm{s}} = 200\). Similar to CycleGAN [9], we set the initial learning rate to 0.0001, fix it for the first 40 epochs, and linearly decrease it for another 40 epochs.

The evaluation dataset CelebAMask-HQ contains 30,000 aligned facial images with the size of 1024 × 1024 and relates semantic segmentation labels with the size of 512 × 512. Each label in the dataset comprises 19 classes (e.g., skin, eyes, eyebrows and hair). In our experiments, five components are considered: eyes (left eye and right eye), nose, mouth (up lip, mouth and down lip), and skin. We obtain a rough version of face components from semantic segmentation labels by an image dilation operation, which is defined as mask images. We take 2000 images as the test set for performance evaluation, and all images are resized to 256 × 256. In our experiments, the input sizes of five face components are determined by the maximum size of each component. Specifically, we use 64 × 32, 64 × 32, 128 × 64, 64 × 128, and 256 × 256 for the left eye, right eye, nose, mouth, and skin, respectively, in our local embedding network. We use a single Nvidia RTX TITAN GPU, and the runtime on a 1024 × 1024 source image is 0.4 s, including 0.1 s for face parsing for both source and reference images and 0.3 s for shape transformation.

4.2 Overall face shape transfer

Qualitative evaluation. To demonstrate the superiority of our method, we compare the quality of sample results to several benchmark methods such as the parametric weight-change face reshaping method [22], MaskGAN [53], EditGAN [17] and DeepFaceEditing [19]. As demonstrated in Fig. 4, the 3D modeling method only reshapes the face contour region. These operations have two effects: first, they limit the reshaping diversity and global coordination of the results because there are no warping operations except in that region; second, the background can be easily affected because half of the operation region is performed on the background area. However, the error between 3D modeling and 2D projection is also unavoidable, and we can find many tiny distortions in the contour region of the face. MaskGAN fails to preserve the texture of the source image and performs poorly on large deformation. Meanwhile, EditGAN and DeepFaceEditing suffer from the artifacts and blurring between the face and the background. Our method outperforms those methods in both reality and fidelity. Due to the mask-guided operation, our method is not limited by the problem of large-scale shape transformation problem.

Figure 4
figure 4

Results for overall face shape transformation. From left to right, we show the source images, reference images and results from MaskGAN [53], EditGAN [17], DeepFaceEditing [19] and our method. (Source from CelebAMask-HQ [53])

Quantitative evaluation. To demonstrate the effectiveness of the shape transformation method, we calculate the cosine similarity and structural similarity index measure (SSIM) between the edited and the reference masks, where the greater score represents higher similarity. We first select 600 pairs of faces from the testing set in CelebAMAsk-HQ, and each pair contains a source mask, a reference mask, and a modified mask, with operations performed on the whole face. The evaluation results are listed in Table 1 and the scores are the average of all the experiments. According to the scores of cosine similarity and SSIM from the two evaluation categories, our method is effective in performing the shape transformation because the edited mask is more similar to the reference mask compared with the source mask. In other words, our method successfully achieves the goal of transforming the reference face shape into that of the source. However, due to the differences in hairstyle and face orientation, the edited masks are not completely similar to the reference masks.

Table 1 Evaluation of the effectiveness of shape transformation. SSIM is structural similarity index measure. \({\boldsymbol{P}}_{\mathrm{src}}\), \({\boldsymbol{P}}_{\mathrm{ref}}\), and \({\boldsymbol{P}}_{\mathrm{res}}\) represent the parsing map of the source, reference, and result images, respectively

Personal identity preservation. When people reshape their faces, they want to beautify themselves and be recognized as the same person so as not to be too similar to the reference faces. Thus, our method finds a balance between two targets, which means that identity preservation ability is crucial. To evaluate the identity preservation ability, we conduct an additional face verification experiment via the person re-identification (Re-ID) method [56]. In the experimental setting, we first select 600 pairs of faces from the testing set in CelebAMAsk-HQ, and each pair contains an unmodified face and a modified face, where operations from different methods maintain the same level on the whole face. The results of the Re-ID accuracy are listed in Table 2, which shows that FST is more qualified to preserve the original identity than the other state-of-the-art face manipulation methods. To further demonstrate that FST achieves obvious shape transfer according to the reference image, we also introduce the Re-ID accuracy of the source image. As shown in Table 2, the Re-ID of the source images achieves a largely high accuracy, demonstrating that FST does not perform small manipulation for higher Re-ID accuracy. Additionally, we evaluate the inference speed and demonstrate that the encoder-decoder structure enhances the re-identity scores without the time-consuming expense. FST balances the preservation of source identity, the degree of face reshaping and efficiency.

Table 2 Evaluation of the personal identity preservation of generated images. The best results are marked in bold. Re-ID is re-identification

User study. To further evaluate the image quality of the above-mentioned methods, we collect 2000 pairwise results from a total of 40 participants at the same time and environment to conduct a user study. For each subject, we first show a reference image as the instruction to guide the users. During the study, we randomly choose two of the methods and present one result for each method. We then ask each subject to select the one that “which results better reflect the target mask and can be recognized as the same person” in terms of the face component shape and global coordination. The results in Table 3 illustrate that FST performs favorably against state-of-the-art methods, which means that we achieve a high-quality face shape transfer and high-fidelity identity preservation simultaneously.

Table 3 User study comparing our method with different reference-based image editing methods

4.3 Individual component shape transfer

Qualitative evaluation. To demonstrate the ability to adjust facial components, we choose the mouth region as the target to compare the component editing results between different state-of-the-art methods. First, we directly replace the mask of the mouth, upper lip, and lower lip from the reference mask to the source mask. Figure 5 shows the visual results of FST and the other state-of-the-art methods. With the very opening reference mouth, MaskGAN [53], FENeRF [57] and DeepFaceEditing [19] all generate teeth in the middle of up lip and down lip. However, the limited performance makes the results unrealistic due to obvious distortions and color errors. FENeRF [57] loses all background information, and distortions occupy approximately the contour region. Considering that DeepFaceEditing [19] belongs to the image generation method, with the help of sketches, it should provide a high-quality results. However, lacking the ability to recover mouth details makes it fail in this task. Compared to the above methods, our method uses parsing masks to control the shape and thus produces better results with the preservation of facial structures and personal identities.

Figure 5
figure 5

Magnified image of individual component shape transformation. We show the source images, reference images and results from MaskGAN [53], FENeRF [57], DeepFaceEditing [19] and our method. (Source from CelebAMask-HQ)

Quantitative evaluation. To measure the generation quality from different models, we introduce Fréchet inception distance (FID) [58] and sliced Wasserstein distance (SWD) [59] into the quantitative evaluation experiments. Table 4 demonstrates the comparison results when reshaping the mouth region. MaskGAN [53] has plausible results but sometimes cannot transfer the mouth shape from the source image accurately because it exchanges attributes from the source image in latent space. FENeRF [57] has a good score but fails in the mouth region because the performance of EditGAN [17] may be influenced by the size of the training data and network design. DeepFaceEditing [19] has an inferior reconstruction ability than other methods, as long as the target image does not have spatial information to learn a better mapping with the user-defined mask.

Table 4 Qualitative evaluation of the final results. ↓ indicates that a lower value is better and vice versa. FID is fréchet inception distance and SWD is sliced wasserstein distance

4.4 Ablation study

Loss functions. To demonstrate the superiority of the loss functions qualitatively, we randomly select 70 reference images, to guide the source image to perform shape transformation on the whole face. We provide visual comparisons in Fig. 6 to verify the effectiveness of the designed loss functions. Here, we gradually add \(\mathcal{L}_{\mathrm{r}}\), \(\mathcal{L}_{\mathrm{cyc}}\) and \(\mathcal{L}_{\mathrm{p}}\) to our training. We can find that using \(\mathcal{L}_{\mathrm{r}}\) leads to much more artifacts, e.g., around the eye and skin regions. This is because the method only learns the semantic difference and does not know the facial structure. After adding \(\mathcal{L}_{\mathrm{cyc}}\), there are a few constraints on the mapping functions and fewer distortions than before. The transformation results seem more well-aligned and visually pleasing when \(\mathcal{L}_{\mathrm{fp}}\) is added. The reason is that \(\mathcal{L}_{\mathrm{fp}}\) penalizes the distortions and artifacts and enforces all pixels in the same region to have a smoother contour.

Figure 6
figure 6

Results for loss ablation. From left to right, we show a source image, a reference image, and the results, which gradually add the full reconstruction loss \(\mathcal{L}_{\mathrm{r}}\), the cycle consistency loss \(\mathcal{L}_{\mathrm{cyc}}\), and the facial component loss \(\mathcal{L}_{\mathrm{fp}}\). Ref. denotes the reference image. (source from CelebAMask-HQ)

Encoder networks. To demonstrate the excellence of the local embedding network, we randomly select 70 reference images, to guide the source image to perform shape transformation on the nose. We provide visual comparisons in Fig. 7 to verify the effectiveness of the designed encoder network. When using ResNet [60] to do global encoding, artifacts can not be avoided on either side of the nose. Because it is difficult to find the feature vectors about the nose accurately in the global encoding vectors. Although the artifacts are small, the impact on the final results is obvious since faces show complex multidimensional visual patterns and abnormal parts are easy to find. Therefore, the global encoding method cannot reshape individual face components without affecting other components. When using ResNet to perform local encoding, the face orientation becomes the most crucial factor influencing the final results. As shown in Fig. 7, if the face is turned slightly, the direction of the bridge of the nose is distorted. Compared with the two methods mentioned above, our method can learn its original structures, which can help to preserve facial structure.

Figure 7
figure 7

Results of encoder ablation. From left to right, we show the source images, reference images and results of our method, the global encoding network using ResNet [60] and the local encoding network using ResNet. (Source from CelebAMask-HQ)

Shape transformation networks. To demonstrate the excellence of the spatial transformer network, we randomly select 70 reference images to guide the source image to perform shape transformation on the whole face. We provide visual comparisons in Fig. 8 to verify the effectiveness of the designed shape transformation network. When using affine transformation or perspective transformation to reshape the face, distortions and artifacts can not be avoided and all pixels in the same region do not have a smoother contour, especially noses. Moreover, it is easy to find that our spatial transformer network successfully achieves denser and more accurate shape transformation operations on the source image than other shape transformation networks.

Figure 8
figure 8

Results of shape transformation networks ablation. From left to right, we show source images, reference images, our results, affine transformation results with 6 parameters and perspective transformation results. (Source from CelebAMask-HQ)

5 Conclusion and discussion

In this work, we propose a novel framework, face shape transfer called FST, a face reshaping framework that can obtain high-quality results, which overcomes the limitations of the shape transformation degree and precise intermediate presentation in existing methods. Through separate face component encoding networks, FST extracts the original structures of each component, which preserves the facial structure and personal identity of the source image. Meanwhile, the novel spatial transformer network with coordinate-based reconstruction loss and region-based facial component loss is introduced to transmit and fuse the features of source and reference images, further boosting the performance of face reshaping. In addition, we show that the learned embedding space of the semantic parsing map allows us to directly manipulate the parsing map and generate shape changes according to the preference of the user. The extensive experiments demonstrate that our framework can achieve state-of-the-art face-reshaping results with observable geometric changes.

Since our method struggles to handle shapeless attributes (e.g., hairstyle, skin color), it fails to handle complex editing tasks, such as changing the complexion or hairstyle. Additionally, due to the lack of corresponding data, obvious facial occlusion and rotation also pose great challenges to our method. To mitigate these shortcomings, we will focus on attribute disentanglement and eliminate dataset bias, leading to more robust and accurate predictions of face editing.