Introduction

Facial expression manipulation aims to convert facial expression of target images into source images while maintaining the image’s original identity information. It has gained significant potential in various applications, such as film production, electronic games, and short videos. Researches in the field of artificial intelligence has made significant progress [1,2,3,4]. However, since face expression is complexed and would bring ambiguous for face structure, it is challenging to produce accurate facial expression images with photo-realistic textures and faithful structures.

State-of-the-art facial expression manipulation methods can be divided into two categories: message judgment methods and sign judgment methods.

  1. (i)

    Message judgment methods focus on directly encoding expression features and thus learning expression patterns of face images. For example, StarGAN [5] is proposed to transfer expressions between different image domains under the guidance of discrete expression labels, e.g., happy, angry, or sad. However, message judgment methods fail to perform continuous expression editing and cannot guarantee the quality of generated images.

  2. (ii)

    Sign judgment methods estimate facial action unit (AU) signals to synthesize expressions. For example, Pumarola et al. [6] utilized AU intensities as guidance to synthesize facial expression images. They incorporated attention mechanisms into the generator’s last layer, enabling the manipulation of images with complex backgrounds. However, labeling AU for face datasets consumes enormous labor. Meanwhile, solely learning global AU features would limit local expression edition performance.

Research [7] has shown that human attention naturally focuses on special facial regions when recognizing and distinguishing facial expressions. For example, the eyes play a vital role in fear analysis, while the mouth is crucial for identifying happiness. Driven by this analysis, we propose to extract local facial features of key facial regions, i.e., eye, mouth, and cheek regions, and then inject target AU signals into local facial features for facial expression manipulation. In this fashion, expression features could be integrated into corresponding facial regions purposefully and thus facilitate fine-grained facial expression manipulation.

Previous methods mainly focus on the whole face, neglecting local facial parts, resulting in overlapping and blurring of local facial regions in the generated results. Toward this problem, we propose a local semantic segmentation mask-based GAN (LSGAN). Our LSGAN captures texture details of key facial regions by designing several networks based on local semantic regions, and uses reconstruction networks to further preserve the structural information of the image. Based on the above, our LSGAN comprises a semantic mask generator (SMG), an adversarial autoencoder (AAE), a transformative generator (TG), and an AU-intensity discriminator (AUD). First, we design the SMG to generate masks of key facial regions of source facial images, i.e., eye, mouth, and cheek regions. Then, we propose the AAE to map facial masks into structured latent codes. Specifically, we introduce the ME-graphAU [8] to predict the AU intensity of target images. Afterward, our TG integrates target AU-intensity labels and corresponding source facial region codes to generate desired facial expression images. Furthermore, we design AUD to capture facial expression variations and evaluate the quality of generated images. During training, we introduce reconstruction losses to preserve generated faces’ identity and structure information. The researches [9, 10] have made significant contributions to function optimization, and based on these studies, we adopt adaptive moment estimation (Adam) solver to optimize the loss function of each module.

The main contributions of our work are threefold:

  • We propose a facial expression manipulation method, dubbed LSGAN, to generate target facial expression images with photo-realistic textures and faithful structures. LSGAN combines key facial region masks with target AU-intensity labels to achieve facial expression manipulation.

  • We design a AAE to generate latent codes of facial semantic masks. In particular, our TG integrates latent codes with target expression labels to generate desired facial expression images, thus alleviating the correspondence ambiguity between source and target expression faces. We introduce self-reconstruction and cyclic reconstruction with same local network structure as generator to ensure the stability of our generated network and maintain the feature and structural invariance of unrelated regions.

  • Our experiments demonstrate that LSGAN can achieve better facial expression manipulation performance. The average MSE of the 16 AU intensities in our method is 0.018, which is lower than the state-of-the-art methods. We also demonstrate the importance of the special facial region partition in facial expression synthesis.

Related work

Given two unpaired images (\(I_i,I_t\)), our main goal is to generate new image \(\widetilde{I_i}\) with facial expression of \(I_t\) while preserving the identity information in the original image \(I_i\). Due to the fact that differences in facial expressions often occur in key areas such as the eyes and mouth, rather than uniformly across the entire face, it has prompted us to perform attribute classification on faces to capture the features of these key regions. Subsequently, these key region features are considered and utilized during the process of facial expression manipulation. Therefore, the related work mainly includes two aspects: facial attribute classification and facial expression manipulation.

Facial attribute classification

Yang et al. [11] focused on facial attribute recognition through the utilization of deep convolution neural networks. The primary objective of this study is to attain a heightened response within facial regions, consequently generating candidate windows of faces. However, the complexity of the CNN structure results in significant time costs when implemented practically. To address this limitation, Zhang et al. [12] proposed a novel framework that integrated face detection and alignment tasks by employing unified cascaded CNNs and multi-task learning. This approach aims to streamline the process and improve efficiency. While the aforementioned method effectively acquires feature points for facial attributes, the resultant attribute regions often encompass redundant components. Aggarwal [13, 14] applied deep learning method to crop segmentation and achieved relatively complete region segmentation results.

To enhance accuracy and alleviate the impact of overlapping regions, Zhao et al. [15] utilized a semantic segmentation method for facial attribute classification. This article introduces a global pyramid pooling, which offers additional contextual information. Moreover, it also proposes a deep supervised optimization strategy designed for ResNet-based fully convolutional networks (FCNs). To tackle the drawbacks associated with employing hole convolution in semantic segmentation, Lin et al. [16] proposed a multi-path reinforcement network called RefineNet, which explicitly incorporates all the information derived from the downsampling process and employs remote residual connections to achieve accurate high-resolution predictions. RefineNet only utilizes the residual layer of a conventional ResNet, thereby avoiding the computational costs associated with hole convolution.

To further reduce the computational time of semantic segmentation networks, Yu et al. [17] introduced a bilateral segmentation network, known as Bilateral Segmentation Network (BiSeNet), with the objective of striking a balance between accuracy and speed. As an enhancement to BiSeNet, Yu et al. [18] proposed BiSeNetV2, which features a more concise bilateral structure. This article introduces multiple auxiliary training branches to enhance the feature extraction capabilities of various shallow networks. Moreover, the authors designed an efficient feature fusion module to effectively integrate spatial detail information with high-level semantic information. However, the utilization of a simplistic feature extraction framework in their methods lead to a decline in accuracy. Additionally, there is potential for further improvement in enhancing the attention framework within the network.

Facial expression manipulation

Generative adversarial networks have emerged as the prevailing approach in facial expression manipulation. Expanding upon classic GAN architectures, numerous variants have been devised to further improve performance. One such variant is conditional GAN (cGAN), which introduces additional condition information to control the distribution of generated data. In recent years, there has been a proliferation of studies utilizing cGAN for facial expression synthesis.

Fig. 1
figure 1

The architecture of our framework, which consists of a mask generator, a AAE, a transformative generator and an AU-intensity discriminator. Given a source face image \(I_i\) and a target AU intensity vector \(u_t\): \(\{u_{t_e}, u_{t_m}\}\), we concatenate each latent codes from local region with corresponding AU vector to generate a new image \(\widetilde{I_i}\)

Fig. 2
figure 2

An overview of SMG. a Network architecture. b Components of attention refinement modules

Prominent approaches, such as StarGAN [5] and AttGAN [19], employ generator networks that utilize input images and target domain information to generate images in diverse domains. For instance, StarGAN [5] utilizes facial images and target facial attributes to enable attribute editing through a single generator and discriminator. On the other hand, AttGAN [19] adopts an encoder-decoder architecture similar to StarGAN but represents facial attributes using latent representations. Ding et al. [20] modeled the intensity of facial expressions to generate a wider range of expressions, but the global expression in the method only describes the overall facial emotion, resulting in limited ability to capture fine details. Geng et al. [21] used 3D Morphable Model (3DMM) to fit an image and then re-render the image with desired expression. Another model, namely LGP-GAN [22], employs a two-stage cascaded structure and integrates both local and global perception to generate facial expressions. However, the complexity of facial expressions, especially microexpression with complex local details, presents significant challenges for these methods.

The facial action coding system (FACS) [23] provides commonly used descriptors such as raised cheeks or depressed lips in expression manipulation approaches. Tools like OpenFace [24] have been developed to achieve the recognition of Action Units (AUs). Additionally, a recent work called ME-graphAU [8] introduces a deep learning-based approach for modeling AU relationships explicitly. This method aims to describe the intricate connections between different AUs and provide a more comprehensive understanding of facial expressions.

With the help of these tools, Pumarola et al. [6] introduced a technique that utilizes AUs to guide the synthesis of facial expressions. This approach allows for precise control over the strength of individual AUs and their combination to form a cohesive expression. However, it only learns global AU features, which limits its performance in editing local expression details. Wang et al. [25] added a path for predicting an appearance flow to align the input image to the target expression. Wu et al. [26] proposed a cascaded expression focal GAN that progressively modifies facial expressions by emphasizing local expression features. On the other hand, considering the distinct structured appearances of facial expressions, Song et al. [27] and Qiao et al. [28] proposed geometrically guided Gans that leverage facial markers to define the facial geometry and generate facial expressions. Nevertheless, aligning landmarks from source images with target images that possess distinct facial shapes is a significant challenge and frequently results in the presence of artifacts within the generated images. These approaches typically rely on global expressions, AU predictions, or landmark predictions and lack the ability to estimate them automatically. Ling et al. [29] improved generator architecture in GANimation and used relative AUs as input. With relative action units, the generator learns to only transform regions of interest which are specified by non-zero-valued relative AUs. To further preserve identity information and edit relevant areas, Wang et al. [30] added an attention module to the generator for facial expression manipulation, and obtained long-range dependencies in the image by using self attention blocks instead of direct skip connections. In order to improve the performance of expression transfer, Shao et al. [31] disentangled the input image into two fine-grained representations (AU-related and AU-free features), and proposed an EET framework to explicitly transfer fine-grained expressions by straightforwardly mapping the unpaired input to two synthesized images with swapped AU-related features. Tang et al. [32] introduced an end-to-end expression-guided GAN in their work, enabling the manipulation of fine-grained expressions and synthesis of continuous intermediate expressions between source and target expressions. However, these methods still overlooks the role of details in local regions in preserving the local structure and texture of the image.

Method

Our LSGAN consists of a semantic mask generator (SMG), an adversarial autoencoder (AAE), a transformative generator (TG), and an AU-intensity discriminator (AUD), as shown in Fig. 1. Our SMG receives input image \(I_i\) and produces the key facial region masks, i.e., \(I_{{\text {eye}}}\), \(I_{{\text {mouth}}}\), and \(I_{{\text {part}}}\). Then, our AAE formulates local latent codes for the facial region masks. Afterward, our TG generates a new facial images with target expression. During this procedure, our AUD forces the generated facial expression images to lie on the same manifold as real frontal faces.

Semantic mask generator (SMG)

Through observation, it is found that facial expression manipulation often occurs in key facial regions, such as mouth and eyes. Therefore, we design SMG to locate eye and mouth regions and generate corresponding local facial part masks.

Our SMG consists of three modules, a spatial path module, a context path module, and a feature fusion module. The spatial path module encodes affluent spatial information, and the context path module provides sufficient receptive field. Our spatial path module consists of a convolutional layer and two basic blocks. Here, we introduce the Resnet [33] architecture, which includes a residual branch and a short-cut branch, to design our basic blocks. Our context path module includes two basic blocks and two attention refinement modules (ARM). Inspired by [34], we design the attention refinement module, which leverages pooling layers along the horizontal and vertical coordinate directions to capture contextual information. The components of ARM is shown in Fig. 2b. By performing pooling on the input feature map of size C\(\times \)H\(\times \)W in horizontal and vertical directions, we obtain the following feature maps:

$$\begin{aligned} \begin{aligned} z_c^h(h)&=\frac{1}{W}\sum _{0\le i<W}x_c(h,i),\\ z_c^w(w)&=\frac{1}{H}\sum _{0\le j<H}x_c(j,w).\\ \end{aligned} \end{aligned}$$
(1)

Then, with the following formula \(f=\delta (F_1(|z^h,z^w|))\), we concatenate \(z^h\) with \(z^w\), and perform the \(F_1\) operation on the concatenated result, which involves dimension reduction and activation using a 1\(\times \)1 convolutional kernel. Along the spatial dimension, we split f into \(f^h\in \mathbb {R}^{C/r\times H\times 1}\) and \(f^w\in \mathbb {R}^{C/r\times 1\times W}\). We perform dimension expansion using a 1\(\times \)1 convolutional kernel, and finally apply the sigmoid activation function to obtain the final attention weights \(g^h\in \mathbb {R}^{C\times H\times 1}\) and \(g^w\in \mathbb {R}^{C\times 1\times W}\) in both directions:

$$\begin{aligned} \begin{aligned}&g^h=\sigma (F_h(f^h)),\\&g^w=\sigma (F_w(f^w)).\\ \end{aligned} \end{aligned}$$
(2)

Finally, the output formula of ARM can be expressed as follows:

$$\begin{aligned} \begin{aligned} y_c(i,j)=x_c(i,j)\times g_c^h(i)\times g_c^w(j), \end{aligned} \end{aligned}$$
(3)

whereas, \(x_c(i,j)\) and \(y_c(i,j)\) correspond to the input and output features, respectively. Furthermore, our feature fusion module is composed of a convolutional layer and an attention refinement module. It fuses the output features of the former two modules and generates face segmentation results (see Fig. 2).

Since facial expression manipulation would cause large shape changes of facial component, we enlarge the size of eyes, eyebrows, and mouth areas and divide the source image \(I_i\) into eye region masks \(I_{{\text {eye}}}\), mouth region masks \(I_{{\text {mouth}}}\) and cheek region masks \(I_{{\text {part}}}\). In this way, we can provide facial semantic priors for following procedures.

Adversarial autoencoder (AAE)

After obtaining the local facial part masks, encoding their latent codes becomes crucial for our task. Therefore, we introduce an AAE [35] to encode the latent code for local facial part masks.

Our AAE is composed of an encoder E, a decoder, and a discriminator \(D_z\), as shown in Fig. 3. The encoder generates latent codes \(z=E(x)\sim q_z\) from source images x, and x can be the original image \(I_i\) or local image \(I_{{\text {eye}}}\), \(I_{{\text {mouth}}}\), etc. The decoder reconstructs input source images with 5 convolution layers and a fully-connected layer. The discriminator judges whether the latent code arises from the predicted code of the autoencoder or from a sampled distribution specified by the user. We employ a latent adversarial loss \(L_{{\text {adv}}}^z\) to learn the structured latent mapping between the latent space and Gaussian distribution:

$$\begin{aligned} \begin{aligned} L_{{\text {adv}}}^z&=\mathbb {E}_{z^{'}\sim p_{z}}[\log D_z(z^{'})]\\ {}&\quad +\mathbb {E}_{z\sim p_{{\text {data}}}}[\log (1-D_z(E(x)))], \end{aligned} \end{aligned}$$
(4)

where z is sampled from the image domain \(p_{{\text {data}}}\) and \(z^{'}\) follows the Gaussian distribution \(p_z\). We pretrain the AAE and introduce the encoder into our LSGAN. In this manner, we can estimate latent codes for different regions, namely \(z_e\) for the eye region, \(z_m\) for the mouth region, and \(z_p\) for sub critical regions.

After obtaining the latent code of each region, we proceed to divide the target AU-intensity vector \(u_t\) into two subsets: \(u_{t_e}\) for the eye region and \(u_{t_m}\) for the mouth region, while \(z_p\) is still cascaded with \(u_t\). This paper uses the target AU-intensity vector as a conditional variable to construct \(\widetilde{I_i}=G(z|u_t)\) to generate an image with the expected expression. Here z is the set of latent code for each region. We cascade the latent code with the corresponding AU vector to obtain representations of different regions: \(R_e\) for the eye region, \(R_m\) for the mouth region and \(R_p\) for sub critical regions.

Transformative generator (TG)

Our TG is proposed to construct facial images with target expression by explicitly exploiting facial semantic priors of source images and AU intensity of target images. Our TG comprises three distinct generation structures, as shown in the Fig. 1. The input to the generator is the local representation of different regions. Specifically, the upper layer is primarily responsible for facial expression changes in the eye region and surrounding areas, while the middle layer focuses on expression changes in the mouth area. The lower layer is utilized for expression changes in sub critical regions, such as the nose, cheeks, and chin. Each layer contains 6 up-sampling residual blocks, resulting in respective outputs \(\widetilde{I_e}\), \(\widetilde{I_m}\), and \(\widetilde{I_p}\). Subsequently, these outputs are fused based on their corresponding local masks to generate the final image \(\widetilde{I_i}\).

To improve the structural stability of our TG, we reconstruct the original image \(I_i\) and generated image \(\widetilde{I_i}\) based on the origin image’s AU-intensity vector \(u_i\). In this manner, we can obtain cyclic-reconstruction image \(I_{{\text {cyc}}}\) and self-reconstruction image \(I_r\).

Fig. 3
figure 3

An overview of AAE

AU-intensity discriminator (AUD)

Our AUD has two tasks: (1) discriminating generated images by TG from real ones; and (2) assessing the expression intensity of the generated image relative to its target AU-intensity vector. We utilize \(D_{{\text {adv}}}\) to complete the first task. For the second task, we introduce \(D_{{\text {cls}}}\) to AUD to ensure the accurate transmission of AU changes throughout the generation process. The structure of our AUD is composed of 6 convolution layers with a stride of 2. Through the above design of our AUD, it can more effectively evaluate the authenticity and quality of generated images, while simultaneously controlling the changes in facial expressions.

Loss function

Our approach focuses on generating facial images that accurately reflect the expected facial expression while preserving the underlying identity structure of the original image. To this end, our generator loss function encompasses not only the expression vector loss, but also the loss of identity information. We pretrained two networks (\(G_{{\text {exp}}}\), \(C_{{\text {id}}}\)) to obtain the identity and facial expression information of the current image. And the architecture of these two networks is inspired by the traditional Visual Geometry Group 19-layer (VGG19) network [36]. Furthermore, to enhance the stability of our generator network, we incorporate self-reconstruction loss and cyclic-reconstruction loss into our overall loss function. Self-reconstruction uses the origin image as input and output image. While the input of cyclic reconstruction is the generated image, and its output is the origin image. The network for image reconstruction is consistent with the original generated network which consists of three hierarchical networks. Reconstruction loss acts as a regularization term that helps prevent overfitting. By compelling the network to capture key features during the reconstruction process, the reconstruction loss helps enhance the network’s ability to represent input data, thereby improving its stability. Furthermore, integrating the reconstruction loss with other losses allows for a balance between different objectives, enhancing the overall model’s stability and generalization capabilities.

Adversarial loss

\(L_{{\text {adv}}}\) models the discriminator’s ability to correctly distinguish real or false facial expression images. The adversarial loss is formulated as:

$$\begin{aligned} \begin{aligned} L_{{\text {adv}}}&=\mathbb {E}_{I_i\sim p_{{\text {data}}}}[\log D(I_i)]\\ {}&\quad +\mathbb {E}_{z\sim p_l}[\log (1-D(\widetilde{I_i}))], \end{aligned} \end{aligned}$$
(5)

where \(I_i\) is the input image and corresponding latent code set is z. And our AU-intensity discriminator D is composed of \(D_{{\text {adv}}}\) and \(D_{{\text {cls}}}\). \(D_{{\text {adv}}}\) discriminates generated images by TG from real ones while \(D_{{\text {cls}}}\) ensures the accurate transmission of AU changes.

Expression loss

Since the expression is decomposed into a set of intensity values of AUs, we need to use expression loss to align the distribution of expressions in the generated image with the target image’s AUs.

$$\begin{aligned} \begin{aligned} L_{{\text {au}}}=-\frac{1}{d}\sum _{j=1}^{d}\sum _{q=0}^{m}{||G_{{\text {exp}}}(\widetilde{I_i})-u_t||}_2, \end{aligned} \end{aligned}$$
(6)

where \(G_{{\text {exp}}}\) was pretrained to generate each AU intensity from multiple levels. \(u_t\) is the facial expression of target image.

Identity loss

We use identity loss to preserve the identity information in the input image \(I_i\):

$$\begin{aligned} \begin{aligned} L_{{\text {id}}}=-\sum _{k=1}^{n}\mathbbm {1}_{k=l_i}\log \left( C_{{\text {id}}}^{(k)}(\widetilde{I_i})\right) , \end{aligned} \end{aligned}$$
(7)

where \(C_{{\text {id}}}\) is pretrained to construct a mapping between input images \(I_i\) and their identity labels \(l_i\).

Self-reconstruction loss

To ensure the stability of image generation, generator G should be able to self-reconstruct \(I_i\). L1 norm often results in structural distortion and image blurring, so we we apply an L1 loss and a MS-SSIM loss [37] to constrain self-reconstruction:

$$\begin{aligned} \begin{aligned} L_{{\text {rec}}}={||I_r-I_i||}_1+(1-{\text {SSIM}}(I_r,I_i)), \end{aligned} \end{aligned}$$
(8)

where \(I_r\) represents the image generated by self-reconstruction, which should be as similar to the input image \(I_i\) as possible.

Cyclic-reconstruction loss

To further ensure the integrity of identity information, cyclic-reconstruction losses have also been introduced into this work:

$$\begin{aligned} \begin{aligned} L_{{\text {cyc}}}={||I_{{\text {cyc}}}-I_i||}_1, \end{aligned} \end{aligned}$$
(9)

where \(I_{{\text {cyc}}}\) represents the image generated by cyclic reconstruction, its input is the generated image \(\widetilde{I_i}\) by G.

Overall objective function

Combining the losses introduced above, the full objective function is formulated as:

$$\begin{aligned} \begin{aligned} L=L_{{\text {adv}}}+\lambda _{{\text {au}}}L_{{\text {au}}}+\lambda _{{\text {id}}}L_{{\text {id}}}+\lambda _{{\text {rec}}}L_{{\text {rec}}}+\lambda _{{\text {cyc}}}L_{{\text {cyc}}}, \end{aligned} \end{aligned}$$
(10)

where \(\lambda _{{\text {au}}}\), \(\lambda _{{\text {id}}}\), \(\lambda _{{\text {rec}}}\) and \(\lambda _{{\text {cyc}}}\) are the hyper-parameters that represent the weight of each loss function.

Fig. 4
figure 4

Generating images under different hyper-parameters

Experiment

Fig. 5
figure 5

Some generated images based on our LSGAN. The input source image has a neutral expression

Datasets and settings

Datasets

To evaluate the effectiveness and generalization of our approach, we conduct experiments on two widely used datasets, namely RaFD [38] and DISFA [39]. The RaFD dataset is composed of high-quality facial images of 67 models displaying eight distinct emotional expressions, namely anger, disgust, fear, happiness, sadness, surprise, contempt, and neutrality. Each expression is depicted in three different gaze directions across five camera angles.

The DISFA dataset, established in 2013, contains AU video samples obtained from 27 participants (15 males and 12 females) watching a 242-second video comprising nine segments meant to elicit various emotions. During the video recording, subjects’ facial expressions were captured from the front with consistent environmental conditions, including lighting and background. The video resolution was set at 1024\(\times \)768, with a frame rate of 20 fps. Each participant’s data consisted of 4845 frames, each annotated by two FACS experts with start and end times for 12 types of AUs and corresponding intensity levels ranging from 0 to 5.

Implementation details

We implemented our network in PyTorch and the computer configuration used in the experiment is intel core I7-8700 CPU and NVIDIA 3090 GPU. In our experiments. Each image is cropped to the size of \(128\times 128\). Due to the lack of corresponding AU vectors in the RAFD dataset, we use ME-graphAU [8] to annotate intensities of 16 AUs (1, 2, 4, 5, 6, 7, 9, 10, 12, 14, 15, 17, 20, 23, 25 and 26) as continuous expression labels.

We set the hyper-parameters as: \(\lambda _{{\text {au}}}\) = 100, \(\lambda _{{\text {id}}}\) = 60, \(\lambda _{{\text {rec}}}\) = 50 and \(\lambda _{{\text {cyc}}}\) = 50. These parameters are subject to a simple constraint condition. Generally speaking, \(\lambda _{{\text {au}}}\) is approximately equal to the sum of \(\lambda _{{\text {rec}}}\) and \(\lambda _{{\text {cyc}}}\). The value of \(\lambda _{{\text {id}}}\) falls within the range of [50, 100]. The generated results based on different parameters are shown in the Fig. 4, it can be seen that when \(\lambda _{{\text {rec}}}\) is set to 10, the details of the eyebrows and eyes appear incomplete. When \(\lambda _{{\text {rec}}}\) is 200, the changes in facial expressions are not sufficiently prominent.

The adversarial learning in E, G, \(D_z\) and \(D_{{\text {img}}}\) employs the Adam solver and a learning rate of \(10^{-4}\), while \(G_{{\text {exp}}}\) uses a learning rate of \(2 \times 10^{-4}\). We train our framework for 600 and 16 epochs on RaFD and DISFA datasets, respectively, with a batch size of 8.

Fig. 6
figure 6

Qualitative comparison of facial expression synthesis on RaFD database (target facial expression from top to bottom: disgusted, surprised and happy). The results of MASK are used in LSGAN

Evaluation metrics

To quantitatively evaluate our method for expression transfer, we introduce mean square error (MSE) and intraclass correlation coefficient (ICC) to measure the difference and correlation between the AU intensities of generated images with ground truth. To further compare the quality and structural similarity of images generated by different methods, we introduce evaluation indicators such as peak signal to noise ratio (PSNR) [40], structural similarity (SSIM) [40], Frechet inception distance (FID) [41] and LPIPS distance [42].

Fig. 7
figure 7

Facial expression manipulation based on different target facial expression images of DISFA datasets

PSNR is a widely used metric for assessing the quality of an image. It quantifies the degree of distortion in an image by comparing the differences between the original and processed/compressed versions. The higher the PSNR value, the lower the level of distortion. The formula for calculating PSNR is expressed as follows:

$$\begin{aligned} \begin{aligned} {\text {PSNR}}=10\cdot \log _{10}\left( \frac{{\text {MAX}}^2}{{\text {MSE}}}\right) \end{aligned} \end{aligned}$$
(11)

Here, MAX represents the maximum possible pixel value of the image (e.g., for an 8-bit image, MAX = 255), and MSE denotes the mean squared error, computed as the average of the squared differences between corresponding pixels of two images.

SSIM is another important metric for evaluating image quality. Unlike PSNR, which focuses solely on pixel-wise differences, SSIM also takes into account structural information and texture similarity. It provides a more comprehensive assessment of perceived image quality by considering both local and global image characteristics. SSIM is a metric based on three comparisons between samples x and y: luminance, contrast, and structure, expressed by the following equation:

$$\begin{aligned} \begin{aligned} {\text {SSIM}}(x,y)=[l(x,y)^\alpha \cdot c(x,y)^\beta \cdot s(x,y)^\gamma ], \end{aligned} \end{aligned}$$
(12)

where l(xy) represents luminance comparison, c(xy) represents contrast comparison (reflecting the magnitude of brightness changes in the image, i.e., the standard deviation of pixels), and s(xy) indicates structure comparison. The parameters \(\alpha \), \(\beta \), and \(\gamma \) are constants.

FID is a measure commonly employed to assess the dissimilarity between two multivariate normal distributions. It is often used in evaluating the performance of generative models, such as GANs. The feature means \(\mu _g\) and variances \(C_g\) of generated images, along with the means \(\mu _r\) and variances \(C_r\) of real images, are used to compute the distance between feature vectors based on their means and variances. This distance is termed as FID, defined as:

$$\begin{aligned} \begin{aligned} {\text {FID}}\left( P_r,P_g\right)&=||\mu _r-\mu _g||\\ {}&\quad +T_r\left( C_r+C_g-2\left( C_rC_g\right) ^{1/2}\right) . \end{aligned} \end{aligned}$$
(13)

Here, \(T_r\) is the trace operation (the sum of the elements on the main diagonal of a square matrix).

Fig. 8
figure 8

Visual comparison of different expressions for EGGAN, GANIMATION, LSGAN\(\#1\) with Cross entropy loss in \(L_{{\text {au}}}\) and LSGAN\(\#2\) with MSE loss in \(L_{{\text {au}}}\) in the DISFA dataset

Table 1 Quantitative evaluation of expression manipulation for STARGAN [5], GANIMATION [6], EGGAN [32], FADM [43] and our LSGAN
Table 2 Quantitative comparison with FID (lower is better), PSNR (higher is better) and SSIM (higher is better) on the generated images of different methods
Fig. 9
figure 9

Origin images and target expressions for quantitative evaluation. The original and target images are used to generate their corresponding reconstructed images

In contrast, LPIPS distance is a perceptual similarity metric and has been demonstrated to correlate well with human perceptual similarity. LPIPS distance has been widely utilized in various computer vision tasks, including image synthesis and style transfer. Given a reference block x from the ground truth image and a distorted block \(x_0\) from a noisy image, the formula for calculating LPIPS is as follows:

$$\begin{aligned} \begin{aligned} d(x,x_0)=\sum _{l}\frac{1}{H_lW_l}\sum _{h,w}\Big |\Big |w_l\odot (\hat{y}_{hw}^l-\hat{y_0}_{hw}^l)\Big |\Big |_2^2. \end{aligned} \end{aligned}$$
(14)

Here, \(y^l\) and \({y_0}^l\) represent the feature maps of the l-th layer of the images.

Table 3 Quantitative comparison with FID (lower is better) and LPIPS (lower is better) on reconstructed images
Fig. 10
figure 10

Illustration of the effectiveness of different loss terms. LSGAN is trained without \(L_{{\text {id}}}\), \(L_{{\text {rec}}}\), \(L_{{\text {cyc}}}\), respectively

Qualitative evaluations

We divide the images in the RAFD dataset into 8 discrete emotional expression categories: neutral, angry, contemptuous, disgusted, fearful, happy, sad, and surprised. Figure 5 shows the generation results of our LSGAN.

The results indicate that our method can generate various types of images with target expressions. We do further evaluation to compare the performance of our LSGAN with the current state-of-the-art approaches, as show in Fig. 6.

Among them, STARGAN [5] handles well for image translation in different domains, but its results are a little blurry with some artifacts. EGGAN [32] can achieve higher quality results but still ignores the structural integrity of local regions, such as the right eyebrow as shown in Fig. 6. Using the generator targeting local semantic regions and reconstruction networks that maintain facial structure, it can be seen that both global expressions and local muscle actions look natural in the generated images of our LSGAN. This proves that our method has more complete details of local regions while achieving expression transformation.

To conduct a more detailed analysis of AU-intensity vector, we do further experiments on DISFA dataset. For different target images and their expressions, the original image achieved expression manipulation while retaining its own identity features, as shown in the Fig. 7.

The comparison with other methods are shown in Fig. 8. Based on a large amount of training data, the images generated by various methods have good results. However, we find that GANIMATION [6] generates blurring and overlap around local areas when there is significant facial deformation, such as from opening the mouth to closing the mouth. By introducing reconstruction networks, our method has slight advantages in terms of structural integrity and local muscle actions, such as texture details in the eye region. This demonstrates the advantage of our method in controlling key facial region details.

Quantitative evaluations

To quantitatively evaluate the expression manipulation, we compute MSE and ICC between the AU intensities of generated images with ground truth. We use ME-graphAU to estimate the AU-intensity vector for each image. Since the comparison method only includes partial AU-intensity vector results, we choose 16 AU intensities near the eyes and mouth. As shown in Table 1, although the latest diffusion method [43] has shown good results in generating high-quality images, it is difficult to capture the detailed changes in facial expressions. It can be seen that our method is the most accurate in predicting the AU-intensity vector near the mouth area. Overall, our method achieves higher average ICC and lower average MSE of 16 AU intensities. This proves the effectiveness of our network based on different regions.

While considering facial expression transfer, we further conduct a comparative analysis of image quality among various methods. We use FID, PSNR and SSIM to analyze the quality of generated images, as shown in Table 2. It can be seen that compared to other methods, the images generated by LSGAN are closer to the image quality and structural similarity of the ground truth.

To further compare the stability and image quality of different networks, we use FID metrics and LPIPS distance to evaluate the reconstructed image on the DISFA datasets. We select 3 pairs of images for evaluation, as shown in Fig. 9.

The quantitative evaluation results of the reconstructed images in Fig. 9 are shown in Table 3 below.

Among them, FIG(1), FIG (2) and FIG (3) correspond to the three pairs of images in Fig. 9, respectively. And FIG (Avg) is the average results of 50 pairs of test images. It indicate that our method has better reconstruction performance, that is, our LSGAN can achieve more stable image generation while preserving the original image identity and structure. In addition, the results of LPIPS reflect that our network is able to maintain the features of invariant local regions well.

The ablation study

In this section, we evaluate the main components in our LSGAN. Specifically, we investigate the effects of different loss terms in our framework by examining their impact on image generation. To accomplish this, we train the network by removing one of three key loss terms: identity loss, self-reconstruction loss, and cyclic reconstruction loss, which are denoted as \(E\_wtI\), \(E\_wtR\), and \(E\_wtC\), respectively. Figure 10 shows some of the results with target facial expression of happiness.

Table 4 Quantitative comparison with FID (lower is better)

We further evaluate the generated image and ground truth with FID metrics, and the results are shown in Table 4. In summary, the various losses used in our network are necessary.

Fig. 11
figure 11

Illustration of the effectiveness of different network structure module. We compare LSGAN with its three different network structure variants (\(E\_wt{\text {ARM}}\), \(E\_wt{\text {SMG}}\), \(E\_{wt({\text {AAE}}+{\text {TG}})}\)

Table 5 Quantitative comparison with FID (lower is better)
Table 6 Quantitative evaluation of expression manipulation for our LSGAN and its variants on RAFD dataset
Table 7 Quantitative evaluation of expression manipulation for our LSGAN and its variants on RAFD dataset

To further analyze the effects of different network architectural modules on our LSGAN, we conduct the ablation study on LSGAN and its variants. The generated images with disgust expressions of our LSGAN and its variants are shown in Fig. 11. \(E\_wt{\text {ARM}}\) denotes the adoption of the conventional channel attention network SENET instead of ARM, whereas \(E\_wt{\text {SMG}}\) signifies the utilization of global images alone for sentiment operations without incorporating local feature alignment. \(E\_{wt({\text {AAE}}+{\text {TG}})}\) uses traditional encoding and decoding structures in LSGAN instead of a combination of AAE and TG. It can be seen that the variants of LSGAN have a decrease in the quality of generated images in key facial regions.

Similarly, we evaluate the generated image and ground truth with FID metrics, as shown in Table 5. It is evident that compared to its variants, LSGAN can alleviate issues such as low resolution and image artifacts.

We also quantitatively evaluate different variants of LSGAN in terms of transferring fine-grained expressions. Tables 6 and 7 present the MSE and ICC between the AU intensity of the ground truth and the generated image of LSGAN and its variants, respectively. The absence of reconstruction loss significantly diminishes the network’s capacity to convey expression intensity labels. Moreover, when compared to conventional encoding and decoding architectures, the utilization of SMG, AAE and TG enables LSGAN to effectively align the features extracted from the input image with those of the target image. This alignment facilitates the precise transfer of nuanced facial expressions, enhancing our network’s ability to faithfully capture fine-grained expression details. In summary, LSGAN showcases superior performance in fine-grained expression manipulation, as evidenced by its highest average ICC and lowest average MSE among its variants. This demonstrates the effectiveness of LSGAN in accurately manipulating facial expressions with fine details.

Conclusion

This paper introduces a novel approach for fine-grained facial expression manipulation, termed LSGAN. By integrating facial expression images generated from various facial regions, our approach is able to fully capture and leverage region-specific information while preserving the overall structural integrity of the image. At the same time, our method ensures that the AU intensity of the generated image is basically consistent with the target image. Our proposed method has been rigorously evaluated through both qualitative and quantitative analyses using publicly available databases, demonstrating its high performance in generating expression-specific facial images.