1 Introduction

The success of computer vision methods based on deep learning [10, 11, 40] requires a large amount of training data. A large number of private images have been shared on various application platforms [21, 48], which has aroused people’s concerns about personal privacy and security. For example, the General Data Protection Regulation (GDPR) is in force in Europe, requiring organizations to define privacy policies based on user preferences [5]. In this case, it is useful to use tools to support users in understanding how their sensitive data are exchanged [2]. However, existing computer vision tasks, such as person reidentification [31, 32] and action recognition [4, 49], do not require clear face information. Therefore, we can use anonymization technology to process faces when publishing these data.

Facial de-identification (De-ID) techniques [14, 35, 39] thus came into being; these approaches aim to remove one person’s identity by replacing the real face with a generated or simulated face while keeping the original head pose, facial expression and background unchanged. As facial De-ID plays a vital role in privacy protection, it has attracted extensive attention.

In this context, the earliest De-ID techniques obfuscated privacy-sensitive identity information via image distortion operations, such as mosaics and image blur. However, these methods also destroy any privacy-insensitive information and decrease the visual quality and authenticity of the images or videos, thus compromising their utility [8]. Other related approaches replace the faces in the image to be processed by finding new face images in a predefined reference image set. Their disadvantage is that the quality of the resulting image depends heavily on the selected reference image, and they cannot protect the face identity privacy in the reference image. Segmentation-based methods [36] are also used for anonymizing faces, but these methods often make faces undetectable. [14, 35] have the problem of insufficient erasure. On the other hand, even though the images generated by some methods [6] can fool recognition systems, they can be easily recognized by humans.

Recently, more sophisticated De-ID techniques were proposed based on generative adversarial networks (GANs) [7]. The conditional identity anonymization GAN (CIAGAN) [24] is a state-of-the-art method for private identification protection. However, the CIAGAN ignores the attribute information contained in the face identity features and only retains the attribute information through the key points of the face, which leads to many visual defects in the results.

To overcome these problems, a new face De-ID and generation framework based on a GAN and a semantic image [44] is proposed to realize high-quality and controllable face De-ID. Specifically, to preserve the basic attribute information of the original image, we use a semantic image instead of a landmark to guide the generation process. To prevent the normalization layers from washing out the information contained in the image, the semantic image is input into the generator with spatially adaptive normalization from [34]. By concatenating the latent space with the representation layer of the face classifier, we achieve a rich latent space, embedding both identity and expression information. In addition, a structural similarity index (SSIM) loss [45] and a perceptual loss [17] are added to the objective function to improve the quality of the generated image.

Our contributions in this work are fourfold.

  • Facial semantic image extraction is exploited to maintain the key pose and expression of the original face.

  • We integrate the features of the new identity into the original features by leveraging the output of the feature representation layer of a pretrained face classification model.

  • We design a hybrid identity discriminator composed of an image quality analysis module, a Visual Geometry Group (VGG)-based perceptual loss function, and a contrastive identity loss to guide the identity anonymization process.

  • A large number of experiments have shown that our approach surpasses most existing techniques in the De-ID field.

2 Related work

2.1 Conventional face De-ID methods

Face anonymity technology aims to protect the private information of faces. Traditional methods change directly the data distribution of original images for face deidentification [1, 29]. The problems with these methods lie in that all the objects in an image are blacked out, blurred or pixelated independently, and the De-ID efficiency cannot be guaranteed since the fixed image operations can be easily reconstructed [8].

Except for simple image processing operations such as image blurring, pixelation or adding random noise, the K-same method [30] improved by the k-anonymity algorithm [41] generates face images by calculating the average value of k in the dataset. This ensures the visual privacy protection but often contains “ghosting” artifacts [14]. Therefore, as the variants of the K-same scheme continue to emerge in the literature, they focused mainly on preserving important attribution information in the original images or improving the naturalness of resultant faces [18, 25].

2.2 Deep learning-based face De-ID methods

With the development of deep learning based computer vision, generative adversarial networks and their variants have been extended for the privacy De-IDentification. GAN has natural advantage due to its strong generation ability according to the guidance of discriminator. It inspires designing frameworks that generate realistic image samples via adversarial training. GAN has become the current main trend for the research on face De-IDentification.

GANs realize face deidentification by generating image pixels instead of deleting or modifying information. According to whether additional reference faces are used, the existing GAN based methods can be divided into two kinds: (1) one-to-one methods and (2) many-to-one methods. In the framework of one-to-one generation methods, no reference faces are used; thus, privacy leakage of reference faces does not exist. Among the one-to-one generation methods, some researchers proposed the privacy-preserving GAN (PPGAN) [47], which forces the removal of the identity from the identity-related feature space performed by the pretrained discriminator. Meanwhile, visual correspondence is maintained by the similarity of the pixel horizontal structure. However, the PPGAN tends to generate images with unique facial features, which leads to low image generation quality. Another one-to-one generation method, called DeepPrivacy, [14] has achieved good performance. This method first masks the face, and then generates it under the guidance of landmarks. Fawkes [38] aims to anonymize identity without changing the visual perception of the face. Different from previous methods, Gu et al. [9] proposed a method that can anonymize and deanonymize at the same time. In addition, this method no longer relies on the mask to eliminate the original identity but directly modifies the original face.

In the many-to-one method, at least one reference face is exploited to fuse the attribute or ID information in the generated faces. Conditional GANs (cGANs) [16, 27] have become popular tools for controlling the appearance of synthesized data. Meden et al. [26] proposed a generation model that can generate anonymous faces based on K faces with different identities closest to the input face. CIAGAN [24] takes the existing face image, landmark, masked face and specified identity as input, trying to generate a face image with a new identity. The controllable face anonymization network (CFA-Net) [22] aims to control the anonymous process by operating the identity vector in the feature space. This method can generate all kinds of new faces highly similar to the original image content.

3 The semantic-aware deidentification GAN

3.1 Overview

As illustrated in Fig. 1, our model takes an image, the corresponding semantic image, the masked face and an identity feature as inputs. We aim to erase the identifiable features in the facial image and preserve the other attributes of the original image, including its pose, expression, and background.

Fig. 1
figure 1

The framework of the proposed model

Our model consists of the following three blocks: Block I for semantic image extraction, Block II for identity transformation, and Block III for face generation with SPADE.

Block I aims to perform facial segmentation via facial semantic image extraction. We propose the use of a semantic image to maintain the key pose and expression of the original face. Block II provides randomly selected identity information extracted from the given face dataset. Block III aims to generate an anonymized face image according to the original face and new identity information. In this component, the generator is an encoder-decoder model where the encoder embeds the original image information into a low-dimensional space. Then, the decoder decodes the combined information of the source image, identity features and semantic image into a generated image. In addition, we use spatially adaptive (DE) normalization (SPADE) [34] to enhance the conditional GAN so that it can produce realistic-looking results. Furthermore, the identity discriminator ensures that the generated images are anonymized to the greatest extent possible.

In addition, the identity information represented by the feature layer extracted from a pretrained face classifier is fused in the generator, while adjustment by the identity discriminator ensures that the generated images are anonymized to the greatest extent possible.

3.2 Facial semantic image extraction

It has been proven that compared with generation methods based on random noise, due to the guidance of semantic information, generation methods based on semantic maps [16] can obtain higher-quality images. In addition, previous studies [46] have proven that correlations are present between facial between components, and it is difficult to mine these correlations in a generation model. This leads to randomly generated facial images that can be easily identified. To effectively preserve the basic attributes and the correlations between the facial components in the original face, the edge-aware graph representation network (EAGRNet) [44] is introduced into our model. The EAGRNet models the relationships between regions by learning graphical representations of facial images. In addition, it can capture long-distance correlations in facial images.

As illustrated in Fig. 2, the semantic image extraction process of the EAGRNet includes the following three stages: feature and edge extraction, edge-aware graph reasoning, and semantic decoding. In the feature and edge extraction phase, the EAGRNet takes the residual network (ResNet) [43] as the backbone to extract features at low levels and high levels for multiscale representation. Additionally, a spatial pyramid pooling operation is exploited to learn multiscale contextual information. Pyramid pooling outputs 16 × 16 size feature map. Furthermore, an edge perception module is constructed to acquire an edge map for the subsequent module. Edge perceiving module outputs a 32 × 32 size feature map.

Fig. 2
figure 2

The overall framework of EAGRNet

Then, to build the long-range relations among facial components, the feature map and edge map are fed into the edge-aware graph reasoning module (EAGR module in Fig. 2). In the EAGR module, to learn intrinsic graph representations, the graph is projected into a collection of pixels that tend to reside in the same facial component to K (K ≥ 1) vertices in the graph. Accordingly, the original features are projected onto vertices in an edge-aware fashion, the relations between the vertices (regions) are reasoned over the graph, and the learned graph representation is projected back to pixel grids, leading to a refined feature map with the same size as the original.

Finally, in the semantic decoding stage, EAGRNet designs a two-way decoder, both of which are based on 32 × 32 size feature map as input. The decoder combines the feature maps of the two paths to generate the final result of face parsing.

3.3 Identity transformation

The traditional encoder-decoder structure easily learns the reconstruction ability, which leads to anonymization failure. To realize the anonymization of the original identity, we integrate the features of the new identity into the original features. Different from the CIAGAN [24], as shown in Block II in Fig. 1, we exploit a pretrained face classification model and leverage the output of the feature representation layer as the identity attribute. In this way, we achieve a rich latent space, embedding both identity and expression information. Based on the new identity features, the generator can learn the features of the reference image, not just the identity, to anonymize the original image.

3.4 Face generator with SPADE

The data distribution of the semantic image may result in the failure of traditional convolutional networks because their normalization layers tend to remove information contained in the input semantic masks. Inspired by SPADE [34], we employ spatially adaptive normalization to replace the traditional normalization layer. As shown in Fig. 2, the face generator is built with the semantic image, the masked image and the identity feature from the pretrained identity classifier as inputs. The masked image is reshaped by a CNN and then fed into a SPADE ResBlk along with the semantic image, where SPADE ResBlk is a residual block with the SPADE. To obtain a reasonable feature dimension, we downsample the output matrix by stacking 4 SPADE ResBlks and then concatenate the output with the reshaped identity feature. This concatenated feature vector is input into 4 ResNet blocks, 4 SPADE ResBlks with upsampling layers, and a convolutional layer again to generate a facial image matching the spatial resolution.

The spatially adaptive denormalization structure is shown in Fig. 3. The relationship between the output \(l_{out}^{i}\) and the input lin of the module is defined as:

$$ \begin{array}{c} l_{out}^{i} = \alpha_{c,h,w}^{i}{l_{in}} + \beta_{c,h,w}^{i}, \end{array} $$
(1)

where \(\alpha _{c,h,w}^{i}\) and \(\beta _{c,h,w}^{i}\) are modulation parameters, and the channel, height and width are (c,h,w), respectively. This conditional normalization layer modulates the activation process by using input semantic layouts through a spatially adaptive learned transformation and can effectively propagate the semantic information throughout the network (Fig. 4).

Fig. 3
figure 3

Generator based on SPADE

Fig. 4
figure 4

Spatially adaptive denormalization

A discriminator network is used to differentiate between the generated face images and the real images. In this paper, we employ the least-squares GAN (LSGAN) [23] to train our face identity anonymization and generation network in an adversarial manner. The LSGAN can boost training stability and produce more realistic images than the regular GAN [7]. The LSGAN loss functions for the discriminator and generator are:

$$ \begin{array}{c} L(D) = \frac{1}{2}{E_{x \sim {p_{I}}(x)}}[{(D(x) - 1)^{2}}] + \frac{1}{2}{E_{z \sim {p_{z}}(z)}}[D{(G(z))^{2}}], \end{array} $$
(2)
$$ \begin{array}{c} L(G) = \frac{1}{2}{E_{z \sim {p_{z}}(z)}}[{(D(G(z)) - 1)^{2}}], \end{array} $$
(3)

where pI is the distribution of the real face images and pz is the distribution of the latent variable z. The adversarial loss Ladv is computed as follows:

$$ \begin{array}{c} {L_{adv}} = L(G) + L(D) \end{array} $$
(4)

The SSIM was originally proposed for image quality analysis to overcome the limitations of the mean squared error (MSE). The SSIM is utilized here to measure the structural similarity between two images. SSIM is defined as:

$$ \begin{array}{c} SSIM(G(x),y) = \frac{{2{\mu_{G(x)}}{\mu_{y}} + {C_{1}}}}{{\mu_{G(x)}^{2} + {\mu_{y}^{2}} + {C_{1}}}} \cdot \frac{{2{\sigma_{G(x)y}} + {C_{2}}}}{{\sigma_{G(x)}^{2} + {\sigma_{y}^{2}} + {C_{2}}}}, \end{array} $$
(5)

where μx and \({\sigma _{x}^{2}}\) are the average value and the variance of x, respectively. σxy is the covariance of x and y. C1 and C2 are constants used to maintain stability. The SSIM ranges from 0 to 1. The SSIM loss is defined as follows:

$$ \begin{array}{c} {L_{S}} = 1 - SSIM(G(x),y) \end{array} $$
(6)

To generate visually pleasing images, we also used the VGG-based perceptual loss [17] function. The perceptual loss function is used to determine the high-level feature differences between the target and generated output, such as content and style differences. In our proposed approach, we extract the high-level features (rectified linear unit 3 (ReLU3)-3 layer) of VGG-16 for both the real target image and the output of the generator. The L1 distances between these features of the target and generated images are used to guide the generators G. The perceptual loss is defined as:

$$ \begin{array}{c} {L_{P}} = \frac{1}{{{C_{p}}{W_{p}}{H_{p}}}}\sum\limits_{c = 1}^{{C_{p}}} {\sum\limits_{w = 1}^{{W_{p}}} {\sum\limits_{h = 1}^{{H_{p}}} {V{{(G(z|x))}^{c,w,h}} - V{{(y)}^{c,w,h}}} } }, \end{array} $$
(7)

V (⋅) denotes a particular layer of VGG-16, where the layer dimensions are Cp, Wp and Hp.

To guide the identity anonymization process, we design an identity discriminator. The identity discriminator uses the architecture of the Siamese network. We train the identity discriminator by using a contrastive loss as:

$$ \begin{array}{c} {L_{C}}(m,{(Y,{X_{1}},{X_{2}})^{i}}) = \left\{ {\begin{array}{*{20}{c}} {{\text{ }}||{X_{1}^{i}} - {X_{2}^{i}}|{|_{2}}{\text{ Y = 1}}}\\ {\max (0,m - ||{X_{1}^{i}} - {X_{2}^{i}}|{|_{2}}){\text{ Y = 0}}} \end{array}} \right., \end{array} $$
(8)

where ||⋅||2 denotes the l2 norm of a vector and m is the margin. Finally, under the guidance of the identity discriminator, the generator learns to generate a face with some of the features of the desired identity while retaining the basic attributes of the real image.

The overall objective function for learning the network parameters in the proposed method is given as the sum of all the loss functions defined above:

$$ \begin{array}{c} {L_{tot}} = {L_{adv}} + {\lambda_{1}}{L_{S}} + {\lambda_{2}}{L_{P}} + {\lambda_{3}}{L_{C}}, \end{array} $$
(9)

where Ladv is the adversarial loss, LP is the perceptual loss, and LS is the SSIM loss. The variables λ1, λ2, and λ3 are hyperparameters used to weight the different loss terms.

4 Experiments

4.1 Experimental settings

4.1.1 Datasets and baseline methods

The CelebA [20] dataset consists of 202,599 face images (218×178 pixels each) and 40 binary attribute annotations per image, such as age (old or young), gender, whether the image is blurry, whether the person is bald. The dataset has an official split into a training set containing 162,770 images, a validation set containing 19,867 images and a test set containing 19,962 images.

FG-NET Aging Dataset (FG-NET-AD) [33] contains 1002 images from 82 persons aged from newborn to 69 years old, but most of them are between 0 and 40 years old. Meanwhile, there are significant diversity in resolution, quality, illumination and viewpoint in the face images in the FG-NET-AD. To evaluate the model’s generalization on diverse data, we also select some images from the CALFW [3] and LFW [13] dataset to set up new datasets with specific data distribution.

We evaluate our method compared to some currently advanced methods, including Fawkes [38], DeepPrivacy [14] and the CIAGAN [24]. Fawkes aims to reduce the probability of face identification without changing the visual feeling of the face. Unlike Fawkes, CIAGAN and DeepPrivacy are committed to generating new faces. DeepPrivacy includes a generator and a discriminator and uses landmarks to guide the generation process. Compared with DeepPrivacy, CIAGAN adds a discriminator for guiding identity. In addition, this method uses face silhouette to guide the generation process.

4.1.2 Implementation details

We resize all images to 64×64 for the quantitative experiments and to 128×128 pixels for the qualitative experiments, and we normalize all pixel values to the region [0,1]. We use a higher resolution for the qualitative results to make subtle visual changes more apparent.

We train our model on 35579 images from 1200 persons as done for the CIAGAN. We train our network on 128×128 resolution images. To evaluate our model’s performance accurately, we test the SDGAN and the baseline approaches on distinct datasets, including the 363 persons (each person has more than 30 images) from the same CelebA dataset and the 82 persons from FG-NET Aging Dataset. In addition, to test the model’s generalization ability, we set up three mixed datasets by selecting images of distinct ages, genders, and skin tones from the CALFW, CelebA and LFW dataset, called the Gender dataset, containing 50 females and 50 males; the Age dataset, containing 17 children, 35 adults and 35 old people; the Skin dataset, containing 100 persons with white, yellow, brown and black skins separately, each 25 persons of the same skin tone type.

4.2 Detection and identification

We first evaluate two important attributes that an anonymization method should have: a high detection rate and a low identification rate. In other words, we do not want the generated face to be identified as the original ID by the identification system, but at the same time, we still want the face to be detected by the detection system. Additional, we hope the generated faces for the same person still can be identified as one person, which can enable more visual applications not to be influenced by the DeID, such as ReID and action recognition. Therefore, we exploit evaluation metrics in terms of the face detection, identification and re-identification metrics. It is known that a high detection rate, a low identification rate and a high re-identification rate indicate better anonymization.

We perform detection using the machine learning library (Dlib) [19] and SSH single-stage headless (SSH) detector [28]. For identification, We use a pretrained FaceNet model [37] based on the Inception-ResNet backbone [42] and use the standard Recall@1 evaluation metric to judge whether a generated face and its corresponding original face belong to the same ID, that is, to measure the effect of De-ID. With regard to Re-Identification, we detect Recall@1 of all generated face images to measure the ratio regarding the number of samples whose nearest neighbor is from the same class, which can implicitly evaluate the impact when applying De-ID in the Re-ID scenarios relying on the facial information.

In Table 1, we show the detection and identification results of the proposed SDGAN and baseline models. Among the existing methods, CIAGAN and DeepPrivacy achieve advanced performance; that is, they have higher detection rates and lower identification rates than the other methods. Although Fawkes can preserve the visual feeling of the original face, it is difficult to anonymize the face image. The detection rates of the classical Dlib [19] and deep learning-based SSH [28] detectors for our anonymized images are 98.12% and 99.76%, respectively, which are higher than those of the CIAGAN and DeepPrivacy. The detection rate of the SSH detector for our anonymized images is almost 100%. The testing results obtained with FaceNet show that the identification rate of our model nearly reaches 0.0%, which suggests that our model almost removes all the identity information, making it better than the CIAGAN and DeepPrivacy. The above experimental results demonstrate that our method can not only generate reliable faces but also has advanced De-ID performance.

Table 1 Results of the tested detection and identification methods on the CelebA dataset. Lower () results imply better anonymization. Higher () results imply better detection

In addition, in terms of the Re-ID score, Recall@1 scores of all the De-ID models are lower than the ones of the original face images, suggesting that the resultant faces from one person of each model can not maintain the same ID more or less. Our model achieves the better Recall@1 scores than CIAGAN and DeepPrivacy, which indicates that our model has less impact in the ReID scenarios compared to the baseline approaches. Fawkes provides best Recall@1 score due to its anonymization mechanism without changing the visual perception of face, which guarantees the identification consistency but loses the visual ID protection.

Table 2 reports the comparison results on the FG-NET-AD dataset. Our method provides best detection rate in SSH, the identification rate, and re-identification rate and second best detection rate in Dlib (slightly lower than Fawkes), indicating the proposed SDGAN anonymizes successfully. It can also be observed that the detection rate measured by SSH of CIAGAN, DeepPrivacy and Fawkes drops significantly on the FG-NET-AD dataset compared to the CelebA dataset, while our model maintains steady, suggesting that the SDGAN is robust enough to overcome the impact of low image quality of the FG-NET-AD. Be limited by the space, the quantitative evaluation results on “Gender”, “Age” and “Skin” dataset are available at https://github.com/kimhyeongbok/SDGAN, and similar results are reported on these datasets.

Table 2 Results of the tested detection and identification methods on the FG-NET-AD

4.3 Generation quality

4.3.1 Quantitative results

In this section, we evaluate the visual quality of the generated images from a quantitative point of view by using the Fréchet inception distance (FID) [12], SSIM [45] and the peak signal-to-noise ratio (PSNR) [15]. The FID and SSIM are metrics that compare the statistics of generated samples to those of real samples. The lower the FID and the higher the SSIM are, the better the results are, corresponding to more similar real and generated samples. The PSNR evaluates the image quality based on the errors between corresponding pixels. The higher the PSNR is, the higher the image quality.

As shown in Table 3, compared with Fawkes [38] and the CIAGAN [24], our method achieved significantly improved generation quality. For example, compared with the CIAGAN on the CelebA dataset, our method reduces the FID by 51.94 and improves the SSIM and PSNR by 0.12 and 6.58, respectively. DeepPrivacy provides slightly better PSNR score, but much worse FID score than our model, which indicates that the faces generated by DeepPrivacy have better peak signal-to-noise ratio but are more difficult to be detected, and have lower visual quality compared to our model. Overall, the quantitative results of FID, SSIM and PSNR show that the quality of the image generated by our method is better than that of the existing advanced methods. In Table 4, similar results can be observed, the proposed SDGAN provides much better FID and SSIM than CIAGAN and DeepPrivacy, and comparable PSNR with DeepPrivacy.

Table 3 FID, SSIM and PSNR results on the CelebA dataset. The lower () the FID and the higher () the SSIM, the better the results are, corresponding to more similar real and generated samples. The higher () the PSNR is, the higher the image quality
Table 4 FID, SSIM and PSNR results on the FG-NET-AD

4.3.2 Qualitative results

In this section, we qualitatively evaluate the quality of the generated images. We compared the generated images under diverse views: normal, side, occlusion, and other challenging scenes. Figure 5 shows the generated images obtained under normal conditions. Compared with those of DeepPrivacy, the CIAGAN and our method, the image generated by Fawkes is highly similar to the original image, and it is difficult to anonymize the image from a visual point of view. Both DeepPrivacy and our model provide more realistic images than the CIAGAN; meanwhile, the generated faces from DeepPrivacy look natural compared to our model.

Fig. 5
figure 5

Images generated by each method in normal scenes. Images in the same column correspond to the same original image

Figure 6 shows the generated images obtained from side scenes. Although Fawkes can generate high-quality images, it cannot effectively anonymize the images. Compared with those of DeepPrivacy and the CIAGAN, the images generated by our method are more realistic. For example, in the second and fourth columns, both DeepPrivacy and the CIAGAN generate unrealistic images.

Fig. 6
figure 6

Images generated by each method in side scenes. Images in the same column correspond to the same original image

Figure 7 shows the generated images obtained in occluded scenes. Fawkes still has difficulty anonymizing identities. Compared with the CIAGAN, DeepPrivacy and our method can generate images with higher quality. Furthermore, our method can generate more realistic images than DeepPrivacy. As shown in columns 3 and 9, DeepPrivacy destroys the integrity of the occlusion. In addition, as shown in column 7, our method can better maintain the expression of the original image than DeepPrivacy.

Fig. 7
figure 7

Images generated by each method in occluded scenes. Images in the same column correspond to the same original image

Figure 8 shows the generated images obtained in other scenes, including uneven illumination, different ages, skin colors and low image quality. These images are generated from the dataset “FG-NET-AD”, “Age”, and “Skin”. Since the quantitative evaluation and the algorithmic principle have proven that Fawkes has difficulty anonymizing identities, its resultant images are not provided here but are available online. Figure 9 provides the generated images with multiple faces only from our model and DeepPrivacy because CIAGAN cannot process multiface cases, and therefore, there are no results. In our model, the face recognition model Dlib is exploited to find the person faces in an image. Then image pieces are obtained by segmenting the original image and making sure each one piece only contains one face. Then these image pieces are de-identification. Finally, the generated faces replace their corresponding original faces to anonymize all the persons in the image.

Fig. 8
figure 8

Images generated by each method in diverse age, gender, skin color, and image quality scenes. Images in the same column correspond to the same original image

Fig. 9
figure 9

Multiface images generated by each method. Images in the same column correspond to the same original image

In all these challenging cases, most of the generated faces from CIAGAN are flawed or unreal, while DeepPrivacy and our method can generate images with higher quality. In the testing of diverse age groups, our model and CIAGAN cannot guarantee age consistency because the age factor has not been considered in the design of their generators and discriminators. DeepPrivacy obtains better generation consistency with age and gender, especially in the children’s images. However, in terms of maintaining expressions, it can be observed that compared with the original images, the facial images generated by DeepPrivacy essentially change in expression. Some of these images, such as column 10 in Fig. 6, column 8 in Fig. 7, column 2,3 in Fig. 8, and column 4 in Fig. 9, have changed from a smile to an appearance of displeasure, or vice versa, which violates the key requirements of De-ID. In contrast, the CIAGAN and our model can better maintain the expression of the original image. Overall, it is verified that the performance of our method is better than that of DeepPrivacy and CIAGAN.

4.4 Ablation studies

In this section, we perform an ablation study with our method to demonstrate the value of our design choices. In Table 5, we show several variants of our model.

Table 5 Ablation study of our model

Effectiveness of SPADE. Compared with that of the baseline, the quality of the image generated by V1 is significantly improved. Specifically, the FID decreases by 41.45, and the SSIM and PSNR increase by 0.05 and 3.12, respectively. Thus, the effectiveness of SPADE in our method is verified.

Effectiveness of LP and LS. V2 and V1 have similar Dlib, and the SSH of V2 is 0.69 higher than that of V1, which verifies that LP and LS can improve the authenticity of the generated image. Compared with those of V1, the SSIM and PSNR of V2 are improved by 0.09 and 4.23, respectively, which verifies that LP and LS can improve the quality of the generated image. In addition, we verify the effectiveness of LP and LS. Table 3 shows that V4 removes LS and obtains a worse FID than V6. Compared with V6, V5 removes LP and obtains worse Dlib, SSH, FID and PSNR values. Thus, the effectiveness of LP and LS is verified.

Effectiveness of the identity features. We try two different identity features, F0 and F1, where F0 represents all features of the image and F1 is a simplified feature. Specifically, F1 eliminates the identity-independent features in F0. We find that V3 and V6 achieve higher performance than V2. This verifies the validity of the selected identity features. In addition, we find that V3 can obtain a lower FID than V6 because we eliminate the interference of irrelevant features so that the generator can focus more on identity-related features.

5 Conclusion

In this paper, we propose the SDGAN for high-fidelity face deidentification. Specifically, we add identity features and a semantic image to the generator. The introduction of identity features enables the generator to learn image features, not just identity features. The combination of SPADE and a semantic image can preserve the basic attributes in the original face. In addition, we introduce a perceptual loss and an SSIM loss to ensure the quality of the generated image. The results of ablation studies verify the effectiveness of the above components. In addition, extensive experimental results demonstrate the effectiveness and progressiveness of the proposed method in terms of identity anonymization. In contrast, as shown in column 7, our method can better maintain the expression of the original image than DeepPrivacy.