Attacks on State-of-the-Art Face Recognition using Attentional Adversarial Attack Generative Network

With the broad use of face recognition, its weakness gradually emerges that it is able to be attacked. So, it is important to study how face recognition networks are subject to attacks. In this paper, we focus on a novel way to do attacks against face recognition network that misleads the network to identify someone as the target person not misclassify inconspicuously. Simultaneously, for this purpose, we introduce a specific attentional adversarial attack generative network to generate fake face images. For capturing the semantic information of the target person, this work adds a conditional variational autoencoder and attention modules to learn the instance-level correspondences between faces. Unlike traditional two-player GAN, this work introduces face recognition networks as the third player to participate in the competition between generator and discriminator which allows the attacker to impersonate the target person better. The generated faces which are hard to arouse the notice of onlookers can evade recognition by state-of-the-art networks and most of them are recognized as the target person.


Introduction
Neural network is widely used in different tasks in society which is profoundly changing our life.Good algorithm, adequate training data, and computing power make neural network supersede human in many tasks, such as face recognition.Face recognition can be used to determine which one the face images belong to or whether the two face images belong to the same one.Applications based on this technology are gradually adopted in some important tasks, such as identity authentication in a railway station and for payment.Unfortunately, it has been shown that face recognition network can be deceived inconspicuously by mildly changing inputs maliciously.The changed inputs are named adversarial examples which implement adversarial attacks on networks.Szegedy et al. [42] present that adversarial attacks can be implemented by applying an imperceptible perturbation which is hard to be observed for human eyes for the first time.Following the work of Szegedy, many works focus on how to craft adversarial examples to attack neural networks.Neural network is gradually under suspicion.The works on adversarial attacks can promote the development of neural network.Akhtar et al. [1] review these works' contributions in the real-world scenarios.Illuminated by predecessor's works, we also do some research about adversarial attack.
Most of adversarial attacks aim at misleading classifier to a false label, not a determined specific label.Besides, attacks on image classifier can not be against face recognition networks.Existing works produce perturbation on the images [13,42,28], do some makeup to faces and add eyeglass, hat or occlusions [34,33,14] to faces.And their adversarial examples are fixed by the algorithms which are not flexible for attacks.These algorithms can not accept any images as inputs.Our goal is to generate face images which are similar to the original images but can be classified as the target person shown in Fig. 1.The method manipulating the intensity of input images directly is intensity-based.
Our work uses geometry-based method to generate adversarial examples.In our work, we use generative adversarial net (GAN) [12] to produce adversarial examples which are not limited by data, algorithms or target networks.It can accept any faces as inputs and convert them to adversarial examples for attacks.To generate adversarial examples, we present A 3 GN to produce the fake image whose appearance is similar to the origin but is able to be classified as the target person.
In face verification domain, whether the two faces belong to one person is based on the cosine distance between feature map in the last layer not based on the probability for each category.So A 3 GN pays more attention to the exploration of feature distribution for faces.To get the instance information, we introduce a conditional variational autoencoder to get the latent code from the target face, and meanwhile, attentional modules are provided to capture more feature representation and facial dependencies of the target face.For adversarial examples, A 3 GN adopts two discriminators -one for estimating whether the generated faces are real called normal discriminator, another for estimating whether the generated faces can be classified as the target person called instance discriminator.Meanwhile, cosine loss is introduced to promise that the fake images can be classified as the target person by the target model.Our main contributions can be summarized into three-fold: • We focus on a novel way of attacking against state-ofthe-art face recognition networks.They will be misled to identify someone as the target person not misclassify inconspicuously in face verification according to the feature map not the probability.
• GAN is introduced to generate the adversarial examples different from traditional intensity-based attacks.Meanwhile, this work presents a new GAN named A 3 GN to generate adversarial examples which are similar to the origins but have the same feature representation as the target face.
• Good performance of A 3 GN can be shown by a set of evaluation criteria in physical likeness, similarity score, and accuracy of recognition.

Related work 2.1. Face recognition
We witness the great development and success of convolutional neural network in face recognition so far.With the development of advanced architectures and discriminative learning approaches, face recognition performance has been boosted to an unprecedented level.Face recognition can be categorized as face verification and face identification.In our work, we focus on face verification which determines whether a pair of faces belong to the same person and the latter classifies a face to a specific identity.Learning discriminative deep face representation through large-scale face identity classification was proposed by [43,44,39,38,40,37].More and more CNN-based approaches are absorbed in loss functions [25,46,8,48,41].Our goal is to generate adversarial examples with A 3 GN to attack these state-of-the-art face recognition networks in face verification.

Adversarial attack
With the remarkable accuracy, neural network gets access to many important domains in society, such as selfdriving cars, surveillance and identity authentication.The security problem of neural networks has become a critical problem.Szegedy et al. [42] reveal the perturbation which can fool DNN for the first time.Moosavi-Dezfooli et al. [29] demonstrate that 'universal perturbation' can fool the classifier by any image in most type of models.I.J.Goodfellow et al. [13] present that the intrinsic reason for adversarial attack is the linearity and highdimensions of inputs.Su et al. [36] present a method to generate one-pixel adversarial perturbations to attack models using differential evolution in an extremely specific scenario.In the face recognition domain, Bose et al. [3] craft adversarial examples by solving constrained optimization so that face detector can not detect faces.Sharif et al. [33] propose a method focusing on facial biometric systems which can be widely used in surveillance and access control.Many works are proposed to explore more imperceptible adversarial examples to attack neural networks efficiently [23,10,4,20,30,27].
In this paper, we focus on generating quasiimperceptible adversarial examples to do white-box and targeted attacks.

Generative adversarial network
Generative adversarial networks [12] have achieved great performance and impressive results in image generation [31,9], style transfer [45,22,11], image-to-image translation [21,52,53] and representation learning [31,32,26].Most works utilize conditional variables such as attributes [50,6].CycleGAN [52] preserves key attributes between the input and the translated images by a cycle consistency loss which has received a good improvement in unpaired image-to-image translation.Conditional VAEs [35] have shown the good performance for imageto-image translation which learn a mapping from input to output image.In [53], cVAE-GAN and cLR-GAN are used to learn a low-dimensional latent code and then map from a high-dimensional input to a high-dimensional output.In our work, we use conditional variational autoencoder GAN to learn the feature representation of the target person for generating adversarial examples from any faces to attack face recognition networks.

Threat model
In our work, the adversary aims at fooling a face recognition network to recognize someone to another one which belongs to impersonation(targeted) attack.To achieve this purpose, we consider a white-box attack with the access of face recognition networks which has knowledge of the outputs, parameters, architectures, training data of target networks.Simultaneously, we also consider a black-box attack to test our A 3 GN transferability.We adopt state-of-the-art face recognition networks as target models.Different networks learn different feature representations for face images which will prove that A 3 GN can be applied to attack different networks.Our adversarial goal is to generate adversarial images which are similar to the original images from LFW datasets but recognized as the target person by imitating feature representation of the target person from target models.For this goal, we present A 3 GN for generating adversarial examples.

A 3 GN
We adopt conditional variational autoencoder generative adversarial networks which can learn targeted feature for generating adversarial examples to fool the target network.To guarantee that generated images are classified as the target person better, we introduce some attentional blocks into conditional variational autoencoder GAN to constitute A 3 GN shown in Fig. 2. Conditional variational autoencoder GAN.For exploring the feature distribution of different faces, we use cVAE-GAN to capture the instance information of different faces for the generator to produce the adversarial examples.Given a target image y, using an encoding function E learns a latent code z of y, E(y) → z.Generator G 1 combines z and an input image x to synthesize the output x, G 1 (x, z) → x.Normal discriminator D 1 determines whether x is real or not.To make the generated images indistinguishable from real images, we adopt an adversarial loss: where generated image G 1 (x, z) learns the latent code z from the target image y, while normal discriminator D 1 tries to distinguish G 1 (x, z) between real and fake image.Normal discriminator D 1 tries to maximize D 1 (x) which is opposite to the generator G 1 .
For the stability of training and high quality generated images, we replace Eq. 1 with Wasserstein GAN objective with gradient penalty [2] [15]: where x is sampled between a pair of a real and a generated images.And λ gp is set to 10.
To preserve the content of the input images while changing instance-level information and a part of feature representation of the inputs, we introduce a cycle-consistency loss [52] to the generator as reconstruction loss: where G 2 is used to take in the generated image G 1 (x, z) as input and reconstruct the original image x.The reconstruct loss adopts the 1 norm.Here, G 1 and G 2 are two different generators with inputs of different dimensions.
Instance discriminator.In this work, we propose an instance discriminator D 2 as third-player to participate in the competition which brings about impersonating target faces better.For generating images with the similar feature representation to the target image, we adopt face recognition network as the instance discriminator directly.For a given input image x, and a latent code z from the target image y, E(y) → z, our goal is to translate x into x, G 1 (x, z) → x, which can be classified as y by D 2 .To achieve this condition, we adopt a cosine loss, defined as: where D 2 is the instance discriminator, and D 2 (y) and D 2 (G 1 (x, z)) mean the feature representation of y and G 1 (x, z).Minimizing cosine loss can minimize the difference between generated image G 1 (x, z) and target image y in space which brings benefit to generating adversarial examples.The objective functions are defined as, where λ rec and λ cos are hyper-parameters that control the relative importance of reconstruction loss and cosine loss respectively compared to the adversarial loss.In our work, we use λ rec = 10 and λ cos = 10.Instance-level attentional block.During the feature extracting and example generating, we plug geometric attentional blocks into VAE to constitute attentional variational autoencoder.In addition, we adopt channel-wise attentional blocks into the generator to model interdependencies between the channel to capture feature representation of faces named attentional generator.

G(x,z) z x
Figure 4. Overview of a channel-wise attentional block in the generator."sq&ex" is the squeeze operation (global average pooling) and the excitation operation (gating mechanism with a sigmoid activation)."scale" is the operation to rescale the transformation output with activations after squeeze and excitation to get the channels with different weights of importance.Fig. 3. VAE in our work is to learn the feature representation of the target person whose facial dependency is significant for capturing the latent code.It is related to the selfattention method which computes the response at one point in a sequence such facial feature by attending to all points.For this purpose, we introduce non-local block [47] to capture the facial dependency.For instance-level learning, we combine basic variational autoencoder residual block and non-local to propose attentional VAE (AVAE) in our A 3 GN in Fig. 3.As shown in Fig. 2, attentional VAE can encode the geometric information of target face and learn the facial dependency from different parts of human face effectively.
We concatenate the original face x (3-dimension) with the latent code z (7-dimension) as the input of attentional generator in Fig. 4.After two subsampling convolution layers in the generator, we introduce squeeze-and-excitation operations [18] to emphasize informative features and suppress less useful ones in channel.SE operations propose to squeeze global spatial information into a channel descriptor by using global average pooling to generate channel-wise statistic.In excitation operation, a gating mechanism with a sigmoid activation is employed to capture channel-wise dependencies.Finally, we employ scaling to rescale the transformation output.Owing to squeeze-and-excitation, we can maintain informative features from the latent code more and suppress the useless information in channels which contributes to capturing feature representation of the target person.

Evaluation
In our work, we define a set of specific evaluation criteria to measure the effectiveness of the attacks: Real accuracy shows the percentage of original images which can be classified as the target person, which is usually 0%, while fake accuracy shows the percentage of generated images which can be classified as the target person.mAP is the mean average precision with different thresholds in a range from 0 to 1 whose step is 0.01.
• Similarity score.Cosine distance between original faces/generated faces and target face faces is seen as a similarity score.Cosine distance is a significant metric in face recognition for verifying whether the two images belong to one person.In our results, we show the similarity scores before attack and after attack and improvement (∆) of similarity scores to exhibit the effectiveness of A 3 GN for attacks.Meanwhile, the similarity scores between the real image and the fake image exhibit the ability that the generated images can be recognized as their real identities by face recognition networks.The similarity scores between the real image and the fake image are less, the attack is more successful.
• SSIM.SSIM means the percentage of structural similarity index between original faces and generating faces higher than a threshold.SSIM is a quantization criterion to determine whether generating faces are perturbed slightly compared with original faces.In our work, we set 0.9 as the threshold to evaluate the quality of generated images compared to original images.

Datasets
The state-of-the-art face recognition networks are trained in CASIA-WebFace dataset and refined MS-Celeb-1M [8].Meanwhile, our A 3 GN is also trained on CASIA-WebFace.And in the inference time, we perform A 3 GN on LFW by generating adversarial examples paired with target faces to verify whether they belong to one person.CASIA-WebFace.CASIA-WebFace dataset [51] is a webcollected dataset which has 494,414 face images belonging to 10,575 different individuals.In our experiments, we use aligned CASIA-WebFace which has images with size of 112×112 after alignment.MS-Celeb-1M.The original MS-Celeb-1M dataset [16] contains about 100k identities with 10 million images.In [8], the noise of MS-Celeb-1M is decreased, and finally, refined MS-Celeb-1M contains 3.8M images of 85k unique identities.LFW.LFW dataset [19] contains 13, 233 web-collected images from 5749 different identities, with large variations in pose, expression and illuminations.In face verification, the verification accuracy is usually measured on 6000 face pairs.But in our work, we pair all the images in LFW with target face image.

A 3 GN with Attentional Block
In this section, we do some experiments to verify the feasibility and effectiveness of attentional blocks.We train A 3 GN on CASIA-WebFace and utilize it to generate adversarial examples on LFW to attack target model in the inference time.We employ ArcFace [8] which has an accuracy of 99.42% on LFW as the target model in these experiments.Network architecture.We design A 3 GN based on cVAE-GAN.For the encoder, we use a classifier with 4 residual basic blocks for the latent code with 7 dimensions.Adapted from [6] [52], generator in our work is composed of two convolution layers for downsampling, 6 residual blocks, and two convolution layers for upsampling.In the generator, we use instance normalization which are not used in discriminator.In our work, we have two discriminators.One is the target face recognition network for classifying whether the image patches belong to the target person or not called instance discriminator and another is PatchGAN discriminator [21] for classifying whether the image patched are real or not called normal discriminator.Training details.In the training process, the target person contains 7 different face images for capturing the latent code.All the input images are resized and cropped to 112×112.Because our goal is to generate images for fooling the face recognition network, all the images should do the alignment similar to the operation in face verification.We update generator once by L G after five normal discriminator updates and one generator update by L cos while the instance discriminator is fixed all the time.All the models are trained for 200000 iterations and use Adam [24] with β 1 = 0.5 and β 2 = 0.999.The batch size is set to 32 in all experiments.We set the learning rate to 0.0001 for the first 100000 iterations and linearly decay the learning rate to 0 over the next 100000 iterations.Quantitative evaluation.We perform a quantitative analysis of the mAP, the difference of similarity score and SSIM on our baseline.All the results are calculated on average among 5 target faces to eliminate the occasionality.The performance of baseline is shown in Table 1.We design two groups of experiments with different conditions to verify the effectiveness of A 3 GN .One is to encode image A to attack image A. Another is to encode image A to attack image A .Neither of A and A is in target images datasets for the latent code in the training process.In the experiment of baseline, we choose one target person randomly to test the performance.The threshold of cosine distance for fake accuracy is set to 0.45.The experiment in A → A can get higher accuracy than the experiment in A → A because it can learn the feature representation of A in the encoder for attacking A. As shown in the Table 1, our baseline can fool the target model generally.Most of them can be classified as the target person in the threshold of 0.45.SSIM is a criterion to evaluate the quality of generated images in some similar works.But we think it does not a strict criterion to evaluate the similarity between the generated images and the original images for human eyes.
We notice that the accuracy and mAP are not extremely high.For improving the performance, we consider introducing some attentional blocks to learn more feature representation of the target person.For capturing the facial dependencies of the target person, we introduce geometric attentional blocks, non-local blocks, into the encoder to improve the performance.And the performance is shown in Table 2. Obviously, each criterion gets improvement compared with the baseline.It is effective to capture the facial dependencies for encoding the latent code.
The geometric attention in encoder can capture the instance-level information effectively.We conjecture that introducing attentional blocks in the generator may also get better performance.During the process of generating images, the generator forces the fake images more similar to the original images which results in the loss of feature representation of the target person due to L rec .Thus, we consider introducing channel-wise attentional blocks into the generator to focus on the information of the latent code.The performance is shown in Table 3.It exceeds our expectations to outperform the experiment of geometric attentional blocks by 1.95% of mAP in A → A condition.We conjecture that channel-wise attentional blocks maintain the instance information from the latent code primarily.
Following the two aforementioned experiments of attentional blocks, we combine geometric attention and channel-  wise attention to improve the performance.The ablation study results are shown in Table 4.And the curves of accuracies in A → A are shown in Fig. 5.As we can see, most of the generated images can get more than 0.4 of cosine distance which far surpasses the result between real images and the target image.A 3 GN can fool the face recognition network successfully.
Qualitative evaluation.In addition to the quantitative evaluation, we exhibit the effectiveness of 4 different models by showing the qualitative comparison results in Fig. 6.All the generated images in Fig. 6 can be classified as the target person in the threshold of 0.45 and they are similar to the original images just with quasi-imperceptible perturbation.
We observe that our model can provide a higher visual quality of attack results on LFW even in baseline.However, the generated images are similar to the target image in physical likeness slightly such as the nose and eyes.We conjecture that it is because that the generator hammers at making the cosine distance between generated images and target image higher.It shows that face recognition network recognizes people by focusing on their noses and eyes more, and the contours of their faces and mouths less.Furthermore, we choose 5 different target face images to exhibit the results of attacks in Fig. 7. Most of the generated images are prone to the target face image slightly.It would seem that most face recognition network focuses on recognizing people by their facial feature and a slight change on the facial feature can fool the face recognition network to recognize as another person which is imperceptible for observers.Meanwhile, a mask learned from the target person can also fool the network.White-box attack.In white-box scenario, we choose 3 different state-of-the-art face recognition networks to verify the feasibility of our model A 3 GN in different target networks.The performance on different target models in white-box scenario is shown in       Black-box attack.In this section, we explore whether fooling one face recognition network leads to successful fooling other networks.In black-box scenario, the parameters, architectures and the feature space of target models are not obtained in the training process.The instance discriminator in black-box scenario is only ArcFace [8] in this experiment.And we have no access of target networks, ResNet [17] with softmax, Sphereface [25] and MobileFaceNet [5] in the training process.In the inference time, we just obtain the feature map of images from the last layers to test the performance.The performance on different target networks in black-box scenario is shown in Table 6.Obviously, each result in Table 6 is lower than that in Table 5.But we also observe that the generated images can disturb the target networks slightly.Black-box attack will be a future work to explore and study.
Comparison with previous works.We compare our A 3 GN with previous attack models in face recognition on CASIA-WebFace dataset.Because they focus on fool the classifier to a false label, we compare our performance on this way in  is seen as a success for an attack.As we can see, the success rate of fool the face recognition network to a false label for A 3 GN is 99.94%.It almost fools the network totally.Though it is 0.02% lower than GFLM, A 3 GN can force the target model to recognize as the target person well.

Conclusion
Face recognition is a compelling task in deep learning.It is necessary to learn how face recognition networks are subject to attacks.In this paper, we focus on a novel way of attacking target models by fooling them to a specific label.For this purpose, we present A 3 GN to generate adversarial examples similar to the original images but which can be classified as the target person.To learn the feature representation of target images, we introduce geometric attention and channel-wise attention into A 3 GN to get good performance.Finally, we show the results of experiments on different target faces, white-box attack, and black-box attack.However, our model is limited to attacking one target person.It will be a future work that one model can attack different target faces.

Figure 1 .
Figure 1.Adversarial attack results in our work.The first column is the target face.The 2nd and 4th columns are the original images and the rest are the generated images.Given target images, our work is to generate images similar to the original faces but classified as the target person.

Figure 2 .Figure 3 .
Figure 2. Overview of A 3 GN .Attentional variational autoencoder (AVAE) captures the latent code z from target face y.And then the original face x is concatenated with z to generate x, G(x, z) → x in attentional generator.G(x, z) is sent into normal discriminator to determine whether it is a real image or not with x and sent into instance discriminator to determine whether it can be classified as the target person or not with y.
Real accuracy & Fake accuracy & mAP.It is defined as the percentage of adversarial examples which are successfully classified as the target person by target model.When cosine distance between examples and target faces is more than 0.45, we consider examples as target faces with a true predicting label.

Table 4 .
Ablation study of 4 different models for A 3 GN .Baseline: Conditional GAN baseline.Geometric attention: Conditional GAN with geometric attentional blocks.Channel-wise attention: Conditional GAN with channel-wise attentional blocks.Both: Conditional GAN with geometric attentional and channel-wise attentional blocks.The threshold of cosine distance is set to 0.45.

Figure 5 .Figure 6 .
Figure 5. Accuracy curve in different thresholds.The horizontal axis represents the different thresholds and the vertical axis represents the accuracy in different thresholds.GA means geometric attention.CWA means channel-wise attention.

Figure 7 .
Figure 7. Generated images by A 3 GN for 5 target faces.The left is original image and the right is generated image.

Table 1 .
Performance of baseline with two conditions.

Table 2 .
Performance of geometric attention with two conditions.

Table 3 .
Performance of channel-wise attention with two conditions.

Table 5 .
A 3 GN performance on different target models in whitebox scenario.

Table 6 .
A 3 GN performance on different target models in blackbox scenario.

Table 7 .
If the cosine distance between the original image and the generated image is lower than 0.45, it model SR(%)Attack acc. on CASIA(%)

Table 7 .
Comparison with other attack models in face recognition.
'SR' means the success rate of fooling the network to a false label.'Attack acc. on CASIA' means the accuracy of fooling the network to a target label.