Mask-aware Photorealistic Face Attribute Manipulation

The task of face attribute manipulation has found increasing applications, but still remains challenging with the requirement of editing the attributes of a face image while preserving its unique details. In this paper, we choose to combine the Variational AutoEncoder (VAE) and Generative Adversarial Network (GAN) for photorealistic image generation. We propose an effective method to modify a modest amount of pixels in the feature maps of an encoder, changing the attribute strength continuously without hindering global information. Our training objectives of VAE and GAN are reinforced by the supervision of face recognition loss and cycle consistency loss for faithful preservation of face details. Moreover, we generate facial masks to enforce background consistency, which allows our training to focus on manipulating the foreground face rather than background. Experimental results demonstrate our method, called Mask-Adversarial AutoEncoder (M-AAE), can generate high-quality images with changing attributes and outperforms prior methods in detail preservation.


Introduction
The task of face attribute manipulation is to edit the face attributes shown in an image, e.g., hair color, facial expression, age and so on. It has a wide range of applications, such as data augmentation and age-invariant face verification [Park et al., 2010;Chi et al., 2017]. Essentially, this is an image generation problem. But unlike the style translation task [Gatys et al., 2016;Li et al., 2017], the attribute manipulation one is more challenging due to the requirement of only modifying some image features while keeping others unchanged (including the image background).
With the advent of generative adversarial networks (GANs) [Goodfellow et al., 2014], the quality of generated images improves over time. The family of GAN methods can be mainly divided into two categories: one with noise input [Mirza and Osindero, 2014;Yang et al., 2017] and the other conditioned on an input images [Johnson et al., 2016;Choi et al., 2017]. Our method falls into the second category, with the aim to change the face attributes in the input image with highly-preserved details.
One simple option to achieve this goal is to use the conditional GAN framework [Mirza and Osindero, 2014;Zhang et al., 2017], which concatenates the input image with a one-hot attribute vector to encode the desired manipulation. However, such global transformation cannot either guarantee facial detail preservation, or make a continuous change in the attribute strength. Another option is to directly learn the image-to-image translation along attributes. CycleGAN [Zhu et al., 2017] learns such translation rule from unpaired images with a cycle consistency constraint. The recent UNIT method [Liu et al., 2017] uses generative adversarial networks (GANs) and variational autoencoders (VAEs) for robust modelling of different image domains. Then the cycle consistency constraint is also applied to learn domain translation effectively. [Shen and Liu, 2017] proposed to only learn the residual image before and after attribute manipulation by using two transformation networks, one for attribute manipulation and the other for its dual operation.
The above methods share one common drawback -there exists no mechanisms to keep the unique facial traits while editing attributes. Most likely we will observe changed attributes with lost personal details. [Zhang et al., 2017] provided a partial remedy by feeding the face images before and after attribute manipulation into a face recognition network and penalizing their feature distance. This is essentially one way to preserve facial identify information. However, it may still change the non-targeted features beyond identity or other parts of the image (e.g., background), which is not visually pleasing. We specially note the importance of keeping the background unchanged since it is often observed to be changed along with the foreground face. This suggests some face attribute manipulation efforts are wasted in the irrelevant regions. Pasting the original background around the manipulated face with a face mask would not be the solution, because the two parts can be drastically incompatible.
In this paper, we learn to simultaneously manipulate the target attributes of a face image and keep its background untouched. Our method is based on the VAE-GAN framework [Zhang et al., 2017; for strong modeling of photorealistic images. We propose an effective method to modify a minimum number of feature map pixels from our encoder. This allows us to maximally preserve the global image information and also change the strength of target attributes continuously. To avoid loss of the unique facial details during attribute editing, we attach to the VAE-GAN objectives additional face recognition loss and cycle consistency loss (to ensure image consistency after two inverse manipulations). Furthermore, we mask out image backgrounds to coherently penalize their difference before and after face attribute manipulation. We call our method as Mask-Adversarial AutoEncoder (M-AAE) and support its efficacy by extensive experiments.
In summary, the contributions of this paper are as follows: • We present an effective method to modify a modest amount of pixels in our learned feature maps to realize continuous manipulation of face attribute.
• We propose a Mask-Adversarial AutoEncoder (M-AAE) training objective to ensure faithful facial detail preservation as well as background consistency.
• The proposed method demonstrates state-of-the-art performance in photorealistic attribute manipulation.

Related Work
Face attribute manipulation Most methods of face attribute manipulation are based on generative models. There are two main groups of these methods: the group with extra input vector, and the group that directly learn the image-toimage translation along attributes. The first group often takes an attribute vector as the guidance for manipulating the desired attribute. The CAAE method [Zhang et al., 2017] concatenates the one-hot age label with latent image features to be fed into the generator for age progression purposes. Star-GAN [Choi et al., 2017] takes the one-hot vector to represent domain information for "domain transfer". However, such global transformation based on external code usually cannot well preserve the facial details after attribute manipulation. The second group of methods only operate in image domains and learn the image-to-image translation directly. The Cycle- GAN [Zhu et al., 2017] and UNIT method [Liu et al., 2017] are such examples, supervised by a cycle consistency loss that requires the manipulated image can be mapped back to the original image. [Shen and Liu, 2017] further proposed to only learn the residual image before and after attribute manipulation, which can be easier and lead to higher-quality image prediction. Unfortunately, these methods still have difficulty of manipulating the target attribute while keeping others unchanged.

Methodology
Our goal is to manipulate the attribute of an input face image and generate a new one, e.g., to change the hair color from black to yellow. The difficulty lies in the generation of photorealistic as well as faithful face images, i.e., the generated image should look real and have its unique details preserved including the background. We propose a Mask-Adversarial AutoEncoder (M-AAE) method to address these challenges, as will be detailed as follows.

Framework Overview
Our M-AAE method is based on the VAE-GAN framework, as shown in Fig. 1. The encoder-decoder De(En(x)) of VAE for input image x is treated as GAN's generator G(x). The discriminator D(·) of GAN tells the generated image G(x) apart from real images. To manipulate attributes of input image x, we design a simple but effective mechanism to uniformly modify the encoded features En(x) by a relative value ±δ, which is fed into the decoder to control the attribute strength present in output Training process Besides training with the VAE and GAN loss functions, we also use the face recognition loss and cycle consistency loss for faithful preservation of face details. The face recognition module extracts features from images before and after attribute manipulation, and penalizes their feature discrepancy to preserve identity information. While the cycle consistency loss aims to preserve other unique facial information by penalizing the difference between input image x and the generated image after two inverse attribute transformations G + (x) and G − (x). To ensure background consistency, we further generate facial masks to penalize the background difference between input x and output G(x). Testing process We simply feed the input image x through our generator G(x) = De(En(x)), changing the relative attribute strength δ in the latent features En(x).  (x)) of VAE for input image x is treated as generator G(x) of GAN, whose discriminator D(·) tells fake from real. We manipulate attributes by modifying the encoded features En(x) by a relative value ±δ, and train using image pairs with opposite face attributes. Our training is supervised by 5 loss functions to both preserve facial details and ensure background consistency (see text for details). We test only using the generator G(·).

Attribute Manipulation in Encoded Features
To manipulate face attributes, rather than take a one hot attribute vector as in [Zhang et al., 2017;Choi et al., 2017], we choose to modify the latent features in our encoder to be able to continuously change the attribute strength. One intuitive way is to uniformly increase or decrease the responses of the entire feature map by a relative value δ. We empirically observed a global change of image tone by doing this. Instead, we propose to only modify a minimum number of feature map pixels whose receptive field covers the whole image in image domain. Fig. ?? illustrates how to find such minimum pixels at the top feature layer recursively from bottom layer. In this way, the image-level manipulation can be operated efficiently with modest feature modification. More importantly, we will avoid a huge loss of image information. Our experiments will show our efficacy in information preservation during attribute manipulation. In practice, the relative value δ is chosen as half the value range of the feature map pixels for reversing one particular attribute (δ ≈ 5 in our scenario). Then such modified features are fed into the decoder to generate output image G + (x) or G − (x) with strengthened or weakened attribute.

Learning of Mask-Adversarial AutoEncoder
VAE loss The VAE consists of an encoder that maps an image x to a latent feature z ∼ En(x) = q(z|x) and a decoder that maps z back to image space x ∼ De(z) = p(x|z). The VAE regularizes the encoder by imposing a prior over the latent distribution p(z), where z ∼ N (0, I) is often assumed to have a Gaussian distribution. VAE also penalizes the reconstruction error between x and x , and has loss function: where λ 1 and λ 2 balance the prior regularization term and reconstruction error term, and KL is the Kullback-Leibler divergence. The reconstruction error term is actually equivalent to the L1 norm between x and x , since we assume p(x|z) has a Laplacian distribution.

GAN loss
The GAN loss is introduced to improve the photorealistic quality of the generated image. Since the encoderdecoder of VAE is treated as the GAN generator, we use the input image x and generated image G(x) from VAE as the real and fake images for discriminative training. The GAN loss function is as follows: The weights of the generator and discriminator are updated alternatively in the training process.
ID loss To make the generated image photorealistic is not enough for face attribute manipulation. We can imagine an extreme case where one perfectly realistic generated image does not keep any unique traits about the face -it simply does not look alike the original face at all. This is not acceptable for faithful face manipulation. To preserve personal information as much as possible, we use a face recognition network [Parkhi et al., 2015] to penalize the shift of face identity, which is one of the most important facial features to consider. Concretely, we extract identify features from images before and after attribute manipulation, and enforce them close to each other. The ID loss function is then defined as: where F ID (·) is the feature extractor from the face recognition network. Figure 2: Facial attribute manipulation results with 7 attributes on CelebA dataset. We compare with the state-of-the-art residual image GAN [Shen and Liu, 2017], UNIT [Liu et al., 2017] and StarGAN [Choi et al., 2017] (first row) and our various baselines (second row). For each method, the results are shown for the manipulation of corresponding attributes in the attribute chart.
Cycle consistency loss We still want to keep those facial charecteristics beyond identity after manipulation of the target attribute. Since it is hard to keep track of those charecteristics that have no supervision, we follow the idea of selfsupervision in [Zhu et al., 2017;. Specifically, we impose the cycle consistency constraint along the dimension of attribute. We apply two inverse transformations G + (·) and G − (·) with attribute strength +δ and −δ to an image x, and ensure the resulting image G − (G + (x)) would arrive close to the input x. The circle consistency loss is defined as: where x and y are the training image pair with opposite attribute labels, and we impose the circle consistency constraint for both of them. The L1 norm is used to measure the image distance.
Mask loss In some cases, we observed the image background would change along with the foreground face by previous attribute manipulation methods. This is not visually pleasing and also suggests some manipulation efforts are wasted in wrong regions. We claim that pasting the original background around the manipulated face is not ideal because the two parts can be incompatible. Here we learn to change the foreground face attribute and keep background the same in a coherent way. We generate a facial mask (thus background mask as well) by using FCN [Long et al., 2015], and penalize the background difference between input x and generated G(x): where Mask(·) is the mask-out operator using the generated background mask. Note the background mask of input x is shared for both input x and output G(x). We do not generate a separate mask for G(x) which leads to inconsistent penalty.

Overall Training Procedure
Our final training objective is defined as follows: where the weights of α 1 ∼ α 5 balance the relative importance of our 5 loss terms. The GAN generator, i.e., the encoderdecoder are trained jointly, while the GAN discriminator is trained alternatively. The face recognition network is only used to extract features and its weights are frozen. We choose the first 11 layers of the recognition network [Parkhi et al., 2015] as feature extractor.

Experiments
In this section, we first introduce our used dataset and implementation details. Our M-AAE is compared against stateof-the-arts both qualitatively and quantitatively to show our advantage. Ablation study is conducted to demonstrate the contribution of each component of our framework.

Dataset and Implementation Details
We evaluated on the CelebA dataset [Liu et al., 2015]. This dataset contains 202599 face images of 10177 celebrities. Each image is labeled with 40 binary attributes, e.g., "hair color", "age", "gender" and "pale skin". We choose 7 typical attributes (see Fig. 2) for our attribute manipulation experiments. For each attribute, we select 1000 testing images and train with the remaining images in the dataset.  [Shen and Liu, 2017], UNIT [Liu et al., 2017] and StarGAN [Choi et al., 2017] in the first row. The recent residual image GAN and Star-GAN achieve top performance in image translation and attribute manipulation. The UNIT method is similar to ours in using the VAE-GAN framework and cycle-consistency constraint. We observed that all these methods can produce artifacts or lose personal features to some extent. Their performance is usually good on single attribute manipulation or multi-attribute manipulation when the target attributes are correlated (e.g., "pale skin" and "gender"). However, the performance deteriorates in more complex scenarios. For example, residual image GAN totally collapses while generating images with eyeglass. In comparison, our M-AAE method  Fig. 2 also compares our various baselines to demonstrate the contribution of our major components. From the comparison of results in (e) and (f), we can find that modifying a meaningful subset of feature map pixels can better preserve global face information (e.g., color tone) than modifying the entire feature map. Note the two baselines already use the cycle consistency loss in our VAE-GAN framework, whose efficacy is validated by similar works like UNIT [Liu et al., 2017]. Hence in (g), we further show that adding an ID loss can enhance the identify preservation while editing other attributes. When we use an extra mask loss, the background is made sharper and the foreground facial details also get enhanced with higher fidelity.

Quantitative Evaluation
For quantitative evaluations, we perform a user study inviting volunteers to evaluate the attribute manipulation results. Given a set of generated images from different methods, the volunteers are instructed to rank the methods based on perceptual realism, quality of transfered attribute and preservation of personal features. The generated images from different methods are shuffled before presented. There are 30 validated volunteers to evaluate results with the 7 attributes chosen from CelebA. The average rank (between 1 and 7) of each method is calculated and shown in Table 1. Note that we experiment with different numbers of manipulated attributes from 1 to 4, which have gradually increasing difficulty. From the shown results, again, we find our advantage over prior methods (ranks higher), especially in the multi-attribute manipulation cases. Our ID loss and Mask loss help to improve the results steadily due to their preservation of foreground facial details and background scene.

Analysis
We show more of our results in Fig. 3 to empirically prove the generalization ability of our method. Our method handles well with a rich combination of attributes, successfully preserving the unique facial details and background in the generated image with a different attribute. We also show our capability of continuous manipulation of attribute strength in Fig. 4. We achieve this by adjusting the attribute strength in latent features, which is more favorable than prior methods that take a fixed attribute vector as input.

Conclusion and Future Work
In this paper, we propose a Mask-Adversarial AutoEncoder (M-AAE) method to effectively manipulate human face attributes. Our method is based on the VAE-GAN framework, and we propose an effective method to modify a minimum number of pixels in the feature maps of an encoder, which allows us to change the attribute strength continuously without hindering global information. Our method pays special attention to facial detail preservation and image background consistency. We introduce the face recognition loss and cycle consistency loss for faithful preservation of face details, and also propose a mask loss to ensure background consistency. Experiments show that our method can generate highly photorealistic and faithful images with changing attributes. In principle, our method can be extended to deal with more image translation tasks (e.g., style transformation) which will be included in our future work.