Mitigating Demographic Bias in Facial Datasets with Style-Based Multi-attribute Transfer

Deep learning has catalysed progress in tasks such as face recognition and analysis, leading to a quick integration of technological solutions in multiple layers of our society. While such systems have proven to be accurate by standard evaluation metrics and benchmarks, a surge of work has recently exposed the demographic bias that such algorithms exhibit–highlighting that accuracy does not entail fairness. Clearly, deploying biased systems under real-world settings can have grave consequences for affected populations. Indeed, learning methods are prone to inheriting, or even amplifying the bias present in a training set, manifested by uneven representation across demographic groups. In facial datasets, this particularly relates to attributes such as skin tone, gender, and age. In this work, we address the problem of mitigating bias in facial datasets by data augmentation. We propose a multi-attribute framework that can successfully transfer complex, multi-scale facial patterns even if these belong to underrepresented groups in the training set. This is achieved by relaxing the rigid dependence on a single attribute label, and further introducing a tensor-based mixing structure that captures multiplicative interactions between attributes in a multilinear fashion. We evaluate our method with an extensive set of qualitative and quantitative experiments on several datasets, with rigorous comparisons to state-of-the-art methods. We find that the proposed framework can successfully mitigate dataset bias, as evinced by extensive evaluations on established diversity metrics, while significantly improving fairness metrics such as equality of opportunity.


Introduction
Deep learning-based models have been successfully utilized to advance the state-of-the-art in face analysis, resulting in accurate algorithms for the recognition of the identity Masi et al. (2018), age (Fu et al. 2010;Georgopoulos et al. 2018), gender, (Ng et al. 2012) and expressions (Li and Deng 2020) of the human face. Building on this success, automatic face analysis facilitates modern human-computer interaction and has found application in numerous areas of everyday life. Most importantly, machine learning and computer vision models have been applied for critical decision-making tasks, from predicting recidivism and criminal behavior, to assessing candidates in interviews, and automating border control. With the wide adoption of this technology, its developers bear the responsibility to ensure that these systems do not discriminate against any subpopulation of users, i.e., that they are fair. However, machine learning models are prone to inheriting or even amplifying the bias that is present in the training data. In the context of face analysis, an algorithm can perform unfairly when applied on demographic groups that are underrepresented in the training set (e.g., faces of a specific gender, skin tone, or age group). This is despite the fact that the algorithm may appear as accurate given current evaluation metrics, but inherently fail to capture properties such as fairness and diversity.
In recent years, a surge of work has exposed the demographic bias of face analysis systems. In Buolamwini and Gebru (2018), Buolamwini and Gebru showed that commercial gender classification systems performed significantly worse on darker-skinned females. Moreover, state-of-the-art face recognition models have been reported to demonstrate bias with regards to the age, gender, and skin tone of the input face (Serna et al. 2019;Wang et al. 2019;Nagpal et al. 2019). In most cases, these demographic disparities in model performance are caused by the lack of diversity in the publicly available face datasets (Kuhlman et al. 2020;Holstein et al. 2019). For instance, widely adopted facial datasets like LFW (Huang et al. 2008) and CelebA (Liu et al. 2015) contain mainly faces of lighter skin. Similarly, only 0.5% of the faces in MOPRH (Ricanek and Tesafaye 2006) are of people over 60 years, while 87% of the faces in FG-NET (Lanitis 2002) are younger than 30 years old. Nevertheless, collecting a diverse dataset large enough for modern deep learning tasks is a herculean and tedious task. Thus, modern state-of-theart face analysis systems are trained and tested on datasets that are either lacking in size or diversity. To circumvent this issue, practitioners turn to the plethora of available data augmentation techniques [see Shorten and Khoshgoftaar (2019) for a survey].
Data augmentation methods range from simple image transformations (e.g., mirroring, rotation, and random cropping) to non-photorealistic image mixing [e.g., Inoue (2018), ], and deep generative models Sandfort et al. (2019). In the latter category, neural style transfer has proved to be an efficient data augmentation tool, that can be used to train robust classifiers Jackson et al. (2019), (Perez and Wang 2017;Zheng et al. 2019). In this spirit, we propose a novel style transfer framework that is tailored to the task of diversity-enhancing data augmentation. Since our aim is to enhance the demographic diversity of a facial dataset, we propose a style transfer approach using Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) that is able to transfer multiple demographic attributes for each image in a biased set. The resulting image set is less ridden with demographic biases, and can hence be used to train fairer face classifiers.
Translation of demographic attributes has been studied extensively, albeit for individual attributes. In particular, facial aging is a long-standing task in computer vision (Ramanathan et al. 2009;Fu et al. 2010;Georgopoulos et al. 2018) with recent GAN-based approaches being able to produce realistically aged and rejuvenated faces Wang et al. 2018;Duong et al. 2019). On the other hand, modifying the gender of a face is a common application of facial attribute transfer (Choi et al. 2018;He et al. 2019). However, these methods are not capable of synthesizing realistic samples at the tails of the distribution-e.g., for faces over 60 years old-as we show experimentally in Sect. 4.4. At the same time, face synthesis in such models are usually conditioned on a single attribute label. Hence, these methods are only able to generate a single image per attribute class, that is constrained by the bias of the training set. The aforementioned limitations prevent these approaches from being able to efficiently mitigate dataset bias.
In this work, instead of collapsing attribute information into a single label, we condition the face generation on discriminative representations for each attribute. For instance, despite training with the provided binary gender labels, our method learns a high-dimensional, continuous-valued representation of gender, which better reflects the underlying non-binary nature of the attribute. Motivated by the recent success of style-based GANs Huang et al. 2018;Kim et al. 2020;Park et al. 2019;Ma et al. 2018) in transferring arbitrary styles at different scales, we propose to transfer the joint demographic style of each subpopulation. Consequently, our method is able to modify the attributes of each face and enhance the diversity of the dataset. In order to combine the different representations capturing complex facial patterns related to each attribute, we propose a novel extension to AdaIN (Huang and Belongie 2017). The proposed method is tailored to handle the mixing of multiple attribute representations by introducing a tensorbased mixing structure that captures multiplicative attribute interactions in a multilinear fashion, effectively facilitating multi-attribute transfer given a single input image.
As we show in this paper, the proposed formulation is flexible enough to synthesize images of large intra-class diversity, transferring complex facial patterns even for samples that belong to the tails of the training set distribution. Summarizing, the contributions of this paper are listed in what follows.
-We introduce a novel style transfer GAN that is able to transfer multiple demographic attributes simultaneously. By conditioning on different sets of target images, our framework is able to generate diverse images for each attribute class. . We also study the case of bias in gender recognition-within the binary paradigm in which it is currently commonly framed in practice-on two datasets: MORPH and KANFace. The experimental analysis indicates that by augmenting the training sets using our model, we are able to mitigate the classifier biases more effectively than other, state-of-the-art methods.
The proposed framework is based on our prior work ) that used standard generative architectures that could be adapted to perform face aging, focusing solely on age progression. This work significantly extends our preliminary work as it is designed to transfer multiple attributes instead of just age, by proposing a novel multilinear extension to AdaIN that is suitable for multiple attributes. That is, the proposed method is able to handle the mixing of multiple attribute representations in a multi-linear fashion. The style transfer model is evaluated on additional datasets and compared to multi-attribute GANs [i.e., Star-GAN (Choi et al. 2018) and AttGAN (He et al. 2019)]. Lastly, the focus of the preliminary work was dataset diversity. In this work, we make the connection between diversity and algorithmic fairness. In particular, we enhance the diversity of datasets and evaluate the fairness of the classifiers that are trained on them, offering a thorough comparison to 7 state-of-the-art bias mitigation methods on 2 datasets.

Related Work
In this section, we provide an overview of related work from the area of style-based image transfer with adversarial learning (Sect. 2.1). Furthermore, we introduce related work in terms of demographic attribute editing (Sect. 2.2), as several methods have been proposed in literature for editing such attributes; albeit with most works focusing on one attribute individually. Finally, we discuss related literature on fairness and bias mitigation, focusing on face analysis (Sect. 2.3).
MUNIT ) uses AdaIN in a GAN setting to perform image-to-image translation by injecting style content into an autoencoder-like network at the bottleneck layers. The seminal work of StyleGAN ) proposed a generator architecture using AdaIN to modulate style content at multiple resolutions. The state-of-the-art imageto-image translation methods continue to adopt style-based approaches Kim et al. 2020;Park et al. 2019;Ma et al. 2018), and we follow in this vein-treating demographic attributes as target styles.

Transfer of Demographic Attributes
Synthesizing faces of a specific target demographic has been studied extensively, albeit for individual demographic attributes. In particular, age progression refers to the task of rendering an aged or rejuvenated image of an input face (Fu et al. 2010;Ramanathan et al. 2009;Georgopoulos et al. 2018). While earlier works in age progression proposed simplistic prototype-based approaches that could not produce photorealistic results, a number of recently proposed GAN-based methods are capable of convincing face aging.  introduced a conditional adversarial autoencoder (CAAE) that models aging as the traversal of a low-dimensional manifold. GAN-based image translation approaches were also proposed in Wang et al. (2018), Yang et al. (2018), where pre-trained networks were used to facilitate identity preservation and aging accuracy, respectively. Besides age, generating faces with different genders is a standard benchmark of the CELEB-A dataset (Liu et al. 2015), and is successfully performed by numerous attribute editing approaches (Choi et al. 2018;He et al. 2019He et al. , 2017Perarnau et al. 2016;Li et al. 2016). Lastly, in order to mitigate racial Fig. 1 Overview of the proposed method for multi-attribute transfer by way of example: An input image X g1,a1 of gender g 1 and age a 1 is passed through the Generator (left), to translate it to target age a 5 and gender g 0 . At each upsampling block in the generator, we perform AdaIN to modulate the style content. The new statistics are computed using a multiplicative fusion module that captures the interactions between the discriminators' activations' moments when evaluated on target images of the desired classes. The logits output from the discriminators are used to train our adversarial loss L adv , and the target and synthetic images are used to train the generator's L f m feature-matching loss bias in face recognition, Yucer et al. (2020) proposed to use CycleGAN  to transfer race. In our work, instead of focusing on a single attribute, we propose to simultaneously edit multiple demographic attributes. Unlike many stand-alone age progression methods Yang et al. 2018) and style transfer-based GANs Huang et al. 2018), we require no additional networks to obtain the conditioning information.

Fairness-aware Learning and Face Analysis
In light of the growing concerns regarding algorithmic discrimination, the field of fairness-aware machine learning has attracted interest from the research community. Most of the work focuses on tabular data [e.g. the UCI Adult income dataset Dua and Graff (2017)] and tackles the problem of fair classification with regards to a protected attribute [e.g., Edwards and Storkey (2015), Madras et al. (2018)]. In such cases, both the output variable and the protected attribute are binary. For instance, a common scenario in the UCI Adult dataset is the classification of the subjects into two categories based on their wage, while being fair with regards to gender. These works have given rise to different definitions of fairness (Mehrabi et al. 2019;Verma and Rubin 2018), the most common of which are: (i) demographic parity, (ii) equalized odds, and (iii) equality of opportunity. Demographic parity ensures that the predicted labelŶ is independent of the protected attribute S, i.e., P(Ŷ = 1|S = 1) = P(Ŷ = 1|S = 0). Unlike demographic parity, a predictorŶ that satisfies equal-ized odds can depend on S, but only through the target variable Y (Hardt et al. 2016), that is: 1}. This definition implies that both the true and false positive rates will be the same for each population. Equality of opportunity is a relaxed notion of equalized odds, that only requires the true positive rates to be equal (Hardt et al. 2016), formally: In this work, we use equality of opportunity to quantitatively evaluate the fairness of our face analysis models as it one of the most commonly employed approaches. It is only very recently that fairness-aware algorithms for face analysis have attracted similar attention. The experimental results in Buolamwini and Gebru (2018) indicate that commercial gender recognition systems demonstrate a significant difference in performance between lighter male and darker female faces. This study resulted in replies from the vendors that focused on the importance of diversity in the training datasets (Raji and Buolamwini 2019). Thereafter, a series of in-processing and pre-processing methods have been proposed for fair face analysis-the former targeting unfairness at the algorithm-level and the latter at the datalevel. In the former category, Alvi et al. (2018) proposed a joint learning and un-learning framework, while Kim et al. (2019) learn a fair classifier by minimizing the mutual information between the intermediate representation and the bias. These methods employ techniques from domain adaptation to learn a representation that minimizes classification loss while being invariant to the sensitive attribute. In the lat-ter category, Sattigeri et al. (2018) extend AC-GAN (Odena et al. 2017) to generate a fair dataset, while Quadrianto et al. (2018) use an autoencoder to remove sensitive information from images. In this work, we introduce an image-to-image translation model to augment the training set, and thus, our framework is most closely related to Quadrianto et al. (2018). However, contrary to Quadrianto et al. (2018), our neural style transfer approach is able to generate naturalistic faces by translating demographic attributes instead of removing them (e.g., gender-less faces).

Methodology
In this section, we introduce the proposed multi-attribute style transfer framework that enhances the diversity of a face dataset suffering from demographic bias. Our aim is to utilize the proposed generative model to augment a biased training set. Thus, by training a face analysis model on the augmented set, we can mitigate the demographic bias and achieve fairer classification performance.
Drawing inspiration from style transfer GAN literature, the proposed model is able to transfer the joint demographic attribute style from a set of images. In particular, we utilize discriminative features for each attribute at different scales to guide the image translation using AdaIN. To obtain the joint demographic attribute style, we present a novel extension to the AdaIN framework suitable for mixing the statistics of multiple attribute representations. The proposed approach models multiplicative interactions between the attributes, leading to a single fused variable for conditioning at each generator layer.
The remainder of this section is structured as follows. In Sect. 3.1 we introduce the notation as well as the basic matrix and tensor operations used in the paper. The multi-attribute style transfer framework is presented in Sect. 3.2. The different components of the training objective are analyzed in Sect. 3.3. Finally, an overview of the proposed method is visualized in Fig. 1.

Notation
In this section, we introduce the notation and definitions of the operations used throughout this paper. Concretely, we denote tensors by calligraphic letters, e.g., X , matrices by uppercase boldface letters, e.g., X and vectors by lowercase boldface letters, e.g., x. We refer to each element of a K th order tensor X by K indices, i.e., (X ) Similarly, we refer the elements of matrix X as x i j , while x j denotes the j-th column of the matrix. The D-dimensional vector of ones is denoted as 1 D .
Hadamard product The Hadamard product of A ∈ R I ×N and B ∈ R I ×N is denoted by A * B and is the element-wise multiplication of the two matrices.
Kronecker product The Kronecker product of two matrices A ∈ R I ×J and B ∈ R K ×L is denoted by A ⊗ B ∈ R (I K )×(J L) and is defined as: (1) Khatri-Rao product The Khatri-Rao product of two matrices A ∈ R I ×N and B ∈ R J ×N is denoted by A B. The resulting matrix is of dimensions (I J) × N , and is defined as: (2) Tensor unfolding The mode-k unfolding of a tensor X ∈ Mode-k vector product The mode-k vector product of a tensor X ∈ R I 1 ×I 2 ×···×I K and a vector v ∈ R I k is denoted by (3)

CP decomposition
The CANDECOMP/PARAFAC (CP) tensor decomposition (Carroll and Chang 1970;Harshman 1970) factorizes a tensor into a sum of rank-one tensors. Let X be a tensor of rank R. Its CP decomposition is: where • denotes the vector outer product. The matrices consist of the rank-one components. The CP decomposition of a third order tensor X is written in matrix form as Kolda and Bader (2009):

Proposed Framework
Style-conditioned Generator For simplicity of presentation, we consider the case of two attributes, namely gender and age. The generator of our framework is an autoencoder G that learns a mapping from a face X g,a of gender g and age a to a synthesized faceX g ,a of gender g and age a . To achieve this, G uses the gender features of a target image X g and the age features of target image X a , which are extracted using the corresponding discriminators D gen and D age . We consider these features to capture the demographic styles s g and s a . The styles are obtained from the target images as the activations at different layers of the discriminator networks. The age and gender styles are then fused and injected into the decoder of the generator using the proposed multi-attribute AdaIN, which is described in detail below.
Multi-attribute AdaIN The standard approach to style transfer is conditioning the generation on a single style representation. This is achieved by using AdaIN at different layers of the generator. The AdaIN operator scales and shifts the normalized activations at each layer of the decoder, in order to match a target style. We denote the activation of the i-th layer of the generator and its corresponding target style s i which is represented by a set of two vectors, i.e., s i .
In this work, the vectors μ i and σ i are obtained from a target image as the channel-wise first and second order moments of the activations at each layer of the discriminators. We cast vectors μ i and σ i to have the same dimensions as Z: where Then, the AdaIN operator is defined as: where μ(Z i ) and σ (Z i ) denote the tensorized channel-wise mean and variance of the activations Z i . In this work, we aim to transfer a collection of styles (i.e., attributes) that are captured by different activations. To this end, we introduce a multi-linear style mixing model that models the multiplicative interactions (Jayakumar et al. 2020) between the activations for each attribute, by assuming a tensorial mixing structure. Concretely, given target styles for age s i a . = {μ i a , σ i a } and gender s i g .
= {μ i g , σ i g } at layer i of the discriminators, we propose the following factorization: which can be written (as shown in Kolda (2006), Kolda and Bader (2009)) as: where W i (1) is the mode-1 unfolding of tensor W i . The fused second order statistics are calculated from σ i a and σ i g in a similar fashion: The multiplicative interactions between the features are captured by tensors Due to the dimensionality of the higher order tensors (d i c 3 parameters) the number of parameters of the fusion model scales exponentially. Indicatively, a convolutional layer of the discriminator with 256 filters would require a tensor of 16M parameters to fuse the activations (statistics) of two attributes. To avoid this, tensors W i and H i assume a CP decomposition, and Eqs. (10) and (11) become: where . . 3} consist of the rank-1 tensor of the CP decompositions of tensors W i and H i . The resulting μ i and σ i are then used in Eqs. (6) and (8) to modulate the style in the layers of the generator.
Discriminators In the proposed framework we utilize multiple discriminator networks-one for each attribute. Similarly to , each of the discriminators are trained to distinguish between real and fake images of each class. Since each discriminator is responsible for one attribute, the resulting representations are also discriminative with respect to that attribute. These representations are used to condition the style transfer in the decoder of the generator at each layer. Hence, we design D gen and D age as mirrored decoders of the generator.

Training Objective
Our full objective function comprises of two terms, namely the adversarial and the reconstruction loss. The losses are described as follows.
Adversarial loss To train the generator to synthesize photorealistic images with characteristics of the desired class, we use an adversarial loss. Given an input image X g,a and its translation G(X g,a , s g , s a ) to age a and gender class g , conditioned on demographic styles s g and s a (obtained for each layer from D gen (X g ) and D age (X a ) respectively), we compute an adversarial loss term for each attribute y as: where D y is the discriminator for attribute y. The combined adversarial loss for both age and gender is: Rather than having the generator minimize L adv , we instead adopt a feature-matching loss (Salimans et al. 2016) in order to train the generator to match the attribute patterns of particular target images. The feature-matching loss is defined as: Reconstruction loss Our framework is trained to edit the selected demographic attributes, while preserving the remaining input information. To this end, we use a cycle constraint  to ensure that the synthetic images can be translated back to the original input: where a, g are the input labels for X 's age and gender respectively.
Full objective The full objective functions for D and G are: where λ rec is the hyper-parameters for the reconstruction loss terms.

Experiments
In this section, we present a series of experiments designed to evaluate the efficacy of the proposed style-based model and method for mitigating dataset bias via data augmentation. Firstly, we outline in Sect. In this work, we benchmark our model on widely adopted datasets [e.g., , Ricanek and Tesafaye (2006)] that are annotated with binary gender labels. In particular, the datasets are annotated with the sex labels 'Male' and 'Female', and therefore it is common practice in both the face analysis (Ng et al. 2012;Dantcheva et al. 2015) and fairness literature (Buolamwini and Gebru 2018;Quadrianto et al. 2019;Zhao et al. 2017;Hendricks et al. 2018) to use these available labels under this categorization. As such, we can only address gender bias within this imposed binary classification paradigm. We note however that gender is widely considered to be non-binary, 1 and as such, inappropriate categorization of this attribute runs the risk of resulting in unfair face analysis systems. Furthermore, similar to Buolamwini and Gebru (2018) we investigate bias with regards to "skin tone". That is, using the "race" labels of MORPH, we classify the faces into "light-skinned" or "dark-skinned".

Implementation Details
We adopt the same generator architecture as StarGAN (Choi et al. 2018), with 6 layers in the encoder-decoder and 6 residual blocks (He et al. 2016) in the bottleneck. We depart from their choice of architecture for our discriminators however, and use a simple 3 layered CNN to match the dimensionality of the decoder (more details on the model architectures can be found in the supplementary material). Following recent progress in stabilising the training of GANs (Mescheder et al. 2018), we use the R 1 gradient penalty loss term in the discriminator's objective: where X denotes the real images sampled from the true data distribution.
To further encourage the translated images to contain class-relevant attribute patterns, we find it beneficial to use the style parameters extracted from the synthetic images as the target styles in the reconstruction term in Eq. (19). Concretely, rather than translating the synthetic images back to the input images using the style parameters s g , s a from the input images, we instead uses g from D gen (X g,a ) ands a from D age (X g,a ). By requiring the synthetic images to retain sufficient style information to reconstruct the original images, the reconstruction term thus also aids in the attribute transfer.
We train our networks with the hyperparameters outlined in : λ rec = 0.01, λ gp = 10.0, and opt to train our networks end-to-end with Kingma and Ba (2014), with a learning rate of 10 −4 , and β 1 = 0.5, β 2 = 0.99. For the multiplicative fusion layers, we set the rank of the tensors in the CP decomposition to be equal to half the number of filters at each layer.

Datasets
We conduct experiments on a number of popular databases with demographic attribute annotations. Concretely, we adopt the MORPH (Ricanek and Tesafaye 2006), CACD (Chen et al. 2014), KANFace , and UTKFace  datasets. MORPH's second album features 55,134 near-frontal facial images of 13,618 identities-captured in a controlled setting and demonstrate little variation in expressions, illumination, or background. For our experiments on bias mitigation, we consider all three of the given annotated attributes: 'age', 'gender', and 'race'. We use images with the attribute skin tone labelled as either dark-skinned or light-skinned, which amounts to 53,140 facial images, covering more than 96% of the dataset. In contrast, the CACD dataset features images captured inthe-wild, collected from Google Images. It contains over 160,000 images from 2000 celebrities. The dataset is annotated with regards to age. We manually annotate gender, using the provided identities. Given these two annotated attributes, we model both age and gender for our experiments on CACD. In Sect. 4.6, we train classifiers in the augmented MORPH and KANFace datasets. KANFace is the largest manually annotated video dataset of faces captured in-the-wild. We utilize the static version of KANFace that consists of 40 K images that are annotated with regards to 'identity', 'age', 'gender', and 'kinship'. Similarly to CACD, we utilize only the age and gender labels. Finally, the UTKFace dataset consists of over 20,000 images of individuals aged between 0 and 116 years old. The dataset is labelled with 'age', 'gender', and 'ethnicity' labels. Similar to MORPH, we utilize 'skin tone' labels instead of 'ethnicity'. For all datasets, we use 80% of the images for training and keep 20% for testing. The faces are grouped into 5 distinct classes: 0-18, 19-30, 31-45, 46-60, 61+. This age split is more fine-grained than the one commonly used in face aging literature (Yang et al. 2018;Wang et al. 2016;Yang et al. 2016) (i.e., 0-30, 31-40, 41-50, 51+), utilized to uncover the biases against faces under 18 and over 60 years old.

Baselines
We benchmark our model against two strong baselines for multi-attribute image translation, namely StarGAN (Choi et al. 2018) and AttGAN (He et al. 2019). Additionally, since face aging has been in itself investigated in the literature, we compare against two standalone face aging GANs, namely IPCGAN ) and CAAE . Both StarGAN and AttGAN translate an input image to multiple target domains by conditioning on a onehot encoded label vector. StarGAN differs from AttGAN in that it employs a cycle-constraint reconstruction loss, while AttGAN utilizes a so-called attribute classification constraint to encourage correct attribute classification in synthesized images. On the other hand, IPCGAN uses a separate pre-trained age classifier to enforce the aging features of a particular target age class on the translated faces, while CAAE learns face aging by traversing a manifold. We benchmark our model against these 4 baselines both qualitatively and quantitatively in the sections that follow. The baselines were implemented using the authors' publicly released code where available. 2

Attribute Transfer
Our approach to dataset bias mitigation involves augmenting the biased datasets in such way that the joint distribution of the attributes' labels is uniform. For this reason, the ability to synthesize sharp and realistic images that are well-classified as the target class is paramount. In Fig. 2a to 2c we translate a given facial image to all attribute combinations for the available class labels and compare the results of our method to the baselines. We find that both IPCGAN and CAAE fail to produce sharp images representative of the target class, especially for faces under 18 and over 60, due to the underrepresentation of these classes in the training set. At the same time AttGAN fails to modify the input image in a noticeable manner, especially in the case of CACD, where the generated faces are very similar to the input (e.g. row 4 of Fig. 2b). Furthermore, whilst StarGAN produces prominent class-discriminative features in the synthetic images, it fails to retain photo-realism in the generated facial images. As a result, the StarGAN's image translation results often contain artifacts. This is particularly pronounced in Star-GAN's transfers to underrepresented demographic groups   Multiple demographic attribute transfer on the test sets of MORPH and CACD using our method. Our method is particularly good at aging inputs to the tail ends of the distribution. The red boxes show the input image, in the position of its attribute labels such as people over 60 years old (see Fig. 2b). Indicatively, only 0.6% of the MORPH and 0.9% of the CACD training sets are over 60 years old. On the other hand, our method is able to synthesize sharp facial images with prominent attribute patterns without sacrificing the photorealism. Additional results for the task of image translation are presented in Figs. 3a and 3b. The proposed framework is able to transfer distinct patterns for age, gender, and skin tone. In particular, we highlight the transfer of global aging patterns such as wrinkles and hair, as well as features associated with gender, such as jawlines and facial hair. In Fig. 4, we present generated facial images for all attribute combinations on CACD.
The ability of the proposed model to generate images of all demographic subpopulations that exist in the training setregardless of their representational support-is leveraged in Sect. 4.6 to mitigate the classification bias using data augmentation.

Intra-class Diversity
A useful property of the proposed method lies in its ability to synthesize multiple image variants per class by conditioning on different within-class attribute styles. That is, the network is trained to transfer specific attribute styles present in the target image, rather than collapsing to a class-specific pattern. This implies that the synthesized images adapt a given attribute style to the attribute style of the image we condition on, in effect being able to modulate (both attenuate and accentuate) the effect of each target attribute style-thus providing a much more fine-grained control over the synthesis process. For instance in Fig. 5 we demonstrate how an input image can have its attribute content accentuated to different degrees according to the choice of target face. To further demonstrate this point, in Fig. 6 we show that even faces belonging to the oldest age group (over 60 years old) can be translated to look even older by using a target image with more pronounced aging patterns. This property is particularly useful in the case of celebrity datasets (e.g., CACD, KANFace, CELEB-A, LFW) where the appar- Fig. 4 Translating images from the test set of CACD to all combinations of the dataset's gender and age labels using our method. The red square indicates the input image, and is positioned according to its label ent age of a face can be significantly lower than its actual age.

Diversity Enhancement
In this section, we propose to benchmark the diversityenhancing capabilities of our model with regards to the available demographic attributes. In particular, we translate a test-set of 1000 faces to all attribute classes and calculate the diversity metrics of the augmented set, as proposed in Merler et al. (2019). We measure the Shannon H (ShH) and Simpson D (SiD) diversity indices and the Shannon E (ShE) and Simpson E (SiE) evenness indices. The metrics are calculated as follows: where S denotes the number of classes and p i is the probability of each class. Higher diversity indices indicate a more diverse dataset, while evenness indices closer to 1 indicate a label distribution that is closer to uniform. The age and gender labels for each image are obtained using the Face++ public API, 3 while the skin tone labels are obtained using the Clarifai 4 API. In the case of the age attribute, we adopt the standard protocol as described in the age progression literature (Yang et al. 2018;Wang et al. 2016;Yang et al. 2016), and translate only faces from the youngest age group (under 18 years old) to the rest of the age groups; that is, we perform face aging.
We compare the diversity metrics of the augmented sets to those of the original test images (GT) and present the results in Tables 1, 2 and 3. Along with our method, both StarGAN and AttGAN are capable of synthesizing a dataset with a distribution of labels close to uniform in the case of the binary attributes skin tone and gender. For the more challenging age attribute however, the superiority of our method and modelling choice is evident, with our augmented datasets' age labels being much more evenly spread than those of the baselines, especially the age progression ones. This is particularly pronounced in the case of the biased MORPH test-set (Table 1), where the difference between the ground truth images and the synthetic ones is significant. Our model consistently outperforms the baselines for all attributes in both datasets.

Mitigating Classifier Bias
In this section, we investigate the effect that a lack of diversity entails with respect to model performance. To this end, we fine-tune and test a state-of-the-art gender recognition model (Rothe et al. 2018) on the MORPH and KANFace datasets. By measuring the True Positive Rate (TPR) for each demographic subpopulation, we are able to uncover the bias of the model with regards to age and skin tone. In particular, Figs. 7 and 8 show that the model trained on MOPRH is biased against dark-skinned females (TPR of 40% for dark-skinned females over 60 years old), while the model trained on KANFace is biased against male faces under 18 (TPR of 64%). Age and skin tone bias has been studied for the task of face recognition in Nagpal et al. (2019), as well as in human perception in psychology literature (Schaich et al. 2016;Bothwell et al. 1989).
We propose to mitigate the bias in both models by augmenting the training set using the proposed image translation method. In particular, we train the models on the diverse augmented sets and evaluate the fairness of the trained classifiers. The MORPH dataset is augmented using the models trained on MORPH in Sect. 4.4.1.
For the KANFace dataset we use the models trained on CACD.
Of the various fairness metrics discussed in Sect. 2.3, we opt to quantify fairness using Equality of Opportunity (EO) (Hardt et al. 2016), which is defined as the difference in TPRs between the subpopulations.In particular, we report the EO score between each (age, gender) class for KANFace and (age, gender, skin tone) for MORPH.
We present the TPRs (↑) and EO (↓) on the ground-truth and synthetic test-sets in Figs. 8 and 7 for MORPH and KAN-Face respectively. Despite StarGAN's images having strong class-discriminative features as established in Sect. 4.4, we demonstrate here that training on their artifact-ridden images leads to even more biased classifiers. Similarly, whilst training on AttGAN-generated training sets can improve the TPR for dark-skinned women over 60 years old by 20% for MORPH, the dataset bias is exacerbated in KANFace when training on augmentations from this model (compared to our augmentations that lead to almost half the EO for males under 18).
Overall, the results indicate that the proposed framework is the only method able to mitigate all of the demonstrated dataset biases in generating more diverse data that can be used to train fairer classifiers.

0.99
We calculate the indices for the age attribute by age progressing all under 18 s in the test set following the standard practice in the age progression literature Wang et al. 2018). For this reason, we dont report the age diversity of the ground-truth images. Our method strictly outperforms or is near-perfect for all metrics and attributes

Comparison to Debiasing Methods
In order to further showcase the efficacy of the proposed framework in bias mitigation, we provide a thorough comparison with 7 state-of-the-art debiasing methods, showcasing the results in Fig. 9. The chosen baselines use different approaches to fairness.  introduce a method to debias pretrained network embeddings by decomposing them into task specific and protected attribute representations. Similarly, adversarial learning is used to remove the sensitive information from the learned representation in Alvi et al. (2018), Kim et al. (2019), Zhang et al. (2018). An autoencoder that translates the input into fair images is proposed in Quadrianto et al. (2019). 5 Lastly, Wang et al. (2020) use both a domain independent (IND) and a domain discriminative (DISC) approach to fair classification.
Most of these approaches are proposed for binary protected attributes, however our method is more general in affording the ability to handle the case of multiple protected attributes (e.g., age group). The results in Fig. 9a highlight the ability of the proposed method to mitigate the age bias of the gender classifier, especially in the case of male faces under 18 years old (age group a0) in the KANFace dataset. In particular, our framework achieves significantly lower EO (0.19-0.22 EO for age group 0) with  being the second best (0.2-0.26 EO for age group 0).
We conduct a similar experiment on MOPRH, where the sensitive attribute is binary (i.e., skin tone). The results in Fig. 9b show that our method can produce competitive results. However, our method doesn't achieve the lowest EO, which can be attributed to the fact that the rest of the baselines were designed to tackle binary protected attributes. In particular (Wang et al. 2020) IND and(Quadrianto et al. 2019) achieve the most fair classification results, with our method being a close second. We calculate the indices for the age attribute by age progressing all under 18 s in the test set following the standard practice in the age progression literature Wang et al. 2018). For this reason, we dont report the age diversity of the ground-truth images

Limitations of the Framework
Due to the generative nature of the proposed framework, our method suffers from its dependence to external data. We investigate this inherent shortcoming of our method by conducting two experiments, which are presented as follows.
Training on limited data Firstly, we investigate the impact of the size of the training set of the GAN on the bias of the trained classifier. Concretely, we show in Fig. 10 the results of training a classifier with our synthetic data, when our GAN is trained on training subsets of different sizes.
For each model, we report the equality of opportunity for the most underrepresented class, i.e., male faces under 18 on KANFace. We notice that when our model is trained on less than 25% of the training images, the bias of the trained classifier is even magnified. It should be noted that this problem is not specific to our method, but one that will plague all generative model-based bias mitigation methods via augmentation. Augmentation with real data Since the use of the proposed framework assumes the existence of an external training set, we explore the option of augmenting the training set of the biased classifier using real images from the external set. In particular, we use the CACD dataset as the external set, and try to flatten the distribution of the training set of the classifier (i.e., KANFace). However, the support for each class in the external set is not sufficient to do so in the extreme cases of people under 18 and over 60 years old. The resulting training distributions along with equality of opportunity scores are presented in Fig. 11. The results highlight the advantage of using the proposed data augmentation method, instead of directly training on the external data.
Learning the real distribution of face images is a task of high data and computational complexity (Arora and Zhang 2017). Therefore, generative models have to rely on their inductive bias in order to generalize on unobserved modes of variation. However, any finite dataset is inherently biased, and training a generative model on a biased dataset can even exacerbate this bias. Indicatively, GANs and VAEs are not able to generate images of unobserved attribute combinations (Zhao et al. 2018). Different approaches for debiased generation have been proposed in the literature and include importance reweighting (Grover et al. 2019a, b) and modeling the multiplicative interactions . Therefore, the proposed and any generative model-based approach to data augmentation should be utilized with caution regarding the distribution of the training set.

Conclusions
In this paper, we proposed a style-based neural data augmentation framework, that can be used to enhance the demographic diversity of a given dataset. To this end, we introduced a novel style transfer method, that is able to simultaneously transfer multiple facial demographic attributes. Contrary to recent work in multi-attribute face translation, the proposed framework leverages attribute-specific demographic style to facilitate image translation, rather than class-labels. In order to mix the style information of multiple attributes, we further introduce a multilinear extension to AdaIN, that fuses the different styles by modeling their multiplicative interactions.    Our style transfer framework is evaluated against baseline attribute transfer models (StarGAN and AttGAN), as well as GAN-based age progression methods (CAAE and IPCGAN) in a series of qualitative and quantitative experiments. In particular, we demonstrate that the proposed model is able to realistically transfer age, gender, and skin tone on the CACD and MORPH datasets, even for undrerepresented classes (such as images of people over 60 or under 18 years old). We further evaluate the diversity-enhancing capabilities of the models by measuring the diversity [as proposed  Fig. 11 Equality of Opportunity scores (two left-most columns) of the classifiers trained using the ground-truth data (first row), the data augmented with real images from a different dataset (middle row), and with our synthetic images (bottom row). The right-most column shows the support of the datasets used to train the classifiers for each row in Merler et al. (2019)] of the augmented test sets. By augmenting the test set using our method, we are able to achieve the most evenly spread distribution of age, gender, and skin tone predictions. Lastly, we quantify the effect of biased datasets in the fairness of face analysis models. Focusing on gender recognition, we showcase how our framework can be used to mitigate existing age and skin tone bias in a state-of-the-art model (Rothe et al. 2018) on the MORPH and KANFace datasets. By measuring equality of opportunity, we show that the model is more fair when trained on the augmentations produced by our model. At the same time, the results in Figs. 7 and 8 indicate that training on non-diverse or non-photorealistic synthetic images can even deteriorate the fairness of a pretrained classifier.
In our future work, we aim to extend our style-transfer framework to be able to handle heavily biased datasets (e.g., missing classes). Furthermore, we plan to investigate different models that are not so dependent on large annotated training sets, e.g., non adversarial frameworks. Another direction is to extend the method to handle an arbitrary number of attributes, without the restrictions imposed by the size of the tensors. This can be achieved by using coupled tensor decompositions that utilize parameter sharing. Lastly, we plan to extend our method to better account for attributes for which a discrete categorization is not so applicable. In particular, we plan to address bias beyond the traditional binary gender paradigm to better reflect the full spectrum of gender identity.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.