Disentangling Geometry and Appearance with Regularised GeometryAware Generative Adversarial Networks
 1.3k Downloads
 1 Citations
Abstract
Deep generative models have significantly advanced image generation, enabling generation of visually pleasing images with realistic texture. Apart from the texture, it is the shape geometry of objects that strongly dictates their appearance. However, currently available generative models do not incorporate geometric information into the image generation process. This often yields visual objects of degenerated quality. In this work, we propose a regularized GeometryAware Generative Adversarial Network (GAGAN) which disentangles appearance and shape in the latent space. This regularized GAGAN enables the generation of images with both realistic texture and shape. Specifically, we condition the generator on a statistical shape prior. The prior is enforced through mapping the generated images onto a canonical coordinate frame using a differentiable geometric transformation. In addition to incorporating geometric information, this constrains the search space and increases the model’s robustness. We show that our approach is versatile, able to generalise across domains (faces, sketches, hands and cats) and sample sizes (from as little as \(\sim \, 200{}30{,}000\) to more than 200, 000). We demonstrate superior performance through extensive quantitative and qualitative experiments in a variety of tasks and settings. Finally, we leverage our model to automatically and accurately detect errors or drifting in facial landmarks detection and tracking inthewild.
Keywords
Generative adversarial network Image generation Active shape model Disentanglement Representation learning Face analysis Deep learning Generative models GAN1 Introduction
Despite their merit, GANs and their variants (Radford et al. 2015; Odena et al. 2016; Mirza and Osindero 2014) cannot adequately model sets of images with large visual variability in a finegrained manner. Consequently, the quality of the generated images is severely affected in terms of shape and appearance. Specific to faces, visual texture (e.g., skin texture of faces, lighting) as well as pose and deformations (e.g., facial expressions, view angle) affect the appearance of a visual object. The interactions of these texture and geometric factors emulate the entangled variability, giving rise to the rich structure of visual object appearance. The vast majority of deep generative models, including GANs, do not allow to incorporate geometric information into the image generation process without explicit labels. As a result, the shape of the generated visual object cannot be controlled explicitly and the visual quality of the produced images degenerates significantly as for instance, depicted in Fig. 1. In particular, while GANbased models (Radford et al. 2015; Arjovsky et al. 2017; Goodfellow et al. 2014) (cf. Sect. 2.1 for a brief overview) generate realistic visual texture, e.g., facial texture in this example, geometry is not precisely followed.

We propose a novel method to address the issue of dataset shift, specifically label shift and covariate shift (QuioneroCandela et al. 2009).

We extend the model to automatically detect poor tracking and landmarks detection results.

We extend GAGAN to generate entire images, including the actual visual object and the background.

We demonstrate the versatility of our model in a variety of settings and demonstrate superior performance across domains, sample sizes, image sizes, and GAN network architectures.

We demonstrate power of our model in terms of representation and generalisation by performing crossdatabase experiments.

By encoding prior knowledge and forcing the generated images to follow a specified statistical shape prior, GAGAN generates morphological credible images.

By leveraging domain specific information such as symmetries and local geometric invariances, GAGAN is able to disentangle the shape from the appearance of the objects.

By employing a flexible differentiable transformation, GAGAN can be seen as a metaalgorithm and used to augment any existing GAN architecture.

By constraining the search space using a shape model built in a strongly supervised way, GAGAN works well on very small datasets unlike existing approaches.
2 Background and Related Work
In this section, we review related work and background in image generation with generative models in Sect. 2.1 and statistical models of shape and their use in Sect. 2.2.
2.1 Generative Models for Image Generation
Current methods for realistic image generation mainly rely on the three types of deep generative models, namely Variational Autoencoders (VAEs), autoregressive models, and Generative Adversarial Networks (GANs). Albeit different, the above mentioned deep generative models share a common setup. Let \(\mathbf {x}_1, \mathbf {x}_2,\)\( \ldots , \mathbf {x}_N\) denote a set of N real images drawn from a true data distribution \(P_{data}(\mathbf {x})\). Deep generative models, implicitly or explicitly, estimate a distribution \(P_G(\mathbf {x}, \theta )\) by learning a nonlinear mapping \(G(\mathbf {z})\) parametrised with \(\theta \) and \(\mathbf {z}\sim \mathcal {N}(\mathbf 0 ,{\mathbf {I}})\). The generated samples are compared to the true data distribution through a probability distance metric, e.g., Kullback–Leibler divergence or Jenson–Shannon divergence. New images are then generated by sampling from \(P_G(\mathbf {z})\).
Variational Autoencoders VAEs approximate the probability distribution of the training data with a known distribution. Inference is performed by finding the parameters of the model that maximise a lower bound on the loglikelihood of the marginal distribution (Kingma and Welling 2014; Reed et al. 2016). Typically, VAEs jointly train a topdown decoder with a bottomup encoder for inference. For images, VAE decoders model the output pixels as conditionally independent given the latent variables. This makes them straightforward to train, but results in a restrictive approximate posterior distribution (Rezende and Mohamed 2015; Kingma et al. 2016). In particular, they do not model any spatial structure in the images and fail to capture smallscale features such as texture and sharp edges, which significantly hurts both loglikelihood and quality of generated samples compared to other models (Larsen et al. 2016). Invertible density estimators were introduced by Rezende and Mohamed (2015), Kingma et al. (2016) and Dinh et al. (2017) to transform latent variables, which allows for exact loglikelihood computation and exact inference. However, the invertibility constraint is restrictive as the actual calculation of the inverse needs to be done in a computationally efficient manner.
Autoregressive Models Unlike VAEs, autoregressive models, most notably PixelCNN (van den Oord et al. 2016) and PixelCNN++ (Salimans et al. 2017), directly model the conditional probability distribution over pixels. These models are capable of capturing fine details in images and thus generate outstanding samples, but at the cost of slow sampling speed. As opposed to conventional convolutional architectures, autoregressive models do not apply downsampling between layers, and in order capture dependencies between distant pixels, the depth of a PixelCNN grows linearly with the size of the images. PixelCNNs also do not explicitly learn a latent representation of the data, and therefore do not allow control over the image generation.
2.2 Statistical Shape Models
Statistical shape models were first introduced by Cootes et al. (1995). By exploiting a statistical model of the shape statistical shape models are able to accurately represent the object’s deformations based on training data. Improved statistical shape models include Active Appearance Models (AAMs), where both the shape and texture is modelled (Edwards et al. 1998; Cootes et al. 2001). In AAMs, a statistical model of shape is built first and then the texture is described by employing a linear model of appearance in a shape variationfree canonical coordinate frame. Fitting the AAM to a new instance is then done by deforming the target image (forward framework) or the template (inverse framework) (Matthews and Baker 2004), or both simultaneously (bidirectional framework) (Kossaifi et al. 2015). The resulting problem can be solved analytically and effectively using Gauss–Newton optimisation (Tzimiropoulos and Pantic 2016) or secondorder methods based on Newton optimisation (Kossaifi et al. 2014). However, using pixel intensities for building the appearance model does not yield satisfactory results due to their variability in the presence of illumination, pose and occlusion variations. To remedy this issue, several robust image descriptors (or features) have been proposed, including Histograms of Oriented Gradients (HOG) (Dalal and Triggs 2005), Image Gradient Orientation kernel (IGO) (Tzimiropoulos et al. 2012), Local Binary Patterns (LBP) (Ojala et al. 2002) or SIFT features (Lowe 2004). The latter are considered the most robust for fitting AAMs (Antonakos et al. 2015). Using these features, AAMs have been shown to give stateoftheart results in facial landmarks localisation, even for inthewild data (Tzimiropoulos and Pantic 2016, 2014a; Antonakos et al. 2015; Kossaifi et al. 2017; Tzimiropoulos and Pantic 2017).
AAMs naturally belong to the class of generative models. As such they are more interpretable and typically require less data than their discriminative counterparts, such as deep learningbased approaches (Kossaifi et al. 2017; Tzimiropoulos and Pantic 2017; Sagonas et al. 2013a). Lately, thanks to the democratisation of large corpora of annotated data, deep methods tend to outperform traditional approaches for areas such as facial landmarks localisation, including AAMs, and allow learning features endtoend rather than relying on handcrafted ones. However, the statistical shape model employed by AAMs has several advantages. In particular, by constraining the search space, it allows methods that can be trained on smaller datasets. Thanks to their generative nature, AAMs can also be used to sample new instances, unseen during training, that respect the morphology of the training shapes.
In this work, we depart from the existing approaches and propose a new method, detailed in the next section, that retains the advantages of a GAN while constraining its output on statistical shape models, built in a strongly supervised way, akin to that of Active Shape Models (Cootes et al. 1995) and AAMs. To this end, we impose a shape prior on the output of the generator, hence explicitly controlling the shape of the generated object.
3 GeometryAware GAN
In GAGAN, we disentangle the input random noise vector \(\mathbf {z}\) to enforce a geometric prior and learn a meaningful latent representation. We do this by separating the shape \(\mathbf {p}\in \mathbb {R}^{N \times n}\) of objects from their appearance \(\mathbf {c}\in \mathbb {R}^{N \times k}\). Their concatenation \(\mathbf {z} = [\mathbf {p}, \mathbf {c}]\) is used as input to the model.
We first model the geometry of N images \(\mathbf {X} = \{\mathbf {X}^{(1)}, \ldots , \mathbf {X}^{(N)}\}\in \mathbb {R}^{N \times h \times w}\) using a collection of fiducial points \(\mathbf {s} = \{\mathbf {s}^{(1)}, \ldots , \mathbf {s}^{(N)}\} \in \mathbb {N}^{N \times m \times 2}\), where h and w represent height and width of a given image and m denotes the number of fiducial points. The set of all fiducial points of a sample composes its shape. Using a statistical shape model, we can compactly represent any shape \(\mathbf {s}\) as a set of normal distributed variables \(\mathbf {p}\) (cf. Sect. 3.1). We enforce the geometry by conditioning the output of the generator. The discriminator, instead of being fed the output of the generator, sees the images mapped onto the canonical coordinate frame by a differentiable geometric transformation (motion model, explained in Sect. 3.2). By assuming a factorised distribution for the latent variables, we propose GAGAN, a conditional GAN to disentangle the latent space (cf. Sect. 3.4). We further extend GAGAN by perturbationmotivated data augmentation (cf. Sect. 3.5) and \(\alpha , \beta \) regularisation (cf. Sect. 3.6).
3.1 Building the Shape Model
We can interpret our model from a probabilistic standpoint, where the shape parameters \(\mathbf {p}_1, \ldots , \mathbf {p}_n\) are independent Gaussian variables with zero mean and variance \(\lambda _1, \ldots , \lambda _n\) (Davies et al. 2008). By using the normalised shape parameters \(\frac{\mathbf {p}_1}{\sqrt{\lambda _1}}, \ldots , \frac{\mathbf {p}_n}{\sqrt{\lambda _n}}\), we enforce them to be independent and normal distributed, suitable as input to our generator. This also gives us a criterion to assess how realistic a shape is by using the sum of its normalised parameters \( \sum _{k=1}^n \frac{\mathbf {p}_k}{\sqrt{\lambda _k}} \sim \chi ^2\), which follows a Chi squared distribution (Davies et al. 2008).
3.2 Enforcing the Geometric Prior
In this work, we use a piecewise affine warping as the motion model. The piecewise affine warping maps the pixels of a source shape onto a target shape. In this work, we employ the canonical shape. This is done by first triangulating both shapes, typically done as a Delaunay triangulation. An affine transformation is then used to map the points inside each simplex of the source shape to the corresponding triangle in the target shape, using their barycentric coordinates in terms of the vertices of that simplex. The corresponding value is decided using the nearest neighbour or interpolation. This process is illustrated in Fig. 4.
3.3 Local Appearance Preservation
The statistical shape model provides rich information about the images being generated. In particular, it is desirable for the appearance of a face to be dependent on the set of fiducial points that compose it, i.e., a baby’s face has a different shape and appearance from that of a woman or a man. However, we also know that certain transformations should preserve appearance and identity. For instance, differences in head pose should ideally not affect appearance.
Rather than feeding directly the training shapes, we create several appearancepreserving variations of each shape, feed them to the generator, and ensure that the resulting samples have similar appearance. Consequently, for each sample we generate several variants by mirroring it, projecting it into the normalised shape space, adding random normal distributed noise sampled, and then use these perturbed shape as input. As the outputs will have different shapes and thus should look different, we cannot directly compare them. However, the geometric transformation projects these onto a canonical coordinate frame where they can be compared, allowing us to add a loss to account for local appearance preservations.
3.4 GAGAN
3.5 Data Augmentation with Perturbations
In order to provide more variety in shapes and avoid the generator learning to produce only faces for shape priors it has seen, we augment the set of training shapes by adding large amount of random small perturbations to these. These are sampled from a Gaussian distribution in the normalised shape space, and projected back onto the original space, therefore enforcing their correctness according to the statistical shape model.
3.6 \((\alpha ,\beta )\) Regularisation
We aim at using the representational power of the discriminator to precisely evaluate the quality of facial landmark estimation. This requires training on certain datasets and their underlying probability distribution, and testing/evaluating on on different distributions. Due to the inthewild nature of the images, this can lead to covariate shift.
In addition, the annotations for the various datasets were obtained differently for the various datasets with sometimes large variations. For instance, most of the data used for our small set of human faces was annotated in a semiautomatic way, while for CelebA, we used a stateoftheart facial landmarks detector. This difference in labelling leads to label shift which needs to be tackled during training.
4 Experimental Evaluation
In this section, we investigate the performance of GAGAN. In particular, we have four primary goals for our evaluation. The first goal is to show the generality of our model in terms of image domains, image sizes and GAN architecture. In Sect. 4.2, we will show that regularised GAGAN can be applied to different domain, not just faces, different image sizes and different GAN architectures. Further, we will also discuss limitations of GAGAN. Following this Subsection, we will compare regularised GAGAN against GAGAN in Sect. 4.5, specifically addressing image quality and the ability to detect badly aligned imageshape pairs. The qualitative and quantitative assessment of regularised GAGAN against stateoftheart conditional GAN (CGAN) models are presented in Sects. 4.3 and 4.4. This includes investigating the ability to disentangle the latent space and thus generate images with control over shape and appearance. Furthermore, we quantitatively assess how precise the generator can generate images with given shapes and how accurately the discriminator can discriminate when given shape and image are not aligned. In particular, we verify that, given an image and a corresponding shape, the discriminator accurately assesses how well the two corresponds. In all experiments, we compare our model with existing conditional GAN variations and GAGAN without regularisation. Finally, we show in an extensive ablation study the influence of the regularisation in Sect. 4.6.
4.1 Experimental Setting
Cats Dataset For the generation of faces of cats, we used the dataset introduced in Sagonas et al. (2015) and Sagonas et al. (2016). In particularly, we used \(348\) images of cats, for which \(48\) facial landmarks where manually annotated (Sagonas et al. 2015), including the ears and boundaries of the face. We first build the statistical shape space as we did previously for human faces. The resulting canonical shape is represented in Fig. 5.
Hand Gestures Dataset We used the Hand Gesture Recognition (HGR) (Grzejszczak et al. 2016; Nalepa and Kawulok 2014; Kawulok et al. 2014) which contains the gestures from Polish Sign Language (‘P’ in the gesture’s ID) and American Sign Language (‘A’). We only used the subsample of HGR which has all 25 hand feature point locations, as some annotations do only include the feature points which are visually visible. This results in a training set of 276 samples.
Sketch Dataset Finally, to demonstrate the versatility of the method, we apply GAGAN to the Face Sketch in the Wild dataset (FSW) (Yang et al. 2014) which contains 450 greyscale sketches of faces. Similarly to the face databases described above, the sketches are annotated with 68 facial landmark.
Baselines and Stateofthe Art Comparisons For comparison, we used a the Conditional GAN (CGAN) (Mirza and Osindero 2014), modified to generate images conditioned on the shape or shape parameters. ShapeCGAN is a CGAN conditioned on shapes by channelwise concatenation and PCGAN is a CGAN conditioned on the shape parameters by channelwise concatenation. To be able to compare with our model, we also ran experiments on HeatmapCGAN, a CGAN conditioned on shapes by heatmap concatenation. First a heatmap taking value \(1\) at the expected position of landmarks, and \(0\) everywhere else is created. This is then used as an additional channel and concatenated to the image passed on to the discriminator. For the generator, the shapes are flattened and concatenated to the latent vector \(\mathbf {z}\). All models use the architecture of DCGAN (Radford et al. 2015).
4.2 Generality of GAGAN
This subsection presents results showing the versatility and generality of GAGAN. As such, we show the different domains GAGAN can applied to as well as GAGAN used for different image sizes (\(128\times 128\) and \(256\times 256\)) and different architectures [improved Wasserstein GAN (Gulrajani et al. 2017)]. Further, we extend GAGAN to generate the entire image and discuss the limitations of GAGAN.
Different Domains Figure 7 shows different image domains that GAGAN was applied to. GAGAN is a general model able to generate any structured objects, including human faces, but also cat faces, sketches and hands. Generally, it is only restricted to objects that have an underlying geometry that can be leveraged by the model. The first row of Fig. 7 shows representative samples generated faces from CelebA (Liu et al. 2015). We also trained GAGAN to successfully generates face sketches from 450 sketches annotated with 68 landmark points (sixth row). Further, GAGAN was trained on cat images annotated with 48 facial landmarks (seventh row) and a subset of HGR which include 25 hand feature point annotations (last row). More samples generated for GAGANsmall and CelebA can be found at the end of the paper (cf. Figs. 21, 22 and 23).
Different Sizes and Architectures GAGAN leverages a statistical shape model, as well as a differentiable piecewise affine transformation to learn a geometryaware generative adversarial model. These concepts are not limited to a specific image size and GAN architecture. We extended the DCGAN architecture to generate images of size \(128 \times 128\) and \(256\times 256\), and samples from our best model are shown in Fig. 7 (second and third row). Further, we transferred the concept of GAGAN to improved Wasserstein GAN (Gulrajani et al. 2017). We call this model GeometryAware WGANGP (GAWGANGP) and samples from the model are also depicted in Fig. 7 (third row).
Limitations Regularised GAGAN has three dependencies: (1) shape annotations, (2) statistical shape model and (3) piecewise affine transformation. The performance of both statistical shape model and piecewise affine transformation both depends on the shape annotations. The generated hands (cf. Fig. 7, last row) do suffer in quality as both statistical shape model and piecewise affine transformation require outer shape annotations, whereas annotations of HGR (Grzejszczak et al. 2016; Nalepa and Kawulok 2014; Kawulok et al. 2014) only provide inner shape annotations. This also explains the observed thinness of the generated fingers. The second limitation is the piecewise affine transformation, which performs best when all shape meshes are visible. Therefore, side faces are difficult to handle. One way to address this issue to to use a partbased model (Tzimiropoulos and Pantic 2014b) based on a more flexible, patchbased transformation. However, it is worth noting that our method does not suffer as much as AAMs from this limitation since the generator creates images in their original coordinate frame. Only the discriminator sees the warped image. As such, the discriminator also learns which deformation corresponds to which shape parameters. This is why the images generated by GAGAN do not display artefacts resulting from deformation by the piecewise affine warping.
4.3 Qualitative Results
This subsection presents a qualitative evaluation of our proposed regularised GAGAN. If not further mentioned, regularised GAGAN was trained with \(\alpha =0.01\), \(\beta =0.5\), \(\lambda =1.0\).
Disentangled Representation In our experiments, we visualise the disentanglement by interpolating only one continuous latent variable \(\mathbf {p}_j^{(i)}\) in the range \([2.5, 2.5]\) while keeping all other \(\mathbf {p}_k^{(i)}, k \ne j\) and \(\mathbf {c}^{(i)}\) fixed. Figure 10 shows that the continuous latent variables \(\mathbf {p}\) encode visual concepts such as pose, morphology and facial expression while appearance remains constant, indicating successful disentangled. Figure 7 shows some representative samples drawn from \(\mathbf {z}\) at resolutions of 64 \(\times \) 64. Only one latent variable \(\mathbf {p}_j^{(i)}\) was changed at a time for each row in Fig. 10. We observe realistic and shapefollowing images for a wide range of facial expressions (rows 1–2), poses (rows 3–4) and morphology (rows 5–6). We show the entire range of \([\,2.5, 2.5]\) as this was the range that GAGAN was trained on. Extreme facial expressions shown in rows 1–2 (first 3 samples each) are hard to generate for GAGAN because they are less realistic and do not occur naturally, with lips too thin to generate.
Similarly, we only randomly sampled \(\mathbf {c}^{(i)} \sim \mathcal {N}(\mathbf 0 ,{\mathbf {I}})\) and kept \(\mathbf {p}^{(i)}\). The results are shown in Fig. 11. In each row we sampled different \(\mathbf {c}^{(i)} \sim \mathcal {N}(\mathbf 0 ,{\mathbf {I}})\). As depicted in Fig. 11, by only sampling \(\mathbf {c}^{(i)}\) the shape is constant in every row, while the identity varies from image to image. The proportion between men and women sampled seems to be balanced, though we observed fewer older people. Interestingly, the model was able to generate accessories such as glasses during sampling.
Figure 12 shows examples of cats and hands generated by varying the shape parameter \(\mathbf {p}^{(i)}\) while keeping \(\mathbf {c}^{(i)}\) constant.
4.4 Quantitative Results
This section discusses quantitative results, especially we focus on the discriminative ability of GAGAN to verify landmark detections.
Generation of Aligned Faces The facial landmark detector introduced in Bulat and Tzimiropoulos (2017) detects fiducial points with an accuracy in most cases higher than that of human annotators. Since our model takes as input a shape prior and outputs an image that respects that prior, we can access how well the prior is followed by running that detector on the produced images and measuring the distance between the shape prior and the actual detected shape. We directly run the method on \(10{,}000\) images generated by the generator of our GAGAN. The error is measured in terms of the established normalised pointtopoint error (ptpterror), as introduced in Zhu and Ramanan (2012) and defined as the RMS error normalised by the facesize. Following Tzimiropoulos and Pantic (2016, 2017); Kossaifi et al. (2015, 2014, 2017) we produced the cumulative error distribution curve depicting for each value on the xaxis in Fig. 13, the percentage of images for which the pointtopoint error was lower than this value. For comparison we run the facial landmarks detector on our GAGANsmall set and compute the error using the groundtruth provided with the data. As can be observed, most of the images are pretty well fitted for the model trained on our GAGANsmall set. When trained on CelebA, our model generates faces according to the given prior with similar accuracy as the landmark detector obtains on our training set.
Discriminative Power As GAGAN is trained to discriminate real and fake images conditioned on shapes, we measured how accurately the GAGAN discriminator can assess how well image and shape pairs are aligned. In practice, this is useful because even though stateoftheart landmark detection performs well, verification of the landmark is usually done manually. This can now be done automatically using the regularised GAGAN to discriminate between good and bad alignment. In the following experiments, we vary the pt–pt error by adding perturbations to the shapes to assess this capability. We report for all experiments the average prediction score and one standard deviation in relation to the pt–pterror of a given imageshape input. We compare the best GAGAN model (\(\beta =0.5, \alpha =0.1\)) to the baseline models HeatmapCGAN, PCGAN and ShapeCGAN. We trained each of the models with GAGANsmall set without 300 VW and tested it on 300 VW. Similarly as Fig. 20, the curves are plotted by calculating the average prediction score from imageshape pairs from CelebA and its perturbations and the corresponding ptpt error. In Fig. 14a, HeatmapCGAN has a similar predictive performance, with a much higher variance. We plotted PCGAN and ShapeCGAN separately in Fig. 14b to assess the slope and predictability. ShapeCGAN performs worse than GAGAN and HeatmapCGAN as the scale of predictions is much smaller (0.0, 0.003) and the curves show high spikes of variance, especially at the critical pt–pt error of 0.05. PCGAN fails almost completely to detect any differences in pt–pterror.
4.5 Improvement Through Regularisation
We also compare the original GAGAN^{1} against GAGAN with \((\alpha> 0,\beta > 0)\) regularisation. We compared the ability to discriminate alignment quantitatively with test samples from 300 VW and CelebA. Both GAGAN and regularised GAGAN were trained on GAGANsmall set, leaving out 300 VW samples, with \(\lambda =1.0\). In both test cases, regularised GAGAN results in a better discrimination of alignment in terms of slope and variance as observed in Fig. 15.
Further, we also investigated the generation of images of GAGAN and regularised GAGAN. Figure 16 shows samples from both models, GAGAN (first row) and GAGAN with \((\alpha , \beta )\)regularisation (second row). While GAGAN without regularisation suffers from minor artefacts, we observe smoother textures generated from GAGAN with regularisation.
4.6 Ablation Study
In this subsection, we present experiments conducted to show the impact of the perturbation and \((\alpha ,\beta )\) regularisation on GAGAN. Firstly, we will show qualitative and quantitative results on GAGAN trained with and without perturbation. Secondly, similarly to the quantitative experiments in Sect. 4.4, we use our trained GAGAN discriminator for automatic facial landmark verification while varying the alignment of imageshape pairs.
CrossDatabase Results With and Without Perturbation We conducted crossdatabase experiments by training GAGAN on the GAGANsmall set and testing the ability of automatic facial landmark verification on CelebA. Figure 17 shows the pt–pterror curves in relation to the average prediction score. GAGAN without perturbation cannot discriminate between well and badly aligned imagepairs, whereas GAGAN with perturbation establishes a smooth trend, although only in the between range 0.0 and 0.2 of prediction score. The performance of GAGAN with perturbation is also poor because we did not employ any \((\alpha ,\beta )\) regularisation. This allowed for a unbiased evaluation of the perturbation.
Test on Helen and LFPW Experiments using test sets where the training set includes the corresponding training dataset (LFPW, Helen) were performed to assess how well GAGAN discriminators are able to distinguish between good and bad alignment. Since the GAGANsmall set consists of shape annotations which have been manually verified and corrected by experts, we know that annotations are well aligned to the images. We trained with all samples except the ones from the LFPW and Helen test set and varied \(\alpha \) or \(\beta \) while keeping respectively \(\beta =0.5\) and \(\alpha =0.1\) constant. GAGAN was trained with \(\lambda = 1.0\). Figure 18 shows that, with \(\alpha = 0.1\) and \(\beta =0.5\), GAGAN predicts alignments with pt–pterror \(=0.0\) with an average score of almost 1.0, and decreasing to an average score of 0.0 with pt–pterror \(> 0.05\). This average scoring is desirable as in practice any alignment with pt–pterror \( > 0.05\) should be manually corrected.
We also varied \(\beta = [0.1, 0.5, 0.8]\) while keeping \(\alpha =0.1\) constant in our crossdatabase experiments. Results are visualised in Fig. 20. By increasing \(\beta \) we decrease the variance in predictions and have a clearer threshold between well aligned images and shapes (pt–pt error \(< 0.05\)) and badly aligned images and shapes for both 300 VW and CelebA. With \(\beta =0.0\), there is a clear division between aligned images and shapes (pt–pt error \(=0.0\)) and aligned images and shapes with pt–pt error of approx. 0.05–0.2 for 300 VW. However, the average prediction score rises with 0.2 pt–pterror which is counterintuitive of how the discriminator should rank the images and shapes. Further, with increasing \(\beta \), the variance of the average prediction score decreases, and thus gives better precision to the predictions.
5 Conclusion
We introduced regularized GAGAN, a novel method able to produce realistic images conditioned on disentangled latent shape and appearance representations. The generator samples from the probability distribution of a statistical shape model and generates faces that respect the induced geometry. This is enforced by an implicit connection from the shape parameters fed to the generator to a differentiable geometric transformation applied to its output. The discriminator, trained only on images normalised to canonical image coordinates, is able to not only differentiate realistic from fake samples, but also judge the alignment between image and shape without being explicitly conditioned on the prior. The resulting representational power allows to automatically assess the quality of facial landmarks tracking, while avoiding dataset shifts. We demonstrated superior performance compared to other methods across datasets, domains and sample sizes. Our method is general and can be used to augment any existing GAN architecture.
6 Appendix
Footnotes
 1.
Original GAGAN can be expressed as regularised GAGAN+ with \((\alpha = 0, \beta =0)\).
Notes
Acknowledgements
The work of Linh Tran, Yannis Panagakis and Maja Pantic has been funded by the European Community Horizon 2020 under Grant Agreement Nos. 688835, 645094 (DEENIGMA).
References
 Antonakos, E., AlabortiMedina, J., Tzimiropoulos, G., & Zafeiriou, S. (2015). Featurebased lucas–kanade and active appearance models. IEEE Transactions on Image Processing, 24(9), 2617.MathSciNetCrossRefzbMATHGoogle Scholar
 Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein gan. arXiv preprint arXiv:1701.07875.
 Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2011). Localizing parts of faces using a consensus of exemplars. In The 24th IEEE conference on computer vision and pattern recognition (CVPR) (pp. 545–552).Google Scholar
 Bulat, A., & Tzimiropoulos, G. (2017). How far are we from solving the 2d and 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International conference on computer vision.Google Scholar
 Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems (pp. 2172–2180).Google Scholar
 Cootes, T. F., Edwards, G. J., & Taylor, C. J. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 23(6), 681.CrossRefGoogle Scholar
 Cootes, T., Taylor, C., Cooper, D., & Graham, J. (1995). Active shape modelstheir training and application. Computer Vision and Image Understanding, 61(1), 38.CrossRefGoogle Scholar
 Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.Google Scholar
 Davies, R., Twining, C., & Taylor, C. (2008). Statistical Models of Shape: Optimisation and Evaluation (1st ed.). Berlin: Springer.zbMATHGoogle Scholar
 Dinh, L., SohlDickstein, J., & Bengio, S. (2017). Density estimation using real NVP. In 5th International conference on learning representations (ICLR).Google Scholar
 Edwards, G. J., Taylor, C. J., & Cootes, T. F. (1998). Interpreting face images using active appearance models. In IEEE international conference on automatic face and gesture recognition (FG) (pp. 300–305).Google Scholar
 Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672–2680).Google Scholar
 Gross, R., Matthews, I., Cohn, J., Kanade, T., & Baker, S. (2010). Multipie. Image and Vision Computing (IVC), 28(5), 807.CrossRefGoogle Scholar
 Grzejszczak, T., Kawulok, M., & Galuszka, A. (2016). Hand landmarks detection and localization in color images. Multimedia Tools and Applications, 75(23), 16363. https://doi.org/10.1007/s1104201529345.CrossRefGoogle Scholar
 Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A.C. (2017). In Advances in neural information processing systems (pp. 5767–5777).Google Scholar
 Jain, V., & Seung, S. (2009). Natural image denoising with convolutional networks. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (pp. 769–776). Red Hook: Curran Associates Inc.Google Scholar
 Johnson, J., Alahi, A., & FeiFei, L. (2016). Perceptual losses for realtime style transfer and superresolution. In European conference on computer vision (pp. 694–711).Google Scholar
 Kawulok, M., Kawulok, J., Nalepa, J., & Smolka, B. (2014). Selfadaptive algorithm for segmenting skin regions. EURASIP Journal on Advances in Signal Processing, 2014(170), 1. https://doi.org/10.1186/168761802014170.Google Scholar
 Kingma, D. P., & Welling, M. (2014). Autoencoding variational bayes. In 2nd international conference on learning representations (ICLR).Google Scholar
 Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016). Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems (pp. 4743–4751).Google Scholar
 Kossaifi, J., Tran, L., Panagakis, Y., & Pantic, M. (2017). Gagan: Geometryaware generative adversarial networks. In IEEE CVPR. arXiv:1712.00684.
 Kossaifi, J., Tzimiropoulos, G., & Pantic, M. (2014). Fast newton active appearance models. In Proceedings of the IEEE international conference on image processing (ICIP14) (pp. 1420–1424).Google Scholar
 Kossaifi, J., Tzimiropoulos, G., & Pantic, M. (2015). Fast and exact bidirectional fitting of active appearance models. In Proceedings of the IEEE international conference on image processing (ICIP15) (pp. 1135–1139).Google Scholar
 Kossaifi, J., Tzimiropoulos, G., Todorovic, S., & Pantic, M. (2017). Afewva database for valence and arousal estimation inthewild. Image and Vision Computing, 65(Supplement C), 23. Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing.Google Scholar
 Kossaifi, J., Tzimiropoulos, G., & Pantic, M. (2017). Fast and exact newton and bidirectional fitting of active appearance models. IEEE Transactions on Image Processing, 26(2), 1040.MathSciNetCrossRefzbMATHGoogle Scholar
 Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2016). Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning (pp. 1558–1566).Google Scholar
 Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., & Wang, Z. et al. (2016). Photorealistic single image superresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802.
 Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of international conference on computer vision (ICCV).Google Scholar
 Lowe, D. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision (IJCV), 60(2), 91.CrossRefGoogle Scholar
 Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprechmann, P., & LeCun, Y. (2016). Disentangling factors of variation in deep representation using adversarial training. In Advances in neural information processing systems (pp. 5040–5048).Google Scholar
 Matthews, I., & Baker, S. (2004). Active appearance models revisited. International Journal of Computer Vision (IJCV), 60(2), 135.CrossRefGoogle Scholar
 Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
 Nalepa, J., & Kawulok, M. (2014). Fast and accurate hand shape classification. In International conference: beyond databases, architectures and structures (pp. 364–373).Google Scholar
 Odena, A., Olah, C., & Shlens, J. (2016). Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585.
 Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution grayscale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971.CrossRefzbMATHGoogle Scholar
 Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2536–2544).Google Scholar
 QuioneroCandela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (2009). Dataset shift in machine learning. Cambridge: The MIT Press.Google Scholar
 Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
 Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396.
 Rezende, D., & Mohamed, S. (2015). Variational inference with normalizing flows. In International Conference on Machine Learning (pp. 1530–1538).Google Scholar
 Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
 Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2016). 300 faces inthewild challenge: Database and results. Image and Vision Computing (IVC), 47, 3. Special Issue on Facial Landmark Localisation “InTheWild”.Google Scholar
 Sagonas, C., Panagakis, Y., Zafeiriou, S., & Pantic, M. (2015). Robust statistical face frontalization. In Proceedings of IEEE international conference on computer vision (ICCV 2015).Google Scholar
 Sagonas, C., Panagakis, Y., Zafeiriou, S., & Pantic, M. (2016). Robust statistical frontalization of human and animal faces. International Journal of Computer Vision. Special Issue on “Machine Vision Applications”.Google Scholar
 Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013a). A semiautomatic methodology for facial landmark annotation. In CVPR Workshops.Google Scholar
 Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013b). 300 faces inthewild challenge: The first facial landmark localization challenge. In The IEEE international conference on computer vision (ICCV) workshops (pp. 397–403).Google Scholar
 Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In Advances in neural information processing systems (pp. 2234–2242).Google Scholar
 Salimans, T., Karpathy, A., Chen, X., & Kingma, D. P. (2017). Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In 5th international conference on learning representations (ICLR).Google Scholar
 Shen, J., Zafeiriou, S., Chrysos, G., Kossaifi, J., Tzimiropoulos, G., & Pantic, M. (2015). The first facial landmark tracking inthewild challenge: Benchmark and results. In Proceedings of IEEE international conference on computer vision, 300 videos in the wild (300VW): Facial landmark tracking inthewild challenge & workshop (ICCVW’15) (pp. 50–58).Google Scholar
 Tipping, M. E., & Bishop, C. M. (2003). Bayesian image superresolution. In Advances in neural information processing systems (pp. 1303–1310).Google Scholar
 Tran, L., Yin, X., & Liu, X. (2017). Disentangled representation learning gan for poseinvariant face recognition. IEEE CVPR, 4(5), 7.Google Scholar
 Tzimiropoulos, G., & Pantic, M. (2014a). Gaussnewton deformable part models for face alignment inthewild. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1851–1858).Google Scholar
 Tzimiropoulos, G., & Pantic, M. (2014b). In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1851–1858).Google Scholar
 Tzimiropoulos, G., & Pantic, M. (2016). Fast algorithms for fitting active appearance models to unconstrained images. International Journal of Computer Vision, 122, 1–17.MathSciNetGoogle Scholar
 Tzimiropoulos, G., & Pantic, M. (2017). Fast algorithms for fitting active appearance models to unconstrained images. International Journal of Computer Vision, 122(1), 17.MathSciNetCrossRefGoogle Scholar
 Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2012). Subspace learning from image gradient orientations. IEEE TPAMI, 34(12), 2454.CrossRefGoogle Scholar
 van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., & Graves, A. et al. (2016). Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems (pp. 4790–4798).Google Scholar
 Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning (pp. 1096–1103).Google Scholar
 Wang, C., Wang, C., Xu, C., & Tao, D. (2017). Tag disentangled generative adversarial networks for object image rerendering. In Proceedings of the twentysixth international joint conference on artificial intelligence, IJCAI (pp. 2901–2907).Google Scholar
 Xie, J., Xu, L., & Chen, E. (2012). Image denoising and inpainting with deep neural networks. In Advances in neural information processing systems (pp. 341–349).Google Scholar
 Yang, J., Wright, J., Huang, T. S., & Ma, Y. (2010). Image superresolution via sparse representation. IEEE Transactions on Image Processing, 19(11), 2861.MathSciNetCrossRefzbMATHGoogle Scholar
 Yang, H., Zou, C., & Patras, I. (2014). Face sketch landmarks localization in the wild. IEEE Signal Processing Letters, 21(11), 1321.CrossRefGoogle Scholar
 Zhao, J., Mathieu, M., & LeCun, Y. (2016). Energybased generative adversarial network. arXiv preprint arXiv:1609.03126.
 Zhou, J. B. F., & Lin, Z. (2013). Exemplarbased graph matching for robust facial landmark localization. In IEEE international conference on computer vision (ICCV) (pp. 1025–1032).Google Scholar
 Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation, and landmark localization in the wild. In Proceedings of the 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2879–2886).Google Scholar
 Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired imagetoimage translation using cycleconsistent adversarial networks. In 2017 IEEE international conference on computer vision (ICCV) (pp. 2242–2251).Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.