An Adversarial Neuro-Tensorial Approach For Learning Disentangled Representations

Several factors contribute to the appearance of an object in a visual scene, including pose, illumination, and deformation, to mention a few. Each factor accounts for a source of variability in the data, while the multiplicative interactions of these factors emulate the entangled variability, giving rise to the rich structure of visual object appearance. Disentangling such unobserved factors from visual data is a challenging task, especially when the data have been captured in uncontrolled recording conditions (also refereed to as"in-the-wild") and label information is not available. In this paper, we propose the first unsupervised deep learning method for disentangling multiple latent factors of variation in face images captured in-the-wild. To this end, we propose a deep latent variable model, where the multiplicative interactions of multiple latent factors of variation are explicitly modelled by means of multilinear (tensor) structure. We demonstrate that the proposed approach indeed learns disentangled representations of facial expressions and pose, which can be used in various applications, including face editing, as well as 3D face reconstruction and classification of facial expression, identity and pose.


Introduction
The appearance of visual objects is significantly affected by multiple factors of variability such as, for example, pose, illumination, identity, and expression in case of faces. Each factor accounts for a source of variability in the data, while their complex interactions give rise to the observed entangled variability. Discovering the modes of variation or in other words disentangling the latent factors of variations in visual data is a very important problem in the intersection of statistics, machine learning and computer vision.
Factor analysis [10] and the closely related Principal Component Analysis (PCA) [14] are probably the most popular statistical methods to find a single mode of variation that explains the data. Nevertheless, facial (and other forms) of visual data have many different and possibly independent, modes of variations and hence methods such as PCA are not able to identify them. For example, it is not uncommon when training PCA models using faces for the first principal component to contain pose as well as expression variations.
An early approach for learning different modes of variation in the data is TensorFaces [34]. In particular, Ten-sorFaces not only requires the facial data to be labelled (e.g., in terms expression, identity, illumination etc.) but the data tensor must also contain all samples in all different variations. This is the primary reason that such tensor decompositions are still mainly applied to databases that have been captured in a very controlled environment, such as the Weizmann face database [34].
Recent unsupervised tensor decompositions [30,35] automatically discover the modes of variation in unlabelled data. In particular, the most recent one [35] assumes that the original visual data have been produced by a hidden multilinear structure and the aim of the unsupervised tensor decomposition is to discover both the underlying multilinear structure, as well as the corresponding weights (coefficients) that best explain the data. Special instances of the unsupervised tensor decomposition are the Shape-from-Shading (SfS) decompositions in [15,29] and the multilinear decompositions for 3D face description in [35]. In [35], it is shown that indeed the method can be used to learn representations where many modes of variation have been disentangled (e.g., identity, expression and illumination etc.). Nevertheless, the method is applicable to faces which have been frontalised by applying a warping function (e.g., a piece-wise affine warping transform [23]).
Another promising line of research for discovering latent representations is unsupervised Deep Neural Networks (DNNs). Unsupervised DNNs architectures include Auto-Encoders (AE) [1], as well as Generative Adversarial Networks (GANs) [11] or adversarial versions of AE, e.g., the Adversarial Auto-Encoders [21] (AAE). Even though GANs, as well as AAEs provide very elegant frameworks for discovering powerful low-dimensional embeddings without having to align the faces, due to the complexity of the networks, unavoidably all modes of variation are Expression Editing Pose Editing Figure 1: Given a single in-the-wild image, our network learns disentangled representations for pose, illumination, expression and identity. Using these representations, we are able to manipulate the image and edit the pose or expression. multiplexed in the latent-representation. Only with the use of labels it is possible to model/learn the manifold over the latent representation, usually as a post-processing step [28].
In this paper, we show that it is possible to learn a disentangled representation of the human face captured in arbitrary recording conditions in an unsupervised manner 1 by imposing a multilinear structure on the latent representation of an AAE [28]. To the best of our knowledge this is the first time that tensor decompositions have been combined with DNNs for learning disentangled representations. We demonstrate the power of the proposed approach by showing expression/pose transfer using only the latent variable that is related to expression/pose. We also demonstrate that the disentangled low-dimensional embeddings are useful for many other applications, such as facial expression, pose, and identity recognition and clustering. An example of the proposed approach is given in Fig. 1. In particular, the left pair of images have been decomposed, using the encoder of the proposed neural network E(·), into many different latent representations including latent representations for pose, illumination, identity and expression. Since, our framework has learned a disentangled representation we can easily transfer the expression by only changing the latent variable related to expression and passing the latent vector into the decoder of our neural network D(·). Similarly, we can transfer the pose by just changing the latent variable related to pose.

Related Work
Learning disentangled representations that explain multiple factors of variation in the data as disjoint latent dimensions is desirable in several machine learning, computer vision, and graphics tasks.
Indeed, bilinear factor analysis models [31] have been employed for disentangling two factors of variation (e.g., head pose and facial identity) in the data. Identity, expression, pose, and illumination variations are disentangled in [34] by applying Tucker decomposition (also known as multilinear Singular Value Decomposition (SVD) [8]) into a carefully constructed tensor (i.e., multidimensional array) through label information. Interestingly, the modes of variation in well aligned images can be recovered via a multilinear tensor factorization [35] without any supervision. The main limitation of the aforementioned methods is that inference given a single unseen test sample might be ill-posed.
More recently, both supervised and unsupervised deep learning methods have been developed for disentangled representations learning. Transforming auto-encoders [13] is among the earliest methods for disentangling latent factors by means of auto-encoder capsules. In [9] hidden factors of variation are disentangled via inference in a variant of the restricted Boltzmann machine. Disentangled representations of input images are obtained by the hidden layers of deep networks in [7] and through a higher-order Boltzmann machine in [25]. The Deep Convolutional Inverse Graphics Network [18] learns a representation that is disentangled with respect to transformations such as out-of-plane rotations and lighting variations. Methods in [6,22,5,32,33] extract disentangled and interpretable visual representations by employing adversarial training. The method in [28] disentangles the latent representations of illumination, surface normals, and albedo of face images using an image rendering pipeline. Trained with weak supervision, [28] shows multiple image editing tasks by manipulating the relevant latent representations. This editing approach though still requires expression labels and sampling multiple images of a specific expression.
Here, the proposed network is able to edit the expression of a face image given another single in-the-wild face image of arbitrary expression. Furthermore, we are able to edit the pose of a face in the image which is not possible in [28].

Methodology
In this section, we will summarise the main multilinear models used for texture, shape and normals (three different modalities) representation. In particular, we assume that for each different modality there is a different core tensor but all modalities share the same latent representation of weights regarding identity and expression. In the following, we assume that we have a set of n facial images (e.g., in the training batch) and their corresponding 3D facial shape, as well as their normals per pixel (the 3D shape and normals have been produced by fitting a 3D model on the image,, e.g. [4]).

Facial Texture
We follow the assumption in [35] that visual data have been produced by a hidden multilinear structure and an unsupervised tensor decomposition can be applied to discover both the underlying multilinear structure, as well as the corresponding weights (coefficients) that best explain the data. Following [35], disentangled representations can be learnt (e.g., identity, expression and illumination etc.) from frontalised facial images. The frontalisation process is performed by applying a piecewise affine transform using the sparse shape recovered by a face alignment process. Unavoidably this process suffers from warping artifacts. In this paper we do not apply any warping process, but we apply the multilinear decomposition only to near frontal faces (automatically detected by the use of a 3D face fitting process). In particular, assuming a near frontal facial image rasterised in a vector x f ∈ R kx×1 , given a core tensor Q ∈ R kx×k l ×kexp×k id 2 , this can be decomposed as 2 Tensors notation: Tensors (i.e., multidimensional arrays) are and denoted by calligraphic letters, e.g., X . The mode-m matricisation of a tensor X ∈ R I 1 ×I 2 ×···×I M maps X to a matrix X (m) ∈ R Im×Īm . The mode-m vector product of a tensor X ∈ R I 1 ×I 2 ×...×I M with a vector x ∈ R Im , denoted by X ×n x ∈ R I 1 ×I 2 ×···×I n−1 ×I n+1 ×···×I N .
The Kronecker product is denoted by ⊗ and the Khatri-Rao (i.e., column-wise Kronecker product) product is denoted by . More details on tensors and multilinear operators can be found in [16].
where z l ∈ R k l , z exp ∈ R kexp and z id ∈ R k id are the weights that correspond to illumination, expression and identity respectively. The equivalent form in case that we have a number of images in the batch stacked in the columns of a matrix X f ∈ R kx×n is where Q (1) is a mode-1 matricisation of tensor Q and Z l , Z exp and Z id are the corresponding matrices that gather the weights of the decomposition for all images in the batch. That is, Z exp ∈ R kexp×n stacks the n latent variables of expressions of the images, Z id ∈ R k id ×n stacks the n latent variables of identity and Z l ∈ R k l ×n stacks the n latent variables of illumination.

3D Facial Shape
It is quite common to use a bilinear model for disentangling identity and expression in 3D facial shape [3]. Hence, for 3D shape we assume that there is a different core tensor B ∈ R k 3d ×kexp×k id and each 3D facial shape x 3d ∈ R k 3d can be decomposed as: where z exp and z id are exactly the same weights as the texture decomposition. The tensor decomposition for the n images in the batch can be written as where B (1) is a mode-1 matricization of tensor B.

Facial Normals
The tensor decomposition we opted to use for facial normals was exactly the same as the texture, hence we can use the same core tensor and weights. The difference is that since facial normals do not depend on illumination parameters (assuming a Lambertian illumination model), we just need to replace the illumination weights with a constant 3 . Thus, the decomposition for normals can be written as where 1 is a matrix of ones.

3D Facial Pose
Finally, we have another latent variable regarding 3D pose. This latent variable z p ∈ R 9 represents a 3D rotation. We denote by x i ∈ R kx an image at index i. The indexing is denoted in the following by the superscript. The corresponding z i p can be reshaped into a rotation matrix R i ∈ R 3×3 . As proposed in [36], we want to apply this rotation to the feature of the image x i created by 2-way synthesis. This feature vector is the i-th column of the feature matrix resulting from 2-way synthesis (Z exp Z id ) ∈ R kexpk id ×n . We denote this feature vector corresponding to a single image as matrix and left-multiplied by R i . After vectorising again, the resulting feature ∈ R kexpk id becomes the input of the decoders for normal and albedo. This transformation from feature vector (Z exp Z id ) i to rotated feature is called rotation.

Network Architecture
We incorporate the structure imposed by (2), (4) and (5) into an auto-encoder network, see Figure 2. For some matrices Y i ∈ R kyi×n , we refer to the operation Y 1 Y 2 ∈ R ky1ky2×n as 2-way synthesis and Y 1 Y 2 Y 3 ∈ R ky1ky2ky3×n as 3-way synthesis. The multiplication of a feature matrix by B (1) or Q (1) , mode-1 matricisations of tensors B and Q, is referred to as projection and can be represented by an unbiased fully-connected layer.
Our network follows the architecture of [28]. The encoder E receives an input image x and the convolutional encoder stack first encodes it into z i , an intermediate latent variable vector of size 128 × 1. z i is then transformed into latent codes for background z b , mask z m , illumination z l , pose z p , identity z id and expression z exp via fullyconnected layers.
The decoder D takes in the latent codes as input. z b and z m (128 × 1 vectors) are directly passed into convolutional decoder stacks to estimate background and face mask respectively. The remaining latent variables follow 3 streams: 1. z exp (15×1 vector) and z id (80×1 vector) are joined by 2-way synthesis and projection to estimate facial shapê x 3d .
2. The result of 2-way synthesis of z exp and z id is rotated using z p . The rotated feature is passed into 2 different convolutional decoder stacks: one for normal estimation and another for albedo. Using the estimated normal map, albedo, illumination component z l , mask and background a reconstructed imagex is rendered.
3. z exp , z id and z l are joined by 3-way synthesis and projection to estimate frontal normal map and a frontal reconstruction of the image.
Streams 1 and 3 drive the disentangling of expression and identity components, while stream 2 focuses on the reconstruction of the image by adding in the pose components.
Our input images are aligned and cropped images of faces from CelebA [19] of size 64×64, so k x = 3×64×64. k 3d = 3 × 9375, k l = 9, k id = 80 and k exp = 15. More details on the network such as the convolutional encoder stacks and decoder stacks are in the supplementary material.

Training
We use in-the-wild face images for training. Hence, we only have access to the image itself (x) and do not have ground truth data for pose, illumination, normal, albedo, expression, identity or 3D shape. The main loss function is the reconstruction loss of the image x: where E recon = x −x 2 2 , E adv represents the adversarial loss and E veri the verification loss. We use the trained verification network V [37] to find face embeddings of our images x andx. As both images are supposed to represent the same person, we minimise the cosine distance between the embeddings: E veri = 1 − cos(V(x), V(x)). A discriminative network D is trained at the same time as our network to distinguish between the generated and real images [11]. We incorporate the discriminative by following the auto-encoder loss distribution matching approach of [2]. The discriminative network D is itself an autoencoder trying to reconstruct the input image x so the adversarial loss is As fully unsupervised training often results in semantically meaningless latent representations, [28] proposed training using pseudo ground truth values for normals, lighting and 3D facial shape. We re-use this technique and introduce further pseudo ground truth values for posex p , expressionx exp and identityx id .x p ,x exp andx id are obtained by fitting coarse face geometry to every image in the training set using a 3D Morphable Model [4]. We incorporated the constraints used in [28] for illumination, normals and albedo. We introduce the following new objectives: wherex p is a 3D camera rotation matrix.
where fc(·) is a fully-connected layer andx exp ∈ R 28 is a pseudo ground truth vector representing 3DMM expression components of the image x.
where fc(·) is a fully-connected layer andx id ∈ R 157 is a pseudo ground truth vector representing 3DMM identity components of the image x.

Multilinear Losses
Directly applying the above losses as constraints to the latent variables does not result in a well-disentangled representation. To disentangle the variations, we impose a tensor structure on the image using the following losses: wherex 3d is the 3D facial shape of the fitted model.
where x f is a semi-frontal face image. During training, E f is only applied on near-frontal face images filtered usingx p .
wheren f is a near frontal normal map. During training, the loss E n is only applied on near frontal normal maps. The model is trained end-to-end by applying gradient descent to batches of images, where (12), (13) and (14) are written in the following general form: where X ∈ R k×n is a data matrix, B (1) is the mode-1 matricisation of a tensor B and Z (i) ∈ R kzi×n are the latent variables matrices. The partial derivative of (15) with respect to the latent variable Z (i) are computed as follows: Letx = vec(X) be a vectorisation of X,ẑ (i) = vec(Z (i) ) be a vectorisation of Z (i) , Consequently the partial derivative of (15) with respect to Z (i) is obtained by matricising the partial derivative of (16) with respect to Z (i) , which is easy to compute analytically. The derivative of this can be found in the supplemental material. To efficiently compute the above mentioned operations [17] has been employed. We compare our pose editing results with a baseline where a 3DMM has been fit to both input images.

Reconstruction
Ground Truth Input

Reconstruction
Ground Truth Figure 5: Given a single image, we infer meaningful expression and identity components to reconstruct a 3D mesh of the face. We compare the reconstruction against the ground truth provided by 3DMM fitting.

Experiments
We perform several experiments on data unseen during training to show that our network is indeed able to disentangle pose, expression and identity. We edit expression or pose by swapping the latent expression/pose component learnt by the encoder E (Eq. (6)) with the latent expression/pose component predicted from another image. Our goal is for the decoder D (Eq. (7)) to take in the changed values and return the edited image.

Expression and Pose Editing in-the-wild
Given two in-the-wild images of faces, we are able to transfer the expression or pose of one person to another. Transferring the expression from two different facial images without fitting a 3D model is a very challenging problem. Generally, transferring of expressions is considered in the context of the same person under an elaborate blending framework [38] or by transferring certain classes of expressions [27].
For this experiment, we work with completely unseen data (a hold-out set of CelebA) and no labels. We first encode both input images x i and x j : where E(·) is our encoder and z exp , z id , z p are the la-tent representations of expression, identity and pose respectively.
Assuming we want x i to take on the expression or pose of x j , we then decode on: where D(·) is our decoder.
The resulting x jii then becomes our result image where x i has the expression of x j . x jii is the edited image where x i changed to the pose of x j .
As there is currently no prior work for this expression editing experiment, we used the image synthesised by the 3DMM fitted models as a baseline. For this baseline, we fit a shape model to both images and extract the expression components of the model. We then generate a new face shape using the expression components of one face and the identity components of another face in the 3DMM setting. This technique has much higher overhead than our proposed method as it requires expensive 3DMM fitting of the images. Our expression editing results and the baseline results are shown in Figure 3. Though the baseline is very strong, it does not change the texture of the face which can result in unnatural looking faces where the original expression is still in the texture. The baseline is also not able to consider the area inside the mouth. Our editing results show more natural looking faces.
For pose editing, the background is unknown once the pose has changed so we focus on the face only for this experiment. Figure 4 shows our pose editing results. For the baseline, we fit a 3DMM model to both images and extract the rotation. We then synthesise x i with the rotation of x j . This technique has high overhead as it requires expensive 3DMM fitting of the images.

3D Reconstruction
The latent variables z exp and z id that our network learns are extremely meaningful. Not only can they be used to reconstruct the image in 2D, they can be mapped into the expression (x exp ) and identity (x id ) components of a 3DMM model. This mapping is learnt inside the network. By replacing the expression and identity components of a mean face shape withx exp andx id , we are able to reconstruct the 3D mesh of a face given a single in-the-wild input image. We compare these reconstructed meshes against the result of a 3DMM model fitted to the input image.
The results of the experiment are visualised in Figure 5. We observe that the reconstruction is very close to the ground truth. Both techniques though do not capture well the identity of the person in the input image due to a known weakness in 3DMM. Zexp Z 0 Figure 6: Visualisation of our Z exp and baseline Z 0 using t-SNE. Our latent Z exp clusters better with regards to expression than the latent space Z 0 of an auto-encoder.

Zp
Z 0 Figure 7: Visualisation of our Z p and baseline Z 0 using t-SNE. It is evident that the proposed disentangled Z p clusters better with regards to pose than the latent space Z 0 of an auto-encoder.

Quantitative Evaluation of the Latent Space
We want to test whether our latent space corresponds well to the variation that it is supposed to learn. For our quantitative experiment, we used Multi-PIE [12] as our test dataset. This dataset contains labelled variations in identity, expressions and pose. Multi-PIE is an especially hard dataset to tackle as the images were captured in a controlled environment quite different to our training images. Also the expressions contained in Multi-PIE do not correspond to the 7 basic expressions and can be easily confused.
We encoded 10368 images of the Multi-PIE dataset and then trained a linear SVM classifier using 90% of the identity labels and the latent z id . We then test on the remaining 10% z id to check whether they are discriminative for identity. We used 10-fold cross-validation to evaluate the accuracy of the learnt classifier. We repeated this experiment for expression with z exp and pose with z p respectively. Our results in Table 1 show that our latent representation is indeed discriminative. This experiment showcases the discriminative power of our latent representation on a previously unseen dataset.
We visualise, using t-SNE [20], the latent Z exp and Z p encoded from Multi-PIE according to their expression and pose label and compare against the latent representation Z 0 learnt by an in-house large-scale adversarial auto-encoder

Identity Expression Pose
Accuracy 83.85% 86.07% 95.73% Table 1: Classification accuracy results: We try to classify identity using z id , expression using z exp and pose using z p .
of similar architecture trained with 2 million faces [21]. Figures 6 and 7 show that even though our encoder has not seen any images of Multi-PIE, it is able to obtain strong latent representations that cluster well by expression and pose (contrary to the representation learned by the tested autoencoder).

Conclusion
We proposed the first, to the best of our knowledge, attempt to jointly disentangle modes of variation that correspond to expression, identity, illumination and pose using no explicit labels regarding these attributes. To do so we proposed the first, as far as we know, approach that combines a powerful Deep Convolutional Neural Network (DCNN) architecture with unsupervised tensor decompositions. We demonstrate the power of our methodology in expression and pose transfer, as well as discovering powerful features for pose and expression classification.

A. Network Details
The convolutional encoder stack (Fig. 2) is composed of three convolutions with 96 * 5 × 5, 48 * 5 × 5 and 24 * 5 × 5 filter sets. Each convolution is followed by max-pooling and a thresholding nonlinearity. We pad the filter responses so that the final output of the convolutional stack is a set of filter responses with size 24 * 8 × 8 for an input image 3 * 64 × 64. The pooling indices of the max-pooling are preserved for the unpooling layers in the decoder stack.
The decoder stacks for the mask and background are strictly symmetric to the encoder stack and have skip connections to the input encoder stack at the corresponding unpooling layers. These skip connections between the encoder and the decoder allow for the details of the background to be preserved.
The other decoder stacks use upsampling and are also strictly symmetric to the encoder stack.

B. Derivation Details
The model is trained end-to-end by applying gradient descent to batches of images, where (12), (13) and (14) are written in the following general form: where X ∈ R k×n is a data matrix, B (1) is the mode-1 matricisation of a tensor B and Z (i) ∈ R kzi×n are the latent variables matrices. The partial derivative of (15) with respect to the latent variable Z (i) are computed as follows: Letx = vec(X) be a vectorisation of X , then (15) is equivalent with: as both the Frobenius norm and the L 2 norm are the sum of all elements squared.
The partial derivative of (15) with respect to Z (i) is obtained by matricising (23).

D. Interpolation Results
We interpolate z i exp / z i id of the input image x i on the right-hand side to the z t exp / z t id of the target image x t on the left-hand side. The interpolation is linear and at 0.1 interval. For the interpolation we do not modify the background so the background remains that of image x i .
For expression interpolation, we expect the identity and pose to stay the same as the input image x i and only the expression to change gradually from the expression of the input image to the expression of the target image x t . Figure 10 shows the expression interpolation. We can clearly see the change in expression while pose and identity remain constant.
For identity interpolation, we expect the expression and pose to stay the same as the input image x i and only the identity to change gradually from the identity of the input image to the identity of the target image x t . Figure 11 shows the identity interpolation. We can clearly observe the change in identity while other variations remain limited.

E. Expression Transfer from Video
We conducted another challenging experiment to test the potential of our method. Can we transfer facial expressions from an "in-the-wild" video to a given template image (also "in-the-wild" image)? For this experiment, we split the input video into frames and extract the expression component z exp of each frame. Then we replace the expression component of the template image with the z exp of the video frames and decode them. The decoded images form a new video sequence where the person in the template image has taken on the expression of the input video at each frame. The result can be seen here: https://youtu.be/tUTRSrY_ ON8. The original video is shown on the left side while the template image is shown on the right side. The result of the expression transfer is the 2nd video from the left. We compare against a baseline (3rd video from the left) where the template image has been warped to the landmarks of the input video. We can clearly see that our method is able to disentangle expression from pose and the change is only at the expression level. The baseline though is only able to transform expression and pose together. Our result video also displays expressions that are more natural to the person in the template image. To conclude, we are able to animate a template face using the disentangled facial expression components of a video sequence.