1 Introduction

The appearance of visual objects is significantly affected by multiple factors of variability such as, for example, pose, illumination, identity, and expression in case of faces. Each factor accounts for a source of variability in the data, while their complex interactions give rise to the observed entangled variability. Discovering the modes of variation, or in other words disentangling the latent factors of variations in visual data, is a very important problem in the intersection of statistics, machine learning, and computer vision.

Fig. 1
figure 1

Given a single in-the-wild image, our network learns disentangled representations for pose, illumination, expression and identity. Using these representations, we are able to manipulate the image and edit the pose or expression

Factor analysis (Fabrigar and Wegener 2011) and the closely related Principal Component Analysis (PCA) (Hotelling 1933) are probably the most popular statistical methods that find a single mode of variation explaining the data. Nevertheless, visual appearance (e.g., facial appearance) is affected by several modes of variations. Hence, methods such as PCA are not able to identify such multiple factors of variation. For example, when PCA is applied to facial images, the first principal component captures both pose and expressions variations.

An early approach for learning different modes of variation in the data is TensorFaces (Vasilescu and Terzopoulos 2002). In particular, TensorFaces is a strictly supervised method as it not only requires the facial data to be labelled (e.g., in terms of expression, identity, illumination etc.) but the data tensor must also contain all samples in all different variations. This is the primary reason that the use of such tensor decompositions is still limited to databases that have been captured in a strictly controlled environment, such as the Weizmann face database (Vasilescu and Terzopoulos 2002).

Recent unsupervised tensor decompositions methods (Tang et al. 2013; Wang et al. 2017b) automatically discover the modes of variation in unlabelled data. In particular, the most recent one (Wang et al. 2017b) assumes that the original visual data have been produced by a hidden multilinear structure and the aim of the unsupervised tensor decomposition is to discover both the underlying multilinear structure, as well as the corresponding weights (coefficients) that best explain the data. Special instances of the unsupervised tensor decomposition are the Shape-from-Shading (SfS) decompositions in Kemelmacher-Shlizerman (2013), Snape et al. (2015) and the multilinear decompositions for 3D face description in Wang et al. (2017b). In Wang et al. (2017b), it is shown that the method indeed can be used to learn representations where many modes of variation have been disentangled (e.g., identity, expression and illumination etc.). Nevertheless, the method in Wang et al. (2017b) is not able to find pose variations and bypasses this problem by applying it to faces which have been frontalised by applying a warping function [e.g., piece-wise affine warping (Matthews and Baker 2004)].

Another promising line of research for discovering latent representations is unsupervised Deep Neural Networks (DNNs). Unsupervised DNNs architectures include the Auto-Encoders (AE) (Bengio et al. 2013), as well as the Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) or adversarial versions of AE, e.g., the Adversarial Auto-Encoders (AAE) (Makhzani et al. 2015). Even though GANs, as well as AAEs, provide very elegant frameworks for discovering powerful low-dimensional embeddings without having to align the faces, due to the complexity of the networks, unavoidably all modes of variation are multiplexed in the latent-representation. Only with the use of labels it is possible to model/learn the manifold over the latent representation, usually as a post-processing step (Shu et al. 2017).

In this paper, we show that it is possible to learn a disentangled representation of the human face captured in arbitrary recording conditions in an pseudo-supervised mannerFootnote 1 by imposing a multilinear structure on the latent representation of an AAE (Shu et al. 2017). To the best of our knowledge, this is the first time that unsupervised tensor decompositions have been combined with DNNs for learning disentangled representations. We demonstrate the power of the proposed approach by showing expression/pose transfer using only the latent variable that is related to expression/pose. We also demonstrate that the disentangled low-dimensional embeddings are useful for many other applications, such as facial expression, pose, and identity recognition and clustering. An example of the proposed approach is given in Fig. 1. In particular, the left pair of images have been decomposed, using the encoder of the proposed neural network \(E(\cdot )\), into many different latent representations including latent representations for pose, illumination, identity and expression. Since our framework has learned a disentangled representation we can easily transfer the expression by only changing the latent variable related to expression and passing the latent vector into the decoder of our neural network \(D(\cdot )\). Similarly, we can transfer the pose merely by changing the latent variable related to pose.

2 Related Work

Learning disentangled representations that explain multiple factors of variation in the data as disjoint latent dimensions is desirable in several machine learning, computer vision, and graphics tasks.

Indeed, bilinear factor analysis models (Tenenbaum and Freeman 2000) have been employed for disentangling two factors of variation (e.g., head pose and facial identity) in the data. Identity, expression, pose, and illumination variations are disentangled in Vasilescu and Terzopoulos (2002) by applying Tucker decomposition [also known as multilinear Singular Value Decomposition (SVD) (De Lathauwer et al. 2000)] into a carefully constructed tensor through label information. Interestingly, the modes of variation in well aligned images can be recovered via a multilinear matrix factorization (Wang et al. 2017b) without any supervision. However, inference in Wang et al. (2017b) might be ill-posed.

More recently, both supervised and unsupervised deep learning methods have been developed for disentangled representations learning. Transforming auto-encoders (Hinton et al. 2011) is among the earliest methods for disentangling latent factors by means of auto-encoder capsules. In Desjardins et al. (2012) hidden factors of variation are disentangled via inference in a variant of the restricted Boltzmann machine. Disentangled representations of input images are obtained by the hidden layers of deep networks in Cheung et al. (2014) and through a higher-order Boltzmann machine in Reed et al. (2014). The Deep Convolutional Inverse Graphics Network (Kulkarni et al. 2015) learns a representation that is disentangled with respect to transformations such as out-of-plane rotations and lighting variations. Methods in Chen et al. (2016), Mathieu et al. (2016), Wang et al. (2017a), Tewari et al. (2017) and Tran et al. (2017) extract disentangled and interpretable visual representations by employing adversarial training. Recent works in face modeling (Tewari et al. 2018; Tran and Liu 2018) also employ self-supervision or pseudo-supervision to learn 3D Morphable Models from images. They rely on the use of a 3D to 2D image rendering layer to separate shape and texture. Contrarily to Tewari et al. (2018), Tran and Liu (2018) the proposed network does not render the 3D shape into a 2D image. Learning the components of a 3D morphable model is an additional advantage of the pseudo-supervision employed. The method in Shu et al. (2017) disentangles the latent representations of illumination, surface normals, and albedo of face images using an image rendering pipeline. Trained with pseudo-supervision, Shu et al. (2017) undertakes multiple image editing tasks by manipulating the relevant latent representations. Nonetheless, this editing approach still requires expression labelling, as well as sufficient sampling of a specific expression.

Here, the proposed network is able to edit the expression of a face image given another single in-the-wild face image of arbitrary expression. Furthermore, we are able to edit the pose of a face in the image which is not possible in Shu et al. (2017).

3 Proposed Method

In this section, we will introduce the main multilinear models used to describe three different image modalities, namely texture, 3D shape and 3D surface normals. To this end, we assume that for each different modality there is a different core tensor but all modalities share the same latent representation of weights regarding identity and expression. During training all the core tensors inside the network are randomly initialised and learnt end-to-end. In the following, we assume that we have a set of n facial images (e.g., in the training batch) and their corresponding 3D facial shape, as well as their normals per pixel (the 3D shape and normals have been produced by fitting a 3D model on the 2D image, e.g., Booth et al. 2017).

3.1 Facial Texture

The main assumption here follows from Wang et al. (2017b). That is, the rich structure of visual data is a result of multiplicative interactions of hidden (latent) factors and hence the underlying multilinear structure, as well as the corresponding weights (coefficients) that best explain the data can be recovered using the unsupervised tensor decomposition (Wang et al. 2017b). Indeed, following (Wang et al. 2017b), disentangled representations can be learnt (e.g., identity, expression, and illumination, etc.) from frontalised facial images. The frontalisation process is performed by applying a piecewise affine transform using the sparse shape recovered by a face alignment process. Inevitably, this process suffers from warping artifacts. Therefore, rather than applying any warping process, we perform the multilinear decomposition only on near frontal faces, which can be automatically detected during the 3D face fitting stage. In particular, assuming a near frontal facial image rasterised in a vector \(\varvec{x}_{f} \in {\mathbb {R}}^{k_{x} \times 1}\), given a core tensor \({\mathcal {Q}} \in {\mathbb {R}}^{k_x \times k_l \times k_{exp} \times k_{id}}\),Footnote 2 this can be decomposed as

$$\begin{aligned} \begin{aligned} \varvec{x}_{f}&= {\mathcal {Q}} \times _2\varvec{z}_{l} \times _3 \varvec{z}_{exp} \times _4 \varvec{z}_{id}, \end{aligned} \end{aligned}$$
(1)

where \(\varvec{z}_{l} \in {\mathbb {R}}^{k_l}, \varvec{z}_{exp} \in {\mathbb {R}}^{k_{exp}}\) and \(\varvec{z}_{id} \in {\mathbb {R}}^{k_{id}}\) are the weights that correspond to illumination, expression and identity respectively. The equivalent form in case that we have a number of images in the batch stacked in the columns of a matrix \({\mathbf {X}}_{f} \in {\mathbb {R}}^{k_x \times n}\) is

$$\begin{aligned} \varvec{X}_{f} = \varvec{Q}_{(1)} (\varvec{Z}_{l} \odot \varvec{Z}_{exp} \odot \varvec{Z}_{id}), \end{aligned}$$
(2)

where \(\varvec{Q}_{(1)}\) is a mode-1 matricisation of tensor \({\mathcal {Q}}\) and \(\varvec{Z}_{l}\), \(\varvec{Z}_{exp}\) and \(\varvec{Z}_{id}\) are the corresponding matrices that gather the weights of the decomposition for all images in the batch. That is, \(\varvec{Z}_{exp} \in {\mathbb {R}}^{k_{exp} \times n}\) stacks the n latent variables of expressions of the images, \(\varvec{Z}_{id} \in {\mathbb {R}}^{k_{id} \times n}\) stacks the n latent variables of identity and \(\varvec{Z}_{l} \in {\mathbb {R}}^{k_{l} \times n}\) stacks the n latent variables of illumination.

3.2 3D Facial Shape

It is quite common to use a bilinear model for disentangling identity and expression in 3D facial shape (Bolkart and Wuhrer 2016). Hence, for 3D shape we assume that there is a different core tensor \({\mathcal {B}} \in {\mathbb {R}}^{k_{3d} \times k_{exp} \times k_{id}}\) and each 3D facial shape \(\varvec{x}_{3d} \in {\mathbb {R}}^{k_{3d}}\) can be decomposed as:

$$\begin{aligned} \varvec{x}_{3d} = {\mathcal {B}} \times _2 \varvec{z}_{exp} \times _3 \varvec{z}_{id}, \end{aligned}$$
(3)

where \(\varvec{z}_{exp}\) and \(\varvec{z}_{id}\) are exactly the same weights as in the texture decomposition (2). The tensor decomposition for the n images in the batch is therefore written as as

$$\begin{aligned} \varvec{X}_{3d} = \varvec{B}_{(1)} (\varvec{Z}_{exp} \odot \varvec{Z}_{id}), \end{aligned}$$
(4)

where \(\varvec{B}_{(1)}\) is a mode-1 matricization of tensor \({\mathcal {B}}\).

3.3 Facial Normals

The tensor decomposition we opted to use for facial normals was exactly the same as the texture, hence we can use the same core tensor and weights. The difference is that since facial normals do not depend on illumination parameters (assuming a Lambertian illumination model), we just need to replace the illumination weights with a constant.Footnote 3 Thus, the decomposition for normals can be written as

$$\begin{aligned} \varvec{X}_{N} = \varvec{Q}_{(1)} \left( \frac{1}{k_l} \varvec{{\mathfrak {1}}} \odot \varvec{Z}_{exp} \odot \varvec{Z}_{id}\right) , \end{aligned}$$
(5)

where \(\varvec{{\mathfrak {1}}}\) is a matrix of ones.

3.4 3D Facial Pose

Finally, we define another latent variable regarding 3D pose. This latent variable \(\varvec{z}_p \in {\mathbb {R}}^{9}\) represents a 3D rotation. We denote by \(\varvec{x}^i \in {\mathbb {R}}^{k_x}\) an image at index i. The indexing is denoted in the following by the superscript. The corresponding \(\varvec{z}_p^i\) can be reshaped into a rotation matrix \(\varvec{R}^i \in {\mathbb {R}}^{3 \times 3}\). As proposed in Worrall et al. (2017), we apply this rotation to the feature of the image \(\varvec{x}^i\) created by 2-way synthesis (explained in Sect. 3.5). This feature vector is the i-th column of the feature matrix resulting from the 2-way synthesis \((\varvec{Z}_{exp} \odot \varvec{Z}_{id}) \in {\mathbb {R}}^{k_{exp}k_{id} \times n}\). We denote this feature vector corresponding to a single image as \((\varvec{Z}_{exp} \odot \varvec{Z}_{id})^i \in {\mathbb {R}}^{k_{exp}k_{id}}\). Next \((\varvec{Z}_{exp} \odot \varvec{Z}_{id})^i\) is reshaped into a \(3 \times \frac{k_{exp}k_{id}}{3}\) matrix and left-multiplied by \(\varvec{R}^i\). After another round of vectorisation, the resulting feature \(\in {\mathbb {R}}^{k_{exp}k_{id}}\) becomes the input of the decoders for normal and albedo. This transformation from feature vector \((\varvec{Z}_{exp} \odot \varvec{Z}_{id})^i\) to the rotated feature is called rotation.

Fig. 2
figure 2

Our network is an end-to-end trained auto-encoder. The encoder E extracts latent variables corresponding to illumination, pose, expression and identity from the input image \(\varvec{x}\). These latent variables are then fed into the decoder D to reconstruct the image. We impose a multilinear structure and enforce the disentangling of variations. The grey triangles represent the losses: adversarial loss A, verification loss V, \(L_1\) and \(L_2\) losses

3.5 Network Architecture

We incorporate the structure imposed by Eqs. (2), (4) and (5) into an auto-encoder network, see Fig. 2. For some matrices \(\varvec{Y}_i \in {\mathbb {R}}^{k_{yi} \times n}\), we refer to the operation \(\varvec{Y}_1 \odot \varvec{Y}_2 \in {\mathbb {R}}^{k_{y1}k_{y2} \times n}\) as 2-way synthesis and \(\varvec{Y}_1 \odot \varvec{Y}_2 \odot \varvec{Y}_3 \in {\mathbb {R}}^{k_{y1}k_{y2}k_{y3} \times n}\) as 3-way synthesis. The multiplication of a feature matrix by \(\varvec{B}_{(1)}\) or \(\varvec{Q}_{(1)}\), mode-1 matricisations of tensors \({\mathcal {B}}\) and \({\mathcal {Q}}\), is referred to as projection and can be represented by an unbiased fully-connected layer.

Our network follows the architecture of Shu et al. (2017). The encoder E receives an input image \(\varvec{x}\) and the convolutional encoder stack first encodes it into \(\varvec{z}_{i}\), an intermediate latent variable vector of size \(128 \times 1\). \(\varvec{z}_{i}\) is then transformed into latent codes for background \(\varvec{z}_{b}\), mask \(\varvec{z}_{m}\), illumination \(\varvec{z}_{l}\), pose \(\varvec{z}_{p}\), identity \(\varvec{z}_{id}\) and expression \(\varvec{z}_{exp}\) via fully-connected layers.

$$\begin{aligned} E(\varvec{x}) = [\varvec{z}_{b}, \varvec{z}_{m}, \varvec{z}_{l}, \varvec{z}_{p}, \varvec{z}_{id}, \varvec{z}_{exp} ]^T. \end{aligned}$$
(6)

The decoder D takes in the latent codes as input. \(\varvec{z}_{b}\) and \(\varvec{z}_{m}\) (\(128 \times 1\) vectors) are directly passed into convolutional decoder stacks to estimate background and face mask respectively. The remaining latent variables follow 3 streams:

  1. 1.

    \(\varvec{z}_{exp}\) (\(15 \times 1\) vector) and \(\varvec{z}_{id}\) (\(80 \times 1\) vector) are joined by 2-way synthesis and projection to estimate facial shape \(\hat{\varvec{x}_{3d}}\).

  2. 2.

    The result of 2-way synthesis of \(\varvec{z}_{exp}\) and \(\varvec{z}_{id}\) is rotated using \(\varvec{z}_{p}\). The rotated feature is passed into 2 different convolutional decoder stacks: one for normal estimation and another for albedo. Using the estimated normal map, albedo, illumination component \(\varvec{z}_{l}\), mask and background, we render a reconstructed image \(\hat{\varvec{x}}\).

  3. 3.

    \(\varvec{z}_{exp}\), \(\varvec{z}_{id}\) and \(\varvec{z}_{l}\) are combined by a 3-way synthesis and projection to estimate frontal normal map and a frontal reconstruction of the image.

Streams 1 and 3 drive the disentangling of expression and identity components, while stream 2 focuses on the reconstruction of the image by adding the pose components. The decoder D then outputs the reconstructed image from the latent codes.

$$\begin{aligned} D(\varvec{z}_{b}, \varvec{z}_{m}, \varvec{z}_{l}, \varvec{z}_{p}, \varvec{z}_{id}, \varvec{z}_{exp}) = \hat{\varvec{x}}. \end{aligned}$$
(7)

Our input images are aligned and cropped facial images from the CelebA database (Liu et al. 2015) of size \(64 \times 64\), so \(k_x = 3 \times 64 \times 64\). \(k_{3d} = 3 \times 9375\), \(k_{l} = 9\), \(k_{id} = 80\) and \(k_{exp} = 15\). More details on the network such as the convolutional encoder stacks and decoder stacks can be found in the supplementary material.

3.6 Training

We use in-the-wild face images for training. Hence, we only have access to the image itself (\(\varvec{x}\)) while ground truth labelling for pose, illumination, normal, albedo, expression, identity or 3D shape is unavailable. The main loss function is the reconstruction loss of the image x:

$$\begin{aligned} E_x = E_{recon} + \lambda _{adv} E_{adv} + \lambda _{veri} E_{veri}, \end{aligned}$$
(8)

where \(\hat{\varvec{x}}\) is the reconstructed image, \(E_{recon} = \Vert \varvec{x} - \hat{\varvec{x}} \Vert _1\) is the reconstruction loss, \(\lambda _{adv}\) and \(\lambda _{adv}\) are regularisation weights, \(E_{adv}\) represents the adversarial loss and \(E_{veri}\) the verification loss. We use the pre-trained verification network \({\mathcal {V}}\) (Wu et al. 2015) to find face embeddings of our images \(\varvec{x}\) and \(\hat{\varvec{x}}\). As both images are supposed to represent the same person, we minimise the cosine distance between the embeddings: \(E_{veri} = 1 - cos({\mathcal {V}}(\varvec{x}), {\mathcal {V}}(\hat{\varvec{x}}))\). Simultaneously, a discriminative network \({\mathcal {D}}\) is trained to distinguish between the generated and real images (Goodfellow et al. 2014). We incorporate the discriminative information by following the auto-encoder loss distribution matching approach of Berthelot et al. (2017). The discriminative network \({\mathcal {D}}\) is itself an auto-encoder trying to reconstruct the input image \(\varvec{x}\) so the adversarial loss is \(E_{adv} = \Vert \hat{\varvec{x}} - {\mathcal {D}}(\hat{\varvec{x}})\Vert _1\). \({\mathcal {D}}\) is trained to minimise \(\Vert \varvec{x} - {\mathcal {D}}(\varvec{x})\Vert _1 - k_t \Vert \hat{\varvec{x}} - {\mathcal {D}}(\hat{\varvec{x}})\Vert _1\).

As fully unsupervised training often results in semantically meaningless latent representations, Shu et al. (2017) proposed to train with pseudo ground truth values for normals, lighting and 3D facial shape. We adopt here this technique and introduce further pseudo ground truth values for pose \(\hat{\varvec{x}_p}\), expression \(\hat{\varvec{x}_{exp}}\) and identity \(\hat{\varvec{x}_{id}}\). \(\hat{\varvec{x}_p}\), \(\hat{\varvec{x}_{exp}}\) and \(\hat{\varvec{x}_{id}}\) are obtained by fitting coarse face geometry to every image in the training set using a 3D Morphable Model (Booth et al. 2017). We incorporated the constraints used in Shu et al. (2017) for illumination, normals and albedo. Hence, the following new objectives are introduced:

$$\begin{aligned} E_{p} = \Vert \varvec{z}_p - \hat{\varvec{x}_p} \Vert ^2_2, \end{aligned}$$
(9)

where \(\hat{\varvec{x}_p}\) is a 3D camera rotation matrix.

$$\begin{aligned} E_{exp} = \Vert fc(\varvec{z}_{exp}) - \hat{\varvec{x}_{exp}} \Vert ^2_2, \end{aligned}$$
(10)

where fc(\(\cdot \)) is a fully-connected layer and \(\hat{\varvec{x}_{exp}} \in {\mathbb {R}}^{28}\) is a pseudo ground truth vector representing 3DMM expression components of the image \(\varvec{x}\).

$$\begin{aligned} E_{id} = \Vert fc(\varvec{z}_{id}) - \hat{\varvec{x}_{id}} \Vert ^2_2 \end{aligned}$$
(11)

where fc(\(\cdot \)) is a fully-connected layer and \(\hat{\varvec{x}_{id}} \in {\mathbb {R}}^{157}\) is a pseudo ground truth vector representing 3DMM identity components of the image \(\varvec{x}\).

Fig. 3
figure 3

Our proof-of-concept network is an end-to-end trained auto-encoder. The encoder E extracts latent variables corresponding to expression and identity from the input image \(\varvec{x}\). These latent variables are then fed into the decoder D to reconstruct the image. A separate stream also reconstructs facial texture from \(\varvec{z}_{id}\). We impose a multilinear structure and enforce the disentanglement of variations. In the extended version a) the encoder also extracts a latent variable corresponding to pose. The decoder takes in this information and reconstructs an image containing pose variations

3.6.1 Multilinear Losses

Directly applying the above losses as constraints to the latent variables does not result in a well-disentangled representation. To achieve a better performance, we impose a tensor structure on the image using the following losses:

$$\begin{aligned} E_{3d} = \Vert \hat{\varvec{x}_{3d}} - {\mathcal {B}} \times _2 \varvec{z}_{exp} \times _3 \varvec{z}_{id} \Vert ^2_2 , \end{aligned}$$
(12)

where \(\hat{\varvec{x}_{3d}}\) is the 3D facial shape of the fitted model.

$$\begin{aligned} E_{f} = \Vert \varvec{x}_{f} - {\mathcal {Q}} \times _2 \varvec{z}_{l} \times _3 \varvec{z}_{exp} \times _4 \varvec{z}_{id}) \Vert ^2_2, \end{aligned}$$
(13)

where \(\varvec{x}_{f}\) is a semi-frontal face image. During training, \(E_{f}\) is only applied on near-frontal face images filtered using \(\hat{\varvec{x}_p}\).

$$\begin{aligned} E_{n} = \Vert \hat{\varvec{n}_{f}} - {\mathcal {Q}} \times _2 \frac{1}{k_l} \varvec{1} \times _3 \varvec{z}_{exp} \times _4 \varvec{z}_{id}) \Vert ^2_2 \end{aligned}$$
(14)

where \(\hat{\varvec{n}_{f}}\) is a near frontal normal map. During training, the loss \(E_{n}\) is only applied on near frontal normal maps.

The model is trained end-to-end by applying gradient descent to batches of images, where Eqs. (12), (13) and (14) are written in the following general form:

$$\begin{aligned} E = \Vert \varvec{X} - \varvec{B}_{(1)} (\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(M)})\Vert ^2_F, \end{aligned}$$
(15)

where M is the number of modes of variations, \(\varvec{X} \in {\mathbb {R}}^{k \times n}\) is a data matrix, \(\varvec{B}_{(1)}\) is the mode-1 matricisation of a tensor \({\mathcal {B}}\) and \(\varvec{Z}^{(i)} \in {\mathbb {R}}^{k_{zi} \times n}\) are the latent variables matrices.

The partial derivative of (15) with respect to the latent variable \(\varvec{Z}^{(i)}\) are computed as follows: Let \(\hat{\varvec{x}} = vec(\varvec{X})\) be the vectorised \(\varvec{X}\), \(\hat{\varvec{z}}^{(i)} = vec(\varvec{Z}^{(i)})\) be the vectorised \(\varvec{Z}^{(i)}\),

\(\hat{\varvec{Z}^{(i-1)}} = \varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(i-1)} \) and \(\hat{\varvec{Z}^{(i+1)}} = \varvec{Z}^{(i+1)} \odot \dots \odot \varvec{Z}^{(M)} \) , then (15) is equivalent with:

$$\begin{aligned} \begin{aligned}&\Vert \hat{\varvec{x}} - (\varvec{I} \otimes \varvec{B}_{(1)}) vec(\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(M)} )\Vert ^2_F \\&\quad = \Vert \hat{\varvec{x}} - (\varvec{I} \otimes \varvec{B}_{(1)}) (\varvec{I} \odot \hat{\varvec{Z}^{(i-1)}} ) \otimes \varvec{I} \\&\qquad \cdot \varvec{I} \odot (\hat{\varvec{Z}^{(i+1)}} (\varvec{I} \otimes \varvec{\mathbb {1}}) ) \cdot \hat{\varvec{z}^{(i)}}\Vert ^2_2 \end{aligned} \end{aligned}$$
(16)

Consequently the partial derivative of (15) with respect to \(\varvec{Z}^{(i)}\) is obtained by matricising the partial derivative of (16) with respect to \(\varvec{Z}^{(i)}\). The derivation details are in the subsequent section.

3.6.2 Derivation Details

The model is trained end-to-end by applying gradient descent to batches of images, where (12), (13) and (14) are written in the following general form:

$$\begin{aligned} E = \Vert \varvec{X} - \varvec{B}_{(1)} (\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(M)})\Vert ^2_F, \end{aligned}$$
(15)

where \(\varvec{X} \in {\mathbb {R}}^{k \times n}\) is a data matrix, \(\varvec{B}_{(1)}\) is the mode-1 matricisation of a tensor \({\mathcal {B}}\) and \(\varvec{Z}^{(i)} \in {\mathbb {R}}^{k_{zi} \times n}\) are the latent variables matrices.

The partial derivative of (15) with respect to the latent variable \(\varvec{Z}^{(i)}\) are computed as follows: Let \(\hat{\varvec{x}} = vec(\varvec{X})\) be a vectorisation of \(\varvec{X}\) , then (15) is equivalent with:

$$\begin{aligned} \begin{aligned}&\Vert \varvec{X} - \varvec{B}_{(1)} (\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(M)})\Vert ^2_F \\&\quad = \Vert vec(\varvec{X} - \varvec{B}_{(1)} (\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(M)}))\Vert ^2_2 \\&\quad = \Vert \hat{\varvec{x}} - vec(\varvec{B}_{(1)} (\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(M)}))\Vert ^2_2, \end{aligned} \end{aligned}$$
(17)

as both the Frobenius norm and the \(L_2\) norm are the sum of all elements squared.

$$\begin{aligned} \begin{aligned}&\Vert \hat{\varvec{x}} - vec(\varvec{B}_{(1)} (\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(M)}))\Vert ^2_2 \\&\quad = \Vert \hat{\varvec{x}} - (\varvec{I} \otimes \varvec{B}_{(1)}) vec(\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(M)} )\Vert ^2_2, \end{aligned} \end{aligned}$$
(18)

as the property \(vec(\varvec{BZ}) = (\varvec{I} \otimes \varvec{B}) vec(\varvec{Z})\) holds Neudecker (1969).

Using \(vec(\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} ) = (\varvec{I} \odot \varvec{Z}^{(1)}) \otimes \varvec{I} \cdot vec(\varvec{Z}^{(2)})\) (Roemer 2012) and let \(\hat{\varvec{Z}^{(i-1)}} = \varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(i-1)} \) and \(\hat{\varvec{Z}^{(i)}} = \varvec{Z}^{(i)} \odot \dots \odot \varvec{Z}^{(M)} \) the following holds:

$$\begin{aligned} \begin{aligned}&\Vert \hat{\varvec{x}} - (\varvec{I} \otimes \varvec{B}_{(1)}) vec(\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} \odot \dots \odot \varvec{Z}^{(M)} )\Vert ^2_2 \\&\quad = \Vert \hat{\varvec{x}} - (\varvec{I} \otimes \varvec{B}_{(1)}) (\varvec{I} \odot \hat{\varvec{Z}^{(i-1)}} ) \otimes \varvec{I} \cdot vec(\hat{\varvec{Z}^{(i)}} )\Vert ^2_2 \\ \end{aligned} \end{aligned}$$
(19)

Using \(vec(\varvec{Z}^{(1)} \odot \varvec{Z}^{(2)} ) = \varvec{I} \odot (\varvec{Z}^{(2)} (\varvec{I} \otimes \varvec{\mathbb {1}}) ) \cdot vec(\varvec{Z}^{(1)})\) (Roemer 2012) and let \(\hat{\varvec{Z}^{(i+1)}} = \varvec{Z}^{(i+1)} \odot \dots \odot \varvec{Z}^{(M)} \):

$$\begin{aligned} \begin{aligned}&\Vert \hat{\varvec{x}} - (\varvec{I} \otimes \varvec{B}_{(1)}) (\varvec{I} \odot \hat{\varvec{Z}^{(i-1)}} ) \otimes \varvec{I} \cdot vec(\hat{\varvec{Z}^{(i)}} )\Vert ^2_2 \\&\quad = \Vert \hat{\varvec{x}} - (\varvec{I} \otimes \varvec{B}_{(1)}) (\varvec{I} \odot \hat{\varvec{Z}^{(i-1)}} ) \otimes \varvec{I} \\&\qquad \cdot \varvec{I} \odot (\hat{\varvec{Z}^{(i+1)}} (\varvec{I} \otimes \varvec{\mathbb {1}}) ) \cdot vec(\varvec{Z}^{(i)})\Vert ^2_2 \\ \end{aligned} \end{aligned}$$
(20)

Let \(\hat{\varvec{z}}^{(i)} = vec(\varvec{Z}^{(i)})\) be a vectorisation of \(\varvec{Z}^{(i)}\), this becomes:

$$\begin{aligned} \begin{aligned}&\Vert \hat{\varvec{x}} - (\varvec{I} \otimes \varvec{B}_{(1)}) (\varvec{I} \odot \hat{\varvec{Z}^{(i-1)}} ) \otimes \varvec{I} \\&\quad \cdot \varvec{I} \odot (\hat{\varvec{Z}^{(i+1)}} (\varvec{I} \otimes \varvec{\mathbb {1}}) ) \cdot \hat{\varvec{z}^{(i)}}\Vert ^2_2 \\ \end{aligned} \end{aligned}$$
(16)

We then compute the partial derivative of (16) with respect to \(\hat{\varvec{z}^{(i)}}\):

$$\begin{aligned} \frac{\partial \Vert \hat{\varvec{x}} - \varvec{A} \hat{\varvec{z}^{(i)}} \Vert ^2_2}{\partial \hat{\varvec{z}^{(i)}}} = 2 \varvec{A}^T (\varvec{A} \cdot \hat{\varvec{z}^{(i)}} - \hat{\varvec{x}} ), \end{aligned}$$
(21)

where \(\varvec{A} = (\varvec{I} \otimes \varvec{B}_{(1)}) (\varvec{I} \odot \hat{\varvec{Z}^{(i-1)}} ) \otimes \varvec{I} \cdot \varvec{I} \odot (\hat{\varvec{Z}^{(i+1)}} (\varvec{I} \otimes \varvec{\mathbb {1}}) ) \).

The partial derivative of (15) with respect to \(\varvec{Z}^{(i)}\) is obtained by matricising (21).

To efficiently compute the above mentioned operations, Tensorly (Kossaifi et al. 2016) has been employed.

Fig. 4
figure 4

Our network is able to transfer the expression from one face to another by disentangling the expression components of the images. The ground truth has been computed using the ground truth texture with synthetic identity and expression components

Fig. 5
figure 5

Given a single image, we infer meaningful expression and identity components to reconstruct a 3D mesh of the face. We compare the reconstruction (last row) against the ground truth (2nd row)

Fig. 6
figure 6

Given a single image, we infer the facial texture. We compare the reconstructed facial texture (last row) against the ground truth texture (2nd row)

Fig. 7
figure 7

Our network is able to transfer the pose from one face to another by disentangling the pose, expression and identity components of the images. The ground truth has been computed using the ground truth texture with synthetic pose, identity and expression components

4 Proof of Concept Experiments

We develop a lighter version of our proposed network, a proof-of-concept network (visualised in Fig. 3), to show that our network is able to learn and disentangle pose, expression and identity.

In order to showcase the ability of the network, we leverage our newly proposed 4DFAB database (Cheng et al. 2018), where subjects were invited to attend four sessions at different times in a span of five years. In each experiment session, the subject was asked to articulate 6 different facial expressions (anger, disgust, fear, happiness, sadness, surprise), and we manually select the most expressive mesh (i.e. the apex frame) for this experiment. In total, 1795 facial meshes from 364 recording sessions (with 170 unique identities) are used. We keep 148 identities for training and leave 22 identities for testing. Note that there are no overlapping of identities between both sets. Within the training set, we synthetically augment each facial mesh by generating new facial meshes with 20 randomly selected expressions. Our training set contains in total 35900 meshes. The test set contains 387 meshes. For each mesh, we have the ground truth facial texture as well as expression and identity components of the 3DMM model.

4.1 Disentangling Expression and Identity

We create frontal images of the facial meshes. Hence there is no illumination or pose variation in this training dataset. We train a lighter version of our network by removing the illumination and pose streams, a proof-of-concept network, visualised in Fig. 3, on this synthetic dataset.

4.1.1 Expression Editing

We show the disentanglement between expression and identity by transferring the expression of one person to another.

For this experiment, we work with unseen data (a hold-out set consisting of 22 unseen identities) and no labels. We first encode both input images \(\varvec{x}^i\) and \(\varvec{x}^j\):

$$\begin{aligned} \begin{aligned} E(\varvec{x}^i)&= \varvec{z}_{exp}^i, \varvec{z}_{id}^i,\\ E(\varvec{x}^j)&= \varvec{z}_{exp}^j, \varvec{z}_{id}^j, \end{aligned} \end{aligned}$$
(22)

where \(E(\cdot )\) is our encoder and \(\varvec{z}_{exp}\) and \(\varvec{z}_{id}\) are the latent representations of expression and identity respectively.

Assuming we want \(\varvec{x}^i\) to emulate the expression of \(\varvec{x}^j\), we decode on:

$$\begin{aligned} \begin{aligned} D(\varvec{z}_{exp}^j, \varvec{z}_{id}^i)&= \varvec{x}^{ji}, \end{aligned} \end{aligned}$$
(23)

where \(D(\cdot )\) is our decoder. The resulting \(\varvec{x}^{ji}\) becomes our edited image where \(\varvec{x}^{i}\) has the expression of \(\varvec{x}^{j}\). Figure 4 shows how the network is able to separate expression and identity. The edited images clearly maintain the identity while expression changes.

4.1.2 3D Reconstruction and Facial Texture

The latent variables \(\varvec{z}_{exp}\) and \(\varvec{z}_{id}\) that our network learns are extremely meaningful. Not only can they be used to reconstruct the image in 2D, but also they can be mapped into the expression (\(\varvec{x}_{exp}\)) and identity (\(\varvec{x}_{id}\)) components of a 3DMM model. This mapping is learnt inside the network. By replacing the expression and identity components of a mean face shape with \(\hat{\varvec{x}_{exp}}\) and \(\hat{\varvec{x}_{id}}\), we are able to reconstruct the 3D mesh of a face given a single input image. We compare these reconstructed meshes against the ground truth 3DMM used to create the input image in Fig. 5.

At the same time, the network is able to learn a mapping from \(\varvec{z}_{id}\) to facial texture. Therefore, we can predict the facial texture given a single input image. We compare the reconstructed facial texture with the ground truth facial texture in Fig. 6.

4.2 Disentangling Pose, Expression and Identity

Our synthetic training set contains in total 35900 meshes. For each mesh, we have the ground truth facial texture as well as expression and identity components of the 3DMM, from which we create a corresponding image with one of 7 given poses. As there is no illumination variation in this training set, we train a proof-of-concept network by removing the illumination stream, visualised in Fig. 3a, on this synthetic dataset.

4.2.1 Pose Editing

We show the disentanglement between pose, expression and identity by transferring the pose of one person to another. Figure 7 shows how the network is able to separate pose from expression and identity. This experiment highlights the ability of our proposed network to learn large pose variations even from profile to frontal faces.

Fig. 8
figure 8

We compare our expression editing results with Wang et al. (2017b). As Wang et al. (2017b) is not able to disentangle pose, editing expressions from images of different poses returns noisy results

Fig. 9
figure 9

Our network is able to transfer the expression from one face to another by disentangling the expression components of the images. We compare our expression editing results with a baseline where a 3DMM has been fit to both input images

Fig. 10
figure 10

Our network is able to transfer the pose of one face to another by disentangling the pose components of the images. We compare our pose editing results with a baseline where a 3DMM has been fit to both input images

Fig. 11
figure 11

Our network is also able to transfer the identity of one image to another by disentangling the identity components of the images

5 Experiments in-the-Wild

We train our network on in-the-wild data and perform several experiments on unseen data to show that our network is indeed able to disentangle illumination, pose, expression and identity.

We edit expression or pose by swapping the latent expression/pose component learnt by the encoder E [Eq. (6)] with the latent expression/pose component predicted from another image. We feed the decoder D [Eq. (7)] with the modified latent component to retrieve our edited image.

5.1 Expression, Pose and Identity Editing in-the-Wild

Given two in-the-wild images of faces, we are able to transfer the expression, pose of one person to another. We are also able to swap the face of the person from one image to another. Transferring the expression from two different facial images without fitting a 3D model is a very challenging problem. Generally, it is considered in the context of the same person under an elaborate blending framework (Yang et al. 2011) or by transferring certain classes of expressions (Sagonas et al. 2017).

For this experiment, we work with completely unseen data (a hold-out set of CelebA) and no labels. We first encode both input images \(\varvec{x}^i\) and \(\varvec{x}^j\):

$$\begin{aligned} \begin{aligned} E(\varvec{x}^i)&= \varvec{z}_{exp}^i, \varvec{z}_{id}^i, \varvec{z}_{p}^i\\ E(\varvec{x}^j)&= \varvec{z}_{exp}^j, \varvec{z}_{id}^j, \varvec{z}_{p}^j, \end{aligned} \end{aligned}$$
(24)

where \(E(\cdot )\) is our encoder and \(\varvec{z}_{exp}\), \(\varvec{z}_{id}\), \(\varvec{z}_{p}\) are the latent representations of expression, identity and pose respectively.

Assuming we want \(\varvec{x}^i\) to take on the expression, pose or identity of \(\varvec{x}^j\), we then decode on:

$$\begin{aligned} \begin{aligned} D(\varvec{z}_{exp}^j, \varvec{z}_{id}^i, \varvec{z}_{p}^i)&= \varvec{x}^{jii} \\ D(\varvec{z}_{exp}^i, \varvec{z}_{id}^i, \varvec{z}_{p}^j)&= \varvec{x}^{iij}\\ D(\varvec{z}_{exp}^i, \varvec{z}_{id}^j, \varvec{z}_{p}^i)&= \varvec{x}^{iji}\\ \end{aligned} \end{aligned}$$
(25)

where \(D(\cdot )\) is our decoder.

The resulting \(\varvec{x}^{jii}\) then becomes our result image where \(\varvec{x}^{i}\) has the expression of \(\varvec{x}^{j}\). \(\varvec{x}^{jii}\) is the edited image where \(\varvec{x}^{i}\) changed to the pose of \(\varvec{x}^{j}\). \(\varvec{x}^{iji}\) is the edit where \(\varvec{x}^{i}\)’s face changed to the face of \(\varvec{x}^{j}\).

As there is currently no prior work for this expression editing experiment without fitting an AAM (Cootes et al. 2001) or 3DMM, we used the image synthesised by the 3DMM fitted models as a baseline, which indeed performs quite well. Compared with our method, other very closely related works (Wang et al. 2017b; Shu et al. 2017) are not able to disentangle illumination, pose, expression and identity. In particular, Shu et al. (2017) disentangles illumination of an image while Wang et al. (2017b) disentangles illumination, expression and identity from “frontalised” images. Hence they are not able to disentangle pose. None of these methods can be applied to the expression/pose editing experiments on a dataset that contains pose variations such as CelebA. If Wang et al. (2017b) is applied directly on our test images, it would not be able to perform expression editing well, as shown by Fig. 8.

For the 3DMM baseline, we fit a shape model to both images and extract the expression components of the model. This fitting step has high overhead of 20 s per image. We then generate a new face shape using the expression components of one face and the identity components of another face in the same 3DMM setting. This technique has much higher overhead than our proposed method as it requires time-consuming 3DMM fitting of the images. Our expression editing results and the baseline results are shown in Fig. 9. Though the baseline is very strong, it does not change the texture of the face which can produce unnatural looking faces shown with original expression. Also, the baseline method can not fill up the inner mouth area. Our editing results show more natural looking faces.

For pose editing, the background is unknown once the pose has changed, thus, for this experiment, we mainly focus on the face region. Figure 10 shows our pose editing results. For the baseline method, we fit a 3DMM to both images and estimate the rotation matrix. We then synthesise \(\varvec{x}_{i}\) with the rotation of \(\varvec{x}_{j}\). This technique has high overhead as it requires expensive 3DMM fitting of the images.

Figure 11 shows our results on the task of face swapping where the identity of one image has been swapped with the face of another person from the second image.

Fig. 12
figure 12

Histogram of cosine similarities on 600 pairs of “non-same” people from CelebA

5.1.1 Quantitative Studies

We conducted a quantitative measure on the expression editing experiment. We ran a face recognition experiment on 50 pairs of images where only the expression has been transferred. We then passed them to a face recognition network (Deng et al. 2018) and extracted their respective embeddings. All 50 pairs of embeddings had cosine similarity larger than 0.3. In comparison, We selected 600 pairs of different people from CelebA and computed their average cosine similarity which is 0.062. The histogram of these cosine similarities is visualised in Fig. 12. This indicates that the expression editing does conserve identity in terms of machine perception.

Fig. 13
figure 13

Ablation study on different losses (multilinear, adversarial, verification) for expression editing. The results show that incorporating multilinear losses indeed helps the network to better disentangle the expression variations

Fig. 14
figure 14

Ablation study on different losses (multilinear, adversarial, verification) for facial pose editing. The results show that incorporating multilinear losses helps the network to better disentangle the pose variations

5.1.2 Ablation Studies

We performed a series of ablation studies. We first trained a network without multilinear losses by simply feeding the concatenated parameters \(\varvec{p}=[\varvec{z}_{pose}, \varvec{z}_{exp}, {\varvec{z}_{id}}]\) to the decoder, thus the training of the network is only driven by the reconstruction loss and pseudo-supervision from 3DMM on pose, expression and identity latent variables, i.e., \(\varvec{z}_{pose}, \varvec{z}_{exp}\) and \({\varvec{z}_{id}}\). Next, we started to incorporate other losses (i.e., multilinear losses, adversarial loss, verification loss) step by step in the network and trained different models. In this way, we can observe at each step how additional loss may improve the result.

Fig. 15
figure 15

Texture reconstruction compared with Tewari et al. (2018), Tran and Liu (2018). Tewari et al. (2018), Tran and Liu (2018) have been trained with images of higher resolutions of 240\(\,\times \,\)240 and 128\(\,\times \,\)128 respectively. In comparison our model has only been trained with images of size 64\(\,\times \,\)64 pixels

Fig. 16
figure 16

Expression interpolation

Fig. 17
figure 17

Identity interpolation

In Figs. 13 and 14, we compare the expression and pose editing results. We find that the results without multilinear losses shows some entanglement of the variations in terms of illumination, identity, expression and pose. In particular, the entanglement with illumination is strong, examples can be found in second and ninth row of Fig. 13. Indeed, by incorporating multilinear losses in the network, the identity and expression variations are better disentangled. Furthermore, the incorporation of adversarial and verification losses enhances the quality of images, making them look more realistic but do not contribute in a meaningful way to the disentanglement.

5.1.3 Discussion on Texture Quality

It has to be noted that our baseline 3DMM method Booth et al. (2017) does not change facial texture. It directly samples the original texture and maps it to a 3D face. Hence, the texture quality is exactly the same as that of the original image as no low-dimensional texture representation is used. In terms of texture quality, direct texture mapping has an edge over our proposed method which models the texture using a low-dimensional representation. But direct texture mapping is also prone to artefacts and does not learn the new expression in the texture. Looking at Fig. 9 column 2, rows 4, 5 and 7, we observe that the texture itself did not change in the baseline result. The eyes and cheeks did not adjust to show a smiling or neutral face. The expression change results from the change in the 3D shape but the texture itself remained the same as in the input. Low-dimensional texture representation does not have this issue and can generate new texture with changed expression.

Generally methods similar to ours which estimate facial texture is not able to extract the same amount of details as the original image. Figure 15 visualises how our texture reconstruction compares to state-of-the-art works which have been trained on images of higher resolutions.

Fig. 18
figure 18

Using the illumination and normals estimated by our network, we are able to relight target faces using illumination from the source image. The source \(\hat{\varvec{s}}^{source}\) and target shading \(\hat{\varvec{s}}^{target}\) are displayed to visualise against the new transferred shading \(\varvec{s}^{transfer}\). We compare against Shu et al. (2017)

Fig. 19
figure 19

Given a single image, we infer meaningful expression and identity components to reconstruct a 3D mesh of the face. We compare our 3D estimation against recent works (Jackson et al. 2017; Feng et al. 2018)

5.2 Expression and Identity Interpolation

We interpolate \(\varvec{z}_{exp}^i\) / \(\varvec{z}_{id}^i\) of the input image \(\varvec{x}^i\) on the right-hand side to the \(\varvec{z}_{exp}^t\) / \(\varvec{z}_{id}^t\) of the target image \(\varvec{x}^t\) on the left-hand side. The interpolation is linear and at 0.1 interval. For the interpolation we do not modify the background so the background remains that of image \(\varvec{x}^i\).

For expression interpolation, we expect the identity and pose to stay the same as the input image \(\varvec{x}^i\) and only the expression to change gradually from the expression of the input image to the expression of the target image \(\varvec{x}^t\). Figure 16 shows the expression interpolation. We can clearly see the change in expression while pose and identity remain constant.

For identity interpolation, we expect the expression and pose to stay the same as the input image \(\varvec{x}^i\) and only the identity to change gradually from the identity of the input image to the identity of the target image \(\varvec{x}^t\). Figure 17 shows the identity interpolation. We can clearly observe the change in identity while other variations remain limited.

5.3 Illumination Editing

We transfer illumination by estimating the normals \(\hat{\varvec{n}}\), albedo \(\hat{\varvec{a}}\) and illumination components \(\hat{\varvec{l}}\) of the source (\(\varvec{x}^{source}\)) and target (\(\varvec{x}^{target}\)) images. Then we use \(\hat{\varvec{n}}^{target}\) and \(\hat{\varvec{l}}^{source}\) to compute the transferred shading \(\varvec{s}^{transfer}\) and multiply the new shading by \(\hat{\varvec{a}}^{target}\) to create the relighted image result \(\varvec{x}^{transfer}\). In Fig. 18 we show the performance of our method and compare against Shu et al. (2017) on illumination transfer. We observe that our method outperforms Shu et al. (2017) as we obtain more realistic looking results.

5.4 3D Reconstruction

The latent variables \(\varvec{z}_{exp}\) and \(\varvec{z}_{id}\) that our network learns are extremely meaningful. Not only can they be used to reconstruct the image in 2D, they can be mapped into the expression (\(\varvec{x}_{exp}\)) and identity (\(\varvec{x}_{id}\)) components of a 3DMM. This mapping is learnt inside the network. By replacing the expression and identity components of a mean face shape with \(\hat{\varvec{x}_{exp}}\) and \(\hat{\varvec{x}_{id}}\), we are able to reconstruct the 3D mesh of a face given a single in-the-wild 2D image. We compare these reconstructed meshes against the fitted 3DMM to the input image.

The results of the experiment are visualised in Fig. 19. We observe that the reconstruction is comparable to other state-of-the-art techniques (Jackson et al. 2017; Feng et al. 2018). None of the techniques though capture well the identity of the person in the input image due to a known weakness in 3DMM.

5.5 Normal Estimation

Fig. 20
figure 20

Comparison of the estimated normals obtained using the proposed model vs the ones obtained by Wang et al. (2017b) and Shu et al. (2017)

Fig. 21
figure 21

Visualisation of our \(\varvec{Z}_{exp}\) and baseline \(\varvec{Z}_{0}\) using t-SNE. Our latent \(\varvec{Z}_{exp}\) clusters better with regards to expression than the latent space \(\varvec{Z}_{0}\) of an auto-encoder

Fig. 22
figure 22

Visualisation of our \(\varvec{Z}_{p}\) and baseline \(\varvec{Z}_{0}\) using t-SNE. It is evident that the proposed disentangled \(\varvec{Z}_{p}\) clusters better with regards to pose than the latent space \(\varvec{Z}_{0}\) of an auto-encoder

Table 1 Angular error for the various surface normal estimation methods on the Photoface (Zafeiriou et al. 2013) dataset. We also show the proportion of the normals below 35\(^{\circ }\) and 40\(^{\circ }\)

We evaluate our method on the surface normal estimation task on the Photoface (Zafeiriou et al. 2013) dataset which has information about illumination. Assuming the normals found using calibrated Photometric Stereo (Woodham 1980) as “ground truth”, we calculate the angular error between our estimated normals and the “ground truth”. Figure 20 and Table 1 quantitatively evaluates our proposed method against prior works (Wang et al. 2017b; Shu et al. 2017) in the normal estimation task. We observe that our proposed method performs on par or outperforms previous methods.

5.6 Quantitative Evaluation of the Latent Space

We want to test whether our latent space corresponds well to the variation that it is supposed to learn. For our quantitative experiment, we used Multi-PIE (Gross et al. 2010) as our test dataset. This dataset contains labelled variations in identity, expressions and pose. Disentanglement of variations in Multi-PIE is particularly challenging as its images are captured under laboratory conditions which is quite different from that of our training images. As a matter of fact, the expressions contained in Multi-PIE do not correspond to the 7 basic expressions and can be easily confused.

Table 2 Classification accuracy results: we try to classify 54 identities using \(\varvec{z}_{id}\), 6 expressions using \(\varvec{z}_{exp}\) and 7 poses using \(\varvec{z}_{p}\). We compare against standard baseline methods such as SIFT and CNN
Table 3 Identity classification accuracy results: we classify 54 identities using \(\varvec{z}_{id}\) with and without verification loss
Table 4 Classification accuracy results in comparison with Wang et al. (2017b): as Wang et al. (2017b) works on frontal images, we only consider frontal images in this experiment. We try to classify 54 identities using \(\varvec{z}_{id}\) versus \(\varvec{C}\), 6 expressions using \(\varvec{z}_{exp}\) versus \(\varvec{E}\) and 16 illumination using \(\varvec{z}_{ill}\) versus \(\varvec{L}\)

We encoded 10,368 images of the Multi-PIE dataset with 54 identities, 6 expressions and 7 poses and trained a linear SVM classifier using 90% of the identity labels and the latent variables \(\varvec{z}_{id}\). We then test on the remaining 10% \(\varvec{z}_{id}\) to check whether they are discriminative for identity classification. We use 10-fold cross-validation to evaluate the accuracy of the learnt classifier. We repeat this experiment for expression with \(\varvec{z}_{exp}\) and pose with \(\varvec{z}_{p}\) respectively. Our results in Table 2 show that our latent representation is indeed discriminative. We compare against some standard baselines such as Bag-of-Words (BoWs) models with SIFT feature (Sivic and Zisserman 2009) and standard CNN. Our model does not outperform the standard CNN model, which is fully supervised and requires a separate model for each variation classification. Still our results are a strong indication that the latent representation found is discriminative. This experiment showcases the discriminative power of our latent representation on a previously unseen dataset.

As an ablation study, we test the accuracy of the identity classification of \(\varvec{z}_{id}\) from a model trained without the verification. The results in Table 3 show that though adding the verification loss improves the performance, the gain is not significant enough to prove that this loss is a substantial contributor of the information.

In order to quantitatively compare with Wang et al. (2017b), we run another experiment on only frontal images of the dataset with 54 identities, 6 expressions and 16 illuminations. The results in Table 4 shows how our proposed model outperforms (Wang et al. 2017b) in these classification tasks. Our latent representation has stronger discriminative power than the one learnt by Wang et al. (2017b).

We visualise, using t-SNE (Maaten and Hinton 2008), the latent \(\varvec{Z}_{exp}\) and \(\varvec{Z}_{p}\) encoded from Multi-PIE according to their expression and pose label and compare against the latent representation \(\varvec{Z}_0\) learnt by an in-house large-scale adversarial auto-encoder of similar architecture trained with 2 million faces (Makhzani et al. 2015). Figures 21 and 22 show that even though our encoder has not seen any images of Multi-PIE, it manages to create informative latent representations that cluster well expression and pose (contrary to the representation learned by the tested auto-encoder).

6 Limitations

Some of our results do still show entanglement in the variations. Sometimes despite only aiming to change expression only, pose or illumination have been modified as well. This happens mainly in very challenging scenarios where for example one of the image shows extreme lighting conditions, is itself black and white or displays large pose variations. Due to the dataset (CelebA) we used, we do struggle with large pose variations. The proof of concept experiments do show that this is possible to be learned with a more balanced dataset.

7 Conclusion

We proposed the first, to the best of our knowledge, attempt to jointly disentangle modes of variation that correspond to expression, identity, illumination and pose using no explicit labels regarding these attributes. More specifically, we proposed the first, as far as we know, approach that combines a powerful Deep Convolutional Neural Network (DCNN) architecture with unsupervised tensor decompositions. We demonstrate the power of our methodology in expression and pose transfer, as well as discovering powerful features for pose and expression classification. For future work, we believe that designing networks with skip connections for better reconstruction quality and which at the same time can learn a representation space where some of the variations are disentangled would be a promising research direction.