Keywords

1 Introduction

3D-assisted 2D face recognition has been attracting increasing attention because it can be used for pose-invariant face matching. This requires fitting a 3D face model to the input image, and using the fitted model to align the input and reference images for matching. As 3D facial shapes are intrinsically invariant to pose and illumination, the fitted shape also provides an invariant representation that can be used directly for recognition. The use of a face prior has been demonstrated to offer impressive performance on images of faces subject to a wide pose variations, even outperforming deep learning [1, 2].

Most popular are 3D morphable face models which represent 3D face images in a PCA subspace. 3D face models proposed in the literature can capture and represent different modes of variations. Some focus solely on 3D shape (3DSM) [36]. Others (3DMM) model also the skin texture  [79], or even face expression (E-3DMM) [10, 11]. When fitting 3DMM to an input image, it is essential to estimate the scene illumination, as skin texture and lighting are intrinsically entwined, and need to be separated.

The problem of 3D model to the 2D image fitting becomes challenging when the input image exhibits intra-personal variations not captured by the 3D model, or the image is corrupted in some way. In this work, we use the term ‘intra-personal’ to represent any variations which are not inter-personal ones (facial shape and texture). We assume that fitting the shape would be affected to a lesser extent if the automatic landmarking procedure used is robust to shape variations and to occlusion. However, fitting the skin texture using 3DMM or E-3DMM would become problematic if the domain of the input data has changed. The problem associated with the missing modes of variation could be rectified by enhancing the 3D face model. However this would require collecting relevant 3D face data, a labour-intensive task which would often be impractical. In any case, this approach would not be appropriate for dealing with other image degradation effects, such as occlusion or image compression artefacts.

The aim of this paper is to develop techniques that can harness the benefits of 3D models in 2D face recognition when the input image is corrupted, e.g. by occlusion, or when it exhibits intra-personal variations which cannot be explicitly synthesised by the models. We address the problem by learning directly from 2D face data the subspace spanned by the missing modes of variation in the surface texture space superimposed on the 3D face shape structure. This is accomplished by estimating the pose of the input image and the face shape from the detected landmarks. The difference of the aligned input and reference images is used to construct a surface texture map. A training set of these texture maps then defines the perturbation space which can be represented using PCA bases. Assuming that the image perturbation subspace is orthogonal to the 3D face model space, then these additive components can be recovered from an unseen input image, resulting in an improved fit of the 3D face model.

We refer to this proposed method as the unified 3DMM (U-3DMM). Unlike the existing 3DMMs, U-3DMM models additional modes of variations in a unified linear framework, which can generalise also to occlusion. In addition, fitting U-3DMM to 2D images is very efficient. It involves first estimating the perturbation component of the input image. Once this component is removed, the core 3D face model fitting is a linear estimation problem. Last, the training set for U-3DMM is much easier to collect than that for 3DMMs.

We conduct an extensive evaluation of U-3DMM on databases which contain diverse modes of variation and perturbation. Experiments show the face recognition rates of U-3DMM are very competitive to state-of-the-art methods. We also present baseline face recognition results on a new dataset including combined pose, illumination and occlusion variations. The datasets and features extracted by U-3DMM will be made publicly available.

The contributions can be summarised as:

  • U-3DMM augments the core 3D face model by an additional PCA subspace, a perturbation subspace. Specifically, we project 2D images to 3D space via geometric fitting. Then, in the 3D space, the difference of two images (one being a reference and the other exhibiting additional variations) works as a training sample to learn the perturbation part of U-3DMM. This process is detailed in Sect. 4.3 and Fig. 4. The linear model of these supplementary variations is generic. The framework can model any variation(s), e.g. occlusion, if appropriate training data is available.

  • It is an open problem to achieve an accurate and efficient fitting for 3DMMs. Unlike non-linear models such as Phong illumination model used by 3DMMs, the linear perturbation model of U-3DMM can be fitted very efficiently.

  • Large number of 3D faces used to train inter- and intra-personal variations are expensive to collect. In comparison, the proposed method uses 2D images, which are much cheaper and easier to acquire, to train diverse variations in the U-3DMM framework.

The paper is organised as follows. In Sect. 2, we present the related work. The 3DMM and its fitting problem are formulated in Sect. 3. Section 4 details our methodology. The proposed algorithm is evaluated in Sect. 5. Section 6 draws conclusions.

2 Related Work

In this section, we discuss the current state-of-the-art. We first introduce various 3D models and fitting strategies, then the motivation of this work is discussed.

2.1 3D Face Models

3D face modeling is an active research field with many applications. The biometrics community uses 3D models to improve face recognition performance. In the graphics and animation community, 3D models are used to reconstruct facial details such as wrinkles. In this work, we mainly focus on the 3D models used for biometrics, namely face recognition. These 3D models are classified into three categories: 3D shape model (3DSM), 3D Morphable Model (3DMM) and extended 3DMM (E-3DMM).

3DSM solves the pose problem using either pose normalisation (PN) [6] or pose synthesis (PS) [35]. For the PN method, input images of arbitrary poses are converted to a canonical (frontal) view via a 3D model, then traditional 2D face matchers are used for recognition. On the other hand, PS methods synthesise multiple virtual images with different poses for each gallery image. Only virtual images with similar pose to the probe are chosen for matching. However, these models can only explicitly model one intra-personal variation (pose).

Unlike the 3DSM, the 3D morphable model (3DMM) [7, 12] consists of not only a shape model but a texture model learned from a set of 3D exemplar faces. The traditional 3DMMs [7, 12] can explicitly model pose and illumination variations. Pose is estimated by either a perspective camera [7, 12] or an affine camera [8, 13], and illumination is modelled by either Phong model [7, 12] or Spherical Harmonic model [8, 13].

In addition to pose and illumination variations, the extended 3DMM [10, 11, 14, 15] (E-3DMM) can model facial expressions. Specifically, the authors collected large number of 3D scans with diverse expressions to train a shape model which can capture both facial shape and expression variations. Experiments show E-3DMM achieves promising face recognition performance in the presence of pose and expression variations. The very recent work [14] uses E-3DMM to improve the accuracy of the facial landmark detection.

2.2 Fitting

3DMM and E-3DMM can recover the pose, shape, facial texture and illumination from a single image via a fitting process. The fitting is mainly conducted by minimising the RGB value differences over all the pixels in the facial area between the input image and its model-based reconstruction. As the fitting is an ill-posed problem, it is difficult to achieve an efficient and accurate fitting. To improve the fitting performance, many methods have been proposed.

The first fitting method is a Stochastic Newton Optimisation (SNO) [7]. To reduce the computational cost, SNO randomly samples a small subset of the model vertices to construct the fitting cost function. However this small subset does not capture enough information of the whole face, leading to inferior fitting. The Inverse Compositional Image Alignment (ICIA) algorithm [16, 17], a gradient-based method, modifies the cost function so that the Jacobian matrix becomes constant. Thus, the Jacobian matrix does not need to be updated in every iteration, improving the efficiency. The efficiency is also the driver behind the linear shape and texture fitting algorithm (List) [18]. List constructs linear systems for shape and texture optimisations, and it uses gradient-based methods to optimise pose and illumination. Multi-Feature Fitting (MFF) [19] is an accurate fitting strategy. MFF extracts many complementary features, such as edge and specular highlight, from the input image to constrain the fitting, leading to a smoother cost function. A recent work [8] is an efficient fitting strategy. Specifically, a probabilistic model [8] incorporating model generalisation error is used to estimate shape. To model specularity reflectance,  [8] first projects the fitting cost function into a specularity-free space to model diffuse light. After that, the results are projected back to the original RGB colour space to model specularity. Two more recent works [20, 21] use image local image features for fitting, achieving promising results.

2.3 Motivation

Although 3DMM and its variants (3DSM and E-3DMM) model pose, illumination and expression, they do not explicitly model other intra-personal variations, which limits their applications. Many of the existing 3DMMs model the intra-personal variations in a non-linear fashion, making the fitting a difficult problem. In comparison, we propose a unified linear framework which can model many more intra-personal variations. In addition, the linearity nature of our framework leads to a very efficient and accurate fitting.

3 Traditional 3D Morphable Model

In this section, the traditional 3DMMs [7, 12] and the fitting problem are formulated. To construct a 3DMM, the registered 3D facial scans including shape and texture are needed. Let the ith vertex of a registered face be located at (\(x_i,y_i,z_i\)) and have grey value \(g_i\). Then the shape and texture can be represented as \(\mathbf s '=(x_1,y_1,z_1,...,x_n,y_n,z_n)^{T}\) and \(\mathbf t '=(g_1,g_2,...,g_n)^{T}\), respectively. Symbol n is the number of vertices of a registered face. PCA is then applied to m example faces \(\mathbf s '\) and \(\mathbf t '\) separately to express shape \(\mathbf s \) and texture \(\mathbf t \) as:

$$\begin{aligned} \mathbf s =\mathbf s _0+\mathbf S \varvec{\alpha },\quad \mathbf t =\mathbf t _0+ \mathbf T \varvec{\beta } \end{aligned}$$
(1)

where \(\mathbf s \in \mathbb {R}^{3n}\) and \(\mathbf t \in \mathbb {R}^{n}\). \(\mathbf s _0\) and \(\mathbf t _0\) are the mean shape and texture of m training faces respectively. The columns of \(\mathbf S \) and \(\mathbf T \) are eigenvectors of shape and texture covariance matrices respectively. The free coefficients \(\varvec{\alpha }\) and \(\varvec{\beta }\) constitute low-dimension codings of \(\mathbf s \) and \(\mathbf t \), respectively.

3DMM can recover the 3D shape, texture, pose, and illumination from a single image via a fitting process. The fitting is conducted by minimising the intensity differences between the input and model reconstructed images. To perform such a minimisation, the 3DMM has to be aligned to the input image by projecting the 3D vertices of \(\mathbf s (\varvec{\alpha })\) to a 2D image plane via a camera model parameterised by \(\rho \). Then we define \(\mathbf a ^{M}\) and \(\mathbf a ^{I}\): (1) \(\mathbf a ^{M}\) is a vector concatenating the pixel values generated by the vertices of a 3DMM. The value of \(\mathbf a ^{M}\) is determined by facial texture and illumination. In common with [7, 19], the texture is represented by \(\mathbf t (\varvec{\beta })\) and the illumination is modelled by the Phong reflection with parameter \(\mu \). (2) Based on the current alignment determined by \(\varvec{\alpha }\) and \(\rho \), the vertices of a 3DMM find the nearest corresponding pixels of a 2D input image. The corresponding pixel values are concatenated as a vector \(\mathbf a ^{I}\). Therefore, \(\mathbf a ^{M}\) and \(\mathbf a ^{I}\) depend on \(\{\varvec{\beta },\mu \}\) and \(\{\varvec{\alpha },\rho \}\), respectively. The fitting can be formulated:

$$\begin{aligned} \min \limits _{\varvec{\alpha }, \rho , \varvec{\beta },\mu } \Vert \mathbf a ^{I}(\varvec{\alpha }, \rho )-\mathbf a ^{M}(\varvec{\beta },\mu )\Vert ^2 \end{aligned}$$
(2)

In common with [7, 12], \(\mathbf a ^{M}\) is formulated as:

$$\begin{aligned} \mathbf a ^{M}=\underbrace{(\mathbf t _0 + \mathbf T \varvec{\beta })}_{inter-personal}.*\, \underbrace{(l_{a}{} \mathbf I +l_{d} \mathbf N \mathbf d )+ \mathbf e }_{illumination} \end{aligned}$$
(3)

where \(.*\) denotes element-wise multiplication; \(l_{a}\) and \(l_{d}\) are the strengths of ambient and directed light; \(\mathbf I \) is a vector with all entries equal to 1; \(\mathbf N \in \mathbb {R}^{n \times 3}\) is stacked by the surface normal at each vertex; \(\mathbf d \in \mathbb {R}^{3}\) denotes light direction; \(\mathbf e \) is stacked by the specular reflectance e of every vertex: \(e= k_s \langle \mathbf v ,\mathbf r \rangle ^{\tau }\). \(k_s\) is a constant for specularity; \(\mathbf v \) and \(\mathbf r \) denote the viewing and reflection directions respectively. \(\tau \) denotes the coefficient of shininess. Then Eq. (2) can be rewritten as:

$$\begin{aligned} \min \limits _{\phi } \Vert \underbrace{{\mathbf{a}}^{I}(\varvec{\alpha }, \rho )}_{input}-\underbrace{((\mathbf{t}_0+ \mathbf{T} \varvec{\beta }).*(l_{a}{\mathbf{I}}+l_{d} \mathbf{N} \mathbf{d})+\mathbf{e})}_{reconstruction} \Vert ^2 \end{aligned}$$
(4)

where \(\phi =\{\varvec{\alpha }, \rho , \varvec{\beta },l_{a}, l_{d}, \mathbf d , {\tau } \}\). This is a difficult non-linear optimisation problem due to (1) the exponential form of e and (2) the element-wise multiplication. For different optimisation strategies, refer to Sect. 2.2.

4 Unified 3D Morphable Model (U-3DMM)

We propose a unified 3D morphable model (U-3DMM), which linearly models inter- and intra-personal variations. Inter-personal variations, which are usually used to model identity, are discriminate between different people. In comparison, intra-personal variations are caused by various other random factors such as illumination and occlusion. Inter- and intra-personal variations jointly determine the observed images as shown in Fig. 1. In this section, first, the construction of our U-3DMM is described. Next, an efficient fitting strategy is detailed. Finally, we propose a method to train intra-personal variations using 2D images. In this work, the 3D model used is Surrey Face Model [22].

4.1 Model

Like 3DMM, U-3DMM consists of shape and texture models. The shape model is exactly the same as \(\mathbf s \) in Eq. (1). Here we only focus on the texture part of U-3DMM.

Fig. 1.
figure 1

Intra- and inter-personal variations. The images in the 2nd column are obtained by subtracting the ones in the 3rd column from those in the 1st column with an offset 128

Motivation and Assumption. The existing 3DMMs model the relationship between inter- and intra-personal variations in a non-linear fashion, for example, the element-wise multiplication operation between inter-personal and illumination in Eq. (3). There are two weaknesses of this nonlinear relationship: (1) it does not generalise well because different relationships should be found to handle different intra-personal variations. For example, the Phong model can only model illumination. (2) The nonlinearity causes difficulties of optimisation. To solve these two problems, we assume an input face is equal to the sum of inter- and intra-personal variations following [23, 24]:

$$\begin{aligned} \mathbf a = \mathbf a ^{inter}+ \mathbf a ^{intra} \end{aligned}$$
(5)

where \(\mathbf a \) is a face, i.e. either \(\mathbf a ^{M}\) or \(\mathbf a ^{I}\). \(\mathbf a ^{inter}\) and \(\mathbf a ^{intra}\) are the inter- and intra-peronal parts respectively. The effectiveness of this assumption has been validated in  [23, 24]. Specifically, this assumption is successfully used for metric learning in [23] and sparse representation-based classification [24]. The former greatly improves the generalisation capacity of the learned metric and the latter solves the single training sample problem. In the field of 3D modeling, this assumption enables 3DMM to model various intra-personal variations in a unified framework. In addition, it leads to an efficient and accurate fitting detailed in Sect. 4.2.

Modeling. Instead of a non-linear relationship in Eq. (3), the reconstructed texture of U-3DMM is linearly modelled as the sum of two parts in Eq. (5). Each part is modelled linearly. To train these two parts separately, training data \(\mathbf t '\) and \(\mathbf u '\) are used: \(\mathbf t '\), which is the same as in Sect. 3, captures the identity facial texture information; \(\mathbf u '\) represents one training sample of texture in 3D that captures intra-personal variation such as expression. \(\mathbf u '\) has the same dimension as \(\mathbf t '\) and it is organised in the same order in 3D space as \(\mathbf t '\). \(\mathbf u '\) can be any type of intra-personal variation. The generation of \(\mathbf u '\) will be detailed in Sect. 4.3. PCA is applied to m samples \(\mathbf t '\) and p samples \(\mathbf u '\) separately to generate the inter- and intra-personal subspaces \(\mathbf T \) and \(\mathbf U \) respectively. Thus the U-3DMM reconstructed texture \(\mathbf a ^M\) is formulated as:

$$\begin{aligned} \mathbf a ^M=\underbrace{\mathbf{t}_0+ \mathbf{T} \varvec{\beta }}_{inter-personal} + \underbrace{\mathbf{U} \varvec{\gamma }}_{intra-personal} \end{aligned}$$
(6)

\( \mathbf T , \mathbf t _0\) and \(\varvec{\beta }\) have the same meaning as in Eq. (1). The inter-personal part is the same as that of 3DMM in Eq. (3). The columns of \(\mathbf U \in \mathbb {R}^{n\times p}\) are the eigenvectors of the intra-personal variation covariance matrix. \(\varvec{\gamma }\) is a free parameter that determines the intra-personal variations. It is assumed that \(\varvec{\beta }\) and \(\varvec{\gamma }\) have Gaussian distributions:

$$\begin{aligned} p(\varvec{\beta }) \sim \mathcal {N}\left( 0, \varvec{\sigma }_{t}\right) \end{aligned}$$
(7)
$$\begin{aligned} p(\varvec{\gamma }) \sim \mathcal {N}\left( \varvec{\gamma }_0, \varvec{\sigma }_{u} \right) \end{aligned}$$
(8)

where the value of \(\varvec{\gamma }_0\) is computed by projecting the mean of all the training samples \(\mathbf u '\) to PCA space \(\mathbf U \), \(\varvec{\sigma }_{t} =(\sigma _{1,t},...,\sigma _{m-1,t} )^T \), \(\varvec{\sigma }_{u} =(\sigma _{1,u},...,\sigma _{p,u} )^T \), and \(\sigma _{i,t}^2\) and \(\sigma _{i,u}^2\) are the ith eigenvalues of inter- and intra-personal variation covariance matrices respectively.

Advantages. The main advantage of U-3DMM is that it can generalise well to diverse intra-personal variations. Table 1 shows that U-3DMM has better generalisation capacity than the existing 3D models. This advantage results from the unified intra-personal part in Eq. (6) which can model more intra-personal variations than the existing 3D models. In addition, compared with the complicated non-linear inter-personal and illumination modeling in Eq. (3), we explicitly linearise the inter- and intra-personal parts in two PCA spaces.

Table 1. Generalisation capacity of 3D models

4.2 Fitting

By virtue of a fitting process, U-3DMM can recover the pose, 3D facial shape, facial texture and intra-personal variations from an input image as shown in Fig. 2. Linearly separating intra- and inter-personal parts allows us to achieve an efficient fitting. Based on Eq. (6), the fitting problem of U-3DMM is formulated as:

$$\begin{aligned} \min \limits _{\varvec{\alpha }, \rho , \varvec{\beta },\varvec{\gamma }} \Vert \underbrace{\mathbf{a}^{I}(\varvec{\alpha }, \rho )}_{input}- \underbrace{(\mathbf{t}_0 + \mathbf{T} \varvec{\beta }+\mathbf{U} \varvec{\gamma })}_{reconstruction}\Vert ^2 \end{aligned}$$
(9)

Compared with Eq. (4), clearly, the reconstruction part is linear. To solve this fitting problem, we propose a fitting strategy, which sequentially optimises pose (\(\rho \)), shape (\(\varvec{\alpha }\)), intra-personal (\(\varvec{\gamma }\)) and facial texture (\(\varvec{\beta }\)) parameters in four separate steps. Closed-form solutions can be obtained for each of these steps. These parameters are optimised by iterating two sequences of steps in turn as shown in Fig. 3. In each step, only one group of parameters are estimated, and the others are regarded as constant.

Fig. 2.
figure 2

Input and output of a fitting

Fig. 3.
figure 3

Topology of fitting algorithm

Pose and Shape Estimations. In the first two steps, pose (\(\rho \)) and shape (\(\varvec{\alpha }\)) are optimised by solving linear systems using the method in [8]. Specifically, motivated by the fact that the pose and shape variations cause the facial landmarks to shift, \(\rho \) and \(\varvec{\alpha }\) are estimated by minimising the distance between the landmarks of the input images and those reconstructed from the model. The cost functions for \(\rho \) and \(\varvec{\alpha }\) are linear [8], thus \(\rho \) and \(\varvec{\alpha }\) have closed-form solutions. Once \(\rho \) and \(\varvec{\alpha }\) are estimated, the correspondence between the vertices of the model and pixels of the input images is established.

Intra-personal Variation Estimation. The cost function in Eq. (9) is used to estimate \(\varvec{\gamma }\). In this step, \(\mathbf a ^{I}(\varvec{\alpha }, \rho )\) in Eq. (9) is constant since \(\rho \) and \(\varvec{\alpha }\) have already been recovered in the first two steps. To avoid over-fitting, a regularisation term based on Eq. (8) is used to constrain the optimisation. Therefore, the optimisation problem is defined as:

$$\begin{aligned} \min \limits _{\varvec{\gamma }} \Vert ( \mathbf a ^{I}-\mathbf t _0 - \mathbf T \varvec{\beta } )-\mathbf U \varvec{\gamma }\Vert ^2 +\lambda _1 \Vert (\varvec{\gamma }-\varvec{\gamma }_0)./\varvec{\sigma }_u \Vert ^2 \end{aligned}$$
(10)

The closed-form solution is \(\varvec{\gamma }=(\mathbf U ^T\mathbf U + \varvec{\Sigma }_u)^{-1} (\mathbf U ^T(\mathbf a ^{I}-\mathbf t _0 - \mathbf T \varvec{\beta })+ \lambda _1 (\varvec{\gamma }_0./ \varvec{\sigma }_{u}^2))\) where \( \varvec{\Sigma }_u=\text {diag}(\lambda _1/\sigma _{1,u}^2,...,\lambda _1/\sigma _{p,u}^2)\), ./ denotes element-wise division, and \(\lambda _1\) is a weighting parameter for the regularisation term. Note that \(\varvec{\beta }\) is set to \(\varvec{0}\) in the first iteration: in other words the mean facial texture \(\mathbf t _0\) is used as the initial estimate of the reconstructed image. In subsequent iterations, \(\varvec{\beta }\) is replaced by the estimate recovered in the previous iteration.

Facial Texture Estimation. Having obtained an estimate of {\(\rho \), \(\varvec{\alpha }\), \(\varvec{\gamma }\)}, \(\varvec{\beta }\) can be recovered in the final step. Similar to Eq. (10), the cost function for estimating \(\varvec{\beta }\) is defined as:

$$\begin{aligned} \min \limits _{\varvec{\beta }} \Vert ( \mathbf a ^{I}-\mathbf t _0 - \mathbf U \varvec{\gamma } ) -\mathbf T \varvec{\beta }\Vert ^2 +\lambda _2 \Vert \varvec{\beta }./\varvec{\sigma }_t \Vert ^2 \end{aligned}$$
(11)

The closed-form solution is: \(\varvec{\beta }=(\mathbf T ^T\mathbf T + \varvec{\Sigma }_t)^{-1}{} \mathbf T ^T(\mathbf a ^{I}-\mathbf t _0 - \mathbf U \varvec{\gamma })\), where \(\lambda _2\) is a free weighting parameter and \( \varvec{\Sigma }_t=\text {diag}(\lambda _2/\sigma _{1,t}^2,...,\lambda _2/\sigma _{m-1,t}^2)\)

4.3 Intra-personal Variation Data Collection

An important prerequisite of building the U-3DMM is to collect intra-personal variation data, i.e. \( \mathbf u '\). The straightforward approach would be to collect enough 3D scans to capture all types of intra-personal variations. However, such 3D data collection is very expensive. In comparison, it is much easier and cheaper to collect 2D image data which covers such variations. Motivated by this, we propose a method to use 2D images to generate 3D intra-personal variation \( \mathbf u '\).

Fig. 4.
figure 4

3D intra-personal variation data generation. \(\mathbf a _c\) and \(\mathbf a _e\) are one 2D image pair without and with intra-personal variation; They are projected to 3D space to reconstruct \(\mathbf u _c\) and \(\mathbf u _e\); \( \mathbf u '\) is the generated 3D data.

The outline of our method is illustrated in Fig. 4. Assume that we have two facial images of the same person: one without intra-personal variations \(\mathbf a _c\) and the other \(\mathbf a _e\) with such variation, e.g. illumination variation in Fig. 4. In the real world, it is easy to collect this type of image pairs from the internet or from publicly available face databases. To project \(\mathbf a _e\) and \(\mathbf a _c\) to 3D space, the correspondence between them and the shape model of U-3DMM has to be created first. Like Sect. 4.2, such a correspondence can be created via geometric fitting, i.e. the pose and shape fitting. By virtue of this correspondence, the intensities of \(\mathbf a _e\) and \(\mathbf a _c\) can be associated with the 3D vertices of the shape model, generating 3D data \(\mathbf u _e\) and \(\mathbf u _c\).. In Eq. (6), the reconstructed image is computed as a sum of inter- and intra-personal variations. We then define the intra-personal variation \( \mathbf u '\) as the difference between \(\mathbf u _e\) and \(\mathbf u _c\):

$$\begin{aligned} \mathbf u ' = \mathbf u _e - \mathbf u _c \end{aligned}$$
(12)

The samples of \(\mathbf u '\) are projected to PCA space to obtain \(\mathbf U \) of Eq. (6).

Invisible Regions. Due to self-occlusions and pose variations, some facial parts of the 2D images (\(\mathbf a _e\) and \(\mathbf a _c\)) are not visible. In this work, the pixel values of \(\mathbf u _e\) and \(\mathbf u _c\) corresponding to the self-occluded parts of \(\mathbf a _e\) and \(\mathbf a _c\) are set to 0. Although those invisible parts are set to 0 for some training images, the same parts are visible for some other training images under different poses. Therefore, training images of different poses are complementary to model the intra-personal variation part.

5 Experiments

Face recognition aims to reduce the impact of intra-personal variations but keep the discriminative inter-personal information. Thus, we remove the intra-personal variations estimated during fitting and keep the shape and texture for face recognition.

In common with [7, 8, 18], \(\varvec{\alpha }\) and \(\varvec{\beta }\) are concatenated as a facial feature for face recognition. Cosine similarity and nearest neighbour classifier are used. Landmarks are manually assigned for the initialisation of U-3DMM fitting. To demonstrate the effectiveness of U-3DMM, we compare U-3DMM with the state of the art. To make an extensive comparison, we implemented a very effective 3DMM using multiple feature fitting [19], Sparse Representation Classification (SRC) [25], Extended SRC (ESRC) [24]. The recognition rates of other methods are cited from their papers. We evaluated these methods on Multi-PIE [26], AR [27], and a new synthetic database. Labeled Faces in the Wild (LFW) [28] is another popular face dataset, however, most subjects in LFW have only one image. As U-3DMM needs image pairs of the same subject to train intra-personal term, LFW is not appropriate to evaluate our method and is not used in our experiment.

5.1 Pose, Occlusion and Illumination Variations

U-3DMM is the first 3D approach to explicitly model combined pose, occlusion and illumination variations. In this section, U-3DMM is compared with state-of-the-art.

Database and Protocol. To our knowledge, there is no database containing large pose, occlusion and illumination variations. Nevertheless, the Multi-PIE database [26] contains two out of three variations, i.e. pose and illumination. We add random occlusions to Multi-PIE images to synthesise a dataset containing all these variations. To simulate real occlusions, the synthetic ones have various sizes and locations within a face.

We generate random occlusions on the facial images. First, we detect the facial area, the width and height of which are denoted as \(\mathbf W \) and \(\mathbf H \). Then, a uniformly distributed random coordinate (xy) in the facial area is generated. Last, the width and height (w and h) of the occlusion are produced by \(\{w, h\}=\mathbf \{W, H\} \times rand(0.2,0.5)\), where rand denotes a uniformly distributed random number generator. (0.2, 0.5) is the range of the random numbers. Hence, the occlusion area of one image can be represented as (xywh).

A subset of Multi-PIE containing four typical illuminations and four typical poses is used. The four illuminations are left, frontal and right lighting, and ambient lighting (no directed lighting) with lighting IDs 02, 07, 12 and 00. The four poses are frontal and left-rotated by angles 15\(^\circ \), 30\(^\circ \), 45\(^\circ \) with pose IDs 051, 140, 130 and 080. Random occlusions are applied to these images. To train the intra-personal variation part of U-3DMM, a subset of 37 identities (from ID-251 to ID-292) in session II is used. The test set contains all the subjects (from ID-001 to ID-250) in session I. In the test set, the frontal images with ambient lighting and without occlusion are the gallery images and the others are the probe. Both training and test sets contain various pose, illumination and occlusion variations.

Results. Both 3D shape and texture parameters (\(\varvec{\alpha }\) and \(\varvec{\beta }\)) of U-3DMM are discriminative. It is interesting to explore the impact of them on the performance. From Table 2, the face recognition rates when using texture information only is much higher than that when using shape only, indicating that texture is more discriminative than shape. For different illuminations, the performance when using shape does not vary greatly, compared with using texture only. Clearly, facial texture is more sensitive to illumination variations than shape. It is also observed that combining shape and texture by concatenating \(\varvec{\alpha }\) and \(\varvec{\beta }\) consistently outperforms either one of them. In all the following evaluations, we use both shape and texture.

We compare qualitatively the reconstruction performance of 3DMM and U-3DMM. Facial textures reconstructed by \(\varvec{\beta }\) are shown in Fig. 5. As is shown, 3DMM suffers from the over-fitting problem caused by occlusion while U-3DMM can reconstruct the facial features more accurately.

Fig. 5.
figure 5

Frontal texture reconstructions

Table 2. Average recognition rate (%) of U-3DMM over all the poses and occlusions per illumination

To demonstrate the effectiveness of U-3DMM, it is compared with state-of-the-art methods: SRC [25], ESRC [24] and 3DMM [19]. Figure 6 illustrates how the recognition performance varies with illumination over poses and occlusions. Our U-3DMM outperforms the other three methods because U-3DMM can effectively handle pose, illumination and occlusion simultaneously. By comparison, SRC and ESRC do not handle the pose problem and 3DMM does not explicitly model occlusion. SRC is worst because it suffers from the problem of having just ‘a single image per subject in gallery’ [24, 25]. In the case of ‘frontal light’, all the methods work worse than the other three illuminations. The inferior performance results from the fact that the illumination effects between the gallery (ambient lighting) and probe (frontal lighting) are larger than the other illumination conditions.

Figure 7 shows how recognition rates, averaged over illuminations and occlusions, vary with pose. All the face recognition rates decrease with the increase of pose variations, showing that pose variations present a challenging problem. U-3DMM works much better than the others due to its strong intra-personal variation modeling capacity. In the case of frontal pose (rotation angle is 0\(^\circ \)) which means only illumination and occlusion are presented, ESRC achieves promising recognition rates because it can explicitly model illumination and occlusion. U-3DMM outperforms ESRC because U-3DMM can extract discriminative shape information for recognition while ESRC cannot.

Fig. 6.
figure 6

Average recognition rate over all the poses and occlusions per illumination

Fig. 7.
figure 7

Average recognition rate over all the illuminations and occlusions per pose

5.2 Pose and Illumination Variations

Face recognition in the presence of pose variation (PFR) and combining pose and illumination variations (PIFR) is very challenging. Extensive research has been conducted to solve PFR and PIFR problems. In this section, U-3DMM is compared with state-of-the-art methods on Multi-PIE database.

Database and Protocol. Following the existing work, two settings (Setting-I and Setting-II) are used for PFR and PIFR respectively. Setting-I uses a subset in session 01 consisting of 249 subjects with 7 poses and 20 illumination variations. The images of the first 100 subjects constitute the training set. The remaining 149 subjects form the test set. In the test set, the frontal images under neutral illumination work as the gallery and the remaining are probe images. Setting-II uses the images of all the 4 sessions (01–04) under 7 poses and only neutral illumination. The images from the first 200 subjects are used for training and the remaining 137 subjects for testing. In the test set, the frontal images from the earliest session work as gallery, and the others are probes.

Table 3. Recognition rate (%) across poses on Multi-PIE

Pose-Robust Face Recognition. Table 3 compares our U-3DMM with the state-of-the-art. SPAE [31] works best among all the 2D methods due to its strong non-linear modeling capacity. E-3DMM [15] and U-3DMM outperform another two 3D methods ([6] and [32]), as E-3DMM [15] and U-3DMM can model both pose and facial shape rather than pose only by [6, 32]. E-3DMM only reports the results using High-Dimensional gabor Feature (HDF) [33] rather than PCA coefficients (\(\varvec{\alpha }\) and \(\varvec{\beta }\)). For fair comparison, we also extracted HDF feature from pose-normalised rendered images as follows: First, an input image is aligned to U-3DMM via geometric fitting. Second, the intensity value of each vertex of U-3DMM is assigned by the value of the corresponding pixel of the input image. The values of invisible parts of U-3DMM are assigned with the values from symmetry visible vertices. Last, a frontal face image is rendered using the obtained intensity values of U-3DMM. U-3DMM (HDF) works much better than U-3DMM (PCA) because (1) the HDF feature can capture both global and local facial information, in comparison with only global information captured by PCA coefficients; (2) HDF uses more invariant Gabor feature than pixel values which are actually coded by PCA coefficients. Our U-3DMM (HDF) works slightly worse than E-3DMM (HDF), however, U-3DMM has advantages over E-3DMM: (1) E-3DMM itself can only model pose and expression, however, U-3DMM can explicitly model more intra-personal variations; (2) The expression part of E-3DMM is trained using 3D faces with various expressions, while U-3DMM is trained using easily-collected 2D images; (3) E-3DMM estimates the depth of background, leading to extra computational costs but being useless for improving face recognition rates; while U-3DMM does not.

Pose- and Illumination-Robust Face Recognition. As shown in Table 4, the subspace method [34] works much worse than the others. The deep learning methods, i.e. FIP (face identity-preserving) [2], RL (FIP reconstructed features) [2] and MVP (multi-view perceptron) [35], achieve promising results. U-3DMM outperforms deep learning methods (RL, FIP, and MVP) because 3D methods intrinsically model pose and illumination. Apart from worse performance, deep learning methods share the difficulty of designing a ‘good’ architecture because (1) there is no theoretical guide and (2) the large number of free parameters are hard to tune.

Table 4. Recognition rate (%) averaging 20 illuminations on Multi-PIE

5.3 Other Intra-personal Variations

To further validate the effectiveness of U-3DMM, we evaluate it on the AR database [27].

Database and Protocol. The AR database contains more than 4000 frontal images of 126 subjects with variations in expressions, illuminations and occlusions. To train the intra-personal component of U-3DMM, we use 10 randomly chosen subjects (5 male and 5 female) in Session 1 and with 13 images per subject. Following [10], we randomly chose 100 subjects in Session 1 for testing. In the test set, the neutral images work as gallery and the others are probe.

Results. We compare U-3DMM with 3DMM, E-3DMM [10], SRC, and ESRC. In the presence of illumination, occlusion or both, U-3DMM works much better than SRC and ESRC. This conclusion is consistent with that drawn in Sect. 5.1. For expression variations, our U-3DMM outperforms SRC and ESRC, but works worse than E-3DMM. Note that E-3DMM uses a commercial SDK to extract facial feature and the authors [10] do not report face recognition rates using only shape and texture coefficients that our U-3DMM uses. In addition, our U-3DMM has two advantages over E-3DMM [10]: (1) U-3DMM can potentially model any variations, while E-3DMM is designed specifically to capture pose and expression variations; (2) U-3DMM is more efficient than E-3DMM (Table 5).

Table 5. Recognition rate (%) evaluated on AR database

6 Conclusions

We propose the U-3DMM, which provides a generic linear framework to model complicated intra-personal variations. The linearity of U-3DMM leads to an efficient and accurate fitting. The experimental results demonstrate that U-3DMM achieves very competitive face recognition rates against the state-of-the-art.