1 Introduction

Face alignment aims to automatically localize fiducial facial points (or landmarks). It is a fundamental step for many facial analysis tasks, e.g. facial recognition [19, 20], face frontalization [21, 22], expression recognition [11, 31], and face attributes prediction [7, 25]. These tasks are essential to Human-System Interaction (HSI) applications including driver-car interaction, human-robot interaction and mobile applications.

The field of face alignment has witnessed rapid progresses in recent years, especially with the application and development of cascaded regression methods [2, 6, 27, 38, 39]. This kind of methods typically learns a sequence of descent directions from image features that move an initial shape towards the ground truth iteratively. Among various cascaded regression approaches for face alignment, SDM [27] has risen as one of the most popular approaches due to its high efficiency and the state-of-the-art performance. The approach is also theoretically sound to some extent with rigorous explanation from the perspective of optimizing a non-linear problem with Newton’s method.

However, SDM has two main drawbacks: 1) It highly relies on the initialization and is prone to local optima. SDM is derived from Newton’s method which leads to a local optima. If the initialised shape is far away from the target shape, the algorithm is prone to a poor local optimum (see Fig. 1a for an example). 2) It is likely to learn conflicting descent directions during optimization. As the feature extraction function in face alignment is not easy to describe, a simple function h(x) = x−1 is used to illustrate it. Suppose the aim is to seek the optimal x (x* = 3.5) that makes h(x) = 0.286 from a range of initial x (x0). According to SDM, a descent map r can be calculated to move x0 towards x* iteratively using the following equation:

$$ {x}_k={x}_{k-1}-r\left(h\left({x}_{k-1}\right)-h\left({x}_{\ast}\right)\right) $$
(1)
Fig. 1
figure 1

a Failure cases of SDM due to poor initializations. Top row: initial shape, bottom row: results after four iterations. Red points: predicted landmarks, green points: ground-truth landmarks. b Initialization points that have conflicting descent directions

For x0 ϵ [1:0.2:6] (0.2 is the interval), all of them can be moved closer to x* with r = −7. Nevertheless, if x0 < 0, e.g. x0 = −1, then it will become farther away from x* with r = −7 (see Fig. 1b).

Actually, only if initial points are close to each other and also target at the same destination, then the compatible descent directions can be learned via SDM. However, this strong prerequisite is very difficult to meet in face alignment, since face images vary from head poses and facial expressions, which are supposed to have different shape-feature relationships. This also leads to another issue of SDM: the algorithm is derived on a weak assumption that the non-linear feature extraction function (e.g. SIFT [13] or [17]) is identical for all the face images. As stated in [28], the feature extraction function is parameterized not only by facial landmark locations, but also by the images such as faces with different head poses and different subjects.

It can be inferred that one possible cause of above issues is that the face alignment task occupies multiple optimization subspaces, but these subspaces cannot be explained within a single optimization process. Although SDM has been extensively studied and further developed in the past few years, there are few works on this essential but relatively unexplored problem [8, 28, 29, 32, 35]. Xiong and De la Torre have made the same inference with this paper and proposed a global SDM (GSDM) [29] by domain partition in feature and shape PCA spaces for face tracking. However, that method is inappropriate for face alignment on still images as the decision of picking the suitable domain depends on ground-truth face shapes. The utilization of PCA also remains a big concern since it might result in un-estimated information loss. Recently, Zhang et al. [35] improves the GSDM by projecting both the feature and shape into a mutual sign-correlation subspace. Their method, however, has the same constraint as GSDM. Some other works resort to the multi-view approach – estimating head poses followed by face alignment on a particular view [12, 32]. The performance improves but the heuristic partition with respect to only head poses is still suboptimal because it neglects other shape deformations or appearance variations. Meanwhile, how to divide the pose range is a purely empirical step which often requires a lot of attempts.

To solve aforementioned problems, this paper proposes an efficient and novel alternative optimization subspace learning method – multi-subspace SDM (MS-SDM), which pushes SDM to the unconstrained face alignment application. The main contributions of our work are: 1) Discover optimization subspaces with a semantic meaning via applying an elegant unsupervised clustering algorithm – k-means on both shape and feature space. 2) Predict the subspace accurately by concerning about the relative proximity between the subspace and the sample. The proposed MS-SDM has been validated on challenging datasets which cover a wide range of head poses, facial expressions and facial appearances. Experimental results show the superiority of MS-SDM over SDM and GSDM.

2 Related work

A large number of works have been developed for face alignment which can be divided into two main categories: generative approaches and discriminative approaches.

Generative approaches, such as Active Appearance Models [4] and Constrained Local Models [5], first construct compact the shape and appearance spaces with Principal Component Analysis (PCA), then build a model instance to fit with the face image under a single optimization process. Although various improvements have been made, the drawbacks of this kind of approaches remain obviously: the expressive power of the built parameter space is limited and the final results heavily depend on the initialization.

Discriminative approaches don’t build a parameter space beforehand, but alternatively they learn a direct mapping from image features to landmark locations [2, 27, 29, 38, 39]. Cascaded regression [2, 27, 38, 39] is a representative discriminative approach which has dominated the face alignment field in recent years due to its high efficiency and the state-of-the-art performance.

2.1 Face alignment with cascaded regression

Starting with a rough initial shape, cascaded regression predicts the shape increment from image features with a series of mapping functions, and update the shape iteratively. Cao et al. [2] apply boosted ferns to learn both features and non-linear mappings which output promising results. In contrast, Xiong et al. [27] propose to use simple linear regression and hand-crafted features to accomplish cascaded regression which is named as Supervised Descent Method (SDM). Such simple configurations surprisingly generated state-of-the-art results. Recently, deep learning have also been applied on face alignment. The strong learning ability of deep models and the end-to-end learning mode enable deep learning based methods produce remarkable performance even for the most challenging datasets [15, 18, 30, 33, 34, 36]. However, deep learning methods always require a huge amount of training data and a very high computational capability, which make it difficult to be deployed on devices with limited resources. Ignoring on-going debates between deep learning and traditional methods, this paper makes a trade-off between efficiency and accuracy of the algorithm, based on the methods using SDM. Readers are referred to surveys [3, 23] for a comprehensive comparison of main-stream face alignment methods.

2.2 Face alignment with SDM based approaches

SDM produces the state-of-the-art performance with very elegant configurations, which has been regarded as an important benchmark method and triggers numerous new approaches in face alignment. As discussed above, only if the initializations are close to each other and the feature extraction function has a unique minimum, a sequence of generic descent directions can be learned via SDM. However, these prerequisite does not hold for faces under unconstrained conditions.

In [38], Zhu et al. starts each iteration by exploring a shape space rather than locking itself on a single initialization. This relaxes the optimization process from being affected by poor initializations to some extent and can lead to more robust face alignment. Nevertheless, the expressive power of a single regression in each iteration still remains a big concern. A few studies [12, 32] adopt intuitive multi-view approach to cover a wider optimization space and achieve a good performance. However, defining the optimization space according to head poses only is still sub-optimal since it neglects other shape deformations or appearance variations. In addition, the operation on dividing the head pose range is purely empirical and always needs a lot of attempts. Xiong et al. [29] theoretically analyzes this limitation of SDM and proposes Global SDM (GSDM) which partitions the optimization space into several domains based on reduced shape and feature. Although their method works well for face tracking and pose estimation, it is inappropriate for face alignment on still images as it requires the ground truth shape during prediction. Meanwhile, the reduced feature and shape space might lose some important information. To address the limitation of GSDM, Zhu et al. [39] proposes to learn a composition from predicted domain-specific shapes. This method performs well for faces with large poses and extreme expressions. Some other works resort to three-dimensional (3D) face modelling [8, 9, 26, 40] which requires additional 3D annotations of the training data. This paper presents an efficient alternative for optimization subspace learning that doesn’t require any additional assumptions.

3 Methodology

In this section, the SDM method is recalled first and its limitations are theoretically analysed. Then, the proposed MS-SDM is introduced.

3.1 Supervised descent method

SDM converts the face alignment task which is originally a non-linear least squares problem into a simple least squares problem. It avoids computing Jacobian and Hessian with some supervised settings which significantly reduces the algorithm’s complexity but at the same time generates state-of-the-art performance. Specifically, given a face image I and initial facial landmarks’ coordinates x0, face alignment can be framed as minimizing the following function over Δx:

$$ f\left({\mathbf{x}}_0+\Delta \mathbf{x}\right)={\left\Vert h\left({\mathbf{x}}_0+\Delta \mathbf{x},I\right)-h\left(\mathbf{x}\ast, I\right)\right\Vert}_2^2 $$
(2)

where h(x, I) represents the SIFT features (or HOG features) around the landmark locations x of image I. x* represents the ground-truth landmark locations. Following Newton’s method, with a second-order Taylor expansion, (2) can be transformed as:

$$ f\left({\mathbf{x}}_0+\Delta \mathbf{x}\right)\approx f\left({\mathbf{x}}_0\right)+{\mathbf{J}}_f{\left({\mathbf{x}}_0\right)}^T\Delta \mathbf{x}+\frac{1}{2}\Delta {\mathbf{x}}^T{\mathbf{H}}_f\left({\mathbf{x}}_0\right)\Delta \mathbf{x} $$
(3)

where Jf (x0) and Hf (x0) are the Jacobian and Hessian matrices of f evaluated at x0. Differentiating (3) with respect to Δx and setting it to zero, the following equations can be obtained:

$$ {\displaystyle \begin{array}{c}\Delta \mathbf{x}=-{\mathbf{H}}_f{\left({\mathbf{x}}_0\right)}^{-1}{\mathbf{J}}_f\left({\mathbf{x}}_0\right)\\ {}=-2{\mathbf{H}}_f{\left({\mathbf{x}}_0\right)}^{-1}{\mathbf{J}}_h^{\mathbf{T}}\left({\mathbf{x}}_0\right)\left(h\left({\mathbf{x}}_0,I\right)-h\left(\mathbf{x}\ast, I\right)\right)\\ {}=-2{\mathbf{H}}_f{\left({\mathbf{x}}_0\right)}^{-1}{\mathbf{J}}_h^{\mathrm{T}}\left({\mathbf{x}}_0\right)h\left({\mathbf{x}}_0,I\right)+2{\mathbf{H}}_f{\left({\mathbf{x}}_0\right)}^{-1}{\mathbf{J}}_h^{\mathbf{T}}\left({\mathbf{x}}_0\right)h\left(\mathbf{x}\ast, I\right)\end{array}} $$
(4)

According to (4), the computation of the descent direction Δx requires h(x, I) to be twice differentiable or numerical approximations of the Jacobian and Hessian could be calculated. However, these requirements are difficult to meet in practice: 1) SIFT or HOG features are non-differentiable image operators; 2) numerically estimating the Jacobian or the Hessian in Eq. 4 is computationally expensive since the dimension of the Hessian matrix can be large and calculating the inverse of Hessian matrix is with O(p3) time complexity and O(p2) space complexity, where p is the dimension of the parameters to estimate [28]. Alternatively, SDM uses an identical pair of R and b to represent all face images’ −2Hf−1JTh and − 2Hf−1JThh(x*, I) which are named as the descent direction. R and b define a linear mapping between Δx and h(x0, I), which can be learned from the training set by minimizing:

$$ {\sum}_{i=1}^N{\left\Vert \Delta {\mathbf{x}}_{\ast}^i-\mathbf{R}h\left({x}_0^i,{I}_i\right)-\mathbf{b}\right\Vert}_2^2 $$
(5)

where, N is the number of images in the training set and \( \Delta {\boldsymbol{x}}_{\ast}^i={\boldsymbol{x}}_{\ast}^i-{\boldsymbol{x}}_0^i \). Since the ground-truth shape is difficult to be found in a single update step, a sequence of such descent directions denoted as {Rk} and {bk} are learned during training. Then for a new face image, in each iteration k, the shape update can be calculated as:

$$ \Delta {\mathbf{x}}_k={\mathbf{R}}_kh\left({\mathbf{x}}_{k-1},I\right)+{\mathbf{b}}_k $$
(6)

The function h(x, I) is parameterized not only by x but also by face images [28], which highly depends on head poses, facial expressions, facial appearances and illuminations. Consequently, R and b may vary from different face images. Therefore, although SDM can generate promising face alignment results in ordinary scenarios, they suffer from unconditional scenarios where faces have large head poses and extreme expressions.

In [29], the authors observe the same problem. They propose to partition the original optimization space into several domains based on reduced shape deviation Δx and feature deviation Δh. They prove that each domain contains a generic descent direction which can make the initial shape closer to the ground-truth shape for every sample belongs to it when both of the following conditions hold: 1) h(x, I) is strictly monotonic around x* and 2) h(x, I) is locally Lipschitz continuous anchored at x* with K (K ≥ 0) as the Lipschitz constant. However, the solution proposed in [29] only satisfies the first condition above and is based on an assumption that Δx and Δh embedded in a lower dimensional manifold. Meanwhile, to predict the specific domain that a sample belongs to, the ground-truth shape x* should be given. This is apparently infeasible during the testing stage as the ground-truth shape is actually what needs to be predicted.

3.2 Multi-subspace SDM

To address problems mentioned above, an alternative two-step framework – MS-SDM (see Fig. 2) is proposed. It first learns subspaces with semantic meanings from the original optimization space via k-means. Then, for each subspace, a particular linear regressor from face features to the shape update is learned. During testing, the sample will be assigned into the correct subspace with a pre-trained Naive Bayes classifier. It will then be allocated to a subspace specific regressor which gradually update the shape as:

$$ \Delta {\mathbf{x}}_k={\mathbf{R}}_{k,s}h\left({\mathbf{x}}_{k-1},I\right)+{\mathbf{b}}_{k,s} $$
(7)

where s represents the subspace label.

Fig. 2
figure 2

The work pipeline of MS-SDM

3.2.1 Semantic subspace learning via K-means

To learn better optimization subspaces, samples which have the similar regression target Δx are assumed to fall inside the same optimization space and have compatible descent directions. Then, the classic clustering algorithm - k-means is applied on all training samples’ Δx to automatically find out the key facial shape variations and divide the original training set into several subsets. In order to preserve all the useful information hidden in the shape space, the initial Δx of each sample is utilised during the clustering process. As shown in Fig. 3a, subsets generated in this way show quite high correlation with head poses. It can also be observed that each subset relates to a particular kind of head pose, such as left-profile face, right-profile face, left-rolling face and right-rolling face.

Fig. 3
figure 3

Comparison between learned subspaces from Δx and Δh. Each row represents a subset which contains three example images and the mean shape of all the samples in the subset. The cluster’s amount of k-means is set as 5.

Since the face shape update Δx are predicted from the feature deviation Δh, the descent direction pair of R and b also describes the hidden relationship between Δx and Δh. Inspired by this intuition, k-means is further applied on Δh to find the feature-based optimization space partition. Surprisingly, the generated subspaces are highly consistent with the subspaces obtained from the head pose’s point of view. The relevant results are shown in Fig. 3b. It indicates that samples in each subspace have close shape-feature relationships which are supposed to share a unified descent direction.

3.2.2 Robust subspace prediction with naive Bayes

As the aforementioned subspace learning relies on the ground-truth shape which will be unavailable during testing, the main difficulty of the final shape prediction arises as the prediction of the subspace that a sample belongs to. A straightforward solution to this problem is a multi-class classifier (e.g. Random Forest, SVM or Naive Bayes), which learns the class label from face appearance features.

In the test phase, a mean-face is placed onto the given face bounding box and SIFT features are extracted around each landmark (see Fig. 2). The concatenation of all extracted features are regarded as the appearance feature for subsequent classification. Random Forest was first tested in our experiment due to its high performance in similar tasks. However, with this approach, a few samples were assigned inaccurately with a completely incompatible subspace, such as a left-profile face was assigned with a right-profile view regressor, which severely ruins the overall prediction accuracy.

The core reason behind this phenomenon is that Random Forest regards different subspaces equally. In particular, during training, it assigns the same loss punishment for any other sub-optimal subspace prediction. However, some sub-optimal subspace provides relatively similar initial-shape-indexed features and can predict similar shapes as the optimal one, which should be punished lighter. Therefore, a classification algorithm fits with this task should be able to identify the relative proximity between the sample and the subspace.

Naive Bayes appears to be a good option to this problem. A Naive Bayes classifier is the function that assigns a class label y = Ck for some k as follows:

$$ y=\arg \kern0.5em \mathrm{ma}{\mathrm{x}}_{k\in \left\{1,\dots, K\right\}}p\left({C}_k\right)\prod p\left({x}_i\left|{C}_k\right.\right) $$
(8)

where x = {x1, …, xn} represents the feature vector of a sample; p(Ck) is a priori probability of class Ck, and p(xi|Ck) is the a posteriori probability of class Ck given the value of xi. As Naive Bayes classifier assumes each feature xi which is conditionally independent of every other feature xj (j ≠ i), p(x|Ck) is equal to the product of all p(xi|Ck). The parameter p(x|Ck) can be regarded as the distance between the current sample to the class centre. If the sample is far away from the class centre, then p(x|Ck) is small, otherwise, p(x|Ck) turns large. Since p(x|Ck) directly contributes to the optimization process, the relative proximity between the sample and the class is then naturally embedded in the Naive Bayes Classifier. This can avoid assigning a sample with an incompatible subspace.

4 Experiments

Dataset

Evaluations are performed on a widely applied benchmark dataset – 300 W [16] and NTHU Drowsy Driver Detection (NTHU-DDD) video dataset [24]. The dataset 300 W is a mixture of several well-known benchmark datasets, including AFW [37], LFPW [1], HELEN [10] and XM2VTS [14], which is challenging due to its images covering a very wide range of head pose, facial expression, appearance, occlusion and illumination. It unifies all the annotations with the 68-point mark-up and offers another challenging 135-image dataset named IBUG.

During the experiment, all the training samples from LFPW, HELEN and the whole AFW form the training set which has 3148 images in total. The testing set comprises of a common testing set and a challenging testing set, which has 689 images in total. The common testing set is composed of testing samples from LFPW and HELEN which have near-frontal head poses. IBUG is regarded as a challenging set as it is generally consisted of samples with large head poses and extreme facial expressions. Since the face detector’s influence on the final face alignment results is not considered in this paper, the prescribed face bounding boxes provided by 300 W are used.

Evaluation metric

The prediction error is measured as the average point-to-point Euclidean error normalised by the inter- pupil distance (the Euclidean distance between eyes’ centres). For simplicity, the ‘%’ is omitted.

Implementation

During training, similar data augmentation as in [27] is applied to enlarge the training data and improve the model’s generalization capability: the face bounding box of each training sample is randomly translated and scaled ten times. As samples in each subspace relate closely to a specific head pose, the mean shape of each subspace is calculated. Before prediction, each sample will be allocated a subspace-specific mean shape which is closer to the ground truth shape than the general mean shape. For subspace learning, the amount of clusters is altered from 3 to 8 and calculated the related error. The setting of 5 subspaces is shown to generate best results.

During the training process of the subspace classifier, it has shown that features indexed on multiple initial shapes can output higher prediction accuracy in comparison with features indexed on a single initial mean shape. This is probably due to that multiple initial shapes, which cover more points on the face region, can generate a larger feature pool and offer more information to the classifier. Therefore, shape-indexed features using all the subspace-specific mean shapes are extracted to train the subspace classifier.

4.1 Comparison with SDM

The released model of SDM was trained on private datasets and the training data has shown to be an important factor to the final performance of the model. What’s more, there is no off-the-shelf GSDM model released. To enable fair comparison on the same benchmark dataset, we re-implement SDM and GSDM by ourselves. Our implementation achieves detection accuracy close to similar implementations that have been reported in some state-of-art works [34].

As shown in Table 1, the proposed MS-SDM outperforms SDM on all testing sets, especially on the challenging set. The challenging set contains many samples with large head pose and extreme facial expressions which have conflicting descent directions with near-frontal faces. As SDM can only learn an average descent direction which is prone to the descent direction shared by major samples (near-frontal faces), the learned descent direction cannot handle minor challenging samples. While MS-SDM classifies each sample into a subspace where samples share similar descent directions which guarantees even the challenging sample can get an effective descent direction. Figure 4 presents some example results which intuitively show MS-SDM’s superiority over SDM.

Table 1 Comparison with SDM and GSDM
Fig. 4
figure 4

Example results from the testing set

4.2 Comparison with GSDM

GSDM offers an optimization space partition strategy for SDM which has demonstrated its effectiveness in real-time face tracking. To compare MS-SDM with GSDM, it is assumed that all the ground-truth shapes are known to make GSDM work even on still images. For both approaches, the subspaces are learned from the training set. Each subspace will be trained with a specific linear regressor. For fair comparison, the optimization space is partitioned into eight subspaces which are the same as that reported in [29]. As shown in Table 1, MS-SDM shows higher detection accuracy than GSDM on both testing sets. What’s more, it learned subspaces without knowing ground-truth shapes which GSDM requires.

4.3 Tracking results on driver dataset

Figure 5 shows tracking results of our method on NTHU-DDD video dataset [24]. Detected facial landmarks can favour driver drowsiness detection which can further be used for facial analysis of drivers to reduce car accidents.

Fig. 5
figure 5

Tracking results on NTHU Drowsy Driver Detection (NTHU-DDD) video dataset [24]

4.4 Facial Mobile tracking implementation

Based on MS-SDM, an Android facial tracking application was developed to track the user’s face with 66 landmarks in real-time. The application can robustly track the face within a large range of head poses and facial expressions (see Fig. 6), while having low hardware requirements to run smoothly on an Android smart phone. It can also benefit many other useful mobile applications such as automated face makeup, personalised emoji generation and objective facial functionality assessment.

Fig. 6
figure 6

Screenshots of the facial tracking mobile application based on MS-SDM

5 Conclusion

With a quite elegant formulation, SDM shows the state-of-the-art performance for face alignment under relatively controlled scenarios. As SDM is a local algorithm and prone to learn conflicting descent directions during training, it suffers from face images captured under unconstrained scenarios, where faces have large poses and extreme facial expressions. This paper proposes a novel two-step framework – MS-SDM which pushed SDM closer to unconstrained face alignment. Via applying k-means on the shape variations, semantic subspaces which have intuitive correlation with head poses are found. Then, using Naive Bayes classifier, each sample can be allocated the most suitable subspace-specific regressor. The proposed approach is validated on challenging datasets and a mobile facial tracking application. In future, we will apply deep learning techniques to extract more informative facial features or partition the feature-shape relationship into subspaces with clearer semantic meaning.