Multi-subspace supervised descent method for robust face alignment
- 327 Downloads
Supervised Descent Method (SDM) is one of the leading cascaded regression approaches for face alignment with state-of-the-art performance and a solid theoretical basis. However, SDM is prone to local optima and likely averages conflicting descent directions. This makes SDM ineffective in covering a complex facial shape space due to large head poses and rich non-rigid face deformations. In this paper, a novel two-step framework called multi-subspace SDM (MS-SDM) is proposed to equip SDM with a stronger capability for dealing with unconstrained faces. The optimization space is first partitioned with regard to shape variations using k-means. The generated subspaces show semantic significance which highly correlates with head poses. Faces among a certain subspace also show compatible shape-appearance relationships. Then, Naive Bayes is applied to conduct robust subspace prediction by concerning about the relative proximity of each subspace to the sample. This guarantees that each sample can be allocated to the most appropriate subspace-specific regressor. The proposed method is validated on benchmark face datasets with a mobile facial tracking implementation.
KeywordsUnconstrained face alignment SDM Subspace learning Cascaded regression
Face alignment aims to automatically localize fiducial facial points (or landmarks). It is a fundamental step for many facial analysis tasks, e.g. facial recognition [19, 20], face frontalization [21, 22], expression recognition [11, 31], and face attributes prediction [7, 25]. These tasks are essential to Human-System Interaction (HSI) applications including driver-car interaction, human-robot interaction and mobile applications.
The field of face alignment has witnessed rapid progresses in recent years, especially with the application and development of cascaded regression methods [2, 6, 27, 38, 39]. This kind of methods typically learns a sequence of descent directions from image features that move an initial shape towards the ground truth iteratively. Among various cascaded regression approaches for face alignment, SDM  has risen as one of the most popular approaches due to its high efficiency and the state-of-the-art performance. The approach is also theoretically sound to some extent with rigorous explanation from the perspective of optimizing a non-linear problem with Newton’s method.
For x0 ϵ [1:0.2:6] (0.2 is the interval), all of them can be moved closer to x* with r = −7. Nevertheless, if x0 < 0, e.g. x0 = −1, then it will become farther away from x* with r = −7 (see Fig. 1b).
Actually, only if initial points are close to each other and also target at the same destination, then the compatible descent directions can be learned via SDM. However, this strong prerequisite is very difficult to meet in face alignment, since face images vary from head poses and facial expressions, which are supposed to have different shape-feature relationships. This also leads to another issue of SDM: the algorithm is derived on a weak assumption that the non-linear feature extraction function (e.g. SIFT  or ) is identical for all the face images. As stated in , the feature extraction function is parameterized not only by facial landmark locations, but also by the images such as faces with different head poses and different subjects.
It can be inferred that one possible cause of above issues is that the face alignment task occupies multiple optimization subspaces, but these subspaces cannot be explained within a single optimization process. Although SDM has been extensively studied and further developed in the past few years, there are few works on this essential but relatively unexplored problem [8, 28, 29, 32, 35]. Xiong and De la Torre have made the same inference with this paper and proposed a global SDM (GSDM)  by domain partition in feature and shape PCA spaces for face tracking. However, that method is inappropriate for face alignment on still images as the decision of picking the suitable domain depends on ground-truth face shapes. The utilization of PCA also remains a big concern since it might result in un-estimated information loss. Recently, Zhang et al.  improves the GSDM by projecting both the feature and shape into a mutual sign-correlation subspace. Their method, however, has the same constraint as GSDM. Some other works resort to the multi-view approach – estimating head poses followed by face alignment on a particular view [12, 32]. The performance improves but the heuristic partition with respect to only head poses is still suboptimal because it neglects other shape deformations or appearance variations. Meanwhile, how to divide the pose range is a purely empirical step which often requires a lot of attempts.
To solve aforementioned problems, this paper proposes an efficient and novel alternative optimization subspace learning method – multi-subspace SDM (MS-SDM), which pushes SDM to the unconstrained face alignment application. The main contributions of our work are: 1) Discover optimization subspaces with a semantic meaning via applying an elegant unsupervised clustering algorithm – k-means on both shape and feature space. 2) Predict the subspace accurately by concerning about the relative proximity between the subspace and the sample. The proposed MS-SDM has been validated on challenging datasets which cover a wide range of head poses, facial expressions and facial appearances. Experimental results show the superiority of MS-SDM over SDM and GSDM.
2 Related work
A large number of works have been developed for face alignment which can be divided into two main categories: generative approaches and discriminative approaches.
Generative approaches, such as Active Appearance Models  and Constrained Local Models , first construct compact the shape and appearance spaces with Principal Component Analysis (PCA), then build a model instance to fit with the face image under a single optimization process. Although various improvements have been made, the drawbacks of this kind of approaches remain obviously: the expressive power of the built parameter space is limited and the final results heavily depend on the initialization.
Discriminative approaches don’t build a parameter space beforehand, but alternatively they learn a direct mapping from image features to landmark locations [2, 27, 29, 38, 39]. Cascaded regression [2, 27, 38, 39] is a representative discriminative approach which has dominated the face alignment field in recent years due to its high efficiency and the state-of-the-art performance.
2.1 Face alignment with cascaded regression
Starting with a rough initial shape, cascaded regression predicts the shape increment from image features with a series of mapping functions, and update the shape iteratively. Cao et al.  apply boosted ferns to learn both features and non-linear mappings which output promising results. In contrast, Xiong et al.  propose to use simple linear regression and hand-crafted features to accomplish cascaded regression which is named as Supervised Descent Method (SDM). Such simple configurations surprisingly generated state-of-the-art results. Recently, deep learning have also been applied on face alignment. The strong learning ability of deep models and the end-to-end learning mode enable deep learning based methods produce remarkable performance even for the most challenging datasets [15, 18, 30, 33, 34, 36]. However, deep learning methods always require a huge amount of training data and a very high computational capability, which make it difficult to be deployed on devices with limited resources. Ignoring on-going debates between deep learning and traditional methods, this paper makes a trade-off between efficiency and accuracy of the algorithm, based on the methods using SDM. Readers are referred to surveys [3, 23] for a comprehensive comparison of main-stream face alignment methods.
2.2 Face alignment with SDM based approaches
SDM produces the state-of-the-art performance with very elegant configurations, which has been regarded as an important benchmark method and triggers numerous new approaches in face alignment. As discussed above, only if the initializations are close to each other and the feature extraction function has a unique minimum, a sequence of generic descent directions can be learned via SDM. However, these prerequisite does not hold for faces under unconstrained conditions.
In , Zhu et al. starts each iteration by exploring a shape space rather than locking itself on a single initialization. This relaxes the optimization process from being affected by poor initializations to some extent and can lead to more robust face alignment. Nevertheless, the expressive power of a single regression in each iteration still remains a big concern. A few studies [12, 32] adopt intuitive multi-view approach to cover a wider optimization space and achieve a good performance. However, defining the optimization space according to head poses only is still sub-optimal since it neglects other shape deformations or appearance variations. In addition, the operation on dividing the head pose range is purely empirical and always needs a lot of attempts. Xiong et al.  theoretically analyzes this limitation of SDM and proposes Global SDM (GSDM) which partitions the optimization space into several domains based on reduced shape and feature. Although their method works well for face tracking and pose estimation, it is inappropriate for face alignment on still images as it requires the ground truth shape during prediction. Meanwhile, the reduced feature and shape space might lose some important information. To address the limitation of GSDM, Zhu et al.  proposes to learn a composition from predicted domain-specific shapes. This method performs well for faces with large poses and extreme expressions. Some other works resort to three-dimensional (3D) face modelling [8, 9, 26, 40] which requires additional 3D annotations of the training data. This paper presents an efficient alternative for optimization subspace learning that doesn’t require any additional assumptions.
In this section, the SDM method is recalled first and its limitations are theoretically analysed. Then, the proposed MS-SDM is introduced.
3.1 Supervised descent method
The function h(x, I) is parameterized not only by x but also by face images , which highly depends on head poses, facial expressions, facial appearances and illuminations. Consequently, R and b may vary from different face images. Therefore, although SDM can generate promising face alignment results in ordinary scenarios, they suffer from unconditional scenarios where faces have large head poses and extreme expressions.
In , the authors observe the same problem. They propose to partition the original optimization space into several domains based on reduced shape deviation Δx and feature deviation Δh. They prove that each domain contains a generic descent direction which can make the initial shape closer to the ground-truth shape for every sample belongs to it when both of the following conditions hold: 1) h(x, I) is strictly monotonic around x* and 2) h(x, I) is locally Lipschitz continuous anchored at x* with K (K ≥ 0) as the Lipschitz constant. However, the solution proposed in  only satisfies the first condition above and is based on an assumption that Δx and Δh embedded in a lower dimensional manifold. Meanwhile, to predict the specific domain that a sample belongs to, the ground-truth shape x* should be given. This is apparently infeasible during the testing stage as the ground-truth shape is actually what needs to be predicted.
3.2 Multi-subspace SDM
3.2.1 Semantic subspace learning via K-means
Since the face shape update Δx are predicted from the feature deviation Δh, the descent direction pair of R and b also describes the hidden relationship between Δx and Δh. Inspired by this intuition, k-means is further applied on Δh to find the feature-based optimization space partition. Surprisingly, the generated subspaces are highly consistent with the subspaces obtained from the head pose’s point of view. The relevant results are shown in Fig. 3b. It indicates that samples in each subspace have close shape-feature relationships which are supposed to share a unified descent direction.
3.2.2 Robust subspace prediction with naive Bayes
As the aforementioned subspace learning relies on the ground-truth shape which will be unavailable during testing, the main difficulty of the final shape prediction arises as the prediction of the subspace that a sample belongs to. A straightforward solution to this problem is a multi-class classifier (e.g. Random Forest, SVM or Naive Bayes), which learns the class label from face appearance features.
In the test phase, a mean-face is placed onto the given face bounding box and SIFT features are extracted around each landmark (see Fig. 2). The concatenation of all extracted features are regarded as the appearance feature for subsequent classification. Random Forest was first tested in our experiment due to its high performance in similar tasks. However, with this approach, a few samples were assigned inaccurately with a completely incompatible subspace, such as a left-profile face was assigned with a right-profile view regressor, which severely ruins the overall prediction accuracy.
The core reason behind this phenomenon is that Random Forest regards different subspaces equally. In particular, during training, it assigns the same loss punishment for any other sub-optimal subspace prediction. However, some sub-optimal subspace provides relatively similar initial-shape-indexed features and can predict similar shapes as the optimal one, which should be punished lighter. Therefore, a classification algorithm fits with this task should be able to identify the relative proximity between the sample and the subspace.
Evaluations are performed on a widely applied benchmark dataset – 300 W  and NTHU Drowsy Driver Detection (NTHU-DDD) video dataset . The dataset 300 W is a mixture of several well-known benchmark datasets, including AFW , LFPW , HELEN  and XM2VTS , which is challenging due to its images covering a very wide range of head pose, facial expression, appearance, occlusion and illumination. It unifies all the annotations with the 68-point mark-up and offers another challenging 135-image dataset named IBUG.
During the experiment, all the training samples from LFPW, HELEN and the whole AFW form the training set which has 3148 images in total. The testing set comprises of a common testing set and a challenging testing set, which has 689 images in total. The common testing set is composed of testing samples from LFPW and HELEN which have near-frontal head poses. IBUG is regarded as a challenging set as it is generally consisted of samples with large head poses and extreme facial expressions. Since the face detector’s influence on the final face alignment results is not considered in this paper, the prescribed face bounding boxes provided by 300 W are used.
The prediction error is measured as the average point-to-point Euclidean error normalised by the inter- pupil distance (the Euclidean distance between eyes’ centres). For simplicity, the ‘%’ is omitted.
During training, similar data augmentation as in  is applied to enlarge the training data and improve the model’s generalization capability: the face bounding box of each training sample is randomly translated and scaled ten times. As samples in each subspace relate closely to a specific head pose, the mean shape of each subspace is calculated. Before prediction, each sample will be allocated a subspace-specific mean shape which is closer to the ground truth shape than the general mean shape. For subspace learning, the amount of clusters is altered from 3 to 8 and calculated the related error. The setting of 5 subspaces is shown to generate best results.
During the training process of the subspace classifier, it has shown that features indexed on multiple initial shapes can output higher prediction accuracy in comparison with features indexed on a single initial mean shape. This is probably due to that multiple initial shapes, which cover more points on the face region, can generate a larger feature pool and offer more information to the classifier. Therefore, shape-indexed features using all the subspace-specific mean shapes are extracted to train the subspace classifier.
4.1 Comparison with SDM
The released model of SDM was trained on private datasets and the training data has shown to be an important factor to the final performance of the model. What’s more, there is no off-the-shelf GSDM model released. To enable fair comparison on the same benchmark dataset, we re-implement SDM and GSDM by ourselves. Our implementation achieves detection accuracy close to similar implementations that have been reported in some state-of-art works .
Comparison with SDM and GSDM
4.2 Comparison with GSDM
GSDM offers an optimization space partition strategy for SDM which has demonstrated its effectiveness in real-time face tracking. To compare MS-SDM with GSDM, it is assumed that all the ground-truth shapes are known to make GSDM work even on still images. For both approaches, the subspaces are learned from the training set. Each subspace will be trained with a specific linear regressor. For fair comparison, the optimization space is partitioned into eight subspaces which are the same as that reported in . As shown in Table 1, MS-SDM shows higher detection accuracy than GSDM on both testing sets. What’s more, it learned subspaces without knowing ground-truth shapes which GSDM requires.
4.3 Tracking results on driver dataset
4.4 Facial Mobile tracking implementation
With a quite elegant formulation, SDM shows the state-of-the-art performance for face alignment under relatively controlled scenarios. As SDM is a local algorithm and prone to learn conflicting descent directions during training, it suffers from face images captured under unconstrained scenarios, where faces have large poses and extreme facial expressions. This paper proposes a novel two-step framework – MS-SDM which pushed SDM closer to unconstrained face alignment. Via applying k-means on the shape variations, semantic subspaces which have intuitive correlation with head poses are found. Then, using Naive Bayes classifier, each sample can be allocated the most suitable subspace-specific regressor. The proposed approach is validated on challenging datasets and a mobile facial tracking application. In future, we will apply deep learning techniques to extract more informative facial features or partition the feature-shape relationship into subspaces with clearer semantic meaning.
This work was supported by the EPSRC through project 4D Facial Sensing and Modelling (EP/N025849/1), UoP RIDF2017 fund, the Emteq (https://emteq.net/) and was in part supported by the Open Fund of the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences (Y6S9011F51).
- 5.Cristinacce D, Cootes TF (2006) Feature detection and tracking with constrained local models. In Bmvc, Vol 1, No 2, p 3.Google Scholar
- 8.Jourabloo A, Liu X (2015) Pose-invariant 3D face alignment. In: IEEE international conference on computer vision (ICCV), pp 3694–3702Google Scholar
- 9.Jourabloo A, Liu X (2016) Large-pose face alignment via CNN-based dense 3D model fitting. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 4188–4196Google Scholar
- 10.Le V, Brandt J, Lin Z, Bourdev L, Huang TS (2012) Interactive facial feature localization. In: European conference on computer vision. Springer, pp 679–692Google Scholar
- 11.Lian Z, Li Y, Tao J, Huang J, Niu M (2019) Expression Analysis Based on Face Regions in Read-world Conditions. Int J Autom Comput, pp 1–12Google Scholar
- 14.Messer K, Matas J, Kittler J, Luettin J, Maitre G (1999) XM2VTSDB: The extended M2VTS database. In Second international conference on audio and video-based biometric person authentication 964:965–966Google Scholar
- 16.Sagonas C, Tzimiropoulos G, Zafeiriou S, Pantic M (2013) 300 faces in-the-wild challenge: The first facial landmark localization challenge. In: IEEE international conference on computer vision workshops (ICCV workshop), pp 397–403Google Scholar
- 18.Shao X, Xing J, Lv JJ, Xiao C, Liu P, Feng Y, Cheng C, Si F (2017) Unconstrained Face Alignment Without Face Detection. In: IEEE Conference on computer vision and pattern recognition workshops (CVPR workshop), pp 2069–2077Google Scholar
- 21.Wang Y, Yu H, Dong J, Stevens B, Liu H (2016). Facial expression-aware face frontalization. In Asian conference on computer vision. Springer, pp 375–388Google Scholar
- 22.Wang Y, Yu H, Dong J, Jian M, Liu H (2017) Cascade support vector regression-based facial expression-aware face frontalization. In: IEEE International Conference on Image Processing (ICIP), pp 2831–2835Google Scholar
- 24.Weng CH, Lai YH, Lai SH (2016) Driver drowsiness detection via a hierarchical temporal deep belief network. In: Asian conference on computer vision. Springer, pp 117–133Google Scholar
- 25.Xia Y, Lou J, Dong J, Li G, Yu H (2018) SDM-based means of gradient for eye center localization. In IEEE International Conference on Pervasive Intelligence and Computing (PiCom), pp. 862–867Google Scholar
- 26.Xiao S, Li J, Chen Y, Wang Z, Feng J, Yan S, Kassim AA (2017) 3D-Assisted Coarse-to-Fine Extreme-Pose Facial Landmark Detection. In: IEEE Conference on computer vision and pattern recognition workshops (CVPR workshop), pp 2060–2068Google Scholar
- 27.Xiong X, De la Torre F (2013) Supervised descent method and its applications to face alignment. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 532–539Google Scholar
- 28.Xiong X, De la Torre F (2014) Supervised descent method for solving nonlinear least squares problems in computer vision. arXiv preprint arXiv:1405.0601Google Scholar
- 29.Xiong X, De la Torre F (2015) Global supervised descent method. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2664–2673Google Scholar
- 30.Yang J, Liu Q, Zhang K (2017). Stacked hourglass network for robust facial landmark localisation. In: IEEE Conference on computer vision and pattern recognition workshops (CVPR workshop), pp 2025–2033Google Scholar
- 32.Yu X, Lin ZL, Zhang S, Metaxas DN (2016). Nonlinear hierarchical part-based regression for unconstrained face alignment. In IJCAI, pp 2711–2717Google Scholar
- 33.Zhang J, Shan S, Kan M, Chen X (2014) Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In: European conference on computer vision. Springer, pp 1–16Google Scholar
- 35.Zhang Y, Liu S, Yang X, Shi D, Zhang JJ (2016) Sign-correlation partition based on global supervised descent method for face alignment. In: Asian conference on computer vision. Springer, pp 281–295Google Scholar
- 36.Zhao Y., Tang F, Dong W, Huang F, Zhang X (2018) Joint face alignment and segmentation via deep multi-task learning. Multimed Tools Appl 1–18Google Scholar
- 37.Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark localization in the wild. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2879–2886Google Scholar
- 38.Zhu S, Li C, Loy CC, Tang X (2015) Face alignment by coarse-to-fine shape searching. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 4998–5006Google Scholar
- 39.Zhu S, Li C, Loy CC, Tang X (2016) Unconstrained face alignment via cascaded compositional learning. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3409–3417Google Scholar
- 40.Zhu X, Lei Z, Liu X, Shi H, Li SZ (2016) Face alignment across large poses: A 3d solution. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 146–155Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.