In this section, the SDM method is recalled first and its limitations are theoretically analysed. Then, the proposed MS-SDM is introduced.
Supervised descent method
SDM converts the face alignment task which is originally a non-linear least squares problem into a simple least squares problem. It avoids computing Jacobian and Hessian with some supervised settings which significantly reduces the algorithm’s complexity but at the same time generates state-of-the-art performance. Specifically, given a face image I and initial facial landmarks’ coordinates x0, face alignment can be framed as minimizing the following function over Δx:
$$ f\left({\mathbf{x}}_0+\Delta \mathbf{x}\right)={\left\Vert h\left({\mathbf{x}}_0+\Delta \mathbf{x},I\right)-h\left(\mathbf{x}\ast, I\right)\right\Vert}_2^2 $$
(2)
where h(x, I) represents the SIFT features (or HOG features) around the landmark locations x of image I. x* represents the ground-truth landmark locations. Following Newton’s method, with a second-order Taylor expansion, (2) can be transformed as:
$$ f\left({\mathbf{x}}_0+\Delta \mathbf{x}\right)\approx f\left({\mathbf{x}}_0\right)+{\mathbf{J}}_f{\left({\mathbf{x}}_0\right)}^T\Delta \mathbf{x}+\frac{1}{2}\Delta {\mathbf{x}}^T{\mathbf{H}}_f\left({\mathbf{x}}_0\right)\Delta \mathbf{x} $$
(3)
where Jf (x0) and Hf (x0) are the Jacobian and Hessian matrices of f evaluated at x0. Differentiating (3) with respect to Δx and setting it to zero, the following equations can be obtained:
$$ {\displaystyle \begin{array}{c}\Delta \mathbf{x}=-{\mathbf{H}}_f{\left({\mathbf{x}}_0\right)}^{-1}{\mathbf{J}}_f\left({\mathbf{x}}_0\right)\\ {}=-2{\mathbf{H}}_f{\left({\mathbf{x}}_0\right)}^{-1}{\mathbf{J}}_h^{\mathbf{T}}\left({\mathbf{x}}_0\right)\left(h\left({\mathbf{x}}_0,I\right)-h\left(\mathbf{x}\ast, I\right)\right)\\ {}=-2{\mathbf{H}}_f{\left({\mathbf{x}}_0\right)}^{-1}{\mathbf{J}}_h^{\mathrm{T}}\left({\mathbf{x}}_0\right)h\left({\mathbf{x}}_0,I\right)+2{\mathbf{H}}_f{\left({\mathbf{x}}_0\right)}^{-1}{\mathbf{J}}_h^{\mathbf{T}}\left({\mathbf{x}}_0\right)h\left(\mathbf{x}\ast, I\right)\end{array}} $$
(4)
According to (4), the computation of the descent direction Δx requires h(x, I) to be twice differentiable or numerical approximations of the Jacobian and Hessian could be calculated. However, these requirements are difficult to meet in practice: 1) SIFT or HOG features are non-differentiable image operators; 2) numerically estimating the Jacobian or the Hessian in Eq. 4 is computationally expensive since the dimension of the Hessian matrix can be large and calculating the inverse of Hessian matrix is with O(p3) time complexity and O(p2) space complexity, where p is the dimension of the parameters to estimate [28]. Alternatively, SDM uses an identical pair of R and b to represent all face images’ −2Hf−1JTh and − 2Hf−1JThh(x*, I) which are named as the descent direction. R and b define a linear mapping between Δx and h(x0, I), which can be learned from the training set by minimizing:
$$ {\sum}_{i=1}^N{\left\Vert \Delta {\mathbf{x}}_{\ast}^i-\mathbf{R}h\left({x}_0^i,{I}_i\right)-\mathbf{b}\right\Vert}_2^2 $$
(5)
where, N is the number of images in the training set and \( \Delta {\boldsymbol{x}}_{\ast}^i={\boldsymbol{x}}_{\ast}^i-{\boldsymbol{x}}_0^i \). Since the ground-truth shape is difficult to be found in a single update step, a sequence of such descent directions denoted as {Rk} and {bk} are learned during training. Then for a new face image, in each iteration k, the shape update can be calculated as:
$$ \Delta {\mathbf{x}}_k={\mathbf{R}}_kh\left({\mathbf{x}}_{k-1},I\right)+{\mathbf{b}}_k $$
(6)
The function h(x, I) is parameterized not only by x but also by face images [28], which highly depends on head poses, facial expressions, facial appearances and illuminations. Consequently, R and b may vary from different face images. Therefore, although SDM can generate promising face alignment results in ordinary scenarios, they suffer from unconditional scenarios where faces have large head poses and extreme expressions.
In [29], the authors observe the same problem. They propose to partition the original optimization space into several domains based on reduced shape deviation Δx and feature deviation Δh. They prove that each domain contains a generic descent direction which can make the initial shape closer to the ground-truth shape for every sample belongs to it when both of the following conditions hold: 1) h(x, I) is strictly monotonic around x* and 2) h(x, I) is locally Lipschitz continuous anchored at x* with K (K ≥ 0) as the Lipschitz constant. However, the solution proposed in [29] only satisfies the first condition above and is based on an assumption that Δx and Δh embedded in a lower dimensional manifold. Meanwhile, to predict the specific domain that a sample belongs to, the ground-truth shape x* should be given. This is apparently infeasible during the testing stage as the ground-truth shape is actually what needs to be predicted.
Multi-subspace SDM
To address problems mentioned above, an alternative two-step framework – MS-SDM (see Fig. 2) is proposed. It first learns subspaces with semantic meanings from the original optimization space via k-means. Then, for each subspace, a particular linear regressor from face features to the shape update is learned. During testing, the sample will be assigned into the correct subspace with a pre-trained Naive Bayes classifier. It will then be allocated to a subspace specific regressor which gradually update the shape as:
$$ \Delta {\mathbf{x}}_k={\mathbf{R}}_{k,s}h\left({\mathbf{x}}_{k-1},I\right)+{\mathbf{b}}_{k,s} $$
(7)
where s represents the subspace label.
Semantic subspace learning via K-means
To learn better optimization subspaces, samples which have the similar regression target Δx are assumed to fall inside the same optimization space and have compatible descent directions. Then, the classic clustering algorithm - k-means is applied on all training samples’ Δx to automatically find out the key facial shape variations and divide the original training set into several subsets. In order to preserve all the useful information hidden in the shape space, the initial Δx of each sample is utilised during the clustering process. As shown in Fig. 3a, subsets generated in this way show quite high correlation with head poses. It can also be observed that each subset relates to a particular kind of head pose, such as left-profile face, right-profile face, left-rolling face and right-rolling face.
Since the face shape update Δx are predicted from the feature deviation Δh, the descent direction pair of R and b also describes the hidden relationship between Δx and Δh. Inspired by this intuition, k-means is further applied on Δh to find the feature-based optimization space partition. Surprisingly, the generated subspaces are highly consistent with the subspaces obtained from the head pose’s point of view. The relevant results are shown in Fig. 3b. It indicates that samples in each subspace have close shape-feature relationships which are supposed to share a unified descent direction.
Robust subspace prediction with naive Bayes
As the aforementioned subspace learning relies on the ground-truth shape which will be unavailable during testing, the main difficulty of the final shape prediction arises as the prediction of the subspace that a sample belongs to. A straightforward solution to this problem is a multi-class classifier (e.g. Random Forest, SVM or Naive Bayes), which learns the class label from face appearance features.
In the test phase, a mean-face is placed onto the given face bounding box and SIFT features are extracted around each landmark (see Fig. 2). The concatenation of all extracted features are regarded as the appearance feature for subsequent classification. Random Forest was first tested in our experiment due to its high performance in similar tasks. However, with this approach, a few samples were assigned inaccurately with a completely incompatible subspace, such as a left-profile face was assigned with a right-profile view regressor, which severely ruins the overall prediction accuracy.
The core reason behind this phenomenon is that Random Forest regards different subspaces equally. In particular, during training, it assigns the same loss punishment for any other sub-optimal subspace prediction. However, some sub-optimal subspace provides relatively similar initial-shape-indexed features and can predict similar shapes as the optimal one, which should be punished lighter. Therefore, a classification algorithm fits with this task should be able to identify the relative proximity between the sample and the subspace.
Naive Bayes appears to be a good option to this problem. A Naive Bayes classifier is the function that assigns a class label y = Ck for some k as follows:
$$ y=\arg \kern0.5em \mathrm{ma}{\mathrm{x}}_{k\in \left\{1,\dots, K\right\}}p\left({C}_k\right)\prod p\left({x}_i\left|{C}_k\right.\right) $$
(8)
where x = {x1, …, xn} represents the feature vector of a sample; p(Ck) is a priori probability of class Ck, and p(xi|Ck) is the a posteriori probability of class Ck given the value of xi. As Naive Bayes classifier assumes each feature xi which is conditionally independent of every other feature xj (j ≠ i), p(x|Ck) is equal to the product of all p(xi|Ck). The parameter p(x|Ck) can be regarded as the distance between the current sample to the class centre. If the sample is far away from the class centre, then p(x|Ck) is small, otherwise, p(x|Ck) turns large. Since p(x|Ck) directly contributes to the optimization process, the relative proximity between the sample and the class is then naturally embedded in the Naive Bayes Classifier. This can avoid assigning a sample with an incompatible subspace.