Simultaneous multi-descent regression and feature learning for facial landmarking in depth images

Face alignment (or facial landmarking) is an important task in many face-related applications, ranging from registration, tracking, and animation to higher-level classification problems such as face, expression, or attribute recognition. While several solutions have been presented in the literature for this task so far, reliably locating salient facial features across a wide range of posses still remains challenging. To address this issue, we propose in this paper a novel method for automatic facial landmark localization in 3D face data designed specifically to address appearance variability caused by significant pose variations. Our method builds on recent cascaded regression-based methods to facial landmarking and uses a gating mechanism to incorporate multiple linear cascaded regression models each trained for a limited range of poses into a single powerful landmarking model capable of processing arbitrary-posed input data. We develop two distinct approaches around the proposed gating mechanism: (1) the first uses a gated multiple ridge descent mechanism in conjunction with established (hand-crafted) histogram of gradients features for face alignment and achieves state-of-the-art landmarking performance across a wide range of facial poses and (2) the second simultaneously learns multiple-descent directions as well as binary features that are optimal for the alignment tasks and in addition to competitive landmarking results also ensures extremely rapid processing. We evaluate both approaches in rigorous experiments on several popular datasets of 3D face images, i.e., the FRGCv2 and Bosphorus 3D face datasets and image collections F and G from the University of Notre Dame. The results of our evaluation show that both approaches compare favorably to the state-of-the-art, while exhibiting considerable robustness to pose variations.


Introduction
Face alignment or facial landmarking refers to the task of locating salient facial features in facial images, which is of paramount importance in various applications including face registration and recognition [21,41], expression recognition [49], face tracking [37], normalization of facial pose, size and expressions [16], face synthesis from morphable models and facial animation [23] to name a few.In real-world scenarios where face images are often acquired in uncontrolled conditions, one has to deal with various unfavorable factors that adversely affect landmarking performance including pose, expression, and illumination variations as well as partial occlusions of the facial areas.These factors influence the appearance of the facial features in traditional 2D images, e.g., [12] but also in 3D (or better said 2.5D) face data used in this work 1 .Although some of the existing landmark localization procedures promise to be (at Fig. 1 Sample results of the landmarking GRID approach proposed in this paper.Our model is able to reliably estimate the location of salient facial features in 3D face data even in the presence of large pose variations, i.e., with yaw angles up to ±90 • .least partially) robust to some of the factors mentioned above (e.g.[9,25,36]), reliable localization of facial landmarks in the presence of highly variable nuisance factors still remains a considerable challenge.
With the advancement of 3D acquisition technology, landmark localization on 3D facial data has recently been researched extensively [13,36].Many of the 3D landmarking techniques proposed in the literature in the last few years rely on the so-called cascadedregression framework, where facial landmarks are estimated by regressing from facial features to landmark locations in a cascaded (iterative) manner [2].Techniques following this framework made considerable advancements towards robust facial landmarking, although they generally still use hand-crafted features, such as SIFTs or HOGs [2,3,39].Additionally, these methods typically rely on a single regression model in each stage of the cascade to estimate the facial landmarks regardless of the facial characteristics.However, as facial appearance is a complex function of various factors, such as facial shape, pose, incident illumination, expression, and occlusion, a single model is often not sufficient to capture the broad range of variability commonly encountered with facial-image data and to robustly estimate the location of the most salient facial features.
To address this problem, we propose in this paper a novel gating mechanism that incorporates multiple cascaded regression based models each trained for narrow range of facial posses into a single (coherent) landmarking model that is able to reliably estimate the location of salient facial features from arbitrary posed input face data.The combination of simpler view-specific landmarking approaches provides the combined gatingbased model the necessary expressive power to describe the considerable appearance variability typically seen with 3D face data captured under different facial poses and consequently allows it to reliably estimate the landmark locations regardless of the facial pose of the input image.The model is partially motivated by the success of earlier methods designed for 2D images that combine multiple landmarking models trained for face alignment of different views, e.g., [4,15,29,47,50], but unlike these early methods does not rely on parametric appearance or shape models.
We develop two distinct facial landmarking approaches around the proposed gating mechanism.The first relies on a combination of the Gated multiple RIdge Descent (GRID) mechanism and established HOG features and as illustrated in Fig. 1 achieves remarkable landmarking performance across a broad range of pose variation.Even for poses with yaw rotations of up to ±90 • , the model is still able to reliably estimate the location of salient facial features.The second approach again relies on the introduced gating mechanism, but in addition to the cascaded regression models needed for face alignment of each group of poses, also learns a feature representation that is used with the regression models for landmark estimation.Specifically, the model Simultaneously learns MUltipledescent directions as well as binary Feature (SMUF) that are optimal for the alignment task and due to their binary nature also ensure extremely fast execution times.This second approach follows recent trends in computer vision and aims to learn the feature representation that is optimal for a given tasks, but different from deep learning models that are typically used for feature learning [40,42], uses a computationally much simpler scheme, where binary features are learned based on a learning objective that can be solved using standard optimization procedures.
To evaluate the proposed landmarking approaches, we conduct experiments on multiple datasets of 3D face images, i.e., the Face Recognition Grand Challenge v2 (FRGCv2), the Bosphorus 3D Face datasets and image collections F and G of the University of Notre Dame.We present extensive experiments and comparisons with state-of-the-art methods from the literature.The results of our evaluation show that the GRID ensures state-of-the-art performance for facial landmark localization in 3D face data across pose, while SMUF yields not only competitive landmarking accuracy, but is also extremely fast.
Our main contributions in this paper are: -We propose a gating mechanism for face alignment in 3D face data that allows us to combine multiple alignment models and foster the combined power of the combined models for face alignment across pose.The use of multiple models adds to the overall localization performance, since each model needs to account only for a limited set of plausible facial variations.With this approach reliable landmarking is possible even under large head rotations such as profile facial images, with yaw rotations up to ±90 • (see Fig. 1), where many competing methods fail.-We develop two distinct landmarking approaches based on the introduced gating mechanism, where one is optimized for performance and the second one is optimized for both performance and speed.We evaluate both approaches in rigorous experiments on multiple face datasets and report competitive performance in comparison to competing methods from the literature.-We study different configurations of the proposed approaches and investigate their behavior when localizing specific facial landmarks.
The rest of the paper is organized as follows: In Section 2 we describe prior work in the field of facial landmarking with the goal of providing the necessary context for our contributions.In Section 3 we present our gating mechanism and the GRID and SMUF alignment techniques.We describe experiments conducted to evaluate the performance of the proposed methods in discuss results in Section 4. We conclude the paper with some final remarks and future research challenges in Section 5.

Related Work
Numerous methods have been proposed in the literature for the task of automatic facial landmark localization over recent years.In this section we present a brief overview of these methods with a focus on alignment techniques that work on 3D images.These techniques can be categorized in various ways, but here we chose to perform a categorization as shown in Fig. 2. We classify existing techniques into two groups: i ) techniques that are entirely dependent on geometric information and derive prior knowledge about the facial structure and location of facial landmarks by defining a number of heuristic rules and ii ) techniques that rely on trained statistical models.The latter group of techniques is further divided according to the type of the model utilized into generative and discriminative methods.A highlevel comparison of the related works discussed in this section is given in     [3, 24, 25, 36, 39] [2, 7, 30, 35, 40, 48] [ 1,6,11,22,33] [ 8,9,18,28,34] Fig. 2 Taxonomy of 3D landmarking approaches as discussed in the related work section.Recent work is largely focusing on statistical approaches, where landmarking is learned from annotated training data using either generative or discriminative models.

Geometric Approaches
Geometric approaches to facial landmarking are generally training-free and depend solely on the geometric information such as surface curvature or shape index values.A number of rules and heuristics encode the prior knowledge about the relationships between adjacent landmarks (e.g. the nose tip lies on the face symmetry axis, eyes are located above the nose tip, etc.).In most cases, the rules used to define the location of facial landmarks require the face to be in upright and near-frontal position.Moreover, a common downside of these methods is that the landmarks are detected in sequence (commonly starting by detecting a nose tip) and the success rate of finding the next landmark in the sequence is dependent on successfully locating the preceding landmark in the sequence.With these methods an incorrect detection of one landmark affects the detection of all subsequent landmarks.
Exemplar geometric methods [1,6,11,22,33] start by detecting the nose tip and use its location to constrain the search space of the remaining landmarks.Landmark detection can be grounded on the analysis of Gaussian curvatures [11], profile curvatures [6], x and y coordinate projections of the depth data [33], shape index values and facial symmetry lines [1] or horizontal slices of range images [22] to name a few.

Statistical Approaches
Statistical landmarking approaches also exploit local shape information around candidate landmark locations.Additionally, these methods derive some prior knowledge from the training data about the location constraints and encode the acquired knowledge into a statistical model.Thus, these methods require a train- However, the fact that statistical methods generally address a complete set of landmarks defined by the model could prove to be a problem when a large number of landmarks is (self-)occluded or data is missing from the input images due to acquisition errors.Recently, efforts have been made to handle such problems, e.g.[36] use a flexible shape model that works even with an incomplete set of landmarks.
In terms of the type statistical model used, approaches from this group can be divided into techniques that rely either on generative or on discriminative models.We discuss both types of techniques in the following two subsections.

Generative Approaches
Landmark locations can be modeled by generative models, such as Active Appearance Models, Active Shape Models [8,18,34] or morphable models [9].Techniques from this group often learn face appearances (and/or shape) by conducting Principal Component Analysis (PCA) [38] on manually annotated and aligned training data.Given a test image, alignment/fitting is achieved by minimizing the difference between the current estimate of the appearance (and/or shape) and the test face.Generally, generative approaches are often computationally expensive and may perform poorly in the presence of occlusions, pose expression and illumination variations due to the involved fitting procedure.

Discriminative Approaches
More recent methods to face alignment focus mostly on discriminative approaches that learn a mapping function that predicts the shape, i.e. landmark locations, directly from corresponding image features.Methods from this category typically offer better landmark localization performance when compared to generative models, especially for faces with greater variability in appearance [2,3,9,36].These methods commonly incorporate shape constraints into the models and use local descriptors that are more robust to appearance variations than conventional depth/intensity pixel values used with generative approaches.Discriminative approaches include random forests [7], graph matching [30], cascaded regression [2,3,9,39], specifically tailored shape models [24,25,36,48], hidden Markov models [17] and convolutional neural networks [40].
The landmarking techniques proposed in this work fall into the group of discriminative approaches.They build on recent face alignment techniques that rely on cascaded regression models that have proven highly successful for landmark localization from 2D face images, e.g.[44,45].However, compared to these models our solutions exhibit unique features, such as the novel gating mechanism for exploiting multiple pose-specific landmarking models and the ability to incorporate taskspecific binary features into the landmarking procedure.

Methodology
In this section we describe GRID and SMUF, two novel facial landmarking approaches build around the gating mechanism illustrated in Fig. 3.As can be seen, the gating mechanism partitions the search space for the landmark localization procedure into a number of subdomains, where each sub-domain encompasses a range of similar facial poses.A separate landmarking model is then trained for each of the sub-domains and the gating mechanism is used to select the most suitable landmarking model for the given tests image.Based on this overall framework, we develop two distinct landmarking techniques, which are described next.

GRID Description
We design GRID (Gated Multiple Ridge Descent) in line with the powerful cascaded regression framework to face alignment, where landmark locations (or in other words, the facial shape) are estimated by regressing from facial features to landmark locations in a cascaded manner.In the first step of this framework, features are extracted from some initial landmark configuration (estimated from the training data) and a regression model is applied on the extracted features to predict landmark updates to better align the landmarks with the actual test image.The update results in a new landmark configuration that forms the basis for the next step in the cascade.The entire procedure is then repeated multi-ple times and, thus, sequentially refines the predicted locations of the facial landmarks in the test image.
With GRID, we train multiple cascaded regression models and integrate them into a gated approach that is robust to pose variations.While different regression models and feature representation have been proposed in the literature for facial landmarking, we built GRID around the Supervised Descent Method (SDM) from [44] that has not proven successful only for facial landmark localization in 2D images [43], but also for alignment of 3D face images, as we have shown in [2,14].

Background
To train the regression models needed for landmarking, SDM requires a number of facial images {I n } N n=1 , where each image I has L landmarks annotated in the form of a shape vector x * ∈ R 2L×1 .The landmark localization task is then posed as a minimization problem over ∆x: where h is a feature extraction function, φ * = h(I, x * ) are features extracted around the ground truth landmarks x * , x 1 is an initial landmark configuration and ∆x is a landmark update (known for the training data).Eq. ( 1) represents a non-linear least squares problem and in general has no closed-form solution.However, it was shown in [43] that the problem can be solved through a cascade of least squares regression problems.Thus, for each step k in the cascade, the solution of the least squares problem results in a regression matrix (also referred to as a descent map, DM) R k that can be used to predict the update of the landmark locations from the current image features.The learning algorithm is formulated as a minimization of the loss between the true shape updates xn k = x n k − x n * and the expected updates over all training images, i.e.: where n is a training-image index and φ * is an average feature vector computed from the ground truth locations x n * over all training images.Eq. ( 2) now represents a sequence of ordinary least squares regression problems that can be solved in closed form.
During test time, the algorithm starts with some initial landmark locations x 1 , for which the face shape (landmark configuration) is defined by the average landmark locations of training images and the position of the face shape is determined by the face detection procedure2 , and sequentially updates the initial estimate to obtain the final landmark locations, i.e.: so that the final shape x k converges to x * for all training images.The number of steps K in the cascade, where k = 1, 2, ...K commonly varies depending on the implementation, but usually values of K are between 3 and 10

Ridge Regression
The original SDM [43] formulation uses a least squares solution of Eq. ( 2) to learn the DMs, i.e.: where Xk is a shape matrix with n-th column xn k and Φk is a feature matrix with n-th column φ n k − φ * .To solve Eq. ( 4) one needs to compute the inverse of Φ k Φk , which, however, may be singular when the size of the feature vectors is too large or when the features are correlated.To overcome this issue, the original SDM applies PCA [38] to the image features before inverting the matrix.
However, we have shown in [2] that better landmarking performance is achieved if ridge regression is used in the original feature space instead of least squares regression in the PCA subspace.The optimization function in (2) in this case can be written as where γ k denotes a regularization factor and the solution of Eq. ( 5) is computed as: where I is an identity matrix.The regularization factor γ k ≥ 0 controls the general instability of the least squares estimate.Selecting a suitable value for γ k , avoids over-fitting and helps to produce estimates of R k that generalize better to unseen data.

Gated Multiple Ridge Descent
Experimental results in [2,43,45] have shown that the original SDM achieves remarkable landmarking performance on various 2D and 3D datasets.However, it still tends to perform poorly when, for example, large head rotations are present in the facial data [45].Such rotations cause complex facial appearance variations that are difficult to model and hard to account for when using only a single DM in each step of the landmarking cascade.
To increase the robustness of the model to pose variations, we propose to exploit multiple DMs {R z k } z={1:Z} such that each of the Z DMs accounts for a specific range of head rotations, as illustrated in Fig. 3. Towards this end, we partition the available training images {I n } N n=1 into Z pose-specific subsets and train separate regression cascades for each subset in line with Eq. (6).
Once all Z cascades (series of DMs) are trained, a gating function g z is used to select the most suitable DM (from the Z DMs available in the first cascade stage) for a given test image.The selection procedure begins by computing features {φ z 1 } z={1:Z} from the initial landmark locations {x z 1 } z={1:Z} in the test image.Here, the initial landmark locations x z 1 (see Fig. 4) are computed by averaging the ground truth shapes over all training images from the z-th training subset: The most fitting DM for the given test image is then selected based on the output of the gating function g z : where m is the feature vector length (in the case of SIFT m = 128 × L), φ * and Σ * are the average and the covariance matrix of the ground truth features over the subset z, respectively.We select the subset z * , for which the gating function outputs the lowest value: By doing so, we reliably choose the DM R z * 1 that has been trained on images with similar face orientations to the orientation of the test face.For all the subsequent steps k we use the DMs R z * k that correspond to the z * -th subset selected in the first step k = 1 and for efficiency reasons due not change the regression cascade in subsequent steps.Location updates on a given test image are thus computed as The described procedure results in significantly improved performance in the case of large head rotations as shown later in Section 4.
It needs to be noted that we rely on HOG features to implement the feature extraction function h in GRID.We select HOG features because of their proven performance in prior landmarking models, e.g., [2,14].

SMUF Description
The GRID landmarking approach presented in the previous section relies on HOG features to encode the appearance of the facial landmarks during face alignment.With SMUF (Simultaneous MUlti-descent regression and binary Feature learning) we take a step further and try to learn facial features that are optimal for face alignment.We choose to learn binary features, due to their simplicity and most of all computational simplicity.In the following subsection, we first review the idea of binary feature learning and then develop of SMUF approach that jointly learns a landmarking model as well as corresponding binary features that are optimal for this task.

Binary Feature Learning
Hand-crafted binary features, such as Local Binary Patterns (LBPs) [27] represent powerful image descriptors that have proven highly effective in various computer vision tasks.These features typically rely on pixel comparisons within a local neighborhood and heuristic rules to encode the pixel comparisons into binary codes.As such, they may be suboptimal and better features could potentially be constructed by learning binary features based on some dedicated learning objective.
Gong et al. [10], for example, propose a learning objective where binary features are learned from an initial image representation d, such that the quantization error is minimized.Since binary features φ (containing only 0s and 1s) can be computed from d as Fig. 5 We use depth difference vectors d as the basis for binary feature calculation as proposed in [19,20].The example image above shows how one such vector is computed for a selected landmark.The local pixel neighborhood shown here is of size 3 × 3 and is selected only for illustration purposes.
We use larger neighborhoods for the actual SMUF implementation.
where W is a matrix of hash functions and sgn(.)stands for the signum function, the learning objective L q that needs to be minimized over W on some training data can be written as: It was shown by Lu et al. [19,20] that descriptive binary image features can be computed based on the above quantization scheme if pixel (or depth in our case) difference values are used as input d for binarization.For SMUF we follow this approach and compute one depth difference vector d for each considered landmark, as illustrated in Fig. 5.

Simultaneous DM and Feature Learning
The learning objective presented in the previous section is focused on representational power, as the binary features are computed in a manner that minimizes a quantization loss.To make the features useful for landmarking, we now formulate a joint optimization function that allows us to simultaneously learn a regression cascade and corresponding binary features that are optimal for the landmarking task. Let ] be a set of depth-difference vectors extracted from patches centered at the facial landmarks and k stands for the cascade stage, k = 1, 2, • • • , K. The depth-differencevector matrix D k is mapped to a binary feature matrix Φ k as follows: where W k is a feature projection matrix and sgn(.) is again the signum function.To learn W k , we formulate the following optimization problem by re-writing (5) into matrix form and extending it with the additional constraint C 2 : where and Dk = D k − D * , D * are the depth-difference values of the ground truth landmark locations.As already emphasized above, the objective of C 2 is to minimize the quantization loss between the original depth-difference values and the binarized features, so that most of the depth-difference energy can be preserved in the learned binary features.We find optimal values for R k and W k by an iterative optimization procedure, where W k is initialized to a random orthogonal matrix.If we assume a fixed W k and compute a partial derivative of C in ( 14) with respect to R k and set the derivative to zero, we obtain the following solution for R k : In the next step we aim to learn W k with a fixed R k and, hence, rewrite (14) as follows: If we differentiate (17) with respect to W k and set the derivative to zero, we obtain the following update rule for W k : The two optimization steps from ( 16) and ( 18) are then repeated until both R k and W k converge.
Once stable version of R k and W k are obtained, we compute the shape update in accordance with and repeat the entire procedure for the next stage in the cascade.Note that because Φk is binary, location updates can be computed extremely quickly by simply summing up (specific) rows from R k .
Finally, we compute separate regression cascades and projection matrices for each of the Z training subsets, that is, for each considered group of poses and integrate the computed cascades into the overall SMUF approach using the same gating mechanism as described above for the GRID approach.

Training and testing of GRID and SMUF
The overall processing pipeline for the SMUF landmarking approach is shown in Fig. 6.The procedure for GRID is identical, except for the fact that no features are learned during training.
The training stage for both methods begins by preprocessing all N training images {I n } n=1:N were a depth component of the surface normal is computed in each pixel instead of using original depth values.In each image the face is detected using a simple clustering procedure [32] and initial landmark locations x n 1 are set based on the detected facial area.To capture the variance of the face detection procedure and to enlarge the amount of training data, we define additional initial landmark locations for each training image by randomly sampling scale and displacement parameters for the detected area from a normal distribution.Starting from the initial locations matrix X 1 along with the ground truth locations X * , a number of DMs R z k (and for SMUF also projection matrices W z k ) are learned.The updates ( 16) and ( 18) are iteratively re-computed till convergence (we empirically estimated that 4 steps are sufficient) for each shape update step k and each of the Z training subsets.
When a test image is presented to the landmarking procedure, it goes through the same face detection, preprocessing, landmark initialization and feature extraction steps as the training images.DMs and feature projections are then selected as described in Section 3.1.3and the final landmark locations are computed based on (10).

Experiments
In this section we evaluate the proposed GRID and SMUF landmarking approaches and compare them to the state-of-the-art.We report landmarking performance in accordance with standard methodology used in this area [32] for all experiments.Specifically, we use the localization error, i.e. the Euclidean distance in mm between the location of the detected landmark and the manually annotated ground truth landmark, for performance reporting.Additionally, we also compute the  mean localization error over all landmarks of each test face for some of the experiments.

Experimental Datasets
We conduct experiments with three popular datasets of 3D face images: the Face Recognition Grand Challenge version 2 (FRGCv2) dataset [26], the Bosphorus 3D Face dataset [31] and the University of Notre Dame dataset (collections F and G, hereinafter referred to as UND dataset) [46].We chose these datasets be-cause they are among the most frequently used 3D face datasets and because they contain challenging 3D images with a high degree of variability in face orientations and are, therefore, well suited for assessing the robustness to such variations.The main characteristics of the datasets are summarized in Table 2.
The FRGCv2 dataset contains 4007 3D face images of 466 individuals.Images of the dataset were acquired with a laser-based Konica Minolta Vivid 910 scanner.Subjects exhibit minor pose variations and various facial expressions.We utilize the ground truth landmarks (8 landmarks per face) from [25], which were manually annotated on a subset of 975 images from 149 subjects.
The Bosphorus dataset consists of 4666 face samples from 105 subjects.Each sample includes a 2D color image, a 3D point cloud and 24 manually annotated landmarks (in our experiments we exclude ear dimple landmarks and use the remaining 22 landmarks).Next to expression and occlusion variations, images in the dataset also exhibit large variations in pose.Images from the dataset were captured using a structured-light based Inspeck Mega Capturor II Digitizer.
The UND dataset contains 1680 semi-profile and profile 3D face images of 537 subjects.For our experiments, we use a subset of 236 images with yaw rotations of ±45 • and 174 images with yaw rotations of ±60 • along with the manual annotations (8 landmarks for 3.9 ± 2.5 3.9 ± 2.5 3.9 ± 2.5 3.9 ± 2.5 3.9 ± 2.5 3.9 ± 2.5 frontal faces and 5 for non-frontal faces) also provided by [25].Images from this dataset were captured by the same acquisition device as used with FRGCv2.

Performance Evaluation on the Bosphorus Dataset
In the first series of experiments we evaluate the performance of GRID and SMUF on the Bosphorus dataset which is particularly suitable to assess the robustness to large pose variations.We perform experiments using a two-fold cross validation setup using half of the images for training and the other half for testing.To increase the size of training data we extend the training set by horizontally flipping each of the aviable training images.We form test sets with respect to the yaw rotation angle or the presence of expressions/occlusions.We implement both methods with K = 7 cascade stages and use this setup also for all following assessments.The results of this series of experiments are summarized in Table 3.For both GRID and SMUF, we train three landmarking variants, each with a different number of DMs.The first column in Table 3 marked 1 DM corresponds to the variant that use only one DM that was trained on images with 22 annotated landmarks (these are generally images of near frontal faces, since large rotations lead to self-occlusions and fewer annotations).The second column represents the GRID and SMUF variants with 3 DMs: one DM is computed in the same way as in the variants in column 1, while the second and the third DM is computed using images with the head rotations up to 45 • to the left and right, respectively.The varaints in the third column correspond to the setup in Fig. 3 and contain an additional two DMs corresponding to head rotations in the ranges of [45 • , 90 • ] and [−45 • , −90 • ].The DM of the near-frontal images are trained using 22 landmarks per face image, while the DMs of rotated images are trained using 14 landmarks per face as some of the landmarks in these images are typically self-occluded.
As expected, we can observe that the robustness to face rotations is significantly increased when more DMs are utilized.With the GRID and SMUF variants with 5 DMs we achieve reliable landmark localization even on profile face images with yaw rotations up to ±90 • .It can also be seen from the last two rows in Table 3 that the same localization errors are obtained for all three variants when evaluated on the frontal face images with expression and occlusion variations.This indicates that the expressions and occlusions do not affect the DM selection process since in all cases the frontal DM is correctly chosen by the gating function.
When comparing the performance of SMUF and GRID, we can see that in general GRID ensures slightly better localization results than SMUF for all implemented variants.However, while there is an evident trend towards lower average localization errors for GRID, it is clear from Table 3 that the performance differences are statistically not significant.Thus, we can conclude that for the Bosphorus dataset both techniques perform more or less equal.

Evaluation on the FRGCv2 and UND Datasets
In the second series of experiments, we evaluate GRID and SMUF on the joint FRGCv2 and UND datasets.Contrary to the Bosphorus dataset, where images contain solely the head regions and, therefore, using a face detector is not required, images from the FRGCv2 and UND datasets may also contain parts of the upper body and thus face detection is needed to initialize the land- mark locations.In this series of experiments we, hence, employ a simple face detector that relies on k-means clustering similar to the one presented in [32].Setting the number of clusters to k = 3 and including several heuristic conditions, this detector divides a 3D image into three regions that most likely correspond to the background, body and head/face regions.The face region is then selected as the cluster with the lowest mean depth value (see Fig. 7).
The face detector introduces additional variability into the facial regions, since the detected face may still include smaller parts of the upper body, neck and hair.For that reason we also report face mis-detection rates and selection rates for this experiment.Face misdetection rate is defined as the percentage of images with the discrepancy between the location of face detection box and the locations of ground truth landmarks.The selection rate is defined as the percentage of images where the correct DM has been selected by the gating function, where we define a DM as incorrect if the DM has been trained on right profile face images while the corresponding test image is facing left, or vice versa.The localization errors are then computed exclusively on the images with correct face detections and DM selections.This type of reporting is adapted from [25], which we use for baseline comparison in this experiment.
The results of the experiments are presented in Table 4.For details on the dataset abbreviations in the first column please refer to [25], since the experimental setup and the landmark annotations are adopted from there.In short, however, DB00F denotes an image subset with varying facial expressions, which is further partitioned into neutral (neut.),mild and extreme (extr.)facial expressions.The remaining image subsets contain faces with 45 • or 60 • yaw rotations either to the right (R), the left (L) or both (RL).As the GRID and SMUF methods require non-frontal images to train some of the DMs, our experimental setup differs from [25] only in Table 4 Mean localization errors (and corresponding standard deviations) on the FRGCv2 and UND datasets.Results are reported for specific subsets in accordance with the protocol from [25].The subsets feature images with different levels of expression variations (DB00F), and yaw rotations of 45 • and 60 • to the right (R), the left (L) or in both directions (RL).The results show that both GRID and SMUF offer competitive performance when compared to the method of Perakis et al. [25].the construction of training set where we also employ images from the Bosphorus dataset.Detection and selection rates are consistently above 95% for all subsets as it can be observed from the first two columns in Table 4. Localization errors of our two landmarking approaches are compared to Perakis et al. [25] (last column) which, to the best of our knowledge, achieves the highest performance in the literature on these datasets.The results show the robustness of our methods to both expression variations and to rotations.The mean localization error is under 6 mm on all tested subsets for both GRID and SMUF.Since the training data is taken from the Bosphorus dataset (acquired with a different 3D camera), the results also imply good generalization to data from different sensors.All experiments from this section were performed using 5 DMs, as we observed earlier in Section 4.2 that this setting is the most robust to rotation variations.

Comparison to the State-of-the-art
In the next series of experiments, we compare the performance of GRID and SMUF to the performance of state-of-the-art landmarking methods from the literature.Specifically, we select the method of Sukno et al. [36], the technique of Creusot et al. [5], and the landmarking approaches of Passalis et al. [24] and Perakis et al. [25] for our comparison.To the best of our knowledge these landmarking methods are the only ones that were evaluated on both frontal as well as rotated 3D facial images.Following the experimental protocols of other  authors, we used the DB00F45RL subset when performing experiments on the FRGCv2+UND database and used the entire database for experimentation on the Bosphorus dataset.Additionally, we also implement our gated landmarking approach with hand-crafted binary features, that is, with LBPs (uniform, neighborhood size of 8 and radius of 1) to capitalize on the usefulness of learning binary features instead of using off-the-shelf binary feature extractors.
The results of the comparison are shown in Table 5.We observe that on the Bosphorus dataset both GRID and SMUF significantly outperform the competing methods from the literature and achieve not only lower average localization errors, but also significantly smaller standard deviations on these errors.The only exception here are the nose tip and corners, where the method of Sukno performs similarly or slightly better.We also see similar results for the FRGC-UND dataset, where both GRID and SMUF achieve a considerable reduction in the localization errors for all considered landmarks compared to the state-of-the-art.
When comparing the learned binary features used in SMUF to the hand-crafted LBP features, we also see an obvious performance improvement with the learned features, re-enforcing our assumption that learning binary features is beneficial for face alignment.The comparison between GRID and SMUF shows a similar picture as in the previous series of experiments, where GRID was found to perform slightly better than SMUF, but not significantly so.

Landmark Analysis
In this section we evaluate how the overall localization performance varies across the individual landmarks for both GRID and SMUF.Fig. 8 illustrates the mean localization errors achieved for the individual landmarks -the size of the circles is proportional to the errors.It can be observed that the landmarks corresponding to the nose tip and eye and mouth corners exhibit low localization errors.This is expected as these landmarks correspond to well pronounced facial parts with distinctive "corner-like" shapes.Contrary, landmarks relating to nose saddle points, the chin tip and eyebrow points correspond to indistinctive "edge-like" local shapes and, therefore, result in high localization errors.This observation is also supported by the box plots in Figs. 9 and 10 that show localizations errors of individual landmarks on the Bosphorus and the FRGCv2+UND databases.The presented behaviour is consistent for both evaluated methods.
To further analyze the landmarking performance of GRID and SMUF and their generalization ability, we also performed a cross-database experiment, where we built the test set with images from the FRGCv2+UND dataset, while the training set was generated using images from the Bosphorus dataset.Results relating to the cross-database experiment are illustrated by the green box plots in Fig. 10.When compared to the experiment where both training and test sets are from the FRGCv2+UND dataset, we observe a slight increase in localization errors for most of the landmarks, ex-cept the chin tip where the difference between mean errors is larger.We presume that the high mean error for the chin tip comes from the increased appearance variability caused by the face detection procedure needed for the FRGCv2+UND data.(see Fig. 12g).Since such variability is not present in the training set from the Bosphorous dataset, the landmarking procedures cannot learn to accommodate for the inaccuracies of the face detector.In terms of comparison of GRID and SMUF, we see no significant difference in their performance in these experiments.

Computational Cost
In the last series of the experiments we evaluate the time needed by the GRID and SMUF methods to localize landmarks on a single test image on average.We compute the average processing time over 100 randomly selected test images from the Bosphorous dataset.The size of the input images is 250×200 and we compute the locations of all 22 landmarks during the benchmark.A PC with the following specifications is used for the assessment: Intel Xeon CPU 2.67 GHz with 12 GB RAM.Both landmarking techniques theare implemented using Matlab and could be further sped up if implemented with a compiled language such as C/C++.We start from detected and localized face regions and measure the time for feature extraction, DM selection and location updates, which take less than 3 × 10 −2 s for SMUF (see Fig. 13) and little less than 8 × 10 −2 s for GRID.When compared to handcrafted features, the learned binary features can be extracted almost 3 times faster Loc.update Fig. 13 Average running time of the GRID and SMUF methods to localize landmarks on one face image (computed over 100 randomly selected test images).For the bencmarking images from the Bosphorous dataset were used and 22 landmarks were predicted.The results show that SMUF is around 3× faster than GRID, but ensures only slightly higher localization errors.
than HOG features and 15 times faster than LBP features.

Conclusion and Future Work
We have presented two approaches to facial landmark localization from 3D face images, GRID and SMUF, that are robust to rotations, facial expressions and partially also to occlusions.We proposed a gating mechanism that allowed us to incorporate multiple posespecific landmarking models (based on HOG features) into the alignment procedure and also developed a simultaneous descent map and binary feature learning algorithm around the proposed gating mechanism.To assess performance we evaluated the developed landmarking techniques on three challenging datasets, containing 3D face images with large head rotations.Our results showed that the proposed solutions exhibit high robustness to different types of appearance variations and display competitive performance when compared to the state-of-the-art.As part of our future work we plan to combine the proposed landmarking methods with face frontalization (or pose correction) procedures and incorporate all developed methods into pose-invariant 3D face recognition systems.

Fig. 3
Fig. 3 Schematic representation of the gating mechanism used in this work.Multiple landmarking models (each containing a cascade of regression models) are trained during the learning stage.At run-time a gating function is used to select the landmarking model that best fits the characteristics of the test data.

Fig. 4
Fig. 4 Multiple landmark initializations {x z 1 } z={1:3} and the ground truth landmarks x * superimposed on an example test image.The gating mechanism used in this work determines the pose of the test image (and consequently selects a landmarking cascade) by comparing features extracted from different shape initializations of the test image to the average features extracted from the true landmark locations of all images in the pose-specific training sets.

Fig. 6
Fig. 6 Overview of the training and testing stages of the GRID and SMUF landmarking techniques.Both techniques use a similar processing pipeline, but the SMUF approach also learns features (marked by W z k ) in each stage of the training procedure in addition to the landmarking cascade (marked by R z k , k = 1, 2, . . ., K) learned by GRID.

Fig. 7
Fig. 7 Illustration of the face detection procedure used on the FRGCv2 and UND datasets.The procedure uses a simple k-means clustering approach (with k = 3) and selects the cluster with the lowest mean depth as the face region.The figure shows: Input image (left), color coded clusters (middle), cropped and smoothed face image (right).

Fig. 8
Fig. 8 Mean localization errors achieved by GRID and SMUF for individual landmarks of the FRGCv2+UND and datasets.The size of the circles corresponds to the localization errors.The numbering of the landmarks as shown here is also used in Figs. 9 and 10).The lowest errors are achieved on distinct landmarks with corner-like properties, e.g., the mouth corners.

Fig. 9 Fig. 10
Fig.9Localization errors in the form of box and whiskers plots for the Bosphorus dataset achieved with the GRID and SMUF landmarking techniques.The results show that the lowest localization errors with both methods are achieved on distinct landmarks with corner-like characteristics, such as the eye or mouth corners or the nose tip.The figure is best viewed in color.

Fig. 11
Fig. 11 Exemplar landmark detection results on the Bosphorus database: the first row depicts randomly chosen test samples (a)-(h), while the second row includes samples with high localization errors due to expressions (i)-(j), occlusions (k)-(l), head rotations (m)-(n) and incorrectly selected descent maps (o)-(p).

Fig. 12
Fig. 12 Exemplar landmark detection results on the UND and FRGCv2 datasets: the first row depicts test images with typical localization performance, while images from the second row are selected among the worst samples measured by the localization error.

4. 6
Qualitative EvaluationIn this section we qualitativelly assess the landmarking performance of the proposed landmarking methods.Figs.11 and 12show exemplar face images from the Bosphorus and UND datasets with localized landmarks marked by red dots.The top rows of both figures contain samples with typical localization performance, where we can see that the method possess stable performance in the presence of different types of variability, such as expressions, partial occlusions and head rotations.However, there are some cases where landmarks are poorly localized.Such samples with large localization errors are exposed in the second rows of Figs.11 and 12. E.g., large occlusions of face areas (Figs.11k and 11l) can cause increased localization errors of visible landmarks.Some of the localization errors originate from poor face detection and cropping, where an image can contain also non-head regions (Figs.12e and 12g) or parts of the face area are cropped out (Fig.12f).Misselected descent maps can also be the cause of landmark localization errors (Figs.11o, 11p and 12h).

Table 1
Summary of the existing 3D facial landmark detection methods

Table 2
Overview of the datasets used for experimentation.
The FRGCv2 dataset is among the most frequently used datasets of 3D face images, whereas the Bosphorus and UND datasets contain challenging images with a high degree of variability in face orientations and are, hence, well suited for our experiments.

Table 3
Mean localization errors (and standard deviations) for the GRID and SMUF methods on the Bosphorus dataset.Results are reported for different variants of both landmarking techniques implemented with 1, 3, or 5 regression cascades.The best overall performance is achieved with 5 cascades for both techniques.The results also show that the gating function always selects the correct cascade -observe results for frontal images across the different landmarking variants.

Table 5
Localization errors of GRID and SMUF in comparison to the state-of-the-art on non-frontal facial data sets for 10 common facial landmarks.GRID and SMUF significantly outperform competing methods on all experimental datasets.When comparing the learned binary features from SMUF to handcrafted LBP features we observe better performance for the learned binary features.GRID and SMUF again perform similarly for all landmarks.± 2.6 4.6 ± 3.0 4.4 ± 2.7 4.9 ± 6.3 3.5 ± 2.8 3.7 ± 2.8 4.8 ± 4.