1 Introduction

Face alignment or facial landmarking refers to the task of locating salient facial features in facial images, which is of paramount importance in various applications including face registration and recognition [21, 41], expression recognition [49], face tracking [37], normalization of facial pose, size, and expressions [16], face synthesis from morphable models and facial animation [23], to name a few. In real-world scenarios where face images are often acquired in uncontrolled conditions, one has to deal with various unfavorable factors that adversely affect landmarking performance including pose, expression, and illumination variations as well as partial occlusions of the facial areas. These factors influence the appearance of the facial features in traditional 2D images, e.g., [12] but also in 3D (or better said 2.5D) face data used in this work.Footnote 1 Although some of the existing landmark localization procedures promise to be (at least partially) robust to some of the factors mentioned above (e.g., [9, 25, 36]), reliable localization of facial landmarks in the presence of highly variable nuisance factors still remains a considerable challenge.

With the advancement of 3D acquisition technology, landmark localization on 3D facial data has recently been researched extensively  [13, 36]. Many of the 3D landmarking techniques proposed in the literature in the last few years rely on the so-called cascaded regression framework, where facial landmarks are estimated by regressing from facial features to landmark locations in a cascaded (iterative) manner [2]. Techniques following this framework made considerable advancements toward robust facial landmarking, although they generally still use hand-crafted features, such as the scale-invariant feature transforms (SIFT) or histograms of oriented gradients (HOG) [2, 3, 39]. Additionally, these methods typically rely on a single regression model in each stage of the cascade to estimate the facial landmarks regardless of the facial characteristics. However, as facial appearance is a complex function of various factors, such as facial shape, pose, incident illumination, expression, and occlusion, a single model is often not sufficient to capture the broad range of variability commonly encountered with facial-image data and to robustly estimate the location of the most salient facial features.

Fig. 1
figure 1

Sample results of the landmarking GRID approach proposed in this paper. Our model is able to reliably estimate the location of salient facial features in 3D face data even in the presence of large pose variations, i.e., with yaw angles up to ± 90°

To address this problem, we propose in this paper a novel gating mechanism that incorporates multiple cascaded regression-based models each trained for a narrow range of facial posses into a single (coherent) landmarking model that is able to reliably estimate the location of salient facial features from arbitrary-posed input face data. The combination of simpler view-specific landmarking approaches provides the combined gating-based model the necessary expressive power to describe the considerable appearance variability typically seen with 3D face data captured under different facial poses and consequently allows it to reliably estimate the landmark locations regardless of the facial pose of the input image. The model is partially motivated by the success of earlier methods designed for 2D images that combine multiple landmarking models trained for face alignment of different views, e.g., [4, 15, 29, 47, 50], but unlike these early methods does not rely on parametric appearance or shape models.

We develop two distinct facial landmarking approaches around the proposed gating mechanism. The first relies on a combination of the Gated multiple RIdge Descent (GRID) mechanism and established HOG features and as illustrated in Fig. 1 achieves remarkable landmarking performance across a broad range of pose variation. Even for poses with yaw rotations of up to ± 90°, the model is still able to reliably estimate the location of salient facial features. The second approach again relies on the introduced gating mechanism, but in addition to the cascaded regression models needed for face alignment of each group of poses, it also learns a feature representation that is used with the regression models for landmark estimation. Specifically, the model Simultaneously learns MUltiple-descent directions as well as binary Features (SMUF) that are optimal for the alignment task and due to their binary nature also ensure extremely fast execution times. This second approach follows recent trends in computer vision and aims to learn the feature representation that is optimal for a given tasks, but different from deep learning models that are typically used for feature learning [40, 42], it uses a computationally much simpler scheme, where binary features are learned based on a learning objective that can be solved using standard optimization procedures.

To evaluate the proposed landmarking approaches, we conduct experiments on multiple datasets of 3D face images, i.e., the Face Recognition Grand Challenge v2 (FRGCv2) [26], the Bosphorus 3D face dataset [31], and the University of Notre Dame dataset (collections F and G, hereinafter referred to as UND dataset) [46]. We present extensive experiments and comparisons with state-of-the-art methods from the literature. The results of our evaluation show that the GRID ensures state-of-the-art performance for facial landmark localization in 3D face data across pose, while SMUF yields not only competitive landmarking accuracy but is also extremely fast.

Our main contributions in this paper are as follows:

  • We propose a gating mechanism for face alignment in 3D face data that allows us to combine multiple alignment models and foster the combined power of the combined models for face alignment across pose. The use of multiple models adds to the overall localization performance, since each model needs to account only for a limited set of plausible facial variations. With this approach, reliable landmarking is possible even under large head rotations such as profile facial images, with yaw rotations up to ± 90° (see Fig. 1), where many competing methods fail.

  • We develop two distinct landmarking approaches based on the introduced gating mechanism, where one is optimized for performance and the second one is optimized for both performance and speed. We evaluate both approaches in rigorous experiments on multiple face datasets and report competitive performance in comparison to competing methods from the literature.

  • We study different configurations of the proposed approaches and investigate their behavior when localizing specific facial landmarks.

The rest of the paper is organized as follows: In Sect. 2, we describe prior work in the field of facial landmarking with the goal of providing the necessary context for our contributions. In Sect. 3, we present our gating mechanism and the GRID and SMUF alignment techniques. We describe experiments conducted to evaluate the performance of the proposed methods in discuss results in Sect. 4. We conclude the paper with some final remarks and future research challenges in Sect. 5.

2 Related work

Numerous methods have been proposed in the literature for the task of automatic facial landmark localization over recent years. In this section, we present a brief overview of these methods with a focus on alignment techniques that work on 3D images. These techniques can be categorized in various ways, but here we chose to perform a categorization as shown in Fig. 2. We classify existing techniques into two groups: (1) techniques that are entirely dependent on geometric information and derive prior knowledge about the facial structure and location of facial landmarks by defining a number of heuristic rules and (2) techniques that rely on trained statistical models. The latter group of techniques is further divided according to the type of the model utilized into generative and discriminative methods. A high-level comparison of the related works discussed in this section is given in Table 1.

Table 1 Summary of the existing 3D facial landmark detection methods
Fig. 2
figure 2

Taxonomy of 3D landmarking approaches as discussed in “Related work” section. Recent work is largely focusing on statistical approaches, where landmarking is learned from annotated training data using either generative or discriminative models

2.1 Geometric approaches

Geometric approaches to facial landmarking are generally training free and depend solely on the geometric information such as surface curvature or shape index values. A number of rules and heuristics encodes the prior knowledge about the relationships between adjacent landmarks (e.g., the nose tip lies on the face symmetry axis, eyes are located above the nose tip, etc.). In most cases, the rules used to define the location of facial landmarks require the face to be in upright and near-frontal positions. Moreover, a common downside of these methods is that the landmarks are detected in sequence (commonly starting by detecting a nose tip) and the success rate of finding the next landmark in the sequence is dependent on successfully locating the preceding landmark in the sequence. With these methods, an incorrect detection of one landmark affects the detection of all subsequent landmarks.

Exemplar geometric methods  [1, 6, 11, 22, 33] start by detecting the nose tip and use its location to constrain the search space of the remaining landmarks. Landmark detection can be grounded on the analysis of Gaussian curvatures [11], profile curvatures [6], x and y coordinate projections of the depth data [33], shape index values and facial symmetry lines [1], or horizontal slices of range images [22], to name a few.

2.2 Statistical approaches

Statistical landmarking approaches also exploit local shape information around candidate landmark locations. Additionally, these methods derive some prior knowledge from the training data about the location constraints and encode the acquired knowledge into a statistical model. Thus, these methods require a training set of facial images with annotated landmarks. Unlike training-free geometric approaches, statistics are utilized uniformly for all landmarks, as there is no need for specific rules for each individual landmark. Since all landmarks are handled simultaneously approaches from this group are typically more robust to local distortions, missing data, and occlusions of individual landmarks. However, the fact that statistical methods generally address a complete set of landmarks defined by the model could prove to be a problem when a large number of landmarks is (self-)occluded or data are missing from the input images due to acquisition errors. Recently, efforts have been made to handle such problems, e.g., [36] use a flexible shape model that works even with an incomplete set of landmarks.

In terms of the type statistical model used, approaches from this group can be divided into techniques that rely either on generative or on discriminative models. We discuss both types of techniques in the following two subsections.

2.2.1 Generative approaches

Landmark locations can be modeled by generative models, such as active appearance models, active shape models [8, 18, 34], or morphable models [9]. Techniques from this group often learn face appearances (and/or shape) by conducting principal component analysis (PCA) [38] on manually annotated and aligned training data. Given a test image, alignment/fitting is achieved by minimizing the difference between the current estimate of the appearance (and/or shape) and the test face. Generally, generative approaches are often computationally expensive and may perform poorly in the presence of occlusions, pose expression, and illumination variations due to the involved fitting procedure.

2.2.2 Discriminative approaches

More recent methods to face alignment focus mostly on discriminative approaches that learn a mapping function that predicts the shape, i.e., landmark locations, directly from corresponding image features. Methods from this category typically offer better landmark localization performance when compared to generative models, especially for faces with greater variability in appearance [2, 3, 9, 36]. These methods commonly incorporate shape constraints into the models and use local descriptors that are more robust to appearance variations than conventional depth/intensity pixel values used with generative approaches. Discriminative approaches include random forests [7], graph matching [30], cascaded regression [2, 3, 9, 39], specifically tailored shape models [24, 25, 36, 48], hidden Markov models [17], and convolutional neural networks [40].

The landmarking techniques proposed in this work fall into the group of discriminative approaches. They build on recent face alignment techniques that rely on cascaded regression models that have proven highly successful for landmark localization from 2D face images, e.g., [44, 45]. However, compared to these models, our solutions exhibit unique features, such as the novel gating mechanism for exploiting multiple pose-specific landmarking models and the ability to incorporate task-specific binary features into the landmarking procedure.

3 Methodology

In this section, we describe GRID and SMUF, two novel facial landmarking approaches built around the gating mechanism illustrated in Fig. 3. As can be seen, the gating mechanism partitions the search space for the landmark localization procedure into a number of sub-domains, where each sub-domain encompasses a range of similar facial poses. A separate landmarking model is then trained for each of the sub-domains, and the gating mechanism is used to select the most suitable landmarking model for the given test image. Based on this overall framework, we develop two distinct landmarking techniques, which are described next.

Fig. 3
figure 3

Schematic representation of the gating mechanism used in this work. Multiple landmarking models (each containing a cascade of regression models) are trained during the learning stage. At run time, a gating function is used to select the landmarking model that best fits the characteristics of the test data

3.1 GRID description

We design GRID (Gated Multiple Ridge Descent) in line with the powerful cascaded regression framework to face alignment, where landmark locations (or in other words, the facial shape) are estimated by regressing from facial features to landmark locations in a cascaded manner. In the first step of this framework, features are extracted from some initial landmark configuration (estimated from the training data) and a regression model is applied on the extracted features to predict landmark updates to better align the landmarks with the actual test image. The update results in a new landmark configuration that forms the basis for the next step in the cascade. The entire procedure is then repeated multiple times and, thus, sequentially refines the predicted locations of the facial landmarks in the test image.

With GRID, we train multiple cascaded regression models and integrate them into a gated approach that is robust to pose variations. While different regression models and feature representation have been proposed in the literature for facial landmarking, we built GRID around the supervised descent method (SDM) from [44] that has not proven successful only for facial landmark localization in 2D images [43], but also for alignment of 3D face images, as we have shown in [2, 14].

3.1.1 Background

To train the regression models needed for landmarking, SDM requires a number of facial images \(\{{\varvec{I}}_n\}_{n=1}^N\), where each image \({\varvec{I}}\) has L landmarks annotated in the form of a shape vector \({\varvec{x}}_*\in \mathbb {R}^{2L\times 1}\). The landmark localization task is then posed as a minimization problem over \(\varDelta {\varvec{x}}\):

$$\begin{aligned} \underset{\varDelta {\varvec{x}}}{\arg \min }\left\| h({\varvec{I}},{\varvec{x}}_1+\varDelta {\varvec{x}}) - {\varvec{\phi }}_* \right\| ^2 \text {,} \end{aligned}$$
(1)

where h is a feature extraction function, \({\varvec{\phi }}_* = {h}({\varvec{I}},{\varvec{x}}_*)\) are features extracted around the ground truth landmarks \({\varvec{x}}_*\), \({\varvec{x}}_1\) is an initial landmark configuration, and \(\varDelta {\varvec{x}}\) is a landmark update (known for the training data).

Equation (1) represents a nonlinear least squares problem and, in general, has no closed-form solution. However, it was shown in [43] that the problem can be solved through a cascade of least squares regression problems. Thus, for each step k in the cascade, the solution of the least squares problem is given in the form of a regression matrix \({\varvec{R}}_k\) (also referred to as a descent map, DM) that can be used to predict the update of the landmark locations from the current image features. The learning algorithm is formulated as a minimization of the loss between the true shape updates \(\hat{{\varvec{x}}}_k^n = {\varvec{x}}_k^n-{\varvec{x}}_*^n\) and the expected updates over all training images, i.e.,

$$\begin{aligned} \underset{{\varvec{R}}_k}{\arg \min }\sum _{n}\left\| \hat{{\varvec{x}}}_k^n - {\varvec{R}}_k \left( {\varvec{\phi }}_{k}^n - \overline{{\varvec{\phi }}_*}\right) \right\| ^2 \text {,} \end{aligned}$$
(2)

where n is a training-image index and \(\overline{{\varvec{\phi }}_*}\) is an average feature vector computed from the ground truth locations \({\varvec{x}}_*^n\) over all training images. Equation (2) now represents a sequence of ordinary least squares regression problems that can be solved in closed form.

During test time, the algorithm starts with some initial landmark locations \({\varvec{x}}_1\), for which the face shape (landmark configuration) is defined by the average landmark locations of training images and the position of the face shape is determined by the face detection procedureFootnote 2, and sequentially updates the initial estimate to obtain the final landmark locations, i.e.,

$$\begin{aligned} {\varvec{x}}_{k+1} = {\varvec{x}}_k + {\varvec{R}}_{k}\left( {\varvec{\phi }}_k - \overline{{\varvec{\phi }}_*} \right) \text {,} \end{aligned}$$
(3)

so that the final shape \({\varvec{x}}_k\) converges to \({\varvec{x}}_*\) for all training images. The number of steps K in the cascade, where \(k=1,2,\ldots K\) commonly varies depending on the implementation, but usually values of K are between 3 and 10.

3.1.2 Ridge regression

The original SDM [43] formulation uses a least squares solution of Eq. (2) to learn the DMs, i.e.,

$$\begin{aligned} {\varvec{R}}_k = \hat{{\varvec{X}}}_k\hat{{\varvec{\varPhi }}}_k^\top \left( \hat{{\varvec{\varPhi }}}_k \hat{{\varvec{\varPhi }}}_k^\top \right) ^{-1}\text {,} \end{aligned}$$
(4)

where \(\hat{{\varvec{X}}}_k\) is a shape matrix with nth column \(\hat{{\varvec{x}}}_k^n\) and \(\hat{{\varvec{\varPhi }}}_k\) is a feature matrix with nth column \({\varvec{\phi }}_{k}^n - \overline{{\varvec{\phi }}_*}\). To solve Eq. (4), one needs to compute the inverse of \(\hat{{\varvec{\varPhi }}}_k^\top \hat{{\varvec{\varPhi }}}_k\), which, however, may be singular when the size of the feature vectors is too large or when the features are correlated. To overcome this issue, the original SDM applies PCA [38] to the image features before inverting the matrix.

However, we have shown in [2] that better landmarking performance is achieved if ridge regression is used in the original feature space instead of least squares regression in the PCA subspace. The optimization function in (2) in this case can be written as

$$\begin{aligned} \underset{{\varvec{R}}_k}{\arg \min }\sum _{n}\left\| \hat{{\varvec{x}}}_k^n - {\varvec{R}}_k \left( {\varvec{\phi }}_{k}^n - \overline{{\varvec{\phi }}_*}\right) \right\| ^2 + \gamma _k\left\| {\varvec{R}}_k\right\| ^2\text {,} \end{aligned}$$
(5)

where \(\gamma _k\) denotes a regularization factor and the solution of Eq. (5) is computed as

$$\begin{aligned} {\varvec{R}}_k = \hat{{\varvec{X}}}_k\hat{{\varvec{\varPhi }}}_k^\top \left( \hat{{\varvec{\varPhi }}}_k\hat{{\varvec{\varPhi }}}_k^\top + \gamma _k {\mathbf {I}}\right) ^{-1}\text {,} \end{aligned}$$
(6)

where \({\mathbf {I}}\) is an identity matrix. The regularization factor \(\gamma _k \ge 0\) controls the general instability of the least squares estimate. Selecting a suitable value for \(\gamma _k\) avoids over-fitting and helps to produce estimates of \({\varvec{R}}_k\) that generalize better to unseen data.

3.1.3 Gated multiple ridge descent

Experimental results in [2, 43, 45] have shown that the original SDM achieves remarkable landmarking performance on various 2D and 3D datasets. However, it still tends to perform poorly when, for example, large head rotations are present in the facial data [45]. Such rotations cause complex facial appearance variations that are difficult to model and hard to account for when using only a single DM in each step of the landmarking cascade.

To increase the robustness of the model to pose variations, we propose to exploit multiple DMs \(\{{\varvec{R}}_k^z\}_{z=\{1:Z\}}\) such that each of the Z DMs accounts for a specific range of head rotations, as illustrated in Fig. 3. Toward this end, we partition the available training images \(\{{\varvec{I}}_n\}_{n=1}^N\) into Z pose-specific subsets and train separate regression cascades for each subset in line with Eq. (6).

Fig. 4
figure 4

Multiple landmark initializations \(\{{\varvec{x}}_1^z\}_{z=\{1:3\}}\) and the ground truth landmarks \({\varvec{x}}_*\) superimposed on an example test image. The gating mechanism used in this work determines the pose of the test image (and consequently selects a landmarking cascade) by comparing the features extracted from different shape initializations of the test image to the average features extracted from the true landmark locations of all images in the pose-specific training sets

Once all Z cascades (series of DMs) are trained, a gating function \(g_z\) is used to select the most suitable DM (from the Z DMs available in the first cascade stage) for a given test image. The selection procedure begins by computing features \(\{{\varvec{\phi }}_1^z\}_{z=\{1:Z\}}\) from the initial landmark locations \(\{{\varvec{x}}_1^z\}_{z=\{1:Z\}}\) in the test image. Here, the initial landmark locations \({\varvec{x}}_1^z\) (see Fig. 4) are computed by averaging the ground truth shapes over all training images from the zth training subset:

$$\begin{aligned} {\varvec{x}}_1^z=\{\overline{{\varvec{x}}_*^n}\}_{n\in z}\text {.} \end{aligned}$$
(7)

The most fitting DM for the given test image is then selected based on the output of the gating function \(g_z\):

$$\begin{aligned} g_z({\varvec{\phi }}_1^z) = \sqrt{\frac{1}{m}\left( {\varvec{\phi }}_1^z - \overline{{\varvec{\phi }}_*}\right) ^\top {\varvec{\varSigma }}_*^{-1}\left( {\varvec{\phi }}_1^z - \overline{{\varvec{\phi }}_*}\right) }\text {,} \end{aligned}$$
(8)

where m is the feature vector length (in the case of SIFT \(m=128\times L\)) and \(\overline{{\varvec{\phi }}_*}\) and \({\varvec{\varSigma }}_*\) are the average and the covariance matrix of the ground truth features over the subset z, respectively. We select the subset \(z_*\), for which the gating function outputs the lowest value:

$$\begin{aligned} z_* \in \{1,\ldots , Z\}:g_{z_*}=\min _{z}(g_z)\text {.} \end{aligned}$$
(9)

By doing so, we reliably choose the DM \({\varvec{R}}_1^{z_*}\) that has been trained on images with similar face orientations to the orientation of the test face. For all the subsequent steps k, we use the DMs \({\varvec{R}}_k^{z_*}\) that correspond to the \(z_*\)th subset selected in the first step \(k=1\) and for efficiency reasons due not change the regression cascade in subsequent steps. Location updates on a given test image are thus computed as

$$\begin{aligned} {\varvec{x}}_{k+1} = {\varvec{x}}_{k} + {\varvec{R}}_k^{z_*}\left( {\varvec{\phi }}_{k-1}^{z_*} - \overline{{\varvec{\phi }}_*} \right) \end{aligned}.$$
(10)

The described procedure results in significantly improved performance in the case of large head rotations as shown later in Sect. 4.

It needs to be noted that we rely on HOG features to implement the feature extraction function h in GRID. We select HOG features because of their proven performance in prior landmarking models, e.g., [2, 14].

3.2 SMUF description

The GRID landmarking approach presented in the previous section relies on HOG features to encode the appearance of the facial landmarks during face alignment. With SMUF (Simultaneous MUlti-descent regression and binary Feature learning), we take a step further and try to learn facial features that are optimal for face alignment. We choose to learn binary features, due to their simplicity and most of all computational simplicity. In the following subsection, we first review the idea of binary feature learning and then develop SMUF approach that jointly learns a landmarking model as well as corresponding binary features that are optimal for this task.

3.2.1 Binary feature learning

Hand-crafted binary features, such as local binary patterns (LBPs) [27] represent powerful image descriptors that have proven highly effective in various computer vision tasks. These features typically rely on pixel comparisons within a local neighborhood and heuristic rules to encode the pixel comparisons into binary codes. As such, they may be suboptimal and better features could potentially be constructed by learning binary features based on some dedicated learning objective.

Gong et al. [10], for example, propose a learning objective where binary features are learned from an initial image representation \({\varvec{d}}\), such that the quantization error is minimized. Since binary features \({\varvec{\phi }}\) (containing only 0s and 1s) can be computed from \({\varvec{d}}\) as

$$\begin{aligned} {\varvec{\phi }}=0.5({{\,\mathrm{sgn}\,}}({\varvec{W}}^\top {\varvec{d}})+1)\text {,} \end{aligned}$$
(11)

where \({\varvec{W}}\) is a matrix of hash functions that defines the length of the binary code and \({{\,\mathrm{sgn}\,}}(.)\) stands for the signum function. The learning objective \(L_q\) that needs to be minimized over \({\varvec{W}}\) on some training data can be written as

$$\begin{aligned} L_q=\left\| {{\varvec{\phi }}} - 0.5 - {\varvec{W}}^\top {\varvec{d}} \right\| ^2\text {.} \end{aligned}$$
(12)

It was shown by Lu et al. [19, 20] that descriptive binary image features can be computed based on the above quantization scheme if pixel (or depth in our case) difference values are used as input \({\varvec{d}}\) for binarization. For SMUF, we follow this approach and compute one depth-difference vector \({\varvec{d}}\) for each considered landmark, as illustrated in Fig. 5.

Fig. 5
figure 5

We use depth-difference vectors \({\varvec{d}}\) as the basis for binary feature calculation as proposed in [19, 20]. The example image above shows how one such vector is computed for a selected landmark. The local pixel neighborhood shown here is of size \(3\times 3\) and is selected only for illustration purposes. We use larger neighborhoods for the actual SMUF implementation

3.2.2 Simultaneous DM and feature learning

The learning objective presented in the previous section is focused on representational power, as the binary features are computed in a manner that minimizes a quantization loss. To make the features useful for landmarking, we now formulate a joint optimization function that allows us to simultaneously learn a regression cascade and corresponding binary features that are optimal for the landmarking task.

Let \({{\varvec{D}}}_k = [{{\varvec{d}}}_k^1,\ldots ,{{\varvec{d}}}_k^{LN}]\) be a set of depth-difference vectors extracted from patches centered at the facial landmarks \({\varvec{X}}_k = [{\varvec{x}}_k^1,\ldots ,{\varvec{x}}_k^{LN}]\) and k stands for the cascade stage, \(k=1,2,\ldots ,K\). The depth-difference-vector matrix \({{\varvec{D}}}_k\) is mapped to a binary feature matrix \({\varvec{\varPhi }}_k\) as follows:

$$\begin{aligned} {\varvec{\varPhi }}_k=0.5({{\,\mathrm{sgn}\,}}({\varvec{W}}_k^\top {{\varvec{D}}}_k)+1)\text {,} \end{aligned}$$
(13)

where \({\varvec{W}}_k\) is a feature projection matrix and \({{\,\mathrm{sgn}\,}}(.)\) is again the signum function. To learn \({\varvec{W}}_k\), we formulate the following optimization problem by rewriting (5) into matrix form and extending it with the additional constraint \(C_2\):

$$\begin{aligned} \mathop {{\mathrm{arg}}\,{\mathrm{min}}}\limits _{{\varvec{R}}_k,{\varvec{W}}_k} C &=C_1 + \lambda C_2 \nonumber \\ &=\left\| \hat{{\varvec{X}}}_k - {\varvec{R}}_k\tilde{{\varvec{\varPhi }}}_k \right\| ^2 + \gamma \left\| {\varvec{R}}_k \right\| ^2 \nonumber \\&+\,\lambda \left\| {\varvec{R_k}}(\tilde{{\varvec{\varPhi }}}_k - 0.5 - {\varvec{W}}_k^\top \hat{{\varvec{D}}}_k) \right\| ^2\text {,} \end{aligned}$$
(14)

where

$$\begin{aligned} \tilde{{\varvec{\varPhi }}}_k=0.5({{\,\mathrm{sgn}\,}}({\varvec{W}}_k^\top \hat{{\varvec{D}}}_k)+1) \end{aligned}$$
(15)

and \(\hat{{\varvec{D}}}_k = {\varvec{D}}_k - {\varvec{D}}_*\), where \({\varvec{D}}_*\) are the depth-difference values of the ground truth landmark locations. As already emphasized above, the objective of \(C_2\) is to minimize the quantization loss between the original depth-difference values and the binarized features, so that most of the depth-difference energy can be preserved in the learned binary features.

We find optimal values for \({\varvec{R}}_k\) and \({\varvec{W}}_k\) by an iterative optimization procedure, where \({\varvec{W}}_k\) is initialized to a random orthogonal matrix. If we assume a fixed \({\varvec{W}}_k\) and compute a partial derivative of C in (14) with respect to \({\varvec{R}}_k\) and set the derivative to zero, we obtain the following solution for \({\varvec{R}}_k\):

$$\begin{aligned} {\varvec{R}}_k &=\hat{{\varvec{X}}}_k\tilde{{\varvec{\varPhi }}}_k^\top \left[ \tilde{{\varvec{\varPhi }}}_k\tilde{{\varvec{\varPhi }}}_k^\top +\gamma _k {\mathbf {I}} \right. \nonumber \\&\left. +\,\lambda (\tilde{{\varvec{\varPhi }}}_k-0.5-{\varvec{W}}_k^\top \hat{{\varvec{D}}}_k)(\tilde{{\varvec{\varPhi }}}_k-0.5 -{\varvec{W}}_k^\top \hat{{\varvec{D}}}_k)^\top \right] ^{-1}\text {.} \end{aligned}$$
(16)

In the next step, we aim to learn \({\varvec{W}}_k\) with a fixed \({\varvec{R}}_k\) and, hence, rewrite (14) as follows:

$$\begin{aligned} \mathop {{\mathrm{arg}}\,{\mathrm{min}}}\limits _{{\varvec{W}}_k} C &=\left\| \hat{{\varvec{X}}}_k - {\varvec{R}}_k{\varvec{W}}_k^\top \hat{{\varvec{D}}}_k \right\| ^2 + \gamma \left\| {\varvec{R}}_k \right\| ^2 \nonumber \\&+\,\lambda \left\| {\varvec{R_k}}(\tilde{{\varvec{\varPhi }}}_k - 0.5 - {\varvec{W}}_k^\top \hat{{\varvec{D}}}_k) \right\| ^2\text {.} \end{aligned}$$
(17)

If we differentiate (17) with respect to \({\varvec{W_k}}\) and set the derivative to zero, we obtain the following update rule for \({\varvec{W}}_k\):

$$\begin{aligned} {\varvec{W}}_k = \left[ ({\varvec{R}}_k^{-1}\hat{{\varvec{X}}}_k+\lambda (\tilde{{\varvec{\varPhi }}}_k-0.5))\hat{{\varvec{D}}}_k \right] ^\top /(1+\gamma +\lambda )\text {.} \end{aligned}$$
(18)

The two optimization steps from (16) to (18) are then repeated until both \({\varvec{R_k}}\) and \({\varvec{W}}_k\) converge.

Once a stable version of \({\varvec{R_k}}\) and \({\varvec{W}}_k\) are obtained, we compute the shape update in accordance with

$$\begin{aligned} {\varvec{X}}_{k+1} = {\varvec{X}}_k + {\varvec{R}}_k\tilde{{\varvec{\varPhi }}}_k \end{aligned}$$
(19)

and repeat the entire procedure for the next stage in the cascade. Note that because \(\tilde{{\varvec{\varPhi }}}_k\) is binary, location updates can be computed extremely quickly by simply summing up (specific) rows from \({\varvec{R}}_k\).

Finally, we compute separate regression cascades and projection matrices for each of the Z training subsets, that is, for each considered group of poses, and integrate the computed cascades into the overall SMUF approach using the same gating mechanism as described above for the GRID approach.

3.3 Training and testing of GRID and SMUF

The overall processing pipeline for the SMUF landmarking approach is shown in Fig. 6. The procedure for GRID is identical, except for the fact that no features are learned during training.

Fig. 6
figure 6

Overview of the training and testing stages of the GRID and SMUF landmarking techniques. Both techniques use a similar processing pipeline, but the SMUF approach also learns features (marked by \({\varvec{W}}_k^z)\) in each stage of the training procedure in addition to the landmarking cascade (marked by \({\varvec{R}}_k^z, k=1,2,\ldots ,K\)) learned by GRID

The training stage for both methods begins by preprocessing all N training images \(\{{\varvec{I}}_n\}_{n=1:N}\) where a depth component of the surface normal is computed in each pixel instead of using original depth values. In each image, the face is detected using a simple clustering procedure [32] and initial landmark locations \({\varvec{x}}_1^n\) are set based on the detected facial area. To capture the variance of the face detection procedure and to enlarge the amount of training data, we define additional initial landmark locations for each training image by randomly sampling scale and displacement parameters for the detected area from a normal distribution. Starting from the initial locations matrix \({\varvec{X}}_1\) along with the ground truth locations \({\varvec{X}}_*\), a number of DMs \({\varvec{R}}_k^z\) (and for SMUF also projection matrices \({\varvec{W}}_k^z\)) are learned. The updates (16) and (18) are iteratively recomputed till convergence (we empirically estimated that 4 steps are sufficient) for each shape update step k and each of the Z training subsets.

When a test image is presented to the landmarking procedure, it goes through the same face detection, preprocessing, and feature extraction steps as the training images. DMs and feature projections are then selected as described in Sect. 3.1.3 and the final landmark locations are computed based on (10). The pseudocode of the GRID method is summarized in Algorithm 1, while the steps for the SMUF method are outlined in Algorithm 2.

figure a
figure b

4 Experiments

In this section, we evaluate the proposed GRID and SMUF landmarking approaches and compare them to the state-of-the-art. We report landmarking performance in accordance with the standard methodology used in this area [32] for all experiments. Specifically, we use the localization error, i.e., the Euclidean distance in mm between the location of the detected landmark and the manually annotated ground truth landmark, for performance reporting. Additionally, we also compute the mean localization error over all landmarks of each test face for some of the experiments.

Table 2 Overview of the datasets used for experimentation

4.1 Experimental datasets

We conduct experiments with three popular datasets of 3D face images: the FRGCv2 dataset, the Bosphorus 3D face dataset, and the UND dataset. We chose these datasets because they are among the most frequently used 3D face datasets and because they contain challenging 3D images with a high degree of variability in face orientations and are, therefore, well suited for assessing the robustness to such variations. The main characteristics of the datasets are summarized in Table 2.

Table 3 Mean localization errors (and standard deviations) for the GRID and SMUF methods on the Bosphorus dataset

The FRGCv2 dataset contains 4007 3D face images of 466 individuals. Images of the dataset were acquired with a laser-based Konica Minolta Vivid 910 scanner. Subjects exhibit minor pose variations and various facial expressions. We utilize the ground truth landmarks (8 landmarks per face) from [25], which were manually annotated on a subset of 975 images from 149 subjects.

The Bosphorus dataset consists of 4666 face samples from 105 subjects. Each sample includes a 2D color image, a 3D point cloud, and 24 manually annotated landmarks (in our experiments, we exclude ear dimple landmarks and use the remaining 22 landmarks). Next to expression and occlusion variations, images in the dataset also exhibit large variations in pose. Images from the dataset were captured using a structured-light-based Inspeck Mega Capturor II Digitizer.

The UND dataset contains 1680 semi-profile and profile 3D face images of 537 subjects. For our experiments, we use a subset of 236 images with yaw rotations of \(\pm {45}^{\circ }\) and 174 images with yaw rotations of \(\pm {60}^{\circ }\) along with the manual annotations (8 landmarks for frontal faces and 5 for non-frontal faces) also provided by [25]. Images from this dataset were captured by the same acquisition device as used with FRGCv2.

4.2 Performance evaluation on the Bosphorus dataset

In the first series of experiments, we evaluate the performance of GRID and SMUF on the Bosphorus dataset which is particularly suitable to assess the robustness to large pose variations. We perform experiments using a twofold cross validation setup using half of the images for training and the other half for testing. To increase the size of training data, we extend the training set by horizontally flipping each of the available training images. We form test sets with respect to the yaw rotation angle or the presence of expressions/occlusions. We implement both methods with \(K=7\) cascade stages and use this setup also for all the following assessments.

The results of this series of experiments are summarized in Table 3. For both GRID and SMUF, we train three landmarking variants, each with a different number of DMs. The first column in Table 3 marked 1 DM corresponds to the variant that uses only one DM that was trained on images with 22 annotated landmarks (these are generally images of near-frontal faces, since large rotations lead to self-occlusions and fewer annotations). The second column represents the GRID and SMUF variants with 3 DMs: one DM is computed in the same way as in the variants in column 1, while the second and the third DMs are computed using images with the head rotations up to \({45}^{\circ }\) to the left and right, respectively. The variants in the third column correspond to the setup in Fig. 3 and contain an additional two DMs corresponding to head rotations in the ranges of \([{45}^{\circ }, {90}^{\circ }]\) and \([-{45}^{\circ },-{90}^{\circ }]\). The DMs of the near-frontal images are trained using 22 landmarks per face image, while the DMs of rotated images are trained using 14 landmarks per face as some of the landmarks in these images are typically self-occluded.

As expected, we can observe that the robustness to face rotations is significantly increased when more DMs are utilized. With the GRID and SMUF variants with 5 DMs, we achieve reliable landmark localization even on profile face images with yaw rotations up to ± 90 °C. It can also be seen from the last two rows in Table 3 that the same localization errors are obtained for all three variants when evaluated on the frontal face images with expression and occlusion variations. This indicates that the expressions and occlusions do not affect the DM selection process since in all cases the frontal DM is correctly chosen by the gating function.

When comparing the performance of SMUF and GRID, we can see that, in general, GRID ensures slightly better localization results than SMUF for all implemented variants. However, while there is an evident trend toward lower average localization errors for GRID, it is clear from Table 3 that the performance differences are statistically not significant. Thus, we can conclude that for the Bosphorus dataset, both techniques perform more or less equal.

4.3 Evaluation on the FRGCv2 and UND datasets

In the second series of experiments, we evaluate GRID and SMUF on the joint FRGCv2 and UND datasets. Contrary to the Bosphorus dataset, where images contain solely the head regions and, therefore, using a face detector is not required; images from the FRGCv2 and UND datasets may also contain parts of the upper body, and thus, face detection is needed to initialize the landmark locations. In this series of experiments, we, hence, employ a simple face detector that relies on k-means clustering similar to the one presented in [32]. Setting the number of clusters to \(k=3\) and including several heuristic conditions, this detector divides a 3D image into three regions that most likely correspond to the background, body, and head/face regions. The face region candidate is then selected as the cluster with the lowest mean depth value (yellow region in Fig. 7). By doing so, a few other minor parts of the image may be selected besides the face region that is later reliably discarded by retaining only the largest connected area (right image in Fig. 7).

Fig. 7
figure 7

Illustration of the face detection procedure used on the FRGCv2 and UND datasets. The procedure uses a simple k-means clustering approach (with \(k=3\)) and selects the cluster with the lowest mean depth as the face region. The figure shows Input image (left), color coded clusters (middle), and cropped and smoothed face image (right)

The face detector introduces additional variability into the facial regions, since the detected face may still include smaller parts of the upper body, neck, and hair. For that reason, we also report face mis-detection rates and selection rates for this experiment. Face mis-detection rate is defined as the percentage of images with the discrepancy between the location of face detection box and the locations of ground truth landmarks. The selection rate is defined as the percentage of images where the correct DM has been selected by the gating function, where we define a DM as incorrect if the DM has been trained on right profile face images while the corresponding test image is facing left or vice versa. The localization errors are then computed exclusively on the images with correct face detections and DM selections. This type of reporting is adapted from [25], which we use for baseline comparison in this experiment.

The results of the experiments are presented in Table 4. For details on the dataset abbreviations in the first column, please refer to [25], since the experimental setup and the landmark annotations are adopted from there. In short, however, DB00F denotes an image subset with varying facial expressions, which is further partitioned into neutral (neut.), mild, and extreme (extr.) facial expressions. The remaining image subsets contain faces with \(45^{\circ }\) or \(60^{\circ }\) yaw rotations either to the right (R), the left (L), or both (RL). As the GRID and SMUF methods require non-frontal images to train some of the DMs, our experimental setup differs from [25] only in the construction of training set where we also employ images from the Bosphorus dataset.

Detection and selection rates are consistently above \(95\%\) for all subsets as it can be observed from the first two columns in Table 4. Localization errors of our two landmarking approaches are compared to Perakis et al. [25] (last column) which, to the best of our knowledge, achieves the highest performance in the literature on these datasets. The results show the robustness of our methods to both expression variations and to rotations. The mean localization error is under 6 mm on all tested subsets for both GRID and SMUF. Since the training data is taken from the Bosphorus dataset (acquired with a different 3D camera), the results also imply good generalization to data from different sensors. All experiments from this section were performed using 5 DMs, as we observed earlier in Sect. 4.2 that this setting is the most robust to rotation variations.

Table 4 Mean localization errors (and corresponding standard deviations) on the FRGCv2 and UND datasets
Table 5 Localization errors of GRID and SMUF in comparison to the state-of-the-art on non-frontal facial datasets for 10 common facial landmarks. GRID and SMUF significantly outperform competing methods on all experimental datasets. When comparing the learned binary features from SMUF to hand-crafted LBP features, we observe better performance for the learned binary features. GRID and SMUF again perform similarly for all landmarks

4.4 Comparison to the state-of-the-art

In the next series of experiments, we compare the performance of GRID and SMUF to the performance of state-of-the-art landmarking methods from the literature. Specifically, we select the method of Sukno et al. [36], the technique of Creusot et al. [5], and the landmarking approaches of Passalis et al. [24] and Perakis et al. [25] for our comparison. To the best of our knowledge, these landmarking methods are the only ones that were evaluated on both frontal as well as rotated 3D facial images. Following the experimental protocols of other authors, we used the DB00F45RL subset when performing experiments on the FRGCv2+UND database and used the entire database for experimentation on the Bosphorus dataset. Additionally, we also implement our gated landmarking approach with hand-crafted binary features, that is, with LBPs (uniform, neighborhood size of 8 and radius of 1) to capitalize on the usefulness of learning binary features instead of using off-the-shelf binary feature extractors.

Fig. 8
figure 8

Mean localization errors achieved by GRID and SMUF for individual landmarks of the FRGCv2+UND and Bosphorus datasets. The size of the circles corresponds to the localization errors. The numbering of the landmarks as shown here is also used in Figs. 9 and 10. The lowest errors are achieved on distinct landmarks with corner-like properties, e.g., the mouth corners

The results of the comparison are shown in Table 5. We observe that on the Bosphorus dataset both GRID and SMUF significantly outperform the competing methods from the literature and achieve not only lower average localization errors, but also significantly smaller standard deviations on these errors. The only exception here are the nose tip and corners, where the method of Sukno performs similarly or slightly better. We also see similar results for the FRGC-UND dataset, where both GRID and SMUF achieve a considerable reduction in the localization errors for all considered landmarks compared to the state-of-the-art.

When comparing the learned binary features used in SMUF to the hand-crafted LBP features, we also see an obvious performance improvement in the learned features, re-enforcing our assumption that learning binary features is beneficial for face alignment. The comparison between GRID and SMUF shows a similar picture as in the previous series of experiments, where GRID was found to perform slightly better than SMUF, but not significantly so.

Fig. 9
figure 9

Localization errors in the form of box and whiskers plots for the Bosphorus dataset achieved with the GRID and SMUF landmarking techniques. The results show that the lowest localization errors with both methods are achieved on distinct landmarks with corner-like characteristics, such as the eye or mouth corners or the nose tip. The figure is best viewed in color

Fig. 10
figure 10

Localization errors in the form of box and whiskers plots achieved on the FRGCv2+UND dataset with the GRID and SMUF landmarking techniques. Results are also presented for a cross-dataset experiment, where the landmarkers are trained on the Bosphorus datasets and are evaluated on the FRGV2+UND dataset. Lower errors are again achieved on distinct landmarks. The methods generalize well to novel datasets with the median errors for the cross-dataset experiment being only slightly larger than for the within-dataset experiments for the majority of landmarks. The figure is best viewed in color

Fig. 11
figure 11

Exemplar landmark detection results on the Bosphorus database: the first row depicts randomly chosen test samples (ah), while the second row includes samples with high localization errors due to expressions (i, j), occlusions (k, l), head rotations (m, n), and incorrectly selected descent maps (o, p)

Fig. 12
figure 12

Exemplar landmark detection results on the UND and FRGCv2 datasets: the first row depicts test images with typical localization performance, while images from the second row are selected among the worst samples measured by the localization error

4.5 Landmark analysis

In this section, we evaluate how the overall localization performance varies across the individual landmarks for both GRID and SMUF. Figure 8 illustrates the mean localization errors achieved for the individual landmarks—the size of the circles is proportional to the errors. It can be observed that the landmarks corresponding to the nose tip and eye and mouth corners exhibit low localization errors. This is expected as these landmarks correspond to well-pronounced facial parts with distinctive “corner-like” shapes. Contrarily, landmarks relating to nose saddle points, the chin tip, and eyebrow points correspond to indistinctive “edge-like” local shapes and, therefore, result in high localization errors. This observation is also supported by the box plots in Figs. 9 and 10 that show localization errors of individual landmarks on the Bosphorus and the FRGCv2+UND databases. The presented behavior is consistent for both evaluated methods. Note that the number of landmarks in Figs. 9 and 10 depends on the employed dataset and not on the chosen landmarking method.

To further analyze the landmarking performance of GRID and SMUF and their generalization ability, we also performed a cross-database experiment, where we built the test set with images from the FRGCv2+UND dataset, while the training set was generated using images from the Bosphorus dataset. Results relating to the cross-database experiment are illustrated by the green box plots in Fig. 10. When compared to the experiment where both training and test sets are from the FRGCv2+UND dataset, we observe a slight increase in localization errors for most of the landmarks, except the chin tip where the difference between mean errors is larger. We presume that the high mean error for the chin tip comes from the increased appearance variability caused by the face detection procedure needed for the FRGCv2+UND data. (see Fig. 12g). Since such variability is not present in the training set from the Bosphorus dataset, the landmarking procedures cannot learn to accommodate for the inaccuracies of the face detector. In terms of comparison of GRID and SMUF, we see no significant difference in their performance in these experiments.

4.6 Qualitative evaluation

In this section, we qualitatively assess the landmarking performance of the proposed landmarking methods. Figures 11 and 12 show exemplar face images from the Bosphorus and UND datasets with localized landmarks marked by red dots. The top rows of both figures contain samples with typical localization performance, where we can see that the method possesses stable performance in the presence of different types of variability, such as expressions, partial occlusions, and head rotations. However, there are some cases where landmarks are poorly localized. Such samples with large localization errors are exposed in the second rows of Figs. 11 and 12, e.g., large occlusions of face areas (Fig. 11k, l) can cause increased localization errors of visible landmarks. Some of the localization errors originate from poor face detection and cropping, where an image can contain also non-head regions (Fig. 12e, g) or parts of the face area are cropped out (Fig. 12f). Mis-selected descent maps can also be the cause of landmark localization errors (Fig. 11o, p, h).

4.7 Computational cost

In the last series of the experiments, we evaluate the time needed by the GRID and SMUF methods to localize landmarks on a single test image on average. We compute the average processing time over 100 randomly selected test images from the Bosphorus dataset. The size of the input images is \(250 \times 200\) and we compute the locations of all 22 landmarks during the benchmark. A PC with the following specifications is used for the assessment: Intel Xeon CPU 2.67 GHz with 12 GB RAM. Both landmarking techniques are implemented using Matlab, run entirely on CPU and could be further sped up if implemented with a compiled language such as C/C++. We start from detected and localized face regions and measure the time for feature extraction, DM selection, and location updates, which take less than \(3\times 10^{-2}\,\hbox {s}\) for SMUF (see Fig. 13) and little less than \(8\times 10^{-2}\,\hbox {s}\) for GRID. When compared to hand-crafted features, the learned binary features can be extracted almost 3 times faster than HOG features and 15 times faster than LBP features.

Fig. 13
figure 13

Average running time of the GRID and SMUF methods to localize landmarks on one face image (computed over 100 randomly selected test images). For the benchmarking, images from the Bosphorus dataset were used and 22 landmarks were predicted. The results show that SMUF is around \(3 \times\) faster than GRID, but ensures only slightly higher localization errors

5 Conclusion and future work

We have presented two approaches to facial landmark localization from 3D face images, GRID, and SMUF, that are robust to rotations, facial expressions, and partially also to occlusions. We proposed a gating mechanism that allowed us to incorporate multiple pose-specific landmarking models (based on HOG features) into the alignment procedure and also developed a simultaneous descent map and binary feature learning algorithm around the proposed gating mechanism. To assess performance, we evaluated the developed landmarking techniques on three challenging datasets, containing 3D face images with large head rotations. Our results showed that the proposed solutions exhibit high robustness to different types of appearance variations and display competitive performance when compared to the state-of-the-art. Both of the proposed approaches need only a fraction of second to compute the landmarks on a given face image and could run in real time in combination with a suitable 3D sensor. Both methods exhibit a comparable performance, with a slight, although not statistically significant, advantage of GRID over SMUF. Therefore, in the case when fast processing times are required, it is preferable to use SMUF.

As a part of our future work, we plan to combine the proposed landmarking methods with face frontalization (or pose correction) procedures and incorporate all developed methods into pose-invariant 3D face recognition systems.