Keywords

Fig. 1.
figure 1

Two facial expressions (a,b) from our database set into dense correspondence using the proposed framework. High geometric and photometric details are accurately morphed between both expressions via a dense corresponding mesh.

1 Introduction

The fully automated matching of sparse or dense facial landmarks in unconstrained 2D or 3D measurement data, e.g. the semantic annotation of facial images captured in the wild, is of great interest in various fields, ranging from entertainment to affective computing. When dealing with conventional cameras, the loss of information due to the perspective projection requires sophisticated techniques for robust estimation of pose or facial landmarks. Even more demanding is the ill-posed inverse problem of estimating the 3D shape from 2D images. Knowledge about plausible variations in facial shape and appearance as well as their correlation are learned from training samples and used to constrain results to desired solutions especially in unconstrained environments. Similarly, for the semantic annotation and tracking of facial features from 3D data, statistical shape and appearance models (SSAM) of faces improve the reliability and robustness of automated approaches as has been shown recently [1].

Facial morphology varies between individuals due to factors like sex, age, or ethnicity, while significant intra-individual changes are caused by facial expressions. Although 3D databases including a wide variety of both, inter- and intra-individual factors, are publicly available (e.g. [2, 3]), the training samples used to construct statistical face models are restricted to face scans in neutral position (see [4, 5]). Only few models include expressions, for instance the work published by Brunton, Bolkart, and Wuhrer in [6] or Cao et al.  [7]. Unfortunately, these models do not include appearance and solely capture 3D shape variation. They are thus limited for applications in computer vision.

A reason for the rare availability of statistical models of facial shape and appearance lies in the challenging problem of dense correspondence estimation for faces. Many generic shape matching methods as well as approaches specifically tuned to estimate dense correspondence for faces have been proposed, but they either lack accuracy, robustness or automation. Nevertheless, approaches satisfying all these characteristics are needed to establish the next generation of 3D face models, and in order to handle improved geometric and photometric resolution of new scanning devices, growing 3D databases, and applications requiring highly accurate semantic annotation of faces in raw measurement data.

With applications for fully automated processing of large-scale databases in mind, we propose a new framework for dense 3D face matching (see Fig. 2). To ensure robustness of the automated processing, we extract reliable prior knowledge on facial shape and appearance from the input data using 2D facial landmark detectors and non-rigid fitting of 3D face models. Highly-accurate dense correspondence, even for fine facial structures (see Fig. 1), is obtained by combining the prior knowledge with a variational approach for the matching of geometric and photometric facial features. We evaluate and compare the performance of our method in localizing facial landmarks on 2500 scans of the publicly available BU3DFE dataset [2] as well as on 400 high-resolution 3D face models acquired using our own prototypic stereophotogrammetric setup. The accuracy of the dense correspondence established by our method can not only be used to improve various applications such as the retargeting of facial shape and texture or the detection of facial expressions. By providing the basis for fully automated computation of individual blendshape rigs as well as large-scale statistical face models, our framework opens up new directions for computer vision tasks, particularly in the emerging field of consumer devices equipped with 3D sensors.

Fig. 2.
figure 2

The proposed framework includes two stages for initialization and dense matching of accurate correspondence. The matching allows to transfer semantic annotations and a reference mesh to the input data.

2 Related Work

Semantic face annotation has been subject of active research in different communities during the last twenty years. The general problem can be stated as the definition of inter-individually corresponding facial landmarks, ranging from few landmarks at clear anatomical structures to an arbitrary number of points covering the entire face, and their identification in raw measurement data. However, the measurement device and its sensor characteristics affect how accurately significant features can be located and distinguished from other landmarks as well as from surrounding facial and non-facial parts.

In the case of 2D images taken with conventional cameras, a great variety of algorithms exist for the detection of sparse facial landmarks [8]. Usually, locally significant, intensity-based features around landmark points are extracted from facial images contained in a database and used to train landmark predictors. Current methods are able to locate the silhouette of a face, as well as a number of sparse landmarks reliably from the frontal view in presence of a wide range of inter- and intra-individual variation even in unconstrained situations [912]. Similarly, for the detection of sparse landmarks on 3D measurement data, knowledge about characteristic geometric properties is gathered from an annotated database. For instance, in [1315] local quantities like geodesic length, surface area or curvature measures are employed to learn the relevant features of distinct facial landmarks for later prediction.

More challenging is the problem of establishing dense correspondence across the entire face where landmarks cannot be clearly defined by local photographic or geometric features. Instead, correspondence estimation in regions like the cheeks or the forehead is usually constrained by means of mathematical objectives. In the case of 2D warping techniques, the topological subdivision of the facial region into geometric primitives allows the definition of dense correspondence. For example in [16, 17], triangular patches covering the facial region are established and affinely warped to match new landmark positions. These approaches yield continuous correspondence mappings for the entire face varying inter- and intra-individually, but suffer from the strong assumption of affine warps.

Recent methods that directly operate in the 3D domain take advantage of the ability to measure distortion of the surface or the embedding space when deforming a shape into an other. In the computer graphics community, general non-rigid shape matching approaches have been developed that are based, for instance, on the as-isometric-as-possible assumption or by measuring the deformation energy (e.g.  [1822] for a comprehensive survey). A common strategy often adopted in computer vision tasks is to use spatial warping techniques, like non-linear variants of the well-known Iterative Closest Point algorithm (ICP) [23], that deform a template surface into the target e.g. by locally constraining coherent deformation of the surface (see Coherent Point Drift for a general technique [24], and [4, 5] particularly for faces).

When matching objects of a specific type, like human faces, methods significantly benefit from additional prior knowledge that is incorporated into the matching process. For instance in [7, 14], a 3D statistical shape model (SSM) of the face is fitted to the target. Non-linear ICP is then used to warp the template to the target in order to project dense correspondence. Similarly in [15], Gilani, Shafait, and Ajmal combined feature detectors based on geodesic curves with the fitting of a deformable model to assign dense corresponding points to unseen faces. The advantages of these methods are their reliability and robustness, which make them ideally suited for the automated processing of large databases. However, the accuracy of the established correspondence is limited, mainly because of two reasons: (1) The constraints derived from the prior knowledge are not flexible enough to match individual features, and (2) most approaches only use the facial geometry for matching.

Alternatively, the problem of face matching can be casted into an image registration task. Using the continuous parametrization of the target and the template surfaces, their features can be commonly mapped into the plane (see Fig. 3). In [25], annotated surface patches were matched to a template by mapping both to the unit circle. Via the common parametrization, dense correspondence was established to build a statistical shape model of anatomical structures that was successfully applied in medical image processing [26, 27]. Additionally, methods like Optical Flow can be used to improve dense correspondence between the flattened photographic textures. As appearance varies heavily between individuals, the method of [28, 29] applied a smoothing filter to the estimated flow field to obtain valid correspondences. By exploiting the temporal dependency between successive scans recorded with professional 3D video setups, recent work has shown that highly accurate dense correspondence can be established over entire facial performances of an actor [3032]. Kaiser et al.  [33] as well as Savran and Sankur [34] proposed variational registration methods employing robust similarity measures on photographic and geometric features. They showed that image registration methods can be used to successfully establish accurate dense correspondence between individuals. However, variational approaches are typically prone to converge to undesired local minima or require additional user interaction, which prevents them from being used in a fully automated processing of large-scale face databases.

Fig. 3.
figure 3

Matching of a facial surface S to the reference R: Parametrizations \({\varPhi }_S\) and \({\varPhi }_R\) are computed and photometric as well as geometric features are mapped to the plane. The dense correspondence mapping \({\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}\) accurately registers photographic and geometric features from S and R.

3 Challenges and Overview

The estimation of dense correspondences on human faces is particularly challenging, mainly due to the following reasons: (1) global facial morphology significantly varies between individuals, (2) facial expressions cause large intra-individual changes in shape and appearance, and (3) large regions like the cheeks or the forehead provide little information on correspondence between individuals.

To build a fully automated method for accurate dense correspondence estimation on human faces, we propose a novel pipeline that addresses these challenges by combining the reliability of methods using prior knowledge with the accuracy of variational matching based on image registration (see Fig. 2). Our key contribution can be divided into two subsequent processing stages:

  1. 1.

    Given a photographically textured raw 3D surface S, we estimate reliable initial correspondence using 2D facial landmark detectors and non-rigid fitting of a 3D SSAM to S. The initial correspondence estimates are used to compute a parametrization for the surface \({\varPhi }_S: S \subset \mathbb {R}^3 \mapsto \mathbb {R}^2\) that maps features into the plane as reliable initial values for variational correspondence matching.

  2. 2.

    We employ an image registration approach to estimate a mapping \({\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}: \mathbb {R}^2 \mapsto \mathbb {R}^2\) that optimizes dense correspondence accurately by matching individual photographic and geometric features from \({\varPhi }_S\) to a reference template \({\varPhi }_R\) which we call Unified Facial Parameter Domain (ufpd) (see Fig. 3).

By using prior information to compute \({\varPhi }_S\), our framework accounts for challenges (1) and (2). The combination of photometric and geometric features with reasonable constraints which penalize non-isometric deformations during image registration further helps to define correspondence according to (3) and to accurately match intra- and inter-individual features that have roughly been aligned in the first stage.

A central concept of our approach is the definition of the planar \(\textsc {ufpd} \subset [0,1]^2\) (see Subsect. 4.1). We propose the ufpd as the reference template domain \({\varPhi }_R\) aggregating all relevant information for robust and reliable optimization of the dense correspondence mapping \({\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}\). Similar to the template provided by face SSMs, we learn significant geometric and photometric features in the ufpd from our high-resolution face database, that serves as the reference during the variational matching stage. Using the inverse of \({\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}\), we are able to transfer a dense corresponding mesh to the surface S via \({\varPhi }_S^{-1} \circ {\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}^{-1} \circ {\varPhi }_R\).

Fig. 4.
figure 4

Left: Our prototypical stereophotogrammetric setup. Right: Facial landmarks detected in a frontal view using implementations of [35] and [11]. The set of landmarks used for SSAM fitting and parametrization are marked in green. (Color figure online)

4 Method

This section describes the main parts of our matching framework. The data is acquired with our prototypical stereophotogrammetric setup using eight DSLR cameras (six Nikon D800E, two Nikon D810, 36 MP each) in four stereo-pairs, and two flashes (Elinchrom 1000) arranged in a semicircular arc around a common focal point. We employed the method of Beeler et al.  [36] for stereo-matching and Poisson surface reconstruction [37] to obtain detailed facial surfaces. High-resolution photographic textures are seamlessly composed by Poisson image editing [38]. Before describing the processing stages for new facial surfaces in detail, we define the ufpd as follows.

4.1 The Unified Facial Parameter Domain

A key strategy of our framework is a Unified Facial Parameter Domain (ufpd), that serves as a flattened facial template during variational matching similar to [28]. As the \(\textsc {ufpd} \subset [0,1]^2\) represents inter- and intra-individually varying faces, we propose to learn significant photometric and geometric facial features from a representative database.

Initially, a reference parametrization \({\varPhi }_R\) of the average face of the Basel Face Model (BFM, see [4]) has been computed employing the QuadCover method presented in [39] as it minimizes isometric distortion. Using \({\varPhi }_R\), the set of sparse facial landmarks provided by the 2D landmark detectors as described in Subsect. 4.2 is marked on the average face and mapped to the ufpd accordingly. As the BFM only provides low-resolution vertex colors, we have computed an average photographic texture (16 MP resolution) from the high-resolution face database acquired with our own setup by projecting the initial parametrization of the fitted BFM to the facial surfaces (see Subsect. 4.2). The rendered textures were then median-averaged to retain sharp edges and salient features around anatomical structures like the eyes. We use mean curvature as the geometric feature of each surface during surface matching. To avoid unwanted influence of non-corresponding high-frequency features like facial hair or small wrinkles, all surfaces were filtered using Laplacian surface smoothing. According to the generation of the photographic texture, the averaged mean curvature images were mapped to the ufpd (see Fig. 5).

Fig. 5.
figure 5

The unified facial parameter domain: average photographic texture with the set of sparse landmarks (a), the same texture overlayed with its corresponding weight map (b) and the curvature where values are mapped to a normalized gray scale ranging from black (min) to white (max) (c).

To account for the specific value of the photometric and geometric features in various facial regions during dense correspondence optimization, we have defined weight maps in the ufpd. The photometric texture is particularly informative in the regions around the eyes, eye-brows, and mouth because they clearly separate skin from other anatomical structures. Similarly, the color of the nostrils is highly valuable for matching due to its high contrast to the skin tone. The geometric features are matched on the entire facial surface except the outer hairline, because this region varies heavily between individuals and disrupts the matching procedure.

Together the set of sparse landmarks, the textures and the weight maps define the ufpd as shown in Fig. 5. Note that the particular definition of ufpd is done once in advance and is independent of the proposed approach. In principle, the parametrization of the ufpd is extensible and can simply be adopted to different scenarios. Additional features or weight maps used for dense matching can easily be integrated.

4.2 Initial Estimation of Facial Landmarks

The detection of sparse facial landmarks is done on the frontal view of a face using two state-of-the-art algorithms (see Fig. 4). The method of Kazemi and Sullivan [11] employs cascades of weak learners, which showed to be more accurate in detecting landmarks with respect to individual morphological features and facial expressions. STASM provided by Milborrow and Nicolls [35] fits a statistical shape and appearance model to the image data and appeared to be more robust. Both methods detect 68 and 77 facial landmarks in frontal faces, respectively. We combined a set of well-defined landmarks faithfully predicted by both approaches for further processing (Exo- and Endocanthion, Pronasale, and Cheilion).

Fig. 6.
figure 6

For model fitting to (a), the SSAM is first rigidly aligned using 3D landmarks (b) and fitted to the data (c). The resulting SSAM instance as shown in (d) roughly matches the facial features (mouth, nose or eyes).

Dense correspondence is further estimated using an SSAM fitted to the raw 3D data similar to [7] (see Fig. 6). Since we desire a combination of shape and color information for better alignment of significant structures like the eyes or mouth, we implemented a fitting routine using the BFM. The sparse facial landmarks are used to estimate the initial parameters of the similarity transform aligning the SSAM with the raw data by performing a single ICP step [23]. Starting with the average face of the BFM, new shape and intensity parameters \(P=(P_{S}\in \mathbb {R}^m,P_I\in \mathbb {R}^n)\) are obtained as the maximum-aposteriori estimates employing a centered isotropic Gaussian prior with hyper-parameter \(\sigma =(\sigma _S,\sigma _I)\) according to

$$\begin{aligned} p( P | C) \sim p( C | P )~p(P|0, \sigma ), \end{aligned}$$
(1)

where \(C\subset \mathbb {N}\times \mathbb {N}\times \mathbb {R}\) is our set of robust landmarks between the SSAM and the point cloud with an additional weight assigned. For color estimation, correspondence is established by nearest-neighbour lookup, while for shape estimation, points are also matched by similar colors. Data likelihoods p(C|P) are defined as isotropic Gaussians according to their Euclidean distance by

$$\begin{aligned} p(C|P) = \prod \limits _{(\imath ,\jmath , \beta ) \in C}\mathcal {N}(x_{\imath }|m_{\jmath },\beta ), \end{aligned}$$
(2)

where \(x_{\imath }\) are positions or colors of the point cloud and \(m_{\jmath }\) of model vertices. Varying point density in both, the BFM as well as the target, introduces a bias into the data likelihood p(C|P) (e.g. high vertex density around the cheeks in the BFM). We therefore determined the correspondence weights \(\beta \) to be the inverse sum of both frequencies, in which each point occurs in the set of correspondences. Correspondences are determined in each optimization step and new parameters for shape and intensity are estimated using the solution of the system of linear equations with Tikhonov regularization (see [40]) equivalent to Eq. 1.

4.3 Computation of the Initial Parametrization \({\varPhi }_S\)

Because variational methods are prone to convergence to local minima, we propose a method to estimate the initial parametrization \({\varPhi }_S\), such that it roughly aligns with features of the ufpd. We constrain the computation of \({\varPhi }_S\) using the sparse facial landmarks and the reference parametrization \({\varPhi }_R\) projected from the fitted SSAM.

The set of sparse facial landmarks on S is used to match the corresponding positions \(x_{\imath }\) as defined in the ufpd. Nearest vertices on the surface mesh \(S=(V,E)\) are determined and a set of labeled correspondences \(L=\{(v_\imath ,x_\imath )~|~v_\imath \in V,~x_\imath \in \textsc {ufpd} \}\) is assembled. Using the projected reference parametrization \({\varPhi }_R\), the facial region on S is segmented and non-facial parts that map outside the ufpd are discarded. We fix the boundary of the facial region as defined by \({\varPhi }_R\) to its corresponding position in the ufpd. Similarly, the inner vertices are soft-constrained to their positions as defined by \({\varPhi }_R\). Two separate sets are determined by \(K^{\partial }=\{(v_\imath , y_\imath )~|~ v_\imath \in V^\partial ,~y_\imath \in \textsc {ufpd} \}\) for the boundary vertices \(V^\partial \) of S and \(K^{\circ }\) for the inner vertices \(V^{\circ }=V\backslash V^\partial \).

Fig. 7.
figure 7

Initial parametrizations \({\varPhi }_S\) for a sad expression computed without any soft-constraints (a), with constraints for the inner vertices \(V^{\circ }\) (b), and additionally combined with facial landmarks L (c). Note that the latter already aligns features to the ufpd nicely (d).

To compute \({\varPhi }_S\) while accounting for the soft constraints defined by the landmarks L and the inner vertices \(K^{\circ }\) as well as the fixed boundary \(K^{\partial }\) of the facial surface, we adopt the approach of convex-combination maps [41]. Here, the mapping of vertices is expressed as a weighted sum of its 1-ring neighbors:

$$\begin{aligned} u(v_\imath ) = \sum \limits _{\jmath \in N_1(v_\imath )} \lambda _{\imath \jmath } u(v_\jmath ). \end{aligned}$$
(3)

To keep geometric distortion minimal, \(\lambda _{\imath \jmath }\) is calculated as the mean value weight defined in [42] while the boundary vertices are constrained according to \(K^{\partial }\). By rewriting Eq. 3 as a linear least squares problem in the mapped coordinates of the inner vertices \(u(V^{\circ })\), the soft constraints

$$\begin{aligned} \frac{\alpha }{|L|}\sum \limits _{(v_\imath , x_\imath ) \in L}\Vert u(v_\imath ) - x_\imath \Vert ^2 + \frac{\beta }{|K^\circ |}\sum \limits _{(v_\imath ,y_\imath ) \in K^\circ }\Vert u(v_\imath ) - y_\imath \Vert ^2 \end{aligned}$$
(4)

can conveniently be added, where \(\alpha , \beta \) are weighting factors accounting for the influence of the soft constraints. The solution of the equivalent sparse system of linear equations in \(V^{\circ }\) gives the desired mapping \({\varPhi }_S\) (see Fig. 7).

4.4 Variational Matching for Accurate Dense Correspondence

Dense correspondence of S is improved using a variational approach for surface matching inspired by the work presented in [18, 33, 34]. The initial surface parametrization \({\varPhi }_S\) allows us to map arbitrary features of S into the plane. We can then use off-the-shelf image registration frameworks that allow us to combine robust similarity measures with reasonable regularization terms into a common objective for optimization of the correspondence mapping \({\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}\).

To measure the similarity between photographic and geometric features, we use two data terms accordingly defined to the weight maps in the ufpd (see Fig. 5b and c). Several image metrics have been investigated (e.g. sum of squared differences, flavors of mutual information, gradient metrics) and we found an advanced version of Normalized Cross Correlation (NCC, see [43]) to be well suited for our purpose. As a correlation measure, the advantage of NCC is its robustness to changes e.g. in lightning or exposure as well as individual facial characteristics like skin tone that vary significantly with respect to the ufpd.

To regularize the correspondence mapping \({\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}\) in regions with less significant facial features, we use a regularization term similar to [34]. This term, called orthonormality criterion \(P_{oc}\) as defined by Eq. (7) in [44], employs the Green-Lagrange strain to measure isometric distortion. Additionally, local foldings of \({\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}\) are avoided by the bending energy \(P_{be}\) as defined by Klein and Staring [43].

We discretized the correspondence mapping \({\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}\in F\), where F is the space of cubic B-spline transformations with \(128^2\) basis functions located on a uniform regular grid covering the ufpd. The objective used to solve the image registration therefore becomes:

$$\begin{aligned} \begin{aligned} O(u)=&\int _{\textsc {ufpd}}{w_{be}P_{be}(u,x) + w_{oc}P_{oc}(u,x) dx} \\&\quad + \int _{\textsc {ufpd}}{m_P(x) NCC(P_S(u(x)),P_R(x))dx} \\&\quad + \int _{\textsc {ufpd}}{m_G(x) NCC(G_S(u(x)),G_R(x))dx}, \end{aligned} \end{aligned}$$
(5)

where \(m_P(x), m_G(x)\) are weight maps of the photometric (\(P_S, P_R\)) and the geometric (\(G_S, G_R\)) features from S and the ufpd. The optimization of Eq. 5 was implemented using elastix, a framework for rigid and non-rigid image registration [45]. A multi-scale approach for both, the discretization of the correspondence mapping and image resolution is used during optimization. We employ quasi-Newton L-BFGS optimizer including line-search for faster convergence.

5 Experiments and Results

We built a database consisting of 400 facial surfaces aquired with our stereophotogrammetric setup as described in the beginning of Sect. 4. We tested the proposed framework on our database because it contains highly detailed reconstructions including high-resolution photographic textures comparable to 3D models acquired with state-of-the-art stereophotogrammetric devices. We refer the reader to the supplementary material provided with this paper for a collection of representative surfaces from our database.

Fig. 8.
figure 8

Close-up of the photometric (eyes) as well as the geometric features (nose area) before (upper row), and after (lower row) dense correspondence has been optimized. ufpd and the individual features are overlayed using a chessboard pattern as indicated by SR. Note the accurate correspondence between characteristic morphological structures.

We also run extensive experiments on all 2500 cases of the BU3DFE database [2]. This database contains 3D models of 100 persons varying in sex, age and ethnicity. The faces are captured in neutral position as well as 6 basic emotions of the Facial Action Coding System [46] in 4 levels of intensity. An initial surface reconstruction using [37] was done to close holes or remove meshing artifacts frequently contained in the raw data (e.g. below the chin). All data was processed in a fully automatic fashion.

During initial correspondence estimation in the first stage of our framework, sparse facial landmarks were reliably located at the expected positions in the 2D images. In some cases, especially in presence of extreme expressions or when the camera perspective significantly differs from the frontal view, facial landmark detection was less accurate. However, the combination of landmarks from both detectors is generally reliable and serves as valuable information in further processing. During fitting of the SSAM, the incorporation of color information improves the registration to structures like the mouth, eyes, and eye-brows where geometric information is less significant. Unfortunately, the BFM is build from a dataset containing neutral expressions only and it fails to adapt to strong variations in shape of the mouth or the eyes. To avoid implausible results in case of expressions, we gave high weight to the prior distribution in shape space and chose \(\sigma =(10,1)\) in Eq. 1.

For the same reason, the initial surface parametrization \({\varPhi }_S\) was computed with higher weight given to the set of landmarks L than to the inner vertices \(V^{\circ } \) with \(\alpha = 1, \beta = 0.001\). Theoretically, the constraints that we have added to Eq. 4 might lead to non-injective parametrizations (see [41] for a discussion). In practice, we did not find any cases were this occurred. The initial parametrizations obtained are roughly aligned with the ufpd and served as suitable starting values for the optimization of dense correspondence (see Fig. 7).

The dense correspondence mapping accurately registers photographic and geometric features with the ufpd (see Fig. 8). We fixed weights \(w_{be}=150\) and \(w_{oc}=2\) to ensure bijectivity of \({\varPsi }_{{\varPhi }_S\rightarrow {\varPhi }_R}\) and to reasonably constrain matching in regions with less significant features. To quantitatively evaluate our approach, we measured the deviation of landmarks distributed with the BU3DFE. Corresponding landmarks were defined in the ufpd and identified accordingly on the original surfaces after matching. Landmark-wise Euclidean distance was computed and averaged (see Table 1). Using the proposed framework, we were able to predict the landmarks with higher accuracy than previous approaches (except Pronasale in [13] where about 200 cases have been discarded). Moreover as depicted by the standard deviations, the prediction-uncertainty has been significantly reduced.

Fig. 9.
figure 9

Results for several expressions from the BU3DFE. Note the accurate dense correspondence established over the entire surface. The red dots indicate the location of landmarks used for quantitative evaluation. (Color figure online)

Table 1. Localization error on the BU3DFE database (for landmarks see Fig. 9). The improvement with respect to the best result from previous work is reported in the last column.

In fact, using the surface mapping established with our approach, we are able to predict any number of landmarks or mesh vertices that are identified in the ufpd. Here, we used a low-level reference mesh of about 15k vertices as it is sufficient for the resolution available in BU3DFE. We have segmented facial regions in the ufpd and generated a color-coded texture overlayed with a chessboard pattern. The result for several facial expression scans of a single individual is shown in Fig. 9.

Finally, the reference mesh was transferred to all surfaces of the BU3DFE. To demonstrate the suitability of our approach for morphological analysis and generation of statistical face models, we have build two SSAMs using Principal Component Analysis (PCA) on the vertex positions and the photographic textures of the 3D face models. The first SSAM contains the geometric variation related to the inter-individual morphology using the neutral scans only. The second model captures the intra-individual variations due to facial expressions. We have simply computed displacement vector fields for the expressions with respect to the neutral scan of each subject and applied it to the average face of the first SSAM using vertex correspondence. Note the morphological variation captured by the shape parameters in Fig. 10. The chessboard pattern is accurately morphed when the shape varies.

6 Limitations and Future Work

In some rare cases of extreme expressions, we found that matching in the mouth and the forehead region is disturbed by folds, e.g. by matching them to other features like the eyebrows. Special detectors could be used to remove these features from textures. Similar strategies could be integrated to handle severe changes in surface area/topology by an open mouth or closed eyes. In the future, we will use the BU3DFE-SSAM instead of the BFM because it already contains several expressions and thus better adapts to an individual morphology. A Riemannian variant of the BU3DFE-SSAM will be established, as non-linear shape spaces have been shown to be superior to PCA based models e.g. for learning relationships between shape and expressions.

Fig. 10.
figure 10

The BU3DFE-SSM. Left: the face shape according to \(\pm 2SD\) of the first three shape parameters of the neutral model. Middle: The average neutral face. Right: The first three shape parameters (\(\pm 2SD\)) of the expression model.

The variational matching in the plane comes at the cost of geometric distortions introduced by the parametrization. As the proposed framework is independent of the actual ufpd, improved definitions will be investigated in further experiments. Similarly, we aim in learning the parameters used in our framework from an annotated ground truth database to further improve accuracy and robustness of automated data processing. The run time of the framework highly depends on the resolution of the input data. In our experiments, we measured times between 0.5 and 3 min on a standard workstation without optimizing our code. We believe that the computation time could be significantly reduced if certain routines are implemented more efficiently and by employing computational parallelism.

7 Conclusions

We have presented a framework for the fully automated determination of highly-accurate dense correspondence for facial surfaces. We showed that the proposed approach works well on a wide range of textured 3D face models varying inter- and intra-individually. Our approach outperforms state-of-the-art methods as confirmed by our experiments. To the best of our knowledge, no SSAM of faces based on a variety of facial expressions with dense correspondence has been released to the research community yet. We are aiming to publish the BU3DFE-SSAM and believe, that this model including geometric as well as photometric variation will help researchers to understand the complex nature of facial morphology. The proposed framework will help to build the next generation of highly-detailed 3D face models on a large scale basis and thus opens up new directions for applications in computer vision and computer graphics.