1 Introduction

Face alignment, which fits a face model to an image and extracts the semantic meanings of facial pixels. Traditional face alignment is to locate the feature points of human face. Such as corners of the eyes, corners of the mouth, tip of the nose, etc. This is a fundamental processing process for many computer vision tasks, e.g., face recognition [3], facial expression analysis [2], facial animation [6, 7] and so on. In view of the importance of this problem, face alignment has been widely studied since the Active Shape Model (ASM) of Cootes in the early 1990s [10].

Despite the continuous improvement on the alignment accuracy, face alignment is still a very challenging problem. Traditional 2D face alignment can achieve satisfactory accuracy in small to medium poses, but this does not meet the changing conditions in real-world applications, non-frontal images, low image resolution, variable illumination and occlusion, etc. 3D face alignment aims to reconstruct 3D face structure through 2D image and estimated the position of 3D and 2D face feature points after 3D face alignment to 2D image.

Motivated by the needs to address the efficient model, pose variation, and the lack of prior work in handling poses, the paper proposes a novel and efficient network structure, and uses different loss functions to optimize the 3D parameters and 3D vertices. The purpose is to calculate the positions of 2D and 3D facial feature points under arbitrary postures. The reason for the efficiency of MobileNet [18] is that the Depthwise Separable Convolution is used in the network structure. Because of the Densely Connected between convolutional layers, DenseNet [19] strengthened the transmission of feature, made more effective use of feature and reduced the number of parameters to a certain extent. Inspired by the above two network structures, our network structure has high efficiency of both Depthwise Separable Convolution and feature reuse of Densely Connected. To achieve a balance between high efficiency and high precision. Finally, extensive experiments are conducted on a large subset of AFLW dataset [23] with a wide range of poses, and the AFLW2000-3D dataset [35] with the comparison with a number of methods. An overview of our method is shown in Fig. 1.

In summary, our contributions are summarized as follows:

  1. (1).

    We proposes a novel and efficient network structure (MobDenseNet). To the best of our knowledge, this is the first that Depthwise Separable Convolution and Densely Connected are combined in a network leading to a new structure of DNN.

  2. (2).

    Different loss functions are used to optimize the parameters of 3D Morphable Model and 3D vertices. Meanwhile, face alignment that can estimate 2D/3D landmarks with an arbitrary pose.

  3. (3).

    We experimentally verified that our algorithm has significantly improved performance of 3D face alignment compared to the previous algorithms, The proposed face alignment method can deal with arbitrary pose and it is more efficient.

Fig. 1.
figure 1

Overview of the ours method. Efficient full convolutional neural networks (MobDenseNet). Figure 2 describes the details of MobDenseNet. The 3D parameters and 3D vertices are constrained using different loss functions.

2 Related Work

In this section, we will review the prior work in generic face alignment and 3D face alignment.

2.1 Generic Face Alignment

Face alignment has achieved many achievements, including the classic AAM [9, 26] and ASM [8] models. This method considers face alignment as an optimization problem to find the best shape and appearance parameters, which make the appearance model best fit the input face. The basic idea of Constrained Local Model (CLM) [1, 11, 27] in Discriminative methods is to learn a set of local appearance models, one for each landmark, and the decision from the local models are combined with a global shape model. Cascaded regression gradually refines a specified initial prediction value through a series of regressions. Each regression unit relies on the output of the previous regression unit to perform simple image operations, and the entire system can automatically learn from the training samples [12]. The ESR [7] (Explicit Shape Regression) proposed by Sun et al. includes three methods, namely two-level boosted regression, shape-indexed features and correlation-based feature selection method.

Besides traditional models, deep convolutional neural networks have recently been used for feature point localization of faces. Sun et al. [28] firstly use CNN to regress landmark locations with the raw face image, accurately positioning of 5 key points of faces from coarse to fine. The work of [16] using the human body pose estimation, the boundary information is introduced into the key point regression. In recent years, most of the landmark detections of faces have been studied on “coarse to fine”, while Feng et al. [14] have taken a different approach, using the idea of cascaded convolutional neural networks. And [14] compared the commonly used loss functions in face landmark detection, and based on this, the concept of wing loss is proposed.

Fig. 2.
figure 2

Details of MobDenseNet. k3n64s1 corresponds to the kernel size(k), number of feature maps(n) and stride(s) of conv1.

2.2 3D Face Alignment

Although the traditional method has achieved many achievements in face alignment, it will be affected by non-frontal face, illumination and occlusion in real-life applications. The most common method is the multi-view framework [29], which uses different landmark configurations for different views. For example, TSPM [34] and CDM [33] use the DPM-like [15] method to align faces of different shape models, and finally select the most probable model as the final result. However, since each view requires testing, the computational cost of the multiview approach is always high.

In addition to multi-view solutions, 3D face alignment is a more common approach. 3D face alignment [16, 20], which aims to fit a 3D morphable model (3DMM) [3] from a 2D image. The 3D Morphable Model is a typical statistical 3D face model. It has a clear understanding of the prior knowledge of 3D faces through statistical analysis. Zhu et al. [35] proposed a localization method based on 3D face shape, which solved the problem that some feature points were invisible under extreme postures (such as side faces). Liu et al. [21] used the cascade of 6 convolutional neural networks to solve the problem of locating facial feature points in a large pose by using 3D face modeling. This paper [13] designed a UV position map to achieve 3D shape features of a complete human face in a 2D UV space.

Our approach is also based on convolutional neural networks, but we have redesigned the network structure to make it efficient and robust. At the same time, we use different loss functions for 3D parameters and 3D vertices to constrain the semantic information of 3D parameters and 3D vertices respectively.

3 Proposed Method

In this section we introduce the proposed robust 3D face alignment (R3FA) which fits 3D morphable model with efficient fully convolutional neural networks.

3.1 3D Morphable Model

The 3D Morphable model is one of the most successful methods for describing 3D face space. Blanz et al. [3] proposed a 3D morphable model (3DMM) of 3D face space with PCA. It is expressed as follows:

$$S=\overline{S}+A_{id}\alpha _{id}+A_{exp}\alpha _{exp}(1)$$

where S is a specific 3D face, \(\overline{S}\) is the mean face, \(A_{id}\) is the principle axes trained on the 3D face scans with neutral expression and \(\alpha _{id}\) is the shape parameter, \(A_{exp}\) is the principle axes trained on the offsets between expression scans and neutral scans and \(\alpha _{exp}\) is the expression parameter. So the coefficient \(\{\alpha _{id},\alpha _{exp}\}\) defines a unique 3D face . In this work \(A_{id}\) comes from the BFM [24] model and \(A_{exp}\) comes from the FaceWarehouse model [5].

In the process of 3DMM fitting, we use the Weak Perspective Projection to project 3DMM onto the 2D face plane. This process can be expressed as follows:

$$S_{2d}=f*Pr*R*\{S+t_{3d}\}(2)$$

where \(S_{2d}\) is the 2D coordinate matrix of the 3D face after Weak Perspective Projection, rotation and translation. f is the scaling factor. Pr is a perspective projection matrix \(\left( \begin{array}{ccc} 1 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 \end{array}\right) \). R is a rotation matrix constructed according to three rotation angles of pitch, yaw and roll respectively. \(t_{3d}\) is the translation transformation matrix of 3D points. Therefore, for the modeling of a specific face, we only need to solve the 3D parameter \(P=[f,pitch,yaw ,roll,t_{3d},\alpha _{id},\alpha _{exp}]\).

3.2 MobDenseNet Structure

The reason MobileNet [18] is effective is the use of Depthwise Separable Convolution technology in the network structure. Based on MobileNetV1 [18], the design of MobileNetV2 [25] combines with the recent popular residual ideas. But the idea of residuals is achieved by the direct addition of elements. [19] a phenomenon that many layers of the ResNet [17] network, the first performer of residual thinking, contribute less and can be randomly discarded during training. This shows that residual ideas are prone to redundant information. In order to solve this problem, DenseNet [19] proposes any layer of the network, the feature map of all the layers in front of the layer is the input of this layer. The feature map of the layer is the input of all the layers behind. However, DenseNet has many parameters and the network structure is not efficient. So combining with MobileNet’s efficiency and DenseNet’s feature enhancement, we build a new network structure MobDenseNet by combining DenseNet’s dense connections on the overall framework of MobileNet. Our network structure includes both MobileNet’s high efficiency and enhance feature representation.

The architecture of MobDenseNet is illustrated in Fig. 2. MobDenseNet is a fully convolutional neural network without full connection layer. Conv1 is a convolution layer with kernel size(k) of 3, stride(s) of 2 and number of feature maps(n) of 32 to extract rough features. Layer1 to Layer7 are 7 dense blocks for extracting depth features. Figure 3 shows the details of one of the DenseBlock, Layer3. The convolution layer of a set of \(1\times 1, 3 \times 3, 1\times 1\) filters in MobDenseNet as a basic unit called MobileBlock. As shown in Fig. 3, this set of basic units is consistent with MobileNetV2. DenseLayer3 contains three sets of Mobile Blocks (each MobileBlock output is cascaded as the input of the next MobileBlock). As such, MobDenseNet retains the simplicity and efficiency of MobileNet. As shown in Fig. 3, Layer3 contains three sets of MobileBlocks. In order to match the number of channels connected to the Dense connection, we added a transition layer after each MobileBlock (the convolution layer filter is \(1\times 1\)), the purpose is adjust the number of channels in the preview MobileBlock output feature map. We use both real face images and generated face images to train our MobDenseNet (details can be found in the suppl. material).

Fig. 3.
figure 3

The details of one of the DenseBlock, Layer3. The convolution layer of a set of \(1\times 1, 3 \times 3, 1\times 1\) filters in MobDenseNet as a basic unit called MobileBlock. The transition layer is the number of channels to match the input and output feature maps.

3.3 Loss Function

We chose two different Loss Functions to jointly train MobDenseNet. For 3D parameters and 3D vertices we use different loss functions for training. We follow the Weighted Parameter Distance Cost (WPDC) of Zhu et al. [35] to calculate the difference between the ground truth of 3D parameters and the predicted 3D parameters. The basic idea is explicitly modeling the importance of each parameter:

$$L_{wpdc}=(P_{gt}-\overline{P})^TW(P_{gt}-\overline{P})(3)$$

where \(\overline{P}\) is the estimation and \(P_{gt}\) is the ground truth. The diagonal matrix W contains the weights. For each element of the shape parameter p, its weight is the inverse of the standard deviation that was obtained from the data used in 3DMM training. Because our ultimate goal is to accurately obtain 68 landmarks of human faces. So for 3D face vertices reconstructed with 3D parameters, we use Wing Loss [14] which is defined as:

$$ L_{wing}(\varDelta {V(P)})=\left\{ \begin{array}{ccl} \omega \ln (1+|\varDelta {V(P)}|/\in ) &{} &{} {if\;|\varDelta {V(P)}| < \omega }\\ |\varDelta {V(P)}|-C &{} &{} {otherwise}\\ \end{array} \right. (4)$$

where \(\varDelta {V(P)}=V(P_{gt})-V(\overline{P})\),\(V(P_{gt})\) and \(V(\overline{P})\) are the ground truth of the 3D facial vertices and the 3D facial vertices reconstructed using the 3D parameters predicted by the network, respectively. \(\omega \) and \(\in \) are parameters. \(C = \omega - \omega \ln (1+\omega /\in )\) is a constant that smoothly links the piecewise-defined linear and nonlinear parts.

Overall, the framework is optimized by the following loss function:

$$L_{loss}=\lambda _{1}L_{wpdc}+\lambda _{2}L_{wing}(5)$$

where \(\lambda _{1}\) and \(\lambda _{2}\) are parameters, which balance the contribution of \(L_{wpdc}\) and \(L_{wing}\). The selection of those parameters will be discussed in the next section.

4 Experiments

In this section, we evaluate the performance of R3FA on three common face alignment tasks, face alignment in small and medium poses, face alignment in large poses, and face reconstruction in extreme poses (\(\pm 90^\circ \) yaw angles), respectively.

4.1 Implementation Details

We use the Pytorch deep learning framework to train the MobDenseNet models. The loss weights of R3FA are empirically set to \(\lambda _{1}= 0.5\) and \(\lambda _{2}= 1\). In our experiments, we set the parameters of the Wing loss as \(\omega = 10\) and \(\in \) = 2. The Adam solver [22] is employed with the mini-batch size and the initial learning rate set to 128 and 0.01, respectively. There are 680,000 face images in our training data set, including 430,000 real face images and 250,000 synthetic face images. Real face images come from 300W-LP [35] datasets, and various data enhancement algorithms are adopted to expand the datasets. We run the training for a total of 40 epochs. After 15, 25 and 30 epochs, we reduced the learning rate to 0.002, 0.0004 and 0.00008 respectively.

4.2 Evaluation Databases

We evaluate the performance of R3FA on two publicly available face data sets AFLW [23] and AFLW2000-3D [35]. These two data sets contain small and medium poses, large poses and extreme poses (\(\pm 90^\circ \) yaw angles). We divide the dataset AFLW and AFLW2000-3D into three intervals of \([0^\circ ,30^\circ ],[30^\circ ,60^\circ ]\), and \([60^\circ ,90^\circ ]\) according to the face absolute yaw angle, and each interval is about 1/3 of the total.

AFLW. AFLW face database is a large-scale face database including multi-pose and multi-view, and each face is marked with 21 feature points. This database has a very large amount of information, including pictures of various poses, expressions, lighting, and ethnicity. The AFLW face database consists of approximately 250 million hand-labeled face images, of which 59% are women and 41% are men. Most of the images are color, images only a few are gray images. We only use part of the extreme pose face images of the AFLW database for qualitative analysis.

AFLW2000-3D. AFLW2000-3D is constructed by [35] to evaluate 3D face alignment on challenging unconstrained images. This database contains the first 2000 images from AFLW and expands its annotations with fitted 3DMM parameters and 68 3D landmarks. We use this database to evaluate the performance of our method on face alignment tasks.

4.3 Evaluation Metric

Given the ground truth 2D landmarks \(U_i\), their visibility \(v_i\), and estimated landmarks \(\hat{U_i}\) of \(N_t\) testing images. Normalized Mean Error (NME), which is the average of the normalized estimation error of visible landmarks, i.e.,

$$NME=\frac{1}{N_t} \sum _{i}^{N_t}(\frac{1}{d_i|v_i|_1}\sum _{j}^Nv_i(j)||\hat{U_i}(:,j)-U_i(:,j)||)(6)$$

where \(d_i\) is the square root of the face bounding box size, as used by [37]. Note that normally \(d_i\) is the distance of two centers of eyes in most prior face alignment work dealing with near-frontal face images.

4.4 Comparison Experiments

Comparison on AFLW. In the AFLW dataset, 21,080 images were selected as test samples, with 21 landmarks in each sample. During testing, we divide the testing set into 3 subsets according to their absolute yaw angles: \([0^\circ ,30^\circ ],[30^\circ ,60^\circ ]\) and \([60^\circ ,90^\circ ]\) with 11,596, 5,457 and 4,027 samples respectively. Since few experiment has been conducted on AFLW, we choose some baseline methods with released codes, including CDM [33], RCPR [4], ESR [7], SDM [32], 3DDFA [35] and nonlinear 3DMM [30]. Table 1 demonstrates the comparison results. The NME(%) of face alignment results on AFLW with the first and the second best results highlighted. The results of provided alignment models are marked with their references. Figure 4 shows the corresponding CED curves. Our CED curve is only compared to the best method in Table 1. The results show that our R3FA algorithm significantly improves the face alignment accuracy in full pose. The minimum standard deviation of R3FA also proves its robustness to posture changes.

Table 1. The NME(%) of face alignment results on AFLW and AFLW2000-3D.
Table 2. The NME(%) of face alignment results on AFLW and AFLW2000-3D with the different network structures.
Fig. 4.
figure 4

Comparisons of cumulative errors distribution (CED) curves on AFLW.

Fig. 5.
figure 5

Comparisons of cumulative errors distribution (CED) curves on AFLW2000-3D.

Comparison on AFLW2000-3D. In the AFLW2000-3D dataset, 2000 images were selected as test samples, with 68 landmarks in each sample. Considering the visible and invisible evaluation, 3D face alignment evaluation can be downgraded to a full landmark evaluation. we divide the testing set into 3 subsets according to their absolute yaw angles: \([0^\circ ,30^\circ ],[30^\circ ,60^\circ ],[60^\circ ,90^\circ ]\) with 1,312, 383 and 305 samples respectively. Table 1 demonstrates the comparison results. The NME(%) of face alignment results AFLW2000-3D with the first and the second best results highlighted. The results of provided alignment models are marked with their references. Figure 5 shows the corresponding CED curves. Our CED curve is only compared to the best method in Table 1. Table1 and Fig. 5 demonstrate that our algorithm also has a significant improvement in the prediction of invisible regions, showing good robustness for face alignment in arbitrary poses.

Comparison on Different Network Structures. We selected a variety of different network structures for comparison during the experiment. The experimental network structure includes ResNeXt [31], MobileNetV2 [25], DenseNet121 [19], and our proposed MobDenseNet. To the best of our knowledge, these three popular and efficient network structures are the first to be used in the field of 3D face alignment. Table 2 demonstrates the comparison results. The NME(%) of face alignment results on AFLW and AFLW2000-3D with the different network structures. The table shows the time when each sample extracts parameters through the network model and the parameter size of the network model. Extracting params time (ms/pic) is calculated on GTX 1080Ti and 64 GB RAM. These three network structures can be divided into two categories, one is the efficient network structure represented by MobileNetV2, and the other is the high-precision network structure of ResNeXt50 and DenseNet121. In order to balance efficiency and high precision, we have designed MobDenseNet independently. The experimental results demonstrate the motivation and expected results of our original design. Our network structure achieves a balance between high efficiency and high precision. Comparison and analysis with MobileNetV2 and DesenNet can be found in suppl. material. The 2D/3D alignment results of our method are shown in Fig. 6.

Fig. 6.
figure 6

The results of 2D/3D face alignment of our method. Result of 2D face alignment (second rows), 3D face alignment (third rows), Align 3D face mesh to 2D image (fourth rows).

5 Conclusions

In this paper, we propose a novel and efficient framework (R3FA), which solves the problem of 2D/3D face alignment with full pose. In order to balance the computational efficiency and alignment accuracy of the model, we propose a new deep network MobDenseNet. We innovatively use two loss functions to jointly optimize 3D reconstruction parameters and 3D vertices. At the same time, we use real and synthetic images to train our network together. We have achieved the best accuracy on both AFLW and AFLW2000-3D datasets compared to existing algorithms. Comparing experiments with several popular networks, our algorithm can achieve a good balance between accuracy and efficiency. In the future, we will further improve the accuracy of 2D/3D face alignment, and at the same time the algorithm will have higher efficiency.