Keywords

1 Introduction

We define head pose as the yaw, pitch and roll angles that determine the orientation of the head in the camera reference system [13]. It has attracted much research due to its relevance as a pre-processing step of many face analysis tasks such as alignment of facial landmarks [2, 20] or facial expressions recognition [4]. It is also used in video-surveillance [11] and intrinsically linked with human-computer interaction in social communication [12], gaze [18] and focus of attention [1] estimation.

There are many approaches for image-based head pose estimation. Some of them use very low resolution images [11] or 3D range data [5]. In this paper we only consider methods that use 2D images of average or high resolution. Among these, manifold embedding and non-linear regression techniques are possibly the most popular ones. The former assume that separated continuous head pose sub-spaces exist according to appearance [14]. Non-linear regression methods learn a mapping from image features to pose angles. Random Forests [5, 19] and Convolutional Neural Networks (CNNs) [6, 10, 15] are some of the most prevailing.

At present, the best performing approaches are based on CNNs. Yang et al. [20] use a small CNN for regression of yaw, pitch and roll angles with 3 convolutional layers, 3 pooling layers and 2 fully connected layers. Ranjan et al. [15] fuse intermediate feature layers at different resolutions, and use a multi-task approach to detect faces, estimate facial landmarks, head pose and gender. The H-CNN architecture [10] uses an inception module [17] that pools and concatenate features from intermediate layers and is jointly trained on the visibility, facial landmarks and head pose estimation parameters. In Table 1 we show the performance of these approaches. Although they use the same databases, their results cannot be immediately compared. This will be further discussed in Sect. 2.

In this paper we review the problem of estimating head pose by regressing the yaw, pitch and roll head angles from medium/high resolution images acquired “in-the-wild”, i.e. in realistic unrestricted conditions. Our contributions are:

  • A brief survey of the best head pose estimation algorithms.

  • Definition of an evaluation methodology and publicly available benchmark to precisely compare the performance of head pose estimation algorithms.

  • The establishment of the state-of-the-art on this benchmark.

2 Benchmarking Head Pose

There are many public databases with face labeled data. However very few of them provide ground truth head pose, because of the difficulty in accurately estimating these angles. Traditionally, pose estimation algorithms have been evaluated with databases acquired in laboratory conditions and with imprecise angular information [13]. Later, more realistic and accurate data-sets such as AFLW [8] emerged. They have images in challenging real-world situations acquired without any position, illumination or quality restriction.

Here we propose the use of three databases:

  • AFLW [8]. It contains a collection of 25993 faces acquired in an uncontrolled scenario with head poses ranging between ±120\(^{\circ }\) for yaw and ±90\(^{\circ }\) for pitch and roll angles. It provides a mean face 3D structure and manual annotations for 21 face landmarks. We compute the pose angles from the labeled landmarks using the POSIT algorithm [3] and assuming each face has the 3D structure of the mean face. We have found several annotations errors and, consequently, removed these faces from our benchmark. From the remaining faces we randomly choose 21074, 2068 and 1000 instances for training, validation and testing respectively. These images will be available after publication.

  • AFW [21]. This small database has been traditionally used only for testing purposes. It has 250 images with 468 faces in quite challenging settings. It provides discrete yaw labels ranging from −90\(^{\circ }\) to 90\(^{\circ }\) with 15\(^{\circ }\) intervals, plus the facial bounding box. These labels were manually annotated, hence often they are not very accurate.

  • 300W Footnote 1. It includes 689 challenging faces obtained from the testing subsets of other databases (HELEN, LFPW and IBUG). This is the most popular face alignment benchmark. It provides face bounding boxes and 68 manually annotated landmarks. It does not provide any pose information. We use again AFLW mean 3D face and the POSIT algorithm [3] to estimate the three pose angles for each face instance. This data-set will also be publicly available.

Table 1. Head pose estimation published results. For AFLW and 300 W we show the Mean Absolute Error (MAE) in degrees. For AFW we show the classification success rate.

In Table 1 we show the published results of the best head pose estimation algorithms. AFLW figures are not comparable among any of the cited works. Some select 1000 test images at random and use the rest for training [10, 15]. Valle et al. [19] chose 10% of the images for testing and the rest for training. Gao et al. [6] use 15561 randomly chosen image faces for training and the remaining 7848 for testing. Moreover, none of these AFLW subsets are publicly available, hence it is impossible to make a fair comparison among any of these approaches.

Similarly, the results for AFW are not comparable. Some approaches test on the whole database [15, 19]. However, each was trained on a different subset of AFLW. Moreover, Kumar et al. [10] test on the 341 images whose height is larger than 150 pixels. Peng et al. [14] test on a different set of 459 faces.

Finally, the head pose labels for 300 W are not available. Yang [20] computes them from an average face composed of 49 3D points. Unfortunately, this information is not public.

In summary, to have comparable results all algorithms should use the same train, validation and test data-sets. For our benchmark we propose to use a single train and validation data-set composed respectively by 21074 and 2068 face images randomly chosen from AFLW. For testing we have three data-sets: the AFLW test is performed on the remaining 1000 images; when testing with AFW and 300 W we use respectively all 468 and 689 faces from AFW and 300 W test sets.

Note also that our labels may also have small errors caused by the assumption that all faces have the same 3D structure.

3 Experiments

3.1 Methodology

Following the models used by the best published results [6, 10, 15, 20], we use a distributed face representation extracted from a deep CNN. Training such a model from scratch requires a large amount of data and computing power. The usual approach in computer vision is to use a general architecture already trained on a related problem and fine-tune it for the task at hand (see Fig. 1).

Fig. 1.
figure 1

Transfer learning methodology to fine-tune ImageNet generic weights.

To build our baseline regressors we use AlexNet [9], GoogLeNet [17], VGG [16] and ResNet [7] trained architectures, top performers in the image classification task of the ILSVRC competition. AlexNet was also used by Ranjan et al. [15], GoogLeNet by Kumar et al. [10], and VGG-NetFootnote 2 by Gao et al. [6]. In each architecture we change the last 1000 units Softmax classification layer with an Euclidean Loss layer with three units for modeling the yaw, pitch and roll angles.

For fine-tuning and evaluation we use the Caffe framework with a GeForce GTX 1080 (8 GB) graphics processor. We followed the same procedure for each model. We use Nesterov Accelerated Gradient Descent (NVG) method, initialize the learning rate to \(\alpha =10^{-5}\) and reduce it with \(\gamma =0.1\) factor after “step size” iterations (see Table 2). Momentum was set to \(\mu =0.9\). Table 2 reports the remaining optimization of parameters for each architecture. We optimize the GPU memory occupation by setting the batch length and number of iterations on the basis of the network size. So, large networks use a small batch and larger number of iterations (see Table 2). The network weights used for tests are those at the last iteration. They will be publicly available after publication.

Table 2. Training parameter values for each architecture.

It takes 8 h for fine tuning the parameters of the largest net, ResNet-152, and process test images on average at a rate of 4 FPS. In Fig. 2 we show a pair of learning curves for VGG-19 and ResNet-152 architectures. Validation curves are more stable because we always process all test images. However, depending on the batch, the training performance has a larger variance. Vertical dashed red lines mark the number of iterations required to complete an epoch.

Fig. 2.
figure 2

Sample learning curves for VGG-19 and ResNet-152 architectures.

In Table 3 we present the results of the baseline classifiers for each network architecture. In general, these results confirm that the deeper the representation, the better the performance. This is a well-known fact in the deep learning literature [7].

In AFLW we use the Mean Absolute Error (MAE) of each angle as evaluation metric. Hence, the baseline model using AlexNet achieves better performance than Ranjan et al. [15]. Similarly, GoogLeNet results improve those by Kumar et al. [10]. For VGG-16, results are only marginally better thank those by Gao et al. [6], although our net was trained on the more general ImageNet data-set.

In AFW, since it provides discrete labels, we use as metric the classification success rate. Here, although again the results are also not strictly comparable, the models by Kumar et al. [10] and Ranjan et al. [15] improve those achieved by our baseline classifiers. This is surprising since in the more precise AFLW regression case, the result is the opposite. Perhaps in this case the discretization played against our models or, since AFW was manually labeled, the annotation error is higher. Hence, the MAE differences are less significant.

Table 3. Head pose baseline estimation results.
Fig. 3.
figure 3

Representative results with yaw errors greater than 15\(^{\circ }\) for AFLW (top), AFW (middle) and 300W (bottom) databases. Below each image we display the yaw, pitch and roll angle values. Green and blue colors represent respectively estimated and ground truth angles. (Color figure online)

The MAEs of Yang et al. [20] in 300 W, although not strictly comparable, are better than those of our baseline classifiers. This may be caused by the fact that they train their CNN on the 300 W training data-set and, perhaps, over-fit to it.

Finally, in Fig. 3 we present some representative face images with head pose estimation errors greater than 15\(^{\circ }\) obtained using ResNet-152 architecture. As can be noticed, sometimes the estimation seems to be more accurate than the annotation. This may be caused by the manual annotation error.

4 Conclusions

We have surveyed the state-of-the-art on face pose estimation “in-the-wild”. Although some of the best performing approaches use the same train and test data-sets, their results are not comparable.

In this paper we have defined an evaluation procedure and benchmark data-sets with images captured in unrestricted settings. We have also trained a set of CNN-based classifiers that provide baseline results for our benchmark. The results in Table 3 represent the reproducible state-of-the-art for this problem.

The model based on the deepest network architecture, ResNet, provides the best overall performance. Hence, confirming that deeper representations have better generalization capabilities. When confronted with the best published results in the literature, although not strictly comparable, the ResNet model achieves better performance in the challenging AFLW dataset.

By making publicly available the baseline classifiers and the benchmark data-sets, we expect that future algorithms will be compared on fair grounds.