Introduction

The capacity to estimate the head pose of another person is a common human ability that presents a unique challenge for computer vision systems. People have the ability to quickly and effortlessly interpret the orientation and movement of a human head, thereby allowing one to infer the intentions of nearby people and to comprehend an important non-verbal form of communication.

In a computer vision context, head pose estimation (HPE) is the process of inferring the orientation of a human head from digital imagery. Like other facial vision tasks, an ideal head pose estimator must demonstrate invariance to a variety of image-changing factors, such as camera distortion, projective geometry, multi-source non-Lambertian lighting, as well as biological appearance, facial expression, and the presence of accessories like glasses and hats [1].

Head pose is an important cue in computer vision when using facial information and has a wide variety of uses in human-computer interaction, explaining the steadily increasing attention received by the scientific community over the last 3 decades.

Although many techniques have been developed over the years to address this issue, head pose estimation remains an open research topic, particularly in unconstrained environments [2].

Similarly to other applicative domains, HPE has greatly benefited in recent years by the exploitation of deep learning (DL) techniques, and the extensive use of Deep Neural Networks. In this article, we shall do a review of the topic from the distinctive perspective of deep learning, discussing and comparing the many different ways in which Deep Neural Networks contributed to the development of the field.

Motivation

HPE systems play an important role in the development of different intelligent environments, so that several computer vision applications rely on a robust HPE system as a prerequisite: for example, applications of gaze estimation [3], virtual/augmented reality [4], and human computer interaction [5], strongly benefit from knowing the exact position of the head in 3D space. Some application examples are:

  • Human Social Behaviour Analysis: People use the orientation of their heads to convey rich, inter-personal information. For example, there is important meaning in the movement of the head as a form of gesturing in a conversation [6] to indicate when to switch roles and begin speaking or to indicate who is the intended target subject [7, 8]. People nod to indicate that they understand what is being said, and they use additional gestures to indicate dissent, confusion, consideration, and agreement [9].

    In addition to the information that is implied by deliberate head gestures, there is much that can be inferred by observing a person’s head. For instance, quick head movements may be a sign of surprise or alarm, these could also trigger reflexive responses from other observers [10].

    Therefore, HPE can be used in smart rooms to monitor participants in a meeting and to record their activities, in particular, their attention can be indirectly related to their head pose [11]. Systems exploiting head pose estimation to analyse people behaviour and human interaction in meeting and workplaces have been proposed in [12,13,14].

    There are also studies on systems for automatic pain monitoring that show how including head pose can improve the performance for both person-specific and general classifiers [15].

  • Driving Safety & Assistance: HPE systems are particularly useful for assisting drivers by providing contextual alert signals, for example in the case of pedestrians outside the driver’s field of view [16].

    Moreover, the head pose can give clues about the intention of the pedestrian e.g. a pedestrian will wait for a stopped automobile driver to look at him before stepping into a crosswalk (this is an example of pattern recognition), very important also in the case of autonomous vehicles.

    Applications to infer the driver’s pose are very important for safety, as they can provide insights about distraction, intention, sleepiness, awareness or detect blind spots of the driver [17], for this reason, in recent years many datasets that address this specific scenario have been published [18,19,20].

  • Surveillance and Safety: Head pose estimation in surveillance video images is an important task in computer vision because it tracks the visual attention and provides insight on human behavioural intentions [21, 22]. Systems for direct an automated surveillance network have been proposed in [23, 24].

  • Targeted Advertisement: Methods to track visual attention in wandering people have been proposed in the literature [25]. These systems count people looking at particular outdoor advertisements (targeted advertisement) and can determine what a person is looking at if movement is unconstrained. Systems like these can be used for behaviour analysis and cognitive science in real world applications also in indoor environments, such as TV viewers behaviour analysis [26].

  • Interface Design: By perceiving the human attention when they look at an interface (e.g. the page of web or software), it is possible to evaluate the property and significance of the displayed visual elements and further guide the design or rearrangement of these elements [27] (see Fig. 1).

Fig. 1
figure 1

An example of application to driver assistance. Right: Green box indicates yaw < ± 45\(^\circ\) and potential awareness of vehicle. Left: Red box indicates possible inattention (image from [7])

Therefore, head pose estimation can be used to monitor human social activities, to observe the behaviour of specific targets, but also to enhance the function of some face-related tasks, including expression detection, gaze estimation (Fig. 2), full-body pose estimation and identity recognition.

Fig. 2
figure 2

Example of a task strongly linked to head pose estimation: Despite the eyes are in the same position in both face images, the perception is that the two gazes are differently oriented. Gaze prediction comes from a combination of both eyes and head pose direction [28]

The intrinsic interaction between head pose and other face parts is also confirmed in more recent research. Studies in [29,30,31,32] suggest that the mutual relationship between face parts can be exploited not only for HPE, but also for other visual tasks such as gender recognition, race classification, and age estimation making head pose estimation a useful and important task for many applications.

Contribution and Structure

The main contribution of the article are:

  • a complete and updated review of all the available databases for the head pose estimation task , with a detailed comparison of the main characteristics (number of subjects, DoF, acquisition scenario) and the analysis of which are the most used and useful in the literature;

  • a categorization and explanation of the different approaches used in the literature for head pose estimation, with a specific focus on modern deep learning approaches;

  • report and discussion of modern head pose estimation methods and their comparative performance on common datasets, with a deep analysis of different evaluation pipelines and a clear tabular presentation of data;

The remainder of the article is organized as follows: Section “Head Pose Estimation” contains an introduction to the basic concepts of the head pose estimation field; Section “Datasets” presents a detailed list of available datasets and their characteristics; Section “Head Pose Rotations Representations” explains the main techniques for representing rotations used in the HPE field; Section “Methods” describes prominent deep learning based approaches for head pose estimation; Section “Evaluation Metrics” reports the most common evaluation metrics; Section “Evaluation” delineates most used evaluation pipelines; Section “Discussion” presents a discussion of datasets, evaluation metrics/pipelines and possible research directions; Section “Conclusion” concludes the paper summarizing the contribution of the proposed work.

Note: All numerical results reported in the following tables are borrowed from the original publications.

Head Pose Estimation

In the computer vision context, head pose estimation is most commonly interpreted as the ability to infer the orientation of a person’s head relative to the view of a camera. More rigorously, head pose estimation is the ability to infer the orientation of a head relative to a global coordinate system, but this subtle difference requires knowledge of the intrinsic camera parameters to undo the perceptual bias from perspective distortion [1].

At the coarsest level, head pose estimation applies to algorithms that identify a head in one of a few discrete orientations, e.g. a frontal versus left/right profile view. At the fine (i.e., granular) level, a head pose estimate might be a continuous angular measurement across multiple Degrees of Freedom (DoF).

In particular, in the head pose estimation task, it is common to predict relative orientation with Euler angles—pitch, yaw and roll. They define the object’s rotation in a 3D environment, if the right prediction about these three angles can be made, it can be found in which direction the human head will be facing (see Fig. 3).

Fig. 3
figure 3

Euler angles in Head Pose Estimation (image source [33])

Despite head pose estimation is an old and largely investigated problem, achieving acceptable quality on it has become possible only thanks to the recent advances in deep Learning. Challenging conditions like extreme pose, bad lighting, occlusions and other faces in the frame make it difficult for data scientists to detect and estimate head poses.

Nevertheless, SOTA methods for head pose estimation satisfy all the following criteria, firstly proposed by Erik Murphy-Chutorian in [1], on standard datasets:

  • Accurate: the system should provide a reasonable estimate of pose with a mean absolute error of 5\(^\circ\) or less.

  • Monocular: the system should be able to estimate head pose from a single camera. Although accuracy might be improved by stereo or multi-view imagery, this should not be a requirement for the system to operate.

  • Autonomous: there should be no expectation of manual initialization, detection, or localization, precluding the use of pure-tracking approaches that measure the relative head pose w.r.t. some initial configuration and shape/geometric approaches that assume facial feature locations are already known.

  • Multi-Person: the system should be able to estimate the pose of multiple people in one image.

  • Identity & Lighting Invariant: the system must work across all identities with the dynamic lighting found in many environments.

  • Resolution Independent: the system should apply to near-field and far-field images with both high and low resolution.

  • Full Range of Head Motion: the methods should be able to provide a smooth, continuous estimate of pitch, yaw and roll, even when the face is pointed away from the camera.

  • Real-Time: the system should be able to estimate a continuous range of head orientation with fast (30fps or faster) operation.

Datasets

Most of the HPE models are trained and evaluated using publicly available datasets. These datasets significantly evolved during the last years, especially in terms of complexity of environmental conditions.

Most datasets provide rotation information by means of Euler angles, which define the orientation of a rigid body with respect to a fixed coordinate system; three rotations are always sufficient to express any target position. These rotation angles can be extrinsic or intrinsic, the former express the rotations with respect to the xyz axes of an original motionless coordinate system, the latter express rotations with respect to axes of a rotating XYZ coordinate system, rigidly attached to the moving body.

Since various formalisms exist to express a rotation in three dimensions beyond Euler angles, e.g. rotation matrices, unit quaternions, Rodrigues’ formula, among others, the datasets contain different forms of representation (many of these formalisms use more than the minimum number of three parameters). More details about some of the representations exploited by the models to solve the HPE task can be found in Section “Head Pose Rotations Representations”.

Head pose datasets can be categorized by different aspects, such as imaging characteristics, data diversity, acquisition scenario, annotation type, and annotation technique [18]. These aspects play an important role on whether and how the dataset identifies the challenges of the head pose estimation task.

  • Imaging characteristics: relate to the image resolution, number of cameras, bit depth, frame rate, modality (RGB, grayscale, depth, infrared), geometric setup and field of view.

  • Data diversity: incorporates aspects such as the number of subjects, the distribution of age, gender, ethnicity, facial expressions, occlusions (e.g. glasses, hands, facial hair) and head pose angles. Data diversity is essential for training and evaluating robust estimation models.

  • Acquisition scenario: covers the circumstances under which the acquisition of the head pose takes place. The most important distinction is between in-laboratory vs. in-the-wild acquisition. While the former restricts the data by defining a rather well-defined, static environment, the latter offers more variety through being acquired in unconstrained environments, such as outside, thus covering many challenging conditions like differing illumination and variable background. Head movement can be staged by following a predefined trajectory or can be naturalistic by capturing head movement while the subject performs a different task, such as driving a car.

  • Annotation type: describes what meta-information, such as head pose, comes alongside the image data and how it is represented. For example, head pose can be defined by a full 6 degrees of freedom (DoF) transformation from the camera coordinate system to the head coordinate system (covering 3 DoF for translation and 3 DoF in rotation) or only a subset of them can be provided. Annotation types can differ also in their granularity of sampling the DoF space: there are discrete annotation types that classify a finite set of head poses, and there are continuous annotation types that offer head pose annotations on a continuous scale for all the DoFs.

  • Annotation technique: there are different methods for obtaining the head pose annotation (label) accompanying each image. The annotation technique has a large impact on data quality (see Table 1, 2 and 3).

Table 1 Available datasets for Head Pose Estimation
Table 2 Legenda for Table 1

Available Datasets

There are many available datasets in the literature:

  • 300W-LP [53]: The 300W-LP (Large Pose) is a synthetic extension of the 300W database [71], generated to augment the number of challenging samples with extreme poses. It includes \({122\,450}\) images with yaw angle in range ±89\(^\circ\).

  • AFLW [45]: Annotated Facial Landmark in the Wild is a challenging dataset which was collected from the internet, in totally unconstrained conditions. It contains a collection of 25, 993 faces with head poses ranging between ± 120\(^\circ\) for yaw and ± 90\(^\circ\) for pitch and roll. The pitch, yaw and roll angles were obtained automatically from the labelled landmarks using the POSIT algorithm [72], assuming the structure of a mean 3D face, for this reason, several annotations errors were found [73].

  • AFLW2000-3D [53]: This dataset contains the first 2000 identities of the in-the-wild AFLW [45] dataset which have been re-annotated with 68 3D landmarks using a 3D model which is fit to each face. Consequently, this dataset contains accurate fine-grained pose annotations and is a prime candidate to be used as a test in head pose estimation task. Yaw varies ±120\(^\circ\), while roll and pitch ±90\(^\circ\).

  • AFW [47]: Annotated Faces in the Wild represents a small database (it’s a subset of AFLW [45]), which is normally used for testing purposes only. AFW has 250 images and inside these images 468 faces in a very challenging environment are included. The yaw angles vary between ± 90\(^\circ\) with a step size of 15\(^\circ\). The ground-truth is manually annotated, so it may contain errors.

  • AISL [54]: The Aisl head orientation database is a collection of small scale head images with various backgrounds of an indoor scene. This dataset contains 6480 images of 20 subjects under 36 yaw angles, 3 pitch angles and 3 different backgrounds. The orientation is determined by two categories: yaw angle in 360\(^\circ\) with an interval of 10\(^\circ\), and pitch angle in the range ±45\(^\circ\) with an interval of 45\(^\circ\).

  • AutoPOSE [19]: It’s a large-scale dataset that provides 1.1 million images taken from a car’s dashboard view. AutoPOSE’s ground-truth head orientation was acquired with a sub-millimetre accurate motion capturing system placed in a car simulator. The rotations are limited to the range [– 90\(^\circ\), + 90\(^\circ\)], the average pitch angle is shifted in the negative values of the rotation angles, this is due to the placement of the camera in the dashboard.

  • BioVid Heat Pain [15]: It contains videos and physiological data of 90 persons subjected to well-defined pain stimuli of 4 intensities, built for the development of automatic pain monitoring systems. It includes information about head pose of the recorded subjects for all 3 angles pitch, yaw, roll, all in the range ±50\(^\circ\).

  • BIWI Kinect [46]: It’s gathered in a laboratory setting by recording RGB-D video of different subjects across different head poses, using a Kinect v2 device. It contains roughly 15, 000 frames and the rotations are \(\pm 75^\circ\) for yaw, \(\pm 60^\circ\) for pitch and \(\pm 50^\circ\) for roll. A 3D model was fit to each individual’s point cloud and the head rotations were tracked to produce the pose annotations. This dataset is commonly used as a benchmark for pose estimation using depth methods that attests to the precision of its labels.

  • BJUT-3D [42]: The database consists of \({46\,500}\) images collected from the 3D faces of 250 male and 250 female participants. The total number of poses in the database is 93. The pitch rotation is quantized into 9 angles [– 40\(^\circ\), +40\(^\circ\)], where the difference between two consecutive poses is 10\(^\circ\). Similarly, the yaw rotation is divided into 13 angles [-60\(^\circ\), +60\(^\circ\)], with the same angular step size as for the pitch.

  • Bosphorus [40]: It contains 5 thousand high resolution face scans from 105 different subjects. The 3D scans are obtained by a commercial structured-light based 3D digitizer. It offers 13 discrete head pose annotations (seven yaw angles, four pitch angles, and two roll angles), with different facial expressions and occlusions.

  • BU [34]: The Boston University Head Tracking dataset includes only 200 images and 5 subjects, which is the main drawback of this database. The acquisition process is repeated in two sessions: initially illumination conditions are uniform; then subject faces are exposed to rather complex scenarios with changing illumination. All three rotation angles were recorded thanks to a magnetic tracker attached to each participant’s head. Pose variation is mainly less than 30\(^\circ\). Since the presence of facial occlusions (e.g., eyeglasses, facial hair, etc.) is very limited, most methods perform very well.

  • CAS-PEAL [37]: The CAS-PEAL is a large dataset having \({99\,594}\) images, with a total number of \({1040}\) participants, with 595 males and 445 female subjects. The CAS-PEAL dataset contains a total of 21 poses combining different yaw and pitch angles: the yaw orientation varies between – 45\(^\circ\) and + 45\(^\circ\) with an interval of 15\(^\circ\) between two consecutive poses; the pitch orientation has only three poses – 30\(^\circ\), 0\(^\circ\), and + 30\(^\circ\). Although the dataset has sufficient data for evaluation and training, its complexity is low, as the number of poses is quite limited.

  • CAVE [49]: The Columbia Gaze dataset contains a total of 5880 images of 56 different subjects (32 male, 24 female) of different ethnic groups and ages. The dataset is mainly created to solve the gaze estimation task, but contains also information about head pose of the participants, therefore it can be used to solve the discrete head pose estimation task. For each subject a combination of five horizontal head poses (0\(^\circ\), ± 15\(^\circ\), ± 30\(^\circ\)), seven horizontal gaze directions (0\(^\circ\), ± 5\(^\circ\), ± 10\(^\circ\), ± 15\(^\circ\)), and three vertical gaze directions (0\(^\circ\), ±10\(^\circ\)) are available.

  • CCNU [56]: All images in CCNU are low-resolution images collected in a classroom. The database consists of 58 participants, captured in 75 different poses, for a total number of \({4\,350}\) images. The face images are collected so that illumination conditions and facial expressions are changing, thus adding more complexity to the images. For obtaining the ground-truth data, SensoMotoric Instruments (SMI) eye tracking glasses are used. The head orientation changes from – 90\(^\circ\) to + 90\(^\circ\) in the horizontal direction, while the vertical direction spans in the range – 45\(^\circ\) to + 90\(^\circ\).

  • CMU Multi-Pie [44]: This is a database collected from subjects exhibiting multiple expressions under different illumination conditions in a constraint environment. All high-resolution images are captured using a system of 15 cameras for a total of 75 thousand images. The only angle of rotation available is the yaw with an incrementation step of 15\(^\circ\).

  • CMU Panoptic Dataset [55]: It’s a large scale dataset providing 3D pose annotations for multiple people engaging social activities. It contains 65 videos with multi-view annotations captured inside a dome from approximately 30 HD cameras. The panoptic dataset includes 3D facial landmarks and calibrated camera extrinsics and intrinsics, but does not include head pose information. Using landmarks and camera calibrations it is possible to locate and crop images of subjects’ heads and compute the corresponding camera-relative Euler angles.

    After processing the dataset to address the head pose problem [7], it contains 1,342,018 images. The yaw angle distribution is almost uniform and ranges in ±179\(^\circ\), but at angles near 90\(^\circ\) and – 90\(^\circ\) there are fewer images due to the effect of Gimbal lock. For the two angles pitch and roll the magnitudes are in the range ± 89\(^\circ\).

  • CMU-PIE [35]: The CMU Pose, Illumination, and Expression (PIE) dataset contains over 40,000 facial images of 68 people. Using the CMU 3D Room each person is imaged across 13 different poses, under 43 different illumination conditions and with 4 different expressions. The pose ground-truth was obtained with a 13 cameras array, each positioned to provide a specific relative pose angle. This consisted of 9 cameras at approximately 22.5\(^\circ\) intervals across yaw, one camera above the centre, one camera below the centre, and one in each corner of the room.

  • DAD-3DHeads [70]: This is an in-the-wild database that contains a variety of extreme poses, facial expressions, challenging illuminations, and severe occlusions cases. It consists of 44 thousand images annotated using a 3D head model, a non-linear optimization algorithm and a final manual adjustment. To validate head pose annotations the rotation matrices were compared to the ground-truth matrices from the BIWI dataset [46].

  • Dali3DHP [51]: This database is an extreme head pose database collected from a camera mounted on a treadmill. The dataset was collected in two different sessions from 33 individuals. Ground-truth data is collected using Shimmer sensor 2 which was attached to each person’s head. The database is large since it contains more than 60,000 depth and colour images. All the three rotation angles pitch, yaw and roll were defined at the time the acquisition took place, covering the following head angles: pitch [\(-\) 65.76\(^\circ\), + 52.60\(^\circ\)], roll [\(-\)29.85\(^\circ\), + 27.09\(^\circ\)], and yaw [\(-\) 89.29\(^\circ\), + 75.57\(^\circ\)].

  • DD-Pose [18]: It contains 330 thousand measurements from multiple cameras acquired by an in-car setup during naturalistic drives by 27 subjects. Large out-of-plane head rotations and occlusions are induced by complex driving scenarios, such as parking and driver-pedestrian interactions. Precise continuous 6 DoF head pose annotations are obtained by a motion capture sensor and a novel calibration device. The angles vary in the following ranges, ignoring outliers with less than 10 measurements in a 3\(^\circ\) neighbourhood: pitch \(\in\) [– 69\(^\circ\), + 57\(^\circ\)], yaw \(\in\) [– 138\(^\circ\), + 126\(^\circ\)], roll \(\in\) [– 63\(^\circ\), + 60\(^\circ\)].

  • DriveAHead [20]: It’s another driver head pose dataset, it contains frame-by-frame head pose labels obtained from a motion-capture system for 20 subjects (about 1 million of frames). It includes parking manoeuvres, driving on the highway and through a small town, different occlusions and illuminations, thus providing distributions of head orientation angles and head positions which are typical for naturalistic drives. Images were collected with a resolution of 512\(\times\)424 pixels, 6 DoF, the range of angles is [– 45\(^\circ\), + 45\(^\circ\)] for pitch, [– 40\(^\circ\), + 40\(^\circ\)] for roll and mainly [– 90\(^\circ\), + 90\(^\circ\)] for yaw.

  • ETH [41]: The ETH Face Pose Range Image Dataset contains more than 10 thousand images of 20 persons (3 of them being female) at a resolution of \(640\times 480\) pixels. Each person freely turned her head while the scanner captured range images at 28 fps. Yaw varies between -90\(^\circ\) to + 90\(^\circ\), pitch between – 45\(^\circ\) to +45\(^\circ\), whereas roll is not considered.

  • FacePix [39]: The FacePix database is built depicting 30 individuals, for a total number of \({5\,430}\) images. It is an imbalanced dataset with 25 males and 5 females. Yaw rotation varies from – 90\(^\circ\) (extreme left profile) to + 90\(^\circ\) (extreme right profile), with a step size of 2\(^\circ\); no other rotation angles were considered.

  • GI4E-HP [57]: It contains 36 thousand images from 10 subjects recorded with a web-cam in an in-laboratory environment. Head pose annotations are given in 6 DoF using a magnetic reference sensor. All transformations and camera intrinsics are provided. Head pose annotations are given relative to an initial subjective frontal pose of the subject.

  • GOTCHA-I [66]: This dataset is a collection of 682 videos of 62 subjects in 11 different indoor and outdoor environments to address both security and surveillance problems. To obtain ground-truth a 3D head model is reconstructed and elaborated using Blender software. There are 137, 826 labelled frames with 2223 head pose per subject in the range of [– 40\(^\circ\), + 40\(^\circ\)] in yaw, [-30\(^\circ\), +30\(^\circ\)] in pitch and [– 20\(^\circ\), + 20\(^\circ\)] in roll, with a step of 5\(^\circ\).

  • ICT-3DHP [48]: It’s a large dataset which was collected in-the-wild, i.e. captured in an unconstrained environment. All images were acquired through the Polhemus FastrackFootnote 1 flock of birds tracker attached to a cap the participants that contains a magnetic sensor, so that the dataset contains both RGB and depth data. The database is evaluated for all three rotation angles including pitch, yaw and roll. No accurate information about the angle ranges is provided.

  • IDIAP Head Pose [36]: It contains 66, 295 head images stemmed from a 8 video meeting recording, each approximately one minute in duration, of a few people in a meeting room. In each sequence, two subjects, which are always visible, were continuously annotated using a magnetic sensor. Therefore, each image has a complete annotation of a head pose orientation from pitch (range [– 60\(^\circ\), + 15\(^\circ\)]), yaw (range ± 60\(^\circ\)) and roll (range ± 30\(^\circ\)) angles.

  • M2FPA [67]: This dataset totally involves 397, 544 images of 229 subjects with 62 poses (including 13 yaw angles, 6 pitch angles and 44 yaw-pitch angles), 4 attributes and 7 illuminations. There are 6 classes for pitch in the range of [– 30\(^\circ\), +45\(^\circ\)] with a step increment of 15\(^\circ\) and 13 measurements for yaw in the range ±90\(^\circ\) with a step increment of 15\(^\circ\).

  • McGill [50]: The database consists of 60 videos of 60 different participants, in total it contains 18, 000 video frames. The videos were recorded in both indoor and outdoor environments. The participants were free to behave as they want during the video collection process, therefore arbitrary illumination conditions and background clutter are present, especially outdoor. Only yaw angles are estimated using a semi-automatic procedure, with variation in the range [– 90\(^\circ\), + 90\(^\circ\)].

  • MDM corpus [68]: The Multimodal Driver Monitoring database was collected with 59 subjects recorded while were diving a car and performing various tasks. To record the head pose the Fi-Cap device was used, this continuously tracks the head movement of the driver using fiducial markers, providing frame-based annotations to train head pose algorithms in naturalistic driving conditions. This set consists of 48.9 h of recordings (10, 541, 166 frames), it covers a large range of head poses along all three rotation axes due to the large number of subjects included, and the variety of primary and secondary driving activities considered during the data acquisition. Yaw angles range around the origin spanning between – 80\(^\circ\) to 80\(^\circ\), pitch angles have an asymmetric range spanning from – 50\(^\circ\) to 100\(^\circ\).

  • MTFL [52]: The Multi-Task Facial Landmark dataset contains 12, 995 outdoor face images from the web. These images are from CUHK Face Alignment database and AFLW dataset. Each image is annotated with a bounding box and five facial landmarks. There are ground-truth annotations for gender, age, smiling, wearing glasses and head pose. For the latter, the images are manually categorized in 5 discrete classes: Left-profile, Left, Frontal, Right, Right-profile.

  • Pandora [60]: It has been specifically created for head centre localization, head pose and shoulder pose estimation and is inspired by the automotive context. A frontal fixed device acquires the upper body part of the subjects, simulating the point of view of the camera placed inside the dashboard. Subjects also perform driving-like actions, such as grasping the steering wheel, looking to the rear-view or lateral mirrors, shifting gears and so on. Pandora contains more than 250 thousand full resolution RGB (1920\(\times\) 1080 pixels) and depth images (512 \(\times\) 424) acquired with a Microsoft Kinect 1 device. Subjects perform wide head movements: ± 70\(^\circ\) roll, ± 100\(^\circ\) pitch and ± 125\(^\circ\) yaw. Garments as well as various objects are worn or used by the subjects to create head occlusions. The ground-truth annotations have been collected using a wearable Inertial Measurement Unit (IMU) sensor.

  • Pointing’04 [38]: It is one of the oldest databases, released in 2004, which was considered as the classical benchmark for HPE (in some studies is also called PRIMA database [74]). Despite its age, it’s still used for research purposes, due to its challenging nature and a large variety in consecutive poses [29,30,31,32]. A total number of 15 participants (between 15 and 40 years) were involved for image acquisitions. Some of them wear eyeglasses or show facial hairs, thus increasing the task complexity. Images were collected in an indoor lab environment, with very low illumination conditions. Each participant is asked to look at some markers on the wall, and two rotation angles (yaw and pitch) are annotated through a subsequent manual labelling process (thus introducing some errors). The head orientation varies between ± 90\(^\circ\) both in the horizontal and vertical directions, while the difference between two consecutive poses in horizontal and vertical orientation is kept at 15\(^\circ\) and 30\(^\circ\), respectively.

  • SASE [61]: This is a 3D database collected through Kinect 2 camera. It consists of both RGB and depth images of 32 male and 18 female subjects. The total number of frames is 30, 000. All subjects have different ethnicity and hairstyles, with an age range of 7–35 years. All three rotation angles pitch, yaw, and roll are considered. All participants have different facial expressions during image acquisition, so that, along with head pose estimation, the database may also be used for emotion recognition. For each person a large sample of head poses are included, within the bounds of yaw from – 45\(^\circ\) to 45\(^\circ\), pitch – 75\(^\circ\) to 75\(^\circ\) and roll – 45\(^\circ\) to 45\(^\circ\) of rotation around each axis.

  • SyLaHP [62]: The Synthetic dataset for Landmark based Head Pose estimation was proposed by Werner et al. [62] along with a benchmark protocol to learn head pose on top of any landmark detector (called HPFL). It contains about 101 thousand synthetic images from 30 subjects, with varying ethnicity, age and gender. The angles are in the ranges: ± 70\(^\circ\) for pitch, ± 90\(^\circ\) for yaw and ±55\(^\circ\) for roll.

  • SynHead [63]: This is a large-scale synthetic dataset for head pose estimation in videos containing 10 head models (5 female and 5 male), 70 motion tracks and \({510\,960}\) frames. Such synthetic dataset, which considers all Euler angles, generates 100% reliable ground-truth to compensate for errors existing in manually annotated datasets. The Euler angles are in the range of [– 100\(^\circ\), +100\(^\circ\)].

  • Synthetic [58]: The Synthetic image database is a large database of 74, 000 high quality images taken from head models. A total of 37 sequences have been considered, where each sequence includes 2000 frames. The head pose in face images covers ± 50\(^\circ\) of roll, ± 75\(^\circ\) for yaw, and ± 60\(^\circ\) for pitch. The database is quite challenging as different ages, races, and facial expressions are included.

  • Taiwan RoboticsLab [43]: It contains 6660 images of 90 subjects. For each subject there are 74 images, where 37 images were taken every 5 degrees from right profile (defined as + 90\(^\circ\)) to left profile (defined as – 90\(^\circ\)) in the yaw rotation using camera array and the remaining 37 images were generated (synthesized) by the existing 37 images using commercial image processing software in the way of flipping them horizontally.

  • UbiPose [64]: This dataset relies on videos from the UBImpressed dataset, which has been captured to study the performance of students from the hospitality industry at their workplace. The data are recorded using a Kinect 2 sensor, however the ground-truth head pose is indirectly inferred from facial landmarks. The validated inferred head poses are 10.4 thousand, most frames fall within a [20\(^\circ\), 40\(^\circ\)] interval.

  • UET-Headpose [69]: The UET-Headpose dataset was created to capture the head pose of annotated people in many conditions, it includes 12, 848 images obtained from 9 people. The dataset has a uniform yaw angle distribution for all directions in the range [– 179\(^\circ\), 179\(^\circ\)]. The dataset is obtained by having the annotated people rotated all yaw directions when collecting the dataset. Therefore, it is possible to learn all yaw angles within a 360\(^\circ\) range.

  • UMD Faces [59]: This dataset has 367, 888 annotated faces of 8277 subjects. It contains information about bounding boxes (verified by humans), twenty-one keypoint locations, Euler angles and the gender of the subject. These annotations have been generated using the All-in-one CNN model [75], therefore the dataset may contain erroneous annotations, especially for the pitch, yaw and roll angles.

  • VGGFace2 [65]: This is a very large HPE database which has been released in 2018. It contains 3.31 million images. The total number of participants to create this content are 9131, whereas the average number of images per subject is 362. The database is constructed with images downloaded from Google Image Search and shows large variations in pose, illumination, age, profession, and ethnicity. However, pose (pitch, yaw and roll) is estimated using pre-trained pose classifiers defining 5 classes for angles in ranges [– 100\(^\circ\), – 40\(^\circ\)), [– 40\(^\circ\), – 10\(^\circ\)), [– 10\(^\circ\), + 10\(^\circ\)), [+ 10\(^\circ\), +40\(^\circ\)) and [+ 40\(^\circ\), + 100\(^\circ\)).

Head Pose Rotations Representations

Many possible representations can be used to express rotations of rigid bodies. The widely used in the field of head pose estimation is that based on Euler angles, but other methods are exploited in the literature due to some problems of this specific representation.

Furthermore, it has been shown that any rotation representation in 3D with less than five dimensions is discontinuous, making the learning process harder [76]. We will further briefly review different rotation parametrizations, their pros and cons to see how they might affect the regression performance.

Euler Angles

The Euler angles were introduced by Leonhard Euler in rigid body dynamics to describe the orientation of a reference system attached to a rigid solid in motion. Three parameters are needed to describe an orientation in a 3 dimensional Euclidean Space \({\mathbb {R}}^{3}\).

Thus, the Euler angles are a set of three angular coordinates which specify the orientation of a reference system with orthogonal axes, usually mobile, with respect to another reference with known orthogonal axis called standard orientation. This standard initial orientation is normally represented by a motionless (fixed) coordinate system.

Euler angles can represent any rotation by means of three successive elemental rotations around three independent axes.

$$\begin{aligned} R_{x}(\alpha )\ {}= & {} \begin{bmatrix} 1 &{} 0 &{} 0\\ 0 &{} \cos (\alpha ) &{} -\sin (\alpha )\\ 0 &{} \sin (\alpha ) &{} \cos (\alpha ) \end{bmatrix} \\ R_{y}(\beta )= & {} \begin{bmatrix} \cos (\beta ) &{} 0 &{} \sin (\beta )\\ 0 &{} 1 &{} 0\\ -\sin (\beta ) &{} 0 &{} \cos (\beta ) \end{bmatrix}\\ R_{z}(\gamma )= & {} \begin{bmatrix} \cos (\gamma ) &{} -\sin (\gamma ) &{} 0\\ \sin (\gamma ) &{} \cos (\gamma ) &{} 0\\ 0 &{} 0 &{} 1 \end{bmatrix}. \end{aligned}$$

These three elemental rotations around distinct axes can be composed to obtain a single rotation matrix using matrix multiplication:

$$\begin{aligned} R\ =\ R_{x}R_{y}R_{z}. \end{aligned}$$

Matrix multiplication is not commutative and the same thing applies to rotations, therefore the order of application of the three successive elemental rotation is important.

However, the definition of Euler angles is not unique, in the literature many different conventions are used, where varies the sequences of rotations and the axes about which the rotations are carried out (see Fig. 4).

Fig. 4
figure 4

Different processes from the same initial pose to the same final pose in different rotation order (image from [77])

Following the Trait–Bryan convention we can define as x, y and z the original axes and X, Y, and Z the axes after rotation. The line that represents the intersection between plane xy and YZ is called the line of nodes N, see Fig. 5. The Euler angles with this convention are: \(\alpha\) the rotation angle between x and N, covering a range of \(2\pi\); \(\beta\) the rotation angle between z and Z, covering a range of \(\pi\); \(\gamma\) the rotation angle between N and X, covering a range of \(2\pi\).

Many datasets have annotations of pitch, yaw and roll angles, but not all of them explicitly mention the order; the process of determining it become tedious and error-prone.

The main limitation of the Euler angles remains the Gimbal lock: when the second elemental rotation reaches 90 (or – 90) degrees, then first and third axes become parallel (i.e. linearly dependent), which gives an infinite number of solutions for the same rotation and the other axis can not be determined. This is a great limitation when wide ranges of rotations [– 180\(^\circ\), +180\(^\circ\)] are considered (see FIg. 5).

Fig. 5
figure 5

Euler angles, image from Wikipedia [78]

Rotation Matrix

Each rotation can be uniquely described with a rotation matrix. The rotation matrix R is a special orthogonal \(3\times 3\) matrix, with a determinant equal to one, that represents a rotation in Euclidean space.

$$\begin{aligned} R\ =\ \begin{pmatrix} r_{11} &{} r_{12} &{} r_{13}\\ r_{21} &{} r_{22} &{} r_{23}\\ r_{31} &{} r_{32} &{} r_{33} \end{pmatrix}, R^\textrm{T}R = RR^\textrm{T} = I, det(R) = 1. \end{aligned}$$

Rotations can be composed using multiplication, and the resulting matrix will remain a rotation matrix. A rotation is represented using nine parameters.

To regress the parameters with back-propagation an orthogonality constraint must be enforced, otherwise something different from rotation matrix will be obtained during inference [79].

A complaint of rotation matrices is that they’re less intuitive. In general, it’s not easy to understand what the matrix is doing by simply looking at the matrix. This is why Euler angles sometimes are more favourable.

Let be the column vector v, the position of each point in the standard initial orientation and R the rotation matrix. Then, a rotated vector u is obtained by multiplying the rotation matrix with the vector.

$$\begin{aligned} u\ =\ R\ \cdot \ v. \end{aligned}$$

The ease by which vectors can be rotated using a rotation matrix, as well as the ease of combining successive rotations, make the rotation matrix a useful and popular way to represent rotations, even though it is less concise than other representations [28].

Quaternions

Quaternions are a compact way to represent rotations, they have four parameters, which can be interpreted as a scalar component plus a three-dimensional vector component:

$$\begin{aligned} q\ =\ \left( s_{0}, \overrightarrow{v}\right) \ =\ \left( s_{0}, v_{1}, v_{2}, v_{3}\right) . \end{aligned}$$

Quaternions are quite popular because are more compact than matrix representation and it’s simple to combine two individual rotations represented as quaternions using quaternion product.

Unlike Euler angles, quaternions are free from the Gimbal lock problem, but still they have an ambiguity caused by their anti-podal symmetry: q and \(-q\) correspond to the same rotation.

Furthermore, it has been recently demonstrated that for 3D rotations, all representations are discontinuous in the real Euclidean spaces of four or fewer dimensions and empirical results suggest that continuous representation outperform discontinuous ones [76]. This means that Euler angles and quaternions representations might not be well suited for regression task.

Methods

The approaches used in the literature to solve the task of head pose estimation are quite different between them: they have different degrees of automation, different prerequisites and are based on different assumptions.

Fig. 6
figure 6

Our taxonomy of deep learning approaches for head pose estimation problem

We try to arrange each system by the approach that underlies its implementation (taking as reference classifications proposed in previous works [1, 28]), by giving a description and evaluating advantages and disadvantages of each approach. Our taxonomy is briefly summarized in Fig. 6.

Since head pose estimation has been investigated for a long time, many methods have emerged during this period; however, starting from 2015, methods based on convolutional neural networks have been used more and more, highlighting a shift in methodology, from traditional machine learning (ML) methods towards deep learning (DL) approaches.

In the following sections, we first shortly review “classical methods” (Section “Classical Methods”), including all approaches that are little, or no longer, considered in the most recent research, then shifting the focus on deep learning based models:

  • Segmentation based models (Section “Segmentation Based Models”):

    compute head pose using probability maps produced by a face segmentation algorithm [29,30,31,32, 80];

  • Model based methods (Section “Model Based Methods”):

    exploit facial keypoints, either for regressing head pose [62, 81,82,83] or for reconstructing 3DMM and learn its rotation parameters [84,85,86,87].

  • Non-linear regression methods (Section “Non-linear Regression Methods”):

    use deep convolutional neural network to develop a mapping from the image to the head pose measurements [7, 8, 60, 63, 76, 88,89,90,91];

  • Multi-task methods (Section “Multi-task methods”):

    jointly solve head pose with other correlated tasks (e.g. face detection or face alignment) to improve the overall performance [75, 92,93,94,95,96,97,98,99,100,101,102,103];

Additional details about classical methods can be found in [1, 104]. More recent surveys are [2, 28]; with respect to them, we will cover the parts relating to the state-of-the-art models in more detail, with a special focus on multi-task learning, 3DMM based and CNN based models.

Classical Methods

Here we briefly recall a short list of methods that played an important role for HPE but have been either outdated by most recent techniques, or are difficult to integrate with deep learning technology, that is the main focus of this survey:

  • Appearance template methods: compare a face image to a set of exemplars template to find the most similar view [105, 106];

  • Detector array: use a series of head detectors, each trained for a specific pose and assign the pose relative to the detector with the greatest support [107,108,109];

  • Manifold embedding: embed an image into low-dimensional manifolds that model the continuous variation in head pose and use these for pose regression [110,111,112,113,114,115,116,117,118,119];

  • Tracking methods: use temporal constraint to recover the pose from observed movements in video frames [51, 120,121,122,123,124];

  • Hybrid classical approaches: combine one or more of the afromentioned methods in a single model [1, 104];

Segmentation Based Methods

These methods address the problem of head pose estimation by exploiting the strong relationship between the head pose and the position of various face parts. The idea is that the performance of the face pose predictor can be improved if a prior efficiently parsed image, having information about various facial features, is provided as input [29,30,31,32].

The first step is to perform semantic segmentation over the input image either by training a single segmentation model or multiple (discrete) pose specific models. Each model parses the face into different parts (e.g. nose, mouth, eyes, hair) and produces probability maps. Given a new image, the probabilities associated to face parts by the single model or the different pose-specific models are used as the only information for estimating the head pose by using specifically designed algorithms or by training a classifier (e.g. Random Forest, SVMs, etc...).

Huang et al. [125] were the first to exploit the relation between face segmentation and head pose estimation. In their method, initially, the face is segmented into three face parts (skin, hair, background) using traditional textural-based techniques, and then in a second stage, they estimate basic discrete head poses using a simple regressor: “frontal”, “right-profile” and “left-profile”.

More modern works address segmentation by means of Deep Neural Networks, that typically allow to consider a larger number of segmentation classes, and discrete poses (e.g. 13 poses [29, 31] or 93 poses [30, 32, 80]).

Khan et al. [29] proposed a simple algorithm to exploit probabilities associated to face parts to predict head pose: first, they run segmentation models for all different poses, obtaining probability maps; then, they consider the maximum of such probabilities to assign a pose to each pixel; finally, they count the total number of pixels associated to each discrete pose and assign to the face image that with the highest number. A similar approach was taken in [30], but relying on the concept of super-pixel, i.e. small meaningful patches belonging to the same object.

The estimation of the head pose after performing segmentation can be done by many traditional ML techniques, comprising multi-class linear SVM [31], Random Forest [32] and Soft-Max classifiers [80].

The main advantage of these methods is that are able to exploit the strong relationship between head pose and position of various face parts, which is useful for accurate pose estimation. Moreover, these methods do not require any landmark detection process or face alignment step. Finally, these systems are typically multi-task, they combine HPE, facial expression detection, gender recognition and age classification in a single framework (see Fig. 7).

Fig. 7
figure 7

Segmentation based method: perform face segmentation and from probability maps infer head pose (image from [29])

A drawback of this technique is that manually segmented face images are needed for training, and creating supervised segmentation datasets is a notoriously onerous operation. On the other side, face segmentation has a lot of different applications, e.g. for editing [126, 127], so we may expect a steady improvement on this aspect of the task.

Surprisingly, only the coarse head pose classification task has been addressed so far. Testing these techniques on the more challenging continuous regression problem is an open issue, that could definitely help to assess the quality of the technique.

Model Based Methods

Model based methods require either a 3D head model or the localization of facial keypoints (landmarks), such as eyes, eyebrows, nose, lips, etc. (or both of them in some cases) and from these estimate the head pose. It is proven that these factors, such as the location of the face in relation to the contour of the head, strongly influence the human perception of the head [1]. For this reason, model based methods are particularly interesting, they can directly exploit properties which are known to influence human head pose estimation. Moreover, in recent years, with the development of deep learning and due to high availability of data, methods which directly extract facial landmarks have improved enormously their performance and have become the dominant approach in facial analysis tasks [8].

A by-product of face alignment is the ability to recover the 3D pose of the head in two different ways: (I) the Landmark-to-Pose approach and (II) by exploiting deformable methods.

In the landmark-to-pose approach the keypoints are given as input to a ML, or DL, algorithm that regress the head rotation angles.

Werner et al. [62] proposed a benchmark protocol to learn pose estimator on top of any landmark detector, called HPFL, that trains a Support Vector Regression (SVR) model using landmarks as features. To exploit the power of Deep Neural Networks not only to compute landmarks but also to obtain Euler angles Gupta et al. [81] proposed to use a deep learning architecture to regress head-pose giving as input uncertainty maps computed from 5 facial keypoints. Even Xia et al. [82] used a CNN, but they give as input a heatmap of 68 landmarks stacked with a transformed version of the input image, so that the neural network can focus on the area around facial landmarks while extracting features from the image, reducing interference from wild environment. Dapogny et al. [83] proposed an attentional cascade model that iteratively refines head pose and landmark estimates. The advantage is that using head pose information to refine landmark alignment provides more precise landmark estimates (as also stated in [128]), which in turn helps refine the head pose prediction, further advocating for an entwined landmark alignment and head pose prediction scheme. The disadvantage is that the network is bigger and requires a longer training time.

For this reason, recently, other researchers have tried to define methods that do not need training for estimating head pose once facial landmarks are detected. Abate et al. [129] used a quad-tree, i.e. a particular kind of unbalanced tree, that divides the image into smaller and smaller quadrants, to measure the distance between the representation of the input face with a reference model. Barra et al. [130] (2020) exploit a spider-web shaped model that uses the landmark locations to build a feature vector, which in turn is compared to a set of prototypical vectors to determine the closest one and establish the pose. Unfortunately with these two methods only discrete pose can be obtained (with 5\(^\circ\) of angular step), they are computationally efficient but less effective than other methods.

Deformable methods, instead, use a non-rigid face model and fit it to the image such that it conforms to the facial structure of each individual and estimate the head poses from the correspondence between feature points on a 2D face image and those on a 3D facial model.

The 3D pose information of the head can be inferred by solving the Perspective-n-Point (PnP) problem, i.e. the problem of estimating the pose of an object by finding the rotation matrix R and the translation vector t given intrinsic camera parameters, known locations of n 3D points and their corresponding 2D projection in the image. Indeed, by looking for the projection relation between a 3D facial model and a 2D face image, head pose angles can be calculated from the elements in the rotation matrix directly.

The most simple and commonly used pipeline involves a number of steps [8]: (1) face alignment; (2) definition of 3D human mean face model; (3) approximation of camera intrinsic parameters; (4) solving 2D-3D correspondence problem using one of the available PnP algorithms, such as POSIT [72] or DLS [131]. In their basic form, these methods do not need to include and train a pose estimation model; moreover, any method for face alignment can be used, such as Dlib [132] or FAN [133] (see [134] for a survey on face alignment methods). The drawback of PnP approach is that typically camera parameters are not known so they are approximated leading to errors in the final prediction.

Modern deformable approaches rely on a 3D face morphable model and learn to deform it to adapt to the person’s head, then solve the 2D-3D correspondence more effectively.

Wu et al. [84] assumed to have a 3D deformable facial model and followed a cascade iterative procedure that iteratively updates the facial landmark locations, the head pose angles and non-rigid deformations. There is no learning involved for head pose that is estimated from the 3D deformable model by minimizing the projection error for all landmark points. Liu et al. [85] trained a CNN to reconstruct a personalized 3D face model from the input head image and through an iterative 3D-2D keypoints matching algorithm estimate head pose under constraint perspective transformation (see Fig. 8). Diaz Barros et al. [135] proposed a hybrid method that incorporates two strategies: (1) a temporal tracking scheme, which uses optical flow to compute the correspondences of a set of keypoints in every pair of frames; (2) a head pose estimation scheme which estimates pose independently in each frame by aligning 2D facial landmarks to every image; the head pose in each scheme is estimated by minimizing the reprojection error from the 3D-2D correspondences.

Fig. 8
figure 8

An example of deformable model: A personalized 3D face is reconstructed from the input head image using a CNN, then keypoints matching is used to obtain the pose [85]

Unfortunately, these methods use deep learning only for face alignment and use some projection method to compute head pose, not exploiting its full potential. Instead, the state-of-the-art networks for head pose estimation follow a different approach, also based on 3DMM. In this case, the focus is on the 3DMM-based 3D dense alignment 3D dense reconstruction task. The network can be directly used for pose estimation, indeed, 3DMM regression contains pose, shape and expression parameters. There is no keypoints matching involved.

Zhu et al. [53] proposed an alignment framework termed 3D Dense Face Alignment (3DDFA), which directly fits a 3D face model to RGB images via convolutional neural networks. The primary task of 3DDFA is to align facial landmarks, even for the occluded ones, using a dense 3D model. As a result of their 3D fitting process, the 3D head pose is produced. SynergyNet [86] is a novel network designed to predict complete 3D facial geometry, including 3D alignment, face orientation and 3D face modelling. The network defines a synergy process that utilizes the relation between 3D landmarks and 3DMM parameters to improve the overall performance. Despite the large amount of work on 3DMM-based 3D dense alignment and the fact that many of the proposed approaches directly estimate rotation matrices, Wu et al. were the first to propose a discussion on the head pose estimation task, previous works only focus on the evaluation of landmarks and 3D faces. The authors, as well as evaluate SynergyNet, conducted extensive and detailed benchmarking on other 3DMM-based methods, such as 3DDFA-TAPAMI [136], 2DASL [137] and 3DDFA-V2 [138], highlighting the better performance of the proposed network due to the innovative synergy process introduced (see Fig. 9).

Fig. 9
figure 9

In SynergyNet a backbone network learns to regress 3DMM parameters (pose, shape, expression) [86]

SADRNet is another network proposed very recently by Ruan et al. [87] that is one of the state-of-the-art models on AFLW2000 [53] dataset. This is an encoder-decoder-based architecture that regresses the deformation D and infers the pose parameters f, R and t to reconstruct the 3D face geometry from a single 2D face image. The most important novelty introduced in the network is the attention mechanism used to enhance the visible facial information and estimate the transformation matrix only with visible landmarks, giving robustness to occlusions and large pose variations.

Finally, with the development of consumer-level depth-image sensors, many studies have tried to exploit 3D-face model-based approaches using RGB-D data. These studies have developed in parallel with the others presented before and mainly use optimization techniques, such as the ICP algorithm [139], which aim to minimize the discrepancy between depth data and a parametrized 3D model. Martin et al. [140] proposed a real-time head pose estimation method that first creates a point-cloud based 3D head model from the input depth image and then registers the 3D head model with the iterative closest point (ICP) algorithm [139] for head pose estimation. Mayer et al. [141] proposed estimating head poses by registering a 3D morphable model (3DMM) to the input depth data through a combination of particle swarm optimization (PSO) and the ICP algorithm [139]. Higher pose estimation accuracy is achieved at the expense of a much higher computational cost. A 3D morphable model and online 3D reconstruction are used by Yu et al. [64] for full head pose estimation, thus also handling extreme poses. Although estimating the head poses on the depth image can avoid suffering from the cluttered background and illumination changes, that are common in RGB images, the main disadvantage is that depth image sensors are not available in most of the current real-world applications.

Summing up, we saw that there is a huge literature of approaches based on the facial keypoints, that are used as key elements of deformable methods, or given as input to neural networks (so used as features), or even are the only information needed in the PnP approach. It is evident that there is a close relationship between head pose and the distribution of the landmarks, so these are a valuable information to estimate head pose [82]. Moreover, there is a growing number of landmark detectors/trackers that can be used for research purposes for free and there is a rapid progress in improving the landmark quality, including unconstrained scenarios with difficult lighting, out-of-plane head poses, and occlusions [62].

PnP approach is one of the most used in the literature, but has a disadvantage: many parameters (such as camera pose) typically are approximated and this can lead to inaccuracies in the results. Moreover, when a mean face model is used, even with perfect registration, the images of two different people will not line up exactly, since the location of facial features varies between people, leading to errors in the final result [82]. For this reason, recently developed approaches rely on face reconstruction as previous step to 2D-3D keypoints matching [85]. These methods typically require high-resolution images and the position of landmarks must be initialized before the pose estimation.

Recent research has been focused on landmark-to-pose approaches that regress the head pose from landmark configuration using deep networks, and on 3DMM based approaches that reconstruct and align a 3D dense face model with the images. Less research has been devoted to the latter case, but this seems a very promising direction, able to achieve remarkable results, even if the head pose is only obtained as a by-product. The main drawbacks of 3DDFA approaches are that the networks are quite complex, and their training depend on costly face mesh annotations. Nevertheless, SADRNet [87] reconstructs the 3D model of the face (starting from a cropped image) in 13.5 ms. However, it is is not clear how these results could generalize in low resolution far-field imagery due to the difficulty in achieving good fitting and precise image feature location in those conditions.

Non-linear regression methods

The non-linear regression methods do not require keypoints detection, but directly predict the head pose angles through images. A model is trained in a supervised manner and learns a functional mapping from the image space to discrete/continuous pose directions. The main challenge is to train a model in a way to ensure that the regression tool will learn a proper mapping.

Early approaches used classical machine learning models such as Support Vector Regressor (SVR) [105], Localized Gradient Histograms (LCH) [142] or Random Forest (RF) [46, 56].

In the last decades, there was a drastical shift towards the deep learning paradigm, with an increasing use of convolutional neural networks to estimate the three-dimensional head pose with higher accuracy.

First attempts with deep models exploited simple architectures [143, 144] and common networks [73], such as AlexNet [145], VGG [146], ResNet [147]. Patacchiola et al. [148] improved the results by introducing dropout and adaptive gradient methods during the training of the network, and by training a different specialized network for each rotation angle (pitch, yaw, roll), that permits fine-tuning for a specific degree of freedom without loosing predictive power on another one. Work from Gu et al. [63] uses a recurrent neural network to regress the head pose Euler angles by exploiting the time dimension in video sequences. RNN has the ability to learn motion information implicitly, gaining robustness to large head pose variations and occlusions.

Ruiz et al. [8] proposed to use a three-branch convolutional neural network structure, that they called Hopenet, where each branch is responsible for one of the Euler angles. All branches share a backbone network that can be of arbitrary structure, e.g. ResNet50 [147], AlexNet [145], VGG [146]. This backbone network is augmented with a branch-specific fully-connected layer that predicts a specific angle. By having three cross-entropy losses, one for each Euler angle, three signals are backpropagated into the network, which improves learning (see Fig. 10).

Fig. 10
figure 10

Hopenet architecture [8]: ResNet50 with combined Mean Squared Error and Cross Entropy Losses (image fromhttps://indatalabs.com/blog/head-pose-estimation-with-cv)

The overall framework of Hopenet is adopted also by Zhou et al. [7] for their network WHENet. WHENet adopted a lighter backbone w.r.t. previous work, EfficientNet-B0 [149] was used (it incorporates Inverted Residual Blocks, from MobileNetV2, to reduce the number of parameters while adding skip connections). This network is optimized for the full range Euler angles (360\(^{\circ }\)), not only for narrow range as the previous works (180\(^{\circ }\)). This is achieved by careful choice of the wrapped loss function as well as by developing an automated labelling method for the CMU Panoptic dataset [55], that is used during the training of the network.

FSA-Net [88] introduced a feature aggregation method to improve pose estimation. QuatNet [89] proposed a Quaternion-based face pose regression framework which claims to be more effective than Euler angle-based methods. The quaternion representation is used also by Zeng et al. in their SRNet [150] where a specific Structural Relation-aware module is introduced, this module improved the prediction quality because discriminative pose features are learned from a global perspective (by capturing the valuable facial structure information) rather than low-level local details. TriNet [76] used a three vector-based representation that replaces Euler-based and Quaternion-based representations for increasing efficacy. RankPose [90] is another CNN that explored Siamese architecture and ranking loss to distinguish pose-related from a mixture of pose-related and irrelevant features, such as age, lighting and identity. Hempel et al. for 6DRepNet [151] efficiently regress a compressed 6D form of the rotation matrix. This representation has been reported to introduce smaller errors for direct regression then vector-based one and made 6DRepNet one of the SOTA models on popular datasets.

Given the fact that the bounding box significantly affects the quality of the trained NN for the HPE problem [152, 153], Sheka et al. [91] (2021) proposed to average the results of predictions of the same neural network, but with various bbox offsets, in what they call offset ensemble.

Not only bounding box affect the final result but also illumination and occlusion, for this reason Wang et al. in their FSEN [154] included low light enhancement, strong light suppression and face occlusion detection modules. This united with a four-branch CNN, in which three branches are used to extract three independent discriminative features of pose angles, and one branch is used to extract composite features corresponding to multiple pose angles, improved the results on benchmark datasets.

Recently, some attempts to propose lightweight networks that obtain good results at lower costs have been made, Berral-Soler et al. [155] and Dhingra [156] proposed respectively RealHePoNet and LwPosr networks. However, the results are less accurate than those obtained with more complex models.

Other researchers, to overcome the limitations of publicly available datasets, that are limited in size, resolution, annotation accuracy and diversity, used synthetic generated data from high-quality 3D facial models to train their networks [58, 63]. Wang et al. [157] proposed a coarse-to-fine network to predict head pose trained on synthetically rendered faces. However, they noticed that the difference (domain gap) between rendered (source domain) and real-world (target domain) images negatively affects the performance. For this reason in [158, 159] Domain Adaptation (DA) techniques are applied to reduce the influence of domain differences.

Recently, Liu et al. propose ARHPE model [160], a novel asymmetric relation-aware network albe to learn the discriminative representations of adjacent head pose images. Different weights are assigned to the yaw and pitch directions by introducing the half at half maximum of the Lorentz distribution. This has proven effective in extracting more discriminative features, even if it has been tested only with two DoF (see Fig. 11).

Fig. 11
figure 11

POSEidon architecture [60]: depth images are provided to a head localization CNN, then the head region is given in input to the POSEidon network to obtain pitch, yaw and roll estimations (image from [60])

Finally, some researches leveraged depth data [46, 60, 161]. Among them the best performing is POSEidon [60], which is a network composed of three independent convolutional nets followed by a fusion layer, specially conceived for understanding the pose by depth. This is the state-of-the-art model on the BIWI database [46] (see Table 4).

The main advantage of head pose estimation derived from CNNs is the strong learning ability, especially for image processing, which make it possible to achieve the desired effects. These algorithms work properly with high and low resolution images, and they have demonstrated their representational ability in tolerating some errors in the training set data. They are not dependent on the head model chosen, the landmark detection method, the subset of points used for alignment of the head model or the optimization method used for aligning 2D to 3D points. Moreover, they can be computationally efficient, straightforward to implement and easily updated with the addition of new data (data-driven approach, the upper limit is high).

However, the performance of these methods drops drastically if the labelled face images are not properly annotated. There can be difficulties in obtaining sufficient data with head annotations for head pose estimation training, especially data with changes in appearance (such as sex, age group, and race attribute) or environmental interference (such as lighting conditions, shooting angle). Many datasets don’t have a uniform distribution of data (many images contain frontal or near-frontal faces) causing difficulties in learning large pose variations. Moreover, powerful CNNs are complex, and can require a long training time. It is also worth to stress that all these methods rely on a face detection step, prior to pose estimation, that can heavily influence the result.

Multi-task Methods

The idea behind multi-task methods is to relate head pose estimation to other face image analysis problems, such as gender recognition, landmark detection, face expression recognition, race classification, etc. because it is proven that jointly solving multiple tasks can lead to better performance [52, 75, 92,93,94,95,96, 162,163,164].

The multi-task learning (MLT) paradigm encompasses a set of learning techniques that provide effective mechanisms for sharing information among multiple tasks. It enables the use of larger and more diverse datasets, improving the stability of training and the generalization of the final model.

Among multi-task methods adopting traditional machine learning frameworks there are [162, 163]. The former adopts the graph guided FEGA-MTL framework for head pose classification of mobile targets based on multi-view image source. The physical space is divided into a discrete number of planar regions and the model try to learn the pose appearance relationship in each region. The latter tried to do the same, but evaluating the SVM-MTL framework.

Multi-task methods have become particularly popular with the advent of deep learning because of the unique ability of neural networks to transfer and share knowledge among various tasks. MTL has been widely used to simultaneously learn related tasks, such as: face detection + head pose estimation [97, 102, 103, 165, 166], face alignment + head pose estimation [93, 94, 98,99,100], face detection + face alignment + head pose estimation [95, 96, 101], face detection + face alignment + head pose estimation + gender recognition [92, 167], or also in combination with other tasks such as face recognition and appearance attributes estimation (age, smile, etc.) [52, 75] and finally there is head pose estimation + gaze estimation [168].

Zhang et al. [52] were the first to investigate the possibility of optimizing multiple tasks using a Task-Constrained Deep Convolutional Neural Network (TCDCN) to jointly optimize facial landmark detection with a set of related tasks, such as head pose estimation. The proposed network learns a shared feature space that is optimized to solve all the tasks at the same time. The network does not perform face detection, therefore it requires an image of a face as input or an additional preprocessing step. A similar network was proposed also by Ahn et al. [165], but their focus was on real-time driving face detection and head pose estimation.

Ranjan et al. [92] proposed a new model called Hyperface that performs face detection, face alignment, pose estimation and gender recognition. The network is designed to exploit the fact that information contained in features is hierarchically distributed throughout the network, therefore lower layers respond to edges and corners, and hence contain better localization properties (are more suitable for face alignment and pose estimation tasks); on the other hand, higher layers are class-specific and suitable for learning complex tasks such as face detection and gender recognition. They make use of all intermediate layer features (called hyperfeatures) through a technique named feature fusion, which allows to transform features to a common subspace where these can be combined linearly or non-linearly. They show that fusing intermediate layers improves the performance for structure dependent tasks of pose estimation and landmarks localization, as the features become invariant to geometry in deeper layers of CNN.

Then, Ranjan et al. [75] proposed another model called All-in-One. It differs from Hyperface because (I) simultaneously performs a higher number of tasks and (II) domain-based regularization is adopted by training on multiple datasets, each one specific to a subset of the tasks.

Xu et al. [93] have brought into the field a new type of network, i.e. a cascaded architecture that is designed in a hierarchical way based on coarse-to-fine principles, which refines the shape and pose sequentially. Other cascaded architectures have been presented in the literature, the main difference among them is the number of stages, the type and the number of tasks addressed in each stage [96, 97] (see Fig. 12).

Fig. 12
figure 12

A convolutional neural network with feature fusion, examples are Hyperface [92] and All-in-One [75] (image from [97])

Kumar et al. [94] transformed the cascaded regression formulation into an iterative scheme, by proposing the KEPLER model. In each iteration, a regressor predicts visibility, pose and the corrections for the next stage, and a rendering module uses these corrections to prepare new rendered data employed in the next iteration. The network is trained on three tasks namely, pose, visibilities and the bounded error using ground-truth annotations. The joint training is helpful since it models the inherent relationship between the visible number of points, the pose and the amount of correction needed for a keypoint in a particular pose.

Many other researchers focused on improving the time needed for the network to resolve the tasks, indeed this is the main drawback of some of the presented models (e.g. Hyperface [92] or All-in-One [75]) that limits real-world applications. Cheng et al. [95] proposed a model that exploits single-shot object detection module (SSD) to perform multi-scale face detection, face alignment and head pose estimation at the same time at much higher speed. ASMNet [100] is a lightweight CNN assisted by an Active Shape Model (ASM) [169], used to guide the network towards learning, that achieves an acceptable performance for face alignment and pose estimation while having a significantly smaller number of parameters and floating point-operations. ATPN [99] and MOS [101] focused on defining a network structure with an even smaller number of parameters to augment efficiency. Other architectures, such as Multitask-net [102] and TRFH [103], leveraged the feature pyramid network to detect faces on different scales (see Fig. 13).

Fig. 13
figure 13

Encoder-decoder network, called MNN, adopted in [98]

Valle et al. [98] proposed another type of architecture, an encoder-decoder CNN (see Fig. 13). They locate the head pose estimation task at the end of the encoder network, in this way the network bottleneck acts as embedding representing face pose. Instead, visibility and face alignment tasks are located at the end of the decoder, since they require information about the spatial location of landmarks in the image. This is the only paper to propose an encoder-decoder architecture. The presented model, called MNN, achieves results comparable to the state-of-the-art methods for the head pose estimation task; this is due to the network architecture and to a new training strategy that uses reannotated datasets.

Recently, Malakshan et al. [170] presented a completely different novel approach that jointly solves Face Super-Resolution (FSR) and HPE problems. To this end, a Multi-Stage Generative Adversarial Network (MSGAN) has been proposed: it benefits from the pose-aware adversarial loss and the head pose estimation feedback to generate super-resolved images that are properly aligned for HPE. Even if the network has not improved the results of SOTA methods on standard datasets, it significantly increased the pose estimation accuracy for the low resolution face images, obtaining at the same time very accurate results for original high-resolution images (on BIWI dataset MAE = 4.11).

The main advantage of the multi-task approach is that many tasks can be solved with a single model. Furthermore, all these tasks are strictly related, therefore the overall performance is improved due to the network’s ability to learn correlations between data from different distributions in an effective way, so more discriminative features are learned. Also, some methods perform face detection with head pose estimation, reducing the time needed to perform preprocessing of the image. Another advantage is that multiple datasets can be used for training, increasing the amount of available data.

The main disadvantage of multi-task approach is the lack of public benchmark datasets with all the annotations for all the tasks. It’s difficult to compare multi-task models among them and to other head pose estimation methods because they use a different combination of datasets for training and testing, therefore the better performance of a model could be due mainly to the training strategy rather than to the architecture of the proposed network. Moreover, some of the older models were not suited for real-world usage, e.g. Hyperface and All-in-One architectures took 3.5 s to process a single image [75]. Although newer models have managed to limit this problem, making it possible to obtain real-time systems.

Table 3 Head pose estimation publications most cited in recent literature

Evaluation Metrics

A common informative metric used for evaluating HPE frameworks is the Mean Absolute Error (MAE) for all the three angles, i.e., pitch, yaw, and roll. MAE is quite popular (most of the papers discussed in this paper use it as main evaluation metric) since it provides a single statistics that gives a quick insight into the performance, for both fine or coarse pose estimations.

$$\begin{aligned} \textrm{MAE}\ =\ \frac{1}{n} \sum _{i=0}^{n} \left( \left| y_{i}\ -\ \hat{y_{i}}\right| \right) . \end{aligned}$$

However, in scenarios with large-range pose variations (360\(^\circ\)), this evaluation method will not be reasonable. For example, when the actual angle is 170\(^\circ\) and the predicted angle is – 170\(^\circ\), then the two angles are only 20\(^\circ\) apart, but the MAE value calculated is 340\(^\circ\), making it bigger than its actual value [69].

For this reason, another measure has been proposed in the literature, called Mean Absolute Wrapped Error (MAWE) [7, 69]. The difference is clear by its definition:

$$\begin{aligned} \textrm{MAWE} = \frac{1}{n} \sum _{i=0}^{n} \textrm{min}\left( \left| y_{i}\ -\ \hat{y_{i}}\right| , \ 360-\left| y_{i}\ -\ \hat{y_{i}}\right| \right) . \end{aligned}$$

Another measure, mainly used for coarse head pose estimation, is the so-called Pose Estimation Accuracy (PEA). Being an accuracy measure, this metric depends on the number of poses, and therefore gives little information about the actual system performance. No recent work use it.

In recent studies on head pose estimation in the driving context, new evaluation metrics have been proposed [18,19,20]; however, no work on general head pose estimation use them.

The first metric is the Balanced Mean Angular Error, introduced to address the problem of the higher number of frontal pose images during evaluation, which leads to an unbalanced amount of different head orientations. The idea is to split the dataset in bins based on the angular difference from the frontal pose and average the MAE of each of the bins [18]

$$\begin{aligned} \textrm{BMAE}\ =\ \frac{d}{k} \sum _{i} \phi _{i, i+d}\ \ \ \ \ i \in d{\mathbb {N}}\cap [0, k], \end{aligned}$$

where \(\phi _{i, i+d}\) is the MAE of all hypotheses, the angular difference between the ground-truth and frontal pose is between i and \(i+d\), d is the bin size and k is the maximum angle degree considered.

Other two metrics employed are the Standard Deviation (Std), that provides insights to the error distribution around the ground-truth, and finally the Root Mean Squared Error, to weight larger errors higher.

$$\begin{aligned} \textrm{RMSE}\ =\ \sqrt{\frac{1}{n} \sum _{i=1}^{n}\left( y - {\hat{y}}\right) ^{2}} \end{aligned}$$

RMSE takes the squared difference of the predicted value and the ground-truth value, weighing larger errors higher. Thus, high variation in predictions of an algorithm results in a higher overall error compared to the mean without squaring the values [19].

Table 4 Evaluation results of head pose estimation on AFLW2000 [53] and BIWI [46]
Table 5 Evaluation results of head pose estimation on AFLW2000 [53] and BIWI [46] for methods exploiting depth data
Table 6 Evaluation results of head pose estimation on AFLW [45] (ordered by training pipeline)
Table 7 Evaluation results of head pose estimation on other databases
Table 8 Evaluation results of head pose estimation on other databases

Evaluation

Comparing different methods is a complex and delicate problem, due to large number of different datasets that can be used for training and testing, and the different features that can be exploited by the models, such as depth information. The community is pushing for the adoption of well defined evaluation pipelines, discussed in the following section, that allows for a fair comparison between models; results relative to this group are given in Table 4 (no depth) and Table 5 (depth). In Table 6 we report figures relative to evaluation on the AFLW dataset [45], although the precise pipeline may be different or unknown. Finally, many systems uses ad-hoc datasets either for training, testing, or both tasks, as it is for instance the case for thematic scenarios like driving or video surveillance. Results relative this latter groups are provided in Tables 7 and 8, splitted in two parts for typographical reasons.

Evaluation Pipelines

Currently, in the state-of-the-art works [7, 8, 76, 82, 86, 87, 90, 91, 166, 187], there are two primary datasets for training: 300W-LP [53] and BIWI [46], corresponding two main datasets for testing AFLW2000-3D [53] and a part of BIWI [46].

The two most used evaluation protocols are [88]:

  • P1: Training performed on a single dataset (300W-LP [53]), while BIWI [46] and AFLW2000-3D [53] are used as test sets. Only images with head rotation angles in range [– 99\(^\circ\), + 99\(^\circ\)] are typically considered (in the case of AFLW2000 31 images are discarded);

  • P2: Training and test sets are derived from the BIWI dataset [46], in some cases random split is applied (typically, 80% and 20% images), in others split by subject (18 and 2 subjects), recently the most common is the split by sequence (16-8 sequences for training and test respectively), but also n-fold cross-validation and leave-one-out cross-validation are used in the literature.

However, a major drawback of the considered evaluation pipelines is that the head pose angles (including pitch, yaw and roll) are all in the range [– 99\(^\circ\), + 99\(^\circ\)], limiting the prediction of the models to a “narrow range” that makes them less effective on large-angle data, such as those acquired from security cameras [69].

For this reason, researchers frequently use additional head pose datasets. Zhou et al. for training the WHENet model [7] use the CMU Panoptic dataset [55] both to increase the amount of data and to get comprehensive yaw angles in range [– 179\(^\circ\), + 179\(^\circ\)]. This is necessary to obtain a model optimized for the full range (360\(^\circ\)) of face orientations, outperforming on such a task models exclusively trained with 300W-LP [53]. Albiero et al. [166] instead annotated the WIDER face database [189] using a deep learning regressor, and used it during training to increase the robustness of the model. Recently, Viet et al. [69] released the UET-Headpose dataset, also with uniform yaw angle in the range ±179\(^\circ\), that can be used as a new benchmark dataset for full range models.

Moreover, the semi-automatic pipeline used to label 300W-LP [53] and AFLW2000-3D [53] has been criticised for not producing accurate annotations for extreme poses and occluded faces [133]. Valle et al. [98] re-annotated AFLW2000-3D with poses estimated from the correct landmarks; this led to an improvement in model performance.

Other researchers employ synthetic datasets for training and tested on real ones [58, 63, 157,158,159]. Kuhnke et al. [158] propose novel benchmark datasets that are derived from BIWI [46] and SynHead [63], namely Biwi+, SynBiwi+, SynHead++. They propose these new datasets because SynHead was rendered using the Euler angles provided by BIWI, but with a different sequence of rotation axes. This rotation order, dissimilar to the BIWI one, causes that several SynHead images and BIWI images with the same label show different head rotations. For this reason, the reannotated SynHead+ contains SynHead images with correct angles. For every image in the BIWI dataset, SynBiwi+ has 10 corresponding images containing the 10 synthetic head models of SynHead. SynHead++ is the union of SynHead+ and SynBiwi+. To further improve the reproducibility manually collected bboxes for BIWI are provided in Biwi+ dataset.

Another dataset often used in the literature both for training and testing is the AFLW [45], however, there isn’t a common evaluation protocol used in the many studies published. The most common is:

  • P3: Train and test set are defined by a random split, 23.386 images are used for training the model (of which typically 2.000 are employed as validation set) and 1.000 images for testing. More details about other evaluation pipelines for AFLW are in Table 6.

Discussion

Head pose estimation is an active research field of computer vision. It remains a challenging task due to several intrinsic and extrinsic problems, and the growing number of specialized contexts of application [2]. We organize this discussion in for parts: datasets, methodologies, open problems, and research directions.

Datasets

New databases are released every year because deep learning models require a huge quantity of data for training, but especially to overcome limitations of previous released datasets, such as limited head rotation angle ranges, non uniform distribution of angles, data captured in constraint environment, limited quality of ground-truth annotations, etc. (see FIg. 14)

Almost all most recent databases have annotations for all three rotation angles (pitch, yaw and roll), mainly acquired using depth cameras or optical motion capture systems. This is a major improvement with respect to earlier datasets that were acquired using direct suggestion or camera array methods, resulting in a discrete number of poses and annotations limited to one or two DoF.

The complexity of images has grown from simple faces on a flat background, to more complex scenarios with images acquired in-the-wild. However, a major drawback of the latter type is that pose is typically annotated manually or estimated with neural networks trained on other datasets, leading to inaccuracies in the ground-truth annotations (see for example Fig. 15).

Fig. 14
figure 14

Example of e distribution of the head rotation angles for the AFLW2000 dataset [53] (image from [195])

Another drawback of almost all the datasets is the data imbalance issue: the distribution between easy frontal faces and more challenging orientations is heavily unbalanced. Techniques to increase the number of hard faces [195] or to enhance the contribution of hard examples (such as HEM [150]) can be used to alter the data distribution space and overcome this issue, making trained models more robust and with better a generalization capability (Fig. 14).

Fig. 15
figure 15

Example of inaccuracies in ground-truth annotations on AFLW2000 dataset [53]. In some cases results from SADRNet [87] model are more accurate that the ground-truth. From the top row to the bottom row there are: the AFLW2000 [53] images, the sparse alignment results of SADRNet [87] and the corresponding ground-truth (blue for the former and red the latter), the reconstructed face models of SADRNet [87], and the ground-truth face models [87]. Vall et al. [98] reannotated AFLW2000 with poses estimated from correct landmarks and evaluated their MNN model, the MAE fell from 3.83 to 1.71 after the reannotation (image from [87])

Among all the databases, Boston University [34] is still used to evaluate head pose estimation methods even if it is one of the oldest; some model-based and segmentation based methods obtain very accurate performance on it, as can be seen in Table 7. Also Pointing’04 [38] is still employed for research purposes, even if it was introduced back in 2004, due to its challenging nature and high image diversity.

BIWI Kinect [46] has become the de-facto benchmark dataset with a high number of publications that evaluate their models on it. However, this dataset has two main disadvantages: it’s a narrow range dataset, head rotation angles go from – 75\(^\circ\) to + 75\(^\circ\), making it not suitable to evaluate models optimized for full range (360\(^\circ\)) head rotations; furthermore, it’s a dataset with images acquired in a constraint environment, therefore less challenging than other captured with different lighting conditions, backgrounds or occlusions.

Nowadays synthetic databases [58, 62, 63] enable more precise evaluation and comparison of HPE methods because they contain nearly perfect ground-truth data. However, training solely on synthetic data can cause poor performance when testing on real-world data due to mismatch or shift of underlying data distribution (domain gap). For this reason, training on a combination of synthetic data and real ones can lead to an improvement of the final result, see for example FSA-Net [88] model tested on BIWI dataset [46] in Table 7.

Recently, the most active sub-field seems to be “driver head pose estimation”, in the last five years five public datasets that address this specific scenario have been released, each with thousands or millions of images. This is mainly due to the increasing interest in driving assistance systems that aim to monitor the driver attention, behaviour and intention, and the fact that head pose is a key element to obtain accurate results [18, 19].

Methodologies

In parallel with the growing number and quality of available datasets, the number of head pose publications has constantly increased in the past few years. More and more people are interested in this area, leading to the development of many different and innovative approaches. Nowadays, deep learning and methods based on convolutional neural networks are the most pervasive: these are used to estimate head pose from monocular images, from a set of detected facial landmarks, from a combination of both in a multi-task approach, or even are used to perform 3D dense face alignment/reconstruction, from which the head pose information is obtained as by-product.

Segmentation based methods are the only recently developed methods that mainly rely on classical machine learning models. They proved the existence of a strong correlation between face segments and the corresponding pose, and that a precise face segmentation may lead to very accurate pose estimations [30]. However, a severe drop in performance is often registered when segmentation is applied in unconstrained environments [32], that hence remains a challenge for future research.

What emerges most from the literature is the strong correlation between face alignment and head pose estimation. This correlation is exploited in different ways in the literature. Among the best performing methods there are:

  • Xia et al. [82] perform face alignment and then create a landmark heatmap that is given as input (along with the facial image) to a CNN. They obtain the best result on AFLW2000 dataset [53] because the heatmap generator improves the generalization ability by making the CNN focus on the area around facial landmarks and reducing the interference from background significantly. However, this method does not remarkably improve the performance on datasets taken under controllable conditions, such as BIWI [46].

  • Valle et al. [98] combine face alignment and head pose estimation in a multi-task model improving the overall performance, obtaining the best result on AFLW dataset [45].

  • Xin et al. [187] construct a landmark-connection graph to model the complex non-linear mapping between graph topologies and head pose angles. Their model has the lowest MAE when trained and tested on BIWI dataset [46] among the models that use only RGB data.

  • Wu et al. [86] exploit facial landmarks to guide 3D facial geometry learning. Pose in this case is a by-product that a backbone network learns during 3DMM parameter regression. SynergyNet outperform all deep learning regressors on AFLW2000 dataset [53].

A different class of models that look particularly promising are those based on 3DMM. They focus on face reconstruction and incorporate occlusion aware mechanisms very useful in complex scenarios. Moreover, because these methods do not use any ground-truth head pose label during training, they do not suffer from the inaccuracy of head pose labels that exist in most publicly available training datasets. Room-of-improvement might exist by designing specialized loss function and addressing specifically the head pose estimation task.

From Table 3 we can see that almost all the models can estimate 3 DoF; actually, some of them (such as 3DMM based) can estimate 6 DoF, but databases are mainly equipped with 3 DoF or less. This highlights a great evolution, indeed until a few years ago, researchers focused more on yaw estimation, because of its importance in applications such as human attention, gaze estimation, etc. Deep learning changed the trend, all three rotation angles are currently being addressed in most works.

From Table 5 we observe that methods that use depth data, alone or in conjunction with RGB information, can usually achieve better results. In particular, the use of depth data enhances the efficacy under challenging illumination conditions and occlusions, making the models suitable for particularly complex scenarios, such as automotive. From Table 7 we can see that, recently, also thermal infrared images (IR) are used as input for HPE algorithms, in some cases obtaining better results than with depth information. However, depth or infrared data are not always available in real-world contexts, and are also quite expensive; therefore, methods based only on monocular images have more generalization abilities and simpler deployment.

Issues and Problems

The main problem that emerges from this analysis is that different experimental set-ups and different validation protocols are adopted for HPE algorithms, and this strongly influences the evaluation, making comparison difficult. Another source of noise comes from the preprocessing phase, that may easily result in the detection of different bounding boxes/facial keypoints eventually influencing further elaboration steps.

Fig. 16
figure 16

Influence of bbox margin and background on head pose estimation: (a) Influence of bbox margin on head pose estimation. The values predicted by FSA-Net [88] change significantly with the change of bounding box size on all three axes. The network is not robust to the change of bbox margin; (b) Influence of background on head pose estimation. The values predicted by SSR-Net-MD [88] are not robust in different background, e.g. the offset of pitch and yaw between A1 and A2 is about 5\(^\circ\) (images from [153])

Coming to more technical problems, Shao et al. [179] discovered in their experiments that bounding box margin has a large impact on the final accuracy of the model; head pose estimators are vulnerable to changes in the background scene around the target face, as shown in image 16.

To solve this problem Xue et al. [153] propose a convolutional cropping module (CCM) that can learn to crop the input image to an attentional area for head pose regression, and a background augmentation technique that can make the network more robust to the background noise. In their experiment SSR-Net-MD [88] MAE error fell from 6.01 to 5.38 and FSA-Net [88] goes from 5.25 to 5.13 thanks to CCM and background augmentation. If on one hand, this shows how there are techniques that allow to improve the results obtained, on the other, hand differences in the ways of getting the bounding boxes do not allow for a valid comparison of the methods for HPE.

The same problem emerged for face landmark detectors, as shown by Xin et al. [187] in their experiments, as reported in Table 9.

Table 9 Influence of different landmark detectors for EVA-GCN performance

Also, the impact of image quality is little studied in the literature. When few low-quality images are present in training data, networks can easily fail to cope with these under-represented cases. Using synthesized LR samples and data augmentation during training is a delicate trade-off between the positive gain deriving from more diverse training instances, and the additional difficulty related to the higher problem complexity. It is proven that when the resolution variation increases, the performance on the original High-Resolution (HR) samples drops [8]. Little studies have been conducted on establish a resolution-agnostic HPE framework [170].

The last question that arises is about the evaluation metrics used. MAE is the standard evaluation metric employed, but is optimal only for narrow range models, as explained in section “Evaluation Metrics”. It’s worth noting that also Cao et al. [76] criticise the use of MAE of Euler angles as evaluation metric, as according to them it cannot correctly measure the performance on profile images. They propose to use the Mean Absolute Error of Vectors (MAEV) to assess the performance. They use three vectors, extracted from the rotation matrix, to describe head poses and compute the difference between the ground-truth vectors and the predicted ones. They showed how this representation is more consistent and how MAEV is a more reliable indicator for the evaluation of pose estimation results (see Fig. 17).

Fig. 17
figure 17

Comparison of pose estimation results with MAE and MAEV metrics on AFLW2000 profile images. All models are trained on 300W-LP (image from [76])

The MAWE metric (details in Section “Evaluation Metrics”) could be a better choice: first, it can be used with Euler angles representation; second, if used to evaluate narrow range methods gives the same result as MAE; third, at this point narrow range methods have reached very high accuracy and it seems the time has come for a switch to full range methods with MAWE as main evaluation metric.

Research Directions

Due to the growing specialization of the field on ad-hoc contexts and tasks, it is natural to expect more and more investigation on topics like domain adaption, partial domain adaption, inaccurate semi-supervised learning, and knowledge transfer.

For similar reasons, we expect an increasing application of multi-task learning, which has seen a steady and strong development from 2017 to today. Head pose can be used as principal task to enhance other face-related subtasks, including gender classification, expression detection and identity recognition.

For deformable models, an important improvement would be the ability to selectively ignore parts of the model that are self-occluded, overcoming a fundamental limitation in an otherwise very promising category, especially in unconstrained conditions.

Another interesting direction, not explored yet, is the use of deep learning in segmentation based methods. A possibility is to use convolutional neural networks to regress pose angles from segmented faces, or alternatively, segmentation based methods can be extended through geometric/deformable methods, where the feature extraction and classification could exploit specific deep learning architectures.

Finally, only Malakshan et al. [170] explored the use of generative models, showing that HPE can be effectively solved in conjunction with other face-related tasks typically associated with the generative field. This seems a very interesting possibility that showed promising result in another partially unexplored area of HPE task the extreme low-resolution images. We expect the development of a specific sub-filed that studies these techniques.

Although general head pose estimation will continue to be an exciting field with a lot of room for improvement, we expect an even stronger development of specific sub-fields that address thematic areas of application, such as the “security and surveillance” problem, recently addressed with the release of GOTCHA-I [66] database, or the “driver head pose estimation” which is already a very active field [16,17,18,19,20, 68]. Indeed, the role of head pose estimation in driving systems is becoming more and more important. By monitoring the head pose of the driver in real-time and analysing the behaviour of the driver, it will be possible to determine whether the driving status of the driver is good, having a profound impact on the future of automotive safety.

We expect new datasets will continue to be released with an increasing focus on 6 degrees of freedom and full range head angles, thanks to the development of new cheap and powerful RGB-D cameras (such as Microsoft Kinect), and other acquisition techniques.

Conclusion

Head pose estimation is a very important task for human-computer interaction, since it provides rich information about the intent, motivation and visual attention of people.

Despite the extensive research in this field, especially during the last years, HPE still remains challenging when images are collected under unconstrained conditions.

In this article, we presented a detailed list of publicly available databases, and gave an in-depth survey of head pose estimation methods, briefly mentioning oldest and no more used classical approaches, and then providing an extensive analysis of modern techniques, mainly based on deep learning. Indeed, most current heads pose estimation methods exploit convolutional neural networks, from direct regressors to deformable based approaches passing through multi-task learning. We have also presented a comparative analysis of the state-of-the-art performance obtained so far in the field by providing organized and informative tables.

The article also discusses and suggests possible directions for future work. In particular, we expect the introduction of new light DL architectures that can perform well on challenging datasets, i.e., those collected in unconstrained environments.

We also expect the development of new sub-fields with dedicated databases and evaluation pipelines, such as the “driver head pose estimation” that is already very active.

An important trend observed is that the number of head pose publications has constantly increased in the past few years. This is a sign that more and more people are interested in this area, which means that the development cycle of new methods will be faster. A constant and periodic updating of the literature is therefore important.

We hope that this survey may help to clarify the evolution of the field, its evaluation methodologies and techniques thanks to the provided comprehensive list of datasets, methods and algorithms.