Abstract
Human head pose estimation in images has applications in many fields such as human–computer interaction or video surveillance tasks. In this work, we address this problem, defined here as the estimation of both vertical (tilt/pitch) and horizontal (pan/yaw) angles, through the use of a single Convolutional Neural Network (ConvNet) model, trying to balance precision and inference speed in order to maximize its usability in real-world applications. Our model is trained over the combination of two datasets: ‘Pointing’04’ (aiming at covering a wide range of poses) and ‘Annotated Facial Landmarks in the Wild’ (in order to improve robustness of our model for its use on real-world images). Three different partitions of the combined dataset are defined and used for training, validation and testing purposes. As a result of this work, we have obtained a trained ConvNet model, coined RealHePoNet, that given a low-resolution grayscale input image, and without the need of using facial landmarks, is able to estimate with low error both tilt and pan angles (\(~4.4^{\circ }\) average error on the test partition). Also, given its low inference time (6 ms per head), we consider our model usable even when paired with medium-spec hardware (i.e. GTX 1060 GPU). Code available at: https://github.com/rafabs97/headpose_final Demo video at: https://www.youtube.com/watch?v=2UeuXh5DjAE.
Similar content being viewed by others
Abbreviations
- AFLW:
-
Annotated Facial Landmarks in the Wild
- CNN:
-
Convolutional Neural Network
- Conv:
-
Convolution
- ConvNet:
-
Convolutional Neural Network
- CT:
-
Confidence Threshold
- FC:
-
Fully connected
- flops:
-
Floating point operations per second
- FPS:
-
Frames per second
- HPE:
-
Head pose estimation
- IoU:
-
Intersection over Union
- MAE:
-
Mean Absolute Error
- MSE:
-
Mean Squared Error
- SSD:
-
Single Shot Detector
References
(2014) YouTube video: How to warm up your neck. https://www.youtube.com/watch?v=W2IlxHQwR14. Accessed 19 Nov 2020
(2016) YouTube video: High School Mannequin Challenge 1500 Students—Maple Ridge Secondary School. https://www.youtube.com/watch?v=qFaUhLkdRPg. Accessed 19 Nov 2020
(2018) YouTube video: Social mobility and education: DISCUSSION—BBC Newsnight. https://www.youtube.com/watch?v=s84NGoMdPxg. Accessed 19 Nov 2020
(2019) YouTube video: Find Out Which ‘The Big Bang Theory’ Star Is the Most Emotional as Series End Nears. https://www.youtube.com/watch?v=5AgenwHpelU. Accessed 19 Nov 2020
(2020) YouTube video: #Coronavirus: Pacientes en #UCI habla por móvil con su familia tras ser extubada. https://www.youtube.com/watch?v=1cYr0NMi5m0. Accessed 19 Nov 2020
Abate AF, Barra P, Bisogni C, Nappi M, Ricciardi S (2019) Near real-time three axis head pose estimation without training. IEEE Access 7:64256–64265. https://doi.org/10.1109/ACCESS.2019.2917451
Ba SO, Odobez JM (2004) A probabilistic framework for joint head tracking and pose estimation. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., vol 4, pp 264–267 Vol.4, https://doi.org/10.1109/icpr.2004.1333754
Balasubramanian VN, Ye J, Panchanathan S (2007) Biased manifold embedding: A framework for person-independent head pose estimation. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp 1–7, https://doi.org/10.1109/cvpr.2007.383280
Barra P, Barra S, Bisogni C, De Marsico M, Nappi M (2020) Web-shaped model for head pose estimation: an approach for best exemplar selection. IEEE Trans Image Process 29:5457–5468. https://doi.org/10.1109/TIP.2020.2984373
Berral-Soler R, Marín-Jiménez MJ, Madrid-Cuevas FJ (2019) Human head pose estimation using Keras over TensorFlow. https://github.com/rafabs97/headpose_final. Accessed 19 Nov 2020
Berral-Soler R, Marín-Jiménez MJ, Madrid-Cuevas FJ (2020) RealHePoNet Demo. https://www.youtube.com/watch?v=2UeuXh5DjAE. Accessed 19 Nov 2020
Castro FM, Marín-Jiménez MJ, Guil N, de la Blanca NP (2020) Multimodal feature fusion for CNN-based gait recognition: an empirical comparison. Neural Comput. Appl. 32(17):14173–14193. https://doi.org/10.1007/s00521-020-04811-z
Czupryński B, Strupczewski A (2014) High accuracy head pose tracking survey. In: Active Media Technology, pp 407–420, https://doi.org/10.1007/978-3-319-09912-5_34
Fanelli G, Gall J, Van Gool L (2011) Real time head pose estimation with random regression forests. CVPR 2011:617–624. https://doi.org/10.1109/cvpr.2011.5995458
Fanelli G, Weise T, Gall J, Gool LV (2011) Real time head pose estimation from consumer depth cameras. In: Proceedings of the 33rd International Conference on Pattern Recognition, Springer-Verlag, Berlin, Heidelberg, DAGM’11, pp 101–110, https://doi.org/10.1007/978-3-642-23123-0_11
Flickr (n.d.) Flickr. https://www.flickr.com/. Accessed 19 Nov 2020
Gourier N, Crowley J (2004) Estimating face orientation from robust detection of salient facial structures. FG Net Workshop on Visual Observation of Deictic Gestures
Gourier N, Maisonnasse J, Hall D, Crowley JL (2007) Head pose estimation on low resolution images. In: Proceedings of the 1st International Evaluation Conference on Classification of Events, Activities and Relationships, Springer-Verlag, Berlin, Heidelberg, CLEAR’06, pp 270–280, https://doi.org/10.1007/978-3-540-69568-4_24
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. CoRR abs/1512.03385, https://doi.org/10.1109/CVPR.2016.90
Koestinger M, Wohlhart P, Roth PM, Bischof H (2011) Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In: Proc. First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, https://doi.org/10.1109/iccvw.2011.6130513
Lathuiliere S, Juge R, Mesejo P, Muñoz-Salinas R, Horaud R (2017) Deep mixture of linear inverse regressions applied to head-pose estimation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7149–7157, https://doi.org/10.1109/cvpr.2017.756
Lathuiliere S, Mesejo P, Alameda-Pineda X, Horaud R (2018) A comprehensive analysis of deep regression. CoRR abs/1803.08450, https://doi.org/10.1109/tpami.2019.2910523, arXiv:1803.08450
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu CY, Berg AC (2015) SSD: single shot multibox detector. CoRR abs/1512.02325, https://doi.org/10.1007/978-3-319-46448-0_2, arXiv:1512.02325
Liu X, Liang W, Wang Y, Li S, Pei M (2016) 3d head pose estimation with convolutional neural network trained on synthetic images. In: 2016 IEEE International Conference on Image Processing (ICIP), pp 1289–1293, https://doi.org/10.1109/icip.2016.7532566
Marín-Jiménez MJ, Zisserman A, Eichner M, Ferrari V (2014) Detecting people looking at each other in videos. Int J Comput Vis 106(3):282–296. https://doi.org/10.1007/s11263-013-0655-7
Marín-Jiménez MJ, Ramírez FJR, Muñoz-Salinas R, Carnicer RM (2018) 3D human pose estimation from depth maps using a deep combination of poses. J Vis Commun Image Represent 55:627–639. https://doi.org/10.1016/j.jvcir.2018.07.010
Marín-Jiménez MJ, Kalogeiton V, Medina-Suárez P, Zisserman A (2019) LAEO-Net: revisiting people Looking At Each Other in videos. In: CVPR, https://doi.org/10.1109/cvpr.2019.00359
Muñoz-Salinas R, Yeguas-Bolivar E, Saffiotti A, Medina Carnicer R (2012) Multi-camera head pose estimation. Mach Vis Appl 23(3):479–490. https://doi.org/10.1007/s00138-012-0410-z
Murphy-Chutorian E, Trivedi MM (2009) Head pose estimation in computer vision: a survey. IEEE Trans Pattern Anal Mach Intell 31(4):607–626. https://doi.org/10.1109/tpami.2008.106
Murphy-Chutorian E, Trivedi MM (2010) Head pose estimation and augmented reality tracking: an integrated system and evaluation for monitoring driver awareness. IEEE Trans Intell Transp Syst 11(2):300–311. https://doi.org/10.1109/tits.2010.2044241
Murphy-Chutorian E, Doshi A, Trivedi MM (2007) Head pose estimation for driver assistance systems: A robust algorithm and experimental evaluation. In: 2007 IEEE Intelligent Transportation Systems Conference, pp 709–714, https://doi.org/10.1109/itsc.2007.4357803
Passalis N, Tefas A (2020) Continuous drone control using deep reinforcement learning for frontal view person shooting. Neural Comput Appl 32(9):4227–4238. https://doi.org/10.1007/s00521-019-04330-6
Patacchiola M, Cangelosi A (2017) Head pose estimation in the wild using convolutional neural networks and adaptive gradient methods. Pattern Recognit. https://doi.org/10.1016/j.patcog.2017.06.009
Patacchiola M, Gooch J, Mehta I, Surace L, Kamath H (2016) Deepgaze library repository. https://github.com/mpatacchiola/deepgaze. Accessed 19 Nov 2020
Pereira EM, Ciobanu L, Cardoso JS (2017) Cross-layer classification framework for automatic social behavioural analysis in surveillance scenario. Neural Comput Appl 28(9):2425–2444. https://doi.org/10.1007/s00521-016-2282-z
Raytchev B, Yoda I, Sakaue K (2004) Head pose estimation by nonlinear manifold learning. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., vol 4, pp 462–466 Vol.4, https://doi.org/10.1109/icpr.2004.1333802
Rosebrock A (2016) Intersection over Union (IoU) for object detection. https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/. Accessed 19 Nov 2020
Ruiz N, Rehg JM (2017) Dockerface: an easy to install and use Faster R-CNN face detector in a Docker container. ArXiv e-prints arXiv:1708.04370
Ruiz N, Chong E, Rehg JM (2017) Hopenet. https://github.com/natanielruiz/deep-head-pose. Accessed 19 Nov 2020
Ruiz N, Chong E, Rehg JM (2018) Fine-grained head pose estimation without keypoints. In: Proc. of IEEE conf. on Computer Vision and Pattern Recognition Workshops, pp 2074–2083, https://doi.org/10.1109/CVPRW.2018.00281
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations, ICLR
Tenenbaum JB, Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323. https://doi.org/10.1126/science.290.5500.2319
Vatahska T, Bennewitz M, Behnke S (2007) Feature-based head pose estimation from images. In: 2007 7th IEEE-RAS International Conference on Humanoid Robots, pp 330–335, https://doi.org/10.1109/ichr.2007.4813889
Wijnands JS, Thompson J, Nice KA, Aschwanden GD, Stevenson M (2019) Real-time monitoring of driver drowsiness on mobile platforms using 3d neural networks. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04506-0
Xia J, Cao L, Zhang G, Liao J (2019) Head pose estimation in the wild assisted by facial landmarks based on convolutional neural networks. IEEE Access 7:48470–48483. https://doi.org/10.1109/ACCESS.2019.2909327
Yuan A, Bai G, Jiao L, Liu Y (2012) Offline handwritten english character recognition based on convolutional neural network. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp 125–129, https://doi.org/10.1109/das.2012.61
Yuan H, Li M, Hou J, Xiao J (2020) Single image-based head pose estimation with spherical parametrization and 3d morphing. Pattern Recognit. 103:107316. https://doi.org/10.1016/j.patcog.2020.107316
Zhang T, Sodhro AH, Luo Z, Zahid N, Nawaz MW, Pirbhulal S, Muzammal M (2020) A joint deep learning and internet of medical things driven framework for elderly patients. IEEE Access 8:75822–75832. https://doi.org/10.1109/access.2020.2989143
Zhu X, Liu X, Lei Z, Li SZ (2019) Face alignment in full pose range: a 3d total solution. IEEE Trans Pattern Anal Mach Intell 41(1):78–92. https://doi.org/10.1109/TPAMI.2017.2778152
Acknowledgements
This work has been partially funded by the Spanish projects TIN2019-75279-P and RED2018-102511-T. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of Interest
The authors declare that they have no conflict of interest.
Code availability
Code is publicly available at: https://github.com/rafabs97/headpose_final/.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Berral-Soler, R., Madrid-Cuevas, F.J., Muñoz-Salinas, R. et al. RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild. Neural Comput & Applic 33, 7673–7689 (2021). https://doi.org/10.1007/s00521-020-05511-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05511-4