RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

Berral-Soler, Rafael; Madrid-Cuevas, Francisco J.; Muñoz-Salinas, Rafael; Marín-Jiménez, Manuel J.

doi:10.1007/s00521-020-05511-4

RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

Original Article
Published: 20 November 2020

Volume 33, pages 7673–7689, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Rafael Berral-Soler¹,
Francisco J. Madrid-Cuevas¹,
Rafael Muñoz-Salinas¹ &
…
Manuel J. Marín-Jiménez ORCID: orcid.org/0000-0001-9294-6714¹

394 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

Human head pose estimation in images has applications in many fields such as human–computer interaction or video surveillance tasks. In this work, we address this problem, defined here as the estimation of both vertical (tilt/pitch) and horizontal (pan/yaw) angles, through the use of a single Convolutional Neural Network (ConvNet) model, trying to balance precision and inference speed in order to maximize its usability in real-world applications. Our model is trained over the combination of two datasets: ‘Pointing’04’ (aiming at covering a wide range of poses) and ‘Annotated Facial Landmarks in the Wild’ (in order to improve robustness of our model for its use on real-world images). Three different partitions of the combined dataset are defined and used for training, validation and testing purposes. As a result of this work, we have obtained a trained ConvNet model, coined RealHePoNet, that given a low-resolution grayscale input image, and without the need of using facial landmarks, is able to estimate with low error both tilt and pan angles (\(~4.4^{\circ }\) average error on the test partition). Also, given its low inference time (6 ms per head), we consider our model usable even when paired with medium-spec hardware (i.e. GTX 1060 GPU). Code available at: https://github.com/rafabs97/headpose_final Demo video at: https://www.youtube.com/watch?v=2UeuXh5DjAE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 6

Fig. 8

Towards Real-Time Head Pose Estimation: Exploring Parameter-Reduced Residual Networks on In-the-wild Datasets

Evaluation of Camera Pose Estimation Using Human Head Pose Estimation

Article Open access 30 March 2023

Simultaneous Face Detection and Head Pose Estimation: A Fast and Unified Framework

Abbreviations

AFLW:: Annotated Facial Landmarks in the Wild
CNN:: Convolutional Neural Network
Conv:: Convolution
ConvNet:: Convolutional Neural Network
CT:: Confidence Threshold
FC:: Fully connected
flops:: Floating point operations per second
FPS:: Frames per second
HPE:: Head pose estimation
IoU:: Intersection over Union
MAE:: Mean Absolute Error
MSE:: Mean Squared Error
SSD:: Single Shot Detector

References

(2014) YouTube video: How to warm up your neck. https://www.youtube.com/watch?v=W2IlxHQwR14. Accessed 19 Nov 2020
(2016) YouTube video: High School Mannequin Challenge 1500 Students—Maple Ridge Secondary School. https://www.youtube.com/watch?v=qFaUhLkdRPg. Accessed 19 Nov 2020
(2018) YouTube video: Social mobility and education: DISCUSSION—BBC Newsnight. https://www.youtube.com/watch?v=s84NGoMdPxg. Accessed 19 Nov 2020
(2019) YouTube video: Find Out Which ‘The Big Bang Theory’ Star Is the Most Emotional as Series End Nears. https://www.youtube.com/watch?v=5AgenwHpelU. Accessed 19 Nov 2020
(2020) YouTube video: #Coronavirus: Pacientes en #UCI habla por móvil con su familia tras ser extubada. https://www.youtube.com/watch?v=1cYr0NMi5m0. Accessed 19 Nov 2020
Abate AF, Barra P, Bisogni C, Nappi M, Ricciardi S (2019) Near real-time three axis head pose estimation without training. IEEE Access 7:64256–64265. https://doi.org/10.1109/ACCESS.2019.2917451
Article Google Scholar
Ba SO, Odobez JM (2004) A probabilistic framework for joint head tracking and pose estimation. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., vol 4, pp 264–267 Vol.4, https://doi.org/10.1109/icpr.2004.1333754
Balasubramanian VN, Ye J, Panchanathan S (2007) Biased manifold embedding: A framework for person-independent head pose estimation. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp 1–7, https://doi.org/10.1109/cvpr.2007.383280
Barra P, Barra S, Bisogni C, De Marsico M, Nappi M (2020) Web-shaped model for head pose estimation: an approach for best exemplar selection. IEEE Trans Image Process 29:5457–5468. https://doi.org/10.1109/TIP.2020.2984373
Article Google Scholar
Berral-Soler R, Marín-Jiménez MJ, Madrid-Cuevas FJ (2019) Human head pose estimation using Keras over TensorFlow. https://github.com/rafabs97/headpose_final. Accessed 19 Nov 2020
Berral-Soler R, Marín-Jiménez MJ, Madrid-Cuevas FJ (2020) RealHePoNet Demo. https://www.youtube.com/watch?v=2UeuXh5DjAE. Accessed 19 Nov 2020
Castro FM, Marín-Jiménez MJ, Guil N, de la Blanca NP (2020) Multimodal feature fusion for CNN-based gait recognition: an empirical comparison. Neural Comput. Appl. 32(17):14173–14193. https://doi.org/10.1007/s00521-020-04811-z
Article Google Scholar
Czupryński B, Strupczewski A (2014) High accuracy head pose tracking survey. In: Active Media Technology, pp 407–420, https://doi.org/10.1007/978-3-319-09912-5_34
Fanelli G, Gall J, Van Gool L (2011) Real time head pose estimation with random regression forests. CVPR 2011:617–624. https://doi.org/10.1109/cvpr.2011.5995458
Article Google Scholar
Fanelli G, Weise T, Gall J, Gool LV (2011) Real time head pose estimation from consumer depth cameras. In: Proceedings of the 33rd International Conference on Pattern Recognition, Springer-Verlag, Berlin, Heidelberg, DAGM’11, pp 101–110, https://doi.org/10.1007/978-3-642-23123-0_11
Flickr (n.d.) Flickr. https://www.flickr.com/. Accessed 19 Nov 2020
Gourier N, Crowley J (2004) Estimating face orientation from robust detection of salient facial structures. FG Net Workshop on Visual Observation of Deictic Gestures
Gourier N, Maisonnasse J, Hall D, Crowley JL (2007) Head pose estimation on low resolution images. In: Proceedings of the 1st International Evaluation Conference on Classification of Events, Activities and Relationships, Springer-Verlag, Berlin, Heidelberg, CLEAR’06, pp 270–280, https://doi.org/10.1007/978-3-540-69568-4_24
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. CoRR abs/1512.03385, https://doi.org/10.1109/CVPR.2016.90
Koestinger M, Wohlhart P, Roth PM, Bischof H (2011) Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In: Proc. First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, https://doi.org/10.1109/iccvw.2011.6130513
Lathuiliere S, Juge R, Mesejo P, Muñoz-Salinas R, Horaud R (2017) Deep mixture of linear inverse regressions applied to head-pose estimation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7149–7157, https://doi.org/10.1109/cvpr.2017.756
Lathuiliere S, Mesejo P, Alameda-Pineda X, Horaud R (2018) A comprehensive analysis of deep regression. CoRR abs/1803.08450, https://doi.org/10.1109/tpami.2019.2910523, arXiv:1803.08450
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Article Google Scholar
Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu CY, Berg AC (2015) SSD: single shot multibox detector. CoRR abs/1512.02325, https://doi.org/10.1007/978-3-319-46448-0_2, arXiv:1512.02325
Liu X, Liang W, Wang Y, Li S, Pei M (2016) 3d head pose estimation with convolutional neural network trained on synthetic images. In: 2016 IEEE International Conference on Image Processing (ICIP), pp 1289–1293, https://doi.org/10.1109/icip.2016.7532566
Marín-Jiménez MJ, Zisserman A, Eichner M, Ferrari V (2014) Detecting people looking at each other in videos. Int J Comput Vis 106(3):282–296. https://doi.org/10.1007/s11263-013-0655-7
Article Google Scholar
Marín-Jiménez MJ, Ramírez FJR, Muñoz-Salinas R, Carnicer RM (2018) 3D human pose estimation from depth maps using a deep combination of poses. J Vis Commun Image Represent 55:627–639. https://doi.org/10.1016/j.jvcir.2018.07.010
Article Google Scholar
Marín-Jiménez MJ, Kalogeiton V, Medina-Suárez P, Zisserman A (2019) LAEO-Net: revisiting people Looking At Each Other in videos. In: CVPR, https://doi.org/10.1109/cvpr.2019.00359
Muñoz-Salinas R, Yeguas-Bolivar E, Saffiotti A, Medina Carnicer R (2012) Multi-camera head pose estimation. Mach Vis Appl 23(3):479–490. https://doi.org/10.1007/s00138-012-0410-z
Article Google Scholar
Murphy-Chutorian E, Trivedi MM (2009) Head pose estimation in computer vision: a survey. IEEE Trans Pattern Anal Mach Intell 31(4):607–626. https://doi.org/10.1109/tpami.2008.106
Article Google Scholar
Murphy-Chutorian E, Trivedi MM (2010) Head pose estimation and augmented reality tracking: an integrated system and evaluation for monitoring driver awareness. IEEE Trans Intell Transp Syst 11(2):300–311. https://doi.org/10.1109/tits.2010.2044241
Article Google Scholar
Murphy-Chutorian E, Doshi A, Trivedi MM (2007) Head pose estimation for driver assistance systems: A robust algorithm and experimental evaluation. In: 2007 IEEE Intelligent Transportation Systems Conference, pp 709–714, https://doi.org/10.1109/itsc.2007.4357803
Passalis N, Tefas A (2020) Continuous drone control using deep reinforcement learning for frontal view person shooting. Neural Comput Appl 32(9):4227–4238. https://doi.org/10.1007/s00521-019-04330-6
Article Google Scholar
Patacchiola M, Cangelosi A (2017) Head pose estimation in the wild using convolutional neural networks and adaptive gradient methods. Pattern Recognit. https://doi.org/10.1016/j.patcog.2017.06.009
Article Google Scholar
Patacchiola M, Gooch J, Mehta I, Surace L, Kamath H (2016) Deepgaze library repository. https://github.com/mpatacchiola/deepgaze. Accessed 19 Nov 2020
Pereira EM, Ciobanu L, Cardoso JS (2017) Cross-layer classification framework for automatic social behavioural analysis in surveillance scenario. Neural Comput Appl 28(9):2425–2444. https://doi.org/10.1007/s00521-016-2282-z
Article Google Scholar
Raytchev B, Yoda I, Sakaue K (2004) Head pose estimation by nonlinear manifold learning. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., vol 4, pp 462–466 Vol.4, https://doi.org/10.1109/icpr.2004.1333802
Rosebrock A (2016) Intersection over Union (IoU) for object detection. https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/. Accessed 19 Nov 2020
Ruiz N, Rehg JM (2017) Dockerface: an easy to install and use Faster R-CNN face detector in a Docker container. ArXiv e-prints arXiv:1708.04370
Ruiz N, Chong E, Rehg JM (2017) Hopenet. https://github.com/natanielruiz/deep-head-pose. Accessed 19 Nov 2020
Ruiz N, Chong E, Rehg JM (2018) Fine-grained head pose estimation without keypoints. In: Proc. of IEEE conf. on Computer Vision and Pattern Recognition Workshops, pp 2074–2083, https://doi.org/10.1109/CVPRW.2018.00281
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations, ICLR
Tenenbaum JB, Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323. https://doi.org/10.1126/science.290.5500.2319
Article Google Scholar
Vatahska T, Bennewitz M, Behnke S (2007) Feature-based head pose estimation from images. In: 2007 7th IEEE-RAS International Conference on Humanoid Robots, pp 330–335, https://doi.org/10.1109/ichr.2007.4813889
Wijnands JS, Thompson J, Nice KA, Aschwanden GD, Stevenson M (2019) Real-time monitoring of driver drowsiness on mobile platforms using 3d neural networks. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04506-0
Article Google Scholar
Xia J, Cao L, Zhang G, Liao J (2019) Head pose estimation in the wild assisted by facial landmarks based on convolutional neural networks. IEEE Access 7:48470–48483. https://doi.org/10.1109/ACCESS.2019.2909327
Article Google Scholar
Yuan A, Bai G, Jiao L, Liu Y (2012) Offline handwritten english character recognition based on convolutional neural network. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp 125–129, https://doi.org/10.1109/das.2012.61
Yuan H, Li M, Hou J, Xiao J (2020) Single image-based head pose estimation with spherical parametrization and 3d morphing. Pattern Recognit. 103:107316. https://doi.org/10.1016/j.patcog.2020.107316
Article Google Scholar
Zhang T, Sodhro AH, Luo Z, Zahid N, Nawaz MW, Pirbhulal S, Muzammal M (2020) A joint deep learning and internet of medical things driven framework for elderly patients. IEEE Access 8:75822–75832. https://doi.org/10.1109/access.2020.2989143
Article Google Scholar
Zhu X, Liu X, Lei Z, Li SZ (2019) Face alignment in full pose range: a 3d total solution. IEEE Trans Pattern Anal Mach Intell 41(1):78–92. https://doi.org/10.1109/TPAMI.2017.2778152
Article Google Scholar

Download references

Acknowledgements

This work has been partially funded by the Spanish projects TIN2019-75279-P and RED2018-102511-T. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Author information

Authors and Affiliations

Department of Computing and Numerical Analysis, University of Cordoba, Cordoba, Spain
Rafael Berral-Soler, Francisco J. Madrid-Cuevas, Rafael Muñoz-Salinas & Manuel J. Marín-Jiménez

Authors

Rafael Berral-Soler
View author publications
You can also search for this author in PubMed Google Scholar
Francisco J. Madrid-Cuevas
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Muñoz-Salinas
View author publications
You can also search for this author in PubMed Google Scholar
Manuel J. Marín-Jiménez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel J. Marín-Jiménez.

Ethics declarations

Conflicts of Interest

The authors declare that they have no conflict of interest.

Code availability

Code is publicly available at: https://github.com/rafabs97/headpose_final/.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berral-Soler, R., Madrid-Cuevas, F.J., Muñoz-Salinas, R. et al. RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild. Neural Comput & Applic 33, 7673–7689 (2021). https://doi.org/10.1007/s00521-020-05511-4

Download citation

Received: 11 May 2020
Accepted: 04 November 2020
Published: 20 November 2020
Issue Date: July 2021
DOI: https://doi.org/10.1007/s00521-020-05511-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

Abstract

Access this article

Similar content being viewed by others

Towards Real-Time Head Pose Estimation: Exploring Parameter-Reduced Residual Networks on In-the-wild Datasets

Evaluation of Camera Pose Estimation Using Human Head Pose Estimation

Simultaneous Face Detection and Head Pose Estimation: A Fast and Unified Framework

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

Abstract

Access this article

Similar content being viewed by others

Towards Real-Time Head Pose Estimation: Exploring Parameter-Reduced Residual Networks on In-the-wild Datasets

Evaluation of Camera Pose Estimation Using Human Head Pose Estimation

Simultaneous Face Detection and Head Pose Estimation: A Fast and Unified Framework

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation