Skip to main content
Log in

Multiple Teacher Knowledge Distillation for Head Pose Estimation Without Keypoints

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

In recent years, human head pose estimation has played a significant role in facial analysis with a variety of practical applications such as gaze estimation, virtual reality, driver assistance, etc. Due to its importance, in this paper, we propose a lightweight model to effectively deal with the task of head pose estimation. Firstly, the teacher models is trained on the synthesis dataset 300W-LPA to obtain the head pose pseudo labels; before an architecture with ResNet18 backbone is adopted and trained with the ensemble of these pseudo labels via the knowledge distillation process. Real-world head pose datasets AFLW-2000 and BIWI are used to evaluate our proposed approach efficacy. Experimental results prove the significant improvement of our proposed approach in the testing accuracy in comparison with other state-of-the-art head pose estimation methods. Furthermore, our model has the real-time speed of \(\sim\)300 FPS when inferring on Tesla V100. Source code and pre-trained weight are available at github.com/chientv99/headpose.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Availability of data and materials

Modified head pose dataset will be available soon at https://drive.google.com/drive/folders/11N3O-eONLXGRrr_x9PJRjBQVibiK32dO?usp=sharing.

Code availability

All source codes for head pose estimation method is available at https://github.com/chientv99/headpose.

References

  1. Cao X, Wei Y, Wen F, Sun J. Face alignment by explicit shape regression. Int J Comput Vis. 2014;107(2):177–90.

    Article  MathSciNet  Google Scholar 

  2. Lathuilière S, Juge R, Mesejo P, Munoz-Salinas R, Horaud R. Deep mixture of linear inverse regressions applied to head-pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. pp. 4817–4825.

  3. Fanelli, G., Weise, T., Gall, J., Van Gool, L.: Real time head pose estimation from consumer depth cameras. In: Joint Pattern Recognition Symposium, Springer; 2011. pp. 101–110.

  4. Xiong X, De la Torre F. Global supervised descent method. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 2664–2673.

  5. Sun Y, Wang X, Tang X. Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2013. p. 3476–3483.

  6. Xin M, Mo S, Lin Y. Eva-GCN: Head pose estimation based on graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021. p. 1462–1471.

  7. Bulat A, Tzimiropoulos G. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision; 2017, p. 1021–1030.

  8. DeMenthon DF, Davis LS. Model-based object pose in 25 lines of code. Int J Comput Vision. 1995;15(1–2):123–41.

    Article  Google Scholar 

  9. Ruiz N, Chong E, Rehg JM. Fine-grained head pose estimation without keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2018, p. 2074–2083.

  10. Yang T-Y, Chen Y-T, Lin Y-Y, Chuang Y-Y. FSA-net: Learning fine-grained structure aggregation for head pose estimation from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. p. 1087–1096.

  11. Zhou, Y., Gregson, J.: Whenet: Real-time fine-grained estimation for wide range head pose. arXiv:2005.10353 (2020)

  12. Chang F-J, Tuan TA, Hassner T, Masi I, Nevatia R, Medioni G. Faceposenet: making a case for landmark-free face alignment. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017. p. 1599–1608.

  13. Meyer, G.P., Gupta, S., Frosio, I., Reddy, D., Kautz, J.: Robust model-based 3D head pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision; 2015. p. 3649–3657.

  14. Mukherjee SS, Robertson NM. Deep head pose: Gaze-direction estimation in multimodal video. IEEE Trans Multimedia. 2015;17(11):2094–107.

    Article  Google Scholar 

  15. Martin M, Van De Camp F, Stiefelhagen R. Real time head model creation and head pose estimation on consumer depth cameras. In: 2014 2nd International Conference on 3D Vision, vol. 1, IEEE; 2014. p. 641–648.

  16. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 770–778.

  17. Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 1492–1500

  18. Zhang H, Wu C, Zhang Z, Zhu Y, Lin H, Zhang Z, Sun Y, He T, Mueller J, Manmatha R, et al. Resnest: split-attention networks. arXiv:2004.08955 (2020)

  19. Gao S-H, et al. Res2net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019;43(2):652–62.

    Article  Google Scholar 

  20. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)

  21. Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. Fitnets: hints for thin deep nets. arXiv:1412.6550 (2014)

  22. Park W, Kim D, Lu Y, Cho M. Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. p. 3967–3976.

  23. Niyogi S, Freeman WT. Example-based head tracking. In: Proceedings of the Second International Conference on Automatic Face and Gesture Recognition; 1996. p. 374–378.

  24. Beymer D. Face recognition under varying pose. CVPR. 1994;94:137.

    Google Scholar 

  25. Sherrah J, Gong S, Ong E-J. Face distributions in similarity space under varying head pose. Image Vis Comput. 2001;19(12):807–19.

    Article  Google Scholar 

  26. Ng J, Gong S. Composite support vector machines for detection of faces across views and pose estimation. Image Vis Comput. 2002;20(5–6):359–68.

    Article  Google Scholar 

  27. Sherrah J, Gong S, Ong E-J. Understanding pose discrimination in similarity space. In: Proceedings of the British Machine Vision Conference; 1999. p. 523–32

  28. Huang J, Shao X, Wechsler H. Face pose discrimination using support vector machines (SVM). In: Proceedings of fourteenth International Conference on Pattern Recognition (Cat. No. 98EX170), vol. 1; 1998. p. 154–156.

  29. Zhang Z, Hu Y, Liu M, Huang T. Head pose estimation in seminar room using multi view face detectors. In: International Evaluation Workshop on Classification of Events, Activities and Relationships, Springer; 2006. p. 299–304.

  30. Jones M, Viola P. Fast multi-view face detection. Mitsubishi Electric Research Lab TR-20003-96; 2003. 3(14):2.

  31. Chen D, Ren S, Wei Y, Cao X, Sun J. Joint cascade face detection and alignment. In: European Conference on Computer Vision, Springer.2014; p. 109–122.

  32. Kumar A, Alavi A, Chellappa R. Kepler. Keypoint and pose estimation of unconstrained faces by learning efficient h-cnn regressors. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (fg 2017). 2017; p. 258–265.

  33. Zhu X, Ramanan D. Face detection, pose estimation, and landmark localization in the wild. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012; p. 2879–2886.

  34. Ranjan R, Sankaranarayanan S, Castillo CD, Chellappa R. An all-in-one convolutional neural network for face analysis. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017); 2017. p. 17–24.

  35. Ranjan R, Patel VM, Chellappa R. Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Trans Pattern Anal Mach Intell. 2017;41(1):121–35.

    Article  Google Scholar 

  36. Gu J, Yang X, De Mello S, Kautz J. Dynamic facial analysis: from bayesian filtering to recurrent neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 1548–1557.

  37. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009. p. 248–255.

  38. Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A. Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 16519–16529.

  39. Hsu G-S, Huang W-F, Yap MH. Edge-embedded multi-dropout framework for real-time face alignment. IEEE Access. 2019;8:6032–44.

    Article  Google Scholar 

  40. Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: a 3D solution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 146–155 (2016)

  41. Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., Zafeiriou, S.: Retinaface: Single-stage dense face localisation in the wild. arXiv:1905.00641 (2019)

  42. Jun W,Liu YHHS, Mei T. Facex-zoo: a pytorh toolbox for face recognition. 2021.

  43. Kazemi V, Sullivan J. One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2014. p. 1867–1874.

  44. Yang T-Y, Huang Y-H, Lin Y-Y, Hsiu P-C, Chuang Y-Y. Ssr-net: a compact soft stagewise regression network for age estimation. IJCAI. 2018;5:7.

    Google Scholar 

  45. Thai C, Tran V, Bui M, Ninh H, Tran H. An effective deep network for head pose estimation without keypoints. In: Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - ICPRAM; 2022. p. 90–98

  46. Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence. 2018.

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chien Thai.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances on Pattern Recognition Applications and Methods 2022” guest edited by Ana Fred, Maria De Marsico and Gabriella Sanniti di Baja.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thai, C., Nham, N., Tran, V. et al. Multiple Teacher Knowledge Distillation for Head Pose Estimation Without Keypoints. SN COMPUT. SCI. 4, 758 (2023). https://doi.org/10.1007/s42979-023-02233-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02233-x

Keywords

Navigation