Skip to main content
Log in

Lightweight head pose estimation without keypoints based on multi-scale lightweight neural network

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Head pose estimation methods without facial key points have emerged as a promising research field. However, there remain several unsolved challenges. For example, the current methods incur a computational cost, require large memory, and are difficult to deploy in practical applications. We propose a lightweight high-precision head pose estimation method based on a dual-stream convolutional neural network for overcoming these issues. The network comprises a dual-stream lightweight backbone network, external attention module, and soft stagewise regression (SSR) module. Dual-stream lightweight backbone network can extract original image features more effectively while keeping low computational overhead. External attention module can enhance the feature map extraction from the backbone network and improve the feature attention. SSR module calculates the probability of the head in each direction and predicts the head pose by regression. Extensive experiments on Annotated Facial Landmarks in the Wild (AFLW2000) and Biwi Kinect Head Pose Database (BIWI) datasets demonstrate that the model proposed in this paper has fewer parameters and lower estimation errors than the state-of-the-art methods in the field of head pose estimation in recent years.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Data availability statement

Data are openly available in a public repository. The data that support the findings of this study are openly available in [300 W-LP: http://www.cbsr.ia.ac.cn/users/xiangyuzhu/projects/3ddfa/main.htm], [ALFW2000: http://www.cbsr.ia.ac.cn/users/xiangyuzhu/projects/3DDFA/main.htm], [BIWI dataset: https://data.vision.ee.ethz.ch/cvl/gfanelli/head_pose/head_forest.html#db].

References

  1. Khan, K., Khan, R.U., Leonardi, R., et al.: Head pose estimation: a survey of the last ten years[J]. Signal Process.: Image Commun. 99, 116479 (2021)

    Google Scholar 

  2. Asad S, Mooney B, Ahmad I, et al.: Object detection and sensory feedback techniques in building smart cane for the visually impaired: an overview[C]. Proceedings of the 13th ACM International Conference on PErvasive Technologies Related to Assistive Environments. 2020: 1–7.

  3. Chang F J, Tuan Tran A, Hassner T, et al.: Faceposenet: Making a case for landmark-free face alignment[C]. Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017: 1599–1608.

  4. Lee S, Saitoh T.: Head pose estimation using convolutional neural network[M]. IT Convergence and Security 2017. Springer, Singapore, 2018: 164–171. C.

  5. Xu X, Kakadiaris I A.: Joint head pose estimation and face alignment framework using global and local CNN features[C]. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 2017: 642–649.

  6. Mukherjee, S.S., Robertson, N.M.: Deep head pose: gaze-direction estimation in multimodal video[J]. IEEE Trans. Multim. 17(11), 2094–2107 (2015)

    Article  Google Scholar 

  7. Szegedy C, Liu W, Jia Y, et al.: Going deeper with convolutions[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1–9.

  8. Chuan T, Xinrui H, Zhicheng W, et al:. Head Pose Estimation via Multi-Task Cascade CNN[C]. Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference. 2019: 123–127.

  9. Zhang, K., Zhang, Z., Li, Z., et al.: Joint face detection and alignment using multitask cascaded convolutional networks[J]. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)

    Article  Google Scholar 

  10. Xu, L., Chen, J., Gan, Y.: Head pose estimation with soft labels using regularized convolutional neural network[J]. Neurocomputing 337, 339–353 (2019)

    Article  Google Scholar 

  11. Zhang, H., Wang, M., Liu, Y., et al.: FDN: feature decoupling network for head pose estimation[C. Proc. AAAI Conf. Artif. Intell. 34(07), 12789–12796 (2020)

    Google Scholar 

  12. Ruiz N, Chong E, Rehg J M.: Fine-grained head pose estimation without keypoints[C]. Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2018: 2074–2083.

  13. Yang T Y, Chen Y T, Lin Y Y, et al.: Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 1087–1096.

  14. Cao Z, Chu Z, Liu D, et al.: A vector-based representation to enhance head pose estimation[C]. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 1188–1197.

  15. Hou Q, Zhou D, Feng J.: Coordinate attention for efficient mobile network design[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 13713–13722.

  16. Murphy-Chutorian, E., Trivedi, M.M.: Head-pose estimation in computer vision: a survey[J]. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 607–626 (2008)

    Article  Google Scholar 

  17. Dollár P, Welinder P, Perona P.: Cascaded pose regression[C]. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010: 1078–1085.

  18. Fanelli, G., Dantone, M., Gall, J., et al.: Random forests for real time 3d face analysis[J]. Int. J. Comput. Vision 101(3), 437–458 (2013)

    Article  Google Scholar 

  19. He, L., Wang, G., Liao, Q., et al.: Depth-images-based pose estimation using regression forests and graphical models[J]. Neurocomputing 164, 210–219 (2015)

    Article  Google Scholar 

  20. Ranjan, R., Patel, V.M., Chellappa, R.: Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition[J]. IEEE Trans. Pattern Anal. Mach. Intell. 41(1), 121–135 (2017)

    Article  Google Scholar 

  21. Zhu X, Lei Z, Liu X, et al.: Face alignment across large poses: A 3d solution[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 146–155.

  22. Liang, D., Geng, Q., Sun, H., et al.: Inferred box harmonization and aggregation for degraded face detection in crowds. Multim. Tools Appl. 81, 35411–35430 (2022)

    Article  Google Scholar 

  23. Kumar A, Alavi A, Chellappa R. Kepler: Keypoint and pose estimation of unconstrained faces by learning efficient h-cnn regressors[C. 2017 12th ieee international conference on automatic face & gesture recognition (fg 2017). IEEE, 2017: 258-265

  24. Xin M, Mo S, Lin Y.: EVA-GCN: Head Pose Estimation Based on Graph Convolutional Networks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 1462–1471.

  25. Liang X, Xu L, Zhang W, et al.: A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition[J]. The Visual Computer, 2022: 1–14.

  26. Cao Z, Liu D, Wang Q, et al.: Towards unbiased label distribution learning for facial pose estimation using anisotropic spherical Gaussian[C]. European Conference on Computer Vision. Springer, Cham, 2022: 737–753.

  27. Bahroun, S., Abed, R. & Zagrouba, E.: Deep 3D-LBP: CNN-based fusion of shape modeling and texture descriptors for accurate face recognition. Vis Comput (2021).

  28. Yang S, Qiao K, Shi S, et al.: EnNeRFACE: improving the generalization of face reenactment with adaptive ensemble neural radiance fields[J]. The Visual Computer, 2022: 1–14.

  29. Liu, H., Fang, S., Zhang, Z., et al.: MFDNet: Collaborative poses perception and matrix Fisher distribution for head pose estimation[J]. IEEE Trans. Multim. 24, 2449–2460 (2021)

    Article  Google Scholar 

  30. Yang, T.Y., Huang, Y.H., Lin, Y.Y., et al.: SSR-Net: a compact soft stagewise regression network for age estimation[C]. IJCAI. 5(6), 7 (2018)

    Google Scholar 

  31. Howard A G, Zhu M, Chen B, et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.

  32. Han K, Wang Y, Tian Q, et al.: Ghostnet: More features from cheap operations[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 1580–1589.

  33. Hu J, Shen L, Sun G.: Squeeze-and-excitation networks[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132–7141.

  34. Woo S, Park J, Lee J Y, et al.: Cbam: Convolutional block attention module[C. Proceedings of the European conference on computer vision (ECCV). 2018: 3–19.

  35. Li X, Wang W, Hu X, et al.: Selective kernel networks[C. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 510–519.

  36. Sandler M, Howard A, Zhu M, et al.: Mobilenetv2: Inverted residuals and linear bottlenecks[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4510–4520.

  37. Ma X, Guo J, Tang S, et al.: DCANet: learning connected attentions for convolutional neural networks[J]. arXiv preprint arXiv:2007.05099, 2020.

  38. Tan M, Le Q.: Efficientnet: Rethinking model scaling for convolutional neural networks[C]. International Conference on Machine Learning. PMLR, 2019: 6105–6114.

  39. Tan M, Le Q V.: Efficientnetv2: Smaller models and faster training[J]. arXiv preprint arXiv:2104.00298, 2021.

  40. Stergiou A, Poppe R, Kalliatakis G.: Refining activation downsampling with Softpool[J]. arXiv preprint arXiv:2101.00440, 2021.

  41. Liu W, Anguelov D, Erhan D, et al:. SSD: Single shot multibox detector[C}. European conference on computer vision. Springer, Cham, 2016: 21–37.

  42. Anisimov D, Khanova T.: Towards lightweight convolutional neural networks for object detection[C]. 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, 2017: 1-8

  43. Bulat A, Tzimiropoulos G.: How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)[C]. Proceedings of the IEEE International Conference on Computer Vision. 2017: 1021–1030.

  44. Kazemi V, Sullivan J.: One millisecond face alignment with an ensemble of regression trees[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 1867–1874.

  45. Huang, B., Chen, R., Wang, Xu., Zhou, Q.: Improving head pose estimation using two-stage ensembles with top-k regression. Image Vis. Comput. 93, 103827 (2020)

    Article  Google Scholar 

Download references

Acknowledgements

This study was supported by the National Natural Science Foundation of China (Nos. 61967012, 61866022, and 61861027) and the Science and Technology Program of Gansu Province (Grant No. 20JR5RA459).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaolei Chen.

Ethics declarations

Conflict of interest

The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, X., Lu, Y., Cao, B. et al. Lightweight head pose estimation without keypoints based on multi-scale lightweight neural network. Vis Comput 39, 2455–2469 (2023). https://doi.org/10.1007/s00371-023-02781-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02781-6

Keywords

Navigation