EfficientPose: Efficient human pose estimation with neural architecture search

Abstract

Human pose estimation from image and video is a key task in many multimedia applications. Previous methods achieve great performance but rarely take efficiency into consideration, which makes it difficult to implement the networks on lightweight devices. Nowadays, real-time multimedia applications call for more efficient models for better interaction. Moreover, most deep neural networks for pose estimation directly reuse networks designed for image classification as the backbone, which are not optimized for the pose estimation task. In this paper, we propose an efficient framework for human pose estimation with two parts, an efficient backbone and an efficient head. By implementing a differentiable neural architecture search method, we customize the backbone network design for pose estimation, and reduce computational cost with negligible accuracy degradation. For the efficient head, we slim the transposed convolutions and propose a spatial information correction module to promote the performance of the final prediction. In experiments, we evaluate our networks on the MPII and COCO datasets. Our smallest model requires only 0.65 GFLOPs with 88.1% PCKh@0.5 on MPII and our large model needs only 2 GFLOPs while its accuracy is competitive with the state-of-the-art large model, HRNet, which takes 9.5 GFLOPs.

References

  1. [1]

    Yang, Y.; Ramanan, D. Articulated pose estimation with flexible mixtures-of-parts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1385–1392, 2011.

  2. [2]

    Pishchulin, L.; Andriluka, M.; Gehler, P.; Schiele, B. Poselet conditioned pictorial structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 588–595, 2013.

  3. [3]

    Toshev, A.; Szegedy, C. DeepPose: Human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1653–1660, 2014.

  4. [4]

    Newell, A.; Yang, K. Y.; Deng, J. Stacked hourglass networks for human pose estimation. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 483–499, 2016.

  5. [5]

    Xiao, B.; Wu, H. P.; Wei, Y. C. Simple baselines for human pose estimation and tracking. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11210. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 472–487, 2018.

  6. [6]

    Sun, K.; Xiao, B.; Liu, D.; Wang, J. D. Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5686–5696, 2019.

  7. [7]

    Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3686–3693, 2014.

  8. [8]

    Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision — ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.

  9. [9]

    Chen, Y. L.; Wang, Z. C.; Peng, Y. X.; Zhang, Z. Q.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7103–7112, 2018.

  10. [10]

    Li, W. B.; Wang, Z. C.; Yin, B. Y.; Peng, Q. X.; Su, J. Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148, 2019.

  11. [11]

    He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.

  12. [12]

    Howard, A. G.; Zhu, M. L.; Chen, B.; Kalenichenko, D.; Adam, H. Mobilenets: Efficient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

  13. [13]

    Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q. V. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8697–8710, 2018.

  14. [14]

    Real, E.; Aggarwal, A.; Huang, Y. P.; Le, Q. V. Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 4780–4789, 2019.

  15. [15]

    Bender, G.; Kindermans, P.; Zoph, B.; Vasudevan, V.; Le, Q. Understanding and simplifying one-shot architecture search. In: Proceedings of the 35th International Conference on Machine Learning, 549–558, 2018.

  16. [16]

    Liu, H. X.; Simonyan, K.; Yang, Y. M. DARTS: Differentiable architecture search. In: Proceedings of the 7th International Conference on Learning Representations, 2019.

  17. [17]

    Cai, H.; Zhu, L.; Han, S. ProxylessNAS: Direct neural architecture search on target task and hardware. In: Proceedings of the International Conference on Learning Representations, 2019.

  18. [18]

    Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. Fbnet: Hardware-aware efficient convNet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10726–10734, 2019.

  19. [19]

    Liu, C. X.; Chen, L. C.; Schroff, F.; Adam, H.; Hua, W.; Yuille, A. L.; Fei-Fei, L. Auto-DeepLab: Hierarchical neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 82–92, 2019.

  20. [20]

    Zhang, Y.; Qiu, Z.; Liu, J.; Yao, T.; Liu, D.; Mei, T. Customizable architecture search forsemantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11633–11642, 2019.

  21. [21]

    Ghiasi, G.; Lin, T. Y.; Le, Q. V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7029–7038, 2019.

  22. [22]

    Fang, J. M.; Sun, Y. Z.; Zhang, Q.; Peng, K. J.; Wang, X. G. FNA++: Fast network adaptation via parameter remapping and architecture search. In: Proceedings of the International Conference on Learning Representations, 2020.

  23. [23]

    Yang, W.; Li, S.; Ouyang, W. L.; Li, H. S.; Wang, X. G. Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, 1290–1299, 2017.

  24. [24]

    Bulat, A.; Tzimiropoulos, G. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In: Proceedings of the IEEE International Conference on Computer Vision, 3726–3734, 2017.

  25. [25]

    Tang, Z. Q.; Peng, X.; Geng, S. J.; Wu, L. F.; Zhang, S. T.; Metaxas, D. Quantized densely connected U-nets for efficient landmark localization. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11207. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 348–364, 2018.

  26. [26]

    Zhang, F.; Zhu, X. T.; Ye, M. Fast human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3512–3521, 2019.

  27. [27]

    Wei, S. H.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724–4732, 2016.

  28. [28]

    Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and checkerboard artifacts. Distill, 2016. Available at https://doi.org/10.23915/distill.

  29. [29]

    Gao, H.; Yuan, H.; Wang, Z.; Ji, S. Pixel deconvolutional networks. arXiv preprint arXiv:1705.06820, 2017.

  30. [30]

    Wojna, Z.; Uijlings, J.; Guadarrama, S.; Silberman, N.; Chen, L. C.; Fathi, A.; Uijlings, J. The devil is in the decoder. In: Proceedings of the British Machine Vision Conference, 10.1–10.13, 2017.

  31. [31]

    Sugawara, Y.; Shiota, S.; Kiya, H. Checkerboard artifacts free convolutional neural networks. APSIPA Transactions on Signal and Information Processing Vol. 8, e9, 2019.

    Article  Google Scholar 

  32. [32]

    Tan, M. X.; Le, Q. V. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.

  33. [33]

    Brock, A.; Lim, T.; Ritchie, J. M.; Weston, N. SMASH: One-shot model architecture search through HyperNetworks. In: Proceedings of the International Conference on Learning Representations, 2018.

  34. [34]

    Dong, X. Y.; Yang, Y. Searching for a robust neural architecture in four GPU hours. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1761–1770, 2019.

  35. [35]

    Xu, Y. H.; Xie, L. X.; Zhang, X. P.; Chen, X.; Xiong, H. K. PC-DARTS: Partial channel connections for memory-efficient differentiable architecture search. In: Proceedings of the International Conference on Learning Representations, 2019.

  36. [36]

    Tan, M. X.; Chen, B.; Pang, R. M.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q. V. MnasNet: Platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2815–2823, 2019.

  37. [37]

    Gong, X. Y.; Chen, W. Y.; Jiang, Y. F.; Yuan, Y.; Wang, Z. Y. AutoPose: Searching multi-scale branch aggregation for pose estimation. arXiv preprint arXiv:2008.07018, 2020.

  38. [38]

    Sandler, M.; Howard, A.; Zhu, M. L.; Zhmoginov, A.; Chen, L. C. MobileNetV2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4510–4520, 2018.

  39. [39]

    Fang, J. M.; Sun, Y. Z.; Zhang, Q.; Li, Y.; Wang, X. G. Densely connected search space for more flexible neural architecture search, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10625–10634, 2020.

  40. [40]

    Tang, W.; Yu, P.; Wu, Y. Deeply learned compositional models for human pose estimation. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11207. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 197–214, 2018.

  41. [41]

    Yang, S.; Yang, W. K.; Cui, Z. Pose neural fabrics search. arXiv preprint arXiv:1909.07068, 2019.

  42. [42]

    Zhang, Z.; Tang, J.; Wu, G. Simple and lightweight human pose estimation. arXiv preprint arXiv:1911.10346, 2019.

  43. [43]

    He, K. M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980–2988, 2017.

  44. [44]

    Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3711–3719, 2017.

  45. [45]

    Huang, S. L.; Gong, M. M.; Tao, D. C. A coarse-fine network for keypoint localization. In: Proceedings of the IEEE International Conference on Computer Vision, 3047–3056, 2017.

  46. [46]

    Ottelander, T. D.; Dushatskiy, A.; Virgolin, M.; Bosman, P. A. N. Local search is a remarkably strong baseline for neural architecture search. arXiv preprint arXiv:2004.08996, 2020.

Download references

Acknowledgements

This work was in part supported by National Natural Science Foundation of China (NSFC) (Nos. 61733007 and 61876212) and Zhejiang Lab (No. 2019NB0AB02).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Xinggang Wang.

Additional information

Wenqiang Zhang is a master student in the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan. His research interests include pose estimation and neural architecture search.

Jiemin Fang received his B.E. degree from the School of Electronic Information and Communications, Huazhong University of Science and Technology in 2018. He is currently a Ph.D. candidate at the Institute of Artificial Intelligence and School of Electronic Information and Communications, Huazhong University of Science and Technology. His research interests include AutoML and efficient deep learning.

Xinggang Wang received his B.S. and Ph.D. degrees in electronics and information engineering from Huazhong University of Science and Technology, in 2009 and 2014, respectively. He is currently an associate professor with the School of Electronic Information and Communications, HUST. His research interests include computer vision and machine learning.

Wenyu Liu received his B.S. degree in computer science from Tsinghua University, Beijing, China, in 1986, and his M.S. and Ph.D. degrees, both in electronics and information engineering, from Huazhong University of Science and Technology (HUST), in 1991 and 2001, respectively. He is now a professor and associate dean of the School of Electronic Information and Communications, HUST. His current research areas include computer vision, multimedia, and machine learning.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Fang, J., Wang, X. et al. EfficientPose: Efficient human pose estimation with neural architecture search. Comp. Visual Media (2021). https://doi.org/10.1007/s41095-021-0214-z

Download citation

Keywords

  • pose estimation
  • neural architecture search
  • efficient deep learning