Two-stage multi-view deep network for 3D human pose reconstruction using images and its 2D joint heatmaps through enhanced stack-hourglass approach

Verma, Pratishtha; Srivastava, Rajeev

doi:10.1007/s00371-021-02120-7

Two-stage multi-view deep network for 3D human pose reconstruction using images and its 2D joint heatmaps through enhanced stack-hourglass approach

Original article
Published: 24 May 2021

Volume 38, pages 2417–2430, (2022)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Pratishtha Verma¹ &
Rajeev Srivastava¹

571 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Human beings easily reconstruct the 3D pose of a human from a 2D image, but 3D human pose reconstruction (HPR) continues to exist as a challenging task for machines. Traditional methods can reconstruct the 3D pose from the image directly or from the 2D joint locations that have been used for 3D HPR. Such traditional strategies have their merits and demerits. In this paper, we have tried to combine the merits of such traditional techniques with that of a deep architecture model. By this strategy, the model delivers both of the merits concurrently in a multi-view scenario and also fuses this knowledge on the upcoming step with early and late fusion strategies. We also introduce an enhanced stack-hourglass network for the prediction of 2D keypoint heatmaps. The predicted 2D keypoint heatmaps and the image have been utilized with simple CNN neural architecture along with both of the fusion strategies for 3D pose reconstruction. Experimental results show that the proposed method achieves comparable performance to the state-of-the-art methods on MPII, and Human3.6M datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CTHPose: An Efficient and Effective CNN-Transformer Hybrid Network for Human Pose Estimation

Staged cascaded network for monocular 3D human pose estimation

Article 23 April 2022

Efficient High-Resolution Human Pose Estimation

References

Verma, P., Sah, A., Srivastava, R.: Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition. Multimed. Syst. 26(6), 671–685 (2020)
Article Google Scholar
Tripathy, S.K., Srivastava, R.: A real-time two-input stream multi-column multi-stage convolution neural network (TIS-MCMS-CNN) for efficient crowd congestion-level analysis. Multimed. Syst. 26(5), 585–605 (2020)
Article Google Scholar
Bo, L., Sminchisescu, C., Kanaujia, A., Metaxas, D.: Fast algorithms for large scale conditional 3D prediction. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
Mori, G., Malik, J.: Recovering 3d human body configurations using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 28(7), 1052–1062 (2006)
Article Google Scholar
Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parameter-sensitive hashing. In: Null, p. 750. IEEE (2003)
Agarwal, A., Triggs, B.: 3D human pose from silhouettes by relevance vector regression. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR, vol. 2, pp. II–II. IEEE (2004)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)
Katircioglu, I., Tekin, B., Salzmann, M., Lepetit, V., Fua, P.: Learning latent representations of 3d human pose with deep neural networks. Int. J. Comput. Vis. 126(12), 1326–1341 (2018)
Article Google Scholar
Popa, A.-I., Zanfir, M., Sminchisescu, C.: Deep multitask architecture for integrated 2d and 3d human sensing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6289–6298 (2017)
Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Fua, P.: Structured prediction of 3d human pose with deep neural networks. arXiv preprint arXiv:1605.05180 (2016)
Nibali, A., He, Z., Morgan, S., Prendergast, L.: 3d human pose estimation with 2d marginal heatmaps. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1477–1485. IEEE (2019)
Núñez, J.C., Cabido, R., Vélez, J.F., Montemayor, A.S., Pantrigo, J.J.: Multiview 3D human pose estimation using improved least-squares and LSTM networks. Neurocomputing 323, 335–343 (2019)
Article Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp. 483–499. Springer, Cham (2016)
Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3d human pose estimation in the wild by adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5255–5264 (2018)
Tekin, B., Márquez-Neila, P., Salzmann, M., Fua, P.: Learning to fuse 2d and 3d image cues for monocular body pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3941–3950 (2017)
Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indirect part detection and contextual information. Comput. Graph. 85, 15–22 (2019)
Article Google Scholar
Hong, C., Jun, Y., Wan, J., Tao, D., Wang, M.: Multimodal deep autoencoder for human pose recovery. IEEE Trans. Image Process. 24(12), 5659–5670 (2015)
Article MathSciNet Google Scholar
Hong, C., Jun, Y., Tao, D., Wang, M.: Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans. Ind. Electron. 62(6), 3742–3751 (2014)
Google Scholar
Hong, C., Chen, X., Wang, X., Tang, C.: Hypergraph regularized autoencoder for image-based 3D human pose recovery. Sig. Process. 124, 132–140 (2016)
Article Google Scholar
Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.: Total capture: 3D human pose estimation fusing video and inertial sensors. BMVC 2, 3 (2017)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Harvesting multiple views for marker-less 3d human pose annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6988–6997 (2017)
Ershadi-Nasab, S., Noury, E., Kasaei, S., Sanaei, E.: Multiple human 3d pose estimation from multiview images. Multimed. Tools Appl. 77(12), 15573–15601 (2018)
Article Google Scholar
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In: European Conference on Computer Vision, pp. 34–50. Springer, Cham (2016)
Verma, P., Srivastava, R.: Three stage deep network for 3D human pose reconstruction by exploiting spatial and temporal data via its 2D pose. J. Vis. Commun. Image Represent. 71, 102866 (2020)
Article Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Gkioxari, G., Toshev, A., Jaitly, N.: Chained predictions using convolutional neural networks. In: European Conference on Computer Vision, pp. 728–743. Springer, Cham (2016)
Rafi, U., Leibe, B., Gall, J., Kostrikov, I.: An efficient convolutional network for human pose estimation. BMVC 1, 2 (2016)
Google Scholar
Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 468–475. IEEE (2017)
Chen, Y., Shen, C., Chen, H., Wei, X.-S., Liu, L., Yang, J.: Adversarial learning of structure-aware fully convolutional networks for landmark localization. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)
Chou, C.-J., Chien, J.-T., Chen, H.-T.: Self adversarial training for human pose estimation. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 17–30. IEEE (2018)
Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: European Conference on Computer Vision, pp. 717–732. Springer, Cham (2016)
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1831–1840 (2017)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Zhang, X., Tang, Z., Hou, J., Hao, Y.: 3D human pose estimation via human structure-aware fully connected network. Pattern Recogn. Lett. 125, 404–410 (2019)
Article Google Scholar
Wang, K., Lin, L., Jiang, C., Qian, C., Wei, P.: 3D human pose machines with self-supervised learning. IEEE Trans. Pattern Anal. Mach, Intell (2019)
Book Google Scholar
Chen, X., Lin, K.-Y., Liu, W., Qian, C., Lin, L.: Weakly-supervised discovery of geometry-aware representation for 3d human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10895–10904 (2019)
Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., Theobalt, C.: In the wild human pose estimation using explicit 2D features and intermediate 3D representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10905–10914 (2019)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Lee, K., Lee, I., Lee, S.: Propagating lstm: 3d pose estimation based on joint interdependency. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–135 (2018)
Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3d human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7307–7316 (2018)
Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84 (2018)

Download references

Author information

Authors and Affiliations

IIT BHU, Varanasi, 221005, India
Pratishtha Verma & Rajeev Srivastava

Authors

Pratishtha Verma
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Srivastava
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pratishtha Verma.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verma, P., Srivastava, R. Two-stage multi-view deep network for 3D human pose reconstruction using images and its 2D joint heatmaps through enhanced stack-hourglass approach. Vis Comput 38, 2417–2430 (2022). https://doi.org/10.1007/s00371-021-02120-7

Download citation

Accepted: 22 March 2021
Published: 24 May 2021
Issue Date: July 2022
DOI: https://doi.org/10.1007/s00371-021-02120-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-stage multi-view deep network for 3D human pose reconstruction using images and its 2D joint heatmaps through enhanced stack-hourglass approach

Abstract

Access this article

Similar content being viewed by others

CTHPose: An Efficient and Effective CNN-Transformer Hybrid Network for Human Pose Estimation

Staged cascaded network for monocular 3D human pose estimation

Efficient High-Resolution Human Pose Estimation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-stage multi-view deep network for 3D human pose reconstruction using images and its 2D joint heatmaps through enhanced stack-hourglass approach

Abstract

Access this article

Similar content being viewed by others

CTHPose: An Efficient and Effective CNN-Transformer Hybrid Network for Human Pose Estimation

Staged cascaded network for monocular 3D human pose estimation

Efficient High-Resolution Human Pose Estimation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation