SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation

Zhen, Jianan; Fang, Qi; Sun, Jiaming; Liu, Wentao; Jiang, Wei; Bao, Hujun; Zhou, Xiaowei

doi:10.1007/978-3-030-58555-6_33

SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation

Jianan Zhen^12,13,
Qi Fang¹²,
Jiaming Sun^12,13,
Wentao Liu¹³,
Wei Jiang¹²,
Hujun Bao¹² &
…
Xiaowei Zhou¹²

Conference paper
First Online: 16 November 2020

3601 Accesses
35 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12360))

Abstract

Recovering multi-person 3D poses with absolute scales from a single RGB image is a challenging problem due to the inherent depth and scale ambiguity from a single view. Addressing this ambiguity requires to aggregate various cues over the entire image, such as body sizes, scene layouts, and inter-person relationships. However, most previous methods adopt a top-down scheme that first performs 2D pose detection and then regresses the 3D pose and scale for each detected person individually, ignoring global contextual cues. In this paper, we propose a novel system that first regresses a set of 2.5D representations of body parts and then reconstructs the 3D absolute poses based on these 2.5D representations with a depth-aware part association algorithm. Such a single-shot bottom-up scheme allows the system to better learn and reason about the inter-person depth relationship, improving both 3D and 2D pose estimation. The experiments demonstrate that the proposed approach achieves the state-of-the-art performance on the CMU Panoptic and MuPoTS-3D datasets and is applicable to in-the-wild videos.

J. Zhen and Q. Fang—Equal contribution.

X. Zhou—State Key Lab of CAD&CG.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Alp Güler, R., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: CVPR (2018)
Google Scholar
Benzine, A., Luvison, B., Pham, Q.C., Achard, C.: Deep, robust and single shot 3D multi-person human pose estimation from monocular images. In: ICIP (2019)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)
Google Scholar
Chen, C.H., Ramanan, D.: 3D human pose estimation= 2D pose estimation+ matching. In: CVPR (2017)
Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: CVPR (2018)
Google Scholar
Dabral, R., Gundavarapu, N.B., Mitra, R., Sharma, A., Ramakrishnan, G., Jain, A.: Multi-person 3d human pose estimation from monocular images. In: 3DV (2019)
Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)
Google Scholar
Guler, R.A., Kokkinos, I.: HoloPose: holistic 3D human reconstruction in-the-wild. In: CVPR (2019)
Google Scholar
Hidalgo, G., et al.: Single-network whole-body pose estimation. In: ICCV (2019)
Google Scholar
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_3
Chapter Google Scholar
Joo, H., et al.: Panoptic studio: a massively multiview system for social interaction capture. TPAMI (2017)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Google Scholar
Lee, J.H., Kim, C.S.: Monocular depth estimation using relative depth maps. In: CVPR (2019)
Google Scholar
Li, W., et al.: Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148 (2019)
Li, Z., et al.: Learning the depths of moving people by watching frozen people. In: CVPR (2019)
Google Scholar
Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from Internet photos. In: CVPR (2018)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: ICCV (2017)
Google Scholar
Mehta, D., et al.: XNect: real-time multi-person 3d human pose estimation with a single RGB camera. TOG (2020)
Google Scholar
Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 3DV (2018)
Google Scholar
Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera. TOG (2017)
Google Scholar
Moon, G., Chang, J., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)
Google Scholar
Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: NeurIPS (2017)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Nie, X., Zhang, J., Yan, S., Feng, J.: Single-stage multi-person pose machines. In: ICCV (2019)
Google Scholar
Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 282–299. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_17
Chapter Google Scholar
Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR (2017)
Google Scholar
Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3D human pose estimation. In: CVPR (2018)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR (2017)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR (2019)
Google Scholar
Pishchulin, L., et al.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: CVPR (2016)
Google Scholar
Popa, A.I., Zanfir, M., Sminchisescu, C.: Deep multitask architecture for integrated 2D and 3D human sensing. In: CVPR (2017)
Google Scholar
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-NET: localization-classification-regression for human pose. In: CVPR (2017)
Google Scholar
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net++: multi-person 2D and 3D pose detection in natural images. TPAMI (2019)
Google Scholar
Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: ICCV (2017)
Google Scholar
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_33
Chapter Google Scholar
Véges, M., Lőrincz, A.: Absolute human pose estimation with depth prediction network. In: IJCNN (2019)
Google Scholar
Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: posing face, body, and hands in the wild. In: CVPR (2019)
Google Scholar
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29
Chapter Google Scholar
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3D human pose estimation in the wild by adversarial learning. In: CVPR (2018)
Google Scholar
Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3D pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In: CVPR (2018)
Google Scholar
Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3D sensing of multiple people in natural images. In: NeurIPS (2018)
Google Scholar
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3D human pose regression. In: CVPR (2019)
Google Scholar
Zhou, X., Zhu, M., Leonardos, S., Derpanis, K.G., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. In: CVPR (2016)
Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge support from NSFC (No. 61806176), Fundamental Research Funds for the Central Universities (2019QNA5022) and ZJU-SenseTime Joint Lab of 3D Vision.

Author information

Authors and Affiliations

Zhejiang University, Hangzhou, China
Jianan Zhen, Qi Fang, Jiaming Sun, Wei Jiang, Hujun Bao & Xiaowei Zhou
SenseTime, Science Park, Hong Kong
Jianan Zhen, Jiaming Sun & Wentao Liu

Authors

Jianan Zhen
View author publications
You can also search for this author in PubMed Google Scholar
Qi Fang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaming Sun
View author publications
You can also search for this author in PubMed Google Scholar
Wentao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Hujun Bao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaowei Zhou .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (mp4 49595 KB)

Supplementary material 1 (pdf 93 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhen, J. et al. (2020). SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12360. Springer, Cham. https://doi.org/10.1007/978-3-030-58555-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-58555-6_33
Published: 16 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58554-9
Online ISBN: 978-3-030-58555-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics