Advertisement

SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation

Conference paper
  • 789 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12360)

Abstract

Recovering multi-person 3D poses with absolute scales from a single RGB image is a challenging problem due to the inherent depth and scale ambiguity from a single view. Addressing this ambiguity requires to aggregate various cues over the entire image, such as body sizes, scene layouts, and inter-person relationships. However, most previous methods adopt a top-down scheme that first performs 2D pose detection and then regresses the 3D pose and scale for each detected person individually, ignoring global contextual cues. In this paper, we propose a novel system that first regresses a set of 2.5D representations of body parts and then reconstructs the 3D absolute poses based on these 2.5D representations with a depth-aware part association algorithm. Such a single-shot bottom-up scheme allows the system to better learn and reason about the inter-person depth relationship, improving both 3D and 2D pose estimation. The experiments demonstrate that the proposed approach achieves the state-of-the-art performance on the CMU Panoptic and MuPoTS-3D datasets and is applicable to in-the-wild videos.

Keywords

Human pose estimation 3D from a single image 

Notes

Acknowledgements

The authors would like to acknowledge support from NSFC (No. 61806176), Fundamental Research Funds for the Central Universities (2019QNA5022) and ZJU-SenseTime Joint Lab of 3D Vision.

Supplementary material

504470_1_En_33_MOESM1_ESM.pdf (93 kb)
Supplementary material 1 (pdf 93 KB)

Supplementary material 2 (mp4 49595 KB)

References

  1. 1.
    Alp Güler, R., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: CVPR (2018)Google Scholar
  2. 2.
    Benzine, A., Luvison, B., Pham, Q.C., Achard, C.: Deep, robust and single shot 3D multi-person human pose estimation from monocular images. In: ICIP (2019)Google Scholar
  3. 3.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)Google Scholar
  4. 4.
    Chen, C.H., Ramanan, D.: 3D human pose estimation= 2D pose estimation+ matching. In: CVPR (2017)Google Scholar
  5. 5.
    Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: CVPR (2018)Google Scholar
  6. 6.
    Dabral, R., Gundavarapu, N.B., Mitra, R., Sharma, A., Ramakrishnan, G., Jain, A.: Multi-person 3d human pose estimation from monocular images. In: 3DV (2019)Google Scholar
  7. 7.
    Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)Google Scholar
  8. 8.
    Guler, R.A., Kokkinos, I.: HoloPose: holistic 3D human reconstruction in-the-wild. In: CVPR (2019)Google Scholar
  9. 9.
    Hidalgo, G., et al.: Single-network whole-body pose estimation. In: ICCV (2019)Google Scholar
  10. 10.
    Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_3CrossRefGoogle Scholar
  11. 11.
    Joo, H., et al.: Panoptic studio: a massively multiview system for social interaction capture. TPAMI (2017)Google Scholar
  12. 12.
    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)Google Scholar
  13. 13.
    Lee, J.H., Kim, C.S.: Monocular depth estimation using relative depth maps. In: CVPR (2019)Google Scholar
  14. 14.
    Li, W., et al.: Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148 (2019)
  15. 15.
    Li, Z., et al.: Learning the depths of moving people by watching frozen people. In: CVPR (2019)Google Scholar
  16. 16.
    Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from Internet photos. In: CVPR (2018)Google Scholar
  17. 17.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  18. 18.
    Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: ICCV (2017)Google Scholar
  19. 19.
    Mehta, D., et al.: XNect: real-time multi-person 3d human pose estimation with a single RGB camera. TOG (2020)Google Scholar
  20. 20.
    Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 3DV (2018)Google Scholar
  21. 21.
    Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera. TOG (2017)Google Scholar
  22. 22.
    Moon, G., Chang, J., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)Google Scholar
  23. 23.
    Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: NeurIPS (2017)Google Scholar
  24. 24.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_29CrossRefGoogle Scholar
  25. 25.
    Nie, X., Zhang, J., Yan, S., Feng, J.: Single-stage multi-person pose machines. In: ICCV (2019)Google Scholar
  26. 26.
    Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 282–299. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01264-9_17CrossRefGoogle Scholar
  27. 27.
    Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR (2017)Google Scholar
  28. 28.
    Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3D human pose estimation. In: CVPR (2018)Google Scholar
  29. 29.
    Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR (2017)Google Scholar
  30. 30.
    Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR (2019)Google Scholar
  31. 31.
    Pishchulin, L., et al.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: CVPR (2016)Google Scholar
  32. 32.
    Popa, A.I., Zanfir, M., Sminchisescu, C.: Deep multitask architecture for integrated 2D and 3D human sensing. In: CVPR (2017)Google Scholar
  33. 33.
    Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-NET: localization-classification-regression for human pose. In: CVPR (2017)Google Scholar
  34. 34.
    Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net++: multi-person 2D and 3D pose detection in natural images. TPAMI (2019)Google Scholar
  35. 35.
    Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: ICCV (2017)Google Scholar
  36. 36.
    Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01231-1_33CrossRefGoogle Scholar
  37. 37.
    Véges, M., Lőrincz, A.: Absolute human pose estimation with depth prediction network. In: IJCNN (2019)Google Scholar
  38. 38.
    Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: posing face, body, and hands in the wild. In: CVPR (2019)Google Scholar
  39. 39.
    Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01231-1_29CrossRefGoogle Scholar
  40. 40.
    Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3D human pose estimation in the wild by adversarial learning. In: CVPR (2018)Google Scholar
  41. 41.
    Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3D pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In: CVPR (2018)Google Scholar
  42. 42.
    Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3D sensing of multiple people in natural images. In: NeurIPS (2018)Google Scholar
  43. 43.
    Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3D human pose regression. In: CVPR (2019)Google Scholar
  44. 44.
    Zhou, X., Zhu, M., Leonardos, S., Derpanis, K.G., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. In: CVPR (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Zhejiang UniversityHangzhouChina
  2. 2.SenseTimeScience ParkHong Kong

Personalised recommendations