Skip to main content

Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Estimating 3D poses and shapes in the form of meshes from monocular RGB images is challenging. Obviously, it is more difficult than estimating 3D poses only in the form of skeletons or heatmaps. When interacting persons are involved, the 3D mesh reconstruction becomes more challenging due to the ambiguity introduced by person-to-person occlusions. To tackle the challenges, we propose a coarse-to-fine pipeline that benefits from 1) inverse kinematics from the occlusion-robust 3D skeleton estimation and 2) Transformer-based relation-aware refinement techniques. In our pipeline, we first obtain occlusion-robust 3D skeletons for multiple persons from an RGB image. Then, we apply inverse kinematics to convert the estimated skeletons to deformable 3D mesh parameters. Finally, we apply the Transformer-based mesh refinement that refines the obtained mesh parameters considering intra- and inter-person relations of 3D meshes. Via extensive experiments, we demonstrate the effectiveness of our method, outperforming state-of-the-arts on 3DPW, MuPoTS and AGORA datasets.

M. Saqlain—This research was conducted when Dr. Saqlain was the post-doctoral researcher at UNIST

G. Kim and M. Shin—Were undergraduate interns at UNIST.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)

    Google Scholar 

  2. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)

  3. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34

    Chapter  Google Scholar 

  4. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2D pose estimation using part affinity fields. TPAMI (2019)

    Google Scholar 

  5. Cha, J., Saqlain, M., Kim, D., Lee, S., Lee, S., Baek, S.: Learning 3D skeletal representation from transformer for action recognition. IEEE Access 10, 67541-67550 (2022)

    Google Scholar 

  6. Cha, J., et al.: Towards single 2D image-level self-supervision for 3D human pose and shape estimation. Appl. Sci. 11(20), 9724(2021)

    Google Scholar 

  7. Cheng, Y., Wang, B., Yang, B., Tan, R.T.: Monocular 3D multi-person pose estimation by integrating top-down and bottom-up networks. In: CVPR (2021)

    Google Scholar 

  8. Cheng, Y., Yang, B., Wang, B., Tan, R.T.: 3D human pose estimation using spatio-temporal networks with explicit occlusion training. In: AAAI (2020)

    Google Scholar 

  9. Cheng, Y., Yang, B., Wang, B., Yan, W., Tan, R.T.: Occlusion-aware networks for 3D human pose estimation in video. In: ICCV (2019)

    Google Scholar 

  10. Choi, H., Moon, G., Chang, J.Y., Lee, K.M.: Beyond static features for temporally consistent 3D human pose and shape from a video. In: CVPR (2021)

    Google Scholar 

  11. Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 769–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_45

    Chapter  Google Scholar 

  12. Choi, H., Moon, G., Park, J., Lee, K.M.: 3Dcrowdnet: 2D human pose-guided3d crowd human pose and shape estimation in the wild. arXiv:2104.07300 (2021)

  13. Dong, Z., Song, J., Chen, X., Guo, C., Hilliges, O.: Shape-aware multi-person pose estimation from multi-view images. In: ICCV (2021)

    Google Scholar 

  14. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ICLR (2021)

    Google Scholar 

  15. Gower, J.C.: Generalized procrustes analysis. Psychometrika 40(1), 33–51 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  16. Guan, P., Weiss, A., Balan, A.O., Black, M.J.: Estimating human shape and pose from a single image. In: ICCV (2009)

    Google Scholar 

  17. Guler, R.A., Kokkinos, I.: Holopose: Holistic 3D human reconstruction in-the-wild. In: CVPR (2019)

    Google Scholar 

  18. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  19. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38

    Chapter  Google Scholar 

  20. Hu, Y.T., Chen, H.S., Hui, K., Huang, J.B., Schwing, A.G.: SAIL-VOS: semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In: CVPR (2019)

    Google Scholar 

  21. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI 36(7), 1325–1339 (2013)

    Google Scholar 

  22. Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: ICCV (2019)

    Google Scholar 

  23. Jiang, W., Kolotouros, N., Pavlakos, G., Zhou, X., Daniilidis, K.: Coherent reconstruction of multiple humans from a single image. In: CVPR (2020)

    Google Scholar 

  24. Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)

    Google Scholar 

  25. Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)

    Google Scholar 

  26. Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation. In: 3DV (2021)

    Google Scholar 

  27. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)

    Google Scholar 

  28. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: CVPR (2019)

    Google Scholar 

  29. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ICLR (2015)

    Google Scholar 

  30. Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: video inference for human body pose and shape estimation. In: CVPR (2020)

    Google Scholar 

  31. Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3d human body estimation. In: ICCV (2021)

    Google Scholar 

  32. Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: SPEC: seeing people in the wild with an estimated camera. In: ICCV (2021)

    Google Scholar 

  33. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)

    Google Scholar 

  34. Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: CVPR (2019)

    Google Scholar 

  35. Kundu, J.N., Rakesh, M., Jampani, V., Venkatesh, R.M., Venkatesh Babu, R.: Appearance consensus driven self-supervised human mesh recovery. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 794–812. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_46

    Chapter  Google Scholar 

  36. Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: HybrIk: a hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In: CVPR (2021)

    Google Scholar 

  37. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Learn to dance with AIST++: music conditioned 3D dance generation. arXiv:2101.08779 (2021)

  38. Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR (2021)

    Google Scholar 

  39. Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: ICCV (2021)

    Google Scholar 

  40. Lin, T.-Y.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

  41. Liu, W., Piao, Z., Min, J., Luo, W., Ma, L., Gao, S.: Liquid warping GAN: a unified framework for human motion imitation, appearance transfer and novel view synthesis. In: ICCV (2019)

    Google Scholar 

  42. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. TOG 34(6), 1–16 (2015)

    Article  Google Scholar 

  43. Ludl, D., Gulde, T., Curio, C.: Enhancing data-driven algorithms for human pose estimation and action recognition through simulation. IEEE Trans. Intell. Transp. Syst. 21(9), 3990–3999 (2020)

    Article  Google Scholar 

  44. von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: ECCV (2018)

    Google Scholar 

  45. Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 3DV (2017)

    Google Scholar 

  46. Mehta, D., et al.: XNect: real-time multi-person 3D motion capture with a single RGB camera. TOG 39(4), 1–82 (2020)

    Article  Google Scholar 

  47. Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 3DV (2018)

    Google Scholar 

  48. Mir, A., Alldieck, T., Pons-Moll, G.: Learning to transfer texture from clothing images to 3D humans. In: CVPR (2020)

    Google Scholar 

  49. Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)

    Google Scholar 

  50. Ning, G., Pei, J., Huang, H.: Lighttrack: a generic framework for online top-down human pose tracking. In: CVPR workshop (2020)

    Google Scholar 

  51. Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: 3DV (2018)

    Google Scholar 

  52. Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: avatars in geography optimized for regression analysis. In: CVPR (2021)

    Google Scholar 

  53. Pavlakos, G., Kolotouros, N., Daniilidis, K.: TexturePose: supervising human mesh estimation with texture consistency. In: ICCV (2019)

    Google Scholar 

  54. Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: CVPR (2018)

    Google Scholar 

  55. Reddy, N.D., Guigues, L., Pishchulin, L., Eledath, J., Narasimhan, S.G.: TesseTrack: end-to-end learnable multi-person articulated 3D pose tracking. In: CVPR (2021)

    Google Scholar 

  56. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)

    Google Scholar 

  57. Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-NET: localization-classification-regression for human pose. In: CVPR (2017)

    Google Scholar 

  58. Saqlain, M., Kim, D., Cha, J., Lee, C., Lee, S., Baek, S.: 3DMesh-GAR: 3D human body mesh-based method for group activity recognition. Sensors 22(4), 1464(2022)

    Google Scholar 

  59. Sárándi, I., Linder, T., Arras, K.O., Leibe, B.: Synthetic occlusion augmentation with volumetric heatmaps for the 2018 ECCV posetrack challenge on 3D human pose estimation. arXiv:1809.04987 (2018)

  60. Sárándi, I., Linder, T., Arras, K.O., Leibe, B.: Metrabs: metric-scale truncation-robust heatmaps for absolute 3D human pose estimation. IEEE Trans. Biometrics Behav. Identity Sci. 3(1), 16–30 (2020)

    Article  Google Scholar 

  61. Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: ECCV (2018)

    Google Scholar 

  62. Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: ICCV (2021)

    Google Scholar 

  63. Sun, Y., Liu, W., Bao, Q., Fu, Y., Mei, T., Black, M.J.: Putting people in their place: monocular regression of 3D people in depth. arXiv:2112.08274 (2021)

  64. Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: ICCV (2019)

    Google Scholar 

  65. Tran, T.Q., Than, C.C., Nguyen, H.T.: MeshLeTemp: leveraging the learnable vertex-vertex relationship to generalize human pose and mesh reconstruction for in-the-wild scenes. arXiv:2202.07228 (2022)

  66. Tung, H.Y.F., Tung, H.W., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. In: NeurIPS (2017)

    Google Scholar 

  67. Varol, G., Laptev, I., Schmid, C., Zisserman, A.: Synthetic humans for action recognition from unseen viewpoints. Int. J. Comput. Vis. 129(7), 2264–2287 (2021). https://doi.org/10.1007/s11263-021-01467-7

    Article  Google Scholar 

  68. Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)

    Google Scholar 

  69. Xu, Y., Zhu, S.C., Tung, T.: DenseRaC: joint 3D pose and shape estimation by dense render-and-compare. In: ICCV (2019)

    Google Scholar 

  70. Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3D sensing of multiple people in natural images. In: NeurIPS (2018)

    Google Scholar 

  71. Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV (2021)

    Google Scholar 

  72. Zhang, J., Cai, Y., Yan, S., Feng, J., et al.: Direct multi-view multi-person 3D pose estimation. In: NeurIPS (2021)

    Google Scholar 

  73. Zhang, J., Yu, D., Liew, J.H., Nie, X., Feng, J.: Body meshes as points. In: CVPR (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported by IITP grants (No. 2021-0-01778 Development of human image synthesis and discrimination technology below the perceptual threshold; No. 2020-0-01336 Artificial intelligence graduate school program(UNIST); No. 2021-0-02068 Artificial intelligence innovation hub; No. 2022-0-00264 Comprehensive video understanding and generation with knowledge-based deep logic neural network) and the NRF grant (No. 2022R1F1A1074828), all funded by the Korean government (MSIT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seungryul Baek .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5832 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cha, J., Saqlain, M., Kim, G., Shin, M., Baek, S. (2022). Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13665. Springer, Cham. https://doi.org/10.1007/978-3-031-20065-6_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20065-6_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20064-9

  • Online ISBN: 978-3-031-20065-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics