Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-view Geometry

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12348)


Epipolar constraints are at the core of feature matching and depth estimation in current multi-person multi-camera 3D human pose estimation methods. Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances mainly due to two sources of ambiguity. The first is the mismatch of human joints resulting from the simple cues provided by the Euclidean distances between joints and epipolar lines. The second is the lack of robustness from the naive formulation of the problem as a least squares minimization. In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation. Our method consists of two key components: a graph model for fast cross-view matching, and a maximum a posteriori (MAP) estimator for the reconstruction of the 3D human poses. We demonstrate the effectiveness and superiority of our proposed method on four benchmark datasets. Our code is available at:


3D pose estimation Occlusion Correspondence problem 



The authors would like to thank Yawei Li and Weixiao Liu for useful discussion. This work is supported in parts by the Office of Naval Research Award N00014-17-1-2142 and the Singapore MOE Tier 1 grant R-252-000-A65-114.


  1. 1.
    Baqué, P., Fleuret, F., Fua, P.: Deep occlusion reasoning for multi-camera multi-target detection. In: Proceedings of the ICCV, pp. 271–279 (2017)Google Scholar
  2. 2.
    Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures for multiple human pose estimation. In: Proceedings of the CVPR, pp. 1669–1676 (2014)Google Scholar
  3. 3.
    Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures revisited: Multiple human pose estimation. IEEE Trans. PAMI 38(10), 1929–1942 (2015)CrossRefGoogle Scholar
  4. 4.
    Belagiannis, V., Wang, X., Schiele, B., Fua, P., Ilic, S., Navab, N.: Multiple human pose estimation with temporally consistent 3D pictorial structures. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 742–754. Springer, Cham (2015). Scholar
  5. 5.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). Scholar
  6. 6.
    Campbell, D., Petersson, L., Kneip, L., Li, H.: Globally-optimal inlier set maximisation for simultaneous camera pose and feature correspondence. In: Proceedings of ICCV, pp. 1–10 (2017)Google Scholar
  7. 7.
    Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. In: arXiv preprint arXiv:1812.08008 (2018)
  8. 8.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the CVPR, pp. 7291–7299 (2017)Google Scholar
  9. 9.
    Chavdarova, T., et al.: WILDTRACK: a multi-camera HD dataset for dense unscripted pedestrian detection. In: Proceedings of the CVPR, pp. 5030–5039 (2018)Google Scholar
  10. 10.
    Chen, C.H., Ramanan, D.: 3D human pose estimation = 2D pose estimation + matching. In: Proceedings of the CVPR, pp. 7035–7043 (2017)Google Scholar
  11. 11.
    Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods, vol. 1. SIAM, Philadelphia (2000)CrossRefGoogle Scholar
  12. 12.
    Dinesh Reddy, N., Vo, M., Narasimhan, S.G.: CarFusion: combining point tracking and part detection for dynamic 3D reconstruction of vehicles. In: Proceedings CVPR, pp. 1906–1915 (2018)Google Scholar
  13. 13.
    Dong, J., Jiang, W., Huang, Q., Bao, H., Zhou, X.: Fast and robust multi-person 3D pose estimation from multiple views. In: Proceedings of the CVPR, pp. 7792–7801 (2019)Google Scholar
  14. 14.
    Duff, T., Kohn, K., Leykin, A., Pajdla, T.: PLMP-point-line minimal problems in complete multi-view visibility. In: Proceedings of the ICCV, pp. 1675–1684 (2019)Google Scholar
  15. 15.
    Ess, A., Leibe, B., Schindler, K., Gool, L.V.: Robust multiperson tracking from a mobile platform. IEEE Trans. PAMI 31, 1831–1846 (2009)CrossRefGoogle Scholar
  16. 16.
    Ess, A., Leibe, B., Schindler, K., Van Gool, L.: A mobile vision system for robust multi-person tracking. In: Proceedings of the CVPR, pp. 1–8. IEEE (2008)Google Scholar
  17. 17.
    Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Tracking by prediction: a deep generative model for mutli-person localisation and tracking. In: Proceedings of the WACV, pp. 1122–1132. IEEE (2018)Google Scholar
  18. 18.
    Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Harris, C.G., Stephens, M., et al.: A combined corner and edge detector. In: Alvey Vision Conference, vol. 15, pp. 10–5244. Citeseer (1988)Google Scholar
  20. 20.
    Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: Proceedings of the ICCV, pp. 7718–7727 (2019)Google Scholar
  21. 21.
    Jahangiri, E., Yuille, A.L.: Generating multiple diverse hypotheses for human 3D pose consistent with 2D joint detections. In: Proceedings of the ICCVW, pp. 805–814 (2017)Google Scholar
  22. 22.
    Jonker, R., Volgenant, A.: A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4), 325–340 (1987)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: Proceedings of the ICCV (2015)Google Scholar
  24. 24.
    Kadkhodamohammadi, A., Padoy, N.: A generalizable approach for multi-view 3D human pose regression. arXiv preprint arXiv:1804.10462 (2018)
  25. 25.
    Korman, S., Milam, M., Soatto, S.: OATM: occlusion aware template matching by consensus set maximization. In: Proceedings of the CVPR, pp. 2675–2683 (2018)Google Scholar
  26. 26.
    Kubo, H., Jayasuriya, S., Iwaguchi, T., Funatomi, T., Mukaigawa, Y., Narasimhan, S.G.: Programmable non-epipolar indirect light transport: Capture and analysis. IEEE Trans. VCG (2019) Google Scholar
  27. 27.
    Li, C., Lee, G.H.: Generating multiple hypotheses for 3D human pose estimation with mixture density network. In: Proceedings of the CVPR, pp. 9887–9895 (2019)Google Scholar
  28. 28.
    Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: Crowdpose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the CVPR, pp. 10863–10872 (2019)Google Scholar
  29. 29.
    Li, Y., Agustsson, E., Gu, S., Timofte, R., Van Gool, L.: CARN: convolutional anchored regression network for fast and accurate single image super-resolution. In: Proceedings of the ECCV, p. 0 (2018)Google Scholar
  30. 30.
    Li, Y., Gu, S., Mayer, C., Van Gool, L., Timofte, R.: Group sparsity: the hinge between filter pruning and decomposition for network compression. In: Proceedings of CVPR (2020)Google Scholar
  31. 31.
    Li, Y., Tsiminaki, V., Timofte, R., Pollefeys, M., Van Gool, L.: 3D appearance super-resolution with deep learning. In: Proceedings of the CVPR, pp. 9671–9680 (2019)Google Scholar
  32. 32.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  33. 33.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  34. 34.
    Liu, X., et al.: Extremely dense point correspondences using a learned feature descriptor. In: Proceedings of the CVPR (2020)Google Scholar
  35. 35.
    Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the ICCV, vol. 2, pp. 1150–1157 (1999)Google Scholar
  36. 36.
    Mahendran, S., Ali, H., Vidal, R.: 3D pose regression using convolutional neural networks. In: Proceedings of the ICCVW, pp. 2174–2182 (2017)Google Scholar
  37. 37.
    Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the CVPR (2017)Google Scholar
  38. 38.
    Reddy, N.D., Vo, M., Narasimhan, S.G.: Occlusion-Net: 2D/3D occluded keypoint localization using graph networks. In: Proceedings of the CVPR, pp. 7326–7335 (2019)Google Scholar
  39. 39.
    Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
  40. 40.
    Sindagi, V.A., Patel, V.M.: Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: Proceedings of the ICCV, pp. 1002–1012 (2019)Google Scholar
  41. 41.
    Stoll, C., Hasler, N., Gall, J., Seidel, H.P., Theobalt, C.: Fast articulated motion tracking using a sums of Gaussians body model. In: Proceedings of the ICCV, pp. 951–958. IEEE (2011)Google Scholar
  42. 42.
    Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). Scholar
  43. 43.
    Vo, M., Yumer, E., Sunkavalli, K., Hadap, S., Sheikh, Y., Narasimhan, S.G.: Self-supervised multi-view person association and its applications. IEEE Trans. PAMI (2020)Google Scholar
  44. 44.
    Wang, C., Wang, Y., Lin, Z., Yuille, A.L.: Robust 3D human pose estimation from single images or video sequences. IEEE Trans. PAMI 41(5), 1227–1241 (2018)CrossRefGoogle Scholar
  45. 45.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the CVPR, pp. 4724–4732 (2016)Google Scholar
  46. 46.
    Windheuser, T., Cremers, D.: A convex solution to spatially-regularized correspondence problems. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 853–868. Springer, Cham (2016). Scholar
  47. 47.
    Xin, S., Nousias, S., Kutulakos, K.N., Sankaranarayanan, A.C., Narasimhan, S.G., Gkioulekas, I.: A theory of fermat paths for non-line-of-sight shape reconstruction. In: Proceedings of the CVPR, pp. 6800–6809 (2019)Google Scholar
  48. 48.
    Theobald, S., Schmitt, A., Diebold, P.: Comparing scaling agile frameworks based on underlying practices. In: Hoda, R. (ed.) XP 2019. LNBIP, vol. 364, pp. 88–96. Springer, Cham (2019). Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.The Johns Hopkins UniversityBaltimoreUSA
  2. 2.National University of SingaporeSingaporeSingapore

Personalised recommendations