Full-Body Awareness from Partial Observations

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12362)


There has been great progress in human 3D mesh recovery and great interest in learning about the world from consumer video data. Unfortunately current methods for 3D human mesh recovery work rather poorly on consumer video data, since on the Internet, unusual camera viewpoints and aggressive truncations are the norm rather than a rarity. We study this problem and make a number of contributions to address it: (i) we propose a simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrate its application to two recent systems; (ii) we introduce evaluation protocols and keypoint annotations 13 K frames across four consumer video datasets for studying this task, including evaluations on out-of-image keypoints; and (iii) we show that our method substantially improves PCK and human-subject judgments compared to baselines, both on test videos from the dataset it was trained on, as well as on three other datasets without further adaptation.


Human pose estimation 



This work was supported by the DARPA Machine Common Sense Program. We thank Dimitri Zhukov, Jean-Baptiste Alayrac, and Luowei Zhou, for allowing sharing of frames from their datasets, and Angjoo Kanazawa and Nikos Kolotouros for polished and easily extended code. Thanks to the members of Fouhey AI Lab and Karan Desai for the great suggestions!

Supplementary material

504472_1_En_31_MOESM1_ESM.pdf (2.8 mb)
Supplementary material 1 (pdf 2867 KB)


  1. 1.
    Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3D human pose reconstruction. In: CVPR (2015)Google Scholar
  2. 2.
    Alayrac, J.B., Bojanowski, P., Agrawal, N., Laptev, I., Sivic, J., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: CVPR (2016)Google Scholar
  3. 3.
    Alp Güler, R., Neverova, N., Kokkinos, I.: Densepose: dense human pose estimation in the wild. In: CVPR (2018)Google Scholar
  4. 4.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)Google Scholar
  5. 5.
    Bahat, Y., Shakhnarovich, G.: Confidence from invariance to image transformations. arXiv preprint arXiv:1804.00657 (2018)
  6. 6.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). Scholar
  7. 7.
    Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3D human pose annotations. In: ICCV (2009)Google Scholar
  8. 8.
    Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)
  9. 9.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)Google Scholar
  10. 10.
    Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012). Scholar
  11. 11.
    Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV, pp. 407–414 (2011)Google Scholar
  12. 12.
    Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)Google Scholar
  13. 13.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: CVPR (2008)Google Scholar
  14. 14.
    Fouhey, D.F., Kuo, W., Efros, A.A., Malik, J.: From lifestyle VLOGs to everyday interactions. In: CVPR (2018)Google Scholar
  15. 15.
    Ghiasi, G., Yang, Y., Ramanan, D., Fowlkes, C.C.: Parsing occluded people. In: CVPR (2014)Google Scholar
  16. 16.
    Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Towards viewpoint invariant 3D human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 160–177. Springer, Cham (2016). Scholar
  17. 17.
    Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016). Scholar
  18. 18.
    Ionescu, C., Li, F., Sminchisescu, C.: Latent structured models for human pose estimation. In: ICCV (2011)Google Scholar
  19. 19.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI 36, 1325–1339 (2013)CrossRefGoogle Scholar
  20. 20.
    Jiang, H., Grauman, K.: Seeing invisible poses: estimating 3D body pose from egocentric video. In: CVPR (2017)Google Scholar
  21. 21.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BVMC (2010)Google Scholar
  22. 22.
    Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. arXiv preprint arXiv:2004.03686 (2020)
  23. 23.
    Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: CVPR (2018)Google Scholar
  24. 24.
    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)Google Scholar
  25. 25.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014)Google Scholar
  26. 26.
    Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: CVPR (2019)Google Scholar
  27. 27.
    Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: CVPR (2017)Google Scholar
  28. 28.
    Lee, H.J., Chen, Z.: Determination of 3D human body postures from a single view. Comput. Vis. Graph. Image Process. 30(2), 148–168 (1985)CrossRefGoogle Scholar
  29. 29.
    Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR (2015)Google Scholar
  30. 30.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  31. 31.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 1–16 (2015)CrossRefGoogle Scholar
  32. 32.
    Loper, M., Mahmood, N., Black, M.J.: Mosh: motion and shape capture from sparse markers. ACM Trans. Graph. (TOG) 33(6), 220 (2014)CrossRefGoogle Scholar
  33. 33.
    Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: CVPR (2016)Google Scholar
  34. 34.
    von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUS and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)Google Scholar
  35. 35.
    Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: CVPR (2017)Google Scholar
  36. 36.
    Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved cnn supervision. In: 2017 International Conference on 3D Vision (3DV), pp. 506–516. IEEE (2017)Google Scholar
  37. 37.
    Mehta, D., et al.: Vnect: real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. (TOG) 36(4), 44 (2017)CrossRefGoogle Scholar
  38. 38.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). Scholar
  39. 39.
    Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In: 2018 international conference on 3D vision (3DV), pp. 484–494. IEEE (2018)Google Scholar
  40. 40.
    Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR (2017)Google Scholar
  41. 41.
    Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR (2017)Google Scholar
  42. 42.
    Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: CVPR (2018)Google Scholar
  43. 43.
    Ramakrishna, V., Kanade, T., Sheikh, Y.: Reconstructing 3D human pose from 2D image landmarks. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 573–586. Springer, Heidelberg (2012). Scholar
  44. 44.
    Rhodin, H., et al.: Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans. Graph. (TOG) 35(6), 1–11 (2016)CrossRefGoogle Scholar
  45. 45.
    Rogez, G., Schmid, C.: Mocap-guided data augmentation for 3D pose estimation in the wild. In: Advances in Neural Information Processing Systems (2016)Google Scholar
  46. 46.
    Rogez, G., Supancic, J.S., Ramanan, D.: First-person pose recognition using egocentric workspaces. In: CVPR (2015)Google Scholar
  47. 47.
    Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In: CVPR (2013)Google Scholar
  48. 48.
    Sigal, L., Balan, A.O., Black, M.J.: Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 87(1–2), 4 (2010)CrossRefGoogle Scholar
  49. 49.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). Scholar
  50. 50.
    Tome, D., Peluse, P., Agapito, L., Badino, H.: xR-EgoPose: egocentric 3D human pose from an HMD camera. In: ICCV (2019)Google Scholar
  51. 51.
    Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: CVPR (2014)Google Scholar
  52. 52.
    Tung, H.Y., Tung, H.W., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. In: NeurIPS (2017)Google Scholar
  53. 53.
    Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)Google Scholar
  54. 54.
    Vosoughi, S., Amer, M.A.: Deep 3D human pose estimation under partial body presence. In: ICIP (2018)Google Scholar
  55. 55.
    Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: posing face, body, and hands in the wild. In: CVPR (2019)Google Scholar
  56. 56.
    Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: ECCV (2018)Google Scholar
  57. 57.
    Xu, W., Chatterjee, A., Zollhoefer, M., Rhodin, H., Fua, P., Seidel, H.P., Theobalt, C.: Mo 2 cap 2: real-time mobile 3D motion capture with a cap-mounted fisheye camera. IEEE Trans. Vis. Comput. Graph. 25(5), 2093–2101 (2019)CrossRefGoogle Scholar
  58. 58.
    Yang, Y., Ramanan, D.: Articulated pose estimation using flexible mixtures of parts. In: CVPR (2011)Google Scholar
  59. 59.
    Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: AAAI (2018)Google Scholar
  60. 60.
    Zhou, X., Zhu, M., Leonardos, S., Derpanis, K.G., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. In: CVPR (2016)Google Scholar
  61. 61.
    Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: ICCV (2017)Google Scholar
  62. 62.
    Zhukov, D., Alayrac, J.B., Cinbis, R.G., Fouhey, D., Laptev, I., Sivic, J.: Cross-task weakly supervised learning from instructional videos. In: CVPR (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of MichiganAnn ArborUSA

Personalised recommendations