Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

  • Federica Bogo
  • Angjoo Kanazawa
  • Christoph Lassner
  • Peter Gehler
  • Javier Romero
  • Michael J. Black
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9909)

Abstract

We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we first use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fit it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior pose accuracy with respect to the state of the art.

Keywords

3D body shape Human pose 2D to 3D CNN 

Notes

Acknowledgements

We thank M. Al Borno for inspiring the capsule representation, N. Mahmood for help with the figures, I. Akhter for helpful discussions.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
    Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3D human pose reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1446–1455 (2015)Google Scholar
  5. 5.
    Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. ACM Trans. Graph. (TOG) - Proc. ACM SIGGRAPH 24(3), 408–416 (2005)CrossRefGoogle Scholar
  6. 6.
    Balan, A.O., Sigal, L., Black, M.J., Davis, J.E., Haussecker, H.W.: Detailed human shape and pose from images. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1–8 (2007)Google Scholar
  7. 7.
    Barron, C., Kakadiaris, I.: Estimating anthropometry and pose from a single uncalibrated image. Comput. Vis. Image Underst. CVIU 81(3), 269–284 (2001)CrossRefMATHGoogle Scholar
  8. 8.
    Bo, L., Sminchisescu, C.: Twin Gaussian processes for structured prediction. Int. J. Comput. Vis. IJCV 87(1–2), 28–52 (2010)CrossRefGoogle Scholar
  9. 9.
    Chen, Y., Kim, T.-K., Cipolla, R.: Inferring 3D shapes and deformations from single views. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6313, pp. 300–313. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15558-1_22 CrossRefGoogle Scholar
  10. 10.
    Ericson, C.: Real-Time Collision Detection. The Morgan Kaufmann Series in Interactive 3-D Technology (2004)Google Scholar
  11. 11.
    Fan, X., Zheng, K., Zhou, Y., Wang, S.: Pose locality constrained representation for 3D human pose reconstruction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 174–188. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10590-1_12 Google Scholar
  12. 12.
    Geman, S., McClure, D.: Statistical methods for tomographic image reconstruction. Bull. Int. Stat. Inst. 52(4), 5–21 (1987)MathSciNetGoogle Scholar
  13. 13.
    Grest, D., Koch, R.: Human model fitting from monocular posture images. In: Proceedings of VMV, pp. 665–1344 (2005)Google Scholar
  14. 14.
    Guan, P., Weiss, A., Balan, A., Black, M.J.: Estimating human shape and pose from a single image. In: IEEE International Conference on Computer Vision, ICCV, pp. 1381–1388 (2009)Google Scholar
  15. 15.
    Guan, P.: Virtual human bodies with clothing and hair: From images to animation. Ph.D. thesis, Brown University, Department of Computer Science, December 2012Google Scholar
  16. 16.
    Hasler, N., Ackermann, H., Rosenhahn, B., Thormhlen, T., Seidel, H.P.: Multilinear pose and body shape estimation of dressed subjects from image sets. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1823–1830 (2010)Google Scholar
  17. 17.
    Ionescu, C., Carreira, J., Sminchisescu, C.: Iterated second-order label sensitive pooling for 3D human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1661–1668 (2014)Google Scholar
  18. 18.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. TPAMI 36(7), 1325–1339 (2014)CrossRefGoogle Scholar
  19. 19.
    Jain, A., Thormählen, T., Seidel, H.P., Theobalt, C.: MovieReshape: tracking and reshaping of humans in videos. ACM Trans. Graph. (TOG) - Proc. ACM SIGGRAPH 29(5), 148:1–148:10 (2010)Google Scholar
  20. 20.
    Jain, A., Tompson, J., LeCun, Y., Bregler, C.: MoDeep: a deep learning framework using motion features for human pose estimation. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9004, pp. 302–315. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-16808-1_21 Google Scholar
  21. 21.
    Jiang, H.: 3D human pose reconstruction using millions of exemplars. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1674–1677 (2010)Google Scholar
  22. 22.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: Proceedings of the British Machine Vision Conference, pp. 12.1-12.11 (2010)Google Scholar
  23. 23.
    Kiefel, M., Gehler, P.V.: Human pose estimation with fields of parts. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 331–346. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10602-1_22 Google Scholar
  24. 24.
    Kostrikov, I., Gall, J.: Depth sweep regression forests for estimating 3D human pose from images. In: Proceedings of the British Machine Vision Conference (2014)Google Scholar
  25. 25.
    Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: a probabilistic programming language for scene perception. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4390–4399 (2015)Google Scholar
  26. 26.
    Lee, H., Chen, Z.: Determination of 3D human body postures from a single view. Comput. Vis. Graph. Image Process. 30(2), 148–168 (1985)CrossRefGoogle Scholar
  27. 27.
    Li, S., Chan, A.B.: 3D human pose estimation from monocular images with deep convolutional neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9004, pp. 332–347. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-16808-1_23 Google Scholar
  28. 28.
    Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10584-0_11 Google Scholar
  29. 29.
    Loper, M., Mahmood, N., Black, M.J.: MoSh: motion and shape capture from sparse markers. ACM Trans. Graph. (TOG) - Proc. ACM SIGGRAPH Asia 33(6), 220:1–220:13 (2014)Google Scholar
  30. 30.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) - Proc. ACM SIGGRAPH Asia 34(6), 248: 1–248: 16 (2015)Google Scholar
  31. 31.
    Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (2006)MATHGoogle Scholar
  32. 32.
    Olson, E., Agarwal, P.: Inference on networks of mixtures for robust robot mapping. Int. J. Robot. Res. 32(7), 826–840 (2013)CrossRefGoogle Scholar
  33. 33.
    Parameswaran, V., Chellappa, R.: View independent human body pose estimation from a single perspective image. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 16–22 (2004)Google Scholar
  34. 34.
    Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: IEEE International Conference on Computer Vision, ICCV, pp. 1913–1921 (2015)Google Scholar
  35. 35.
    Pfister, T., Simonyan, K., Charles, J., Zisserman, A.: Deep convolutional neural networks for efficient pose estimation in gesture videos. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9003, pp. 538–552. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-16865-4_35 Google Scholar
  36. 36.
    Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., Schiele, B.: DeepCut: joint subset partition and labeling for multi person pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4929–4937 (2016)Google Scholar
  37. 37.
    Pons-Moll, G., Fleet, D., Rosenhahn, B.: Posebits for monocular human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2345–2352 (2014)Google Scholar
  38. 38.
    Pons-Moll, G., Taylor, J., Shotton, J., Hertzmann, A., Fitzgibbon, A.: Metric regression forests for correspondence estimation. Int. J. Comput. Vis. IJCV 113(3), 1–13 (2015)MathSciNetGoogle Scholar
  39. 39.
    Ramakrishna, V., Kanade, T., Sheikh, Y.: Reconstructing 3D human pose from 2D image landmarks. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 573–586. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33765-9_41 Google Scholar
  40. 40.
    Rother, C., Kolmogorov, V., Blake, A.: Grabcut: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. (TOG) - Proc. ACM SIGGRAPH 23(3), 309–314 (2004)CrossRefGoogle Scholar
  41. 41.
    Sigal, L., Balan, A., Black, M.J.: HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. IJCV 87(1), 4–27 (2010)CrossRefGoogle Scholar
  42. 42.
    Sigal, L., Balan, A., Black, M.J.: Combined discriminative and generative articulated pose and non-rigid shape estimation. In: Advances in Neural Information Processing Systems (NIPS), vol. 20, pp. 1337–1344 (2008)Google Scholar
  43. 43.
    Simo-Serra, E., Quattoni, A., Torras, C., Moreno-Noguer, F.: A joint model for 2D and 3D pose estimation from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3634–3641 (2013)Google Scholar
  44. 44.
    Simo-Serra, E., Ramisa, A., Alenya, G., Torras, C., Moreno-Noguer, F.: Single image 3D human pose estimation from noisy observations. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2673–2680 (2012)Google Scholar
  45. 45.
    Sminchisescu, C., Telea, A.: Human pose estimation from silhouettes, a consistent approach using distance level sets. In: WSCG International Conference for Computer Graphics, Visualization and Computer Vision, pp. 413–420 (2002)Google Scholar
  46. 46.
    Sminchisescu, C., Triggs, B.: Covariance scaled sampling for monocular 3D body tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 447–454 (2001)Google Scholar
  47. 47.
    Sridhar, S., Mueller, F., Oulasvirta, A., Theobalt, C.: Fast and robust hand tracking using detection-guided optimization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3121–3221 (2015)Google Scholar
  48. 48.
    Taylor, C.: Reconstruction of articulated objects from point correspondences in single uncalibrated image. Comput. Vis. Image Underst. CVIU 80(10), 349–363 (2000)CrossRefMATHGoogle Scholar
  49. 49.
    Tekin, B., Rozantsev, A., Lepetit, V., Fua, P.: Direct prediction of 3D body poses from motion compensated sequences. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 991–1000 (2016)Google Scholar
  50. 50.
    Thiery, J.M., Guy, E., Boubekeur, T.: Sphere-meshes: shape approximation using spherical quadric error metrics. ACM Trans. Graph. (TOG) - Proc. ACM SIGGRAPH Asia 32(6), 178:1–178:12 (2013)Google Scholar
  51. 51.
    Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1653–1660 (2014)Google Scholar
  52. 52.
    Wang, C., Wang, Y., Lin, Z., Yuille, A., Gao, W.: Robust estimation of 3D human poses from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2369–2376 (2014)Google Scholar
  53. 53.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4724–4732 (2016)Google Scholar
  54. 54.
    Yang, Y., Ramanan, D.: Articulated pose estimation using flexible mixtures of parts. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3546–3553 (2011)Google Scholar
  55. 55.
    Yasin, H., Iqbal, U., Krüger, B., Weber, A., Gall, J.: A dual-source approach for 3D pose estimation from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4948–4956 (2016)Google Scholar
  56. 56.
    Zhou, F., Torre, F.: Spatio-temporal matching for human detection in video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 62–77. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10599-4_5 Google Scholar
  57. 57.
    Zhou, S., Fu, H., Liu, L., Cohen-Or, D., Han, X.: Parametric reshaping of human bodies in images. ACM Trans. Graph. (TOG) - Proc. ACM SIGGRAPH 29(4), 126:1–126:10 (2010)Google Scholar
  58. 58.
    Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., Daniilidis, K.: Sparse representation for 3D shape estimation: a convex relaxation approach. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4447–4455 (2015)Google Scholar
  59. 59.
    Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4447–4455 (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Federica Bogo
    • 2
  • Angjoo Kanazawa
    • 3
  • Christoph Lassner
    • 1
    • 4
  • Peter Gehler
    • 1
    • 4
  • Javier Romero
    • 1
  • Michael J. Black
    • 1
  1. 1.Max Planck Institute for Intelligent SystemsTübingenGermany
  2. 2.Microsoft ResearchCambridgeUK
  3. 3.University of MarylandCollege ParkUSA
  4. 4.University of TübingenTübingenGermany

Personalised recommendations