Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12357)


We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene, all from a single image in-the-wild captured in an uncontrolled environment. Notably, our method runs on datasets without any scene- or object-level 3D supervision. Our key insight is that considering humans and objects jointly gives rise to “3D common sense” constraints that can be used to resolve ambiguity. In particular, we introduce a scale loss that learns the distribution of object size from data; an occlusion-aware silhouette re-projection loss to optimize object pose; and a human-object interaction loss to capture the spatial layout of objects with which humans interact. We empirically validate that our constraints dramatically reduce the space of likely 3D spatial configurations. We demonstrate our approach on challenging, in-the-wild images of humans interacting with large objects (such as bicycles, motorcycles, and surfboards) and handheld objects (such as laptops, tennis rackets, and skateboards). We quantify the ability of our approach to recover human-object arrangements and outline remaining challenges in this relatively unexplored domain. The project webpage can be found at



We thank Georgia Gkioxari and Shubham Tulsiani for insightful discussion and Victoria Dean and Gengshan Yang for useful feedback. We also thank Senthil Purushwalkam for deadline reminders. This work was funded in part by the CMU Argo AI Center for Autonomous Vehicle Research.

Supplementary material

504453_1_En_3_MOESM1_ESM.pdf (27.6 mb)
Supplementary material 1 (pdf 28226 KB)


  1. 1.
  2. 2.
  3. 3.
    Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. TPAMI 28(1), 44–58 (2006)CrossRefGoogle Scholar
  4. 4.
    Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. In: SIGGRAPH (2005)Google Scholar
  5. 5.
    Aubry, M., Maturana, D., Efros, A.A., Russell, B.C., Sivic, J.: Seeing 3D chairs: exemplar part-based 2D–3D alignment using a large dataset of cad models. In: CVPR (2014)Google Scholar
  6. 6.
    Ballan, L., Taneja, A., Gall, J., Van Gool, L., Pollefeys, M.: Motion capture of hands in action using discriminative salient points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 640–653. Springer, Heidelberg (2012).
  7. 7.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016).
  8. 8.
    Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., Zhu, S.C.: Holistic++ scene understanding: single-view 3D holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In: CVPR (2019)Google Scholar
  9. 9.
    Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: CVPR (2019)Google Scholar
  10. 10.
    Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016).
  11. 11.
    Delaitre, V., Fouhey, D.F., Laptev, I., Sivic, J., Gupta, A., Efros, A.A.: Scene semantics from long-term observation of people. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 284–298. Springer, Heidelberg (2012).
  12. 12.
    Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)Google Scholar
  13. 13.
  14. 14.
    Gavrila, D.M.: Pedestrian detection from a moving vehicle. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 37–49. Springer, Heidelberg (2000).
  15. 15.
    Gkioxari, G., Malik, J., Johnson, J.: Mesh r-cnn. In: ICCV (2019)Google Scholar
  16. 16.
    Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016).
  17. 17.
    Grauman, K., Shakhnarovich, G., Darrell, T.: Inferring 3D structure with a statistical image-based shape model. In: ICCV (2003)Google Scholar
  18. 18.
    Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: A papier-mâché approach to learning 3D surface generation. In: CVPR (2018)Google Scholar
  19. 19.
    Guan, P., Weiss, A., Balan, A.O., Black, M.J.: Estimating human shape and pose from a single image. In: ICCV (2009)Google Scholar
  20. 20.
    Guler, R.A., Kokkinos, I.: Holopose: holistic 3D human reconstruction in-the-wild. In: CVPR (2019)Google Scholar
  21. 21.
    Gupta, A., Satkin, S., Efros, A.A., Hebert, M.: From 3D scene geometry to human workspace. In: CVPR (2011)Google Scholar
  22. 22.
    Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV (2019)Google Scholar
  23. 23.
    Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)Google Scholar
  24. 24.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)Google Scholar
  25. 25.
    Izadinia, H., Shan, Q., Seitz, S.M.: Im2cad. In: CVPR (2017)Google Scholar
  26. 26.
    Jiang, W., Kolotouros, N., Pavlakos, G., Zhou, X., Daniilidis, K.: Coherent reconstruction of multiple humans from a single image. In: CVPR, pp. 5579–5588 (2020)Google Scholar
  27. 27.
    Joo, H., Neverova, N., Vedaldi, A.: Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. arXiv preprint arXiv:2004.03686 (2020)
  28. 28.
    Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: CVPR (2018)Google Scholar
  29. 29.
    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)Google Scholar
  30. 30.
    Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: CVPR (2015)Google Scholar
  31. 31.
    Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: CVPR (2018)Google Scholar
  32. 32.
    Kholgade, N., Simon, T., Efros, A., Sheikh, Y.: 3D object manipulation in a single photograph using stock 3D models. ACM Trans. Graph. (TOG) 33(4), 1–12 (2014)CrossRefGoogle Scholar
  33. 33.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  34. 34.
    Kirillov, A., Wu, Y., He, K., Girshick, R.: Pointrend: image segmentation as rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9799–9808 (2020)Google Scholar
  35. 35.
    Kjellström, H., Kragić, D., Black, M.J.: Tracking people interacting with objects. In: CVPR (2010)Google Scholar
  36. 36.
    Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)Google Scholar
  37. 37.
    Kulkarni, N., Misra, I., Tulsiani, S., Gupta, A.: 3D-relnet: joint object and relational network for 3D prediction. In: CVPR (2019)Google Scholar
  38. 38.
    Kundu, A., Li, Y., Rehg, J.M.: 3D-rcnn: Instance-level 3D object reconstruction via render-and-compare. In: CVPR (2018)Google Scholar
  39. 39.
    Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing ikea objects: fine pose estimation. In: ICCV (2013)Google Scholar
  40. 40.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).
  41. 41.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. In: SIGGRAPH Asia (2015)Google Scholar
  42. 42.
    Mehta, D., et al.: Vnect: real-time 3D human pose estimation with a single RGB camera. In: SIGGRAPH (2017)Google Scholar
  43. 43.
    Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: CVPR (2019)Google Scholar
  44. 44.
    Monszpart, A., Guerrero, P., Ceylan, D., Yumer, E., Mitra, N.J.: imapper: interaction-guided scene mapping from monocular videos. ACM Trans. Graph. (TOG) 38(4), 1–15 (2019)CrossRefGoogle Scholar
  45. 45.
    Omran, M., Lassner, C., Pons-Moll, G., Gehler, P.V., Schiele, B.: Neural body fitting: unifying deep learning and model-based human pose and shape estimation. In: 3DV (2018)Google Scholar
  46. 46.
    Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: learning continuous signed distance functions for shape representation. In: CVPR (2019)Google Scholar
  47. 47.
    Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)Google Scholar
  48. 48.
    Pavlakos, G., Kolotouros, N., Daniilidis, K.: TexturePose: supervising human mesh estimation with texture consistency. In: ICCV (2019)Google Scholar
  49. 49.
    Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3D human pose estimation. In: CVPR (2018)Google Scholar
  50. 50.
    Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: CVPR (2018)Google Scholar
  51. 51.
    Rosenhahn, B., Schmaltz, C., Brox, T., Weickert, J., Cremers, D., Seidel, H.P.: Markerless motion capture of man-machine interaction. In: CVPR (2008)Google Scholar
  52. 52.
    Savva, M., Chang, A.X., Hanrahan, P., Fisher, M., Nießner, M.: Pigraphs: learning interaction snapshots from observations. ACM Trans. Graph. (TOG) 35(4), 1–12 (2016)CrossRefGoogle Scholar
  53. 53.
    Sigal, L., Balan, A., Black, M.J.: Combined discriminative and generative articulated pose and non-rigid shape estimation. In: NeurIPS (2008)Google Scholar
  54. 54.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)Google Scholar
  55. 55.
    Tulsiani, S., Gupta, S., Fouhey, D., Efros, A.A., Malik, J.: Factoring shape, pose, and layout from the 2D image of a 3D scene. In: CVPR (2018)Google Scholar
  56. 56.
    Tung, H.Y.F., Tung, H.W., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. In: NeurIPS (2017)Google Scholar
  57. 57.
    Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. Int. J. Comput. Vision 118(2), 172–193 (2016). Scholar
  58. 58.
    Varol, G., et al.: Bodynet: volumetric inference of 3D human body shapes. In: ECCV (2018)Google Scholar
  59. 59.
    Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: CVPR (2015)Google Scholar
  60. 60.
    Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: posing face, body, and hands in the wild. In: CVPR (2019)Google Scholar
  61. 61.
    Yamamoto, M., Yagishita, K.: Scene constraints-aided tracking of human body. In: CVPR (2000)Google Scholar
  62. 62.
    Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3D pose and shape estimation of multiple people in natural scenes – the importance of multiple scene constraints. In: CVPR (2018)Google Scholar
  63. 63.
    Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3D sensing of multiple people in natural images. In: NeurIPS (2018)Google Scholar
  64. 64.
    Zhou, S., Fu, H., Liu, L., Cohen-Or, D., Han, X.: Parametric reshaping of human bodies in images. In: SIGGRAPH (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.Facebook AI ResearchMenlo ParkUSA
  3. 3.Argo AIPittsburghUSA
  4. 4.UC BerkeleyBerkeleyUSA

Personalised recommendations