International Journal of Computer Vision

, Volume 110, Issue 3, pp 259–274 | Cite as

People Watching: Human Actions as a Cue for Single View Geometry

  • David F. Fouhey
  • Vincent Delaitre
  • Abhinav Gupta
  • Alexei A. Efros
  • Ivan Laptev
  • Josef Sivic


We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene understanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These constraints are then used to improve single-view 3D scene understanding approaches. The proposed method is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Internet. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.


Scene understanding Action recognition 3D reconstruction 



This work was supported by NSF Graduate Research and NDSEG Fellowships to DF, and by ONR-MURI N000141010934, NSF IIS-1320083, the MSR-INRIA laboratory, the EIT-ICT labs, Google, ERC Activia, and the Quaero Programme, funded by OSEO.


  1. Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In CVPR.Google Scholar
  2. Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D pose estimation and tracking by detection. In CVPR.Google Scholar
  3. Barinova, O., Lempitsky, V., Tretyak, E., & Kohli, P. (2010). Geometric image parsing in man-made environments. In ECCV.Google Scholar
  4. Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV.Google Scholar
  5. Choi, W., Chao, Y.W., Pantofaru, C., & Savarese, S. (2013). Understanding indoor scenes using 3D geometric phrases. In CVPR.Google Scholar
  6. Coughlan, J., & Yuille, A. (2000). The Manhattan world assumption: Regularities in scene statistics which enable bayesian inference. In NIPS.Google Scholar
  7. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.Google Scholar
  8. Del Pero, L., Bowdish, J., Fried, D., Kermgard, B., Hartley, E. L., & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. In CVPR.Google Scholar
  9. Del Pero, L., Guan, J., Brau, E., Schlecht, J., & Barnard, K. (2011). Sampling bedrooms. In CVPR.Google Scholar
  10. Delaitre, V., Fouhey, D., Laptev, I., Sivic, J., Efros, A., & Gupta, A. (2012). Scene semantics from long-term observation of people. In ECCV.Google Scholar
  11. Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person–object interactions for action recognition in still images. In NIPS.Google Scholar
  12. Desai, C., Ramanan, D., & Fowlkes, C. (2010). Discriminative models for static human–object interactions. In SMiCV, CVPR.Google Scholar
  13. Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185.CrossRefzbMATHMathSciNetGoogle Scholar
  14. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.Google Scholar
  15. Flint, A., Murray, D., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3D features. In ICCV.Google Scholar
  16. Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. In ECCV.Google Scholar
  17. Fouhey, D.F., Gupta, A., & Hebert, M. (2013). Data-driven 3D primitives for single image understanding. In ICCV.Google Scholar
  18. Gall, J., Fossati, A., & van Gool, L. (2011). Functional categorization of objects using real-time markerless motion capture. In CVPR.Google Scholar
  19. Gibson, J. (1979). The ecological approach to visual perception. Boston: Houghton Mifflin.Google Scholar
  20. Grabner, H., Gall, J., & van Gool, L. (2011). What makes a chair a chair? In CVPR.Google Scholar
  21. Guan, L., Franco, J.S., & Pollefeys, M. (2007). 3D occlusion inference from silhouette cues. In CVPR.Google Scholar
  22. Gupta, A., & Davis, L. S. (2007). Objects in action: An approach for combining action understanding and object perception. In CVPR.Google Scholar
  23. Gupta, A., Chen, T., Chen, F., Kimber, D., & Davis, L. (2008). Context and observation driven latent variable model for human pose estimation. In CVPR.Google Scholar
  24. Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.Google Scholar
  25. Gupta, A., Satkin, S., Efros, A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In CVPR.Google Scholar
  26. Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry in computer vision (2nd ed.). Cambridge University Press, Cambridge, ISBN: 0521540518.Google Scholar
  27. Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In: ICCV.Google Scholar
  28. Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In: ECCV.Google Scholar
  29. Hoiem, D., Efros, A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.Google Scholar
  30. Hoiem, D., Efros, A., & Hebert, M. (2008). Putting objects in perspective. In IJCV.Google Scholar
  31. Jiang, Y., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3D scenes. In CVPR.Google Scholar
  32. Johnson, S., & Everingham, M. (2011). Learning effective human pose estimation from inaccurate annotation. In: CVPR.Google Scholar
  33. Kanade, T. (1981). Recovery of the three-dimensional shape of an object from a single view. Artificial Intelligence, 17(1), 409–460.CrossRefGoogle Scholar
  34. Karsch, K., Liu, C., & Kang, S.B. (2012). Depth extraction from video using non-parametric sampling. In ECCV.Google Scholar
  35. Kjellstrom, H., Romero, J., Martinez, D., & Kragic, D. (2008). Simultaneous visual recognition of manipulation actions and manipulated objects. In ECCV.Google Scholar
  36. Krahnstoever, N., & Mendonca, P. R. S. (2005). Bayesian autocalibration for surveillance. In CVPR.Google Scholar
  37. Lee, D., Gupta, A., Hebert, M., Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS.Google Scholar
  38. Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In ICCV.Google Scholar
  39. Park, D., & Ramanan, D. (2011). N-best maximal decoders for part models. In ICCV.Google Scholar
  40. Payet, N., & Todorovic, S. (2011). Scene shape from texture of objects. In CVPR.Google Scholar
  41. Prest, A., Schmid, C., & Ferrari, V. (2011). Weakly supervised learning of interactions between humans and objects. In PAMI.Google Scholar
  42. Ramakrishna, V., Kanade, T., & Sheikh, Y. (2013). Tracking human pose by tracking symmetric parts. In CVPR. Google Scholar
  43. Rother, C. (2002). A new approach to vanishing point detection in architectural environments. In IVC 20.Google Scholar
  44. Rother, D., Patwardhan, K., & Sapiro, G. (2007). What can casual walkers tell us about the 3D scene. In CVPR.Google Scholar
  45. Saxena, A., Sun, M., & Ng, A. Y. (2008). Make3D: Learning 3D scene structure from a single still image. In TPAMI.Google Scholar
  46. Schodl, A., & Essa, I. (2001). Depth layers from occlusions. In CVPR.Google Scholar
  47. Schwing, A. G., & Urtasun, R. (2012). Efficient exact inference for 3D indoor scene understanding. In ECCV.Google Scholar
  48. Schwing, A.G., Fidler, S., Pollefeys, M., & Urtasun, R. (2013). Box in the box: Joint 3D layout and object reasoning from single images. In ICCV.Google Scholar
  49. Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single image. In: CVPR.Google Scholar
  50. Turek, M., Hoogs, A., & Collins, R. (2010). Unsupervised learning of functional categories in video scenes. In ECCV.Google Scholar
  51. Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.Google Scholar
  52. Xiao, J., Russell, B., & Torralba, A. (2012). Localizing 3D cuboids in single-view images. In NIPS.Google Scholar
  53. Yang, Y., & Ramanan, D. (2011). Articulated pose estimation using flexible mixtures of parts. In: CVPR.Google Scholar
  54. Yao, B., Khosla, A., & Fei-Fei, L. (2011). Classifying actions and measuring action similarity by modeling the mutual context of objects and human poses. In Proceedings of the ICML.Google Scholar
  55. Yu, S. X., Zhang, H., & Malik, J. (2008). Inferring spatial layout from a single image via depth-ordered grouping. In 6th Workshop on Perceptual Organization in Computer Vision.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • David F. Fouhey
    • 1
  • Vincent Delaitre
    • 2
  • Abhinav Gupta
    • 1
  • Alexei A. Efros
    • 1
    • 3
  • Ivan Laptev
    • 2
  • Josef Sivic
    • 2
  1. 1.Robotics InstituteCarnegie Mellon UniversityPittsburghUSA
  2. 2.WILLOW Project, Département d’Informatique de l’École Normale SupérieureENS/INRIA/CNRS UMR 8548ParisFrance
  3. 3.EECS Department at UC BerkeleyBerkeleyUSA

Personalised recommendations