Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

People Watching: Human Actions as a Cue for Single View Geometry


We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene understanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These constraints are then used to improve single-view 3D scene understanding approaches. The proposed method is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Internet. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15


  1. Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In CVPR.

  2. Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D pose estimation and tracking by detection. In CVPR.

  3. Barinova, O., Lempitsky, V., Tretyak, E., & Kohli, P. (2010). Geometric image parsing in man-made environments. In ECCV.

  4. Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV.

  5. Choi, W., Chao, Y.W., Pantofaru, C., & Savarese, S. (2013). Understanding indoor scenes using 3D geometric phrases. In CVPR.

  6. Coughlan, J., & Yuille, A. (2000). The Manhattan world assumption: Regularities in scene statistics which enable bayesian inference. In NIPS.

  7. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

  8. Del Pero, L., Bowdish, J., Fried, D., Kermgard, B., Hartley, E. L., & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. In CVPR.

  9. Del Pero, L., Guan, J., Brau, E., Schlecht, J., & Barnard, K. (2011). Sampling bedrooms. In CVPR.

  10. Delaitre, V., Fouhey, D., Laptev, I., Sivic, J., Efros, A., & Gupta, A. (2012). Scene semantics from long-term observation of people. In ECCV.

  11. Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person–object interactions for action recognition in still images. In NIPS.

  12. Desai, C., Ramanan, D., & Fowlkes, C. (2010). Discriminative models for static human–object interactions. In SMiCV, CVPR.

  13. Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185.

  14. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.

  15. Flint, A., Murray, D., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3D features. In ICCV.

  16. Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. In ECCV.

  17. Fouhey, D.F., Gupta, A., & Hebert, M. (2013). Data-driven 3D primitives for single image understanding. In ICCV.

  18. Gall, J., Fossati, A., & van Gool, L. (2011). Functional categorization of objects using real-time markerless motion capture. In CVPR.

  19. Gibson, J. (1979). The ecological approach to visual perception. Boston: Houghton Mifflin.

  20. Grabner, H., Gall, J., & van Gool, L. (2011). What makes a chair a chair? In CVPR.

  21. Guan, L., Franco, J.S., & Pollefeys, M. (2007). 3D occlusion inference from silhouette cues. In CVPR.

  22. Gupta, A., & Davis, L. S. (2007). Objects in action: An approach for combining action understanding and object perception. In CVPR.

  23. Gupta, A., Chen, T., Chen, F., Kimber, D., & Davis, L. (2008). Context and observation driven latent variable model for human pose estimation. In CVPR.

  24. Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.

  25. Gupta, A., Satkin, S., Efros, A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In CVPR.

  26. Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry in computer vision (2nd ed.). Cambridge University Press, Cambridge, ISBN: 0521540518.

  27. Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In: ICCV.

  28. Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In: ECCV.

  29. Hoiem, D., Efros, A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.

  30. Hoiem, D., Efros, A., & Hebert, M. (2008). Putting objects in perspective. In IJCV.

  31. Jiang, Y., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3D scenes. In CVPR.

  32. Johnson, S., & Everingham, M. (2011). Learning effective human pose estimation from inaccurate annotation. In: CVPR.

  33. Kanade, T. (1981). Recovery of the three-dimensional shape of an object from a single view. Artificial Intelligence, 17(1), 409–460.

  34. Karsch, K., Liu, C., & Kang, S.B. (2012). Depth extraction from video using non-parametric sampling. In ECCV.

  35. Kjellstrom, H., Romero, J., Martinez, D., & Kragic, D. (2008). Simultaneous visual recognition of manipulation actions and manipulated objects. In ECCV.

  36. Krahnstoever, N., & Mendonca, P. R. S. (2005). Bayesian autocalibration for surveillance. In CVPR.

  37. Lee, D., Gupta, A., Hebert, M., Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS.

  38. Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In ICCV.

  39. Park, D., & Ramanan, D. (2011). N-best maximal decoders for part models. In ICCV.

  40. Payet, N., & Todorovic, S. (2011). Scene shape from texture of objects. In CVPR.

  41. Prest, A., Schmid, C., & Ferrari, V. (2011). Weakly supervised learning of interactions between humans and objects. In PAMI.

  42. Ramakrishna, V., Kanade, T., & Sheikh, Y. (2013). Tracking human pose by tracking symmetric parts. In CVPR.

  43. Rother, C. (2002). A new approach to vanishing point detection in architectural environments. In IVC 20.

  44. Rother, D., Patwardhan, K., & Sapiro, G. (2007). What can casual walkers tell us about the 3D scene. In CVPR.

  45. Saxena, A., Sun, M., & Ng, A. Y. (2008). Make3D: Learning 3D scene structure from a single still image. In TPAMI.

  46. Schodl, A., & Essa, I. (2001). Depth layers from occlusions. In CVPR.

  47. Schwing, A. G., & Urtasun, R. (2012). Efficient exact inference for 3D indoor scene understanding. In ECCV.

  48. Schwing, A.G., Fidler, S., Pollefeys, M., & Urtasun, R. (2013). Box in the box: Joint 3D layout and object reasoning from single images. In ICCV.

  49. Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single image. In: CVPR.

  50. Turek, M., Hoogs, A., & Collins, R. (2010). Unsupervised learning of functional categories in video scenes. In ECCV.

  51. Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.

  52. Xiao, J., Russell, B., & Torralba, A. (2012). Localizing 3D cuboids in single-view images. In NIPS.

  53. Yang, Y., & Ramanan, D. (2011). Articulated pose estimation using flexible mixtures of parts. In: CVPR.

  54. Yao, B., Khosla, A., & Fei-Fei, L. (2011). Classifying actions and measuring action similarity by modeling the mutual context of objects and human poses. In Proceedings of the ICML.

  55. Yu, S. X., Zhang, H., & Malik, J. (2008). Inferring spatial layout from a single image via depth-ordered grouping. In 6th Workshop on Perceptual Organization in Computer Vision.

Download references


This work was supported by NSF Graduate Research and NDSEG Fellowships to DF, and by ONR-MURI N000141010934, NSF IIS-1320083, the MSR-INRIA laboratory, the EIT-ICT labs, Google, ERC Activia, and the Quaero Programme, funded by OSEO.

Author information

Correspondence to David F. Fouhey.

Additional information

Communicated by Carlo Colombo.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Fouhey, D.F., Delaitre, V., Gupta, A. et al. People Watching: Human Actions as a Cue for Single View Geometry. Int J Comput Vis 110, 259–274 (2014). https://doi.org/10.1007/s11263-014-0710-z

Download citation


  • Scene understanding
  • Action recognition
  • 3D reconstruction