People Watching: Human Actions as a Cue for Single View Geometry
- 1.2k Downloads
- 14 Citations
Abstract
We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene understanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These constraints are then used to improve single-view 3D scene understanding approaches. The proposed method is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Internet. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.
Keywords
Scene understanding Action recognition 3D reconstructionNotes
Acknowledgments
This work was supported by NSF Graduate Research and NDSEG Fellowships to DF, and by ONR-MURI N000141010934, NSF IIS-1320083, the MSR-INRIA laboratory, the EIT-ICT labs, Google, ERC Activia, and the Quaero Programme, funded by OSEO.
References
- Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In CVPR.Google Scholar
- Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D pose estimation and tracking by detection. In CVPR.Google Scholar
- Barinova, O., Lempitsky, V., Tretyak, E., & Kohli, P. (2010). Geometric image parsing in man-made environments. In ECCV.Google Scholar
- Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV.Google Scholar
- Choi, W., Chao, Y.W., Pantofaru, C., & Savarese, S. (2013). Understanding indoor scenes using 3D geometric phrases. In CVPR.Google Scholar
- Coughlan, J., & Yuille, A. (2000). The Manhattan world assumption: Regularities in scene statistics which enable bayesian inference. In NIPS.Google Scholar
- Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.Google Scholar
- Del Pero, L., Bowdish, J., Fried, D., Kermgard, B., Hartley, E. L., & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. In CVPR.Google Scholar
- Del Pero, L., Guan, J., Brau, E., Schlecht, J., & Barnard, K. (2011). Sampling bedrooms. In CVPR.Google Scholar
- Delaitre, V., Fouhey, D., Laptev, I., Sivic, J., Efros, A., & Gupta, A. (2012). Scene semantics from long-term observation of people. In ECCV.Google Scholar
- Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person–object interactions for action recognition in still images. In NIPS.Google Scholar
- Desai, C., Ramanan, D., & Fowlkes, C. (2010). Discriminative models for static human–object interactions. In SMiCV, CVPR.Google Scholar
- Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397), 171–185.CrossRefMATHMathSciNetGoogle Scholar
- Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In CVPR.Google Scholar
- Flint, A., Murray, D., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3D features. In ICCV.Google Scholar
- Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. In ECCV.Google Scholar
- Fouhey, D.F., Gupta, A., & Hebert, M. (2013). Data-driven 3D primitives for single image understanding. In ICCV.Google Scholar
- Gall, J., Fossati, A., & van Gool, L. (2011). Functional categorization of objects using real-time markerless motion capture. In CVPR.Google Scholar
- Gibson, J. (1979). The ecological approach to visual perception. Boston: Houghton Mifflin.Google Scholar
- Grabner, H., Gall, J., & van Gool, L. (2011). What makes a chair a chair? In CVPR.Google Scholar
- Guan, L., Franco, J.S., & Pollefeys, M. (2007). 3D occlusion inference from silhouette cues. In CVPR.Google Scholar
- Gupta, A., & Davis, L. S. (2007). Objects in action: An approach for combining action understanding and object perception. In CVPR.Google Scholar
- Gupta, A., Chen, T., Chen, F., Kimber, D., & Davis, L. (2008). Context and observation driven latent variable model for human pose estimation. In CVPR.Google Scholar
- Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.Google Scholar
- Gupta, A., Satkin, S., Efros, A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In CVPR.Google Scholar
- Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry in computer vision (2nd ed.). Cambridge University Press, Cambridge, ISBN: 0521540518.Google Scholar
- Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In: ICCV.Google Scholar
- Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In: ECCV.Google Scholar
- Hoiem, D., Efros, A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.Google Scholar
- Hoiem, D., Efros, A., & Hebert, M. (2008). Putting objects in perspective. In IJCV.Google Scholar
- Jiang, Y., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3D scenes. In CVPR.Google Scholar
- Johnson, S., & Everingham, M. (2011). Learning effective human pose estimation from inaccurate annotation. In: CVPR.Google Scholar
- Kanade, T. (1981). Recovery of the three-dimensional shape of an object from a single view. Artificial Intelligence, 17(1), 409–460.CrossRefGoogle Scholar
- Karsch, K., Liu, C., & Kang, S.B. (2012). Depth extraction from video using non-parametric sampling. In ECCV.Google Scholar
- Kjellstrom, H., Romero, J., Martinez, D., & Kragic, D. (2008). Simultaneous visual recognition of manipulation actions and manipulated objects. In ECCV.Google Scholar
- Krahnstoever, N., & Mendonca, P. R. S. (2005). Bayesian autocalibration for surveillance. In CVPR.Google Scholar
- Lee, D., Gupta, A., Hebert, M., Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS.Google Scholar
- Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In ICCV.Google Scholar
- Park, D., & Ramanan, D. (2011). N-best maximal decoders for part models. In ICCV.Google Scholar
- Payet, N., & Todorovic, S. (2011). Scene shape from texture of objects. In CVPR.Google Scholar
- Prest, A., Schmid, C., & Ferrari, V. (2011). Weakly supervised learning of interactions between humans and objects. In PAMI.Google Scholar
- Ramakrishna, V., Kanade, T., & Sheikh, Y. (2013). Tracking human pose by tracking symmetric parts. In CVPR. Google Scholar
- Rother, C. (2002). A new approach to vanishing point detection in architectural environments. In IVC 20.Google Scholar
- Rother, D., Patwardhan, K., & Sapiro, G. (2007). What can casual walkers tell us about the 3D scene. In CVPR.Google Scholar
- Saxena, A., Sun, M., & Ng, A. Y. (2008). Make3D: Learning 3D scene structure from a single still image. In TPAMI.Google Scholar
- Schodl, A., & Essa, I. (2001). Depth layers from occlusions. In CVPR.Google Scholar
- Schwing, A. G., & Urtasun, R. (2012). Efficient exact inference for 3D indoor scene understanding. In ECCV.Google Scholar
- Schwing, A.G., Fidler, S., Pollefeys, M., & Urtasun, R. (2013). Box in the box: Joint 3D layout and object reasoning from single images. In ICCV.Google Scholar
- Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single image. In: CVPR.Google Scholar
- Turek, M., Hoogs, A., & Collins, R. (2010). Unsupervised learning of functional categories in video scenes. In ECCV.Google Scholar
- Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.Google Scholar
- Xiao, J., Russell, B., & Torralba, A. (2012). Localizing 3D cuboids in single-view images. In NIPS.Google Scholar
- Yang, Y., & Ramanan, D. (2011). Articulated pose estimation using flexible mixtures of parts. In: CVPR.Google Scholar
- Yao, B., Khosla, A., & Fei-Fei, L. (2011). Classifying actions and measuring action similarity by modeling the mutual context of objects and human poses. In Proceedings of the ICML.Google Scholar
- Yu, S. X., Zhang, H., & Malik, J. (2008). Inferring spatial layout from a single image via depth-ordered grouping. In 6th Workshop on Perceptual Organization in Computer Vision.Google Scholar