Scene Semantics from Long-Term Observation of People

Delaitre, Vincent; Fouhey, David F.; Laptev, Ivan; Sivic, Josef; Gupta, Abhinav; Efros, Alexei A.

doi:10.1007/978-3-642-33783-3_21

Vincent Delaitre²¹,
David F. Fouhey²²,
Ivan Laptev²¹,
Josef Sivic²¹,
Abhinav Gupta²² &
…
Alexei A. Efros^21,22

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 7577))

Included in the following conference series:

European Conference on Computer Vision

9618 Accesses
45 Citations

Abstract

Our everyday objects support various tasks and can be used by people for different purposes. While object classification is a widely studied topic in computer vision, recognition of object function, i.e., what people can do with an object and how they do it, is rarely addressed. In this paper we construct a functional object description with the aim to recognize objects by the way people interact with them. We describe scene objects (sofas, tables, chairs) by associated human poses and object appearance. Our model is learned discriminatively from automatically estimated body poses in many realistic scenes. In particular, we make use of time-lapse videos from YouTube providing a rich source of common human-object interactions and minimizing the effort of manual object annotation. We show how the models learned from human observations significantly improve object recognition and enable prediction of characteristic human poses in new scenes. Results are shown on a dataset of more than 400,000 frames obtained from 146 time-lapse videos of challenging and realistic indoor scenes.

Download to read the full chapter text

Chapter PDF

The Ingredients of Scenes that Affect Object Search and Perception

SUN Database: Exploring a Large Collection of Scene Categories

Article 13 August 2014

Inferring Semantic Object Affordances from Videos

References

Palmer, S.E.: Vision science: photons to phenomenology. MIT Press, Cambridge (1999)
Google Scholar
Yang, Y., Ramanan, D.: Articulated pose estimation using flexible mixtures of parts. In: CVPR (2011)
Google Scholar
Kohli, P., Ladicky, L., Torr, P.: Robust higher order potentials for enforcing label consistency. IJCV 82, 302–324 (2009)
Article Google Scholar
Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 1–15. Springer, Heidelberg (2006)
Chapter Google Scholar
Hedau, V., Hoiem, D., Forsyth, D.: Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 224–237. Springer, Heidelberg (2010)
Chapter Google Scholar
Lee, D., Hebert, M., Kanade, T.: Geometric reasoning for single image structure recovery. In: ICCV (2009)
Google Scholar
Fouhey, D.F., Delaitre, V., Gupta, A., Efros, A.A., Laptev, I., Sivic, J.: People Watching: Human Actions as a Cue for Single View Geometry. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 732–745. Springer, Heidelberg (2012)
Chapter Google Scholar
Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using spatial and functional compatibility for recognition. PAMI (2009)
Google Scholar
Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. In: NIPS (2011)
Google Scholar
Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for static human-object interactions. In: SMiCV, CVPR (2010)
Google Scholar
Stark, M., Lies, P., Zillich, M., Wyatt, J., Schiele, B.: Functional Object Class Detection Based on Learned Affordance Cues. In: Gasteratos, A., Vincze, M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 435–444. Springer, Heidelberg (2008)
Chapter Google Scholar
Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. PAMI (2011)
Google Scholar
Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR (2010)
Google Scholar
Yao, B., Khosla, A., Fei-Fei, L.: Classifying actions and measuring action similarity by modeling the mutual context of objects and human poses. In: Proc. ICML (2011)
Google Scholar
Gall, J., Fossati, A., Van Gool, L.: Functional categorization of objects using real-time markerless motion capture. In: CVPR (2011)
Google Scholar
Kjellström, H., Romero, J., Martínez, D., Kragić, D.: Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 336–349. Springer, Heidelberg (2008)
Chapter Google Scholar
Fathi, A., Ren, X., Rehg, J.: Learning to recognize objects in egocentric activities. In: CVPR (2011)
Google Scholar
Peursum, P., West, G., Venkatesh, S.: Combining image regions and human activity for indirect object recognition in indoor wide-angle views. In: ICCV (2005)
Google Scholar
Turek, M.W., Hoogs, A., Collins, R.: Unsupervised Learning of Functional Categories in Video Scenes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 664–677. Springer, Heidelberg (2010)
Chapter Google Scholar
Wang, X., Tieu, K., Grimson, E.: Learning Semantic Scene Models by Trajectory Analysis. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part III. LNCS, vol. 3953, pp. 110–123. Springer, Heidelberg (2006)
Chapter Google Scholar
Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: CVPR (2011)
Google Scholar
Gupta, A., Satkin, S., Efros, A., Hebert, M.: From 3d scene geometry to human workspace. In: CVPR (2011)
Google Scholar
Gibson, J.: The ecological approach to visual perception. Houghton Mifflin, Boston (1979)
Google Scholar
Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In: ICCV (2009)
Google Scholar
Rodriguez, M., Laptev, I., Sivic, J., Audibert, J.Y.: Density-aware person detection and tracking in crowds. In: ICCV (2011)
Google Scholar
Csurka, G., Bray, C., Dance, C., Fan, L.: Visual categorization with bags of keypoints. In: WS-SLCV, ECCV (2004)
Google Scholar
Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60, 91–110 (2004)
Article Google Scholar
Hoiem, D., Efros, A., Hebert, M.: Geometric context from a single image. In: ICCV (2005)
Google Scholar
Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. IJCV 59, 167–181 (2004)
Article Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning. Springer (2003)
Google Scholar
Staufer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: CVPR (1998)
Google Scholar
Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: CVPR (2008)
Google Scholar
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 32, 1627–1645 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

INRIA/École Normale Supérieure, Paris, France
Vincent Delaitre, Ivan Laptev, Josef Sivic & Alexei A. Efros
Carnegie Mellon University, USA
David F. Fouhey, Abhinav Gupta & Alexei A. Efros

Authors

Vincent Delaitre
View author publications
You can also search for this author in PubMed Google Scholar
David F. Fouhey
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Laptev
View author publications
You can also search for this author in PubMed Google Scholar
Josef Sivic
View author publications
You can also search for this author in PubMed Google Scholar
Abhinav Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Alexei A. Efros
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Microsoft Research Ltd., CB3 0FB, Cambridge, UK
Andrew Fitzgibbon
Dept. of Computer Science, University of North Carolina, 27599, Chapel Hill, NC, USA
Svetlana Lazebnik
California Institute of Technology, 91125, Pasadena, CA, USA
Pietro Perona
Institute of Industrial Science, The University of Tokyo, 153-8505, Tokyo, Japan
Yoichi Sato
INRIA, 38330, Montbonnot, France
Cordelia Schmid

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Delaitre, V., Fouhey, D.F., Laptev, I., Sivic, J., Gupta, A., Efros, A.A. (2012). Scene Semantics from Long-Term Observation of People. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds) Computer Vision – ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, vol 7577. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33783-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-33783-3_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33782-6
Online ISBN: 978-3-642-33783-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Scene Semantics from Long-Term Observation of People

Abstract

Chapter PDF

Similar content being viewed by others

The Ingredients of Scenes that Affect Object Search and Perception

SUN Database: Exploring a Large Collection of Scene Categories

Inferring Semantic Object Affordances from Videos

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Scene Semantics from Long-Term Observation of People

Abstract

Chapter PDF

Similar content being viewed by others

The Ingredients of Scenes that Affect Object Search and Perception

SUN Database: Exploring a Large Collection of Scene Categories

Inferring Semantic Object Affordances from Videos

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation