International Journal of Computer Vision

, Volume 112, Issue 2, pp 204–220 | Cite as

Indoor Scene Understanding with Geometric and Semantic Contexts

  • Wongun Choi
  • Yu-Wei Chao
  • Caroline Pantofaru
  • Silvio Savarese


Truly understanding a scene involves integrating information at multiple levels as well as studying the interactions between scene elements. Individual object detectors, layout estimators and scene classifiers are powerful but ultimately confounded by complicated real-world scenes with high variability, different viewpoints and occlusions. We propose a method that can automatically learn the interactions among scene elements and apply them to the holistic understanding of indoor scenes from a single image. This interpretation is performed within a hierarchical interaction model which describes an image by a parse graph, thereby fusing together object detection, layout estimation and scene classification. At the root of the parse graph is the scene type and layout while the leaves are the individual detections of objects. In between is the core of the system, our 3D Geometric Phrases (3DGP). We conduct extensive experimental evaluations on single image 3D scene understanding using both 2D and 3D metrics. The results demonstrate that our model with 3DGPs can provide robust estimation of scene type, 3D space, and 3D objects by leveraging the contextual relationships among the visual elements.


Scene understanding Scene parsing  Object recognition 3D layout 


  1. Bao, S., Sun, M., & Savarese, S. (2010). Toward coherent object detection and scene layout understanding. In Proceedings of the conference on Computer Vision and Pattern Recognition.Google Scholar
  2. Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2, 27:1–27:27.CrossRefGoogle Scholar
  3. Chao, Y.W., Choi, W., Pantofaru, C., & Savarese, S. (2013). Layout estimation of highly cluttered indoor scenes using geometric and semantic cues. In Proceedings of the International Conference on Image Analysis and Processing.Google Scholar
  4. Choi, W., Chao, Y., Pantofaru, C., & Savarese, S. (2013) Understanding indoor scenes using 3D geometric phrases. In CVPR.Google Scholar
  5. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.Google Scholar
  6. Desai, C., Ramanan, D., & Fowlkes, C. C. (2011). Discriminative models for multi-class object layout. IJCV.Google Scholar
  7. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge. IJCV.Google Scholar
  8. Fei-Fei, L., & Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. CVPR pp. 524–531.Google Scholar
  9. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. PAMI, 32(9), 1627–1645.CrossRefGoogle Scholar
  10. Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2012). People watching: Human actions as a cue for single-view geometry. In ECCV.Google Scholar
  11. Geiger, A., Wojek, C., & Urtasun, R. (2011). Joint 3D estimation of objects and scene layout. In NIPS.Google Scholar
  12. Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.Google Scholar
  13. Hartley, R. I., & Zisserman, A. (2004). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge: Cambridge University Press, ISBN: 0521540518.Google Scholar
  14. Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered room. In ICCV (2009)Google Scholar
  15. Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.Google Scholar
  16. Hedau, V., Hoiem, D., & Forsyth, D. (2012). Recovering free space of indoor scenes from a single image. In CVPR.Google Scholar
  17. Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. IJCV.Google Scholar
  18. Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. IJCV.Google Scholar
  19. Lagarias, J. C., Reeds, J. A., Wright, M. H., & Wright, P. E. (1998). Convergence properties of the nelder-mead simplex method in low dimensions. SIAM Journal on Optimization, 9(1), 148–158.CrossRefMathSciNetGoogle Scholar
  20. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.Google Scholar
  21. Lee, D., Gupta, A., Hebert, M., & Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS.Google Scholar
  22. Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In CVPR.Google Scholar
  23. Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In Statistical Learning in Computer Vision, ECCV.Google Scholar
  24. Li, C., Parikh, D., & Chen, T. (2012). Automatic discovery of groups of objects for scene understanding. In CVPR.Google Scholar
  25. Li, L. J., Su, H., Xing, E. P., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. In NIPS.Google Scholar
  26. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110. doi: 10.1023/B:VISI.0000029664.99615.94.CrossRefGoogle Scholar
  27. Pandey, M., & Lazebnik, S. (2011). Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV.Google Scholar
  28. Pero, L. D., Bowdish, J., Fried, D., Kermgard, B., Hartley, E. L., & Barnard, K. (2012). Bayesian geometric modeling of indoor scenes. In CVPR.Google Scholar
  29. Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.Google Scholar
  30. Rother, C. (2002). A new approach for vanishing point detection in architectural environments. Journal Image and Vision Computing, 20, 647–656.Google Scholar
  31. Sadeghi, A., & Farhadi, A. (2011). Recognition using visual phrases. In CVPR.Google Scholar
  32. Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3D models. In BMVC.Google Scholar
  33. Schwing, A. G., & Urtasun, R. (2012). Efficient exact inference for 3D indoor scene understanding. In ECCV.Google Scholar
  34. Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.Google Scholar
  35. Wang, Y., & Mori, G. (2011). Hidden part models for human action recognition: Probabilistic versus max margin. In PAMI.Google Scholar
  36. Xiang, Y., & Savarese, S. (2012). Estimating the aspect layout of object categories. In CVPR.Google Scholar
  37. Zhao, Y., & Zhu, S. C. (2011). Image parsing via stochastic scene grammar. In NIPS.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Wongun Choi
    • 1
  • Yu-Wei Chao
    • 2
  • Caroline Pantofaru
    • 3
  • Silvio Savarese
    • 4
  1. 1.NEC Laboratories AmericaCupertinoUSA
  2. 2.University of MichiganAnn ArborUSA
  3. 3.Google, IncMountain ViewUSA
  4. 4.Stanford UniversityStanfordUSA

Personalised recommendations