A Generic Model to Compose Vision Modules for Holistic Scene Understanding

Li, Congcong; Kowdle, Adarsh; Saxena, Ashutosh; Chen, Tsuhan

doi:10.1007/978-3-642-35749-7_6

Congcong Li¹⁷,
Adarsh Kowdle¹⁷,
Ashutosh Saxena¹⁸ &
…
Tsuhan Chen¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 6553))

Included in the following conference series:

European Conference on Computer Vision

1819 Accesses

Abstract

The problem of holistic scene understanding involves many vision tasks such as depth estimation, scene categorization, event categorization, etc. Each of these tasks explores some aspects of the scene but, these tasks are related in that, they represent attributes of the same scene. An intuition is that one task can provide meaningful attributes to aid the learning process of another task. In this work, we propose a generic model (together with learning and inference techniques) for connecting different vision tasks in the form of a 2-layer cascade. Our model considers the first layer as a hidden layer, where the latent variables are inferred by feedback from the second layer. The feedback mechanism allows the first layer classifiers to focus on more important image modes, and draws their output towards “attributes” rather than the original “labels”. Our model also automatically discovers sparse connections between the learned attributes on the first layer and the target task on the second layer. Note that in our model, the same vision tasks can act as attribute learners as well as target tasks, while being set up on different layers. In extensive experiments, we show that the same proposed model improves the performance in all the tasks we consider: single image depth estimation, scene categorization, saliency detection and event categorization.

Download to read the full chapter text

Chapter PDF

Unified Perceptual Parsing for Scene Understanding

Vision Transformers with Hierarchical Attention

Article Open access 19 April 2024

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Article 12 January 2023

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Saxena, A., Chung, S.H., Ng, A.Y.: 3-D depth reconstruction from a single still image. IJCV 76, 53–69 (2007)
Article Google Scholar
Li, L.J., Socher, R., Fei-Fei, L.: Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In: CVPR (2009)
Google Scholar
Hoiem, D., Efros, A.A., Hebert, M.: Closing the loop on scene interpretation. In: CVPR (2008)
Google Scholar
Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Depth from familiar objects: A hierarchical model for 3D scenes. In: CVPR (2006)
Google Scholar
Heitz, G., Gould, S., Saxena, A., Koller, D.: Cascaded classification models: Combining models for holistic scene understanding. In: NIPS (2008)
Google Scholar
Ferrari, V., Zisserman, A.: Learning visual attributes. In: NIPS (2007)
Google Scholar
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: CVPR (2009)
Google Scholar
Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: CVPR (2009)
Google Scholar
Wang, G., Forsyth, D.: Joint learning of visual attributes, object classes and visual saliency. In: ICCV (2009)
Google Scholar
Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Learning hierarchical models of scenes, objects, and parts. In: ICCV (2005)
Google Scholar
Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. In: CVPR (2006)
Google Scholar
Tu, Z., Chen, X., Yuille, A.L., Zhu, S.: Image parsing: Unifying segmentation, detection, and recognition. In: ICCV (2003)
Google Scholar
Kumar, S., Hebert, M.: A hierarchical field framework for unified context-based classification. In: ICCV (2005)
Google Scholar
Li, L.J., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition. In: ICCV (2007)
Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: EuroCOLT (1995)
Google Scholar
Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: NIPS (2007)
Google Scholar
Mairal, J., Leordeanu, M., Bach, F., Hebert, M., Ponce, J.: Discriminative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 43–56. Springer, Heidelberg (2008)
Chapter Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. of Royal Stat. Soc., Series B 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Gibbs, M., Mackay, D.: Variational gaussian process classifiers. IEEE Transactions on Neural Networks 11, 1458–1464 (1997)
Google Scholar
Torralba, A., Oliva, A., Castelhano, M.S., Henderson, J.M.: Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychol. Rev. 113, 766–786 (2006)
Article Google Scholar
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 99 (2009)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: Pascal, voc2008 (2008), http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html
Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV 42, 145–175 (2001)
Article MATH Google Scholar
Oliva, A., Torralba, A.: Mit outdoor scene dataset (2009), http://people.csail.mit.edu/torralba/code/spatialenvelope/
Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned Salient Region Detection. In: CVPR (2009)
Google Scholar
Saxena, A., Sun, M., Ng, A.: Make3D: Learning 3D scene structure from a single still image. PAMI 31, 824–840 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical & Computer Engineering, Cornell University, USA
Congcong Li, Adarsh Kowdle & Tsuhan Chen
Department of Computer Science, Cornell University, USA
Ashutosh Saxena

Authors

Congcong Li
View author publications
You can also search for this author in PubMed Google Scholar
Adarsh Kowdle
View author publications
You can also search for this author in PubMed Google Scholar
Ashutosh Saxena
View author publications
You can also search for this author in PubMed Google Scholar
Tsuhan Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Toronto, 10 King’s College Road, ON M5S 3G4, Toronto, Canada
Kiriakos N. Kutulakos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, C., Kowdle, A., Saxena, A., Chen, T. (2012). A Generic Model to Compose Vision Modules for Holistic Scene Understanding. In: Kutulakos, K.N. (eds) Trends and Topics in Computer Vision. ECCV 2010. Lecture Notes in Computer Science, vol 6553. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35749-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-35749-7_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35748-0
Online ISBN: 978-3-642-35749-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Generic Model to Compose Vision Modules for Holistic Scene Understanding

Abstract

Chapter PDF

Similar content being viewed by others

Unified Perceptual Parsing for Scene Understanding

Vision Transformers with Hierarchical Attention

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Generic Model to Compose Vision Modules for Holistic Scene Understanding

Abstract

Chapter PDF

Similar content being viewed by others

Unified Perceptual Parsing for Scene Understanding

Vision Transformers with Hierarchical Attention

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation