Abstract
In this paper, we propose a computational model of the recognition of real world scenes that bypasses the segmentation and the processing of individual objects or regions. The procedure is based on a very low dimensional representation of the scene, that we term the Spatial Envelope. We propose a set of perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. Then, we show that these dimensions may be reliably estimated using spectral and coarsely localized information. The model generates a multidimensional space in which scenes sharing membership in semantic categories (e.g., streets, highways, coasts) are projected closed together. The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Similar content being viewed by others
References
Amadasun, M. 1989. Textural features corresponding to textural properties. IEEE Trans. Sys., Man and Cybernetics, 19:1264-1274.
Atick, J. and Redlich, A. 1992. What does the retina know about natural scenes? Neural Computation, 4:196-210.
Baddeley, R. 1997. The correlational structure of natural images and the calibration of spatial representations. Cognitive Science, 21:351-372.
Barrow, H.G. and Tannenbaum, J.M. 1978. Recovering intrinsec scene characteristics from images. In Computer Vision Systems, A. Hanson and E. Riseman (Eds.), Academic Press: New York, pp. 3-26.
Biederman, I. 1987. Recognition-by-components:Atheory of human image interpretation. Psychological Review, 94:115-148.
Biederman, I. 1988. Aspects and extension of a theory of human image understanding. In Computational Processes in Human Vision: An Interdisciplinary Perspective, Z. Pylyshyn (Ed.), Ablex Publishing Corporation: Norwood, New Jersey.
Carson, C., Belongie, S., Greenspan, H., and Malik, J. 1997. Regionbased image querying. In Proc. IEEEW. on Content-Based Access of Image and Video Libraries, pp. 42-49.
Carson, C., Thomas, M., Belongie, S., Hellerstein, J.M., and Malik, J. 1999. Blobworld: A system for region-based image indexing and retrieval. In Third Int. Conf. on Visual Information Systems, June 1999, Springer-Verlag.
De Bonet, J.S. and Viola, P. 1997. Structure driven image database retrieval. Advances in Neural Information Processing, 10:866-872.
van der Schaaf, A. and van Hateren, J.H. 1996. Modeling of the power spectra of natural images: Statistics and information. Vision Research, 36:2759-2770.
Field, D.J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. Journal of Optical Society of America, 4:2379-2394.
Field, D.J. 1994. What is the goal of sensory coding? Neural Computation, 6:559-601.
Friedman, A. 1979. Framing pictures: The role of knowledge in automatized encoding and memory for gist. Journal of Experimental Psychology: General, 108:316-355.
Guerin-Dugue, A. and Oliva, A. 2000. Classification of scene photographs from local orientations features. Pattern Recognition Letters, 21:1135-1140.
Gorkani, M.M. and Picard, R.W. 1994. Texture orientation for sorting photos “at a glance”. In Proc. Int. Conf. Pat. Rec., Jerusalem, Vol. I, pp. 459-464.
Hancock, P.J., Baddeley, R.J., and Smith, L.S. 1992. The principal components of natural images. Network, 3:61-70.
Heaps, C. and Handel, S. 1999. Similarity and features of natural textures. Journal of Experimental Psychology: Human Perception and Performance, 25:299-320.
Henderson, J.M. and Hollingworth, A. 1999. High level scene perception. Annual Review of Psychology, 50:243-271.
Hochberg, J.E. 1968. In the mind's eye. In Contemporary Theory and Research in Visual Perception, R.N. Haber (Ed.), Holt, Rinehart, and Winston: New York, pp. 309-331.
Lipson, P., Grimson, E., and Sinha, P. 1997. Configuration based scene classification and image indexing. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Puerto Rico, pp. 1007-1013.
Marr, D. 1982. Vision. WH Freeman: San Francisco, CA.
Moghaddam, B. and Pentland, A. 1997. Probabilistic Visual Learning for Object Representation. IEEE Trans. Pattern Analysis and Machine Vision, 19(7):696-710.
Morgan, M.J., Ross, J., and Hayes, A. 1991. The relative importance of local phase and local amplitude in patchwise image reconstruction. Biological Cybernetics, 65:113-119.
Oliva, A. and Schyns, P.G. 1997. Coarse blobs or fine edges? Evidence that information diagnosticity changes the perception of complex visual stimuli. Cognitive Psychology, 34:72-107.
Oliva, A. and Schyns, P.G. 2000. Diagnostic color blobs mediate scene recognition. Cognitive Psychology, 41:176-210.
Oliva, A., Torralba, A., Guerin-Dugue, A., and Herault, J. 1999. Global semantic classification using power spectrum templates. In Proceedings of The Challenge of Image Retrieval, Electronic Workshops in Computing series, Springer-Verlag: Newcastle.
O'Regan, J.K., Rensink, R.A., and Clark, J.J. 1999. Changeblindness as a result of 'mudsplashes'. Nature, 398:34.
Piotrowski, L.N. and Campbell, F.W. 1982. A demonstration of the visual importance and flexibility of spatial-frequency amplitude and phase. Perception, 11:337-346.
Pentland, A.P. 1984. Fractal-based description of natural scenes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6:661-674.
Potter, M.C. 1975. Meaning in visual search. Science, 187:965-966.
Rao, A.R. and Lohse, G.L. 1993. Identifying high level features of texture perception. Graphical Models and Image Processing, 55:218-233.
Rensink, R.A. 2000. The dynamic representation of scenes. Visual Cognition, 7:17-42.
Rensink, R.A., O'Regan, J.K., and Clark, J.J. 1997. To see or not to see: the need for attention to perceive changes in scenes. Psychological Science, 8:368-373.
Ripley, B.D. 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, UK.
Rosch, E. and Mervis, C.B. 1975. Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, 7:573-605.
Sanocki, T. and Epstein, W. 1997. Priming spatial layout of scenes. Psychological Science, 8:374-378.
Sanocki, T. and Reynolds, S. 2000. Does figural goodness influence the processing and representation of spatial layout. Investigative Ophthalmology and Visual Science, 41:723.
Schyns, P.G. and Oliva, A. 1994. From blobs to boundary edges: evidence for time-and spatial-scale dependent scene recognition. Psychological Science, 5:195-200.
Simons, D.J. and Levin, D.T. 1997. Change blindness. Trends in Cognitive Sciences, 1:261-267.
Sirovich, L. and Kirby, M. 1987. Low-dimensional procedure for the characterization of human faces. Journal of Optical Society of America, 4:519-524.
Swets, D.L. and Weng, J.J. 1996. Using discriminant eigenfeatures for image retrieval. IEEE Trans. On Pattern Analysis and Machine Intelligence, 18:831-836.
Switkes, E., Mayer, M.J., and Sloan, J.A. 1978. Spatial frequency analysis of the visual environment: anisotropy and the carpentered environment hypothesis. Vision Research, 18:1393-1399.
Szummer, M. and Picard, R.W. 1998. Indoor-outdoor image classification. In IEEE intl.Workshop on Content-Based Access of Image and Video Databases.
Tamura, H., Mori, S., and Yamawaki, T. 1978. Textural features corresponding to visual perception. IEEE Trans. Sys. Man and Cybernetics, 8:460-473.
Torralba, A. and Oliva, A. 1999. Scene organization using discriminant structural templates. In IEEE Proc. Of Int. Conf in Comp. Vision, pp. 1253-1258.
Torralba, A. and Oliva, A. 2001. Depth perception from familiar structure. submitted.
Torralba, A. and Sinha, P. 2001. Statistical context priming for object detection. In IEEE. Proc of Int. Conf. in Computer Vision.
Tversky, B. and Hemenway, K. 1983. Categories of environmental scenes. Cognitive Psychology, 15:121-149.
Vailaya, A., Figueiredo, M., Jain, A., and Zhang, H.J. 1999. Contentbased hierarchical classification of vacation images. In Proceedings of the International Conference on Multimedia, Computing and Systems, June.
Vailaya, A., Jain, A., and Zhang, H.J. 1998. On image classification: City images vs. landscapes. Pattern Recognition, 31:1921-1935.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Oliva, A., Torralba, A. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. International Journal of Computer Vision 42, 145–175 (2001). https://doi.org/10.1023/A:1011139631724
Issue Date:
DOI: https://doi.org/10.1023/A:1011139631724