International Journal of Computer Vision

, Volume 42, Issue 3, pp 145–175 | Cite as

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

  • Aude Oliva
  • Antonio Torralba


In this paper, we propose a computational model of the recognition of real world scenes that bypasses the segmentation and the processing of individual objects or regions. The procedure is based on a very low dimensional representation of the scene, that we term the Spatial Envelope. We propose a set of perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. Then, we show that these dimensions may be reliably estimated using spectral and coarsely localized information. The model generates a multidimensional space in which scenes sharing membership in semantic categories (e.g., streets, highways, coasts) are projected closed together. The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.

scene recognition natural images energy spectrum principal components spatial layout 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Amadasun, M. 1989. Textural features corresponding to textural properties. IEEE Trans. Sys., Man and Cybernetics, 19:1264-1274.Google Scholar
  2. Atick, J. and Redlich, A. 1992. What does the retina know about natural scenes? Neural Computation, 4:196-210.Google Scholar
  3. Baddeley, R. 1997. The correlational structure of natural images and the calibration of spatial representations. Cognitive Science, 21:351-372.Google Scholar
  4. Barrow, H.G. and Tannenbaum, J.M. 1978. Recovering intrinsec scene characteristics from images. In Computer Vision Systems, A. Hanson and E. Riseman (Eds.), Academic Press: New York, pp. 3-26.Google Scholar
  5. Biederman, I. 1987. Recognition-by-components:Atheory of human image interpretation. Psychological Review, 94:115-148.Google Scholar
  6. Biederman, I. 1988. Aspects and extension of a theory of human image understanding. In Computational Processes in Human Vision: An Interdisciplinary Perspective, Z. Pylyshyn (Ed.), Ablex Publishing Corporation: Norwood, New Jersey.Google Scholar
  7. Carson, C., Belongie, S., Greenspan, H., and Malik, J. 1997. Regionbased image querying. In Proc. IEEEW. on Content-Based Access of Image and Video Libraries, pp. 42-49.Google Scholar
  8. Carson, C., Thomas, M., Belongie, S., Hellerstein, J.M., and Malik, J. 1999. Blobworld: A system for region-based image indexing and retrieval. In Third Int. Conf. on Visual Information Systems, June 1999, Springer-Verlag.Google Scholar
  9. De Bonet, J.S. and Viola, P. 1997. Structure driven image database retrieval. Advances in Neural Information Processing, 10:866-872.Google Scholar
  10. van der Schaaf, A. and van Hateren, J.H. 1996. Modeling of the power spectra of natural images: Statistics and information. Vision Research, 36:2759-2770.Google Scholar
  11. Field, D.J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. Journal of Optical Society of America, 4:2379-2394.Google Scholar
  12. Field, D.J. 1994. What is the goal of sensory coding? Neural Computation, 6:559-601.Google Scholar
  13. Friedman, A. 1979. Framing pictures: The role of knowledge in automatized encoding and memory for gist. Journal of Experimental Psychology: General, 108:316-355.Google Scholar
  14. Guerin-Dugue, A. and Oliva, A. 2000. Classification of scene photographs from local orientations features. Pattern Recognition Letters, 21:1135-1140.Google Scholar
  15. Gorkani, M.M. and Picard, R.W. 1994. Texture orientation for sorting photos “at a glance”. In Proc. Int. Conf. Pat. Rec., Jerusalem, Vol. I, pp. 459-464.Google Scholar
  16. Hancock, P.J., Baddeley, R.J., and Smith, L.S. 1992. The principal components of natural images. Network, 3:61-70.Google Scholar
  17. Heaps, C. and Handel, S. 1999. Similarity and features of natural textures. Journal of Experimental Psychology: Human Perception and Performance, 25:299-320.Google Scholar
  18. Henderson, J.M. and Hollingworth, A. 1999. High level scene perception. Annual Review of Psychology, 50:243-271.Google Scholar
  19. Hochberg, J.E. 1968. In the mind's eye. In Contemporary Theory and Research in Visual Perception, R.N. Haber (Ed.), Holt, Rinehart, and Winston: New York, pp. 309-331.Google Scholar
  20. Lipson, P., Grimson, E., and Sinha, P. 1997. Configuration based scene classification and image indexing. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Puerto Rico, pp. 1007-1013.Google Scholar
  21. Marr, D. 1982. Vision. WH Freeman: San Francisco, CA.Google Scholar
  22. Moghaddam, B. and Pentland, A. 1997. Probabilistic Visual Learning for Object Representation. IEEE Trans. Pattern Analysis and Machine Vision, 19(7):696-710.Google Scholar
  23. Morgan, M.J., Ross, J., and Hayes, A. 1991. The relative importance of local phase and local amplitude in patchwise image reconstruction. Biological Cybernetics, 65:113-119.Google Scholar
  24. Oliva, A. and Schyns, P.G. 1997. Coarse blobs or fine edges? Evidence that information diagnosticity changes the perception of complex visual stimuli. Cognitive Psychology, 34:72-107.Google Scholar
  25. Oliva, A. and Schyns, P.G. 2000. Diagnostic color blobs mediate scene recognition. Cognitive Psychology, 41:176-210.Google Scholar
  26. Oliva, A., Torralba, A., Guerin-Dugue, A., and Herault, J. 1999. Global semantic classification using power spectrum templates. In Proceedings of The Challenge of Image Retrieval, Electronic Workshops in Computing series, Springer-Verlag: Newcastle.Google Scholar
  27. O'Regan, J.K., Rensink, R.A., and Clark, J.J. 1999. Changeblindness as a result of 'mudsplashes'. Nature, 398:34.Google Scholar
  28. Piotrowski, L.N. and Campbell, F.W. 1982. A demonstration of the visual importance and flexibility of spatial-frequency amplitude and phase. Perception, 11:337-346.Google Scholar
  29. Pentland, A.P. 1984. Fractal-based description of natural scenes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6:661-674.Google Scholar
  30. Potter, M.C. 1975. Meaning in visual search. Science, 187:965-966.Google Scholar
  31. Rao, A.R. and Lohse, G.L. 1993. Identifying high level features of texture perception. Graphical Models and Image Processing, 55:218-233.Google Scholar
  32. Rensink, R.A. 2000. The dynamic representation of scenes. Visual Cognition, 7:17-42.Google Scholar
  33. Rensink, R.A., O'Regan, J.K., and Clark, J.J. 1997. To see or not to see: the need for attention to perceive changes in scenes. Psychological Science, 8:368-373.Google Scholar
  34. Ripley, B.D. 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, UK.Google Scholar
  35. Rosch, E. and Mervis, C.B. 1975. Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, 7:573-605.Google Scholar
  36. Sanocki, T. and Epstein, W. 1997. Priming spatial layout of scenes. Psychological Science, 8:374-378.Google Scholar
  37. Sanocki, T. and Reynolds, S. 2000. Does figural goodness influence the processing and representation of spatial layout. Investigative Ophthalmology and Visual Science, 41:723.Google Scholar
  38. Schyns, P.G. and Oliva, A. 1994. From blobs to boundary edges: evidence for time-and spatial-scale dependent scene recognition. Psychological Science, 5:195-200.Google Scholar
  39. Simons, D.J. and Levin, D.T. 1997. Change blindness. Trends in Cognitive Sciences, 1:261-267.Google Scholar
  40. Sirovich, L. and Kirby, M. 1987. Low-dimensional procedure for the characterization of human faces. Journal of Optical Society of America, 4:519-524.Google Scholar
  41. Swets, D.L. and Weng, J.J. 1996. Using discriminant eigenfeatures for image retrieval. IEEE Trans. On Pattern Analysis and Machine Intelligence, 18:831-836.Google Scholar
  42. Switkes, E., Mayer, M.J., and Sloan, J.A. 1978. Spatial frequency analysis of the visual environment: anisotropy and the carpentered environment hypothesis. Vision Research, 18:1393-1399.Google Scholar
  43. Szummer, M. and Picard, R.W. 1998. Indoor-outdoor image classification. In IEEE intl.Workshop on Content-Based Access of Image and Video Databases.Google Scholar
  44. Tamura, H., Mori, S., and Yamawaki, T. 1978. Textural features corresponding to visual perception. IEEE Trans. Sys. Man and Cybernetics, 8:460-473.Google Scholar
  45. Torralba, A. and Oliva, A. 1999. Scene organization using discriminant structural templates. In IEEE Proc. Of Int. Conf in Comp. Vision, pp. 1253-1258.Google Scholar
  46. Torralba, A. and Oliva, A. 2001. Depth perception from familiar structure. submitted.Google Scholar
  47. Torralba, A. and Sinha, P. 2001. Statistical context priming for object detection. In IEEE. Proc of Int. Conf. in Computer Vision.Google Scholar
  48. Tversky, B. and Hemenway, K. 1983. Categories of environmental scenes. Cognitive Psychology, 15:121-149.Google Scholar
  49. Vailaya, A., Figueiredo, M., Jain, A., and Zhang, H.J. 1999. Contentbased hierarchical classification of vacation images. In Proceedings of the International Conference on Multimedia, Computing and Systems, June.Google Scholar
  50. Vailaya, A., Jain, A., and Zhang, H.J. 1998. On image classification: City images vs. landscapes. Pattern Recognition, 31:1921-1935.Google Scholar

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • Aude Oliva
    • 1
  • Antonio Torralba
    • 2
  1. 1.Harvard Medical School and the Brigham and Women's HospitalBoston
  2. 2.Department of Brain and Cognitive Sciences, MITCambridge

Personalised recommendations