Advertisement

International Journal of Computer Vision

, Volume 119, Issue 1, pp 3–22 | Cite as

SUN Database: Exploring a Large Collection of Scene Categories

  • Jianxiong Xiao
  • Krista A. Ehinger
  • James Hays
  • Antonio Torralba
  • Aude Oliva
Article

Abstract

Progress in scene understanding requires reasoning about the rich and diverse visual environments that make up our daily experience. To this end, we propose the Scene Understanding database, a nearly exhaustive collection of scenes categorized at the same level of specificity as human discourse. The database contains 908 distinct scene categories and 131,072 images. Given this data with both scene and object labels available, we perform in-depth analysis of co-occurrence statistics and the contextual relationship. To better understand this large scale taxonomy of scene categories, we perform two human experiments: we quantify human scene recognition accuracy, and we measure how typical each image is of its assigned scene category. Next, we perform computational experiments: scene recognition with global image features, indoor versus outdoor classification, and “scene detection,” in which we relax the assumption that one image depicts only one scene category. Finally, we relate human experiments to machine performance and explore the relationship between human and machine recognition errors and the relationship between image “typicality” and machine recognition accuracy.

Keywords

Scene recognition Scene detection Scene descriptor Scene typicality Scene and object Visual context 

Notes

Acknowledgments

We thank Yinda Zhang for help on the scene classification experiments. This work is funded by Google Research Award to J. X., NSF Grant 1016862 to A.O, NSF CAREER Award 0747120 to A.T., NSF CAREER Award 1149853 to J.H, as well as ONR MURI N000141010933, Foxconn and gifts from Microsoft and Google. K.A.E was funded by a NSF Graduate Research fellowship.

References

  1. Ahonen, T., Matas, J., He, C., & Pietikäinen, M., et al. (2009). Rotation invariant image description with local binary pattern histogram fourier features. In Scandinavian Conference on Image Analysis.Google Scholar
  2. Arbelaez, P., Fowlkes, C., & Martin, D. (2007). The Berkeley segmentation dataset and benchmark. Retrieved from http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds.
  3. Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5), 898–916.Google Scholar
  4. Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. The Journal of Machine Learning Research, 3, 1107–1135.Google Scholar
  5. Barriuso, A., & Torralba, A. (2012). Notes on image annotation. Retrieved from arXiv:1210.3448.
  6. Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological Review, 94(2), 115.Google Scholar
  7. Boutell, M. R., Luo, J., Shen, X., & Brown, C. M. (2004). Learning multi-label scene classification. Pattern recognition, 37(9), 1757–1771.Google Scholar
  8. Bunge, J., & Fitzpatrick, M. (1993). Estimating the number of species: A review. Journal of the American Statistical Association, 88(421), 364–373.Google Scholar
  9. Dalal, N., & Triggs, B. (2005, June). Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on (Vol. 1, pp. 886-893). IEEE.Google Scholar
  10. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 248-255). IEEE.Google Scholar
  11. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. Retrieved from arXiv:1310.1531.
  12. Ehinger, K. A., Xiao, J., Torralba, A., & Oliva, A. (2011). Estimating scene typicality from human ratings and image features. Massachusetts: Cognitive science.Google Scholar
  13. Epstein, R., & Kanwisher, N. (1998). A cortical representation of the local visual environment. Nature, 392(6676), 598–601.Google Scholar
  14. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.Google Scholar
  15. Fei-Fei, L., Fergus, R., & Perona, P., et al. (2004). Learning generative visual models from few training examples. In Computer Vision and Pattern Recognition Workshop on Generative-Model Based Vision.Google Scholar
  16. Fei-Fei, L., & Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on (Vol. 2, pp. 524–531). IEEE.Google Scholar
  17. Fellbaum, C. (1998). Wordnet: An electronic lexical database. Bradford: Bradford Books.Google Scholar
  18. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9), 1627–1645.Google Scholar
  19. Griffin, G., Holub, A., Perona, P., et al. (2007). Caltech-256 object category dataset. Technical Report.Google Scholar
  20. Hays, J., & Efros, A. A. (2008). IM2GPS: estimating geographic information from a single image. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.Google Scholar
  21. Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1), 151–172.Google Scholar
  22. Jolicoeur, P., Gluck, M. A., & Kosslyn, S. M. (1984). Pictures and names: Making the connection. Cognitive Psychology, 16(2), 243–275.CrossRefGoogle Scholar
  23. Kosecka, J., & Zhang, W. (2002). Video compass. In Computer Vision-ECCV 2002 (pp. 476–490). Berlin: Springer.Google Scholar
  24. Lalonde, J.F., Hoiem, D., Efros, A.A., Rother, C., Winn, J., & Criminisi, A., et al. (2007). Photo clip art. SIGGRAPH.Google Scholar
  25. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 2, pp. 2169–2178). IEEE.Google Scholar
  26. Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on (Vol. 2, pp. 416-423). IEEE.Google Scholar
  27. Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767.Google Scholar
  28. Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7), 971–987.Google Scholar
  29. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.Google Scholar
  30. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2008, June). Lost in quantization: Improving particular object retrieval in large scale image databases. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.Google Scholar
  31. Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN Features off-the-shelf: An Astounding Baseline for Recognition. Retrieved from arXiv:1403.6382.
  32. Renninger, L. W., & Malik, J. (2004). When is scene identification just texture recognition?. Vision Research, 44(19), 2301–2311.Google Scholar
  33. Rosch, E. H. (1973). Natural categories. Cognitive Psychology, 4(3), 328–350.Google Scholar
  34. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8(3), 382–439.Google Scholar
  35. Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013, December). Detecting avocados to zucchinis: what have we done, and where are we going?. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 2064–2071). IEEE.Google Scholar
  36. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173.Google Scholar
  37. Sadeghi, M. A., & Farhadi, A. (2011, June). Recognition using visual phrases. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (pp. 1745–1752). IEEE.Google Scholar
  38. Sanchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.Google Scholar
  39. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. Retrieved from arXiv:1312.6229.
  40. Shechtman, E., & Irani, M. (2007, June). Matching local self-similarities across images and videos. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on (pp. 1–8). IEEE.Google Scholar
  41. Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In ECCV (pp. 1–15). Berlin: Springer.Google Scholar
  42. Sivic, J., & Zisserman, A. (2004, June). Video data mining using configurations of viewpoint invariant regions. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on (Vol. 1, pp. I–488). IEEE.Google Scholar
  43. Song, S., & Xiao, J. (2014). Sliding Shapes for 3D object detection in RGB-D images. In European Conference on Computer Vision.Google Scholar
  44. Spain, M., & Perona, P. (2008). Some objects are more equal than others: measuring and predicting importance. In: European Conference on Computer Vision.Google Scholar
  45. Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(11), 1958–1970.Google Scholar
  46. Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003, October). Context-based vision system for place and object recognition. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on (pp. 273–280). IEEE.Google Scholar
  47. Tversky, B., & Hemenway, K. (1983). Categories of environmental scenes. Cognitive Psychology, 15(1), 121–149.Google Scholar
  48. Vedaldi, A., & Fulkerson, B. (2010, October). An open and portable library of computer vision algorithms: VLFeat. In Proceedings of the international conference on Multimedia (pp. 1469–1472). ACM.Google Scholar
  49. Vogel, J., & Schiele, B. (2004). A semantic typicality measure for natural scene categorization. In Pattern Recognition (pp. 195–203). Berlin: Springer. Google Scholar
  50. Vogel, J., & Schiele, B. (2007). Semantic modeling of natural scenes for content-based image retrieval. International Journal of Computer Vision, 72(2), 133–157.Google Scholar
  51. Xiao, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2012, June). Recognizing scene viewpoint using panoramic place representation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 2695–2702). IEEE.Google Scholar
  52. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010, June). Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on (pp. 3485–3492). IEEE.Google Scholar
  53. Xiao, J., Owens, A., & Torralba, A. (2013, December). SUN3D: A database of big spaces reconstructed using sfm and object labels. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1625–1632). IEEE.Google Scholar
  54. Zhang, Y., Song, S., Tan, P., & Xiao, J., et al. (2014). PanoContext: A whole-room 3D context model for panoramic scene understanding. In European Conference on Computer Vision.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Jianxiong Xiao
    • 1
  • Krista A. Ehinger
    • 2
  • James Hays
    • 3
  • Antonio Torralba
    • 4
  • Aude Oliva
    • 4
  1. 1.Princeton UniversityPrincetonUSA
  2. 2.Harvard Medical SchoolBostonUSA
  3. 3.Brown UniversityProvidenceUSA
  4. 4.Massachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations