Skip to main content

Advertisement

Log in

The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper we present the first large-scale scene attribute database. First, we perform crowdsourced human studies to find a taxonomy of 102 discriminative attributes. We discover attributes related to materials, surface properties, lighting, affordances, and spatial layout. Next, we build the “SUN attribute database” on top of the diverse SUN categorical database. We use crowdsourcing to annotate attributes for 14,340 images from 707 scene categories. We perform numerous experiments to study the interplay between scene attributes and scene categories. We train and evaluate attribute classifiers and then study the feasibility of attributes as an intermediate scene representation for scene classification, zero shot learning, automatic image captioning, semantic image search, and parsing natural images. We show that when used as features for these tasks, low dimensional scene attributes can compete with or improve on the state of the art performance. The experiments suggest that scene attributes are an effective low-dimensional feature for capturing high-level context and semantics in scenes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Notes

  1. Individual attribute presence might be ambiguous in certain scenes, just like category membership can be ambiguous. Scenes only have one category label, though, and the larger the number of categories the more ambiguous the membership is. However, with over one hundred scene attributes in our taxonomy, several attributes may be strongly present, offering a description of that scene that has more context than simply the scene category label. This also enables an attribute-based representation to make finer-grain distinctions about which which components or characteristics of the scene are ambiguous or obvious.

  2. Word cloud made using the software available at www.wordle.net by Jonathan Feinberg.

  3. SUN Attribute Classifiers along with the full SUN Attribute dataset and associated code are available at www.cs.brown.edu/~gen/sunattributes.html.

  4. The images in the SUN Attribute dataset were originally taken from the whole SUN dataset, which includes more than 900 scene categories. Thus, some portion of the SUN Attribute images also appear in the SUN 397 dataset, which is also a subset of the full SUN dataset. The scene classifiers using low-level and predicted attribute features were trained and tested on the SUN397 dataset minus any overlapping images from the SUN Attribute dataset to avoid testing scene classification on the same images used to train attribute classifiers.

  5. Because ground truth attributes were collected on the SUN Attribute set of images, the classifiers using the ground truth attributes directly as features were trained and tested on the SUN Attribute dataset.

  6. http://dsl1.cewit.stonybrook.edu/vicente/py/website/search

References

  • Berg, T., Berg, a, & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. ECCV, 6311, 663–676.

    Google Scholar 

  • Chen, D., & Dolan, W. (2011). Building a persistent workforce on mechanical turk for multilingual data collection. In The 3rd human computation workshop (HCOMP).

  • Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Ehinger, KA., Xiao, J., Torralba, A., & Oliva, A. (2011). Estimating scene typicality from human ratings and image features. In 33rd annual conference of the cognitive science society.

  • Eigen, D., & Fergus, R. (2012). Nonparametric image parsing using adaptive neighbor sets. In Computer vision and pattern recognition (CVPR), 2012 IEEE Conference on (pp. 2799–2806). doi:10.1109/CVPR.2012.6248004.

  • Endres, I., Farhadi, A., Hoiem, D., & Forsyth, D. (2010). The benefits and challenges of collecting richer object annotations. In ACVHL 2010 (in conjunction with CVPR).

  • Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In CVPR.

  • Farhadi, A., Endres, I., & Hoiem, D. (2010a). Attribute-centric recognition for cross-category generalization. In CVPR.

  • Farhadi, A., Hejrati, M., Sadeghi, MA., Young, P., Rashtchian1, C., Hockenmaier, J., & Forsyth, DA. (2010b) Every picture tells a story: Generating sentences from images. In Proc ECCV.

  • Ferrari, V., & Zisserman, A. (2008). Learning visual attributes. NIPS 2007

  • Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In Computer vision, 2009 IEEE 12th International Conference on (pp. 1–8). doi:10.1109/ICCV.2009.5459211.

  • Greene, M., & Oliva, A. (2009). Recognition of natural scenes from global properties: Seeing the forest without representing the trees. Cognitive Psychology, 58(2), 137–176.

    Article  Google Scholar 

  • He, X., Zemel, R., & Carreira-Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on (Vol. 2, pp. II-695–II-702). doi:10.1109/CVPR.2004.1315232.

  • Hironobu, YM., Takahashi, H., & Oka, R. (1999). Image-to-word transformation based on dividing and vector quantizing images with words. In Boltzmann machines, neural networks, (pp. 405409).

  • Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1), 151–172.

    Article  Google Scholar 

  • Kovashka, A., Parikh, D., & Grauman, K. (2012). Whittlesearch: Image search with relative attribute feedback. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, AC., & Berg, TL. (2013). Babytalk: Understanding and generating simple image descriptions. In IEEE transactions on pattern analysis and machine intelligence (TPAMI).

  • Kumar, N., Berg, A., Belhumeur, P., & Nayar, S. (2009). Attribute and simile classifiers for face verification. In ICCV.

  • Kumar, N., Berg, AC., Belhumeur, PN., & Nayar, SK. (2011). Describable visual attributes for face verification and image search. In IEEE transactions on pattern analysis and machine intelligence (PAMI).

  • Ladicky, L., Sturgess, P., Alahari, K., Russell, C., & Torr, PH. (2010). What, where and how many? combining object detectors and crfs. In Computer vision-ECCV 2010, Springer (pp. 424–437).

  • Lampert, CH., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR.

  • Lasecki, M., White, M., & Bigham, K. (2011). Real-time crowd control of existing interfaces. In UIST.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  • Liu, C., Yuen, J., & Torralba, A. (2011a). Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2368–2382.

    Article  Google Scholar 

  • Liu, J., Kuipers, B., & Savarese, S. (2011b). Recognizing human actions by Attributes. In CVPR.

  • Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579–2605), 85.

    Google Scholar 

  • Malisiewicz, T., & Efros, AA. (2008). Recognition by association via learning per-exemplar distances. In Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on, IEEE (pp. 1–8).

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3), 145–175.

    Article  MATH  Google Scholar 

  • Oliva, A., & Torralba, A. (2002). Scene-centered description from spatial envelope properties. In 2nd Wworkshop on biologically motivated computer vision (BMCV).

  • Ordonez, V., Kulkarni, G., & Berg, TL. (2011). Im2text: Describing images using 1 million captioned photographs. In Neural information processing systems (NIPS).

  • Palatucci, M., Pomerleau, D., Hinton, GE., & Mitchell, TM. (2009). Zero-shot learning with semantic output codes. In Advances in neural information processing systems (pp. 1410–1418).

  • Papineni, K., Roukos, S., Ward, T., & Zhu, WJ. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, association for computational linguistics, Stroudsburg, PA, USA, ACL (pp. 311–318). doi:10.3115/1073083.1073135.

  • Parikh, D., & Grauman, K. (2011a). Interactively building a discriminative vocabulary of nameable attributes. In CVPR.

  • Parikh, D., & Grauman, K. (2011b) Relative attributes. In CCV.

  • Patterson, G., & Hays, J. (2012). Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceeding of the 25th conference on computer vision and pattern recognition (CVPR).

  • Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In Computer vision, 2007. ICCV 2007. IEEE 11th international conference on (pp. 1–8). doi:10.1109/ICCV.2007.4408986.

  • Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Computer vision and pattern recognition (CVPR), 2011 IEEE conference on IEEE (pp. 1641–1648).

  • Russakovsky, O., & Fei-Fei, L. (2010). Attribute learning in largescale datasets. In ECCV 2010 workshop on parts and attributes.

  • Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77(1), 157–173.

    Article  Google Scholar 

  • Sanchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.

    Google Scholar 

  • Scheirer, WJ., Kumar, N., Belhumeur, PN., & Boult, TE. (2012). Multi-attribute spaces: Calibration for attribute fusion and similarity search. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the 9th European conference on computer vision. Berlin: Springer, ECCV’06 (pp. 1–15). doi: 10.1007/11744023_1.

  • Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). doi: 10.1109/CVPR.2008.4587503.

  • Siddiquie, B., Feris, RS., & Davis, LS. (2011). Image ranking and retrieval based on multi-attribute queries. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Socher, R., Lin, CC., Ng, AY., & Manning, CD. (2011). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 26th international conference on machine learning (ICML) Vol. 2, p. 7).

  • Sorokin, A., & Forsyth, D. (2008). Utility data annotation with amazon mechanical turk. In First IEEE workshop on internet vision at CVPR 08.

  • Su, Y., Allan, M., & Jurie, F. (2010). Improving object classification using semantic attributes. In BMVC.

  • Tighe, J., & Lazebnik, S. (2013). Superparsing. International Journal of Computer Vision, 101, 329–349. doi:10.1007/s11263-012-0574-z.

    Article  MathSciNet  Google Scholar 

  • Torralba, A., Fergus, R., & Freeman, W. T. (2008a). 80 Million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.

    Article  Google Scholar 

  • Torralba, A., Fergus, R., & Freeman, W. T. (2008b). 80 Million tiny images: A large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1371–1384.

    Article  Google Scholar 

  • Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.

  • Yao, B., Jiang, X., Khosla, A., Lin, AL., Guibas, L., & Fei-Fei, L. (2011). Human action recognition by learning bases of action attributes and parts. In ICCV.

Download references

Acknowledgments

We thank Vazheh Moussavi (Brown Univ.) for his insights and contributions in the data annotation process. Genevieve Patterson is supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program. This work is also funded by NSF CAREER Award 1149853 to James Hays.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Genevieve Patterson.

Appendix: Scene Attributes

Appendix: Scene Attributes

See Appendix Table 5.

Table 5 Complete list of discovered scene attributes

Rights and permissions

Reprints and permissions

About this article

Cite this article

Patterson, G., Xu, C., Su, H. et al. The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding. Int J Comput Vis 108, 59–81 (2014). https://doi.org/10.1007/s11263-013-0695-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-013-0695-z

Keywords

Navigation