Abstract
In this paper we present the first large-scale scene attribute database. First, we perform crowdsourced human studies to find a taxonomy of 102 discriminative attributes. We discover attributes related to materials, surface properties, lighting, affordances, and spatial layout. Next, we build the “SUN attribute database” on top of the diverse SUN categorical database. We use crowdsourcing to annotate attributes for 14,340 images from 707 scene categories. We perform numerous experiments to study the interplay between scene attributes and scene categories. We train and evaluate attribute classifiers and then study the feasibility of attributes as an intermediate scene representation for scene classification, zero shot learning, automatic image captioning, semantic image search, and parsing natural images. We show that when used as features for these tasks, low dimensional scene attributes can compete with or improve on the state of the art performance. The experiments suggest that scene attributes are an effective low-dimensional feature for capturing high-level context and semantics in scenes.
Similar content being viewed by others
Notes
Individual attribute presence might be ambiguous in certain scenes, just like category membership can be ambiguous. Scenes only have one category label, though, and the larger the number of categories the more ambiguous the membership is. However, with over one hundred scene attributes in our taxonomy, several attributes may be strongly present, offering a description of that scene that has more context than simply the scene category label. This also enables an attribute-based representation to make finer-grain distinctions about which which components or characteristics of the scene are ambiguous or obvious.
Word cloud made using the software available at www.wordle.net by Jonathan Feinberg.
SUN Attribute Classifiers along with the full SUN Attribute dataset and associated code are available at www.cs.brown.edu/~gen/sunattributes.html.
The images in the SUN Attribute dataset were originally taken from the whole SUN dataset, which includes more than 900 scene categories. Thus, some portion of the SUN Attribute images also appear in the SUN 397 dataset, which is also a subset of the full SUN dataset. The scene classifiers using low-level and predicted attribute features were trained and tested on the SUN397 dataset minus any overlapping images from the SUN Attribute dataset to avoid testing scene classification on the same images used to train attribute classifiers.
Because ground truth attributes were collected on the SUN Attribute set of images, the classifiers using the ground truth attributes directly as features were trained and tested on the SUN Attribute dataset.
References
Berg, T., Berg, a, & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. ECCV, 6311, 663–676.
Chen, D., & Dolan, W. (2011). Building a persistent workforce on mechanical turk for multilingual data collection. In The 3rd human computation workshop (HCOMP).
Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Ehinger, KA., Xiao, J., Torralba, A., & Oliva, A. (2011). Estimating scene typicality from human ratings and image features. In 33rd annual conference of the cognitive science society.
Eigen, D., & Fergus, R. (2012). Nonparametric image parsing using adaptive neighbor sets. In Computer vision and pattern recognition (CVPR), 2012 IEEE Conference on (pp. 2799–2806). doi:10.1109/CVPR.2012.6248004.
Endres, I., Farhadi, A., Hoiem, D., & Forsyth, D. (2010). The benefits and challenges of collecting richer object annotations. In ACVHL 2010 (in conjunction with CVPR).
Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In CVPR.
Farhadi, A., Endres, I., & Hoiem, D. (2010a). Attribute-centric recognition for cross-category generalization. In CVPR.
Farhadi, A., Hejrati, M., Sadeghi, MA., Young, P., Rashtchian1, C., Hockenmaier, J., & Forsyth, DA. (2010b) Every picture tells a story: Generating sentences from images. In Proc ECCV.
Ferrari, V., & Zisserman, A. (2008). Learning visual attributes. NIPS 2007
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In Computer vision, 2009 IEEE 12th International Conference on (pp. 1–8). doi:10.1109/ICCV.2009.5459211.
Greene, M., & Oliva, A. (2009). Recognition of natural scenes from global properties: Seeing the forest without representing the trees. Cognitive Psychology, 58(2), 137–176.
He, X., Zemel, R., & Carreira-Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on (Vol. 2, pp. II-695–II-702). doi:10.1109/CVPR.2004.1315232.
Hironobu, YM., Takahashi, H., & Oka, R. (1999). Image-to-word transformation based on dividing and vector quantizing images with words. In Boltzmann machines, neural networks, (pp. 405409).
Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1), 151–172.
Kovashka, A., Parikh, D., & Grauman, K. (2012). Whittlesearch: Image search with relative attribute feedback. In The IEEE conference on computer vision and pattern recognition (CVPR).
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, AC., & Berg, TL. (2013). Babytalk: Understanding and generating simple image descriptions. In IEEE transactions on pattern analysis and machine intelligence (TPAMI).
Kumar, N., Berg, A., Belhumeur, P., & Nayar, S. (2009). Attribute and simile classifiers for face verification. In ICCV.
Kumar, N., Berg, AC., Belhumeur, PN., & Nayar, SK. (2011). Describable visual attributes for face verification and image search. In IEEE transactions on pattern analysis and machine intelligence (PAMI).
Ladicky, L., Sturgess, P., Alahari, K., Russell, C., & Torr, PH. (2010). What, where and how many? combining object detectors and crfs. In Computer vision-ECCV 2010, Springer (pp. 424–437).
Lampert, CH., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR.
Lasecki, M., White, M., & Bigham, K. (2011). Real-time crowd control of existing interfaces. In UIST.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
Liu, C., Yuen, J., & Torralba, A. (2011a). Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2368–2382.
Liu, J., Kuipers, B., & Savarese, S. (2011b). Recognizing human actions by Attributes. In CVPR.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579–2605), 85.
Malisiewicz, T., & Efros, AA. (2008). Recognition by association via learning per-exemplar distances. In Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on, IEEE (pp. 1–8).
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3), 145–175.
Oliva, A., & Torralba, A. (2002). Scene-centered description from spatial envelope properties. In 2nd Wworkshop on biologically motivated computer vision (BMCV).
Ordonez, V., Kulkarni, G., & Berg, TL. (2011). Im2text: Describing images using 1 million captioned photographs. In Neural information processing systems (NIPS).
Palatucci, M., Pomerleau, D., Hinton, GE., & Mitchell, TM. (2009). Zero-shot learning with semantic output codes. In Advances in neural information processing systems (pp. 1410–1418).
Papineni, K., Roukos, S., Ward, T., & Zhu, WJ. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, association for computational linguistics, Stroudsburg, PA, USA, ACL (pp. 311–318). doi:10.3115/1073083.1073135.
Parikh, D., & Grauman, K. (2011a). Interactively building a discriminative vocabulary of nameable attributes. In CVPR.
Parikh, D., & Grauman, K. (2011b) Relative attributes. In CCV.
Patterson, G., & Hays, J. (2012). Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceeding of the 25th conference on computer vision and pattern recognition (CVPR).
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In Computer vision, 2007. ICCV 2007. IEEE 11th international conference on (pp. 1–8). doi:10.1109/ICCV.2007.4408986.
Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Computer vision and pattern recognition (CVPR), 2011 IEEE conference on IEEE (pp. 1641–1648).
Russakovsky, O., & Fei-Fei, L. (2010). Attribute learning in largescale datasets. In ECCV 2010 workshop on parts and attributes.
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77(1), 157–173.
Sanchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.
Scheirer, WJ., Kumar, N., Belhumeur, PN., & Boult, TE. (2012). Multi-attribute spaces: Calibration for attribute fusion and similarity search. In The IEEE conference on computer vision and pattern recognition (CVPR).
Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the 9th European conference on computer vision. Berlin: Springer, ECCV’06 (pp. 1–15). doi: 10.1007/11744023_1.
Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). doi: 10.1109/CVPR.2008.4587503.
Siddiquie, B., Feris, RS., & Davis, LS. (2011). Image ranking and retrieval based on multi-attribute queries. In The IEEE conference on computer vision and pattern recognition (CVPR).
Socher, R., Lin, CC., Ng, AY., & Manning, CD. (2011). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 26th international conference on machine learning (ICML) Vol. 2, p. 7).
Sorokin, A., & Forsyth, D. (2008). Utility data annotation with amazon mechanical turk. In First IEEE workshop on internet vision at CVPR 08.
Su, Y., Allan, M., & Jurie, F. (2010). Improving object classification using semantic attributes. In BMVC.
Tighe, J., & Lazebnik, S. (2013). Superparsing. International Journal of Computer Vision, 101, 329–349. doi:10.1007/s11263-012-0574-z.
Torralba, A., Fergus, R., & Freeman, W. T. (2008a). 80 Million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.
Torralba, A., Fergus, R., & Freeman, W. T. (2008b). 80 Million tiny images: A large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1371–1384.
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.
Yao, B., Jiang, X., Khosla, A., Lin, AL., Guibas, L., & Fei-Fei, L. (2011). Human action recognition by learning bases of action attributes and parts. In ICCV.
Acknowledgments
We thank Vazheh Moussavi (Brown Univ.) for his insights and contributions in the data annotation process. Genevieve Patterson is supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program. This work is also funded by NSF CAREER Award 1149853 to James Hays.
Author information
Authors and Affiliations
Corresponding author
Appendix: Scene Attributes
Appendix: Scene Attributes
See Appendix Table 5.
Rights and permissions
About this article
Cite this article
Patterson, G., Xu, C., Su, H. et al. The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding. Int J Comput Vis 108, 59–81 (2014). https://doi.org/10.1007/s11263-013-0695-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-013-0695-z