The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding

Article

Abstract

In this paper we present the first large-scale scene attribute database. First, we perform crowdsourced human studies to find a taxonomy of 102 discriminative attributes. We discover attributes related to materials, surface properties, lighting, affordances, and spatial layout. Next, we build the “SUN attribute database” on top of the diverse SUN categorical database. We use crowdsourcing to annotate attributes for 14,340 images from 707 scene categories. We perform numerous experiments to study the interplay between scene attributes and scene categories. We train and evaluate attribute classifiers and then study the feasibility of attributes as an intermediate scene representation for scene classification, zero shot learning, automatic image captioning, semantic image search, and parsing natural images. We show that when used as features for these tasks, low dimensional scene attributes can compete with or improve on the state of the art performance. The experiments suggest that scene attributes are an effective low-dimensional feature for capturing high-level context and semantics in scenes.

Keywords

Scene understanding Crowdsourcing Attributes Image captioning Scene parsing 

References

  1. Berg, T., Berg, a, & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. ECCV, 6311, 663–676.Google Scholar
  2. Chen, D., & Dolan, W. (2011). Building a persistent workforce on mechanical turk for multilingual data collection. In The 3rd human computation workshop (HCOMP).Google Scholar
  3. Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
  4. Ehinger, KA., Xiao, J., Torralba, A., & Oliva, A. (2011). Estimating scene typicality from human ratings and image features. In 33rd annual conference of the cognitive science society.Google Scholar
  5. Eigen, D., & Fergus, R. (2012). Nonparametric image parsing using adaptive neighbor sets. In Computer vision and pattern recognition (CVPR), 2012 IEEE Conference on (pp. 2799–2806). doi:10.1109/CVPR.2012.6248004.
  6. Endres, I., Farhadi, A., Hoiem, D., & Forsyth, D. (2010). The benefits and challenges of collecting richer object annotations. In ACVHL 2010 (in conjunction with CVPR).Google Scholar
  7. Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In CVPR.Google Scholar
  8. Farhadi, A., Endres, I., & Hoiem, D. (2010a). Attribute-centric recognition for cross-category generalization. In CVPR.Google Scholar
  9. Farhadi, A., Hejrati, M., Sadeghi, MA., Young, P., Rashtchian1, C., Hockenmaier, J., & Forsyth, DA. (2010b) Every picture tells a story: Generating sentences from images. In Proc ECCV.Google Scholar
  10. Ferrari, V., & Zisserman, A. (2008). Learning visual attributes. NIPS 2007Google Scholar
  11. Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In Computer vision, 2009 IEEE 12th International Conference on (pp. 1–8). doi:10.1109/ICCV.2009.5459211.
  12. Greene, M., & Oliva, A. (2009). Recognition of natural scenes from global properties: Seeing the forest without representing the trees. Cognitive Psychology, 58(2), 137–176.CrossRefGoogle Scholar
  13. He, X., Zemel, R., & Carreira-Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on (Vol. 2, pp. II-695–II-702). doi:10.1109/CVPR.2004.1315232.
  14. Hironobu, YM., Takahashi, H., & Oka, R. (1999). Image-to-word transformation based on dividing and vector quantizing images with words. In Boltzmann machines, neural networks, (pp. 405409).Google Scholar
  15. Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1), 151–172.CrossRefGoogle Scholar
  16. Kovashka, A., Parikh, D., & Grauman, K. (2012). Whittlesearch: Image search with relative attribute feedback. In The IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  17. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, AC., & Berg, TL. (2013). Babytalk: Understanding and generating simple image descriptions. In IEEE transactions on pattern analysis and machine intelligence (TPAMI).Google Scholar
  18. Kumar, N., Berg, A., Belhumeur, P., & Nayar, S. (2009). Attribute and simile classifiers for face verification. In ICCV.Google Scholar
  19. Kumar, N., Berg, AC., Belhumeur, PN., & Nayar, SK. (2011). Describable visual attributes for face verification and image search. In IEEE transactions on pattern analysis and machine intelligence (PAMI).Google Scholar
  20. Ladicky, L., Sturgess, P., Alahari, K., Russell, C., & Torr, PH. (2010). What, where and how many? combining object detectors and crfs. In Computer vision-ECCV 2010, Springer (pp. 424–437).Google Scholar
  21. Lampert, CH., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR.Google Scholar
  22. Lasecki, M., White, M., & Bigham, K. (2011). Real-time crowd control of existing interfaces. In UIST.Google Scholar
  23. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.Google Scholar
  24. Liu, C., Yuen, J., & Torralba, A. (2011a). Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2368–2382.CrossRefGoogle Scholar
  25. Liu, J., Kuipers, B., & Savarese, S. (2011b). Recognizing human actions by Attributes. In CVPR.Google Scholar
  26. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579–2605), 85.Google Scholar
  27. Malisiewicz, T., & Efros, AA. (2008). Recognition by association via learning per-exemplar distances. In Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on, IEEE (pp. 1–8).Google Scholar
  28. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3), 145–175.CrossRefMATHGoogle Scholar
  29. Oliva, A., & Torralba, A. (2002). Scene-centered description from spatial envelope properties. In 2nd Wworkshop on biologically motivated computer vision (BMCV).Google Scholar
  30. Ordonez, V., Kulkarni, G., & Berg, TL. (2011). Im2text: Describing images using 1 million captioned photographs. In Neural information processing systems (NIPS).Google Scholar
  31. Palatucci, M., Pomerleau, D., Hinton, GE., & Mitchell, TM. (2009). Zero-shot learning with semantic output codes. In Advances in neural information processing systems (pp. 1410–1418).Google Scholar
  32. Papineni, K., Roukos, S., Ward, T., & Zhu, WJ. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, association for computational linguistics, Stroudsburg, PA, USA, ACL (pp. 311–318). doi:10.3115/1073083.1073135.
  33. Parikh, D., & Grauman, K. (2011a). Interactively building a discriminative vocabulary of nameable attributes. In CVPR.Google Scholar
  34. Parikh, D., & Grauman, K. (2011b) Relative attributes. In CCV.Google Scholar
  35. Patterson, G., & Hays, J. (2012). Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceeding of the 25th conference on computer vision and pattern recognition (CVPR).Google Scholar
  36. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Belongie, S. (2007). Objects in context. In Computer vision, 2007. ICCV 2007. IEEE 11th international conference on (pp. 1–8). doi:10.1109/ICCV.2007.4408986.
  37. Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Computer vision and pattern recognition (CVPR), 2011 IEEE conference on IEEE (pp. 1641–1648).Google Scholar
  38. Russakovsky, O., & Fei-Fei, L. (2010). Attribute learning in largescale datasets. In ECCV 2010 workshop on parts and attributes.Google Scholar
  39. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77(1), 157–173.CrossRefGoogle Scholar
  40. Sanchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245. Google Scholar
  41. Scheirer, WJ., Kumar, N., Belhumeur, PN., & Boult, TE. (2012). Multi-attribute spaces: Calibration for attribute fusion and similarity search. In The IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  42. Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the 9th European conference on computer vision. Berlin: Springer, ECCV’06 (pp. 1–15). doi: 10.1007/11744023_1.
  43. Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). doi: 10.1109/CVPR.2008.4587503.
  44. Siddiquie, B., Feris, RS., & Davis, LS. (2011). Image ranking and retrieval based on multi-attribute queries. In The IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  45. Socher, R., Lin, CC., Ng, AY., & Manning, CD. (2011). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 26th international conference on machine learning (ICML) Vol. 2, p. 7).Google Scholar
  46. Sorokin, A., & Forsyth, D. (2008). Utility data annotation with amazon mechanical turk. In First IEEE workshop on internet vision at CVPR 08.Google Scholar
  47. Su, Y., Allan, M., & Jurie, F. (2010). Improving object classification using semantic attributes. In BMVC.Google Scholar
  48. Tighe, J., & Lazebnik, S. (2013). Superparsing. International Journal of Computer Vision, 101, 329–349. doi:10.1007/s11263-012-0574-z.CrossRefMathSciNetGoogle Scholar
  49. Torralba, A., Fergus, R., & Freeman, W. T. (2008a). 80 Million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.CrossRefGoogle Scholar
  50. Torralba, A., Fergus, R., & Freeman, W. T. (2008b). 80 Million tiny images: A large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1371–1384.CrossRefGoogle Scholar
  51. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In CVPR.Google Scholar
  52. Yao, B., Jiang, X., Khosla, A., Lin, AL., Guibas, L., & Fei-Fei, L. (2011). Human action recognition by learning bases of action attributes and parts. In ICCV.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Genevieve Patterson
    • 1
  • Chen Xu
    • 1
  • Hang Su
    • 1
  • James Hays
    • 1
  1. 1.Department of Computer ScienceBrown UniversityProvidenceUSA

Personalised recommendations