Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search

Abstract

We introduce an approach to image retrieval and auto-tagging that leverages the implicit information about object importance conveyed by the list of keyword tags a person supplies for an image. We propose an unsupervised learning procedure based on Kernel Canonical Correlation Analysis that discovers the relationship between how humans tag images (e.g., the order in which words are mentioned) and the relative importance of objects and their layout in the scene. Using this discovered connection, we show how to boost accuracy for novel queries, such that the search results better preserve the aspects a human may find most worth mentioning. We evaluate our approach on three datasets using either keyword tags or natural language descriptions, and quantify results with both ground truth parameters as well as direct tests with human subjects. Our results show clear improvements over approaches that either rely on image features alone, or that use words and image features but ignore the implied importance cues. Overall, our work provides a novel way to incorporate high-level human perception of scenes into visual representations for enhanced image search.

This is a preview of subscription content, access via your institution.

References

  1. von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In CHI.

    Google Scholar 

  2. Akaho, S. (2001). A kernel method for canonical correlation analysis. In International meeting of Psychometric Society.

    Google Scholar 

  3. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Reading: Addison Wesley.

    Google Scholar 

  4. Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., & Jordan, M. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.

    MATH  Google Scholar 

  5. Bekkerman, R., & Jeon, J. (2007). Multi-modal clustering for multimedia collections. In CVPR.

    Google Scholar 

  6. Berg, T., Berg, A., Edwards, J., & Forsyth, D. (2004). Who’s in the picture. In NIPS.

    Google Scholar 

  7. Blaschko, M. B., & Lampert, C. H. (2008). Correlational spectral clustering. In CVPR.

    Google Scholar 

  8. Bruce, N., & Tsotsos, J. (2005). Saliency based on information maximization. In NIPS.

    Google Scholar 

  9. Datta, R., Joshi, D., Li, J., & Wang, J. (2008). Image retrieval: ideas, influences, and trends of the New Age. ACM Computing Surveys, 40(2), 1–60.

    Article  Google Scholar 

  10. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. In CVPR.

    Google Scholar 

  11. Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In ECCV.

    Google Scholar 

  12. Einhauser, W., Spain, M., & Perona, P. (2008). Objects predict fixations better than early saliency. Journal of Vision, 8(14), 1–26.

    Article  Google Scholar 

  13. Elazary, L., & Itti, L. (2008). Interesting objects are visually salient. Journal of Vision, 8(3), 1–15.

    Article  Google Scholar 

  14. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.

  15. Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: generating sentences for images. In ECCV.

    Google Scholar 

  16. Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In ICCV.

    Google Scholar 

  17. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.

    Article  Google Scholar 

  18. Fyfe, C., & Lai, P. (2001). Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10, 365–374.

    Google Scholar 

  19. Gupta, A., & Davis, L. (2008). Beyond nouns: exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV.

    Google Scholar 

  20. Hardoon, D., & Shawe-Taylor, J. (2003). KCCA for different level precision in content-based image retrieval. In Third international workshop on content-based multimedia indexing.

    Google Scholar 

  21. Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: an overview with application to learning methods. Neural Computation, 16(12).

  22. Hotelling, H. (1936). Relations between two sets of variants. Biometrika, 28, 321–377.

    MATH  Google Scholar 

  23. Hwang, S. J., & Grauman, K. (2010a). Accounting for the relative importance of objects in image retrieval. In British machine vision conference.

    Google Scholar 

  24. Hwang, S. J., & Grauman, K. (2010b). Reading between the lines: object localization using implicit cues from image tags. In CVPR.

    Google Scholar 

  25. Jarvelin, K., & Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446.

    Article  Google Scholar 

  26. Kadir, T., & Brady, M. (2001). Saliency, scale and image description. International Journal of Computer Vision, 45(2), 83–105.

    MATH  Article  Google Scholar 

  27. Kulis, B., & Grauman, K. (2009). Kernelized locality-sensitive hashing for scalable image search. In ICCV.

    Google Scholar 

  28. Lavrenko, V., Manmatha, R., & Jeon, J. (2003). A model for learning the semantics of pictures. In NIPS.

    Google Scholar 

  29. Li, L., Wang, G., & Fei-Fei, L. (2007). Optimol: automatic online picture collection via incremental model learning. In CVPR.

    Google Scholar 

  30. Li, L. J., Socher, R., & Fei-Fei, L. (2009). Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In CVPR.

    Google Scholar 

  31. Li, Y., & Shawe-Taylor, J. (2006). Using KCCA for Japanese-English cross-language information retrieval and document classification. Journal of Intelligent Information Systems, 27(2).

  32. Loeff, N., & Farhadi, A. (2008). Scene discovery by matrix factorization. In ECCV.

    Google Scholar 

  33. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2).

  34. Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In ECCV.

    Google Scholar 

  35. Monay, F., & Gatica-Perez, D. (2003). On image auto-annotation with latent space models. In ACM multimedia.

    Google Scholar 

  36. Qi, G. J., Hua, X. S., & Zhang, H. J. (2009). Learning semantic distance from community-tagged media collection. In ACM multimedia.

    Google Scholar 

  37. Quack, T., Leibe, B., & Gool, L. V. (2008). World-scale mining of objects and events from community photo collections. In CIVR.

    Google Scholar 

  38. Quattoni, A., Collins, M., & Darrell, T. (2007). Learning visual representations using images with captions. In CVPR.

    Google Scholar 

  39. Russell, B., Torralba, A., Murphy, K., & Freeman, W. (2005). Labelme: a database and web-based tool for image annotation (Tech. rep). MIT.

  40. Schroff, F., Criminisi, A., & Zisserman, A. (2007). Harvesting image databases from the web. In ICCV.

    Google Scholar 

  41. Smeulders, A., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1349–1380.

    Article  Google Scholar 

  42. Spain, M., & Perona, P. (2008). Some objects are more equal than others: measuring and predicting importance. In ECCV.

    Google Scholar 

  43. Tatler, B., Baddeley, R., & Gilchrist, I. (2005). Visual correlates of fixation selection: effects of scale and time. Vision Research, 45, 643–659.

    Article  Google Scholar 

  44. Torralba, A. (2003). Contextual priming for object detection. International Journal of Computer Vision, 53(2), 169–191.

    Article  Google Scholar 

  45. Vijayanarasimhan, S., & Grauman, K. (2008). Keywords to visual categories: multiple-instance learning for weakly supervised object categorization. In CVPR.

    Google Scholar 

  46. Wolfe, J., & Horowitz, T. (2004). What attributes guide the deployment of visual attention and how do they do it? Neuroscience, 5, 495–501.

    Google Scholar 

  47. Yakhnenko, O., & Honavar, V. (2009). Multiple label prediction for image annotation with multiple kernel correlation models. In Workshop on visual context learning, in conjunction with CVPR.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Sung Ju Hwang.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Hwang, S.J., Grauman, K. Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search. Int J Comput Vis 100, 134–153 (2012). https://doi.org/10.1007/s11263-011-0494-3

Download citation

Keywords

  • Image retrieval
  • Image tags
  • Multi-modal retrieval
  • Cross-modal retrieval
  • Image search
  • Object recognition
  • Auto annotation
  • Kernelized canonical correlation analysis