International Journal of Computer Vision

, Volume 119, Issue 1, pp 46–59 | Cite as

Large Scale Retrieval and Generation of Image Descriptions

  • Vicente Ordonez
  • Xufeng Han
  • Polina Kuznetsova
  • Girish Kulkarni
  • Margaret Mitchell
  • Kota Yamaguchi
  • Karl Stratos
  • Amit Goyal
  • Jesse Dodge
  • Alyssa Mensch
  • Hal DauméIII
  • Alexander C. BergEmail author
  • Yejin Choi
  • Tamara L. Berg


What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated textual descriptions (captions) to both sound like a person wrote them, and also remain true to the image content. To do this we develop data-driven approaches for image description generation, using retrieval-based techniques to gather either: (a) whole captions associated with a visually similar image, or (b) relevant bits of text (phrases) from a large collection of image + description pairs. In the case of (b), we develop optimization algorithms to merge the retrieved phrases into valid natural language sentences. The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts.


Retrieval Image description Data driven Big data Natural language processing 



Support of the 2011 JHU-CLSP Summer Workshop Program. Tamara L. Berg and Kota Yamaguchi were supported in part by NSF CAREER IIS-1054133; Hal Daumé III and Amit Goyal were partially supported by NSF Award IIS-1139909.


  1. Aker, A., & Gaizauskas, R. (2010). Generating image descriptions using dependency relational patterns. In ACL.Google Scholar
  2. Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., & Jordan, M. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.zbMATHGoogle Scholar
  3. Berg, T., Berg, A., Edwards, J., & Forsyth, D. (2004) Who’s in the picture?. In NIPS.Google Scholar
  4. Berg, T., Berg, A., Edwards, J., Maire, M., White, R., Learned-Miller, E., Teh, Y., & Forsyth, D. (2004). Names and faces. In CVPR.Google Scholar
  5. Berg, T.L., Berg, A.C., & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. In ECCV.Google Scholar
  6. Brants, T., & Franz., A. (2006). Web 1t 5-gram version 1. In LDC.Google Scholar
  7. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In WWW.Google Scholar
  8. Chum, O., Philbin, J., & Zisserman, A. (2008). Near duplicate image detection: min-hash and tf-idf weighting. In BMVC.Google Scholar
  9. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.Google Scholar
  10. Deng, J., Berg, A.C., & Fei-Fei, L. (2011). Hierarchical semantic indexing for large scale image retrieval. In CVPR.Google Scholar
  11. Deng, J., Berg, A.C., Li, K., & Fei-Fei, L. (2010). What does classifying more than 10,000 image categories tell us?. In ECCV.Google Scholar
  12. Deng, J., Krause, J., Berg, A.C., & Fei-Fei, L. (2012). Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR.Google Scholar
  13. Deng, J., Satheesh, S., Berg, A.C., & Fei-Fei, L. (2011). Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS.Google Scholar
  14. Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation. In ECCV.Google Scholar
  15. Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D.A. (2009). Describing objects by their attributes. In CVPR.Google Scholar
  16. Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D.A. (2010). Every picture tells a story: generating sentences for images. In ECCV.Google Scholar
  17. Felzenszwalb, P.F., Girshick, R.B., McAllester, D. (2011). Discriminatively trained deformable part models, release 4.
  18. Feng, Y., & Lapata, M. (2010). How many words is a picture worth? automatic caption generation for news images. In ACL.Google Scholar
  19. Ferrari, V., & Zisserman, A. (2007). Learning visual attributes. In NIPS.Google Scholar
  20. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV.Google Scholar
  21. Hays, J., & Efros, A.A. (2008). im2gps: estimating geographic information from a single image. In CVPR.Google Scholar
  22. Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.MathSciNetzbMATHGoogle Scholar
  23. Hoiem, D., Efros, A.A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.Google Scholar
  24. Jing, Y., & Baluja, S. (2008). Pagerank for product image search. In WWW.Google Scholar
  25. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2891–2903.CrossRefGoogle Scholar
  26. Kumar, N., Berg, A.C., Belhumeur, P.N., & Nayar, S.K. (2009). Attribute and simile classifiers for face verification. In ICCV.Google Scholar
  27. Kuznetsova, P., Ordonez, V., Berg, A., Berg, T.L., & Choi, Y. (2012). Collective generation of natural image descriptions. In ACL.Google Scholar
  28. Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., & Choi, Y. (2013). Generalizing image captions for image-text parallel corpus. In ACL.Google Scholar
  29. Lampert, C., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR.Google Scholar
  30. Leung, T.K., & Malik, J., (1999). Recognizing surfaces using three-dimensional textons. In ICCV.Google Scholar
  31. Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., & Choi, Y. (2011). Composing simple image descriptions using web-scale n-grams. In CoNLL.Google Scholar
  32. Li, W., Xu, W., Wu, M., Yuan, C., & Lu, Q. (2006). Extractive summarization using inter- and intra- event relevance. In International Conference on Computational Linguistics.Google Scholar
  33. Li, Li-Jia., Su, Hao., Xing, E.P., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS.Google Scholar
  34. Lin, C.Y. (2004). Rouge: A package for automatic evaluation of summaries. In ACL.Google Scholar
  35. Lowe, D. G. (2004). Distinctive image features from scale invariant keypoints. International Journal of Computer Vision, 60, 91–110.CrossRefGoogle Scholar
  36. Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. In ACL.Google Scholar
  37. Mihalcea, R. (2005). Language independent extractive summarization. In AAAI.Google Scholar
  38. Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Sratos, K., Han, X., Mensch, A., Berg, A., Berg, T.L., & Daumé, III, H. (2012). Midge: Generating image descriptions from computer vision detections. In EACL.Google Scholar
  39. Nenkova, A., Vanderwende, L., & McKeown, K. (2006). A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In SIGIR.Google Scholar
  40. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42, 145–175.CrossRefzbMATHGoogle Scholar
  41. Ordonez, V., Deng, J., Choi, Y., Berg, A.C., & Berg, T.L. (2013). From large scale image categorization to entry-level categories. In ICCV.Google Scholar
  42. Ordonez, V., Kulkarni, G., & Berg, T.L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.Google Scholar
  43. Papineni, K., Roukos, S., Ward, T., & Zhu, W. jing. (2002). Bleu: A method for automatic evaluation of machine translation. In ACL.Google Scholar
  44. Petrov, S., Barrett, L., Thibaux, R., & Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In COLING/ACL.Google Scholar
  45. Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In HLT-NAACL.Google Scholar
  46. Radev, D.R., & Allison, T. (2004). Mead—A platform for multidocument multilingual text summarization. In LREC.Google Scholar
  47. Rashtchian, C., Young, P., Hodosh, M., & Hockenmaier, J. (2010). Collecting image annotations using amazon’s mechanical turk. In NAACL Workshop Creating Speech and Language Data With Amazon’s Mechanical Turk.Google Scholar
  48. Roelleke, T., & Wang, J. (2008). Tf-idf uncovered: a study of theories and probabilities. In SIGIR.Google Scholar
  49. Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In ICCV.Google Scholar
  50. Stratos, K., Sood, A., Mensch, A., Han, X., Mitchell, M., Yamaguchi, K., Dodge, J., Goyal, A., Daumé, III, H., Berg, A., & Berg, T.L. (2012). Understanding and predicting importance in images. In CVPR.Google Scholar
  51. Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV.Google Scholar
  52. Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.CrossRefGoogle Scholar
  53. Wong, K.F., Wu, M., & Li, W. (2008). Extractive summarization using supervised and semi-supervised learning. In COLING.Google Scholar
  54. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In CVPR.Google Scholar
  55. Yang, Y., Teo, C.L., Daumé, III, H., & Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In EMNLP.Google Scholar
  56. Yao, B., Yang, X., Lin, L., Lee, M. W., & Zhu, S. C. (2010). I2t: Image parsing to text description. Proceedings of the IEEE.Google Scholar
  57. Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event description. Transactions of the Association for Computational Linguistics, 2, 67–78.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Vicente Ordonez
    • 1
  • Xufeng Han
    • 1
  • Polina Kuznetsova
    • 2
  • Girish Kulkarni
    • 2
  • Margaret Mitchell
    • 3
  • Kota Yamaguchi
    • 4
  • Karl Stratos
    • 5
  • Amit Goyal
    • 6
  • Jesse Dodge
    • 7
  • Alyssa Mensch
    • 8
  • Hal DauméIII
    • 9
  • Alexander C. Berg
    • 1
    Email author
  • Yejin Choi
    • 10
  • Tamara L. Berg
    • 1
  1. 1.University of North CarolinaChapel HillUSA
  2. 2.Stony Brook UniversityStony BrookUSA
  3. 3.Microsoft ResearchRedmondUSA
  4. 4.Tohoku UniversitySendaiJapan
  5. 5.Columbia UniversityNew YorkUSA
  6. 6.Yahoo! LabsSunnyvaleUSA
  7. 7.Carnegie Mellon UniversityPittsburghUSA
  8. 8.University of PennsylvaniaPhiladelphiaUSA
  9. 9.University of MarylandCollege ParkUSA
  10. 10.University of WashingtonSeattleUSA

Personalised recommendations