Large Scale Retrieval and Generation of Image Descriptions


What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated textual descriptions (captions) to both sound like a person wrote them, and also remain true to the image content. To do this we develop data-driven approaches for image description generation, using retrieval-based techniques to gather either: (a) whole captions associated with a visually similar image, or (b) relevant bits of text (phrases) from a large collection of image + description pairs. In the case of (b), we develop optimization algorithms to merge the retrieved phrases into valid natural language sentences. The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.

  2. 2.

    The coefficient \(\alpha \) can be tuned via grid search, and scores are normalized \(\in [0, 1]\).

  3. 3.

    An interesting but non-trivial extension to this generation technique is allowing re-ordering or omission of phrases (Kuznetsova et al. 2012).


  1. Aker, A., & Gaizauskas, R. (2010). Generating image descriptions using dependency relational patterns. In ACL.

  2. Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., & Jordan, M. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.

    MATH  Google Scholar 

  3. Berg, T., Berg, A., Edwards, J., & Forsyth, D. (2004) Who’s in the picture?. In NIPS.

  4. Berg, T., Berg, A., Edwards, J., Maire, M., White, R., Learned-Miller, E., Teh, Y., & Forsyth, D. (2004). Names and faces. In CVPR.

  5. Berg, T.L., Berg, A.C., & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. In ECCV.

  6. Brants, T., & Franz., A. (2006). Web 1t 5-gram version 1. In LDC.

  7. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In WWW.

  8. Chum, O., Philbin, J., & Zisserman, A. (2008). Near duplicate image detection: min-hash and tf-idf weighting. In BMVC.

  9. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

  10. Deng, J., Berg, A.C., & Fei-Fei, L. (2011). Hierarchical semantic indexing for large scale image retrieval. In CVPR.

  11. Deng, J., Berg, A.C., Li, K., & Fei-Fei, L. (2010). What does classifying more than 10,000 image categories tell us?. In ECCV.

  12. Deng, J., Krause, J., Berg, A.C., & Fei-Fei, L. (2012). Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR.

  13. Deng, J., Satheesh, S., Berg, A.C., & Fei-Fei, L. (2011). Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS.

  14. Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation. In ECCV.

  15. Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D.A. (2009). Describing objects by their attributes. In CVPR.

  16. Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D.A. (2010). Every picture tells a story: generating sentences for images. In ECCV.

  17. Felzenszwalb, P.F., Girshick, R.B., McAllester, D. (2011). Discriminatively trained deformable part models, release 4.

  18. Feng, Y., & Lapata, M. (2010). How many words is a picture worth? automatic caption generation for news images. In ACL.

  19. Ferrari, V., & Zisserman, A. (2007). Learning visual attributes. In NIPS.

  20. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV.

  21. Hays, J., & Efros, A.A. (2008). im2gps: estimating geographic information from a single image. In CVPR.

  22. Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.

    MathSciNet  MATH  Google Scholar 

  23. Hoiem, D., Efros, A.A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.

  24. Jing, Y., & Baluja, S. (2008). Pagerank for product image search. In WWW.

  25. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2891–2903.

    Article  Google Scholar 

  26. Kumar, N., Berg, A.C., Belhumeur, P.N., & Nayar, S.K. (2009). Attribute and simile classifiers for face verification. In ICCV.

  27. Kuznetsova, P., Ordonez, V., Berg, A., Berg, T.L., & Choi, Y. (2012). Collective generation of natural image descriptions. In ACL.

  28. Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., & Choi, Y. (2013). Generalizing image captions for image-text parallel corpus. In ACL.

  29. Lampert, C., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR.

  30. Leung, T.K., & Malik, J., (1999). Recognizing surfaces using three-dimensional textons. In ICCV.

  31. Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., & Choi, Y. (2011). Composing simple image descriptions using web-scale n-grams. In CoNLL.

  32. Li, W., Xu, W., Wu, M., Yuan, C., & Lu, Q. (2006). Extractive summarization using inter- and intra- event relevance. In International Conference on Computational Linguistics.

  33. Li, Li-Jia., Su, Hao., Xing, E.P., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS.

  34. Lin, C.Y. (2004). Rouge: A package for automatic evaluation of summaries. In ACL.

  35. Lowe, D. G. (2004). Distinctive image features from scale invariant keypoints. International Journal of Computer Vision, 60, 91–110.

    Article  Google Scholar 

  36. Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. In ACL.

  37. Mihalcea, R. (2005). Language independent extractive summarization. In AAAI.

  38. Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Sratos, K., Han, X., Mensch, A., Berg, A., Berg, T.L., & Daumé, III, H. (2012). Midge: Generating image descriptions from computer vision detections. In EACL.

  39. Nenkova, A., Vanderwende, L., & McKeown, K. (2006). A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In SIGIR.

  40. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42, 145–175.

    Article  MATH  Google Scholar 

  41. Ordonez, V., Deng, J., Choi, Y., Berg, A.C., & Berg, T.L. (2013). From large scale image categorization to entry-level categories. In ICCV.

  42. Ordonez, V., Kulkarni, G., & Berg, T.L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.

  43. Papineni, K., Roukos, S., Ward, T., & Zhu, W. jing. (2002). Bleu: A method for automatic evaluation of machine translation. In ACL.

  44. Petrov, S., Barrett, L., Thibaux, R., & Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In COLING/ACL.

  45. Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In HLT-NAACL.

  46. Radev, D.R., & Allison, T. (2004). Mead—A platform for multidocument multilingual text summarization. In LREC.

  47. Rashtchian, C., Young, P., Hodosh, M., & Hockenmaier, J. (2010). Collecting image annotations using amazon’s mechanical turk. In NAACL Workshop Creating Speech and Language Data With Amazon’s Mechanical Turk.

  48. Roelleke, T., & Wang, J. (2008). Tf-idf uncovered: a study of theories and probabilities. In SIGIR.

  49. Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In ICCV.

  50. Stratos, K., Sood, A., Mensch, A., Han, X., Mitchell, M., Yamaguchi, K., Dodge, J., Goyal, A., Daumé, III, H., Berg, A., & Berg, T.L. (2012). Understanding and predicting importance in images. In CVPR.

  51. Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV.

  52. Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.

    Article  Google Scholar 

  53. Wong, K.F., Wu, M., & Li, W. (2008). Extractive summarization using supervised and semi-supervised learning. In COLING.

  54. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In CVPR.

  55. Yang, Y., Teo, C.L., Daumé, III, H., & Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In EMNLP.

  56. Yao, B., Yang, X., Lin, L., Lee, M. W., & Zhu, S. C. (2010). I2t: Image parsing to text description. Proceedings of the IEEE.

  57. Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event description. Transactions of the Association for Computational Linguistics, 2, 67–78.

Download references


Support of the 2011 JHU-CLSP Summer Workshop Program. Tamara L. Berg and Kota Yamaguchi were supported in part by NSF CAREER IIS-1054133; Hal Daumé III and Amit Goyal were partially supported by NSF Award IIS-1139909.

Author information



Corresponding author

Correspondence to Alexander C. Berg.

Additional information

Communicated by Antonio Torralba and Alexei Efros.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ordonez, V., Han, X., Kuznetsova, P. et al. Large Scale Retrieval and Generation of Image Descriptions. Int J Comput Vis 119, 46–59 (2016).

Download citation


  • Retrieval
  • Image description
  • Data driven
  • Big data
  • Natural language processing