Large Scale Retrieval and Generation of Image Descriptions

Ordonez, Vicente; Han, Xufeng; Kuznetsova, Polina; Kulkarni, Girish; Mitchell, Margaret; Yamaguchi, Kota; Stratos, Karl; Goyal, Amit; Dodge, Jesse; Mensch, Alyssa; Daumé, Hal; Berg, Alexander C.; Choi, Yejin; Berg, Tamara L.

doi:10.1007/s11263-015-0840-y

Large Scale Retrieval and Generation of Image Descriptions

Published: 08 July 2015

Volume 119, pages 46–59, (2016)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Vicente Ordonez¹,
Xufeng Han¹,
Polina Kuznetsova²,
Girish Kulkarni²,
Margaret Mitchell³,
Kota Yamaguchi⁴,
Karl Stratos⁵,
Amit Goyal⁶,
Jesse Dodge⁷,
Alyssa Mensch⁸,
Hal Daumé III⁹,
Alexander C. Berg¹,
Yejin Choi¹⁰ &
…
Tamara L. Berg¹

1867 Accesses
42 Citations
Explore all metrics

Abstract

What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated textual descriptions (captions) to both sound like a person wrote them, and also remain true to the image content. To do this we develop data-driven approaches for image description generation, using retrieval-based techniques to gather either: (a) whole captions associated with a visually similar image, or (b) relevant bits of text (phrases) from a large collection of image + description pairs. In the case of (b), we develop optimization algorithms to merge the retrieved phrases into valid natural language sentences. The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Picture it in your mind: generating high level visual representations from textual descriptions

Article 14 October 2017

Fabio Carrara, Andrea Esuli, … Alejandro Moreo Fernández

DeepDiary: Automatically Captioning Lifelogging Image Streams

Searching Things in Large Sets of Images

Notes

http://www.imageclef.org/2011.
The coefficient \(\alpha \) can be tuned via grid search, and scores are normalized \(\in [0, 1]\).
An interesting but non-trivial extension to this generation technique is allowing re-ordering or omission of phrases (Kuznetsova et al. 2012).

References

Aker, A., & Gaizauskas, R. (2010). Generating image descriptions using dependency relational patterns. In ACL.
Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., & Jordan, M. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.
MATH Google Scholar
Berg, T., Berg, A., Edwards, J., & Forsyth, D. (2004) Who’s in the picture?. In NIPS.
Berg, T., Berg, A., Edwards, J., Maire, M., White, R., Learned-Miller, E., Teh, Y., & Forsyth, D. (2004). Names and faces. In CVPR.
Berg, T.L., Berg, A.C., & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. In ECCV.
Brants, T., & Franz., A. (2006). Web 1t 5-gram version 1. In LDC.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In WWW.
Chum, O., Philbin, J., & Zisserman, A. (2008). Near duplicate image detection: min-hash and tf-idf weighting. In BMVC.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Deng, J., Berg, A.C., & Fei-Fei, L. (2011). Hierarchical semantic indexing for large scale image retrieval. In CVPR.
Deng, J., Berg, A.C., Li, K., & Fei-Fei, L. (2010). What does classifying more than 10,000 image categories tell us?. In ECCV.
Deng, J., Krause, J., Berg, A.C., & Fei-Fei, L. (2012). Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR.
Deng, J., Satheesh, S., Berg, A.C., & Fei-Fei, L. (2011). Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS.
Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation. In ECCV.
Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D.A. (2009). Describing objects by their attributes. In CVPR.
Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D.A. (2010). Every picture tells a story: generating sentences for images. In ECCV.
Felzenszwalb, P.F., Girshick, R.B., McAllester, D. (2011). Discriminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/~pff/latent-release4/
Feng, Y., & Lapata, M. (2010). How many words is a picture worth? automatic caption generation for news images. In ACL.
Ferrari, V., & Zisserman, A. (2007). Learning visual attributes. In NIPS.
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV.
Hays, J., & Efros, A.A. (2008). im2gps: estimating geographic information from a single image. In CVPR.
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.
MathSciNet MATH Google Scholar
Hoiem, D., Efros, A.A., & Hebert, M. (2005). Geometric context from a single image. In ICCV.
Jing, Y., & Baluja, S. (2008). Pagerank for product image search. In WWW.
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2891–2903.
Article Google Scholar
Kumar, N., Berg, A.C., Belhumeur, P.N., & Nayar, S.K. (2009). Attribute and simile classifiers for face verification. In ICCV.
Kuznetsova, P., Ordonez, V., Berg, A., Berg, T.L., & Choi, Y. (2012). Collective generation of natural image descriptions. In ACL.
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., & Choi, Y. (2013). Generalizing image captions for image-text parallel corpus. In ACL.
Lampert, C., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR.
Leung, T.K., & Malik, J., (1999). Recognizing surfaces using three-dimensional textons. In ICCV.
Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., & Choi, Y. (2011). Composing simple image descriptions using web-scale n-grams. In CoNLL.
Li, W., Xu, W., Wu, M., Yuan, C., & Lu, Q. (2006). Extractive summarization using inter- and intra- event relevance. In International Conference on Computational Linguistics.
Li, Li-Jia., Su, Hao., Xing, E.P., & Fei-Fei, L. (2010). Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS.
Lin, C.Y. (2004). Rouge: A package for automatic evaluation of summaries. In ACL.
Lowe, D. G. (2004). Distinctive image features from scale invariant keypoints. International Journal of Computer Vision, 60, 91–110.
Article Google Scholar
Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. In ACL.
Mihalcea, R. (2005). Language independent extractive summarization. In AAAI.
Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Sratos, K., Han, X., Mensch, A., Berg, A., Berg, T.L., & Daumé, III, H. (2012). Midge: Generating image descriptions from computer vision detections. In EACL.
Nenkova, A., Vanderwende, L., & McKeown, K. (2006). A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In SIGIR.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42, 145–175.
Article MATH Google Scholar
Ordonez, V., Deng, J., Choi, Y., Berg, A.C., & Berg, T.L. (2013). From large scale image categorization to entry-level categories. In ICCV.
Ordonez, V., Kulkarni, G., & Berg, T.L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. jing. (2002). Bleu: A method for automatic evaluation of machine translation. In ACL.
Petrov, S., Barrett, L., Thibaux, R., & Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In COLING/ACL.
Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In HLT-NAACL.
Radev, D.R., & Allison, T. (2004). Mead—A platform for multidocument multilingual text summarization. In LREC.
Rashtchian, C., Young, P., Hodosh, M., & Hockenmaier, J. (2010). Collecting image annotations using amazon’s mechanical turk. In NAACL Workshop Creating Speech and Language Data With Amazon’s Mechanical Turk.
Roelleke, T., & Wang, J. (2008). Tf-idf uncovered: a study of theories and probabilities. In SIGIR.
Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In ICCV.
Stratos, K., Sood, A., Mensch, A., Han, X., Mitchell, M., Yamaguchi, K., Dodge, J., Goyal, A., Daumé, III, H., Berg, A., & Berg, T.L. (2012). Understanding and predicting importance in images. In CVPR.
Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV.
Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.
Article Google Scholar
Wong, K.F., Wu, M., & Li, W. (2008). Extractive summarization using supervised and semi-supervised learning. In COLING.
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In CVPR.
Yang, Y., Teo, C.L., Daumé, III, H., & Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In EMNLP.
Yao, B., Yang, X., Lin, L., Lee, M. W., & Zhu, S. C. (2010). I2t: Image parsing to text description. Proceedings of the IEEE.
Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event description. Transactions of the Association for Computational Linguistics, 2, 67–78.

Download references

Acknowledgments

Support of the 2011 JHU-CLSP Summer Workshop Program. Tamara L. Berg and Kota Yamaguchi were supported in part by NSF CAREER IIS-1054133; Hal Daumé III and Amit Goyal were partially supported by NSF Award IIS-1139909.

Author information

Authors and Affiliations

University of North Carolina, Chapel Hill, NC, USA
Vicente Ordonez, Xufeng Han, Alexander C. Berg & Tamara L. Berg
Stony Brook University, Stony Brook, NY, USA
Polina Kuznetsova & Girish Kulkarni
Microsoft Research, Redmond, WA, USA
Margaret Mitchell
Tohoku University, Sendai, Japan
Kota Yamaguchi
Columbia University, New York, NY, USA
Karl Stratos
Yahoo! Labs, Sunnyvale, CA, USA
Amit Goyal
Carnegie Mellon University, Pittsburgh, PA, USA
Jesse Dodge
University of Pennsylvania, Philadelphia, PA, USA
Alyssa Mensch
University of Maryland, College Park, MD, USA
Hal Daumé III
University of Washington, Seattle, WA, USA
Yejin Choi

Authors

Vicente Ordonez
View author publications
You can also search for this author in PubMed Google Scholar
Xufeng Han
View author publications
You can also search for this author in PubMed Google Scholar
Polina Kuznetsova
View author publications
You can also search for this author in PubMed Google Scholar
Girish Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar
Margaret Mitchell
View author publications
You can also search for this author in PubMed Google Scholar
Kota Yamaguchi
View author publications
You can also search for this author in PubMed Google Scholar
Karl Stratos
View author publications
You can also search for this author in PubMed Google Scholar
Amit Goyal
View author publications
You can also search for this author in PubMed Google Scholar
Jesse Dodge
View author publications
You can also search for this author in PubMed Google Scholar
Alyssa Mensch
View author publications
You can also search for this author in PubMed Google Scholar
Hal Daumé III
View author publications
You can also search for this author in PubMed Google Scholar
Alexander C. Berg
View author publications
You can also search for this author in PubMed Google Scholar
Yejin Choi
View author publications
You can also search for this author in PubMed Google Scholar
Tamara L. Berg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander C. Berg.

Additional information

Communicated by Antonio Torralba and Alexei Efros.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ordonez, V., Han, X., Kuznetsova, P. et al. Large Scale Retrieval and Generation of Image Descriptions. Int J Comput Vis 119, 46–59 (2016). https://doi.org/10.1007/s11263-015-0840-y

Download citation

Received: 03 June 2013
Accepted: 16 June 2015
Published: 08 July 2015
Issue Date: August 2016
DOI: https://doi.org/10.1007/s11263-015-0840-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Large Scale Retrieval and Generation of Image Descriptions

Abstract

Access this article

Similar content being viewed by others

Picture it in your mind: generating high level visual representations from textual descriptions

DeepDiary: Automatically Captioning Lifelogging Image Streams

Searching Things in Large Sets of Images

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Large Scale Retrieval and Generation of Image Descriptions

Abstract

Access this article

Similar content being viewed by others

Picture it in your mind: generating high level visual representations from textual descriptions

DeepDiary: Automatically Captioning Lifelogging Image Streams

Searching Things in Large Sets of Images

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation