Skip to main content

Captioning the Images: A Deep Analysis

  • Conference paper
  • First Online:
Book cover Computing, Communication and Signal Processing

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 810))

Abstract

Image captioning is one of the fundamental tasks in machine learning since the ability to generate text captions of an image can have a great impact by assisting us in day-to-day life. However, it is not just an object classification or recognition task, because the model must know the dependencies among the recognized objects and their attributes and encode that knowledge correctly in the caption using a natural language like English. Recently, the internet is overwhelmed with the huge amount of textual and visual data consisting of billions of unstructured images and videos. Meaningful captions will serve as useful keys for retrieval, creative searching, and powerful browsing of these images. In this paper, we present the goal of analysis and classification of the recent state-of-the-art in image captioning and discuss significant differences among them. We provide a comparative review of existing models, techniques with their advantages and disadvantages. Future directions in the field of automatic image caption generation are also explored.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Fischler, M.A., Elschlager, R.A.: The representation and matching of pictorial structures. IEEE Trans. Comput. 22(1) (1973)

    Article  Google Scholar 

  2. Barnard, K., Forsyth, D.: Learning the semantics of words and pictures. In: Proceedings of International Conference on Computer Vision, pp. II:408–415 (2001)

    Google Scholar 

  3. Barnard, K., Duyguly, P., Forsyth, D.: Clustering art. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition, June 2001

    Google Scholar 

  4. Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., Jordan, M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)

    MATH  Google Scholar 

  5. Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object Recognition as machine Translation: Learning a lexicon for a fixed image vocabulary. In: Proceedings of European Conference Computer Vision (2002)

    Google Scholar 

  6. Hofmann, T.: Learning and representing topic: a hierarchical mixture model for word occurrence in document databases. In: Proceedings of Workshop on Learning from Text and the Web, CMU (1998)

    Google Scholar 

  7. Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: European Conference on Computer Vision (2010)

    Google Scholar 

  8. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A., Berg, T.: Baby talk: understanding and generating simple image descriptions. In: Computer Vision and Pattern Recognition (CVPR), vol. 2011, pp. 1601–1608. IEEE (2011)

    Google Scholar 

  9. Yang, Y., Teo, C.L., Daume, III.H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Conference on Empirical Methods in Natural Language Processing (2011)

    Google Scholar 

  10. Yatskar, M., Galley, M., Vanderwende, L., Zettlemoyer, L.: See no Evil, say no Evil: description generation from densely labeled images. In: Joint Conference on Lexical and Computation Semantics (2014)

    Google Scholar 

  11. Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Doll ́ar, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  12. Elliott, D., Keller, F.: Image description using visual dependency representations. In Conference on Empirical Methods in Natural Language Processing (2013)

    Google Scholar 

  13. Elliott, D., de Vries, A.P.: Describing images using inferred visual dependency representations. In: Annual Meeting of the Association for Computational Linguistics (2015)

    Google Scholar 

  14. Ortiz, L.M.G., Wolff, C., Lapata, M.: Learning to interpret and describe abstract scenes. In: Conference of the North American Chapter of the Association of Computational Linguistics (2015)

    Google Scholar 

  15. Lin, D., Fidler, S., Kong, C., Urtasun, R.: Generating multi-sentence natural language descriptions of indoor scenes. In: British Machine Vision Conference (2015)

    Google Scholar 

  16. Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228. Association for Computational Linguistics (2011)

    Google Scholar 

  17. Kuznetsova, P., Ordonezz, V., Berg, T.L., Choi, Y.: TREETALK: composition and compression of trees for image descriptions. In: Conference on Empirical Methods in Natural Language Processing (2014)

    Article  Google Scholar 

  18. Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A. C., Yamaguchi, K., Berg, T. L., Stratos, K., Daume, III.H.: Midge: generating image descriptions from computer vision detections. In: Conference of the European Chapter of the Association for Computational Linguistics (2012)

    Google Scholar 

  19. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR (2003)

    Google Scholar 

  20. Berg, T.L., Berg, A.C., Edwards, J., Forsyth, D.A.: Who’s in the Picture? In: Proceedings Neural Information Processing Systems Conference (2004)

    Google Scholar 

  21. Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Learned-Miller, E., Teh, Y.-W., Forsyth, D.A.: Names and faces in the news. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition (2004)

    Google Scholar 

  22. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition (2005)

    Google Scholar 

  23. Berg, T.L., Forsyth, D.A.: Animals on the web. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition (2006)

    Google Scholar 

  24. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)

    Google Scholar 

  25. Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: CVPR (2009)

    Google Scholar 

  26. Kumar, N., Belhumeur, P., Nayar, S.K.: FaceTracer: a search engine for large collections of images with faces. In: ECCV (2008)

    Google Scholar 

  27. Kumar, N., Berg, A.C., Belhumeur, P., Nayar, S.K.: Attribute and simile classifiers for face verification. In: ICCV (2009)

    Google Scholar 

  28. Jie, L., Caputo, B., Ferrari, V.: Who’s doing what: joint modeling of names and verbs for simultaneous face and pose annotation. In: NIPS, editor, Advances in Neural Information Processing Systems, NIPS (2009)

    Google Scholar 

  29. Li, L.-J., Fei-Fei, L.: OPTIMOL: automatic online picture collection via incremental model learning. Int. J. Comput. Vis. 88, 147–168 (2009)

    Article  MathSciNet  Google Scholar 

  30. Schroff, F., Criminisi, A., Zisserman, A.: Harvesting image databases from the web. In: Proceedings of 11th IEEE International Conference Computer Vision (2007)

    Google Scholar 

  31. Felzenszwalb, P.F., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)

    Google Scholar 

  32. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)

    Article  Google Scholar 

  33. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems (2011)

    Google Scholar 

  34. Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Annual Meeting of the Association for Computational Linguistics (2012)

    Google Scholar 

  35. Mason, R., Charniak, E.: Nonparametric method for data-driven image captioning. In: Annual Meeting of the Association for Computational Linguistics (2014)

    Google Scholar 

  36. Patterson, G., Xu, C., Su, H., Hays, J.: The SUN attribute database: beyond categories for deeper scene understanding. Int. J. Comput. Vis. 108 (1–2), 59–81 (2014)

    Article  Google Scholar 

  37. Yagcioglu, S., Erdem, E., Erdem, A., Cakici, R.: A distributed representation based query expansion approach for image captioning. In: Annual Meeting of the Association for Computational Linguistics (2015)

    Google Scholar 

  38. Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., Mitchell, M.: Language models for image captioning: the quirks and what works. In: Annual Meeting of the Association for Computational Linguistics (2015)

    Google Scholar 

  39. Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. PAMI 30, 1371–1384 (2008)

    Article  Google Scholar 

  40. Hodosh, M., Hockenmaier, J.: Sentence-based image description with scalable, explicit models. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (2013)

    Google Scholar 

  41. Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)

    Google Scholar 

  42. Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems (2014)

    Google Scholar 

  43. Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: International Conference on Computer Vision (2015)

    Google Scholar 

  44. Pinheiro, P., Lebret, R., Collobert, R.: Simple image description generator via a linear phrase-based model. In: International Conference on Learning Representations Workshop (2015)

    Google Scholar 

  45. Ushiku, Y., Yamaguchi, M., Mukuta, Y., Harada, T.: Common subspace for model and similarity: phrase learning for caption generation from images. In: International Conference on Computer Vision (2015)

    Google Scholar 

  46. Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. In: Advances in Neural Information Processing Systems Deep Learning Workshop (2015)

    Google Scholar 

  47. Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  48. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  49. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (2015)

    Google Scholar 

  50. Lebret, R., Pinheiro, P.O., Collobert, R.: Phrase-based image captioning. In: International Conference on Machine Learning (2015)

    Google Scholar 

  51. Gupta, A., Kembhavi, A., Davis, L.: Observing human-object interactions: using spatial and functional compatibility for recognition. In: PAMI (2009)

    Google Scholar 

  52. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)

    Article  Google Scholar 

  53. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (2006)

    Google Scholar 

  54. Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2012)

    Article  Google Scholar 

  55. Verma, Y., Jawahar, C.V.: Im2Text and Text2Im: associating images and texts for cross-modal retrieval. In: British Machine Vision Conference (2014)

    Google Scholar 

  56. Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., Plank, B.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)

    Article  Google Scholar 

  57. Lenat, D.B.: Cyc: a large-scale investment in knowledge infrastructure. Commun. ACM 38(11), 33–38 (1995P

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaitrali P. Chaudhari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chaudhari, C.P., Devane, S. (2019). Captioning the Images: A Deep Analysis. In: Iyer, B., Nalbalwar, S., Pathak, N. (eds) Computing, Communication and Signal Processing . Advances in Intelligent Systems and Computing, vol 810. Springer, Singapore. https://doi.org/10.1007/978-981-13-1513-8_100

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-1513-8_100

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-1512-1

  • Online ISBN: 978-981-13-1513-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics