Captioning the Images: A Deep Analysis

Chaudhari, Chaitrali P.; Devane, Satish

doi:10.1007/978-981-13-1513-8_100

Chaitrali P. Chaudhari¹⁷ &
Satish Devane¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 810))

1600 Accesses
1 Citations

Abstract

Image captioning is one of the fundamental tasks in machine learning since the ability to generate text captions of an image can have a great impact by assisting us in day-to-day life. However, it is not just an object classification or recognition task, because the model must know the dependencies among the recognized objects and their attributes and encode that knowledge correctly in the caption using a natural language like English. Recently, the internet is overwhelmed with the huge amount of textual and visual data consisting of billions of unstructured images and videos. Meaningful captions will serve as useful keys for retrieval, creative searching, and powerful browsing of these images. In this paper, we present the goal of analysis and classification of the recent state-of-the-art in image captioning and discuss significant differences among them. We provide a comparative review of existing models, techniques with their advantages and disadvantages. Future directions in the field of automatic image caption generation are also explored.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Fischler, M.A., Elschlager, R.A.: The representation and matching of pictorial structures. IEEE Trans. Comput. 22(1) (1973)
Article Google Scholar
Barnard, K., Forsyth, D.: Learning the semantics of words and pictures. In: Proceedings of International Conference on Computer Vision, pp. II:408–415 (2001)
Google Scholar
Barnard, K., Duyguly, P., Forsyth, D.: Clustering art. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition, June 2001
Google Scholar
Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., Jordan, M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)
MATH Google Scholar
Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object Recognition as machine Translation: Learning a lexicon for a fixed image vocabulary. In: Proceedings of European Conference Computer Vision (2002)
Google Scholar
Hofmann, T.: Learning and representing topic: a hierarchical mixture model for word occurrence in document databases. In: Proceedings of Workshop on Learning from Text and the Web, CMU (1998)
Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: European Conference on Computer Vision (2010)
Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A., Berg, T.: Baby talk: understanding and generating simple image descriptions. In: Computer Vision and Pattern Recognition (CVPR), vol. 2011, pp. 1601–1608. IEEE (2011)
Google Scholar
Yang, Y., Teo, C.L., Daume, III.H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Conference on Empirical Methods in Natural Language Processing (2011)
Google Scholar
Yatskar, M., Galley, M., Vanderwende, L., Zettlemoyer, L.: See no Evil, say no Evil: description generation from densely labeled images. In: Joint Conference on Lexical and Computation Semantics (2014)
Google Scholar
Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Doll ́ar, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Elliott, D., Keller, F.: Image description using visual dependency representations. In Conference on Empirical Methods in Natural Language Processing (2013)
Google Scholar
Elliott, D., de Vries, A.P.: Describing images using inferred visual dependency representations. In: Annual Meeting of the Association for Computational Linguistics (2015)
Google Scholar
Ortiz, L.M.G., Wolff, C., Lapata, M.: Learning to interpret and describe abstract scenes. In: Conference of the North American Chapter of the Association of Computational Linguistics (2015)
Google Scholar
Lin, D., Fidler, S., Kong, C., Urtasun, R.: Generating multi-sentence natural language descriptions of indoor scenes. In: British Machine Vision Conference (2015)
Google Scholar
Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228. Association for Computational Linguistics (2011)
Google Scholar
Kuznetsova, P., Ordonezz, V., Berg, T.L., Choi, Y.: TREETALK: composition and compression of trees for image descriptions. In: Conference on Empirical Methods in Natural Language Processing (2014)
Article Google Scholar
Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A. C., Yamaguchi, K., Berg, T. L., Stratos, K., Daume, III.H.: Midge: generating image descriptions from computer vision detections. In: Conference of the European Chapter of the Association for Computational Linguistics (2012)
Google Scholar
Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR (2003)
Google Scholar
Berg, T.L., Berg, A.C., Edwards, J., Forsyth, D.A.: Who’s in the Picture? In: Proceedings Neural Information Processing Systems Conference (2004)
Google Scholar
Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Learned-Miller, E., Teh, Y.-W., Forsyth, D.A.: Names and faces in the news. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition (2004)
Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition (2005)
Google Scholar
Berg, T.L., Forsyth, D.A.: Animals on the web. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition (2006)
Google Scholar
Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Google Scholar
Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: CVPR (2009)
Google Scholar
Kumar, N., Belhumeur, P., Nayar, S.K.: FaceTracer: a search engine for large collections of images with faces. In: ECCV (2008)
Google Scholar
Kumar, N., Berg, A.C., Belhumeur, P., Nayar, S.K.: Attribute and simile classifiers for face verification. In: ICCV (2009)
Google Scholar
Jie, L., Caputo, B., Ferrari, V.: Who’s doing what: joint modeling of names and verbs for simultaneous face and pose annotation. In: NIPS, editor, Advances in Neural Information Processing Systems, NIPS (2009)
Google Scholar
Li, L.-J., Fei-Fei, L.: OPTIMOL: automatic online picture collection via incremental model learning. Int. J. Comput. Vis. 88, 147–168 (2009)
Article MathSciNet Google Scholar
Schroff, F., Criminisi, A., Zisserman, A.: Harvesting image databases from the web. In: Proceedings of 11th IEEE International Conference Computer Vision (2007)
Google Scholar
Felzenszwalb, P.F., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)
Google Scholar
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Article Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems (2011)
Google Scholar
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Annual Meeting of the Association for Computational Linguistics (2012)
Google Scholar
Mason, R., Charniak, E.: Nonparametric method for data-driven image captioning. In: Annual Meeting of the Association for Computational Linguistics (2014)
Google Scholar
Patterson, G., Xu, C., Su, H., Hays, J.: The SUN attribute database: beyond categories for deeper scene understanding. Int. J. Comput. Vis. 108 (1–2), 59–81 (2014)
Article Google Scholar
Yagcioglu, S., Erdem, E., Erdem, A., Cakici, R.: A distributed representation based query expansion approach for image captioning. In: Annual Meeting of the Association for Computational Linguistics (2015)
Google Scholar
Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., Mitchell, M.: Language models for image captioning: the quirks and what works. In: Annual Meeting of the Association for Computational Linguistics (2015)
Google Scholar
Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. PAMI 30, 1371–1384 (2008)
Article Google Scholar
Hodosh, M., Hockenmaier, J.: Sentence-based image description with scalable, explicit models. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (2013)
Google Scholar
Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Google Scholar
Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems (2014)
Google Scholar
Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: International Conference on Computer Vision (2015)
Google Scholar
Pinheiro, P., Lebret, R., Collobert, R.: Simple image description generator via a linear phrase-based model. In: International Conference on Learning Representations Workshop (2015)
Google Scholar
Ushiku, Y., Yamaguchi, M., Mukuta, Y., Harada, T.: Common subspace for model and similarity: phrase learning for caption generation from images. In: International Conference on Computer Vision (2015)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. In: Advances in Neural Information Processing Systems Deep Learning Workshop (2015)
Google Scholar
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (2015)
Google Scholar
Lebret, R., Pinheiro, P.O., Collobert, R.: Phrase-based image captioning. In: International Conference on Machine Learning (2015)
Google Scholar
Gupta, A., Kembhavi, A., Davis, L.: Observing human-object interactions: using spatial and functional compatibility for recognition. In: PAMI (2009)
Google Scholar
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Article Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (2006)
Google Scholar
Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2012)
Article Google Scholar
Verma, Y., Jawahar, C.V.: Im2Text and Text2Im: associating images and texts for cross-modal retrieval. In: British Machine Vision Conference (2014)
Google Scholar
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., Plank, B.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)
Article Google Scholar
Lenat, D.B.: Cyc: a large-scale investment in knowledge infrastructure. Commun. ACM 38(11), 33–38 (1995P
Article Google Scholar

Download references

Author information

Authors and Affiliations

Lokmanya Tilak College of Engineering, Koparkhairane, Navi Mumbai, India
Chaitrali P. Chaudhari
Datta Meghe College of Engineering, Airoli, Navi Mumbai, India
Satish Devane

Authors

Chaitrali P. Chaudhari
View author publications
You can also search for this author in PubMed Google Scholar
Satish Devane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chaitrali P. Chaudhari .

Editor information

Editors and Affiliations

Department of Electronics and Telecommunication Engineering, Dr. Babasaheb Ambedkar Technological University, Lonere, Raigad, Maharashtra, India
Brijesh Iyer
Department of Electronics and Telecommunication Engineering, Dr. Babasaheb Ambedkar Technological University, Lonere, Raigad, Maharashtra, India
S.L. Nalbalwar
Department of Electronics and Communication Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Nagendra Prasad Pathak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chaudhari, C.P., Devane, S. (2019). Captioning the Images: A Deep Analysis. In: Iyer, B., Nalbalwar, S., Pathak, N. (eds) Computing, Communication and Signal Processing . Advances in Intelligent Systems and Computing, vol 810. Springer, Singapore. https://doi.org/10.1007/978-981-13-1513-8_100

Download citation

DOI: https://doi.org/10.1007/978-981-13-1513-8_100
Published: 13 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1512-1
Online ISBN: 978-981-13-1513-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics