Deep Neural Network Based Image Captioning

Tripathi, Anurag; Srivastava, Siddharth; Kothari, Ravi

doi:10.1007/978-3-030-04780-1_23

Anurag Tripathi¹⁸,
Siddharth Srivastava¹⁸ &
Ravi Kothari¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11297))

Included in the following conference series:

International Conference on Big Data Analytics

1781 Accesses

Abstract

Generating a concise natural language description of an image enables a number of applications including fast keyword based search of large image collections. Primarily inspired by deep learning, recent times have witnessed a substantially increased focus on machine based image caption generation. In this paper, we provide a brief review of deep learning based image caption generation along with a brief overview of the datasets and metrics used to evaluate the captioning algorithms. We conclude the paper with some discussion on promising directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Google Scholar. http://scholar.google.com
MSCOCO Image Captioning Challenge (2015). http://cocodataset.org/#captions-leaderboard. Accessed 05 Oct 2018
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Aneja, J., Deshpande, A., Schwing, A.: Convolutional image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5561–5570 (2018)
Google Scholar
Banerjee, S., Lavie, A.: Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Tanslation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3185–3194. IEEE (2017)
Google Scholar
Carneiro, G., Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 394–410 (2007)
Article Google Scholar
Dai, B., Fidler, S., Urtasun, R., Lin, D.: Towards diverse and natural image descriptions via a conditional GAN. arXiv preprint arXiv:1703.06029 (2017)
Devlin, J., et al.: Language models for image captioning: the quirks and what works. arXiv preprint arXiv:1505.01809 (2015)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., Lew, M.S.: Deep learning for visual understanding: a review. Neurocomputing 187, 27–48 (2016)
Article Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Kulkarni, G., et al.: Baby talk: understanding and generating image descriptions. In: Proceedings of the 24th CVPR. Citeseer (2011)
Google Scholar
Li, D., He, X., Huang, Q., Sun, M.T., Zhang, L.: Generating diverse and accurate visual captions by comparative adversarial learning. arXiv preprint arXiv:1804.00861 (2018)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: ACL Text Summarization Branches Out (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lu, S., Zhu, Y., Zhang, W., Wang, J., Yu, Y.: Neural text generation: past, present and beyond. arXiv preprint arXiv:1803.07133 (2018)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems, pp. 1143–1151 (2011)
Google Scholar
Pan, J.Y., Yang, H.J., Duygulu, P., Faloutsos, C.: Automatic image captioning. In: IEEE International Conference on Multimedia and Expo, ICME 2004, vol. 3, pp. 1987–1990. IEEE (2004)
Google Scholar
Pan, J.Y., Yang, H.J., Faloutsos, C., Duygulu, P.: GCap: graph-based automatic image captioning. In: Conference on Computer Vision and Pattern Recognition Workshop, CVPRW 2004, p. 146. IEEE (2004)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Park, C.C., Kim, B., Kim, G.: Towards personalized image captioning via multimodal memory networks. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Google Scholar
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical turk. In: NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147. Association for Computational Linguistics (2010)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, p. 3 (2017)
Google Scholar
Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: Gall, J., Gehler, P., Leibe, B. (eds.) GCPR 2015. LNCS, vol. 9358, pp. 209–221. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24947-6_17
Chapter Google Scholar
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)
Article Google Scholar
Wang, Q., Chan, A.B.: CNN+ CNN: convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019 (2018)
Wang, X.J., Zhang, L., Jing, F., Ma, W.Y.: AnnoSearch: image auto-annotation by search. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1483–1490. IEEE (2006)
Google Scholar
Wu, Q., Shen, C., Wang, P., Dick, A., van den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2018)
Article Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: European Conference on Computer Vision (ECCV), pp. 684–699 (2018)
Chapter Google Scholar
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: International Conference on Computer Vision (ICCV), pp. 22–29 (2017)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Google Scholar
Yu, L., Zhang, W., Wang, J., Yu, Y.: SeqGAN: sequence generative adversarial nets with policy gradient. arXiv preprint arXiv:1609.05473 (2017)

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Delhi, New Delhi, India
Anurag Tripathi & Siddharth Srivastava
Ashoka University, Sonepat, India
Ravi Kothari

Authors

Anurag Tripathi
View author publications
You can also search for this author in PubMed Google Scholar
Siddharth Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Ravi Kothari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ravi Kothari .

Editor information

Editors and Affiliations

Ashoka University, Sonepat, India
Anirban Mondal
IBM Research - India, New Delhi, India
Himanshu Gupta
University of Minnesota, Minneapolis, MN, USA
Jaideep Srivastava
IIIT, Hyderabad, India
P. Krishna Reddy
National Institute of Technology, Warangal, India
D.V.L.N. Somayajulu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tripathi, A., Srivastava, S., Kothari, R. (2018). Deep Neural Network Based Image Captioning. In: Mondal, A., Gupta, H., Srivastava, J., Reddy, P., Somayajulu, D. (eds) Big Data Analytics. BDA 2018. Lecture Notes in Computer Science(), vol 11297. Springer, Cham. https://doi.org/10.1007/978-3-030-04780-1_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-04780-1_23
Published: 22 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04779-5
Online ISBN: 978-3-030-04780-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics