Attention-Based Image Captioning Using DenseNet Features

Hossain, Md. Zakir; Sohel, Ferdous; Shiratuddin, Mohd Fairuz; Laga, Hamid; Bennamoun, Mohammed

doi:10.1007/978-3-030-36802-9_13

Md. Zakir Hossain⁹,
Ferdous Sohel⁹,
Mohd Fairuz Shiratuddin⁹,
Hamid Laga⁹ &
…
Mohammed Bennamoun¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1143))

Included in the following conference series:

International Conference on Neural Information Processing

2288 Accesses
2 Citations

Abstract

We present an attention-based image captioning method using DenseNet features. Conventional image captioning methods depend on visual information of the whole scene to generate image captions. Such a mechanism often fails to get the information of salient objects and cannot generate semantically correct captions. We consider an attention mechanism that can focus on relevant parts of the image to generate fine-grained description of that image. We use image features from DenseNet. We conduct our experiments on the MSCOCO dataset. Our proposed method achieved 53.6, 39.8, and 29.5 on BLEU-2, 3, and 4 metrics, respectively, which are superior to the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: The ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, vol. 29, pp. 65–72 (2005)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CSUR) 51, 118 (2019)
Article Google Scholar
Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H., Bennamoun, M.: Bi-san-cap: bi-directional self-attention for image captioning. In: Accepted in Digital Image Computing: Techniques and Applications (DICTA) (2019)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. IEEE (2017)
Google Scholar
Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: International Conference on Computer Vision, pp. 2407–2415 (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, vol. 8 (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: International Conference on Learning Representations (2015)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Park, C.C., Kim, B., Kim, G.: Attend to you: personalized image captioning with context sequence memory networks. In: Computer Vision and Pattern Recognition, pp. 6432–6440 (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Murdoch University, Perth, WA, 6155, Australia
Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin & Hamid Laga
The University of Western Australia, Perth, WA, 6009, Australia
Mohammed Bennamoun

Authors

Md. Zakir Hossain
View author publications
You can also search for this author in PubMed Google Scholar
Ferdous Sohel
View author publications
You can also search for this author in PubMed Google Scholar
Mohd Fairuz Shiratuddin
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Laga
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Bennamoun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md. Zakir Hossain .

Editor information

Editors and Affiliations

Australian National University, Canberra, ACT, Australia
Tom Gedeon
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H., Bennamoun, M. (2019). Attention-Based Image Captioning Using DenseNet Features. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1143. Springer, Cham. https://doi.org/10.1007/978-3-030-36802-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-36802-9_13
Published: 05 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36801-2
Online ISBN: 978-3-030-36802-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics