Skip to main content

Improve Image Captioning by Self-attention

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1143))

Included in the following conference series:

Abstract

The common attention mechanism has been widely adopted in prevalent image captioning frameworks. In most of the prior work, attention weights were only determined by visual features as well as the hidden states of Recurrent Neural Network (RNN), while the interaction of visual features was not modelled. In this paper, we introduce the self-attention into the current image captioning framework to leverage the nonlocal correlation among visual features. Moreover, we propose three distinctive methods to fuse the self-attention and the conventional attention mechanism. Extensive experiments on MSCOCO dataset show that the self-attention can empower the captioning model to achieve competitive performance with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

    Google Scholar 

  2. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)

    Google Scholar 

  3. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)

    Google Scholar 

  4. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)

    Google Scholar 

  5. Wu, Q., Shen, C., Liu, L., Dick, A., van den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: CVPR (2016)

    Google Scholar 

  6. Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: ICCV (2017)

    Google Scholar 

  7. Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR (2017)

    Google Scholar 

  8. Wang, J., Madhyastha, P.S., Specia, L.: Object counts! Bringing explicit detections back into image captioning. In: NAACL-HLT (2018)

    Google Scholar 

  9. Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR (2017)

    Google Scholar 

  10. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)

    Google Scholar 

  11. Pedersoli, M., Lucas, T., Schmid, C., Verbeek, J.: Areas of attention for image captioning. In: ICCV (2017)

    Google Scholar 

  12. Li, L., Tang, S., Deng, L., Zhang, Y., Tian, Q.: Image caption with global-local attention. In: AAAI (2017)

    Google Scholar 

  13. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)

    Google Scholar 

  14. Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local Neural Networks. In: CVPR (2018)

    Google Scholar 

  15. Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: coarse-to-fine learning for image captioning. In AAAI (2018)

    Google Scholar 

  16. Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 711–727. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_42

    Chapter  Google Scholar 

  17. Jiang, W., Ma, L., Jiang, Y.-G., Liu, W., Zhang, T.: Recurrent fusion network for image captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 510–526. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_31

    Chapter  Google Scholar 

  18. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR (2017)

    Google Scholar 

  19. Chen, C., et al.: Improving image captioning with conditional generative adversarial nets. In: AAAI (2019)

    Article  Google Scholar 

  20. Gao, J., Wang, S., Wang, S., Ma, S., Gao, W.: Self-critical n-step training for image captioning. In: CVPR (2019)

    Google Scholar 

  21. Luo, R., Price, B.L., Cohen, S., Shakhnarovich, G.: Discriminability objective for training descriptive captions. In: CVPR (2018)

    Google Scholar 

  22. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  23. Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L.J.: Deep reinforcement learning-based image captioning with embedding reward. In: CVPR (2017)

    Google Scholar 

  24. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA. In: CVPR (2018)

    Google Scholar 

Download references

Acknowledgement

This paper is supported by NSFC (No. 61772330, 61533012, 61876109), the pre-research project (no.61403120201), Shanghai authentication key Lab. (2017XCWZK01), and Technology Committee the interdisciplinary Program of Shanghai Jiao Tong University (YG2019QNA09).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongtao Lu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Z., Li, Y., Lu, H. (2019). Improve Image Captioning by Self-attention. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1143. Springer, Cham. https://doi.org/10.1007/978-3-030-36802-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-36802-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-36801-2

  • Online ISBN: 978-3-030-36802-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics