Image Captioning with Local-Global Visual Interaction Network

Wang, Changzhi; Gu, Xiaodong

doi:10.1007/978-981-99-1645-0_38

Changzhi Wang¹⁰ &
Xiaodong Gu¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1793))

Included in the following conference series:

International Conference on Neural Information Processing

849 Accesses

Abstract

Existing attention based image captioning approaches treat local feature and global feature in the image individually, neglecting the intrinsic interaction between them that provides important guidance for generating caption. To alleviate above issue, in this paper we propose a novel Local-Global Visual Interaction Network (LGVIN) that novelly explores the interactions between local feature and global feature. Specifically, we devise a new visual interaction graph network that mainly consists of visual interaction encoding module and visual interaction fusion module. The former implicitly encodes the visual relationships between local feature and global feature to obtain an enhanced visual representation containing rich local-global feature relationship. The latter fuses the previously obtained multiple relationship features to further enrich different-level relationship attribute information. In addition, we introduce a new relationship attention based LSTM module to guide the word generation by dynamically focusing on the previously output fusion relationship information. Extensive experimental results show that the superiority of our LGVIN approach, and our model obviously outperforms the current similar relationship based image captioning methods.

This work was supported by the National Natural Science Foundation of China under grant 62176062.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available: https://github.com/tylin/coco-caption.

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Banerjee, S., Lavie, A.: Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of Meeting of the Association for Computational Linguistics, pp. 65–72 (2005)
Google Scholar
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659–5667 (2017)
Google Scholar
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020)
Google Scholar
Ding, X., Li, Q., Cheng, Y., Wang, J., Bian, W., Jie, B.: Local keypoint-based faster R-CNN. Appl. Intell. 50(10), 3007–3022 (2020)
Article Google Scholar
Gao, J., Wang, S., Wang, S., Ma, S., Gao, W.: Self-critical n-step training for image captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6300–6308 (2019)
Google Scholar
Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8928–8937 (2019)
Google Scholar
Li, L., Tang, S., Zhang, Y., Deng, L., Tian, Q.: GLA: global-local attention for image description. IEEE Trans. Multimedia 20(3), 726–737 (2018)
Article Google Scholar
Li, X., Jiang, S.: Know more say less: image captioning based on scene graphs. IEEE Trans. Multimedia 20(8), 2117–2130 (2020)
Article Google Scholar
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Proceedings of Meeting of the Association for Computational Linguistics, pp. 74–81 (2004)
Google Scholar
Luo, Y., et al.: Dual-level collaborative transformer for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)
Google Scholar
Pan, Y., Yao, T., Li, Y., Mei, T.: X-linear attention networks for image captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 10971–10980 (2020)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Velickovic, P., Cucurull, G., Casanova, A., Romero, et al. : Graph attention networks. arXiv:1710.10903 (2017)
Wang, C., Gu, X.: Image captioning with adaptive incremental global context attention. Appl. Intell. 52(6), 6575–6597 (2021). https://doi.org/10.1007/s10489-021-02734-3
Article Google Scholar
Wang, J., Wang, W., Wang, L., Wang, Z., Feng, D.D., Tan, T.: Learning visual relationship and context-aware attention for image captioning. Pattern Recogn. 98, 107075 (2020)
Article Google Scholar
Wu, J., Chen, T., Wu, H., Yang, Z., Luo, G., Lin, L.: Fine-grained image captioning with global-local discriminative objective. IEEE Trans. Multimedia 23, 2413–2427 (2021)
Article Google Scholar
Wu, L., Xu, M., Sang, L., Yao, T., Mei, T.: Noise augmented double-stream graph convolutional networks for image captioning. IEEE Trans. Circuits Syst. Video Technol. 31(8), 3118–3127 (2021)
Article Google Scholar
Xiao, X., Wang, L., Ding, K., Xiang, S., Pan, C.: Deep hierarchical encoder-decoder network for image captioning. IEEE Trans. Multimedia 21(11), 2942–2956 (2019)
Article Google Scholar
Yang, L., Hu, H., Xing, S., Lu, X.: Constrained LSTM and residual attention for image captioning. ACM Trans. Multimed. Comput. Commun. Appl. 16(3), 1–18 (2020)
Article Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 711–727. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_42
Chapter Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Hierarchy parsing for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2621–2629 (2019)
Google Scholar
Zhang, J., Mei, K., Zheng, Y., Fan, J.: Integrating part of speech guidance for image captioning. IEEE Trans. Multimedia 23, 92–104 (2021)
Article Google Scholar
Zhang, Y., Shi, X., Mi, S., Yang, X.: Image captioning with transformer and knowledge graph. Pattern Recogn. Lett. 143, 43–49 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Fudan University, Shanghai, 200438, China
Changzhi Wang & Xiaodong Gu

Authors

Changzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong Gu .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C., Gu, X. (2023). Image Captioning with Local-Global Visual Interaction Network. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_38

Download citation

DOI: https://doi.org/10.1007/978-981-99-1645-0_38
Published: 14 April 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1644-3
Online ISBN: 978-981-99-1645-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Image Captioning with Local-Global Visual Interaction Network