Extractive text-image summarization with relation-enhanced graph attention network

Xie, Feng; Chen, Jingqiang; Chen, Kejia

doi:10.1007/s10844-022-00757-x

Extractive text-image summarization with relation-enhanced graph attention network

Published: 28 October 2022

Volume 61, pages 325–341, (2023)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Feng Xie¹^na1,
Jingqiang Chen¹^na1 &
Kejia Chen¹

483 Accesses
1 Citation
Explore all metrics

Abstract

Multi-modal summarization with multi-modal output (MSMO) aims to generate multi-modal summaries for a multi-modal document to improve readability of summaries by making use of information of different modalities. Most existing Seq2Seq-based MSMO models cannot well capture multi-modal relations which are significant for generating high-quality multi-modal summaries. To address this issue, this paper proposes a relation-enhanced graph attention network for extractive text-image summarization (ReGAT-Summ) to capture inter-modal and intra-modal relations in the multi-modal document. Firstly, a multi-modal graph is constructed from the document. Then, node representations are calculated by proposed graph neural network. Finally, a sentence-image selector is trained to select salient sentences and images, which are further aligned by training. To our knowledge, we are the first to explore the graph-based model for MSMO. Experiments on two news datasets E-DailyMail and NYTime800k demonstrate that ReGAT-Summ achieves the state-of-the-art performance in terms of automatic metrics and human evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic-guided abstractive multimodal summarization with multimodal output

Article 24 August 2023

SA-NMS: Multi-document Summarization with Self-attention and Non-maximum Suppression Selection

Self-supervised opinion summarization with multi-modal knowledge graph

Article 01 September 2023

Availability of data and material

The authors declare that the all supporting data are available.

References

Al-Amin, S. T., & Ordonez, C. (2022). Incremental and accurate computation of machine learning models with smart data summarization. Journal of Intelligent Information Systems, 59(1), 149–172. https://doi.org/10.1007/s10844-021-00690-5
Article Google Scholar
Calixto, I., Liu, Q., & Campbell, N. (2017). Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 1913–1924). Association for Computational Linguistics, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1175
Chen, J., & Zhuge, H. (2018). Abstractive text-image summarization using multi-modal attentional hierarchical RNN. In Proceedings of the 2018 conference on empirical methods in natural language processing, (pp. 4046–4056). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1438
Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. In Proceedings of the 54th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 484–494). Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1046
Devlin, J., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Vol. 1 Long and Short Papers, pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Erkan, G., & Radev, D. R. (2004). Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22(1), 457–479.
Article Google Scholar
Faghri, F., et al. (2018). Vse++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British machine vision conference (BMVC). https://github.com/fartashf/vsepp
He, K., et al. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR), (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90
Kipf, T.N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In International conference on learning representations. https://openreview.net/forum?id=SJU4ayYgl
Li, Y., et al. (2016). Gated graph sequence neural networks. In 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. arXiv:1511.05493
Li, H., et al. (2018). Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the 27th international joint conference on artificial intelligence IJCAI-18, (pp. 4152–4158). International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2018/577
Li, M., et al. (2020). VMSMO: Learning to generate multimodal summary for video-based news articles. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), (pp. 9360–9369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.752
Li, H., et al. (2020). Aspect-aware multimodal summarization for Chinese e-commerce products. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 8188–8195. https://doi.org/10.1609/aaai.v34i05.6332.
Article Google Scholar
Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp. 3730–3740). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1387
Mihalcea, R., & Tarau, P. (2004) TextRank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, (pp. 404–411). Association for Computational Linguistics, Barcelona, Spain. https://aclanthology.org/W04-3252
Nallapati, R., et al. (2016). Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL conference on computational natural language learning, (pp. 280–290). Association for Computational Linguistics. https://doi.org/10.18653/v1/K16-1028
Nallapati, R., Zhai, F., & Zhou, B. (2017). Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), 3075–3081. https://doi.org/10.1609/aaai.v31i1.10958.
Article Google Scholar
Narayan, S., Cohen, S.B., & Lapata, M. (2018). Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, (Vol. 1: Long Papers, pp. 1747–1759). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1158
Peal, M., Hossain, M. S., & Chen, J. (2022). Summarizing consumer reviews. Journal of Intelligent Information Systems, 59(1), 193–212. https://doi.org/10.1007/s10844-022-00694-9
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162
Rush, A.M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 conference on empirical methods in natural language processing, (pp. 379–389). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1044
Sacenti, J. A. P., Fileto, R., & Willrich, R. (2022). Knowledge graph summarization impacts on movie recommendations. Journal of Intelligent Information Systems, 58(1), 43–66. https://doi.org/10.1007/s10844-021-00650-z
Article Google Scholar
See, A., Liu, P.J., & Manning, C.D. (2017). Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 1073–1083). Association for Computational Linguistics, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1099
Shen, X., et al. (2019). Improving latent alignment in text summarization by generalizing the pointer generator. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp. 3762–3773). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1390
Song, L., et al. (2018). A graph-to-sequence model for AMR-to-text generation. In Proceedings of the 56th annual meeting of the association for computational linguistics, (Vol. 1: Long Papers, pp. 1616–1626). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1150
Tran, A., Mathews, A., & Xie, L. (2020). Transform and tell: Entity-aware news image captioning. In IEEE/CVF conference on computer vision and pattern recognition (CVPR).
Veličković, P., et al. (2018). Graph attention networks. Accepted as poster. https://openreview.net/forum?id=rJXMpikCZ
Wang, D., et al. (2020). Heterogeneous graph neural networks for extractive document summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, (pp. 6209–6219). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.553
Xu, J., & Durrett, G. (2019). Neural extractive text summarization with syntactic compression. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp. 3292–3303). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1324
Xue, M., et al. (2019). Neural collective entity linking based on recurrent random walk network learning. In Proceedings of the 28th international joint conference on artificial intelligence, IJCAI-19, (pp. 5327–5333). International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2019/740
Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 7370–7377. https://doi.org/10.1609/aaai.v33i01.33017370
Article Google Scholar
Zhou, Q., et al. (2018). Neural document summarization by jointly learning to score and select sentences. In Proceedings of the 56th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 654–663). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1061
Zhu, J., et al. (2018). MSMO: Multimodal summarization with multimodal output. In Proceedings of the 2018 conference on empirical methods in natural language processing, (pp. 4154–4164). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1448
Zhu, J., et al. (2020). Multimodal summarization with guidance of multimodal reference. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 9749–9756. https://doi.org/10.1609/aaai.v34i05.6525
Article Google Scholar

Download references

Acknowledgements

The research was sponsored by the National Natural Science Foundation of China (No.61806101). We thank the anonymous reviewers for helpful comments. JingQiang Chen is the corresponding author.

Funding

This research was sponsored by the National Natural Science Foundation of China (No.61806101).

Author information

Feng Xie and Jingqiang Chen contributed equally to this work.

Authors and Affiliations

Nanjing University of Posts and Telecommunications, NanJing, 210049, JiangSu, China
Feng Xie, Jingqiang Chen & Kejia Chen

Authors

Feng Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jingqiang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kejia Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Feng Xie and JingQing Chen contributed equally to this work.

Corresponding author

Correspondence to Jingqiang Chen.

Ethics declarations

Ethical Approval and Consent to participate

Not Applicable.

Consent for publication

The authors declare that they consent for publication.

Human and Animal Ethics

Not Applicable.

Competing interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (zip 6613 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xie, F., Chen, J. & Chen, K. Extractive text-image summarization with relation-enhanced graph attention network. J Intell Inf Syst 61, 325–341 (2023). https://doi.org/10.1007/s10844-022-00757-x

Download citation

Received: 25 July 2022
Revised: 11 October 2022
Accepted: 12 October 2022
Published: 28 October 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10844-022-00757-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extractive text-image summarization with relation-enhanced graph attention network

Abstract

Access this article

Similar content being viewed by others

Topic-guided abstractive multimodal summarization with multimodal output

SA-NMS: Multi-document Summarization with Self-attention and Non-maximum Suppression Selection

Self-supervised opinion summarization with multi-modal knowledge graph

Availability of data and material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval and Consent to participate

Consent for publication

Human and Animal Ethics

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary file 1 (zip 6613 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extractive text-image summarization with relation-enhanced graph attention network

Abstract

Access this article

Similar content being viewed by others

Topic-guided abstractive multimodal summarization with multimodal output

SA-NMS: Multi-document Summarization with Self-attention and Non-maximum Suppression Selection

Self-supervised opinion summarization with multi-modal knowledge graph

Availability of data and material

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval and Consent to participate

Consent for publication

Human and Animal Ethics

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary file 1 (zip 6613 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation