Skip to main content
Log in

Extractive text-image summarization with relation-enhanced graph attention network

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Multi-modal summarization with multi-modal output (MSMO) aims to generate multi-modal summaries for a multi-modal document to improve readability of summaries by making use of information of different modalities. Most existing Seq2Seq-based MSMO models cannot well capture multi-modal relations which are significant for generating high-quality multi-modal summaries. To address this issue, this paper proposes a relation-enhanced graph attention network for extractive text-image summarization (ReGAT-Summ) to capture inter-modal and intra-modal relations in the multi-modal document. Firstly, a multi-modal graph is constructed from the document. Then, node representations are calculated by proposed graph neural network. Finally, a sentence-image selector is trained to select salient sentences and images, which are further aligned by training. To our knowledge, we are the first to explore the graph-based model for MSMO. Experiments on two news datasets E-DailyMail and NYTime800k demonstrate that ReGAT-Summ achieves the state-of-the-art performance in terms of automatic metrics and human evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Availability of data and material

The authors declare that the all supporting data are available.

References

  • Al-Amin, S. T., & Ordonez, C. (2022). Incremental and accurate computation of machine learning models with smart data summarization. Journal of Intelligent Information Systems, 59(1), 149–172. https://doi.org/10.1007/s10844-021-00690-5

    Article  Google Scholar 

  • Calixto, I., Liu, Q., & Campbell, N. (2017). Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 1913–1924). Association for Computational Linguistics, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1175

  • Chen, J., & Zhuge, H. (2018). Abstractive text-image summarization using multi-modal attentional hierarchical RNN. In Proceedings of the 2018 conference on empirical methods in natural language processing, (pp. 4046–4056). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1438

  • Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. In Proceedings of the 54th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 484–494). Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1046

  • Devlin, J., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Vol. 1 Long and Short Papers, pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

  • Erkan, G., & Radev, D. R. (2004). Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22(1), 457–479.

    Article  Google Scholar 

  • Faghri, F., et al. (2018). Vse++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British machine vision conference (BMVC). https://github.com/fartashf/vsepp

  • He, K., et al. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR), (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90

  • Kipf, T.N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In International conference on learning representations. https://openreview.net/forum?id=SJU4ayYgl

  • Li, Y., et al. (2016). Gated graph sequence neural networks. In 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. arXiv:1511.05493

  • Li, H., et al. (2018). Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the 27th international joint conference on artificial intelligence IJCAI-18, (pp. 4152–4158). International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2018/577

  • Li, M., et al. (2020). VMSMO: Learning to generate multimodal summary for video-based news articles. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), (pp. 9360–9369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.752

  • Li, H., et al. (2020). Aspect-aware multimodal summarization for Chinese e-commerce products. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 8188–8195. https://doi.org/10.1609/aaai.v34i05.6332.

    Article  Google Scholar 

  • Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp. 3730–3740). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1387

  • Mihalcea, R., & Tarau, P. (2004) TextRank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, (pp. 404–411). Association for Computational Linguistics, Barcelona, Spain. https://aclanthology.org/W04-3252

  • Nallapati, R., et al. (2016). Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL conference on computational natural language learning, (pp. 280–290). Association for Computational Linguistics. https://doi.org/10.18653/v1/K16-1028

  • Nallapati, R., Zhai, F., & Zhou, B. (2017). Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), 3075–3081. https://doi.org/10.1609/aaai.v31i1.10958.

    Article  Google Scholar 

  • Narayan, S., Cohen, S.B., & Lapata, M. (2018). Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, (Vol. 1: Long Papers, pp. 1747–1759). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1158

  • Peal, M., Hossain, M. S., & Chen, J. (2022). Summarizing consumer reviews. Journal of Intelligent Information Systems, 59(1), 193–212. https://doi.org/10.1007/s10844-022-00694-9

    Article  Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162

  • Rush, A.M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 conference on empirical methods in natural language processing, (pp. 379–389). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1044

  • Sacenti, J. A. P., Fileto, R., & Willrich, R. (2022). Knowledge graph summarization impacts on movie recommendations. Journal of Intelligent Information Systems, 58(1), 43–66. https://doi.org/10.1007/s10844-021-00650-z

    Article  Google Scholar 

  • See, A., Liu, P.J., & Manning, C.D. (2017). Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 1073–1083). Association for Computational Linguistics, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1099

  • Shen, X., et al. (2019). Improving latent alignment in text summarization by generalizing the pointer generator. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp. 3762–3773). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1390

  • Song, L., et al. (2018). A graph-to-sequence model for AMR-to-text generation. In Proceedings of the 56th annual meeting of the association for computational linguistics, (Vol. 1: Long Papers, pp. 1616–1626). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1150

  • Tran, A., Mathews, A., & Xie, L. (2020). Transform and tell: Entity-aware news image captioning. In IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Veličković, P., et al. (2018). Graph attention networks. Accepted as poster. https://openreview.net/forum?id=rJXMpikCZ

  • Wang, D., et al. (2020). Heterogeneous graph neural networks for extractive document summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, (pp. 6209–6219). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.553

  • Xu, J., & Durrett, G. (2019). Neural extractive text summarization with syntactic compression. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp. 3292–3303). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1324

  • Xue, M., et al. (2019). Neural collective entity linking based on recurrent random walk network learning. In Proceedings of the 28th international joint conference on artificial intelligence, IJCAI-19, (pp. 5327–5333). International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2019/740

  • Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 7370–7377. https://doi.org/10.1609/aaai.v33i01.33017370

    Article  Google Scholar 

  • Zhou, Q., et al. (2018). Neural document summarization by jointly learning to score and select sentences. In Proceedings of the 56th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 654–663). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1061

  • Zhu, J., et al. (2018). MSMO: Multimodal summarization with multimodal output. In Proceedings of the 2018 conference on empirical methods in natural language processing, (pp. 4154–4164). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1448

  • Zhu, J., et al. (2020). Multimodal summarization with guidance of multimodal reference. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 9749–9756. https://doi.org/10.1609/aaai.v34i05.6525

    Article  Google Scholar 

Download references

Acknowledgements

The research was sponsored by the National Natural Science Foundation of China (No.61806101). We thank the anonymous reviewers for helpful comments. JingQiang Chen is the corresponding author.

Funding

This research was sponsored by the National Natural Science Foundation of China (No.61806101).

Author information

Authors and Affiliations

Authors

Contributions

Feng Xie and JingQing Chen contributed equally to this work.

Corresponding author

Correspondence to Jingqiang Chen.

Ethics declarations

Ethical Approval and Consent to participate

Not Applicable.

Consent for publication

The authors declare that they consent for publication.

Human and Animal Ethics

Not Applicable.

Competing interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (zip 6613 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, F., Chen, J. & Chen, K. Extractive text-image summarization with relation-enhanced graph attention network. J Intell Inf Syst 61, 325–341 (2023). https://doi.org/10.1007/s10844-022-00757-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-022-00757-x

Keywords

Navigation