Abstract
Multi-modal summarization with multi-modal output (MSMO) aims to generate multi-modal summaries for a multi-modal document to improve readability of summaries by making use of information of different modalities. Most existing Seq2Seq-based MSMO models cannot well capture multi-modal relations which are significant for generating high-quality multi-modal summaries. To address this issue, this paper proposes a relation-enhanced graph attention network for extractive text-image summarization (ReGAT-Summ) to capture inter-modal and intra-modal relations in the multi-modal document. Firstly, a multi-modal graph is constructed from the document. Then, node representations are calculated by proposed graph neural network. Finally, a sentence-image selector is trained to select salient sentences and images, which are further aligned by training. To our knowledge, we are the first to explore the graph-based model for MSMO. Experiments on two news datasets E-DailyMail and NYTime800k demonstrate that ReGAT-Summ achieves the state-of-the-art performance in terms of automatic metrics and human evaluations.
Similar content being viewed by others
Availability of data and material
The authors declare that the all supporting data are available.
References
Al-Amin, S. T., & Ordonez, C. (2022). Incremental and accurate computation of machine learning models with smart data summarization. Journal of Intelligent Information Systems, 59(1), 149–172. https://doi.org/10.1007/s10844-021-00690-5
Calixto, I., Liu, Q., & Campbell, N. (2017). Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 1913–1924). Association for Computational Linguistics, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1175
Chen, J., & Zhuge, H. (2018). Abstractive text-image summarization using multi-modal attentional hierarchical RNN. In Proceedings of the 2018 conference on empirical methods in natural language processing, (pp. 4046–4056). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1438
Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. In Proceedings of the 54th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 484–494). Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1046
Devlin, J., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Vol. 1 Long and Short Papers, pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Erkan, G., & Radev, D. R. (2004). Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22(1), 457–479.
Faghri, F., et al. (2018). Vse++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British machine vision conference (BMVC). https://github.com/fartashf/vsepp
He, K., et al. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR), (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90
Kipf, T.N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In International conference on learning representations. https://openreview.net/forum?id=SJU4ayYgl
Li, Y., et al. (2016). Gated graph sequence neural networks. In 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. arXiv:1511.05493
Li, H., et al. (2018). Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the 27th international joint conference on artificial intelligence IJCAI-18, (pp. 4152–4158). International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2018/577
Li, M., et al. (2020). VMSMO: Learning to generate multimodal summary for video-based news articles. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), (pp. 9360–9369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.752
Li, H., et al. (2020). Aspect-aware multimodal summarization for Chinese e-commerce products. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 8188–8195. https://doi.org/10.1609/aaai.v34i05.6332.
Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp. 3730–3740). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1387
Mihalcea, R., & Tarau, P. (2004) TextRank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, (pp. 404–411). Association for Computational Linguistics, Barcelona, Spain. https://aclanthology.org/W04-3252
Nallapati, R., et al. (2016). Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL conference on computational natural language learning, (pp. 280–290). Association for Computational Linguistics. https://doi.org/10.18653/v1/K16-1028
Nallapati, R., Zhai, F., & Zhou, B. (2017). Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), 3075–3081. https://doi.org/10.1609/aaai.v31i1.10958.
Narayan, S., Cohen, S.B., & Lapata, M. (2018). Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, (Vol. 1: Long Papers, pp. 1747–1759). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1158
Peal, M., Hossain, M. S., & Chen, J. (2022). Summarizing consumer reviews. Journal of Intelligent Information Systems, 59(1), 193–212. https://doi.org/10.1007/s10844-022-00694-9
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162
Rush, A.M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 conference on empirical methods in natural language processing, (pp. 379–389). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1044
Sacenti, J. A. P., Fileto, R., & Willrich, R. (2022). Knowledge graph summarization impacts on movie recommendations. Journal of Intelligent Information Systems, 58(1), 43–66. https://doi.org/10.1007/s10844-021-00650-z
See, A., Liu, P.J., & Manning, C.D. (2017). Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 1073–1083). Association for Computational Linguistics, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1099
Shen, X., et al. (2019). Improving latent alignment in text summarization by generalizing the pointer generator. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp. 3762–3773). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1390
Song, L., et al. (2018). A graph-to-sequence model for AMR-to-text generation. In Proceedings of the 56th annual meeting of the association for computational linguistics, (Vol. 1: Long Papers, pp. 1616–1626). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1150
Tran, A., Mathews, A., & Xie, L. (2020). Transform and tell: Entity-aware news image captioning. In IEEE/CVF conference on computer vision and pattern recognition (CVPR).
Veličković, P., et al. (2018). Graph attention networks. Accepted as poster. https://openreview.net/forum?id=rJXMpikCZ
Wang, D., et al. (2020). Heterogeneous graph neural networks for extractive document summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics, (pp. 6209–6219). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.553
Xu, J., & Durrett, G. (2019). Neural extractive text summarization with syntactic compression. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), (pp. 3292–3303). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1324
Xue, M., et al. (2019). Neural collective entity linking based on recurrent random walk network learning. In Proceedings of the 28th international joint conference on artificial intelligence, IJCAI-19, (pp. 5327–5333). International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2019/740
Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 7370–7377. https://doi.org/10.1609/aaai.v33i01.33017370
Zhou, Q., et al. (2018). Neural document summarization by jointly learning to score and select sentences. In Proceedings of the 56th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 654–663). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1061
Zhu, J., et al. (2018). MSMO: Multimodal summarization with multimodal output. In Proceedings of the 2018 conference on empirical methods in natural language processing, (pp. 4154–4164). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1448
Zhu, J., et al. (2020). Multimodal summarization with guidance of multimodal reference. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 9749–9756. https://doi.org/10.1609/aaai.v34i05.6525
Acknowledgements
The research was sponsored by the National Natural Science Foundation of China (No.61806101). We thank the anonymous reviewers for helpful comments. JingQiang Chen is the corresponding author.
Funding
This research was sponsored by the National Natural Science Foundation of China (No.61806101).
Author information
Authors and Affiliations
Contributions
Feng Xie and JingQing Chen contributed equally to this work.
Corresponding author
Ethics declarations
Ethical Approval and Consent to participate
Not Applicable.
Consent for publication
The authors declare that they consent for publication.
Human and Animal Ethics
Not Applicable.
Competing interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xie, F., Chen, J. & Chen, K. Extractive text-image summarization with relation-enhanced graph attention network. J Intell Inf Syst 61, 325–341 (2023). https://doi.org/10.1007/s10844-022-00757-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-022-00757-x