Unsupervised Key-Phrase Extraction from Long Texts with Multilingual Sentence Transformers

Dias, Hélder; Guimarães, Artur; Martins, Bruno; Roche, Mathieu

doi:10.1007/978-3-031-45275-8_10

Hélder Dias¹²,
Artur Guimarães¹²,
Bruno Martins¹² &
…
Mathieu Roche¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14276))

Included in the following conference series:

International Conference on Discovery Science

527 Accesses

Abstract

Key-phrase extraction concerns retrieving a small set of phrases that encapsulate the core concepts of an input textual document. As in other text mining tasks, current methods often rely on pre-trained neural language models. Using these models, the state-of-the-art supervised systems for key-phrase extraction require large amounts of labelled data and generalize poorly outside the training domain, while unsupervised approaches generally present a lower accuracy. This paper presents a multilingual unsupervised approach to key-phrase extraction, improving upon previous methods in several ways (e.g., using representations from pre-trained Transformer models, while supporting the processing of long documents). Experimental results on datasets covering multiple languages and domains attest to the quality of the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Ahmad, W., Bai, X., Lee, S., Chang, K.W.: Select, extract and generate: neural keyphrase generation with layer-wise coverage attention. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2021)
Google Scholar
Aquino, G.O., Lanzarini, L.C.: Keyword identification in Spanish documents using neural networks. J. Comput. Sci. Technol. 15 (2015)
Google Scholar
Aronson, A.R., et al.: The NLM indexing initiative. In: Proceedings of the American Medical Informatics Association Symposium (2000)
Google Scholar
Beltagy, I., Peters, M.E., Cohan, A.: LongFormer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Bennani-Smires, K., Musat, C., Hossmann, A., Baeriswyl, M., Jaggi, M.: Simple unsupervised keyphrase extraction using sentence embeddings. In: Proceedings of the Conference on Computational Natural Language Learning (2018)
Google Scholar
Boudin, F.: Unsupervised keyphrase extraction with multipartite graphs. arXiv preprint arXiv:1803.08721 (2018)
Bougouin, A., Boudin, F., Daille, B.: TopicRank: graph-based topic ranking for keyphrase extraction. In: Proceedings of the International Joint Conference on Natural Language Processing (2013)
Google Scholar
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020)
Article Google Scholar
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: FlashAttention: fast and memory-efficient exact attention with IO-awareness. arXiv preprint arXiv:2205.14135 (2022)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Diao, S., Song, Y., Zhang, T.: Keyphrase generation with cross-document attention. arXiv preprint arXiv:2004.09800 (2020)
Ding, H., Luo, X.: AttentionRank: unsupervised keyphrase extraction using self and cross attentions. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021)
Google Scholar
Florescu, C., Caragea, C.: PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017)
Google Scholar
Huang, J., et al.: WhiteningBERT: an easy unsupervised sentence embedding approach. arXiv preprint arXiv:1801.04470 (2021)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2003)
Google Scholar
Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 774–787. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_55
Chapter Google Scholar
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)
Article Google Scholar
Joshi, R., Balachandran, V., Saldanha, E., Glenski, M., Volkova, S., Tsvetkov, Y.: Unsupervised keyphrase extraction via interpretable neural networks. arXiv preprint arXiv:2203.07640 (2022)
Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T.: SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: Proceedings of the International Workshop on Semantic Evaluation (2010)
Google Scholar
Liang, X., Wu, S., Li, M., Li, Z.: Unsupervised keyphrase extraction by jointly modeling local and global context. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Marujo, L., Viveiros, M., da Silva Neto, J.P.: Keyphrase cloud generation of broadcast news. arXiv preprint arXiv:1306.4606 (2013)
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004)
Google Scholar
Mu, F., et al.: Keyphrase extraction with span-based feature representations. arXiv preprint arXiv:2002.05407 (2020)
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77094-7_41
Chapter Google Scholar
Papagiannopoulou, E., Tsoumakas, G.: A review of keyphrase extraction (2019)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086 (2011)
Rabby, G., Azad, S., Mahmud, M., Zamli, K.Z., Rahman, M.M.: TeKET: a tree-based unsupervised keyphrase extraction technique. Cogn. Comput. 12(4) (2020)
Google Scholar
Rabe, M.N., Staats, C.: Self-attention does not need \(o(n^2)\) memory. arXiv preprint arXiv:2112.05682 (2021)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813 (2020)
Sajjad, H., Alam, F., Dalvi, F., Durrani, N.: Effect of post-processing on contextualized word representations. arXiv preprint arXiv:2104.07456 (2021)
Saxena, A., Mangal, M., Jain, G.: KeyGames: a game theoretic approach to automatic keyphrase extraction. In: Proceedings of the International Conference on Computational Linguistics (2020)
Google Scholar
Shapira, O., Pasunuru, R., Dagan, I., Amsterdamer, Y.: Multi-document keyphrase extraction: a literature review and the first dataset. arXiv preprint arXiv:2110.01073 (2021)
Shen, X., Wang, Y., Meng, R., Shang, J.: Unsupervised deep keyphrase generation. arXiv preprint arXiv:2104.08729 (2021)
Su, J., Cao, J., Liu, W., Ou, Y.: Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316 (2021)
Sun, Y., Qiu, H., Zheng, Y., Wang, Z., Zhang, C.: SIFRank: a new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access 8, 10896–10906 (2020)
Article Google Scholar
Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence (2008)
Google Scholar
Wang, X., Song, X., Li, B., Guan, Y., Han, J.: Comprehensive named entity recognition on CORD-19 with distant or weak supervision. arXiv preprint arXiv:2003.12218 (2020)
Wang, Y., Lee, C.T., Qipeng Guo, Z.Y., Zhou, Y., Huang, X., Qiu, X.: What dense graph do you need for self-attention? arXiv preprint arXiv:2205.14014 (2022)
Xiong, L., Hu, C., Xiong, C., Campos, D., Overwijk, A.: Open domain web keyphrase extraction beyond language modeling. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2019)
Google Scholar
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Proceedings of the Annual Meeting on Neural Information Processing Systems (2020)
Google Scholar
Zhang, L., et al.: MDERank: a masked document embedding rank approach for unsupervised keyphrase extraction. arXiv preprint arXiv:2110.06651 (2021)

Download references

Acknowledgement

This research was supported by the European Union’s H2020 research and innovation programme, under grant agreement No. 874850 (MOOD), as well as by Fundação para a Ciência e Tecnologia (FCT), namely through the INESC-ID multi-annual funding with reference UIDB/50021/2020, and through the project grants with references PTDC/CCI-CIF/32607/2017 (MIMU), DSAIPA/DS/0102/2019 (DEBAQI), and POCI/01/0145/FEDER/031460 (DARGMINTS).

Author information

Authors and Affiliations

INESC-ID and IST, University of Lisbon, Lisbon, Portugal
Hélder Dias, Artur Guimarães & Bruno Martins
CIRAD, TETIS Research Unit, Montpellier, France
Mathieu Roche

Authors

Hélder Dias
View author publications
You can also search for this author in PubMed Google Scholar
Artur Guimarães
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Martins
View author publications
You can also search for this author in PubMed Google Scholar
Mathieu Roche
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bruno Martins .

Editor information

Editors and Affiliations

Waikato University, Hamilton, New Zealand
Albert Bifet
Aeronautics Institute of Technology, São José dos Campos, Brazil
Ana Carolina Lorena
University of Porto, Porto, Portugal
Rita P. Ribeiro
University of Porto, Porto, Portugal
João Gama
University of Coimbra, Coimbra, Portugal
Pedro H. Abreu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dias, H., Guimarães, A., Martins, B., Roche, M. (2023). Unsupervised Key-Phrase Extraction from Long Texts with Multilingual Sentence Transformers. In: Bifet, A., Lorena, A.C., Ribeiro, R.P., Gama, J., Abreu, P.H. (eds) Discovery Science. DS 2023. Lecture Notes in Computer Science(), vol 14276. Springer, Cham. https://doi.org/10.1007/978-3-031-45275-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-45275-8_10
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45274-1
Online ISBN: 978-3-031-45275-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Key-Phrase Extraction from Long Texts with Multilingual Sentence Transformers