Skip to main content

Unsupervised Key-Phrase Extraction from Long Texts with Multilingual Sentence Transformers

  • Conference paper
  • First Online:
Discovery Science (DS 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14276))

Included in the following conference series:

  • 527 Accesses

Abstract

Key-phrase extraction concerns retrieving a small set of phrases that encapsulate the core concepts of an input textual document. As in other text mining tasks, current methods often rely on pre-trained neural language models. Using these models, the state-of-the-art supervised systems for key-phrase extraction require large amounts of labelled data and generalize poorly outside the training domain, while unsupervised approaches generally present a lower accuracy. This paper presents a multilingual unsupervised approach to key-phrase extraction, improving upon previous methods in several ways (e.g., using representations from pre-trained Transformer models, while supporting the processing of long documents). Experimental results on datasets covering multiple languages and domains attest to the quality of the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/epfml/sent2vec.

  2. 2.

    https://github.com/jhlau/doc2vec.

  3. 3.

    https://github.com/araag2/KP_Extraction.

  4. 4.

    https://spacy.io/models/.

  5. 5.

    https://github.com/adbar/simplemma/.

  6. 6.

    https://www.sbert.net/.

  7. 7.

    https://huggingface.co/allenai/longformer-large-4096.

  8. 8.

    https://huggingface.co/spaces/mteb/leaderboard.

References

  1. Ahmad, W., Bai, X., Lee, S., Chang, K.W.: Select, extract and generate: neural keyphrase generation with layer-wise coverage attention. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2021)

    Google Scholar 

  2. Aquino, G.O., Lanzarini, L.C.: Keyword identification in Spanish documents using neural networks. J. Comput. Sci. Technol. 15 (2015)

    Google Scholar 

  3. Aronson, A.R., et al.: The NLM indexing initiative. In: Proceedings of the American Medical Informatics Association Symposium (2000)

    Google Scholar 

  4. Beltagy, I., Peters, M.E., Cohan, A.: LongFormer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

  5. Bennani-Smires, K., Musat, C., Hossmann, A., Baeriswyl, M., Jaggi, M.: Simple unsupervised keyphrase extraction using sentence embeddings. In: Proceedings of the Conference on Computational Natural Language Learning (2018)

    Google Scholar 

  6. Boudin, F.: Unsupervised keyphrase extraction with multipartite graphs. arXiv preprint arXiv:1803.08721 (2018)

  7. Bougouin, A., Boudin, F., Daille, B.: TopicRank: graph-based topic ranking for keyphrase extraction. In: Proceedings of the International Joint Conference on Natural Language Processing (2013)

    Google Scholar 

  8. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020)

    Article  Google Scholar 

  9. Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: FlashAttention: fast and memory-efficient exact attention with IO-awareness. arXiv preprint arXiv:2205.14135 (2022)

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  11. Diao, S., Song, Y., Zhang, T.: Keyphrase generation with cross-document attention. arXiv preprint arXiv:2004.09800 (2020)

  12. Ding, H., Luo, X.: AttentionRank: unsupervised keyphrase extraction using self and cross attentions. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021)

    Google Scholar 

  13. Florescu, C., Caragea, C.: PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017)

    Google Scholar 

  14. Huang, J., et al.: WhiteningBERT: an easy unsupervised sentence embedding approach. arXiv preprint arXiv:1801.04470 (2021)

  15. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2003)

    Google Scholar 

  16. Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 774–787. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_55

    Chapter  Google Scholar 

  17. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)

    Article  Google Scholar 

  18. Joshi, R., Balachandran, V., Saldanha, E., Glenski, M., Volkova, S., Tsvetkov, Y.: Unsupervised keyphrase extraction via interpretable neural networks. arXiv preprint arXiv:2203.07640 (2022)

  19. Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T.: SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: Proceedings of the International Workshop on Semantic Evaluation (2010)

    Google Scholar 

  20. Liang, X., Wu, S., Li, M., Li, Z.: Unsupervised keyphrase extraction by jointly modeling local and global context. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)

    Google Scholar 

  21. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  22. Marujo, L., Viveiros, M., da Silva Neto, J.P.: Keyphrase cloud generation of broadcast news. arXiv preprint arXiv:1306.4606 (2013)

  23. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004)

    Google Scholar 

  24. Mu, F., et al.: Keyphrase extraction with span-based feature representations. arXiv preprint arXiv:2002.05407 (2020)

  25. Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77094-7_41

    Chapter  Google Scholar 

  26. Papagiannopoulou, E., Tsoumakas, G.: A review of keyphrase extraction (2019)

    Google Scholar 

  27. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)

  28. Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086 (2011)

  29. Rabby, G., Azad, S., Mahmud, M., Zamli, K.Z., Rahman, M.M.: TeKET: a tree-based unsupervised keyphrase extraction technique. Cogn. Comput. 12(4) (2020)

    Google Scholar 

  30. Rabe, M.N., Staats, C.: Self-attention does not need \(o(n^2)\) memory. arXiv preprint arXiv:2112.05682 (2021)

  31. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)

  32. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813 (2020)

  33. Sajjad, H., Alam, F., Dalvi, F., Durrani, N.: Effect of post-processing on contextualized word representations. arXiv preprint arXiv:2104.07456 (2021)

  34. Saxena, A., Mangal, M., Jain, G.: KeyGames: a game theoretic approach to automatic keyphrase extraction. In: Proceedings of the International Conference on Computational Linguistics (2020)

    Google Scholar 

  35. Shapira, O., Pasunuru, R., Dagan, I., Amsterdamer, Y.: Multi-document keyphrase extraction: a literature review and the first dataset. arXiv preprint arXiv:2110.01073 (2021)

  36. Shen, X., Wang, Y., Meng, R., Shang, J.: Unsupervised deep keyphrase generation. arXiv preprint arXiv:2104.08729 (2021)

  37. Su, J., Cao, J., Liu, W., Ou, Y.: Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316 (2021)

  38. Sun, Y., Qiu, H., Zheng, Y., Wang, Z., Zhang, C.: SIFRank: a new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access 8, 10896–10906 (2020)

    Article  Google Scholar 

  39. Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence (2008)

    Google Scholar 

  40. Wang, X., Song, X., Li, B., Guan, Y., Han, J.: Comprehensive named entity recognition on CORD-19 with distant or weak supervision. arXiv preprint arXiv:2003.12218 (2020)

  41. Wang, Y., Lee, C.T., Qipeng Guo, Z.Y., Zhou, Y., Huang, X., Qiu, X.: What dense graph do you need for self-attention? arXiv preprint arXiv:2205.14014 (2022)

  42. Xiong, L., Hu, C., Xiong, C., Campos, D., Overwijk, A.: Open domain web keyphrase extraction beyond language modeling. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2019)

    Google Scholar 

  43. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Proceedings of the Annual Meeting on Neural Information Processing Systems (2020)

    Google Scholar 

  44. Zhang, L., et al.: MDERank: a masked document embedding rank approach for unsupervised keyphrase extraction. arXiv preprint arXiv:2110.06651 (2021)

Download references

Acknowledgement

This research was supported by the European Union’s H2020 research and innovation programme, under grant agreement No. 874850 (MOOD), as well as by Fundação para a Ciência e Tecnologia (FCT), namely through the INESC-ID multi-annual funding with reference UIDB/50021/2020, and through the project grants with references PTDC/CCI-CIF/32607/2017 (MIMU), DSAIPA/DS/0102/2019 (DEBAQI), and POCI/01/0145/FEDER/031460 (DARGMINTS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bruno Martins .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dias, H., Guimarães, A., Martins, B., Roche, M. (2023). Unsupervised Key-Phrase Extraction from Long Texts with Multilingual Sentence Transformers. In: Bifet, A., Lorena, A.C., Ribeiro, R.P., Gama, J., Abreu, P.H. (eds) Discovery Science. DS 2023. Lecture Notes in Computer Science(), vol 14276. Springer, Cham. https://doi.org/10.1007/978-3-031-45275-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45275-8_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45274-1

  • Online ISBN: 978-3-031-45275-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics