Skip to main content

Keyword Extraction from Short Texts with a Text-to-Text Transfer Transformer

Part of the Communications in Computer and Information Science book series (CCIS,volume 1716)

Abstract

The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords. Furthermore, a plT5kw keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.

Keywords

  • Keyword extraction
  • T5 language model
  • POSMAC
  • Polish

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    This paper also offers an up-to-date review of keyword extraction methods.

  2. 2.

    https://curlicat.eu/.

  3. 3.

    http://clip.ipipan.waw.pl/POSMAC.

  4. 4.

    https://bibliotekanauki.pl/.

  5. 5.

    We use the traditional term keyword to refer to potentially multiword phrases found in the Keywords section of a scientific abstract.

  6. 6.

    https://huggingface.co/allegro/plt5-large.

  7. 7.

    See https://vict0rs.ch/2018/05/24/sample-multilabel-dataset/.

  8. 8.

    DiaBiz is a corpus developed in the CLARIN-Biz project. It contains some 4,000 phone-based customer support calls covering a range of topics and business processes.

References

  1. Boudin, F.: Unsupervised keyphrase extraction with multipartite graphs. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 667–672. Association for Computational Linguistics, New Orleans (2018). https://aclanthology.org/N18-2105

  2. Bougouin, A., Boudin, F., Daille, B.: TopicRank: graph-based topic ranking for keyphrase extraction. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 543–551. Asian Federation of Natural Language Processing, Nagoya (2013). https://aclanthology.org/I13-1062

  3. Chrabrowa, A., et al.: Evaluation of transfer learning for polish with a text-to-text model. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pp. 4374–4394. European Language Resources Association, Marseille (2022). https://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.466.pdf

  4. El-Beltagy, S.R., Rafea, A.: KP-miner: participation in SemEval-2. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 190–193. Association for Computational Linguistics, Uppsala (2010). https://aclanthology.org/S10-1041

  5. Firoozeh, N., Nazarenko, A., Alizon, F., Daille, B.: Keyword extraction: issues and methods. Nat. Lang. Eng. 26(3), 259–291 (2020)

    CrossRef  Google Scholar 

  6. Florescu, C., Caragea, C.: PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1105–1115. Association for Computational Linguistics, Vancouver (2017). https://aclanthology.org/P17-1102

  7. Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the C-value/NC-value method. Int. J. Digit. Libr. 3(2), 115–130 (2000). https://doi.org/10.1007/s007999900023

    CrossRef  Google Scholar 

  8. Giarelis, N., Kanakaris, N., Karacapilidis, N.: A comparative assessment of state-of-the-art methods for multilingual unsupervised keyphrase extraction. In: Maglogiannis, I., Macintyre, J., Iliadis, L. (eds.) AIAI 2021. IAICT, vol. 627, pp. 635–645. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79150-6_50

    CrossRef  Google Scholar 

  9. Grootendorst, M.: KeyBERT: minimal keyword extraction with BERT (2020). https://doi.org/10.5281/zenodo.4461265

  10. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics, Valencia (2017). http://aclanthology.org/E17-2068

  11. Marciniak, M., Mykowiecka, A., Rychlik, P.: TermoPL—a flexible tool for terminology extraction. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 2278–2284. European Language Resources Association (2016). http://www.lrec-conf.org/proceedings/lrec2016/pdf/296_Paper.pdf

  12. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411. Association for Computational Linguistics, Barcelona (2004). http://aclanthology.org/W04-3252

  13. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020). http://jmlr.org/papers/v21/20-074.html

  14. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4512–4525. Association for Computational Linguistics (2020). http://aclanthology.org/2020.emnlp-main.365/

  15. Sechidis, K., Tsoumakas, G., Vlahavas, I.: On the stratification of multi-label data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6913, pp. 145–158. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23808-6_10

    CrossRef  Google Scholar 

  16. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS 2017), pp. 5998–6008 (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  17. Wydmuch, M., Jasinska, K., Kuznetsov, M., Busa-Fekete, R., Dembczyński, K.: A no-regret generalization of hierarchical softmax to extreme multi-label classification. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS 2018), pp. 6358–6368. Curran Associates Inc. (2018). https://proceedings.neurips.cc/paper/2018/hash/8b8388180314a337c9aa3c5aa8e2f37a-Abstract.html

Download references

Acknowledgements

The work reported here was supported by 1) the European Commission in the CEF Telecom Programme (Action No: 2019-EU-IA-0034, Grant Agreement No: INEA/CEF/ICT/A2019/1926831) and the Polish Ministry of Science and Higher Education: research project 5103/CEF/2020/2, funds for 2020-2022) and 2) the National Centre for Research and Development, research grant POIR.01.01.01-00-1237/19.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Piotr Pęzik , Agnieszka Mikołajczyk , Adam Wawrzyński , Bartłomiej Nitoń or Maciej Ogrodniczuk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Nitoń, B., Ogrodniczuk, M. (2022). Keyword Extraction from Short Texts with a Text-to-Text Transfer Transformer. In: Szczerbicki, E., Wojtkiewicz, K., Nguyen, S.V., Pietranik, M., Krótkiewicz, M. (eds) Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2022. Communications in Computer and Information Science, vol 1716. Springer, Singapore. https://doi.org/10.1007/978-981-19-8234-7_41

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8234-7_41

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8233-0

  • Online ISBN: 978-981-19-8234-7

  • eBook Packages: Computer ScienceComputer Science (R0)