Abstract
The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords. Furthermore, a plT5kw keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.
Keywords
- Keyword extraction
- T5 language model
- POSMAC
- Polish
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
This paper also offers an up-to-date review of keyword extraction methods.
- 2.
- 3.
- 4.
- 5.
We use the traditional term keyword to refer to potentially multiword phrases found in the Keywords section of a scientific abstract.
- 6.
- 7.
- 8.
DiaBiz is a corpus developed in the CLARIN-Biz project. It contains some 4,000 phone-based customer support calls covering a range of topics and business processes.
References
Boudin, F.: Unsupervised keyphrase extraction with multipartite graphs. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 667–672. Association for Computational Linguistics, New Orleans (2018). https://aclanthology.org/N18-2105
Bougouin, A., Boudin, F., Daille, B.: TopicRank: graph-based topic ranking for keyphrase extraction. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 543–551. Asian Federation of Natural Language Processing, Nagoya (2013). https://aclanthology.org/I13-1062
Chrabrowa, A., et al.: Evaluation of transfer learning for polish with a text-to-text model. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pp. 4374–4394. European Language Resources Association, Marseille (2022). https://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.466.pdf
El-Beltagy, S.R., Rafea, A.: KP-miner: participation in SemEval-2. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 190–193. Association for Computational Linguistics, Uppsala (2010). https://aclanthology.org/S10-1041
Firoozeh, N., Nazarenko, A., Alizon, F., Daille, B.: Keyword extraction: issues and methods. Nat. Lang. Eng. 26(3), 259–291 (2020)
Florescu, C., Caragea, C.: PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1105–1115. Association for Computational Linguistics, Vancouver (2017). https://aclanthology.org/P17-1102
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the C-value/NC-value method. Int. J. Digit. Libr. 3(2), 115–130 (2000). https://doi.org/10.1007/s007999900023
Giarelis, N., Kanakaris, N., Karacapilidis, N.: A comparative assessment of state-of-the-art methods for multilingual unsupervised keyphrase extraction. In: Maglogiannis, I., Macintyre, J., Iliadis, L. (eds.) AIAI 2021. IAICT, vol. 627, pp. 635–645. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79150-6_50
Grootendorst, M.: KeyBERT: minimal keyword extraction with BERT (2020). https://doi.org/10.5281/zenodo.4461265
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics, Valencia (2017). http://aclanthology.org/E17-2068
Marciniak, M., Mykowiecka, A., Rychlik, P.: TermoPL—a flexible tool for terminology extraction. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 2278–2284. European Language Resources Association (2016). http://www.lrec-conf.org/proceedings/lrec2016/pdf/296_Paper.pdf
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411. Association for Computational Linguistics, Barcelona (2004). http://aclanthology.org/W04-3252
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020). http://jmlr.org/papers/v21/20-074.html
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4512–4525. Association for Computational Linguistics (2020). http://aclanthology.org/2020.emnlp-main.365/
Sechidis, K., Tsoumakas, G., Vlahavas, I.: On the stratification of multi-label data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6913, pp. 145–158. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23808-6_10
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS 2017), pp. 5998–6008 (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Wydmuch, M., Jasinska, K., Kuznetsov, M., Busa-Fekete, R., Dembczyński, K.: A no-regret generalization of hierarchical softmax to extreme multi-label classification. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS 2018), pp. 6358–6368. Curran Associates Inc. (2018). https://proceedings.neurips.cc/paper/2018/hash/8b8388180314a337c9aa3c5aa8e2f37a-Abstract.html
Acknowledgements
The work reported here was supported by 1) the European Commission in the CEF Telecom Programme (Action No: 2019-EU-IA-0034, Grant Agreement No: INEA/CEF/ICT/A2019/1926831) and the Polish Ministry of Science and Higher Education: research project 5103/CEF/2020/2, funds for 2020-2022) and 2) the National Centre for Research and Development, research grant POIR.01.01.01-00-1237/19.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Nitoń, B., Ogrodniczuk, M. (2022). Keyword Extraction from Short Texts with a Text-to-Text Transfer Transformer. In: Szczerbicki, E., Wojtkiewicz, K., Nguyen, S.V., Pietranik, M., Krótkiewicz, M. (eds) Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2022. Communications in Computer and Information Science, vol 1716. Springer, Singapore. https://doi.org/10.1007/978-981-19-8234-7_41
Download citation
DOI: https://doi.org/10.1007/978-981-19-8234-7_41
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8233-0
Online ISBN: 978-981-19-8234-7
eBook Packages: Computer ScienceComputer Science (R0)