Abstract
In this paper we present a new approach to the problem of lemmatisation in inflectional languages on the example of Polish. We made an introduction to the problem domain, described the solution used – the Transformer architecture and learning process on lexical data – and presented experimental results showing a high degree of generalization of the new solution. At the very end, we presented conclusions and plans for future research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arakelyan, G., Hambardzumyan, K., Khachatrian, H.: Towards JointUD: part-of-speech tagging and lemmatization using recurrent neural networks. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 180–186. Association for Computational Linguistics, Brussels, Belgium, October 2018. https://doi.org/10.18653/v1/K18-2018, https://www.aclweb.org/anthology/K18-2018
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Ingólfsdóttir, S.L., Loftsson, H., Daðason, J.F., Bjarnadóttir, K.: Nefnir: a high accuracy lemmatizer for Icelandic. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, pp. 310–315. Linköping University Electronic Press, Turku, Finland, Sep-Oct 2019. https://www.aclweb.org/anthology/W19-6133
Kieraś, W., Woliński, M.: słownik gramatyczny języka polskiego -wersja internetowa. Język Polski 97(1), 84–93 (2017)
Kieraś, W., Woliński, M.: Morfeusz 2–analizator i generator fleksyjny dla jȩzyka polskiego. Jȩzyk Polski XCVI I(1), 75–83 (2017)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kobyliński, Ł., Ogrodniczuk, M.: Results of the PolEval 2017 competition: part-of-speech tagging shared task. In: Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 362–366 (2017)
Kondratyuk, D.: Cross-lingual lemmatization and morphology tagging with two-stage multilingual BERT fine-tuning. In: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 12–18. Association for Computational Linguistics, Florence, Italy, August 2019. https://doi.org/10.18653/v1/W19-4203, https://www.aclweb.org/anthology/W19-4203
Kondratyuk, D., Gavenčiak, T., Straka, M., Hajič, J.: LemmaTag: jointly tagging and lemmatizing for morphologically rich languages with BRNNs. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4921–4928. Association for Computational Linguistics, Brussels, Belgium, Oct-Nov 2018. https://doi.org/10.18653/v1/D18-1532, https://www.aclweb.org/anthology/D18-1532
Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Association for Computational Linguistics, Lisbon, Portugal, September 2015. https://doi.org/10.18653/v1/D15-1166, https://www.aclweb.org/anthology/D15-1166
McCarthy, A.D., et al.: The SIGMORPHON 2019 shared task: morphological analysis in context and cross-lingual transfer for inflection. In: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 229–244. Association for Computational Linguistics, Florence, Italy, August 2019. https://doi.org/10.18653/v1/W19-4226, https://aclanthology.org/W19-4226
Piasecki, M., Radziszewski, A.: Morphological prediction for polish by a statistical a Tergo index. Syst. Sci. 34(4), 7–17 (2008)
Rybak, P., Wróblewska, A.: Semi-supervised neural system for tagging, parsing and lematization. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 45–54. Association for Computational Linguistics, Brussels, Belgium, October 2018. https://doi.org/10.18653/v1/K18-2004, https://www.aclweb.org/anthology/K18-2004
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207. Association for Computational Linguistics, Brussels, Belgium, October 2018. https://doi.org/10.18653/v1/K18-2020, https://www.aclweb.org/anthology/K18-2020
Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Association for Computational Linguistics, Vancouver, Canada, August 2017. https://doi.org/10.18653/v1/K17-3009, https://www.aclweb.org/anthology/K17-3009
Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 13–18. Association for Computational Linguistics, Baltimore, Maryland, June 2014. https://doi.org/10.3115/v1/P14-5003, https://www.aclweb.org/anthology/P14-5003
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Walentynowicz, W., Piasecki, M., Oleksy, M.: Tagger for polish computer mediated communication texts. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 1295–1303. INCOMA Ltd., Varna, Bulgaria, September 2019. https://doi.org/10.26615/978-954-452-056-4_148, https://www.aclweb.org/anthology/R19-1148
Wróbel, K.: KRNNT: polish recurrent neural network tagger. In: Proceedings of the 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 386–391. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu (2017)
Zalmout, N., Habash, N.: Joint diacritization, lemmatization, normalization, and fine-grained morphological tagging. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8297–8307. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.736, https://www.aclweb.org/anthology/2020.acl-main.736
Acknowledgements
This work has been carried out as part of the Project “SentiCognitiveServices - next generation service for automating voice of customer and social media support based on artificial intelligence methods” (POIR.01.01.01-00-0806/16), cofinanced by the European Regional Development Fund under the Smart Growth Programme 2014–2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Walentynowicz, W., Piasecki, M., Kot, A. (2022). Context-free Transformer-based Generative Lemmatiser for Polish. In: Nguyen, N.T., Manolopoulos, Y., Chbeir, R., Kozierkiewicz, A., Trawiński, B. (eds) Computational Collective Intelligence. ICCCI 2022. Lecture Notes in Computer Science(), vol 13501. Springer, Cham. https://doi.org/10.1007/978-3-031-16014-1_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-16014-1_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16013-4
Online ISBN: 978-3-031-16014-1
eBook Packages: Computer ScienceComputer Science (R0)