Skip to main content

Context-free Transformer-based Generative Lemmatiser for Polish

  • Conference paper
  • First Online:
Computational Collective Intelligence (ICCCI 2022)

Abstract

In this paper we present a new approach to the problem of lemmatisation in inflectional languages on the example of Polish. We made an introduction to the problem domain, described the solution used – the Transformer architecture and learning process on lexical data – and presented experimental results showing a high degree of generalization of the new solution. At the very end, we presented conclusions and plans for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arakelyan, G., Hambardzumyan, K., Khachatrian, H.: Towards JointUD: part-of-speech tagging and lemmatization using recurrent neural networks. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 180–186. Association for Computational Linguistics, Brussels, Belgium, October 2018. https://doi.org/10.18653/v1/K18-2018, https://www.aclweb.org/anthology/K18-2018

  2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  3. Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423

  5. Ingólfsdóttir, S.L., Loftsson, H., Daðason, J.F., Bjarnadóttir, K.: Nefnir: a high accuracy lemmatizer for Icelandic. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, pp. 310–315. Linköping University Electronic Press, Turku, Finland, Sep-Oct 2019. https://www.aclweb.org/anthology/W19-6133

  6. Kieraś, W., Woliński, M.: słownik gramatyczny języka polskiego -wersja internetowa. Język Polski 97(1), 84–93 (2017)

    Google Scholar 

  7. Kieraś, W., Woliński, M.: Morfeusz 2–analizator i generator fleksyjny dla jȩzyka polskiego. Jȩzyk Polski XCVI I(1), 75–83 (2017)

    Google Scholar 

  8. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  9. Kobyliński, Ł., Ogrodniczuk, M.: Results of the PolEval 2017 competition: part-of-speech tagging shared task. In: Proceedings of the 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 362–366 (2017)

    Google Scholar 

  10. Kondratyuk, D.: Cross-lingual lemmatization and morphology tagging with two-stage multilingual BERT fine-tuning. In: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 12–18. Association for Computational Linguistics, Florence, Italy, August 2019. https://doi.org/10.18653/v1/W19-4203, https://www.aclweb.org/anthology/W19-4203

  11. Kondratyuk, D., Gavenčiak, T., Straka, M., Hajič, J.: LemmaTag: jointly tagging and lemmatizing for morphologically rich languages with BRNNs. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4921–4928. Association for Computational Linguistics, Brussels, Belgium, Oct-Nov 2018. https://doi.org/10.18653/v1/D18-1532, https://www.aclweb.org/anthology/D18-1532

  12. Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Association for Computational Linguistics, Lisbon, Portugal, September 2015. https://doi.org/10.18653/v1/D15-1166, https://www.aclweb.org/anthology/D15-1166

  13. McCarthy, A.D., et al.: The SIGMORPHON 2019 shared task: morphological analysis in context and cross-lingual transfer for inflection. In: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 229–244. Association for Computational Linguistics, Florence, Italy, August 2019. https://doi.org/10.18653/v1/W19-4226, https://aclanthology.org/W19-4226

  14. Piasecki, M., Radziszewski, A.: Morphological prediction for polish by a statistical a Tergo index. Syst. Sci. 34(4), 7–17 (2008)

    MATH  Google Scholar 

  15. Rybak, P., Wróblewska, A.: Semi-supervised neural system for tagging, parsing and lematization. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 45–54. Association for Computational Linguistics, Brussels, Belgium, October 2018. https://doi.org/10.18653/v1/K18-2004, https://www.aclweb.org/anthology/K18-2004

  16. Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 197–207. Association for Computational Linguistics, Brussels, Belgium, October 2018. https://doi.org/10.18653/v1/K18-2020, https://www.aclweb.org/anthology/K18-2020

  17. Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Association for Computational Linguistics, Vancouver, Canada, August 2017. https://doi.org/10.18653/v1/K17-3009, https://www.aclweb.org/anthology/K17-3009

  18. Straková, J., Straka, M., Hajič, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 13–18. Association for Computational Linguistics, Baltimore, Maryland, June 2014. https://doi.org/10.3115/v1/P14-5003, https://www.aclweb.org/anthology/P14-5003

  19. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  20. Walentynowicz, W., Piasecki, M., Oleksy, M.: Tagger for polish computer mediated communication texts. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 1295–1303. INCOMA Ltd., Varna, Bulgaria, September 2019. https://doi.org/10.26615/978-954-452-056-4_148, https://www.aclweb.org/anthology/R19-1148

  21. Wróbel, K.: KRNNT: polish recurrent neural network tagger. In: Proceedings of the 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 386–391. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu (2017)

    Google Scholar 

  22. Zalmout, N., Habash, N.: Joint diacritization, lemmatization, normalization, and fine-grained morphological tagging. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8297–8307. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.736, https://www.aclweb.org/anthology/2020.acl-main.736

Download references

Acknowledgements

This work has been carried out as part of the Project “SentiCognitiveServices - next generation service for automating voice of customer and social media support based on artificial intelligence methods” (POIR.01.01.01-00-0806/16), cofinanced by the European Regional Development Fund under the Smart Growth Programme 2014–2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wiktor Walentynowicz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Walentynowicz, W., Piasecki, M., Kot, A. (2022). Context-free Transformer-based Generative Lemmatiser for Polish. In: Nguyen, N.T., Manolopoulos, Y., Chbeir, R., Kozierkiewicz, A., Trawiński, B. (eds) Computational Collective Intelligence. ICCCI 2022. Lecture Notes in Computer Science(), vol 13501. Springer, Cham. https://doi.org/10.1007/978-3-031-16014-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16014-1_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16013-4

  • Online ISBN: 978-3-031-16014-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics