Part of Speech Tagging for Polish: State of the Art and Future Perspectives

Kobyliński, Łukasz; Kieraś, Witold

doi:10.1007/978-3-319-75477-2_21

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9623))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1377 Accesses
2 Citations

Abstract

In this paper we discuss the intricacies of Polish language part of speech tagging, present the current state of the art by comparing available taggers in detail and show the main obstacles that are a limiting factor in achieving an accuracy of Polish POS tagging higher than 91% of correctly tagged word segments. As this result is not only lower than in the case of English taggers, but also below those for other highly inflective languages, such as Czech and Slovene, we try to identify the main weaknesses of the taggers, their underlying algorithms, the training data, or difficulties inherent to the language to explain this difference. For this purpose we analyze the errors made individually by each of the available Polish POS taggers, an ensemble of the taggers and also by a publicly available well-known OpenNLP tagger, adapted to Polish tagset. Finally, we propose further steps that should be taken to narrow down the gap between Polish and English POS tagging performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
IPI PAN corpus was the first large, POS-tagged reference corpus of Polish, now superseded by the National Corpus of Polish.
2.
Now available also on-line at http://sgjp.pl.
3.
http://opennlp.apache.org/.
4.
As the difference in accuracy between these two approaches turned out not to be statistically significant, we have limited further experiments to maximum entropy models. Trained models available at: http://zil.ipipan.waw.pl/OpenNLP.
5.
In fact, this number is further reduced by the morphosyntactic analyser.
6.
One exception from this general observation are gerunds (ger), which are however systematically homonymous with nouns and thus are extremely difficult to disambiguate not only for taggers, but also for the human annotator.
7.
This phenomenon is typical of fusional languages such as Polish and other Slavonic languages.

References

Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19400-9_14
Chapter Google Scholar
Przepiórkowski, A., Woliński, M.: The unbearable lightness of tagging: a case study in morphosyntactic tagging of polish. In: Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03), EACL 2003, pp. 109–116 (2003)
Google Scholar
Przepiórkowski, A.: A comparison of two morphosyntactic tagsets of Polish. In: Koseska-Toszewa, V., Dimitrova, L., Roszko, R. (eds.) Representing Semantics in Digital Lexicography: Proceedings of MONDILEX Fourth Open Workshop, Warsaw, pp. 138–144 (2009)
Google Scholar
Woliński, M.: Morfeusz reloaded. [18], pp. 1106–1111
Google Scholar
Woliński, M.: Morfeusz—a practical tool for the morphological analysis of polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. AINSC, vol. 35, pp. 503–512. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-33521-8_55
Google Scholar
Saloni, Z., Woliński, M., Wołosz, R., Gruszczyński, W., Skowrońska, D.: Słownik gramatyczny jȩzyka polskiego, 2. edn. Warszawa (2012)
Google Scholar
Przepiórkowski, A., Bańko, M., Górski, R., Lewandowska-Tomaszczyk, B. (eds.) Narodowy Korpus Jȩzyka Polskiego. Warszawa (2012)
Google Scholar
Dȩbowski, Ł.: Trigram morphosyntactic tagger for polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. AINSC, vol. 25, pp. 409–413. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-39985-8_43
Chapter Google Scholar
Piasecki, M.: Polish tagger TaKIPI: rule based construction and optimisation. Task Q. 11, 151–167 (2007)
Google Scholar
Acedański, S.: A morphosyntactic brill tagger for inflectional languages. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS (LNAI), vol. 6233, pp. 3–14. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14770-8_3
Chapter Google Scholar
Radziszewski, A., Śniatowski, T.: A memory-based tagger for polish. In: Proceedings of the LTC 2011 (2011)
Google Scholar
Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. SCI, vol. 467, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16
Chapter Google Scholar
Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India, pp. 2789–2804 (2012)
Google Scholar
Kobyliński, Ł.: PoliTa: a multitagger for polish. [18], pp. 2949–2954
Google Scholar
Radziszewski, A., Acedański, S.: Taggers gonna tag: an argument against evaluating disambiguation capacities of morphosyntactic taggers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 81–87. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_9
Chapter Google Scholar
Awramiuk, E.: Systemowość polskiej homonimii międzyparadygmatycznej. Białystok (1999)
Google Scholar
Radziszewski, A.: Evaluation of lemmatisation accuracy of four polish taggers. In: Proceedings of the LTC 2013 (2013)
Google Scholar
Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., eds.: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavík, Iceland, ELRA (2014)
Google Scholar

Download references

Acknowledgment

Work partly financed by the Polish Ministry of Science and Higher Education, a program in support of scientific units involved in the development of a European research infrastructure for the humanities and social sciences in the scope of the CLARIN ERIC consortium and partly financed by Polish National Science Center grant 2014/15/B/HS2/03119.

Author information

Authors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, Jana Kazimierza 5, 01-248, Warszawa, Poland
Łukasz Kobyliński & Witold Kieraś

Authors

Łukasz Kobyliński
View author publications
You can also search for this author in PubMed Google Scholar
Witold Kieraś
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Łukasz Kobyliński .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kobyliński, Ł., Kieraś, W. (2018). Part of Speech Tagging for Polish: State of the Art and Future Perspectives. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-75477-2_21
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75476-5
Online ISBN: 978-3-319-75477-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics