Advertisement

Part of Speech Tagging for Polish: State of the Art and Future Perspectives

  • Łukasz Kobyliński
  • Witold Kieraś
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9623)

Abstract

In this paper we discuss the intricacies of Polish language part of speech tagging, present the current state of the art by comparing available taggers in detail and show the main obstacles that are a limiting factor in achieving an accuracy of Polish POS tagging higher than 91% of correctly tagged word segments. As this result is not only lower than in the case of English taggers, but also below those for other highly inflective languages, such as Czech and Slovene, we try to identify the main weaknesses of the taggers, their underlying algorithms, the training data, or difficulties inherent to the language to explain this difference. For this purpose we analyze the errors made individually by each of the available Polish POS taggers, an ensemble of the taggers and also by a publicly available well-known OpenNLP tagger, adapted to Polish tagset. Finally, we propose further steps that should be taken to narrow down the gap between Polish and English POS tagging performance.

Notes

Acknowledgment

Work partly financed by the Polish Ministry of Science and Higher Education, a program in support of scientific units involved in the development of a European research infrastructure for the humanities and social sciences in the scope of the CLARIN ERIC consortium and partly financed by Polish National Science Center grant 2014/15/B/HS2/03119.

References

  1. 1.
    Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-19400-9_14 CrossRefGoogle Scholar
  2. 2.
    Przepiórkowski, A., Woliński, M.: The unbearable lightness of tagging: a case study in morphosyntactic tagging of polish. In: Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03), EACL 2003, pp. 109–116 (2003)Google Scholar
  3. 3.
    Przepiórkowski, A.: A comparison of two morphosyntactic tagsets of Polish. In: Koseska-Toszewa, V., Dimitrova, L., Roszko, R. (eds.) Representing Semantics in Digital Lexicography: Proceedings of MONDILEX Fourth Open Workshop, Warsaw, pp. 138–144 (2009)Google Scholar
  4. 4.
    Woliński, M.: Morfeusz reloaded. [18], pp. 1106–1111Google Scholar
  5. 5.
    Woliński, M.: Morfeusz—a practical tool for the morphological analysis of polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. AINSC, vol. 35, pp. 503–512. Springer, Heidelberg (2006).  https://doi.org/10.1007/3-540-33521-8_55 Google Scholar
  6. 6.
    Saloni, Z., Woliński, M., Wołosz, R., Gruszczyński, W., Skowrońska, D.: Słownik gramatyczny jȩzyka polskiego, 2. edn. Warszawa (2012)Google Scholar
  7. 7.
    Przepiórkowski, A., Bańko, M., Górski, R., Lewandowska-Tomaszczyk, B. (eds.) Narodowy Korpus Jȩzyka Polskiego. Warszawa (2012)Google Scholar
  8. 8.
    Dȩbowski, Ł.: Trigram morphosyntactic tagger for polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. AINSC, vol. 25, pp. 409–413. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-39985-8_43 CrossRefGoogle Scholar
  9. 9.
    Piasecki, M.: Polish tagger TaKIPI: rule based construction and optimisation. Task Q. 11, 151–167 (2007)Google Scholar
  10. 10.
    Acedański, S.: A morphosyntactic brill tagger for inflectional languages. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS (LNAI), vol. 6233, pp. 3–14. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-14770-8_3 CrossRefGoogle Scholar
  11. 11.
    Radziszewski, A., Śniatowski, T.: A memory-based tagger for polish. In: Proceedings of the LTC 2011 (2011)Google Scholar
  12. 12.
    Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. SCI, vol. 467, pp. 215–230. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-35647-6_16 CrossRefGoogle Scholar
  13. 13.
    Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India, pp. 2789–2804 (2012)Google Scholar
  14. 14.
    Kobyliński, Ł.: PoliTa: a multitagger for polish. [18], pp. 2949–2954Google Scholar
  15. 15.
    Radziszewski, A., Acedański, S.: Taggers gonna tag: an argument against evaluating disambiguation capacities of morphosyntactic taggers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 81–87. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-32790-2_9 CrossRefGoogle Scholar
  16. 16.
    Awramiuk, E.: Systemowość polskiej homonimii międzyparadygmatycznej. Białystok (1999)Google Scholar
  17. 17.
    Radziszewski, A.: Evaluation of lemmatisation accuracy of four polish taggers. In: Proceedings of the LTC 2013 (2013)Google Scholar
  18. 18.
    Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., eds.: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavík, Iceland, ELRA (2014)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Institute of Computer Science, Polish Academy of SciencesWarszawaPoland

Personalised recommendations