Abstract
This paper describes work on the construction of a morpho-syntactic tagger for Polish as an ensemble of the best performing Polish taggers: TaKIPI and Pantera. The tagger set was extended with RFTagger trained on the Polish corpus. Several methods of ensemble construction were tested with the best result, in terms of the tagging error reduction, achieved with simple, unweighted voting among the three taggers. Two evaluation metrics were used, namely: weak and strong accuracy. The ensemble-based tagger presented a significant increase in both evaluation metrics, achieving nearly 94% weak correctness. This represents a one percentage point increase over the best individual tagger tested, or an error rate reduction of over 15%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Acedański, S., Gołuchowski, K.: A Morphosyntactic Rule-Based Brill Tagger for Polish. In: Proceedings of Intelligent Information Systems, pp. 67–76 (2009)
Acedański, S., Przepiórkowski, A.: Towards the Adequate Evaluation of Morphosyntactic Taggers. In: Proceedings of COLING 2010 (2010)
Borin, L.: Something borrowed, something blue: Rule-based combination of POS taggers. In: Proceedings of the Second International Conference on Language Resources and Evaluation, pp. 21–26 (2000)
Brill, E., Wu, J.: Classifier combination for improved lexical disambiguation. In: Proceedings of COLING 1998, vol. 1, pp. 191–195. Association for Computational Linguistics (1998)
Dębowski, Ł.: Trigram morphosyntactic tagger for Polish. In: Proceedings of the International IIS: IIPWM 2004 Conference, pp. 409–413 (2004)
Grefenstette, G., Tapanainen, P.: What is a word, what is a sentence? Problems of tokenization. In: Proceedings of COMPLEX 1994, Budapest (1994)
Habert, B., Adda, G., Adda-Decker, M., de Mareuil, P.B., Ferrari, S., Ferret, O., Illouz, G., Paroubek, P.: Towards Tokenization Evaluation. In: Proceedings of 1st International Conference on Language Resources and Evaluation, vol. 1 (1998)
Hajič, J., Krbec, P., Květoň, P., Oliva, K., Petkevič, V.: Serial combination of rules and statistics: A case study in Czech tagging. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 268–275. Association for Computational Linguistics (2001)
Henderson, J., Brill, E.: Exploiting diversity in natural language processing: Combining parsers. In: Proceedings of the Fourth Conference on Empirical Methods in Natural Language Processing, pp. 187–194 (1999)
Kuba, A., Felföldi, L., Kocsor, A.: POS tagger combinations on Hungarian text. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 191–196. Springer, Heidelberg (2005)
Marcus, M., Marcinkiewicz, M., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19(2), 313–330 (1993)
Miłkowski, M.: Developing an open-source, rule-based proofreading tool. Software: Practice and Experience 40, 543–566 (2010)
Piasecki, M.: Polish Tagger TaKIPI: Rule Based Construction and Optimisation. Task Quarterly 11(1–2), 151–167 (2007)
Piasecki, M., Gaweł, B.: A rule-based tagger for Polish based on Genetic Algorithm. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Proceedings of IIPWM 2005. Advances in Soft Computing. Springer, Heidelberg (2005)
Piasecki, M., Radziszewski, A.: Morphological Prediction for Polish by a Statistical A Tergo Index. Systems Science 34(4), 7–17 (2008)
Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version. Institute of Computer Science, Polish Academy of Sciences, Warsaw (2004)
Przepiórkowski, A.: The IPI PAN corpus in numbers. In: Proceedings of the 2nd Language & Technology Conference, Poznan, Poland (2005)
Przepiórkowski, A., Woliński, M.: A flexemic tagset for Polish. In: Proceedings of Morphological Processing of Slavic Languages, EACL 2003 (2003)
Schmid, H., Laws, F.: Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. In: Proceedings of COLING 2008, vol. 1, pp. 777–784. Association for Computational Linguistics (2008)
Sharoff, S.: What is at stake: a case study of Russian expressions starting with a preposition. In: Proceedings of the Workshop on Multiword Expressions: Integrating Processing, pp. 17–23. Association for Computational Linguistics (2004)
Sjöbergh, J.: Combining POS-taggers for improved accuracy on Swedish text. In: Proceedings of NoDaLiDa 2003 (2003)
Søgaard, A.: Ensemble-based POS tagging of Italian. In: IAAI-EVALITA, Reggio Emilia, Italy (2009)
Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180. Association for Computational Linguistics (2003)
Van Halteren, H.: Performance of taggers. Syntactic Wordclass Tagging 9, 81–94 (1999)
Van Halteren, H., Daelemans, W., Zavrel, J.: Improving accuracy in word class tagging through the combination of machine learning systems, vol. 27, pp. 199–229. MIT Press (2001)
Woliński, M.: Morfeusz — a practical tool for the morphological analysis of Polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Proceedings of IIPWM 2006, Ustroń, Poland, pp. 511–520. Springer, Berlin (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Śniatowski, T., Piasecki, M. (2012). Combining Polish Morphosyntactic Taggers. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds) Security and Intelligent Information Systems. SIIS 2011. Lecture Notes in Computer Science, vol 7053. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25261-7_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-25261-7_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25260-0
Online ISBN: 978-3-642-25261-7
eBook Packages: Computer ScienceComputer Science (R0)