Tagging Icelandic text: an experiment with integrations and combinations of taggers

Abstract

We use integrations and combinations of taggers to improve the tagging accuracy of Icelandic text. The accuracy of the best performing integrated tagger, which consists of our linguistic rule-based tagger for initial disambiguation and a trigram tagger for full disambiguation, is 91.80%. Combining five different taggers, using simple voting, results in 93.34% accuracy. By adding two linguistically motivated rules to the combined tagger, we obtain an accuracy of 93.48%. This method reduces the error rate by 20.5%, with respect to the best performing tagger in the combination pool.

This is a preview of subscription content, access via your institution.

Abbreviations

DDT:

data-driven taggers

HMM:

Hidden Markov model

IFD:

Icelandic frequency dictionary

LMR:

linguistically motivated rules

References

  1. Borin, L. (2000). Something borrowed, something blue: Rule-based combination of POS taggers. In Proceedings of the 2nd International Conference on Language Resources and Evaluation. Greece: Athens.

  2. Brants, T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied natural language processing. Seattle, WA, USA.

  3. Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. (1996). MBT: a Memory-Based Part of Speech Tagger-Generator. In Proceedings of the 4th Workshop on Very Large Corpora. Copenhagen, Denmark.

  4. Daelemans, W., Zavrel, J., & van den Bosch, A. (2003). MBT: Memory-Based Tagger. Reference Guide: ILK Technical Report-ILK 03–13.

  5. Dietterich, T. G. (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1924.

    Article  Google Scholar 

  6. Hajič, J., Krbec, P., Oliva, K., Květoň, P., & Petkevič, V. (2001). Serial combination of rules and statistics: a case study in Czech tagging. In Proceedings of the 39th Association of Computational Linguistics Conference. Toulouse, France.

  7. Helgadóttir, S. (2004). Testing Data-Driven Learning algorithms for PoS tagging of Icelandic. In H. Holmboe (Ed.), Nordisk Sprogteknologi 2004. Museum Tusculanums Forlag.

  8. Karlsson, F., Voutilainen, A., Heikkilä, J., & Anttila, A. (1995). Constraint grammar: a language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin, Germany.

    Google Scholar 

  9. Loftsson, H. (2006a). Tagging Icelandic text: A linguistic rule-based approach. Technical Report CS-06-04, Department of Computer Science, University of Sheffield.

  10. Loftsson, H. (2006b). Tagging a morphologically complex language using heuristics. In T. Salakoski, F. Ginter, S. Pyysalo, & T. Pahikkala (Eds.), Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006, Proceedings. Turku, Finland.

  11. Ngai, G., & Florian, R. (2001), Transformation-based learning in the fast lane. In Proceedings of the 2nd Conference of the North American Chapter of the ACL. Pittsburgh, PA, USA.

  12. Pind, J., Magnússon, F., & Briem, S. (1991). The Icelandic frequency dictionary. The Institute of Lexicography at the University of Iceland, Reykjavik, Iceland.

  13. Ratnaparkhi A. (1996) A Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference. Philadelphia, PA, USA.

  14. Sjöbergh, J. (2003). Combining POS-taggers for improved accuracy on Swedish text. In Proceedings of NoDaLiDa 2003. Reykjavik, Iceland.

  15. van Halteren, H., Zavrel, J., & Daelemans, W. (2001) Improving accuracy in wordclass tagging through combination of machine learning systems. Computational Linguistics, 27(2), 199–230.

    Article  Google Scholar 

Download references

Acknowledgements

Thanks to the Institute of Lexicography at the University of Iceland, for providing access to the IFD corpus, and Professor Y. Wilks for valuable comments and suggestions in the preparation of this paper.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hrafn Loftsson.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Loftsson, H. Tagging Icelandic text: an experiment with integrations and combinations of taggers. Lang Resources & Evaluation 40, 175–181 (2006). https://doi.org/10.1007/s10579-006-9013-5

Download citation

Keywords

  • Combination of taggers
  • Integration of taggers
  • Linguistically motivated rules
  • Simple voting
  • Tagging accuracy