A Case Study in Tagging Case in German: An Assessment of Statistical Approaches

  • Simon Clematide
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 380)

Abstract

In this study, we assess the performance of purely statistical approaches using supervised machine learning for predicting case in German (nominative, accusative, dative, genitive, n/a). We experiment with two different treebanks containing morphological annotations: TIGER and TUEBA. An evaluation with 10-fold cross-validation serves as the basis for systematic comparisons of the optimal parametrizations of different approaches. We test taggers based on Hidden Markov Models (HMM), Decision Trees, and Conditional Random Fields (CRF). The CRF approach based on our hand-crafted feature model achieves an accuracy of about 94%. This outperforms all other approaches and results in an improvement of 11% compared to a baseline HMM trigram tagger and an improvement of 2% compared to a state-of-the-art tagger for rich morphological tagsets. Moreover, we investigate the effect of additional (morphological) categories (gender, number, person, part of speech) in the internal tagset used for the training. Rich internal tagsets improve results for all tested approaches.

Keywords

German Case Tagging Supervised Learning Decision Trees Conditional Random Fields Hidden Markov Models Morphologically annotated treebanks Evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Koskeniemmi, K., Haapalainen, M.: GERTWOL – Lingsoft Oy. In: Hausser, R. (ed.) Linguistische Verifikation: Dokumentation zur Ersten Morpholympics 1994, Niemeyer, Tübingen. Sprache und Information, vol. 34, pp. 121–140 (1996)Google Scholar
  2. 2.
    Zielinski, A., Simon, C.: Morphisto: An open-source morphological analyzer for German. In: Seventh International Workshop on Finite-State Methods and Natural Language Processing, pp. 177–184 (2008)Google Scholar
  3. 3.
    Lezius, W., Rapp, R., Wettler, M.: A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German. In: Proceedings of COLING-ACL 1998: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, vol. 2, pp. 743–748 (1998)Google Scholar
  4. 4.
    Schmid, H., Laws, F.: Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 777–784 (August 2008)Google Scholar
  5. 5.
    Perera, P., Witte, R.: A self-learning context-aware lemmatizer for German. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), October 6-8, pp. 636–643. Association for Computational Linguistics, ACL, Vancouver (2005)Google Scholar
  6. 6.
    Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Applied Natural Language Processing Conference ANLP 2000, pp. 224–231 (2000)Google Scholar
  7. 7.
    Schiller, A., Teufel, S., Stöckert, C.: Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset) (1999)Google Scholar
  8. 8.
    Sutton, C.A., McCallum, A.: An introduction to conditional random fields. Foundations and Trends in Machine Learning 4(4), 267–373 (2012)CrossRefGoogle Scholar
  9. 9.
    Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 504–513. Association for Computational Linguistics (July 2010)Google Scholar
  10. 10.
    Brants, T.: Internal and external tagsets in part-of-speech tagging. In: Proceedings of Eurospeech, pp. 2787–2790 (1997)Google Scholar
  11. 11.
    Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: Tiger: Linguistic interpretation of a german corpus. Research on Language and Computation 2(4), 597–620 (2004)CrossRefGoogle Scholar
  12. 12.
    Hinrichs, E., Kübler, S., Naumann, K., Telljohann, H., Trushkina, J.: Recent developments in linguistic annotations of the TüBa-D/Z treebank. In: Proceedings of the Third Workshop on Treebanks and Linguistic Theories, pp. 51–62 (2004)Google Scholar
  13. 13.
    Halácsy, P., Kornai, A., Oravecz, C.: Hunpos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 209–212. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  14. 14.
    Constant, M., Tellier, I.: Evaluating the impact of external lexical resources into a CRF-based multiword segmenter and part-of-speech tagger. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 646–650 (May 2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Simon Clematide
    • 1
  1. 1.Institute of Computational LinguisticsUniversity of ZurichZürichSwitzerland

Personalised recommendations