Advertisement

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

  • Sarah SchulzEmail author
  • Nora Ketschik
Original Paper
  • 1 Downloads

Abstract

By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training a tagger, we exemplify how existing resources can be adapted fruitfully to serve as additional training data. The resulting POS model achieves a tagging accuracy of about 91% on a diverse test set representing the different genres, time periods and varieties of MHG. In order to verify its general applicability, we evaluate the performance on different genres, authors and varieties of MHG, separately. We explore self-learning techniques which yield the advantage that unannotated data can be utilized to improve tagging performance on specific subcorpora.

Keywords

Historical language Part-of-speech tagging Digital Humanities Non-standard text processing Middle High German 

Notes

Acknowledgements

This work was completed within the Center for Reflected Text Analytics (CRETA) which is supported by the German Ministry of Education and Research (BMBF) and we are grateful for their financial support. We also want to thank our colleagues at the MHDBDB for their collaboration and the reviewers for their helpful comments. This work is based on a talk given at the meeting of “Digital Humanities im deutschsprachigen Raum” (DHd) in March 2017 in Bern.

Funding

Funding was provided by Bundesministerium für Bildung und Forschung (Grant No. 01UG1601).

References

  1. Barteld, F., Schröder, I., & Zinsmeister, H. (2015). Unsupervised regularization of historical texts for POS tagging. In Proceedings of the 4th workshop on corpus-based research in the humanities (CRH) (pp. 3–12). Polish Academy of Sciences.Google Scholar
  2. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Wortman Vaughan, J. (2010). A theory of learning from different domains. Machine Learning, 79(1–2), 151–175.CrossRefGoogle Scholar
  3. Blitzer, J., McDonald, R., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP ’06), (pp. 120–128). Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar
  4. Brants, T. (2000). TnT: A Statistical Part-of-speech tagger. In Proceedings of the Sixth conference on applied natural language processing (ANLC ’00) (pp. 224–231). Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar
  5. Busa, R. (1980). The annals of humanities computing: The index thomisticus. Computers and the Humanities, 14(2), 83–90.CrossRefGoogle Scholar
  6. Celano, G., Crane, G., & Majidi, S. (2016). Part of speech tagging for ancient Greek. Open Linguistics, 2(1), 393–399.CrossRefGoogle Scholar
  7. Choi, J. D. (2016). Dynamic feature induction: The last gist to the state-of-the-art. In NAACL HLT 2016, The 2016 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, (pp. 271–281).Google Scholar
  8. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.CrossRefGoogle Scholar
  9. Daume III, H. (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 256–263), Prague, Czech Republic. Association for Computational Linguistics.Google Scholar
  10. Dipper, S. (2011). Morphological and part-of-speech tagging of historical language data: A comparison. JLCL, 26(2), 25–37.Google Scholar
  11. Dipper, S., Donhauser, K., Klein, T., Linde, S., Müller, S., & Wegera, K.-P. (2013). HiTS: ein Tagset für historische sprachstufen des deutschen. JLCL, 28(1), 85–137.Google Scholar
  12. Fagerland, M. W., Lydersen, S., & Laake, P. (2013). The mcnemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional. BMC Medical Research Methodology, 13(1), 91.CrossRefGoogle Scholar
  13. Garrette, D. & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the North American chapter of the association for computational linguistics: Human Language Technologies (NAACL-HLT-13) (pp. 138–147). Atlanta, GA.Google Scholar
  14. Giesbrecht, E. & Evert, S. (2009). Is part-of-speech tagging a solved task? An evaluation of POS taggers for the German web as Corpus. In I. Alegria, I. Leturia & S. Sharoff (Ed.), Proceedings of the 5th web as corpus workshop (WAC5) (pp. 27–35), San Sebastian, Spain.Google Scholar
  15. Goldberg, Y., Adler, M., & Elhadad, M. (2008). EM can find pretty good HMM POS-taggers (when given a good start). In K. McKeown, J. D. Moore, S. Teufel, J. Allan, S. Furui (Ed.), ACL (pp. 746–754). The Association for Computer Linguistics.Google Scholar
  16. Hardmeier, C. (2016). A neural model for part-of-speech tagging in historical texts. In COLING 2016, 26th international conference on computational linguistics, proceedings of the conference: technical papers, December 11–16, 2016, Osaka, Japan (pp. 922–931).Google Scholar
  17. Hennings, T. (2003). Einführung in das Mittelhochdeutsche. De Gruyter Studienbuch: De Gruyter.Google Scholar
  18. Hupkes, D. & Bod, R. (2016). POS-tagging of historical Dutch. In N. Calzolari (Conference Chair), K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp. 77–82). Paris, France: European Language Resources Association (ELRA).Google Scholar
  19. Jiang, J. & Zhai, C. (2007). Instance weighting for domain adaptation in NLP. In In ACL 2007 (pp. 264–271).Google Scholar
  20. Klein, T. & Dipper, S. (2016). Handbuch zum Referenzkorpus Mittelhochdeutsch.Google Scholar
  21. Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2), 155–171.Google Scholar
  22. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems 26, (pp. 3111–3119). Curran Associates, Inc.Google Scholar
  23. Mittelhochdeutsche Begriffsdatenbank (1992–2017). Mittelhochdeutsche Begriffsdatenbank (MHDBDB). http://www.mhdbdb.sbg.ac.at/.
  24. Moser, H., (Ed.) (1977). Des Minnesangs Frühling: Unter Benutzung der Ausgaben von Karl Lachmann und Moriz Haupt, Friedrich Vogt und Carl von Kraus. Bearbeitet von Hugo Moser und Helmut Tervooren. S. Hirzel, 36 edition.Google Scholar
  25. Müller, T., Schmid, H., & Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 322–332).Google Scholar
  26. Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., McDonald, R. T., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., & Zeman, D. (2016). Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the tenth international conference on language resources and evaluation LREC 2016, Portorož, Slovenia, May 23–28, 2016 (pp. 1659–1666).Google Scholar
  27. Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Conference on empirical methods in natural language processing (pp. 133–142).Google Scholar
  28. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In International conference on new methods in language processing (pp. 44–49). Manchester, UK.Google Scholar
  29. Schulz, S. & Kuhn, J. (2016). Learning from Within? Comparing PoS tagging approaches for historical text. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC) (pp. 4316–4322). European Language Resources Association (ELRA).Google Scholar
  30. Smith, N. A. & Eisner, J. (2005). Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of the 43rd annual meeting on association for computational linguistics (ACL ’05) (pp. 354–362). Stroudsburg, PA, USA; Association for Computational Linguistics.Google Scholar
  31. Søgaard, A. (2010). Simple semi-supervised training of part-of-speech taggers. In Proceedings of the ACL 2010 conference short papers (ACLShort ’10), (pp. 205–208). Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
  32. Yang, Y. & Eisenstein, J. (2015). Unsupervised multi-domain adaptation with feature embeddings. In NAACL HLT 2015, the 2015 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, Denver, Colorado, USA, May 31–June 5, 2015 (pp. 672–682).Google Scholar
  33. Yang, Y. & Eisenstein, J. (2016). Part-of-speech tagging for historical English. In Proceedings of the 2016 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1318–1328). Association for Computational Linguistics.Google Scholar
  34. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on association for computational linguistics (ACL ’95) (pp. 189–196). Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar
  35. Zeldes, A., & Schroeder, C. T. (2015). Computational methods for coptic: Developing and using part-of-speech tagging for digital scholarship in the humanities. Digital Scholarship in the Humanities, 30(supp–1), 164–176.CrossRefGoogle Scholar
  36. Zhou, Z.-H., & Li, M. (2005). Tri-Training: Exploiting unlabeled data using three classifiers. IEEE Transactions Knowledge and Data Engineering, 17(11), 1529–1541.CrossRefGoogle Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.Institute for Natural Language Processing (IMS)University of StuttgartStuttgartGermany
  2. 2.Institute for Literary Studies (ILW)University of StuttgartStuttgartGermany

Personalised recommendations