Advertisement

Reconstructing Complete Lemmas for Incomplete German Compounds

  • Noëmi Aepli
  • Martin Volk
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8105)

Abstract

This paper discusses elliptical compounds, which are frequently used in German in order to avoid repetitions. This phenomenon involves truncated words, mostly truncated compounds. These words pose a challenge in PoS tagging and lemmatization, which often leads to unknown or incomplete lemmas. We present an approach to reconstruct complete lemmas of truncated compounds in order to improve subsequent language technology or corpus linguistic applications. Results show an f-measure of 95.6% for the detection of elliptical compound patterns and 86.4% for the correction of compound lemmas.

Keywords

Elliptical compounds decompounding corpus annotation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.: The TIGER treebank. In: Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol (2002)Google Scholar
  2. Bredel, U.: Die Interpunktion des Deutschen. Niemeyer Verlag (2008)Google Scholar
  3. Eisenberg, P.: Grundriß der deutschen Grammatik. Der Satz. Metzler Verlag, Stuttgart, 2. überarbeitete und aktualisierte edition (2004)Google Scholar
  4. Gustafson-Capková, S., Hartmann, B.: Manual of the Stockholm Umeå corpus version 2.0. description of the content of the SUC 2.0 distribution, including the unfinished documentation by Gunnel Källgren. Technical report, Stockholm University (2006)Google Scholar
  5. Holz, F., Biemann, C.: Unsupervised and knowledge-free learning of compound splits and periphrases. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 117–127. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  6. Rakić, S.: Some observations on the structure, type frequencies and spelling of English compounds. SKASE Journal of Theoretical Linguistics (2009)Google Scholar
  7. Schmid, H.: Probabilistic part of speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK (1994)Google Scholar
  8. Srinivasan, V.: Punctuation and parsing of real-world texts. In: Sikkel, K., Nijholt, A. (eds.) Proceedings of the Sixth Twente Workshop on Language Technology, Natural Language Parsing, Methods and Formalisms. ACL/SIGPARSE Workshop, Enschede, pp. 163–167 (1993)Google Scholar
  9. Thielen, C., Schiller, A., Teufel, S., Stöckert, C.: Guidelines für das Tagging Deutscher Textkorpora mit STTS. Technical report, IMS und SfS (1999)Google Scholar
  10. Volk, M.: Choosing the right lemma when analysing German nouns. In: Multilinguale Corpora: Codierung, Strukturierung, Analyse. 11. Jahrestagung der GLDV, pp. 304–310. Enigma Corporation, Frankfurt (1999)Google Scholar
  11. Volk, M., Bubenhofer, N., Althaus, A., Bangerter, M., Furrer, L., Ruef, B.: Challenges in building a multilingual alpine heritage corpus. In: Proceedings of LREC, Malta (2010)Google Scholar
  12. Volk, M., Furrer, L., Sennrich, R.: Strategies for reducing and correcting OCR errors. In: Sporleder, C., van den Bosch, A., Zervanou, K. (eds.) Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series, Theory and Applications of Natural Language Processing, pp. 3–22. Springer, Berlin (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Noëmi Aepli
    • 1
  • Martin Volk
    • 1
  1. 1.University of ZurichSwitzerland

Personalised recommendations