Dsolve—Morphological Segmentation for German Using Conditional Random Fields

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 537)

Abstract

We describe Dsolve, a system for the segmentation of morphologically complex German words into their constituent morphs. Our approach treats morphological segmentation as a classification task, in which the locations and types of morph boundaries are predicted by a Conditional Random Field model trained from manually annotated data. The prediction of morph-boundary types in addition to their locations distinguishes Dsolve from similar approaches previously suggested in the literature. We show that the use of boundary types provides a (somewhat counter-intuitive) performance boost with respect to the simpler task of predicting only segment locations.

Keywords

Classification Task Model Order Conditional Random Field Boundary Type Boundary Detection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI, Stanford (2003)Google Scholar
  2. 2.
    Chang, J.Z., Chang, J.S.: Word root finder: a morphological segmentor based on CRF. In: Proceedings of COLING 2012: Demonstration Papers, pp. 51–58 (2012)Google Scholar
  3. 3.
    Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Proceedings of the ACL 2002 Workshop on Morphological and Phonological Learning, pp. 21–30 (2002)Google Scholar
  4. 4.
    Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. 4(1), 3:1–3:34 (2007)CrossRefGoogle Scholar
  5. 5.
    Creutz, M., Lindén, K.: Morpheme segmentation gold standards for Finnish and English. Technical report A77, Helsinki University of Technology (2004)Google Scholar
  6. 6.
    Daelemans, W.: Grafon: a grapheme-to-phoneme conversion system for Dutch. In: Proceedings of COLING 1988, pp. 133–138 (1988)Google Scholar
  7. 7.
    Déjean, H.: Morphemes as necessary concept for structures discovery from untagged corpora. In: Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning, pp. 295–298 (1998)Google Scholar
  8. 8.
    Frakes, W.B.: Stemming algorithms. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval, pp. 131–160. Prentice-Hall, Upper Saddle River (1992)Google Scholar
  9. 9.
    Geyken, A., Hanneforth, T.: TAGH: a complete morphology for German based on weighted finite state automata. In: Yli-Jyrä, A., Karttunen, L., Karhumäki, J. (eds.) FSMNLP 2005. LNCS (LNAI), vol. 4002, pp. 55–66. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  10. 10.
    Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27(2), 153–198 (2001)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Green, S., DeNero, J.: A class-based agreement model for generating accurately inflected translations. In: Proceedings of ACL 2012, pp. 146–155 (2012)Google Scholar
  12. 12.
    Haapalainen, M., Ari, M.: GERTWOL und morphologische Disambiguierung für das Deutsche. In: Proceedings of the 10th Nordic Conference of Computational Linguistics. University of Helsinki, Department of General Linguistics (1995)Google Scholar
  13. 13.
    Harris, Z.: From phoneme to morpheme. Language 31, 190–222 (1955)CrossRefGoogle Scholar
  14. 14.
    Klenk, U., Langer, H.: Morphological segmentation without a lexicon. Literary Linguist. Comput. 4(4), 247–253 (1989)CrossRefGoogle Scholar
  15. 15.
    Kohonen, O., Virpioja, S., Lagus, K.: Semi-supervised learning of concatenative morphology. In: Proceedings of SIGMORPHON 2010, pp. 78–86 (2010)Google Scholar
  16. 16.
    Kurimo, M., Virpioja, S., Turunen, V., Lagus, K.: Morpho challenge competition 2005–2010: evaluations and results. Proceedings of SIGMORPHON 2010, pp. 87–95 (2010)Google Scholar
  17. 17.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann (2001)Google Scholar
  18. 18.
    Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of ACL 2010, pp. 504–513 (2010)Google Scholar
  19. 19.
    Müller, C., Gurevych, I.: Semantically enhanced term frequency. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 598–601. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  20. 20.
    Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Berlin (1999)CrossRefMATHGoogle Scholar
  21. 21.
    Pfeifer, W.: Etymologisches Wörterbuch des Deutschen, 2nd edn. Akademie-Verlag, Berlin (1993)Google Scholar
  22. 22.
    Porter, M.F.: An algorithm for suffix stripping. Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)Google Scholar
  23. 23.
    Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–285 (1989)CrossRefGoogle Scholar
  24. 24.
    Reichel, U.D., Weilhammer, K.: Automated morphological segmentation and evaluation. In: Proceedings of LREC, pp. 503–506 (2004)Google Scholar
  25. 25.
    van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, Newton (1979)MATHGoogle Scholar
  26. 26.
    Ruokolainen, T., Kohonen, O., Virpioja, S., Kurimo, M.: Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 29–37 (2013)Google Scholar
  27. 27.
    Ruokolainen, T., Kohonen, O., Virpioja, S., Kurimo, M.: Painless semi-supervised morphological segmentation using conditional random fields. In: Proceedings of EACL 2014, pp. 84–89 (2014)Google Scholar
  28. 28.
    Schmid, H., Fitschen, A., Heid, U.: SMOR: a German computational morphology covering derivation, composition and inflection. In: Proceedings of LREC (2004)Google Scholar
  29. 29.
    Selkirk, E.O.: On the nature of phonological representation. In: Myers, T., Laver, J., Anderson, J. (eds.) The Cognitive Representation of Speech, pp. 379–388. North-Holland Publishing Company, Dordrecht (1981)CrossRefGoogle Scholar
  30. 30.
    Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for SIGHAN bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (2005)Google Scholar
  31. 31.
    Wallach, H.M.: Conditional random fields: an introduction. Technical report MS-CIS-04-21, University of Pennsylvania, Department of Computer and Information Science (2004)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Berlin-Brandenburg Academy of Sciences and HumanitiesBerlinGermany

Personalised recommendations