Designing and Comparing G2P-Type Lemmatizers for a Morphology-Rich Language

  • Steffen Eger
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 537)


We consider the statistical lemmatization problem in which lemmatizers are trained on (word form, lemma) pairs. In particular, we consider this problem for ancient Latin, a language with high degree of morphological variability. We investigate whether general purpose string-to-string transduction models are suitable for this task, and find that they typically perform (much) better than more restricted lemmatization techniques/heuristics based on suffix transformations. We also experimentally test whether string transduction systems that perform well on one string-to-string translation task (here, G2P) perform well on another (here, lemmatization) and vice versa, and find that a joint n-gram modeling performs better on G2P than a discriminative model of our own making but that this relationship is reversed for lemmatization. Finally, we investigate how the learned lemmatizers can complement lexicon-based systems, e.g., by tackling the OOV and/or the disambiguation problem.


Word Pair Word Form Conditional Random Field Input String Word Class 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bartlett, S., Kondrak, G., Cherry, C.: Automatic syllabification with structured SVMs for letter-to-phoneme conversion. In: McKeown, K., Moore, J.D., Teufel, S., Allan, J., Furui, S. (eds.) ACL, pp. 568–576. Association for Computational Linguistics, Morristown (2008)Google Scholar
  2. 2.
    Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)CrossRefGoogle Scholar
  3. 3.
    Bohnet, B.: Top accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Coling 2010 Organizing Committee, Beijing, China, pp. 89–97, August 2010.
  4. 4.
    Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 2000, pp. 286–293. Association for Computational Linguistics, Stroudsburg (2000)Google Scholar
  5. 5.
    Daelemans, W., Groenewald, H.J., Huyssteen, G.B.V.: Prototype-based active learning for lemmatization. In: Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., Nikolov, N. (eds.) RANLP, pp. 65–70. RANLP 2009 Organising Committee/ACL, Morristown (2009)Google Scholar
  6. 6.
    Dreyer, M., Smith, J., Eisner, J.: Latent-variable modeling of string transductions with finite-state methods. In: EMNLP, pp. 1080–1089. ACL (2008)Google Scholar
  7. 7.
    Eger, S.: Sequence segmentation by enumeration: an exploration. Prague Bull. Math. Linguist. 100, 113–132 (2013)CrossRefGoogle Scholar
  8. 8.
    Eger, S., vor der Brück, T., Mehler, A.: Lexicon-assisted tagging and lemmatization in Latin: a comparison of six taggers and two lemmatization methods. In: Latech 2015. Association for Computational Linguistics (2015, accepted)Google Scholar
  9. 9.
    Gesmundo, A., Samardzic, T.: Lemmatisation as a tagging task. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), pp. 368–372. Association for Computational Linguistics (2012).
  10. 10.
    Jiampojamarn, S., Cherry, C., Kondrak, G.: Joint processing and discriminative training for letter-to-phoneme conversion. In: Proceedings of ACL-08: HLT, pp. 905–913. Association for Computational Linguistics, Columbus, June 2008.
  11. 11.
    Jiampojamarn, S., Cherry, C., Kondrak, G.: Integrating joint n-gram features into a discriminative training framework. In: NAACL-HLT, pp. 697–700. Association for Computational Linguistics (2010)Google Scholar
  12. 12.
    Juršič, M., Mozetič, I., Lavrač, N.: Learning ripple down rules for efficient lemmatization. In: Mladenić, D., Grobelnik, M. (eds.) Proceedings of the 10th International Multiconference Information Society, pp. 206–209. IJS, Ljubljana (2007)Google Scholar
  13. 13.
    Juršič, M., Mozetič, I., Lavrač, N.: LemmaGen: multilingual lemmatisation with induced ripple-down rules. J. Univ. Comput. Sci. 16, 1190–1214 (2010)Google Scholar
  14. 14.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)Google Scholar
  15. 15.
    Mehler, A., vor der Brück, T., Gleim, R., Geelhaar, T.: Towards a network model of the coreness of texts: an experiment in classifying Latin texts using the ttlab Latin tagger. In: Biemann, C., Mehler, A. (eds.) Text Mining: From Ontology Learning to Automated text Processing Applications. Theory and Applications of Natural Language Processing, pp. 87–112. Springer, Berlin (2015)Google Scholar
  16. 16.
    Migne, J.P. (ed.): Patrologiae Cursus Completus: Series Latina, vol. 1–221. Chadwyck-Healey, Cambridge (1844–1855)Google Scholar
  17. 17.
    Nguyen, N., Guo, Y.: Comparisons of sequence labeling algorithms and extensions. In: Ghahramani, Z. (ed.) ICML. ACM International Conference Proceeding Series, vol. 227, pp. 681–688. ACM, New York (2007)CrossRefGoogle Scholar
  18. 18.
    Novak, J.R., Minematsu, N., Hirose, K.: WFST-based grapheme-to-phoneme conversion: open source tools for alignment, model-building and decoding. In: Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pp. 45–49. Association for Computational Linguistics, Donostia-San Sebasti, July 2012.
  19. 19.
    Porter, M.: An algorithm for suffix stripping. Program Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)CrossRefGoogle Scholar
  20. 20.
    Richmond, K., Clark, R.A.J., Fitt, S.: Robust LTS rules with the Combilex speech technology lexicon. In: INTERSPEECH, pp. 1295–1298. ISCA (2009)Google Scholar
  21. 21.
    Sherif, T., Kondrak, G.: Substring-based transliteration. In: Carroll, J.A., van den Bosch, A., Zaenen, A. (eds.) ACL. Association for Computational Linguistics, Morristown (2007)Google Scholar
  22. 22.
    Smith, D.A., Rydberg-Cox, J.A., Crane, G.R.: The Perseus project: a digital library for the humanities. Literary and Linguistic Computing 15(1), 15–25 (2000). CrossRefGoogle Scholar
  23. 23.
    Toutanova, K., Cherry, C.: A global model for joint lemmatization and part-of-speech prediction. In: Su, K.Y., Su, J., Wiebe, J. (eds.) ACL/IJCNLP, pp. 486–494. Association for Computational Linguistics, Morristown (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Text Technology LabGoethe UniversityFrankfurt am MainGermany

Personalised recommendations