Skip to main content

Designing and Comparing G2P-Type Lemmatizers for a Morphology-Rich Language

  • 264 Accesses

Part of the Communications in Computer and Information Science book series (CCIS,volume 537)


We consider the statistical lemmatization problem in which lemmatizers are trained on (word form, lemma) pairs. In particular, we consider this problem for ancient Latin, a language with high degree of morphological variability. We investigate whether general purpose string-to-string transduction models are suitable for this task, and find that they typically perform (much) better than more restricted lemmatization techniques/heuristics based on suffix transformations. We also experimentally test whether string transduction systems that perform well on one string-to-string translation task (here, G2P) perform well on another (here, lemmatization) and vice versa, and find that a joint n-gram modeling performs better on G2P than a discriminative model of our own making but that this relationship is reversed for lemmatization. Finally, we investigate how the learned lemmatizers can complement lexicon-based systems, e.g., by tackling the OOV and/or the disambiguation problem.


  • Word Pair
  • Word Form
  • Conditional Random Field
  • Input String
  • Word Class

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-23980-4_2
  • Chapter length: 14 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   44.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-23980-4
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   59.99
Price excludes VAT (USA)
Fig. 1.


  1. 1.

    In stemming, all that typically matters is that related words map to the same (linguistic or even non-linguistic) object.

  2. 2.

    See, e.g.,

  3. 3.

    In our experiments below, we choose an n-gram order of size 6 for Phonetisaurus. Increasing n-gram order size did not lead to better performance in preliminary tests.

  4. 4.

    We use the alignments produced by the Phonetisaurus toolkit.

  5. 5.

    Although CRFs are rather old and typically not always the best-performing sequence labeling models [17], we use them here mainly for practical reasons. In particular, the CRF package we are using, available from, provides a very convenient interface to modeling sequence labeling.

  6. 6.

    Increasing window size typically does not lead to better performance, as we verified in preliminary experiments.

  7. 7.

    Typically, word forms in other word classes are also not inflectional, so that the learning problem would be trivial.

  8. 8.

    In fact, it seems that Mate simply stores input strings that occur fewer than 5 times, rather than learning substitution patterns from these (personal communication with Bernd Bohnet). Thus, the evaluation scenario adopted in this work puts Mate at a general disadvantage, since we generally train systems on arbitrary lists of word pairs selected from a lexicon rather than on the distributions found in ‘real’ text.

  9. 9.

    E.g., when the lemmatizer is developed to assist a lexicon-based lemmatizer.

  10. 10.

    We also performed the alternative decoding strategy where lemmatizers are separately trained, but found it to perform worse.

  11. 11.

    Available at

  12. 12.

    We could not use Perseus because the TreeTagger was trained on Perseus.


  1. Bartlett, S., Kondrak, G., Cherry, C.: Automatic syllabification with structured SVMs for letter-to-phoneme conversion. In: McKeown, K., Moore, J.D., Teufel, S., Allan, J., Furui, S. (eds.) ACL, pp. 568–576. Association for Computational Linguistics, Morristown (2008)

    Google Scholar 

  2. Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)

    CrossRef  Google Scholar 

  3. Bohnet, B.: Top accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Coling 2010 Organizing Committee, Beijing, China, pp. 89–97, August 2010.

  4. Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL 2000, pp. 286–293. Association for Computational Linguistics, Stroudsburg (2000)

    Google Scholar 

  5. Daelemans, W., Groenewald, H.J., Huyssteen, G.B.V.: Prototype-based active learning for lemmatization. In: Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., Nikolov, N. (eds.) RANLP, pp. 65–70. RANLP 2009 Organising Committee/ACL, Morristown (2009)

    Google Scholar 

  6. Dreyer, M., Smith, J., Eisner, J.: Latent-variable modeling of string transductions with finite-state methods. In: EMNLP, pp. 1080–1089. ACL (2008)

    Google Scholar 

  7. Eger, S.: Sequence segmentation by enumeration: an exploration. Prague Bull. Math. Linguist. 100, 113–132 (2013)

    CrossRef  Google Scholar 

  8. Eger, S., vor der Brück, T., Mehler, A.: Lexicon-assisted tagging and lemmatization in Latin: a comparison of six taggers and two lemmatization methods. In: Latech 2015. Association for Computational Linguistics (2015, accepted)

    Google Scholar 

  9. Gesmundo, A., Samardzic, T.: Lemmatisation as a tagging task. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), pp. 368–372. Association for Computational Linguistics (2012).

  10. Jiampojamarn, S., Cherry, C., Kondrak, G.: Joint processing and discriminative training for letter-to-phoneme conversion. In: Proceedings of ACL-08: HLT, pp. 905–913. Association for Computational Linguistics, Columbus, June 2008.

  11. Jiampojamarn, S., Cherry, C., Kondrak, G.: Integrating joint n-gram features into a discriminative training framework. In: NAACL-HLT, pp. 697–700. Association for Computational Linguistics (2010)

    Google Scholar 

  12. Juršič, M., Mozetič, I., Lavrač, N.: Learning ripple down rules for efficient lemmatization. In: Mladenić, D., Grobelnik, M. (eds.) Proceedings of the 10th International Multiconference Information Society, pp. 206–209. IJS, Ljubljana (2007)

    Google Scholar 

  13. Juršič, M., Mozetič, I., Lavrač, N.: LemmaGen: multilingual lemmatisation with induced ripple-down rules. J. Univ. Comput. Sci. 16, 1190–1214 (2010)

    Google Scholar 

  14. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  15. Mehler, A., vor der Brück, T., Gleim, R., Geelhaar, T.: Towards a network model of the coreness of texts: an experiment in classifying Latin texts using the ttlab Latin tagger. In: Biemann, C., Mehler, A. (eds.) Text Mining: From Ontology Learning to Automated text Processing Applications. Theory and Applications of Natural Language Processing, pp. 87–112. Springer, Berlin (2015)

    Google Scholar 

  16. Migne, J.P. (ed.): Patrologiae Cursus Completus: Series Latina, vol. 1–221. Chadwyck-Healey, Cambridge (1844–1855)

    Google Scholar 

  17. Nguyen, N., Guo, Y.: Comparisons of sequence labeling algorithms and extensions. In: Ghahramani, Z. (ed.) ICML. ACM International Conference Proceeding Series, vol. 227, pp. 681–688. ACM, New York (2007)

    CrossRef  Google Scholar 

  18. Novak, J.R., Minematsu, N., Hirose, K.: WFST-based grapheme-to-phoneme conversion: open source tools for alignment, model-building and decoding. In: Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pp. 45–49. Association for Computational Linguistics, Donostia-San Sebasti, July 2012.

  19. Porter, M.: An algorithm for suffix stripping. Program Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)

    CrossRef  Google Scholar 

  20. Richmond, K., Clark, R.A.J., Fitt, S.: Robust LTS rules with the Combilex speech technology lexicon. In: INTERSPEECH, pp. 1295–1298. ISCA (2009)

    Google Scholar 

  21. Sherif, T., Kondrak, G.: Substring-based transliteration. In: Carroll, J.A., van den Bosch, A., Zaenen, A. (eds.) ACL. Association for Computational Linguistics, Morristown (2007)

    Google Scholar 

  22. Smith, D.A., Rydberg-Cox, J.A., Crane, G.R.: The Perseus project: a digital library for the humanities. Literary and Linguistic Computing 15(1), 15–25 (2000).

    CrossRef  Google Scholar 

  23. Toutanova, K., Cherry, C.: A global model for joint lemmatization and part-of-speech prediction. In: Su, K.Y., Su, J., Wiebe, J. (eds.) ACL/IJCNLP, pp. 486–494. Association for Computational Linguistics, Morristown (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Steffen Eger .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Eger, S. (2015). Designing and Comparing G2P-Type Lemmatizers for a Morphology-Rich Language. In: Mahlow, C., Piotrowski, M. (eds) Systems and Frameworks for Computational Morphology. SFCM 2015. Communications in Computer and Information Science, vol 537. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23978-1

  • Online ISBN: 978-3-319-23980-4

  • eBook Packages: Computer ScienceComputer Science (R0)