Skip to main content

Lemmatization for Ancient Languages: Rules or Neural Networks?

  • Conference paper
  • First Online:
Artificial Intelligence and Natural Language (AINL 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 930))

Included in the following conference series:

Abstract

Lemmatisation, which is one of the most important stages of text preprocessing, consists in grouping the inflected forms of a word together so they can be analysed as a single item. This task is often considered solved for most modern languages irregardless of their morphological type, but the situation is dramatically different for ancient languages. Rich inflectional system and high level of orthographic variation common to these languages together with lack of resources make lemmatising historical data a challenging task. It becomes more and more important as manuscripts are being extensively digitized now, but still remains poorly covered in literature. In this work, I compare a rule-based and a neural network based approach to lemmatisation in case of Early Irish (Old and Middle Irish are often described together as “Early Irish”) data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://docs.cltk.org/en/latest/.

  2. 2.

    http://cistern.cis.lmu.de/marmot/models/CURRENT/.

  3. 3.

    http://dil.ie.

References

  1. Attia, M., Samih, Y., Shaalan, K.F., Van Genabith, J.: The floating Arabic dictionary: an automatic method for updating a lexical database through the detection and lemmatization of unknown words. In: COLING, pp. 83–96 (2012)

    Google Scholar 

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint (2014). arXiv:1409.0473

  3. Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008)

    Google Scholar 

  4. Baron, A., Rayson, P.: Automatic standartisation of texts containing spelling variation. How much training data do you need (2009)

    Google Scholar 

  5. Bollmann, M., Dipper, S., Krasselt, J., Petran, F.: Manual and semi-automatic normalization of historical spelling-case studies from Early New High German. In: KONVENS, pp. 342–350 (2012)

    Google Scholar 

  6. Borin, L., Forsberg, M.: Something old, something new: a computational morphological description of Old Swedish. In: LREC 2008 Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), pp. 9–16 (2008)

    Google Scholar 

  7. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint (2014). arXiv:1406.1078

  8. Chrupała, G.: Simple data-driven context sensitive lemmatization. Procesamiento del Leng. Nat. 37, 121–127 (2006)

    Google Scholar 

  9. Chrupała, G.: Normalizing tweets with edit scripts and recurrent neural embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. vol. 2, pp. 680–686. Citeseer (2014)

    Google Scholar 

  10. Chrupała, G., Dinu, G., Van Genabith, J.: Learning morphology with Morfette (2008)

    Google Scholar 

  11. Cinková, S., Pomikálek, J.: LEMPAS: a make-do lemmatizer for the Swedish PAROLE-corpus. Prague Bull. Math. Linguist. 86, 47–54 (2006)

    Google Scholar 

  12. Daelemans, W., Groenewald, H.J., Van Huyssteen, G.B.: Prototype-based active learning for lemmatization (2009)

    Google Scholar 

  13. De Pauw, G., De Schryver, G.M.: Improving the computational morphological analysis of a Swahili corpus for lexicographic purposes. Lexikos 18(1), 303–318 (2008)

    Google Scholar 

  14. Dhonnchadha, E.U.: A two-level morphological analyser and generator for Irish using finite-state transducers. In: LREC (2002)

    Google Scholar 

  15. El-Shishtawy, T., El-Ghannam, F.: An accurate Arabic root-based lemmatizer for information retrieval purposes. arXiv preprint (2012). arXiv:1203.3584

  16. Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 333–341. ACM (2007)

    Google Scholar 

  17. Giusti, R., Candido, A., Muniz, M., Cucatto, L., Aluísio, S.: Automatic detection of spelling variation in historical corpus. In: Proceedings of the Corpus Linguistics Conference (CL) (2007)

    Google Scholar 

  18. Halácsy, P., Trón, V.: Benefits of deep NLP-based lemmatization for information retrieval. CLEF (Working Notes) (2006)

    Google Scholar 

  19. Hendrickx, I., Marquilhas, R.: From old texts to modern spellings: an experiment in automatic normalisation. JLCL 26(2), 65–76 (2011)

    Google Scholar 

  20. Huet, G.: Towards computational processing of Sanskrit. In: International Conference on Natural Language Processing (ICON). Citeseer (2003)

    Google Scholar 

  21. Ingason, A.K., Helgadóttir, S., Loftsson, H., Rögnvaldsson, E.: A mixed method lemmatization algorithm using a hierarchy of linguistic identities (HOLI). In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 205–216. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85287-2_20

    Chapter  Google Scholar 

  22. Johnson, K.P., et al.: CLTK: the classical language toolkit. https://github.com/cltk/cltk (2014–2017)

  23. Kanis, J., Müller, L.: Automatic lemmatizer construction with focus on OOV words lemmatization. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 132–139. Springer, Heidelberg (2005). https://doi.org/10.1007/11551874_17

    Chapter  Google Scholar 

  24. Kestemont, M., Daelemans, W., De Pauw, G.: Weigh your words–memory-based lemmatization for Middle Dutch. Lit. Linguist. Comput. 25(3), 287–301 (2010)

    Article  Google Scholar 

  25. Kestemont, M., de Pauw, G., van Nie, R., Daelemans, W.: Lemmatization for variation-rich languages using deep learning. Dig. Scholarsh. Humanit. 32, 1–19 (2016)

    Google Scholar 

  26. Kilgarriff, A., Rundell, M., Dhonnchadha, E.U.: Efficient corpus development for lexicography: building the New Corpus for Ireland. Lang. Resour. Eval. 40(2), 127–152 (2006)

    Article  Google Scholar 

  27. Lash, E.: The parsed Old and Middle Irish corpus (POMIC). version 0.1 (2014)

    Google Scholar 

  28. Lyras, D.P., Sgarbas, K.N., Fakotakis, N.D.: Applying similarity measures for automatic lemmatization: a case study for Modern Greek and English. Int. J. Artif. Intell. Tools 17(05), 1043–1064 (2008)

    Article  Google Scholar 

  29. Marinone, N.: A project for Latin lexicography: 1. Automatic lemmatization and word-list. Comput. Humanit. 24(5), 417–420 (1990)

    Google Scholar 

  30. Meelen, M., Beekhuizen, B.: PoS-tagging and chunking historical Welsh. In: Proceedings of the Scottish Celtic Colloquium 2012 (2013)

    Google Scholar 

  31. Müller, T., Cotterell, R., Fraser, A.M., Schütze, H.: Joint lemmatization and morphological tagging with Lemming. In: EMNLP, pp. 2268–2274 (2015)

    Google Scholar 

  32. Packard, D.: Computer-assisted morphological analysis of ancient Greek (1973)

    Google Scholar 

  33. Passarotti, M.C.: Development and perspectives of the Latin morphological analyser LEMLAT. Linguist. Comput. 20(A), 397–414 (2004)

    Google Scholar 

  34. Paul, S., Joshi, N., Mathur, I.: Development of a Hindi lemmatizer. arXiv preprint (2013). arXiv:1305.6211

  35. Perera, P., Witte, R.: A self-learning context-aware lemmatizer for German. In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 636–643. Association for Computational Linguistics (2005)

    Google Scholar 

  36. Pilz, T., Ernst-Gerlach, A., Kempken, S., Rayson, P., Archer, D.: The identification of spelling variants in English and German historical texts: manual or automatic? Lit. Linguist. Comput. 23(1), 65–72 (2008)

    Article  Google Scholar 

  37. Piotrowski, M.: Natural language processing for historical texts. Synth. Lect. Hum. Lang. Technol. 5(2), 1–157 (2012)

    Article  Google Scholar 

  38. Plisson, J., Lavrac, N., Mladenic, D., et al.: A rule based approach to word lemmatization. In: Proceedings C of the 7th International Multi-Conference Information Society IS 2004, vol. 1, pp. 83–86. Citeseer (2004)

    Google Scholar 

  39. Reynaert, M., Hendrickx, I., Marquilhas, R.: Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2. In: Proceedings of Annotation of Corpora for Research in the Humanities (ACRH-2), p. 87 (2012)

    Google Scholar 

  40. Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: BSNLP 2013–4th Biennial Workshop on Balto-Slavic Natural Language Processing (2013)

    Google Scholar 

  41. Schnober, C., Eger, S., Dinh, E.L.D., Gurevych, I.: Still not there? Comparing traditional sequence-to-sequence models to encoder-decoder neural networks on monotone string translation tasks. In: Proceedings of the 26th International Conference on Computational Linguistics (COLING), December 2016, to appear

    Google Scholar 

  42. Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: MLMTA, pp. 273–280. Citeseer (2003)

    Google Scholar 

  43. Shavrina, T., Sorokin, A.: Modeling advanced lemmatization for Russian language using TnT-Russian morphological parser. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialog” (2015)

    Google Scholar 

  44. Souvay, G., Pierrel, J.M.: Lemmatisation des mots en Moyen Français. Traitement Autom. Lang. 50(2), 21 (2009)

    Google Scholar 

  45. Toner, G., Bondarenko, G., Fomin, M., Torma, T.: An electronic dictionary of the Irish language (2007)

    Google Scholar 

  46. Toutanova, K., Cherry, C.: A global model for joint lemmatization and part-of-speech prediction. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 1, pp. 486–494. Association for Computational Linguistics (2009)

    Google Scholar 

  47. Verboom, A.: Towards a Sanskrit wordparser. Lit. Linguist. Comput. 3(1), 40–44 (1988)

    Article  Google Scholar 

  48. Vinyals, O., Le, Q.: A neural conversational model. arXiv preprint (2015). arXiv:1506.05869

  49. Yao, K., Zweig, G.: Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. arXiv preprint (2015). arXiv:1506.00196

  50. Zaliznyak, A.A.: Grammatichesky slovar russkogo yazyka. Slovoizmenenie. Russian grammatical dictionary. Inflection (1980)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oksana Dereza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dereza, O. (2018). Lemmatization for Ancient Languages: Rules or Neural Networks?. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2018. Communications in Computer and Information Science, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-01204-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01204-5_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01203-8

  • Online ISBN: 978-3-030-01204-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics