Abstract
Automatic lemmatisation is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma to each word in a running text is not trivial: nouns and adjectives, for instance, inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, as word forms cannot be matched against a lexicon giving the correct lemma, its part-of-speech and paradigm class. The paper discusses a machine learning approach to the automatic lemmatisation of unknown words, in particular nouns and adjectives, in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems: the first is to learn to perform morphosyntactic tagging, and the second is to learn to perform morphological analysis, which produces the lemma from the word form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn to perform morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. The dataset used is the 90.000 word Slovene translation of Orwell’s ‘1984’, split into a training and validation set. The validation set is the Appendix of the novel, on which extensive testing of the two components, singly and in combination, is performed. The trained model is then used on an open-domain testing set, which has 25.000 words, pre-annotated with their word lemmas. Here 13.000 nouns or adjective tokens are previously unseen cases. Tested on these unknown words, our method achieves an accuracy of 81% on the lemmatisation task.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brants, T. (2000). TnT-a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000 Seattle, WA. http://www.coli.uni-sb.de/~thorsten/tnt/.
Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21 (4), 543–565.
Chanod, J., & Tapanainen, P. (1995). Creating a tagset, lexicon and guesser for a French tagger. In Proceedings of the ACL SIGDAT workshop From Text to Tags: Issues in Multilingual Language Analysis Dublin.
Cussens, J. (1997). Part-of-speech tagging using Progol. In Proceedings of the 6th International Workshop on Inductive Logic Programming, pp. 93–108 Berlin. Springer.
Cussens, J., Džeroski, S., & Erjavec, T. (1999). Morphosyntactic tagging of Slovene using Progol. In Džeroski, S., & Flach, P. (Eds.), Inductive Logic Programming; 9th International Workshop ILP-99, Proceedings, No. 1634 in Lecture Notes in Artificial Intelligence, pp. 68–79 Berlin. Springer.
Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133–140 Trento, Italy.
Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. (1996). MBT: A memory-based part of speech tagger-generator. In Ejerhed, E., & Dagan, I. (Eds.), Proceedings of the Fourth Workshop on Very Large Corpora, pp. 14–27 Copenhagen.
Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.-J., Petkevič, V., & Tufiş, D. (1998). Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages. In COLING-ACL ’98, pp. 315–319 Montréal, Québec, Canada.
Džeroski, S., Erjavec, T., & Zavrel, J. (1999). Morphosyntactic Tagging of Slovene: Evaluating PoS Taggers and Tagsets. Research report IJSDP 8018, Jožef Stefan Institute, Ljubljana. http://nl.ijs.si/lll/bib/dzerzareport/.
Erjavec, T. (1999). The ELAN Slovene-English Aligned Corpus. In Proceedings of the Machine Translation Summit VII, pp. 349–357 Singapore. http://nl.ijs.si/elan/.
Erjavec, T., & (eds.), M. M. (1997). Specifications and notation for lexicon encoding. MULTEXT-East final report D1.1F, Jožef Stefan Institute, Ljubljana. http://nl.ijs.si/ME/CD/docs/mte-d11f/.
Erjavec, T., Lawson, A., & Romary, L. (1998). East meets West: A Compendium of Multilingual Resources. CD-ROM. ISBN: 3-922641-46-6.
Manandhar, S., Džeroski, S., & Erjavec, T. (1998). Learning multilingual morphology with CLOG. In Page, D. (Ed.), Inductive Logic Programming; 8th International Workshop ILP-98, Proceedings, No. 1446 in Lecture Notes in Artificial Intelligence, pp. 135–144. Springer.
Mikheev, A. (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics, 23 (3), 405–424.
Mooney, R. J., & Califf, M. E. (1995). Induction of first-order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, pp. 1–24.
Ratnaparkhi, A. (1996). A maximum entropy part of speech tagger. In Proc. ACL-SIGDAT Conference on Empirical Methods in Natural Language Processing, pp. 491–497 Philadelphia.
Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (1994). Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford.
Steetskamp, R. (1995). An implementation os a probabilistic tagger. Master’s thesis, TOSCA Research Group, University of Nijmegen, Nijmegen. 48 p.
van Halteren, H. (Ed.). (1999). Syntactic Wordclass Tagging. Kluwer.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Džeroski, S., Erjavec, T. (2000). Learning to Lemmatise Slovene Words. In: Cussens, J., Džeroski, S. (eds) Learning Language in Logic. LLL 1999. Lecture Notes in Computer Science(), vol 1925. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40030-3_5
Download citation
DOI: https://doi.org/10.1007/3-540-40030-3_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41145-1
Online ISBN: 978-3-540-40030-1
eBook Packages: Springer Book Archive