Corpus-Based Lexeme Ranking for Morphological Guessers
- Cite this paper as:
- Lindén K., Tuovila J. (2009) Corpus-Based Lexeme Ranking for Morphological Guessers. In: Mahlow C., Piotrowski M. (eds) State of the Art in Computational Morphology. SFCM 2009. Communications in Computer and Information Science, vol 41. Springer, Berlin, Heidelberg
Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a morphological lexicon, we need to determine their base form and indicate their inflectional paradigm. A base form and a paradigm define a lexeme. In this article, we evaluate a lexicon-based method augmented with data from a corpus or the internet for generating and ranking lexeme suggestions for new words. As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By generating lexeme suggestions with an entry generator and then further generating some key word forms for the lexemes, we can find support for the lexemes in a corpus. Our ranking methods have 56–79% average precision and 78–89% recall among the top 6 candidates, i.e., an F-score of 65–84%, indicating that the first correct entry suggestion is on the average found as the second or third candidate. The corpus-based ranking methods were found to be significant in practice as they save time for the lexicographer by increasing recall with 7–8% among the top candidates.
Unable to display preview. Download preview PDF.