Division of Spanish Words into Morphemes with a Genetic Algorithm

  • Alexander Gelbukh
  • Grigori Sidorov
  • Diego Lara-Reyes
  • Liliana Chanona-Hernandez
Conference paper

DOI: 10.1007/978-3-540-69858-6_4

Part of the Lecture Notes in Computer Science book series (LNCS, volume 5039)
Cite this paper as:
Gelbukh A., Sidorov G., Lara-Reyes D., Chanona-Hernandez L. (2008) Division of Spanish Words into Morphemes with a Genetic Algorithm. In: Kapetanios E., Sugumaran V., Spiliopoulou M. (eds) Natural Language and Information Systems. NLDB 2008. Lecture Notes in Computer Science, vol 5039. Springer, Berlin, Heidelberg

Abstract

We discuss an unsupervised technique for determining morpheme structure of words in an inflective language, with Spanish as a case study. For this, we use a global optimization (implemented with a genetic algorithm), while most of the previous works are based on heuristics calculated using conditional probabilities of word parts. Thus, we deal with complete space of solutions and do not reduce it with the risk to eliminate some correct solutions beforehand. Also, we are working at the derivative level as contrasted with the more traditional grammatical level interested only in flexions. The algorithm works as follows. The input data is a wordlist built on the base of a large dictionary or corpus in the given language and the output data is the same wordlist with each word divided into morphemes. First, we build a redundant list of all strings that might possibly be prefixes, suffixes, and stems of the words in the wordlist. Then, we detect possible paradigms in this set and filter out all items from the lists of possible prefixes and suffixes (though not stems) that do not participate in such paradigms. Finally, a subset of those lists of possible prefixes, stems, and suffixes is chosen using the genetic algorithm. The fitness function is based on the ideas of minimum length description, i.e. we choose the minimum number of elements that are necessary for covering all the words. The obtained subset is used for dividing the words from the wordlist. Algorithm parameters are presented. Preliminary evaluation of the experimental results for a dictionary of Spanish is given.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Alexander Gelbukh
    • 1
  • Grigori Sidorov
    • 1
  • Diego Lara-Reyes
    • 1
  • Liliana Chanona-Hernandez
    • 2
  1. 1.Natural Language and Text Processing Laboratory, Center for Research in Computer ScienceNational Polytechnic InstituteMexico CityMexico
  2. 2.ESIME ZacatencoNational Polytechnic InstituteMexico CityMexico

Personalised recommendations