Division of Spanish Words into Morphemes with a Genetic Algorithm

Gelbukh, Alexander; Sidorov, Grigori; Lara-Reyes, Diego; Chanona-Hernandez, Liliana

doi:10.1007/978-3-540-69858-6_4

Alexander Gelbukh¹,
Grigori Sidorov¹,
Diego Lara-Reyes¹ &
…
Liliana Chanona-Hernandez²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5039))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

1496 Accesses
1 Citations

Abstract

We discuss an unsupervised technique for determining morpheme structure of words in an inflective language, with Spanish as a case study. For this, we use a global optimization (implemented with a genetic algorithm), while most of the previous works are based on heuristics calculated using conditional probabilities of word parts. Thus, we deal with complete space of solutions and do not reduce it with the risk to eliminate some correct solutions beforehand. Also, we are working at the derivative level as contrasted with the more traditional grammatical level interested only in flexions. The algorithm works as follows. The input data is a wordlist built on the base of a large dictionary or corpus in the given language and the output data is the same wordlist with each word divided into morphemes. First, we build a redundant list of all strings that might possibly be prefixes, suffixes, and stems of the words in the wordlist. Then, we detect possible paradigms in this set and filter out all items from the lists of possible prefixes and suffixes (though not stems) that do not participate in such paradigms. Finally, a subset of those lists of possible prefixes, stems, and suffixes is chosen using the genetic algorithm. The fitness function is based on the ideas of minimum length description, i.e. we choose the minimum number of elements that are necessary for covering all the words. The obtained subset is used for dividing the words from the wordlist. Algorithm parameters are presented. Preliminary evaluation of the experimental results for a dictionary of Spanish is given.

Work done under partial support of Mexican Government (CONACYT, SNI) and National Polytechnic Institute, Mexico (SIP, COFAA, PIFI).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baroni, M., Matiasek, J., Trost, H.: Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In: ACL Workshop on Morphological and Phonological Learning (2002)
Google Scholar
Gelbukh, A., Alexandrov, M., Han, S.: Detecting Inflection Patterns in Natural Language by Minimization of Morphological Model. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 432–438. Springer, Heidelberg (2004)
Google Scholar
Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics 27(2), 153–198 (2001)
Article MathSciNet Google Scholar
Creutz, M.: Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003), Sapporo, Japan, pp. 280–287 (2003)
Google Scholar
Haahr, P., Baker, S.: Making search better in Catalonia, Estonia, and everywhere else. Google blog (2007), http://googleblog.blogspot.com/2008/03/making-search-better-in-catalonia.htm
Kazakov, D.: Unsupervised learning of naıve morphology with genetic algorithms. In: Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks. Prague, Czech Republic, pp. 105–112 (1997)
Google Scholar
Rehman, K., Hussain, I.: Unsupervised Morphemes Segmentation. In: Pascal Morphochallenge, 5p. (2005)
Google Scholar
Pascal Morphochallenge (2007), http://www.cis.hut.fi/morphochallenge2007/
Pascal Morphochallenge (2005), http://www.cis.hut.fi/morphochallenge2005/

Download references

Author information

Authors and Affiliations

Natural Language and Text Processing Laboratory, Center for Research in Computer Science, National Polytechnic Institute, Av. Juan Dios Batiz, s/n, Zacatenco, 07738, Mexico City, Mexico
Alexander Gelbukh, Grigori Sidorov & Diego Lara-Reyes
ESIME Zacatenco, National Polytechnic Institute, Zacatenco, 07738, Mexico City, Mexico
Liliana Chanona-Hernandez

Authors

Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar
Grigori Sidorov
View author publications
You can also search for this author in PubMed Google Scholar
Diego Lara-Reyes
View author publications
You can also search for this author in PubMed Google Scholar
Liliana Chanona-Hernandez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Epaminondas Kapetanios Vijayan Sugumaran Myra Spiliopoulou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gelbukh, A., Sidorov, G., Lara-Reyes, D., Chanona-Hernandez, L. (2008). Division of Spanish Words into Morphemes with a Genetic Algorithm. In: Kapetanios, E., Sugumaran, V., Spiliopoulou, M. (eds) Natural Language and Information Systems. NLDB 2008. Lecture Notes in Computer Science, vol 5039. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69858-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-69858-6_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69857-9
Online ISBN: 978-3-540-69858-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics