Testing Word Similarity: Language Independent Approach with Examples from Romance

  • Mikhail Alexandrov
  • Xavier Blanco
  • Pavel Makagonov
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3136)


Identification of words with the same basic meaning (stemming) has important applications in Information Retrieval, first of all for constructing word frequency lists. Usual morphologically-based approaches (including the Porter stemmers) rely on language-dependent linguistic resources or knowledge, which causes problems when working with multilingual data and multi-thematic document collections. We suggest several empirical formulae with easy to adjust parameters and demonstrate how to construct such formulae for a given language using an inductive method of model self-organization. This method considers a set of models (formulae) of a given class and selects the best ones using training and test samples. We describe the method and give detailed examples for French, Italian, Portuguese, and Spanish. The formulae are examined on real domain-oriented document collections. Our approach can be easily applied to other European languages.


Information Retrieval Word Pair Document Collection External Criterion Word Similarity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  2. 2.
    Cramer, H.: Mathematical methods of statistics, Cambridge (1946)Google Scholar
  3. 3.
    Gelbukh, A.: Exact and approximate prefis search under access locality requirements for morphological analysis and spelling correction. Computación y Sistemas 6(3), 167–182 (2003)Google Scholar
  4. 4.
    Gelbukh, A., Sidorov, G.: Zipf and Heaps Laws’ Coefficients Depend on Language. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 332–335. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  5. 5.
    Gelbukh, A., Sidorov, G.: Morphological Analysis of Inflective Languages through Generation. Procesamiento de Lenguaje Natural (29), 105–112 (2002)Google Scholar
  6. 6.
    Gelbukh, A., Sidorov, G.: Approach to construction of automatic morphological analysis systems for inflective languages with little effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. 7.
    Ivahnenko, A.:: Manual on typical algorithms of modeling. Tehnika Publ., Kiev (1980) (in Russian)Google Scholar
  8. 8.
    Makagonov, P., Alexandrov, M.: Constructing empirical formulas for testing word similarity by the inductive method of model self-organization. In: Ranchhold, Mamede (eds.) Advances in Natural Language Processing. LNCS (LNAI), vol. 2379, pp. 239–247. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  9. 9.
    Porter, M.: An algorithm for suffix stripping. Program 14, 130–137 (1980)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Mikhail Alexandrov
    • 1
  • Xavier Blanco
    • 2
  • Pavel Makagonov
    • 3
  1. 1.Center for Computing ResearchNational Polytechnic Institute (IPN)Mexico
  2. 2.Department of French and Romance PhilologyAutonomous University of Barcelona 
  3. 3.Mixteca University of TechnologyMexico

Personalised recommendations