Testing Word Similarity: Language Independent Approach with Examples from Romance
Identification of words with the same basic meaning (stemming) has important applications in Information Retrieval, first of all for constructing word frequency lists. Usual morphologically-based approaches (including the Porter stemmers) rely on language-dependent linguistic resources or knowledge, which causes problems when working with multilingual data and multi-thematic document collections. We suggest several empirical formulae with easy to adjust parameters and demonstrate how to construct such formulae for a given language using an inductive method of model self-organization. This method considers a set of models (formulae) of a given class and selects the best ones using training and test samples. We describe the method and give detailed examples for French, Italian, Portuguese, and Spanish. The formulae are examined on real domain-oriented document collections. Our approach can be easily applied to other European languages.
KeywordsInformation Retrieval Word Pair Document Collection External Criterion Word Similarity
Unable to display preview. Download preview PDF.
- 1.Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
- 2.Cramer, H.: Mathematical methods of statistics, Cambridge (1946)Google Scholar
- 3.Gelbukh, A.: Exact and approximate prefis search under access locality requirements for morphological analysis and spelling correction. Computación y Sistemas 6(3), 167–182 (2003)Google Scholar
- 5.Gelbukh, A., Sidorov, G.: Morphological Analysis of Inflective Languages through Generation. Procesamiento de Lenguaje Natural (29), 105–112 (2002)Google Scholar
- 7.Ivahnenko, A.:: Manual on typical algorithms of modeling. Tehnika Publ., Kiev (1980) (in Russian)Google Scholar
- 9.Porter, M.: An algorithm for suffix stripping. Program 14, 130–137 (1980)Google Scholar