Constructing Empirical Formulas for Testing Word Similarity by the Inductive Method of Model Self-Organization
Identification of words with the same base meaning is a necessary procedure for many algorithms of computational linguistics and text processing. We propose to use for this a knowledge-poor approach using an empirical formula based on the number of the coincident letters in the initial parts of the two words and the number of non-coincident letters in the final parts of these two words. To construct such a formula for a given language, we use inductive method of self-organization developed by A. Ivahnenko. This method considers a set of models (formulas) of a given class and selects the best ones using training samples and test samples. We give a detailed example for English. We also show how to apply the formula for creating word frequency list.
KeywordsEmpirical Formula Word Pair Automatic Documentation Mathematical Linguistics External Criterion
Unable to display preview. Download preview PDF.
- 1.Gelbukh, A. (1992): Effective implementation of morphology model for an inflectional natural language. J. Automatic Documentation and Mathematical Linguistics, Allerton Press, 26, NI, pp. 22–31.Google Scholar
- 2.Gelbukh, A. (ed.) (2000): Computational Linguistics and Intelligent Text Processing. Proc. of CICLing-2002, IPN, Mexico City, 2002, 430 pp.Google Scholar
- 3.Gelbukh, A. and G. Sidorov (2002): A Method for Development of Automatic Morphological Analysis Systems for Inflective Languages. In: Text, Speech and Dialogue, Proc. of TSD-2002, September 2002, Lecture Notes in Computer Science, Springer Verlag.Google Scholar
- 4.Gelbukh, A. (2002): A data structure for prefix search under access locality requirements and its application to spelling correction. J. Computación y Sistemas.Google Scholar
- 5.Ivahnenko, A. (1980): Manual on typical algorithms of modeling. Tehnika Publ., Kiev (in Russian).Google Scholar
- 6.Makagonov, P., Alexandrov, M. (2002): Empirical formula for testing word similarity and its application for constructing a word frequency list. In: A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing, Proc. of CICLing-2002, Lecture Notes in Computer Science, N 2276, Springer Verlag, pp. 425–432.CrossRefGoogle Scholar
- 7.Manning, D. C, Schutze, H. (1999): Foundations of statistical natural language processing. MIT Press.Google Scholar
- 8.Porter, M. (1980): An algorithm for suffix stripping. Program, 14, pp. 130–137.Google Scholar