N-Gram Similarity and Distance
In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.
Unable to display preview. Download preview PDF.
- 1.Brew, C., McKelvie, D.: Word-pair extraction for lexicography. In: Proc. of the 2nd Intl Conf. on New Methods in Language Processing, pp. 45–55 (1996)Google Scholar
- 4.Hewson, J.: A computer-generated dictionary of proto-Algonquian, Hull, Canadian Museum of Civilization, Quebec (1993)Google Scholar
- 7.Melamed, I.D.: Manual annotation of translational equivalence: The Blinker project. Technical Report IRCS #98-07, University of Pennsylvania (1998)Google Scholar
- 8.Melamed, I.D.: Bitext maps and alignment via pattern recognition. Computational Linguistics 25(1), 107–130 (1999)Google Scholar
- 9.Sankoff, D., Kruskal, J.B. (eds.): Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading (1983)Google Scholar
- 10.Smyth, B.: Computing Patterns in Strings. Pearson, London (2003)Google Scholar
- 11.Tufis, D.: A cheap and fast way to build useful translation lexicons. In: Proc. of the 19th Intl Conf. on Computational Linguistics, pp. 1030–1036 (2002)Google Scholar
- 13.Use caution — avoid confusion. United States Pharmacopeial Convention Quality Review, No. 76 (March 2001), Available from http://www.bhhs.org/pdf/qr76.pdf