N-Gram Similarity and Distance
In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.
KeywordsWord Pair Edit Distance Identity Match Longe Common Subsequence Candidate Pair
Unable to display preview. Download preview PDF.
- 1.Brew, C., McKelvie, D.: Word-pair extraction for lexicography. In: Proc. of the 2nd Intl Conf. on New Methods in Language Processing, pp. 45–55 (1996)Google Scholar
- 4.Hewson, J.: A computer-generated dictionary of proto-Algonquian, Hull, Canadian Museum of Civilization, Quebec (1993)Google Scholar
- 7.Melamed, I.D.: Manual annotation of translational equivalence: The Blinker project. Technical Report IRCS #98-07, University of Pennsylvania (1998)Google Scholar
- 8.Melamed, I.D.: Bitext maps and alignment via pattern recognition. Computational Linguistics 25(1), 107–130 (1999)Google Scholar
- 9.Sankoff, D., Kruskal, J.B. (eds.): Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading (1983)Google Scholar
- 10.Smyth, B.: Computing Patterns in Strings. Pearson, London (2003)Google Scholar
- 11.Tufis, D.: A cheap and fast way to build useful translation lexicons. In: Proc. of the 19th Intl Conf. on Computational Linguistics, pp. 1030–1036 (2002)Google Scholar
- 13.Use caution — avoid confusion. United States Pharmacopeial Convention Quality Review, No. 76 (March 2001), Available from http://www.bhhs.org/pdf/qr76.pdf