Abstract

In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brew, C., McKelvie, D.: Word-pair extraction for lexicography. In: Proc. of the 2nd Intl Conf. on New Methods in Language Processing, pp. 45–55 (1996)Google Scholar
  2. 2.
    Chvátal, V., Sankoff, D.: Longest common subsequences of two random sequences. Journal of Applied Probability 12, 306–315 (1975)MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001)MATHGoogle Scholar
  4. 4.
    Hewson, J.: A computer-generated dictionary of proto-Algonquian, Hull, Canadian Museum of Civilization, Quebec (1993)Google Scholar
  5. 5.
    Lambert, B.L., Lin, S.-J., Chang, K.-Y., Gandhi, S.K.: Similarity As a Risk Factor in Drug-Name Confusion Errors: The Look-Alike (Orthographic) and Sound-Alike (Phonetic) Model. Medical Care 37(12), 1214–1225 (1999)CrossRefGoogle Scholar
  6. 6.
    Marzal, A., Vidal, E.: Computation of normalized edit distance and applications. IEEE Trans. Pattern Analysis and Machine Intelligence 15(9), 926–932 (1993)CrossRefGoogle Scholar
  7. 7.
    Melamed, I.D.: Manual annotation of translational equivalence: The Blinker project. Technical Report IRCS #98-07, University of Pennsylvania (1998)Google Scholar
  8. 8.
    Melamed, I.D.: Bitext maps and alignment via pattern recognition. Computational Linguistics 25(1), 107–130 (1999)Google Scholar
  9. 9.
    Sankoff, D., Kruskal, J.B. (eds.): Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading (1983)Google Scholar
  10. 10.
    Smyth, B.: Computing Patterns in Strings. Pearson, London (2003)Google Scholar
  11. 11.
    Tufis, D.: A cheap and fast way to build useful translation lexicons. In: Proc. of the 19th Intl Conf. on Computational Linguistics, pp. 1030–1036 (2002)Google Scholar
  12. 12.
    Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92, 191–211 (1992)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Use caution — avoid confusion. United States Pharmacopeial Convention Quality Review, No. 76 (March 2001), Available from http://www.bhhs.org/pdf/qr76.pdf
  14. 14.
    Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the Association for Computing Machinery 21(1), 168–173 (1974)MATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Grzegorz Kondrak
    • 1
  1. 1.Department of Computing ScienceUniversity of AlbertaEdmontonCanada

Personalised recommendations