Skip to main content

N-Gram Similarity and Distance

  • Conference paper
String Processing and Information Retrieval (SPIRE 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3772))

Included in the following conference series:

Abstract

In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brew, C., McKelvie, D.: Word-pair extraction for lexicography. In: Proc. of the 2nd Intl Conf. on New Methods in Language Processing, pp. 45–55 (1996)

    Google Scholar 

  2. Chvátal, V., Sankoff, D.: Longest common subsequences of two random sequences. Journal of Applied Probability 12, 306–315 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  3. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  4. Hewson, J.: A computer-generated dictionary of proto-Algonquian, Hull, Canadian Museum of Civilization, Quebec (1993)

    Google Scholar 

  5. Lambert, B.L., Lin, S.-J., Chang, K.-Y., Gandhi, S.K.: Similarity As a Risk Factor in Drug-Name Confusion Errors: The Look-Alike (Orthographic) and Sound-Alike (Phonetic) Model. Medical Care 37(12), 1214–1225 (1999)

    Article  Google Scholar 

  6. Marzal, A., Vidal, E.: Computation of normalized edit distance and applications. IEEE Trans. Pattern Analysis and Machine Intelligence 15(9), 926–932 (1993)

    Article  Google Scholar 

  7. Melamed, I.D.: Manual annotation of translational equivalence: The Blinker project. Technical Report IRCS #98-07, University of Pennsylvania (1998)

    Google Scholar 

  8. Melamed, I.D.: Bitext maps and alignment via pattern recognition. Computational Linguistics 25(1), 107–130 (1999)

    Google Scholar 

  9. Sankoff, D., Kruskal, J.B. (eds.): Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading (1983)

    Google Scholar 

  10. Smyth, B.: Computing Patterns in Strings. Pearson, London (2003)

    Google Scholar 

  11. Tufis, D.: A cheap and fast way to build useful translation lexicons. In: Proc. of the 19th Intl Conf. on Computational Linguistics, pp. 1030–1036 (2002)

    Google Scholar 

  12. Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92, 191–211 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  13. Use caution — avoid confusion. United States Pharmacopeial Convention Quality Review, No. 76 (March 2001), Available from http://www.bhhs.org/pdf/qr76.pdf

  14. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the Association for Computing Machinery 21(1), 168–173 (1974)

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kondrak, G. (2005). N-Gram Similarity and Distance. In: Consens, M., Navarro, G. (eds) String Processing and Information Retrieval. SPIRE 2005. Lecture Notes in Computer Science, vol 3772. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575832_13

Download citation

  • DOI: https://doi.org/10.1007/11575832_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29740-6

  • Online ISBN: 978-3-540-32241-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics