N-Gram Similarity and Distance

Kondrak, Grzegorz

doi:10.1007/11575832_13

Grzegorz Kondrak¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3772))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

2244 Accesses
135 Citations

Abstract

In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brew, C., McKelvie, D.: Word-pair extraction for lexicography. In: Proc. of the 2nd Intl Conf. on New Methods in Language Processing, pp. 45–55 (1996)
Google Scholar
Chvátal, V., Sankoff, D.: Longest common subsequences of two random sequences. Journal of Applied Probability 12, 306–315 (1975)
Article MATH MathSciNet Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001)
MATH Google Scholar
Hewson, J.: A computer-generated dictionary of proto-Algonquian, Hull, Canadian Museum of Civilization, Quebec (1993)
Google Scholar
Lambert, B.L., Lin, S.-J., Chang, K.-Y., Gandhi, S.K.: Similarity As a Risk Factor in Drug-Name Confusion Errors: The Look-Alike (Orthographic) and Sound-Alike (Phonetic) Model. Medical Care 37(12), 1214–1225 (1999)
Article Google Scholar
Marzal, A., Vidal, E.: Computation of normalized edit distance and applications. IEEE Trans. Pattern Analysis and Machine Intelligence 15(9), 926–932 (1993)
Article Google Scholar
Melamed, I.D.: Manual annotation of translational equivalence: The Blinker project. Technical Report IRCS #98-07, University of Pennsylvania (1998)
Google Scholar
Melamed, I.D.: Bitext maps and alignment via pattern recognition. Computational Linguistics 25(1), 107–130 (1999)
Google Scholar
Sankoff, D., Kruskal, J.B. (eds.): Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading (1983)
Google Scholar
Smyth, B.: Computing Patterns in Strings. Pearson, London (2003)
Google Scholar
Tufis, D.: A cheap and fast way to build useful translation lexicons. In: Proc. of the 19th Intl Conf. on Computational Linguistics, pp. 1030–1036 (2002)
Google Scholar
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92, 191–211 (1992)
Article MATH MathSciNet Google Scholar
Use caution — avoid confusion. United States Pharmacopeial Convention Quality Review, No. 76 (March 2001), Available from http://www.bhhs.org/pdf/qr76.pdf
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the Association for Computing Machinery 21(1), 168–173 (1974)
MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, AB, T6G 2E8, Canada
Grzegorz Kondrak

Authors

Grzegorz Kondrak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Toronto,
Mariano Consens
Dept. of Computer Science, University of Chile,
Gonzalo Navarro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kondrak, G. (2005). N-Gram Similarity and Distance. In: Consens, M., Navarro, G. (eds) String Processing and Information Retrieval. SPIRE 2005. Lecture Notes in Computer Science, vol 3772. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575832_13

Download citation

DOI: https://doi.org/10.1007/11575832_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29740-6
Online ISBN: 978-3-540-32241-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics