Abstract
A method is proposed for approximation of the classic edit distance between strings. The method is based on a mapping of strings into vectors belonging to a space with an easily calculable metric. The method preserves the closeness of strings and makes it possible to accelerate the computation of edit distances. The developed q-gram method of approximation of edit distances and its two randomized versions improves the approximation quality in comparison with well-known results.
Similar content being viewed by others
References
V. I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,” Dokl. Akad. Nauk SSSR, 163, No. 4, 845–848 (1965).
C. Burks, M. J. Cinkosky, and P. Gilna, “Decades of nonlinearity: The growth of DNA sequence data,” in: N. G. Cooper (ed.), Los Alamos Science, No. 20 (1992), pp. 254–255.
T. K. Vintsyuk, “Speech recognition by dynamic programming methods,” Cybernetics, No. 1, 81–88 (1968).
R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” J. ACM, 21, No. 1, 168–173 (1974).
G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, 33, No. 1, 31–88 (2001).
P. Indyk, “Embedded stringology,” Talk at Fifteenth Annual Combinatorial Pattern Matching Symposium (2004), http://theory.lcs.mit.edu/∼indyk/cpm.ps.
P. Indyk, “Open problems,” in: Jiri Matousek (ed.), Workshop on Discrete Metric Spaces and Their Algorithmic Applications, Haifa (2002).
A. Sokolov and D. Rachkovskij, “Some approaches to distributed encoding of sequences,” in: Proc. XI-th Intern. Conf. Knowledge-Dialogue-Solution, 2, Varna, Bulgaria (2005), pp. 522–528.
A. Sokolov, “Nearest string by neural-like encoding,” in: Proc. XI-th Intern. Conf. Knowledge-Dialogue-Solution, Varna, Bulgaria (2006), pp. 101–106.
N. G. de Bruijn, A Combinatorial Problem, Koninklijke Nederlandsche Akademie van Wetenschappen, 49 (1946).
D. Knuth, The Art of Computer Programming, Vol. 2, Seminumerical Algorithms [Russian translation], Mir, Moscow (1977).
E. Ukkonen, “Approximate string-matching with q-grams and maximal matches,” Theor. Comput. Sci., 92, No. 1, 191–211 (1992).
A. Borodin, R. Ostrovsky, and Y. Rabani, “Lower bounds for high dimensional nearest neighbor search and related problems,” in: Proc. 31st STOC, ACM Press, New York (1999), pp. 312–321.
E. Kushilevitz, R. Ostrovsky, and Y. Rabani, “Efficient search for approximate nearest neighbor in high dimensional spaces,” in: Proc. 30th STOC, ACM Press, New York (1998), pp. 614–623.
P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in: Proc. 30th STOC (1998), pp. 604–613.
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in: 20th Annual Symposium on Computational Geometry, New York (2004), pp. 253–262.
J. Nolan, An Introduction to Stable Distributions, http://academic2.american.edu/∼jpnolan/stable/chap1.pdf.
A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing.” in: Proc. VLDB, Morgan Kaufmann Publishers, San Francisco (1999), pp. 518–529.
D. A. Rachkovskii, S.V. Slipchenko, E. M. Kussul’, and T. N. Baidyk, “A binding procedure for distributed binary data representations,” Cybernetics and Systems Analysis, No. 3, 3–8 (2005).
Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar, “Approximating edit distance efficiently,” in: 45th IEEE Symposium on Foundations of Computer Science, IEEE (2004), pp. 550–559.
Author information
Authors and Affiliations
Additional information
__________
Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 18–38, July–August 2007.
Rights and permissions
About this article
Cite this article
Sokolov, A.M. Vector representations for efficient comparison and search for similar strings. Cybern Syst Anal 43, 484–498 (2007). https://doi.org/10.1007/s10559-007-0075-1
Received:
Issue Date:
DOI: https://doi.org/10.1007/s10559-007-0075-1