Skip to main content
Log in

Vector representations for efficient comparison and search for similar strings

  • Published:
Cybernetics and Systems Analysis Aims and scope

Abstract

A method is proposed for approximation of the classic edit distance between strings. The method is based on a mapping of strings into vectors belonging to a space with an easily calculable metric. The method preserves the closeness of strings and makes it possible to accelerate the computation of edit distances. The developed q-gram method of approximation of edit distances and its two randomized versions improves the approximation quality in comparison with well-known results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. V. I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,” Dokl. Akad. Nauk SSSR, 163, No. 4, 845–848 (1965).

    MathSciNet  Google Scholar 

  2. C. Burks, M. J. Cinkosky, and P. Gilna, “Decades of nonlinearity: The growth of DNA sequence data,” in: N. G. Cooper (ed.), Los Alamos Science, No. 20 (1992), pp. 254–255.

  3. T. K. Vintsyuk, “Speech recognition by dynamic programming methods,” Cybernetics, No. 1, 81–88 (1968).

  4. R. A. Wagner and M. J. Fischer, “The string-to-string correction problem,” J. ACM, 21, No. 1, 168–173 (1974).

    Article  MATH  MathSciNet  Google Scholar 

  5. G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, 33, No. 1, 31–88 (2001).

    Article  Google Scholar 

  6. P. Indyk, “Embedded stringology,” Talk at Fifteenth Annual Combinatorial Pattern Matching Symposium (2004), http://theory.lcs.mit.edu/∼indyk/cpm.ps.

  7. P. Indyk, “Open problems,” in: Jiri Matousek (ed.), Workshop on Discrete Metric Spaces and Their Algorithmic Applications, Haifa (2002).

  8. A. Sokolov and D. Rachkovskij, “Some approaches to distributed encoding of sequences,” in: Proc. XI-th Intern. Conf. Knowledge-Dialogue-Solution, 2, Varna, Bulgaria (2005), pp. 522–528.

  9. A. Sokolov, “Nearest string by neural-like encoding,” in: Proc. XI-th Intern. Conf. Knowledge-Dialogue-Solution, Varna, Bulgaria (2006), pp. 101–106.

  10. N. G. de Bruijn, A Combinatorial Problem, Koninklijke Nederlandsche Akademie van Wetenschappen, 49 (1946).

  11. D. Knuth, The Art of Computer Programming, Vol. 2, Seminumerical Algorithms [Russian translation], Mir, Moscow (1977).

    MATH  Google Scholar 

  12. E. Ukkonen, “Approximate string-matching with q-grams and maximal matches,” Theor. Comput. Sci., 92, No. 1, 191–211 (1992).

    Article  MATH  MathSciNet  Google Scholar 

  13. A. Borodin, R. Ostrovsky, and Y. Rabani, “Lower bounds for high dimensional nearest neighbor search and related problems,” in: Proc. 31st STOC, ACM Press, New York (1999), pp. 312–321.

  14. E. Kushilevitz, R. Ostrovsky, and Y. Rabani, “Efficient search for approximate nearest neighbor in high dimensional spaces,” in: Proc. 30th STOC, ACM Press, New York (1998), pp. 614–623.

  15. P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in: Proc. 30th STOC (1998), pp. 604–613.

  16. M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in: 20th Annual Symposium on Computational Geometry, New York (2004), pp. 253–262.

  17. J. Nolan, An Introduction to Stable Distributions, http://academic2.american.edu/∼jpnolan/stable/chap1.pdf.

  18. A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing.” in: Proc. VLDB, Morgan Kaufmann Publishers, San Francisco (1999), pp. 518–529.

  19. D. A. Rachkovskii, S.V. Slipchenko, E. M. Kussul’, and T. N. Baidyk, “A binding procedure for distributed binary data representations,” Cybernetics and Systems Analysis, No. 3, 3–8 (2005).

  20. Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar, “Approximating edit distance efficiently,” in: 45th IEEE Symposium on Foundations of Computer Science, IEEE (2004), pp. 550–559.

Download references

Author information

Authors and Affiliations

Authors

Additional information

__________

Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 18–38, July–August 2007.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sokolov, A.M. Vector representations for efficient comparison and search for similar strings. Cybern Syst Anal 43, 484–498 (2007). https://doi.org/10.1007/s10559-007-0075-1

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10559-007-0075-1

Keywords

Navigation