Skip to main content
Log in

Investigation of accelerated search for close text sequences with the help of vector representations

  • Cybernetics
  • Published:
Cybernetics and Systems Analysis Aims and scope

Abstract

The results of numerical experiments using artificial data are presented. The experiments are designed for testing theoretically derived properties of a randomized scheme for embedding an edit distance into a vector space. Its application to the search for similar texts is also described as applied to the problems of duplicate filtration and spam detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. S. Brin, J. Davis, and H. Garcia-Molina, “Copy detection mechanisms for digital documents,” Proc. SIGMOD, 398–409 (1995).

  2. D. Gusfield, Algorithms on Strings Trees and Sequences, Cambridge University Press, Cambridge (1997).

    MATH  Google Scholar 

  3. V. I. Levenstein, “Binary codes capable of correcting deletions, insertions, and reversals,” Dokl. Akad. Nauk SSSR, 163, No. 4, 845–848 (1965).

    MathSciNet  Google Scholar 

  4. T. K. Vintsyuk, “Recognition of words of oral speech using dynamic programming methods,” Cybernetics, No. 1, 81–88 (1968).

  5. P. Indyk, “Open problems,” in: Jiri Matousek (ed.), Workshop on Discrete Metric Spaces and Their Algorithmic Applications, Haifa, Israel (2002).

  6. A. M. Sokolov, “Vector representations for efficient comparison and search for similar strings,” Cybernetics and Systems Analysis, No. 4, 18–38 (2007).

  7. P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” Proc. of 30th STOC, 604–613 (1998).

  8. M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni “Locality-sensitive hashing scheme based on p-stable distributions,” in: 20-th Symposium on Computational Geometry (2004), pp. 253–262.

  9. E. Ukkonen, “Approximate string-matching with q-grams and maximal matches,” Theor. Comput. Sci., 92, No. 3, 191–211 (1992).

    Article  MATH  MathSciNet  Google Scholar 

  10. A. Sokolov, “Nearest string by neural-like encoding,” in: Proc. XI-th Conf. Knowledge-Dialogue-Solution, Varna, Bulgaria (2006), pp. 101–106.

  11. M. Bawa, T. Condie, and P. Ganesan, “LSH forest: Self-tuning indices for similarity search,” in: Proc. 14th Conf. on WWW, ACM Press, New York (2005), pp. 651–660.

    Chapter  Google Scholar 

  12. S. Azenkot, T.-Y. Chen, and G. Cormode, “An evaluation of the edit-distance-with-moves metric for comparing genetic sequences,” DIMACS Technical Report 2005-39 (2005).

  13. R. Baeza-Yates and R. Neto, Modern Information Retrieval, ACM Press Series-Addison Wesley, New York (1999).

    Google Scholar 

  14. A. Spink, J. Bateman, B. J. Jansen, “Searching the Web: Survey of EXCITE users,” Internet Research: Electronic Networking Applications and Policy, 9, No. 4, 117–128 (1999).

    Google Scholar 

  15. D. Hawking, E. Voorhees, N. Craswell, and P. Bailey, “Overview of the TREC8 Web Track,” in: 8th Text Retrieval Conference, Gaithersburg (1999).

  16. R. Fagin, R. Kumar, and D. Sivakumar, “Comparing top k lists,” SIAM J. on Discrete Mathematics, 134–160 (2003).

  17. Reuters-21578, www.daviddlewis.com/resources/testcollections/reuters21578.

  18. The British National Corpus, www.natcorp.ox.ac.uk.

  19. M. Sanderson, “Duplicate detection in the Reuters collection,” Technical Report (TR-1997-5), Department of Computing Science at the University of Glasgow, Glasgow, UK (1997).

    Google Scholar 

  20. Data sets of the competition “Internet-Mathematics,” Yandex, http://company.yandex.ru/grant/datasets_description.xml. 2007.

  21. Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar, “Approximating Edit Distance Efficiently,” in: Proc. 45th IEEE Symposium on Foundations of Computer Science, IEEE (2004), p. 550–559.

  22. C. J. Van Rijsbergen, Information Retireval, Butterworths, London (1979).

    Google Scholar 

  23. “Email Metrics Program: The Network Operators’ Perspective,” Messaging Anti-Abuse Working Group, Report No. 1, 4th Quarter (2005).

  24. P. Graham, Plan for Spam, www.paulgraham.com/stopspam.html (2002).

  25. J. Graham-Cumming, “The Spammers’ Compendium,” in: Spam Conference at MIT (2003), www.jgc.org/tsc.html.

  26. A. Kolcz, A. Chowdhury, and J. Alspector, “The impact of feature selection on signature-driven spam detection,” in: Proc. 1st Conf. on Email and Anti-Spam (2004), www.ceas.cc/papers-2004/147.pdf.

  27. G. V. Cormack, “TREC 2006 Spam Track Overview,” in: Proc. 15th Text REtrieval Conf., NIST, Gaithersburg, MD (2006).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. M. Sokolov.

Additional information

__________

Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 32–47, July–August 2008.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sokolov, A.M. Investigation of accelerated search for close text sequences with the help of vector representations. Cybern Syst Anal 44, 493–506 (2008). https://doi.org/10.1007/s10559-008-9021-0

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10559-008-9021-0

Keywords

Navigation