Advertisement

Cybernetics and Systems Analysis

, Volume 44, Issue 4, pp 493–506 | Cite as

Investigation of accelerated search for close text sequences with the help of vector representations

  • A. M. Sokolov
Cybernetics

Abstract

The results of numerical experiments using artificial data are presented. The experiments are designed for testing theoretically derived properties of a randomized scheme for embedding an edit distance into a vector space. Its application to the search for similar texts is also described as applied to the problems of duplicate filtration and spam detection.

Keywords

edit distance approximation approximate nearest-neighbor search neural information technology spam detection duplicate detection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    S. Brin, J. Davis, and H. Garcia-Molina, “Copy detection mechanisms for digital documents,” Proc. SIGMOD, 398–409 (1995).Google Scholar
  2. 2.
    D. Gusfield, Algorithms on Strings Trees and Sequences, Cambridge University Press, Cambridge (1997).zbMATHGoogle Scholar
  3. 3.
    V. I. Levenstein, “Binary codes capable of correcting deletions, insertions, and reversals,” Dokl. Akad. Nauk SSSR, 163, No. 4, 845–848 (1965).MathSciNetGoogle Scholar
  4. 4.
    T. K. Vintsyuk, “Recognition of words of oral speech using dynamic programming methods,” Cybernetics, No. 1, 81–88 (1968).Google Scholar
  5. 5.
    P. Indyk, “Open problems,” in: Jiri Matousek (ed.), Workshop on Discrete Metric Spaces and Their Algorithmic Applications, Haifa, Israel (2002).Google Scholar
  6. 6.
    A. M. Sokolov, “Vector representations for efficient comparison and search for similar strings,” Cybernetics and Systems Analysis, No. 4, 18–38 (2007).Google Scholar
  7. 7.
    P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” Proc. of 30th STOC, 604–613 (1998).Google Scholar
  8. 8.
    M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni “Locality-sensitive hashing scheme based on p-stable distributions,” in: 20-th Symposium on Computational Geometry (2004), pp. 253–262.Google Scholar
  9. 9.
    E. Ukkonen, “Approximate string-matching with q-grams and maximal matches,” Theor. Comput. Sci., 92, No. 3, 191–211 (1992).zbMATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    A. Sokolov, “Nearest string by neural-like encoding,” in: Proc. XI-th Conf. Knowledge-Dialogue-Solution, Varna, Bulgaria (2006), pp. 101–106.Google Scholar
  11. 11.
    M. Bawa, T. Condie, and P. Ganesan, “LSH forest: Self-tuning indices for similarity search,” in: Proc. 14th Conf. on WWW, ACM Press, New York (2005), pp. 651–660.CrossRefGoogle Scholar
  12. 12.
    S. Azenkot, T.-Y. Chen, and G. Cormode, “An evaluation of the edit-distance-with-moves metric for comparing genetic sequences,” DIMACS Technical Report 2005-39 (2005).Google Scholar
  13. 13.
    R. Baeza-Yates and R. Neto, Modern Information Retrieval, ACM Press Series-Addison Wesley, New York (1999).Google Scholar
  14. 14.
    A. Spink, J. Bateman, B. J. Jansen, “Searching the Web: Survey of EXCITE users,” Internet Research: Electronic Networking Applications and Policy, 9, No. 4, 117–128 (1999).Google Scholar
  15. 15.
    D. Hawking, E. Voorhees, N. Craswell, and P. Bailey, “Overview of the TREC8 Web Track,” in: 8th Text Retrieval Conference, Gaithersburg (1999).Google Scholar
  16. 16.
    R. Fagin, R. Kumar, and D. Sivakumar, “Comparing top k lists,” SIAM J. on Discrete Mathematics, 134–160 (2003).Google Scholar
  17. 17.
  18. 18.
    The British National Corpus, www.natcorp.ox.ac.uk.
  19. 19.
    M. Sanderson, “Duplicate detection in the Reuters collection,” Technical Report (TR-1997-5), Department of Computing Science at the University of Glasgow, Glasgow, UK (1997).Google Scholar
  20. 20.
    Data sets of the competition “Internet-Mathematics,” Yandex, http://company.yandex.ru/grant/datasets_description.xml. 2007.
  21. 21.
    Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar, “Approximating Edit Distance Efficiently,” in: Proc. 45th IEEE Symposium on Foundations of Computer Science, IEEE (2004), p. 550–559.Google Scholar
  22. 22.
    C. J. Van Rijsbergen, Information Retireval, Butterworths, London (1979).Google Scholar
  23. 23.
    “Email Metrics Program: The Network Operators’ Perspective,” Messaging Anti-Abuse Working Group, Report No. 1, 4th Quarter (2005).Google Scholar
  24. 24.
    P. Graham, Plan for Spam, www.paulgraham.com/stopspam.html (2002).
  25. 25.
    J. Graham-Cumming, “The Spammers’ Compendium,” in: Spam Conference at MIT (2003), www.jgc.org/tsc.html.
  26. 26.
    A. Kolcz, A. Chowdhury, and J. Alspector, “The impact of feature selection on signature-driven spam detection,” in: Proc. 1st Conf. on Email and Anti-Spam (2004), www.ceas.cc/papers-2004/147.pdf.
  27. 27.
    G. V. Cormack, “TREC 2006 Spam Track Overview,” in: Proc. 15th Text REtrieval Conf., NIST, Gaithersburg, MD (2006).Google Scholar

Copyright information

© Springer Science+Business Media, Inc. 2008

Authors and Affiliations

  1. 1.International Scientific-Educational Center of Information Technologies and SystemsNAS of UkraineKievUkraine
  2. 2.Ministry of Education and Science of UkraineKievUkraine

Personalised recommendations