Abstract
The results of numerical experiments using artificial data are presented. The experiments are designed for testing theoretically derived properties of a randomized scheme for embedding an edit distance into a vector space. Its application to the search for similar texts is also described as applied to the problems of duplicate filtration and spam detection.
Similar content being viewed by others
References
S. Brin, J. Davis, and H. Garcia-Molina, “Copy detection mechanisms for digital documents,” Proc. SIGMOD, 398–409 (1995).
D. Gusfield, Algorithms on Strings Trees and Sequences, Cambridge University Press, Cambridge (1997).
V. I. Levenstein, “Binary codes capable of correcting deletions, insertions, and reversals,” Dokl. Akad. Nauk SSSR, 163, No. 4, 845–848 (1965).
T. K. Vintsyuk, “Recognition of words of oral speech using dynamic programming methods,” Cybernetics, No. 1, 81–88 (1968).
P. Indyk, “Open problems,” in: Jiri Matousek (ed.), Workshop on Discrete Metric Spaces and Their Algorithmic Applications, Haifa, Israel (2002).
A. M. Sokolov, “Vector representations for efficient comparison and search for similar strings,” Cybernetics and Systems Analysis, No. 4, 18–38 (2007).
P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” Proc. of 30th STOC, 604–613 (1998).
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni “Locality-sensitive hashing scheme based on p-stable distributions,” in: 20-th Symposium on Computational Geometry (2004), pp. 253–262.
E. Ukkonen, “Approximate string-matching with q-grams and maximal matches,” Theor. Comput. Sci., 92, No. 3, 191–211 (1992).
A. Sokolov, “Nearest string by neural-like encoding,” in: Proc. XI-th Conf. Knowledge-Dialogue-Solution, Varna, Bulgaria (2006), pp. 101–106.
M. Bawa, T. Condie, and P. Ganesan, “LSH forest: Self-tuning indices for similarity search,” in: Proc. 14th Conf. on WWW, ACM Press, New York (2005), pp. 651–660.
S. Azenkot, T.-Y. Chen, and G. Cormode, “An evaluation of the edit-distance-with-moves metric for comparing genetic sequences,” DIMACS Technical Report 2005-39 (2005).
R. Baeza-Yates and R. Neto, Modern Information Retrieval, ACM Press Series-Addison Wesley, New York (1999).
A. Spink, J. Bateman, B. J. Jansen, “Searching the Web: Survey of EXCITE users,” Internet Research: Electronic Networking Applications and Policy, 9, No. 4, 117–128 (1999).
D. Hawking, E. Voorhees, N. Craswell, and P. Bailey, “Overview of the TREC8 Web Track,” in: 8th Text Retrieval Conference, Gaithersburg (1999).
R. Fagin, R. Kumar, and D. Sivakumar, “Comparing top k lists,” SIAM J. on Discrete Mathematics, 134–160 (2003).
Reuters-21578, www.daviddlewis.com/resources/testcollections/reuters21578.
The British National Corpus, www.natcorp.ox.ac.uk.
M. Sanderson, “Duplicate detection in the Reuters collection,” Technical Report (TR-1997-5), Department of Computing Science at the University of Glasgow, Glasgow, UK (1997).
Data sets of the competition “Internet-Mathematics,” Yandex, http://company.yandex.ru/grant/datasets_description.xml. 2007.
Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar, “Approximating Edit Distance Efficiently,” in: Proc. 45th IEEE Symposium on Foundations of Computer Science, IEEE (2004), p. 550–559.
C. J. Van Rijsbergen, Information Retireval, Butterworths, London (1979).
“Email Metrics Program: The Network Operators’ Perspective,” Messaging Anti-Abuse Working Group, Report No. 1, 4th Quarter (2005).
P. Graham, Plan for Spam, www.paulgraham.com/stopspam.html (2002).
J. Graham-Cumming, “The Spammers’ Compendium,” in: Spam Conference at MIT (2003), www.jgc.org/tsc.html.
A. Kolcz, A. Chowdhury, and J. Alspector, “The impact of feature selection on signature-driven spam detection,” in: Proc. 1st Conf. on Email and Anti-Spam (2004), www.ceas.cc/papers-2004/147.pdf.
G. V. Cormack, “TREC 2006 Spam Track Overview,” in: Proc. 15th Text REtrieval Conf., NIST, Gaithersburg, MD (2006).
Author information
Authors and Affiliations
Corresponding author
Additional information
__________
Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 32–47, July–August 2008.
Rights and permissions
About this article
Cite this article
Sokolov, A.M. Investigation of accelerated search for close text sequences with the help of vector representations. Cybern Syst Anal 44, 493–506 (2008). https://doi.org/10.1007/s10559-008-9021-0
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10559-008-9021-0