Investigation of accelerated search for close text sequences with the help of vector representations
- 20 Downloads
The results of numerical experiments using artificial data are presented. The experiments are designed for testing theoretically derived properties of a randomized scheme for embedding an edit distance into a vector space. Its application to the search for similar texts is also described as applied to the problems of duplicate filtration and spam detection.
Keywordsedit distance approximation approximate nearest-neighbor search neural information technology spam detection duplicate detection
Unable to display preview. Download preview PDF.
- 1.S. Brin, J. Davis, and H. Garcia-Molina, “Copy detection mechanisms for digital documents,” Proc. SIGMOD, 398–409 (1995).Google Scholar
- 4.T. K. Vintsyuk, “Recognition of words of oral speech using dynamic programming methods,” Cybernetics, No. 1, 81–88 (1968).Google Scholar
- 5.P. Indyk, “Open problems,” in: Jiri Matousek (ed.), Workshop on Discrete Metric Spaces and Their Algorithmic Applications, Haifa, Israel (2002).Google Scholar
- 6.A. M. Sokolov, “Vector representations for efficient comparison and search for similar strings,” Cybernetics and Systems Analysis, No. 4, 18–38 (2007).Google Scholar
- 7.P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” Proc. of 30th STOC, 604–613 (1998).Google Scholar
- 8.M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni “Locality-sensitive hashing scheme based on p-stable distributions,” in: 20-th Symposium on Computational Geometry (2004), pp. 253–262.Google Scholar
- 10.A. Sokolov, “Nearest string by neural-like encoding,” in: Proc. XI-th Conf. Knowledge-Dialogue-Solution, Varna, Bulgaria (2006), pp. 101–106.Google Scholar
- 12.S. Azenkot, T.-Y. Chen, and G. Cormode, “An evaluation of the edit-distance-with-moves metric for comparing genetic sequences,” DIMACS Technical Report 2005-39 (2005).Google Scholar
- 13.R. Baeza-Yates and R. Neto, Modern Information Retrieval, ACM Press Series-Addison Wesley, New York (1999).Google Scholar
- 14.A. Spink, J. Bateman, B. J. Jansen, “Searching the Web: Survey of EXCITE users,” Internet Research: Electronic Networking Applications and Policy, 9, No. 4, 117–128 (1999).Google Scholar
- 15.D. Hawking, E. Voorhees, N. Craswell, and P. Bailey, “Overview of the TREC8 Web Track,” in: 8th Text Retrieval Conference, Gaithersburg (1999).Google Scholar
- 16.R. Fagin, R. Kumar, and D. Sivakumar, “Comparing top k lists,” SIAM J. on Discrete Mathematics, 134–160 (2003).Google Scholar
- 17.Reuters-21578, www.daviddlewis.com/resources/testcollections/reuters21578.
- 18.The British National Corpus, www.natcorp.ox.ac.uk.
- 19.M. Sanderson, “Duplicate detection in the Reuters collection,” Technical Report (TR-1997-5), Department of Computing Science at the University of Glasgow, Glasgow, UK (1997).Google Scholar
- 20.Data sets of the competition “Internet-Mathematics,” Yandex, http://company.yandex.ru/grant/datasets_description.xml. 2007.
- 21.Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar, “Approximating Edit Distance Efficiently,” in: Proc. 45th IEEE Symposium on Foundations of Computer Science, IEEE (2004), p. 550–559.Google Scholar
- 22.C. J. Van Rijsbergen, Information Retireval, Butterworths, London (1979).Google Scholar
- 23.“Email Metrics Program: The Network Operators’ Perspective,” Messaging Anti-Abuse Working Group, Report No. 1, 4th Quarter (2005).Google Scholar
- 24.P. Graham, Plan for Spam, www.paulgraham.com/stopspam.html (2002).
- 25.J. Graham-Cumming, “The Spammers’ Compendium,” in: Spam Conference at MIT (2003), www.jgc.org/tsc.html.
- 26.A. Kolcz, A. Chowdhury, and J. Alspector, “The impact of feature selection on signature-driven spam detection,” in: Proc. 1st Conf. on Email and Anti-Spam (2004), www.ceas.cc/papers-2004/147.pdf.
- 27.G. V. Cormack, “TREC 2006 Spam Track Overview,” in: Proc. 15th Text REtrieval Conf., NIST, Gaithersburg, MD (2006).Google Scholar