Investigation of accelerated search for close text sequences with the help of vector representations

Sokolov, A. M.

doi:10.1007/s10559-008-9021-0

Investigation of accelerated search for close text sequences with the help of vector representations

Cybernetics
Published: 15 August 2008

Volume 44, pages 493–506, (2008)
Cite this article

Cybernetics and Systems Analysis Aims and scope

A. M. Sokolov^1,2

2 Citations
Explore all metrics

Abstract

The results of numerical experiments using artificial data are presented. The experiments are designed for testing theoretically derived properties of a randomized scheme for embedding an edit distance into a vector space. Its application to the search for similar texts is also described as applied to the problems of duplicate filtration and spam detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discovering Similar Passages within Large Text Documents

DRESS: dimensionality reduction for efficient sequence search

Article 21 March 2015

Fast Similarity Search for Graphs by Edit Distance

Article 29 November 2019

References

S. Brin, J. Davis, and H. Garcia-Molina, “Copy detection mechanisms for digital documents,” Proc. SIGMOD, 398–409 (1995).
D. Gusfield, Algorithms on Strings Trees and Sequences, Cambridge University Press, Cambridge (1997).
MATH Google Scholar
V. I. Levenstein, “Binary codes capable of correcting deletions, insertions, and reversals,” Dokl. Akad. Nauk SSSR, 163, No. 4, 845–848 (1965).
MathSciNet Google Scholar
T. K. Vintsyuk, “Recognition of words of oral speech using dynamic programming methods,” Cybernetics, No. 1, 81–88 (1968).
P. Indyk, “Open problems,” in: Jiri Matousek (ed.), Workshop on Discrete Metric Spaces and Their Algorithmic Applications, Haifa, Israel (2002).
A. M. Sokolov, “Vector representations for efficient comparison and search for similar strings,” Cybernetics and Systems Analysis, No. 4, 18–38 (2007).
P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” Proc. of 30th STOC, 604–613 (1998).
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni “Locality-sensitive hashing scheme based on p-stable distributions,” in: 20-th Symposium on Computational Geometry (2004), pp. 253–262.
E. Ukkonen, “Approximate string-matching with q-grams and maximal matches,” Theor. Comput. Sci., 92, No. 3, 191–211 (1992).
Article MATH MathSciNet Google Scholar
A. Sokolov, “Nearest string by neural-like encoding,” in: Proc. XI-th Conf. Knowledge-Dialogue-Solution, Varna, Bulgaria (2006), pp. 101–106.
M. Bawa, T. Condie, and P. Ganesan, “LSH forest: Self-tuning indices for similarity search,” in: Proc. 14th Conf. on WWW, ACM Press, New York (2005), pp. 651–660.
Chapter Google Scholar
S. Azenkot, T.-Y. Chen, and G. Cormode, “An evaluation of the edit-distance-with-moves metric for comparing genetic sequences,” DIMACS Technical Report 2005-39 (2005).
R. Baeza-Yates and R. Neto, Modern Information Retrieval, ACM Press Series-Addison Wesley, New York (1999).
Google Scholar
A. Spink, J. Bateman, B. J. Jansen, “Searching the Web: Survey of EXCITE users,” Internet Research: Electronic Networking Applications and Policy, 9, No. 4, 117–128 (1999).
Google Scholar
D. Hawking, E. Voorhees, N. Craswell, and P. Bailey, “Overview of the TREC8 Web Track,” in: 8th Text Retrieval Conference, Gaithersburg (1999).
R. Fagin, R. Kumar, and D. Sivakumar, “Comparing top k lists,” SIAM J. on Discrete Mathematics, 134–160 (2003).
Reuters-21578, www.daviddlewis.com/resources/testcollections/reuters21578.
The British National Corpus, www.natcorp.ox.ac.uk.
M. Sanderson, “Duplicate detection in the Reuters collection,” Technical Report (TR-1997-5), Department of Computing Science at the University of Glasgow, Glasgow, UK (1997).
Google Scholar
Data sets of the competition “Internet-Mathematics,” Yandex, http://company.yandex.ru/grant/datasets_description.xml. 2007.
Z. Bar-Yossef, T. S. Jayram, R. Krauthgamer, and R. Kumar, “Approximating Edit Distance Efficiently,” in: Proc. 45th IEEE Symposium on Foundations of Computer Science, IEEE (2004), p. 550–559.
C. J. Van Rijsbergen, Information Retireval, Butterworths, London (1979).
Google Scholar
“Email Metrics Program: The Network Operators’ Perspective,” Messaging Anti-Abuse Working Group, Report No. 1, 4th Quarter (2005).
P. Graham, Plan for Spam, www.paulgraham.com/stopspam.html (2002).
J. Graham-Cumming, “The Spammers’ Compendium,” in: Spam Conference at MIT (2003), www.jgc.org/tsc.html.
A. Kolcz, A. Chowdhury, and J. Alspector, “The impact of feature selection on signature-driven spam detection,” in: Proc. 1st Conf. on Email and Anti-Spam (2004), www.ceas.cc/papers-2004/147.pdf.
G. V. Cormack, “TREC 2006 Spam Track Overview,” in: Proc. 15th Text REtrieval Conf., NIST, Gaithersburg, MD (2006).
Google Scholar

Download references

Author information

Authors and Affiliations

International Scientific-Educational Center of Information Technologies and Systems, NAS of Ukraine, Kiev, Ukraine
A. M. Sokolov
Ministry of Education and Science of Ukraine, Kiev, Ukraine
A. M. Sokolov

Authors

A. M. Sokolov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. M. Sokolov.

Additional information

__________

Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 32–47, July–August 2008.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sokolov, A.M. Investigation of accelerated search for close text sequences with the help of vector representations. Cybern Syst Anal 44, 493–506 (2008). https://doi.org/10.1007/s10559-008-9021-0

Download citation

Received: 09 August 2007
Published: 15 August 2008
Issue Date: July 2008
DOI: https://doi.org/10.1007/s10559-008-9021-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigation of accelerated search for close text sequences with the help of vector representations

Abstract

Access this article

Similar content being viewed by others

Discovering Similar Passages within Large Text Documents

DRESS: dimensionality reduction for efficient sequence search

Fast Similarity Search for Graphs by Edit Distance

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Investigation of accelerated search for close text sequences with the help of vector representations

Abstract

Access this article

Similar content being viewed by others

Discovering Similar Passages within Large Text Documents

DRESS: dimensionality reduction for efficient sequence search

Fast Similarity Search for Graphs by Edit Distance

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation