Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

Uno, Takeaki

doi:10.1007/s10115-009-0271-6

Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

Regular Paper
Published: 25 November 2009

Volume 25, pages 229–251, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Takeaki Uno¹

189 Accesses
17 Citations
Explore all metrics

Abstract

Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Abrahamson K (1987) Generalized string matching. SIAM J Comput 16: 1039–1051
Article MATH MathSciNet Google Scholar
Altschul FS, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410
Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402
Article Google Scholar
Amir A, Lewenstein M, Porat E (2000) Faster algorithms for string matching with k mismatches. In: Symposium on discrete algorithms, pp 794–803
Assent I, Krieger R, Glavic B, Seidl T (2008) Clustering multidimensional sequences in spatial and temporal databases. Knowl Inf Syst 16: 29–51
Article Google Scholar
Brown P, Botstein D (2000) Exploring the new world of the genome with DNA microarrays. Nat Genet 21: 33–37
Article Google Scholar
Faloutsos C, Barber R, Flickner M, Hafner J, Niblack W, Petkovic D, Equitz W (1994) Efficient and effective querying by image content. Intell Inf Syst 3: 231–262
Article Google Scholar
Feigenbaum J, Kannan S, Strauss M, Viswanathan M (1999) An approximate L1-difference algorithm for massive data streams. In: Proceedings of FOCS99
Fleischmann RD, Adams MD, White O, Clayton RA (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 28: 496–512
Article Google Scholar
Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 31: 1441–1442
Google Scholar
Kawahara D, Kurohashi S (2006) Case frame compilation from the web using high-performance computing. In: Proceedings of the 5th international conference on language resources and evaluation (LREC2006), pp 1344–1347
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12: 25–53
Article Google Scholar
Liu J, Goss S, Murray G (1994) Similarity comparison and analysis of sequential data. In: IEEE international conference on expert systems for development, pp 138–143
Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22: 935–948
Article MATH MathSciNet Google Scholar
Muthukrishnan S, Sahinalp SC (2000) Approximate nearest neighbors and sequence comparison with block operations. In: Proceedings of 32nd annual ACM symposium on theory of computing, pp 416–424
Muthukrishnan S, Sahinalp SC (2002) Simple and practical sequence nearest neighbors under block edit operations. In: Proceedings of CPM2002
Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132: 185–219
Google Scholar
Popendorf K, Osana Y, Hachiya T, Sakakibara Y (2007) Murasaki-homology detection across multiple large-scale genomes. In: Fifth annual RECOMB satellite workshop on comparative genomics (San Diego, USA, 2007)
Shivakumar N, Garcia-Molina H (1996) Building a scalable and accurate copy detection mechanism. In: International conference on digital libraries, proceedings of the first ACM international conference on digital libraries, pp 160–168
Yamada S, Gotoh O, Yamana H (2006) Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost. BMC Bioinform 7: 524
Article Google Scholar
Yamada T, Morishita S (2003) Computing highly specific and mismatch tolerant oligomers efficiently. Bioinformatics Conference 2003
Yamada T, Morishita S (2005) Accelerated off-target search algorithm for siRNA. Bioinformatics 21: 1316–1324
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Informatics, 2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
Takeaki Uno

Authors

Takeaki Uno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takeaki Uno.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Uno, T. Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowl Inf Syst 25, 229–251 (2010). https://doi.org/10.1007/s10115-009-0271-6

Download citation

Received: 30 September 2008
Revised: 20 October 2009
Accepted: 23 October 2009
Published: 25 November 2009
Issue Date: November 2010
DOI: https://doi.org/10.1007/s10115-009-0271-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

Abstract

Access this article

Similar content being viewed by others

Indexed Matching Statistics and Shortest Unique Substrings

On-line String Matching in Highly Similar DNA Sequences

Suffix sorting via matching statistics

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

Abstract

Access this article

Similar content being viewed by others

Indexed Matching Statistics and Shortest Unique Substrings

On-line String Matching in Highly Similar DNA Sequences

Suffix sorting via matching statistics

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation