Skip to main content
Log in

Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abrahamson K (1987) Generalized string matching. SIAM J Comput 16: 1039–1051

    Article  MATH  MathSciNet  Google Scholar 

  2. Altschul FS, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410

    Google Scholar 

  3. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402

    Article  Google Scholar 

  4. Amir A, Lewenstein M, Porat E (2000) Faster algorithms for string matching with k mismatches. In: Symposium on discrete algorithms, pp 794–803

  5. Assent I, Krieger R, Glavic B, Seidl T (2008) Clustering multidimensional sequences in spatial and temporal databases. Knowl Inf Syst 16: 29–51

    Article  Google Scholar 

  6. Brown P, Botstein D (2000) Exploring the new world of the genome with DNA microarrays. Nat Genet 21: 33–37

    Article  Google Scholar 

  7. Faloutsos C, Barber R, Flickner M, Hafner J, Niblack W, Petkovic D, Equitz W (1994) Efficient and effective querying by image content. Intell Inf Syst 3: 231–262

    Article  Google Scholar 

  8. Feigenbaum J, Kannan S, Strauss M, Viswanathan M (1999) An approximate L1-difference algorithm for massive data streams. In: Proceedings of FOCS99

  9. Fleischmann RD, Adams MD, White O, Clayton RA (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 28: 496–512

    Article  Google Scholar 

  10. Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 31: 1441–1442

    Google Scholar 

  11. Kawahara D, Kurohashi S (2006) Case frame compilation from the web using high-performance computing. In: Proceedings of the 5th international conference on language resources and evaluation (LREC2006), pp 1344–1347

  12. Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12: 25–53

    Article  Google Scholar 

  13. Liu J, Goss S, Murray G (1994) Similarity comparison and analysis of sequential data. In: IEEE international conference on expert systems for development, pp 138–143

  14. Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22: 935–948

    Article  MATH  MathSciNet  Google Scholar 

  15. Muthukrishnan S, Sahinalp SC (2000) Approximate nearest neighbors and sequence comparison with block operations. In: Proceedings of 32nd annual ACM symposium on theory of computing, pp 416–424

  16. Muthukrishnan S, Sahinalp SC (2002) Simple and practical sequence nearest neighbors under block edit operations. In: Proceedings of CPM2002

  17. Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132: 185–219

    Google Scholar 

  18. Popendorf K, Osana Y, Hachiya T, Sakakibara Y (2007) Murasaki-homology detection across multiple large-scale genomes. In: Fifth annual RECOMB satellite workshop on comparative genomics (San Diego, USA, 2007)

  19. Shivakumar N, Garcia-Molina H (1996) Building a scalable and accurate copy detection mechanism. In: International conference on digital libraries, proceedings of the first ACM international conference on digital libraries, pp 160–168

  20. Yamada S, Gotoh O, Yamana H (2006) Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost. BMC Bioinform 7: 524

    Article  Google Scholar 

  21. Yamada T, Morishita S (2003) Computing highly specific and mismatch tolerant oligomers efficiently. Bioinformatics Conference 2003

  22. Yamada T, Morishita S (2005) Accelerated off-target search algorithm for siRNA. Bioinformatics 21: 1316–1324

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takeaki Uno.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Uno, T. Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowl Inf Syst 25, 229–251 (2010). https://doi.org/10.1007/s10115-009-0271-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0271-6

Keywords

Navigation