Skip to main content

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5012))

Included in the following conference series:

Abstract

Finding similar substrings/substructures is a central task in analyzing huge amounts of string data such as genome sequences, web documents, log data, etc. In the sense of complexity theory, the existence of polynomial time algorithms for such problems is usually trivial since the number of substrings is bounded by the square of their lengths. However, straightforward algorithms do not work for practical huge databases because of their computation time of high degree order. This paper addresses the problems of finding pairs of strings with small Hamming distances from huge databases composed of short strings. By solving the problem for all the substrings of fixed length, we can efficiently find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm running in almost linear time of the database size. We prove that the computation time of its variant is bounded by linear of the database size when the length of short strings to be found is constant. Slight modifications of the algorithm adapt to the edit distance and mismatch tolerance computation. Computational experiments for genome sequences show the efficiency of the algorithm. An implementation is available at the author’s homepage

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abrahamson, K.: Generalized String Matching. SIAM J. on Comp. 16(6), 1039–1051 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)

    Google Scholar 

  3. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, Z.W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)

    Article  Google Scholar 

  4. Amir, A., Lewenstein, M., Porat, E.: Faster Algorithms for String Matching with k Mismatches. In: Symposium on Disc. Alg, pp. 794–803 (2000)

    Google Scholar 

  5. Brown, P., Botstein, D.: Exploring the New World of the Genome with DNA Microarrays. Nature Genetics 21, 33–37 (2000)

    Article  Google Scholar 

  6. Feigenbaum, J., Kannan, S., Strauss, M., Viswanathan, M.: An Approximate l1-difference Algorithm for Massive Data Streams. In: Proc. FOCS 1999 (1999)

    Google Scholar 

  7. Manber, U., Myers, G.: Suffix Arrays: A New Method for On-line String Searches. SIAM J. on Comp. 22, 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  8. Muthukrishnan, S., Sahinalp, S.C.: Approximate Nearest Neighbors and Sequence Comparison with Block Operations. In: Proc. 32nd annual ACM symposium on Theory of Computing, pp. 416–424 (2000)

    Google Scholar 

  9. Muthukrishnan, S., Sahinalp, S.C.: Simple and Practical Sequence Nearest Neighbors under Block Edit Operations. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  10. Pearson, W.R.: Flexible sequence similarity searching with the FASTA3 program package. Methods in Molecular Biology 132, 185–219 (2000)

    Google Scholar 

  11. Yamada, S., Gotoh, O., Yamana, H.: Improvement in Accuracy of Multiple Sequence Alignment Using Novel Group-to-group Sequence Alignment Algorithm with Piecewise Linear Gap Cost. BMC Bioinformatics 7, 524 (2006)

    Article  Google Scholar 

  12. Yamada, T., Morishita, S.: Computing Highly Specific and Mismatch Tolerant Oligomers Efficiently. In: Bioinformatics Conference (2003)

    Google Scholar 

  13. Yamada, T., Morishita, S.: Accelerated Off-target Search Algorithm for siRNA. Bioinformatics 21, 1316–1324 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Takashi Washio Einoshin Suzuki Kai Ming Ting Akihiro Inokuchi

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Uno, T. (2008). An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68125-0_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68125-0_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68124-3

  • Online ISBN: 978-3-540-68125-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics