Fast Computation of Good Multiple Spaced Seeds

  • Lucian Ilie
  • Silvana Ilie
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4645)

Abstract

Homology search finds similar segments between two biological sequences, such as DNA or protein sequences. A significant fraction of computing power in the world is dedicated to performing such tasks. The introduction of optimal spaced seeds by Ma et al. has increased both the sensitivity and the speed of homology search and it has been adopted by many alignment programs such as BLAST. With the further improvement provided by multiple spaced seeds in PatternHunterII, the sensitivity of dynamic programming is approached at BLASTn speed. Whereas computing optimal multiple spaced seeds was proved to be NP-hard, we show that, from practical point of view, computing good ones can be very efficient. We give a simple heuristic algorithm which computes good multiple seeds in polynomial time. Computing sensitivity is not required. When allowing the computation of the sensitivity for few seeds, we obtain better multiple seeds than previous ones in much shorter time.

Keywords

homology search multiple spaced seeds sensitivity string overlaps PatternHunterII BLAST 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)Google Scholar
  2. 2.
    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped Blast and Psi-Blast: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)CrossRefGoogle Scholar
  3. 3.
    Brejova, B., Brown, D., Vinar, T.: Optimal spaced seeds for homologous coding regions. J. Bioinf. and Comput. Biol. 1, 595–610 (2004)CrossRefGoogle Scholar
  4. 4.
    Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proc. of RECOMB 2003, pp. 67–75. ACM Press, New York (2003)CrossRefGoogle Scholar
  5. 5.
    Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 73–85. Springer, Heidelberg (2001)Google Scholar
  6. 6.
    Choi, K.P., Zhang, L.: Sensitivity analysis and efficient method for identifying optimal spaced seeds. J. Comput. Sys. Sci. 68, 22–40 (2004)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Choi, K.P., Zeng, F., Zhang, L.: Good Spaced Seeds for Homology Search. Bioinformatics 20, 1053–1059 (2004)CrossRefGoogle Scholar
  8. 8.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins Univ. Press, Baltimore (1996)MATHGoogle Scholar
  9. 9.
    Ilie, L., Ilie, S.: Long spaced seeds for finding similarities between biological sequences. In: Proc. of BIOCOMP 2007 (to appear)Google Scholar
  10. 10.
    Karp, R., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Develop. 31, 249–260 (1987)MATHMathSciNetCrossRefGoogle Scholar
  11. 11.
    Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Appl. Math. 3, 253–263 (2004)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Kisman, D., Li, M., Ma, B., Wang, L.: tPatternHunter: Gapped, fast and sensitive translated homology search. Bioinformatics 21, 542–544 (2005)CrossRefGoogle Scholar
  13. 13.
    Kong, Y.: Generalized correlation functions and their applications in selection of optimal multiple spaced seeds for homology search. J. Comput. Biol. (to appear)Google Scholar
  14. 14.
    Kucherov, G., Noe, L., Ponty, Y.: Estimating seed sensitivity on homogeneous alignments. In: Proc. of BIBE 2004, Taiwan, pp. 387–394 (2004)Google Scholar
  15. 15.
    Li, M.: personal communicationGoogle Scholar
  16. 16.
    Li, M., Ma, B., Kisman, D., Tromp, J.: Pattern-HunterII: highly sensitive and fast homology search. J. Bioinformatics and Comput. Biol. 2, 417–440 (2004)CrossRefGoogle Scholar
  17. 17.
    Lipman, D.J., Pearson, W.R.: Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985)CrossRefGoogle Scholar
  18. 18.
    Li, M., Ma, B., Zhang, L.: Superiority and complexity of spaced seeds. In: Proc. of SODA 2006. SIAM, pp. 444–453 (2006)Google Scholar
  19. 19.
    Ma, B.: personal communicationGoogle Scholar
  20. 20.
    Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)CrossRefGoogle Scholar
  21. 21.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)CrossRefGoogle Scholar
  22. 22.
    Ning, Z., Cox, A.J., Mullikin, J.C.: SSAHA: A fast search method for large DNA databases. Genome Res. 11, 1725–1729 (2001)CrossRefGoogle Scholar
  23. 23.
    Noé, L., Kucherov, G.: Yass: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res. 33, 540–543 (2005)CrossRefGoogle Scholar
  24. 24.
    Pevzner, P., Waterman, M.S.: Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995)MATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    Preparata, F.P., Zhang, L., Choi, K.P.: Quick, practical selection of effective seeds for homology search. J. Comput. Biol. 12, 137–1152 (2005)CrossRefGoogle Scholar
  26. 26.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar
  27. 27.
    Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. In: Proc. of RECOMB 2004, pp. 76–85. ACM Press, New York (2004)CrossRefGoogle Scholar
  28. 28.
    Xu, J., Brown, D., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology search. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 47–58. Springer, Heidelberg (2004)Google Scholar
  29. 29.
    Yang, I.-H., Wang, S.-H., Chen, H.-H., Huang, P.-H., Chao, K.-M.: Efficient methods for generating optimal single and multiple spaced seeds. In: Proc. of IEEE 4th Symp. on Bioinformatics and Bioengineering, Taiwan, pp. 411–418. IEEE Computer Society Press, Los Alamitos (2004)CrossRefGoogle Scholar
  30. 30.
    Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Lucian Ilie
    • 1
  • Silvana Ilie
    • 2
  1. 1.Department of Computer Science, University of Western Ontario, N6A 5B7, London, OntarioCanada
  2. 2.Numerical Analysis, Centre for Mathematical Sciences, Lund University, Box 118, SE-221 00 LundSweden

Personalised recommendations