Multi-seed Lossless Filtration

  • Gregory Kucherov
  • Laurent Noé
  • Mikhail Roytberg
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3109)

Abstract

We study a method of seed-based lossless filtration for approximate string matching and related applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundamenta Informaticae 56, 51–70 (2003) ;Preliminary version in Combinatorial Pattern Matching (2001)MATHMathSciNetGoogle Scholar
  2. 2.
    Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings – Practical on-line search algorithms for texts and biological sequences, p. 280. Cambridge University Press, Cambridge (2002) ISBN 0-521-81307-7MATHGoogle Scholar
  3. 3.
    Altschul, S., Madden, T., Schäffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)CrossRefGoogle Scholar
  4. 4.
    Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)CrossRefGoogle Scholar
  5. 5.
    Schwartz, S., Kent, J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., Miller, W.: Human–mouse alignments with BLASTZ. Genome Research 13, 103–107 (2003)CrossRefGoogle Scholar
  6. 6.
    Noe, L., Kucherov, G.: YASS: Similarity search in DNA sequences. Research Report RR-4852, INRIA (2003), http://www.inria.fr/rrrt/rr-4852.html
  7. 7.
    Pevzner, P., Waterman, M.: Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995)MATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Califano, A., Rigoutsos, I.: Flash: A fast look-up algorithm for string homology. In: Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology, pp. 56–64 (1993)Google Scholar
  9. 9.
    Buhler, J.: Provably sensitive indexing strategies for biosequence similarity search. In: Proceedings of the 6th Annual International Conference on Computational Molecular Biology (RECOMB 2002), pp. 90–99. ACM Press, Washington (2002)Google Scholar
  10. 10.
    Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied Mathematics (2004) (to appear)Google Scholar
  11. 11.
    Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB 2003), pp. 67–75. ACM Press, Berlin (2003)Google Scholar
  12. 12.
    Brejova, B., Brown, D., Vinar, T.: Vector seeds: An extension to spaced seeds allows substantial improvements in sensitivity and specificity. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS (LNBI), vol. 2812, pp. 39–54. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Kucherov, G., Noe, L., Ponty, Y.: Estimating seed sensitivity on homogeneous alignments. In: Proceedings of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE 2004), May 19-21, IEEE Computer Society Press, Los Alamitos (2004)Google Scholar
  14. 14.
    Choi, K., Zhang, L.: Sensitivity analysis and efficient method for identifying optimal spaced seeds. Journal of Computer and System Sciences (2003) (to appear)Google Scholar
  15. 15.
    Li, F., Stormo, G.: Selection of optimal DNA oligos for gene expression arrays. Bioinformatics 17, 1067–1076 (2001)CrossRefGoogle Scholar
  16. 16.
    Kaderali, L., Schliep, A.: Selecting signature oligonucleotides to identify organisms using DNA arrays. Bioinformatics 18, 1340–1349 (2002)CrossRefGoogle Scholar
  17. 17.
    Rahmann, S.: Fast large scale oligonucleotide selection using the longest common factor approach. Journal of Bioinformatics and Computational Biology 1, 343–361 (2003)CrossRefGoogle Scholar
  18. 18.
    Zheng, J., Close, T., Jiang, T., Lonardi, S.: Efficient selection of unique and popular oligos for large EST databases. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 384–401. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  19. 19.
    Burkhardt, S., Karkkainen, J.: One-gapped q-gram filtersfor levenshtein distance. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 225–234. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  20. 20.
    Li, M., Ma, B., Kisman, D., Tromp, J.: PatternHunter II: Highly sensitive and fast homology search. Journal of Bioinformatics and Computational Biology (2004); Earlier version in GIW 2003 (International Conference on Genome Informatics)Google Scholar
  21. 21.
    Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. In: Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004), ACM Press, New York (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Gregory Kucherov
    • 1
  • Laurent Noé
    • 1
  • Mikhail Roytberg
    • 2
  1. 1.INRIA/LORIAVillers-lès-NancyFrance
  2. 2.Institute of Mathematical Problems in BiologyPushchino, Moscow RegionRussia

Personalised recommendations