Subset Seed Automaton

  • Gregory Kucherov
  • Laurent Noé
  • Mikhail Roytberg
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4783)

Abstract

We study the pattern matching automaton introduced in [1] for the purpose of seed-based similarity search. We show that our definition provides a compact automaton, much smaller than the one obtained by applying the Aho-Corasick construction. We study properties of this automaton and present an efficient implementation of the automaton construction. We also present some experimental results and show that this automaton can be successfully applied to more general situations.

Keywords

spaced seed subset seed automaton seed sensitivity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kucherov, G., Noé, L., Roytberg, M.: A unifying framework for seed sensitivity and its application to subset seeds. JBCB 4, 553–569 (2006)Google Scholar
  2. 2.
    Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundamenta Informaticae 56, 51–70 (2003)MATHMathSciNetGoogle Scholar
  3. 3.
    Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)CrossRefGoogle Scholar
  4. 4.
    Brown, D., Li, M., Ma, B.: A tutorial of recent developments in the seeding of local alignment. JBCB 2, 819–842 (2004)Google Scholar
  5. 5.
    Brown, D.: A survey of seeding for sequence alignments. In: Bioinformatics Algorithms: Techniques and Applications (to appear, 2007)Google Scholar
  6. 6.
    Li, M., Ma, B., Kisman, D., Tromp, J.: PatternHunter II: Highly sensitive and fast homology search. Journal of Bioinformatics and Computational Biology 2, 417–439 (2004)CrossRefGoogle Scholar
  7. 7.
    Noé, L., Kucherov, G.: YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research 33(web-server issue), W540–W543 (2005)CrossRefGoogle Scholar
  8. 8.
    Califano, A., Rigoutsos, I.: Flash: A fast look-up algorithm for string homology. In: Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB), pp. 56–64 (1993)Google Scholar
  9. 9.
    Tsur, D.: Optimal probing patterns for sequencing by hybridization. In: Bücher, P., Moret, B.M.E. (eds.) WABI 2006. LNCS (LNBI), vol. 4175, pp. 366–375. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  10. 10.
    Schwartz, S., Kent, J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., Miller, W.: Human–mouse alignments with BLASTZ. Genome Research 13, 103–107 (2003)CrossRefGoogle Scholar
  11. 11.
    Sun, Y., Buhler, J.: Choosing the best heuristic for seeded alignment of DNA sequences. BMC Bioinformatics 7 (2006)Google Scholar
  12. 12.
    Csürös, M., Ma, B.: Rapid homology search with two-stage extension and daughter seeds. In: Wang, L. (ed.) COCOON 2005. LNCS, vol. 3595, pp. 104–114. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  13. 13.
    Mak, D., Gelfand, Y., Benson, G.: Indel seeds for homology search. Bioinformatics 22, e341–e349 (2006)CrossRefGoogle Scholar
  14. 14.
    Brejová, B., Brown, D., Vinar, T.: Vector seeds: An extension to spaced seeds. Journal of Computer and System Sciences 70, 364–380 (2005)MATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied Mathematics 138, 253–263 (2004) preliminary version in 2002. MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB), pp. 67–75 (2003)Google Scholar
  17. 17.
    Brejová, B., Brown, D., Vinar, T.: Optimal spaced seeds for homologous coding regions. Journal of Bioinformatics and Computational Biology 1, 595–610 (2004)CrossRefGoogle Scholar
  18. 18.
    Cole, R., Hariharan, R., Indyk, P.: Tree pattern matching and subset matching in deterministic O(nlog3 n)-time. In: Proceedings of 10th Symposium on Discrete Algorithms (SODA), pp. 245–254 (1999)Google Scholar
  19. 19.
    Holub, J., Smyth, W.F., Wang, S.: Fast pattern-matching on indeterminate strings. Journal of Discrete Algorithms (2006)Google Scholar
  20. 20.
    Rahman, S., Iliopoulos, C., Mouchard, L.: Pattern matching in degenerate DNA/RNA sequences. In: Proceedings of the Workshop on Algorithms and Computation (WALCOM), pp. 109–120 (2007)Google Scholar
  21. 21.
    Noé, L., Kucherov, G.: Improved hit criteria for DNA local alignment. BMC Bioinformatics 5 (2004)Google Scholar
  22. 22.
    Aho, A.V., Corasick, M.J.: Efficient string matching: An aid to bibliographic search. Communications of the ACM 18, 333–340 (1975)MATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Amir, A., Porat, E., Lewenstein, M.: Approximate subset matching with don’t cares. In: Proceedings of 12th Symposium on Discrete Algorithms (SODA), pp. 305–306 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Gregory Kucherov
    • 1
  • Laurent Noé
    • 1
  • Mikhail Roytberg
    • 2
  1. 1.LIFL/CNRS/INRIA, Bât. M3 Cité Scientifique, 59655, Villeneuve d’Ascq cedexFrance
  2. 2.Institute of Mathematical Problems in Biology, Pushchino, Moscow Region, 142290Russia

Personalised recommendations