Procrastination Leads to Efficient Filtration for Local Multiple Alignment

  • Aaron E. Darling
  • Todd J. Treangen
  • Louxin Zhang
  • Carla Kuiken
  • Xavier Messeguer
  • Nicole T. Perna
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4175)


We describe an efficient local multiple alignment filtration heuristic for identification of conserved regions in one or more DNA sequences. The method incorporates several novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously, (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered. The resulting local multiple alignments may have nucleotide substitutions and internal gaps as large as w characters in any occurrence of the motif. The algorithm consumes \(\mathcal{O}(wN)\) memory and \(\mathcal{O}(wN \log wN)\) time where N is the sequence length. We score the significance of multiple alignments using entropy-based motif scoring methods. We demonstrate the performance of our filtration method on Alu-repeat rich segments of the human genome and a large set of Hepatitis C virus genomes. The GPL implementation of our algorithm in C++ is called procrastAligner and is freely available from


Neighborhood Group Neighborhood List Seed Match Link Extension Match Record 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)CrossRefGoogle Scholar
  2. 2.
    Brudno, M., Morgenstern, B.: Fast and sensitive alignment of large genomic sequences. In: Proc IEEE CSB 2002, pp. 138–147 (2002)Google Scholar
  3. 3.
    Noé, L., Kucherov, G.: Improved hit criteria for DNA local alignment. BMC Bioinformatics 5 (2004)Google Scholar
  4. 4.
    Kahveci, T., Ljosa, V., Singh, A.K.: Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics 20, 2122–2134 (2004)CrossRefGoogle Scholar
  5. 5.
    Choi, P., Zeng, K., Zhang, F.L.: Good spaced seeds for homology search. Bioinformatics 20, 1053–1059 (2004)CrossRefGoogle Scholar
  6. 6.
    Li, M., Ma, B., Zhang, L.: Superiority and complexity of the spaced seeds. In: Proc. SODA 2006, pp. 444–453 (2006)Google Scholar
  7. 7.
    Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. J. Comput. Biol. 12, 847–861 (2005)CrossRefGoogle Scholar
  8. 8.
    Xu, J., Brown, D.G., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology search. In: CPM 2004, pp. 47–58 (2004)Google Scholar
  9. 9.
    Flannick, J., Batzoglou, S.: Using multiple alignments to improve seeded local alignment algorithms. Nucleic Acids Res. 33, 4563–4577 (2005)CrossRefGoogle Scholar
  10. 10.
    Li, L., Stoeckert, C.J., Roos, D.S.: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003)CrossRefGoogle Scholar
  11. 11.
    Jaffe, D.B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J.P., Zody, M.C., Lander, E.S.: Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003)CrossRefGoogle Scholar
  12. 12.
    Ane, C., Sanderson, M.: Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. Syst. Biol. 54, I311–I317 (2005)CrossRefGoogle Scholar
  13. 13.
    Margulies, M., et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005)Google Scholar
  14. 14.
    Darling, A.C.E., Mau, B., Blattner, F.R., Perna, N.T.: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14(7), 1394–1403 (2004)CrossRefGoogle Scholar
  15. 15.
    Hohl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. Bioinformatics 18(suppl. 1), S312–S320 (2002)Google Scholar
  16. 16.
    Treangen, T., Messeguer, X.: M-GCAT: Multiple Genome Comparison and Alignment Tool (submitted, 2006)Google Scholar
  17. 17.
    Dewey, C.N., Pachter, L.: Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum. Mol. Genet. 15(suppl. 1) (2006)Google Scholar
  18. 18.
    Sammeth, M., Heringa, J.: Global multiple-sequence alignment with repeats. Proteins (2006)Google Scholar
  19. 19.
    Raphael, B., Zhi, D., Tang, H., Pevzner, P.: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 14(11), 2336–2346 (2004)CrossRefGoogle Scholar
  20. 20.
    Edgar, R.C., Myers, E.W.: PILER: identification and classification of genomic repeats. Bioinformatics 21(suppl. 1) (2005)Google Scholar
  21. 21.
    Kurtz, S., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Computation and visualization of degenerate repeats in complete genomes. In: Proc. 8th Intell. Syst. Mol. Biol. ISMB 2000, pp. 228–238 (2000)Google Scholar
  22. 22.
    Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O., Walichiewicz, J.: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 110, 462–467 (2005)CrossRefGoogle Scholar
  23. 23.
    Zhang, Y., Waterman, M.S.: An Eulerian path approach to local multiple alignment for DNA sequences. PNAS 102, 1285–1290 (2005)CrossRefMathSciNetGoogle Scholar
  24. 24.
    Siddharthan, R., Siggia, E.D., van Nimwegen, E.: PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput. Biol. 1 (2005)Google Scholar
  25. 25.
    Nagarajan, N., Jones, N., Keich, U.: Computing the P-value of the information content from an alignment of multiple sequences. Bioinformatics 21(suppl. 1) (2005)Google Scholar
  26. 26.
    Szklarczyk, R., Heringa, J.: Tracking repeats using significance and transitivity. Bioinformatics 20(suppl. 1), 311–317 (2004)CrossRefGoogle Scholar
  27. 27.
    Kuiken, C., Yusim, K., Boykin, L., Richardson, R.: The Los Alamos hepatitis C sequence database. Bioinformatics 21, 379–384 (2005)CrossRefGoogle Scholar
  28. 28.
    Prakash, A., Tompa, M.: Statistics of local multiple alignments. Bioinformatics 21, i344–i350 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Aaron E. Darling
    • 1
  • Todd J. Treangen
    • 2
  • Louxin Zhang
    • 4
  • Carla Kuiken
    • 5
  • Xavier Messeguer
    • 2
  • Nicole T. Perna
    • 3
  1. 1.Department of Computer ScienceUniversity of WisconsinUSA
  2. 2.Department of Computer ScienceTechnical University of CataloniaBarcelonaSpain
  3. 3.Department of Animal Health and Biomedical Sciences, Genome CenterUniversity of WisconsinUSA
  4. 4.Department of MathematicsNational University of SingaporeSingapore
  5. 5.T-10 Theoretical Biology DivisionLos Alamos National LaboratoryUSA

Personalised recommendations