Fast algorithms for aligning sequences with restricted affine gap penalties

  • Kun-Mao Chao
Session 8: Computational Biology II
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1276)


Affine gap penalties are generally considered appropriate for aligning DNA and protein sequences. (“Affine” means that a gap of length k is penalized α + kβ, i.e., it costs α to open up a gap plus β for each symbol in the gap.) For certain applications, such as aligning a cDNA sequence with a genomic DNA sequence, it might be adequate to use the restricted affine gap penalties which penalize long gaps with a constant penalty. As it turns out, several techniques developed for solving the approximate string matching problem can be utilized to yield efficient algorithms for computing the optimal alignment with restricted affine gap penalties. In particular, efficient algorithms can be derived based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the cost tables. To speedup the computation, the q-gram paradigm can be used to locate the interval in the longer sequence that should be aligned with the shorter sequence. We have implemented the above methods in C on Sun workstations running SunOS Unix. Preliminary experiments show that these approaches are very promising for aligning a cDNA sequence with a genomic DNA sequence.

Key Words

cDNA sequences q-gram paradigm Dynamic programming Genomic DNA sequences Sequence comparison 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Altschul, S., Gish, W., Miller, W., Myers, E. and Lipman, D. (1990) A basic local alignment search tool. J. Mol. Biol. 215, 403–410.Google Scholar
  2. [2]
    Baeza-Yates, R. A. and Gonnet, G. H. (1994) Fast string matching with mismatches. Information and Computation 108, 187–199.Google Scholar
  3. [3]
    Chang, W. I. and Lampe, J. (1992) Theoretical and empirical comparisons of approximate string matching algorithms. Combinatorial Pattern Matching '92, Lecture Notes in Computer Science, 172–181.Google Scholar
  4. [4]
    Chao, K.-M. (1994) Computing all suboptimal alignments in linear space. Combinatorial Pattern Matching '94, Lecture Notes in Computer Science 807,31–42.Google Scholar
  5. [5]
    Chao, K.-M. and Miller, W. (1995) Linear-space algorithms that build local alignments from fragments. Algorithmica 13, 106–134.Google Scholar
  6. [6]
    Chao, K.-M., Zhang, J., Ostell, J. and Miller, W. (1995) A local alignment tool for very long DNA sequences. CABIOS 11, 147–153.Google Scholar
  7. [7]
    Chao, K.-M., Zhang, J., Ostell, J. and Miller, W. (1997) A tool for aligning very similar DNA sequences. CABIOS, 13, 75–80.Google Scholar
  8. [8]
    Crochemore, M., Czumaj, A., Gaasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W. and Rytter, W. (1994) Speeding up two string-matching algorithms. Algorithmica 12, 247–267.Google Scholar
  9. [9]
    Daniels, D. L., Plunkett, G., Burland, V. and Blattner, F. R. (1992) Analysis of the Escherichia coli genome: DNA sequence of the region from 84.5 to 86.5 minutes. Science 257, 771–778.Google Scholar
  10. [10]
    Dermouche, A. (1995) A fast algorithm for string matching with mismatches. Information Processing Letters 55, 105–110.Google Scholar
  11. [11]
    Gotoh, O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705–708.Google Scholar
  12. [12]
    Gotoh, O. (1990) Optimal sequence alignment allowing for long gaps. Bull. Math. Biol. 52, 359–373.Google Scholar
  13. [13]
    Hardison, R. C., Chao, K.-M., Schwartz, S., Stojanovic, N., Ganetsky, M. and Miller, W. (1994) Globin Gene Server: a prototype E-mail database server featuring extensive multiple alignments and data compilation for electronic genetic analysis. Genomics 21, 344–353.Google Scholar
  14. [14]
    Huang, X. (1994) On global sequence alignment. CABIOS 10, 227–235.Google Scholar
  15. [15]
    Kim, J. Y. and Shawe-Taylor, J. (1992) An approximate string-matching algorithm. Theo. Comp. Sci. 92, 107–117.Google Scholar
  16. [16]
    Landau, G. M., Vishkin, U. and Nussinov, R. (1988) Locating alignments with k differences for nucleotide and amino acid sequences. CABIOS 4, 19–24.Google Scholar
  17. [17]
    Lewin, B. (1994) Genes V. Oxford University Press.Google Scholar
  18. [18]
    Myers, E. W. (1986) An O(ND) difference algorithm and its variations. Algorithmica 1, 251–266.Google Scholar
  19. [19]
    Myers, E. W. and Miller, W. (1988) Optimal alignments in linear space. CABIOS 4,11–17.Google Scholar
  20. [20]
    Myers, E. W. and Miller, W. (1989) Row replacement algorithms for screen editors. ACM Trans. Program. Lang. Syst. 11, 33–56.Google Scholar
  21. [21]
    Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48,443–453.Google Scholar
  22. [22]
    Pearson, W. R. and Lipman, D. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. 85, 2444–2448.Google Scholar
  23. [23]
    Plunkett, G., Burland, V., Daniels D. L. and Blattner, F. R. (1993) Analysis of the Escherichia coli genome.III. DNA sequence of the region from 87.2 to 89.2 minutes. Nucleic Acids Res 21, 3391–3398.Google Scholar
  24. [24]
    Schuler, G.D., Epstein, J.A., Ohkawa, H., and Kans, J.A. (1996) Entrez: Molecular Biology Database and Retrieval System. Methods in Enzymol. 266, 141–162.Google Scholar
  25. [25]
    Sze, S.-H. and Pevzner, P. A. (1997) Las Vegas algorithms for gene recognition: suboptimal and error-tolerant spliced alignment. Proceedings of the First Annual International Conference on Computational Molecular Biology, 300–309.Google Scholar
  26. [26]
    Ukkonen, E. (1992) Approximate string-matching with q-grams and maximal matches. Theo. Comp. Sci. 92, 191–211.Google Scholar
  27. [27]
    Ukkonen, E. and Wood, D. (1993) Approximate string matching with suffix automata. Algorithmica 10, 353–364.Google Scholar
  28. [28]
    Waterman, M. S. (1984) Efficient sequence alignment algorithms. J. theor. Biol. 108, 333–337.Google Scholar
  29. [29]
    Wilbur, W. J. and Lipman, D. (1984) The context dependent comparison of biological sequences. SIAM J. Appl. Math. 44, 557–567.Google Scholar
  30. [30]
    Xu, Y, Mural, R. and Uberbacher, E. C. (1994) Constructing gene models from a set of accurately-predicted exons: an application of dynamic programming. CABIOS 10, 613–623.Google Scholar
  31. [31]
    Zhang, J., Chao, K.-M., Florea, L. and Miller, W. (1997) Alignment Requirements for NCBI's Genomes Division. First Annual International Conference on Computational Molecular Biology, poster session.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1997

Authors and Affiliations

  • Kun-Mao Chao
    • 1
  1. 1.Department of Computer Science and Information Management Providence University ShaluTaichungTaiwan

Personalised recommendations