Skip to main content

Sequence Alignment

  • Chapter
  • First Online:
  • 1040 Accesses

Part of the book series: Computational Biology ((COBO,volume 11))

Abstract

Before we can discuss comparative gene finding in this chapter we go through some of the basic theory behind sequence alignment. The chapter is divided into two parts. In the first part we describe the basic concepts of pairwise alignments, including substitution schemes and gap models, and move on to the application of dynamic programming to global and local alignments. We finish off by giving an overview of heuristic database searches and the statistical foundation they rest upon. The extension of dynamic programming to multiple alignments is complicated by the increased computational complexity. As a result, a flora of heuristic alignment algorithms have evolved. In the second part of the chapter we give an account of the most common of these models, including progressive alignments, iterative methods, hidden Markov models, genetic algorithms, simulated annealing, and alignment profiles.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alexandersson, M., Bray, N., Pachter, L.: Pair hidden Markov models. In: Jorde, L.B., Little, P., Dunn, M., Subramanian, S. (eds.) Encyklopedia of Genetics, Genomics, Proteomics and Bioinformatics, Chap. 4.2 (17) (2005)

    Google Scholar 

  2. Altschul, S.F.: Gap costs for multiple alignments. J. Theor. Biol. 138, 297–309 (1989)

    Article  MathSciNet  Google Scholar 

  3. Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)

    Article  Google Scholar 

  4. Altschul, S.F., Gish, W.: Local alignment statistics. Methods Enzymol. 266, 460–480 (1996)

    Article  Google Scholar 

  5. Altschul, S.F., Carroll, R.J., Lipman, D.J.: Weights for data related by a tree. J. Mol. Biol. 207, 647–653 (1989)

    Article  Google Scholar 

  6. Altschul, S.F., Gish, W., Miller, W., Myers, E.M., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)

    Google Scholar 

  7. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)

    Article  Google Scholar 

  8. Baldi, P., Chauvin, Y., Hunkapiller, T., McClure, M.A.: Hidden Markov models of biological primary sequence information. Proc. Natl. Acad. Sci. USA 91, 1059–1063 (1994)

    Article  Google Scholar 

  9. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 36, D25–D30 (2008)

    Article  Google Scholar 

  10. Berger, M.P., Munson, P.J.: A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci. 7, 479–484 (1991)

    Google Scholar 

  11. Bishop, M.J., Rawlings, C.J. (eds.) DNA and Protein Sequence Analysis. Oxford University Press, Oxford (1997)

    Google Scholar 

  12. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., Martin, M.J., Mochoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement in 2003. Nucleic Acids Res. 31, 365–370 (2003)

    Article  Google Scholar 

  13. Brudno, M., Chapman, M., Göttgens, B., Batzoglou, S., Morgenstern, B.: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinf. 4, 66 (2003)

    Article  Google Scholar 

  14. Carrillo, H., Lipman, D.: The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48, 1073–1082 (1988)

    Article  MATH  MathSciNet  Google Scholar 

  15. Černý, V.: Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. J. Optim. Theory Appl. 45, 41–51 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  16. Dayhoff, M.O.: Atlas of Protein Sequence and Structure. National Biomedical Research Foundation (1969)

    Google Scholar 

  17. Dayhoff, M.O., Schwartz, R.M.: Matrices for detecting distant relationships. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5, pp. 353–358 (1978)

    Google Scholar 

  18. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5, pp. 345–352 (1978)

    Google Scholar 

  19. Durbin, R., Eddy, S., Krogh, A., Mitchinson, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)

    MATH  Google Scholar 

  20. Eddy, S.R.: Multiple alignment using hidden Markov models. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 114–120 (1995)

    Google Scholar 

  21. Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)

    Article  Google Scholar 

  22. Edgar, R.C.: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinf. 5, 113 (2004)

    Article  Google Scholar 

  23. Edgar, R.C., Batzoglou, S.: Multiple sequence alignment. Curr. Opin. Struct. Biol. 16, 368–373 (2006)

    Article  Google Scholar 

  24. Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981)

    Article  Google Scholar 

  25. Feng, D.F., Johnson, M.S., Dolittle, R.F.: Aligning amino acid sequences: comparison of commonly used methods. J. Mol. Evol. 21, 112–125 (1985)

    Article  Google Scholar 

  26. Feng, D.F., Doolittle, R.F.: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351–360 (1987)

    Article  Google Scholar 

  27. Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.-R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L.L., Bateman, A.: The Pfam protein families database. Nucleic Acids Res. 36, D281–D288 (2008)

    Article  Google Scholar 

  28. Fitch, W.M.: Random sequences. J. Mol. Biol. 163, 171–176 (1983)

    Article  Google Scholar 

  29. Fitch, W.M., Margoliash, E.: Construction of phylogenetic trees. Science 155, 279–284 (1967)

    Article  Google Scholar 

  30. Gibbs, A.J., McIntyre, G.A.: The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1–11 (1970)

    Article  Google Scholar 

  31. Gonnet, G.H., Cohen, M.A., Benner, S.A.: Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445 (1992)

    Article  Google Scholar 

  32. Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705–708 (1982)

    Article  Google Scholar 

  33. Gotoh, O.: Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. 264, 823–838 (1996)

    Article  Google Scholar 

  34. Gotoh, O.: Multiple sequence alignments: algorithms and applications. Adv. Biophys. 36, 159–206 (1999)

    Article  Google Scholar 

  35. Gribskov, M., McLachlan, A.D., Eisenberg, D.: Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84, 4355–4358 (1987)

    Article  Google Scholar 

  36. Gumbel, E.J.: Statistics of Extremes. Columbia University Press, New York (1958)

    MATH  Google Scholar 

  37. Gupta, S.K., Kececioglu, J.D., Schäffer, A.A.: Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Comput. Biol. 2, 459–472 (1995)

    Article  Google Scholar 

  38. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  39. Hasegawa, M., Kishino, H., Yano, T.: Dating of human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985)

    Article  Google Scholar 

  40. Haussler, D., Krogh, A., Mian, I.S., Sjölander, K.: Protein modeling using hidden Markov models: analysis of globins. HICSS-26 1, 792–802 (1993)

    Google Scholar 

  41. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)

    Article  Google Scholar 

  42. Henikoff, S., Henikoff, J.G., Pietrokovski, S.: Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 15, 471–479 (1999)

    Article  Google Scholar 

  43. Henikoff, J.G., Greene, E.A., Pietrokovski, S., Henikoff, S.: Increased coverage of protein families with the blocks database servers. Nucleic Acids Res. 28, 228–230 (2000)

    Article  Google Scholar 

  44. Higgins, D.G., Sharp, P.M.: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1988)

    Article  Google Scholar 

  45. Hirosawa, M., Totoki, Y., Hoshida, M., Ishikawa, M.: Comprehensive study on iterative algorithms of multiple sequence alignment. Comput. Appl. Biosci. 11, 13–18 (1995)

    Google Scholar 

  46. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975)

    Google Scholar 

  47. Hughey, R., Krogh, A.: Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci. 12, 95–108 (1996)

    Google Scholar 

  48. Jones, D.T., Taylor, W.R., Thornton, J.M.: The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, 275–282 (1992)

    Google Scholar 

  49. Jukes, T.H., Cantor, C.R.: Evolution of protein molecules. In: Munro, H.N. (ed.) Mammalian Protein Metabolism, pp. 21–123. Academic Press, New York (1969)

    Google Scholar 

  50. Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268 (1990)

    Article  MATH  Google Scholar 

  51. Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002)

    Article  Google Scholar 

  52. Kim, J., Pramanik, S., Chung, M.J.: Multiple sequence alignment using simulated annealing. Comput. Appl. Biosci. 10, 419–426 (1994)

    Google Scholar 

  53. Kimura, M.: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980)

    Article  Google Scholar 

  54. Kimura, M.: Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78, 454–458 (1981)

    Article  MATH  Google Scholar 

  55. Kimura, M.: The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge (1983)

    Google Scholar 

  56. Kimura, M., Ohta, T.: On the stochastic model for estimation of mutational distances between homologous proteins. J. Mol. Evol. 2, 87–90 (1972)

    Article  Google Scholar 

  57. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983)

    Article  MathSciNet  Google Scholar 

  58. Krogh, A., Brown, M., Mian, I.S., Sjölander, K., Haussler, D.: Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994)

    Article  Google Scholar 

  59. Kruskal, J.B.: An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM Rev. 25, 201–237 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  60. Lipman, D.J., Pearson, W.R.: Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985)

    Article  Google Scholar 

  61. Lipman, D.J., Altschul, S.F., Kececioglu, J.D.: A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA 86, 4412–4415 (1989)

    Article  Google Scholar 

  62. Lüthy, R., Xenarios, I., Bucher, P.: Improving the sensitivity of the sequence profile method. Protein Sci. 3, 139–146 (1994)

    Article  Google Scholar 

  63. Maizel, J.V., Lenk, R.P.: Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proc. Natl. Acad. Sci. USA 78, 7665–7669 (1981)

    Article  MathSciNet  Google Scholar 

  64. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)

    Article  Google Scholar 

  65. Meyer, I.M., Durbin, R.: Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18, 1309–1318 (2002)

    Article  Google Scholar 

  66. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  67. Morgenstern, B., Frech, K., Dress, A., Werner, T.: DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14, 290–294 (1998)

    Article  Google Scholar 

  68. Morgenstern, B.: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, 211–218 (1999)

    Article  Google Scholar 

  69. Morrison, D.R.: PATRICIA—practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 514–534 (1968)

    Google Scholar 

  70. Mott, R.: Maximum-likelihood estimation of the statistical distribution of Smith–Waterman local sequence similarity scores. Bull. Math. Biol. 54, 59–75 (1992)

    MATH  Google Scholar 

  71. Müller, T., Spang, R., Vingron, T.: Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 19, 8–13 (2002)

    Google Scholar 

  72. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)

    Article  Google Scholar 

  73. Notredame, C.: Recent evolutions of multiple sequence alignment algorithms. PLoS Comput. Biol. 3, e123 (2007)

    Article  Google Scholar 

  74. Notredame, C., Higgins, D.G.: SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res. 24, 1515–1524 (1996)

    Article  Google Scholar 

  75. Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000)

    Article  Google Scholar 

  76. Pearson, W.R.: Empirical statistical estimates for sequence similarity searches. J. Mol. Biol. 276, 71–84 (1998)

    Article  Google Scholar 

  77. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 86, 2444–2448 (1988)

    Article  Google Scholar 

  78. Pustell, J., Kafatos, C.: A high speed. high capacity homology matrix: zooming through SV40 and polyoma. Nucleic Acids Res. 10, 4765–4782 (1982)

    Google Scholar 

  79. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)

    Article  Google Scholar 

  80. Saitou, N., Nei, M.: Neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)

    Google Scholar 

  81. Sankoff, D., Kruskal, J.B.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, New York (1983)

    Google Scholar 

  82. Sigrist, C.J.A., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., Bucher, P.: PROSITE: a documented database using patterns and profiles as motif descriptors. Brief. Bioinf. 3, 265–274 (2002)

    Article  Google Scholar 

  83. Smith, T.F., Waterman, M.S.: Comparison of biosequences. Adv. Appl. Math. 2, 482–489 (1981)

    Article  MATH  MathSciNet  Google Scholar 

  84. Smith, T.F., Waterman, M.S., Burks, C.: The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 13, 645–656 (1985)

    Article  Google Scholar 

  85. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. Freeman, San Francisco (1973)

    MATH  Google Scholar 

  86. Staden, R.: An interactive graphics program for comparing and aligning nucleic acid and amino acid sequences. Nucleic Acids Res. 10, 2951–2961 (1982)

    Article  Google Scholar 

  87. Steinmetz, M., Frelinger, J.G., Fisher, D., Hunkapiller, T., Pereira, D., Weissman, S.M., Uehara, H., Nathenson, S., Hood, L.: Three cDNA clones encoding mouse transplantation antigens: homology to immunoglobulin genes. Cell 24, 125–134 (1981)

    Article  Google Scholar 

  88. Tamura, K., Nei, M.: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10, 512–526 (1993)

    Google Scholar 

  89. Tavare, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. In: Lectures on Mathematics in the Life Sciences, vol. 17, pp. 57–86 (1986)

    Google Scholar 

  90. Thompson, J.D., Higgins, D.G., Gibson, T.J.: Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Appl. Biosci. 10, 19–29 (1994)

    Google Scholar 

  91. Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)

    Article  Google Scholar 

  92. The UniProt Consortium: The universal protein resource (UniProt) 2009. Nucleic Acids Res. 37, D169–D174 (2009)

    Article  Google Scholar 

  93. Wallace, I.M., Blackshields, G., Higgins, D.G.: Multiple sequence alignments. Curr. Opin. Struct. Biol. 15, 261–266 (2005)

    Article  Google Scholar 

  94. Wang, J., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337–448 (1994)

    Article  Google Scholar 

  95. Waterman, M.S., Smith, T.F., Beyer, W.A.: Some biological sequence metrics. Adv. Math. 20, 367–387 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  96. Waterman, M.S.: Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall/CRC, London (1995)

    MATH  Google Scholar 

  97. Wilbur, W.J., Lipman, D.J.: Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. USA 80, 726–730 (1983)

    Article  Google Scholar 

  98. Wilbur, W.J., Lipman, D.J.: The context dependent comparison of biological sequences. SIAM J. Appl. Math. 44, 557–567 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  99. Yang, Z.: Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39, 105–111 (1994)

    Google Scholar 

  100. Zaki, M.J., Bystroff, C.: Protein structure prediction. In: Zaki, M.J., Bystroff, C. (eds.) Methods in Molecular Biology, vol. 413. Humana Press, Clifton (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Axelson-Fisk .

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag London

About this chapter

Cite this chapter

Axelson-Fisk, M. (2010). Sequence Alignment. In: Comparative Gene Finding. Computational Biology, vol 11. Springer, London. https://doi.org/10.1007/978-1-84996-104-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-84996-104-2_3

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84996-103-5

  • Online ISBN: 978-1-84996-104-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics