Abstract
Before we can discuss comparative gene finding in this chapter we go through some of the basic theory behind sequence alignment. The chapter is divided into two parts. In the first part we describe the basic concepts of pairwise alignments, including substitution schemes and gap models, and move on to the application of dynamic programming to global and local alignments. We finish off by giving an overview of heuristic database searches and the statistical foundation they rest upon. The extension of dynamic programming to multiple alignments is complicated by the increased computational complexity. As a result, a flora of heuristic alignment algorithms have evolved. In the second part of the chapter we give an account of the most common of these models, including progressive alignments, iterative methods, hidden Markov models, genetic algorithms, simulated annealing, and alignment profiles.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Alexandersson, M., Bray, N., Pachter, L.: Pair hidden Markov models. In: Jorde, L.B., Little, P., Dunn, M., Subramanian, S. (eds.) Encyklopedia of Genetics, Genomics, Proteomics and Bioinformatics, Chap. 4.2 (17) (2005)
Altschul, S.F.: Gap costs for multiple alignments. J. Theor. Biol. 138, 297–309 (1989)
Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)
Altschul, S.F., Gish, W.: Local alignment statistics. Methods Enzymol. 266, 460–480 (1996)
Altschul, S.F., Carroll, R.J., Lipman, D.J.: Weights for data related by a tree. J. Mol. Biol. 207, 647–653 (1989)
Altschul, S.F., Gish, W., Miller, W., Myers, E.M., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)
Baldi, P., Chauvin, Y., Hunkapiller, T., McClure, M.A.: Hidden Markov models of biological primary sequence information. Proc. Natl. Acad. Sci. USA 91, 1059–1063 (1994)
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 36, D25–D30 (2008)
Berger, M.P., Munson, P.J.: A novel randomized iterative strategy for aligning multiple protein sequences. Comput. Appl. Biosci. 7, 479–484 (1991)
Bishop, M.J., Rawlings, C.J. (eds.) DNA and Protein Sequence Analysis. Oxford University Press, Oxford (1997)
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., Martin, M.J., Mochoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement in 2003. Nucleic Acids Res. 31, 365–370 (2003)
Brudno, M., Chapman, M., Göttgens, B., Batzoglou, S., Morgenstern, B.: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinf. 4, 66 (2003)
Carrillo, H., Lipman, D.: The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48, 1073–1082 (1988)
Černý, V.: Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. J. Optim. Theory Appl. 45, 41–51 (1985)
Dayhoff, M.O.: Atlas of Protein Sequence and Structure. National Biomedical Research Foundation (1969)
Dayhoff, M.O., Schwartz, R.M.: Matrices for detecting distant relationships. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5, pp. 353–358 (1978)
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5, pp. 345–352 (1978)
Durbin, R., Eddy, S., Krogh, A., Mitchinson, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)
Eddy, S.R.: Multiple alignment using hidden Markov models. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 114–120 (1995)
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)
Edgar, R.C.: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinf. 5, 113 (2004)
Edgar, R.C., Batzoglou, S.: Multiple sequence alignment. Curr. Opin. Struct. Biol. 16, 368–373 (2006)
Felsenstein, J.: Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981)
Feng, D.F., Johnson, M.S., Dolittle, R.F.: Aligning amino acid sequences: comparison of commonly used methods. J. Mol. Evol. 21, 112–125 (1985)
Feng, D.F., Doolittle, R.F.: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351–360 (1987)
Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.-R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L.L., Bateman, A.: The Pfam protein families database. Nucleic Acids Res. 36, D281–D288 (2008)
Fitch, W.M.: Random sequences. J. Mol. Biol. 163, 171–176 (1983)
Fitch, W.M., Margoliash, E.: Construction of phylogenetic trees. Science 155, 279–284 (1967)
Gibbs, A.J., McIntyre, G.A.: The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1–11 (1970)
Gonnet, G.H., Cohen, M.A., Benner, S.A.: Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445 (1992)
Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705–708 (1982)
Gotoh, O.: Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. 264, 823–838 (1996)
Gotoh, O.: Multiple sequence alignments: algorithms and applications. Adv. Biophys. 36, 159–206 (1999)
Gribskov, M., McLachlan, A.D., Eisenberg, D.: Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84, 4355–4358 (1987)
Gumbel, E.J.: Statistics of Extremes. Columbia University Press, New York (1958)
Gupta, S.K., Kececioglu, J.D., Schäffer, A.A.: Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Comput. Biol. 2, 459–472 (1995)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Hasegawa, M., Kishino, H., Yano, T.: Dating of human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985)
Haussler, D., Krogh, A., Mian, I.S., Sjölander, K.: Protein modeling using hidden Markov models: analysis of globins. HICSS-26 1, 792–802 (1993)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)
Henikoff, S., Henikoff, J.G., Pietrokovski, S.: Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 15, 471–479 (1999)
Henikoff, J.G., Greene, E.A., Pietrokovski, S., Henikoff, S.: Increased coverage of protein families with the blocks database servers. Nucleic Acids Res. 28, 228–230 (2000)
Higgins, D.G., Sharp, P.M.: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1988)
Hirosawa, M., Totoki, Y., Hoshida, M., Ishikawa, M.: Comprehensive study on iterative algorithms of multiple sequence alignment. Comput. Appl. Biosci. 11, 13–18 (1995)
Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor (1975)
Hughey, R., Krogh, A.: Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci. 12, 95–108 (1996)
Jones, D.T., Taylor, W.R., Thornton, J.M.: The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, 275–282 (1992)
Jukes, T.H., Cantor, C.R.: Evolution of protein molecules. In: Munro, H.N. (ed.) Mammalian Protein Metabolism, pp. 21–123. Academic Press, New York (1969)
Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268 (1990)
Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002)
Kim, J., Pramanik, S., Chung, M.J.: Multiple sequence alignment using simulated annealing. Comput. Appl. Biosci. 10, 419–426 (1994)
Kimura, M.: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980)
Kimura, M.: Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78, 454–458 (1981)
Kimura, M.: The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge (1983)
Kimura, M., Ohta, T.: On the stochastic model for estimation of mutational distances between homologous proteins. J. Mol. Evol. 2, 87–90 (1972)
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983)
Krogh, A., Brown, M., Mian, I.S., Sjölander, K., Haussler, D.: Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994)
Kruskal, J.B.: An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM Rev. 25, 201–237 (1983)
Lipman, D.J., Pearson, W.R.: Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985)
Lipman, D.J., Altschul, S.F., Kececioglu, J.D.: A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA 86, 4412–4415 (1989)
Lüthy, R., Xenarios, I., Bucher, P.: Improving the sensitivity of the sequence profile method. Protein Sci. 3, 139–146 (1994)
Maizel, J.V., Lenk, R.P.: Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proc. Natl. Acad. Sci. USA 78, 7665–7669 (1981)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)
Meyer, I.M., Durbin, R.: Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18, 1309–1318 (2002)
Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1998)
Morgenstern, B., Frech, K., Dress, A., Werner, T.: DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14, 290–294 (1998)
Morgenstern, B.: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, 211–218 (1999)
Morrison, D.R.: PATRICIA—practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 514–534 (1968)
Mott, R.: Maximum-likelihood estimation of the statistical distribution of Smith–Waterman local sequence similarity scores. Bull. Math. Biol. 54, 59–75 (1992)
Müller, T., Spang, R., Vingron, T.: Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 19, 8–13 (2002)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)
Notredame, C.: Recent evolutions of multiple sequence alignment algorithms. PLoS Comput. Biol. 3, e123 (2007)
Notredame, C., Higgins, D.G.: SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res. 24, 1515–1524 (1996)
Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000)
Pearson, W.R.: Empirical statistical estimates for sequence similarity searches. J. Mol. Biol. 276, 71–84 (1998)
Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 86, 2444–2448 (1988)
Pustell, J., Kafatos, C.: A high speed. high capacity homology matrix: zooming through SV40 and polyoma. Nucleic Acids Res. 10, 4765–4782 (1982)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)
Saitou, N., Nei, M.: Neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)
Sankoff, D., Kruskal, J.B.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, New York (1983)
Sigrist, C.J.A., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., Bucher, P.: PROSITE: a documented database using patterns and profiles as motif descriptors. Brief. Bioinf. 3, 265–274 (2002)
Smith, T.F., Waterman, M.S.: Comparison of biosequences. Adv. Appl. Math. 2, 482–489 (1981)
Smith, T.F., Waterman, M.S., Burks, C.: The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 13, 645–656 (1985)
Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. Freeman, San Francisco (1973)
Staden, R.: An interactive graphics program for comparing and aligning nucleic acid and amino acid sequences. Nucleic Acids Res. 10, 2951–2961 (1982)
Steinmetz, M., Frelinger, J.G., Fisher, D., Hunkapiller, T., Pereira, D., Weissman, S.M., Uehara, H., Nathenson, S., Hood, L.: Three cDNA clones encoding mouse transplantation antigens: homology to immunoglobulin genes. Cell 24, 125–134 (1981)
Tamura, K., Nei, M.: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10, 512–526 (1993)
Tavare, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. In: Lectures on Mathematics in the Life Sciences, vol. 17, pp. 57–86 (1986)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput. Appl. Biosci. 10, 19–29 (1994)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)
The UniProt Consortium: The universal protein resource (UniProt) 2009. Nucleic Acids Res. 37, D169–D174 (2009)
Wallace, I.M., Blackshields, G., Higgins, D.G.: Multiple sequence alignments. Curr. Opin. Struct. Biol. 15, 261–266 (2005)
Wang, J., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337–448 (1994)
Waterman, M.S., Smith, T.F., Beyer, W.A.: Some biological sequence metrics. Adv. Math. 20, 367–387 (1976)
Waterman, M.S.: Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall/CRC, London (1995)
Wilbur, W.J., Lipman, D.J.: Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. USA 80, 726–730 (1983)
Wilbur, W.J., Lipman, D.J.: The context dependent comparison of biological sequences. SIAM J. Appl. Math. 44, 557–567 (1984)
Yang, Z.: Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39, 105–111 (1994)
Zaki, M.J., Bystroff, C.: Protein structure prediction. In: Zaki, M.J., Bystroff, C. (eds.) Methods in Molecular Biology, vol. 413. Humana Press, Clifton (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2010 Springer-Verlag London
About this chapter
Cite this chapter
Axelson-Fisk, M. (2010). Sequence Alignment. In: Comparative Gene Finding. Computational Biology, vol 11. Springer, London. https://doi.org/10.1007/978-1-84996-104-2_3
Download citation
DOI: https://doi.org/10.1007/978-1-84996-104-2_3
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84996-103-5
Online ISBN: 978-1-84996-104-2
eBook Packages: Computer ScienceComputer Science (R0)