Sequence Alignment

  • Xuhua Xia


Sequence alignment serves two purposes: global alignment is mainly for identifying site homology between sequences to facilitate the inference of ancestral-descendant relationships through molecular phylogenetics, and local alignment mainly for identifying sequence similarities that may be due to either inheritance or convergence. Dynamic programming algorithm for pairwise alignment and profile alignment is illustrated in detail, for both constant gap penalty and affine function gap penalty, followed by progressive multiple sequence alignment using a guide tree, and by how to align protein-coding nucleotide sequences against aligned amino acid sequences. Derivation of PAM and BLOSUM matrices were numerically illustrated in detail, and how the effect of purifying selection on substitution patterns is discussed. Also discussed are pros and cons of representing internal sequences by a profile or by a reconstructed sequence in multiple sequence alignment.


  1. Althaus E, Caprara A, Lenhof HP, Reinert K (2002) Multiple sequence alignment with arbitrary gap costs: computing an optimal solution using polyhedral combinatorics. Bioinformatics 18(Suppl 2):S4–S16CrossRefPubMedGoogle Scholar
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410PubMedCrossRefGoogle Scholar
  3. Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed) Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, DC, pp 345–352Google Scholar
  4. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797CrossRefPubMedPubMedCentralGoogle Scholar
  5. Einstein A, Russell B, Dewey J, Millikan RA, Dreiser T, Wells HG, Nansen F, Jeans SJ, Babbitt I, Keith SA et al (1931) Living philosophies. Simon and Schuster, New YorkGoogle Scholar
  6. Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25(4):351–360CrossRefPubMedGoogle Scholar
  7. Fitch WM (1971) Toward defining the course of evolution: minimum change for a specific tree topology. Syst Zool 20:406–416CrossRefGoogle Scholar
  8. Galtier N, Lobry JR (1997) Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in prokaryotes. J Mol Evol 44(6):632–636CrossRefPubMedGoogle Scholar
  9. Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162(3):705–708CrossRefPubMedGoogle Scholar
  10. Gowri-Shankar V, Rattray M (2007) A reversible jump method for Bayesian phylogenetic inference with a nonhomogeneous substitution model. Mol Biol Evol 24(6):1286–1299CrossRefPubMedGoogle Scholar
  11. Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185:862–864CrossRefPubMedGoogle Scholar
  12. Gupta SK, Kececioglu JD, Schaffer AA (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J Comput Biol 2(3):459–472CrossRefPubMedGoogle Scholar
  13. Gusfield D (1997) Algorithms on strings, trees, and sequences : computer science and computational biology. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  14. Hein J (1990) A unified approach to phylogenies and alignments. Methods Enzymol 183:625–644Google Scholar
  15. Hein J (1994) TreeAlign. Methods Mol Biol 25:349–364PubMedGoogle Scholar
  16. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89:10915–10919CrossRefPubMedPubMedCentralGoogle Scholar
  17. Hickson RE, Simon C, Perrey SW (2000) The performance of several multiple-sequence alignment programs in relation to secondary-structure features for an rRNA sequence. Mol Biol Evol 17(4):530–539CrossRefPubMedGoogle Scholar
  18. Higgins DG (1994) CLUSTAL V: multiple alignment of DNA and protein sequences. Methods Mol Biol 25:307–318PubMedGoogle Scholar
  19. Higgs PG, Attwood TK (2005) Bioinformatics and molecular evolution. Blackwell, MaldenGoogle Scholar
  20. Hogeweg P, Hesper aB (1984) The alignment of sets of sequences and the construction of phylogenetic trees: an integrated method. J Mol Evol 20:175–186CrossRefPubMedGoogle Scholar
  21. Holmes I, Bruno WJ (2001) Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17(9):803–820CrossRefPubMedGoogle Scholar
  22. Hurst LD, Merchant AR (2001) High guanine-cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes. Proc R Soc Lond B 268:493–497CrossRefGoogle Scholar
  23. Jacob F (1988) The statue within: an autobiography. Basic Books, Inc., New YorkGoogle Scholar
  24. Jensen JL, Hein J (2005) Gibbs sampler for statistical multiple alignment. Stat Sin 15:889–907Google Scholar
  25. Katoh K, Toh H (2010) Parallelization of the MAFFT multiple sequence alignment program. Bioinformatics 26(15):1899–1900CrossRefPubMedPubMedCentralGoogle Scholar
  26. Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33(2):511–518CrossRefPubMedPubMedCentralGoogle Scholar
  27. Kjer KM (1995) Use of ribosomal-RNA secondary structure in phylogenetic studies to identify homologous positions – an example of alignment and data presentation from the frogs. Mol Phylogenet Evol 4(3):314–330CrossRefPubMedGoogle Scholar
  28. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157:105–132CrossRefPubMedGoogle Scholar
  29. Lipman DJ, Altschul SF, Kececioglu JD (1989) A tool for multiple sequence alignment. Proc Natl Acad Sci U S A 86(12):4412–4415CrossRefPubMedPubMedCentralGoogle Scholar
  30. Miyata T, Miyazawa S, Yasunaga T (1979) Two types of amino acid substitutions in protein evolution. J Mol Evol 12(3):219–236CrossRefPubMedGoogle Scholar
  31. Moerschell RP, Hosokawa Y, Tsunasawa S, Sherman F (1990) The specificities of yeast methionine aminopeptidase and acetylation of amino-terminal methionine in vivo. Processing of altered iso-1-cytochromes c created by oligonucleotide transformation. J Biol Chem 265(32):19638–19643PubMedGoogle Scholar
  32. Nakashima H, Fukuchi S, Nishikawa K (2003) Compositional changes in RNA, DNA and proteins for bacterial adaptation to higher and lower temperatures. J Biochem (Tokyo) 133(4):507–513CrossRefGoogle Scholar
  33. Needleman SB, Wunsch CD (1970) A general method applicable to the search of similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453PubMedCrossRefGoogle Scholar
  34. Nomenclature Committee of the International Union of Biochemistry (1985) Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Eur J Biochem 150:1–5CrossRefGoogle Scholar
  35. Notredame C, O’Brien EA, Higgins DG (1997) RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res 25(22):4570–4580CrossRefPubMedPubMedCentralGoogle Scholar
  36. Pei J, Kim BH, Grishin NV (2008) PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res 36(7):2295–2300CrossRefPubMedPubMedCentralGoogle Scholar
  37. Pevzner PA (2000) Computational molecular biology: an algorithmic approach. The MIT Press, Cambridge, MAGoogle Scholar
  38. Reinert K, Stoye J, Will T (2000) An iterative method for faster sum-of-pairs multiple sequence alignment. Bioinformatics 16(9):808–814CrossRefPubMedGoogle Scholar
  39. Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425PubMedGoogle Scholar
  40. Sankoff D (1975) Minimal mutation trees of sequences. J SIAM Appl Math 28:35–42CrossRefGoogle Scholar
  41. Sankoff D, Morel C, Cedergren RJ (1973) Evolution of 5S RNA and the non-randomness of base replacement. Nat New Biol 245(147):232–234CrossRefPubMedGoogle Scholar
  42. Sankoff D, Cedergren RJ, Lapalme G (1976) Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA. J Mol Evol 7(2):133–149CrossRefPubMedGoogle Scholar
  43. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197PubMedCrossRefGoogle Scholar
  44. Stoye J, Moulton V, Dress AW (1997) DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Comput Appl Biosci 13(6):625–626PubMedGoogle Scholar
  45. Tamura K, Nei M (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10:512–526PubMedGoogle Scholar
  46. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680CrossRefPubMedPubMedCentralGoogle Scholar
  47. Wang HC, Hickey DA (2002) Evidence for strong selective constraint acting on the nucleotide composition of 16S ribosomal RNA genes. Nucleic Acids Res 30(11):2501–2507CrossRefPubMedPubMedCentralGoogle Scholar
  48. Wang HC, Xia X, Hickey DA (2006) Thermal adaptation of ribosomal RNA genes: a comparative study. J Mol Evol 63(1):120–126CrossRefPubMedGoogle Scholar
  49. Xia X (1998b) The rate heterogeneity of nonsynonymous substitutions in mammalian mitochondrial genes. Mol Biol Evol 15:336–344CrossRefPubMedGoogle Scholar
  50. Xia X (2000) Phylogenetic relationship among horseshoe crab species: the effect of substitution models on phylogenetic analyses. Syst Biol 49:87–100CrossRefPubMedGoogle Scholar
  51. Xia X (2001) Data analysis in molecular biology and evolution. Kluwer Academic Publishers, BostonGoogle Scholar
  52. Xia X (2013) DAMBE5: a comprehensive software package for data analysis in molecular biology and evolution. Mol Biol Evol 30:1720–1728PubMedPubMedCentralCrossRefGoogle Scholar
  53. Xia X (2016) PhyPA: phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences. Mol Phylogenet Evol 102:331–343CrossRefPubMedGoogle Scholar
  54. Xia X, Kumar S (2006) Codon-based detection of positive selection can be biased by heterogeneous distribution of polar amino acids along protein sequences. In: Markstein P, Xu Y (eds) Computational systems bioinformatics: proceedings of the conference CSB 2006. Imperial College Press, London, pp 335–340CrossRefGoogle Scholar
  55. Xia X, Li WH (1998) What amino acid properties affect protein evolution? J Mol Evol 47(5):557–564CrossRefPubMedGoogle Scholar
  56. Xia X, Xie Z (2001b) DAMBE: software package for data analysis in molecular biology and evolution. J Hered 92(4):371–373CrossRefPubMedGoogle Scholar
  57. Xia X, Xie Z (2002) Protein structure, neighbor effect, and a new index of amino acid dissimilarities. Mol Biol Evol 19(1):58–67CrossRefPubMedGoogle Scholar
  58. Xia X, Xie Z, Kjer KM (2003a) 18S ribosomal RNA and tetrapod phylogeny. Syst Biol 52(3):283–295CrossRefPubMedGoogle Scholar
  59. Zhu J, Liu JS, Lawrence CE (1998) Bayesian adaptive sequence alignment algorithms. Bioinformatics 14(1):25–39CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2018

Authors and Affiliations

  • Xuhua Xia
    • 1
  1. 1.University of Ottawa CAREG and Biology DepartmentOttawaCanada

Personalised recommendations