Evolutionary Genomics pp 203-235

Part of the Methods in Molecular Biology book series (MIMB, volume 855)

Alignment Methods: Strategies, Challenges, Benchmarking, and Comparative Overview

Protocol

Abstract

Comparative evolutionary analyses of molecular sequences are solely based on the identities and differences detected between homologous characters. Errors in this homology statement, that is errors in the alignment of the sequences, are likely to lead to errors in the downstream analyses. Sequence alignment and phylogenetic inference are tightly connected and many popular alignment programs use the phylogeny to divide the alignment problem into smaller tasks. They then neglect the phylogenetic tree, however, and produce alignments that are not evolutionarily meaningful. The use of phylogeny-aware methods reduces the error but the resulting alignments, with evolutionarily correct representation of homology, can challenge the existing practices and methods for viewing and visualising the sequences. The inter-dependency of alignment and phylogeny can be resolved by joint estimation of the two; methods based on statistical models allow for inferring the alignment parameters from the data and correctly take into account the uncertainty of the solution but remain computationally challenging. Widely used alignment methods are based on heuristic algorithms and unlikely to find globally optimal solutions. The whole concept of one correct alignment for the sequences is questionable, however, as there typically exist vast numbers of alternative, roughly equally good alignments that should also be considered. This uncertainty is hidden by many popular alignment programs and is rarely correctly taken into account in the downstream analyses. The quest for finding and improving the alignment solution is complicated by the lack of suitable measures of alignment goodness. The difficulty of comparing alternative solutions also affects benchmarks of alignment methods and the results strongly depend on the measure used. As the effects of alignment error cannot be predicted, comparing the alignments’ performance in downstream analyses is recommended.

Key words

Character homology Evolutionary sequence alignment Dynamic programming Insertions and deletions Alignment correctness 

References

  1. 1.
    Thompson, J., Higgins, D., and Gibson, T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res, 22, 4673–4680.PubMedCrossRefGoogle Scholar
  2. 2.
    Eddy, S. (1995) Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol, 3, 114–120.PubMedGoogle Scholar
  3. 3.
    Gotoh, O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol, 264, 823–838.PubMedCrossRefGoogle Scholar
  4. 4.
    Thompson, J., Plewniak, F., and Poch, O. (1999) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87–88.PubMedCrossRefGoogle Scholar
  5. 5.
    Sauder, J., Arthur, J., and Dunbrack, R. (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins, 40, 6–22.PubMedCrossRefGoogle Scholar
  6. 6.
    Van Walle, I., Lasters, I., and Wyns, L. (2005) SABmark–a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 21, 1267–1268.PubMedCrossRefGoogle Scholar
  7. 7.
    Thompson, J., Koehl, P., Ripp, R., and Poch, O. (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127–136.PubMedCrossRefGoogle Scholar
  8. 8.
    Edgar, R. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res, 32, 1792–1797.PubMedCrossRefGoogle Scholar
  9. 9.
    Wallace, I., O’Sullivan, O., Higgins, D., and Notredame, C. (2006) M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucl Acids Res, 34, 1692–1699.PubMedCrossRefGoogle Scholar
  10. 10.
    Notredame, C. (2009) Computing multiple sequence alignment with template-based methods. In Rosenberg, M., (ed.), Sequence alignment: methods, models, concepts, and strategies, pp. 55–70 University of California Press Los Angeles, California.Google Scholar
  11. 11.
    Morrison, D. (2009) Why would phylogeneticists ignore computerized sequence alignment? Syst Biol, 58, 150–158.PubMedCrossRefGoogle Scholar
  12. 12.
    Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079.PubMedCrossRefGoogle Scholar
  13. 13.
    Lee, C. (2003) Generating consensus sequences from partial order multiple sequence alignment graphs. Bioinformatics, 19, 999–1008.PubMedCrossRefGoogle Scholar
  14. 14.
    Altenhoff, A. and Dessimoz, C. (2012) Inferring Orthology and Paralogy. In Anisimova, M., (ed.), Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media, LLC.Google Scholar
  15. 15.
    Hubbard, T., Aken, B., Ayling, S., Ballester, B., Beal, K., Bragin, E., Brent, S., Chen, Y., Clapham, P., Clarke, L., et al. (2009) Ensembl 2009. Nucl Acids Res, 37, D690–697.PubMedCrossRefGoogle Scholar
  16. 16.
    Dewey, C. (2012) Whole-genome alignment. In Anisimova, M., (ed.), Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media, LLC.Google Scholar
  17. 17.
    Blanchette, M., Kent, J., Riemer, C., Elnitski, L., Smit, A., Roskin, K., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E., Haussler, D., and Miller, W. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res, 14, 708–715.PubMedCrossRefGoogle Scholar
  18. 18.
    Hein, J., Wiuf, C., Knudsen, B., Moller, M., and Wibling, G. (2000) Statistical alignment: computational properties, homology testing and goodness-of-fit. J Mol Biol, 302, 265–280.PubMedCrossRefGoogle Scholar
  19. 19.
    Torres, A., Cabada, A., and Nieto, J. (2003) An exact formula for the number of alignments between two DNA sequences. DNA Seq, 14, 427–430.PubMedGoogle Scholar
  20. 20.
    Covington, M. (2004) The number of distinct alignments of two strings. J Quant Linguistics, 11, 173–182.CrossRefGoogle Scholar
  21. 21.
    Levenshtein, V. (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Dokl, 10, 707–710.Google Scholar
  22. 22.
    Needleman, S. and Wunsch, C. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 48, 443–453.PubMedCrossRefGoogle Scholar
  23. 23.
    Sankoff, D. (1972) Matching sequences under deletion/insertion constraints. Proc Natl Acad Sci USA, 69, 4–6.PubMedCrossRefGoogle Scholar
  24. 24.
    Sankoff, D. (2000) The early introduction of dynamic programming into computational biology. Bioinformatics, 16, 41–47.PubMedCrossRefGoogle Scholar
  25. 25.
    Eddy, S. (2004) What is dynamic programming? Nature Biotech, 22, 909–910.CrossRefGoogle Scholar
  26. 26.
    Hirschberg, D. (1975) A linear space algorithm for computing maximal common subsequences. Commun ACM, 18, 341–343.CrossRefGoogle Scholar
  27. 27.
    Myers, E. and Miller, W. (1988) Optimal alignments in linear space. Comput Appl Biosci, 4, 11–17.PubMedGoogle Scholar
  28. 28.
    Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological sequence analysis: Probabilistic models of proteins and nucleic acids, Cambridge University Press, Cambridge, UK.CrossRefGoogle Scholar
  29. 29.
    Eddy, S. (2004) Where did the BLOSUM62 alignment score matrix come from? Nature Biotech, 22, 1035–1036.CrossRefGoogle Scholar
  30. 30.
    Thorne, J., Kishino, H., and Felsenstein, J. (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol, 33, 114–124.PubMedCrossRefGoogle Scholar
  31. 31.
    Löytynoja, A. and Goldman, N. (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA, 102, 10557–10562.PubMedCrossRefGoogle Scholar
  32. 32.
    Gotoh, O. (1982) An improved algorithm for matching biological sequences. J Mol Biol, 162, 705–708.PubMedCrossRefGoogle Scholar
  33. 33.
    Gu, X. and Li, W. (1995) The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J Mol Evol, 40, 464–473.PubMedCrossRefGoogle Scholar
  34. 34.
    Cartwright, R. (2006) Logarithmic gap costs decrease alignment accuracy. BMC Bioinf, 7, 527.CrossRefGoogle Scholar
  35. 35.
    Knudsen, B. and Miyamoto, M. (2003) Sequence alignments and pair hidden Markov models using evolutionary history. J Mol Biol, 333, 453–460.PubMedCrossRefGoogle Scholar
  36. 36.
    Löytynoja, A. and Goldman, N. (2008) A model of evolution and structure for multiple sequence alignment. Phil Trans Royal Soci B: Biol Sci, 363, 3913–3919.CrossRefGoogle Scholar
  37. 37.
    Waterman, M. (1983) Sequence alignments in the neighborhood of the optimum with general application to dynamic programming. Proc Natl Acad Sci USA, 80, 3123–3124.PubMedCrossRefGoogle Scholar
  38. 38.
    Vingron, M. (1996) Near-optimal sequence alignment. Curr Opin Struct Biol, 6, 346–352.PubMedCrossRefGoogle Scholar
  39. 39.
    Landan, G. and Graur, D. (2007) Heads or tails: a simple reliability check for multiple sequence alignments. Mol Biol Evol, 24, 1380–1383.PubMedCrossRefGoogle Scholar
  40. 40.
    Löytynoja, A. and Milinkovitch, M. (2001) SOAP, cleaning multiple alignments from unstable blocks. Bioinformatics, 17, 573–574.PubMedCrossRefGoogle Scholar
  41. 41.
    Penn, O., Privman, E., Landan, G., Graur, D., and Pupko, T. (2010) An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol, 27, 1759–1767.PubMedCrossRefGoogle Scholar
  42. 42.
    Allison, L. and Wallace, C. (1994) The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimization of multiple alignments. J Mol Evol, 39, 418–430.PubMedCrossRefGoogle Scholar
  43. 43.
    Bradley, R., Roberts, A., Smoot, M., Juvekar, S., Do, J., Dewey, C., Holmes, I., and Pachter, L. (2009) Fast statistical alignment. PLoS Comput Biol, 5, e1000392.PubMedCrossRefGoogle Scholar
  44. 44.
    Löytynoja, A. and Goldman, N. (2010) webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinf, 11, 579.CrossRefGoogle Scholar
  45. 45.
    Miklós, I., Lunter, G., and Holmes, I. (2004) A “Long Indel” model for evolutionary sequence alignment. Mol Biol Evol, 21, 529–540.PubMedCrossRefGoogle Scholar
  46. 46.
    Lunter, G., Rocco, A., Mimouni, N., Heger, A., Caldeira, A., and Hein, J. (2008) Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res, 18, 298–309.PubMedCrossRefGoogle Scholar
  47. 47.
    Lunter, G., Miklós, I., Drummond, A., Jensen, J., and Hein, J. (2005) Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinf, 6, 83.CrossRefGoogle Scholar
  48. 48.
    Satija, R., Pachter, L., and Hein, J. (2008) Combining statistical alignment and phylogenetic footprinting to detect regulatory elements. Bioinformatics, 24, 1236–1242.PubMedCrossRefGoogle Scholar
  49. 49.
    Redelings, B. and Suchard, M. (2005) Joint Bayesian estimation of alignment and phylogeny. Syst Biol, 54, 401–418.PubMedCrossRefGoogle Scholar
  50. 50.
    Sankoff, D. (1975) Minimal mutation trees of sequences. SIAM J Appl Math, 28, 35–42.CrossRefGoogle Scholar
  51. 51.
    Hogeweg, P. and Hesper, B. (1984) The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J Mol Evol, 20, 175–186.PubMedCrossRefGoogle Scholar
  52. 52.
    Wheeler, W. and Gladstein, D. (1994) MALIGN: a multiple sequence alignment program. J Heredity, 85, 417.Google Scholar
  53. 53.
    Gonnet, G. and Benner, S. (1996) In SWAT ’96: Proceedings of the 5th Scandinavian Workshop on Algorithm Theory pp. 380–391, Springer-Verlag.Google Scholar
  54. 54.
    Hudek, A. and Brown, D. (2005) Ancestral sequence alignment under optimal conditions. BMC Bioinf, 6, 273.CrossRefGoogle Scholar
  55. 55.
    Löytynoja, A. and Goldman, N. (2008) Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science, 320, 1632–1635.PubMedCrossRefGoogle Scholar
  56. 56.
    Notredame, C., Holm, L., and Higgins, D. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics, 14, 407–422.PubMedCrossRefGoogle Scholar
  57. 57.
    Notredame, C., Higgins, D., and Heringa, J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol, 302, 205–217.PubMedCrossRefGoogle Scholar
  58. 58.
    Do, C., Mahabhashyam, M., Brudno, M., and Batzoglou, S. (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res, 15, 330–340.PubMedCrossRefGoogle Scholar
  59. 59.
    Paten, B., Herrero, J., Beal, K., Fitzgerald, S., and Birney, E. (2008) Enredo and Pecan: Genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res, 18, 1814–1824.PubMedCrossRefGoogle Scholar
  60. 60.
    Berger, M. and Munson, P. (1991) A novel randomized iterative strategy for aligning multiple protein sequences. Comput Appl Biosci, 7, 479–484.PubMedGoogle Scholar
  61. 61.
    Gotoh, O. (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput Appl Biosci, 9, 361–370.PubMedGoogle Scholar
  62. 62.
    Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 4, 406–425.PubMedGoogle Scholar
  63. 63.
    Kumar, S. and Filipski, A. (2007) Multiple sequence alignment: in pursuit of homologous DNA positions. Genome Res, 17, 127–135.PubMedCrossRefGoogle Scholar
  64. 64.
    Suchard, M. and Redelings, B. (2006) BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics, 22, 2047–2048.PubMedCrossRefGoogle Scholar
  65. 65.
    Novák, A., Miklós, I., Lyngsø, R., and Hein, J. (2008) StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics, 24, 2403–2404.PubMedCrossRefGoogle Scholar
  66. 66.
    Liu, K., Raghavan, S., Nelesen, S., Linder, C., and Warnow, T. (2009) Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324, 1561–1564.PubMedCrossRefGoogle Scholar
  67. 67.
    Löytynoja, A. and Goldman, N. (2009) Uniting alignments and trees. Science, 324, 1528–1529.PubMedCrossRefGoogle Scholar
  68. 68.
    Fletcher, W. and Yang, Z. (2010) The effect of insertions, deletions and alignment errors on the branch-site test of positive selection. Mol Biol Evol, 27, 2257–2267.PubMedCrossRefGoogle Scholar
  69. 69.
    Morgenstern, B., Frech, K., Dress, A., and Werner, T. (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, 14, 290–294.PubMedCrossRefGoogle Scholar
  70. 70.
    Schwartz, A. and Pachter, L. (2007) Multiple alignment by sequence annealing. Bioinformatics, 23, 24–29.CrossRefGoogle Scholar
  71. 71.
    Kim, J. and Sinha, S. (2007) Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment. Bioinformatics, 23, 289–297.PubMedCrossRefGoogle Scholar
  72. 72.
    Paten, B., Herrero, J., Fitzgerald, S., Beal, K., Flicek, P., Holmes, I., and Birney, E. (2008) Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res, 18, 1829.PubMedCrossRefGoogle Scholar
  73. 73.
    Thompson, J., Plewniak, F., and Poch, O. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucl Acids Res, 27, 2682–2690.PubMedCrossRefGoogle Scholar
  74. 74.
    Rosenberg, M. (2005) Evolutionary distance estimation and fidelity of pair wise sequence alignment. BMC Bioinf, 6, 102.CrossRefGoogle Scholar
  75. 75.
    Ogden, T. and Rosenberg, M. (2006) Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol, 55, 314–328.CrossRefGoogle Scholar
  76. 76.
    Dessimoz, C. and Gil, M. (2010) Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol, 11, R37.PubMedCrossRefGoogle Scholar
  77. 77.
    Cartwright, R. (2005) DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics, 21 S3, 31–38.Google Scholar
  78. 78.
    Fletcher, W. and Yang, Z. (2009) INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol, 26, 1879–1888.PubMedCrossRefGoogle Scholar
  79. 79.
    Katoh, K., Misawa, K., Kuma, K., and Miyata, T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucl Acids Res, 30, 3059–3066.PubMedCrossRefGoogle Scholar
  80. 80.
    Grasso, C. and Lee, C. (2004) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics, 20, 1546–1556.PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.European Bioinformatics Institute (EMBL)HinxtonUK
  2. 2.Institute of BiotechnologyUniversity of HelsinkiHelsinkiFinland

Personalised recommendations