Multiple Alignment of DNA Sequences with MAFFT

  • Kazutaka Katoh
  • George Asimenos
  • Hiroyuki Toh
Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 537)

Abstract

Multiple alignment of DNA sequences is an important step in various molecular biological analyses. As a large amount of sequence data is becoming available through genome and other large-scale sequencing projects, scalability, as well as accuracy, is currently required for a multiple sequence alignment (MSA) program. In this chapter, we outline the algorithms of an MSA program MAFFT and provide practical advice, focusing on several typical situations a biologist sometimes faces. For genome alignment, which is beyond the scope of MAFFT, we introduce two tools: TBA and MAUVE.

Key words

Multiple sequence alignment progressive method iterative refinement method consistency objective function genome comparison 

References

  1. 1.
    Woese, C. R., and Fox, G. E. (1977) Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci USA 74, 5088–90.PubMedCrossRefGoogle Scholar
  2. 2.
    Flicek, P., Keibler, E., Hu, P., Korf, I., and Brent, M. R. (2003) Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res 13, 46–54.PubMedCrossRefGoogle Scholar
  3. 3.
    Katoh, K., Misawa, K., Kuma, K., and Miyata, T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30, 3059–66.PubMedCrossRefGoogle Scholar
  4. 4.
    Katoh, K., Kuma, K., Toh, H., and Miyata, T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33, 511–8.PubMedCrossRefGoogle Scholar
  5. 5.
    Wilm, A., Mainz, I., and Steger, G. (2006) An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol Biol 1, 19.PubMedCrossRefGoogle Scholar
  6. 6.
    Carroll, H., Beckstead, W., O’connor, T., Ebbert, M., Clement, M., Snell, Q., and McClellan, D. (2007) DNA reference alignment benchmarks based on tertiary structure of encoded proteins. Bioinformatics 23, 2648–49.Google Scholar
  7. 7.
    Blanchette, M., Kent, W. J., Riemer, C., Elnitski, L., Smit, A. F., Roskin, K. M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E. D., Haussler, D., and Miller, W. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14, 708–15.PubMedCrossRefGoogle Scholar
  8. 8.
  9. 9.
    Darling, A. C., Mau, B., Blattner, F. R., and Perna, N. T. (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14, 1394–403.PubMedCrossRefGoogle Scholar
  10. 10.
  11. 11.
    Edgar, R. C., and Batzoglou, S. (2006) Multiple sequence alignment. Curr Opin Struct Biol 16, 368–73.PubMedCrossRefGoogle Scholar
  12. 12.
    Needleman, S. B., and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443–53.PubMedCrossRefGoogle Scholar
  13. 13.
    Smith, T. F., and Waterman, M. S. (1981) Identification of common molecular subsequences. J Mol Biol 147, 195–7.PubMedCrossRefGoogle Scholar
  14. 14.
    Gotoh, O. (1982) An improved algorithm for matching biological sequences. J Mol Biol 162, 705–8.PubMedCrossRefGoogle Scholar
  15. 15.
    Feng, D. F., and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25, 351–60.PubMedCrossRefGoogle Scholar
  16. 16.
    Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673–80.PubMedCrossRefGoogle Scholar
  17. 17.
    Katoh, K., and Toh, H. (2007) Parttree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23, 372–4.PubMedCrossRefGoogle Scholar
  18. 18.
    Barton, G. J., and Sternberg, M. J. (1987) A strategy for the rapid multiple alignment of protein sequences. confidence levels from tertiary structure comparisons. J Mol Biol 198, 327–37.PubMedCrossRefGoogle Scholar
  19. 19.
    Berger, M. P., and Munson, P. J. (1991) A novel randomized iterative strategy for aligning multiple protein sequences. Comput Appl Biosci 7, 479–84.PubMedGoogle Scholar
  20. 20.
    Gotoh, O. (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput Appl Biosci 9, 361–70.PubMedGoogle Scholar
  21. 21.
    Ishikawa, M., Toya, T., Hoshida, M., Nitta, K., Ogiwara, A., and Kanehisa, M. (1993) Multiple sequence alignment by parallel simulated annealing. Comput Appl Biosci 9, 267–73.PubMedGoogle Scholar
  22. 22.
    Notredame, C., and Higgins, D. G. (1996) Saga: sequence alignment by genetic algorithm. Nucleic Acids Res 24, 1515–24.PubMedCrossRefGoogle Scholar
  23. 23.
    Gotoh, O. (1994) Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comput Appl Biosci 10, 379–87.PubMedGoogle Scholar
  24. 24.
    Gotoh, O. (1995) A weighting system and algorithm for aligning many phylogenetically related sequences. Comput Appl Biosci 11, 543–51.PubMedGoogle Scholar
  25. 25.
    Gotoh, O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264, 823–38.PubMedCrossRefGoogle Scholar
  26. 26.
    Hirosawa, M., Totoki, Y., Hoshida, M., and Ishikawa, M. (1995) Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11, 13–18.PubMedGoogle Scholar
  27. 27.
    Vingron, M., and Argos, P. (1989) A fast and sensitive multiple sequence alignment algorithm. Comput Appl Biosci 5, 115–21.PubMedGoogle Scholar
  28. 28.
    Gotoh, O. (1990) Consistency of optimal sequence alignments. Bull Math Biol 52, 509–25.PubMedGoogle Scholar
  29. 29.
    Notredame, C., Holm, L., and Higgins, D. G. (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14, 407–22.PubMedCrossRefGoogle Scholar
  30. 30.
    Notredame, C., Higgins, D. G., and Heringa, J. (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302, 205–17.PubMedCrossRefGoogle Scholar
  31. 31.
    Higgins, D. G., and Sharp, P. M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–44.PubMedCrossRefGoogle Scholar
  32. 32.
    Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8, 275–82.PubMedGoogle Scholar
  33. 33.
    Altschul, S. F. (1998) Generalized affine gap costs for protein sequence alignment. Proteins 32, 88–96.PubMedCrossRefGoogle Scholar
  34. 34.
    Myers, E. W., and Miller, W. (1988) Optimal alignments in linear space. Comput Appl Biosci 4, 11–17.PubMedGoogle Scholar
  35. 35.
    Gribskov, M., McLachlan, A. D., and Eisenberg, D. (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci USA, 84, 4355–58.PubMedCrossRefGoogle Scholar
  36. 36.
    Schwartz, S., Kent, W. J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R. C., Haussler, D., and Miller, W. (2003) Human-mouse alignments with BLASTZ. Genome Res 13, 103–7.PubMedCrossRefGoogle Scholar
  37. 37.
  38. 38.
  39. 39.
    Smit, A. F. A., Hubley, R., and Green, P. Repeatmasker. http://www.repeatmasker.org/
  40. 40.
    Benson, G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27, 573–80.PubMedCrossRefGoogle Scholar
  41. 41.
  42. 42.
  43. 43.
  44. 44.
    Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search pro grams. Nucleic Acids Res 25, 3389–402.PubMedCrossRefGoogle Scholar
  45. 45.
    Morgenstern, B., Goel, S., Sczyrba, A., and Dress, A. (2003) Altavist: comparing alternative multiple sequence alignments. Bioinformatics 19, 425–6.PubMedCrossRefGoogle Scholar
  46. 46.
    Lassmann, T., and Sonnhammer, E. L. (2007) Automatic extraction of reliable regions from multiple sequence alignments. BMC Bioinformat 8 Suppl 5, S9.CrossRefGoogle Scholar
  47. 47.
    Morgenstern, B., Dress, A., and Werner, T. (1996) Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci USA 93, 12098–103.PubMedCrossRefGoogle Scholar
  48. 48.
    Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792–7.PubMedCrossRefGoogle Scholar
  49. 49.
    Do, C. B., Mahabhashyam, M. S., Brudno, M., and Batzoglou, S. (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15, 330–40.PubMedCrossRefGoogle Scholar
  50. 50.
    Lassmann, T., and Sonnhammer, E. L. (2005) Kalign – an accurate and fast multiple sequence alignment algorithm. BMC Bioinformat 6, 298.CrossRefGoogle Scholar
  51. 51.
    Wallace, I. M., O’Sullivan, O., Higgins, D. G., and Notredame, C. (2006) M-Coffee: combining multiple sequence alignment methods with t-coffee. Nucleic Acids Res 34, 1692–9.PubMedCrossRefGoogle Scholar
  52. 52.
    Golubchik, T., Wise, M. J., Easteal, S., and Jermiin, L. S. (2007) Mind the gaps: Evidence of bias in estimates of multiple sequence alignments. Mol Biol Evol 24, 2433–42.Google Scholar
  53. 53.
    Do, C. B., and Katoh, K. (2008) Protein multiple sequence alignment Functional Proteomics, Methods Mol Biol 484, 379–413.Google Scholar
  54. 54.
    Morrison, D. (2006) Multiple sequence alignment for phylogenetic purposes. Aust Syst Bot 19, 479–539.CrossRefGoogle Scholar
  55. 55.
    Roshan, U., and Livesay, D. R. (2006) Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 22, 2715–21.PubMedCrossRefGoogle Scholar
  56. 56.
    Yamada, S., Gotoh, O., and Yamana, H. (2006) Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost. BMC Bioinformat 7, 524.CrossRefGoogle Scholar
  57. 57.
    Brudno, M., Do, C. B., Cooper, G. M., Kim, M. F., Davydov, E., Green, E. D., Sidow, A., and Batzoglou, S. (2003) LAGAN and multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13, 721–31.PubMedCrossRefGoogle Scholar
  58. 58.
    Bray, N., and Pachter, L. (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14, 693–9.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Kazutaka Katoh
    • 1
  • George Asimenos
    • 2
  • Hiroyuki Toh
    • 3
  1. 1.Digital Medicine InitiativeKyushu UniversityFukuokaJapan
  2. 2.Department of Computer ScienceStanford UniversityStanfordUSA
  3. 3.Medical Institute of BioregulationKyushu UniversityFukuokaJapan

Personalised recommendations