Advertisement

An Introduction to Multiple Sequence Alignment — and the T-Coffee Shop. Beyond Just Aligning Sequences: How Good can you Make your Alignment, and so What?

  • Steven M. Thompson

Abstract

I begin the chapter with a discussion of the fundamental principles of multiple-sequence alignment starting with pair-wise dynamic programming, then I move onto significance and similarity statistics, then amino acid scoring matrices, and I end the introduction with multiple-sequence alignment algorithms themselves. Reliability issues, complications, and applications of multiple-sequence alignment are discussed next. The chapter concludes with a description and tutorial about using the T-Coffee multiple-sequence alignment package.

Keywords

Multiple sequence alignment Dynamic programming Significance Reliability T-Coffee M-Coffee 3DCoffee 

References

  1. 1.
    Waterman MS. Sequence alignments. In: Waterman MS, ed. Mathematical Methods for DNA Sequences, Boca Raton: CRC Press, 1989.Google Scholar
  2. 2.
    Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol 1982;162:705–708.CrossRefPubMedGoogle Scholar
  3. 3.
    Genetics Computer Group (GCG®) Program Manual for the Wisconsin Package®, version 11, San Diego: Accelrys, Inc., ©1982–2007.Google Scholar
  4. 4.
    Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970;48:443–453.CrossRefPubMedGoogle Scholar
  5. 5.
    Smith TF, Waterman MS. Comparison of bio-sequences. Adv Appl Math 1981;2:482–489.CrossRefGoogle Scholar
  6. 6.
    Pearson WR, Lipman DJ. Improved tools for biological sequence analysis. Proc Natl Acad Sci U S A 1988;85:2444–2448.CrossRefPubMedGoogle Scholar
  7. 7.
    Pearson WR. Empirical statistical estimates for sequence similarity searches. J Mol Biol 1998;276:71–84. FastA package available at http://fasta.bioch.virginia.edu/fasta_www2/fasta_down.shtml Google Scholar
  8. 8.
    Altschul SF, Gish W, Miller W, Myers EW, Lipman D.J. Basic Local Alignment Tool. J Mol Biol 1990;215:403–410.PubMedGoogle Scholar
  9. 9.
    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25:3389–3402. Server at http://www.ncbi.nlm.nih.gov/BLAST/ and source code at ftp.ncbi.nih.gov/blast/
  10. 10.
    Gribskov M, McLachlan M, Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A 1987;84:4355–4358.CrossRefPubMedGoogle Scholar
  11. 11.
    Eddy SR. Profile hidden Markov models. Bioinformatics 1998;14:755–763.CrossRefPubMedGoogle Scholar
  12. 12.
    Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 1990;87:2264–2268.CrossRefPubMedGoogle Scholar
  13. 13.
    Schwartz RM, Dayhoff MO. Matrices for detecting distant relationships. In Dayhoff MO, ed. Atlas of Protein Sequences and Structure, vol. 5. Washington D.C: National Biomedical Research Foundation, 1979:353–358.Google Scholar
  14. 14.
    Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992;89:10915–10919.CrossRefPubMedGoogle Scholar
  15. 15.
    Gonnet GH, Cohen MA, Benner SA. Exhaustive matching of the entire protein sequence database. Science 1992;256:1443–1145.CrossRefPubMedGoogle Scholar
  16. 16.
    Gupta SK, Kececioglu JD, Schaffer AA. Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J Comput Biol 1995;2:459–472. MSA available at www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html
  17. 17.
    Smith RF, Smith TF. Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparative protein modeling. Protein Eng 1992;5:35–41. Available at genamics.com/software/downloads/pima-1.40.tar.gz
  18. 18.
    Smith RF, Wiese BA, Wojzynski MK, Davison DB, Worley KC. BCM Search Launcher — an integrated interface to molecular biology data base search and analysis services available on the World Wide Web. Genome Research 1996;6:454–462. See the Baylor College of Medicine’s Search Launcher at http://searchlauncher.bcm.tmc.edu/
  19. 19.
    Feng DF, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987;25:351–360.CrossRefPubMedGoogle Scholar
  20. 20.
    Thompson JD, Higgins DG, Gibson TJ. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994;22:4673–4680. Available at http://www.ebi.ac.uk/clustalw/
  21. 21.
    Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 1997;24:4876–4882. Available at http://ftp-igbmc.u-strasbg.fr/pub/ClustalX/
  22. 22.
    Saitou N, Nei M. The neighbor-joining method: A new method of constructing phylogenetic trees. Mol Biol Evol 1987;4:1406–1425.Google Scholar
  23. 23.
    Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for multiple sequence alignments. J Mol Biol 2000;302:205–217.CrossRefPubMedGoogle Scholar
  24. 24.
    Notredame C. T-Coffee: Tutorial and FAQ and Technical Documentation. 2006. Included with the distribution through http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html
  25. 25.
    Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004;32:1792–1797. Available through http://www.drive5.com/muscle/
  26. 26.
    Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 2005;33:511–518. Available at http://align.bmr.kyushu-u.ac.jp/mafft/software/
  27. 27.
    Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S. ProbCons: Probabilistic Consistency-based multiple sequence alignment. Genome Res 2005;15:330–340. Available at http://probcons.stanford.edu/download.html
  28. 28.
    Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 1990;183:63–98. MrTrans included with the FastA package at http://fasta.bioch.virginia.edu/fasta_www2/fasta_down.shtml
  29. 29.
    Rice P, Longden I, Bleasby A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet 2000;16:276–277. Available at http://emboss.sourceforge.net/
  30. 30.
    Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E. The Bioperl toolkit: Perl modules for the life sciences. Genome Res 2002;12:1611–1618. Available at http://www.bioperl.org/ Google Scholar
  31. 31.
    Bininda-Emonds ORP. transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics 2005;6:156. Available through http://www.personal.uni-jena.de/~b6biol2/ProgramsMain.html
  32. 32.
    Wernersson R, Pedersen AG. RevTrans — Constructing alignments of coding DNA from aligned amino acid sequences. Nucleic Acids Res 2003;31:3537–3539. Available at http://www.cbs.dtu.dk/services/RevTrans/download.php
  33. 33.
    Letondal C, Schuerer K. Pasteur Institute, Paris, France, www.pasteur.fr/english.html, 2003. ProtAl2DNA available at http://ftp.pasteur.fr/pub/GenSoft/unix/alignment/protal2dna and in BioPerl
  34. 34.
    Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res 2006;34:609–612. See http://coot.embl.de/pal2nal/ Google Scholar
  35. 35.
    Smith SW, Overbeek R, Woese CR, Gilbert W, Gillevet PM. The Genetic Data Environment, an expandable GUI for multiple sequence analysis. Comput Appl Biosci 1994;10:671–675. The original Sun OS version is at megasun.bch.umontreal.ca/pub/gde/. See Linux and Mac OS X GDE ports at http://www.bioafrica.net/GDElinux/index.html and http://www.msu.edu/~lintone/macgde/
  36. 36.
    Clamp M, Cuff J, Searle SM, Barton G.J. The Jalview Java Alignment Editor, Bioinformatics 2004;20:426–427. Available at http://www.jalview.org/
  37. 37.
    Rambaut A. Se-Al: Sequence Alignment editor. 1996. Available at http://evolve.zoo.ox.ac.uk/software.html?id=seal
  38. 38.
    Galtier N, Gouy M, Gautier C. SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Comput Appl Biosci 1996;12:543–548. Available through http://pbil.univ-lyon1.fr/software/ Google Scholar
  39. 39.
    Bairoch A. PROSITE: A dictionary of sites and patterns in proteins. Nucleic Acids Res 1992;20:2013–2018PubMedGoogle Scholar
  40. 40.
    Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intel Syst Mol Biol 1994;2:28–36Google Scholar
  41. 41.
    Bailey TL, Gribskov M. Combining evidence using p-values: Application to sequence homology searches. Bioinformatics 1998;14:48–44CrossRefPubMedGoogle Scholar
  42. 42.
    Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-Mohideen AS, McGarrell DM, Bandela AM, Cardenas E, Garrity GM, Tiedje JM. The ribosomal database project (RDP-II): Introducing myRDP space and quality controlled public data. Nucleic Acids Res 2007;35:D169–D172. See http://rdp.cme.msu.edu/ Google Scholar
  43. 43.
    Wuyts J, Perriere G, Van de Peer Y. The European ribosomal RNA database. Nucleic Acids Res 2004;32:D101–D103. See http://www.psb.ugent.be/rRNA/
  44. 44.
    National Center for Biotechnology Information (NCBI) Entrez, public domain software distributed by the authors. National Library of Medicine, National Institutes of Health, Bethesda. See http://www.ncbi.nlm.nih.gov/Entrez/
  45. 45.
    Etzold T, Argos P. SRS — an indexing and retrieval tool for flat file data libraries. Comput Appl Biosci 1993;9:49–57PubMedGoogle Scholar
  46. 46.
    Olsen G. Inference of Molecular Phylogenies, University of Illinois at Urbana-Champaign; lecture, September 3, 1992.Google Scholar
  47. 47.
    Higgins DG, Bleasby AJ, Fuchs R. CLUSTALV: improved software for multiple sequence alignment. Comput Appl Biosci 1992;8:189–191.PubMedGoogle Scholar
  48. 48.
    Swofford DL. PAUP* (Phylogenetic Analysis Using Parsimony, and other methods) version 4.0+. ©1989–2007. Home page at http://paup.csit.fsu.edu/, distributed through Sunderland: Sinaeur Associates, Inc. at http://www.sinauer.com/
  49. 49.
    Ronquist F, Huelsenbeck JP. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 2003;19:1572–1574. See http://mrbayes.csit.fsu.edu/.
  50. 50.
    Felsenstein J. PHYLIP (Phylogeny Inference Package). Distributed by the author. Department of Genome Sciences, University of Washington, Seattle. 1980–2007. Available at http://evolution.genetics.washington.edu/phylip.html
  51. 51.
    Gilbert DG. ReadSeq. Distributed by the author. Biology Department, Indiana University, Bloomington, 1990–2006. See http://iubio.bio.indiana.edu/soft/molbio/readseq/
  52. 52.
    Rozen S, Skaletsky H. Primer3 on the WWW for general users and for biologist programmers. In: Krawetz S, Misener S, eds. Bioinformatics Methods and Protocols: Methods in Molecular Biology, Totowa: Humana Press, 2000:365–386. Available at http://primer3.sourceforge.net/
  53. 53.
    Hofmann K, Baron M. 1999. BOXSHADE server at www.ch.embnet.org/software/BOX_form.html and software available at http://www.isrec.isb-sib.ch/pub/boxshade
  54. 54.
    Schneider TD, Stephens RM. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res 1990;18:6097–6100. See http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html
  55. 55.
    Guex N, Diemand A, Peitsch MC. Protein modelling for all. Trends Biochem Sci 1999;24:364–367. See http://swissmodel.expasy.org//SWISS-MODEL.html Google Scholar
  56. 56.
    Sayle RA, Milner-White EJ. RasMol: Biomolecular graphics for all. Trends Biochem Sci 1995;20:374–376. See http://www.umass.edu/microbio/rasmol/ and openrasmol.org/
  57. 57.
    Lunter G, Miklos I, Drummond A, Jensen JL, Hein J. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 2005;6:83.CrossRefPubMedGoogle Scholar
  58. 58.
    Van de Peer Y, Frickey T, Taylor JS, Meyer A. Dealing with saturation at the amino acid level: A case study based on anciently duplicated zebrafish genes. Gene 2002;295:205–211. ASaturA available at http://bioinformatics.psb.ugent.be/software_details.php?id=6 Google Scholar
  59. 59.
    Taylor WR. Protein structure comparison using iterated double dynamic programming. Protein Sci 1999;8:654–665. SAP available for non-profit academic work at http://mathbio.nimr.mrc.ac.uk/ftp/wtaylor/sap/ Google Scholar
  60. 60.
    Berman, HM, Henrick K, Nakamura H. Announcing the worldwide Protein Data Bank. Nat Struct Biol 2003;10:980. See http://www.rcsb.org/pdb/ Google Scholar
  61. 61.
    Wallace IM, O’Sullivan O, Higgins DG, Notredame C. M-Coffee: Combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res 2006;34:1692–1699. Included in T-Coffee distribution.Google Scholar
  62. 62.
    Lee C, Grasso C, Sharlow M. Multiple sequence alignment using partial order graphs. Bioinformatics 2002;18:452–464. POA available at http://bioinfo.mbi.ucla.edu/poa/ Google Scholar
  63. 63.
    Subramanian AR, Weyer-Menkhoff J, Kaufmann M, Morgenstern B. DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics 2005;6:66. Available at http://dialign-t.gobics.de/
  64. 64.
    Pei J, Sadreyev R, Grishin NV. PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 2003;19:427–428. Available at http://iole.swmed.edu/pub/PCMA/ Google Scholar
  65. 65.
    Edgar RC, Batzoglou S. Multiple sequence alignment. Curr Opin Struct Biol 2006;16:368–373.CrossRefPubMedGoogle Scholar
  66. 66.
    Notredame C. Mocca: semi-automatic method for domain hunting. Bioinformatics 2001;17:373–374. Included in T-Coffee distribution.CrossRefPubMedGoogle Scholar
  67. 67.
    Shi J, Blundell TL, Mizuguchi K. FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. Journal of Mol Biol 2001;310:243–257. See http://www-cryst.bioc.cam.ac.uk/fugue/ Google Scholar
  68. 68.
    O'Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C. 3DCoffee: Combining protein sequences and structures within multiple sequence alignment. J Mol Biol 2004;340:385–395. Included in T-Coffee distribution.CrossRefPubMedGoogle Scholar
  69. 69.
    Kjeldgaard M, Nissen P, Thirup S, Nyborg J. The crystal structure of elongation factor EF-Tu from Thermus aquaticus in the GTP conformation. Structure 1993;1:35–50.CrossRefPubMedGoogle Scholar
  70. 70.
    Armougom F, Moretti S, Keduas V, Notredame C. The iRMSD: a local measure of sequence alignment accuracy using structural information. Bioinformatics 2006;22:35–39. Included in T-Coffee distribution.CrossRefGoogle Scholar
  71. 71.
    Armougom F, Moretti S, Poirot O, Audic S, Dumas P, Schaeli B, Keduas V, Notredame C. Expresso: automatic incorporation of structural information in multiple sequence alignment using 3D-Coffee. Nucleic Acids Res 2006;34:604–608. See http://www.tcoffee.org/
  72. 72.
    von Heijne G. Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. San Diego: Academic Press, 1987.Google Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.School of Computational ScienceFlorida State UniversityTallahasseeUSA

Personalised recommendations