Skip to main content

Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP

  • Protocol
  • First Online:
Multiple Sequence Alignment

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2231))

Abstract

The estimation of very large multiple sequence alignments is a challenging problem that requires special techniques in order to achieve high accuracy. Here we describe two software packages—PASTA and UPP—for constructing alignments on large and ultra-large datasets. Both methods have been able to produce highly accurate alignments on 1,000,000 sequences, and trees computed on these alignments are also highly accurate. PASTA provides the best tree accuracy when the input sequences are all full-length, but UPP provides improved accuracy compared to PASTA and other methods when the input contains a large number of fragmentary sequences. Both methods are available in open source form on GitHub.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This chapter is an update of [9], a previous article for Methods in Molecular Biology, which focused on using SATé [1, 2] for co-estimation of alignments and trees.

References

  1. Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T (2009) Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324(5934):1561–1564

    Article  CAS  PubMed  Google Scholar 

  2. Liu K, Warnow T, Holder MT, Nelesen SM, Yu J, Stamatakis AP, Linder CR (2012) SATé-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol 61(1):90–106

    Article  PubMed  Google Scholar 

  3. Mirarab S, Nguyen N, Warnow T (2014) PASTA: ultra-large multiple sequence alignment. In: International conference on research in computational molecular biology. Springer, Berlin, pp 177–191

    Google Scholar 

  4. Mirarab S, Nguyen N, Wang L-S, Guo S, Kim J, Warnow T (2015) PASTA: ultra-large multiple sequence alignment of nucleotide and amino acid sequences. J Comput Biol 22:377–386

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Nguyen N, Mirarab S, Kumar K, Warnow T (2015) Ultra-large alignments using phylogeny aware profiles. Genome Biol 16:124. A preliminary version appeared in the Proceedings RECOMB 2015

    Google Scholar 

  6. Mirarab S, Nguyen N, Warnow T (2012) SEPP: SATé-enabled phylogenetic placement. In: Pacific symposium on biocomputing, pp 247–58

    Google Scholar 

  7. Nguyen N, Mirarab S, Liu B, Pop M, Warnow T (2014) TIPP: taxonomic identification and phylogenetic profiling Bioinformatics 30(24):3548–3555

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Nguyen N, Nute M, Mirarab S, Warnow T (2016) HIPPI: highly accurate protein family classification with ensembles of hidden Markov models. BMC Bioinformatics 17(Suppl 10):765

    Google Scholar 

  9. Liu K, Warnow T (2014) Large-scale multiple sequence alignment and tree estimation using SATé. In: Multiple sequence alignment methods. Springer, Berlin, pp 219–244

    Chapter  Google Scholar 

  10. Mirarab S (2019) Github site for PASTA software. https://github.com/smirarab/pasta. Accessed 13 July 2019

  11. Mirarab S (2019) Github site for Ensemble of HMM methods (SEPP, TIPP, UPP) software. https://github.com/smirarab/sepp. Accessed 13 July 2019

  12. Price MN, Dehal PS, Arkin AP (2010) FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3), e9490. https://doi.org/10.1371/journal.pone.0009490

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539

    Article  PubMed  PubMed Central  Google Scholar 

  14. Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinf 9(4):286–298

    Article  CAS  Google Scholar 

  15. Wheeler T, Kececioglu J (2007) Multiple alignment by aligning alignments. In: Proceedings of the 15th ISCB conference on intelligent systems for molecular biology, pp 559–568

    Google Scholar 

  16. Löytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Nat Acad Sci 102:10557–10562

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  17. Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5(113):113

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  18. Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models Bioinformatics 22:2688–2690.

    Google Scholar 

  19. Balaban M, Moshiri N, Mai U, Mirarab S (2019) TreeCluster: clustering biological sequences using phylogenetic trees. bioRxiv, https://doi.org/10.1101/591388

  20. Suchard MA, Redelings BD (2006) BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22:2047–2048

    Article  CAS  PubMed  Google Scholar 

  21. Redelings BD, Suchard MA (2007) Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol Biol 7:40

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  22. Nute M, Warnow T (2016) Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17(10):764

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  23. Löytynoja A, Goldman N (2008) Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320(5883):1632–1635

    Article  PubMed  CAS  Google Scholar 

  24. Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C (2015) Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst Biol 64(5):778–791

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Collins K PASTA for proteins github site. https://github.com/kodicollins/pasta-databases

  26. Nute M (2019) Github site for PASTA+BAli-Phy. https://github.com/mgnute/pasta. Accessed 18 July 2019

  27. Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge

    Book  Google Scholar 

  28. Warnow T (2018) Computational phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge University Press, Cambridge

    Google Scholar 

  29. Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211

    PubMed  Google Scholar 

  30. Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39:W29–W37

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Novák Á, Miklós I, Lyngsoe R, Hein J (2008) StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24:2403–2404

    Article  PubMed  CAS  Google Scholar 

  32. Huelsenbeck J, Ronquist R (2001) MrBayes: Bayesian inference of phylogeny. Bioinformatics 17:754–755

    Article  CAS  PubMed  Google Scholar 

  33. Drummond A, Rambaut A (2007) BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 7:214

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  34. Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H, Xie D, Suchard MA, Rambaut A, Drummond AJ (2014) BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol 10(4):e1003537

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  35. Lefort V, Desper R, Gascuel O (2015) FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol Biol Evol 32(10):2798–2800

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Goloboff P, Farris J, Nixon K (2008) TNT, a free program for phylogenetic analysis. Cladistics 24:1–13

    Article  Google Scholar 

  37. Swofford DL (1996) PAUP*: Phylogenetic analysis using parsimony (and other methods), Version 4.0. Sinauer Associates, Sunderland

    Google Scholar 

  38. Naser-Khdour S, Minh BQ, Zhang W, Stone E, Lanfear R (2019) The prevalence and impact of model violations in phylogenetics. BioRxiv. https://doi.org/10.1101/460121

  39. Crotty SM, Minh BQ, Bean NG, Holland BR, Tuke J, Jermiin LS, Haeseler Av (2019) GHOST: recovering historical signal from heterotachously-evolved sequence alignments. bioRxiv, https://doi.org/10.1101/174789

  40. Jermiin LS, Catullo RA, Holland BR (2018) A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. bioRxiv, https://doi.org/10.1101/400648

  41. Nelesen S, Liu K, Wang L-S, Linder CR, Warnow T (2012) DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28:i274–i282

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Zhang Q, Rao S, Warnow T (2019) Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy. Algorithms Mol Biol 14(1):2

    Article  PubMed  PubMed Central  Google Scholar 

  43. Le T, Sy A, Molloy EK, Zhang QR, Rao S, Warnow T (2019) Using INC within divide-and-conquer phylogeny estimation. In: International conference on algorithms for computational biology. Springer, Berlin, pp 167–178

    Chapter  Google Scholar 

  44. Molloy EK, Warnow T (2018) NJMerge: a generic technique for scaling phylogeny estimation methods and its application to species trees. In: RECOMB International conference on comparative genomics. Springer, Berlin, pp 260–276

    Google Scholar 

  45. Molloy EK, Warnow T (2019) TreeMerge: a new method for improving the scalability of species tree estimation methods. Bioinformatics 35(14):i417–i426

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Sayyari E, Whitfield JB, Mirarab S (2017) Fragmentary gene sequences negatively impact gene tree and species tree Reconstruction. Mol. Biol. Evol. 34(12):3279–3291

    Article  CAS  PubMed  Google Scholar 

  47. Jarvis E, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho S, Faircloth BC, Nabholz B, Howard JT, Suh A, Weber CC, daFonseca RR, Li J, Zhang F, Li H, Zhou L, Narula N, Liu L, Ganapathy G, Boussau B, Bayzid MS, Zavidovych V, Subramanian S, Gabaldón T, Capella-Gutiérrez S, Huerta-Cepas J, Rekepalli B, Munch K, Schierup M, Lindow B, Warren WC, Ray D, Green RE, Bruford MW, Zhan X, Dixon A, Li S, Li N, Huang Y, Derryberry EP, Bertelsen MF, Sheldon FH, Brumfield RT, Mello CV, Lovell PV, Wirthlin M, Schneider MPC, Prosdocimi F, Samaniego JA, Velazquez AMV, Alfaro-Núnez A, Campos PF, Petersen B, Sicheritz-Ponten T, Pas A, Bailey T, Scofield P, Bunce M, Lambert DM, Zhou Q, Perelman P, Driskell AC, Shapiro B, Xiong Z, Zeng Y, Liu S, Li Z, Liu B, Wu K, Xiao J, Yinqi X, Zheng Q, Zhang Y, Yang H, Wang J, Smeds L, Rheindt FE, Braun M, Fjeldsa J, Orlando L, Barker FK, Jonsson KA, Johnson W, Koepfli K-P, O’Brien S, Haussler D, Ryder OA, Rahbek C, Willerslev E, Graves GR, Glenn TC, McCormack J, Burt D, Ellegren H, Alstrom P, Edwards SV, Stamatakis A, Mindell DP, Cracraft J, Braun EL, Warnow T, Jun W, Gilbert MTP, Zhang G (2014) Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215):1320–1331

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Do CB, Gross SS, Batzoglou S (2006) CONTRAlign: discriminative training for protein sequence alignment. In: Proceedings of the tenth annual international conference on computational molecular biology (RECOMB 2006). Springer, Berlin, pp 160–174

    Google Scholar 

  49. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment Genome Res 15(2):330–340

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2006) ProbCons: probabilistic consistency-based multiple sequence alignment of amino acid sequences. Software available at http://probcons.stanford.edu/download.html

  51. Liu K, Linder C, Warnow T (2012) RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS ONE 6(11):e27731

    Article  CAS  Google Scholar 

  52. Abascal F, Zardoya R, Posada D (2005) ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21(9):2104–2105

    Article  CAS  PubMed  Google Scholar 

  53. Posada D, Crandall K (1998) Modeltest: testing the model of DNA substitution. Bioinformatics 14(9):817–818

    Article  CAS  PubMed  Google Scholar 

  54. Hoff M, Orf S, Riehm B, Darriba D, Stamatakis A (2016) Does the choice of nucleotide substitution models matter topologically? BMC Bioinformatics 17:143

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  55. Tavaré S (1986) Some probabilistic and statistical problems in the analysis of DNA sequences. In: Lectures on mathematics in the life sciences, vol 17. American Mathematical Society, Providence, pp 57–86

    Google Scholar 

  56. Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro HN (ed) Mammalian protein metabolism. Academic, New York, pp 21–132

    Chapter  Google Scholar 

  57. Nute M, Warnow T (2016) Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17:764(2016) Special issue for RECOMB-CG 2016. https://doi.org/10.1186/s12864-016-3101-8

  58. Nute M, Saleh E, Warnow T (2018) Evaluating statistical multiple sequence alignment in comparison to other alignment methods on protein data sets. Syst Biol 68(3):396–411

    Article  PubMed Central  CAS  Google Scholar 

  59. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL (2002) The Pfam protein families database. Nucleic Acids Res. 30:276–280

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Mai U, Mirarab S (2018) TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics 19(S5):272

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This paper was supported by NSF grant ABI-1458652 to TW and NSF grant IIS-1845967 to SM.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tandy Warnow .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Warnow, T., Mirarab, S. (2021). Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP. In: Katoh, K. (eds) Multiple Sequence Alignment. Methods in Molecular Biology, vol 2231. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1036-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-1036-7_7

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-1035-0

  • Online ISBN: 978-1-0716-1036-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics