Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP

Warnow, Tandy; Mirarab, Siavash

doi:10.1007/978-1-0716-1036-7_7

Tandy Warnow³ &
Siavash Mirarab⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2231))

1613 Accesses
1 Citations
1 Altmetric

Abstract

The estimation of very large multiple sequence alignments is a challenging problem that requires special techniques in order to achieve high accuracy. Here we describe two software packages—PASTA and UPP—for constructing alignments on large and ultra-large datasets. Both methods have been able to produce highly accurate alignments on 1,000,000 sequences, and trees computed on these alignments are also highly accurate. PASTA provides the best tree accuracy when the input sequences are all full-length, but UPP provides improved accuracy compared to PASTA and other methods when the input contains a large number of fragmentary sequences. Both methods are available in open source form on GitHub.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This chapter is an update of [9], a previous article for Methods in Molecular Biology, which focused on using SATé [1, 2] for co-estimation of alignments and trees.

References

Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T (2009) Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324(5934):1561–1564
Article CAS PubMed Google Scholar
Liu K, Warnow T, Holder MT, Nelesen SM, Yu J, Stamatakis AP, Linder CR (2012) SATé-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol 61(1):90–106
Article PubMed Google Scholar
Mirarab S, Nguyen N, Warnow T (2014) PASTA: ultra-large multiple sequence alignment. In: International conference on research in computational molecular biology. Springer, Berlin, pp 177–191
Google Scholar
Mirarab S, Nguyen N, Wang L-S, Guo S, Kim J, Warnow T (2015) PASTA: ultra-large multiple sequence alignment of nucleotide and amino acid sequences. J Comput Biol 22:377–386
Article CAS PubMed PubMed Central Google Scholar
Nguyen N, Mirarab S, Kumar K, Warnow T (2015) Ultra-large alignments using phylogeny aware profiles. Genome Biol 16:124. A preliminary version appeared in the Proceedings RECOMB 2015
Google Scholar
Mirarab S, Nguyen N, Warnow T (2012) SEPP: SATé-enabled phylogenetic placement. In: Pacific symposium on biocomputing, pp 247–58
Google Scholar
Nguyen N, Mirarab S, Liu B, Pop M, Warnow T (2014) TIPP: taxonomic identification and phylogenetic profiling Bioinformatics 30(24):3548–3555
Article CAS PubMed PubMed Central Google Scholar
Nguyen N, Nute M, Mirarab S, Warnow T (2016) HIPPI: highly accurate protein family classification with ensembles of hidden Markov models. BMC Bioinformatics 17(Suppl 10):765
Google Scholar
Liu K, Warnow T (2014) Large-scale multiple sequence alignment and tree estimation using SATé. In: Multiple sequence alignment methods. Springer, Berlin, pp 219–244
Chapter Google Scholar
Mirarab S (2019) Github site for PASTA software. https://github.com/smirarab/pasta. Accessed 13 July 2019
Mirarab S (2019) Github site for Ensemble of HMM methods (SEPP, TIPP, UPP) software. https://github.com/smirarab/sepp. Accessed 13 July 2019
Price MN, Dehal PS, Arkin AP (2010) FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3), e9490. https://doi.org/10.1371/journal.pone.0009490
Article PubMed PubMed Central CAS Google Scholar
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539
Article PubMed PubMed Central Google Scholar
Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinf 9(4):286–298
Article CAS Google Scholar
Wheeler T, Kececioglu J (2007) Multiple alignment by aligning alignments. In: Proceedings of the 15th ISCB conference on intelligent systems for molecular biology, pp 559–568
Google Scholar
Löytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Nat Acad Sci 102:10557–10562
Article PubMed CAS PubMed Central Google Scholar
Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5(113):113
Article PubMed PubMed Central CAS Google Scholar
Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models Bioinformatics 22:2688–2690.
Google Scholar
Balaban M, Moshiri N, Mai U, Mirarab S (2019) TreeCluster: clustering biological sequences using phylogenetic trees. bioRxiv, https://doi.org/10.1101/591388
Suchard MA, Redelings BD (2006) BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 22:2047–2048
Article CAS PubMed Google Scholar
Redelings BD, Suchard MA (2007) Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol Biol 7:40
Article PubMed PubMed Central CAS Google Scholar
Nute M, Warnow T (2016) Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17(10):764
Article PubMed PubMed Central CAS Google Scholar
Löytynoja A, Goldman N (2008) Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320(5883):1632–1635
Article PubMed CAS Google Scholar
Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C (2015) Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst Biol 64(5):778–791
Article CAS PubMed PubMed Central Google Scholar
Collins K PASTA for proteins github site. https://github.com/kodicollins/pasta-databases
Nute M (2019) Github site for PASTA+BAli-Phy. https://github.com/mgnute/pasta. Accessed 18 July 2019
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge
Book Google Scholar
Warnow T (2018) Computational phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge University Press, Cambridge
Google Scholar
Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211
PubMed Google Scholar
Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39:W29–W37
Article CAS PubMed PubMed Central Google Scholar
Novák Á, Miklós I, Lyngsoe R, Hein J (2008) StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24:2403–2404
Article PubMed CAS Google Scholar
Huelsenbeck J, Ronquist R (2001) MrBayes: Bayesian inference of phylogeny. Bioinformatics 17:754–755
Article CAS PubMed Google Scholar
Drummond A, Rambaut A (2007) BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 7:214
Article PubMed PubMed Central CAS Google Scholar
Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H, Xie D, Suchard MA, Rambaut A, Drummond AJ (2014) BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Comput Biol 10(4):e1003537
Article PubMed PubMed Central CAS Google Scholar
Lefort V, Desper R, Gascuel O (2015) FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol Biol Evol 32(10):2798–2800
Article CAS PubMed PubMed Central Google Scholar
Goloboff P, Farris J, Nixon K (2008) TNT, a free program for phylogenetic analysis. Cladistics 24:1–13
Article Google Scholar
Swofford DL (1996) PAUP*: Phylogenetic analysis using parsimony (and other methods), Version 4.0. Sinauer Associates, Sunderland
Google Scholar
Naser-Khdour S, Minh BQ, Zhang W, Stone E, Lanfear R (2019) The prevalence and impact of model violations in phylogenetics. BioRxiv. https://doi.org/10.1101/460121
Crotty SM, Minh BQ, Bean NG, Holland BR, Tuke J, Jermiin LS, Haeseler Av (2019) GHOST: recovering historical signal from heterotachously-evolved sequence alignments. bioRxiv, https://doi.org/10.1101/174789
Jermiin LS, Catullo RA, Holland BR (2018) A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. bioRxiv, https://doi.org/10.1101/400648
Nelesen S, Liu K, Wang L-S, Linder CR, Warnow T (2012) DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28:i274–i282
Article CAS PubMed PubMed Central Google Scholar
Zhang Q, Rao S, Warnow T (2019) Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy. Algorithms Mol Biol 14(1):2
Article PubMed PubMed Central Google Scholar
Le T, Sy A, Molloy EK, Zhang QR, Rao S, Warnow T (2019) Using INC within divide-and-conquer phylogeny estimation. In: International conference on algorithms for computational biology. Springer, Berlin, pp 167–178
Chapter Google Scholar
Molloy EK, Warnow T (2018) NJMerge: a generic technique for scaling phylogeny estimation methods and its application to species trees. In: RECOMB International conference on comparative genomics. Springer, Berlin, pp 260–276
Google Scholar
Molloy EK, Warnow T (2019) TreeMerge: a new method for improving the scalability of species tree estimation methods. Bioinformatics 35(14):i417–i426
Article CAS PubMed PubMed Central Google Scholar
Sayyari E, Whitfield JB, Mirarab S (2017) Fragmentary gene sequences negatively impact gene tree and species tree Reconstruction. Mol. Biol. Evol. 34(12):3279–3291
Article CAS PubMed Google Scholar
Jarvis E, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho S, Faircloth BC, Nabholz B, Howard JT, Suh A, Weber CC, daFonseca RR, Li J, Zhang F, Li H, Zhou L, Narula N, Liu L, Ganapathy G, Boussau B, Bayzid MS, Zavidovych V, Subramanian S, Gabaldón T, Capella-Gutiérrez S, Huerta-Cepas J, Rekepalli B, Munch K, Schierup M, Lindow B, Warren WC, Ray D, Green RE, Bruford MW, Zhan X, Dixon A, Li S, Li N, Huang Y, Derryberry EP, Bertelsen MF, Sheldon FH, Brumfield RT, Mello CV, Lovell PV, Wirthlin M, Schneider MPC, Prosdocimi F, Samaniego JA, Velazquez AMV, Alfaro-Núnez A, Campos PF, Petersen B, Sicheritz-Ponten T, Pas A, Bailey T, Scofield P, Bunce M, Lambert DM, Zhou Q, Perelman P, Driskell AC, Shapiro B, Xiong Z, Zeng Y, Liu S, Li Z, Liu B, Wu K, Xiao J, Yinqi X, Zheng Q, Zhang Y, Yang H, Wang J, Smeds L, Rheindt FE, Braun M, Fjeldsa J, Orlando L, Barker FK, Jonsson KA, Johnson W, Koepfli K-P, O’Brien S, Haussler D, Ryder OA, Rahbek C, Willerslev E, Graves GR, Glenn TC, McCormack J, Burt D, Ellegren H, Alstrom P, Edwards SV, Stamatakis A, Mindell DP, Cracraft J, Braun EL, Warnow T, Jun W, Gilbert MTP, Zhang G (2014) Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215):1320–1331
Article CAS PubMed PubMed Central Google Scholar
Do CB, Gross SS, Batzoglou S (2006) CONTRAlign: discriminative training for protein sequence alignment. In: Proceedings of the tenth annual international conference on computational molecular biology (RECOMB 2006). Springer, Berlin, pp 160–174
Google Scholar
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment Genome Res 15(2):330–340
Article CAS PubMed PubMed Central Google Scholar
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2006) ProbCons: probabilistic consistency-based multiple sequence alignment of amino acid sequences. Software available at http://probcons.stanford.edu/download.html
Liu K, Linder C, Warnow T (2012) RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS ONE 6(11):e27731
Article CAS Google Scholar
Abascal F, Zardoya R, Posada D (2005) ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21(9):2104–2105
Article CAS PubMed Google Scholar
Posada D, Crandall K (1998) Modeltest: testing the model of DNA substitution. Bioinformatics 14(9):817–818
Article CAS PubMed Google Scholar
Hoff M, Orf S, Riehm B, Darriba D, Stamatakis A (2016) Does the choice of nucleotide substitution models matter topologically? BMC Bioinformatics 17:143
Article PubMed PubMed Central CAS Google Scholar
Tavaré S (1986) Some probabilistic and statistical problems in the analysis of DNA sequences. In: Lectures on mathematics in the life sciences, vol 17. American Mathematical Society, Providence, pp 57–86
Google Scholar
Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro HN (ed) Mammalian protein metabolism. Academic, New York, pp 21–132
Chapter Google Scholar
Nute M, Warnow T (2016) Scaling statistical multiple sequence alignment to large datasets. BMC Genomics 17:764(2016) Special issue for RECOMB-CG 2016. https://doi.org/10.1186/s12864-016-3101-8
Nute M, Saleh E, Warnow T (2018) Evaluating statistical multiple sequence alignment in comparison to other alignment methods on protein data sets. Syst Biol 68(3):396–411
Article PubMed Central CAS Google Scholar
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL (2002) The Pfam protein families database. Nucleic Acids Res. 30:276–280
Article CAS PubMed PubMed Central Google Scholar
Mai U, Mirarab S (2018) TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics 19(S5):272
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This paper was supported by NSF grant ABI-1458652 to TW and NSF grant IIS-1845967 to SM.

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
Tandy Warnow
Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA, USA
Siavash Mirarab

Authors

Tandy Warnow
View author publications
You can also search for this author in PubMed Google Scholar
Siavash Mirarab
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tandy Warnow .

Editor information

Editors and Affiliations

Research Institute for Microbial Disease, Osaka University, Osaka, Japan
Kazutaka Katoh

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Warnow, T., Mirarab, S. (2021). Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP. In: Katoh, K. (eds) Multiple Sequence Alignment. Methods in Molecular Biology, vol 2231. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1036-7_7

Download citation

DOI: https://doi.org/10.1007/978-1-0716-1036-7_7
Published: 09 December 2020
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-1035-0
Online ISBN: 978-1-0716-1036-7
eBook Packages: Springer Protocols

Publish with us

Policies and ethics