Abstract
Phylogenomics is the inference of phylogenetic trees based on multiple marker genes sampled in the genomes of interest. An important challenge in phylogenomics is the potential incongruence among the evolutionary histories of individual genes, which can be widespread in microorganisms due to the prevalence of horizontal gene transfer. This protocol introduces the procedures for building a phylogenetic tree of a large number of microbial genomes using a broad sampling of marker genes that are representative of whole-genome evolution. The protocol highlights the use of a gene tree summary method, which can effectively reconstruct the species tree while accounting for the topological conflicts among individual gene trees. The pipeline described in this protocol is scalable to tens of thousands of genomes while retaining high accuracy. We discussed multiple software tools, libraries, and scripts to enable convenient adoption of the protocol. The protocol is suitable for microbiology and microbiome studies based on public genomes and metagenomic data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Szöllõsi GJ, Tannier E, Daubin V, Boussau B (2014) The inference of gene trees with species trees. Syst Biol 64:e42–e62
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R et al (2016) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44:D733–D745
Ochman H, Lawrence JG, Groisman EA (2000) Lateral gene transfer and the nature of bacterial innovation. Nature 405:299–304
Doolittle WF, Boucher Y, Nesbø CL, Douady CJ, Andersson JO, Roger AJ (2003) How big is the iceberg of which organellar genes in nuclear genomes are but the tip? Philos Trans R Soc Lond Ser B Biol Sci 358:39–57. discussion 57–8
Puigbò P, Wolf YI, Koonin EV (2009) Search for a “tree of life” in the thicket of the phylogenetic forest. J Biol 8:59
Dagan T, Artzy-Randrup Y, Martin W (2008) Modular networks and cumulative impact of lateral transfer in prokaryote genome evolution. Proc Natl Acad Sci U S A 105:10039–10044
Quince C, Walker AW, Simpson JT, Loman NJ, Segata N (2017) Shotgun metagenomics, from sampling to analysis. Nat Biotechnol:833–844. https://doi.org/10.1038/nbt.3935
Bowers RM, The Genome Standards Consortium, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D et al (2017) Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol:725–731. https://doi.org/10.1038/nbt.3893
Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A et al (2015) Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523:208–211
Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng J-F et al (2013) Insights into the phylogeny and coding potential of microbial dark matter. Nature 499:431–437
Zaremba-Niedzwiedzka K, Caceres EF, Saw JH, Bäckström D, Juzokaite L, Vancaester E et al (2017) Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature 541:353–358
Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ et al (2016) A new view of the tree of life. Nat Microbiol 1:16048
Castelle CJ, Banfield JF (2018) Major new microbial groups expand diversity and Alter our understanding of the tree of life. Cell 172:1181–1197
Williams TA, Foster PG, Cox CJ, Embley TM (2013) An archaeal origin of eukaryotes supports only two primary domains of life. Nature 504:231–236
Mande SS, Mohammed MH, Ghosh TS (2012) Classification of metagenomic sequences: methods and challenges. Brief Bioinform 13:669–681
Orakov A, Fullam A, Coelho LP, Khedkar S, Szklarczyk D, Mende DR et al (2021) GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol 22:178
Steinegger M, Salzberg SL (2020) Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol 21:115
Harris JK, Kelley ST, Spiegelman GB, Pace NR (2003) The genetic core of the universal ancestor. Genome Res 13:407–412
Creevey CJ, Doerks T, Fitzpatrick DA, Raes J, Bork P (2011) Universally distributed single-copy genes indicate a constant rate of horizontal transfer. PLoS One 6:e22099
Zhu Q, Mai U, Pfeiffer W, Janssen S, Asnicar F, Sanders JG et al (2019) Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains bacteria and archaea. Nat Commun 10:5477
de Queiroz A, Gatesy J (2007) The supermatrix approach to systematics. Trends Ecol Evol 22:34–41
Roch S, Steel M (2014) Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol 100:56–62
Kubatko LS, Degnan JH (2007) Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol 56:17–24
Boussau B, Szöllősi GJJ, Duret L (2013) Genome-scale coestimation of species and gene trees. Genome Res 23:323–330
Wang Y, Nakhleh L (2018) Towards an accurate and efficient heuristic for species/gene tree co-estimation. Bioinformatics 34:i697–i705
Ogilvie HA, Bouckaert RR, Drummond AJ (2017) StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol Biol Evol 34:2101–2114
Heled J, Drummond AJ (2010) Bayesian inference of species trees from multilocus data. Mol Biol Evol 27:570–580
Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, Roychoudhury A (2012) Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol 29:1917–1932
Chifman J, Kubatko LS (2014) Quartet inference from SNP data under the coalescent model. Bioinformatics 30:3317–3324
Leaché AD, Rannala B (2011) The accuracy of species tree estimation under simulation: a comparison of methods. Syst Biol 60:126–137
Knowles LL, Lanier HC, Klimov PB, He Q (2012) Full modeling versus summarizing gene-tree uncertainty: method choice and species-tree accuracy. Mol Phylogenet Evol 65:501–509
Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A (2019) RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35:4453–4455
Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313
Price MN, Dehal PS, Arkin AP (2010) FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 5:e9490
Nguyen LT, Schmidt HA, Von Haeseler A, Minh BQ (2015) IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32. https://doi.org/10.1093/molbev/msu300
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S et al (2012) MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61:539–542
Liu L, Yu L, Edwards SV (2010) A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol 10:302
Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A (2021) SpeciesRax: a tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss. bioRxiv:2021.03.29.437460. https://doi.org/10.1101/2021.03.29.437460
Wu Y (2012) Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66:763–775
Liu L, Yu L, Pearl DK, Edwards SV (2009) Estimating species phylogenies using coalescence times among sequences. Syst Biol 58:468–477
Liu L, Yu L (2011) Estimating species trees from unrooted gene trees. Syst Biol 60:661–667
Vachaspati P, Warnow T (2015) ASTRID: accurate species TRees from internode distances. BMC Genomics 16:S3
Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T (2014) ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30:i541–i548
Mirarab S, Nakhleh L, Warnow T (2021) Multispecies coalescent: theory and applications in phylogenetics. Annu Rev Ecol Evol Syst. https://doi.org/10.1146/annurev-ecolsys-012121-095340
Bininda-Emonds ORP (ed) (2004) Phylogenetic Supertrees: combining information to reveal the tree of life. Kluwer Academic Publishers, p 550
Holmes S (2003) Statistics for phylogenetic trees. Theor Popul Biol 63:17–32
Degnan JH (2013) Anomalous unrooted gene trees. Syst Biol 62:574–590
Allman ES, Degnan JH, Rhodes JA (2011) Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol 62:833–862
Mirarab S, Warnow T (2015) ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31:i44–i52
Zhang C, Rabiee M, Sayyari E, Mirarab S (2018) ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform 19:153
Rabiee M, Sayyari E, Mirarab S (2019) Multi-allele species reconstruction using ASTRAL. Mol Phylogenet Evol:286–296. https://doi.org/10.1016/j.ympev.2018.10.033
Yin J, Zhang C, Mirarab S (2019) ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization. Bioinformatics 35:3961–3969
Davidson R, Vachaspati P, Mirarab S, Warnow T (2015) Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer. BMC Genomics 16:S1
Roch S, Snir S (2012) Recovering the tree-like trend of evolution despite extensive lateral genetic transfer: a probabilistic analysis. Lect Notes Comput Sci:224–238. https://doi.org/10.1007/978-3-642-29627-7_23
Legried B, Molloy EK, Warnow T, Roch S (2021) Polynomial-time statistical estimation of species trees under gene duplication and loss. J Comput Biol 28:452–468
Markin A, Eulenstein O (2020) Quartet-based inference methods are statistically consistent under the unified duplication-loss-coalescence model. Available: http://arxiv.org/abs/q-bio.PE/2004.04299
Solís-Lemus C, Yang M, Ané C (2016) Inconsistency of species tree methods under gene flow. Syst Biol 65:843–851
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25:1043–1055
Lagesen K, Hallin P, Rødland EA, Staerfeldt H-H, Rognes T, Ussery DW (2007) RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35:3100–3108
Laslett D, Canback B (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res 32:11–16
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S et al (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17:132
Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform 11:119
Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60
Chen I-MA, Chu K, Palaniappan K, Ratner A, Huang J, Huntemann M et al (2021) The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities. Nucleic Acids Res:D751–D763. https://doi.org/10.1093/nar/gkaa939
Davis JJ, Wattam AR, Aziz RK, Brettin T, Butler R, Butler RM et al (2020) The PATRIC bioinformatics resource center: expanding data and analysis capabilities. Nucleic Acids Res 48:D606–D612
Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A, Hugenholtz P (2021) GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. https://doi.org/10.1093/nar/gkab776
Mende DR, Letunic I, Maistrenko OM, Schmidt TSB, Milanese A, Paoli L et al (2020) proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res 48:D621–D625
Almeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ et al (2021) A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol 39:105–114
Nayfach S, Roux S, Seshadri R, Udwary D, Varghese N, Schulz F et al (2020) A genomic catalog of Earth’s microbiomes. Nat Biotechnol 39:499–509
Danko D, Bezdan D, Afshin EE, Ahsanuddin S, Bhattacharya C, Butler DJ et al (2021) A global metagenomic map of urban microbiomes and antimicrobial resistance. Cell 184:3376–3393.e17
Heath TA, Hedtke SM, Hillis DM (2008) Taxon sampling and the accuracy of phylogenetic analyses. J Syst Evol 46:239–257
Hillis DM, Pollock DD, McGuire JA, Zwickl DJ (2003) Is sparse taxon sampling a problem for phylogenetic inference? Syst Biol 52:124–126
Zwickl DJ, Hillis DM (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst Biol 51:588–598
Hedtke SM, Townsend TM, Hillis DM (2006) Resolution of phylogenetic conflict in large data sets by increased taxon sampling. Syst Biol 55:522–529
Balaban M, Moshiri N, Mai U, Jia X, Mirarab S (2019) TreeCluster: clustering biological sequences using phylogenetic trees. PLoS One 14:e0221068
Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F et al (2019) Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176:649–662.e20
Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM (2007) DNA–DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol:81–91. https://doi.org/10.1099/ijs.0.64483-0
Sarmashghi S, Bohmann K (2019) P. Gilbert MT, Bafna V, Mirarab S. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol 20:34
Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9:5114
Parks DH, Chuvochina M, Chaumeil P-A, Rinke C, Mussig AJ, Hugenholtz P (2020) A complete domain-to-species taxonomy for bacteria and archaea. Nat Biotechnol 38:1079–1086
Murray CS, Gao Y, Wu M (2021) Re-evaluating the evidence for a universal genetic boundary among microbial species. Nat Comm:4059
Gamez JE, Modave F, Kosheleva O (2008) Selecting the most representative sample is NP-hard: need for expert (fuzzy) knowledge. In: 2008 IEEE international conference on fuzzy systems (IEEE world congress on computational intelligence). IEEE. https://doi.org/10.1109/fuzzy.2008.4630502
Ling J, O’Donoghue P, Söll D (2015) Genetic code flexibility in microorganisms: novel mechanisms and impact on physiology. Nat Rev Microbiol 13:707–721
Molloy EK, Warnow T (2018) To include or not to include: the impact of gene filtering on species tree estimation methods. Syst Biol 67:285–303
Segata N, Börnigen D, Morgan XC, Huttenhower C (2013) PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat Commun 4:2304
Asnicar F, Thomas AM, Beghini F, Mengoni C, Manara S, Manghi P et al (2020) Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0. Nat Comm. https://doi.org/10.1038/s41467-020-16366-7
Wiens JJ (2006) Missing data and the design of phylogenetic analyses. J Biomed Inform 34:–42. https://doi.org/10.1016/j.jbi.2005.04.001
Smirnov V, Warnow T (2021) Phylogeny estimation given sequence length heterogeneity. Syst Biol 70:268–282
Nguyen N-PD, Mirarab S, Kumar K, Warnow T (2015) Ultra-large alignments using phylogeny-aware profiles. Genome Biol 16:124
Mirarab S, Nguyen N, Guo S, Wang L-S, Kim J, Warnow T (2015) PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. J Comput Biol 22:377–386
Finn RD, Clements J, Eddy SR (2011) {HMMER} web server: interactive sequence similarity searching. Nucleic Acids Res 39:W29–W37
Warnow T, Mirarab S (2021) Multiple sequence alignment for large heterogeneous datasets using SATé, PASTA, and UPP. Methods Mol Biol 2231:99–119
Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972–1973
Portik DM, Wiens JJ (2020) Do alignment and trimming methods matter for phylogenomic (UCE) analyses? Syst Biol. https://doi.org/10.1093/sysbio/syaa064
Zhang C, Zhao Y, Braun EL, Mirarab S (2020) TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution. bioRxiv:2020.11.30.405589
Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M et al (2015) Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst Biol 64. https://doi.org/10.1093/sysbio/syv033
Sayyari E, Whitfield JB, Mirarab S (2017) Fragmentary gene sequences negatively impact gene tree and species tree reconstruction. Mol Biol Evol 34:3279–3291
Philippe H, de Vienne DM, Ranwez V, Roure B, Baurain D, Delsuc F (2017) Pitfalls in supermatrix phylogenomics. Eur J Taxon. https://doi.org/10.5852/ejt.2017.283
Springer MS, Gatesy J (2017) On the importance of homology in the age of phylogenomics. Syst Biodivers:1–19
Mai U, Mirarab S (2018) TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics 19:272
Springer MS, Gatesy J (2016) The gene tree delusion. Mol Phylogenet Evol 94:1–33
Wickett NJ, Mirarab S, Nguyen N, Warnow T, Carpenter E, Matasci N et al (2014) Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc Natl Acad Sci U S A 111:E4859–E4868
Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS (2017) ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods 14:587–589
Quang LS, Gascuel O, Lartillot N (2008) Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics 24:2317–2323
Felsenstein J (1981) Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol:368–376. https://doi.org/10.1007/bf01734359
Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS (2018) UFBoot2: improving the ultrafast bootstrap approximation. Mol Biol Evol 35:518–522
Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol:307–321. https://doi.org/10.1093/sysbio/syq010
Anisimova M, Gil M, Dufayard J-F, Dessimoz C, Gascuel O (2011) Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. Syst Biol 60:685–699
Sayyari E, Mirarab S (2016) Fast coalescent-based computation of local branch support from quartet frequencies. Mol Biol Evol 33:1654–1668
Mirarab S (2019) Species tree estimation using ASTRAL: practical considerations. Arxiv preprint 1904(03826) Available: http://arxiv.org/abs/1904.03826
Letunic I, Bork P (2021) Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res 49:W293–W296
Cantrell K, Fedarko MW, Rahman G, McDonald D, Yang Y, Zaw T et al (2021) EMPress enables tree-guided, interactive, and exploratory analyses of multi-omic data dets. mSystems 6. https://doi.org/10.1128/mSystems.01216-20
Cordova J, Navarro G (2016) Simple and efficient fully-functional succinct trees. Theor Comput Sci:135–145. https://doi.org/10.1016/j.tcs.2016.04.031
Vázquez-Baeza Y, Pirrung M, Gonzalez A, Knight R (2013) EMPeror: a tool for visualizing high-throughput microbial community data. Gigascience 2:16
Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA et al (2019) Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 37:852–857
Robinson DF, Foulds LR (1981) Comparison of phylogenetic trees. Math Biosci 53:131–147
Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A et al (2018) A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36:996–1004
Moshiri N (2020) TreeSwift: a massively scalable python tree package. SoftwareX 11:100436
Sukumaran J, Holder MT (2010) DendroPy: a python library for phylogenetic computing. Bioinformatics 26:1569–1571
Huerta-Cepas J, Dopazo J, Gabaldón T (2010) ETE: a python environment for tree exploration. BMC Bioinform 11:24
Junier T, Zdobnov EM (2010) The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell. Bioinformatics 26:1669–1670
Castresana J (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17:540–552
Bossert S, Murray EA, Pauly A, Chernyshov K, Brady SG, Danforth BN (2020) Gene tree estimation error with ultraconserved elements: an empirical study on Pseudapis bees. Syst Biol (0):1–19
Zhang C, Scornavacca C, Molloy E, Mirarab S (2019) ASTRAL-Pro: quartet-based species tree inference despite paralogy. bioRxiv:2019.12.12.874727
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Zhu, Q., Mirarab, S. (2022). Assembling a Reference Phylogenomic Tree of Bacteria and Archaea by Summarizing Many Gene Phylogenies. In: Luo, H. (eds) Environmental Microbial Evolution. Methods in Molecular Biology, vol 2569. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2691-7_7
Download citation
DOI: https://doi.org/10.1007/978-1-0716-2691-7_7
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2690-0
Online ISBN: 978-1-0716-2691-7
eBook Packages: Springer Protocols