Abstract
Discordances between species trees and gene trees can complicate phylogenetics reconstruction. ASTRAL is a leading method for inferring species trees given gene trees while accounting for incomplete lineage sorting. It finds the tree that shares the maximum number of quartets with input trees, drawing bipartitions from a predefined set of bipartitions X. In this paper, we introduce ASTRAL-III, which substantially improves on ASTRAL-II in terms of running time by handling polytomies more efficiently, exploiting similarities between gene trees, and trimming unnecessary parts of the search space. The asymptotic running time in the presence of polytomies is reduced from \(O(n^3k|X|^{{1.726}})\) for n species and k genes to \(O(D|X|^{1.726})\) where \(D=O(nk)\) is the sum of degrees of all unique nodes in input trees. ASTRAL-III enables us to test whether contracting low support branches in gene trees improves the accuracy by reducing noise. In extensive simulations and on real data, we show that removing branches with very low support improves accuracy while overly aggressive filtering is harmful.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Song, S., Liu, L., Edwards, S.V., Wu, S.: Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Nat. Acad. Sci. 109(37), 14942–14947 (2012)
Wickett, N.J., Mirarab, S., Nguyen, N., et al.: Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc. Nat. Acad. Sci. 111(45), 4859–4868 (2014)
Jarvis, E.D., Mirarab, S., Aberer, A.J., et al.: Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215), 1320–1331 (2014)
Laumer, C.E., Hejnol, A., Giribet, G.: Nuclear genomic signals of the ‘microturbellarian’ roots of platyhelminth evolutionary innovation. eLife 4 (2015)
Tarver, J.E., dos Reis, M., Mirarab, S., et al.: The interrelationships of placental mammals and the limits of phylogenetic inference. Genome Biol. Evol. 8(2), 330–344 (2016)
Rokas, A., Williams, B.L., King, N., Carroll, S.B.: Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425(6960), 798–804 (2003)
Maddison, W.P.: Gene trees in species trees. Syst. Biol. 46(3), 523–536 (1997)
Springer, M.S., Gatesy, J.: The gene tree delusion. Mol. Phylogenet. Evol. 94(Part A), 1–33 (2016)
Meiklejohn, K.A., Faircloth, B.C., Glenn, T.C., Kimball, R.T., Braun, E.L.: Analysis of a rapid evolutionary radiation using ultraconserved elements: evidence for a bias in some multispecies coalescent methods. Syst. Biol. 65(4), 612–627 (2016)
Edwards, S.V., Xi, Z., Janke, A., et al.: Implementing and testing the multispecies coalescent model: a valuable paradigm for phylogenomics. Mol. Phylogenet. Evol. 94, 447–462 (2016)
Shen, X.X., Hittinger, C.T., Rokas, A.: Studies can be driven by a handful of genes. Nature 1, 1–10 (2017)
Heled, J., Drummond, A.J.: Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27(3), 570–580 (2010)
Chifman, J., Kubatko, L.S.: Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23), 3317–3324 (2014)
Degnan, J.H., Rosenberg, N.A.: Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol. 24(6), 332–340 (2009)
Edwards, S.V.: Is a new and general theory of molecular systematics emerging? Evolution 63(1), 1–19 (2009)
Pamilo, P., Nei, M.: Relationships between gene trees and species trees. Mol. Biol. Evol. 5(5), 568–583 (1988)
Rannala, B., Yang, Z.: Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164(4), 1645–1656 (2003)
Liu, L., Yu, L., Edwards, S.V.: A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol. Biol. 10(1), 302 (2010)
Liu, L., Yu, L.: Estimating species trees from unrooted gene trees. Syst. Biol. 60, 661–667 (2011)
Sayyari, E., Mirarab, S.: Anchoring quartet-based phylogenetic distances and applications to species tree reconstruction. BMC Genomics 17(S10), 101–113 (2016)
Liu, L., Yu, L., Pearl, D.K., Edwards, S.V.: Estimating species phylogenies using coalescence times among sequences. Syst. Biol. 58(5), 468–477 (2009)
Mossel, E., Roch, S.: Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 7(1), 166–171 (2010)
Roch, S., Warnow, T.: On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods. Syst. Biol. 64(4), 663–676 (2015)
Mirarab, S., Reaz, R., Bayzid, M.S., et al.: ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17), i541–i548 (2014)
Lafond, M., Scornavacca, C.: On the Weighted Quartet Consensus problem. arXiv:1610.00505 (2016)
Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015)
Allman, E.S., Degnan, J.H., Rhodes, J.A.: Determining species tree topologies from clade probabilities under the coalescent. J. Theor. Biol. 289(1), 96–106 (2011)
Shekhar, S., Roch, S., Mirarab, S.: Species tree estimation using ASTRAL: how many genes are enough? In: Proceedings of International Conference on Research in Computational Molecular Biology (RECOMB) (to appear) (2017)
Davidson, R., Vachaspati, P., Mirarab, S., Warnow, T.: Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer. BMC Genomics 16(Suppl 10), S1 (2015)
Sayyari, E., Mirarab, S.: Fast coalescent-based computation of local branch support from quartet frequencies. Mol. Biol. Evol. 33(7), 1654–1668 (2016)
Price, M.N., Dehal, P.S., Arkin, A.P.: FastTree-2 - approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3), e9490 (2010)
Mirarab, S., Bayzid, M.S., Boussau, B., Warnow, T.: Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science 346(6215), 1250463–1250463 (2014)
Bayzid, M.S., Mirarab, S., Boussau, B., Warnow, T.: Weighted statistical binning: enabling statistically consistent genome-scale phylogenetic analyses. PLoS ONE 10(6), e0129183 (2015)
Mirarab, S., Bayzid, M.S., Warnow, T.: Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Syst. Biol. 65(3), 366–380 (2016)
Patel, S., Kimball, R., Braun, E.: Error in phylogenetic estimation for bushes in the tree of life. Phylogenet. Evol. Biol. 1(2), 2 (2013)
Gatesy, J., Springer, M.S.: Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. Mol. Phylogenet. Evol. 80, 231–266 (2014)
Yu, Y., Warnow, T., Nakhleh, L.: Algorithms for MDC-based multi-locus phylogeny inference: beyond rooted binary gene trees on single alleles. J. Comput. Biol. 18(11), 1543–1559 (2011)
Vachaspati, P., Warnow, T.: ASTRID: accurate species trees from internode distances. BMC genomics 16(Suppl 10), S3 (2015)
Kane, D., Tao, T.: A bound on partitioning clusters (2017). arXiv:11702.00912
Stamatakis, A.: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014)
Mallo, D., De Oliveira Martins, L., Posada, D.: SimPhy: Phylogenomic simulation of gene, locus and species trees. Syst. Biol. 65(2), syv082 (2016)
Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009)
Tavaré, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17, 57–86 (1986)
Junier, T., Zdobnov, E.M.: The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell. Bioinformatics 26(13), 1669–1670 (2010)
Robinson, D., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53(1–2), 131–147 (1981)
Acknowledgments
This work was supported by the NSF grant IIS-1565862 to SM and ES. Computations were performed on the San Diego Supercomputer Center (SDSC) through XSEDE allocations, which is supported by the NSF grant ACI-1053575.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Derivations
Derivation of Equation 6 : First note that:
Now, we note that:
Replacing this (and similar calculations for other terms) in Eq. 10 directly gives us the Eq. 6:
Derivation of the Upperbound \({{\varvec{U(Z)}}}\) : In ASTRAL, V(Z) denotes the total contribution to the support of the best rooted tree \(T_Z\) on taxon set Z, where each quartet tree in the set of input gene trees contributes 0 if it conflicts with \(T_Z\) or only intersects it with one leaf, and otherwise contributes 1 or 2, depending on the number of nodes in \(T_Z\) it maps to. Let U(Z) be the sum of max possible support of each quartet tree in the gene trees with respect to any resolution \(T_Z\) of set Z, allowing the resolution to change for each gene tree. In other words, let Q(Z) be the set of quartets that would be resolved one way or another in any resolution of Z, and note that these are quartets that include two or leaves in Z; then, U(Z) is the number of resolved gene tree quartets that would match some resolution of Z and are included in Q(Z). More formally,
where
Clearly, \(V(Z)\le U(Z)\) (equality can be achieved only if all gene trees are compatible with some resolution of Z). Then, letting \(d=|M|\) and defining \(z_i=|Z \cap M_i|\) and \(l_i=|L \cap M_i|=|M_i|\), we have
Notice that based on Eq. 4,
(going from the fourth term to the fifth is accomplished by changing the order of sums). Therefore,
B Simulations and Commands
Simulation Setup
S100: In order to generate the gene trees and species trees using the Simphy we use this command:
Larege- \({\varvec{n}}\) Simulated Dataset: In order to compare running time performances of ASTRAL-II and ASTRAL-III, we created another dataset with very large numbers of species using Simphy and under the MSCM. Since we are only comparing running times, we only use true gene trees to infer the ASTRAL species trees. We have three sub-datasets with 5000, 2000, and 1000 species (plus one outgroup). Each sub-dataset has 4 replicates, and each replicate has a different species tree with 500 gene trees. Species trees are generated based on the birth-death process with birth and date rates from log uniform distributions. We sampled the number of generations and effective population size from log normal and uniform distributions respectively such that we have medium amounts of ILS. The average FN rates between the true gene trees and the species tree ranges between 4% and 23% for 1K, between 21% and 58% for 2K, and between 21% and 33% for 5K.
In order to generate the gene trees and true species trees using the Simphy we use parameters given in Table 3 and the following command.
1K: Â
2K: Â
5K:Â
Commands
Contracting Branches: In order to contract gene tree branches with bootstrap up to a certain threshold we used this command:
Drawing Bootstrap Support on ML Gene Trees: In order to draw bootstrap support on best ML gene trees we first reroot both best ML gene tree, and the bootstrap gene trees using this command:
Then we draw bootstrap supports on the branches:
Gene Tree Estimation: We used FastTree version 2.1.9 Double precision. In order to estimated best ML gene trees we used the following command:
where we have all the alignments in the PHYLIP format in the file all-genes.phylip for each replicate, and \(<num>\) is the number of alignments in this file.
For bootstrapping analysis, we first generate bootstrapped sequences using RAxML version 8.2.9 with the following command:
and then we Fasttree to perform the actual ML analyses; for FastTree bootstrap runs, we use the same command and models that we used for best ML gene trees.
Running ASTRAL: ASTRAL-II in this paper refers to ASTRAL version 4.11.1 and ASTRAL-III refers to ASTRAL version 5.2.5. Both versions can be found in the link below:
Both versions of ASTRAL program were run with following command:
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zhang, C., Sayyari, E., Mirarab, S. (2017). ASTRAL-III: Increased Scalability and Impacts of Contracting Low Support Branches. In: Meidanis, J., Nakhleh, L. (eds) Comparative Genomics. RECOMB-CG 2017. Lecture Notes in Computer Science(), vol 10562. Springer, Cham. https://doi.org/10.1007/978-3-319-67979-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-67979-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67978-5
Online ISBN: 978-3-319-67979-2
eBook Packages: Computer ScienceComputer Science (R0)