Abstract
Phylogenomics—the estimation of species trees from multi-locus datasets—is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this paper, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL. All scripts and datasets used in this study are available on the Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allman, E.S., Degnan, J.H., Rhodes, J.A.: Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J. Math. Biol. 62(6), 833–862 (2011). https://doi.org/10.1007/s00285-010-0355-7
Arvestad, L., Lagergren, J., Sennblad, B.: The gene evolution model and computing its associated probabilities. J. ACM 56(2), 7 (2009). https://doi.org/10.1145/1502793.1502796
Bandelt, H.J., Dress, A.: Reconstructing the shape of a tree from observed dissimilarity data. Adv. Appl. Math. 7(3), 309–343 (1986). https://doi.org/10.1016/0196-8858(86)90038-2
Bansal, M.S., Burleigh, J.G., Eulenstein, O., Fernández-Baca, D.: Robinson-foulds supertrees. Algorithms Mol. Biol. 5(1), 18 (2010). https://doi.org/10.1186/1748-7188-5-18
Bayzid, M.S., Warnow, T.: Gene tree parsimony for incomplete gene trees: addressing true biological loss. Algorithms Mol. Biol. 13(1), 1 (2018). https://doi.org/10.1186/s13015-017-0120-1
Blom, M.P.K., Bragg, J.G., Potter, S., Moritz, C.: Accounting for uncertainty in gene tree estimation: summary-coalescent species tree inference in a challenging radiation of Australian lizards. Syst. Biol. 66(3), 352–366 (2017). https://doi.org/10.1093/sysbio/syw089
Boussau, B., Szöllősi, G.J., Duret, L., Gouy, M., Tannier, E., Daubin, V.: Genome-scale coestimation of species and gene trees. Genome Res. 23(2), 323–330 (2013). https://doi.org/10.1101/gr.141978.112
Chaudhary, R., Boussau, B., Burleigh, J.G., Fernández-Baca, D.: Assessing approaches for inferring species trees from multi-copy genes. Syst. Biol. 64(2), 325–339 (2015). https://doi.org/10.1093/sysbio/syu128
Chaudhary, R., Fernández-Baca, D., Burleigh, J.G.: MulRF: a software package for phylogenetic analysis using multi-copy gene trees. Bioinformatics 31(3), 432–433 (2014). https://doi.org/10.1093/bioinformatics/btu648
Daskalakis, C., Roch, S.: Species trees from gene trees despite a high rate of lateral genetic transfer: a tight bound (extended abstract). In: Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1621–1630 (2016). https://doi.org/10.1137/1.9781611974331.ch110
Davidson, R., Vachaspati, P., Mirarab, S., Warnow, T.: Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer. BMC Genom. 16(10), S1 (2015). https://doi.org/10.1186/1471-2164-16-S10-S1
Du, P., Hahn, M.W., Nakhleh, L.: Species tree inference under the multispecies coalescent on data with paralogs is accurate. bioRxiv (2019). https://doi.org/10.1101/498378
Emms, D., Kelly, S.: STAG: species tree inference from all genes. bioRxiv (2018). https://doi.org/10.1101/267914
Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009). https://doi.org/10.1093/molbev/msp098
Hosner, P.A., Faircloth, B.C., Glenn, T.C., Braun, E.L., Kimball, R.T.: Avoiding missing data biases in phylogenomic inference: an empirical study in the landfowl (Aves: Galliformes). Mol. Biol. Evol. 33(4), 1110–1125 (2016). https://doi.org/10.1093/molbev/msv347
Jarvis, E.D., Mirarab, S., et al.: Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215), 1320–1331 (2014). https://doi.org/10.1126/science.1253451
Kingman, J.F.C.: The coalescent. Stoch. process. Their Appl. 13(3), 235–248 (1982). https://doi.org/10.1016/0304-4149(82)90011-4
Larget, B.R., Kotha, S.K., Dewey, C.N., Ané, C.: BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics 26(22), 2910–2911 (2010). https://doi.org/10.1093/bioinformatics/btq539
Liu, L., Yu, L.: Estimating species trees from unrooted gene trees. Syst. Biol. 60(5), 661–667 (2011). https://doi.org/10.1093/sysbio/syr027
Maddison, W.: Gene trees in species trees. Syst. Biol. 46(3), 523–536 (1997). https://doi.org/10.1093/sysbio/46.3.523
Mallo, D., De Oliveira Martins, L., Posada, D.: SimPhy: phylogenomic simulation of gene, locus, and species trees. Syst. Biol. 65(2), 334–344 (2016). https://doi.org/10.1093/sysbio/syv082
Mirarab, S., Reaz, R., Bayzid, M.S., Zimmermann, T., Swenson, M.S., Warnow, T.: ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17), i541–i548 (2014). https://doi.org/10.1093/bioinformatics/btu462
Mirarab, S.: DynaDup github repository: a software package for species tree estimation from rooted gene trees under gene duplication and loss. https://github.com/smirarab/DynaDup. Accessed 3 Oct 2019
Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015). https://doi.org/10.1093/bioinformatics/btv234
Molloy, E.K., Warnow, T.: To include or not to include: the impact of gene filtering on species tree estimation methods. Syst. Biol. 67(2), 285–303 (2018). https://doi.org/10.1093/sysbio/syx077
Rabiee, M., Sayyari, E., Mirarab, S.: Multi-allele species reconstruction using ASTRAL. Mol. Phylogenet. Evol. 130, 286–296 (2019). https://doi.org/10.1016/j.ympev.2018.10.033
Rasmussen, M.D., Kellis, M.: Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22(4), 755–765 (2012). https://doi.org/10.1101/gr.123901.111
Robinson, D., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53(1), 131–147 (1981). https://doi.org/10.1016/0025-5564(81)90043-2
Roch, S., Nute, M., Warnow, T.: Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods. Syst. Biol. 68(2), 281–297 (2018). https://doi.org/10.1093/sysbio/syy061
Roch, S., Snir, S.: Recovering the treelike trend of evolution despite extensive lateral genetic transfer: a probabilistic analysis. J. Comput. Biol. 20(2), 93–112 (2013). https://doi.org/10.1089/cmb.2012.0234
Roch, S., Steel, M.: Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100, 56–62 (2015). https://doi.org/10.1016/j.tpb.2014.12.005
Stamatakis, A.: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014). https://doi.org/10.1093/bioinformatics/btu033
Streicher, J.W., Schulte II, J.A., Wiens, J.J.: How should genes and taxa be sampled for phylogenomic analyses with missing data? An empirical study in iguanian lizards. Syst. Biol. 65(1), 128–145 (2016). https://doi.org/10.1093/sysbio/syv058
Takahata, N.: Gene genealogy in three related populations: consistency probability between gene and population trees. Genetics 122(4), 957–966 (1989)
Than, C., Ruths, D., Nakhleh, L.: PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinform. 9(1), 322 (2008). https://doi.org/10.1186/1471-2105-9-322
Vachaspati, P., Warnow, T.: ASTRID: accurate species TRees from internode distances. BMC Genom. 16(10), S3 (2015). https://doi.org/10.1186/1471-2164-16-S10-S3
Vachaspati, P., Warnow, T.: FastRFS: fast and accurate Robinson-Foulds supertrees using constrained exact optimization. Bioinformatics 33(5), 631–639 (2016). https://doi.org/10.1093/bioinformatics/btw600
Wehe, A., Bansal, M.S., Burleigh, J.G., Eulenstein, O.: DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony. Bioinformatics 24(13), 1540–1541 (2008). https://doi.org/10.1093/bioinformatics/btn230
Wen, D., Yu, Y., Zhu, J., Nakhleh, L.: Inferring phylogenetic networks using PhyloNet. Syst. Biol. 67(4), 735–740 (2018). https://doi.org/10.1093/sysbio/syy015
Zhang, C., Rabiee, M., Sayyari, E., Mirarab, S.: ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform. 19(6), 153 (2018). https://doi.org/10.1186/s12859-018-2129-y
Acknowledgments
This study was supported in part by NSF grants CCF-1535977 and 1513629 (to TW) and by the Ira and Debra Cohen Graduate Fellowship in Computer Science (to EKM). SR was supported by NSF grants DMS-1614242, CCF-1740707 (TRIPODS), and DMS-1902892, as well as a Simons Fellowship and a Vilas Associates Award. BL was supported by NSF grant DMS-1614242 (to SR). All computational analyses were performed on the Illinois Campus Cluster and the Blue Waters supercomputer, computing resources that are operated and financially supported by UIUC in conjunction with the National Center for Supercomputing Applications. Blue Waters is supported by the NSF (grants OCI-0725070 and ACI-1238993) and the state of Illinois.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Additional Proofs
A Additional Proofs
1.1 A.1 Proof of Theorem 1: case (2)
In case (2), assume that \(T^{\mathcal {Q}} = (((A,B),C),D)\), let R be the most recent common ancestor of A, B, C (but not D) in \(T^{\mathcal {Q}}\), and let I be the number of gene copies exiting R. As in case 1), it suffices to prove (2) almost surely. Let \(i_{x} \in \{1,...,I\}\) be the ancestral lineage of \(x \in \{a,b,c\}\) in R. Then
On the other hand,
with a similar result for \(\mathbf {P}'_{I}[q=ad|bc]\). By symmetry again, the last term on the RHS of (9) and (10) are the same. This implies
where \(x = \mathbf {P}'_{I}[i_a = i_b]\). This function g attains its minimum value at the smallest possible of x, which by Lemma 1 is \(x = 1/I\). Evaluating at \(x = 1/I\) gives
which establishes (2) in case (2).
1.2 A.2 Proof of Theorem 2
First, we prove consistency for the exact version of ASTRAL. The input to the ASTRAL/ONE pipeline is the collection of gene trees \(\mathcal {T} = \{t_i\}_{i=1}^k\), where \(t_i\) is labeled by individuals (i.e., gene copies) \(R_i \subseteq R\). For each species and each gene tree \(t_i\), we pick a uniform random gene copy, producing a new gene tree \(\tilde{t}_i\). Recall that the quartet score of \(\widetilde{T}\) with respect to \(\widetilde{\mathcal {T}} = \{\tilde{t}_i\}_{i=1}^k\) is then
We note that the score only depends on the unrooted topology of \(\widetilde{T}\). Under the GDL model, by independence of the gene trees (and non-negativity), \(Q_k(\widetilde{T})/k\) converges almost surely to its expectation simultaneously for all unrooted species tree topologies over S.
For a species \(A \in S\) and gene tree \(\tilde{t}_i\), let \(A_i\) be the gene copy in A on \(\tilde{t}_i\) if it exists and let \(\mathcal {E}_i^A\) be the event that it exists. For a 4-tuple of species \(\mathcal {Q} = \{A,B,C,D\}\), let \(\mathcal {Q}_i = \{A_i,B_i,C_i,D_i\}\) and \(\mathcal {E}_i^{\mathcal {Q}} = \mathcal {E}_i^A\cap \mathcal {E}_i^B\cap \mathcal {E}_i^C\cap \mathcal {E}_i^D\). The expectation can then be written as
as, on the event \((\mathcal {E}_1^\mathcal {Q})^c\), there is no contribution from \(\mathcal {Q}\) in the sum over the first sample.
Based on the proof of Theorem 1, a different way to write \(\mathbf {E}[ \mathbf {1}(\widetilde{T}_{ext}^{\mathcal {Q}_1},\tilde{t}_1^{\mathcal {Q}_1}) \,|\,\mathcal {E}_1^\mathcal {Q}]\) is in terms of the original gene tree \(t_1\). Let a, b, c, d be random gene copies on \(t_1\) in A, B, C, D respectively. Then if q is the topology of \(t_1\) restricted to a, b, c, d,
From (5), we know that this expression is maximized (strictly) at the true species tree \(\mathbf {P}'[q = T^{\mathcal {Q}}]\). Hence, together with (11) and the law of large numbers, almost surely the quartet score is eventually maximized by the true species tree as \(k \rightarrow +\infty \). This completes the proof for the exact version.
The default version is statistically consistent for the same reason as in the proof of Theorem 3. As the number of MUL-trees sampled tends to infinity, the true species tree will appear as one of the input gene trees almost surely. So ASTRAL returns the true species tree topology almost surely as the number of sampled MUL-trees increases.
1.3 A.3 Proof of Theorem 3: case (2)
In case (2), assume that \(T^{\mathcal {Q}} = (((A,B),C),D)\) and let R be the most recent common ancestor of A, B, C (but not D) in T. We want to establish (8) in this case. For \(i=1,2,3\), let \(\mathcal {N}_{AB|CD}^{\mathcal {Q},\{i\}}\) (respectively \(\mathcal {N}_{AC|BD}^{\mathcal {Q},\{i\}}\)) be the number of choices consisting of one gene copy from each species in \(\mathcal {Q}\) whose corresponding restriction on \(t^{\mathcal {Q}}\) agrees with AB|CD (respectively AC|BD) and where, in addition, copies of A, B, C descend from i distinct lineages in R. We make five observations:
-
Contributions to \(\mathcal {N}_{AB|CD}^{\mathcal {Q},\{2\}}\) necessarily come from copies in A and B descending from the same lineage in R, together with a copy in C descending from a distinct lineage and any copy in D. Similarly for \(\mathcal {N}_{AC|BD}^{\mathcal {Q},\{2\}}\)
-
Moreover \(\mathcal {N}_{AC|BD}^{\mathcal {Q},\{1\}} = 0\) almost surely, as in that case the corresponding copies from A and B coalesce (backwards in time) below R.
-
Arguing as in the proof of Theorem 1, by symmetry we have the equality \(\mathbf {E}_I[\mathcal {N}_{AB|CD}^{\mathcal {Q},\{3\}}] = \mathbf {E}_I[\mathcal {N}_{AC|BD}^{\mathcal {Q},\{3\}}].\)
-
For \(j \in \{1,\ldots ,I\}\), let \(\mathcal {A}_j\) be the number of gene copies in A descending from j in R, and similarly define \(\mathcal {B}_j\), \(\mathcal {C}_j\). Let \(\mathcal {D}\) be the number of gene copies in D. Then, under the conditional probability \(\mathbf {P}_I\), \(\mathcal {D}\) is independent of \((\mathcal {A}_j, \mathcal {B}_j, \mathcal {C}_j)_{j=1}^I\). Moreover, under \(\mathbf {P}_I\), \((\mathcal {C}_j)_{j=1}^I\) is independent of \((\mathcal {A}_j, \mathcal {B}_j)_{j=1}^I\).
-
Similarly to case 1), by symmetry we have \(X^{=} \equiv \mathbf {E}_I[\mathcal {A}_{j_1} \mathcal {B}_{j_1}] = \mathbf {E}_I[\mathcal {A}_{1} \mathcal {B}_ {1}]\), \(X^{\ne } \equiv \mathbf {E}_I[\mathcal {A}_{j_1} \mathcal {B}_{k_1}] = \mathbf {E}_I[\mathcal {A}_1]\mathbf {E}_I[\mathcal {B}_1]\) for all \(j_1, k_1\) with \(j_1\ne k_1\). Define also \(X = I X^{=} + I(I-1) X^{\ne }\), \(Y \equiv \mathbf {E}_I[\mathcal {C}_{1}]\) and \(Z \equiv \mathbf {E}_I[\mathcal {D}]\).
Putting these observations together, we obtain
where we used Lemma 2 on the last line.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Legried, B., Molloy, E.K., Warnow, T., Roch, S. (2020). Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss. In: Schwartz, R. (eds) Research in Computational Molecular Biology. RECOMB 2020. Lecture Notes in Computer Science(), vol 12074. Springer, Cham. https://doi.org/10.1007/978-3-030-45257-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-45257-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45256-8
Online ISBN: 978-3-030-45257-5
eBook Packages: Computer ScienceComputer Science (R0)