Skip to main content

Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2020)

Abstract

Phylogenomics—the estimation of species trees from multi-locus datasets—is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this paper, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL. All scripts and datasets used in this study are available on the Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Allman, E.S., Degnan, J.H., Rhodes, J.A.: Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J. Math. Biol. 62(6), 833–862 (2011). https://doi.org/10.1007/s00285-010-0355-7

    Article  MathSciNet  MATH  Google Scholar 

  2. Arvestad, L., Lagergren, J., Sennblad, B.: The gene evolution model and computing its associated probabilities. J. ACM 56(2), 7 (2009). https://doi.org/10.1145/1502793.1502796

    Article  MathSciNet  MATH  Google Scholar 

  3. Bandelt, H.J., Dress, A.: Reconstructing the shape of a tree from observed dissimilarity data. Adv. Appl. Math. 7(3), 309–343 (1986). https://doi.org/10.1016/0196-8858(86)90038-2

    Article  MathSciNet  MATH  Google Scholar 

  4. Bansal, M.S., Burleigh, J.G., Eulenstein, O., Fernández-Baca, D.: Robinson-foulds supertrees. Algorithms Mol. Biol. 5(1), 18 (2010). https://doi.org/10.1186/1748-7188-5-18

    Article  Google Scholar 

  5. Bayzid, M.S., Warnow, T.: Gene tree parsimony for incomplete gene trees: addressing true biological loss. Algorithms Mol. Biol. 13(1), 1 (2018). https://doi.org/10.1186/s13015-017-0120-1

    Article  Google Scholar 

  6. Blom, M.P.K., Bragg, J.G., Potter, S., Moritz, C.: Accounting for uncertainty in gene tree estimation: summary-coalescent species tree inference in a challenging radiation of Australian lizards. Syst. Biol. 66(3), 352–366 (2017). https://doi.org/10.1093/sysbio/syw089

    Article  Google Scholar 

  7. Boussau, B., Szöllősi, G.J., Duret, L., Gouy, M., Tannier, E., Daubin, V.: Genome-scale coestimation of species and gene trees. Genome Res. 23(2), 323–330 (2013). https://doi.org/10.1101/gr.141978.112

    Article  Google Scholar 

  8. Chaudhary, R., Boussau, B., Burleigh, J.G., Fernández-Baca, D.: Assessing approaches for inferring species trees from multi-copy genes. Syst. Biol. 64(2), 325–339 (2015). https://doi.org/10.1093/sysbio/syu128

    Article  Google Scholar 

  9. Chaudhary, R., Fernández-Baca, D., Burleigh, J.G.: MulRF: a software package for phylogenetic analysis using multi-copy gene trees. Bioinformatics 31(3), 432–433 (2014). https://doi.org/10.1093/bioinformatics/btu648

    Article  Google Scholar 

  10. Daskalakis, C., Roch, S.: Species trees from gene trees despite a high rate of lateral genetic transfer: a tight bound (extended abstract). In: Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1621–1630 (2016). https://doi.org/10.1137/1.9781611974331.ch110

  11. Davidson, R., Vachaspati, P., Mirarab, S., Warnow, T.: Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer. BMC Genom. 16(10), S1 (2015). https://doi.org/10.1186/1471-2164-16-S10-S1

    Article  Google Scholar 

  12. Du, P., Hahn, M.W., Nakhleh, L.: Species tree inference under the multispecies coalescent on data with paralogs is accurate. bioRxiv (2019). https://doi.org/10.1101/498378

  13. Emms, D., Kelly, S.: STAG: species tree inference from all genes. bioRxiv (2018). https://doi.org/10.1101/267914

  14. Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009). https://doi.org/10.1093/molbev/msp098

    Article  Google Scholar 

  15. Hosner, P.A., Faircloth, B.C., Glenn, T.C., Braun, E.L., Kimball, R.T.: Avoiding missing data biases in phylogenomic inference: an empirical study in the landfowl (Aves: Galliformes). Mol. Biol. Evol. 33(4), 1110–1125 (2016). https://doi.org/10.1093/molbev/msv347

    Article  Google Scholar 

  16. Jarvis, E.D., Mirarab, S., et al.: Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215), 1320–1331 (2014). https://doi.org/10.1126/science.1253451

    Article  Google Scholar 

  17. Kingman, J.F.C.: The coalescent. Stoch. process. Their Appl. 13(3), 235–248 (1982). https://doi.org/10.1016/0304-4149(82)90011-4

    Article  MathSciNet  MATH  Google Scholar 

  18. Larget, B.R., Kotha, S.K., Dewey, C.N., Ané, C.: BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics 26(22), 2910–2911 (2010). https://doi.org/10.1093/bioinformatics/btq539

    Article  Google Scholar 

  19. Liu, L., Yu, L.: Estimating species trees from unrooted gene trees. Syst. Biol. 60(5), 661–667 (2011). https://doi.org/10.1093/sysbio/syr027

    Article  Google Scholar 

  20. Maddison, W.: Gene trees in species trees. Syst. Biol. 46(3), 523–536 (1997). https://doi.org/10.1093/sysbio/46.3.523

    Article  Google Scholar 

  21. Mallo, D., De Oliveira Martins, L., Posada, D.: SimPhy: phylogenomic simulation of gene, locus, and species trees. Syst. Biol. 65(2), 334–344 (2016). https://doi.org/10.1093/sysbio/syv082

    Article  Google Scholar 

  22. Mirarab, S., Reaz, R., Bayzid, M.S., Zimmermann, T., Swenson, M.S., Warnow, T.: ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17), i541–i548 (2014). https://doi.org/10.1093/bioinformatics/btu462

    Article  Google Scholar 

  23. Mirarab, S.: DynaDup github repository: a software package for species tree estimation from rooted gene trees under gene duplication and loss. https://github.com/smirarab/DynaDup. Accessed 3 Oct 2019

  24. Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015). https://doi.org/10.1093/bioinformatics/btv234

    Article  Google Scholar 

  25. Molloy, E.K., Warnow, T.: To include or not to include: the impact of gene filtering on species tree estimation methods. Syst. Biol. 67(2), 285–303 (2018). https://doi.org/10.1093/sysbio/syx077

    Article  Google Scholar 

  26. Rabiee, M., Sayyari, E., Mirarab, S.: Multi-allele species reconstruction using ASTRAL. Mol. Phylogenet. Evol. 130, 286–296 (2019). https://doi.org/10.1016/j.ympev.2018.10.033

    Article  Google Scholar 

  27. Rasmussen, M.D., Kellis, M.: Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22(4), 755–765 (2012). https://doi.org/10.1101/gr.123901.111

    Article  Google Scholar 

  28. Robinson, D., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53(1), 131–147 (1981). https://doi.org/10.1016/0025-5564(81)90043-2

    Article  MathSciNet  MATH  Google Scholar 

  29. Roch, S., Nute, M., Warnow, T.: Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods. Syst. Biol. 68(2), 281–297 (2018). https://doi.org/10.1093/sysbio/syy061

    Article  Google Scholar 

  30. Roch, S., Snir, S.: Recovering the treelike trend of evolution despite extensive lateral genetic transfer: a probabilistic analysis. J. Comput. Biol. 20(2), 93–112 (2013). https://doi.org/10.1089/cmb.2012.0234

    Article  MathSciNet  Google Scholar 

  31. Roch, S., Steel, M.: Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100, 56–62 (2015). https://doi.org/10.1016/j.tpb.2014.12.005

    Article  MATH  Google Scholar 

  32. Stamatakis, A.: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014). https://doi.org/10.1093/bioinformatics/btu033

    Article  Google Scholar 

  33. Streicher, J.W., Schulte II, J.A., Wiens, J.J.: How should genes and taxa be sampled for phylogenomic analyses with missing data? An empirical study in iguanian lizards. Syst. Biol. 65(1), 128–145 (2016). https://doi.org/10.1093/sysbio/syv058

    Article  Google Scholar 

  34. Takahata, N.: Gene genealogy in three related populations: consistency probability between gene and population trees. Genetics 122(4), 957–966 (1989)

    Google Scholar 

  35. Than, C., Ruths, D., Nakhleh, L.: PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinform. 9(1), 322 (2008). https://doi.org/10.1186/1471-2105-9-322

    Article  Google Scholar 

  36. Vachaspati, P., Warnow, T.: ASTRID: accurate species TRees from internode distances. BMC Genom. 16(10), S3 (2015). https://doi.org/10.1186/1471-2164-16-S10-S3

    Article  Google Scholar 

  37. Vachaspati, P., Warnow, T.: FastRFS: fast and accurate Robinson-Foulds supertrees using constrained exact optimization. Bioinformatics 33(5), 631–639 (2016). https://doi.org/10.1093/bioinformatics/btw600

    Article  Google Scholar 

  38. Wehe, A., Bansal, M.S., Burleigh, J.G., Eulenstein, O.: DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony. Bioinformatics 24(13), 1540–1541 (2008). https://doi.org/10.1093/bioinformatics/btn230

    Article  Google Scholar 

  39. Wen, D., Yu, Y., Zhu, J., Nakhleh, L.: Inferring phylogenetic networks using PhyloNet. Syst. Biol. 67(4), 735–740 (2018). https://doi.org/10.1093/sysbio/syy015

    Article  Google Scholar 

  40. Zhang, C., Rabiee, M., Sayyari, E., Mirarab, S.: ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform. 19(6), 153 (2018). https://doi.org/10.1186/s12859-018-2129-y

    Article  Google Scholar 

Download references

Acknowledgments

This study was supported in part by NSF grants CCF-1535977 and 1513629 (to TW) and by the Ira and Debra Cohen Graduate Fellowship in Computer Science (to EKM). SR was supported by NSF grants DMS-1614242, CCF-1740707 (TRIPODS), and DMS-1902892, as well as a Simons Fellowship and a Vilas Associates Award. BL was supported by NSF grant DMS-1614242 (to SR). All computational analyses were performed on the Illinois Campus Cluster and the Blue Waters supercomputer, computing resources that are operated and financially supported by UIUC in conjunction with the National Center for Supercomputing Applications. Blue Waters is supported by the NSF (grants OCI-0725070 and ACI-1238993) and the state of Illinois.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sébastien Roch .

Editor information

Editors and Affiliations

A Additional Proofs

A Additional Proofs

1.1 A.1 Proof of Theorem 1: case (2)

In case (2), assume that \(T^{\mathcal {Q}} = (((A,B),C),D)\), let R be the most recent common ancestor of ABC (but not D) in \(T^{\mathcal {Q}}\), and let I be the number of gene copies exiting R. As in case 1), it suffices to prove (2) almost surely. Let \(i_{x} \in \{1,...,I\}\) be the ancestral lineage of \(x \in \{a,b,c\}\) in R. Then

$$\begin{aligned} \mathbf {P}'_I[q = ab|cd]&= \mathbf {P}'_I[i_a = i_b] + \mathbf {P}'_I[q = ab|cd \text { and }i_a, i_b, i_c \text { all distinct}]. \end{aligned}$$
(9)

On the other hand,

$$\begin{aligned} \mathbf {P}'_I[q = ac|bd]&= \mathbf {P}'_I[i_b \ne i_a = i_c] + \mathbf {P}'_I[q = ac|bd \text { and } i_a, i_b, i_c \text { all distinct}], \end{aligned}$$
(10)

with a similar result for \(\mathbf {P}'_{I}[q=ad|bc]\). By symmetry again, the last term on the RHS of (9) and (10) are the same. This implies

$$\begin{aligned}&\mathbf {P}'_I[q = ab|cd] - \mathbf {P}'_I[q = ac|bd] = \mathbf {P}'_I[i_a = i_b] - \mathbf {P}'_I[i_b \ne i_a = i_c]\\&= x - (1-x)\mathbf {P}'_{I}[i_{a}=i_{c}|i_{a}\ne i_{b}] = x - (1-x)\frac{1}{I} \equiv g(x), \end{aligned}$$

where \(x = \mathbf {P}'_{I}[i_a = i_b]\). This function g attains its minimum value at the smallest possible of x, which by Lemma 1 is \(x = 1/I\). Evaluating at \(x = 1/I\) gives

$$g(1/I) = \frac{1}{I} - \frac{1}{I} + \frac{1}{I^2} = \frac{1}{I^2} > 0,$$

which establishes (2) in case (2).

1.2 A.2 Proof of Theorem 2

First, we prove consistency for the exact version of ASTRAL. The input to the ASTRAL/ONE pipeline is the collection of gene trees \(\mathcal {T} = \{t_i\}_{i=1}^k\), where \(t_i\) is labeled by individuals (i.e., gene copies) \(R_i \subseteq R\). For each species and each gene tree \(t_i\), we pick a uniform random gene copy, producing a new gene tree \(\tilde{t}_i\). Recall that the quartet score of \(\widetilde{T}\) with respect to \(\widetilde{\mathcal {T}} = \{\tilde{t}_i\}_{i=1}^k\) is then

$$ Q_k(\widetilde{T}) = \sum _{i=1}^k \sum _{\mathcal {J} = \{a,b,c,d\} \subseteq R_i} \mathbf {1}(\widetilde{T}_{ext}^{\mathcal {J}},\tilde{t}_i^{\mathcal {J}}). $$

We note that the score only depends on the unrooted topology of \(\widetilde{T}\). Under the GDL model, by independence of the gene trees (and non-negativity), \(Q_k(\widetilde{T})/k\) converges almost surely to its expectation simultaneously for all unrooted species tree topologies over S.

For a species \(A \in S\) and gene tree \(\tilde{t}_i\), let \(A_i\) be the gene copy in A on \(\tilde{t}_i\) if it exists and let \(\mathcal {E}_i^A\) be the event that it exists. For a 4-tuple of species \(\mathcal {Q} = \{A,B,C,D\}\), let \(\mathcal {Q}_i = \{A_i,B_i,C_i,D_i\}\) and \(\mathcal {E}_i^{\mathcal {Q}} = \mathcal {E}_i^A\cap \mathcal {E}_i^B\cap \mathcal {E}_i^C\cap \mathcal {E}_i^D\). The expectation can then be written as

(11)

as, on the event \((\mathcal {E}_1^\mathcal {Q})^c\), there is no contribution from \(\mathcal {Q}\) in the sum over the first sample.

Based on the proof of Theorem 1, a different way to write \(\mathbf {E}[ \mathbf {1}(\widetilde{T}_{ext}^{\mathcal {Q}_1},\tilde{t}_1^{\mathcal {Q}_1}) \,|\,\mathcal {E}_1^\mathcal {Q}]\) is in terms of the original gene tree \(t_1\). Let abcd be random gene copies on \(t_1\) in ABCD respectively. Then if q is the topology of \(t_1\) restricted to abcd,

From (5), we know that this expression is maximized (strictly) at the true species tree \(\mathbf {P}'[q = T^{\mathcal {Q}}]\). Hence, together with (11) and the law of large numbers, almost surely the quartet score is eventually maximized by the true species tree as \(k \rightarrow +\infty \). This completes the proof for the exact version.

The default version is statistically consistent for the same reason as in the proof of Theorem 3. As the number of MUL-trees sampled tends to infinity, the true species tree will appear as one of the input gene trees almost surely. So ASTRAL returns the true species tree topology almost surely as the number of sampled MUL-trees increases.

1.3 A.3 Proof of Theorem 3: case (2)

In case (2), assume that \(T^{\mathcal {Q}} = (((A,B),C),D)\) and let R be the most recent common ancestor of ABC (but not D) in T. We want to establish (8) in this case. For \(i=1,2,3\), let \(\mathcal {N}_{AB|CD}^{\mathcal {Q},\{i\}}\) (respectively \(\mathcal {N}_{AC|BD}^{\mathcal {Q},\{i\}}\)) be the number of choices consisting of one gene copy from each species in \(\mathcal {Q}\) whose corresponding restriction on \(t^{\mathcal {Q}}\) agrees with AB|CD (respectively AC|BD) and where, in addition, copies of ABC descend from i distinct lineages in R. We make five observations:

  • Contributions to \(\mathcal {N}_{AB|CD}^{\mathcal {Q},\{2\}}\) necessarily come from copies in A and B descending from the same lineage in R, together with a copy in C descending from a distinct lineage and any copy in D. Similarly for \(\mathcal {N}_{AC|BD}^{\mathcal {Q},\{2\}}\)

  • Moreover \(\mathcal {N}_{AC|BD}^{\mathcal {Q},\{1\}} = 0\) almost surely, as in that case the corresponding copies from A and B coalesce (backwards in time) below R.

  • Arguing as in the proof of Theorem 1, by symmetry we have the equality \(\mathbf {E}_I[\mathcal {N}_{AB|CD}^{\mathcal {Q},\{3\}}] = \mathbf {E}_I[\mathcal {N}_{AC|BD}^{\mathcal {Q},\{3\}}].\)

  • For \(j \in \{1,\ldots ,I\}\), let \(\mathcal {A}_j\) be the number of gene copies in A descending from j in R, and similarly define \(\mathcal {B}_j\), \(\mathcal {C}_j\). Let \(\mathcal {D}\) be the number of gene copies in D. Then, under the conditional probability \(\mathbf {P}_I\), \(\mathcal {D}\) is independent of \((\mathcal {A}_j, \mathcal {B}_j, \mathcal {C}_j)_{j=1}^I\). Moreover, under \(\mathbf {P}_I\), \((\mathcal {C}_j)_{j=1}^I\) is independent of \((\mathcal {A}_j, \mathcal {B}_j)_{j=1}^I\).

  • Similarly to case 1), by symmetry we have \(X^{=} \equiv \mathbf {E}_I[\mathcal {A}_{j_1} \mathcal {B}_{j_1}] = \mathbf {E}_I[\mathcal {A}_{1} \mathcal {B}_ {1}]\), \(X^{\ne } \equiv \mathbf {E}_I[\mathcal {A}_{j_1} \mathcal {B}_{k_1}] = \mathbf {E}_I[\mathcal {A}_1]\mathbf {E}_I[\mathcal {B}_1]\) for all \(j_1, k_1\) with \(j_1\ne k_1\). Define also \(X = I X^{=} + I(I-1) X^{\ne }\), \(Y \equiv \mathbf {E}_I[\mathcal {C}_{1}]\) and \(Z \equiv \mathbf {E}_I[\mathcal {D}]\).

Putting these observations together, we obtain

$$\begin{aligned}&\mathbf {E}_I[\mathcal {N}^{\mathcal {Q}}_{AB|CD}] - \mathbf {E}_I[\mathcal {N}^{\mathcal {Q}}_{AC|BD}]\\&\qquad = \mathbf {E}_I[\mathcal {N}^{\mathcal {Q},\{1\}}_{AB|CD}] + \mathbf {E}_I[\mathcal {N}^{\mathcal {Q},\{2\}}_{AB|CD}] - \mathbf {E}_I[\mathcal {N}^{\mathcal {Q},\{2\}}_{AC|BD}]\\&\qquad = I X^{=} Y Z + I(I-1) X^{=} Y Z - I(I-1) X^{\ne } Y Z\\&\qquad > 0, \end{aligned}$$

where we used Lemma 2 on the last line.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Legried, B., Molloy, E.K., Warnow, T., Roch, S. (2020). Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss. In: Schwartz, R. (eds) Research in Computational Molecular Biology. RECOMB 2020. Lecture Notes in Computer Science(), vol 12074. Springer, Cham. https://doi.org/10.1007/978-3-030-45257-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-45257-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-45256-8

  • Online ISBN: 978-3-030-45257-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics