Statistical Consistency of Coalescent-Based Species Tree Methods Under Models of Missing Data

  • Michael NuteEmail author
  • Jed Chou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10562)


The estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, duplication and loss, and horizontal gene transfer, that result in gene trees that differ from the species tree. Methods to estimate species trees in the presence of gene tree discord resulting from incomplete lineage sorting (ILS) have been developed and proved to be statistically consistent when gene tree discord is due only to ILS and every gene tree has the full set of species. Here we address statistical consistency of coalescent-based species tree estimation methods when gene trees are missing species, i.e., in the presence of missing data.



MN was supported by NSF grants DBI-1461364, CCF-1535977 and AF:1513629 and by a fellowship from the CompGen initiative in the Coordinated Science Laboratory at UIUC. JC was supported by the Mathematics Department at UIUC.

A great deal of thanks is owed to our advisor, Dr. Tandy Warnow, who guided this manuscript from start to finish and pushed us to leave no stone unturned.


  1. 1.
    Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N.A., RoyChoudhury, A.: Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol. Biol. Evol. 29(8), 1917–1932 (2012)CrossRefGoogle Scholar
  2. 2.
    Chifman, J., Kubatko, L.: Quartet inference from SNP data under the coalescent. Bioinformatics 30(23), 3317–3324 (2014)CrossRefGoogle Scholar
  3. 3.
    Dasarathy, G., Nowak, R., Roch, S.: Data requirement for phylogenetic inference from multiple loci: a new distance method. IEEE/ACM Trans. Comput. Biol. Bioinf. 12(2), 422–432 (2015)CrossRefGoogle Scholar
  4. 4.
    DeGiorgio, M., Degnan, J.H.: Fast and consistent estimation of species trees using supermatrix rooted triples. Mol. Biol. Evol. 27(3), 552–569 (2010)CrossRefGoogle Scholar
  5. 5.
    Edwards, S.V.: Is a new and general theory of molecular systematics emerging? Evolution 63, 1–19 (2009)CrossRefGoogle Scholar
  6. 6.
    Graybeal, A.: Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol. 47(1), 9–17 (1998)CrossRefGoogle Scholar
  7. 7.
    Heled, J., Drummond, A.J.: Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27(3), 570–580 (2010)CrossRefGoogle Scholar
  8. 8.
    Hovmöller, R., Knowles, L.L., Kubatko, L.S.: Effects of missing data on species tree estimation under the coalescent. Mol. Phylogenet. Evol. 69, 1057–1062 (2013)CrossRefGoogle Scholar
  9. 9.
    Jewett, E., Rosenberg, N.: iGLASS: an improvement to the GLASS method for estimating species trees from gene trees. J. Comput. Biol. 19(3), 293–315 (2012)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Kingman, J.F.C.: On the genealogy of large populations. J. Appl. Probab. 19, 27 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Kubatko, L.S., Carstens, B.C., Knowles, L.L.: STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics 25(7), 971–973 (2009)CrossRefGoogle Scholar
  12. 12.
    Larget, B.R., Kotha, S.K., Dewey, C.N., Ané, C.: BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics 26(22), 2910–2911 (2010)CrossRefGoogle Scholar
  13. 13.
    Lefort, V., Desper, R., Gascuel, O.: FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program: table 1. Mol. Biol. Evol. 32(10), 2798–2800 (2015)CrossRefGoogle Scholar
  14. 14.
    Lemmon, A.R., Brown, J.M., Stanger-Hall, K., Lemmon, E.M.: The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst. Biol. 58(1), 130–145 (2009)CrossRefGoogle Scholar
  15. 15.
    Liu, L.: BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics 24(21), 2542–2543 (2008)CrossRefGoogle Scholar
  16. 16.
    Liu, L., Yu, L.: Estimating species trees from unrooted gene trees. Syst. Biol. 60(5), 661–667 (2011)CrossRefGoogle Scholar
  17. 17.
    Liu, L., Yu, L., Edwards, S.V.: A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol. Biol. 10(1), 302 (2010)CrossRefGoogle Scholar
  18. 18.
    Liu, L., Yu, L., Pearl, D.K., Edwards, S.V.: Estimating species phylogenies using coalescence times among sequences. Syst. Biol. 58(5), 468–77 (2009)CrossRefGoogle Scholar
  19. 19.
    Maddison, W.P.: Gene trees in species trees. Syst. Biol. 46(3), 523–536 (1997)CrossRefGoogle Scholar
  20. 20.
    Mirarab, S., Reaz, R., Bayzid, M., Zimmermann, T., Swenson, M., Warnow, T.: ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17), i541–i548 (2014)CrossRefGoogle Scholar
  21. 21.
    Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015)CrossRefGoogle Scholar
  22. 22.
    Mossel, E., Roch, S.: Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE/ACM Trans. Comput. Biol. Bioinf. 7(1), 166–171 (2010)CrossRefGoogle Scholar
  23. 23.
    Page, R.D.M.: Modified mincut supertrees. In: Guigó, R., Gusfield, D. (eds.) WABI 2002. LNCS, vol. 2452, pp. 537–551. Springer, Heidelberg (2002). doi: 10.1007/3-540-45784-4_41 CrossRefGoogle Scholar
  24. 24.
    Pollock, D.D., Zwickl, D.J., McGuire, J.A., Hillis, D.M.: Increased taxon sampling is advantageous for phylogenetic inference. Syst. Biol. 51, 664–671 (2002)CrossRefGoogle Scholar
  25. 25.
    Roch, S., Warnow, T.: On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods. Syst. Biol. 64(4), 663–676 (2015)CrossRefGoogle Scholar
  26. 26.
    Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987)Google Scholar
  27. 27.
    Semple, C., Steel, M.: Phylogenetics. Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, Oxford (2003)zbMATHGoogle Scholar
  28. 28.
    Steel, M.: The complexity of reconstructing trees from qualitative characters and subtrees. J. Classif. 9, 91–116 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Streicher, J.W., Schulte, J.A., Wiens, J.J.: How should genes and taxa be sampled for phylogenomic analyses with missing data? An empirical study in iguanian lizards. Syst. Biol. 65(1), 128–145 (2016)CrossRefGoogle Scholar
  30. 30.
    Swofford, D.: PAUP*: Phylogenetic analysis using parsimony (* and other methods) Ver. 4. Sinauer Associates, Sunderland, Massachusetts (2002)Google Scholar
  31. 31.
    Vachaspati, P., Warnow, T.: ASTRID: Accurate species trees from internode distances. BMC Genom. 16(Suppl. 10), S3 (2015)CrossRefGoogle Scholar
  32. 32.
    Wickett, N.J., Mirarab, S., Nguyen, N., Warnow, T., Carpenter, E., Matasci, N., Ayyampalayam, S., Barker, M.S., Burleigh, J.G., Gitzendanner, M.A., Ruhfel, B.R., Wafulal, E., Derl, J.P., Graham, S.W., Mathews, S., Melkonian, M., Soltis, D.E., Soltis, P.S., Miles, N.W., Rothfels, C.J., Pokorny, L., Shaw, A.J., De Gironimo, L., Stevenson, D.W., Sureko, B., Villarreal, J.C., Roure, B., Philippe, H., de Pamphilis, C.W., Chen, T., Deyholos, M.K., Baucom, R.S., Kutchan, T.M., Augustin, M.M., Wang, J., Zhang, Y., Tian, Z., Yan, Z., Wu, X., Sun, X., Wong, G.K.S., Leebens-Mack, J.: Phylotranscriptomic analysis of the origin and diversification of land plants. Proc. Nat. Acad. Sci. 111(45), E4859–E4868 (2014)CrossRefGoogle Scholar
  33. 33.
    Wiens, J.: Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. 52, 528–538 (2003)CrossRefGoogle Scholar
  34. 34.
    Wiens, J.: Missing data and the design of phylogenetic analyses. J. Biomed. Inform. 39, 34–42 (2006)CrossRefGoogle Scholar
  35. 35.
    Wiens, J.J., Morrill, M.C.: Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. Syst. Biol. 60, 719–731 (2011)CrossRefGoogle Scholar
  36. 36.
    Xi, Z., Liu, L., Davis, C.C.: The impact of missing data on species tree estimation. Mol. Biol. Evol. 33(3), 838–860 (2016)CrossRefGoogle Scholar
  37. 37.
    Yang, J., Warnow, T.: Fast and accurate methods for phylogenomic analyses. BMC Bioinform. 12(Suppl. 9), S4 (2011)CrossRefGoogle Scholar
  38. 38.
    Zwickl, D.J., Hillis, D.M.: Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51, 588–598 (2002)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of StatisticsUniversity of Illinois at Urbana-ChampaignChampaignUSA
  2. 2.Department of MathematicsUniversity of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations