Advertisement

NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and Its Application to Species Trees

  • Erin K. Molloy
  • Tandy Warnow
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11183)

Abstract

Divide-and-conquer methods, which divide the species set into overlapping subsets, construct trees on the subsets, and then combine the trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of these approaches. In this paper, we present a new divide-and-conquer approach that does not require supertree estimation: we divide the species set into disjoint subsets, construct trees on the subsets, and then combine the trees using a distance matrix computed on the full species set. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of the Neighbor Joining algorithm. We report on the results of an extensive simulation study evaluating NJMerge’s utility in scaling three popular species tree estimation methods: ASTRAL, SVDquartets, and concatenation analysis using RAxML. We find that NJMerge provides substantial improvements in running time without sacrificing accuracy and sometimes even improves accuracy. Furthermore, although NJMerge can sometimes fail to return a tree, the failure rate in our experiments is less than 1%. Together, these results suggest that NJMerge is a valuable technique for scaling computationally intensive methods to larger datasets, especially when computational resources are limited. NJMerge is freely available on Github: https://github.com/ekmolloy/njmerge. All datasets, scripts, and supplementary materials are freely available through the Illinois Data Bank: https://doi.org/10.13012/B2IDB-1424746_V1.

Keywords

Phylogenomics Species trees Incomplete lineage sorting Divide-and-conquer Neighbor Joining NJst ASTRAL SVDquartets 

Notes

Acknowledgments

The authors with to thank the anonymous reviewers, whose feedback led to improvements in the paper.

Funding

This work was supported by the National Science Foundation (award CCF-1535977) to TW. EKM was supported by the NSF Graduate Research Fellowship (award DGE-1144245) and the Ira and Debra Cohen Graduate Fellowship in Computer Science. Computational experiments were performed on Blue Waters, which is supported by the NSF (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.

References

  1. 1.
    Aho, A.V., Sagiv, Y., Szymanski, T.G., Ullman, J.D.: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10(3), 405–421 (1981).  https://doi.org/10.1137/0210030MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Allman, E.S., Degnan, J.H., Rhodes, J.A.: Species tree inference from gene splits by unrooted STAR methods. IEEE/ACM Trans. Comput. Biol. Bioinform. 15(1), 337–342 (2018).  https://doi.org/10.1109/TCBB.2016.2604812CrossRefGoogle Scholar
  3. 3.
    Atteson, K.: The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25(2–3), 251–278 (1999).  https://doi.org/10.1007/PL00008277MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bayzid, M.S., Hunt, T., Warnow, T.: Disk covering methods improve phylogenomic analyses. BMC Genomics 15(6), S7 (2014).  https://doi.org/10.1186/1471-2164-15-S6-S7CrossRefGoogle Scholar
  5. 5.
    Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N.A., RoyChoudhury, A.: Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol. Biol. Evol. 29(8), 1917–1932 (2012).  https://doi.org/10.1093/molbev/mss086CrossRefGoogle Scholar
  6. 6.
    Chifman, J., Kubatko, L.: Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23), 3317–3324 (2014).  https://doi.org/10.1093/bioinformatics/btu530CrossRefGoogle Scholar
  7. 7.
    Chifman, J., Kubatko, L.: Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J. Theor. Biol. 374, 35–47 (2015).  https://doi.org/10.1016/j.jtbi.2015.03.006MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Dasarathy, G., Nowak, R., Roch, S.: Data requirement for phylogenetic inference from multiple loci: a new distance method. IEEE/ACM Trans. Comput. Biol. Bioinform. 12(2), 422–432 (2015).  https://doi.org/10.1109/TCBB.2014.2361685CrossRefGoogle Scholar
  9. 9.
    Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009).  https://doi.org/10.1093/molbev/msp098CrossRefGoogle Scholar
  10. 10.
    Huson, D.H., Vawter, L., Warnow, T.: Solving large scale phylogenetic problems using DCM2. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 118–129. AAAI Press (1999)Google Scholar
  11. 11.
    Jarvis, E.D., Mirarab, S., et al.: Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215), 1320–1331 (2014).  https://doi.org/10.1126/science.1253451CrossRefGoogle Scholar
  12. 12.
    Jukes, T.H., Cantor, C.R.: Evolution of protein molecules. In: Munro, H. (ed.) Mammalian Protein Metabolism, vol. 3, pp. 21–132. Academic Press, New York (1969)CrossRefGoogle Scholar
  13. 13.
    Lagergren, J.: Combining polynomial running time and fast convergence for the disk-covering method. J. Comput. Syst. Sci. 65(3), 481–493 (2002).  https://doi.org/10.1016/S0022-0000(02)00005-3MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Lefort, V., Desper, R., Gascuel, O.: FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol. Biol. Evol. 32(10), 2798–2800 (2015).  https://doi.org/10.1093/molbev/msv150CrossRefGoogle Scholar
  15. 15.
    Liu, L., Yu, L.: Estimating species trees from unrooted gene trees. Syst. Biol. 60(5), 661–667 (2011).  https://doi.org/10.1093/sysbio/syr027CrossRefGoogle Scholar
  16. 16.
    Maddison, W.P.: Gene trees in species trees. Syst. Biol. 46(3), 523–536 (1997).  https://doi.org/10.1093/sysbio/46.3.523CrossRefGoogle Scholar
  17. 17.
    Mallo, D., De Oliveira Martins, L., Posada, D.: SimPhy: phylogenomic simulation of gene, locus, and species trees. Systematic Biol. 65(2), 334–344 (2016).  https://doi.org/10.1093/sysbio/syv082CrossRefGoogle Scholar
  18. 18.
    Mirarab, S., Nguyen, N., Guo, S., Wang, L.S., Kim, J., Warnow, T.: PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22(5), 377–386 (2015).  https://doi.org/10.1089/cmb.2014.0156CrossRefGoogle Scholar
  19. 19.
    Mirarab, S., Reaz, R., Bayzid, M.S., Zimmermann, T., Swenson, M.S., Warnow, T.: ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17), i541–i548 (2014).  https://doi.org/10.1093/bioinformatics/btu462CrossRefGoogle Scholar
  20. 20.
    Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015).  https://doi.org/10.1093/bioinformatics/btv234CrossRefGoogle Scholar
  21. 21.
    Molloy, E.K., Warnow, T.: To include or not to include: the impact of gene filtering on species tree estimation methods. Syst. Biol. 67(2), 285–303 (2018).  https://doi.org/10.1093/sysbio/syx077CrossRefGoogle Scholar
  22. 22.
    Nelesen, S., Liu, K., Wang, L.S., Linder, C.R., Warnow, T.: DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28(12), i274–i282 (2012).  https://doi.org/10.1093/bioinformatics/bts218CrossRefGoogle Scholar
  23. 23.
    Ogilvie, H.A., Bouckaert, R.R., Drummond, A.J.: StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol. Biol. Evol. 34(8), 2101–2114 (2017).  https://doi.org/10.1093/molbev/msx126CrossRefGoogle Scholar
  24. 24.
    Pamilo, P., Nei, M.: Relationships between gene trees and species trees. Mol. Biol. Evol. 5(5), 568–583 (1988)Google Scholar
  25. 25.
    Price, M.N., Dehal, P.S., Arkin, A.P.: FastTree 2 - approximately maximum-likelihood trees for large alignments. PLOS ONE 5(3), 1–10 (2010).  https://doi.org/10.1371/journal.pone.0009490CrossRefGoogle Scholar
  26. 26.
    Rannala, B., Yang, Z.: Bayes estimation of species divergence times and ancestral population sizes using dna sequences from multiple loci. Genetics 164(4), 1645–1656 (2003)Google Scholar
  27. 27.
    Robinson, D., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53(1), 131–147 (1981).  https://doi.org/10.1016/0025-5564(81)90043-2MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Roch, S., Steel, M.: Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100, 56–62 (2015).  https://doi.org/10.1016/j.tpb.2014.12.005CrossRefzbMATHGoogle Scholar
  29. 29.
    Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987).  https://doi.org/10.1093/oxfordjournals.molbev.a040454CrossRefGoogle Scholar
  30. 30.
    Stamatakis, A.: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014).  https://doi.org/10.1093/bioinformatics/btu033CrossRefGoogle Scholar
  31. 31.
    Steel, M.: The complexity of reconstructing trees from qualitative characters and subtrees. J. Classif. 9(1), 91–116 (1992).  https://doi.org/10.1007/BF02618470MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Sukumaran, J., Holder, M.T.: DendroPy: a python library for phylogenetic computing. Bioinformatics 26(12), 1569–1571 (2010).  https://doi.org/10.1093/bioinformatics/btq228CrossRefGoogle Scholar
  33. 33.
    Swenson, M.S., Suri, R., Linder, C.R., Warnow, T.: An experimental study of Quartets MaxCut and other supertree methods. Algorithm. Mol. Biol. 6(1), 7 (2011).  https://doi.org/10.1186/1748-7188-6-7CrossRefGoogle Scholar
  34. 34.
    Swofford, D.L.: PAUP* (*Phylogenetic Analysis Using PAUP), Version 4a161 (2018). http://phylosolutions.com/paup-test/
  35. 35.
    Tavaré, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17(2), 57–86 (1986)MathSciNetzbMATHGoogle Scholar
  36. 36.
    Vachaspati, P., Warnow, T.: ASTRID: accurate species trees from internode distances. BMC Genomics 16(10), S3 (2015).  https://doi.org/10.1186/1471-2164-16-S10-S3CrossRefGoogle Scholar
  37. 37.
    Vachaspati, P., Warnow, T.: SVDquest: improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol. Phylogenet. Evol. 124, 122–136 (2018).  https://doi.org/10.1016/j.ympev.2018.03.006CrossRefGoogle Scholar
  38. 38.
    Warnow, T.: Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press, Cambridge UK (2017)CrossRefGoogle Scholar
  39. 39.
    Warnow, T.: Supertree Construction: Opportunities and Challenges. ArXiv e-prints, May 2018. https://arxiv.org/abs/1805.03530
  40. 40.
    Warnow, T., Moret, B.M.E., St. John, K.: Absolute convergence: true trees from short sequences. In: Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2001, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 186–195 (2001)Google Scholar
  41. 41.
    Warnow, T.: Tree compatibility and inferring evolutionary history. J. Algorith. 16(3), 388–407 (1994).  https://doi.org/10.1006/jagm.1994.1018MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Zhang, C., Rabiee, M., Sayyari, E., Mirarab, S.: ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform. 19(6), 153 (2018).  https://doi.org/10.1186/s12859-018-2129-yCrossRefGoogle Scholar
  43. 43.
    Zhang, Q.R., Rao, S., Warnow, T.: New absolute fast converging phylogeny estimation methods with improved scalability and accuracy. In: Parida, L., Ukkonen, E. (eds.) 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), vol. 113, pp. 8:1–8:12. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2018).  https://doi.org/10.4230/LIPIcs.WABI.2018.8

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations