NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and Its Application to Species Trees

  • Erin K. Molloy
  • Tandy Warnow
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11183)


Divide-and-conquer methods, which divide the species set into overlapping subsets, construct trees on the subsets, and then combine the trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of these approaches. In this paper, we present a new divide-and-conquer approach that does not require supertree estimation: we divide the species set into disjoint subsets, construct trees on the subsets, and then combine the trees using a distance matrix computed on the full species set. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of the Neighbor Joining algorithm. We report on the results of an extensive simulation study evaluating NJMerge’s utility in scaling three popular species tree estimation methods: ASTRAL, SVDquartets, and concatenation analysis using RAxML. We find that NJMerge provides substantial improvements in running time without sacrificing accuracy and sometimes even improves accuracy. Furthermore, although NJMerge can sometimes fail to return a tree, the failure rate in our experiments is less than 1%. Together, these results suggest that NJMerge is a valuable technique for scaling computationally intensive methods to larger datasets, especially when computational resources are limited. NJMerge is freely available on Github: All datasets, scripts, and supplementary materials are freely available through the Illinois Data Bank:


Phylogenomics Species trees Incomplete lineage sorting Divide-and-conquer Neighbor Joining NJst ASTRAL SVDquartets 



The authors with to thank the anonymous reviewers, whose feedback led to improvements in the paper.


This work was supported by the National Science Foundation (award CCF-1535977) to TW. EKM was supported by the NSF Graduate Research Fellowship (award DGE-1144245) and the Ira and Debra Cohen Graduate Fellowship in Computer Science. Computational experiments were performed on Blue Waters, which is supported by the NSF (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.


  1. 1.
    Aho, A.V., Sagiv, Y., Szymanski, T.G., Ullman, J.D.: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10(3), 405–421 (1981). Scholar
  2. 2.
    Allman, E.S., Degnan, J.H., Rhodes, J.A.: Species tree inference from gene splits by unrooted STAR methods. IEEE/ACM Trans. Comput. Biol. Bioinform. 15(1), 337–342 (2018). Scholar
  3. 3.
    Atteson, K.: The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25(2–3), 251–278 (1999). Scholar
  4. 4.
    Bayzid, M.S., Hunt, T., Warnow, T.: Disk covering methods improve phylogenomic analyses. BMC Genomics 15(6), S7 (2014). Scholar
  5. 5.
    Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N.A., RoyChoudhury, A.: Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol. Biol. Evol. 29(8), 1917–1932 (2012). Scholar
  6. 6.
    Chifman, J., Kubatko, L.: Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23), 3317–3324 (2014). Scholar
  7. 7.
    Chifman, J., Kubatko, L.: Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J. Theor. Biol. 374, 35–47 (2015). Scholar
  8. 8.
    Dasarathy, G., Nowak, R., Roch, S.: Data requirement for phylogenetic inference from multiple loci: a new distance method. IEEE/ACM Trans. Comput. Biol. Bioinform. 12(2), 422–432 (2015). Scholar
  9. 9.
    Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009). Scholar
  10. 10.
    Huson, D.H., Vawter, L., Warnow, T.: Solving large scale phylogenetic problems using DCM2. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 118–129. AAAI Press (1999)Google Scholar
  11. 11.
    Jarvis, E.D., Mirarab, S., et al.: Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215), 1320–1331 (2014). Scholar
  12. 12.
    Jukes, T.H., Cantor, C.R.: Evolution of protein molecules. In: Munro, H. (ed.) Mammalian Protein Metabolism, vol. 3, pp. 21–132. Academic Press, New York (1969)CrossRefGoogle Scholar
  13. 13.
    Lagergren, J.: Combining polynomial running time and fast convergence for the disk-covering method. J. Comput. Syst. Sci. 65(3), 481–493 (2002). Scholar
  14. 14.
    Lefort, V., Desper, R., Gascuel, O.: FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol. Biol. Evol. 32(10), 2798–2800 (2015). Scholar
  15. 15.
    Liu, L., Yu, L.: Estimating species trees from unrooted gene trees. Syst. Biol. 60(5), 661–667 (2011). Scholar
  16. 16.
    Maddison, W.P.: Gene trees in species trees. Syst. Biol. 46(3), 523–536 (1997). Scholar
  17. 17.
    Mallo, D., De Oliveira Martins, L., Posada, D.: SimPhy: phylogenomic simulation of gene, locus, and species trees. Systematic Biol. 65(2), 334–344 (2016). Scholar
  18. 18.
    Mirarab, S., Nguyen, N., Guo, S., Wang, L.S., Kim, J., Warnow, T.: PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22(5), 377–386 (2015). Scholar
  19. 19.
    Mirarab, S., Reaz, R., Bayzid, M.S., Zimmermann, T., Swenson, M.S., Warnow, T.: ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17), i541–i548 (2014). Scholar
  20. 20.
    Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015). Scholar
  21. 21.
    Molloy, E.K., Warnow, T.: To include or not to include: the impact of gene filtering on species tree estimation methods. Syst. Biol. 67(2), 285–303 (2018). Scholar
  22. 22.
    Nelesen, S., Liu, K., Wang, L.S., Linder, C.R., Warnow, T.: DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28(12), i274–i282 (2012). Scholar
  23. 23.
    Ogilvie, H.A., Bouckaert, R.R., Drummond, A.J.: StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol. Biol. Evol. 34(8), 2101–2114 (2017). Scholar
  24. 24.
    Pamilo, P., Nei, M.: Relationships between gene trees and species trees. Mol. Biol. Evol. 5(5), 568–583 (1988)Google Scholar
  25. 25.
    Price, M.N., Dehal, P.S., Arkin, A.P.: FastTree 2 - approximately maximum-likelihood trees for large alignments. PLOS ONE 5(3), 1–10 (2010). Scholar
  26. 26.
    Rannala, B., Yang, Z.: Bayes estimation of species divergence times and ancestral population sizes using dna sequences from multiple loci. Genetics 164(4), 1645–1656 (2003)Google Scholar
  27. 27.
    Robinson, D., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53(1), 131–147 (1981). Scholar
  28. 28.
    Roch, S., Steel, M.: Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100, 56–62 (2015). Scholar
  29. 29.
    Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987). Scholar
  30. 30.
    Stamatakis, A.: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014). Scholar
  31. 31.
    Steel, M.: The complexity of reconstructing trees from qualitative characters and subtrees. J. Classif. 9(1), 91–116 (1992). Scholar
  32. 32.
    Sukumaran, J., Holder, M.T.: DendroPy: a python library for phylogenetic computing. Bioinformatics 26(12), 1569–1571 (2010). Scholar
  33. 33.
    Swenson, M.S., Suri, R., Linder, C.R., Warnow, T.: An experimental study of Quartets MaxCut and other supertree methods. Algorithm. Mol. Biol. 6(1), 7 (2011). Scholar
  34. 34.
    Swofford, D.L.: PAUP* (*Phylogenetic Analysis Using PAUP), Version 4a161 (2018).
  35. 35.
    Tavaré, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17(2), 57–86 (1986)MathSciNetzbMATHGoogle Scholar
  36. 36.
    Vachaspati, P., Warnow, T.: ASTRID: accurate species trees from internode distances. BMC Genomics 16(10), S3 (2015). Scholar
  37. 37.
    Vachaspati, P., Warnow, T.: SVDquest: improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol. Phylogenet. Evol. 124, 122–136 (2018). Scholar
  38. 38.
    Warnow, T.: Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press, Cambridge UK (2017)CrossRefGoogle Scholar
  39. 39.
    Warnow, T.: Supertree Construction: Opportunities and Challenges. ArXiv e-prints, May 2018.
  40. 40.
    Warnow, T., Moret, B.M.E., St. John, K.: Absolute convergence: true trees from short sequences. In: Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2001, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 186–195 (2001)Google Scholar
  41. 41.
    Warnow, T.: Tree compatibility and inferring evolutionary history. J. Algorith. 16(3), 388–407 (1994). Scholar
  42. 42.
    Zhang, C., Rabiee, M., Sayyari, E., Mirarab, S.: ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform. 19(6), 153 (2018). Scholar
  43. 43.
    Zhang, Q.R., Rao, S., Warnow, T.: New absolute fast converging phylogeny estimation methods with improved scalability and accuracy. In: Parida, L., Ukkonen, E. (eds.) 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), vol. 113, pp. 8:1–8:12. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2018).

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations