Enhancing Searches for Optimal Trees Using SIESTA
Many supertree estimation and multi-locus species tree estimation methods compute trees by combining trees on subsets of the species set based on some NP-hard optimization criterion. A recent approach to computing large trees has been to constrain the search space by defining a set of “allowed bipartitions”, and then use dynamic programming to find provably optimal solutions in polynomial time. Several phylogenomic estimation methods, such as ASTRAL, the MDC algorithm in PhyloNet, and FastRFS, use this approach. We present SIESTA, a method that allows the dynamic programming method to return a data structure that compactly represents all the optimal trees in the search space. As a result, SIESTA provides multiple capabilities, including: (1) counting the number of optimal trees, (2) calculating consensus trees, (3) generating a random optimal tree, and (4) annotating branches in a given optimal tree by the proportion of optimal trees it appears in. SIESTA is available in open source form on github at https://github.com/pranjalv123/SIESTA.
We thank the anonymous reviewers for their helpful criticisms on an earlier draft, which greatly improved the manuscript. We also thank Erin Molloy, Sarah Christensen, and Siavash Mirarab, for feedback on the initial results.
Funding. This study made use of the Illinois Campus Cluster, a computing resource that is operated by the Illinois Campus Cluster Program in conjunction with the National Center for Supercomputing Applications and which is supported by funds from the University of Illinois at Urbana-Champaign. This work was partially supported by U.S. National Science Foundation Graduate Research Fellowship Program under Grant Number DGE-1144245 to PV and U.S. National Science Foundation grant CCF-1535977 to TW.
- 2.Bayzid, M.S., Mirarab, S., Warnow, T.J.: Inferring optimal species trees under gene duplication and loss. In: Pacific Symposium Biocomputing, vol. 18, pp. 250–261 (2013)Google Scholar
- 5.Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009). http://mbe.oxfordjournals.org/content/26/8/1879.abstract CrossRefGoogle Scholar
- 7.Hallett, M.T., Lagergren, J.: New algorithms for the duplication-loss model. In: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB), pp. 138–146. ACM (2000)Google Scholar
- 23.Stamatakis, A.: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9) (2014). doi: 10.1093/bioinformatics/btu033
- 28.Vachaspati, P.: Simulated data for siesta paper (2017). doi: 10.6084/m9.figshare.5234803.v1. Accessed 21 July 2017
- 30.Vachaspati, P., Warnow, T.: FastRFS: fast and accurate Robinson-Foulds Supertrees using constrained exact optimization. Bioinformatics 33(5), 631–639 (2017)Google Scholar