TreeShrink: Efficient Detection of Outlier Tree Leaves

  • Uyen Mai
  • Siavash MirarabEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10562)


Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink” problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset.


Tree diameter Rogue taxon removal Gene tree discordance 



This work was supported by the NSF grant IIS-1565862 to SM and UM. Computations were performed on the San Diego Supercomputer Center (SDSC) through XSEDE allocations, which is supported by the NSF grant ACI-1053575.

Supplementary material


  1. 1.
    Braun, M.J., Clements, J.E., Gonda, M.A.: The visna virus genome: evidence for a hypervariable site in the env gene and sequence homology among lentivirus envelope proteins. J. Virol. 61(12), 4046–4054 (1987)Google Scholar
  2. 2.
    Hugenholtz, P., Huber, T.: Chimeric 16S rDNA sequences of diverse origin are accumulating in the public databases. Int. J. Syst. Evol. Microbio. 53(1), 289–293 (2003)CrossRefGoogle Scholar
  3. 3.
    Zwickl, D.J., Stein, J.C., Wing, R.A., Ware, D., Sanderson, M.J.: Disentangling methodological and biological sources of gene tree discordance on Oryza (Poaceae) chromosome 3. Syst. Biol. 63(5), 645–659 (2014)CrossRefGoogle Scholar
  4. 4.
    Leaché, A.D., Rannala, B.: The accuracy of species tree estimation under simulation: a comparison of methods. Syst. Biol. 60(2), 126–137 (2011)CrossRefGoogle Scholar
  5. 5.
    Mirarab, S., Bayzid, M.S., Boussau, B., Warnow, T.: Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science 346(6215), 1250463 (2014)CrossRefGoogle Scholar
  6. 6.
    Gatesy, J., Springer, M.S.: PhyloGenet. Anal. at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. Mol. Phylogenet. Evol. 80, 231–266 (2014)CrossRefGoogle Scholar
  7. 7.
    Arvestad, L., Berglund, A.C., Lagergren, J., Sennblad, B.: Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. In: RECOMB, pp. 326–335. ACM Press, New York (2004)Google Scholar
  8. 8.
    Akerborg, O., Sennblad, B., Arvestad, L., Lagergren, J.: Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. PNAS 106(14), 5714–5719 (2009)CrossRefGoogle Scholar
  9. 9.
    Szöllősi, G.J., Tannier, E., Daubin, V., Boussau, B.: The inference of gene trees with species trees. Syst. Biol. 64(1), e42–e62 (2014)CrossRefGoogle Scholar
  10. 10.
    Stolzer, M., Lai, H., Xu, M., et al.: Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics 28(18), i409–i415 (2012)CrossRefGoogle Scholar
  11. 11.
    Chauve, C., El-Mabrouk, N., Guéguen, L., Semeria, M., Tannier, E.: Duplication, rearrangement and reconciliation: a follow-up 13 years later. In: Chauve, C., El-Mabrouk, N., Tannier, E. (eds.) Models and Algorithms for Genome Evolution. Computational Biology, vol. 19, pp. 47–62. Springer, London (2013). doi: 10.1007/978-1-4471-5298-9_4 CrossRefGoogle Scholar
  12. 12.
    Wu, Y.C., Rasmussen, M.D., Bansal, M.S., Kellis, M.: TreeFix: statistically informed gene tree error correction using species trees. Syst. Biol. 62(1), 110–120 (2013)CrossRefGoogle Scholar
  13. 13.
    Lafond, M., Chauve, C., Dondi, R., El-Mabrouk, N.: Polytomy refinement for the correction of dubious duplications in gene trees. Bioinformatics 30(17), i519–i526 (2014)CrossRefGoogle Scholar
  14. 14.
    Bansal, M.S., Wu, Y.C., Alm, E.J., Kellis, M.: Improved gene tree error correction in the presence of horizontal gene transfer. Bioinformatics 31(8), 1211–1218 (2015)CrossRefGoogle Scholar
  15. 15.
    Tan, G., Muffato, M., Ledergerber, C., et al.: Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst. Biol. 64(5), 778–791 (2015)CrossRefGoogle Scholar
  16. 16.
    Castresana, J.: Selection of conserved blocks from multiple alignments for their use in PhyloGenet. Anal. Mol. Biol. Evol. 17(4), 540–552 (2000)CrossRefGoogle Scholar
  17. 17.
    Capella-Gutiérrez, S., Silla-Martínez, J.M., Gabaldón, T.: trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15), 1972–1973 (2009)CrossRefGoogle Scholar
  18. 18.
    Shen, X.X., Hittinger, C.T., Rokas, A.: Studies can be driven by a handful of genes. Nature 1(April), 1–10 (2017)Google Scholar
  19. 19.
    Krüger, D., Gargas, A.: New measures of topological stability in phylogenetic trees - taking taxon composition into account. Bioinformation 1(8), 327–330 (2006)CrossRefGoogle Scholar
  20. 20.
    Westover, K.M., Rusinko, J.P., Hoin, J., Neal, M.: Rogue taxa phenomenon: a biological companion to simulation analysis. Mol. Phylogenet. Evol. 69(1), 1–3 (2013)CrossRefGoogle Scholar
  21. 21.
    Pattengale, N.D., Swenson, K.M., Moret, B.M.E.: Uncovering hidden phylogenetic consensus. In: Borodovsky, M., Gogarten, J.P., Przytycka, T.M., Rajasekaran, S. (eds.) ISBRA 2010. LNCS, vol. 6053, pp. 128–139. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-13078-6_16 CrossRefGoogle Scholar
  22. 22.
    Aberer, A.J., Krompass, D., Stamatakis, A.: Pruning rogue taxa improves phylogenetic accuracy: an efficient algorithm and webservice. Syst. Biol. 62(1), 162–166 (2013)CrossRefGoogle Scholar
  23. 23.
    Goloboff, P.A., Szumik, C.A.: Identifying unstable taxa: efficient implementation of triplet-based measures of stability, and comparison with Phyutility and RogueNaRok. Mol. Phylogenet. Evol. 88, 93–104 (2015)CrossRefGoogle Scholar
  24. 24.
    Hosner, P.A., Braun, E.L., Kimball, R.T.: Land connectivity changes and global cooling shaped the colonization history and diversification of New World quail (Aves: Galliformes: Odontophoridae). J. Biogeogr. 42, 1883–1895 (2015)CrossRefGoogle Scholar
  25. 25.
    Streicher, J.W., Schulte, J.A., Wiens, J.J.: How should genes and taxa be sampled for phylogenomic analyses with missing data? An empirical study in iguanian lizards. Syst. Biol. 65(1), 128–145 (2016)CrossRefGoogle Scholar
  26. 26.
    Salichos, L., Rokas, A.: Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497(7449), 327–331 (2013)CrossRefGoogle Scholar
  27. 27.
    Wickett, N.J., Mirarab, S., Nguyen, N., et al.: Phylotranscriptomic analysis of the origin and early diversification of land plants. PNAS 111(45), 4859–4868 (2014)CrossRefGoogle Scholar
  28. 28.
    Bergsten, J.: A review of long-branch attraction. Cladistics 21(2), 163–193 (2005)CrossRefGoogle Scholar
  29. 29.
    Hampl, V., Hug, L., Leigh, J.W., et al.: Phylogenomic analyses support the monophyly of Excavata and resolve relationships among eukaryotic “supergroups”. PNAS 106(10), 3859–3864 (2009)CrossRefGoogle Scholar
  30. 30.
    Song, S., Liu, L., Edwards, S.V., Wu, S.: Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. PNAS 109(37), 14942–14947 (2012)CrossRefGoogle Scholar
  31. 31.
    Silverman, B.: Density estimation for statistics and data analysis. In: Monographs on Statistics and Applied Probability. Chapman & Hall (1986)Google Scholar
  32. 32.
    R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2016)Google Scholar
  33. 33.
    Mirarab, S., Reaz, R., Bayzid, M.S., et al.: ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17), i541–i548 (2014)CrossRefGoogle Scholar
  34. 34.
    Misof, B., Liu, S., Meusemann, K., et al.: Phylogenomics resolves the timing and pattern of insect evolution. Science 346(6210), 763–767 (2014)CrossRefGoogle Scholar
  35. 35.
    Cannon, J.T., Vellutini, B.C., Smith, J., et al.: Xenacoelomorpha is the sister group to Nephrozoa. Nature 530(7588), 89–93 (2016)CrossRefGoogle Scholar
  36. 36.
    Rouse, G.W., Wilson, N.G., Carvajal, J.I., Vrijenhoek, R.C.: New deep-sea species of Xenoturbella and the position of Xenacoelomorpha. Nature 530(7588), 94–97 (2016)CrossRefGoogle Scholar
  37. 37.
    Philippe, H., Brinkmann, H., Copley, R.R., et al.: Acoelomorph flatworms are deuterostomes related to Xenoturbella. Nature 470(7333), 255–258 (2011)CrossRefGoogle Scholar
  38. 38.
    Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015)CrossRefGoogle Scholar
  39. 39.
    Springer, M.S., Gatesy, J.: The gene tree delusion. Mol. Phylogenet. Evol. 94(Part A), 1–33 (2016)CrossRefGoogle Scholar
  40. 40.
    Sukumaran, J., Holder, M.T.: DendroPy: a Python library for phylogenetic computing. Bioinformatics 26(12), 1569–1571 (2010)CrossRefGoogle Scholar
  41. 41.
    Bogdanowicz, D., Giaro, K.: Matching split distance for unrooted binary phylogenetic trees. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(1), 150–160 (2012)CrossRefGoogle Scholar
  42. 42.
    Bogdanowicz, D., Giaro, K., Wróbel, B.: TreeCmp: comparison of trees in polynomial time. Evol. Bioinform. 2012(8), 475–487 (2012)Google Scholar
  43. 43.
    DeSantis, T.Z., Hugenholtz, P., Larsen, N., et al.: Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72(7), 5069–5072 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Computer Science and EngineeringUniversity of California at San DiegoSan DiegoUSA
  2. 2.Electrical and Computer EngineeringUniversity of California at San DiegoSan DiegoUSA

Personalised recommendations