An Introduction to Supertree Construction (and Partitioned Phylogenetic Analyses) with a View Toward the Distinction Between Gene Trees and Species Trees
The dominant approach to the analysis of phylogenomic data is the concatenation of the individual gene data sets into a giant supermatrix that is analyzed en masse. Nevertheless, there remain compelling arguments for a partitioned approach in which individual partitions (usually genes) are instead analyzed separately and the resulting trees are combined to yield the final phylogeny. For instance, it has been argued that this supertree framework, which remains controversial, can better account for natural evolutionary processes like horizontal gene transfer and incomplete lineage sorting that can cause the gene trees, although accurate for the evolutionary history of the genes, to differ from the species tree. In this chapter, I review the different methods of supertree construction (broadly defined), including newer model-based methods based on a multispecies coalescent model. In so doing, I elaborate on some of their strengths and weaknesses relative to one another as well as provide a rough guide to performing a supertree analysis before addressing criticisms of the supertree approach in general. In the end, however, rather than dogmatically advocating supertree construction and partitioned analyses in general, I instead argue that a combined, “global congruence” approach in which data sets are analyzed under both a supermatrix (unpartitioned) and supertree (partitioned) framework represents the best strategy in our attempts to uncover the Tree of Life.
I thank László Zsolt Garamszegi for the invitation to contribute to this exciting project and his incredible patience in putting it all together. Thanks also go to Las and two anonymous reviewers for their comments that helped improve and focus my original thoughts.
The phenomenon whereby consistent secondary signals among a set of data partitions can overrule their conflicting primary signals to yield a novel solution not to be found among any of the individual data sets. As a simplified example, take the case of two separate gene data sets, each with an aligned length of 1000 nucleotides. In the first data set, 60 % of the positions support a sister-group relationship between A and B (primary signal), whereas 40 % support the clustering of B and C (secondary signal). In the second data set, 60 % support A and C, whereas 40 % support B and C.
Separate analyses of each data set will yield conflicting results (AB vs. AC); however, when the data sets are combined, each of these solutions is now only supported by 30 % of the data. By contrast, the secondary signals supporting BC are now present among 40 % of the combined data and now form the primary signal. In other words, each separate data set possessed hidden support for BC that could combine and determine the overall solution upon the concatenation of the data sets. Because supertree analyses work with trees as their primary data source, these secondary signals in the raw character data are normally invisible and cannot be accounted for.
An artifact in the phylogenetic analysis of DNA sequence data that was first exposed by Felsenstein (1978) and is a result of saturation in such data. Felsenstein observed that taxa at the ends of very long branches that themselves were separated by a short intervening branch often clustered to form sister taxa in a maximum parsimony analysis. Optimization criteria that used an explicit model of evolution like maximum likelihood were more immune to this problem.
This artifactual attraction of the long branches arises because the taxa are characterized by high rates of molecular evolution (as indicated by the long branches) and concomitant large number of shared convergent changes that, through their high number, are falsely interpreted as evidence for shared common ancestry. It is now known that long-branch attraction is a general problem (i.e., it can affect nonmolecular data, although is far less likely to do so) and can occur even if the branches occur on distant parts of the tree (see Bergsten 2005).
A long-standing mathematical principle (Ponstein 1966) showing that there is a one-to-one correspondence between a tree (a “directed acyclic graph”) and its encoding as a binary matrix. Whereas additive binary coding (Farris et al. 1970) of the tree will derive the matrix, the tree can be recreated from the matrix via analysis of the latter using virtually any optimization criterion (see Fig. 3.2).
A class of nondeterministic polynomial (NP) time methods for which no efficient solution is known and for which the running time increases tremendously with the size of the problem. As such, heuristic rather than exact algorithms must be used beyond a certain problem size, meaning that there is no guarantee that the optimal solution has been found. In phylogenetics, classic examples of NP-complete algorithms include maximum parsimony and maximum likelihood.
Polynomial time algorithms are said to be “fast” in the sense that they have an efficient solution that scales “reasonably” with the size of the problem. A cogent example here is neighbor joining (NJ), the running time of which scales no worse than the cube of the number of taxa (i.e., O(n3)). This is in stark contrast to the NP-complete maximum parsimony and maximum-likelihood methods, where the running times scale super-exponentially with respect to the problem size.
A phenomenon attributed primarily to DNA sequence data and which arises because of the limited character state space for such data (i.e., the four nucleotides A, C, G, and T). As such, the potential for homoplasy in the form of either convergence or back mutation is high (e.g., two completely random DNA sequences are expected to be 25 % similar). Saturation, however, can also occur, but is less likely, for both amino-acid and morphological character data.
In practice, saturation is visualized by the degree of divergence between two sequences leveling off or plateauing with time since their divergence because faster evolving sites have experienced multiple substitutions (“multiple hits”) with the increased potential for homoplastic similarity. Another method is to examine for deviations from an expected transition: transversion ratio of 1:2 in neutral/silent sites, given the faster rate of evolution for transitions compared to transversions and, again, greater opportunity for multiple hits.
- Adams EM III (1972) Consensus techniques and the comparison of taxonomic trees. Syst Zool 21:390–397Google Scholar
- Adams EM III (1986) N-trees as nestings: complexity, similarity, and consensus. J Classif 3:299–317Google Scholar
- Arnold CL, Matthews J, Nunn CL (2010) The 10k Trees website: a new online resource for primate phylogeny. Evol Anthropol 19:114–118Google Scholar
- Asher RJ, Müller J (2012) Molecular tools in palaeobiology: divergence and mechanisms. In: Asher RJ, Müller J (eds) From clone to bone: the synergy of morphological and molecular tools in palaeobiology. Cambridge studies in morphology and molecules: new paradigms in evolutionary biology, vol 4. Cambridge University Press, Cambridge, pp 1–15Google Scholar
- Baum BR (1992) Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon 41:3–10Google Scholar
- Beddard FE (1900) A book of whales. G.P. Putnam’s Sons, New YorkGoogle Scholar
- Bergsten J (2005) A review of long-branch attraction. Cladistics 21(2):163–193Google Scholar
- Bininda-Emonds ORP (2004b) New uses for old phylogenies: an introduction to the volume. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the tree of life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 3–14Google Scholar
- Bininda-Emonds ORP (2010) The future of supertrees: bridging the gap with supermatrices. Palaeodiversity 3(Suppl.):99–106Google Scholar
- Bininda-Emonds ORP, Jones KE, Price SA, Cardillo M, Grenyer R, Purvis A (2004) Garbage in, garbage out: data issues in supertree construction. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the Tree of Life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 267–280Google Scholar
- Bininda-Emonds ORP, Jones KE, Price SA, Grenyer R, Cardillo M, Habib M, Purvis A, Gittleman JL (2003) Supertrees are a necessary not-so-evil: a comment on Gatesy et al. Syst Biol 52 (5):724–729Google Scholar
- Chen D, Diao L, Eulenstein O, Fernández-Baca D, Sanderson MJ (2003) Flipping: a supertree construction method. In: Janowitz MF, Lapointe F-J, McMorris FR, Mirkin B, Roberts FS (eds) Bioconsensus, vol 61., DIMACS Series in discrete mathematics and theoretical computer scienceAmerican Mathematical Society, Providence, RI, pp 135–160Google Scholar
- Chippindale PT, Wiens JJ (1994) Weighting, partitioning, and combining characters in phylogenetic analysis. Syst Biol 43:278–287Google Scholar
- Cotton JA, Page RDM (2004) Tangled trees from multiple markers: reconciling conflict between phylogenies to build molecular supertrees. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the tree of life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 107–125Google Scholar
- de Queiroz A, Donoghue MJ, Kim J (1995) Separate versus combined analysis of phylogenetic evidence. Annu Rev Ecol Syst 26:657–681Google Scholar
- Degnan JH, Rosenberg NA (2006) Discordance of species trees with their most likely gene trees. PLoS Genet 2(5):762–768Google Scholar
- Farris JS, Kluge AG, Eckhardt MJ (1970) A numerical approach to phylogenetic systematics. Syst Zool 19:172–191Google Scholar
- Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 27:401–410Google Scholar
- Felsenstein J (1985b) Phylogenies and the comparative method. Am Nat 125:1–15Google Scholar
- Gatesy J, O’Grady P, Baker RH (1999) Corroboration among data sets in simultaneous analysis: hidden support for phylogenetic relationships among higher level artiodactyl taxa. Cladistics 15(3):271–313Google Scholar
- Gatesy J, Springer MS (2004) A critique of matrix representation with parsimony supertrees. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the tree of life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 369–388Google Scholar
- Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G (1979) Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool 28(2):132–163Google Scholar
- Gordon AD (1986) Consensus supertrees: the synthesis of rooted trees containing overlapping sets of labeled leaves. J Classif 3:31–39Google Scholar
- Harvey PH, Pagel MD (1991) The comparative method in evolutionary biology. Oxford University Press, OxfordGoogle Scholar
- Hillis DM (1987) Molecular versus morphological approaches to systematics. Annu Rev Ecol Syst 18:23–42Google Scholar
- Kluge AG (1989) A concern for evidence and a phylogenetic hypothesis of relationships among Epicrates (Boidae, Serpentes). Syst Zool 38:7–25Google Scholar
- Lapointe F-J, Cucumel G (1997) The average consensus procedure: combination of weighted trees containing identical or overlapping sets of taxa. Syst Biol 46(2):306–312Google Scholar
- Lapointe F-J, Levasseur C (2004) Everything you always wanted to know about the average consensus, and more. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the tree of life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 87–105Google Scholar
- Lee MS, Camens AB (2009) Strong morphological support for the molecular evolutionary tree of placental mammals. J Evol Biol 22 (11):2243–2257. doi:JEB1843 [pii] 10.1111/j.1420-9101.2009.01843.x
- Lindqvist C, Schuster SC, Sun Y, Talbot SL, Qi J, Ratan A, Tomsho LP, Kasson L, Zeyl E, Aars J, Miller W, Ingolfsson O, Bachmann L, Wiig O (2010) Complete mitochondrial genome of a Pleistocene jawbone unveils the origin of polar bear. Proc Natl Acad Sci U S A 107(11):5053–5057. doi: 10.1073/pnas.0914266107 PubMedPubMedCentralGoogle Scholar
- Liu LA, Yu LL, Edwards SV (2010) A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol 10Google Scholar
- Maddison WP (1997) Gene trees in species trees. Syst Biol 46(3):523–536Google Scholar
- Miller W, Schuster SC, Welch AJ, Ratan A, Bedoya-Reina OC, Zhao F, Kim HL, Burhans RC, Drautz DI, Wittekindt NE, Tomsho LP, Ibarra-Laclette E, Herrera-Estrella L, Peacock E, Farley S, Sage GK, Rode K, Obbard M, Montiel R, Bachmann L, Ingolfsson O, Aars J, Mailund T, Wiig O, Talbot SL, Lindqvist C (2012) Polar and brown bear genomes reveal ancient admixture and demographic footprints of past climate change. Proc Natl Acad Sci USA 109(36):E2382–E2390. doi: 10.1073/pnas.1210506109 PubMedGoogle Scholar
- Mossel E, Roch S (2007) Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. http://arxiv.org/abs/0710.0262
- Murphy WJ, Janecka JE, Stadler T, Eizirik E, Ryder OA, Gatesy J, Meredith RW, Springer MS (2012) Response to comment on “impacts of the cretaceous terrestrial revolution and KPg extinction on mammal diversification”. Science 337(6090):34Google Scholar
- Nguyen N, Mirarab S, Warnow T (2012) MRL and SuperFine plus MRL: new supertree methods. Algorithms Mol Biol 7(1):3Google Scholar
- Page RDM (1994) Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol 43(1):58–77Google Scholar
- Page RDM (2002) Modified mincut supertrees. In: Guigó R, Gusfield D (eds) Proceedings of Algorithms in bioinformatics, second international workshop, WABI, Rome, Italy. Lecture Notes in computer science, vol 2452. Springer, Berlin, pp 537–552, 17–21 Sept 2002Google Scholar
- Piaggio-Talice R, Burleigh JG, Eulenstein O (2004) Quartet supertrees. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the tree of life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 173–191Google Scholar
- Ponstein J (1966) Matrices in graph and network theory. Van Gorcum, Assen, NetherlandsGoogle Scholar
- Purvis A (1995a) A composite estimate of primate phylogeny. Philos Trans R Soc Lond B 348:405–421Google Scholar
- Purvis A (1995b) A modification to Baum and Ragan’s method for combining phylogenetic trees. Syst Biol 44:251–255Google Scholar
- Ranwez V, Berry V, Criscuolo A, Fabre PH, Guillemot S, Scornavacca C, Douzery EJ (2007) PhySIC: a veto supertree method with desirable properties. Syst Biol 56 (5):798–817. doi:782748826 [pii] 10.1080/10635150701639754
- Ronquist F (1996) Matrix representation of trees, redundancy, and weighting. Syst Biol 45:247–253Google Scholar
- Ronquist F, Huelsenbeck JP, Britton T (2004) Bayesian supertrees. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the tree of life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 193–224Google Scholar
- Roshan U, Moret BME, Williams TL, Warnow T (2004) Performance of supertree methods on various data set decompositions. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the tree of life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 301–328Google Scholar
- Ross HA, Rodrigo AG (2004) An assessment of matrix representation with compatibility in supertree construction. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the tree of life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 35–63Google Scholar
- Sanderson MJ, Donoghue MJ, Piel W, Eriksson T (1994) TreeBASE: a prototype database of phylogenetic analyses and an interactive tool for browsing the phylogeny of life. Am J Bot 81(6):183Google Scholar
- Scornavacca C, Berry V, Lefort V, Douzery EJ, Ranwez V (2008) PhySIC_IST: cleaning source trees to infer more informative supertrees. BMC Bioinformatics 9:413. doi:1471-2105-9-413 [pii] 10.1186/1471-2105-9-413
- Semple C, Steel M (2000) A supertree method for rooted trees. Discrete Appl Math 105(1–3):147–158Google Scholar
- Stamatakis A (in press) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. doi: 10.1093/bioinformatics/btu033
- Strimmer K, von Haeseler A (1996) Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evol 13:964–969Google Scholar
- Swofford DL (2002) PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, MassachusettsGoogle Scholar
- Thorley JL, Wilkinson M (2003) A view of supertree methods. In: Janowitz MF, Lapointe F-J, McMorris FR, Mirkin B, Roberts FS (eds) Bioconsensus, vol 61., DIMACS series in discrete mathematics and theoretical computer scienceAmerican Mathematical Society, Providence, RI, pp 185–193Google Scholar
- Wilkinson M, Thorley JL, Pisani D, Lapointe F-J, McInerney JO (2004) Some desiderata for liberal supertrees. In: Bininda-Emonds ORP (ed) Phylogenetic supertrees: combining information to reveal the tree of life, computational biology, vol 4. Kluwer Academic, Dordrecht, the Netherlands, pp 227–246Google Scholar