Biology & Philosophy

, Volume 25, Issue 4, pp 659–673 | Cite as

Gene sharing and genome evolution: networks in trees and trees in networks

Article

Abstract

Frequent lateral genetic transfer undermines the existence of a unique “tree of life” that relates all organisms. Vertical inheritance is nonetheless of vital interest in the study of microbial evolution, and knowing the “tree of cells” can yield insights into ecological continuity, the rates of change of different cellular characters, and the evolutionary plasticity of genomes. Notwithstanding within-species recombination, the relationships most frequently recovered from genomic data at shallow to moderate taxonomic depths are likely to reflect cellular inheritance. At the same time, it is clear that several types of ‘average signals’ from whole genomes can be highly misleading, and the existence of a central tendency must not be taken as prima facie evidence of vertical descent. Phylogenetic networks offer an attractive solution, since they can be formulated in ways that mitigate the misleading aspects of hybrid evolutionary signals in genomes. But the connections in a network typically show genetic relatedness without distinguishing between vertical and lateral inheritance of genetic material. The solution may lie in a compromise between strict tree-thinking and network paradigms: build a phylogenetic network, but identify the set of connections in the network that are potentially due to vertical descent. Even if a single tree cannot be unambiguously identified, choosing a subnetwork of putative vertical connections can still lead to drastic reductions in the set of candidate vertical hypotheses.

Keywords

Lateral genetic transfer Microbial genomics Tree of Life Phylogenetic networks 

Introduction

Whole-genome analyses of microbial evolution emerged soon after a sufficient number of genomes had been published (Huynen and Bork 1998; Lawrence and Ochman 1998). Early phylogenomic work (e.g., Fitz-Gibbon and House 1999; Snel et al. 1999; Tekaia et al. 1999; Wolf et al. 2001) was done optimistically, with a pervading assumption that a tree recovered from genomic data would accurately describe the relationships among the genomes covered, and by extension the organisms specified by those genomes. However, phylogenetic discordance (usually in the form of trees that had strongly supported but incompatible conclusions) persisted, and led to numerous attempts to quantify the amount of apparent phylogenetic signal that was due to (1) Strictly vertical inheritance of genetic material, (2) Lateral genetic transfer, and (3) Other confounding factors such as cryptic paralogy, lineage sorting, and errors in phylogenetic inference (Clarke et al. 2002; Kurland et al. 2003; Gophna et al. 2005, Galtier and Daubin 2008). While (3) is no doubt a contributor to phylogenetic discordance, different methods based on phylogenetic trees, compatibility of phylogenetic profiles, and compositional outliers in the genome have all highlighted a significant role for LGT in genome evolution (Lawrence and Ochman 1998; Nakamura et al. 2004; Beiko et al. 2005; Dagan and Martin 2007). Different predictive methods target different aspects of the compositional or phylogenetic signal within genes, and often do not agree on which genes are the best candidates for transfer (Ragan 2001; Ragan et al. 2006). While each method will identify a distinct combination of true LGT cases and statistical ‘false positives’, it is clear from all approaches that LGT is not a rare process.

Given very high estimates for the rate of LGT from some published analyses, Doolittle and Bapteste (2007) argued that the notion of a microbial Tree of Life was a philosophical artefact that had been adopted for historical rather than scientific reasons. Critically, using a tree to represent genome evolution reflected an assumption of ‘pattern monism’ that coerces many different phylogenetic signals into a single representation that has no valid interpretation. While ‘rampant lateralists’ are correct in holding that the strict notion of a tree of genomes is invalidated by LGT, other units of genetic and cellular organization apart from the genome are interesting in their own right, and exhibit evolutionary patterns that can be described using a tree. Cells that reproduce through binary fission do not merge again, at least not in a symmetric way and likely not across great evolutionary distances as occurs with LGT (Dorward et al. 1989); therefore their patterns of descent can be accurately described with a rooted tree, even if such a tree does not conform to the original view of the “Tree of Life” as relating species rather than individuals. Whole genes may be inherited vertically or laterally as cohesive units, and most analyses to date have made this assumption, although recent evidence suggests that intra-gene recombination may play a greater role than previously appreciated (Inagaki et al. 2006; Chan et al. 2009a, b). Individual nucleotide residues (and possibly aggregates of adjacent residues) also evolve in a treelike fashion due to the semiconservative nature of DNA replication. But tree construction from such small units is unreliable for reasons of statistical inconsistency (e.g., Swofford et al. 2001).

Returning to genome-level analysis, Doolittle (2004) proposed the ship of Theseus as an analogy for genomes that undergo extensive LGT (see also Andam et al. 2010; Franklin-Hall 2010). Although the mode by which genes (here, planks in the boat) are inherited is not always vertical from parent cell to offspring cell, the organismal lineage imposes a meaningful continuity, even if no gene in the extant genome has a history that exactly tracks the cellular lineage. In such a scenario, the predominantly vertical signal in genome evolution might be recoverable and recognized as vertical in nature. But if LGT is sufficiently frequent, then the vertical signal will not stand out, and assigning the label of ‘vertical inheritance’ to any coherent signal will not be justified. The frequency of LGT therefore defines a continuum of possible representations of genome evolution, from a strict tree (no LGT), through simple networks with few reticulations (rare LGT or LGT concentrated into highways of gene sharing: see e.g., Andam et al. 2010), to a ‘haze’ in which the distribution of edges is not distinguishable from a random arrangement. The extremes of this continuum are not taken into serious consideration except as null hypotheses against which empirical observations can be compared (Creevey et al. 2004; Bapteste et al. 2005). In the context of pattern pluralism defined by Doolittle and Bapteste (2007), a ‘verticalist’ view of evolution is one in which patterns are plural but tractable (and justifiably recognized as vertical or lateral), whereas the ‘lateralist’ position is that LGT is sufficiently frequent that no mapping of patterns to different modes of inheritance can be carried out. Doolittle (2004) identified four points along a continuum of opinion concerning the impact of LGT, but the first three (LGT ‘conservatives’, ‘core vertical genome’ supporters, and ‘ship of Theseus’) all expect that a meaningful organismal ( = cellular) tree exists, and is recoverable (at least in principle) from genomic data.

Trees from networks, Part 1: Imposing a tree on non-tree-like data

Tree thinking has dominated the first ~15 years of the era of microbial genome sequencing. Discordant gene trees were well known prior to the advent of whole genome sequences; these were thought to be due to methodological artifacts, or the occasional case of LGT (Woese et al. 1980; Hilario and Gogarten 1993). Rigorous phylogenetic approaches based on likelihood are statistically consistent as long as model assumptions are reasonably well met, but depending on the difficulty of the problem, tens or hundreds of thousands of sequence characters (e.g., nucleotide or amino acid residues in a multiple sequence alignment) may be necessary to reliably recover the correct signal (Swofford et al. 2001). Complete genome data presented researchers with thousands of genes per genome, and the opportunity was ripe for the ‘true’ evolutionary signal to finally outpace the noise. But new techniques were needed to extract such signals from complete genomes, and the strategies that gained widespread use were dependent on filtering and aggregation of the data. The implications and assumptions of any particular approach are dependent on the way(s) in which filtering and aggregation are carried out.

Filtering strategies

One approach to building genome histories is to narrow the scope of the analysis to a restricted data set, which can then be analyzed in a uniform way. The extreme of a filtering approach is to take a single gene that is present in the genomes of all organisms under consideration, infer its evolutionary history, and extrapolate the inferred relationships back to the set of genomes. The textbook Tree of Life with its splits between Bacteria, Archaea and Eukarya is based on phylogenetic analysis of 16S rDNA (Woese and Fox 1977). Other conserved genes are often used as markers, including RNA polymerase subunit β′ (Walsh et al. 2004) and RecA (Thompson et al. 2004). With the availability of many genes from each genome, ‘supermatrix’ approaches were developed in which a set of genes present in each genome were combined into a single alignment and treated as a single gene for the purposes of phylogenetic analysis. Examples of this approach include the concatenation of conserved loci from across a broad range of eukaryotes (Baldauf et al. 2000), and the construction of a concatenated alignment of 31 ‘LGT-free’ genes covering all three domains of life (Ciccarelli et al. 2006). Filtering approaches can produce robustly supported phylogenies, but are subject to several problems:

Statistical artifacts of phylogenetic inference

As data set sizes increase, the risk of recovering phylogenetic artifacts due to small sample size (‘stochastic error’) is diminished (Swofford et al. 2001). But another class of problems arises when assumptions of phylogenetic models concerning equal rates of evolution or compositional stability are violated; in such cases, adding more data can give even stronger support for an incorrect answer (Rodríguez-Ezpeleta et al. 2007). Recoding DNA to purine and pyrimidine character states can compensate for divergence in the relative abundance of A+T versus G+C nucleotides, and has a substantial impact on both single-gene and concatenated alignment-based (Phillips et al. 2004) phylogenies. The set of sequences under consideration may also lack the ability to resolve every relationship in the tree: for example, 16S rDNA sequences are often identical between strains of the same named species (Rocap et al. 2002; Jaspers and Overmann 2004).

Gene histories that do not match the cellular history

Apart from LGT, phenomena such as lineage sorting and paralogy can produce gene trees that do not match the cellular history (Degnan and Rosenberg 2009). One can attempt to mitigate the effects of paralogy by restricting the analysis to those genes for which only one homologous copy is present in each genome: however, this will drastically reduce the number of available genes for analysis (Lerat et al. 2003; Chan et al. 2009a). Notably, since 16S rRNA genes are present in multiple divergent copies in some genomes (Case et al. 2007), it is not always a suitable candidate for this type of restricted analysis.

Mutually incompatible gene histories

A tree inference method will construct a tree, even if the underlying data are not treelike. When multiple conflicting phylogenetic signals are present in a multiple sequence alignment, the resulting tree may reflect the plurality signal or an ‘average signal’ that is not exhibited by any component of the data set. Posada and Crandall (2002) examined this effect in the context of genetic recombination, which induces very similar effects to LGT, and found that the phylogenetic consequences were dependent on the extent to which different parts of the alignment supported different histories, and the degree of difference between these histories. Not surprisingly, two trees that differ greatly (with a large nearest-neighbor interchange distance, for example) are more likely to produce a misleading averaging effect than two trees that differ minimally in their topology. Tests of statistical support such as the bootstrap or the Shimodaira-Hasegawa test (Shimodaira and Hasegawa 1999) are of little use in diagnosing these problems, because averaging artifacts can have very high statistical support values (Beiko et al. 2008). Given recent observations of LGT that disrupts single genes (Inagaki et al. 2006; Chan et al. 2009a, b), it is not even safe to assume that a single-gene phylogeny is immune to these effects.

Aggregation strategies

An alternative to the filtering approach, aggregation strategies aim to treat each character (e.g., gene) as a distinct trait to be analyzed separately, after which the complete set of signals can be combined into a comprehensive picture of genomic relatedness. Genome phylogeny methods typically use a genomic criterion such as the proportion of shared genes, the similarity of shared genes, conservation in gene order, or the similarity of compositional attributes, to compute pairwise distances between genomes (reviewed in Snel et al. 2005). These distances can then be used to infer a tree. More intensive approaches involve a ‘pipeline’ analysis whereby many gene trees are inferred independently, then combined or compared to generate a consensus signal. However, aggregation approaches are subject to the same statistical and averaging problems as the filtering methods described above, and consistent results between the two types of approach do not constitute independent evidence for any implied evolutionary history. As in the filtering cases above, the effect of forcing incompatible genomic signals into a single tree representation can still give well-supported relationships that are incorrect. In a genome phylogeny framework, simulations carried out with high levels of LGT yielded the desirable effect of collapsing nodes for which there was conflicting support, particularly those nodes near the base of the simulated tree of genomes (Beiko et al. 2008). However, deviations from random patterns of exchange led to the recovery of incorrect relationships with strong statistical support.

These results on simulated data show that an overly constrained tree structure may or may not reflect some subset of the evolutionary signal within a set of genomes. Since genomes exhibit biased patterns of LGT (Kunin et al. 2005; Beiko et al. 2005), phylogenetic averaging artefacts are likely to emerge when a tree-like structure is imposed on a set of genomes. Many aggregate approaches have an inherent advantage in that they infer the histories of all entities independently, enabling the use of comparisons to assess the consistency of support for a given central tendency (Creevey et al. 2004; Beiko et al. 2005; Puigbò et al. 2009), and the filtering or reweighting of results to emphasize certain evolutionary trajectories (Gophna et al. 2005).

How do we solve the problems outlined above? Model violations are important, but they have been the subject of intensive benchmarking and validation, and can now be tested for and corrected in a number of different ways (e.g., Ababneh et al. 2006). From a conceptual point of view, these problems can be set aside since they do not bear on the underlying representational question. The problem of gene trees not matching the tree of cellular divisions is one of process, and identification of such cases relies on the use of accessory evidence such as the presence of paralogous copies of a gene on a genome, or unusual compositional properties that may be indicative of recent LGT. If LGT were rare, then such trees would stand out against a backdrop of many trees that are in mutual agreement. But mutually incompatible gene histories yield a fundamental problem of representation: such relationships cannot be displayed using a strict hierarchical branching process. Given this limitation, and the demonstration that imposing a simplified tree structure on data with heterogeneous signals can produce trees that are completely misleading, we must either find a perfect filtering method that retains only vertical signals, or adopt a more-general network representation if we wish to capture the evolutionary patterns of interest in genomic data.

Networks from trees: assembling a genomic web

Given the fundamental limitations of tree-based approaches identified above, many authors have proposed a shift (or, more precisely, a generalization) from strictly tree-based to network-based approaches. The critical advantage of a network is that an entity in the network (terminal or internal node) can have multiple affinities that are incompatible with the strict branching pattern of a tree. In the context of a rooted network, which imposes ancestor/descendant relationships without forcing any given path to be ‘vertical’, this means that a given entity can have multiple ancestors. There has been extensive theoretical work on different types of phylogenetic networks, and there are many different classes with varying degrees of ability to capture and represent reticulated relationships among genomes (reviewed in Huson and Bryant 2006). Network reconstruction algorithms can be distinguished based on the input data they can accept as well as the type of network they produce. Since the principal concern here is the appropriateness of a representation, below I describe network types without considering in detail the different data types such as distances (Springman et al. 2009), bipartitions (Zhaxybayeva et al. 2009), and trees (Laing et al. 2009), that can be used to build these networks. Figure 1 shows several different ways of summarizing the relationships of a pair of trees: the two source trees shown in Fig. 1a differ in the positioning of only a single ‘hybrid’ leaf node (Hyb), but the impact on the resulting aggregate tree or network can be significant and depends on the choice of visualization used. Figure 1b shows a strict consensus tree which contains the intersection of all bipartitions in the two input trees: all relationships subsequent to the last common ancestor of Hyb and its two partners HP1 and HP2 are therefore collapsed. Completely resolved tree representations are possible, but would require either the favoring of one input tree over the other (based, for example, on an extrinsic hypothesis or a preponderance of genes supporting one tree), or a phylogenetic ‘average’ as described above.
Fig. 1

Visualizations of conflicting trees, the resulting consensus tree and two different types of phylogenetic network. The two rooted trees in a differ only in the positioning of a hybrid taxon (Hyb) relative to two putative parents (HP1 and HP2); expressed in terms of LGT, this scenario could arise in cases where the phylogenetic affinities of genes in a given genome are affiliated with two other lineages. b Consensus tree collapsing all relationships back to the common ancestor of HP1 and HP2. c Unrooted consensus network showing all splits contained in the two trees. d Rooted cluster network showing the full range of possible reticulations surrounding Hyb, HP1 and HP2. e Rooted galled network showing the reticulate origin of Hyb against an otherwise strictly bifurcating tree. f Reconciliation network in which the topological permutation implied by the comparison of the two input trees is mapped onto one of these trees. All trees and networks were constructed and visualized using either Dendroscope 2.3 (Huson et al. 2007) or SplitsTree 4.10 (Huson and Bryant 2006)

Split networks represent disagreement within a data set by showing a complete or restricted set of splits that are supported by the data. A simple example would involve two fully resolved trees, covering the same set of taxa, which induce incompatible bipartitions on those taxa. For example, if tree X1 contained an edge that split taxa T = {A, B, C, D} into groups {A, B} and {C, D}, while tree X2 induced a split of T into groups {A, C} and {B, D}, then the consensus representation of these two trees could not itself be a tree. However, a network formulation in which A has B and C as alternative partners would be sufficient to represent the relationships present in the input data. These networks do not explicitly represent the connections among entities in the input trees; instead, they enumerate the disagreements in affinity that are found in the source data (Huson and Bryant 2006). As such, reticulation-inducing events such as LGT will lead to a large number of conflicting splits from which relationships cannot easily be extracted (Beiko and Ragan 2009). Nonetheless, splits networks, particularly those constructed by the Neighbor-Net algorithm (Bryant and Moulton 2004) and consensus networks built from trees (Holland et al. 2004; see also Fig. 1c) have been widely used to represent phylogenetic uncertainty or disagreement, and can serve as a useful starting point for the inference of other types of phylogenetic network (Huson et al. 2005). A further distinguishing feature of splits networks is that they tend to be unrooted, making no assumptions about the horizontal or vertical nature of any given pattern of inheritance.

Reticulation networks aim to explicitly represent the hybrid relationships among a set of taxa by making direct connections between lineages that are implicated in reticulate evolution. Genetic recombination and hybridization (Huber et al. 2006) have spurred the initial development of these representations; cluster networks and galled networks (Huson et al. 2009) are recent developments that can be used to represent reticulate relationships that have arisen due to LGT. These different types of networks differ in their interpretability and the efficiency of the algorithms used to construct them: cluster networks tend to be more complex, but have a relatively efficient polynomial-time solution (Huson and Rupp 2008; see Fig. 1d). ‘Galled’ networks, so called because they attach reticulation features (‘galls’) to an underlying backbone tree, require fewer reticulations to represent a series of events (Fig. 1e), but their construction depends on a pair of NP-complete problems which have no efficient exact solution. In spite of this limitation, Huson et al. (2009) showed that inference of such networks was feasible for relatively large (>100 taxa) sets of trees. Galled networks are a generalization of galled trees (Gusfield et al. 2003), which do not permit galls to overlap, and are therefore inappropriate for a general representation of LGT relationships.

A third type of network might be termed reconciliation networks, in which a set of trees are reconciled with a reference topology. Tree permutation operators such as subtree prune-and-regraft (SPR) induce topological changes that are similar to those caused by LGT, and the reconciliation is usually achieved via iterative applications of these operations to one tree until it is reconciled with another (Nakhleh et al. 2005; Beiko and Hamilton 2006; Hickey et al. 2008; see Fig. 1f and visualizations in MacLeod et al. 2005). The series of operations thus recovered are then assembled into a global picture of network-like relationships. The direct linkage of operation to evolutionary process is a distinct advantage of these methods, but the aggregation of potentially large numbers of edit operations across many trees and the lack of efficient reconciliation solutions pose significant problems to this class of approaches (Beiko and Ragan 2009). Most importantly, the need to propose a reference tree upon which operations are to be overlaid returns us to the problems of tree forcing outlined above: we will have either one piece of the evolutionary puzzle, or none if our tree is reflective of averaging artifacts.

Which type of network can best capture the reticulate patterns amongst genomes? The galled network (Fig. 1e) and reconciliation network (Fig. 1f) are both attractive, because they introduce a minimal number of extra network edges to represent the single implied lateral or hybrid relationship. In doing so we retain a clear and accurate depiction of the relationships that are common to both trees, and the simplest possible representation of the divergent patterns found in the trees. However, there are fundamental differences between the two approaches that produce distinct advantages and disadvantages. To date, reticulation networks have depended on a notion of directionality of evolution; for example, the input trees used to produce the networks in Fig. 1d and e must be rooted. But rooting phylogenetic trees requires either an explicit or implicit outgroup, or the assumption of a molecular clock for e.g., midpoint rooting of a tree. The latter assumption is frequently violated, especially when evolutionary distances are large (Kuo and Ochman 2009), while the former criterion is invalid if we allow for rampant LGT, since gene outgroups will vary from tree to tree in ways cannot be identified. The requirement for rooting of reconciliation networks is not as stringent, because some approaches (e.g., Beiko and Hamilton 2006) require only the reference tree to be rooted. However, the use of a reference tree requires the choice of a single ‘privileged’ path of vertical descent, onto which other (by comparison, vertical or lateral) paths can be mapped. Given the serious challenges facing the recovery of a vertical Tree of Cells, especially at the deepest levels of divergence, proposing a single, unambiguously vertical tree is likely impossible (Creevey et al. 2004; Puigbò et al. 2009).

Phylogenetic network methods (which include aggregate tree approaches) are subject to the constraints of computational tractability: the complexity of reticulation networks and reconciliation networks has been noted above, although approximations and heuristic algorithms are active areas of research. Current implementations struggle to handle hundreds or thousands of input trees covering hundreds or thousands of taxa. A typical analysis will constrain either the set of genes (Huson et al. 2009) or the set of taxa (Lerat et al. 2005) under consideration, but both of these approaches can potentially miss important and interesting phylogenetic signals if certain critical taxa or genes are eliminated.

Trees from networks, part 2: extracting the tree(s) of cells

To the set of network algorithms outlined in the preceding section, we can add structures that can be conceived of even if they have not yet been explored in depth. In particular, unrooted reticulation networks could represent conflicting signals without the need for rooted input trees, and without introducing a confusing series of splits that obfuscate the relationships being shown. A network that can theoretically and minimally represent all of the reticulate relationships in a data set will display all evolutionary trajectories of interest, allow investigations into the degree of mosaicism in different taxon and gene sets, and reveal sets of genes that may be inherited en bloc or over a short span of time, either vertically or laterally. What reticulation networks fail to do, and reconciliation networks can do only by imposing unacceptable constraints, is distinguish vertical from lateral histories, i.e., highlighting the Tree of Cells against a backdrop of LGT. Identifying a ‘privileged’ set of connections in a network that correspond to vertical inheritance pathways would highlight genes that tend to be inherited from parent to offspring, describe a path of ecological continuity throughout the evolutionary history of a group of organisms, and allow the identification of cellular traits that are slow to change. With this information as a backdrop, rare events such as significant shifts in cell wall structure or chemistry (Cavalier-Smith 2006) and wholesale transfer of genes encoding multiprotein complexes such as the flagellum (Desmond et al. 2007) would also be identified.

How can we identify such privileged connections in a reticulation network? Regrettably, in most cases the data that we would like to use to this end are inextricably linked up with the hypotheses under consideration. For example, one could propose that for each set of mutually exclusive alternative connections in a network, the one that is supported by a plurality of genes will be designated as the vertical edge. In doing this, we will have already progressed beyond the phylogenetic averaging artefacts described above, because the network will contain only those relationships that are supported by the input data. However, in deep parts of the tree the plurality signal may be one of several with statistically similar support (Creevey et al. 2004). Preferring a single edge under these circumstances would not be reasonable. Furthermore, some groups of closely related genomes may have multiple alternative connections to very divergent ‘other’ lineages: one of the best-studied examples of these is the hyperthermophile Aquifex aeolicus, which has strong genetic affinities with (among others) the Thermotogae, certain classes of Proteobacteria, and certain lineages of Archaea. The question of the ‘sister’ group to Aquificae remains controversial and may never be resolved (Cavalier-Smith 2002; Beiko et al. 2005; Boussau et al. 2008), and the emergence of a single new genome (e.g., thermophilic Epsilon-proteobacteria that group even more strongly with A. aeolicus) may either clarify or further confuse the question. Vertical or quasi-vertical edges may also be assigned based on agreement with the phylogenetic tendencies of genes that are thought to be recalcitrant to transfer (Jain et al. 1999). But no gene has been shown to be completely untransferable, and many examples of informational protein transfer have been reported (Omelchenko et al. 2003). Again, choosing privileged edges cannot be done conclusively.

But often lost in the tree versus network debate is the fact that a network can implicitly reject the vast majority of alternative hypotheses (but see Bucknam et al. 2006 for an example of the use of trees to reject candidate relationships). The galled network in Fig. 1e is not an unambiguous tree, and indeed such networks built on genetic data from prokaryotes can be quite complex and contain many reticulations (Huson et al. 2009). If one compares, however, a recovered network against the number of possible relationships in a network covering a given number of taxa, it will usually be the case that most possible relationships are not displayed. In Fig. 1 there are 12 taxa, and any completely resolved, rooted tree will contain n − 2 = 10 bipartitions (ignoring the trivial bipartitions at the tips of the tree), while the rooted equivalent will contain n − 3 = 9 bipartitions. A set of n entities can be grouped into 2n distinct subsets, including the complete set and the empty set. Consequently the total number of possible non-trivial bipartitions in a completely connected phylogenetic network is equal to 2n−1 (because sets are paired in a bipartition) − n (trivial bipartitions) − 1 (the complete set/empty set pair) = 2,035 when n = 12. The unrooted split network in Fig. 1c contains a total of 13 non-trivial bipartitions, or 4 more than would be present in an unrooted tree, but still over 100 times less than the maximum possible. Even the characteristic ‘squash’ seen in the center of many split network reconstructions (e.g., Puigbò et al. 2009) still rejects many alternative relationships. And starting from a reconstructed reticulation network, it may be possible to reject certain network edges as candidate vertical edges if, for instance, they are caused by a small number of genes relative to the plurality signal, or if the only genes represented are known to be readily exchanged. Once such edges have been eliminated from consideration, the remaining features will constitute a candidate network of vertical evolution. Such a network will likely contain more reticulations near its base (or its ‘center’, if unrooted), but may also contain many unambiguous edges that uniquely support a given grouping of genomes. Hypotheses that can be tested using a Tree of Cells, including those outlined above, can be adapted to such a network either by considering all possible trees exhibited by that network, or by adapting tree-like notions of conservation and predictive power to the constrained network model.

Conclusion

Doolittle (2004) argued:

…any “Ship of Theseus” phylogeny is not unambiguously constructible, and is so far from the original conceptual understanding of the Tree of Life as to require a radical reworking of this understanding, not some subtle terminological negotiation.

But microorganisms are not ships: for them there is no safe harbor where they can retool, and they cannot survive merely through replacement of worn-out parts. The first condition imposes functional and ecological continuity on lineages of microorganisms, and underscores the importance of the microbial Tree of (organismal) Life. Were Theseus’ ship to remain in continuous service while changing over time from Greek galley into a modern battleship over the course of millennia, the very existence of a continuous path would be of tremendous importance to understanding the nature of sailing vessels. The simple fact that lineages are recognizable, and we are consequently presented with a network rather than a ‘haze’ of life, demonstrates some degree of organismal continuity (see Andam et al. 2010; Franklin-Hall 2010, for additional perspectives on the ship of Theseus analogy).

It is important to recognize that the network approaches outlined above will not free us from all of the constraints imposed by LGT and gene content change in general: as we trace relationships back through a tree or network, we will lose predictive power because our confidence in ancestral gene content will diminish. It is perhaps reasonable to map the majority of ‘core’ genes of a given species (however defined) or genus back to a hypothetical common cellular ancestor or ancestral population, but the size of this core will decrease as we cast the genomic net more widely. Conversely, the pan-genome, ‘accessory genome’ or variable genome, comprising genes with patchy distributions in a group (Tettelin et al. 2005), will grow as more genomes are added. Not only will we be uncertain about which components of the (shrinking) core and (expanding) variable genome were present in a hypothesized common ancestor, there will also be a complement of ‘invisible’ genes that were present in this ancestor but shared only with lineages that have not yet been sampled, or have gone extinct. These problems are not induced solely by processes of LGT: gene duplication and loss, which can occur in a strictly vertical framework, also increase the uncertainty about ancestral genome content, and genes present in an ancestor will become extinct if they are lost from all lineages that contained them.

If the importance of the organismal tree is recognized, then the central question is not “Is there a tree of life?” but rather “To what extent can the organismal tree be recovered from the available data?”. An extreme lateralist might argue that questions about monophyletic, paraphyletic or polyphyletic origins of groups such as the Gamma-proteobacteria are meaningless due to the effects of rampant LGT. But if the candidate edges within a phylogenetic network can be sufficiently restricted (either in the initial construction or through a process of post hoc elimination), then groupings will emerge that can be examined in a vertical context, without the misleading effects of signal averaging. Tapping the resulting organismal continuity will be an important step toward understanding the evolution of microorganisms, individually and in interacting communities.

Notes

Acknowledgments

I would like to thank the participants of the 2009 “Perspectives on the Tree of Life” symposium, sponsored by the Leverhulme Trust, for lively discussions and valuable insights, and am particularly indebted to two anonymous referees, Ford Doolittle and Donovan Parks for comments on earlier versions of the manuscript. I also acknowledge the financial support of Genome Atlantic and the Canada Research Chairs program.

References

  1. Ababneh F, Jermiin LS, Ma C, Robinson J (2006) Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics 22:1225–1231CrossRefGoogle Scholar
  2. Andam CP, Williams D, Gogarten JP (2010) Natural taxonomy in light of horizontal gene transfer. Biol PhilosGoogle Scholar
  3. Baldauf SL, Roger AJ, Wenk-Siefert I, Doolittle WF (2000) A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290:972–977CrossRefGoogle Scholar
  4. Bapteste E, Susko E, Leigh J, MacLeod D, Charlebois RL, Doolittle WF (2005) Do orthologous gene phylogenies really support tree-thinking? BMC Evol Biol 5:33CrossRefGoogle Scholar
  5. Beiko RG, Hamilton N (2006) Phylogenetic identification of lateral genetic transfer events. BMC Evol Biol 6:15CrossRefGoogle Scholar
  6. Beiko RG, Ragan MA (2009) Untangling hybrid phylogenetic signals: horizontal gene transfer and artifacts of phylogenetic reconstruction. Methods Mol Biol 532:241–256CrossRefGoogle Scholar
  7. Beiko RG, Harlow TJ, Ragan MA (2005) Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A 102:14332–14337CrossRefGoogle Scholar
  8. Beiko RG, Doolittle WF, Charlebois RL (2008) The impact of reticulate evolution on genome phylogeny. Syst Biol 57:844–856CrossRefGoogle Scholar
  9. Boussau B, Guéguen L, Gouy M (2008) Accounting for horizontal gene transfers explains conflicting hypotheses regarding the position of aquificales in the phylogeny of Bacteria. BMC Evol Biol 8:272CrossRefGoogle Scholar
  10. Bryant D, Moulton V (2004) Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol Biol Evol 21:255–265CrossRefGoogle Scholar
  11. Bucknam J, Boucher Y, Bapteste E (2006) Refuting phylogenetic relationships. Biol Direct 1:26CrossRefGoogle Scholar
  12. Case RJ, Boucher Y, Dahllöf I, Holmström C, Doolittle WF, Kjelleberg S (2007) Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies. Appl Environ Microbiol 73:278–288CrossRefGoogle Scholar
  13. Cavalier-Smith T (2002) The neomuran origin of archaebacteria, the negibacterial root of the universal tree and bacterial mega classification. Int J Syst Evol Microbiol 52:7–76Google Scholar
  14. Cavalier-Smith T (2006) Rooting the tree of life by transition analyses. Biol Direct 1:19CrossRefGoogle Scholar
  15. Chan CX, Beiko RG, Darling AE, Ragan MA (2009a) Lateral transfer of genes and gene fragments in prokaryotes. Gen Biol Evol. doi:10.1093/gbe/evp044
  16. Chan CX, Darling AE, Beiko RG, Ragan MA (2009b) Are protein domains modules of lateral genetic transfer? PLoS ONE 4:e4524CrossRefGoogle Scholar
  17. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P (2006) Toward automatic reconstruction of a highly resolved tree of life. Science 311:1283–1287CrossRefGoogle Scholar
  18. Clarke GD, Beiko RG, Ragan MA, Charlebois RL (2002) Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J Bacteriol 184:2072–2080CrossRefGoogle Scholar
  19. Creevey CJ, Fitzpatrick DA, Philip GK, Kinsella RJ, O’Connell MJ, Pentony MM, Travers SA, Wilkinson M, McInerney JO (2004) Does a tree-like phylogeny only exist at the tips in the prokaryotes? Proc Biol Sci 271:2551–2558CrossRefGoogle Scholar
  20. Dagan T, Martin W (2007) Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc Natl Acad Sci U S A 104:870–875CrossRefGoogle Scholar
  21. Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol 24:332–340CrossRefGoogle Scholar
  22. Desmond E, Brochier-Armanet C, Gribaldo S (2007) Phylogenomics of the archaeal flagellum: rare horizontal gene transfer in a unique motility structure. BMC Evol Biol 7:106CrossRefGoogle Scholar
  23. Doolittle WF (2004) If the Tree of Life fell, would we recognize the sound? In: Sapp J (ed) Microbial evolution: concepts and controversies. Oxford University Press, USA, pp 119–133Google Scholar
  24. Doolittle WF, Bapteste E (2007) Pattern pluralism and the Tree of Life hypothesis. Proc Natl Acad Sci U S A 104:2043–2049CrossRefGoogle Scholar
  25. Dorward DE, Garon CF, Judd RC (1989) Export and intercellular transfer of DNA via membrane blebs of Neisseria gonorrhoeae. J Bacteriol 171:2499–2505Google Scholar
  26. Fitz-Gibbon ST, House CH (1999) Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res 27:4218–4222CrossRefGoogle Scholar
  27. Franklin-Hall L (2010) Trashing the tree: Bad reasons and good reasons. Biol PhilosGoogle Scholar
  28. Galtier N, Daubin V (2008) Dealing with incongruence in phylogenomic analyses. Philos Trans R Soc Lond B Biol Sci 27:1512Google Scholar
  29. Gophna U, Doolittle WF, Charlebois RL (2005) Weighted genome trees: refinements and applications. J Bacteriol 187:1305–1316CrossRefGoogle Scholar
  30. Gusfield D, Eddhu S, Langley C (2003) Efficient reconstruction of phylogenetic networks with constrained recombination. In: Proceedings of the IEEE CSB 2003, Stanford, CA, USA, p 363Google Scholar
  31. Hickey G, Dehne F, Rau-Chaplin A, Blouin C (2008) SPR distance computation for unrooted trees. Evol Bioinform Online 4:17–27Google Scholar
  32. Hilario E, Gogarten JP (1993) Horizontal transfer of ATPase genes–the tree of life becomes a net of life. Biosystems 31:111–119CrossRefGoogle Scholar
  33. Holland BR, Huber KT, Moulton V, Lockhart PJ (2004) Using consensus networks to visualize contradictory evidence for species phylogeny. Mol Biol Evol 21:1459–1461CrossRefGoogle Scholar
  34. Huber KT, Oxelman B, Lott M, Moulton V (2006) Reconstructing the evolutionary history of polyploids from multilabeled trees. Mol Biol Evol 23:1784–1791CrossRefGoogle Scholar
  35. Huson DH, Bryant D (2006) Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 23:254–267CrossRefGoogle Scholar
  36. Huson DH, Rupp R (2008) Summarizing multiple gene trees using cluster networks. In: Crandall K, Lagergren J (eds) Algorithms in bioinformatics, WABI 2008, 5251. Berlin/Heidelberg: Springer, pp 211–225. In Lecture Notes in Bioinformatics (LNBI)Google Scholar
  37. Huson DH, Klöpper TH, Lockhart PJ, Steel MA (2005) Reconstruction of reticulate networks from gene trees. In: Miyano S et al (eds) Research in computational biology. Lecture Notes in Computer Science, vol 3500. Springer-Verlag, Berlin., pp 233–249Google Scholar
  38. Huson DH, Richter DC, Rausch C, Dezulian T, Franz M, Rupp R (2007) Dendroscope: an interactive viewer for large phylogenetic trees. BMC Bioinformatics 8:460CrossRefGoogle Scholar
  39. Huson DH, Rupp R, Berry V, Gambette P, Paul C (2009) Computing galled networks from real data. Bioinformatics 25:i85–i93CrossRefGoogle Scholar
  40. Huynen MA, Bork P (1998) Measuring genome evolution. Proc Natl Acad Sci U S A 95:5849–5856CrossRefGoogle Scholar
  41. Inagaki Y, Susko E, Roger AJ (2006) Recombination between elongation factor 1-alpha genes from distantly related archaeal lineages. Proc Natl Acad Sci U S A 103:4528–4533CrossRefGoogle Scholar
  42. Jain R, Rivera MC, Lake JA (1999) Horizontal gene transfer among genomes: the complexity hypothesis. Proc Natl Acad Sci U S A 96:3801–3806CrossRefGoogle Scholar
  43. Jaspers E, Overmann J (2004) Ecological significance of microdiversity: identical 16S rRNA gene sequences can be found in bacteria with highly divergent genomes and ecophysiologies. Appl Environ Microbiol 70:4831–4839CrossRefGoogle Scholar
  44. Kunin V, Goldovsky L, Darzentas N, Ouzounis CA (2005) The net of life: reconstructing the microbial phylogenetic network. Genome Res 15:954–959CrossRefGoogle Scholar
  45. Kuo C-H, Ochman H (2009) Inferring clocks when lacking rocks: the variable rates of molecular evolution in bacteria. Biol Direct 4:35CrossRefGoogle Scholar
  46. Kurland CG, Canback B, Berg OG (2003) Horizontal gene transfer: a critical view. Proc Natl Acad Sci U S A 100:9658–9662CrossRefGoogle Scholar
  47. Laing CR, Buchanan C, Taboada EN, Zhang Y, Karmali MA, Thomas JE, Gannon VP (2009) In silico genomic analyses reveal three distinct lineages of Escherichia coli O157:H7, one of which is associated with hyper-virulence. BMC Genomics 10:287CrossRefGoogle Scholar
  48. Lawrence JG, Ochman H (1998) Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci U S A 95:9413–9417CrossRefGoogle Scholar
  49. Lerat E, Daubin V, Moran NA (2003) From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol 1:e19CrossRefGoogle Scholar
  50. Lerat E, Daubin V, Ochman H, Moran NA (2005) Evolutionary origins of genomic repertoires in bacteria. PLoS Biol 3:e130CrossRefGoogle Scholar
  51. MacLeod D, Charlebois RL, Doolittle WF, Bapteste E (2005) Deduction of probable events of lateral gene transfer through comparison of phylogenetic trees by recursive consolidation and rearrangement. BMC Evol Biol 5:27CrossRefGoogle Scholar
  52. Nakamura Y, Itoh T, Matsuda H, Gojobori T (2004) Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat Genet 36:760–766CrossRefGoogle Scholar
  53. Nakhleh L, Ruths D, Wang LS (2005) RIATA-HGT: a fast and accurate heuristic for reconstructing horizontal gene transfer. Lect Notes Comput Sci 3595:84–93CrossRefGoogle Scholar
  54. Omelchenko MV, Makarova KS, Wolf YI, Rogozin IB, Koonin EV (2003) Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biol 4:R55CrossRefGoogle Scholar
  55. Phillips MJ, Delsuc F, Penny D (2004) Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol 21:1455–1458CrossRefGoogle Scholar
  56. Posada D, Crandall KA (2002) The effect of recombination on the accuracy of phylogeny estimation. J Mol Evol 54:396–402Google Scholar
  57. Puigbò P, Wolf YI, Koonin EV (2009) Search for a ‘Tree of Life’ in the thicket of the phylogenetic forest. J Biol 8:59CrossRefGoogle Scholar
  58. Ragan MA (2001) On surrogate methods for detecting lateral gene transfer. FEMS Microbiol Lett 201:187–191CrossRefGoogle Scholar
  59. Ragan MA, Harlow TJ, Beiko RG (2006) Do different surrogate methods detect lateral genetic transfer events of different relative ages? Trends Microbiol 14:4–8CrossRefGoogle Scholar
  60. Rocap G, Distel DL, Waterbury JB, Chisholm SW (2002) Resolution of Prochlorococcus and Synechococcus ecotypes by using 16S–23S ribosomal DNA internal transcribed spacer sequences. Appl Environ Microbiol 68:1180–1191CrossRefGoogle Scholar
  61. Rodríguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, Philippe H (2007) Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol 56:389–399CrossRefGoogle Scholar
  62. Shimodaira H, Hasegawa M (1999) Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol 16:1114–1116Google Scholar
  63. Snel B, Bork P, Huynen MA (1999) Genome phylogeny based on gene content. Nat Genet 21:108–110CrossRefGoogle Scholar
  64. Snel B, Huynen MA, Dutilh BE (2005) Genome trees and the nature of genome evolution. Annu Rev Microbiol 59:191–209CrossRefGoogle Scholar
  65. Springman AC, Lacher DW, Wu G, Milton N, Whittam TS, Davies HD, Manning SD (2009) Selection, recombination, and virulence gene diversity among Group B Streptococcal genotypes. J Bacteriol 191:5419–5427CrossRefGoogle Scholar
  66. Swofford DL, Waddell PJ, Huelsenbeck JP, Foster PG, Lewis PO, Rogers JS (2001) Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst Biol 50:525–539CrossRefGoogle Scholar
  67. Tekaia F, Lazcano A, Dujon B (1999) The genomic tree as revealed from whole proteome comparisons. Genome Res 9:550–557Google Scholar
  68. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O’Connor KJ, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A 102:13950–13955CrossRefGoogle Scholar
  69. Thompson CC, Thompson FL, Vandemeulebroecke K, Hoste B, Dawyndt P, Swings J (2004) Use of recA as an alternative phylogenetic marker in the family Vibrionaceae. Int J Syst Evol Microbiol 54:919–924CrossRefGoogle Scholar
  70. Walsh DA, Bapteste E, Kamekura M, Doolittle WF (2004) Evolution of the RNA polymerase B’ subunit gene (rpoB’) in Halobacteriales: a complementary molecular marker to the SSU rRNA gene. Mol Biol Evol 21:2340–2351CrossRefGoogle Scholar
  71. Woese CR, Fox GE (1977) Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A 74:5088–5090CrossRefGoogle Scholar
  72. Woese CR, Gibson J, Fox GE (1980) Do genealogical patterns in purple photosynthetic bacteria reflect interspecific gene transfer? Nature 283:212–214CrossRefGoogle Scholar
  73. Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV (2001) Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 20:8CrossRefGoogle Scholar
  74. Zhaxybayeva O, Doolittle WF, Papke RT, Gogarten JP (2009) Intertwined evolutionary histories of marine Synechococcus and Prochlorococcus marinus. Gen Biol Evol 2009:325Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Faculty of Computer ScienceDalhousie UniversityHalifaxCanada

Personalised recommendations