Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 856)

Abstract

Large-scale databases are available that contain homologous gene families constructed from hundreds of complete genome sequences from across the three domains of life. Here, we discuss the approaches of increasing complexity aimed at extracting information on the pattern and process of gene family evolution from such datasets. In particular, we consider the models that invoke processes of gene birth (duplication and transfer) and death (loss) to explain the evolution of gene families. First, we review birth-and-death models of family size evolution and their implications in light of the universal features of family size distribution observed across different species and the three domains of life. Subsequently, we proceed to recent developments on models capable of more completely considering information in the sequences of homologous gene families through the probabilistic reconciliation of the phylogenetic histories of individual genes with the phylogenetic history of the genomes in which they have resided. To illustrate the methods and results presented, we use data from the HOGENOM database, demonstrating that the distribution of homologous gene family sizes in the genomes of the eukaryota, archaea, and bacteria exhibits remarkably similar shapes. We show that these distributions are best described by models of gene family size evolution, where for individual genes the death (loss) rate is larger than the birth (duplication and transfer) rate but new families are continually supplied to the genome by a process of origination. Finally, we use probabilistic reconciliation methods to take into consideration additional information from gene phylogenies, and find that, for prokaryotes, the majority of birth events are the result of transfer.

Key words

Gene family evolution Gene duplication Gene loss Horizontal gene transfer Birth-and-death models Reconciliation 

1 Introduction

The strongest evidence for the universal ancestry of all life on Earth comes from two sources: (1) the shared molecular characters essential to the functioning of the cell, such as fundamental biological polymers, core metabolism, and the nearly universal genetic code; (2) sequence similarity between functionally related proteins in the bacteria, archaea, and eukaryota (1, 2). However, the majority of functionally related genes, similar to other phylogenetic characters, exhibiting a more restricted distribution and consequently taken separately, can only provide phylogenetic information on finer scales. Nonetheless, considered together, the ensemble of related sequences carry a comprehensive record of the evolutionary history and mechanisms that have generated them (3). Sequence similarity on these finer scales has been used to construct large-scale databases of putative sets of sequences of common ancestry, in particular homologous proteins and protein domains. At present, such databases constructed from hundreds of complete genome sequences from across the three domains of life are available. Here, we discuss the methods capable of extracting information on the pattern and process of genome evolution from large-scale datasets composed of homologous gene families.

2 Birth-and-Death Processes and the Shape of the Protein Universe

The majority of bacterial, archaeal, and eukaryotic genes belong to homologous families (4) which together contain a potential treasure trove of information on the pattern and process of descent of these genes, and the genomes in which they reside. A qualitative examination of the number of family members in genomes and the phylogenetic distribution of the families reveals two important patterns: (1) the distribution of the majority of homologous gene families is not universal, but phylogenetically limited and (2) many families contain multiple members from the same genomes while at the same time being characterized by a patchy distribution. These observations imply that (1) some process of gene origination must exist that results in the ongoing generation of sequences sufficiently different to be seen as a novel gene family and (2) processes of gene birth capable of creating new genes with recognizable homology from the existing ones must also exist in parallel with processes of gene death leading to the loss of existing genes.

Considering the latter case first, several molecular mechanisms are known to be involved in the creation of new gene structures in a genome. Among eukaryotes, a range of mechanisms are known to be capable of producing gene-sized duplications of genetic material. These mechanisms include exon shuffling, reverse transcription of expressed genes, and the action of mobile elements; for reviews, see refs. 5, 6. In the case of prokaryotes, mechanisms for duplication are less well understood and horizontally transferred genes are believed to be an important, perhaps dominant, source of new gene structures entering the genome (7). Note that transfer of DNA into the prokaryotic cell can occur primarily by three means: (1) transduction by viruses, (2) conjugation by plasmids, and (3) natural genetic transformation: the ability of some bacteria to take up DNA fragments released by another cell. For details, see ref. 8. While we expect duplication to produce gene copies with recognizable homology, whether transfer is seen as gene origination or gene birth in the context of a particular genome depends on the presence of recognizable homologs. In contrast to duplication and transfer, the loss of genes is thought to most frequently result from a cascade of small deletion events with small or no fitness effect, which follow the initial inactivation of a gene (the emergence of a pseudogene). As in the case of pseudogenization, molecular mechanisms can generate new gene structures or lead to the loss of existing ones in the genomes of individual cells; the fate of these genomic changes, whether they will fix or be lost in the population, will be determined by their selective effects and population genetic parameters, such as effective population size.

On the broadest scale, the strength of genetic drift has been hypothesized to be a dominant factor influencing genome size across all three domains of life (9). As we see in the following section, the pattern of the distribution of homologous gene family sizes in and among genomes can, to a large extent, also be described in terms of essentially neutral stochastic birth-and-death processes. Birth (duplication and transfer) and death (loss) in the context of these models correspond to the addition and removal of genes to homologous gene families over evolutionary timescales that are long compared to the mutational and population genetic timescales.

The question of mechanisms responsible for the origination of gene families is not well understood. A significant fraction of genes in genomes from all three domains of life appears to be of very recent origin in so far as they are restricted to a particular genome and possess no known homologs. By some counts, such orphan genes constitute, e.g., one-third of the genes in the human genome (6) and 14% in a survey of 60 bacterial genomes (10). While there are signs that a large fraction of orphan genes in prokaryotic genomes may have viral origin (11), our understanding of where these genes come from and more generally what the dominant processes of gene origination are remain largely unresolved fundamental questions. Nevertheless, as we show below using birth-and-death processes as models, the continuous presence and significance of origination during the course of genome evolution is readily apparent from the record it has in the pattern of gene homologous family sizes, i.e., in the shape of the protein universe.

2.1 The Distribution of Homologous Gene Family Sizes

The frequency distributions of gene family sizes in the complete genomes of organisms from all three domains exhibit remarkably similar shapes with characteristic long, slowly decaying tails (12, 13, 14). These distributions all have a power-law shape; for large family size n, the frequency of families f(n) falls off as f(n) α nγ with some γ < 0. This power-law shape is apparent in the log-log plots of Fig. 1 and corresponds to an excess of large and very large families compared to what would be expected based on the size of the average gene family. Even more remarkable is the similarity of the family size distributions between species from a single domain (columns in Fig. 1), and even between domains (rows in Fig. 1). This similarity implies that the processes that have generated these distributions may share universal features across species and across the three domains. Here, we focus on the information that can be inferred under the assumption that particular forms of birth-and-death processes have shaped these distributions and will not consider potential connections with power-law scaling in functional genome content (15) or homology networks and their connection to other biological networks with similar characteristics (16).
Fig. 1.

Distribution of homologous gene family sizes across the three domains. The distribution of homologous gene family sizes was derived from the version 5 of the HOGENOM database (17). The results for the three domain data for the complete genomes of 820 bacteria, 62 archaea, and 64 eukaryotes, and correspond to the average of the frequencies of family sizes across species in the domain. Dashed lines indicate fits with different origination duplication and loss (ODL) models. The linear model corresponds to the model of Reed et al. and the nonlinear is that proposed by Karev et al.; see text for details. The bottom row presents the relative rate of duplication as a function of family size corresponding to the fits of the nonlinear model of Eq. 2 in the two rows above it.

2.2 Interpreting the Pattern of Gene Family Sizes

Huynen and van Nimwegen were the first to describe and interpret a widespread pattern of a slowly decaying asymptotic power law in the distribution of homologous gene family sizes. They examined a diverse set of genomes spanning the bacteria, archaea, eukaryota, and viruses (12). They found that a simple, but relatively abstract, stochastic birth-and-death process, one where the duplication and loss events are correlated within a family, produces power-law distributions (for details, see below). They found the exponent γ to be between −2 and −4 in their studies. In fact, a value consistent with these results of γ between −2 and −3 has been observed in all subsequent studies and can easily be read off from Fig. 1. In the context of Huynen and van Nimwegen’s model, this indicates that the origination rate (in general, a combination of gain resulting from transfer, and the birth of new families with no homologs in other genomes) that is required to compensate for the stochastic loss of families must be significant.

Subsequent work has shown that for models, where the birth and death of genes in a gene family are considered independent, the asymptotic decay of the distribution of gene family sizes can also become a power law, albeit such behavior is only exhibited by a certain specific subclass of origination–duplication–loss-type birth-and-death models. As demonstrated by Karev et al. (14), this is the case for nonlinear models (see below) in which the death rate approaches the birth rate for large families but is considerably greater than the birth rate for small families (see bottom row of Fig. 1). Karev et al. have been able to accurately reproduce the distributions of gene (and domain) family sizes for a range of analyzed genomes. The origination rates necessary to fit empirical family size distributions were found to be relatively high, and comparable, at least in small prokaryotic genomes, to the overall intragenomic duplication rate. This has been interpreted as support for the key role of horizontal gene transfer (HGT) in these genomes (14, 18, 19).

At about the same time as the work of Karev and colleagues appeared, Reed et al. demonstrated (20) that a very simple birth-and-death process can also exhibit an asymptotic power law. They considered a model, where the birth and death of genes are independent of each other and family size, and origination occurs randomly with a uniform rate (see below), and found asymptotic power-law behavior under the condition that the rate of birth (duplication) is larger than the rate of death (loss). In Fig. 1, we show comparisons of the fits of the linear model of Reed et al. and the nonlinear model of Karev et al. to gene family size distributions for the three domains. We can see that despite its relative simplicity, considering data from individual species (top row of Fig. 1), the linear model (described by three parameters) provides comparable quality fits as the model of Karev et al. (described by five parameters). If we consider, however, the fits to distributions averaged over the three domains, we can observe that the nonlinear mode clearly provides a better fit (second row of Fig. 1). As the functions being fit are discrete probability distributions, one can easily calculate the probability of the observed empirical distribution given values of the model parameters, and subsequently perform fitting by maximizing the likelihood of the model parameters. For the case of the averaged distributions, this method of fitting using likelihood allows a clear interpretation of the fit to the averaged distributions, as corresponding to the hypothesis of a birth-and-death process with identical parameter values across all species in the domain having generated the observed distribution.

Perhaps more conclusively, the parameter values obtained in the case of the linear model, corresponding to a birth-to-death ratio of between roughly 2 and 5 (δ/λ = 4.9 for the human dataset with the best apparent fit), are qualitatively at odds with empirical estimates of the recent duplication and loss rates in eukaryotic genomes, which unanimously indicate a value much smaller than one (see Table 1 in ref. 6).

2.3 The Theory of Birth-and-Death Processes

Historically, the biological application of birth-and-death processes, starting with the seminal work of Yule (22) in the 1920s and continuing in the following decades (23, 24, 25, 26), was the construction of stochastic models that can furnish a means for interpreting random fluctuations in the population size with time. The application of birth-and-death process to sizes of gene families is more recent. The realization that the sizes of gene families can be compared with the aim of better understanding adaptive evolutionary processes and organismal phylogeny began with the work of Hughes and Nei (27, 28) and others (29) in the context of the debate on whether differences in the copy number of major histocompatibility complex genes across species have evolved due to adaptive or stochastic forces. As described above, recent work has focused on explaining the distribution of the number of genes in homologous gene families in genomes as the result of stochastic birth-and-death processes (see also Chap. 3 of ref. 6).

A birth-and-death process is a stochastic process in which transitions between states labeled by integers (representing the number of individuals, cells, lineages, etc.) are only allowed to neighboring states (see Fig. 2). An increase by one of the number of individuals (or genes in a gene family) constitutes birth, whereas decrease by one is a death. More formally, the dynamics of a population (of individuals, or of genes in a gene family) is represented by a Markov process, i.e., the state of the population at time t is described by the value of a random variable described by the Markov property (for an accessible review, see ref. 18). In general, for each state, the probability of both birth, a transition from state n to n + 1, and of death, a transition from state n to n − 1, is described by a rate birth rate δn and a death rate λn. A third elementary process besides birth and death that is relevant in the context of gene family size evolution is origination. As described above, not all gene families are of the same age, consequently to model the process of origination of new families, families with a single gene relevant to originate at some rate constant Ω as shown in Fig. 2. Considering a similar rate of influx into each state can be regarded as a model of HGT cf. Fig. 2.
Fig. 2.

Birth-and-death models of homologous gene family evolution. A birth-and-death process is a stochastic process in which transitions between states labeled by integers (representing the number of individuals, cells, lineages, etc.) are only allowed to neighboring states. A jump to the right constitutes birth, whereas a jump to the left is a death. In the context of birth-and-death processes that model the evolution of homologous gene families, the number of representatives a homologous gene family has in a given gene corresponds to the model state. Birth represents the addition of gene to a family in genome as a result of (1) origination of a new family with a single member, (2) duplication of an existing gene, or (3) gain of a gene by means of horizontal transfer of a gene from the same family from a different genome. The three models pictured above have been used in different contexts to model observed patterns of gene family size: (a) the stationary distribution of nonlinear origination–duplication–loss-type models is able to reproduce the general shape and in particular the power-law-like tail of the distribution of homologous gene family sizes (cf. Subheading 2 and 14) while transient distributions of linear origination–duplication–loss can be used to construct models of gene family size evolution along a phylogeny, modeling the “inparalog,” i.e., vertically evolving component of the size family distribution (21); (b) and (c) linear gain–loss and gain–duplication–loss-type models are used to model the nonvertically evolving, the so-called xenolog, component of the family size distribution along a branch of a phylogenetic tree.

The simplest type of birth-and-death processes with biological relevance are linear birth-and-death processes. Linear birth-and-death processes are described by a single birth rate δ and a single death rate λ from which the state-wise rates can be derived by the following first order rate law:
$$ {\delta_n} = \delta n\quad {\hbox{and}}\quad {\lambda_n} = \lambda n. $$
(1)

In other words, a gene (individual) in a gene family (population) gives birth to a new gene at a rate δ and undergoes death at a rate λ, independent of the size of the gene family. The stationary distribution of a linear birth-and-death process with origination—with some rate Ω—can be shown to be (1) a stretched exponential if δ ≤ λ, i.e., the birth rate is smaller than the death rate or (2) exhibiting an asymptotic power-law behavior with exponent γ = (Ω/(δ − λ) + 1) (30) if δ > λ. The transient distribution can be analytically expressed for the linear version of all three processes shown in Fig. 2. These distributions are important in deriving the probability of observing a particular pattern of family sizes at the leaves of a phylogeny, as well as in estimating branch-wise duplication, transfer, and loss parameters from a forest of gene trees that have been mapped using a series of duplication transfer and loss events to the branches of a species phylogeny (see Subheading 4).

A succession of more complex nonlinear models can be constructed, the simplest proposed (14) being a model with a family size-dependent duplication and loss rate parameterized by a pair of constants a and b:
$$\begin{array}{ccccc}{\delta_n} = \delta (n)n = \left( {\frac{{\delta ^{\prime}(n + a)}}{n}} \right)n\quad {\hbox{and}}\quad {\lambda_n} = \lambda (n)n = \left( {\frac{{\lambda ^{\prime}(n + b)}}{n}} \right)n, \end{array}$$
(2)
where we have not simplified by n to emphasize the relationship with the linear model above. For this class of models, asymptotic power laws are obtained only if δ′ < λ′ (14), i.e., the birth rate is smaller than the death rate. It is important to note that the linear origination–duplication–loss type model of Reed et al. (20) differs from those of Karev et al. (14) in details related to how origination is considered and in how the space of possible states (family sizes) and hence the stationary state is defined. While Hughes and Reed consider gene families to originate at a constant rate and family size to be unbounded, Karev et al. assume that family sizes are bounded and consider reflecting boundary conditions. Discrete time models that are closely related to the continuous time models considered by Karev et al. were presented by Wójtowicz and Tiurjn (31).
A different more abstract type of birth-and-death process was historically the first to be proposed to model the distribution of gene family sizes (12). Similarly to the above model, a gene family is founded by a single ancestor, and the size of the family may change as a result of duplications and losses (birth and death). However, in contrast to the birth-and-death models considered so far, duplications and losses are considered to act “coherently” on genes within one gene family. That is, if a certain gene is likely to duplicate (be lost), then all genes of its family are likely to duplicate (be lost). More formally, denoting the size of a gene family at time t, by nt
$$ {n_t} = {\alpha_t}{n_{{t - 1}}}, $$
(3)
where αt is a random multiplication factor, giving the instantaneous ratio of birth to death, that is drawn independently at each time step from some distribution P(α). The distribution of gene family sizes that is the result of many such processes can be shown to have a power-law distribution, provided the further important condition that some form of origination be present is met. The exponent of the power-law asymptotic followed by the family size distribution is in this case independent of the exact nature of origination (independent, e.g., of whether one considers reflecting boundary conditions or random influx) and is given by \( \gamma = - \left( {1 - {\mu_{\alpha }}/\sigma_{\alpha }^2} \right) \), where \( {\mu_{\alpha }} = \left\langle {\log (\alpha )} \right\rangle \) is the mean of the logarithm of the random variable α and \( \sigma_{\alpha }^2 = \left\langle {{{\log }^2}(\alpha )} \right\rangle - {\mu_{\alpha }} \) is its variance (12). Interestingly, this implies that birth-and-death models with coherent noise (also called multiplicative noise) produce a power-law asymptotic regardless of whether the birth rate is smaller or larger than the death rate. The value of the exponent, however, can give an indication of their relative values. The reason being that since \( \sigma_{\alpha }^2 \) is positive, γ < −1 implies \( {\mu_{\alpha }} = \left\langle {\log (\alpha )} \right\rangle < 0 \), which can be shown to be equivalent to the geometric mean of α, i.e., the instantaneous ratio of birth to death, being smaller than unity.

2.4 Birth and Death Along a Species Phylogeny

So far, we have only considered the distribution of homologous gene family sizes in genomes of individual species and the average of such distribution across domains. The distributions of gene family sizes between species are, however, not independent, but rather reflect correlated histories related by common descent along a species phylogeny. The phylogenetic profile of a gene family, consisting of the number of homologs within the same family in each genome, encodes this information. Such phylogenetic profiles can be informative even though they neglect a large part of the information present in gene sequences. Nonetheless, profile datasets have been used both to construct organismal phylogenies (32, 33, 34, 35, 36) and reconstruct ancestral gene content (37). These methods have, however, proved sensitive to methods of homology inference and have relatively poor performance as methods of phylogenetic analysis. This can be explained, in the case of prokaryotes, by high levels of homoplasy resulting from both HGT and extensive parallel loss of gene families in certain bacteria genomes (30). (Remember that homoplasy, also called convergent evolution, describes the acquisition of the same biological trait – in this case, genes from the same family – in unrelated lineages).

The primary advantage of the above attempts at reconstructing phylogeny is their relative ease of implementation and computational tractability on large datasets derived from complete genomes. They, however, suffer two major shortcomings: (1) they lack an explicit model of evolution and consequently provide at best indirect information on processes and (2) they disregard a great deal of phylogenetically relevant information present in homologous sequences by considering only presence–absence or at most the gene copy number in genomes.

The first of these shortcomings can be overcome by considering phylogenetic profiles as observations at the branches of a species tree generated by a birth-and-death process of sufficient complexity. Csűrös and Miklós have recently developed an efficient algorithm for calculating the probability of observing a given phylogenetic profile as a function of branch-wise parameters of duplication, gain, and loss along a species tree (21). Their model assumes that gene families evolve according to a linear birth-and-death process along the branches of the species tree. Each branch is characterized by a duplication rate, a gain rate, and a loss rate. A gene family evolves along the tree from the root toward the leaves according to the birth-and-death process. At internal nodes of the tree, families are instantaneously copied to evolve independently along descendant branches. Transient distributions of the linear version of processes presented in Fig. 2 give the expected change in the number of vertically inherited genes (“inparalogs”) and recently acquired ones (“xenologs”) (38). Leading up to the work of Csűrös and Miklós, other groups had also developed likelihood-based methodologies. These either only considered duplication and loss (39) or relied on heuristic restrictions on maximal ancestral family size for computational tractability (40, 41).

Using the above approach, it is possible to search for the branch-wise duplication, gain, and loss rates that maximize the likelihood given a set of observed profiles (derived from complete genome sequences) and a species phylogeny. Conceptually, this is no different than searching for branch-wise substitution rates that maximize the likelihood given a set of homologous sites (see, for instance, Chap. 16 of ref. 42). Columns of an alignment in the former case correspond to the phylogenetic profile of an individual gene family in the latter. In Table 1, we present results obtained in this manner using COUNT (43), a software that provides an implementation of this calculation. The results in Table 1 lend further support to both the observation that birth-and-death rates are similar across the tree of life (although here we have only considered prokaryotes) and the pattern of death (loss) rates being on average significantly larger than birth (duplication and gain) rates. Similar to what was observed for 28 archaeal genomes (21), duplications are inferred to account for the majority of birth events.
Table 1

Relative rates of duplication, gain, and loss for prokaryotic phyla obtained by maximum likelihood using COUNT (43)

Phylum name

Loss

Duplication

Gain

# of genomes

Actinobacteria

0.75

0.23

0.010

31

Alphaproteobacteria

0.85

0.13

0.008

47

Bacillales

0.52

0.42

0.048

16

Bacteroidetes/chlorobi

0.59

0.38

0.024

10

Betaproteobacteria

0.63

0.32

0.037

32

Chlamydiae/verrucomicrobia

0.70

0.24

0.043

7

Clostridia

0.57

0.37

0.055

11

Cyanobacteria

0.68

0.28

0.027

14

Deltaproteobacteria

0.64

0.33

0.024

13

Epsilonproteobacteria

0.54

0.29

0.158

7

Gammaproteobacteria

0.88

0.10

0.009

70

Lactobacillales

0.66

0.29

0.036

21

Mollicutes

0.49

0.47

0.023

14

Spirochetes

0.79

0.19

0.014

7

Crenarchaeota

0.69

0.28

0.018

11

Euryarchaeota

0.66

0.31

0.016

25

Rooted reference trees were obtained from concatenates of universal and near-universal genes and phylogenetic profiles extracted from version 4 of the HOGENOM database (17). Relative rates correspond to the ratio of the average of the branch-wise rates (of duplication, gain, and loss) to the average branch-wise sum of the three rates

3 The Ubiquity of Phylogenetic Discord and the Joint Reconstruction of Pattern and Process

In order to extract as much information as possible, we must step beyond phylogenetic profiles and consider in more detail the phylogenetic information contained in the sequences of homologous gene families. This can be done by using some model of sequence evolution to infer a gene phylogeny from the multiple sequence alignment (MSA) of the family. Because gene families evolve through not only the genome level process of speciation but also the gene level processes of origination, duplication, transfer, and loss described above, the phylogenies of individual families constructed in this manner reflect intricate individual genic histories. Differences in the histories of individual families inevitably lead to phylogenetic discord among gene families. The amount of phylogenetic conflict reflects the extent of HGT among genomes, and consequently the profusion of phylogenetic discord that we observe among prokaryotes (see below) is interpreted as reflecting large rates of transfer.

Independent of the degree of HGT, however, the existence of gene level processes of birth and death makes it necessary to extend the implicit model behind the tree of species. This extension consists of taking into consideration the processes of gene origination, birth, and death described above. The classic concept of the species tree implicitly assumes that all genes evolve along a strictly shared track—the branches of the species tree. The presence of duplications, transfers, and losses obliges us to replace this model by a tree, the branches of which can be best visualized as tubes—tubes within which genes may duplicate and be lost, and among which they can be transferred. This tree of genomes is a straightforward extension of the classic tree of species with its branches characterized by rates of duplication, transfer, and loss.

For this tree of genomes to be useful, however, methods based on statistical models capable of considering data from complete genome sequences and inferring such a tree need to be developed. Below, we describe recent progress in the construction of tractable models of genome evolution that are full, probabilistic models of all variables, in particular in our case of branch-wise duplication, transfer, and loss rates and the species tree topology.

3.1 Phylogenetic Discord Among Homologous Gene Families

Apparent phylogenetic conflict can result from different processes. First of all, inferred gene tree topologies can be different from the species tree, and hence each other, in the absence of any biological processes due to reconstruction errors. Such errors can result from stochastic differences caused by, e.g., insufficient sequence length and, more problematically, from systematic reconstruction artifacts due to departures from model assumptions (44). More informatively, phylogenetic discord can result from three important biological processes (summarized in (1) of Fig. 3): lineage sorting, HGT, and hidden paralogy.
Fig. 3.

Evolutionary processes behind phylogenetic discord. Phylogenetic incongruences can be the result of three major evolutionary processes (45): (1) deep coalescence resulting from incomplete lineage sorting (see previous chapter); (2) hidden paralogy (resulting from duplication and differential loss); and (3) horizontal gene transfer (HGT). Incomplete lineage sorting occurs when an ancestral species undergoes two speciation events in rapid succession. If, for a given gene, the ancestral polymorphism has not been fully resolved into two monophyletic lineages at the time of the second speciation, with a probability determined by the effective population size, the gene tree will differ from the species tree. A potential source of incongruence relevant over wider phylogenetic scales is hidden paralogy. If a gene family contains paralogous copies (genes that are related by a duplication event, e.g., the dashed and grey lines above), the gene phylogeny will partly reflect the duplication history of the gene that is independent of species divergence history. The third process is HGT. If genetic exchanges occur between species, then the phylogeny of individual genes will be influenced by the number and nature of transfers they have undergone. In the above figure, we illustrate how a particular gene tree topology can be explained by each process. Depending on the parameters (duplication, transfer, and loss rates and effective population size) describing the branches of the species tree, the three different scenarios have different probabilities.

Galtier and Daubin (45) analyzed the level of phylogenetic conflict between genes in several datasets extracted from the HOGENOM (17) database. Their aim was to ascertain the relative contribution of HGT to the amount of phylogenetic discord by comparing metazoan datasets (where HGT can be assumed to be rare) to prokaryotic ones. Their results were consistent with expectations as the level of discord measured for metazoan sequences was smaller than for any of the bacterial datasets considered. Interestingly, however, the differences in the amount of discord among the bacterial datasets were also measured to be large (see Table 1 of ref. 45). These large differences in the amount of discord, presumably caused by differences in rates of transfer, stand in stark to the broadly similar rates of gene birth and death implied by the similarity of the gene family size distributions.

A further finding of the study of Galtier and Daubin was that even in the case of actinobacteria (the prokaryotic dataset with the highest degree of self-conflict) more than 75% of the genes did not significantly reject the consensus tree. While it is clear that including more and more species would cause this particular measure to converge to a much smaller value, a series of more careful studies have demonstrated that there exists a strong signal of vertical inheritance in prokaryotic genomes despite persistent HGT (46, 47, 48, 49) (see also Chap. 3 of this volume, ref. 50).

3.2 Reconciling Phylogenetic Discord

The detection and measurement of phylogenetic discord among a group of phylogenetic trees can be accomplished relatively easily, for instance, by using some measure of distance between trees (see Chap. 30 of ref. 42 for an introduction on distance measures). A different and harder problem consists of constructing a reconciliation between two trees, i.e., of proposing a set of evolutionary events (such as speciations, duplications, transfers, and losses) that correspond to an evolutionary scenario, where one of the trees (the gene tree) has resulted from evolution along the other tree (the species tree). In Fig. 3, we present three different reconciliations involving different sets of events for the same gene tree. The set of events considered in the context of the reconciliation problem has, until recently, been limited to speciation, duplication and loss events, and lineage sorting, as discussed in Chap. 1 of this Volume (ref. 64), and respectively in Chaps. 29 and 25 of ref. 42. Goodman (51) was the first to describe an algorithm to find the reconciliation that minimizes the number of duplication and loss events followed more recently by several others (see ref. 42 for citations). If transfers are also considered, the problem of reconciliation becomes difficult from a combinatorial perspective for two reasons: (1) the difficulty of restricting the set of events to ones which respect the partial order of evolution imposed by speciation events on the species tree (52); this corresponds to forbidding the transfer of genes from a species (branch of the species tree) to species from which it has descended (ancestral branches of the species tree), i.e., forbidding transfers that “go backward in time”; (2) if transfer events are considered where the acquisition of a homologous copy implies the loss of extent copy, the problem of identifying the minimum number of such events can easily be shown to correspond to the problem of finding the shortest path between two trees using subtree prune and regraft (SPR) operations that is known to be NP complete (see Chaps. 4 and 30 of ref. 42).

The latter process of replacement of genes by HGT is biologically motivated by the elevated probability of functional redundancy in the case of homologous genes (53). Such replacement is particularly relevant in modeling genes that are present in a single copy in all or most genomes. A variety of approaches have been put forward to solve the problem of tree reconciliation for the case when the replacement of genes is relevant (53, 54, 55). These approaches offer heuristic algorithms to find approximate solutions to the SPR and the closely related maximum agreement forest (MAF) problems efficiently. However, they are all limited to single-label trees, i.e., trees of families that do not have multiple members in any of the genomes considered.

The former problem of considering only transfers that respect the partial time order implied by the species tree can be resolved by fully specifying the time order of speciation events. As shown by Tofigh (56), and described below, this allows the construction of a dynamic programming algorithm that is able to efficiently traverse all possible reconciliations allowing the calculation of the sum of the probabilities of all reconciliations given a tree, the most parsimonious reconciliation (57, 58), or the reconciliation with the highest likelihood.

3.3 The Probability of a Gene Tree Given a Species Tree and Rates of Duplication, Transfer, and Loss

Tofigh et al. consider the forest of gene trees to be generated by a common birth-and-death process taking place on a shared species (or genome) tree. They derive the probability \( p\left( {G\left| {S^{\prime}} \right.,{\mathcal{M}_{\rm{BD}}},r} \right) \) of a gene tree topology G given a reconciliation r, where \( {\mathcal{M}_{\rm{BD}}} \) is a birth-and-death process taking place on S′, a species tree for which the order of speciation events are fully specified. Provided the process \( {\mathcal{M}_{\rm{BD}}} \) is linear, the probability of gene tree topology G can be expressed given a reconciliation r that maps branches and nodes of G to S′ using events considered in \( {\mathcal{M}_{\rm{BD}}} \).

This calculation requires two functions: (1) the probability of extinction Qe(t), i.e., the probability of a gene observed on branch e at time t evolving such that it is not observed in any extant genome (at time t = 0); (2) the propagator Qef(t, t′) which gives the probability of a gene observed on branch e at time t evolving such that it has a descendent present on branch f at time t′, furthermore any descendants of the gene observed at the leaves (at time t = 0) of S′ descend from this copy. These functions can be obtained numerically from systems of differential equations found in 56.

As illustrated in Fig. 4, the same gene tree can be reconciled in different ways with the species tree. The probability of extinction Qe(t) and the propagator Qef(t, t′), together with rates of origination, duplication, and transfer, can be used to calculate the probability of a gene tree topology for an arbitrary reconciliation (here, we present an example with a rooted gene tree; however, the position of the root can be considered to be part of the reconciliation without changing the complexity of the dynamic programming algorithm). For this probability to be useful, however, we must be able to either sum over all reconciliations,
Fig. 4.

Probabilistic DTL model. If we consider gene trees to be generated by a linear birth-and-death process \( {\mathcal{M}_{\rm{BD}}} \) taking place on a tree S′ with the order of speciation events fully specified, we can express the probability of a gene tree topology G given a reconciliation. Specifying the order of speciation events corresponds to constructing time slices, which decompose the branches of the species tree into pieces yielding the tree S′. For example, the branch leading to Genome A is decomposed into three branches labeled 2, 4, 7 (for a formal definition, see ref. 56). Transfers are only possible between branches in the same time slice, e.g., between 7 and 9, but not 4 and 9. A reconciliation consists of mapping the branches and nodes of G to the branches of nodes of S′. For a given gene tree, there are many possible reconciliations. For G, we can construct (1) a transfer scenario, where node g of G is a speciation at the root of S′, e is a transfer from 4 to 9, f is a speciation at the end of 3, and the branch below f traverses the speciation at the end of 6 implying at least one loss and also (2) a duplication scenario, where e maps to the root, g is a duplication above it, the position of f is unchanged, but at least four losses have occurred. The probability of extinction Qe(t) and the propagator Qef(t, t′) can be used to construct the probability of a given reconciliation as shown for the black subtree of G. Because the probability of a reconciliation can be hierarchically decomposed into the product of the probabilities of the reconstructions of the subtrees of G, a dynamic programming algorithm can be derived that is able to calculate the sum or maximum of the probability over all reconciliations.

$$ p(G|S^{\prime},{\mathcal{M}_{\rm{BD}}}) = \sum\limits_{{r \in \Omega }} {p(G|S^{\prime},{\mathcal{M}_{\rm{BD}}},r)}, $$
(4)
to obtain the probability of G given S′ and ℳBD, or alternatively be able to find the most likely reconciliation allowing the calculation of:
$$ {p_{{\max }}}(G|S^{\prime},{\mathcal{M}_{\rm{BD}}}) = \mathop{{\max }}\limits_r p(G|S^{\prime},{\mathcal{M}_{\rm{BD}}},r). $$
(5)

The probability of a reconciliation can be hierarchically decomposed into the product of probabilities of the reconstructions of subtrees of G. This allows the construction of a dynamic programming algorithm that can efficiently sum or take the maximum over reconciliations, allowing the calculation of both Eqs. 4 and 5. Furthermore, the same dynamic programming scheme can be used to calculate the most parsimonious reconciliation given costs of the possible events with reduced complexity (57).

3.4 Hierarchical Probabilistic Models of Duplication, Transfer, and Loss

Using the above dynamic programming algorithm, it is possible to calculate the likelihood of a species tree topology S′ and the parameters describing ℳBD, i.e., rates of duplication, transfer, and loss on its branches, given a forest of gene trees obtained from homologous gene families:
$$ \begin{array}{ccccc}L(S^{\prime},{\mathcal{M}_{\rm{BD}}}|\{ {G_f}\} ) = \prod\limits_{{f \in {\rm{families}}}} {p({G_f}|S^{\prime},\mathcal{M})}, \end{array}$$
(6)
$$ {\hbox{where}}\quad {G_f} = \mathop{{\arg \max }}\limits_G \{ L(G|{\hbox{MSA}}\;{\hbox{of}}\;f)\}, $$
and the product goes over the set of most likely gene trees {Gf} encoding the sequence information in families of homologous genes composing a set of genomes. This expression can be thought of as being similar to the classic likelihood of a gene tree topology G and some model of sequence evolution ℳseq. with parameters, such as branch-wise substitution rates, given a multiple sequence alignment:
$$ L(G,{\mathcal{M}_{{{\rm{seq}}{.}}}}|{\hbox{MSA}}) = \prod\limits_{{i \in {\rm{sites}}}} {p({\hbox{column}}\;i{\hbox{ of MSA}}|G,{ }{\mathcal{M}_{{{\rm{seq}}{.}}}})}, $$
(7)
where in this case the product goes over columns of homologous sites composing a MSA. In Fig. 5, we present results obtained using such an approach, where we have kept the species tree topology fixed and maximized the likelihood given by Eq. 6 over the space of possible orders in time of speciations and uniform rate parameters. We can see that the inferred ratio of birth to death is in good agreement with that obtained from phylogenetic profiles (see Table 1). In contrast, taking into consideration additional information from the sequences of the proteins in homologous families in the form of gene tree topologies, we infer for both phyla considered the majority of birth events to be the result of transfer.
Fig. 5.

Relative rates of duplication, transfer, and loss for two prokaryotic phyla. The results were obtained by maximum likelihood using reference trees inferred from concatenated alignments of universal and near-universal genes and all homologous gene families with trees available in version 4 of the HOGENOM database (17). These results show that while the ratio of birth to death is practically identical, taking into consideration phylogenetic information from gene trees, the majority of birth events are inferred to have resulted from transfer and not duplication in contrast to results obtained from phylogenetic profiles (see Table 1). The histograms correspond to results obtained for 1,000 jackknife samples of 20% all trees (see Chap. 20 of 42 for a discussion of resampling). The calculation was implemented using results from 56 and 57. We kept the species tree topology fixed and maximized Eq. 6 over the space of possible orders in time of speciations and uniform rate parameters. We assumed each branch of S′ to have branch lengths compatible with the time order of speciations with all time slices being of equal width and inferred global rates of duplication, transfer, and loss.

This scheme has two shortcomings. First, instead of complete sequence information, only the most likely gene tree topologies are considered. Second, global information on how likely different gene tree topologies are given S′ and ℳBD is not considered. Both of these shortcomings can be addressed by combining Eqs. 6 and 7 in a hierarchical likelihood framework. Using such a framework allows us to use global information on the species phylogeny and the birth-and-death process, together with sequence information from each family to improve gene trees, while at the same time inferring the species phylogeny and the parameters of birth-and-death process. Such a hierarchical framework was first suggested by Maddison (59) and has recently been implemented using a duplication and loss model (excluding transfer) (60) and models of transfer (excluding duplication and loss) (61, 62). The dynamic programming approach presents the first opportunity to construct a hierarchical model that considers all three processes. That is, we can express the likelihood of S′, {Gf}, and ℳBD given a set of homologous gene families as
$$\begin{array}{cccccc} L(S^{\prime},\{ {G_f}\}, {\mathcal{M}_{\rm{BD}}}|{\hbox{ families}}) = \prod\limits_{{f \in {\rm{families}}}} p ({G_f}|S^{\prime},{\mathcal{M}_{\rm{BD}}}) \times L({G_f}|{\hbox{MSA}}\;{\hbox{of}}\;f). \end{array}$$
(8)

It is important to note that this hierarchical likelihood function is amicable to parallel computation, because the p(Gf|S′, ℳBD) × L(Gf | MSA of f) terms can be computed independently, by client nodes. It is possible to implement an efficient optimization scheme consisting of a hierarchical optimization loop, wherein clients optimize the Gf-s using the independent terms in the hierarchical likelihood product while keeping S and ℳBD fixed until conditionally optimal Gf-s are attained using which S and ℳBD can be optimized.

4 Conclusion

In conclusion, the distributions of homologous gene family sizes in the genomes of the eukaryota, archaea, and bacteria show astonishingly similar shapes. These distributions are best described by models of gene family size evolution, where the loss rates of individual genes are larger than their duplication rate but new families are continually supplied to the genome by a process of origination that in general includes both transfer and the generation of new gene families. This picture is supported by analysis of phylogenetic profiles using maximum likelihood. Taking into consideration additional information from the sequences of the proteins in homologous families in the form of gene tree topologies, the inferred ratio of birth to death is found to be in good agreement with that obtained from phylogenetic profiles; however, in prokaryotes, the majority of birth events is inferred to be the result of transfer.

It has not been demonstrated to date that a single tree can adequately describe the evolution of entire genomes across the diversity of life and certainly no such tree has been inferred. However, recent advances in the construction and implementation of hierarchical probabilistic models of duplication, transfer, and loss presented here have the potential to allow us undertake this project to infer genome trees based on sequence information from complete genomes. While currently this task is computationally daunting, the use of parallel computing and recent advances in algorithms present the promise of making this feasible in the foreseeable future.

From a biological perspective, birth-and-death models of gene family size evolution are essentially neutral models of evolution. They ignore completely the individuality of gene families and any potential selective forces that make some of them expendable and others indispensable. The fact that they accurately reproduce the observed family size distributions, nonetheless, suggests that genome evolution, at least on this coarse scale of observation, might be in large part the result of a stochastic process, which is only modulated by selection (6, 19). Even so, as soon as we are able to better reconstruct the pattern and process of duplication, transfer, and loss, we can expect to be able to observe more and more of this modulation by selection. And by proxy, start to learn more about the biology of genome evolution over large timescales to better understand the population genetic and biochemical and ecological constraints and opportunities that govern the evolution of genomes in general and the transfers of genes in particular. This requires integrating information reconstructed from ancestral genomes and DTL events with system-level models of phenotype, such as metabolic networks (3, 63).

5 Exercises

  1. 1.

    Using log-log axis on the range [0.1, 106], plot the following functions: e−x, e−x/10, e−x/100, e−x/1000, x1, x3, x9 and observe how power-law-like tails decay much slower than any exponential function.

     
  2. 2.

    Using both the COG (http://www.ncbi.nlm.nih.gov/COG) and the HOGENOM (http://pbil.univ-lyon1.fr/databases/HOGENOM) databases, construct the histogram in Fig. 1 of the frequency of homologous gene family sizes in the human genome, i.e., the fraction fn of times you see a family of size n among all homologous gene families in the human genome.

     
  3. 3.

    Using the result that the stationary distribution pn of family sizes is reached exponentially fast and assuming that this occurs according to the relationship |pn(t) − pn| ∝ e–(δ + λ)t, considering the rates of duplication and loss from Table 1 of 6, estimate the amount of time (in units of percentage of divergence at silent sites) that the distribution of family sizes needs to reach the stationary distribution following a perturbation. Is this number large or small? Which organisms can be described by such divergence in comparison to the human genome?

     
  4. 4.

    Using the form of the transient distribution for the linear duplication–loss process given in Table 1 of 38, express the duplication rate λ and the loss rate δ using the fraction of families with 0 genes and the mean number of genes in a family.

     
  5. 5.

    Write down the differential equation giving p(t) the probability of families with size n at time t using only the probabilities of pn–1(t), pn+1(t) and the rates of duplication δn= δn and loss λn= λn (note that the case p0(t) needs to be treated differently; solution can be found in 20).

     
  6. 6.

    Using the transient distribution of the duplication–loss process from Table 1 and the results of Lemma 1 in 38, and assuming the species tree to be ((A:y, B:y):x, C:x + y) with branch lengths x, y in arbitrary units of time (see ref. 42 for a description of the Newick format), a duplication rate of δ, a loss rate of λ, and assuming the probability of the number of genes in a family at the root of the tree to be given by a Poisson distribution with mean n0, further limiting the number of genes at internal nodes to a maximum of M genes, derive the probability of observing a profile {nA, nB, nC}.

     
  7. 7.

    In what respect would including gain introduce significant new complications in the above calculations?

     
  8. 8.

    Considering only duplications and losses (excluding transfer), express Qef(t, t′) using transient distributions from 38 and the extinction probability Qe(t).

     

References

  1. 1.
    Crick, F. H. (1968) The origin of the genetic code. J Mol Biol, 38, 367–79.PubMedCrossRefGoogle Scholar
  2. 2.
    Theobald, D. L. (2010) A formal test of the theory of universal common ancestry. Nature, 465, 219–22.PubMedCrossRefGoogle Scholar
  3. 3.
    Boussau, B. and Daubin, V. (2010) Genomes as documents of evolutionary history. Trends Ecol Evol, 25, 224–32.PubMedCrossRefGoogle Scholar
  4. 4.
    Koonin, E. V. and Wolf, Y. I. (2008) Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res, 36, 6688–719.PubMedCrossRefGoogle Scholar
  5. 5.
    Long, M., Betrán, E., Thornton, K., and Wang, W. (2003) The origin of new genes: glimpses from the young and old. Nat Rev Genet, 4, 865–75.PubMedCrossRefGoogle Scholar
  6. 6.
    Lynch, M. (2007) The origins of genome architecture. Sinauer Associates.Google Scholar
  7. 7.
    Lerat, E., Daubin, V., Ochman, H., and Moran, N. A. (2005) Evolutionary origins of genomic repertoires in bacteria. PLoS Biol, 3, e130.PubMedCrossRefGoogle Scholar
  8. 8.
    Gogarten, J. P. and Townsend, J. P. (2005) Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol, 3, 679–87.PubMedCrossRefGoogle Scholar
  9. 9.
    Lynch, M. and Conery, J. S. (2003) The origins of genome complexity. Science, 302, 1401–4.PubMedCrossRefGoogle Scholar
  10. 10.
    Siew, N. and Fischer, D. (2003) Analysis of singleton orfans in fully sequenced microbial genomes. Proteins, 53, 241–51.PubMedCrossRefGoogle Scholar
  11. 11.
    Daubin, V. and Ochman, H. (2004) Bacterial genomes as new gene homes: the genealogy of orfans in e. coli. Genome Res, 14, 1036–42.PubMedCrossRefGoogle Scholar
  12. 12.
    Huynen, M. A. and van Nimwegen, E. (1998) The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol, 15, 583–9.PubMedGoogle Scholar
  13. 13.
    Qian, J., Luscombe, N. M., and Gerstein, M. (2001) Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J Mol Biol, 313, 673–81.PubMedCrossRefGoogle Scholar
  14. 14.
    Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S., and Koonin, E. V. (2002) Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol Biol, 2, 18.PubMedCrossRefGoogle Scholar
  15. 15.
    Molina, N. and van Nimwegen, E. (2009) Scaling laws in functional genome content across prokaryotic clades and lifestyles. Trends Genet, 25, 243–7.PubMedCrossRefGoogle Scholar
  16. 16.
    Koonin, E. V., Wolf, Y. I., and Karev, G. P. (2006) Power laws, scale-free networks and genome biology. Molecular biology intelligence unit, Landes Bioscience/Eurekah.com.Google Scholar
  17. 17.
    Penel, S., Arigon, A.-M., Dufayard, J.-F., Sertier, A.-S., Daubin, V., Duret, L., Gouy, M., and Perrière, G. (2009) Databases of homologous gene families for comparative genomics. BMC Bioinformatics, 10 Suppl 6, S3.Google Scholar
  18. 18.
    Novozhilov, A. S., Karev, G. P., and Koonin, E. V. (2006) Biological applications of the theory of birth-and-death processes. Brief Bioinform, 7, 70–85.PubMedCrossRefGoogle Scholar
  19. 19.
    Koonin, E. V., Wolf, Y. I., and Karev, G. P. (2002) The structure of the protein universe and genome evolution. Nature, 420, 218–23.PubMedCrossRefGoogle Scholar
  20. 20.
    Reed, W. J. and Hughes, B. D. (2004) A model explaining the size distribution of gene and protein families. Math Biosci, 189, 97–102.PubMedCrossRefGoogle Scholar
  21. 21.
    Csűrös, M. and Miklós, I. (2009) Streamlining and large ancestral genomes in archaea inferred with a phylogenetic birth-and-death model. Mol Biol Evol, 26, 2087–95.PubMedCrossRefGoogle Scholar
  22. 22.
    Yule, G. U. (1925) A mathematical theory of evolution, based on the conclusions of dr. j. c. willis, f.r.s. Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character, 213, 21–87.Google Scholar
  23. 23.
    Feller, W. (1939) Die grundlagen der volterraschen theorie des kampfes urns dasein in wahrscheinliehkeitstheoretischer behandlung. Acta Biotheoretioa Series A., 5, 11–39.CrossRefGoogle Scholar
  24. 24.
    Kendall, D. G. (1948) On the generalized “birth-and-death” process. The Annals of Mathematical Statistics, 19, 1–15.CrossRefGoogle Scholar
  25. 25.
    Bartholomay, A. (1958-06-01) On the linear birth and death processes of biology as markoff chains. Bulletin of Mathematical Biology, 20, 97–118.Google Scholar
  26. 26.
    Takács, L. (1962) Introduction to the theory of queues. Oxford University Press.Google Scholar
  27. 27.
    Ota, T. and Nei, M. (1994) Divergent evolution and evolution by the birth-and-death process in the immunoglobulin vh gene family. Mol Biol Evol, 11, 469–82.PubMedGoogle Scholar
  28. 28.
    Nei, M., Gu, X., and Sitnikova, T. (1997) Evolution by the birth-and-death process in multigene families of the vertebrate immune system. Proc Natl Acad Sci U S A, 94, 7799–806.PubMedCrossRefGoogle Scholar
  29. 29.
    Yanai, I., Camacho, C. J., and DeLisi, C. (2000) Predictions of gene family distributions in microbial genomes: evolution by gene duplication and modification. Phys Rev Lett, 85, 2641–4.PubMedCrossRefGoogle Scholar
  30. 30.
    Hughes, A. L., Ekollu, V., Friedman, R., and Rose, J. R. (2005) Gene family content-based phylogeny of prokaryotes: the effect of criteria for inferring homology. Syst Biol, 54, 268–76.PubMedCrossRefGoogle Scholar
  31. 31.
    Wójtowicz, D. and Tiuryn, J. (2007) Evolution of gene families based on gene duplication, loss, accumulated change, and innovation. J Comput Biol, 14, 479–95.PubMedCrossRefGoogle Scholar
  32. 32.
    Fitz-Gibbon, S. T. and House, C. H. (1999) Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res, 27, 4218–22.PubMedCrossRefGoogle Scholar
  33. 33.
    Snel, B., Bork, P., and Huynen, M. A. (1999) Genome phylogeny based on gene content. Nat Genet, 21, 108–10.PubMedCrossRefGoogle Scholar
  34. 34.
    Wolf, Y. I., Rogozin, I. B., Grishin, N. V., and Koonin, E. V. (2002) Genome trees and the tree of life. Trends Genet, 18, 472–9.PubMedCrossRefGoogle Scholar
  35. 35.
    Deeds, E. J., Hennessey, H., and Shakhnovich, E. I. (2005) Prokaryotic phylogenies inferred from protein structural domains. Genome Res, 15, 393–402.PubMedCrossRefGoogle Scholar
  36. 36.
    Lienau, E. K., DeSalle, R., Rosenfeld, J. A., and Planet, P. J. (2006) Reciprocal illumination in the gene content tree of life. Syst Biol, 55, 441–53.PubMedCrossRefGoogle Scholar
  37. 37.
    Mirkin, B. G., Fenner, T. I., Galperin, M. Y., and Koonin, E. V. (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol, 3, 2.PubMedCrossRefGoogle Scholar
  38. 38.
    Csűrös, M. and Miklós, I. (2009) Mathematical framework for phylogenetic birth-and-death models. ar Xiv, p. 0902.0970.Google Scholar
  39. 39.
    Hahn, M. W., De Bie, T., Stajich, J. E., Nguyen, C., and Cristianini, N. (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res, 15, 1153–60.PubMedCrossRefGoogle Scholar
  40. 40.
    Spencer, M., Susko, E., and Roger, A. J. (2006) Modelling prokaryote gene content. Evol Bioinform Online, 2, 157–78.Google Scholar
  41. 41.
    Iwasaki, W. and Takagi, T. (2007) Reconstruction of highly heterogeneous gene-content evolution across the three domains of life. Bioinformatics, 23, i230–9.PubMedCrossRefGoogle Scholar
  42. 42.
    Felsenstein, J. (2004) Inferring phylogenies. Sinauer Associates.Google Scholar
  43. 43.
    Csűrös, M. (2010) Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics, 26, 1910–2.PubMedCrossRefGoogle Scholar
  44. 44.
    Jeffroy, O., Brinkmann, H., Delsuc, F., and Philippe, H. (2006) Phylogenomics: the beginning of incongruence? Trends Genet, 22, 225–31.PubMedCrossRefGoogle Scholar
  45. 45.
    Galtier, N. and Daubin, V. (2008) Dealing with incongruence in phylogenomic analyses. Philos Trans R Soc Lond B Biol Sci, 363, 4023–9.PubMedCrossRefGoogle Scholar
  46. 46.
    Daubin, V., Moran, N. A., and Ochman, H. (2003) Phylogenetics and the cohesion of bacterial genomes. Science, 301, 829–32.PubMedCrossRefGoogle Scholar
  47. 47.
    Ochman, H., Lerat, E., and Daubin, V. (2005) Examining bacterial species under the specter of gene transfer and exchange. Proc Natl Acad Sci U S A, 102 Suppl 1, 6595–9.PubMedCrossRefGoogle Scholar
  48. 48.
    Beiko, R. G., Harlow, T. J., and Ragan, M. A. (2005) Highways of gene sharing in prokaryotes. Proc Natl Acad Sci USA, 102, 14332–7.PubMedCrossRefGoogle Scholar
  49. 49.
    Puigbò, P., Wolf, Y. I., and Koonin, E. V. (2009) Search for a ‘tree of life’ in the thicket of the phylogenetic forest. J Biol, 8, 59.PubMedCrossRefGoogle Scholar
  50. 50.
    Puigbò, P., Wolf, Y. I., and Koonin, E. V. (2012) Genome-wide comparative analysis of phylogenetic trees: the prokaryotic forest of life. In Anisimova, M., (ed.), Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media New York.Google Scholar
  51. 51.
    Goodman, M., Czelusniak, J., Moore, W., Herrera, R., and Matsuda, G. (1979) Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Systematic Zoology, 28, 132–163.CrossRefGoogle Scholar
  52. 52.
    Hallett, M., Lagergren, J., and Tofigh, A. (2004) Simultaneous identification of duplications and lateral transfers. RECOMB ’04: Proceedings of the eighth annual international conference on Resaerch in computational molecular biology, New York, NY, USA, pp. 347–356, ACM.Google Scholar
  53. 53.
    Abby, S. S., Tannier, E., Gouy, M., and Daubin, V. (2010) Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics, 11, 324.PubMedCrossRefGoogle Scholar
  54. 54.
    Nakhleh, L., Ruths, D., and Wang, L.-S. (2005) Riata-hgt: A fast and accurate heuristic for reconstructing horizontal gene transfer. Wang, L. (ed.), Computing and Combinatorics, vol. 3595 of Lecture Notes in Computer Science, pp. 84–93, Springer Berlin / Heidelberg.Google Scholar
  55. 55.
    Beiko, R. G. and Hamilton, N. (2006) Phylogenetic identification of lateral genetic transfer events. BMC Evol Biol, 6, 15.PubMedCrossRefGoogle Scholar
  56. 56.
    Tofigh, A. (2009) Using Trees to Capture Reticulate Evolution: Lateral Gene Transfers and Cancer Progression. Ph.D. thesis, KTH, School of Computer Science and Communication.Google Scholar
  57. 57.
    Doyon, J., C, S., KY, G., GJ, S., V, R., and V, B. (2010) An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. Proceedings of RECOMB Comperative Genomics, p. to appear.Google Scholar
  58. 58.
    David, L. A. and Alm, E. J. (2011) Rapid evolutionary innovation during an archaean genetic expansion. Nature, 469, 93–6.PubMedCrossRefGoogle Scholar
  59. 59.
    Maddison, W. P. (1997) Gene trees in species trees. Systematic Biology, 46, 523–536.CrossRefGoogle Scholar
  60. 60.
    Akerborg, O., Sennblad, B., Arvestad, L., and Lagergren, J. (2009) Simultaneous bayesian gene tree reconstruction and reconciliation analysis. Proc Natl Acad Sci USA, 106, 5714–9.PubMedCrossRefGoogle Scholar
  61. 61.
    Suchard, M. A. (2005) Stochastic models for horizontal gene transfer: taking a random walk through tree space. Genetics, 170, 419–31.PubMedCrossRefGoogle Scholar
  62. 62.
    Bloomquist, E. W. and Suchard, M. A. (2010) Unifying vertical and nonvertical evolution: a stochastic arg-based framework. Syst Biol, 59, 27–41.PubMedCrossRefGoogle Scholar
  63. 63.
    Wagner, A. (2009) Evolutionary constraints permeate large metabolic networks. BMC Evol Biol, 9, 231.PubMedCrossRefGoogle Scholar
  64. 64.
    Anderson, C., Liu, L., Pearl, D., and Edwards, S. V. (2012) Tangled Trees: The Challenge of Inferring Species Trees from Coalescent and Non-Coalescent Genes. In Anisimova M (ed) Evolutionary genomics: statistical and computational methods.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.UMR CNRS 5558, LBBE, “Biometrie et Biologie Evolutive” UCB Lyon 1VilleurbanneFrance

Personalised recommendations