1 Atomic Events in Bacterial Microevolution

Bacterial microevolution is the study of the evolutionary forces that shape the genetic diversity of a natural population of bacteria. This evolutionary process takes place as a result of the genetic changes happening within each of the genomes of the bacterial cells constituting the population. Over time, these changes are amplified or weakened by the effects of both genetic drift and natural selection. Genetic drift represents the evolution caused by the death and birth of cells in the bacterial population, and it acts at random on all genetic variants (Charlesworth 2009). The effects of genetic drift are higher when the population size is small, and so it could be thought given the large number of cells in bacterial populations that genetic drift would be weak. However, bacterial populations sometimes go through punctual bottlenecks during which genetic drift has a large effect, for example during transmission of pathogens from one host to another (Didelot et al. 2016). It is also believed that the strong structure of bacterial habitat, sometimes at the microscale can lead to much smaller effective population sizes than intuition suggests (Vos et al. 2013). Natural selection on the other hand acts in a nonrandom fashion, amplifying some variations and suppressing others, and is a very potent evolutionary force in shaping the diversity of bacterial species (Petersen et al. 2007; Buckee et al. 2008; Pepperell et al. 2013).

The genetic changes occurring on a single bacterial cell can be classified into mutation and recombination events, and the events of interest differ whether the focus is on the core genome (the regions shared by all genomes in the population) or the accessory genome (the regions that are found in some but not all of the genomes). As far as the core genome is concerned, the main type of mutation is the point mutation, whereby a single nucleotide is replaced, and the main type of recombination is called homologous recombination, in which a relatively short fragment of the genome is replaced with a homologous fragment coming from another bacterial cell (Didelot and Maiden 2010). There are three biological mechanisms that can lead to homologous recombination, namely conjugation (where two bacterial cells come in contact so that DNA can be transmitted from donor to recipient), transduction (where a phage acts as vector from donor to recipient) and transformation (where naked DNA is picked up by the recipient from the environment, possibly following the death of the donor cell) (Thomas and Nielsen 2005). But since their outcomes are hard to distinguish this diversity of mechanisms is usually ignored in evolutionary models of homologous recombination.

Point mutation and homologous recombination events clearly act on the evolution of the accessory genome in the same way as they do for the core genome. However, they do not change the genetic content of core genomes. There are two types of endogenous mutations that can alter the genetic content of a genome, duplication and deletion, and they can be thought of as opposite forces, with the former increasing the number of copies of a gene by one and the latter decreasing it by one. Finally, the accessory genome is also subject to non-homologous recombination, where a bacterial cell imports a DNA fragment from another cell and inserts it in its genome, without overwriting a previously existing homologous fragment (Ochman et al. 2000). Non-homologous recombination is often called lateral gene transfer or horizontal gene transfer, and in this chapter we will be using these three terms interchangeably. It should be noted, however, that this terminology is not always consistently used in the literature, with some authors using the term horizontal gene transfer to refer to both homologous and non-homologous recombination.

The three biological mechanisms mentioned above for homologous recombination (conjugation, transduction, and transformation) can lead to non-homologous recombination and once again, it is helpful when studying the bacterial microevolution of the accessory genome to set aside the mechanism at play. Likewise, genetic duplication and deletion can have multiple causes that we will not explore. It should in fact be noted that even though it is useful to present and study them as separate, the atomic evolutionary events briefly described above are not biologically independent (Lawrence 1999; Everitt et al. 2014; Oliveira et al. 2017). For example, a single event of recombination could involve the replacement of some genes (homologous recombination), the insertion of new genes (non-homologous recombination) and the loss of some other genes (genetic deletion).

Furthermore, non-homologous recombination can sometimes be duplicative if the newly imported material is homologous to a sequence found somewhere else in the genome. In this case, the number of copies of the genes concerned is increased by one, as in a duplication event. The evolutionary distance between donor and recipient of such a non-homologous recombination event is then a crucial factor: if this distance is small the effect is similar to a duplication event, which can be seen as a transfer event where recipient and donor are the same organism. If distance between organisms is high, then the difference between the newly imported copy and the copy already present will likely be high too, providing a clear sign that duplication was not involved. This situation is analogous to the detection of homologous recombination in the core genome, where events from a closely related source do not leave a trace, or perhaps involve just a single substitution in which case they are undistinguishable from point mutation (Didelot et al. 2010).

2 Neutral Phylogenetic Framework of Bacterial Pangenome Microevolution

2.1 Challenges with a Comprehensive Model

The microevolutionary events that act on the bacterial pangenome, as briefly described above, can be combined into an evolutionary model of how the pangenome evolves over time. Let us consider a comprehensive model, which would account for the whole population of bacterial cells, including the fact that cells die and reproduce over time (so that genetic drift is included) and that various selective pressures are exerted. In this model, the genome of each cell is affected by various mutation and recombination events, all of which happens at a certain rates over time for each cell. All the rates involved in this model (birth and death of individuals, selection for specific variants and various evolutionary events) would not be assumed to be constant, but would be allowed to vary over time. This model falls in the class of forward-in-time models, due to the fact that it considers evolution as it unfolds over time, and famous examples of such models in the general population genetics literature are the Wright–Fisher model (Fisher 1931; Wright 1931) and the Moran model (Moran 1958). Figure 1 illustrates such a forward-in-time model of pangenome evolution.

Fig. 1
figure 1

Illustration of the forward-in-time evolution of a bacterial population and its pangenome. At each time step, an individual is removed and another gives birth as in a standard Moran model. Furthermore, at each time step the accessory genome of each individual may evolve via deletion (orange cross), duplication (red square) and recombination (orange arrow)

The idea of this comprehensive model is to replicate exactly the processes that we know occur in nature, so that it is of the highest possible realism. However, a comprehensive model would also integrate the diversification process of the whole community of microorganism found at a given spot, with the impact of their biotic interactions and genetic exchange, but most importantly, of the competitive process leading to natural selection of the fittest. Such level of description of natural processes would render the model impossible to use, and that is why it is not found in the literature. It is educative though to ask ourselves why this model is unusable, as this will guide us towards more practical models that feature some of these ideal properties.

The first problem with this comprehensive model is computational: it would require very large amounts of computer memory to store the state of a population at a single time point, even much more so to track its evolution over time, and an equally impossible amount of computer power to consider the evolutionary events happening to all members of the population. But even more importantly, there is statistical problem with the comprehensive model, in the sense that there are too many unknown quantities involved, for which we would not be able to take even a very rough guess at what their value might be. Therefore, even if the computational problems could be overcome, and analysis conducted under the comprehensive model, the results would be worthless since the quantities to be estimated would be unidentifiable. Simplifications will therefore have to be made to reach a model that has practical use, with the best model being not the most comprehensive one, but the one that achieves the best trade-off between biological realism on one hand, and computational and statistical considerations on the other hand.

Beyond the degree of complexity of a model and the search for a trade-off between computational efficiency and model realism, models may rely on different conceptual formalisation of bacterial genomes and their evolutionary process. These different concepts will generate different approaches and methods that are in general complementary. We will thus present different elements of phylogenetic models of pangenome evolution, which flavours may be combined to provide a variety of practical models.

2.2 Analysing Selection Based on Neutral Models

Perhaps the greatest challenge posed by the comprehensive model above would be its attempt at encompassing the role of natural selection. As previously mentioned, it is clear that natural selection plays a crucial role in shaping the microevolution of the pangenomes of bacterial populations, but the effect of this force is different for all genes or nucleotidic site and their allelic variants, may vary significantly over time, and be different for different segments of the populations, for example if some lineages are adapted to a certain environment. Such adaptation of a lineage will involve many traits distributed in the pangenome of that population, and new mutation arising in this background might interact with it; this leads to complex epistatic (i.e. non-additive fitness) interactions between genomic traits, affecting the probability of selecting new genetic variants in one or another genomic background—a process that could add infinite degrees of complexity to the exhaustive model. Model design can, therefore, be greatly simplified by considering no effect of selection, or in other words neutrality of evolution.

Even if the role of selection is not explicitly included in a model, it does not mean that analyses based on this model are completely uninformative about selection. Neutral evolutionary models provide a framework to search for evidence of natural selection. This can be achieved formally by contrasting observed patterns in compared genomic data to expectations under neutral models. Another approach to detect selection is to fit a neutral model to genomic data, having heterogeneous parameters to describe the evolution process of each species lineage and/or gene family; outlier species or genes with ‘abnormally’ high or low parameters can provide a clue to non-neutral processes taking place. Similarly, the identification of historical changes of processes (e.g. acceleration or slowing down of diversification rates) in the scenario of pangenome evolution can provide strong clues of selection affecting the species lineage or gene under focus (Boussau et al. 2004; David and Alm 2011; Lassalle et al. 2017).

This approach is similar to the way that the role of selection is being investigated in the core genome. In this more frequently explored setting, a typical pipeline (Hedge and Wilson 2016) involves reconstructing a phylogenetic tree, classifying substitutions in terms of whether they are synonymous or not and estimating evolutionary rates so that selection tests can be applied based on variations in the rates of synonymous and non-synonymous substitutions along the genome (Wilson and McVean 2006; Castillo-Ramírez et al. 2011) or between populations (McDonald and Kreitman 1991; Vos 2011). In this popular approach, the evolutionary models used to build the phylogenetic tree and reconstruct substitution events are purely neutral, but still lead to invaluable insights into the natural selection process.

2.3 Phylogenetic Approach

A neutral version of the comprehensive model described above would still be impossible to use in practice. A major difficulty is that it considers the evolution of the population forward-in-time, so that every single cell in the population has to be included. However, any dataset we may have available for analysis will only include a small fraction of the population, sampled typically at a single time point (or a few recent time points in the best-case scenario). However, considering the evolution of the whole population over time can appear wasteful, since most cells in the past would not have had any descendants surviving in the present-time sample. A much more tractable approach is therefore to only consider the genealogical process of the sampled genomes, which is a backward-in-time process. Under relatively mild assumptions, and without introducing too much approximation, this genealogical process can be described without reference to the whole forward-in-time process. In particular, the coalescent model (Kingman 1982) describes the genealogical process of a population following either the Wright–Fisher or the Moran model of forward-in-time evolution. Extensions of the basic coalescent model have been derived to deal with fluctuating population size (Griffiths and Tavare 1994), homologous recombination (Griffiths and Marjoram 1997), which for bacteria is analogous to gene conversion (Wiuf and Hein 2000), and many other forms of relaxation of the assumptions ruling the evolutionary process (Donnelly and Tavare 1995; Nordborg 2001; Rosenberg and Nordborg 2002).

Considering this genealogical process, and the ability to reconstruct it with relatively high accuracy from genome sequences, is pivotal to lead to a usable model of pangenome evolution. A simple approach is to focus on core genome elements and apply a standard phylogenetic method typically based on maximum likelihood or Bayesian inference under an evolutionary model of neutral point mutations (Yang and Rannala 2012). Bacteria reproduce clonally and most species recombine relatively rarely (Vos and Didelot 2009; Yang et al. 2018), so that this simple approach can often be sufficient for our purposes. Phylogenetic methods have also been developed that can account for the effect of homologous recombination while still reconstructing a single tree (Didelot and Wilson 2015; Croucher et al. 2015). Methods that attempt to reconstruct a graph of ancestry rather than a single tree are superior in principle, but rarely used in practice due to their high computational cost (Didelot et al. 2010; Vaughan et al. 2017).

In the context of a phylogenetic tree reconstructed from the core genome, we can consider the events of duplication, deletion and non-homologous recombination that shape the accessory genome. These events happen on the branches of the phylogeny at certain rates that may vary over time and lineages. Notwithstanding such remaining complexities, a phylogenetic model of bacterial pangenome microevolution represents a practical approach relative to the comprehensive forward-in-time model. The events that affect the accessory genome are relatively rare, which results in a strong phylogenetic inertia of genome gene contents, i.e. a large correlation between gene contents and core genome-based diversity (Konstantinidis et al. 2006; Kislyuk et al. 2011). Ignoring this effect would lead to strong misinterpretation of gene distribution patterns, especially in case of a diversity bias in genome sampling, e.g. when surveying a pathogen epidemics where clusters of closely related strains occur. Modelling the pangenome evolution within a phylogenetic framework where evolution of the gene content takes place along the genealogical tree avoids such pitfalls, in the same way as a phylogenetic framework avoids false conclusions to be reached when performing bacterial genome-wide association mapping (Collins and Didelot 2018).

3 Description of Pangenome Data for Inclusion in Microevolutionary Models

3.1 Units of Pangenome Evolution

In order to describe further the existing models of pangenome microevolution, it is necessary to consider the unit in which the pangenome is being described. Figure 2 illustrates the different approaches that have been used for that purpose. The ideal starting point would be a complete sequence of each genome of interest, but this is rarely available due to repeat regions in the genomes that obscure the exact ordering of sequences along the genomes, at least based on short read sequencing. For that reason, the most frequently used data is a de novo assembly of each genome, which can be performed, for example using Velvet (Zerbino and Birney 2008) or SPAdes (Bankevich et al. 2012). This results a set of genomic regions called contigs, which occur in an unknown order either on chromosomes or on plasmids.

Fig. 2
figure 2

(a) Datasets of raw genome sequence assemblies can be dealt with different approaches. (b) Whole-genome alignment of the compared genomes. Extensive homologous regions (syntenic blocks) are mapped in different colours. Orange indicate the core genome, the other colours indicate accessory syntenic blocks; the absence of a colour block indicate genome-specific (orphan) sequence. (c) From there, core and accessory genomes can be treated separately. An ungapped core genome alignment is produced (left hand, orange blocks; dotted vertical lines indicate insertion point of accessory material). Regions of abnormal sequence similarity (with respect to genome-wide divergence) are identified (hatched boxes) as clues of homologous recombination events. Presence of accessory genomic blocks is counted in each genome, an information that can be easily stored in a numeric matrix. (d) The clonal genealogy (orange tubes) is computed from the core genome alignment purged of recombined regions. Homologous recombination events and gain/loss of accessory genome blocks can be inferred by confronting this tree to the information gathered in (c). (e) Another approach consists of annotating the genes present in the genomes (e.g. protein-coding sequences) and to classify them onto homologous gene families based on sequence similarity. (f) The occurrence pattern of each family allow to classify them as core (when a single copy of the gene is present in every genome) or accessory. Aligning gene family sequences allow to infer gene-specific phylogenies, or gene trees. Core gene trees can be used to generate a consensus tree representative of the core genome history (species tree). Differences of topologies between the species tree and the individual core gene trees reveal genes subject to homologous recombination. (g) Ancestral states and evolution scenarios can be reconstructed for each gene family, through reconciliation of the gene family tree with the species tree under a model with events of gene duplication, transfer and loss. Sample scenarios for families f, b, k, c and d are presented, with the reconciled gene tree (colour line) embedded in the species tree (tube); shaded areas indicate events of whole-genome diversification, where gene speciation events occur. Note that simpler gene gain/loss scenarios and ancestral gene content reconstruction can be reconstructed based on gene family occurrence data using the method depicted in (d)

A first approach considers a genome alignment, where every part of a genome is assigned to a syntenic block—segments of genome sequences that are all homologous and can be aligned (Fig. 2b, c). These sequence segments can have boundaries falling anywhere in the genome, notably between or within protein-coding sequences, and the often span many genes. While this is a flexible view of pangenome evolution that is probably the most realistic—evolving genomes ignore human annotation of functional elements—it may be cumbersome to implement with a growing number of compared genome. Indeed, every genome added to the dataset may result in the breakage of a syntenic block into several parts due to insertion, deletions, or rearrangements in one of the homologous genome segments. Homologous genome segments can themselves be difficult to align at the nucleotide level when they include fast-evolving genome regions. For these reasons, even the best software for this task such as progressiveMauve (Darling et al. 2010) or MUGSY (Angiuoli and Salzberg 2011) can only deal with between 10 and 100 genomes, depending on how diverse they are. This first alignment approach works best on the well-conserved parts of pangenomes, i.e. the core genomes and possibly large conserved accessory regions of the genomes. This partial sampling of genome sequences is practical because it allows to represent the homology between genomes as a concatenated alignment of all these syntenic blocks, which amounts to a representative map of the genome. Alternatively, a representative whole-genome alignment can be obtained by mapping all homologous sequences in compared genomes to the genome of a reference isolate, using, for example MUMmer (Kurtz et al. 2004). This can result in reference-biased representation, which may be avoided by restricting the alignment to the core genome.

A second approach focuses on genes, or more specifically on families of homologous genes. These are usually defined based on sequence similarity and restricted to protein-coding sequences, even though it can be applied to conserved intergenic sequences as well (Fig. 2e). In this representation, rather than a reference whole-genome map, we consider independent gene families, which members need not be localised in a genome. The diversity of the gene family can conveniently be represented with a phylogenetic tree based on all nucleotide positions in the aligned genes, which allows the estimation of statistical support (Fig. 2f). This information can, in turn, be used to inform the ancestral reconstruction of genome gene content (as discussed below). There are several ways in which this gene family content identification can be performed. If a representative from each family is known in advance, similarity search tools like BLAST (Altschul et al. 1997) can be used to search them in each genome, and, for example BIGSdb automates this process (Jolley and Maiden 2010). Alternatively, each de novo assembled genome can be annotated separately, using, for example Prokka (Seemann 2014) or RAST (Aziz et al. 2008). Homologs can then be identified by using a combination of similarity search between genes from different genomes (e.g. with BLAST) and similarity network analysis, as implemented, for example in the software OrthoMCL (Li et al. 2003), with integrated pipelines implemented in software like Roary (Page et al. 2015) and MMseqs (Steinegger and Söding 2017).

A consequence of this distinction is the way genetic exchange between genomes is considered. In the nucleotide-centred vision, genetic exchange will result either in the replacement of a region (homologous recombination) or the insertion of a sequence at a defined position in the genome map (non-homologous recombination) (Fig. 2c). Similarly, a genetic duplication event will consist in recopying a segment of genome sequence and inserting it next to its template (tandem duplication) or away from it. Homologous recombination events can be evidenced based on a scan of the genome map, looking for increased or decreased sequence similarity (or phylogenetic relatedness) between compared genomes along the genome map (Didelot et al. 2007). Non-homologous recombination and duplication events consist of insertion events and are simply evidenced by some region being only represented in some genomes in the alignment—the others featuring a large ‘gap’, or long string of missing characters. Distinguishing non-homologous recombination from duplication events can be tricky: even comparing the inserted segment to the rest of the genome and finding a similar region is not conclusive that it would be the duplication template (or copy) of the studied region. Such a pattern could also result from a recombination with a related organism leading to the insertion of genetic material that had homologous counterpart already residing in the recipient genome. Not finding a similar region is not conclusive of the insertion resulting from a non-homologous recombination event either, as an ancient duplication followed by a loss (or deletion) in the compared lineage may result in the same pattern. The answer to this conundrum is modelling of the possible sequences of events, or scenarios, and determine the most likely based on patterns of sequence divergence.

In the second approach centred on genes, the exchange of genetic material is made most evident in the phylogeny of genes, or gene tree, because the gene from the recipient will be more closely related to genes from the donor than to genes from closely related species. In this context, the event is rather called horizontal gene transfer, in opposition to vertical evolution, which would have resulted in the ‘normal’ clustering of genes from closely related species. Again, this representation ignores the locus where genes sit, and it is therefore not straightforward to know from the gene tree whether the horizontal gene transfer event resulted in the replacement of a resident sequence or in the addition of a new one.

There are also other evolving units that can be considered as the basis of pangenome microevolution modelling, including conserved protein domains or short sequences of a constant length, which are also known as words, features, or k-mers (Sims and Kim 2011; Sheppard et al. 2013b). Some units may seem more natural than others from a theoretical point of view, but in practice all units have pros and cons, and the choice of unit is guided by the evolutionary resolution required by each pangenome investigation.

3.2 Granularity of Homologous Groups

When modelling the pangenome diversity with homologous gene families, a further distinction can be done on which homologous link to consider clustering genes into families. A popular approach is to consider orthology relationships. In theory, genes are orthologous when they are related only by events of speciation (i.e. diversification of the whole genome), not by duplication of horizontal gene transfer. Because the true course of gene diversification events is unknown, we must rely on practical definitions of orthology. This theoretical definition implies that two orthologues cannot occur in the same genome. A usual criteria is thus to look for the bidirectional best hit (BBH) in a similarity search of all the genes in a genome against all of the genes in another genome pairwise genome (Altenhoff and Dessimoz 2009).

This pairwise relationship can be used to build a network of genes covering the whole pangenome dataset, where cliques (groups where the found relationship is transitive among members) are recognised as clusters of orthologous genes or COGs (Tatusov et al. 1997). Using this practical definition, it is straightforward to classify any gene into a cluster, many of which will however be clusters of genes on their own: orphan genes with no homologues, but also those resulting from a recent transfer or duplication. By construction, these COGs can only be absent or present in a single copy in a genome, which proves very convenient for representing the distribution of genes in the pangenome by a genome-to-COG binary matrix filled with zeros and ones. This representation can be handled by many simple methods that model the transition between these binary states over the tree of the genomes, i.e. events of gene gain and loss (Fig. 3a) (Mirkin et al. 2003). This approach has been widely used, but suffers from its stringent definition that leaves many homologous genes out of COGs under scrutiny, which may strongly flaw the inferred ancestral genome gene contents and the derived conclusions on ancestral functional repertoires.

Fig. 3
figure 3

Schematic representation of scenarios of gene family evolution, inferred with varied levels of information input. (a, b) Phylogenetic profile: only the information of presence/absence of gene families is used to infer the scenarios of evolution. (a) Genes are classified into orthologous groups; (b) genes are classified into homologous groups, retaining the information that yellow and brown gene families are related, the emergence of the latter being explained by a duplication of the former. (c) Gene tree/species tree reconciliation: the information of topology of the gene tree is used, revealing a more complex scenario, with an additional transfer. (d) Synteny-aware reconciliation: the information of linkage of genes (synteny) is used, suggesting that apparently independent transfer events may have happen as one joint event

Instead, it is possible to consider a whole family of homologues, which distribution of the family in genomes can again be represented in a matrix of counts, where this time values range from zero to any integer number. Models of pangenome evolution can account for this multiplicity of gene copy number by invoking extra gene gain events (Fig. 3b) (Csurös 2008; Csurös and Miklós 2009). The nature of these gain events—duplication or horizontal gene transfer—is, however, not inferred as it fundamentally requires to know the phylogenetic relationships between genes within a homologous family.

3.3 Linkage of Genes and Syntenic Blocks

Notwithstanding the type of evolving unit considered (aligned genome segment or gene family), all units are usually considered to evolve independently on the phylogeny. This is, however, not always realistic given the high linkage disequilibrium found in bacterial genome—the non-independence of physically linked characters in evolution, a consequence of their clonal mode of reproduction.

Linkage can be introduced in a pangenome evolution model by specifying the location of genes on a genomic map. The evolution of genes on the map is then modelled through events of insertion, deletion and rearrangement. This map can relate the absolute position of genes in contemporary genomes (i.e. with nucleotide site coordinates) by chopping all genomes in the dataset into syntenic blocks, where homology and overall gene order between genomes is conserved (Vallenet et al. 2006; Darling et al. 2010). However, as mentioned above, the larger the genome sample, the more syntenic blocks will split and shrink. Based on such genome maps, the history of each syntenic blocks can be estimated, describing the ancestral events of pangenome evolution. Even though in theory the map evolves over time due to genome rearrangements (Darling et al. 2008), in practice the maps are assumed to be constant in order to allow to focus on fine-grained changes within the syntenic blocks. This assumption is commonly made, for example when investigating homologous recombination in the core genome (Didelot et al. 2010).

Another option is to map the relative position of smaller evolving units (usually gene families) in each genome of the dataset. Such a relative map can be represented by a matrix of presence or absence of a direct adjacency between genes in a given genome, contemporary or ancestral. This more abstract representation allows the use of incomplete data, such as draft genome assemblies, where the physical linkage of sequences is not fully or not unambiguously documented. The evolution of gene neighbourhood is modelled by invoking events of creation and breakage of adjacencies between neighbour genes, thereby modelling any insertion, deletion and rearrangement. Ancestral state reconstruction (see below) is then undertaken, by estimating a genome map at each ancestral node of a species phylogeny (Fig. 3d) (Bérard et al. 2012; Patterson et al. 2013; Duchemin et al. 2017). These models are, however, quite heavy computationally and may become overwhelmed by large structural diversity in the dataset.

4 Methodological Approaches to the Reconstructing Pangenome Microevolution

4.1 Ancestral State Reconstruction

The inference of ancestral genomes and corresponding gene gain and loss scenarios can be a complex and computationally intensive task, but it can also be simplified to the point that it becomes almost straightforward if the research questions are relatively simple. For example, using profiles of gene presence/absence in genomes and a phylogenetic tree as only input, ancestral state reconstruction can be applied to infer in which internal nodes of the tree the genes were present, and therefore on which branches the genes were gained and lost. For a general review on ancestral state reconstruction, see Joy et al. (2016). One of the simplest and most popular approach is to perform a parsimonious reconstruction, where the number of gain and loss events is minimised without the need to estimate any parameter (Mirkin et al. 2003). In practice, this is more or less equivalent to performing maximum likelihood inference under a model in which gain and loss happen at the same small rates. However, probabilistic modelling of state evolution has the interesting property to integrate over several possible scenarios. Even a maximum likelihood point estimate of the presence of a gene at a given ancestral node will therefore consist of a non-binary probability, a nuanced result allowing to consider the uncertainty in the ancestral reconstruction (Pagel 1999). A similar Bayesian approach is stochastic character mapping (Huelsenbeck et al. 2003), which consists in sampling gain and loss histories from their posterior probability distribution via a Monte Carlo method.

Ancestral state reconstruction is particularly well suited to analyses focused on specific genes rather than the whole pangenome, for example analysing the gain and loss of pathogenicity genes (Dingle et al. 2014) or resistance genes (Ward et al. 2014). It can also be applied more generally to all genes in a pangenome, and the rates of gain and loss would typically be assumed to be equal meaning that the genome size is at equilibrium (Touchon et al. 2009). Alternatively, the reconstruction can be based on genomic elements known to be gained and lost in one block, such as bacteriophages, plasmids, and integrative conjugational elements (Zhou et al. 2013). This represents one simple way of dealing with the linkage of genes mentioned previously, although at the cost of potentially losing information about the gene content evolution of the genomic elements assumed to be perfectly linked. At the other end of the spectrum, the reconstruction can be based on smaller elements than genes, for example k-mers, but in this case it becomes vital to relax the assumption of a fixed clock on gain and loss, for example using a local clock model (Didelot et al. 2009) as illustrated in Fig. 4. This technique has been applied to the pangenomes of Escherichia coli (Didelot et al. 2012) and Campylobacter jejuni (Sheppard et al. 2013a), showing in both cases a strong relationship between evolution of the accessory genome via gain and loss events and evolution of the core genome via homologous recombination.

Fig. 4
figure 4

Illustration of a pangenome gain and loss model with local clock. The clonal genealogy is shown in black. The width of the red block on the left of the branches is proportional to the rate of gain. Similarly, the blue block on the right of each branch represents the rate of loss. Both the gain and loss rates occasionally change over time. Individual gain events are represented by red arrows, and individual loss events are represented by blue arrows

An important drawback of ancestral state reconstruction methods is that they ignore the nature (recombination or duplication) and origins (recombination donor) of gene gain events, which can yield partial and inaccurate scenarios when the true history is complex, especially with many recombination events. In particular, the exploitable signal from a profile of gene presence/absence in extant genomes are quickly saturated when several gene copies coexist in a genome, and likely descend from separate past events. This issue can sometimes be tackled by defining strict families of orthologs, where every gene is present in one copy or none, but at the cost of losing the information on evolution of homologues. Ancestral state reconstruction could also in principle be applied to data on family of homologues, where each genome can contain zero, one or more copies of a gene. This would require to fit a ladder model similar to the ones used when analysing microsatellite data (Ohta and Kimura 1973; Wilson and Balding 1998). This approach is difficult in practice because bacterial accessory gene families of interest have often too complex histories to reliably infer orthologous groups and have high gain and/or loss rates that quickly saturate signals. It has, however, been applied successfully in studies where genomes of single representatives from fairly distant species were compared, thus ignoring the ‘messy’ variation introduced by within-population evolution (Csurös and Miklós 2009).

4.2 Phylogenetic Reconciliation

To deliver more informative scenarios of evolution, it is necessary to know the origin of gene gains, which effectively means to know the relationship between observed genes. Gene tree versus species tree reconciliation methods compare the topologies of phylogenetic trees built from individual gene sequences against a reference species tree (Maddison 1997). In the context of pangenome analysis, the species tree is a phylogenetic tree reconstructed from the whole of the core genome. Species and gene trees often have inconsistent topologies, which could happen by chance, especially since the gene tree typically has limited statistical support, or may be the result of evolutionary events affecting the history of the gene relative to the clonal history. Reconciliation methods intend to explain the significant topological discords by events of gene duplication, transfer, or loss (Szollosi et al. 2015). Figure 3c illustrates the principles behind reconciliation methods. Practically, both trees are annotated with the inferred events, such that there is a full agreement on the course of events, from the root of the gene lineage to the contemporary distribution of genes in genomes—thus reconstructing the path of evolution and diversification of genes in the clonal frame of genome evolution. As a result, this approach allows to explicitly determine the donors and recipients of transferred genes, or the ancestor in which a gene was duplicated. Methods for pangenome reconciliation analysis have been proposed based on parsimonious reconstruction (Abby et al. 2010; Jacox et al. 2016) and probabilistic models (Szollosi et al. 2012, 2013).

The ancestral state reconstruction approach and the reconciliation approach have a lot in common, and the latter can be thought of as a natural extension of the former when observation is not limited to presence or absence or number of copies of a gene, but also includes the phylogenetic relationships between genes from separate genomes. Reconciliation methods are therefore superior in the sense that they exploit more of the available signal, but they are also much more challenging to implement computationally and have so far been limited to analysis of a handful of genomes. Ancestral state reconstruction methods are currently more popular but we predict that reconciliation methods will become increasingly widespread in the near future with the development of more effective statistical methods. Beyond the study of the atomic events whereby the pangenome evolves, both methods allow to infer ancestral states in hypothetical ancestors, or in other words to reconstruct ancestral genomes. Doing so, one can derive the expected phenotypic traits of the ancestors—antimicrobial resistance, metabolic activities, even ecological lifestyle. These inferred traits can then be compared to historical records of Earth evolution or pathogen epidemic spread to try and find causal relations between biological activity and the course of events (David and Alm 2011; Holden et al. 2013), or be considered in support of further ancestral reconstruction, such as scenarios of ecological niche colonisation (Lassalle et al. 2017).