1 Introduction

Bacteria dominate our planet and can be traced back to billions of years in the geological record. They play critical roles in shaping our habitat, from adding oxygen to the atmosphere to fixing nitrogen in the soil. They also play a vital role in human health, with commensal/mutualistic bacteria influencing nutrition and immunity, and pathogenic bacteria causing diseases from epidemics like the Black Death of medieval times to modern-day chronic biofilm infections resulting in the spread of antibiotic resistance. A defining characteristic of bacteria in both the environment and health is their ability to rapidly evolve and adapt. Here we discuss the elegant population-level organizational scheme that bacterial species use wherein their genomes are distributed among large numbers of strains, with no single strain having more than a small minority of genes available at the population level. This distributed pan(supra)-genome provides for adaptation to countless novel challenges and environmental niches.

Individual bacterial genomes have a discrete number of genes. However, enormous differences in gene content exist even among the genomes of strains of a single species. Therefore, the gene content of a single strain is less than the full complement of different genes from all strains. The comprehensive set of genes within a species, i.e., all genes from all strains, is defined as the pangenome (or supragenome). The pangenome is organized into the core genome, which corresponds to the set of genes conserved across all strains in the species, and the accessory genome (or distributed genome), which are all noncore genes. We compiled pangenome papers from PubMed, identifying 295 species-specific pangenome projects performed on approximately 70 genera (Fig. 1). In all of these projects, the pangenome was found to be substantially larger than the core genome (Fig. 2).

Fig. 1
figure 1

Overview of the number of pangenomic studies per genus from a literature search between 2005 and 2018 (See “References for Fig. 1”)

Fig. 2
figure 2

The number of core and total gene clusters for genera with at least three available pangenomic projects. Numbers at the top correspond to the mean percent core. This is only an estimate as these numbers vary considerably based on parameters and strain selection processes

The diversity within a species’ pangenome provides a reservoir of genetic material available to bacterial cells to respond to selective pressures. Horizontal gene transfer (HGT) is the process by which individual bacterial cells can uptake genetic material from their environment or neighboring bacteria and generate novel, strain-specific gene combinations. It seems logical that when HGT occurs among strains of the same species these events are more likely to be adaptive or work in concert within the biological network, when compared to random mutations or genes acquired from distantly related species. This has been demonstrated to be the case in multiple species, where the majority of accessory genes appear to be evolving in tandem with the core genome (Gladitz et al. 2005). In this manner, the pangenome allows a species to incorporate more solutions to environmental stresses and niches than can be encoded by a single strain (Ehrlich et al. 2005, 2010).

2 Steps in the Assembly of a Pangenome

Pangenome analyses are performed on a set of strains from the same species, or very closely related species (often different species grouped together by genus, though we will not be examining those projects here). The set of all coding sequences (CDS) are clustered by sequence similarity with the objective of generating groups of orthologous genes. This is a multistep process that begins with whole-genome sequencing (WGS) of multiple independent bacterial (nonclonal, nonderivative) strains selected to represent the broadest geographic and phenotypic ranges of the species of interest. Following sequencing, the remaining steps are computational and include (1) assembly of genomes into contigs, (2) annotation of protein-coding sequences (CDS), and (3) clustering of CDSs based on the sequence similarity of nucleic acids or amino acids of their cognate encoded proteins. Once clusters are defined, they are classified based on strain prevalence into core or accessory (distributed) clusters. The accessory/distributed set of gene clusters is often further organized into those that are widely distributed (near core/soft core) in a population and those that are rare (shell) or unique (Fig. 3).

Fig. 3
figure 3

Histogram of the number of gene clusters present in a given number of genomes. Taken from a project examining 12 genomes of Moraxella catarrhalis (Davie et al. 2011), with a total of 2383 gene clusters

The tools and the parameters used to characterize gene clusters vary widely among projects (Fig. 4). Generally, the first project(s) within a species tend to focus on the basic characterization of the pangenome. Subsequent projects often emphasize specific areas of interest, such as the distribution of virulence factors, levels of horizontal gene transfer, or epigenetic factors. Our survey of 295 pangenome projects did not reveal a strong preference for any individual assembly program. This is likely because assembly programs and versions perform differently depending on the examined species and the employed DNA sequencing technology. Further, many pangenome projects utilize pre-assembled genomes from publicly available databases (GenBank, EMBL, DDJB, JGI, PubMLST, etc.). This survey found that the CD-HIT program was the most frequently used gene clustering software, though a diverse set of other programs were also utilized for this purpose. Finally, commonly used software for other analyses include gene annotation (RAST, Prokka, PHAST, and Prodigal) (Aziz et al. 2008; Seemann 2014; Zhou et al. 2011; Hyatt et al. 2010), genome/gene alignments (Muscle, Mauve, Mega, and ClustalW) (Edgar 2004; Darling et al. 2004; Kumar et al. 1994; Higgins and Sharp 1988), and phylogenetic tree building (Mega, RAxML, and PhyML) (Kumar et al. 1994; Stamatakis 2006; Guindon et al. 2010). Overall, there is high variability in the methods/software used for pangenome analyses, reflecting diversity in the scope and goals of these projects.

Fig. 4
figure 4

Frequency of reference to programs over the past 5 years in pangenome publications (referenced at least 4 times)

3 Size of the Pangenome

The size of a species’ pangenome, relative to the size of the core genome, is highly variable across the eubacteria. In Fig. 2, we display the variability we encountered in 295 species-specific pangenome projects (Figs. 1 and 2). Papers included in this summary span from 2005 [when the first pangenomes were described in S. agalactiae (Tettelin et al. 2005) and H. influenzae (Shen et al. 2005; Hogg et al. 2007)] through 2018. In all cases, the pangenome was significantly larger than the set of genes in a given strain. The size of the core genomes ranged from <20 to >60% of the pangenome (Fig. 2).

In some cases, calculations on the size of the pangenome may reflect inaccuracies in the current taxonomy, instead of the underlying biology. An instance of high genomic diversity is observed with Gardnerella vaginalis, where only 27% (746/2792) of its gene clusters are core (Ahmed et al. 2012). It is likely that G. vaginalis appears so genomically diverse because traditional biochemical tests used to identify strains within this taxa were unable to distinguish among the multiple genomically diverse species that are actually present. Thus, in this case, the apparent large size of the pangenome (and the corresponding small size of the core genome) arose from the unintentional merging of multiple species into a single species. In contrast, instances of low genomic diversity are observed in the genus Bacillus. Both Bacillus anthracis and Bacillus thuringiensis closely resemble B. cereus (Vilas-Bôas et al. 2007). B. thuringiensis appears to correspond to multiple phylogenetic clades (lineages) within B. cereus. B. anthracis (a species with one of the smallest pangenomes) likely represents a single phylogenetic lineage within the broader, more diverse definition of B. cereus that acquired a clinically important set of toxin genes (Okinaka and Keim 2016; Hall et al. 2010).

It is tempting to speculate that there are general principles that directly associate the size of the pangenome with the biology of the species. Factors that may play a substantial role are the extent of gene transfer, the degree of interactions with competing and cooperating species, the number of niches inhabited, or the lifestyle of the bacterium. The hypothesis that highly specialized environments lead to smaller genome sizes has been explored in the context of obligate intracellular species and pathogens (Merhej et al. 2009; Georgiades et al. 2011). A study of overall differences between the genomes of 12 highly pathogenic species compared to their most closely related nonpathogenic cousins found that, for the sets of bacteria studied, the most virulent species generally had smaller genomes, which suggests gene loss as well as loss-of-function mutations (Georgiades and Raoult 2011). The reduced genome size is hypothesized to be a consequence of extreme specialization of the pathogens to their hosts, while the less-specialized nonpathogens show greater levels of genomic variation due to selective pressure to remain competitive in more diverse environments (Georgiades and Raoult 2011). While this is an interesting idea, not all studies point to a relationship between pathogenicity and genome size (Bonar et al. 2018).

In a related vein, longitudinal comparative genomic studies of pathogenic clonal lineages of Pseudomonas aeruginosa, Burkholderia sp., and Haemophilus influenzae have captured microevolution and host adaptation in the human lung (Rau et al. 2012; Lee et al. 2017; Pettigrew et al. 2018; Moleres et al. 2018; Bianconi et al. 2018; Burns et al. 2001; Li et al. 2005; Jorth et al. 2015; Silva et al. 2016). In many cases, these changes reveal gene deletions when compared to their antecedents. For instance, serial isolates of H. influenzae clonal lineages in COPD patients display a significant association with loss-of-function mutations in the ompP1 (fadL) accessory gene. fadL is beneficial to this bacterium in early infection, as it promotes adhesion and intracellular invasion via interactions with the epithelial cell ligand hCEACAM1 (human carcinoembryonic antigen-related cell adhesion molecule 1). In contrast, it may hinder long-term survival in the lung, as its expression increases sensitivity to arachidonic acid, an exogenous mammalian long-chain fatty acid with bactericidal effects (Moleres et al. 2018). This is indicative of selective pressure in favor of ompP1 function in the nasopharynx and against its function in the lungs. These observations support the general concept that gene loss may accompany the ability to survive within highly circumscribed niches (Rau et al. 2012; Lee et al. 2017; Pettigrew et al. 2018; Moleres et al. 2018). Nonetheless, one must keep in mind that evolution in niches that do not support transmission may not be relevant to the evolution of the pangenome. Large-scale comparative pangenome and evolutionary studies promise to reveal the rules that shape the overall pangenome size, as well as identify disease and tissue-specific genes (and gene losses).

4 The Accessory Genome and Functional Diversity

In general, core genomes are enriched for housekeeping functions. These include energy production, amino acid metabolism, nucleotide metabolism, lipid transport, and translational machinery. Accessory genomes often encode genes involved in protein trafficking and defense, as well as many niche-specific functions. Further, plasmids, phage, and transposons are also often associated with accessory genomes. This section focuses on functional diversity as it pertains to the accessory genome.

Phenotypic traits can result from a blend of core genes with highly variable accessory genes. This is exemplified by the production of the capsule (Swartley et al. 1997; Bentley et al. 2006), synthesis of the extracellular polymeric substance (EPS) (Harris et al. 2017), and modification of the cell wall (Gerlach et al. 2018). Here, conserved modules encoded in the core and softcore genomes are modified by components encoded by the accessory genome, providing a procedure to generate phenotypic variability. In Neisseria meningitidis, capsule biosynthesis genes are encoded within a single syntenic cps chromosomal region, which encodes both core and accessory genes. Variations in the accessory genes yield diversity in capsular types (Harrison et al. 2013). In Lactobacillus salivarius, the EPS cluster 2 contributes to the biofilm matrix. The genes at the extremities of this multigene cluster genes are core, while there is extensive variation in the genes encoded in the center of the cluster. These differences in glycotransferases and EPS biosynthesis-related proteins contribute to variations in the EPS structure (Harris et al. 2017). Yet another example is observed in methicillin-resistant Staphylococcus aureus (MRSA), where strains evade host immunity by modification of wall teichoic acid (WTA) using an alternative WTA glycosyltransferase encoded on a prophage (Gerlach et al. 2018). These studies exemplify how diversity within the accessory genome can provide bacteria with a blueprint to generate variability. This genomic flexibility is likely to increase the adaptive potential of bacterial species in the face of environmental stresses.

Genes encoded by the accessory genome can influence pathogenic potential. A well-studied example is Escherichia coli; this species encodes a highly diverse pangenome, where variability within the accessory genome leads to strains that differ in their ability to colonize human cell types and to trigger pathogenicity (Rasko et al. 2008). E. coli strains are grouped into pathovars based on the presence of virulence markers, often encoded on mobile elements (Kaper et al. 2004). Whole-genome comparative analyses of pathovars demonstrate that strains of the same pathovar are not always phylogenetically clustered (Rasko et al. 2008; Salipante et al. 2015; Hazen et al. 2013). This pattern of clustering is consistent with the transfer of accessory genes among E. coli strains, as well as the independent acquisition of virulence traits by strains in the same pathovar. One prominent example of HGT among E. coli strains of different pathovars is observed in the highly pathogenic strain that caused the 2011 German food poisoning outbreak (Mahan et al. 2013). Multiple genomic studies ultimately concluded that the outbreak was caused by a Shiga toxin-producing E. coli (STEC) of serotype O104:H4, which harbored multiple genes commonly associated with enteroaggregative E. coli (EAEC) including: a plasmid-encoded type I aggregative adherence fimbriae that mediate colonization and biofilm formation, assortment of serine proteases (SPATEs), and chromosomally encoded Shigella enterotoxin 1 (Askar et al. 2011; Mellmann et al. 2011; Rasko et al. 2011). Moreover, the prevalence of genetic transfer among E. coli strains is highlighted by the lack of an exclusive genomic signature among commensal E. coli strains. The strains that asymptomatically colonize the human gastrointestinal tract are genetically diverse (Rasko et al. 2008). These commensal strains may serve as genetic repositories for virulence determinants and, in addition, gene transfer events may modify their pathogenic potential and drug sensitivity. In conclusion, the accessory genome of E. coli is a critical determinant of tissue tropism, pathogenic potential, and clinical presentation.

Non-orthologous accessory genes with related functions are often syntenic across strains. We propose that this genomic configuration allows one variant to be switched by another in the process of recombination, where the neighboring genes provide an anchor for homologous recombination. One example is the genomic region that encodes the DpnI, DpnII, or the DpnIII type II restriction enzymes in S. pneumoniae. These loci differ in the sequence of the enzymes, the number of genes in the locus, and their ability to restrict phages or transforming DNA (Johnston et al. 2013a; Eutsey et al. 2015). Another example is the genomic region that encodes bacteriocins downstream of the blp histidine kinase signal transduction system in S. pneumoniae. While the genes in this region are predicted to be bacteriocins, the number of genes, their sequence, and the cells they target differ across strains (Lux et al. 2007; Dawid et al. 2007; Valente et al. 2016; Rezaei Javan et al. 2018). Other examples of this proposed mechanism, wherein conserved flanking genes anchor multiple variants of pathogenicity genes, include the parologous vHiSLR genes of H. influenzae (Kress-Bennett et al. 2016) and the bro gene variants of Moraxella catarrhalis (Earl et al. 2016). Syntenic regions that encode non-homologous genes within a single functional class may provide a pangenomic “switch,” allowing cells to flip between variants of a single function to optimize fitness in diverse niches.

In summary, many of the genes in the accessory genome provide new functions or variations on a conserved function in a manner that expands the ability of strains to survive or adapt in their environments. In this manner, the strain diversity resulting from variations in the accessory genome may serve as a population-level tool to ensure the survival of a bacterial species.

5 Pangenome Plasticity

Speaking teleologically, via intra- and inter-species gene transfer, individual bacterial strains can draw from an expanded set of genes for their own adaptation and evolutionary success. This phenomenon was observed as early as 1928 in the Griffith’s experiment, where a nonencapsulated strain of S. pneumoniae integrated DNA from an encapsulated isolate, leading to its conversion from avirulent to virulent (Griffith 1928). Almost a century later, the bacterial research community has described multitudinous instances of gene transfer among bacterial strains.

5.1 Gene Transfer Events Within and Across Species

Gene transfer events can occur anywhere, and our literature review identified 19 manuscripts that describe bacterial in vivo gene transfer within human patients (Table 1). A common theme is the acquisition of antibiotic resistance; particularly in regard to carbapenems, β-lactamases, and quinolones. Resistance was commonly the result of genes acquired via bacteriophages, plasmids, or pathogenicity islands (Conlan et al. 2014; Bielaszewska et al. 2007; Datta et al. 2017; Feld et al. 2008; Langhanki et al. 2018; Mena et al. 2006; Neuwirth et al. 2001; Soto et al. 2011). In our set, five cases show HGT between different bacterial species: Serratia marcescens and Escherichia coli (Mata et al. 2010), two instances of Klebsiella pneumoniae and E. coli (Gona et al. 2014; Göttig et al. 2015), Staphylococcus aureus and Staphylococcus epidermidis (Hurdle et al. 2005), and Enterobacter cloacae and E. coli (Sidjabat et al. 2014). These studies highlight how bacteria occupying the same niche can evolve during the infectious disease process, posing new challenges for treatment.

Table 1 Summary of studies on in vivo recombination

Cross-species transfer events introduce new genes into the species, thus expanding the pangenome. A prominent example is acquisition of the type 3 secretion system (T3SS) by multiple Gram-negative bacteria. The T3SS allows for the transport of effector proteins from the bacterial cytosol directly into the host cells (Hacker et al. 1997; Hueck 1998). In most cases, the genes encoding this injection system, and their effectors, have been acquired by HGT (Brown and Finlay 2011). These T3SS systems are critical components of virulence. For instance, in Salmonella, acquisition of the SPI1 T3SS enables the bacterium to invade host cells, while acquisition of the SPI2 T3SS enables it to escape host defenses and survive within host cells inside a protective vacuole (Jennings et al. 2017; Ochman et al. 1996). Another example of cross-species transfer has been observed in S. pneumoniae, where a multigene locus was acquired from Streptococcus suis (Antic et al. 2017). This locus was acquired exclusively by a phylogenetically distinct subset of strains within the S. pneumoniae species—a subset much more likely to infect the conjunctiva. The genes acquired from S. suis appear to contribute to the tissue tropism by promoting adherence to the ocular epithelium. Thus, expansion of the pangenome by gene acquisition from outside the species can contribute to bacterial virulence and tropism.

Gene transfer among strains of the same species provides a mechanism to redistribute accessory/distributed genes within single strains. Studies on vaccine-escape strains of S. pneumoniae identified multiple genes acquired from a single donor (Golubchik et al. 2012). These recombination events ranged from 0.04 to 44 kb in size, and were located in various regions of the genome, including the capsular locus. Separate analyses of whole genomes of S. pneumoniae have captured multiple instances of serotype switches including from 23F to 3 and from 19F to 19A (Chewapreecha et al. 2014; Croucher et al. 2014a; Hiller et al. 2011). A current vaccine targets the 19F capsule, but not the 19A. Serotype 19F strains were widely prevalent pre-vaccine, while serotype 19A strains have spread in the USA during the post-vaccine era (Geno et al. 2015). This serotype switch has been observed in vaccinated and non-vaccinated populations. These observations are consistent with a model where HGT generates diverse genotypes, selective pressure from vaccines drives the spread of a subset of strains, and competition across strains shape the population and distribution of accessory genes.

Studies that describe recombination among strains driven by natural competence and transformation suggest that multiple transfers may occur both simultaneously and sequentially between individual donors and recipient strains. A study on S. pneumoniae captured the progressive accumulation of recombinations in a set of six clinical strains isolated from a pediatric patient over a 7-month period. One strain incurred multiple recombination events from the same donor, over two instances of recombination. These events introduced recombinations at 23 sites, and led to the exchange of over 7% of the genome (Hiller et al. 2010). Similarly, a laboratory study in H. influenzae also captured multiple gene transfer events after a bout of recombination (Mell et al. 2011). For this study, DNA from a clinical strain was used to transform a laboratory strain. Transformants were observed to have multiple recombination events over the length of the chromosome, collectively corresponding to ~1–3% of the genome. These analyses not only demonstrate HGT events across strains, but also suggest that strains may display multiple transfers during a single competence event.

HGT occurring through natural competence and transformation is unique among HGT mechanisms, in that it is driven by the recipient as opposed to by the donor (as is the case with mating and transduction). This means that it is an expressed phenotype that is triggered by the recipient cell. Thus, as a mechanism of mutation and evolution, it is expressed when a cell is stressed and provides a genetic means to adapt to a stressful environment resulting in mutation-on-demand (Ehrlich et al. 2005).

5.2 Constraints on Gene Transfer

While there is clear evidence of HGT among strains of the same species, distributed genes are not randomly distributed within a species. Instead, they tend to be associated with specific lineages, suggesting that pangenome evolution operates with forces that promote as well as limit gene transfer (Croucher et al. 2014b), as discussed in the next paragraphs.

There is increasing evidence that co-selection of genes limits gene transfer. A genome-wide study in S. pneumoniae demonstrated that a set of 876 loci, annotated to function in metabolism or transport, displayed a nonrandom distribution (Watkins et al. 2015). The authors show that groups of coevolved genes (alleles) are adapted to particular metabolic niches. They predict that disruption of these groups of alleles, a process mediated by HGT, would lead to a drop in strain fitness. A computational approach applied to S. pneumoniae and N. meningitidis also uncovered co-selection of genes associated with drug resistance and virulence (Pensar et al. 2019). Genome architecture may also limit gene transfer. Many bacterial genomes encode short sequences that are enriched in close proximity to the replication terminus. The location of these sequences is under selection, such that HGT events that disrupt these elements impose a fitness cost (Hendrickson et al. 2018). Thus, allele co-selection and genomic architecture illustrate genome-wide features that, when disturbed, can result in loss of fitness and consequently restrict gene flow.

In addition to factors that limit gene transfer via their influence on fitness, bacteria encode genes that serve as barriers to incoming DNA, such as restriction modification systems (RM), phage-defense systems, and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)—associated proteins (CRISPR-Cas). Most RM and CRISPR-Cas systems exert their influence on double-stranded DNA. While DNA entering the cell by transformation is single stranded, these systems still appear to serve as barriers to transformation; a compelling model proposed that they do so via their activity on the transformed chromosome (Johnston et al. 2013b). Studies in N. meningitidis and S. pneumoniae illustrate the role of restriction modification (RM) systems in limiting HGT. Strains of N. meningitidis organize into distinct phylogenetic groups that are associated with the distribution of >20 RM systems (Budroni et al. 2011). This distribution is consistent with the hypothesis that the RM systems limit HGT among clades. Similarly, the PMEN1 pandemic lineage of S. pneumoniae displays asymmetric gene transfer. The heterologous gene transfer from PMEN1 to other strains is abundant, yet into PMEN1 is modest (Wyres et al. 2012). The DpnIII RM system contributes to this structure, as it appears to limits HGT into PMEN1 strains, and is almost exclusively found in the PMEN1 lineage (Eutsey et al. 2015). Type I RM systems can also limit gene transfer, however, their architecture may allow rapid evolution of HGT barriers. The type I RM systems have a multifunctional component, where modification in one sequence can lead to both changes in methylation and endonuclease activity. This is in contrast to type II RM systems, where the protein that directs methylation is distinct from the protein that directs endonuclease activity, such that changes in specificity require mutations in more than one protein (Wilson and Murray 2003). In this manner, type I RM systems can rapidly evolve new specificities and generate diversity. A recent study in S. pneumoniae demonstrated that phase variation in the SpnIV phase-variable Type I RM limits acquisition of genomic islands by transformation (Kwun et al. 2018). The work captures an instance of phase variation on a type I RM system that generated an HGT barrier between nearly identical strains. Together, these studies suggest that RM systems may foster genomic stability within subsets of strains.

Many bacteria encode an abortive infection (Abi) system, which appears to be altruistic mechanism to protect the population at-large. When bacteria possessing an Abi system are infected by phage, the system is activated and triggers the death of the bacterial host. In this manner, death of the infected isolate avoids spread of the phage across the bacterial community (Chopin et al. 2005). In an exciting twist, phage defense systems may also be encoded by prophage, illustrating cooperation between bacteria and phage to restrict unrelated phages (Dedrick et al. 2017; Bondy-Denomy et al. 2016).

CRISPR-Cas confers adaptive immunity in prokaryotes and has the ability to inhibit conjugation, transduction and transformation. The CRISPR-Cas are composed of arrays of palindromic nucleotide repeats that are interspersed by short unique DNA segments called spacers, and cas genes. The spacers are acquired from foreign DNA, usually bacteriophages. Following acquisition, spacers are transcribed and processed into small CRISPR RNA (crRNA) molecules. A complex formed by Cas proteins and crRNA leads to the degradation of invading foreign nucleic acid, protecting cells from future invasion (Jiang and Doudna 2017; Adli 2018). Many bacterial species and lineages are devoid of CRISPR-Cas systems. In vitro studies in multiple bacteria reveal an inverse correlation between HGT and the presence of a functional CRISPR-Cas system (Jiang et al. 2013; Watson et al. 2018). In Enterococcus faecalis, multidrug-resistant plasmids were observed in strains that lacked CRISPR-Cas systems, while the drug-sensitive strains encoded this system (Palmer and Gilmore 2010). Further, under selective pressure for the acquisition of antibiotic-resistant plasmids, Staphylococcus epidermidis strains acquired inactivating mutations in the CRISPR-Cas system (Jiang et al. 2013). These studies suggest that bacteria encounter a tradeoff: the fitness advantages associated with phage resistance afforded by CRISPR-Cas must be balanced against a decrease in genomic plasticity and the benefits conferred by acquisition of novel genes. Nonetheless, the role of phage protection systems in restricting gene flow is far from fully resolved. Some studies find contrasting results, and do not support the conclusion that CRISPR-Cas limits HGT. A large-scale computational study revealed that the activity of the CRISPR-Cas system was not associated with HGT events over long evolutionary timescales (Gophna et al. 2015). Further, a study in Pectobacterium atrosepticum suggests that CRISPR-Cas systems may actually contribute to HGT via their role in protecting bacteria against phage attack (Watson et al. 2018). Thus, more research is required to determine the ultimate influence of CRISPR-Cas systems on the genomic plasticity of bacterial populations.

In conclusion, the set of genes in a species’ pangenome can expand via the introduction of genes from other species, rearrange across strains via an intra-species exchange, or vary with mutations. The shuffling of accessory genes and alleles generates new combinations that are subsequently subjected to the forces of selection on gene products and genome-wide features. Moreover, RMs, CRISPR-Cas, and phage-defense systems may also influence gene flow across strains and species. All factors combined, genomic plasticity emerges as a successful strategy for bacterial survival.

6 A Balance in the Accessory Genome

A remarkable observation comes from recent mathematical models and population studies. Negative frequency-dependent selection may stabilize the proportion of individual accessory genes in a population of S. pneumoniae (Azarian et al. 2018; Corander et al. 2017). As expected, the authors observed that vaccination led to a dramatic drop in the representation of vaccine-sensitive strains. In doing so, the distribution of accessory genes within the population differed from that of the pre-vaccine population. Interestingly, over time, the frequency of the accessory genes trended toward that seen in the pre-vaccine population. These results suggest that the distribution of genes in the pneumococcal pangenome may have an equilibrium point. It remains to be determined whether similar patterns are observed in other species. The suggestion that the composition of pangenomes tends toward an equilibrium has important implications regarding our ability to predict the nature of replacement strains after the introduction of therapies that target subsets of strains within a bacterial population using a microbiome-sparing approach.

7 Clinical Applications

Pangenomic analyses can be utilized to identify potential therapeutic targets. Target specificity can be customized depending on the desired effect. The core genome can be used to target an entire species, as it contains genes possessed by every member of the species. Alternatively, targeting select members of the accessory genome, or the “microbiome-sparing” approach, will ensure that only strains containing the gene of interest are affected. Both strategies can be utilized to combat a wide variety of pathogens.

Current efforts to combat pathogenic bacteria include targeting the bacterial capsule, a large polysaccharide layer that is a major virulence determinant with a key role in immune evasion. Strains vary in the composition of their capsules: those with identical capsules are placed in the same serotype, and those with highly similar capsules within a serogroup. For example, there are over 97 different serotypes known for S. pneumoniae that fall into 46 serogroups (Bentley et al. 2006; Geno et al. 2015; Tzeng et al. 2016), and over 12 serotypes for N. meningitidis (Harrison et al. 2013; Geno et al. 2015; Tzeng et al. 2016; Claus et al. 1997). New serotypes can arise by HGT, like in the movement of SiaD genes between N. meningitidis strains, or through mispairing during gene replication, which is responsible for serotypes 15 B/C in S. pneumoniae (Claus et al. 1997; van Selm et al. 2003). Capsular polysaccharide vaccines are available for S. pneumoniae, S. typhi, and N. meningitidis (Geno et al. 2015; Tzeng et al. 2016; Hessel et al. 1999). These specifically target the bacterial capsule, but young children (under the age of two) fail to create antibodies against these vaccines. To combat this, polysaccharide–protein conjugate vaccines were designed, which combine the polysaccharide antigen with protein carriers and render them more immunogenic in young children (Finn 2004; Nair 2012; Szu et al. 1989; Lin et al. 2001). Development of conjugate vaccines faces major challenges, such as cost, host immune response, and bacterial structures (Nair 2012). Therefore, it would be ideal to create capsular polysaccharide vaccines with better immunogenicity. However, the structures of some capsule sugars are too similar to those found in mammalian tissues to be useful as polysaccharide vaccines. In these cases, vaccines could be designed to target virulence via accessory genes or to target these species as a whole via the core genome (Pichichero 2017; Daniels et al. 2016; Chan et al. 2018).

Using the accessory genome to create strain-specific drugs and vaccines has wide implications. For example, it is easy to imagine the creation of therapies against bacterial pathogens that are able to spare the larger microbiome. Commensal bacteria in the microbiome and pathogenic bacteria of the same species may share the same core genome, but can have vast differences in the content of their accessory genomes. If a therapy targets protein products from genes found only in the accessory genomes of pathogenic bacteria, it will not disturb the patient’s microflora as the commensal bacteria would lack the proteins the therapy is created against. This strategy has the potential to greatly improve patient health and recovery following a bacterial infection.

Pangenomic studies can aid in the development of diagnostic tools. As with vaccines and drug development, accessory genes can be used to identify a particular strain/phenotype and core genes to identify a specific species. A study of 17 clinical isolates of G. vaginalis was used to propose the reclassification of G. vaginalis as a genus, based on the extent of pangenomic variation (Ahmed et al. 2012). Previously, metronidazole was used as a blanket antibiotic for the treatment of bacterial vaginosis. However, the understanding that metronidazole-resistant clades of G. vaginalis are actually different species creates room for the development of diagnostic tools to inform antibiotic treatment for patients with bacterial vaginosis (Balashov et al. 2014). Similarly, pangenomic studies among phenotypically divergent M. catarrhalis strains led to the characterization of a deep phylogenetic clade structure that separated the pathogenic sero-resistant strains from commensal sero-sensitive strains (Earl et al. 2016). In yet another example, Staphylococcus epidermidis was divided into two phylogenetic groups. One group included both commensals and pathogens, the other composed exclusively of commensal strains. Strains in the second group-encoded formate dehydrogenase, revealing a potential diagnostic marker (Conlan et al. 2012). A study in Helicobacter pylori identified lineage-specific genes; some have already been associated with acid resistance and virulence, and thus are potential targets to guide treatments (van Vliet 2017). Moreover, when studies associating pangenome and phenotype identify unannotated genes as diagnostic markers, they provide genetic fodder for linking new functions, distribution, and disease outcome (Ehrlich et al. 2010). One caution to consider in the development of diagnostics is that chronic infections can be caused by multiple strains of the same species, and analysis of a single strain could misdirect treatment.

A crucial benefit of pangenomic analyses is their ability to determine the presence or absence of antibiotic-resistant markers. Prescription of an ineffective antibiotic is both detrimental to patient’s health and adds to the problem of global antibiotic resistance. Some examples of pangenomic analyses to study the distribution and transmission of resistance genes have been performed on E. coli strains collected from wastewater treatment plants (Mahfouz et al. 2018), community-associated Clostridium difficile strains isolated from farm animals and humans (Knetsch et al. 2018), and strains of Stenotrophomonas maltophilia collected from cystic fibrosis (CF) patients (Esposito et al. 2017). Given that related strains often differ in their drug resistance profile, probing the accessory genome for genes that encode drug resistance will be a critical component of personalized medicine.

Genome-scale models (GEMs) of metabolism can provide great insight into the link between metabolism and pathogenesis. These network reconstructions provide context for the relationship between gene, gene product, and phenotype. Pangenomic analyses in three species observed that the majority of core genes are associated with metabolism (Cornejo et al. 2013; Bosi et al. 2016; Vieira et al. 2011). Pangenomic analysis of inflammatory bowel disease (IBD)-associated E. coli strains reported metabolic differences between IBD-associated strains and nonassociated strains, where the former set appeared to utilize energy more efficiently (Fang et al. 2018). The differences in metabolic capabilities in disease and healthy states provide a promising place to explore diagnostic applications of the pangenome. Furthermore, the link between metabolism and virulence can be explored, and be used diagnostically to differentiate strains that cause mild or severe symptom presentation (Bosi et al. 2016).

Beyond the use of pangenomic analyses to select targets for vaccines, therapeutics, and diagnosis, it has also served as an epidemiological tool. The origin of the 2010 cholera outbreak in Haiti was traced using pangenomic analysis of Vibrio cholerae. Initially, it was unclear whether the epidemic originated with a local strain or Asian strain. A pangenomic analysis revealed that the epidemic was caused by strains originated in Southeast Asia (Reimer et al. 2011; Hendriksen et al. 2011; Chin et al. 2011; Mutreja et al. 2011; Orata et al. 2014; Hasan et al. 2012). Such epidemiological studies allow better strategic planning to avoid future epidemics.

8 Conclusions

The Distributed Genome Hypothesis provides both a historical and theoretical framework for understanding bacterial genomic plasticity, and puts it in the context of other classes of chronic pathogens (viruses and eukaryotic parasites) that have developed different mechanistic strategies for the generation of genetic diversity in situ. Viruses such as HIV-1 utilize an error-prone DNA polymerase (reverse transcriptase) to generate enormous diversity resulting in the development of a quasispecies within days of infection (Korber et al. 2001). Trypanosomes utilize a cassetting mechanism for antigen switching wherein they have an entire chromosome of outer surface protein cassettes that they can exchange within the larger functional protein whenever the host adaptive immune response recognizes the previous cassette (Horn 2014). Thus, within this context, we can view HGT of distributed genes among bacterial strains of a species as yet another means of “programmed” variation (Ehrlich et al. 2010).

9 Perspectives

The plasticity provided by the eubacterial pangenome may be driving the evolution of other domains of life. The rapid recombination of bacterial strains provided the evolutionary pressure for the development of the vertebrate adaptive immune system—which is mechanistically similar to what the bacteria are doing—it is essentially a random gene rearrangement phenomenon, very similar to HGT (Hu et al. 2007). Lastly, as the variability in species becomes apparent, it triggers the question of how best to define a species. While pangenomic analyses do not offer the ultimate solution, they may provide a useful definition. Once the core genome of a species is defined, strains can be assigned, or not assigned, to a species based on the extent to which they share the same core genome (Nistico et al. 2014).