The evolution of biological diversity through genetic change raises questions about how much variation one can expect among closely related genomes. Some answers are emerging from the application of two different technologies to comparative genomics of bacteria. One, complete genome sequencing, is providing detailed blueprints of one or a few examples of each genome of interest. The second, physical mapping by pulsed-field gel electrophoresis (PFGE), is providing 'skeleton' views of large numbers of closely related genomes. Together, these technologies are providing insights into the dynamics of genome plasticity that are both detailed and broad. At this early stage in comparative genomics, the main generalizations that are emerging concerning rearrangements in bacterial genome organization are as follows. First, large chromosome inversions and translocations are common, even between closely related species. Second, chromosome inversions are usually symmetric around the axis of DNA replication. Third, chromosomal rearrangements are less common within species, but a dramatic increase in the frequency of inversions and translocation seems to be associated with the ability of bacteria to infect eukaryotic hosts, possibly reflecting a bacterial response to the challenges posed by the immune system.

The underlying causes of rearrangement

The bacterial RecA protein is required for damage to chromosomes - in particular chromosome breaks - to be repaired, and it acts by using a duplicate copy of the damaged sequence as a template for repair. The template is normally the homologous sequence on a sister chromosome, but when sequences are present in multiple copies within a genome, RecA can promote recombination between paralogs. Such recombination events can result in rearrangements in the order of genes on the chromosome [1]. Thus, recombination between repeated sequences that are in the same orientation as each other (direct repeats) can result in tandem duplication of the region bounded by the repeat sequences (Figure 1a). These duplications are usually unstable, unless maintained by selection or by the accumulation of sufficient mutations to avoid subsequent recombination. Some large duplications have been stabilized during evolution [2]. Recombination between direct repeats can also result in deletion of the intervening sequence, generating a fragment that can potentially insert back into the genome at the site of another copy of the repeat, thus generating a translocation (Figure 1b). Recombination between repeat sequences that are in the opposite orientation to each other (inverse repeats) can result in inversion of the intervening sequence (Figure 1c).

Figure 1
figure 1

Genome rearrangement by homologous recombination between repetitive sequences. A circular bacterial genome in illustrated. The dashed line represents the replication origin-terminus axis about which bi-directional replication of the chromosome occurs. Red arrows indicate the positions and relative orientations of the repeat sequences, W, X, Y and Z. The lower-case letters a, b, c and d represent sequences bounded by some of these repeat sequences. (a) Recombination between non-allelic repeat sequences (Y and Z) present on sister chromosomes after replication, can lead to duplication of the Y - d - Z region. (b) Recombination between repeat sequences in the same orientation on the same chromosome (Y and Z) can lead to the excision of a DNA fragment (Z/Y - d) that can recombine at another repeat position on the chromosome (W), resulting in a translocation. (c) Recombination between repeat sequences in inverse orientations on the chromosome (X and Z) can lead to inversion of the intervening sequence.

To these phenomena one can add the acquisition of new DNA by horizontal transfer from another genome, which in addition to introducing new genetic information may upset the stability of a genome and trigger other compensating rearrangements. The sequences involved in RecA-mediated rearrangements are usually long repeats, such as rRNA operons, transposons and IS (insertion sequence) elements. Sequences as short as 10-100 nucleotides can also be substrates for homologous recombination, but this is usually limited to sequences in close proximity to one another [1].

Another outcome of homologous recombination between repeated sequences is gene conversion - homogenization of the sequences within a gene family, to prevent the divergence of repeated sequences - which maintains the sequence similarity required for RecA-mediated recombination over evolutionary time scales [3]. RecA activity is thus a double-edged sword: it is needed to maintain the chromosome integrity required for completing replication, but it also promotes rearrangements within the genome. An interesting exception is that Buchnera lacks a recA gene but compensates by having over 100 copies of its entire genome per cell [4].

Historical background and new techniques

Microbial genome analysis has its origins in the intensive laboratory analysis of just a few bacterial strains, in particular Escherichia coli K-12 and Salmonella typhimurium LT2 (proper name: Salmonella enterica serovar Typhimurium). These species had a common ancestor approximately 140 million years ago [5]. Their genetic maps are almost identical in organization and their major phenotypic differences can be explained by the horizontal acquisition of DNA segments into one or other of the species. With the exception of a large inversion around the terminus of replication, the two genomes seem to be stable in organization and to be diverging by the accumulation of point mutations. A high frequency of recombination in the terminus region is related to the mechanism of chromosome separation after replication, and different inversions around the terminus are found in other closely related bacteria [1]. The general conclusion drawn from these early comparisons was that bacterial genomes were stable in organization. Within the past few years, however, the application of new technologies to the analysis of the genomes of a wide variety of bacterial species has challenged this view.

The two most important experimental techniques for comparative analysis of genome organization have been whole-genome sequencing and physical mapping of genome organization. Whole-genome sequencing provides complete information on a genome, facilitating many types of analysis including comparative analysis of genome organization. The TIGR Microbial Database [6] currently lists dozens of completed bacterial genome sequences and over one hundred in progress. In most cases a few genomes from each species are being sequenced. The availability of a genome sequence, while not essential, also facilitates the physical analysis of that genome and of related genomes. Physical mapping, most often by PFGE in conjunction with restriction digestion and probing for specific sequences, is used to generate skeleton structures of genome organization and is suited to screening large numbers of strains. The most informative comparisons, in terms of evaluating genome dynamics, are those made between close relatives - either sister species or isolates within a species.

Genome rearrangements at three levels of comparison

Comparisons between genome arrangements in related bacteria have been made at several levels: interspecific, intraspecific (serovars, or immunologically detectable variants, and biovars, or biochemically detectable variants within a single species), and within presumed clonal populations. The phylogenetic relationships between many of the bacterial species referred to in the following discussion, based on an analysis of their 16S rDNA, are shown in Figure 2.

Figure 2
figure 2

Phylogenetic relationships of the bacteria discussed in the text, based on 16S rDNA sequences. The unrooted tree was built using the neighbor-joining method with 500 bootstrap replicates. A similar clustering of close relatives is found using the maximum parsimony method.

Interspecific comparisons

Complete genome sequence data for three pairs of related species - Vibrio cholera - Escherichia coli; Streptococcus pneumoniae - Streptococcus pyrogenes; and Mycobacterium tuberculosis - Mycobacterium leprae - has been used to compare the positions of conserved sequences within each genome. The comparisons reveal in each case a distinct X-shaped pattern in scatterplots, suggestive of large chromosomal inversions that reverse the genomic sequence symmetrically around the axis of replication [7]. Similarly, Chlamydia trachomatis and Chlamydia pneumoniae differ by multiple large inversions, apparently oriented around the axis of replication. In addition, the region around the terminus of replication in these two species is subject to a high rate of reorganization [8]. Table 1 summarizes all the interspecific rearrangements referred to in this section.

Table 1 Interspecific genome comparisons

As discussed earlier, another important type of genome rearrangement is translocation. Comparative genomics using whole-genome sequences shows that the genomes of the close relatives Mycoplasma genitalium and Mycoplasma pneumoniae [9] can be subdivided into six segments, which are ordered differently in the two species. Within each segment the order of genes is conserved, and the increased size of the M. pneumoniae chromosome is due mainly to gene duplications. In both species there is strong uniformity of direction of transcription, and this direction is not changed by the translocations [9].

Other interspecific genome comparisons reveal that inversions, translocations and deletions typically distinguish closely related species. Thus, Neisseria meningitidis and Neisseria gonorrhoeae differ by multiple translocations and/or inversions of blocks of genetic markers within a 500 kilobase region [10]. Mycobacterium tuberculosis differs from the attenuated vaccine strain Mycobacterium bovis BCG Pasteur in carrying two tandem duplications in its chromosome, of 29 kb and 36 kb respectively [11]. In addition, clinical isolates of Mycobacterium tuberculosis differ from each other in having deletions of up to several kilobases, probably linked to homologous recombination between multiple copies of IS6110 in the genome [12]. Finally, the chromosomal locations of 30 putative orthologs between Bacillus subtilis and Bacillus cereus are arranged in an apparently random manner [13], similar to what is seen when comparing the genomes of very distantly related organisms. Also, within B. subtilis several variants created by X-ray mutagenesis have large inversions (1,700-1,900 kb) around either the axis of replication, a 100 kb translocation, and smaller duplications and deletions [14]. In conclusion, in every case where interspecific comparisons have been made, clear evidence of large chromosomal rearrangements has been found.

Intraspecific comparisons

Comparisons within species, including comparisons between different serovars and biovars of the same species, reveal that rearrangements are less common within than between species (summarized in Table 2). Using PFGE after restriction digestion targeted to cut the genome within the conserved and repetitive rRNA operons, 'rrn genomic skeletons' were established for isolates of many serovars of Salmonella enterica [15]. The order of fragments, which is ABCDEFG in S. typhimurium and E. coli K-12, is conserved in most Salmonella serovars, most of which are host-generalists. In S. typhi and S. paratyphi C (which have human hosts), however, and in S. pullorum and S. gallinarum (which have fowl hosts), these fragments are rearranged. Thus, of 127 natural isolates of S. typhi examined, 21 different genome orders were found, all postulated to be due to inversions and translocations with end-points in rrn operons [15]. A feature of these rearrangements is that the distance from the origin of chromosome replication is well conserved, as is the direction of transcription relative to the direction of chromosome replication. A similar PFGE analysis has been made of Shigella, the human pathogenic form of E. coli [16]. This showed that of the four traditional Shigella subgroups (often referred to as species), S. boydii and S. sonnei had chromosomal arrangements identical to E. coli K-12, while S. dysenteriae and S. flexneri had different large rearrangements [17]. Interestingly, the Shiga toxin genes on the S. dysenteriae chromosome are bracketed by IS600 sequences, and increased toxin production is caused by tandem amplification via recombination between the IS600 elements [18].

Table 2 Intraspecific genome comparisons

The genome sequence of the strain Lactococcus lactis IL1403 has been determined [19] and physical genome maps have been created for several Lactococcus lactis strains (subspecies lactis and subspecies cremoris strains) and strains of the related Streptococcus thermophilus [20,21]. Within each group, strains were similar with the exception of the L. Lactis subspecies cremoris, where different strains were polymorphic, in part due to an inversion of half the chromosome. This inversion is due to a homologous recombination event between two defective copies of IS905 and does not alter the symmetry of the replication origin and terminus, oriC and terC [22]. Comparison of the physical and genetic maps of strains representing two serovars of Leptospira interrogans suggests that at least two inversions in the large replicon distinguish their genomes [23].

Brucella is a Gram-negative bacterium pathogenic for animals. The genus is divided into six species and numerous biovars. Physical maps of the genomes of reference strains in each species show a high conservation of restriction sites and the presence of two chromosomes. The exception is a large inversion in the small chromosome of B. abortus. But physical mapping of the genomes of the four biovars within one of these species, Brucella suis, reveals differences in both chromosome number and size. These differences can be explained by rearrangements due to homologous recombination between the three rrn loci in the genome [24]. It is proposed that the ancestor of Brucella had a single chromosome and that recombination, probably between rrn genes, led to the creation of two chromosomes.

Multiple chromosomes are also found in other bacteria, mostly within the proteobacteriaceae, including Rhodobacter sphaeroides, Leptospira interrogans, Rhizobium spp., Burkholderia cepacia, Agrobacterium spp., and Ochrobactrum anthropi [24]. There is no evidence that the presence of multiple chromosomes in these genomes is related to a common phylogeny, since it is not always shared by all members of a genus (for example, the Rhodobacter genus) or even by all strains of the same species (for example, Brucella suis). In Streptomyces, the most common rearrangements found are sequence and length variations in the terminal inverted repeats (TIRs) at the ends of the linear replicons. This variation is due to homologous recombination between repetitive sequences and results in amplifications, deletions, a high frequency of spontaneous mutations, and the transfer of sequences between different chromosome arms [25]. The exchange of telomeric regions has also been described for linear replicons in the unrelated bacterium Borrelia burgdorferi [26]. Instability may not be a particular feature of linearity, however, because when Streptomyces chromosomes are circularized they remain unstable in these regions [27]. The lack of housekeeping genes in a large region at each end of the Streptomyces linear chromosomes probably permits the detection of deletions at a high frequency because they do not affect cell viability under laboratory conditions.

Clinical and clonal populations

The closest relatives that have been subjected to genome organization analysis are presumed clonal derivatives associated with clinical infections (summarized in Table 3). Bordetella pertussis strains from a whooping cough outbreak in Canada were subjected to restriction-enzyme genome mapping. Among 70 isolates, presumed to be descended from the same starting clone, 14 different types were found (distinguished by restriction fragment length polymorphism, RFLP). Representatives of these 14 types were further analyzed and shown to have 11 different genome orders, due in each case to large chromosomal inversions [28]. A similar analysis among different laboratory strains also revealed frequent large inversions [29]. B. pertussis carries about 100 copies of the 1 kilobase insertion sequence IS481, providing many targets for homologous recombination. The positions of the origin and terminus of replication are unknown in Bordetella, but because almost all of the inversions are around the same axis, it seems likely that they are in fact symmetric about the origin-terminus axis, as has been observed in other species. The genomes of clinical isolates of Pseudomonas aeruginosa from cystic fibrosis patients were analyzed by PFGE, revealing that 50% of them have large chromosomal inversions [30]. Most of these inversions are approximately symmetric about the replication axis. It is not known if these rearrangements confer any advantage on the strains in colonizing the lung habitat.

Table 3 Comparisons of clinical and clonal isolates

PFGE analysis of the genomes of 30 Neisseria meningitidis epidemic strains belonging to the ET-5 complex, isolated from various parts of the world over a period of 20 years, revealed 10 different types, including some with genome order rearrangements [31]. A striking feature of N. meningiditis revealed by complete genome sequencing [32,33] is the presence of hundreds of repetitive elements that could contribute to genome rearrangements important for evasion of the host immune system. Finally, within a defined lineage of N. gonorrhoeae strains with pilin variations, an inversion of more than one third of the chromosome was found [34]. The end points of the inversion are within a multicopy gene family involved in pilin production.

Constraints on the frequency of rearrangement

Rearrangement could be constrained by the number and size of repetitive sequences in a genome. Most bacterial genomes contain multiple copies of some highly expressed genes (such as rrn genes) or have multiple copies of insertion sequences, however. While the relative positions of these sequences could influence which rearrangements occur most frequently, recombination between short repeats and the mobility of IS elements should increase the variety of rearrangements. A second potential constraint is the rate of recombination, but experimental data from S. typhimurium show that rates of recombination between long repeat sequences are at least as high as nucleotide substitution rates [1]. For example, the rate of inversion between the tuf genes, approximately 10-8 per cell per generation, is equal to the rate of nucleotide substitution within the same genes [3,35]. If one can generalize from this, then recombination can rearrange genome organization as fast as genomes diverge by nucleotide substitution. A third possible constraint is that the fitness of the rearranged genomes is in general reduced. Indeed, inversions that reverse the orientation of sequences on either side of the replication terminus of S. typhimurium and E. coli usually occur very infrequently, or make the bacteria very unfit [1]. Inversions that do not alter the replication axis (that is, inversions that do not change the distance of genes from the origin of replication, or their orientation relative to the direction of replication) may be the least disruptive in terms of fitness, but this has never been rigorously tested. In conclusion, fitness costs may be an important constraint on the fixation of genome rearrangements in bacteria, but there are very few relevant measurements.

Comparing these theoretical constraints on genome rearrangements with the data from natural bacterial isolates, several patterns emerge. One is that inversions and translocations are very common between even closely related species (compare Figure 2 and Table 1). This presumably reflects the frequency of the rearrangement events, on a time scale of tens to hundreds of millions of years. In general, however, the variety of large rearrangements found is quite limited. Almost all rearrangements conserve the axis of chromosome replication, suggesting that this is important for fitness on the evolutionary time scale. Below the species level, within and between serovars/biovars, there are interesting differences in genome stability (Table 2). Some genomes are stable while others, like that of S. typhi, have many different arrangements. Even more striking is the data from clinical isolates that are probably very closely related, separated from one another on a time scale of months to a few years (Table 3). In such cases, a very high frequency of rearranged genomes is found. One thing that S. typhi and the clinical isolates of Neisseria, Pseudomonas and Bordetella have in common is that they have all encountered and survived the human immune system. It is tempting to speculate that such encounters select for variants of the infecting strain. Thus, if an invading pathogen population is targeted by the immune system, bacteria within that population with genome rearrangements may be sufficiently different in phenotype to escape and establish an infection. Variation generated by genome rearrangements has several advantages for a bacterial population invading a complex environment: it occurs at a high frequency, it is reversible, and it can simultaneously alter the expression pattern of many genes.

In conclusion, the main constraint on genome rearrangements on an evolutionary time scale may be bacterial fitness, in particular associated with the global regulation of gene expression patterns and the orderly and efficient replication of the genome. In particular complex environments, however, such as those encountered on invading an eukaryotic host, bacterial fitness may be positively associated, at least on a short time scale, with the generation of genome rearrangements.