Introduction

The availability of complete genome sequences from multiple isolates of a given species has opened up a whole new range of research strategies. By far the best-studied bacterial species is Escherichia coli, and the highest number of individual genome sequences is available for this species, which has been the working horse of bacteriology for as long as the specialization exists. Numerous basic molecular processes have been first characterized and extensively studied in E. coli, leading to insights that could subsequently be applied to other bacteria [47]. Despite the vast amount of knowledge already available for E. coli, based on decades of experimental research, genetic manipulation and, more recently, observations based on single or multiple genome sequences, comparison of a large number of E. coli genome sequences can still provide novel insights, such as the presence of genomic islands, present in some pathogenicity groups, but missing in others. At the time of writing, there are more than 100 E. coli genome sequence projects reported, many of which have been deposited to GenBank. Here, we compare 61 publically available genome sequences of E. coli and Shigella spp. isolates.

Escherichia spp. and Shigella spp. are Gram-negative, facultative anaerobic, intestinal bacteria belonging to the Enterobacteriaceae, which are taxonomically placed within the gamma subdivision of the Proteobacteria phylum. Although Shigella spp. isolates have been rewarded their own genus, which is divided into several species (representing different sero-groups), its separation from Escherichia spp. is mainly historical. For example, in Bergey’s Manual of Systematic Bacteriology, the section on Shigella phylogeny begins with the following sentence: “Scientific evidence accumulated to date strongly supports that view that Shigella species are biotypes/pathotypes or clones of E. coli” [39]. More than 50 years ago it was observed that Shigella spp. and E. coli have the same fertility system [25]; in 1972, Brenner et al. [3] found that based on DNA/DNA hybridization, that Shigella spp. and E. coli are the same species. Experiments with multilocus enzyme electrophoresis concluded that nearly all of the Shigella species are clones from within E. coli species [35]. Further, analysis of 16S rRNA sequence alignment places Shigella spp. within E. coli [6]. Thus, all current evidence indicates that Shigella spp. should be classified as E. coli [23, 36]. Both genera contain highly diverse species, although Shigella spp. are as related to E. coli as they are to each other. E. coli is a ubiquitous component of the intestinal gut flora of animals including humans, and can survive and multiply in abiotic environments as well. The species comprises both benign and pathogenic variants, whilst Shigella spp. are all enteropathogens in mammals.

E. coli isolates have in the past been divided into subgroups in various ways. Based on established pathogenicity towards the human host, pathogenic versus commensal E. coli have been recognized, although it is acknowledged that 'pathogenic' E. coli strains may colonize other animal species asymptomatically. Pathogenic E. coli have further been subdivided according to their typical site of infection and clinical manifestations in humans, for instance enteropathogenic, uropathogenic, or extra-intestinal pathogenic E. coli, or based on their virulence mechanisms, such as enterohemorragic (EHEC), enterotoxigenic, enteroinvasive, and enteroaggregative E. coli [1, 13]. Other divisions that are frequently used are based on serology (e.g., serotypes O127:H7 or K12) or, mainly for population genetic purposes, on phylogenetic properties of particular housekeeping genes, as established by MULTI-ENZYME electrophoresis and later by multilocus sequence typing (MLST) [25]. Finally, some isolates are described simply for their source of isolation, such as environmental isolates or avian pathogenic E. coli.

All these subdivisions have been applied more or less frequently to group isolates that share particular features. We were interested to see if any of these groupings would hold when isolates were compared based on their complete genome sequences, considering some or all of their genes. Isolates from some groups (based on whatever grouping) have been more frequently sequenced than from others, and complete information on all characteristics of interest (pathogenicity, source of isolation, serotype) is not available for all sequenced isolates. Despite these recognized shortcomings in sampling bias and recorded information, comparison of these 61 genome sequences revealed that neither the 16S gene, nor gene fragments usually used for MLST, provides biologically meaningful information on the relatedness of the sequenced isolates. The best way to analyze this is by taking into account all the genomic content, rather than looking at one or a few individual genes. The E. coli core genome has been previously reported to be less than half the genes [13], with more than half the E. coli genes in any given genome being found in some strains, but missing in others. Many of these variable genes can be clustered to specific regions, located on genomic islands in an E. coli chromosome.

Materials and Methods

Bacterial Genomes and Gene Annotations

Sixty-one bacterial genomes of E. coli and Shigella spp. were used in this study (Table 1). Of these, 39 fully sequenced genomes and 19 genomes for which the sequence was still in progress at the time of extraction were obtained from GenBank (1). Sequence from E. coli O103 Oslo was obtained from Norwegian Veterinary Institute and sequences from strains LANL ECA and LANL ECF were obtained from Los Alamos National Lab. Genome sequences of Escherichia albertii, Escherichia fergusonii, and Salmonella enterica Typhimurium LT2 were included for comparison (Table 1). The ‘quality score’ for each genome is given in Table 1, based on the suggested scale by Chain et al. [4]. A completely sequenced genome that has been deposited to GenBank is given a score of ‘1’, with the only exception being E. coli O157:H7 isolate EDL933, which currently has more than 4,000 “N’s” in the DNA sequence of the GenBank file, representing unfilled gaps along the chromsomal sequence—hence, this genome is given a lower score of ‘2’. The higher scores represent lower quality (and often more contigs, or pieces of the DNA, although sequence quality is not measured only by this, as described in [4]).

Table 1 Genomes used in this study

16S Ribosomal RNA Analysis

The sequences encoding 16S ribosomal RNA were extracted from the analyzed genomes using RNAmmer [22]; sequences with an RNAmmer score above 1,400 were considered reliable and were kept for analysis. From every genome, the gene with highest similarity to rrsH of E. coli K12 MG1655 was selected and these sequences were aligned using ClustalX [24]. A phylogenic tree was generated by ClustalX using the Bootstrap neighborhood-joining method, showing the bootstrap values at branch points, visualized by NJPlot [34].

In Silico MLST

The alleles for seven housekeeping genes used for MLST of various species (www.mlst.net) were analyzed. These were fragments of adk, fumC, icd, gyrB, mdh, purA, and recA. The obtained DNA sequences were extracted from the genome sequence, concatenated and phylogenically analyzed as described above. Alignments were not manually adjusted to avoid subjective interpretation of the outcome.

Predicted Proteome Analysis

The predicted proteomes comprising all protein-coding genes were extracted from the GenBank files for the published genomes. For unpublished genomes, they were predicted using EasyGene [30]. All predicted proteomes were compared by BLASTP reciprocal pairwise comparison. Two genes were attributed to a single gene family and considered 'conserved' when they shared at least 50% amino acid identity over at least 50% of the length of the longest gene.

A hierarchical clustering was performed for the complete pan-genome as described by Snipen et al. [38]. Briefly, a pan-genome matrix was constructed consisting of 1 s and 0 s where each row corresponds to a gene family, as described above, and each column to a genome. Cell (i,j) in the matrix is 1 if gene family i is present in genome j, or 0 if it is absent. Manhattan distances were calculated and used for hierarchical clustering to generate the tree. The plotted distance between two genomes shows the proportion of gene families where their present/absent status differs. Thus, pan-genome hierarchical clustering analyses genes that are not conserved, but vary in their presence or absence between genomes. Shorter distances represent genomes with more gene families in common. Genes only occurring in a single genome (singletons) were not included in the analysis. Bootstrap values (per mil) were computed for each inner node by re-sampling the rows of the matrix.

A pan- and core genome plot was constructed according to [12]. The order of genomes was chosen based on the pan-genome tree, starting with the largest E. coli O157 genome. For the pan-genome curve, all cumulative BLAST hits found in the genomes were plotted as a running total, which increases as more genomes are added. The number of gene families with at least one representative in every genome was plotted for the core genome and this slowly decreases with the addition of more genomes, as these genomes may lack genes from gene families that had been conserved in the previously plotted genomes.

A BLAST atlas was constructed as described by Hallin et al. [14].

Results and Discussion

A number of characteristics of each of the 61 genomes are summarized in Table 1, such as their size, their number of recognized protein genes, and their gene density. Their GC content varies around 50% for all genomes (not shown), but their size and number of genes varies extensively. The smallest E. coli genome included is that of strain BL21 (DE3) sequenced by the Korean consortium, which is only 4.56 Mbp, and the smallest Shigella genome is that of Shigella dysenteriae Sd197, with 4.56 Mbp. The longest genome of the completed genomes so far belongs to E. coli O157:H7 strain EC4115, with 5.70 Mbp. Longer genomes are listed in Table 1, but since those sequences are still in multiple contigs, it is possible that their stated length is overestimated. These size differences mean that around one million nucleotides (approximately 20% of a genome) can be absent in one E. coli or Shigella isolate and present in another. These 'extra' sequences are not void, as indicated by the variation in number of genes: the longest E coli genome has 1,158 more predicted genes than the shortest E. coli genome (5,315 genes for strain EC4115 and 4157 genes for BL21). Further, the observed gene density is relatively constant, at 0.911 ± 0.04 genes per 1,000 base pairs. It should be noted that published proteomes have been defined using different gene prediction programs and definitions, so that the observed slight variation in gene density might be explained by non-standardized gene identification.

Phylogeny of 16S Ribosomal RNA and MLST Genes

A phylogenetic tree based on the 16S ribosomal RNA sequences extracted from a representative set of 20 Enterobacteriacea genomes is shown in panel a of Fig. 1, which is in agreement with the known phylogeny of the family. The tree for the full set of the 61 E. coli and Shigella strains, including two additional species of Escherichia and one from S. enterica is shown in Fig. 1b. From this figure, it is obvious that phylogeny of the 16S rRNA gene does not resolve well within the genus level, as is known, because the rRNA operons are so similar. Although some of the tree nodes are predicted with uncertainty, clearly the genera Shigella and Escherichia are not separated, nor are E. coli genes separated from those of E. fergusonii or E. albertii. This finding was expected, considering the close relatedness between Escherichia spp. and Shigella spp. In general, 16S sequences are not suitable to analyze inter-strain relationships within a species or between closely related species, as illustrated with this set of Enterobacteriaceae genes. This questions the reliability to use 16S as an indicator for the species to which unknown sequenced DNA belongs [45].

Figure 1
figure 1figure 1

Phylogenetic tree based on extracted 16S rRNA sequences. a Comparison of 20 different Enterobacteriaceae, based on extracted 16S rRNA sequences from the GenBank sequence files. E. coli and Shigella are shown in green. b Tree of 61 sequenced E. coli (black) and related species (colored), based on the alignment of the 16S rRNA gene sequence. Apart from Shigella spp., the genes from E. albertii and E. fergusonii are also included (arrows). The 16S rRNA gene of S. enterica Typhimurium LT2 was used as the root. Bootstrap values, indicated in red, show that most nodes are predicted with uncertainty; nevertheless, the genera Escherichia spp. and Shigella spp. are not separated in this tree, and the three Escherichia species are also mixed

Next, it was investigated if conserved housekeeping genes, frequently assessed for MLST, provide a better representation of the relatedness of the investigated genomes. Various MLST schemes are in use for E. coli [9, 28] or Shigella spp. [35] but these are not standardized and the genes assessed in these schemes are not conserved in all genomes. We used the combination of seven housekeeping genes that has been applied to a number of bacterial species [26] (www.mlst.net). Since S. enterica lacks fumC (an observation that somewhat weakens the general applicability of this MLST gene set), that genome was not included in the analysis. The resulting tree, shown in Fig. 2, still mixes E. coli with Shigella species, and does not separate all pathogenic strains from commensal strains. Some of the phylogroups previously defined by multilocus enzyme electrophoresis are clearly separated, such as the E cluster containing all O157 strains, the A/B cluster of commensal K12 and B strains, and the B2 cluster containing some of the uropathogenic strains, in accordance to comparisons carried out by others [40]. Other authors concluded that the O157 serotype of EHEC probably evolved in successive evolutionary events [9]; however, that conclusion is not supported by the MLST tree. And although the B phylogroup is known for its commensal isolates, one of which being used by Delbrück and Luria for their famous phage work, this branch also contains the enteroaggregative strain 101-1 (Fig. 2). Moreover, the two S. dysenteriae strains are widely separated from each other. Pupo et al. [36], who used a different set of MLST genes, also found that isolates of the three species Shigella flexneri, Shigella boydii, and S. dysenteriae, could not always be grouped together nor separated from E. coli. Various enteroinvasive E. coli serotypes have been suggested as ancestral to the different Shigella serogroups [23], which could explain the lack of differentiation power of MLST in this case. Apparently, neither MLST gene sets are suitable to group these Enterobacteriaceae organisms in a meaningful way. The performance of MLST could in theory be improved by selecting different genes, for instance using a set of genes specifically chosen to produce the desired grouping. However, the strength of MLST analysis should be that a conserved set of genes is able to identify phylogenetic relationships in any collection of isolates from one species. If one has to select a 'standard' gene set specifically for the species under investigation, it weakens the general application of MLST considerably.

Figure 2
figure 2

Phylogenetic tree of concatenated MLST gene alleles (adk, fumC, icd, gyrB, mdh, purA, recA), extracted from the genome sequences. Color use is the same as in Fig. 1

Pan-Genome Comparisons

MLST analyzes allelic differences in genes whose presence has to be conserved in all genomes. However, we hypothesized that genes that are variably present could provide useful information as to the true relatedness of the analyzed genomes. Since the variable fraction contain genes that are present in some, absent in other genomes, a phylogenetic analysis cannot be performed to capture all information. Figure 3 displays a pan-genome clustering tree, based on the gene families that are variably present in the analyzed genomes (gene families comprising singletons were excluded). The hierarchical clustering obtained by this analysis correctly separates the Shigella spp. and S. Typhimurium from Escherichia spp. and, within the latter genus, separates E. coli from the other Escherichia spp (Fig. 3). Moreover, all E. coli O157:H7 genomes now cluster together, as do the K12 derivatives (W3110, MG1655, DH1, BW2952, DH10B, and ATCC8739). The strains belonging to phylogenic group B are also positioned in one cluster, to which the non-pathogenic commensal strain HS also seems to belong. All these are avirulent isolates, and it is quite impressive that all these are positioned close together in the tree. We conclude that this analysis of variable genes identifies inter-strain relationships that can be correlated to the lifestyle of the organisms.

Figure 3
figure 3

Pan-genome clustering of E. coli (black) and related species (colored), based on the alignment of their variable gene content. The genomes now cluster according to species and a relatedness between E. coli K12 derivatives (green block) and group B isolates (orange block) is visible

The contribution of every genome to the complete pan-genome of E. coli and related organisms is demonstrated in Fig. 4, where the pan-genome and core genome, as defined by other authors [40] of the analyzed sequences is plotted. The number of novel gene families for every added genome is also shown. As can be seen, all genomes contribute to the increase of the pan-genome. This increase is less strong when similar genomes are added (for instance all four K12 genomes, or the B strains). The addition of Shigella spp. genomes does not alter the shape of the pan-genome curve, but addition of the other Escherichia genomes causes a sharp increase. The contribution of E. fergusonii to the pan-genome has been noted before [42].

Figure 4
figure 4

Pan- and core genome plot of the analyzed genomes. The blue pan-genome curve connects the cumulative number of gene families present in the analyzed genomes. The red core genome curve connects the conserved number of gene families. The gray bars show the numbers of novel gene families identified in each genome

The core genome reduces in size as more genomes are added, with an expected significant drop when the shorter genomes are assessed (starting with E. coli K12 DH10B, at position 18 in Fig. 4). The core genome reaches 1,472 gene families conserved in 53 E. coli genomes, which is further reduced to 993 gene families if Shigella spp. are considered as well. The bars show how many novel gene families each genome contributes to the growing pan-genome. It should be noted that the order in which genomes are analyzed influences the number of these reported novel gene families other than for singletons. When novel genes are considered, instead of novel gene families, the findings can be even more dramatic. For instance, six novel E. coli genome sequences identified approximately 10,000 novel genes [42]. Previous work has estimated a core genome of 1,976 genes for 20 E. coli genomes and a pan-genome of 17,831 genes. Our analysis of 53 E. coli genomes identified 1,472 conserved gene families and 13,296 gene families comprising the pan-genome. We prefer to report these findings as gene families, instead of individual genes, using clearly defined criteria for inclusion of genes into a gene family (described in the “Materials and Methods” section).

Where are all these variable genes located in a genome? Gene order is not strongly conserved between the analyzed genomes, so that gene location depends which genome is considered. Nevertheless, by visualizing where a gene, whose presence can vary, is located on a single reference genome provides further information, and this can be visualized in a BLAST atlas [14]. In the BLAST atlas of Fig. 5, it becomes apparent that the variable gene content is not evenly distributed over the reference genome, but appears to be distributed over various islands. The reference chromosome of E. coli O157:H7 EC4115 was chosen, as it is the largest chromosome for which a complete sequence is currently available. Around this, all other genomes are plotted, whereby lack of color indicates that particular gene from EC4115 is missing in the shown genome. The strong conservation of gene presence within the O157 serotype (in green) contrasts with the multiple 'gaps' seen in the other lanes. Every gap represents multiple genes in strain EC4115, illustrating that gene variation is not evenly distributed along the genome, but located in islands.

Figure 5
figure 5

BLAST atlas. In the middle, a genome atlas of E. coli O157:H7 strain EC4115 is shown, around which BLAST lanes are shown. Every lane corresponds to a genome, with the following colors (going outwards): green E. coli O157:H7 (15 lanes); light blue E. coli LANL strains (two lanes); dark blue Shigella spp. (eight lanes); red E. coli K12 and derivatives (six lanes); orange E. coli strain B phylogroup (four lanes); followed by all other E. coli genomes in different colors. The outermost three lanes represent E. fergusonii, E. albertii, and S. enterica Typhimurium LT2. Lack of color indicates that the genes at that position in strain EC4115 were not found in the genome of that lane. The position of replication origin and terminus is indicated

Concluding Remarks

“This gene is not found in E. coli”, is an expression often heard in discussions about novel genes in various organisms, and when people are looking for functional matches in databases. It is a sobering thought to realize that any given E. coli genome sequenced will have only roughly 20% of its genes part of the E. coli core, and the remaining 80% are not found in all other E. coli genomes. After a comparison of the diversity with many sequenced E. coli genomes, it has become clear such a statement can only be valid when it is specified which E. coli genome sequence has been searched. Of the predicted pan-genome comprising about 16,000 gene families, the core (slightly less than a thousand genes) is found to be only about a fifth of a typical E. coli genome which contains around 5,000 genes. Many of the accessible or variable genes, making up more than 90% of the pan-genome and roughly four fifth of a typical genome, are often found co-localized on genomic islands. The diversity within the species E. coli, and the overlap in gene content between this and related species is far greater than many had anticipated, and represents a broad set of functions for adapting to many different environments. The comparative methods used here are generally applicable to genomes of related species, and are considered a valuable tool to evaluate current insights of species' relatedness and evolutionary history.