Comparative Genomics of Bifidobacterium, Lactobacillus and Related Probiotic Genera
- First Online:
- Cite this article as:
- Lukjancenko, O., Ussery, D.W. & Wassenaar, T.M. Microb Ecol (2012) 63: 651. doi:10.1007/s00248-011-9948-y
- 2.2k Views
Six bacterial genera containing species commonly used as probiotics for human consumption or starter cultures for food fermentation were compared and contrasted, based on publicly available complete genome sequences. The analysis included 19 Bifidobacterium genomes, 21 Lactobacillus genomes, 4 Lactococcus and 3 Leuconostoc genomes, as well as a selection of Enterococcus (11) and Streptococcus (23) genomes. The latter two genera included genomes from probiotic or commensal as well as pathogenic organisms to investigate if their non-pathogenic members shared more genes with the other probiotic genomes than their pathogenic members. The pan- and core genome of each genus was defined. Pairwise BLASTP genome comparison was performed within and between genera. It turned out that pathogenic Streptococcus and Enterococcus shared more gene families than did the non-pathogenic genomes. In silico multilocus sequence typing was carried out for all genomes per genus, and the variable gene content of genomes was compared within the genera. Informative BLAST Atlases were constructed to visualize genomic variation within genera. The clusters of orthologous groups (COG) classes of all genes in the pan- and core genome of each genus were compared. In addition, it was investigated whether pathogenic genomes contain different COG classes compared to the probiotic or fermentative organisms, again comparing their pan- and core genomes. The obtained results were compared with published data from the literature. This study illustrates how over 80 genomes can be broadly compared using simple bioinformatic tools, leading to both confirmation of known information as well as novel observations.
The first bacterial genome sequences were published in 1995, and within 15 years, over a thousand fully sequenced bacterial genomes have become publicly available . A number of these genome sequences are derived from bacteria used as probiotics or starter cultures in food fermentation, or both. Reid and co-workers  defined probiotics as “live microorganisms which when administered in adequate amounts confer a health benefit on the host”. A number of bacterial species from various genera are in use as probiotics, including members of Lactobacillus, Lactococcus and, less commonly, Leuconostoc. These Firmicutes are sometimes collectively described as lactic acid bacteria (LAB). Other commonly used probiotic species belong to Bifidobacterium, a genus within the phylum Actinobacteria. These genera exclusively contain species that are unlikely to cause disease while colonizing the intestine, and although some species (e.g. Bifidobacterium dentium) have been associated with dental disease, these are more commonly members of a normal oral flora. The distinction between normal gut flora (commensals) and probiotic bacteria having a beneficial effect on their host’s health cannot always be made, for which reason we collectively describe them here as ‘non-pathogens’. Species belonging to LAB or Bifidobacterium are also frequently used in food fermentation, another application where the bacterial load of food is desirably increased. Besides LAB and Bifidobacterium, fermentation starter cultures can typically comprise of Streptococcus thermophilus, a non-pathogenic member of this genus that mostly contains pathogenic species. Some strains of Enterococcus are also in use as starter cultures or probiotics, whereby the used species also contain pathogenic strains. These two genera are therefore of interest, and their species that are used as starter cultures are included in our general description of ‘non-pathogens’. Other types of bacteria (particular strains of Escherichia coli, Pediococcus species and others) or yeasts used as starter cultures or probiotics are not treated here.
For all six genera of interest, multiple genome sequences are publicly available. In many cases, several genomes per species have been sequenced, so that the variation between and even within species can be assessed. One obvious question that could be addressed by comparison of these genomes is: what genes (if any) are common to all genomes of non-pathogens and distinct from genes found in (related) pathogens? Such a comparison requires including multiple species and genera of multiple bacterial phyla (in this case, the phylum of Firmicutes and Actinobacteria). As a general rule, genetic diversity increases with evolutionary distance, so that the genetic variation in such a collection of genomes will be enormous. One way of extracting information from such complex data is by grouping genes into functional groups or families, so that gene families rather than individual genes are compared. Such grouping is based on protein sequence similarity, as this approximately predicts conservation of gene function, ignoring the exceptions resulting from parallel evolution where function similarity does not coincide with sequence conservation. Slight differences in function, resulting from minor differences in sequences, are usually ignored in these groupings, so that fewer but broader groups can be achieved.
In this contribution, 2 approaches were used to compare over 80 genomes from 6 bacterial genera of interest. First, all protein-coding genes from these genomes were grouped into gene families based on sequence identity using a defined similarity cut-off, after which comparisons between and across genera could be performed. Genomes were then compared within their genus for both conserved and variable genes. Second, clusters of orthologous groups (COG) of genes were used to produce functional groups of genes. An attempt was made to identify differences in functional gene distribution between pathogenic and non-pathogenic members of the six genera of interest.
Materials and Methods
Selection of Genomes Used in This Study
Genomes selected for analysis
Size, bp or Mb
Number of genes
Lactobacillus acidophilus NCFM
Commercial strain for yogurt, fluid milk production
Lactobacillus brevis ATCC 367
Starter culture for beer, sourdough, and silage
Lactobacillus casei ATCC 334
Starter culture for milk fermentation and flavour development of cheese
Lactobacillus casei BL23
Lactobacillus crispatus ST1
Normal oral/vaginal flora, chicken isolate
Lactobacillus delbrueckii bulgaricus ATCC 11842
Lactobacillus delbrueckii bulgaricus ATCC BAA-365
Thermophilic starter culture for yogurt, Swiss and Italian-type cheeses
Lactobacillus fermentum IFO 3956
Lactobacillus gasseri ATCC 33323
Human isolate, type strain
Lactobacillus helveticus DPC 4571
Lactobacillus johnsonii FI9785
Competitive exclusion strain in chicken
Lactobacillus johnsonii NCC 533
Lactobacillus plantarum JDM1
Lactobacillus plantarum WCFS1
Lactobacillus reuteri DSM 20016
Type strain, human isolate
Lactobacillus reuteri JCM 1112
Lactobacillus rhamnosus GG
Lactobacillus rhamnosus GG ATCC53103
Lactobacillus rhamnosus Lc 705
Lactobacillus sakei sakei 23K
Lactobacillus salivarius UCC118
Lactococcus lactis cremoris MG1363
Plasmid-cured NCDO712, lab strain
Lactococcus lactis cremoris SK11
Lactococcus lactis lactis Il1403
Lactococcus lactis lactis KF147
Leuconostoc citreum KM20
Kimchi (food, Korea)
Leuconostoc kimchii IMSNU11154
Kimchi? not specified
Leuconostoc mesenteroides mesenteroides ATCC 8293
Food fermentation, not specified
Enterococcus faecalis V583
Clinical, blood isolate, vancomycin resistant
Enterococcus faecalis T11
Enterococcus faecalis E1Sol
Faecal isolate, antibiotic-naïve, normal flora
Enterococcus faecalis OG1RF
No info - lab strain?
Enterococcus faecalis T3
Enterococcus gallinarum EG2
Enterococcus casseliflavus EC10
Enterococcus casseliflavus EC20
Enterococcus faecium PC4.1
Human microbiome, normal flora
Enterococcus faecium Com12
Enterococcus faecium Com15
Streptococcus agalactiae 2603V/R
Clinical isolate, common in adults
Streptococcus agalactiae A909
Streptococcus agalactiae NEM316
Streptococcus dysgalactiae equisimilis GGS 124
Streptococcus gallolyticus UCN34
Normally rumen flora, this is a clinical human isolate from endocarditis
Streptococcus gordonii str. Challis CH1
Causes caries and periodontal diseases
Streptococcus infantarius infantarius ATCC BAA-102
Human microbiome project, normal flora
Streptococcus mitis B6
Streptococcus mutans NN2025
Normally oral flora, can cause caries, endocarditis. Clinical isolate
Streptococcus mutans UA159
Oral flora, can cause caries, caries isolate
Streptococcus pneumoniae ATCC 700669
Alternative name Spain 23FST81. Pandemic, high prevalence, invasive
Streptococcus pneumoniae G54
Resistant clinical isolate
Streptococcus pneumoniae TIGR4
Virulent clinical isolate
Streptococcus pyogenes M1 GAS SF370
Streptococcus pyogenes MGAS10270
Sequenced for comparative genome analysis
Streptococcus pyogenes MGAS8232
Streptococcus sanguinis SK36
Indigenous oral bacteria, causes dental decay, oral plaque isolate
Streptococcus suis 05ZYH33
Causes disease in pigs and occasionally humans
Streptococcus suis BM407
Human clinical isolate
Streptococcus suis GZ1
Causes meningitis, arthritis, pneumonia in pigs human epidemic in China
Streptococcus thermophilus CNRZ1066
Isolated from yogurt for industrial dairy fermentations
Streptococcus thermophilus LMD-9
Used in the manufacture of fermented dairy foods
Streptococcus thermophilus LMG 18311
Isolated from yogurt for industrial dairy fermentations
Bifidobacterium adolescentis ATCC 15703
Normal gut flora
Bifidobacterium animalis lactis AD011
Normal gut flora
Bifidobacterium animalis lactis BB-12
Normal gut flora
Bifidobacterium animalis lactis Bl-04
Normal gut flora
Bifidobacterium animalis lactis DSM 10140
Normal gut flora
Bifidobacterium animalis lactis V9
Normal gut flora
Bifidobacterium animalis lactis HN019
Normal gut flora
Bifidobacterium dentium Bd1
Normal oral and gut flora, can cause caries, caries isolate
Bifidobacterium dentium ATCC 27678
Human microbiome, faeces isolate
Bifidobacterium longum DJO10A
Normal gut flora, probiotic
Bifidobacterium longum NCC2705
Normal gut flora, probiotic
Bifidobacterium longum infantis ATCC 15697
Normal gut flora, probiotic
Bifidobacterium longum infantis CCUG 52486
Normal gut flora, human microbiome project
Bifidobacterium longum longum JDM301
Normal gut flora, probiotic
Bifidobacterium angulatum DSM 20098
Normal gut flora, type strain
Bifidobacterium bifidum NCIMB 41171
Normal gut flora, probiotic
Bifidobacterium catenulatum DSM 16992
Normal gut flora
Bifidobacterium gallicum DSM 20093
Human microbiome project
Bifidobacterium pseudocatenulatum DSM 20438
Human microbiome project
Definition of Gene Families and Pan- and Core Genome
The pan-genome of a collection of genomes represents all genes encountered in these genomes . In order to define a pan-genome, the criteria to score a gene as ‘conserved’ or ‘novel’ were used as previously described . Simply put, two genes are considered to belong to the same gene family and thus ‘conserved’ when their amino acid sequence is at least 50% identical over at least 50% of the length of the longest gene. All genes of a genome are thus grouped into gene families. Multiple genes per genome can belong to a single gene family, resulting in a lower number of gene families per genome than the reported number of genes. A gene not finding a match with the given criteria is put in its own gene family as a singleton.
An accumulative pan-genome was constructed according to Friis et al. , who built on work by Tettelin and co-workers . A resulting pan-genome curve increases in size as more genomes are analyzed, and its shape is order-dependent, though the accumulative pan-genome is not influenced by the order of analysis. Similarly, a core genome is defined as all gene families conserved in all analyzed genomes, and this decreases in size as more genomes are analyzed.
Pairwise pan- and core genomes were calculated for all genome combinations as above, and for each combination, the obtained core genome was expressed as the fraction of the pan-genome. These percentages were visualized in a BLAST Matrix .
Core Genome Consensus Tree
Phylogenetic trees were constructed of all core genes that were conserved within the analyzed Firmicute genomes. Multiple alignments of all core sequences were performed with MUSCLE software . PAUP was used to construct a set of core trees . Later, these trees were compared and a best-fit consensus tree was constructed as described by Retief .
In Silico MLST Analysis
In silico multilocus sequence typing (MLST) analysis was performed with gene fragments extracted from the genome sequences. For Bifidobacterium, gene fragments from clpC, fusA, gyrB, IleS, purF, rplB and rpoB were extracted, according to the method proposed for Bifidobacterium bifidum, Bifidobacterium breve and Bifidobacterium longum . For Enterococcus, the gene set of gdh, gyd, pstS, gki, aroE, xpt and yqlI, which is advised for use in Enterococcus faecalis (http://www.mlst.net), was compared with that designed for Enterococcus faecium, which is based on atpA, ddl, gdh, purK, gyd, pstS and adk. For Lactobacillis, de Las Rivas and co-workers  described an MLST gene set specified for Lactobacillis plantarum based on the target genes pgm, ddl, gyrB, purK1, gdh, mutS and tkt4. Two alternative combinations of genes have been proposed for Lactobacillis casei: ftsZ, polA, mutL, metRS, nrdD and pgm  or fusA, ileS, lepA, leuS, pyrG, recA and recG (http://www.pasteur.fr). A fourth gene set (gdh, gyrA, mapA, nox, pgmA and pta) has recently been described for Lactobacillis sanfranciscensis , but since this species is not represented in our dataset, this scheme was not used. For each genus, after concatenation of the gene fragments, a maximum likelihood phylogenetic tree was constructed.
Analysis of Variable Gene Content
The variable gene content of the analyzed genomes was compared using the method by Snipen and Ussery . This method calculates Manhattan distances based on a matrix in which the presence or absence for each gene in each genome is scored with the binary score of 0 (absent) or 1 (present). Core genes and singletons are ignored. BLAST Atlases were produced according to Hallin and co-workers .
COG is a database of proteins where each sequence is assigned to some group. All proteins within a group are believed to have a common ancestor and are likely to share a common function. The various groups are again clustered into some super-groups called functional groups . In this analysis, each found protein was compared to the COG database using BLASTP to identify the functional groups to which they belong. An R-script was used to analyze the protein composition in pan- and core genomes, and the results were visualized in a pie chart. This was done using standard operating procedures .
Comparison of Pan-Genomes
Average findings per genus and their pan- and core genome
Number of genomes included
Number of species
Average genome size (kbp)
Average % CG
Average number of genes (min–max values)
Average number of gene families (min–max values)
When a BLAST Matrix was constructed with all genomes included in the analysis, the similarity between Bifidobacterium genomes and those of the other genera remained below 3%, illustrative of the difference of Bifidobacterium compared to the Firmicutes (results not shown). Thus, despite their sharing of an ecological niche, these bacteria share relatively few genes. A comparison of all Firmicute genomes is provided as Supplementary Fig. S1. As expected, the found percentage identity within any of these genera is much higher than that between genera. For instance, the three Leuconostoc genomes produced a similarity of 49.5–52.3% between them, but around 8% to 10% to genomes of other genera. The four Lc. lactis genomes gave slightly higher similarities of 16.1–18.4% to all other Firmicute genomes whilst sharing 59.5–66.1% between themselves. An Enterococcus and a Streptococcus genome typically share 10% to 15% of their genes, and two genomes of Enterococcus and Lactococcus 14% to 16%. Different Enterococcus species share around 30% of their genes, but multiple genomes within one species of this genus have around 70% of their genes being similar.
Comparison of Core Genomes and Conserved Genes
A more commonly used procedure is to compare only a small subset of core genes. In population biology, MLST of six or seven core gene fragments is frequently used to assess evolutionary distances between isolates within a species. MLST analysis is based on DNA sequences. We adapted this approach to perform in silico MLST for all isolates within a genus, as a measure for evolutionary distance of core genes, and used this for analysis of three genera. Unfortunately, despite the reputation of MLST as being generally applicable and despite a considerable number of gene families being conserved even between Firmicutes and Bifidobacteria (63 gene families), different MLST target gene sets have been proposed for various species, and most of these are not conserved between all species (Supplementary Table S1). In order to compare our findings with published data, we have used fragments of various genes depending on the genus, as suggested in the literature.
The MLST website (http://www.mlst.net) lists two different gene sets to be used for Enterococci. Figure 5 (right side) shows the results obtained with each. Both trees produce little resolution within the species, especially when compared with the consensus tree based on 243 core genes in the previous figure.
For Lactobacilli, four MLST schemes are available: one for L. plantarum , two for Lb. casei (, http://www.pasteur.fr) and one for L. sanfranciscensis , which is not represented in our dataset. The first three MLST schemes were tested, which produced different trees (Supplementary Fig. S2). All three trees clustered multiple strains per species, but the branch positions of these species varied according to the gene set used. It cannot be stated which MLST tree is ‘correct’ as they all display the evolutionary relationship of the genes analyzed in question—but obviously, the phylogeny of core genes is not always conserved within a genome, as it is affected by recombination. This is also visible from the numbers of core genes producing consensus branches in Fig. 4. With this variation in mind, an MLST tree should be interpreted with caution, as it represents only a tiny fraction of the complete core genome of a strain.
Comparison of Variable Gene Content
The pan-genome of a species or genus comprises both conserved core and variable genes. The latter can also be used to establish inter-genome relationships, although not by phylogeny. Instead, clustering of presence or absence of variable genes can be performed . This method calculates Manhattan distances for genes variably present. Obviously, core genes and genes found present in only one genome were excluded from this analysis, as they cannot identify any correlation between genomes. Thus, only genes whose presence varies, found at least in two genomes but absent in at least one genome, are assessed. The resulting clustering is not a phylogenetic tree, since it is not based on phylogeny of individual genes. Instead, it shows which genomes share more of their variable genes than others.
The analysis of variable gene content can simultaneously be performed with genomes of varying similarity, so that Fig. 6b combines all Firmicute genomes. The 21 Lactobacillus genomes are split into two major groups, which match a deep branch in the phylogenetic tree of 16S rRNA genes of this genus . However, the clustering based on variable gene content produces a different picture to the consensus tree based on core genes (compare Figs. 4 and 6b). This probably reflects different evolutionary forces at play. Genes whose presence is variable may be located on mobile elements or may be more frequently subjected to DNA recombination than core genes. The three Leuconostoc genomes are placed within the Lactobacillus genus; apparently, these share a considerable number of variable genes.
The three major clusters within the Streptococcus genus visible in Fig. 6 largely match their taxonomic relationship as defined by 16S rRNA , although the distance between S. thermophilus and Streptococcus infantarius, which are both part of the ‘Salivarius group Streptococci’, is better captured by variable gene content than by 16S rRNA phylogeny. The discrepancy between this clustering and the consensus core gene tree is even more extensive for this genus.
The four Lc. lactis genomes are placed between Streptococcus and Enterococcus, which reminds of their inclusion, prior to the 1980s, into the single genus Streptococcus . Within the genus Enterococcus, the clustering in Fig. 6 separates each of the analyzed species and confirms that Enterococcus casseliflavus and Enterococcus gallinarum are more related to E. faecium than to E. faecalis.
Visualization of Conserved and Variable Gene Content
A BLAST Atlas of Streptococcus genomes with S. thermophilus LMD-9 as the reference is provided as Supplementary Fig. S3. Two non-pathogenic E. faecalis genomes were included as well, since these are normal human flora strains and could be considered to share a similar niche to S. thermophilus, at least when colonizing the human gut. There is quite a bit of variation in protein-coding genes between the three S. thermophilus genomes, and as expected, there is even fewer conservation in other species of Streptococcus or in the two E. faecalis genomes. Apparently, similarity in bacterial lifestyle is not necessarily represented by a significant homology in gene content.
COG Comparison of Pan- and Core Genomes
Relative fractions of COG groups within the functionally annotated genes for the six genera
Cellular process, signalling
Figure 10 also shows the plots for Lactococcus (middle) and Leuconostoc (bottom). Although these last two genera are represented by four and three genomes only, all pan-genomes look surprisingly similar. However, when concentrating on the functionally annotated genes only (Table 3), some differences become apparent. The core genes of Lactococcus and Leuconostoc display a similar distribution of the three major COG classes as Bifidobacterium (which is taxonomically removed) that is different to the core genome of Lactobacillus, to which they are much closer related. Note that, in their pan-genomes, these three COG groups are similarly divided in Bifidobacterium and Lactobacillus. The shifts observed between pan-genome and core genome within a genus are contrasting between Lactobacillus and Lactococcus, whereas there is hardly a shift for Leuconostoc. From Fig. 10, it can be seen that, in the pan-genome of Lactococcus, class L genes make up a relatively large proportion. Within the metabolic gene classes, for Lactobacillus, a strong enrichment of nucleotide metabolism genes (class F) is observed in the core genes, whereas genes related to amino acid metabolism (class E) are more favoured in the core genome of Lactococcus. A significant increase in the core genes of COG class O (post-translational modification and chaperones) is observed for all analyzed genera. This could be an indication of the importance for such genes in the natural habitat of these gut bacteria.
The COG distribution plots for the pan-genome genes and the core genes of Enterococcus and Streptococcus is provided as Supplementary Fig. S4; the percentages of the three functionally classified COG top levels are included in Table 3. In contrast to the above examples, these two genera contain both pathogenic and non-pathogenic isolates. As in the previous examples, the large fraction of genes with unknown function is minimized in the core genome, but for both genera. Metabolism genes are neither over- nor underrepresented in the core genome. As before, a strong conservation of genes of COG class J (translation, ribosomal structure and biogenesis) was observed. Carbohydrate transport and metabolism genes (class G) were more frequently found in the Enterococcus pan-genome than in the Streptococcus pan-genome, though this was less pronounced for their core genomes.
Relative fractions of COG groups within the functionally annotated genes for non-pathogens/pathogens. The arrows indicate how the reported percentages increase or decrease in the core genome compared to the pan genome.
Cell. process, signalling
The comparative analysis presented here of 81 bacterial genomes, covering 6 genera and 43 different species, could be performed by grouping their genes into gene families and comparing core and pan-genomes of various subsets of genomes. The findings frequently confirmed taxonomic relationships but could not identify common signatures, in terms of gene content, for all non-pathogenic bacteria included in the analysis. This finding is surprising, as all these species occupy a similar niche. Conserved genes were compared by means of a consensus tree, while genes variably present were analyzed by cluster analysis. The latter indicated that Leuconostoc genomes share a considerable number of variable genes with Lactobacillus. Functional analysis of the proteins coded by the genes comprising a genus’ core genome identified the relative strong conservation of information storage genes; this was observed for all genera analyzed. When all genomes were divided into a pathogenic and a non-pathogenic group, the pan-genome of both groups showed a surprisingly similar COG distribution; however, their core genome differed considerably. It was observed that, in the core genome of non-pathogenic genomes, genes for post-translational modification and chaperones were enriched.
A simultaneous comparison of the pan- and core genomes of publicly available genomes of Lactobacillus, Lactococcus, Leuconostoc, Enterococcus, Streptococcus and Bifidobacterium, as was performed here, has not been published before, but similar analyses have been published for smaller selections of organisms. Canchaya and co-workers  performed comparative genomics of the then five available Lactobacillus genomes from different species and commented on the high variability within this genus. Schleifer and Ludwig  stated that “It is widely recognized that the taxonomy of this genus is unsatisfactory due to the highly heterogeneous nature of its members”. Indeed, data presented here illustrate the diversity within Lactobacillus. However, the heterogeneity of this genus is not larger than that of other bacteria. Using the same comparison criteria as applied here, the pan-genome of 53 E. coli genomes was found to comprise 13,000 gene families, even within this single species . Similarly, an analysis of 27 genomes from 7 Vibrio species produced a pan-genome of nearly 15,000 gene families for this genus , and 38 genomes of 5 Burkholderia species contained as much as 26,000 gene families . Thus, the diversity in gene content within the genus Lactobacillus, based on the genome sequences currently available, is not exceptional in the bacterial world.
Our analyses are mainly based on core genomes, an approach that others followed as well . Those authors had defined a core genome for Lactobacillus whose size is similar to our findings. However, the fraction of identified orthologous genes in the pairwise comparisons performed by those authors range from 52.3% to 68.9%, which is much higher than our findings of between 12% and 42%, shown in the BLAST Matrix of Fig. 2. The difference may be due to the way these percentages were calculated. Whereas we express these as the fraction of gene families found in the reciprocal pan-genome of the pair of analyzed genomes, their calculations are different, and they do not state the cut-off used to recognize orthologous genes as such. In view of their limited reported range, we believe our way of expressing pairwise homology is more useful, as it gives a more sensitive measure. In a subsequent publication, comparative genomics was performed with a larger set of 12 Lactobacillus genomes . Inclusion of 7 more genomes reduced their core genome to 141 genes which indicates they used more strict criteria of inclusion than the 50–50 rule we applied. Similar to our analysis, these authors compared the COG classes of the core genes they had identified, and their findings also reported the largest class represented to be genes involved in translation, followed by replication.
Comparative genomics of both Lactobacillus and Bifidobacterium was presented in a review , which mentioned the ability of Bifidobacterium to “synthesize at least 19 amino acids and (…) all of the enzymes that are needed for the biosynthesis of pyrimidine and purine nucleotides”. These authors further emphasized the importance of carbohydrate metabolism for Bifidobacterium with its capability to degrade complex sugars. Indeed, top-level metabolism genes form a major part of the Bifidobacterium core genome (Fig. 9) with class E (amino acid metabolism) as the largest fraction within that category. When we compare this core genome with that of Lactobacillus (Fig. 10), our analysis shows that class F genes (nucleotide metabolism) comprise the largest metabolism gene fraction in the Lactobacillus core genome. Ventura and co-workers  used a known physiological characteristic (Bifidobacterium species are known for their prototrophy) and looked for evidence of this in the genomes. In contrast, we have done a neutral analysis of pan- and core genome COG class representation and then compared this between genera. Our approach reveals novel insights that would remain unnoticed when known phenotypes are taken as a start, for instance the conservation of COG class O genes, involved in post-translational modification and chaperones, in both of these genera.
The authors of a recent review on Bifidobacterium genomics  pointed out that most Bifidobacterium genomes have been sequenced from organisms that have a long history of culture outside their natural habitat, the gut, with the exception of B. longum DJO10A. There is good evidence that the genome of Bifidobacterium is subject to gene reduction to adapt to prolonged culture conditions. This could potentially bias our comparative analysis of Bifidobacterium genomes with that of the other probiotic organisms.
The term ‘lactic acid bacteria’ is commonly used to describe bacteria used as starter cultures and fermentation of foodstuffs. LAB can refer to species from the genera Lactobacillus, Lactococcus, Leuconostoc, Streptococcus, Enterococcus, Pediococcus or all of the Lactobacillales, and sometimes includes Bifidobacterium as well. However, there are good reasons why these bacteria have been placed into different genera and phyla. The analyses presented here support their current taxonomic positions and stress their differences in gene content. The term LAB incorrectly suggests all these organisms are somehow related; a view that is still being presented in the literature . The use of the term LAB is a bit misleading, as the genetic content from these various genera differ significantly. Moreover, some of the genera within LAB comprise only non-pathogenic species (Leuconostoc, Bifidobacterium, Lactobacillus), whereas other genera are a mixture of pathogenic and non-pathogenic species and strains (Streptococcus, Enterococcus). It would be better to refrain from the term LAB as there is no common denominator, other than the production of lactic acid (which is not restricted to these organisms) to collectively describe all species and strains supposedly included in this diverse group of organisms.
An extensive comparative study of Enterococcus genomes could not be identified from the literature. Most studies concentrate on pathogenicity of E. faecalis. Vebø and co-workers  compared probiotic and (uro-)pathogenic E. faecalis genomes; however, those comparisons were not based on sequence data. The Enterococcus genomes we have included were mostly from pathogenic organisms (only two non-pathogenic E. faecalis strains whose sequences were nearing completion were publicly available at the time of analysis), which limits the strength of this analysis, as it cannot be used to compare and contrast multiple non-pathogenic with pathogenic Enterococcus genomes. The 11 genomes included represent only 4 species, giving a pan-genome of nearly 8,000 gene families. The first four species of Lactobacillus or Streptococcus genomes in the pan-genome plots of Fig. 1 produce smaller pan-genomes, which could suggest that the diversity of Enterococcus could be at least as extensive as that of Lactobacillus. The pairwise BLAST comparison within this genus resulted in homologues varying from 24% to 84%, again indicating extensive intra-genus diversity.
Streptococcus and Enterococcus are frequently considered as closely related, but the BLAST Matrix comparing all included genomes (Supplementary Fig. S1) does not support this. Instead, somewhat surprisingly, the observed homology between Leuconostoc and Streptococcus genomes is slightly higher than that between Streptococcus and Enterococcus. On the other hand, Lc. lactis was positioned in between these two genera in the tree based on variable gene content. A shared gene pool between these genera can be hypothesized. Based on the conserved core genes, however, Enteroccus is more related to Streptococcus, while Lactococcus is more distinct.
A small comparative study of Streptococcus genomes combined with MLST suggested that S. thermophilus is a relatively young clone, evolved by genome reduction which removed or inactivated Streptococcus virulence genes . It is possible, however, that the reduced genomes observed are the result of prolonged use as starter cultures, as no fresh human isolates have been sequenced to date. In a short review, Delorme  states that “S. thermophilus is related to Lactococcus lactis…”. Indeed, from the all-against-all BLAST Matrix, a similarity between 17.3% and 20.2% is recorded between genomes of these two species, which is higher than that between S. thermophilus and any other non-streptococcal genome. However, Lc. lactis also shares 16.0% to 18.0% of reciprocal genes with S. suis, so these overlapping percentages of gene similarity are no indicator of similarity in (probiotic) phenotype. Within the Streptococcus genus, the stated similarity of S. thermophilus with Streptococcus sanguinis (the only member of the viridans group for which a genome sequence is available) is confirmed in our Matrix, but an even higher similarity is found with Streptococcu gordonii.
The COG analysis of the core genomes of separate genera identified both similarities and differences. The three top-level functional COG groups are relatively equally divided over the functionally annotated pan-genes of all species, but their core genomes differ. Notably, Lactobacillus and Leuconostoc both have a smaller fraction of metabolism core genes than the other four genera and a larger information storage gene fraction. Information storage genes are essential, but redundancy allows so much variation between organisms that they are not all maintained in a core genome of diverse species. In the approach presented here, we first identified the core genomes of groups of bacteria and then sorted the genes in these core genomes for top-level COG categories. As a consequence, genes that were insufficiently conserved based on sequence similarity to be maintained in the core genome are removed despite their possible functional conservation. Using this approach, we found no correlation between the diversity within a genus (using the difference of their pan- and core genome as a measure) and the fraction of their information/storage COG genes. This lack of correlation is illustrated by the core genome of Bifidobacterium (724, or 10% of its pan-genome) and Leuconostoc (1,164, or 40% of its pan-genome). These two core genomes contain 34% and 31% information/storage genes, respectively, despite a huge difference in the degree of variation in these two genera.
Of particular interest is the COG analysis where all genomes were divided into a pathogenic and a non-pathogenic group. Virulence genes are not a separate COG category, but from the comparison of the core genomes of the pathogenic group with that of the non-pathogenic group, we can hypothesize that genes belonging to COG categories M (cell wall/membrane biosynthesis) and O (post-translational modification, chaperones) would mostly contribute to virulence. Conversely, it could be assumed that genes highly overrepresented in the core genome of the non-pathogenic group (compared to the core genome of the pathogenic group) most likely contribute to their probiotic or fermentative lifestyle. We observe enrichment for genes belonging to COG class J (translation, ribosomal structure and biogenesis) and again O (post-translational modification and chaperones). The finding that core genes of the non-pathogenic isolates are more frequently information storage genes and less likely metabolic genes than the core genes of pathogens is counter-intuitive. It is generally accepted that commensals and probiotic strains are most adequately equipped to live in the intestine, which would assume they share a large number of (conserved) metabolic genes to do so. Instead, the reduced metabolism gene fraction in their core genome suggests that there is a large variation within these genes, which reflects the diversity of the various commensals, fermentative and probiotic isolates. The vast enrichment for information/storage genes in the core genome of the non-pathogenic organisms is possibly a reflection of the relative poor conservation of all other functional classes in this group, an effect that appears to be less pronounced in the (ecologically more diverse) pathogenic group. The fact that Bifidobacterium are not present in the pathogenic group may have skewed these results slightly. A more accurate prediction for conserved genes with an important role in bacteria with a non-pathogenic lifestyle may become possible in the future, when more non-pathogenic Enterococcus genomes become available, which allows comparison of gene content within a genus or even species.
This study illustrates the value of comparative genomics of multiple genomes within and between related species and genera. The applied tools are relatively simple to analyze a vast number of genes, and the results can support or contradict existing hypotheses and taxonomic divisions, as well as generate novel hypotheses. We believe the data presented here can assist in understanding the commensal and probiotic relationship of bacteria with their human host. The work presented here demonstrates that the used analyses can be applied to large numbers of genomes, when searching for general mechanisms to predict trends even across genera. The presented analyses can be taken as a test case for comparison of multiple genomes from a largely variable dataset.
The authors are grateful to all research groups that have submitted their genome sequences to public databases, without which this analysis would not have been possible. TMW acknowledges the support provided by the Safety and Environmental Assurance Centre at Unilever for part of this work. OL and DWU received supported by the Center for Genomic Epidemiology at the Technical University of Denmark; part of this work was funded by grant 09-067103/DSF from the Danish Council for Strategic Research.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.