Genome and pan-genome analysis to classify emerging bacteria
- 3.9k Downloads
In the recent years, genomic and pan-genomic studies have become increasingly important. Culturomics allows to study human microbiota through the use of different culture conditions, coupled with a method of rapid identification by MALDI-TOF, or 16S rRNA. Bacterial taxonomy is undergoing many changes as a consequence. With the help of pan-genomic analyses, species can be redefined, and new species definitions generated.
Genomics, coupled with culturomics, has led to the discovery of many novel bacterial species or genera, including Akkermansia muciniphila and Microvirga massiliensis. Using the genome to define species has been applied within the genus Klebsiella. A discontinuity or an abrupt break in the core/pan-genome ratio can uncover novel species.
Applying genomic and pan-genomic analyses to the reclassification of other bacterial species or genera will be important in the future of medical microbiology. The pan-genome is one of many new innovative tools in bacterial taxonomy.
This article was reviewed by William Martin, Eric Bapteste and James Mcinerney.
Open peer review
Reviewed by William Martin, Eric Bapteste and James Mcinerney.
KeywordsGenomic Novel species Pan-genome Taxonomy
Average nucleotide identity
Antibiotic resistance gene
Blast score ratio
Cluster of orthologous group
Genome-to-genome distance calculator
Genomes online database
Internal transcribed spacer
Kyoto encyclopedia of genes and genomes
List of prokaryotic names with standing in nomenclature
Matrix-assisted laser desorption/ionization time-of-flight
Multilocus sequence analysis
Single nucleotide polymorphism
The study of digestive bacterial ecosystems was initially explored by microbial culture in the 1970s [1, 2, 3, 4]. The birth of genomics, followed by the development of Next Generation Sequencing (NGS) methods in 2004, made it possible to discover the uncultivable, such as Akkermansia muciniphila  thanks to metagenomics. Since the emergence of metagenomics, microbial culture has gradually been replaced by molecular tools for complex microbiota study . In 2015, a new approach called “culturomics” was developed and intensive culture assays were carried out to select the 18 best culture conditions to cultivate the largest number of isolates . Culturomics has also allowed the identification of a large number of prokaryotic species, as it allows the simultaneous combination of different culture conditions, using 16S rRNA gene amplification and matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) . As a result, by increasing the number of species of bacteria being discovered, a new polyphasic method for bacterial species description has emerged; the “taxonogenomics” . In order to describe a new bacterium, taxonogenomics complements classic features with the description of the whole genome sequence with the proteomic information obtained by MALDI-TOF MS and has permitted to describe, for example, Microvirga massiliensis .
NGS application in genomics and metagenomics
This technical revolution offers new fields of application, such as genomics (whole-genome sequencing, WGS), single-cell sequencing (SCS) and metagenomics. In particular, complete genome sequencing allows a better understanding of the genetic basis of phenotypic variability, but also to analyze biodiversity and estimate the diversity of single nucleotide polymorphism (SNP) . Single-cell genomics allows the sequencing of a single isolated cell by capturing and amplifying its DNA. Single-cell sequencing is able to produce partial  and complete  genomes. SCS DNA allows the identification and assembly of genomes from uncultivated microorganisms [40, 41, 42, 43] but it is estimated that only 1% of bacterial species have been cultivated in laboratory . Nevertheless, SCS has its own limits along with inherent biases [40, 43], especially during the amplification step, including false-positive and false-negative errors, allelic dropout events and coverage non-uniformity . However, single-cell genomics and metagenomics are two complementary approaches to analyze bacterial communities . The sequencing of metagenomes has made it possible to inventory the diversity of microbial ecosystems  and to determine the interactions between microorganisms in ecosystems . This technique, which allows DNA sequencing of bacteria present in specific environments, offers many advantages. Indeed, it opens a window on a world of great wealth that was hitherto totally unknown.
Assembly, finishing and annotations
Organism sequencing has recently become accessible in many laboratories where specific consortium sequencing projects are proposed, and currently the number of projects is exponentially growing. However, NGS technologies have higher error rates (~ 0.1–15%) and smaller read lengths (35–1000 bp) than those obtained from Sanger sequencing platforms . After sequencing, the data produced (reads) are computationally reconstructed into longer continuous sequences (contigs), a step called assembly . This process consists of an overlap of reads aimed at reconstructing the initial sequence of the genome. It’s called de novo when no reference is available. The assembly with Sanger data is based on two-by-two comparison of reads, looking for overlapping sequences of minimal length with associated identity percentages (CAP assembler, for example ). The change of scale due to the huge volume of data, the short-read lengths and the non-uniform confidence in base calling, excluded this assembly strategy. The most commonly used approach for the assembly of short-read data is based on treating k-mers. All k-mers length substrings of all sequencing reads appear as nodes in a graph. The nodes are joined by edges if the nodes share a k-1 length substring. This graph is known as the de Bruijn graph . Short read de Bruijn graph assemblers determine optimal paths through this graph to form the contigs of an assembly. Each contig ends when there is no outgoing edge from a visited node, or when the branching in the graph becomes too complex to be resolved. Frequently, the raw de Bruijn graph is reduced by collapsing any linear chains of nodes and edges into a single metanode. An assembly quality can then be assessed using a set of metrics. The usual ones are the total count of contigs, scaffolds, their total length, N50 (length of the smallest contig in the set of contigs that represent at least 50% of the genome), and their average length [53, 54]. Bilen et al., removed scaffolds less than 800 bp in size or less than 25% of the median depth (identified as possible contaminants) . A good metric is also the proportion of reads mapped back, or not, to the contigs . The assembly quality is an important criterion for ulterior analysis such as rate of lateral transfer or annotation . The next important step required is genome annotation. To make genomic data valuable, a reliable and correct annotation is essential . It is used to identify, locate and distinguish gene function using similarities when studying protein databases by BLAST , and can provide a basis for many genome analyses . Organisms whose genome is now fully sequenced have revealed that nearly 40% of the genes identified have no assigned function; either because they do not resemble any known genes, or because (for half of them) they resemble other genes with unknown function . The first step in annotation is to predict gene function, which is usually done individually for each gene using computational tools. However, the identification of gene function requires the combination of several complementary experimental approaches, whether by computer (in silico analysis), biochemical, or genetic (in vivo and in vitro analysis). The second step in annotation is to identify relationships between genes, proteins and regulatory elements. These relationships can be of very varied nature: physical interactions between proteins/DNA, proteins/RNA and proteins/proteins, networks regulating gene expression, metabolic pathways or others.
Example of a mapping study: Akkermansia muciniphila
Powerful approaches based on mapping short-read sequences to a reference genome are used to analyze WGS data from closely related isolates [5, 61, 62, 63, 64]. Several bioinformatics tools are used in a mapping study, such as the CLC genomics Workbench (CLC bio, Aarhus, Denmark).
By using metagenomics data from a human stool sample, the genome of Akkermansia muciniphila was successfully assembled after mapping the short-read sequences . The stool sample was from a patient admitted to the intensive care unit and treated with a 10-day course of imipenem . High-level colonization by the Verrucomicrobia phylum was reported in this patient. Indeed, the reads from pyrosequencing are classified in the phylum range and up to 84% of these reads belong to Verrucomicrobia phyla. Reads were generated from a SOLiD sequencer, whereas short-reads shotgun and paired-end runs were generated on a 454 sequencer. Both technologies generated 1.4 G bases of metagenomic sequence data from the sample. The several mapping sequences from SOLiD and 454 data against the A. muciniphila type strain ATCC BAA-835 allowed to obtain the genome of A. muciniphila strain Urmite with 1 scaffold and 58 contigs.
The presence of a range of putative antibiotic resistance genes (ARGs) from different antibiotic classes has been demonstrated in a recent study of A. muciniphila strain Urmite . The putative ARG were beta-lactamase, macrolides, vancomycin, chloramphenicol, sulfonamide, tetracycline and trimethoprim.
The short-read sequence mapping approach of a reference genome has been successful in assembling a genome directly from a human stool sample. However, this approach remains limited since very divergent sequences, not present in the reference, could not be detected . Indeed, if the draft genome obtain contains highly divergent additional genes with respect to the reference genome, we may lose this information and not be able to completely reconstruct the original genetic content. Consequently, mapping method cannot be used on very divergent genomes . These limitations will be progressively lessened by the exponential growth of data and subsequent genomes available in generalist databases.
Finding novel species
Novel species identification
- Culturomics (MALDI-TOF-MS and 16S rRNA)
In recent years, with the introduction of a rapid and inexpensive identification method using MALDI-TOF mass spectrometry, the volume of microbial culture has considerably increased, which now allows to more readily detect pathogenic bacterial species. This technique of reference for bacterial identification in clinical microbiology laboratories has also developed a new concept to study human microbiota also called “microbial culturomics” [7, 66]. Culturomics was developed in a study of a complex microbiota and allows a large number of isolates to be grown through the selection of the 18 best culture conditions [7, 9]. This technique is based on the diversification of culture conditions by varying the time and temperature of incubation but also culture medium composition and atmosphere . In a preliminary work, this approach allowed the cultivation of 340 bacterial species, including 31 novel species, as well as species belonging to rare phyla (Synergistetes and Deinococcus-Thermus), using 212 different cultivation conditions . A detailed analysis made it possible to select successively the 70 then the 18 most appropriate culture conditions in order to explore the greatest possible diversity for each sample. Another work has permitted to cultivated more than 50% of the known species of the human digestive tract, including 247 novel species [7, 8]. Novel species are identified by MALDI-TOF, and 16S rRNA gene sequencing for non-identified spectra in MALDI-TOF.
- other methods
Novel species were also identified with multilocus sequence analysis (MLSA) of several concatenated housekeeping gene sequences (rrs, recA, gyrB, dnaK, glnII and rpoD) [67, 68]. The housekeeping genes are usually involved in the expression and maintenance of genetic information at the transcription or translation level. The recA gene is essential for the maintenance and repair of DNA and is a good resolutive tool for predicting lineage and genus among rhizobial strains . Phylogenetic analysis to discover new species also uses DNA sequences of the internal transcribed spacer 2 (ITS2) region .
The nucleotide sequence or the peptide sequence can also be used. In general, the peptide sequence is preferred, since it can avoid some biases inherent to the G + C content of the organism studied, as well as the degeneration on the third nucleotide of the codon .
Taxonogenomics strategy and the example of a novel species study: Microvirga massiliensis
This new polyphasic strategy, called “taxonogenomics”, systematically combines phenotypic and genomic criteria [9, 72]. Using this strategy, 15 novel bacteria have officially been considered as new species and/or new genera in official validation lists No. 153 and No. 155 by the International Taxonomy Committee of the International Journal of Systematic and Evolutionary Microbiology [66, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83]. Lagier et al., identified a novel bacterial species, Microvirga massiliensis, from a human stool sample using culturomics and metagenomics approaches . Recently, the description of the genome of Microvirga massiliensis sp. nov. strain JC119T was realised via the taxonogenomics approach . The draft genome of this bacterium is 9,207,211 bp long, which is the largest bacterial genome of a human isolate. Of the 8762 predicted genes, 8685 were protein-coding genes, 77 were rRNA genes, including 21 rRNA genes, and the genome exhibits a G + C content of 63.28%.
The bacterial pan-genome
History of taxonomy and the concept of bacterial species
Since the early 1990s, because of advances in molecular biology, bacterial classification has been in perpetual upheaval. Taxonomic information is essential regarding the identification, nomenclature and classification of microbial strains  and to better understanding the biodiversity and relationships among living microorganisms .
Currently, prokaryote taxonomy relies on polyphasic combinations of phenotypic properties (pathogenesis, morphology, environmental and culture conditions), chemotaxonomic properties (chemical composition of cellular components) and genotypic properties [11, 90], including DNA-DNA hybridization (DDH), DNA G + C content  and 16S rRNA sequence similarity [12, 92] The application of molecular hybridization methods provides a genomic definition of the bacterial species, taking into account the similarity rate and the thermal stability of the hybrids obtained by the DNAs of two bacterial isolates. Isolates belonging to the same species are characterized by homologies of their DNA, which result in hybridization percentages greater than or equal to 70%, and stability of the hybrids formed below 5 °C [93, 94, 95]. This definition is still recognized by the international bacterial taxonomy committees.
The development of DNA sequencing has led to the determination of a threshold in order to define the species based on the similarity of gene sequences, initially based on the 16S rRNA gene. These data were later compared to those obtained by DDH [96, 97]. As for the classification of prokaryotes, the 16S rRNA gene is an effective molecular marker due to its functional stability, conservation and universal presence . However, for bacterial taxonomy, this gene has several limitations; notably, the presence of SNPs in the rRNA operon in a single genome [23, 99]; the use of a single gene that may not reflect the evolution history of the genome (~ 0.07% of a genome) [19, 100] or the high degree of conservation in a same genus, like Brucella or Rickettsia . The divergence of 1.3% accepted from two sequences, corresponding to 50 million years of divergence [17, 22] also brings a limitation such that the presence in multiple of the 16 s rRNA genes and sometimes variable copies [102, 103]. VanBerkum et al., showed that a small portion of the 16S rRNA gene sequence of Bradyrhizobium elkanii is originated from a Mesorhizobium spp. genome by lateral transfer [100, 104].
What happens when two species have a 98.6% similarity percentage? Are they two distinct species? Establishing a threshold is not biological and cannot be based essentially on this criterion, especially when different thresholds are used by different biologists.
With the advent of whole genomes sequencing, phylogeny has entered a new era: the era of phylogenomics. Many studies have already demonstrated the importance of genomes in bacterial taxonomy by suggesting a focus on the presence or absence of genes within genomes [105, 106, 107]; the gene content ; the presence of SNPs or indels in conserved genes ; the comparison of orthologous genes ; the study of metabolic pathways and chromosome gene order [111, 112]; or by sequence similarity at the genome level, estimated by parameters such as “digital DDH”, Average Nucleotide Identity (ANI) or AGIOS using the Genome-To-Genome Distance Calculator (GGDC), ANI calculator and in-lab pipeline named Marseille Average Genomic identity (MAGi) softwares [113, 114, 115, 116].
Pan-genome study for taxonomic purposes: The Klebsiella genus
The taxonomic classification of Klebsiella species has been the subject of a long controversy. Klebsiella species are part of the large Enterobacteriaceae family, which are Gram-negative bacteria. Originally, the Klebsiella genus was divided into pathovars linked to the diseases they caused: Klebsiella pneumoniae, Klebsiella ozaenae and Klebsiella rhinoscleromatis . With the development of new tools such as G + C content composition, DNA-DNA hybridization and 16S rRNA sequencing [11, 95], classification of Klebsiella species has been continuously revised [118, 119]. K. ozaenae and K. rhinoscleromatis were notably reclassified as K. pneumoniae subspecies [70, 120, 121].
This pan-genomic analysis enabled to conlude that K. pneumoniae subsp. ozaenae or K. pneumoniae subsp. rhinoscleromatis, which exhibit as many differences between them as with other Klebsiella species, may be distinct Klebsiella species  or, following Ereshefsky’s eliminative pluralism on species philosophy , distinct pan-genome-derived species.
Since the introduction of DNA sequencing by Sanger and Coulson in 1977, great progress has been made. A growing amount of data is being generated, requiring continuously advanced computer processing. Numerous studies illustrating different methods have been published in different fields, such as genome assembly and annotation, as well as research on new bacterial species and the taxonomic classification of bacteria.
Regarding genome analysis, a complete genome of Akkermansia muciniphila was obtained directly from human stool samples by an original approach. Based on taxonogenomics, the creation of Microvirga massiliensis sp. nov. containing the strain JC119T could be done.
Taxonomy traditionally operates with the principle of discontinuous variation. A break in the ratio of the core/pan-genome means that there is no transition from one species to another, leading to a definition of different species. This leap in the ratio represents a major difference between genomes. These irreconcilable differences cannot exist within a single species. The great discontinuity variation in the core/pan-genome ratio observed in Klebsiella species may help to redefine these species. Thus, we believe that pan-genome studies can help to better visualize gene content differences, inform species (re)definitions or classify species on their own according to discontinuous genomic content. However, this strategy will have to be empirically and systematically tested in the currently proposed genera.
Genomics challenges taxonomy. We are at the very beginning of the interpretation of the genome for taxonomic purposes.
This study was supported by Mediterranée Infection and the National Research Agency under the program “Investissements d’avenir”, reference ANR-10-IAHU-03.
Availability of data and materials
DR designed the research project. AC performed analysis and wrote the manuscript. PEF provided support. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 11.Staley JT. The bacterial species dilemma and the genomic-phylogenetic species concept. Philos Trans R Soc B Biol Sci. 2006;361:1899–909.Google Scholar
- 30.Mira A, Martín-Cuadrado AB, D’Auria G, Rodríguez-Valera F. The bacterial pan-genome:a new paradigm in microbiology. Int Microbiol Off J Span Soc Microbiol. 2010;13:45–57.Google Scholar
- 33.den Bakker HC, Cummings CA, Ferreira V, Vatta P, Orsi RH, Degoricija L, et al. Comparative genomics of the bacterial genus Listeria: genome evolution is characterized by limited gene acquisition and limited gene loss. BMC Genomics. 2010;11:688.Google Scholar
- 35.Diene SM, Merhej V, Henry M, Filali AE, Roux V, Robert C, et al. The Rhizome of the Multidrug-Resistant Enterobacter aerogenes Genome Reveals How New “Killer Bugs” Are Created because of a Sympatric Lifestyle. Mol Biol Evol. 2012;30:mss236.Google Scholar
- 52.Idury RM, Waterman MS. A new algorithm for DNA sequence assembly. J Comput Biol J Comput Mol Cell Biol. 1995;2:291–306.Google Scholar
- 57.Lugli GA, Milani C, Mancabelli L, van Sinderen D, Ventura M. MEGAnnotator: a user-friendly pipeline for microbial genomes assembly and annotation. FEMS Microbiol Lett. 2016;363.Google Scholar
- 60.Médigue C, Bocs S, Labarre L, Mathé C, Vallenet D. L’annotation in silico des séquences génomiques. médecine/sciences. 2002;18:237–50.Google Scholar
- 66.Lagier J-C, Armougom F, Million M, Hugon P, Pagnier I, Robert C, et al. Microbial culturomics: paradigm shift in the human gut microbiome study. Clin Microbiol Infect Off Publ Eur Soc Clin Microbiol Infect Dis. 2012;18:1185–93.Google Scholar
- 67.Msaddak A, Rejili M, Durán D, Rey L, Imperial J, Palacios JM, et al. Members of microvirga and Bradyrhizobium genera are native endosymbiotic bacteria nodulating Lupinus luteus in northern Tunisian soils. FEMS Microbiol Ecol. 2017;93.Google Scholar
- 68.Busquets A, Gomila M, Beiki F, Mulet M, Rahimian H, García-Valdés E, et al. Pseudomonas caspiana sp. nov., a citrus pathogen in the Pseudomonas syringae phylogenetic group. Syst Appl Microbiol. [cited 2017 Jun 13]; Available from: http://www.sciencedirect.com/science/article/pii/S0723202017300450
- 69.Vinuesa P, León-Barrios M, Silva C, Willems A, Jarabo-Lorenzo A, Pérez-Galdona R, et al. Bradyrhizobium canariense sp. nov., an acid-tolerant endosymbiont that nodulates endemic genistoid legumes (Papilionoideae: Genisteae) from the Canary Islands, along with Bradyrhizobium japonicum bv. Genistearum, Bradyrhizobium genospecies alpha and Bradyrhizobium genospecies beta. Int J Syst Evol Microbiol. 2005;55:569–75.PubMedGoogle Scholar
- 77.Oren A, Garrity GM. List of new names and new combinations previously effectively, but not validly, published. Int J Syst Evol Microbiol. 2014;66:1–5.Google Scholar
- 83.Pei J, Chu C, Li X, Lu B, Wu Y. CLADES: A classification-based machine learning method for species delimitation from population genetic data. Mol Ecol Resour. 2018;18(5):1144–56.Google Scholar
- 84.Paolozzi L, Liébart J-C, Sansonetti P. Microbiologie: biologie des procaryotes et de leurs virus. Paris: Dunod; 2015.Google Scholar
- 85.Skerman VBD, McGowan V, Sneath PHA. Approved lists of bacterial names. Int J Syst Evol Microbiol. 1980;30:225–420.Google Scholar
- 87.Le Minor L, Popoff MY. Designation of Salmonella enterica sp. nov., nom. Rev., as the type and only species of the Genus Salmonella: request for an opinion. Int J Syst Evol Microbiol. 1987;37:465–8.Google Scholar
- 92.Stackebrandt E, Ebers J. Taxonomic parameters revisited: tarnished gold standards. Microbiol Today. 2006;33:152–5.Google Scholar
- 93.Krawiec S. Concept of a bacterial species. Int J Syst Bacteriol. 1985;35:217–20.Google Scholar
- 95.Wayne LG, Brenner DJ, Colwell RR, Grimont PAD, Kandler O, Krichevsky MI, et al. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Int J Syst Bacteriol. 1987;37:463–4.Google Scholar
- 97.Stackebrandt E, Goebel BM. Taxonomic note: a place for DNA-DNA Reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Bacteriol. 1994;44:846–9.Google Scholar
- 99.Rainey FA, Ward-Rainey NL, Janssen PH, Hippe H, Stackebrandt E. Clostridium paradoxum DSM 7308T contains multiple 16S rRNA genes with heterogeneous intervening sequences. Microbiol Read Engl. 1996;142(Pt 8):2087–95.Google Scholar
- 103.Marchandin H, Teyssier C, Siméon De Buochberg M, Jean-Pierre H, Carriere C, Jumas-Bilak E. Intra-chromosomal heterogeneity between the four 16S rRNA gene copies in the genus Veillonella: implications for phylogeny and taxonomy. Microbiol Read Engl. 2003;149:1493–501.Google Scholar
- 109.Gupta RS. The branching order and phylogenetic placement of species from completed bacterial genomes, based on conserved indels found in various proteins. Int Microbiol Off J Span Soc Microbiol. 2001;4:187–202.Google Scholar
- 110.Coenye T, Vandamme P. Extracting phylogenetic information from whole-genome sequencing projects: the lactic acid bacteria as a test case. Microbiol Read Engl. 2003;149:3507–17.Google Scholar
- 111.Huson DH, Steel M. Phylogenetic trees based on gene content. Bioinforma Oxf Engl. 2004;20:2044–9.Google Scholar
- 119.Brenner DJ, Farmer JJ, Hickman FW, Asbury MA, Steigerwalt AG. Taxonomic and Nomenclature Changes in Enterobacteriaceae. [cited 2017 May 31]. Available from: https://www.abebooks.fr/Taxonomic-Nomenclature-Enterobacteriaceae-Don-Brenner-Farmer/4158251114/bd
- 120.Klebsiella OI. Bergey’s man Syst Bacteriol; 1974.Google Scholar
- 121.Orskov I, Genus V. Klebsiella. Bergey’s manual of systematic bacteriology. 1984:461–5.Google Scholar
- 122.Rouli L, Merhej V, Fournier P-E, Raoult D. The bacterial pangenome as a new tool for analyzing pathogenic bacteria. New Microbes New Infect. [cited 2015 Jul 20]; Available from: http://www.sciencedirect.com/science/article/pii/S2052297515000529
- 123.Ereshefsky M. Eliminative Pluralism. Philos Sci. 1992;59:671–90.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.