Introduction

Strain DSM 20147T is the type strain in a subgroup of industrial relevant bacteria originally isolated during a screening for l-glutamic acid producing microorganisms and was classified to belong to the genus Corynebacterium[1]. This genus is comprised of Gram-positive bacteria with a high G + C content. It currently contains 126 validly published members (species and subspecies), 4 of which are synonyms of other species within the genus, 27 that were later reclassified as members of 7 other genera, and 1 member abolished in erratum [211]. The remaining 93 were isolated from diverse backgrounds like soil, sea, or ripening cheese, but also from human clinical samples and animals.

Within this diverse genus, C. callunae has been found to be a producer of l-glutamic acid, like one of the most prominent representatives of the corynebacteria, C. glutamicum[1]. The biological context of this species is, unfortunately, basically unknown as it was first described in a patent application [1] that does neither mention the geographic location nor the exact habitat of the strain. Based on the name and the habitats of its close relatives C. glutamicum, C. deserti, and C. efficiens, the most likely habitat of C. callunae is soil around heather plants. But while the biotechnological uses and capabilities of this subgroup within the genus Corynebacterium has been studied in detail, especially for C. glutamicum, the ability of all these strains to secrete considerable amounts of l-glutamic acid is still not well understood in the context of the environment.

C. callunae DSM 20147T harbors two cryptic plasmids: pCC1 (4,109 bp) which encodes a Rep protein that shows similarity to the corynebacterial plasmid pAG3 and pBL1, and pCC2 (85,023 bp) the Rep protein of which has possible orthologs in many other corynebacteria. Aside from this, DSM 20147T is an alkaline-tolerant bacterium, which grows well at pH 5.0 - 9.0 (optimum pH 6–8) [1]. Here we present a summary classification and a set of features for C. callunae DSM 20147T, together with the description of the genomic sequencing and annotation.

Organism information

Classification and features

A representative genomic 16S rRNA sequence of C. callunae DSM 20147T was compared to the Ribosomal Database Project database [12] confirming the initial taxonomic classification. C. callunae shows highest similarity to C. glutamicum and C. deserti (97%, respectively).

Figure 1 shows the phylogenetic neighborhood of C. callunae in a 16S rRNA based tree. C. callunae forms a subgroup containing furthermore the species C. glutamicum ATCC 13032T, C. deserti GIMN1.010T, and C. efficiens YS-314T.

Figure 1
figure 1

Phylogenetic tree highlighting the position of C. callunae relative to type strains of other species within the genus Corynebacterium . Species with at least one publicly available genome sequence (not necessarily the type strain) are highlighted in bold face. The tree is based on sequences aligned by the RDP aligner and utilizes the Jukes-Cantor corrected distance model to construct a distance matrix based on alignment model positions without alignment inserts, using a minimum comparable position of 200. The tree is built with RDP Tree Builder, which utilizes Weighbor [13] with an alphabet size of 4 and length size of 1000. The building of the tree also involves a bootstrapping process repeated 100 times to generate a majority consensus tree [14]Rhodococcus equi (X80614) was used as an outgroup.

C. callunae DSM 20147T is a Gram-positive rod shaped bacterium, which is 1–2 μm long and 0.4-0.6 μm wide (Figure 2). It is described to be non-motile [1], which coincides with a complete lack of genes associated with ‘cell motility’ (functional category N in COGs table). Growth of DSM 20147T was shown at temperatures between 25–37°C with optimal l-glutamic acid production between 25–35°C [1]. Carbon sources utilized by strain DSM 20147T include dextrose, fructose, galactose, inulin, inositol, maltose, mannitol, mannose, raffinose, salicin, sucrose and trehalose [1]. DSM 20147T tested positive for citrate, catalase and urease, but shows no nitrate reduction activity [1]. Details on the chemotaxonomy are largely missing, but can be inferred from the close relatives C. glutamicum, C. efficiens, and C. deserti[3]. Based on these relatives, meso-diaminopimelic acid is expected to be the major diamino acid of the cell wall, with arabinose and galactose as the main sugars (chemotype IV). Short-chain mycolic acids (32 ± 36 carbon atoms) are also certain to be present, as all necessary genes were found to be present. The major cellular fatty acids are expected to be hexadecanoic acid (C16:0, 40-50%) and octadecenoic acid (C18:1 ω9c, 40-50%) with small amounts of octadecanoic acid (C18:0, ~1%) and possible others. MK-9(H2) is thought to be the major menaquinone, although MK-8(H2) might also be present in significant amounts. Phosphatidylinositol, diphosphatidylglycerol, and phosphatidylglycerol as well as their glycosides are expected to be the main components of the polar lipids (Table 1).

Figure 2
figure 2

Scanning electron micrograph of C. callunae DSM 20147 T .

Table 1 Classification and general features of C. callunae DSM 20147 T according to the MIGS recommendations [15]

Genome sequencing and annotation

Genome project history

Due to its phylogenetic position in the near neighborhood of industrial relevant species of the genus Corynebacterium, C. callunae was selected for sequencing as part of a project to define production relevant loci in corynebacteria. While not being part of the GEBA project, sequencing of the type strain will nonetheless aid the GEBA effort. The genome project is deposited in the Genomes OnLine Database [28] and the complete genome sequence is deposited in GenBank. Sequencing, finishing and annotation were performed at the CeBiTec. A summary of the project information is shown in Table 2.

Table 2 Genome sequencing project information

Growth conditions and DNA isolation

C. callunae DSM 20147T was grown aerobically in CASO bouillon (Carl Roth GmbH, Karlsruhe, Germany) at 30°C. DNA was isolated from ~ 108 cells using the protocol described by Tauch et al. [29].

Genome sequencing and assembly

Two libraries were prepared: a WGS library using the Illumina-Compatible Nextera DNA Sample Prep Kit (Epicentre, WI, U.S.A) and a 6 k MatePair library using the Nextera Mate Pair Sample Preparation Kit, both according to the manufacturer's protocol. Both libraries were sequenced in a 2× 250 bp paired read run on the MiSeq platform, yielding 1,747,266 total reads, providing 99.51× coverage of the genome. Reads were assembled using the Newbler assembler v2.8 (Roche). The initial Newbler assembly consisted of 29 contigs in four scaffolds. Analysis of the four scaffolds revealed two to be an extrachromosomal element (plasmid pCC1 and pCC2), one to make up the chromosome and the remaining one containing the seven copies of the RRN operon.

The Phred/Phrap/Consed software package [3033] was used for sequence assembly and quality assessment in the subsequent finishing process, gaps between contigs were closed by manual editing in Consed (for repetitive elements).

Genome annotation

Gene prediction and annotation were done using the PGAP pipeline [34]. Genes were identified using GeneMark [35], GLIMMER [36], and Prodigal [37]. For annotation, BLAST searches against the NCBI Protein Clusters Database [38] are performed and the annotation is enriched by searches against the Conserved Domain Database [39] and subsequent assignment of coding sequences to COGs. Non-coding genes and miscellaneous features were predicted using tRNAscan-SE [40], Infernal [41], RNAMMer [42], Rfam [43], TMHMM [44], and SignalP [45].

Genome properties

The genome (on the scale of 2,928,683 bp) includes one circular chromosome of 2,839,5514 bp (52.39% G + C content) and two plasmids of 4,109 bp (54.42% G + C content) and 85,023 bp (54.38% G + C content, [Figure 3]). For chromosome and plasmids, a total of 2,729 genes were predicted, 2,647 of which are protein coding genes. 2,085 (76.40%) of the protein coding genes were assigned to a putative function, the remaining were annotated as hypothetical proteins. 1,937 protein coding genes belong to 314 paralogous families in this genome corresponding to a gene content redundancy of 41.52%. The properties and the statistics of the genome are summarized in [Tables 3, 4 and 5].

Figure 3
figure 3

Graphical map of the chromosome and the two plasmids pCC1 and pCC2 (not drawn to scale). From the outside in: Genes on forward strand (color by COG categories), Genes on reverse strand (color by COG categories), GC content, GC skew.

Table 3 Summary of genome: one chromosome and two plasmids
Table 4 Genome statistics
Table 5 Number of genes associated with the general COG functional categories

Insights from the genome sequence

The complete genome sequence of C. callunae was already mined for biotechnological purposes to define the core genome of the C. glutamicum - C. efficiens - C. callunae subgroup to define the chassis genome for C. glutamicum[46]. Comparison of the three genomes using EDGAR [47] reveals that the core genome of this group comprises just 1,873 genes and the number of genes that are found only in C. callunae is also relatively small (366), especially when compared to number of singletons found in the other two (926 and 773 in C. glutamicum and C. efficiens, respectively; Figure 4). As C. callunae was shown to produce l-glutamate in an amount comparable to C. glutamicum, C. callunae might be considered as a potential candidate for future genome reduction efforts since the chromosome is already considerably smaller than that of C. glutamicum and C. efficiens (2.84 Mbp versus 3.21 Mbp and 3.15 Mbp, respectively). This future approach is aided by the observation that many of the singletons are clustered in just three regions (I: H924_2045-H924_02230, 37 genes, 25.2 kbp; II: H924_03630-H924_03880, 50 genes 52.5 kbp; III: H924_07070-H924_07380, 61 genes, 48.2 kbp) which constitutes ~ 4.4% of the genome size. As at least region II and region III are likely prophages, loss of these regions should be neutral or even beneficial, as demonstrated for C. glutamicum[48].

Figure 4
figure 4

Venn diagram depicting the number of genes shared between C. callunae , C. glutamicum , and C. efficiens . EDGAR [47] was used to determine the core genomes shared between respectively singletons unique to the three species.

One central prerequisite for future rational strain development is the genetic accessibility of the prospective strain. Knowledge of the complete genome sequence of C. callunae helps to overcome at least two of the main obstacles: the construction of plasmids usable as vectors and removal of elements that hinder DNA transfer. For the former, the knowledge of the sequences of the two plasmids pCC1 and pCC2 allows use of plasmid-tagging approaches via a counter-selectable marker [49] to cure them, should conventional approaches like heat-shock curing fail. Once cured, the sequence of the plasmids help to identify the minimal gene set necessary for replication to build shuttle vectors, as demonstrated for pCC1 [50]. For the latter, the genome sequence helps to identify restriction-modification systems. A preliminary analysis revealed the presence of at least 4 such systems, one of which is located in the potential prophage region II. Removal of such systems has been shown to significantly enhance the stability of foreign DNA introduced and thus facilitating genetic engineering approaches [48].

Conclusion

The complete genome sequence of C. callunae is the third genome sequence of the C. glutamicum - C. deserti - C. efficiens - C. callunae subgroup of L-glutamic acid producing corynebacteria within the genus Corynebacterium. Knowledge of the complete genome sequence has already contributed to identify the core genome of this group. With a size of 2.84 Mbp and an a total of 2,647 protein coding genes, the genome of C. callunae is by far the smallest within this group. Therefore, this bacterium might be an ideal choice for future development of a platform strain as the otherwise high degree of similarity of its genome content to the well studied C. glutamicum would allow an easy transfer of knowledge to the new host. Furthermore, knowledge of the complete genome sequence also facilitates the identification of possible targets to improve the accessibility to genetic engineering (like restriction-modification systems) and to enhance genome stability (like phages and transposases).

Authors contributions

MP prepared and wrote the manuscript, AA performed library preparation and sequencing, HB and KN performed electron microscopy, JK coordinated the study, and CR assembled and analyzed the genome sequence.