Introduction

The use of microorganisms in food fermentations is the means for converting perishable and frequently inedible raw materials into safe, shelf-stable and nutritionally upgraded foods [1]. The economic importance of starter cultures for the food industry has led to the continuous search for the discovery of new microorganisms with important technological characteristics. In many cases it has been proven that traditionally fermented foods represent a natural reservoir of undiscovered microbial strains for possible diverse food applications [2, 3].

Streptococcus thermophilus is among the species commonly used in the dairy industry, mainly in the fermentation of yogurt and several cheese varieties, contributing to the desirable organoleptic characteristics of the final product [4, 5]. It is the sole species considered GRAS within the Streptococcus genus, which includes mostly pathogens and opportunistic pathogens [6]. Due to the industrial significance of the species, a plethora of studies has been conducted for a number of strains, revealing information about their diverse technological features [7, 8]. Furthermore, during the last 15 years, the advance of high-throughput sequencing techniques along with the development of novel bioinformatics tools facilitated the analysis of complete genome sequences, providing information for the overall genetic content of S. thermophilus [912]. These studies have demonstrated that S. thermophilus strains have been adapted to the milk environment through extensive reductive evolution as indicated by the large number of pseudogenes found in all strains. Adaptation to the milk environment is also supported by the loss of genes related to carbohydrate metabolism and virulence.

In this study, we present the analysis of the complete genome sequence of S. thermophilus ACA-DC 2. The genomic insights acquired could be proven useful for the exploitation of the specific strain in the production of fermented dairy products.

Organism information

Classification and features

Streptococcus thermophilus ACA-DC 2 is classified within the order Lactobacillales of the class Bacilli . It is a non-sporulating, Gram-positive bacterium with coccus-shaped cells (Fig. 1). The strain was isolated from traditional Greek yogurt manufactured through backslopping [13, 14]. Its optimum growth takes place in M17 medium at 42 oC under microaerophilic conditions within 24 h. Information about the classification and the features of S. thermophilus ACA-DC 2 is summarized in Table 1. The phylogenetic analysis was based on 16S rRNA gene sequences and places S. thermophilus ACA-DC 2 in the distinct cluster formed by the S. thermophilus strains and within the salivarius group, as shown in Fig. 2.

Fig. 1
figure 1

Photomicrographs of S. thermophilus ACA-DC 2. The images were obtained with (a) optical microscopy at magnification 1000x for Gram stained cells and (b) transmission electron microscopy of stained cells with 10% (w/v) PTA. Scale bar in (b) corresponds to 1 μm

Table 1 Classification and general features of S. thermophilus strain ACA-DC 2 according to the MIGS recommendations [39]
Fig. 2
figure 2

Phylogenetic tree highlighting the position of S. thermophilus ACA-DC 2 relative to other Streptococcus species. The tree was constructed based on 16S rRNA gene sequences. GenBank accession numbers are presented in parentheses and type strains are indicated with a superscript T (type strains = T). Strains with complete genome sequence are marked with an asterisk. 16S rRNA gene sequences were aligned using MUSCLE [55]. The phylogenetic tree was built by the Maximum Likelihood method within MEGA7 software [56] using the Tamura-Nei substitution model [57]. Lactococcus lactis subsp. lactis NCDO 604T served as the outgroup. Bootstrap values derived after 1,000 replicates. The scale bar indicates an estimated 0.01 nucleotide change per nucleotide position

Genome sequencing information

Genome project history

S. thermophilus ACA-DC 2 is deposited in the ACA-DC culture collection of the Laboratory of Dairy Research, Agricultural University of Athens, Athens, Greece. The strain was selected for sequencing in order to obtain information about its technological and probiotic potential, having as basic aim its application as a starter culture in the production of dairy fermented foods. The project was carried out in 2015 and the genome was sequenced, fully assembled and annotated. The genome sequencing project was registered in the European Nucleotide Archive database under accession number LT604076. The summary of the project information is shown in Table 2.

Table 2 Project information

Growth conditions and genomic DNA preparation

S. thermophilus ACA-DC 2 was grown in M17 medium (Biokar Diagnostics, Beauvais, France). For the isolation of the genomic DNA, 2 ml from an overnight culture incubated at 42 °C were used and the extraction procedure was performed according to the protocol of Pitcher et al. [15]. The purity and the concentration of the extracted DNA were measured with a UV-Vis spectrophotometer (Q5000, Quawell, San Jose, USA) while its integrity was evaluated electrophoretically in a 0.8% agarose gel.

Genome sequencing and assembly

Whole-genome sequencing was performed using the Illumina HiSeq2500 and the PacBio RSII platforms at BaseClear service laboratory for DNA-research (Leiden, The Netherlands). Paired-end sequence reads were generated using the Illumina HiSeq2500 system. FASTQ sequence files were obtained using the Illumina Casava pipeline v1.8.3. Initial quality assessment was based on data passing the Illumina Chastity filtering. Subsequently, reads containing adapters and/or PhiX control signal were removed using an in-house filtering protocol. The second quality assessment was based on the remaining reads using the FASTQC quality control tool v0.10.0 resulting in 4,403,680 reads.

The data collected from the PacBio RSII instrument were processed and filtered using the SMRT Analysis software suite. The Continuous Long Read data were filtered by Read-length (>50), Subread-length (>50) and Read quality (>0.75) resulting in 117,020 reads.

The quality of the Illumina FASTQ sequences was enhanced by trimming off low-quality bases using the program bbduk, which is part of the BBMap suite v34.46. The quality-filtered sequence reads were puzzled into a number of contig sequences. The analysis was performed using ABySS v1.5.1. The contigs were linked and placed into super-scaffolds based on the alignment of the PacBio CLR reads with BLASR [16]. The alignment was further used to estimate the orientation, order and distance between the contigs by the SSPACE-LongRead scaffolder v1.0 [17]. The gapped regions within the super-scaffolds were closed in an automated manner using GapFiller v1.10 [18]. The method takes advantage of the insert size between the Illumina paired-end reads. The assembly resulted in one circular chromosome of 1,731,838 bp.

Genome annotation

Prediction of genes was carried out with the online programs Prodigal [19], MetaGeneAnnotator [20] and FGENESB [21], for comparison and verification of the obtained results. Genome annotation was performed using RAST v2.0 [22]. Annotation anomalies, including pseudogenes, were identified using GenePRIMP [23]. All data acquired were combined and subjected to manual curation. The WebMGA server [24] and the EggNog v4.5 [25] were used for COG annotation, the Phobius web server was used for the identification of genes with transmembrane helices and genes with signal peptides [26] and the Pfam database was used for the identification of genes with Pfam domains [27]. Potential pathogenic features were identified using the MP3 tool [28]. The CRISPRs, the restriction-modification systems and the putative antimicrobial peptides were predicted using the CRISPRFinder web tool [29], the REBASE database [30] and BAGEL3 [31], respectively. The KODON software (Applied Maths NV, Sint-Martens-Latem, Belgium) was utilized for the visualization of synteny among the CRISPR regions of ACA-DC 2 and LMD-9 strains. The EDGAR server [32] was used for whole genome phylogeny and Venn diagram analysis. Circoletto [33] was employed for whole genome alignment among S. thermophilus strains. Finally, the genomic islands were identified through the IslandViewer 3 web-based resource [34].

Genome properties

The complete genome of S. thermophilus ACA-DC 2 consists of one circular chromosome containing 1,731,838 bp. The average GC content of the chromosome is 39.2%. A total of 1,850 genes were predicted after manual curation, including 1,556 protein-coding genes, 70 RNAs (56 tRNAs and 14 rRNAs) and 224 potential pseudogenes (Table 3). A circular map of the genome was generated using the CGView comparison tool [35] as shown in Fig. 3. Function was assigned to 1,182 genes (63.89%), while 1,318 genes (71.24%) had one or more conserved Pfam domains. The distribution of protein-coding genes into COG functional category is shown in Table 4. The analysis revealed that approximately 28.5% of the protein-coding genes do not have any described function.

Table 3 Genome statistics
Fig. 3
figure 3

Circular map of S. thermophilus ACA-DC 2 genome features generated with the CGview tool. From periphery to center: Protein coding genes on forward strand colored by COG category assignment; Genes on forward strand; Protein coding genes on reverse strand colored by COG category assignment; Genes on reverse strand; GC content; GC skew; Genome region in kbp

Table 4 Number of genes associated with general COG functional categories

Insights from the genome sequence

Main genome sequence characteristics

The genome of S. thermophilus ACA-DC 2 is the smallest one described to date among the fully sequenced strains of the species deposited in NCBI and it is approximately 200 kbp smaller than the larger described genome. The majority of potential pseudogenes encode hypothetical proteins, transposases and proteins involved in carbohydrate transport and metabolism. Analysis of the genome for virulence factors with the MP3 tool revealed a number of hits (data not shown). Detailed inspection of these hits with EDGAR demonstrated that several such genes are conserved among S. thermophilus strains indicating that it is rather unlikely to be related to pathogenicity, given the safe history of the species. The high percentage of pseudogenes along with the absence of typical virulence factors for streptococci suggest that the ACA-DC 2 strain evolved through genome decay during the adaptation to the rich in nutrients dairy environment [9, 11].

S. thermophilus ACA-DC 2 carries a complete lactose-galactose operon containing the galR, galK, galT, galE, galM, lacS and lacZ genes (STACADC2_1195-1189) and it is able to ferment lactose and galactose, the latter in a fairly slow rate (data not shown). It has been reported that fermentation of galactose is limited among the strains of S. thermophilus [11]. As mentioned above, several genes responsible for the transport and degradation of sugars, such as fructose, maltose and trehalose, have been identified as pseudogenes in the genome of ACA-DC 2, further supporting the specialization of the bacterium to catabolize lactose.

The proteolytic system of S. thermophilus ACA-DC 2 consists of several genes encoding aminopeptidases, such as pepA (STACADC2_1626), pepC (STACADC2_0202), pepF (STACADC2_0406), pepM (STACADC2_1333), pepN (STACADC2_0892), pepO (STACADC2_1656), pepP (STACADC2_1520), pepQ (STACADC2_0572), pepS (STACADC2_0058), pepT (STACADC2_0971), pepV (STACADC2_0960), and pepX (STACADC2_1446), one oligopeptide opp ABC transporter (STACADC2_1229-1233), four polar amino acid ABC transporters (STACADC2_0780-0782, STACADC2_0992-0995, STACADC2_1355-1358, STACADC2_1431-1433), two symporters for branched-chain amino acids (STACADC2_0872, STACADC2_1160), and two glutamine ABC transporters (STACADC2_0547-0548, STACADC2_1281-1282). Strain ACA-DC 2 lacks a cell wall-associated proteinase (PrtS). Although this gene may be important for optimal growth of S. thermophilus in milk, its absence is of minor significance when co-cultured with a proteolytic  Lactobacillus delbrueckii subsp. bulgaricus strain, since the release of peptides by the latter is beneficial for the growth of S. thermophilus [10, 11].

Similar to other dairy bacteria, S. thermophilus ACA-DC 2 is able to synthesize exopolysaccharides (EPS) that may confer improved viscosity and texture to yogurt [4]. The EPS cluster is flanked by a deoD gene encoding a purine nucleoside phosphorylase (STACADC2_0949) and a pseudogene originally encoding a beta-glucosidase. Four of these genes, namely epsA (STACADC2_0948), epsB (STACADC2_0947), epsC (STACADC2_0946) and epsD (STACADC2_0945) are implicated in the regulation, polymerization, chain length and export of the EPS and are conserved among several S. thermophilus strains [36].

The genome analysis of strain ACA-DC 2 revealed a number of genes known to be responsive to unfavorable conditions prevailing during industrial applications. Among them we identified conserved heat shock genes like grpE, dnaK, dnaJ (STACADC2_0105-0107) and groES, groEL (STACADC2_0179-0180), genes encoding Clp proteases (STACADC2_0071, STACADC2_0315, STACADC2_0526, STACADC2_0544, STACADC2_1391), a proton translocating F0F1-ATPase system (STACADC2_0430-0437) and a P-type calcium pump ATPase (STACADC2_0983). The strain also harbors genes related to oxidative stress, namely a Mn-superoxide dismutase (STACADC2_0657), a glutathione reductase (STACADC2_0362), two thioredoxins (STACADC2_1043, STACADC2_1624), two thioredoxin reductases (STACADC2_1208, STACADC2_1429), a NADH oxidase (STACADC2_1113) and two sulfoxide reductases (STACADC2_1408, STACADC2_1159). Furthermore, the genome carries four putative antimicrobial peptides that need further investigation (STACADC2_0091, STACADC2_1453, STACADC2_1458 and STACADC2_1709).

Two candidate CRISPRs were detected in the chromosome of strain ACA-DC 2. Intriguingly, both CRISPRs contained only one spacer. One CRISPR was found surrounded by cas proteins (STACADC2_0849-0856) while the other was orphan. The CRISPR-cas system of strain ACA-DC 2 exhibited the same organization and high degree of identity to that described previously for strain LMD-9 (Fig. 4) [37]. The two CRISPR-cas systems differed mainly in the csm6 gene, which in the case of strain ACA-DC 2 is a potential pseudogene as well as in csm2 gene that seems to be distinct in the two strains. S. thermophilus LMD-9 carries three CRISPR-cas systems and the system that is similar to that of ACA-DC 2 carries the lowest number (three) of spacers. Combined these findings could indicate low activity or even inactivation of the entire CRISPR-cas system in strain ACA-DC 2. Another possibility that cannot be excluded concerns low exposure of strain ACA-DC 2 to foreign DNA. Of course, any deficiency in the activity of the CRISPR-cas system may be compensated by restriction-modification (RM) systems. Strain ACA-DC 2 carries four putative RM systems according to the REBASE database (data not shown) belonging to RM types I (STACADC2_0642, STACADC2_0645, STACADC2_0648), II (STACADC2_0597-0598), III (STACADC2_0788-0789) and IV (STACADC2_0626).

Fig. 4
figure 4

Synteny plot of the CRISPR loci between S. thermophilus strains ACA-DC 2 and LMD-9. The synteny of the two regions was calculated by the KODON software. In both strains the cas genes are denoted in blue. Gene csm6 in strain ACA-DC 2 is a potential pseudogene and it is denoted in yellow. The pyrD and pyrF genes colored in beige define the upstream and downstream limits of the CRISPR loci. Percentages displayed in the ribbon areas correspond to the % identity among the nucleotide sequences

Comparative genomic analysis, strain specific genomic features and genomic islands

Resolution of phylogenetic trees based on 16S rRNA gene sequences is limited due to high sequence identity, especially for strains of the same species. For this reason, we also performed whole genome phylogeny as implemented in EDGAR, using all available complete genomes of S. thermophilus . The phylogenetic tree produced revealed that S. thermophilus strains could be clustered in two distinct branches, the second of which could be also split in two sub-branches (Fig. 5). Strain ACA-DC 2 formed one of the branches along with strains CNRZ1066, LMG 18311, S9 and CS8. We chose strains ACA-DC 2, JIM 8232 and KLDS 3.1003 as representatives of each branch for comparative genomic analysis (Fig. 5). Whole genome alignments revealed extensive regions of high identity (>98%) among the genomes. However, regions of lower identity (between 80 and 98%) as well as strain specific regions were also identified. Using Venn diagram analysis as implemented in EDGAR, we determined a core genome of 1,303 genes among the three genomes as well as 137, 185 and 236 unique genes for strains ACA-DC 2, KLDS 3.1003 and JIM 8232, respectively.

Fig. 5
figure 5

Comparative genomics of S. thermophilus strains. a Whole genome phylogeny of S. thermophilus strains with complete available genome sequences. The phylogenetic tree was calculated in EDGAR and it is presented as a cladogram ignoring branch length. The strains S. salivarius NCTC 8616 and Lactococcus lactis subsp. cremoris MG1363 were used as outgroups. Colored boxes indicate the three distinct S. thermophilus branches identified. Strains in each colored branch designated in bold were used for further comparative analysis. b Whole genome alignments of S. thermophilus strains KLDS 3.1003, JIM 8232 and ACA-DC 2 using Circoletto. Red and blue ribbons correspond to regions of >98% and 80–98% identity, respectively. White regions correspond to strain specific loci. c Venn diagram analysis of S. thermophilus strains KLDS 3.1003, JIM 8232 and ACA-DC 2, as implemented in EDGAR

The 137 unique genes of strain ACA-DC 2 were found to be involved in diverse functions (Fig. 6). At least some of those genes may be the result of horizontal gene transfer (HGT). HGT acquired genes may play a role in the technological properties of S. thermophilus strains [11]. Another analysis that may also reveal regions of HGT in the bacterial chromosome is the identification of GIs [38]. Twelve integrated GIs were predicted in the genome of S. thermophilus ACA-DC 2 (Fig. 6), containing a total of 213 genes also involved in diverse functions (Fig. 6). Detailed analysis of genes either unique or in the GIs could relate some of them to important technological traits. For example, we determined genes coding for cold shock proteins CspA and CspG (STACADC2_0749-0750), acid resistance locus arl7 (STACADC2_0743), putative bacteriocin peptides (STACADC2_1453 and STACADC2_1458) and a type I RM system (STACADC2_0642, STACADC2_0645, STACADC2_0648). A putative agmatinase gene (STACADC2_0818) that may play a role to protocooperation of S. thermophilus and L. bulgaricus during polyamine metabolism, was also detected in ACA-DC 2 strain [10]. Furthermore, genes implicated in zinc and heavy metals transport (STACADC2_0165-0166, STACADC2_0752), in DNA repair and metabolism (STACADC2_1696, STACADC2_1716, STACADC2_1719, STACADC2_1754) as well as several ribosome binding proteins, were also identified (STACADC2_0137, STACADC2_1568-1569, STACADC2_1667, STACADC2_1669-1671, STACADC2_1675-1695, STACADC2_1717, STACADC2_1732-1733, STACADC2_1752, STACADC2_1755).

Fig. 6
figure 6

Additional genomic features of S. thermophilus ACA-DC 2. a Circular map of the S. thermophilus ACA-DC 2 genome as generated by IslandViewer 3. Highlighted regions correspond to GIs. GIs are colored within the circular map according to the prediction method used: five GIs in orange and eight GIs in blue were predicted with SIGI-HMM and IslandPath-DIMOB, respectively. Twelve integrated GIs are presented on the periphery in red. The black line plot represents the GC content (%) of the genomic sequence. b Distribution of genes in GIs and unique genes of S. thermophilus ACA-DC 2 into COG categories

Conclusions

The genome of S. thermophilus ACA-DC 2 presents characteristics in accordance with its adaptation to the milk environment including a high percentage of pseudogenes and absence of pathogenic features. Our analysis revealed that the strain carries genes involved in lactose and galactose catabolism and protein degradation that may accommodate its growth during milk fermentation. Stress response related genes that may contribute to survival under technological hurdles were also detected. Whole genome phylogeny suggested that S. thermophilus strains may diversify in three phylogenetic clades. Comparative analysis of genomes representative of each clade, including strain ACA-DC 2, revealed a number of unique genes for the latter. Furthermore, certain unique genes or genes belonging to GIs could be related to technological properties important for starter cultures. Theoretically, such genes could have been acquired through HGT. These findings render S. thermophilus ACA-DC 2 an appropriate candidate for use in the production of fermented dairy products.