Background

The genus Streptococcus consists of Gram positive bacteria including a numerous clinically significant species which are responsible for wide variety of infections in human and animals with a different manifestation and course [1]. To date, nearly 129 Streptococcus and 58 Enterococcus species have been identified ([2,3,4] http://www.bacterio.net/streptococcus.html), but these numbers undergo constant modification. Streptococci are capable to colonize human and animal mucous membranes and considered to be opportunistic pathogens, so in special conditions, they can cause acute infections [5]. Some streptococcal species (e.g. S. pyogenes and S. pneumoniae) are highly virulent and responsible for severe diseases like pneumonia, necrotizing fasciitis, sepsis and meningitis, the other ones (S. bovis, S. mutans, S. sanguis, S. agalactiae and S. anginosus) are involved in a number of clinically relevant diseases like endocarditis, abscesses and other pathological conditions [1, 6, 7]. The genus has undergone considerable taxonomic revisions and, currently based on defined group antigens (A, B, C, E, F, and G) has been divided into different groups: GAS (Group A Streptococcus), GBS (Group B Streptococcus), group C Streptococcus, group G Streptococcus, group viridans with subgroups: anginosus, mitis, mutans, salivarius, group bovis [8,9,10].

Enterococci were initially a part of the Streptococcus genus. Currently, they are considered as a separate genus being a part of the human natural microbiota. Enterococcal species are commensals of the gastrointestinal tract of humans and animals and as opportunistic pathogens in immunocompromised patients they can cause acute infections. The Enterococcus genus have been reported as the third most common causative agent of bacteremia and infective endocarditis [11,12,13].

Identification of streptococcal and enterococcal species has been a challenge for decades due to changing taxonomy, names modifications and addition of new species. In routine diagnostic laboratories, phenotypic biochemical methods still play a dominant role. Considering the variability of the strains and species, the differentiation is limited compared to methods based on genetic discrimination and may result in incorrect identification in more than 50% of the cases [14]. The rapidly changing taxonomy also results in a lack of updates in phenotypic databases used in routine diagnostics. If the isolates are not identified at the species level the real impact of single, in particular less frequent species is underreported. Accurate identification is highly desirable for precise therapy, monitoring the spread of infection with epidemiologic characteristics and for investigating the progress of disease [14, 15].

In standard diagnostics, phenotypic tests including automated systems such as Vitek 2 (bioMérieux, La Balme Les Grottes, France) or BD Phoenix (BD Diagnostic Systems, Sparks, MD, USA) as well as the matrix-assisted laser desorption ionization–time of flight mass spectrometry (MALDI-TOF MS) are used for bacteria identification. Especially, commercially available MALDI-TOF MS systems provide accurate identification for many of clinically relevant bacterial species. Nevertheless, the technique so far failed at differentiating between mitis, bovis groups and other closely relative species. Since databases are limited to only some species, further improvements of Streptococcus and Enterococcus spectra database seem necessary. Moreover, the phenotypic methods are not always reliable enough because of variable expression of phenotypic characteristics [16,17,18]. The accurate identification at the species level may change the diagnosis and is important to characterize the pathogenic potential of individual species, monitor trends in antimicrobial susceptibility and emerging infections. The ideal method should have a high discriminatory power allowing for identification of closely related species and at the same time should be relatively simple, inexpensive, rapid and reproducible. Therefore, genetic methods based on PCR or sequencing are good candidates for identification purposes. The identification is based on selected nucleic acid target amplification, sequencing and comparison to a reference sequence deposited in a nucleotide database [19, 20].

When polymicrobial samples must be analyzed, it is useful to simultaneously identify species of different genera using a single primer pair. The sequence analysis of the 16S rRNA gene, a highly conserved gene present in all bacteria, can be used for identification at the species level for most bacteria, even those not genetically related, with the same pair of primers [21]. Although this method is widely used and accurate, the high degree of identity of the 16S rRNA gene among the genetically closely related species limits its usefulness for identifying several bacterial species [19, 22, 23].

Next generation sequencing (NGS) has highly improved microbiological genetic investigations by providing a cost-effective way to characterize bacterial genomes. The main advantage of NGS over Sanger sequencing is an ability to produce millions of reads in a single run. Recently, to overcome the limitations of 16S rRNA gene Sanger sequencing, a method based on NGS of the 16S–23S rRNA region has been developed by Sabat and colleagues [24]. This method is based on a PCR amplification of the 16S–23S rRNA region followed by amplicon sequencing on the MiSeq platform (Illumina, Inc., San Diego, CA, USA); the resulting reads are de novo assembled into contigs. Species identification is based on an alignment of the contig sequences with the sequences deposited in the reference databases [24]. This method can be used for identification of common pathogens directly from the patient samples with a high identification potential. This method can also be used for the identification of non-cultured microorganisms, identification of bacterial species in polymicrobial samples or those samples with a too low DNA concentration for direct whole genome sequencing (WGS). However, the main disadvantage of this method is a lack of the 16S–23S rRNA reference sequences for many bacterial species, which hinders the proper interpretation of the results [24]. The main aim of this study was to develop a dataset of reference sequences of the 16S–23S rRNA region for clinically relevant streptococcal and enterococcal species. We also compared the identification potential of NGS-based approach with Sanger sequencing of the 16S rRNA, sodA, tuf and rpoB genes used for standard streptococci and enterococci identification and determined the cut off values for genus and species level identification.

Methods

Bacterial isolates

The bacterial strains used in this study are in detail listed in Table 1. The collection included strains from 42 diverse streptococcal and 9 enterococcal species. Part of the strains are deposited in reference microorganisms collections like the Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures (DSMZ), American Type Culture Collection (ATCC) or Belgian Coordinated Collection of Microorganisms (BCCM). The other strains were clinical isolates from various human and animal sources from Warsaw (National Medicines Institute, Warsaw, Poland), Pescara (Clinical Microbiology and Virology, Spirito Santo Hospital, Pescara, Italy), and Groningen (University Medical Centre Groningen, The Netherlands).

Table 1 Streptococcus and Enterococcus reference species used for analyses

Genomic DNA extraction

For genomic DNA extraction, the isolates were grown for 18–20 h at 37 °C on blood agar plates in microaerophilic conditions or with 5% CO2. Two strains, S. cremoris (DSM20069) and S. difficilis (ATCC700208) were grown at 30 °C. A full inoculation loop of 10 μl of bacterial colonies was homogenized with a TissueLyser II (Qiagen, Germantown, MD, USA). Total DNA was extracted by enzymatic lysis using the buffers and solutions provided with the DNeasy Blood and Tissue Kit (Qiagen, Germantown, MD, USA) according to manufacturer’s instructions. To obtain an accurate quantification of the extracted genomic DNA for NGS, a fluorometric method specific for duplex DNA, a Qubit dsDNA BR Assay Kit and a Qubit fluorometer 2.0 (Life Technologies, Inc., Eggenstein, Germany) were used according to the manufacturer’s instructions.

PCR amplification and Sanger sequencing of 16S rRNA, sodA, tuf and rpoB genes

All reference strains were identified at the species level by polymerase chain reaction (PCR) and Sanger sequencing of 16S rRNA, sodA, tuf and rpoB genes. The 16S rRNA gene was amplified using the primers LPW57 (5′-AGTTTGATCCTGGCTCAG-3′) and LPW58 (5′-AGGCCCGGGAACGTATTCAC-3′) as previously described [25]. The PCR program was as follow: initial denaturation for 2 min at 94 °C, then followed by 25 cycles of denaturation at 94 °C for 30 s, annealing at 58 °C for 30 s and extension at 72 °C for 60 s. The final extension was for 5 min at 72 °C.

For the sodA gene, the internal fragment which represents 83% of the gene (430 bp), was amplified with the primers d1 (5′-CCITAYICITAYGAYGCIYTIGARCC-3′) and d2 (5′-ARRTARTAIGCRTGYTCCCAIACRTC-3′) as previously described [26]. The PCR mixtures were initially denatured for 3 min at 95 °C and then followed by 35 cycles of denaturation at 95 °C for 30 s, annealing at 40 °C for 60 s, extension at 72 °C for 90 s with final extension at 72 °C for 10 min. For some strains, the PCR product was not specific in these conditions and the annealing temperature was increased to 43 °C, 46 °C or 50 °C. For strain DSM9848 (S. adjacens) the aforementioned primers did not yield any amplification product and primers sodA-F (5′- TRCAYCATGAYAARCACCAT-3′) and sodA-R (5′- ARRTARTAMGCRTGYTCCCARACRTC-3′) were used [19]. Amplification of the DNA fragments was performed with predenaturation for 5 min at 94 °C followed by 30 cycles of denaturation at 94 °C for 30 s, annealing at 45 °C for 60 s, extension at 72 °C for 30 s with final extension at 72 °C for 5 min.

For tuf, an 830-bp portion of the gene, was amplified with the primers Tuf-F (5′-CCAATGCCACAAACTCGT-3′) and Tuf-R (5′-CCTGAACCAACAGTACGT-3′) as previously described [20]. The PCR program was as follow: initial denaturation for 2 min at 95 °C and then followed by 30 cycles of denaturation at 94 °C for 30 s, annealing at 50 °C for 30 s, extension at 72 °C for 90 s with final extension at 72 °C for 10 min. For some strains, the PCR product was not specific in these conditions and the annealing temperature was increased to 53 °C, 56 °C or 59 °C. For strain LMG 12287 (E. porcinus) the aforementioned primers did not yield any amplification product and primers U1 (5′-AAYATGATIACIGGIGCIGCICARATGGA-3′) and U2 (5′- AYRTTITCICCIGGCATIACCAT-3′) were used [27]. Amplification of the DNA fragments was performed with predenaturation for 3 min at 95 °C followed by 35 cycles of denaturation at 95 °C for 30 s, annealing at 55 °C for 30 s, extension at 72 °C for 60 s with final extension at 72 °C for 7 min.

The partial rpoB gene (740 bp) was amplified with the primers Strepto F (5′- AARYTIGGMCCTGAAGAAAT-3′) and Strepto R (5′- TGIARTTTRTCATCAACCATGTG − 3′) as previously described [28] with slight modifications in the PCR program: initial denaturation for 2 min at 95 °C and then followed by 35 cycles of denaturation at 94 °C for 30 s, annealing at 52 °C for 30 s, extension at 72 °C for 60 s with final extension at 72 °C for 5 min. For some strains, the PCR product was not specific in these conditions and the annealing temperature was increased to 55 °C.

All PCR products were resolved by electrophoresis using a 2200 TapeStation System (Agilent Technologies, Santa Clara, CA, USA) and then purified using the DNA Clean & Concentrator™-5 purification kit (Zymo Research, Irvine, CA, USA).

For the Sanger sequencing of 16S rRNA, sodA, tuf and rpoB genes, the same primers as for PCR amplification were used. For the 16S rRNA, tuf and rpoB genes a total amount of 200 ng of PCR product was sequenced and for the sodA gene 100 ng.

Next generation sequencing of the 16S–23S rRNA region

Amplification of the 16S–23S rRNA region was performed using primer 16S-27F (5′-AGAGTTTGATCMTGGCTCAG-3′) and primer 23S-2490R (5′-GACATCGAGGTGCCAAAC-3′) as described previously [24]. The PCR program was as follow: initial denaturation for 2 min at 94 °C and then followed by 30 cycles of denaturation at 94 °C for 30 s, annealing at 66 °C for 30 s, extension at 72 °C for 120 s with final extension at 72 °C for 5 min. The obtained PCR products were purified and the DNA libraries were prepared with Nextera XT DNA Sample Preparation Kit (Illumina) according to the manufacturer’s instructions. The indexed libraries were pooled and loaded onto an Illumina MiSeq reagent cartridge using MiSeq reagent kit v3 and 600 cycles. The 2 × 300 bp sequencing was run on an Illumina MiSeq platform.

Data analysis

The Sanger sequencing results were analyzed using the Chromas (v. 2.6.2.; Technelysium Pty Ltd., South Brisbane, Australia) software. The obtained sequences were analyzed using nucleotide BLAST (Basic Local Alignment Search Tool, http://www.ncbi.nlm.nih.gov/BLAST/) and aligned to the reference sequences deposited in the GenBank (https://www.ncbi.nlm.nih.gov/nucleotide/) and leBIBI (https://umr5558-bibiserv.univ-lyon1.fr/lebibi/lebibi.cgi) databases. The best and the second best species alignment were analyzed. According to the criteria developed by Sabat et al. in 2017 [24], the bacterial species were assigned when the identity score was 99% or higher and the identity score differences with the next closest species was ≥0.2%. Therefore, the identification at the species level using Sanger sequencing of the 16S rRNA (1284-bp), sodA (430-bp), tuf (830-bp) and rpoB (740-bp) gene fragments was considered as unambiguous for sequences different in at least 3, 2, 3 and 3 nucleotides, respectively. The identification at the species level using NGS of the whole 16S–23S rRNA region (4.3-kb), 16S rRNA gene (1.5-kb), intergenic region (330-bp) and 23S rRNA gene (2.5-kb) was considered as unambiguous for sequences different in at least 9, 3, 2 and 5 nucleotides, respectively. The sequences were aligned in ClustalW [29] and the phylogenetic trees were constructed using the Neighbor-Joining method [30,31,32]. The tree topologies were compared using Compare2Trees program [33]. The pairwise comparison of each pair of sequences was obtained using CLC Genomics Workbench (v. 8.1; Qiagen, Germantown, MD, USA) considering deletions as differences.

NGS generated 35,000–350,000 sequencing reads for pure culture to obtain a minimum coverage of 1000 per sample. The fastq files (Illumina MiSeq) with read length of 300 nucleotides were de novo assembled with the DNASTAR SeqMan NGen software (v. 15.3; DNASTAR, Madison, WI, USA). During read assembly, reads shorter than 250 nucleotides were excluded. The minimum match percentage was 85% or 93% and the mer size was set as 31 nucleotides. After assembly, mean sample coverage was 6680.50-fold. However, the coverage per sample varied between 1983.38- and 23,643.77-fold. Only runs with a Q30 read quality score of > 80% were accepted. To further determine sequencing errors of the Illumina MiSeq platform three types of errors were investigated: insertion, deletion and mismatch. If a single nucleotide polymorphism (SNP) variant was identified in the consensus sequence, it was at maximum level of 5.36% with 932-fold coverage. Such SNP values were regarded as the potential sequencing errors and discarded from further analysis. If the assembly resulted in multiple contigs, the obtained ones were checked for length and quality in order to select the longest main contig with the highest reads amount assigned. Finally, the main contig was exported as fasta file for use in the subsequent analyses. For all species the main contig comprising the whole 16S–23S rRNA region, counting for Streptococcus from 4251 (S. adjacens) to 4732 nucleotides (S. equinus) and for Enterococcus from 4224 nucleotides (E. cecorum) to 4381 nucleotides (E. faecium), was obtained. Species identification was based on alignment of contig sequences with 16S–23S rRNA sequences deposited in the GenBank database using nucleotide BLAST and also compared to leBIBI database (the 16S rRNA gene sequence as reference).

Nucleotide sequence accession numbers

The 255 sequences for 42 Streptococcus and 9 Enterococcus species were annotated using the NCBI BankIt tool and deposited in the GenBank database (http://www.ncbi.nlm.nih.gov/genbank/) under accession numbers: for the 16S–23S rRNA region, MK330555-MK330596 and MK322658-MK322666; for the 16S rRNA gene, MK330513-MK330554 and MK322649-MK322657; for the sodA gene, MK322556-MK322597 and MK308717-MK308725; for the tuf gene, MK322607-MK322648 and MK322598-MK322606; and for the rpoB gene, MK322514-MK322555 and MK308708-MK308716. The NGS of 16S–23S rRNA region raw reads were deposited in the European Nucleotide Archive (ENA) (https://www.ebi.ac.uk/ena) under study accession number: PRJEB32803 (ERP115525).

Results

Identification potential of Sanger sequencing methods for Streptococcus and Enterococcus species

All strains from the collection were characterized by Sanger sequencing of the 16S rRNA, sodA, tuf and rpoB genes. The identification to the species level was not possible by all targets used due to identical or almost identical sequence (Table 2 and Additional file 1: Tables S1-S8) or the lack of some reference sequences in the GenBank (v. 231.0; June 21, 2019) database (Table 1). Therefore, the identification was confirmed by 1 target for 7 streptococcal species, by 2 targets for 10 streptococcal and 1 enterococcal species, by 3 targets for 8 streptococcal and 7 enterococcal species and by 4 targets for the vast majority of the species (17 Streptococcus and 1 Enterococcus species) (Table 1). The reference sequences for all Streptococcus and Enterococcus species were available only for 16S rRNA gene.

Table 2 The comparison of indistinguishable pairs or groups of Streptococcus and Enterococcus species after Sanger sequencing of 16S rRNA, sodA, tuf and rpoB genes and NGS of 16S rRNA, 23S rRNA genes, intergenic spacer region and whole 16S–23S rRNA region

Sequence analysis of the 16S–23S rRNA region

The sequence analysis of the 16S–23S rRNA region was performed on 51 strains from our collection representing 42 Streptococcus and 9 Enterococcus species. Search of the GenBank database showed that the sequences for the 16S–23S rRNA region were available for 27 Streptococcus species and 6 Enterococcus species, while this study allowed for the obtainment and deposition of nucleotide sequences for the additional 15 and 3 species, respectively. Taking into consideration the differences in length of an intergenic spacer located between the 16S and 23S rRNA genes, the average sequence length of the 16S–23S rRNA region was determined and equaled 4346 nucleotides for Streptococcus and 4299 for Enterococcus. The highest identity of 16S–23S rRNA region among Streptococcus species was found between S. infantis and S. tigurinus showing 99.7% sequence homology (13 nucleotides of difference), while the highest nucleotide difference was found between S. adjacens and S. criceti and equaled 1209 nucleotides (74.4% identity). For Enterococcus, the highest identity was found between E. avium and E. raffinosus showing 98.6% sequence homology (62 nucleotides of difference). The highest nucleotide difference was found between E. cecorum and E. hirae and equaled 431 nucleotides (90.1% identity) (Additional file 1: Tables S9 and S10). We also determined the lengths of 16S rRNA gene, intergenic spacer region and 23S rRNA gene for all species used in this study (Additional file 1: Table S11).

To show the relationships between species, the phylogenetic trees were constructed. The pairwise overall topological scores computed by Compare2Trees based on Streptococcus 16S rRNA, rpoB, sodA, tuf, and 16S–23S rRNA sequences ranged from 61.7 to 72.4% (Additional file 1: Figure S1). For Enterococcus, the distances between two trees in terms of topology were more diverse, reaching the lowest and highest values, 56.4 and 80.6%, respectively. All targets showed S. cremoris and group of species: S. adjacens, S. durans and S. saccharolyticus are distantly related to other species. For Enterococcus species the E. cecorum was distantly related to other species. The analysis of the phylogenetic tree of the 16S–23S rRNA region showed similar clustering as in dendrogram based on 16S rRNA gene sequencing, but more discriminative with unambiguous identification for all species (Additional file 1: Figure S1).

Criteria for assigning Streptococcus and Enterococcus at the species level

We performed the BLAST analysis based on alignment of the 16S–23S rRNA sequences obtained during the current study with those deposited in GanBank (Table 3) using criteria proposed by Sabat et al. [24]. For the assignment at the species level, we used identity score > 99% and differences with the next closest species at ≥0.2%, which reflected the difference of at least 9 nucleotides by sequencing the 16S–23S rRNA region. In comparison to sequences already deposited in GenBank, for a great majority of species (Streptococcus, n = 39, Enterococcus, n = 8) those criteria allowed the NGS-based approach the proper identification, except S. australis with a first identification score at 97.4%. For next 4 species (Streptococcus, n = 3, Enterococcus, n = 1), the first criterium of > 99% identity was fulfilled but the differences with the next closest species ranged from 2 to 7 nucleotides so the species could not be unambiguously assigned.

Table 3 The Streptococcus and Enterococcus species alignment of 16S–23S rRNA region to GenBanka, b

Intraspecies nucleotide sequence variation of the 16S–23S RNA region

To show the variability of 16S–23S rRNA region, the nucleotide sequence variation within Streptococcus and Enterococcus species was determined (Additional file 1: Table S12). The analysis was performed for those species for which at least one nucleotide sequence of the 16S–23S rRNA region could be found in the GenBank database. For almost all species, the length of the 16S–23S rRNA region was the same within a species when the sequences obtained in this study and those deposited in GenBank were compared. The length of 16S–23S rRNA region was different within the same species only in case of S. acidominimus and S. equinus. The nucleotide variation within Streptococcus species accounted from 0.07 to 2.74%, with the exception of S. pneumoniae for which the intraspecies nucleotide variation was 11.65%. For Enterococcus species, the nucleotide variation accounted from 0.02 to 2.67%.

Comparison of identification potential of NGS of the 16S–23S rRNA region to the methods based on Sanger sequencing

For Streptococcus species, NGS of the 16S–23S rRNA region, tuf and rpoB genes Sanger sequencing had the highest identification potential allowing for an unambiguous identification of 93% of analyzed species (Table 4). For Enterococcus species, sodA, tuf and rpoB genes sequencing allowed for identification of all species, while the NGS-based method did not allow for identification of only one enterococcal species (Table 5). For both genera 16S rRNA gene Sanger sequencing had the lowest identification potential of all the methods used.

Table 4 Summary of the species identification, nucleotide differences range and amount of available reference sequences based on 16S rRNA, sodA, tuf and rpoB genes and 16S–23S rRNA region for Streptococcus genus
Table 5 Summary of the species identification, nucleotide differences range and amount of available reference sequences based on 16S rRNA, sodA, tuf and rpoB genes and 16S–23S rRNA region for Enterococcus genus

The identification potential of 16S rRNA, 23S rRNA genes, intergenic spacer region and 16S–23S rRNA region

We also determined the identification potential of each part of the 16S–23S rRNA region separately. Each fragment alone showed a drop in identification potential for Streptococcus species in comparison to the whole region (Table 2). The rates of identification to the species level using sequences of the 16S rRNA gene, intergenic region, 23S rRNA gene and whole 16S–23S rRNA region were 64, 71, 86 and 93%, respectively. In case of Enterococcus, the species identification potential of the intergenic spacer region was as good as the whole region and equaled 89%, and superior to that of the 16S rRNA and 23S rRNA genes, 33 and 78%, respectively.

Discussion

Because of the clinical significance and challenging taxonomy changes of Streptococcus and Enterococcus species, an accurate identification at the species level is highly desirable to permit a more precise determination of host-pathogen relationships and to better understand pathogenic potential of various streptococcal and enterococcal species. Phenotypic identification of streptococcal and enterococcal species appears to be unsatisfactory, unreliable, and irreproducible [14, 16, 18]. This is a reason for applying genetic methods in standard microbiological diagnostics. If an unknown organism needs to be identified in a clinical sample, 16S rRNA gene sequencing is the method of choice because of the availability of universal primers [34]. The 16S rRNA gene sequencing is an excellent target for most streptococcal and enterococcal species but the differentiation between the species is difficult due to the insufficient heterogeneity within the 16S rRNA gene. Most of the reports show that the discriminatory power of 16S rRNA gene sequencing is very low for closely related Streptococcus and Enterococcus species [1, 20, 35,36,37]. Moreover, some authors claim that accuracy of identification of bacterial species with 16S rRNA gene sequencing is limited by the low quality of the sequences deposited in publicly available databases [38]. The other targeted sequencing methods do have a higher identification potential than 16S rRNA gene sequencing but are limited to only genetically related genera [20].

Within this study, we used a combination of four genetic targets (16S rRNA, sodA, tuf and rpoB) in order to unambiguously confirm the identification at the species level for all Streptococcus and Enterococcus strains tested. The analysis based on only one gene is not recommended because of possible gene duplication, lateral gene transfer or gene loss, which can distort the results [39]. The Compare2Trees data showed that the topology of phylogenetic trees obtained in this study was not very similar. These findings indicated that the genes, even highly conserved rRNA genes, are subject to recombination and that these events may render species identification challenging.

This study showed that NGS of the 16S–23S rRNA region was as discriminative as tuf and rpoB genes sequencing for Streptococcus species. In case of Enterococcus, sodA, tuf and rpoB genes sequencing allowed for identification of all species, while the NGS-based method did not allow for identification of only E. casseliflavus. Moreover, NGS of the 16S–23S rRNA region showed the same clustering like other methods. As NGS of the 16S–23S rRNA region uses universal primers it is applicable to different genetically unrelated bacterial genera [24].

The purpose of this study was not only to compare five sequence-based methods for streptococci and enterococci identification but primarily to develop streptococcal and enterococcal reference sequence datasets of the 16S–23S rRNA region. NGS of the 16S–23S rRNA region developed by Sabat and colleagues [24] provides the ability to detect microorganisms not only in samples from mixed polymicrobial colonization and infections consisting commensal microorganisms and the whole persistant microbiome. However, this method currently suffers from a lack of reference sequences in the GenBank database for many bacterial species. Before this study the 16S–23S rRNA sequences were available for 27 clinically relevant Streptococcus and 6 Enterococcus species, respectively. Our investigations allowed obtainment and deposition of the 16S–23S rRNA sequences for the next 15 streptococcal and 3 enterococcal species making identification of Streptococcus and Enterococcus species feasible. Moreover, we determined that in case of phylogenetically related species, like mitis group, the analysis of only the intergenic spacer region are not sufficient enough to precisely identify Streptococcus strains at the species level.

In order to identify strains at the species level, the reference sequence with the highest identity score needs to be found. For several Streptococcus and Enterococcus species, only one or a few reference 16S–23S rRNA sequences can be found during BLAST searches in the GenBank database. In such cases, it is possible that the sequence obtained during a study belongs to a different evolutionary cluster within a species than the reference and the nucleotide differences between them are high (more than 1%). Then, it is not possible to assign bacterial species with the identity score 99% or higher. During the current study, such instance was found only for S. australis. If more reference sequences are deposited in the genetic sequence databases, representing evolutionary diverse lineages, species will always be assigned with an identity score above 99%.

NGS of 16S–23S rRNA approach proved to be an excellent tool for identification at the species level for a great majority of Streptococcus and Enterococcus strains. Although, there were some problematic cases especially in bovis and mitis groups as the groups have undergone several reclassifications. S. infantarius was alternately classified as S. lutetiensis or S. infantarius, finally described as the second one [40]. Moreover, this species is a part of S. bovis/S. equinus complex and therefore challenging to be properly identified [41, 42]. In our study, in case of S. infantarius, the next closest species was S. equinus with an alignment to only one and not published genome assembly. Similar situation was for S. tigurinus which at first was a subspecies, then a separate species and in 2016 again proposed to be classified as S. oralis subsp. tigurinus [43, 44]. As showed in results, our sequence was aligned to S. oralis subsp. tigurinus (4223/4261) and the next closest species was the sequence of unpublished S. oralis (4221/4261). For both S. mitis and S. pseudopneumoniae, the next best alignment was to S. pneumoniae. As the problems in accurate dentification of mitis group are described [45, 46], we believe that the increase of deposited sequences for S. mitis and S. pseudopneumoniae will allow for an unequivocal identification. It is very important to develop a well-curated database with a verification of deposited sequences in terms of proper organism identification. For now, the sequences that are not published should not be considered as reference ones. There is no previous single study with a same dataset of reference sequences for genes commonly used for streptococci and enterococci identification, so usually those sequences cannot be compared. In this study, we have not only deposited such dataset for 4 commonly used identification targets but also added a package of sequences for a new identification tool with a high identification potential.

As the NGS-based techniques allow culture free detection of a theoretically unlimited number of pathogens it is necessary to precisely identify the species. Concerning the opportunistic pathogens and those not dominating in a sample, the accurate identification indicates the correct identification of an etiological factor of the infection. Since the benchtop sequencers were introduced, the NGS is likely to become a diagnostic tool in microbiological laboratories [47]. The NGS of 16S–23S rRNA region was developed to fill the gap between the conventional methods (culture and PCR) and metagenomics but as highlighted by Sabat et al. still suffers for the lack of reference sequences for many bacterial species. The development of Streptococcus and Enterococcus 16S–23S rRNA sequences dataset is a first step to come across this limitation. We are currently working on development of datasets for next clinically relevant genera.

The PCR-based methods as a tool for microbial identification, are superior to NGS-based methods in cost and speed. Although, when unknown bacteria needs to be identified, it is challenging to accurately choose the appropriate method as targets such as sodA, tuf or rpoB sequencing are genus-specific. The reagents and consumables costs for PCR-based methods combined with Sanger sequencing amount to ~ 10 € per sample and in a turnaround time of 2 days. The costs may be higher if the first choice of the method is not correct and those methods can be applied only for pure cultures. Year by year, the NGS techniques become cheaper and currently, the total costs of all reagents and consumables for NGS of 16S–23S rRNA region amount to ~ 150 € per sample with a turnaround time of 6–8 days. With the NGS-based approach the whole species content can be detected within one sequencer run so no other methods need to be applied.

The rapid development of DNA sequencing techniques has allowed substantial improvement of the culture-independent identification of microbial pathogens. On the other hand, the advances in DNA sequencing techniques has allowed simultaneous investigation of millions of DNA fragments, enabling a rapid identification of all the microorganisms present in a given clinical sample. NGS-based techniques, especially NGS of the 16S rRNA gene, have been successfully applied to the comprehensive analysis of microbiomes not only from healthy people, but also from those associated with many diseases [48,49,50]. As sensitive NGS-based techniques enable accurate detection of the microbiome composition, it could lead to better understanding of the species content that might modulate growth, virulence, biofilm formation, quorum sensing, and antibiotic resistance [51]. In any case, identification of microbiome-constituents at the species or genera level is microbiologically not detailed enough. This is also because microbes are transmitted between hosts and have different virulence, fitness factors (e.g. tenacity), transmission power, and biological and epidemiological behavior.

Conclusions

In conclusion, our study demonstrated a high reliability of NGS of the 16S–23S rRNA region sequencing in streptococci and enterococci identification at the species level. The method based on NGS of the 16S–23S rRNA region had undoubtedly one of the highest identification potential from all the methods used. We have developed a reference dataset of the 16S–23S rRNA region for 42 streptococcal and 9 enterococcal species, therefore, many clinically relevant streptococcal and enterococcal species can now be detected in a clinical sample. All diagnostic laboratories which have access to next generation sequencing will be able to introduce a highly precise, rapid and reliable method for identification of microorganisms and the obtained results will facilitate an unambiguous identification of many clinically significant streptococci and enterococci in all samples.