Introduction

Taxonomy plays a prominent role in understanding biology, evolution, and the interplay between environmental conditions and biological populations. Historically, the classification of life forms was based on shared characteristics such as anatomy, developmental processes, and behaviours (phenotype) while today genotypes form the basis for a taxonomic framework due to their uniqueness and the ever-increasing deposit of genetic information (Hugenholtz et al. 2021). Prokaryotes are single cell organisms that have dominated most of life’s history on Earth and have survived many different environmental conditions, endowing extremely diverse genotypes (Nealson and Conrad 1999; Preiner et al. 2020). Among these prokaryotes, the phylum Actinomycetota, of which Streptomyces species represent the largest genus, has proven a treasure trove for natural product discovery and the pharmaceutical industry. Specifically, Streptomyces species are responsible for the production of around 100,000 antibiotic compounds, accounting for 70–80% of all natural bioactive products with applications in the pharmaceutical and agrochemical industry (Alam et al. 2022). However, their secondary metabolites are not limited to antibiotic properties but also possess a huge potential as herbicides, antifungal agents, and antiviral and anticancer drugs (Alam et al. 2022; Del Carratore et al. 2022). Due to the perpetual accumulation of new genomic information and the application potential of Streptomyces secondary metabolites (Lacey and Rutledge 2022), proper classification is of paramount importance.

Conventionally, 16S rRNA gene analysis was the gold standard for bacterial and thus Streptomyces phylogenetics, species delineation, and the identification of new species (Komaki 2022; Hassler et al. 2022). However, it suffers from poor taxonomic resolution at the species level (O’Connell et al. 2022), intragenomic heterogeneity, horizontal gene transfer (HGT), and recombination reports (Hassler et al. 2022). There has been an extensive discussion on the 16S rRNA threshold value for species discrimination, a parameter subject to regular updates. Currently, the threshold used in the literature ranges between 98.2 and 99% (Meier-Kolthoff et al. 2013). However, even when the similarity exceeds the threshold of 99%, the definitive assignment of whether they are of the same species remains uncertain. For example, some species have identical 16S rRNA gene sequences, introducing complexity into taxonomic assignments (Tamura et al. 2008). As databases hinged almost exclusively on 16S rRNA gene analysis, the reliability of previous classifications reported in the literature could be questioned. This has led to the development of more representative genotyping methods such as multilocus sequence analysis (MLSA) which comprises the analysis of several housekeeping genes and thus significantly increases species resolution in comparison to 16S rRNA analysis. While MLSA has led to reclassifications (Rong et al. 2009), it did encounter inconsistencies due to the lack of a common MLSA scheme that would permit overall comparability, and it could not account for HGT biases (Glaeser and Kämpfer 2015).

Decreasing DNA sequencing costs and increasing computational power have in recent years allowed whole genome sequence analysis. This analysis is typically accomplished through the evaluation of the overall genome-related index (OGRI), with the most commonly used parameters being the Average Nucleotide Identity (ANI) and digital DNA-DNA Hybridization (dDDH) (Chun et al. 2018). The latter offers great convenience due to its clear and objective threshold of 70% as species boundary while species delineation thresholds for ANI are species dependent (Tindall et al. 2010). By including as much genetic information as possible in phylogenetic analyses, a higher resolution is obtained, consequently leading to further reclassification of Streptomyces species. Due to their increased power, these whole genome sequence methods are significantly impacting taxonomy today (Zuo et al. 2018). The lack of uniformity in phylogenetic analysis for classification of Streptomyces strains, with for example 16S rRNA analysis or MLSA still occasionally happening without whole genome sequence analysis (Sáez-Nieto et al. 2021; Kurnijasanti and Sudjarwo 2022; Salehghamari et al. 2023; Wang et al. 2023a), complicates taxonomy and introduces uncertainties in public repositories today. Notably, culture collections such as the ATCC heavily depend on the classification accuracy of the associated publications, making misclassifications impactful on studies focused on culture collection data. To illustrate, only 22 Streptomyces species are reported in the ATCC Genome Portal, while there are 3311 Streptomyces species available in their culture collection. In combination with the vastly expanding available genomic information, this might prove to be a recipe for disaster as current misclassifications will be incorporated in related studies and new phylogenetic analyses. The issue of misclassification of Streptomyces strains within current databases is substantiated by the review of Komaki (2023) as over the past decade, approximately 34 Streptomyces species have undergone reclassification and were transferred to other genera (Komaki 2023). Additionally, 14 subspecies were reclassified to only four subspecies. Furthermore, a total of 63 species were reclassified as later heterotypic synonyms. Also, many unclassified Streptomyces genomes have been deposited in the NCBI database. To date, little effort has been made to investigate whether they are new species. This leads to loss of information that could be beneficial to for example natural product discovery, but also complicates future classification. This paper therefore assesses the current taxonomy of known Streptomyces strains as well as attempts to classify unclassified Streptomyces species.

Material and methods

Dataset collection

On June 26, 2023, an Excel file was created using a custom-made Python script that retrieved Streptomyces metadata from both NCBI Genome (now NCBI Datasets) and LPSN (List of Prokaryotic names with Standing in Nomenclature) databases (Supplementary Table S1). The script used a comprehensive list of all species obtained from the LPSN database and then selectively filtered out the Streptomyces species. Additional information retrieved from the LPSN database, such as LPSN status and LPSN address, is documented in columns K to O within Supplementary Table S1. Following this, we examined the availability of genomes for these species on NCBI. Atypical genomes, which include those identified as contaminated, and genomes of subspecies were excluded (based on NCBI filters). Often, multiple genome assemblies were available from NCBI per species. To determine the best assembly for our taxonomical analysis, the initial step involved verifying the presence of an NCBI reference genome. In case no reference genome was present, the available genome assemblies were ranked based on their assembly level (i.e., degree of completeness, categorized as complete genomes, chromosomes, scaffolds, or contigs). If several options for the optimal assembly remained available, preference was given to the most recent. If RefSeq assemblies were available, we selected these over GenBank assemblies. Additional genome and assembly information, including accession number and sequence length, is available in columns C to J within Supplementary Table S1. In total, 604 Streptomyces species and their metadata were collected. Out of these, four were manually not retained: three unclassified Streptomyces species catalogued as classified and one Kitasatospora species, bringing the total number of Streptomyces species for data analysis to 600 (Supplementary Table S2). For these species, whole genome sequence FASTA files, coding sequence (CDS) FASTA files with annotated genes, and RNA FASTA files were downloaded from the NCBI genome FTP site using their assembly ID with a custom-made Python script.

Taxonomic analysis of characterized Streptomyces species

For 16S rRNA analysis, a Python script was used to extract these sequences from the previously acquired RNA FASTA files, specifically targeting and isolating the 16S rRNA sequences within the RNA sequence. There were six genomes (GCA_003719395.1, GCA_016464675.1, GCA_000715685.1, GCA_018114805.1, GCA_000715605.1, GCA_000715635.1) for which RNA FASTA files were unavailable (Supplementary Table S2). Only sequences with a 95% completeness were considered, with a length ranging between 1451 and 1604 nucleotides. Sequences that were partial and had low completeness scores lack the necessary resolving power, leading to incorrect identification results, were excluded, further reducing the 16S rRNA dataset with 111 species to 483 (Supplementary Table S2). Furthermore, in cases where multiple copies of the 16S rRNA sequence were available, the longest 16S rRNA sequence was selected. ClustalO (Sievers et al. 2011) with default settings and the –percent-id parameter was used to generate a multiple sequence alignment (MSA) for computing the identity matrix among the retained 16S rRNA sequences (Supplementary Table S3). Species with an identity exceeding 99.93% were considered to be part of the same species (Komaki 2022).

MLSA was conducted using five housekeeping genes specifically proposed for the demarcation of Streptomyces species: gyrB, atpD, recA, trpB, and rpoB (Komaki 2022). These gene sequences were extracted from the coding sequence (CDS) FASTA files, based on the annotation headers. However, similar to the RNA FASTA files, the CDS FASTA files were not available in the same six cases (Supplementary Table S2). Only sequences that were not annotated as partial were considered. Among these genes, several were missing or only partially present: atpD in six species, gyrB in ten species, recA in 72 species, trpB in nine species, and rpoB in ten (Supplementary Table S2). Initially, each housekeeping gene except atpD was individually assessed and an MSA and identity matrix was constructed for each gene using ClustalO with the same settings as the 16S rRNA analysis (Supplementary Tables S4 to S7). The thresholds for species delineation were set at 99% (gyrB), 99.5% (recA), 99.6% (rpoB), and 97.9% (trpB) (Komaki 2022). Due to insufficient resolution, a species threshold could not be established for atpD. Subsequently, the sequences of the remaining genes were concatenated together to form a single comprehensive sequence. Only 502 species that possessed all five housekeeping genes were taken into consideration (Supplementary Table S2). Again, ClustalO was used without the –percent-id parameter for the construction of an MSA and subsequently creating a distance matrix (Supplementary Table S8). The threshold used for the MLSA was 0.007, where species with an evolutionary distance less than 0.007 belong to the same species (Komaki 2022).

ANI values were calculated on the whole genome FASTA sequences for the original dataset of 600 species, with all species being compared pairwise. FastANI v1.33 was the preferred ANI algorithm being 50–4608 × faster than Average Nucleotide Identity based on BLAST (ANIb), without compromising accuracy within the 80 to 100% ANI identity range (Jain et al. 2018). A matrix file was generated as output by FastANI and was subsequently imported into Excel (Supplementary Table S9). To delineate species, a threshold of 96.7% was used (Hu et al. 2022). Streptomyces caldifontis (GCA_016464675.1) exhibited values around the lower exclusion threshold of 80%, leading to multiple “Not Applicable” (NA) results in the ANI analysis.

In total, 409 Streptomyces species for which FastANI, 16S rRNA, and MLSA results were available, underwent pairwise analysis (Supplementary Table S2). Exceeded thresholds for each method were listed to indicate species belonging to the same taxonomic group (Supplementary Tables S10 to S17). While at first 16S rRNA was included, its resolution quickly proved insufficient (see “Results” and “Discussion”), thus warranting its exclusion from further analysis. Additionally, a digital DNA-DNA hybridization was conducted with the Genome-to-Genome Distance Calculator (Meier-Kolthoff et al. 2022). This dDDH analysis was only performed for species exceeding the threshold for FastANI and/or MLSA due to their labor-intensive nature (Supplementary Tables S18, S19).

Strain reclassification

When the FastANI values exceeded their threshold, the classification of the investigated strains was analyzed in detail and a reclassification based on the Bacterial Code was proposed. To decide between which name had to be taken when two species were identified as synonyms, the guidelines of the International Code of Nomenclature of Prokaryotes (the Bacterial Code) were used (Oren et al. 2023).

Classification of uncharacterized Streptomyces species

Uncharacterized Streptomyces strains were acquired using the command-line tool NCBI datasets on October 19, 2023. Data from these strains were gathered at three hierarchical levels: complete, chromosome, and scaffold. For these levels, there were respectively 27, 204, and 493 genomes gathered (Supplementary Table S20). At these three levels, a FastANI analysis was performed, exclusively comparing uncharacterized species with the 600 previously acquired ones. The threshold value remained at 96.7%. Uncharacterized Streptomyces species that surpassed this threshold and therefore indicating a match with a known species can be found in Supplementary Table S21. When no hit with a characterized species was found, the strain was transferred to the TYGS server for further identification attempts, encompassing known species beyond the Streptomyces genus (Supplementary Table S22) (Meier-Kolthoff et al. 2022).

Results

Species delineation method analysis

To ensure correct classification of uncharacterized Streptomyces species, the most appropriate and convenient classification method was investigated. Hence, the Streptomyces genus was first analyzed by 16S rRNA, individual housekeeping genes, MLSA and FastANI analysis for the 409 Streptomyces species for which all necessary information was available (Supplementary Tables S3 to S9). As evidenced, several thresholds for these parameters were exceeded during pairwise comparison of species, suggesting that both strains are actually the same species. Such synonyms were uncovered in 60 instances for 16S rRNA analysis, 56 instances for gyrB, 78 instances for trpB, 32 instances for recA and 43 instances for rpoB housekeeping gene analysis, 38 instances for MLSA, and 44 instances for FastANI analysis (Supplementary Tables S10 to S17). The results revealed a substantial discrepancy between 16S rRNA, MLSA, and FastANI. In 24 out of 60 cases, the threshold was exclusively exceeded for 16S rRNA and not by any other method. This suggested that the 16S rRNA gene did not provide sufficient resolution. As a result, it was excluded from further analysis in this study. However, this outcome was expected, given that the 16S rRNA gene is composed of approximately 1526 nucleotides and a single nucleotide variation would still result in a 99.9345% match (Fig. 1). This value surpasses the threshold of species delineation (99.93%) and has been observed in multiple occasions, as were 100% matches. Similar to the 16S rRNA gene, individual housekeeping genes make it challenging to delineate species as they represent only a snapshot of the genomes.

Fig. 1
figure 1

Difference in 16S rRNA sequences between Streptomyces globisporus (GCF_003147545.1) and Streptomyces flavovirens (GCF_000720395.1). ClustalO alignment of both sequences identified a single mismatch at position 200 of 1526, representing a 99.9345% similarity which was above the threshold for species demarcation. Conversely, the FastANI value for these species is 84.67%, considerably below the threshold. The figure also denotes the hypervariable regions of the 16S rRNA gene

Due to the inconsistency in results obtained with single gene taxonomy parameters like 16S rRNA and single housekeeping genes, a more in-depth analysis of MLSA and FastANI results was done to pinpoint the most appropriate method. In total, scores for both approaches were calculated for 502 species. For pairwise comparisons where the threshold was exceeded by at least one of the approaches, an additional dDDH analysis was conducted (Supplementary Table S18). Between MLSA and FastANI, only 20 differences in exceeding the threshold were observed (Supplementary Table S19). The dDDH analysis yielded 14 differences with MLSA and 6 differences with FastANI, suggesting that FastANI is the best method to use. This result was unsurprising as both FastANI and dDDH utilize the entire genome for comparison whereas MLSA relies solely on a concatenated sequence of five single housekeeping genes. Differences between FastANI and dDDH are given in Table 1.

Table 1 Discrepancies between the FastANI and dDDH pairwise alignment of a subset of Streptomyces strains. Values exceeding the parameter threshold are highlighted in bold and point to edge cases close to the threshold. Organisms are considered to belong to the same species above thresholds of 96.7% for FastANI and 70% for dDDH

Taxonomic analysis of previously characterized Streptomyces species

As illustrated above, FastANI has been validated as the preferred method for demarcation on species level regarding the Streptomyces genus. Its ability to compare whole genomes in a high-throughput manner allowed us to investigate Streptomyces species currently delineated in public repositories like LPSN and NCBI datasets prior to classifying unknown species in these databases. With a threshold of 96.7% for Streptomyces species (Hu et al. 2022), 59 out of 600 analyzed species did not seem to be unique (Fig. 2). This adjustment refines our dataset of 600 Streptomyces species back to 541 unique species. An overview of the pairwise alignments exceeding the species delineation threshold, suggesting these species are actually synonyms, is given in Table 2, together with the proposed reclassification based on the guidelines of the International Code of Nomenclature of Prokaryotes (the Bacterial Code) (Oren et al. 2023). While several of these findings have been previously reported by various authors, supporting our results, many new synonyms have been uncovered in our study. We thus propose to amend these errors in public repositories by classifying these species as synonyms to stop the taxonomy sprawl of Streptomyces species, leading to further misclassification and complexities.

Fig. 2
figure 2

Overview of the Streptomyces species investigated in this study. The genomes of 600 Streptomyces species on public repositories (characterized) appear to be not as unique as advocated, with almost 10% synonyms and 541 unique strains. Of the 724 unclassified Streptomyces species (uncharacterized), 289 could be classified with FastANI, while TYGS proposed classification of 35 strains and the uniqueness of 435 new species

Table 2 Suggested reclassification of Streptomyces strains obtained from NCBI Datasets. FastANI values are given for pairwise alignments exceeding the species demarcation threshold of 96.7%, suggesting these to be the same species. The suggested species name is highlighted in gray. For example, S. sampsonii and S. wadaymensis should be reclassified as S. albidoflavus according to our analysis. In case there was no reference genome used at the time of analysis, the species name is followed by *. Strains marked with † might possibly yield less reliable results. Please refer to the discussion section for further details. The taxonomic status indicates whether the suggested name has currently been approached according to the Bacterial Code, or whether the name has been validated. Additionally, references to the literature for the proposed reclassification are given

Classification of uncharacterized Streptomyces species

As FastANI appeared to be the most reliable and appropriate method for Streptomyces species demarcation, this method was exclusively used for the classification of uncharacterized Streptomyces species. At the chromosome level, nine out of the 27 species were assigned to one of the 600 characterized Streptomyces species in our previously acquired dataset, surpassing the 96.7% threshold value (Supplementary Table S21). Among these nine species, six were identified to match with S. mirabilis, while the three remaining uncharacterized Streptomyces species (GCF_900104935.1, GCF_003352805.1, GCF_001040905.1) exhibited matches with three distinct species at chromosome level. At the complete level, we successfully classified 99 out of the 204 species with in total 62 already characterized Streptomyces species. Notably, several uncharacterized strains were found to match with the same known Streptomyces species, for instance, seven were identified as S. albidoflavus. At the scaffold level, 181 out of 493 were matched with 77 characterized Streptomyces species, implicating that again several uncharacterized species were assigned to the same known Streptomyces species. Hence, in total, 289 uncharacterized species were classified as 106 already known species. Regarding the remaining species that could not be classified solely based on the Streptomyces genomes available in our dataset, we attempted to identify the species with the aid of the TYGS server (Supplementary Table S22). We were able to classify an additional 35 species, with 31 belonging to the Streptomyces genus and 4 belonging to other genera: Microbacterium resistens, Kitasatospora cheerisanensis, Embleya scabrispora, and Cytobacillus praedii (Fig. 2). The remaining ones were indicated as “potential new species” by TYGS.

Discussion

Method discussion

Due to missing RNA and CDS FASTA files, housekeeping genes, and insufficient completeness of 16S rRNA, taxonomic assessment based on 16S rRNA, MLSA, and FastANI was only performed for 409 out of 600 Streptomyces genomes available on NCBI Datasets (Supplementary Table S2). Based on this study, methods relying on single gene taxonomy parameters such as 16S rRNA and housekeeping genes did not provide sufficient resolution and showed divergent results. To conduct 16S rRNA taxonomic analysis, it is essential to have the RNA FASTA files accessible on NCBI Datasets with a satisfactory level of completeness (> 95%) to avoid insufficient resolving power (Kim et al. 2012b). These conditions were not met for 117 out of the 600 available Streptomyces genomes because the 16S rRNA sequence was either absent or incomplete, i.e., falling short of the required length (< 95%) or exceeding it (> 105%). The repetitive nature of specific regions within the 16S rRNA gene poses a challenge for short-read sequencing technologies in achieving robust assemblies, often leading to incomplete 16S rRNA gene sequences. In that regard, long-read methods hold a distinct advantage over short-read methods. These technologies can capture the entire length of the 16S rRNA gene, in contrast to being limited to hypervariable regions alone. For example, in Fig. 1, the only mismatch between S. globisporus and S. flavovirens was located in the hypervariable V2 region. While the majority of 16S rRNA phylogenetic studies rely on the V3–V4 region (Klindworth et al. 2013), the V3–V5 has proven to perform poorly for the phylum Actinobacteria (Meier-Kolthoff et al. 2013) which can lead to biased results. Notably, a rigorous threshold of 99.93% was applied in the assessment of the 16S rRNA analysis, surpassing the threshold commonly found in literature, which typically ranges between 98.2 and 99% (Meier-Kolthoff et al. 2013). Despite this high threshold value, the threshold was exceeded for 60 out of 409 instances of pairwise comparisons in 16S rRNA whereas for FastANI this was for 44 instances (Supplementary Tables S10 to S12). The observation that in 40% (24/60) of the cases, only the threshold for 16S rRNA was surpassed and not for MLSA or whole genome–based methods highlights the low resolution of 16S rRNA and raises questions about the ongoing use of 16S rRNA. Therefore, we excluded 16S rRNA taxonomic analysis from further analysis.

For the single housekeeping gene analysis and MLSA, a key issue was the lack of these genes in the Streptomyces genomes or their partial availability. Remarkably, the recA gene was in 72 out of the 600 cases only partially present or even lacking (Supplementary Table S2). This could be due to incomplete assembly or annotation mistakes, as recA is a core gene encoding recombinase A which has been described as indispensable for viability (Muth et al. 1997). In total, 502 Streptomyces genomes contained all five housekeeping genes (Supplementary Table S2). The gene atpD was not included in our analysis, as recent literature proved its insufficient resolution for species delineation (Komaki 2022). In contrast, whole genome–based methods are more robust towards the abovementioned issues as for example they are not influenced by annotation mistakes; however, they can also suffer from poor assembly. In 38 out of 409 cases, the taxonomic threshold was exceeded for MLSA (Supplementary Table S17), while this was 44 times with FastANI analysis (Supplementary Table S11). As these numbers were close to each other, we performed for these cases a dDDH analysis to emphasize a more pronounced distinction between gene-based (MLSA) and whole genome–based methods (FastANI and dDDH). As a result, dDDH analysis showed 14 differences in exceeding the threshold with MLSA whereas there were only six differences between dDDH and FastANI results (Supplementary Table S19). Hence, this indicated consistency regarding the whole genome–based methods. Table 1 shows that four of these six discrepancies include Streptomyces albiflaviniger (GCA_016103485.1) while the other two are close to the threshold (69.8% for dDDH and 96.75% for FastANI). Notably, the genome of Streptomyces barkulensis (GCF_002843305.1) consisted of 119 scaffolds while that of S. albiflaviniger (GCA_016103485.1) consisted of 5265 contigs. Both had a relatively low coverage (23 × and 9.9 ×, respectively). While the quality of the genomes can have an effect on species delineation, these discrepancies can also be the result of the type of ANI tool used (i.e., ANIb, OrthoANI, ANIm, FastANI, and gANI). Palmer and colleagues have compared 7 ANI methods and noted slight differences in outcome (Palmer et al. 2020). Their conclusions also support our use of FastANI as it is the fastest and thus the most suitable approach to investigate large numbers of pairwise genome comparisons. However, as evidenced by the results in Table 1, for analyses near the inferred ANI threshold of 96.7% for Streptomyces sp., the ANI results should be carefully considered. Such findings should be examined more closely by for example comparing multiple ANI methods against dDDH and/or analyzing phenotypic and chemotaxonomic characteristics for additional insights. Also, species delineation thresholds for ANI are species dependent, and we feel confident that the threshold value used in our assessment is correct as it is often used and recognized for Streptomyces species (Hu et al. 2022; Wang et al. 2023b). Analyzing ANI values of over 90,000 bacterial and archaeal genomes has set a species threshold at ≥ 95% (Murray et al. 2021), meaning that strains with such a score are synonyms. However, in the case of Streptomyces strains in particular, an ANI score of  ≥ 96.7% has been suggested as a more stringent species demarcation (Hu et al. 2022). In literature, threshold values ranging from 95 to 96% are frequently employed for Streptomyces taxonomic analyses (Jain et al. 2018; Park et al. 2021; Nikolaidis et al. 2023; Song et al. 2023).

Despite these couple differences between both methods, FastANI has proven to be a reliable and convenient method for the classification of Streptomyces species. In addition, due to the labor-intensive character of dDDH, as there is a limit on the number of simultaneous analyses and due to the higher amount of computational power required, FastANI is the preferred method. Indeed, taxonomic analysis in the past solely relied on single gene taxonomy parameters as whole genome sequencing was very expensive and overvalued at that time; however, today there is a remarkable trend in the decrease of next- and third-generation sequencing cost. Therefore, taxonomic analysis based on whole genomes should become the golden standard in order to get rid of the misclassification challenge (Komaki 2023).

Emended classification of previously characterized Streptomyces species

To provide clarity in the classification of currently uncharacterized Streptomyces species, we first investigated the existing taxonomy of this genus. While we used a stringent threshold for species delineation with FastANI, this study uncovered multiple synonyms for the first time. With a FastANI threshold of 96.7%, 59 out of 600 analyzed Streptomyces species did not seem to be unique according to our analysis. Table 2 provides a summary of our suggestions for reclassification. We utilized genomic data primarily comprised of reference genomes whenever available. However, certain species lacked a reference genome. In such cases, the genomes of these species may contain errors or contamination. For example, between downloading the data and writing this manuscript, we observed an alteration in NCBI datasets: the S. coelicolor genome used in the analysis, GCF_013317105.1, is now flagged as contaminated. To ensure accurate conclusions, we reran FastANI comparing S. rubrogriseus and S. anthocyanicus with S. coelicolor GCF_000203835.1. Nonetheless, the FastANI values remained unchanged, indicating both strains are later heterotypic synonyms of S. coelicolor. Furthermore S. cellulosae GCF_026339705.1 and S. thermocarboxydus GCF_024760485.1 are now also marked as contaminated. Unfortunately, as there are no uncontaminated genomes available for S. thermocarboxydus, a reanalysis for this comparison was impossible. Additionally, S. diastaticus GCF 010548605.1 is now marked as “unverified source organism,” only S. diastaticus subsp. diastaticus GCF_014648935.1 available as uncontaminated genome. Reanalyzing the species demarcation with S. ardesiacus yielded a FastANI value of only 81.74%, suggesting indeed a potential discrepancy with the original genome of S. diastaticus GCF 010548605.1. In the meantime, S. griseostramineus GCF_014649635.1 and S. nashvillensis GCF_014650095.1 underwent a reclassification on NCBI, respectively, to S. griseomycini and S. tanashiensis, which is in line with our FastANI analysis. These changes happened over the course of several months, proving the importance of depositing high-quality data and the curation of public repositories.

As another remark, we expected that S. canarius and S. corchorusii would also be identified as synonyms as S. canarius and S. olivaceoviridis are synonyms as well as S. olivaceoviridis and S. corchorusii. However, this was not the case (FastANI value of 96.4%). Similarly, S. violaceus and S. violarus were expected to be synonyms but the FastANI value was 96.5%.

The number of newly discovered synonyms is quite surprising as many other studies on Streptomyces reclassification have been published in recent years. In general, they align perfectly with this study, although some differences are present. Rong et al. (2009) investigated the taxonomy of the S. albidoflavus clade based on MLSA with the concatenated sequences of the aptD, gyrB, recA, rpoB, and trpB housekeeping genes and with DNA-DNA Hybridization (DDH), and proposed the reclassification of S. sampsonii and S. coelicolor to S. albidoflavus (Rong et al. 2009). However, we did not observe the same result for S. coelicolor in our FastANI approach. The identification of S. wadayamensis as a later synonym of S. albidoflavus was not reported to date and is unique to this study. In 2010, Rong and Huang proposed several other reclassifications based on MLSA with atpD, gyrB, recA, rpoB, and trpB as well as DDH analysis. They suggested that S. baarnensis is a synonym of S. fimicarius while our data proposes to classify S. baarnensis within the S. griseus clade (Rong and Huang 2010). Various reclassifications of S. baarnensis have been proposed since 2010, also resulting in the classification of this strain as S. griseus, which is in line with our findings. The lack of transparency on these alterations and publications associated with these changes is another example of the complexity of Streptomyces taxonomy. The authors also proposed to classify S. californicus as a synonym of S. puniceus while our FastANI analysis is the first to suggest that S. californicus, S. purpeochromogenes, and S. violaceoruber are all synonyms. This result might hold major implications as Komaki (2022) proposed that S. anthocyanicus is a synonym of S. violaceoruber while we did not come to the same conclusion. Additionally, Rong and Huang suggested reclassifying S. mediolani as a later synonym of S. albovinaceus while no association with S. globisporus had been made. Finally, their proposed reclassification of S. alboviridis and S. luridiscabiei as synonyms of S. microflavus is in line with our findings.

In 2012, Kim et al. (2012a) suggested that S. fimicarius belongs to the S. setonii clade based on 16S rRNA and gyrB analysis while S. setonii is now a heterotypic synonym of S. griseus. They also uncovered that S. albovinaceus is a later synonym of S. globisporus. In 2020, Madhaiyan and colleagues suggested that S. griseofuscus should be renamed to S. murinus based on genome-based analyses with dDDH, ANI, and Average Amino Acid Identity (AAI), which is in line with our results (Madhaiyan et al. 2020). The same year, Komaki and Tamura (2020) proposed to classify S. diastaticus subsp. ardesiacus as an independent species (S. ardesiacus sp. nov.) based on 16S rRNA and dDDH analysis; however, our FastANI results suggest that S. diastaticus and S. ardesiacus are synonyms. The next year, Komaki (2021) proposed that S. plicatus, S. geysiriensis, and S. vinaceusdrappus are later heterotypic synonyms of S. rochei, that S. variabilis is a later synonym of S. griseoincarnatus, and that S. libani subsp. libani is the same species as S. nigrescens based on MLSA and dDDH analysis. These findings are in line with our own results.

An extensive study by Li and colleagues in 2022 proposed various reclassifications based on a genome-based study with ANI and dDDH (Li et al. 2022). Most of these findings were in line with our own results, such as the proposed reclassification of S. canarius and S. olivaceoviridis as synonyms of S. corchorusii, S. durhamensis as a synonym of S. filipinensis, S. jietaisiensis as a synonym of S. griseoaurantiacus, S. janthinus and S. violaceus as a synonym of S. violarus, S. nashvillensis as a synonym of S. tanashiensis, S. rubiginosus as a synonym of S. pseudogriseolus, S. pluricolorescens as a synonym of S. rubiginosohelvolus, and S. melanosporofaciens as a synonym of S. antimycoticus. However, the observed similarity between S. niveoruber and S. griseoviridis diverged from our own conclusions. It was however noted that S. niveoruber is a synonym of S. daghestanicus, which was published 1 year prior to S. niveoruber, warranting the use of S. daghestanicus as species name. Finally, Wang et al. (2023b) indicated that S. griseomycini and S. griseostramineus are the same species, with the latter being the later heterotypic synonym of the former, which is also in line with our findings.

As evidenced by these proposed reclassifications, it is urgent time to clean up the taxonomy of the Streptomyces genus. Several of these synonyms were published a while ago and have not been amended in public repositories, leading to erroneous classification of new strains and overcomplicating taxonomy. Furthermore, as our assessment was focused exclusively on Streptomyces species with available genomes on NCBI, we anticipate a need for additional reclassifications. The LPSN database lists 869 child taxa under the species Streptomyces, yet our FastANI analysis covered only a subset of 600 species whose genomes were publicly available.

Classification of uncharacterized Streptomyces species

Based on our findings, we used FastANI to try and classify publicly available uncharacterized Streptomyces species sourced from NCBI Datasets. Despite our stringent threshold, we were able to classify 289 uncharacterized species out of 724 (Supplementary Table S21) (Fig. 2). Notably, some of these uncharacterized species were not unique. For instance, 20 were classified as S. albidoflavus, 13 as S. parvus, 11 as S. mirabilis, and 10 as S. rochei. In total, the number of uncharacterized species that can be classified amounts to 106 characterized species. To further assess unclassified species, we first used the TYGS server which searches for synonyms within and beyond the Streptomyces genus (Supplementary Table S22). Out of the remaining 435 uncharacterized strains, 35 were associated with existing species (Fig. 2). Interestingly, four of these were not found to be Streptomyces species at all: M. resistens, K. cheerisanensis, E. scabrispora, and C. praedii. While K. cheerisanensis and E. scabrispora belong to the family Streptomycetaceae, M. resistant belongs to the class Actinomycetia. On the other hand, C. praedii belongs to Bacillota phylum, which differs from Streptomyces species who are classified under the phylum Actinomycetota. These misclassifications evidence the need for improved curation and classification of genomes deposited on public repositories.

The discrepancy between TYGS and our FastANI results, leading to the remaining 31 species classified by TYGS, was due to the absence of certain genomes in our own dataset. S. koelreuteriae and S. chengmaiensis were added after our data retrieval from NCBI Datasets, and S. regensis was marked as contaminated. In other cases, the genome match from TYGS was not found on NCBI datasets (i.e., S. chrysomallus, S. cinnamonensis, S. sporocinereus, S. fluorescens), or a different strain isolate was used on TYGS (S. scabiei, S. griseus, S. anulatus, S. xinghaiensis). Comparison of various isolates of these strains or loosening the stringent FastANI threshold a little could bridge the discrepancies between TYGS and FastANI. To further classify uncharacterized Streptomyces strains, a viable strategy might involve initially filtering out all unique species. Various whole genome sequence techniques like dDDH and AAI might help uncover new synonyms. While ANI was specifically developed to delineate species, AAI is more powerful in associating strains that are more distantly related. Once a more suitable genus has been identified for uncharacterized, unique strains, a FastANI analysis of that genus can be done analogously to our study.

Finally, our findings emphasize the need for action, urging adjustments in the NCBI database and thorough reexamination of species within the Streptomyces genus. In the meantime, dubious entries should be flagged in public repositories and more stringent data deposition conditions should be enforced. Improving future classifications could be achieved by implementing more comprehensive curation in these databases, thereby avoiding the submission of incomplete data.