Background

Prokaryotes are generally assumed to be the oldest existing form of life on earth and the primary engines of global biogeochemical processes; they are found in almost all ecosystems [1, 2]. Genome sequencing provides a blueprint for the evolutionary and functional diversities of prokaryotes and improves our understanding of how they interact with one another, their hosts, and their surroundings [3,4,5]. However, what is the cells or taxa proportion of genome-sequenced bacteria or archaea on earth? This basic and seemingly simple question has never been answered.

Since the first bacterial genome was completely sequenced in 1995, more than 200,000 bacterial and archaeal complete or draft genomes have been uploaded to public databases as a result of the development of sequencing technology and the decrease in costs [6, 7]. Meanwhile, due to improvements in sequencing throughput and computational techniques, cultivation-independent recovery of genomes from metagenomes further promotes prokaryotic genome mining [8,9,10]. Interestingly, compared to the exponential accumulation of genomic data, the latest estimate of global prokaryotic operational taxonomic units (OTUs, 16S-V4 regions at 97% sequence identities) is only 0.8–1.6 million, far less than the trillions previously predicted [11, 12]. It is necessary to globally evaluate the proportion of sequenced prokaryotic genomes in environments.

The Earth Microbiome Project (EMP) was founded in 2010 to sample and explore the Earth’s microbial communities at an unprecedented scale [13,14,15]. In this study, we conducted a large-scale sequence alignment between the data released by the EMP and the sequenced bacterial or archaeal genomes in the public database. From these data, we evaluated the present situation of prokaryotic genome sequencing in the earth biomes for the first time.

Results

High genome-sequenced proportions in different prokaryotic biomes

A representative subset, containing 10,000 samples to represent different environment types, was selected from 27,751 samples of 97 independent studies released by the EMP [13]. Bcell and BOTU, which represent the genome-sequenced proportions of cells and taxa (at 100%, > 98.6%, or > 97% identities in the 16S-V4 region) in a specific prokaryotic biome, respectively, were evaluated based on the alignment between the 16S rRNA gene sequences of the EMP and the nearly 155,810 RefSeq genome sequences. The results showed that the median Bcell (100%) in the 10,000 samples was 38.1%, and the upper and lower quartiles were 16.4% and 86.3%, respectively (Fig. 1a and Supplementary dataset 1). This finding indicates that the genome information of at least 38% of cells has been reported in more than half of the prokaryotic biomes. The median BOTU (100%) reached 18.8% (9.1–52.6%) (Fig. 1b). Generally, closely related strains with high similarities of 16S rRNA gene sequences (97% or 98.6%) also share high genome similarity [16,17,18]. The median Bcell (98.6%) was 50.1% (28.3–90.6%) whereas the median Bcell (97%) reached 60.4% (40.0–93.0%) across the 10,000 samples. Similarly, the median BOTU (98.6%) was 28.4% (16.4–64.3%), and the median BOTU (97%) increased to 37.6% (24.8–71.4%) (Supplementary Figs. S1, S2).

Fig. 1
figure 1

Genome-sequenced degree of prokaryotic biomes. a Genome-sequenced proportion of cells. b Genome-sequenced proportion of taxa. OTUs share 100% identities with the sequenced genomes. Based on the analysis of 10,000 EMP samples, each gray point represents a single sample. For the box plots, the middle line indicates the median, the box represents the 25th–75th percentiles, and the error bar indicates the 10th–90th percentiles of observations. Environment types were classified by EMPO; red represents host associated and green represents free living

The genome-sequenced proportion in the prokaryotic biome was closely related to habitat (Fig. 1 and Supplementary dataset 2). Microbial environments are divided into different environment types by the EMP. The EMP ontology (EMPO level 1) classifies microbial environments as free living and host associated, with further subdivision into 17 environment types (EMPO level 3) [13]. We found that the genome-sequenced proportions in host-associated biomes were significantly higher than those in free-living biomes. For the host-associated prokaryotic biomes (5161 samples), the median Bcell (100%) was as high as 68.3% (16.9–95.2%), and the median BOTU (100%) was 40.7% (11.8–69.4%). However, for the free-living prokaryotic biomes (4839 samples), the median Bcell (100%) was only 29.1% (16.2–52.0%), and the median BOTU (100%) was 13.0% (7.7–24.8%). In detail, the median Bcell (100%) in plant corpus, animal corpus, and animal secretions exceeded 95%, and the median BOTU (100%) exceeded 66.7%. Comparatively, the median Bcell (100%) values for plant surface, sediment (non-saline), and hypersaline samples were all less than 10%, and the median BOTU (100%) values for sediment (non-saline) and sediment (saline) samples were less than 5% (Fig. 1). For closely related strains, Bcell and BOTU also showed similar variabilities among different habitats (Supplementary Figs. S1, S2). Despite significant differences, the genome-sequenced proportions were high in most of the prokaryotic biomes.

Furthermore, we found that the genome-sequenced proportion in the prokaryotic biome was significantly negatively correlated with its alpha diversity indices (Supplementary Fig. S3). For both cells and taxa, the prokaryotic biomes with low alpha diversity indices (observed OTUs, Shannon index, Chao1 index, and Faith’s PD value) tended to have a higher degree of genome sequencing. For example, the Pearson correlation coefficients of Bcell (100%) and BOTU (100%) with Shannon indices were − 0.62 (p < 0.01) and − 0.67 (p < 0.01), respectively.

Low genome-sequenced proportions of global prokaryotic taxa

A total of 262,011 OTUs were obtained from 10,000 EMP samples through a meta-analysis. We defined the genome-sequenced proportion of all taxa (at 100%, > 98.6%, or > 97% identities in the 16S-V4 region) as POTU and found that the POTU (100%) of the 10,000 samples was only 2.1% (Supplementary dataset 3). The POTU (98.6%) and POTU (97%) values were 6.8% and 12.2%, respectively, and both were also much lower than the corresponding Bcell and BOTU medians. Furthermore, we found that 75.8% of OTUs were present in two or more biome samples. The POTU (100%) value was 0.6% for the OTUs that appeared in only a single sample (401 of 63,459 OTUs), 1.2% for those in 2 to 10 samples (1641 of 134,119 OTUs), 5.4% for those in more than 10 samples (3478 of 64,433 OTUs), 16.2% for those in more than 100 samples (1431 of 8810 OTUs), and 72.5% for those in more than 1000 samples (108 of 149 OTUs) (Fig. 2a). Notably, many prokaryotic taxa could exist in diverse environment types; approximately 21.7% of prokaryotic taxa could exist in two types of environments, and 20.2% of OTUs could exist in three or more types of environments. We found that the taxon genome-sequenced proportion also increased with its distribution extent in different environment types. The POTU (100%) was only 0.6% for prokaryotic OTUs that existed in only one type of environment (932 of 152,229 OTUs), 14.5% for OTUs in five or more types of environments (2645 of 18,230 OTUs), 43.6% for 10 or more types of environments (904 of 2074 OTUs), and 74.6% for 14 or more types of environments (287 of 385 OTUs) (Fig. 2b). A higher genome-sequenced proportion of prokaryotic cosmopolitan OTUs led to a lower POTU than the corresponding BOTU (Fig. 2c).

Fig. 2
figure 2

High genome-sequenced proportion of prokaryotic cosmopolitan taxa. a OTUs that can exist in one or more samples. b OTUs that can exist in one or more environment types. The gray column represents the proportion of OTUs that can exist in one or more samples (environments), and the red column represents the genome-sequenced proportion of OTUs. c Lower POTU than BOTU is caused by a high genome-sequenced proportion of cosmopolitan taxa

Because an OTU was likely to appear in multiple samples, we evaluated the effects of sample quantity on POTU by random sampling. Our results demonstrated that the POTU (100%) displayed an exponential decay trend (R2 = 0.992) and eventually stabilized at 2.13% ± 0.03% as the number of samples increased (Fig. 3a). Similarly, the POTU (98.6%) and POTU (97%) values also decreased with increasing sample size and stabilized at approximately 6.8% and 12.2%, respectively (Supplementary Fig. S4). The estimated POTU values based on 10,000 EMP samples were close to the genome-sequenced proportions in all global prokaryotic taxa. We evaluated the changes in POTU as the number of sequenced genomes increased from 2010 to 2019. The results showed that the POTU (100%) increased exponentially (R2 = 0.998) by sixfold over the decade. However, it was estimated that it would take at least 25 years for the POTU (100%) to reach 95%. With the increase in sequenced genomes, the POTU (100%) value showed an allometric increase (R2 = 0.989), and we determined that the 95% POTU (100%) value required more than 109 sequenced genomes (Supplementary Fig. S5). In addition, the POTU also differed significantly between environments. The POTU (100%) value based on the total host-associated samples was 4.6% whereas the POTU (100%) value for all the free-living samples was only 2.1%. The POTU (100%) values for the animal corpus and plant corpus environments were 28.3% and 23.7%, respectively, whereas the POTU (100%) values for sediment (non-saline), soil (non-saline), and water (non-saline) environments were only 2.3%, 2.9%, and 2.9%, respectively. POTU (98.6%) and POTU (97%) also showed similar patterns (Fig. 3b). Thus, despite the rapid accumulation of prokaryotic genomic information, the genome-sequenced proportion of the global prokaryotic taxa was still fairly low.

Fig. 3
figure 3

Genome-sequenced proportion of prokaryotic taxa from global or different environment types. a As the number of samples increases, the POTU (100%) shows an exponential declining trend and finally stabilizes at 2.1%. A random selection of 1000, 2000…, 9000 samples was performed 10 times for each group to calculate the mean value and standard deviation. b Significant difference of POTU among environment types. The red point is POTU (100%), the blue point is POTU (98.6%), and the orange point is POTU (97%)

The majority of the biomes were occupied by a few predominant taxa with high relative abundances

Our results showed that the top 1% of the prokaryotic taxa (sorted by their percentage of 16S rRNA sequences) accounted for 72.9% of the global prokaryotic biomes (Fig. 4a and Supplementary Fig. S6). These top 1% of taxa always had a high abundance in different environment types (Fig. 4b), which was similar to a recent report on global soil dominant bacteria [19]. By contrast, the rare taxa with low abundance (the total number of sequences < 10) accounted for 59.8% of the total prokaryotic taxa but only 1.2% of the global prokaryotic cells (Supplementary Fig. S7). We found that the number of samples affected the observed proportion of rare taxa to global taxa; as the number of samples increased, the ratio value increased gradually (Supplementary Fig. S8). Notably, the genome-sequenced proportion of the top 1% of prokaryotic taxa reached 38.0% whereas that of the 59.8% of prokaryotic taxa with a low abundance was only 0.6% (Fig. 4c and Supplementary Fig. S6). The genome-sequenced proportions of the top 1% of prokaryotic taxa from different environment types exceeded 12% (Fig. 4b). We further selected 1325 highly abundant and widely distributed OTUs on the following conditions: existing in at least 9 environment types and at least 100 samples and had an abundance reaching the top 1% in at least 1 type of environment (Supplementary dataset 3). These predominant taxa accounted for only 0.5% of the total OTUs but contributed to 50.3% of the global prokaryotic biomes. The genome-sequenced proportion was fairly high in these dominant taxa, and the POTU (100%), POTU (98.6%), and POTU (97%) values were 48.2%, 61.7%, and 71.3%, respectively (Supplementary Fig. S9). The majority of biomes were occupied by a few predominant taxa with high genome-sequenced proportions.

Fig. 4
figure 4

High genome-sequenced proportion of prokaryotic taxa with high abundance. a The top 1% of the prokaryotic taxa account for 72.9% of the global prokaryotic biomes. b The top 1% of the prokaryotic taxa from different environment types accounted for more than 40% with a genome-sequenced proportion greater than 10%. The gray column represents the cellular proportion of the top 1% of the taxa, and the red column represents the POTU (100%). c High genome-sequenced proportion of the top 1%. The red line is POTU (100%), the blue line is POTU (98.6%), and the orange line is POTU (97%)

Culturability altered genome-sequenced preferences among prokaryotes but not environments

We estimated the POTU values of prokaryotes at different taxonomic levels (Supplementary dataset 4), which showed that the POTU values were obviously different among different taxa, and the POTU value of the same taxon also differed significantly among different environment types (Supplementary Figs. S10, S11, S12, S13, S14 and S15). For example, of the 11 phyla with OTU numbers greater than 1%, the highest POTU (100%) value was 5.7% for Actinobacteria, and the lowest POTU (100%) value was 0.04% for Parcubacteria; the difference between them spanned more than 100-fold (Supplementary Fig. S10).

Due to improvements in sequencing throughput and computational techniques, cultivation-independent recovery of genomes from metagenomic data has rapidly developed. In total, 7903 bacterial and archaeal metagenome-assembled genomes (MAGs) were recovered from massive metagenomic data, which were considered from uncultivated strains [8]. We assessed the effect of strain culturability on the current genomic sequencing preferences using these MAGs and 155,810 cultured genomes (Supplementary dataset 5). The results showed that the genome-sequenced proportion of prokaryotes increased by 0.1% after combining these MAGs. According to the environment types, the POTU (100%) based on MAGs was highly positively correlated with that based on RefSeq (r = 0.91, p < 0.01) (Supplementary Fig. S16). The result showed that, similar to the RefSeq genomes, the MAGs also showed environmental differences, and the culturability of strains was not the main factor leading to these differences. For the 11 phyla with an OTU number proportion greater than 1%, there was no significant correlation between the POTU (100%) based on the MAGs and the RefSeq (p > 0.05) (Supplementary Fig. S16). This indicated that although the recovered MAGs had a distinct difference in prokaryotic taxa, its species preference was significantly different from the RefSeq genomes.

Discussion

The genome is the basic resource for understanding the physiology, ecology, and evolution of prokaryotes. More than 200,000 bacterial and archaeal genomes are now available from over two decades of development [3, 6]. These genomes provide important insights into the role of microorganisms in industrial processes, the pathogenic mechanisms of pathogenic microorganisms, etc. In this study, we assessed the genome-sequenced proportion of global prokaryotes. We found that the median proportions of the genome-sequenced prokaryotic cells and taxa (at 100% identities in the 16S-V4 region) in global biomes were 38.1% (16.4–86.3%) and 18.8% (9.1–52.6%), respectively. The Bcell (97%) of 61.9% of the samples reached 50%, and the BOTU (97%) of 38.4% of the samples reached 50% after combining closely related strains. In addition, the median Bcell (97%) and BOTU (100%) values in host-associated biomes were 85.6% (43.2–98.0%) and 62.8% (9.8–82.3%), respectively, which were significantly higher than those in free-living biomes. Thus, the genetic information of a specific prokaryotic biome may have been reported to a considerable degree.

However, compared to prokaryotic biomes, the genome-sequenced proportion of global prokaryotic OTUs was fairly low. Our results suggest that only 2.1% of the global prokaryotic taxa (at 100% identities in the 16S-V4 region) have been sequenced. More than 75% of prokaryotic OTUs could exist in multiple biomes; the more types of environments in which prokaryotic OTUs can survive, the higher the genome-sequenced proportion could be. Prokaryotic biomes are usually composed of a few predominant taxa with a high abundance and many rare taxa with a low abundance [20, 21]. We found that 0.5% of predominant OTUs occupied 50.3% of prokaryotic cell abundance with a high genome-sequenced proportion (48.2%); however, the 60% of rare OTUs only accounted for 1.2% of the global prokaryotic cells with a low genome-sequenced proportion (0.6%). A large number of rare taxa are considered to be critical components of the earth’s ecosystem and contain a large functional genes pool [21, 22]. Therefore, from this perspective, our current understanding of global prokaryotic genomic information remains very limited due to the large number of genome-unsequenced rare taxa, and the exploration of this huge genetic resource is just beginning.

Predominant taxa are considered the priority for isolated culture and genome sequencing [19]. We identified 1325 predominant OTUs with a wide distribution, high abundance, and adaptability to a variety of environmental types, more than half of which had not been genome-sequenced. In particular, some predominant taxa acquired less attention in specific environmental types. For example, the top 1% taxa of abundance in plant surfaces and animal surfaces accounted for 81.0% and 79.2% of the global prokaryotic biomes whereas the genome-sequenced proportions of the taxa were only 13.7% and 79.0%, respectively. The POTU (100%) of plant surfaces (leaf or kelp surface biofilms) was ranked 8th, but its median Bcell was last given the lack of understanding of predominant taxa.

Currently, most of the prokaryotic sequenced genomes (RefSeq genomes) are from pure cultures, while MAGs are not limited by culturability [8, 9, 23]. We found similar genome-sequenced differences among different environment types between RefSeq genomes and MAGs, which indicated that the current imbalance of prokaryotic genome sequencing in different environments was more likely due to differences in researchers’ attention rather than prokaryotic culturability. Although the significant genome-sequenced differences among different taxa between RefSeq genomes and MAGs suggested that culturability caused genomic sequencing preferences had no effect on MAGs, MAGs had also owned its own taxa sequenced preferences.

The paradigm that only 1% of prokaryotes are culturable has a profound impact on microbial ecology but has recently been debated [24,25,26]. Since the RefSeq genomes are mainly from culturable taxa, and a significant proportion of culturable taxa have not been sequenced, we estimate that the culturable rate of global prokaryotic taxa (> 97% identities) would be higher than the genome-sequenced proportion of 12.2%. Similar to the higher genome-sequenced proportion of the high abundance predominant taxa, predominant taxa should also have a much higher culturability rate than rare taxa; thus, the culturability rate of prokaryotic cells will be much higher than that of taxa. Consequently, our data indicated that the paradigm that only 1% of prokaryotes are culturable is out of date, both for cells and taxa.

Conclusions

This study performed an in-depth analysis of the prokaryotic genome-sequenced proportion in the EMP and comprehensively showed the global-scale genome-sequenced degree for various environment types and different species. Most of the biomes were occupied by a few widespread predominant taxa. Given the high genome-sequenced proportion of predominant taxa, the genetic information of most prokaryotic biomes has been revealed to a high degree. However, due to the large number of rare taxa with unknown genomes, our current understanding of the global prokaryotic genome information remains limited. These results will be helpful for more reasonable and efficient explorations of prokaryotic genomes and will accelerate the comprehensive understanding of microbial ecological functions in different environments.

Methods

Data collection from EMP and RefSeq

The Earth Microbiome Project (EMP) was founded in 2010 to sample the Earth’s microbial communities at an unprecedented scale to advance our understanding of the organizing biogeographic principles that govern microbial community structure on Earth [13,14,15]. A total of 262,011 OTUs and their abundance and nucleic acid sequence information were collected from the website (ftp://ftp.microbio.me/emp/release1), which were obtained and shared by the EMP from 10,000 samples using the Deblur software [27]. Chimera filtering relied on the EMP project. The NCBI’s reference sequence (RefSeq) database is a curated non-redundant collection of sequences representing whole or frame genomes [28]. We obtained all of the 155,810 bacterial or archaeal genomes collected by the database before July 2019. In addition, 7903 (1539 contained the 16S rRNA gene) metagenome-assembled genomes (MAGs) [8] recovered from > 1500 public metagenomes using MetaBAT [29] were also collected for representative uncultivated bacteria and archaea.

Sequence alignment and analysis

Alignment between the EMP OTUs and 155,810 or 7903 genomes was performed using BLASTn (E value < 1e–5) [30]. To assess the adequacy of the OTUs, we analyzed all the samples by increasing the number of samples from 1000 to 10,000 randomly. The genome-sequenced proportions of cells and taxa (at 100%, > 98.6% or, > 97% identities in the 16S-V4 region) in a specific prokaryotic biome were defined as Bcell and BOTU, respectively. The genome-sequenced proportion of taxa (at 100%, > 98.6%, or > 97% identities in the 16S-V4 region) from subgroup or global biomes was defined as the POTU. The 100% identity represents the most rigorous and accurate match, while 98.6% and 97% identities are the new and traditional criteria for species definitions, respectively [16,17,18]. Briefly, Bcell represents the ratio of the genome-sequenced sequences in a single sample, BOTU represents the ratio of the genome-sequenced OTUs in a single sample and POTU represents the ratio of the genome-sequenced OTUs in multiple samples.

Taxonomic analysis of EMP OTU

The taxonomy of each OTU was analyzed by the Ribosomal Database Project (RDP) Classifier [31] at a 70% confidence threshold. The EMP ontology (EMPO) classified 17 microbial environments (level 3) as free living or host associated (level 1) and saline or non-saline (if free living) or animal or plant (if host associated) (level 2) [13]. Based on the taxonomic results and the EMPO (level 3) for each OTU, we calculated the composition and relative abundance of different levels of taxonomy (phylum, class, order, family, and genus) in different environments.