Background

The rRNA gene approach to microbiome analyses, either based on amplicon or metagenomic sequencing, relies on the tacit assumption that the counts of this marker gene translate into a robust measure or proxy for microbial abundance. However, this assumption is often violated. Sources of error in gene abundance determination can come from analytical procedures such as DNA extraction, PCR amplification, and sequencing itself [1]. But likely as important, systematic biases can be caused by the varying abundance of ribosomal genes in the genomes of microbes [2]. The concern is evident in the dedicated databases that document the variability in ribosomal gene copy number per genome (Rg) among microbes [3]. Interestingly, Rg seems to correlate with a microbe’s life history traits, where fast growth is associated with higher values [4,5,6]. There is also evidence for a certain degree of conservation in Rg within bacterial phylogenetic clades [7]. On this basis, bioinformatic tools have been developed to automatically correct ribosomal gene surveys for Rg [8]. The phylogenetic conservation of Rg, however, seems only conspicuous among closely related microbes [9] and can explain only some 10% of its variability in complex, diverse communities [10]. In some eukaryotes like Saccharomyces cerevisae, Rg is unstable and can vary widely among strains or individuals [11]. Importantly, such corrections would only lead us to a description of community composition in terms of relative abundance of taxon-specific genome copies. But more useful metrics in microbiome community composition analyses are either cell number [7] (i.e., individuals) or biomass contributions by each taxon. Given the close to 9 orders of magnitude spanned by microbial cell biomass, it can be argued that taxon-specific biomass rather than cell number would be a better descriptor of a taxon’s contribution to a community. However, there are still instances where number of cells would be preferred (for example, to gauge dispersal potential, culturability, or susceptibility to deleterious agents like predators or toxicants). In any case, to translate genome numbers to cell numbers, one needs to take into account the level of ploidy, P, the number of copies of the genome present in a cell, where the number of ribosomal operons per cell, Rc, is the product PRg. Surprisingly, P is not typically taken into account, perhaps under the assumption that most microbes, like Escherichia coli, are monoploid [12, 13]. And yet, in bacteria and archaea, P varies far more than Rg [12, 14], and most species examined are oligo- or polyploid, with some containing in excess of 200 genomes copies per cell [15]. If one includes unicellular eukaryotes, the variation can be 4 orders of magnitude [16]. Clearly, ploidy constitutes a very important source of bias for community counts in itself [17], affecting estimates from both amplicon sequencing and shot-gun metagenomics. The variable nature of P could potentially either compound or diminish the effect of Rg variability in determining a cell’s Rc, as it is not known whether P and Rg correlate or vary independently among species; a high P could be associated with low Rg, and vice-versa. Studies on marine protists intended to estimate biomass from 18S counts have shown that Rc correlated linearly with cell volume (Vc) [18] or cell length [19] when plotted on double log scales, indicating an Rc dependence on size.

Here, we posited that perhaps there is constancy among microbes in the need for ribosomal gene content in relation to their cell biomass. In other words, microbial species would be under selection to contain a sufficient but not excessive Rto support the production of their typical cell biomass, Bc, so that Rcwould be proportional to Bc. Assuming cell density to be invariant (around 1.008 g ml−1) [20], Rc would also be proportional to cell volume (Vc).

Methods

Dataset

Values for all parameters were gathered or derived from the literature. In place of Bc, we used cellular volume, Vc, assuming cellular density to be constant (around 1.008 g ml−1 [20]). Cell volumes were either taken from reported direct determinations or derived from literature photomicrographs assuming simple formulae for a variety of fitting three-dimensional shapes (i.e., sphere, cylinder) or combinations thereof as given in Table S1 (see Additional file 1). When a range of volume values was available, we used the average. For Rg, we used values given in rrnDB [3] for the same species or strain. If they were not available, we used literature values or determined it by examination of the strain’s publicly available genome through BLAST. Ploidy was either taken directly from reported values or estimated if cellular DNA content and genome size were known. If P was variable within a species or strain, we used the average level of the range given. Rc values were then derived as the product of P and Rg, although for many protists, Rc was taken directly from experimentally determined values. The annotated input data are gathered in Table S1 (see Additional file 1). The limiting factor to the size of the database was the availability of P determinations, which are quite uncommon. In all, we could analyze 107 cases.

Statistics

Power fits of data were run in Excel as linear regressions of the ln-transformed data pairs using a least-squares model. Statistics are given in Table S2 (see Additional file 2). To test the significance of exponent differences in two separate datasets, we used T tests for the slopes of the linear fits.

Estimation of taxon-specific cell numbers and biovolumes from 16S rRNA counts

In a dataset of rRNA gene taxon-specific frequencies, Fr, assigned to i taxa whose cell volumes, Vc(i), are known, one can directly estimate Rc (i) from Eq. 1 (see the “Results” section). The relative contribution to number of cells by taxon i, Fc(i), is computed as:

$${F}_{c }\left(i\right)=\frac{{F}_{r}(i)}{{R}_{c}(i)\sum \frac{{F}_{r}\left(i\right)}{{R}_{c}\left(i\right)}}$$

And the relative contribution to biovolume, Fv (i), as:

$${F}_{v }\left(i\right)=\frac{{F}_{c}(i){ V}_{c}(i)}{\sum {F}_{c}(i){ V}_{c}(i)}$$

If a determination of the absolute abundance of the total copies of the ribosomal gene for all taxa considered in the sample of origin, Rs, is available (from qPCR, for example, in units of copies per mass, volume or surface sampled), then absolute taxon-specific assignments R(i) can be obtained as the product \({F}_{r}(i)Rs(i).\) From R(i), one can derive absolute values for cells C(i) and biovolume V(i) attributable to each taxon: C(i) = R(i)/Rc (i) and V(i) = C(i)Vc(i). The sums \(\sum C(i)\) and \(\sum V(i)\) estimate the absolute number of cells or biovolume (in µm3), respectively, of the entire set of taxa under consideration.

An alternative to using Vc(i), if those are not exactly known, is to assign rough discrete size ranges to taxa, and to use mean Vc (and Rc) values of the range’s maximum and minimum. We found it advisable to set variable-width size ranges in such a way that within-range variation in resulting Rc values was kept moderate. We used the following cell diameter ranges (in µm): 0.2–0.3, 0.3–0.4, 0.4–0.6, 0.6–0.9, 0.9–1.2, 1.2–1.5, 1.5–2.1, 2.1–2.9, 2.9–4.1, 4.1–5.8, 5.8–8.2, 8.2–11.6, and 11.6–16.4. This set provides within-range variation in Rc of less than 8% in all cases, which is smaller than the uncertainty of our estimates for the normalization constant in Eq. 1 of the “Results” section.

Results

Traits that span orders of magnitude are best evaluated as double logarithmic plots, which can be analyzed by power function fits. In this approach, the hypothesis of proportionality between Vc and Rc we posed should have resulted in a power function fit with an exponent close to unity. Our analysis (Fig. 1) readily dispelled that contention. The fit instead revealed that Rc follows well (R2 = 0.86) a power function of Vwith an exponent significantly lower than unity, and indistinguishable from 2/3 (0.66 ± 0.03; ± SE) across nine orders of magnitude in cell volume. For volumes expressed in µm3,

Fig. 1
figure 1

Relationship between cellular ribosomal gene content (Rc) and cell volume (Vc) in microbes (n = 107), plotted as a log/log graph. The grey line is a power fit with the equation displayed in red type (fit statistics are in Table S2, Additional file 2). Data points belonging to eukaryotes are in orange, those for archaea in yellow, and bacteria in green. For three species, we plotted datasets to highlight intraspecies variability: Synechococcus elongatus (light blue symbols) [28], Colozoum pelagicum (light purple) [19], and Sphaerozoum fuscum [19] (light yellow)

$$R_c\:=\:9.58\;v_c^{0.66}\;\cong9.58\;V_c^{2/3}$$
(1)

where 9.58 ± 1.21 is the estimated normalization constant. One could envision that the scaling relationship may have been artifactually distorted at the low range of Rc, since it cannot physiologically take values < 1. But a reanalysis of the dataset excluding data pairs with Rc \(\le 2\) did not change the fit significantly in exponent or normalization constant (see Additional file 2). We also tested the hypothesis that exponents for a fit of data pairs from eukaryotes (exponent = 0.72 ± 0.05) vs. prokaryotes (0.62 ± 0.05) could be different, but this did not find strong statistical support in a T test comparison (p = 0.20).

Equation 1 can be rewritten as a function of linear cell dimensions using a spherical-equivalent cell diameter, \({D}_{c }^{0}=2 \sqrt[3]{\frac{3 {V}_{c}}{4\pi }}\) so that.

$$R_c\:=\:6.25\;\left(D_c^0\right)^{1.98}\;\cong\;6.25\;\left(D_c^0\right)^2$$
(2)

Thus, Rc scales generally not with the volume but with the surface area of a microbial cell, which for the purpose of this study means that the bias associated with ribosomal gene counts will be size-dependent regardless of our choice of abundance estimator. Ribosomal counts will overestimate large-celled microbes over small-celled ones if one is interested in number of cells, the bias increasing with the square of linear cell size (Eq. 2), a prediction that finds experimental support for specific cases in the literature [21]. In terms of biomass, ribosomal counts will underestimate the contribution of large microorganisms, the bias increasing with the 2/3 power of cellular biovolume (Eq. 1). Whichever the desired measure of abundance, however, the explicit relationship in Fig. 1 provides a means for bias correction in tallies of ribosomal genes, as long as cell biovolume is known from ancillary data for the taxa detected in the microbiome of interest. The correction requires knowledge of neither P nor Rg.

A procedural explanation is given under the “Methods” section, and we provide an example application in Fig. 2 using a dataset of phototrophic bacteria from endolithic microbiomes within intertidal hard carbonate rocks [22], responsible for their micritization and bioerosion [23], and useful here because typical cell volumes could be assigned to all taxa. The differential outcomes are obvious: 16S rRNA counts of large-celled cyanobacterial genera severely underestimate their contribution to biomass but overestimate their contribution in terms of number of cells (see for example, Hyella sp.). The opposite is true for alphaproteobacterial phototrophs (see for example Rhodomicrobium sp.), most of which are small-celled [24]. The distortion is less intense for the Chloroflexi, with intermediate cell size (see Roseiflexus castenholzii, for example).

Fig. 2
figure 2

Estimation of microbial community structure based on experimental ribosomal counts (central column), estimated cell number (left column) and estimated biovolume (right column) in a single, exemplary dataset using allometric corrections based on Eq. 1. The dataset is from Roush et al. [22] and includes the subset of taxonomically assignable phototrophic bacteria from an endolithic microbiome on coastal marine carbonate rocks. Only three exemplary phototrophs are labeled, but full, taxonomically explicit distributional data are in Table S3 (see Additional file 3). For ease of comparison, results are graphically presented as relative frequencies, but absolute scales of areal abundance are indicated on the arrow to the right

We have presented the issue of bias having in mind relative abundance tallies of microbiome members, but proportional tallies have methodological constraints in themselves, because the individual proportions must add up to 1, and thus the relative abundances of taxa are necessarily not independent of each other. There is clear evidence of severely diverging analytical outcomes when both relative and absolute abundance are compared in the same datasets [25, 26]. Commonly, relative proportions or taxa-specific ribosomal copies are converted to absolute abundances with parallel quantification of rRNA gene copies by qPCR, either total copies in the community analyzed or those of particular taxa [16]. We note here that, in view of our results, the latter would require allometric correction, whereas the former would not (as done in the dataset presented in Fig. 2) and is thus a preferable approach. However, we also note that the total number of ribosomal gene copies in a sample is not a good absolute measure of the combined microbial biomass or number of cells present for comparisons among samples, as it will be dependent on their inherent cell-size distribution. Hence, comparisons among samples will only be meaningful if carried out after conversion to biomass or cell numbers, unless the microbial composition of the samples is unchanged.

Discussion

The procedure outlined here requires knowledge of morphological metadata in addition to sequencing counts for each taxon. Unfortunately, cell volume data are not readily available for many taxa, at least in a compiled format, and requires intensive literature searches. In its absence, and as an approximation, using a few discrete cell-size classes instead of exact values yields useful corrected distributions (see Figure S1 in Additional file 4). Yet, an effort to bring microbial size data into a consolidated platform would be desirable in that it would enable the processing of large datasets in an automated, more manageable way.

An additional factor to take into account is the substantial data spread around the fit leading to Eq. 1, which can limit the precision of the correction. An expanded dataset should improve predictive accuracy and perhaps even precision, but some inherent limitations are also at play. P can vary in a single strain with cell cycle [15] and growth conditions [27]. We have included the range of intraspecies variability on the Vc/Rc space in Fig. 1 for the cases of a single strain of Synechococcus elongatus, and of single cells from natural populations of Sphaerozoum fuscum and Colozoum pelagicum. They suggest that a significant proportion of spread can be attributed to biological intraspecies variability, tempering the prospects for improvement with eventually extended datasets. Studies on Synechococcus elongatus [28,29,30] and Saccharomyces cerevisae [31] point to a regulatory interdependency of P with cell size, indicating that natural variations in ploidy may be met by commensurate variation in volume, making this much less of a problem. Additionally, because the data used here were arrived at through several approaches, a dedicated survey based on more consistent analytical procedures may result in tighter fits. Finally, part of the variability detected may have been due to neglecting contributions of organelle ribosomal genes in protists. This is expected to be negligible for large-celled eukaryotes, but perhaps not so much for the smallest of them, in which organelles take up a larger portion of their cell volume. Indeed, some of these pico-eukaryotes contribute disproportionately (by defect in Rc) to the regression’s sum of squares and may have contributed to the somewhat higher exponent in the eukaryote-only fit (Additional file 2). In support of this contention, a re-analysis excluding eukaryotes with Vc < 20 µm3 yields an exponent (0.66 ± 0.06; R2 = 0.68), more in line with that of Eq. 1, showing no trace of statistical difference (T test, p = 0.68) with that of the prokaryote-only fit (Additional file 2). While the dataset does not allow us to differentiate between bacteria and archaea because of the low number of archaeal cases in it, given the substantial biological differentiation between bacteria and archaea, it may be an interesting future exercise.

The preceding discussion on uncertainty in the correction approach should not be taken as grounds for inaction, given that the range of variation in Vc far exceeds that traceable to deviations from the fit, not only among microbes at large, but also in specific settings, and the spectra of microbial size distribution in microbiomes seems to be dynamic. For example, the range of Vc of typical bacterioplankton (excluding phototrophs) in seawater spans 3 orders of magnitude and its spectrum can be modified significantly by factors like grazing [32]. Considering photosynthetic plankton would likely add another 4–5 orders of magnitude in Vc, and the size spectrum of this group is also affected by environmental parameters [33]. In the human gut microbiome, our initial assessments show that microbiome typical bacteria span over at least 4 orders of magnitude in volume.

Beyond the pragmatic uses for community composition corrections, we see it as unlikely that the apparent scaling relationship with cell surface area has no biological meaning. It is tempting to speculate that Rc scales with size to maintain an increasing protein content need. Indeed, protein content scales as a function of cell volume with a similar exponent of 0.70 ± 0.06 (R2 = 0.87; 95% CI 0.64–0.75) [34]. Because the CI of the exponent for protein content per cell and that for Rc in the fit of Fig. 1 [0.72 and 0.61; see (Additional file 2)] overlap, the possibility of a connection to cellular need for proteins cannot be rejected solely on this basis. Indeed, in Synechococcus elongatus in the laboratory, protein content and P (hence also Rc) strongly co-vary with cell volume [30].

Alternatively, and perhaps more trivially, the scaling relationship of Rc with Vc may simply be a reflection of the size scaling of DNA content per cell. In other words, ribosomal genes would follow the trends of DNA content as a whole, just like any other universal gene would. The allometric relationship between DNA content and cell size, however, has not been addressed in the literature or has been addressed incorrectly by neglecting ploidy [34, 35]. We know that genome size scales among bacteria with reported exponents between 0.21 (R2 = 0.60) [34] and 0.35 (R2 = 0.45) [36]. In our dataset, which includes eukaryotes, it does so with an exponent of 0.18 (R2 = 0.34; see Additional file 5). Even when these coefficients of correlation are rather poor, genome size clearly increases much more weakly with Vc than does Rc. Again, this does not take into account P variations to yield actual DNA content per cell; it is the size of one copy of the genome. A portion of our dataset can be used to explore the scaling of DNA content per cell for prokaryotes (n = 60). To this subset, we can add the measured or slightly derived values reported by Shuter et al. [35] (n = 39), excluding those that relied on assumptions of monoploidy. This combined set yields a power scaling fit with R2 = 0.89 and estimated exponent of ¾ (0.75 ± 0.03; Fig. 3).

Fig. 3
figure 3

DNA content scales with cell volume as a power function with an exponent of ¾. Entries are from a subset of those in Table S1 (n = 60, see Additional file 1), and determinations by Shuter et al. [35] (n = 39). Orange points belong to eukaryotic microbes, yellow points belong to archaea, and green points to bacteria. Full statistics for the fit (in red type) are given in Table S2 (Additional file 2)

The difference in scaling exponent between genome size and cell DNA content (0.18–035 vs. 0.75) gauges the importance of P. In fact, in our dataset, P seems to scale with Vc as a power law with an exponent of 0.54 (R2 = 0.69; Additional file 6). This is consistent with the fact that the product of genome size and ploidy yields the cell DNA content, as the exponents of the multipliers (0.18 and 0.54, respectively) roughly add up to the estimated exponent of the product (0.75). That the exponents for DNA content per cell (3/4) and Rc (2/3) are significantly different (T test, p = 0.02), speaks for respective mechanistic drivers that are fundamentally decoupled. In fact, most known allometric laws found in nature scale with exponents that are simple multiples of 1/4 [37]. It would seem that ribosomal genes are, in that sense, unique.

Conclusions

The results presented here uncover surprising basic rules on the composition of microbes, rules that ties them all together, and that far from being self-evident, pose an intellectual challenge to elucidate. In practical terms, this discovery also provides a rather simple approach to deal with biases affecting the use of current omics methodologies for the assessment of microbiome composition, which, given their extensive use in many areas of microbiology and allied sciences, has a large potential for applicability.