Skip to main content
Log in

Estimating the number of sequencing errors in microbial diversity studies

  • Published:
Environmental and Ecological Statistics Aims and scope Submit manuscript

Abstract

Species diversity analysis of microbial communities is an important tool for assessing an ecosystem health. The advent of high-throughput genome sequencing techniques has made it possible to process an unprecedented number of RNA sequences. However, many studies report the presence of a significant number of fictitious rare species in datasets generated using these techniques. These species are the product of errors that can occur at any step of the sequence analysis pipeline. The overcount of rare species (especially singletons) affects the estimation of the total number of species, and of the diversity of the community as measured by Shannon’s index. To avoid overestimating these quantities, it is crucial to model the source of error. In this work, we present a new model that treats spurious singletons as false-negative record linkage errors, and compare it with another approach where spurious singletons are considered for deletion. We discuss the two inferential approaches both with an application to real data and on theoretical grounds. We demonstrate that, while Shannon’s index can differ significantly under the two models, the estimate of the total number of species is equivalent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Allen HK, Bunge J, Foster JA, Bayles DO, Stanton TB (2013) Estimation of viral richness from shotgun metagenomes using a frequency count approach. Microbiome 1(1):1–7

    Article  CAS  Google Scholar 

  • Barger K, Bunge J (2010) Objective Bayesian estimation for the number of species. Bayesian Anal 5(4):765–785

    Article  Google Scholar 

  • Böhning D (2015) Power series mixtures and the ratio plot with applications to zero-truncated count distribution modelling. Metron 73(2):201–216

    Article  Google Scholar 

  • Böhning D, Kaskasamkul P, van der Heijden PGM (2019) A modification of Chao’s lower bound estimator in the case of one-inflation. Metrika 82(3):361–384

    Article  Google Scholar 

  • Bucci A, Allocca V, Naclerio G, Capobianco G, Divino F, Fiorillo F, Celico F (2015) Winter survival of microbial contaminants in soil: an in situ verification. J Environ Sci 27:131–138

    Article  Google Scholar 

  • Bunge J (2009) Statistical estimation of uncultivated microbial diversity. In: Uncultivated microorganisms, pp 160–178. Springer

  • Bunge J, Böhning D, Allen H, Foster JA (2012a) Estimating population diversity with unreliable low frequency counts. In: Biocomputing 2012: Proceedings of the Pacific symposium. World Sci. Publ, Hackensack, pp 203–212

  • Bunge J, Woodard L, Böhning D, Foster JA, Connolly S, Allen HK (2012b) Estimating population diversity with CatchAll. Bioinformatics 28(7):1045–1047

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Bunge J, Willis A, Walsh F (2014) Estimating the number of species in microbial diversity studies. Annu Rev Stat Appl 1:427–445

    Article  Google Scholar 

  • Chambers R, Diniz da Silva A (2020) da Silva Improved secondary analysis of linked data: a framework and an illustration. J R Stat Soc A 183(1):37–59

    Article  Google Scholar 

  • Chang X, Sun D, He C (2014) Objective Bayesian analysis for a capture-recapture model. Ann Inst Stat Math 66(2):245–278

    Article  Google Scholar 

  • Chao A (1987) Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43(4):783–791

    Article  CAS  PubMed  Google Scholar 

  • Chao A, Bunge J (2002) Estimating the number of species in a stochastic abundance model. Biometrics 58(3):531–539

    Article  PubMed  Google Scholar 

  • Chiu C-H (2023) A more reliable species richness estimator based on the Gamma-Poisson model. PeerJ 11:e14540

    Article  PubMed  PubMed Central  Google Scholar 

  • Chiu C-H, Chao A (2016) Estimating and comparing microbial diversity in the presence of sequencing errors. PeerJ 4:e1634

    Article  PubMed  Google Scholar 

  • Coull BA, Agresti A (1999) The use of mixed logit models to reflect heterogeneity in capture-recapture studies. Biometrics 55(1):294–301

    Article  CAS  PubMed  Google Scholar 

  • da Silva CQ (2009) Bayesian analysis to correct false-negative errors in capture-recapture photo-ID abundance estimates. Braz J Prob Stat 23(1):36–48

    Google Scholar 

  • Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460–2461

    Article  CAS  PubMed  Google Scholar 

  • Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R (2011) UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27(16):2194–2200

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Fonseca VG, Nichols B, Lallias D, Quince C, Carvalho GR, Power DM, Creer S (2012) Sample richness and genetic diversity as drivers of chimera formation in nSSU metagenetic analyses. Nucleic Acids Res 40(9):e66–e66

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Gontcharova V, Youn E, Wolcott RD, Hollister EB, Gentry TJ, Dowd SE (2010) Black box chimera check (B2C2): a windows-based software for batch depletion of chimeras from bacterial 16S rRNA gene datasets. Open Microbiol J 4:47–52

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Guimaraes P, Lindrooth RC (2007) Controlling for overdispersion in grouped conditional logit models: a computationally simple application of dirichlet-multinomial regression. Economet J 10(2):439–452

    Article  Google Scholar 

  • Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, Ciulla D, Tabbaa D, Highlander SK, Sodergren E et al (2011) Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res 21(3):494–504

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Haegeman B, Hamelin J, Moriarty J, Neal P, Dushoff J, Weitz JS (2013) Robust estimation of microbial diversity in theory and in practice. ISME J 7(6):1092–1101

    Article  PubMed  PubMed Central  Google Scholar 

  • Hartmann M, Six J (2023) Soil structure and microbiome functions in agroecosystems. Nat Rev Earth Environ 4(1):4–18

    Article  Google Scholar 

  • Hartmann M, Niklaus PA, Zimmermann S, Schmutz S, Kremer J, Abarenkov K, Lüscher P, Widmer F, Frey B (2014) Resistance and resilience of the forest soil microbiome to logging-associated compaction. ISME J 8(1):226–244

    Article  CAS  PubMed  Google Scholar 

  • Hugerth LW, Andersson AF (2017) Analysing microbial community composition through amplicon sequencing: from sampling to hypothesis testing. Front Microbiol 8:1561

    Article  PubMed  PubMed Central  Google Scholar 

  • Huse SM, Welch DM, Morrison HG, Sogin ML (2010) Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol 12(7):1889–1898

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kumar MS, Slud EV, Hehnly C, Zhang L, Broach J, Irizarry RA, Schiff SJ, Paulson JN (2022) Differential richness inference for 16S rRNA marker gene surveys. Genome Biol 23(1):166

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Ligi T, Oopkaup K, Truu M, Preem J, Nõlvak H, Mitsch WJ, Mander Ü, Truu J (2014) Characterization of bacterial communities in soil and sediment of a created riverine wetland complex using high-throughput 16S rRNA amplicon sequencing. Ecol Eng 72:56–66

    Article  Google Scholar 

  • Link WA, Yoshizaki J, Bailey LL, Pollock KH (2010) Uncovering a latent multinomial: analysis of mark-recapture data with misidentification. Biometrics 66(1):178–185

    Article  PubMed  Google Scholar 

  • Linwei W, Ning D, Zhang B, Li Y, Zhang P, Shan X, Zhang Q, Brown MR, Li Z, Van Nostrand JD et al (2019) Global diversity and biogeography of bacterial communities in wastewater treatment plants. Nat Microbiol 4(7):1183–1195

    Article  Google Scholar 

  • Lukacs PM, Burnham KP (2005) Estimating population size from DNA-based closed capture-recapture data incorporating genotyping error. J Wildl Manag 69(1):396–403

    Article  Google Scholar 

  • Marin JM, Pudlo P, Robert CP, Ryder RJ (2012) Approximate Bayesian Computational methods. Stat Comput 22(6):1167–1180

    Article  Google Scholar 

  • Nijenhuis A, Wilf HS (1978) Combinatorial algorithms: for computers and calculators. Academic Press, New York

    Google Scholar 

  • Øvreås L, Curtis TP (2011) Microbial diversity and ecology. In: Biological diversity: frontiers in measurement and assessment. Oxford University Press, Oxford, pp 221–236

  • Porter TM, Hajibabaei M (2018) Scaling up: a guide to high-throughput genomic approaches for biodiversity analysis. Mol Ecol 27(2):313–338

    Article  PubMed  Google Scholar 

  • Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ (2011) Removing noise from pyrosequenced amplicons. BMC Bioinformatics 12(1):1–18

    Article  Google Scholar 

  • Reitmeier S, Hitch TCA, Treichel N, Fikas N, Hausmann B, Ramer-Tait AE, Neuhaus K, Berry D, Haller D, Lagkouvardos I et al (2021) Handling of spurious sequences affects the outcome of high-throughput 16S rRNA gene amplicon profiling. ISME Commun 1(1):31

    Article  PubMed  PubMed Central  Google Scholar 

  • Rocchetti I, Bunge J, Böhning D (2011) Population size estimation based upon ratios of recapture probabilities. Ann Appl Stat 5(2):1512–1533

    Google Scholar 

  • Shoemaker WR, Locey KJ, Lennon JT (2017) A macroecological theory of microbial biodiversity. Nat Ecol Evol 1(5):0107

    Article  Google Scholar 

  • Stevick PT, Palsbøll PJ, Smith TD, Bravington MV, Hammond PS (2001) Errors in identification using natural markings: rates, sources, and effects on capture recapture estimates of abundance. Can J Fish Aquat Sci 58(9):1861–1870

    Google Scholar 

  • Stojmenović I (1992) On random and adaptive parallel generation of combinatorial objects. Int J Comput Math 42(3–4):125–135

    Article  Google Scholar 

  • Sun-Hee Hong J, Bunge S-OJ, Epstein SS (2006) Predicting microbial species richness. Proc Natl Acad Sci 103(1):117–122

    Article  PubMed  Google Scholar 

  • Tancredi A, Liseo B (2011) A hierarchical Bayesian approach to record linkage and population size problems. Ann Appl Stat 5(2):1553–1585

    Google Scholar 

  • Tancredi A, Auger-Méthé M, Marcoux M, Liseo B (2013) Accounting for matching uncertainty in two stage capture-recapture experiments using photographic measurements of natural marks. Environ Ecol Stat 20(4):647–665

    Article  Google Scholar 

  • Tang J, Zhang J, Ren L, Zhou Y, Gao J, Luo L, Yang Y, Peng Q, Huang H, Chen A (2019) Diagnosis of soil contamination using microbiological indices: a review on heavy metal pollution. J Environ Manag 242:121–130

    Article  CAS  Google Scholar 

  • Tedersoo L, Nilsson RH, Abarenkov K, Jairus T, Sadam A, Saar I, Bahram M, Bechem E, Chuyong G, Kõljalg U (2010) 454 Pyrosequencing and Sanger sequencing of tropical mycorrhizal fungi provide similar results but reveal substantial methodological biases. New Phytol 188(1):291–301

    Article  CAS  PubMed  Google Scholar 

  • Tuoto T, Di Cecco D, Tancredi A (2022) Bayesian analysis of one-inflated models for elusive population size estimation. Biom J 64(5):912–933

    Article  PubMed  PubMed Central  Google Scholar 

  • Urian K, Gorgone A, Read A, Balmer B, Wells RS, Berggren P, Durban J, Eguchi T, Rayment W, Hammond PS (2015) Recommendations for photo-identification methods used in capture-recapture models with cetaceans. Mar Mamm Sci 31(1):298–321

    Article  Google Scholar 

  • Vale RTR, Fewster RM, Carroll EL, Patenaude NJ (2014) Maximum likelihood estimation for model \({M}_{t, \alpha }\) for capture-recapture data with misidentification. Biometrics 70(4):962–971

    Article  CAS  PubMed  Google Scholar 

  • Walsh F, Smith DP, Owens SM, Duffy B, Frey J (2014) Restricted streptomycin use in apple orchards did not adversely alter the soil bacteria communities. Front Microbiol 4:383

    Article  PubMed  PubMed Central  Google Scholar 

  • Wang J-PZ, Lindsay BG (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Stat Assoc 100(471):942–959

    Article  CAS  Google Scholar 

  • Wang X, He CZ, Sun D (2007) Bayesian population estimation for small sample capture-recapture data using noninformative priors. J Stat Plan Inference 137(4):1099–1118

    Article  Google Scholar 

  • Watanabe S (2010) Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res 11(12)

  • Wesson P, Jewell NP, McFarland W, Glymour MM (2023) Evaluating tools for capture-recapture model selection to estimate the size of hidden populations: it works in practice, but does it work in theory? Ann Epidemiol 77:24–30

    Article  PubMed  Google Scholar 

  • Willis A (2016) Species richness estimation with high diversity but spurious singletons. arXiv preprint arXiv:1604.02598

  • Willis A, Bunge J (2015) Estimating diversity via frequency ratios. Biometrics 71(4):1042–1049

    Article  PubMed  Google Scholar 

  • Wright JA, Barker RJ, Schofield MR, Frantz AC, Byrom AE, Gleeson DM (2009) Incorporating genotype uncertainty into mark-recapture-type models for estimating abundance using DNA samples. Biometrics 65(3):833–840

    Article  CAS  PubMed  Google Scholar 

  • Xiao X, Wang M, Zhu H, Guo Z, Han X, Zeng P (2017) Response of soil microbial activities and microbial community structure to vanadium stress. Ecotoxicol Environ Saf 142:200–206

    Article  CAS  PubMed  Google Scholar 

  • Yoshizaki J, Brownie C, Pollock KH, Link WA (2011) Modeling misidentification errors that result from use of genetic tags in capture-recapture studies. Environ Ecol Stat 18:27–55

    Article  Google Scholar 

  • Zelterman D (1988) Robust estimation in truncated discrete distributions with application to capture-recapture experiments. J Stat Plan Inference 18(2):225–237

    Article  Google Scholar 

  • Zhang W, Bravington MV, Fewster RM (2019) Fast likelihood-based inference for latent count models using the saddlepoint approximation. Biometrics 75(3):723–733

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davide Di Cecco.

Additional information

Handling Editor: Luiz Duczmal.

Appendix: conditional ABC algorithm

Appendix: conditional ABC algorithm

In this Appendix we give some details on the ABC rejection sampler conditioned on the number of captures s presented in Sect. 5.2.1. As we have noted there, to condition ourselves on the total number of specimens s, which remains fixed in the hypothesis of the MLM, we have to replace the first two steps of the naive algorithm presented in Sect. 5.2, with the following two:

  1. 1.

    generate values for \((\theta ,N^*)\) given s and the priors \(\pi (\theta )\) and \(\pi (N^*)\)

  2. 2.

    generate values \((n_0^*,n_1^*,n_2^*,...)\) conditional on \(N^*\), \(\theta\) and s

In the following three subsections, we detail the passages for the two steps above according to the chosen baseline.

1.1 Poisson

If we consider a \(Poi(\lambda )\) baseline distribution for our MLM, we have

$$\begin{aligned} \sum X_i^* \,|\; \lambda ,N^* \; \sim \; Poi(\lambda N^*). \end{aligned}$$

and, by integrating over \(\lambda\) with prior \(Gamma(\alpha _{\lambda }, \beta _{\lambda })\) we have

$$\begin{aligned} \sum X_i^* \,|\; N^* \; \sim \; NegBin\left( \alpha _{\lambda },\frac{\beta _{\lambda }}{N^*+\beta _{\lambda }}\right) . \end{aligned}$$

Given a prior in the family \(\pi (N^*) \propto (N^*){}^{-k}\) defined by k, we can calculate the probability \(P(N^* \,|\, s )\) for a sufficiently large range of values, via

$$\begin{aligned} P\left( N^* \,|\, \textstyle \sum X^*_i = s \right)&\propto P\left( \textstyle \sum X^*_i = s \,|\, N^* \right) \pi (N^*), \nonumber \\&\propto \exp {\left\{ (s-k)\log (N^*) - (\alpha _{\lambda }+s) \log (N^*+\beta _{\lambda }) \right\} }. \end{aligned}$$
(9)

Then, step 1 amounts to generate a value for \(N^* | s\) according to (9), and then generate \(\lambda\) from \(P( \lambda \,|\, N^*, s )\), which is the updated Gamma distribution

$$\begin{aligned} Gamma\left( \alpha _{\lambda } + s \,,\, \beta _{\lambda }+N^*\right) . \end{aligned}$$

As for step 2. of our scheme, note that the distribution of \((n_0^*,n_1^*,n_2^*,...)\) conditional on \(N^*\) and s is independent of \(\lambda\). In fact, it is well-known that the joint distribution of \(N^*\) independent Poisson having fixed sum s is Multinomial with uniform probabilities:

$$\begin{aligned} \left( X^*_1,...,X^*_{N^*}\,|\, \textstyle \sum X^*_i = s\right) \sim Mult\big (s, (1/N^*,...,1/N^*)\big ). \end{aligned}$$

So, we can generate \((x_1^*,...,x^*_{N^*})\) having fixed sum s directly from the Multinomial above.

1.2 Geometric

Under a Geo(p) baseline distribution we have

$$\begin{aligned} \sum X_i^* \,|\; p,N^* \sim NegBin(N^*,p). \end{aligned}$$

Then, if we adopt a prior \(p \sim Beta(\alpha _p, \beta _p)\), by integrating out p, we have

$$\begin{aligned} P\left( \sum X_i^*=s \,|\, N^*\right) = \left( {\begin{array}{c}N^*+s-1\\ s\end{array}}\right) \frac{B(\alpha _p+N^*,\beta _p+s)}{B(\alpha _p,\beta _p)}, \end{aligned}$$

where B denotes the Beta function. Thus, we can calculate the probability \(P(N^* \,|\, s)\) for a sufficiently large range of values as:

$$\begin{aligned} P(N^* \,|\, s) \propto \frac{\Gamma (N^*+s) \Gamma (\alpha _p+N)}{\Gamma (N^*) \Gamma (\alpha _p+\beta _p+N^*+s) } \cdot \pi (N^*), \end{aligned}$$
(10)

and generate values for \(N^*\) accordingly. Then, we can generate values for p from \(P(p \,|\, N^*,s)\), which is the updated Beta distribution:

$$\begin{aligned} Beta(\alpha _p + N^*, \beta _p+s). \end{aligned}$$

The distribution of \((n_0^*,n_1^*,n_2^*,...)\) conditional on \(N^*\) and s is independent of p. In fact, all possible vectors \((x_1^*,...,x^*_{N^*})\) with fixed sum s have the same probability \(\left( {\begin{array}{c}N^*+s-1\\ s\end{array}}\right) ^{-1}\), equal to the reciprocal of the number of possible nonnegative integer \(N^*\)-vectors summing to s (or weak \(N^*\)–compositions of s). As a consequence, step 2. of our scheme can be completed by using an algorithm for random compositions,(see, e.g., Nijenhuis and Wilf (1978) or Stojmenović (1992)), to generate a nonnegative integer vector \((x_1^*,...,x^*_{N^*})\) with fixed sum s.

1.3 Negative Binomial

Under a NegBin(rp) baseline distribution we have:

$$\begin{aligned} P\left( \sum _{i=1}^{N^*} X_i^* = s \,|\, r,p,N^*\right) = \left( {\begin{array}{c}rN^*+s-1\\ s\end{array}}\right) p^{rN^*} (1-p)^s. \end{aligned}$$
(11)

In order to generate values from the posterior of \((N^*,r,p)\) given s, we can generate values from their (independent) priors and accept them with probability given by (11).

Consider the following result (see, e.g., Guimaraes and Lindrooth 2007):

Proposition 3

Let \(X^*_i \sim NegBin(r,p)\), \(i=1,...,N^*\), and let \(\sum _{i=1}^{N^*}X^*_i=s\) then:

$$\begin{aligned} P(x_1^*,...,x^*_{N^*} | s) = \frac{\Gamma (N^* r) \Gamma (s+1)}{\Gamma (s+N^* r)} \prod _{i=1}^{N^*} \frac{\Gamma (x^*_i +r)}{\Gamma (r) \Gamma (x^*_i+1)}. \end{aligned}$$

Then, for the second step, we generate \(x^*_1,...,x^*_{N^*}\) with fixed sum s from the Dirichlet Multinomial defined in Proposition 3.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Di Cecco, D., Tancredi, A. Estimating the number of sequencing errors in microbial diversity studies. Environ Ecol Stat 31, 485–507 (2024). https://doi.org/10.1007/s10651-024-00614-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10651-024-00614-w

Keywords

Navigation