Skip to main content

A computationally efficient approach to estimating species richness and rarefaction curve


In ecological and educational studies, estimators of the total number of species and rarefaction curve based on empirical samples are important tools. We propose a new method to estimate both rarefaction curve and the number of species based on a ready-made numerical approach such as quadratic optimization. The key idea in developing the proposed algorithm is based on nonparametric empirical Bayes estimation incorporating an interpolated rarefaction curve through quadratic optimization with linear constraints based on g-modeling in Efron (Stat Sci 29:285–301, 2014). Our proposed algorithm is easily implemented and shows better performances than existing methods in terms of computational speed and accuracy. Furthermore, we provide a criterion of model selection to choose some tuning parameters in estimation procedure and the idea of confidence interval based on asymptotic theory rather than resampling method. We present some asymptotic result of our estimator to validate the efficiency of our estimator theoretically. A broad range of numerical studies including simulations and real data examples are also conducted, and the gain that it produces has been compared to existing methods.

This is a preview of subscription content, access via your institution.

Fig. 1


  • Acinas SG, Klepac-Ceraj V, Hunt DE, Pharino C, Ceraj I, Distel DL, Polz MF (2004) Fine-scale phylogenetic architecture of a complex bacterial community. Nature 430(6999):551

    Article  Google Scholar 

  • Baayen RH (2002) Word frequency distributions, vol 18. Springer, Berlin

    MATH  Google Scholar 

  • Barger K, Bunge J (2010) Objective bayesian estimation for the number of species. Bayesian Anal 5(4):765–785

    Article  MathSciNet  Google Scholar 

  • Ben-Hamou A, Boucheron S, Ohannessian MI et al (2017) Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli 23(1):249–287

    Article  MathSciNet  Google Scholar 

  • Bertram D (1949) Studies on the transmission of cotton rat filariasis: I.—the variability of the intensities of infection in the individuals of the vector, liponyssus bacoti, its causation and its bearing on the problem of quantitative transmission. Ann Trop Med Parasitol 43(3–4):313–332

    Article  Google Scholar 

  • Böhning D (1994) A note on a test for Poisson overdispersion. Biometrika 81(2):418–419

    Article  MathSciNet  Google Scholar 

  • Böhning D (1999) Computer-assisted analysis of mixtures and applications: meta-analysis, disease mapping and others, vol 81. CRC Press, Cambridge

    MATH  Google Scholar 

  • Böhning D, Schön D (2005) Nonparametric maximum likelihood estimation of population size based on the counting distribution. J R Stat Soci Ser C (Appl Stat) 54(4):721–737

    Article  MathSciNet  Google Scholar 

  • Bunge J, Fitzpatrick M (1993) Estimating the number of species: a review. J Am Stat Assoc 88(421):364–373

    Google Scholar 

  • Chee CS, Wang Y (2016) Nonparametric estimation of species richness using discrete k-monotone distributions. Comput Stat Data Anal 93:107–118

    Article  MathSciNet  Google Scholar 

  • Colwell RK, Coddington JA (1994) Estimating terrestrial biodiversity through extrapolation. Philos Trans R Soc Lond B Biol Sci 345(1311):101–118

    Article  Google Scholar 

  • Colwell RK, Chao A, Gotelli NJ, Lin SY, Mao CX, Chazdon RL, Longino JT (2012) Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J Plant Ecol 5(1):3–21

    Article  Google Scholar 

  • Efron B (2014) Two modeling strategies for empirical Bayes estimation. Stat Sci 29:285–301

    Article  MathSciNet  Google Scholar 

  • Efron B, Thisted R (1976) Estimating the number of unseen species: how many words did shakespeare know? Biometrika 63(3):435–447

    MATH  Google Scholar 

  • Gnedin A, Hansen B, Pitman J et al (2007) Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws. Probab Surv 4:146–171

    Article  MathSciNet  Google Scholar 

  • Good I, Toulmin G (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1–2):45–63

    Article  MathSciNet  Google Scholar 

  • Greenshtein E, Itskov T (2018) Application of non-parametric empirical bayes to treatment of non-response. Stat Sin 28:2189–2208

    MathSciNet  MATH  Google Scholar 

  • Greenshtein E, Ritov Y et al (2004) Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6):971–988

    Article  MathSciNet  Google Scholar 

  • Hasselblad V (1969) Estimation of finite mixtures of distributions from the exponential family. J Am Stat Assoc 64(328):1459–1471

    Article  Google Scholar 

  • Hoerl AE, Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67

    Article  Google Scholar 

  • Hurlbert SH (1971) The nonconcept of species diversity: a critique and alternative parameters. Ecology 52(4):577–586

    Article  Google Scholar 

  • Jiang W, Zhang CH (2009) General maximum likelihood empirical bayes estimation of normal means. Ann Stat 37(4):1647–1684

    Article  MathSciNet  Google Scholar 

  • Mao CX (2007) Estimating species accumulation curves and diversity indices. Stat Sin 17:761–774

    MathSciNet  MATH  Google Scholar 

  • Mao CX, Colwell RK (2005) Estimation of species richness: mixture models, the role of rare species, and inferential challenges. Ecology 86(5):1143–1153

    Article  Google Scholar 

  • Mao XC, Colwell RK, Chang J (2005) Estimating the species accumulation curve using mixtures. Biometrics 61(2):433–441

    Article  MathSciNet  Google Scholar 

  • Norris JL, Pollock KH (1996) Nonparametric MLE under two closed capture-recapture models with heterogeneity. Biometrics 52:639–649

    Article  Google Scholar 

  • Norris JL, Pollock KH (1998) Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environ Ecol Stat 5(4):391–402

    Article  Google Scholar 

  • Orlitsky A, Suresh AT, Wu Y (2016) Optimal prediction of the number of unseen species. Proc Natl Acad Sci 113(47):13283–13288

    Article  MathSciNet  Google Scholar 

  • Palmer MW (1990) The estimation of species richness by extrapolation. Ecology 71(3):1195–1198

    Article  MathSciNet  Google Scholar 

  • Portnoy S et al (1984) Asymptotic behavior of \(m\)-estimators of \(p\) regression parameters when \(p^2/n\) is large. I. consistency. Ann Stat 12(4):1298–1309

    Article  Google Scholar 

  • Sanders HL (1968) Marine benthic diversity: a comparative study. Am Nat 102(925):243–282

    Article  Google Scholar 

  • Shen TJ, Chao A, Lin CF (2003) Predicting the number of new species in further taxonomic sampling. Ecology 84(3):798–804

    Article  Google Scholar 

  • Simar L et al (1976) Maximum likelihood estimation of a compound Poisson process. Ann Stat 4(6):1200–1209

    Article  MathSciNet  Google Scholar 

  • Smith W, Grassle JF (1977) Sampling properties of a family of diversity measures. Biometrics 33:283–292

    Article  MathSciNet  Google Scholar 

  • Spevack M (1968) A complete and systematic concordance to the works of Shakespeare, vol 1-6. Hildesheim: George Olms

  • Ugland KI, Gray JS, Ellingsen KE (2003) The species-accumulation curve and estimation of species richness. J Anim Ecol 72(5):888–897

    Article  Google Scholar 

  • Wang JP (2010) Estimating species richness by a Poisson-compound gamma model. Biometrika 97(3):727–740

    Article  MathSciNet  Google Scholar 

  • Wang JPZ, Lindsay BG (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Stat Assoc 100(471):942–959

    Article  MathSciNet  Google Scholar 

  • Wang JP et al (2011) SPECIES: an R package for species richness estimation. J Stat Softw 40(9):1–15

    Article  MathSciNet  Google Scholar 

Download references


We thank the Editor-in-Chief, Associate Editor and referees. Research of S. Baek was supported in part by the Summer Research Faculty Fellowship (SURFF) from the University of Maryland, Baltimore County. Research of J. Park was supported in part by the New Faculty Startup Fund from Seoul National University and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C1A01100526).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Junyong Park.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



1.1 A.1 Proof of Lemma  1


The proof of \(E({\hat{c}})=c\):

The given conditions \(\mathbf{f} = (\int \{1-(1-p)^m\} d {\hat{H}}_J(p))_{1\le m \le s}\) and \( \mathbf{f}\) is an unbiased estimator of \( A(m) = \int \{1-(1-p)^m\}dH_J(p)\) are equivalent to

$$\begin{aligned} \sum _{j=1}^J p_j^m \{E({\hat{w}}_j^*) - w_j \} \equiv \sum _{j=1}^J p_j^{m-1} \delta _j =0 \end{aligned}$$

for \(1\le m \le s\) and \(\delta _j =p_j \{E({\hat{w}}_j^*) - w_j \}\). This can be expressed

$$\begin{aligned} \mathcal{P} \varvec{\delta } =\mathbf{0}. \end{aligned}$$

where \(\mathcal{P} =(p_{mj})_{1\le m \le s, 1\le j \le J} =(p_j^m)_{0\le m \le s-1, 1\le j \le J}\) is a Vandermonde matrix and \(\varvec{\delta } = (\delta _j)_{1\le j \le J}\). The Vandemonde matrix has a full rank, so if \( J \le s\), then the rank of \(\mathcal{P}\) is J which leads to \({\varvec{\delta }} =\mathbf{0}\). We have \( p_j E({\hat{w}}_j^*) = p_j w_j\) resulting in \(E({\hat{w}}_j^*) = w_j\) due to \(p_j > 0\). Finally, we prove \( E({\hat{c}}) = E(\sum _{j=1}^J {\hat{w}}_j^*) = \sum _{j=1}^J w_j =c\).


The proof of \(\hbox {var}({\hat{c}})=O \left( c \frac{\xi _1}{\xi _M} \right) \):

For a sequence \(m = [\theta s]\) which is the integer part of \(\theta s\) for some \( 0< \theta <1\), we can take a \(\delta >0 \) such that \(\min _{ p \in [\epsilon _1, \epsilon _2] } \{1-(1-p)^m\} \ge (1-(1-\epsilon _1)^m) \ge \delta >0\) from \((\mathbf{A2})\) in Assumption A. Since we only need to consider the first M coordinates from (A3), so we define M dimensional vectors such as \(\hat{\mathbf{w}}^*_M =({\hat{w}}_1^*, \ldots , {\hat{w}}_M^*)^{\mathrm{T}}\) and \(\mathbf{1}_M=(1,1,\ldots , 1)^{\mathrm{T}}\) as well as covariance matrix \({\varvec{\varSigma }}_M = \hbox {var}(\hat{\mathbf{w}}_M^*)\). We also define \( \mathbf{a}_M =(a_1,\ldots , a_M)^{\mathrm{T}}\) for \( a_i = 1-(1-p_i)^m \) for \(1\le i \le M\). Then we have

$$\begin{aligned} \delta {\hat{c}} = \delta \mathbf{1}_M^{\mathrm{T}}\hat{\mathbf{w}}_M^* \le \mathbf{a}_M ^{\mathrm{T}}\hat{\mathbf{w}}_M^* \le \mathbf{1}_M^{\mathrm{T}}\hat{\mathbf{w}}_M^* = {\hat{c}}, \end{aligned}$$

Let \((\mathbf{e}_j, \xi _j)\) be the eigenvector and eigenvalue of \({\varvec{\varSigma }}_{M}\) where \( \xi _1 \ge \cdots \ge \xi _M >0 \). We have \(\hbox {var}(f_m) = \mathbf{a}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{a}_M\) where \(f_m=\mathbf{a}_M ^{\mathrm{T}}\hat{\mathbf{w}}_M^*\) and \(\hbox {var}({\hat{c}}) = \mathbf{1}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{1}_M\). Using these and \(a_i \ge \delta \) leading to \( \mathbf{1}^{\mathrm{T}}_M \mathbf{1}_M \le \frac{1}{\delta ^2} \mathbf{a}^{\mathrm{T}}_M \mathbf{a}_M\), we derive

$$\begin{aligned} \frac{\hbox {var}({\hat{c}})}{\hbox {var}(f_m)}= & {} \frac{\mathbf{1}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{1}_M}{\mathbf{a}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{a}_M} = \frac{\xi _1 }{\xi _M}\frac{\mathbf{1}^{\mathrm{T}}_M \mathbf{1}_M}{ \mathbf{a}^{\mathrm{T}}_M \mathbf{a}_M } \le \frac{\xi _1}{\xi _{M}} \frac{1}{\delta ^2} =O\left( \frac{\xi _1}{\xi _M}\right) . \end{aligned}$$

Since we have

$$\begin{aligned} \hbox {var}(f_m) \le c \max _{1\le j \le s} \left\{ 1- \frac{{s-m \atopwithdelims ()j}}{ {s \atopwithdelims ()j}}\right\} ^2 \sum _{j=1}^s g(j) \le c \{1-(1-\epsilon _2)^s\} \le c, \end{aligned}$$

we prove \(\hbox {var}( {\hat{c}}) \le \hbox {var}(f_m) O(\frac{\xi _M}{\xi _1}) =O \left( c \frac{\xi _1}{\xi _M} \right) \).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Baek, S., Park, J. A computationally efficient approach to estimating species richness and rarefaction curve. Comput Stat 37, 1919–1941 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Nonparametric empirical Bayes
  • Quadratic optimization
  • Rarefaction curve
  • Species richness