Abstract
In ecological and educational studies, estimators of the total number of species and rarefaction curve based on empirical samples are important tools. We propose a new method to estimate both rarefaction curve and the number of species based on a ready-made numerical approach such as quadratic optimization. The key idea in developing the proposed algorithm is based on nonparametric empirical Bayes estimation incorporating an interpolated rarefaction curve through quadratic optimization with linear constraints based on g-modeling in Efron (Stat Sci 29:285–301, 2014). Our proposed algorithm is easily implemented and shows better performances than existing methods in terms of computational speed and accuracy. Furthermore, we provide a criterion of model selection to choose some tuning parameters in estimation procedure and the idea of confidence interval based on asymptotic theory rather than resampling method. We present some asymptotic result of our estimator to validate the efficiency of our estimator theoretically. A broad range of numerical studies including simulations and real data examples are also conducted, and the gain that it produces has been compared to existing methods.
This is a preview of subscription content, access via your institution.

References
Acinas SG, Klepac-Ceraj V, Hunt DE, Pharino C, Ceraj I, Distel DL, Polz MF (2004) Fine-scale phylogenetic architecture of a complex bacterial community. Nature 430(6999):551
Baayen RH (2002) Word frequency distributions, vol 18. Springer, Berlin
Barger K, Bunge J (2010) Objective bayesian estimation for the number of species. Bayesian Anal 5(4):765–785
Ben-Hamou A, Boucheron S, Ohannessian MI et al (2017) Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli 23(1):249–287
Bertram D (1949) Studies on the transmission of cotton rat filariasis: I.—the variability of the intensities of infection in the individuals of the vector, liponyssus bacoti, its causation and its bearing on the problem of quantitative transmission. Ann Trop Med Parasitol 43(3–4):313–332
Böhning D (1994) A note on a test for Poisson overdispersion. Biometrika 81(2):418–419
Böhning D (1999) Computer-assisted analysis of mixtures and applications: meta-analysis, disease mapping and others, vol 81. CRC Press, Cambridge
Böhning D, Schön D (2005) Nonparametric maximum likelihood estimation of population size based on the counting distribution. J R Stat Soci Ser C (Appl Stat) 54(4):721–737
Bunge J, Fitzpatrick M (1993) Estimating the number of species: a review. J Am Stat Assoc 88(421):364–373
Chee CS, Wang Y (2016) Nonparametric estimation of species richness using discrete k-monotone distributions. Comput Stat Data Anal 93:107–118
Colwell RK, Coddington JA (1994) Estimating terrestrial biodiversity through extrapolation. Philos Trans R Soc Lond B Biol Sci 345(1311):101–118
Colwell RK, Chao A, Gotelli NJ, Lin SY, Mao CX, Chazdon RL, Longino JT (2012) Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J Plant Ecol 5(1):3–21
Efron B (2014) Two modeling strategies for empirical Bayes estimation. Stat Sci 29:285–301
Efron B, Thisted R (1976) Estimating the number of unseen species: how many words did shakespeare know? Biometrika 63(3):435–447
Gnedin A, Hansen B, Pitman J et al (2007) Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws. Probab Surv 4:146–171
Good I, Toulmin G (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1–2):45–63
Greenshtein E, Itskov T (2018) Application of non-parametric empirical bayes to treatment of non-response. Stat Sin 28:2189–2208
Greenshtein E, Ritov Y et al (2004) Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6):971–988
Hasselblad V (1969) Estimation of finite mixtures of distributions from the exponential family. J Am Stat Assoc 64(328):1459–1471
Hoerl AE, Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67
Hurlbert SH (1971) The nonconcept of species diversity: a critique and alternative parameters. Ecology 52(4):577–586
Jiang W, Zhang CH (2009) General maximum likelihood empirical bayes estimation of normal means. Ann Stat 37(4):1647–1684
Mao CX (2007) Estimating species accumulation curves and diversity indices. Stat Sin 17:761–774
Mao CX, Colwell RK (2005) Estimation of species richness: mixture models, the role of rare species, and inferential challenges. Ecology 86(5):1143–1153
Mao XC, Colwell RK, Chang J (2005) Estimating the species accumulation curve using mixtures. Biometrics 61(2):433–441
Norris JL, Pollock KH (1996) Nonparametric MLE under two closed capture-recapture models with heterogeneity. Biometrics 52:639–649
Norris JL, Pollock KH (1998) Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environ Ecol Stat 5(4):391–402
Orlitsky A, Suresh AT, Wu Y (2016) Optimal prediction of the number of unseen species. Proc Natl Acad Sci 113(47):13283–13288
Palmer MW (1990) The estimation of species richness by extrapolation. Ecology 71(3):1195–1198
Portnoy S et al (1984) Asymptotic behavior of \(m\)-estimators of \(p\) regression parameters when \(p^2/n\) is large. I. consistency. Ann Stat 12(4):1298–1309
Sanders HL (1968) Marine benthic diversity: a comparative study. Am Nat 102(925):243–282
Shen TJ, Chao A, Lin CF (2003) Predicting the number of new species in further taxonomic sampling. Ecology 84(3):798–804
Simar L et al (1976) Maximum likelihood estimation of a compound Poisson process. Ann Stat 4(6):1200–1209
Smith W, Grassle JF (1977) Sampling properties of a family of diversity measures. Biometrics 33:283–292
Spevack M (1968) A complete and systematic concordance to the works of Shakespeare, vol 1-6. Hildesheim: George Olms
Ugland KI, Gray JS, Ellingsen KE (2003) The species-accumulation curve and estimation of species richness. J Anim Ecol 72(5):888–897
Wang JP (2010) Estimating species richness by a Poisson-compound gamma model. Biometrika 97(3):727–740
Wang JPZ, Lindsay BG (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Stat Assoc 100(471):942–959
Wang JP et al (2011) SPECIES: an R package for species richness estimation. J Stat Softw 40(9):1–15
Acknowledgements
We thank the Editor-in-Chief, Associate Editor and referees. Research of S. Baek was supported in part by the Summer Research Faculty Fellowship (SURFF) from the University of Maryland, Baltimore County. Research of J. Park was supported in part by the New Faculty Startup Fund from Seoul National University and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C1A01100526).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 A.1 Proof of Lemma 1
- (i):
-
The proof of \(E({\hat{c}})=c\):
The given conditions \(\mathbf{f} = (\int \{1-(1-p)^m\} d {\hat{H}}_J(p))_{1\le m \le s}\) and \( \mathbf{f}\) is an unbiased estimator of \( A(m) = \int \{1-(1-p)^m\}dH_J(p)\) are equivalent to
$$\begin{aligned} \sum _{j=1}^J p_j^m \{E({\hat{w}}_j^*) - w_j \} \equiv \sum _{j=1}^J p_j^{m-1} \delta _j =0 \end{aligned}$$for \(1\le m \le s\) and \(\delta _j =p_j \{E({\hat{w}}_j^*) - w_j \}\). This can be expressed
$$\begin{aligned} \mathcal{P} \varvec{\delta } =\mathbf{0}. \end{aligned}$$where \(\mathcal{P} =(p_{mj})_{1\le m \le s, 1\le j \le J} =(p_j^m)_{0\le m \le s-1, 1\le j \le J}\) is a Vandermonde matrix and \(\varvec{\delta } = (\delta _j)_{1\le j \le J}\). The Vandemonde matrix has a full rank, so if \( J \le s\), then the rank of \(\mathcal{P}\) is J which leads to \({\varvec{\delta }} =\mathbf{0}\). We have \( p_j E({\hat{w}}_j^*) = p_j w_j\) resulting in \(E({\hat{w}}_j^*) = w_j\) due to \(p_j > 0\). Finally, we prove \( E({\hat{c}}) = E(\sum _{j=1}^J {\hat{w}}_j^*) = \sum _{j=1}^J w_j =c\).
- (ii):
-
The proof of \(\hbox {var}({\hat{c}})=O \left( c \frac{\xi _1}{\xi _M} \right) \):
For a sequence \(m = [\theta s]\) which is the integer part of \(\theta s\) for some \( 0< \theta <1\), we can take a \(\delta >0 \) such that \(\min _{ p \in [\epsilon _1, \epsilon _2] } \{1-(1-p)^m\} \ge (1-(1-\epsilon _1)^m) \ge \delta >0\) from \((\mathbf{A2})\) in Assumption A. Since we only need to consider the first M coordinates from (A3), so we define M dimensional vectors such as \(\hat{\mathbf{w}}^*_M =({\hat{w}}_1^*, \ldots , {\hat{w}}_M^*)^{\mathrm{T}}\) and \(\mathbf{1}_M=(1,1,\ldots , 1)^{\mathrm{T}}\) as well as covariance matrix \({\varvec{\varSigma }}_M = \hbox {var}(\hat{\mathbf{w}}_M^*)\). We also define \( \mathbf{a}_M =(a_1,\ldots , a_M)^{\mathrm{T}}\) for \( a_i = 1-(1-p_i)^m \) for \(1\le i \le M\). Then we have
$$\begin{aligned} \delta {\hat{c}} = \delta \mathbf{1}_M^{\mathrm{T}}\hat{\mathbf{w}}_M^* \le \mathbf{a}_M ^{\mathrm{T}}\hat{\mathbf{w}}_M^* \le \mathbf{1}_M^{\mathrm{T}}\hat{\mathbf{w}}_M^* = {\hat{c}}, \end{aligned}$$Let \((\mathbf{e}_j, \xi _j)\) be the eigenvector and eigenvalue of \({\varvec{\varSigma }}_{M}\) where \( \xi _1 \ge \cdots \ge \xi _M >0 \). We have \(\hbox {var}(f_m) = \mathbf{a}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{a}_M\) where \(f_m=\mathbf{a}_M ^{\mathrm{T}}\hat{\mathbf{w}}_M^*\) and \(\hbox {var}({\hat{c}}) = \mathbf{1}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{1}_M\). Using these and \(a_i \ge \delta \) leading to \( \mathbf{1}^{\mathrm{T}}_M \mathbf{1}_M \le \frac{1}{\delta ^2} \mathbf{a}^{\mathrm{T}}_M \mathbf{a}_M\), we derive
$$\begin{aligned} \frac{\hbox {var}({\hat{c}})}{\hbox {var}(f_m)}= & {} \frac{\mathbf{1}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{1}_M}{\mathbf{a}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{a}_M} = \frac{\xi _1 }{\xi _M}\frac{\mathbf{1}^{\mathrm{T}}_M \mathbf{1}_M}{ \mathbf{a}^{\mathrm{T}}_M \mathbf{a}_M } \le \frac{\xi _1}{\xi _{M}} \frac{1}{\delta ^2} =O\left( \frac{\xi _1}{\xi _M}\right) . \end{aligned}$$Since we have
$$\begin{aligned} \hbox {var}(f_m) \le c \max _{1\le j \le s} \left\{ 1- \frac{{s-m \atopwithdelims ()j}}{ {s \atopwithdelims ()j}}\right\} ^2 \sum _{j=1}^s g(j) \le c \{1-(1-\epsilon _2)^s\} \le c, \end{aligned}$$we prove \(\hbox {var}( {\hat{c}}) \le \hbox {var}(f_m) O(\frac{\xi _M}{\xi _1}) =O \left( c \frac{\xi _1}{\xi _M} \right) \).
Rights and permissions
About this article
Cite this article
Baek, S., Park, J. A computationally efficient approach to estimating species richness and rarefaction curve. Comput Stat 37, 1919–1941 (2022). https://doi.org/10.1007/s00180-021-01185-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-021-01185-1
Keywords
- Nonparametric empirical Bayes
- Quadratic optimization
- Rarefaction curve
- Species richness