## Abstract

In ecological and educational studies, estimators of the total number of species and rarefaction curve based on empirical samples are important tools. We propose a new method to estimate both rarefaction curve and the number of species based on a ready-made numerical approach such as quadratic optimization. The key idea in developing the proposed algorithm is based on nonparametric empirical Bayes estimation incorporating an interpolated rarefaction curve through quadratic optimization with linear constraints based on *g*-modeling in Efron (Stat Sci 29:285–301, 2014). Our proposed algorithm is easily implemented and shows better performances than existing methods in terms of computational speed and accuracy. Furthermore, we provide a criterion of model selection to choose some tuning parameters in estimation procedure and the idea of confidence interval based on asymptotic theory rather than resampling method. We present some asymptotic result of our estimator to validate the efficiency of our estimator theoretically. A broad range of numerical studies including simulations and real data examples are also conducted, and the gain that it produces has been compared to existing methods.

This is a preview of subscription content, access via your institution.

## References

Acinas SG, Klepac-Ceraj V, Hunt DE, Pharino C, Ceraj I, Distel DL, Polz MF (2004) Fine-scale phylogenetic architecture of a complex bacterial community. Nature 430(6999):551

Baayen RH (2002) Word frequency distributions, vol 18. Springer, Berlin

Barger K, Bunge J (2010) Objective bayesian estimation for the number of species. Bayesian Anal 5(4):765–785

Ben-Hamou A, Boucheron S, Ohannessian MI et al (2017) Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli 23(1):249–287

Bertram D (1949) Studies on the transmission of cotton rat filariasis: I.—the variability of the intensities of infection in the individuals of the vector, liponyssus bacoti, its causation and its bearing on the problem of quantitative transmission. Ann Trop Med Parasitol 43(3–4):313–332

Böhning D (1994) A note on a test for Poisson overdispersion. Biometrika 81(2):418–419

Böhning D (1999) Computer-assisted analysis of mixtures and applications: meta-analysis, disease mapping and others, vol 81. CRC Press, Cambridge

Böhning D, Schön D (2005) Nonparametric maximum likelihood estimation of population size based on the counting distribution. J R Stat Soci Ser C (Appl Stat) 54(4):721–737

Bunge J, Fitzpatrick M (1993) Estimating the number of species: a review. J Am Stat Assoc 88(421):364–373

Chee CS, Wang Y (2016) Nonparametric estimation of species richness using discrete k-monotone distributions. Comput Stat Data Anal 93:107–118

Colwell RK, Coddington JA (1994) Estimating terrestrial biodiversity through extrapolation. Philos Trans R Soc Lond B Biol Sci 345(1311):101–118

Colwell RK, Chao A, Gotelli NJ, Lin SY, Mao CX, Chazdon RL, Longino JT (2012) Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J Plant Ecol 5(1):3–21

Efron B (2014) Two modeling strategies for empirical Bayes estimation. Stat Sci 29:285–301

Efron B, Thisted R (1976) Estimating the number of unseen species: how many words did shakespeare know? Biometrika 63(3):435–447

Gnedin A, Hansen B, Pitman J et al (2007) Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws. Probab Surv 4:146–171

Good I, Toulmin G (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1–2):45–63

Greenshtein E, Itskov T (2018) Application of non-parametric empirical bayes to treatment of non-response. Stat Sin 28:2189–2208

Greenshtein E, Ritov Y et al (2004) Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6):971–988

Hasselblad V (1969) Estimation of finite mixtures of distributions from the exponential family. J Am Stat Assoc 64(328):1459–1471

Hoerl AE, Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55–67

Hurlbert SH (1971) The nonconcept of species diversity: a critique and alternative parameters. Ecology 52(4):577–586

Jiang W, Zhang CH (2009) General maximum likelihood empirical bayes estimation of normal means. Ann Stat 37(4):1647–1684

Mao CX (2007) Estimating species accumulation curves and diversity indices. Stat Sin 17:761–774

Mao CX, Colwell RK (2005) Estimation of species richness: mixture models, the role of rare species, and inferential challenges. Ecology 86(5):1143–1153

Mao XC, Colwell RK, Chang J (2005) Estimating the species accumulation curve using mixtures. Biometrics 61(2):433–441

Norris JL, Pollock KH (1996) Nonparametric MLE under two closed capture-recapture models with heterogeneity. Biometrics 52:639–649

Norris JL, Pollock KH (1998) Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environ Ecol Stat 5(4):391–402

Orlitsky A, Suresh AT, Wu Y (2016) Optimal prediction of the number of unseen species. Proc Natl Acad Sci 113(47):13283–13288

Palmer MW (1990) The estimation of species richness by extrapolation. Ecology 71(3):1195–1198

Portnoy S et al (1984) Asymptotic behavior of \(m\)-estimators of \(p\) regression parameters when \(p^2/n\) is large. I. consistency. Ann Stat 12(4):1298–1309

Sanders HL (1968) Marine benthic diversity: a comparative study. Am Nat 102(925):243–282

Shen TJ, Chao A, Lin CF (2003) Predicting the number of new species in further taxonomic sampling. Ecology 84(3):798–804

Simar L et al (1976) Maximum likelihood estimation of a compound Poisson process. Ann Stat 4(6):1200–1209

Smith W, Grassle JF (1977) Sampling properties of a family of diversity measures. Biometrics 33:283–292

Spevack M (1968) A complete and systematic concordance to the works of Shakespeare, vol 1-6. Hildesheim: George Olms

Ugland KI, Gray JS, Ellingsen KE (2003) The species-accumulation curve and estimation of species richness. J Anim Ecol 72(5):888–897

Wang JP (2010) Estimating species richness by a Poisson-compound gamma model. Biometrika 97(3):727–740

Wang JPZ, Lindsay BG (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Stat Assoc 100(471):942–959

Wang JP et al (2011) SPECIES: an R package for species richness estimation. J Stat Softw 40(9):1–15

## Acknowledgements

We thank the Editor-in-Chief, Associate Editor and referees. Research of S. Baek was supported in part by the Summer Research Faculty Fellowship (SURFF) from the University of Maryland, Baltimore County. Research of J. Park was supported in part by the New Faculty Startup Fund from Seoul National University and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1A2C1A01100526).

## Author information

### Authors and Affiliations

### Corresponding author

## Ethics declarations

### Conflict of interest

The authors declare that they have no conflict of interest.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendix

### Appendix

### 1.1 A.1 Proof of Lemma 1

- (
*i*): -
The proof of \(E({\hat{c}})=c\):

The given conditions \(\mathbf{f} = (\int \{1-(1-p)^m\} d {\hat{H}}_J(p))_{1\le m \le s}\) and \( \mathbf{f}\) is an unbiased estimator of \( A(m) = \int \{1-(1-p)^m\}dH_J(p)\) are equivalent to

$$\begin{aligned} \sum _{j=1}^J p_j^m \{E({\hat{w}}_j^*) - w_j \} \equiv \sum _{j=1}^J p_j^{m-1} \delta _j =0 \end{aligned}$$for \(1\le m \le s\) and \(\delta _j =p_j \{E({\hat{w}}_j^*) - w_j \}\). This can be expressed

$$\begin{aligned} \mathcal{P} \varvec{\delta } =\mathbf{0}. \end{aligned}$$where \(\mathcal{P} =(p_{mj})_{1\le m \le s, 1\le j \le J} =(p_j^m)_{0\le m \le s-1, 1\le j \le J}\) is a Vandermonde matrix and \(\varvec{\delta } = (\delta _j)_{1\le j \le J}\). The Vandemonde matrix has a full rank, so if \( J \le s\), then the rank of \(\mathcal{P}\) is J which leads to \({\varvec{\delta }} =\mathbf{0}\). We have \( p_j E({\hat{w}}_j^*) = p_j w_j\) resulting in \(E({\hat{w}}_j^*) = w_j\) due to \(p_j > 0\). Finally, we prove \( E({\hat{c}}) = E(\sum _{j=1}^J {\hat{w}}_j^*) = \sum _{j=1}^J w_j =c\).

- (
*ii*): -
The proof of \(\hbox {var}({\hat{c}})=O \left( c \frac{\xi _1}{\xi _M} \right) \):

For a sequence \(m = [\theta s]\) which is the integer part of \(\theta s\) for some \( 0< \theta <1\), we can take a \(\delta >0 \) such that \(\min _{ p \in [\epsilon _1, \epsilon _2] } \{1-(1-p)^m\} \ge (1-(1-\epsilon _1)^m) \ge \delta >0\) from \((\mathbf{A2})\) in

**Assumption A**. Since we only need to consider the first*M*coordinates from (**A3**), so we define*M*dimensional vectors such as \(\hat{\mathbf{w}}^*_M =({\hat{w}}_1^*, \ldots , {\hat{w}}_M^*)^{\mathrm{T}}\) and \(\mathbf{1}_M=(1,1,\ldots , 1)^{\mathrm{T}}\) as well as covariance matrix \({\varvec{\varSigma }}_M = \hbox {var}(\hat{\mathbf{w}}_M^*)\). We also define \( \mathbf{a}_M =(a_1,\ldots , a_M)^{\mathrm{T}}\) for \( a_i = 1-(1-p_i)^m \) for \(1\le i \le M\). Then we have$$\begin{aligned} \delta {\hat{c}} = \delta \mathbf{1}_M^{\mathrm{T}}\hat{\mathbf{w}}_M^* \le \mathbf{a}_M ^{\mathrm{T}}\hat{\mathbf{w}}_M^* \le \mathbf{1}_M^{\mathrm{T}}\hat{\mathbf{w}}_M^* = {\hat{c}}, \end{aligned}$$Let \((\mathbf{e}_j, \xi _j)\) be the eigenvector and eigenvalue of \({\varvec{\varSigma }}_{M}\) where \( \xi _1 \ge \cdots \ge \xi _M >0 \). We have \(\hbox {var}(f_m) = \mathbf{a}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{a}_M\) where \(f_m=\mathbf{a}_M ^{\mathrm{T}}\hat{\mathbf{w}}_M^*\) and \(\hbox {var}({\hat{c}}) = \mathbf{1}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{1}_M\). Using these and \(a_i \ge \delta \) leading to \( \mathbf{1}^{\mathrm{T}}_M \mathbf{1}_M \le \frac{1}{\delta ^2} \mathbf{a}^{\mathrm{T}}_M \mathbf{a}_M\), we derive

$$\begin{aligned} \frac{\hbox {var}({\hat{c}})}{\hbox {var}(f_m)}= & {} \frac{\mathbf{1}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{1}_M}{\mathbf{a}_M^{\mathrm{T}}{\varvec{\varSigma }}_M \mathbf{a}_M} = \frac{\xi _1 }{\xi _M}\frac{\mathbf{1}^{\mathrm{T}}_M \mathbf{1}_M}{ \mathbf{a}^{\mathrm{T}}_M \mathbf{a}_M } \le \frac{\xi _1}{\xi _{M}} \frac{1}{\delta ^2} =O\left( \frac{\xi _1}{\xi _M}\right) . \end{aligned}$$Since we have

$$\begin{aligned} \hbox {var}(f_m) \le c \max _{1\le j \le s} \left\{ 1- \frac{{s-m \atopwithdelims ()j}}{ {s \atopwithdelims ()j}}\right\} ^2 \sum _{j=1}^s g(j) \le c \{1-(1-\epsilon _2)^s\} \le c, \end{aligned}$$we prove \(\hbox {var}( {\hat{c}}) \le \hbox {var}(f_m) O(\frac{\xi _M}{\xi _1}) =O \left( c \frac{\xi _1}{\xi _M} \right) \).

## Rights and permissions

## About this article

### Cite this article

Baek, S., Park, J. A computationally efficient approach to estimating species richness and rarefaction curve.
*Comput Stat* **37**, 1919–1941 (2022). https://doi.org/10.1007/s00180-021-01185-1

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s00180-021-01185-1

### Keywords

- Nonparametric empirical Bayes
- Quadratic optimization
- Rarefaction curve
- Species richness