Abstract
The species accumulation curve, or collector’s curve, of a population gives the expected number of observed species or distinct classes as a function of sampling effort. Species accumulation curves allow researchers to assess and compare diversity across populations or to evaluate the benefits of additional sampling. Traditional applications have focused on ecological populations but emerging large-scale applications, for example in DNA sequencing, are orders of magnitude larger and present new challenges.We developed a method to estimate accumulation curves for predicting the complexity of DNA sequencing libraries. This method uses rational function approximations to a classical nonparametric empirical Bayes estimator due to Good and Toulmin [Biometrika, 1956, 43, 45–63]. Here we demonstrate how the same approach can be highly effective in other large-scale applications involving biological data sets. These include estimating microbial species richness, immune repertoire size, and k-mer diversity for genome assembly applications. We show how the method can be modified to address populations containing an effectively infinite number of species where saturation cannot practically be attained. We also introduce a flexible suite of tools implemented as an R package that make these methods broadly accessible.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Magurran, A. E. (1988). Ecological Diversity and Its Measurement, 168, Princeton: Princeton University Press
Bunge, J. and Fitzpatrick, M. (1993) Estimating the number of species: A review. J. Am. Stat. Assoc., 88, 364–373
Colwell, R. K., Mao, C. X. and Chang, J. (2004) Interpolating, extrapolating, and comparing incidence-based species accumulation curves. Ecology, 85, 2717–2727
Efron, B. and Thisted, R. (1976) Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63, 435–447
Ionita-Laza, I., Lange, C. and Laird, N. M. (2009) Estimating the number of unseen variants in the human genome. Proc. Natl. Acad. Sci. USA, 106, 5008–5013
Hughes, J. B., Hellmann, J. J., Ricketts, T. H. and Bohannan, B. J. (2001) Counting the uncountable: Statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 67, 4399–4406
Laydon, D. J., Melamed, A., Sim, A., Gillet, N. A., Sim, K., Darko, S., Kroll, J. S., Douek, D. C., Price, D. A., Bangham, C. R., et al. (2014) Quantification of HTLV-1 clonality and TCR diversity. PLoS Comput. Biol., 10, e1003646
Gotelli, N. J. and Colwell, R. K. (2001) Quantifying biodiversity: Procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett., 4, 379–391
Colwell, R. K., Chao, A., Gotelli, N. J., Lin, S.-Y., Mao, C. X., Chazdon, R. L. and Longino, J. T. (2012) Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J. Plant Ecol., 5, 3–21
Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol., 12, 42–58
Bulmer, M. (1974) On fitting the Poisson lognormal distribution to species-abundance data. Biometrics, 30, 101–110
Burrell, Q. L. and Fenton, M. R. (1993) Yes, the GIGP really does work–and is workable! J. Am. Soc. Inf. Sci., 44, 61–69
Engen, S., (1978). Stochastic Abundance Models. London: Chapman and Hall
Norris, J. L. and Pollock, K. H. (1998) Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environ. Ecol. Stat., 5, 391–402
Wang, J.-P. Z. and Lindsay, B. G. (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J. Am. Stat. Assoc., 100, 942–959
Mao, C. X., Colwell, R. K. and Chang, J. (2005) Estimating the species accumulation curve using mixtures. Biometrics, 61, 433–441
Lindsay, B. G. (1983) The geometry of mixture likelihoods: A general theory. Ann. Stat., 11, 86–94
Wang, J.-P. (2010) Estimating species richness by a Poisson-compound Gamma model. Biometrika, 97, 727–740
Good, I. and Toulmin, G. (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43, 45–63
Keating, K. A., Quinn, J. F., Ivie, M. A. and Ivie, L. L. (1998) Estimating the effectiveness of further sampling in species inventories. Ecol. Appl., 8, 1239–1249
Daley, T. and Smith, A. D. (2013) Predicting the molecular complexity of sequencing libraries. Nat. Methods, 10, 325–327
Daley, T. and Smith, A. D. (2014) Modeling genome coverage in single-cell sequencing. Bioinformatics, 30, 3159–3165
Wang, J.-P. (2011) SPECIES: An R package for species richness estimation. J. Stat. Softw., 40, 1–15
Mao, C. X. and Lindsay, B. G. (2007) Estimating the number of classes. Ann. Stat., 35, 917–930
Baker, G. and Graves-Morris, P. (1996). Padé Approximants (Encyclopedia of Mathematics and its Applications)2nd ed., London: Cambridge University Press
Baker, G. A. Jr. (2000) Defects and the convergence of Padé approximants. Acta Appl. Math., 61, 37–52
Daley, T. P. (2014). Non-Parametric Models for Large Capture- Recapture Experiments with Applications to DNA Sequencing. Ph.D. thesis, University of Southern California
Heck, K. L. Jr, van Belle, G. and Simberloff, D. (1975) Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology, 56, 1459–1461
Hsieh, T. C., Ma, K. H. and Chao, A. (2013). iNEXT online: interpolation and extrapola-tion [software]. http://chaostatnthuedutw/ blog/software-download/ inext-online/
Bunge, J., Willis, A. and Walsh, F. (2014) Estimating the number of species in microbial diversity studies. Annu. Rev. Stat. Appl., 1, 427–445
Yatsunenko, T., Rey, F. E., Manary,M. J., Trehan, I., Dominguez-Bello, M. G., Contreras, M., Magris, M., Hidalgo, G., Baldassano, R. N., Anokhin, A. P., et al. (2012) Human gut microbiome viewed across age and geography. Nature, 486, 222–227
Meyer, F., Paarmann, D., D’ Souza, M., Olson, R., Glass, E. M., Kubal, M., Paczian, T., Rodriguez, A., Stevens, R., Wilke, A., et al. (2008) The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 9, 386
Britanova, O. V., Putintseva, E. V., Shugay, M., Merzlyak, E. M., Turchaninova, M. A., Staroverov, D. B., Bolotin, D. A., Lukyanov, S., Bogdanova, E. A., Mamedov, I. Z., et al. (2014) Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J. Immunol., 192, 2689–2698
Wedderburn, L., Patel, A., Varsani, H. and Woo, P. (2001) The developing human immune system: T-cell receptor repertoire of children and young adults shows a wide discrepancy in the frequency of persistent oligoclonal T-cell expansions. Immunology, 102, 301–309
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA., 98, 9748–9753
Compeau, P. E., Pevzner, P. A. and Tesler, G. (2011) How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol., 29, 987–991
Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J. A., Chapuis, G., Chikhi, R., et al. (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience, 2, 1–31
Zerbino, D. R. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829
Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27, 764–770
Ren, J., Song, K., Deng, M., Reinert, G., Cannon, C. H. and Sun, F. (2015) Inference of markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics, doi: 10.1093/bioinformatics/btv395
Kroes, I., Lepp, P. W. and Relman, D. A. (1999) Bacterial diversity within the human subgingival crevice. Proc. Natl. Acad. Sci. USA., 96, 14547–14552
Robins, H. S., Campregher, P. V., Srivastava, S. K., Wacher, A., Turtle, C. J., Kahsai, O., Riddell, S. R., Warren, E. H. and Carlson, C. S. (2009) Comprehensive aßseßsment of T-cell receptor ß-chain diversity in aß T cells. Blood, 114, 4099–4107
Colwell, R. K. and Coddington, J. A. (1994) Estimating terrestrial biodiversity through extrapolation. Philos. Trans. R. Soc. Lond. B Biol. Sci., 345, 101–118
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Deng, C., Daley, T. & Smith, A. Applications of species accumulation curves in large-scale biological data analysis. Quant Biol 3, 135–144 (2015). https://doi.org/10.1007/s40484-015-0049-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40484-015-0049-7