Quantitative Biology

, Volume 3, Issue 3, pp 135–144 | Cite as

Applications of species accumulation curves in large-scale biological data analysis

  • Chao Deng
  • Timothy Daley
  • Andrew SmithEmail author
Research Article


The species accumulation curve, or collector’s curve, of a population gives the expected number of observed species or distinct classes as a function of sampling effort. Species accumulation curves allow researchers to assess and compare diversity across populations or to evaluate the benefits of additional sampling. Traditional applications have focused on ecological populations but emerging large-scale applications, for example in DNA sequencing, are orders of magnitude larger and present new challenges.We developed a method to estimate accumulation curves for predicting the complexity of DNA sequencing libraries. This method uses rational function approximations to a classical nonparametric empirical Bayes estimator due to Good and Toulmin [Biometrika, 1956, 43, 45–63]. Here we demonstrate how the same approach can be highly effective in other large-scale applications involving biological data sets. These include estimating microbial species richness, immune repertoire size, and k-mer diversity for genome assembly applications. We show how the method can be modified to address populations containing an effectively infinite number of species where saturation cannot practically be attained. We also introduce a flexible suite of tools implemented as an R package that make these methods broadly accessible.


species accumulation curve accumulation region rational function approximation immune repertoire microbiome diversity species richness 

Supplementary material

40484_2015_49_MOESM1_ESM.pdf (445 kb)
Supplementary material, approximately 445 KB.


  1. 1.
    Magurran, A. E. (1988). Ecological Diversity and Its Measurement, 168, Princeton: Princeton University PressCrossRefGoogle Scholar
  2. 2.
    Bunge, J. and Fitzpatrick, M. (1993) Estimating the number of species: A review. J. Am. Stat. Assoc., 88, 364–373Google Scholar
  3. 3.
    Colwell, R. K., Mao, C. X. and Chang, J. (2004) Interpolating, extrapolating, and comparing incidence-based species accumulation curves. Ecology, 85, 2717–2727CrossRefGoogle Scholar
  4. 4.
    Efron, B. and Thisted, R. (1976) Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63, 435–447Google Scholar
  5. 5.
    Ionita-Laza, I., Lange, C. and Laird, N. M. (2009) Estimating the number of unseen variants in the human genome. Proc. Natl. Acad. Sci. USA, 106, 5008–5013PubMedCentralCrossRefPubMedGoogle Scholar
  6. 6.
    Hughes, J. B., Hellmann, J. J., Ricketts, T. H. and Bohannan, B. J. (2001) Counting the uncountable: Statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 67, 4399–4406PubMedCentralCrossRefPubMedGoogle Scholar
  7. 7.
    Laydon, D. J., Melamed, A., Sim, A., Gillet, N. A., Sim, K., Darko, S., Kroll, J. S., Douek, D. C., Price, D. A., Bangham, C. R., et al. (2014) Quantification of HTLV-1 clonality and TCR diversity. PLoS Comput. Biol., 10, e1003646PubMedCentralCrossRefPubMedGoogle Scholar
  8. 8.
    Gotelli, N. J. and Colwell, R. K. (2001) Quantifying biodiversity: Procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett., 4, 379–391CrossRefGoogle Scholar
  9. 9.
    Colwell, R. K., Chao, A., Gotelli, N. J., Lin, S.-Y., Mao, C. X., Chazdon, R. L. and Longino, J. T. (2012) Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J. Plant Ecol., 5, 3–21CrossRefGoogle Scholar
  10. 10.
    Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol., 12, 42–58CrossRefGoogle Scholar
  11. 11.
    Bulmer, M. (1974) On fitting the Poisson lognormal distribution to species-abundance data. Biometrics, 30, 101–110CrossRefGoogle Scholar
  12. 12.
    Burrell, Q. L. and Fenton, M. R. (1993) Yes, the GIGP really does work–and is workable! J. Am. Soc. Inf. Sci., 44, 61–69CrossRefGoogle Scholar
  13. 13.
    Engen, S., (1978). Stochastic Abundance Models. London: Chapman and HallGoogle Scholar
  14. 14.
    Norris, J. L. and Pollock, K. H. (1998) Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environ. Ecol. Stat., 5, 391–402CrossRefGoogle Scholar
  15. 15.
    Wang, J.-P. Z. and Lindsay, B. G. (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J. Am. Stat. Assoc., 100, 942–959CrossRefGoogle Scholar
  16. 16.
    Mao, C. X., Colwell, R. K. and Chang, J. (2005) Estimating the species accumulation curve using mixtures. Biometrics, 61, 433–441CrossRefPubMedGoogle Scholar
  17. 17.
    Lindsay, B. G. (1983) The geometry of mixture likelihoods: A general theory. Ann. Stat., 11, 86–94CrossRefGoogle Scholar
  18. 18.
    Wang, J.-P. (2010) Estimating species richness by a Poisson-compound Gamma model. Biometrika, 97, 727–740PubMedCentralCrossRefPubMedGoogle Scholar
  19. 19.
    Good, I. and Toulmin, G. (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43, 45–63CrossRefGoogle Scholar
  20. 20.
    Keating, K. A., Quinn, J. F., Ivie, M. A. and Ivie, L. L. (1998) Estimating the effectiveness of further sampling in species inventories. Ecol. Appl., 8, 1239–1249Google Scholar
  21. 21.
    Daley, T. and Smith, A. D. (2013) Predicting the molecular complexity of sequencing libraries. Nat. Methods, 10, 325–327PubMedCentralCrossRefPubMedGoogle Scholar
  22. 22.
    Daley, T. and Smith, A. D. (2014) Modeling genome coverage in single-cell sequencing. Bioinformatics, 30, 3159–3165CrossRefPubMedGoogle Scholar
  23. 23.
    Wang, J.-P. (2011) SPECIES: An R package for species richness estimation. J. Stat. Softw., 40, 1–15CrossRefGoogle Scholar
  24. 24.
    Mao, C. X. and Lindsay, B. G. (2007) Estimating the number of classes. Ann. Stat., 35, 917–930CrossRefGoogle Scholar
  25. 25.
    Baker, G. and Graves-Morris, P. (1996). Padé Approximants (Encyclopedia of Mathematics and its Applications)2nd ed., London: Cambridge University PressGoogle Scholar
  26. 26.
    Baker, G. A. Jr. (2000) Defects and the convergence of Padé approximants. Acta Appl. Math., 61, 37–52CrossRefGoogle Scholar
  27. 27.
    Daley, T. P. (2014). Non-Parametric Models for Large Capture- Recapture Experiments with Applications to DNA Sequencing. Ph.D. thesis, University of Southern CaliforniaGoogle Scholar
  28. 28.
    Heck, K. L. Jr, van Belle, G. and Simberloff, D. (1975) Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology, 56, 1459–1461CrossRefGoogle Scholar
  29. 29.
    Hsieh, T. C., Ma, K. H. and Chao, A. (2013). iNEXT online: interpolation and extrapola-tion [software]. http://chaostatnthuedutw/ blog/software-download/ inext-online/Google Scholar
  30. 30.
    Bunge, J., Willis, A. and Walsh, F. (2014) Estimating the number of species in microbial diversity studies. Annu. Rev. Stat. Appl., 1, 427–445Google Scholar
  31. 31.
    Yatsunenko, T., Rey, F. E., Manary,M. J., Trehan, I., Dominguez-Bello, M. G., Contreras, M., Magris, M., Hidalgo, G., Baldassano, R. N., Anokhin, A. P., et al. (2012) Human gut microbiome viewed across age and geography. Nature, 486, 222–227PubMedCentralPubMedGoogle Scholar
  32. 32.
    Meyer, F., Paarmann, D., D’ Souza, M., Olson, R., Glass, E. M., Kubal, M., Paczian, T., Rodriguez, A., Stevens, R., Wilke, A., et al. (2008) The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 9, 386PubMedCentralCrossRefPubMedGoogle Scholar
  33. 33.
    Britanova, O. V., Putintseva, E. V., Shugay, M., Merzlyak, E. M., Turchaninova, M. A., Staroverov, D. B., Bolotin, D. A., Lukyanov, S., Bogdanova, E. A., Mamedov, I. Z., et al. (2014) Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J. Immunol., 192, 2689–2698CrossRefPubMedGoogle Scholar
  34. 34.
    Wedderburn, L., Patel, A., Varsani, H. and Woo, P. (2001) The developing human immune system: T-cell receptor repertoire of children and young adults shows a wide discrepancy in the frequency of persistent oligoclonal T-cell expansions. Immunology, 102, 301–309PubMedCentralCrossRefPubMedGoogle Scholar
  35. 35.
    Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA., 98, 9748–9753PubMedCentralCrossRefPubMedGoogle Scholar
  36. 36.
    Compeau, P. E., Pevzner, P. A. and Tesler, G. (2011) How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol., 29, 987–991CrossRefPubMedGoogle Scholar
  37. 37.
    Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J. A., Chapuis, G., Chikhi, R., et al. (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience, 2, 1–31CrossRefGoogle Scholar
  38. 38.
    Zerbino, D. R. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829PubMedCentralCrossRefPubMedGoogle Scholar
  39. 39.
    Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27, 764–770PubMedCentralCrossRefPubMedGoogle Scholar
  40. 40.
    Ren, J., Song, K., Deng, M., Reinert, G., Cannon, C. H. and Sun, F. (2015) Inference of markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics, doi: 10.1093/bioinformatics/btv395Google Scholar
  41. 41.
    Kroes, I., Lepp, P. W. and Relman, D. A. (1999) Bacterial diversity within the human subgingival crevice. Proc. Natl. Acad. Sci. USA., 96, 14547–14552PubMedCentralCrossRefPubMedGoogle Scholar
  42. 42.
    Robins, H. S., Campregher, P. V., Srivastava, S. K., Wacher, A., Turtle, C. J., Kahsai, O., Riddell, S. R., Warren, E. H. and Carlson, C. S. (2009) Comprehensive aßseßsment of T-cell receptor ß-chain diversity in aß T cells. Blood, 114, 4099–4107PubMedCentralCrossRefPubMedGoogle Scholar
  43. 43.
    Colwell, R. K. and Coddington, J. A. (1994) Estimating terrestrial biodiversity through extrapolation. Philos. Trans. R. Soc. Lond. B Biol. Sci., 345, 101–118CrossRefPubMedGoogle Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH 2015

Authors and Affiliations

  1. 1.Molecular and Computational BiologyUniversity of Southern CaliforniaLos AngelesUSA

Personalised recommendations