Abstract
Many germplasm collections aim to preserve most of the genetic diversity present in a population so that the population could be regenerated, which provides genetic resources to ensure food security. This paper proposes a way to measure how well a germplasm collection achieve this goal. In the most common scenario, one has little information regarding the number and statistical distribution of alleles at every locus, and it is thus very difficult to measure the representativeness of the accession. Here, we show how to use samples of allelic diversity at a sample of loci to estimate the representativeness of an accession based on the coverage of a sample with point and interval estimates. Our approach avoids making unrealistic assumptions regarding the number of loci, the bounds for the number of alleles or their frequency distributions. Depending on the sampling scheme of a collection, we differentiate between absolute or relative coverage. Here, we demonstrate this methodology using data from the germplasm collection at the Leibniz Institute of Plant Genetics and Crop Plant Research.
Similar content being viewed by others
References
Brown AHD (1995) The core collection at the crossroads. In: Hodgkin T, Brown AHD, van Hintum TJL, Morales EAV (eds) Core collections of plant genetic resources. Wiley, Chichester, pp 3–19
Chao A (1981) On estimating the probability of discovering a new species. Ann Stat 9(6):1339–1342
Chao A, Lee SM (1992) Estimating the number of classes via sample coverage. J Am Stat Assoc 87(417):210–217
Chao A, Lee SM (1993) Estimating population size for continuous-time capture-recapture models via sample coverage. Biom J 35(1):29–45
Darwin C (1866) On the origin of species by means of natural selection: or the preservation of favoured races in the struggle for life. John Murray, London
Esty WW (1982) Confidence intervals for the coverage of low coverage samples. Ann Stat 10(1):190–196
Esty WW (1983) A normal limit law for a nonparametric estimator of the coverage of a random sample. Ann Stat 11(3):905–912
Esty W (1985) Estimation of the number of classes in a population and the coverage of a sample. Math Sci 10:41–50
Esty WW (1986) The efficiency of good’s nonparametric coverage estimator. Ann Stat 14(3):1257–1260
Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3–4):237–264
Good I, Toulmin G (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1–2):45–63
Harris B (1959) Determining bounds on integrals with applications to cataloging problems. Ann Math Stat 30(2):521–548
Huang SP, Weir B (2001) Estimating the total number of alleles using a sample coverage method. Genetics 159(3):1365–1373
Huang X, Börner A, Röder M, Ganal M (2002) Assessing genetic diversity of wheat (triticum aestivum l.) germplasm using microsatellite markers. Theor Appl Genet 105(5):699–707
Knott M (1967) Models for cataloguing problems. Ann Math Stat 38(4):1255–1260
Lee SM, Chao A (1994) Estimating population size via sample coverage for closed capture-recapture models. Biometrics 50(1):88–97
Lo SH (1992) From the species problem to a general coverage problem via a new interpretation. Ann Stat 20(2):1094–1109
Nei M (1973) Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci 70(12):3321–3323
Robbins HE (1968) Estimating the total probability of the unobserved outcomes of an experiment. Ann Math Stat 39(1):256–257
Starr N (1979) Linear estimation of the probability of discovering a new species. Ann Stat 7(3):644–652
van Hintum TJ, Brown AHD, Spillane C, Hodkin T (2000) Core collections of plant genetic resources (IPGRI Technical Bulletin No. 3., Rome, Italy, 2000)
Zhang C-H, Zhang Z (2009) Asymptotic normality of a nonparametric estimator of sample coverage. Ann Stat 37:2582–2595
Acknowledgements
The author would like to thank Dr. Marion Roder, who kindly shared the data set used in Huang et al. (2002) paper.
Author contributions
Carlos Hernandez-Suarez developed the methodology, performed the simulations, wrote the manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares no conflict of interest.
Additional information
Communicated by David Hawksworth.
This article belongs to the Topical Collection: Ex-situ conservation.
Appendix: Proof of properties of the coverage of several populations
Appendix: Proof of properties of the coverage of several populations
-
1.
If we select an individual at random from the population and then select one of its attributes X or Y, this attribute will be included in the sample with respective probabilities \(C_X\) and \(C_Y\). Because the selected attribute is equally likely to be X or Y, the probability that the attribute selected is in the sample is \((C_X + C_Y)/2\).
-
2.
If two populations of sizes \(N_1\) and \(N_2\) are mixed and a sample of size n is taken from the mix, the probability that an individual selected at random from the mixed population is represented in the sample is defined as the coverage of the sample, this follows from the fact that a randomly selected individual from the mix of populations belongs to each initial population with respective probabilities \(f= N_1/(N_1 + N_2)\) and \(1-f=N_2/(N_1 + N_2)\). It follows that there is no need to mix both populations as long as each population is sampled with sample sizes \(n_1=N_1/(N_1+N_2)\) and \(n_2\), respectively.
-
3.
Suppose we have two populations 1 and 2 of sizes \(N_1\) and \(N_2\), respectively, where \(N_1 = N_2\). Suppose we take a sample of size n from each population and let C represent the coverage of the mixed sample of size 2n. By property 2, C can be interpreted as the probability that a random individual selected from the mix of both populations is represented in the sample, i.e., the absolute coverage. Now suppose that the size of population 2 is increased by a factor of k, where \(k >1\), keeping the relative frequency of alleles fixed. Clearly, the previous interpretation of the coverage (absolute coverage) no longer holds because an individual selected randomly from the mixture of populations 1 and 2 is k times more likely to come from population 2. But if we can guarantee that the individual selected is equally likely to come from either population, then the probability that this individual is already represented in the sample is still C. The restriction imposed by requiring that it must be equally likely that the individual comes from either population defines the relative coverage. It follows that if we have two populations of general sizes \(N_1\) and \(N_2\), \(N_1 \ne N_2\), and take a sample of the same size n from each population, the coverage of the sample mix follows the definition of relative coverage.
-
4.
This property follows from properties of random sampling.
Rights and permissions
About this article
Cite this article
Hernandez-Suarez, C. Measuring the representativeness of a germplasm collection. Biodivers Conserv 27, 1471–1486 (2018). https://doi.org/10.1007/s10531-018-1504-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10531-018-1504-3