Introduction

Single nucleotide polymorphisms (SNPs) are biallelic markers largely abundant in most genomes and with low mutation rate (~ 10−9 per generation) (Brumfield et al. 2003; Morin et al. 2004). SNPs can be associated to diseases, susceptibility to environmental factors or quantitative trait locus (Erichsen and Chanock 2004; Amos et al. 2008; Casellas et al. 2008; Nickels et al. 2013). In forensic medicine, SNPs can be useful to identify individuals from non-invasive samples by using short amplicons (Sobrino et al. 2005). However, individual identification using degraded samples with less than 100 copies of gDNA may cause genotyping errors (Giardina et al. 2009; von Thaden et al. 2020). Allelic amplification failure, or dropout is the most common error caused by stochastic effects of the PCR reaction (Taberlet and Luikart 1999). To reduce dropout ratio, multiplex pre-amplification or increased replicates per sample could be performed (Bellemain and Taberlet 2004; Sastre et al. 2009). However, both solutions increase time and cost for genotyping individuals. In order to reduce genotyping errors using non-invasive samples without cost, we decided to develop a software (SNP+) to predict the dropout probability of each SNP from a sample of replicated genotypes. Moreover, two alternative parametrizations were compared by a Bayes factor to check for within-SNP homogeneous dropout probability against different dropout probabilities for each allele.

Material and methods

The SNP+ software analyzes each SNP independently, taking as a starting point a vector y of n genotypes ordered by individual (m) and replicates within individual (y’ = [y1 y2ym]), where n1 is the number of replicates for the first individual, and n = n1 + n2 + … + nm. Assuming two alleles, A and B, the Bayesian joint posterior distribution generalizes to

$$p(f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} \left| {{\mathbf{y}})\sim p({\mathbf{y}}} \right|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} )p\left( {f_{{\text{A}}} } \right)p(\varepsilon_{{\text{A}}} )p(\varepsilon_{{\text{B}}} ),$$

and focuses on estimating the allele frequency (fA), as well as the dropout probability for allele A (εA) or B (εB). Taking a particular genotype yi with possible outcomes AA, AB, BB and missing genotype (miss.), its Bayesian likelihood is computed as

$$p({\text{y}}_{i} = {\text{ AA}}|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} ) = p\left( {{\text{AA}}|{\text{AA}}} \right)p\left( {{\text{AA}}} \right) \, + p\left( {{\text{AA}}|{\text{AB}}} \right)p\left( {{\text{AB}}} \right)$$
$$p({\text{y}}_{i} = {\text{AB}}|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} ) = p\left( {{\text{AB}}|{\text{AB}}} \right)p\left( {{\text{AB}}} \right)$$
$$p({\text{y}}_{i} = {\text{ BB}}|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} ) = p\left( {{\text{BB}}|{\text{AB}}} \right)p\left( {{\text{AB}}} \right) + p\left( {{\text{BB}}|{\text{BB}}} \right)p\left( {{\text{BB}}} \right)$$
$$p({\text{y}}_{i} = miss.|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} ) = p\left( {miss.|{\text{AA}}} \right)p\left( {{\text{AA}}} \right) + p\left( {miss.|{\text{AB}}} \right)p\left( {{\text{AB}}} \right) + p\left( {miss.|{\text{BB}}} \right)p\left( {{\text{BB}}} \right),$$

where

$$p\left( {{\text{AA}}|{\text{AA}}} \right) = ({1}{-}\varepsilon_{{\text{A}}} )^{{2}}$$
$$p\left( {{\text{AA}}|{\text{AB}}} \right) = \varepsilon_{{\text{B}}}$$
$$p\left( {{\text{AB}}|{\text{AB}}} \right) = ({1}{-}\varepsilon_{{\text{A}}} )({1}{-}\varepsilon_{{\text{B}}} )$$
$$p\left( {{\text{BB}}|{\text{AB}}} \right) = \varepsilon_{{\text{A}}}$$
$$p\left( {{\text{BB}}|{\text{BB}}} \right) = ({1}{-}\varepsilon_{{\text{B}}} )^{{2}}$$
$$p\left( {miss.|{\text{AA}}} \right) = \varepsilon_{{\text{A}}}^{{2}}$$
$$p\left( {miss.|{\text{AB}}} \right)\, = \,\varepsilon _{{\text{A}}} \varepsilon _{{\text{B}}}$$
$$p\left( {miss.|{\text{BB}}} \right) = \varepsilon_{{\text{B}}}^{{2}}$$

And \(p\left( {{\text{AA}}} \right) = f_{{\text{A}}}^{{2}} ,p\left( {{\text{AB}}} \right) = {2}f_{{\text{A}}} \left( {{1} - f_{{\text{A}}} } \right)\), and \(p\left( {{\text{BB}}} \right) = f_{{\text{B}}}^{{2}}\). Note that the model assumes that a BB individual cannot be genotyped as AA (the probability of false alleles is zero). A priori distributions for fA, εA and εB were assumed flat between 0 and 1.

For each SNP, the model was solved by a Metropolis–Hastings sampling process (Metropolis et al. 1953) with 500,000 iterations after a burn-in period of 10,000 iterations. Two alternative parameterizations (\(\varepsilon_{{\text{A}}} = \varepsilon_{{\text{B}}} vs.\varepsilon_{{\text{A}}} \ne \varepsilon_{{\text{B}}}\)) were compared by Bayes factor (Kass and Raftery 1995). The minimum number of within-individual replicates required to predict the reliable genotype with probability 95% was calculated as log(0.05)/log(εA). All these procedures have been implemented in the SNP+ software, available at http://www.casellas.info/software.html.

Results and discussion

The program generates the following text delimited output files:

  1. (1)

    Summary table of the probability of error, confidence interval, replications, and Bayes factor of all the SNPs (output file example in Fig. 1).

  2. (2)

    SNP-by-SNP report of dropout probabilities with their confidence intervals, minimum number of replicates, Bayer factor comparing a single dropout probability against two independent dropout probabilities.

  3. (3)

    Predicted genotypes for each individual and SNP, and probability of error, if any.

  4. (4)

    Pairwise comparison between individuals and the probability to have an identical genotype.

  5. (5)

    SNP-by-SNP report of the minor allele frequency (MAF) and probability of identity (PI).

Fig. 1
figure 1

Output file generated by the SNP+ software for each analyzed SNP, its alleles, allele-specific or joint dropout probability (and minimum number of replicates to guarantee a 95% genotype probability), and the Bayes factor comparing the models with allele-specific and joint dropout probability

The software has been extensively tested on simulated data with appealing results. Figure 2 illustrated SNP+ ability to detect allele-specific departures in dropout probabilities, as well as the increase in statistical relevance (i.e., Bayes factor) as estimated dropout differences increase. The U-shaped scatter plot misidentified a common dropout probability in less than 3% of the simulated data sets, and this percentage reduced below 1% with seven replicates per sample (results not shown). Sample size does not modify the dropout probability but the accuracy of the (dropout probability) estimate we obtain with SNP+.

Fig. 2
figure 2

Predicted dropout probabilities on simulated data sets with 50 individuals and 5 replicate genotypes per individual. Genotypes were simulated under Hardy–Weinberg equilibrium (0.5 allele frequency) and dropout probabilities for each allele were 0.1 and 0.2, respectively. The Bayes factor compared the model with allele-specific dropout probabilities against the same dropout probability for both alleles

Moreover, we have used SNP+ to evaluate two panels using Open Array® technology (Thermo Fisher Scientific Inc). We analyzed 22 fecal samples and 114 hair samples from Iberian brown bears (Ursus arctos) using first a 120 SNP panel (data prepared for publication but not submitted). To decrease the cost of the analysis, we selected 60 SNPs out of 120 SNPs with the lowest dropout probabilities, and we repeated our analysis with SNP+ using 164 fecal samples and 173 hair samples. All samples were replicated four times, and about 25% and 20% of low-quality DNA fecal and hair samples, respectively (call rate < 25%) were not included in both analyses. Figure 3 shows the relative frequencies of the dropout probability for the four studies. The dropout probability was clearly low after selecting the panel of 60 SNPs using two types of non-invasive samples, on average 0.05 (studies C and D; 60 SNPs) versus 0.2 (studies A and B; 120 SNPs). In terms of variability and distribution mode, the study that obtains lower dropout probabilities is the study C, after SNP selection. The study with the highest probability of dropout is the study B probably because hair samples were hair-trapping collected and therefore, not all samples contained roots or enough hair quantity to obtain high DNA quality. To summarize, SNP+ calculates the dropout likelihood, the Bayes factor, PI and MAF, and can be used to select the best arrays from low density arrays up to high density arrays, avoiding those SNPs that require many replicates because they lead to error. Moreover, SNP+ shows the number of replicates needed per sample to reach a 95% of genotyping reliability per SNP.

Fig. 3
figure 3

Histograms showing the relative frequencies (%) of dropout probability in bear samples using the SNP+ software in four cases (“X” = average dropout probability)