SNP+ to predict dropout rates in SNP arrays

Sastre, N.; Mercadé, A.; Casellas, J.

doi:10.1007/s12686-023-01309-3

SNP+ to predict dropout rates in SNP arrays

Methods and Resources
Open access
Published: 08 July 2023

Volume 15, pages 113–116, (2023)
Cite this article

Download PDF

You have full access to this open access article

Conservation Genetics Resources Aims and scope Submit manuscript

SNP+ to predict dropout rates in SNP arrays

Download PDF

N. Sastre¹,
A. Mercadé¹ &
J. Casellas²

902 Accesses
Explore all metrics

Abstract

Genotyping individuals using forensic or non-invasive samples such as hair or fecal samples increases the risk of allelic amplification failure (dropout) due to the low quality and quantity of DNA. One way to decrease genotyping errors is to increase the number of replicates per sample. Here, we have developed the software SNP+ to estimate the dropout probability and the subsequent required number of replicates to obtain the reliable genotype with probability 95%. Moreover, the software predicts the minor allele frequency and compares two competing models assuming equal or allele-specific dropout probabilities by Bayes factor. The software handles data from one SNP to high density arrays (e.g., 100,000 SNPs).

SNP allele calling of Illumina Infinium Omni5-4 data using the butterfly method

Article Open access 12 October 2022

A SNP panel for identification of DNA and RNA specimens

Article Open access 25 January 2018

High-Throughput SNP Genotyping

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Single nucleotide polymorphisms (SNPs) are biallelic markers largely abundant in most genomes and with low mutation rate (~ 10⁻⁹ per generation) (Brumfield et al. 2003; Morin et al. 2004). SNPs can be associated to diseases, susceptibility to environmental factors or quantitative trait locus (Erichsen and Chanock 2004; Amos et al. 2008; Casellas et al. 2008; Nickels et al. 2013). In forensic medicine, SNPs can be useful to identify individuals from non-invasive samples by using short amplicons (Sobrino et al. 2005). However, individual identification using degraded samples with less than 100 copies of gDNA may cause genotyping errors (Giardina et al. 2009; von Thaden et al. 2020). Allelic amplification failure, or dropout is the most common error caused by stochastic effects of the PCR reaction (Taberlet and Luikart 1999). To reduce dropout ratio, multiplex pre-amplification or increased replicates per sample could be performed (Bellemain and Taberlet 2004; Sastre et al. 2009). However, both solutions increase time and cost for genotyping individuals. In order to reduce genotyping errors using non-invasive samples without cost, we decided to develop a software (SNP+) to predict the dropout probability of each SNP from a sample of replicated genotypes. Moreover, two alternative parametrizations were compared by a Bayes factor to check for within-SNP homogeneous dropout probability against different dropout probabilities for each allele.

Material and methods

The SNP+ software analyzes each SNP independently, taking as a starting point a vector y of n genotypes ordered by individual (m) and replicates within individual (y’ = [y’₁ y’₂ … y’_m]), where n₁ is the number of replicates for the first individual, and n = n₁ + n₂ + … + n_m. Assuming two alleles, A and B, the Bayesian joint posterior distribution generalizes to

$$p(f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} \left| {{\mathbf{y}})\sim p({\mathbf{y}}} \right|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} )p\left( {f_{{\text{A}}} } \right)p(\varepsilon_{{\text{A}}} )p(\varepsilon_{{\text{B}}} ),$$

and focuses on estimating the allele frequency (f_A), as well as the dropout probability for allele A (ε_A) or B (ε_B). Taking a particular genotype y_i with possible outcomes AA, AB, BB and missing genotype (miss.), its Bayesian likelihood is computed as

$$p({\text{y}}_{i} = {\text{ AA}}|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} ) = p\left( {{\text{AA}}|{\text{AA}}} \right)p\left( {{\text{AA}}} \right) \, + p\left( {{\text{AA}}|{\text{AB}}} \right)p\left( {{\text{AB}}} \right)$$

$$p({\text{y}}_{i} = {\text{AB}}|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} ) = p\left( {{\text{AB}}|{\text{AB}}} \right)p\left( {{\text{AB}}} \right)$$

$$p({\text{y}}_{i} = {\text{ BB}}|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} ) = p\left( {{\text{BB}}|{\text{AB}}} \right)p\left( {{\text{AB}}} \right) + p\left( {{\text{BB}}|{\text{BB}}} \right)p\left( {{\text{BB}}} \right)$$

$$p({\text{y}}_{i} = miss.|f_{{\text{A}}} ,\varepsilon_{{\text{A}}} ,\varepsilon_{{\text{B}}} ) = p\left( {miss.|{\text{AA}}} \right)p\left( {{\text{AA}}} \right) + p\left( {miss.|{\text{AB}}} \right)p\left( {{\text{AB}}} \right) + p\left( {miss.|{\text{BB}}} \right)p\left( {{\text{BB}}} \right),$$

where

$$p\left( {{\text{AA}}|{\text{AA}}} \right) = ({1}{-}\varepsilon_{{\text{A}}} )^{{2}}$$

$$p\left( {{\text{AA}}|{\text{AB}}} \right) = \varepsilon_{{\text{B}}}$$

$$p\left( {{\text{AB}}|{\text{AB}}} \right) = ({1}{-}\varepsilon_{{\text{A}}} )({1}{-}\varepsilon_{{\text{B}}} )$$

$$p\left( {{\text{BB}}|{\text{AB}}} \right) = \varepsilon_{{\text{A}}}$$

$$p\left( {{\text{BB}}|{\text{BB}}} \right) = ({1}{-}\varepsilon_{{\text{B}}} )^{{2}}$$

$$p\left( {miss.|{\text{AA}}} \right) = \varepsilon_{{\text{A}}}^{{2}}$$

$$p\left( {miss.|{\text{AB}}} \right)\, = \,\varepsilon _{{\text{A}}} \varepsilon _{{\text{B}}}$$

$$p\left( {miss.|{\text{BB}}} \right) = \varepsilon_{{\text{B}}}^{{2}}$$

And $p\left( {{\text{AA}}} \right) = f_{{\text{A}}}^{{2}} ,p\left( {{\text{AB}}} \right) = {2}f_{{\text{A}}} \left( {{1} - f_{{\text{A}}} } \right)$, and $p\left( {{\text{BB}}} \right) = f_{{\text{B}}}^{{2}}$. Note that the model assumes that a BB individual cannot be genotyped as AA (the probability of false alleles is zero). A priori distributions for f_A, ε_A and ε_B were assumed flat between 0 and 1.

For each SNP, the model was solved by a Metropolis–Hastings sampling process (Metropolis et al. 1953) with 500,000 iterations after a burn-in period of 10,000 iterations. Two alternative parameterizations ($\varepsilon_{{\text{A}}} = \varepsilon_{{\text{B}}} vs.\varepsilon_{{\text{A}}} \ne \varepsilon_{{\text{B}}}$) were compared by Bayes factor (Kass and Raftery 1995). The minimum number of within-individual replicates required to predict the reliable genotype with probability 95% was calculated as log(0.05)/log(ε_A). All these procedures have been implemented in the SNP+ software, available at http://www.casellas.info/software.html.

Results and discussion

The program generates the following text delimited output files:

(1)
Summary table of the probability of error, confidence interval, replications, and Bayes factor of all the SNPs (output file example in Fig. 1).
(2)
SNP-by-SNP report of dropout probabilities with their confidence intervals, minimum number of replicates, Bayer factor comparing a single dropout probability against two independent dropout probabilities.
(3)
Predicted genotypes for each individual and SNP, and probability of error, if any.
(4)
Pairwise comparison between individuals and the probability to have an identical genotype.
(5)
SNP-by-SNP report of the minor allele frequency (MAF) and probability of identity (PI).

The software has been extensively tested on simulated data with appealing results. Figure 2 illustrated SNP+ ability to detect allele-specific departures in dropout probabilities, as well as the increase in statistical relevance (i.e., Bayes factor) as estimated dropout differences increase. The U-shaped scatter plot misidentified a common dropout probability in less than 3% of the simulated data sets, and this percentage reduced below 1% with seven replicates per sample (results not shown). Sample size does not modify the dropout probability but the accuracy of the (dropout probability) estimate we obtain with SNP+.

Moreover, we have used SNP+ to evaluate two panels using Open Array® technology (Thermo Fisher Scientific Inc). We analyzed 22 fecal samples and 114 hair samples from Iberian brown bears (Ursus arctos) using first a 120 SNP panel (data prepared for publication but not submitted). To decrease the cost of the analysis, we selected 60 SNPs out of 120 SNPs with the lowest dropout probabilities, and we repeated our analysis with SNP+ using 164 fecal samples and 173 hair samples. All samples were replicated four times, and about 25% and 20% of low-quality DNA fecal and hair samples, respectively (call rate < 25%) were not included in both analyses. Figure 3 shows the relative frequencies of the dropout probability for the four studies. The dropout probability was clearly low after selecting the panel of 60 SNPs using two types of non-invasive samples, on average 0.05 (studies C and D; 60 SNPs) versus 0.2 (studies A and B; 120 SNPs). In terms of variability and distribution mode, the study that obtains lower dropout probabilities is the study C, after SNP selection. The study with the highest probability of dropout is the study B probably because hair samples were hair-trapping collected and therefore, not all samples contained roots or enough hair quantity to obtain high DNA quality. To summarize, SNP+ calculates the dropout likelihood, the Bayes factor, PI and MAF, and can be used to select the best arrays from low density arrays up to high density arrays, avoiding those SNPs that require many replicates because they lead to error. Moreover, SNP+ shows the number of replicates needed per sample to reach a 95% of genotyping reliability per SNP.

References

Amos CI, Wu X, Broderick P et al (2008) Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nat Genet 40(5):616–622. https://doi.org/10.1038/ng.109
Article CAS PubMed PubMed Central Google Scholar
Bellemain E, Taberlet P (2004) Improved noninvasive genotyping method: application to brown bear (Ursus arctos) faeces. Mol Ecol Notes 4(3):519–522. https://doi.org/10.1111/j.1471-8286.2004.00711.x
Article CAS Google Scholar
Brumfield R, Beerli PA, Nickerson D et al (2003) The utility of single nucleotide polymorphisms in inferences of population history. Trends Ecol Evol 18:249–256. https://doi.org/10.1016/S0169-5347(03)00018-1
Article Google Scholar
Casellas J, Varona L, Muñoz G et al (2008) Empirical Bayes factor analyses of quantitative trait loci for gestation length in Iberian × Meishan F2 sows. Animal 2(2):177–183. https://doi.org/10.1017/S1751731107001085
Article CAS PubMed Google Scholar
Erichsen HC, Chanock SJ (2004) SNPs in cancer research and treatment. Br J Cancer 90(4):747–751. https://doi.org/10.1038/sj.bjc.6601574
Article CAS PubMed PubMed Central Google Scholar
Giardina E, Pietrangeli I, Martone C et al (2009) Whole genome amplification and real-time PCR in forensic casework. BMC Genomics 10:159. https://doi.org/10.1186/1471-2164-10-159
Article CAS PubMed PubMed Central Google Scholar
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795. https://doi.org/10.1080/01621459.1995.10476572
Article Google Scholar
Metropolis N, Rosenbluth AW, Rosenbluth MN et al (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092. https://doi.org/10.1063/1.1699114
Article CAS Google Scholar
Morin P, Luikart G, Wayne RK (2004) SNPs in ecology, evolution and conservation. Trends Ecol Evol 19:208–216. https://doi.org/10.1016/j.tree.2004.01.009
Article Google Scholar
Nickels S, Truong T, Hein R et al (2013) Evidence of gene-environment interactions between common breast cancer susceptibility loci and established environmental risk factors. PLoS Genet 9(3):e1003284. https://doi.org/10.1371/journal.pgen.1003284
Article CAS PubMed PubMed Central Google Scholar
Sastre N, Francino O, Lampreave G et al (2009) Sex identification of wolf (Canis lupus) using non-invasive samples. Conserv Genet 10(3):555–558. https://doi.org/10.1007/s10592-008-9565-6
Article CAS Google Scholar
Sobrino B, Brión M, Carracedo A (2005) SNPs in forensic genetics: a review on SNP typing methodologies. Forensic Sci Int 154(2–3):181–194. https://doi.org/10.1016/j.forsciint.2004.10.020
Article CAS PubMed Google Scholar
Taberlet P, Luikart G (1999) Non-invasive genetic sampling and individual identification. Biol J Lin Soc 68(1–2):41–55. https://doi.org/10.1111/j.1095-8312.1999.tb01157.x
Article Google Scholar
von Thaden A, Nowak C, Tiesmeyer A et al (2020) Applying genomic data in wildlife monitoring: development guidelines for genotyping degraded samples with reduced single nucleotide polymorphism (SNP) panels. Mol Ecol Resour 20(3):662. https://doi.org/10.1111/1755-0998.13136
Article Google Scholar

Download references

Acknowledgements

This work was funded by the LoupO project (EFA354/19) of the European Interreg Program V-A Spain-France-Andorra (POCTEFA 2014-2020). We are grateful to two anonymous reviewers for helpful comments on the manuscript.

Funding

Open Access Funding provided by Universitat Autonoma de Barcelona. Funding was provided by Interreg (EFA354/19).

Author information

Authors and Affiliations

Servei Veterinari de Genètica Molecular, Universitat Autònoma de Barcelona, 08193, Bellaterra, Spain
N. Sastre & A. Mercadé
Departament de Ciència Animal i dels Aliments, Universitat Autònoma de Barcelona, 08193, Bellaterra, Spain
J. Casellas

Authors

N. Sastre
View author publications
You can also search for this author in PubMed Google Scholar
A. Mercadé
View author publications
You can also search for this author in PubMed Google Scholar
J. Casellas
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualisation: NS, JC, AM. Formal analysis: JC. Methodology: NS, JC. Resources: NS. Validation: NS, JC. Writing—original draft: NS Writing—review and editing: NS, JC, AM.

Corresponding author

Correspondence to J. Casellas.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

For the present study no animal was captured or euthanized.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sastre, N., Mercadé, A. & Casellas, J. SNP+ to predict dropout rates in SNP arrays. Conservation Genet Resour 15, 113–116 (2023). https://doi.org/10.1007/s12686-023-01309-3

Download citation

Received: 14 November 2022
Accepted: 12 June 2023
Published: 08 July 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s12686-023-01309-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SNP+ to predict dropout rates in SNP arrays

Abstract

Similar content being viewed by others

SNP allele calling of Illumina Infinium Omni5-4 data using the butterfly method