Skip to main content
Log in

Competing analytical strategies of combining associated SNPs for estimating genetic risks

  • Research Article
  • Published:
Journal of Genetics Aims and scope Submit manuscript

Abstract

In genomewide association study (GWAS) of a complex phenotype, a large number of variants, many with small effect sizes, are found to contribute to the variability of the phenotype. Subsequent to the identification of such variants in a GWAS, it is of interest to estimate the risk jointly conferred by the variants. We propose three different strategies of combining the risk SNPs to calculate an allele dosage score. Using simulations, we evaluate the different measures of allele dosage score with respect to the risk prediction accuracy of a binary trait and the proportion of variance explained for a quantitative trait. For a binary trait, an allele dosage score defined based on log odds ratio performs marginally better than the other two measures. For a quantitative trait, the measure based on the standardized slope coefficient in linear regression of the trait on single-nucleotide polymorphism (SNP) genotypes performs better than the measures using the weights proportional to log P-value and the proportion of variance explained. We demonstrate the utility of these measures using a real data on type 2 diabetes and fasting blood sugar level in a south Indian population.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Chauhan G., Spurgeon C. J., Tabassum R., Bhaskar S., Kulkarni S. R., Mahajan A. et al. 2010 Impact of common variants of PPARG, KCNJ11, TCF7L2, SLC30A8, HHEX, CDKN2A, IGF2BP2, and CDKAL1 on the risk of type 2 diabetes in 5,164 Indians. Diabetes 59, 2068–2074.

    Article  CAS  Google Scholar 

  • Consortium I. S. 2009 Common polygenic variation contributes to risk of schizophrenia that overlaps with bipolar disorder. Nature 460, 748.

    Article  Google Scholar 

  • Dudbridge F. 2013 Power and predictive accuracy of polygenic risk scores. PLoS Genet9, e1003348.

    Article  CAS  Google Scholar 

  • Duncan L., Shen H., Gelaye B., Meijsen J., Ressler K., and Feldman M. 2019 Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun. 10, 1–9.

    Article  CAS  Google Scholar 

  • Hachiya T., Kamatani Y., Takahashi A., Hata J., Furukawa R., Shiwa Y. et al. 2017 Genetic predisposition to ischemic stroke: a polygenic risk score. Stroke. 48, 253–258

    Article  Google Scholar 

  • Kendler K. S. 2016 The schizophrenia polygenic risk score: to what does it predispose in adolescence?. JAMA Psychiat73, 193–194.

    Article  Google Scholar 

  • Khera A. V., Chaffin M., Aragam K. G., Haas M. E., Roselli C., Choi S. H. et al. 2018 Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224.

    Article  CAS  Google Scholar 

  • Martin A. R., Kanai M., Kamatani Y., Okada Y., Neale B. M. and Daly M. J. 2019 Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591.

    Article  CAS  Google Scholar 

  • Mega J. L., Stitziel N. O., Smith J. G., Chasman D. I., Caulfield M. J., and Devlin J. J. 2015 Genetic risk, coronary heart disease events, and the clinical benefit of statin therapy: an analysis of primary and secondary prevention trials. Lancet 385, 2264–2271.

    Article  CAS  Google Scholar 

  • Peyrot W. J., Milaneschi Y., Abdellaoui A., Sullivan P. F., Hottenga J. J., Boomsma D. I. and Penninx B. W. 2014 Effect of polygenic risk scores on depression in childhood trauma. Br. J. Psychiat. 205, 113–119.

    Article  Google Scholar 

  • Prentice R. L. and Pyke R. 1979 Logistic disease incidence models and case-control studies. Biometrika66, 403–411.

    Article  Google Scholar 

  • Power R. A., Steinberg S., Bjornsdottir G., Rietveld C. A., Abdellaoui A., Nivard M. M. et al. 2015 Polygenic risk scores for schizophrenia and bipolar disorder predict creativity. Nat. Neurosci. 18, 953–955.

    Article  CAS  Google Scholar 

  • Ramya K., Ayyappa K. A., Ghosh S., Mohan V. and Radha V. 2013 Genetic association of ADIPOQ gene variants with type 2 diabetes, obesity and serum adiponectin levels in south Indian population. Gene532, 253–262.

    Article  CAS  Google Scholar 

  • Richardson T. G., Harrison S., Hemani G. and Smith G. D. 2019 An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. Elife. 8, e43657.

    Article  Google Scholar 

  • Torkamani A., Wineinger N. E. and Topol E. J. 2018 The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19, 581–590.

    Article  CAS  Google Scholar 

  • Wray N. R., Goddard M. E. and Visscher P. M. 2007 Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 17, 1520–1528.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

The authors are grateful to Prof. V. Mohan and Dr Radha Venkatesan of Madras Diabetes Research Foundation (MDRF) for providing access to an apriori analysed portion of the CURES data through a mutual collaboration with SG.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saurabh Ghosh.

Additional information

Corresponding editor: Shrish Tiwari

Appendix

Appendix

Risk prediction for a binary end-point trait

The popular case–control study design usually involves a prefixed number of cases and controls. We demonstrate that such a design is inappropriate for estimating risks conferred by individual SNPs. The prevalence of a disease is a function of the allele frequencies and the penetrances at the different causal loci. Since the prevalence of a disease in a population is usually much lower than \(50\%\) (or the proportion of cases in GWAS data), there is a gross over-representation of cases in the usual study design comprising equal number of cases and controls (or with substantially larger case control ratio compared to the population) leading to overestimation of the penetrances. On the other hand, a large sample of individuals chosen at random from a population irrespective of their affection status is expected to carry proper information on the disease prevalence in the population.

For a disease trait, define a variable Y for an individual, such that \(Y=1\), if the individual is a case, and \(Y=0\) if the individual is a control. We assume that the disease locus is biallelic with alleles D and d having frequencies p and (\(1-p\)), respectively in the population. Suppose we have genotype data at the disease locus on 2n individuals comprising \(n_1\) cases and \(n_2\) \((=2n-n_1)\) controls. The penetrances at the disease locus are defined as: \(f_2=P(\)case|DD),  \(f_1=P(\)case|Dd),  \(f_0=P(\)case|dd). The overall disease prevalence is given by: P(case\()=f_2p^2 + 2f_1 p (1-p) + f_0 (1-p)^2\). Using a logistic link function between the disease status and SNP genotype, we can estimate the penetrance parameters from the data.

\(P(DD|\text{case})=\frac{f_2p^2}{P(\text{case})},\) \(P(Dd|\text{case})=\frac{2 f_1 p(1-p)}{P(\text{case})},\) \(P(dd|\text{case})=\frac{f_0(1-p)^2}{P(\text{case})},\) and, \(P(DD|\text{control})=\frac{(1-f_2)p^2}{P(\text{control})},\) \(P(Dd|\text{control})=\frac{2 (1-f_1) p(1-p)}{P(\text{control})},\) \(P(dd|\text{control})=\frac{(1-f_0)(1-p)^2}{P(\text{control})},\) where \(P(\text{control}) = 1-P(\text{case})\). Let \(n_{case}^{DD}\) and \(n_{control}^{DD}\) denote the number of case and control individuals who have the genotype DD, respectively. Similarly, \(n_{case}^{Dd},\) \(n_{case}^{dd},\) \(n_{control}^{Dd},\) \(n_{control}^{dd}\) are defined. Hence, \(E(n_{case}^{G}) = n_1 \times P(G|\text{case})\) and \(E(n_{control}^{G}) = n_2 \times P(G|\text{control})\), \(G=DD,Dd,dd.\) Since for a given G, \({n_{case}^{G}}/{n_1}\) is a consistent estimator of \(P(G|\text{case})\), we assume that, \(n_{case}^{G} \approx n_1 \times P(G|\text{case}),\) for large \(n_1\). Using the same argument, for a given G,  we assume that \(n_{control}^{G} \approx n_2 \times P(G|\text{control})\), for large \(n_2\).

Let X denote the number of D alleles in a genotype and hence, \(X=0,1,2\). We model the probability of a case conditioned on the genotype at the disease locus via the logistic link as follows: \(P(\text{case}|X=x) = \frac{\text{ exp }\{\alpha +\beta x\}}{1+\text{exp}\{\alpha +\beta x\}}.\) So, \(P(\text{control}|X=x) = \frac{1}{1+\text{exp}\{\alpha +\beta x \}}.\)

Following the above discussion, for a given choice of \(n_1,n_2,\) and \(f_2,f_1,f_0,\) and p,  we assume the values of \(n_{case}^{G}\) and \(n_{control}^{G}\) to be \(n_1 P(G|\text{case})\) and \(n_2 P(G|\text{control}),\) respectively, \(G=DD,Dd,dd\). Suppose, we model the likelihood L of the genotype data of the cases and controls: \(n_{case}^{G} ( = n_{1}^{G})\) and \(n_{control}^{G} ( = n_{2}^{G}),\) \(G=DD,Dd,dd,\) based on the logistic model presented above. Then the log-likelihood function denoted by \(\text{ log }(L)\) is given by:

$$\begin{aligned} \text{ log }(L)= & {} n_{1}^{DD} (\alpha + 2\beta ) + n_{1}^{Dd} (\alpha + \beta ) \\&+ n_{1}^{dd} \alpha - (n_{1}^{DD}+n_{2}^{DD}) log(1+exp\{\alpha +2\beta \})\\&- (n_{1}^{Dd}+n_{2}^{Dd}) log(1+exp\{\alpha +\beta \}) \\&- (n_{1}^{dd}+n_{2}^{dd}) log(1+exp\{\alpha \}). \end{aligned}$$

For ease of exposition, we define the following quantities: \(\text{ prob}_2 = \frac{\text{exp}\{\alpha +2\beta \}}{1+\text{exp}\{\alpha +2\beta \}},\) \(\text{ prob}_1 = \frac{\text{exp}\{\alpha +\beta \}}{1+\text{exp}\{\alpha +\beta \}},\) \(\text{ prob}_0 = \frac{\text{exp}\{\alpha \}}{1+\text{exp}\{\alpha \}}.\) Then the \(1{\mathrm{st}}\) and \(2{\mathrm{nd}}\) order partial derivatives of the log data likelihood are given by:

$$\begin{aligned}\frac{\partial \text{ log }(L)}{\partial \alpha } &= (n_{1}^{DD}+n_{1}^{Dd}+n_{1}^{dd}) - (n_{1}^{DD}+n_{2}^{DD})\text{ prob}_2 \\&\quad - (n_{1}^{Dd}+n_{2}^{Dd})\text{ prob}_1 - (n_{1}^{dd}+n_{2}^{dd})\text{ prob}_0\\\frac{\partial \text{ log }(L)}{\partial \beta } &= 2 n_{1}^{DD} + n_{1}^{Dd} - 2 (n_{1}^{DD}+n_{2}^{DD})\text{ prob}_2 \\&\quad - (n_{1}^{Dd}+n_{2}^{Dd})\text{ prob}_1 \\\frac{{\partial }^2 \text{ log }(L)}{\partial {\alpha }^2} &= - (n_{1}^{DD}+n_{2}^{DD})\text{ prob}_2(1-\text{ prob}_2)\\&\quad - (n_{1}^{Dd}+n_{2}^{Dd})\text{ prob}_1(1-\text{ prob}_1) - (n_{1}^{dd}+n_{2}^{dd})\text{ prob}_0(1-\text{ prob}_0)\\\frac{{\partial }^2 \text{ log }(L)}{\partial {\beta }^2} &= - 2 (n_{1}^{DD}+n_{2}^{DD})\text{ prob}_2(1-\text{ prob}_2) \\&\quad - (n_{1}^{Dd}+n_{2}^{Dd})\text{ prob}_1(1-\text{ prob}_1). \end{aligned}$$

Using the above equations in the Fisher’s scoring method, we obtain m.l.e. of \(\alpha \) and \(\beta \). Alternatively, we could also use the Newton Raphson method or the iteratively reweighted least squares method to obtain the m.l.e. Based on the m.l.e. of \(\alpha \) and \(\beta \), we estimate the penetrance parameters \(f_2,f_1,f_0,\) as \(P(\text{case}|X=2), P(\text{case}|X=1), P(\text{case}|X=0),\) respectively, using the logistic model discussed above.

For a given choice of \(n_1,n_2,\) and \(f_2,f_1,f_0,\) and p,  (and hence, \(n_{1}^{G},n_{2}^{G},\) \(G = DD,Dd,dd\)), it is expected that the estimated penetrances would be close to the true choice of \(f_2,f_1,f_0\). For example, we consider a choice of the penetrance parameters: \(f_2 = 0.1, f_1 = 0.05, f_0 = 0.01,\) and \(p = 0.1,\) that induces an overall prevalence of 0.018. Suppose we consider a total sample of 10,000 individuals. If we consider a design with equal number of cases and controls (i.e., \(n_1=n_2=5000\)) and estimate \(f_2,f_1,f_0\) using the above model, the estimates are found to be \( f_2 = 0.93, f_1 = 0.73, f_0 = 0.36\). Hence, it is clear that the true penetrances are grossly overestimated as pointed out earlier. On the other hand, if we consider the number of cases to be proportional to the overall prevalence (i.e., \(n_1 = 181,\) \(n_2 = 9819)\) we obtain the estimates of the penetrances to be \( f_2 = 0.16, f_1 = 0.04, f_0 = 0.01\). Thus, these estimates are much closer to the true penetrances compared to the previous sampling design comprising equal number of cases and controls.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Majumdar, A., Ghosh, S. Competing analytical strategies of combining associated SNPs for estimating genetic risks. J Genet 101, 14 (2022). https://doi.org/10.1007/s12041-021-01349-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12041-021-01349-4

Keywords

Navigation