Annotation Regression for Genome-Wide Association Studies with an Application to Psychiatric Genomic Consortium Data


Although genome-wide association studies (GWAS) have been successful at finding thousands of disease-associated genetic variants (GVs), identifying causal variants and elucidating the mechanisms by which genotypes influence phenotypes are critical open questions. A key challenge is that a large percentage of disease-associated GVs are potential regulatory variants located in noncoding regions, making them difficult to interpret. Recent research efforts focus on going beyond annotating GVs by integrating functional annotation data with GWAS to prioritize GVs. However, applicability of these approaches is challenged by high dimensionality and heterogeneity of functional annotation data. Furthermore, existing methods often assume global associations of GVs with annotation data. This strong assumption is susceptible to violations for GVs involved in many complex diseases. To address these issues, we develop a general regression framework, named Annotation Regression for GWAS (ARoG). ARoG is based on a finite mixture of linear regressions model where GWAS association measures are viewed as responses and functional annotations as predictors. This mixture framework addresses heterogeneity of effects of GVs by grouping them into clusters and high dimensionality of the functional annotations by enabling annotation selection within each cluster. ARoG further employs permutation testing to evaluate the significance of selected annotations. Computational experiments indicate that ARoG can discover distinct associations between disease risk and functional annotations. Application of ARoG to autism and schizophrenia data from Psychiatric Genomics Consortium led to identification of GVs that significantly affect interactions of several transcription factors with DNA as potential mechanisms contributing to these disorders.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

    Stranger BE, Stahl EA, Raj T (2011) Progress and promise of genome-wide association studies for human complex trait genetics. Genetics 187(2):367–383

  2. 2.

    Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan K-K, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, Min R, Alves P, Abyzov A, Addleman N, Bhardwaj N, Boyle AP, Cayting P, Charos A, Chen DZ, Cheng Y, Clarke D, Eastman C, Euskirchen G, Frietze S, Fu Y, Gertz J, Grubert F, Harmanci A, Jain P, Kasowski M, Lacroute P, Leng J, Lian J, Monahan H, O/’Geen H, Ouyang Z, Partridge EC, Patacsil D, Pauli F, Raha D, Ramirez L, Reddy TE, Reed B, Shi M, Slifer T, Wang J, Wu L, Yang X, Yip KY, Zilberman-Schapira G, Batzoglou S, Sidow A, Farnham PJ, Myers RM, Weissman SM, Snyder M (2012) Architecture of the human regulatory network derived from ENCODE data. Nature 489(7414):91–100

    Article  Google Scholar 

  3. 3.

    Yue F, Cheng Y, Breschi A, Vierstra J, Wu W, Ryba T, Sandstrom R, Ma Z, Davis C, Pope BD, Shen Y, Pervouchine DD, Djebali S, Thurman RE, Kaul R, Rynes E, Kirilusha A, Marinov GK, Williams BA, Trout D, Amrhein H, Fisher-Aylor K, Antoshechkin I, DeSalvo G, See L-H, Fastuca M, Drenkow J, Zaleski C, Dobin A, Prieto P, Lagarde J, Bussotti G, Tanzer A, Denas O, Li K, Bender MA, Zhang M, Byron R, Groudine MT, McCleary D, Pham L, Ye Z, Kuan S, Edsall L, Wu Y-C, Rasmussen MD, Bansal MS, Kellis M, Keller CA, Morrissey CS, Mishra T, Jain D, Dogan N, Harris RS, Cayting P, Kawli T, Boyle AP, Euskirchen G, Kundaje A, Lin S, Lin Y, Jansen C, Malladi VS, ClineMS, Erickson DT, Kirkup VM, Learned K, Sloan CA, Rosenbloom KR, Lacerda de Sousa B, Beal K, Pignatelli M, Flicek P, Lian J, Kahveci T, Lee D, Kent JW, Ramalho Santos M, Herrero J, Notredame C, Johnson A, Vong S, Lee K, Bates D, Neri F, DiegelM, Canfield T, Sabo PJ, Wilken MS, Reh TA, Giste E, Shafer A, Kutyavin T, Haugen E, Dunn D, Reynolds AP, Neph S, Humbert R, Hansen RS, De Bruijn M, Selleri L, Rudensky A, Josefowicz S, Samstein R, Eichler EE, Orkin SH, Levasseur D, Papayannopoulou T, ChangK-H, SkoultchiA, Gosh S, DistecheC, Treuting P,WangY, Weiss MJ, BlobelGA, CaoX, Zhong S, Wang T, Good PJ, Lowdon RF, Adams LB, Zhou X-Q, Pazin MJ, Feingold EA, Wold B, Taylor J, MortazaviA, Weissman SM, Stamatoyannopoulos JA, Snyder MP, Guigo R, Gingeras TR, GilbertDM, Hardison RC, BeerMA, Ren B, TheMouse ENCODE Consortium (2014) A comparative encyclopedia of DNA elements in the mouse genome. Nature 515 (7527):355–364.

  4. 4.

    Roadmap Epigenomics Consortium (2015) Integrative analysis of 111 reference human epigenomes. Nature 518(7539):317–330.

  5. 5.

    The GTeX Consortium (2015) The genotype-tissue expression (GTEx) pilot analysis: multi-tissue gene regulation in humans. Science 348(6235):648–660

  6. 6.

    International Human Epigenome Consortium.

  7. 7.

    Iversen ES, Lipton G, Clyde MA, Monteiro AN (2014) Functional annotation signatures of disease susceptibility loci improve SNP association analysis. BMC Genom 15:398

    Article  Google Scholar 

  8. 8.

    Wasserman WW, Long N, Dickson SP, Maia JM, Kim HS, Zhu Q, Allen AS (2013) Leveraging prior information to detect causal variants via multi-variant regression. PLoS Comput Biol 9(6):e1003093

    Article  Google Scholar 

  9. 9.

    Chung D, Yang C, Li C, Gelernter J, Zhao H (2014) GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation. PLoS Genet 10(11):e1004787

    Article  Google Scholar 

  10. 10.

    Gagliano SA, Barnes MR, Weale ME, Knight J (2014) A Bayesian method to incorporate hundreds of functional characteristics with association evidence to improve variant prioritization. PLoS ONE 9(5):e98122. doi:

    Article  Google Scholar 

  11. 11.

    Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, Kraft P, Pasaniuc B (2014) Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet 10(10):e1004722. doi:

    Article  Google Scholar 

  12. 12.

    Thompson JR, Gögele M, Weichenberger CX, Modenese M, Attia J, Barrett JH, Boehnke M, De Grandi A, Domingues FS, Hicks AA, Marroni F, Pattaro C, Ruggeri F, Borsani G, Casari G, Parmigiani G, Pastore A, Pfeufer A, Schwienbacher C, Taliun D, CKDGen Consortium, Fox CS, Pramstaller PP, Minelli C (2013) SNP prioritization using a Bayesian probability of association. Genet Epidemiol 37(2):214–221

  13. 13.

    Pickrell JK (2014) Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am J Hum Genet 94(4):559–573

    Article  Google Scholar 

  14. 14.

    Pai AA, Pritchard JK, Gilad Y (2015) The genetic and mechanistic basis for variation in gene regulation. PLoS Genet 11(1):e1004857. doi:

    Article  Google Scholar 

  15. 15.

    Psychiatric Genomics Consortium.

  16. 16.

    Cross-Disorder Group of the Psychiatric Genomics Consortium (2013) Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet 381(9875):1371–1379

  17. 17.

    Psychiatric GWAS Consortium Bipolar Disorder Working Group (2011) Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4. Nat Genet 43(10):977–983

  18. 18.

    Schizophrenia Working Group of the Psychiatric Genomics Consortium (2011) Genome-wide association study identifies five new schizophrenia loci. Nat Genet 43(10):969–976

  19. 19.

    Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014) Biological insights from 108 schizophrenia-associated genetic loci. Nature 511:421–427

    Article  Google Scholar 

  20. 20.

    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300

    MathSciNet  MATH  Google Scholar 

  21. 21.

    Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O’Donnell CJ, de Bakker PIW (2008) SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24(24):2938–2939

  22. 22.

    dbSNP: Short Genetic Variations.

  23. 23.

    Zuo C, Shin S, Keleş S (2015) atSNP: transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics 31(20):3353–3355

    Article  Google Scholar 

  24. 24.

    Stormo GD, Shneider TD, Gold L, Ehrenfeucht A (1982) Use of ‘perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3010

    Article  Google Scholar 

  25. 25.

    Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen C, Chou A, Ienasescu H, Lim J, Shyr C, Tan G, Zhou M, Lenhard B, Sandelin A, Wasserman WW (2014) JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res 42(D1):D142–D147

    Article  Google Scholar 

  26. 26.

    Städler N, Bühlmann P, van de Geer S (2010) \(l_1\)-penalization for mixture regression models. TEST 19(2):209–256

    MathSciNet  Article  Google Scholar 

  27. 27.

    Tibshirani R (1994) Regression selection and shrinkage via the Lasso. J R Stat Soc B 58:267–288.

  28. 28.

    Meinshausen N (2007) Relaxed Lasso. Comput Stat Data Anal 52(1):374–393

    MathSciNet  Article  Google Scholar 

  29. 29.

    Candes E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351

    MathSciNet  Article  Google Scholar 

  30. 30.

    Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    MathSciNet  Article  Google Scholar 

  31. 31.

    Forrest MP, Hill MJ, Quantock AJ, Martin-Rendon E, Blake DJ (2014) The emerging roles of TCF4 in disease and development. Trends Mol Med 20(6):322–331

    Article  Google Scholar 

  32. 32.

    Zou F, Chai HS, Younkin CS, Allen M, Crook J, Pankratz VS, Carrasquillo MM, Rowley CN, Nair AA, Middha S, Maharjan S, Nguyen T, Ma L, Malphrus KG, Palusak R, Lincoln S, Bisceglio G, Georgescu C, Kouri N, Kolbert CP, Jen J, Haines JL, Mayeux R, Pericak-Vance MA, Farrer LA, Schellenberg GD, Petersen RC, Graff-Radford NR, Dickson DW, Younkin SG, Ertekin-Taner N (2012) Brain expression genome-wide association study (eGWAS) identifies human disease-associated variants. PLoS Genet 8(6):e1002707

    Article  Google Scholar 

  33. 33.

    Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, Cherry JM, Snyder M (2012) Annotation of functional variation in personal genomes using regulomedb. Genome Res 22(9):1790–1797

    Article  Google Scholar 

  34. 34.

    dbGaP: The Database of Genotypes and Phenotypes.

  35. 35.

    Akahoshi E, Yoshimura S, Ishihara-Sugano M (2006) Over-expression of AhR (aryl hydrocarbon receptor) induces neural differentiation of Neuro2a cells: neurotoxicology study. Environ Health 5(1):24

    Article  Google Scholar 

  36. 36.

    Xie HQ, Xu H-M, Fu H-L, Hu Q, Tian W-J, Pei X-H, Zhao B (2013) AhR-mediated effects of dioxin on neuronal acetylcholinesterase expression in vitro. Environ Health Perspect 121(5):613–618

    Article  Google Scholar 

  37. 37.

    Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

Download references


This research was supported by the National Institutes of Health Grants HG007019, HG003747, and U54AI117924. The authors thank the editor and two referees for their helpful comments.

Author information



Corresponding author

Correspondence to Sündüz Keleş.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1586 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shin, S., Keleş, S. Annotation Regression for Genome-Wide Association Studies with an Application to Psychiatric Genomic Consortium Data. Stat Biosci 9, 50–72 (2017).

Download citation


  • Finite mixture of regressions
  • Functional genomic data
  • Genome-wide association studies
  • Integrative analysis
  • Regularized variable selection