Although genome-wide association studies (GWAS) have been successful at finding thousands of disease-associated genetic variants (GVs), identifying causal variants and elucidating the mechanisms by which genotypes influence phenotypes are critical open questions. A key challenge is that a large percentage of disease-associated GVs are potential regulatory variants located in noncoding regions, making them difficult to interpret. Recent research efforts focus on going beyond annotating GVs by integrating functional annotation data with GWAS to prioritize GVs. However, applicability of these approaches is challenged by high dimensionality and heterogeneity of functional annotation data. Furthermore, existing methods often assume global associations of GVs with annotation data. This strong assumption is susceptible to violations for GVs involved in many complex diseases. To address these issues, we develop a general regression framework, named Annotation Regression for GWAS (ARoG). ARoG is based on a finite mixture of linear regressions model where GWAS association measures are viewed as responses and functional annotations as predictors. This mixture framework addresses heterogeneity of effects of GVs by grouping them into clusters and high dimensionality of the functional annotations by enabling annotation selection within each cluster. ARoG further employs permutation testing to evaluate the significance of selected annotations. Computational experiments indicate that ARoG can discover distinct associations between disease risk and functional annotations. Application of ARoG to autism and schizophrenia data from Psychiatric Genomics Consortium led to identification of GVs that significantly affect interactions of several transcription factors with DNA as potential mechanisms contributing to these disorders.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Stranger BE, Stahl EA, Raj T (2011) Progress and promise of genome-wide association studies for human complex trait genetics. Genetics 187(2):367–383
Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan K-K, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, Min R, Alves P, Abyzov A, Addleman N, Bhardwaj N, Boyle AP, Cayting P, Charos A, Chen DZ, Cheng Y, Clarke D, Eastman C, Euskirchen G, Frietze S, Fu Y, Gertz J, Grubert F, Harmanci A, Jain P, Kasowski M, Lacroute P, Leng J, Lian J, Monahan H, O/’Geen H, Ouyang Z, Partridge EC, Patacsil D, Pauli F, Raha D, Ramirez L, Reddy TE, Reed B, Shi M, Slifer T, Wang J, Wu L, Yang X, Yip KY, Zilberman-Schapira G, Batzoglou S, Sidow A, Farnham PJ, Myers RM, Weissman SM, Snyder M (2012) Architecture of the human regulatory network derived from ENCODE data. Nature 489(7414):91–100
Yue F, Cheng Y, Breschi A, Vierstra J, Wu W, Ryba T, Sandstrom R, Ma Z, Davis C, Pope BD, Shen Y, Pervouchine DD, Djebali S, Thurman RE, Kaul R, Rynes E, Kirilusha A, Marinov GK, Williams BA, Trout D, Amrhein H, Fisher-Aylor K, Antoshechkin I, DeSalvo G, See L-H, Fastuca M, Drenkow J, Zaleski C, Dobin A, Prieto P, Lagarde J, Bussotti G, Tanzer A, Denas O, Li K, Bender MA, Zhang M, Byron R, Groudine MT, McCleary D, Pham L, Ye Z, Kuan S, Edsall L, Wu Y-C, Rasmussen MD, Bansal MS, Kellis M, Keller CA, Morrissey CS, Mishra T, Jain D, Dogan N, Harris RS, Cayting P, Kawli T, Boyle AP, Euskirchen G, Kundaje A, Lin S, Lin Y, Jansen C, Malladi VS, ClineMS, Erickson DT, Kirkup VM, Learned K, Sloan CA, Rosenbloom KR, Lacerda de Sousa B, Beal K, Pignatelli M, Flicek P, Lian J, Kahveci T, Lee D, Kent JW, Ramalho Santos M, Herrero J, Notredame C, Johnson A, Vong S, Lee K, Bates D, Neri F, DiegelM, Canfield T, Sabo PJ, Wilken MS, Reh TA, Giste E, Shafer A, Kutyavin T, Haugen E, Dunn D, Reynolds AP, Neph S, Humbert R, Hansen RS, De Bruijn M, Selleri L, Rudensky A, Josefowicz S, Samstein R, Eichler EE, Orkin SH, Levasseur D, Papayannopoulou T, ChangK-H, SkoultchiA, Gosh S, DistecheC, Treuting P,WangY, Weiss MJ, BlobelGA, CaoX, Zhong S, Wang T, Good PJ, Lowdon RF, Adams LB, Zhou X-Q, Pazin MJ, Feingold EA, Wold B, Taylor J, MortazaviA, Weissman SM, Stamatoyannopoulos JA, Snyder MP, Guigo R, Gingeras TR, GilbertDM, Hardison RC, BeerMA, Ren B, TheMouse ENCODE Consortium (2014) A comparative encyclopedia of DNA elements in the mouse genome. Nature 515 (7527):355–364. http://dx.doi.org/10.1038/nature13992
Roadmap Epigenomics Consortium (2015) Integrative analysis of 111 reference human epigenomes. Nature 518(7539):317–330. http://view.ncbi.nlm.nih.gov/pubmed/25693563
The GTeX Consortium (2015) The genotype-tissue expression (GTEx) pilot analysis: multi-tissue gene regulation in humans. Science 348(6235):648–660
International Human Epigenome Consortium. http://ihec-epigenomes.org/research/projects/
Iversen ES, Lipton G, Clyde MA, Monteiro AN (2014) Functional annotation signatures of disease susceptibility loci improve SNP association analysis. BMC Genom 15:398
Wasserman WW, Long N, Dickson SP, Maia JM, Kim HS, Zhu Q, Allen AS (2013) Leveraging prior information to detect causal variants via multi-variant regression. PLoS Comput Biol 9(6):e1003093
Chung D, Yang C, Li C, Gelernter J, Zhao H (2014) GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation. PLoS Genet 10(11):e1004787
Gagliano SA, Barnes MR, Weale ME, Knight J (2014) A Bayesian method to incorporate hundreds of functional characteristics with association evidence to improve variant prioritization. PLoS ONE 9(5):e98122. doi:https://doi.org/10.1371/journal.pone.0098122
Kichaev G, Yang WY, Lindstrom S, Hormozdiari F, Eskin E, Price AL, Kraft P, Pasaniuc B (2014) Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet 10(10):e1004722. doi:https://doi.org/10.1371/journal.pgen.1004722
Thompson JR, Gögele M, Weichenberger CX, Modenese M, Attia J, Barrett JH, Boehnke M, De Grandi A, Domingues FS, Hicks AA, Marroni F, Pattaro C, Ruggeri F, Borsani G, Casari G, Parmigiani G, Pastore A, Pfeufer A, Schwienbacher C, Taliun D, CKDGen Consortium, Fox CS, Pramstaller PP, Minelli C (2013) SNP prioritization using a Bayesian probability of association. Genet Epidemiol 37(2):214–221
Pickrell JK (2014) Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am J Hum Genet 94(4):559–573
Pai AA, Pritchard JK, Gilad Y (2015) The genetic and mechanistic basis for variation in gene regulation. PLoS Genet 11(1):e1004857. doi:https://doi.org/10.1371/journal.pgen.1004857
Psychiatric Genomics Consortium. http://www.med.unc.edu/pgc
Cross-Disorder Group of the Psychiatric Genomics Consortium (2013) Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet 381(9875):1371–1379
Psychiatric GWAS Consortium Bipolar Disorder Working Group (2011) Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4. Nat Genet 43(10):977–983
Schizophrenia Working Group of the Psychiatric Genomics Consortium (2011) Genome-wide association study identifies five new schizophrenia loci. Nat Genet 43(10):969–976
Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014) Biological insights from 108 schizophrenia-associated genetic loci. Nature 511:421–427
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O’Donnell CJ, de Bakker PIW (2008) SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24(24):2938–2939
dbSNP: Short Genetic Variations. http://www.ncbi.nlm.nih.gov/SNP/
Zuo C, Shin S, Keleş S (2015) atSNP: transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics 31(20):3353–3355
Stormo GD, Shneider TD, Gold L, Ehrenfeucht A (1982) Use of ‘perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3010
Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen C, Chou A, Ienasescu H, Lim J, Shyr C, Tan G, Zhou M, Lenhard B, Sandelin A, Wasserman WW (2014) JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res 42(D1):D142–D147
Städler N, Bühlmann P, van de Geer S (2010) \(l_1\)-penalization for mixture regression models. TEST 19(2):209–256
Tibshirani R (1994) Regression selection and shrinkage via the Lasso. J R Stat Soc B 58:267–288. http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.35.7574
Meinshausen N (2007) Relaxed Lasso. Comput Stat Data Anal 52(1):374–393
Candes E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Forrest MP, Hill MJ, Quantock AJ, Martin-Rendon E, Blake DJ (2014) The emerging roles of TCF4 in disease and development. Trends Mol Med 20(6):322–331
Zou F, Chai HS, Younkin CS, Allen M, Crook J, Pankratz VS, Carrasquillo MM, Rowley CN, Nair AA, Middha S, Maharjan S, Nguyen T, Ma L, Malphrus KG, Palusak R, Lincoln S, Bisceglio G, Georgescu C, Kouri N, Kolbert CP, Jen J, Haines JL, Mayeux R, Pericak-Vance MA, Farrer LA, Schellenberg GD, Petersen RC, Graff-Radford NR, Dickson DW, Younkin SG, Ertekin-Taner N (2012) Brain expression genome-wide association study (eGWAS) identifies human disease-associated variants. PLoS Genet 8(6):e1002707
Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, Cherry JM, Snyder M (2012) Annotation of functional variation in personal genomes using regulomedb. Genome Res 22(9):1790–1797
dbGaP: The Database of Genotypes and Phenotypes. http://www.ncbi.nlm.nih.gov/gap
Akahoshi E, Yoshimura S, Ishihara-Sugano M (2006) Over-expression of AhR (aryl hydrocarbon receptor) induces neural differentiation of Neuro2a cells: neurotoxicology study. Environ Health 5(1):24
Xie HQ, Xu H-M, Fu H-L, Hu Q, Tian W-J, Pei X-H, Zhao B (2013) AhR-mediated effects of dioxin on neuronal acetylcholinesterase expression in vitro. Environ Health Perspect 121(5):613–618
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
This research was supported by the National Institutes of Health Grants HG007019, HG003747, and U54AI117924. The authors thank the editor and two referees for their helpful comments.
Electronic supplementary material
Below is the link to the electronic supplementary material.
About this article
Cite this article
Shin, S., Keleş, S. Annotation Regression for Genome-Wide Association Studies with an Application to Psychiatric Genomic Consortium Data. Stat Biosci 9, 50–72 (2017). https://doi.org/10.1007/s12561-016-9154-z
- Finite mixture of regressions
- Functional genomic data
- Genome-wide association studies
- Integrative analysis
- Regularized variable selection