Abstract
Most genome-wide association study (GWAS) analyses test the association between single-nucleotide polymorphisms (SNPs) and a single trait or outcome. While valuable second-step analyses of these associations (e.g., calculating genetic correlations between traits) are common, single-step multivariate analyses of GWAS data are rarely performed. This is unfortunate because multivariate analyses can reveal information which is irrevocably obscured in multi-step analysis. One simple example is the distinction between variance common to a set of measures, and variance specific to each. Neither GWAS of sum- or factor-scores, nor GWAS of the individual measures will deliver a clean picture of loci associated with each measure’s specific variance. While multivariate GWAS opens up a broad new landscape of feasible and informative analyses, its adoption has been slow, likely due to the heavy computational demands and difficulties specifying models it requires. Here we describe GW-SEM 2.0, which is designed to simplify model specification and overcome the inherent computational challenges associated with multivariate GWAS. In addition, GW-SEM 2.0 allows users to accurately model ordinal items, which are common in behavioral and psychological research, within a GWAS context. This new release enhances computational efficiency, allows users to select the fit function that is appropriate for their analyses, expands compatibility with standard genomic data formats, and outputs results for seamless reading into other standard post-GWAS processing software. To demonstrate GW-SEM’s utility, we conducted (1) a series of GWAS using three substance use frequency items from data in the UK Biobank, (2) a timing study for several predefined GWAS functions, and (3) a Type I Error rate study. Our multivariate GWAS analyses emphasize the utility of GW-SEM for identifying novel patterns of associations that vary considerably between genomic loci for specific substances, highlighting the importance of differentiating between substance-specific use behaviors and polysubstance use. The timing studies demonstrate that the analyses take a reasonable amount of time and show the cost of including additional items. The Type I Error rate study demonstrates that hypothesis tests for genetic associations with latent variable models follow the hypothesized uniform distribution. Taken together, we suggest that GW-SEM may provide substantially deeper insights into the underlying genomic architecture for multivariate behavioral and psychological systems than is currently possible with standard GWAS methods. The current release of GW-SEM 2.0 is available on CRAN (stable release) and GitHub (beta release), and tutorials are available on our github wiki (https://jpritikin.github.io/gwsem/).
Similar content being viewed by others
References
Allen NE, Sudlow C, Peakman T, Collins R, Uk biobank (2014) Uk biobank data: come and get it. Sci Transl Med 6(224):224ed4. https://doi.org/10.1126/scitranslmed.3008601
Asparouhov T, Muthén B (2010) Weighted least squares estimation with missing data. http://ww.statmodel2.com/download/GstrucMissingRevision.pdf. Accessed 1 Nov 2016
Band G, Marchini J (2018) BGEN: a binary file format for imputed genotype and haplotype data. https://doi.org/10.1101/308296
Barrett JC, Dunham I, Birney E (2015) Using human genetics to make new medicines. Nat Rev Genet 16(10):561–2. https://doi.org/10.1038/nrg3998
Bidwell LC, McGeary JE, Gray JC, Palmer RHC, Knopik VS, MacKillop J (2015a) An initial investigation of associations between dopamine-linked genetic variation and smoking motives in African Americans. Pharmacol Biochem Behav 138:104–10. https://doi.org/10.1016/j.pbb.2015.09.018
Bidwell LC, McGeary JE, Gray JC, Palmer RHC, Knopik VS, MacKillop J (2015b) Ncam1-ttc12-ankk1-drd2 variants and smoking motives as intermediate phenotypes for nicotine dependence. Psychopharmacology 232(7):1177–86. https://doi.org/10.1007/s00213-014-3748-2
Bradley EL (1973) The equivalence of maximum likelihood and weighted least squares estimates in the exponential family. J Am Stat Assoc 68(341):199–200
Bulik-Sullivan BK, Finucane HK, Anttila V, Gusev A, Day FR, Loh P-R, ReproGen Consortium, Psychiatric Genomics Consortium, Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium 3, Duncan, L, Perry JRB, Patterson N, Robinson EB, Daly MJ, Price AL, Neale BM (2015a) An atlas of genetic correlations across human diseases and traits. Nat Genet 47(11):1236–1241. https://doi.org/10.1038/ng.3406
Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, Schizophrenia Working Group of the Psychiatric Genomics Consortium, Patterson N, Daly MJ, Price AL, Neale BM (2015b) Ld score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet 47(3):291–295. https://doi.org/10.1038/ng.3211
Cardon LR, Harris T (2016) Precision medicine, genomics and drug discovery. Hum Mol Genet 25(R2):R166–R172. https://doi.org/10.1093/hmg/ddw246
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4(1):7. https://doi.org/10.1186/s13742-015-0047-8
Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55(4):997–1004. https://doi.org/10.1111/j.0006-341x.1999.00997.x
Duncan LE, Keller MC (2011) A critical review of the first 10 years of candidate gene-by-environment interaction research in psychiatry. Am J Psychiatry 168(10):1041–9. https://doi.org/10.1176/appi.ajp.2011.11020191
Enders CK, Bandalos DL (2001) The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Struct Equ Model 8(3):430–457. https://doi.org/10.1207/S15328007SEM0803_5
Grotzinger AD, Rhemtulla M, de Vlaming R, Ritchie SJ, Mallard TT, Hill WD, Ip HF, Marioni RE, McIntosh AM, Deary IJ, Koellinger PD, Harden KP, Nivard MG, Tucker-Drob EM (2019) Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat Hum Behav 3(5):513–525. https://doi.org/10.1038/s41562-019-0566-x
Hagenaars JA (1988) Latent structure models with direct effects between indicators local dependence models. Sociol Methods Res 16(3):379–405. https://doi.org/10.1177/0049124188016003002
International Schizophrenia Consortium, Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, Sklar P (2009) Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460(7256):748–752. https://doi.org/10.1038/nature08185
Jones MP (1996) Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc 91(433):222–230
Jöreskog KG (1990) New developments in LISREL: analysis of ordinal variables using polychoric correlations and weighted least squares. Qual Quant 24(4):387–404. https://doi.org/10.1007/BF00152012
Jöreskog KG, Moustaki I (2001) Factor analysis of ordinal variables: a comparison of three approaches. Multivar Behav Res 36(3):347–387. https://doi.org/10.1207/S15327906347-387
Lee S-Y, Poon W-Y, Bentler PM (1992) Structural equation models with continuous and polytomous variables. Psychometrika 57(1):89–105. https://doi.org/10.1007/BF02294660
Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Karlsson Linnér R, Fontana MA, Kundu T, Lee C, Li H, Li R, Royer R, Timshel PN, Walters RK, Willoughby EA, Cesarini D (2018) Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat Genet 50(8):1112–1121. https://doi.org/10.1038/s41588-018-0147-3
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup (2009) The sequence alignment/map format and samtools. Bioinformatics 25(16):2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Liu M, Jiang Y, Wedow R, Li Y, Brazel DM, Chen F, Datta G, Davila-Velderrain J, McGuire D, Tian C, Zhan X, 23 and Me Research Team, HUNT All-In Psychiatry, Choquet H, Docherty AR, Faul JD, Foerster JR, Fritsche LG, Gabrielsen ME, Vrieze S (2019) Association studies of up to 12 million individuals yield new insights into the genetic etiology of tobacco and alcohol use. Nat Genet 51(2):237–244. https://doi.org/10.1038/s41588-018-0307-5
Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39(7):906–13. https://doi.org/10.1038/ng2088
Muthén B (1984) A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika 49(1):115–132. https://doi.org/10.1007/BF02294210
Nagel M, Jansen PR, Stringer S, Watanabe K, de Leeuw CA, Bryois J, Savage JE, Hammerschlag AR, Skene NG, Muñoz-Manchado AB, 23andMe Research Team, White T, Tiemeier H, Linnarsson S, Hjerling-Leffler J, Polderman TJC, Sullivan PF, van der Sluis S, Posthuma D (2018) Meta-analysis of genome-wide association studies for neuroticism in 449,484 individuals identifies novel genetic loci and pathways. Nat Genet 50(7):920–927 https://doi.org/10.1038/s41588-018-0151-7
Neale MC, Hunter MD, Pritikin JN, Zahery M, Brick TR, Kirkpatrick R, Estabrook R, Bates TC, Maes H, Boker SM (2016) OpenMx 2.0: extended structural equation and statistical modeling. Psychometrika 81(2):535–549. https://doi.org/10.1007/s11336-014-9435-8
Nelson MR, Tipney H, Painter JL, Shen J, Nicoletti P, Shen Y, Floratos A, Sham PC, Li MJ, Wang J, Cardon LR, Whittaker JC, Sanseau P (2015) The support of human genetic evidence for approved drug indications. Nat Genet 47(8):856–60. https://doi.org/10.1038/ng.3314
Olsson U (1979) Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika 44(4):443–460. https://doi.org/10.1007/BF02296207
Pritikin JN, Brick TR, Neale MC (2018) Multivariate normal maximum likelihood with both ordinal and continuous variables, and data missing at random. Behav Res Methods 50(2):395–401. https://doi.org/10.3758/s13428-017-1011-6
Pritikin JN, Schmitt JE, Neale MC (2019) Cloud computing for voxel-wise SEM analysis of MRI data. Struct Equ Model 26(3):470–480. https://doi.org/10.1080/10705511.2018.1521285
Pruim RJ, Welch RP, Sanna S, Teslovich TM, Chines PS, Gliedt TP, Boehnke M, Abecasis GR, Willer CJ (2010) Locuszoom: regional visualization of genome-wide association scan results. Bioinformatics 26(18):2336–7. https://doi.org/10.1093/bioinformatics/btq419
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC (2007) Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–75. https://doi.org/10.1086/519795
R Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Ripke S, O’Dushlaine C, Chambert K, Moran JL, Kähler AK, Akterin S, Bergen SE, Collins AL, Crowley JJ, Fromer M, Kim Y, Lee SH, Magnusson PKE, Sanchez N, Stahl EA, Williams S, Wray NR, Xia K, Bettella F, Sullivan PF (2013) Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat Genet 45(10):1150–9. https://doi.org/10.1038/ng.2742
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592. https://doi.org/10.2307/2335739
Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B, Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, Collins R (2015) Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 12(3):e1001779. https://doi.org/10.1371/journal.pmed.1001779
Turner S (2014) Qqman: an r package for visualizing gwas results using q-q and manhattan plots. biorXiv. https://doi.org/10.1101/005165.
van der Sluis S, Posthuma D, Dolan CV (2013) Tates: efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS Genet 9(1):e1003235. https://doi.org/10.1371/journal.pgen.1003235
Verhulst B, Neale MC (2021) Best practices for binary or ordinal data analysis. Behav Genet. https://doi.org/10.1037/a002824
Verhulst B, Maes HH, Neale MC (2017) Gw-sem: a statistical package to conduct genome-wide structural equation modeling. Behav Genet 47(3):345–359. https://doi.org/10.1007/s10519-017-9842-6
Verhulst B, Pritikin JN, Clifford J, Prom-Wormley EC (Under Review). The importance of genetic marginal effects for the interpretation of gene-environment interactions in the genome wide association studies (gwas). Behav Genet
von Oertzen T, Brandmaier A, Tsang S (2015) Structural equation modeling with nyx. Struct Equ Model 22(1):148–161
Wray NR, Ripke S, Mattheisen M, Trzaskowski M, Byrne EM, Abdellaoui A, Adams MJ, Agerbo E, Air TM, Andlauer TMF, Bacanu S-A, Bækvad-Hansen M, Beekman AFT, Bigdeli TB, Binder EB, Blackwood DRH, Bryois J, Buttenschøn HN, Bybjerg-Grauholm J, Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium (2018) Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat Genet 50(5):668–681. https://doi.org/10.1038/s41588-018-0090-3
Xue A, Wu Y, Zhu Z, Zhang F, Kemper KE, Zheng Z, Yengo L, Lloyd-Jones LR, Sidorenko J, Wu Y, eQTLGen Consortium, McRae AF, Visscher PM, Zeng J, Yang J (2018) Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat Commun 9(1):2941. https://doi.org/10.1038/s41467-018-04951-w
Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, Frayling TM, Hirschhorn J, Yang J, Visscher PM, GIANT Consortium (2018) Meta-analysis of genome-wide association studies for height and body mass index in 700000 individuals of european ancestry. Hum Mol Genet 27(20):3641–3649. https://doi.org/10.1093/hmg/ddy271
Zheng J, Erzurumluoglu AM, Elsworth BL, Kemp JP, Howe L, Haycock PC, Hemani G, Tansey K, Laurin C, Early Genetics and Lifecourse Epidemiology (EAGLE) Eczema Consortium, Pourcain BS, Warrington NM, Finucane HK, Price AL, Bulik-Sullivan BK, Anttila V, Paternoster L, Gaunt TR, Evans DM, Neale BM (2017) Ld hub: a centralized database and web interface to perform ld score regression that maximizes the potential of summary level gwas data for snp heritability and genetic correlation analysis. Bioinformatics 33(2):272–279. https://doi.org/10.1093/bioinformatics/btw613
Zhou X, Stephens M (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet 44(7):821–4. https://doi.org/10.1038/ng.2310
Zhou X, Stephens M (2014) Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods 11(4):407–9. https://doi.org/10.1038/nmeth.2848
Acknowledgements
The authors would like to express our deepest gratitude to the anonymous reviewers for their invaluable comments as reviewers of this manuscript that undoubtedly improved the overall quality of the manuscript.
Funding
MCN was supported by NIDA Grant R01-DA018673. JNP was supported by NIDA Grant R25-DA-26119 (PI: Neale).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Joshua N. Pritikin, Michael C. Neale, Elizabeth C. Prom-Wormley, Shaunna L. Clark, and Brad Verhulst declare that they have no conflicts of interest related to the publication of this article.
Ethical approval
The data used for the demonstration section of this study were obtained from the UK Biobank (Application Number 40967) and involved secondary data analysis. As no identifying information was transfered, the data was not deemed “Human Subjects Data”, and appropriate human subjects waivers were obtained by the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Edited by Sarah Medland.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Pritikin, J.N., Neale, M.C., Prom-Wormley, E.C. et al. GW-SEM 2.0: Efficient, Flexible, and Accessible Multivariate GWAS. Behav Genet 51, 343–357 (2021). https://doi.org/10.1007/s10519-021-10043-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10519-021-10043-1