Skip to main content
Log in

GW-SEM: A Statistical Package to Conduct Genome-Wide Structural Equation Modeling

  • Original Research
  • Published:
Behavior Genetics Aims and scope Submit manuscript

Abstract

Improving the accuracy of phenotyping through the use of advanced psychometric tools will increase the power to find significant associations with genetic variants and expand the range of possible hypotheses that can be tested on a genome-wide scale. Multivariate methods, such as structural equation modeling (SEM), are valuable in the phenotypic analysis of psychiatric and substance use phenotypes, but these methods have not been integrated into standard genome-wide association analyses because fitting a SEM at each single nucleotide polymorphism (SNP) along the genome was hitherto considered to be too computationally demanding. By developing a method that can efficiently fit SEMs, it is possible to expand the set of models that can be tested. This is particularly necessary in psychiatric and behavioral genetics, where the statistical methods are often handicapped by phenotypes with large components of stochastic variance. Due to the enormous amount of data that genome-wide scans produce, the statistical methods used to analyze the data are relatively elementary and do not directly correspond with the rich theoretical development, and lack the potential to test more complex hypotheses about the measurement of, and interaction between, comorbid traits. In this paper, we present a method to test the association of a SNP with multiple phenotypes or a latent construct on a genome-wide basis using a diagonally weighted least squares (DWLS) estimator for four common SEMs: a one-factor model, a one-factor residuals model, a two-factor model, and a latent growth model. We demonstrate that the DWLS parameters and p-values strongly correspond with the more traditional full information maximum likelihood parameters and p-values. We also present the timing of simulations and power analyses and a comparison with and existing multivariate GWAS software package.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002) Merlinrapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30(1):97–101

    Article  PubMed  Google Scholar 

  • Agresti, A. (2002). Categorical data analysis [second edition]. Wiley-Interscience

  • Bock RD and Aitkin M (1981) Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika 46(4):443–459

    Article  Google Scholar 

  • Boker S, Neale M, Maes H, Wilde M, Spiegel M, Brick T, Fox J (2011) Openmx: an open source extended structural equation modeling framework. Psychometrika 76(2):306–311

    Article  PubMed  PubMed Central  Google Scholar 

  • Boker SM, Neale MC, Maes HH, Wilde MJ, Spiegel M, Brick TR et al. (2015) Openmx 2.3.1 user guide. [Computer software manual]

  • Blangero J, Lange K, Almasy L, Williams J, Dyer T, Peterson C (2000) Sequential oligogenic linkage analysis routines (SOLAR). [Computer software manual]

  • Browne MW (1984) Asymptotically distribution-free methods for the analysis of covariance structures. Br J Math Stat Psychol 37:62–83

    Article  PubMed  Google Scholar 

  • Carragher N, Teesson M, Sunderland M, Newton NC, Krueger RF, Conrod PJ, Slade T (2016) The structure of adolescent psychopathology: a symptom-level analysis. Psychol Med 46(5):981–994. doi:10.1017/S0033291715002470

    Article  PubMed  Google Scholar 

  • Chin WW (1998) Issues and opinion on structural equation modeling. MIS Q 22(1):vii–xvi

  • Choh AC, Lee M, Kent JW, Diego VP, Johnson W, Curran JE, Dyer TD, Bellis C, Blangero J, Siervogel RM, Towne B, Demerath EW, Czerwinski SA (2014) Gene-by-age effects on BMI from birth to adulthood: the Fels Longitudinal Study. Obesity 22(3):875–881

    Article  PubMed  Google Scholar 

  • CONVERGE consortium (2015) Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523:588–591. doi:10.1038/nature14659

  • Cross-Disorder Group of the Psychiatric Genomics Consortium (2013) Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet 381(9875):1371–1379. doi:10.1016/S0140-6736(12)62129-1

  • Dahl A, Iotchkova V, Baud A, Johansson A, Gyllensten U, Soranzo N, Marchini J (2016) A multiple-phenotype imputation method for genetic studies. Nat Genet 48:466–472. doi:10.1038/ng.3513

    Article  PubMed  PubMed Central  Google Scholar 

  • DiStefano C, Morgan GB (2014) A comparison of diagonal weighted least squares robust estimation techniques for ordinal data. Struct Equ Model 21(3):425–438

    Article  Google Scholar 

  • Doyle MM, Murphy J, Shevlin M (2016) Competing factor models of child and adolescent psychopathology. J Abnorm Child Psychol 44:1559–1571. doi:10.1007/s10802-016-0129-9

    Article  PubMed  Google Scholar 

  • Duell EJ, Sala N, Travier N, Munoz X, Boutron-Ruault MC, Clavel-Chapelon F, Gonzalez CA (2012) Genetic variation in alcohol dehydrogenase (adh1a, adh1b, adh1c, adh7) and aldehyde dehydrogenase (aldh2), alcohol consumption and gastric cancer risk in the european prospective investigation into cancer and nutrition (epic) cohort. Carcinogenesis 33(2):361–367. doi:10.1093/carcin/bgr285

    Article  PubMed  Google Scholar 

  • Duncan SC, Duncan TE, Strycker LA (2006) Alcohol use from ages 9 to 16: a cohort-sequential latent growth model. Drug Alcohol Depend 81(1):71–81. doi:10.1016/j.drugalcdep.2005.06.001

    Article  PubMed  Google Scholar 

  • Duncan TE, Duncan SC, Alpert A, Hops H, Stoolmiller M, Muthen B (1997) Latent variable modeling of longitudinal and multilevel substance use data. Multivar Behav Res 32(3):275–318. doi:10.1207/s15327906mbr3203

    Article  Google Scholar 

  • Fardo DW, Zhang X, Ding L, He H, Kurowski B, Alexander ES, Mersha TB, Pilipenko V, Kottyan L, Nandakumar K, Martin L (2014) On family-based genome-wide association studies with large pedigrees: observations and recommendations. BMC Proc 8(Suppl 1):S26

    Article  PubMed  PubMed Central  Google Scholar 

  • Ferreira MAR, Purcell SM (2009) A multivariate test of association. Bioinformatics 25(1):132–133. doi:10.1093/bioinformatics/btn563

    Article  PubMed  Google Scholar 

  • Furlotte NA, Eskin E (2015) Efficient multiple-trait association and estimation of genetic correlation using the matrix-variate linear mixed model. Genetics 200(1):59–68. doi:10.1534/genetics.114.171447

    Article  PubMed  PubMed Central  Google Scholar 

  • Grice JW (2001) Computing and evaluating factor scores. Psychol Methods 6(4):430–450

    Article  PubMed  Google Scholar 

  • Hyde CL, Nagle MW, Tian C, Chen X, Paciga SA, Wendland JR, Winslow AR (2016) Identification of 15 genetic loci associated with risk of major depression in individuals of european descent. Nat Genet 48(9):1031–1036. doi:10.1038/ng.3623

    Article  PubMed  Google Scholar 

  • Johnson DR, Creech JC (1983) Ordinal measures in multiple indicator models: a simulation study of categorization error. Am Soc Rev 48:398407

    Article  Google Scholar 

  • Joreskog KG, Sorbom D (1989) LISREL 7: a guide to the program and applications, 2nd edn. SPSS Inc, Chicago

    Google Scholar 

  • Joreskog KG, Sorbom D (1993) New features in prelis 2. Scientific Software International, Chicago

    Google Scholar 

  • Joreskog KG, Sorbom D (1996) Lisrel 8 users reference guide. Scientific Software International, Chicago

    Google Scholar 

  • Joreskog KG, Sorbom D (1996) LISREL 8 users reference guide. Scientific Software Inc, Mooresville

    Google Scholar 

  • Joreskog KG, Sorbom D (2001) LISREL 8: new statistical features. Scientific Software Inc, Mooresville

    Google Scholar 

  • Kent JW, Peterson CP, Dyer TD, Almasy L, Blangero J (2009) Genome-wide discovery of maternal effect variants. BMC Proc 9(Suppl 7):S19

    Article  Google Scholar 

  • Kessler RC, Chiu WT, Demler O, Walters EE (2005) Prevalence, severity, and comorbidity of twelve-month DSM-IV disorders in the national comorbidity survey replication (NCS-R). Arch Gen Psychiatry 62(6):617627

    Google Scholar 

  • Klei L, Luca D, Devlin B, Roeder K (2008) Pleiotropy and principal components of heritability combine to increase power for association analysis. Genet Epidemiol 32(1):9–19. doi:10.1002/gepi.20257

    Article  PubMed  Google Scholar 

  • Krueger RF (1999) The structure of common mental disorders. Arch Gen Psychiatry 56(10):921–926

    Article  PubMed  Google Scholar 

  • Lai K (2011) Abstract: sample size planning for latent curve models. Multivar Behav Res 46(6):1013. doi:10.1080/00273171.2011.636705

    Article  Google Scholar 

  • Laird NM (2011) Family-based association test (FBAT). Wiely, St. Hoboken

    Book  Google Scholar 

  • Li CH (2015) Confirmatory factor analysis with ordinal data: comparing robust maximum likelihood and diagonally weighted least squares. Behav Res Methods. doi:10.3758/s13428-015-0619-7

  • Lips EH, Gaborieau V, McKay JD, Chabrier A, Hung RJ, Boffetta P, Brennan P (2010) Association between a 15q25 gene variant, smoking quantity and tobacco-related cancers among 17 000 individuals. Int J Epidemiol 39(2):563–577. doi:10.1093/ije/dyp288

    Article  PubMed  Google Scholar 

  • Little RJ, Rubin DB (1989) The analysis of social science data with missing values. Sociol Methods Res 18:292–326

    Article  Google Scholar 

  • Liu JZ, Tozzi F, Waterworth DM, Pillai SG, Muglia P, Middleton L, Marchini J (2010) Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nat Genet 42(5):436–440. doi:10.1038/ng.572

    Article  PubMed  PubMed Central  Google Scholar 

  • MacCallum RC, Hong S (1997) Power analysis in covariance structure modeling using GFI and AGFI. Multivar Behav Res 32(2):193–210. doi:10.1207/s15327906mbr3202

    Article  Google Scholar 

  • Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39(7):906–913. doi:10.1038/ng2088

    Article  PubMed  Google Scholar 

  • McArdle JJ, Boker SM (1990) Rampath path diagram software. Data Transforms Inc, Denver

    Google Scholar 

  • McArdle JJ, McDonald RP (1984) Some algebraic properties of the reticular action model for moment structures. Br J Math Stat Psychol 37:234–251

    Article  PubMed  Google Scholar 

  • Medland SE, Neale MC (2010) An integrated phenomic approach to multivariate allelic association. Eur J Hum Genet 18(2):233–239. doi:10.1038/ejhg.2009.133

    Article  PubMed  Google Scholar 

  • Medland SE, Nyholt DR, Painter JN, McEvoy BP, McRae AF, Zhu G, Martin NG (2009) Common variants in the trichohyalin gene are associated with straight hair in Europeans. Am J Hum Genet 85(5):750–755. doi:10.1016/j.ajhg.2009.10.009

    Article  PubMed  PubMed Central  Google Scholar 

  • Mehta PD, Neale MC, Flay BR (2004) Squeezing interval change from ordinal panel data: latent growth curves with ordinal outcomes. Psychol Methods 9(3):301–333

    Article  PubMed  Google Scholar 

  • Meyer K, Tier B (2012) SNP snappy: a strategy for fast genome-wide association studies fitting a full mixed model. Genetics 190(1):275–277. doi:10.1534/genetics.111.134841

    Article  PubMed  PubMed Central  Google Scholar 

  • Miles J (2003) A framework for power analysis using a structural equation modelling procedure. BMC Med Res Methodol 3:27. doi:10.1186/1471-2288-3-27

    Article  PubMed  PubMed Central  Google Scholar 

  • Mindrila D (2010) Maximum likelihood (ml) and diagonally weighted least squares (DWLS) estimation procedures: a comparison of estimation bias with ordinal and multivariate non-normal data. Int J Digital Soc 1(1):60–66

    Article  Google Scholar 

  • Muhleisen TW, Leber M, Schulze TG, Strohmaier J, Degenhardt F, Treutlein J et al (2014) Genome-wide association study reveals two new risk loci for bipolar disorder. Nat Commun 5:3339. doi:10.1038/ncomms4339

    Article  PubMed  Google Scholar 

  • Muthen B (1984) A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika 49:115–132

    Article  Google Scholar 

  • Nakamura K, Suwaki H, Matsuo Y, Ichikawa Y, Miyatake R, Iwahashi K (1995) Association between alcoholics and the genotypes of ALDH2, ADH2, ADH3 as well as P-4502E1. Arukoru Kenkyuto Yakubutsu Ison 30:33–42

    PubMed  Google Scholar 

  • Neale MC (1994) Mx: statistical modeling, 2nd edn. Medical College of Virginia, Richmond

    Google Scholar 

  • Neale MC, Hunter MD, Pritikin JN, Zahery M, Brick TR, Kickpatrick RM et al. (in press) OpenMx 2.0: extended structural equation and statistical modeling. Psychometrika.

  • Neale MC, McArdle JJ (2000) Structured latent growth curves for twin data. Twin Res 3(3):165–177

    Article  PubMed  Google Scholar 

  • Okbay A, Baselmans BML, De Neve J-E, Turley P, Nivard MG, Fontana MA, Cesarini D (2016) Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses. Nat Genet 48(6):624–633. doi:10.1038/ng.3552

    Article  PubMed  PubMed Central  Google Scholar 

  • OReilly PF, Hoggart CJ, Pomyen Y, Calboli FCF, Elliott P, Jarvelin M-R, Coin LJM (2012) Multiphen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS ONE 7(5):e34861. doi:10.1371/journal.pone.0034861

    Article  Google Scholar 

  • Paltoo DN, Rodriguez LL, Feolo M, Gillanders E, Ramos EM, Rutter JL et al (2014) National Institutes of Health Genomic Data Sharing Governance Committees (2014, Sep). Data use under the nih gwas data sharing policy and future directions. Nat Genet 46(9):934–938. doi:10.1038/ng.3062

    Article  PubMed  PubMed Central  Google Scholar 

  • Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Sham PC (2007) Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575. doi:10.1086/519795

    Article  PubMed  PubMed Central  Google Scholar 

  • R Development Core Team (2008) R: a language and environment for statistical computing[Computer software manual]. Vienna, Austria. http://www.R-project.org (ISBN 3-900051-07-0)

  • Saccone NL, Saccone SF, Hinrichs AL, Stitzel JA, Duan W, Pergadia ML, Bierut LJ (2009) Multiple distinct risk loci for nicotine dependence identified by dense coverage of the complete family of nicotinic receptor subunit (CHRN) genes. Am J Med Genet B Neuropsychiatr Genet 150B(4):453–466. doi:10.1002/ajmg.b.30828

    Article  PubMed  PubMed Central  Google Scholar 

  • Schizophrenia Psychiatric Genome-Wide Association Study (GWAS) Consortium (2011) Genome-wide association study identifies five new schizophrenia loci. Nat Genet 43(10):969–976. doi:10.1038/ng.940

  • Servin B, Stephens M (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 3(7):e114. doi:10.1371/journal.pgen.0030114

    Article  PubMed  PubMed Central  Google Scholar 

  • Smith DJ, Escott-Price V, Davies G, Bailey MES, Colodro-Conde L, Ward J et al (2016) Genome-wide analysis of over 106 000 individuals identifies 9 neuroticism-associated loci. Mol Psychiatry 21(11):1644. doi:10.1038/mp.2016.177

    Article  PubMed  PubMed Central  Google Scholar 

  • Stephens M (2013) A unified framework for association analysis with multiple related phenotypes. PLoS ONE 8(7):e65245. doi:10.1371/journal.pone.0065245

    Article  PubMed  PubMed Central  Google Scholar 

  • van der Sluis S, Posthuma D, Dolan CV (2013) Tates: efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS Genet 9(1):e1003235. doi:10.1371/journal.pgen.1003235

    Article  PubMed  PubMed Central  Google Scholar 

  • Venables WN, Ripley BD (2002) Modern applied statistics with s, 4th edn. Springer, New York (ISBN 0-387-95457-0)

    Book  Google Scholar 

  • Visscher PM, Brown MA, McCarthy MI, Yang J (2012) Five years of gwas discovery. Am J Hum Genet 90(1):7–24. doi:10.1016/j.ajhg.2011.11.029

    Article  PubMed  PubMed Central  Google Scholar 

  • Whitfield JB, Nightingale BN, Bucholz KK, Madden PAF, Heath AC, Martin NG (1998) ADH genotypes and alcohol use and dependence in europeans. Alcoholism 22:1463–1469

    Article  PubMed  Google Scholar 

  • Wolf EJ, Harrington KM, Clark SL, Miller MW (2013) Sample size requirements for structural equation models: an evaluation of power, bias, and solution propriety. Educ Psychol Meas 76(6):913–934. doi:10.1177/0013164413495237

    Article  PubMed  PubMed Central  Google Scholar 

  • Zhou X, Stephens M (2012) Genome-wide efficient mixed-model analysis for association studies. Nat Genet 44(7):821–824. doi:10.1038/ng.2310

    Article  PubMed  PubMed Central  Google Scholar 

  • Zhou X, Stephens M (2014) Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods 11(4):407–409. doi:10.1038/nmeth.2848

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

An earlier draft of this paper was presented at the 44th meeting of the Behavioral Genetics Association in Charlottesville, Virginia, June 18 to June 21, 2014. We would like to thank the conference attendees for their suggestions to improve the paper.

Funding

This study was supported by NIDA Grants R01DA-025109, R01DA-024413, R01DA-018673 and R25DA-26119.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brad Verhulst.

Ethics declarations

Conflict of interest

Brad Verhulst, Hermine H. Maes, and Michael C. Neale declare that they have no conflict of interest.

Statement of Human and Animal Rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

For this type of study formal consent is not required.

Additional information

Edited by Sarah Medland.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 30 KB)

Appendix 1: syntax and application

Appendix 1: syntax and application

A supplementary goal of GW-SEM is to create a user-friendly set of commands that researchers who may not be dedicated data analysts can use effectively. Therefore to demystify the process, in this section we explain the use of each of the principal functions.

The first step in the analysis is to calculate the SNP-invariant covariances. These calculations are conducted using the facCov() function:

facCov( dataset, VarNames, covariates)

where dataset is a dataframe in R, VarNames is a list of the variable names of the items, and covariates is a list of covariates. The function returns covariances, weights, and standard errors of all of the variances, covariances, means and thresholds for all of the items and covariates. Because this function runs quickly (even for a relatively large number of items), and is necessary for all subsequent functions, it is called directly by the other functions. Users can use this function to ensure that their data is properly organized, and to ensure that there are no peculiarities with any of the variables they plan on including in their analyses.

The second step in the analysis is to estimate the SNP-item and SNP-covariate covariances. These calculations are conducted using the snpCovs() function:

snpCovs(FacModelData, vars, covariates, SNPdata, output, zeroOne, runs, inc, start)

where FacModelData is the path to the text file with the item and covariate data, vars is a list of items, covariates is a list of covariates, SNPdata is the path to the text file with the SNP values, output is the prefix for the output files, zeroOne is a logical value indicating whether the first and second thresholds should be fixed at 0 and 1, freeing up parameters to estimate the mean and the variance following the liability-threshold Model (Mehta et al. 2004), runs is the number of batches of SNPs to be analyzed, inc is the number of SNPs included in each batch, and start, is the column in the SNP file of the first SNP to be sampled. The output from this function is saved in three separate files as specified by the output argument: the covariances, the weights and the standard errors.

The final step of the model fits the SEM using the gwasDWLS function:

gwasDWLS(itemData, snpCov, snpWei, VarNames, covariates, runs, output, inc)

where itemData is the path to the text file with the item and covariates, snpCov is the path to the text file with covariances between the SNPs and the item and covariates (calculated in the previous step), snpWei is the path to the text file with the weights, VarNames is a list of items, covariates is a list of covariates, runs is the number of batches of SNPs to be analyzed, output is the file name for the output file, and inc is the number of SNPs included in each batch. Due to identification restrictions, users must supply at least three items (indicators) for the latent factor. There is no minimum or maximum for the number of covariates that can be included in the analysis. Note that with these two lines of R code, it is possible to conduct the one-factor GWAS.

The next SEM is the residuals model. The syntax to fit the residuals model is:

snpCovs(FacModelData, vars, covariates, SNPdata, output, zeroOne, runs, inc, start)

resDWLS(itemData, snpCov, snpWei, VarNames, covariates, resids, factor, runs, output, inc)

As can be seen above, for the residuals model, the only two arguments that differ from the one-factor model are resids which is a list of the items to be regressed on the SNPs, and factor which is a logical value asking whether the latent factor is to be regressed on the SNPs. The other arguments operate in exactly the same way as with gwasDWLS. Further, the snpCovs function is equivalent for both the gwasDWLS and the resDWLS, making it possible to easily conduct additional analyses with minimal additional steps. Again, at least three items are required in order to provide an identified factor model.

The third model in the package is the two-factor SEM. The syntax to run the two-factor GWAS is:

snpCovs(FacModelData, vars, covariates, SNPdata, output, runs, inc, start)

twofacDWLS(itemData, snpCov, snpWei, f1Names, f2Names, covariates, runs, output, inc)

Again the snpCovs argument is identical to the previous GWAS models, and the only change in arguments from the gwasDWLS to the twofacDWLS is the addition of f1Names and f2Names, which are lists of the variable names that load on Factor 1 and Factor 2, respectively. These lists are not exclusive for generality but at least three items must be specified for each factor, with at least one item for each factor excluded from the alternative factor.

The last model included in the software is the LGM, depicted in Fig 1d. The syntax for the LGM GWAS is:

snpCovs(FacModelData, vars, covariates, SNPdata, output, zeroOne, runs, inc, start)

growDWLS(itemData, snpCov, snpWei, VarNames, covariates, quadratic, orthogonal, runs, output, inc)

The snpCovs function is again equivalent to the function described above, except that for categorical data, the zeroOne argument should be specified as TRUE, to facilitate the estimation of the LGM. As the LGM is particularly focused on mean and variance changes, this is an important feature of the covariance model. The only change in the growDWLS() function from the gwasDWLS() function is the inclusion of the quadratic and orthogonal arguments. The quadratic argument is a logical value asking whether to include a latent quadratic growth parameter. The orthogonal argument is a logical value asking whether to use the standard growth loadings or orthogonal contrasts. The other arguments are exactly the same as the gwasDWLS function.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Verhulst, B., Maes, H.H. & Neale, M.C. GW-SEM: A Statistical Package to Conduct Genome-Wide Structural Equation Modeling. Behav Genet 47, 345–359 (2017). https://doi.org/10.1007/s10519-017-9842-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10519-017-9842-6

Keywords

Navigation