Abstract
Proprietary genetic datasets are valuable for boosting the statistical power of genomewide association studies (GWASs), but their use can restrict investigators from publicly sharing the resulting summary statistics. Although researchers can resort to sharing downsampled versions that exclude restricted data, downsampling reduces power and might change the genetic etiology of the phenotype being studied. These problems are further complicated when using multivariate GWAS methods, such as genomic structural equation modeling (Genomic SEM), that model genetic correlations across multiple traits. Here, we propose a systematic approach to assess the comparability of GWAS summary statistics that include versus exclude restricted data. Illustrating this approach with a multivariate GWAS of an externalizing factor, we assessed the impact of downsampling on (1) the strength of the genetic signal in univariate GWASs, (2) the factor loadings and model fit in multivariate Genomic SEM, (3) the strength of the genetic signal at the factor level, (4) insights from geneproperty analyses, (5) the pattern of genetic correlations with other traits, and (6) polygenic score analyses in independent samples. For the externalizing GWAS, although downsampling resulted in a loss of genetic signal and fewer genomewide significant loci; the factor loadings and model fit, geneproperty analyses, genetic correlations, and polygenic score analyses were found robust. Given the importance of data sharing for the advancement of open science, we recommend that investigators who generate and share downsampled summary statistics report these analyses as accompanying documentation to support other researchers’ use of the summary statistics.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
The success of genomewide association studies (GWASs) depends on sample size (Abdellaoui et al. 2023). Accordingly, genetics researchers increasingly depend on public–private partnerships that pool data collected by academic researchers, national biobanks, and private companies. For example, the company 23andMe Inc. contributed an astonishing 2.5 million observations to a recent GWAS of height (Yengo et al. 2022). However, to protect their interests, private companies place restrictions on the public sharing of GWAS summary statistics and require a potentially lengthy and burdensome application process for researchers to gain access. In some cases, researchers’ institutions are unwilling to agree to the legal terms set by private companies in their material transfer agreements. These restrictions pose a challenge to scientific transparency and slow the pace of genetic discovery.
To address this challenge, researchers can publicly share downsampled GWAS summary statistics that exclude restricted data (Coleman et al. 2020; Lee et al. 2018; Yengo et al. 2022). This is an imperfect solution, as leaving out a large part of the study sample not only reduces power but can also change the genetic etiology of the trait being studied, potentially leading to substantial differences in downstream analyses (Vlaming et al. 2017). For instance, downsampling could influence estimates of genetic correlations with other traits, associations in polygenic score analyses, and insights from bioannotation analyses. We are only aware of one study investigating the effects of excluding restricted data from a univariate depression GWAS (Coleman et al. 2020), prior to including them in a metaanalysis of mood disorders. The authors examined the robustness of SNP heritability estimates, genetic correlations, and gene identification. Although they identified fewer variants in the downsampled analyses, results were otherwise similar, suggesting that excluding data in their study did not markedly change the genetic etiology of their focal phenotype. However, most of the studies providing downsampled summary statistics have not evaluated the comparability with restricted data counterparts (Lee et al. 2018; Liu et al. 2019; Wray et al. 2018).
There have been few if any, systematic investigations of how downsampling affects results from multivariate GWASs. Multivariate GWAS methods, such as genomic structural equation modeling (Genomic SEM; Grotzinger et al. 2019), have become increasingly popular, as there is substantial genetic overlap across psychiatric and behavioral phenotypes. Genomic SEM models the shared genetic architecture among traits with latent factors representing crosscutting genetic liabilities. Rather than just examining genetic associations with individual phenotypes, Genomic SEM enables the identification of shared genes. As in phenotypic factor analysis, the construct represented by latent factors could be sensitive to the choice of indicator phenotypes used in the factor analysis, or the construct might be fairly robust to this decision (Johnson et al. 2004, 2008). Using downsampled univariate GWAS summary statistics as inputs in Genomic SEM could, therefore, identify a genetic factor structure that occupies a different position in genetic multivariate space. Yet, no studies to our knowledge have examined how downsampling affects multivariate GWAS in the context of Genomic SEM.
Here, we present a systematic approach to assess the comparability of downsampled summary statistics with their full data counterparts and examine their suitability for typical followup analyses. We used externalizing, a latent factor representing a crosscutting liability to behaviors and disorders characterized by problems with selfregulation, as our model phenotype. A previous multivariate GWAS by the Externalizing Consortium identified several hundred genomic loci associated with an externalizing (EXT) factor, reflecting shared genetic liability among seven indicator phenotypes (Karlsson Linnér et al. 2021): (1) attentiondeficit/hyperactivity disorder (ADHD; Demontis et al. 2019), (2) problematic alcohol use (ALCP; SanchezRoige et al. 2019), (3) lifetime cannabis use (CANN; Pasman et al. 2018), (4) reversecoded age at first sexual intercourse (FSEX; Karlsson Linnér et al. 2019), (5) number of sexual partners (NSEX; Karlsson Linnér et al. 2019), (6) general risk tolerance (RISK; Karlsson Linnér et al. 2019), and (7) lifetime smoking initiation (SMOK; Liu et al. 2019). However, the univariate GWASs on two of the seven phenotypes, SMOK and CANN, contain restricted data, which limits public sharing of the summary statistics from this multivariate GWAS (hereafter, the original study on externalizing).
Therefore, we developed the following six steps to investigate the robustness of downsampling and applied them to our scenario of assessing the impact of excluding restricted data from the original study on externalizing (Karlsson Linnér et al. 2021). As an initial check, we recommend that authors who generate and share downsampled summary statistics report whether the genetic correlation between the full and downsampled version is less than unity, suggesting an imperfect overlap of GWAS coefficients and genetic etiology. The greater the discrepancy between the genetic correlation of the full and downsampled GWASs on the same trait, the more important it is to evaluate the comparability of downsampled analyses.
We recommend that investigators who share downsampled summary statistics generated with multivariate GWAS methods (e.g., Genomic SEM) report all six steps as supporting documentation, while steps 2–3 can be skipped when generating downsampled univariate GWAS:

1.
What is the loss of genetic signal in downsampled univariate GWASs (which may later be used as indicator phenotypes in Genomic SEM)?

2.
How do the factor loadings and factor model fit differ in multivariate Genomic SEM when the indicator phenotypes are downsampled univariate GWASs?

3.
What is the loss of genetic signal at the factor level of multivariate GWAS when the indicator phenotypes are downsampled univariate GWASs?

4.
How similar are geneproperty analyses when using downsampled GWASs?

5.
How similar is the pattern of genetic correlations with other traits when using downsampled GWASs?

6.
How much explanatory power is lost when using polygenic scores (PGSs) constructed from downsampled GWASs?
Methods
The code is publicly available here: https://github.com/Camzcamz/EXTminus23andMe, and the GWAS summary statistics on externalizing that excluded restricted data from 23andMe (“EXTminus23andMe”) are available here: https://externalizing.rutgers.edu/requestdata/.

1.
What is the loss of genetic signal in downsampled univariate GWASs?
The following five key indicators are useful for evaluating the loss of genetic signal in downsampled univariate GWASs: (1) effective sample size (EffN), (2) heritability, (3) mean χ^{2}, (4) genomic inflation factor, and (5) the LD Score regression attenuation/stratification bias ratio (see formula in Table 1). EffN is a transformation relevant for GWAS on binary traits that transforms an unbalanced number of cases and controls to effectively reflect the sample size of a balanced analysis (i.e., 50% cases). For a metaanalysis of k cohortlevel univariate summary statistics, it is the sum of \({EffN}_{k}=4\times {V}_{k}(1\) \({V}_{k}){N}_{k}\), where \({V}_{k}\) is the cohortspecific proportion of cases, and \({N}_{k}\) is the total number of cases and controls. For GWAS on continuous traits, EffN can be replaced by the total sample size (N). The remaining four key indicators are standard estimates of LD Score regression (version 1.0.1; BulikSullivan et al. 2015).
In addition to evaluating the loss of genetic signal, we recommend three checks to examine concordance in GWAS coefficients (\(\beta\)), which should preferably be applied to nearindependent SNPs (Step 3 explains a standard pruning procedure to find nearindependence). If correlated SNPs are included, larger LD blocks will be given more weight. Depending on the power of the fulldata GWAS, the checks could be applied only to genomewide significant hits or could be expanded to a less stringent threshold (say, P < 1 × 10^{–5}). The three checks are to (1) test for sign concordance, (2) inspect for outliers, and (3) run a regression of the downsampled GWAS coefficients on their fulldata counterparts (as absolute values).
Sign concordance can be evaluated by reporting the proportion of SNPs that have concordant direction of effect or by performing a binomial test. The binomial test requires an assumed null hypothesis of the true probability of success, which we set to 99% to make the test sensitive enough to detect minor deviations from nearperfect concordance (100% is too sensitive as a single discordant observation will reject the null). Power calculations show that 150 independent SNPs provide ≥ 80% power to reject this null even if the true, imperfect concordance is as high as 95%. To detect outliers, we suggest evaluating whether the downsampled GWAS coefficients fall outside the 95% confidence intervals of their fulldata counterparts. If outliers are detected, then we recommend adding an extra indicator column to the downsampled summary statistics to allow its users to filter out SNPs with deviating downsampled GWAS coefficients. The regression analysis of the downsampled coefficients on the fulldata coefficients should investigate whether (a) the intercept is zero, (b) whether the regression coefficient is unity (i.e., diagonal line), and (c) whether the adjusted coefficient of determination (adj. R^{2}) is high. These checks are applicable to both univariate and multivariate GWAS (thus, also in Step 3). Here, because we did not generate any downsampled univariate summary statistics to be disseminated, we report these checks only for the downsampled multivariate GWAS on externalizing.
To generate a downsampled version of the multivariate GWAS on externalizing, we first downsampled the univariate GWASs of SMOK and CANN by mirroring the metaanalysis protocol of the original study (Karlsson Linnér et al. 2021) while excluding restricted 23andMe data. We then used these five key indicators to assess the loss of genetic signal in the downsampled univariate GWASs. Finally, we estimated genetic correlations among the seven indicator phenotypes in the downsampled analysis using LD Score regression (BulikSullivan et al. 2015) and compared them to genetic correlations among the indicator phenotypes in the original study.
Stable heritability estimates and attenuation ratios across the original and downsampled indicators should yield comparable factor loadings in the downsampled Genomic SEM factor analysis (Step 2), whereas loss of genetic signal, indicated by a decrease in mean χ^{2}, should yield larger standard errors in the factor analysis and loss of statistical power to detect SNP effects in the multivariate GWAS (Step 3).

2.
How do the factor loadings and factor model fit differ in Genomic SEM when the indicator phenotypes are downsampled univariate GWASs?
Genomic SEM is a flexible modeling approach that (1) estimates an empirical genetic covariance matrix and sampling covariance matrix from input GWAS summary statistics, and (2) evaluates a set of conventional parameters for structural equation modeling, such as factor loadings and residual variances, to minimize the discrepancy between the modelimplied and empirical genetic covariance matrices (Grotzinger et al. 2019). Typically, several alternative models are compared (e.g., a singlefactor model versus a twofactor model) followed by multivariate GWAS to estimate SNP effects on each of the factors in the preferred factor solution (Step 3).
To assess the impact of downsampling on the factor loadings and model fit, we suggest forcing the bestfitting factor solution from the Genomic SEM analysis of the full dataset (that includes restricted data) onto the empirical genetic covariance matrix of the downsampled summary statistics, and then evaluating the stability of the factor loadings and factor model fit indicators (e.g., the comparative fit index or the root mean square residual). We do not suggest searching for a better factor solution with the downsampled indicators because the aim is to evaluate whether downsampled analyses are representative of their corresponding versions with restricted data.
Thus, we ran the bestfitting Genomic SEM factor model of the original study (Karlsson Linnér et al. 2021): a singlefactor model with seven indicator phenotypes (ADHD, ALCP, CANN, FSEX, NSEX, RISK, and SMOK), using unit variance identification of the factor model without SNP effects. However, in the analysis reported here, the input summary statistics for SMOK and CANN were replaced by downsampled versions (see Step 1). We refer to the original factor model based on analyses with 23andMe data as the EXT factor and the downsampled version as the EXTminus23andMe factor.

3.
What is the loss of genetic signal at the factor level of downsampled multivariate GWAS when the indicator phenotypes are downsampled univariate GWASs?
After conducting a multivariate GWAS on the latent factors in downsampled analyses with Genomic SEM, the loss of genetic signal at the factor level can be assessed by (i) examining the genetic correlation between the respective latent factors of the full and downsampled summary statistics using bivariate LD Score regression (BulikSullivan et al. 2015) and by (ii) estimating the decrease in genetic signal with key indicators (1), (3), and (4) from Step 1. Please note that key indicators (2) and (5) are not used to evaluate the genetic signal of the latent factor because they are not clearly defined (e.g., heritability is defined as a ratio with phenotypic variance as the denominator, which is absent in latent genetic factors).
To generalize the loss of statistical power to identify individual SNP effects, we need to make assumptions about their magnitude. One approach is to compute the squared standardized coefficients,^{Footnote 1} approximated as \({r}^{2}={Z}^{2}/N\), and then evaluate the median among the subset of genomewide significant SNPs (P < 5 × 10^{–8}) in the downsampled GWAS. Given that statistical power is the probability of correctly rejecting the null hypothesis when the alternative hypothesis is true, it can be computed as \(1{CDF}_{\lambda }[{\chi }_{1}^{2}\)], where \({CDF}_{\lambda }\) is the cumulative distribution function for a χ^{2} distribution with 1 degree of freedom and the noncentrality parameter \(\lambda =N{r}^{2}\). The sample size, \(N\), is set to the EffN of the summary statistics being evaluated. The term \({\chi }_{1}^{2}\)(c) is the critical value (~29.7) at the threshold of genomewide significance (P < 5 × 10^{–8}) for a χ^{2}test with 1 degree of freedom. As a complement, we suggest evaluating the power to detect arbitrary effectsize magnitudes, for which we selected three magnitudes representative of effects reaching genomewide significance in recent largescale GWAS (\({r}^{2}=\) 0.003%, 0.004%, or 0.005%). Because power loss is more noticeable at the level of individual SNPs compared to methods that aggregate genetic signal among sets of SNPs or genomewide, we recommend researchers interested in following up on individual SNPs use the original and not the downsampled summary statistics for best precision.
As in the original study (Karlsson Linnér et al. 2021), we estimated individual SNP effects on the latent EXTminus23andMe factor with Genomic SEM, which we refer to as the EXTminus23andMe summary statistics. We then evaluated the loss of signal at the factor level. We expect the loss of power to be more noticeable at the level of individual loci compared to the followup analyses presented below, which aggregate genetic signal across larger sets of SNPs or genome wide. Lastly, we examined the concordance of GWAS coefficients on the latent factor per the threecheck procedure outlined in Step 1.
Because of its accessibility and ease of use, we recommend using FUMA to find nearindependent genomewide significant “lead SNPs”. FUMA conducts conventional linkagedisequilibrium (LD) informed pruning (“clumping”). The default settings are sensible to use in most cases. FUMA computes LD with the publicly available European subsample of the 1000 Genomes Phase 3 reference panel as the default setting (though, researchers should depart from this default to match the genetic ancestry of the summary statistics being evaluated). The default settings largely overlap with those of the original study on EXT (importantly, the LD r^{2} threshold of 0.1 to define lead SNPs is identical), though the original study used a larger restrictedaccess reference panel that combined the 1000 Genomes Phase 3 reference panel with other reference data.

4.
How similar are geneproperty analyses when using downsampled GWASs?
The biological correspondence of downsampled univariate or multivariate GWAS can be evaluated by comparing the results from the Multimarker analysis of genomic annotation (MAGMA) geneproperty analyses in the SNP2GENE function of Functional Mapping and Annotation of GenomeWide Association Studies (FUMA; Watanabe et al. 2017); version 1.5.0e) software using Spearman rank correlations of point estimates.
As done in the original paper, we ran geneproperty analyses on the EXTminus23andMe summary statistics to (1) test 54 tissuespecific gene expression profiles, and (2) test gene expression profiles across 11 brain tissues and developmental stages with reference data from BrainSpan (Allen Institute for Brain Science 2022). We used the default settings of SNP2GENE, which match those used to conduct the genebased analyses reported in the original study (Tables S3, 4).

5.
How similar is the pattern of genetic correlations with other traits when using downsampled GWASs?
To assess the convergent and discriminant validity of downsampled multivariate GWAS on latent factors, we can examine potential changes in the pattern of genetic correlation with other traits. If the downsampled analysis tags the same genetic etiology, the confidence intervals of the point estimates should display considerable overlap. The overall pattern can be examined by estimating the rank correlation of the point estimates across traits, whereas significance of changes to individual genetic correlations can be assessed using a ttest.
The original study estimated genetic correlations between EXT and 91 other traits (Karlsson Linnér et al. 2021). Here, we performed the same analysis for EXTminus23andMe and then examined whether the pattern of genetic overlap was preserved after removing restricted data. Since the summary statistics of some of the 91 traits in the original study include restricted data, we conducted these analyses on the 79 traits with publicly available data.

6.
How much explanatory power is lost when using polygenic scores (PGSs) constructed from downsampled GWASs?
Generally, the loss of genetic signal from downsampling will only exacerbate the problem of measurement error in PGSs constructed with finitesample estimates as weights (Becker et al. 2021). As one of the most common thirdparty applications of publicly available GWAS summary statistics, we strongly encourage researchers to evaluate the loss of explanatory power in their main PGS analysis before they share downsampled summary statistics with other users. This loss can be evaluated (i) across traits, as indicated by the overall reduction in variance explained (R^{2}/pseudoR^{2}) and (ii) with the rank correlation of point estimates to evaluate the comparability of the overall pattern of polygenic score associations.
Following the original study protocol (Karlsson Linnér et al. 2021), we constructed PGSs in two holdout samples: the Collaborative Study on the Genetics of Alcoholism (COGABegleiter 1995; Bucholz et al. 2017; Edenberg 2002); N = 7594) and the National Longitudinal Study of Adolescent to Adult Health (Add HealthHarris et al. 2013; McQueen et al. 2015); N = 5107). We constructed the PGSs from the EXTminus23andMe summary statistics (EXTminus23andMe PGS), adjusted for LD with PRSCS (version 20 October 2019; Ge et al. 2019), which restricts the PGS to ~1 million HapMap3 SNPs. The default settings are sensible for most standard uses (Bayesian gammagamma prior of 1 and 0.5, and 1000 Monte Carlo iterations with 500 burnin iterations).
We compared the explanatory power of the EXTminus23andMe PGSs with the one reported in the original study from analyses of a phenotypic externalizing factor, followed by a set of outcomes related to, or affected by, externalizing behaviors and disorders (e.g., smoking initiation, substanceuse disorders, or childhood developmental disorders) (Table S6). Linear regression was applied to continuous outcomes and logistic regression to dichotomous outcomes. We evaluated the incremental R^{2}/pseudoR^{2} by subtracting the variance explained by a baseline model with only covariates (age, sex, and the first ten genetic principal components) from the variance explained by a model with the covariates and PGS. Confidence intervals were estimated with the percentile bootstrap method (1000 iterations). We then evaluated whether the coefficient estimates of the downsampled EXTminus23andMe PGSs were comparable to the estimates of the PGS of EXT from the original paper.
We are aware of recent suggestions to evaluate the squared (semi)partial correlation in favor of the incremental R^{2}/pseudoR^{2}, but the results of these two alternative approaches are often highly similar (except for certain phenotypes, e.g., height). For comparability with the original study, we retained the incremental R^{2}/pseudoR^{2} measure.
Results

1.
What is the loss of genetic signal in downsampled univariate GWASs?
In the initial check of genetic overlap between the full and downsampled summary statistics of the same trait, we found genetic correlations close to, but still significantly less than unity: 0.966 (SE = 0.007) for SMOK and 0.953 (SE = 0.012) for CANN,^{Footnote 2} which motivated us to apply our approach to evaluate the comparability of the downsampled summary statistics to those from the original paper.
The loss of genetic signal was evaluated using the five key indicators. First, downsampling reduced the EffN of the two univariate GWASs on SMOK and CANN by about 47% and 12%, respectively (Table 1), which is a marked reduction with potential downstream consequences. However, downsampling did not meaningfully impact heritability estimates nor the attenuation/stratification bias ratio, which is important for expecting a comparable factor structure in the multivariate analysis below. Similarly, downsampling did not meaningfully influence the genetic correlations among the seven indicator phenotypes (Fig. 1, Table S1), which increases the likelihood of obtaining a similar factor structure.
Nevertheless, there was a noticeable loss of genetic signal as measured by mean χ^{2} and the genomic inflation factor. The greatest decrease was observed for the downsampled GWAS on SMOK (Δ mean χ^{2} = 2.06–3.15 = – 1.09; – 34.6%), while the decrease for CANN was less pronounced (– 1.3%). Similar decreases were observed for the genomic inflation factor: – 25.9% and – 1.0% for SMOK and CANN, respectively. The overall stability we observed for the heritability estimates and attenuation ratios suggest that the factor loadings in the downsampled Genomic SEM factor analysis will resemble those of the original paper (Step 2). The decrease in genetic signal in SMOK and CANN should translate into larger standard errors in the factor analysis and loss of statistical power to detect SNP effects in the multivariate GWAS of EXTminus23andMe (Step 3).

2.
How do the factor loadings and factor model fit differ in multivariate Genomic SEM when the indicator phenotypes are downsampled univariate GWASs?
The factor loadings, residual variances, and model fit statistics were comparable in the downsampled single factor solution (Fig. 2; Table S2). Neither the factor loadings nor residual variances were statistically different from the original estimates (a path diagram of the original estimates was therefore omitted). The largest nonsignificant difference was observed for the factor loading of the indicator phenotype RISK, which increased from 0.54 (SE = 0.03) to 0.56 (SE = 0.03). A similarsized, nonsignificant decrease was observed for CANN: from 0.77 (SE = 0.03) to 0.75 (SE = 0.03). Furthermore, the comparative fit index (CFI) and standardized root mean square residual (SRMR) were similar between the downsampled and original factor models and were within the preregistered thresholds for “good fit” (i.e., CFI > 0.9, and SRMR < 0.08) of the original study. In our example, we obtain close to identical factor loadings and model fit when applying the bestfitting factor solution of the original study to the empirical genetic covariance matrix of the downsampled summary statistics.

3.
What is the loss of genetic signal at the factor level of multivariate GWAS when the indicator phenotypes are downsampled univariate GWASs?
We estimated a multivariate GWAS of the EXTminus23andMe factor (see Step 2) (Figures S1). The genetic correlation between the summary statistics from the multivariate GWAS of EXT and EXTminus23andMe was strong but significantly less than unity (r_{g} = 0.978, SE = 0.001), which motivated Steps 4–6. The \(EffN\) of the multivariate GWAS of EXTminus23andMe was 1,045,957 (about 70.1% of that on EXT). The mean χ^{2} of the EXT and EXTminus23andMe factors were 3.12 and 2.37, respectively, corresponding to a 24% decrease. The reduction in the genomic inflation factor was similar (–18%). Thus, there was an appreciable loss of genetic signal in the downsampled GWAS of EXTminus23andMe.
The reduction in mean χ^{2} and genomic inflation factor suggested some loss of power to detect SNP effects. Downsampling decreased the power by 17.8 pp to detect the median of squared standardized coefficients among the genomewide significant SNPs (i.e., median r^{2} = 0.0038%), and about 5–45 pp less power to detect the three assumed effectsize magnitudes (\({r}^{2}=\) 0.003%, 0.004%, or 0.005%) (Figures S2, 3). Therefore, we recommend that users interested in following up on individual genomewide significant SNPs associated with externalizing prioritize the version with 23andMe data.
Pruning of the summary statistics to find nearindependent lead SNPs (using the FUMA default settings), identified 358 lead SNPs for the downsampled EXTminus23andMe, as compared to 842 in the fullsample version. (Note that the number of lead SNPs reported here for EXT differs from the 855 reported in the original study because that study used a restrictedaccess genetic reference panel and somewhat different settings for the pruning parameters.) Thus, downsampling reduced the number of lead SNPs by 57.5%, which could appear problematic. However, the results of the following three checks of the concordance in coefficients (see Step 1) suggested no strong reason for concern (Figure S4). First, all the 842 lead SNPs identified in the fulldata version had a consistent direction of effect, meaning the null hypothesis of nearperfect sign concordance (99%) could not be rejected (P = 1). Moreover, there was 100% sign concordance among all 130,176 SNPs with P < 1 × 10^{–5} (in the fulldata GWAS). Second, we identified only 21 lead SNPs (out of the 842; 2.5%) for which the downsampled coefficient fell outside the 95% confidence interval of the fulldata estimate. Among the 130,176 SNPs, we found 2202 such outliers (1.7%). We marked these SNPs in the disseminated summary statistics, but otherwise interpret their small number as unproblematic for the comparability of the downsampled multivariate GWAS. Third, regression analysis of the downsampled coefficients on the fulldata estimates with the 842 lead SNPs found an intercept close to zero (~0.0005, P = 0.045), a regression coefficient statistically different from but still near unity (~0.898, P = 5.24 × 10^{–5}), and high adjusted R^{2} = 0.86. We found similar results for the 130,176 SNPs (reported in Figure S4). The regression results suggest the downsampling induced some, but not marked, attenuation of the coefficients. Overall, these results demonstrate satisfactory concordance for the downsampled multivariate coefficients.

4.
How similar are the geneproperty analyses when using downsampled GWASs?
We ran geneproperty analyses using MAGMA on the EXTminus23andMe summary statistics. The Spearman rank correlation of the point estimates from the MAGMA 54 tissuesspecific gene expression profiles on the downsampled and restricted data multivariate GWAS summary statistics was 0.98, suggesting a comparable pattern of genetissue expression (Table S3 and Figure S5). The Spearman rank correlation of the point estimates from the MAGMA gene expression profiles across 11 brain tissues and developmental stages also suggested great similarity (r = 0.98) (Table S4 and Figure S6). Furthermore, the same 14 tissues, and three developmental stages, remained significant after Bonferronicorrection in the downsampled analysis (Table S3–4). This evaluation showed that, in the case of EXTminus23andMe, the downsampled geneproperty analyses led to similar biological insights as those from the original paper (Karlsson Linnér et al. 2021).

5.
How similar is the pattern of genetic correlations with other traits when using downsampled GWASs?
We assessed the pattern of genetic correlations of EXTminus23andMe with other traits and found this pattern to be nearly identical to that of the original study (Spearman r ~ 1) (Fig. 3, Table S5). Furthermore, none of the point estimates were statistically different. Thus, in our scenario, downsampling did not meaningfully impact the genetic correlations with other traits, meaning that researchers interested in such analyses can safely proceed with using the downsampled summary statistics.

6.
How much explanatory power is lost when using polygenic scores (PGSs) constructed from downsampled GWASs?
The downsampled PGS for EXTminus23andMe explained 8.4% and 8.5% of the variance of a phenotypic externalizing factor in Add Health and COGA, respectively, which is 1.9 pp and 0.5 pp less compared to the same analysis in the original study (Table S6). The overall reduction in explanatory power across other outcomes was less pronounced, on average 0.35 pp in Add Health, and 0.23 pp in COGA. The largest decrease was observed for lifetime smoking initiation with 2.1 pp and 1.7 pp, followed by lifetime cannabis use with 1.1 pp in Add Health (but only 0.55 pp in COGA), which may be explained by these two indicator phenotypes being most affected by the downsampling. For most other traits, the variance explained by the downsampled PGS was comparable to the original study.
Secondly, the Spearman rank correlation of the regression coefficients was 0.996, suggesting great similarity in point estimates (Fig. 4). All the coefficients of the downsampled PGS fell within the confidence intervals of their original study counterparts (Table S6), except those for the phenotypic externalizing factor (in Add Health), lifetime smoking initiation, and lifetime cannabis use (in Add Health). Overall, our downsampled polygenic score results were comparable to those from the original study, meaning that researchers interested in using the downsampled summary statistics to construct PGS for EXTminus23andMe can generally expect similar results. However, we recommend the users be aware of the weaker explanatory power for certain outcomes.
Discussion
Unrestricted access to data and results is the cardinal tenet of open science. Here, we propose a systematic approach for researchers disseminating GWAS summary statistics with restricted data removed (i) to evaluate the comparability of downsampled GWAS summary statistics with their restricted data counterparts, and (ii) to assess the impact of using downsampled univariate summary statistics in multivariate GWAS with Genomic SEM. We examined the loss of genetic signal in downsampled univariate GWAS (Step 1), the change in the factor model loadings and fit (Step 2), the loss of genetic signal at the factorlevel of downsampled multivariate GWAS (Step 3); and for potential changes to geneproperty analyses (Step 4), the pattern of genetic correlations with other traits (Step 5), and the explanatory power of polygenic score analyses in independent samples (Step 6).
We applied these steps to the largest available multivariate GWAS of externalizing to evaluate the quality and predictive performance of the results following restricted data removal. We found nearly identical model fit and parameter estimates, genetic correlations with other phenotypes, and polygenic score analyses of externalizing phenotypes in independent samples. As expected, we observed a decrease in power and genetic signal in the downsampled univariate and multivariate summary statistics. Although fewer lead SNPs were identified for EXTminus23andMe compared to EXT, the genes associated with EXT and EXTminus23andMe were similar in terms of region and developmental timing of expression. In the PGS context, EXT and EXTminus23andMe performed similarly well. Therefore, while we suggest that the downsampled summary statistics may be used in analyses related to gene enrichment, genetic correlations, or polygenic scores, the summary statistics with restricted data should be prioritized for gene identification or to follow up on genomewide significant hits. Prioritizing the restricted data when following up on individual GWAS hits is less of a problem because results for significant SNPs are more likely to be reported in full in the original study.
In our example, removing restricted data did not change the construct that was identified by genetic factor analysis: The genetic correlation between the factor identified without 23andMe data and the factor identified with 23andMe data was near unity, and the factors had highly similar associations with external variables. But this outcome is not guaranteed. Removing restricted data may be more impactful for univariate GWASs prior to their inclusion in metaanalyses and multivariate GWAS with different indicator phenotypes and model structures. The consistency we observed between EXT and EXTminus23andMe is likely explained by the inclusion of restricted data in only a subset of indicators, with just one of seven summary statistics experiencing a substantive reduction in genetic signal (i.e., 35% decrease in the mean χ^{2} of SMOK). In the circumstance that more indicators had included 23andMe data, we could have expected greater discrepancies between EXT and EXTminus23andMe.
The issues raised here are also relevant in the context of GWAS metaanalyses. Removing a restricted set of cohortlevel summary statistics from a singlephenotype GWAS metaanalysis should mainly affect power if the genetic correlation between the cohortlevel summary statistics is close to unity. However, considering that genetic correlations between cohortlevel GWASs of the same trait can be substantially less than unity (Levey et al. 2021), removing a large cohort from the metaanalysis can change the genetic etiology of the trait being studied (de Vlaming et al. 2017). Researchers should thus use the approach presented here to examine potential changes in a phenotype’s genetic etiology alongside the expected power reduction after removing a sample from their GWAS metaanalysis. To our knowledge, this has only been done by one metaanalysis (Coleman et al. 2020), where the authors conducted a subset of the steps described in the present study (e.g., changes in heritability, genetic correlations with external variables, and gene enrichment analyses). Therefore, the utility of our systematic approach goes beyond the Genomic SEM context, as some of these steps may apply to other multivariate GWAS implementations.
Providing public summary statistics to the wider research community is crucial to facilitating open science and advancing behavioral and biomedical research. The first step in this process should be to evaluate the comparability of downsampled summary statistics and their restricted data counterparts. Herein, we provide a systematic approach to investigators who resort to sharing downsampled GWAS summary statistics and recommend they report these analyses as accompanying documentation to facilitate open science and data sharing.
Data Availability
The code for EXTminus23andMe is available on the wiki (https://github.com/Camzcamz/EXTminus23andMe/wiki) and the EXTminus23andMe summary statistics are available on the externalizing website (https://externalizing.rutgers.edu/ext23andmesummarystatisticsnowavailable/).
Notes
An approximate measure of variance explained (R^{2}), standardized with respect to the outcome.
Estimated with the chisquare cutoff set to 30, i.e., the default cutoff applied by bivariate LD Score regression when estimating the heritability. To our knowledge, there is no consensus on the best cutoff to use.
References
Abdellaoui A, Yengo L, Verweij KJH, Visscher PM (2023) 15 years of GWAS discovery: realizing the promise. Am J Hum Genet. https://doi.org/10.1016/j.ajhg.2022.12.011
Allen Institute for Brain Science. (2022). BrainSpan atlas of the developing human brain. http://www.brainspan.org/. Accessed 22 Dec 2022
Becker J, Burik CAP, Goldman G, Wang N, Jayashankar H, Bennett M, Belsky DW, Karlsson Linnér R, Ahlskog R, Kleinman A, Hinds DA, Caspi A, Corcoran DL, Moffitt TE, Poulton R, Sugden K, Williams BS, Harris KM, Steptoe A et al (2021) Resource profile and user guide of the polygenic index repository. Nat Hum Behaviour 5(12):12. https://doi.org/10.1038/s41562021011193
Begleiter H (1995) The collaborative study on the genetics of alcoholism. Alcohol Health Res World 19(3):228–236
Bucholz KK, McCutcheon VV, Agrawal A, Dick DM, Hesselbrock VM, Kramer JR, Kuperman S, Nurnberger JI, Salvatore JE, Schuckit MA, Bierut LJ, Foroud TM, Chan G, Hesselbrock M, Meyers JL, Edenberg HJ, Porjesz B (2017) Comparison of parent, peer, psychiatric, and cannabis use influences across stages of offspring alcohol involvement: evidence from the COGA prospective study. Alcohol Clin Exp Res 41(2):359–368. https://doi.org/10.1111/acer.13293
BulikSullivan BK, Loh PR, Finucane HK, Ripke S, Yang J, Patterson N, Daly MJ, Price AL, Neale BM (2015) LD Score regression distinguishes confounding from polygenicity in genomewide association studies. Nat Genet 47(3):3. https://doi.org/10.1038/ng.3211
Coleman JRI, Gaspar HA, Bryois J, Breen G, Disorder Working Group of the Psychiatric Genomics Consortium, Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium (2020) The genetics of the mood disorder spectrum: genomewide association analyses of more than 185,000 cases and 439,000 controls. Biol Psychiatry 88(2):169–184. https://doi.org/10.1016/j.biopsych.2019.10.015
de Vlaming R, Okbay A, Rietveld CA, Johannesson M, Magnusson PKE, Uitterlinden AG, van Rooij FJA, Hofman A, Groenen PJF, Thurik AR, Koellinger PD (2017) MetaGWAS accuracy and power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genetics 13(1):e1006495. https://doi.org/10.1371/journal.pgen.1006495
Demontis D, Walters RK, Martin J, Mattheisen M, Als TD, Agerbo E, Baldursson G, Belliveau R, BybjergGrauholm J, BækvadHansen M, Cerrato F, Chambert K, Churchhouse C, Dumont A, Eriksson N, Gandal M, Goldstein JI, Grasby KL, Grove J et al (2019) Discovery of the first genomewide significant risk loci for attention deficit/hyperactivity disorder. Nat Genet 51(1):63–75. https://doi.org/10.1038/s4158801802697
Edenberg HJ (2002) The collaborative study on the genetics of alcoholism: an update. Alcohol Res Health 26:214–218
Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW (2019) Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 10(1):1. https://doi.org/10.1038/s41467019097185
Grotzinger AD, Rhemtulla M, de Vlaming R, Ritchie SJ, Mallard TT, Hill WD, Ip HF, Marioni RE, McIntosh AM, Deary IJ, Koellinger PD, Harden KP, Nivard MG, TuckerDrob EM (2019) Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat Hum Behav 3(5):513–525. https://doi.org/10.1038/s415620190566x
Harris KM, Halpern CT, Haberstick BC, Smolen A (2013) The national longitudinal study of adolescent health (add health) sibling pairs data. Twin Res Hum Genet 16(1):391–398. https://doi.org/10.1017/thg.2012.137
Johnson W, Bouchard TJ, Krueger RF, McGue M, Gottesman II (2004) Just one g: consistent results from three test batteries. Intelligence 32(1):95–107. https://doi.org/10.1016/S01602896(03)00062X
Johnson W, te Nijenhuis J, Bouchard TJ (2008) Still just 1 g: consistent results from five test batteries. Intelligence 36(1):81–95. https://doi.org/10.1016/j.intell.2007.06.001
Karlsson Linnér R, Biroli P, Kong E, Meddens SFW, Wedow R, Fontana MA, Lebreton M, Tino SP, Abdellaoui A, Hammerschlag AR, Nivard MG, Okbay A, Rietveld CA, Timshel PN, Trzaskowski M, de Vlaming R, Zünd CL, Bao Y, Buzdugan L et al (2019) Genomewide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences. Nat Genet 51(2):245–257. https://doi.org/10.1038/s4158801803093
Karlsson Linnér R, Mallard TT, Barr PB, SanchezRoige S, Madole JW, Driver MN, Poore HE, de Vlaming R, Grotzinger AD, Tielbeek JJ, Johnson EC, Liu M, Rosenthal SB, Ideker T, Zhou H, Kember RL, Pasman JA, Verweij KJH, Liu DJ et al (2021) Multivariate analysis of 1.5 million people identifies genetic associations with traits related to selfregulation and addiction. Nat Neurosci 24(10):10. https://doi.org/10.1038/s41593021009083
Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, NguyenViet TA, Bowers P, Sidorenko J, Linnér RK, Fontana MA, Kundu T, Lee C, Li H, Li R, Royer R, Timshel PN, Walters RK, Willoughby EA et al (2018) Gene discovery and polygenic prediction from a genomewide association study of educational attainment in 1.1 million individuals. Nat Gene 50(8):1112–1121. https://doi.org/10.1038/s4158801801473
Levey DF, Stein MB, Wendt FR, Pathak GA, Zhou H, Aslan M, Quaden R, Harrington KM, Nuñez YZ, Overstreet C, Radhakrishnan K, Sanacora G, McIntosh AM, Shi J, Shringarpure SS, Concato J, Polimanti R, Gelernter J (2021) Biancestral depression GWAS in the million veteran program and metaanalysis in >1.2 million individuals highlight new therapeutic directions. Nat Neurosci 24(7):7. https://doi.org/10.1038/s41593021008602
Liu M, Jiang Y, Wedow R, Li Y, Brazel DM, Chen F, Datta G, DavilaVelderrain J, McGuire D, Tian C, Zhan X, Choquet H, Docherty AR, Faul JD, Foerster JR, Fritsche LG, Gabrielsen ME, Gordon SD, Haessler J et al (2019) Association studies of up to 1.2 million individuals yield new insights into the genetic etiology of tobacco and alcohol use. Nat Genet 51(2):237–244. https://doi.org/10.1038/s4158801803075
McQueen MB, Boardman JD, Domingue BW, Smolen A, Tabor J, KilleyaJones L, Halpern CT, Whitsel EA, Harris KM (2015) The national longitudinal study of adolescent to adult health (add health) sibling pairs genomewide data. Behav Genet 45(1):12–23. https://doi.org/10.1007/s1051901496924
Pasman JA, Verweij KJH, Gerring Z, Stringer S, SanchezRoige S, Treur JL, Abdellaoui A, Nivard MG, Baselmans BML, Ong JS, Ip HF, van der Zee MD, Bartels M, Day FR, Fontanillas P, Elson SL, de Wit H, Davis LK, MacKillop J et al (2018) GWAS of lifetime cannabis use reveals new risk loci, genetic overlap with psychiatric traits, and a causal influence of schizophrenia. Nat Neurosci 21(9):1161–1170. https://doi.org/10.1038/s4159301802061
SanchezRoige S, Palmer AA, Fontanillas P, Elson SL, Adams MJ, Howard DM, Edenberg HJ, Davies G, Crist RC, Deary IJ, McIntosh AM, Clarke TK (2019) GenomeWide Association Study MetaAnalysis of the Alcohol Use Disorders Identification Test (AUDIT) in Two PopulationBased Cohorts. Am J Psychiatry 176(2):107–118. https://doi.org/10.1176/appi.ajp.2018.18040369
Watanabe K, Taskesen E, van Bochoven A, Posthuma D (2017) Functional mapping and annotation of genetic associations with FUMA. Nat Commun 8(1):1826. https://doi.org/10.1038/s41467017012615
Wray NR, Ripke S, Mattheisen M, Trzaskowski M, Byrne EM, Abdellaoui A, Adams MJ, Agerbo E, Air TM, Andlauer TMF, Bacanu SA, BækvadHansen M, Beekman AFT, Bigdeli TB, Binder EB, Blackwood DRH, Bryois J, Buttenschøn HN, BybjergGrauholm J et al (2018) Genomewide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat Genet 50(5):668–681. https://doi.org/10.1038/s4158801800903
Yengo L, Vedantam S, Marouli E, Sidorenko J, Bartell E, Sakaue S, Graff M, Eliasen AU, Jiang Y, Raghavan S, Miao J, Arias JD, Graham SE, Mukamel RE, Spracklen CN, Yin X, Chen SH, Ferreira T, Highland HH et al (2022) A saturated map of common genetic variants associated with human height. Nature 610(7933):7933. https://doi.org/10.1038/s4158602205275y
Acknowledgements
This research was conducted by the Externalizing Consortium. The Externalizing Consortium has been supported by the National Institute on Alcohol Abuse and Alcoholism (R01AA015416 – administrative supplement to DMD), and the National Institute on Drug Abuse (R01DA050721 to DMD). Additional funding for investigator effort has been provided by K02AA018755, U10AA008401, P50AA022537 to DMD, R01AA029688, and 28IR0070 to AAP and T29KT0526 and T32IR5226 to NCK and SSR from the TobaccoRelated Disease Research Program (TRDRP), NIDA DP1DA054394 to SSR, R25MH08148216 to NCK, R01HD092548 to KPH, as well as a European Research Council Consolidator Grant (647648 EdGe) to PDK. The content is solely the responsibility of the authors and does not necessarily represent the official views of the above funding bodies. The Externalizing Consortium would like to thank the following groups for making the research possible: 23andMe, Add Health, Vanderbilt University Medical Center’s BioVU, Collaborative Study on the Genetics of Alcoholism (COGA), the Psychiatric Genomics Consortium’s Substance Use Disorders working group, UK10K Consortium, UK Biobank, and Philadelphia Neurodevelopmental Cohort.
Funding
TobaccoRelated Disease Research Program, T29KT0526, T29KT0526, R01AA029688, K02AA018755, National Institute on Drug Abuse, R25MH08148216, DP1DA054394, R01HD092548, R01DA050721, European Research Council Consolidator Grant, 647648 EdGe, National Institute on Alcohol Abuse and Alcoholism, R01AA015416
Author information
Authors and Affiliations
Contributions
CMW: Contribution: Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Writing – review, and editing. HP: Conceptualization, Data curation, Formal analysis, Methodology, Writing—original draft, Writing – review, and editing. PTT: Conceptualization, Writing—original draft, Writing – review, and editing, Visualization. HK: Data curation, Software, Writing—original draft. NSCK: Formal analysis, Writing—Original Draft, Writing—review and editing. DLC: Validation, Writing—Original Draft. TTM: Conceptualization, Data curation, Methodology, Supervision. PB: Formal analyses, Writing – review, and editing. PDK: Conceptualization, Writing—review and editing. IDW: Conceptualization, Writing—review and editing. SSR: Conceptualization, Writing—review and editing. KPH: Conceptualization, Writing—review and editing. AAP: Conceptualization, Writing—review and editing. DMD: Conceptualization, Writing—review and editing. RKL: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Supervision, Writing – review and editing.
Corresponding authors
Ethics declarations
Conflict of interest
Camille M. Williams, Holly Poore, Peter T. Tanksley, Hyeokmoon Kweon, Natasia S. CourchesneKrak, Diego LondonoCorrea, Travis T. Mallard, Peter Barr, Philipp D. Koellinger, Irwin D. Waldman, Sandra SanchezRoige, K. Paige Harden, Abraham A Palmer, Danielle M. Dick and Richard Karlsson Linnér declare that they have no conflict of interest.
Ethical Approval
This study included only secondary data analysis of deidentified data and was approved as “Exempt Human Subjects Research” by the institutional review board (IRB) of Rutgers University (#Pro2022000138). All participants provided written informed consent in the original studies from which these data were drawn. In addition, data collection of each cohort was approved by a review board at each respective institution.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Edited by Sarah Medland.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Williams, C.M., Poore, H., Tanksley, P.T. et al. Guidelines for Evaluating the Comparability of DownSampled GWAS Summary Statistics. Behav Genet 53, 404–415 (2023). https://doi.org/10.1007/s1051902310152z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1051902310152z