Comparisons of power of statistical methods for gene–environment interaction analyses
Any genome-wide analysis is hampered by reduced statistical power due to multiple comparisons. This is particularly true for interaction analyses, which have lower statistical power than analyses of associations. To assess gene–environment interactions in population settings we have recently proposed a statistical method based on a modified two-step approach, where first genetic loci are selected by their associations with disease and environment, respectively, and subsequently tested for interactions. We have simulated various data sets resembling real world scenarios and compared single-step and two-step approaches with respect to true positive rate (TPR) in 486 scenarios and (study-wide) false positive rate (FPR) in 252 scenarios. Our simulations confirmed that in all two-step methods the two steps are not correlated. In terms of TPR, two-step approaches combining information on gene-disease association and gene–environment association in the first step were superior to all other methods, while preserving a low FPR in over 250 million simulations under the null hypothesis. Our weighted modification yielded the highest power across various degrees of gene–environment association in the controls. An optimal threshold for step 1 depended on the interacting allele frequency and the disease prevalence. In all scenarios, the least powerful method was to proceed directly to an unbiased full interaction model, applying conventional genome-wide significance thresholds. This simulation study confirms the practical advantage of two-step approaches to interaction testing over more conventional one-step designs, at least in the context of dichotomous disease outcomes and other parameters that might apply in real-world settings.
KeywordsGene–environment interaction Statistical modeling False positive rate True positive rate Genome-wide testing
With the completion of genome-wide genotyping in many epidemiological studies, opportunities exist for testing gene–environment (G*E) interactions on an unprecedented scale, and on an “agnostic” (hypothesis-free) basis. However, testing for G*E interactions on a genome-wide basis involves multiple comparisons with the associated problems of limiting the overall (genome-wide) false discovery rate .
Several methods have been proposed to improve the power to detect G*E interactions when many thousands of comparisons have been made: First, a restriction to cases only was introduced . Later this approach was combined with a full interaction analysis in cases and controls by a Bayesian shrinkage factor [3, 4]. Also global tests for genetic effects or interactions have been suggested . In 2009, the concept of a two-step method was introduced by Murcray et al. . In a first step the gene–environment association is tested and only genetic variants associated with the exposure at a prespecified level of alpha are tested in the second step in a conventional model containing the genetic and environmental main effects and an interaction term. In 2011, we proposed an additional module (step 1B), where we perform a classical genome wide association study (GWAS) separately in the two exposure strata and integrate the signals with the gene–environment associations within the disease strata (step 1A) . Independently, Murcray et al.  developed a similar, but distinct method of integrating information regarding the gene-disease association.
The power of one-step and two-step methods has recently been compared in the context of a case–control study design , but the relative power of these methods has not been formally compared for detecting different types of G*E interactions, and in different types of epidemiological study designs. In this paper we applied each of the published methods [2, 3, 4, 5, 6, 7, 8], and our modifications of them , to simulated data representing both case–control and population-wide (cross-sectional or cohort) designs, under conditions representing the null hypothesis and alternative hypotheses of varying degrees of G*E interaction. We compared the false positive rate (FPR) as determined by null simulations and the true positive rate (TPR) or statistical power as quantified by the alternative simulations. Furthermore we explored an optimal threshold for the analysis performed in both modules of step 1.
Materials and methods
Types of interaction
With respect to stratification for the environmental exposure different kinds of interactions can be discerned, which we name full effect concentration, partial effect concentration, or cross-over interaction. A full effect concentration represents a genetic effect which is only present in the stratum with the environmental exposure. A partial effect concentration is present when the environmental effect is weaker but still present in the unexposed stratum. A cross-over interaction represents opposite genetic effects between the strata of environmental exposure.
Creating data sets for simulations
Parameters entered into the power simulation study
Prevalence of disease (PD)
0.05, 0.1 … (0.1) … 0.6
Number of control subjects [N*(1−PD)]
1,000 … (1,000) … 9,000
Prevalence of exposure (PE)
0.1 … (0.1) … 0.8
Prevalence of genotype (PG)
0.1 … (0.1) … 0.9
Population attributable risk fraction of exposure (PARFE)
0.1 … (0.1) … 0.5
Population attributable risk fraction of genotype (PARFG)
0 … (0.05) … 0.25
Interaction odds ratio (ORG*E)
1.05, 1.1 … (0.1) … 2.0
OR in controls (OREG|D−)
0.8±1, 0.9±1, 0.95±1, 1.0
Proportion of corrected alpha level allocated to step 1 (ψ)
0 … (0.1) … 1
The strength of the interaction was controlled by varying ORG*E. The impact of gene–environment association (GEA), i.e. the deviation from gene–environment independence, was simulated by varying the OR in controls (ORGE|D−) or in an additional analysis in the full data set (ORGE). When varying ORGE|D−, the ORGE|D+ was derived as ORGE|D+ = ORGE|D− * ORG*E. All the above parameters refer to the study sample, which we call for simplicity “population”; the source population might differ. From the two contingency tables for the environmental and the genetic effect on the disease and the ORGE|D− and ORGE|D+ all 8 cells of the contingency table of all possible combinations of the three dichotomous variables D, E, and G were created using the quadratic formula. Finally the 8 cells of the contingency table were expanded to 12 cells for coding G in three categories (0, 1, and 2) to model two alleles per locus. The simulated counts within each cell represented numbers of individuals. In general, additive genetic models were applied to model the genotype; but also dominant and recessive models were explored.
Performing the simulations and calculating power
Random simulations of counts in the presence of G*E interactions, i.e. the alternative hypotheses, were performed with 103 iterations for each 3-way combination of PD, PE, and PG. The null hypothesis, i.e. absence of an interaction, was simulated with 106 iterations using parallel computing on the National Supercomputers HLRB-II SGI Altix 4700, Linux-Cluster, and superMUC at the Leibniz Supercomputer Center in Munich, Germany.
Logistic regression models performed to calculate the interaction ORs and p values
D = G + E + G*E
Full interaction model
E = G in D+
Gene–environment association in cases
E = G in D−
Gene–environment association in controls or non-cases
D = G in E+
Gene-disease association in exposed subjects
D = G in E−
Gene-disease association in unexposed subjects
E = G
Gene–environment association in all subjects combined
D = G
Gene-disease association in all subjects combined
Corrections for multiple testing assumed that interactions were being examined “agnostically” as part of a genome-wide study performing tests on 651,550 statistically independent loci, as proposed by Dudbridge and Gusnanto . A false positive rate of 0.05 across the whole study was selected, and a Bonferroni correction applied resulting in a corrected alpha level (αcorr). Thus, for one-step methods, if the interaction p value was below αcorr = 0.05/651,550 = 7.67 × 10−8 the interaction was considered significant. The power was calculated as the proportion of the significant interaction results among the respective iterations. The choice of significance levels in the two-step methods was simulated for various thresholds for α1 = αcorrψ ranging from ψ = 0 to 1 by 0.1 (Table 1), where ψ = 0 corresponds to a full interaction model (step 2 only) and ψ = 1 corresponds to a “step 1-only” analysis. The FPR was calculated as αcorr1−ψ. As an approximation ψ = 0.5 was used in the first simulation experiments.
The full interaction model including cases and controls (“case–control”) comprises both main effects of the genetic and environmental factors and estimates the interaction effect directly by a multiplicative interaction term corresponding to model (i) in Table 2. This model is also used by all two-step approaches to calculate the interaction effect estimate (in step 2).
The case-only analysis (“case-only”) assumes independence of G and E in the controls, i.e. OR GE|D− = 1. In this case the interaction OR equals the OR in the cases (OR GE|D+) following formula (1). Therefore simply the βEG|D+ and its variance estimated by model (ii) in Table 2 were used to calculate the interaction p value of the case-only method.
Combination of case–control and case-only analysis
Mukherjee and Chatterjee [3, 4] combined the power advantages of the case-only approach and the rigour of the full interaction model by introducing a shrinkage factor that weighs down the case-only estimate in the direction of the case–control (full interaction) estimate in relation to the observed GEA among controls.
In the first approach the variance of the interaction V(βG*E) is used for weighting , whereas in the second approach this is replaced by the variance in the controls V(βEG|D−) . As our analyses and those by the originators  consistently showed that the first approach was more powerful only this one is shown in the graphs (Mukherjee and Chatterjee 2008 ). The respective terms were estimates by models (i) and (iii) in Table 2.
Two-step approaches proposed by Murcray et al.
Murcray et al.  proposed a two step approach performing a logistic regression for the gene–environment association (model (vi) in Table 2) in the first step and a full interaction analysis (model (i) in Table 2) in the second step. Bonferroni correction was only applied to the number of SNPs being tested in the second step, resulting in less extreme significance thresholds in the full interaction model. For comparability we used the same step 1 thresholds as in our modifications of the method as described below.
In a second publication, Murcray et al.  introduce a second GWAS at step 1, testing for the marginal association of the disease with the interacting allele [model (vii) in Table 2] . A weighting factor ‘rho’ then needs to be chosen to partition the step 1 significance level between the two GWASes performed at step 1, when selecting SNPs for inclusion in step 2. The criteria for selecting rho are not specified, although Murcray et al. show that their approach is relatively insensitive to the choice of rho within the range 0.1–0.9; we chose rho = 0.5 for our power comparisons. Common to both Murcray approaches is the estimation of the gene-disease and gene–environment associations in the pooled population and not stratified by environmental exposure, or by disease status, respectively.
Modifications of the Murcray method
“Weighted two-step” method
The full interaction model may be considered as a test of the difference between the stratum-specific beta estimates contributing to the averages at step 1A (βEG|D+ and βEG|D−) or at step 1B (βGD|E+ and βGD|E−). Under the null hypothesis of no interaction (zero difference), the difference will be independent of the unweighted average only if the variances of the two components are equal. Unequal variances might result in situations where PD ≠ 0.5 or PE ≠ 0.5 and would lead to a correlation between the averages (tested at step 1) and the difference (tested at step 2), even in the absence of G*E interaction. Theoretically, therefore, simple Bonferroni correction at step 2 would be insufficiently conservative.
In the case–control study design, the variances of the beta estimates contributing to the averages at step 1A will be similar. However, the variances will not be similar for the beta estimates contributing to the average at step 1B, unless the environmental exposure prevalence is close to 50 %. In the context of a population survey or a case–control study with many more controls than cases, the variances will be unequal at step 1A.
We assessed the extent of the bias (in practice) from using the unweighted averaging approach of Ege 2011  by comparing it with an alternative approach that used inverse variance weighting for averaging in step 1 (termed here the “weighted 2-step” method). By weighting the beta estimates contributing to the step 1A and step 1B averages, inversely to their respective variances, the correlation between steps 1 and 2, under the null hypothesis, is removed, assuring statistical independence of steps 1 and 2, in the absence of G*E interaction. In both, weighted and unweighted, approaches we always included step 1A and step 1B.
False positive rate under the null hypothesis of no interaction
Absolute numbers of false and true positive findings across all methods in a population setting (PD = 0.1) at a genome-wide scale
Mukherjee and Chatterjee 2008 
Murcray 2009 
Murcray 2011 
Ege 2011 
These null simulations also revealed that both steps of the two-step methods were uncorrelated (Supplemental Fig. 1) except for the population setting of Ege 2011  with a slight correlation (median of rho = −0.03). The two components of the first step of the two-step methods were only correlated for Murcray 2011  in the case–control setting with a median correlation of rho = 0.28 (Supplemental Fig. 2). When correlating the individual components of the first step to the second step, step 1A and step 2 were slightly correlated for the unweighted approach in the population setting (Ege 2011 , median of rho = 0.03, Supplemental Fig. 3).
When simulating GEA in the entire sample and not just in the controls, similar results emerged (Supplemental Fig. 4).
True positive rate (power) in relation to the prevalence of D, E and G
The true positive rate (TPR) was first explored in relation to disease prevalence (Fig. 2, left panel, and Supplemental Table 2). The lowest power was associated with the full interaction model. At low disease prevalences the Ege et al.  approach displayed the highest power followed by Murcray et al.  and the weighted two-step method. The case-only approach achieved a higher TPR at higher prevalences (PD > 0.3), although with the potential for a high FPR if ORGE|D−≠1 (see above). In the typical case–control scenario (PD ~ 0.5), the TPR was similar for all four two-step methods, which had a small power advantage over Mukherjee and Chatterjee 2008 . The apparent gain in power of Ege et al.  in comparison to the other two-step approaches was accompanied by a slightly increased FPR. In order to understand whether the power advantage was maintained after adjustment for the inflated FPR we estimated both TPR and FPR for a genome-wide setting with 651,550 statistically independent loci  and 100 truly interacting loci, respectively. The TPR and FPR of Ege 2011  at an alpha of 0.01 corresponded to the respective characteristics of Murcray et al.  and the weighted approach at an alpha of 0.3 (see Table 3, figures printed in bold). Thus, the weighted and unweighted approaches have similar power if the study-wide FPR is held constant.
The TPR was also related to different total sample sizes keeping the absolute number of cases constant at 1,000 (Fig. 2, right panel). As the number of controls was increased above 1,000, the TPR of the case-only became inferior to the 2-step approaches. The TPR of Murcray 2009  decreased above 5,000 controls.
The variation of the exposure prevalence in the context of a fixed PARFE and a fixed ORG*E led to various combinations of gene-disease associations in the exposure strata, thereby reflecting the different interaction types (Table included in Fig. 3). The highest power was achieved for a full effect concentration in the population setting and a mild cross-over in the case–control setting (Fig. 3).
The prevalence of the genotype also impacted on the power: The highest power was achieved for intermediate genotype prevalences in both population and case–control settings by all approaches except Murcray 2009 , which was highest at low prevalences of genotype in a population setting (Supplemental Fig. 5).
True positive rate (power) in relation to the magnitude and direction of ORG*E
Optimal step 1 threshold
True and false positive rates in relation to GEA in controls
Effect of the underlying genetic model
The issue of multiple comparisons is a major challenge for any genome-wide analysis. In the case of interaction analyses, which tend to have lower statistical power than analyses of marginal associations (main effects), the balance between true positive rate (TPR) and (study-wide) false positive rates (FPR) require careful evaluation [11, 12]. In the present study we have simulated various data sets resembling real world scenarios (Table 1) and compared our modification  of the two-step approach originally proposed by Murcray et al.  to all relevant methods currently available for GWIS analyses. We find that, in terms of TPR, the weighted variant of our modification is superior to all other methods when analyzing population based data in the context of GEA, while preserving a low FPR in over 250 million simulations under the null hypothesis. The apparent gain in power of our unweighted variant (Ege 2011 ) is offset by a slightly increased FPR in the population survey setting, attributable to the correlation between steps 1A and step 2 test statistics arising from unequal stratum sizes at step 1A in the context of a population survey. In case–control studies, both modifications match the performance of other two-step methods, and have a slightly higher TPR than the empirical Bayes one-step method proposed by Mukherjee and Chatterjee . In all scenarios, the least powerful method is to proceed directly to a full interaction model, applying conventional genome-wide significance thresholds.
The susceptibility of the case-only approach  to GEA is generally acknowledged and also reflected by our simulations (Fig. 1). Moreover, we note that the power advantage of the case-only method over other more rigorous approaches is removed by including even a modest increase in the number of controls per case (Fig. 2, right panel).
Weighted combination of case-only and full interaction models
The approach by Mukherjee and Chatterjee  performs much better than the case-only approach in terms of FPR and much better than the full interaction model in terms of TPR as shown in their recent contribution  and by our simulation study. However, its performance in terms of both TPR and FPR falls short of all two-step approaches.
The major advantage of all two-step approaches over the single step methods lies in reducing the number of tests performed in the full interaction model since the two steps are asymptotically independent, and only the tests performed in the second step require correction for multiple comparisons . This major advantage over all single step methods is particularly pronounced in a genome-wide scenario. When testing only few SNPs or SNPs in high linkage disequilibrium the advantage is partially consumed.
The original Murcray approach is limited by its restriction to a case–control study design and loses power with increasing numbers of controls at a constant step 1 threshold (Fig. 2) . Furthermore the original approach neglects additional information from the genotype-disease association. This has now been addressed by a modification recently published by the same authors . However, the additional information obtained is not fully integrated, but only combined in a trade-off of type 1 error rates at step 1 by an arbitrary partitioning parameter “rho”. At a broad range of intermediate values of rho, the method seems to be fairly insensitive to the choice of rho .
Modifications of the two-step approach
In order to integrate as much information as possible we have introduced a step 1B, where we calculate the unweighted average of the association estimates of two gene-disease GWASes stratified for the environmental exposure . As the unweighted average induces some correlation between step 1 and step 2 (Supplemental Figs. 1 and 3) it is affected by a slightly increased FPR. Although the integration of step 1A and step 1B via the χ²df=2 statistic removed the correlation of step 1 and 2 almost completely (rho = −0.03), there was still some evidence of an inflated FPR due to incomplete statistical independence between steps 1 and 2. Therefore, in a variant of our earlier approach we integrated the stratified analyses by an inverse-variance weighted average, rendering it perfectly robust against false positive findings (Fig. 1, Table 3) while maintaining a high power throughout a spectrum of various degrees of GEA (Fig. 6). In practical terms the inflated FPR of the unweighted approach led only to one false positive finding at a genome-wide alpha level of 0.05 (Table 3), which is more than compensated by 68 true positive findings. Retrospectively these findings justify our previous analysis of a genome-wide data set . However, the characteristics of the unweighted approach at a genome-wide alpha level of 0.001 were comparable to those of the weighted approach (or Murcray 2011 ) at a genome-wide alpha level of 0.3 (Table 3), ultimately suggesting that either a more liberal approach or a more progressive significance level should be applied to genome-wide interaction testing.
The two symmetric components of step 1 are uncorrelated (Supplemental Fig. 2) and can thus be integrated by a two degree of freedom test, avoiding the need to choose a partitioning parameter at step 1. A further advantage of this approach is an enhanced power for cross-over interactions  with equally weak main genetic and environmental effects, a situation where Murcray 2011  is relatively less powerful (Fig. 3). Moreover, the weighted approach is relatively more powerful in dominant genetic models as compared to Murcray 2011  (Fig. 7).
In the situation we were facing, with a single non-genetic exposure of interest, the environment-disease association was given. However, in situations where a multitude of environmental exposures were present one might integrate the environment-disease associations as a step 1C and perform a three degree of freedom test at step 1. This extension to a step 1C would be appealing (and symmetrical) in the search for G*G interactions, where the “E” could be one of many thousands of SNPs .
A further advantage of the stratified assessment of the associations in step 1, over the unstratified assessment as proposed in the original two-step approach, is that it takes mutual confounding of genetic and environmental effects into account. In theory, the Murcray methods [6, 8] are prone to confounding at step 1, since the environmental exposure has presumably been selected because of prior knowledge or suspicion of an environment-disease association, leading to a spurious GEA for SNPs that are associated with disease, even if there is no G*E interaction. In practice, confounding may explain the weak correlation of step 1A and step 1B of Murcray 2011  (Supplemental Fig. 2).
As a first approximation we selected the step 1 significance threshold as the square root of the genome-wide significance level, i.e. α1 < 2.7 × 10−4, in order to distribute the power equally between the two steps (ψ = 0.5). The simulation experiment, however, revealed that the optimal threshold depended on PD and PG. Ignoring PG the best choice of ψ would be between 0.5 and 0.6 in a population survey setting (at PD ≈ 0.1) and around 0.8 in a case–control setting (Fig. 5). The finding that the TPR is maximized at ψ = 1 in the case–control setting is in line with a recent commentary by Thomas et al. . However, choice of such an extreme partitioning between steps 1 and 2 would imply that the step 1 statistic alone would be taken as evidence of gene–environment interaction, i.e. all variants passing step 1 also pass step 2. Without a supplementary test of the full interaction model, the overall analysis is highly prone to false positives in the context of gene-disease or gene–environment associations in the total population.
The two degree of freedom test we used in this approach is fundamentally different from the two degree of freedom test suggested by Kraft et al.  in order to detect genetic effects in an entire population or subsamples characterized by specific exposures. The Kraft 2 df test compares the full interaction model (D = E + G + G*E) to a model containing only the marginal environmental effect (D = E), which equals the sum of the likelihood ratio χ² of the two models D = G in E + and D = G in E− (models iv and v in Table 2). This differs from our step 1B, which is a 1df test of the pooled stratum-specific gene-disease associations. The purpose of the Kraft approach is not to distinguish between marginal genetic effects and interactions; inherently, it is less specific as a test for interactions than the methods evaluated in this paper.
Here we have focused on modelling G*E interaction by multiplicative interaction terms in order to render the current state-of-the-art analysis methods comparable. Other modelling strategies e.g. based on risk differences, however, require special analysis techniques to be explored in the future.
This extensive set of simulations, using a range of parameters that might apply in real-world settings, confirms the practical advantage of two-step approaches to interaction testing over more conventional one-step designs, at least in the context of dichotomous disease outcomes. The method can be easily adapted to the assessment of gene–gene interactions, as discussed above. The underlying concept of averaging associations between two potentially interacting variables (E and G) across strata defined by the level of an outcome (D) could, in theory, be extended to genome-wide G*E or G*G interaction analyses for continuous outcomes variables, such as physiological traits or levels of gene expression. However, further developmental work is required to evaluate the detailed application in such circumstances.
This work was supported by the European Commission as part of GABRIEL (A multidisciplinary study to identify the genetic and environmental causes of asthma in the European Community) Contract Number 018996 under the Integrated Program LSH-2004-1.2.5-1. M.J.E received the Stephan-Weiland Fellowship of the GABRIEL consortium.