Introduction

Genetic investigations of coronary artery disease (CAD) aim to identify functional variants to assist with its diagnosis, prognosis or treatment. The full spectrum of DNA variant sizes and frequencies, ranging from single nucleotide changes to large copy number variations and from rare mutations to common polymorphisms, are components of a comprehensive approach to identify genetic determinants of CAD. However, CAD is the terminal manifestation of multiple intermediate disease processes, which individually have genetic and environmental determinants (Figure 1). For genetic research into CAD to be truly comprehensive, experimental methods must identify environmental and genetic factors and their interactions [1, 2].

Figure 1
figure 1

The pathophysiology of coronary artery disease (CAD) is affected by environmental and genetic factors and their interactions. Pathogenic mechanisms contributing to plaque development and subsequent CAD can be affected both negatively and positively by environmental exposures and genes. Environmental exposures can be either discrete (presence or absence) or continuous. Typically, CAD associated mutations and polymorphisms are found in genes encoding proteins that have key roles in intermediate pathways. Neither the environmental nor genetic lists shown here are comprehensive.

It seems reasonable that the effect of a CAD susceptibility allele could differ depending on the context of other genetic or environmental factors. For instance, is it effective to search for a gene underlying type 2 diabetes mellitus (T2DM) in high performance athletes? Although such athletes may be genetically predisposed to T2DM, their activity levels would probably protect them from expressing the phenotype. However, although gene-gene or gene-environment interactions seem to be an obvious topic for consideration, the analysis of such interactions is not yet routine in genetic studies of CAD. Here, we will focus on interaction types, strategies to detect interactions, potential biases and the statistical issues involved in studying gene-gene and gene-environment interactions in CAD.

Types of interactions

Broadly defined, interactions are differences in the strength of association between a gene and phenotype on the basis of the presence of, absence of or quantitative differences in an additional factor, which could be another genetic variant or an environmental exposure. There are several putative models for gene-environment interactions, including synergy, modification of effects and redundancy (Figure 2). For a gene-gene interaction, the additional factor might be dichotomous, such as carrier versus non-carrier status, or additive, such as zero, one or two copies of the minor allele. For a gene-environment interaction, the additional factor can similarly be dichotomous, such as presence or absence of smoking history, or it can be a continuous variable, such as number of pack-years smoked.

Figure 2
figure 2

Putative gene-environment interactions. For even the simplest case, a dichotomous genetic risk factor (for example, carriers versus non-carriers) and a dichotomous environmental risk factor (for example, present versus absent), several types of interactions are possible. If both the gene and environment have main effects (odds ratios > 1), and thus could be identified independently, a synergistic interaction would result in an effect size larger than a simple additive effect. A second possibility is that an environmental factor could have no main effect but could modify the effect of a genetic factor that does have a main effect, creating a larger than expected combined effect. The inverse is also possible, in which a modifier gene with no main effect of its own increases the effect size of an environmental risk factor. A fourth possibility is that neither the gene nor the environment has a detectable main effect, and interaction is required to produce a measurable effect. A fifth possibility is for a gene and an environmental factor to have redundant effects, in which case the combination of factors produces no increase in risk. These types of interactions can be extended to include different effect sizes or gene-gene interactions.

Role of interactions in genetic association studies

Recent advances in cost-effective, array-based, high-throughput genotyping platforms have led to a flood of investigations of common single nucleotide polymorphisms (SNPs) in various diseases. Genome-wide association studies (GWASs) have successfully identified genetic determinants of CAD and its component risk factors [316]. For instance, several investigations found a region of chromosome 9p21 that was associated with CAD independently of traditional risk factors [36]. Furthermore, multiple genetic associations for T2DM [7, 17] and body mass index (BMI) [18] have been discovered. However, most associated loci from GWASs have been reported for lipoprotein traits, including over 30 loci associated with plasma concentrations of low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol and triglyceride [716]. The success in finding genetic associations with lipoprotein phenotypes was due to methodological standardization (accuracy and precision) in trait measurement and to evaluation of large sample sizes, allowing detection of relatively subtle effects. Meta-analyses and collaborative consortia with large sample sizes have allowed GWASs to detect risk variants with low minor allele frequencies (< 5%) and small effect sizes (odds ratio of about 1.1 to 1.7) (Box 1); SNP association studies may have already reached their limit to detect clinically or biologically relevant loci with such effect sizes [8, 11, 13, 17, 18].

figure 3

Box 1. Glossary of statistical terms

Despite recent success in identifying CAD-associated SNPs, much of the genetic component of CAD and its risk factors remains unattributed. Forcing additional genetic markers with small effect sizes into predictive models only marginally improves prediction over traditional risk factors [19]. However, accounting for gene-gene and gene-environment interactions might produce a meaningful increase in the combined effect of the genetic determinants [1, 2]. To ensure a valid assessment of gene-gene and gene-environment interactions, standards are required for sample sizes, accuracy and precision for continuous data, specificity and sensitivity for discrete data and appropriate statistical methods. Phenomics, defined as the comprehensive characterization of phenotype and environmental exposure [20], is also of key importance.

Identification of small effect genetic and environmental factors

So far, most genetic association studies have evaluated effects on intermediate phenotypes or pathogenic mechanisms, which can themselves be considered disease processes. For CAD, these intermediate phenotypes include blood coagulability, hypertension, altered lipid metabolism, cell proliferation and inflammation. When a new gene or locus is discovered, such as the chromosome 9p21 region associated with early CAD [36], and its association is subsequently replicated in multiple study samples [2124], the basis of the association with CAD is assumed to be mediated through a pathogenic pathway [22]. This assumption will guide the design of subsequent functional experiments. Similarly, newly identified environmental determinants might exert their influence through one or even several pathogenic mechanisms and might even help identify previously unappreciated pathways.

Although the effect sizes of SNP associations identified in GWASs of CAD are modest, they are still important because: (i) individual associations can be combined to obtain larger cumulative effects; (ii) genes with small effects in GWASs can point to targets for drug-based or other interventions; (iii) genes with small effects in GWASs might contain rare, large-effect mutations in more severely affected patients; (iv) some GWAS loci with no previous CAD association might unveil new pathways; and (v) the effects of a GWAS locus could be amplified by gene-gene or gene-environment interactions.

These principles can be extended to the study of gene-environment interactions. For instance: (i) individual environmental interactions could be combined to obtain a cumulatively larger effect; (ii) rare extreme environmental exposures may display larger effects on the CAD phenotype than more common or typical environmental variation; (iii) identification of gene-environment interactions might suggest new hypotheses to evaluate disease-causing mechanisms. These principles could direct the design of future studies of gene-environment interactions in CAD.

Minor versus major alleles as a risk factor for CAD

How do alleles affecting CAD susceptibility arise? DNA mutagenesis could provide a basis for understanding the generation of risk alleles. Several mutagenic mechanisms have been identified [25]. If a DNA error escapes repair and becomes embedded in the genome, it could, by affecting the expression or function of a protein, modify CAD risk either positively or negatively. If the recently mutated allele increases CAD risk, it is possible that genetic drift, inbreeding, pleiotropy, heterozygote advantage or small effects on reproductive fitness could be responsible for the allele reaching appreciable frequencies in the population [26]. For CAD, mortality typically occurs after the reproductive years, thus reducing selection pressure against deleterious alleles. Another possibility is that an environmental change might cause an allele that once had a neutral or beneficial effect to become deleterious.

Alternatively, if the mutated allele is beneficial, reducing CAD risk, one would expect the allele to increase in frequency to become the major allele. If the mutation occurred relatively recently, it is possible the minor allele is gradually becoming more prevalent. Such 'protective' minor alleles, or conversely major alleles that increase CAD risk, are possibly important from a public health perspective, since defining a gene-environment interaction might suggest an environmental intervention with a potentially large impact, due to the high population prevalence of the risk allele.

Analytical detection strategies

Gene-gene and gene-environment investigations have included family-based and population-based samples in retrospective and prospective designs. Statistical methods have included methods modifying regression and chi-squared analyses, as well as statistical classification techniques, such as neural networks, support vector machines or Bayesian networks (Table 1). Although the statistical methods used in GWASs are fairly consistent and include regression and chi-squared analysis [38, 1018, 2730], the statistical approaches to detect gene-gene and gene-environment interactions are somewhat less standardized at present.

Table 1 Summary of strategies to detect gene-gene and gene-environment interactions

Investigators have tested for association between the cumulative number of risk alleles at multiple independent loci and disease [11, 27, 28]. Absolute allele counts [28] and relative weighting of alleles on the basis of their effect size [11, 27] have both been reported. Although this showed that the alleles were independent and their effects could be added, no interaction between the alleles was measured. Subgroup analyses, in which the strength or effect size of the association is compared between sample subgroups, have substantially less power to detect an association than the original intact sample, increasing the risk of false negative results. For example, assuming 80% power to detect a difference in allele frequencies between cases and controls within one subgroup, the second equally sized subgroup will yield disparate results about 30% of the time just by chance. The clinical trial literature contains many examples of inappropriate subgroup analyses [31], and one excellent review examines the lack of consistency of sex-specific subgroup genetic associations [32].

Regression techniques can be modified to test for gene-gene or gene-environment interactions, either by including additional interaction terms in the model or testing association with or without an additional covariate. Careful reviews of regression approaches to study interactions show the multitude and complexity of these techniques [33, 34]. Finally, sophisticated statistical classification techniques, including but not limited to neural networks [29], support vector machines [35] and Bayesian networks [30], are being updated to accommodate analysis of interactions.

Multiple comparisons

If N genetic variants are entered into an analysis, N*(N-1)/2 potentially interacting pairs can be constructed. Selecting a priori known functional SNPs, or SNPs with coinciding spatial or temporal expression patterns, is one approach to reduce the number of tests. An alternative approach is first to test for marginal main effects in a primary hypothesis-generating analysis and then to test for interactions between those significant effects in a second analysis in which the nominal level of significance has not been substantially adjusted [33]. In GWASs, permutation testing, control of false discovery rates and Bonferroni correction have been used to determine appropriate significance thresholds. Whatever approach is used, care will be required for selecting the nominal level of significance in gene-gene and gene-environment investigations.

Potential biases in gene-gene and gene-environment investigations of CAD

Many types of biases can affect gene-gene and gene-environment interactions (Table 2). The accuracy and precision of genotyping technologies render genetic investigations relatively resistant to measurement bias, compared to other sources of potential bias. Unequivocal disease phenotypes, such as myocardial infarction or coronary bypass surgery, are least susceptible to measurement bias. New imaging techniques, such as ultrasound-based intima-media thickness or magnetic resonance imaging (MRI)-based plaque volume calculations, are more susceptible to systematic errors of measurement. Self-reported measures of environmental exposure, such as caloric intake, energy expenditure or alcohol use, are most vulnerable to biases. Strategies to maximize the sensitivity and specificity of environmental factor measurement will improve the likelihood of detecting a significant association signal for interactions with genetic determinants [36].

Table 2 Potential biases in gene-gene and gene-environment investigations of coronary artery disease (CAD)

Study design can affect bias, because prospective cohort studies are generally more resistant to bias than retrospective case-control designs [1]. Survivorship bias and population stratification are less common in prospective studies, assuming a truly representative cross-sectional cohort. Survivorship bias is a potential liability of retrospective studies of CAD, because patients with a fatal first myocardial infarction (up to 30% of cases) cannot be included in future studies. Recall bias, in which the study participant is more likely to remember an environmental exposure if it is associated with a negative outcome, respondent bias, in which patients alter their answers to exposure questions following a negative outcome, and exposure suspicion biases, in which investigators query individuals who have a negative outcome more thoroughly, are all reduced in prospective designs, as long as environmental exposure information is collected from all study participants irrespective of CAD outcomes.

Statistical power

Statistical power is directly proportional to the number of study participants and to the size of the effect under study. Factors to be included in power calculations of all genetic investigations include the minor allele frequency, the degree of linkage disequilibrium between the queried marker and the hypothetical disease locus, the genotype error rate and the genetic or phenotypic heterogeneity (Box 2). Fortunately, high-throughput genotyping platforms have a negligible genotype error rate [37]. Correction for multiple comparisons and the measurement error of environmental exposures also influence study power [1, 2]. As a result of the greater accuracy of genotyping compared with the measurement or report of environmental exposures, there is theoretically more power to detect a gene-gene interaction than a gene-environment interaction for the same sized sample. Studies with inaccurate or imprecise measurement of phenotype or environmental exposure may require up to 20 times larger samples to detect an association signal above background noise [36]. However, the power advantage of gene-gene investigations resulting from their higher measurement accuracy is diminished by the need to correct for multiple comparisons and by the potentially increased complexity of interactions compared with gene-environment investigations.

figure 4

Box 2. Factors affecting the statistical power of a study of gene-gene or gene-environment interactions

How large a sample is required for adequate power to find gene-gene and gene-environment interactions? A rule of thumb is that a four-fold increment in sample size is required to test for a multiplicative interaction of two main effects [2, 38]. This may overestimate the sample size requirement, especially if the effect of the interaction is larger than the main effects, but it illustrates the general requirement for a larger sample size when interactions are introduced into hypothesis testing. Given that many previous candidate gene studies, and even many GWASs, were powered to detect only main effects, testing these samples for gene-gene and gene-environment interactions has the potential for false positive and false negative results [2, 3]. Higher-order interactions will require even larger samples to attain suitable power and may not be possible even among the largest current association studies [1].

Examples of interactions in monogenic CAD

Studies of monogenic susceptibility to CAD have revealed several gene-gene and gene-environment interactions. For instance, age at death from CAD was studied in large Mormon families with familial hypercholesterolemia (FH) attributable to rare heterozygous mutations in the LDLR gene [39, 40]. Carriers of LDLR mutations who lived in the 19th century had survived to the eighth and ninth decades of life, whereas carriers of LDLR mutations who lived in the 20th century died early with CAD, often in the third and fourth decades of life [39, 40]. The most likely explanation for this observation was a healthier environment in past times, including higher physical activity and lower saturated fat consumption compared with the contemporary environment [39, 40]. Similar conclusions were reached with multi-generational studies of FH patients in the Netherlands [41]. Other investigators found that Chinese FH heterozygotes who had immigrated to North America had worsened biochemical and clinical phenotypes than carriers of the same LDLR mutations living in China [42]. The difference in disease severity was ascribed to differences in dietary fat consumption; these circumstantial observations strongly suggested that environmental factors, such as diet and activity level, modulated the phenotype of heterozygous FH.

From our personal experience, there are other examples of monogenic illnesses whose severity can be significantly modulated by the environment - mainly diet and activity. For instance, we have seen the severity of expression of the disease phenotype made worse by an adverse environment in patients with hypertriglyceridemia due to apo CII-T [43], with analphalipoproteinemia due to APOA1 Q [-2]X [44], with T2DM due to HNF1A G319S [45] and with metabolic complications and CAD in familial partial lipodystrophy due to LMNA R482Q [46].

Examples of interactions with common SNPs

Although interactions between environment and disease penetrance in rare monogenic disorders are instructive, a much larger potential impact could be seen in common complex CAD susceptibility because of small-effect common SNPs. The effect of the environment might be even more pronounced in patients whose phenotypes are caused by the aggregation of small contributions from many genetic and non-genetic factors. Examples of replicated gene-gene and gene-environment interactions identified in investigations of common SNPs in candidate genes are shown in Table 3. For instance, increased CAD risk has been observed in smokers with null genotypes for glutathione S-transferases, which are involved in the detoxification of carcinogens and products of oxidative stress [47, 48]. Furthermore, smokers who are carriers of at least one APOE E4 allele seem to have significantly higher concentrations of oxidized LDL cholesterol compared with non-carriers, potentially further increasing CAD risk [49, 50]. Humphries and colleagues report a robust association between the -455G>A SNP of the fibrinogen beta chain (FGB) gene and elevated post-exercise fibrinogen levels [51, 52]. Elevated fibrinogen levels may modulate the myocardial infarction risk associated with the Leu34 allele of blood coagulation factor XIII (F13A1), a tetrameric zymogen that protects the fibrin clot from proteolytic degradation [53, 54]. These candidate gene-environment interactions were examined because of plausible biological relationships, but large-scale replications are still required, with careful attention to the issues raised in this article.

Table 3 Examples of replicated gene-gene and gene-environment interactions in CAD

Examples from GWASs

Gene-gene or gene-environment interactions are not yet routinely evaluated in GWASs, but two recent reports include exploratory examinations. Kathiresan and colleagues performed a two-stage GWAS of plasma lipoproteins [11]. The first stage identified over 1,000 associated SNPs in 25 loci (p < 5 × 10-8) [11]. The second stage analysisre-tested all SNPs using 36 of the significantly associated SNPs from the first stage as covariates in the regression. The number of associated SNPs was reduced to 105 in 7 loci (p < 5 × 10-8) [11]. All loci identified in the second stage had been identified in the first stage of analysis, suggesting that additional SNPs in known loci - that are not in linkage disequilibrium with the SNPs used as covariates - are associated with lipoprotein traits.

Sabatti and colleagues examined genome-wide gene-environment interactions, with the caveat that the work was under-powered to confidently identify interactions [13]. They examined four dichotomized environmental variables (sex, use of oral contraceptives, BMI over 25 kg/m2 and gestational age), comparing differences in effect size between the two subgroups and two variables separated into quintiles (birth BMI and early growth), which were tested by regression using an interaction term [13]. At least one interaction SNP was identified (p < 5 × 10-7) for five out of six environment variables, although none of the SNPs were in genes with a main effect or with known biological relevance [13].

These findings represent possible novel associations with metabolic CAD risk factors, but replication in larger samples is required. The issues discussed above in relation to study design, power and analytic strategies to detect gene-gene and gene-environment interactions are relevant to these large multi-center population studies, as these studies will form the precedent for future investigations.

Clinical implications

Accounting for gene-gene and gene-environment interactions will probably be important for future strategies of diagnosis, prognosis and management of CAD. For instance, current treatment guidelines for CAD prevention require risk stratification of the patient. CAD risk strata in a currently disease-free patient are calculated using traditional epidemiological risk factors, such as older age, male sex, the presence of cigarette smoking, diabetes, hypertension, dyslipidemia and, in some models, a family history of early CAD. Quantification of the patient's CAD risk using these variables guides the intensity of evidence-based drug treatment of modifiable risk factors, such as hypertension and dyslipidemia. It certainly seems feasible that reliable molecular genetic information can be included in future risk stratification models, improving precision over simply documenting a family history of CAD. Furthermore, combinations of specific genetic variables in the context of specific environmental variables - reflecting both gene-gene and gene-environment interactions - could help to re-stratify an individual between risk strata derived using non-molecular data. Also, given that such environmental factors as diet, activity level, stress, smoking and air quality are known to be important determinants of CAD risk, the first line of cost-effective and safe intervention for an individual with a high genetic risk burden would include modulation of such environmental factors instead of more costly, high-tech approaches, such as gene-based biological therapies.

Conclusions

In the context of GWAS datasets, gene-gene and gene-environment interactions are a new frontier for CAD association studies. GWASs have been extremely successful in identifying individual loci for CAD susceptibility, but the practical limits of sample size and array resolution for the identification of biologically valid loci will soon be reached. As a result of the high prevalence of CAD and the presence of large, multi-center prospective cohort initiatives with genotyping on high-density DNA genotyping arrays, gene-gene and gene-environment interaction studies of CAD will be possible in the future. Rigorous testing for gene-gene and gene-environment interactions should be built into the experimental study design. To ensure that testing for interactions enjoys the same success as GWASs of CAD, precise standards, including suitable sample sizes, reliable methods for measurement of environmental exposures, phenomic characterization and statistical analyses, will be required to minimize both false negative and false positive findings and to allow findings to be compared across samples and reports. The increment in the understanding of CAD susceptibility provided through systematic study and replication of gene-gene and gene-environment interactions will permit a more complete set of tools for diagnosis, disease prediction and prognosis and tailored therapy, perhaps using appropriate environment-based interventions.