Current Epidemiology Reports

, Volume 1, Issue 3, pp 130–137

The Impact of GWAS Findings on Cancer Etiology and Prevention

  • Jane C. Figueiredo
  • Daniel O. Stram
  • Christopher A. Haiman
Cancer (G Colditz, Section Editor)

DOI: 10.1007/s40471-014-0017-1

Cite this article as:
Figueiredo, J.C., Stram, D.O. & Haiman, C.A. Curr Epidemiol Rep (2014) 1: 130. doi:10.1007/s40471-014-0017-1


Genome-wide association studies (GWAS) of common genetic variation have contributed immensely to our understanding of inherited susceptibility to cancer. To date, over 400 susceptibility loci have been identified across all cancers. These loci implicate novel as well as established genes and biological pathways, and have reinforced the significance and functionality of non-protein-coding DNA sequence. While the genetic associations for each variant are deemed modest, for some cancers, risk stratification based on aggregate effects may be of value for targeted screening and prevention strategies. Several questions still remain to be answered in order to fully assess the significance of these findings, including a better understanding of the source of the missing heritability. Here we discuss, from studies of common cancers, the impact of GWAS findings on illuminating disease etiology, the potential future utility of polygenic models for screening and prevention, and the future of association studies in the post-genomic era.




The late 2000s marked the first publications of genome-wide association studies (GWAS) in cancer, with the discovery of a limited number of loci for the most common malignancies, including the 8q24 region in colorectal, prostate, and breast cancers [13]. Since then, as for other complex diseases, there has been impressive and unprecedented growth in the number of identified common susceptibility loci [4]. Currently, individual studies and meta-analyses have enumerated approximately 40 susceptibility loci for colorectal cancer, >70 for breast cancer, and >100 for prostate cancer [4]. These findings have helped to confirm the importance of polygenic inheritance resulting from multiple low-penetrant alleles in cancer. It is estimated that these loci may explain approximately 18 % of the familial component (i.e. the risk associated with a positive family history) of breast cancer [5•] and 30 % of the familial component of prostate cancer [6•]. While this is encouraging, there remains a large fraction of missing heritability that has yet to be explained for most common cancers [7•]. Whether this missing heritability is due to large numbers of common and/or rare alleles is currently not known; however, both are likely, with gains in evidence for both requiring even larger consortia studies and the incorporation of large-scale next-generation sequencing as the primary platform for providing data for variant identification and testing.

Inarguably, one of the primary goals of GWAS, relative to the preceding single-variant/candidate-gene approaches, has been to take an unbiased and agnostic view to uncover the genes and biological pathways that contribute (unambiguously) to cancer development. Second has been the expectation that common susceptibility alleles, alone or in combination, would improve population risk stratification and prediction models with family history and/or other previously established risk factors, thereby improving the effectiveness of screening and prevention efforts. In this review, we discuss the success and limitations of GWAS with regard to these goals, with a particular focus on the most common cancers of the breast, prostate, and colorectum, where a large number of risk loci have been revealed (because of available samples in consortia efforts), and the potential utility of risk stratification assessed in a clinical context.

Impact on Cancer Etiology

Importance of Non-protein-Coding Sequence

An important observation is that the vast majority of risk variants (and their proxies) have been located in non-protein-coding sequence, which points to the functional nature of non-coding DNA sequence (as highlighted by ENCODE [8••] and the National Institutes of Health Roadmap Epigenomics Project) and gene regulation rather than protein modification underlying the genetic effects. The etiologic mechanisms at non-coding susceptibility loci have also been informed through the incorporation of epigenetic information (i.e., chromatin histone modifications). For example, functional characterization of 71 breast cancer susceptibility loci showed that of 1,005 single nucleotide polymorphisms (SNPs) in linkage disequilibrium (LD) with the index SNPs (r2 ≥ 0.5), only 21 were in exons (and only two predicted as non-benign coding variants); in contrast, there were 76 SNPs in predicted transcription start-site regions and 921 SNPs in putative enhancers at 60 of the 71 breast cancer risk loci [9]. Similarly, for an investigation of 77 prostate cancer susceptibility loci, a seemingly large number of variants or their proxies in high LD were in non-coding regions but putatively functional variants, 509 SNPs located in enhancers and 20 SNPs in promoters, compared with a relative handful of coding variants [10]. In addition to influencing regulation of genes, there is also growing evidence for the importance of long non-coding RNA [11, 12].

Highlighting Known and Novel Genes and Biological Pathways

Several risk variants are located within or in close proximity to genes in pathways previously implicated in cancer, which has helped to reinforce and clarify biological understanding of these pathways in cancer development. For example, in colorectal cancer, the transforming growth factor-β (TGFβ) signaling pathway plays a critical role in proliferation, differentiation, cell migration, adhesion, and extracellular matrix production, which are key processes in development and progression of tumors [13, 14]. Of the 40 regions identified to date through GWAS, six loci implicate genes involved in TGFβ signaling (Fig. 1), including (rs4939827) SMAD7 [15, 16], (rs4779584) GREM1 [17], (rs10411210) RHPN2, the bone morphogenetic protein genes (rs961253) BMP2 and (rs4444235) BMP4 [15], and (rs4925386) LAMA5 [18], which is required for the production of noggin, a secreted BMP antagonist [19]. Similar corollaries have been noted in breast cancer, with susceptibility loci containing genes involved in mammary development (FGFR2, PTHLH, TBX3), DNA repair (e.g., BRCA2, RAD51B, and MERIT40), hormone signaling (ESR1, NRIP1), and cell cycle regulation (CDKN2A, CDKN2B, CCND1) [20]. In prostate cancer, prominent examples include KLK3 (PSA), the androgen receptor, and MSMB (β-microseminoprotein [MSP]) [21], which is one of the most highly secreted proteins from the prostate and a previously hypothesized biomarker of prostate cancer detection or risk. Laboratory and population-based studies have demonstrated that the risk allele (rs10993994) at the MSMB locus influences CREB binding and is associated with lower MSMB gene expression in tumor cell lines and prostate tissue [22, 23], as well as lower circulating MSP levels in the blood [24]. We have also shown MSP levels to be an independent and highly significant risk factor for prostate cancer in multiple populations, a study that was motivated by the GWAS discovery at this locus [25].
Fig. 1

Genes involved in the transforming growth factor-β (TGFβ) pathway implicated in colorectal cancer risk by genome-wide association study (GWAS)-identified loci. Six loci either within or near the genes circled in red—SMAD7, GREM1, RHPN2, BMP2, BMP4, and LAMA5—are implicated as risk factors for colorectal cancer. Figure provided by Graham Casey

It is important to note that for most of these risk loci, the presumed “affected” genes are implicated on the basis of proximity to the index variant, whereas it is possible that the relevant gene is not in the immediate vicinity. With this caveat in mind, what is interesting is that a number of risk loci harbor genes where the previous link with cancer has been weak. For example, in breast cancer, rs3803662 is just upstream of the gene TNRC9 (TOX3), a high mobility group box family member, which has been implicated in T-cell development in the thymus. The TOX gene family contains a high mobility group box motif, which is found in many eukaryotic proteins that are involved in bending and unwinding of DNA, suggesting a role in chromatin assembly. How a gene involved in T-cell development and chromatin assembly is related to breast cancer carcinogenesis is not entirely known. This is one such example of how GWAS findings may lead to hypotheses to examine new biological links with cancer.

Pleiotropy in Cancer

GWAS have also revealed a shared genetic underpinning for a number of cancers (Fig. 2). Although they yet lack validation, these pleiotropic associations [26], defined as a single variant or multiple variants in the same region of the genome being associated with risk of different cancers, indicate shared biological pathways of importance, which is perhaps not entirely surprising given our hypothesis that cancers generally evolve from a common sequence of events (e.g., uncontrolled cell proliferation and altered DNA repair capacity). Examples include genetic variants at chromosome 8q24, which have been associated with prostate, colorectum, breast, bladder, and other cancers [3, 27, 28]. Similarly, genetic variants in and near the telomerase reverse transcriptase (TERT) gene at 5p15, which encodes for telomerase activity, have been associated with glioma, lung, prostate, colorectal, breast, and other cancers [18, 29, 30], emphasizing the importance of cellular aging in cancer development. A third example is the cyclin D1 locus at 11q13, which is a gene that is involved in regulating cell cycle progression and has been associated with susceptibility to prostate and breast cancers [31]. Pleiotropic associations have also been observed between cancer and non-cancer phenotypes—for example, endometrial and prostate cancers and type 2 diabetes at HNF1B on 17q12 [3234], and prostate cancer and cardiovascular disease at the CDKN2BCDKN2A gene cluster region at 9p21 [35, 36]. Thus, GWAS have begun to reveal a shared genetic and biological basis between multiple disease phenotypes, which can be used to discover novel relationships between SNPs, phenotypes, and networks of interrelated phenotypes to provide novel mechanistic insights.
Fig. 2

Pleiotropy among different cancers. The risk-associated loci for each cancer are indicated by chromosomal location, and sharing is indicated by colored lines connecting different cancers. CLL chronic lymphocytic leukemia. Reprinted by permission from Macmillan Publishers Ltd: Nature Genetics [78] copyright 2013

Impact on Public Health and Prevention

Unlike somatic cancer genetics, where mutated genes highlight potentially actionable drug targets for a select subset of patients, germline risk variants are of low penetrance and are carried by a large fraction of the population that never develop cancer. Thus, a therapeutic prevention or treatment approach of targeting one of the potentially hundreds of risk loci for cancer is not currently feasible or sensible. Consequently, research on the potential clinical utility of GWAS findings has focused on modeling of genetic risk and incorporation of risk stratification into targeted strategies for screening and prevention.

Currently, managing cancer risk in the general population entails both promotion of behaviors that reduce modifiable risk factors and/or routine screening for early detection of lesions with malignant potential. In high-risk populations typically defined by family history, use of selected chemoprevention agents, prophylactic surgery, and/or increased frequency of screening may also be recommended. Given the wide spectrum of lifetime risk in both the average and high-risk populations, it is apparent that additional information is needed beyond family history to tailor recommendations to best balance the risks and benefits associated with different prevention strategies. This raises the question of whether an aggregate set of risk alleles might also improve the effectiveness of currently available prevention strategies through targeting of the appropriate “at risk” population. For the common cancers, this seems encouraging given that the common risk alleles in aggregate can stratify risk 3- to 6-fold when comparing the top 10 % and 1 %, respectively, with the average risk in the population [5•, 6•].

Tailoring of Screening Strategies

There have been several studies and commentaries on the potential value of genetic risk profiling for screening [37••, 3841]. In breast cancer, mammographic screening has remained a controversial topic following decades of research, including 10 randomized clinical trials, which showed inconsistent evidence for a beneficial effect on overall mortality [42, 43]. At best, screening appears to have only a small impact in reducing the incidence of advanced cancers [44], with substantive rates of false positives, overdiagnosis, and overtreatment [43, 44]; therefore, tailoring recommendations is a high research priority. Recently, Pashayan et al. [45] evaluated the clinical utility of polygenic risk stratification based on findings from the Collaborative Oncological Gene–environment Study (COGS) in improving the effectiveness and cost effectiveness of screening. Compared with screening women on the basis of age alone (women aged 47–73 years; 10-year absolute risk ≥2.5 %), personalized screening of women aged 35–79 years at the same risk threshold, using a 67-SNP panel, resulted in 24 % fewer women being eligible for screening, while potentially detecting 3 % fewer cases. Such findings are promising; one important issue will be discriminating between subgroups of breast cancer—in particular, estrogen receptor-negative (ER−) disease is associated with a poorer prognosis than estrogen receptor-positive (ER+) disease, and to date there have been only a handful of SNPs specifically associated with ER−disease [46].

In prostate cancer, the value of serum-based prostate-specific antigen (PSA) screening remains controversial. According to randomized clinical trials [47, 48], early detection of prostate cancer by screening may prevent death for a subset of men, but the rates of overdiagnosis and overtreatment are substantial, limiting enthusiasm for PSA screening as a population-based prevention strategy. In a study similar to the above example of breast cancer screening, Pashayan et al. [45], using a genetic risk score based on 72 GWAS, identified prostate cancer genetic variants from COGS compared with a hypothetical screening program with eligibility based on age alone (men aged 55–79 years: 10-year absolute risk of being diagnosed with prostate cancer ≥2 %). They found that with personalized screening for men aged 45–79 years at the same risk threshold, 19 % fewer men would be eligible for screening, at a cost of 4 % fewer potentially screen-detected cases. Whether targeted screening based on risk profiling may reduce overdiagnosis of indolent disease also needs to be examined. Unfortunately, no risk loci have been identified that can differentiate between advanced and non-advanced disease, and so applying a risk stratification scheme for overall prostate cancer to current screening practices may not reveal individuals who are likely to die from prostate cancer. Further studies are needed to search for variants associated with aggressive prostate cancer and to test the implications of adding polygenic risk profiling to guide decision making in screening for prostate cancer.

For other cancers, more studies are needed to evaluate the impact of risk stratification, including colorectal cancer, where colonoscopy with polypectomy has been demonstrated to decrease cancer incidence in both average-risk individuals and high-risk individuals [49, 50]. Currently, the US Preventive Service Task Force recommendations for screening vary only by age and family history, but genomic risk stratification may also be important in further defining individuals in different categories of risk.

Tailoring of Pharmacogenetic Strategies

GWAS hold much promise in advancing our knowledge of the pharmacogenetics of various agents used in cancer treatment. In breast cancer, tamoxifen has become the most widely used drug in managing ER+breast cancer [51]. Overall risk reductions of 30–80 % for contralateral breast cancer have been reported in women treated with tamoxifen for ER+breast cancer [5255]. However, considerable variations in efficacy and toxicity have been observed, and there have been several discussions about weighting these risks and benefits according to age, race, and selected breast cancer risk factors [56]. Recently, genome-wide discoveries have yielded a number of variants in biologically relevant genes (e.g., USP7, TMPRSS3, and SMARCA2) affecting tamoxifen sensitivity [57], which, if validated by other studies, may also contribute to further refinement of potential benefits and risks.

In colorectal cancer, considerable evidence from experimental, epidemiologic, and clinical trials demonstrates that aspirin and other nonsteroidal anti-inflammatory drugs (NSAIDs) reduce the risk of colorectal neoplasms [5862]. Routine use of aspirin/NSAIDs for chemoprevention of colorectal cancer is not currently recommended, because of uncertainty about its risk–benefit profile. Although several mechanisms have been shown to mediate the anti-cancer benefit of aspirin/NSAIDs, emerging data suggest that aspirin/NSAIDs may inhibit WNT/beta-catenin signaling, one of the most essential oncogenic pathways in colorectal cancer [6365]. Recently, the association between regular use of aspirin and colorectal cancer risk has been shown to differ according to the variation at the colorectal cancer susceptibility locus rs6983267 on chromosome 8q24, which has been functionally linked to WNT/beta-catenin signaling activity [66]. In a genome-wide study of 9,000 cases and 9,000 controls, which investigated gene–aspirin interactions, several loci (rs2965667, rs10505806, and rs10283740) in genes involved in WNT signaling were also found to be differentially associated with the beneficial effects of aspirin/NSAID use [67]. Hence, understanding the interplay between genetic markers and aspirin/NSAID use can help to identify subgroups in the population that may preferentially benefit from use of agents for chemoprevention.

Future of Association Studies in the Post-genomic Era

Several questions remain to be answered in order to assess the impact of GWAS in advancing our understanding of etiology and prevention. Many of these questions have to do with the source of the missing heritability that has not yet been explained by GWAS. At least three possible sources have been much discussed in the literature—namely (1) gene by gene interactions; (2) a larger role for rare variants in common diseases; and (3) an extremely polygenic basis of complex disease.

For the first of these scenarios, it has been argued [68] that such interactions imply that traditional family-based estimates of the additive heritability of many traits or diseases (that portion of heritability that would be directly detectable in single-SNP analyses in GWAS) have been consistently overestimated in earlier (pre-GWAS) work, leading to exaggerated expectations of what can be accomplished in GWAS analyses. The second scenario has been discussed in many contexts; it was noted by Pritchard [69], for example, that even quite mild selection against disease-related variants will shift the spectrum of variants that contribute to genetic risk; thus, rare variants contribute much more to total trait or disease heritability than they would under neutral selection. The third scenario is also seriously discussed by many investigators [7072].

The happiest state of nature from a translational point of view—specifically for risk prediction and screening-based prevention—is if the third scenario dominates. While large sample sizes will be required to identify and calibrate the components of a complex polygenic predictor variable, the third scenario implies that steady progress is expected using today’s GWAS platforms (assisted by the imputation of unmeasured variants), so long as sample sizes in conventional GWAS continue to increase. It also implies that today’s GWAS data will remain valuable in translating results to patients in the future, since the variants that are identified in these very large studies will be imputable on the basis of the SNPs that have already been genotyped, i.e., the polygene would be expected to be well predicted using today’s technology. The second scenario (rare variants) may require much more expensive methods (e.g., whole-genome sequencing) to identify the risk variants, and these variants may be very poorly imputed on the basis of today’s GWAS data, so the whole enterprise of detecting, estimating effects, developing a risk model, and applying these risk models to individuals will be much more expensive in the second scenario than in the third scenario. This is further compounded by the observation that rare variants tend to be population specific, so good results in one population would likely not translate to good results in other populations. Even these obstacles are not as great, however, as those expected in the first scenario, where the number of comparisons required to identify complex interactions are vastly greater than currently conceivable sample sizes are likely to support.

Arguments against either of these first two scenarios playing dominating roles in the missing heritability puzzle for most complex diseases are provided by several lines of evidence. First, since rare SNPs are generally confined to specific racial/ethnic populations, the findings that many associations with common SNPs detectable in one racial/ethnic population are also reproducible in other different populations speaks against one of the corollaries of the rare-variant hypothesis [73], which is that many associations that are observed with common variants are “synthetic” associations with rare variants at the bottom of them. The general consistency of single-SNP associations over different populations also argues against a dominant role for gene×gene interactions, since single-SNP associations, which are manifestations of interactions between two (or more) alleles, are sensitive to the allele frequencies of both (which tend to vary with the population). Another line of evidence against both of the first two scenarios is provided by the many GWAS-based estimates of complex trait heritability, which have been calculated using the variance components methods of Yang, Visscher, and their colleagues [7476]. These methods often provide much higher estimates of additive trait heritability due to the common variants measured in GWAS than is evident on the basis of the current set of known hits. While this is encouraging to believers in the GWAS approach, the results of the variance components methods may also be implying that complex trait heritability is dependent upon an extremely large number of common variants, so almost every portion of the genome harbors some variants that affect a given complex trait (of course with very small effect sizes).

In any event, it seems a certainty that additional progress in unearthing missing heritability will require extremely large sample sizes [77]. This is true whether it is rare variants with large effects or common variants with very small effects, and it is especially true if gene–gene interactions are the principal culprits.


GWAS have made an important leap forward in our understanding of cancer biology by reinforcing the role of known cancer pathways and revealing the unknown biological significance of non-coding regions of the genome that now appear to have important effects on nearby genes, as well as at long distance. On the other hand, GWAS have arguably not identified actionable drug targets, although significant functional exploration has yet to be conducted on most of the loci that may in fact open up new pathways. It is likely that incorporating polygenic risk stratification into tailoring of prevention strategies may prove to be of great clinical utility, at least for some cancers. Critically important will be discrimination of individuals with greater potential for aggressive subtypes of cancers, which may merit enriched screening and other approaches. Overall, the advantages of personalized screening include improving the efficiency of screening programs, detecting cancer in younger individuals, who tend to have more aggressive forms of the disease, and reducing harms from false positive findings through screening of fewer individuals. With polygenic profiling and risk stratification, a subgroup of the population at low risk of cancer may receive screening at lower frequency. Such tailoring of screening strategies to different risk groups may in turn improve the balance between the benefits and harms of screening. Clearly, implementation of a risk-stratified screening program in a population is much more complex than implementation of a program with eligibility based on age or family history alone. The issue of how to develop a dynamic risk score that incorporates technological advances of the rapidly evolving field of genomics and changes over time in individual’s non-genetic risk factors will also involve further research. Lastly, the cost effectiveness of genome-based risk profiling will also need to be explored in addition to the optimal timing and frequency of use, ethical considerations, and public approval.

Compliance with Ethics Guidelines

Conflict of Interest

J.C. Figueiredo, D.O. Stram, and C.A. Haiman all declare no conflicts of interest.

Human and Animal Rights and Informed Consent

This article does not contain any studies with human or animal subjects performed by any of the authors.

Copyright information

© Springer International Publishing AG 2014

Authors and Affiliations

  • Jane C. Figueiredo
    • 1
  • Daniel O. Stram
    • 1
  • Christopher A. Haiman
    • 1
  1. 1.Department of Preventive Medicine, Keck School of MedicineUniversity of Southern CaliforniaLos AngelesUSA