Background

Prostate cancer (PCA) is the most frequently diagnosed cancer and the 2nd leading cause of cancer-related deaths among men in the United States [1]. The American Cancer Society estimates that 26-29% of all new cancer cases and cancer-related deaths are attributed to PCA cancer. Well established PCA risk factors include older age, black race, and family history of PCA. However, other potential contributors of this disease may include lifestyle and genetic factors as well as imbalances within important biological pathways.

Apoptosis or programmed cell death is one such biological process that moderates cell differentiation, proliferation, death, whole body homeostasis and tumorigenesis [24]. This process is controlled by cell death (e.g. BAD, CASP, BIK) and cell survival proteins (e.g. BCL2, NFκB, AKT3) that induce or block apoptosis, respectively, as summarized in Table 1 [2, 3, 5]. Decreased apoptotic cell death and increased cell proliferation may lead to clonal expansion and tumor growth [2]. Failure to undergo apoptosis permits survival of transformed cells, leading to subsequent genetic alteration, genomic instability, and ultimately a more invasive cancer phenotype [3]. Imbalances in apoptosis-related genes may play an important role in PCA susceptibility as well as disease progression. For example, several independent studies have shown that overexpression of cell survival indicators (e.g., BCL-2, CARD8, IKBKE, PRKCQ, and PIK3CB, AKT3) or down-regulation of cell death markers (e.g., BCL2L14) are associated with more aggressive phenotypes, higher Gleason grade, increased tumor progression, and poor PCA prognosis [619].

Table 1 Selected genes involved in the regulation of apoptosis

There is mounting epidemiological evidence that genetic alterations in apoptosis-related genes play an important role in tumorigenesis. Apoptosis-associated sequence variants, when considered individually, may minimally influence the risk of developing numerous cancers, such as multiple myeloma, squamous cell carcinoma, chronic lymphocytic leukemia, non-Hodgkin lymphoma, colorectal, ovarian, breast, pancreatic, and non-small cell lung [2037]. However, the impact of individual apoptosis-related single nucleotide polymorphisms (SNPs) and their interactions on PCA outcomes remains understudied.

Genome wide association studies (GWAS), involving the evaluation of millions of SNPs within various biological pathways, has resulted in the detection of numerous PCA susceptibility loci [38]. However, most GWAS and PCA epidemiology studies place emphasis on individual SNP effects. Consequently, researchers often ignore the fact that complex diseases such as PCA are governed by complex gene-gene and gene-environment interactions within distinct biological pathways. Consequently, we sought to evaluate millions of interactions among apoptosis SNPs and their joint modifying impact on PCA risk and disease progression.

This report focuses on the impact of complex interactions among 172 apoptosis-related SNPs on PCA among 2,286 European-American male participants of the Cancer Genetic Markers of Susceptibility Study (CGEMS). The exhaustive assessment of all possible two-, three-, and four-way interactions among 172 SNPs, totaling more than 36 million SNP combinations is computationally burdensome. The statistical challenge stems from the prohibitive amount of data needed for multiple hypothesis testing. To address this issue, for the first time, we use statistical epistasis networks (SEN) guided multi-factor dimensionality reduction (MDR) to efficiently pre-process our genetic data, prior to modeling higher order interactions. As previously reported by Hu and co-workers (2011), SEN uses network science to generate a large connected cluster of pairwise interactions embedded within a genetic dataset [39]. The resulting network, consisting of a subset of high susceptibility SNPs, is used to guide SNP interactions using MDR. MDR is a rigorous statistical tool designed, in part, to evaluate main effects and complex interactions in relation to discrete outcomes [4048]. This report may serve as a foundation for researchers interested in a state-of the art bioinformatics technique, namely SEN-guided MDR, which facilitates the logical prioritization of SNPs for gene-gene interaction analyses in relation to PCA susceptibility and prognosis.

Results

Cancer Genetic Markers of Susceptibility (CGEMS) Population Description

CGEMS study participants consisted of middle aged non-Hispanic Caucasian men, ranging in age from 55 to 81. Relative to control subjects, men diagnosed with PCA were more likely to have a family history of the disease (11.4% versus 6.3%), PSA levels ≥ 4 ng/ml (48.5% versus 6.5%), and abnormal DRE exams (60.7% versus 46.2%) [data not shown]. Men with aggressive PCA had both a Gleason score ≥ 7 and tumor stage III/IV. Although non-aggressive PCA cases primarily had a Gleason score < 7 and tumor stage I/II, a small percentage (1.8%) of them had a Gleason score > 6 and a tumor stage I/II. Among cases, there were no significant differences in family history of disease or PSA levels between men diagnosed comparing men with aggressive and non-aggressive disease (data not shown). The vast majority of subjects with aggressive PCA had higher Gleason scores than subjects with non-aggressive PCA, although 9 of the 488 non-aggressive cases (1.08%) had a Gleason score greater than 6.

Single SNP Effects

We examined 172 apoptosis-related sequence variants in relation to PCA risk and disease aggressiveness, as shown in Tables 2 and 3. Age and family history of PCA did not modify the relationship between the sequence variants and the risk of developing PCA or aggressive disease. Inheritance of one or more TNFRSF10B rs1001793 or TNFSF10 rs4894559 A alleles was nominally associated with a 1.23-1.38 fold increase in PCA risk (p ≤ 0.039), as shown in Table 2. Whereas, eight apoptosis-related SNPs were associated with a moderate 18-50% reduction in disease susceptibility [BIK rs4988366 AG & AG + GG; BNIP3L rs10503786 TT & TC + TT; CASP9 rs1052571 TT; IKBKE rs1539243 TC & TC + TT; PRKCE rs608139 CC; PRKCE rs935673 GG & AG + GG; TNFRSF1A rs4149576 AG, GG, & AG + GG; and TNFRSF1A rs4149577 TC & TC + CC] (p ≤ 0.043). However, after applying Bonferonni correction, none of these findings remained statistically significant under the per-allele, dominant and recessive genetic models.

Table 2 Association of apoptosis SNPs with prostate cancer
Table 3 Association of apoptosis SNPs with aggressive prostate cancer

In terms of disease progression, eleven apoptosis SNPs were modestly associated with a 1.28-1.81 fold increase in the risk of developing aggressive PCA [AKT3 rs10157763 TC & TC + TT; AKT3 rs10927067 AG & AG + AA; AKT3 rs12031994 TC & TC + CC;AKT3 rs2125230 AG & AG + AA; AKT3 rs2125231 AG + AA; AKT3 rs2345994 TC & TC + CC; AKT3 rs4132509 AC & AC + AA; BCL2L14 rs885720 AG & AG + AA; CARD8 rs10405717 TC & TC + TT; IKBKE rs11578093 GG & TG + GG; and RAF1 rs13060691 TG] (p ≤ 0.031), as summarized in Table 3. In addition, possession of IKBKE rs1539243 TT and PIK3CB rs500687 CC genotypes were moderately linked with a 44-56% reduction in aggressive PCA susceptibility (p ≤ 0.019). However, none of these loci remained significant after adjusting for multiple comparison bias.

Combined Gene Effects

We investigated the ability of gene-gene interactions to predict PCA outcomes using statistical epistasis networks (SEN) guided multifactor dimensionality reduction (MDR), information gain (IG) measure, and hierarchical interactions graphs (hIG). SEN was used to generate a network to visualize the genetic architecture of PCA outcomes, prior to an exhaustive search of SNP interactions. This approach uses information theory to build an epistasis network from the strongest pairwise SNP interactions. A network, consisting of vertices (i.e., SNPs) joined by edges (i.e., SNP pairs), enabled us to limit our MDR analysis to a subset of the most informative SNPs in relation to PCA outcomes. We were unable to build a statistically significant epistasis network for PCA susceptibility. However, the network for aggressive PCA was topographically significant (p < 0.05), given the results from 1000 permutated data sets. This network consisted of 91 vertices (i.e., individual SNP effects), 80 edges (i.e., pairwise interactions), and 18 total components (i.e., "sub-networks" of vertices and edges). The largest connected component of the aggressive PCA SEN diagram involved 24 vertices and 34 edges, as shown in Figure 1 andAdditional file 1. Consequently, MDR analysis was limited to < 13,000 interactions among 24 SNPs, instead of more than 36 million interactions for 172 SNPs. SEN-guided MDR revealed complex interactions between AKT3 rs12031994-PRKCQ rs571715 as well as AKT3 rs12031994-BID rs366542-PRKCQ rs571715 that were significantly associated with disease aggressiveness (p ≤ 0.009). However, the network and entropy graphs indicated these findings were mainly driven by AKT3 rs12031994. This is evident by the marker's large vertex size (i.e., large main effect) in the SEN graph (Figure 1) and a lack of a strong pairwise interaction between AKT3 rs12031994-PRKCQ rs571715 in the entropy graph. The AKT3-PRKCQ interaction resulted in joint mutual information (I) of only 0.54% relative to the mutual information (I) of 0.87% for AKT3 alone. Therefore, prior to repeating MDR, we removed the AKT3rs12031994 marker. This resulted in the detection of a modest synergistic interaction between AKT3 rs2125230-PRKCQ rs571715 (p = 0.011), as shown in Table 4. Based on MDR, this model was the best two-factor model. It was selected 9 out of 10 times by MDR based on a cross-validation consistency (CVC) score of 90% and predictive accuracy of 56.3%. Furthermore, we observed a higher mutual information for the combined effect of AKT3 rs2125230-PRKCQ rs571715 (I = 0.66%), in contrast to AKT3 rs2125230 alone (I = 0.45%) or PRKCQrs571715 alone (I = 0.21%), as revealed in Figure 2.

Figure 1
figure 1

Visualization of the interaction among apoptosis-related sequence variants using Statistical Epistasis Network Modeling (SEN)-guided MDR. The Statistical Epistasis Network (SEN) was generated using a pairwise interaction strength cut-off or 1%. Overall, the model has 91 vertices (single SNPs), 80 edges (pairwise SNPs), and 18 components ("sub-networks" of vertices and edges). The largest connected component, shown in the center and spanning the entire length of this model, has 24 vertices and 34 edges. Each number in Figure 1 corresponds with a specific apoptosis-related SNP analyzed in the current study, as summarized in Additional file 2.

Table 4 Multi-dimensionality reduction models for apoptosis-related markers in relation to aggressive prostate cancer (PCA)
Figure 2
figure 2

Interaction Entropy model. This graphical model, describes the percent entropy that is explained by each apoptosis-related SNP or pairwise combination within our study population. Positive percent entropy indicates information gain (IG) or synergy; whereas, negative percent indicates redundancy or lack of information gain (IG). Schematic coloration used in the visualization tools represents a continuum from synergy (i.e. non-additive) to redundancy. The colors range from red representing a high degree of synergy (positive information gain (IG)), orange a lesser degree, and gold representing independence and a midway point between synergy and redundancy. On the other hand, green and blue represent redundancy, which is not apparent in the current analysis.

To confirm the validity of the SEN results, MDR analysis was also conducted on all 172 SNPs (Figure 1 andAdditional file 1) as well as 148 loci that were not included in the main component of the network. As expected, we did not detect any significant two-, three-, or four- way interaction models (p ≥ 0.184).

Discussion

Previous genome-wide association studies (GWAS) have identified linkages about 40 PCA loci, including between PCA genetic alterations detected in the 8q24 region, β-microseminoprotein (MSMB), and allele -8 of the microsatellite DG8S737 [38]. These studies have limited their scope to individual SNPs across the entire genome. Such assessments tend to ignore the genetic architecture of PCA that potentially involves complex interactions along key regulatory pathways. For the first time, the current study evaluates complex interactions among 172 apoptosis-related SNPs in relation to PCA risk and disease aggressiveness among 2,286 European-American men using SEN-guided MDR. Specifically, SEN was used to build a topographically significant aggressive PCA epistasis network, prior to evaluating complex interactions. This inferred epistasis network consisted of 24 SNPs and 34 pairwise interactions, and reduced MDR analysis from > 36 million to < 13,000 SNP interactions. Consequently, we observed a non-linear and modest interaction between AKT3 rs2125230-PRKCQ rs571715 in relation to aggressive PCA. This state-of-the-art bioinformatics technique facilitates the logical prioritization of SNPs for gene-gene interaction analyses in relation to complex diseases.

Unfortunately, there are no published reports on the functional consequence of these two intronic SNPs in AKT3 and PRKCQ in relation to mRNA stability/expression protein expression/structure/function or PCA outcomes. However, we speculate that the AKT3 rs2125230 and PRKCQ rs571715 sequence variants, with minor allele frequencies ranging from 14.4-22.9 among men of European descent, may alter transcription regulation, leading to increased mRNA expression. Increased mRNA/protein expression AKT3 rs2125230 and PRKCQ rs571715 may cause: decreased apoptosis, an escape of transformed cells from programmed cell death, increased accumulation of genetic alterations, genomic instability, and ultimately an invasive PCA phenotype. Thus, in vitro and in vivo assays using (short hairpin RNAs) shRNAs or small interfering RNAs (siRNAs) are needed to elucidate the impact of AKT3-PRKCQ genetic alterations on protein expression, apoptosis capacity, and prostate tumorigenesis.

The impact of a non-linear interaction along the AKT3 rs2125230-PRKCQ rs571715 axis in relation to aggressive PCA may be attributed to markers involved in the apoptosis signaling pathway. Overexpression of PRKCQ and AKT3 are associated with invasive cancer phenotypes [619]. In fact, AKT3 is responds to insulin and growth factors, transduces signals including cell death, and is upregulated in androgen-independent PCA cell lines [49]. PRKCQ is also associated with apoptosis. In particular PRKCQ, a protein kinase C (PRKC) family member, promotes cell survival by inactivating BAD (BCL2-associated agonist of cell death), which subsequently results in NFκB activation.

Although AKT3 and PRKCQ are involved in pro-survival pathways, their interaction is not fully understood. However, their interactions with other related protein kinases may offer biological clues on the mechanism by which AKT3 and PRKCQ synergistically influence aggressive PCA. For example, PRKCQ interacts with another AKT family member, AKT1, to activate NFκB [50]. If the AKT3-PRKCQ axis has a similar function as other protein kinases, namely AKT1 and PRKCQ, then these pro-survival markers may synergistically activate NFκB. As a result, activated NFκB may enable the tumor to escape programmed cell death and progress toward an aggressive PCA phenotype.

Numerous observational studies evaluated the impact of one or more apoptosis-related SNPs in relation to cancer outcomes [2037]. Among 15 case-control studies, less than 10% of the sequence variants evaluated in the current study were significantly associated with various tumors, including of the colon, rectal, ovarian, breast, pancreatic, and non-small cell lung cancers [2037]. In particular, modest cancer risk estimates were observed among 8 apoptosis-related SNPs detected in CASP3, CASP8, CASP9, TP53, NFKB2, and NFKBIA. However, some of these studies were limited by a small sample size, small number of analyzed SNPs, or failure to consider the impact of multiple SNPs on disease susceptibility. In the current study, there were 24 SNPs detected in 10 apoptosis-related genes [i.e., AKT3, BIK, BNIP3L, CARD8, CASP9, IKBKE, PRKCE, TNFSF10, and TNFRSF10 (B, D)]. These apoptosis-related SNPs, after adjusting for confounders, were modestly associated with PCA risk and/or aggressive disease. Yet, these findings lost statistical significance after adjusting for multiple hypothesis testing. Main effects were not observed for AKT3 rs2125230 and PRKCQ rs571715 in relation to PCA risk or disease progression in our study set. Wang and co-workers (2009) evaluated interactions among 5 apoptosis-related SNPs, including death receptor 4 (DR4), and pack-years of smoking in relation to bladder cancer using entropy-based MDR [51]. MDR analysis revealed a significant additive interaction between DR4 -397 G > T and smoking on bladder cancer. Unfortunately, this study analyzed a relatively small number of sequence variants in the DR4 apoptosis-related gene; hence, making it difficult to compare study findings. In a post-hoc analysis, we did not observe any interactions among the selected apoptosis-related markers and sources of reactive oxygen species, antioxidants, and anti-inflammatory agents, including cigarette smoking, dietary supplements, aspirin, ibuprofen, and meat-derived carcinogens (data not shown).

We considered the strengths, limitations and future directions of the current study. Although SEN-guided MDR only identified a nominally significant network for disease aggressiveness in the current study, this approach overcomes the computational challenge of detecting all possible two-, three- and four-way SNP combinations involved in PCA progression. SEN, in the current study, was used to prioritize > 10 million possible interactions by focusing on a "sub-network" of informative SNPs in relation to aggressive PCA. Reduction of our genetic data set to the most informative markers improved the feasibility to detect interactions that may have otherwise remained undetected. Unfortunately, the genome-wide association studies (GWAS) database used for the current study did not include apoptosis-related sequence variants (e.g., TNF-308 rs1800629, TNFSF10 rs1131532, BCL2 -936 rs2279115) previously reported in published cancer epidemiology studies [5254]. Future studies in our laboratory will focus on high-throughput targeted sequencing to evaluate the impact of novel and commonly reported sequence variants on PCA susceptibility and disease prognosis. In light of recent GWAS reports, it is tempting to assume that extremely large case-control study sets, involving tens of thousands subjects are required to evaluate millions of SNP interactions in relation to PCA outcomes. However, the current study had adequate statistical power to evaluate individual and SNP combination effects in relation to prostate cancer. In particular, MDR has 80% statistical power to evaluate all possible two-, three-, and four-way gene-gene interactions with as low as 200 cases and 200 controls [55]. MDR remains effective even in the presence of 5% genotyping errors and/or 5% missing data [55]. It is anticipated that 5% of the about 13,000 possible interactions among 24 apoptosis-related SNPs will result in approximately 650 significant relationships due to chance alone. However, MDR coupled with permutation testing adjust for multiple comparison bias. Given the low prediction accuracy affiliated with the interaction between AKT3 and PRKCQ, our study findings require replication within independent study sets. However, recent simulation studies demonstrate that even modest disparities in genotype allele frequencies among study participants of independent study sets may interfere with the capacity to replicate complex interactions [56]. Thus, to ensure reproducibility within future studies, we plan to select study sets with similar genetic architecture (i.e., ancestry identification markers and SNP genotypes) as CGEMS PCA case-control subjects.

Conclusion

In summary, we have identified a marginal interaction between two apoptosis-related SNPs linked to PCA disease aggressiveness using SEN-guided MDR. Future molecular and pre-clinical studies may help to clarify the functionality of these genetic variants and their role in PCA disease progression. Our state-of-the art bioinformatics approach will enable researchers to pre-process millions of SNP interactions using GWAS or cancer consortia genotype data. In order to replicate findings within independent study sets, researchers will have to ensure that comparative sub-populations have the same genetic architecture and ancestry backgrounds. Therefore, the application of SEN-guided MDR may ultimately help researchers identify and validate gene-gene and/or gene-environment interactions to serve as effective cancer prognostication tools.

Methods

Study population

The Prostate, Lung, Colon, and Ovarian (PLCO) Cancer Screening Trial is a randomized, well-designed, and multi-center investigation sponsored and coordinated by the National Cancer Institute (NCI) [57, 58]. Between 1993 and 2001, the PLCO Trial recruited study participants aged 55 to 81 years old to evaluate the effect of screening on disease specific mortality, relative to standard care. Men randomized to the PCA arm of the trial, received annual PSA and DRE exams. For the current study, 2,286 European-American men were included in our study if they had GWAS data available through the Cancer Genetic Markers of Susceptibility (CGEMS) data portal. (http://cgems.cancer.gov/) [38]. The CGEMS study population consists of nationally available genetic data for nearly 500,000 sequence variants. Incident PCA cases (488 non-aggressive and 687 aggressive) were identified through various sources including: screening exams; reports from patients, physicians, or relatives; or linkage with the National Death Index or state cancer registries. Incident PCA cases were pathologically confirmed with aggressive (Gleason score ≥ 7 and tumor stage III/IV) or non-aggressive [Gleason score (< 7) and tumor stage I/II] disease. Controls (n = 1,111) were matched to cases identified based on age, time since initial screening, and year of blood draw using incidence density sampling. All participants signed informed consent documents approved by both the NCI and local institutional review boards. Access to clinical and background data collected through examinations and questionnaires was approved for use by the PLCO.

Gene selection

A panel of 73 apoptosis-associated genes was selected from published cancer epidemiology or molecular biology studies as well as pathway databases and tools, including Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.jp/kegg), Gene Ontology (GO), BioCarta (http://www.biocarta.com), ProteinLounge (http://www.proteinlounge.com), and Ingenuity (http://www.ingenuity.com) [5963]. We searched PubMed for articles using the following keywords: [(prostate OR prostatic) AND (cancer OR neoplasms) AND apoptosis (gene variants OR single nucleotide polymorphisms OR targets) AND epidemiology]. Pathway tools, such as KEGG, BioCarta, ProteinLounge, and Ingenuity were used to visualize gene-gene and protein-protein interactions essential to regulating apoptosis [5961, 63]. The Entrez Gene database of the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov) was used to ensure that selected targets were involved in the apoptotic pathway [64].

Criteria for SNP selection and data management

Prior to uploading our initial list (i.e., 73 apoptosis genes) into the CGEMS data portal, we secured the HUGO gene name equivalents for the targets of interest using NCBI Entrez Gene. SNP profile data was available for more than 1197 SNPs [38]. However, upon further investigation, many of these SNPs were eliminated if they were: not located within 2.5 kb of the 5' or 3' end of the gene; or detected within a gene that was not related to apoptosis. Among SNPs that were associated with apoptosis, we focused on gene regions located in coding, promoter, or "near gene" regions. "Near gene" regions were defined as 2.5 kb up- and downstream of the 5' or 3' ends of the selected genes, respectively.

We excluded any sequence variants that had a minor allele frequency (MAF) < 5% as reported in the NCBI Entrez SNP [64]. In addition, one SNP was removed since its genotype frequency distribution among controls deviated substantially from the Hardy-Weinberg Equilibrium, p ≤ 0.005 (n = 1). Following data cleanup, 172 apoptotic SNPs were analyzed among 2,286 men of European descent (687 aggressive cases, 488 non-aggressive cases, and 1,111 controls).

Predicted function of selected SNPs

The SNPinfo (http://snpinfo.niehs.nih.gov/) webserver enabled us to annotate and/or predict the functional consequence of selected apoptosis variants, as summarized in Additional file 2 [65]. This server consists of several pipelines to predict whether alternative alleles of a SNP may alter one or more of the following: transcriptional regulation via transcription factor binding site (TFBS) activity; premature termination of amino-acid sequence (stop codons); the splicing pattern or efficiency at mRNA splice sites, exonic splicing enhancers (ESE) or silencers (ESS); protein structures or properties by changing single amino acids (i.e., non-synonymous SNPs); or mRNA transcription or protein translation by altering microRNA (miRNA) binding sites.

Statistical analysis for single gene effects

Univariate and multivariate analyses were used to evaluate apoptosis associated SNPs among men of European descent in relation to PCA outcomes. To assess whether possession of one or more apoptotic alleles influence the risk of developing PCA, we tested for significant differences in the distribution of homozygous major, heterozygous, or homozygous minor genotypes between cases and controls using the chi-square test of heterogeneity. A case-only analysis was used to examine the relationship between apoptosis-related alleles and aggressive PCA. We evaluated differences in the distribution and inheritance of apoptosis-related genes comparing men with aggressive disease (Gleason score ≥ 7 and tumor stage > 2) to those with non-aggressive disease (i.e., other PCA cases that did not have both Gleason score ≥ 7 and tumor stage > 2). The associations between PCA outcomes and selected polymorphic genes, expressed as odds ratios (ORs) and corresponding 95% confidence intervals (CIs), were estimated using unconditional multivariate logistic regression (LR) models adjusted for age and family history of PCA. LR analyses for genetic variants and PCA outcomes were conducted using the major or common genotype as the referent category. All chi-square test and LR analyses were conducted using SAS 9.2 (SAS Institute Inc., Cary, NC) and SVS software (Golden Helix, Inc., Bozeman, MT). Statistical significance was assessed using a p-value < 0.05. Adjustments for multiple comparisons were made using Bonferroni correction and permutation testing.

Statistical power

We performed calculations to determine the statistical power of our sample to detect significant relationships between apoptosis-related polymorphisms and PCA development. The expected risk estimates of our study were determined by specifying values for a number of parameters, including: an average minor allele frequency of at least 26.0%, NCI's estimate of PCA disease prevalence (19%); statistical power (80%); sample size (1,175 PCA cases and 1,111 disease-free individuals or 687 aggressive PCA cases and 488 non-aggressive cases). We assumed the outcome was in complete linkage disequilibrium with an apoptosis-predisposing variant (r2 = 1.0). Based on the aforementioned parameters, we have > 80% power to detect genetic markers with odds ratios (ORs) of ≥ 1.4 (or 0.71 for protective effects) for PCA risk and ≥ 1.6 (or 0.62 for protective effects) for disease aggressiveness, assuming a codominant model with 1 degree of freedom (df). Power calculations were performed using Power for Genetic Association Version 2 Software [66].

Analysis of gene interactions using multi-factor dimensionality reduction (MDR)

To evaluate the single- and joint- modifying effects of 172 candidate apoptosis-related SNPs within a large dataset, such as CGEMS, is computationally challenging [67, 68]. In order to overcome this problem, open-source and freely available MDR 2.0 software (htpp://http://www.epistasis.org) was used to analyze interactions among apoptotic sequence variants in relation to PCA outcomes [69]. MDR is a data reduction approach, designed to detect and characterize high-order interactions in case-control studies. With MDR, information from various genetic loci is categorized into high-risk or low-risk groups of disease. The resulting one-dimensional multi-locus genotype variable is then evaluated for its ability to classify and predict a disease outcome through cross-validation and permutation testing. MDR uses a 10-fold cross validation to estimate the testing accuracy of a model by leaving out 1/10th of the data as an independent test set. The model is developed on 9/10th of the data and then evaluated on the remaining test set. This process is repeated for each 1/10th of the data and the resulting prediction accuracies are averaged.

In the current study, the model with the greatest cross validation consistency (i.e., CVC ≥ 8/10) and highest prediction accuracy [i.e., Average Testing Accuracy (ATA)] was selected as the best predictor of disease outcome. Accuracy is a function of the percentage of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) as defined as (TP + TN)/(TP + TN + FP + FN). ATAs are averaged across all 10 pieces of the data, in order to provide an estimate of the predictive ability of the loci in relation to the outcome of interest. We used cross-validation consistency (CVC) to measure the degree to which the same best MDR model was selected across the 10 divisions of the data. Models with a CVC of ≥ 8/10 using a 10-fold cross-validation were considered more carefully. If the MDR model met the CVC criteria, we selected models that had the highest ATAs. The combination of CVC and permutation testing were used to control for multiple hypothesis testing. Permutation testing results were statistically significant at the ≤ 0.05 level.

The current version of MDR used in this study enabled the incorporation and adjustment of multiple covariates [70]. To remove the covariate effects [i.e., age-group and family history of PCA (yes or no)], we integrated two sampling methods (i.e., over- and under-sampling). This approach is computationally efficient, since it allows for the adjustment of multiple covariates without significantly increasing computational burden.

Visualization of interaction models using interaction entropy algorithms, hierarchical graphs and statistical epistasis network (SEN)

The interaction entropy algorithm, based on information theory, is a method to verify, visualize, and interpret combination effects identified by parametric (e.g., LR) and non-parametric (e.g., MDR) statistical test[43, 7174]. Jakulin and Brakto (2003) have developed a metric to gauge whether the gain in information (i.e., information gain) about a class variable (i.e., ability to predict disease status) from a combination of two variables provides more information than each variable considered independently [72, 73]. In the current study, measures of interaction were used to build interaction entropy graphs to visualize and interpret interactions among selected apoptosis-related markers in relation to PCA risk and disease progression. Individual and all possible pairwise loci were assigned a mutual information (I) percentage score in relation to disease risk or aggressiveness; whereby typical scores for genetic loci are < 5%. Pairwise SNP combinations were deemed important if the joint mutual information [I = (SNP1, SNP2; disease status)] was greater than the total mutual information of each individual locus considered separately [I (SNP1; disease status) + I (SNP2; disease status)]. Interactions were further visualized using an interaction entropy graph, which uses a color-coding scheme to interpret interactions. Strong interacting factors, coded either red or orange, indicated high and medium levels of synergistic effects on outcomes, respectively. Weak interacting factors, coded either blue or green, denoted high or modest levels of redundancy between markers, respectively. Gold depicted independence and a midway point between synergy and redundancy. Design of interaction entropy graphs was accomplished using the Orange software [75].

An important limitation of any gene-gene interaction analysis method is the combinatorial nature of the problem. Exhaustive analysis of all possible two-, three-, four-way combinations of SNPs creates a computational burden and significant multiple testing problems. We addressed this issue by pre-processing the data using statistical epistasis network (SEN) modeling, which has been described elsewhere [39]. Briefly, this tool uses information theory, as previously described, to develop a network based on the amount of mutual information from SNP1, SNP2, and a class variable (i.e., disease status). Thus, it describes the pairwise interactive effect on a discrete outcome. An epistasis network, or graph, consists of a set of vertices (individual SNPs) and edges (pairwise interactions) that attach the vertices. In the current study, the statistical epistasis networks (SEN) were built by incrementally adding edges if their strengths were greater than a given threshold. Each vertex and each edge connecting two vertices were assigned a weight commensurate with the strength of main and interactive effects, respectively. The weight of a vertex is depicted by its size; whereby, a larger size is indicative of a higher main effect. Similarly, the thickness of the line between SNP pairs is directly proportional to the strength of their interaction.

We used SEN to visually summarize the strongest individual SNP effects and SNP-pairwise interactions. The topology of the graph, consisting of SNPs and their pairwise interactions, was used to guide MDR modeling. This resulted in a reduction in the total number of SNP combinations. For instance, we reduced the gene-gene interaction analysis from 172 to 24 SNPs that resided within the major connected component with the largest number of vertices (i.e., main effects). The derived major component plot was validated by comparing: (1) the number of vertices and edges among the various connected "sub-networks"; and (2) the size of the major component plot of real data versus randomized data that underwent 1000 permutations. The latter compared data using a pairwise interaction strength cut-off of 1% and significance level of 0.05. To ensure that no significant interactions were identified, non-informative SNPs outside the major component plot were also analyzed by MDR. In essence, SEN-guided MDR facilitated logical prioritization of SNPs for gene-gene interaction analysis in relation to PCA outcomes.