Background

Colon cancer is the third most common cancer worldwide, and is the second leading cause of cancer deaths in Europe and the United States [1,2,3]. Both genetic and epigenetic alterations have been implicated in the development of colon cancer [4]. microRNAs (miRNAs) are small (18–24 nucleotides) noncoding RNAs, that act as epigenetic regulation of gene expression. miRNAs act on genes post-translationally and have been implicated in cancer development, progression, and both response and resistance to chemotherapy [5]. Alterations of miRNA expression have been detected in the broad spectrum of hematological malignancies and solid tumors, including colon cancer [6,7,8,9,10]. Previous studies have established that miRNAs are differentially expressed in tumor and normal tissue [5], and altered miRNA expression is involved in colon cancer development [6, 11, 12, 13, 14, 37]. For example, miR-144 is significantly associated with colon tumor stages [15]. Therefore, the expression changes of microRNAs may regulate important genes in tumor pathogenesis and can be useful for classifying tumors and predicting their outcomes.

However, most studies focus on the identifying differentially expressed miRNAs between tumor and non-tumor samples or between stages in tumor tissue. Limited work has conducted to study the interactions or epistasis between miRNAs and how the epistasis brings about effect on tumor progression. Here, we define epistasis as the situations that the phenotype variance could be explained by the interactions or combinations of (epi-)genetic variants, instead of individual (epi-)genetic variants alone. Epistasis of miRNAs have been reported as an important component in cancer research and drug resistance research. For example, a previous study has reported an epistasis between miR-155 and miR-146a related to tumor growth [16]. Specifically, this study identifies that miR-155 deficiency is epistatic to a loss of miR-146a during antitumor immune responses and thus results in varied tumor growth [16].

However, genomic data is usually high dimensional, making it difficult to analyze epistatic interactions using general parameter estimate methods, such as variations of LASSO [17,18,19,20,21,22,23,24] or the empirical Bayesian method [25]. Many other methods developed to analyze epistasis on quantitative phenotypes, including a statistical selection method [26] and a combinatorial partitioning method (CPM) for multi-locus-epistasis [27]. A multifactor-dimensionality reduction method (MDR) [28] is developed based on CPM, and a GEM model is developed to detect the functional epistasis and infer the hierarchical relationships of genes [29]. Neither CPM or MDR scales up well, so it is impractical to use them on large datasets [27, 28].

Additionally, methods have been proposed to identify epistasis on dichotomous phenotypes as in case-control studies. These methods include an Epistasis Detector based on the Clustering of relatively Frequent items (EDCF) [30], a Bayesian inference method called Detecting genome-wide Association on Multiple diseases (DAM) [31], a Multi-SNP Combination Set Detector (MSCD) based on a combinatorial optimization model [32].

Recently, an Empirical Bayesian Elastic Net (EBEN) method was proposed to study epistasis [33]. EBEN is efficient to estimate unknown parameters in an over- saturated statistical model as in mining high dimensional genomic data. Therefore, in this study, we use the EBEN method thanks to three of its advantages for epistasis analysis: 1) EBEN is scalable on high dimensional data, 2) EBEN can perform a statistical test on the features selected, and 3) EBEN shows lower FDR than LASSO [33].

In order to efficiently identify epistasis that might be masked by strong main effects, we develop a multi-step workflow to find both main and epistatic effect in a unified model. First, we identify the main effect of miRNAs on pathological stages of colon cancer, ignoring epistasis. Second, we generate a corrected phenotype by removing the main effect. Third, we use an epistasis model on the corrected phenotype to solely identify epistatic effect. Finally, we run a full model including both main and epistatic effects, on the significant features previously selected as main and epistatic effects. We apply EBEN as the parameter estimation method in all steps.

Using the multi-step workflow on data from the The Cancer Genome Atlas (TCGA) [34], we identify a set of miRNAs with main and epistatic effect on the pathological stages of colon cancer. Many of miRNAs with main effect we detected have been reported to be associated with colon cancer from previous experimental studies, and the majority of epistatic miRNAs share common target genes and thus could bring about epistatic effect on the resulted pathological stages. We also find some of the target genes of detected miRNAs are associated with colon cancer. Gene Ontology Enrichment Analysis of the experimentally validates targets of main and epistatic miRNAs, shows that these target genes are enriched for biological processes associated with cancer progression.

Methods

In order to efficiently identify the main and epistatic effect of miRNAs on pathological stages, we develop a multi-step workflow based on the Empirical Bayesian Elastic Net method for modeling. We use the miRNA profiles and pathological stages of colon cancer as an example to demonstrate our analysis workflow. The data used in this study, R script pipeline on analyzing the dataset, and the EBEN package can be accessed from github (https://github.com/shilab/EBEN-epistasis) and R CRAN (https://cran.r-project.org/web/packages/EBEN/index.html).

Data collection and preprocessing

TCGA [34] provides a dataset that fits well to evaluate the proposed method, since it offers comprehensive measurements at different layers on the same individuals for a cancer type for integrative analysis. The miRNA expression profiles from miRNA sequencing (miRNASeq) and clinical data for colon cancer were downloaded from TCGA data portal. Specifically, we extracted miRNA expression data and pathological stages of 233 samples from the TCGA colon cancer datasets. We then filter out those miRNAs with more than 20% missing data and finally collected the expression profiles of 376 miRNAs in 233 samples. We then organize the miRNA expression data into a matrix, with each row representing a sample and each column representing a miRNA. We use inverse quantile normalization on the miRNA expression matrix, map the values for each miRNA onto a standard normal distribution, and transpose our miRNA expression matrix for analysis.

In this study, we focus on analyzing the impact of individual miRNAs, and the epistasis between two miRNAs on the pathological stage of colon cancer. Here, we use the pathological stages, i.e., tumor stages, as a proxy to study cancer progression. According to TCGA, the pathological stage refers the “classification assigned to a malignancy which allows for the grouping of similar cancer types based on the extent of disease in the primary tumor (T), regional lymph nodes (N), and metastatic sites (M), using criteria from the American Joint Committee on Cancer staging criteria” [34]. We extracted the pathological stages of these 233 samples from TCGA. The pathological stages are then transformed into natural log values to scale the ordinary value of different pathological stages in order to make the variation more similar across different ordinary values.

Introduction of empirical bayesian elastic net

In our workflow, we use an Empirical Bayesian Elastic Net (EBEN) to model the data. We choose EBEN because it scales well on high-dimensional data since it uses feature filtering to remove unimportant features and the coordinate ascent method to estimate the unknown parameters. The unknown parameters in the EBEN algorithms are μ , β and \( \overset{\sim }{\alpha } \) in the linear model in Eq. (5). μ denotes the mean of phenotype that is assigned to a uniform prior distribution. β is the coefficient matrix in the model of Eq. (5), and is what we aim to estimate for feature selections. β is assigned to have two-level prior distributions, with the first level as an independent normal distribution and the second level as a generalized Gamma distribution. EBEN algorithm introduces two hyper-parameters, λ 1 and λ 2, and then uses cross-validation to determine the optimal values of these two hyper-parameters. \( \overset{\sim }{\alpha } \) is defined as \( 1/{\overset{\sim }{\sigma}}^2 \), and \( {\overset{\sim }{\alpha}}_k \) denotes the element of \( \overset{\sim }{\alpha } \). In each cycle of the coordinate ascent method, EBEN adds or deletes features according to the variable of \( {\overset{\sim }{\alpha}}_k \) between two iterations in the algorithm. If \( {\overset{\sim }{\alpha}}_k \) is finite, feature k is kept in the model, otherwise it is deleted from the model. We can see that if the dataset is high dimensional, lots of \( {\overset{\sim }{\alpha}}_k \) might be infinite using a coordinate ascent method, hence their corresponding β is zero and EBEN can drop them from model quickly. Therefore, EBEN is efficient to estimate unknown parameters in an over-saturated statistical model [33], makes it scalable to handle high dimensional datasets. Another reason that we prefer EBEN over other Elastic Net or LASSO methods because other methods usually give non-zero coefficients for feature selection, without estimating the covariance or performing a statistical test. Instead, EBEN performs a t-test using the coefficient and the covariance matrix to obtain p-values for selected features from point estimates [33].

As illustrated in Fig. 1, the EBEN algorithm [33] can be summarized as the following four steps.

  1. 1.

    Initialize model parameters and the statistical model. The parameter sets need to be initialized are \( \mu, {\sigma}_0^2 \) and \( \overset{\sim }{y} \). μ denotes the mean of phenotype and is initialized as \( \mu =\frac{\sum_i^n{y}_{\mathrm{i}}}{n} \). \( \overset{\sim }{y} \) denotes the initial dependent variable and is initialized as \( \overset{\sim }{y}=y-\mu \). \( {\sigma}_0^2 \) denotes the variance of the model and can be initialized as a very small number such as \( {\sigma}_0^2=0.1\times \frac{{\overset{\sim }{y}}^T\overset{\sim }{y}}{n} \) . After initializing these parameters, we need to initialize the statistical model with an initial set of features. The initial feature set satisfies \( k={arg}_i\left\{\left|{x}_i^T\overset{\sim }{y}\right|,\forall i\right\} \), because EBEN starts with features that have the highest correlations with the dependent variable. Here, n is the number of samples, k denotes the subscripts of features, x i denotes the vector of feature i, and α k is a variable calculated from \( {\sigma}_k^2 \).

  2. 2.

    For the posterior estimate, the posterior distribution of parameter set θ can be given as in Eq. (1) and the log posterior distribution of \( {\overset{\sim }{\alpha}}_k \) in Eq. (2) according to the prior distributions [33]. The \( {\overset{\sim }{\alpha}}_k \) is the element of \( \overset{\sim }{\alpha } \), and s k and q k in Eq. (2) can be derived from C which is the covariance matrix of y calculated by the given \( \overset{\sim }{\alpha } \) in Eq. (2) [33]:

    $$ p\left(\theta |y\right)\propto p\left(y|\mu, \beta, {\sigma}_0^2\right)p\left(\mu \right)p\left({\sigma}_0^2\right)p\left(\beta |{\overset{\sim }{\sigma}}^2\right)p\left({\overset{\sim }{\sigma}}^2|{\lambda}_1,{\lambda}_2\right) $$
    (1)
    $$ L\left({\overset{\sim }{\alpha}}_k\right)=\frac{1}{2}\left[\mathit{\log}\frac{{\overset{\sim }{\alpha}}_k}{{\overset{\sim }{\alpha}}_k+1+{s}_k} + \frac{{\overset{\sim }{q}}_k^2}{{\overset{\sim }{\alpha}}_k+1+{s}_k\ }\right]-\frac{\lambda_2}{{\overset{\sim }{\alpha}}_k} $$
    (2)
Fig. 1
figure 1

An overview of the EBEN algorithm. 1) Initialize model parameters and the statistical model. The unknown parameters μ denotes the mean of phenotype, \( \overset{\sim }{y} \) denotes the initial dependent variable and \( {\sigma}_0^2 \) denotes the variance of the model, obtain the initial features satisfying \( k={arg}_i\left\{\left|{x}_i^T\overset{\sim }{y}\right|,\forall i\right\} \). Here, k denotes the subscripts of features, x i denotes the vector of feature i, \( \overset{\sim }{y} \) denotes the dependent variable in the statistical model, and α k is a variable calculated from \( {\sigma}_k^2 \), 2) Update the parameters in the model during iterations, 3) Use t-test to perform hypothesis test on the estimated value, and 4) Output β that denotes the significant results and the covariance matrix

Let the \( L\left({\overset{\sim }{\alpha}}_k\right) \) to be maximized, we can derive the optimal estimate of \( {\overset{\sim }{\alpha}}_k \) as in Eq. (3) [33]:

$$ {\overset{\sim }{\alpha}}_k^{\ast }=\left\{\begin{array}{c}r, if\ {q}_k^2-{s}_k>{\lambda}_1+2{\lambda}_2\\ {}\infty, otherwise\end{array}\right. $$
(3)

r can be calculated according to the s k , q k , λ 1 and λ 2. From Eq. (3), the β k will be zero if the \( {\overset{\sim }{\alpha}}_k^{\ast } \) is infinite. During iterations, the algorithm finds a new α k according to Eq. (4) [35]:

$$ j={arg}_k\left\{\varDelta L\left({\alpha}_k^{\ast}\right)=L\left({\alpha}_k^{\ast}\right)-L\left({\alpha}_k^{(n)}\right)\right\} $$
(4)

The parameters in the model are updated through iterations until three convergence criteria are met. These three criteria are i) no finite α k is output, ii) the change between two iterations of α k is smaller than a pre-specified value and iii) the Euclidean norm of the change between two iterations is smaller than a pre-specified value. There are two hyper-parameters in the algorithm, and EBEN uses cross-validation to determine the optimal value of hyper-parameters [33, 35].

  1. 3.

    Use the non-zero coefficients β and covariance matrix to conduct t-test to perform hypothesis test on the estimated value.

  2. 4.

    Output final β that denotes the significant results and the covariance matrix.

Our analysis workflow based on EBEN

In this study, we use a linear regression model to model the natural log value of pathological stages versus the main and pair-wise epistasis of miRNAs, and used the following formula as our full model (5):

$$ y=\mu +X{\beta}_m+{X}_i{X}_j{\beta}_e+e $$
(5)

where y denotes the dependent variable, (i.e., the transformed value using natural log on pathological stage in this study), (5) denotes the mean of miRNA expression level, X is the miRNA expression matrix with the dimension n × k, n is the sample size, k is the number of miRNAs, β m is the coefficient that represents the main effect of miRNA, X i and X j denote two different miRNAs expression vectors, β e is the coefficient that represents the epistasis between miRNA i and j, and e is the residual error that follows a normal distribution with zero mean and variance of σ 2, e ~ N(0, σ 2). Because EBEN could give the estimates of posterior variances, t-test was used to determine whether the non-zero coefficients of select features were significant.

In order to avoid the situations that main effects dominate and mask out epistatic effects, we develop an analysis workflow composed of multiple steps of feature se- lection and modeling using BEN. The overall analysis workflow is illustrated in Fig. 2, and can be divided into the following four steps.

  • Step 1: Select features with solely main effect m on the phenotype y. EBEN was used to screen all the main features that have p values smaller than 0.05. Only these significant features with main effect would be included in the model of Step 4.

  • Step 2: Derive a corrected phenotype with main effects removed. We eliminated those main effects from the original phenotype (\( \overrightarrow{y} \)) using the significant features with main effect from Step 1, to generate corrected y , \( {y}^{\prime }=\overrightarrow{y}-{X}_m^{\prime }{\beta}_m^{\prime } \). \( {X}_m^{\prime } \) represents the significant features selected in Step 1 and \( {\beta}_m^{\prime } \) is the vector effects for the significant features \( {X}_m^{\prime } \).

  • Step 3: Select features with epistatic effect on the phenotype. The corrected y was used as the new dependent variable to detect epistasis using EBEN. The significance epistatic effects were still selected at p level of 0.05.

  • Step 4: A unified model of estimating both main and epistatic effect. All the features with main effect from the step 1 and epistatic features identified in Step 3 were included in Eq. 5 and estimated by EBEN. In this step, since the covariance matrix only included the significant main and epistasis effects from Steps 1 and 3, the new p values, β m and β e are different from the results in Steps 1 and 3. In order to obtain these values from the same model, we should use the new covariance matrix to re-estimate all the features to see whether each of them was significantly associated with phenotype. Here, the threshold value was also set at the level of P < 0.05.

Fig. 2
figure 2

The overall workflow of our epistasis analysis based on EBEN. Step 1: Run the simplified model including solely main effect, which means only the significant features from this step are included in the main effect part of the model at the final step, Step 2: Obtain the corrected phenotype y through removing main effect of significant features from the original phenotype y, Step 3: The newly corrected y is used to infer epistasis. Only the features with significant epistatic effect can be included in the model in the next step. Step 4: Run the full model that includes all the features with significant features with main effects from the first step and significant features with epistatic effect identified in third step. EBEN is used as the parameter estimation method in Steps 1, 3 and 4. Here, y denotes the trait phenotype, X represents the miRNA expression, β m and β e represents for the main effect and epistatic effect separately, μ represents for the phenotype mean and e represents for the standard error

Results

Using our multi-step analysis workflow, we identify a set of miRNAs with main effect and epistatic effect, as summarized in Table 1. Many of miRNAs with main effect are verified to be up or down regulated in colon cancer by previous experimental studies (Table 2), and the majority pairs of epistatic miRNAs have common target genes that are associated with colon cancer. The target genes of these miRNAs related to pathological stages of colon cancer are previously reported to be associated with colon cancer. Further, we use the experimentally validated target genes of these identified miRNAs to conduct GO Enrichment Analysis, and find that these genes are enriched for biological processes related with cancer.

Table 1 The main and epistatic effect miRNAs identified to be associated with pathological stages of colon cancer in our study
Table 2 Our identified miRNAs with main effect that are previously reported to be associated with colon cancer

Regarding the computational cost, the analysis takes approximately 20 h on a computing node with 2GB memory per process for the dataset consisting 376 features and 233 samples in this study. Most of the computing time is spent in training the model using cross-validation to choose the optimal hyperparameters. Once a model is learned and these hyperparameters are determined, it takes only tens of minutes to run the model on a dataset at this scale.

Main effect

We identify 11 miRNAs with main effect on pathological stages of colon cancer, with 6 miRNAs being verified to be associated with colon cancer in previous experimental studies (Table 2). Table 2 summarizes the main effect miRNAs identified in our study that have been previously reported to be associated with colon cancer. For example, hsa-let-7c has been found to have an effect on regulating RAS oncogene expression in human colon cancer and hsa-let-7c could be involved in the growth of colon cancer cells [36]. In addition, miR-31 has the positive correlation with tumor stage in colon cancer [37]. Quantitative real-time PCR experiments find that miR-31 has the most notable oncogenic targets AXIN1, which is involved in Wnt signaling pathway and forkhead family transcription factors FOXC2 and FOXP3, and this target gene and the two transcription factors are correlated with tumor stages [6]. Another example is that hsa-mir-1249 is found in our study, and TP53 is one of hsa-mir-1249 target genes which is tumor protein gene. Mutations in TP53 are one of the frequent alterations in human cancers. TP53 is associated with poor prognosis in colon cancer and usually mutated in stage IV. TP53 mutations have also be used as biomarkers in clinical settings [38].

Epistatic effect

For the epistasis analysis, we identify 14 pairs of epistatic miRNAs associated with pathological stage in colon cancer. 13 pairs among them have more than one common target genes according to three databases that are miR2Disease [39], TargetScan [40] and miRDB [41]. Figure 3 presents a network view among the epistatic miRNAs and their target genes. In Fig. 3, solid bold blue lines denote the epistasis between miRNAs, yellow triangles denote the miRNAs, their corresponding target genes are denoted as the blue dots, and the links between miRNAs and target genes are denoted by solid black lines. From this network, we can see that many epistatic miRNAs share the same target genes, which implies that the epistatic effect among miRNAs can be reflected by their joint effect on these common genes and potentially on the same pathways.

Fig. 3
figure 3

Common target genes shared by epistatic miRNAs. The target genes in this network are from TargetScan, miR2DB and miRDisease. The miRNAs are denoted by yellow triangles. Identified epistatic interactions between miRNAs are showed by solid bold blue lines. Common target genes of these epistatic miRNAs are denoted by blue circles. The common target genes with corresponding miRNAs are linked by solid black lines

For instance, we find that hsa-mir-497 and hsa-mir-7-2 have an epistatic interaction and they share two common target genes, namely APC and KRAS. These two target genes belong to the Wnt signaling pathway and EGFR signaling pathway separately, which are involved in the development of colon cancer [42, 43]. Particularly, hsa-mir-7-2 is verified to negatively regulate the target KLF4 and promotes the progress of colon cancer since KLF4 is a tumor suppressor gene [44]. Recently, hsa-mir497 has been reported to have lower expression levels and be associated with progression in colorectal cancer [45]. Furthermore, KLF4 is common target gene of both hsa-mir-497 and hsa-mir-7-2. Hence, we infer that hsa-mir-497 and hsa-mir-7-2 can affect the progression of colon cancer jointly in addition to their independent effect on colon cancer.

Disease associations

In order to understand the associations between our target genes and colon cancer, we query our target genes of all miRNAs with both main and epistatic effect with Online Mendelian Inheritance in Man (OMIM) Disease database [46]. There are 26 genes associated with colon cancer reported in OMIM phenotype-gene relationships. Out of these 26 colon cancer related genes, we find that 15 genes are the target genes of the identified miRNAs associated with the pathological stages of colon cancer (Table 3). For example, PIK3CA is the target gene of hsa-mir-363, and a previous study [47] reports high frequency mutations of PIK3CA in colon cancer. NRAS is a target gene of six miNRAs that hsa-let-7c, hsa-let-7d, hsa-mir-363, hsa-mir-3913, hsa-mir-483 and hsa-mir-874 identified in our study. Another study [48] finds that the mutation of NRAS is related to the suppression of apoptosis in tumor development. APC is a target gene of four miRNAs that hsa-mir-22, hsa-mir-3065, hsa-mir-497 and hsa-mir-7-2 identified in our study, the mutation of APC can induce inherited syndromes familiar adenomatous polyposis which leads to a greater potential of colon cancer [49]. Additionally, PIK3CA, NRAS and APC are included in the most frequently mutated genes in colon cancer according to [50]. Hence, we propose that the main and epistatic relationship between miRNAs and colon cancer can be used as an evidence that these miRNAs might affect the prognosis and patient’s survival and can be used as biomarker future research of colon cancer.

Table 3 Summary of target genes associated with colon cancer according to OMIM database

Gene ontology enrichment analysis

For the miRNAs identified as with main and epistatic effect on the pathological stages of colon cancer, we obtain all their experimentally verified target genes from miR2Disease. We then use these target genes from mir2Disease to conduct gene ontology (GO) enrichment analysis [51,52,53] to find out their enriched molecular functions and biological processes.

As shown in Fig. 4, these target genes are enriched in biological processes including cell proliferation, cell death and cell division (see Additional file 1). Cell proliferation and cell death are related with tumors, and cell division is proved to be related with colon cancer [54]. Because the growth of tumor depends on the combined regulation of cell proliferation, cell death and cell division, cancer progression is possible if cell death is suppressed, and cell division and proliferation is promoted [54, 55]. These target genes are also enriched in molecular function including protein binding and regulatory region DNA binding. Studies have shown that protein binding and regulatory region DNA binding processes are associated with colon cancer [56, 57]. Thus, these target genes may serve as drug targets as they could block the progression of colon cancer by interfering with protein binding or regulation of DNA binding.

Fig. 4
figure 4

Gene Ontology Enrichment Analysis. X-axis represents each category of molecular functions and biological processes. Y-axis on the left denotes the target gene counts for each category and y-axis on the right denotes –log (p-value) values

Discussion

For the optimization problem in EBEN, we can improve its performance by setting initial random seed or increasing the fold number in cross validation of EBEN. We will further incorporate covariates, such as known and unknown confounders to EBEN method. Comparing with the nature log value to transform an ordinary phenotype, we will extend EBEN to directly take ordinary phenotypes. We can also make minor adjustments for different datasets, such as adjust the p value threshold in the step 1 and 3, in our pipleine used in this study.

While we use miRNAs in this analysis, we acknowledge that gene expression also plays an important role in colon cancer and leading to changes of tumor pathological stages. We plan to incorporate gene expression into the model that would allow us to study interactions between miRNAs and genes in a unified statistical model. We believe this strategy can help develop a better understanding of the molecular mechanisms of colon cancer. In addition, germline genetic variation and somatic genetic aberrations contribute significantly to tumorigenesis. Therefore, we plan to include these genetic factors into our model as well.

Conclusion

Changes in miRNA expression are known to be involved in colon cancer development and progression. miRNAs have a critical impact on etiology of cancer and cancer progression. Clarifying the changes and the epistatic effect among miRNAs could be helpful to advance cancer research and treatment.

In this study, we apply a multi-step workflow that enables us to identify both main and pair-wise epistatic effects of miRNAs on pathological stages of colon cancer. In each learning step of the workflow, an Empirical Bayesian Elastic Net method is used to solve the model. It has been demonstrated that EBEN efficiently selects significant features in high dimensional (epi-)genomic datasets. While we use miRNA expression data and pathological stages in our study, this workflow can be used to identify epistasis and main effect in many diseases.

In summary, our study provides a flexible workflow for an integrative analysis of the contribution of genetic and epigenetic factors to phenotypes. Such analysis has potentials for biomarker and drug discovery, as well as for the improvement in prognosis prediction. Our study thus provides a reference pipeline for epistasis and main effect analysis in future research that can be extended to various applications.