Background

Discovering important pathways, genes, and gene-gene interactions that account for the phenotype of interest has continued to be a key challenge in genome-wide expression analysis [1]. Under this high-dimensional data setting, single and multiple biomarker (e.g. gene) tests commonly used usually have limited power to detect causal biomarkers associated with the clinical phenotypes. To improve the power, analyses incorporating external biological information have been proposed. For example, gene-based analyses group the single-nucleotide polymorphisms (SNPs) under study into genes, and pathway-based analyses group the genes under study into some biologically meaningful pathways; both types of multiple biomarker analyses have shown to be effective in detecting causal association signals and become increasingly popular. The analyses incorporating external biological information are particularly useful for detecting interaction effects among biomarkers, since the number of interaction effects grows quickly with the number of biomarkers and hence traditional statistical tests lose power.

To identify causal interaction effects of single-nucleotide polymorphisms (SNPs) on a quantitative or disease trait, Fang et al. [2] develop a two-stage grouped sure independence screening (TS-GSIS) procedure using gene-based SNP sets. Simulation studies demonstrate that the performance of TS-GSIS is better than some existing approaches, including the extended SVM [3] and the TS-SIS method [4] without incorporating gene set information. A potential drawback for the TS-GSIS method is that, it is developed in the setting where the groups (gene sets) are non-overlapping and does not pay attention to settings with overlapping groups, which would be encountered in pathway-based analyses where different pathways may involve some common genes. Besides, TS-GSIS is focused specifically on the quantitative/qualitative outcome modeled by linear/logistic regression, and its application to the survival outcome has not been examined.

In this work, we propose the overlapping group screening (OGS) method, which is an extension and improvement of TS-GSIS to accommodate overlapping group structures. Following the latent effect approach of Jacob et al. [5], we decompose the original biomarker effects into a sum of group-specific latent effects, so that the original overlapping group structure can be transformed into a new non-overlapping group structure. The latent effect approach has also been applied by Zeng and Breheny [6], Zhang et al. [7] and Tang et al. [8] to joint selection of genes and genetic pathways.

In addition, to perform association analyses with general types of traits including survival endpoints, OGS employs the sequence kernel association test (SKAT) proposed by Chen et al. [9] as the group screening criterion. SKAT is a supervised, flexible, and computationally efficient regression method to test for association between genetic variants/gene expressions in a region and a quantitative/qualitative/survival trait [10]. In particular, SKAT can quickly compute p-values analytically by fitting the null model only once, and hence can be conveniently applied to genome-wide data. Further, we utilize a data-driven thresholding strategy of Fan et al. [11] for screening candidate biomarkers/features, where we permute randomly the original biomarker data among subjects to decouple the association between the biomarker and outcome data, such that the permuted data follow the null distribution, from which a cut-off value for the SKAT p-value to determine significance can be determined. After screening candidate biomarkers by the SKAT p-values, we apply the Ridge or Lasso penalized regression method [12] to build the prediction model in OGS. The Lasso penalty, in particular, allows for automatic variable selection, which are commonly employed in high-dimensional data such as genome-wide data analysis.

We note that OGS maintains the advantages of TS-GSIS, namely: (i) it can mitigate the issue of co-linearity in regression analyses owing to correlations among biomarkers in the same gene/pathway, and (ii) it can substantially reduce the search space for interaction effects by utilizing the feature grouping structure.

The other objective of this article is to predict survival outcomes based on gene expression profiles, a topic which has received much attention in the recent decade ([13,14,15] and so on). Zhang et al. [16] indicate that one of the main shortcomings of the past studies is the failure to incorporate prior biological information into the prediction model, which may in turn lead to inaccurate prognosis and prediction. The survival prediction based on OGS addresses this problem. Simulation studies demonstrate that the OGS approach not only identifies correctly the causal biological pathways and epistasis, but also improves survival prediction compared with the alternative analyses that ignore the pathway information.

In the real data application, we utilize OGS to select several causal genes and epistasis that are associated with clinical survival outcomes of diffuse large B-cell lymphoma (DLBCL) and non-small-cell lung cancer (NSCLC) patients. In these applications, we combine gene expression profile data with prior pathway information from the KEGG pathway database (for DLBCL) and the Gene Ontology (GO) biological process database (for NSCLC), which are popular public databases providing information on discovered pathways and their involved genes [17]. We use the pathway information available from these two databases to assign genes into groups based on the specific pathways in which they are involved, and conduct survival prediction based on the selected genes and gene-gene interactions.

Motivation

Suppose that there are q genes assigned to G possibly overlapping pathways, namely, a given gene may belong to more than one pathway. The schematic plot in Fig. 1 displays the natural hierarchal structure of genes related to pathways and shows the overlapping pathway structure present in the gene expression data. Each gene can belong to one or multiple pathways. It is of interest to identify genes, as well as their interactions, that are associated with the clinical survival outcome.

Fig. 1
figure 1

The natural hierarchal structure of genes related to pathways with the clinical outcome

Survival model

Let X denote the N × q dimensional covariate matrix of the gene expression profiles with \( \mathrm{X}={\left({\mathrm{x}}_1,\mathrm{L},{\mathrm{x}}_N\right)}^{/}={\left(\begin{array}{ccc}{x}_{11}& \cdots & {x}_{1q}\\ {}\vdots & \ddots & \vdots \\ {}{x}_{N1}& \cdots & {x}_{Nq}\end{array}\right)}_{N\times q} \), where xij denotes the expression level of the j-th gene of the i-th subject. Assume the survival outcome Ti is related to the gene expression covariates xi through a Cox’s regression model. In the Cox’s regression framework, the hazard function at time t for subject i‘s survival given the covariates is modeled as

$$ \lambda \left(t|{\mathbf{x}}_i\right)={\lambda}_0(t)\exp \left({\mathbf{x}}_i^{/}\boldsymbol{\upbeta} \right), $$

where λ0(t) is a non-negative deterministic baseline hazard function and β = (β1, ⋯, βq)/ is the logarithm of the risk ratio. Based on the Cox’s model, the survival function of subject i given his/her expression profile xi is given by \( P\left({T}_i>t|{\mathbf{x}}_i\right)=S\left(t|{\mathbf{x}}_i\right)={S}_0{(t)}^{\exp \left({\mathbf{x}}_i^{/}\boldsymbol{\upbeta} \right)} \) with \( {S}_0(t)=\exp \left[-{\int}_0^t{\lambda}_0(s) ds\right] \) the baseline survival. Usually the survival outcome is subject to censoring, and we use \( {t}_i^{\ast } \) to denote the observed survival time of subject i, and \( {\delta}_i^{\ast } \) is the indicator of whether the survival time of subject i is censored.

In practice, we can check the Cox’s model assumption by existing approaches, such as statistical tests and graphical diagnostics based on the Schoenfeld residuals [18].

Latent effect approach

Incorporating the grouping (pathway) information into the modeling process has the potential to improve the interpretability and the accuracy of the model. When the groups overlap one another, special techniques are required to adequately account for the overlapping grouping information. According to Jacob et al. [5], we decompose the original coefficient vector into a sum of group-specific latent effects, namely, \( \boldsymbol{\upbeta} =\sum \limits_{j=1}^G{\boldsymbol{\upgamma}}^j \), where \( {\boldsymbol{\upgamma}}^j={\left({\gamma}_1^j,\mathrm{L},{\gamma}_q^j\right)}^{/} \) is the latent coefficient vector for group j. Here is a simple example for illustration [6]. Suppose that there are four genes that are involved in the four pathways, P1 = {g1, g2}, P2 = {g2, g3}, P3 = {g1, g3} and P4 = {g3, g4}, the original coefficient β can be decomposed as β = [β1, β2, β3, β4]/

\( =\left[\begin{array}{l}{\gamma}_1^1\\ {}{\gamma}_2^1\\ {}{\gamma}_3^1\\ {}{\gamma}_4^1\end{array}\right]+\left[\begin{array}{l}{\gamma}_1^2\\ {}{\gamma}_2^2\\ {}{\gamma}_3^2\\ {}{\gamma}_4^2\end{array}\right]+\left[\begin{array}{l}{\gamma}_1^3\\ {}{\gamma}_2^3\\ {}{\gamma}_3^3\\ {}{\gamma}_4^3\end{array}\right]+\left[\begin{array}{l}{\gamma}_1^4\\ {}{\gamma}_2^4\\ {}{\gamma}_3^4\\ {}{\gamma}_4^4\end{array}\right] \) \( =\left[\begin{array}{l}{\gamma}_1^1\\ {}{\gamma}_2^1\\ {}0\\ {}0\end{array}\right]+\left[\begin{array}{l}0\\ {}{\gamma}_2^2\\ {}{\gamma}_3^2\\ {}0\end{array}\right]+\left[\begin{array}{l}{\gamma}_1^3\\ {}0\\ {}{\gamma}_3^3\\ {}0\end{array}\right]+\left[\begin{array}{l}0\\ {}0\\ {}{\gamma}_3^4\\ {}{\gamma}_4^4\end{array}\right] \)

$$ =\left[\begin{array}{l}1,0,0,0,1,0,0,0\\ {}0,1,1,0,0,0,0,0\\ {}0,0,0,1,0,1,1,0\\ {}0,0,0,0,0,0,0,1\end{array}\right]{\left[{\gamma}_1^1,{\gamma}_2^1,{\gamma}_2^2,{\gamma}_3^2,{\gamma}_1^3,{\gamma}_3^3,{\gamma}_3^4,{\gamma}_4^4\right]}^{/} $$
$$ =\mathbf{S}\boldsymbol{\upgamma } . $$

Based on the coefficient decomposition, the original regression model can be transformed into a new model, i.e. \( {\mathbf{X}}_{N\times q}{\boldsymbol{\upbeta}}_{q\times 1}={\mathbf{X}}_{N\times q}{\mathbf{S}}_{q\times u}{\boldsymbol{\upgamma}}_{u\times 1}={\overset{\sim }{\mathbf{X}}}_{N\times u}{\boldsymbol{\upgamma}}_{u\times 1} \). Equivalently, this new model can be constructed by duplicating the columns of overlapped variables in the raw design matrix. For the new transformed model, the hazard function for subject i in the Cox’s regression model is re-expressed as

$$ \lambda \left(t|{\overset{\sim }{\mathbf{x}}}_i\right)={\lambda}_0(t)\exp \left({\overset{\sim }{\mathbf{x}}}_i^{/}\boldsymbol{\upgamma} \right). $$

Method (OGS)

We propose the OGS method and apply it to the gene expression profile data with clinical survival trait to detect causal genes and epistasis interactions by incorporating prior pathway information. We standardized all the predictors before performing the OGS approach. The steps of the OGS algorithm are described as follows.

Step1: Based on the latent effect approach, we utilize the overlapping group Cox’s regression model to identify the causal pathways, which can be computed by the R package “grpregOverlap” [6]. We define \( {\hat{\mathrm{M}}}_{main} \) as the selected set of causal pathways, and \( A=\left|{\hat{\mathrm{M}}}_{main}\right| \) as the size of \( {\hat{\mathrm{M}}}_{main} \).

Step 2: Consider gene-gene interaction pairs between gene pairs from one causal pathway or two different causal pathways in \( {\hat{\mathrm{M}}}_{main} \), as well as gene pairs between one pathway in \( {\hat{\mathrm{M}}}_{main} \) and one non-causal pathway outside \( {\hat{\mathrm{M}}}_{main} \). The interaction between two pathways is also termed “cross-talk” of pathways [19]. For groups of gene-gene interaction pairs from each of the candidate pathways or from each two cross-talk pathways, apply the SKAT test to obtain the group-specific significance. Detail about the group-specific SKAT test is given in the next section.

Step 3: We randomly permute the original genotype matrix xi to form the permuted data {Yi, xπ(i)} following the null model, where {π(1), ⋯, π(N)} is a random permutation of the index. Then apply again the SKAT test for each of the pathway interaction groups with the permuted data to obtain the group screening measures (p-values) \( \left\{{p}_1^{\ast },\cdots, {p}_B^{\ast}\right\} \). We adopt \( {C}_{\mathrm{int}}=\min \left\{{p}_1^{\ast },\cdots, {p}_B^{\ast}\right\} \) as a cutoff point to select candidate pathway interactions, i.e.

$$ {\hat{\mathrm{M}}}_{\mathrm{int}}=\left\{b:{p}_b<{C}_{\mathrm{int}},b=1,\cdots, B\right\}, $$

is our selected set of pathway interactions.

Step 4: Apply the penalized Cox’s regression with the Ridge, or Lasso penalty to build the final prediction model based on genes in \( {\hat{\mathrm{M}}}_{main} \) and gene-pair interactions in \( {\hat{\mathrm{M}}}_{\mathrm{int}} \). Note that when applying the Lasso penalty, some of the genes/gene pairs in \( {\hat{\mathrm{M}}}_{main} \)/\( {\hat{\mathrm{M}}}_{\mathrm{int}} \) may be removed since the Lasso penalty can set some of the coefficients exactly to 0, while when applying the Ridge penalty, all of the candidate genes and gene pairs are retained. The penalized Cox’s model with the Ridge and Lasso penalties can be obtained by the R package “glmnet” [12].

Group-specific test (SKAT)

Following Chen et al. [9], the group-specific SKAT statistic under the Cox’s regression model is given as

$$ {Q}_{(k)}={\mathbf{m}}^{/}{\mathbf{R}}_{(k)}{\mathbf{W}}_{(k)}{\mathbf{W}}_{(k)}{\mathbf{R}}_{(k)}^{/}\mathbf{m},k=1,\cdots, \mathrm{B} $$

Here, \( B=A+{C}_2^A+\left(G-A\right)\times A \) is the total number of groups of pathway interaction, m is the vector of martingale residuals estimated from the null model without considering the gene expression data, R(k) = [r(k)ij]N × l,

where l is the number of gene-gene interaction pairs in the pathway interaction group k, r(k)ij is the j-th gene-gene interaction pair of i-th subject in the pathway interaction group k, and W(k) is a diagonal weight matrix that contains the weights of the l interaction pairs in the pathway interaction group k. Suitable weights can improve the testing power [10]. We utilize the penalized Cox’s partial likelihood approach with the Ridge penalty to estimate effect sizes for gene-gene interaction pairs in each pathway interaction group, and take the square root of the absolute estimated coefficients as our weights, i.e.,

$$ {\mathbf{W}}_{(k)}={\left(\begin{array}{ccc}\sqrt{\left|{\overset{\sim }{\boldsymbol{\upbeta}}}_{(k)1}\right|}& & 0\\ {}& \ddots & \\ {}0& & \sqrt{\left|{\overset{\sim }{\boldsymbol{\upbeta}}}_{(k)l}\right|}\end{array}\right)}_{l\times l}, $$

Based on null model without gene covariates, let V =  diag (e1, …, eN) − PP/, where P is an N × v matrix with element pijthe baseline hazard for individual i at ordered failure time \( {t}_{(j)}^{\ast },j=1,\cdots, v \), and ei the cumulative hazard for individual i at observed time \( {t}_i^{\ast } \). Let \( {\Sigma}_{(k)}={\mathbf{W}}_{(k)}{\mathbf{R}}_{(k)}^{/}{\mathbf{VR}}_{(k)}{\mathbf{W}}_{(k)} \) be the covariance matrix of the vector \( {\mathbf{W}}_{(k)}{\mathbf{R}}_{(k)}^{/}\mathbf{m} \) under the null hypothesis of all gene-gene interaction pairs in the pathway interaction group k having null effects. Under the null hypothesis, the SKAT statistic follows a mixture chi-square distribution:

$$ {Q}_{(k)}\sim \sum \limits_{j=1}^l{\lambda}_{(k)j}{\chi}_{1,j}^2, $$

where λ(k)j, j = 1, ⋯, l are the eigenvalues of Σ(k), and \( {\chi}_{1,j}^2 \)‘s are independent 1-df central chi-square random variables.

We use the Davies method [20] to approximate the tail probability (p-value) of the mixture chi-square distributions, which can be computed by R package “CompQuadForm” [21]. In general, the Davies method is accurate [22]. The p-values {p1, ⋯, pB} for the pathway interaction groups serve as our group screening measures; a smaller p-value corresponds to higher significance of the group and hence leads to higher priority to be selected.

Results

In the following simulations, we investigate the performances of the proposed OGS approach in variable selection, estimation, and prediction, and compared them with those from the “Oracle”, “Univariate Selection”, “Ordinary Lasso”, and “TS-GSIS Lasso” methods. The “Oracle” method is based on the underlying true model, which is known in simulations but unknown in real applications. The “Univariate Selection” method selects the genes and gene-pairs one by one via univariate regression, with controlled false discovery rate (< 0.05), and the selected variables are included in a multivariate Cox’s regression model to form the final prediction model. The “Ordinary Lasso” method is the penalized Cox’s regression model with the covariates of gene expressions from all genes and gene-pair interactions and with the Lasso penalty. The “TS-GSIS Lasso” method is essentially proposed by Fang et al. [2], except that we apply the SKAT test to obtain the group-specific significance.

For performance comparison, we obtain the root mean squared error (RMSE) to measure estimation accuracy, defined as

$$ RMSE=\sqrt{\frac{1}{S}\sum \limits_{j=1}^S{\left({\beta}_j-{\hat{\beta}}_j\right)}^2}, $$

where S is full model size including all main and interaction covariates. Over 500 simulations, we report the median value RMSE.M of RMSE over simulations. We also report the following proportions in 500 simulations as performance measures for variable selection: T.model is the proportion where the selected model includes the underlying effective variables, including both the main and interaction terms; Tint.model is the proportion where the selected model includes the underlying effective gene-gene interaction terms; Sen. is the sensitivity, i.e., the proportion of the underlying effective variables being selected; Spe. is the specificity, i.e., the proportion of the underlying ineffective variables not being selected. We also report the median size S.model of the selected model over 500 simulations. For assessing the performance in survival prediction, we report two measures of prediction accuracy: the deviance and c-index proposed by Harrell et al. [23] and smaller deviance/larger c-index corresponds to better prediction accuracy. The median values of deviance and c-index over 500 simulations are reported.

Also, let \( \hat{\boldsymbol{\upbeta}} \) be an estimator of the (penalized) Cox’s regression parameter in a prediction model obtained from the training dataset and \( \left({t}_i^{\ast },{\delta}_i^{\ast },{\mathbf{x}}_i^{\ast}\right) \) the survival and covariate data of subject i in the test data. Define \( {\mathbf{x}}_i^{\ast}\hat{\boldsymbol{\upbeta}} \) as the prognosis index (PI) value for subject i. The prediction accuracy measure of Cox-test is defined as the p-value of PI when PI is used as the covariate in the univariate Cox model for the survival outcome in the test data. A smaller value of Cox-test (p-value) would suggest better prediction accuracy. Similarly, the prediction accuracy measure of LR-test is the p-value of the log-rank test for the null hypothesis of equality of the survival between the “poor” and “good” prognosis groups in the test data, which are formed according to whether the PI value is higher or lower than the median PI value. A smaller LR-test value corresponds to better prediction power.

We consider survival data with a cohort size 500 as the training set, where each subject’s survival time follows the Cox’s proportional hazards model

$$ {\lambda}_0\left(\left.t\right|\mathbf{x}\right)=0.1\cdot \exp \left({\mathbf{x}}^{/}\boldsymbol{\upbeta} \right), $$

with β measuring the log-relative risk with respect to the covariates and the covariates x jointly following a multivariate standard normal distribution with correlation corr(xj, xk) = 0.5|j − k|. The censoring time distribution follows a uniform U(0, 1) distribution. We then generate survival data, independent of the training data, with a cohort of size 100 as the test data to assess the prediction accuracy for different methods.

Simulation setting 1

In this simulation study, the design matrix consists of 5 groups with each group having different group sizes. The group size (number of genes in each pathway) and the overlapping structure (number of genes shared by two overlapping pathways) are shown  in Table 1.

Table 1 Data structure in Simulation 1

For example, pathways 1 and 2 contain 7 and 14 genes, respectively. The two groups contain 18 unique genes, and 3 genes are shared by the two groups. As a result, there are 81 genes (q = 81) and 105 latent effects in this example. Fig. 2 shows the gene indices of the pathways. Pathways 2 and 4 are effective, and genes in each of them have constant latent effects of 4.5 and − 3, respectively. Three types of gene-gene interactions are considered: (1) gene-gene interactions (x⋅8 × x⋅9, x⋅10 × x⋅11, x⋅12 × x⋅13) within pathway 2 with effects(6, 6, 6), (2) gene-gene interaction (x⋅36 × x⋅66, x⋅38 × x⋅68, x⋅40 × x⋅70) across pathways 4 and 5 with effects(−6, −6, −6), and (3) coexistence of interactions (1) and (2). The number of effective genes and gene-pair interactions is 45 or 48 among the total 3321 genes and gene-pairs. We examine performances of different methods under a censoring rate of 50% or 65%.

Fig. 2
figure 2

The gene indices of the pathways considered in Simulation 1

Simulation setting 2

In this simulation study, the design matrix consists of 24 groups with each group having different group sizes, ranging from 3 to 60 (genes). The group size and the overlapping structure are shown in Table 2.

Table 2 Data structure in Simulation 2

For example, pathway 4 contains 6 genes, as group 5 does, and the two groups contain 10 unique genes, and 2 genes are shared by the two groups. As a result, there are 462 genes (q = 462) and 594 latent effects in this example. Fig. 3 shows the gene indices of the pathways. Pathways 1, 7, 13, and 19 are effective, and genes in each of them have constant latent effects of 4.5, − 3, − 3, and 1.5, respectively. As above, three types of gene-gene interactions are considered: (1) gene-gene interactions (x⋅22 × x⋅23, x⋅24 × x⋅25, x⋅26 × x⋅27) within pathway 7 with effects(4, 4, 4), (2) gene-gene interaction (x⋅81 × x⋅101, x⋅82 × x⋅102, x⋅83 × x⋅103) across pathways 13 and 14 with effects(−4, −4, −4), and (3) coexistence of interactions (1) and (2). The number of effective genes and gene-pair interactions is 84 or 87 among the total 106,953 genes and gene-pairs. We examine different methods under a censoring rate of 50% or 65%.

Fig. 3
figure 3

The gene indices of the pathways considered in Simulation 2

Summary of simulation results

From the simulation results shown in Tables 3, 4, 5, 6, 7, and 8, the OGS method using the Lasso penalty outperforms the OGS method using the Ridge penalty. Also, compared to the existing methods, OGS with the Lasso penalty performs substantially better than the Univariate Selection and the TS-GSIS with the Lasso penalty methods in variable selection (T.model, Tint.model, Sen., Spe.), estimation (RMSE.M), and prediction (Deviance, c-index). When the number of groups (pathways) and the group size (number of genes) are smaller (Setting 1) and the censoring rate is relatively lower (50%), the ordinary Lasso also performs well in variable selection and survival prediction; while in other cases, the ordinary Lasso is less competitive than the proposed OGS method with the Lasso penalty in variable selection, estimation, and survival prediction. Comparing Tables 3, 4, and 5, or Tables 6, 7, and 8, we see that the pattern of interactions, namely whether the gene-gene interactions occur within the same pathway or not, does not affect much the performance of the proposed OGS method, in particular for survival prediction.

Table 3 Results of Simulation 1 (1): The performances of OGS compared with other approaches under gene-gene interactions within one pathway
Table 4 Results of Simulation 1 (2): The performances of OGS compared with other approaches under gene-gene interactions across two pathways
Table 5 Results of Simulation 1 (3): The performances of OGS compared with other approaches under coexistence of within- and between-pathway gene-gene interactions
Table 6 Results of Simulation 2 (1): The performances of OGS compared with other approaches under gene-gene interactions within one pathway
Table 7 Results of Simulation 2 (3): The performances of OGS compared with other approaches under gene-gene interactions across two pathways
Table 8 Results of Simulation 2 (3): The performances of OGS compared with other approaches under coexistence of within- and between-pathway gene-gene interactions

The DLBCL analysis

The DLBCL data [24] contain two sets of gene expression data, CHOP and R-CHOP. The CHOP dataset is under a combination chemotherapy with cyclophosphamide, doxorubicin, vincristine and prednisone; R-CHOP is under the current golden standard treatment, the rituxima immunotherapy in addition to the chematherapy in CHOP. The CHOP and R-CHOP datasets consist of censored survival outcomes from 181 and 233 patients, respectively, with gene expression data from the same 3833 genes after the filtering process. The censoring rates are 42% and 74% in the CHOP and R-CHOP datasets, respectively. These two microarray datasets can be downloaded from the R package “bujar” [25]. In our analysis, we divide randomly the patients into 207:207 training/test datasets from the pool of R-CHOP and CHOP datasets. There were no significant differences in clinical survival outcome between subjects in the two datasets.

We apply the proposed OGS approach to the DLBCL data with the prior pathway information obtained from the KEGG pathway database. The following analysis is based on the 451 genes mapped into 165 pathways in the DLBCL data, which result in 101,926 main and two-way interaction covariates.

In Steps 1–3 of the OGS approach, we identify 6 significant pathways and 2 significant cross-talk pathway interactions. In Step 4 of the OGS method, the Cox’s model with the Ridge or Lasso penalty is applied to the training data to establish the final prediction model. In particular, the OGS method with the Lasso penalty leads to a prediction model with 5 main and 10 two-way interaction covariates. The “Univariate Selection” and “Ordinary Lasso” methods are applied directly to the whole 101,926 covariates in the training data to build the prediction models. The “Overlap Lasso” method is obtained by applying the R package “grpregOverlap” [6], which performs group selection among overlapping groups with the Lasso penalty but without considering interactions among features.

Table 9 displays several survival prediction accuracy measures for different approaches in the test data. We see that the OGS method with the Lasso penalty has better performances compared to existing methods in the test data. Fig. 4 displays the Kaplan-Meier survival curves for the “good” (blue) and “poor” (red) prognosis groups in the test data, which are formed according to whether the prognosis index (PI) value is lower or higher than the median PI value (see the Results section for detail). It is seen that that the two survival curves are better separated by the OGS approach than by the existing methods.

Table 9 Results of prediction accuracies of different methods based on DLBCL data
Fig. 4
figure 4

Kaplan-Meier curves for the 207 subjects in the DLBCL with the testing data. Good (blue) and poor (red) groups are identified by the median of the PI’s in the testing dataset

In DLBCL data, we discard 3382 genes that are not mapped into any pathways in the KEGG pathway database based on the latent effect approach. We also perform the other OGS analysis putting the 3382 ungrouped genes together as an additional group. The results from such an analysis are similar to those presented here.

The NSCLC analysis

The NSCLC data of Chen et al. [14] is available from NCBI with accession number GSE4882. The data contain censored survival outcomes from 125 lung cancer patients and their gene expression profiles for 672 genes. The censoring rate is 65%. Following Emura et al. [13], we consider the subset consisting of 485 genes, and, following Chen et al. [14], we divide the patients into 63:62 training/test datasets.

Based on the GO biological process database, prior pathway information for 251 genes mapped into 344 pathways are utilized, which lead to a total number of 31,626 main and two-way interaction covariates. Using the OGS approach, we identify 2 significant pathways but no significant pathway interaction, and the final prediction model obtained by the Lasso method includes main effects from two genes, DUSP6 and LCK. Indeed, the two genes are also included in the five-gene signature by Chen et al. [14], and are found to be strongly associated with lung cancer in other literatures ([26,27,28] and so on).

Table 10 shows the prediction accuracy measures for patients’ survival in the test sample of the NSCLC data, where the measure LR-test_3 is the p-value of the log-rank test for equality of survival distributions among the three prognosis groups divided by the tertiles of the PI values in the test sample. Fig. 5 displays the three Kaplan-Meier survival curves for three prognosis groups (“good”, “medium”, “poor” groups according tertiles of the PI values) in the test sample of the NSCLC data (in this case the LR-test for the two prognosis groups divided by the median PI is less significant. Fig. 6 displays the two Kaplan-Meier survival curves for the two prognosis groups). In all these measures, the OGS method with the Lasso penalty performs better than the Ordinary Lasso.

Table 10 Results of prediction accuracies of different methods based on NSCLC data (using the training and test sets as in Chen et al. [14])
Fig. 5
figure 5

Kaplan-Meier curves for the 62 subjects in the NSCLC testing data. Good (blue), medium (red) and poor (green) groups are identified by the tertile of the PI’s in the test dataset

Fig. 6
figure 6

Kaplan-Meier curves for the 62 subjects in the NSCLC testing data. Good (blue) and poor (red) groups are identified by the median of the PI’s in the test dataset

Besides, we also apply the 10-fold cross-validation method to evaluate the performance of the OGS method for survival prediction in the NSCLC data. In the 10-fold cross-validation process, most of the time the OGS still identifies the same prediction model containing the main effects of DUSP6 and LCK genes. Table 11 shows the performances of the OGS method in the NSCLC data with the performance evaluation based on the 10-fold cross-validation, i.e., the average of the results among 10 folds. We see that the performance patterns are similar to those in Table 10, and the OGS with the Lasso penalty still outperforms the other methods.

Table 11 Results of prediction accuracies of different methods based on NSCLC data with 10-fold cross-validation procedure

In NSCLC data, we discard 234 genes that are not mapped into any pathways in the GO biological process database based on the latent effect approach. The OGS approach for putting the 234 ungrouped genes together as an additional group results in the same prediction model as the one presented above.

Discussion

The OGS procedure can further adjust for confounding covariates (e.g. environmental factors) when all the models involved, including the null model without using gene expression covariate data, further adjust for the confounding variables; see [9, 10] for the SKAT statistics with confounding covariates for quantitative, qualitative and survival traits.

In this article, we consider two-way and multiplicative interactions as a simple way to implement interaction assessments. Examination of higher-order and general forms of interaction is challenging and deserves further research. Besides, the OGS method employs the latent effect approach to deal with the overlapping structure among pathways. This approach requires the gene grouping (pathway) structure to be pre-specified and is restricted to genes that can be assigned to at least one group (pathway). It is interesting to study how these restrictions can be relaxed to improve the performances of gene selection and survival prediction. Yu and Liu [29] propose a procedure for sparse regression incorporating a comprehensive graphical structure (SRIG) among predictors, and we would like to extend the current proposal by employing the SRIG approach.

The idea of group screening procedure we propose can also be applied to detect gene-environment interactions. In the first step, we still apply the overlapping group method to identify the causal pathways \( {\hat{\mathrm{M}}}_{main} \). In the second step, we apply the SKAT test to obtain the group-specific significance, where each of the groups are formed by the interactions between one gene from each of the causal pathways in \( {\hat{\mathrm{M}}}_{main} \) and one environment factor in Z, where Z is the set of environment covariates whose interactions with genes are of interest. In step 3, we select significant gene-environment interactions, where the permutation procedure and the cutoff determination are the same as those in the original OGS, except that now the permeation is applied to the covariate matrix consisting of both gene and environmental covariates. Finally, the penalized regression with the Ridge or Lasso penalty is still applied to build the final prediction model based on the genes in \( {\hat{\mathrm{M}}}_{main} \), the environmental covariates, and the selected gene-environment interactions. We plan to study extensions of the OGS method, including the extension to gene-environment interactions, in our future research.

In this work we focus on survival prediction based on the Cox’s proportional hazards model. In the case where the proportional hazards assumption is not appropriate, an alternative model, such as the proportional odds model, that proves to be appropriate can be used instead in the OGS procedure proposed. The required modification with models alternative to the proportional hazards is quite straightforward. For example, the SKAT statistic involved in OGS can be simply modified by using residuals from the alternative model considered.

Conclusions

It has been a long-lasting interest in the bioinformatics field for detecting the pairwise gene-gene interactions. In this paper we propose an overlapping group screening procedure to identify causal genes and gene-gene interactions efficiently by incorporating prior pathway information, where the pathways involved are allowed to overlap one another. Specifically, we utilize the gene pathway information via the latent effect approach which formally accounts for the possibly overlapping grouping structure. In addition, we utilize the SKAT testing approach to perform powerful screening of main and interaction effects. Simulation and real data studies demonstrate that the new proposal can substantially improve the accuracy of gene and gene-gene interaction selection and hence lead to more accurate survival prediction compared with the common analyses that ignore the pathway information. We provide an R package “OGS” to perform Steps 1–3 of the proposed OGS method, together with the reference manual describing how to perform “OGS” and the code used in our simulation. Please see Additional Files 2 and 3 for detail.

The OGS approach is general in that they can accommodate various types of clinical outcomes and regression models, such as quantitative, qualitative, and survival outcomes modeled by linear, logistic, and Cox’s regression models, respectively. In the paper the OGS approach based on the Cox’s model for gene selection, effect estimation, and survival outcome prediction has been examined. The OGS methods for continuous and binary outcomes based respectively on the linear and logistic regression models are discussed in Additional File 1. The extension of OGS to more flexible models, such as those based on the kernel methods [30], deserves further research and will be studied in our future work.

The importance of gene-gene interactions have been discussed widely in literature. For example, Cordell [31] discussed the need of considering gene-gene interactions in genetic studies of complex diseases. Fang et al. [2] identified and confirmed important gene-gene interactions related to rheumatoid arthritis. We believe that the proposed overlapping group screening (OGS) approach provides an useful tool to this important task in delineating the underlying disease etiology.