Background

Survival prediction from high-dimensional molecular data is an active topic in the fields of genomics and precision medicine, especially for various cancer studies. Large-scale omics data provide extraordinary opportunities for detecting biomarkers and building accurate prognostic and predictive models. However, such high-dimensional data also introduce statistical and computational challenges. Tibshirani [1, 2] has proposed a novel penalized method, lasso, for variable selection in high-dimensional data, which has attracted considerable attention in modern statistical research. Thereafter, several penalized methods were developed, like minimax concave penalty (MCP) method by Zhang [3, 4], smoothly clipped absolute deviation (SCAD) penalty method by Fan and Li [5]. These penalization approaches have been widely applied for disease prediction and prognosis using large-scale molecular data [6,7,8,9,10,11].

Furthermore, the group structures among molecular variables was noticed in analysis. For example, genes can be grouped into known biological pathways or functionally similar sets. Genes within a same biological pathway may be biologically related and statistically correlated. Incorporating such biological grouping information into statistical modeling can improve the interpretability and efficiency of the models. Several penalization methods have been proposed had been proposed to utilize the grouping information, such as group Lasso method [12], sparse group lasso (SGL) [13, 14]. group bridge [15], composite MCP [16], composite absolute penalty method [17], group exponential Lasso [18], group variable selection via convex log-exp-sum penalty method [19], and doubly sparse approach for group variable selection [20]. Some of these methods perform group level selection, including or excluding an entire group of variables. Others can perform bi-level selection, achieving sparsity within each group. Huang et al. [21] and Ogutu et al. [22] reviewed these penalization methods in prediction and highlighted some issues for further study.

Ročková and George [23, 24] recently proposed a new Bayesian approach, called the spike-and-slab lasso, for high-dimensional normal linear models using the spike-and-slab mixture double-exponential prior distribution. Based on the Bayesian framework, we have recently incorporated the spike-and-slab mixture double-exponential prior into generalized linear models (GLMs) and Cox survival models, and developed the spike-and-slab lasso GLMs and Cox models for predicting disease outcomes and detecting associated genes [25, 26]. More recently, we have developed the group spike-and-slab lasso GLMs [27] to incorporate biological pathways.

In this article, we aim to develop the group spike-and-slab lasso Cox model (gsslasso Cox) for predicting disease survival outcomes and detecting associated genes by incorporating biological pathway information. An efficient algorithm was proposed to fit the group spike-and-slab lasso Cox model by integrating Expectation-Maximization (EM) steps into the extremely fast cyclic coordinate descent algorithm. The novelty is incorporating group or biological pathway information into the spike-and-slab lasso Cox model for predicting disease survival outcomes and detecting associated genes. The performance of the proposed method was evaluated via extensive simulations and comparing with several commonly used methods. The proposed procedure was also applied to three cancer data sets with thousands of gene expression values and their pathways information. Our results show that the proposed method not only generates powerful prognostic models for survival prediction, but also excels at detecting associated genes.

Methods

The group spike-and-slab lasso Cox models

In Cox survival model, variables yi = (ti, di) for each individual is the survival outcome. The censoring indicator di takes 1 if the observed survival time ti for individual i is uncensored. The di takes 0 if it is censored. For individual i, the true survival time is assumed by Ti. Therefore, when Ti = ti, di = 1, whereas when Ti > ti, di = 0. The predictor variables include numerous molecular predictors (e.g., gene expression) and some relevant demographic/clinical covariates. Assume that the predictors can be organized into G groups (e.g., biological pathways) based on existing biological knowledge. It should be indicated that the group could overlap each other. For example, one or some genes can belong to two or more biological pathway. Following the idea of overlap group lasso [28,29,30,31], we performed a restructure step by replicating a variable in whatever group it appears to expand the vector of predictors.

In Cox proportional hazards model, it usually assumes that the hazard function of survival time T takes the form [32, 33]:

$$ h\left(t|X\right)={h}_0(t)\exp \left( X\beta \right) $$
(1)

where the baseline hazard function h0(t) is unspecified, X and β are the vectors of explanatory variables and coefficients, respectively, and is the linear predictor or called the prognostic index.

Fitting classical Cox models is to estimate β by maximizing the partial log-likelihood [34]:

$$ pl\left(\beta \right)=\sum \limits_{i=1}^n{d}_i\log \left(\exp \left({X}_i\beta \right)/\sum \limits_{i\hbox{'}\in R\left({t}_i\right)}\exp \left({X}_{i\hbox{'}}\beta \right)\right) $$
(2)

where R(ti) is the risk set at time ti. In the presence of ties, the partial log-likelihood can be approximated by the Breslow or the Efron methods [35, 36]. The standard algorithm for maximizing the partial log-likelihood is the Newton-Raphson algorithm [32, 37].

For high dimensional and/or correlated data, the classical model fitting is often unreliable. The problem can be solved by using Bayesian hierarchical modeling or penalization approaches [31, 38, 39]. We here propose a Bayesian hierarchical modeling approach, which allows us to simultaneously analyze numerous predictors and more importantly provides an efficient way to incorporate group information. Our hierarchical Cox models employ the spike-and-slab mixture double-exponential (de) prior on the coefficients:

$$ {\beta}_j\mid {\gamma}_j,{s}_0,{s}_1\sim \mathrm{de}\left(0,\left(1-{\gamma}_j\right){s}_0+{\gamma}_j{s}_1\right)=\frac{1}{\left(1-{\gamma}_j\right){s}_0+{\gamma}_j{s}_1}\exp \left(-\frac{\left|{\beta}_j\right|}{\left(1-{\gamma}_j\right){s}_0+{\gamma}_j{s}_1}\right) $$
(3)

where, s0 and s1 are the preset scale parameters, which are small and relatively large (0 < s0 < s1), inducing strong or weak shrinkage on βj, respectively. γj is the indicator variable: γj = 1 or 0. Equivalently, this prior can be expressed as (1 - γj) de(0, s0) + γj de(0, s1), a mixture of the shrinkage prior de(0, s0) and the weakly informative prior de(0, s1), which are the spike and slab components of the prior distribution, respectively.

We incorporate the group structure by proposing a group-specific Berllouli distribution for the indicator variables. For predictors in group g, the indicator variables are assumed to follow the Berllouli distribution with the group-specific probability θg:

$$ {\gamma}_j\mid {\theta}_g\sim \mathrm{Bin}\left({\gamma}_j|1,{\theta}_g\right)={\theta}_g^{\gamma_j}{\left(1-{\theta}_g\right)}^{1-{\gamma}_j} $$
(4)

If group g includes important predictors, the parameter θg will be estimated to be relatively large, implying other predictors in the group more likely to be important. Therefore, the group-specific Berllouli prior plays a role on incorporating the biological similarity of genes within a same pathway into the hierarchical model. For the probability parameters, we adopt a beta prior, θg~beta(a, b), setting a = b = 1 yielding the uniform hyper prior θg~U(0, 1) that will be used in later sections to illustrate our method. Hereafter, the above hierarchical Cox models are referred to as the group spike-and-slab lasso Cox model.

The EM coordinate descent algorithm

We have developed a fast deterministic algorithm, called the EM coordinate descent algorithm to fit the spike-and-slab lasso Cox models by estimating the posterior modes of the parameters [26]. The EM coordinate descent algorithm incorporates EM steps into the cyclic coordinate descent procedure for fitting the penalized lasso Cox models, and has been shown to be fast and efficient for analyzing high-dimensional survival data [26]. We here extend the EM coordinate descent algorithm to fit the group spike-and-slab lasso Cox models. We derive the algorithm based on the log joint posterior density of the parametersϑ = (β, γ, θ):

$$ {\displaystyle \begin{array}{l}\log p\left(\beta, \gamma, \theta |t,d\right)\propto \log p\left(t,d|\beta, {h}_0\right)+{\sum}_{j=1}^J\log p\left({\beta}_j|{S}_j\right)\\ {}+{\sum}_{j=1}^J\log p\left({\gamma}_j|{\theta}_g\right)+{\sum}_{g=1}^G\log p\left({\theta}_g\right)\end{array}} $$
(5)

The log-likelihood function, logp(t, d| β, h0), is proportional to the partial log-likelihood pl(β) defined in Eq. (2) or the Breslow or the Efron approximation in the presence of ties [35, 36], if the baseline hazard function h0 is replaced by the Breslow estimator [37, 40]. Therefore, the log joint posterior density can be expressed as

$$ {\displaystyle \begin{array}{l}\log p\left(\beta, \gamma, \theta |t,d\right)\propto pl\left(\beta \right)-{\sum}_{j=1}^J{S}_j^{-1}\left|{\beta}_j\right|+{\sum}_{j=1}^J\left({\gamma}_j\log {\theta}_g+\left(1-{\gamma}_j\right)\log \left(1-{\theta}_g\right)\right)\\ {}\kern8.25em +{\sum}_{g=1}^G\left(\left(a-1\right)\log {\theta}_g+\left(b-1\right)\log \left(1-{\theta}_g\right)\right)\end{array}} $$
(6)

where pl(β) is the partial likelihood described in (2), and Sj = (1 − γj)s0 + γjs1.

In EM coordinate decent algorithm, the indicator variables γj were treated the as ‘missing values’. The parameters (β, θ) were estimated by averaging the missing values over their posterior distributions. For the E-step, the expectation of the log joint posterior density was calculated with respect to the conditional posterior distributions of the missing data. For predictors in group g, the conditional posterior expectation of the indicator variable γj can be derived as

$$ {\displaystyle \begin{array}{l}{p}_j^g=p\left({\gamma}_j=1|{\beta}_j,{\theta}_g,t,d\right)\\ {}=\frac{p\left({\beta}_j|{\gamma}_j=1,{s}_1\right)p\left({\gamma}_j=1|{\theta}_g\right)}{p\left({\beta}_j|{\gamma}_j=0,{s}_0\right)p\left({\gamma}_j=0|{\theta}_g\right)+p\left({\beta}_j|{\gamma}_j=1,{s}_1\right)p\left({\gamma}_j=1|{\theta}_g\right)}\end{array}} $$
(7)

where p(γj = 1| θg) = θg, p(γj = 0| θg) = 1 − θg, p(βj| γj = 1, s1) = de(βj| 0, s1) and p(βj| γj = 0, s0) = de(βj| 0, s0). Therefore, the conditional posterior expectation of \( {S}_j^{-1} \) can be obtained by

$$ E\left({S}_j^{-1}|{\beta}_j\right)=E\left(\frac{1}{\left(1-{\gamma}_j\right){s}_0+{\gamma}_j{s}_1}|{\beta}_j\right)=\frac{1-{p}_j^g}{s_0}+\frac{p_j^g}{s_1} $$
(8)

From Eqs. (7) and (8), we can see that the estimates of pj and Sj are larger for larger coefficientsβj, leading to different shrinkage for different coefficients.

For the M-step, parameters (β, θ) were updated by maximizing the posterior expectation of the log joint posterior density with γj and \( {S}_j^{-1} \) replaced by their conditional posterior expectations. From the log joint posterior density, we can see that β and θ can be updated separately, because the coefficients β are only involved in\( pl\left(\beta \right)-{\sum}_{j=1}^J{S}_j^{-1}\left|{\beta}_j\right| \) and the probability parameter θ is only in \( {\sum}_{j=1}^J\left({\gamma}_j\log {\theta}_g+\left(1-{\gamma}_j\right)\log \left(1-{\theta}_g\right)\right)+{\sum}_{g=1}^G\left(\left(a-1\right)\log {\theta}_g+\left(b-1\right)\log \left(1-{\theta}_g\right)\right) \). Therefore, the coefficients β are updated by maximizing the expression:

$$ {Q}_1\left(\beta \right)= pl\left(\beta \right)-{\sum}_{j=1}^J{\widehat{S}}_j^{-1}\left|{\beta}_j\right| $$
(9)

where \( {\widehat{S}}_j^{-1} \) is the conditional posterior expectation of \( {S}_j^{-1} \) as derived above. Given the scale parameters Sj, the term \( {\sum}_{j=1}^J{\widehat{S}}_j^{-1}\left|{\beta}_j\right| \) serves as the L1 lasso penalty with \( {\widehat{S}}_j^{-1} \) as the penalty factors, and thus the coefficients can be updated by maximizing Q1(β) using the cyclic coordinate decent algorithm, which is extremely fast and can estimate some coefficients exactly to zero [31, 41]. The probability parameters {θg} are updated by maximizing the expression:

$$ {Q}_2\left(\theta \right)=\sum \limits_{j=1}^J\left[{p}_j^g\log {\theta}_g+\left(1-{p}_j^g\right)\log \left(1-{\theta}_g\right)\right]+{\sum}_{g=1}^G\left(\left(a-1\right)\log {\theta}_g+\left(b-1\right)\log \left(1-{\theta}_g\right)\right) $$
(10)

We can easily obtain:

$$ {\theta}_g=\frac{\sum \limits_{j\in g}{p}_j^g+a-1}{J_g+a+b-2} $$
(11)

where Jg is the number of predictors belonging to group g.

Totally, the framework of the proposed EM coordinate decent algorithm was summarized as follows:

  1. 1)

    Choose a starting value for β0, and \( {\theta}_g^0 \). For example, we can initialize β0 = 0, and \( {\theta}_g^0=0.5 \).

  2. 2)

    For t = 1, 2, 3, …,

E-step: Update γj and \( {S}_j^{-1} \) by their conditional posterior expectations.

M-step:

  1. a)

    Update β using the cyclic coordinate decent algorithm;

  2. b)

    Update (θ1, ⋯, θG) by Eq. (11).

We assess convergence by the criterion: ∣d(t) − d(t − 1) ∣ /(0.1−| d(t)| ) < ε, where d(t) =  − 2pl(β(t)) is the estimate of deviance at the tth iteration, and ε is a small value (say 10− 5).

Evaluation of predictive performance

We can use several ways to measure the performance of a fitted group lasso Cox model, including the partial log-likelihood (PL), the concordance index (C-index), the survival curves, and the survival prediction error [37]. The partial log-likelihood function measures the overall quality of a fitted Cox model, and thus is usually used to choose an optimal model [37, 41, 42]. The standard way to evaluate the performance of a model is to fit the model using a data set and then calculate the above measures with independent data. A variant of cross-validation [31, 43], called pre-validation method was used in the present study to evaluate the performance. The data was randomly split to K subsets of roughly the same size. The (K – 1) subsets was used to fit a hierarchical Cox model. The estimate of coefficients denoted as \( {\widehat{\beta}}^{\left(-k\right)} \) from the data excluding the k-th subset. The prognostic indices \( {\widehat{\eta}}_{(k)}={X}_{(k)}{\widehat{\beta}}^{\left(-k\right)} \), called the cross-validated or pre-validated prognostic index, were calculated for all individuals in the k-th subset of the data. Cross-validated prognostic indices \( {\widehat{\eta}}_i \) for all individuals can be calculated by cycling through all the K parts. Then, (\( {t}_i,{d}_i,{\widehat{\eta}}_i \)) was used to compute the several measures described above. We can see that the cross-validated prognostic value for each patient is derived independently of the observed response of the patient. Therefore, the ‘pre-validated’ dataset (\( {t}_i,{d}_i,{\widehat{\eta}}_i \)) can essentially be treated as a ‘new dataset’. This procedure provides valid assessment of the predictive performance of the model [31, 43].

Moreover, we also use an alternative way to evaluate the partial log-likelihood, i.e., the so-called cross-validated partial likelihood (CVPL), defined as [37, 41, 42].

$$ CVPL={\sum}_{k=1}^K\left[ pl\left({\widehat{\beta}}_{\left(-k\right)}\right)-{pl}_{\left(-k\right)}\left({\widehat{\beta}}_{\left(-k\right)}\right)\right] $$
(12)

where \( {\widehat{\beta}}_{\left(-k\right)} \) is the estimate of β from all the data except the k-th part, \( pl\left({\widehat{\beta}}_{\left(-k\right)}\right) \) is the partial likelihood of all the data points and \( {pl}_{\left(-k\right)}\left({\widehat{\beta}}_{\left(-k\right)}\right) \) is the partial likelihood excluding part k of the data. By subtracting the log-partial likelihood evaluated on the non-left out data from that evaluated on the full data, we can make efficient use of the death times of the left out data in relation to the death times of all the data.

Selecting optimal scale values

The spike-and-slab double-exponential prior requires two preset scale parameters (s0, s1). Following the previous studies [24,25,26], we set the slab scale s1 to be relatively large (e.g., 1), and consider a sequence of L decreasing values {\( {s}_0^l \)}: \( {s}_1>{s}_0^1>{s}_0^2>\cdots >{s}_0^L>0 \), for the spike scale s0. We then fit L models with scales {\( \left({s}_0^l,{s}_1\right);l=1,\cdots, L \)} and select an optimal model using the method described above. This procedure is similar to the lasso implemented in the widely-used R package glmnet, which quickly fits the lasso Cox models over a grid of values of λ covering its entire range, giving a sequence of models for users to choose from [31, 41].

Implementation and software package

We have incorporated the method proposed in this study into the function bmlasso() in our R package BhGLM [44]. The package BhGLM also includes several other functions for summarizing and evaluating the predictive performance, like summary.bh, cv.bh predict.bh. The function in the package is very fast, usually taking several minutes for fitting and evaluating a model with thousands of variables. The package BhGLM is freely available from https://github.com/nyiuab/BhGLM.

Simulation study and real data analysis

Simulation studies

We assessed the proposed approach by extensive simulations, and compared with the lasso implemented in the R package glmnet and several penalization methods that can incorporate group information, including sparse group lasso (SGL) in the R package SGL, overlap group lasso (grlasso), overlap group MCP (grMCP), overlap group SCAD (grSCAD), and overlap group composite MCP (cMCP) in the R package grpregOverlap [45]. Our simulation method was similar to our previous work [26, 27]. We considered five simulation scenarios with different complexities, including non-overlap or overlap groups, group sizes, number of non-null groups, and correlation coefficients (r) (Table 1). In simulation scenario 2–5, overlap structures were considered. To handle the overlap structures, we duplicated overlapping predictors into groups that predictors belong to [28, 30]. In each scenario, we simulated two data sets, and used the first one as the training data to fit the models and the second one as the test data to evaluate the predictive values. We replicated the simulation 100 times and summarized the results over these replicates. In simulation scenario 6, we vary the effect size of the non-zero coefficient β5, from − 2 to 2. Other simulation setting are the same with scenario 2. The purpose of this simulation is to see the profile of prior scale along with varying effect size.

Table 1 The preset non-zero predictors and their assumed effect values of the different simulation scenarios

Each simulated dataset included n = 500 observations, with a censored survival response yi and a vector of m = 1000 continuous predictorsXi = (xi1, …, xim). We assumed 20 groups. Each group included about 50 predictors. For example, group 1 and 2 included variables (x1, …, x50) and (x51, …, x100), respectively. The vector Xi was randomly sampled from multivariate normal distributionN1000(0, Σ), where the covariance matrix Σ was set to account for varied grouped correlation and overlapped structures under different simulation scenarios. We simulated several scenarios. The predictors were assumed to be correlated each other with in group and those predictors in different groups were assumed to be independent. The correlation coefficient r was generally set to be 0.5.

To simulate the censored survival response, following the method of Simon [41], we generated the “true” survival time Ti for each individual from the exponential distribution: \( {T}_i\sim \mathrm{Expon}\left(\exp \left({\sum}_{j=1}^m{x}_{ij}{\beta}_j\right)\right) \) and the censoring time Ci for each individual from the exponential distribution: Ci~Expon(exp(ri)), where ri were randomly sampled from a standard normal distribution. The observed censored survival time ti was set to be the minimum of the “true” survival and censoring times, ti = min(Ti, Ci), and the censoring indicator di was set to be 1 if Ci > Ti and 0 otherwise. Our simulation scenarios resulted in different censoring ratios, but generally below 50%. For all the scenarios, we set eight coefficients to be non-zero and the others to be zero.

Scenario 1: Non-overlap group

In this scenario, each group is independent. There was no any overlap among groups. Eight non-zero predictors{x5, x20, x40},  {x210,  x220,  x240},  {x975,  x995} were simulated to be included into three groups, group 1, 5, and 20 (Table 1). The group sizes is 50, including 50 predictors, presented as below:

Group ID:

1

2

5

19

20

Group setting:

x1 − x50

x51 − x100

 

x201 − x250

 

x901 − x950

x951 − x1000

Scenario 2: Overlap grouping

In this scenario, overlapped grouping structure was considered. Only the last group is independent. For example, for group 1 and group 2, there were five predictors (x46, x47, x48, x49, x50) belong to two groups. The setting for eight non-zero predictors and their effect sizes are the same with scenario 1. The group sizes is still 50. The overlap structure are presented below:

Group ID:

1

2

3

19

20

Group setting:

x1 − x50

x46 − x100

x96 − x150

 

x896 − x950

x951 − x1000

Scenario 3: Varying group sizes

Group size means the number of predictors included in a group. A big group size means the group included relative more predictors. The group size may affect the model fitting. In this scenario, we assumed two groups, group 1 and 11, including non-zero predictors, {x1, x2, x3, x4} and {x501, x502, x503, x504}, respectively. Other simulation setting are similar with scenario 2. To investigate the group size effect on model fitting, we simulated different group size as below:

  1. (1).

    only four non-zero predictors included in group 1 and 11:

Group ID:

1

2

3

11

12

19

20

Group setting:

x1 - x4

x5 - x100

x96 - x150

 

x501 - x504

x505 - x600

 

x896 - x950

x951 - x1000

  1. (2).

    20 predictors included in group 1 and 11:

Group ID:

1

2

3

11

12

19

20

Group setting:

x1 - x20

x21 - x100

x96 - x150

 

x501 - x520

x521 -x 600

 

x896 - x950

x951 - x1000

  1. (3).

    50 predictors included in group 1 and 11:

Group ID:

1

2

3

11

12

19

20

Group setting:

x1 - x50

x46 - x100

x96 - x150

 

x501 - x550

x546 - x600

 

x896 - x950

x951 - x1000

Scenario 4: Varying the number of non-null group

The true non-zero predictors may be included in some groups. Other zero predictors belong to other groups. These groups included non-zero predictors called non-null group. The number of non-null group may also affect the model fitting. To evaluate the group number effect, we varied the number of non-null groups, as following:

  1. (1).

    There are 8 non-null groups including non-zero coefficients: {x5}, {x55}, {x305}, {x355}, {x505}, {x555}, {x905}, and {x955};

  2. (2).

    There are 3 non-null groups including non-zero coefficients: {x5, x15, x25}, {x355, x365, x375}, and {x905, x915};

  3. (3).

    There is only 1 non-null group including non-zero coefficients: {x5, x10, x15, x20, x25, x30, x35, x40}. The overlap settings were the same with scenario 2. The group number and effect sizes of these non-zero coefficients are shown in Table 1.

Scenario 5: Varying the correlation within group

To evaluate the effect of correlation within group, we set different correlation coefficients within a group: r = 0.0, 0.5, and 0.7. Other settings were the same with scenario 2.

Scenario 6: Self-adaptive shrinkage on varying the effect size

The significant feature of the proposed spike-and-slab prior is the self-adaptive shrinkage. To show this property, we performed additional simulation study based on Scenario 1. We fixed the prior scale (s0, s1) = (0.02, 1) and varied the effect size of the first simulated non-zero predictor (x5) from (− 2, 2). We recorded the scale parameters for this non-zero predictor (x5) and nearby zero effect predictor (x6), and non-zero predictor (x20) with the simulated effect size − 0.7. These three predictors belong to the same group.

Real data analysis

We applied the proposed gsslasso Cox model to analyze three real datasets, ovarian cancer (OV), lung adenocarcinoma (LUAD), and breast cancer. The whole genome expression data were downloaded from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) (updated at June 2017). We firstly clean the data to get the clear survival information and potential genes involved in further analysis. The details of the three datasets and clean procedure are described below paragraphs. Secondly, several genome annotation tools were used to construct the pathways information. All the genes were mapped to KEGG pathways by using R/bioconductor packages: mygene, clusterProfiler and AnnotationDbi [46]. The R/Bioconductor mygene package was used to convert gene names to gene ENTREZ ID. The clusterProfiler package was used to get pathway/group information for genes, by loading the gene ENTREZ ID. AnnotationDbi was used primarily to create mapping objects that allow easy access from R to underlying annotation databases, like KEGG in the present study. By using these packages, we mapped the genes into pathways, and got group structure information for further analysis. Only the gene included in pathways were used in further analysis. Thirdly, the proposed method and several penalization approaches used in above simulation study were applied to analyze the survival data with thousands of genes and pathway/group information. We performed 10-fold cross-validation with 10 replicates to evaluate the predictive values of the several models. After model fitting, the non-zero parameters were the detected genes.

TCGA ovarian cancer dataset (mRNA sequencing data)

This dataset contains mRNA expression data and relevant clinical outcome for ovarian cancer (OV) from TCGA. The raw dataset includes 304 patients and 20,503 genes after removing the duplication and unknown gene names. The raw clinical data included 586 patients. We cleaned the clinical survival data from several clinical files, and obtained 582 patients with clear survival information. We merged the individuals both with gene expression data and survival information, and obtained 304 patients with 20,503 genes for further analysis. First, we filtered the genes with expressions values less than 10. Then, genes with more than 30% of zero expression values in the dataset were removed. Furthermore, we calculated the coefficient of variance (CV) of expression values for each gene, and kept the genes with CV of larger than 20% quantile. After these steps, 304 patients with 14,265 genes were included in our analysis. The censoring ratio was 39.5%.We mapped these genes to 271 pathways including 4260 genes.

TCGA lung adenocarcinoma dataset (mRNA sequencing data)

The raw expression data contains 578 patients and 20,530 genes. After removing the duplication and unknown gene names, there are 516 patients with 20,501 used for further analysis. The raw clinical data included 521 patients. We cleaned the clinical data with clear survival records, and included 497 patients in our analysis. We then merged the clinical data and expression data, and obtained 491 patients for with 20,501 genes for quality control. Similar with the steps for ovarian cancer dataset, we filtered the genes with expressions values less than 10. Then, we removed genes with more than 30% of zero expression values in the dataset. Furthermore, we calculated the coefficient of variance (CV) of expression values for each gene, and kept the genes with CV of larger than 20% quantile. After these steps, 491 patients with 14,143 genes were included in our analysis. The censoring ratio was 68.4%. We mapped these genes to 274 pathways including 4266 genes.

TCGA breast cancer dataset (mRNA sequencing data)

The raw expression data contains 1220 patients and 20,530 genes. After removing the duplication and unknown gene names, there are 1097 patients with 20,503 used for further analysis. The raw clinical data included 1097 patients. We cleaned the clinical data with clear survival records, and included 1084 patients in our analysis. We then merged the clinical data and expression data, and obtained 1082 patients for with 20,503 genes for quality control. The same steps used here for breast cancer dataset, we filtered the genes with expressions values less than 10, and removed genes with more than 30% of zero expression values in the dataset. Furthermore, we calculated the coefficient of variance (CV) of expression values for each gene, and kept the genes with CV of larger than 20% quantile. After these steps, 1082 patients with 14,077 genes were included in our analysis. The censoring ratio was 86.0%. We mapped these genes to 275 pathways including 4385 genes.

Results

Simulation results

Predictive performance

Tables 2 and 3 summarizes the CVPL (cross-validated partial likelihood) and C-index in the testing data over 100 replicates for Scenarios 1–5. We observed that the group spike-and-slab lasso Cox model performed similarly with cMCP and outperformed other methods, under different simulation scenarios. These results suggested that, with complex group structures, the proposed method could perform well.

Table 2 Estimates of two measures over 100 replicates under simulation scenario 1 and 2
Table 3 Estimates of two measures over 100 replicates for varying group size and varying number of non-null group under simulation scenario 3,4 and 5, respectively

Accuracy of parameter estimates

To evaluate the accuracy of parameters estimation, we summarized the average numbers of non-zero coefficients and the mean absolute errors (MAE) of coefficient estimates, defined as MAE = \( \sum \left|{\widehat{\beta}}_j-{\beta}_j\right|/m \), in Tables 4 and 5 for different scenarios. It was found that the dected number of null-zero coefficients were very close preset number 8, and the values of MAE were very small for the proposed method under different scenarios. The performances of the group spike-and-slab lasso Cox and cMCP were consistently better than the other methods for all the five scenarios, and the proposed method was slightly better than cMCP. These results suggested that the proposed method can generate lowest false positive and unbiased estimation.

Table 4 Average number of non-zero coefficients and mean absolute error (MAE) of coefficient estimates over 100 simulations for scenario 1 and 2
Table 5 Average number of non-zero coefficients and mean absolute error (MAE) of coefficient estimates over 100 simulations for scenario 3, 4, and 5

The estimates of coefficients from the group spike-and-slab lasso Cox and the other methods over 100 replicates are shown in Fig. 1 and Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3, Additional file 4: Figure S4, Additional file 5: Figure S5, Additional file 6: Figure S6 and Additional file 7: Figure S7 for different scenarios. It can be seen that the group spike-and-slab lasso Cox method produced estimates close to the simulated values for all the coefficients. This is expected, because the spike-and-slab prior can induce weak shrinkage on larger coefficients and strong shrinkage on zero coefficients. In contrast, other methods except for cMCP, gave a strong shrinkage amount on all the coefficients and resulted in the solutions that non-zero coefficients were shrunk and underestimated compared to true values. In addition, higher false positives (grey bars) were observed, except for the group spike-and-slab lasso Cox and cMCP methods.

Fig. 1
figure 1

The parameter estimation averaged over 100 replicates for the group spike-and-slab lasso Cox (gsslasso), the lasso, grlasso, grMCP, grSCAD, SGL and cMCP methods for Scenario 1. Blue cycles are the simulated non-zero values. Black points and lines represent the estimated values and the interval estimates of coefficients

The self-adaptive shrinkage feature

To show the self-adaptive shrinkage feature, we performed simulation 6. Figure 2 shows the adaptive shrinkage amount on non-zero coefficients x5, along with the varying effect size. It clearly shows that the proposed spike-and-slab lasso Cox model approach has self-adaptive and flexible characteristics, without affecting the nearby zero coefficient (x6) and non-zero variable (x20) belong to the same group.

Fig. 2
figure 2

The adaptive shrinkage amount, along with the varying effect size for x5. The x20 is the none-zero coefficients with the simulated effect size − 0.7, while x6 is the nearby coefficient with simuated zero effect size

Real data analysis results

There were about one third genes were mapped to pathways for the above three real datasets. The rest genes were put together as an additional group. The detailed information of genes shared by different pathways is listed in Additional file 8: S1, S2, and S3, for ovarian cancer, lung cancer and breast cancer, respectively.

Real data analysis is to build a survival model for predicting the overall survival outcome by integrating gene expression data and pathway information. We standardized all the predictors to use a common scale for all predictors, prior to fitting the models, using the function covariate() function in BhGLM package. In our prior distribution, there were to preset parameters, (s0, s1). In our real data analysis, we fixed the slab scale s1 to 1, and varied the spike scale s0 values by: {k × 0.01; k = 1, …, 9}, leading to 9 models. The optimal spike scale s0 was select by the 10-fold 10-time cross-validation according to the CVPL. Using the optimal s0, we performed further real data analysis. For comparison, we also analyzed the data using the several existing methods as described in the simulation studies.

We performed 10-fold cross-validation with 10 replicates to evaluate the predictive values of the several models. Table 6 summarizes the measures of the prognostic performance on these three data sets, by only using the genes included in pathway. For all the data sets the proposed group spike-and-slab lasso Cox model performed better than the other methods. The above results used only genes mapped in pathways. Additional file 9 shows the measures of the performance on these three data sets, by using the all genes. The genes which were not mapped into any pathway were put together as an additional group. We can see that the prediction performance of the proposed method were still better than the other methods.

Table 6 The measures of optimal group spike-and-slab lasso (gsslasso) cox and the lasso cox models for TCGA ovarian cancer, lung adenocarcinoma (LUAD) and breast cancer dataset with pathway genes by 10 times 10-fold cross validation

The pathway enrichment analyses for these detected genes were summarized in Additional file 10: S4-S6. Additional file 11: S7 presents the genes detected by the proposed and existed methods. Their standardized effects size were also plotted for the three real data sets. There were many common gene among different methods. For ovarian cancer dataset, the genes CYP2R1 and HLA-DOB detected by the proposed gsslasso method, were also detected by both lasso and cMCP methods. For Lung cancer dataset, several genes detected by the proposed gsslasso method, such as VDAC1, EHHADH, ACAT2, KIT, CCND1, PIK3R1, NRAS, GNPNAT1, and KYNU, were also detected by other method. For, breast cancer dataset, two genes HSPA1A and ABCB5 detected by the proposed gsslasso method were also detected by other method.

We found that most of the genes detected by the proposed method were associated with cancers in previous studies. HABP2, detected in ovarian cancer, was associated with familial nonmedullary thyroid cancer [47]. CYP24A1, the main enzyme responsible for the degradation of active vitamin D, plays an important role in many cancer related cellular processes. The associations between CYP24A1 polymorphisms and cancer susceptibility had been evaluated by many studies [48]. Keratin 8 (KRT8) plays an essential role in the development and metastasis of multiple human cancers. A recent study suggested that in clear cell renal cell carcinoma and gastric cancer, KRT8 upregulation promotes tumor metastasis and associated with poor prognosis [49, 50]. E2F7, detected in lung cancer, involved in several cancer studies, which might act as an independent prognostic factor for breast cancer, and Squamous Cell Carcinoma, and gliomas [51,52,53]. Most of the genes detected by the proposed method in the three real datasets had been found associated with different cancers. These results may provide some interesting information for further studies.

Discussion

The group structures among various features arise naturally in many biological and medical researches, especially in large-scale omics data. Such grouping information is biologically meaningful and intrinsically encoded in the biological data. Thus it is desirable to incorporate the grouping information into data analysis. Various penalization methods have been designed for such situations [13, 14, 30, 54, 55]. Recently, we have developed a novel hierarchical modeling approach, the group spike-and-slab lasso GLMs, to integrate the variable group information for gene detection and prognostic prediction [27]. In this study, we extended the method to Cox proportional hazards model for analyzing censored survival data.

Similar to the group spike-and-slab lasso GLMs, the key to our group spike-and-slab lasso Cox is the group spike-and-slab double-exponential prior. This prior has significant advantage in variable selection and parameter estimation. It induces weak shrinkage on larger coefficients and strong shrinkage on irrelevant coefficients. In contrast, other methods usually gave a strong shrinkage amount on all the coefficients and resulted in the solutions that non-zero coefficients were shrunk and underestimated. The proposed group spike-and-slab prior allows the model to incorporate the biological similarity of genes within a same pathway into the analysis.

The spike-and-slab prior depends on the spike and slab scale parameters. Our previous study suggested that slab scale s1 had little influence on model fitting, while the spike scale s0 strongly affected model performance [25, 26]. A slab scale s1 value introducing weak shrinkage amount would be helpful to include relevant variables into the model. Therefore, we set s1 = 1 in our analysis. We evaluated the performance of the proposed model on a grid of values of spike scale s0 from a reasonable range, e.g., (0, 0.1), and then selecting an optimal value using cross-validation. This is a path-following strategy for fast dynamic posterior exploration of the proposed models, which is similar to the approach of Ročková and George [24, 56]. Additional file 12: Figure S8 a and b show the solution paths under Scenario 2, for the proposed model and the lasso Cox model. Additional file 12: Figure S8 c and d show the profiles of cross-validated palatial log-likelihood by 10-fold cross-validation for the proposed model. These profiles would help to choose optimal tuning parameters. It could be found that, similar to the lasso, the spike-and-slab lasso Cox is a path-following strategy for fast dynamic posterior exploration. However, the solution path is essentially different from that of the lasso model. For the lasso cox model, the number of non-zero coefficient could be a few, even zero if a strong penalty is adopted. However, in the spike-and-slab lasso Cox model, larger coefficients will be always included in the model with weak shrinkage, while irrelevant coefficients are removed (grey path in Additional file 12: Figure S8 a).

Another feature of the proposed spike-and-slab prior is bi-level selection, which is capable of selecting important groups as well as important individual variables within those groups. Several methods perform bi-level selection, including cMCP method [16], SGL [14], and group exponential lasso [18]. The underlying assumption is that the model is sparse at both the group and individual variable levels. The proposed group spike-and-slab lasso Cox model can efficiently perform bi-level selection. In group level, the importance of a group is controlled by the group-specific probability θg. Within a group, the spike-and-slab prior allows to perform variable selection by shrinking irrelevant or small effect coefficients exactly to zero, without affecting the prediction performance.

The extensive simulation studies show that the prediction performance of the proposed method is always slightly better than cMCP method, and significantly better than all other methods under different scenarios. In the real data analysis, the prediction accuracy of the proposed method incorporating pathway information was slightly improved compared with the existing methods. This might be mainly due to the complex genetic components involved in the expression data, like haplotype blocks, subnetworks, and interaction among the genes. The present model under the linear assumption may not capture these complexities. More sophisticated strategies could potentially enhance prediction accuracy and further improve the models, by defining more precise biological grouping information.

There are several further extensions of the proposed method. For example, it can also be extended to incorporate multiple level group structure, like three-level group structure, i.e. SNP-gene-pathway. In addition, the proposed model takes the spike-and-slab mixture double-exponential prior. Other priors with a spike at zero and includes heavier tails could be investigated, like Cauchy distribution, a special case of Student-t distribution. The theoretical and empirical properties of other priors are different, which may introduce more interesting results.

Conclusion

Incorporating biological group structure in high-dimensional molecular data analysis can improve the accuracy of disease prediction and power of gene detection. We propose a new hierarchical Cox model, gsslasso Cox, for incorporating biological pathway information for predicting disease survival outcomes and detecting associated genes. We develop a fast and stable deterministic algorithm to fit the proposed models. Extensive simulation studies and real applications show that compared with several existing methods, the proposed approach provides more accurate parameter estimation and survival prediction. The proposed method has been implemented in a freely available R package BhGLM.