Background

The proportional hazards (PH) model [1, 2] has been widely used for predicting time-to-event outcomes. The partial likelihood estimation of the PH model, however, is not appropriate for exploring the simultaneous relationship of the high dimensional variables with outcomes. For this reason, variable selection approaches, such as the least absolute shrinkage and selection operator (LASSO) [3], adaptive LASSO (ALASSO) [4], and a specific extension of random survival forests (RSF) [5], have been widely used to select the informative variables in a high dimensional setting [6,7,8,9]. In addition, LASSO, ALASSO, and RSF have been extended to time-to-event endpoints that are subject to censoring. Although these methods are capable of reducing the number of variables in high dimensionality, Fan et al. [10] and Zhao and Li [11] proposed methods, such as the sure independence screening (SIS) [10], the iterative SIS (ISIS) [10], and the principled SIS (PSIS) [11], to expedite computing time and improve estimation accuracy in a ultra-high dimensional setting. Furthermore, Fan et al. define ultra-high dimensionality by the exponential growth of the dimensionality in the sample size (that is, log(p) = O(na) for some a ∈ (0,0.5)) [9].

The SIS or PSIS aims to reduce the ultra-high dimensional space to a manageable subset which encompasses the important variables. We define important variables as those in which the true coefficients are not equal to zero marginally on the survival outcomes, assuming a linear association between the log-hazard function and the variables in the proportional hazard model. SIS or PSIS can select important variables, just as LASSO or ALASSO. When SIS or PSIS is combined with LASSO or ALASSO, however, SIS or PSIS is applied prior to running LASSO or ALASSO in the same way as employed in [9]. To emphasize the processing order among the methods, we broadly classify these methods into two groups: screening approaches which follow the genomic literature (SIS and PSIS) and variable selection methods (LASSO and ALASSO). Fan et al. [9, 10] theoretically proved that SIS is highly likely to choose all the important variables and showed that SIS and its variants are more efficient in excluding unimportant variables than LASSO alone.

We chose LASSO, ALASSO, and RSF due to their wide application in genomic medicine in selecting informative variables in high dimensional settings. Fan et al. [9, 10] showed that LASSO selects too many uninformative variables and thus the authors proposed to combine SIS with LASSO to efficiently exclude uninformative variables while keeping all the informative ones. Thus, the concept of implementing SIS, ISIS, and PSIS in high dimensional space before running ALASSO and RSF was innovative [9,10,11]. Moreover, statisticians who are not familiar with SIS, ISIS, and PSIS may want to consider using these variable methods as alternatives to select important variables and ultimately develop prognostic models with higher predictive accuracy in high dimensional space.

We are interested in identifying important germline single-nucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. The data included 498,081 SNPs processed from blood samples from 623 Caucasian patients with prostate cancer after passing quality control. While the abovementioned variable selection approaches have theoretical justification in selecting important SNPs, the comparison and the performance of these combined methods have not been studied in ultra-high dimensional space with time-to-event endpoints. In other words, it is unknown if the combined variable selection methods would perform well in a context of an ultra-high number of variables. Our primary goal is to compare the performance of the methods in predicting a time-to-event outcome via a simulation study with 100,000 (100 k) variables. There is a plethora of variable selection methods, but it is not always apparent which approach is best to use in identifying important variables in ultra-high dimensional space. Thus, another purpose of this paper is to provide applied statisticians with a basic understanding of these combined methods and to make recommendations on how best to use them.

The rest of this paper is organized as follows. We provide an overview of the widely used variable selection methods in the “Methods” section while offering a concise discussion of their advantages and disadvantages. In the “Results” section, we describe the design of the simulation studies and the comparison among the four single variable selection methods and the nine combinations of the variable selection methods. An ultra-high dimensional example is used where the data are analyzed using a combination of methods to identify SNPs that will predict overall survival time for metastatic prostate cancer patients (“Discussion” section). The results of the simulations and recommendations are discussed in the “Conclusion” section.

Methods

Let p be the number of variables. A linear association between the log-hazard function and the p-dimensional variables is assumed in the PH model as follows:

$$ h\left(t|x\right)={h}_0(t){e}^{\boldsymbol{x}\boldsymbol{\beta }} $$
(1)

where h(t| x) is a hazard function of time-to-event t, provided that each of the p variables, x = (x1, x2, …, xp ), is a vector with sample size n, for the xi = (xi1, xi2, …, xin ) for i = 1, …, p. The parameters β = (β1, β2, …, βp) and h0(t) are an arbitrary parameter-free baseline hazard function. We assume that the important variables are associated with the time-to-event based on (1). We consider the variable selection methods in order to only choose important variables of clinical outcomes, as presented in Fig. 1.

Fig. 1
figure 1

Overall diagram of the screening approaches and the variable selection methods

Screening approaches

We refer to SIS, ISIS, and PSIS as screening or variable selection approaches, and these words are used interchangeably. Fan et al. [10] introduced the SIS to decrease the number of variables p from the ultra-high to a reduced subset size m for time-to-event outcomes. They choose m as ≤⌊n/ log(n)⌋ or ⌊n/(4 log(n))⌋ depending on the different variants of SIS. The former is known as “aggressive” SIS, and the latter is named “vanilla” SIS. We consider the aggressive SIS since its performance is shown to be more effective in high dimensional setting [10]. The aggressive SIS starts with randomly splitting the sample into two partitions, n1 and n2, then SIS computes the marginal correlation between a single variable and the survival outcome within each partition. This is called marginal utility and is denoted as uk, 1 and uk, 2 for k = 1, …, p. The utility is obtained by maximizing the partial likelihood of each variable in the following way:

$$ {\displaystyle \begin{array}{c}{u}_{k,1}={\max}_{\beta_k}\left({\sum}_{i\in {n}_1}{\delta}_i{x}_{ik}{\beta}_k-{\sum}_{i\in {n}_1}{\delta}_i\log \left\{{\sum}_{j\in R\left({y}_i\right)}\exp \left({x}_{jk}{\beta}_k\right)\right\}\right),\\ {}{u}_{k,2}={\max}_{\beta_k}\left({\sum}_{i\in {n}_2}{\delta}_i{x}_{ik}{\beta}_k-{\sum}_{i\in {n}_2}{\delta}_i\log \left\{{\sum}_{j\in R\left({y}_i\right)}\exp \left({x}_{jk}{\beta}_k\right)\right\}\right),\end{array}} $$
(2)

in which δi is the censoring indicator, xik is the kth element among the p variables, and R(yi) is the risk set right before the time yi, i.e., R(yi) = {j; yj ≥ yi}. By ordering all variables in increasing order of their corresponding marginal utilities, u(1) < u(2) < … < u(p − 1) < u(p), one selects the top s \( \left(=\left\lfloor \frac{n}{\log (n)}\right\rfloor \right) \) ranked variables for each partition. The details are presented in Fig. 2. Let I1 and I2 be the index sets of the selected s variables and are expressed as

$$ s=\left|{I}_1=\left\{1\le k\le p;{u}_{k,1}\ge {\gamma}_{s,1}\right\}\right|=\left|{I}_2=\left\{1\le k\le p;{u}_{k,2}\ge {\gamma}_{s,2}\right\}\right|, $$
(3)

where γs, 1 and γs, 2 are the cutoff values depending on s for each partition. The asymptotic probability that I1 ∩ I2 contains all the important variables is 1. This is the reason why this method is called sure screening.

Fig. 2
figure 2

Process of the aggressive variant of sure independence screening (SIS)

The SIS is, however, unable to select important variables that are weakly and marginally associated with right-censored outcomes. Furthermore, I1 ∩ I2 is most likely to retain unimportant variables that are highly correlated with the important variables [10]. To reduce the false negative and false positive rates, Fan et al. [10] also proposed ISIS and its aggressive variant. The iterative SIS begins similarly to SIS in the first iteration and derives the first subset, O1 containing variables selected from a variable selection method. Let m1 = |O1| denote the first subset size. The second iteration repeats SIS but with the (p − m1) variables and calculates the conditional utility for each partition given m1 variables. The selection method is again applied to the union set of m variables and O1 from which the second subset O2 is driven. The process runs iteratively until Oj − 1 = Oj, for some j which is the number of iterations or until a pre-specified number of iterations has been reached.

Despite the high probability for ISIS to select the important variables, there is no theoretical justification for choosing the cutoff values γs, 1 and γs, 2. The lack of theoretical explanation motivates Zhao and Li [11] to develop the principled SIS for time-to-event endpoints. Furthermore, Zhao and Li provide the mathematical reasoning for the cutoff value γm by using the standard normal distribution and hence controlling the false positive rate. The PSIS requires the maximum partial likelihood estimates (MPLE) of the coefficients to be written as:

$$ {\widehat{\beta}}_k=\underset{\beta_k}{\mathrm{argmax}}\left({\sum}_{i=1}^n{\delta}_i{x}_{ik}{\beta}_k-{\sum}_{i=1}^n{\delta}_i\log \left\{{\sum}_{j\in R\left({y}_i\right)}\exp \left({x}_{jk}{\beta}_k\right)\right\}\right). $$
(4)

Once the false positive rate qm is chosen, the cutoff value γm will be the quantile of the standard normal distribution, i.e., γm = Φ−1(1 − qm/2), assuming that the asymptotic null distribution of \( {I}_k{\left({\widehat{\beta}}_k\right)}^{1/2}{\widehat{\beta}}_k \) is N(0, 1), where Ik is the information matrix. Hence, the m variables are chosen in the following manner:

$$ m=\left|1\le k\le p;{I}_k{\left({\widehat{\beta}}_k\right)}^{\frac{1}{2}}|{\widehat{\beta}}_k|\ge {\varPhi}^{-1}\left(1-{q}_m/2\right)\ \right|. $$
(5)

Variable selection methods

It is possible that the subset filtered by SIS or PSIS includes not only the important p variables but also several of the unimportant (p − p) variables. To remove the falsely selected variables, we employ some of the most widely used penalized regression methods in current clinical research, such as the LASSO and the ALASSO. Tibshirani [3] proposed LASSO with the L1 penalty while taking the survival analysis framework into consideration. The efficient computational algorithm is conducted in Simon et al. [12] by maximizing a scaled log partial likelihood of the PH model. The penalized maximum likelihood estimator is given by

$$ \widehat{\beta}=\underset{\beta }{\mathrm{argmax}}\left[\frac{2}{n}l\left(\beta \right)-\lambda {P}_{\alpha}\left(W,\beta \right)\right], $$
(6)

where

$$ {P}_{\alpha}\left(W,\beta \right)=\alpha {\left\Vert W\beta \right\Vert}_1+\frac{1}{2}\left(1-\alpha \right){\left\Vert \beta \right\Vert}_2^2. $$
(7)

Note that the R library glmnet scales the log partial likelihood by a factor of 2/n for convenience. LASSO evaluates (6) with a scalar α = 1 and a matrix W = Ip in which Ip is the p-dimensional identity matrix in (7). ALASSO identically plugs in α = 1, but instead, a diagonal matrix, W, with \( 1/\mid \overset{\sim }{\beta}\mid \) as the diagonal entries in which \( \overset{\sim }{\beta }={\left({\overset{\sim }{\beta}}_1,{\overset{\sim }{\beta}}_2,\dots, {\overset{\sim }{\beta}}_p\right)}^{\prime } \) is a vector containing the MPLE of the log partial likelihood l(β). The coefficient estimates are obtained corresponding to the value of λ that minimizes the mean cross-validated partial likelihood error in this paper. To be specific, 10-fold cross-validation was used. Only variables with non-zero coefficient estimates are finally selected by the penalized algorithm.

Unlike the penalized regression approaches, the random survival forest (RSF) is a non-parametric method and an extension of Breiman’s random forests [13] for analyzing right-censored outcomes. These methods are known for their ability to easily deal with nonlinear effects, correlated variables, and variable interactions [13]. We implement minimal depth thresholding, which stems from the idea of observing the most important variables that are located on the upper nodes in a tree. Ishwaran et al. [14] provide the null distribution of the minimal depth of a maximal subtree that is given by

$$ P\left({D}_v=D(S)|v\ \mathrm{is}\ \mathrm{noisy},{l}_0,\dots, {l}_{D(S)-1}\right)=1-{\sum}_{d=0}^{D(S)-1}{\left(1-\frac{1}{p}\right)}^{L_d}\left(1-{\left(1-\frac{1}{p}\right)}^{l_d}\right), $$
(8)

where Dv is the minimal depth for a given variable v, D(S) is the depth of the tree S, ld is the number of nodes at depth d, and \( {L}_d=\sum \limits_{i=0}^{d-1}{l}_i \). According to Ishwaran et al. [14], a variable v is chosen if its forest averaged minimal depth, Dv = dv, is less than or equal to the mean minimal depth of Dv under the null distribution of (8). Thus, we build final models with the variables meeting the (mean) minimal depth thresholding.

Simulation studies

We conducted a series of simulations to compare the performance of the combined variable selection approaches in the ultra-high dimensional setting within the survival framework. We used single variable selection methods as references and assumed a linear relationship between the log-hazard function and the significant variables via the non-zero constant coefficients. In addition, we assumed that the baseline hazard function had a value of 1. We considered p to be six because most articles in the context of the ultra-high dimensional data assumed six or less important variables in their simulation studies as these methods are computationally intensive [10, 15, 16]. Another reason for considering a small number of important variables is that many genome-wide association (GWAS) analysis identified only a few SNPs that are clinically important [17,18,19,20].

We limited p to 100 k, each variable having three classes since they were similar to the SNP genetic data that served as our motivation example. The random samples were generated assuming that the variables are mutually independent and identically distributed (IID). The IID variables were randomly and independently sampled from a multinomial distribution being specified with probabilities for the three classes, (0, 1, 2) as ((1 − q)2, 2q(1 − q), q2), respectively, which followed the Hardy-Weinberg equilibrium (HWE) [21, 22]. In the context of genetic data, q is the minor (risk) allele frequency (MAF) of a SNP and was assumed to represent 0.15 of SNPs. The MAF refers to the frequency of the allele with frequency no more than 50% in a population [22]. The three probabilities were formed given that the pairs of alleles were known to be independent, and hence, there was no deviation from the HWE.

The true regression coefficient values β were randomly chosen from (−1)u(a+| z| ), where u was sampled from a Bernoulli distribution having parameter 0.5, z generated from the standard Gaussian distribution. We followed Fan et al. [9] and set a = 1 and a = 2 for weak and strong signal strength, respectively. We considered two scenarios for the true values of β.

Strong signal:

$$ \boldsymbol{\beta} ={\left(2.527186,2.443898,2.152147,-2.388758,2.156502,-2.003314\right)}^{\prime }, $$

Weak signal:

$$ \boldsymbol{\beta} ={\left(1.036478,-1.073296,-1.250946,1.138729,-1.128361,1.145263\right)}^{\prime }. $$

The true coefficients were multipliers of the first six variables, and hence, the other 99,994 variables had zero coefficients. Failure times were assumed following a Weibull distribution. The hazard function hw(t| x) was the function of the variables from a Weibull distribution, and it is expressed as

$$ {h}_w\left(t|x\right)=\tau {t}^{\tau -1}\times \mathit{\exp}\left\{\tau \left({x}_1{\beta}_1+{x}_2{\beta}_2+{x}_3{\beta}_3+{x}_4{\beta}_4+{x}_5{\beta}_5+{x}_6{\beta}_6\right)\right\}, $$
(9)

where τ was a shape parameter of the Weibull distribution. The failure times were randomly drawn from the Weibull distribution assuming τ=1.5. Also, we set the inverse of exp{τ(x1β1 + x2β2 + x3β3 + x4β4 + x5β5 + x6β6)} as the scale parameter.

The censoring rate C was specified as 20% across the simulation studies. Censored times were assumed to follow a uniform distribution on [0, θc]. For each specific censoring rate, θc was determined using the Newton-Raphson iteration (Appendix (A3) and (A5) of [23]). The failure time, Ti, for i = 1, …, n, was randomly generated from (9), whereas the censoring times Ci, for i = 1, …, n was randomly sampled from uniform distribution. Then, the observed failure times and censoring indicators were ti = min(Ti, Ci) and δi = I(Ti ≤ Ci), respectively. We considered sample sizes of 150 and 300.

Because of the intensive computing time (100 k variables), we generated 300 data sets for each simulation scenario. Single variable selection approaches (SIS, PSIS, LASSO, and ALASSO) were examined as references. In addition, nine combinations of the variable selection methods were considered. These are ISIS-LASSO, SIS-LASSO, PSIS-LASSO, ISIS-ALASSO, SIS-ALASSO, PSIS-ALASSO, ISIS-RSF, SIS-RSF, and PSIS-RSF. It is important to note that it is not possible to run ISIS alone as it works simultaneously with SIS in combination with a variable selection method. We sought to run RSF by itself; however, it had insufficient memory. We used the default maximum number of iteration (five) from the SIS library in R version 0.6 when running the ISIS combinations. To expedite computing time in running RSF, we set the number of trees in the forest and the maximum number of random split points to be 100 and 10, respectively, to increase computing speed. The split points split for a variable. For PSIS, we set the false positive rate at 0.001 (=qm from (5)) over all simulation data sets as this will limit PSIS from selecting hundreds of variables in a reduced subset.

Prognostic models were developed based on the training datasets, whereas independent testing sets were generated from the underlying joint distribution of the variables. The testing sets had much larger sample sizes (n = 2000) than the training sets, as suggested by Hothorn et al. [24]. The above variable selection approaches were compared for their predictive ability using overall performance, discrimination, and calibration. For the overall performance, we picked up Graf et al. \( {R}_{BS}^2 \) [25], which assesses the percentage gain in predictive accuracy compared to the null model (model with no variables) at a single time point. In this paper, \( {R}_{BS}^2 \) was evaluated at 2 years because it was very close to the median overall survival time in our motivation example of prostate cancer. A positive value of \( {R}_{BS}^2 \) means that a prognostic model predicts better relative to the null model, whereas a negative value indicates that the model predicts poorly relative to the null model. Discrimination describes the ability of a prognostic model to distinguish between patients with and patients without the outcome of interest [26]. Discrimination was measured using Harrell’s c-index. A value of 0.5 represents random prediction whereas a value of 1 represents perfect discrimination [26]. Calibration refers to the agreement between the observed survival probability at 2 years versus the predicted probability at 2 years. Patients were divided into quartiles, and the average predicted survival probability was compared with the observed survival probability similar to how it is described by Harrell [26]. The calibration slope was estimated by regressing new survival outcomes on the predicted prognostic index [27]. Lastly, overfitting was evaluated by using the rms library in R using the validated command with 100 bootstrapped samples generated from the original sample.

The high computational burden was alleviated by using shared cluster computers at the University of Iowa Supercomputer Center. The LASSO, ALASSO, and RSF were executed in R using glmnet library version 2.0-5 and randomForestSRC version 2.1.0. We created R codes for PSIS according to the algorithms in [9] because there is no available library for PSIS. We employed R codes from the SIS library (version 0.6) and combined it with ALASSO and RSF in accordance with the techniques described in [28] since SIS can only implement SIS-LASSO or ISIS-LASSO. The R codes were written by the first author and are available on https://duke.box.com/s/q2wj7m4gxv0gnslaw91nooindjjat6cf.

Results

Overall performance

Table 1 shows the means and standard deviations of \( {R}_{BS}^2 \) values evaluated at 2 years out of the 300 simulations for the training sets. The highest \( {R}_{BS}^2(2) \) were observed for PSIS-LASSO and PSIS-ALASSO when n = 150 regardless of the signal strength. For n = 300, and assuming a weak signal among the six variables, LASSO had the highest \( {R}_{BS}^2(2) \) which was 83% (Table 1). This implies that the final model containing the selected variables from LASSO had 83% gain relative to a null model in explaining the survival outcomes at 2 years. When n = 300, and assuming a strong signal among the six variables, PSIS-LASSO and PSIS had \( {R}_{BS}^2(2) \) above 85%, whereas LASSO had negative average \( {R}_{BS}^2(2) \), indicating that the average model size of 108 variables performed poorly compared to the null model.

Table 1 Mean (SD) of \( {R}_{BS}^2(2) \) for the training sets over 300 simulations

We present the mean and standard deviation (SD) of the final model sizes in Table 2. We observed that PSIS, PSIS-LASSO, and PSIS-ALASSO chose an excessive number of uninformative variables given that the number of informative variables was fixed at six. When n = 300, and assuming a strong signal among the six variables, the prediction models that employed LASSO performed poorly compared to the null model. When the sample size was 300 and a strong signal was assumed, the combinations with ISIS chose a small number of variables and their \( {R}_{BS}^2(2) \) values were almost 80% (Table 2).

Table 2 Mean (SD) of final model size for the training sets over 300 simulations

Table 3 presents the mean number of important variables chosen in the final models over the 300 simulations in the training sets. In addition, we present the proportion of the number of unimportant variables selected in each model relative to the size of the final models. When n = 150, PSIS, PSIS-LASSO, PSIS-ALASSO, LASSO, and ALASSO selected four or more important variables on average. These final models chose too many unimportant variables as the proportions of unimportant variables were greater than 85% (Table 3). On the other hand, when n = 300 and the signal strength was weak, ISIS-LASSO, ISIS-ALASSO, and ISIS-RSF selected five or more important variables on average. These combinations however chose a small number of unimportant variables as the largest proportion was only 3.3% for ISIS-RSF (Table 3).

Table 3 Mean of the number of informative features (% of uninformative features) selected in final model over 300 simulations in the training sets

Overfitting and discrimination

We also assessed overfitting using the original samples (training sets). Table 4 presents the mean optimism and corrected c-index over 100 boostrapped samples when the sample size is 300. We observed the lowest mean optimism with SIS combination when the signal was weak, but these approaches led to the smallest corrected c-index. On the other hand, PSIS-LASSO had the highest mean optimism. LASSO, ALASSO, and the ISIS combinations had reasonable c-indexes relative to the mean optimism (Table 4). The same pattern was observed when a strong signal among the important variables was assumed.

Table 4 Mean optimism and corrected c-index over 100 boostrapped samples

The mean and standard deviation values of CH over the 300 simulations in the testing sets are presented in Table 5. We observed that when the signal is weak and n = 150, ALASSO had the highest c-index. On the other hand, prediction models from ISIS-LASSO, ISIS-ALASSO, ISIS-RSF, and ALASSO had the highest average c-index when the sample size was 300 and the signal was weak. When the sample size was 150 and a strong signal was assumed, both LASSO and ALASSO had the highest c-index (Table 5). In contrast, when the sample size was large and a strong signal was assumed, all variable selection approaches had high c-indices. The ISIS combinations had the highest c-index, followed by LASSO and ALASSO-PSIS combinations, respectively.

Table 5 Mean (SD) of CH for the testing sets over 300 simulations

Calibration

The mean and standard deviation of the calibration slopes for the testing sets are displayed in Table 6. Calibration slope identifies a linear association between the observed survival outcomes and the predicted prognostic index. When n = 300, the average calibration slopes from the prognostic models containing variables selected by ISIS-LASSO, ISIS-ALASSO, and ISIS-RSF were close to a value of 1. This implies that these prognostic models were well calibrated. On the other hand, when n = 150, most variable selection approaches had calibration slopes considerably below 1, suggesting overfitting.

Table 6 Mean (SD) of calibration slope for testing set over 300 simulations

The models were also evaluated for their calibration by plotting the observed versus the predicted probabilities for the variable selection approaches that had the calibration slopes closet to 1. Figure 3a–d presents the calibration plots of the observed survival probability versus the predicted probability at 2 years. The observed survival probability was close to the predicted probability when n = 150 and the signal strength is weak (Fig. 3a). We detected similar patterns as the observed survival probability was close to the predicted probability for ISIS-ALASSO when the sample size was 300 with a weak signal (Fig. 3b), for ISIS-LASSO when a strong signal among the variables is assumed and n = 150 (Fig. 3c) and for ISIS-LASSO when a strong signal among the variables is assumed n = 300 (Fig. 3d).

Fig. 3
figure 3

Calibration plots on training set for observed survival probability at 2 years versus predicted survival for a ALASSO with n = 150 and weak signal strength, b ISIS-ALASSO with n = 150 and weak signal strength, c ISIS-LASSO when n = 150 and strong signal strength, and d ISIS-LASSO when n = 300 and strong signal strength

Real-data example

The variable selection methods (SIS, PSIS, LASSO, LASSO, RSF) and the nine combinations were used to identify SNPs that predict overall survival in men with metastatic prostate cancer who participated in a phase III clinical trial (CALGB 90401). Overall survival was defined as the date of death from date of random assignment. We confined our analysis to 623 genetically defined Caucasians men who participated in the GWAS study [29]. Deaths were observed in 94% of the 623 Caucasian men.

Among all the SNPs, 498,081 SNPs were selected after quality control utilizing GenABEL (R package): call rate < 99%, p value of HWE < 1e−08, MAF < 0.05, and removal of non-autosomal SNPs. The missing SNP values were simply replaced with the average of non-missing SNP values. Although this may not be the optimal method for handling missing data, it is an effective ad hoc approach when the proportion of missing SNPs is relatively low. Eighty-five percent of the 498,081 SNPs that passed quality control were complete. The proportion of missing SNPs among 498,081 were 11.9% in 1 SNP, 2% had 2 SNPs, 0.5% had 3 SNPs, 0.2% had 4 SNPs missing, 0.10% had 5 SNPs missing, and 0.4% had more than 5 SNPs missing.

The dataset was randomly split into a 2:1 allocation ratio to training (n = 419) and testing (n = 204) sets, respectively. From the training set, important SNPs were selected using the different variable selection approaches. In all the proportional hazards models, SNPs were considered assuming additive models and were adjusted for risk score based on the predicted survival probability [30]. As expected, there was a slight indication of overfitting in the training set when the c-index from the original sample was compared to the bootstrapped samples. The corrected c-index is presented in Table 7. Three variable selection methods (ISIS-ALASSO, SIS-LASSO, and SIS-ALASSO) selected only the validated risk score. It is worth mentioning that for PSIS, PSIS-RSF, and LASSO, the corrected c-index could not be estimated due to the singularity problem among the predictors. The estimated coefficients of the selected SNPs from the training set were applied to the testing set, and we calculated the c-indices and the 95% confidence intervals. As presented in Table 7, ISIS-ALASSO, SIS-LASSO, and SIS-ALASSO combinations achieved the highest c-index of 0.67.

Table 7 c-indices based on the training and testing sets for the real example

Discussion

Advancement in laboratory technologies has led to ultra-high dimensional data that are now being routinely captured in clinical studies. A critical element of genomic medicine is implementing validated prognostic models for identifying patients with cancer for clinical trial design and/or for optimal therapy. For example, the Decipher signature has been developed for predicting metastasis after radical prostatectomy in men with prostate cancer [31,32,33,34,35]. The Decipher score was developed using random forest and has been independently validated [32,33,34,35]. Another example is the well-known oncotypeDx recurrence score that has been utilized to stratify randomization and to guide treatment in breast cancer patients in the TAILORx trial [36,37,38].

Prognostic models are usually built and validated utilizing common statistical methods and machine learning tools. The challenge in ultra-high dimensional space is not only to reduce the dimensionality of the data, but to keep the important molecular variables in predicting the time-to-event endpoints. Another major challenge which is common in high dimensional data is overfitting, which is identified by the high accuracy for a prognostic model based on a training set, but a low accuracy is observed when the model is evaluated on an external validation dataset.

This article explores the feasibility of combining several variable selection methods in an ultra-high dimensional setting for the purpose of developing prognostic models for time-to-event outcomes. To our knowledge, this is among the first articles that systematically compared the performance of ultra-high dimensional screening with variable selection methods for time-to-event endpoints.

When the sample size was small (n = 150) and the signal strength among the variables was weak, ALASSO outperformed all other approaches having the highest c-index and the calibration slopes closest to 1. Assuming a moderate sample size (n = 300) and a strong signal among the six variables, the combinations with ISIS not only selected all the important variables, but also excluded the unimportant variables. When n = 300 and the signal strength is weak, the combinations with ISIS had similar discriminative abilities compared with ALASSO in terms of high average c-index. The calibration slope, however, for ALASSO (0.732) indicated overfitting as compared with the slope value for ISIS-ALASSO (0.945). On the other hand, when n = 150 with a strong signal strength, ALASSO had the best performance. While the combinations with ISIS or SIS failed to select all the important variables, the combinations with PSIS chose too many unimportant variables.

Our results do not agree with Fan et al. [10], who demonstrated that SIS combined with LASSO is efficient in excluding unimportant variables rather than using only LASSO. This could be due to the fact that different performance measures (median and squared estimation errors) were reported [10]. Another possible explanation may be due to the fact that Fan et al. simulated correlated variables whereas we did not.

Other variable selection methods based on p values have also been used. One of these approaches is based on the false discovery rate (FDR). We have previously demonstrated the feasibility of the sequential use of FDR as a screening method with ISIS and other variable selection methods for predicting binary endpoints [39].

Turning back to our motivating example, the results from the prostate cancer data demonstrate the difficulty in choosing the important SNPs in predicting overall survival. The top three combinations, ISIS-ALASSO, SIS-LASSO, and SIS-ALASSO, outperformed other approaches, but they only selected the validated risk score [29]. The other combinations of variable selection methods had smaller c-indices even if they included more SNPs in addition to the validated risk score. This highlights the importance of validating the selected SNPs. Although external validation is considered the gold standard, model developers may not have access to external data. Other validation procedures, such as bootstrapping, are acceptable approaches [40, 41].

There are some caveats in applying the results from our simulation studies to real data. First, our simulations start with randomly generating categorical variables of which each has an identical probability to select three classes: 0, 1, and 2. The identical probability assumption, however, may not hold in a real problem. The second limitation is due to the independently and identically distributed assumption in generating the variables. Actual SNPs may be correlated with each other, although in our example, we observed zero correlation among SNPs across genes and modest correlations among SNPs within the same genes. Thirdly, we fixed several parameters in the simulations, such as the number of important variables, sample size, MAF, and signal strength, and all of these factors may have an impact on the simulation studies. Lastly, we used the R codes of the COXvarISISscad function in SIS (version 0.6) which was available at the time of the simulation study to combine it with ALASSO and RSF. The results from the aggressive variant of ISIS-LASSO using the latest SIS library (version 0.8-6) may produce different results when the sample size is small and the signal strength is weak. Despite the above limitations, these simulations are valuable to modelers.

Another vital point to consider is the computational time for the combination variable selection methods as it depends on the sample size. For instance, running ISIS-ALASSO and ALASSO for one training set with a sample size of 150 took 16 and 3 min, respectively. When n = 300, the processing time increased to 41 and 7 min for ISIS-ALASSO and ALASSO, respectively. Furthermore, running 500,000 variables over 200 bootstrapped samples was computationally intensive, and for the ISIS-LASSO combination, it took 6 h simply to choose the important variables. Although ISIS is computationally intensive, the combination performed well as indicated in our results.

Based on our simulations, we provide the following guidelines as a trade-off between predictive accuracy, calibration, and overfitting:

  1. a)

    When the sample size is small, and regardless of the strength of the signal among the variables, ALASSO performs well in selecting the important variables and achieving a high c-index.

  2. b)

    When the sample size is small and the signal among the covariate is strong, ALASSO again performs well in terms of having high c-index and high calibration slopes.

  3. c)

    When the sample size is large and a weak signal among the variables is present, ALASSO and the ISIS combinations had the highest performance and were comparable. The computing time, however, is much faster for ALASSO (six times faster than ISIS combination).

  4. d)

    When sample size is large and a strong signal is present, the ISIS combinations are the preferred approaches.

Conclusion

Building and validating prognostic models of clinical outcomes will remain an important research area in 21st genomic medicine. Choosing the appropriate variable selection method for training a model is a critical step in developing a robust prognostic model. Based on the simulation studies, the effective use of ALASSO or a combination of methods, such as ISIS-LASSO and ISIS-ALASSO, allows both for the development of prognostic models with high predictive accuracy and a low risk of overfitting assuming moderate sample sizes.