Background

Due to the limited effectiveness of current clinical diagnoses, expression profiles are utilized for informing variables, which are not only associated with the categories of patients with different survival risks but also consistent with survival time [1]. Commonly, Cox proportional hazards regression analysis is used to seek relevant variables considering the continuity of the patients’ survival outcomes with right censoring [2]. As to small sample data with high dimension, Cox proportional hazards regression has to be combined with methods using dimension reduction or shrinkage such as partial least squares [3] and principal component analysis [4]. However, these approaches only provide a combination of variables. Besides, tree-structured survival analysis [5], random survival forests [6] and that associated with hazards regression [7] are proposed for selection of features associated with survival outcomes. Anyway, these top-down strategies provide so many variable candidates that the real features which may reveal the possible molecular cause of different survival risks are inevitably submerged.

In contrast, univariable hazards regression analyses have been placed firmly in the mainstream. Bottom-up strategies with different constraints such as least-angle regression [8] and sparse kernel [9] are utilized for providing variables associated with survival time. To the best of our knowledge, we are the first to present joint covariate detection [1] that combines significant variables consistent with survival time and associated with the categories of patients. Other than individually significant variables, we concentrate on bottom-up enumeration of feature tuples, each component of which is either individually significant or not. This thought is inspired by Integrative Hypothesis Testing [10], which is used for selecting features differentially expressed between different groups of patients. Unlike Integrative Hypothesis Testing, joint covariate detection is faced with continuous survival time other than labels representing different categories of patients.

In this paper, we further divide the provided feature selection into two steps, i.e., selection of variables associated with survival outcomes and further feature selection for discrimination between patients with different survival risks. In addition, we develop a joint covariate detection tool for survival analysis on tumor expression profiles (i.e. JCDSA), which helps to conveniently select significant features either on a cluster or a workstation, even on a personal computer. Matlab R2012b and Python 3 are utilized as the development platform. miRNA expression data (Level 3) of 548 patients with GBM downloaded from TCGA (http://cancergenome.nih.gov) and the simulated data are considered to be the examples. Compared with the prevailing method named as random survival forests (i.e. RSF), JCDSA shows better experimental results, which demonstrates the effectiveness of our method.

Implementation

In order to elucidate joint covariate detection in brief, a schematic diagram is illustrated in Fig. 1 (Notations: x (i) and β denote the expression levels of sample i and the regression coefficients of the detected variables, respectively. The summation in the denominator is over all subjects in the risk set at ordered survival time t(i), denoted by R(t(i)). \(z_{k}^{0}\) denotes a null statistics by a random rearrangement of survival outcomes. The estimator of the expected number of deaths in high-risk group is denoted by \(\hat {e}_{1i}\), expressed as \(\hat {e}_{1i}\) = \(\frac {n_{1i}d_{i}}{n_{i}}\), where n i and d i represent the number at risk and of deaths at the observation of ordered survival time t(i), n1i denotes the number at risk in high-risk group. The estimator of the variance of d1i on the hypergeometric distribution is defined as \(\hat {v}_{1i}\) = \(\frac {n_{1i}n_{0i}d_{i}(n_{i} - d_{i})}{n_{i}^{2}(n_{i} - 1)}\), where n 0i denotes the number at risk in low-risk group. Q\(_{r}^{0}\) denotes a null statistics by a random rearrangement of survival outcomes). Input data is considered as expression profiles with survival time and censoring states of patients. Output data refers to selected features. Joint covariate detection corresponds to two-step feature selection, i.e., selection of features associated with survival outcomes and selection of features for discriminating between two risk groups.

Fig. 1
figure 1

A schematic diagram to elucidate joint covariate detection

Features associated with survival outcomes

We first consider to select features associated with survival time. A bottom-up enumeration on k-tuple with k variables is made. As to each k-tuple, Cox proportional hazards regression analysis [2] is introduced. By making the maximum partial likelihood estimation on the partial likelihood function, we obtain k estimated regression coefficients on which Wald statistics are made. Furthermore, a permutation test is made on each Wald statistic. The k-tuple with each component corresponding to a significant p value is regarded as a candidate feature associated with survival outcomes. More details can be seen in [1].

Features for discriminating between two risk groups

We then intend to select features for discriminating between low-risk and high-risk group of patients, which conforms to doctors’ daily decision making process. As to each patient, a risk score which is the linear portion of the expression values using the Cox regression coefficients is calculated. A preassigned risk score is utilized as a cut-off value for stratification between high-risk and low-risk group of patients. Log-rank test is made. Furthermore, a permutation test is presented on each tuple, which has been selected to be associated with survival outcomes. The k-tuple with a significant p value is regarded as a candidate feature for discriminating between two risk groups. More details can be also seen in [1].

Brief overview of the software

Our software, which is implemented in Matlab R2012b or other later versions, can work on different computational platforms (e.g., a cluster, a workstation, even a personal computer). Therefore, it contains two parts, i.e., client and server. Selection of features associated with survival outcomes is accomplished by two Matlab m-files (i.e., ’/Client/S1_feature_selection.m’ and ’/Server/S1_feature_selection_on_server.m’). A further selection of features for stratification of patients is fulfilled by a Matlab m-file ’Client/S2_plot_draw.m’. If this program is implemented on a workstation or a personal computer, only the client part is needed. That is to say, users only need to concentrate on two GUIs (i.e., ’/Client/S1_feature_selection.m’ and ’Client/S2_plot_draw.m’) on the client part. Otherwise, the server part is also in demand. Data communications and environment configurations are actualized using Python 3. More details can be seen in the user’s guide on the website: http://bio-nefu.com/resource/jcdsa.

Results

According to the presented two-step feature selection strategy, we first consider selecting features associated with survival outcomes. Figure 2 illustrates this step. Cancer type can be selected or input by clicking the right side arrow if it is not supported in the type list. Other selections in the setting frame can be also made, details of which are listed in user’s guide. Before running at full speed, JCDSA estimates the finishing seconds which helps to make a further decision. After its completion, the result which records p value(s) of each k-tuple is stored in ’/Client/Data/S1’. Figure 3 further illustrates the step of selecting features associated with survival outcomes (i.e., Step 2.1). By setting the threshold of the p value corresponding to permutation test on Wald statistic, features associated with survival outcomes are selected.

Fig. 2
figure 2

Selection of features associated with survival time

Fig. 3
figure 3

Selection of features for discriminating between two risk groups

Using the miRNA expression data (Level 3) of 548 patients with GBM as an example, individually significant miRNAs and significant miRNAs in pairs are listed in Tables 1 and 2, respectively. After making careful comparisons between Tables 1 and 2, we conclude that significant features in high dimension may not be composed of individually significant miRNAs. Taking the significant pair miR-10b and miR-222 as an example, miR-10b is not listed in Table 1, which shows that it is not individually significant. This phenomenon reveals the advantage of using joint covariate detection.

Table 1 Individually significant miRNAs using joint covariate detection (p <=0.001)
Table 2 Significant miRNAs in pairs using joint covariate detection (p <0.001)

Together, Figs. 3, 4 and 5 illustrate the feature selection step for discriminating between two risk groups. In Fig. 3, after choosing the files that represent the original data and the result corresponding to significant features associated with survival time at Step 2.2, the software runs to Step 2.3 and Step 2.4.

Fig. 4
figure 4

Kaplan-Meier analysis

Fig. 5
figure 5

Risk score analysis

As shown in Fig. 4, Kaplan-Meier analysis with parameters derived from log-rank test and Harrell’s concordance index is made for further selection of features, which helps to discriminate between high-risk and low-risk group of patients. Meanwhile, the result of risk score analysis is illustrated in Fig. 5. Correspondingly, results which refer to significant features are stored in ’Client/Data/S2/S2_3’ and ’Client/Data/S2/S2_4’, respectively.

In order to show the effectiveness of our method, we implemented the prevailing method named as random survival forests (i.e. RSF) on the miRNA expression data (Level 3) of 548 patients with GBM for comparison. 1000 binary survival trees were made, with each terminal node containing a minimum of d0=10 unique deaths. We made 1000 permutations on each variable, and obtained the variable importance (VIMP) for each variable. The result is listed in Table 3.

Table 3 Significant miRNAs using random survival forests (VIMP score >=0.001)

After making careful comparisons between Tables 2 and 3, we find that miR-10b is still unimportant, as it is not listed in Table 3. This phenomenon reveals the advantage of using joint covariate detection other than RSF. In fact, the individually significant miR-222 keeps a p=0.0012 corresponding to log-rank test with 10000 rounds of permutation. As to significant pair (i.e., miR-222 and miR-10b), it keeps a p=0.0002 which corresponds to log-rank test with 10000 rounds of permutation. As to miR-10b, it keeps a p=0.285, which is individually insignificant.

We simulated data under 40 independent dimensions, from which we assigned two to be significant. That is, the survival time S is defined as S=exp(−Xβ+ε), where X is the simulated gene expression matrix and β=[0.9,0.1,0.001,...,0.001]40 denotes the coefficient parameter. εN(0,2). The sample size n is 50. The censoring states are generated, and yield 10 percent censoring for the simulated data.

The experimental results on simulated data are listed in Tables 4, 5 and 6, respectively. The significant pair closely associated with simulated survival outcomes are selected out, as shown in Table 5. In contrast, miRNA-alternative 2 which is in absence in Table 4 shows insignificant (p=0.939), and illustrates less important in Table 6. These results demonstrate the effectiveness of our method. The simulated data and full tables corresponding to Tables 4, 5 and 6 can be downloaded on the website: http://bio-nefu.com/resource/jcdsa.

Table 4 Individually significant miRNAs using joint covariate detection on the simulated data (p <=0.05)
Table 5 Significant miRNAs in pairs using joint covariate detection on the simulated data (p <=0.001)
Table 6 Significant miRNAs using random survival forests on the simulated data (VIMP score >=0.001)

In order to show that selected variables are improbable false positive or false negative ones, we repeated the simulations above for 100 times with an enlarged sample size (n=500). The experimental results are illustrated in Fig. 6. Figure 6a denotes the p values (p<1e−3) of the significant pair through 100 times of simulation. However, miRNA-alternative 2 individually shows less important, as illustrated in Fig. 6b. Comparisons between Fig. 6a and b indicate that the significant features are probably not composed of individually significant uni-variables. Figure 6c and d report the number of positive pairs and individuals through 100 times of simulation, respectively. No false negative results are discovered. In Fig. 6c, the maximum number of false positive pair is three, which indicates a small probability of false positive pair 0.0038 (i.e., \(3/C_{40}^{2}\)). As to Fig. 6d, the maximum number of false positive individual is also three; yet, the probability of false positive individual is 0.075 (i.e., 3/40).

Fig. 6
figure 6

Simulation results. a p values of the significant pair through 100 times of simulation. b p values of the significant individual through 100 times of simulation. c The number of positive pairs through 100 times of simulation. d The number of positive individuals through 100 times of simulation

Discussion

There are several states needed to be discussed. First, it is the significant multi-variable other than combinations of individually significant uni-variables that contributes to selection of features not only consistent with survival outcomes but also associated with stratification of patients under different survival risks. This fact has been demonstrated by our experimental results in this paper. Second, components of each significant multi-variable may keep a low correlation. This phenomenon has been discovered when experiments on the simulated data were made. Further evidence is still needed. Third, the correction for multiple hypothesis testing is absent, considering the computational cost of calculating FDR, q value, the adjusted p values, etc. on each pair or each high-dimension tuple of variables. However, simulations are made, which demonstrate the effectiveness of our method.

Conclusion

Our joint covariate detection for survival analysis provides a new viewpoint for selecting variable candidates which are not individually but jointly significant. Following a two-step variable selection strategy, we propose a software (i.e., JCDSA) in order to help users to select features which are not only consistent with survival time but also associated with prognosis risk. JCDSA can be adapted for many categories of cancer. Users can easily operate it and conveniently obtain the experimental results for subsequent biological experimental validation.

Availability and requirements

Project name: JCDSA Project home page: http://bio-nefu.com/resource/jcdsaOperating system(s): Linux, Windows Programming language: Matlab (≥R2012b), Python (≥ 3.0) License: GPL (≥2) Any restrictions to use by non-academics: none