Background

Microarray expression data of cancer tissue samples has the following properties: small sample size yet large number of features, high noise and redundancy, a remarkable level of background differences among samples and features, and nonlinearity [1, 2]. Selecting a parsimonious set of informative genes to build robust classifier with highly generalized performance is one of the most important tasks for the analysis of microarray expression data, as it can help to discover disease mechanisms, as well as improve the precision and reduce the cost of clinical diagnoses [3].

Gene selection depends on a given evaluation strategy and a defined score. The individual-gene-ranking methods rank genes by only comparing the expression values of the same individual gene between different classes (a vertical comparison evaluation strategy). This can be very far from the truth, as the deregulation of pathways, rather than individual genes, may be critical in triggering carcinogenesis [4]. If a gene has a remarkable joint effect on other genes, it should be selected as an informative gene, even though it may receive a lower rank in an individual-gene-ranking method. This joint effect of genes has been taken into account in most popular, existing algorithms, including top scoring pair (TSP) [5, 6], top scoring triplet(TST) [7], top-scoring ‘N’(TSN) [8], top scoring genes (TSG) [9] and doublet method [4]. However, the gene pairs score, that is the percentage of Δ ij in TSP [5, 6], cannot reflect size differences among samples. To fully utilize sample size information TSG introduces chi-square values as the score for gene pairs [9]. TSP and TSG are both pair-wise gene evaluations, which compare the expression values of the same sample between two different genes (a horizontal comparison evaluation strategy), and can help to eliminate the influence of sampling variability due to different subjects [5, 6, 9].

At the level of gene pairs, Merja et al. [10] defined two patterns based on rank data, rather than absolute expression, from data-driven perspective: the consistent reversal of relative expression and consistent relative expression. This premise allowed us to organize the cell types in to their ontogenetic lineage-relationships and may reflect regulatory relationships among the genes [10]. The first pattern can be subdivided into a consistent reversal of expression (Pattern I) and a consistent reversal of relative expression (Pattern II) based on absolute expression (see Table 1). Similarly, the second pattern can be subdivided in to a consistent expression (Pattern III) and a consistent relative expression (Pattern IV). Furthermore, a heterogeneous background expression of samples (Pattern V) and an interaction expression pattern (Pattern VI) can be defined, if the influence of sampling variability due to different subjects [9] and paired-gene interactions are considering [11]. Clearly, all twelve genes (G1 ~ G12) in Table 1 should be informative genes from data-driven perspective. However, individual-gene evaluations, which only detect different expression levels between positive samples and negative samples, cannot highlight Pattern V and Pattern VI. Pair-wise gene evaluation with vertical comparison can highlight most patterns except Pattern V. Only pair-wise gene evaluation with horizontal comparison can highlight Pattern V, even though it cannot detect most other patterns. Therefore, both vertical and horizontal comparisons need to be considered in pair-wise gene evaluation techniques.

Table 1 Six patterns for joint effect of gene pairs in binary-class simulation data

We first propose a novel score measure, in this paper, that of relative simplicity (RS), based on information theory. We adopt an integrated evaluation strategy to rank genes one by one, considering not only individual-gene effects, but also pair-wise joint effects between candidate gene and others. In particular, for pair-wise gene evaluations, vertical comparisons are integrated with horizontal comparisons to detect all six patterns of pair-wise joint effects. Ultimately, we construct a relative simplicity-based direct classifier (RS-based DC) to select binary-discriminative informative genes on training dataset and perform independent tests. The independent testing of nine multiclass tumor gene expression datasets showed that RS-based DC selects fewer informative genes and outperforms the referred models by a large margin, especially in larger m (total number of classes) datasets, such as Cancers (m = 11) [12]and GCM (m = 14) [13].

Datasets and methods

Datasets

Ten multi-class datasets have been used in published previous TSP [5, 6] and TSG [9] papers. We did not include dataset Leukemia3 [14] in our study because 65 % of the expression values in it are zero. The remaining nine datasets references, sample sizes, numbers of genes, and numbers of classes are summarized in Table 2. Suppose that a training dataset has n samples and p genes, and that the data can be denoted as (Y i , X i,j ), i = 1,2,…, n; j = 1,2,…, p. Where X i,j represents the expression value of the j th gene (G j ) in the i th sample; and Y i represents the class label of i th sample, where Y i ∈{Class1, Class2, …, Class t , …, Class m }, t = 1,2,…,m.

Table 2 Nine multi-class gene expression datasets

Data preprocessing

Adjustment for outliers

Outliers may exist in datasets. For example, in the Lung1 [16] training set, the expression value X 54,4290 of the 54th sample in gene G4290 is 7396.1, while the average expression value of the other samples in gene G4290 is 80.15 (range from 16 to 197). The outliers overstate the differences among the classes, and need be adjusted before gene ranking. For gene G j , we defined outliers as those values beyond the scope of [\( \overline{X}{.}_j-{u}_{\alpha}\sigma {.}_j \), \( \overline{X}{.}_j+{u}_{\alpha}\sigma {.}_j \)]. If \( {X}_{ij}<\overline{X}{.}_j-{u}_{\alpha}\sigma {.}_j \) or \( {X}_{ij}>\overline{X}{.}_j-{u}_{\alpha}\sigma {.}_j \), then X ij is an outlier, where α is significance level, \( \overline{X}{.}_j \) and σ.  j represent the average value and standard deviation of X · j, respectively. Therefore, we adjust the outliers using the following formula:

$$ {X}_{ij}^{"}=\left\{\begin{array}{l}{\overline{X}}_{\hbox{-} i,j}-u\alpha {\sigma}_{\hbox{-} i,j}\kern1.25em \mathrm{if}\kern0.5em {X}_{ij} < \overline{X}{.}_j-u\alpha \sigma {.}_j\\ {}\\ {}{\overline{X}}_{\hbox{-} i,j}+u\alpha {\sigma}_{\hbox{-} i,j}\kern1.25em \mathrm{if}\kern0.5em {X}_{ij} > \overline{X}{.}_j+u\alpha \sigma {.}_j\end{array}\right. $$
(1)

Here \( {\overline{X}}_{\hbox{-} i,j} \) and σ ‐ i,j represent the average value and standard deviation of X · j without X i,j , respectively. X " ij is the value of X ij after adjusting. \( \left[{\overline{X}}_{\hbox{-} i,j}-{u}_{\alpha }{\sigma}_{\hbox{-} i,j},{\overline{X}}_{\hbox{-} i,j}+{u}_{\alpha }{\sigma}_{\hbox{-} i,j}\right] \) represents the distribution interval of X -i,j . We generally set α to 0.05 (u0.05 = 1.96). Adjustment for outliers was only used with training set.

Transforming datasets from multi-class to binary-class with “one versus rest”

Suppose that Y i ∈(Class1, Class2, …, Class t , …, Class m ), and we adopt a “one versus rest” (OVR) approach to transform a multi-class training set to binary-class. This generates m binary-class datasets, denoted {Class1 vs. non-Class1}, {Class2 vs. non-Class2}, …, {Class t vs. non-Class t }, …, {Class m vs. non-Class m }. In each binary-class training dataset, Class t are positive samples {+}, and non-Class t are negative samples {−}.

Complexity and relative simplicity score

Entropy stands for disorder or uncertainty. For a discrete system with k events, its Shannon entropy is defined as:

$$ H=-{\displaystyle \sum_{i=1}^k\frac{n_i}{N} \log \left(\frac{n_i}{N}\right)} $$
(2)

Where n i denotes the frequency of event i, and N is the total frequency. Here we use base-2 logarithms. H only reflects the event ratios. Complexity (C) as proposed by Zhang [22] can reflect both event ratios and event frequencies:

$$ C=-{\displaystyle \sum_{i=1}^k{n}_i \log \left(\frac{n_i}{N}\right)} $$
(3)

For a given 2 × r Contingency table (Table 3), its complexity is the total of row complexities (C row) and column complexities (C column). f+d (d = 1,…,r) and fd in Table 3 represent the frequency of the event.

$$ {C}_{\mathrm{row}}=-{\displaystyle \sum_{d=1}^r{\mathrm{f}}_{+d} \log \left(\frac{{\mathrm{f}}_{+d}}{{\mathrm{f}}_{+}}\right)} - {\displaystyle \sum_{d=1}^r{\mathrm{f}}_{-d} \log \left(\frac{{\mathrm{f}}_{-d}}{{\mathrm{f}}_{-}}\right)} $$
(4)
$$ {C}_{\mathrm{column}}=-{\displaystyle \sum_{d=1}^r\Big({\mathrm{f}}_{+d} \log \left(\frac{{\mathrm{f}}_{+d}}{{\mathrm{f}}_d}\right)+}{\mathrm{f}}_{-d} \log \left(\frac{{\mathrm{f}}_{-d}}{{\mathrm{f}}_d}\right)\Big) $$
(5)
$$ C={C}_{\mathrm{row}}+{C}_{\mathrm{column}} $$
(6)
Table 3 2×r Contingency table

For contingency Table 1 (2 × r 1) and contingency Table 2 (2 × r 2), their complexities are incomparable if r 1 is unequal to r 2. Therefore we introduce a novel score, RS, according to their maximum complexity (Table 4). Table 4 cames directly from Table 3 directly, only the frequency of each column in the same class is set to be equal.

$$ {C}_{\mathrm{row}\hbox{-} \max }=n \log (r) $$
(7)
$$ {C}_{\mathrm{column}\hbox{-} \max }=-{\mathrm{f}}_{+} \log \left(\frac{{\mathrm{f}}_{+}}{n}\right)-{\mathrm{f}}_{-} \log \left(\frac{{\mathrm{f}}_{-}}{n}\right) $$
(8)
$$ {C}_{\max }={C}_{\mathrm{row}\hbox{-} \max }+{C}_{\mathrm{column}\hbox{-} \max } $$
(9)
$$ RS=\frac{C_{\max }-C}{C_{\max }} $$
(10)
Table 4 2×r Contingency table for maximum complexity

Individual-gene evaluation

For a given gene G j with continued expression values X.  j in a binary-class training dataset, we partition X.  j into two parts (X.  j  > EP j and X.  j  < EP j ) with an endpoint (EP):

$$ E{P}_j=\left({\overline{X}}_{-j}+{\overline{X}}_{+j}\right)/2 $$
(11)

Where \( {\overline{X}}_{-j} \) and \( {\overline{X}}_{+j} \) are the average expression values of X.  j for negative and positive samples, respectively. We then generate a 2 × 2 contingency table for gene G j (Table 5).

Table 5 2 × 2 contingency table for individual gene

For the individual-gene evaluation of gene G j , we then got its RS score, \( R{S}_{G_j} \), according to Table 5 and formula (10).

Pair-wise gene evaluation

Horizontal comparison of gene pairs

For gene pairs G j and G q (j ≠ q) in a binary-class training dataset, we generate a 2 × 2 contingency table (Table 6) for the horizontal comparison with X i,j  > X i,q and X i,j  < X i,q , similar to TSP [2, 3] and TSG [9].

Table 6 2 × 2 contingency table for gene pairs of horizontal comparison

For horizontal comparison of gene pairs G j and G q , We generate the complexity C hor-Gj-Gq and the maximum complexity C hor-Gj-Gq-max, of gene pairs G j and G q , for the horizontal comparison, according to Table 6, formula (6), and formula (9).

Vertical comparison of gene pairs

For gene pairs G j and G q (j ≠ q) in a binary-class training dataset, we partition X.  j and X.  q into two parts with endpoint EP j and EP q , respectively. We then generate a 2 × 4 contingency table (Table 7) for the vertical comparison.

Table 7 2 × 4 contingency table for gene pairs of vertical comparison

For vertical comparison of gene pairs G j and G q , We then generate the complexity C ver-Gj-Gq and the maximum complexity C ver -Gj-Gq-max of gene pairs G j and G q for the vertical comparison according to Table 7, formula (6), and formula (9).

RS score of gene pairs

For gene pairs G j and G q in a binary-class training dataset, we generate RS weight scores, RS Gj_Gq , according to formula (12).

$$ R{S}_{Gj\_Gq}=\frac{\left({C}_{hor-Gj-Gq- \max }+{C}_{ver-Gj-Gq- \max}\left)-\right({C}_{hor-Gj-Gq}+{C}_{ver-Gj-Gq}\right)}{C_{hor-Gj-Gq- \max }+{C}_{ver-Gj-Gq- \max }} $$
(12)

Integrated individual-gene ranking

For a given gene G j in a binary-class training dataset, the integrated RS score, IRS Gj , can be calculated with formula (13):

$$ IR{S}_{Gj}=R{S}_{Gj}+{\displaystyle \sum_{q=1}^p\left(\frac{R{S}_{Gj}}{R{S}_{Gj}+R{S}_{Gq}}\times R{S}_{Gj\_Gq}\right)},q\ne j $$
(13)

Here, RS Gj represents vertical comparison of individual-gene; RS Gj_Gq represents horizontal comparison and vertical comparison of pair-wise genes; \( \frac{R{S}_{Gj}}{R{S}_{Gj}+R{S}_{Gq}} \) represents the weight of Gj in the pair-wise comparison. According to IRS Gj , the descending order of all p genes can be obtained and recorded as {GRank1, GRank2,…, GRankj ,…, GRankp }. The integrated evaluation process of G j is shown in Fig. 1.

Fig. 1
figure 1

Integrated evaluation process of G j

Informative gene selection

The IRS scores provide a list of top ranked genes. However, the combination of top ranked genes may not produce a top ranked combination of genes because of the redundancy and interaction among genes [23]. Therefore, we used a forward feature selection strategy to select informative gene subsets, along with our RS-based-DC classifier and leave-one-out cross-validation error estimates (LOOCV).

For a given binary-class training dataset with n samples and p ranked genes:

Step 1: Introduce gene GRank1, get dataset S∈(Y i , X i ), i = 1,2,…, n; X i represents the expression value of gene GRank1 in the i th sample; Y i represents the class label of i th sample and Y i ∈{+, −}. Leave out one sample as the validation data (S-validation) and the rest as the training data (S-train). First assign {+} to S-validation as a class label, merge S-validation and S-train, get RS GRank1(+); then assign {−} to S-validation as a class label, merge S-validation and S-train, get RS GRank1(−). If RS GRank1(+) is larger than RS GRank1(−), the S-validation sample belongs to the positive sample; otherwise, the S-validation sample belongs to the negative sample. Repeat prediction for all the samples in S to get the prediction class labels. Calculate the Matthew correlation coefficient (MCC) according to formula (14) and denote as MCC 1.

$$ MCC=\frac{\left(TP\times TN\right)-\left(FN\times FP\right)}{\sqrt{\left(TP+FN\right)\times \left(TN+FP\right)\times \left(TP+FP\right)\times \left(TN+FN\right)}} $$
(14)

Here TP, TN, FP, FN represent true positives, true negatives, false positives and false negatives, respectively.

Step 2: MCCbenchmark = MCC 1.

Step 3: Introduce the next top ranked gene. In general, denote total number of the current genes as r. Get dataset S = (Y i , X i,j ), i = 1,2,…, n; j = 1,2,…, r. The network RS score of r gene can be calculated with formula (15).

$$ R{S}_r\hbox{-} net={\displaystyle \sum_{j=1}^r{\displaystyle \sum_{q=1}^rR{S}_{GRankj\_ GRankq}}},q\ne j $$
(15)

Leave out one sample as the validation data (S-validation) and the rest as the training data (S-train). First assign {+} to S-validation as a class label, merge S-validation and S-train, get RS r -net(+); then assign {−} to S-validation as a class label, merge S-validation and S-train, get RS r -net (−). If RS r -net (+) is larger than RS r -net (−), the S-validation sample belongs to the positive sample; if RS r -net (+) is less than RS r -net (−), the S-validation sample belongs to the negative sample. Repeat prediction for all the samples in S to get the prediction class labels. Calculate MCC according to formula (14) and denote as MCC r .

Step 4: If MCC r  ≤ MCCbenchmark delete X.  r , else MCCbenchmark = MCC r .

Step 5: Repeat Step 3 and Step 4, until the top B rank genes are successively introduced (our experience suggests that it is sufficient to set the upper bound of B at 100).

We consequently generate the informative genes subset for the binary-class dataset (Pseudo-code see Table 8).

Table 8 Pseudo-code of informative genes selection

Paired votes prediction with RS-based DC

We generate an m binary-class training set, denoted as {Class1 vs. non-Class1}, {Class2 vs. non-Class2},…,{Class t vs. non-Class t },…,{Class m vs. non-Class m }, according our OVR approach; and the corresponding m binary-discriminative informative gene (BDIG) subsets, denoted as BDIGClass1, BDIGClass2, …, BDIGClasst , …, BDIGClassm , according to our individual-gene evaluation ~ informative gene selection sections.

For a test sample with m possible class labels, in general, for paired vote predictions between Class t and Class w , we merge the Class t and Class w samples into a new training set with r genes according to {BDIGClasst ∪ BDIGClassw }. We first assign {Class t } to the test sample as a class label, merge the test sample and the new training set, generating RS r  ‐ net {Class t }; then we assign {Class w } to the test sample as class a label, merge the test sample and the new training set, generating RS r  ‐ net {Class w }. If RS r -net {Class t } is larger than RS r -net {Class w }, the test sample belongs to Class t , else it belongs to Class w . The winner continues paired vote with the next class and the prediction class label of the test sample is the last winner.

After the predictions for all of the testing samples have been obtained, we calculate the test accuracy, expressed as the ratio of the number of correctly classified samples to the total number of samples, for multi-classification.

Results and analysis

Comparison of independent prediction accuracy and the number of informative genes among different models

We used nine reference models, HC-TSP [3], HC-K-TSP [3], DT [24], PAM [25], TSG [9], mRMR-SVM, SVM-RFE-SVM, Entropy-based DC and χ 2-based DC, to evaluate the performance of RS-based DC. Results from the first five models are cited from the corresponding literature, and the results from the latter four models are presented in this paper.

As a feature selection method mRMR has two evaluation criterions: mutual information difference (MID) and mutual information quotient (MIQ). Here we used MIQ-mRMR, because MIQ is more robust than MID in general [26]. mRMR and SVM-RFE [27] only provide a list of ranked genes, therefore, we adopted the Library for Support Vector Machines (LIBSVM) as a classifier [28] to generate an informative gene subset. LIBSVM supports multiclass classification, and is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. We initially listed the top 2 % of informative genes according to mRMR or SVM-RFE. Second, we introduced these genes one by one and conducted 10-fold cross-validation for the training sets based on SVM. Third, we selected the genes with the highest cross-validation accuracy as our informative genes subset, and finally we performed independent predictions using SVM with informative genes, for the mRMR-SVM and SVM-RFE-SVM models. Four kernel functions, linear, radius basis function (RBF), sigmoid and polynomial in SVM, were evaluated, and the linear kernel produced optimal accuracy with the nine datasets. Therefore, we used linear kernel in this study, unless specifically stated. Different penalty parameters C (C∈[2−5, 215]) were optimized in different SVM models with the training set. Entropy-based DC and χ 2-based DC uses the same modelling process as RS-based DC, except entropy [29] is used, rather than complexity, in Entropy-based DC, and χ 2 is used, rather than RS, in χ 2-based DC.

The test accuracy and informative gene number for nine different multi-class datasets are listed in Table 9. The best models based on average accuracy were RS-based DC (91.40 %), χ 2-based DC (89.41 %), TSG (88.99 %), PAM (87.91 %), SVM-RFE-SVM (86.23 %) and HC-K-TSP (85.45 %). Of the six models, χ 2-based DC, TSG and HC-K-TSP performed poorly in predictive power with GCM, Cancers and Breast datasets, respectively. PAM generated an unacceptable informative gene number (an average of 1450), and also demonstrated poor predictive performance with the Cancers dataset. RS-based DC and SVM-RFE-SVM performed robustly with all nine datasets. Compared with the nine reference models, RS-based DC received the least informative gene number (an average of 20.56), the highest average accuracy and the minimum standard deviation (9 %).

Table 9 Independent test accuracy and the number of informative genes (in parenthesis) among different models

The same modeling process was conducted for RS-based DC, Entropy-based DC and χ 2-based DC to compare the merits of the defined score. As mentioned above, RS scores and χ 2 scores utilize sample size information, whereas entropy scores only reflect the events ratio. Therefore, our RS-based DC and χ 2-based DC have better predictive performance than Entropy-based DC method.

Comparison of feature selection methods

An excellent feature selection method should perform well with various classifiers. We used four reference feature selection methods, mRMR, SVM-RFE, TSG and HC-K-TSP, to evaluate the performance of RS.

As shown in Table 10, with the informative genes selected by the five feature selection methods, the average independent prediction precisions of Naïve Bayes (NB) [31] and K-nearest neighbor (KNN) [32] on the nine datasets were clearly improved. However, surprisingly, the four reference feature selection methods were ineffective in the SVM classifier. This seems to challenge the conventional wisdom that feature selection should be effective in improving the performance of the model. Fortunately, RS still performed well with the SVM classifier upholding the conventional wisdom. For the SVM classifier, in three (Lung1, SRBCT and GCM) out of nine datasets, there was basically no improvement in performing feature selection, regardless of the feature selection technique. However, the NB and KNN classifiers did not always show such a phenomenon, possibly because SVM is not sensitive to feature dimensions; therefore, SVM could obtain very precise prediction without feature selection. RS was the only strategy that was better than no feature selection, on average, when combined with SVM, because on the Leuk1, Breast and Cancers datasets it showed a sufficiently large improvement was large enough, while it slightly reduced the precision of the prediction on the other datasets. Thus, the results indicated that RS is superior to the other four feature selection methods.

Table 10 Test accuracy of different classifiers with informative genes selected by different feature-selection methods

Comparison of generalization performance among different models

Of the nine models in Table 9, PAM had an unacceptable informative gene number, DT had the lowest average accuracy (76.40 %), HC-TSP was similar to HC-K-TSP, and Entropy-based DC and χ 2-based DC were similar to RS-based DC. Therefore, we selected five typical models, mRMR-SVM, SVM-RFE-SVM, HC-K-TSP, TSG and RS-based DC, for further evaluation of generalization performance by comparing the accuracy of fitting, LOOCV and independent testing. For LIBSVM[28], the LOOCV strategy was used to optimize penalty parameters C (C∈[2–5, 215]) and the gamma parameter γ(γ∈[215, 23]) in the kernel function. Suppose the training set has n samples, for a given combination of C and γ. We leave one as a validation sample and the other n-1 as sub-training samples, and acquire the LOOCV accuracy in this parameter combination after predicting n times. Traversing all parameter combinations, we acquire the highest LOOCV and the corresponding optimal C and γ. The optimal parameters and training set are used for constructing the predictive model. We apply this model to predict the training set and testing set, and obtain the fitting accuracy and independent testing accuracy, respectively. In sum, the fitting and LOOCV are the internal validation in this paper, and independent testing is the external validation. The results are shown in Fig. 2, 3, 4 5 and 6.

Fig. 2
figure 2

Accuracy of mRMR-SVM for fitting, LOOCV and independent test

Fig. 3
figure 3

Accuracy of SVM-RFE-SVM for fitting, LOOCV and independent test

Fig. 4
figure 4

Accuracy of HC-K-TSP for fitting, LOOCV and independent test

Fig. 5
figure 5

Accuracy of TSG for fitting, LOOCV and independent test

Fig. 6
figure 6

Accuracy of RS-based DC for fitting, LOOCV and independent test

Obviously, over-fitting occurred with all five models; average accuracy always decreased monotonically from fitting through LOOCV to the independent test. For the mRMR-SVM and SVM-RFE-SVM models, which require parameter optimizations, the gaps between LOOCV average accuracy and test average accuracy were 17.22 % and 12.76 %, respectively. However, HC-K-TSP, TSG and RS-based DC models, which adopted a DC core and were parameter-free, tended to generate smaller gaps (5.06 %, 3.08 % and 3.67 %, respectively). For those models that required parameter optimizations, the test accuracy was always systematically less than the LOOCV accuracy for each dataset. For the DC core model, the test accuracy was even higher than LOOCV accuracy for some datasets, for example, the HC-K-TSP model for the SRBCT and Cancers datasets, TSG model for Lung1, Leuk2 and Lung2 datasets, and RS-based DC model for Leuk2 and Lung2 datasets.

Parameter optimizations may be responsible for SVM’s over-fitting? It could be argued that informative genes selected by mRMR and SVM-RFE are not the best feature subsets for mRMR-SVM and SVM-RFE-SVM models, respectively. RS resulted in better performance than the other four feature selection methods (Table 10). Therefore, we further compared the SVM performances with parameter optimizations or not, based on informative genes selected by RS. As shown in Table 11, parameter optimizations considerably improved the fitting and LOOCV accuracy of SVM. For the linear kernel and RBF kernel, the gaps between LOOCV average accuracy and test average accuracy with no parameter optimizations were 3.76 % and 1.90 %, respectively. However, the gaps with parameters optimization were 4.90 % and 9.43 %, respectively. That is, over-fitting is deepened by parameter optimizations in SVM.

Table 11 SVM performances with parameters optimization or not based on informative genes selected by RS

Discussion

Outlier adjustment and endpoint selection

A small number of outliers may affect gene ranking by changing the endpoints. Although not all gene expression values fit the normal distribution, the standard deviation of a normal distribution has good robustness for outlier adjustment when the probability of that distribution is unknown [33]. We compared independent test accuracies of RS-based DC with different significance level α (i. no adjustment, ii. α = 0.01, iii. α = 0.05). As shown in Table 12, the significance level α had an evident effect on classification performance, and 0.05 is the most appropriate choice for α. Endpoint selection is the nature of the binarization procedure for the vertical comparison of gene evaluation. TSG uses the mean of gene expression values as its endpoint [9]. In this paper, the endpoint defined by formula (11) is based on Fisher’s discriminant principle. We also compared independent test accuracies of RS-based DC with different endpoint selection approaches. As shown in Table 12, the endpoint selection approach has very little influence on classification performance.

Table 12 Independent test accuracy of RS-based DC with different outlier adjustment and endpoint selection approach

Entropy and complexity

In this study, a novel score measure, RS, is proposed based on complexity. Complexity and entropy are very similar. The former takes sample size information into account in addition to entropy. As scores are calculated based on percentages, sample size information is not fully utilized in the latter. For example, suppose three white balls and seven black balls are in a system, the entropy (H) is 0.88. In another case, suppose all the counts are multiplied by 10, i.e. 30 white balls and 70 black balls; H is identical to the previous case. The additional information related to the additional sample size is completely ignored in entropy measures. For Entropy-based DC, we used entropy in place of the complexity used in RS-based DC. The results are shown in Table 9. The same modeling process was conducted for the two models, but Entropy-based DC had poorer predictive performance than RS-based DC. This result shows that the additional information associated with sample size can improve a model’s predictive performance.

Horizontal and vertical evaluation of gene pairs

Background differences between pair-wise genes and among samples are fairly common in microarray expression data, and result in very diverse joint effect patterns. It is difficult to fairly evaluate all of the patterns with a single-strategy. As shown in Table 13, a vertical comparison cannot highlight gene G1141 and G4940 in the GCM dataset, and a horizontal comparison cannot highlight gene G6678 and G3330 in the Lung1 dataset. RS, however, highlighted the two pairs of genes by integrating vertical comparison with horizontal comparison.

Table 13 Horizontal and vertical comparison of gene pairs in real data

Direct classifier

Parameters need to be optimized and adjusted, e.g. the parameters of a kernel function in SVM, and the connection weights of neurons in an artificial neural network. This is the primary reason for classifier over-fitting. SVM integrates the minimum structure risk and the maximal margin and transduction inference, and thereby should be able to efficiently control over-fitting. SVM-RFE-SVM and mRMR-SVM have the highest LOOCV accuracies of those SVM classifiers we tested, 99 % and 98.97 %, respectively. Therefore, these two SVM variants should theoretically both receive high test accuracy. However, results were not as good as expected; obvious over-fitting still appeared (See Fig. 2 and Fig. 3) and deepened by parameter optimizations (See Table 11).

HC-K-TSP, TSG and RS-based DC models, on the other hand, simultaneously received high LOOCV accuracy, high independent test accuracy, and a small gap. Test accuracy higher than LOOCV accuracy appeared in different datasets for the three models, excluding the possibility that DC preferred a specific dataset. The three models have different defined scores and different feature selection methods, only having the same DC core; therefore, we believe that DC plays an important role in effectively controlling over-fitting.

Paired votes based on binary-discriminative informative genes

In most cases, an informative gene can distinguish between just a few classes much more robustly than all of the classes in a multi-class dataset. Therefore, it is necessary to transform datasets from multi-class to binary-class with a “one versus one” (OVO) or an OVR approach. For an m-class dataset, OVO gets incredibly complicated, especially with a big m, as the OVO has to build m(m-1)/2 binary-classifiers. OVR only needs to build m binary-classifiers; however, a serious unbalance between the number of positive samples and negative samples may distort prediction resulting in non-unique calls. Therefore, we employ paired votes based on binary-discriminative informative genes that integrate OVO with OVR. We first build m binary-classifiers with OVR to select m BDIG subsets, then build m-1 binary-classifiers with OVO to perform paired votes. For each paired votes between Class t and Class w , feature subset {BDIGClasst ∪ BDIGClassw } was binary-discriminative and the sample sizes were balanced. Paired votes based on binary-discriminative informative genes only built 2 m-1 binary-classifiers and received robust prediction precision.

Biological relevance of informative genes selected by RS

Do informative genes selected by RS have any biological relevance for a particular tissue/cancer type? This is particularly relevant considering that even a random set of genes may be a good predictor for defining cancer samples [34]. In our study we scanned these potentially informative genes against PubMed. Two examples illustrate: for the Leuk2 dataset, 13 genes out of 12,582 were selected as informative genes by our method, of which ten genes are reported in PubMed as being related to tumors, and seven genes are reported as being related to leukemia (see Table 14). For the Cancers dataset (prostate, breast, lung, ovary, colorectum, kidney, liver, pancreas, bladder/ureter, and gastroesophagus), 36 genes out of 12,533 were selected as informative genes, of which 34 genes are reported related to be tumor related in PubMed (see Table 15). Clearly, most of informative genes selected by RS are supported by PubMed references (Informative genes selected by RS method of nine datasets see Additional file 1).

Table 14 The 10 tumor related genes selected by RS on original training group of Leuk2 dataset
Table 15 The 34 tumor related genes selected by RS on original training group of Cancers dataset

Conclusion

Gene selection and classifier choice are two key issues in the analysis of tumor microarray expression data. Gene selection depends on an evaluation strategy and on a defined score. Diverse patterns of gene pairs can be highlighted more fully by integrating a vertical comparison with a horizontal comparison strategy. The RS score and the χ 2 score, which both consider events ratios as well as events frequencies, were superior to Δ ij scores and entropy scores. Parameter optimizations are the main reason for over-fitting classifiers, a DC core classifier can effectively control over-fitting. RS-based DC (Source code of RS-based DC see Additional file 2), which takes into account all of the above factors, receives the highest average independent test accuracy, the smallest informative average gene number, and the best generalization performance. This was confirmed by testing our method on nine bench-mark multi-class gene expression datasets, compared with the nine reference models and the four reference feature selection methods.