Background

Feature selection algorithms, which select a subset of the most relevant features for the underlying data mining tasks, are commonly used in combination with classifier construction to analyze “omics” data or data with high-dimensional input variables. The benefits of feature selection include minimizing model over-fitting, improved predictive performance, and computational efficiency. It may also provide insights on potential targets that relate to the fundamental differences among different classes or subtypes of a biological process [1]. Threshold Gradient Descent Regularization (TGDR) [2], one such algorithms, has been explored and implemented by us [35] extensively because it possesses several key advantages, as described in our previous paper [4].

Currently, multi-class classification, where an observation needs to be categorized into more than two classes, has attracted increasing attention in the statistics and bioinformatics literatures [610]. Its popularity may be attributed to the fact that multi-class classification is commonly encountered in real-world applications. For example, multiple classes can represent different tumor types or different responses to a therapy. According to Li et al. [6], multi-class classification can be roughly divided into two types. One type includes classification algorithms that can be directly extended to handle multi-class cases, and the other type includes algorithms that arise from the decomposition of multi-class problems into a series of binary ones.

While a series of binary TGDRs can be easily constructed to accomplish multi-class classification, it is more desirable to extend TGDR directly to the multi-class cases since this approach results in a substantial decrease of the number of classifiers being trained. The major technical difficulty associated with such extension of TGDR involves defining an overall threshold for a feature across different classes, which is not addressed in the original TGDR framework [2, 11]. In our previous work [4], we introduced one approach, referred to as multi-TGDR, for determining the threshold function. We applied the proposed multi-TGDR framework to two real-world data conducted on the Affymetrix HG-U133 Plus 2 platform and demonstrated that multi-TGDR was superior in terms of predictive accuracy and parsimony compared to its binary counterparts (i.e., one-versus-another schema). In this paper, we propose a more general method to determine the threshold function, which allows the threshold function to be class-specific.

Metabolomics is the “…systematic study of the unique chemical fingerprints that specific cellular processes leave behind” [12]. Over the last decade, metabolomics has evolved into a mainstream scientific approach practiced by many laboratories globally. The information conveyed in metabolomics data can provide insight for a variety of applications such as biomarker identification, clinical toxicology, and drug discovery and development [13]. Like other “omics” data, metabolomics data typically has the characteristics of a smaller sample size compared to the number of features (usually hundreds of metabolites after peak alignment). Therefore, it is crucial to implement feature selection. However, metabolomics data analysis is less standardized compared to other “omics” data analysis (e.g., microarray and Next-Generation Sequencing [NGS]) due to its complexity. Consequently, many of the existing feature selection algorithms have not been explored and implemented in metabolomics data analyses. Only a few algorithms have been proposed to specifically analyze mass spectrometry (MS) data [14]. Reviews on feature selection algorithms that may be used in metabolomics data analyses have been reported [1, 15].

Partial Least Square-Discrimination Analysis (PLS-DA) is a very popular multivariate analysis tool, which is commonly used in metabolomics data analyses to identify informative metabolites for many distinct purposes, such as the diagnosis or prognosis of a disease [1619]. Notably, the success of the stand-alone software SIMCAP (http://www.umetrics.com) boosts the prevalence applications of PLS-DA in metabolomics data analyses. As a supervised method, PLS-DA rotates the Principal Component Analysis (PCA) components by using the class membership information to achieve a better separation between the classes of samples. Similar to PCA, the results from PLS-DA are based on some linear combinations of all metabolites or at least the linear combinations of the selected metabolites by naively leaving out the metabolites with small variable influence on the projection (VIP, which is a weighted sum of PLS loadings). This approach not only lacks readily biological interpretation, but also does not provide valid statistics that can be used to evaluate its predictive performance. To obtain such statistics, an extra classifier is desirable in PLS-DA. For example, the study by Student and Fujarewicz [10] obtained the accuracy of PLS-DA by implementing an additional support vector machine (SVM) classifier. Furthermore, absence of predictive rules in PLS-DA makes the results of PLS-DA less practical. This is because in clinical practice, physicians would prefer to a score (e.g., posterior probabilities) to quantify a patient’s status. Therefore, an explicit predictive rule is essential for metabolomics to become a diagnostic tool.

In this paper, we investigate the use of two multi-TGDR approaches to analyze mass spectrometry metabolomics data. Hepatocellular carcinoma (HCC) is the most common type of liver cancer. Most cases of HCC are secondary to either a viral hepatitis (hepatitis B or C) or cirrhosis [20]. Currently, the gold standards for diagnosis (e.g., ultrasonography and alpha-fetoprotein [AFP]) have been reported to lack satisfactory sensitivity and specificity for identifying HCC at early stages [21, 22]. Since metabolomics can monitor the changes in small molecular comprehensively and provide insight on metabolic deregulations systematically [23, 24], researchers are employing metabolomics techniques to elucidate the difference between HCC and cirrhosis [19]. The identification of metabolic profiles for HCC/cirrhosis infected with HBV or HVC may help discriminate between HCC/cirrhosis/normal classes and achieve accurate diagnosis of HCC at early stages. Moreover, the analyses presented in this paper also provide motivation for developing feature selection algorithms specifically for metabolomics data, and for the applications of existing algorithms to metabolomics data.

Methods

The experimental data

The study included 30 patients with cirrhotic liver disease (22 infected with HBV and 8 with HCV, respectively), 70 patient with HCC (39 with HBV infection and 31 with HCV infection), and 31 healthy volunteers recruited in the metabolic profiling study. All of them provided the written informed consent, and the ethics committee of the Jilin University approved upon this study. Detailed descriptions on the study design, experimental procedures, and LC-MS metabolomics data collections were reported in [19].

Pre-processing procedures

Raw data were imported into Databridge (Waters, U.K.) for data format transformation. The resulting NeTCDF files were imported into XCMS software for the peak extraction and alignment. Then the peaks including 384 metabolites (indexed by the combination of m/z and retention time, and their corresponding peak intensities) and 131 samples were exported to an Excel file. The peak intensity values were log transformed so that the distribution of the transformed intensity values for each metabolite was approximately normal. Zeros (corresponding to no peaks) in peak intensity, were replaced by a nominal value (i.e., 0.01) before log transformation, to avoid the creation of missing values. Several other values for replacing zero values (i.e., 0.001, 0.005, 0.02, 0.05) were examined to evaluate if different nominal values would affect the results, and no difference was found. Finally, peak intensity values were further centralized and normalized to have a mean of 0 and a variance of 1. The resulting matrix was used in the two multi-TGDR frameworks for the classification analysis.

Compounds identification was achieved by comparing the accurate mass of compounds from the Human Metabolome Database: HMDB version 3 (http://www.hmdb.ca).

Methods

Here, we omit the description of binary TGDR. Interested readers may refer to the original papers [2, 11] for the detailed descriptions on binary TGDR. We present two multi-class TGDR frameworks with emphasis on the specific modifications made on the overall threshold functions to handle the multi-class problem.

Extension to multi-class classification

In the multi-class cases, we have a set of C-1 binary variables Yci , which are the indicators for class c on subject i (i = 1,…,n, here n is the total number of subjects) i.e., Yci is equal 1 if the ith subject belongs to class c and zero otherwise. C is the number of classes (C ≥ 3) and X1,…,Xn represents the feature values of one specific subject. Notably, Xi is a vector of length G and thus X is an n×G matrix with Xij for the corresponding intensity values of feature j (j = 1,…,G) on subject i. The log-likelihood function is defined as,

R β = i = 1 n c = 1 C - 1 Y ci β c 0 + β c X i T - log 1 + c = 1 C - 1 exp β c 0 + β c X i T
(1)

βc0s (c = 1,…,C-1) are unknown intercepts which are not subject to regularization. βc = (βc1,…, βcG) are the corresponding coefficients for the expression values of metabolites under consideration. In an ‘omics’ experiment, most of those betas are assumed to be zeros, implying the corresponding features are non-informative in explaining the difference across different classes. In the multi-class cases, the threshold functions of every feature (i.e., metabolites in our application) in TGDR need to be redefined across classes. In previous work [4], we proposed an extension of TGDR as described below.

Method 1

Denote Δv as the small positive increment (e.g., 0.01) in ordinary gradient descent search and v k  = k×Δv as the index for the point along the parameter path after k steps. Let β(v k ) denote the parameter estimate of β corresponding to v k . For a fixed threshold 0 ≤τ≤ 1, our proposed TGDR algorithm for multi-class cases is given as follows:

  1. 1.

    Initialize β(0)=0 and v 0 =0.

  2. 2.

    With current estimate β, compute the negative gradient matrix g(v) = - ∂R(β)/∂β with its (c,j) th component as g cj (v).

  3. 3.
    1. a)

      Let f c (v) represent the threshold vector of size G for class c (c=1,..,C-1), with its j th component calculated as

      f cj v = I | g cj v | τ × max g cl v l β c j β c
    2. b)

      Then, the j th-feature specific threshold function was defined as

      f j v = max c f cj
  4. 4.

    Update β(v+Δv) = β(v) - Δv×g(v)×f(v) and update v by v+Δv , where the (c, j) th component of the product of f and g is g cj (v) × f j (v).

  5. 5.

    Steps 2-4 are iterated K times. The number of iteration K is determined by cross validation.

As in binary TGDR, all metabolites are assumed to be non-informative at the initial stage. Parameters τ and k are the tuning parameters, and thus jointly determine the property of the estimated coefficients, including the selection of features and their corresponding magnitudes. τ can be regarded as a threshold because it determines how βs would be updated in each iteration. Two extreme cases include: if τ=0, all coefficients are nonzero for all values of k; and if τ=1, the multi-TGDR increases in the direction of one (if the gradient of the intercept term has the largest absolute value) or two covariates in each iteration. Consequently, the non-zero coefficients are few at the early iterations. With increasing k, increasing number of βs would deviate from zeros until all of them would eventually enter the model. Both τ and K are determined by using cross-validations [25].

In this framework, when one feature is selected in one comparison, it will appear in the rest comparisons even though it may not be informative in those comparisons. This is analogous to the multivariate regression model setting, where the same set of covariates is used for each response even though some of them may not be statistically significant. Alternatively, we may choose to force small estimated coefficients into zeros in the last step. Then, the set of selected features for each comparison becomes different. This framework is referred to as multi-TGDR global herein.

On the other hand, one may argue why not set fj as the minimum of fcjs instead of their maximum. So if, there is no update until one feature has large enough gradients for all C-1 comparison. Therefore, only features which are informative in all comparisons will be chosen, resulting in an optimal feature set that is used to classify all classes simultaneously. This is in conflict with the hypothesis that a good feature set consists of those highly correlated with a class but uncorrelated with other classes, which had been confirmed by Hall [26]. Moreover, the performance of such determination has been proved to be generally less favorable than that of one-versus-another or one-versus-the rest binary ensembles [10].

Method 2

Instead of having an overall threshold function for jth feature, a cth-class specific threshold function for the feature is used to select features. This modification corresponds to the step 3a in the multi-TGDR global framework. Thus, a feature is not necessarily selected in other comparisons when it is updated in one comparison. As a result, different sets of selected features are associated with different classes. This framework is herein referred to as multi-TGDR local. Figure 1 shows the flowcharts of multi-TGDR global and multi-TGDR local, and pinpoints the difference between two frameworks.

Figure 1
figure 1

The flowchart of multi-TGDR. Global: multi-TGDR global; local: multi-TGDR local.

In the above two frameworks, we treat τ as a uniform tuning parameter across classes, which can certainly be relaxed so that τ may take different values for each class, allowing different degrees of regularization for different comparisons. However, for the “omics” data where the number of features is much larger than the number of samples, in our experience τ=1 tends to give the most reasonable results. Firstly, it has the harshest threshold, resulting in the smallest set of selected features. Secondly, the predictive performance may be improved by discarding those non-informative or redundant features.

Bagging and brier score

Bagging [27] procedure was used to discard the possible noise from a single run of multi-TGDR, so that a better model parsimony can be warranted. The benefits of bagging include but are not limited to: avoidance of over-fitting, improvement on prediction, and manageable experimental verification. In many applications, e.g., [10], bootstrap resampling/bagging is mainly used to evaluate the stability of a classifier.

Besides the traditionally used confusion matrix and misclassification rate, the generalized brier score (GBS) [7] was also calculated to evaluate the predictive performance of two multi-TGDR frameworks by absorbing the extra information provided by the estimated posterior probabilities. Additional details on trimming performed on both bagging and brier score for multi-class classifications were discussed in a previous study [4].

Statistical language and packages

The statistical analysis was carried out in the R language version 2.15 (http://www.r-project.org), R codes for multi-TGDR are available upon request.

Results and discussion

Synthesized data

In order to study the empirical performance of both multi-TGDR frameworks, we used the real values for metabolites of the HCC/cirrhosis data (384 metabolites and 131 samples) but assigned the class membership according to pre-determined logit functions f. Specifically, the logit functions for class 2 and 3, having class 1 as reference, were given by following relationship for two synthesized datasets,

Simulation 1

f 2 vs 1 = - 0.1 X 1 - 0.8 X 2 - 0.9 X 3 + 2 X 4 + 1.2 X 5 , f 3 vs 1 = 0.4 X 1 - 0.5 X 2 - 0.8 X 3 + 1.7 X 4 - 1.5 X 5

where the logits for class 2 and 3 depend only on features X1 ~ X5, but differ in the direction and magnitudes of the association.

Simulation 2

f 2 vs 1 = - 0.1 X 1 + 2 X 4 + 1.2 X 5 , f 3 vs 1 = 1.5 X 2 - 0.8 X 3

where the logits for class 2 and 3 are two function with different parameters in the second simulation.

By this means, the true relevant features (i.e., X1X2 X3 X4 X5) are known and performance comparison can be made between multi-TGDRs and PLS-DA. Here, PLS-DA analysis was carried out in the software of SIMCA-P + version 12.0 (http://www.umetrics.com). A feature was eliminated unless it had VIP values larger than 1 in either of the first two PLS components. The results were given in Table 1.

Table 1 The comparison between Multi-TGDR frameworks and PLS-DA using simulated data

In summary, the true relevant features were successfully identified by all methods. The predictive performance of both multi-TGDR frameworks was superior to that of PLS-DA. Even after bagging, the final models for both multi-TGDRs include substantially more features than the true ones, which might indicate more improvement in the multi-TGDR frameworks and other relevant algorithms may exist.

Real data

A metabolomics study was conducted with the objectives of identifying potentially important differential metabolites related to HCC pathogenesis and early diagnosis, and thus providing an explicit predictive rule that can aid a physician’s diagnosis on HCC and cirrhosis. There were 131 subjects (70 with HCC, 30 with cirrhosis, and 31 normal controls, respectively) and 384 metabolites in this study. Additional details on this motivating study have been presented in [19]. Figure 2 outlines the schema of this study.

Figure 2
figure 2

The schema of the study.

Performance of multi-TGDR

In Figure 3, cross-validation scores showed minimal change, especially after k > 500 for both frameworks. So the final iteration number K in both Multi-TGDR global and Multi-TGDR local was chosen as 500. Table 2 presents the results of the two multi-TGDR approaches. Multi-TGDR global selected 45 metabolites with a 0% misclassification rate and a 3.82% 5-fold cross-validation predictive error rate. With the cutoff of bagging frequency fixed at 40%, 30 metabolites were retained in the final model (Model 1_w), which had a 0% misclassification rate and a slight improvement on GBS. On the other hand, Multi-TGDR local selected 48 metabolites with a 0% misclassification rate and a 5.34% CV-5 error rate. After applying Bagging, 29 were identified in the final model (Model 2_w) with a slight decrement in GBS. Interestingly, the metabolites in model 1_w and model 2_w are almost the same (25 overlapped). Model 1_w identified 5 extra metabolites and model 2 _w identified 4 such metabolites. Table 3 presented those overlaps and those extra metabolites identified by specific multi-TGDR framework.

Figure 3
figure 3

The comparison between cross validation (CV)-determined tuning parameter k (the iteration number) in both multi-TGDR frameworks. Global: multi-TGDR global; local: multi-TGDR local.

Table 2 The predictive performance of Multi-TGDR frameworks and PLS-DA
Table 3 The selected metabolites by both multi-TGDR frameworks (the results of model 1_w and model 2_w)

Comparison with PLS-DA analysis

The data had also been analyzed using PLS-DA [19]. There, the potential markers were chosen based on the loading plot of PLS-DA, then evaluated by VIP of the first two components in PLS-DA and further confirmed by t-tests. We compared the selected metabolites by the original analysis with the resulting metabolites from Multi-TGDR frameworks (using the whole data on which the original PLS-DA was conducted), there are only 4 overlaps between multi-TGDR global and PLS-DA, and 5 overlaps between multi-TGDR local and PLS-DA, respectively (indicated as * in Table 3).

In order to compare results obtained from PLS-DA and those from Multi-TGDR, we used the 42 metabolites selected by PLS-DA (as shown on Table 2 in [19]) and considered a naïve Bayes model as a classifier to calculate the posterior probabilities in PLS-DA. The performance of PLS-DA was shown in Table 2. In summary, the metabolites selected by Multi-TGDR have a better predictive performance than those by PLS-DA.

Evaluation on the effect of pre-processing filtering

Moderated t-tests using limma [28] were conducted to identify the differential metabolites between HCC/cirrhosis versus normal to examine the effects of pre-filtering. The cutoff for the false discovery rate (FDR) was chosen as 0.05. There were 94 down-regulated and 104 up-regulated metabolites in the comparison of cirrhosis to normal, and 63 down-regulated and 186 up-regulated metabolites for HCC to normal. In total, 302 unique differentially expressed metabolites were identified by those t-tests. Only 4 metabolites were missing from the final classifier models (i.e., model 1_w and 2_w). We then reran both multi-TGDRs with those 302 differentially expressed metabolites. The corresponding results were shown in Table 2B and Figure 4. From them, we can see the performance of both multi-TGDR on the filtered data decreased but was not substantial. To conclude, pre-filtering may save considerable computational time with marginal impact on predictive performance.

Figure 4
figure 4

The comparison of the selected metabolites by multi-TGDR frameworks on the whole data and on the reduced data (BF > 40% for both data). A. Venn-diagram for multi-TGDR global. B. Venn-diagram for multi-TGDR local. The whole data: without moderated t-test filtering. The reduced data: with t-test filtering and 72 metabolites were filtered out. The metabolites (indexed by m/z values) in red represent those filtered out by moderated t-tests. The metabolites (indexed by m/z values) in purple represent those selected by multi-TGDR framework on the whole data analysis, but excluded by bagging.

Conclusions

Metabolites selected by multi-TGDR may provide biological insight in HCC/cirrhosis. According to the description of those selected metabolites given by the HMDB, some interesting observations were gained. First, furoic acid is a metabolite produced by furfural via oxidation. Furfural is a confirmed animal carcinogen with unknown relevance to humans, and it has been suggested as a substance that produces hepatic cirrhosis [29, 30]. Here, both multi-TGDR versions selected furoic acid, while the coefficients of both comparisons (i.e., HCC versus normal, cirrhosis versus normal) are in opposite directions. A significant decrease of isoxanthopterin has been identified in cancer patients [31], however, the multi-TGDR results show an increase instead. It is well known that careful control of the participants’ intake before a metabolomics experiment is difficult. With that in mind, many of the HCC subjects may have received therapeutic treatments that might increase the level of isoxanthopterin, with residual levels present despite strict diet and intake control during the metabolomics study. In addition, over-dosage of interventions for cancer patients, especially in a developing country like China is possible. Thus the accumulation of isoxanthopterin in HCC patients is possible as a result of long-term over-dosage of relevant drugs. Meanwhile, we don’t exclude the possibility that HCC has its own unique characteristics in terms of isoxanthopterin and is consequently different from other cancer types. Further investigation on the biological explanation of those selected metabolites is definitely warranted. Here, our focus is to present the multi-TGDR frameworks and to demonstrate their applications in metabolomics.

With the aids of a feature selection algorithm like multi-TGDR (an algorithm can provide an explicit predictive rule and compute the posterior probabilities of the class membership), it is possible to design a diagnostic kit to examine the selected metabolites in a clinic setting with higher sensitivity and specificity. This kit would allow discrimination between HCC developed from HCV/HBV infections apart from cirrhosis with HCV/HBV infections, which is highly desirable and of scientific importance. One limitation of our application is that since the proportion of diseased persons in an observational study may not reflect disease prevalence in the population, care must be taken in both model construction and evaluation. To ensure a multi-TGDR model can correctly classify persons in the general population, one approach is to obtain weights based on the ratio between the proportion of diseased persons in the population and that in the study. A comprehensive investigation of these issues is the focus of our future research.

Two extensions of TGDR are proposed here for multi-class classification problems. By training only one classifier, we specifically address sub-optimality associated with dividing multi-class classification into individual binary pairs. The performance of multi-TGDR global has been shown to be excellent by us previously [4] using simulated data and two microarray data sets. Compared to multi-TGDR global, multi-TGDR local has an almost identical predictive performance in the HCC metabolomics data (in both the simulated data and the real data). Additionally, we conducted extra simulations to verify the validity of multi-TGDR local and compared its performance with multi-TGDR global. The results (included in the Additional file 1: Supplementary materials) show that both multi-TGDR algorithms can identify the true relevant features and discard the irrelevant features. Identical predictive performances are also observed even in cases where some of features are highly correlated to the relevant features. Intuitively, we hypothesized that multi-TGDR global should perform better in cases where the classes share more similarity. This entails that the same set of features is shared across different classes, but the magnitudes of the association differ. This may correspond to different stages of a disease. In contrast, multi-TGDR local may be optimal in cases where no similarity of the classes is present. This entails that complete different sets are selected across different classes, which may represent different diseases. Interestingly, the results from the simulations don’t support this hypothesis. Finally, we also examined whether multi-TGDR local is associated with less computation time since it does not need to compute the overall threshold function f j (v). However, with our current experience on the simulations and real-world applications, we found the computational effort of these two approaches to be comparable. This may be due to the fact that compared to the computation of many gradients at each iteration, the computation of maximum on f cj (v) is negligible. One obvious advantage of multi-TGDR local is that it may provide us with information on which feature is related to which class/classes.

To conclude, we recommend the use of the multi-TGDR frameworks for multi-class classifications on “omics” data because they have excellent predictive capacity. The researchers may choose to run both or either of the multi-TGDR frameworks based on their research hypotheses and data type.