Background

In statistical learning, a predictive model is learned from a hypothesis class using a finite number of training samples [1]. The distance between the learned model and the target function is often quantified as the generalization error, which can be divided into an approximation term and an estimation term. The former is determined by the capacity of the hypothesis class, while the latter is related to the finite sample size. Loosely speaking, given a finite training set, a complex hypothesis class reduces the approximation error but increases the estimation error. Therefore, for good generalization performance, it is important to find the right tradeoff between the two terms. Along this line, an intuitive solution is to build a simple predictive model with good training performance [2]. However, the “high dimensionality, small sample size” nature of many biological applications makes it extremely challenging to build a good predictive model: a simple model often fails to fit the training data, but a complex model is prone to overfitting. A commonly used strategy to tackle this dilemma is to simplify the problem itself using domain knowledge. In particular, domain information may be used to divide a learning task into several simpler problems, for which building predictive models with good generalization is feasible.

The use of domain information in biological problems has notable effects. There is an abundance of prior work in the field of bioinformatics, machine learning, and pattern recognition. It is beyond the scope of this article to supply a complete review of the respective areas. Nevertheless, a brief synopsis of some of the main findings most related to this article will serve to provide a rationale for incorporating domain information in supervised learning.

Representation of domain information

Although there is raised awareness about the importance of utilizing domain information, representing it in a general format that can be used by most state-of-the-art algorithms is still an open problem [3]. Researchers usually focus on one or several types of application-specific domain information. The various ways of utilizing domain information are categorized as following: the choice of attributes or features, generating new examples, incorporating domain knowledge as hints, and incorporating domain knowledge in the learning algorithms [2].

Use of domain information in the choice of attributes could include adding new attributes that appear in conjunction (or disjunction) with given attributes, or selection of certain attributes satisfying particular criteria. For example, Lustgarten et al. [4] used the Empirical Proteomics Ontology Knowledge Bases in a pre-processing step to choose only 5% of candidate biomarkers of disease from high-dimensional proteomic mass spectra data. The idea of generating new examples with domain information was first proposed by Poggio and Vetter [5]. Later, Niyogi et al. [2] showed that the method in [5] is mathematically equivalent to a regularization process. Jing and Ng [6] presented two methods of identifying functional modules from protein-protein interaction (PPI) networks with the aid of Gene Ontology (GO) databases, one of which is to take new protein pairs with high functional relationship extracted from GO and add them into the PPI data. Incorporating domain information as hints has not been explored in biological applications. It was first introduced by Abu-Mostafa [7], where hints were denoted by a set of tests that the target function should satisfy. An adaptive algorithm was also proposed for the resulting constrained optimization.

Incorporating domain information in a learning algorithm has been investigated extensively in the literature. For example, the regularization theory transforms an ill-posed problem into a well-posed problem using prior knowledge of smoothness [8]. Verri and Poggio [9] discussed the regularization framework under the context of computer vision. Considering domain knowledge of transform invariance, Simard et al. [10] introduced the notion of transformation distance represented as a manifold to substitute for Euclidean distance. Schölkopf et al. [11] explored techniques for incorporating transformation invariance in Support Vector Machines (SVM) by constructing appropriate kernel functions. There are a large number of biological applications incorporating domain knowledge via learning algorithms. Ochs reviewed relevant research from the perspective of biological relations among different types of high-throughput data [12].

Data integration

Domain information could be perceived of as data extracted from a different view. Therefore, incorporating domain information is related to integration of different data sources [13, 14]. Altmann et al. [15, 16] added prediction outcomes from phenotypic models as additional features. English and Butter [13] identified biomarker genes causally associated with obesity from 49 different experiments (microarray, genetics, proteomics and knock-down experiments) with multiple species (human, mouse, and worm), integrated these findings by computing the intersection set, and predicted previously unknown obesity-related genes by the comparison with the standard gene list. Several researchers applied ensemble-learning methods to incorporate learning results from domain information. For instance, Lee and Shatkay [17] ranked potential deleterious effects of single-nucleotide polymorphisms (SNP) by computing the weighted sum of various prediction results from four major bio-molecular functions, protein coding, splicing regularization, transcriptional regulation, and post-translational modification, with distinct learning tools.

Incorporating domain information as constraints

Domain information could also be treated as constraints in many forms. For instance, Djebbari and Quackenbush [18] deduced prior network structure from the published literature and high-throughput PPI data, and used the deduced seed graph to generate a Bayesian gene-gene interaction network. Similarly, Ulitsky and Shamir [19] seeded a graphical model of gene-gene interaction from a PPI database to detect modules of co-expressed genes. In [6], Gene Ontology information was utilized to construct transitive closure sets from which the PPI network graph could grow. In all these methods, domain information was used to specify constraints on the initial states of a graph.

Domain information could be represented as part of an objective function that needs to be minimized. For example, Tian et al. [20] considered the measure of agreement between a proposed hypergraph structure and two domain assumptions, and encoded them by a network-Laplacian constraint and a neighborhood constraint in the penalized objective function. Daemen et al. [21] calculated a kernel from microarray data and another kernel from targeted proteomics domain information, both of which measure the similarity among samples from two angles, and used their sum as the final kernel function to predict the response to cetuximab in rectal cancer patients. Bogojeska et al. [22] predicted the HIV therapy outcomes by setting the model prior parameter from phenotypic domain information. Anjum et al. [23] extracted gene interaction relationships from scientific literature and public databases. Mani et al. [24] filtered a gene-gene network by the number of changes in mutual information between gene pairs for lymphoma subtypes.

Domain knowledge has been widely used in Bayesian probability models. Ramakrishnan et al. [25] computed the Bayesian posterior probability of a gene’s presence given not only the gene identification label but also its mRNA concentration. Ucar et al. [26] included ChIP-chip data with motif binding sites, nucleosome occupancy and mRNA expression data within a probabilistic framework for the identification of functional and non-functional DNA binding events with the assumption that different data sources were conditionally independent. In [27], Werhli and Husmeier measured the similarity between a given network and biological domain knowledge, and by this similarity ratio, the prior distribution of the given network structure is obtained in the form of a Gibbs distribution.

Our contributions

In this article, we present a novel method that uses domain information encoded by a discrete or categorical attribute to restructure a supervised learning problem. To select the proper discrete/categorical attribute to maximally simplify a classification problem, we propose an attribute selection metric based on conditional entropy achieved by a set of optimal classifiers built for the restructured problem space. As finding the optimal solution is computationally expensive if the number of discrete/categorical attributes is large, an approximate solution is proposed using random projections.

Methods

Many learning problems in biology are of high dimension and small sample size. The simplicity of a learning model is thus essential for the success of statistical modeling. However, the representational power of a simple model family may not be enough to capture the complexity of the target function. In many situations, a complex target function may be decomposed into several pieces, and each can be easily described using simple models. Three binary classification examples are illustrated in Figure 1, where red/blue indicates positive/negative class. In example (a), the decision boundary that separates two distinct color regions is a composite of multiple polygonal lines. It suggests the classification problem in (a) could not be solved by a simple hypothesis class such as a linear or polynomial model. Similarly, in examples (b) and (c), the decision boundary is so complex that neither a linear nor polynomial model can be fitted into these problems. Nevertheless, if the whole area is split into four different sub-regions (as shown in the figure, four quadrants marked from 1 to 4), the problem could be handled by solving each quadrant using a simple model individually. In example (a), the sub-problem defined on each quadrant is linearly separable. Likewise, each quadrant in (b) is suitable for a two-degree polynomial model. A linear model can be viewed as a special case of a two-degree polynomial. Therefore, the four sub-problems in (c) could be solved by a set of two-degree polynomial models. In the three examples, a categorical attribute X3 provides such partition information.

Figure 1
figure 1

Examples of piece-wise separable classification problems. Three binary classification examples are illustrated here, where red/blue indicates positive/negative class. The figure shows that with the help of a categorical attribute X3, the three problems can be solved by simple hypothesis classes such as linear or polynomial models.

Attributes like X3 exist in many biological applications. For instance, leukemia subtype domain knowledge, which can be encoded as a disicrete or categorical attribute, may help the prediction of prognosis. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. As depicted in Figure 2, the original problem is split into multiple sub-problems by one or more discrete or categorical attributes. If the proper attribute is selected in the restructuring process, each sub-problem will have a comparably simpler target function. Our approach is fundamentally different from the decision tree approach [28]: first, the tree-like restructuring process is to break up the problem into multiple more easily solvable sub-problems, not to make prediction decisions; second, the splitting criterion we propose here is based on the conditional entropy achieved by a categorical attribute and a hypothesis class, whereas the conditional entropy in decision trees is achieved by an attribute only. The conditional entropy will be discussed in detail later. Also, our method is related to feature selection in the sense that it picks categorical attributes according to a metric. However, it differs from feature selection in that feature selection focuses on the individual discriminant power of an attribute, and our method studies the ability of an attribute to increase the discriminant power of all the rest of the attributes. The categorical attributes selected by our method may or may not be selected by traditional feature selection approaches.

Figure 2
figure 2

Restructuring a problem by one or more categorical attribute. By one or more discrete or categorical attributes, the original problem is split into multiple sub-problems. If the proper attribute is selected in the restructuring process, each sub-problem will have a comparably simpler target function.

In theory, there’s no limit on the number of categorical attributes used in a partition if an infinite data sample is available. However, in reality, the finite sample size puts a limit on the number of sub-problems good for statistical modeling. In this article, we only consider incorporating one discrete or categorical attribute at a time. Identifying a discrete or categorical attribute that provides a good partition of a problem is nontrivial when the number of discrete or categorical attributes is large. In this paper, we propose a metric to rank these attributes.

An attribute selection metric

A discrete or categorical attribute is viewed as having high potential if it provides a partition that greatly reduces the complexity of the learning task, or in other words, the uncertainty of the classification problem. A hypothesis class, such as the linear function family, is assumed beforehand. Therefore, we quantify the potential using the information gain achieved by a set of optimal classifiers, each of which is built for a sub-problem defined by the discrete or categorical attribute under consideration. Searching for the top ranked attribute with maximum information gain is equivalent to seeking the one with minimum conditional entropy. In a naive approach, an optimal prediction model is identified by comparing restructured problems using each discrete or categorical attribute. This exhaustive approach is computationally prohibitive when the number of discrete or categorical attributes is large. We propose to rank attributes using a metric that can be efficiently computed.

In a classification problem, consider a set of l samples (x, y) from an unknown distribution, x ∈ ℝn, and y is the class label. In a k-class learning task, y gets a value from {1, …, k}; In a binary classification problem, y is either 1 or –1. z represents a discrete or categorical attribute with finite unique values. For simplicity, let’s assume z takes values from {1, …, q}, which offers a problem partition into q sub-problems, i.e. for all the samples when attribute z takes value i, i ∈ 1, …, q. Z is the set of all discrete and categorical attributes, zZ. A hypothesis class M is considered. We will first consider the linear model family. The metric can be generalized to a non-linear hypothesis class using the kernel trick [1].

For a binary classification problem, a linear discriminant function is formulated as f(x) = wTx + c, where w indicates the normal vector of the corresponding hyperplane and c is the offset parameter. For a multi-class task, if the one-vs-one method [29] is applied, there exists k(k – 1)/2 linear discriminant functions, each of which separates a pair of classes. Because a categorical attribute z divides the problem into q sub-problems, we define a model m for the whole problem as a set of linear discriminant functions on the q sub-problems: if it is a binary classification problem, m contains q linear discriminant functions; if it is a multi-class problem, m comprises qk(k – 1)/2 discriminant functions. Model m contains a pair of components (w, c), where w is the set of normal vectors of all of the discriminant functions in m, and c contains all of the linear function offset parameters in m.

The most informative attribute under the context discussed above is defined through the following optimization problem:

which is equivalent to

Note that the conditional entropy used here is fundamentally different from the one normally applied in decision trees. The traditional conditional entropy H(y|z) refers to the remaining uncertainty of class variable y given that the value of an attribute z is known. The conditional entropy used above is conditional on the information from attribute z and model m. In other words, the proposed method looks one more step ahead than a decision tree about data impurity of sub-problems.

An approximated solution

The above optimization problem cannot be solved without knowledge of the probabilistic distribution of data. Sample version solutions may not be useful due to the curse of dimensionality: in high dimension feature spaces, a finite number of points may easily be separated by a hypothesis class (an infinitesimal conditional entropy), but the solution is more likely to be overfit than to be a close match to the target function. Taking a different perspective, if a categorical attribute is able to maximally simplify the learning task, the expected impurity value with respect to all possible models within the given hypothesis class should be small. This motivates the following approximation using the expected conditional entropy with respect to a random hyperplane:

The expectation could be estimated by the average over a finite number of trials. Hence, we randomly generate N sets of normal vectors (each set includes q normal vectors for binary-class or qk(k – 1)/2 for multi-class), search for the corresponding best offset for each normal vector, and calculate the average conditional entropy

(1)

In the i th random projection, w i includes all the normal vectors of the linear classifiers, each of which is built on a sub-problem, and c i does the same for the offsets. According to the definition of conditional entropy, H(y|z, (w i , c i )) in (1) is formulated as:

(2)

Probability p(z = j) is approximated by the sub-problem size ratio. The last step of the above derivation is based on the fact that the random projections are independent from the size of the sub-problems.

In a binary classification task, z = j denotes the j th sub-problem, and (w ij , c ij ) indicates the linear discriminant function of the i th random projection on the j th sub-problem. The discriminant function represented by (w ij , c ij ) classifies the j th sub-problem into two parts, and :

H(y|z = j,w ij , c ij ) in (2) quantifies the remaining uncertainty of variable y in the j th sub-problem given the learned partition result defined by the linear discriminant function with parameters (w ij , c ij ):

(3)

In the computation of (3), and . and are estimated by the proportion of positive/negative samples within and , respectively.

In a multi-class setting, within a sub-problem, instead of two sub-regions (Ω+, Ω), there are q sub-regions (Ω1, …, Ωq), each of which is the decision region for a class. All the categorical attributes are ranked according to (1).

Extension to non-linear models

Our proposed metric could be easily extended to non-linear models using the kernel trick [1]. By the dual representation of a linear model, the normal vector is represented as a weighted summation of sample data.

where α I ∈ ℝ is a weight. The linear function is then formulated as:

Using the kernel trick, inner product can be replaced by a kernel function K. K(x i , x) is the inner product of x i and x in the reproducing kernel Hilbert space. Therefore, the above linear discriminative function is transformed to,

(4)

In our method, given a kernel K, random projections are achieved through α i .

Results and discussion

We tested our method on three artificial data sets, three cheminformatics data sets and two cancer microarray data sets. The random projection was executed 1000 times for each data set.

Three different kernels were applied in this paper: linear, two-degree polynomial and Gaussian. The latter two kernels have one or more parameters. For the two-degree polynomial kernel, we used the default setting as K(u, v) = (uTv)2. Choosing a proper parameter γ in the Gaussian kernel K(u, v) = exp(– γ||uv||2) is not an easy task. This paper focuses on how to select one (or more) categorical or discrete attribute(s) to divide the original problem into multiple simpler sub-problems. Selecting a proper model is not the theme of the work. Therefore, we list three Gaussian kernels using different γ values, 0.01, 1 and 10, to demonstrate that our restructuring process could be extended to non-linear models including the Gaussian kernel.

Many prediction problems have the property of small sample size and high dimensionality, for example, the learning tasks for the three cheminformatics data sets. Simple models under these circumstances are usually preferred. We applied a linear kernel on these three data sets, and analyzed the results from a cheminformaticist’s point of view. For the purpose of comparison, two-degree polynomial kernels and Gaussian kernels were also used.

The code was written with Matlab and libsvm package, and can be downloaded from http://cbbg.cs.olemiss.edu/StructureClassifier.zip.

Artificial data sets

Three artificial data sets were generated to test our method using both linear and non-linear models. They are shown in Figure 1. Each artificial data is generated by four attributes: X1 and X2 are continuous attributes, and X3 and X4 are categorical attributes. The continuous attributes are uniformly distributed. X3 = {1, 2, 3, 4} denotes four different smaller square sub-regions. X4 = {1, 2} is a random categorical attribute for the purpose of comparison. In the experiment, we generated 10 sets for Artificial Data 1, 2, and 3, respectively. All 10 sets share the same values of attributes X1, X2, and X3, but X4 is random. Average results and standard deviations were computed.

The binary class information is coded by two distinct colors. Categorical attribute X3 provides interesting partitions: the partition in (a) leads to linear classification problems; the partition in (b) and (c) generates nonlinear problems that can be solved using techniques such as SVM with a polynomial kernel. Note that the original problem in (a) is not linear. The original problems in (b) and (c) are nonlinear, and not solvable using a polynomial kernel of degree 2.

Next, we assume linear classifiers in (a) and SVM with a polynomial kernel of degree 2 in (b) and (c). From Tables 1, 2, and 3, we see that the averaged estimated conditional entropy of X3 is always smaller than that of X4. Hence X3 is selected to restructure the problem. Next, we build both linear classifier and degree-2 polynomial SVM models on the original problem (we call it the baseline method), and linear and degree-2 polynomial models on the restructured problems introduced by X3. Significant improvements in both cross-validation (CV) accuracy and test accuracy are achieved using the partitions provided by X3. For comparison purposes, models were built on the restructured problem produced by X4. X3 outperforms X4 with a comfortable margin. There is no significant improvement using X4 than the baseline approaches.

Table 1 Experimental Results of Artificial Data 1 (Fig1 (a)) with Linear Model.
Table 2 Experimental Results of Artificial Data 2 (Fig 1.(b)) Using Two-degree Polynomial Kernel.
Table 3 Experimental Results of Artificial Data 3 (Fig 1.(c)) Using Two-degree Polynomial Kernel.

Cheminformatics data

We tested our approach on three cheminformatics data sets, biological activity data of glycogen synthase kinase-3β inhibitors, cannabinoid receptor subtypes CB1 and CB2 activity data, and CB1/CB2 selectivity data.

Biological activity prediction of glycogen synthase kinase-3β inhibitors

In the first dataset, data samples (IC50) were collected from several publications, with a range from subnanomolar to hundred micromolar. The biological activities have been discretized as binary values: highly active and weakly active, with a cut-off value of 100 nM. The aim is to predict biological activity based on physicochemical properties and other molecular descriptors of the compounds calculated using DragonX software [30]. This data set was divided into 548 training samples and 183 test samples. The attribute set size is 3225, among which 682 are categorical attributes. Some discrete attributes contain a large number of values. For a fixed sized training set, some regions generated by a partition using such attributes may contain a very small number of samples (many times 1 or 2), and hence are not suitable for training a classifier. So we filtered out attributes with more than 10 unique values.

Using a linear kernel, we ranked the categorical attributes based on their estimated conditional entropies. The top 31 attributes (with smallest estimated conditional entropy) were viewed as candidate attributes for problem partition. We restructured the learning problem according to these candidate attributes separately, and built linear models for each partition. Figure 3 shows the experimental results. Among the 31 attributes, there are 17 categorical attributes whose performance beat the baseline approach in terms of both cross-validation accuracy and test accuracy. The detailed performance values and the names of the attributes are provided in Table 4. Compared with linear kernels, the ranking orders of these attributes by two-degree polynomial and Gaussian kernels and their corresponding cross-validation and test accuracies are provided in Table 5 as well. For Gaussian kernels, we notice performance improvement for most of the selected attributes under all three tested γ values. The highest performance was achieved when the Bioassay Protocol attribute was selected to restructure the problem. This attribute records the different protocols used during the cheminformatics experiment, and also indicates distinct chemotypes.

Figure 3
figure 3

Experimental results for biological activity prediction of glycogen synthase kinase-3 β inhibitors. The categorical attributes were ranked based on their estimated conditional entropies. We chose the first 31 attributes with smallest entropy values for problem partition. We restructured the learning problem according to these candidate attributes separately, and built linear models for each partition. Among the 31 attributes, there are 17 categorical attributes whose performance beat the baseline approach in terms of both cross-validation accuracy and test accuracy.

Table 4 Learning Performance for the Selected Categorical Attributes in Biological Activity Data of Glycogen Synthase Kinase-3β Inhibitors Using Linear Kernel.
Table 5 Performance Comparison for the Selected Categorical Attributes in Biological Activity Data of Glycogen Synthase Kinase-3β Inhibitors Using Two-degree Polynomial Kernel and Gaussian Kernels.

The highest cross-validation performance attribute, nCIR, belongs to the constitutional descriptors. Constitutional descriptors reflect the chemical composition of a compound without the structural information of the connectivity and the geometry. nCIR means the number of circuits, which includes both rings and the larger loop around two or more rings. For instance, naphthalene contains 2 rings and 3 circuits. This attribute could easily distinguish ring-containing structures and linear structures. Many attributes selected have names starting with “F0”. They are from the 2D frequency fingerprints, which define the frequency of specific atom pairs at different topological distances from 1 to 10. Among all of the 2D frequency fingerprints, the atom pair “N-N” appeared multiple times. The frequency of this atom pair at different topological distances from 2 to 4 could be used to separate the dataset. Another important atom pair is “N-O”, which also appeared multiple times in the list. Both atom pairs contain the nitrogen atom which is highly common in the kinase inhibitor structures, since it plays a key role in the hydrogen bond interactions between the inhibitor and the kinase. Another atom-centered fragment attribute is H-049, which means the atom H attached to any of C3(sp3) / C2(sp2) / C3(sp2) / C3(sp) groups. The superscripts on the carbons stand for the formal oxidation number and the contents in the parentheses stand for the hybridization state. The hydrogen in an H-049 fragment has negative atomic hydrophobicity and low molecular refractivity [31], so they are less hydrophobic and more hydrophilic. H-049 could be used to separate the database because the kinase inhibitors are usually hydrophilic in order to bind to the protein in the ATP-binding pocket.

Cannabinoid receptor subtypes CB1 and CB2 activity and selectivity prediction

The second and third data sets are for cannabinoid receptor subtypes CB1 and CB2. They were also computed from DragonX software, and have 3225 attributes. The second data set is to predict activity and was divided into 645 training samples and 275 test samples. It contains 683 categorical attributes. The third set is to predict selectivity of binding to CB1 vs. CB2 and includes 405 training samples, 135 test samples, and 628 categorical attributes. The experimental results are shown in Figures 4 and 5, respectively. We ordered the categorical attributes based on their conditional entropy values in ascending order. Note that the model based on the first attribute always performed better than the baseline approach.

Figure 4
figure 4

Experimental results for cannabinoid receptor subtypes CB1 and CB2 activity prediction. The categorical attributes were ranked based on their estimated conditional entropies, and the top 20 attributes were chosen to partition the problem separately. Linear models were built for each partition. Among the 20 attributes, there are 8 having better performance than the baseline approach in terms of both cross-validation accuracy and test accuracy.

Figure 5
figure 5

Experimental results for cannabinoid receptor subtypes CB1 and CB2 selectivity prediction. The categorical attributes were ranked based on their estimated conditional entropies, and the top 20 attributes were choseN to partition the problem separately. Linear models were built for each partition. Among the 20 attributes, there are 5 having better performance than the baseline approach in terms of both cross-validation accuracy and test accuracy.

The classes and descriptions for the attributes that result in better performance than the baseline approach are listed in Tables 6 and 7. The learning performance comparison with other non-linear kernels are shown in Tables 8 and 9 respectively. For the CB activity, among the eight features, six of them (F01[N-O], N-076, nArNO2, B01[N-O], N-073 and nN(CO)2) involve nitrogen. This clearly suggests that nitrogen plays a significant role in classifying the active CB ligands. The input data showed that the values of N-076 and nArNO2 for all the active compounds are 0. Hence, it is very likely that any compound with the Ar-NO2 / R–N(–R)–O / RO-NO moiety or a nitro group may not be active. In addition, the majority of the active compounds have F01[N-O] and nN(CO)2 values of 0. Hence, the lack of a N-O or an imide moiety is perhaps a common feature of active CB ligands. Furthermore, the N-073 feature is distributed between 0 and 2 in the active compounds. Hence, the nitrogen atom in the active compounds, if it exists, may appear in the form of Ar2NH / Ar3N / Ar2N-Al / R..N..R. Its role may include acting as a hydrogen bond acceptor, or affecting the polarity of the molecule, which may facilitate the ligand binding. For the CB selectivity problem, two features (nDB and nCconj) involve double bonds. Both of these address the non-aromatic C=C double bond and the values are primarily distributed between 0 - 6 and 0 - 2, respectively, in the selective compounds. The role of this bond, if it exists, is perhaps to form hydrophobic interactions with the proteins. It is also interesting to note that the nCconj attribute leads to the best test accuracy for both the activity and selectivity datasets. The descriptions of selected categorical attributes can be viewed in Tables 10 and 11.

Table 6 Learning Performance for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Activity Data Using Linear Model.
Table 7 Learning Performance for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Selectivity Data Using Linear Model.
Table 8 Performance Comparison for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Activity Data Using Two-degree Polynomial Model and Gaussian Models.
Table 9 Performance Comparison for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Selectivity Data Using Two-degree Polynomial Model and Gaussian Models.
Table 10 Descriptions for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Activity Data.
Table 11 Descriptions for the Selected Categorical Attributes in Cannabinoid Receptor Subtypes CB1 and CB2 Selectivity Data.

Leukemia gene data

The two leukemia gene data sets used are defined in Yeoh et al. [32] and Golub et al. [33], respectively. We applied a linear classifier, SVM with a two-degree polynomial kernel and Gaussian kernels on these two data sets.

Yeoh’s data [34] comprises gene expression data and two additional categorical attributes, Subtype and Protocol. Subtype indicates specific genetic subtypes of Acute lymphoblastic leukemia (ALL), and Protocol means distinct therapies. The entire set contains 201 continuous complete remission (CCR) samples and 32 relapse cases (including 27 Heme relapses and 5 additional relapses). We randomly split the data into training and test sets with 174 and 59 samples, respectively. The original data contains 12627 attributes, which is almost two orders of magnitude larger than the training set size. We used the 58 preselected attributes provided in the original paper and two additional categorical attributes to predict prognosis. Tables 12, 13, and 14 show the experimental results using linear, two-degree polynomial and Gaussian kernels, respectively. The subtype categorical attribute has smaller estimated conditional entropy than Protocol, and is thus selected to divide the problem. The learning performances from both the linear model and SVM demonstrate that it is the right choice.

Table 12 Experimental Results of ALL Prognosis Prediction Using Preselected Attribute Sets and Linear Model.
Table 13 Experimental Results of ALL Prognosis Prediction Using Preselected Attribute Sets and Two-degree Polynomial Kernel.
Table 14 Experimental Results of ALL Prognosis Prediction Using Preselected Attribute Sets and Gaussian Kernel.

Golub’s data set [35] includes gene expression data and four categorical attributes, BM/PM, T/B-cell, FAB, and Gender. A random split was used to separate the whole data set into 54 training samples and 18 test samples. Correlation-based Feature Selection [36] was executed beforehand to decrease the attribute dimension from 7133 to 45. The 45 attributes include two categorical attributes, T/B-cell and FAB. FAB denotes one of the most commonly used classification schemata for Acute Myeloid Leukemia (AML). BM/PM and Gender had been deleted during the feature selection process. The goal is to predict ALL or AML. From Tables 15, 16 and 17, we can see that both T/B-cell and FAB have very small conditional entropy values (it may be because it is an easy learning problem). The T/B-cell categorical attribute was selected to partition the problem.

Table 15 Experimental Results of ALL/AML Prediction Using Attributes Selected by CFS and Linear Model.
Table 16 Experimental Results of ALL/AML Prediction Using Attributes Selected by CFS and Two-degree Polynomial Kernel.
Table 17 Experimental Results of ALL/AML Prediction Using Attributes Selected by CFS and Gaussian Kernel.

Discussions and future work

For choosing a proper partition attribute, we could either select the one with the smallest conditional entropy, or the one with the highest training cross-validation accuracy among multiple candidates. The first strategy worked well for all the data sets — while it may not provide the best performing partition, it always outperformed the baseline. The second strategy yielded the best answer for most cases — glycogen synthase kinase-3β inhibitors data is an example — however, it failed on cannabinoid receptor subtypes CB1 and CB2 activity data.

In addition to simplifying the learning problem, the selected categorical attribute may provide additional perspective in unveiling hidden biological information. For example, the attributes chosen from cannabinoid receptor subtypes CB1 and CB2 data sets supply useful information for compound design.

Although the restructuring process organizes classifiers in a tree, it is fundamentally different from the splitting process of a standard decision tree: the conditional entropy in the proposed metric depends on a classifier family. In the future, we would like to extend the restructuring process to multiple layers using one or more attributes.

Conclusions

We propose a method of restructuring a supervised learning problem using a discrete/categorical attribute. Such attributes naturally divide the original problem into several non-overlapping sub-problems. With a proper choice of the attribute, the complexity of the learning task is reduced, and the prediction performance enhanced. Selecting a proper discrete or categorical attribute that maximally simplifies the learning task is a challenging problem. A naive approach requires exhaustive searching for the optimal learning model for each possible restructured problem, and hence is computationally prohibitive. We propose a metric to select the categorical attribute based on the estimated expected conditional entropy with respect to random projections. This method can be applied to multi-class and non-linear problems. Experimental results demonstrate the good performance of the proposed approach on several data sets. Future work is to develop methods/metrics to extend the approach to efficiently identify multiple categorical attributes for problem restructuring.