Leveraging domain information to restructure biological prediction

Nan, Xiaofei; Fu, Gang; Zhao, Zhengdong; Liu, Sheng; Patel, Ronak Y; Liu, Haining; Daga, Pankaj R; Doerksen, Robert J; Dang, Xin; Chen, Yixin; Wilkins, Dawn

doi:10.1186/1471-2105-12-S10-S22

Leveraging domain information to restructure biological prediction

Proceedings
Open access
Published: 18 October 2011

Volume 12, article number S22, (2011)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Leveraging domain information to restructure biological prediction

Download PDF

Xiaofei Nan¹,
Gang Fu²,
Zhengdong Zhao¹,
Sheng Liu¹,
Ronak Y Patel²,
Haining Liu²,
Pankaj R Daga²,
Robert J Doerksen²,
Xin Dang³,
Yixin Chen¹ &
…
Dawn Wilkins¹

2874 Accesses
2 Citations
Explore all metrics

Abstract

Background

It is commonly believed that including domain knowledge in a prediction model is desirable. However, representing and incorporating domain information in the learning process is, in general, a challenging problem. In this research, we consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The goal of this research is to develop an algorithm to identify discrete or categorical attributes that maximally simplify the learning task.

Results

We consider restructuring a supervised learning problem via a partition of the problem space using a discrete or categorical attribute. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We propose a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach is tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem.

Conclusions

The proposed conditional entropy based metric is effective in identifying good partitions of a classification problem, hence enhancing the prediction performance.

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Background

In statistical learning, a predictive model is learned from a hypothesis class using a finite number of training samples [1]. The distance between the learned model and the target function is often quantified as the generalization error, which can be divided into an approximation term and an estimation term. The former is determined by the capacity of the hypothesis class, while the latter is related to the finite sample size. Loosely speaking, given a finite training set, a complex hypothesis class reduces the approximation error but increases the estimation error. Therefore, for good generalization performance, it is important to find the right tradeoff between the two terms. Along this line, an intuitive solution is to build a simple predictive model with good training performance [2]. However, the “high dimensionality, small sample size” nature of many biological applications makes it extremely challenging to build a good predictive model: a simple model often fails to fit the training data, but a complex model is prone to overfitting. A commonly used strategy to tackle this dilemma is to simplify the problem itself using domain knowledge. In particular, domain information may be used to divide a learning task into several simpler problems, for which building predictive models with good generalization is feasible.

The use of domain information in biological problems has notable effects. There is an abundance of prior work in the field of bioinformatics, machine learning, and pattern recognition. It is beyond the scope of this article to supply a complete review of the respective areas. Nevertheless, a brief synopsis of some of the main findings most related to this article will serve to provide a rationale for incorporating domain information in supervised learning.

Representation of domain information

Although there is raised awareness about the importance of utilizing domain information, representing it in a general format that can be used by most state-of-the-art algorithms is still an open problem [3]. Researchers usually focus on one or several types of application-specific domain information. The various ways of utilizing domain information are categorized as following: the choice of attributes or features, generating new examples, incorporating domain knowledge as hints, and incorporating domain knowledge in the learning algorithms [2].

Use of domain information in the choice of attributes could include adding new attributes that appear in conjunction (or disjunction) with given attributes, or selection of certain attributes satisfying particular criteria. For example, Lustgarten et al. [4] used the Empirical Proteomics Ontology Knowledge Bases in a pre-processing step to choose only 5% of candidate biomarkers of disease from high-dimensional proteomic mass spectra data. The idea of generating new examples with domain information was first proposed by Poggio and Vetter [5]. Later, Niyogi et al. [2] showed that the method in [5] is mathematically equivalent to a regularization process. Jing and Ng [6] presented two methods of identifying functional modules from protein-protein interaction (PPI) networks with the aid of Gene Ontology (GO) databases, one of which is to take new protein pairs with high functional relationship extracted from GO and add them into the PPI data. Incorporating domain information as hints has not been explored in biological applications. It was first introduced by Abu-Mostafa [7], where hints were denoted by a set of tests that the target function should satisfy. An adaptive algorithm was also proposed for the resulting constrained optimization.

Incorporating domain information in a learning algorithm has been investigated extensively in the literature. For example, the regularization theory transforms an ill-posed problem into a well-posed problem using prior knowledge of smoothness [8]. Verri and Poggio [9] discussed the regularization framework under the context of computer vision. Considering domain knowledge of transform invariance, Simard et al. [10] introduced the notion of transformation distance represented as a manifold to substitute for Euclidean distance. Schölkopf et al. [11] explored techniques for incorporating transformation invariance in Support Vector Machines (SVM) by constructing appropriate kernel functions. There are a large number of biological applications incorporating domain knowledge via learning algorithms. Ochs reviewed relevant research from the perspective of biological relations among different types of high-throughput data [12].

Data integration

Domain information could be perceived of as data extracted from a different view. Therefore, incorporating domain information is related to integration of different data sources [13, 14]. Altmann et al. [15, 16] added prediction outcomes from phenotypic models as additional features. English and Butter [13] identified biomarker genes causally associated with obesity from 49 different experiments (microarray, genetics, proteomics and knock-down experiments) with multiple species (human, mouse, and worm), integrated these findings by computing the intersection set, and predicted previously unknown obesity-related genes by the comparison with the standard gene list. Several researchers applied ensemble-learning methods to incorporate learning results from domain information. For instance, Lee and Shatkay [17] ranked potential deleterious effects of single-nucleotide polymorphisms (SNP) by computing the weighted sum of various prediction results from four major bio-molecular functions, protein coding, splicing regularization, transcriptional regulation, and post-translational modification, with distinct learning tools.

Incorporating domain information as constraints

Domain information could also be treated as constraints in many forms. For instance, Djebbari and Quackenbush [18] deduced prior network structure from the published literature and high-throughput PPI data, and used the deduced seed graph to generate a Bayesian gene-gene interaction network. Similarly, Ulitsky and Shamir [19] seeded a graphical model of gene-gene interaction from a PPI database to detect modules of co-expressed genes. In [6], Gene Ontology information was utilized to construct transitive closure sets from which the PPI network graph could grow. In all these methods, domain information was used to specify constraints on the initial states of a graph.

Domain information could be represented as part of an objective function that needs to be minimized. For example, Tian et al. [20] considered the measure of agreement between a proposed hypergraph structure and two domain assumptions, and encoded them by a network-Laplacian constraint and a neighborhood constraint in the penalized objective function. Daemen et al. [21] calculated a kernel from microarray data and another kernel from targeted proteomics domain information, both of which measure the similarity among samples from two angles, and used their sum as the final kernel function to predict the response to cetuximab in rectal cancer patients. Bogojeska et al. [22] predicted the HIV therapy outcomes by setting the model prior parameter from phenotypic domain information. Anjum et al. [23] extracted gene interaction relationships from scientific literature and public databases. Mani et al. [24] filtered a gene-gene network by the number of changes in mutual information between gene pairs for lymphoma subtypes.

Domain knowledge has been widely used in Bayesian probability models. Ramakrishnan et al. [25] computed the Bayesian posterior probability of a gene’s presence given not only the gene identification label but also its mRNA concentration. Ucar et al. [26] included ChIP-chip data with motif binding sites, nucleosome occupancy and mRNA expression data within a probabilistic framework for the identification of functional and non-functional DNA binding events with the assumption that different data sources were conditionally independent. In [27], Werhli and Husmeier measured the similarity between a given network and biological domain knowledge, and by this similarity ratio, the prior distribution of the given network structure is obtained in the form of a Gibbs distribution.

Our contributions

In this article, we present a novel method that uses domain information encoded by a discrete or categorical attribute to restructure a supervised learning problem. To select the proper discrete/categorical attribute to maximally simplify a classification problem, we propose an attribute selection metric based on conditional entropy achieved by a set of optimal classifiers built for the restructured problem space. As finding the optimal solution is computationally expensive if the number of discrete/categorical attributes is large, an approximate solution is proposed using random projections.

Methods

Many learning problems in biology are of high dimension and small sample size. The simplicity of a learning model is thus essential for the success of statistical modeling. However, the representational power of a simple model family may not be enough to capture the complexity of the target function. In many situations, a complex target function may be decomposed into several pieces, and each can be easily described using simple models. Three binary classification examples are illustrated in Figure 1, where red/blue indicates positive/negative class. In example (a), the decision boundary that separates two distinct color regions is a composite of multiple polygonal lines. It suggests the classification problem in (a) could not be solved by a simple hypothesis class such as a linear or polynomial model. Similarly, in examples (b) and (c), the decision boundary is so complex that neither a linear nor polynomial model can be fitted into these problems. Nevertheless, if the whole area is split into four different sub-regions (as shown in the figure, four quadrants marked from 1 to 4), the problem could be handled by solving each quadrant using a simple model individually. In example (a), the sub-problem defined on each quadrant is linearly separable. Likewise, each quadrant in (b) is suitable for a two-degree polynomial model. A linear model can be viewed as a special case of a two-degree polynomial. Therefore, the four sub-problems in (c) could be solved by a set of two-degree polynomial models. In the three examples, a categorical attribute X₃ provides such partition information.

Attributes like X₃ exist in many biological applications. For instance, leukemia subtype domain knowledge, which can be encoded as a disicrete or categorical attribute, may help the prediction of prognosis. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. As depicted in Figure 2, the original problem is split into multiple sub-problems by one or more discrete or categorical attributes. If the proper attribute is selected in the restructuring process, each sub-problem will have a comparably simpler target function. Our approach is fundamentally different from the decision tree approach [28]: first, the tree-like restructuring process is to break up the problem into multiple more easily solvable sub-problems, not to make prediction decisions; second, the splitting criterion we propose here is based on the conditional entropy achieved by a categorical attribute and a hypothesis class, whereas the conditional entropy in decision trees is achieved by an attribute only. The conditional entropy will be discussed in detail later. Also, our method is related to feature selection in the sense that it picks categorical attributes according to a metric. However, it differs from feature selection in that feature selection focuses on the individual discriminant power of an attribute, and our method studies the ability of an attribute to increase the discriminant power of all the rest of the attributes. The categorical attributes selected by our method may or may not be selected by traditional feature selection approaches.

In theory, there’s no limit on the number of categorical attributes used in a partition if an infinite data sample is available. However, in reality, the finite sample size puts a limit on the number of sub-problems good for statistical modeling. In this article, we only consider incorporating one discrete or categorical attribute at a time. Identifying a discrete or categorical attribute that provides a good partition of a problem is nontrivial when the number of discrete or categorical attributes is large. In this paper, we propose a metric to rank these attributes.

An attribute selection metric

A discrete or categorical attribute is viewed as having high potential if it provides a partition that greatly reduces the complexity of the learning task, or in other words, the uncertainty of the classification problem. A hypothesis class, such as the linear function family, is assumed beforehand. Therefore, we quantify the potential using the information gain achieved by a set of optimal classifiers, each of which is built for a sub-problem defined by the discrete or categorical attribute under consideration. Searching for the top ranked attribute with maximum information gain is equivalent to seeking the one with minimum conditional entropy. In a naive approach, an optimal prediction model is identified by comparing restructured problems using each discrete or categorical attribute. This exhaustive approach is computationally prohibitive when the number of discrete or categorical attributes is large. We propose to rank attributes using a metric that can be efficiently computed.

In a classification problem, consider a set of l samples (x, y) from an unknown distribution, x ∈ ℝⁿ, and y is the class label. In a k-class learning task, y gets a value from {1, …, k}; In a binary classification problem, y is either 1 or –1. z represents a discrete or categorical attribute with finite unique values. For simplicity, let’s assume z takes values from {1, …, q}, which offers a problem partition into q sub-problems, i.e. for all the samples when attribute z takes value i, i ∈ 1, …, q. Z is the set of all discrete and categorical attributes, z ∈ Z. A hypothesis class M is considered. We will first consider the linear model family. The metric can be generalized to a non-linear hypothesis class using the kernel trick [1].

For a binary classification problem, a linear discriminant function is formulated as f(x) = w^Tx + c, where w indicates the normal vector of the corresponding hyperplane and c is the offset parameter. For a multi-class task, if the one-vs-one method [29] is applied, there exists k(k – 1)/2 linear discriminant functions, each of which separates a pair of classes. Because a categorical attribute z divides the problem into q sub-problems, we define a model m for the whole problem as a set of linear discriminant functions on the q sub-problems: if it is a binary classification problem, m contains q linear discriminant functions; if it is a multi-class problem, m comprises qk(k – 1)/2 discriminant functions. Model m contains a pair of components (w, c), where w is the set of normal vectors of all of the discriminant functions in m, and c contains all of the linear function offset parameters in m.

The most informative attribute under the context discussed above is defined through the following optimization problem:

which is equivalent to

Note that the conditional entropy used here is fundamentally different from the one normally applied in decision trees. The traditional conditional entropy H(y|z) refers to the remaining uncertainty of class variable y given that the value of an attribute z is known. The conditional entropy used above is conditional on the information from attribute z and model m. In other words, the proposed method looks one more step ahead than a decision tree about data impurity of sub-problems.

An approximated solution

The above optimization problem cannot be solved without knowledge of the probabilistic distribution of data. Sample version solutions may not be useful due to the curse of dimensionality: in high dimension feature spaces, a finite number of points may easily be separated by a hypothesis class (an infinitesimal conditional entropy), but the solution is more likely to be overfit than to be a close match to the target function. Taking a different perspective, if a categorical attribute is able to maximally simplify the learning task, the expected impurity value with respect to all possible models within the given hypothesis class should be small. This motivates the following approximation using the expected conditional entropy with respect to a random hyperplane:

The expectation could be estimated by the average over a finite number of trials. Hence, we randomly generate N sets of normal vectors (each set includes q normal vectors for binary-class or qk(k – 1)/2 for multi-class), search for the corresponding best offset for each normal vector, and calculate the average conditional entropy

(1)

In the i_th random projection, w_i includes all the normal vectors of the linear classifiers, each of which is built on a sub-problem, and c_i does the same for the offsets. According to the definition of conditional entropy, H(y|z, (w_i, c_i)) in (1) is formulated as:

(2)

Probability p(z = j) is approximated by the sub-problem size ratio. The last step of the above derivation is based on the fact that the random projections are independent from the size of the sub-problems.

In a binary classification task, z = j denotes the j_th sub-problem, and (w_ij, c_ij) indicates the linear discriminant function of the i_th random projection on the j_th sub-problem. The discriminant function represented by (w_ij, c_ij) classifies the j_th sub-problem into two parts, and :

H(y|z = j,w_ij, c_ij) in (2) quantifies the remaining uncertainty of variable y in the j_th sub-problem given the learned partition result defined by the linear discriminant function with parameters (w_ij, c_ij):

(3)

In the computation of (3), and . and are estimated by the proportion of positive/negative samples within and , respectively.

In a multi-class setting, within a sub-problem, instead of two sub-regions (Ω⁺, Ω^–), there are q sub-regions (Ω¹, …, Ω^q), each of which is the decision region for a class. All the categorical attributes are ranked according to (1).

Extension to non-linear models

Our proposed metric could be easily extended to non-linear models using the kernel trick [1]. By the dual representation of a linear model, the normal vector is represented as a weighted summation of sample data.

where α_I ∈ ℝ is a weight. The linear function is then formulated as:

Using the kernel trick, inner product can be replaced by a kernel function K. K(x_i, x) is the inner product of x_i and x in the reproducing kernel Hilbert space. Therefore, the above linear discriminative function is transformed to,

(4)

In our method, given a kernel K, random projections are achieved through α_i.

Results and discussion

We tested our method on three artificial data sets, three cheminformatics data sets and two cancer microarray data sets. The random projection was executed 1000 times for each data set.

Three different kernels were applied in this paper: linear, two-degree polynomial and Gaussian. The latter two kernels have one or more parameters. For the two-degree polynomial kernel, we used the default setting as K(u, v) = (u^Tv)². Choosing a proper parameter γ in the Gaussian kernel K(u, v) = exp(– γ||u – v||²) is not an easy task. This paper focuses on how to select one (or more) categorical or discrete attribute(s) to divide the original problem into multiple simpler sub-problems. Selecting a proper model is not the theme of the work. Therefore, we list three Gaussian kernels using different γ values, 0.01, 1 and 10, to demonstrate that our restructuring process could be extended to non-linear models including the Gaussian kernel.

Many prediction problems have the property of small sample size and high dimensionality, for example, the learning tasks for the three cheminformatics data sets. Simple models under these circumstances are usually preferred. We applied a linear kernel on these three data sets, and analyzed the results from a cheminformaticist’s point of view. For the purpose of comparison, two-degree polynomial kernels and Gaussian kernels were also used.

The code was written with Matlab and libsvm package, and can be downloaded from http://cbbg.cs.olemiss.edu/StructureClassifier.zip.

Artificial data sets

Three artificial data sets were generated to test our method using both linear and non-linear models. They are shown in Figure 1. Each artificial data is generated by four attributes: X₁ and X₂ are continuous attributes, and X₃ and X₄ are categorical attributes. The continuous attributes are uniformly distributed. X₃ = {1, 2, 3, 4} denotes four different smaller square sub-regions. X₄ = {1, 2} is a random categorical attribute for the purpose of comparison. In the experiment, we generated 10 sets for Artificial Data 1, 2, and 3, respectively. All 10 sets share the same values of attributes X₁, X₂, and X₃, but X₄ is random. Average results and standard deviations were computed.

The binary class information is coded by two distinct colors. Categorical attribute X₃ provides interesting partitions: the partition in (a) leads to linear classification problems; the partition in (b) and (c) generates nonlinear problems that can be solved using techniques such as SVM with a polynomial kernel. Note that the original problem in (a) is not linear. The original problems in (b) and (c) are nonlinear, and not solvable using a polynomial kernel of degree 2.

Next, we assume linear classifiers in (a) and SVM with a polynomial kernel of degree 2 in (b) and (c). From Tables 1, 2, and 3, we see that the averaged estimated conditional entropy of X₃ is always smaller than that of X₄. Hence X₃ is selected to restructure the problem. Next, we build both linear classifier and degree-2 polynomial SVM models on the original problem (we call it the baseline method), and linear and degree-2 polynomial models on the restructured problems introduced by X₃. Significant improvements in both cross-validation (CV) accuracy and test accuracy are achieved using the partitions provided by X₃. For comparison purposes, models were built on the restructured problem produced by X₄. X₃ outperforms X₄ with a comfortable margin. There is no significant improvement using X₄ than the baseline approaches.

Table 1 Experimental Results of Artificial Data 1 (Fig1 (a)) with Linear Model.

Leveraging domain information to restructure biological prediction

Abstract

Background

Results

Conclusions

Explore related subjects

Background

Representation of domain information

Data integration

Incorporating domain information as constraints

Our contributions

Methods

An attribute selection metric

An approximated solution

Extension to non-linear models

Results and discussion

Artificial data sets

Cheminformatics data

Biological activity prediction of glycogen synthase kinase-3β inhibitors

Cannabinoid receptor subtypes CB1 and CB2 activity and selectivity prediction

Leukemia gene data

Discussions and future work

Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation