Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure
Abstract
Background
Many bioinformatics studies aim to identify markers, or features, that can be used to discriminate between distinct groups. In problems where strong individual markers are not available, or where interactions between gene products are of primary interest, it may be necessary to consider combinations of features as a marker family. To this end, recent work proposes a hierarchical Bayesian framework for feature selection that places a prior on the set of features we wish to select and on the label-conditioned feature distribution. While an analytical posterior under Gaussian models with block covariance structures is available, the optimal feature selection algorithm for this model remains intractable since it requires evaluating the posterior over the space of all possible covariance block structures and feature-block assignments. To address this computational barrier, in prior work we proposed a simple suboptimal algorithm, 2MNC-Robust, with robust performance across the space of block structures. Here, we present three new heuristic feature selection algorithms.
Results
The proposed algorithms outperform 2MNC-Robust and many other popular feature selection algorithms on synthetic data. In addition, enrichment analysis on real breast cancer, colon cancer, and Leukemia data indicates they also output many of the genes and pathways linked to the cancers under study.
Conclusions
Bayesian feature selection is a promising framework for small-sample high-dimensional data, in particular biomarker discovery applications. When applied to cancer data these algorithms outputted many genes already shown to be involved in cancer as well as potentially new biomarkers. Furthermore, one of the proposed algorithms, SPM, outputs blocks of heavily correlated genes, particularly useful for studying gene interactions and gene networks.
Keywords
Feature selection Bayesian learning Biomarker discovery Heuristic search algorithmsBackground
Many bioinformatics studies aim to identify predictive biomarkers that can be used to establish diagnosis or prognosis, or to predict a drug response [1, 2, 3]. This problem can often be framed as a feature selection task, where the goal is to identify a list of features (molecular biomarkers) that can discriminate between groups of interest based on high-dimensional data from microarray, RNA-seq, or other high-throughput technologies.
Initially, exploratory studies are often conducted on small samples to generate a shortlist of biomarker candidates before a large-sample validation study is performed [4]. However, such studies have too often been unsuccessful at producing reliable and reproducible biomarkers [5]. Biomarker discovery is inherently difficult, given the large number of features, highly complex interactions between genes and gene products, enormous variety of dysfunctions that can occur, and many sources of error in the data. As a result, feature selection algorithms are often implemented without much consideration of the particular demands of the problem. For instance, variants of t-test are perhaps the most widely implemented selection strategies in bioinformatics, but can only detect strong individual features, and fail to take correlations into account.
Given that molecular signaling is often inherently multivariate, there is a need for methods that can account for correlations and extract combinations of features as a marker family. Wrapper methods do this by ranking sets of features according to some objective function, usually the error of a classifier. However, methods based on classifier error are computationally expensive, and may not necessarily produce the best markers; indeed, strong features can be excluded if they are correlated with other strong features. Furthermore, analysis downstream from feature selection may include gene set enrichment analysis, where the hope is to identify known pathways or other biological mechanisms that contain a statistically significant number of genes in the reported gene set, or may involve the development of new pathways and gene networks. We are thus motivated to develop methods that not only select markers useful for discrimination, but select all relevant markers, even individually weak ones.
To address this, in prior work we proposed a hierarchical Bayesian framework for feature selection, labeling features as “good” or “bad”, where good features are those we wish to select, i.e., biomarkers. This framework places a prior on the set of good features and the underlying distribution parameters. Three Gaussian models have been considered. Under independent features, Optimal Bayesian Filtering reports a feature set of a given size with a maximal expected number of truly good features (CMNC-OBF) [6]. Assuming fully dependent good features and independent bad features, 2MNC-DGIB is a fast suboptimal method that ranks features by evaluating all sets of size 2 [7]. Finally, assuming good and bad features are separately dependent, 2MNC-Robust proposes an approximation of the posterior on good features and uses a ranking strategy similar to 2MNC-DGIB to select features [8].
While 2MNC-DGIB has outstanding performance when its assumptions are satisfied [7], it performs poorly when bad features are dependent [9]. On the other hand, CMNC-OBF and 2MNC-Robust have been shown to have robust performance across Bayesian models with block-diagonal covariances [9]. CMNC-OBF is extremely fast and enjoys particularly excellent performance when markers are individually strong with low correlations, but, like all filter methods, may miss weak features that are of interest due to high correlations with strong features [6, 9]. 2MNC-Robust is computationally very manageable and generally improves upon CMNC-OBF in the presence of correlations.
Although CMNC-OBF and 2MNC-Robust are robust to different block-diagonal covariance structures, they do not attempt to detect these underlying structures, and their assumptions and approximations constrain performance. Thus, in this work we propose three new feature selection algorithms that: (1) use an iterative strategy to update the approximate posterior used in 2MNC-Robust, (2) use a novel scoring function inspired by Bayes factors to improve overall rankings, and (3) attempt to actually detect the underlying block structure of the data. We show that these algorithms have comparable computation time to 2MNC-Robust, while outperforming 2MNC-Robust and many other popular feature selection algorithms on a synthetic Bayesian model assuming block-diagonal covariance matrices, and a synthetic microarray data model. Finally, we apply the proposed algorithms and CMNC-OBF to breast cancer, colon cancer, and AML datasets, and perform enrichment analysis on each to address validation.
Feature selection model
We review a hierarchical Bayesian model that serves as a reference for the approximate posterior developed in 2MNC-Robust [8, 9] and will be used in the algorithms we present in the next section.
Consider a binary feature selection problem with class labels y=0,1. Let F be the set of feature indices. Assume features are partitioned into blocks, where features in each block are dependent, but features in different blocks are independent. Assume each block is either good or bad. A good block has different class-conditioned distributions between the two classes, while a bad block has the same distribution in both classes. We denote a partitioning of F to good and bad blocks by P=(P_{ G },P_{ B }), and hereafter call it a feature partition, where P_{ G }={G_{1},⋯,G_{ u }} is the set of u good blocks and P_{ B }={B_{1},⋯,B_{ v }} is the set of v bad blocks. Furthermore, denote the set of all features in good blocks as good features, \(G=\cup _{i=1}^{u} G_{i}\), and denote all features in bad blocks as bad features, \(B=\cup _{j=1}^{v} B_{j}\). Denote the random feature partition by \(\bar {P}=(\bar {P}_{G},\bar {P}_{B})\), the random set of good features by \(\bar {G}\), and the random set of bad features by \(\bar {B}\).
We define \(\pi (P)=\mathrm {P}(\bar {P}=P)\) to be the prior distribution on \(\bar {P}\). Let P be fixed. Let θ^{ P } be the parameter describing the joint feature distribution of P. Since blocks are independent of each other we can write \(\theta ^{P}\,=\,\left [\!\theta ^{G_{1}}_{0},\cdots,\theta ^{G_{u}}_{0},\theta ^{G_{1}}_{1},\cdots,\theta ^{G_{u}}_{1},\theta ^{B_{1}},\cdots,\theta ^{B_{v}}\right ]\), where \(\theta ^{G_{i}}_{y}\) parametrizes class-y features in G_{ i }, and \(\phantom {\dot {i}\!}\theta ^{B_{j}}\) parametrizes features in B_{ j }. Assume \(\theta ^{G_{i}}_{y}\) and \(\phantom {\dot {i}\!}\theta ^{B_{j}}\)’s are independent given P, i.e., \(\pi \left (\theta ^{P}\right)=\prod _{i=1}^{u} \pi \left (\theta ^{G_{i}}_{0}\right) \pi \left (\theta ^{G_{i}}_{1}\right) \prod _{j=1}^{v} \pi \left (\theta ^{B_{j}}\right)\).
In addition, the marginal posterior of a feature set G is Open image in new window , and marginal posterior of a feature f is Open image in new window . Note Open image in new window is different than Open image in new window .
Gaussian model
Here we solve Eq. (1) for jointly Gaussian features. We assume for a block A, \(\theta ^{A}_{y}=\left [\mu ^{A}_{y}, \Sigma ^{A}_{y}\right ]\) and θ^{ A }=[μ^{ A },Σ^{ A }], where \(\mu ^{A}_{y}\) and μ^{ A } are the mean vectors, and \(\Sigma ^{A}_{y}\) and Σ^{ A } are the covariance matrices.
where for a matrix |.| denotes determinant. \(S^{A}_{y}, \kappa ^{A}_{y}, m^{A}_{y}\), and \(\nu ^{A}_{y}\) are hyperparameters, which are assumed given and fixed. \(S^{A}_{y}\) is an |A|×|A| matrix, where for a set |.| denotes cardinality. For a proper prior \(S^{A}_{y}\) is symmetric and positive-definite, and \(\kappa ^{A}_{y}>|A|-1\). If \(\kappa ^{A}_{y}>|A|+1\), then \(E\left (\Sigma ^{A}_{y}\right)=S^{A}_{y}/\left (\kappa ^{A}_{y}-|{A}|-1\right)\). Furthermore, \(m^{A}_{y}\) is an |A|×1 vector describing the average mean of features and for a proper prior we need \(\nu ^{A}_{y}>0\). \(K^{A}_{y}\) and \(L^{A}_{y}\) represent the relative weights of each distribution. For a proper distribution we have \(K^{A}_{y}=|S^{A}_{y}|^{0.5 \kappa ^{A}_{y}}2^{-0.5 \kappa ^{A}_{y} |{A}|} / \Gamma _{|{A}|}\left (0.5 \kappa ^{A}_{y}\right)\) and \(L^{A}_{y}=\left (2\pi /\nu ^{A}_{y}\right)^{-0.5|{A}|}\), where Γ_{ d } denotes the multivariate gamma function.
Methods
Here we describe the set selection methods used. Note we aim to find the set of true good features, rather than the true underlying feature partition. The Maximum Number Correct (MNC) criterion [7] outputs the set maximizing the expected number of correctly labeled features and the Constrained MNC (CMNC) criterion outputs the set with maximum expected number of correctly labeled features constrained to having exactly D selected features, where D is a parameter of the optimization problem [9]. The solution of MNC is {f∈F:π^{∗}(f)>0.5} [7] and the solution of CMNC is picking the top D features with largest π^{∗}(f) [9]. Therefore, both MNC and CMNC require computing π^{∗}(f) for all f∈F, which is not computationally feasible for an arbitrary block structure unless |F| is very small. We review two previously proposed algorithms, OBF and 2MNC-Robust, and then present three new algorithms.
Optimal Bayesian filter
Optimal Bayesian Filter (OBF) assumes all blocks have size one, i.e., all features are independent, and assumes the events \(\{f \in \bar {G}\}\) are independent a priori. In this case π^{∗}(f) can be found in closed form with little computation cost [6, 9]. OBF is optimal under its modeling assumptions. As argued in [9], in the presence of correlation OBF is a robust suboptimal algorithm that can detect individually strong good features, i.e., those whose mean and/or variance is very different between the two classes, but cannot take advantage of correlations to correctly label individually weak good features, those whose mean and variance are similar in both classes.
2MNC-Robust
The 2MNC algorithm [7] suggests approximating π^{∗}(f) using π^{∗}(G) for all sets G such that |G|=2, and picking the top D features. Since finding π^{∗}(G) for all feature partitions where |∪P_{ G }|=2 is typically infeasible, an approximate posterior, \(\tilde {\pi }^{*}(G)\), is proposed [8], where for all G⊆F of size 2,
2MNC-Robust is implementing 2MNC with \(\tilde {\pi }^{*}(f)\). As mentioned before, 2MNC-Robust does not tune itself to the underlying block structure of data.
Recursive marginal posterior inflation
It is easy to show that \(\sum _{f\in F} \tilde {\pi }^{*}(f)=2\) when only sets of size 2 are used to find \(\tilde {\pi }^{*}(f)\). Hence, under MNC criterion one would at most pick 4 good features, implying we underestimate π^{∗}(f) by only using sets of size 2 when \(|\bar {G}|>>2\). REcursive MArginal posterior INflation (REMAIN) aims to sequentially detect good features by rescaling \(\tilde {\pi }^{*}(f)=\sum _{G:f\in G,|G|=2} \tilde {\pi }^{*}(G)\). We initialize REMAIN with the set of all features, F_{ r }=F. Then, REMAIN uses the MArginal posterior INflation (MAIN) algorithm to identify several features as good, removes them from F_{ r }, and feeds MAIN with the truncated F_{ r } to select additional features. This process iterates until MAIN does not output any features. REMAIN is nothing but repetitive calls to MAIN with shrinking feature sets, making MAIN the heart of this algorithm.
Pseudo-code of MAIN is provided in Algorithm 1, where H(G) is the right hand side of Eq. (2). Inputted with a feature set F_{ t }, MAIN finds \(\tilde {\pi }^{*}(f)\) using sets of size 2, and finds the set \(G_{s}=\{f \in F_{t} : \tilde {\pi }^{*}(f)>T_{1} \}\). MAIN adds G_{ s } to \(\tilde {G}\), the set of features in F_{ t } already labeled as good. It then updates F_{ t } to F_{ t }∖G_{ s }, and rescales \(\tilde {\pi }^{*}(f)\) of features f∈F_{ t } so that \(\sum _{f \in F_{t}} \tilde {\pi }^{*}(f)=2\). Note features in \(\tilde {G}\) are used to compute \(\tilde {\pi }^{*}(f)\) for features f∈F_{ t }, but \(\tilde {\pi }^{*}(f)\) of features \(f \in \tilde {G}\) are not used in the scaling of \(\sum _{f \in F_{t}} \tilde {\pi }^{*}(f)=2\). MAIN iterates until \(\tilde {G}=\phi \), or H(G)≤T_{2} for all G⊆F_{ t } with |G|=2.
Not finding new features in MAIN might be due to the remaining good features being weaker and independent of \(\tilde {G}\). Hence, REMAIN removes \(\tilde {G}\) from F_{ r }, and feeds MAIN with the updated F_{ r }. This way, features in \(\tilde {G}\) are not used to compute \(\tilde {\pi }^{*}(f)\) anymore for any feature f∈F_{ r }, thus making it easier to detect weaker good features that are independent of features already selected by REMAIN. Pseudo-code of REMIAN is provided in Algorithm 2.
T_{1} mimics the role of the threshold used in the MNC criterion. Hence, T_{1}∈[ 0, 1]. Recall that by evaluating sets of size 2 we underestimate \(\tilde {\pi }^{*}(f)\) when \(|\bar {G}|>>2\). Therefore, when confident \(|\bar {G}|>>2\), one might opt for smaller values for T_{1} rather than values close to 1. As T_{2} is a threshold over un-normalized posteriors, H(G), extra care must be taken when setting T_{2}. We suggest T_{2}=n for high-dimensional feature selection applications, which is a good rule of thumb based on our simulation results and asymptotic analysis of H(.).
Performance of REMAIN for various values of T_{1}
Objective | n | T_{1}=0.3 | T_{1}=0.2 | T_{1}=0.1 | T_{1}=0.05 | T_{1}=0.01 | T_{1}=0.005 |
---|---|---|---|---|---|---|---|
Marker found | 20 | 0.9 | 1.4 | 2.5 | 4.6 | 15.3 | 26.2 |
Non-marker found | 20 | 2.0 | 3.7 | 9.4 | 23.3 | 164.5 | 403.0 |
Marker found | 100 | 52.5 | 56.2 | 58.2 | 60.0 | 68.9 | 76.1 |
Non-marker found | 100 | 1.6 | 3.3 | 8.8 | 20.2 | 124.7 | 288.5 |
Posterior factor
We propose approximate POsterior FActor-Constrained (POFAC) algorithm as follows: Use \(\tilde {\beta }(\,f)\) to rank features, and pick the top D features. Note that D is a parameter of the algorithm.
Sequential partition mustering
Sequential Partition Mustering (SPM) aims to improve feature selection performance by sequentially detecting good blocks, and adding them to the set of previously selected features. To find a good block, we start with the most significant feature, i.e., the feature with largest \(\tilde {\beta }\), and find the block containing it. Note we do not aim to find the structure of bad blocks, and as soon as we declare no more good features remain the algorithm terminates.
Pseudo-code of SPM is explained in Algorithm 4. Let F_{ t } be the feature set used by SPM initialized to F_{ t }=F. We start with the most significant feature u_{0} and find the block containing it, U. We then update F_{ t } to F_{ t }∖U. If \(\tilde {\beta }(f)<T_{4}\) for all f∈F_{ t }, then SPM declares F_{ t } does not contain any good features and terminates; otherwise, it picks the most significant feature of F_{ t } and iterates. Similar to REMAIN, SPM cannot be forced to output a fixed number of features, but T_{4} can be used to tune SPM to output close to a desired number of features. In addition, t_{1} and t_{2} can be used to avoid picking large blocks, which also affects the output feature set size.
Simulations
We compare the performance of proposed algorithms with many popular feature selection algorithms over a Bayesian setting, and a synthetic microarray model introduced in [13] and extended in [8, 9].
Bayesian simulation
In this simulation we assume |F|=4100 and \(|\bar {G}|=100\). We assume there is 1 good block for each of the following sizes: 10, 20, 30, and 40. We also assume there are 20 bad blocks for each of the following sizes: 5, 10, 15, 20, 50, and 100. We first randomly assign each feature to a block such that the assumed block structure is satisfied, effectively constructing \(\bar {P}\). Afterwards, distribution parameters are randomly drawn from the following NIW prior. For each good block, A, we have \(S^{A}_{0}=S^{A}_{1}=0.5 \times I_{|A| \times |A|}, \kappa ^{A}_{0}=\kappa ^{A}_{1}=|A|+2, m^{A}_{0}=m^{A}_{1}=0\), and \(\nu ^{A}_{0}=\nu ^{A}_{1}=4\), where I is the identity matrix. Also, for a bad block, A, we have S^{ A }=0.5×I_{|A|×|A|},κ^{ A }=|A|+2,m^{ A }=0, and ν^{ A }=4. Given distribution parameters, a stratified sample of size n with equal points in each class is drawn. The following feature selection methods declare the set of good features: t-test, Bhatacharrya Distance (BD), Mutual Information (MI) using the non-parameter method of [14] with spacing parameter m=1, Sequential Forward Search using the bolstered error estimate [15] of Regularized Linear Discriminant Analysis applied to the top 300 features of BD (SFS-RLDA), FOrward selection using Hilbert-Schmidt Independence Criterion (FOHSIC) [16] applied to the top 300 features of BD, CMNC-OBF, 2MNC-Robust, REMAIN, POFAC, and SPM. Note t-test, MI, BD, and CMNC-OBF are filter methods. All methods except REMAIN and SPM output \(|\bar {G}|\) features. CMNC-OBF assumes the events \(\{f \in \bar {G} \}\) are independent and \(P(f \in \bar {G})\) is constant for all f∈F. 2MNC-Robust and REMAIN assume \(\tilde {\pi }(G)\) is uniform over all sets of size 2, and zero otherwise. POFAC assumes \(\tilde {\pi }(G)\) is uniform over all sets of size 1 and 2. Finally, SPM assumes \(\tilde {\pi }(G_{1},G_{2})=0.5\) for all sets G_{1},G_{2}⊆F, and uses the same \(\tilde {\pi }(G)\) of POFAC to compute \(\tilde {\beta }(f)\). Bayesian algorithms use proper priors with hyperparameters of the same form given previously (PP), and Jeffreys non-informative prior (JP), where for each set, A, \(S^{A}_{y}\) and S^{ A } are zero matrices, \(K^{A}_{y}=K^{A}=L^{A}_{y}=L^{A}=1\), and \(\kappa ^{A}_{y}=\kappa ^{A}=\nu ^{A}_{y}=\nu ^{A}=0\). With \(\nu ^{A}_{y}=\nu ^{A}=0\) we do not need to specify \(m^{A}_{y}\) and m^{ A }. We use T_{1}=0.3 and T_{2}=n for REMAIN using both PP and JP. For SPM-PP we set t_{1}=100, t_{2}=0.5, and T_{4}=100n^{2}, which resulted in adequate performance among all sample sizes. When using SPM-JP we use the same t_{1} and T_{4}, but set t_{2}=1 to avoid picking large blocks. This process iterates 1000 times.
Run-time comparison of Bayesian simulation
Alg. | Filter | 2MNC-Robust | REMAIN | POFAC | SPM | SFS-RLDA | FOHSIC |
Time | < 10^{−3} | 1 | 2 | 1.05 | 1.5 | 10 | 15 |
Synthetic microarray simulations
Here an extended version of a synthetic model developed to mimic microarrays is used to generate data. The original model is introduced in [13], and has been extended in [8, 9]. In these models features are markers or non-markers. Markers are either global or heterogeneous. Global markers (GM) are homogeneous within each class. Heterogeneous Markers (HM) compromise c subclasses, where for each specific set of heterogeneous markers, a specific subset of the training sample has a different distribution than markers in class 0, and the remaining sample points have the same distribution as class 0. Markers comprise blocks of size k, where each block in class y is Gaussian with mean μ_{ y } and covariance σ_{ y }Σ_{ y }. Diagonal elements of Σ_{ y } are 1 and non-diagonal elements are ρ_{ y }. The original model of [13] forced ρ_{0}=ρ_{1}. We also have μ_{0} = [ 0,⋯,0]. There are three types of markers according to their mean in class 1: redundant, synergetic, and marginal, with μ_{1} being [ 1,⋯,1],[ 1,1/2,⋯,1/k], and [ 1,0,⋯,0], respectively. Non-markers are either Low Variance (LV) or High Variance (HV). In the original model LV non-markers are independent, each with a Gaussian distribution, N(0,σ_{0}). However, in the extended model of [8, 9], similar to markers in class 0, LV non-markers comprise blocks of size k, where in each block features are jointly Gaussian with mean μ_{0} and covariance σ_{0}Σ_{0}. HV non-markers are independent with marginal distribution pN(0,σ_{0})+(1−p)N(1,σ_{1}), where p is drawn from the uniform distribution over [ 0,1].
For small sample sizes, or cases where correctly labeling good features is more difficult, REMAIN tends to output very few features resulting in very good performance, in contrast to methods that are forced to output \(|\bar {G}|=100\) features. OBF can be implemented with the MNC objective instead of CMNC to enjoy this characteristic of REMAIN. SPM seems to have the most diverse behavior. While it performs inferior to all feature selection algorithms when sample size is very small, it tends to outperform all other methods for larger sample sizes. In order for the quantities used in SPM to be well-defined under JP, sample size must be larger than the block size. Hence, under small samples SPM with JP tends to break good blocks into smaller blocks, thereby losing some of its ability to identify weak good features with strong dependencies, and making it more prone to detecting blocks incorrectly. Also note that we have used the same parameters for SPM across all data models and sample sizes, and performance is expected to improve if t_{1},t_{2} and T_{4} are calibrated each time it is run.
POFAC is an interesting option, enjoying competitive performance across all sample sizes. It outperforms 2MNC-Robust while its computation cost is only slightly larger. CMNC-OBF tends to select individually strong markers, i.e., markers with class 1 mean far from 0. CMNC-OBF performs very similar to BD in this simulation, with their performance graphs almost overlapping.
While correctly labeling more features tends to result in lower classification error, maximizing the average number of correctly labeled features does not necessarily minimize classification error [see Additional file 1]. An example can be seen in the Supplementary, where we examine the prediction error of several popular classifiers with feature selection on the synthetic microarray model [see Additional file 1].
Results
We apply CMNC-OBF, POFAC, REMAIN, and SPM with the same priors used for synthetic microarray simulations to cancer microarray datasets, select the top genes, and perform enrichment analysis. We list the top 5 genes selected by CMNC-OBF, POFAC, and REMAIN. The top 100 genes are provided in the supplementary [see Additional file 1]. REMAIN ranks genes as follows. In each call to MAIN, we rank genes of \(\tilde {G}\) by the order they are added to \(\tilde {G}\), and if several genes are added at once in step 5 of MAIN, they are ranked based on \(\tilde {\pi }^{*}(f)\). In addition, \(\tilde {G}\)’s are ranked by the order they are obtained using consecutive calls to the MAIN subroutine. Note SPM outputs a set of feature blocks, not a feature ranking. Studying blocks of SPM might provide invaluable information about the underlying biological mechanisms of the disease under study, but we leave this for future work.
We perform enrichment analysis using PANTHER [17, 18]. The top 20 enriched pathways are reported in the supplementary [see Additional file 1]. We list their names, number of known genes in each pathway, number of selected genes that belong to the pathway, and the corresponding p-value. Here we only list the top 3 pathways and their p-values. We study if among the top genes and pathways any are already suggested to be involved in the cancer under study. The complete analysis, with references that suggest involvement of the top reported genes and pathways involved in cancer, is provided in the supplementary [see Additional file 1]. Here, we only report the conclusions made in the supplementary based on our literature review [see Additional file 1].
CMNC-OBF tends to find individually strong genes, which are typically those already known to be involved in cancer. Hence, CMNC-OBF tends to give the best enrichment analysis results, but it might not be the best option to find biomarkers that are individually weak, but heavily correlated to strong biomarkers. POFAC and REMAIN tend to find genes that are individually strong or highly correlated to individually strong biomarkers. Hence, they might be very useful for many practical applications, particularly for those where it is desired to target genes directly involved in cancer, or genes directly interacting with them. SPM is specifically designed to find all genes correlated to individually strong biomarkers. Hence, it tends to report large gene sets. Thereby, this algorithm is particularly useful for identifying and hypothesizing which biological functions are affected in the cancer under study.
POFAC and CMNC-OBF require the user to specify the number of genes to select, which we fix to 2000 so that a reasonable number of genes are identified by the pathway enrichment analysis database. On the other hand, REMAIN and SPM cannot take a predetermined number of genes to select. We adjust their thresholds for each dataset so that a reasonable number of genes are selected. We fix T_{2}=n, and tune T_{1},t_{1},t_{2}, and T_{4}.
The following process is used on each dataset. We first remove probes that are not mapped to any genes. We then use OBF and POFAC to rank probes, and use REMAIN to select a subset of probes. If multiple probes are mapped to the same genes, only the probe with the highest rank is retained. This gives the selected genes of REMAIN, and final gene rankings of OBF and POFAC. D=2000 is used to obtain gene sets of CMNC-OBF and POFAC. SPM uses the gene ranking obtained by POFAC with the corresponding \(\tilde {\beta }(.)\), where among probes mapped to the same genes only the probe ranking highest is retained. Running all algorithms, using MATLAB2015b, on a server with 4 XEON E5-4650L processors and 512GB of RAM took about 20 minutes for the breast cancer dataset, and about 70 minutes for each of the colon cancer and AML datasets. For all datasets REMAIN and SPM took about 55% and 25% of the total run-time, respectively.
Breast cancer
Top genes of breast cancer
Rank | CMNC-OBF | REMAIN | POFAC |
---|---|---|---|
1 | DCT | DCT | DCT |
2 | PHTF1 | ZNF192 | PHTF1 |
3 | ZNF227 | ZP2 | MUC5AC |
4 | ZP2 | PCSK6 | HUWE1 |
5 | CEACAM7 | CEACAM7 | MLANA |
Top pathways of breast cancer
Algorithm | Pathway | P-value |
---|---|---|
CMNC-OBF | Gonadotropin-releasing hormone receptor pathway | 4.93E−05 |
p53 pathway | 8.34E−05 | |
Ubiquitin proteasome pathway | 1.26E−04 | |
REMAIN | Ubiquitin proteasome pathway | 3.33E−07 |
Angiogenesis | 3.40E−04 | |
FAS signaling pathway | 7.23E−04 | |
POFAC | CCKR signaling map | 3.82E−05 |
p53 pathway | 2.61E−04 | |
Ubiquitin proteasome pathway | 4.18E−04 | |
SPM | Ubiquitin proteasome pathway | 6.25E−07 |
Integrin signalling pathway | 1.72E−06 | |
Pyrimidine Metabolism | 2.87E−05 |
Due to different properties of these algorithms, different types of biomarkers they tend to pick, and our limited knowledge of cancer pathways, it is natural to obtain different gene sets, p-values, and pathway rankings. However, there is reasonable consistency between the enrichment analysis results. For instance, the ubiquitin proteasome pathway, which is in the top 3 pathways of all algorithms, is shown to be involved in breast cancer. Many of the top 20 pathways are in common between at least 3 algorithms and are shown to be involved in breast cancer. For instance, the gonadotropin-releasing hormone receptor pathway, FAS signaling pathway, P53 pathway, CCKR signaling map, de novo purine biosynthesis, TCA cycle, Cytoskeletal regulation by Rho GTPase, and cell cycle are involved in breast cancer. In addition, many of the top 20 genes are in common between algorithms ranking features. For instance, PHTF1, MUC5AC, ZNF192, PCSK6, and HDGFRP3 are shown to be involved in breast cancer, and some common genes such as DCT, ZP2, and CEACAM7 might be involved in breast cancer.
Colon cancer
Top genes of colon cancer
Rank | CMNC-OBF | REMAIN | POFAC |
---|---|---|---|
1 | CPNE4 | EPHA7 | EPHA7 |
2 | GAGE1,12,4,5,6,7 | NBLA00301 | CPNE4 |
3 | GAGE1,12,2,4,5,6,7,8 | LOC100133920 LOC286297 | LOC100133920 LOC286297 |
4 | S100A7 | PDK4 | SCN7A |
5 | EPHA7 | MYH11 | NBLA00301 |
Top pathways of colon cancer
Algorithm | Pathway | P-value |
---|---|---|
CMNC-OBF | Cadherin signaling pathway | 1.83E−20 |
Wnt signaling pathway | 7.25E−14 | |
Plasminogen activating cascade | 7.67E−05 | |
REMAIN | Cadherin signaling pathway | 2.21E−10 |
Plasminogen activating cascade | 7.95E−05 | |
Wnt signaling pathway | 9.27E−05 | |
POFAC | Ionotropic glutamate receptor pathway | 1.72E−05 |
Metabotropic glutamate receptor group III pathway | 1.04E−02 | |
Nicotinic acetylcholine receptor signaling pathway | 1.70E−02 | |
SPM | Ionotropic glutamate receptor pathway | 2.98E−03 |
Heterotri. G-prot. sig. P.W., Gi alpha & Gs alpha med. P.W. | 4.73E−03 | |
Axon guidance mediated by Slit/Robo | 5.79E−03 |
In the supplementary (Additional file 1) we show: (1) CPNE4, EPHA7, and LOC286297, which are among the top 20 genes of all three algorithms that rank genes, are shown to be involved in colon cancer, (2) many of the top 20 genes in common between two of the gene ranking algorithms, such as the GAGE genes, RYR3, PDK4, and MYH11, are suggested to be involved in colon cancer, and (3) among the common top 20 enriched pathways, the plasminogen activating cascade, blood coagulation, and the beta1 adrenergic receptor signaling pathway are suggested to be involved in colon cancer.
Acute myeloid leukemia
Data obtained in [23, 24, 25] is deposited on GEO with accession number GSE13204, containing gene expression levels of 2096 points. 74 points belong to healthy people, 542 points belong to Acute Myeloid Leukemia (AML) patients, and the remaining points are other subtypes of leukemia. Healthy points comprise class 0 and AML patients comprise class 1. The data is already pre-processed, including a summarization and quantile normalization step. Feature selection algorithms pick the top genes, and enrichment analysis is performed using PANTHER. Here we implement REMAIN with T_{1}=0.05 to obtain 957 genes, and SPM with T_{4}=10^{7}n^{10}, t_{1}=10^{6}, and t_{2}=4, to obtain 522 blocks containing 5172 genes. Although the thresholds of SPM are chosen to be very large, we still pick very many genes. This might imply that many of the genes involved in AML might be individually weak, but highly correlated.
Top genes of AML
Rank | CMNC-OBF | REMAIN | POFAC |
---|---|---|---|
1 | ORM1 ORM2 | LTF | S100A12 |
2 | LTF | CRISP3 | S100A9 |
3 | CRISP3 | ORM1 ORM2 | ORM1 ORM2 |
4 | CHIT1 | CHIT1 | CRISP3 |
5 | DNAH10 | DNAH10 | LTF |
Top pathways of AML
Algorithm | Pathway | P-value |
---|---|---|
CMNC-OBF | Heme biosynthesis | 2.78E−04 |
Pentose phosphate pathway | 7.34E−03 | |
De novo purine biosynthesis | 8.40E−03 | |
REMAIN | Interferon-gamma signaling pathway | 5.86E−04 |
Alzheimer disease-presenilin pathway | 6.11E−03 | |
Inflammation med. by chemokine & cytokine sig. P.W. | 6.22E−03 | |
POFAC | Pentose phosphate pathway | 1.15E−03 |
Formyltetrahydroformate biosynthesis | 1.15E−03 | |
Heme biosynthesis | 1.77E−03 | |
SPM | Ubiquitin proteasome pathway | 1.72E−08 |
T cell activation | 2.07E−05 | |
Inflammation med. by chemokine & cytokine sig. P.W. | 3.00E−05 |
Studying the top 20 genes and pathways in the supplementary (Additional file 1) we see that ORM1, ORM2, LTF, CAMP, LCN2, MMP9, CYP4F3, WT1, and CRISP3 are among the top 20 genes of all gene ranking algorithms, and are shown or suggested to be involved in AML. Among the top pathways in common between all methods, the interferon signaling pathway, and the inflammation mediated by chemokine and cytokine signaling pathway are involved in AML. Many of the top pathways picked by at least 3 methods, such as heme biosynthesis, denovo purine biosynthesis, and T-cell activation are also suggested to be involved in AML.
Discussion
Here we proposed several suboptimal feature selection algorithms outperforming many popular algorithms. However, the ability to correctly detect weaker biomarkers via these suboptimal methods comes at the expense of less intuitive objective functions compared with the optimal solutions. Although the proposed algorithms are more computationally intensive than OBF and 2MNC-Robust, they are still much faster than many popular feature selection algorithms.
While the previously introduced OBF is suitable to find individually strong biomarkers, REMAIN and POFAC find individually strong biomarkers as well as individually weak biomarkers heavily correlated to strong biomarkers, which is useful for many practical applications. For instance, in drug development it might be desirable to target genes directly involved in biological mechanisms of the cancer under study, or target genes strongly correlated to them to indirectly control the behavior of genes directly involved in cancer.
When sample size is small or correlations are not very strong one could use REMAIN to find a small set of high-profile biomarkers. REMAIN cannot be forced to output a predetermined number of features, but for a given fixed dataset its parameters can be tuned to output close to a desired number of features. Note the output of REMAIN is a feature ranking that greatly depends on T_{1}. The larger T_{1} is, the smaller is the output feature ranking. However, the user would have more confidence that declared features are biomarkers. While π^{∗}(f) is a very intuitive quantity to evaluate the quality of a feature, \(\tilde {\pi }^{*}(f)\) obtained by finding \(\tilde {\pi }^{*}(G)\) for sets of size 2 is not as easy to work with.
On the other hand, if sample size is large or correlations are strong, POFAC is a suitable option. POFAC provides a feature ranking based on \(\tilde {\beta }(f)\) and the user specifies how many features to select, similar to CMNC-OBF and 2MNC-Robust. However, \(\tilde {\beta }(f)\) is not as intuitive as π^{∗}(f).
SPM outputs a family of good blocks, which is very useful in studying the interactions between biomarkers, and is very useful to hypothesize about biological mechanisms that are involved in the disease under study. As SPM is designed to pick all biomarkers correlated to strong ones, it typically reports larger feature sets. SPM is very desirable when correlations are large; however, it should only be used when sample size is relatively large. Finally, parameters of SPM can be used to adjust the trade-off between the size of detected blocks, and the minimum desired dependence between biomarkers. However, one cannot intuitively determine what values should be used to achieve a certain point in the trade-off. Trial-and-error can be used on a fixed sample to find the desired parameters, as we did in this paper. One of the main reasons it is not easy to predetermine SPM parameters is their dependence on sample size and the underlying distribution parameters.
Conclusion
The proposed Bayesian framework is indeed promising for biomarker discovery applications. The objective of finding the posterior probability that a feature set is the set of good features does not suffer many of the drawbacks of heuristics used in biomarker discovery. For instance, while t-test only captures differences in the means, OBF can capture differences in variances, 2MNC-Robust, REMAIN, and POFAC take pairwise dependencies into account, and SPM looks at the joint distribution of good blocks. While the proposed suboptimal methods can efficiently take advantage of dependencies to find the set of good features, many heuristics proposed to consider dependencies do not perform well under small samples [13], which is observed in our simulations as well.
While the optimal solution under the general block model is computationally infeasible, the success of proposed suboptimal algorithms shows the Bayesian framework can serve as a foundation to model biomarker discovery problems and develop efficient suboptimal methods. Future work includes studying the properties of the proposed algorithms, for instance their asymptotic properties, further analyzing outputs of the proposed algorithms on real datasets, and exploring the specific applications suitable for each algorithm in greater detail. In addition, prior construction may be used to design prior distributions for SPM to boost its performance under small samples. Finally, while here we have studied SPM’s ability to select all relevant features, it actually outputs blocks of features that appear highly correlated and differentially expressed. In future work we will examine SPM as a block detection algorithm, which has important applications in gene network modeling.
Notes
Acknowledgements
Not applicable.
Funding
The publication costs of this article was funded by the National Science Foundation (CCF-1422631 and CCF-1453563).
Availability of data and materials
The breast cancer, colon cancer, and Leukemia datasets used in this paper are publicly available on Gene Expression Omnibus (GEO) [20] with accession numbers, GSE1456, GSE17538, and GSE13204, respectively. A MATLAB implementation of algorithms is available on Github.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 3, 2018: Selected original research articles from the Fourth International Workshop on Computational Network Biology: Modeling, Analysis, and Control (CNB-MAC 2017): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-3.
Authors’ contributions
AF proposed the main idea and worked on the simulations and manuscript. LAD contributed to the formulation of main idea, and revised the manuscript. Both authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
- 1.Ramachandran N, Srivastava S, LaBaer J. Applications of protein microarrays for biomarker discovery. Proteomics Clin Appl. 2008; 2(10-11):1444–59.CrossRefPubMedPubMedCentralGoogle Scholar
- 2.Rifai N, Gillette MA, Carr SA. Protein biomarker discovery and validation: The long and uncertain path to clinical utility. Nat Biotechnol. 2006; 24(8):971–83.CrossRefPubMedGoogle Scholar
- 3.Ilyin SE, Belkowski SM, Plata-Salamán CR. Biomarker discovery and validation: Technologies and integrative approaches. Trends Biotechnol. 2004; 22(8):411–6.CrossRefPubMedGoogle Scholar
- 4.Feng Z, Prentice R, Srivastava S. Research issues and strategies for genomic and proteomic biomarker discovery and validation: A statistical perspective. Pharmacogenomics. 2004; 5(6):709–19.CrossRefPubMedGoogle Scholar
- 5.Diamandis EP. Cancer biomarkers: Can we turn recent failures into success?J Natl Cancer Inst. 2010; 102(19):1462–7.CrossRefPubMedPubMedCentralGoogle Scholar
- 6.Foroughi pour A, Dalton LA. Optimal Bayesian feature filtering. In: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. Atlanta: ACM: 2015. p. 651–2.Google Scholar
- 7.Foroughi pour A, Dalton LA. Optimal Bayesian feature selection on high dimensional gene expression data. In: Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP). Atlanta: IEEE: 2014. p. 1402–5.Google Scholar
- 8.Foroughi pour A, Dalton LA. Multiple sclerosis biomarker discovery via Bayesian feature selection. In: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. Seattle: ACM: 2016. p. 540–1.Google Scholar
- 9.Foroughi pour A, Dalton LA. Robust feature selection for block covariance bayesian models. In: Proceedigns of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans: IEEE: 2017. p. 2696–700.Google Scholar
- 10.Dalton LA. Optimal Bayesian feature selection. In: Proceedings of 2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP). Austin: IEEE: 2013. p. 65–8.Google Scholar
- 11.Murphy KP. Conjugate Bayesian analysis of the Gaussian distribution. Canada: Technical report, University of British Columbia; 2007.Google Scholar
- 12.Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor. Ann Intern Med. 1999; 130(12):1005–13.CrossRefPubMedGoogle Scholar
- 13.Hua J, Tembe WD, Dougherty ER. Performance of feature-selection methods in the classification of high-dimension data. Pattern Recog. 2009; 42(3):409–24.CrossRefGoogle Scholar
- 14.Beirlant J, Dudewicz EJ, Györfi L, Van der Meulen EC. Nonparametric entropy estimation: An overview. Int J Math Stat Sci. 1997; 6(1):17–39.Google Scholar
- 15.Braga-Neto U, Dougherty ER. Bolstered error estimation. Pattern Recognit. 2004; 37(6):1267–81.CrossRefGoogle Scholar
- 16.Song L, Smola A, Gretton A, Borgwardt KM, Bedo J. Supervised feature selection via dependence estimation. In: Proceedings of the 24th International Conference on Machine Learning. 2007. p. 823–30.Google Scholar
- 17.Mi H, Thomas P. PANTHER pathway: an ontology-based pathway database coupled with data analysis tools In: Nikolsky Y, Bryant J, editors. Protein Networks and Pathway Analysis. Totowa: Humana Press: 2009. p. 123–40. https://doi.org/10.1007/978-1-60761-175-2_7.Google Scholar
- 18.Mi H, Huang X, Muruganujan A, Tang H, Mills C, Kang D, Thomas PD. Panther version 11: Expanded annotation data from gene ontology and reactome pathways, and data analysis tool enhancements. Nucleic Acids Res. 2017; 45(1):183–9.CrossRefGoogle Scholar
- 19.Pawitan Y, Bjöhle J, Amler L, Borg A-L, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu ET, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw PM, Smeds J, Skoog L, Wedrén S, Bergh J. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 2005; 7(6):953–64.CrossRefGoogle Scholar
- 20.Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002; 30(1):207–10.CrossRefPubMedPubMedCentralGoogle Scholar
- 21.Smith JJ, Deane NG, Wu F, Merchant NB, Zhang B, Jiang A, Lu P, Johnson JC, Schmidt C, Bailey CE, Eschrich S, Kis C, Levy S, Washington MK, Heslin MJ, Coffey RJ, Yeatman TJ, Shyr Y, Beauchamp RD. Experimentally derived metastasis gene expression profile predicts recurrence and death in patients with colon cancer. Gastroenterology. 2010; 138(3):958–68.CrossRefPubMedGoogle Scholar
- 22.Freeman TJ, Smith JJ, Chen X, Washington MK, Roland JT, Means AL, Eschrich SA, Yeatman TJ, Deane NG, Beauchamp RD. Smad4-mediated signaling inhibits intestinal neoplasia by inhibiting expression of β-catenin. Gastroenterology. 2012; 142(3):562–71.CrossRefPubMedGoogle Scholar
- 23.Kohlmann A, Kipps TJ, Rassenti LZ, Downing JR, Shurtleff SA, Mills KI, Gilkes A, Hofmann W-K, Basso G, Dell’Orto MC, Foà R, Chiaretti S, De Vos J, Rauhut S, Papenhausen PR, Hernández JM, Lumbreras E, Yeoh AE, Koay ES, Li R, Liu W-m, Williams PM, Wieczorek L, Haferlach T. An international standardization programme towards the application of gene expression profiling in routine leukaemia diagnostics: the microarray innovations in leukemia study prephase. Br J Haematol. 2008; 142(5):802–7.CrossRefPubMedPubMedCentralGoogle Scholar
- 24.Haferlach T, Kohlmann A, Wieczorek L, Basso G, Kronnie GT, Béné M-C, Vos JD, Hernández JM, Hofmann W-K, Mills KI, Gilkes A, Chiaretti S, Shurtleff SA, Kipps TJ, Rassenti LZ, Yeoh AE, Papenhausen PR, Liu W-M, Williams PM, Foà R. Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: Report from the international microarray innovations in leukemia study group. J Clin Oncol. 2010; 28(15):2529–37.CrossRefPubMedPubMedCentralGoogle Scholar
- 25.Kühnl A, Gökbuget N, Stroux A, Burmeister T, Neumann M, Heesch S, Haferlach T, Hoelzer D, Hofmann W-K, Thiel E, Baldus CD. High BAALC expression predicts chemoresistance in adult B-precursor acute lymphoblastic leukemia. Blood. 2010; 115(18):3737–44.CrossRefPubMedGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.