Background

Many of the contemporary databases used in data classification research [110] uses considerably large number of data points to represent an object sample. High dimensional feature vectors that result from these samples often contain intra-class natural variability reflected as noise and irrelevant information [11, 12]. The noise in feature vectors occurs due to inaccurate feature measurements, whereas irrelevancy of a feature depends on the natural variability and the redundancy within the feature vector. Further, relevance of a feature is application dependent. For example, consider a hypothetical image consisting of image regions that correspond to faces and some other objects. When using this image in a face recognition application, the relevant pixels in the image are in the face regions while the pixels in the remaining regions are irrelevant. In addition, face regions themselves can have irrelevant information due to intra-class variability such as occlusions, facial expressions, illumination changes, and pose changes. Natural variability that occurs in high dimensional data has significant impact on lowering the performance of all pattern recognition methods. To improve the recognition performance of classification techniques methods, in the recent past, most of the effort has been to compensate or remove intra-class natural variability from the data samples through various feature processing methods.

Dimensionality reduction [1315] and feature selection [69] are two types of feature processing techniques that are used to automatically improve the quality of data by removing irrelevant information. Dimensionality reduction methods are popular because they achieve the purpose of reducing the number of features and noise in a feature vector with the mathematical convenience of feature transformations and projections. However, the assumption of correlations between the features in the data is a core aspect of dimensionality reduction methods that can result in inaccurate feature descriptions. Further, irrelevant information from the original data is not always possible to remove in a dimensionality reduction approach. Improving the quality of resulting features using linear and more recently non-linear dimensionality reduction methods has consistently been a field of intense research and debate in the recent past [13]. An alternative to dimensionality reduction approach, instead of trying to improve overall feature quality, feature selection tries to remove irrelevant features from the high dimensional feature vector thereby improving the performance of classification systems. Feature selection have been an intense field of study in the recent years, gaining importance in parallel with the dimensionality reduction methods. Feature selection provides an advantage over dimensionality reduction methods because of its ability to distinguish and select the best available features in a data set [610, 16]. This means that feature selection methods can be applied to both the original feature vectors and to the feature vectors that result from the application of dimensionality reduction methods. From this point of view, feature selection can be considered as an essential component required for developing high performance pattern classification systems that use high dimensional data [13, 17]. Since higher dimensional feature vectors contain several irrelevant features that reduce the performance of pattern recognition methods, feature selection by itself can be used in most of the modern data classification methods to combat the issues resulting from the curse of high dimensionality [18, 19].

Feature selection problems revolve around the correct selection of feature subset. In a search-criteria approach to feature selection, feature selection is reduced to a search problem that detects an optimal feature subset based on the selected criteria. Exhaustive search ensures optimal solution, however, with increase in dimensionality such a search is computationally prohibitive. In the present literature, there exists no other distinct way to optimally select the features without reducing classification performance.

The existing research in feature selection has been focused on excluding features that are determined as most redundant using various search strategies and criteria assessment techniques[2025]. In this paper, we propose a new method for feature selection based solely on individual feature discriminatory ability as an alternative to the existing search and criteria driven feature selection methods. The discriminatory ability of each feature is measured by the area of overlap between inter-class and intra-class distances that are obtained from feature to feature comparisons. Experimental results of a classification task based on microarray and image databases validate the effectiveness and accuracy of features obtained by our feature selection method.

Related work

Feature selection methods can be classified in three broad categories: filter model [26, 27], wrapper model [28, 29] and hybrid and embedded model [30, 31]. In order to evaluate and select features, filter models exclusively use characteristics about the data, warper models uses mining algorithms, and hybrid models combine the use of characteristics about the data with data-mining algorithms. In general, these feature selection methods consists of three steps: (1) feature subset generation, (2) evaluation, and (3) stopping criteria [32]. Subset generation process is used to arrive at a starting set of features using different types of forward, backward or bidirectional search methods . Some of the most common techniques employed are complete search such as branch and bound [33] and beam search [34], sequential search such as sequential forward selection, sequential backward elimination, bidirectional selection [35], and random search such as random-start hill-climbing and simulated annealing [34]. The generated subset is evaluated for goodness using either an independent or a dependent criterion. Independent criterion is generally used in filter model, the popular ones are distance, dependency and consistency measures [3537]. The dependent criteria is generally used in wrapper model requiring tuning of data-mining algorithms. The wrapper models perform better, however are computationally expensive and less robust to parameter changes in data-mining algorithms [3841]. The goodness of the subsets using a selection criteria is assessed against stopping criteria such as minimum number of features, optimal number of iterations and lower classification error rates.

It can be noted that in conventional feature selection methods, features or subset of features are selected based on the rank as obtained by evaluating features against a selection criterion such that redundancy of features in the training set is minimized. The best performing methods for classification that rely on data-mining strategies include feature relevance calculations to select features holistically [2022]. However, data-mining based solutions result in features that tend to be sensitive to minor changes in training data. Further, an increase in dimensionality makes the data-mining algorithms computationally intensive and often require problem specific optimization techniques. Contrary to data-mining based solutions, criteria driven methods based on filter models are computationally less complex and are more robust to minor changes in training data [2325]. In such methods, the accuracy of initial selection of subsets using exhaustive forward or backward search of the features [42] would significantly impact the accuracy of features obtained with a given feature selection criterion. In addition, as pointed out in [28] optimal selection of subsets is intractable and in some problems are NP-hard [43]. Further, variations in the nature of data from one database to another make the optimal selection of an objective function difficult and a high classification accuracy using selected features from such methods are not always guaranteed. Because of such deficiencies, hybrids of filter and wrapper models also reflect these problems at various levels of feature selection.

The determination of inter-feature dependency as described by filter models, and wrapper models lay the foundations of present day feature selection methods. These models arrive at features that are often tuned to suite a classifier using several machine learning strategies at selection or criteria assessment stage. Some of the recent approaches that attempt to improve the performance of the conventional feature selection methods use the ideas of neighborhood margins [4446], and manifold regularization using SVMs [47]. However, similar to wrapper methods that uses specific mining techniques, these recent methods are computationally complex and require additional optimization methods to speedup calculations. In addition, optimal performance of the selected features on classifiers are highly sensitive to minor changes in training data and tuning parameters. Due these reasons, the practical applicability and robustness of such methods on large sample high dimensional datasets are questionable.

Conventional feature selection methods apply multiple level processing on a given feature vector to find a subset of useful features for classification using several machine learning techniques and search strategies. The presented work on the contrary draws specific attention to select most discriminating features from a single step process of discriminating subset selection. As distinct from the general idea of optimizing feature subsets for classification oriented filter and warper models, here we focus on developing an approach to determine relevant features from a training set solely by calculating their individual inter-class discriminatory ability.

Discriminant feature selection based on nearest features

Although not popular in feature selection literature, perhaps the simplest way to understand discriminatory nature of feature in a training set with two classes can be by using a search using naive bayes classifier. A low probability of error of individual features as obtained using baysian classifier would indicate good discriminatory ability and asserts the usefulness of the feature.

A standard approach in feature selection literature is to directly apply training and selection criteria on the feature values. However, when natural variability in the data is high and number of training samples are less, even minor changes in feature values would introduce errors in the bayes probability calculations. Classification methods such as SVM on the other hand try to get around this problem by normalising the feature values and by parametric training of the classifiers against several possible changes in features values. In classifier studies, this essentially shifts the focus from feature values to distance values. Instead of directly optimising the classifier parameters based on feature values, the distance functions itself is trained and optimised.

Proposed method

In this work, we attempt to develop a technique of feature selection by using the new concept of distance probability distributions. This is a very different concept to that of filter methods that applies various criterion such as inter-feature distance, bayes error or correlation measures to determine set of features having low redundancy. Instead of complicating the feature selection process by different search and filter schemes to remove redundant features and to maintain relevant features, we focus our work in using all features that are most discriminative and useful for a classifier. Further, rather than looking at feature selection as a problem of finding inter-feature dependencies for reducing number of features, we treat each feature individually and arrive at features that would have the ability to contribute to classifiers performance improvement.

Suppose there are M classes in a training set having patterns with a set of J features, with ω ij as class label for feature j, where i∈{1.M} and j∈{1.J}. And let x jk be a feature in the k th training pattern that can be used to calculate the inter-class and intra-class distance probability distributions. The intra-class distances y j a of the j th feature in a training set is equal to the distance 1 e | x jk x j k ̄ | , where k∈{1.K}, k ̄ {1.K} with k k ̄ within a class in training set with K samples. The inter-class distances y j e of a feature x jk in a training set belonging to a class ω ij is equal to the distance 1 e | x jk x ̄ j | , where x ̄ j is a feature at j belonging to a sample in another class other than that of x jk . We can represent the set of classes that does not belong to the class ω ij as ω ̄ ij . Then the intra-class distance probability distribution of feature j in class ω ij is p( y j a | ω ij ) and the corresponding inter-class distance probability distribution is p( y j e | ω ̄ ij ). The area of overlap of these distributions can be seen as the probability of error of feature at j for a class label at i and represents the discriminatory ability of feature. Since, in practice we are dealing with samples in discrete form the probability density can be represented in discrete from with m bins, and the area of overlap P(j|i) can be represented as:

P ( j | i ) = 1 2 m = y 0 p m ( y j a | ω ij )dy+ 1 2 m = y 0 p m ( y j e | ω ̄ ij )dy
(1)

The relative area of overlap of feature among all the classes can be then found as:

P ̂ ( j | i ) = P ( j | i ) min i P ( j | i )
(2)

The minimum area of overlap for feature across different classes can be then calculated as a measure to establish the discriminatory ability of feature:

P ̂ j =1 min i P ̂ ( j | i )
(3)

Taking the minimum value of P ̂ ( j | i ) across different classes ensures that features that could discriminate well for any one of the class among many and such features can be considered as useful for classification. The features are ranked in descending order based on the value of P ̂ j , a value of 0 would force the feature to take a low rank while a value of 1 would force the feature to take top rank. Let R represent the set of P ̂ j , arranged in the order of their ranks, each rank representing feature or group of features. R set can be used to form a rank based probability distribution by normalising the P ̂ j .

It is well known that almost every other ranked distributions of empirical nature originating from realistic back end data follow a power law distribution. The top ranked features in a ranked distribution often retain most of the information. This effect is observed in different problems and applications, and has formed the basis of Winner-take-all and Pareto principles.

The ranked distribution is formed with P ̄ r = P ̂ j j = 1 J P ̂ j represent the normalised value of P ̂ j for the feature at j having a rank r. The cumulative ranked distribution c j r is obtained as:

c r = P ̂ r + c r 1 ,where c 1 =0
(4)

The top ranked values of c r can used to select the most discriminative set of features. Applying the winner-take-all principle, and in the lines of 20−80 concept of rank-size distributions, it is logical to assume that the top ranked features would have maximum amount of discriminative information. The subset of features X having a size L∈[1,J] from the ranked features can be selected based on a selection threshold θ.

x j X c r θ
(5)

In other words, the features x j corresponding to the ranks that fall below the cumulative area threshold θ is selected to form X with size L. The selection threshold θ for selecting the top ranked features is done using the proposed Def 1.

Definition 1

The selection threshold θ is equal to the standard deviation σ of the distribution of c j r , where σ= 1 N i = 1 N ( c j r 1 N i = 1 N c j r ) 2 .

If each feature in X is uncorrelated and independent, the features within X will be very few or no be redundant features. The selection of X based on the discriminatory ability is sufficient to ensure good classification performance. However, in feature selection problem, there is a chance that the subset of discriminant feature would have very similar features, and such features become redundant in improving classification performance. Identifying the independence of discriminant features would ensure the detection of least redundant features. For two features, { x r ,x r + 1}, ranked in order of P ̄ r and P ̄ r + 1 values, let p( x r ) and p(x r + 1) be the probability density functions, and p( x r ,x r + 1) be the joint probability density function, where r∈[1,L] is the rank of a feature in X corresponds to an index j in the original feature space. Then the features are independent if it can be established that p( x r ,x r + 1)=p( x r )p(x r + 1). This idea of independence testing is utilised in finding an independence score of a feature. The area score between the probability densities p ( x r ,x r + 1) and p ( x r ) p (x r + 1) in discrete domain is calculated as:

A r , r + 1 = 1 2 m = x 0 p m ( x r ) p m ( x r + 1 )dx+ 1 2 m = x 0 p m ( x r , x r + 1 )dx
(6)

The independence score I r of feature x r with respect to remaining L−1 features in X is determined as:

I r = 1 L 1 r = 1 L 1 A r , r + 1
(7)

A value of I r =1 would indicate that x r is an independent feature in X (or x j in the feature set with j th feature in the original feature space corresponding to the r th rank feature in X), while a value of I r would indicate that x r is redundant and should be removed. The independence score I r corresponding to the feature at j in the sample along with the discriminatory score P ̂ j can be used to select the most independent set of discriminant features.

z s = x j I r P ̂ j ε
(8)

where the value of ε=0.01 is a small number, and z s is the set of most relevant discriminative independent features x j , with sJ.

These subset of top ranked features are considered as useful for classification. However, parameters and nature of decision boundary imposed by a specific classifier need to be considered before these features can be used for classification. Consider using a nearest neighbour classifier, then the relative importance of feature z s X can be rated based on the recognition performance of using individual feature z s alone for classification. Assuming the independence of features, using a leave one out cross validation, the classification accuracy of s th feature and j th sample in training set with size J, and lJ is found by the identification of the class as:

w =arg min l , l j d( z sj , z sl )
(9)

The selected features z s are ranked based on the total number of correct class identification win descending order. The top ranked features represent the most discriminant features while the lower ranked ones are relatively of lower in class discriminatory ability when using a nearest neighbour classifier. Such a ranking of the features for a given classifier identifies itself as the best responding features for that classifier.

Results and discussion

The role of feature selection methods in a high dimensional pattern classification problem is to select the minimum number of features that maximize the recognition accuracy. In this section, we demonstrate how the newly proposed selection method performs this task on standard databases used for bench marking feature selection methods.

Advancements in measurement techniques and computing methodologies have resulted in the use of microarray data in application to genetics, medicine, and patient diagnosis. The high dimensional feature vectors in the microarray data often contain large number of features that are not useful in the process of classification. The main role of our feature selection method is to identify the gene expressions from a microarray data that are most useful for classification.

Five benchmark microarray based gene expression databases are used in this study: GLI-85 (also known as GSE4412)[48], GLA-BRA-180 (also known as GDS1962)[49], CLL-SUB-111 (also known as GSE2466)[50], TOX-171 (also known as GDS2261)[51], and SMK-CAN-187 (also known as GSE4115)[52].

Selection threshold and classification

To assess the recognition performance of the proposed feature selection method for the microarray databases listed in Table 1, we randomly select equal number of samples to form the training and test sets. It should be noted that for all the experiments and results presented in this section, a random split of 50% is used for the individual classes in the databases to form the train and test sets. The average recognition accuracies are reported for 30 repeated random splits. The number of features that have an area of overlap within a specified selection threshold can vary from one database to another. This means that the quality of feature can vary in different databases, depending on the level of natural variability within a database. Figure 1 illustrates this observation by the dependencies of the normalized number of selected features z s on the selection threshold. It can be seen that the quality of the features is different for almost every database. Interestingly, all databases apart from SMK-CAN-187 contain less than 3% of features with a relative overlap area smaller than 0.2. This means that the intra-class variability in SMK-CAN-187 is lower than the other databases, and is possibility because lung cancer affects several gene expressions distinctively in comparisons with other cancer and toxicology databases.

Table 1 Organization of the databases used in the experiments
Figure 1
figure 1

Selection threshold versus selected features. The dependence of selection threshold on the number of selected features for 5 gene expression databases

Figure 2 shows the recognition performance of the presented feature selection method when used with the nearest neighbor classifier. The recognition accuracy is defined as the ratio between the total number of correctly identified test samples as belonging to a class to the total number of test samples. It can be seen that for all the databases, a selection threshold (σ) of 0.3 or less is sufficient to obtain high recognition accuracies. The maximum values of accuracies are possibly limited by the nature of the classifier and quality of the best features.

Figure 2
figure 2

Average recognition performance versus threshold. Average recognition performance of the nearest neighbor classifier used with the newly proposed feature-selection method for 5 gene expression databases

Feature ranking and classification

When the relative area of overlap for all the features is small, applying the threshold based selection results in the use of almost all available features for classification. The use of complete set of features in the process of automatic classification is often not a feasible option due to the issues of curse of dimensionality. In such situations, ranking the features and selecting a group of top ranked features can be used for both the dimensionality reduction and selection of the best available features for classification. The simplest and common approach for selection of the top ranks is by individual searches that evaluate each feature separately. Leave one out cross-validation is performed using the training set of individual features that are selected based on a specified value of selection threshold. The selected features are ranked based on the recognition error by evaluating it individually with a nearest neighbor classifier.

Figure 3 shows the dependence of recognition accuracies on the number of top ranked features used with a nearest-neighbor classifier. This dependence is illustrated for the maximum number of 100 features that all fall below the selection threshold of 0.2 and are ranked based on the least recognition error using the cross validation test. It can seen that a small number of top-ranked features increases the recognition accuracy to the maximum values observed in Figure 2.

Figure 3
figure 3

Average recognition performance versus ranked features. Average recognition accuracies obtained by the nearest-neighbor classifier and a selection of up to top 100 features for five different gene expression databases

Comparisons

Table 2 shows the comparison of the best accuracies obtained with top ranked features using four conventional classifiers: nearest neighbor, linear SVM, and naive Bayes. The recognition accuracies shown in Table 2 is the total number of correctly identified labels of the test samples as belonging to a class in training set to that of the total number of test samples in a test set, where the process of calculating accuracy is repeated for 30 random selections of testing and training set in each of the micro-array databases. Such a cross-validation is done to ensure the correctness of the reported accuracy. The accuracy values of each database is reported on the samples from the testing set using the features selected by the proposed method. Overall, it can be seen that all the classifiers perform equally well. It should be noted here that in most cases, the highest recognition accuracies are obtained with a very small number of features in comparison with the total number of available features. This means that for gene expression databases only very few gene expressions are useful for the process of classification irrespective of the type of classifier employed.

Table 2 The highest recognition accuracies on gene expression databases when selecting features within the top 100 ranked features obtained by three different classifiers

Table 3 shows the pe rformance comparison between the newly presented feature selection method and conventional feature selection methods[53, 54]. The accuracy and features are determined using the same process as mentioned for Table 2, It can be seen that the presented method uses a fewer number of features to achieve higher recognition accuracies, which shows that the presented method results in more accurate selection of the features that are useful for recognition compared to the conventional methods. The ability of the proposed method to detect fewer number of features without compromising the recognition performance can have a significant impact on the early detection and diagnosis of human diseases (eg glioma) using gene expressions. The detection of such feature imply that they reflect those set of features that indicate the incidence of a particular disease. Any significant change in the such features are indicative of an abnormality or precedence of belonging to a particular state or condition.

Table 3 Comparison of maximum recognition accuracies on gene-expression databases using up to 100 top ranked features obtained by different feature-selection methods and a nearest neighbor classifier

Conclusion

In this paper, we presented a feature selection method for gene data classification that is based on the assessment of discriminatory ability of individual features within a class. The area of overlap between inter-class and intra-class distance distributions of individual features is identified as a useful measure for feature selection. A common framework to select the most important set of features is provided by applying a selection threshold. The ability of the proposed method to select the most discriminatory features resulted in improved classification performance with a smaller number of features, although the number of features that are required for achieving high recognition accuracy varies from one database to another. The presented feature selection technique can be used in the automatic identification of cancer causing genes and would help facilitate early detection of specific diseases or conditions.