1 Introduction

Artificial intelligence has made significant breakthroughs in recent years, owing to recent developments in algorithms, computing power, and big data. In particular, machine learning has had a lot of success due to its outstanding capacity to evaluate massive volumes of data automatically. Classification is one of the most important tasks in machine learning, as it allows for the prediction of events in a wide range of applications, from medical to finance. However, when faced with a high number of irrelevant and/or redundant features, several of the most popular classification algorithms can deteriorate their performance. This phenomenon is known as curse of dimensionality and is the reason why dimensionality reduction methods play an important role in preprocessing the data.

Feature selection is one of these dimensionality reduction approaches, which is described as the process of selecting relevant features and rejecting irrelevant or redundant ones. There are numerous noisy and meaningless features that are frequently gathered or generated by various sensors and algorithms, all of which consume a significant amount of computational resources. As a result of this, feature selection is critical in the context of machine learning, as it allows for the removal of nonsense features while keeping a small subset of features to reduce computational complexity.

Feature selection methods can be classified into three categories based on their relationship to the induction algorithm (Guyon et al., 2008): (i) filters, which are independent of the induction algorithm and use metrics like mutual information or statistics like Chi\(^2\) to determine the importance of the features; (ii) wrappers, which use the induction algorithm accuracy to determine the importance of the features; and (iii) embedded methods, which perform feature selection in the process of training and are usually specific to given learning machines. Furthermore, feature selection approaches are classified as univariate (when they compute the relevance of a single feature to the predictive class) and multivariate (when they take into account the interactions among subsets of features).

Unlike other dimensionality reduction techniques that are gaining popularity, such as feature extraction based on embeddings or deep neural networks (Salau & Jain, 2019; Kasongo & Sun, 2020), there are various applications where finding relevant features is required. In bioinformatics (for example, to discover a few important biomolecules that account for the majority of a phenotype (Climente-González et al., 2019)), in terms of decision-making fairness (e.g., instead of focusing on the fairness of the choice outcomes, locate the input features employed in the decision process (Grgic-Hlaca et al., 2018)), or in nanotechnology (for example, to establish the most important experimental conditions and physicochemical features to take into account when making a nanotoxicology risk assessment (Furxhi et al., 2020)). These applications all have one thing in common: they are not pure classification problems. In fact, knowing which features are relevant is just as crucial as correctly classifying them, because these features may provide new information about the underlying system.

However, there are numerous feature selection methods to choose from, and most researchers agree that the best feature selection approach does not exist (Bolón-Canedo et al., 2013). On top of this, new feature selection methods are appearing every year, which makes us ask the questions: do we really need so many feature selection methods? Which ones are the best to use for each type of data? In light of these concerns, the purpose of this paper is to examine the most common feature selection approaches in two scenarios: synthetic and real datasets, using random selection as a baseline. Our goal is to analyze if there are some methods that produce results that are not considerably better than those obtained by randomly selecting a subset of features. Differently from our previous work (Morán-Fernández & Bolón-Canedo, 2021), in this paper we (a) include seven synthetic datasets, trying to check the behavior of the feature and random selection when the relevant features are known, (b) examine the effects of including different levels of noise in the inputs, (c) analyze the impact of discretization on feature selection through a case study involving the variation of the number of bins in the Equal-width method, (d) compare the results obtained by applying the rough set attribute method QuickReduct with the Correlation-based Feature Selection method and (e) perform an illustrative example of feature selection over Mnist dataset.

The remainder of the paper is organized as follows: Section 3 presents the different feature selection methods employed in the study and provides a brief description of the 7 synthetic and 55 real datasets used to reduce data dimensionality. Section 4 details the experimental results carried out, including several case studies. Finally, Section 5 contains our concluding remarks and proposals for future research.

2 Background

Machine learning researchers face an interesting dilemma when datasets expand in size; to cite Donoho (2000) “our task is to find a needle in a haystack, teasing the relevant information out of a vast pile of glut”. Ultra-high dimensionality necessitates a large amount of memory and a significant training computational cost. Furthermore, what is known as the “curse of dimensionality” undermines generalization abilities. As a result, in a society where huge amounts of data and features are required in a variety of fields, new solutions for dealing with the critical issue of feature selection are urgently needed (Bolón-Canedo et al., 2015).

The initial studies on feature selection date back to the 1960s (Hughes, 1968), but it was not until the 1990s that significant advancements in feature selection for solving machine learning problems were made. Because of its capacity to improve the performance of learning algorithms, feature selection has gained popularity in the field of machine learning, particularly in supervised and unsupervised processes like clustering, regression, and classification. However, the most widely used feature selection approaches were created years ago, and they are currently facing significant hurdles that could negatively impact their performance. Feature selection is a difficult task, since for a dataset with m features, the total number of possible alternatives for a feature subset is \(2^m -1\).

Furthermore, feature-to-feature correlations are common. There are a variety of two-way, three-way, and more complex correlations. A weak correlation between two features may become a strong correlation, when they are combined with other features. Furthermore, the most common types of search in feature selection, such as sequential forward or sequential backward selection, suffer from local convergence issues and significant computational costs. Table 1 shows the computational cost of some of the most popular FS methods.

Table 1 Popular filter methods and their theoretical complexity where n is the number of samples, m is the number of features, \(c+c\) is the double hashing time cost and P(m) is the power-set of conditional features

As can be seen, the most sophisticated methods have quadratic complexity with the number of features, an expensive calculation process usually derived from computing the correlation of pairs of features. In this paper we will try to answer the question of if it is worth paying the price of an expensive calculation for better performance results.

3 Methods and materials

3.1 Feature selection techniques

In the classification literature, feature selection approaches have garnered a lot of attention, and they can be divided into three categories based on their interaction with the induction algorithm (Guyon et al., 2008; Shahrjooihaghighi & Frigui, 2021): filters, wrappers, and embedding methods. We choose filter methods over wrapper and embedded methods because we want to avoid the interaction with the classifier. Furthermore, filter methods are a popular choice in the new Big Data environment, owing to their lower computing cost as compared to wrapper or embedded approaches. The seven filters used in the experiment are described below, where two of them are univariate (Information Gain and Mutual Information Maximisation) and the other five are multivariate.

  • Correlation-based Feature Selection (CFS) is a simple multivariate filter technique that ranks feature subsets using a heuristic evaluation function based on correlation (Hall, 1999). This function tries to find subsets of features that present correlation with the class but not with one another. The idea is to remove those attributes whose correlation with the class is low (and thus they are considered irrelevant), as well as those redundant (correlation among them).

  • The INTERACT (INT) algorithm works on the idea of symmetrical uncertainty (SU) and adds a contribution for consistency (Zhao & Liu, 2009). This method works on two steps. First, features are sorted in descending order according to their value of SU. In the second step, the algorithm starts taking those features at the end of the feature ranking, and it evaluates each feature one by one. If a feature’s consistency contribution is below a predetermined threshold, it is deleted; otherwise, it is selected.

  • Information Gain (IG) filter analyses a single feature at a time and evaluates it based on its information gain (Hall & Smith, 1998). It gives an ordered classification of all features, after which a threshold is used to choose a particular number of them based on the order.

  • ReliefF algorithm (RelF) (Kononenko, 1994) adds the ability to deal with noisy, incomplete, and multiple class datasets to the original Relief algorithm. This algorithm’s key idea is to estimate features based on how well their values discriminate between examples that are close to each other.

  • Mutual Information Maximisation (MIM) (Lewis, 1992) obtains a ranking of attributes according to their mutual information score and selects the top k features, where k is determined by a predefined need for a certain number of features or another criterion.

  • The minimum Redundancy Maximum Relevance (mRMR) (Peng et al., 2005) approach selects features that fulfill two conditions: they are highly relevant to the target class but no redundant among each other. Both the maximum-relevance and minimum-redundancy optimization criteria are based on mutual information.

  • Another feature selection approach based on mutual information is Joint Mutual Information (JMI) (Yang & Moody, 2000), which uses a new criterion to evaluate candidate features. In each phase, JMI selects the feature with the highest cumulative sum of joint mutual information with the selected features and adds it to the subset S, until the number of selected features exceeds k.

In addition, for case study III (see Section 4.3.3), we will use a method belonging to the family of rough set attribute reduction algorithms:

  • QuickReduct (QR) (Shen & Chouchoulas, 2000; Chouchoulas & Shen, 2001) employs a forward selection approach, utilizing a non-exhaustive hill-climbing search that may encounter local optima, lacking a guarantee of global optimality. It evaluates attribute subsets based on rough set dependency values. The objective is to reach a state where the search identifies the highest achievable dependency value for the dataset.

3.2 Synthetic and real datasets

To investigate the effect of feature selection empirically, we used 7 synthetic datasets and 55 real datasets, 17 of which were microarray datasets. There are a range of features for each dataset, some of which are binary/discrete and others which are continuous. Using the Equal-width method, continuous features were discretized into 5 bins, while categorical features were left unchanged.

The synthetic datasets used in this work (Table 2) are designed to address a variety of issues, such as an increasing amount of irrelevant features, redundancy, noise, input variations, data nonlinearity, etc. These factors make the task of feature selection methods, which are heavily influenced by them, more complicated.

Table 2 Summary of the seven synthetic datasets

We also examined 55 real datasets to make significant findings about the impact of feature selection. There were 38 datasets with at least nine features downloaded from the UCI repository (Bache & Linchman, 2013), as well as 17 microarray datasets due to their high dimensionality (Morán-Fernández et al., 2017; Remeseiro & Bolón-Canedo, 2019). Tables 3 and 4 depict key properties of the datasets used in this investigation, such as sample size, number of features and classes.

Table 3 Summary of the 38 real datasets
Table 4 Summary of the 17 DNA microarary datasets

4 Experimental results

The different experiments consist of comparing the application of each of the seven feature selection approaches individually, as well as random selection (represented as ‘Ran’ in the tables/figures), which will serve as the comparison baseline. While two of the feature selection methods (CFS and INTERACT) produce a feature subset, the remaining five (IG, ReliefF, MIM, JMI, and mRMR) are ranker methods, requiring a threshold to acquire a subset of features. We chose to keep the top 10%, 20%, and log2(n) of the most significant features in the ordered ranking in this study, where n is the number of features in a given dataset. Because of the mismatch between dimensionality and sample size in microarray datasets, the thresholds picked the top 5%, 10%, and \(log_2(n)\) features, respectively. To estimate the error rate, we used \(3\times 5\) cross validation.

The best classifier will not be the same for all datasets, according to the No-Free-Lunch theorem (Wolpert, 1996). As a result, the behavior of feature selection approaches will be evaluated using the classification error acquired from five different classifiers, each of which belongs to a different family. Two linear classifiers (naive Bayes and Support Vector Machine with a linear kernel) and three nonlinear classifiers (C4.5, k-Nearest Neighbor with \(k=3\), and Random Forest) were used. The Matlab (2022b) and Weka (3.8) tools were used to run the experiments on Windows 10 operating system (Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz 16GB RAM). The QuickReduct algorithm was executed using the package (Scully & Jensen, 2011) for Weka.

4.1 Synthetic datasets

The initial step in determining the efficacy of a feature selection approach should be to use synthetic data, because knowing the optimal features and having the ability to change the experimental conditions allows for more useful conclusions to be drawn. Thus, the experimental findings obtained by the different feature selection approaches over the seven synthetic datasets, depending on the classifier, are reported in this section. To investigate the statistical significance of our classification results, we used a Friedman test with a Nemenyi post-hoc test to examine the classification error. The following figures present the critical different (CD) diagrams, proposed by Demšar (2006), where how groups of methods that are not significantly different (at \(alpha=0.10\)) are connected. The top line in the critical difference diagram is the axis on which we plot the average ranks of methods. The axis is turned so that the lowest (best) ranks are to the right since we perceive the methods on the right side as better.

By working with synthetic datasets, we know what their relevant features are. Therefore, apart from the results obtained by the different feature selection methods and the random selection (Ran), those obtained when the relevant features are used are also presented (labeled in the figures/tables as “Relevant”). Thus, we can see in Fig. 1 that, regardless of the classifier used, the lowest classification errors are obtained when the model is trained with the known relevant features. If we look at the different feature selection methods, INTERACT (INT) seems to be one of the most appropriate for this type of datasets. Besides, if we analyze the results of the univariate methods, MIM obtains competitive results (and sometimes even better) than the multivariate methods. This makes this method an appropriate choice for scenarios where it is important to consider the computational cost. Regarding the threshold of the rankers that achieves lower classification errors, the results are highly variable depending on the classifier, with 10% and the logarithm generally standing out. The synthetic datasets used have a number of relevant features between 2 and 7, far from the average of 25 features that are selected when using the 20% threshold. In this case, irrelevant and/or redundant features are surely being included that make the classification task difficult.

Fig. 1
figure 1

Critical difference diagrams showing the ranks after applying feature selection over the seven synthetic datasets. For feature selection methods that require a threshold, the option to keep \(10\%\) is indicated by ‘-10’, the option to stay with \(20\%\) is indicated by ‘-20’, and the option ‘-log’ refers to use \(log_2\)

On the other hand, random selection along with thresholds of 10 and 20 percent show the worst classification results. However, it appears that random selection shows competitive results against other feature selection methods when selecting the \(log_2(n)\) features of the dataset. Thus, and due to the drawbacks of the traditional tests of contrast of the null hypothesis pointed up by Benavoli et al. (2017), we have chosen to apply the Bayesian hypothesis test (Kuncheva, 2020), in order to analyze the classification results achieved by “Ran-log” and the ranker methods. A previous step is required for this type of study, which is the defining of the Region of Practical Equivalence (Rope). If the mean differences between two approaches for a given metric are smaller than a predefined threshold, they are considered practically equal in practice. In our situation, if the difference in error is less than \(1\%\), we will consider two methods as equivalent.

Fig. 2
figure 2

Simplex graphs for pair comparison of each feature selection method and the baseline random selection (Ran) over the seven real datasets using Bayesian hierarchical tests: random selection (left) and filter method (right)

Table 5 Classification errors obtained by the five classifiers for the seven synthetic datasets tested
Fig. 3
figure 3

Critical difference diagrams showing the ranks after applying feature selection over the 38 real datasets. For feature selection methods that require a threshold, the option to keep \(10\%\) is indicated by ‘-10’, the option to stay with \(20\%\) is indicated by ‘-20’, and the option ‘-log’ refers to use \(log_2\)

For the whole benchmark and each pair of methods, we calculate the probability of the three possibilities: (i) with a difference greater than rope, random selection (Ran) wins over filter method, (ii) filter method wins over random selection with a difference larger than rope, and (iii) the difference between the outcomes is inside the rope area. We consider a substantial difference if one of these probabilities is greater than \(95\%\). As a result, using simplex graphs, Fig. 2 depicts the distribution of differences between each pair of methods. As can be seen, although the CD diagrams showed a slight superiority of the random selection together with the 20% threshold compared to the ranker Information Gain (IG), the simplex graphs show that there are no significant differences. In fact, facing only these two methods, the probabilities are 75% skewed to the feature selection method. This may be due to the fact that, in their attempt to compare all the proposed methods, the CD diagrams ignore the individual confrontations carried out by means of the simplex graphs.

Table 5 displays the classification error obtained by the five classifiers and the eight feature selection methods—the seven filters and random selection—over the seven synthetic datasets using the five different classifiers (the lowest classification error obtained for each feature selection method is in bold). As can be seen, the lowest classification errors have been obtained by the non-linear classifiers C4.5 and Random Forest. Let us remember that within the seven synthetic datasets used, three of them (XOR-100, Parity3+3 and Madelon), represent non-linear scenarios. Therefore, and taking into account that SVM and naive Bayes are linear classifiers (a linear kernel is being used for SVM), good results were not expected. Furthermore, it can also be clearly observed that random selection obtains the worst classification results, with hardly any differences across the different classifiers.

4.2 Real datasets

In this section we will perform experiments on real datasets, to check if the results are similar to those obtained on synthetic data. For this task, we selected a suite of 38 real datasets. CFS and INTERACT, regardless of the classifier used, appear to be the most suitable feature selection methods for this type of datasets, as shown in Fig. 3. Apart from obtaining good results, these two feature selection methods have an added advantage: they do not require to establish a threshold for the number of features to keep. When it comes to ranker methods (which do need a threshold), a percentage of 20% appears to be the best option overall. Moreover, as for the synthetic datasets, the univariate MIM method achieves good results despite its simplicity. Although in this case it only manages to obtain better classification results than the ReliefF multivariate method.

Now, we proceed to compare the results obtained by the feature selection methods tested with the baseline, which we established as the Random selection (Ran). As expected, the random selection is the worst option, when using the thresholds logarithmic and 10%. Nevertheless, when we allow the random selection to keep more features (threshold 20%), it is interesting to see that the random selection is competitive when compared with the other methods. Thus, using simplex graphs as we did with synthetic datasets, Fig. 4 depicts the distribution of differences between random selection (with a 20% threshold) and the ranker methods Information Gain, ReliefF, and MIM (with a 10% threshold). Even although the random selection with a 20% threshold is not statistically significant when compared to the outcomes of numerous ranker methods, it consistently outperforms them. This indicates that the ranker methods (ReliefF, InfoGain, and MIM) are very dependant of the chosen threshold, so a bad choice of threshold produces results that are similar to randomly choosing 20% of features. These findings highlight the importance of choosing a proper threshold, which is a difficult procedure that is frequently dependent on the problem to solve (and sometimes, even the classifier that is subsequently used).

Fig. 4
figure 4

Simplex graphs for pair comparison of each feature selection method and the baseline random selection (Ran) over the 38 real datasets using Bayesian hierarchical tests: random selection (left) and filter method (right)

Table 6 Classification errors obtained by the five classifiers for the 38 real datasets tested
Fig. 5
figure 5

Critical difference diagram showing the ranks after applying feature selection over the 17 microarray datasets. For feature selection methods that require a threshold, the option to keep \(5\%\) is indicated by ‘-5’, the option to stay with \(10\%\) is indicated by ‘-10’, and the option ‘-log’ refers to use \(log_2\)

Fig. 6
figure 6

Simplex graphs for pair comparison of each feature selection method and the baseline random selection (Ran) over the 17 microarray datasets for SVM classifier using Bayesian hierarchical tests: random selection (left) and filter method (right)

Table 6 displays the classification error on 38 real datasets, for each classifier and feature selection method (seven filters and the random selection). A total of five classifiers were employed, and lowest classification errors are marked in bold face. Despite the fact that there are no significant differences among the feature selection methods, it is worth highlighting that Random Forest seems to be the best classifier in this setting.

4.2.1 Microrrray datasets

The mismatch between dimensionality and sample size has been seen as a specific issue for machine learning researchers when it comes to DNA microarray classification. Several studies have shown that the majority of genes detected in microarray experiments do not contribute to accurate sample classification (Bolón-Canedo et al., 2014). Feature selection is recommended to avoid the curse of dimensionality by identifying the specific genes that improve classification accuracy.

Figure 5 illustrates the critical difference diagrams for each classification algorithm, based on the same study as for the previous datasets, in order to examine the ranks of the feature selection methods throughout the 17 DNA microarray datasets. The ideal feature selection strategy varies depending on the classifier, as can be seen. In general, though, we can say that CFS is the best option. Regarding the results obtained by the univariate methods, and unlike with the synthetic and real datasets, the IG method seems to work better than MIM, also achieving results similar to those of other more complex multivariate methods. The percentage that maintains 5% of the features appears to be the best fit for these high-dimensional datasets among the many thresholds used by ranker algorithms.

Random selection gives the worst classification accuracy in the C4.5, NB, 3NN, and Random Forest classifiers, both for the thresholds that retain 5 and 10% and for the logarithm, according to the statistical test findings. The results of the SVM reveal a very striking pattern. When the number of features is low (in contrast to the dataset’s initial size), this classification strategy appears to perform poorly (Miller, 2002). Remember that if the ranker approaches utilize a threshold to select the top log2(n) features, the number of features used to train the model for these datasets will be limited to 15 (less than 1% of the original microarray dataset’s features). Figure 6 depicts the distribution of the differences between random selection—with 5% and 10% thresholds—and the ranker methods with the logarithm threshold using simplex graphs, just as it does with real datasets. Random selection outperforms ranker methods that keep the top log2(n) features on average and statistically significant, as can be seen. These data illustrate, once again, and much more clearly in this example, that when utilizing ranker methods, an incorrect threshold decision can result in performance comparable to a random selection of features. This is a challenging problem to tackle because the only way to be sure we are using the right threshold is to attempt a large number of them and compute the classification performance for that subset of features, which would result in inadmissible computation times.

Table 7 Classification errors obtained by the five classifiers for the 17 DNA microarray datasets tested

The classification error produced by the five classifiers and the eight feature selection methods over the 17 DNA microarray datasets is shown in Table 7 (the lowest error rates highlighted in bold). As mentioned in Navarro (2011), these results demonstrate the superiority in performance of SVM over other classifiers in this domain.

4.3 Case studies

After presenting the experimental results, and before discussing and analyzing them in detail, we will describe several cases of study.

4.3.1 Case study I: Dealing with noise in the inputs

There are various scenarios that can obstruct the feature selection process, including the presence of irrelevant and redundant features, attribute interaction, and data noise. Therefore, in this case study we will analyze the influence of the presence of noise at the input for the Led-25 dataset (see Table 2). The LED dataset requires properly identifying seven LEDs that correspond to values ranging from 0 to 9. The Led-25 dataset was created by adding 17 irrelevant features. Different levels of noise in the inputs (10 and 20 percent) were added to make this dataset more challenging. It is worth noting that, because the features are binary, adding noise results in the wrong value being assigned to the relevant features.

Figure 7 depicts the behavior of feature selection approaches in response to different levels of noise, as measured by the classification error of SVM and Random Forest classifiers. For the rankings methods, only the thresholds of 10 and 20 percent are shown, since in the case of this dataset, the number of features retained by the 20% threshold is the same as that of the logarithm. As we would expect, the classification error grows as the level of noise increases, regardless of the feature selection method used. Furthermore, it is interesting to see that as the noise level at the input is increased, the difference in terms of classification error between the feature selection methods and the random selection is markedly reduced. In fact, when the noise level is 20%, random selection achieves better classification results than various feature selection methods.

This highlights the low robustness of feature selection methods to noise in the inputs, where the methods most resistant are ReliefF, mRMR, and JMI, while the subsets filters (CFS and INTERACT) and the univariate approach Information Gain are the most affected by noise.

Fig. 7
figure 7

Classification error (%) for LED-25 dataset with different levels of noise (0%, 10% and 20%). Random selection is indicated by a dashed line

4.3.2 Case study II: Influence of discretization

Many feature selection algorithms are designed to handle only discrete data (Bolón-Canedo et al., 2011). To apply these algorithms to numeric features, a common practice is to discretize the data before conducting feature selection. However, the choice of how to group continuous values, the number of intervals to generate, and the positioning of interval cut points on the continuous attribute scale may differ among various discretization methods. Among the discretization methods available in the literature, we opted to use Equal-width due to its widespread popularity. Equal-width divides the number line between \(v_{min}\) and \(v_{max}\) into b intervals (or bins) of equal width, where b is a user-predefined parameter.

To investigate the impact of discretization, particularly the Equal-width method, on the feature selection process, we will conduct a case study. During this study, we will systematically vary the number of bins, exploring the effects of 5, 10, 15, and 20 bins on the analysis. For this purpose, we choose seven DNA microarray datasets from Table 4, namely 9-tumors, brain-tumor-1, CNS, DLBCL, leukemia-1, SRBCT and TOX-171. Table 8 presents an overview of the classification results obtained. The first observation indicates that the version using 5 bins yields the most favorable outcomes. Concerning the feature selection methods, ReliefF demonstrates superior performance on average, followed closely by CFS and IG. Among the ranking-based methods, the 5% threshold consistently achieves the lowest errors across the five methods, with ReliefF showing a tie with the 10% threshold in this regard.

Table 8 Average of the classification errors obtained by the five classifiers on the 7 microarray datasets for the different feature selection methods and number of bins of the Equal-width discretization method
Table 9 Average of the classification errors obtained by the five classifiers on the 7 microarray datasets and feature selection methods for the different number of bins of the Equal-width discretization method. Lower error rates highlighted in bold

Finally, SVM stands out as the optimal classifier for this particular type of data, as indicated in Table 9. Therefore, we can affirm that, overall, despite the variation in the number of bins and the observed impact of discretization on feature selection, the conclusions drawn in Section 4.2.1 remain consistent.

4.3.3 Case study III: CFS vs the rough set attribute method QuickReduct

Rough Set Theory (Pawlak, 1991; Kopczynski & Grzes, 2022) is a formal mathematical technique that aids in reducing dataset dimensionality by quantifying the information content concerning a specific classification. Within rough set theory, the notion of an attribute reduct holds significance, referring to a subset of attributes that, when taken together, effectively retain a specific property of the dataset while ensuring that each attribute, individually, is essential for this preservation. Thus, in order to analyse other feature selection methods that return a set of features, in this case study we will compare the CFS method, explained above, and the QuickReduct method, belonging to the family of rough set attribute reduction algorithms.

For the experiments, we selected seven DNA microarray datasets from Table 4. The initial part of our analysis presents the classification results obtained by the five previously used classifiers after applying the CFS and QuickReduct feature selection methods, as shown in Table 10. The results demonstrate that in nearly all datasets and classifier scenarios, CFS consistently yields the lowest classification errors, often exhibiting a significant difference compared to QuickReduct. The possible reason behind this can be observed in Table 11, where QuickReduct selects significantly fewer features compared to CFS. The number of features selected by QuickReduct is notably insufficient since the microarray datasets used in this case study have a range of features between 2308 and 7129.

Table 10 Classification errors obtained by the five classifiers for the CFS and QuickReduct feature selection methods and the DNA microarray datasets 9-tumors, brain-tumor-1, CNS, DLBCL, leukemia-1, SRBCT and TOX-171. Lower error rates highlighted in bold
Table 11 Number of features selected by the CFS and QuickReduct methods for the DNA microarray datasets 9-tumors, brain-tumor-1, CNS, DLBCL, leukemia-1, SRBCT and TOX-171
Table 12 Runtimes (in seconds) for the CFS and QuickReduct methods for the DNA microarray datasets 9-tumors, brain-tumor-1, CNS, DLBCL, leukemia-1, SRBCT and TOX-171

However, while CFS exhibits superiority in terms of the achieved classification results, it comes at the expense of a longer execution time, as evidenced in Table 12.

Fig. 8
figure 8

An example of the use of feature selection (ranker methods with logarithm threshold) for one sample of the class “9” digit. Green dots mark selected features. For the sake of a clear visualization, those features that correspond with pixels that are always in the white area are not marked

Fig. 9
figure 9

An example of the use of feature selection (ranker methods with 10% threshold) for one sample of the class “9” digit. Green dots mark selected features. For the sake of a clear visualization, those features that correspond with pixels that are always in the white area are not marked

4.3.4 Case study IV: An illustrative example of feature selection over Mnist dataset

In this case of study, we will illustrate the feature selection process over the MNIST dataset (LeCun et al., 1998), in which only two confusable classes are used: digits 4 and 9, because of the small distinctions between them. We thus have 4000 examples per class available. In the original representation, each digit image has \(28\times 28\) gray level pixels (784 features).

For these experiments, in the case of the ranker methods, we have used both the 10% and the logarithm for the threshold. Thus, in the following figures we can observe the features selected by the different feature selection methods—marked in green—, as well as by the random selection over the original digit image. As can be seen in Fig. 8, where the logarithm threshold is used for the ranker methods, all feature selection methods select features that illustrate the distinguishable part with the digit 4 (that is, the closed upper part of 9). However, this does not happen in the case of random selection, which fails to define the area, where only one of the 10 selected features falls on the representation of the digit. This makes the classification task to distinguish digits 4 and 9 really complex. When we select the top 10% of features for the ranking methods—i.e. we are left with 78 features—a greater part of the digit is defined (see Fig. 9). In the case of feature selection methods, they continue to select a greater number of features in the area distinguishable with the digit 4 (especially in the case of ReliefF). Meanwhile, the random selection continues to select many features that fall outside the representation of the digit, thus not leaving the upper part of the digit 9 defined.

5 Conclusions and future work

The goal of this research is to thoroughly examine the most common approaches in the field of feature selection, make appropriate comparisons, as well as to determine if there exist some methods that are not able to outperform those results obtained by the random selection. We tested 62 synthetic and real datasets (including the challenging family of DNA microarray datasets) and found that feature selection is effective in general, and that feature selection approaches are superior than random selection in most circumstances, as expected. Our experiments revealed, in particular, that:

  • CFS is an excellent choice for any dataset. As a result, when having no knowledge of the specifics of the problem to be solved, we recommend using the CFS method, which has the extra benefit of not requiring the establishment of a threshold. However, if we take into account the computational cost of the feature selection methods used, the univariate filter MIM seems an appropriate choice, which manages to obtain competitive results compared to other more complex multivariate methods.

  • Regarding the use of different thresholds, it seems that 10% is more appropriate for the synthetic datasets. For real datasets, the 20% criterion for normal datasets (although worse than the subset approaches, which are the winning option for this type of dataset) and the 5% threshold for microarray datasets appear to be more appropriate. Indeed, when using ranker feature selection methods, the threshold selection is crucial, as our research demonstrated. For some thresholds, in particular, the outcomes were as poor as if some features were chosen at random.

  • Despite the fact that the classification results obtained were not significantly different between the feature selection methods used—as discussed in Morán-Fernández et al. (2020)—, we can conclude that Random Forest in the case of synthetic and real datasets and SVM in the case of microarrays were the ones that obtained the best results in terms of classification precision in a general way across all datasets used, as Fernández-Delgado et al. (2014) concluded in their study.

  • With respect to the presence of noise, and as we would have expected, the classification accuracy decreases when the level of noise increases. Besides, the feature selection methods have not proved to be very robust to noise, obtaining classification errors similar to those given by random selection. This highlights the importance of working with quality data.

  • Concerning the impact of discretization on feature selection, and particularly in this study, the choice of 5 bins in the Equal-width method yields the most favorable results.

As previously stated, determining an appropriate threshold for ranker-type approaches is a major issue in feature selection that has yet to be solved. As a result, we plan to test a wider number of thresholds in the future, as well as establish an automatic threshold for each dataset type. Another interesting line of research would be to develop feature selection methods more robust to noise, as well as testing other discretization methods to gain further insights into their potential effects on feature selection.