1 Introduction

Variable Selection (VS), also known as feature selection, is the procedure of selecting a subset of important input variables, where the words variable or feature are often used as synonyms. Actually, the term “variable” represents the raw input variable, while “feature” identifies a variable, which derives from a pre-processing of input variables. Regarding the VS operation, the two terms are usually used indifferently, as there is no influence on the selection algorithms [39].

The interest of researchers and practitioners for VS approaches has increased over the years, due to both increasing diffusion of Machine Learning (ML) techniques and multiplication of data sources and data collection and storage capabilities in any field and discipline. In real world applications, a relatively easier access to large amounts of different data allows the development of more complex and reliable models. However, the multiplication of information sources also increases the difficulty of extracting the most important information conveyed by the collected data. More specifically, when a classification task is faced in relation to a phenomenon or system where poor a-priori knowledge is available, the first problem to address consists of selecting the right input variables for the designed classifier. Selecting an appropriate set of input variables reduces the complexity of the classifier and can also improve the classification accuracy [39]. Nevertheless, the stability of the adopted VS algorithm needs to be taken into careful consideration together with performance and computational efficiency [44].

A VS approach is stated to be stable if it leads to the selection of the same (or akin) subset of variables even when the so–called training dataset, i.e. the set of data on which VS is performed, changes [45]. In several cases of practical interest, especially dealing with datasets involving many variables and a limited number of samples, when the training subset is modified, traditional VS methods lead to different solutions at each iteration. This makes the data reduction practically useless for the classifier design, due to lack of replicability in presence of an eventual new dataset. Moreover, no significant contribution is provided to the achievement of a thorough knowledge of the phenomenon or process under investigation.

Therefore, stability becomes a crucial aspect when the VS goal is twofold, i.e. knowledge extraction from raw data and accurate classification performance, a highly frequent situation in real world applications of ML-based models. A good VS algorithm should not only improve the performance of the model, but also provide stable and reproducible selection results.

The above-mentioned typology of datasets is frequently found in applications where each sample is the outcome of complex, costly and time-consuming analyses, such as, for instance, in the medical field [3] or in material science [74]. Therefore, addressing this issue is of relevant practical importance.

In this paper a fully automatic procedure is proposed, which is capable of selecting the variables that mostly affect the target by combining classic VS methods and by ensuring excellent stability without degrading the performance of the implemented model. Noticeably, the approach discussed here is not a novel VS algorithm, but rather a methodology to ensure the stability of the selection. The novelty of the proposed methodology lies in the combination of existing VS techniques from both the filter and the wrapper category for extracting the most influential variables regardless of the type of initial data by ensuring that the result does not change when modifying the training dataset.

The main elements and strengths of the proposed concept are:

  • tackling situations characterized by a large number of potential input variables and a small sample size, which are intrinsically unstable, by largely improving VS stability with respect to standard VS approaches;

  • a modular and flexible combination of existing VS techniques, which enables the adoption of a wide variety of VS methods as well as of any modelling algorithm or stability index, and preserves the accuracy of the designed classifier or regression model;

  • an automatic procedure, which does not require a-priori information on the available data and fits well different applications;

  • an affordable computational burden, at least for off-line pre-processing applications.

The paper is an extended version of the paper entitled A combined approach for enhancing the stability of the variable selection stage in binary classification tasks which was presented by the same authors at the International Work conference on Artificial Neural Networks IWANN 2021 [21]. With respect to the previously presented results, here multi-class classification and regression tasks are tackled, to broaden the applicability range of the proposed approach.

The paper is organized as follows: Sect. 2 provides a literature review regarding VS; Sect. 3 analyses in detail the stability issue; Sect. 4 deals with VS in neural networks applications; in Sect. 5 a description of the proposed algorithm is provided. Some numerical results are presented and discussed in Sect. 6, while Sect. 7 provides some concluding remarks.

2 Background

Within different applications, such as classification [4, 38], clustering [52, 69] and regression [6, 37, 54], VS is considered an important data pre-processing phase. The selection of the most significant variables to be fed as inputs to a ML-based model is crucial to remove variables that are highly correlated to each other (i.e. redundant variables) or irrelevant from an informative point of view [22]. In fact, the presence of redundant and irrelevant variables usually reduces the performance of the implemented model. Moreover, an ideal selection of input variables helps to gain deeper knowledge of the problem, system or phenomenon under consideration.

The selection should consider the following factors:

  • Relevance The set of selected variables should include all or most of the significant information concerning the considered problem.

  • Computational efficiency The number of selected variables should not be too high in order to reduce the computational burden. This element is of utmost importance when dealing with NN-based models: the presence of redundant and irrelevant variables adds noise, increases the number of free parameters and the time required for the network training.

  • Knowledge improvement The ideal selection of variables leads to a deeper understanding of the behavior of the investigated phenomenon.

To sum up, the ideal set of variables should comprise only the most influencing variables that are independent from each other, in order to create an accurate, efficient, inexpensive and more easily interpretable model.

In literature, VS techniques can be divided into three main classes: filters, wrappers and embedded approaches [62].

Filters select the best subset of input variables before applying the classifier [20, 64]. The subset is generated by evaluating the association between inputs and outputs and, consequently, the variables are classified on the basis of their relevance to the target through a statistical test [24, 27]. The main advantages of filters are their low computational complexity and their speed. However, being independent from the adopted ML-based model or classifier, they are unable to optimize it. An example of filters is represented by the correlation-based approach, which, firstly, computes the correlation coefficient between each variable and the target. Afterwards, variables are ranked and a subset is selected by including the variables showing the highest values of the correlation coefficient [26]. Other commonly applied filter approaches are the chi-square approach [9] and the Information Gain method [43].

The wrapper VS approach, which was introduced in [47], exploits the performance of the learning machine to extract the subset of variables on the basis on their predictive/classification power. Wrappers consider the model as a black box, and this feature makes them universal, as they can be applied using different kind of algorithms [25]. An obvious wrapper method is the exhaustive search, also named brute force method, which analyses all the variables combinations. When the number of input variables is significant, this exhaustive approach is not viable. A traditional wrapper method is the Greedy Search strategy [35], which gradually creates the variables subset by adding or eliminating single variables from an initial set. Greedy search can be applied in two directions: Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) [7]. SFS starts with an empty set of variables, which are iteratively added until a fixed stopping condition is achieved. For instance, the accuracy of the learning machine is the standard performance index in classification tasks, but other indexes can also be used especially in the case of imbalanced datasets [36]. In other words, the search ends when the addition of new variables to the input set does not increase the model performance of the model. Conversely, SBS starts with an input set including all the available variables and removes them one by one. The importance of an input variable is determined by removing one of them and calculating the model performance without having such variables among its inputs. The search ends when removing variables from the input significantly decreases the model performance. SFS is less computationally expensive than SBS, being SBS impracticable when the number of potential input variables is too large [53].

Embedded methods perform the variable selection as part of the learning stage and are generally specific of a particular learning machine [33]. Common examples of embedded approaches are Decision Trees (DT) and approaches based on regularization techniques [10]. The main advantage of embedded methods lies in their association to the learning algorithm. Embedded approaches also exploit all the variables to generate a model and then evaluate it to establish the importance of the variables [2].

To sum up, filter VS methods are suitable to deal with very high dimensional datasets, as they are computationally simple, fast and independent from the adopted algorithm. Wrapper methods exploit the learning algorithm as a black box assessing its performance for VS, but are subjected to overfitting and computationally cumbersome. Embedded methods show a lower computational cost with respect to wrappers, but they are specific of a given model or classifier [59].

Recently, several hybrid variable selection approaches have been proposed to jointly exploit the advantages of several VS approaches by overcoming their drawbacks [16, 18].

3 The stability Issue

Stability is a broad and very general concept, which touches different domains of systems analysys, modelling and control [42, 48, 67, 71] including various classes of neural networks [11, 50], with many works focused on large complex dynamical netorks [32, 70]. In the context of VS, the stability concept was introduced in [65], and is defined as the sensitivity of a VS algorithm to variations in the training dataset. Ignoring the stability issue of a VS algorithm can lead to wrong conclusions and unreliable design of a ML-based model or classifier [14, 66, 72]. Several papers discuss the fact that using different training sets can lead to select different variable subsets even when applying the same VS algorithm [13, 49]. The investigation on the stability of VS approaches is related to the need to provide users with a quantitative confirmation that VS is reliable and sufficiently robust with respect to variations in the training data [29]. This requirement is particulary important in real-world applications, where an improved knowledge of the phenomenon under investigation is a further significant outcome of the VS. For instance, in monitoring applications [57, 68, 73], the identification of the most relevant input variables allows highlighting the factors to monitor so as to early identify and prevent faults and anomalies, but such identification needs to be definitive, when preliminary to the setup of the hardware for a monitoring system.

A further interpretation of stability in probabilistic terms can be provided by considering the outcome of VS as a stochastic process. Some of the approaches described in Sect. 2 inherently hold some elements of stochasticity, as they include ML procedures whose outcome might also depend, for instance, on the initial values of internal parameters, that are often randomly selected, as well as on the data used for the training. Moreover, in practical applications, noise and disturbances, which can often be modeled as stochastic processes, affect the measurements of all potential variables. All these factors make it possible that the same VS procedure applied to different portions of the available experimental data lead to the identification of different subsets of relevant variables. Let us define as \(P_i\) the probability that the variable \(x_i\) is selected by a given VS procedure applied on a given dataset \(\mathcal {D}\). In general, the more relevant \(x_i\) with respect to the target, the higher \(P_i\), but \(P_i\) can also be affected by above-mentioned noise in the data as well as by the correlation among the potential input variables. Therefore, \(P_i\) depends both on \({{\mathcal {D}}}\) and on the subset \({{\mathcal {S}}} \subset \mathcal {D}\) of the available data on which VS is performed, i.e. \(P_i=P_i \left( \mathcal{D}, \mathcal{S} \right) \). For instance, if two potential input variables are exactly linearly correlated among each other, their individual relevance with respect to the target is identical, thus a correlation-based filter approach identifies both of them and the removal of is done according to a further criterion, which might not be deterministic, while a sequential wrapper approach, like SBS or SFS, eliminates one of them, but only depending on the order according to which they are considered. Roughly speaking, it can be stated that while the issue of selecting an appropriate VS procedure translates in maximizing \(P_i\) for the variables \(x_i\), which actually affect the target, by minimizing it for the other ones, the issue of ensuring VS stability translates in making \(P_i\) as independent and insensitive as possible on the selection of \(\mathcal {S}\). For an effective model design, both issues are relevant, but an inappropriate development of the first task might seriously compromise the second one, with detrimental effects on the overall results.

The VS output can be represented by different evaluation criteria: a weighted score of each variable, the ranking, or a subset of significant variables, namely a binary vector b, where the unitary value of one entry corresponds to the presence of the associated feature in the subset [46]. Stability measures can be classified in three categories: stability by subset, stability by rank and stability by weight. According to the first definition, the similarity between two weighting vectors is computed through the Pearson’s correlation coefficient. Similarly, to quantify the similarity between two variables rankings, the Spearman’s rank correlation coefficient [63], also known as the Pearson correlation coefficient between the ranked variables is evaluated. These two indices lie in the range [-1, 1] where the unitary values means full direct (1) or inverse (-1) correlation, while the null value corresponds to the absence of correlation. Finally, the similarity between two binary vectors \(b_1\) and \(b_2\) can be estimated employing the Tanimoto distance [30], which is defined as:

$$\begin{aligned} S_B=\frac{\vert b_1 \cdot b_2\vert }{\Vert b_1\Vert + \Vert b_2\Vert - \vert b_1 \cdot b_2\vert } \end{aligned}$$
(1)

where \(\Vert \Vert \) represents the norm of a vector, \(\cdot \) the scalar product between two vectors, and \(\vert \vert \) the absolute value of a scalar value. The Tanimoto index lies in the range [0; 1], where the unitary value corresponds to two identical vectors, and the null value identifies two completely different vectors.

4 The Importance on Variable Selection Method in Neural Networks Applications

NNs are extensively used in real-world applications when highly complex and non-linear tasks are considered. As described in the previous paragraph, the set of variables to be fed as input to a model should include only the most significant variables with a small correlation among each other. In addition, a good selection is recommended, as any unnecessary input worsens the robustness of the system, adds noise and increases computational effort.

For standard parametric mathematical models the complexity of the VS task can be moderated by the a priori hypothesis of the functional form of the model, exploiting also a physical interpretation of the system or phenomenon under investigation. Conversely for NNs the generic and highly non-linear nature of the model makes the use of standard VS methods difficult. This complexity increases if the VS procedure is performed in the learning phase. Most NN paradigms are not efficiently capable to identify redundant and irrelevant variables during the training phase. Moreover, several NN users tend to build the network using all available variables, thinking that their redundancy will lead to a more robust model. On the other hand, skilled NN modelers understand the importance of VS techniques in network design as a preliminary phase. Moreover, the selection of the most suitable VS method is also important: methods that are suitable for linear regression can be inappropriate for a highly non-linear neural model. Wrappers and embedded approaches are usually adopted when the number of training samples is low with respect to the number of input variables. The search strategy should represent a trade-off between the number of considered solutions and the sustainable computational burden, ensuring that the approach selects the most informative but still limited subset of variables. The traditional SBS wrapper method is efficient when training NN algorithms of small dimensions. SFS performs an efficient search when it embeds the redundancy check in the statistical analysis of the variables. Exhaustive search (considering all combinations of variables) is impracticable in most cases, whereas evolutionary search approaches can reach a reasonable compromise by covering an appropriate number of combinations of input variables. Filter approaches, being fast and model-free, are suitable to applications where independence from a specific NN architecture or computational efficiency is needed. However, simple ranking procedures can detect more variables than needed, without considering the redundancy of variables and are unsuitable to multivariate NN regression tasks [12, 15].

The approach proposed here is a hybrid method capable of both selecting the most significant variables without degrading the model performance and ensuring stability of the solutions, by thus leading also to a more consolidated knowledge acquisition.

5 The Proposed Method

The proposed algorithm, named Stability Enhanced HYbrid VAriable Selection (STEHYVAS), aims at implementing a stable and efficient VS procedure for ML-based classification and regression tasks and is applicable also to the datasets characterized by a relatively low number of samples compared to the number of potential input variables. STEYVAS combines VS selection approaches belonging to two over the three previously described categories, i.e. filters and wrappers methods.

For the sake of simplicity, in the following an arbitrary dataset is represented as a matrix, where rows correspond to samples and columns to measured variables for each sample.

A pseudo–code that describes how STEHYVAS works is provided in Algorithm  1. The following description of the method refers to this listing.

figure a

Firstly, the dataset is shuffled and divided into two portions (line 6), 25% of the available data are used as test set to validate the classifier, while the remaining 75% is used to extract the final variables subset through the proposed procedure. The dataset also needs to be preliminarily cleansed (line 7), by eliminating all the lines where at least one variable is missing, to produce a complete dataset [17, 19].

Subsequently, a redundancy analysis is performed to eliminate redundant variables (line 8), which also offers improvement in interpretability of the VS outcome. Redundant variables are detected by applying a method based on the Dominating Set Algorithm (DSA), a method deriving from the graph theory [1, 60]. In this context, two arbitrary variables are considered redundant if their linear correlation coefficient \(\rho > 0.95\) and its associate significance value p-value\(<0.05\), where the definitions of \(\rho \) and p-value derive from classical statistical correlation theory introduced by Pearson at the end of the 19th century [56]. DSA extracts the minimal dominating set of the graph corresponding to the set with the lower number of vertexes: variables of the graph that do not fall inside the minimal dominating set are considered redundant.

After redundant variables elimination, two VS techniques belonging to filter category are separately applied. The two VS procedures are applied \(N_F\) times each and, at each iteration, the intersection of the most significant variable set is recorded. The variables, which are included in such intersection with a frequency greater than or equal to 80% are selected (lines 10–15).

The reduced dataset (line 16) is then subjected to \(N_W\) different iterations of the two well-known wrapper VS methods: SFS and SBS (lines 18–23). The union of the two solutions is stored at each iteration and the variables which are included in such union with a frequency greater than or equal to 60% are selected. This last subset of variables is the candidate to be the final winner variable subset.

The two frequencies values used as threshold (80% and 60%) have been set considering that the filters methods are intrinsically stable, so a higher threshold ensures that only those variables that, for a certain training set, have accidentally fallen into the subset are not selected. Furthermore, the filters based reduction is performed prior the wrappers, because filters are computationally faster than the wrappers, thus this preliminary reduction of the variables number speeds up the subsequent steps.

Finally, both the average accuracy of the model and the stability of the approach can be assessed. Similarly to the training dataset, also the test set undergoes a cleansing stage, but such cleansing is done considering only the reduced subset of input variables. The overall procedure is schematically represented in Fig. 1.

Fig. 1
figure 1

Generic scheme of the proposed method

STEHYVAS is generic and can be executed on any kind of binary or multi-class classifier as well as on a regression model. In addition, the classic VS methods that are used here could be replaced with other algorithms belonging to the same category. Finally, STEHYVAS shows the great advantages of not requiring any a priori information on the available database, being completely automatic and modular and, therefore, adaptable to various problems.

6 Experimental Tests

6.1 Datasets Used in the Experiments

In order to assess the validity of the proposed VS approach, STEHYVAS has been applied to binary classification, multiclass classification and regression tasks, related to different datasets extracted by the largely used UCI learning repository [8]. Only one dataset, named Baseball, is not extracted by UCI repository but it is downloaded from CNN/Sports. The datasets, which are exploited in the pursued tests, are briefly described in the following.

  • Breast Cancer Wisconsis (BCW): the BCW database comes from the Hospital of the University of Wisconsis. Data refer to patients affected by tumours. The binary classification target refers to the benign or malign nature of the tumor.

  • Cervical (Cerv): The dataset contains several attributes on observed patients, who are classified according to the presence/absence of cervical cancer.

  • Heart Failure clinical records (HF): This dataset includes the medical records of patients who had heart failure, which were collected during their follow-up period.

  • Heart: This dataset is generated to identify the presence or absence of heart disease.

  • Pima Indians Diabetes (PID): This dataset is provided by the National Institutes of Diabetes and Digestive and Kidney Diseases and concerns women patients coming from Arizona, whose minimum age is 21 years. The binary classification refers to the fact that patient is found positive or negative to the diabetes test.

  • Dermatology (Derm.): This dataset is related to the determination of the type of Eryhemato-Squamous Disease. The differential diagnosis of such diseases is a real problem in dermatology: they all share the clinical features of erythema and peeling , with very little differences. The diseases in this group are psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris. Generally, a biopsy is needed for the diagnosis but, unluckly, these diseases share many histo-pathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of a similar disease in an initial stage, and its own typical features later on.

  • NewThyroid (Thyroid): This dataset refer to features of patients who are found to suffer or not from hyperthyroidism or hypothyroidism.

  • Wine: This dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivations. The analysis determined the quantities of 13 constituents found in each of the three considered types of wines.

  • Iris: This is one of the most widely known datasets of the literature on pattern recognition. It includes data from 3 classes, each one represented by 50 samples and referring to a type of iris plant. One class is linearly separable from the other 2; the latter ones are not linearly separable from each other.

  • Facebook metrics: This dataset was collected within a study on the performance metrics of a renowned cosmetic’s brand Facebook page. The input variables include category, page total likes, type, month, hour, weekday, paid and the target is the measure of the impact of the post in terms of total interactions (sum of likes, comments and shares of the post).

  • Computer Hardware: This dataset is used to forecast the relative CPU performance of a PC on the basis of its features, such as, for instance, model, CPU frequency and memory size.

  • Breast Tissue: This dataset includes electrical impedance measurements of freshly excised tissue samples from the breast. Such measurements constitute the impedance spectrum from where the breast tissue features are computed.

  • Baseball: This dataset concerns Baseball Major League players who played at least one game in both the 1991 and 1992 seasons, excluding pitchers. This dataset contains the 1992 salaries of the set of Major League Baseball players who played at least one game in both the 1991 and 1992 seasons, excluding pitchers. For each player, some performance measures are given along with four categorical variables that indicate how free each player was to move to other teams. The variable which indicates the salaries is the target to estimate.

The sizes of each of the above-listed databases, namely the total number of samples \(N_{TOT}\) and of variables \(N_{V}\) considered in each dataset, are summarised in Table 1.

Table 1 Overview of the characteristics of the adopted datasets

6.2 Exploited Filter Approaches

Such as stated in Sect. 5, two filter VS approaches are required in the first stage of STEHYVAS to develop a preliminary selection of potentially relevant input variables.

The first very simple VS approach is the correlation-based one, which is generic and applicable for any task. In this approach, for each input variable \( x_j\) (\( 1 \le j \le N_V \)), the Pearson’s correlation coefficient with respect to the considered target is evaluated. Afterwards, all the variables are ranked considering this value and the mean value of all coefficients is computed. Finally, the variables showing a coefficient value greater than the overall mean value are selected. A further quite simple test, which can however been applied only for binary classification tasks, is based on the evaluation of the so-called Wilcoxon index. This filter algorithm aims at ranking the variables of a given dataset by assessing their significance for separating the two labelled groups of data. Once the variables have been ranked according to their relevance, those that belong to the first half of the ranking are selected. As a criterion, the absolute value of the standardized u-statistic of a two samples unpaired Wilcoxon test, also known as Mann Whitney test [61], is adopted. The unpaired two-samples Wilcoxon test is a non-parametric alternative to the unpaired two samples t-test, that can be used also when data are not normally distributed.

The Wilconxon index-based filter method cannot be applied to multi-class classification and regression tasks. Therefore, in these cases two algorithms belonging to the Relief family are exploited in STEHYVAS as second filter approach: ReliefF is used for multi-class classification and RReliefF is used for regression [58]. The main idea behind all Relief algorithms is to determine the importance of a variable according to its capability of distinguishing between data samples \( \mathbf{x} \in {\mathbb {R}}^{N_V} \) that are near to each other in the problem domain. To this aim, the variable importance is ranked by associating a weight \(w_j\) to the j-th variable \( x_j\) for all \( 1 \le j \le N_V \). Such weights are iteratively computed. Finally, the variables belonging to the upper half are selected according to the obtained ranking. A detailed description of the ReliefF and RReliefF methods, which are far less known that the other methods employed in this work, is provided in appendix A together with some additional detail about their set–up within this work.

6.3 Results

In the performed experiments, the iteration parameters are assumed to be all identical, namely \(N_F\)=\(N_W\)=10. Moreover, 10 independent runs of the proposed procedure have been performed on each dataset in order to calculate the stability of the solutions. The stability is measured through the average Tanimoto distance. This distance is compared to the one calculated by running SFS and SBS as the only VS algorithms (also computed on 10 iterations) to demonstrate that they can encounter instability issues and to quantify the improvement obtained by STEHYVAS. Furthermore, the variables selected with the three methods (SFS, SBS, STEHYVAS) are used in the test set (to keep only selected variables), which is used to validate the adopted machine learning algorithm.

In our tests the binary classifiers used are a Bayesian classifier [40, 51] and a Support Vector Machine (SVM)-based classifier [23]. Regarding the multi-class target, we implemented the Error-Correcting Output Codes (ECOC) [5, 28, 31], a technique that allows a multi-class classification problem to be reframed as multiple binary classification problems, allowing the use of native binary classification models to be used directly; in our case we have used the SVM-based classification. The second classifier used in the multi-class tasks is a Neural Networks (NN)-based classifier [34]. A feed-forward fully connected NN has been trained. The final fully connected layer and the subsequent softmax activation function produce the network’s output and predicted labels. The employed activation function is the Rectified Linear Unit (ReLU) [41]. A limited-memory Broyden-Flecter-Goldfarb-Shanno quasi-Newton algorithm (LBFGS) [55] is used with its loss function minimization technique, where the cross-entropy loss is minimized.

For the regression tasks, NNs and SVMs have been used. In particular, a two–layers feedforward NN with fully connected layers is adopted. In this network the hidden layer is formed by 10 neurons that use a ReLU activation function. The last fully layer holds one output corresponding to the estimated target. The adopted SVMs models employ linear kernel functions and are trained through the well known Sequential Minimal Optimization algorithm. The purpose of a first set of tests is to demonstrate the efficiency of STEHYVAS in improving the stability of the VS stage independently from the adopted ML algorithm and without negatively affecting the performance of the model itself. Therefore, the focus is not on the selection of the most suitable model.

Moreover, in order to further assess the validity of STEYVAS, a comparison is proposed with respect to a method recently proposed in [3], where an ensemble approach based on the bagging technique is applied to improve feature selection stability via data variance reduction. Such approach, named Bagging Feature Selection (in the following indicated as BaggingFS for the sake of brevity) firstly applies cross validation k times, followed by a bootstrap sampling of each training set. Afterwards, a VS algorithm is applied on each fold to select the variables and, finally, stability is evaluated. The method is repeated for each cardinality of the selected variable set. BaggingFS has been developed for classification tasks, and specifically targets datasets with a limited number of samples and a relatively large number of potential input variables, such as STEHYVAS. BaggingFS is considered an adequate term of comparison mostly because, analogously to STEHYVAS, BaggingFS, it actually does not represent a brand new VS algorithm, but it aims at improving the stability of existing VS methods, it does not depend on the adopted classifier, and it does not introduce a new metric to quantify stability. In the tests presented here, the VS approach adopted within BaggingFS is ReliefF for classification and RReliefF for regression, and \(k=10\).

The performance of the classifiers is evaluated using as metric the Overall Accuracy, which is calculated as the ratio between the sum of the correctly classified test values and the total number of test values. The performance on the regression problems is evaluated computing the Normalised Root Mean Square Error (NRMSE) (Eq. 2) as follows:

$$\begin{aligned} NRMSE=\frac{1}{N_{TOT}} \sum _{i=1}^{N_{TOT}} \sqrt{\frac{({\hat{y}}_i-y_i)^2}{\sigma _y}} \end{aligned}$$
(2)

where \({\hat{y}}\) is the estimated value of \(y_i\), and \(\sigma _y\) is the standard deviation of the target.

The obtained results are shown in Table 2 for the Bayesian and SVM-based binary classifiers, in Table 3 for the ECOC and NN-based multiclass classifiers, and finally, in Table 4 for the SVM-based and NN-based regressors. In particular, the first column refers to the type of classifier or regressor adopted for the tests, the second column indicates the processed dataset, columns 3-7 report the mean accuracy obtained with the test set using all initial variables, the selected variable by SFS and SBS, and the mean accuracy obtained with STEHYVAS and BaggingFS , respectively, while the last four columns show the Tanimoto stability measure considering the four different approaches.

Table 2 Results on the tested datasets for binary classification tasks
Table 3 Results on the tested datasets for multiclass classification tasks
Table 4 Results on the tested datasets for regression tasks

Finally, the times required to execute the VS algorithms have been calculated and are shown in Tables 5, 6 and 7, where average computation times and their standard deviations (in brackets) over 10 runs are reported. The time is expressed in seconds and, concerning the binary classifier, it is computed using a Notebook PC with Intel Core i9, CPU speed 2.9 GHz, SDD 512 GB and RAM 16GB (see Table 5). On the other hand, concerning the multiclass classifier and the regressor, the time is computed using a workstation with AMD Ryzen 7 2700X 8 Core, CPU speed 3.70 GHz, SSD 512 GB and RAM 64GB (see Tables 6 and 7).

Table 5 Average execution time (in seconds) on the tested datasets with implemented ML algorithms for binary classification tasks. Standard deviation is also reported in brackets
Table 6 Average execution time (in seconds) on the tested datasets with implemented ML algorithms in multiclass classification task. Standard deviation is also reported in brackets
Table 7 Average execution time (in seconds) on the tested datasets with implemented ML algorithms in regression task. Standard deviation is also reported in brackets

In general, according to the results shown in the tables, especially compared to single VS approaches, STEHYVAS turns out to sensibly improve the VS stability, reduce the computational burden related to models training while keeping their accuracy comparable to those achieved by other VS methods.

More in detail, within binary classification tasks, the improvement in terms of stability obtained by STEHYVAS (average 0.92) is marked with respect to the other commonly used VS approaches, where the best competitor, namely SBS, achieves an average stability close to 0.6. In this case, the models using the variables selected by STEHYVAS increase the predictive accuracy as well (+5% compared to SBS). On the other hand, BaggingFS achieves stability values similar to STEHYVAS (average 0.94) and comparable predictive accuracy of the trained classifiers. As far as the models training time is concerned, the method outperforms SFS, while there is no clear prominence with respect to SBS. This behavior is related to the attitude of STEHYVAS and SBS to select a smaller number of variables compared to SFS. On the other hand, with respect to BaggingFS, STEHYVAS is much faster, thanks to its first stage based on redundancy analysis and the combination of two filter VS approaches, which limits the number of variables undergoing the following wrapper VS stage, while BaggingFS develops a number of iterations equal to the cardinality of the selected variable set.

The proposed approach improves stability also when handling multi-class classification tasks. Again, STEHYVAS achieves an overall stability value of 0.92, SFS and SBS reach a stability value of 0.63 and 0.62, respectively, while BaggingFS achieves an overall stability value of 0.91. The accuracy obtained by the different VS methods is comparable for all datasets apart from the Wine one, on which STEHYVAS markedly outperforms all the other approaches, while BaggingFS shows far worse performances than all the other approaches. In terms of computational time, STEHYVAS drastically outperforms BaggingFS on all the datasets apart from the Wine one, for which the computation time is similar, and is also faster than SBS, while it shows training time values similar to SFS.

Finally, also for regression tasks STEHYVAS sensibly improves selection stability (average 0.91) with respect to both SFS (average ) and SBS (average 0.64), while is slightly outperformed by BaggingFS, which shows an average stability value of 0.97. However, in this case BaggingFS shows drastically worse performance in terms of NRMSE than all the other methods, among which STEHYVAS slightly outperforms both SFS and SBS. In other words, BaggingFS selects almost always the same variable subset, but such subset seems not to be the optimal one. Moreover, STEHYVAS is faster than all the other methods in most cases, being outperformed by BaggingFS only for the SVM-based model trained applied to the Facebook and Comp.HW databases and for the NN-based model applied to the Facebook database. The average reduction through when using a NN is 67% and 37% with respect to SBS and SFS, respectively. When a SVM is used as regressor, the training time is more than halved with respect to SBS, whilst SFS shows the best performance on the Baseball database, but requires a far higher computation time on the Facebook and Breast tissue datasets. The relevant time required by STEHYVAS when dealing with the Facebook and Comp.HW databases is explained by the fact that the first stage based on the combination of two filter VS approaches eliminates very few variables, therefore the actual selection process is performed by the second wrapper-based stage, which is computationally more cumbersome. The inefficiency of filter-based methods on these two databases is confirmed by the bad performance of BaggingFS, which, in these tests, incorporates RReliefF.

6.4 Discussion

The obtained results in terms of performance of the trained models and of the VS show that STEHYVAS ensures excellent stability of the VS and, therefore, also support knowledge extraction form raw data concerning about the considered problem/phenomenon without decreasing the performance of the employed algorithm. This superior capability of STEHYVAS is shown in comparison to both traditional VS methods and the case where VS is not applied (i.e. all the available variables are exploited for the classification task).

In terms of computational costs and related time requirements, STEHYVAS, in most cases, is not far worse than the considered traditional approaches, and in some cases it is even faster. This is mainly due to the combination of some quick preliminary steps, which eliminate irrelevant or redundant variables, and a more cumbersome VS approach, which is applied to an already significantly reduced number of variables.

In effects, the main idea behind the elaboration of STEHYVAS arose from the consideration that wrapper approaches usually show better performance with respect to filters, as they involve the learning algorithm, but are more cumbersome from the computational point of view. Such excessive computational burden make adoption of wrapper approaches not fully justifiable compared to the achievable advantages in terms of accuracy. Moreover, the variable set that leads to good performances for a given ML approach often provide similar results with other ML algorithms. Therefore, an hybrid approach is preferable, as it provides competitive advantages with respect to pure wrapper methods by also leading to a more effective and often also faster selection of the relevant variable subset.

In particular, in the present case the combination of a preliminary fast VS stage based on a filter and an embedded approach and a second stage where two wrapper methods is compared to the application of a single wrapper VS approach. Despite of the fact that two wrapper variable selection approaches are used in STEHYVAS, the preliminary reduction of the number of the variables to account ensures a reasonable overall duration of the VS procedure, which turns out to be comparable and sometimes faster than the considered wrapper VS procedures. When the STEHYVAS procedure takes longer times, this means that the fist stage was not able to significantly reduce the number of potential variables to be considered by the second stage, and this might also depend on the considered problem/database, e.g. there are not many correlated variables.

It is also worth noting that the computational time depends on the adopted ML model for classification or regression and this is the reason for very different numerical results that are obtained in the pursued tests even when considering the same dataset and VS algorithm. The training of a Bayesian classifier is known to be faster than the one of an SVM-based one. In many applications NNs are preferred to ECOC classifiers or SVM-based regressors, respectively, for multi-class classification and regression tasks. However, the direct application of a wrapper VS approach exploiting for a NN-based model using a dataset holding a relatively low number of samples compared to the number of potential input variables to select is more prone to overfitting issues, by worsening the stability. Therefore, especially in this specific situation, which is not infrequent in real-world applications, it is of utmost importance to reduce the number of variables to consider before implementing a wrapper VS procedure and the advantages of STEHYVAS in terms of performances and stability are even more evident.

Finally, it must be underlined that the proposed algorithm serves as a pre-processing stage to identify the variables that mostly affect the target. Therefore, it is often performed once or very rarely and in the design phase of the ML-based model and the time spent to execute it is not relevant in the overall economy of the designed system. On the other hand, VS stability is fundamental for both model design and improved knowledge of the phenomenon under consideration, as it ensures that performed selection include all and only the variables that are actually significant. This also contributes to enhancing the computational efficiency of the designed model.

7 Conclusions and Future Work

The paper proposes the STEHYVAS approach to improve the stability of traditional VS algorithms when applied to regression and classification tasks. In fact, the main issue of these algorithms lies in their sensitivity to variations of the training set, which is a serious problem, especially when the main objective is knowledge extraction on the considered system or phenomenon.

STEHYVAS was tested on real data sets and the results demonstrate its efficiency in terms of both accuracy and, mostly, stability. In fact, STEHYVAS improves the VS stability compared to classical methods, without worsening the accuracy of the model. A further advantage of STEHYVAS lies in the fact that it is automatic, namely it does not require any a priori information on the considered dataset. Moreover STEHYVAS is fully modular, namely each included VS approach can be replaced by other approaches belonging to the same category. Finally STEHYVAS demonstrates its versatility within different contexts, namely binary and multi-class classification as well as regression problems.

The main limitation of STEHYVAS lies in its exclusive suitability to offline implementation, as the time required to provide the most influential variables is not compatible with an online solution. Nonetheless, it can be used as a preprocessing step to be performed once and possibly periodically repeated as the data volume grows. This feature is common to most VS approaches. A further limitation is represented by the fact that some parameters, e.g. \(N_F\) and \(N_W\), are fixed and the values proposed here might not be optimal for any possible application.

Future work will focus on the improvement of the algorithm by optimizing its hyperparameters: alternative selection strategies within the early filter–based steps will be assessed as well as different values of associate thresholds and the number of iterations of various operations (i.e. filters and wrappers applications) at different stages of the algorithm. Furthermore, the stability will be assessed through more sophisticated indexes. Finally, the k-fold cross-validation strategy will be tested instead of splitting the database into training and test sets.