Improving the Stability of the Variable Selection with Small Datasets in Classification and Regression Tasks

Cateni, Silvia; Colla, Valentina; Vannucci, Marco

doi:10.1007/s11063-022-10916-4

Improving the Stability of the Variable Selection with Small Datasets in Classification and Regression Tasks

Open access
Published: 10 June 2022

Volume 55, pages 5331–5356, (2023)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Improving the Stability of the Variable Selection with Small Datasets in Classification and Regression Tasks

Download PDF

2785 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Within the design of a machine learning-based solution for classification or regression problems, variable selection techniques are often applied to identify the input variables, which mainly affect the considered target. The selection of such variables provides very interesting advantages, such as lower complexity of the model and of the learning algorithm, reduction of computational time and improvement of performances. Moreover, variable selection is useful to gain a profound knowledge of the considered problem. High correlation in variables often produces multiple subsets of equally optimal variables, which makes the traditional method of variable selection unstable, leading to instability and reducing the confidence of selected variables. Stability identifies the reproducibility power of the variable selection method. Therefore, having a high stability is as important as the high precision of the developed model. The paper presents an automatic procedure for variable selection in classification (binary and multi-class) and regression tasks, which provides an optimal stability index without requiring any a priori information on data. The proposed approach has been tested on different small datasets, which are unstable by nature, and has achieved satisfactory results.

A Combined Approach for Enhancing the Stability of the Variable Selection Stage in Binary Classification Tasks

Dynamic Selection of Classifiers Applied to High-Dimensional Small-Instance Data Sets: Problems and Challenges

Variable Selection for Efficient Design of Machine Learning-Based Models: Efficient Approaches for Industrial Applications

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Variable Selection (VS), also known as feature selection, is the procedure of selecting a subset of important input variables, where the words variable or feature are often used as synonyms. Actually, the term “variable” represents the raw input variable, while “feature” identifies a variable, which derives from a pre-processing of input variables. Regarding the VS operation, the two terms are usually used indifferently, as there is no influence on the selection algorithms [39].

The interest of researchers and practitioners for VS approaches has increased over the years, due to both increasing diffusion of Machine Learning (ML) techniques and multiplication of data sources and data collection and storage capabilities in any field and discipline. In real world applications, a relatively easier access to large amounts of different data allows the development of more complex and reliable models. However, the multiplication of information sources also increases the difficulty of extracting the most important information conveyed by the collected data. More specifically, when a classification task is faced in relation to a phenomenon or system where poor a-priori knowledge is available, the first problem to address consists of selecting the right input variables for the designed classifier. Selecting an appropriate set of input variables reduces the complexity of the classifier and can also improve the classification accuracy [39]. Nevertheless, the stability of the adopted VS algorithm needs to be taken into careful consideration together with performance and computational efficiency [44].

A VS approach is stated to be stable if it leads to the selection of the same (or akin) subset of variables even when the so–called training dataset, i.e. the set of data on which VS is performed, changes [45]. In several cases of practical interest, especially dealing with datasets involving many variables and a limited number of samples, when the training subset is modified, traditional VS methods lead to different solutions at each iteration. This makes the data reduction practically useless for the classifier design, due to lack of replicability in presence of an eventual new dataset. Moreover, no significant contribution is provided to the achievement of a thorough knowledge of the phenomenon or process under investigation.

Therefore, stability becomes a crucial aspect when the VS goal is twofold, i.e. knowledge extraction from raw data and accurate classification performance, a highly frequent situation in real world applications of ML-based models. A good VS algorithm should not only improve the performance of the model, but also provide stable and reproducible selection results.

The above-mentioned typology of datasets is frequently found in applications where each sample is the outcome of complex, costly and time-consuming analyses, such as, for instance, in the medical field [3] or in material science [74]. Therefore, addressing this issue is of relevant practical importance.

In this paper a fully automatic procedure is proposed, which is capable of selecting the variables that mostly affect the target by combining classic VS methods and by ensuring excellent stability without degrading the performance of the implemented model. Noticeably, the approach discussed here is not a novel VS algorithm, but rather a methodology to ensure the stability of the selection. The novelty of the proposed methodology lies in the combination of existing VS techniques from both the filter and the wrapper category for extracting the most influential variables regardless of the type of initial data by ensuring that the result does not change when modifying the training dataset.

The main elements and strengths of the proposed concept are:

tackling situations characterized by a large number of potential input variables and a small sample size, which are intrinsically unstable, by largely improving VS stability with respect to standard VS approaches;
a modular and flexible combination of existing VS techniques, which enables the adoption of a wide variety of VS methods as well as of any modelling algorithm or stability index, and preserves the accuracy of the designed classifier or regression model;
an automatic procedure, which does not require a-priori information on the available data and fits well different applications;
an affordable computational burden, at least for off-line pre-processing applications.

The paper is an extended version of the paper entitled A combined approach for enhancing the stability of the variable selection stage in binary classification tasks which was presented by the same authors at the International Work conference on Artificial Neural Networks IWANN 2021 [21]. With respect to the previously presented results, here multi-class classification and regression tasks are tackled, to broaden the applicability range of the proposed approach.

The paper is organized as follows: Sect. 2 provides a literature review regarding VS; Sect. 3 analyses in detail the stability issue; Sect. 4 deals with VS in neural networks applications; in Sect. 5 a description of the proposed algorithm is provided. Some numerical results are presented and discussed in Sect. 6, while Sect. 7 provides some concluding remarks.

2 Background

Within different applications, such as classification [4, 38], clustering [52, 69] and regression [6, 37, 54], VS is considered an important data pre-processing phase. The selection of the most significant variables to be fed as inputs to a ML-based model is crucial to remove variables that are highly correlated to each other (i.e. redundant variables) or irrelevant from an informative point of view [22]. In fact, the presence of redundant and irrelevant variables usually reduces the performance of the implemented model. Moreover, an ideal selection of input variables helps to gain deeper knowledge of the problem, system or phenomenon under consideration.

The selection should consider the following factors:

Relevance The set of selected variables should include all or most of the significant information concerning the considered problem.
Computational efficiency The number of selected variables should not be too high in order to reduce the computational burden. This element is of utmost importance when dealing with NN-based models: the presence of redundant and irrelevant variables adds noise, increases the number of free parameters and the time required for the network training.
Knowledge improvement The ideal selection of variables leads to a deeper understanding of the behavior of the investigated phenomenon.

To sum up, the ideal set of variables should comprise only the most influencing variables that are independent from each other, in order to create an accurate, efficient, inexpensive and more easily interpretable model.

In literature, VS techniques can be divided into three main classes: filters, wrappers and embedded approaches [62].

Filters select the best subset of input variables before applying the classifier [20, 64]. The subset is generated by evaluating the association between inputs and outputs and, consequently, the variables are classified on the basis of their relevance to the target through a statistical test [24, 27]. The main advantages of filters are their low computational complexity and their speed. However, being independent from the adopted ML-based model or classifier, they are unable to optimize it. An example of filters is represented by the correlation-based approach, which, firstly, computes the correlation coefficient between each variable and the target. Afterwards, variables are ranked and a subset is selected by including the variables showing the highest values of the correlation coefficient [26]. Other commonly applied filter approaches are the chi-square approach [9] and the Information Gain method [43].

The wrapper VS approach, which was introduced in [47], exploits the performance of the learning machine to extract the subset of variables on the basis on their predictive/classification power. Wrappers consider the model as a black box, and this feature makes them universal, as they can be applied using different kind of algorithms [25]. An obvious wrapper method is the exhaustive search, also named brute force method, which analyses all the variables combinations. When the number of input variables is significant, this exhaustive approach is not viable. A traditional wrapper method is the Greedy Search strategy [35], which gradually creates the variables subset by adding or eliminating single variables from an initial set. Greedy search can be applied in two directions: Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) [7]. SFS starts with an empty set of variables, which are iteratively added until a fixed stopping condition is achieved. For instance, the accuracy of the learning machine is the standard performance index in classification tasks, but other indexes can also be used especially in the case of imbalanced datasets [36]. In other words, the search ends when the addition of new variables to the input set does not increase the model performance of the model. Conversely, SBS starts with an input set including all the available variables and removes them one by one. The importance of an input variable is determined by removing one of them and calculating the model performance without having such variables among its inputs. The search ends when removing variables from the input significantly decreases the model performance. SFS is less computationally expensive than SBS, being SBS impracticable when the number of potential input variables is too large [53].

Embedded methods perform the variable selection as part of the learning stage and are generally specific of a particular learning machine [33]. Common examples of embedded approaches are Decision Trees (DT) and approaches based on regularization techniques [10]. The main advantage of embedded methods lies in their association to the learning algorithm. Embedded approaches also exploit all the variables to generate a model and then evaluate it to establish the importance of the variables [2].

To sum up, filter VS methods are suitable to deal with very high dimensional datasets, as they are computationally simple, fast and independent from the adopted algorithm. Wrapper methods exploit the learning algorithm as a black box assessing its performance for VS, but are subjected to overfitting and computationally cumbersome. Embedded methods show a lower computational cost with respect to wrappers, but they are specific of a given model or classifier [59].

Recently, several hybrid variable selection approaches have been proposed to jointly exploit the advantages of several VS approaches by overcoming their drawbacks [16, 18].

3 The stability Issue

Stability is a broad and very general concept, which touches different domains of systems analysys, modelling and control [42, 48, 67, 71] including various classes of neural networks [11, 50], with many works focused on large complex dynamical netorks [32, 70]. In the context of VS, the stability concept was introduced in [65], and is defined as the sensitivity of a VS algorithm to variations in the training dataset. Ignoring the stability issue of a VS algorithm can lead to wrong conclusions and unreliable design of a ML-based model or classifier [14, 66, 72]. Several papers discuss the fact that using different training sets can lead to select different variable subsets even when applying the same VS algorithm [13, 49]. The investigation on the stability of VS approaches is related to the need to provide users with a quantitative confirmation that VS is reliable and sufficiently robust with respect to variations in the training data [29]. This requirement is particulary important in real-world applications, where an improved knowledge of the phenomenon under investigation is a further significant outcome of the VS. For instance, in monitoring applications [57, 68, 73], the identification of the most relevant input variables allows highlighting the factors to monitor so as to early identify and prevent faults and anomalies, but such identification needs to be definitive, when preliminary to the setup of the hardware for a monitoring system.

A further interpretation of stability in probabilistic terms can be provided by considering the outcome of VS as a stochastic process. Some of the approaches described in Sect. 2 inherently hold some elements of stochasticity, as they include ML procedures whose outcome might also depend, for instance, on the initial values of internal parameters, that are often randomly selected, as well as on the data used for the training. Moreover, in practical applications, noise and disturbances, which can often be modeled as stochastic processes, affect the measurements of all potential variables. All these factors make it possible that the same VS procedure applied to different portions of the available experimental data lead to the identification of different subsets of relevant variables. Let us define as $P_i$ the probability that the variable $x_i$ is selected by a given VS procedure applied on a given dataset $\mathcal {D}$. In general, the more relevant $x_i$ with respect to the target, the higher $P_i$, but $P_i$ can also be affected by above-mentioned noise in the data as well as by the correlation among the potential input variables. Therefore, $P_i$ depends both on ${{\mathcal {D}}}$ and on the subset ${{\mathcal {S}}} \subset \mathcal {D}$ of the available data on which VS is performed, i.e. $P_i=P_i \left( \mathcal{D}, \mathcal{S} \right) $. For instance, if two potential input variables are exactly linearly correlated among each other, their individual relevance with respect to the target is identical, thus a correlation-based filter approach identifies both of them and the removal of is done according to a further criterion, which might not be deterministic, while a sequential wrapper approach, like SBS or SFS, eliminates one of them, but only depending on the order according to which they are considered. Roughly speaking, it can be stated that while the issue of selecting an appropriate VS procedure translates in maximizing $P_i$ for the variables $x_i$, which actually affect the target, by minimizing it for the other ones, the issue of ensuring VS stability translates in making $P_i$ as independent and insensitive as possible on the selection of $\mathcal {S}$. For an effective model design, both issues are relevant, but an inappropriate development of the first task might seriously compromise the second one, with detrimental effects on the overall results.

The VS output can be represented by different evaluation criteria: a weighted score of each variable, the ranking, or a subset of significant variables, namely a binary vector b, where the unitary value of one entry corresponds to the presence of the associated feature in the subset [46]. Stability measures can be classified in three categories: stability by subset, stability by rank and stability by weight. According to the first definition, the similarity between two weighting vectors is computed through the Pearson’s correlation coefficient. Similarly, to quantify the similarity between two variables rankings, the Spearman’s rank correlation coefficient [63], also known as the Pearson correlation coefficient between the ranked variables is evaluated. These two indices lie in the range [-1, 1] where the unitary values means full direct (1) or inverse (-1) correlation, while the null value corresponds to the absence of correlation. Finally, the similarity between two binary vectors $b_1$ and $b_2$ can be estimated employing the Tanimoto distance [30], which is defined as:

$$\begin{aligned} S_B=\frac{\vert b_1 \cdot b_2\vert }{\Vert b_1\Vert + \Vert b_2\Vert - \vert b_1 \cdot b_2\vert } \end{aligned}$$

(1)

where $\Vert \Vert $ represents the norm of a vector, $\cdot $ the scalar product between two vectors, and $\vert \vert $ the absolute value of a scalar value. The Tanimoto index lies in the range [0; 1], where the unitary value corresponds to two identical vectors, and the null value identifies two completely different vectors.

4 The Importance on Variable Selection Method in Neural Networks Applications

NNs are extensively used in real-world applications when highly complex and non-linear tasks are considered. As described in the previous paragraph, the set of variables to be fed as input to a model should include only the most significant variables with a small correlation among each other. In addition, a good selection is recommended, as any unnecessary input worsens the robustness of the system, adds noise and increases computational effort.

For standard parametric mathematical models the complexity of the VS task can be moderated by the a priori hypothesis of the functional form of the model, exploiting also a physical interpretation of the system or phenomenon under investigation. Conversely for NNs the generic and highly non-linear nature of the model makes the use of standard VS methods difficult. This complexity increases if the VS procedure is performed in the learning phase. Most NN paradigms are not efficiently capable to identify redundant and irrelevant variables during the training phase. Moreover, several NN users tend to build the network using all available variables, thinking that their redundancy will lead to a more robust model. On the other hand, skilled NN modelers understand the importance of VS techniques in network design as a preliminary phase. Moreover, the selection of the most suitable VS method is also important: methods that are suitable for linear regression can be inappropriate for a highly non-linear neural model. Wrappers and embedded approaches are usually adopted when the number of training samples is low with respect to the number of input variables. The search strategy should represent a trade-off between the number of considered solutions and the sustainable computational burden, ensuring that the approach selects the most informative but still limited subset of variables. The traditional SBS wrapper method is efficient when training NN algorithms of small dimensions. SFS performs an efficient search when it embeds the redundancy check in the statistical analysis of the variables. Exhaustive search (considering all combinations of variables) is impracticable in most cases, whereas evolutionary search approaches can reach a reasonable compromise by covering an appropriate number of combinations of input variables. Filter approaches, being fast and model-free, are suitable to applications where independence from a specific NN architecture or computational efficiency is needed. However, simple ranking procedures can detect more variables than needed, without considering the redundancy of variables and are unsuitable to multivariate NN regression tasks [12, 15].

The approach proposed here is a hybrid method capable of both selecting the most significant variables without degrading the model performance and ensuring stability of the solutions, by thus leading also to a more consolidated knowledge acquisition.

5 The Proposed Method

The proposed algorithm, named Stability Enhanced HYbrid VAriable Selection (STEHYVAS), aims at implementing a stable and efficient VS procedure for ML-based classification and regression tasks and is applicable also to the datasets characterized by a relatively low number of samples compared to the number of potential input variables. STEYVAS combines VS selection approaches belonging to two over the three previously described categories, i.e. filters and wrappers methods.

For the sake of simplicity, in the following an arbitrary dataset is represented as a matrix, where rows correspond to samples and columns to measured variables for each sample.

A pseudo–code that describes how STEHYVAS works is provided in Algorithm 1. The following description of the method refers to this listing.

Firstly, the dataset is shuffled and divided into two portions (line 6), 25% of the available data are used as test set to validate the classifier, while the remaining 75% is used to extract the final variables subset through the proposed procedure. The dataset also needs to be preliminarily cleansed (line 7), by eliminating all the lines where at least one variable is missing, to produce a complete dataset [17, 19].

Subsequently, a redundancy analysis is performed to eliminate redundant variables (line 8), which also offers improvement in interpretability of the VS outcome. Redundant variables are detected by applying a method based on the Dominating Set Algorithm (DSA), a method deriving from the graph theory [1, 60]. In this context, two arbitrary variables are considered redundant if their linear correlation coefficient $\rho > 0.95$ and its associate significance value p-value$<0.05$, where the definitions of $\rho $ and p-value derive from classical statistical correlation theory introduced by Pearson at the end of the 19th century [56]. DSA extracts the minimal dominating set of the graph corresponding to the set with the lower number of vertexes: variables of the graph that do not fall inside the minimal dominating set are considered redundant.

After redundant variables elimination, two VS techniques belonging to filter category are separately applied. The two VS procedures are applied $N_F$ times each and, at each iteration, the intersection of the most significant variable set is recorded. The variables, which are included in such intersection with a frequency greater than or equal to 80% are selected (lines 10–15).

The reduced dataset (line 16) is then subjected to $N_W$ different iterations of the two well-known wrapper VS methods: SFS and SBS (lines 18–23). The union of the two solutions is stored at each iteration and the variables which are included in such union with a frequency greater than or equal to 60% are selected. This last subset of variables is the candidate to be the final winner variable subset.

The two frequencies values used as threshold (80% and 60%) have been set considering that the filters methods are intrinsically stable, so a higher threshold ensures that only those variables that, for a certain training set, have accidentally fallen into the subset are not selected. Furthermore, the filters based reduction is performed prior the wrappers, because filters are computationally faster than the wrappers, thus this preliminary reduction of the variables number speeds up the subsequent steps.

Finally, both the average accuracy of the model and the stability of the approach can be assessed. Similarly to the training dataset, also the test set undergoes a cleansing stage, but such cleansing is done considering only the reduced subset of input variables. The overall procedure is schematically represented in Fig. 1.

STEHYVAS is generic and can be executed on any kind of binary or multi-class classifier as well as on a regression model. In addition, the classic VS methods that are used here could be replaced with other algorithms belonging to the same category. Finally, STEHYVAS shows the great advantages of not requiring any a priori information on the available database, being completely automatic and modular and, therefore, adaptable to various problems.

6 Experimental Tests

6.1 Datasets Used in the Experiments

In order to assess the validity of the proposed VS approach, STEHYVAS has been applied to binary classification, multiclass classification and regression tasks, related to different datasets extracted by the largely used UCI learning repository [8]. Only one dataset, named Baseball, is not extracted by UCI repository but it is downloaded from CNN/Sports. The datasets, which are exploited in the pursued tests, are briefly described in the following.

Breast Cancer Wisconsis (BCW): the BCW database comes from the Hospital of the University of Wisconsis. Data refer to patients affected by tumours. The binary classification target refers to the benign or malign nature of the tumor.
Cervical (Cerv): The dataset contains several attributes on observed patients, who are classified according to the presence/absence of cervical cancer.
Heart Failure clinical records (HF): This dataset includes the medical records of patients who had heart failure, which were collected during their follow-up period.
Heart: This dataset is generated to identify the presence or absence of heart disease.
Pima Indians Diabetes (PID): This dataset is provided by the National Institutes of Diabetes and Digestive and Kidney Diseases and concerns women patients coming from Arizona, whose minimum age is 21 years. The binary classification refers to the fact that patient is found positive or negative to the diabetes test.
Dermatology (Derm.): This dataset is related to the determination of the type of Eryhemato-Squamous Disease. The differential diagnosis of such diseases is a real problem in dermatology: they all share the clinical features of erythema and peeling , with very little differences. The diseases in this group are psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris. Generally, a biopsy is needed for the diagnosis but, unluckly, these diseases share many histo-pathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of a similar disease in an initial stage, and its own typical features later on.
NewThyroid (Thyroid): This dataset refer to features of patients who are found to suffer or not from hyperthyroidism or hypothyroidism.
Wine: This dataset is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivations. The analysis determined the quantities of 13 constituents found in each of the three considered types of wines.
Iris: This is one of the most widely known datasets of the literature on pattern recognition. It includes data from 3 classes, each one represented by 50 samples and referring to a type of iris plant. One class is linearly separable from the other 2; the latter ones are not linearly separable from each other.
Facebook metrics: This dataset was collected within a study on the performance metrics of a renowned cosmetic’s brand Facebook page. The input variables include category, page total likes, type, month, hour, weekday, paid and the target is the measure of the impact of the post in terms of total interactions (sum of likes, comments and shares of the post).
Computer Hardware: This dataset is used to forecast the relative CPU performance of a PC on the basis of its features, such as, for instance, model, CPU frequency and memory size.
Breast Tissue: This dataset includes electrical impedance measurements of freshly excised tissue samples from the breast. Such measurements constitute the impedance spectrum from where the breast tissue features are computed.
Baseball: This dataset concerns Baseball Major League players who played at least one game in both the 1991 and 1992 seasons, excluding pitchers. This dataset contains the 1992 salaries of the set of Major League Baseball players who played at least one game in both the 1991 and 1992 seasons, excluding pitchers. For each player, some performance measures are given along with four categorical variables that indicate how free each player was to move to other teams. The variable which indicates the salaries is the target to estimate.

The sizes of each of the above-listed databases, namely the total number of samples $N_{TOT}$ and of variables $N_{V}$ considered in each dataset, are summarised in Table 1.

Table 1 Overview of the characteristics of the adopted datasets

Full size table

6.2 Exploited Filter Approaches

Such as stated in Sect. 5, two filter VS approaches are required in the first stage of STEHYVAS to develop a preliminary selection of potentially relevant input variables.

The first very simple VS approach is the correlation-based one, which is generic and applicable for any task. In this approach, for each input variable $ x_j$ ($ 1 \le j \le N_V $), the Pearson’s correlation coefficient with respect to the considered target is evaluated. Afterwards, all the variables are ranked considering this value and the mean value of all coefficients is computed. Finally, the variables showing a coefficient value greater than the overall mean value are selected. A further quite simple test, which can however been applied only for binary classification tasks, is based on the evaluation of the so-called Wilcoxon index. This filter algorithm aims at ranking the variables of a given dataset by assessing their significance for separating the two labelled groups of data. Once the variables have been ranked according to their relevance, those that belong to the first half of the ranking are selected. As a criterion, the absolute value of the standardized u-statistic of a two samples unpaired Wilcoxon test, also known as Mann Whitney test [61], is adopted. The unpaired two-samples Wilcoxon test is a non-parametric alternative to the unpaired two samples t-test, that can be used also when data are not normally distributed.

The Wilconxon index-based filter method cannot be applied to multi-class classification and regression tasks. Therefore, in these cases two algorithms belonging to the Relief family are exploited in STEHYVAS as second filter approach: ReliefF is used for multi-class classification and RReliefF is used for regression [58]. The main idea behind all Relief algorithms is to determine the importance of a variable according to its capability of distinguishing between data samples $ \mathbf{x} \in {\mathbb {R}}^{N_V} $ that are near to each other in the problem domain. To this aim, the variable importance is ranked by associating a weight $w_j$ to the j-th variable $ x_j$ for all $ 1 \le j \le N_V $. Such weights are iteratively computed. Finally, the variables belonging to the upper half are selected according to the obtained ranking. A detailed description of the ReliefF and RReliefF methods, which are far less known that the other methods employed in this work, is provided in appendix A together with some additional detail about their set–up within this work.

6.3 Results

In the performed experiments, the iteration parameters are assumed to be all identical, namely $N_F$=$N_W$=10. Moreover, 10 independent runs of the proposed procedure have been performed on each dataset in order to calculate the stability of the solutions. The stability is measured through the average Tanimoto distance. This distance is compared to the one calculated by running SFS and SBS as the only VS algorithms (also computed on 10 iterations) to demonstrate that they can encounter instability issues and to quantify the improvement obtained by STEHYVAS. Furthermore, the variables selected with the three methods (SFS, SBS, STEHYVAS) are used in the test set (to keep only selected variables), which is used to validate the adopted machine learning algorithm.

In our tests the binary classifiers used are a Bayesian classifier [40, 51] and a Support Vector Machine (SVM)-based classifier [23]. Regarding the multi-class target, we implemented the Error-Correcting Output Codes (ECOC) [5, 28, 31], a technique that allows a multi-class classification problem to be reframed as multiple binary classification problems, allowing the use of native binary classification models to be used directly; in our case we have used the SVM-based classification. The second classifier used in the multi-class tasks is a Neural Networks (NN)-based classifier [34]. A feed-forward fully connected NN has been trained. The final fully connected layer and the subsequent softmax activation function produce the network’s output and predicted labels. The employed activation function is the Rectified Linear Unit (ReLU) [41]. A limited-memory Broyden-Flecter-Goldfarb-Shanno quasi-Newton algorithm (LBFGS) [55] is used with its loss function minimization technique, where the cross-entropy loss is minimized.

For the regression tasks, NNs and SVMs have been used. In particular, a two–layers feedforward NN with fully connected layers is adopted. In this network the hidden layer is formed by 10 neurons that use a ReLU activation function. The last fully layer holds one output corresponding to the estimated target. The adopted SVMs models employ linear kernel functions and are trained through the well known Sequential Minimal Optimization algorithm. The purpose of a first set of tests is to demonstrate the efficiency of STEHYVAS in improving the stability of the VS stage independently from the adopted ML algorithm and without negatively affecting the performance of the model itself. Therefore, the focus is not on the selection of the most suitable model.

Moreover, in order to further assess the validity of STEYVAS, a comparison is proposed with respect to a method recently proposed in [3], where an ensemble approach based on the bagging technique is applied to improve feature selection stability via data variance reduction. Such approach, named Bagging Feature Selection (in the following indicated as BaggingFS for the sake of brevity) firstly applies cross validation k times, followed by a bootstrap sampling of each training set. Afterwards, a VS algorithm is applied on each fold to select the variables and, finally, stability is evaluated. The method is repeated for each cardinality of the selected variable set. BaggingFS has been developed for classification tasks, and specifically targets datasets with a limited number of samples and a relatively large number of potential input variables, such as STEHYVAS. BaggingFS is considered an adequate term of comparison mostly because, analogously to STEHYVAS, BaggingFS, it actually does not represent a brand new VS algorithm, but it aims at improving the stability of existing VS methods, it does not depend on the adopted classifier, and it does not introduce a new metric to quantify stability. In the tests presented here, the VS approach adopted within BaggingFS is ReliefF for classification and RReliefF for regression, and $k=10$.

The performance of the classifiers is evaluated using as metric the Overall Accuracy, which is calculated as the ratio between the sum of the correctly classified test values and the total number of test values. The performance on the regression problems is evaluated computing the Normalised Root Mean Square Error (NRMSE) (Eq. 2) as follows:

$$\begin{aligned} NRMSE=\frac{1}{N_{TOT}} \sum _{i=1}^{N_{TOT}} \sqrt{\frac{({\hat{y}}_i-y_i)^2}{\sigma _y}} \end{aligned}$$

(2)

where ${\hat{y}}$ is the estimated value of $y_i$, and $\sigma _y$ is the standard deviation of the target.

The obtained results are shown in Table 2 for the Bayesian and SVM-based binary classifiers, in Table 3 for the ECOC and NN-based multiclass classifiers, and finally, in Table 4 for the SVM-based and NN-based regressors. In particular, the first column refers to the type of classifier or regressor adopted for the tests, the second column indicates the processed dataset, columns 3-7 report the mean accuracy obtained with the test set using all initial variables, the selected variable by SFS and SBS, and the mean accuracy obtained with STEHYVAS and BaggingFS , respectively, while the last four columns show the Tanimoto stability measure considering the four different approaches.

Table 2 Results on the tested datasets for binary classification tasks

Full size table

Table 3 Results on the tested datasets for multiclass classification tasks

Full size table

Table 4 Results on the tested datasets for regression tasks

Full size table

Finally, the times required to execute the VS algorithms have been calculated and are shown in Tables 5, 6 and 7, where average computation times and their standard deviations (in brackets) over 10 runs are reported. The time is expressed in seconds and, concerning the binary classifier, it is computed using a Notebook PC with Intel Core i9, CPU speed 2.9 GHz, SDD 512 GB and RAM 16GB (see Table 5). On the other hand, concerning the multiclass classifier and the regressor, the time is computed using a workstation with AMD Ryzen 7 2700X 8 Core, CPU speed 3.70 GHz, SSD 512 GB and RAM 64GB (see Tables 6 and 7).

Table 5 Average execution time (in seconds) on the tested datasets with implemented ML algorithms for binary classification tasks. Standard deviation is also reported in brackets

Full size table

Table 6 Average execution time (in seconds) on the tested datasets with implemented ML algorithms in multiclass classification task. Standard deviation is also reported in brackets

Full size table

Table 7 Average execution time (in seconds) on the tested datasets with implemented ML algorithms in regression task. Standard deviation is also reported in brackets

Full size table

In general, according to the results shown in the tables, especially compared to single VS approaches, STEHYVAS turns out to sensibly improve the VS stability, reduce the computational burden related to models training while keeping their accuracy comparable to those achieved by other VS methods.

More in detail, within binary classification tasks, the improvement in terms of stability obtained by STEHYVAS (average 0.92) is marked with respect to the other commonly used VS approaches, where the best competitor, namely SBS, achieves an average stability close to 0.6. In this case, the models using the variables selected by STEHYVAS increase the predictive accuracy as well (+5% compared to SBS). On the other hand, BaggingFS achieves stability values similar to STEHYVAS (average 0.94) and comparable predictive accuracy of the trained classifiers. As far as the models training time is concerned, the method outperforms SFS, while there is no clear prominence with respect to SBS. This behavior is related to the attitude of STEHYVAS and SBS to select a smaller number of variables compared to SFS. On the other hand, with respect to BaggingFS, STEHYVAS is much faster, thanks to its first stage based on redundancy analysis and the combination of two filter VS approaches, which limits the number of variables undergoing the following wrapper VS stage, while BaggingFS develops a number of iterations equal to the cardinality of the selected variable set.

The proposed approach improves stability also when handling multi-class classification tasks. Again, STEHYVAS achieves an overall stability value of 0.92, SFS and SBS reach a stability value of 0.63 and 0.62, respectively, while BaggingFS achieves an overall stability value of 0.91. The accuracy obtained by the different VS methods is comparable for all datasets apart from the Wine one, on which STEHYVAS markedly outperforms all the other approaches, while BaggingFS shows far worse performances than all the other approaches. In terms of computational time, STEHYVAS drastically outperforms BaggingFS on all the datasets apart from the Wine one, for which the computation time is similar, and is also faster than SBS, while it shows training time values similar to SFS.

Finally, also for regression tasks STEHYVAS sensibly improves selection stability (average 0.91) with respect to both SFS (average ) and SBS (average 0.64), while is slightly outperformed by BaggingFS, which shows an average stability value of 0.97. However, in this case BaggingFS shows drastically worse performance in terms of NRMSE than all the other methods, among which STEHYVAS slightly outperforms both SFS and SBS. In other words, BaggingFS selects almost always the same variable subset, but such subset seems not to be the optimal one. Moreover, STEHYVAS is faster than all the other methods in most cases, being outperformed by BaggingFS only for the SVM-based model trained applied to the Facebook and Comp.HW databases and for the NN-based model applied to the Facebook database. The average reduction through when using a NN is 67% and 37% with respect to SBS and SFS, respectively. When a SVM is used as regressor, the training time is more than halved with respect to SBS, whilst SFS shows the best performance on the Baseball database, but requires a far higher computation time on the Facebook and Breast tissue datasets. The relevant time required by STEHYVAS when dealing with the Facebook and Comp.HW databases is explained by the fact that the first stage based on the combination of two filter VS approaches eliminates very few variables, therefore the actual selection process is performed by the second wrapper-based stage, which is computationally more cumbersome. The inefficiency of filter-based methods on these two databases is confirmed by the bad performance of BaggingFS, which, in these tests, incorporates RReliefF.

6.4 Discussion

The obtained results in terms of performance of the trained models and of the VS show that STEHYVAS ensures excellent stability of the VS and, therefore, also support knowledge extraction form raw data concerning about the considered problem/phenomenon without decreasing the performance of the employed algorithm. This superior capability of STEHYVAS is shown in comparison to both traditional VS methods and the case where VS is not applied (i.e. all the available variables are exploited for the classification task).

In terms of computational costs and related time requirements, STEHYVAS, in most cases, is not far worse than the considered traditional approaches, and in some cases it is even faster. This is mainly due to the combination of some quick preliminary steps, which eliminate irrelevant or redundant variables, and a more cumbersome VS approach, which is applied to an already significantly reduced number of variables.

In effects, the main idea behind the elaboration of STEHYVAS arose from the consideration that wrapper approaches usually show better performance with respect to filters, as they involve the learning algorithm, but are more cumbersome from the computational point of view. Such excessive computational burden make adoption of wrapper approaches not fully justifiable compared to the achievable advantages in terms of accuracy. Moreover, the variable set that leads to good performances for a given ML approach often provide similar results with other ML algorithms. Therefore, an hybrid approach is preferable, as it provides competitive advantages with respect to pure wrapper methods by also leading to a more effective and often also faster selection of the relevant variable subset.

In particular, in the present case the combination of a preliminary fast VS stage based on a filter and an embedded approach and a second stage where two wrapper methods is compared to the application of a single wrapper VS approach. Despite of the fact that two wrapper variable selection approaches are used in STEHYVAS, the preliminary reduction of the number of the variables to account ensures a reasonable overall duration of the VS procedure, which turns out to be comparable and sometimes faster than the considered wrapper VS procedures. When the STEHYVAS procedure takes longer times, this means that the fist stage was not able to significantly reduce the number of potential variables to be considered by the second stage, and this might also depend on the considered problem/database, e.g. there are not many correlated variables.

It is also worth noting that the computational time depends on the adopted ML model for classification or regression and this is the reason for very different numerical results that are obtained in the pursued tests even when considering the same dataset and VS algorithm. The training of a Bayesian classifier is known to be faster than the one of an SVM-based one. In many applications NNs are preferred to ECOC classifiers or SVM-based regressors, respectively, for multi-class classification and regression tasks. However, the direct application of a wrapper VS approach exploiting for a NN-based model using a dataset holding a relatively low number of samples compared to the number of potential input variables to select is more prone to overfitting issues, by worsening the stability. Therefore, especially in this specific situation, which is not infrequent in real-world applications, it is of utmost importance to reduce the number of variables to consider before implementing a wrapper VS procedure and the advantages of STEHYVAS in terms of performances and stability are even more evident.

Finally, it must be underlined that the proposed algorithm serves as a pre-processing stage to identify the variables that mostly affect the target. Therefore, it is often performed once or very rarely and in the design phase of the ML-based model and the time spent to execute it is not relevant in the overall economy of the designed system. On the other hand, VS stability is fundamental for both model design and improved knowledge of the phenomenon under consideration, as it ensures that performed selection include all and only the variables that are actually significant. This also contributes to enhancing the computational efficiency of the designed model.

7 Conclusions and Future Work

The paper proposes the STEHYVAS approach to improve the stability of traditional VS algorithms when applied to regression and classification tasks. In fact, the main issue of these algorithms lies in their sensitivity to variations of the training set, which is a serious problem, especially when the main objective is knowledge extraction on the considered system or phenomenon.

STEHYVAS was tested on real data sets and the results demonstrate its efficiency in terms of both accuracy and, mostly, stability. In fact, STEHYVAS improves the VS stability compared to classical methods, without worsening the accuracy of the model. A further advantage of STEHYVAS lies in the fact that it is automatic, namely it does not require any a priori information on the considered dataset. Moreover STEHYVAS is fully modular, namely each included VS approach can be replaced by other approaches belonging to the same category. Finally STEHYVAS demonstrates its versatility within different contexts, namely binary and multi-class classification as well as regression problems.

The main limitation of STEHYVAS lies in its exclusive suitability to offline implementation, as the time required to provide the most influential variables is not compatible with an online solution. Nonetheless, it can be used as a preprocessing step to be performed once and possibly periodically repeated as the data volume grows. This feature is common to most VS approaches. A further limitation is represented by the fact that some parameters, e.g. $N_F$ and $N_W$, are fixed and the values proposed here might not be optimal for any possible application.

Future work will focus on the improvement of the algorithm by optimizing its hyperparameters: alternative selection strategies within the early filter–based steps will be assessed as well as different values of associate thresholds and the number of iterations of various operations (i.e. filters and wrappers applications) at different stages of the algorithm. Furthermore, the stability will be assessed through more sophisticated indexes. Finally, the k-fold cross-validation strategy will be tested instead of splitting the database into training and test sets.

References

Akbari Torkestani J, Meybodi MR (2012) Finding minimum weight connected dominating set in stochastic graph based on learning automata. Inform Sciences 200:57–77. https://doi.org/10.1016/j.ins.2012.02.057
Article MathSciNet MATH Google Scholar
Al Janabi KBS, Kadhim R (2018) Data reduction techniques: a comparative study for attribute selection methods. Int J Adv Computer Sci Tech 8(1):1–13
Google Scholar
Alelyani S (2021) Stable bagging feature selection on medical data. J Big data 8(1):1–18. https://doi.org/10.1186/s40537-020-00385-8
Article Google Scholar
Ali S, Smith MK (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138. https://doi.org/10.1016/j.asoc.2004.12.002
Article Google Scholar
Allwein EL, Schapire RE, Singer Y (2001) Reducing multiclass to binary: A unifying approach for margin classifiers. J Mach Learn Res 1(2):113–141
MathSciNet MATH Google Scholar
Andresen CM, Bro R (2010) Variable selection in regression-a tutorial. J Chemometr 24(11–12):728–737. https://doi.org/10.1002/cem.1360
Article Google Scholar
Asdaghi F, Soleimani A (2019) An effective feature selection method for web spam detection. Knowl-Based Syst 166:198–206. https://doi.org/10.1016/j.knosys.2018.12.026
Article Google Scholar
Asuncion A, Newman DJ (2007) Uci machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Bahassine S, Madani A, Al-Sarem M et al (2020) Feature selection using an improved chi-square for arabic text classification. J King Saud University - Comp Inf- Sci 32(2):225–231. https://doi.org/10.1016/j.jksuci.2018.05.010
Article Google Scholar
Breiman L (2001) Random forests. Machine Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Cao Q, Long X (2020) New convergence on inertial neural networks with time-varying delays and continuously distributed delays. AIMS Mathematics 5(6):5955–5968. https://doi.org/10.3934/math.2020381
Article MathSciNet MATH Google Scholar
Cateni S, Colla V (2016) The importance of variable selection for neural networks-based classification in an industrial context. Smart Innovation, Systems and Technologies 54:363–370. https://doi.org/10.1007/978-3-319-33747-0_36
Article Google Scholar
Cateni S, Colla V (2016) Improving the stability of sequential forward and backward variables selection. In: Proc. 15th Int. Conf. Intelligent Systems Design and Applications ISDA 2015, p 374–379, https://doi.org/10.1109/ISDA.2015.7489258
Cateni S, Colla V (2016) Improving the stability of wrapper variable selection applied to binary classification. Int J Comput Inf Sys & Ind Manag Appl 8:214–225
Google Scholar
Cateni S, Colla V (2016) Variable selection for efficient design of machine learning-based models: Efficient approaches for industrial applications. Commun Comp Inf Sci 629:352–366. https://doi.org/10.1007/978-3-319-44188-7_27
Article Google Scholar
Cateni S, Colla V (2017) A hybrid variable selection approach for nn-based classification in industrial context. Smart Innov. Sys. 69:173–180. https://doi.org/10.1007/978-3-319-56904-8_17
Article Google Scholar
Cateni S, Colla V, Vannucci M (2009) A fuzzy system for combining different outliers detection methods. In: Proc. IASTED Int. Conf. Artificial Intelligence and Applications, AIA 2009, p 87–93
Cateni S, Colla V, Vannucci M (2014) A hybrid feature selection method for classification purposes. In: Proc. UKSim-AMSS 8th European Modelling Symp. Computer Modelling and Simulation, EMS 2014, p 39–44, https://doi.org/10.1109/EMS.2014.44
Cateni S, Colla V, Vannucci M, et al (2014) A procedure for building reduced reliable training datasets from real-world data. In: Proc. IASTED Int. Conf. Artificial Intelligence and Applications, AIA 2014, p 393–399, https://doi.org/10.2316/P.2014.816-010
Cateni S, Colla V, Vannucci M (2017) A fuzzy system for combining filter features selection methods. Int J Fuzzy Syst 19(4):1168–1180. https://doi.org/10.1007/s40815-016-0208-7
Article Google Scholar
Cateni S, Colla V, Vannucci M (2021) A combined approach for enhancing the stability of the variable selection stage in binary classification tasks. Lect. Notes Comput. Sci., vol 12862 LNCS. p 248–259, https://doi.org/10.1007/978-3-030-85099-9_20
Che J, Yang Y, Li L et al (2017) Maximum relevance minimum common redundancy feature selection for nonlinear data. Inform Sci 409–410:68–86. https://doi.org/10.1016/j.ins.2017.05.013
Article MATH Google Scholar
Christianini N, Shawe-Taylor J (2000) An Introduction To Support Vector Machines And Other Kernel-based Learning Methods. Cambridge University Press, Cambridge
Book Google Scholar
Degenhardt F, Seifert S, Szymczak S (2019) Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform 20(2):492–503. https://doi.org/10.1093/bib/bbx124
Article Google Scholar
Dhamodharavadhani S, Rathipriya R (2021) Variable selection method for regression models using computational intelligence techniques. In: Research Anthology on Multi-Industry Uses of Genetic Programming and Algorithms. IGI Global, p 742–761, https://doi.org/10.4018/978-1-7998-8048-6.ch037
Eid HF, Hassanien AE, Kim TH, et al (2013) Linear correlation-based feature selection for network intrusion detection model. Communications in Computer and Information Science, vol 381 CCIS. p 240–248, https://doi.org/10.1007/978-3-642-40597-6_21
Ellies-Oury MP, Chavent M, Conanec A et al (2019) Statistical model choice including variable selection based on variable importance: A relevant way for biomarkers selection to predict meat tenderness. Sci Rep-UK 9(1):1–12. https://doi.org/10.1038/s41598-019-46202-y
Article Google Scholar
Escalera S, Pujol O, Radeva P (2010) On the decoding process in ternary error-correcting output codes. IEEE T Pattern Anal 32(1):120–134. https://doi.org/10.1109/TPAMI.2008.266
Article Google Scholar
Fakhraei S, Soltanian-Zadeh H, Fotouhi F (2014) Bias and stability of single variable classifiers for feature ranking and selection. Expert Syst Appl 41(15):6945–6958. https://doi.org/10.1016/j.eswa.2014.05.007
Article Google Scholar
Fligner MA, Verducci JS, Blower PE (2002) A modification of the jaccard-tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics 44(2):110–119. https://doi.org/10.1198/004017002317375064
Article MathSciNet Google Scholar
Fürnkranz J (2002) Round robin classification. J Mach Learn Res 2(4):721–747. https://doi.org/10.1162/153244302320884605
Article MathSciNet MATH Google Scholar
Gao Z, Wang Y, Xiong J et al (2020) Structural balance control of complex dynamical networks based on state observer for dynamic connection relationships. Complexity. https://doi.org/10.1155/2020/5075487
Article MATH Google Scholar
Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recog Lett 31(14):2225–2236. https://doi.org/10.1016/j.patrec.2010.03.014
Article Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res 9:249–256
Google Scholar
Gokalp O, Tasci E, Ugur A (2020) A novel wrapper feature selection algorithm based on iterated greedy metaheuristic for sentiment classification. Expert Syst Appl 146:113176. https://doi.org/10.1016/j.eswa.2020.113176
Article Google Scholar
Gupta D, Richhariya B (2018) Entropy based fuzzy least squares twin support vector machine for class imbalance learning. Appl Intell 48:4212–4231. https://doi.org/10.1007/s10489-018-1204-4
Article Google Scholar
Gupta U, Gupta D (2021) Least squares large margin distribution machine for regression. Appl Intell 51:7058–7093. https://doi.org/10.1007/s10489-020-02166-5
Article Google Scholar
Gupta U, Gupta D, Prasad M (2019) Kernel target alignment based fuzzy least square twin bounded support vector machine. In: Proc. 2018 IEEE Symp. Series on Computational Intelligence, SSCI 2018, p 228 – 235, https://doi.org/10.1109/SSCI.2018.8628903
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2008) The Elements Of Statistical Learning, 2nd edn. Springer, Berlin
MATH Google Scholar
He K, Zhang X, Ren S, et al (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proc. IEEE Int. Conf. Computer Vision ICCV 2015, p 1026–1034, https://doi.org/10.1109/ICCV.2015.123
Huang L, Ma H, Wang J et al (2020) Global dynamics of a filippov plant disease model with an economic threshold of infected-susceptible ratio. J Appl Anal Comput 10(5):2263–2277. https://doi.org/10.11948/20190409
Article MathSciNet MATH Google Scholar
Jadhav S, He H, Jenkins K (2018) Information gain directed genetic algorithm wrapper feature selection for credit rating. Appl Soft Comput 69:541–553. https://doi.org/10.1016/j.asoc.2018.04.033
Article Google Scholar
Kalousis A, Prados J, Hilario M (2005) Stability of feature selection algorithms. In: Proc. 5th IEEE Int. Conf. on Data Mining (ICDM’05), p 8–15, https://doi.org/10.1109/ICDM.2005.135
Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116. https://doi.org/10.1007/s10115-006-0040-8
Article Google Scholar
Khaire UM, Dhanalakshmi R (2019) Stability of feature selection algorithm: A review. J King Saud University - Comp Inf- Sci. https://doi.org/10.1016/j.jksuci.2019.06.012
Article MATH Google Scholar
Kohavi R, John GH (1997) Wrappers for feature selection. Artif Intell 97(1–2):273–324. https://doi.org/10.1016/s0004-3702(97)00043-x
Article MATH Google Scholar
Li B, Wang F, Zhao K (2020) Large time dynamics of 2d semi-dissipative boussinesq equations. Nonlinearity 33(5):2481–2501. https://doi.org/10.1088/1361-6544/ab74b1
Article MathSciNet MATH Google Scholar
Loscalzo S, Yu L, Ding C (2009) Consensus group stable feature selection. In: Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, p 567.–575, https://doi.org/10.1145/1557019.1557084
Manickam I, Ramachandran R, Rajchakit G et al (2020) Novel lagrange sense exponential stability criteria for time-delayed stochastic cohen-grossberg neural networks with markovian jump parameters: A graph-theoretic approach. Nonlinear Anal-Model 25(5):726–744. https://doi.org/10.15388/namc.2020.25.16775
Article MathSciNet MATH Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction To Information Retrieval. Cambridge University Press, Cambridge
Book MATH Google Scholar
Maugis C, Celeux G, Martin-Magniette ML (2009) Variable selection in model-based clustering: A general variable role modeling. Comput Stat Data An 53(11):3872–3882. https://doi.org/10.1016/j.csda.2009.04.013
Article MathSciNet MATH Google Scholar
May R, Dandy G, Maier H (2011) Review of Input Variable Selection Methods for Artificial Neural Networks. IntechOpen, chap 2. https://doi.org/10.5772/16004
Mehmood T, Liland KH, Snipen L et al (2012) A review of variable selection methods in partial least squares regression. Chemometr Intell Lab 118:62–69. https://doi.org/10.1016/j.chemolab.2012.07.010
Article Google Scholar
Nocedal J, Wright SJ (2006) Numerical Optimization, 2nd edn. Springer, Berlin
MATH Google Scholar
Pearson K (1895) Notes on regression and inheritance in the case of two parents. P R Soc London 58:240–242
Article Google Scholar
Peres FAP, Peres TN, Fogliatto FS et al (2019) Fault detection in batch processes through variable selection integrated to multiway principal component analysis. J Process Contr 80:223–234. https://doi.org/10.1016/j.jprocont.2019.06.002
Article Google Scholar
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of relieff and rrelieff. Machine Learn 53(1–2):23–69. https://doi.org/10.1023/A:1025667309714
Article MATH Google Scholar
Rodriguez-Galiano V, Luque-Espinar JA, Chica-Olmo M et al (2018) Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods. Sci Total Environ 624:661–672. https://doi.org/10.1016/j.scitotenv.2017.12.152
Article Google Scholar
Sampathkumar E, Walikar HB (1979) The connected domination number of a graph. J Math Phys Sci 13(6):607–613
MathSciNet MATH Google Scholar
Siegel S, Castellan NJJ (1988) Nonparametric Statistics For The Behavioral Sciences, 2nd edn. Mac GrawHill, New York
Google Scholar
Souza F, Araújo R, Soares S, et al (2010) Variable selection based on mutual information for soft sensors application. In: Proc. 9th Portuguese Conf. on Automatic Control, p 1–6
Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 100(3–4):441–471. https://doi.org/10.2307/1422689
Article Google Scholar
Sun Y, Robinson M, Adams R, et al (2006) Using feature selection filtering methods for binding site predictions. In: Proc. 5th IEEE Int. Conf. Cognitive Informatics (ICCI ’06), p 566–571, https://doi.org/10.1109/COGINF.2006.365547
Turney P (1995) Techncal note: bias and the quantification of stability. Machine Learn 20:23–33. https://doi.org/10.1023/A:1022682001417
Article Google Scholar
Vannucci M, Colla V, Sgarbi M, et al (2009) Thresholded neural networks for sensitive industrial classification tasks. Lect. Notes Comput. Sci., vol 5517 LNCS. p 1320–1327, https://doi.org/10.1007/978-3-642-02478-8_165
Wang J, He S, Huang L (2020) Limit cycles induced by threshold nonlinearity in planar piecewise linear systems of node-focus or node-center type. Int J Bifurcat Chaos 30(11):2050160. https://doi.org/10.1142/S0218127420501606
Article MathSciNet MATH Google Scholar
Wang L, Yang C, Sun Y et al (2018) Effective variable selection and moving window hmm-based approach for iron-making process monitoring. J Process Contr 68:86–95. https://doi.org/10.1016/j.jprocont.2018.04.008
Article Google Scholar
Wang S, Zhu J (2008) Variable selection for model-based high dimensional clustering and its application on microarray data. Biometrics 64(2):440–448. https://doi.org/10.1111/j.1541-0420.2007.00922.x
Article MathSciNet MATH Google Scholar
Yan L, Wen Y, Teo KL et al (2020) Construction of regional logistics weighted network model and its robust optimization: Evidence from china. Complexity. https://doi.org/10.1155/2020/2109423
Article Google Scholar
Yu F, Zhang Z, Liu L et al (2020) Secure communication scheme based on a new 5d multistable four-wing memristive hyperchaotic system with disturbance inputs. Complexity. https://doi.org/10.1155/2020/5859273
Article MATH Google Scholar
Yu L, Ding C, Loscalzo S (2008) Stable feature selection via dense feature groups. In: Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, p 803–811, https://doi.org/10.1145/1401890.1401986
Zagaria M, Dimastromatteo V, Colla V (2010) Monitoring erosion and skull profile in blast furnace hearth. Ironmak Steelmak 37(3):229–234. https://doi.org/10.1179/030192309X12595763237003
Article Google Scholar
Zhang Y, Ling C (2018) A strategy to apply machine learning to small datasets in materials science. npj Comp Mater 4(1):1–8. https://doi.org/10.1038/s41524-018-0081-z
Article Google Scholar

Download references

Funding

Open access funding provided by Scuola Superiore Sant’Anna within the CRUI-CARE Agreement. Open access funding provided by Scuola Superiore Sant’Anna within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

TeCIP Institute and Department of Excellence in Robotics & AI, Scuola Superiore Sant’Anna, via Moruzzi, 1, 56127, Pisa, Italy
Silvia Cateni, Valentina Colla & Marco Vannucci

Authors

Silvia Cateni
View author publications
You can also search for this author in PubMed Google Scholar
Valentina Colla
View author publications
You can also search for this author in PubMed Google Scholar
Marco Vannucci
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentina Colla.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: The ReliefF and RReliefF Methods

The ReliefF and RReliefF methods are two filter approaches belonging to the Relief family approaches, which have been adopted in the present work as second filter method (the first one being the correlation-based method) within STEHYVAS.

ReliefF is suitable to multi-class classifiers, as it computes the weights of variables when the target is a categorical variable y. In the initial step $k=0$ of the algorithm, the weight values associated to each variable are null, i.e. $w_j^0 = 0 $ $ \forall j$; then an iterative procedure starts, which considers all the $N_{TR}$ samples included in the training database ${\mathcal {D}}_{TS}$ (for instance, in the present case $N_{TR}= 0.75 N_{TOT}$). At the generic step $m>0$, the algorithm selects one sample $\mathbf{x}_r \in {\mathcal {D}}_{TS}$ and finds its k-nearest neighbors. For the generic nearest neighbour $\mathbf{x}_q$, the weight $w_j$ of the j-th variable is updated according to the fact that $\mathbf{x}_r$ and $\mathbf{x}_q$ belong to the same class or not. In particular, the value of the weight of $ x_j$ at step m $w^m_j$ is updated as follows:

$$\begin{aligned} w^m_j= {\left\{ \begin{array}{ll} w^{m-1}_j-\frac{\Delta _j(\mathbf{x}_r,\mathbf{x}_q)}{N_{TR}} \cdot d_{rq} &{} \text {if }{} \mathbf{x}_r, \mathbf{x}_q\text { are in the same class } \\ w^{k-m}_j+\frac{P_{yq}}{1-P_{yr}} \cdot \frac{\Delta _j(\mathbf{x}_r,\mathbf{x}_q)}{N_{TR}} \cdot d_{rq} &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(A1)

where

$P_{yr}$ is the prior probability of class to which $\mathbf{x}_r$ belongs.
$P_{yq}$ is the prior probability of the class to which $\mathbf{x}_q$ belongs.
$\Delta _j(\mathbf{x}_r,\mathbf{x}_q)$ is computed as:
$$\begin{aligned} \Delta _j(\mathbf{x}_r,\mathbf{x}_q) = {\left\{ \begin{array}{ll} 0 &{} \text {if } x_{r,j}= x_{q,j}\\ 1 &{} \text {if } x_{r,j} \ne x_{q,j} \end{array}\right. } \end{aligned}$$
(A2)
if $x_j$ is a discrete variable, or as:
$$\begin{aligned} \Delta _j ( \mathbf{x}_r,\mathbf{x}_q) = \frac{ \vert \mathrm {x_{r,j} } - \mathrm {x_{q,j} }\vert }{ \max _{\mathbf{x} \in {\mathcal {D}}_{TS}}{ \mathrm {x_j}}- \min _{\mathbf{x} \in {\mathcal {D}}_{TS}}{\mathrm {x_j}}} \end{aligned}$$
(A3)
if $x_j$ is a continuous variable.
$d_{rq}$ is a function measuring the distance between $\mathbf{x}_r$ and $\mathbf{x}_q$, which is computed as follows:
$$\begin{aligned} d_{rq} = \frac{{\tilde{d}}_{rq}}{\sum _{h=1}^{k} { {\tilde{d}}_{rh} }} \text { being } {\tilde{d}}_{rh} = e^{- \left( rank(r,q) \over \sigma \right) ^2 } \end{aligned}$$
(A4)

where rank(r, q) is the position of $\mathbf{x}_q$ among the k nearest neighbors of $\mathbf{x}_r$, and $\sigma $ is a positive scaling factor. It is also possible to set $\sigma = \infty $, which implies ${\tilde{d}}_{rh} = 1$ $ \forall h$, and $d_{rq}=\frac{1}{k}$, namely all the neighbors have the same influence.

On the other hand, RReliefF is used with continuous target, and employs intermediate weights to calculate the final variables weights. In particular, given two nearest neighbours let us indicate with $w_{dy}$ the weight of having different values for the target y, with $w_{dj}$ the weight of having different values for the variable $x_j$, and with $w_{dy\hat{ }dj}$ is the weight of having different response values and different values for $x_j$. Firstly, i.e. at step $m=0$, RReliefF sets $w^0_{dy}=w^0_{dj}=w^0_{dy \hat{ } dj}=w^0_j=0$; then an iterative procedure starts, which considers all the $N_{TR}$ samples included in the training database ${\mathcal {D}}_{TS}$. At the generic step $m>0$, the algorithm selects one sample $\mathbf{x}_r \in {\mathcal {D}}_{TS}$, finds its k-nearest neighbors and, for each neighbours $\mathbf{x}_q$, updates the intermediate weights as follows:

$$\begin{aligned} w^m_{dy}= & {} w^{m-1}_{dy}+\Delta _y(\mathbf{x}_r,\mathbf{x}_q) \cdot d_{rq} \end{aligned}$$

(A5)

$$\begin{aligned} w^m_{dj}= & {} w^{m-1}_{dj}+\Delta _y(\mathbf{x}_r,\mathbf{x}_q) \cdot d_{rq} \end{aligned}$$

(A6)

$$\begin{aligned} w^m_{dy \hat{ } dj}= & {} w^{m-1}_{dy \hat{ } dj}+\Delta _y(\mathbf{x}_r,\mathbf{x}_q) \cdot \Delta _j(\mathbf{x}_r,\mathbf{x}_q) \cdot d_{rq} \end{aligned}$$

(A7)

where $\Delta _y(x_r,x_q)$ is the difference in the values of the target y associated to $\mathbf{x}_r$ and $\mathbf{x}_q$, named $y_r$ and $y_q$, respectively, and is computed as follows:

$$\begin{aligned} \Delta _y(\mathbf{x}_r,\mathbf{x}_q)=\frac{\vert y_r-y_q\vert }{\max _{\mathbf{x} \in {\mathcal {D}}_{TS}}{y} - \min _{\mathbf{x} \in {\mathcal {D}}_{TS}}{y}} \end{aligned}$$

(A8)

Finally, RreliefF computes the weight $w_j$ of the variable $x_j$ after fully updating all the intermediate weights, as follows:

$$\begin{aligned} w^m_j=\frac{w^m_{dy \hat{ } dj}}{w^m_{dy}} - \frac{w^m_{dj}-w^m_{dy \hat{ } dj}}{N_{TR}-w^m_{dy}} \end{aligned}$$

(A9)

As far as values of the parameters are concerned, which are adopted in the implementation of ReliefF and RReliefF in the present work, the scaling factor was set to $\sigma =+\infty $ in ReliefF (i.e. for classification tasks) and to $\sigma =50$ in RReliefF (i.e. for regression tasks), while the number of considered neighbors was set to $k=10$ for both methods.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Cateni, S., Colla, V. & Vannucci, M. Improving the Stability of the Variable Selection with Small Datasets in Classification and Regression Tasks. Neural Process Lett 55, 5331–5356 (2023). https://doi.org/10.1007/s11063-022-10916-4

Download citation

Accepted: 06 June 2022
Published: 10 June 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11063-022-10916-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improving the Stability of the Variable Selection with Small Datasets in Classification and Regression Tasks

Abstract

Similar content being viewed by others

A Combined Approach for Enhancing the Stability of the Variable Selection Stage in Binary Classification Tasks

Dynamic Selection of Classifiers Applied to High-Dimensional Small-Instance Data Sets: Problems and Challenges

Variable Selection for Efficient Design of Machine Learning-Based Models: Efficient Approaches for Industrial Applications

1 Introduction

2 Background

3 The stability Issue

4 The Importance on Variable Selection Method in Neural Networks Applications

5 The Proposed Method

6 Experimental Tests

6.1 Datasets Used in the Experiments

6.2 Exploited Filter Approaches

6.3 Results

6.4 Discussion

7 Conclusions and Future Work

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: The ReliefF and RReliefF Methods

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving the Stability of the Variable Selection with Small Datasets in Classification and Regression Tasks

Abstract

Similar content being viewed by others

A Combined Approach for Enhancing the Stability of the Variable Selection Stage in Binary Classification Tasks

Dynamic Selection of Classifiers Applied to High-Dimensional Small-Instance Data Sets: Problems and Challenges

Variable Selection for Efficient Design of Machine Learning-Based Models: Efficient Approaches for Industrial Applications

Explore related subjects

1 Introduction

2 Background

3 The stability Issue

4 The Importance on Variable Selection Method in Neural Networks Applications

5 The Proposed Method

6 Experimental Tests

6.1 Datasets Used in the Experiments

6.2 Exploited Filter Approaches

6.3 Results

6.4 Discussion

7 Conclusions and Future Work

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: The ReliefF and RReliefF Methods

Appendix A: The ReliefF and RReliefF Methods

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation