1 Introduction

Greedy algorithms are currently mainly used to iteratively select a reduced and appropriate number of examples according to some error indicators, and hence to produce surrogate and sparse models (Dutta et al. 2021; De Marchi et al. 2005; Santin and Haasdonk 2017; Wenzel et al. 2021, 2023; Wirtz and Haasdonk 2013). The ambition of this paper is to analyze and extend greedy methods to work in the significantly more challenging case of feature reduction, i.e., as the computational core for feature-ranking schemes in the framework of classification issues.

The importance of this application follows from the fact that, as supervised learning models are usually trained on a reduced number of features, the sparsity enhancement is a crucial issue for statistical learning procedures. Most popular feature reduction procedures include Lasso regression (Tibshirani 1996) or variations of the classical Lasso (Group Lasso (Yuan et al. 2006), Adaptive Lasso (Zou 2006), Adaptive Poisson re-weighted Lasso (Guastavino and Benvenuto 2019) to mention a few), linear Support Vector Machine (SVM) feature ranking (Guyon et al. 2002), Fisher score-based schemes (Duda et al. 2012), methods based on mutual information (Peng et al. 2005), Relief and its variants (Robnik-Åikonja and Kononenko 2003). Nevertheless, given any classifier, which can be in principle highly non-linear as e.g. a neural network, none of those algorithms is able to actually capture all the corresponding most relevant features for that classifier. For instance, in the case of Lasso and its generalizations (Freijeiro-González et al. 2022), drawbacks in feature selection ability are shown when there exist non-linear dependence structures.

More in general, all the above mentioned methods identify an optimal subset of features based on general patterns in the data. Among them, schemes based on fuzzy information may help in taking account the correlation between features (Yin et al. 2024, 2023). More recently, wrapper and embedded schemes gained popularity; we refer the reader to Bommert et al. (2020) for a general overview. The former use machine learning algorithms to seek for the optimal subset of features by considering all possible feature combinations (Bajer et al. 2020), while for the latter, feature selection is integrated or built into the classifier algorithm (Zebari et al. 2020).

In this paper, we propose the so-called greedy scheme that falls in the class of wrapper feature selection methods, but unlike the classical approaches, such as recursive feature elimination (RFE) or recursive feature augmentation (RFA) (Guyon et al. 2002) and forward step-wise selection (James et al. 2023), our method is fully model-dependent and target-based, meaning that any accuracy score can be maximized during the iterative process. Indeed, given any score and any classifier, the feature-based greedy methods iteratively select the most important feature at each step in a classifier-dependent fashion.

At a more theoretical level, this study investigates the effectiveness of the greedy scheme in terms of the Vapnik-Chervonenkis (VC) dimension (Vapnik and Chervonenkis 1971), which is a complexity indicator common to any classifier, such as Feed-forward Neural Networks (FNNs), and it is related to the empirical risk (Bartlett and Mendelson 2002). As a particular instance, we further investigate how greedy methods behave for kernel-based classifiers, such as SVMs (Shawe-Taylor and Cristianini 2004), and in doing so we considered a particular complexity score, known as kernel alignment. These theoretical findings are used on both synthetic and benchmark datasets, showing that with a greedy feature selection, we are able to find a minimal set of features without any accuracy loss. Moreover, we apply greedy methods for a case study concerning the classification and prediction of severe geomagnetic events triggered by solar flares.

Solar flares (Piana et al. 2022) are the most explosive manifestations of the active Sun and the main trigger of space weather (Schwenn 2006). They may be followed by coronal mass ejections (CMEs) (Kahler 1992), which, in turn, may generate geomagnetic storms potentially impacting both space and on-earth technological assets (Gonzalez et al. 1994). Data-driven approaches forecasting these events leverage machine learning algorithms trained against historical archives containing physical features extracted from remote-sensing data such as solar images or time series of physical parameters acquired from in-situ instruments (Bobra and Couvidat 2015; Camporeale et al. 2018; Florios et al. 2018; Guastavino et al. 2023; Telloni et al. 2023). These archives systematically provide a huge amount of descriptors and it is currently well-established that this redundancy of information significantly hampers the prediction performances of the classifiers (Campi et al. 2019). Our feature-based greedy scheme is applied in this context, in order to identify among the features the redundant ones and hopefully to improve the classification performances.

The paper is organized as follows. Section 2 introduces our greedy feature selection scheme, which will be motivated thanks to the theoretical analysis in Subsections 2.1 and 2.2. Section 3 describes the application of greedy feature selection to simulated, benchmark and real datasets. Our conclusions are offered in Sect. 4.

2 Greedy feature ranking schemes

Given a set of examples depending on several features, greedy methods are frequently used to find an optimal subset of examples and, for such task, since they might be target-dependent, they have already been proved to be effective (see e.g. Wenzel et al. (2021, 2023, 2024)). Here, instead of focusing on the examples, we drive our attention towards the problem of feature selection. To this aim, we considered a binary classification problem with training examples

$$\begin{aligned} \Xi = X \times Y = \{ ({\varvec{x}}_1, y_1), \ldots ,({\varvec{x}}_n, y_n)\}, \end{aligned}$$
(1)

where \({\varvec{x}}_i \in \Omega \subseteq \mathbb {R}^d\) and \(y_i \in \mathbb {R}\). For the particular case of the binary classification setting, we fix \(y_i \in \{-1, +1\}\).

In the machine learning framework, feature reduction is typically performed by means of linear models, and once the features are identified, non-linear methods like neural networks are applied to predict the given task. However, the fact that some specific features could be useful for some classifier does not imply that the same feature is relevant for any classification model, and this is probably the main weakness of current feature reduction methods in this context. Conversely, our feature-based greedy method (see e.g. Temlyakov (2008) for a general overview) will consist in iteratively selecting the most important feature at each step and in agreement with the considered classifier.

To reach this objective, as usually done, we split the initial dataset \(\Xi = X \times Y\) into training and validation sets, respectively denoted by \({\mathcal {X}} \times {\mathcal {Y}}\) and \({\mathfrak {X}} \times {\mathfrak {Y}}\). Then, at the \(k-1\) greedy step \(X^{(k-1)}\) will consists of the \(k-1\) features that have already been selected (without loosing generalities the first \(k-1\)). At the k-th greedy step, on \({\mathcal {X}}^{(k-1)} \times {\mathcal {Y}}^{(k-1)}\) we train \(d-k\) models \({{{\mathcal {M}}}}_p\) with \(x_1,\ldots , x_{k-1},x_p\), \(p=k, \ldots , d\). Then, given an accuracy score \(\mu \) (the largest the better), we select the k-th feature as

$$\begin{aligned} x_k = {{\,\textrm{argmax}\,}}_{p=k, \ldots , d} \mu ({{{\mathcal {M}}}}_p({\mathfrak {X}}^{(k-1)}),{\mathfrak {Y}}^{(k-1)}). \end{aligned}$$
(2)

We point out that any model can be used in (2), and this implies a totally target-dependent feature selection, which also accounts for the model used to predict a given task.

In the following we investigate the effects of the proposed scheme in terms of VC dimension and for particular instances of kernel learning theory, while a stopping criterion for the algorithm is discussed later in view of the incoming analysis and trade-off remarks.

2.1 The VC dimension in the greedy framework

We consider the dataset (1), where we now suppose that \(\Omega =\bigotimes _{k=1}^d\Omega ^k\) with \(\Omega ^k=[a_k,b_k] \subset \mathbb {R}\). Given a classifying function \(f:\Omega \longrightarrow Y\) we consider the zero–one loss function

$$\begin{aligned} c({\varvec{x}},y,f)= \frac{1}{2}|f({\varvec{x}})-y|, \end{aligned}$$

which is 0 if \(f({\varvec{x}})=y\) and 1 otherwise. From this loss, we can define the empirical risk

$$\begin{aligned} {\hat{e}}(\Xi ,f)= \frac{1}{n}\sum _{i=1}^{n}{c({\varvec{x}}_i,y_i,f)}. \end{aligned}$$

Assuming that \(\Xi \) is sampled from some fixed unknown probability distribution \(p({\varvec{x}},y)\) on \(\Omega \times Y\), we note that the empirical risk is the empirical mean value of so-called generalization risk, i.e.:

$$\begin{aligned} e(f)= \int _{\Omega \times Y}{c({\varvec{x}},y,f)\,\textrm{d}p({\varvec{x}},y)}, \end{aligned}$$

i.e., it is the mean value of c averaged over all possible test samples generated by \(p({\varvec{x}},y)\), and hence it represents the misclassification probability. However, minimizing the empirical risk does not necessarily correspond to a low generalization risk (refer, e.g., to (Schölkopf and Smola 2002, §5)) or (Vapnik 1998, §5 & §6)). Indeed, this might lead to poor generalization capability in the sense that statistical learning theory already proved that the generalization capacity of a given model is somehow inversely related to the empirical risk. Such general idea can be formalized in different ways, such as via the VC dimension. In order to define it, we need to introduce the concept of shattering. Let \(\Xi _1,\dots ,\Xi _{2^n}\) be all the different datasets obtainable taking all possible configurations of labels assigned to the data. A class \({\mathcal {F}}\) shatters the set X if for every dataset \(\Xi _i\), \(i=1,\dots ,2^n\), there exists a function \(f:\Omega \longrightarrow Y\), \(f\in {\mathcal {F}}\), such that \({\hat{e}}(\Xi _{i},f)=0\).

Definition 1

The VC dimension of a class \({\mathcal {F}}\) of classifying functions is the largest natural number s such that there exists a set X of s examples that can be shattered by \({\mathcal {F}}\). If such s does not exist, then the VC dimension is \(\infty \).

Let us consider a class \({\mathcal {F}}\) of classifying functions on \(\Omega \) whose VC dimension is \(s<n\). Then, if \(f\in {\mathcal {F}}\) and \(\delta >0\), the bound

$$\begin{aligned} e(f)\le {\hat{e}}(\Xi ,f)+C(s,n,\delta ), \end{aligned}$$

holds with probability \(1-\delta \), where the so-called capacity term is

$$\begin{aligned} C(s,n,\delta )= \sqrt{\frac{1}{n}\bigg (s\bigg (\log {\frac{2n}{s}}+1\bigg )+\log {\frac{4}{\delta }}\bigg )}. \end{aligned}$$

The generalization risk (and thus the test error) is bounded by the sum between the empirical risk (that is the training error) and the capacity term of the class, which is monotonically increasing with the VC dimension. If we choose a poor class, we get a low VC dimension but possibly a high empirical risk; this situation is usually called underfitting. On the other hand, by choosing a rich class we can obtain a very small empirical risk, but the VC dimension, and thus the capacity term, is likely to be large; this condition is called overfitting. In the following, our purpose is to study how the VC dimension evolves during the greedy steps. It is natural to guess that the capacity of a classifier increases if the information contained in an added feature is considered.

Definition 2

Let \({\mathcal {F}}\) be a class of binary classifying functions \(f:\Omega \longrightarrow Y\). Letting \({\varvec{e}}_k\) be the k-th cardinal basis vector, we define the k-blind class \({\mathcal {F}}^{(k)}\), \(k\in \{1,\dots ,d\}\), \({\mathcal {F}}^{(k)}\subseteq {\mathcal {F}}\) as the class of functions \(f^{(k)}:\Omega \longrightarrow Y\) such that

$$\begin{aligned} f^{(k)}({\varvec{x}}) = f^{(k)}({\varvec{x}}+\delta {\varvec{e}}_k), \end{aligned}$$

for any \(\delta \in \mathbb {R}\) such that \({\varvec{x}}+\delta {\varvec{e}}_k\in \Omega \).

For example, consider the class of functions

$$\begin{aligned} {\mathcal {F}}_{W,{\varvec{b}}}:=\{f:\Omega \longrightarrow Y\,|\, f({\varvec{x}})={\tilde{f}}(W{\varvec{x}}+{\varvec{b}})\}, \end{aligned}$$

where \({\tilde{f}}\) is the activation function, W is a \(r\times d\) matrix and \({\varvec{b}}\) is a \(r\times 1\) vector, \(r\ge 1\). Many well-known classifiers are included in \({\mathcal {F}}_{W,{\varvec{b}}}\), such as, neural networks and linear models. In this setting, classifiers in \({\mathcal {F}}^{(k)}_{W,{\varvec{b}}}\) can be constructed by restricting to W and \({\varvec{b}}\) such that \(W_{:,k}={\textbf{0}}\), where \(W_{:,k}\) is the k-th column of W, and \(b_k=0\).

Remark 1

As \({\mathcal {F}}^{(k)}\subseteq {\mathcal {F}}\), the fact that that \(\textrm{VC}({\mathcal {F}}^{(k)})\le \textrm{VC}({\mathcal {F}})\), trivially follows.

In order to formally prove that by adding a feature in the greedy step the obtained classifier cannnot be less expressive (in terms of VC dimension) than the previous one, we introduce two maps:

  • \(\pi _{k}:\Omega \longrightarrow \bigotimes _{\begin{array}{c} i=1\\ i\ne k \end{array}}^d\Omega ^i\), so that \(\pi _{k}({\varvec{x}})=(x_1,\dots , x_{k-1},x_{k+1},\dots ,x_{d})\), which is a projection.

  • \(\iota _{\alpha }:\pi _{k}(\Omega )\longrightarrow \Omega \), \(\alpha \in \Omega ^k\), so that \(\iota _{\alpha }({\varvec{x}})=(x_1,\dots , x_{k-1},\alpha ,x_{k+1},\dots ,x_{d})\), which is injective.

Note that applying \(\iota _{\alpha }\circ \pi _k\) to X has the effect of setting to \(\alpha \) the k-th feature for all the examples.

Proposition 1

X is shattered by \({\mathcal {F}}^{(k)}\) if and only if \(\iota _{\alpha }(\pi _{k}(X))\) is shattered by \({\mathcal {F}}^{(k)}\).

Proof

Any classifier in \({\mathcal {F}}^{(k)}\) cannot rely on the k-th feature. Precisely, for each \({\varvec{x}}_i\in X\) we can find \(\delta _i\in \mathbb {R}\) so that \({\varvec{x}}_i+\delta _i{\varvec{e}}\in \iota _{\alpha }(\pi _{k}(X))\). Hence, it is equivalent for any function in \({\mathcal {F}}^{(k)}\) to shatter X and \(\iota _{\alpha }(\pi _{k}(X))\). \(\square \)

For any function \(f^{(k)}\in {\mathcal {F}}^{(k)}\) and \(\alpha \in \Omega ^k\), we can define a classifier \(g:\pi _{k}(\Omega )\longrightarrow Y\) such that \(g({\varvec{x}})=f^{(k)}(\iota _\alpha ({\varvec{x}}))\). Denoting by \({\mathcal {G}}\) the class consisting of such functions g, we achieve the following result.

Proposition 2

\(\iota _{\alpha }(\pi _{k}(X))\) is shattered by \({\mathcal {F}}^{(k)}\) if and only if \(\pi _{k}(X)\) is shattered by \({\mathcal {G}}\).

Proof

Assume that there exists \(f^{(k)}\in {\mathcal {F}}^{(k)}\) that shatters \(\iota _{\alpha }(\pi _{k}(X))\). Note that the shattering does not rely on the k-th feature, which is constant, and therefore this is equivalent to shatter \(\pi _{k}(\iota _{\alpha }(\pi _{k}(X)))=\pi _{k}(X)\) in a lower-dimensional space by means of a classifier g so that \(f^{(k)}=g\circ \pi _{k}\). Finally, by defining \({\varvec{x}}^{(k)}=\pi _{k}({\varvec{x}})\), \({\varvec{x}}\in \iota _{\alpha }(\pi _{k}(X))\), we further obtain \({\varvec{x}}=\iota _{\alpha }({\varvec{x}}^{(k)})\), and therefore \(g({\varvec{x}}^{(k)})=f^{(k)}(\iota _\alpha ({\varvec{x}}^{(k)}))\) for \({\varvec{x}}^{(k)}\in \pi _{k}(X)\), which completes the proof. \(\square \)

Corollary 1

We have that \(\textrm{VC}({\mathcal {G}})\le \textrm{VC}({\mathcal {F}})\).

Proof

By putting together Propositions 1 and 2 we can affirm that X is shattered by \({\mathcal {F}}^{(k)}\) if and only if \(\pi _{k}(X)\) is shattered by \({\mathcal {G}}\). Note that X and \(\pi _{k}(X)\) have the same cardinality, and therefore \(\textrm{VC}({\mathcal {G}})= \textrm{VC}({\mathcal {F}}^{(k)})\). We conclude the proof by virtue of Remark 1. \(\square \)

The results in Corollary 1 formalize the idea that by adding a feature in the greedy step the obtained classifier cannot be less expressive than the previous one. Nevertheless, in this greedy context we face a sort of trade-off that deals with the VC dimension: precisely, a high VC-dimension allows the model to fit more complex patterns but may lead to overfitting. Hence, we will discuss later robust stopping criteria for the greedy iterative rule. Now, as a particular case study, we consider SVM classifiers, which are probably the most frequently used ones. Further, being they based on kernels, other capability measures concerning such classifiers can be straightforwardly studied.

2.2 SVM in the greedy framework

Following the SVM literature, we drive our attention towards strictly positive definite kernels \(\kappa : \Omega \times \Omega \longrightarrow \mathbb {R}\) that satisfy

$$\begin{aligned} \int _{\Omega } \kappa ({\varvec{x}},{\varvec{z}}) v({\varvec{x}}) v({\varvec{z}}) d{\varvec{x}} d {\varvec{z}} \ge 0, \quad \forall v \in L_2(\Omega ), \end{aligned}$$

for \({\varvec{x}}, {\varvec{z}} \in \Omega \). Then, those kernels can be decomposed via the Mercer’s Theorem as (see e.g. Theorem 2.2. Fasshauer (2007) p. 107 or Mercer (1909)):

$$\begin{aligned} \kappa ({\varvec{x}}, {\varvec{z}}) = \sum _{k \ge 0} \lambda _k \rho _k({\varvec{x}}) \rho _k({\varvec{z}}),\quad {\varvec{x}}, {\varvec{z}}\in \Omega , \end{aligned}$$

where \(\{\lambda _k \}_{k \ge 0}\) are the (non-negative) eigenvalues and \(\{ \rho _k \}_{k \ge 0}\) are the (\(L_2\)-orthonormal) eigenfunctions of the operator \(T: L_2(\Omega ) \longrightarrow L_2(\Omega )\), given by

$$\begin{aligned} T[v]({\varvec{x}}) = \int _{\Omega } \kappa ({\varvec{x}},{\varvec{z}}) v({\varvec{z}}) d{\varvec{z}}. \end{aligned}$$

Mercer’s theorem provides an easy background for introducing feature maps and spaces. Indeed, for Mercer kernels we can interpret the series representation in terms of an inner product in the so-called feature space F, which is a Hilbert space. Indeed, we have that

$$\begin{aligned} \kappa ({\varvec{x}}, {\varvec{z}}) = \langle \Phi ({\varvec{x}}), \Phi ({\varvec{z}}) \rangle _{F},\quad {\varvec{x}}, {\varvec{z}}\in \Omega , \end{aligned}$$

where \(\Phi :\Omega \longrightarrow F\) is a feature map. For a given kernel, the feature map and space are not unique. A possible solution is the one of taking the map \(\Phi ({\varvec{x}})= \kappa (\cdot , {\varvec{x}})\), which is linked to the characterization of F as a reproducing kernel Hilbert space; see Fasshauer and McCourt (2015); Shawe-Taylor and Cristianini (2004) for further details. Both in machine learning literature and in approximation theory, radial kernels are truly common. They are kernels for whom there exists a Radial Basis Function (RBF) \( \varphi : \mathbb {R}_{+} \longrightarrow \mathbb {R}\), where \(\mathbb {R}_{+}= [0,\infty )\), and (possibly) a shape parameter \(\gamma >0\) such that, for all \({\varvec{x}},{\varvec{z}} \in \Omega \),

$$\begin{aligned} \kappa ({\varvec{x}},{\varvec{z}})= \kappa _{\gamma }({\varvec{x}},{\varvec{z}})=\varphi _{\gamma }( ||{\varvec{x}}-{\varvec{z}}||_2)=\varphi (r), \end{aligned}$$

where \(r=||{\varvec{x}}-{\varvec{z}}||_2\). Among all radial kernels, we remark that the Gaussian one is given by

$$\begin{aligned} \kappa ({\varvec{x}},{\varvec{z}})= \kappa _{\gamma }({\varvec{x}},{\varvec{z}})= \textrm{e}^{- \gamma \Vert {\varvec{x}}-{\varvec{z}}\Vert _2^2} =\textrm{e}^{-\gamma r^2}. \end{aligned}$$

In the following, for simplicity, we omit the dependence on \(\gamma \), which is also known as scale parameter in machine learning literature.

With radial kernel as well, SVMs can be used for classification purposes and several complexity indicators, such as the kernel alignment, can be studied in order to have a better understanding of the greedy strategy based on SVM, i.e., when the generic classifier in (2) is an SVM function. The notion of kernel alignment was first introduced by Cristianini et al. (2001) and later investigated in e.g. Wang et al. (2015). Other common complexity indicators related to the alignment can be found in Donini and Aiolli (2017). Given two kernels \(\kappa _1\) and \(\kappa _2: \Omega \times \Omega \longrightarrow \mathbb {R}\), the empirical alignment evaluates the similarity between the corresponding kernel matrices. It is given by

$$\begin{aligned} \textsf {A}(X,\textsf {K}_1,\textsf {K}_2) = \dfrac{\left( \textsf {K}_1,\textsf {K}_2\right) _{\textsf {F}}}{\sqrt{||\textsf {K}_1||_{\textsf {F}} ||\textsf {K}_2||_{\textsf {F}}}}, \end{aligned}$$

where \(\textsf {K}_1:=\textsf {K}_1(X)\) and \(\textsf {K}_2:=\textsf {K}_2(X)\) denote the Gram matrices for the kernels \(\kappa _1\) and \(\kappa _2\) on X, respectively and

$$\begin{aligned} \left( \textsf {K}_1,\textsf {K}_2\right) _{\textsf {F}} = \sum _{i,j=1}^n \kappa _1({\varvec{x}}_i,{\varvec{x}}_j) \kappa _2({\varvec{x}}_i,{\varvec{x}}_j). \end{aligned}$$

The alignment can be seen as a similarity score based on the cosine of the angle. For arbitrary matrices, this score ranges between \(-1\) and 1.

For classification purposes we can define an ideal target matrix as \(\textsf {Y}={\varvec{y}}{\varvec{y}}^{\intercal }\), where \({\varvec{y}}=(y_1,\ldots ,y_n)^{\intercal }\) is the vector of labels. Then the empirical alignment between the kernel matrix \(\textsf {K}\) and the target matrix \(\textsf {Y}\) can be written as:

$$\begin{aligned} \textsf {A}(X,\textsf {K},\textsf {Y})= \dfrac{\left( \textsf {K},\textsf {Y}\right) _{\textsf {F}}}{\sqrt{||\textsf {K}||_{\textsf {F}} ||\textsf {Y}||_{\textsf {F}}}}=\dfrac{\left( \textsf {K},\textsf {Y}\right) _{\textsf {F}}}{n \sqrt{||\textsf {K}||_{\textsf {F}} }}. \end{aligned}$$

Such alignment with the target matrix is an indicator of the classification capacity of a classifier. Indeed, to higher alignment scores correspond a separation of the data with a low bound on the generalization error (Wang et al. 2015).

We now prove the following result which will be helpful in understanding our greedy approach.

Theorem 1

Given two kernels \(\kappa _1\) and \(\kappa _2: \Omega \times \Omega \longrightarrow \mathbb {R}\), if \(||\textsf {K}_2||_{\textsf {F}} \ge ||\textsf {K}_1||_{\textsf {F}}\) then \(\textsf {A}(X,\textsf {K}_1,\textsf {Y}) \le \textsf {A}(X,\textsf {K}_2,\textsf {Y})\).

Proof

By hypothesis we have that:

$$\begin{aligned} \textsf {A}(X,\textsf {K}_1,\textsf {Y})&= \dfrac{\left( \textsf {K}_1,\textsf {Y}\right) _{\textsf {F}}}{n \sqrt{||\textsf {K}_1||_{\textsf {F}} }} \le \dfrac{\left( \textsf {K}1,\textsf {Y}\right) _{\textsf {F}}}{n \sqrt{||\textsf {K}_2||_{\textsf {F}} }}. \end{aligned}$$

Then, by adding and subtracting \({\left( \textsf {K}_2,\textsf {Y}\right) _{\textsf {F}}}\) at the numerator, and thanks to the linearity of the norm, we obtain:

$$\begin{aligned} \textsf {A}(X,\textsf {K}_1,\textsf {Y})&\le \dfrac{\left( \textsf {K}1,\textsf {Y}\right) _{\textsf {F}}}{n \sqrt{||\textsf {K}_2||_{\textsf {F}} }}\\&= \dfrac{\left( \textsf {K}1-\textsf {K}_2,\textsf {Y}-\textsf {Y}\right) _{\textsf {F}}}{n \sqrt{||\textsf {K}_2||_{\textsf {F}} }}+\dfrac{\left( \textsf {K}_2,\textsf {Y}\right) _{\textsf {F}}}{n \sqrt{||\textsf {K}_2||_{\textsf {F}} }}\\&= \textsf {A}(X,\textsf {K}_2,\textsf {Y}). \end{aligned}$$

\(\square \)

Considering again Eq. (2), as a corollary of the previous theorem, we have the following result.

Corollary 2

If \(\kappa \) is a non-increasing radial kernel, then

$$\begin{aligned} \textsf {A}(X^{(k)},\textsf {K}(X^{(k)}),\textsf {Y}) \ge \textsf {A}(X^{(k-1)},\textsf {K}(X^{(k-1)}),\textsf {Y}). \end{aligned}$$

Proof

Being \(\varphi : \mathbb {R}_{+} \longrightarrow \mathbb {R}\) non-increasing, for \({\varvec{x}}, {\varvec{z}} \in \mathbb {R}^{d},\) we obtain

$$\begin{aligned}&\varphi \left( \Vert {\varvec{x}}-{\varvec{z}}\Vert _2\right) = \\&=\varphi (\Vert (x_1,x_2,\ldots ,x_k)- (z_1,z_2,\ldots ,z_k)\Vert _2) \le \\&\le \varphi (\Vert (x_1,x_2,\ldots ,x_{k-1})- (z_1,z_2,\ldots ,z_{k-1})\Vert _2), \end{aligned}$$

which in particular implies that

$$\begin{aligned} \textsf {K}_{ij}(X^{(k-1)}) \ge \textsf {K}_{ij} (X^{(k)})\ge 0, \quad i,j=1, \ldots , n. \end{aligned}$$

Thus, we get

$$\begin{aligned} \Vert \textsf {K}(X^{(k-1)}) \Vert _{\textsf {F}} \ge \Vert \textsf {K}(X^{(k)}) \Vert _{\textsf {F}}, \end{aligned}$$

and hence

$$\begin{aligned} \textsf {A}(X^{(k)},\textsf {K}(X^{(k)}),\textsf {Y}) \ge \textsf {A}(X^{(k-1)},\textsf {K}(X^{(k-1)}),\textsf {Y}). \end{aligned}$$

\(\square \)

The result shown in Corollary 2 formalizes again the fact that at each greedy step, the obtained classifier cannot be less expressive than the previous one. Note that this kind of feature augmentation strategy via greedy schemes shows some similarities with the so-called Variably Scaled Kernels (VSKs), first introduced in Bozzini et al. (2015) and recently applied in the framework of inverse problems, see e.g. Perracchione et al. (2023, 2021). Indeed, both approaches are based on adding features and both are again characterized by a trade-off between the model capacity, which can be characterized by the kernel alignment, and the model accuracy. To achieve a good trade-off between these two factors we need a stopping criteria for the iterative rule shown in (2).

2.3 Stopping criterion

In actual applications, the greedy iterative algorithm should select, at first, the most relevant features, and then, if no relevant features are available, any accuracy score should saturate. Among several scores \( \mu \), a robust one is the so-called True Skill Statistic (TSS) for its characteristic of being insensitive to class imbalance (Bloomfield et al. 2012). Precisely, letting TN, FP, FN, TP respectively the number of true negatives, false positives, false negatives and true positives, the TSS is defined by:

$$\begin{aligned} \textrm{TSS}(\textrm{TN,FP,FN,TP})= \; \textrm{recall}(\textrm{TN,FP,FN,TP}) \nonumber \\ + {\; \mathrm specificity}(\textrm{TN,FP,FN,TP})-1, \end{aligned}$$

where

$$\begin{aligned} \textrm{recall}(\textrm{TN,FP,FN,TP}) = \dfrac{\textrm{TP}}{\mathrm{FN+TP}}, \end{aligned}$$
(3)

and

$$\begin{aligned} \textrm{specificity} (\textrm{TN,FP,FN,TP}) = \dfrac{\textrm{TN}}{\mathrm{FP+TN}}. \end{aligned}$$
(4)

In order to introduce a stopping criterion, we need to point out that we construct a greedy feature ranking by considering, at each step, q splits of the dataset into training and validation sets. Moreover, we now have to denote by \(\{x_s\}_{s \in J}\) the \( k-1 \) features selected at the k-th step of the greedy algorithm, where J is the set of integers associated to the \( k-1 \) features (\(\text {card}(J)=k-1\)). Then, at the \(k\)-th step of the greedy algorithm, each one of the \(d-k\) datasets, composed by the \( k-1 \) selected features and the added one \(x_p\), for each \(p \in I {\setminus } J\), being \(I = \{1,\ldots ,d\}\), is divided into training and validation sets. We denote such training and validation sets by \({\mathcal {X}}_{p,h}^{(k-1)} \times {\mathcal {Y}}_{p,h}^{(k-1)}\) and \({\mathfrak {X}}_{p,h}^{(k-1)} \times {\mathfrak {Y}}_{p,h}^{(k-1)}\), for \(h=1,\dots ,q\). Hence, once the models \({{{\mathcal {M}}}}_{p,h}\), for each \(p \in I \setminus J\) and \(h=1,\dots ,q\), have been trained, the \(k\)-th feature is chosen so that:

$$\begin{aligned} \hspace{-0.3cm} x^* = {{\,\textrm{argmax}\,}}_{p \in I \setminus J} \mu ^{(k)}_{p}, \end{aligned}$$
(5)

with

$$\begin{aligned} \mu ^{(k)}_{p} = \frac{1}{q}\sum _{h=1}^q \mu ({{{\mathcal {M}}}}_{p,h}({\mathfrak {X}}_{p,h}^{(k-1)}),\mathfrak {Y }_{p,h}^{(k-1)}), \end{aligned}$$
(6)

and where \(\mu \) is the TSS score. Finally, the new set of features will be given by \(\{x_s\}_{s \in J}\), where \(J=J \cup \{ l \}\), being \(x_l=x^*\).

Letting \(m^{(k)}\) be the average of the TSS scores computed on different folds at the k-th step and \(\sigma ^{(k)}\) the associated standard deviation, we stop the greedy iteration at the k-th step if:

$$\begin{aligned} r^{(k)}=\dfrac{|m^{(k+1)}-m^{(k)}|}{\sqrt{((\sigma ^{(k+1)})^2+(\sigma ^{(k)})^2)} }< \tau , \end{aligned}$$
(7)

and \(\tau \) is a given threshold. By doing so, we stop the greedy algorithm when the added feature does not contribute to the accuracy score. In order to better understand this fact, we provide in the following a numerical experiment with synthetic data. Dealing with real data, we might stop the greedy iteration as shown in (7), but then select only the first \(k^*\) features, where \(k^*\) is

$$\begin{aligned} k^* ={{\,\textrm{argmax}\,}}_{j=1,\ldots ,k} m^{(j)}. \end{aligned}$$
(8)
Table 1 List of notations in the greedy algorithm

We refer the reader to Algorithm 1 for the greedy pseudo-code, while the list of symbols and notations is reported in Table 1.

Algorithm 1
figure a

Pseudo-code for the greedy feature ranking algorithm.

3 Numerical experiments

The first numerical experiment aims to numerically show the convergence of the greedy algorithm and the efficacy of the stopping rule. We than test and compare, with state-of-the-art techniques, our scheme on a benchmark dataset. Finally, we will show an application in the context of space weather, which aims to show how this general method is able to deal with real data and infer on the physical aspects of the problem.

3.1 Experiments with a toy dataset

We first focus on the application of the non-linear SVM greedy technique to a balanced simulated dataset constructed as follows: we considered the set \(X = \{{\varvec{x}}_i\}_{i=1}^n \) of \(n=1000\) random points in dimension \(d=15\) sampled from a uniform distribution over \([0,1)\) and the set of corresponding function values \(\{f_{\alpha ,i}=f_{\alpha }({\varvec{x}}_i)\}_{i=1}^n\), where \(f_{\alpha }: [0,1)^d \longrightarrow \mathbb {R}\) is defined as

$$\begin{aligned} \begin{aligned} f_{\alpha }({\varvec{x}})&= e^{x_1^2}+e^{x_2}+3x_3+2\cos {(x_4x_5)}\\&\quad +4x_6^2 +10^{{\alpha }}\sum _{j=7}^{d} x_j, \end{aligned} \end{aligned}$$
(9)

and \(\alpha \in \{-8,-6,-4,-2\}\). Each \(f_{\alpha ,i}\) is then labeled according to a threshold value to obtain the set of outputs \(Y=\{y_i\}\), i.e., \(y_i = 1\) if \(f_{\alpha ,i}\) is greater than the mean value attained by \(f_\alpha \), and \(y_i=-1\) otherwise. From (9) we note that the first 6 features (i.e., \(x_j\) for \(j=1,\dots ,6\)) are meaningful for classification purposes when \(\alpha \) is lower than \(-4\), while the contribution of the remaining ones is negligible. The classifier used in the following is a SVM model for which both the scale parameter of the Gaussian kernel and the bounding box are optimized via standard cross-validation. The results of using such a classifier into the greedy scheme are reported in Table 2. Such table contains the greedy ranking of the features \(x_j\), \(j=1,\dots ,d\), and the TSS values obtained at each step by averaging over 7 different validation sets. Letting \(\tau =9 \textrm{e}-2\) be the threshold for the stopping criteria in (7), the greedy algorithm selects the features reported in Table 2, which are above the black solid line. As expected, the algorithm selects only the first six features (the most relevant ones) when \(\alpha \) is small enough (\(\alpha \le -6 \)). Then, as soon as the remaining features become more meaningful the greedy selection takes into account more features. In this didactic example we report all the TSS values until the end, to emphasise the robustness of our procedure that correctly identifies the most relevant features.

Table 2 Feature ranking for the greedy scheme on the dataset generated as in (9)

3.2 Experiments with a benchmark dataset

As a second test we consider a benchmark dataset and we compare different feature extraction algorithms. We take the Breast Cancer Wisconsin (Diagnostic) dataset (Wolberg et al. 1995) free available at the UCI repository at https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic. The classification task consists in predicting whereas a tumor is malignant or benignant basing the considerations on 569 examples made of 30 features which are computed from digitized images of fine needle aspirate of breast masses. Some of the features are, e.g, radius, texture, perimeter and area of the cancer mass.

We then compare our greedy feature selection strategy with LASSO, RFE, and random forest selection, as implemented in the Python scikit-learn package. Again, the classifier used in the following is a SVM model for which both the scale parameter of the Gaussian kernel and the bounding box are optimized via standard cross-validation. The results will show that the greedy strategy, being tailored for the classifier, is able to find really a few relevant features, only 6. Random forest identifies 17 relevant features, while both RFE and LASSO select 24 features, i.e. almost all of them. The accuracy scores returned by all the groups of selected features on 4 test folds are reported in Table 3. Precisely, we compute the TSS as reference score, the Heidke Skill Score (HSS) (Heidke 1926), precision, recall (see Eq. (3)), specificity (see Eq. (4)), F1 score (which is the harmonic mean of precision and recall), and balanced accuracy (which is the arithmetic mean between recall and specificity). We can observe that the SVM classifier trained with only a few greedily selected features is able to achieve about the same accuracy scores than the SVM trained with all, or almost all (RFE, LASSO), features.

With these examples we already proved the ability of the greedy schemes in eliminating the redundant information, and hence in finding a small subset of features. In the next section, we further consider their application to noisy and real data. Moreover, we test the proposed strategy with other classifiers, as neural networks.

Table 3 Average scores for the Breast Cancer dataset obtained with SVM using different subsets of features

3.3 Applications to solar physics: geo-effectiveness prediction

We now focus on a significant space weather application, i.e., the prediction of severe geomagnetic events based on the use of in-situ data. More specifically, data-driven methods addressing this task typically utilize features acquired by in-situ instruments at Lagrangian point L1 (i.e., the Lagrangian point between the Sun and the Earth) to forecast a significant decrease of the SYM-H index, i.e., the expression of the geomagnetic disturbance at Earth (Wanliss and Showalter 2006).

3.3.1 The dataset and the models

The dataset we use consists of a collection of solar wind, geomagnetic and energetic indices. In particular, it is composed by \(N=7888320\) examples and \(d=15\) features sampled at each minute starting from (1-st January 2005) to (31-st December 2019). Below we summarize the features we use:

  1. 1.

    B [nT], the magnetic field intensity, and B\(_\textrm{x}\), B\(_\textrm{y}\) and B\(_\textrm{z}\) [nT], its three coordinates.

  2. 2.

    V [Km/s], the velocity of the solar wind, and V\(_\textrm{x}\), V\(_\textrm{y}\) and V\(_\textrm{z}\) [Km/s], its three coordinates.

  3. 3.

    T, the proton temperature, and \(\rho \), the proton density number [cm\(^{-3}\)].

  4. 4.

    E\(_\textrm{k}\), E\(_\textrm{m}\), E\(_\textrm{t}\) the kinetic, magnetic and total energies.

  5. 5.

    H\(_\textrm{m}\), the magnetic helicity.

  6. 6.

    SYM-H [nT], a geomagnetic activity index that quantifies the level of geomagnetic disturbance.

The first ten features are acquired at the Lagrangian point L1 by in-situ instruments, the energies and the magnetic helicity are adimensional derived quantities, and the SYM-H is measured at Earth. The task considered in what follows consists in identifying the most relevant features used to predict whereas a geomagnetic event occurred, i.e., when the SYM-H is less than \(-50\) nT (label 1), or not (label -1). The dataset at our disposal is highly unbalanced: the rate of positive events is about 2.5\(\%\). In order to exploit our data analysis, we first need to fix the notation. We denote by \({\tilde{X}} = \{\tilde{{\varvec{x}}_i} \}_{i=1}^N \subseteq \Omega \), where \(\Omega \subseteq \mathbb {R}^d\), the set of input samples an by \({\tilde{Y}}=\{{\tilde{y}}_i \}_{i=1}^N\), with \({\tilde{y}}_i \in \{-1,1\}\), the set of associated labels. The features denoted by \(\tilde{{x}_j}\), \(j=1,\ldots ,d\), represent respectively B, B\(_\mathrm{{x}}\), B\(_\mathrm{{y}}\), B\(_\mathrm{{z}}\), V, V\(_\mathrm{{x}}\), V\(_\mathrm{{y}}\), V\(_\mathrm{{z}}\), T, \(\rho \), E\(_\mathrm{{k}}\), E\(_\mathrm{{m}}\), E\(_\mathrm{{t}}\), H\(_\mathrm{{m}}\) and the SYM-H.

The analysis is performed with data aggregated by hours, i.e., letting \(m=60\), \(n =N/m\) and

$$\begin{aligned} {\varvec{x}}_i=\dfrac{(\sum _{k=i}^{i+m} \tilde{{\varvec{x}}_k})}{m}, \end{aligned}$$

we focused on \({X} = \{{{\varvec{x}}_i} \}_{i=1}^n \subseteq \Omega \). Similarly, we define the set of aggregated labels \({Y}=\{{y}_i \}_{i=1}^n\).

Given X and Y, the first step of our study consists in using different feature selection approaches to rank the features accordingly to their relevance (see Subsection 3.3.2). After this step, we investigate how these results can be exploited to improve the prediction task (see Subsection 3.3.3). In doing so, we use both SVM and a Feed-forward Neural Network (FNN) in order to predict whether a geo-effective event occurs or not in the next hour. Specifically, the SVM algorithm is trained by performing a randomized and cross-validated search over the hyper-parameters of the model (the regularization parameter \(C\) and the kernel coefficient \(\gamma \)) taken from uniform distributions on \(I_C=[0.1, 1000]\) and \(I_{\gamma }=[0.001, 0.1]\) respectively. Instead, the FNN architecture is characterized by 7 hidden layers. The Rectified Linear Unit (ReLU) function is used to activate the hidden layers, the sigmoid activation function is applied to activate the output, and the binary cross-entropy is used as loss function. The model is trained over 200 epochs using the Adam optimizer with learning rate equal to 0.001, with a mini-batch size of 64 examples. In order to prevent overfitting, an \(L^2\) regularization constraint is set as 0.01 in the first two layers. Further, we make use of an early stopping strategy to select the best epoch with respect to the validation loss.

3.3.2 Greedy feature selection approaches

In order to apply efficiently our greedy strategy to both SVM and FNN, we first consider a subset \( X_p \) of the original dataset \( X \) with a reduced number of examples: we take \( p= 3333 \) examples. The so-constructed ranking is compared to a state-of-the-art method, i.e., the Lasso feature selection. Precisely, the active set of features returned by Lasso is composed by: B\(_\mathrm{{x}}\), B\(_\mathrm{{y}}\), B\(_\mathrm{{z}}\), V\(_\mathrm{{y}}\), V\(_\mathrm{{z}}\), T, \(\rho \), E\(_\mathrm{{k}}\), E\(_\mathrm{{m}}\), E\(_\mathrm{{t}}\), H\(_\mathrm{{m}}\) and the SYM-H. Note that neither V and B, which are physically meaningful for the considered task, are selected by cross-validated Lasso.

Table 4 Feature rankings for the greedy schemes on the dataset used for the prediction of geomagnetic solar storms

In Table 4 we report the results of the greedy feature ranking scheme by using SVM and FNN. In this table, the features are ordered accordingly to the greedy selection. In particular, the greedy iteration stops with all the features reported in the table accordingly to (7), but the selected features are only the ones above the bold line, as in (8). We can note that, the features selected for both SVM and FNN are only a few, and this is due to the fact that greedy schemes are model-dependent and hence are able to truly capture the most significant ones. Interestingly, the features extracted as the most prominent ones are indeed those associated with physical processes involved in the transfer of energy from the CMEs to the Earth’s magnetosphere and, thus, with the CME likelihood for inducing geomagnetic storms. B\(_\mathrm{{z}}\), i.e., a southward directed interplanetary magnetic field, is indeed required for magnetic re-connection with the Earth’s magnetic field to occur, and thus for the energy carried by the solar wind and/or CMEs to be transferred to the Earth system. In addition, the bulk speed V, or equivalently the radial component of the flow velocity vector V\(_\mathrm{{x}}\), is directly related to the kinetic energy of the solar wind. On the one hand, it is well known that particularly fast particle streams or solar transients can compress the magnetosphere on the sunward side. On the other hand, high levels of magnetic energy (quadratically proportional to the magnetic field intensity) can be converted into thermal energy that heats the Earth’s atmosphere, expanding it. In both cases, it appears evident that the transfer of energy, either kinetic or magnetic or total, enabled by the magnetic reconnection between the interplanetary and terrestrial magnetic fields, disrupts the magnetosphere current system, thus causing geomagnetic disturbances. As a conclusion, the extracted features are the physical quantities with the higher expected predictive capability.

We further point out that in order to extract such features, we make use of a validation set and we do not considered any test set. Therefore, the greedy feature extraction is coherently based on the TSS computed on the validation set. Nevertheless, we are now interested in understanding how the selected features work in the prediction (on tests sets) of the original task and with all examples.

Table 5 Average scores obtained with SVM using different subsets of features
Table 6 Average scores obtained with FNN using different subsets of features

3.3.3 Prediction of geomagnetic solar storms events with greedy-selected features

In order to numerically validate our greedy procedure we compare the performances of SVM and FNN trained with respectively: all features, the features returned by Lasso, and the greedily selected features. The comparison is performed by computing several scores (reported in Tables 5 and 6) and by averaging on different splits of the test set. We can observe that for the SVM-based prediction, when using the features extracted with the greedy procedure, we have a remarkable improvement of all accuracy scores. Further, although the performances of the FNN are essentially the same, independently of the feature selection scheme, we note that we are able to achieve the same accuracy scores with only a few features selected ad hoc (3 in this case). This points out again the fact that features extracted by methods, such as Lasso, might be redundant for the considered classifiers. This is even more evident when using the FNN algorithm, which achieves the same accuracy with only 3 greedily selected features. The improvement in terms of accuracy was remarkable only for SVM classifiers, which are known to be less robust then neural networks to noise, i.e., redundant information stored in redundant features.

4 Conclusions and future work

We introduced a novel class of feature reduction schemes, namely greedy feature selection algorithms. Their main advantage consists in the fact that they are able to identify the most relevant features for any given classifier. We studied their behavior both analytically and numerically. Analytically, we could conclude that the models constructed in such a way cannot be less expressive than the standard ones (in terms of VC dimension or kernel alignment). Numerically, we showed their efficacy on a problem associated to the prediction of geomagnetic solar storms. As the activity of the Sun is cyclic, work in progress consists in using greedy schemes to study which features are relevant on either high or low activity periods. Finally, as there is a growing interest in physics-informed neural networks (PINN), we should investigate, both theoretically and numerically, which are the challenges that greedy methods could achieve in this context.