1 Introduction

High-dimensional data sets are now ubiquitous in Machine Learning workflows for data driven knowledge discovery. For example, in bioinformatics, the researchers seek to understand the gene expression level with microarray or next-generation sequencing techniques where each point consists of over 50,000 measurements [1,2,3,4]. The abundance of features demands the development of feature selection algorithms to improve Machine Learning explainability in classification problems. For high-dimensional data samples, it is often typical that good classification rates may be due to a small fraction of the measured features. Hence, selecting the discriminatory features from large feature sets are essential to understanding the underlying data as well as explaining, interpreting and trusting predictive models. For example, the discovery of drug therapies revolves around the identification of biomarkers that characterize the biological processes associated with the host immune response to infection by respiratory viruses such as influenza [5]. Additional benefits of feature selection include improved visualization and understanding of data, reducing storage requirements, and faster algorithm training times.

Feature selection can be accomplished in various ways that can be broadly categorized into the filter, wrapper, and embedded methods. In a filter method, each variable is ordered based on a score. After that, a threshold is used to select the relevant features [6]. Variables are usually ranked using correlation [7, 8] and mutual information [9, 10]. In contrast, a wrapper method uses a model and determines the importance of a feature or a group of features by the generalization performance of the predetermined model [11, 12]. Since evaluating every possible combination of features becomes an NP-hard problem, heuristics are used to find a subset of features. Wrapper methods are computationally intensive for larger data sets, in which case search techniques like Genetic Algorithm (GA) [13] or Particle Swarm Optimization (PSO) [14] are used. In embedded methods, feature selection criteria are incorporated within the model, i.e., the variables are picked during the training process [15]. Iterative Feature Removal (IFR) uses the absolute weight ratios of a Sparse SVM model as a criterion to extract features from the high dimensional biological data set [5].

Mathematically feature selection problem can be posed as an optimization problem on \(\ell_{0}\)-norm, i.e., how many predictors are required for a machine learning task. As the minimization of \(\ell_{0}\) is intractable (non-convex and non-differentiable), 1-norm is used instead, which is a convex proxy of \(\ell_{0}\) [16]. Note 1 is not differentiable at 0, but the problem can be tackled by leveraging on sub-gradient methods [17, 18]. Since the introduction of these seminal papers [19, 20], the use of 1-norm is now widespread. For example, the 1 has been used for the feature selection task in linear [5, 21,22,23,24] as well as in nonlinear regime [25,26,27]. Popularity notwithstanding, note that it is only a proxy for sparsity, and can generate small weights (the shrinkage problem) [28, 29], and has the potential non-unique solutions [30].

This paper proposes a new embedded variable selection approach called Sparsity-promoted Centroid-Encoder (SCE) to extract features when class labels are available. Our method extends the Centroid-Encoder model [31, 32], where we applied a 1-penalty to a sparsity promoting layer between the input and the first hidden layer. We evaluate this proposed model SCE on diverse data sets and show that the selected features produce better generalization than other state-of-the-art techniques. As a feature selection tool, SCE uses a single model for the multi-class problem without the need to create multiple one-against-one binary models typical of linear methods, e.g., Lasso [16], or Sparse SVM [24]. The work of [25] also uses a similar sparse layer between the input and the first hidden with an Elastic net penalty while minimizing the classification error with a softmax layer. The authors used Theano’s symbolic differentiation [33] to impose sparsity. In contrast, our approach minimizes the Centroid-Encoder loss with an explicit differentiation of the 1 function using the sub-gradient. Unlike DFS, our model can capture the intra-class variability by using multiple centroids per class. This property is beneficial for multi-modal data sets.

1.1 Summary of novelty

Here, we summarize the novel aspects of our work. The performance implications of these innovations will be explored via direct comparison with many of state-of-the-art algorithms in the context of benchmark data sets from the literature.

  • We propose a novel nonlinear feature selection technique, called Sparsity-promoted Centroid-Encoder (SCE).

  • SCE minimizes the distortion error of each class in the ambient space and, at the same time, uses a 1-penalty on the sparse layer to discard features from input not essential to reconstruct the class centroids.

  • One key attribute of SCE is that it can extract informative features by capturing the intra-class variance using multiple centroids per class. This property of SCE distinguishes itself from other neural network-based feature selection techniques, such as LassoNet [34], Concrete Autoencoders [35], FsNet [36], Stochastic Gate [37], which do not model the multi-modal nature of data (data sets whose classes appear to have multiple clusters) during feature selection.

  • SCE requires the solution of a non-convex optimization problem. Training such models has the potential to produce a non-unique set of selected features as a consequence of local minima. We propose a framework to select the most robust features to address this limitation of all non-convex feature selection algorithms.

  • We also address the challenges of minimizing 1-norm using stochastic optimization. We empirically show that the hyper-parameters associated with stochastic optimization, such as learning rate and mini-batch size, play a critical role in promoting sparsity using SCE.

The article is organized as follows: In Sect. 2, we review related work, for both linear and nonlinear feature selection techniques. In Sect. 3, we present the formulation of Sparsity-promoted Centroid-Encoder (SCE) with analysis to understand the model. In Sect. 4, we present a robust feature selection workflow. Section 5 offers an array of experiments to show the challenges of minimizing 1-norm using stochastic optimization. In Sect. 6, we apply SCE to a range of bench-marking data sets taken from the literature and compare it with other state-of-the-art methods. Finally, we present a discussion and possible extension of our model in Sect. 7.

2 Related work

Feature selection has a long history spread across many fields, including bioinformatics, document classification, data mining, hyperspectral band selection, computer vision. It is an active research area, and numerous techniques exist to accomplish the task. We describe the literature related to the embedded methods where the selection criteria are part of a model. The model can be either linear or nonlinear.

2.1 Feature selection using linear models

Linear models are widely used in Machine Learning for classification and regression. These models approximate the output as a linear combination of input variables (features), i.e., \(y \approx f(x) = w^T x + b\) where w and b are the model parameters. From the optimization perspective, a linear model takes the following form: \(\underset{\theta }{\text{minimize}}\;\; l(y,f(x,\theta ))\) where l is a loss function and \(\theta\) is the parameter set. Adding a 1 penalty on the parameter set \(\theta\) gives a feature selector, least absolute shrinkage and selection operator or Lasso [16], which takes the form:

$$\begin{aligned} \begin{aligned} \underset{\theta }{\text{minimize}}\;\; l(y,f(x,\theta )) \; + \; \lambda \Vert \theta \Vert _1 \end{aligned} \end{aligned}$$
(1)

where \(\lambda\) is a hyper parameter which controls the sparsity of the model. Since its inception, the model has been used extensively for feature selection on various data sets [21,22,23]. Elastic net, proposed by Zou et al. [29], combined the Lasso penalty with the Ridge Regression penalty [38] to overcome some limitations of Lasso. The Elastic net is defined as following:

$$\begin{aligned} \begin{aligned} \underset{\theta }{\text{minimize}}\;\; l(y,f(x,\theta )) \; + \; (1-\alpha )\Vert \theta \Vert _1 \;+ \; \alpha \Vert \theta \Vert ^2_2 \end{aligned} \end{aligned}$$
(2)

where \(\alpha \in [0,1)\) and the term \((1-\alpha )\Vert \theta \Vert _1 + \alpha \Vert \theta \Vert ^2_2\) known as elastic net penalty. Elastic net has been widely applied, e.g., [39,40,41]. Note both Lasso and Elastic net are convex in the parameter space. Also see the following works which address the issues with Lasso [42,43,44,45,46].

Support Vector Machines (SVM) [47] is a state-of-the-art model for classification, regression and feature selection. SVM-RFE is a linear feature selection model which iteratively removes the least discriminative features until a parsimonious set of predictive features are selected [48]. Arbitrary p-norm separating hyperplanes were proposed by [49]. IFR [5], on the other hand, selects a group of discriminatory features at each iteration and eliminates them from the data set. The process repeats until the accuracy of the model starts to drop significantly. Note IFR uses Sparse SVM (SSVM), which minimizes the \(l_1\) norm of the model parameters. Lasso, Elastic Net, and SVM-based techniques are primarily suitable for binary problems, i.e., a single model cannot handle multiple classes. These models are extended to the multi-class regime by combining several binary one-against-one (OAO) or one-against-all (OAA) models. For example, [24] used 120 Sparse SVM models to select discriminative bands from the Indian Pine data set, which has 16 classes.

On the other hand, Random forest (RF) [50], a decision tree-based technique, finds features from multi-class data using a single model. The model does not use Lasso or Elastic net penalty for feature selection. Instead, the model weighs the importance of each feature by measuring the out-of-bag error. Warda et al. proposed a hybrid feature selection technique (HBAPSO) for breast cancer prediction [51]. The algorithm uses a combination of particle swarm optimization (PSO) [14] and bat algorithm (BA) [52] for feature pruning. The work of Dai et al. uses label correlation and instance correlation with the \(\ell _{2,1}\)-norm to extract features from multi-label data (CMFSS) [53]. More information on feature selection on high dimensional microarray data can be found [54].

2.2 Feature selection using deep neural networks

While the linear models are fast and convex, they do not capture the nonlinear relationship among the input features (unless a kernel trick is applied). Because of the shallow architecture, these models do not learn a high-level representation of input features. Moreover, there is no natural way to incorporate multi-class data in a single model. Nonlinear models based on deep neural networks overcome these limitations. In this section, we will briefly discuss a handful of such models.

Group Lasso [55] was modified to impose sparsity on a group of variables instead of a single variable [26]. They applied the group sparsity simultaneously on the input and the hidden layers to remove features from the input data and the hidden activation. On MNIST, their algorithm discarded more than 200 features from the input vector with an accuracy of \(97\%\) on the test data. Although on the Forest Cover data set, the algorithm used most of the input variables, 52.7 on average out of 54. Deep feature selection (DFS), a multilayer neural network-based feature selection technique, was proposed by [25]. As a sparse regularization, the authors used elastic-net [29] on the variables of the feature selection layer to induce sparsity. The standard soft-max function is used in the output layer for classification. With this setup, the network is trained in an end-to-end fashion by error backpropagation. Despite the deep architecture, its accuracy is not competitive, and experimental results have shown that the method did not outperform the random forest (RF) method. Kim et al. proposed a heuristics-based technique [56], EP-DNN, to assign importance to each feature. Unlike [25, 26], EP-DNN does not use sparsity or group sparsity during training; instead, the importance of a feature is calculated using a backpropagation-like technique after the training step is done, making it a multi-step process. The authors evaluated the model with only one biological data set. On the other hand, Roy et al. proposed to use ReLU activation to measure the contribution of an input feature toward hidden activation of the next layer [57]. The approach makes the feature selection and training of the deep network in a single step. Unlike the supervised methods [25, 26, 56, 57], Han et al. developed an unsupervised feature selection technique based on the autoencoder architecture [58]. Applying a \(l_{2,1}\)-penalty to the weights coming out from each input node, the authors measure the contribution of each feature while reconstructing the input. The model removes the input features with a minimum contribution to sample reconstruction. Similar to the previous work, Taherkhani et al. proposed [59] an RBM[60]-based feature selection model which discards a feature if the reconstruction error does not increase after setting the corresponding input to zero.

Recently, Balin et al. proposed an end-to-end unsupervised feature selection technique, namely Concrete Autoencoders (CAE) [35]. The authors utilize the concrete random variable, a continuous approximation of a one-hot vector in the feature selection layer. One of the attractive features of CAE is that its cost function is differentiable, and the model picks a subset of original features by gradually minimizing the temperature of the concrete feature selector layer using an annealing scheme. Note CAE requires the user to specify the number of features to be selected from the data. Stochastic Gates, proposed by [37] incorporates continuous relaxation of the Bernoulli distribution to approximate \(\ell_{0}\)-norm. Like CAE, the cost of the Stochastic Gate is also differentiable, and the model assumes the input features follow a Gaussian distribution. The FsNet, proposed by Singh et al. [36], designed for high-dimensional biological data sets, also uses a Concrete feature selection layer along with Diet Networks [61] to reduce the model size. LassoNet [34], on the other hand, uses a skip connection to measure the contribution of a feature using the Lasso penalty and only allows a feature to participate in hidden units if it is still active. Uzma et al. proposed Gene encoder [62] which is an unsupervised feature selection technique based on deep architecture. We left the brief description of these models in (Table 1).

Table 1 Brief description of the feature selection techniques

3 Sparsity-promoted centroid-encoder

Centroid-Encoder (CE) neural networks are the starting point of our approach [31, 32, 63]. We present a brief overview of CEs and demonstrate how they can be extended to perform nonlinear feature selection.

3.1 Centroid-encoder

The CE neural network is a variation of an autoencoder and can be used for both visualization and classification tasks. Consider a data set with N samples and M classes. The classes denoted \(C_j, j = 1, \dots , M\) where the indices of the data associated with class \(C_j\) are denoted \(I_j\). We define centroid of each class as \(c_j=\frac{1}{|C_j|}\sum _{i \in I_j} x^i\) where \(|C_j|\) is the cardinality of class \(C_j\). Unlike autoencoder, which maps each point \(x^i\) to itself, the CE maps each point \(x^i\) to its class centroid \(c_j\) by minimizing the following cost function over the parameter set \(\theta\):

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{ce}(\theta )=\frac{1}{2N}\sum ^M_{j=1} \sum _{i \in I_j}\Vert c_j-f(x^i; \theta ))\Vert ^2_2 \end{aligned} \end{aligned}$$
(3)

The mapping f is composed of a dimension reducing mapping g (encoder) followed by a dimension increasing reconstruction mapping h (decoder). The output of the encoder is used as a supervised visualization tool [31, 32] and attaching another layer to map to the one-hot encoded labels performs robust classification [63].

3.2 Sparsity-promoted centroid-encoder for feature selection

The Sparsity-promoted Centroid-Encoder (SCE) is a modification to the centroid-encoder architecture as shown in Fig. 1. Unlike centroid-encoder, we have not used a bottleneck architecture as visualization is not our aim here. The input layer is connected to the first hidden layer via the sparsity promoting layer (SPL). Each node of the input layer has a weighted one-to-one connection to each node of the SPL. The number of nodes in these two layers is the same. The nodes in SPL do not have any bias or nonlinearity. The SPL is fully connected to the first hidden layer; therefore, the weighted input from the SPL will be passed to the hidden layer in the same way that of a standard feed forward network. During training, a 1 penalty will be applied to the weights connecting the input layer and SPL layer. The sparsity promoting 1 penalty will drive most of the weights to near zero, and the corresponding input nodes/features can be discarded. Therefore, the purpose of the SPL is to select important features from the original input. Note we only apply the 1 penalty to the parameters of the SPL.

Denote \(\theta _{spl}\) to be the parameters (weights) of the SPL and \(\theta\) to be the parameters of the rest of the network. The cost function of SCE is given by

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{sce}(\theta )=\frac{1}{2N}\sum ^M_{j=1} \sum _{i \in I_j}\Vert c_j-f(x^i; \theta ))\Vert ^2_2 + \lambda \Vert \theta _{spl}\Vert _{1} \end{aligned} \end{aligned}$$
(4)

where \(\lambda\) is the hyperparameter which controls the sparsity. A larger value of \(\lambda\) will promote higher sparsity resulting more near-zero weights in SPL. In other words, \(\lambda\) is a knob that controls the number of features selected from the input data.

Like centroid-encoder, we trained sparsity-promoted centroid-encoder using error backpropagation, which requires the gradient of the cost function of Eq. 4. As 1 function is not differentiable at 0, we implement this term using the sub-gradient [17]. We trained SCE using Scaled Conjugate Gradient Descent [64] on the full training set. Like any neural network-based model, the hyperparameters of SCE need to be tuned for optimum performance. Table 2 contains the list with the range of values we used in this research. We used validation set to choose the optimal value. For a small sample size data set (high-dimensional biological data), we ran a five-fold cross validation on the training set to pick the optimum value. The Python implementation of Sparsity-promoted Centroid-Encoder is available in Github repository https://github.com/Tomojit1/SparseCentroid-Encoder.git.

Fig. 1
figure 1

Architecture of Centroid-Encoder and Sparsity-promoted Centroid-Encoder. Notice the Centroid-Encoder uses a bottleneck architecture which is helpful for visualization. In contrast, the Sparsity-promoted Centroid-Encoder does not use any bottleneck architecture; instead, it employs a sparse layer between the input and the first hidden layer to promote feature sparsity

Table 2 Hyperparameters for sparsity-promoted centroid-encoder

3.2.1 Feature cut-off

The 1-norm of the sparse layer (SPL) drives a lot of weight to near zero. Often hard thresholding or a ratio of two consecutive weights is used to pick the nonzero weight [5]. We take a different approach. After training SCE, we create a sparsity curve from the weights of the sparse layer by arranging the absolute value of the weights in descending order, then find the elbow of the curve. We measure the distance of each point on the curve to the straight line formed by joining the first and last points of the curve. The point with the largest distance is the position (P) of the elbow. We pick all the features whose absolute weight is greater than that of P. We demonstrate the feature cut-off in Fig. 2 with high-dimensional SMK_CAN data, which has 19,993 genes per sample. The two panels display the absolute weight of the sparsity promoting layer in descending order for two runs. The red dot indicates the exact position of P determined by the cut-off algorithm.

Fig. 2
figure 2

Figure to display the absolute weight SPL layer in descending order over two run on high dimensional SMK_CAN data. The experiment is done with \(\lambda = 0.0002\)

3.3 Empirical analysis of SCE

In this section, we present an empirical analysis of our model. The results of feature selection for the digits 5 and 6 from the MNIST set are displayed in Fig. 3. In panel (a), we compare the two terms that contribute to Eq. 4, i.e., the centroid-encoder and 1 costs, weighted with different values of \(\lambda\). As expected, we observe that the CE cost monotonically decreases with \(\lambda\), while the 1 cost increases as \(\lambda\) decreases. For larger values of \(\lambda\), the model focuses more on minimizing the 1-norm of the sparse layer, which results in smaller values. In contrast, the model pays more attention to minimizing the CE cost for small \(\lambda\)s; hence, we notice smaller CE cost and higher 1 cost.

Fig. 3
figure 3

Analysis of SCE. a Change of the two costs over \(\lambda\). b Change of validation accuracy over \(\lambda\). c Sparsity plot of the weight of \(W_{\text{SPL}}\) for \(\lambda = 0.001\). d Same as c but \(\lambda = 0.1\)

Panel (b) of Fig. 3 shows the accuracy on a validation set as a function nine different values of \(\lambda\); the validation accuracy reached its peak for \(\lambda = 0.001\). In panels (c) and (d), we plotted the magnitude of the feature weights of the sparse layer in descending order. The sharp decrease in the magnitude of the weights demonstrates the promotion of sparsity by SCE. The model effectively ignores features by setting their weight to approximately zero. Notice the model produced a sparser solution for \(\lambda = 0.1\), selecting only 32 features compared to 122 chosen variables for \(\lambda = 0.001\). Figure 4 shows the position of the selected features, i.e., pixels, on the digits 5 and 6. The intensity of the color represents the feature’s importance. Dark blue signifies a higher absolute value of the weight, whereas light blue means a smaller absolute weight.

Fig. 4
figure 4

Demonstration of the sparsity of the proposed model on MNIST digits 5 and 6. The digits are shown in white, and the selected pixels are marked using blue-the darkness of blue indicates the relative importance of the pixel to distinguish the two digits. We showed the selected pixels for two choices of \(\lambda\). Notice that for \(\lambda = 0.1\), the model chose the lesser number of features, whereas it picked more pixels for \(\lambda = 0.001\). \(\lambda\) is the nob which controls the sparsity of the model

Our next analysis shows how SCE extracts informative features from a multi-modal data set, i.e., data sets whose classes appear to have multiple clusters. In this case, one center per class may not be optimal, e.g., ISOLET data. To this end, we trained SCE using a different number of centers per class where the centers were determined using standard k-Means algorithm [65, 66]. After the feature selection, we calculated the validation accuracy and plotted it against the number of centers per class in Fig. 5. The validation accuracy jumped significantly from one center to two centers per class. The increased accuracy indicates that the speech classes are multi-modal, further validated by the two-dimensional PCA plot of the three classes shown in panel (b)–(d).

Fig. 5
figure 5

Sparsity-promoted Centroid-Encoder for multi-modal data set. Panel a shows the increase in validation accuracy over the number of centroids per class. Panel bd shows the two-dimensional PCA plot of the three speech classes

4 Feature selection workflow using SCE

By design, sparse methods identify a small number of features that accomplish a classification task. If one is interested in all the discriminatory features that can be used to separate multiple classes, then, one can repeat the process of removing good features. This section describes how SCE can be used iteratively to extract all discriminatory features from a data set; see [5] for an application of this approach to sparse support vector machines.

SCE is a model based on neural network architecture; hence, it is a non-convex optimization. As a result, multiple runs will produce different solutions, i.e., different feature sets on the same training set. These features may not be optimal given an unseen test set. To find out the robust features from a training set, we resort to frequency-based feature pruning. In this strategy, first, we divide the entire training set into k folds. On each of these folds, we ran the SCE and picked the top N (user select) number of features. We repeat the process T times to get \(k\times T\) feature sets. Then, we count the number of occurrences of each feature and call this number the frequency of a feature. We ordered the features based on the frequency and picked the optimum number from a validation set. We present the feature selection workflow in Fig. 6.

Fig. 6
figure 6

Feature selection workflow using Sparsity-promoted Centroid-encoder: a First, the data set has been partitioned into training and validation. b We further partitioned the training set into \(\textit{n}\) splits. c On each of the training splits, we ran Sparsity-promoted Centroid-encoder to get \(\textit{n}\) feature sets. d We calculated the occurrence of each feature among the \(\textit{n}\) sets and called it the frequency of the feature. We ranked features from high to a low frequency to get an ordered set. e At last, we picked the optimum number of features using a validation set

5 Minimizing 1-norm using stochastic optimization

In this section, we investigate the challenges of minimizing 1-norm using stochastic optimization, such as stochastic gradient descent [SGD; 67], adaptive moment estimation [Adam; 68]. These techniques are beneficial for large-scale machine learning, see [69]. The authors of Group Sparse ANN [26] and DFS [25] used stochastic optimization with 1-norm on neural network architecture to promote feature sparsity. In recent work, Yamada et al. [37] reported that DFS and Group Sparse ANN failed to induce sparsity on several bench-marking feature selection data sets. However, the authors did not investigate the root cause. Note that stochastic optimizations like SGD, Adam require hyper-parameters, e.g., learning rate, mini-batch size, momentum. Calculating the gradient on a random subsample (mini-batches) of a training set might add noise that may affect 1-norm minimization. We did an array of experiments to evaluate the dependencies of the hyper-parameters on 1-norm minimization.

All the experiments in this section use Sparsity-promoted Centroid-Encoder on MNIST data set. This time we use Adam to optimize the network parameters over mini-batches. We used one hidden layer with 500 hyperbolic tangent (‘tanh’) activation units with a learning rate and \(\lambda\) set to 0.01 and 0.0001, respectively. In Fig. 7, we present the result of the first experiment, where we show the effect of the size of the mini-batch. The three columns (A, B, and C) show results for a specific choice of mini-batch, i.e., 512, 1024, and 5000. For each column, the upper panel shows the position of the top 200 selected pixel, and the lower panel shows the absolute weight of the sparse layer in descending order.

Fig. 7
figure 7

Effect of the size of mini-batch on 1-norm minimization using SCE for three choices of mini-batches- 512 in (A), 1024 in (B), and 5000 in (C). For each case, the upper panel shows the position of the selected pixels in a 28 × 28 grid, and the lower panel presents the absolute weight of the sparse layer in descending order

Minimizing the 1-norm with smaller mini-batches (512) does not induce sparsity. Surprisingly, the 1-norm of the sparse layer put higher weight on the pixels around the border, ignoring the pixels in the center of the image. The model selects only 8 pixels (colored in teal) for mini bath 1024; among them, four pixels reside at the image’s border. The position of the pixels and the sparsity plot improves significantly for mini-batch size 5000, selecting around 300 pixels from the center of the picture. Also, notice that the scale of the absolute weight increases with the size of the min-batch. We saw similar observations while working with a ReLU activation function, i.e., the relation between the mini-batch size and the sparsity does not change if we switch from tanh to ReLU.

Next, we show the effect of the learning rate/step size on the model’s sparsity for \(\lambda = 0.0001\) and mini-batch size of 5000. Figure 8 shows the results. For a relatively larger step size (0.1), the model did not induce sparsity, and a lot of selected pixels (top 200) lie on the border of the image, suggesting the presence of noise. In contrast, the model produces sparse solutions for learning rates of 0.01 and 0.001. In both cases, the selected pixels also reside in the middle of the image.

Fig. 8
figure 8

Effect of learning rate on 1-norm minimization using SCE for three values 0.1 in (A), 0.01 in (B), and 0.001 in (C). For each case, the upper panel shows the position of the selected pixels in a 28 × 28 grid, and the lower panel presents the absolute weight of the sparse layer in descending order

Figure 9 shows the result of the second experiment where we study the effect of penalty term \(\lambda\) for three different values 0.01 (panel A), 0.001 (panel B), and 0.0001 (panel C) for a fixed mini-batch size of 5000 and learning rate of 0.01. Notice that the model did not promote sparsity for \(\lambda =0.01, 0.001\). The 1-norm of the sparse layer selects pixels from all over the image when \(\lambda =0.01\); in contrast, \(\lambda = 0.001\) ignores the middle of the images and picks pixels from the boundary. Interestingly, the selected pixels form a circle. Clearly, these two values of \(\lambda\) would not pick the most informative features from an MNIST image. On the other hand, we see a sparser solution for \(\lambda =0.0001\), selecting around 325 features from the middle of the 28 × 28 grid. The position of the selected pixels also makes sense as the digits lie in the center of the grid.

The in-depth analysis of this Section reveals an essential aspect of stochastic optimization when minimizing the 1-norm. We have observed that the hyper-parameters play a crucial role. Smaller mini-batches and higher \(\lambda\)s do not promote feature sparsity, and the selected features perhaps contain noise. The learning rate also dictates the sparsity when other hyper-parameters are kept constant. These challenges can be overcome by carefully tuning the hyper-parameters using a validation set. So, in summary, minimizing 1-norm using stochastic optimization is challenging and requires a careful selection of hyper-parameters to induce feature sparsity. Consequently, we didn’t use stochastic optimization while training SCE; instead, we used Scaled Conjugate Gradient (SCG) descent [64]. Unlike Adam or SGD, SCG calculates the step size/learning rate at each iteration, thus reducing the effort of tuning one hyperparameter. We also used the entire training set to calculate the gradient, which reduces the step of adjusting the mini-batch size apart from the fact that the model parameters were updated on the actual gradient.

Fig. 9
figure 9

Effect of \(\lambda\) on 1-norm minimization using SCE for three values 0.01 in (A), 0.001 in (B), and 0.0001 in (C). For each case, the upper panel shows the position of the selected pixels in a 28 × 28 grid, and the lower panel presents the absolute weight of the sparse layer in descending order

6 Experimental results

We present the comparative evaluation of our model on various data sets using several feature selection techniques.

6.1 Experimental details

We used twelve data sets from a variety of domains (image, biology, speech, and sensor; see Table 3) and five neural network-based models to run three benchmarking experiments. To this end, we picked the published results from four papers [25, 34, 36, 37] for benchmarking. We followed the same experimental methodology described in those papers for an apples-to-apples comparison. This approach permitted a direct comparison of LassoNet, FsNet, Supervised CAE, DFS, and Stochastic Gates using the authors’ best results. All three experiments follow the standard workflow.

Table 3 Descriptions of the data sets used for benchmarking experiments
  • Split each data sets into training and test partition.

  • Run SCE on the training set to extract top \(K\in \{10,16,50\}\) features.

  • Using the top K features train a one hidden layer ANN classifier with H ReLU units to predict the test samples. The H is picked using a validation set.

  • Repeat the classification 20 times and report average accuracy.

Now, we describe the details of the three experiments.

6.1.1 Experiment 1

The first bench-marking experiment is conducted on five real-world high dimensional biological data sets: ALLAML, GLIOMA, SMK_CAN, Prostate_GE, GLI_85, and CLL_SUBFootnote 1 to compare SCE with FsNet and Supervised CAE (SCAE). Following the experimental protocol of Singh et al. [36], we randomly partitioned each data into a 50:50 ratio of train and test and ran SCE on the training set. After that, we calculated the test accuracy using the top \(K=\{10,50\}\) SCE features. We repeated the experiment 20 times and reported the mean accuracy. We ran a 5-fold cross-validation on the training set to tune the hyperparameters.

6.1.2 Experiment 2

In the second bench-marking experiment, we compared SCE with LassoNet [34] and Stochastic Gate [37] on six data sets: Mice Protein,Footnote 2 COIL20, Isolet, Human Activity, MNIST, and FMNIST.Footnote 3 Following the experimental set of Lemhadri et al. we split each data set into 70:10:20 ratio of training, validation, and test sets. We ran SCE on the training set to pick the top \(K=50\) features to predict the class labels of the sequester test set. We extensively used the validation set to tune the hyperparameters.

6.1.3 Experiment 3

In the last benchmark, we used the single cell GM12878 dataFootnote 4 which has separate training, validation, and test sets. The SCE is run to select the top \(K=16\) features to compare the prediction performance with Deep/Shallow DFS [25], and Lasso. Again, we used the validation set for hyperparameters tuning.

6.2 Results

Now, we discuss the results of the three benchmarking experiments. In Table 4, we present the results of the first experiment where we compare SCE, SCAE, and FsNet on five high-dimensional biological data sets. Apart from the results using a subset (10 and 50) of features, we also provide the prediction using all the features. In most cases, feature selection helps improve classification performance. Generally, SCE features perform better than SCAE and FsNet; out of the twelve classification tasks, SCE produces the best result on eight. Notice that the top fifty SCE features give a better prediction rate than the top ten in all the cases. Interestingly, the accuracy of SCAE and FsNet drop significantly on SMK_CAN, GLI_85, and CLL_SUB using the top fifty features.

Table 4 Comparison of mean classification accuracy of FsNet, SCAE, and SCE features on five real-world high-dimensional biological data sets

Now, we turn our attention to the results of the second experiment, as shown in Table 5. The features of the Sparsity-promoted Centroid-Encoder produce better classification accuracy than LassoNet and STG in all the cases. Especially for Mice Protein, Activity, Isolet, FMNIST, and MNIST, our model has better accuracy by 2.5–4.5%. The results for Stochastic Gates (STG) in [37] are not in a table form, but our eyeball comparison of classification accuracy with the top 50 features on ISOLET, COIL20, and MNIST suggests that stochastic gate is not more accurate than SCE. For example, using the top 50 features, STG obtains approximately 85% accuracy on ISOLET, while SCE obtains 91.1%; STG obtains about 97% on COIL20, while SCE obtains 99.3%; on the data set, MNIST STG achieves approximately 91%, while SCE 93.8%. In this experiment, we ran SCE with multiple centroids per class and observed an improved prediction rate than one center per class on Isolet, Activity, MNIST, and FMNIST. The observation suggests that the classes are multi-modal, providing a piece of valuable information. The optimum number of centers was picked using the validation set.

Table 5 Classification results using LassoNet, STG, and SCE features on six publicly available data sets

In Table 6, we present the results of our last experiment on the single cell data GM12878. We use the published results for deep feature selection (DFS), shallow feature selection, and Lasso from the work of Li et al. to evaluate SCE. To compare with Li et al. we used the top 16 features to report the mean accuracy of the test samples. We see that the SCE features outperform all the other models. Among all the models, Lasso exhibits the worst performance with an accuracy of \(81.86\%\). This relatively low accuracy is not surprising, given Lasso is a linear model.

Table 6 Classification accuracies using the top 16 features by various techniques

6.3 Experiment with feature selection workflow

This Section presents the results using the feature selection workflow mentioned in Sect. 4. To this end, we used a high-dimensional biological data set, GSE73072, that has proven to be very valuable for understanding the human immune response to respiratory infection. The framework is applied to SCE and Random Forest (RF), a widely used feature selection tool in Computational Biology. The details of the data set and the experimental results are given below.

GSE73072 is a microarray data set which is a collection of gene expressions taken from human blood samples as part of multiple clinical challenge studies [70] where individuals were infected with the following respiratory viruses HRV, RSV, H1N1, and H3N2. In our experiment, we excluded the RSV study. Blood samples were taken from the individuals before and after the inoculation. RMA normalization [71] is applied to the entire data set, and the LIMMA [72] is used to remove the subject-specific batch effect. Each sample is represented by 22,277 probes associated with gene expression. The data are publically available on the NCBI GeneExpression Omnibus (GEO) with identifier GSE73072.

We conducted an experiment on this data where the goal is to predict the classes control, shedders, and non-shedders at the very early phase of the infection, i.e., at time bin spanning hours 1–8. Controls are the pre-infection samples, whereas shedders and non-shedders are post-infection samples picked from the time bin 1–8 hr. Shedders actually disseminate virus, while non-shedders do not. We considered six studies, including two H1N1 (DEE3, DEE4), two H3N2 (DEE2, DEE5), and two HRV (Duke, UVA) studies. We used \(10\%\) training samples as a validation set-the training set comprised all the studies except for the DEE5, which was kept out for testing. We did a leave-one-subject-out (LOSO) cross-validation on the test set using the selected features from the training set. The validation set is used to find the optimum number of features and to tune model-specific hyper-parameters. As our primary goal is to determine the utility of the proposed feature selection framework, we compared the results with two sets of features for each model. The first set is computed with the framework, and the send set is derived without the framework.

The results of this data set are shown in Table 7. Notice that features computed with the framework generalize the test samples better for both SCE and RF. The framework improved the classification of the DEE5 study by a margin of \(17\) and \(10\%\) for SCE and RF, respectively. Not only that, the variance of the balanced accuracy decreases for both models. It is pretty fascinating that with the framework, the models picked a relatively small number of features, 35 for SCE and 30 for RF, out of 22,277 genes to achieve relatively high accuracy. Note that the optimal number of features is picked using the validation set. Finally, we point out that SCE features with the framework have a performance benefit over RF.

Table 7 Balanced success rate (BSR) of LOSO cross-validation on the DEE5 test set. The selected features from training set are used to predict the classes of control, shedder, and non-shedder. The best classification result is highlighted in bold

6.4 Analysis of results

The experimental results in Sects. 6.2 and 6.3 show that Sparsity-promoted Centroid-Encoder features often perform better than other state-of-the-art methods on diverse sets. Here, we analyze the features in more detail to explain the improved performance of our model. We visualize the selected features directly or indirectly to explain and interpret their discriminative quality. To start with, we show the position of the selected pixels from the MNIST images in Fig. 10 over two runs using the whole training set.

Fig. 10
figure 10

Pixels selected by Sparsity-promoted Centroid-Encoder on MNIST with all the ten classes. The figure shows the position of the selected pixels over two runs (\(\lambda = 0.0002\)) on a 28 × 28 grid. SCE ignores the boundary of the image and picks most of the pixels from the middle, making sense as the MNIST digits lie in the center of a 28 × 28 grid. Notice that there is a significant number of common pixels across the two runs and the non-overlapping pixels reside in neighboring space

The model picks almost the same number of pixels (194 and 198) across two runs, with 167 overlapping ones. Most of the selected pixels reside in the middle of the image, making sense as the MNIST digits lie in the center of a 28 × 28 grid. Notice that the non-overlapping pixels of the two runs are neighbors, making sense as the neighboring pixels perhaps contain similar information about the digits.

Now, we turn our attention to a high-dimensional biological data set, ALLAML, which has 7129 genes per sample. The top 10 and 50 SCE features predicted the test samples with an accuracy of more than \(92\%\), better than the accuracy of using all the genes. Unlike MNIST, we cannot visualize the selected features directly on a grid, so we use an indirect method to interpret the discriminative power of the features. Here, we use PCA, a widely used unsupervised linear projection technique, to visualize the data using all features vs. the SCE features. Figure 11 presents the three-dimensional visualization of ALLAML data using all genes and the top 50 SCE genes. The projection with all 7129 genes in panel (a) does not separate the classes entirely. On the other hand, the projection using the top 50 SCE genes does separate the two classes better. Notice that the test cases are mapped close to the corresponding training class, which explains why SCE features to produce high test accuracy. These examples not only demonstrate the discriminative quality of the SCE features but also help us to interpret the features.

Fig. 11
figure 11

Three-dimensional embedding using PCA on ALLAML data. a The entire data set is projected on the first three principal components. b First, the data set is partitioned into training and test sets by a 50:50 ratio. After that, the training and test samples are restricted to the top 50 SCE features, which are computed from the training set. The training set is used to calculate the principal components, and then, the training and the test samples are projected on the first three principal components

The classification performance gives a quantitative measure that does not reveal the biological significance of the selected genes. We did a literature survey of the top genes selected by sparsity-promoted centroid-encoder on GM12878 single cell data and provided a detailed description in the appendix. Some of these genes play an essential role in transcriptional activation, e.g., H4K20ME1 [73], TAF1 [74], H3K27ME3 [75]. Gene H3K27AC [76] plays a vital role in separating active enhances from inactive ones. Besides that, many of these genes are related to the proliferation of the lymphoblastoid cancer cells, e.g., POL2 [77], NRSF/REST [78], GCN5 [79], PML [80]. This survey indicates the possible biological significance of the selected genes.

7 Discussion and conclusion

In this paper, we proposed a novel feature selection technique Sparsity-promoted Centroid-Encoder. Using the basic multi-layer perceptron neural network architecture, the model backpropagates the Centroid-Encoder cost to a feature selection layer which filters out non-discriminating features by 1-regularization. The setting allows the feature selection to be data-driven without the need for any prior knowledge, such as the number of features to be selected and the underlying distribution of the input features. One key attribute of our method is the ability to model intra-class variance with multiple centroids per class. This approach improves the discriminative power of the selected features and offers new information about the data set, i.e., whether the classes are unimodal or multi-modal.

The in-depth empirical analysis of the SCE provides the interpretability of our model. For example, the interplay between the CE cost and the 1 cost of SPL explains how the model should behave over different choices of the weighting parameter \(\lambda\). A higher value of \(\lambda\) forces the model to minimize the 1-norm allowing the CE cost to increase. On the other hand, a lower \(\lambda\) decreases the CE cost allowing the 1-norm to grow. As an effect, the model selects fewer features for relatively large values of the parameter \(\lambda\), and vice versa, demonstrated by the MNIST digits. We chose the optimal \(\lambda\) from a wide range of values using the validation set and observed that smaller values work better for our model. We noted that the variability in the loss function varied smoothly with \(\lambda\) making selection of an optimal parameter more robust. The 1 penalty on the SPL layer induces sharp sparsity on the high-dimensional SMK_CAN data without shrinking all the variables. Our feature cut-off technique correctly demarcates the crucial features from the rest.

We established the utility of our feature selection model using several benchmarking experiments involving seven methods. The results span fourteen data sets from various domains, such as image, speech, accelerometer sensors, and biology, providing evidence that the features of SCE produce better generalization performance than previously state-of-the-art models. We compared SCE with FsNet, primarily designed for high-dimensional biological data and found that our proposed method outperformed it in most cases. SCE also compares favorably with a supervised version of the Concrete Autoencoder (SCAE). Note that FsNet and SCAE use differentiable techniques, i.e, employ smooth cost functions. In data sets with more samples than the number of variables, SCE produces better classification results than LassoNet, Stochastic Gate, and DFS.

SCE may employ multiple centroids to capture the variability within a class, improving the prediction rate of unknown test samples. In particular, the prediction rate on the ISOLET improved significantly from one centroid to multiple centroids, suggesting that the speech classes are multi-modal. The two-dimensional PCA of ISOLET classes further confirms the multi-modality of the classes. We also observed an enhanced classification rate on MNIST, FMNIST and Activity data with multiple centroids. In contrast, using a single-center per class performed better for other data sets (e.g., COIL-20, Mice Protein, GM12878). Hence, apart from producing an improved prediction rate, our model can provide extra information about whether the data is unimodal or multi-modal. This aspect of sparsity-promoted centroid-encoder distinguishes it from the STG, CAE LassoNet, and DFS, which do not model the multi-modal nature of the data.

We demonstrated the interpretability of the selected features using several examples. The visualization of the MNIST pixels and the PCA projection of ALLAML data provided a qualitative explanation for a high prediction rate. On MNIST digits, SCE selected most of the pixels from the central part of the image, ignoring the border, making sense as the digits lie in the center of a 28 × 28 grid. On the high-dimensional ALLAML data, the top 50 SCE features showed better class separability on three-dimensional PCA space. Besides the visual explanation, the survey of the sixteen SCE genes of GM12878 data indicates plausible biological significance.

We also presented a feature selection workflow to determine the optimal number of robust features. Our experimental results showed that the prediction rate using the workflow improves the generalization performance for Random Forest and SCE. We think the workflow will benefit other nonconvex methods. We have presented a detailed empirical analysis to point out the challenges of inducing feature sparsity using stochastic optimization. Although the study is done using Sparsity-promoted Centroid-Encoder, it is plausible that other neural network-based models may exhibit similar behavior, an area of research we hope to explore in the future.

The Sparsity-promoted Centroid-Encoder presented here produced state-of-the-art results on many benchmarking data sets. Nonetheless, we think the performance of the model can potentially be improved further by exploring additional extensions. Currently, the model maps a sample to its class centroid while applying sparsity. The features may not be discriminatory if two class centroids are close in the ambient space. Adopting a cost that also caters to separating the classes may be beneficial. Our model may be extended to a semi-supervised feature selection regime by combining the centroid-encoder cost with the autoencoder loss. In the future, we will explore these ideas.