1 Introduction

Over the past 10 years, Artificial Neural Networks (ANNs) have become the model of choice for machine learning tasks in many modern applications. Although not completely understood today, the beliefs of the reasons for their success are mathematical, statistical and computational.

From the point-of-view of approximation theory, ANNs approximate well smooth functions. For instance a single hidden layer neural net with a diverging number of neurons is dense in the class of compactly supported continuous functions (Cybenko 1989) and the first error rate derived (Barron 1993) motivates shallow learning (few layers) (Ravishankar et al. 2015; Kostadinov et al. 2018). Some results show that deep learning is superior to shallow learning in the sense that less parameters are needed to achieve the same level of accuracy for a smoothness and compositional class of functions, in which case deep learning avoids the curse of dimensionality; see Poggio et al. (2017) for a review. Grohs et al. (2019) prove that deep neural networks provide information-theoretically optimal approximation of a very wide range of functions used in signal processing. Chen and Chen (1995) and related papers extend the results to wider classes of functions. Approximation bound of sparse neural network, that is with bounded network connectivity, has been studied for instance by Bölcskei et al. (2019) who show a link between the degree of connectivity and the complexity of a function class. Adcock et al. (2022) and Adcock and Dexter (2021) show that deep ANNs compare well to benchmark compressed sensing methods with both exponential rates of convergence for analytic functions, using a penalized square-root \(\ell _2\)-loss.

In machine learning, the success of ANNs is huge and, in part, can be attributed to their expressiveness or capacity (ability to fit a wide variety of functions). The very large number of parameters and the layer structure of ANNs make them impossible to interpret. ANNs are overparametrized with multiple distinct settings of the parameters leading to the same prediction. So traditional measures of model complexity based on the number of parameters do not apply. This makes understanding and interpreting the predictions challenging. Yet in scientific applications, one often seeks to do just that. In keeping with Occam’s razor, among all the models with similar predictive capability, the one with the smallest number of features should be selected. Statistically, models with fewer features not only are easier to interpret but can produce predictors with good statistical properties because such models disregard useless features that contribute only to higher variance.

Operationally, the model selection paradigm often uses a validation set or cross-validation (in which the data is randomly split, models are built on a training set and predictions are evaluated on a validation set). While conceptually elegant, (cross-)validation sets are of limited use if feature selection is of interest (it tends to select many irrelevant features) (Arlot and Celisse 2010), or if fitting a single model is computationally expensive. ANNs and in particular deep ANNs are computationally expensive to fit, so cross-validation is an expensive way of selecting model complexity. Aiming at good predictive performance on a test set, also known as generalization, cross-validation is a poor feature selector as it tends to select too many features. In addition, quadratic prediction error from cross-validation exhibits an unexpected behavior with models of increasing complexity: as expected, the training error always decreases with increasing number of input features, but while the quadratic prediction error on the test set is at first U-shaped (initially decreasing thanks to decreasing bias, and then increasing due to an excess of variance), it then unexpectedly decreases a second time. This phenomenon known as double descent has been empirically observed (Advani et al. 2020; Geiger et al. 2019). For least squares estimation regularized by an \(\ell _2\) ridge penalty (Hoerl and Kennard 1970), double descent has been mathematically described for two-layer ANNs with random first-layer weights by Mei and Montanari (2021) and Hastie et al. (2019). They show that for high signal-to-noise ratio (SNR) and large sample size, high complexity is optimal for the ridgeless limit estimator of the weights, leading to a smooth and more expressive interpolating learner. In other words, interpolation is good and leads to double descent, which after careful thinking should not be a surprise since the interpolating ANN becomes smoother with increasing number of layers, and therefore better predicts between interpolated raining data. Indeed with high SNR, the signal is almost noiseless, so a smooth interpolating function of the training data shall perform well for future prediction. In noisy regimes (that is with low SNR and small sample size), Mei and Montanari (2021) observe that regularization is needed.

In this paper, we present an alternative to the use of a validation set geared towards identifying important features. Specifically, we develop an automatic feature selection method for simultaneous feature extraction and generalization. For ease of exposition, we present our novel method in the context of regression and classification, noting that the ideas can be ported beyond. Our approach exploits ideas from statistical hypothesis testing that directly focus on identifying significant features, and this without explicitly considering minimizing the generalization error. Similar ideas percolate the statistics literature, see for example Johnstone and Silverman (2004), Chen et al. (1999), Tibshirani (1996) with LASSO, Bühlmann and van de Geer (2011) who propose methods for finding needles in a haystack in linear models. In this context, the optimized criterion is not the prediction error, but is the ability to retrieve the needles (i.e., relevant features). Useful criteria include the stringent exact support recovery criterion, and softer criteria such as the false discovery rate (FDR) and true positive rate (TPR).

Of course some regularization methods have already been developed to enforce sparsity to the weights of ANNs. For example, dropout leaves out a certain number of neurons to prevent overfitting, which incidentally can be used to perform feature selection (Hinton et al. 2012; Srivastava et al. 2014). Sparse neuron architectures can be achieved by other means: Mollaysa et al. (2017) enforce sparsity based on the Jacobian and Li et al. (2016), Lee et al. (2006), Ranzato et al. (2007), Collins and Kohli (2014); Ma et al. (2019) employ \(\ell _1\)-based LASSO penalty to induce sparsity. Curci et al. (2021) prune their ANNs based on a metric for neuron importance. Evci et al. (2019) discuss the difficulty of training sparse ANNs. spinn (sparse input neural networks) (Feng and Simon 2019) have a sparsity inducing penalty and is governed by two hyperparameters chosen on a validation set; its improved version spinn-dropout (the former originally published in 2017) adds a dropout mechanism governed by an additional hyperparameter (Ye and Sun 2018). So spinn-dropout is a mix between \(\ell _1\) and \(\ell _0\) (subset selection) sparsity inducing method, similar to the pruning idea (Carreira-Perpinan and Idelbayev 2018; Chao et al. 2020). Sun et al. (2021) propose a Bayesian neural networks (BNN) learner. None of these learners have been studied through the prism of phase transition in the probability of retrieving features.

All of these sparsity inducing methods suffer from two drawbacks: (1) the selection of the penalty parameter(s) is often partly based on a validation set, therefore geared towards good generalization performance, not feature identification, and some hyperparameters are set to arbitrary values; (2) the ability to recover the “right” features has not been quantified through the prism of a phase transition in the probability of support recovery; spinn, spinn-dropout and BNN consider criteria related to FDR and TPR.

This paper is organized as follows. Section 2 presents the theoretical framework and defines our LASSO ANN learner. Section 2.1 defines the statistical model and notation. Section 2.2 reviews the LASSO sparsity paradigm for linear models and extends it to ANNs. Section 2.3 discusses the choice of activation functions. Section 2.4 derives a selection rule for the penalty parameter, a generalization of the universal threshold (Donoho and Johnstone 1994) to non-convex optimization due to the nonlinearity of ANN models. Section 2.5 discusses optimization issues to solve the non-convex high-dimensional and non-differentiable optimization problem. Section 3 evaluates via simulations the ability of our method to exhibit a phase transition in the probability of exact support recovery for the regression task. Section 4 evaluates with a large number of real data sets the ability of our method to perform feature selection and generalization for the classification task. Section 5 summarizes the findings and points to future developments. Proofs and technical details are given in the Appendix.

2 LASSO ANN

2.1 Function estimation model and notation

Suppose n pairs of ouput-input data \(({{\mathcal {Y}}}, {{\mathcal {X}}})=\{(\textbf{y}_i,\textbf{x}_i) \}_{i=1}^n\) are collected to learn about their association. For example, in some medical applications (see Sect. 4.1), \(\textbf{x}\in {{\mathbb {R}}}^{p_1}\) is an input vector of \(p_1\) gene expressions and \(\textbf{y}\) is any of m cancer types that is coded as a one-hot output vector of \({\mathbb R}^{m}\); classification aims at assigning the correct type of cancer given an input vector. In regression, y is a scalar (\(m=1\)), for instance riboflavin production rate in a bacteria (see Sect. 4.2).

To model their stochastic nature, data can be modeled as realizations from the pair of random vectors \((\textbf{Y},\textbf{X})\). We assume the real-valued response \(\textbf{Y} \in {{\mathbb {R}}}^{m}\) is related to real-valued feature vector \(\textbf{X}\in {\mathbb {R}}^{p_1}\) through the conditional expectation

$$\begin{aligned} {{\mathbb {E}}}[\textbf{Y}\mid \textbf{X}=\textbf{x}] = \mu (\textbf{x}), \end{aligned}$$
(1)

for some unknown function \(\mu : {{\mathbb {R}}}^{p_1} \rightarrow \Gamma \subseteq {{\mathbb {R}}}^m\). In regression, \(\Gamma ={{\mathbb {R}}}\) and in classification, \(\Gamma =\{{\varvec{\pi }}\in ({{\mathbb {R}}}^+)^m: \sum _{k=1}^m \pi _k=1\}\), where \(\pi _k\) is the probability of belonging to class k.

Many learners have been proposed to model the association \(\mu \) between input and output. A recent approach that is attracting considerable attention models \(\mu \) as a standard fully connected ANN with l layers

$$\begin{aligned} \mu _{\varvec{\theta }}(\textbf{x})= S_l \circ \cdots \circ S_1\left( \textbf{x}\right) , \end{aligned}$$
(2)

where \({\varvec{\theta }}\) are the parameters (see (5)) indexing the ANN, and letting \(\textbf{u}=\textbf{x}\) at the first layer, the nonlinear functions \(S_k(\textbf{u})=\sigma (\textbf{b}_k + W_k \textbf{u})\) maps the \(p_k\times 1\) vector \(\textbf{u}\) into a \(p_{k+1}\times 1\) latent vector obtained by applying an activation function \(\sigma \) component-wise, for each layer \(k \in \{1,\ldots , l-1\}\). The vectors \(\textbf{b}_k\) are commonly named “biases.” The matrix of weights \(W_k\) is \(p_{k+1} \times p_k\) and the operation \(+\) is the broadcasting operation.

The last layer \(k=l\) has two requirements. First we must have \(p_{l+1}=m\) to match the output dimension, so the last function is \(S_l(\textbf{u})=G(\textbf{c}+W_l \textbf{u})\) where \(W_l\) is \(m \times p_l\) and the intercept vector \(\textbf{c}\in {{\mathbb {R}}}^{m}\). Second the function \(G: {{\mathbb {R}}}^{m}\rightarrow \Gamma \) is a link function that maps \({{\mathbb {R}}}^{m}\) into the parameter space \(\Gamma \). Commonly used link functions for classification are

$$\begin{aligned} G(\textbf{u})= & {} \left( \frac{\exp \{u_1\}}{\sum _{k=1}^{m}\exp \{u_k\}}, \ldots , \frac{\exp \{u_m\}}{\sum _{k=1}^{m}\exp \{u_k\}}\right) ^{\textrm{T}} \end{aligned}$$
(3)
$$\begin{aligned} G(\textbf{u})= & {} \left( \frac{\exp \{u_1\}}{\sum _{k=1}^{m-1}\exp \{u_k\} +1}, \ldots , \frac{\exp \{u_{m-1}\}}{\sum _{k=1}^{m-1}\exp \{u_k\}+1},\right. \nonumber \\{} & {} \quad \times \left. \frac{1}{\sum _{k=1}^{m-1}\exp \{u_k\}+1}\right) ^\textrm{T} \end{aligned}$$
(4)

respectively called Softmax and multiclass-Logit. For regression, \(G(u)=u\).

The parameters indexing the neural network are therefore

$$\begin{aligned} {\varvec{\theta }}=(( W_1, \textbf{b}_1, \ldots , \textbf{b}_{l-1}), (W_2, \ldots , W_l,\textbf{c}))=:({\varvec{\theta }}_1, {\varvec{\theta }}_2) \end{aligned}$$
(5)

for a total of \(\gamma =\sum _{k=1}^l p_{k+1}(p_k+1)\) parameters. The following property is straightforward to prove, but is crucial for our methodology; it is the reason for splitting \({\varvec{\theta }}\) into \({\varvec{\theta }}_1\) and \({\varvec{\theta }}_2\).

Property 1

Assuming the activation function satisfies \(\sigma (0)=0\), then setting \({\varvec{\theta }}_1=\textbf{0}\) implies \(\mu _{\varvec{\theta }}(\textbf{x})\) is the constant function \(\mu (\textbf{x})=G(\textbf{c})\) for all \(\textbf{x} \in {\mathbb {R}}^{p_1}\).

We believe that only a few features in the \(p_1\)-long input vector \(\textbf{x}\) carry information to predict the output. For many medical data treated in Sect. 4 for instance, the input is a vector of hundreds of gene expressions, and genetic aims to identify the ones having an effect on the output. So our main goal while estimating \({\varvec{\theta }}\) is to find needles in the haystack by selecting a subset of the \(p_1\)-long inputs by setting some entries of \({\varvec{\theta }}_1\) to zero. Feature selection has been extensively studied for linear associations, showing a phase transition between regimes where features can be retrieved with probability near one to regimes where the probability of retrieving the features is essentially zero. Our goal is to investigate such a phase transition with ANN learners to retrieve features in nonlinear associations.

2.2 Sparse estimation

Finding needles amounts to setting some weights to non-zero values corresponding to features in \(\textbf{x}\) that have predictive information. So we seek sparsity in the first layer on the weights \(W_1\). For the other layers, large weights in a layer could compensate small weights in the next layer, so we bound them by forcing unit \(\ell _2\)-norm; instead, Feng and Simon (2019) and Ye and Sun (2018) take the approach of a ridge penalty controlled by an additional hyperparameter fixed to the arbitrary value of 0.0001. Instead we slightly modify the nonlinear terms in (2) and define the \(j\text {th}\) nonlinear function \(S_{k,j}\) in layer k as

$$\begin{aligned} S_{k,j}(\textbf{u})= \left\{ \begin{array}{ll} \sigma \left( \textbf{b}_1^{(j)} + \langle \textbf{w}_1^{(j)}, \textbf{u} \rangle \right) &{} k=1\\ \sigma \left( \textbf{b}_k^{(j)} +\frac{ \langle \textbf{w}_k^{(j)}, \textbf{u} \rangle }{\left\| \textbf{w}_k^{(j)} \right\| _2} \right) &{} 1< k <l \\ G\left( \textbf{c}+ \frac{ \langle \textbf{w}_k^{(j)},\textbf{u} \rangle }{\left\| \textbf{w}_k^{(j)}\right\| _2}\right) &{} k=l \end{array} \right. , \quad j \in \{1,\ldots ,p_{k+1}\}, \nonumber \\ \end{aligned}$$
(6)

where \(\textbf{w}_k^{(j)}\) is the \(j\text {th}\) row of \(W_k\). At the last layer (\(k=l\)), \(\textbf{c}\) plays the role of an intercept.

Sparsity in the first layer allows interpretability. To enforce sparsity and control overfitting, we take the approach inspired by LASSO of minimizing a compromise between a measure \({{\mathcal {L}}}_n\) of closeness to the data and a measure of sparsity P. Owing to Property 1, we estimate the parameters \({\varvec{\theta }}=({\varvec{\theta }}_1, {\varvec{\theta }}_2)\) defined in (5) by aiming the best local minimum

$$\begin{aligned} \hat{\varvec{\theta }}_\lambda {\in } \arg \min _{ {\varvec{\theta }}\in {{\mathbb {R}}}^\gamma } {{\mathcal {L}}}_n ({{\mathcal {Y}}} , {\mu }_{\varvec{\theta }}( {{\mathcal {X}}})) + \lambda \ P({\varvec{\theta }}_1) \end{aligned}$$
(7)

found by a numerical scheme, where \(\lambda >0\) is the regularization parameter of the procedure and P is sparsity-inducing penalty (Bach et al. 2011). We stress that our method is driven by a single regularization parameter \(\lambda \), as opposed to other methods that use two or three hyperparameters (Ye and Sun 2018; Feng and Simon 2019; Sun et al. 2021).

Common loss functions between training responses \({{\mathcal {Y}}}\) and predicted values \({\mu }_{\varvec{\theta }}( {{\mathcal {X}}})\) include: for m-class classification the cross-entropy loss \({{\mathcal {L}}}_n ({{\mathcal {Y}}} , {\mu }_{\varvec{\theta }}( {{\mathcal {X}}}))=\sum _{i=1}^n \textbf{y}_i^{\textrm{T}}\log \mu _{\varvec{\theta }}(\textbf{x}_i)\), where the \(\log \) function is applied component-wise to the m-long vectors \(\textbf{y}_i\) and \(\mu _{\varvec{\theta }}(\textbf{x}_i)\); for regression, we use \({{\mathcal {L}}}_n ({{\mathcal {Y}}} , {\mu }_{\varvec{\theta }}( {{\mathcal {X}}}))=(\sum _{i=1}^n ( {y}_i- {\mu }_{\varvec{\theta }}( \textbf{x}_i))^2)^{1/2}\) for reason that will become clear, although its squared version is more often used.

A common sparsity-inducing penalty used by waveshrink (Donoho and Johnstone 1994) and LASSO (Tibshirani 1996) for \(q=1\) and group-LASSO (Yuan and Lin 2006) for \(q=2\) is

$$\begin{aligned} P({\varvec{\theta }}_1)= \sum _{j=1}^{p_1}\Vert \textbf{w}_{1,j}\Vert _q + \sum _{k=1}^{l-1} \Vert \textbf{b}_k \Vert _q , \end{aligned}$$
(8)

where \(\textbf{w}_{1,j}\) is the \(j\text {th}\) column of \(W_1\). The choice \(q=2\) forces the \(j\text {th}\) feature to be either on or off across all neurons. The former is more flexible since a feature can be on in a neuron and off in another, so, in the sequel, we use \(q=1\). The reason for penalizing the biases is that the gradient of the loss function with respect to the biases evaluated at zero is zero and that the Hessian is only positive semi-definite (see Appendix C); so a local minimum would not be guaranteed without a penalty on the biases.

ANNs are flexible in the sense that they can fit nonlinear associations. A more rigid and older class of models that has been extensively studied is the class of linear models

$$\begin{aligned} \mu _{\varvec{\theta }}^{\textrm{lin}}(\textbf{x})= c+\sum _{j=1}^{p_1} \beta _j x_j, \end{aligned}$$
(9)

where here the set of parameters \({\varvec{\theta }}=(\beta _1, \ldots , \beta _{p_1}, c)=:({\varvec{\theta }}_1, c)\) is assumed s-sparse, that is only s entries in \({\varvec{\theta }}_1\) are different from zero. Here again, like for \(W_1\) in ANNs, a non-zero entry in \({\varvec{\theta }}_1\) corresponds to an entry in the input vector \(\textbf{x}\) that is relevant to predict the response. For a properly chosen penalty parameter \(\lambda \), LASSO has the remarkable property of retrieving the non-zero entries of \({\varvec{\theta }}_1\) in certain regimes (that depend on n, \(p_1\), SNR, training locations \({{\mathcal {X}}}\) and amount s of sparsity), as studied in the noiseless and noisy scenarios by Candès and Tao (2005), Donoho (2006), Donoho et al. (2011), and Bühlmann and van de Geer (2011), for instance. In particular, the value of \(\lambda \) must bound the sup-norm of the gradient of the empirical loss at zero with high probability when \({\varvec{\theta }}_1=\textbf{0}\) for LASSO to satisfy oracle inequalities. For linear models in wavelet denoising theory (Donoho and Johnstone 1994), this approach leads to an asymptotic minimax property.

Our contribution is to extend the linear methodology to ANNs, and to investigate how well our extension leads to a phase transition to discover nonlinear lower-dimensional structures in the data.

2.3 Choice of activation functions

Since the weights from level two and higher are bounded on the \(\ell _2\)-ball of unit radius (6), we require the activation function \(\sigma \in {{\mathcal {C}}}^2({{\mathbb {R}}})\) to be unbounded. For reasons related to Property 1 and the choice of the hyperparameter \(\lambda \) based on the zero-thresholding function proportional to \(\sigma '(0)\) (see (13) and (14)), we require

$$\begin{aligned} \sigma (0)=0 \quad \textrm{and} \quad \sigma '(0)>0. \end{aligned}$$
(10)

The centered softplus function \(\sigma _\textrm{softplus}(u)=\log (1+\exp (u))-\log (2)\) for example satisfies this requirement. The ReLU (Rectified Linear Unit) function \(\sigma _{\textrm{ReLU}}(u)=\max (u,0)\) does not because not differentiable at zero.

A legitimate question for a statistician is to ask whether ANNs can retrieve interactions between covariates. Projection pursuit models (Friedman and Stuetzle 1981) have this ability, which additive models do not have. Thanks to ANNs property of being dense in smooth function spaces, the answer is yes, but with a large number of neurons when conventional activation functions like softplus and ReLU are used. With the family of activation functions now defined, a k-way interaction can be written with a sparse ANN.

Definition 1

The smooth activation rescaled dictionary is the collection of activation functions defined by

$$\begin{aligned} \sigma _{M,u_0,k}(u)= & {} \frac{1}{k}(f(u)^k-f(0)^k) \quad \textrm{with} \quad \nonumber \\ f(u)= & {} \frac{1}{M}\log (1+\exp \{M (u+u_0)\}) \end{aligned}$$
(11)

indexed by \(M>0, u_0>0, k >0\). For \(u_0=1\) the dictionary is rescaled in the sense that \(\lim _{M\rightarrow \infty }\sigma '_{M,u_0,k}(0)=1\).

Suppose for instance the association is a single two-way interaction, that is \(\mu (x)=x_i x_j\) for some pair (ij), then with 6 neurons we have \(x_i x_j=-1+ \sigma _{\infty ,1,2}(x_i+x_j-1)+ \sigma _{\infty ,1,2}(-x_i-x_j-1)- \sigma _{\infty ,1,2}(x_i-1)- \sigma _{\infty ,1,2}(-x_i-1)- \sigma _{\infty ,1,2}(x_j-1)- \sigma _{\infty ,1,2}(-x_j-1)\) since \(x^2/2=1+ \sigma _{\infty ,1,2}(x-1)+ \sigma _{\infty ,1,2}(-x-1)\). When the ANN model employs both linear and quadratic smooth ReLU (shifted by one), selecting neurons with \(\sigma _{M,1,k}\) with \(k=2\) and M large reveals potential interactions.

The proposed activation functions have some basic properties. They satisfy requirements (10). For finite M, \(\sigma _{M,u_0,j}\in {{\mathcal {C}}}^\infty \). For \((u_0,k)=(0,1)\), the family includes two important activation functions: softplus for \(M=1\) and ReLU as M tends to infinity. Using ReLU is prohibited with our method; if one likes the shape of ReLU, then one gets a smooth approximation of ReLU choosing a large M (say \(M=20\)). Moreover since \(\lim _{M\rightarrow \infty }\sigma '_{M,u_0,k}(0)=u_0^{k-1}\), choosing \(u_0=1\) scales activation functions across k’s in the sense that their derivatives at zero are asymptotically (as \(M\rightarrow \infty \)) equal to one for all k; this is a desired property in our methodology since the zero-thresholding functions (defined in Theorem 1 and derived in Theorem 3 below) are proportional to \(\sigma '(0)\). Moreover, since sparsity is of interest, zero is a region where the cost function ought to be smooth for optimization purposes; hence choosing \(u_0=1\) also makes the smoothness of the loss function bounded at zero since \(\sigma _{M,1,1}''(0)=M/\exp (M)\) while \(\sigma _{M,0,1}''(0)=M/4\) (which reflects that ReLU is not differentiable at zero).

2.4 Selection of penalty \(\lambda \)

The proposed choice of \( \lambda \) is based on Property 1. It shows that fitting a constant function is achieved by choosing \(\lambda \) large enough to set the penalized parameters \( {\varvec{\theta }}_1\) to zero when solving the penalized cost function (7). For convex loss functions and linear models, the quantile universal threshold (Giacobino et al. 2017) achieves this goal with high probability under the null model that the underlying function is constant. This specific value \(\lambda _{\textrm{QUT}}\) has good properties for model selection outside the null model as well (Donoho and Johnstone 1994; Donoho et al. 1995; Bühlmann and van de Geer 2011). The quantile universal threshold has so far been developed and employed for cost functions that are convex in the parameters, hence guaranteeing that any local minimum is also global. The cost function in (7) is not convex for ANN models, so we extend the quantile universal threshold by creating with high probability a local minimum at the sparse point of interest \({\varvec{\theta }}_{1}=\textbf{0}\). This can be achieved thanks to the penalty term \(\lambda \ P({\varvec{\theta }}_{1})\) that is part of the cost function in (7), provided \(\lambda \) is large enough. The following theorem derives an expression for the zero-thresholding function that gives the smallest \(\lambda \) that guarantees a minimum with \(\hat{\varvec{\theta }}_1=\textbf{0}\).

Theorem 1

For given output–input data \(({{\mathcal {Y}}}, {{\mathcal {X}}})\), consider the optimization problem (7) with \(P({\varvec{\theta }}_{1})\) defined in (8) with \(q=1\), activation function \(\sigma \in {{\mathcal {C}}}^2({{\mathbb {R}}})\) and loss function \({{\mathcal {L}}}_n\in {{\mathcal {C}}}^2(\Gamma ^n)\) such that \(\hat{\textbf{c}}= \arg \min _{ \textbf{c}\in {{\mathbb {R}}}^m} {{\mathcal {L}}}_n({{\mathcal {Y}}},G(\textbf{c}))\) exists. Let \( {\varvec{\theta }}^0=(\textbf{0}_{p_2\times p_1}, W_2 , \ldots , W_l,\hat{\textbf{c}})\) with arbitrary values \(W_{k}\) for layers 2 to l. Define \({ g}_0({{\mathcal {Y}}}, {{\mathcal {X}}}, {\varvec{\theta }}^0)=\nabla _{{\varvec{\theta }}_{1}} {{\mathcal {L}}}_n(\mathcal{Y},{\varvec{\mu }}_{{\varvec{\theta }}^0}({{\mathcal {X}}}))\). For any \(\lambda \) larger than the zero-thresholding function \(\lambda _0({{\mathcal {Y}}}, {{\mathcal {X}}})=\sup _{(W_2 \ldots , W_{l})}\Vert { g}_0({{\mathcal {Y}}}, {{\mathcal {X}}}, {\varvec{\theta }}^0) \Vert _\infty \), there is a local minimum to (7) with \((\hat{\varvec{\theta }}_{1,\lambda }, \hat{\textbf{c}}_\lambda )=(\textbf{0}, \hat{\textbf{c}})\).

The proof of Theorem 1 is provided in the appendix; it can be made more general for \(q\ge 1\) using Hölder’s inequality. The estimate \(\hat{\textbf{c}}\) often has a closed form expression; in regression for instance, if the loss function between \({{\mathcal {Y}}}\in {{\mathbb {R}}}^n\) and \(c \textbf{1}\) with \(c\in {{\mathbb {R}}}\) and \(\textbf{1} \in {{\mathbb {R}}}^n\) is \(\mathcal{L}_n({{\mathcal {Y}}},c\textbf{1}) = \Vert {{\mathcal {Y}}}-c \textbf{1}\Vert _2\), then \({{\hat{c}}}=\bar{{\mathcal {Y}}}\), the average of the responses. Based on \(\lambda _0({{\mathcal {Y}}}, {{\mathcal {X}}})\), the following theorem extends the universal threshold to non-convex cost functions.

Theorem 2

Given training inputs \({{\mathcal {X}}}\), define the random set of outputs \({{\mathcal {Y}}}_0\) generated from (1) with \(\mu (\mathcal X)=\mu _{\varvec{\theta }}({{\mathcal {X}}})\) defined in (2) for any activation function satisfying (10) under the null hypothesis \(H_0: {\varvec{\theta }}_{1}=\textbf{0}\), that is \(H_0: \mu _{\varvec{\theta }}=\textbf{c}\) is a constant function. Letting the random variable \(\Lambda =\lambda _0({{\mathcal {Y}}}_0, \mathcal{X})\) and \(F_\Lambda \) be the distribution function of \(\Lambda \), the quantile universal threshold is \(\lambda _\textrm{QUT}=F^{-1}_\Lambda (1-\alpha )\) for a small value of \(\alpha \). It satisfies that

$$\begin{aligned}{} & {} {{\mathbb {P}}}_{H_0}(\text{ there } \text{ exists } \text{ a } \text{ local } \text{ minimum } \text{ to }\,(7)\hbox { such that } \mu _{\hat{\varvec{\theta }}_{\lambda _{\textrm{QUT}}}}\nonumber \\{} & {} { is constant})\ge 1-\alpha . \end{aligned}$$
(12)

A proof of Theorem 2 can be found in Giacobino et al. (2017). The law of \(\Lambda \) is unknown but can be easily estimated by Monte Carlo simulation, provided there exists a closed form expression for the zero-thresholding function \(\lambda _0({{\mathcal {Y}}}, {{\mathcal {X}}})=\sup _{(W_2 \ldots , W_{l})}\Vert { g}_0({{\mathcal {Y}}}, {{\mathcal {X}}}, {\varvec{\theta }}^0) \Vert _\infty \). The following theorem states a simple expression for \(\lambda _0(\mathcal{Y}, {{\mathcal {X}}})\) in two important cases: classification and regression.

Theorem 3

Consider a fully connected l-layer ANN employing a differentiable activation function \(\sigma \) and let \(\tau _l=\sqrt{\Pi _{j=3}^l p_{j}}\) for \(l\ge 3\), \(\tau _2=1\), \({{\mathcal {Y}}}_\bullet ={{\mathcal {Y}}}-\textbf{1}_{n}\bar{{\mathcal {Y}}}\) and \(\Vert A\Vert _\infty =\max _{j=1,\ldots ,p} \sum _{i=1}^k |a_{ji}|\) for a \(p\times k\) matrix A.

  • In classification, using the cross-entropy \({{\mathcal {L}}}_n ({{\mathcal {Y}}} , {\mu }_{\varvec{\theta }}( {{\mathcal {X}}}))=\sum _{i=1}^n \textbf{y}_i^{\textrm{T}}\log \mu _\theta (\textbf{x}_i)\) and for the Softmax link function G in (3), we have

    $$\begin{aligned} \lambda _0({{\mathcal {Y}}}, {{\mathcal {X}}}) = \tau _l \sigma '(0)^{l-1} \Vert \mathcal{X}^{\textrm{T}} {{\mathcal {Y}}}_\bullet \Vert _\infty ; \end{aligned}$$
    (13)
  • In regression, for \({{\mathcal {L}}}_n=\Vert {{\mathcal {Y}}} - \mu _{ \varvec{\theta }}({{\mathcal {X}}}) \Vert _2\), we have

    $$\begin{aligned} \lambda _0({{\mathcal {Y}}}, {{\mathcal {X}}}) = \tau _l \sigma '(0)^{l-1} \frac{\Vert {{\mathcal {X}}}^{\textrm{T}} {{\mathcal {Y}}}_\bullet \Vert _\infty }{\Vert \mathcal{Y}_\bullet \Vert _2} . \end{aligned}$$
    (14)

Theorem 2 states that the choice of \(\lambda \) is simply an upper quantile of the random variable \(\Lambda =\lambda _0(\mathcal{Y}_0, {{\mathcal {X}}})\), where \({{\mathcal {Y}}}_0\) is the distribution of the response under the null distribution that \({\varvec{\theta }}_1=\textbf{0}\). The upper quantile of \(\Lambda \) can be easily estimated by Monte-Carlo simulation.

In regression and assuming Gaussian errors, the null distribution is \({{\mathcal {Y}}}_0 \sim \textrm{N}(c \textbf{1}, \xi ^2 I_n)\). Both the constant c and \(\xi ^2\) are unknown however, and \(\xi ^2\) is difficult to estimate in high dimension. Fortunately, one observes first that (14) involves only the mean centered responses \(\mathcal{Y}_\bullet \) and therefore do not dependent on c. Second, both numerator and denominator are proportional to \(\xi \). Consequently, \(\Lambda \) is a pivotal random variable in the Gaussian case. Knowledge of c and \(\xi ^2\) are therefore not required to derive our choice of hyperparameter \(\lambda _{\textrm{QUT}}\). This well-known fact inspired by square-root LASSO (Belloni et al. 2011) motivates the use of \({{\mathcal {L}}}_n=\Vert {{\mathcal {Y}}} - \mu _{ \varvec{\theta }}({{\mathcal {X}}}) \Vert _2\) rather than \({{\mathcal {L}}}_n=\Vert {{\mathcal {Y}}} - \mu _{ \varvec{\theta }}({{\mathcal {X}}}) \Vert _2^2\).

In classification, the null distribution is \({{\mathcal {Y}}}_0 \sim \textrm{Multinomial}(n, {\varvec{\pi }}=G(\textbf{c}))\). The constant vector \(\textbf{c}\) is unknown and the random variable \(\Lambda \) with \(\lambda _0\) defined in (13) is not pivotal. Moreover Holland (1973) proved no covariance stabilizing transformation exists for the trinomial distribution. So the approach we take is to assume the training outputs \({{\mathcal {Y}}}\) reflect the proportion of classes in future samples seeking class prediction. So if \(\hat{\varvec{\pi }}\) are the proportions of classes in the training set, then the null distribution is \({{\mathcal {Y}}}_0 \sim \textrm{Multinomial}(n, {\varvec{\pi }}={\hat{\varvec{\pi }}})\). The quantile universal threshold derived under this null hypothesis is appropriate if future data come from the same distribution, which is a reasonable assumption.

2.5 Computational cost for LASSO ANN

For a given \(\lambda \), we solve (7) first by steepest descent with a small learning rate, and then employ a proximal method to refine the minimum by exactly setting to zero some entries of \({{\hat{W}}}_{1,\lambda _{\textrm{QUT}}}\) (Beck and Teboulle 2009; Bach et al. 2011).

Solving (7) directly for the prescribed \(\lambda =\lambda _{\textrm{QUT}}\) risks getting trapped at some poor local minimum. Instead, inspired by simulated annealing and warm start, we avoid thresholding too hardly at first and possibly missing important features by solving (7) for an increasing sequence of \(\lambda \)’s tending to \(\lambda =\lambda _{\textrm{QUT}}\), namely \(\lambda _{k+1}=\exp (k)/(1+\exp (k))\lambda _{\textrm{QUT}}\) for \(k\in \{-1,0,\ldots ,4\}\). Taking as initial parameter values the solution corresponding to the previous \(\lambda _k\) leads to a sequence of sparser approximating solutions until solving for \(\lambda _{\textrm{QUT}}\) at the last step. We do not perform multiple starts.

The computational cost is low. It requires solving (7) approximately on the small grid of \(\lambda \)’s tending to \(\lambda _{\textrm{QUT}}\) using the warm start to finally solve (7) precisely for \(\lambda _{\textrm{QUT}}\). Calculating \(\lambda _{\textrm{QUT}}\) is also cost efficient (and highly parallelizable) since it is based on an M-sample Monte Carlo that calculates M gradients \(\{{ g}_0({ y}_k, { X}, {\varvec{\theta }}_0)\}_{k=1}^M\) using backpropagation (Rumelhart et al. 1986) for M Gaussian samples \(\{y_k\}_{k=1}^M\) under \(H_0\); see Theorems 1 and 2 for details. Using V-fold cross-validation instead would require solving  (7) a total of \(V*L\) times, where L is the number of \(\lambda \)’s visited until finding a (hopefully global) minimum to the cross-validation function. Using a validation set reduces complexity by a factor V, at the cost of using data to validate. Instead, our quantile universal threshold approach does not require a validation set.

The phase transition property achieved with LASSO ANN shows its stability to repeatedly identify the same relevant features over many training sets. Yet, Bastounis et al. (2021a, 2021b) and Colbrook et al. (2022) have proved that stability and generability do not exist together for ANN as well as LASSO ANN learners, showing some limitation of the method.

3 Regression simulation study

The regression problem is model (1) for scalar output (\(m=1\)), Gaussian additive noise and (unknown) standard deviation, here chosen \(\xi =1\). To evaluate the ability to retrieve needles in a haystack, the true associations \(\mu \) is written as sparse ANNs that uses only s of the \(p_1\) entries in the inputs \(\textbf{x}\). We say an association \(\mu \) is s-sparse when it uses only s input entries, that is \(s=|S|\) where \(S=\{j\mid x_j \text{ carries } \text{ information } \}\) in the association \(\mu \). A sparse ANN learner estimates which inputs are relevant by estimating the support with

$$\begin{aligned} {{\hat{S}}}=\{j \mid \Vert \hat{\textbf{w}}_{1,j}\Vert > \epsilon \}, \end{aligned}$$
(15)

where \(\hat{\textbf{w}}_{1,j}\) is the \(j\text {th}\) column of the estimated weights \({{\hat{W}}}_1\) at the first layer. Likewise for linear model (9), the support is estimated with \({{\hat{S}}}=\{j \mid {\hat{\beta }}_j \ne 0\}\).

Since we employ a precise thresholding algorithm to solve (7), we use \(\epsilon =0\) to determine \({{\hat{S}}}\) in (15); other methods apply a hard thresholding step with a second hyperparameter \(\epsilon \) to get rid of small values. Our method could be improved by using \(\epsilon \) as another hyperparameter, but our aim is to investigate a phase transition with LASSO ANN, so we consider a single hyperparameter \(\lambda \), and show that choosing \(\lambda =\lambda _{\textrm{QUT}}\) leads to a phase transition.

To quantify the performance of the tested methods, we use four criteria: the probability of exact support recovery \(\textrm{PESR}={{\mathbb {P}}}({{\hat{S}}}=S)\), the true positive rate \(\textrm{TPR}={{\mathbb {E}}}\left( \frac{|S \bigcap {{\hat{S}}}|}{|S|} \right) \), the false discovery rate \(\textrm{FDR}={{\mathbb {E}}}\left( \frac{| \bar{S} \bigcap {{\hat{S}}}|}{|{{\hat{S}}}| \vee 1} \right) \), and the generalization or predictive error \(\textrm{PE}^2={\mathbb E}(\mu (X)-{\hat{\mu }}(X))^2\). Although stringent, the PESR criterion reaches values near one in certain regimes. In fact, a phase transition has been observed for linear models: PESR is near one when the complexity parameter s is small, and PESR suddenly decreases to zero when s becomes larger (Candès and Tao 2005; Donoho 2006). One wonders whether this phenomenon is also present for nonlinear models, which we are investigating below. A high TPR with low FDR is also of interest, but is a criterion less strict than having high PESR.

We consider five learners: a standard ANN with keras available in TensorFlow (with its optimizer=‘sgd’ option) with no sparsity inducing mechanism; spinn (sparse input neural networks) (Feng and Simon 2019) with sparsity mechanisms governed by two hyperparameters chosen on a validation set; spinn-dropout (which Python code was kindly provided to us by the first author) (Ye and Sun 2018) with sparsity inducing mechanisms (including dropout) governed by three hyperparameters chosen on a validation set; Bayesian neural networks (BNN) (Sun et al. 2021) driven by three hyperparameters; and our LASSO ANN with a sparsity inducing penalty governed by a single hyperparameter chosen by the same QUT principle (and no validation set required).

For LASSO ANN we use two to four-layer ANNs with the arbitrary choices of \((p_2,p_3,p_4)=(20, 10, 5)\) (small values because the sample size is small, and decreasing sequence because many practitioners choose such an architecture), activation function \(\sigma _{20,1,1}\) defined in (11) (\(M=20\) to have an approximation error of ReLU negligible compared to the noise level \(\xi \)) and the \(\ell _1\)-LASSO penalty. spinn, spinn-dropout and BNN use ReLU.

The ReLU activation function allows to sparsely write a linear association (Sect.  3.1) and the nonlinear absolute value function (Sect. 3.2). With Monte-Carlo simulations to estimate PESR, TPR, FDR and PE in two different settings, we investigate the behavior of these four criteria as a function of the model complexity parameter s, for fixed sample size n and signal to noise ratio governed by \((\xi , \theta )\). The first simulation assumes a sparse linear association and compares LASSO ANN to the benchmark square-root LASSO for linear models. The second simulation assumes a sparse nonlinear association. These allegedly simple sparse associations reveal a phase transition in the ability of LASSO ANN to retrieve needles in haystacks in a more coherent way than with the more complex (i.e., more than one hyperparameter) spinn, spinn-dropout and BNN learners.

3.1 Linear associations

The linear model (9) is the most commonly used and studied model, so we investigate in this section how LASSO ANN compares to a state-of-the-art method for linear models, here square-root LASSO (Belloni et al. 2011) (using the slim function in the flare library in R). This allows to investigate the impact of the loss of convexity for ANNs.

Assuming the linear association is s-sparse, this section compares the ability to retrieve the s relevant input entries assuming either a linear model (the benchmark) or a non-linear model using fully connected ANNs. The aim of the Monte Carlo simulation is to investigate:

  1. 1.

    a phase transition with LASSO ANN and if so, how close it is to the phase transition of square-root LASSO which, assuming a linear model, should be difficult to improve upon. We consider two selection rules for \(\lambda \) for square-root LASSO: QUT and using a validation set to minimize the predictive error.

  2. 2.

    how the quantile universal threshold \(\lambda _{\textrm{QUT}}\) based on (14) performs for LASSO ANN with two, three and four layers.

  3. 3.

    a phase transition with spinn and spinn-dropout. In an attempt to make them comparable to LASSO, we set their parameter controlling the trade-off between LASSO and group-LASSO to a small value so that their penalty is essentially LASSO’s. Like LASSO, spinn and spinn-dropout use a validation set to tune their hyperparameters. Results with their default values are not as good and not reported here.

This experimental setting allows various interesting comparisons: linear versus nonlinear models to retrieve a linear model, and model selection- (QUT) versus validation set-based choice of the hyperparameter(s).

We estimate the PESR criterion of the three methods with a Monte-Carlo simulation with 100 repetitions. Each sample is generated from an s-sparse linear model with \(s\in \{0,1,2,\ldots ,16\}\), the sample size is \(n=100\) from and the dimension of input variables is \(p_1=2n\). Donoho and Tanner (2010) studied in the noiseless case the performance of \(\ell _1\)-regularization as a function of \(\delta =n/p_1\) and \(\rho =s/n\) (for us, \(\delta =1/2\) and \(\rho =s/100\)) and found a PESR phase transition. To be close to their setting, we assume the input variables are i.i.d. standard Gaussian with a moderate signal-to-noise ratio: the s non-zero linear coefficients \(\beta _j\) in (9) are all equal to 3 and the standard deviation of the Gaussian noise is \(\xi =1\). ANN models with ReLU fits linear models sparsely. Indeed a two-layer ANN with a single activated neuron with s non-zero entries in the weights \(W_1\) matches the linear function in the convex hull of the data, as stated in the following property.

Property 2

Using the ReLU activation function, an s-sparse linear function restricted to the convex hull of the n data vectors \(\{\textbf{x}_i\}_{i=1,\ldots ,n}\) can be written as a two-layer neural network with a single neuron with a row matrix \(W_1\) with s non-zero entries.

The proof of Property 2 is provided in the appendix. The convex hull includes the n observed covariates which enter the square-root \(\ell _2\)-loss in (7). So the sparsest two-layer ANN model that solves the optimization and that is a linear model in the convex hull of the data has a single neuron. But the ANN fit is no longer linear outside the convex hull, which makes prediction error PE poor outside the convex hull range of the data; we therefore do not report PE for the linear model since the ANN model will have poor performance for test data outside the convex hull of the training data.

Figure 1 summarizes the results of the Monte-Carlo simulation. As in Donoho and Tanner (2010), we observe a PESR phase transition. Surprisingly, little is lost with LASSO ANN (red curve) compared to linear model (black line), showing the good performance of our choice of \(\lambda _{\textrm{QUT}}\) and optimization scheme. The linear model based on a validation set (black dashed line) shows poor performance in terms of PESR, as expected. In summary, LASSO ANN compares surprisingly well to the benchmark linear square-root LASSO with QUT by not losing much in terms of PESR. The other three ANNs learners spinn, spinn-dropout and BNN cannot directly be compared to the other since governed by more than one hyperparameter, but, while we observe good PESR for s large, their global behavior does not follow the conventional phase transition (that is, no high plateau near one for small s and rapidly dropping down to zero with larger s); the nonlinear simulation in the next section confirms their non-conventional behaviors. Going back to LASSO ANN, we observe on the right plot of Fig. 1 that using more layers slightly lowers the performance, as expected, but that the choice of \(\lambda _{\textrm{QUT}}\) for more layers still leads to a phase transition.

Fig. 1
figure 1

Monte-Carlo simulation results for linear association plotting the estimated probability of exact support recovery (PESR). Left plot: the two black curves assume a linear model while the color curves assume an ANN model; the two blue lines (light for spinn and dark for spinn-dropout) and the grey line (BNN) are governed by more than one hyperparameter while the red line (LASSO ANN) is governed by a single hyperparameter; the two continuous lines (black for square-root LASSO linear and red for LASSO ANN) select the hyperparameter with QUT while the dashed lines require a validation set. Right plot: LASSO ANN with 2 to 4 layers with its hyperparameter based on QUT

Fig. 2
figure 2

Monte-Carlo simulation results for nonlinear association plotting the estimated probability of exact support recovery (PESR-top left), generalization (PE-top right), true positive rate (TPR-bottom left) and false discovery rate (FDR-bottom right). The red curves are for LASSO ANN with its hyperparameter based on QUT with two (continuous) to four layers (dashed). The two blue lines (light for spinn and dark for spinn-dropout) and and the grey line (BNN) are governed by more than one hyperparameter selected based on a validation set. The green curve is a standard ANN (without sparsity constraint)

3.2 Nonlinear associations

To investigate a phase transition as a function of s, we consider s-sparse functions of the form \( \mu _{\varvec{\theta }}(\textbf{x})= \sum _{i=1}^h 10\cdot |x_{2i}-x_{2i-1}| \) for \(h \in \{0,1,\ldots ,8 \}\), which corresponds to s needles in a nonlinear haystack with \(s\in \{ 0, 2, \ldots , 16 \}\). Because this association is harder to retrieve than the linear one (due to the non-monotone nature of the absolute value function), the haystack is of size \(p_1=50\) and the training set is of size \(n=500\). This ratio \(\delta =n/p_1=10\) seems to be the limit where needles can be recovered with LASSO ANN. The association \(\mu _{\varvec{\theta }}(\textbf{x})\) is well approximated by a sparse two-layer ANN employing the smooth activation function \(\sigma _{20,1,1}\) and with \(c=10s\), \(\textbf{w}_{2}=(10\cdot \textbf{1}_{h}^{\textrm{T}}\), \(\textbf{0}_{p_2/2-h}^{\textrm{T}},10\cdot \textbf{1}_{h}^\textrm{T}\), \(\textbf{0}_{p_2/2-h}^{\textrm{T}})\), \(\textbf{b}_{1}=-\textbf{1}_{p_2}\) and

(16)

The columns of \(W_1\) being sparse, a LASSO is more appropriate than a group-LASSO penalty.

Table 1 Some data characteristics (results for data with \(\dagger \) are plotted in Fig. 3)
Fig. 3
figure 3

Monte-Carlo simulation results based on four representative data sets, namely, Breast, Wine, BCI\(\_\)2240, Yeoh of Table 1. The left boxplots are the accuracy results and the right boxplots are the number of selected needles. The horizontal red line is the accuracy by always predicting the most frequent class (that is, without looking at the inputs)

Figure 2 reports the estimated PESR, TPR, FDR and PE criteria as a function of the sparsity level s.

We observe that, as for linear models, LASSO ANN (red lines for two to four layers) has a PESR phase transition thanks to a good trade-off between high TPR and low FDR. Moreover LASSO ANN generalizes better in this setting than the off-the-shelf ANN learner (green lines). The other three ANNs learners spinn , spinn-dropout and BNN (light and dark blue, and grey respectively) perform poorly is terms of FDR, but somewhat better in terms of PESR thanks to more than one hyperparameter. The good FDR control of LASSO ANN is striking, in particular at \(s=0\) where its value is near \(\alpha =0.05\), as mathematically expected, proving the effectiveness of not only QUT but also of the optimization algorithm. Finally, as far as generalization is concerned, the sparsity inducing learners perform better than the conventional ANN learner since the underlying ANN model is indeed sparse. Because LASSO ANN not only selects a sparse model but also shrinks, its predictive performance is not as good as with spinn and spinn-dropout which regularization parameters are selected to generalize well, but LASSO ANN is better than BNN.

3.3 Conclusions of the Monte Carlo simulations

With a single hyperparameter, LASSO ANN has a phase transition for both linear and nonlinear associations and a good FDR control, proving the effectiveness of our quantile universal threshold and optimization scheme. The linear simulation reveals that the impact of the loss of convexity is mild with LASSO ANN since we essentially get the same phase transition as with a linear model. The other ANN learners do not have a conventional phase transition and do not control FDR well; yet, with the help of more hyperparameters selected based on a validation set, they are able to sometimes generalize well.

4 Application to real data

4.1 Classification data

The characteristics of 26 classification data sets are listed in Table 1, in particular the sample size n, the number of inputs \(p_1\) and the number of classes m. Most inputs are gene expressions, but there are also FFT preprocessed time series and other types of inputs.

We randomly split the data into training (70%) and test (30%) sets, repeating the operation 100 times. Figure 3 reports the results for four data sets chosen for their ratios \(n/p_1\) and their number of classes m (marked with a \(\dagger \) in Table 1). The left boxplots of Fig. 3 report classification accuracy, and the right boxplots report the number \({{\hat{s}}}\) of selected needles. High accuracy along with low \({{\hat{s}}}\) reflects good needles selection. The results of all 26 sets are summarized in the scatter plot of Fig. 4.

We train and test the following learners: LASSO GLM with \(\lambda \) chosen to minimize 10-fold cross validation (Friedman et al. 2010) in R with glmnet, CART (Breiman et al. 1984) in R with rpart, random forest (Breiman 2001) in R with randomForest, spinn in Python for binary classification (no code for multiclass and for spinn-dropout available), Bayesian neural networks (BNN), standard ANN learner in Python with keras and its optimizer=‘adam’ option, and our LASSO ANN with two layers in Python. Random forest is an ensemble learner that combines CARTs; so the comparison between CART and random forest quantifies the ensembling effect, and the comparison between CART and LASSO ANN is more fair since both are no ensemble learners.

Figure 4 visualizes the accuracy-sparsity trade-off by plotting accuracy versus \(({{\hat{s}}}+1)/(p_1+1)\) on a log-scale, so that both axes are on [0, 1]. Learners with points near (0, 1) offer the best trade-off. We were able to apply CART and BNN to only 22 and 17 data sets, respectively, because of memory issues when \(p_1\) is too large. Among all ANN-based learners (represented with “o”), LASSO ANN is clearly the best.

Fig. 4
figure 4

Summary of Monte-Carlo results for all data sets of Table 1. The x-axis measures sparsity on a log-scale with \(({{\hat{s}}}+1)/(p_1+1)\) and the y-axis is accuracy. Ideal points are near the top-left corner of the figure

The main lesson of this experiment on real data sets is that LASSO ANN offers a good compromise between high accuracy and low number of selected needles. Yet, linear learners are difficult to beat when \(n/p_1\ll 1\), which corroborates our findings in regression that the sample size must be large to identify nonlinear associations.

4.2 Regression data

Bühlmann et al. (2014) reported genetic data measuring the expression levels of \(p_1= 4088\) genes on \(n= 71\) Bacillus subtilis bacteria. The logarithms of gene expression measurements are known to have some strongly correlated genes, which also makes selection difficult. The output is the riboflavin production rate of the bacteria. This is a high-dimensional setting in the sense that the training set is small compared to the size of the haystack. Generalization is not the goal here, but finding the informative genes; the scientific questions are: what genes affect the riboflavin production rate? Is the association linear?

The ground truth is not known here, but LASSO-zero, a conservative method with low false discovery rate (Descloux and Sardy 2021), selects genes 4003 and 2564. Standard LASSO (using cv.glmnet in R) selects 30 genes including 4003 and 2564. Using \(p_2=20\) neurons, LASSO ANN finds a single active neuron containing 9 non-zero parameters including genes 4003 and 2564. Feng and Simon (2019) reports 45 important genes with spinn, and running spinn-dropout 100 times (randomly splitting into \(70\%\) training and \(30\%\) validating) we find an average of 6 genes (in which 4003 and 2564 are rarely present). BNN selects zero genes, in particular because the sample size here is really small. So the answers to the scientific questions are that few genes seem responsible for riboflavin production and that a linear model seems sufficient.

5 Conclusion

For finding needles in a nonlinear haystack, LASSO ANN, with a simple principle to select a single hyperparameter, achieves: (1) a phase transition in the probability of exact support recovery and controls well the false discovery rate; (2) a consistent good trade-off between generalization and low number of selected needles whether in regression, binary or multiclass classification with various \(n/p_1\) ratios. This makes it a good candidate to discover important features without adding many spurious ones. Our empirical findings call for more theory to mathematically predict the regimes indexed by \((n,\textbf{p},s,\xi ,\theta ,\sigma )\) where feature recovery is highly probable. We also introduce a class of rescaled activation functions \(\sigma _{M,u_0,k}\) that can be employed in different neurons.

ANN models are widely used state-of-the-art black boxes. There is a keen interest, especially in scientific applications, to understand the why of model predictions. Sparse encoding automatic feature selection provides a path towards such an understanding.

Our work makes sparse encoding with LASSO ANN closer to practical applications. Its coherent PESR behavior and FDR control make it reliable for finding needles in nonlinear haystacks, but could also be used for other ANN tasks requiring sparsity, e.g., sparse auto-encoding or convolutional ANN (He et al. 2020).

6 Reproducible research

Our codes are available at https://github.com/StatisticsL/ANN-LASSO, for BNN at https://github.com/sylydya/Consistent-Sparse-Deep-Learning-Theory-and-Computation, for SPINN binary classification at https://github.com/jjfeng/spinn.