A phase transition for finding needles in nonlinear haystacks with LASSO artificial neural networks

To fit sparse linear associations, a LASSO sparsity inducing penalty with a single hyperparameter provably allows to recover the important features (needles) with high probability in certain regimes even if the sample size is smaller than the dimension of the input vector (haystack). More recently learners known as artificial neural networks (ANN) have shown great successes in many machine learning tasks, in particular fitting nonlinear associations. Small learning rate, stochastic gradient descent algorithm and large training set help to cope with the explosion in the number of parameters present in deep neural networks. Yet few ANN learners have been developed and studied to find needles in nonlinear haystacks. Driven by a single hyperparameter, our ANN learner, like for sparse linear associations, exhibits a phase transition in the probability of retrieving the needles, which we do not observe with other ANN learners. To select our penalty parameter, we generalize the universal threshold of Donoho and Johnstone (Biometrika 81(3):425–455, 1994) which is a better rule than the conservative (too many false detections) and expensive cross-validation. In the spirit of simulated annealing, we propose a warm-start sparsity inducing algorithm to solve the high-dimensional, non-convex and non-differentiable optimization problem. We perform simulated and real data Monte Carlo experiments to quantify the effectiveness of our approach.


Introduction
Over the past 10 years, Artificial Neural Networks (ANNs) have become the model of choice for machine learning tasks in many modern applications. Although not completely From the point-of-view of approximation theory, ANNs approximate well smooth functions. For instance a single hidden layer neural net with a diverging number of neurons is dense in the class of compactly supported continuous functions (Cybenko 1989) and the first error rate derived (Barron 1993) motivates shallow learning (few layers) (Ravishankar et al. 2015;Kostadinov et al. 2018). Some results show that deep learning is superior to shallow learning in the sense that less parameters are needed to achieve the same level of accuracy for a smoothness and compositional class of functions, in which case deep learning avoids the curse of dimensionality; see Poggio et al. (2017) for a review. Grohs et al. (2019) prove that deep neural networks provide information-theoretically optimal approximation of a very wide range of functions used in signal processing. Chen and Chen (1995) and related papers extend the results to wider classes of functions. Approximation bound of sparse neural network, that is with bounded network connectivity, has been studied for instance by Bölcskei et al. (2019) who show a link between the degree of connectivity and the complexity of a function class. Adcock et al. (2022) and Adcock and Dexter (2021) show that deep ANNs compare well to benchmark compressed sensing methods with both exponential rates of convergence for analytic functions, using a penalized squareroot 2 -loss.
In machine learning, the success of ANNs is huge and, in part, can be attributed to their expressiveness or capacity (ability to fit a wide variety of functions). The very large number of parameters and the layer structure of ANNs make them impossible to interpret. ANNs are overparametrized with multiple distinct settings of the parameters leading to the same prediction. So traditional measures of model complexity based on the number of parameters do not apply. This makes understanding and interpreting the predictions challenging. Yet in scientific applications, one often seeks to do just that. In keeping with Occam's razor, among all the models with similar predictive capability, the one with the smallest number of features should be selected. Statistically, models with fewer features not only are easier to interpret but can produce predictors with good statistical properties because such models disregard useless features that contribute only to higher variance. Operationally, the model selection paradigm often uses a validation set or cross-validation (in which the data is randomly split, models are built on a training set and predictions are evaluated on a validation set). While conceptually elegant, (cross-)validation sets are of limited use if feature selection is of interest (it tends to select many irrelevant features) (Arlot and Celisse 2010), or if fitting a single model is computationally expensive. ANNs and in particular deep ANNs are computationally expensive to fit, so crossvalidation is an expensive way of selecting model complexity. Aiming at good predictive performance on a test set, also known as generalization, cross-validation is a poor feature selector as it tends to select too many features. In addition, quadratic prediction error from cross-validation exhibits an unexpected behavior with models of increasing complexity: as expected, the training error always decreases with increasing number of input features, but while the quadratic prediction error on the test set is at first U-shaped (initially decreasing thanks to decreasing bias, and then increasing due to an excess of variance), it then unexpectedly decreases a second time. This phenomenon known as double descent has been empirically observed (Advani et al. 2020;Geiger et al. 2019). For least squares estimation regularized by an 2 ridge penalty (Hoerl and Kennard 1970), double descent has been mathematically described for two-layer ANNs with random first-layer weights by Mei and Montanari (2021) and Hastie et al. (2019). They show that for high signal-to-noise ratio (SNR) and large sample size, high complexity is optimal for the ridgeless limit estimator of the weights, leading to a smooth and more expressive interpolating learner. In other words, interpolation is good and leads to double descent, which after careful thinking should not be a surprise since the interpolating ANN becomes smoother with increasing number of layers, and therefore better predicts between interpolated raining data. Indeed with high SNR, the signal is almost noiseless, so a smooth interpolating function of the training data shall perform well for future prediction. In noisy regimes (that is with low SNR and small sample size), Mei and Montanari (2021) observe that regularization is needed.
In this paper, we present an alternative to the use of a validation set geared towards identifying important features. Specifically, we develop an automatic feature selection method for simultaneous feature extraction and generalization. For ease of exposition, we present our novel method in the context of regression and classification, noting that the ideas can be ported beyond. Our approach exploits ideas from statistical hypothesis testing that directly focus on identifying significant features, and this without explicitly considering minimizing the generalization error. Similar ideas percolate the statistics literature, see for example Johnstone and Silverman (2004), Chen et al. (1999), Tibshirani (1996) with LASSO, Bühlmann and van de Geer (2011) who propose methods for finding needles in a haystack in linear models. In this context, the optimized criterion is not the prediction error, but is the ability to retrieve the needles (i.e., relevant features). Useful criteria include the stringent exact support recovery criterion, and softer criteria such as the false discovery rate (FDR) and true positive rate (TPR).
Of course some regularization methods have already been developed to enforce sparsity to the weights of ANNs. For example, dropout leaves out a certain number of neurons to prevent overfitting, which incidentally can be used to perform feature selection (Hinton et al. 2012;Srivastava et al. 2014). Sparse neuron architectures can be achieved by other means: Mollaysa et al. (2017) enforce sparsity based on the Jacobian and Li et al. (2016), Lee et al. (2006), Ranzato et al. (2007), Collins and Kohli (2014); Ma et al. (2019) employ 1 -based LASSO penalty to induce sparsity. Curci et al. (2021) prune their ANNs based on a metric for neuron importance. Evci et al. (2019) discuss the difficulty of training sparse ANNs. spinn (sparse input neural networks) (Feng and Simon 2019) have a sparsity inducing penalty and is governed by two hyperparameters chosen on a validation set; its improved version spinn-dropout (the former originally published in 2017) adds a dropout mechanism governed by an additional hyperparameter (Ye and Sun 2018). So spinn-dropout is a mix between 1 and 0 (subset selection) sparsity inducing method, similar to the pruning idea (Carreira-Perpinan and Idelbayev 2018; Chao et al. 2020). Sun et al. (2021) propose a Bayesian neural networks (BNN) learner. None of these learners have been studied through the prism of phase transition in the probability of retrieving features.
All of these sparsity inducing methods suffer from two drawbacks: (1) the selection of the penalty parameter(s) is often partly based on a validation set, therefore geared towards good generalization performance, not feature identification, and some hyperparameters are set to arbitrary values; (2) the ability to recover the "right" features has not been quantified through the prism of a phase transition in the probability of support recovery; spinn, spinn-dropout and BNN consider criteria related to FDR and TPR. This paper is organized as follows. Section 2 presents the theoretical framework and defines our LASSO ANN learner. Section 2.1 defines the statistical model and notation. Section 2.2 reviews the LASSO sparsity paradigm for linear models and extends it to ANNs. Section 2.3 discusses the choice of activation functions. Section 2.4 derives a selection rule for the penalty parameter, a generalization of the universal threshold (Donoho and Johnstone 1994) to non-convex optimization due to the nonlinearity of ANN models. Section 2.5 discusses optimization issues to solve the non-convex high-dimensional and non-differentiable optimization problem. Section 3 evaluates via simulations the ability of our method to exhibit a phase transition in the probability of exact support recovery for the regression task. Section 4 evaluates with a large number of real data sets the ability of our method to perform feature selection and generalization for the classification task. Section 5 summarizes the findings and points to future developments. Proofs and technical details are given in the Appendix.

Function estimation model and notation
Suppose n pairs of ouput-input data (Y, are collected to learn about their association. For example, in some medical applications (see Sect. 4.1), x ∈ R p 1 is an input vector of p 1 gene expressions and y is any of m cancer types that is coded as a one-hot output vector of R m ; classification aims at assigning the correct type of cancer given an input vector. In regression, y is a scalar (m = 1), for instance riboflavin production rate in a bacteria (see Sect. 4.2).
To model their stochastic nature, data can be modeled as realizations from the pair of random vectors (Y, X). We assume the real-valued response Y ∈ R m is related to realvalued feature vector X ∈ R p 1 through the conditional expectation for some unknown function μ : R p 1 → ⊆ R m . In regression, = R and in classification, = {π ∈ (R + ) m : m k=1 π k = 1}, where π k is the probability of belonging to class k.
Many learners have been proposed to model the association μ between input and output. A recent approach that is attracting considerable attention models μ as a standard fully connected ANN with l layers where θ are the parameters (see (5)) indexing the ANN, and letting u = x at the first layer, the nonlinear functions S k (u) = σ (b k +W k u) maps the p k ×1 vector u into a p k+1 ×1 latent vector obtained by applying an activation function σ component-wise, for each layer k ∈ {1, . . . , l − 1}. The vectors b k are commonly named "biases." The matrix of weights W k is p k+1 × p k and the operation + is the broadcasting operation.
The last layer k = l has two requirements. First we must have p l+1 = m to match the output dimension, so the last function is S l (u) = G(c + W l u) where W l is m × p l and the intercept vector c ∈ R m . Second the function G : R m → is a link function that maps R m into the parameter space . Commonly used link functions for classification are respectively called Softmax and multiclass-Logit. For regression, G(u) = u. The parameters indexing the neural network are therefore for a total of γ = l k=1 p k+1 ( p k + 1) parameters. The following property is straightforward to prove, but is crucial for our methodology; it is the reason for splitting θ into θ 1 and θ 2 .

Property 1 Assuming the activation function satisfies
We believe that only a few features in the p 1 -long input vector x carry information to predict the output. For many medical data treated in Sect. 4 for instance, the input is a vector of hundreds of gene expressions, and genetic aims to identify the ones having an effect on the output. So our main goal while estimating θ is to find needles in the haystack by selecting a subset of the p 1 -long inputs by setting some entries of θ 1 to zero. Feature selection has been extensively studied for linear associations, showing a phase transition between regimes where features can be retrieved with probability near one to regimes where the probability of retrieving the features is essentially zero. Our goal is to investigate such a phase transition with ANN learners to retrieve features in nonlinear associations.

Sparse estimation
Finding needles amounts to setting some weights to non-zero values corresponding to features in x that have predictive information. So we seek sparsity in the first layer on the weights W 1 . For the other layers, large weights in a layer could compensate small weights in the next layer, so we bound them by forcing unit 2 -norm; instead, Feng and Simon (2019) and Ye and Sun (2018) take the approach of a ridge penalty controlled by an additional hyperparameter fixed to the arbitrary value of 0.0001. Instead we slightly modify the nonlinear terms in (2) and define the jth nonlinear function S k, j in layer k as where w ( j) k is the jth row of W k . At the last layer (k = l), c plays the role of an intercept.
Sparsity in the first layer allows interpretability. To enforce sparsity and control overfitting, we take the approach inspired by LASSO of minimizing a compromise between a measure L n of closeness to the data and a measure of sparsity P. Owing to Property 1, we estimate the parameters θ = (θ 1 , θ 2 ) defined in (5) by aiming the best local minimum found by a numerical scheme, where λ > 0 is the regularization parameter of the procedure and P is sparsity-inducing penalty (Bach et al. 2011). We stress that our method is driven by a single regularization parameter λ, as opposed to other methods that use two or three hyperparameters (Ye and Sun 2018;Feng and Simon 2019;Sun et al. 2021).
Common loss functions between training responses Y and predicted values μ θ (X ) include: for m-class classification the cross-entropy loss L n (Y, μ θ (X )) = n i=1 y T i log μ θ (x i ), where the log function is applied component-wise to the m-long vectors y i and μ θ (x i ); for regression, we use L n (Y, μ θ (X )) = ( n i=1 (y i − μ θ (x i )) 2 ) 1/2 for reason that will become clear, although its squared version is more often used. A common sparsity-inducing penalty used by waveshrink (Donoho and Johnstone 1994) and LASSO (Tibshirani 1996) for q = 1 and group-LASSO (Yuan and Lin 2006) for q = 2 is where w 1, j is the jth column of W 1 . The choice q = 2 forces the jth feature to be either on or off across all neurons. The former is more flexible since a feature can be on in a neuron and off in another, so, in the sequel, we use q = 1. The reason for penalizing the biases is that the gradient of the loss function with respect to the biases evaluated at zero is zero and that the Hessian is only positive semi-definite (see Appendix C); so a local minimum would not be guaranteed without a penalty on the biases. ANNs are flexible in the sense that they can fit nonlinear associations. A more rigid and older class of models that has been extensively studied is the class of linear models where here the set of parameters θ = (β 1 , . . . , β p 1 , c) =: (θ 1 , c) is assumed s-sparse, that is only s entries in θ 1 are different from zero. Here again, like for W 1 in ANNs, a nonzero entry in θ 1 corresponds to an entry in the input vector x that is relevant to predict the response. For a properly chosen penalty parameter λ, LASSO has the remarkable property of retrieving the non-zero entries of θ 1 in certain regimes (that depend on n, p 1 , SNR, training locations X and amount s of sparsity), as studied in the noiseless and noisy scenarios by Candès and Tao (2005), Donoho (2006), Donoho et al. (2011), and Bühlmann and van de Geer (2011), for instance. In particular, the value of λ must bound the sup-norm of the gradient of the empirical loss at zero with high probability when θ 1 = 0 for LASSO to satisfy oracle inequalities. For linear models in wavelet denoising theory (Donoho and Johnstone 1994), this approach leads to an asymptotic minimax property.
Our contribution is to extend the linear methodology to ANNs, and to investigate how well our extension leads to a phase transition to discover nonlinear lower-dimensional structures in the data.

Choice of activation functions
Since the weights from level two and higher are bounded on the 2 -ball of unit radius (6), we require the activation function σ ∈ C 2 (R) to be unbounded. For reasons related to Property 1 and the choice of the hyperparameter λ based on the zero-thresholding function proportional to σ (0) (see (13) and (14)), we require σ (0) = 0 and σ (0) > 0. (10) The centered softplus function σ softplus (u) = log(1 + exp(u)) − log(2) for example satisfies this requirement. The ReLU (Rectified Linear Unit) function σ ReLU (u) = max(u, 0) does not because not differentiable at zero. A legitimate question for a statistician is to ask whether ANNs can retrieve interactions between covariates. Projection pursuit models (Friedman and Stuetzle 1981) have this ability, which additive models do not have. Thanks to ANNs property of being dense in smooth function spaces, the answer is yes, but with a large number of neurons when conventional activation functions like softplus and ReLU are used. With the family of activation functions now defined, a k-way interaction can be written with a sparse ANN.

Definition 1 The smooth activation rescaled dictionary is the collection of activation functions defined by
indexed by M > 0, u 0 > 0, k > 0. For u 0 = 1 the dictionary is rescaled in the sense that lim M→∞ σ M,u 0 ,k (0) = 1.
Suppose for instance the association is a single two-way interaction, that is μ(x) = x i x j for some pair (i, j), then with 6 neurons we have . When the ANN model employs both linear and quadratic smooth ReLU (shifted by one), selecting neurons with σ M,1,k with k = 2 and M large reveals potential interactions. The proposed activation functions have some basic properties. They satisfy requirements (10). For finite M, σ M,u 0 , j ∈ C ∞ . For (u 0 , k) = (0, 1), the family includes two important activation functions: softplus for M = 1 and ReLU as M tends to infinity. Using ReLU is prohibited with our method; if one likes the shape of ReLU, then one gets a smooth approximation of ReLU choosing a large M (say M = 20). Moreover since lim M→∞ σ M,u 0 ,k (0) = u k−1 0 , choosing u 0 = 1 scales activation functions across k's in the sense that their derivatives at zero are asymptotically (as M → ∞) equal to one for all k; this is a desired property in our methodology since the zero-thresholding functions (defined in Theorem 1 and derived in Theorem 3 below) are proportional to σ (0). Moreover, since sparsity is of interest, zero is a region where the cost function ought to be smooth for optimization purposes; hence choosing u 0 = 1 also makes the smoothness of the loss function bounded at zero since σ M,1,1 (0) = M/ exp(M) while σ M,0,1 (0) = M/4 (which reflects that ReLU is not differentiable at zero).

Selection of penalty
The proposed choice of λ is based on Property 1. It shows that fitting a constant function is achieved by choosing λ large enough to set the penalized parameters θ 1 to zero when solving the penalized cost function (7). For convex loss functions and linear models, the quantile universal threshold (Giacobino et al. 2017) achieves this goal with high probability under the null model that the underlying function is constant. This specific value λ QUT has good properties for model selection outside the null model as well (Donoho and Johnstone 1994;Donoho et al. 1995;Bühlmann and van de Geer 2011). The quantile universal threshold has so far been developed and employed for cost functions that are convex in the parameters, hence guaranteeing that any local minimum is also global. The cost function in (7) is not convex for ANN models, so we extend the quantile universal threshold by creating with high probability a local minimum at the sparse point of interest θ 1 = 0. This can be achieved thanks to the penalty term λ P(θ 1 ) that is part of the cost function in (7), provided λ is large enough. The following theorem derives an expression for the zero-thresholding function that gives the smallest λ that guarantees a minimum withθ 1 = 0.
Theorem 1 For given output-input data (Y, X ), consider the optimization problem (7) with P(θ 1 ) defined in (8) with q = 1, activation function σ ∈ C 2 (R) and loss function L n ∈ C 2 ( n ) such thatĉ = arg min c∈R m L n (Y, G(c)) exists. Let The proof of Theorem 1 is provided in the appendix; it can be made more general for q ≥ 1 using Hölder's inequality. The estimateĉ often has a closed form expression; in regression for instance, if the loss function between Y ∈ R n and c1 with c ∈ R and 1 ∈ R n is L n (Y, c1) = Y − c1 2 , then c =Ȳ, the average of the responses. Based on λ 0 (Y, X ), the following theorem extends the universal threshold to nonconvex cost functions.
Theorem 2 Given training inputs X , define the random set of outputs Y 0 generated from (1) with μ(X ) = θ (X ) defined in (2) for any activation function satisfying (10) under the null hypothesis H 0 : θ 1 = 0, that is H 0 : μ θ = c is a constant function. Letting the random variable = λ 0 (Y 0 , X ) and F be the distribution function of , the quantile universal threshold is λ QUT = F −1 (1 − α) for a small value of α. It satisfies that A proof of Theorem 2 can be found in Giacobino et al. (2017). The law of is unknown but can be easily estimated by Monte Carlo simulation, provided there exists a closed form expression for the zero-thresholding function The following theorem states a simple expression for λ 0 (Y, X ) in two important cases: classification and regression.

Theorem 3 Consider a fully connected l-layer ANN employing a differentiable activation function σ and let
Theorem 2 states that the choice of λ is simply an upper quantile of the random variable = λ 0 (Y 0 , X ), where Y 0 is the distribution of the response under the null distribution that θ 1 = 0. The upper quantile of can be easily estimated by Monte-Carlo simulation.
In regression and assuming Gaussian errors, the null distribution is Y 0 ∼ N(c1, ξ 2 I n ). Both the constant c and ξ 2 are unknown however, and ξ 2 is difficult to estimate in high dimension. Fortunately, one observes first that (14) involves only the mean centered responses Y • and therefore do not dependent on c. Second, both numerator and denominator are proportional to ξ . Consequently, is a pivotal random variable in the Gaussian case. Knowledge of c and ξ 2 are therefore not required to derive our choice of hyperparameter λ QUT . This well-known fact inspired by square-root LASSO (Belloni et al. 2011) motivates the use of L n = Y−μ θ (X ) 2 rather than L n = Y − μ θ (X ) 2 2 . In classification, the null distribution is Y 0 ∼ Multinomial (n, π = G(c)). The constant vector c is unknown and the random variable with λ 0 defined in (13) is not pivotal. Moreover Holland (1973) proved no covariance stabilizing transformation exists for the trinomial distribution. So the approach we take is to assume the training outputs Y reflect the proportion of classes in future samples seeking class prediction. So ifπ are the proportions of classes in the training set, then the null distribution is Y 0 ∼ Multinomial(n, π = π). The quantile universal threshold derived under this null hypothesis is appropriate if future data come from the same distribution, which is a reasonable assumption.

Computational cost for LASSO ANN
For a given λ, we solve (7) first by steepest descent with a small learning rate, and then employ a proximal method to refine the minimum by exactly setting to zero some entries ofŴ 1,λ QUT (Beck and Teboulle 2009;Bach et al. 2011).
Solving (7) directly for the prescribed λ = λ QUT risks getting trapped at some poor local minimum. Instead, inspired by simulated annealing and warm start, we avoid thresholding too hardly at first and possibly missing important features by solving (7) for an increasing sequence of λ's tending to λ = λ QUT , namely λ k+1 = exp(k)/(1 + exp(k))λ QUT for k ∈ {−1, 0, . . . , 4}. Taking as initial parameter values the solution corresponding to the previous λ k leads to a sequence of sparser approximating solutions until solving for λ QUT at the last step. We do not perform multiple starts.
The computational cost is low. It requires solving (7) approximately on the small grid of λ's tending to λ QUT using the warm start to finally solve (7) precisely for λ QUT . Calculating λ QUT is also cost efficient (and highly parallelizable) since it is based on an M-sample Monte Carlo that calculates M gradients {g 0 (y k , X , θ 0 )} M k=1 using backpropagation (Rumelhart et al. 1986) for M Gaussian samples {y k } M k=1 under H 0 ; see Theorems 1 and 2 for details. Using V -fold cross-validation instead would require solving (7) a total of V * L times, where L is the number of λ's visited until finding a (hopefully global) minimum to the cross-validation function. Using a validation set reduces complexity by a factor V , at the cost of using data to validate. Instead, our quantile universal threshold approach does not require a validation set.
The phase transition property achieved with LASSO ANN shows its stability to repeatedly identify the same relevant features over many training sets. Yet, Bastounis et al. (2021a, b) and Colbrook et al. (2022) have proved that stability and generability do not exist together for ANN as well as LASSO ANN learners, showing some limitation of the method.

Regression simulation study
The regression problem is model (1) for scalar output (m = 1), Gaussian additive noise and (unknown) standard deviation, here chosen ξ = 1. To evaluate the ability to retrieve needles in a haystack, the true associations μ is written as sparse ANNs that uses only s of the p 1 entries in the inputs x. We say an association μ is s-sparse when it uses only s input entries, that is s = |S| where S = {j | x j carries information} in the association μ. A sparse ANN learner estimates which inputs are relevant by estimating the support witĥ whereŵ 1, j is the jth column of the estimated weightsŴ 1 at the first layer. Likewise for linear model (9), the support is estimated withŜ = { j |β j = 0}.
Since we employ a precise thresholding algorithm to solve (7), we use = 0 to determineŜ in (15); other methods apply a hard thresholding step with a second hyperparameter to get rid of small values. Our method could be improved by using as another hyperparameter, but our aim is to investigate a phase transition with LASSO ANN, so we consider a single hyperparameter λ, and show that choosing λ = λ QUT leads to a phase transition.
To quantify the performance of the tested methods, we use four criteria: the probability of exact support recovery PESR = P(Ŝ = S), the true positive rate TPR = E

|S Ŝ | |Ŝ|∨1
, and the generalization or predictive error PE 2 = E(μ(X ) −μ(X )) 2 . Although stringent, the PESR criterion reaches values near one in certain regimes. In fact, a phase transition has been observed for linear models: PESR is near one when the complexity parameter s is small, and PESR suddenly decreases to zero when s becomes larger (Candès and Tao 2005;Donoho 2006). One wonders whether this phenomenon is also present for nonlinear models, which we are investigating below. A high TPR with low FDR is also of interest, but is a criterion less strict than having high PESR. We consider five learners: a standard ANN with keras available in TensorFlow (with its optimizer='sgd' option) with no sparsity inducing mechanism; spinn (sparse input neural networks) (Feng and Simon 2019) with sparsity mechanisms governed by two hyperparameters chosen on a validation set; spinn-dropout (which Python code was kindly provided to us by the first author) (Ye and Sun 2018) with sparsity inducing mechanisms (including dropout) governed by three hyperparameters chosen on a validation set; Bayesian neural networks (BNN) (Sun et al. 2021) driven by three hyperparameters; and our LASSO ANN with a sparsity inducing penalty governed by a single hyperparameter chosen by the same QUT principle (and no validation set required).
For LASSO ANN we use two to four-layer ANNs with the arbitrary choices of ( p 2 , p 3 , p 4 ) = (20, 10, 5) (small values because the sample size is small, and decreasing sequence because many practitioners choose such an architecture), activation function σ 20,1,1 defined in (11) (M = 20 to have an approximation error of ReLU negligible compared to the noise level ξ ) and the 1 -LASSO penalty. spinn, spinn-dropout and BNN use ReLU.
The ReLU activation function allows to sparsely write a linear association (Sect. 3.1) and the nonlinear absolute value function (Sect. 3.2). With Monte-Carlo simulations to estimate PESR, TPR, FDR and PE in two different settings, we investigate the behavior of these four criteria as a function of the model complexity parameter s, for fixed sample size n and signal to noise ratio governed by (ξ, θ ). The first simulation assumes a sparse linear association and compares LASSO ANN to the benchmark square-root LASSO for linear models. The second simulation assumes a sparse nonlinear association. These allegedly simple sparse associations reveal a phase transition in the ability of LASSO ANN to retrieve needles in haystacks in a more coherent way than with the more complex (i.e., more than one hyperparameter) spinn, spinn-dropout and BNN learners.

Linear associations
The linear model (9) is the most commonly used and studied model, so we investigate in this section how LASSO ANN compares to a state-of-the-art method for linear models, here square-root LASSO (Belloni et al. 2011) (using the slim function in the flare library in R). This allows to investigate the impact of the loss of convexity for ANNs.
Assuming the linear association is s-sparse, this section compares the ability to retrieve the s relevant input entries assuming either a linear model (the benchmark) or a nonlinear model using fully connected ANNs. The aim of the Monte Carlo simulation is to investigate: 1. a phase transition with LASSO ANN and if so, how close it is to the phase transition of square-root LASSO which, assuming a linear model, should be difficult to improve upon. We consider two selection rules for λ for squareroot LASSO: QUT and using a validation set to minimize the predictive error. 2. how the quantile universal threshold λ QUT based on (14) performs for LASSO ANN with two, three and four layers. 3. a phase transition with spinn and spinn-dropout.
In an attempt to make them comparable to LASSO, we set their parameter controlling the trade-off between LASSO and group-LASSO to a small value so that their penalty is essentially LASSO's. Like LASSO, spinn and spinn-dropout use a validation set to tune their hyperparameters. Results with their default values are not as good and not reported here.
This experimental setting allows various interesting comparisons: linear versus nonlinear models to retrieve a linear model, and model selection-(QUT) versus validation setbased choice of the hyperparameter(s).
We estimate the PESR criterion of the three methods with a Monte-Carlo simulation with 100 repetitions. Each sample is generated from an s-sparse linear model with s ∈ {0, 1, 2, . . . , 16}, the sample size is n = 100 from and the dimension of input variables is p 1 = 2n. Donoho and Tanner (2010) studied in the noiseless case the performance of 1 -regularization as a function of δ = n/ p 1 and ρ = s/n (for us, δ = 1/2 and ρ = s/100) and found a PESR phase transition. To be close to their setting, we assume the input variables are i.i.d. standard Gaussian with a moderate signalto-noise ratio: the s non-zero linear coefficients β j in (9) are all equal to 3 and the standard deviation of the Gaussian noise is ξ = 1. ANN models with ReLU fits linear models sparsely. Indeed a two-layer ANN with a single activated neuron with s non-zero entries in the weights W 1 matches the linear function in the convex hull of the data, as stated in the following property.

n can be written as a two-layer neural network with a single neuron with a row matrix W 1 with s non-zero entries.
The proof of Property 2 is provided in the appendix. The convex hull includes the n observed covariates which enter the square-root 2 -loss in (7). So the sparsest two-layer ANN model that solves the optimization and that is a linear model in the convex hull of the data has a single neuron. But the ANN fit is no longer linear outside the convex hull, which makes prediction error PE poor outside the convex hull range of the data; we therefore do not report PE for the linear model since the ANN model will have poor performance for test data outside the convex hull of the training data. Figure 1 summarizes the results of the Monte-Carlo simulation. As in Donoho and Tanner (2010), we observe a PESR phase transition. Surprisingly, little is lost with LASSO ANN (red curve) compared to linear model (black line), showing the good performance of our choice of λ QUT and optimization scheme. The linear model based on a validation set (black dashed line) shows poor performance in terms of PESR, as expected. In summary, LASSO ANN compares surprisingly well to the benchmark linear square-root LASSO with QUT by not losing much in terms of PESR. The other three ANNs learners spinn, spinn-dropout and BNN cannot directly be compared to the other since governed by more than one hyperparameter, but, while we observe good PESR for s large, their global behavior does not follow the conventional phase transition (that is, no high plateau near one for small s and rapidly dropping down to zero with larger s); the nonlinear simulation in the next section confirms their non-conventional behaviors. Going back to LASSO ANN, we observe on the right plot of Fig. 1 that using more layers slightly lowers the performance, as expected, but that the choice of λ QUT for more layers still leads to a phase transition.

Nonlinear associations
To investigate a phase transition as a function of s, we consider s-sparse functions of the form {0, 1, . . . , 8}, which corresponds to s needles in a nonlinear haystack with s ∈ {0, 2, . . . , 16}. Because this association is harder to retrieve than the linear one (due to the non-monotone nature of the absolute value function), the haystack is of size p 1 = 50 and the training set is of size n = 500. This ratio δ = n/ p 1 = 10 seems to be the limit where needles can be recovered with LASSO ANN. The association μ θ (x) is well approximated by a sparse twolayer ANN employing the smooth activation function σ 20,1,1 and with c = 10s, The columns of W 1 being sparse, a LASSO is more appropriate than a group-LASSO penalty. Figure 2 reports the estimated PESR, TPR, FDR and PE criteria as a function of the sparsity level s.
We observe that, as for linear models, LASSO ANN (red lines for two to four layers) has a PESR phase transition thanks to a good trade-off between high TPR and low FDR. Moreover LASSO ANN generalizes better in this setting than the off-the-shelf ANN learner (green lines). The other three ANNs learners spinn , spinn-dropout and BNN (light and dark blue, and grey respectively) perform poorly is terms of FDR, but somewhat better in terms of PESR thanks to more than one hyperparameter. The good FDR control of LASSO ANN is striking, in particular at s = 0 where its value is near α = 0.05, as mathematically expected, proving the effectiveness of not only QUT but also of the optimization algorithm. Finally, as far as generalization is concerned, the sparsity inducing learners perform better than the conventional ANN learner since the underlying ANN model is indeed sparse. Because LASSO ANN not only selects a sparse model but  also shrinks, its predictive performance is not as good as with spinn and spinn-dropout which regularization parameters are selected to generalize well, but LASSO ANN is better than BNN.

Conclusions of the Monte Carlo simulations
With a single hyperparameter, LASSO ANN has a phase transition for both linear and nonlinear associations and a good FDR control, proving the effectiveness of our quantile universal threshold and optimization scheme. The linear simulation reveals that the impact of the loss of convexity is mild with LASSO ANN since we essentially get the same phase transition as with a linear model. The other ANN learners do not have a conventional phase transition and do not control FDR well; yet, with the help of more hyperparameters selected based on a validation set, they are able to sometimes generalize well.

Classification data
The characteristics of 26 classification data sets are listed in Table 1, in particular the sample size n, the number of inputs p 1 and the number of classes m. Most inputs are gene expressions, but there are also FFT preprocessed time series and other types of inputs. We randomly split the data into training (70%) and test (30%) sets, repeating the operation 100 times. Figure 3 reports the results for four data sets chosen for their ratios n/ p 1 and their number of classes m (marked with a † in Table 1). The left boxplots of Fig. 3 report classification accuracy, and the right boxplots report the numberŝ of selected needles. High accuracy along with lowŝ reflects good needles selection. The results of all 26 sets are summarized in the scatter plot of Fig. 4.
We train and test the following learners: LASSO GLM with λ chosen to minimize 10-fold cross validation (Friedman et al. 2010) in R with glmnet, CART (Breiman  Table 1. The left boxplots are the accuracy results and the right boxplots are the number of selected needles. The horizontal red line is the accuracy by always predicting the most frequent class (that is, without looking at the inputs) et al. 1984) in R with rpart, random forest (Breiman 2001) in R with randomForest, spinn in Python for binary classification (no code for multiclass and for spinn-dropout available), Bayesian neural networks (BNN), standard ANN learner in Python with keras and its optimizer='adam' option, and our LASSO ANN with two layers in Python. Random forest is an ensemble learner that combines CARTs; so the comparison between CART and random forest quantifies the ensembling effect, and the comparison between CART and LASSO ANN is more fair since both are no ensemble learners. Figure 4 visualizes the accuracy-sparsity trade-off by plotting accuracy versus (ŝ + 1)/( p 1 + 1) on a log-scale, so that both axes are on [0, 1]. Learners with points near (0, 1) offer the best trade-off. We were able to apply CART and BNN to only 22 and 17 data sets, respectively, because of memory issues when p 1 is too large. Among all ANN-based learners (represented with "o"), LASSO ANN is clearly the best.
The main lesson of this experiment on real data sets is that LASSO ANN offers a good compromise between high accuracy and low number of selected needles. Yet, linear learners are difficult to beat when n/ p 1 1, which corroborates our findings in regression that the sample size must be large to identify nonlinear associations. Bühlmann et al. (2014) reported genetic data measuring the expression levels of p 1 = 4088 genes on n = 71 Bacillus subtilis bacteria. The logarithms of gene expression  Table 1. The x-axis measures sparsity on a log-scale with (ŝ + 1)/( p 1 + 1) and the y-axis is accuracy. Ideal points are near the top-left corner of the figure measurements are known to have some strongly correlated genes, which also makes selection difficult. The output is the riboflavin production rate of the bacteria. This is a highdimensional setting in the sense that the training set is small compared to the size of the haystack. Generalization is not the goal here, but finding the informative genes; the scientific questions are: what genes affect the riboflavin production rate? Is the association linear? The ground truth is not known here, but LASSO-zero, a conservative method with low false discovery rate (Descloux and Sardy 2021), selects genes 4003 and 2564. Standard LASSO (using cv.glmnet in R) selects 30 genes including 4003 and 2564. Using p 2 = 20 neurons, LASSO ANN finds a single active neuron containing 9 non-zero parameters including genes 4003 and 2564. Feng and Simon (2019) reports 45 important genes with spinn, and running spinn-dropout 100 times (randomly splitting into 70% training and 30% validating) we find an average of 6 genes (in which 4003 and 2564 are rarely present). BNN selects zero genes, in particular because the sample size here is really small. So the answers to the scientific questions are that few genes seem responsible for riboflavin production and that a linear model seems sufficient.

Conclusion
For finding needles in a nonlinear haystack, LASSO ANN, with a simple principle to select a single hyperparameter, achieves: (1) a phase transition in the probability of exact sup-port recovery and controls well the false discovery rate; (2) a consistent good trade-off between generalization and low number of selected needles whether in regression, binary or multiclass classification with various n/ p 1 ratios. This makes it a good candidate to discover important features without adding many spurious ones. Our empirical findings call for more theory to mathematically predict the regimes indexed by (n, p, s, ξ, θ, σ ) where feature recovery is highly probable. We also introduce a class of rescaled activation functions σ M,u 0 ,k that can be employed in different neurons.
ANN models are widely used state-of-the-art black boxes. There is a keen interest, especially in scientific applications, to understand the why of model predictions. Sparse encoding automatic feature selection provides a path towards such an understanding.
Our work makes sparse encoding with LASSO ANN closer to practical applications. Its coherent PESR behavior and FDR control make it reliable for finding needles in nonlinear haystacks, but could also be used for other ANN tasks requiring sparsity, e.g., sparse auto-encoding or convolutional ANN (He et al. 2020).

Funding Open access funding provided by University of Geneva
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.
Since the loss function l is twice differentiable with respect to W 1 around θ 0 , applying Taylor's theorem we have If we assume that (λ − supθ g 0 (Y, X , θ 0 ) ∞ ) > C > 0 (since we normalize layers 2 to l, the supremum is finite), it follows that the regularized loss satisfies Thus the cost function (7) with our choice of λ indeed has a local minimum at θ 0 .

C Hessian matrix
The notation is that W k;:i be the i-th column of W k and W k;h: is the h-th row of W k . We consider a 3-layer network for two tasks: • Regression with the square-root 2 -loss L n (Y, μ θ (X )) = n k=1 (y k − (c + w 3 σ (b 2 + W 2 σ (b 1 + W 1 x k )))) 2 =: n k=1 G 2 k .
• Classification with cross-entropy loss  W 1 x k )) .

D Proof of Property 2
Let U ⊂ R be any subset of R. For all u ∈ U, choosing b = max u∈U −u = − min(U), we have u = σ ReLU (u + b) − b. Consider the data matrix X , u = X β ∈ R n and b = − min i=1,...,n u i < ∞. Then for any point x ∈ R p 1 in the convex hull of the n data vectors {x i } i=1,...,n , we have b + x T β = n i=1 λ i (b + x T i β) > 0, so a linear function μ lin θ (x) = β 0 + x T β = β 0 + σ ReLU (x T β + b) − b can be written as an ANN with a single neuron.