1 Introduction

Nowadays, there is a growing interest in merging techniques from the areas of Statistics and Machine Learning to leverage the advantages offered by each approach. One illustrative example of this convergence is found in the area of Statistical Item Response Theory, particularly in the context of computerized adaptive tests [1, 2]. Yan et al. [3] and, later, Ueno and Songmuang [4] introduced decision trees as an alternative to the computerized adaptive tests. Subsequently, Delgado-Gómez et al. [5] established a mathematically equivalence between these two techniques, enabling real-time administration of computerized adaptive tests with computationally intensive item selection criteria. More recently, researchers have explored the application of neural networks in this field [6, 7].

Recent studies highlight the synergies emerging between the fields of Statistics and Neural Networks [8, 9]. Representing statistical models through neural networks provides them with the flexibility and optimization capabilities inherent in neural networks. In a previous pilot study, Laria et al. [10] showed how the least absolute shrinkage and selection operator (lasso) algorithm can be represented as a neural network framework. Other works have also been published relating neural networks and lasso [11,12,13]. However, these works have focused on obtaining sparse representations of neural networks including lasso elements. Conversely, linking neural networks with statistical models enhances the interpretability of the neural models [14]. Such synergies have manifested across various statistical domains, including regression, dimensionality reduction, time series analysis, and quality control [15].

This article focuses on the adaptation of the widely used lasso algorithm within the context of neural networks. To this end, Sect. 2 provides an overview of the essential features of the lasso algorithm to lay the foundation for understanding its neural version. Section 3 contains the contribution of this work. It extends the mathematical formulation proposed by Laria et al. [10] and redefines the optimization process for both linear and logistic regressions. Specifically, three novel optimization procedures are developed. The first one, called standard neural lasso, estimates the weights of the network in the usual way in which neural networks are trained. The second one, called restricted neural lasso, aims at reproducing the lasso by making certain weights non-trainable. Finally, a third algorithm, called voting neural lasso, redefines the way in which the lasso is formulated with the aim of improving its performance. Section 4 presents a series of experiments designed to assess the performance of the neural versions and compare it with its statistical counterpart, utilizing both real and simulated datasets. Finally, Sect. 5 concludes the article with a discussion of the obtained results and outlines future research directions.

2 The lasso

Following, the lasso algorithm is briefly presented highlighting the most relevant elements in relation to our proposal. Hereafter, the lasso algorithm will be referred to as statistical lasso to differentiate it from its neural versions throughout the article.

2.1 Formulation

Let \((\varvec{x}_i, y_i)\), \(i=1, \dots , N\), be a set containing N observations where \(\varvec{x}_i \in \mathbb {R}^p \) represents the predictors, and \(y_i \in \mathbb {R}\) are the associated responses. It is assumed that the predictors are standardized and the responses are centered, i.e.,

$$\begin{aligned} \sum _{i=1}^N x_{ij} = 0,\, \sum _{i=1}^N x_{ij}^2 = 1, \, \sum _{i=1}^N y_i = 0, \ \text {for } j{=}1,2, \dots , p\nonumber \\ \end{aligned}$$
(1)

The lasso technique was introduced for generalized linear models in the supervised context by Tibshirani [16]. It is formulated as the following optimization problem

$$\begin{aligned} \hspace{3pt}\underset{\varvec{\beta }}{{{\,\textrm{argmin}\,}}}\, \mathcal {R}( \varvec{y}, \varvec{X} \varvec{\beta }) + \lambda \bigl \Vert \varvec{\beta }\bigr \Vert _1 \end{aligned}$$
(2)

where \(\varvec{X}\) is the (standardized) matrix that contains the observations as rows, \(\varvec{y}\) is the vector with the corresponding labels, \(\varvec{\beta }\) is the vector containing the weights of the regression, and \(\lambda \bigl \Vert \varvec{\beta }\bigr \Vert _1\) is a penalization term. \(\mathcal {R} (\varvec{y}, \varvec{X} \varvec{\beta })\) represents the error term. In this work, we will focus on linear and logistic regression. For linear regression, the error term is given by

$$\begin{aligned} \mathcal {R}_{Lin}(\varvec{y}, \varvec{X} \varvec{\beta })=\frac{1}{N}\sum _{i=1}^N (y_i-\textbf{x}_i^t \varvec{\beta })^2 \end{aligned}$$
(3)

while the error term for the logistic regression is given by:

$$\begin{aligned} \mathcal {R}_{Log}(\varvec{y}, \varvec{X} \varvec{\beta })= \frac{1}{N} \sum _{i=1}^N\left[ \log (1+e^{\textbf{x}_i^t \varvec{\beta }})- y_i \textbf{x}_i^t \varvec{\beta }\right] \end{aligned}$$
(4)

2.2 Optimization

Given a fixed \(\lambda \), the values of \(\varvec{\beta }\) are estimated using coordinate descent. As an example, the coordinate descent update for the \(j^{th}\) coefficient in the linear regression case is given by

$$\begin{aligned} \hat{\beta }_j= \mathcal {S}_{\lambda } \left( \frac{1}{N} \langle \textbf{X}_j, \textbf{r}_j \rangle \right) \end{aligned}$$
(5)

where \(\textbf{X}_j\) is the \(j^{th}\) column of matrix \(\textbf{X}\), the \(i^{th}\) component of \(\textbf{r}_j\) is obtained by

$$\begin{aligned} \textbf{r}_j(i)=y_i - \sum _{k \ne j} x_{ik} \hat{\beta }_k \end{aligned}$$
(6)

and \(\mathcal {S}_{\lambda }\) is the soft-thresholding operator defined by

$$\begin{aligned} S_{\lambda }(x)={{\,\textrm{sign}\,}}(x)(|x|-\lambda )_{+} \end{aligned}$$
(7)

The optimal value of \(\lambda \) is obtained through a k-fold cross-validation. A more detailed discussion of the lasso optimization can be found in the book by Hastie et al. [17]. A schematic representation of the lasso optimization algorithm is shown in the upper panel of Fig. 3.

3 The neural lasso

In this section, the formulation and optimization of the neural lasso is presented.

3.1 Formulation

Here, we introduce the neural adaptation of the lasso, commencing with the mathematical formulation for linear regression and subsequently extending it to encompass logistic regression.

Linear regression

When the error term is given by the mean squared error (MSE), lasso can be characterized as the neural network shown in Fig. 1. In this case, the loss function is given by

Fig. 1
figure 1

Neural representation of lasso for linear regression

$$\begin{aligned} \begin{aligned} \mathcal {L}(\varvec{w})&= \dfrac{1}{N} \sum _{i=1}^{N} \Biggl ( y_i - \gamma \sum _{j=1}^{p} x_{ij} w_j \Biggr )^2 + \ell _{1} \sum _{j=1}^p \vert w_j \vert \\&= \dfrac{1}{N}\Vert \textbf{y}- \gamma \textbf{X} \varvec{w}\Vert ^{2}_{2} + \ell _{1} \Vert \varvec{w}\Vert _1 \end{aligned} \end{aligned}$$
(8)

where \((\varvec{w},\gamma )\) are the parameters of the network, and \(\ell _1\) is a regularization hyper-parameter. Notice that, by making \(\varvec{\beta }=\gamma \varvec{w}\) and \(\lambda =\frac{\ell _{1}}{\gamma }\), Eq. (8) is equivalent to Eq. (2) using MSE as error term.

It is important to note that, unlike the statistical lasso, neural network optimization does not set the weights exactly to zero. Therefore, it is necessary to establish a condition that determines which weights are zeros after each training epoch, and set them to this value. To do this, we calculate the derivative of the loss function defined in Eq. (8) with respect to \(w_j\)

$$\begin{aligned} \dfrac{\partial \mathcal {L}(\varvec{w})}{\partial w_j} =\dfrac{-2 \gamma }{N} \sum _{i=1}^{N} \Biggl ( y_i - \gamma \sum _{k=1}^{p} x_{ik} w_{k} \Biggr ) x_{ij} +\ell _1 s_j \end{aligned}$$
(9)

where the term \(s_j\) is the subgradient defined by

$$\begin{aligned} s_j =\left\{ \begin{array}{ll} 1 &{} w_j >0\\ -1 &{} w_j <0\\ \left[ -1,1\right] &{} w_j=0 \end{array} \right. . \end{aligned}$$
(10)

Equation (9) can be rewritten as

$$\begin{aligned} \dfrac{\partial \mathcal {L}(\varvec{w})}{\partial w_j}&=\dfrac{-2 \gamma }{N} \Biggl ( \sum _{i=1}^{N} y_i x_{ij}- \gamma \sum _{i=1}^{N} x_{ij} \sum _{k\ne j} x_{ik}w_k \nonumber \\&\quad -\gamma w_j \sum _{i=1}^{N} x_{ij}^2 \Biggl ) + \ell _1 s_j \end{aligned}$$
(11)

and, equivalently, in vector form

$$\begin{aligned} \dfrac{\partial \mathcal {L}(\varvec{w})}{\partial w_j}=\dfrac{-2 \gamma }{N} \Biggl ( \textbf{X}_j^t \textbf{y}- \Bigl (\gamma \textbf{X}_j^t \textbf{X} \varvec{w}^*_j -\gamma w_j \Bigr )+\ell _1 s_j \end{aligned}$$
(12)

where \(\textbf{X}_j^t\) is the transpose of the \(j^{th}\) column of matrix \(\textbf{X}\) (containing observations as rows) and \(\varvec{w}^*_j\) is the vector \(\varvec{w}\) with the \(j^{th}\) component equal to 0. To obtain the above expression, it has been taken into account that \(\sum _{i=1}^N x_{ij}^2=N\) since the data are standardized.

Equating the derivative to 0 leads to

$$\begin{aligned} w_j=\dfrac{\dfrac{2}{N} \gamma \textbf{X}_j^t \Biggl ( \textbf{y}- \gamma \textbf{X} \varvec{w}^*_j \Biggr ) - \ell _1 s_j }{\dfrac{2}{N} \gamma ^2} \end{aligned}$$
(13)

From where it follows that

$$\begin{aligned} \small w_j^{op} = \left\{ \begin{array}{ll} \frac{\dfrac{2}{N} \gamma \textbf{X}_j^t \Biggl ( \textbf{y}- \gamma \textbf{X} \varvec{w}^*_j \Biggr ) - \ell _1 }{\dfrac{2}{N} \gamma ^2} &{} \text { if } \dfrac{2}{N}\gamma \textbf{X}_j^t \Biggl ( \textbf{y}- \gamma \textbf{X} \varvec{w}^*_j \Biggr ) > \ell _1 \\ \frac{\dfrac{2}{N} \gamma \textbf{X}_j^t \Biggl ( \textbf{y}- \gamma \textbf{X} \varvec{w}^*_j \Biggr ) + \ell _1 }{\dfrac{2}{N} \gamma ^2}, &{} \text { if } \dfrac{2}{N}\gamma \textbf{X}_j^t \Biggl ( \textbf{y}- \gamma \textbf{X} \varvec{w}^*_j \Biggr ) < -\ell _1 \\ 0 &{} \text { if } \left| \dfrac{2}{N} \gamma \textbf{X}_j^t \Biggl ( \textbf{y}- \gamma \textbf{X} \varvec{w}^*_j \Biggr ) \right| \le \ell _1 \end{array} \right. \nonumber \\ \end{aligned}$$
(14)
Fig. 2
figure 2

Neural representation of lasso for logistic regression

Note that unlike lasso, which requires all three updates from Eq. (14), neural lasso only relies on the last condition to zero out weights. This is because the weight updates happen implicitly during network training. In essence, after each training epoch, the network assesses whether any weights can be set to zero by verifying if the last condition of Eq. (14) is met based on the current estimates. This distinction becomes particularly important in the context of logistic regression, as we will see later on.

Logistic regression

As shown below, the optimization problem for the logistic case is formulated by

$$\begin{aligned} \underset{\varvec{\beta }}{{{\,\textrm{argmin}\,}}}\ \dfrac{1}{N} \sum _{i=1}^N \left[ \log (1+e^{\textbf{x}_i^t \varvec{\beta }{+} \beta _0})- y_i \left( \textbf{x}_i^t \varvec{\beta }{+} \beta _0 \right) \right] {+} \lambda \bigl \Vert \varvec{\beta }\bigr \Vert _1\nonumber \\ \end{aligned}$$
(15)

This problem can be characterized by the neural network shown in Fig. 2.

Note that the linear activation of the output layer has been replaced by a sigmoid. In addition, the MSE has been replaced by the binary cross-entropy function whose formula is given by

$$\begin{aligned} -\dfrac{1}{N}\sum _{i=1}^N y_i \log \hat{y}_i +(1-y_i) \log (1-\hat{y_i}) \end{aligned}$$
(16)

Therefore, the loss function of the network is given by

$$\begin{aligned} \begin{aligned} \mathcal {L}(\varvec{w})&=-\dfrac{1}{N} \sum _{i=1}^N \Biggl ( y_i \log \left( \dfrac{1}{1+e^{-\gamma x_i^t \varvec{w}- b_0} } \right) \\&\quad + (1-y_i) \log \left( 1-\dfrac{1}{1+e^{-\gamma x_i^t \varvec{w}- b_0 }}\right) \Biggr ) + \ell _{1} \bigl \Vert \varvec{w}\bigr \Vert _1 \end{aligned} \end{aligned}$$
(17)

It can be seen that Eq. (17) is equivalent to equation (15) as follows. Focusing on the error term of Eq. (17):

$$\begin{aligned} \mathcal {R}(\varvec{y}, \varvec{X} \varvec{w})= & {} -\dfrac{1}{N} \displaystyle \sum _{i=1}^N \Biggl ( y_i \log \left( \dfrac{1}{1+e^{-\gamma x_i^t \varvec{w}- b_0}} \right) \nonumber \\{} & {} \quad + (1-y_i) \log \left( \dfrac{1}{1+e^{\gamma x_i^t \varvec{w}+ b_0}}\right) \Biggr ) \nonumber \\= & {} -\dfrac{1}{N} \displaystyle \sum _{i=1}^N \left( -y_i \log (1+ e^{-\gamma \varvec{x}_i^t\varvec{w}- b_0})\right. \nonumber \\{} & {} \quad \left. - (1-y_i)\log (1+e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}) \right) \nonumber \\= & {} \dfrac{1}{N} \sum _{i=1}^{N} \left( y_i \log (1+ e^{-\gamma \varvec{x}_i^t\varvec{w}- b_0}) \right. \nonumber \\{} & {} \quad \left. + \log (1+e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}) - y_i \log (1+e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}) \right) \nonumber \\= & {} \dfrac{1}{N}\sum _{i=1}^N \left( y_i \log \left( \dfrac{1{+}e^{-\gamma \varvec{x}_i^t \varvec{w}- b_0}}{1{+}e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}}\right) {+} \log (1+e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}) \right) \nonumber \\= & {} \dfrac{1}{N}\sum _{i=1}^N \left( y_i \log \left( e^{-\gamma \varvec{x}_i^t\varvec{w}- b_0} \right) + \log (1+e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}) \right) \nonumber \\= & {} \dfrac{1}{N}\sum _{i=1}^N \left( \log (1+e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}) - y_i (\gamma \varvec{x}_i^t\varvec{w}+ b_0 ) \right) \end{aligned}$$

Therefore, (17) becomes

$$\begin{aligned} \mathcal {L}(\varvec{w})= & {} \dfrac{1}{N}\sum _{i=1}^N \left( \log (1+e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}) - y_i (\gamma \varvec{x}_i^t\varvec{w}+ b_0) \right) \nonumber \\{} & {} + \ell _1 \Vert \varvec{w}\Vert _1 \end{aligned}$$
(18)

Defining, as above, \(\varvec{\beta }= \gamma \varvec{w}\), \(\lambda =\ell _1/\gamma \), formulation (17) is equivalent to formulation (15).

Similar to the linear case, it is necessary to establish a mechanism that makes the weights associated with the non-significant variables equal to 0. Taking the derivative of the loss function in equation (18)

$$\begin{aligned} \dfrac{\partial \mathcal {L}(\varvec{w})}{\partial w_j} = \dfrac{1}{N} \sum _{i=1}^N \left( \dfrac{ \gamma x_{ij} e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}}{1+e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}}-y_i\gamma x_{ij} \right) +\ell _1 s_j \end{aligned}$$
(19)

Unfortunately, in contrast to the linear case, it is not possible to isolate the vector \(\varvec{w}\). The problem is, therefore, approached from a different perspective.

Rearranging and equating the above equation to zero

$$\begin{aligned} \dfrac{\gamma }{N} \sum _{i=1}^N \left( \dfrac{ e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}}{1+e^{\gamma \varvec{x}_i^t\varvec{w}+ b_0}}-y_i \right) x_{ij} +\ell _1 s_j = 0 \end{aligned}$$
(20)

which is equivalent to

$$\begin{aligned} \dfrac{\gamma }{\ell _1 N} \sum _{i=1}^N \left( y_i-\dfrac{ 1}{1+e^{-\gamma \varvec{x}_i^t\varvec{w}- b_0}} \right) x_{ij} = s_j \end{aligned}$$
(21)

Following Simon et al. [18], this is satisfied for \(w_j=0\) if

$$\begin{aligned} \dfrac{\gamma }{\ell _1 N} \sum _{i=1}^N \left( y_i-\dfrac{ 1}{1+e^{-\gamma \varvec{x}_i^t\varvec{w}_j^* - b_0} } \right) x_{ij} = s_j \end{aligned}$$
(22)

where \(\varvec{w}_j^*\) is the vector \(\varvec{w}\) with the \(j^{th}\) component equal to 0. Therefore,

$$\begin{aligned} \left| \dfrac{\gamma }{\ell _1 N} \sum _{i=1}^N \left( y_i-\dfrac{1}{1+e^{-\gamma \varvec{x}_i^t\varvec{w}_j^* - b_0}} \right) x_{ij} \right| = \left| s_j\right| \le 1 \end{aligned}$$
(23)

Rearranging gives

$$\begin{aligned} \left| \dfrac{\gamma }{N} \sum _{i=1}^N \left( y_i-\dfrac{ 1}{1+e^{-\gamma \varvec{x}_i^t\varvec{w}_j^* - b_0}} \right) x_{ij} \right| \le \ell _1 \end{aligned}$$
(24)

which vectorially can be written as

$$\begin{aligned} \left| \dfrac{ \gamma }{N} \textbf{X}_j^t \Biggl ( \textbf{y}- \sigma \left( \gamma \textbf{X} \varvec{w}_j^* + {\textbf {b}} \right) \Biggr ) \right| \le \ell _1 \end{aligned}$$
(25)

where \(\sigma (x)=1/(1+e^{-x})\) is the sigmoid activation function and \( {\textbf {b}}\) is the N-dimensional vector whose all components are equal to \(b_0\).

It is important to highlight that the method through which neural lasso determines whether a weight should be set to zero differs from the approach used by the statistical lasso. The latter uses a quadratic approximation of the error term since it also needs to have an explicit expression of the update of the non-zero weights. Neural lasso only needs to know which weights are zero since the update of the non-zero weights is implicitly performed during the training of the network.

3.2 Optimization

A key aspect to address is the estimation of neural lasso weights. In this section, we introduce three novel optimization algorithms, each illustrated schematically in the respective lower panels of Fig. 3.

Fig. 3
figure 3

Statistical lasso and neural lasso algorithms

Typically, when working with neural networks, their architecture is determined through cross-validation, and weight estimation is done through simple hold-out validation. In simple hold-out validation, the available data is split into a training set for weight estimation and a validation set for assessing network performance independently. The final network is the one with weights that minimize the validation error. However, since neural lasso has a predefined network layout, it only needs to estimate its weights using simple hold-out validation. This optimization approach will be referred to as standard neural lasso.

However, standard neural lasso may have a drawback compared to statistical lasso in how it estimates weights. Statistical lasso benefits from cross-validation, which utilizes all available data to estimate the error, whereas standard neural lasso uses only a subset of the data because it relies on simple validation. To bridge this gap, we introduce a second algorithm called restricted neural lasso. In this approach, \(\gamma \) is fixed at 1 and set as a non-trainable parameter. Then, the hyper-parameter \(\ell _1\) is set to one of the \(\lambda \) values considered by statistical lasso during its optimization. Cross-validation is performed to select the \(\ell _1\) value that minimizes the cross-validation error. In the second step, the algorithm estimates weights using the optimal \(\ell _1\) value with \(\gamma \) set to 1. Assuming the network layout is correct, this optimization method’s performance should be practically identical to that obtained with statistical lasso.

Additionally, a third optimization approach emerged during this work, termed voting neural lasso. It combines all the optimization approaches discussed above. It employs the cross-validation design of restricted neural lasso and statistical lasso but does not search for the hyper-parameter \(\lambda \) that minimizes the average validation error in the k-fold scenarios. Instead, for each of the k settings, it selects the \(\lambda \) value that yields the smallest validation error, similar to standard neural lasso. A variable is considered to be significant when it has been selected in most of the k settings. In a second phase, it estimates the weights of only these significant variables without considering the penalty term. It is important to note that this approach is not a relaxed lasso [19].

In summary, we consider three optimization algorithms with distinct purposes. Standard neural lasso employs the conventional procedure for training neural networks. Restricted neural lasso mimics statistical lasso to establish a connection between Statistics and Machine Learning. Voting neural lasso offers a new way of estimating weights.

For the standard neural lasso and for the voting neural lasso, the network is initialized with \(\gamma =1\) and \(\ell _1 = \max _j \left| \frac{2}{N} \textbf{X}_j^t \textbf{y}\right| \) for the linear case and \(\ell _1= \max _j \left| \frac{1}{N} \textbf{X}_j^t ( \textbf{y}-\right. \)\(\left. \sigma (0)) \right| \) for the logistic case. In addition, in this article, the Adam optimization algorithm is used to adjust the weights [20].

4 Experimental Results

To assess the effectiveness of our approach, we conducted three experiments. The first two are centered on the linear case, with the initial experiment employing simulated data and the second utilizing various real datasets. These two experiments are complemented with a third one aiming to evaluate the proposed method in the logistic case using real data.

4.1 Experiment 1: Linear case, Simulated data

In the first study, the data were simulated according to the model \( y= \textbf{X} \varvec{\beta } + \epsilon \) where \(\textbf{X}\) is the matrix containing the observations as row, \(\epsilon _i \sim N(0,1)\) and

$$ \beta =[1\,2\,3\,4\,\underbrace{0\, \ldots \, 0}_{p-4}]$$

Moreover, the data were simulated from a centered normal distribution so that \(\rho _{ij}=0.5^{|i-j|}\) for \(1 \le i <j \le p\). In addition, the columns with the predictors were randomly rearranged to avoid possible positional effects.

Table 1 Results obtained for the linear scenario with synthetic data and 50 training observations

In order to test the performance of the different algorithms, training sets for \(p \in \{20,100,200\}\) with sample size N equal to 50 were generated. For each of the three scenarios, a repeated validation was performed with 100 runs. In each repetition, a test set comprising 1000 observations was generated. Our performance metrics included MSE on the test set, precision (the percentage of correctly identified non-significant variables), and recall (the percentage of correctly identified significant variables). We configured the number k of folds to five for the statistical lasso, restricted neural lasso, and voting neural lasso algorithms. In contrast, the standard neural lasso employed 20% of the training data as a validation set. It is worth noting that the non-neural versions of the analyses were conducted using the glmnet R package [21], while the neural versions were implemented in Pytorch [22]. The results obtained from these experiments are presented in Table 1.

This table clearly indicates that the standard neural lasso yields notably poorer performance compared to its non-neural counterpart. As noted above, this discrepancy arises because the standard neural lasso relies on a relatively small validation subset. Additionally, the performance of the statistical lasso and the restricted neural lasso is nearly identical, confirming that the network design is accurate. Remarkably, the best results are achieved by the voting neural lasso algorithm, which outperforms all three previous approaches.

It is important to note that the standard neural lasso approach, despite the poor performance obtained, has the advantage that the estimation of the regularization parameter is performed continuously. This differs from the statistical lasso that takes into account only the predefined values in a grid. This prevents the lasso performance from being affected by a poor choice of the values included in the grid. On the other hand, it requires that the validation set be sufficiently representative, a fact that was not satisfied in the previous experiment. To show this issue, Table 2 shows the results obtained when we repeat the previous experiment for p equal to 20 but having 15,000 training data instead of 50. It is observed that in this new scenario, the standard neural lasso improves its performance significantly. In addition, it is also observed that the restricted neural lasso performs almost identically to the statistical lasso, as expected, and that the voting neural lasso once again achieves the best performance with recall and precision of 1.

Table 2 Results obtained for the linear scenario with synthetic data and 15,000 training observations

4.2 Experiment 2: Linear case, Real data

We further evaluated our proposed technique using five distinct real datasets, including three from the University of California-Irvine (UCI) repository and two proprietary datasets. These datasets are as follows:

\(\circ \):

UCI White wine quality [23]. This dataset comprises 4898 observations and was created to predict the quality of Portuguese “Vinho Verde" using 11 predictors. In each repetition, the training set included 4000 training observations, with the test set consisting of 898 observations.

\(\circ \):

UCI Boston housing [24]. Comprising 506 observations, this dataset is focused on predicting the median value of owner-occupied homes in Boston, with the aid of 11 predictors. Each repetition employed a training set with 400 training observations and a test set with 106 observations.

\(\circ \):

UCI Abalone [25]. Collected to predict the age of abalones from physical measurements, this dataset contains 4177 observations, each with nine attributes. In each repetition, the training set encompassed 3342 training observations, and the test set contained 1935 observations.

\(\circ \):

Suicide attempt severity. This database contains information on the severity of 349 suicide attempts as measured by the Beck suicide intent scale [26]. The predictors are derived from 30 items of the Barrat impulsivity scale [27]. In each repetition, the training set comprised 200 training observations, while the test set comprised 149 instances.

\(\circ \):

Attention Deficit Hyperactivity Disorder (ADHD). This dataset encompasses responses from 59 mothers of children with ADHD to the Behavior Rating Inventory of Executive Function-2, containing 63 items [28]. This dataset features two possible dependent variables, quantifying the levels of inattention and hyperactivity in children, as measured by the ADHD rating scale [29]. For each repetition, the training set comprises 47 observations, and the validation set contains 12 observations.

As with the previous experiment, 100 repeated validations are performed, the number k of folds is set to five, and the validation set contains 20% of the training data. The results, presented in Table 3, reinforce the conclusions drawn from the synthetic data experiment. Specifically, we observe that the voting neural lasso achieves an MSE comparable to that of the statistical lasso but with the added benefit of using significantly fewer predictors. Conversely, the standard neural lasso exhibits the poorest performance. Additionally, it is evident that the statistical lasso and restricted neural lasso yield nearly identical results.

Table 3 Results obtained for the linear scenario with real data
Table 4 Results obtained for the logistic scenario with real data

4.3 Experiment 3: Logistic case, Real data

This last experiment is intended to test the performance of the neural lasso in the logistic scenario. For this purpose, three databases obtained from the UCI repository and one proprietary database are used. A brief description of these databases is given below.

\(\circ \):

UCI Wisconsin Breast cancer [30]. This dataset is composed of 569 observations. Each observation has 30 predictors and a dependent variable indicating whether the predictors were obtained from a malignant tumor. The training set was made up of 445 observations while the test set consisted of 124.

\(\circ \):

UCI Spam [31]. This dataset is made up of 4601 instances. Each of them contains 57 predictors and one dependent variable indicating whether the email was spam. The training set consisted of 3975 observations while the test set comprised 626.

\(\circ \):

UCI Ionosphere [32]. This database is composed of 351 instances with 34 predictors and a dependent variable indicating whether the radar signal passed through the ionosphere or not. The training set was made up of 299 observations while the test set consisted of 52.

\(\circ \):

Suicidal Behaviour [33]. This database consists of 700 observations. Each contains 106 predictors consisting of responses to items of various scales, and a dependent variable indicating whether the respondent had recently made an attempt.

The experimental setup mirrored that of the previous sections (k-fold cross-validation with k equal to five, 100 repetitions, and a validation set comprising 20% of the training data). The results obtained are shown in Table 4.

Results obtained for the logistic case are similar to those obtained in the linear scenario and presented in the previous two sections. It is observed that the best results are achieved by the voting neural lasso in three of the four settings. A significantly lower accuracy than the statistical lasso is obtained only in the spam data set. It is also observed that the restricted neural lasso and the statistical lasso obtain equivalent results, which again shows the convergence of the neural technique with the statistical one. A small difference, with respect to the results achieved previously, is that the standard neural lasso gets better results than the statistical neural lasso in two settings (Cancer and Ionosphere).

5 Conclusions

In this article, we successfully implemented the lasso algorithm using neural networks. Specifically, we defined the network architecture and compared three novel optimization algorithms for weight estimation.

The first developed algorithm was the standard neural lasso. The benefit of this approach is that the search for the optimal value of the regularization parameter is performed continuously, unlike the lasso that uses a grid search. This avoids that the parameter estimation depends on the values considered in the grid and prevents that the lasso performance decreases due to a bad choice of these values. However, the fact that this first proposed approach performs a simple validation hold-out rather than a cross-validation requires that the validation set be sufficiently representative.

The previous algorithm, despite introducing an improvement when it is possible to obtain a representative validation set, did not reproduce closely the lasso, which was our initial objective. Therefore, we developed the restricted neural lasso which reproduces almost identically the lasso algorithm. This task was not easy because in neural networks the cross-validation is only used to define the topology of the network and, in our case, it was already predefined. To replicate the lasso with the neural version it was necessary to make one of the weights non-trainable. As shown in all the results, the performance obtained by this second approach and the statistical lasso are practically identical. That is, this second approximation achieved the initial objective of obtaining the neural version of the lasso and built the basis for future work to develop non-linear versions.

Furthermore, we introduced a novel algorithm based on majority voting, which considers the significance of variables across the cross-validation scenarios. This third algorithm significantly outperformed the widely used statistical lasso. Notably, the voting neural lasso consistently achieved lower error rates and improved variable selection in both linear and logistic cases. These results were obtained across diverse training sets, encompassing observations ranging from as few as 47 to as many as 4000, with the number of predictors varying from 9 to 200.

These findings open up new avenues for future research. One direction could involve developing neural versions of other shrinkage techniques, such as the elastic net, or extending these algorithms to nonlinear versions leveraging the flexibility of neural networks. Additionally, while the development of the voting neural lasso was based on simple cross-validation, exploring the use of repeated validation or repeated cross-validation, along with the computation of confidence intervals, could lead to a more robust algorithm.