1 Introduction

Feedforward neural networks (FNN) have been comprehensively studied and widely used in many fields of science because of approximating complex nonlinear mappings and modeling of various natural and artificial phenomena. Many researchers investigate the universal approximation capabilities of standard multi-layer FNN architectures, including [24, 25, 38]. Real data applications of the neural network are based on a finite training set. [28] proved that a finite number of distinct observations can be learned with zero error by a single-hidden layer feedforward neural network (SLFN) with at most the same finite number of hidden neurons and with almost any nonlinear activation function.

Although SLFN is one of the most popular FNNs, the learning algorithms used to train the SLFN are slow since its parameters are tuned using iterative procedures and these parameters may get stuck at a local minimum [12]. To deal with these drawbacks, [33] proposed the extreme learning machine (ELM), which is a fast learning algorithm based on the SLFN. The applications of ELM are primarily regression, classification and clustering tasks [29, 34]. The main distinction of ELM from traditional learning algorithms for neural networks is that it does not need to determine the input weights iteratively. In the ELM algorithm, the learning parameters (input weights and biases) of hidden layer units are determined by random assignments and the output weights are obtained by the generalized inverse operation. In the last step of the training process of ELM, a linear or non-linear transformation is applied to the output weights. This procedure leads to notable benefits including faster learning speed, promising generalization capability and less human intervention [30, 31]. Since ELM is proposed by [33], ELM has been successfully applied to many real-life problems (see, [7, 35,36,37, 40, 49, 50, 55, 56, 58, 60, 61]). Although ELM has shown remarkable success on many real data sets, it may give poor results in terms of stability, generalization performance and sparsity depending on the nature of the underlying data set. Multicollinearity occurs when the columns of the output matrix of the hidden layer are moderately or highly correlated with one another. The multicollinearity leads the ELM results to be unstable and affects the generalization performance, adversely. Also, ELM may suffer from overfitting or underfitting because of interference of redundant information when the number of hidden neurons is not properly selected [42].

Regularization methods have been proposed to deal with the drawbacks of basic ELM. There are \(\ell _{1}\) and \(\ell _{2}\) norm-based regularization methods to improve the underlying model in points of multicollinearity, overfitting and sparsity. [23] proposed an \(\ell _{2}\) norm-based regularization method called ridge regression to handle the multicollinearity for linear models. Although ridge regression improves the model in the case of multicollinearity, it does not carry out automatic variable selection. [53] introduced the lasso as an alternative to ridge regression, which is an \(\ell _{1}\) norm-based regularization method. The lasso performs automatic variable selection as a result of the nature of the \(\ell _{1}\) norm. [69] proposed the elastic net, which employs a linear combination of \(\ell _{1}\) and \(\ell _{2}\) norm regularizations. The elastic net tends to group highly correlated variables and improves the lasso in case of multicollinearity. [4] proposed a modified version of the lasso, called the square-root lasso by modifying the form of the objective function of the lasso. The square-root lasso carries out the automatic variable selection in a similar way to the lasso while it is a pivotal method in that the method does not rely on the knowledge or pre-estimate of the standard deviation of the noise. The square-root lasso also does not rely on the normality assumption of the noise. Hence, the square-root lasso potentially outperforms the lasso in variable selection.

Recently, many algorithms have been proposed as variants of basic ELM to improve the performance of the method depending on the structure of the underlying data set. The majority of these works are \(\ell _{1}\) and \(\ell _{2}\) norm-based regularization methods. [48] proposed the pruned-ELM, a systematic and automated method for ELM classifier, based on constituting the appropriate classifier network by measuring the relevance of the hidden nodes through statistical methods. [17] introduced the error-minimized extreme learning machine that updates the output weights incrementally during the growth of the network to determine the number of hidden nodes automatically. [32] examined the ELM for classification in the aspect of the standard optimization method and extended the ELM to the support vector network, a specific type of generalized SFLN. [11] proposed regularized ELM considering the structural risk minimization principle and weighted least squares to tackle the drawbacks of ELM in cases of heteroskedasticity, outliers and over-fitting. [43] introduced the optimally pruned ELM (OP-ELM) to reduce the effect of the irrelevant variables on the ELM predictions. The OP-ELM uses the leave-one-out (LOO) validation method to achieve the best number of neurons. Since the computation of the LOO error is based on PRESS statistic, the OP-ELM is sensitive to the high correlations among variables. To cope with this situation, [44] proposed a double-regularized ELM called Tikhonov regularized ELM (TROP-ELM) to improve the numerical stability with efficient pruning of neurons compared to OP-ELM. [42] proposed a regularized ELM for regression problems by using ridge, lasso and elastic net methods. The idea of the proposed method is to prune the irrelevant nodes from a large number of hidden nodes determined initially, hence automating the architectural design of ELM. [39] proposed ridge regression ELM by applying ridge regression in the computation of output weights to handle the multicollinearity in the data set. [16] proposed \(\ell _{1}\) regularized ELM which is based on using lasso-type regularization in the computation of output weights to promote the sparsity and, hence, to improve the performance of the ELM. [2] proposed the knowledge-based ELM formulation by incorporating nonlinear prior knowledge in the form of implications into the ELM. [3] considered the primal form of optimization-based ELM [32], whose solution is obtained by solving an absolute value equation problem by a functional iterative method. [41] proposed \(\ell _{1}-\ell _{2}\)-ELM, a unified form of ELM that combines the grouping property of \(\ell _{2}\) norm and sparsity property of \(\ell _{1}\) norm to control the complexity of the network and deal with over-fitting. [5] proposed the implicit Lagrangian twin random vector functional-link networks through the unconstrained minimization problem for better generalization performance in ELM. [62] observed that the regularization parameter selection criteria affected the performance of the ridge regression ELM and investigated the performance of the ridge regression ELM by various regularization parameter selection criteria. [9] proposed the bootstrapped lasso as a regularization and resampling method to select the most related neurons for the model. [57] proposed \(\ell _{2,1}\)-norm-based regularization by the difference of convex functions program to obtain a more robust and stable model by reducing the influence of outliers. [21] proposed a robust regularized extreme learning machine by using the asymmetric Huber loss function to reduce the effect of noise and outliers in the data. [22] proposed regularized based implicit Lagrangian twin extreme learning machine in primal as a pair of unconstrained convex minimization problems where the regularization term is added to follow the structural risk minimization principle. [47] proposed a novel online sequential algorithm for ELM with \(\ell _{2,1}\)-norm regularization to deal with real-time sequential data. [64] proposed the Liu-lasso ELM as a regularization and variable selection method and compared the performance of the method with ridge, liu, lasso and elastic net regularized ELM methods. [66] developed a novel elastic net regularized ELM algorithm based on the over-relaxed alternating direction method of multipliers to reduce the model training time by using the results of the previous iteration to the next iteration. [26] proposed a least squares ELM with derivative characteristics in which weights and biases of the network are specified by a twice least squares method to improve the regression accuracy of the network.[65] proposed a combination of ridge and Liu regressions in a unified form in the ELM context as a remedy to the disadvantages of the basic ELM algorithm. [59] established a connection between the variance and weights of the ELM for regression problems and proposed maximum likelihood-based estimates for model parameters. [67] proposed a regularized functional extreme learning machine that uses the regularization function instead of a preset regularization parameter to select appropriate regularization parameters adaptively.

In this study, we present square-root lasso ELM (SQRTL-ELM), an enhanced ELM based on the pivotal recovery of signals in the hidden layer output matrix. The SQRTL-ELM calculates output weights by square-root lasso to achieve better stability and generalization capability compared to the other regularized ELM methods.

The rest of the paper is organized as follows: A brief review of the algorithms is given in Sect. 2. The details and computational properties of the proposed method are presented in Sect. 3. In Sect. 4, an experimental study is carried out to examine the performances of all the methods on real data sets. Finally, the study is concluded in Sect. 5.

2 A Review of Algorithms

2.1 Extreme Learning Machine (ELM)

In this section, the ELM algorithm is briefly reviewed. The ELM is a special SLFN in which the weight matrix between the input layer and output layer is chosen randomly. Hence, the estimation problem of ELM is reduced to a solution of a set of linear equations and the output layer weights can be estimated by least squares [27, 34]. As a result of this training process, ELM is a faster learning algorithm compared to the other algorithms such as backpropagation for training SLFN.

Consider N distinct training samples \(\left( \textbf{x}_{i},\textbf{t}_{i}\right) \) where \(\textbf{x}_{i}\in \mathbb {R}^{N}\) is the extracted feature vector and \(\textbf{t}_{i}\in \mathbb {R}^{p}\) is the target vector. The mathematical model for the SFLN with L hidden nodes and activation function \(g\left( x\right) \) is

$$\begin{aligned} \sum _{i=1}^{L}\varvec{\beta }_{i}^{T}g\left( \textbf{w}_{i}\cdot \textbf{x}_{j}+b_{i}\right) =\textbf{t}_{j},\ j=1,2,\ldots ,N, \end{aligned}$$
(1)

where \(\textbf{w}_{i}\in \mathbb {R}^{N}\) is the weight vector that connects the input neurons to the ith hidden neuron, \(\varvec{\beta }_{i}\) is the weight vector that connects the output neurons to i the hidden neuron, \(b_{i}\) is the bias of the ith hidden neuron, and \(\textbf{w}_{i}\cdot \textbf{x}_{j}\) denotes the inner product of the vectors \(\textbf{w}_{i}\) and \(\textbf{x}_{j}\). The structure of basic ELM is given in Fig. 1.

The matrix form corresponding to Eq. (1) is written as

$$\begin{aligned} \textbf{H}\varvec{\beta }=\textbf{T}, \end{aligned}$$

where

$$\begin{aligned} \textbf{H}=\begin{bmatrix}g\left( \textbf{w}_{1}\cdot \textbf{x}_{1}+b_{1}\right) &{} \cdots &{} g\left( \textbf{w}_{L}\cdot \textbf{x}_{1}+b_{L}\right) \\ \vdots &{} \ddots &{} \vdots \\ g\left( \textbf{w}_{1}\cdot \textbf{x}_{N}+b_{1}\right) &{} \cdots &{} g\left( \textbf{w}_{L}\cdot \textbf{x}_{N}+b_{L}\right) \end{bmatrix}_{N\times L}, \end{aligned}$$
(2)
$$\begin{aligned} \varvec{\beta }=\begin{bmatrix}\varvec{\beta }_{1}^{T}\\ \vdots \\ \varvec{\beta }^{T} \end{bmatrix}_{L\times p}\quad \text { and }\quad \textbf{T}=\begin{bmatrix}\textbf{t}_{1}^{T}\\ \vdots \\ \textbf{t}_{N}^{T} \end{bmatrix}_{N\times p}. \end{aligned}$$
(3)

The \(\textbf{H}\) matrix in Eq. (3) is called the hidden layer output matrix of the neural network. Therefore, the weight matrix \(\varvec{\beta }\) can be obtained analytically by the least squares approach

$$\begin{aligned} \hat{\varvec{\beta }}_{\textrm{ELM}}=\textbf{H}^{+}\textbf{T}, \end{aligned}$$
(4)

where \(\left\| \cdot \right\| _{2}^{2}\) denotes the \(\ell _{2}\)-norm of the vector and \(\textbf{H}^{+}\) is the Moore-Penrose generalized inverse of the matrix \(\textbf{H}\).

Fig. 1
figure 1

The structure of ELM with one output

2.2 Regularized ELM

If the \(\textbf{H}\) matrix suffers from multicollinearity, the stability and generalization capability of the ELM are adversely affected since, in the case of multicollinearity, ELM results cannot be obtained or they will be unstable [39, 64]. Several ELM variants are proposed to deal with the multicollinearity in the hidden layer output matrix. Consider the setup for a single-output regression model

$$\begin{aligned} \textbf{T}=\textbf{H}\varvec{\beta }+\varvec{\varepsilon } \end{aligned}$$
(5)

where \(\varvec{\varepsilon }\) is the noise vector which consists of elements with mean zero and constant variance \(\sigma ^{2}\). The ELM based on the ordinary least squares (OLS) estimates is formulated as the minimization of the loss function

$$\begin{aligned} \hat{\varvec{\beta }}_{\textrm{ELM}}=\textrm{argmin}_{\varvec{\beta }}\left[ \left\| \textbf{H}\varvec{\beta }-\textbf{T}\right\| _{2}^{2}\right] , \end{aligned}$$
(6)

which has a closed-form solution

$$\begin{aligned} \hat{\varvec{\beta }}_{\textrm{ELM}}=\left( \textbf{H}^{T}\textbf{H}\right) ^{-1}\textbf{H}^{T}\textbf{T} \end{aligned}$$

if \(\textbf{H}\) has full column rank. In the case of multicollinearity, \(\hat{\varvec{\beta }}_{\textrm{ELM}}\) may suffer from a large variance and yield unsatisfactory estimates [63].

[39] proposed the ridge regression ELM (RIDGE-ELM) as the solution to the problem

$$\begin{aligned} \hat{\varvec{\beta }}_{\mathrm {R-ELM}}=\textrm{argmin}_{\varvec{\beta }}\left[ \frac{1}{2N}\left\| \textbf{H}\varvec{\beta }-\textbf{T}\right\| _{2}^{2}+\lambda \left\| \varvec{\beta }\right\| _{2}^{2}\right] , \end{aligned}$$
(7)

where \(\lambda >0\) is the tuning parameter of RIDGE-ELM. The problem in Eq. (7) has a closed-form solution as

$$\begin{aligned} \hat{\varvec{\beta }}_{\mathrm {R-ELM}}=\left( \textbf{H}^{T}\textbf{H} +\lambda \textbf{I}\right) ^{-1}\textbf{H}^{T}\textbf{T}, \end{aligned}$$

where \(\textbf{I}\) is the \(L\times L\) identity matrix. Introducing the regularization term reduces the variance of the estimates of \(\varvec{\beta }\) for the sake of some bias. The main benefit of the Ridge-ELM is that it ensures the stability of the model in case of multicollinearity in the \(\textbf{H}\) matrix. [44] proposed LASSO-ELM by replacing \(\ell _{2}\) regularization with \(\ell _{1}\) regularization in the criterion in Eq. (7) as

$$\begin{aligned} \hat{\varvec{\beta }}_{\mathrm {L-ELM}}=\textrm{argmin}_{\varvec{\beta }}\left[ \frac{1}{2N}\left\| \textbf{H}\varvec{\beta }-\textbf{T}\right\| _{2}^{2}+\lambda \left\| \varvec{\beta }\right\| _{1}\right] . \end{aligned}$$
(8)

The LASSO-ELM aims to improve the performance of ELM by pruning some hidden nodes using a \(\ell _{1}\)-type regularization term. However, the LASSO-ELM selects at most L variables, which is a limiting property for a variable selection method [41, 42]. To deal with the limitation of the LASSO-ELM in variable selection, [42] proposed to use a convex combination of \(\ell _{1}\) and \(\ell _{2}\) regularization, which leads to ENET-ELM as

$$\begin{aligned} \hat{\varvec{\beta }}_{\mathrm {ENET-ELM}}=\textrm{argmin}_{\varvec{\beta }}\left[ \frac{1}{2N}\left\| \textbf{H}\varvec{\beta }-\textbf{T}\right\| _{2}^{2}+\lambda \left( \alpha \left\| \varvec{\beta }\right\| _{1}+\left( 1-\alpha \right) \left\| \varvec{\beta }\right\| _{2}^{2}\right) \right] \end{aligned}$$
(9)

where \(\alpha \) is a tuning parameter that controls the trade-off between the RIDGE-ELM (\(\alpha =0\)) and LASSO-ELM (\(\alpha =1\)). The ENET-ELM performs much like the LASSO-ELM for the small values of \(\alpha \), while dealing with the wild behavior arising from high correlations. However, tuning the second parameter \(\alpha \) leads to an increase in the computational cost of the ENET-ELM.

3 The Proposed Algorithm: SQRTL-ELM

Despite the attractive properties of LASSO-ELM and ENET-ELM, their estimations depend on the standard deviation \(\sigma \) of the noise in the model (5). Considering this issue, [4] proposed the square-root LASSO (SQRTL) to provide pivotal recovery of sparse signals in the scope of regularized regression methods. The SQRTL is pivotal in terms of its regularization level since the regularization level is not dependent on the standard deviation of the noise. With the pivotal recovery property, The SQRTL can be used to enhance the performance of the ELM. Therefore, we propose square-root LASSO-ELM (SQRTL-ELM) by applying \(\ell _{1}\) type regularization to the square root of the loss function in (6):

$$\begin{aligned} \hat{\varvec{\beta }}_{\mathrm {SQRTL-ELM}}=\textrm{argmin}_{\varvec{\beta }}\left[ \frac{1}{2N}\left\| \textbf{H}\varvec{\beta }-\textbf{T}\right\| _{2}+\lambda \left\| \varvec{\beta }\right\| _{1}\right] . \end{aligned}$$
(10)

In the SQRTL-ELM, the standard deviation \(\sigma \) becomes a constant factor that drops out of the minimization. This property of SQRTL-ELM enables the derivation of pivotal estimators for the lasso tuning parameter since there is no need to estimate \(\sigma \) as a part of the estimation formula of the tuning parameter. Hence, the SQRTL-ELM potentially outperforms the LASSO-ELM and ENET-ELM methods with the properly selected tuning parameter value.

3.1 Computational Properties of Sqrt-Lasso-Elm

The SQRTL-ELM method given by problem (10) does not have a closed-form solution because of the nature of the \(\ell _{1}\) norm. [4] proposed an interior-point method and a first-order method to obtain the solution to problem (10) by using the fact that the SQRTL is equivalent to a conic programming problem holding strong duality. [51] proposed the scaled lasso which has an equivalent form of SQRTL in coefficient estimates. [51] proposed the alternating minimization algorithm to obtain the scaled lasso estimates, which is based on solving a sequence of LASSO problems. The LASSO solution can be obtained by the coordinate descent algorithm, efficiently [18]. Also, the computational cost of solving the LASSO problem can be reduced by a large amount by using strong rules for screening out variables from the model [54]. Therefore, we use a modified version of the alternating minimization algorithm, which is based on altering the design matrix in SQRTL to the hidden layer output matrix to obtain the SQRTL-ELM solution in Eq. (10).

The selection of the tuning parameter \(\lambda \) is critical for the SQRTL-ELM method to give successful results in the prediction of the target vector. In this paper, we use three different techniques for the selection of the tuning parameter. Each technique has its theoretical background.

(\(\lambda _{1}\)):

The first technique is k-fold cross-validation (CV) which is a commonly used technique to estimate the prediction error [15, 19, 45, 68]. In the k-fold cross-validation procedure, the data set is split into k subsets of approximately equal size and the model is trained and tested k times. In each replication, one of the subsets is used as testing data and the rest \(k-1\) subsets serve as training data. Then, we combine these k estimates of the prediction error to obtain the cross-validation error. We select the tuning parameter that gives the model with minimum cross-validation error.

(\(\lambda _{2}\)):

The second technique is to use the universal penalty level [13],

$$\begin{aligned} \lambda _{2}=\sqrt{2\log \left( L\right) /N}. \end{aligned}$$
(11)

[51] used the universal penalty level in the scope of the scaled LASSO estimator. [8] studied the theoretical properties of the universal penalty level.

(\(\lambda _{3}\)):

The third technique is the quantile-based penalty proposed by [52]. Let \(Q\left( u\right) \) be the negative u-quantile function of the standard normal distribution. The penalty level in this approach is determined by

$$\begin{aligned} \lambda _{3}=\sqrt{2/N}\cdot Q\left( k/L\right) \end{aligned}$$
(12)

where k is the solution of the equation \(k=Q^{4}\left( k/L\right) +2Q^{2}\left( k/L\right) \). [52] investigated the theoretical properties of the quantile-based penalty and provided oracle inequalities for prediction bounds based on this penalty level.

We give Algorithm 10 to summarize the details of the implementation of the SQRTL-ELM method. As seen from Algorithm 10, the CV technique selects the best model from a group of candidate models while the universal or quantile-based penalty technique is based on a single model corresponding to a single tuning parameter. Therefore, the computational cost of applying the CV technique is higher than the computational cost of applying other techniques.

Algorithm 1
figure a

SQRTL-ELM.

4 Performance Evaluation of SQRTL-ELM

In this section, a performance comparison is carried out on several benchmark regression data sets to verify the effectiveness of the proposed method SQRTL-ELM. For this purpose, the data sets Auto MPG, Strikes, Delta Ailerons, Concrete and Yacht are retrieved from the UCI machine learning repository [14] and Sleep data from [1]. Each data set is split into training and testing sets by the proportions of 70% and 30%, respectively, and all the inputs and outputs are normalized for each data set into the interval \(\left[ -1,1\right] \) before the trials of the experiments. The properties of the data sets are summarized in Table 1.

Table 1 The properties of benchmark data sets used in the study

All the experiments are carried out on an Intel Core i5 CPU 2.40 GHz with 8 GB RAM. All the numerical computations of the algorithms are performed in C++, and the algorithms are implemented as R functions. Each experiment is repeated 50 times to reduce the effect of randomness. Also, the number of hidden nodes is fixed at 50 for the experiments, and the sigmoidal activation function is used for all the data sets.

The accuracy of the regularized ELM algorithms on the testing set is critically dependent on the right choice of the tuning parameters [6]. When the value of the tuning parameter is very small, the algorithms tend to result in an over-fitted model. In the experiments, the tuning parameter \(\lambda \) is selected from the range \(\left\{ 2^{-10},2^{-4},\ldots ,2^{13},2^{14}\right\} \) by following [20, 46], and five-fold cross-validation is conducted to determine the best tuning parameter value for all the regularized ELM models. Also, the SQRTL-ELM models are fitted by using universal and quantile-based tuning parameters detailed in Sect. 3 to compare these techniques with the cross-validation approach. Moreover, since the ENET-ELM has two tuning parameters, \(\alpha \) is fixed at 0.5 to fit the ENET-ELM model.

To compare the performance of the algorithms, the mean of the root mean squares error (RMSE) is computed for the training and testing sets and the mean of the optimal \(\lambda \) values is obtained. The standard deviations of the training and testing RMSE for each method are given to evaluate the stability performance of the methods. The number of hidden layer nodes is reported in order to compare the sparsity levels of the models. Also, the reduction rates are computed as \(\text {RR}=100\times \frac{\text {RMSE(any algorithm)\,} -\text {\,RMSE(SQRTL-ELM)}}{\text {RMSE(any algorithm)}}\) to determine the gain in percent from using the proposed method. Moreover, the mean of the computation time of each algorithm is given as a computational efficiency measure. All the criteria are reported in Table 2.

Table 2 Performance comparison of the algorithms in terms of training and testing RMSE

The results in Table 2 are summarized as follows:

  • In Table 2, \(\lambda _{1}\), \(\lambda _{2}\) and \(\lambda _{3}\) represent the value of the tuning parameter based on cross-validation, universal penalty technique, and quantile-based technique, respectively. Table 2 reports that the ELM yields better training RMSE values followed by SQRTL-ELM for at least one tuning parameter selection method. The SQRTL-ELM method outperforms the basic ELM and regularized ELM methods in terms of testing RMSE with at least one tuning parameter selection technique for the data sets other than the Yacht data in which the ELM has a better testing RMSE. However, SQRTL-ELM with universal and quantile-based penalties is superior to other regularized ELM algorithms for the Yacht data set. The cases where SQRTL-ELM has a better performance in terms of Testing RMSE are presented in bold in Table 2.

  • Fig. 2 shows the reduction rates in log-scale gained by SQRTL-ELM based on universal and quantile-based penalties respectively, compared to other regularization methods. Considering Fig. 2 together with Table 2, the SQRTL-ELM method with universal and quantile-based penalties shows considerable reduction rates compared to the competing models.

  • According to Table 2, the universal penalty outperforms the other penalty levels in Auto-MPG, Delta Ailerons, Concrete and Sleep data sets while the cross-validation-based tuning parameter is superior in the Strikes data set. While the quantile-based penalty is dominated by the other penalty levels, it outperforms at least one of the other regularized ELM methods except the Strikes data set in terms of testing RMSE.

  • Considering the sparsity results in Table 2, the lasso and elastic net give more sparse models at all penalty levels of the SQRTL-ELM except Sleep data in which the cross-validation-based SQRTL-ELM outperforms the ENET-ELM in terms of sparsity and testing RMSE. When the sparsity results are considered together with the testing RMSE results, it can be stated that SQRTL-ELM makes a trade-off between model complexity and testing RMSE. In other words, SQRTL-ELM compromises from sparsity to achieve a more precise RMSE value in the testing set.

  • Table 2 presents the standard deviation of the testing RMSE values to examine the stability of the SQRTL-ELM. According to Table 2, the standard deviations obtained from the SQRTL-ELM are larger than that of the other regularized ELM methods by a small margin in general. Also, the standard deviations of the testing RMSE values obtained from universal and quantile-based penalties for SQRTL-ELM are better than those of the cross-validation-based penalty level.

  • The training time of the basic ELM is lower than the regularized ELM algorithms for all the data sets. The SQRTL-ELM method with universal or quantile-based tuning has a lower training time than RIDGE-ELM. Among the models built based on cross-validation, LASSO-ELM and ENET-ELM methods, which give smaller node sizes than SQRTL-ELM, have lower training times than other algorithms. Also, The SQRTL-ELM with universal penalty yields a lower training time than the cross-validation and quantile-based penalties.

Overall, the SQRTL-ELM outperforms the basic ELM and regularized ELM in terms of testing RMSE in general. However, it is slightly poorer than the other methods in terms of standard errors. Moreover, although the training time of SQRTL-ELM is higher than that of the basic ELM, the training times of the universal and quantile-based SQRTL-ELM algorithms are comparable to those of other regularization methods.

Fig. 2
figure 2

Reduction rates in testing RMSE obtained by SQRTL-ELM based on universal (left) and quantile-based (right) penalties

Fig. 3
figure 3

Testing errors of ELM, RIDGE-ELM, LASSO-ELM, ENET-ELM and SQRTL-ELM (\(\lambda _{1},\lambda _{2},\lambda _{3}\)) algorithms for Auto MPG data set

Fig. 4
figure 4

Testing errors of ELM, RIDGE-ELM, LASSO-ELM, ENET-ELM and SQRTL-ELM (\(\lambda _{1},\lambda _{2},\lambda _{3}\)) algorithms for Delta Ailerons data set

Fig. 5
figure 5

Testing errors of ELM, RIDGE-ELM, LASSO-ELM, ENET-ELM and SQRTL-ELM (\(\lambda _{1},\lambda _{2},\lambda _{3}\)) algorithms for Concrete data set

Figures 3, 4 and 5 show the change of errors on the test sets of all the algorithms for Auto MPG, Delta Ailerons and Concrete data sets, respectively. The errors are obtained from the fitted models at the optimal parameter values. Figures 3, 4 and 5 show the stability performance of the algorithms, the smaller spread in testing RMSE will be more homogeneous. When the range of errors is analyzed in Figs. 3 and 4, it is observed that the algorithms show similar stability for the Delta Ailerons data set and better performance for at least one of the penalty levels of SQRTL-ELM for the auto MPG and Concrete data sets.

To check the performance of the algorithms statistically, the non-parametric Friedman test is conducted with the Nemenyi post hoc tests [10] based on testing RMSE of the data sets. Based on Friedman test results, the differences between RMSE scores of fifty trials are found to be statistically significant (\(p<0.05\) for the data sets). Also, Nemenyi (z) test statistics with corresponding \(p-\)values are reported in Table 3.Footnote 1 For the cases where SQRTL-ELM performs better, the \(p-\)values that indicate significant differences at the 0.05 level in testing RMSE are shown in bold.

Considering the results in Table 3, the following interpretations for SQRTL-ELM with cross-validation, universal and quantile-based penalties can be presented:

  • SQRTL-ELM is statistically superior to ELM for Auto MPG, Strikes and Sleep data sets.

  • SQRTL-ELM is statistically superior to RIDGE-ELM for Concrete and Yacht data sets.

  • SQRTL-ELM is statistically superior to LASSO-ELM and ENET-ELM for Auto MPG, Delta Ailerons, Concrete and Yacht data sets.

As a result, SQRTL-ELM with cross-validation, universal and quantile-based penalties may overcome ELM, RIDGE-ELM, LASSO-ELM and ENET-ELM statistically, in terms of testing performance for at least one data set. Considering Tables 2 and 3 together, it can be said that the SQRTL-ELM algorithm for at least one tuning parameter selection method provides generally better results for the mean of testing RMSE than the compared regularized ELM methods RIDGE-ELM, LASSO-ELM and ENET-ELM.

Table 3 Post hoc results on testing RMSE for each data set

5 Conclusions

In this paper, we propose a novel regularized ELM method called SQRT-ELM to improve the basic ELM and its regularized extensions. The proposed method uses the pivotal recovery property of square-root lasso to handle the drawbacks of basic ELM like instability, poor generalizability and overfitting problems. The experimental studies on well-known benchmark data sets show that the SQRTL-ELM generally improves the testing RMSE performance of basic ELM and outperforms its variants including RIDGE-ELM, LASSO-ELM and ENET-ELM depending on the data set for the sake of slightly larger standard deviations of the testing RMSE. The amount of improvement in the test RMSE and sparsity level varies according to the applied tuning parameter technique. The training time of SQRTL-ELM is higher than that of the basic ELM and competes with the training time of regularized ELM methods depending on the underlying data set. Also, SQRTL-ELM with universal and quantile-based penalty outperforms SQRTL-ELM with cross-validation in terms of the lower training time.

Consequently, as a novel regularized ELM algorithm, SQRTL-ELM shows applicability and usefulness in regression tasks in data-driven studies.