1 Introduction

The feed-forward neural networks (FNNs) have been seen as powerful tools in machine learning fields due to the adaptability on complex learning problems. However, difficulties arise in choosing parameters such as learning rate, momentum, period, stopping criteria, input weights, biases, and so on in FNNS. Therefore, Huang et al. (2006) proposed a learning algorithm called extreme learning machine (ELM) which overcomes slow learning speed and overfitting. The logic behind ELM is to generate the main networks parameters like input weights and biases randomly and to train a single layer feed-forward network (SLFN) with a solution of classic linear systems. This main logic brings extra speed and performance improvement on the learning and generalization aspects to the ELM.

In recent years, ELM has been attracting considerable attention from the researchers and is widely used in real-world applications. There are many studies published on ELM in different research areas to demonstrate the performance of ELM or to improve ELM according to the applied area to get accurate results. Some of them are as follows: telecommunication for developing a robust and precise indoor positioning system (IPS) (Zou et al. 2016) and for the evaluation of intrusion detection mechanisms (Ahmad et al. 2018), neuroscience for concept drift learning (Mirza and Lin 2016), for discriminating preictal and interictal brain states in intracranial EEG (Song and Zhang 2016), for pathological brain detection (Lu et al. 2017), robotics for building an effective prediction model of input displacement of gripper (Petković et al. 2016), for determination the inverse kinematics solutions of a robotic manipulator (Zhou et al. 2018), astronomy for contour detection using Cassini ISS images (Yang et al. 2020), for developing a prediction model for the ionospheric propagation factor M(3000)F2 (Bai et al. 2020), psychology for attention deficit hyperactivity disorder using functional brain MRI (Qureshi et al. 2017), geology for mapping mineral prospectivity (Chen and Wu 2017), education for designing English classroom teaching quality evaluation model (Wang et al. 2017), biology for hepatocellular carcinoma nuclei grading (Li et al. 2017), chemistry for short-term wind speed prediction (Chen et al. 2019), mathematics for solving ordinary differential equations (Yang et al. 2018), physics for the short-term photovoltaic power generation forecasting (Tang et al. 2016), economics for gold price prediction (Weng et al. 2020) and credit score classification (Kuppili et al. 2020), energy for prediction of photovoltaic power (Zhou et al. 2020) and for resource optimization model (Han et al. 2021), environmental engineering for modeling qualitative and quantitative parameters of groundwater (Poursaeid et al. 2021), computer science for intrusion detection system (Al-Yaseen et al. 2017), for dimension reduction (Kasun et al. 2016), for short-term load forecasting (Zeng et al. 2017), automation such as for traffic sign recognition (Huang et al. 2017) and for malware hunting (Jahromi et al. 2020). Also, ELM is also used in healthcare for the classification of COVID-19 pneumonia infection from normal chest CT scans (Khan et al. 2021; Turkoglu 2021; Murugan and Goel 2021), for developing cloud computing-based framework for breast cancer diagnosis (Lahoura et al. 2021), for brain tumor detection (Özyurt et al. 2020), engineering for prediction compressive strength of concrete with partial replacements for cement (Shariati et al. 2020), for predicting the thermal conductivity of soil (Kardani et al. 2021) and for wheat yield prediction (Liu et al. 2022).

Although it has many applications in real life, ELM is also inadequate or has fields that need improvement. There are many studies done to eliminate these shortcomings; some of the improvement fields are as follows: the structure of hidden layer output matrix (Huynh et al. 2008; Ding et al. 2017), heteroscedasticity (Deng et al. 2009), outliers (Chen et al. 2017; Deng et al. 2009; Xu et al. 2016), over-fitting (Chen et al. 2017; Deng et al. 2009; Zhang et al. 2017), feature selection (Miche et al. 2010, 2011; Martínez-Martínez et al. 2011; Fakhr et al. 2015; He et al. 2017) and multicollinearity (Miche et al. 2011; Martínez-Martínez et al. 2011; Toh 2008; Li and Niu 2013; Su et al. 2018; Nóbrega and Oliveira 2019; Yıldırım and Özkale 2019, 2020; Cancelliere et al. 2015). In this study, we will focus on multicollinearity in ELM and seek to answer the question of how ELM shows better stability and generalization performance when there is multicollinearity.

Multicollinearity which is defined as the linear dependencies among the predictors in linear regression has serious effects on the ordinary least squares (OLS) estimator in linear regression resulting in large variance and far away from the true parameters. Although the classical linear models related with OLS estimators like the linear discriminant analysis or linear regression model have been widely used in practical applications such as dimension reduction on big data (Reddy et al. 2020) and engineering (Kaluri et al. 2021), these estimates are unstable and show worse generalization performance in the presence of multicollinearity. Biased estimation, which is also known as shrinkage estimation, methods have been proposed to overcome the negative effects of multicollinearity on the OLS estimator. One of the most well-established biased estimator for dealing with multicollinearity in linear regression is the ridge estimator (a.k.a. Tikhonov regularization, \(L_{2}\)-norm regularization) proposed by Hoerl and Kennard (1970). In ridge estimation, Hoerl and Kennard (1970) were interested in the estimation of parameters that provide a smaller mean square error by adding an \(L_{2}\)-norm penalty along with a constant term to the objective function of the OLS. This additional constant term, known as the tuning parameter, affects the performance of the ridge estimator. Although the ridge estimator is the best known method in multicollinearity, its biggest problem is the lack of precision in tuning parameter selection. In addition, since the ridge estimator is not linear in terms of this tuning parameter, tuning parameter is difficult to choose. Therefore, Liu estimator was proposed by Liu (1993) as an alternative to ridge estimator which also has a tuning parameter affecting the performance of the model. Liu estimator can deal with multicollinearity and provides easier and solid ways on the selection of tuning parameters. Subsequently, many biased estimators have been proposed. One of them is the two parameter estimator proposed by Özkale and Kaçıranlar (2007) which was later called as OK estimator by Gruber (2012) and Özkale et al. (2022). In proposing this estimator, Özkale and Kaçıranlar (2007) have taken advantage of the idea that combining the two estimators will inherently make a new estimator with the advantages of both estimators.

1.1 Problem and some of the existing solutions

As in linear regression, in the context of ELM, multicollinearity also arises between the columns of the hidden layer output matrix (i.e., nodes) and causes instability and poor generalization performance of the ELM (Li and Niu 2013). Studies have been made and continue to be done to eliminate the negative effects of multicollinearity on ELM. Toh (2008) proposed a new approach based on ridge regression with sigmoid activation function to obtain minimum error for SLFNs in classification field. Deng et al. (2009) developed a novel algorithm called regularized ELM to deal with heteroscedasticity, outliers and multicollinearity and to obtain better generalization performance. Miche et al. (2010, 2011) proposed optimally pruned ELM (OP-ELM) and Tikhonov regularized optimally pruned ELM (TROP-ELM) based on \(L_{1}\)-norm for sparsity and both \(L_{1}\) and \(L_{2}\) norms for both sparsity and stability, respectively. Martínez-Martínez et al. (2011) developed a unified solution via ridge (\(L_{2}\)-norm), elastic net and lasso (\(L_{1}\)-norm) methods. Li and Niu (2013) proposed ELM based on ridge and almost unbiased ridge estimators (RR-ELM and AUR-ELM) with an appropriate selection method of ridge tuning parameter. Yu et al. (2013) proposed a new approach based on TROP-ELM and pairwise distance calculation to deal with the missing data problem in ELM. Shao et al. (2015) proposed “automatic regularized ELM with leave-one-out cross-validation” based on ridge regression to investigate the randomness performance of ELM. Cao et al. (2016) presented a novel approach based on stacked sparse denoising autoencoder—ridge regression to achieve more stable performance with comparable processing time in classification and regression applications. Luo et al. (2016) developed a unified framework of ELM using both \(L_{1}\)-norm and \(L_{2}\)-norm for regression and multi-class classification problems. Yu et al. (2018) proposed dual adaptive regularized online sequential extreme learning machine called as DA-ROS-ELM to improve the performance on detecting network intrusion. Wang and Li (2019) developed an ELM algorithm based regularization via an \( L_{0}\)-based broken adaptive ridge (BAR) penalization on Cox-type model with advantages like avoiding some assumptions of classical survival models and achieving reasonable computation time. Yan et al. (2020) proposed a kernel ridge ELM algorithm by using artificial bee colony algorithm to determine the appropriate parameter for the insurance fraud problems. Guo (2020) is proposed a regularized ELM algorithm based on elastic net to keep a balance between system stability and solution’s sparsity. Jiao et al. (2021) presented an optimized regularized extreme learning machine algorithm based on the conjugate gradient (called as CG-RELM) for estimating the state of charge.

Although Ridge-ELM is frequently used in the literature, there is no general rule regarding the selection of the ridge tuning parameter. On the other hand, the selection of ridge tuning parameter plays critical role on the performance of ridge-type ELM algorithms which affects both training & testing performance and speed of the algorithm. Also, there is no single selection method providing reasonable performance. That’s why, Yıldırım and Özkale (2019) proposed some alternative approaches for the selection of the ridge tuning parameter for RR-ELM (a.k.a Ridge-ELM) which are based on Akaike information criterion (AIC), Bayesian information criterion (BIC) and cross-validation (CV) method and presented a comprehensive comparison. Furthermore, selection of the ridge tuning parameter is not easy because of the ridge estimator being nonlinear function of the ridge tuning parameter. Therefore, Yıldırım and Özkale (2020) proposed a novel algorithm called L-ELM (a.k.a. Liu-ELM) as an alternative to RR-ELM algorithm and provided more stable and generalizable results than its competitors like ELM, RR-ELM and AUR-ELM and OP-ELM.

1.2 Contribution

Both ridge and Liu estimators have individual characteristic properties at the point of dealing with multicollinearity. Depending on the problem and data structure, these estimators can overperform each other. Even if the ridge and Liu estimation methods are adapted to ELM to improve the multicollinearity problem in ELM studies, estimator adaptations that can outperform RR-ELM and Liu-ELM under multicollinearity can further be made. Utilizing from the idea of using both estimators in a unified form, we consider a new method alternative to RR-ELM and Liu-ELM that provides more insightful and better results in terms of learning capability, stability and generalization performance. Our novel method in ELM is based on two-parameter (a.k.a OK) estimator originally proposed by Özkale and Kaçıranlar (2007) in linear regression field. The key features of the proposed algorithm are summarized as follows:

  • The proposed algorithm presents a unified form of RR-ELM and Liu-ELM which improves the ELM and its variants RR-ELM and Liu-ELM at the point of obtaining more stable and generalizable results due to the existence of the effects of both k and d tuning parameters in the model.

  • The proposed algorithm gives a regularization method which can be easily adjustable to any other model for dealing with multicollinearity and irrelevant features.

  • The proposed algorithm depends on two tuning parameters so that one of the tuning parameters provides better generalization performance while the other provides better shrinkage.

  • The proposed algorithm can be easily integrated to any system & algorithm to provide solutions for both classification and regression studies in the context of ELM.

1.3 Organization

The rest of the paper is structured as follows. We present the review of related studies including the preliminary ELM and its variants in Sect. 2. In Sect. 3, the details of our proposed method are described. Experimental results and findings are given in Sect. 4. The conclusions are summarized in Sect. 5.

2 Review of related studies

The ELM introduced by Huang et al. (2006) to make possible a network training without tuning any parameter was proposed as an alternative to gradient-descent based algorithms like back-propagation for SLFNs. The idea was noteworthy because of speed capability. ELM algorithm is based on searching best weights providing minimum training error and minimum normed weights via randomly assign networks parameters including input weights and biases. As a results of random assignment, the output weights can be obtained by solving a classic linear system in the output layer. The usage of some approaches like least squares, Moore-Penrose inverse for solution stage brings ELM some advantages like faster learning, less need for human intervention, less possibility for reaching local optima and mostly reasonable generalization performance. Table 1 summarizes the features/key findings and challenges of ELM in the context of regularized ELM. In this section, we summarize the preliminary ELM and RR-ELM and Liu-ELM.

2.1 The preliminary ELM

A classic SLFN can be expressed as

$$\begin{aligned} \sum \limits _{i=1}^{\theta }\varvec{\beta }_{i}f\left( \mathbf {\delta } _{i}.{\textbf{x}}_{j}+b_{i}\right) ={\textbf{t}}_{j},~j=1,2,\ldots ,N, \end{aligned}$$
(1)

where \(\left( {\textbf{x}}_{j}^{T},{\textbf{t}}_{j}^{T}\right) \) is the set of N distinct patterns with \({\textbf{x}}_{j}\in R^{p}\) and \({\textbf{t}}_{j}\in R^{m}\) is the m-dimensional network output, \(\mathbf {\delta }_{i}\) are the input weights, \(b_{i}\) are the biases, \(\theta \) is the number of hidden neurons, \(f\left( .\right) \) is the activation function and \(\varvec{ \beta }_{i}\) are the output weights (Huang et al. 2004, 2006). The basic SLFN structure is given by Fig. 1.

The matrix form of Eq. (1) can be written as:

$$\begin{aligned} {\textbf{H}}\varvec{\beta }={\textbf{T}} \end{aligned}$$
(2)

where

$$\begin{aligned} {\textbf{H}}=\left[ \begin{array}{ccc} f\left( \mathbf {\delta }_{1}.{\textbf{x}}_{1}+b_{1}\right) &{} \ldots &{} f\left( \mathbf {\delta }_{\theta }.{\textbf{x}}_{1}+b_{\theta }\right) \\ \vdots &{} \ldots &{} \vdots \\ f\left( \mathbf {\delta }_{1}.{\textbf{x}}_{N}+b_{1}\right) &{} \ldots &{} f\left( \mathbf {\delta }_{\theta }.{\textbf{x}}_{N}+b_{\theta }\right) \end{array} \right] _{N\times \theta } \end{aligned}$$

is the output matrix of hidden layer, \(\varvec{\beta }_{\left( \theta \times m\right) }=\left( \beta _{1},\ldots ,\beta _{\theta }\right) ^{T}\) and \( {\varvec{T}}_{\left( N\times m\right) }=\left( {\textbf{t}}_{1},\ldots ,{\textbf{t}}_{N}\right) ^{T}\) are the output weights vector and output values vector, respectively. Here, m corresponds to the number of output layer neurons which is commonly equal to the number of target variable and fixed as 1 in most practical applications.

In order to get the solution of Eq. (2), the following objective function is minimized:

$$\begin{aligned} \left( {\textbf{H}}\varvec{\beta -}{\textbf{T}}\right) ^{T}\left( {\textbf{H}} \varvec{\beta -}{\textbf{T}}\right) . \end{aligned}$$
(3)

The minimizer of the objective function in Eq. (3) (i.e., the estimator of \(\varvec{\beta }\)) can be found analytically as

$$\begin{aligned} \widehat{\varvec{\beta }}_\textrm{ELM}={\textbf{H}}^{+}{\textbf{T}} \end{aligned}$$
(4)

where \({\textbf{H}}^{+}\) is the Moore–Penrose inverse of matrix \({\textbf{H}}\) (Huang et al. 2006). Some popular ways to calculate the Moore-Penrose inverse are the orthogonal projection method, iterative methods and singular value decomposition (Rao et al. 1971; Schott 2005). According to the orthogonal projection method, \({\textbf{H}}^{+}\) is calculated via \({\textbf{H}}^{T}\left( \textbf{HH}^{T}\right) ^{-1}\) if \({\textbf{H}}\) is full row rank, else \( {\textbf{H}}^{+}=\left( {\textbf{H}}^{T}{\textbf{H}}\right) ^{-1}{\textbf{H}}^{T}\) if \({\textbf{H}}\) is full column rank.

Table 1 Literature survey on regularized ELM

2.2 ELM based on ridge and Liu regression

Although the solution given by Eq. (4) provides faster solutions, it has some drawbacks in some situations like multicollinearity. Due to the multicollinearity, the stability and generalization performance may weaken. Ridge-based ELM is defined by Li and Niu (2013) and optimizes the objective function

$$\begin{aligned} \left( {\textbf{H}}\varvec{\beta -}{\textbf{T}}\right) ^{T}\left( {\textbf{H}} \varvec{\beta -}{\textbf{T}}\right) +\frac{k}{2}\varvec{\beta }^{T} \varvec{\beta }. \end{aligned}$$
(5)

The closed form solution of Eq. (5) by using simple algebra can be found as

$$\begin{aligned} \widehat{\varvec{\beta }}_{\text {RR-ELM}}^{(k)}=\left( {\textbf{H}}^{T}\mathbf {H+} kI\right) ^{-1}{\textbf{H}}^{T}\mathbf {T,~}k\ge 0 \end{aligned}$$

where k is the ridge tuning parameter (Yıldırım and Özkale 2020).

Yıldırım and Özkale (2020) by minimizing the objective function

$$\begin{aligned} \left( {\textbf{H}}\varvec{\beta -}{\textbf{T}}\right) ^{T}\left( {\textbf{H}} \varvec{\beta -}{\textbf{T}}\right) +\left( d\widehat{\varvec{\beta }} _\textrm{ELM}-\varvec{\beta }\right) ^{T}\left( d\widehat{\varvec{\beta }} _\textrm{ELM}-\varvec{\beta }\right) , \end{aligned}$$

introduced Liu-ELM as

$$\begin{aligned} \widehat{\varvec{\beta }}_\mathrm{Liu-ELM}^{d}=\left( {\textbf{H}}^{T}{\textbf{H}} +I\right) ^{-1}\left( {\textbf{H}}^{T}{\textbf{T}}+ d\widehat{\varvec{\beta }} _\textrm{ELM}\right) ,~0<d<1 \end{aligned}$$

where \(0<d<1\) is called as Liu tuning parameter. The properties of Liu-ELM are considered by Yıldırım and Özkale (2020).

3 A new type of ELM based on unified ridge and Liu idea

RR-ELM and Liu-ELM have individual advantages in terms of capabilities for improving the stability and generalization performance. Starting from the idea of combining both estimators, we propose a new estimator named OK-ELM. For this, we utilize the objective function

$$\begin{aligned} \left( {\textbf{H}}\varvec{\beta -}{\textbf{T}}\right) ^{T}\left( {\textbf{H}} \varvec{\beta -}{\textbf{T}}\right) +\frac{k}{2}\left( d\varvec{\hat{ \beta }}_\textrm{ELM}-\varvec{\beta }\right) ^{T}\left( d\varvec{\hat{\beta }} _\textrm{ELM}-\varvec{\beta }\right) \end{aligned}$$
(6)
Fig. 1
figure 1

The basic ELM structure

which was originally the idea of Özkale and Kaçıranlar (2007) in linear regression and the resulted estimator was then called as OK estimator by Gruber (2012). The objective function in Eq. (6) looks for an estimator of \(\varvec{\beta }\) which minimizes \(\left( {\textbf{H}}\varvec{\beta -}{\textbf{T}}\right) ^{T}\left( {\textbf{H}}\varvec{\beta -}{\textbf{T}}\right) \) in an equivalence class of estimators of \(\varvec{\beta }\) which are equal distance from \(d\varvec{\hat{\beta }}_\textrm{ELM}\) and its general form of Liu-ELM by k constant. By minimizing the objective function in Eq. (6), we get

$$\begin{aligned} \widehat{\varvec{\beta }}_\mathrm{OK-ELM}=\left( {\textbf{H}}^{T}{\textbf{H}}+k{\textbf{I}}\right) ^{-1}\left( {\textbf{H}}^{T}{\textbf{T}}+k d\varvec{\hat{\beta }} _\textrm{ELM}\right) \end{aligned}$$

where \(0<d<1\) and \(k>0\) are the tuning parameters. \(\widehat{\varvec{ \beta }}_\mathrm{OK-ELM}\) has some statistical properties:

  • The OK-ELM enjoys the computational advantages of ELM. For this purpose, we define the augmented matrices

    $$\begin{aligned} {\varvec{{\tilde{H}}}}=\genfrac(){0.0pt}0{{\textbf{H}}}{\sqrt{k}{\textbf{I}}_{\theta }},~ {\varvec{{\tilde{T}}}}=\genfrac(){0.0pt}0{{\textbf{T}}}{\sqrt{k} d\varvec{\hat{\beta }} _\textrm{ELM}} \end{aligned}$$

    where \({\textbf{I}}_{\theta }\) is the identity matrix with dimension \(\theta \) . This implies that the OK-ELM is obtained by using the prior information on \(\varvec{\beta }\) in the form of linear stochastic restrictions \(\sqrt{k}d \varvec{\hat{\beta }}_\textrm{ELM}=\sqrt{k}\beta +\varepsilon ^{*}\) where \( \varepsilon ^{*}\) is a random vector with mean 0 and variance–covariance matrix same with the output weight vector. The optimal solution \(\widehat{\varvec{\beta }}_{\text {OK-ELM}}\) based on augmented form of the linear system corresponds to the minimizer of the objective function

    $$\begin{aligned} ({\varvec{{\tilde{H}}}}\varvec{\beta -}{\varvec{{\tilde{T}}}})^{T}({\varvec{{\tilde{H}}}}\varvec{\beta -}{\varvec{{\tilde{T}}}}). \end{aligned}$$
  • It is a convex combination of RR-ELM and ELM (Özkale 2013; Gruber 2012):

    $$\begin{aligned} \widehat{\varvec{\beta }}_{\text {OK-ELM}}=d\varvec{\hat{\beta }}_\textrm{ELM}+\left( 1-d\right) \widehat{\varvec{\beta }}_{\text {RR-ELM}}^{(k)} \end{aligned}$$

This convex combination shows that the tuning parameter, d, controls the respective contributions of the ELM and RR-ELM. As d goes to 1, the contribution of ELM is more and RR-ELM is less; however, as d goes to 0 RR-ELM has more contribution than ELM. Thus the parameter d plays a role as proportion of contribution between ELM and RR-ELM. As a common choice in experimental settings, we consider a grid search of the related parameters for OK-ELM. The main goal is to obtain the optimal parameters combination yielding the minimum testing error. The details of the computing algorithm for OK-ELM are as explained in Fig. 2.

Fig. 2
figure 2

The flowchart of the OK-ELM algorithm

4 Experimental procedure and results

In this section, a comparative analysis is presented to measure the performance of the proposed algorithm (OK-ELM) with its competitors including ELM (Huang et al. 2006), RR-ELM (Li and Niu 2013) and Liu-ELM (Yıldırım and Özkale 2020) on twelve different regression benchmark data sets which have been collected from UCI repository (Asuncion and Newman 2007). The description details of these data sets are summarized in Table 2 . Sigmoidal activation function described as \(f(\delta ,b,X)=1/\left( 1+e^{-\left( \delta X+b\right) }\right) \) is used for all data sets. The number of hidden layers neuron \(\left( \theta \right) \) is set equally as 50, 100, 150 for all algorithms. The experiments have been conducted in R Software platform and all codes related with the algorithms have been written from scratch. In order to eliminate the effect of data scale, each attribute of data sets has been standardized as zero mean and unit variance by using the formula:

$$\begin{aligned} {\textbf{x}}=\left( \frac{x_{i}-\overline{x}}{sd\left( {\textbf{x}}\right) } \right) . \end{aligned}$$
Table 2 The properties of data sets used in this study

To calculate the generalization performance effectively, we used fivefold CV approach. For each fold, forty trials have been carried out and the mean of all metrics for all trials has been reported with its standard deviation. As the performance metric, we used root mean square error (RMSE) which is defined as

$$\begin{aligned} \text {RMSE}=\sqrt{\frac{1}{N}\sum _{j=1}^{N}\left( {\textbf{o}}_{j}-{\textbf{t}} _{j}\right) ^{2}} \end{aligned}$$
(7)

where \(\left( {\textbf{o}}_{j}-{\textbf{t}}_{j}\right) \) corresponds to the error between the actual and output values of the network. The values of tuning parameters have significant effects on the performances of the algorithm based ridge and Liu estimators. In order to observe the effects of tuning parameter, the selection process of ridge and Liu tuning parameters for all data sets has been carried out in the same way and range. The ridge tuning parameter \(\left( k\right) \) and Liu tuning parameter \(\left( d\right) \) are, respectively, selected via CV within the following ranges:

$$\begin{aligned}{} & {} k\in \left[ 10^{-5},10^{-4},\ldots ,10^{-2},0.02,\ldots ,\right. \\{} & {} \quad \left. 0.1,0.2,\ldots ,1,1.5,2,\ldots ,5\right] \\{} & {} d\in \left[ 10^{-5},10^{-4},\ldots ,10^{-2},0.02,0.03\ldots ,\right. \\{} & {} \quad \left. 0.1,0.15,0.2,\ldots ,1\right] . \end{aligned}$$

In each fold, the k and d parameters minimizing the testing CV error are determined as optimal for corresponding fold. This process is repeated for all trials and the mean values of d and k parameters calculated as overall for all folds are given in Table 3. Table  3 summarizes the performance of each algorithm against its RMSE and standard deviation in the optimum tuning parameter for all data sets. Besides, the norm values with standard deviations are presented in Table to investigate the effect of the proposed algorithm in terms of shrinkage performance. Based on the results in Table 3, we also show the reduction rate (RR) to obtain the performance percentage of OK-ELM over the other methods and give in Figs. 2 and 3. The RR is calculated for both testing RMSE and standard deviation of RMSE as

$$\begin{aligned} \text {RR}=\frac{\left( \text {any algorithm}\right) -\left( \text {OK-ELM}\right) }{\left( \text {any algorithm}\right) }\times 100 \end{aligned}$$
Table 3 The RMSE performance comparison of algorithms

We obtain the following conclusions:

  • What stands out in Table 3 and Fig. 3 is that is that there is a clear trend of decreasing test RMSE value for the proposed algorithm (OK-ELM) compared to ELM, RR-ELM and Liu-ELM regardless of the node number. Only for Forests data set, OK-ELM is worse than others in terms of testing RMSE. As the node number increases, the RR of OK-ELM over ELM, RR-ELM and Liu-ELM increases that its superiority over ELM is remarkable. It can therefore be assumed that the proposed algorithm is more generalizable than its competitors.

  • Table 3 and Fig. 3 present that the changes on the standard deviation of testing RMSE vary depending on the node numbers. When the node number is 50, OK-ELM is the best except Auto Price, Body Fat, Machine CPU and Forest data sets while RR-ELM is the best on these 4 data sets. There is no single biased ELM method that is best when the number of node is 100 or 150 in terms of the standard deviation of testing RMSE. According to the data, one biased ELM method may be better than the other. There are data where RR-ELM or Liu-ELM is better than OK-ELM, while OK-ELM, RR-ELM and Liu-ELM all are always better than ELM in all data and all node numbers in the sense of the standard deviation of testing RMSE.

  • In Table 4, it is shown that OK-ELM outperforms ELM in the sense of having smaller norm and the standard deviation of the norm in all data sets and all node numbers. The OK-ELM algorithm provides smaller norm and the standard deviation of the norm than all other algorithms for Abalone, Auto Mpg, Fish and Yacht data sets regardless of the node numbers. For the rest of data sets, the performance of OK-ELM over RR-ELM or Liu-ELM depends on the node number in terms of norm and standard deviation of norm. In all data except Auto Price and Servo data, as the number of nodes increases, the norm value of OK-ELM becomes better than RR-ELM and Liu-ELM, that is, OK-ELM is better than RR-ELM and Liu-ELM while the number of nodes is large under the fixed number of nodes. The results validate that the OK-ELM algorithm gives smaller norm of coefficients (i.e., satisfying shrinkage performance) than the ELM, RR-ELM and Liu-ELM algorithms especially when the node number is large. Except for Auto Price and Servo data, with 150 nodes in all other data, OK-ELM outperforms RR-ELM and Liu-ELM in terms of the standard deviation of the norm.

  • Figures 5, 6 and 7 show the change of errors of all four algorithms for Abalone, Servo and Strikes data sets, respectively. The errors have been retrieved from the testing results of a random cross validation process at the optimal parameter values which are approximately equal to the mean values given in Table 3. Figures 5, 6 and 7 present the stability performance, the spread around zero values will be more homogeneous. When the range of errors is examined in Figs. 5, 6 and 7, it is observed that RR-ELM and Liu-ELM have almost same stability performance while ELM is the worst and the OK-ELM is the best. At the point of stability, the OK-ELM algorithm is more stable around zero than its competitors.

Table 4 The comparison of algorithms in terms of coefficients norm values
Fig. 3
figure 3

Comparison of testing RMSE in terms of RR

Fig. 4
figure 4

Comparison of standard deviation of testing RMSE in terms of RR

Fig. 5
figure 5

Testing errors for Abalone data for 150 nodes

Fig. 6
figure 6

Testing errors for Servo data for 100 nodes

Fig. 7
figure 7

Testing errors for Strikes data for 50 nodes

As mentioned before, the tuning parameters for all algorithms including RR-ELM, Liu-ELM and OK-ELM have effect on the performance of each algorithm. In this study, we have been carried out a comprehensive grid search on experiments. To investigate the effect of each parameter, we give the performance change of RR-ELM, Liu-ELM and OK-ELM based on their own tuning parameter. By using an optimum value of each tuning parameter, the changes of testing performance depending on the other tuning parameter are given in Figs. 8 and 9 for Abalone and Fish data sets for 150 and 50 node number, respectively. To investigate the performance change of OK-ELM and RR-ELM, the Liu tuning parameter is taken as fixed as the optimum value which is approximately equal the mean value given in Tables 2 and 3. Similar process is repeated for OK-ELM and Liu-ELM by taking the ridge tuning parameter as fixed. It can be seen in Figs. 8 and 9 that the tuning parameters significantly affect the performance of the algorithms in training process. The performance of OK-ELM can outperform both RR-ELM and Liu-ELM if the tuning parameters are properly tuned. For a particular data set, the breaking points can give useful insights to determine the optimal range of each tuning parameter.

Fig. 8
figure 8

The change of tuning parameter for Abalone data

Fig. 9
figure 9

The change of tuning parameter for Fish data

5 Conclusions

In this paper, we proposed a novel algorithm based on the combination of ridge and Liu algorithms in order to deal with the multicollinearity problem in the context of ELM. The main advantage of the proposed algorithm is to enjoy the properties of both ridge and Liu algorithms and to present an alternative and easily adaptable to any other system and algorithm for obtaining the solutions of both regression and classification problems. Based on the experimental results, the proposed algorithm can outperform its competitors in terms of testing RMSE and stability performances for the appropriate selection of the (kd) parameters.

The OK-ELM method that we newly propose has three main limitations:

  • Because of depending on two tuning parameters, it takes time in terms of computation

  • It cannot be used in high-dimensional ELM settings

  • It does not select nodes

In the future research direction, an effective estimation method on ELM can be proposed on high-dimensional data which can do variable selection.

6 Future studies

The main shortcomings of this study are to integrate some deterministic selection methods of the tuning parameters for the proposed method and not to be able to use in high-dimensional settings. In the future works, we focus on the high-dimensional issues to provide more effective algorithms to the field of machine learning, especially ELM.