Introduction

Neural networks are powerful models for data mining and information engineering which can learn from data to construct feature-based classification models and nonlinear prediction models. Training neural networks (NNs) requires the optimization of a highly non-convex landscape with several local minima and saddle points. Alternative kernel-based methods like support vector machine (SVM) [7, 11, 17] on the other hand, produce well-posed convex problems, which is one of the main reasons for their success over the last few decades. Kernel methods, however, fail to scale effectively to large datasets because one of their main tasks is to compute pairwise kernel values over the complete dataset. The main benefit of simple NN architecture is that it enables an acceptable level of solution to be achieved in one-hundredth (or even one-millionth) of the time taken by larger, more complex models while maintaining high optimality [37]. A single hidden layer feed-forward NN (SLFN) can handle a function that contains non-linearity with arbitrary precision [26]. They have been widely implemented in various problems associated with classification [19, 22, 24, 33]. Moreover, one of the most popular categories of NN is feed-forward NNs with random weights which were popularized by Pao et al. [32] in their work. They introduced novel random vector functional link networks (RVFL) [5, 8, 30]. In RVFL, it is possible to connect inputs and outputs directly, leading towards an excellent generalization performance. The weights between the input and the hidden layers can also be produced randomly [16, 45]. Zhang and Suganthan [45] further extended the study by performing a comprehensive evaluation of RVFL networks. They performed operations on RVFL using some popular activation functions and discovered that the hardlim and sign activation function significantly degrade the performance of the algorithm. They also suggested that the bias used in RVFL can be a tunable configuration for specific problems. Li et al. [25] proposed a novel SVM+ which uses the learning using privileged information (LUPI) [40] procedure. They further suggested a kernelized version of that model for non-linear data processing. They also use the QP solver from MATLAB R2014b as a starting point, to solve the QP problem and proposed MAT-SVM+ . Inspired by the work of Li et al. [25], Zhang and Wang [48] embedded the LUPI in the RVFL and proposed a novel RVFL+ model. They further used the kernel trick in the model and proposed a kernel RVFL+ (KRVFL+). Both RVFL+ and KRVFL+ are in analogy to Teacher–Student Interaction [39] in the human learning process. Xu et al. [43] proposed a novel kernel-based RVFL model (K-RVFL) for learning the spatiotemporal dynamic process. Because of the complexity embedded in the kernel, the K-RVFL can handle the complex process better. Kernel ridge regression (KRR) [36] has gained the attention of researchers over the last few decades due to its non-iterative learning approach. It has been widely used to solve a variety of classification [23, 35] and regression [29, 42] problems. KRR is computationally fast since, it adapts equality constraints rather than inequality constraints, and solves a series of linear equations. Several variants of KRR have been proposed to improve its classification performance. For example, Zhang et al. [46], Chang et al. [6], Zhang and Suganthan [47] and more. However, recently the growing popularity of extreme learning machine (ELM) [21, 27] is because of its high generalization performance with low computational cost [2, 14, 15, 18, 20, 38]. Peng et al. [34] proposed a novel discriminative graph regularized extreme learning machine (GELM) model to improve the classification ability of ELM. Due to its closed-form solution, the outputs for GELM can be obtained efficiently. As per the theory, conventional ELM does not guarantee its convergence. The correct convergence results have been shown and proved in Igelnik and Pao [22], Wang and Wan [41]. The RVFL has been a very efficient and powerful model for tasks related to classification and regression. Despite the high computational efficiency and high generalization ability of RVFL, it was observed that because of the randomly selected weights and hidden layer biases, it requires many nodes to accomplish satisfactory performance. Recently, the 1 norm regularization has gained tremendous popularity among researchers [13, 44] since it results in sparse outputs. The sparse output indicates that most of the elements in the output matrix are zero, hence, the decision surface can be obtained by incorporating lesser hidden nodes compared to the conventional models. Also, these sparse models are easily implementable [1]. An influential contribution in this direction is the 1-norm SVM developed by Mangasarian [28]. The solution of 1-norm SVM is computed by solving its exterior penalty problem as an unconstrained convex minimization problem using the Newton Armijo algorithm. Hence, inspired by the work of Mangasarian [28] and the recent significant pieces of literature on 1-norm regularizations, this paper proposes a novel 1 norm random vector functional link (1N RVFL) network for binary classification. The key innovations of this work are:

  1. 1.

    A novel 1N RVFL classifier is proposed using two different activation functions.

  2. 2.

    Due to the incorporation of the L1 norm in the proposed model, sparse feature vectors are generated, which is a very useful property for problems related to classification.

  3. 3.

    1N RVFL produces a classifier that is based on a smaller number of input features. To put it another way, this method will suppress the required number of neurons in the hidden layer.

  4. 4.

    Experiments on real world datasets have been considered to demonstrate the classification ability of 1N RVFL compared to other models.

The advantages and limitations of a few related classifiers are tabulated in Table 1.

Table 1 Pros and cons of a few related models

The remaining paper is structured as follows; Section “Mathematical background” gives a brief mathematical description of a few related models, viz., ELM, KRR, GLTRVFL and RVFL. Section “Proposed 1-norm random vector functional link (1N RVFL)” shows the formulation and description of the proposed 1N RVFL model. In Sect. Simulation and analysis of results, the numerical experiments and comparative analysis with ELM, KRR, RVFL, K-RVFL and GLTRVFL are undertaken. In Section “Conclusion”, we conclude the paper.

Mathematical background

Consider, \(x_{i} \in R^{n}\) for \(i = 1,2,3, \ldots ,m\), to be an \(n\)-dimensional input vector and \(D\) is the \(m \times n\) dimensional matrix of training examples. \(y_{i} \in \{ + 1, - 1\}\) for \(i = 1,2,3, \ldots ,m\), is the class level of \(D\); and hence \(y = (y_{1} , \ldots y_{m} )^{t}\) is the class labels vectors and the diagonal matrix \(y\) is presented by \(Y = {\text{diag}}(y)\). Let \(A\) data samples have \(+ 1\) and \(B\) data samples have \(- 1\) class label of \(m_{1} \times n\) and \(m_{2} \times n\) dimensions respectively.

The ELM model

Let, \(\beta = (\beta_{1} , \ldots ,\beta_{T} )^{t}\), where \(T = L + n\), be the weight vector (WV) to the output neuron with \(L\) indicates the hidden layer nodes quantity. \(h_{l} (x_{i} ) = G(a_{l} ,b,x_{i} )\) for \(l = 1, \ldots ,L\) and \(i = 1,2,3, \ldots ,m\) be the output of the activation function \(G(.,.,.)\) of the \(l{\text{th}}\) hidden layer neuron with respect to the \(i{\text{th}}\) training sample. \(a_{l} = (a_{l1} , \ldots ,a_{lm} )^{t}\) indicates the WV and \(b\) represents the bias to the hidden layer nodes.

The output equation for ELM [21] can be expressed as:

$$ y = H\beta , $$
(1)

The Hessian matrix can be formulated as \(H = \left[ {h_{1} (D)\; \ldots \;h_{L} (D)} \right]\), i.e.,

$$ H = \left[ \begin{gathered} h_{1} (x_{1} )\quad ...\quad h_{L} (x_{1} ) \hfill \\ \;\;...\quad \quad ...\quad \quad ... \hfill \\ h_{1} (x_{m} )\quad ...\quad h_{L} (x_{m} ) \hfill \\ \end{gathered} \right], $$

\(\beta\) represents the solution in the primal space that can computed as:

$$ \beta = H^{\dag } y, $$
(2)

where, \( H^{\dag }\) is the Moore–Penrose inverse of \(H\) Now, the final classifier of ELM may be expressed as,

$$ f(x) = {\text{sign}}\left( {h(x)^{t} \beta } \right). $$
(3)

The KRR model

The primal problem of KRR [36] may be defined as:

$$ \begin{gathered} min\;\frac{C}{2}\left\| w \right\|^{2} + \frac{1}{2}\left\| \psi \right\|^{2} , \hfill \\ {\text{s.t.}},\;y - \varphi (x)^{t} w = \psi , \hfill \\ \end{gathered} $$
(4)

where w is the unknown, e is the one’s vector and \(\psi\) is the slack variable. y is the output vector and \(\varphi (x)\) indicates the feature mapping function of the input x.

The Lagrangian of (4) may be formulated as:

$$ min\;\frac{C}{2}\left\| w \right\|^{2} + \frac{1}{2}\left\| \psi \right\|^{2} - \ell^{t} \left\{ {y - \varphi (x)^{t} w - \psi } \right\}, $$
(5)

where \(\ell\) is the Lagrangian multiplier.

Now equating Eq. (5) to zero and further applying the KKT condition, the dual form may be obtained as:

$$ L = - \frac{1}{2C}\ell^{t} \varphi (x)\varphi (x)^{t} \ell - \frac{1}{2}\ell^{t} I\ell - y^{t} \ell , $$
(6)

where \(I\) is the identity matrix with the appropriate dimension.

For a new input example, \(x \in \Re^{n}\) the KRR classifier may be generated as:

$$ f(x) = {\text{sign}}\left\{ { - \frac{1}{C}\varphi (x)^{t} \alpha } \right\}. $$

The RVFL model

RVFL [31] is a type of SLFN that randomly generates the weights to the hidden layer nodes and fixes them without tuning them iteratively.

The regularized version of RVFL can be expressed as

$$ \min \left\| {y - \Omega \beta } \right\|^{2} + C\left\| \beta \right\|^{2} , $$
(7)

where \(\Omega = [H\quad D]\). Now, by differentiating (7) with respect to \(\beta\) and further equating it to zero we obtain,

$$ \beta = \left( {\Omega^{t} \Omega + CI} \right)^{ - 1} \Omega^{t} y. $$
(8)

For any new instance,\(x\) the classification function for RVFL can be generated as,

$$ f(x) = {\text{sign}}\;\;\left({[h(x)\quad x\;]\beta } \right) $$
(9)

where, \(h(x) = \left[ {h_{1} (x) \ldots h_{L} (x)} \right]\).

The GLTRVFL model

Recently Borah and Gupta [3] proposed a generalized Lagrangian RVFL model called GLTRVFL. The primal problems of GLTRVFL are:

$$ \begin{gathered} \min \;\frac{1}{2}\left| {\left\| {\Omega_{1} \beta_{1} } \right\|_{2}^{2} + \frac{{C_{1} }}{2}\psi^{t} \psi } \right., \hfill \\ {\text{s.t.}},\;\Omega_{2} \beta_{1} \ge e - \psi , \hfill \\ \end{gathered} $$
(10)

and

$$ \begin{gathered} \min \;\frac{1}{2}\left| {\left\| {\Omega_{2} \beta_{2} } \right\|_{2}^{2} + \frac{{C_{2} }}{2}\zeta^{t} \zeta } \right., \hfill \\ s{\text{.t}}{.},\;\Omega_{1} \beta_{2} \ge e - \zeta \hfill \\ \end{gathered} $$
(11)

The dual formulation of (10) and (11) may be expressed in generalized form as:

Now (10) and (11) can be expressed in dual form and after forming their duals we apply the Newton iterative technique, which can be expressed as:

$$ \nabla_{1} (\alpha_{1}^{i} ) = \left( {\left( {\Omega_{2} \left( {\Omega_{1}^{t} \Omega_{1} } \right)^{ - 1} \Omega_{2}^{t} + \frac{I}{{C_{1} }}} \right)\alpha_{1}^{i} - e} \right) - \left( {\left( {\Omega_{2} \left( {\Omega_{1}^{t} \Omega_{1} } \right)^{ - 1} \Omega_{2}^{t} + \frac{I}{{C_{1} }}} \right)\alpha_{1}^{i} - \gamma_{1} \alpha_{1}^{i} e} \right) \\$$
(12)

and

$$ \nabla_{2} (\alpha_{2}^{i} ) = \left( {\left( {\Omega_{1} \left( {\Omega_{2}^{t} \Omega_{2} } \right)^{ - 1} \Omega_{1}^{t} + \frac{I}{{C_{2} }}} \right)\alpha_{2}^{i} - e} \right) - \left( {\left( {\Omega_{1} \left( {\Omega_{2}^{t} \Omega_{2} } \right)^{ - 1} \Omega_{1}^{t} + \frac{I}{{C_{2} }}} \right)\alpha_{2}^{i} - \gamma_{2} \alpha_{2}^{i} e} \right). $$
(13)

Proposed 1-norm random vector functional link (1N RVFL)

The1-norm RVFL with absolute loss is suggested in this section as a standardized classification model resulting in a robust representation of the model. Moreover, motivated by the study of Mangasarian [28], the proposed 1N RVFL model is formulated by using the Newton–Armijo algorithm that considers its dual exterior penalty problem as an unconstrained convex minimization problem. The proposed formulation leads to an iterative solution for the binary classification problem that is simple and rapidly converging.

Consider the regularized formulation of RVFL as:

$$ \mathop {\min }\limits_{{\beta \in \Re^{l} \;}} \;\left\| {y - \Omega \beta } \right\|_{1} + \;C\left\| \beta \right\|_{1} , $$
(14)

where, \(C > 0\) is the trade-off parameter. By using the same procedure as Mangasarian [28] and Balasundaram and Gupta [1], Eq. (14) can be rewritten in linear programming problem form as:

$$ \begin{gathered} \beta = s - t\;{\text{and}}\;y - \Omega \;\beta = p - q \hfill \\ {\text{s.t.}},\;{\text{s,t}} \ge 0 \in \Re^{l} \;{\text{and}}\;p,q \ge 0 \in \Re^{m} \hfill \\ \end{gathered} $$
(15)

Using (15) on (14), the linear programming RVFL in primal form as:

$$ \begin{gathered} \mathop {\min }\limits_{r,s,p,q\;} \quad d\;_{l}^{t} (s + t) + Cd_{m}^{t} (p + q) \hfill \\ {\text{s.t.}},\;\Omega \;(s - t) - (p - q) = y,\;p,q \ge 0, \hfill \\ \end{gathered} $$
(16)

where \(d_{l}\) and \(d_{m}\) are one’s vector of dimensions \(l\) and \(m\), respectively. The optimization problem of (16) is easily solvable using the optimization toolbox in MATLAB.

However, it is recommended to determine its dual external penalty problem as an unconstrained minimization problem in \(m\) variables, whose solution can be obtained by the Newton-Armijo technique. This is due to an increase in the number of unknowns and constraints and thus an increase in the problem size.

Preposition 1: ([1, 28]) Consider the primal linearly programmable problem:

$$ \begin{gathered} \mathop {\min }\limits_{{(x,y) \in R^{n + 1} }} \quad e^{t} x + f^{t} y \hfill \\ {\text{s.t.}},\;Px + Qy \ge b,\;Sx + Ny = h,\;x > 0, \hfill \\ \end{gathered} $$
(17)

is solvable, where \(e \in R^{n} ,\) \(f \in R^{l} ,\) \(P \in R^{m \times n} ,Q \in R^{m \times l} ,b \in R^{m} ,S \in R^{k \times n} ,{\rm N} \in R^{k \times l}\) and \(h \in R^{k} ,\)

Therefore, the dual penalty optimization problem may be defined as:

$$ \mathop {\min }\limits_{{w,v \in R^{m + k} }} \quad \phi \left( { - b^{t} w - h^{t} v} \right) + \frac{1}{2}\left( {\left. {\left\| {\left. {P^{t} w + S^{t} v - e} \right)_{ + } } \right.} \right\|_{2}^{2} + \left\| {Q^{t} w + N^{t} v - f} \right\|_{2}^{2} + \left\| {( - u)_{ + } } \right\|_{2}^{2} } \right). $$
(18)

Equation (18) is also solvable \(\forall \phi > 0.\) Furthermore for every \(\phi \in (0,\overline{\phi }],\;\exists \overline{\phi } > 0\) the \((w,v)\) will be a solution of (18) which leads to:

$$ x = \frac{1}{\phi }\left( {P^{t} w + S^{t} v - e} \right)_{ + } ,\;y = \frac{1}{\phi }\left( {Q^{t} w + S^{t} v - f} \right). $$
(19)

Now following Preposition 1, the dual penalty optimization problem [1] of (19) may be obtained as:

$$ \begin{aligned} &\mathop {\min }\limits_{{w \in R^{m} }} \quad \phi y^{t} w + \frac{1}{2}\left( \left\| {\left( {\Omega^{t} w - d_{l} } \right)_{ + } } \right\|_{2}^{2} + \left\| {\left( { - \Omega^{t} w - d_{l} } \right)_{ + } } \right\|_{2}^{2} \right.\\ &\quad\left. + \left\| {\left( { - w - Cd_{m} } \right)_{ + } } \right\|+ \left\| {\left( { - w - Cd_{m} } \right)} \right\|_{2}^{2} + \left\| {\left( {w - Cd_{m} } \right)} \right\|_{2}^{2} \right) \end{aligned} $$
(20)

where \(\phi\) is the penalty parameter and (20) is solvable for \(\phi > 0.\) Additionally for any \(\overline{\phi } > 0\) there exists:

$$ s = \frac{1}{\phi }(\Omega^{t} w - d_{l} )_{ + } ,\;\;\;\;t = \frac{1}{\phi }( - \Omega^{t} w - d_{l} )_{ + } $$
$$ \begin{aligned} s & = \frac{1}{\phi }\left( {\Omega^{t} w - d_{l} } \right)_{ + } ,\;\;\;\;t = \frac{1}{\phi }\left( { - \Omega^{t} w - d_{l} } \right)_{ + } \\ p & = \frac{1}{\phi }\left( { - w - Cd_{m} } \right)_{ + } \;{\text{and}}\;q = \frac{1}{\phi }\left( {w - Cd_{m} } \right)_{ + } . \\ \end{aligned} $$
(21)

The unconstrained minimization problem of (20) can be solved by Newton-Armijo iterative technique.

figure a

Here \(L(w)\) is the gradient of \(L( \cdot )\) expressed by:

$$ \nabla L(w) = - \phi y + \Omega \left( {\Omega^{t} w - d_{l} } \right)_{ + } - \Omega \left( { - \Omega^{t} w - d_{l} } \right)_{ + } - \left( { - w - Cd_{m} } \right)_{ + } + \left( {w - Cd_{m} } \right)_{ + } , $$
(22)

which is not differentiable. Hence the second-order derivative of \(L( \cdot )\) does not exist. But its “Generalized Hessian” can be formed for \(w \in R^{m}\) as:

$$ \begin{aligned} \nabla^{2} L(w) & = \nabla^{2} L = \left( {\left. {{\text{diag}}\left( {\Omega^{t} w - d_{l} } \right)_{*} + \left( { - \Omega^{t} w - d_{l} } \right)_{*} } \right)} \right)\Omega^{t} + {\text{diag}}\left( { - w - Cd_{m} } \right)_{*} + \left( {w - Cd_{m} } \right)_{*} \\ & = \Omega {\text{diag}}\left( {\left( {|\Omega^{t} w| - d_{l} } \right)_{*} } \right)\Omega^{t} + {\text{diag}}\left( {\left( {|w| - Cd_{m} } \right)_{*} } \right), \\ \end{aligned} $$
(23)

Equation (23) follows the following form of equality:

$$ (\alpha - 1)_{*} + ( - \alpha - 1)_{*} = (\left| a \right| - 1)_{*} $$
(24)

where \({\text{diag}}\) indicates the diagonal matrix. The “Generalized Hessian” can be useful while solving the unconstrained smooth optimization problems and leads to a unique solution.

$$ \left( {\nabla^{2} L\left( {w^{i} } \right)} \right)\left( {w^{i + 1} - w^{i} } \right) = - \nabla L\left( {w^{i} } \right). $$
(25)

However, \(\nabla^{2} L(w)\) is positive semi-definite and might get ill-conditioned.

Remark 1

To avoid the ill-conditioning of (25) a very small positive integer \(\tau > 0\) is added to (25) and multiplied with an identity matrix \(I\) of appropriate dimension. Therefore \(\nabla^{2} L + \tau I\) is used.

In this work, the optimization problem of (24) is solved using the Newton method without the Armijo step for simplicity. This indicates that \(w^{i + 1}\) at \((i + 1){\text{th}}\) iteration is obtained by finding the solution of,

$$ \left( {\nabla^{2} L\left( {w^{i} } \right) + \tau I} \right)\left( {w^{i + 1} - w^{i} } \right) = - \nabla L\left( {w^{i} } \right), $$
(26)

where \(i = 0,\;1\;, \ldots\)

$$ \begin{aligned} &\left( {\Omega {\text{diag}}\left( {\left( {|\Omega^{t} w^{i} | - e_{l} } \right)_{*} } \right)\Omega^{t} + {\text{diag}}\left( {\left( {|w^{i} | - Ce_{m} } \right)_{*} } \right) + \tau I} \right)\\ &\quad \times \left( {w^{i + 1} - w^{i} } \right) \hfill \\ &{\text{i}}.{\text{e}}., = - \left( - \phi y + \Omega \left( {\Omega^{t} w^{i} - e_{l} } \right)_{ + } - \Omega \left( { - \Omega^{t} w^{i} - e_{l} } \right)_{ + }\right. \\ &\quad\left. - \left( { - w^{i} - Ce_{m} } \right)_{ + } + \left( {w^{i} - Ce_{m} } \right)_{ + } \right). \hfill \\ \end{aligned} $$
(27)

We can determine the value of \(w\) by solving the above iterative schemes in (27).

figure b

Simulation and analysis of results

This segment investigates the performance of the 1 norm RVFL model in comparison to ELM, KRR, RVFL, K-RVFL and GLTRVFL for classification problems on some real-world benchmark datasets. All the simulations are performed in MATLAB 2008b environment on a desktop computer of 4 GB of RAM, 64-bit Windows 7 OS, Intel i5 processor with 3.20 GHz speed. No external optimization toolbox was required to solve the optimization problems of the reported models.

Zhang and Suganthan [45] suggested that the Hardlim and sign activation function generally degrade the overall performance of the RVFL algorithm. To select the best activation function, we have performed experiments on a few real-world datasets using different activation functions for the proposed 1N RVFL, viz., Hardlim, multiquard, radial basis function (RBF), triangular bias (Tri-bas), sigmoid, sine and ReLU. The average ranks are shown in Table 2 and based on that we have picked the best and the second-best activation functions i.e., ReLU, and sine have been tested in the experiments for ELM, RVFL, GLTRVFL and the proposed 1 N RVFL. Let us consider \(x\) as an input vector. For this purpose, the two activation functions could be defined as-

  1. (a)

    ReLU: \(\phi (x) = \max (\;0,x)\)

  2. (b)

    Sine: \(\phi (x) = \sin (x)\)

Table 2 Average rank using different activation functions for 1N RVFL (best average rank is bolded)

where \(\phi (x)\) represents the output function for the input sample \(x\).

The tests were performed using tenfold cross-validation technique. Here, the sample is split into 10 equal subsamples. One portion is used for testing each of the subsamples and the other portion is used for training. This method runs 10 times until all components are trained for at least one time [4]. For computational convenience, the input data is split into two parts, where 30% of the data are training data and the other 70% are testing data. To validate the efficacy of the proposed 1 norm RVFL model, the performance of this algorithm was compared with the ELM, KRR, RVFL, K-RVFL and GLTRVFL models on some interesting real-world benchmark datasets.

Since the large value of ELM leads to an increase in computational time [1], hence, for ELM, KRR, RVFL and GLTRVFL the optimum values of the parameter \(L\) is considered from a set of {20, 50, 100, 200, 500, 1000}. For, KRR, RVFL, K-RVFL and GLTRVFL, the optimum parameters of \(C\) are obtained from {10–5,…,105} respectively. The Gaussian kernel is selected while implementing the KRR. The kernel parameter \(\mu\) is chosen from {2–5,…,25}. In the proposed 1 N RVFL, the optimum values for the two parameters \(C\) and \(L\), are chosen from the range of {10–5,…,105} and \(\{ \,20,\,50,\,100,\,200,\,500,1000,2000\}\), respectively. The statistics of the datasets that are considered during the experiment are tabulated in Table 3, where \(S\) indicates the length and \(N\) is the total number of attributes.

Table 3 Information of the datasets

All the experimental datasets are collected from UCI machine learning databases [12]. The numerical experiments on various datasets are performed after the normalization of the data. The raw data is normalized by considering \(\overline{r}_{ij} = \frac{{r_{ij} - r_{j}^{\min } }}{{r_{j}^{\max } - r_{j}^{\min } }}\), where \(r_{j}^{\max } = \,\mathop {\max }\nolimits_{i = 1, \ldots ,m} (x_{ij} )\) and \(r_{j}^{\min } = \,\mathop {\min }\nolimits_{i = 1, \ldots ,m} (x_{ij} )\) denotes the maximum and minimum values respectively of the \(j{\text{th}}\) attribute over all of the input data \(r_{i}\). \(\overline{r}_{ij}\) represents the normalized value of \(r{}_{ij}\).

For all the datasets, the attributes, the number of training and test samples and the optimum parameters are obtained using the tenfold cross-validation method. The total number of training and testing samples, the optimum values of the parameters and the classification accuracies of the models are shown in Table 4. Comparable or better performance indicates the efficacy and applicability of the proposed model. Additionally, the ranks based on classification accuracy for each dataset are exhibited in Table 5 for each reported classifier.

Table 4 Classification accuracies obtained by the classifiers on the real-world dataset with optimum parameters
Table 5 Average ranks of ELM, KRR, RVFL, K-RVFL, GLTRVFL and 1N RVFL based on the accuracy

Friedman test with Nemenyi statistics for classifier comparison

To compare the performance of the reported algorithms with the proposed algorithm, we perform a non-parametric Friedman test from Table 5 [9] where the average ranks of the models are tabulated. The lowest average rank of our proposed 1N RVFL Sine reflects the efficiency of the model. The Friedman test for the null hypothesis may be obtained as:

$$ \begin{aligned} \chi_{F}^{2} & = \frac{12 \times 23}{{10 \times (10 + 1)}}\left[ \left( 7.6522^{2} + \;\,5.8478^{2} + 5.1957^{2} \right. \right.\\ &+ 7.1522^{2} \; + 5.4348^{2} + \;4.543^{2} \\ & + \left. {5.5652^{2} + 5.3478^{2} + 5.7391^{2} + 2.5217^{2} } \right) \\ & - \left. {\frac{{10 \times (10 + 1)^{2} }}{4}} \right] = 43.7819, \\ F_{F} & = \frac{(23 - 1) \times 43.7819}{{23(10 - 1) - \;43.7819}}\; = \;5.9013. \\ \end{aligned} $$

\(F_{F}\) is distributed according to \(F\)-distribution with \(((10 - 1),(10 - 1) \times (23 - 1)) = (9,\,198)\) degrees of freedom. For the level of significance, \(\alpha = 0.10\) the critical value (CV) for \(F \, (9,\;198)\) is \({1}{\text{.927}}\) which is less than the \(F_{F}\). Hence, the null hypothesis can be rejected. Now, let us perform the Nemenyi test to compare the methods. Here, the critical difference (CD) is:

$$ {\text{CD}} = 2.92 \times \sqrt {\frac{10 \times (10 + 1)}{{6 \times 23}}} = \;2.607\;{\text{for}}\;q_{\alpha } = 0.10. $$

Figure 1 shows the statistical comparison of classifiers against each other based on the Nemenyi test. Groups of classifiers that are not significantly different (at \(\alpha = 0.10\)) are connected. One can notice from the figure that the 1N RVFL sine model is significantly better than ELM ReLU, ELM sine, KRR, RVFL ReLU, GLTRVFL ReLU, GLTRVFL sine and 1N RVFL ReLU models. However, despite showing a better average rank than K-RVFL, 1N RVFL sine is not significantly different from K-RVFL.

Fig. 1
figure 1

Statistical comparison of classifiers against each other based on the Nemenyi test

The training time and testing time (in seconds) of the reported models are shown in Tables 6 and 7 respectively for the models. It can be observed that 1N RVFL is that it is computationally less efficient than RVFL despite showing better generalization performance.

Table 6 Training time (in seconds) for the reported models
Table 7 Testing time (in seconds) for the reported models

Win/Tie/Loss test

The statistical analysis approach win/tie/loss [10] for our best-proposed model, i.e., 1N RVFL sine is used to further validate the efficacy of the proposed 1N RVFL sine model. The outcomes are exhibited in the last row of Table 4. For example, the second column shows the comparison between 1N RVFL sine and ELM ReLU. It can be noted that 1N RVFL wins in 20 cases, tie in no case and loss in 3 cases compared to ELM ReLU. Similarly, the third column shows the comparison between 1N RVFL sine and ELM sine. It can be noticed that 1N RVFL wins in 18 cases, tie in no case and loss in 5 cases compared to ELM ReLU. Similar conclusions can be derived from the other columns. As it can be observed from the last row of Table 3, the proposed method has the best classification accuracy in most situations, indicating that 1N RVFL sine outperforms other algorithms.

The parameter insensitivity plots of the proposed models are presented in Figs. 2, 3 and 4 for Habarman, Vehicle 1 and Yeast3 datasets. One can notice from Figs. 2, 3 and 4 that the proposed models are not very sensitive towards the user-defined parameters \(C\) and \(L\).

Fig. 2
figure 2

Parameter insensitivity performance on user-specified parameters, (\(C\),\(L\)) of 1N RVFL ReLU and sine models for Habarman dataset

Fig. 3
figure 3

Parameter insensitivity performance of 1N RVFL ReLU and sine models on user-specified parameters, (\(C\),\(L\)) for Vehicle1 dataset

Fig. 4
figure 4

Parameter insensitivity performance of 1NRVFL ReLU and sine models on user-specified parameters, (\(C\),\(L\)) for Yeast3 dataset

Moreover, to reveal the sparseness of the proposed model, the average number of “actually” contributing nodes are portrayed in Fig. 5. A lower number of non-zero components indicates that the model is sparse. Ecoli2, New thyroid1 and Habarman datasets are used to see whether the proposed 1NRVFL solution method results in the least number of hidden nodes when determining the decision function. Hence, the degree of sparseness for each pair of (C, L) is determined. It can be observed from Fig. 5 that the 1N RVFL always leads to sparse solutions.

Fig. 5
figure 5figure 5

Number of actually contributing nodes with user-defined parameters for 1N RVFL with ReLU and sine additive nodes for a Ecoli2. b New thyroid1 and c Habarman datasets

Conclusion

In this work, a novel 1N RVFL has been proposed for solving binary classification problems. The solutions are obtained by solving the dual using Newton-Armijo technique as an unconstrained minimization problem. The basic advantage of the proposed 1N RVFL is that it produces many coefficients with zero values, which results in a sparse output. The 1 norm RVFL is a robust model as it only considers the absolute value and can ignore the extreme values. This leads to an increase in the outliers cost exponentially. Extensive experiments on several classification datasets using the reported models portray that the proposed 1 norm RVFL shows better performance compared to ELM, KRR, RVFL, K-RVFL and GLTRVFL. The good generalization ability of the proposed model implies the usability and efficiency of the same.1N RVFL is useful for classification problems with very high dimensional input spaces. However, the major limitation of 1N RVFL is that it is computationally less efficient than RVFL. Future work of this model will be based on developing this model for solving the multiclass classification problems using the one-versus-rest or the one-versus-one procedure. This can be fruitful for various real-life classification problems such as face classification, character recognition, plant species recognition and others.