1-Norm random vector functional link networks for classification problems

This paper presents a novel random vector functional link (RVFL) formulation called the 1-norm RVFL (1N RVFL) networks, for solving the binary classification problems. The solution to the optimization problem of 1N RVFL is obtained by solving its exterior dual penalty problem using a Newton technique. The 1-norm makes the model robust and delivers sparse outputs, which is the fundamental advantage of this model. The sparse output indicates that most of the elements in the output matrix are zero; hence, the decision function can be achieved by incorporating lesser hidden nodes compared to the conventional RVFL model. 1N RVFL produces a classifier that is based on a smaller number of input features. To put it another way, this method will suppress the neurons in the hidden layer. Statistical analyses have been carried out on several real-world benchmark datasets. The proposed 1N RVFL with two activation functions viz., ReLU and sine are used in this work. The classification accuracies of 1N RVFL are compared with the extreme learning machine (ELM), kernel ridge regression (KRR), RVFL, kernel RVFL (K-RVFL) and generalized Lagrangian twin RVFL (GLTRVFL) networks. The experimental results with comparable or better accuracy indicate the effectiveness and usability of 1N RVFL for solving binary classification problems.


Introduction
Neural networks are powerful models for data mining and information engineering which can learn from data to construct feature-based classification models and nonlinear prediction models.Training neural networks (NNs) requires the optimization of a highly non-convex landscape with several local minima and saddle points.Alternative kernelbased methods like support vector machine (SVM) [7,11,17] on the other hand, produce well-posed convex problems, which is one of the main reasons for their success over the last few decades.Kernel methods, however, fail to scale effectively to large datasets because one of their main tasks is to compute pairwise kernel values over the complete B Deepak Gupta deepakjnu85@gmail.com;deepak@nitap.ac.inBarenya Bikash Hazarika barenya1431@gmail.comdataset.The main benefit of simple NN architecture is that it enables an acceptable level of solution to be achieved in one-hundredth (or even one-millionth) of the time taken by larger, more complex models while maintaining high optimality [37].A single hidden layer feed-forward NN (SLFN) can handle a function that contains non-linearity with arbitrary precision [26].They have been widely implemented in various problems associated with classification [19,22,24,33].Moreover, one of the most popular categories of NN is feed-forward NNs with random weights which were popularized by Pao et al. [32] in their work.They introduced novel random vector functional link networks (RVFL) [5,8,30].In RVFL, it is possible to connect inputs and outputs directly, leading towards an excellent generalization performance.The weights between the input and the hidden layers can also be produced randomly [16,45].Zhang and Suganthan [45] further extended the study by performing a comprehensive evaluation of RVFL networks.They performed operations on RVFL using some popular activation functions and discovered that the hardlim and sign activation function significantly degrade the performance of the algorithm.They also suggested that the bias used in RVFL can be a tunable configuration for specific problems.Li et al. [25] pro-posed a novel SVM+ which uses the learning using privileged information (LUPI) [40] procedure.They further suggested a kernelized version of that model for non-linear data processing.They also use the QP solver from MATLAB R2014b as a starting point, to solve the QP problem and proposed MAT-SVM+ .Inspired by the work of Li et al. [25], Zhang and Wang [48] embedded the LUPI in the RVFL and proposed a novel RVFL+ model.They further used the kernel trick in the model and proposed a kernel RVFL+ (KRVFL+).Both RVFL+ and KRVFL+ are in analogy to Teacher-Student Interaction [39] in the human learning process.Xu et al. [43] proposed a novel kernel-based RVFL model (K-RVFL) for learning the spatiotemporal dynamic process.Because of the complexity embedded in the kernel, the K-RVFL can handle the complex process better.Kernel ridge regression (KRR) [36] has gained the attention of researchers over the last few decades due to its non-iterative learning approach.It has been widely used to solve a variety of classification [23,35] and regression [29,42] problems.KRR is computationally fast since, it adapts equality constraints rather than inequality constraints, and solves a series of linear equations.Several variants of KRR have been proposed to improve its classification performance.For example, Zhang et al. [46], Chang et al. [6], Zhang and Suganthan [47] and more.However, recently the growing popularity of extreme learning machine (ELM) [21,27] is because of its high generalization performance with low computational cost [2,14,15,18,20,38].Peng et al. [34] proposed a novel discriminative graph regularized extreme learning machine (GELM) model to improve the classification ability of ELM.Due to its closed-form solution, the outputs for GELM can be obtained efficiently.As per the theory, conventional ELM does not guarantee its convergence.The correct convergence results have been shown and proved in Igelnik and Pao [22], Wang and Wan [41].The RVFL has been a very efficient and powerful model for tasks related to classification and regression.Despite the high computational efficiency and high generalization ability of RVFL, it was observed that because of the randomly selected weights and hidden layer biases, it requires many nodes to accomplish satisfactory performance.Recently, the 1 norm regularization has gained tremendous popularity among researchers [13,44] since it results in sparse outputs.The sparse output indicates that most of the elements in the output matrix are zero, hence, the decision surface can be obtained by incorporating lesser hidden nodes compared to the conventional models.Also, these sparse models are easily implementable [1].An influential contribution in this direction is the 1norm SVM developed by Mangasarian [28].The solution of 1-norm SVM is computed by solving its exterior penalty problem as an unconstrained convex minimization problem using the Newton Armijo algorithm.Hence, inspired by the work of Mangasarian [28] and the recent significant pieces of literature on 1-norm regularizations, this paper proposes The advantages and limitations of a few related classifiers are tabulated in Table 1.
The remaining paper is structured as follows; Section "Mathematical background" gives a brief mathematical description of a few related models, viz., ELM, KRR, GLTRVFL and RVFL.Section "Proposed 1-norm random vector functional link (1N RVFL)" shows the formulation and description of the proposed 1N RVFL model.In Sect.Simulation and analysis of results, the numerical experiments and comparative analysis with ELM, KRR, RVFL, K-RVFL and GLTRVFL are undertaken.In Section "Conclusion", we conclude the paper.

The ELM model
Let, β (β 1 , . . ., β T ) t , where T L + n, be the weight vector (WV) to the output neuron with L indicates the hidden layer nodes quantity.h l (x i ) G(a l , b, x i ) for l 1, . . ., L and i 1, 2, 3, . . ., m be the output of the activation function G(., ., .) of the lth hidden layer neuron with respect to the ith training sample.a l (a l1 , . . ., a lm ) t indicates the WV and b represents the bias to the hidden layer nodes.
The output equation for ELM [21] can be expressed as: The Hessian matrix can be formulated as ... ... ...
β represents the solution in the primal space that can computed as: where, H † is the Moore-Penrose inverse of H Now, the final classifier of ELM may be expressed as, The KRR model The primal problem of KRR [36] may be defined as: where w is the unknown, e is the one's vector and ψ is the slack variable.y is the output vector and ϕ(x) indicates the feature mapping function of the input x.The Lagrangian of (4) may be formulated as: where is the Lagrangian multiplier.Now equating Eq. ( 5) to zero and further applying the KKT condition, the dual form may be obtained as: where I is the identity matrix with the appropriate dimension.For a new input example, x ∈ n the KRR classifier may be generated as: The RVFL model RVFL [31] is a type of SLFN that randomly generates the weights to the hidden layer nodes and fixes them without tuning them iteratively.
The regularized version of RVFL can be expressed as where [H D].Now, by differentiating (7) with respect to β and further equating it to zero we obtain, For any new instance,x the classification function for RVFL can be generated as, where,

The GLTRVFL model
Recently Borah and Gupta [3] proposed a generalized Lagrangian RVFL model called GLTRVFL.The primal problems of GLTRVFL are: and The dual formulation of ( 10) and ( 11) may be expressed in generalized form as: Now ( 10) and ( 11) can be expressed in dual form and after forming their duals we apply the Newton iterative technique, which can be expressed as:

Proposed 1-norm random vector functional link (1N RVFL)
The1-norm RVFL with absolute loss is suggested in this section as a standardized classification model resulting in a robust representation of the model.Moreover, motivated by the study of Mangasarian [28], the proposed 1N RVFL model is formulated by using the Newton-Armijo algorithm that considers its dual exterior penalty problem as an unconstrained convex minimization problem.The proposed formulation leads to an iterative solution for the binary classification problem that is simple and rapidly converging.
Consider the regularized formulation of RVFL as: where, C > 0 is the trade-off parameter.By using the same procedure as Mangasarian [28] and Balasundaram and Gupta [1], Eq. ( 14) can be rewritten in linear programming problem form as: Using ( 15) on ( 14), the linear programming RVFL in primal form as: where d l and d m are one's vector of dimensions l and m, respectively.The optimization problem of ( 16) is easily solvable using the optimization toolbox in MATLAB.
However, it is recommended to determine its dual external penalty problem as an unconstrained minimization problem in m variables, whose solution can be obtained by the Newton-Armijo technique.This is due to an increase in the number of unknowns and constraints and thus an increase in the problem size.Preposition 1: ([1, 28]) Consider the primal linearly programmable problem: Therefore, the dual penalty optimization problem may be defined as: Equation ( 18) is also solvable ∀φ > 0. Furthermore for every φ ∈ (0, φ], ∃φ > 0 the (w, v) will be a solution of (18) which leads to: Now following Preposition 1, the dual penalty optimization problem [1] of ( 19) may be obtained as: where φ is the penalty parameter and ( 20) is solvable for φ > 0. Additionally for any φ > 0 there exists: The unconstrained minimization problem of (20) can be solved by Newton-Armijo iterative technique.
Here L(w) is the gradient of L(•) expressed by: which is not differentiable.Hence the second-order derivative of L(•) does not exist.But its "Generalized Hessian" can be formed for w ∈ R m as: Equation ( 23) follows the following form of equality: where diag indicates the diagonal matrix.The "Generalized Hessian" can be useful while solving the unconstrained smooth optimization problems and leads to a unique solution.
However, ∇ 2 L(w) is positive semi-definite and might get ill-conditioned.

Remark 1
To avoid the ill-conditioning of (25) a very small positive integer τ > 0 is added to (25) and multiplied with an identity matrix I of appropriate dimension.Therefore ∇ 2 L + τ I is used.
In this work, the optimization problem of ( 24) is solved using the Newton method without the Armijo step for simplicity.This indicates that w i+1 at (i +1)th iteration is obtained by finding the solution of, where i 0, 1 , . . .
We can determine the value of w by solving the above iterative schemes in (27).

Simulation and analysis of results
This segment investigates the performance of the 1 norm RVFL model in comparison to ELM, KRR, RVFL, K-RVFL and GLTRVFL for classification problems on some realworld benchmark datasets.All the simulations are performed in MATLAB 2008b environment on a desktop computer of 4 GB of RAM, 64-bit Windows 7 OS, Intel i5 processor with 3.20 GHz speed.No external optimization toolbox was required to solve the optimization problems of the reported models.
Zhang and Suganthan [45] suggested that the Hardlim and sign activation function generally degrade the overall performance of the RVFL algorithm.To select the best activation function, we have performed experiments on a few real-world datasets using different activation functions for the proposed 1N RVFL, viz., Hardlim, multiquard, radial basis function (RBF), triangular bias (Tri-bas), sigmoid, sine and ReLU.The average ranks are shown in Table 2 and based on that we have picked the best and the second-best activation functions i.e., ReLU, and sine have been tested in the experiments for ELM, RVFL, GLTRVFL and the proposed 1 N RVFL.Let us consider x as an input vector.For this purpose, the two activation functions could be defined as-(a) ReLU: φ(x) max( 0, x) (b) Sine: φ(x) sin(x)  123    The tests were performed using tenfold cross-validation technique.Here, the sample is split into 10 equal subsamples.One portion is used for testing each of the subsamples and the other portion is used for training.This method runs 10 times until all components are trained for at least one time [4].For computational convenience, the input data is split into two parts, where 30% of the data are training data and the other 70% are testing data.To validate the efficacy of the proposed 1 norm RVFL model, the performance of this algorithm was compared with the ELM, KRR, RVFL, K-RVFL and GLTRVFL models on some interesting real-world benchmark datasets.
Since the large value of ELM leads to an increase in computational time [1], hence, for ELM, KRR, RVFL and GLTRVFL the optimum values of the parameter L is considered from a set of {20, 50, 100, 200, 500, 1000}.For, KRR, RVFL, K-RVFL and GLTRVFL, the optimum parameters of C are obtained from {10 -5 ,…,10 5 } respectively.The Gaussian kernel is selected while implementing the KRR.The kernel parameter μ is chosen from {2 -5 ,…,2 5 }.In the proposed 1 N RVFL, the optimum values for the two parameters C and L, are chosen from the range of {10 -5 ,…,10 5 } and { 20, 50, 100, 200, 500, 1000, 2000}, respectively.The statistics of the datasets that are considered during the experiment are tabulated in Table 3, where S indicates the length and N is the total number of attributes.
All the experimental datasets are collected from UCI machine learning databases [12].The numerical experiments on various datasets are performed after the normalization of the data.The raw data is normalized by considering r i j r i j −r min j r max j −r min j , where r max j max i 1,...,m (x i j ) and r min j min i 1,...,m (x i j ) denotes the maximum and minimum values respectively of the jth attribute over all of the input data r i .r i j represents the normalized value of r i j .For all the datasets, the attributes, the number of training and test samples and the optimum parameters are obtained using the tenfold cross-validation method.The total number of training and testing samples, the optimum values of the parameters and the classification accuracies of the models are shown in Table 4. Comparable or better performance indicates the efficacy and applicability of the proposed model.Additionally, the ranks based on classification accuracy for each dataset are exhibited in Table 5 for each reported classifier.

Friedman test with Nemenyi statistics for classifier comparison
To compare the performance of the reported algorithms with the proposed algorithm, we perform a non-parametric Friedman test from Table 5 [9] where the average ranks of the models are tabulated.The lowest average rank of our proposed 1N RVFL Sine reflects the efficiency of the model.The Friedman test for the null hypothesis may be obtained as: .607for q α 0.10.
Figure 1 shows the statistical comparison of classifiers against each other based on the Nemenyi test.Groups of classifiers that are not significantly different (at α 0.10) are connected.One can notice from the figure that the 1N RVFL sine model is significantly better than ELM ReLU, ELM sine, KRR, RVFL ReLU, GLTRVFL ReLU, GLTRVFL sine and 1N RVFL ReLU models.However, despite showing a better average rank than K-RVFL, 1N RVFL sine is not significantly different from K-RVFL.
The training time and testing time (in seconds) of the reported models are shown in Tables 6 and 7 respectively for the models.It can be observed that 1N RVFL is that it is computationally less efficient than RVFL despite showing better generalization performance.

Win/Tie/Loss test
The statistical analysis approach win/tie/loss [10] for our best-proposed model, i.e., 1N RVFL sine is used to further validate the efficacy of the proposed 1N RVFL sine model.The outcomes are exhibited in the last row of Table 4.For example, the second column shows the comparison between 1N RVFL sine and ELM ReLU.It can be noted that 1N RVFL wins in 20 cases, tie in no case and loss in 3 cases compared to ELM ReLU.Similarly, the third column shows the comparison between 1N RVFL sine and ELM sine.It can be noticed that 1N RVFL wins in 18 cases, tie in no case and loss in 5 cases compared to ELM ReLU.Similar conclusions can be derived from the other columns.As it can be observed from the last row of Table 3, the proposed method has the best classification accuracy in most situations, indicating that 1N RVFL sine outperforms other algorithms.
The parameter insensitivity plots of the proposed models are presented in Figs.Moreover, to reveal the sparseness of the proposed model, the average number of "actually" contributing nodes are portrayed in Fig. 5.A lower number of non-zero components indicates that the model is sparse.Ecoli2, New thyroid1 and Habarman datasets are used to see whether the proposed 1NRVFL solution method results in the least number of hidden nodes when determining the decision function.Hence, the degree of sparseness for each pair of (C, L) is determined.It can be observed from Fig. 5 that the 1N RVFL always leads to sparse solutions.

Conclusion
In this work, a novel 1N RVFL has been proposed for solving binary classification problems.The solutions are obtained by solving the dual using Newton-Armijo technique as an unconstrained minimization problem.The basic advantage of the proposed 1N RVFL is that it produces many coefficients  with zero values, which results in a sparse output.The 1 norm RVFL is a robust model as it only considers the absolute value and can ignore the extreme values.This leads to an increase in the outliers cost exponentially.Extensive experiments on several classification datasets using the reported models portray that the proposed 1 norm RVFL shows better performance compared to ELM, KRR, RVFL, K-RVFL and GLTRVFL.The good generalization ability of the proposed model implies the usability and efficiency of the same.1NRVFL is useful for classification problems with very high dimensional input spaces.However, the major limitation of 1N RVFL is that it is computationally less efficient than RVFL.Future work of this model will be based on developing this model for solving the multiclass classification problems using the one-versus-rest or the one-versus-one procedure.This can be fruitful for various real-life classification problems such as face classification, character recognition, plant species recognition and others.

Fig. 1
Fig. 1 Statistical comparison of classifiers against each other based on the Nemenyi test

Fig. 4
Fig. 4 Parameter insensitivity performance of 1NRVFL ReLU and sine models on user-specified parameters, (C,L) for Yeast3 dataset 2, 3 and 4 for Habarman, Vehicle 1 and Yeast3 datasets.One can notice from Figs. 2, 3 and 4 that the proposed models are not very sensitive towards the user-defined parameters C and L.

Fig. 5
Fig. 5 Number of actually contributing nodes with user-defined parameters for 1N RVFL with ReLU and sine additive nodes for a Ecoli2.b New thyroid1 and c Habarman datasets

Table 1
Pros and cons of a few related models 3. 1N RVFL produces a classifier that is based on a smaller number of input features.To put it another way, this method will suppress the required number of neurons in the hidden layer.4. Experiments on real world datasets have been considered to demonstrate the classification ability of 1N RVFL compared to other models.

Table 2
Average rank using different activation functions for 1N RVFL (best average rank is bolded)

Table 4
Classification accuracies obtained by the classifiers on the real-world dataset with optimum parameters