1 Introduction

Support vector machines (SVM) is a celebrated binary classifier built upon the direct practical application of the Statistical Learning Theory. The practicality and strong theoretical backing of SVMs have enabled them to be successfully adapted in a diverse variety of real-world applications ranging from bioinformatics (Yin et al., 2015), scene classification (Subasi, 2013), to power applications (Hao & Lewin, 2010). SVM has also attracted many researchers and inspired the development of new classifiers, like Generalised eigenvalue proximal support vector machines (GEPSVM) (Mangasarian & Wild, 2005), Twin Support vector machines (TSVM) (Khemchandani & Chandra, 2007). Each of them gave rise to an entirely new plethora of research.

TSVM showed that a two hyperplane rule is better adept at dealing with XOR(cross-planes) dataset, without the use of kernel methods. Also, leveraging the divide and conquer rule to solve individual quadratic programming problems (QPPs) for each class, reduced the constraints by half and hence were around four times faster than the conventional SVMs. Moreover, TSVM was able to output comparable performance with SVMs.

Nevertheless, TSVM also has its own set of limitations like matrix inversion, sparsity and sensitivity to outliers due to L\(_2\)-norm. Each of these problems has captivated researchers’ attention resulting in a new set of research.

TSVM adopts two hyperplane rule to classify samples which results in better accuracy than the SVM. However, the inconsistency between the training and prediction processes leads to sub-optimal results in certain cases. Authors in Shao et al. (2014), inculcated the two hyperplane prediction rule into their optimisation to address this issue. Similar to the TSVM their model minimises the respective class variances simultaneously maximising the projection difference between the classes.

Authors in Tian et al. (2013) presented an interesting extension to TSVM, that avoids matrix inversion, is sparse, and shows uniformity with the linear kernel and non-linear classifier. The first two sets of constraints in their optimisation enforce the respective class patterns to lie in the \(\epsilon\)-bands. Their formulation shows correspondence with L\(_{1}\)-norm if \(\epsilon\) tends to zero. The last set of constraints models the hinge loss to allow optimal separation from the opposite class patterns. Since the data points are implicitly modelled in the optimisation objective, it avoids matrix inversion. While on the other hand, explicit modulation of the data points in the constraints bestows the model properties such as sparsity as well as uniformity. Further to speed up the training process, fast SVM solvers can be adopted with little modification for their model.

All the aforementioned models in a way implement the hinge loss function which is unstable in the presence of noise. Traditionally, this problem has been dealt with by replacing the hinge loss with the pinball loss function. The pinball loss function focuses on maximising the quantile distance between the classes in contrast to hinge loss which focuses on the maximising closest set of points in the convex hull of the two classes. Alternatively, pinball loss grants noise resilience by penalising the correctly classified samples by a small amount and hence minimising the scatter as much as permissible. However, pinball loss trades noise resilience in exchange for the sparsity of the classifier. To attain back the sparsity, several pinball loss variants have been proposed, the most popular of them being the \(\epsilon\)-pinball loss, truncated pinball loss etc. They do succeed in equipping the classifier with noise resilience and sparsity by increasing the number of variables, constraints and parameters of the model. The explicit impact is a more complicated optimisation having enormous time complexity.

In this research, we focus on some set of problems in SVM which are critical and a solution to them can help make SVM better applicable to real-world applications such as being noise resilience simultaneously maintaining sparsity and being time-efficient than the current state of the art pinball loss models.

Taking inspiration from the TSVM’s proximal classification rule and through intelligent redesigning of the SVM optimisation, we achieve the solution to the problems above and our model stands in comparison with pinball loss models.

The present work is described in the following sections. Related work is reviewed in Sect. 2. Our proposed model PN-SVM is introduced in Sect. 3 and subsequently talks about the motivation behind the proposed model and its connection with the existing approaches. The experimental results are reported in Sect. 4, and finally, Sect. 5 concludes the paper.

2 Related work

In this paper, we denote class A as the set of samples having positive label and class B as the set of samples having negative label, i.e \(A = \{(x_{i}, y_{i} = 1)\},\) and \(B = \{ (x_{i}, y_{i} = -1) \}\), \(i=1, 2,\ldots , m_{1} + m_{2}\), \(D = [A;B]\), where \(x_i \in X \subset\) \(R^{N}\). Labels of the samples are \(y_i = +1\) for \(i=1,2,\ldots , m_{1}\), \(y_i = -1\) for \(i=m_{1}+1, m_{1}+2,\ldots , n=m_{1}+m_{2}\). For simplicity we write \(Y=diag(y_1,\ldots ,y_{m_{1}}, y_{m_{1}+1},\ldots ,y_{m_{1}+m_{2}})\). Here note that number of samples in class A, class B are denoted as \(m_{1}\), \(m_{2}\) respectively.

2.1 Support vector machines

State of the art binary classification algorithm, SVM is backed by statistical learning theory’s (Vapnik, 1999) structural risk minimization principle. For a binary classification dataset D, SVM tends to find a hyperplane from a family of separating hyperplanes such that the width of the separation between the two classes is maximised. Consider the optimal hyperplane as: \(J(w,b) = w^Tx + b = 0\), where \(b \in R\) is the bias term, \(w \in R^N\) is the normal vector of J(wb). The optimal hyperplane J can be obtained by solving the following optimisation problem:

$$\begin{aligned} \begin{aligned}&\underset{w,b,q}{min} \quad \frac{1}{2}\left\Vert w\right\Vert ^{2}_{2} \,\, + \,\, ce^{T}q,\\&s.t. \quad Y(Dw \,\, + \,\,eb) + q \,\, \ge e , \\&\,\, \quad \quad q \,\, \ge \,\, 0. \end{aligned} \end{aligned}$$
(1)

here the minimization of first term (or the L\(_2\)-norm regularisation) ensures maximising margin or separation between the classes. e is the vector of ones of appropriate dimension. The second term is the sum of errors(q) of incorrectly positioned samples in the dead-zone or are incorrectly classified. The parameter c controls the importance of the two terms in the objective function. The first set of constraints dictates the data points projections to hyperplane, be at least unit distance away. However, if this is violated, then the error variable takes minimum value to satisfy the constraint and hence results in soft-margin hyperplane. For more details we advise the reader to refer to the original SVM paper (Cortes & Vapnik, 1995).

The dual solution, is generally sparse, implying that the SVM need only a fraction of the total data points to determine the optimal hyperplane. This property is called sparsity. There are several research works that tend to identify these data points from the training set. Since the classifier model is dependent on these points only, the training set can be reduced thereby making the model faster (Jung & Kim, 2013; Nalepa & Kawulok, 2019)

2.2 Support vector machines with pinball loss function

Authors in Huang et al. (2014) proposed Support vector machines with Pinball loss (Pin-SVM) taking into consideration the problems with SVMs namely: sensitivity towards noise and instability with respect to resampling. Their idea was to penalise the correctly classified samples by introducing a parameter \(\frac{1}{\tau }\), so as to keep the correctly classified samples close to the separating hyperplane at the same time maximising the margin as in SVMs. The Pin-SVM formulation is as follows:

$$\begin{aligned} \begin{aligned}&\underset{w,b,q}{min} \quad \frac{\left\Vert w\right\Vert _{2}^{2}}{2} \,\,+\,\, ce^{T}q, \\&s.t. \quad Y(Dw \,\, + \,\,eb) + q \,\, \ge e, \\&\,\,\quad \quad Y(Dw \,\, + \,\,eb) - \frac{q}{\tau } \,\, \le e. \\ \end{aligned} \end{aligned}$$
(2)

Here c and \(\tau\) are model parameters. As it is eminent from the above set of equations, Pin-SVM loses sparsity in exchange for noise insensitivity and resampling robustness. The issue of sparsity is handled by introducing a \(\epsilon\) zone which is mentioned in the subsequent subsection.

2.3 Modified support vector machine with pinball loss

Authors in Rastogi et al. (2018) motivated by the properties of \(\epsilon\) Pin-SVM, built a data-driven, asymmetric \(\epsilon\) zone Pin-SVM termed as \((\epsilon _{1},\epsilon _{2})\) Modified Pin Support Vector Machine (Mod-Pin-SVM). Please note that here we take into consideration the version 2 mentioned in their paper, we advise the reader to Rastogi et al. (2018) for more details.

$$\begin{aligned} \begin{aligned}&\underset{w,b,q,\epsilon _{2}}{min} \quad \frac{\left\Vert w\right\Vert _{2}^{2}}{2}\,\,+\,\, c_{1}\epsilon _{2} \,\,+\,\, c_{2}e^{T}q, \\&s.t. \quad Y(Dw \,\, + \,\,eb) + \left( \frac{q + \epsilon _{1}}{\tau _{1}}\right) \,\, \ge e, \\&\,\,\quad \quad Y(Dw \,\, + \,\,eb) - \left( \frac{q+\epsilon _{2}}{\tau _{2}}\right) \,\, \le e, \\&\,\, \quad \quad q \,\, \ge \,\, 0, \,\, \quad \epsilon _{2} \,\, \ge \,\, 0. \end{aligned} \end{aligned}$$
(3)

Here \(c_{1}\), \(c_{2}\) \(\tau _{1}\), \(\tau _{2}\) and \(\epsilon _{1}\) being the model parameters. This model optimisation enforces the data points to lie in the respective \(\epsilon\) zones, wherein they have zero value for the error variable q. Thus the obtained solution is naturally sparse. From the results reported, it is evident that their model achieves sparsity, also retaining the noise sensitivity and robustness to resampling properties. The downside is many model parameters, and increased training time due to the increased number of constraints and variables.

2.4 Twin support vector machines

Twin support vector machines (TSVM) find a pair of hyperplanes such that each hyperplane is proximal to the patterns of its own class and opposite class patterns are atleast unit distance away. It solves the following two sets of quadratic problems:

$$\begin{aligned} \begin{aligned}&\underset{w_{1},b_{1},q_{1}}{min} \quad \frac{1}{2}\left\Vert Aw_{1}\,\,+\,\,e_{1}b_1\right\Vert ^{2}_{2} \,\,+\,\,c_{1}e_{2}^{T}q_{1},\\&s.t. \quad -(Bw_1\,\,+\,\,e_{2}b_1) \,\,+\,\,q_{1} \ge e_2,\\&\quad \quad q_{1} \ge 0. \end{aligned} \end{aligned}$$
(4)

For brevity, we avoid writing the equation for the second hyperplane. Here \(q_1\) is the error vector, and \(c_1\) parameter controls the effectiveness between the L\(_2\)-norm square and the sum of errors. TSVM deals with only half of the constraints in each QPP, hence are four times faster than conventional SVM (Khemchandani & Chandra, 2007).

TSVM loses sparsity because of the two-loss functions (L\(_{2}\) loss, hinge loss) used on the classes. Moreover, for large datasets, matrix inversion might be problematic. To overcome this, an appropriate small value \(\epsilon I\) (i.e. (\(H^{T}H\,+\,\epsilon I\))) is added to make the matrices invertible. Once the class proximal hyperplanes \(J_{1}(x) = w_{1}^Tx\,\,+\,\, b_{1}\) and \(J_{2}(x) = w_{2}^Tx\,\,+\,\, b_{2}\) are found, the test data can be annotated according to the rule:

$$\begin{aligned} y = sign(abs(J_{2}(x))\,\,-\,\,abs(J_{1}(x))). \end{aligned}$$
(5)

To overcome the matrix inversion step in the dual, many alternatives have been put forward by the researchers (Chen et al., 2020; Shao et al., 2011). One solution to avoid matrix inversion is replacing L\(_2\)-norm with L\(_1\)-norm. Another reason for using L\(_1\)-norm in the primal objective of TSVM, rather than using L\(_2\)-norm is that the latter is highly influenced by outliers and noise in comparison to the former. Also L\(_1\)-norm is more robust than L\(_2\)-norm from experimental and statistical point of view (Li et al., 2016; Kwak, 2008).

Moreover, the use of L\(_1\)-norm can make the twin solution sparse. Note that a technique to solve L\(_1\)-norm has been applied in Twin Support Vector Clustering (Wang et al., 2015). It has been shown to converge to the optimal solution, but it would require more number iterations to converge, in a way increasing the complexity of the model.

2.5 L\(_1\)-norm twin support vector machines

Authors in Peng et al. (2016) motivated by the problems due to the L\(_2\)-norm, introduced L\(_1\)-norm based TSVM (L\(_1\)TSVM). The optimisation is as follows:

Hyperplane 1 (\(J_{1}\)):

$$\begin{aligned} \begin{aligned}&\underset{w_{1},b_{1},q_{1}^{+},q_{1}^{-},q_{2}}{min} \quad \frac{1}{2}( \left\Vert w_{1}\right\Vert _{2}^{2} \,\,+\,\, b_{1}^{2} ) \, + \, c_{1}e_1^{T}( q_{1}^{+} \, + \, q_{1}^{-} )\,\, +\,\, c_{2}e_2^{T}q_{2}, \\&s.t. \quad Aw_{1}\,+\,e_{1}b_{1} = ( q_{1}^{+} \, - \, q_{1}^{-} ), \\&\quad \quad -(Bw_{1}\,+\,e_{2}b_{1}) \,\, + \,\, q_{2} \ge e_{2}, \\&\quad \quad q_{1}^{\pm } \ge 0\,, \,\, q_{2} \,\ge \, 0. \end{aligned} \end{aligned}$$
(6)

This formulation avoids the matrix inversion step and the solution obtained is partly sparse. From the empirical observations reported in their paper (Peng et al., 2016), it is evident their model becomes less sensitive to outliers with the introduction of L\(_1\)-norm. The problem with L\(_1\)-norm TSVM is that it is highly dependent on the parameters. Looking at the constraints, it has n constraints in both optimisation Eqs. (6) and (7). So they lose the time efficacy of original TSVM, in-fact it requires almost twice the training time as compared to conventional SVM. Overall, they do succeed in introducing partial sparsity in TSVM, simultaneously avoiding the matrix inversion step.

2.6 Twin parametric margin support vector machine

Authors in Peng (2011) inspired from TSVM framework introduced Twin parametric margin support vector machines (TPMSVM). The main optimisation of their model is as follows:

$$\begin{aligned} \begin{aligned}&\underset{w_{1},b_{1},q_{1}}{min} \quad \frac{1}{2}\left\Vert w_{1}\right\Vert _{2}^{2} \,\,+\,\, \frac{v_{1}}{m_{2}}e^{T}_{2}(Bw_{1}\,\,+\,\,e_{2}b_{1}) \,\,+\,\, \frac{c_{1}}{m_{1}}e_{2}^{T}q_{1},\\&s.t. \quad (Aw_{1}\,\,+\,\,e_{1}b_{1}) \,\,+\,\, q_{1} \,\,\ge \,\, 0,\\&\quad \quad q_{1} \,\,\ge \,\, 0. \end{aligned} \end{aligned}$$
(7)

Here \(v_{1}, \, c_{1}\) being the model parameters. For the sake of brevity, we avoid writing the set of equations for the second hyperplane. The constraints dictate the Class A samples to lie on the positive half-space. Due to the minimisation of the second term in the objective function, class B samples will lie on the negative half-space keeping them as far as possible from the hyperplane. Notice that in this formulation the class A is kept in constraint and class B is kept in objective while calculating the hyperplane for class A. TPMSVM obtains marginal hyperplanes but TSVM obtains Proximal hyperplanes. A good advantage of this formulation is that it doesn’t require any matrix inversion, simultaneously enjoying similar time complexity as TSVM. This model mainly finds applicability in scenarios where heteroscedastic noise is prevalent.

2.7 Twin-support vector machines with pinball loss

Authors in Xu et al. (2016) introduced pinball loss in Twin support vector machines (Pin-TSVM) to make it less sensitive to noise. They take Twin Parametric Margin Support Vector Machine (TPMSVM) as the building block of their research and model it to inculcate the pinball loss function. In their paper they have shown how Pin-TSVM is an extension to TPMSVM, we advise the reader to reference (Peng, 2011; Xu et al., 2016) for more details. The optimisation of Pin-TSVM is as follows:

$$\begin{aligned} \begin{aligned}&\underset{w,b,q}{min} \quad \frac{1}{2}\left\Vert w\right\Vert _{2}^{2}\,\,+\,\, \frac{v_{1}}{m_{2}}(Bw\,+\,e_{2}b) \,\,+\,\, \frac{c_{1}}{m_{1}}(e_{1}^{T}q), \\&s.t. \quad (Aw \,\, + \,\,e_{1}b) + q \,\, \ge e_{1}*0, \\&\,\,\quad \quad (Aw \,\, + \,\,e_{1}b) - (\frac{q}{\tau _{1}}) \,\, \le e_{1}*0. \\ \end{aligned} \end{aligned}$$
(8)

For the sake of brevity, we avoid writing the set of equations for the second hyperplane. Among the pinball loss models, Pin-TSVM does seem to be faster in terms of training time if there is no data imbalance, but it does not have a sparse solution.

3 Proposed model

3.1 Motivation

Conventional SVM has proved to be effective in many practical applications owing to its SRM principle which avoids overfitting.

SVM has cubic complexity which means it requires humongous training time for large datasets. TSVM is around 4 times faster than SVM due to the reduction in the number of constraints. TSVM still needs to solve two QPPs, as their decision rule dictates, to classify samples. Moreover, it is sensitive to outliers and needs to invert matrices due to the usage of squared loss on the proximal class. The L\(_{1}\)-norm TSVM addresses these problems by replacing L\(_{2}\)-norm by L\(_{1}\)-norm. However, it manages to be only partly sparse. This model served as the primal point of our research i.e. to further improve upon sparsity in the TSVM framework. Surprisingly, the design of PN-SVM turned out to be very similar to the Pinball loss models instead. Also, our proposed model is approximately four times (or more) faster than the Pin-SVM (and its generations). To sum up, not only does PN-SVM manages to address the noise instability problem in the SVM framework, but is also sparse, faster, and less complicated than Pin-SVM based models.

3.2 Formulation

3.2.1 Linear model

The mathematical formulation of our proposed model is as follows:

Hyperplane 1 (\(J_{1}\)):

$$\begin{aligned} \begin{aligned}&\underset{w_{1},b_{1},q_{1},q_{2}}{min} \quad \frac{1}{2}\left\Vert w_{1}\right\Vert ^{2}_{2} \,\,+\,\, \frac{v_{1}}{m_{1}}e_{1}^{T}(Aw_{1}\,+\,e_{1}b_{1}) \,\,+\,\, \frac{c_{1}}{m_1 + m_2}(e_{1}^{T}q_{1}\,+\,e_{2}^{T}q_{2}), \\&\quad s.t. (Aw_{1}\,+\,e_{1}b_{1}) \,\,+\,\, q_{1}\,\, \ge \,\, 0*e_{1}, \\&\quad \quad -(Bw_{1}\,+\,e_{2}b_{1}) \,\,+\,\, q_{2}\,\, \ge \,\, \epsilon *e_{2},\\&\quad \quad \quad q_{1} \,\,\ge \,\,0, \,\, q_{2}\,\, \ge \,\,0. \end{aligned} \end{aligned}$$
(9)

Hyperplane 2 (\(J_{2}\)):

$$\begin{aligned} \begin{aligned}&\underset{w_{2},b_{2},q_{1}^{'},q_{2}^{'}}{min} \quad \frac{1}{2}\left\Vert w_{2}\right\Vert ^{2}_{2} \,\,+\,\, \frac{v_{1}}{m_{2}}e_{2}^{T}(Bw_{2}\,+\,e_{2}b_{2}) \,\,+\,\, \frac{c_{1}}{m_1 + m_2}(e_{1}^{T}q_{1}^{'}\,+\,e_{2}^{T}q_{2}^{'}), \\&s.t. \quad (Bw_{2}\,+\,e_{2}b_{2}) \,\,+\,\, q_{1}^{'}\,\, \ge \,\, 0*e_{2}, \\&\quad \quad -(Aw_{2}\,+\,e_{1}b_{2}) \,\,+\,\, q_{2}^{'}\,\, \ge \,\, \epsilon *e_{1},\\&\quad \quad \quad q_{1}^{'} \,\,\ge \,\,0, \,\, q_{2}^{'}\,\, \ge \,\,0. \end{aligned} \end{aligned}$$
(10)

Consider the equation for Hyperplane 1 (Eq. 9), the second term in the optimisation forces the hyperplane to be close to class A so that the mean of class A projections is minimised. The first constraint dictates the hyperplane to keep class A projections to be above the hyperplane, if not, \(q_{1}\) (the slack vector) takes the minimum value to satisfy the constraint. Similarly, class B projections need to be at least \(\epsilon\) unit away from the hyperplane. \(q_{2}\) is the slack vector for class B samples. The minimisation of the first term in optimisation along with the third term adds the SRM principle to the model to avoid overfitting. Notice that the Eq. (9), without considering the second term is simply the bounding hyperplane equation for the SVM for class A. The minimisation of the mean of class A projections, in the objective, dictates the hyperplane to keep close to class A. Note that the nature of the hyperplane is still bounding if the sum of training errors is emphasised more than the mean of the projections i.e. \(c_1 > v_1\). Please note that if \(c_1>> v_1\), or if the effectiveness of the second term in the objective is very small the hyperplanes (i.e. Eqs. 9 and 10) becomes parallel and hence converge to the conventional SVM. The minimisation of the mean allows the bounding hyperplanes to minimise the scatter of the classes individually. Since this is similar to penalising the correctly classified samples by some amount, it points towards an analogy with the pinball loss popularly used to equip SVMs with noise robustness.

For class A, the points which lie on the opposite side of the class A respective hyperplanes and the class B points which lie in the \(\epsilon\)-margin of the hyperplane, contribute towards the training error and hence only few points contribute to the model building for class A, allowing sparsity. Similarly for class B.

The Eq. (9) can be simplified to as follows:

$$\begin{aligned} \begin{aligned}&\underset{w_{1},b_{1},q}{min} \quad \frac{1}{2}\left\Vert w_{1}\right\Vert ^{2}_{2} \,\,+\,\, \frac{v_1}{m_{1}}e_{1}^{T}(Aw_{1}\,+\,e_{1}b_{1})\,\,+\,\, \frac{c_{1}}{m_1 + m_2}e^{T}q, \\&s.t. \quad Y*(D_{1}w_{1}\,+\,eb_{1}) \,\,+\,\, q\,\, \ge \,\, p_{1}, \\&\quad \quad \quad q \,\,\ge \,\,0 \,. \end{aligned} \end{aligned}$$
(11)

here \(q = [q_{1}\,; \,q_{2}]\), \(e = [e_{1}\,; \,e_{2}]\), \(D_{1}=[A;B]\), \(p_{1}=[0*e_{1}\,; \,\epsilon *e_{2}]\)

For calculating the dual of Eq. (11), we write the Lagrangian \(L_{1}\) as follows:

$$\begin{aligned} \begin{aligned}&L_{1}(w_{1},b_{1},q) \,\, = \,\,\frac{1}{2}\left\Vert w_{1}\right\Vert ^{2}_{2} \,\,+\,\, \frac{v_{1}}{m_{1}}e_{1}^{T}(Aw_{1}\,+\,e_{1}b_{1}) \,\,+\,\, \frac{c_{1}}{m_{1}+m_{2}}e^{T}q\,\ \\&-\,\, \alpha _{1}^{T}(Y*(D_{1}w_{1}\,+\,eb_{1})\,+\,q\,-\,p_{1}) \,\,-\,\, \beta _{1}^{T}q\,\,=\,\, 0.\\ \end{aligned} \end{aligned}$$
(12)

Here \(\alpha _{1}\) and \(\beta _{1}\) are the Lagrangian multipliers. The Karush-Kuhn-Tucker(KKT) necessary and sufficient conditions of Eq. (11) are given by:

$$\begin{aligned}&\frac{\partial L_{1} }{\partial w_{1}}\,\,=\,\, \frac{v_{1}}{m_{1}}A^{T}e_{1} \,+\, w_{1} \,-\,(Y*D_{1})^{T}\alpha _{1}\,\,\Rightarrow \,\, w_{1} \,=\, (Y*D_{1})^{T}\alpha _{1} \,-\, \frac{v_{1}}{m_{1}}A^{T}e_{1}, \end{aligned}$$
(13)
$$\begin{aligned}&\frac{\partial L_{1} }{\partial b_{1}}\,\,=\,\,\frac{ v_{1}}{m_{1}}e_{1}^{T}e_{1}\,-\,(Y*e)^{T}\alpha _{1}\,=\,0 \,\,\Rightarrow \alpha _{1}^{T}Y\,=\,v_{1}, \end{aligned}$$
(14)
$$\begin{aligned}&\frac{\partial L_{1} }{\partial q}\,\,=\,\, \frac{c_1}{m_{1}+m_{2}}e \,-\, \alpha _{1} -\,\beta _{1}\,\, \Rightarrow \,\, \frac{c_1}{m_{1}+m_{2}}e\,=\,\alpha _{1}\,+\,\beta _{1}, \end{aligned}$$
(15)
$$\begin{aligned}&\alpha _{1}^{T}(Y*(D_{1}w_{1}\,+\,eb_{1})\,+\,q\,-\,p_{1})\,=\,0,\, \alpha _{1} \,\ge \, 0, \end{aligned}$$
(16)
$$\begin{aligned}&\beta _{1}^{T}q\,=\,0,\,\,\beta _{1}\,\ge \,0. \end{aligned}$$
(17)

Substituting Eqs. (13)– (17) in Eq. (12), we get the dual of primal problem Eq. (11) which is as follows:

$$\begin{aligned} \begin{aligned}&\; \underset{\alpha _{1}}{min}\quad \frac{\alpha _{1}^{T}Q_{1}\alpha _{1}}{2} \,+\, F_{1}^{T}\alpha _{1}, \\&\quad s.t. \quad 0 \,\le \, \alpha _{1} \, \le \, \frac{c_{1}}{m_{1}+m_{2}}e,\\&\quad \quad \quad \alpha _{1}^{T}Y\,=\,v_{1}. \end{aligned} \end{aligned}$$
(18)

Here \(Q_{1} = (Y*D_{1})(Y*D_{1})^{T}\) and \(F_{1} = -(\frac{v_{1}}{m_1}(Y*D_{1})A^{T}e_{1}\,+\,p_{1})\).

Once the optimal \(\alpha _{1}\) is obtained, \(w_{1}\) can be obtained from Eq. (13) and \(b_{1}\) for the required hyperplane can be obtained by substituting \(w_{1}\) in Eq. (16) as:

$$\begin{aligned} \begin{aligned}&b_{1} = \frac{1}{m_{1}+m_{2}}e^{T}(diag(Y*p_{1}) - D_{1}w_{1}).\\ \end{aligned} \end{aligned}$$
(19)

Note that the diag() function column vectorizes the diagonal elements of the square matrix. Hence the required hyperplane is: \(J_{1}(w_{1},b_{1}) = w_{1}^Tx + b_{1}\)=0.

Similarly, the hyperplane for class B can be calculated.

$$\begin{aligned} \begin{aligned}&\underset{\alpha _{2}}{min} \,\,\,\, \frac{\alpha _{2}^{T}Q_{2}\alpha _{2}}{2} \,+\, (F_{2})^{T}\alpha _{2}, \\&\quad s.t. \quad 0 \,\le \, \alpha _{2} \, \le \, \frac{c_{1}}{m_{1}+m_{2}}e,\\&\quad \quad \quad \alpha _{2}^{T}{\bar{Y}}\,=\,v_{1}. \end{aligned} \end{aligned}$$
(20)

Here \(Q_{2} = ({\bar{Y}}*D_{2})({\bar{Y}}*D_{2})^{T}\) and \(F_{2} = -(\frac{v_{1}}{m_2}({\bar{Y}}*D_{2})B^{T}e_{2}\,+\,p_{2})\).

\(p_{2}=[0*e_{2}; \epsilon *e_{1}]\), \({\bar{Y}} = -Y\), and we redefine \(D_{2}=[B;A]\).

$$\begin{aligned} \begin{aligned}&w_{2} \,=\, ({\bar{Y}}*D_{2})^{T}\alpha _{2} \,-\, \frac{v_{1}}{m_{2}}B^{T}e_{2},\\&b_{2} = \frac{1}{m_{1}+m_{2}}e^{T}(diag({\bar{Y}}*p_{2}) - D_{2}w_{2}).\\ \end{aligned} \end{aligned}$$
(21)

Hence the required hyperplane for class B is: \(J_{2}(w_{2},b_{2}) = w_{2}^{T}x + b_{2}\). The test data sample, x can be annotated according to the rule as in Eq. (5).

3.2.2 Nonlinear model

In this section we extend the linear model to the nonlinear case. Here we directly use kernel matrices to deal with the non-linear data. The nonlinear model optimizes the following optimization problem:

$$\begin{aligned} \begin{aligned}&\underset{w_{1},b_{1},q}{min} \quad \frac{1}{2}\left\Vert w_{1}\right\Vert ^{2}_{2} \,\,+\,\, \frac{v_{1}}{m_{1}}e_{1}^{T}(K(A,D_{1})w_{1}\,+\,e_{1}b_{1})\,\,+\,\, \frac{c_{1}}{m_{1}+m_{2}}e^{T}q, \\&s.t. \quad Y*(K(D_{1},D_{1})w_{1}\,+\,eb_{1}) \,\,+\,\, q\,\, \ge \,\, p_{1}, \\&\quad \quad \quad q \,\,\ge \,\,0 \,. \end{aligned} \end{aligned}$$
(22)
$$\begin{aligned} \begin{aligned}&\underset{w_{2},b_{2},q'}{min} \quad \frac{1}{2}\left\Vert w_{2}\right\Vert ^{2}_{2} \,\,+\,\, \frac{v_{1}}{m_{1}}e_{2}^{T}(K(B,D_{2})w_{2}\,+\,e_{2}b_{2})\,\,+\,\, \frac{c_{1}}{m_{1}+m_{2}}e^{T}q',\\&s.t. \quad {\bar{Y}}*(K(D_{2},D_{2})w_{2}\,+\,eb_{2}) \,\,+\,\, q'\,\, \ge \,\, p_{2}, \\&\quad \quad \quad q' \,\,\ge \,\,0. \,\, \end{aligned} \end{aligned}$$
(23)

Here note that we have used the rectangular kernel technique, here it indicates that the linear kernel model and nonlinear formulations are non-uniform.

Similar to the linear case we can obtain the dual of the primal problem Eq. (22) which is as follows:

$$\begin{aligned} \begin{aligned}&\underset{\alpha _{1}}{min} \,\,\,\, \frac{\alpha _{1}^{T}\hat{Q_{1}}\alpha _{1}}{2} \,+\, {\hat{F}}^{T}\alpha _{1}, \\&\quad s.t. \quad 0 \,\le \, \alpha _{1} \, \le \, \frac{c_{1}}{m_{1}+m_{2}}e,\\&\quad \quad \quad \alpha _{1}^{T}Y\,=\,v_{1}. \end{aligned} \end{aligned}$$
(24)

Here \(\hat{Q_{1}} = (Y*\hat{D_{1}})(Y*\hat{D_{1}})^{T}\), \({\hat{F}} = -(\frac{v_{1}}{m_1}(Y*\hat{D_{1}}))K(A,D_{1})^{T}e_{1}\,+\,p_{1})\) and \(\hat{D_{1}}=K(D_{1},D_{1})\).

Solving we get the above Eq. (24):

$$\begin{aligned} \begin{aligned}&w_{1} = \alpha _{1} (Y*\hat{D_{1}})^{T} - \frac{v_{1}}{m_{1}}e_{1}K(A,D_{1})^{T},\\&b_{1} = \frac{1}{m_1\,+\,m_2}e^{T}(Y.*p_{1}-\hat{D_{1}}w_{1}). \end{aligned} \end{aligned}$$
(25)

Similarly we can obtain the dual of the primal problem Eq. (23) which is as follows:

$$\begin{aligned} \begin{aligned}&\underset{\alpha _{2}}{min} \,\,\,\, \frac{\alpha _{2}^{T}\hat{Q_{2}}\alpha _{2}}{2} \,+\, \hat{F_{2}}^{T}\alpha _{2}, \\&\quad s.t. \quad 0 \,\le \, \alpha _{2} \, \le \, \frac{c_{1}}{m_{1}+m_{2}}e,\\&\quad \quad \quad \alpha _{2}^{T}{\bar{Y}}\,=\,v_{1}. \end{aligned} \end{aligned}$$
(26)

Here \(\hat{Q_{2}} = ({\bar{Y}}*\hat{D_{2}})({\bar{Y}}*\hat{D_{2}})^{T}\), \(\hat{F_{2}} = -(\frac{v_{1}}{m_1}({\bar{Y}}*\hat{D_{2}}))K(B,D_{2})^{T}e_{2}\,+\,p_{2})\) and \(\hat{D_{2}}=K(D_{2},D_{2})\).

Solving the above Eq. (26) we get:

$$\begin{aligned} \begin{aligned}&w_{2} = \alpha _{2} ({\bar{Y}}*\hat{D_{2}})^{T} - \frac{v_{1}}{m_{1}}e_{1}K(B,D_{2})^{T},\\&b_{2} = \frac{1}{m_1\,+\,m_2}e^{T}({\bar{Y}}.*p_{2}-\hat{D_{2}}w_{2}). \end{aligned} \end{aligned}$$
(27)

The test data sample, x can be annotated according to the kernelized version of rule Eq. (5), which is as follows:

$$\begin{aligned} y = sign( abs(K(x,D)*w_{2}\,+\,b_{2}) - abs(K(x,D)*w_{1}\,+\,b_{1}) ). \end{aligned}$$
(28)

3.3 Algorithm

The algorithm for the proposed model is as follows:

figure a
figure b

3.4 Connection with existing approaches

In this section, we briefly describe how our proposed model shares its connection with the existing approaches.

3.4.1 L\(_1\)-norm TSVM versus PN-SVM

In L\(_1\)-norm TSVM, the first set of constraints has an equality sign (Eq. 7) so as to develop a class A centric hyperplane. In our case, we don’t keep the hyperplane centric to class A but instead, keep it close to the boundary of class A simultaneously minimising class A projections. This modification not only makes our model sparse but also allows it to act as a marginal/bounding or even a proximal hyperplane.

3.4.2 SVM versus PN-SVM

Our proposed model seems to have a similar formulation as SVM but the difference is the margin and the introduction of the second term in the optimisation Eq. (9). It is simply the bounding hyperplane equation for class A in SVM, if the mean of class A is ignored. It is evident that the onset of this change in optimisation makes SVM capture optimal variance of the respective class, hence the name PN-SVM.

3.4.3 Pin-SVM versus PN-SVM

The introduction of pinball loss pull-down sparsity in SVMs. To bring back sparsity \(\epsilon\) zone-based Pin-SVM models (Rastogi et al., 2018) are developed. This in turn makes pinball loss models complicated. The addition of class A projections into the objective explicitly makes this process much simpler and time-efficient. Along similar lines, our model can be regarded as an improved version of Pin-TSVM on the advent of sparsity.

3.4.4 MCM versus PN-SVM

Minimal Complexity Machines (MCM) (Jayadeva,, 2015) also share homology with Pin-SVM models. Our model can be extended as a two hyperplane version of MCM, as it minimises the scatter of two classes separately. However, it does not have the time efficacy relationship as TSVM has compared with SVM. This can be a subject of further research.

3.4.5 TPMSVM versus PN-SVM

TPMSVM extends the idea of Parametric Margin-SVM (Hao, 2010) by solving two independent QPPs. Their model does not implement the SRM principle, thus resulting in sub-optimal results in certain cases. On the other hand, similar to TPMSVM, PN-SVM also tries to model marginal hyperplanes but is not limited to it, i.e. it can also model proximal hyperplanes on a suitable choice of parameters. Moreover, PN-SVM gives regard to the variance of the respective classes, simultaneously implementing the SRM principle and maintaining sparsity. In TPMSVM, the mean of the opposite class is used in the objective to model separation from the opposite class patterns. However, in PN-SVM mean of the respective class is used to model proximity to its own class patterns. On the basis of the aforementioned points, PN-SVM can be also regarded as an improved version of TPMSVM analogously. For more details on the TSVM based models we advise the reader to refer to a recently published survey paper (Tanveer et al., 2022).

3.5 Computation complexity

In this section, we compare the computational cost of our proposed PN-SVM with other related algorithms. During the training phase, similar to conventional SVM, the proposed model needs to find coefficients of n constraints for solving the dual QPP. Hence the computation of the proposed model is twice that of SVM is 2\({\mathcal {O}}(n^{3})\). As for the TSVM it needs to optimise a pair of smaller QPPs with approximately n/2 constraints, it has the computational complexity 2\({\mathcal {O}}(\frac{n^{3}}{8})\), which is four times faster than the SVM. It should be pointed out that TSVM has some extra computational cost since it needs to invert a pair of matrices. For TPMSVM it only needs to solve two QPPs hence it has the computational complexity 2\({\mathcal {O}}(\frac{n^{3}}{8})\). As for the pinball loss models it has 2n constraints so it should have computation of order \({\mathcal {O}}({(2n)^{3}})\) except for Pin-TSVM which has a computation of approximately \(2{\mathcal {O}}({n^{3}})\) only if the datasets are balanced (i.e. \(m_1=m_2\) ). Here note that the PN-SVM is theoretically four times faster than Pin-SVM.

During the testing phase, since the PN-SVM and SVM both are sparse, hence their computational cost depends only on the non-zero coefficients (or support vectors, sv) that is \({\mathcal {O}}(sv)\) Generally \(sv<<n\), hence due to the sparsity these models are faster in the testing phase. For the other models, TSVM and other non-sparse models testing complexity is simply \({\mathcal {O}}(n)\).

4 Experiments

The experiments are performed on well known diverse datasets using ten-fold cross-validation in MATLAB (Matlab, 2012) version 9.4 under Microsoft Windows environment on a machine with 3.40 GHz i7 CPU and 16 GB RAM. The optimal value of user-defined parameters in different models is obtained by fine-tuning a validation set generated using ten percent of training data. After the model parameters are known the validation set is sent back to training data for retraining. All datasets are normalised in the range \([1,-1]\). All comparing models are implemented in Matlab. Quadprog() function is used for all models to provide uniformity for comparison. For all two hyperplane models, parameters for both hyperplanes are kept the same. For measuring the sparsity of sparse models, we use Gini index. In Hurley and Rickard (2009) authors show that Gini index is well suited for evaluating sparsity for sparse models. Note that the lower the Gini index, the better the sparsity of the model. Note that TSVM, TPMSVM, Pin-TSVM and Pin-SVM are non-sparse models hence the Gini index for such non-sparse models is explicitly reported as 1 without any calculation. For measuring the time (seconds) of the algorithms we use tic-toc (Matlab). Note that we measure only the time taken by quadprog function by each algorithm for uniformity’s sake. The training time taken is reported in seconds. The best testing accuracy and Gini index are highlighted in bold, and training accuracy is reported to check for overfitting. We use gaussian kernel (\(K(x_1,x_2) = e^{-\rho \left\Vert x_1-x_2\right\Vert ^{2}}\)) for all kernelised models, \(\rho\) is tuned in range [0.1 : 0.1 : 1].

For our proposed model the setting of model parameters is very important. We tune \(v_{1}\), \(c_{1}\) in range [0.1 : 0.1 : 0.9] and [1 : 1 : 10] respectively. For brevity, we select the same value for the second hyperplane. However, the parameters for both hyperplanes can be tuned separately for better accuracy. Note that \(c_{1}\) should be sufficiently larger that \(v_{1}\) (i.e. \(c_{1} > v_{1}\)).

For all our experiments we fix \(\epsilon =0.1\) just to avoid trivial solution for the optimisation.

4.1 Illustration on toy dataset

To show how our proposed model works, we show its plot on a synthetic two-dimensional dataset. The synthetic data is generated using mvnrnd function Matlab. We take the mean of class A as \([0.1, -1.8]^{T}\) and the mean of class B as \([-0.1, 1.8]^{T}\). Covariance matrices are [0.3, 0.2; 0.2, 0.3] and \([0.3, -0.2; -0.2, 0.3]\) for the two classes respectively. The size of the two classes is taken to be 200 each. Refer to Fig. 1, it is easy to observe the differences among the various two hyperplane classifiers. Our proposed model does not run through the centre of the respective classes but it runs along the margin of the class such that the sum of projections of the respective class is minimised. Most of the samples lie on one side of the hyperplane but the trade-off between the second and third term in Eq. (9) forces some noisy samples to lie on the opposite side of the hyperplane (note that we have kept the parameter for the third term to be higher than that of the second term).

The separation can be made better by significantly increasing the value of \(c_{1}\). Doing so will allow the hyperplane to focus on more separation between the classes. In fact, if \(c_{1}>>v_{1}\) then it can be seen that the hyperplanes become parallel and the PN-SVM effectively behaves as conventional-SVM.

Look at Fig. 2, notice how increasing the value of \(c_{1}\) changes the hyperplanes and the results for PN-SVM. So it’s very important to select the parameters correctly.

Fig. 1
figure 1

Figures starting from top left to bottom right, are TSVM, TPMSVM, Pin-TSVM, and PN-SVM

Fig. 2
figure 2

Plots showing the effect of \(c_{1}\) on PN-SVM

4.2 Illustration on noise-corrupted datasets

The Fig. 3 shows the effect of noise on the classification accuracy of the two comparing models, namely PN-SVM and Pin-SVM. For a more accurate representation for comparison, natural log of testing accuracy is reported. Datasets are corrupted by adding noise using the mvrnd function Matlab. Dataset mean and covariance are used as input to the mvrnd function. Finally, some percentage of samples are corrupted by this generated noise. From the plots shown, it is evident that PN-SVM behaves approximately as Pin-SVM in the presence of noise. However, a comparison between their optimisations points toward functional analogy.

Fig. 3
figure 3

Figures show the comparison between PN-SVM and Pin-SVM on varying percentage of noise on datasets

4.3 Datasets

The datasets are selected keeping into consideration the diversity of sizes. The characteristics of the datasets (Instances \(\times\) features) used are described alongside the results. Datasets used are obtained from UCI machine learning repository (Asuncion & Newman, 2007) and Gunnar Rätsch’s repository (Diethe, 2015).

4.4 Results for linear models

The Table 1 show the linear results for our proposed models PN-SVM against the comparing models. In comparison with TSVM and SVM, PN-SVM needs more training time. On the other hand, PN-SVM has comparable performance with these two and in many cases, it outputs greater Accuracy. Compared with TSVM and PN-SVM, the latter introduces sparsity into its framework which is comparable to the sparsity of SVM. In many cases, it exhibits superiority in this regard.

In comparison with pinball loss models, i.e. Pin-SVM, Pin-TSVM and Mod-Pin-SVM, the time efficacy of PN-SVM is clearly visible. Since pinball loss adds another set of constraints, it makes these models slower. Please note that our model is not the first attempt to introduce sparsity into the Pinball loss based models. However, it can be regarded as a better, simpler model, in terms of training time, accuracy and sparsity as compared with Pinball loss based models. Our model has superior sparsity in comparison with Mod-Pin-SVM. Our model adopts the sparsity due to the SVM framework and noise resilience is granted by the minimisation of the mean of respective class projections.

Table 1 Linear results on datasets

4.5 Results for NDC datasets

To clearly demonstrate the effect of dataset size on the training time of the linear classifiers we conduct experiments on NDC datasets (Musicant, 1998). The characteristic of these datasets is described in the Table 2 as samples \(\times\) dimension \(\times\) Imbalance. Imbalance is defined as the ratio of number of positive labelled samples to the number of negative labelled samples in the dataset. The Table 2 show the linear results for our proposed models namely PN-SVM against the comparing models. The dimension of the NDC datasets is fixed (i.e. 32) while the sample size is varied. For all the datasets the model parameters are fixed to study the change of time with increasing data size. It can be seen that as the imbalance ratio increases, the pin-TSVM model becomes more time-consuming. In fact from the Fig. 4, it can be seen that its curve grows the fastest among all the other models. Also, it is clear that the pinball loss models are very heavy on training time compared to our proposed model PN-SVM. Among these three, TSVM is the fastest, followed by SVM and PN-SVM. So it is important to improve upon the time requirement of the pinball models.

Table 2 Linear results on NDC datasets
Fig. 4
figure 4

Time comparison on NDC datasets

4.6 Results for non-linear models

The Table 3 show the kernelised results for our proposed models namely PN-SVM against the comparing models. Due to the higher computational requirement of Pinball models, we experiment on medium-sized datasets. Our proposed model PN-SVM is able to output comparable results and achieves superior sparsity. Similar to the observations in the linear case, the time requirement of our model, here is higher than SVM and TSVM, but still lesser in comparison to Pinball models.

Table 3 Non-linear results on datasets

4.7 Statistical analysis

We use ranking criteria to evaluate better the results obtained from the algorithms. The ranks are allotted to each algorithm according to the order in which they perform w.r.t to a metric, i.e. if a model has the best results, it is ranked 1; if it has a second-best result, it is given rank 2 on a particular dataset. If two algorithms have the same results, they are given the same rank. Mean rank obtained via averaging ranks for all datasets w.r.t to a metric is reported in Tables 4 and 5. It can be seen that the proposed algorithm PN-SVM achieves the best rank amongst the comparing algorithms for linear as well as non-linear classifiers.

Table 4 Ranks with linear classifiers for datasets
Table 5 Ranks with non-linear classifiers for datasets

5 Conclusions

In this paper, we have proposed the PN-SVM model to handle binary classification, wherein we effectively address the problem of sparsity and noise resilience through reformulation of the SVM optimisation framework by minimising average projections of the respective classes. The results on benchmark datasets prove the efficacy of the proposed model against the comparing approaches. Further, our proposed model presents a faster and much simpler model than the Pinball loss models. PN-SVM exhibits homology to SVM. Hence the fast SVM type approaches can be modified to adapt to our model. Further, it would be interesting to develop its extension for multiclass classification.