1 Introduction

A support vector machine (SVM) (Bradley and Mangasarian 2000; Cortes and Vapnik 1995; Liu et al. 2002; Tian and Huang 2000; Vapnik 1995) plays a critical role in data classification and regression analysis. It operates under the constraint that two support planes are parallel, and the maximum interval classification is implemented by solving quadratic programming problems (QPPs). Based on structural risk minimization and Vapnik–Chervonenkis dimension, SVM has good generalization performance. SVM has been extensively used in many practical problems, such as image classification (Song et al. 2002), scene classification (Yin et al. 2015), fault diagnosis (Muralidharan et al. 2014) and bioinformatics (Subasi 2013).

The advantages of SVM are remarkable, but in some cases, deep learning and shallow approaches such as random forests give competitive results compared to SVM on several application domains. In particular, the performance of deep learning may be superior to that of SVM when handling large samples, and SVM may cost more computational burden than random forests on the same samples, so SVM still has a lot of space for improvement. There are two main shortcomings: exclusive OR (XOR) problems cannot be handled smoothly (Mangasarian and Wild 2006), and QPPs suffer from high computational complexity (Chang and Lin 2011; Deng et al. 2012). To alleviate these problems, Mangasarian and Wild proposed a proximal support vector machine via generalized eigenvalues (GEPSVM) based on the concept of proximal support vector machine (PSVM) (Fung and Mangasarian 2001) for binary classification problems (Mangasarian and Wild 2006). According to the geometric interpretation of GEPSVM, the numerator should be as small as possible, while the denominator should be as large as possible to minimize the objective function value. GEPSVM relaxes the requirement of PSVM that the planes be parallel and can solve XOR problem smoothly. Moreover, GEPSVM attempts to find two nonparallel planes by solving a pair of generalized eigenvalue problems instead of complex QPPs, which can reduce the computation time and improve the generalization ability over that of PSVM (Mangasarian and Wild 2006). The advantages of GEPSVM play an important role in this improvement (Guarracino et al. 2007; Shao et al. 2013, 2014; Ye and Ye 2009). However, it should be noted that GEPSVM and its variants are sensitive to outliers because the L2-norm distance exaggerates the effect of outliers by the square operation (Kwak 2008), which reduces the classification performance. Outliers are defined as the data points that deviate significantly from the majority of the data points or those do not have a regular distribution over the data points (Wang et al. 2014b). In view of this limitation, many researches have been carried out to improve the robustness of machine learning models by using the L1-norm distance (Gao et al. 2011; Li et al. 2015a, b; Wang et al. 2014a, b; Ye et al. 2016, 2017). To promote the robustness, Li et al. (2015a) reformulated the optimization problems of a nonparallel proximal support vector machine using the L1-norm distance (L1-NPSVM). To solve the formulated objective, a gradient ascent (GA) iterative algorithm is proposed, which is simple to execute but may not guarantee the optimality of the solution due to both the need of introducing a non-convex surrogate function and the difficulty in selecting the step-size (Kwak 2014). Wang et al. (2014a) optimized Fisher linear discriminant analysis (LDA) by taking advantage of the L1-norm distance instead of the conventional L2-norm distance; this optimized LDA is denoted as LDA-L1. The utilization of the L1-norm distance makes LDA-L1 robust to outliers, and LDA-L1 does not suffer from the problems of small sample size and rank limit that existed in the traditional LDA. Nevertheless, in LDA-L1, a gradient ascent iterative algorithm is applied, which suffers from the difficulty in choosing the step-size.

As a successful improvement of GEPSVM, Jayadeva et al. proposed a twin support vector machine (Jayadeva and Chandra 2007)_(TWSVM) based on the concept of GEPSVM. TWSVM solves two QPPs (the scale is relatively small compared to that of standard SVM) to replace generalized eigenvalue problems (Mangasarian and Wild 2006). As TWSVM inherits the advantages of GEPSVM, it can handle the XOR problem smoothly. At present, the research on TWSVM is still in its infancy, and many improved methods have been developed based on the concept of TWSVM, such as smooth TWSVM (Kumar and Gopal 2008), localized TWSVM (LCTSVM) (Ye et al. 2011b), twin bounded SVM (TBSVM) (Shao et al. 2011), and robust TWSVM (R-TWSVM) (Qi et al. 2013a). Ye et al. (2011a) introduced a regularization technique for optimizing TWSVM and proposed a feature selection method for TWSVM via the regularization technique (RTWSVM), which is a convex programming problem, to overcome the possible singular problem and improve the generalization ability. Kumar and Gopal (2009) reformulated the optimization problems of TWSVM by using constraints in the form of equalities to replace inequalities to modify the primal QPPs in least-squares sense and proposed a least-squares version of TWSVM (LSTSVM). The solutions of LSTSVM follow directly from solving two linear equations, as opposed to solving two QPPs. Therefore, LSTSVM effectively addresses large samples without any external optimization. Moreover, its computational cost is much lower than that of TWSVM. Qi et al. (2013b) optimized TWSVM by applying the structural information of data, which may contain useful prior domain knowledge for training the classifier, and proposed a new structural TWSVM (S-TWSVM). S-TWSVM utilizes two hyperplanes to decide the category of new data, and each model only considers the structural information of one class. Each plane is closer to one of the two classes and as far away as possible from the other class. This allows S-TWSVM to fully exploit the prior knowledge to directly improve its generalization ability.

It is worth noting that TWSVM and its variants are also sensitive to outliers. L1-norm distance is more robust to outliers than the squared L2-norm distance in distance metric learning (Cayton and Dasgupta 2006; Ke and Kanade 2005; Li et al. 2015a; Lin et al. 2015; Pang et al. 2010; Wang et al. 2012, 2014a; Zhong and Zhang 2013). The utilization of the L1-norm distance is considered to be a simple and effective way to reduce the impact of outliers (Li et al. 2015b; Wang et al. 2014a) and can improve the generalization ability and flexibility of the model, as with L1-NPSVM and LDA-L1. Following the same motivations as these prior studies, we propose replacing the squared L2-norm distance in TWSVM with the robust L1-norm distance to improve the robustness; the resulting TWSVM is called L1-TWSVM. L1-TWSVM seeks two nonparallel optimal planes by solving two QPPs. The optimization goal of L1-TWSVM is to minimize the intra-class distance and maximize the inter-class distance simultaneously. Moreover, L1-TWSVM seamlessly integrates the merits of TWSVM with those of the robust L1-norm-based distance metric, which improves the classification performance and robustness. In summary, this paper makes the following contributions: (1) An iterative algorithm is presented to solve L1-norm distance optimization problems. The iterative optimization technique is simple and convenient to implement. We theoretically prove that the objective function value of L1-TWSVM is reduced at each step of iteration. This means that the convergence of the iterative algorithm to a local optimal solution is theoretically guaranteed. (2) In L1-TWSVM, the conventional L2-norm distance is replaced by more robust L1-norm distance to reduce the effect of outliers, which makes L1-TWSVM robust to outliers. L1-TWSVM can efficiently decrease the impact of the outliers, even if the ratio of outliers is large. (3) The proposed method is evaluated with relevant algorithms (SVM, GEPSVM, TWSVM, LSTSVM and L1-NPSVM) on both synthetic datasets and UCI datasets. Extensive experimental results confirm that L1-TWSVM and L1-NPSVM effectively reduce the effect of the outliers, which improves the generalization ability and flexibility of the model. (4) The proposed method can be conveniently extended to solve other improved methods based on TWSVM.

The remainder of this paper is organized as follows. Section 2 briefly introduces GEPSVM and TWSVM. Section 3 proposes L1-TWSVM, discusses its feasibility and presents the theoretical analysis. All the experimental results are shown in Sect. 4, and conclusions are presented in Sect. 5.

2 Related works

In this paper, all vectors are column vectors unless a superscript T is present, which denotes transposition. We use bold uppercase letters to represent matrices and bold lowercase ones to represent vectors. The vectors \( {\mathbf{e}}_{1} \) and \( {\mathbf{e}}_{2} \) of appropriate lengths are represented by identity column vectors. Furthermore, \( {\mathbf{I}} \) denotes an identity matrix of appropriate dimension. We consider a binary classification problem in the \( n \)-dimensional real space \( R^{n} \), and the dataset is denoted by \( {\mathbf{T}} = \left\{ {\left( {{\mathbf{x}}_{j}^{\left( i \right)} ,y_{i} } \right)\left| {i = 1,2,\;j = 1,2, \ldots ,m_{i} } \right.} \right\} \), where \( {\mathbf{x}}_{j}^{\left( i \right)} \in R^{n} \) and \( y_{i} \in \{ - 1,\;1\} \), and \( {\mathbf{x}}_{j}^{\left( i \right)} \) denotes the i-th class and j-th sample. We suppose that matrix \( {\mathbf{A}} = \left( {{\mathbf{a}}_{1}^{\left( 1 \right)} ,{\mathbf{a}}_{2}^{\left( 1 \right)} , \ldots ,{\mathbf{a}}_{{m_{1} }}^{\left( 1 \right)} } \right)^{T} \) of size \( m_{1} \times n \) represents the data points of Class 1 (Class +1), while matrix \( {\mathbf{B}} = \left( {{\mathbf{b}}_{1}^{\left( 2 \right)} ,{\mathbf{b}}_{2}^{\left( 2 \right)} , \ldots ,{\mathbf{b}}_{{m_{2} }}^{\left( 2 \right)} } \right)^{T} \) of size of \( m_{2} \times n \) represents the data points of Class 2 (Class -1), where matrices \( {\mathbf{A}} \) and \( {\mathbf{B}} \) represent all the data points, \( m_{1} \) represents the number of positive class samples, \( m_{2} \) represents the number of negative class samples, and \( m_{1} + m_{2} = m \). In the following, we review two well-known nonparallel proximal classifiers: GEPSVM (Mangasarian and Wild 2006) and TWSVM (Jayadeva and Chandra 2007).

2.1 GEPSVM

GEPSVM is an excellent classifier for binary classification problems and is widely used for pattern classification problems. The primary aim of GEPSVM is to find two nonparallel proximal planes

$$ {\mathbf{x}}^{T} {\mathbf{w}}_{1} + b_{1} = 0, \, {\mathbf{x}}^{T} {\mathbf{w}}_{2} + b_{2} = 0 $$
(1)

where \( {\mathbf{w}}_{1} , \, {\mathbf{w}}_{2} \in R^{n} \) and \( b_{1} , \, b_{2} \in R \). The geometric interpretation of GEPSVM is that each plane is closer to one of the two classes and as far away as possible from the other class. This produces the following two optimization problems of GEPSVM:

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}}_{{\mathbf{1}}} ,b_{1} }} \frac{{||{\mathbf{Aw}}_{\text{1}} + {\mathbf{e}}_{\text{1}} b_{1} ||_{2}^{2} + \delta ||\left( {{\mathbf{w}}_{1}^{T} \left. {b_{1} } \right)} \right.^{T} ||_{2}^{2} }}{{||{\mathbf{Bw}}_{\text{1}} + {\mathbf{e}}_{\text{2}} b_{1} ||_{2}^{2} }} $$
(2)
$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}}_{2} ,b_{2} }} \frac{{||{\mathbf{Bw}}_{2} + {\mathbf{e}}_{2} b_{2} ||_{2}^{2} + \delta ||\left( {{\mathbf{w}}_{2}^{T} \left. {b_{2} } \right)} \right.^{T} ||_{2}^{2} }}{{||{\mathbf{Aw}}_{2} + {\mathbf{e}}_{1} b_{2} ||_{2}^{2} }} $$
(3)

where \( || \cdot ||_{2} \) denotes the L2-norm, \( \delta ||\left( {{\mathbf{w}}_{1}^{T} \left. {b_{1} } \right)} \right.^{T} ||_{2}^{2} \) is a Tikhonov regularization term, and \( \delta \) is a regularization factor. The regularization terms are introduced to address the singular problem when solving the generalized eigenvalue problems, which can improve the stability of GEPSVM. Then, optimization problems (2) and (3) become

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{1} }} \frac{{{\mathbf{z}}_{1}^{T} {\mathbf{Ez}}_{1} }}{{{\mathbf{z}}_{1}^{T} {\mathbf{Fz}}_{1} }} $$
(4)
$$ \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{2} }} \frac{{{\mathbf{z}}_{2}^{T} {\mathbf{Lz}}_{2} }}{{{\mathbf{z}}_{2}^{T} {\mathbf{Mz}}_{2} }} $$
(5)

where \( {\mathbf{H}} = \left( {{\mathbf{A}}\; \, {\mathbf{e}}_{1} } \right) \), \( {\mathbf{G}} = \left( {{\mathbf{B}}\;\;{\mathbf{e}}_{2} } \right) \) are matrices and \( {\mathbf{z}}_{1} = \left( {{\mathbf{w}}_{1}^{T} } \right. \, \left. {b_{1} } \right)^{T} \), \( {\mathbf{z}}_{\text{2}} = \left( {{\mathbf{w}}_{2}^{T} } \right.\left. { \, b_{2} } \right)^{T} \) are augmented vectors, \( {\mathbf{E}} = {\mathbf{H}}^{T} {\mathbf{H}} + \delta {\mathbf{I}} \), \( {\mathbf{L}} = {\mathbf{G}}^{T} {\mathbf{G}} + \delta {\mathbf{I}} \), \( {\mathbf{F}} = {\mathbf{G}}^{T} {\mathbf{G}} \), and \( {\mathbf{M}} = {\mathbf{H}}^{T} {\mathbf{H}} \).

\( {\mathbf{E}}, \, {\mathbf{F}} \) and \( {\mathbf{L}}, \, {\mathbf{M}} \) are symmetric matrices in \( R^{{\left( {n + 1} \right) \times \left( {n + 1} \right)}} \). The objective functions in (4) and (5) are Rayleigh quotient problems (Parlett 1998) and have some very useful properties, as we now state. It is easy to derive the solutions of (4) and (5) by solving the generalized eigenvalue problems

$$ {\mathbf{Ez}}_{1} = \lambda_{1} {\mathbf{Fz}}_{1} , \, {\mathbf{z}}_{1} \ne 0 $$
(6)
$$ {\mathbf{Lz}}_{\text{2}} = \lambda_{2} {\mathbf{Mz}}_{\text{2}} , \, {\mathbf{z}}_{\text{2}} \ne 0 $$
(7)

where the minimum of (4) is attained at an eigenvector corresponding to the smallest eigenvalue \( \lambda_{1} \) of (6). Consequently, if \( {\mathbf{z}}_{1} \) denotes the eigenvector corresponding to \( \lambda_{1} \), then the augmented vector \( {\mathbf{z}}_{1} = \left( {{\mathbf{w}}_{1}^{T} } \right. \, \left. {{\mathbf{b}}_{1} } \right)^{T} \) determines the plane \( {\mathbf{x}}^{T} {\mathbf{w}}_{1} + b_{1} = 0 \), which is close to data points of Class 1. Similarly, the augmented vector \( {\mathbf{z}}_{\text{2}} = \left( {{\mathbf{w}}_{2}^{T} } \right.\left. { \, {\mathbf{b}}_{2} } \right)^{T} \) determines the plane \( {\mathbf{x}}^{T} {\mathbf{w}}_{2} + b_{2} = 0 \), which is close to data points of Class 2.

2.2 TWSVM

In this section, after a brief review of GEPSVM, we introduce TWSVM, which is an improved version of GEPSVM. To obtain two planes, TWSVM solves two convex programming problems rather than solving a system of two linear equations as GEPSVM does (Mangasarian and Wild 2006). The two objective functions of TWSVM are expressed as follows:

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{w}}_{1} ,b_{1} }} \frac{1}{2}||{\mathbf{Aw}}_{\text{1}} + {\mathbf{e}}_{\text{1}} b_{1} ||_{2}^{2} + c_{1} {\mathbf{e}}_{2}^{T} {\mathbf{q}}_{1} \\ & s.t. \, - \left( {{\mathbf{Bw}}_{\text{1}} + {\mathbf{e}}_{2} b_{1} } \right) + {\mathbf{q}}_{1} \ge {\mathbf{e}}_{2} , \, {\mathbf{q}}_{1} \ge 0 \\ \end{aligned} $$
(8)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{w}}_{2} ,b_{2} }} \frac{1}{2}||{\mathbf{Bw}}_{2} + {\mathbf{e}}_{2} b_{2} ||_{2}^{2} + c_{2} {\mathbf{e}}_{1}^{T} {\mathbf{q}}_{2} \\ & s.t. \, \left( {{\mathbf{Aw}}_{2} + {\mathbf{e}}_{1} b_{2} } \right) + {\mathbf{q}}_{2} \ge {\mathbf{e}}_{1} , \, {\mathbf{q}}_{2} \ge 0 \\ \end{aligned} $$
(9)

where \( || \cdot ||_{2} \) denotes the L2-norm, \( {\mathbf{q}}_{1} \) and \( {\mathbf{q}}_{2} \) are slack vectors, and \( c_{1} \) and \( c_{2} \) are nonnegative penalty coefficients, which are the balance factors of the positive and negative samples, respectively, and can overcome the problem of sample imbalance in TWSVM. It should be noted that in TWSVM, the distance is measured by the L2-norm, which is likely to exaggerate the effect of outliers by the square operation. The optimization strategy of TWSVM is that points of the same class are clustered as compactly as possible and are as far as possible from data in the other class, which guarantees the minimization of the objective function. By solving formulas (8) and (9), we can obtain two nonparallel planes:

$$ {\mathbf{x}}^{T} {\mathbf{w}}_{1} + b_{1} = 0, \, {\mathbf{x}}^{T} {\mathbf{w}}_{2} + b_{2} = 0 $$
(10)

A new data point \( {\mathbf{x}} \) is assigned to Class 1 or Class 2 depending on its proximity to each of the two nonparallel planes. We can obtain the corresponding Wolfe dual problems of formulas (8) and (9):

$$ \begin{aligned} & \mathop {\hbox{max} }\limits_{{\varvec{\upalpha}}} {\mathbf{e}}_{2}^{T} {\varvec{\upalpha}} - \frac{1}{2}{\varvec{\upalpha}}^{T} {\mathbf{G}}\left( {{\mathbf{H}}^{T} {\mathbf{H}}} \right)^{ - 1} {\mathbf{G}}^{T} {\varvec{\upalpha}} \\ & s.t. \, 0 \le {\varvec{\upalpha}} \le c_{1} {\mathbf{e}}_{2} \\ \end{aligned} $$
(11)
$$ \begin{aligned} & \mathop {\hbox{max} }\limits_{{\varvec{\upbeta}} } {\mathbf{e}}_{1}^{T} {\varvec{\upbeta}} - \frac{1}{2}{\varvec{\upbeta}}^{T} {\mathbf{H}}\left( {{\mathbf{G}}^{T} {\mathbf{G}}} \right)^{ - 1} {\mathbf{H}}^{T} {\varvec{\upbeta}} \\ & s.t. \, 0 \le {\varvec{\upbeta}} \le c_{2} {\mathbf{e}}_{1} \\ \end{aligned} $$
(12)

where \( {\varvec{\upalpha}} \in R^{{m_{2} }} \) and \( {\varvec{\upbeta}} \in R^{{m_{1} }} \) are Lagrange multipliers, we can derive two nonparallel planes using \( {\varvec{\upalpha}} \) and \( {\varvec{\upbeta}} \):

$$ \begin{aligned} {\mathbf{z}}_{1} & = \left( {{\mathbf{w}}_{1}^{T} \, b_{1} } \right)^{T} = - \left( {{\mathbf{H}}^{T} {\mathbf{H}}} \right)^{ - 1} {\mathbf{G}}^{T} {\varvec{\upalpha}} \\ {\mathbf{z}}_{2} & = \left( {{\mathbf{w}}_{2}^{T} \, b{}_{2}} \right)^{T} = \left( {{\mathbf{G}}^{T} {\mathbf{G}}} \right)^{ - 1} {\mathbf{H}}^{T} {\varvec{\upbeta}} \\ \end{aligned} $$
(13)

Note that the inverse matrices \( \left( {{\mathbf{H}}^{T} {\mathbf{H}}} \right)^{ - 1} \) and \( \left( {{\mathbf{G}}^{T} {\mathbf{G}}} \right)^{ - 1} \) in Eq. (13) easily encounter singularity problems. To prevent matrix singularity, the regularization term \( \varepsilon {\mathbf{I}} \), where \( \varepsilon \) is a positive scalar that is small enough to preserve the structure of the data, is introduced (Jayadeva and Chandra 2007; Mangasarian and Wild 2006). Because \( \left( {{\mathbf{H}}^{T} {\mathbf{H}} + \varepsilon {\mathbf{I}}} \right)^{ - 1} \) and \( \left( {{\mathbf{G}}^{T} {\mathbf{G}} + \varepsilon {\mathbf{I}}} \right)^{ - 1} \) are positive definite, they do not suffer from singularity problems.

3 Efficient and robust TWSVM based on L1-norm distance

TWSVM has become a hotspot in the research of data classification due to its good classification performance. However, in TWSVM, the distance is measured by the L2-norm. It is well known that the squared L2-norm distance is sensitive to outliers, which implies that abnormal observations may affect the solution obtained by TWSVM. In the literature (Ding et al. 2006; Gao 2008; Kwak 2008; Li et al. 2015a; Nie et al. 2015; Wright et al. 2009), the L1-norm distance is usually considered as a robust alternative to the L2-norm distance for improving the generalization ability and flexibility of the model. Motivated by the basic idea of L1-norm-based modeling, we propose a robust classifier based on the L1-norm distance metric, which replaces the squared L2-norm distance in the distance metric learning objective functions in formulas (8) and (9), thereby leading to the following optimization problems:

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{w}}_{1} ,b_{1} }} \frac{1}{2}||{\mathbf{Aw}}_{\text{1}} + {\mathbf{e}}_{\text{1}} b_{1} ||_{1} + c_{1} {\mathbf{e}}_{2}^{T} {\mathbf{q}}_{1} \\ & s.t. \, - \left( {{\mathbf{Bw}}_{\text{1}} + {\mathbf{e}}_{2} b_{1} } \right) + {\mathbf{q}}_{1} \ge {\mathbf{e}}_{2} , \, {\mathbf{q}}_{1} \ge 0 \\ \end{aligned} $$
(14)
$$ \begin{aligned} &\mathop {\hbox{min} }\limits_{{{\mathbf{w}}_{2} ,b_{2} }} \frac{1}{2}||{\mathbf{Bw}}_{2} + {\mathbf{e}}_{2} {\text{b}}_{2} ||_{1} + c_{2} {\mathbf{e}}_{1}^{T} {\mathbf{q}}_{2} \hfill \\ & s.t. \, \left( {{\mathbf{Aw}}_{2} + {\mathbf{e}}_{1} b_{2} } \right) + {\mathbf{q}}_{2} \ge {\mathbf{e}}_{1} , \, {\mathbf{q}}_{2} \ge 0 \hfill \\ \end{aligned} $$
(15)

where \( || \cdot ||_{1} \) denotes the L1-norm. In a solution that minimizes the objective functions, each plane is as close as possible to one of the two classes and as far as possible from the other class. Because formulas (14) and (15) are convex optimization problems with non-convex constraints in the form of inequalities, they have the local optimal solutions, and we can obtain two nonparallel planes by solving them:

$$ {\mathbf{x}}^{T} {\mathbf{w}}_{1} + b_{1} = 0, \, {\mathbf{x}}^{T} {\mathbf{w}}_{2} + b_{2} = 0 $$
(16)

The original problems in formulas (14) and (15) can be optimized in the following forms:

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{w}}_{1} ,b_{1} }} \frac{1}{2}\left( {\sum\limits_{i = 1}^{{m_{1} }} {\frac{{\left( {{\mathbf{a}}_{i}^{T} {\mathbf{w}}_{1} + e_{1}^{i} b_{1} } \right)}}{{d_{i} }}}^{2} } \right) + c_{1} {\mathbf{e}}_{2}^{T} {\mathbf{q}}_{1} \\ & s.t. \, - \left( {{\mathbf{Bw}}_{\text{1}} + {\mathbf{e}}_{2} b_{1} } \right) + {\mathbf{q}}_{1} \ge {\mathbf{e}}_{2} , \, {\mathbf{q}}_{1} \ge 0 \\ \end{aligned} $$
(17)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{w}}_{2} ,b_{2} }} \frac{1}{2}\left( {\sum\limits_{j = 1}^{{m_{2} }} {\frac{{\left( {{\mathbf{b}}_{j}^{T} {\mathbf{w}}_{2} + e_{2}^{j} b_{2} } \right)^{2} }}{{d_{j} }}} } \right) + c_{2} {\mathbf{e}}_{1}^{T} {\mathbf{q}}_{2} \\ & s.t. \, \left( {{\mathbf{Aw}}_{2} + {\mathbf{e}}_{1} b_{2} } \right) + {\mathbf{q}}_{2} \ge {\mathbf{e}}_{1} , \, {\mathbf{q}}_{2} \ge 0 \\ \end{aligned} $$
(18)

where \( d_{i} = \left| {{\mathbf{a}}_{i}^{T} {\mathbf{w}}_{1} + e_{1}^{i} b_{1} } \right| \ne 0 \) and \( d_{j} = \left| {{\mathbf{b}}_{j}^{T} {\mathbf{w}}_{2} + e_{2}^{j} b_{2} } \right| \ne 0 \), \( e_{1}^{i} \),\( e_{2}^{j} \) denote the i-th and j-th element of \( {\mathbf{e}}_{1} \) and \( {\mathbf{e}}_{2} \) respectively. It is difficult to directly solve formulas (17) and (18) because they each contain an absolute value operation, which makes the optimization of objective function (17) intractable. To solve these problems, we propose an iterative convex optimization strategy. The basic idea of this method is to iteratively update the augmented vector \( {\mathbf{z}}_{1} \) until its objective values in (17) of two successive iterations is less than a fixed value (0.001); then, \( {\mathbf{z}}_{1} \) is the local minimum solution. Assume that \( {\mathbf{z}}_{1}^{p} \) is the solution for iteration \( p \). Then, the solution \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \) for iteration \( p + 1 \) is defined as the solution to the following problems:

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{1} }} \frac{1}{2}\left( {\sum\limits_{i = 1}^{{m_{1} }} {\frac{{\left( {{\mathbf{h}}_{i}^{T} {\mathbf{z}}_{1} } \right)}}{{d_{1i} }}}^{2} } \right) + c_{1} {\mathbf{e}}_{2}^{T} {\mathbf{q}}_{1} \\ & s.t. \, - {\mathbf{Gz}}_{1} + {\mathbf{q}}_{1} \ge {\mathbf{e}}_{2} , \, {\mathbf{q}}_{1} \ge 0 \\ \end{aligned} $$
(19)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{ 2} }} \frac{1}{2}\left( {\sum\limits_{j = 1}^{{m_{2} }} {\frac{{\left( {{\mathbf{g}}_{j}^{T} {\mathbf{z}}_{2} } \right)^{2} }}{{d_{2j} }}} } \right) + c_{2} {\mathbf{e}}_{1}^{T} {\mathbf{q}}_{2} \\ & s.t. \, {\mathbf{Hz}}_{2} + {\mathbf{q}}_{2} \ge {\mathbf{e}}_{1} , \, {\mathbf{q}}_{2} \ge 0 \\ \end{aligned} $$
(20)

where \( d_{1i} = \left| {{\mathbf{h}}_{i}^{T} {\mathbf{z}}_{1}^{p} } \right| \), \( d_{2 \, j} = \left| {{\mathbf{g}}_{j}^{T} {\mathbf{z}}_{2}^{p} } \right| \), \( {\mathbf{h}}_{i}^{T} = \left( {{\mathbf{a}}_{i}^{T} \, e_{1}^{i} } \right) \), and \( {\mathbf{g}}_{j}^{T} = \left( {{\mathbf{b}}_{j}^{T} \;e_{2}^{j} } \right) \). Then, formulas (19) and (20) are rewritten as

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{1} }} \frac{1}{2}{\mathbf{z}}_{1}^{T} {\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{Hz}}_{1} + c_{1} {\mathbf{e}}_{2}^{T} {\mathbf{q}}_{1} \\ & s.t. \, - {\mathbf{Gz}}_{1} + {\mathbf{q}}_{1} \ge {\mathbf{e}}_{2} , \, {\mathbf{q}}_{1} \ge 0 \\ \end{aligned} $$
(21)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{2} }} \frac{1}{2}{\mathbf{z}}_{2}^{T} {\mathbf{G}}^{T} {\mathbf{D}}_{2} {\mathbf{Gz}}_{2} + c_{2} {\mathbf{e}}_{1}^{T} {\mathbf{q}}_{2} \\ & s.t. \, {\mathbf{Hz}}_{2} + {\mathbf{q}}_{2} \ge {\mathbf{e}}_{1} , \, {\mathbf{q}}_{2} \ge 0 \\ \end{aligned} $$
(22)

where \( {\mathbf{D}}_{1} = diag\left( {{1 \mathord{\left/ {\vphantom {1 {d_{11} }}} \right. \kern-0pt} {d_{11} }},{1 \mathord{\left/ {\vphantom {1 {d_{12} }}} \right. \kern-0pt} {d_{12} }}, \ldots ,{1 \mathord{\left/ {\vphantom {1 {d_{{1m_{1} }} }}} \right. \kern-0pt} {d_{{1m_{1} }} }}} \right) \) and \( {\mathbf{D}}_{2} = diag\left( {{1 \mathord{\left/ {\vphantom {1 {d_{21} }}} \right. \kern-0pt} {d_{21} }},{1 \mathord{\left/ {\vphantom {1 {d_{22} }}} \right. \kern-0pt} {d_{22} }}, \ldots ,{1 \mathord{\left/ {\vphantom {1 {d_{{2m_{2} }} }}} \right. \kern-0pt} {d_{{2m_{2} }} }}} \right) \) are diagonal matrices.

We rewrite the problems (21) and (22) with the following equivalent formulation,

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{1} }} \frac{1}{2}\left\| {{\mathbf{Hz}}_{1} } \right\|_{1} + c_{1} {\mathbf{e}}_{2}^{T} {\mathbf{q}}_{1} \\ & s.t. \, - {\mathbf{Gz}}_{1} + {\mathbf{q}}_{1} \ge {\mathbf{e}}_{2} , \, {\mathbf{q}}_{1} \ge 0 \\ \end{aligned} $$
(23)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{2} }} \frac{1}{2}\left\| {{\mathbf{Gz}}_{2} } \right\|_{1} + c_{2} {\mathbf{e}}_{1}^{T} {\mathbf{q}}_{2} \\ & s.t. \, {\mathbf{Hz}}_{2} + {\mathbf{q}}_{2} \ge {\mathbf{e}}_{1} , \, {\mathbf{q}}_{2} \ge 0 \\ \end{aligned} $$
(24)

Formula (14) is a convex optimization problem with inequality constraints (non-convex); therefore, it has a closed-form solution. The Lagrange function of (14) is constructed to solve this problem:

$$ \begin{aligned} L_{1} \left( {{\mathbf{w}}_{1} ,b_{1} ,{\mathbf{q}}_{1} ,{\varvec{\upalpha}} ,{\varvec{\upbeta}} } \right) & = \frac{1}{2}\left( {{\mathbf{Aw}}_{1} + {\mathbf{e}}_{1} b_{1} } \right)^{T} {\mathbf{D}}_{1} \left( {{\mathbf{Aw}}_{1} + {\mathbf{e}}_{1} b_{1} } \right) \\ & \quad + \;c_{1} {\mathbf{e}}_{2}^{T} {\mathbf{q}}_{1} - {\varvec{\upalpha}}^{T} \left( { - \left( {{\mathbf{Bw}}_{\text{1}} + {\mathbf{e}}_{2} b_{1} } \right) + {\mathbf{q}}_{1} - {\mathbf{e}}_{2} } \right) - {\varvec{\upbeta}}^{T} {\mathbf{q}}_{1} \\ \end{aligned} $$
(25)

where \( {\varvec{\upalpha}} = \left( {\alpha_{1} , \alpha_{2} , \alpha_{3} , \ldots , \alpha_{{m_{2} }} } \right)^{T} \) and \( {\varvec{\upbeta}} = \left( \beta_{1} , \beta_{2} , \beta_{3} , \ldots , \beta_{{m_{1} }} \right)^{T} \) are Lagrange multipliers, and \( {\varvec{\upalpha}} \ge 0, \, {\varvec{\upbeta}} \ge 0 \). The partial derivatives of \( {\mathbf{w}}_{1} \), \( b_{1} \) and \( {\mathbf{q}}_{1} \) are obtained with Lagrange function \( L_{1} \) separately, and their derivatives are set equal to zero. Then, the Karush–Kuhn–Tucker (KKT) conditions can be obtained:

$$ \frac{\partial L}{{\partial {\mathbf{w}}_{1} }} = {\mathbf{A}}^{T} {\mathbf{D}}_{1} \left( {{\mathbf{Aw}}_{1} + {\mathbf{e}}_{1} b_{1} } \right) + {\mathbf{B}}^{T} {\varvec{\upalpha}} = 0 $$
(26)
$$ \frac{\partial L}{{\partial b_{1} }} = {\mathbf{e}}_{1}^{T} {\mathbf{D}}_{1} \left( {{\mathbf{Aw}}_{1} + {\mathbf{e}}_{1} b_{1} } \right) + {\mathbf{e}}_{2}^{T} {\varvec{\upalpha}} = 0 $$
(27)
$$ \frac{\partial L}{{\partial {\mathbf{q}}_{1} }} = c_{1} {\mathbf{e}}_{2} - {\varvec{\upalpha}} - {\varvec{\upbeta}} = 0 $$
(28)
$$ - \left( {{\mathbf{Bw}}_{\text{1}} + {\mathbf{e}}_{2} b_{1} } \right) + {\mathbf{q}}_{1} \ge {\mathbf{e}}_{2} , \, {\mathbf{q}}_{1} \ge 0 $$
(29)
$$ {\varvec{\upalpha}}^{T} \left( { - \left( {{\mathbf{Bw}}_{\text{1}} + {\mathbf{e}}_{2} b_{1} } \right) + {\mathbf{q}}_{1} - {\mathbf{e}}_{2} } \right) = 0, \, {\varvec{\upbeta}}^{T} {\mathbf{q}}_{1} = 0 $$
(30)

We can obtain \( 0 \le {\varvec{\upalpha}} \le c_{1} {\mathbf{e}}_{2} \) from Eq. (28) since \( {\varvec{\upalpha}} \ge 0, \, {\varvec{\upbeta}} \ge 0 \). Next, Eqs. (26) and (27) are combined:

$$ \left( {{\mathbf{A}}^{T} \, {\mathbf{e}}_{1}^{T} } \right){\mathbf{D}}_{1} \left( {{\mathbf{A}} \, {\mathbf{e}}_{1} } \right)\left( {{\mathbf{w}}_{1} \, b_{1} } \right)^{T} + \left( {{\mathbf{B}}^{T} \, {\mathbf{e}}_{2}^{T} } \right){\varvec{\upalpha}} = 0 $$
(31)

We have previously defined matrices (\( {\mathbf{H}} \), \( {\mathbf{G}} \)) and augmented vectors (\( {\mathbf{z}}_{1} \), \( {\mathbf{z}}_{\text{2}} \)). With these notations, the solution of \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \) can be obtained based on the conditions above:

$$ {\mathbf{H}}^{T} {\mathbf{D}}_{1}^{p} {\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} + {\mathbf{G}}^{T} {\varvec{\upalpha}} = 0 $$
(32)

Equation (32) is equivalent to Eq. (33):

$$ {\mathbf{z}}_{1}^{{\left( {P + 1} \right)}} = - \left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1}^{p} {\mathbf{H}}} \right)^{ - 1} {\mathbf{G}}^{T} {\varvec{\upalpha}} $$
(33)

In Eq. (33), it is necessary to calculate inverse matrix \( \left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1}^{p} {\mathbf{H}}} \right)^{ - 1} \) to obtain \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \). \( {\mathbf{H}}^{T} {\mathbf{D}}_{1}^{p} {\mathbf{H}} \) is a positive semi-definite matrix that may be ill-conditioned in some situations, so we may obtain an inaccurate or unstable solution. In real applications, we can use the methods described in Jayadeva and Chandra (2007), Mangasarian and Wild (2006). The regularization term is introduced to address this problem, where \( \varepsilon \) is a small perturbation. \( \left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1}^{p} {\mathbf{H}} + \varepsilon {\mathbf{I}}} \right) \) is a positive definite matrix and does not suffer from the singularity problem. Moreover, inverse matrix \( \left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1}^{p} {\mathbf{H}}} \right)^{ - 1} \) is approximately replaced by \( \left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1}^{p} {\mathbf{H}} + \varepsilon {\mathbf{I}}} \right)^{ - 1} \). Therefore, we can derive the final solution of \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \):

$$ {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} = - \left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1}^{p} {\mathbf{H}} + \varepsilon {\mathbf{I}}} \right)^{ - 1} {\mathbf{G}}^{T} {\varvec{\upalpha}} $$
(34)

Similarly,

$$ {\mathbf{z}}_{2}^{{\left( {p + 1} \right)}} = \left( {{\mathbf{G}}^{T} {\mathbf{D}}_{2}^{p} {\mathbf{G}} + \varepsilon {\mathbf{I}}} \right)^{ - 1} {\mathbf{H}}^{T} {\varvec{\upbeta}} $$
(35)

Augmented vectors \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \) and \( {\mathbf{z}}_{2}^{{\left( {p + 1} \right)}} \) are substituted into Lagrange function (25) separately. Under KKT conditions, the original optimization problems (14) and (15) can be transformed into Wolfe dual problems:

$$ \begin{aligned} & \mathop {\hbox{max} }\limits_{{\varvec{\upalpha}}} {\mathbf{e}}_{2}^{T} {\varvec{\upalpha}} - \frac{1}{2}{\varvec{\upalpha}}^{T} {\mathbf{G}}\left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{H}}} \right)^{ - 1} {\mathbf{G}}^{T} {\varvec{\upalpha}} \\ & s.t. \, 0 \le {\varvec{\upalpha}} \le c_{1} {\mathbf{e}}_{2} \\ \end{aligned} $$
(36)
$$ \begin{aligned} & \mathop {\hbox{max} }\limits_{{\varvec{\upbeta}} } {\mathbf{e}}_{1}^{T} {\varvec{\upbeta}} - \frac{1}{2}{\varvec{\upbeta}}^{T} {\mathbf{H}}\left( {{\mathbf{G}}^{T} {\mathbf{D}}_{2} {\mathbf{G}}} \right)^{ - 1} {\mathbf{H}}^{T} {\varvec{\upbeta}} \hfill \\ & s.t. \, 0 \le {\varvec{\upbeta}} \le c_{2} {\mathbf{e}}_{1} \hfill \\ \end{aligned} $$
(37)

We can obtain the Lagrange multipliers \( {\varvec{\upalpha}} \in R^{{m_{2} \times 1}} \) and \( {\varvec{\upbeta}} \in R^{{m_{1} \times 1}} \) by solving the dual problems and substitute \( {\varvec{\upalpha}} \) and \( {\varvec{\upbeta}} \) into Eqs. (34) and (35), respectively. In addition, weight vectors \( {\mathbf{w}}_{1} , \, {\mathbf{w}}_{2} \) and deviations \( b_{1} , \, b_{2} \) can be obtained. That is, we acquire two nonparallel planes (16).

A new point \( {\mathbf{x}} \in {\text{R}}^{\text{n}} \) is assigned to Class 1 or Class 2, according to which of the two nonparallel planes given by (16) lies closest to the decision function

$$ f\left( {\mathbf{x}} \right) = \arg \mathop {\hbox{min} }\limits_{i = 1,2} \left( {{{\left| {{\mathbf{x}}^{T} {\mathbf{w}}_{i} + b_{i} } \right|} \mathord{\left/ {\vphantom {{\left| {{\mathbf{x}}^{T} {\mathbf{w}}_{i} + b_{i} } \right|} {\left\| {{\mathbf{w}}_{i} } \right\|}}} \right. \kern-0pt} {\left\| {{\mathbf{w}}_{i} } \right\|}}} \right) $$
(38)

Here, \( \left| \cdot \right| \) is the absolute value operation.

The new objective function in (14) is a convex problem with non-convex constraint, so \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \) is the local optimal solution to the problem. Note that in Eq. (33), \( {\mathbf{D}}_{1}^{p} \) is dependent on \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \); thus, it is an unknown variable and can be viewed as the potential variable of the objective in (14), which can be solved using the same iterative algorithm by alternating optimization. We calculate \( {\mathbf{D}}_{1}^{p} \) based on the solution \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \) that was obtained in the previous iteration and iteratively update \( {\mathbf{D}}_{1}^{p} \) to change \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \), increase \( p \) until the objective values of two successive iterations is less than a fixed value. Besides, proper initialization can effectually expedite the convergence of the algorithm. In practice, we solve formulas (8) and (9) to obtain initial solutions, which empirically works very well in our experiments. The iterative procedure of L1-TWSVM is summarized in Algorithm 1.

figure a

Algorithm 1 is an efficient iterative algorithm for solving the optimization problem defined by formula (14), which implies that each updating step decreases the value of the objective function, whose convergence is guaranteed by Theorem 1. To prove this, we first introduce Lemma 1.

Lemma 1

For any nonzero vector\( {\mathbf{u}} \), \( {\mathbf{u}}^{p} \in R^{1} \), the following inequality is established:

$$ \left\| {\mathbf{u}} \right\|_{1} - \frac{{\left\| {\mathbf{u}} \right\|_{1}^{2} }}{{2\left\| {{\mathbf{u}}^{p} } \right\|_{1} }} \le \left\| {{\mathbf{u}}^{p} } \right\|_{1} - \frac{{\left\| {{\mathbf{u}}^{p} } \right\|_{1}^{2} }}{{2\left\| {{\mathbf{u}}^{p} } \right\|_{1} }} $$
(40)

Proof

Starting with the inequality \( \left( {\sqrt {\mathbf{v}} - \sqrt {{\mathbf{v}}^{p} } } \right)^{2} \ge 0 \), we have

$$ \begin{aligned} & \left( {\sqrt {\mathbf{v}} - \sqrt {{\mathbf{v}}^{p} } } \right)^{2} \ge 0 \Rightarrow {\mathbf{v}} - 2\sqrt {{\mathbf{vv}}^{p} } + {\mathbf{v}}^{p} \ge 0 \hfill \\ & \Rightarrow \sqrt {\mathbf{v}} - \frac{{\mathbf{v}}}{{2\sqrt {{\mathbf{v}}^{p} } }} \le \frac{{\sqrt {{\mathbf{v}}^{p} } }}{2} \Rightarrow \sqrt {\mathbf{v}} - \frac{{\mathbf{v}}}{{2\sqrt {{\mathbf{v}}^{p} } }} \le \sqrt {{\mathbf{v}}^{p} } - \frac{{{\mathbf{v}}^{p} }}{{2\sqrt {{\mathbf{v}}^{p} } }} \hfill \\ \end{aligned} $$
(41)

By replacing \( {\mathbf{v}} \) and \( {\mathbf{v}}^{p} \) in (41) with \( \left\| {\mathbf{u}} \right\|_{1}^{2} \) and \( \left\| {{\mathbf{u}}^{p} } \right\|_{1}^{2} \), respectively, we obtain (40). □

Theorem 1

Algorithm 1 monotonously decreases the objective of problem (23) in each iteration.

Proof

First, we rewrite the problem in (39) with the following equivalent formulation:

$$ {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} = \arg \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{1} }} \frac{1}{2}{\mathbf{z}}_{1}^{T} {\mathbf{H}}^{T} {\mathbf{D}}_{1}^{p} {\mathbf{Hz}}_{1} + c_{1} {\mathbf{e}}_{2}^{T} \hbox{max} \left( {0,{\mathbf{e}}_{2} + {\mathbf{Gz}}_{1} } \right) $$
(42)

That is,

$$ {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} = \arg \mathop {\hbox{min} }\limits_{{{\mathbf{z}}_{1} }} \frac{1}{2}\left( {{\mathbf{Hz}}_{1} } \right)^{T} {\mathbf{D}}_{1}^{p} {\mathbf{Hz}}_{1} + c_{1} {\mathbf{e}}_{2}^{T} \hbox{max} \left( {0,{\mathbf{e}}_{2} + {\mathbf{Gz}}_{1} } \right) $$
(43)

Thus, in the \( \left( {p + 1} \right) \)-th iteration, according to (39) in Algorithm 1, we have

$$ \begin{aligned} & \frac{1}{2}\left( {{\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} } \right)^{T} {\mathbf{D}}_{1}^{p} \left( {{\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} } \right) + c_{1} {\mathbf{e}}_{2}^{T} \hbox{max} \left( {0, \, {\mathbf{e}}_{2} + {\mathbf{Gz}}_{1}^{{\left( {p + 1} \right)}} } \right) \\ & \le \frac{1}{2}\left( {{\mathbf{Hz}}_{1}^{p} } \right)^{T} {\mathbf{D}}_{1}^{p} \left( {{\mathbf{Hz}}_{1}^{p} } \right) + c_{1} {\mathbf{e}}_{2}^{T} \hbox{max} \left( {0, \, {\mathbf{e}}_{2} + {\mathbf{Gz}}_{1}^{p} } \right) \\ \end{aligned} $$
(44)

Substituting \( {\mathbf{u}} \) and \( {\mathbf{u}}^{p} \) in (40) by \( \left\| {{\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} } \right\|_{1}^{2} \) and \( \left\| {{\mathbf{Hz}}_{1}^{p} } \right\|_{1}^{2} \), respectively, leads to

$$ \left\| {{\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} } \right\|_{1} - \frac{{\left\| {{\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} } \right\|_{1}^{2} }}{{2\left\| {{\mathbf{Hz}}_{1}^{p} } \right\|_{1} }} \le \left\| {{\mathbf{Hz}}_{1}^{p} } \right\|_{1} - \frac{{\left\| {{\mathbf{Hz}}_{1}^{p} } \right\|_{1}^{2} }}{{2\left\| {{\mathbf{Hz}}_{1}^{p} } \right\|_{1} }} $$
(45)

Therefore, the following inequality holds:

$$ \sum\limits_{i = 1}^{{m_{1} }} {\left( {\left| {{\mathbf{h}}_{i}^{T} {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} } \right| - \frac{{\left( {{\mathbf{h}}_{i}^{T} {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} } \right)^{2} }}{{2\left| {{\mathbf{h}}_{i}^{T} {\mathbf{z}}_{1}^{p} } \right|}}} \right)} \le \sum\limits_{i = 1}^{{m_{1} }} {\left( {\left| {{\mathbf{h}}_{i}^{T} {\mathbf{z}}_{1}^{p} } \right| - \frac{{\left( {{\mathbf{h}}_{i}^{T} {\mathbf{z}}_{1}^{p} } \right)^{2} }}{{2\left| {{\mathbf{h}}_{i}^{T} {\mathbf{z}}_{1}^{p} } \right|}}} \right)} $$
(46)

(46) can be simplified to (47)

$$ \begin{aligned} & \left\| {{\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} } \right\|_{1} - \frac{1}{2}\left( {{\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} } \right)^{T} {\mathbf{D}}_{1}^{P} \left( {{\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} } \right) \\ & \le \left\| {{\mathbf{Hz}}_{1}^{p} } \right\|_{1} - \frac{1}{2}\left( {{\mathbf{Hz}}_{1}^{p} } \right)^{T} {\mathbf{D}}_{1}^{P} \left( {{\mathbf{Hz}}_{1}^{p} } \right) \\ \end{aligned} $$
(47)

Combining inequalities (44) and (47), we obtain

$$ \begin{aligned} & \left\| {{\mathbf{Hz}}_{1}^{{\left( {p + 1} \right)}} } \right\|_{1} + c_{1} {\mathbf{e}}_{2}^{T} \hbox{max} \left( {0, \, {\mathbf{e}}_{2} + {\mathbf{Gz}}_{1}^{{\left( {p + 1} \right)}} } \right) \\ & \le \left\| {{\mathbf{Hz}}_{1}^{p} } \right\|_{1} + c_{1} {\mathbf{e}}_{2}^{T} \hbox{max} \left( {0, \, {\mathbf{e}}_{2} + {\mathbf{Gz}}_{1}^{p} } \right) \\ \end{aligned} $$
(48)

As the problem in (23) is bounded below 0, Algorithm 1 converges. The inequality in (48) holds when the algorithm converges. This indicates that the objective value of (23) decreases in each iteration till the algorithm converges. □

Theorem 2

Algorithm 1 converges to a local minimal solution to problem (23).

Proof

The Lagrange function of problem (23) is as follows,

$$ L_{2} \left( {{\mathbf{z}}_{1} ,{\mathbf{q}}_{1} } \right) = \frac{1}{2}\left\| {{\mathbf{Hz}}_{1} } \right\|_{1} + c_{1} {\mathbf{e}}_{2}^{T} {\mathbf{q}}_{1} - {\varvec{\upalpha}}^{T} \left( { - {\mathbf{Gz}}_{1} + {\mathbf{q}}_{1} - {\mathbf{e}}_{2} } \right) - {\varvec{\upbeta}}^{T} {\mathbf{q}}_{1} $$
(49)

where \( {\varvec{\upalpha}} \) and \( {\varvec{\upbeta}} \) are the vectors of Lagrange multipliers. Taking the derivative of \( L_{2} \left( {{\mathbf{z}}_{1} ,{\mathbf{q}}_{1} } \right) \) w.r.t. \( {\mathbf{z}}_{1} \) and \( {\mathbf{q}}_{1} \) respectively and setting them to zero, we obtain the KKT condition of problem (23) in the following,

$$ {\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{Hz}}_{1} + {\mathbf{G}}{\varvec{\upalpha}} { = }0, \, c_{1} {\mathbf{e}}_{2} - {\varvec{\upalpha}} - {\varvec{\upbeta}}{ = }0 $$
(50)

In each iteration of Algorithm 1, we find the optimal \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \) to the problem (39). Hence, the converged solution of Algorithm 1 satisfies the KKT condition of the problem. Next, we define the Lagrange function of problem (39) of Algorithm 1, shown as follows,

$$ L_{3} \left( {{\mathbf{z}}_{1} ,{\mathbf{q}}_{1} } \right) = \frac{1}{2}{\mathbf{z}}_{1}^{T} {\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{Hz}}_{1} + c_{1} {\mathbf{e}}_{2}^{T} {\mathbf{q}}_{1} - {\varvec{\upalpha}}^{T} \left( { - {\mathbf{Gz}}_{1} + {\mathbf{q}}_{1} - {\mathbf{e}}_{2} } \right) - {\varvec{\upbeta}}^{T} {\mathbf{q}}_{1} $$
(51)

Taking the derivative of \( L_{3} \left( {{\mathbf{z}}_{1} ,{\mathbf{q}}_{1} } \right) \) w.r.t. \( {\mathbf{z}}_{1} \) and \( {\mathbf{q}}_{1} \) respectively and setting them to zero.

$$ {\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{Hz}}_{1} + {\mathbf{G}}{\varvec{\upalpha}} { = }0, \, c_{1} {\mathbf{e}}_{2} - {\varvec{\upalpha}} - {\varvec{\upbeta}}{ = }0 $$
(52)

According to the definition of \( {\mathbf{D}}_{1} \) in Algorithm 1, the equivalence between (50) and (52) holds when Algorithm 1 converges. This implies that the converged solution \( {\mathbf{z}}_{1}^{{\left( {p + 1} \right)}} \) of Algorithm 1 satisfies (50) (the KKT condition of the problem in (23)) and is a local minimum solution to problem (23). □

Next, we evaluate the validity and robustness of L1-TWSVM by experiments, and the classification performance is demonstrated by the experimental results on synthetic datasets and UCI datasets (Bache and Lichman 2013; Chen et al. 2011).

4 Experimental results

To evaluate the classification performance and robustness of L1-TWSVM, it is compared with five related algorithms [SVM (Vapnik 1995), GEPSVM (Mangasarian and Wild 2006), TWSVM (Jayadeva and Chandra 2007), LSTSVM (Kumar and Gopal 2009) and L1-NPSVM (Li et al. 2015a)], further, and demonstrate the L1-norm distance can alleviate the effect of outliers and noise in most cases, so fifteen commonly used datasets are selected from the UCI datasets. L1-TWSVM and L1-NPSVM are two iterative algorithms, which require initial solutions to be specified. Good initialization in L1-TWSVM is critical for success but is non-trivial. Considering these two algorithms are designed to correct the planes of GEPSVM and TWSVM that may be non-optimal due to the effect of outliers, we set their initial solutions as the solutions of GEPSVM and TWSVM, respectively. Moreover, for L1-TWSVM and L1-NPSVM, we terminate the iterative procedures when the difference in the objective values of two successive iterations is less than 0.001. The experimental environment consists of a Windows 10 operating system, an Intel(R) Core(TM) i5-5200u quad-core processor (2.2 GHz) and 4 GB of RAM. Six classification algorithms are implemented in MATLAB 7.1. The experimental parameters are selected by 10-fold cross-validation (Ding et al. 2013; Ye et al. 2012). That is, each dataset is divided into ten subsets, one of which being testing data in turn, with the remaining nine subsets being training data. The testing accuracy is the average value of the results of N runs for each dataset (in this experiment, N = 10). In addition, the experimental datasets only contain two types of data (Class 1 & Class 2), and all sample data are normalized to the interval \( \left( { - 1,1} \right) \) to reduce the differences between the characteristics of different samples. It is known that experimental parameters may influence the classification performance. Thus, to obtain the best generalization performance, all experimental parameters are selected by 10-fold cross-validation, which is described below. Parameters \( c_{1} \) and \( c_{2} \) are in the range of \( \left\{ {2^{i} } \right.\left| {i = - 7, - 6, - 5, \ldots ,\left. 7 \right\}} \right. \), while parameter \( \varepsilon \) is in the range of \( \left\{ {10^{i} } \right.\left| {i = - 10, - 9, - 8, \ldots ,\left. {10} \right\}} \right. \).

4.1 Experiments on synthetic datasets

To examine the performance of L1-TWSVM, we performed the same experiment on XOR datasets called Cross-plane (60 × 2), in which the number of positive samples is 20 and the number of negative samples is 40. This datasets is generated by perturbing points that originally lie on two intersecting planes (lines), where each plane corresponds to one class. The two-dimensional datasets contain two classes (positive class and negative class) with their covariance matrices are \( \left( {1,0.9576;0.9576,1} \right) \) and \( \left( {1, - 0.9067; - 0.9067,1} \right) \) respectively, while the mean vectors are \( \left( {4.39,11.6062} \right) \) and \( \left( {8.15,11.4137} \right) \) respectively. Outliers tend to have a certain influence on the classification performance; this influence is measured for evaluating the stability of the algorithms. Here, two extra outliers (data points that deviate significantly from the remainder data points) are added to the Cross-plane datasets (called Cross-plane 1 (62 × 2)) to assess the robustness of the six algorithms, among which one outlier with coordinate \( \left( {17,5} \right)^{T} \) is generated in the positive class, and another with coordinate \( \left( { - 5, - 1} \right)^{T} \) is generated in the negative class, as shown in Fig. 1. The classification results of each classifier on the Cross-plane 1 datasets are given in Fig. 2a–f.

Fig. 1
figure 1

XOR datasets with outliers

Fig. 2
figure 2

The classification results on the Cross-plane 1 datasets. Annotation: The red line is the optimal plane of the “Circle” sample, while the blue line is the optimal plane of the “Square” sample. a By SVM, b By GEPSVM, c By TWSVM, d By LSTSVM, e By L1-NPSVM, f By L1-TWSVM (Color figure online)

The traditional distance metric learning methods (such as SVM, GEPSVM, TWSVM and LSTSVM) often formulate the objectives using the squared L2-norm distance, but they could be highly influenced by outlying data points. We know that each point has the same contribution, especially the large distance point, if the squared L2-norm distance of them dominates the sum (the squared L2-norm distance of remaining data points), it means that these measurements become inappropriate on the datasets. That is, these outlying data points are defined as outliers, which deviate significantly from the rest of the data points. According to Fig. 2, compared with L1-TWSVM, the other competing algorithms misclassify more points (Class 1 has more points closer to the blue separating plane of Class 2 or Class 2 has more points closer to the red separating plane of Class 1). The accuracies of the six algorithms (SVM, GEPSVM, TWSVM, LSTSVM, L1-NPSVM and L1-TWSVM) are 34.05, 74.34, 73.08, 67.86, 75.84 and 77.56%, respectively. According to the experimental results above, L1-TWSVM achieves the highest classification accuracy after the introduction of outliers. This may be attributed to the use of the robust L1-norm distance in TWSVM. The squared L2-norm distance may result in large distances dominating the sum in classifiers GEPSVM, TWSVM and LSTSVM when outliers appear in the datasets, which can easily lead to biased results; however, the L1-norm distance can greatly reduce the influence of outliers. The performance of SVM is the worst among six relative algorithms, which indicates SVM cannot deal with the XOR datasets effectively. These results validate the practicability and feasibility of L1-TWSVM.

4.2 Experiments on UCI datasets

To solve the L1-norm optimization problem, we developed an iterative method that is simple and convenient to implement. We also theoretically showed that the objective function value of L1-TWSVM is reduced in each step of the iteration. The objective function values of L1-TWSVM monotonically decrease as the iteration number increases until converging to fixed values (Fig. 3a–f); the algorithm can quickly converge within approximately six iterations. The horizontal axis represents the number of iterations, and the vertical axis represents the value of the objective function.

Fig. 3
figure 3

The objective function values of L1-TWSVM monotonically decrease as the iteration number increases on six datasets. a On Tae datasets, b On Pidd datasets, c On Monks3 datasets, d On Housingdata datasets, e On Sonar datasets, f On Cancer datasets

To further evaluate the effectiveness and practicality of L1-TWSVM, it is compared with the relevant algorithms (SVM, GEPSVM, TWSVM, LSTSVM and L1-NPSVM) on fifteen commonly used datasets that are selected from the UCI datasets. Noise is one of the criteria used for evaluating the robustness of the algorithm. The accuracy changes smoothly with the increase of noise, which indicates that the algorithm has good robustness to noise.

To imitate the outlier data samples, we corrupt the training samples using a noise matrix \( {\mathbf{N}}_{0} \) (the mean is 0 and the standard deviation is 1) whose element are i.i.d. (independent and identically distributed) standard Gaussian variables (Wang et al. 2015). Then we execute the training procedures on the corrupted training set \( {\mathbf{T}} + \sigma {\mathbf{N}}_{0} \), where \( \sigma = k{{\left\| {\mathbf{T}} \right\|_{\text{F}} } \mathord{\left/ {\vphantom {{\left\| {\mathbf{T}} \right\|_{\text{F}} } {\left\| {{\mathbf{N}}_{0} } \right\|_{\text{F}} }}} \right. \kern-0pt} {\left\| {{\mathbf{N}}_{0} } \right\|_{\text{F}} }} \) and \( k \) is a given noise factor. In this paper, we set \( k = 0.1 \). Table 1 lists the accuracies of the six algorithms on the original datasets, while Table 2 lists the accuracies on fifteen datasets where 10% Gaussian noise was introduced. Table 3 list the accuracies of the six algorithms on datasets where 20% Gaussian noise was introduced. To further test the convergence of L1-TWSVM, the average numbers of iterations used for training are listed in the three tables for each experiment. In addition, the P values are obtained from paired t tests comparing each algorithm to L1-TWSVM. An asterisk (*) indicates a significant difference from L1-TWSVM, which corresponds to a P value of less than 0.05. The highest accuracy is shown in bold. Standard deviation is a metric that is used to quantify the amount of variation or dispersion of a set of data values, called Std for short. Detailed results are given in the following tables:

Table 1 Experimental results of six algorithms on the original datasets
Table 2 Experimental results of six algorithms on datasets into which 10% Gaussian noise has been introduced
Table 3 Experimental results of six algorithms on datasets into which 20% Gaussian noise has been introduced

We performed paired t tests comparing L1-TWSVM to the related algorithms. The P value for each test is the probability of the observed or a greater difference occurring between the correctness values of the two datasets, under the assumption of the null hypothesis that there is no difference between the correctness distributions of the datasets. Hence, the smaller the P-Value, the less likely it is that the observed difference resulted from datasets with the same correctness distribution. In this study, we set the threshold for the P value to 0.05. For instance, the P value of the test comparing L1-TWSVM and TWSVM on the Ticdata datasets is 0.00021681, and that on the Monks3 datasets is 0.0097, both of which are less than 0.05; therefore, we can conclude that L1-TWSVM and TWSVM have different accuracies on these datasets and L1-TWSVM significantly outperforms TWSVM. In addition, on the Monks1, Monks2, Tae and Sonar datasets, we find that the performance differences between L1-TWSVM and the other algorithms (except for SVM) are statistically insignificant. Finally, by more carefully examining the experimental results, we also notice that although TWSVM and LSTSVM outperform L1-TWSVM in terms of accuracy on some datasets (such as Heart, Ionodata, and Housingdata), the P values are higher than 0.05; that is to say, the accuracies of TWSVM and LSTSVM are not significantly different from that of L1-TWSVM. This important observation clearly indicates that the classification performance of our method is superior to those of all other competing methods.

Based on the data in Table 1, we find the following interesting patterns. First, the accuracy of L1-TWSVM is comparable to those of other competing algorithms in most cases and is higher than those of others in some scenarios. This indicates that the classification performance of L1-TWSVM is better. Second, according to the columns corresponding to L1-TWSVM in Table 1, L1-TWSVM can rapidly converge within approximately seven iterations, except on the Ticdata datasets (the iteration number is 13); as guaranteed by Theorem 1, L1-TWSVM gradually converges to a local optimal solution.

According to the experimental results in the three tables, regardless of whether Gaussian noise is introduced or not, the accuracy of our method is comparable to or better than the other methods. The performance degradation of our method is very small when 10% and 20% Gaussian noise is introduced; even the accuracy of the proposed algorithm is superior to TWSVM. In addition, the accuracies of L1-NPSVM and L1-TWSVM show little change compared to other competing methods, especially when Gaussian noise is introduced. This may be attributed to the embedding of the L1-norm distance, which makes them more robust to outliers than the other methods. This further demonstrates that the L1-norm distance is useful for data classification, especially for samples with outliers.

According to the three tables, the iteration numbers of L1-TWSVM increase slightly after Gaussian noise is introduced; however, L1-TWSVM can converge within a limited number of iterations. Furthermore, the experimental results expose high computational cost. The training time of L1-TWSVM is the longest. The reasons for this are as follows: (1) L1-TWSVM, similar to TWSVM, requires the calculation of two QPPs; the time complexity of this calculation is no more than \( {{m^{3} } \mathord{\left/ {\vphantom {{m^{3} } 4}} \right. \kern-0pt} 4} \) when \( {\mathbf{A}} \) is equivalent to \( {\mathbf{B}} \) in terms of the number of samples. (2) The time complexity of computing the two inverse matrices \( \left( {{\mathbf{H}}^{T} {\mathbf{D}}_{1} {\mathbf{H}} + \varepsilon {\mathbf{I}}} \right)^{ - 1} \) and \( \left( {{\mathbf{G}}^{T} {\mathbf{D}}_{2} {\mathbf{G}} + \varepsilon {\mathbf{I}}} \right)^{ - 1} \) is approximately \( 2n^{2} \). (3) As L1-TWSVM is an iterative algorithm that need to iteratively compute the solutions, in each iteration, it needs to calculate two QPPs, two inverse matrices and two diagonal matrices \( {\mathbf{D}}_{1} , \, {\mathbf{D}}_{2} \), where the time complexities of calculating \( {\mathbf{D}}_{1} \) and \( {\mathbf{D}}_{2} \) during the learning process are \( m_{1} \times (d + 1) \) and \( m_{2} \times (d + 1) \), respectively. Therefore, the total time complexity of solving problem (14) is \( l\left( {{{m^{3} } \mathord{\left/ {\vphantom {{m^{3} } 4}} \right. \kern-0pt} 4} + 2n^{2} + m\left( {d + 1} \right)} \right) \), where \( l \) is the iteration number. The iteration procedure results in L1-TWSVM with higher computational cost than the other five methods in most cases; fortunately, it surpasses them in accuracy and has good robustness.

Although the accuracy improvements of our method over the other compared methods on the original datasets are mediocre, the performance degradation of the proposed method is very small (less than 3.2%) when Gaussian noise is introduced. According to the experimental results in the three tables, the performances of all the methods are degraded due to the introduction of Gaussian noise; however, the degradation of our new method is much less than those of the other competing methods. This clearly indicates the robustness of our improved method to outliers and empirically confirms the validity of our strategy of using the robust L1-norm distance to improve the distance metric learning.

From Table 2, although the accuracy of L1-TWSVM is higher than that of TWSVM on the Heart datasets, the P value of the test comparing L1-TWSVM and TWSVM is 0.8717, which is higher than 0.05. This means that the accuracy of TWSVM is not significantly different from that of L1-TWSVM. The same scenario can be seen on the Monks1, Monks2, Pidd and other datasets. However, the P value of L1-TWSVM and TWSVM is 0.0246 on Monks 3 datasets, which is less than 0.05, we get the different accuracies of L1-TWSVM and TWSVM, the same is true on Ticdata datasets. Besides, the performance degradation of TWSVM is larger than that of L1-TWSVM after the Gaussian noise is introduced. This indicates that the robustness of L1-TWSVM is superior to TWSVM. In an extremely similar way, we can get the results of comparing the other algorithms (SVM, GEPSVM and LSTSVM) with L1-TWSVM respectively. Note that, the accuracy of L1-NPSVM has a little change after the Gaussian noise is introduced, but L1-TWSVM outperforms L1-NPSVM in accuracy.

To more clearly test the classification performance of the improved method (L1-TWSVM), and show the validity of L1-norm distance is useful for data classification, especially for datasets with outliers. Eight original datasets (Monks2, Ticdata, Monks1, Pidd, Pimadata, Tae, Heart and Monks3) are chosen from fifteen commonly used datasets. This is the main motivation for us to do it. Figure 4a shows the classification accuracies of the six algorithms on eight original datasets. Figure 4b, c shows the accuracies of the six algorithms when 10 and 20% Gaussian noise, respectively, is introduced. According to Fig. 4, the accuracy of our improved method is comparable to those of the other compared methods on the original datasets. However, the improvements achieved by our method on the contaminated datasets (with 10 and 20% Gaussian noise) are large in most cases. This indicates that the classification performance of L1-TWSVM is superior, further validates the practicability of utilizing the L1-norm distance in TWSVM, and shows that the proposed methods are more effective and robust against outlier samples than traditional squared L2-norm distance metric learning approaches.

Fig. 4
figure 4

Comparison of six algorithms on eight datasets with respect to classification accuracy. a Without Gaussian noise, b Introduce 10% Gaussian noise, c Introduce 20% Gaussian noise

To evaluate the robustness of L1-TWSVM, we designed three test schemes (called Test 1, Test 2, and Test 3) on the Heart and Pidd datasets separately. We introduce 0, 10 and 20% Gaussian noise into the three tests, respectively. As illustrated in Fig. 5, L1-TWSVM is superior to the other competing algorithms in most cases; in particular, on Test 3 (Fig. 5a), the accuracy of L1-TWSVM is the highest. In addition, on Test 1 and Test 2, the accuracies of L1-TWSVM are comparable to those of TWSVM. In brief, our proposed method consistently outperforms the other compared methods on the three tests, which demonstrates that L1-TWSVM can improve the classification performance and is useful for data classification. Combining Fig. 5a, b, we find they show similar experimental results. Therefore, Fig. 5b further verifies the practicability of L1-TWSVM in alleviating the effect of noise. These observations are consistent with those made in the previous experiments.

Fig. 5
figure 5

The classification performances of six algorithms. a On Heart datasets, b On Pidd datasets

To further confirm the robustness to noise of L1-TWSVM, Fig. 6 vividly illustrates the performance comparison of the six algorithms on nine datasets, where 0, 10 and 20% Gaussian noise has been introduced. In terms of robustness, L1-TWSVM and L1-NPSVM obtain the best results among all competing methods, which indicates that the utilization of the L1-norm distance can alleviate the negative influence of outliers and make the model stronger. However, the accuracy of L1-TWSVM is higher than that of L1-NPSVM, which firmly demonstrates that our method is more effective and robust against outlier samples. Furthermore, this provides more evidence of the effectiveness of the robust L1-norm distance in metric learning and verifies the correctness of our improved method.

Fig. 6
figure 6

The performance comparison of six algorithms on nine datasets. a By SVM, b By GEPSVM, c By TWSVM, d By LSTSVM, e By L1-NPSVM, f By L1-TWSVM

5 Conclusions and future work

We propose an efficient and robust TWSVM classifier based on the L1-norm distance metric for binary classification, which is denoted as L1-TWSVM. It makes full use of the robustness of the L1-norm distance to noise and outliers. As the new objective function contains the non-smooth L1-norm term, it is challenging to directly solve the optimization problem. In view of this, we develop a simple and valid iterative algorithm for solving L1-norm optimization problems that is easy to implement and prove that the objective function of the proposed method can obtain the local optimal solutions in theory. Moreover, extensive experimental results indicate that L1-TWSVM can effectively suppress the negative effects of outliers to some extent and improves the generalization ability and flexibility of the model. Nevertheless, L1-TWSVM still needs to iteratively compute two QPPs to obtain the solutions. According to the experimental results above, the computational cost of L1-TWSVM is the highest compared with other related algorithms under the same scenario. This makes it difficult to effectively address large data samples. In summary, L1-TWSVM has better classification performance and robustness than the other algorithms, especially when Gaussian noise is introduced.

There are three future directions for this research. First, we would like to find a better way to decrease the computational cost of L1-TWSVM so that it will be able to handle larger samples. Second, we would like to extend L1-TWSVM to a kernel version to deal with nonlinear tasks. Third, L1-TWSVM is only effective for binary classification problems at present; it is promising to extend L1-TWSVM to multi-category classification and study the application of multi-class L1-TWSVM to real-world problems.