Skip to main content
Log in

A bilateral-truncated-loss based robust support vector machine for classification problems

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Support vector machine (SVM) is sensitive to outliers or noise in the training dataset. Fuzzy SVM (FSVM) and the bilateral-weighted FSVM (BW-FSVM) can partly overcome this shortcoming by assigning different fuzzy membership degrees to different training samples. However, it is a difficult task to set the fuzzy membership degrees of the training samples. To avoid setting fuzzy membership degrees, from the beginning of the BW-FSVM model, this paper outlines the construction of a bilateral-truncated-loss based robust SVM (BTL-RSVM) model for classification problems with noise. Based on its equivalent model, we theoretically analyze the reason why the robustness of BTL-RSVM is higher than that of SVM and BW-FSVM. To solve the proposed BTL-RSVM model, we propose an iterative algorithm based on the concave–convex procedure and the Newton–Armijo algorithm. A set of experiments is conducted on ten real world benchmark datasets to test the robustness of BTL-RSVM. The statistical tests of the experimental results indicate that compared with SVM, FSVM and BW-FSVM, the proposed BTL-RSVM can significantly reduce the effects of noise and provide superior robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Armijo L (1966) Minimization of functions having Lipschitz-continuous first partial derivatives. Pac J Math 16(1):1–3

    Article  MATH  MathSciNet  Google Scholar 

  • Batuwita R, Palade V (2010) FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans Fuzzy Syst 18(3):558–571

    Article  Google Scholar 

  • Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont

    MATH  Google Scholar 

  • Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178

    Article  MATH  MathSciNet  Google Scholar 

  • Collobert R, Sinz F, Weston J, Bottou L (2006) Trading convexity for scalability. In: Proceedings of the 23rd international conference on machine learning. ACM Press, Pittsburgh, pp 201–208

  • Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  • Dennis JE, Schnabel RB (1983) Numerical methods for unconstrained optimization and nonlinear equations. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Ertekin S, Bottou L, Giles CL (2011) Nonconvex online support vector machines. IEEE Trans Pattern Anal Mach Intell 33(2):368–381

    Article  Google Scholar 

  • Fung G, Mangasarian OL (2001) Proximal support vector machine classifiers. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California, pp 77–86

  • Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75(4):800–802

    Article  MATH  MathSciNet  Google Scholar 

  • Huang W, Lai KK, Yu L, Wang SY (2008) A least squares bilateral- weighted fuzzy SVM method to evaluate credit risk. In: Proceedings of the fourth international conference on natural computation, pp 13–17

  • Jayadeva Khemchandani R, Chandra S (2004) Fast and robust learning through fuzzy linear proximal support vector machines. Neurocomputing 61:401–411

    Article  Google Scholar 

  • Jiang XF, Yi Z, Lv JC (2006) Fuzzy SVM with a new fuzzy membership function. Neural Comput Appl 15(3–4):268–276

    Article  Google Scholar 

  • Jilani TA, Burney SMA (2008) Multiclass bilateral-weighted fuzzy support vector machine to evaluate financial strength credit rating. In: Proceedings of the international conference on computer science and information technology, pp 342–348

  • Keller JM, Hunt DJ (1985) Incorporating fuzzy membership functions into the perceptron algorithm. IEEE Trans Pattern Anal Mach Intell 7(6):693–699

    Article  Google Scholar 

  • Lee YJ, Mangasarian OL (2001) SSVM: a smooth support vector machine for classification. Comput Optim Appl 20(1):5–22

    Article  MATH  MathSciNet  Google Scholar 

  • Leski JK (2004) An \(\varepsilon -\)margin nonlinear classifier based on fuzzy if-then rules. IEEE Trans Syst Man Cybern Part B Cybern 34(1):68–76

    Article  Google Scholar 

  • Lin CF, Wang SD (2002) Fuzzy support vector machines. IEEE Trans Neural Netw 13(2):464–471

    Article  Google Scholar 

  • Lin CF, Wang SD (2004) Training algorithms for fuzzy support vector machines with noisy data. Pattern Recognit Lett 25(14):1647–1656

    Article  Google Scholar 

  • Lin Y (2002) Support vector machines and the Bayes rule in classification. Data Min Knowl Discov 6(3):259–275

    Article  MathSciNet  Google Scholar 

  • Liu YF, Shen XT, Doss H (2005) Multicategory \(\psi \)-learning and support vector machine: computational tools. J Comput Graph Stat 14(1):219–236

    Article  MathSciNet  Google Scholar 

  • Liu YF, Shen XT (2006) Multicategory \(\psi \)-learning. J Am Stat Assoc 101(474):500–509

    Article  MathSciNet  Google Scholar 

  • Platt JC (1998) Sequential minimal optimization—a fast algorithm for training support vector machines. In: Advances in Kernel methods—support vector learning. MIT Press, Cambridge, pp 185–208

  • Song Q, Hu W, Xie W (2002) Robust support vector machine with bullet hole image classification. IEEE Trans Syst Man Cybern Part C Appl Rev 32(4):440–448

    Article  Google Scholar 

  • Tao Q, Wang J (2004) A new fuzzy support vector machine based on the weighted margin. Neural Process Lett 20:139–150

  • Wang L, Jia HD, Li J (2008) Training robust support vector machine with smooth Ramp loss in the primal space. Neurocomputing 71:3020–3025

    Article  Google Scholar 

  • Wu XD (1995) Knowledge acquisition from databases. Ablex Publishing Corporation, Norwood

    Google Scholar 

  • Wu YC, Liu YF (2007) Robust truncated hinge loss support vector machines. J Am Stat Assoc 102(479):974–983

    Article  MATH  Google Scholar 

  • Wu YC, Liu YF (2013) Adaptively weighted large margin classifiers. J Comput Graph Stat 22(2):416–432

  • Wang YQ, Wang SY, Lai KK (2005) A new fuzzy support vector machine to evaluate credit risk. IEEE Trans Fuzzy Syst 13(6):820–831

    Article  Google Scholar 

  • Xu LL, Crammer K, Schuurmans D (2006) Robust support vector machine training via convex outlier ablation. In: Proceedings of the 21st national conference on artificial intelligence. AAAI Press, Boston, pp 536–542

  • Yang XW, Zhang GQ, Lu J, Ma J (2011) A kernel fuzzy c-means clustering based fuzzy support vector machine algorithm for classification problems with outliers or noise\(s\). IEEE Trans Fuzzy Syst 19(1):105–115

    Article  Google Scholar 

  • Yuille AL, Rangarajia A (2003) The concave–convex procedure. Neural Comput 15(4):915–936

    Article  MATH  Google Scholar 

  • Zhang XG (1999) Using class-center vectors to build support vector machines. In: Proceedings of the 1999 IEEE signal processing society workshop. IEEE Press, New York, pp 3–11

Download references

Acknowledgments

The work presented in this paper is supported by the National Science Foundation of China (61273295), the Major Project of the National Social Science Foundation of China (11&ZD156), the Open Project of Key Laboratory of Symbolic Computation and Knowledge Engineering of the Chinese Ministry of Education (93K-17-2009-K04).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaowei Yang.

Additional information

Communicated by V. Loia.

Appendix

Appendix

Proof of Proposition 1

The loss function \(L({{\mathbf w}},b,m,{{\mathbf x}})\) can be rewritten as follows:

$$\begin{aligned} \begin{array}{lll} L({{\mathbf w}},b,m,{{\mathbf x}})&{}=&{}m\left| {1-({{\mathbf w}}^T\varphi ({{\mathbf x}})+b)} \right| _+ \\ &{}&{}+(1-m)\left| {1+{{\mathbf w}}^T\varphi ({{\mathbf x}})+b} \right| _+ +1 \\ &{}=&{}m( \left| {1-({{\mathbf w}}^T\varphi ({{\mathbf x}})+b)} \right| _+ \!-\!\left| 1+{{\mathbf w}}^T\varphi ({{\mathbf x}}_i )\right. \\ &{}&{}\left. +b \right| _+ )+\left| {1+{{\mathbf w}}^T\varphi ({{\mathbf x}}_i )+b} \right| _+ +1 \\ \end{array}.\nonumber \\ \end{aligned}$$
(42)

If \(\left| {1-({{\mathbf w}}^T\varphi ({{\mathbf x}})+b)} \right| _+ \ge \left| {1+{{\mathbf w}}^T\varphi ({{\mathbf x}})+b} \right| _+ \), then \(L({{\mathbf w}},b,m,{{\mathbf x}})\ge 1+\left| {1+{{\mathbf w}}^T\varphi ({{\mathbf x}})+b} \right| _+ \). If \(\left| 1-({{\mathbf w}}^T\varphi ({{\mathbf x}})\right. +\left. b) \right| _+ <\left| {1+{{\mathbf w}}^T\varphi ({{\mathbf x}})+b} \right| _+ \), then \(L({{\mathbf w}},b,m,{{\mathbf x}})\ge 1+\left| {1-{{\mathbf w}}^T\varphi ({{\mathbf x}})-b} \right| _+ \). From the definition of the bilateral truncated loss function \(\mathrm{robust}({{\mathbf w}},b,{{\mathbf x}})\), we know that under these two cases, \(L({{\mathbf w}},b,m,{{\mathbf x}})\ge \mathrm{robust}({{\mathbf w}},b,{{\mathbf x}})\) always holds.

Proof of Proposition 2

In the following proof, \(z={{\mathbf w}}^T\varphi ({{\mathbf x}})+b\). The graphs of the functions \(f(z)=| {1-z} |_+ \) and \(f(z)=| {1+z} |_+ \) are shown in Fig. 2:

Fig. 2
figure 2

The graphs of the functions \(f(z)=| {1-z} |_+ \) and \(f(z)=| {1+z} |_+ \)

From Fig. 2, we know that

$$\begin{aligned} \left| {1-z} \right| _+ \ge 1\ge \left| {1+z} \right| _+, \end{aligned}$$
(43)

or

$$\begin{aligned} \left| {1-z} \right| _+ \le 1\le \left| {1+z} \right| _+ . \end{aligned}$$
(44)

If (43) holds, then \(\mathrm{robust}({{\mathbf w}},b,{{\mathbf x}})=1+\left| {1+{{\mathbf w}}^T\varphi ({{\mathbf x}})+b} \right| _+ \). If (44) holds, then \(\mathrm{robust}({{\mathbf w}},b,{{\mathbf x}})=1+\left| {1-{{\mathbf w}}^T\varphi ({{\mathbf x}})-b} \right| _+ \). From the proof procedure of Proposition 1, we know

$$\begin{aligned} \min L({{\mathbf w}},b,m,{{\mathbf x}})=\mathrm{robust}({{\mathbf w}},b,{{\mathbf x}}). \end{aligned}$$
(45)

Let \(\mathrm{ErrorClass}({{\mathbf w}},b,{{\mathbf x}},y)\) be a misclassification error function,

$$\begin{aligned} \mathrm{ErrorClass}({{\mathbf w}},b,{{\mathbf x}},y)=\left\{ {{ \begin{array}{lc} {1, \quad y({{\mathbf w}}^T\varphi ({{\mathbf x}})+b)<0} \\ {0, \quad y({{\mathbf w}}^T\varphi ({{\mathbf x}})+b)\ge 0} \\ \end{array}}} \right. .\nonumber \\ \end{aligned}$$
(46)

From the definition of \(\mathrm{robust}({{\mathbf w}},b,{{\mathbf x}})\) and (46), we know that the following formulation holds.

$$\begin{aligned} {\mathrm{robust}}({{\mathbf w}},b,{{\mathbf x}})\ge 1\ge {\mathrm{ErrorClass}}({{\mathbf w}},b,{{\mathbf x}},y), \end{aligned}$$
(47)

which shows that the bilateral truncated loss function \(\mathrm{robust}({{\mathbf w}},b,{{\mathbf x}})\) in the optimization problem (14) is the upper bound of the misclassification error function.

Proof of Theorem 1

Define

$$\begin{aligned}&F_{\mathrm{rob}} ({{\mathbf w}},b)=\frac{1}{2}{{\mathbf w}}^T{{\mathbf w}}+C\sum \limits _{i=1}^l {\mathrm{robust}({{\mathbf w}},b,{{\mathbf x}}_i )} , \end{aligned}$$
(48)
$$\begin{aligned}&F_L ({{\mathbf w}},b,{{\mathbf m}})=\frac{1}{2}{{\mathbf w}}^T{{\mathbf w}}+C\sum \limits _{i=1}^l {L({{\mathbf w}},b,m_i, {{\mathbf x}}_i )}, \end{aligned}$$
(49)
$$\begin{aligned}&({{\mathbf w}}_r, b_r )=\arg \min \limits _{w,b} F_{\mathrm{rob}} ({{\mathbf w}},b), \end{aligned}$$
(50)
$$\begin{aligned}&({{\mathbf w}}_L, b_L, {{\mathbf m}}_L)=\arg \min \limits _{w,b} \min \limits _{0\le m\le 1} F_L ({{\mathbf w}},b,{{\mathbf m}}), \end{aligned}$$
(51)
$$\begin{aligned}&{{\mathbf m}}_r =\arg \min \limits _{0\le m\le 1} F_L ({{\mathbf w}}_r, b_r, {{\mathbf m}}). \end{aligned}$$
(52)

From Proposition 2, we know the following two equalities hold:

$$\begin{aligned}&F_{\mathrm{rob}} ({{\mathbf w}}_r, b_r )=\min \limits _{0\le {{\mathbf m}}\le 1} F_L ({{\mathbf w}}_r, b_r, {{\mathbf m}}), \end{aligned}$$
(53)
$$\begin{aligned}&\min \limits _{0\le {{\mathbf m}}\le 1} F_L ({{\mathbf w}}_L, b_L ,{{\mathbf m}})=F_{\mathrm{rob}} ({{\mathbf w}}_L, b_L ). \end{aligned}$$
(54)

Considering that \(({{\mathbf w}}_r, b_r )\) and \(({{\mathbf w}}_L ,b_L )\) are the optimal solutions of the optimization problems \(\min \nolimits _{w,b} F_{\mathrm{rob}} ({{\mathbf w}},b)\) and \(\min \nolimits _{w,b} \min \nolimits _{0\le m\le 1} F_L ({{\mathbf w}},b,{{\mathbf m}})\), respectively, we have:

$$\begin{aligned}&F_{\mathrm{rob}} ({{\mathbf w}}_L, b_L )\ge \min \limits _{w,b} F_{\mathrm{rob}} ({{\mathbf w}},b),\end{aligned}$$
(55)
$$\begin{aligned}&\min \limits _{0\le {{\mathbf m}}\le 1} F_L ({{\mathbf w}}_r, b_r, {{\mathbf m}})\ge \min \limits _{{{\mathbf w}},b} \min \limits _{0\le {{\mathbf m}}\le 1} F_L ({{\mathbf w}},b,{{\mathbf m}}). \end{aligned}$$
(56)

Based on (50), (53), and (56), we can obtain

$$\begin{aligned} \min \limits _{w,b} F_{\mathrm{rob}} ({{\mathbf w}},b)\ge \min \limits _{{{\mathbf w}},b} \min \limits _{0\le {{\mathbf m}}\le 1} F_L ({{\mathbf w}},b,{{\mathbf m}}). \end{aligned}$$
(57)

Based on (51), (52), and (55), we have

$$\begin{aligned} \min \limits _{w,b} \min \limits _{0\le {{\mathbf m}}\le 1} F_L ({{\mathbf w}},b,{{\mathbf m}})\ge \min \limits _{w,b} F_{\mathrm{rob}} ({{\mathbf w}},b). \end{aligned}$$
(58)

Comparing (57) with (58) yields

$$\begin{aligned}&\min \limits _{{{\mathbf w}},b} \min \limits _{0\le m\le 1} \frac{1}{2}{{\mathbf w}}^T{{\mathbf w}}+C\sum \limits _{i=1}^l {L({{\mathbf w}},b,m_i, {{\mathbf x}}_i )} \\&\quad =\min \limits _{{{\mathbf w}},b} \frac{1}{2}{{\mathbf w}}^T{{\mathbf w}}+C\sum \limits _{i=1}^l {\mathrm{robust}({{\mathbf w}},b,{{\mathbf x}}_i )} . \end{aligned}$$

From the following inequality

$$\begin{aligned} F_{\mathrm{rob}} ({{\mathbf w}}_r, b_r )&= F_L ({{\mathbf w}}_r, b_r ,{{\mathbf m}}_r )\ge F_L ({{\mathbf w}}_L, b_L, {{\mathbf m}}_L )\\&= F_{\mathrm{rob}} ({{\mathbf w}}_L, b_L )\ge F_{\mathrm{rob}} ({{\mathbf w}}_r, b_r ), \end{aligned}$$

we know that the optimal solutions \(({{\mathbf w}}_r, b_r )\) and \(({{\mathbf w}}_L, b_L )\) of the optimization problems (13) and (14) with respect to \(({{\mathbf w}},b)\) are interchangeable.

Proof of Theorem 2

Based on the decision values \(z_i =\sum \nolimits _{j=1}^l {\alpha _j K({{\mathbf x}}_j, {{\mathbf x}}_i )+b} \), we divide the training samples into seven sets \(U_1 =\{ {i\vert \vert z_i -1\vert \le h} \}\), \(U_2 =\{ {i\vert \vert z_i +1\vert \le h} \}\), \(B_1^ =\{ {i\vert h<z_i^ <1-h} \}\), \(B_2^ =\{ {i\vert \vert z_i^ \vert \le h} \}\), \(B_3^ =\!\{ {i\vert \!-\!1\!+\!h<\!z_i^ <\!-\!h} \}\),\(N_1 =\{i\vert z_i >1+h\}\) and \(N_2^ =\{ {i\vert z_i^ <-1-h} \}\), which are illustrated in Fig. 3. Let \(n_{U_1}\), \(n_{U_2 }\), \(n_{B_1}\), \(n_{B_2}\), \(n_{B_3 }\), \(n_{N_1 }\) and \(n_{N_2}\) denote the number of training samples located in the sets \(U_1\), \(U_2\), \(B_1\), \(B_2\), \(B_3\), \(N_1\) and \(N_2 \), respectively. \({{\mathbf I}}_{U_1}\) denotes an \(l\times l\) diagonal matrix with the first \(n_{U_1}\) elements being 1 and the other elements being zeros. \({{\mathbf I}}_{U_2}\) (\({{\mathbf I}}_{B_1}\), \({{\mathbf I}}_{B_2}\), \({{\mathbf I}}_{B_3}\), \({{\mathbf I}}_{N_1}\), and \({{\mathbf I}}_{N_2})\) denote an \(l\times l\) diagonal matrix with the first \(n_{U_1}\) (\(n_{U_1 } +n_{U_2 } \), \(n_{U_1 } +n_{U_2 } +n_{B_1 } \), \(n_{U_1} +n_{U_2}+n_{B_1} +n_{B_2 } \), \(n_{U_1 } +n_{U_2 } +n_{B_1} +n_{B_2}+n_{B_3}\), \(n_{U_1 } +n_{U_2 } +n_{B_1 } +n_{B_2} +n_{B_3}+n_{N_1})\) elements being zeros, followed by \(n_{U_2 }\) (\(n_{B_1}\), \(n_{B_2}\), \(n_{B_3}\), \(n_{N_1 } \) and \(n_{N_2})\) elements being 1, and the other elements being zeros. \({{\mathbf e}}_{U_1}\) denotes an \(l\times 1\) vector with the first \(n_{U_1}\) elements being 1 and the other elements being zeros. \({{\mathbf e}}_{U_2}\) (\({{\mathbf e}}_{B_1 }\), \({{\mathbf e}}_{B_2 }\), \({{\mathbf e}}_{B_3}\), \({{\mathbf e}}_{N_1}\), and \({{\mathbf e}}_{N_2})\) denote an \(l\times 1\) vector with the first \(n_{U_1 }\) (\(n_{U_1 } +n_{U_2 }\), \(n_{U_1 } +n_{U_2} +n_{B_1}\), \(n_{U_1}+n_{U_2}+n_{B_1}+n_{B_2 }\), \(n_{U_1}+n_{U_2} +n_{B_1}+n_{B_2}+n_{B_3}\), \(n_{U_1 } +n_{U_2}+n_{B_1} +n_{B_2}+n_{B_3}+n_{N_1})\) elements being zeros, followed by \(n_{U_2}\) (\(n_{B_1}\), \(n_{B_2}\), \(n_{B_3}\), \(n_{N_1}\) and \(n_{N_2})\) elements being 1, and the other elements being zeros.

Fig. 3
figure 3

The partitioned seven sets

From

$$\begin{aligned} \frac{\partial G_1^*(z_i)}{\partial z_i }=\left\{ { \begin{array}{ll} -1, &{} z_i <1-h \\ \frac{z_i -(1+h)}{2h},&{} \left| {z_i -1} \right| \le h \\ 0, &{} \,z_i >1+h \\ \end{array}} \right. , \end{aligned}$$
(59)

and

$$\begin{aligned} \frac{\partial H_1^*(z_i)}{\partial z_i}=\left\{ {\begin{array}{ll} {1,} &{} {z_i^ >-1+h} \\ {\frac{z_i+(1+h)}{2h},} &{} {\left| {z_i +1} \right| \le h} \\ {0,} &{} {z_i <-1-h} \\ \end{array}} \right. , \end{aligned}$$
(60)

we can obtain the first-order partial derivatives and the second-order ones of \(J({\varvec{\alpha }},b)\) with respect to \({\varvec{\alpha }}\) and \(b\) as follows:

$$\begin{aligned} \frac{\partial J({\varvec{\alpha }},b)}{\partial {\varvec{\alpha }}}&= {\varvec{K \alpha }}+C\sum \limits _{i=1}^l {\left( \frac{\partial G_1^*(z_i )}{\partial z_i }{{\mathbf K}}_i +\frac{\partial H_1^*(z_i )}{\partial z_i }{{\mathbf K}}_i \right) } +C\sum \limits _{i=1}^l {\lambda _i^t {{\mathbf K}}_i} \nonumber \\&= {\varvec{K \alpha }}+C\left( \frac{{{\mathbf K}}({{\mathbf I}}_{U_1 } +{{\mathbf I}}_{U_2 })({\varvec{K \alpha }}+b{{\mathbf e}})}{2h}+\frac{{{\mathbf K}}({{\mathbf I}}_{U_2} -{{\mathbf I}}_{U_1})(1+h){{\mathbf e}}}{2h}\right) , \nonumber \\&\quad +C{{\mathbf K}}({{\mathbf I}}_{U_1} +{{\mathbf I}}_{N_1} -{{\mathbf I}}_{U_2} -{{\mathbf I}}_{N_2 }){{\mathbf e}}+C{\varvec{K} \varvec{\lambda }}^{\mathbf t} \end{aligned}$$
(61)
$$\begin{aligned} \frac{\partial J({\varvec{\alpha }},b)}{\partial b}&= \delta b+C\sum \limits _{i=1}^l {\left( \frac{\partial G_1^*(z_i)}{\partial z_i} +\frac{\partial H_1^*(z_i )}{\partial z_i }\right) } +C\sum \limits _{i=1}^l {\lambda _i^t}\nonumber \\&= \delta b+C\left( \frac{({{\mathbf e}}_{U_1 } +{{\mathbf e}}_{U_2})^T({\varvec{K \alpha }}+b{{\mathbf e}})}{2h}+\frac{({{\mathbf e}}_{U_2}-{{\mathbf e}}_{U_1})^T(1+h){{\mathbf e}}}{2h}\right) \nonumber \\&\quad +C({{\mathbf e}}_{U_1} +{{\mathbf e}}_{N_1} -{{\mathbf e}}_{U_2} -{{\mathbf e}}_{N_2})^T{{\mathbf e}},+C({\varvec{\lambda }}^{\mathbf t})^T{{\mathbf e}} \end{aligned}$$
(62)
$$\begin{aligned} \frac{\partial ^2J({\varvec{\alpha }},b)}{\partial {\varvec{\alpha }}^2}&= {{\mathbf K}}+C\frac{{{\mathbf K}}({{\mathbf I}}_{U_1} +{{\mathbf I}}_{U_2}){{\mathbf K}}}{2h},\end{aligned}$$
(63)
$$\begin{aligned} \frac{\partial ^2J({\varvec{\alpha }},b)}{\partial {\varvec{\alpha }}\partial b}&= C\frac{{{\mathbf K}}({{\mathbf e}}_{U_1} +{{\mathbf e}}_{U_2} )}{2h}, \end{aligned}$$
(64)
$$\begin{aligned} \frac{\partial ^2J({\varvec{\alpha }},b)}{\partial b^2}&= \delta +C\frac{({{\mathbf e}}_{U_1} +{{\mathbf e}}_{U_2})^T({{\mathbf e}}_{U_1}+{{\mathbf e}}_{U_2})}{2h}, \end{aligned}$$
(65)

where \({{\mathbf e}}\) is an \(l\times 1\) vector with elements being 1, the vector \({\varvec{\lambda }}^t\) composes of \(\lambda _i^t \) according to the order of training samples located in the sets \(U_1\), \(U_2 \), \(B_1 \), \(B_2 \), \(B_3 \), \(N_1\), and \(N_2\), and is denoted by \({\varvec{\lambda }}^t=( {{\varvec{\lambda }}_{U_1 }^t, {\varvec{\lambda }}_{U_2 }^t, {\varvec{\lambda }}_{B_1 }^t, {\varvec{\lambda }}_{B_2 }^t, {\varvec{\lambda }}_{B_3 }^t, {\varvec{\lambda }}_{N_1 }^t, {\varvec{\lambda }}_{N_2 }^t })^T\).

From (61)–(65), we can obtain the Hessian matrix and the gradient of the objective function \(J({\varvec{\alpha }},b)\) as follows:

$$\begin{aligned} {\mathbf H}=\left( {{ \begin{array}{ll} {\delta +\frac{C({{\mathbf e}}_{U_1 } +{{\mathbf e}}_{U_2 })^T({{\mathbf e}}_{U_1 }+{{\mathbf e}}_{U_2 })}{2h}}&{} {\frac{C({{\mathbf e}}_{U_1 }+{{\mathbf e}}_{U_2 })^T{{\mathbf K}}}{2h}}\\ {\frac{C{{\mathbf K}}({{\mathbf e}}_{U_1 } +{{\mathbf e}}_{U_2 })}{2h}} &{} {{{\mathbf K}}+\frac{C{{\mathbf K}}({{\mathbf I}}_{U_1}+{{\mathbf I}}_{U_2 }){{\mathbf K}}}{2h}} \\ \end{array}}}\right) ,\nonumber \\ \end{aligned}$$
(66)

and

$$\begin{aligned}&\nabla =\left( {\begin{array}{l} \frac{\partial J({\varvec{\alpha }},b)}{\partial b} \\ \frac{\partial J({\varvec{\alpha }},b)}{\partial {\varvec{\alpha }}} \\ \end{array}}\right) ={\mathbf H}\left( {{ \begin{array}{lc} b\\ {\varvec{\alpha }} \\ \end{array} }}\right) \nonumber \\&\qquad +\left( { \begin{array}{c} {\frac{C({{\mathbf e}}_{U_2 } -{{\mathbf e}}_{U_1 })^T(1+h)e}{2h}+C({{\mathbf e}}_{U_1}+{{\mathbf e}}_{N_1}-{{\mathbf e}}_{U_2} -{{\mathbf e}}_{N_2 })^T{{\mathbf e}} +C({\varvec{\lambda }}^{\mathbf t})^T{{\mathbf e}}} \\ {\frac{C{{\mathbf K}}({{\mathbf I}}_{U_2 } -{{\mathbf I}}_{U_1 } )(1+h)e}{2h} +C{{\mathbf K}}({{\mathbf I}}_{U_1 }+{{\mathbf I}}_{N_1 }-{{\mathbf I}}_{U_2 } -{{\mathbf I}}_{N_2 }){{\mathbf e}}+C{\varvec{K \lambda }}^{\mathbf t}} \\ \end{array}}\right) .\nonumber \\ \end{aligned}$$
(67)

For any nonzero vector \((\text{ b }{\varvec{\alpha }}^T)\in R^{l+1}\),

$$\begin{aligned}&(\text{ b }{\varvec{\alpha }}^T){\mathbf H}\left( {{ \begin{array}{ll} b \\ {\varvec{\alpha }} \\ \end{array}}}\right) =(\hbox {b} \quad {\varvec{\alpha }}^T) \\&\qquad \times \left( {{ \begin{array}{ll} {\delta +\frac{C({{\mathbf e}}_{U_1 }+{{\mathbf e}}_{U_2 })^T({{\mathbf e}}_{U_1} +{{\mathbf e}}_{U_2 })}{2h}}&{} {\frac{C({{\mathbf e}}_{U_1} +{{\mathbf e}}_{U_2})^T{{\mathbf K}}}{2h}} \\ {\frac{C{{\mathbf K}}({{\mathbf e}}_{U_1 }+{{\mathbf e}}_{U_2 })}{2h}} &{} {{{\mathbf K}}+\frac{C{{\mathbf K}}({{\mathbf I}}_{U_1}+{{\mathbf I}}_{U_2}){{\mathbf K}}}{2h}} \\ \end{array}}}\right) \left( {{ \begin{array}{l} b \\ {\varvec{\alpha }} \\ \end{array}}}\right) \\&\quad =b^2\delta +b^2C\frac{({{\mathbf e}}_{U_1 } +{{\mathbf e}}_{U_2 } )^T({{\mathbf e}}_{U_1 }+{{\mathbf e}}_{U_2 })}{2h} +bC {\varvec{\alpha }}^T\frac{{{\mathbf K}}({{\mathbf e}}_{U_1} \qquad +{{\mathbf e}}_{U_2 } )}{2h}\\&\qquad +bC\frac{({{\mathbf e}}_{U_1} +{{\mathbf e}}_{U_2 })^T{{\mathbf K}}}{2h}{\varvec{\alpha }}+{\varvec{\alpha }}^T( {{{\mathbf K}}+\frac{C{{\mathbf K}}({{\mathbf I}}_{U_1 } +{{\mathbf I}}_{U_2 }){{\mathbf K}}}{2h}}){\varvec{\alpha }} \\&\quad =b^2\delta +{\varvec{\alpha }}^T{\varvec{K \alpha }}+C( b^2\frac{({{\mathbf e}}_{U_1 }+{{\mathbf e}}_{U_2 })^T({{\mathbf e}}_{U_1}+{{\mathbf e}}_{U_2 })}{2h}\\&\qquad +b{\varvec{\alpha }}^T\frac{{{\mathbf K}}({{\mathbf e}}_{U_1} +{{\mathbf e}}_{U_2 })}{2h}+b\frac{({{\mathbf e}}_{U_1} +{{\mathbf e}}_{U_2 })^T{{\mathbf K}}}{2h}{\varvec{\alpha }}+{\varvec{\alpha }}^T\frac{{{\mathbf K}}({{\mathbf I}}_{U_1 } +{{\mathbf I}}_{U_2 } ){{\mathbf K}}}{2h}{\varvec{\alpha }}) \\&\quad =b^2\delta +{\varvec{\alpha }}^T{\varvec{K \alpha }}+\frac{C}{2h}( {b({{\mathbf e}}_{U_1 } +{{\mathbf e}}_{U_2 } )+({{\mathbf I}}_{U_1 }+{{\mathbf I}}_{U_2 }){\varvec{K \alpha }}})^T \\&\qquad \times ({b({{\mathbf e}}_{U_1 } +{{\mathbf e}}_{U_2 })+({{\mathbf I}}_{U_1}+{{\mathbf I}}_{U_2}){\varvec{K \alpha }}})>0 \\ \end{aligned}$$

Therefore, the optimization problem (31) is a strictly convex QP problem.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, X., Han, L., Li, Y. et al. A bilateral-truncated-loss based robust support vector machine for classification problems. Soft Comput 19, 2871–2882 (2015). https://doi.org/10.1007/s00500-014-1448-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-014-1448-9

Keywords

Navigation