Skip to main content
Log in

Nonlinear optimization and support vector machines

  • Invited Survey
  • Published:
4OR Aims and scope Submit manuscript

Abstract

Support Vector Machine (SVM) is one of the most important class of machine learning models and algorithms, and has been successfully applied in various fields. Nonlinear optimization plays a crucial role in SVM methodology, both in defining the machine learning models and in designing convergent and efficient algorithms for large-scale training problems. In this paper we present the convex programming problems underlying SVM focusing on supervised binary classification. We analyze the most important and used optimization methods for SVM training problems, and we discuss how the properties of these problems can be incorporated in designing useful algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Astorino A, Fuduli A (2015) Support vector machine polyhedral separability in semisupervised learning. J Optim Theory Appl 164:1039–1050

    Article  Google Scholar 

  • Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont

    Google Scholar 

  • Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin

    Google Scholar 

  • Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory, COLT ’92. ACM, New York, pp 144–152

  • Byrd RH, Chin GM, Neveitt W, Nocedal J (2011) On the use of stochastic hessian information in optimization methods for machine learning. SIAM J Optim 21:977–995

    Article  Google Scholar 

  • Carrizosa E, Romero Morales D (2013) Supervised classification and mathematical optimization. Comput Oper Res 40:150–165

    Article  Google Scholar 

  • Cassioli A, Chiavaioli A, Manes C, Sciandrone M (2013) An incremental least squares algorithm for large scale linear classification. Eur J Oper Res 224:560–565

    Article  Google Scholar 

  • Chang CC, Hsu CW, Lin CJ (2000) The analysis of decomposition methods for support vector machines. IEEE Trans Neural Netw Learn Syst 11:1003–1008

    Article  Google Scholar 

  • Chang KW, Hsieh CJ, Lin CJ (2008) Coordinate descent method for large-scale l2-loss linear support vector machines. J Mach Learn Res 9:1369–1398

    Google Scholar 

  • Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19:1155–1178

    Article  Google Scholar 

  • Chen PH, Fan RE, Lin CJ (2006) A study on smo-type decomposition methods for support vector machines. IEEE Trans Neural Netw 17:893–908

    Article  Google Scholar 

  • Chiang WL, Lee MC, Lin CJ (2016) Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16. ACM, New York, pp 1485–1494

  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297

    Google Scholar 

  • Dai HY, Fletcher R (2006) New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Math Programm 106:403–421

    Article  Google Scholar 

  • Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874

    Google Scholar 

  • Fan RE, Chen PH, Lin CJ (2005) Working set selection using second order information for training support vector machines. J Mach Learn Res 6:1889–1918

    Google Scholar 

  • Ferris MC, Munson TS (2004) Semismooth support vector machines. Math Programm B 101:185–204

    Google Scholar 

  • Ferris MC, Munson TS (2002) Interior-point methods for massive support vector machines. SIAM J Optim 13:783–804

    Article  Google Scholar 

  • Fine S, Scheinberg K (2001) Efficient svm training using low-rank kernel representations. J Mach Learn Res 2:243–264

    Google Scholar 

  • Fletcher R (1987) Practical methods of optimization, 2nd edn. Wiley, New York

    Google Scholar 

  • Franc V, Sonnenburg S (2009) Optimized cutting plane algorithm for large-scale risk minimization. J Mach Learn Res 10:2157–2192

    Google Scholar 

  • Franc V, Sonnenburg S (2008) Optimized cutting plane algorithm for support vector machines. In: Proceedings of the 25th international conference on machine learning, ICML ’08. ACM, New York, pp 320–327

  • Fung G, Mangasarian OL (2001) Proximal support vector machine classifiers. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01. ACM New York, pp 77–86

  • Gaudioso M, Gorgone E, Labbé M, Rodríguez-Chía AM (2017) Lagrangian relaxation for svm feature selection. Comput OR 87:137–145

    Article  Google Scholar 

  • Gertz EM, Griffin JD (2010) Using an iterative linear solver in an interior-point method for generating support vector machines. Comput Optim Appl 47:431–453

    Article  Google Scholar 

  • Glasmachers T, Igel C (2006) Maximum-gain working set selection for svms. J Mach Learn Res 7:1437–1466

    Google Scholar 

  • Goldfarb D, Scheinberg K (2008) Numerically stable ldlt factorizations in interior point methods for convex quadratic programming. IMA J Numer Anal 28:806–826

    Article  Google Scholar 

  • Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore

    Google Scholar 

  • Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear svm. In: Proceedings of the 25th international conference on machine learning, ICML ’08. ACM, New York, pp 408–415

  • Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13:415–425

    Article  Google Scholar 

  • Hsu CW, Lin CJ (2002) A simple decomposition method for support vector machines. Mach Learn 46:291–314

    Article  Google Scholar 

  • Teo CH, Vishwanthan SVN, Smola AJ, Le QV (2010) Bundle methods for regularized risk minimization. J Mach Learn Res 11:311–365

    Google Scholar 

  • Joachims T (1999) Advances in kernel methods. Chapter making large-scale support vector machine learning practical. MIT Press, Cambridge, pp 169–184

    Google Scholar 

  • Joachims T ( 2006) Training linear svms in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06. ACM, New York, pp 217–226

  • Joachims T, Finley T, Yu CNJ (2009) Cutting-plane training of structural svms. Mach Learn 77:27–59

    Article  Google Scholar 

  • Joachims T, Yu CNJ (2009) Sparse kernel svms via cutting-plane training. Mach Learn 76:179–193

    Article  Google Scholar 

  • Keerthi SS, DeCoste D (2005) A modified finite Newton method for fast solution of large scale linear svms. J Mach Learn Res 6:341–361

    Google Scholar 

  • Keerthi SS, Gilbert EG (2002) Convergence of a generalized smo algorithm for svm classifier design. Mach Learn 46:351–360

    Article  Google Scholar 

  • Keerthi SS, Lin CJ (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15:1667–1689

    Article  Google Scholar 

  • Kimeldorf GS, Wahba G (1970) A correspondence between bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41:495–502

    Article  Google Scholar 

  • Kiwiel KC (2008) Breakpoint searching algorithms for the continuous quadratic knapsack problem. Math Program 112:473–491

    Article  Google Scholar 

  • Le QV, Smola AJ, Vishwanathan S (2008) Bundle methods for machine learning. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20. Curran Associates Inc, New York, pp 1377–1384

    Google Scholar 

  • Lee MC, Chiang WL, Lin CJ (2015) Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. In: Aggarwal C, Zhou Z-H, Tuzhilin A, Xiong H, Wu X (eds) ICDM. IEEE Computer Society, pp 835–840

  • Lee YJ, Mangasarian OL (2001) SSVM: a smooth support vector machine for classification. Comput Optim Appl 20:5–22

    Article  Google Scholar 

  • Lin C-J, Lucidi S, Palagi L, Risi A, Sciandrone M (2009) A decomposition algorithm model for singly linearly constrained problems subject to lower and upper bounds. J Optim Theory Appl 141:107–126

    Article  Google Scholar 

  • Lin CJ (2001) Formulations of support vector machines: a note from an optimization point of view. Neural Comput 13:307–317

    Article  Google Scholar 

  • Lin CJ (2001) On the convergence of the decomposition method for support vector machines. IEEE Trans Neural Netw 12:1288–1298

    Article  Google Scholar 

  • Lin CJ (2002) Asymptotic convergence of an SMO algorithm without any assumptions. IEEE Trans Neural Netw 13:248–250

    Article  Google Scholar 

  • Lin CJ (2002) A formal analysis of stopping criteria of decomposition methods for support vector machines. IEEE Trans Neural Netw 13:1045–1052

    Article  Google Scholar 

  • Lin CJ, Morè JJ (1999) Newton’s method for large bound-constrained optimization problems. SIAM J Optim 9:1100–1127

    Article  Google Scholar 

  • Lucidi S, Palagi L, Risi A, Sciandrone M (2007) A convergent decomposition algorithm for support vector machines. Comput Optim Appl 38:217–234

    Article  Google Scholar 

  • Lucidi S, Palagi L, Risi A, Sciandrone M (2009) A convergent hybrid decomposition algorithm model for svm training. IEEE Trans Neural Netw 20:1055–1060

    Article  Google Scholar 

  • Mangasarian OL (1994) Nonlinear programming. Classics in applied mathematics. Society for Industrial and Applied Mathematics. ISBN: 9780898713411

  • Mangasarian OL (2002) A finite Newton method for classification. Optim Methods Softw 17:913–929

    Article  Google Scholar 

  • Mangasarian OL (2006) Exact 1-norm support vector machines via unconstrained convex differentiable minimization. J Mach Learn Res 7:1517–1530

    Google Scholar 

  • Mangasarian OL, Musicant DR (1999) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10:1032–1037

    Article  Google Scholar 

  • Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177

    Google Scholar 

  • Mavroforakis ME, Theodoridis S (2006) A geometric approach to support vector machine (SVM) classification. IEEE Trans Neural Netw 17:671–682

    Article  Google Scholar 

  • Osuna E, Freund R, Girosi F (1997) An improved training algorithm for support vector machines. In: Neural networks for signal processing VII. Proceedings of the 1997 IEEE signal processing society workshop, pp 276–285

  • Osuna E, Freund R, Girosit F (1997) Training support vector machines: an application to face detection. In Proceedings of IEEE computer society conference on computer vision and pattern recognition, pp 130–136

  • Palagi L, Sciandrone M (2005) On the convergence of a modified version of the svmlight algorithm. Optim Methods Softw 20:315–332

    Article  Google Scholar 

  • Pardalos PM, Kovoor N (1990) An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds. Math Program 46:321–328

    Article  Google Scholar 

  • Platt JC (1999) Advances in kernel methods. Chapter fast training of support vector machines using sequential minimal optimization. MIT Press, Cambridge, pp 185–208

    Google Scholar 

  • Scholkopf B, Smola AJ (2001) Learning with Kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  • Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for svm. Math Program 127:3–30

    Article  Google Scholar 

  • Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, New York

    Book  Google Scholar 

  • Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9:293–300

    Article  Google Scholar 

  • Glasmachers T, Dogan U (2013) Accelerated coordinate descent with adaptive coordinate frequencies. In: Asian conference on machine learning, ACML 2013, Canberra, ACT, Australia, pp 72–86

  • Serafini T, Zanni L (2005) On the working set selection in gradient projection-based decomposition techniques for support vector machines. Optim Methods Softw 20:583–596

    Article  Google Scholar 

  • Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: fast svm training on very large data sets. J Mach Learn Res 6:363–392

    Google Scholar 

  • Vapnik VN (1998) Statistical learning theory. Wiley-Interscience, New York

    Google Scholar 

  • Wang PW, Lin CJ (2014) Iteration complexity of feasible descent methods for convex optimization. J Mach Learn Res 15:1523–1548

    Google Scholar 

  • Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale svm training. J Mach Learn Res 13:3103–3131

    Google Scholar 

  • Woodsend K, Gondzio J (2009) Hybrid mpi/openmp parallel linear support vector machine training. J Mach Learn Res 10:1937–1953

    Google Scholar 

  • Woodsend K, Gondzio J (2011) Exploiting separability in large-scale linear support vector machine training. Comput Optim Appl 49:241–269

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Sciandrone.

Appendices

Appendix A: Proof of existence and uniqueness of the optimal hyperplane

The idea underlying the proof of existence and uniqueness of the optimal hyperplane is based on the following steps:

  • for each separating hyperplane H(wb), there exists a separating hyperplane \(H({\hat{w}},{\hat{b}})\) such that

    $$\begin{aligned} {1\over {\Vert w\Vert }}\le \rho (w,b)\le {1\over {\Vert {\hat{w}}\Vert }}; \end{aligned}$$
  • the above condition implies that problem (2), i.e.,

    $$\begin{aligned} \begin{array}{ll} \displaystyle {\max \limits _{w\in \mathfrak {R}^n,b\in \mathfrak {R}}}&{}\quad \rho (w,b)\\ \mathrm{s.t.}&{}\quad w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\qquad \forall x^j\in B \end{array} \end{aligned}$$

    admits solution provided that the following problem

    $$\begin{aligned} \begin{array}{ll}\displaystyle {\max \limits _{w\in \mathfrak {R}^n,b\in \mathfrak {R}}}&{}\quad {{1}\over {\Vert w\Vert }}\\ \mathrm{s.t.}&{}\quad w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\qquad \forall x^j\in B\end{array} \end{aligned}$$
    (54)

    admits solution;

  • problem (54) is obviously equivalent to

    $$\begin{aligned} \begin{array}{ll}\displaystyle {\min \limits _{w\in \mathfrak {R}^n,b\in \mathfrak {R}}}&{}\quad \Vert w\Vert ^2\\ \mathrm{s.t.}&{}\quad w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\qquad \forall x^j\in B;\end{array} \end{aligned}$$
    (55)
  • then we prove that (55) admits a unique solution, which is also the unique solution of (2).

Lemma 1

Let \(H({\hat{w}},{\hat{b}})\) be a separating hyperplane. Then

$$\begin{aligned} \rho ({\hat{w}},{\hat{b}})\ge {1\over {\Vert {\hat{w}}\Vert }}. \end{aligned}$$

Proof

Since

$$\begin{aligned} |{\hat{w}}^Tx^\ell +{\hat{b}}|\ge 1,\quad \forall x^\ell \in A\cup B, \end{aligned}$$

it follows

$$\begin{aligned} \rho ({\hat{w}},{\hat{b}})=\min \limits _{x^\ell \in A\cup B}\left\{ {{|{\hat{w}}^Tx^\ell +{\hat{b}}|} \over {\Vert {\hat{w}}\Vert }}\right\} \ge {1\over {\Vert {\hat{w}}\Vert }}. \end{aligned}$$

\(\square \)

Lemma 2

Given any separating hyperplane \(H({\hat{w}},{\hat{b}})\), there exists a separating hyperplane \(H({\bar{w}},{\bar{b}})\) such that

$$\begin{aligned} \rho ({\hat{w}},{\hat{b}})\le \rho ({\bar{w}},{\bar{b}})={1\over {\Vert {\bar{w}}\Vert }}. \end{aligned}$$
(56)

Moreover there exist two points \(x^+\in A\) and \(x^-\in B\) such that

$$\begin{aligned} \begin{array}{l} {\bar{w}}^Tx^++{\bar{b}}=1\\ {\bar{w}}^Tx^-+{\bar{b}}=-1 \end{array} \end{aligned}$$
(57)

Proof

Let \({\hat{x}}^i\in A\) and \({\hat{x}}^j\in B\) be the closest points to \(H({\hat{w}},{\hat{b}})\), that is, the two points such that

$$\begin{aligned} \begin{array}{ll} {\hat{d}}_i&{}=\displaystyle {{{|{\hat{w}}^T{\hat{x}}^i+{\hat{b}}|}\over {\Vert {\hat{w}}\Vert }}\le {{|{\hat{w}}^Tx^i+{\hat{b}}|}\over {\Vert {\hat{w}}\Vert }}},\quad \forall x^i\in A\\ {\hat{d}}_j&{}=\displaystyle {{{|{\hat{w}}^T{\hat{x}}^j+{\hat{b}}|}\over {\Vert {\hat{w}}\Vert }}\le {{|{\hat{w}}^Tx^j+{\hat{b}}|}\over {\Vert {\hat{w}}\Vert }}},\quad \forall x^j\in B \end{array} \end{aligned}$$
(58)

from which it follows

$$\begin{aligned} \rho ({\hat{w}},{\hat{b}})=\min \{{\hat{d}}_i,{\hat{d}}_j\}\le {1\over 2}({\hat{d}}_i+{\hat{d}}_j)={{{\hat{w}}^T({\hat{x}}^i- {\hat{x}}^j)}\over {2\Vert {\hat{w}}\Vert }}. \end{aligned}$$
(59)

Let us consider the numbers \(\alpha \) and \(\beta \) such that

$$\begin{aligned} \begin{array}{ll}\alpha {\hat{w}}^T{\hat{x}}^i+\beta &{}=1\\ \alpha {\hat{w}}^T{\hat{x}}^j+\beta &{}=-1\end{array} \end{aligned}$$
(60)

that is, the numbers

$$\begin{aligned} \alpha ={2\over {{\hat{w}}^T({\hat{x}}^i-{\hat{x}}^j)}},\qquad \beta =-\,{{{\hat{w}}^T ({\hat{x}}^i+{\hat{x}}^j)}\over {{\hat{w}}^T({\hat{x}}_i-{\hat{x}}_j)}}. \end{aligned}$$

It can be easily verified that \(0<\alpha \le 1\). We will show that the hyperplane \(H({\bar{w}},{\bar{b}}) \equiv H(\alpha {\hat{w}},\beta )\) is a separating hyperplane for the sets A and B, and it is such that (56) holds. Indeed, using (58), we have

$$\begin{aligned} \begin{array}{ll}{\hat{w}}^Tx^i&{}\ge {\hat{w}}^T{\hat{x}}^i,\quad \forall x^i\in A\\ {\hat{w}}^Tx^j&{}\le {\hat{w}}^T{\hat{x}}^j,\quad \forall x^j\in B. \end{array} \end{aligned}$$

As \(\alpha >0\), we can write

$$\begin{aligned} \begin{array}{ll} \alpha {\hat{w}}^Tx^i+\beta \ge \alpha {\hat{w}}^T{\hat{x}}^i+\beta =&{}1, \phantom {-1}\quad \forall x^i\in A\\ \alpha {\hat{w}}^Tx^j+\beta \le \alpha {\hat{w}}^T{\hat{x}}^j+\beta =&{}-1,\quad \forall x^j\in B \end{array} \end{aligned}$$
(61)

from which we get that \({\bar{w}}\) and \({\bar{b}}\) satisifies (1), and hence, that \(H({\bar{w}},{\bar{b}})\) is a separating hyperplane for the sets A and B.

Furthermore, taking into account (61) and the value of \(\alpha \), we have

$$\begin{aligned} \rho ({\bar{w}},{\bar{b}})=\min \limits _{x^\ell \in A\cup B}\left\{ {{|\bar{w}^Tx^\ell +{\bar{b}}|} \over {\Vert {\bar{w}}\Vert }}\right\} ={1\over {\Vert {\bar{w}}\Vert }}={1\over {\alpha \Vert {\hat{w}}\Vert }}= {{{\hat{w}}^T({\hat{x}}^i-{\hat{x}}^j)}\over {2\Vert {\hat{w}}\Vert }}. \end{aligned}$$

Condition (56) follows from the above equality and (59). Using (60) we obtain that (57) holds with \(x^+={\hat{x}}^i\) and \(x^-={\hat{x}}^j\). \(\square \)

Proposition 4

The following problem

$$\begin{aligned} \begin{array}{ll}\mathrm{min}\quad &{}\Vert w\Vert ^2\\ \mathrm{t.c.}\quad &{}w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\quad \forall x^j\in B\end{array} \end{aligned}$$
(62)

admits a unique solution \((w^\star ,b^\star )\).

Proof

Let \({{\mathcal {F}}}\) the feasible set, that is,

$$\begin{aligned} {{\mathcal {F}}}=\{(w,b)\in \mathfrak {R}^n\times \mathfrak {R}:~w^Tx^i+b\ge 1,\forall x^i\in A,~w^Tx^j+b\le -1, \forall x^j\in B\}. \end{aligned}$$

Given any \((w_o,b_o)\in {{\mathcal {F}}}\), let us consider the level set

$$\begin{aligned} {{\mathcal {L}}}_o=\{(w,b)\in {{\mathcal {F}}}:~\Vert w\Vert ^2\le \Vert w_o\Vert ^2\}. \end{aligned}$$

The set \({{\mathcal {L}}}_o\) is closed, and we will show that is also bounded. To this aim, assume by contradiction that there exists an unbounded sequence \(\{(w_k,b_k)\}\) belonging to \({{\mathcal {L}}}_o\). Since \(\Vert w_k\Vert \le \Vert w_o\Vert ,\forall k\), we must have \(|b_k|\rightarrow \infty \). For any k we can write

$$\begin{aligned} \begin{array}{ll}w_k^Tx^i+b_k&{}\ge 1,\phantom {-}\quad \forall x^i\in A\\ w_k^Tx^j+b_k&{}\le -1,\quad \forall x^j\in B\end{array} \end{aligned}$$

and hence, as \(|b_k|\rightarrow \infty \), for k sufficiently large, we have \(\Vert w_k\Vert ^2 >\Vert w_o\Vert ^2\), and this contradicts the fact that \(\{(w_k,b_k)\}\) belongs to \({{\mathcal {L}}}_o\). Thus \({{\mathcal {L}}}_o\) is a compact set.

Weirstrass’s theorem implies that the function \(\Vert w\Vert ^2\) admits a minimum \((w^\star ,b^\star )\) on \({{\mathcal {L}}}_o\), and hence, on \(\mathcal{F}\). As consequence, \((w^\star ,b^\star )\) is a solution of (62).

In order to prove that \((w^\star ,b^\star )\) is the unique solution, by contradiction assume that there exists a pair \(({\bar{w}},\bar{b})\in {{\mathcal {F}}}\), \(({\bar{w}},{\bar{b}})\ne (w^\star ,b^\star )\), such that \(\Vert {\bar{w}}\Vert ^2=\Vert w^\star \Vert ^2\). Suppose \({\bar{w}}\ne w^\star \). The set \({{\mathcal {F}}}\) is convex, so that

$$\begin{aligned} \lambda (w^\star ,b^\star )+(1-\lambda )({\bar{w}},{\bar{b}})\in {{\mathcal {F}}},\quad \forall \lambda \in [0,1]. \end{aligned}$$

Since \(\Vert w\Vert ^2\) is a strictly convex function, for any \(\lambda \in (0,1)\) it follows

$$\begin{aligned} \Vert \lambda w^\star +(1-\lambda ){\bar{w}}\Vert ^2<\lambda \Vert w^\star \Vert ^2+(1-\lambda )\Vert {\bar{w}}\Vert ^2. \end{aligned}$$

Getting \(\lambda =1/2\), which corresponds to consider the pair \(({\tilde{w}}, \tilde{b})\equiv \displaystyle \left( {1\over 2}w^\star +{1\over 2}\bar{w},{1\over 2} b^\star \right. \left. +{1\over 2}{\bar{b}}\right) \), we have \((\tilde{w},{\tilde{b}})\in {{\mathcal {F}}}\) and

$$\begin{aligned} \Vert {\tilde{w}}\Vert ^2<{1\over 2}\Vert w^\star \Vert ^2+{1\over 2}\Vert \bar{w}\Vert ^2=\Vert w^\star \Vert ^2, \end{aligned}$$

and this contradicts the fact that \((w^\star ,b^\star )\) is a global minimum. Therefore, we must have \({\bar{w}}\equiv w^\star \).

Assume \(b^\star >{\bar{b}}\) (the case \(b^\star <{\bar{b}}\) is analogous), and consider the point \({\hat{x}}^i\in A\) such that

$$\begin{aligned} w{^\star }^T{\hat{x}}^i+b^\star =1 \end{aligned}$$

(the existence of such a point follows from (57) of Lemma 2). We have

$$\begin{aligned} 1=w{^\star }^T{\hat{x}}^i+b^\star ={\bar{w}}^T{\hat{x}}^i+b^\star >{\bar{w}}^T{\hat{x}}^i+ {\bar{b}} \end{aligned}$$

and this contradicts the fact that \({\bar{w}}^Tx^i+{\bar{b}}\ge 1,~\forall x^i\in A\). As consequence, we must have \({\bar{b}}\equiv b^\star \), and hence the uniqueness of the solution is proved. \(\square \)

Proposition 5

Let \((w^\star ,b^\star )\) be the solution of (62). Then, \((w^\star ,b^\star )\) is the unique solution of the following problem

$$\begin{aligned} \begin{array}{ll}\mathrm{max}\quad &{}\rho (w,b)\\ \mathrm{t.c.}\quad &{}w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\quad \forall x^j\in B\end{array} \end{aligned}$$
(63)

Proof

We observe that \((w^\star ,b^\star )\) is the unique solution of the problem

$$\begin{aligned} \begin{array}{ll}\mathrm{max}\quad &{}{1\over {\Vert w\Vert }}\\ \mathrm{t.c.}\quad &{}w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\quad \forall x^j\in B.\end{array} \end{aligned}$$

Lemmas 1 and 2 imply that, for any separating hyperplane H(wb), we have

$$\begin{aligned} {1\over {\Vert w\Vert }}\le \rho (w,b)\le {1\over {\Vert w^\star \Vert }} \end{aligned}$$

and hence, for the separating hyperplane \(H(w^\star ,b^\star )\) we obtain \(\rho (w^\star , b^\star )=\displaystyle {1\over {\Vert w^\star \Vert }}\), which implies that \(H(w^\star ,b^\star )\) is the optimal separating hyperplane. \(\square \)

Appendix B: The Wolfe dual and its properties

Consider the convex problem

$$\begin{aligned} \begin{array}{ll} \min &{} f(x)\\ \text{ s.t. } &{} g(x)\le 0\\ &{} h(x)=0 \end{array} \end{aligned}$$
(64)

with \(f:\mathfrak {R}^n\rightarrow \mathfrak {R}\) convex and continuously differentiable, \(g:\mathfrak {R}^n\rightarrow \mathfrak {R}^m\) convex and continuously differentiable, and \(h:\mathfrak {R}^n\rightarrow \mathfrak {R}^p\) affine functions. Then its Wolfe dual is

$$\begin{aligned} \begin{array}{ll} \displaystyle \max \limits _{x,\lambda ,\mu } &{} L(x,\lambda ,\mu )\\ \text{ s.t. } &{} \nabla _xL(x,\lambda ,\mu )= 0\\ &{} \lambda \ge 0, \end{array} \end{aligned}$$
(65)

where \(L(x,\lambda ,\mu ) = f(x)+\lambda ^Tg(x)+\mu ^Th(x)\).

Proposition 6

Let \(x^*\) be a global solution of problem (64) with multipliers \((\lambda ^*,\mu ^*)\). Then it is also a solution of problem (65) and there is zero duality gap, i.e., \(f(x^*) = L(x^*,\lambda ^*,\mu ^*)\).

Proof

The point \((x^*,\lambda ^*,\mu ^*)\) is clearly feasible for problem (65) since it satisfies the KKT conditions of problem (64). Furthermore, by complementarity (\((\lambda ^*)^Tg(x^*)=0\)) and feasibility (\(h(x^*)=0\))

$$\begin{aligned} L(x^*,\lambda ^*,\mu ^*) = f(x^*)+(\lambda ^*)^Tg(x^*)+(\mu ^*)^Th(x^*)=f(x^*) \end{aligned}$$

so that there is zero duality gap. Furthermore, for any \(\lambda \ge 0\), \(\mu \in \mathfrak {R}^p\), by the feasibility of \(x^*\), we have

$$\begin{aligned} L(x^*,\lambda ^*,\mu ^*) = f(x^*)\ge f(x^*)+\lambda ^Tg(x^*)+\mu ^Th(x^*) = L(x^*,\lambda ,\mu ). \end{aligned}$$
(66)

By the convexity assumptions on f and g, the nonnegativity of \(\lambda \) and by the linearity of h, we get that \(L(\cdot ,\lambda ,\mu )\) is a convex function in x and hence, for any feasible \((x,\lambda ,\mu )\), we can write

$$\begin{aligned} L(x^*,\lambda ,\mu )\ge L(x,\lambda ,\mu )+\nabla _x L(x,\lambda ,\mu )^T(x^*-x) = L(x,\lambda ,\mu ), \end{aligned}$$
(67)

where the last equality derives from the constraints of problem (65). By combining (66) and (67), we get

$$\begin{aligned} L(x^*,\lambda ^*,\mu ^*)\ge L(x,\lambda ,\mu ) \text{ for } \text{ all } (x,\lambda ,\mu ) \text{ feasible } \text{ for } \text{ problem } (65) \end{aligned}$$

and hence \((x^*,\lambda ^*,\mu ^*)\) is a global solution of problem (65). \(\square \)

A stronger result can be proved when the primal problem is a convex quadratic programming problem defined by (6).

Proposition 7

Let \(f(x)={{1}\over {2}}x^TQx+c^Tx\), and suppose that the matrix Q is symmetric and positive semidefinite. Let \(({\bar{x}},{\bar{\lambda }})\) be a solution of Wolfe’s dual (7). Then, there exists a vector \(x^\star \) (not necessarily equal to \({\bar{x}}\)) such that

  1. (i)

    \(Q(x^\star -{\bar{x}})=0\);

  2. (ii)

    \(x^\star \) is a solution of problem (6); and

  3. (iii)

    \(x^*\) is a global minimum of (6) with associated multipliers \({\bar{\lambda }}\).

Proof

First, we show how in this case problem (7) is a convex quadratic programming problem. In particular, problem (7) becomes for the quadratic case:

$$\begin{aligned} \max \limits _{x,\lambda }&\frac{1}{2} x^TQx+c^Tx+\lambda ^T(Ax-b) \end{aligned}$$
(68)
$$\begin{aligned}&Qx+c+A^T\lambda =0 \end{aligned}$$
(69)
$$\begin{aligned}&\lambda \ge 0. \end{aligned}$$
(70)

Multiplying the constraints (68) by \(x^T\) we get

$$\begin{aligned} x^TQx+c^Tx+x^TA^T\lambda = 0, \end{aligned}$$

which implies that the objective function (68) can be rewritten as

$$\begin{aligned} \max -\frac{1}{2} x^TQx+c^Tx-\lambda ^Tb = -\min \frac{1}{2} x^TQx+\lambda ^Tb , \end{aligned}$$

which shows how problem (68) is actually a convex quadratic optimization problem. For this problem, the KKT conditions are necessary and sufficient for global optimality, and, if we denote by v the multipliers of the equality constraints (69) and by z the multipliers of the constraints (70), we get that there must exist multipliers v and z such that the following conditions hold;

$$\begin{aligned}&Q{\bar{x}}-Qv=0 \end{aligned}$$
(71)
$$\begin{aligned}&b-Av-z =0\end{aligned}$$
(72)
$$\begin{aligned}&z^T{\bar{\lambda }} = 0\end{aligned}$$
(73)
$$\begin{aligned}&z\ge 0\end{aligned}$$
(74)
$$\begin{aligned}&Q{\bar{x}}+c+A^T{\bar{\lambda }} =0\end{aligned}$$
(75)
$$\begin{aligned}&{\bar{\lambda }}\ge 0. \end{aligned}$$
(76)

The expression of z can be derived by constraints (72), and substituted in (73) and (74), implying:

$$\begin{aligned}&Av- b \le 0 \end{aligned}$$
(77)
$$\begin{aligned}&{\bar{\lambda }}^T(Av-b) = 0. \end{aligned}$$
(78)

Furthermore by subtracting (71) from (75), we get

$$\begin{aligned} Qv+c+A^T{\bar{\lambda }} =0. \end{aligned}$$
(79)

By combining (79), (78), (77) and (76) we get that the pair \((v,{\bar{\lambda }})\) satisfies the KKT conditions of problem (6), and hence setting \(x^*=v\) we get the thesis, keeping into account that point (i) derives from (71). \(\square \)

Appendix C: Kernel characterization

Proposition 8

Let \(K:X\times X\rightarrow \mathfrak {R}\) be a symmetric function. Function K is a kernel if and only if the \(l\times l\) matrix

$$\begin{aligned} \left( K(x^i,x^j)\right) _{i,j=1}^l= \left( \begin{array}{ccc} K(x^1,x^1)&{}\ldots &{}K(x^1,x^l)\\ &{}\vdots &{}\\ K(x^l,x^1)&{}\ldots &{}K(x^l,x^l) \end{array} \right) \end{aligned}$$

is positive semidefinite for any set of training vectors \(\{x^1,\ldots ,x^l\}\).

Proof

  • necessity Symmetry derives from the symmetry of the function K. To prove positive semidefiniteness we look at the quadratic form, for any \(v\in \mathfrak {R}^l\):

    $$\begin{aligned} v^TKv&=\displaystyle \sum \limits _{i=1}^l\sum \limits _{j=1}^lv_iv_jK(x^i,x^j) =\sum _{i=1}^l\sum \limits _{j=1}^lv_iv_j\langle \phi (x^i),\phi (x^j)\rangle \\&=\left\langle \sum \limits _{i=1}^lv_i \phi (x^i), \sum _{j=1}^lv_j\phi (x_j)\right\rangle \\&=\displaystyle \langle z,z\rangle \ge 0 \end{aligned}$$
  • sufficiency Assume

    $$\begin{aligned} \left( \begin{array}{ccc} K(x^1,x^1)&{}\ldots &{}K(x^1,x^l)\\ &{}\vdots &{}\\ K(x^l,x^1)&{}\ldots &{}K(x^l,x^l) \end{array} \right) \succeq 0 \end{aligned}$$
    (80)

    We need to prove that there exists a linear space \({{\mathcal {H}}}\), a function \(\phi :X\rightarrow {{\mathcal {H}}}\) and a scalar product \(\langle \cdot ,\cdot \rangle \) defined on \({{\mathcal {H}}}\) such that \(k(x,y) = \langle \phi (x),\phi (y)\rangle \) for all \(x,y\in X\). Consider the linear space

    $$\begin{aligned} \displaystyle {{\mathcal {H}}} = lin\left\{ K(\cdot ,y)\,:\, y\in X\right\} \end{aligned}$$

    with the generic element \(f(\cdot )\)

    $$\begin{aligned} \displaystyle f = \sum \limits _{i=1}^m\alpha _iK(\cdot ,x^i) \end{aligned}$$

    for any \(m\in N\), with \(\alpha _i\in \mathfrak {R}\) for \(i=1,\ldots ,m\). Given two elements \(f,g\in {{\mathcal {H}}}\), with \(g(\cdot ) = \sum _{j=1}^{m^\prime }\beta _jK(\cdot ,x^j)\), define the function \(\rho :{{\mathcal {H}}}\times {{\mathcal {H}}}\rightarrow \mathfrak {R}\) defined as

    $$\begin{aligned} \rho (f,g)=\sum \limits _{i=1}^m\sum \limits _{j=1}^{m^\prime }\alpha _i\beta _j K(x^i,x^j) \end{aligned}$$

    It can be shown that the function \(\rho \) is a scalar product in the space \({{\mathcal {H}}}\), by showing that the following properties hold:

    1. (i)

      \(\rho (f,g) = \rho (g,f)\)

    2. (ii)

      \(\rho (f^1+f^2,g) = \rho (f^1,g)+\rho (f^2,g)\)

    3. (iii)

      \(\rho (\lambda f,g) = \lambda \rho (f,g) \)

    4. (iv)

      \(\rho (f,f)\ge 0\) and \(\rho (f,f)=0\) implies \(f=0\)

    The first three properties are a consequence of the definition of \(\rho \) and can be easily verified. We need to show property (iv). First, we observe that, given \(f^1,\ldots ,f^p\) in \({{\mathcal {H}}}\) the matrix with elements \(\rho _{st} = \rho (f^s,f^t)\) is symmetric (thanks to property (i)) and positive semidefinite. Indeed,

    $$\begin{aligned} \displaystyle \sum \limits _{i=1}^p\sum \limits _{j=1}^p\gamma _i\gamma _j\rho _{ij} =\sum _{i=1}^p\sum \limits _{j=1}^p\gamma _i\gamma _j\rho (f^i,f^j) =\rho \left( \sum \limits _{i=1}^p\gamma _if^i,\sum _{j=1}^p\gamma _jf^j\right) \ge 0 \end{aligned}$$

    This implies in turn that all principal minors have non negative determinant. Consider any \(2\times 2\) principal minor, with elements \(\rho _{ij} = \rho (f^i,f^j)\). The nonnegativity of the determinant, and the symmetry of the matrix imply

    $$\begin{aligned}&\rho (f^i,f^i)\rho (f^j,f^j)- \rho (f^i,f^j) \rho (f^j,f^i )\\&=\rho (f^i,f^i)\rho (f^j,f^j)- \rho (f^i,f^j)^2\ge 0 \end{aligned}$$

    so that

    $$\begin{aligned} \rho (f^i,f^j)^2\le \rho (f^i,f^i)\rho (f^j,f^j) \end{aligned}$$
    (81)

    We note that, setting \(m^\prime =1\), \(g(\cdot ) = k(\cdot ,x)\), f(x) can be written as

    $$\begin{aligned} f(x) = \displaystyle \sum \limits _{i=1}^m\alpha _iK(x,x^i) = \rho (K(\cdot ,x),f) \end{aligned}$$

    with \(K(\cdot ,x)\in {{\mathcal {H}}}\). Furthermore, for any \(x,y \in X\), we get

    $$\begin{aligned} \rho (K(\cdot ,x),K(\cdot ,y)) = K(x,y) \end{aligned}$$

    Using (81) with \(f^i = K(\cdot ,x)\) and \(f^j = f(x)\) we get

    $$\begin{aligned}&f(x)^2 = \rho (K(\cdot ,x),f)\le \rho (f^i,f^i)\rho (f^j,f^j) = \rho (K(\cdot ,x),\\&K(\cdot ,x))\rho (f,f) = K(x,x)\rho (f,f) \end{aligned}$$

    that implies, thanks to (80), both \(\rho (f,f)\ge 0\) and that if \(\rho (f,f)=0\), then \(f(x)^2\le 0\) for all \(x\in X\) and hence \(f=0\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Piccialli, V., Sciandrone, M. Nonlinear optimization and support vector machines. 4OR-Q J Oper Res 16, 111–149 (2018). https://doi.org/10.1007/s10288-018-0378-2

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10288-018-0378-2

Keywords

Mathematics Subject Classification

Navigation