Nonlinear optimization and support vector machines

Piccialli, Veronica; Sciandrone, Marco

doi:10.1007/s10288-018-0378-2

Nonlinear optimization and support vector machines

Invited Survey
Published: 23 May 2018

Volume 16, pages 111–149, (2018)
Cite this article

4OR Aims and scope Submit manuscript

983 Accesses
18 Citations
2 Altmetric
Explore all metrics

Abstract

Support Vector Machine (SVM) is one of the most important class of machine learning models and algorithms, and has been successfully applied in various fields. Nonlinear optimization plays a crucial role in SVM methodology, both in defining the machine learning models and in designing convergent and efficient algorithms for large-scale training problems. In this paper we present the convex programming problems underlying SVM focusing on supervised binary classification. We analyze the most important and used optimization methods for SVM training problems, and we discuss how the properties of these problems can be incorporated in designing useful algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonlinear optimization and support vector machines

Article Open access 14 April 2022

Linear Support Vector Machines

Linear Classification of Data with Support Vector Machines and Generalized Support Vector Machines

References

Astorino A, Fuduli A (2015) Support vector machine polyhedral separability in semisupervised learning. J Optim Theory Appl 164:1039–1050
Article Google Scholar
Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont
Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Google Scholar
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory, COLT ’92. ACM, New York, pp 144–152
Byrd RH, Chin GM, Neveitt W, Nocedal J (2011) On the use of stochastic hessian information in optimization methods for machine learning. SIAM J Optim 21:977–995
Article Google Scholar
Carrizosa E, Romero Morales D (2013) Supervised classification and mathematical optimization. Comput Oper Res 40:150–165
Article Google Scholar
Cassioli A, Chiavaioli A, Manes C, Sciandrone M (2013) An incremental least squares algorithm for large scale linear classification. Eur J Oper Res 224:560–565
Article Google Scholar
Chang CC, Hsu CW, Lin CJ (2000) The analysis of decomposition methods for support vector machines. IEEE Trans Neural Netw Learn Syst 11:1003–1008
Article Google Scholar
Chang KW, Hsieh CJ, Lin CJ (2008) Coordinate descent method for large-scale l2-loss linear support vector machines. J Mach Learn Res 9:1369–1398
Google Scholar
Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19:1155–1178
Article Google Scholar
Chen PH, Fan RE, Lin CJ (2006) A study on smo-type decomposition methods for support vector machines. IEEE Trans Neural Netw 17:893–908
Article Google Scholar
Chiang WL, Lee MC, Lin CJ (2016) Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16. ACM, New York, pp 1485–1494
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297
Google Scholar
Dai HY, Fletcher R (2006) New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Math Programm 106:403–421
Article Google Scholar
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
Google Scholar
Fan RE, Chen PH, Lin CJ (2005) Working set selection using second order information for training support vector machines. J Mach Learn Res 6:1889–1918
Google Scholar
Ferris MC, Munson TS (2004) Semismooth support vector machines. Math Programm B 101:185–204
Google Scholar
Ferris MC, Munson TS (2002) Interior-point methods for massive support vector machines. SIAM J Optim 13:783–804
Article Google Scholar
Fine S, Scheinberg K (2001) Efficient svm training using low-rank kernel representations. J Mach Learn Res 2:243–264
Google Scholar
Fletcher R (1987) Practical methods of optimization, 2nd edn. Wiley, New York
Google Scholar
Franc V, Sonnenburg S (2009) Optimized cutting plane algorithm for large-scale risk minimization. J Mach Learn Res 10:2157–2192
Google Scholar
Franc V, Sonnenburg S (2008) Optimized cutting plane algorithm for support vector machines. In: Proceedings of the 25th international conference on machine learning, ICML ’08. ACM, New York, pp 320–327
Fung G, Mangasarian OL (2001) Proximal support vector machine classifiers. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01. ACM New York, pp 77–86
Gaudioso M, Gorgone E, Labbé M, Rodríguez-Chía AM (2017) Lagrangian relaxation for svm feature selection. Comput OR 87:137–145
Article Google Scholar
Gertz EM, Griffin JD (2010) Using an iterative linear solver in an interior-point method for generating support vector machines. Comput Optim Appl 47:431–453
Article Google Scholar
Glasmachers T, Igel C (2006) Maximum-gain working set selection for svms. J Mach Learn Res 7:1437–1466
Google Scholar
Goldfarb D, Scheinberg K (2008) Numerically stable ldlt factorizations in interior point methods for convex quadratic programming. IMA J Numer Anal 28:806–826
Article Google Scholar
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
Google Scholar
Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear svm. In: Proceedings of the 25th international conference on machine learning, ICML ’08. ACM, New York, pp 408–415
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13:415–425
Article Google Scholar
Hsu CW, Lin CJ (2002) A simple decomposition method for support vector machines. Mach Learn 46:291–314
Article Google Scholar
Teo CH, Vishwanthan SVN, Smola AJ, Le QV (2010) Bundle methods for regularized risk minimization. J Mach Learn Res 11:311–365
Google Scholar
Joachims T (1999) Advances in kernel methods. Chapter making large-scale support vector machine learning practical. MIT Press, Cambridge, pp 169–184
Google Scholar
Joachims T ( 2006) Training linear svms in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06. ACM, New York, pp 217–226
Joachims T, Finley T, Yu CNJ (2009) Cutting-plane training of structural svms. Mach Learn 77:27–59
Article Google Scholar
Joachims T, Yu CNJ (2009) Sparse kernel svms via cutting-plane training. Mach Learn 76:179–193
Article Google Scholar
Keerthi SS, DeCoste D (2005) A modified finite Newton method for fast solution of large scale linear svms. J Mach Learn Res 6:341–361
Google Scholar
Keerthi SS, Gilbert EG (2002) Convergence of a generalized smo algorithm for svm classifier design. Mach Learn 46:351–360
Article Google Scholar
Keerthi SS, Lin CJ (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15:1667–1689
Article Google Scholar
Kimeldorf GS, Wahba G (1970) A correspondence between bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41:495–502
Article Google Scholar
Kiwiel KC (2008) Breakpoint searching algorithms for the continuous quadratic knapsack problem. Math Program 112:473–491
Article Google Scholar
Le QV, Smola AJ, Vishwanathan S (2008) Bundle methods for machine learning. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20. Curran Associates Inc, New York, pp 1377–1384
Google Scholar
Lee MC, Chiang WL, Lin CJ (2015) Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. In: Aggarwal C, Zhou Z-H, Tuzhilin A, Xiong H, Wu X (eds) ICDM. IEEE Computer Society, pp 835–840
Lee YJ, Mangasarian OL (2001) SSVM: a smooth support vector machine for classification. Comput Optim Appl 20:5–22
Article Google Scholar
Lin C-J, Lucidi S, Palagi L, Risi A, Sciandrone M (2009) A decomposition algorithm model for singly linearly constrained problems subject to lower and upper bounds. J Optim Theory Appl 141:107–126
Article Google Scholar
Lin CJ (2001) Formulations of support vector machines: a note from an optimization point of view. Neural Comput 13:307–317
Article Google Scholar
Lin CJ (2001) On the convergence of the decomposition method for support vector machines. IEEE Trans Neural Netw 12:1288–1298
Article Google Scholar
Lin CJ (2002) Asymptotic convergence of an SMO algorithm without any assumptions. IEEE Trans Neural Netw 13:248–250
Article Google Scholar
Lin CJ (2002) A formal analysis of stopping criteria of decomposition methods for support vector machines. IEEE Trans Neural Netw 13:1045–1052
Article Google Scholar
Lin CJ, Morè JJ (1999) Newton’s method for large bound-constrained optimization problems. SIAM J Optim 9:1100–1127
Article Google Scholar
Lucidi S, Palagi L, Risi A, Sciandrone M (2007) A convergent decomposition algorithm for support vector machines. Comput Optim Appl 38:217–234
Article Google Scholar
Lucidi S, Palagi L, Risi A, Sciandrone M (2009) A convergent hybrid decomposition algorithm model for svm training. IEEE Trans Neural Netw 20:1055–1060
Article Google Scholar
Mangasarian OL (1994) Nonlinear programming. Classics in applied mathematics. Society for Industrial and Applied Mathematics. ISBN: 9780898713411
Mangasarian OL (2002) A finite Newton method for classification. Optim Methods Softw 17:913–929
Article Google Scholar
Mangasarian OL (2006) Exact 1-norm support vector machines via unconstrained convex differentiable minimization. J Mach Learn Res 7:1517–1530
Google Scholar
Mangasarian OL, Musicant DR (1999) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10:1032–1037
Article Google Scholar
Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177
Google Scholar
Mavroforakis ME, Theodoridis S (2006) A geometric approach to support vector machine (SVM) classification. IEEE Trans Neural Netw 17:671–682
Article Google Scholar
Osuna E, Freund R, Girosi F (1997) An improved training algorithm for support vector machines. In: Neural networks for signal processing VII. Proceedings of the 1997 IEEE signal processing society workshop, pp 276–285
Osuna E, Freund R, Girosit F (1997) Training support vector machines: an application to face detection. In Proceedings of IEEE computer society conference on computer vision and pattern recognition, pp 130–136
Palagi L, Sciandrone M (2005) On the convergence of a modified version of the svmlight algorithm. Optim Methods Softw 20:315–332
Article Google Scholar
Pardalos PM, Kovoor N (1990) An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds. Math Program 46:321–328
Article Google Scholar
Platt JC (1999) Advances in kernel methods. Chapter fast training of support vector machines using sequential minimal optimization. MIT Press, Cambridge, pp 185–208
Google Scholar
Scholkopf B, Smola AJ (2001) Learning with Kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Google Scholar
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for svm. Math Program 127:3–30
Article Google Scholar
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, New York
Book Google Scholar
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9:293–300
Article Google Scholar
Glasmachers T, Dogan U (2013) Accelerated coordinate descent with adaptive coordinate frequencies. In: Asian conference on machine learning, ACML 2013, Canberra, ACT, Australia, pp 72–86
Serafini T, Zanni L (2005) On the working set selection in gradient projection-based decomposition techniques for support vector machines. Optim Methods Softw 20:583–596
Article Google Scholar
Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: fast svm training on very large data sets. J Mach Learn Res 6:363–392
Google Scholar
Vapnik VN (1998) Statistical learning theory. Wiley-Interscience, New York
Google Scholar
Wang PW, Lin CJ (2014) Iteration complexity of feasible descent methods for convex optimization. J Mach Learn Res 15:1523–1548
Google Scholar
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale svm training. J Mach Learn Res 13:3103–3131
Google Scholar
Woodsend K, Gondzio J (2009) Hybrid mpi/openmp parallel linear support vector machine training. J Mach Learn Res 10:1937–1953
Google Scholar
Woodsend K, Gondzio J (2011) Exploiting separability in large-scale linear support vector machine training. Comput Optim Appl 49:241–269
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria Civile e Ingegneria Informatica, Università degli Studi di Roma “Tor Vergata”, via del Politecnico, 100133, Rome, Italy
Veronica Piccialli
Dipartimento di Ingegneria dell’Informazione, Università di Firenze, via di Santa Marta 3, 50139, Firenze, Italy
Marco Sciandrone

Authors

Veronica Piccialli
View author publications
You can also search for this author in PubMed Google Scholar
Marco Sciandrone
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Sciandrone.

Appendices

Appendix A: Proof of existence and uniqueness of the optimal hyperplane

The idea underlying the proof of existence and uniqueness of the optimal hyperplane is based on the following steps:

for each separating hyperplane H(w, b), there exists a separating hyperplane $H({\hat{w}},{\hat{b}})$ such that
$$\begin{aligned} {1\over {\Vert w\Vert }}\le \rho (w,b)\le {1\over {\Vert {\hat{w}}\Vert }}; \end{aligned}$$
the above condition implies that problem (2), i.e.,
$$\begin{aligned} \begin{array}{ll} \displaystyle {\max \limits _{w\in \mathfrak {R}^n,b\in \mathfrak {R}}}&{}\quad \rho (w,b)\\ \mathrm{s.t.}&{}\quad w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\qquad \forall x^j\in B \end{array} \end{aligned}$$
admits solution provided that the following problem
$$\begin{aligned} \begin{array}{ll}\displaystyle {\max \limits _{w\in \mathfrak {R}^n,b\in \mathfrak {R}}}&{}\quad {{1}\over {\Vert w\Vert }}\\ \mathrm{s.t.}&{}\quad w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\qquad \forall x^j\in B\end{array} \end{aligned}$$
(54)
admits solution;
problem (54) is obviously equivalent to
$$\begin{aligned} \begin{array}{ll}\displaystyle {\min \limits _{w\in \mathfrak {R}^n,b\in \mathfrak {R}}}&{}\quad \Vert w\Vert ^2\\ \mathrm{s.t.}&{}\quad w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\qquad \forall x^j\in B;\end{array} \end{aligned}$$
(55)
then we prove that (55) admits a unique solution, which is also the unique solution of (2).

Lemma 1

Let $H({\hat{w}},{\hat{b}})$ be a separating hyperplane. Then

$$\begin{aligned} \rho ({\hat{w}},{\hat{b}})\ge {1\over {\Vert {\hat{w}}\Vert }}. \end{aligned}$$

Proof

Since

$$\begin{aligned} |{\hat{w}}^Tx^\ell +{\hat{b}}|\ge 1,\quad \forall x^\ell \in A\cup B, \end{aligned}$$

it follows

$$\begin{aligned} \rho ({\hat{w}},{\hat{b}})=\min \limits _{x^\ell \in A\cup B}\left\{ {{|{\hat{w}}^Tx^\ell +{\hat{b}}|} \over {\Vert {\hat{w}}\Vert }}\right\} \ge {1\over {\Vert {\hat{w}}\Vert }}. \end{aligned}$$

$\square $

Lemma 2

Given any separating hyperplane $H({\hat{w}},{\hat{b}})$, there exists a separating hyperplane $H({\bar{w}},{\bar{b}})$ such that

$$\begin{aligned} \rho ({\hat{w}},{\hat{b}})\le \rho ({\bar{w}},{\bar{b}})={1\over {\Vert {\bar{w}}\Vert }}. \end{aligned}$$

(56)

Moreover there exist two points $x^+\in A$ and $x^-\in B$ such that

$$\begin{aligned} \begin{array}{l} {\bar{w}}^Tx^++{\bar{b}}=1\\ {\bar{w}}^Tx^-+{\bar{b}}=-1 \end{array} \end{aligned}$$

(57)

Proof

Let ${\hat{x}}^i\in A$ and ${\hat{x}}^j\in B$ be the closest points to $H({\hat{w}},{\hat{b}})$, that is, the two points such that

$$\begin{aligned} \begin{array}{ll} {\hat{d}}_i&{}=\displaystyle {{{|{\hat{w}}^T{\hat{x}}^i+{\hat{b}}|}\over {\Vert {\hat{w}}\Vert }}\le {{|{\hat{w}}^Tx^i+{\hat{b}}|}\over {\Vert {\hat{w}}\Vert }}},\quad \forall x^i\in A\\ {\hat{d}}_j&{}=\displaystyle {{{|{\hat{w}}^T{\hat{x}}^j+{\hat{b}}|}\over {\Vert {\hat{w}}\Vert }}\le {{|{\hat{w}}^Tx^j+{\hat{b}}|}\over {\Vert {\hat{w}}\Vert }}},\quad \forall x^j\in B \end{array} \end{aligned}$$

(58)

from which it follows

$$\begin{aligned} \rho ({\hat{w}},{\hat{b}})=\min \{{\hat{d}}_i,{\hat{d}}_j\}\le {1\over 2}({\hat{d}}_i+{\hat{d}}_j)={{{\hat{w}}^T({\hat{x}}^i- {\hat{x}}^j)}\over {2\Vert {\hat{w}}\Vert }}. \end{aligned}$$

(59)

Let us consider the numbers $\alpha $ and $\beta $ such that

$$\begin{aligned} \begin{array}{ll}\alpha {\hat{w}}^T{\hat{x}}^i+\beta &{}=1\\ \alpha {\hat{w}}^T{\hat{x}}^j+\beta &{}=-1\end{array} \end{aligned}$$

(60)

that is, the numbers

$$\begin{aligned} \alpha ={2\over {{\hat{w}}^T({\hat{x}}^i-{\hat{x}}^j)}},\qquad \beta =-\,{{{\hat{w}}^T ({\hat{x}}^i+{\hat{x}}^j)}\over {{\hat{w}}^T({\hat{x}}_i-{\hat{x}}_j)}}. \end{aligned}$$

It can be easily verified that $0<\alpha \le 1$. We will show that the hyperplane $H({\bar{w}},{\bar{b}}) \equiv H(\alpha {\hat{w}},\beta )$ is a separating hyperplane for the sets A and B, and it is such that (56) holds. Indeed, using (58), we have

$$\begin{aligned} \begin{array}{ll}{\hat{w}}^Tx^i&{}\ge {\hat{w}}^T{\hat{x}}^i,\quad \forall x^i\in A\\ {\hat{w}}^Tx^j&{}\le {\hat{w}}^T{\hat{x}}^j,\quad \forall x^j\in B. \end{array} \end{aligned}$$

As $\alpha >0$, we can write

$$\begin{aligned} \begin{array}{ll} \alpha {\hat{w}}^Tx^i+\beta \ge \alpha {\hat{w}}^T{\hat{x}}^i+\beta =&{}1, \phantom {-1}\quad \forall x^i\in A\\ \alpha {\hat{w}}^Tx^j+\beta \le \alpha {\hat{w}}^T{\hat{x}}^j+\beta =&{}-1,\quad \forall x^j\in B \end{array} \end{aligned}$$

(61)

from which we get that ${\bar{w}}$ and ${\bar{b}}$ satisifies (1), and hence, that $H({\bar{w}},{\bar{b}})$ is a separating hyperplane for the sets A and B.

Furthermore, taking into account (61) and the value of $\alpha $, we have

$$\begin{aligned} \rho ({\bar{w}},{\bar{b}})=\min \limits _{x^\ell \in A\cup B}\left\{ {{|\bar{w}^Tx^\ell +{\bar{b}}|} \over {\Vert {\bar{w}}\Vert }}\right\} ={1\over {\Vert {\bar{w}}\Vert }}={1\over {\alpha \Vert {\hat{w}}\Vert }}= {{{\hat{w}}^T({\hat{x}}^i-{\hat{x}}^j)}\over {2\Vert {\hat{w}}\Vert }}. \end{aligned}$$

Condition (56) follows from the above equality and (59). Using (60) we obtain that (57) holds with $x^+={\hat{x}}^i$ and $x^-={\hat{x}}^j$. $\square $

Proposition 4

The following problem

$$\begin{aligned} \begin{array}{ll}\mathrm{min}\quad &{}\Vert w\Vert ^2\\ \mathrm{t.c.}\quad &{}w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\quad \forall x^j\in B\end{array} \end{aligned}$$

(62)

admits a unique solution $(w^\star ,b^\star )$.

Proof

Let ${{\mathcal {F}}}$ the feasible set, that is,

$$\begin{aligned} {{\mathcal {F}}}=\{(w,b)\in \mathfrak {R}^n\times \mathfrak {R}:~w^Tx^i+b\ge 1,\forall x^i\in A,~w^Tx^j+b\le -1, \forall x^j\in B\}. \end{aligned}$$

Given any $(w_o,b_o)\in {{\mathcal {F}}}$, let us consider the level set

$$\begin{aligned} {{\mathcal {L}}}_o=\{(w,b)\in {{\mathcal {F}}}:~\Vert w\Vert ^2\le \Vert w_o\Vert ^2\}. \end{aligned}$$

The set ${{\mathcal {L}}}_o$ is closed, and we will show that is also bounded. To this aim, assume by contradiction that there exists an unbounded sequence $\{(w_k,b_k)\}$ belonging to ${{\mathcal {L}}}_o$. Since $\Vert w_k\Vert \le \Vert w_o\Vert ,\forall k$, we must have $|b_k|\rightarrow \infty $. For any k we can write

$$\begin{aligned} \begin{array}{ll}w_k^Tx^i+b_k&{}\ge 1,\phantom {-}\quad \forall x^i\in A\\ w_k^Tx^j+b_k&{}\le -1,\quad \forall x^j\in B\end{array} \end{aligned}$$

and hence, as $|b_k|\rightarrow \infty $, for k sufficiently large, we have $\Vert w_k\Vert ^2 >\Vert w_o\Vert ^2$, and this contradicts the fact that $\{(w_k,b_k)\}$ belongs to ${{\mathcal {L}}}_o$. Thus ${{\mathcal {L}}}_o$ is a compact set.

Weirstrass’s theorem implies that the function $\Vert w\Vert ^2$ admits a minimum $(w^\star ,b^\star )$ on ${{\mathcal {L}}}_o$, and hence, on $\mathcal{F}$. As consequence, $(w^\star ,b^\star )$ is a solution of (62).

In order to prove that $(w^\star ,b^\star )$ is the unique solution, by contradiction assume that there exists a pair $({\bar{w}},\bar{b})\in {{\mathcal {F}}}$, $({\bar{w}},{\bar{b}})\ne (w^\star ,b^\star )$, such that $\Vert {\bar{w}}\Vert ^2=\Vert w^\star \Vert ^2$. Suppose ${\bar{w}}\ne w^\star $. The set ${{\mathcal {F}}}$ is convex, so that

$$\begin{aligned} \lambda (w^\star ,b^\star )+(1-\lambda )({\bar{w}},{\bar{b}})\in {{\mathcal {F}}},\quad \forall \lambda \in [0,1]. \end{aligned}$$

Since $\Vert w\Vert ^2$ is a strictly convex function, for any $\lambda \in (0,1)$ it follows

$$\begin{aligned} \Vert \lambda w^\star +(1-\lambda ){\bar{w}}\Vert ^2<\lambda \Vert w^\star \Vert ^2+(1-\lambda )\Vert {\bar{w}}\Vert ^2. \end{aligned}$$

Getting $\lambda =1/2$, which corresponds to consider the pair $({\tilde{w}}, \tilde{b})\equiv \displaystyle \left( {1\over 2}w^\star +{1\over 2}\bar{w},{1\over 2} b^\star \right. \left. +{1\over 2}{\bar{b}}\right) $, we have $(\tilde{w},{\tilde{b}})\in {{\mathcal {F}}}$ and

$$\begin{aligned} \Vert {\tilde{w}}\Vert ^2<{1\over 2}\Vert w^\star \Vert ^2+{1\over 2}\Vert \bar{w}\Vert ^2=\Vert w^\star \Vert ^2, \end{aligned}$$

and this contradicts the fact that $(w^\star ,b^\star )$ is a global minimum. Therefore, we must have ${\bar{w}}\equiv w^\star $.

Assume $b^\star >{\bar{b}}$ (the case $b^\star <{\bar{b}}$ is analogous), and consider the point ${\hat{x}}^i\in A$ such that

$$\begin{aligned} w{^\star }^T{\hat{x}}^i+b^\star =1 \end{aligned}$$

(the existence of such a point follows from (57) of Lemma 2). We have

$$\begin{aligned} 1=w{^\star }^T{\hat{x}}^i+b^\star ={\bar{w}}^T{\hat{x}}^i+b^\star >{\bar{w}}^T{\hat{x}}^i+ {\bar{b}} \end{aligned}$$

and this contradicts the fact that ${\bar{w}}^Tx^i+{\bar{b}}\ge 1,~\forall x^i\in A$. As consequence, we must have ${\bar{b}}\equiv b^\star $, and hence the uniqueness of the solution is proved. $\square $

Proposition 5

Let $(w^\star ,b^\star )$ be the solution of (62). Then, $(w^\star ,b^\star )$ is the unique solution of the following problem

$$\begin{aligned} \begin{array}{ll}\mathrm{max}\quad &{}\rho (w,b)\\ \mathrm{t.c.}\quad &{}w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\quad \forall x^j\in B\end{array} \end{aligned}$$

(63)

Proof

We observe that $(w^\star ,b^\star )$ is the unique solution of the problem

$$\begin{aligned} \begin{array}{ll}\mathrm{max}\quad &{}{1\over {\Vert w\Vert }}\\ \mathrm{t.c.}\quad &{}w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\quad \forall x^j\in B.\end{array} \end{aligned}$$

Lemmas 1 and 2 imply that, for any separating hyperplane H(w, b), we have

$$\begin{aligned} {1\over {\Vert w\Vert }}\le \rho (w,b)\le {1\over {\Vert w^\star \Vert }} \end{aligned}$$

and hence, for the separating hyperplane $H(w^\star ,b^\star )$ we obtain $\rho (w^\star , b^\star )=\displaystyle {1\over {\Vert w^\star \Vert }}$, which implies that $H(w^\star ,b^\star )$ is the optimal separating hyperplane. $\square $

Appendix B: The Wolfe dual and its properties

Consider the convex problem

$$\begin{aligned} \begin{array}{ll} \min &{} f(x)\\ \text{ s.t. } &{} g(x)\le 0\\ &{} h(x)=0 \end{array} \end{aligned}$$

(64)

with $f:\mathfrak {R}^n\rightarrow \mathfrak {R}$ convex and continuously differentiable, $g:\mathfrak {R}^n\rightarrow \mathfrak {R}^m$ convex and continuously differentiable, and $h:\mathfrak {R}^n\rightarrow \mathfrak {R}^p$ affine functions. Then its Wolfe dual is

$$\begin{aligned} \begin{array}{ll} \displaystyle \max \limits _{x,\lambda ,\mu } &{} L(x,\lambda ,\mu )\\ \text{ s.t. } &{} \nabla _xL(x,\lambda ,\mu )= 0\\ &{} \lambda \ge 0, \end{array} \end{aligned}$$

(65)

where $L(x,\lambda ,\mu ) = f(x)+\lambda ^Tg(x)+\mu ^Th(x)$.

Proposition 6

Let $x^*$ be a global solution of problem (64) with multipliers $(\lambda ^*,\mu ^*)$. Then it is also a solution of problem (65) and there is zero duality gap, i.e., $f(x^*) = L(x^*,\lambda ^*,\mu ^*)$.

Proof

The point $(x^*,\lambda ^*,\mu ^*)$ is clearly feasible for problem (65) since it satisfies the KKT conditions of problem (64). Furthermore, by complementarity ($(\lambda ^*)^Tg(x^*)=0$) and feasibility ($h(x^*)=0$)

$$\begin{aligned} L(x^*,\lambda ^*,\mu ^*) = f(x^*)+(\lambda ^*)^Tg(x^*)+(\mu ^*)^Th(x^*)=f(x^*) \end{aligned}$$

so that there is zero duality gap. Furthermore, for any $\lambda \ge 0$, $\mu \in \mathfrak {R}^p$, by the feasibility of $x^*$, we have

$$\begin{aligned} L(x^*,\lambda ^*,\mu ^*) = f(x^*)\ge f(x^*)+\lambda ^Tg(x^*)+\mu ^Th(x^*) = L(x^*,\lambda ,\mu ). \end{aligned}$$

(66)

By the convexity assumptions on f and g, the nonnegativity of $\lambda $ and by the linearity of h, we get that $L(\cdot ,\lambda ,\mu )$ is a convex function in x and hence, for any feasible $(x,\lambda ,\mu )$, we can write

$$\begin{aligned} L(x^*,\lambda ,\mu )\ge L(x,\lambda ,\mu )+\nabla _x L(x,\lambda ,\mu )^T(x^*-x) = L(x,\lambda ,\mu ), \end{aligned}$$

(67)

where the last equality derives from the constraints of problem (65). By combining (66) and (67), we get

$$\begin{aligned} L(x^*,\lambda ^*,\mu ^*)\ge L(x,\lambda ,\mu ) \text{ for } \text{ all } (x,\lambda ,\mu ) \text{ feasible } \text{ for } \text{ problem } (65) \end{aligned}$$

and hence $(x^*,\lambda ^*,\mu ^*)$ is a global solution of problem (65). $\square $

A stronger result can be proved when the primal problem is a convex quadratic programming problem defined by (6).

Proposition 7

Let $f(x)={{1}\over {2}}x^TQx+c^Tx$, and suppose that the matrix Q is symmetric and positive semidefinite. Let $({\bar{x}},{\bar{\lambda }})$ be a solution of Wolfe’s dual (7). Then, there exists a vector $x^\star $ (not necessarily equal to ${\bar{x}}$) such that

(i)
$Q(x^\star -{\bar{x}})=0$;
(ii)
$x^\star $ is a solution of problem (6); and
(iii)
$x^*$ is a global minimum of (6) with associated multipliers ${\bar{\lambda }}$.

Proof

First, we show how in this case problem (7) is a convex quadratic programming problem. In particular, problem (7) becomes for the quadratic case:

$$\begin{aligned} \max \limits _{x,\lambda }&\frac{1}{2} x^TQx+c^Tx+\lambda ^T(Ax-b) \end{aligned}$$

(68)

$$\begin{aligned}&Qx+c+A^T\lambda =0 \end{aligned}$$

(69)

$$\begin{aligned}&\lambda \ge 0. \end{aligned}$$

(70)

Multiplying the constraints (68) by $x^T$ we get

$$\begin{aligned} x^TQx+c^Tx+x^TA^T\lambda = 0, \end{aligned}$$

which implies that the objective function (68) can be rewritten as

$$\begin{aligned} \max -\frac{1}{2} x^TQx+c^Tx-\lambda ^Tb = -\min \frac{1}{2} x^TQx+\lambda ^Tb , \end{aligned}$$

which shows how problem (68) is actually a convex quadratic optimization problem. For this problem, the KKT conditions are necessary and sufficient for global optimality, and, if we denote by v the multipliers of the equality constraints (69) and by z the multipliers of the constraints (70), we get that there must exist multipliers v and z such that the following conditions hold;

$$\begin{aligned}&Q{\bar{x}}-Qv=0 \end{aligned}$$

(71)

$$\begin{aligned}&b-Av-z =0\end{aligned}$$

(72)

$$\begin{aligned}&z^T{\bar{\lambda }} = 0\end{aligned}$$

(73)

$$\begin{aligned}&z\ge 0\end{aligned}$$

(74)

$$\begin{aligned}&Q{\bar{x}}+c+A^T{\bar{\lambda }} =0\end{aligned}$$

(75)

$$\begin{aligned}&{\bar{\lambda }}\ge 0. \end{aligned}$$

(76)

The expression of z can be derived by constraints (72), and substituted in (73) and (74), implying:

$$\begin{aligned}&Av- b \le 0 \end{aligned}$$

(77)

$$\begin{aligned}&{\bar{\lambda }}^T(Av-b) = 0. \end{aligned}$$

(78)

Furthermore by subtracting (71) from (75), we get

$$\begin{aligned} Qv+c+A^T{\bar{\lambda }} =0. \end{aligned}$$

(79)

By combining (79), (78), (77) and (76) we get that the pair $(v,{\bar{\lambda }})$ satisfies the KKT conditions of problem (6), and hence setting $x^*=v$ we get the thesis, keeping into account that point (i) derives from (71). $\square $

Appendix C: Kernel characterization

Proposition 8

Let $K:X\times X\rightarrow \mathfrak {R}$ be a symmetric function. Function K is a kernel if and only if the $l\times l$ matrix

$$\begin{aligned} \left( K(x^i,x^j)\right) _{i,j=1}^l= \left( \begin{array}{ccc} K(x^1,x^1)&{}\ldots &{}K(x^1,x^l)\\ &{}\vdots &{}\\ K(x^l,x^1)&{}\ldots &{}K(x^l,x^l) \end{array} \right) \end{aligned}$$

is positive semidefinite for any set of training vectors $\{x^1,\ldots ,x^l\}$.

Proof

necessity Symmetry derives from the symmetry of the function K. To prove positive semidefiniteness we look at the quadratic form, for any $v\in \mathfrak {R}^l$:
$$\begin{aligned} v^TKv&=\displaystyle \sum \limits _{i=1}^l\sum \limits _{j=1}^lv_iv_jK(x^i,x^j) =\sum _{i=1}^l\sum \limits _{j=1}^lv_iv_j\langle \phi (x^i),\phi (x^j)\rangle \\&=\left\langle \sum \limits _{i=1}^lv_i \phi (x^i), \sum _{j=1}^lv_j\phi (x_j)\right\rangle \\&=\displaystyle \langle z,z\rangle \ge 0 \end{aligned}$$
sufficiency Assume
$$\begin{aligned} \left( \begin{array}{ccc} K(x^1,x^1)&{}\ldots &{}K(x^1,x^l)\\ &{}\vdots &{}\\ K(x^l,x^1)&{}\ldots &{}K(x^l,x^l) \end{array} \right) \succeq 0 \end{aligned}$$
(80)
We need to prove that there exists a linear space ${{\mathcal {H}}}$, a function $\phi :X\rightarrow {{\mathcal {H}}}$ and a scalar product $\langle \cdot ,\cdot \rangle $ defined on ${{\mathcal {H}}}$ such that $k(x,y) = \langle \phi (x),\phi (y)\rangle $ for all $x,y\in X$. Consider the linear space
$$\begin{aligned} \displaystyle {{\mathcal {H}}} = lin\left\{ K(\cdot ,y)\,:\, y\in X\right\} \end{aligned}$$
with the generic element $f(\cdot )$
$$\begin{aligned} \displaystyle f = \sum \limits _{i=1}^m\alpha _iK(\cdot ,x^i) \end{aligned}$$
for any $m\in N$, with $\alpha _i\in \mathfrak {R}$ for $i=1,\ldots ,m$. Given two elements $f,g\in {{\mathcal {H}}}$, with $g(\cdot ) = \sum _{j=1}^{m^\prime }\beta _jK(\cdot ,x^j)$, define the function $\rho :{{\mathcal {H}}}\times {{\mathcal {H}}}\rightarrow \mathfrak {R}$ defined as
$$\begin{aligned} \rho (f,g)=\sum \limits _{i=1}^m\sum \limits _{j=1}^{m^\prime }\alpha _i\beta _j K(x^i,x^j) \end{aligned}$$
It can be shown that the function $\rho $ is a scalar product in the space ${{\mathcal {H}}}$, by showing that the following properties hold:
1. (i)
  $\rho (f,g) = \rho (g,f)$
2. (ii)
  $\rho (f^1+f^2,g) = \rho (f^1,g)+\rho (f^2,g)$
3. (iii)
  $\rho (\lambda f,g) = \lambda \rho (f,g) $
4. (iv)
  $\rho (f,f)\ge 0$ and $\rho (f,f)=0$ implies $f=0$
The first three properties are a consequence of the definition of $\rho $ and can be easily verified. We need to show property (iv). First, we observe that, given $f^1,\ldots ,f^p$ in ${{\mathcal {H}}}$ the matrix with elements $\rho _{st} = \rho (f^s,f^t)$ is symmetric (thanks to property (i)) and positive semidefinite. Indeed,
$$\begin{aligned} \displaystyle \sum \limits _{i=1}^p\sum \limits _{j=1}^p\gamma _i\gamma _j\rho _{ij} =\sum _{i=1}^p\sum \limits _{j=1}^p\gamma _i\gamma _j\rho (f^i,f^j) =\rho \left( \sum \limits _{i=1}^p\gamma _if^i,\sum _{j=1}^p\gamma _jf^j\right) \ge 0 \end{aligned}$$
This implies in turn that all principal minors have non negative determinant. Consider any $2\times 2$ principal minor, with elements $\rho _{ij} = \rho (f^i,f^j)$. The nonnegativity of the determinant, and the symmetry of the matrix imply
$$\begin{aligned}&\rho (f^i,f^i)\rho (f^j,f^j)- \rho (f^i,f^j) \rho (f^j,f^i )\\&=\rho (f^i,f^i)\rho (f^j,f^j)- \rho (f^i,f^j)^2\ge 0 \end{aligned}$$
so that
$$\begin{aligned} \rho (f^i,f^j)^2\le \rho (f^i,f^i)\rho (f^j,f^j) \end{aligned}$$
(81)
We note that, setting $m^\prime =1$, $g(\cdot ) = k(\cdot ,x)$, f(x) can be written as
$$\begin{aligned} f(x) = \displaystyle \sum \limits _{i=1}^m\alpha _iK(x,x^i) = \rho (K(\cdot ,x),f) \end{aligned}$$
with $K(\cdot ,x)\in {{\mathcal {H}}}$. Furthermore, for any $x,y \in X$, we get
$$\begin{aligned} \rho (K(\cdot ,x),K(\cdot ,y)) = K(x,y) \end{aligned}$$
Using (81) with $f^i = K(\cdot ,x)$ and $f^j = f(x)$ we get
$$\begin{aligned}&f(x)^2 = \rho (K(\cdot ,x),f)\le \rho (f^i,f^i)\rho (f^j,f^j) = \rho (K(\cdot ,x),\\&K(\cdot ,x))\rho (f,f) = K(x,x)\rho (f,f) \end{aligned}$$
that implies, thanks to (80), both $\rho (f,f)\ge 0$ and that if $\rho (f,f)=0$, then $f(x)^2\le 0$ for all $x\in X$ and hence $f=0$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Piccialli, V., Sciandrone, M. Nonlinear optimization and support vector machines. 4OR-Q J Oper Res 16, 111–149 (2018). https://doi.org/10.1007/s10288-018-0378-2

Download citation

Received: 22 February 2018
Revised: 25 April 2018
Published: 23 May 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s10288-018-0378-2

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonlinear optimization and support vector machines

Abstract

Access this article

Similar content being viewed by others

Nonlinear optimization and support vector machines

Linear Support Vector Machines

Linear Classification of Data with Support Vector Machines and Generalized Support Vector Machines

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of existence and uniqueness of the optimal hyperplane

Lemma 1

Proof

Lemma 2

Proof

Proposition 4

Proof

Proposition 5

Proof

Appendix B: The Wolfe dual and its properties

Proposition 6

Proof

Proposition 7

Proof

Appendix C: Kernel characterization

Proposition 8

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Nonlinear optimization and support vector machines

Abstract

Access this article

Similar content being viewed by others

Nonlinear optimization and support vector machines

Linear Support Vector Machines

Linear Classification of Data with Support Vector Machines and Generalized Support Vector Machines

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of existence and uniqueness of the optimal hyperplane

Lemma 1

Proof

Lemma 2

Proof

Proposition 4

Proof

Proposition 5

Proof

Appendix B: The Wolfe dual and its properties

Proposition 6

Proof

Proposition 7

Proof

Appendix C: Kernel characterization

Proposition 8

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation