Skip to main content

Support Vector Machine

Abstract

Support vector machine is a method for classification and regression that draws an optimal boundary in the space of covariates (p dimension) when the samples \((x_1, y_1), \ldots , (x_N, y_N)\) are given. This is a method to maximize the minimum value over \(i = 1, \ldots , N\) of the distance between \(x_i\) and the boundary. This notion is generalized even if the samples are not separated by a surface by softening the notion of a margin. Additionally, by using a general kernel that is not the inner product, even if the boundary is not a surface, we can mathematically formulate the problem and obtain the optimum solution. In this chapter, we consider only the two-class case and focus on the core part. Although omitted here, the theory of support vector machine also applies to regression and classification with more than two classes.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-15-7568-6_9
  • Chapter length: 22 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   39.99
Price excludes VAT (USA)
  • ISBN: 978-981-15-7568-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   49.99
Price excludes VAT (USA)
Fig. 9.1
Fig. 9.2
Fig. 9.3
Fig. 9.4
Fig. 9.5
Fig. 9.6

Notes

  1. 1.

    We say that v is an upper bound of S if \(u\le v\) for any \(u\in S\) in a set \(S\subseteq {\mathbb R}\) and that the minimum of the upper bounds of S is the upper limit of S, which we write as \(\sup A\). For example, the maximum does not exist for \(S=\{x\in {\mathbb R}|0\le x<1\}\), but \(\sup S=1\). Similarly, we define the lower bounds and their maximum (the upper limit) of S, which we write as \(\inf A\).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joe Suzuki .

Appendix: Proof of Propositions

Appendix: Proof of Propositions

Proposition 23

The distance between a point \((x,y)\in {\mathbb R}^2\) and a line \(l: aX+bY+c=0\), \(a,b\in {\mathbb R}\) is given by

$$\displaystyle \frac{|ax+by+c|}{\sqrt{a^2+b^2}}\ .$$

Proof

Let \((x_0,y_0)\) be the perpendicular foot of l from (xy). \(l'\) is a normal of l and can be written by

$$l': \frac{X-x_0}{a}=\frac{Y-y_0}{b}=t$$

for some t (Fig. 9.6). Since \((x_0,y_0)\) and (xy) are on l and \(l'\), respectively, we have

$$\left\{ \begin{array}{l} \displaystyle ax_0+by_0+c=0\\ \displaystyle \frac{x-x_0}{a}=\frac{y-y_0}{b}=t \end{array} \right. \ .$$

If we erase \((x_0,y_0)\), from \(x_0=x-at\), \(y_0=y-bt\), \(a(x-at)+b(y-bt)+c=0\), we have \(t=(ax+by+c)/(a^2+b^2)\). Thus, the distance is

$$\sqrt{(x-x_0)^2+(y-y_0)^2}=\sqrt{(a^2+b^2)t^2}=\frac{|ax+by+c|}{\sqrt{a^2+b^2}}\ .$$

Proposition 25

$$\left\{ \begin{array}{lll} \alpha _i=0&{}\Longleftarrow &{}y_i(\beta _0+x_i\beta )> 1\\ 0<\alpha _i<C&{}\Longrightarrow &{}y_i(\beta _0+x_i\beta )=1\\ \alpha _i=C&{}\Longleftarrow &{}y_i(\beta _0+x_i\beta )< 1 \end{array} \right. $$

Proof

When \(\alpha _i=0\), applying (9.20), (9.17) and (9.14) in this order, we have

$$\alpha _i=0 \Longrightarrow \mu _i=C>0 \Longrightarrow \epsilon =0 \Longrightarrow y_i(\beta _0+x_i\beta )\ge 1\ .$$

When \(0<\alpha _i<C\), from (9.17), (9.20), we have \(\epsilon _i=0\). Moreover, applying (9.16), we have

$$0<\alpha _i<C \Longrightarrow y_i(\beta _0+x_i\beta )-(1-\epsilon )=0 \Longrightarrow y_i(\beta _0+x_i\beta )=1\ .$$

When \(\alpha _i=C\), from(9.15), we have \(\epsilon _i\ge 0\). Moreover, applying (9.16), we have

$$\alpha _i=C \Longrightarrow y_i(\beta _0+x_i\beta )-(1-\epsilon )=0 \Longrightarrow y_i(\beta _0+x_i\beta )\le 1\ .$$

Furthermore, from (9.16), we have \(y_i(\beta _0+x_i\beta )> 1 \Longrightarrow \alpha _i=0\). On the other hand, applying (9.14), (9.17) and (9.20) in this order, we have

$$y_i(\beta _0+x_i\beta )< 1 \Longrightarrow \epsilon _i>0 \Longrightarrow \mu _i=0 \Longrightarrow \alpha _i=C\ .$$

Exercises 75–87

We define the distance between a point \((u,v)\in {\mathbb R}^2\) and a line \(aU+bV+c=0,\ a,b\in {\mathbb R}\) by

$$\displaystyle \frac{|au+bv+c|}{\sqrt{a^2+b^2}}\ .$$

For \(\beta \in {\mathbb R}^p\) such that \(\beta _0\in {\mathbb R}\) and \(\Vert \beta \Vert _2=1\), when samples \((x_1,y_1),\ldots ,(x_N,y_N)\in {\mathbb R}^p\times \{-1,1\}\) satisfy the separability \(y_1(\beta _0+x_1\beta ),\ldots ,y_N(\beta _0+x_N\beta )\ge 0\), the support vector machine is formulated as the problem of finding \((\beta _0,\beta )\) that maximize the minimum value \(M:=\min _i y_i(\beta _0+x_i\beta )\) over the distances between \(x_i\) (row vector) and the surfaces \(\beta _0+X\beta =0\).

  1. 75.

    We extend the support vector machine problem to finding \((\beta _0,\beta )\in {\mathbb R}\times {\mathbb R}^p\) and \(\epsilon _i\ge 0\), \(i=1,\ldots ,N\) that maximize M under the constraints \(\gamma \ge 0\), \(M\ge 0\), \(\displaystyle \sum _{i=1}^N\epsilon _i\le \gamma \), and

    $$y_i(\beta _0+x_i\beta )\ge M(1-\epsilon _i) ,\ i=1,\ldots ,N\ .$$
    1. (a)

      What can we say about the locations of samples \((x_i,y_i)\) when \(\epsilon _i=0,\ 0<\epsilon _i<1,\ \epsilon _i=1,\ 1<\epsilon _i\).

    2. (b)

      Suppose that \(y_i(\beta _0+x_i\beta )<0\) for at least r samples i, for any \(\beta _0\) and \(\beta \). Show that if \(\gamma \le r\), then no solution exists. Hint: \(\epsilon _i>1\) for such an i.

    3. (c)

      The larger the \(\gamma \) is, the smaller the M. Why?

  2. 76.

    We wish to obtain \(\beta \in {\mathbb R}^p\) that minimizes \(f_0(\beta )\) under \(f_j(\beta )\le 0\), \(j=1,\ldots ,m\). If such a solution exists, we denote the minimum value by \(f^*\). Consider the following two equations

    $$\begin{aligned} \sup _{\alpha \ge 0}L(\alpha ,\beta )= & {} \left\{ \begin{array}{l@{\quad }l} f_0(\beta ),&{}f_j(\beta )\le 0\ ,\ j=1,\ldots ,m\\ +\infty &{}\mathrm{Otherwise}, \end{array} \right. \end{aligned}$$
    (9.25)
    $$\begin{aligned} f^*:= & {} \inf _\beta \sup _{\alpha \ge 0}L(\alpha ,\beta )\ge \sup _{\alpha \ge 0}\inf _\beta L(\alpha ,\beta ), \end{aligned}$$
    (9.26)

    under

    $$L(\alpha ,\beta ):=f_0(\beta )+\sum _{j=1}^m \alpha _jf_j(\beta )$$

    for \(\alpha =(\alpha _1,\ldots ,\alpha _m)\in {\mathbb R}^m\). Moreover, suppose \(p=2\) and \(m=1\). For

    $$\begin{aligned} L(\alpha ,\beta ):=\beta _1+\beta _2+\alpha (\beta _1^2+\beta _2^2-1)\ , \end{aligned}$$
    (9.27)

    such that the equality holds in the inequality (9.26).

  3. 77.

    Suppose that \(f_0,f_1,\ldots ,f_m: {\mathbb R}^p\rightarrow {\mathbb R}\) are convex and differentiable at \(\beta =\beta ^*\). It is known that \(\beta ^*\in {\mathbb R}^p\) is the optimum value of \(\displaystyle \min \{f_0(\beta )\mid f_i(\beta )\le 0,\ i=1,\ldots ,m\}\) if and only if there exist \(\alpha _i\ge 0\), \(i=1,\ldots ,m\) such that

    $$\begin{aligned} f_i(\beta ^*)\le 0 ,\ i=1,\ldots ,m \end{aligned}$$
    (9.28)

    and the two conditions are met (KKT conditions)

    $$\begin{aligned}&\alpha _if_i(\beta ^*)=0 ,\ i=1,\ldots ,m \end{aligned}$$
    (9.29)
    $$\begin{aligned}&\displaystyle \nabla f_0(\beta ^*)+\sum _{i=1}^m \alpha _i\nabla f_i(\beta ^*)=0 . \end{aligned}$$
    (9.30)

    In this problem, we consider the sufficiency.

    1. (a)

      If \(f: {\mathbb R}^p\rightarrow {\mathbb R}\) is convex and differentiable at \(x=x_0\in {\mathbb R}\), then

      $$\begin{aligned} f(x)\ge f(x_0)+\nabla f(x_0)^T(x-x_0) \end{aligned}$$
      (9.31)

      for each \(x\in {\mathbb R}^p\). From this fact, show that \(f_0(\beta ^*)\le f_0(\beta )\) for arbitrary \(\beta \in {\mathbb R}^p\) that satisfies (9.28). Hint: Use (9.29) and (9.30) once, (9.31) twice, and \(f_1(\beta )\le 0,\ldots ,f_m(\beta )\le 0\) once.

    2. (b)

      For (9.27), find the conditions that correspond to (9.28)–(9.30).

  4. 78.

    If we remove the condition \(\Vert \beta \Vert _2=1\) in Problem 9.4 and regard \(\beta _0/M,\beta /M\) as \(\beta _0\) and \(\beta \), then the problem reduces to finding \(\beta _0,\beta ,\epsilon _i\), \(i=1,\ldots ,N\) that minimize

    $$\begin{aligned} L_P:=\frac{1}{2}\Vert \beta \Vert ^2_2+C\sum _{i=1}^N\epsilon _i-\sum _{i=1}^N\alpha _i\{ y_i(\beta _0+x_i\beta )-(1-\epsilon _i)\}-\sum _{i=1}^N\mu _i\epsilon _i\ , \end{aligned}$$
    (9.32)

    where \(C>0\) (cost), the last two terms are constraints, and \(\alpha _i,\mu _i\ge 0\), \(i=1,\ldots ,N\) are the Lagrange coefficients. Show that the KKT conditions (9.28)–(9.30) are the following.

    $$\begin{aligned}&\displaystyle \sum _{i=1}^N\alpha _iy_i=0 \end{aligned}$$
    (9.33)
    $$\begin{aligned}&\displaystyle \beta =\sum _{i=1}^N\alpha _iy_ix_i\in {\mathbb R}^p \end{aligned}$$
    (9.34)
    $$\begin{aligned}&C-\alpha _i-\mu _i=0 \end{aligned}$$
    (9.35)
    $$\begin{aligned}&\alpha _i[y_i(\beta _0+x_i\beta )-(1-\epsilon _i)]=0 \end{aligned}$$
    (9.36)
    $$\begin{aligned}&\mu _i\epsilon _i=0 \end{aligned}$$
    (9.37)
    $$\begin{aligned}&y_i(\beta _0+x_i\beta )-(1-\epsilon _i)\ge 0 \end{aligned}$$
    (9.38)
    $$\begin{aligned}&\epsilon _i\ge 0 \end{aligned}$$
    (9.39)
  5. 79.

    Show that the dual problem (9.32) of \(L_P\) is given by

    $$\begin{aligned} L_D:=\sum _{i=1}^N\alpha _i-\frac{1}{2}\sum _{i=1}^N\sum _{j=1}^N\alpha _i\alpha _jy_iy_jx_i^Tx_j\ , \end{aligned}$$
    (9.40)

    where \(\alpha \) ranges over (9.33) and

    $$\begin{aligned} 0\le \alpha _i\le C\ . \end{aligned}$$
    (9.41)

    Moreover, how is \(\beta \) obtained from such an \(\alpha \)?

  6. 80.

    Show the following:

    $$\left\{ \begin{array}{lll} \alpha _i=0&{}\Longleftarrow &{}y_i(\beta _0+x_i\beta )> 1\\ 0<\alpha _i<C&{}\Longrightarrow &{}y_i(\beta _0+x_i\beta )=1\\ \alpha _i=C&{}\Longleftarrow &{}y_i(\beta _0+x_i\beta )< 1 \end{array} \right. \ . $$
  7. 81.

    We wish to obtain the value of \(\beta _0\) by \(y_i(\beta _0+x_i\beta )=1\) for at least one i.

    1. (a)

      Show that \(\alpha _1=\cdots =\alpha _N=0\) and \(y_i(\beta _0+x_i\beta )=1\) imply \(\beta _0=y_i, i=1, \ldots , N\).

    2. (b)

      Suppose that (\(\alpha =0\) or \(\alpha =C\)) and \(y_i(\beta _0+x_i\beta )\not =1\) for each i, and let \(\epsilon _*:=\min _i \epsilon _i\). Show that \(L_p\) decreases when replacing \(\epsilon _i\) and \(\beta \) by \(\epsilon _i-\epsilon _*\) and \(\beta _0+y_i\epsilon _*\), respectively, for each i, which means that no optimum solution cannot be obtained under the assumption. Hint: \(y_i=\pm 1\Longleftrightarrow y_i^2=1\)

    3. (c)

      Show that \(y_i(\beta _0+x_i\beta )=1\) for at least one i.

  8. 82.

    In order to input the dual problem (9.40), (9.33) and (9.41) into a quadratic programing solver, we specify \(D_{ mat}\in {\mathbb R}^{N\times N}\), \(A_{ mat}\in {\mathbb R}^{m\times N}\), \(d_{ vec}\in {\mathbb R}^N\) and \(b_{ vec}\in {\mathbb R}^m\) (\(m\ge 1\)) such that

    $$L_D=-\frac{1}{2}\alpha ^TD_{ mat}\alpha +d_{ vec}^T\alpha $$
    $$A_{ mat}\alpha \ge b_{ vec}\ ,$$

    where the first meq and last \(m-meq\) are equalities and inequalities, respectively in the m constraints \(A_{ mat}\alpha \ge b_{ vec}\), \(\alpha \in {\mathbb R}^N\). If we define

    $$b_{ vec}:=[0,-C,\ldots ,-C,0,\ldots ,0]^T,$$

    what are \(D_{ mat}\), \(A_{ mat}\), \(d_{ vec}\), meq? Moreover, fill in the blanks below and execute the result.

    figure k
  9. 83.

    Let V be a vector space. We define a kernel K(xy) w.r.t. \(\phi : {\mathbb R}^p\rightarrow V\) as the inner product of \(\phi (x)\) and \(\phi (y)\) given \((x,y)\in {\mathbb R}^p\times {\mathbb R}^p\). For example, for the d-dimensional polynomial kernel \(K(x,y)=(1+x^Ty)^d\), if \(d=1\) and \(p=2\), then the mapping is

    $$((x_1,x_2), (y_1,y_2))\mapsto 1\cdot 1+x_1y_1+x_2y_2=(1,x_1,x_2)^T(1,y_1,y _2)\ .$$

    In this case, we regard the map \(\phi \) as \((x_1,x_2)\mapsto (1,x_1,x_2)\). What is \(\phi \) for \(p=2, d=2\)? Write an R function K.poly(x,y) that realizes the \(d=2\)-dimensional polynomial kernel.

  10. 84.

    Let V be a vector space over \(\mathbb R\).

    1. (a)

      Suppose that V is the set of continuous functions in [0, 1]. Show that \(\displaystyle \int _0^1 f(x)g(x)dx\), \(f,g\in V\) is an inner product of V.

    2. (b)

      For vector space \(V:={\mathbb R}^p\), show that \((1+x^Ty)^2\), \(x,y \in {\mathbb R}^p\) is not an inner product of V.

    3. (c)

      Write an R function K.linear(x,y) for the standard inner product.

    Hint: Check the definition of an inner product, for \(a,b,c\in V\), \(\alpha \in {\mathbb R}\), \(\langle a+b,c\rangle =\langle a,c\rangle +\langle b,c\rangle \); \(\langle a,b\rangle =\langle b,a\rangle \); \(\langle \alpha a,b\rangle =\alpha \langle a,b\rangle \); \(\langle a,a\rangle =\Vert a\Vert ^2\ge 0\).

  11. 85.

    In the following, using \(\phi : {\mathbb R}^p\rightarrow V\), we replace \(x_i\in {\mathbb R}^p\), \(i=1,\ldots ,N\) with \(\phi (x_i)\in V\). Thus, \(\beta \in {\mathbb R}^p\) is expressed as \(\beta =\sum _{i=1}^N\alpha _iy_i\phi (x_i)\in V\), and the inner product \(\langle x_i,x_j\rangle \) in \(L_D\) is replaced by the inner product of \(\phi (x_i)\) and \(\phi (x_j)\), i.e., \(K(x_i,x_j)\). If we extend the vector space, the border \(\phi (X)\beta +\beta _0=0\), i.e., \(\sum _{i=1}^N\alpha _iy_iK(X,x_i)+\beta _0=0\) is not necessarily a surface. Modify the svm.1 in Problem 82 as follows.

    1. (a)

      Add argument K to the definition.

    2. (b)

      Replace sum(X[,i]*X[,j]) with K(X[i,],X[j,]).

    3. (c)

      Replace beta in return with alpha.

    Then, execute the function svm.2 by filling in the blanks.

    figure l
  12. 86.
    1. (a)

      Execute the support vector machine with \(\gamma =1\) and \(C=100\).

      figure m
    2. (b)

      Use the tune command to find the optimal C and \(\gamma \) over \(C=0.1,1,10,100,1000\) and \(\gamma = 0.5,1,2,3,4\) via cross-validation.

      figure n
  13. 87.

    A support vector machine works even when more than two classes exist. In fact, the package e1071 runs even if we give no information about the number of classes. Fill in the blanks and execute it.

    figure o

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Suzuki, J. (2020). Support Vector Machine. In: Statistical Learning with Math and R. Springer, Singapore. https://doi.org/10.1007/978-981-15-7568-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-7568-6_9

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-7567-9

  • Online ISBN: 978-981-15-7568-6

  • eBook Packages: Computer ScienceComputer Science (R0)