Skip to main content
Log in

Global resolution of the support vector machine regression parameters selection problem with LPCC

  • Original Paper
  • Published:
EURO Journal on Computational Optimization

Abstract

Support vector machine regression is a robust data fitting method to minimize the sum of deducted residuals of regression, and thus is less sensitive to changes of data near the regression hyperplane. Two design parameters, the insensitive tube size (\(\varepsilon _\mathrm{e}\)) and the weight assigned to the regression error trading off the normed support vector (\(C_\mathrm{e}\)), are selected by user to gain better forecasts. The global training and validation parameter selection procedure for the support vector machine regression can be formulated as a bi-level optimization model, which is equivalently reformulated as linear program with linear complementarity constraints (LPCC). We propose a rectangle search global optimization algorithm to solve this LPCC. The algorithm exhausts the invariancy regions on the parameter plane (\((C_\mathrm{e},\varepsilon _\mathrm{e})\)-plane) without explicitly identifying the edges of the regions. This algorithm is tested on synthetic and real-world support vector machine regression problems with up to hundreds of data points, and the efficiency are compared with several approaches. The obtained global optimal parameter is an important benchmark for every other selection of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The 1st order subscript represents the role of that mathematical expression—subscript d: data; subscript s: support vector machine regression; subscript e: exogenous parameter to the support vector machine regression.

  2. By exhausting all the invariancy regions, we mean that for every invariancy region, at least one \((C_\mathrm{e},\varepsilon _\mathrm{e})\) point within the region is chosen and the associated restricted linear program is solved.

  3. According to the numerical experiments provided in work Ferris and Munson (2004), the semismooth method outperforms the interior method (Ferris and Munson 2002) specifically in solving the large-scale SVM classification problems.

  4. Lower-hyperplane: \({y_\mathrm{d}}_j = ({\mathbf{x}_\mathbf{d}^\mathbf{j}})^T{\mathbf{w}_\mathbf{s}}^f+b_\mathrm{s}^f-\varepsilon _\mathrm{e}\).

  5. Upper-hyperplane: \({y_\mathrm{d}}_j = ({\mathbf{x}_\mathbf{d}^\mathbf{j}})^T{\mathbf{w}_\mathbf{s}}^f+b_\mathrm{s}^f+\varepsilon _\mathrm{e}\).

  6. A rectangle is defined by four bounds: upper and lower bounds of \(C_\mathrm{e}\) and \(\varepsilon _\mathrm{e}\).

  7. If there is no previous interval, no need to run the add-in.

  8. Partitioning at the endpoints is also a theoretically valid strategy. We choose to partition at the midpoints rather than the endpoints to avoid the loss of information due to arithmetic imprecision at the dividing line. The drawback of it is that most of the rectangles are not eliminated in the 1st stage but have to be passed to the 2nd stage.

  9. Recall that we solve a \({\mathbb {RQCP}}\) when \({\mathbb {RLP}}\) or \({\mathbb {LCP}}\) fail to be solved.

  10. Because the \(b_\mathrm{s}\) of SVM regression is not unique, we select a \(b_\mathrm{s}\) minimizing the absolute residual of the 45 validation data. See Sect. 2.3 for the selection of \(b_\mathrm{s}\).

  11. A finer grid might give us a local grid-searched parameter that coincides the global optimal parameter since finding the global optimum is relatively easy compared to varying it. Note that the grids commonly used in the SVM parameter selection are the logarithmic grids (Fung and Mangasarian 2004) of base 2 or 10, i.e., \(2^i\) or \(10^j\) where i and j ranges from some negative integer to some positive integer.

  12. We use \(tol = 10^{-14}\).

References

  • Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79

    Article  Google Scholar 

  • Bard JF, Moore JT (1990) A branch and bound algorithm for the bilevel programming problem. SIAM J Sci Stat Comput 11(2):281–292

    Article  Google Scholar 

  • Bemporad A, Morari M, Dua V, Pistikopoulos EN (2002) The explicit linear quadratic regulator for constrained systems. Automatica 38(1):3–20

    Article  Google Scholar 

  • Billups SC (1995) Algorithm for complementarity problems and generalized equations. PhD thesis, University of Wisconsin Madison

  • Burges CJC, Crisp DJ (1999) Uniqueness of the SVM solution. In NIPS’99, pp 223–229

  • Byrd RH, Nocedal J, Waltz RA (2006) Knitro: an integrated package for nonlinear optimization. In: Di Pillo G, Roma M (eds) Large-scale nonlinear optimization. Nonconvex optimization and its applications, vol 83. Springer, US, pp 35–59

  • Carrizosa E, Morales DR (2013) Supervised classification and mathematical optimization. Comput Oper Res 40(1):150–165

    Article  Google Scholar 

  • Carrizosa E, Martn-Barragn B, Morales DR (2014) A nested heuristic for parameter tuning in support vector machines. Comput Oper Res 43(0):328–334

    Article  Google Scholar 

  • Cawley GC, Talbot NLC (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079–2107

    Google Scholar 

  • Columbano S, Fukuda K, Jones CN (2009) An output-sensitive algorithm for multi-parametric LCPs with sufficient matrices. In CRM proceedings and lecture notes, vol 48

  • De Luca T, Facchinei F, Kanzow C (1996) A semismooth equation approach to the solution of nonlinear complementarity problems. Math Program 75:407–439

    Google Scholar 

  • Demiriz A, Bennett KP, Breneman CM, Embrechts MJ (2001) Support vector machine regression in chemometrics. In: Proceedings of the 33rd symposium on the interface of computing science and statistics

  • Facchinei F, Soares J (1997) A new merit function for nonlinear complementarity problems and a related algorithm. SIAM J Optim 7(1):225–247

    Article  Google Scholar 

  • Facchinei F, Pang J-S (2003) Finite-dimensional variational inequalities and complementarity problems II. Springer, New York

    Google Scholar 

  • Ferris MC, Munson TS (2002) Interior-point methods for massive support vector machines. SIAM J Optim 13(3):783–804

    Article  Google Scholar 

  • Ferris MC, Munson TS (2004) Semismooth support vector machines. Math Program 101:185–204

    Google Scholar 

  • Floudas CA, Gounaris C (2009) A review of recent advances in global optimization. J Glob Optim 45:3–38

    Article  Google Scholar 

  • Fung GM, Mangasarian OL (2004) A feature selection newton method for support vector machine classification. Comput Optim Appl 28(2):185–202

    Article  Google Scholar 

  • Ghaffari-Hadigheh A, Romanko O, Terlaky T (2010) Bi-parametric convex quadratic optimization. Optim Methods Softw 25:229–245

    Article  Google Scholar 

  • Gumus ZH, Floudas CA (2001) Global optimization of nonlinear bilevel programming problems. J Glob Optim 20:1–31

    Article  Google Scholar 

  • IBM ILOG CPLEX Optimizer (2010) http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/

  • Jing H, Mitchell JE, Pang J-S, Bennett KP, Kunapuli G (2008) On the global solution of linear programs with linear complementarity constraints. SIAM J Optim 19:445–471

    Article  Google Scholar 

  • Kecman V (2005) Support vector machines—an introduction. In: Wang L (ed) Support vector machines: theory and applications. Studies in fuzziness and soft computing, vol 177. Springer, Berlin, pp 1–48

  • Keerthi SS, Lin C-J (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15(7):1667–1689

    Article  Google Scholar 

  • Kunapuli G (2008) A bilevel optimization approach to machine learning. PhD thesis, Rensselaer Polytechnic Institute

  • Kunapuli G, Bennett KP, Jing H, Pang J-S (2008) Classification model selection via bilevel programming. Optim Methods Softw 23(4):475–489

    Article  Google Scholar 

  • Kunapuli G, Bennett KP, Hu J, Pang J-S (2006) Model selection via bilevel programming. In: Proceedings of the IEEE international joint conference on neural networks

  • Lee Y-C, Pang J-S, Mitchell JE (2015) An algorithm for global solution to bi-parametric linear complementarity constrained linear programs. J Glob Optim 62(2):263–297

    Article  Google Scholar 

  • Mangasarian OL, Musicant DR (1998) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10:1032–1037

    Article  Google Scholar 

  • Ng AY (1997) Preventing “overfitting” of cross-validation data. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann, Menlo Park, pp 245–253

  • Schittkowski K (2005) Optimal parameter selection in support vector machines. J Ind Manag Optim 1(4):465–476

    Article  Google Scholar 

  • Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  • Tondel P, Johansen TA, Bemporad A (2003) An algorithm for multi-parametric quadratic programming and explicit mpc solutions. Automatica 39(3):489–497

    Article  Google Scholar 

  • Vapnik V, Golowich SE, Smola AJ (1997) Support vector method for function approximation, regression estimation and signal processing. In: Mozer M, Jordan MI, Petsche T (eds) Advances in neural information processing systems, vol 9. Proceedings of the 1996 neural information processing systems conference (NIPS 1996). MIT Press, Cambridge, pp 281–287

  • Yu B (2011) A branch and cut approach to linear programs with linear complementarity constraints. PhD thesis, Rensselaer Polytechnic Institute

Download references

Acknowledgments

Thanks for the thoroughgoing comments and suggestions from reviewers that allow us to significantly improve the completeness and the presentation of this work. Lee and Pang were supported in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0151. Mitchell was supported in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0260. Pang was supported by the National Science Foundation under Grant Number CMMI-1333902. Mitchell was supported by the National Science Foundation under Grant Number CMMI-1334327. Lee’s new affiliation after August 1, 2015 will be Department of Industrial Engineering and Engineering Management, National Tsing Hua University, Hsinchu, Taiwan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu-Ching Lee.

Appendices

Appendix A: Semismooth method for SVR

Algorithm 12

Damped Newton method (Algorithm 2 in [17])

Step 0: Initialization: :

Let \(\mathbf {a}^0 \in {\mathbb {R}}^n\), \(\rho \ge 0\), \(p \ge 2\), and \(\sigma \in (0, \frac{1}{2})\) be given. Set \(k = 0\). Set tol Footnote 12.

Step 1: Termination: :

If \(g(\mathbf {a}^k) := \frac{1}{2} \parallel \Phi (\mathbf {a}^k) \parallel _2^2 \le tol \), stop.

Step 2: Direction Generation: :

Otherwise, let \(\mathbf {H}^k \in \partial _B \Phi (\mathbf {a}^k)\), and calculate \(\mathbf {d}^k \in {\mathbb {R}}^n\) solving the Newton system:

$$\begin{aligned} \mathbf {H}^k\mathbf {d}^k = - \Phi (\mathbf {a}^k). \end{aligned}$$
(31)

If either (31) is unsolvable or the decent condition

$$\begin{aligned} \nabla g(\mathbf {a}^k)^T\mathbf {d}^k < - \rho \parallel \mathbf {d}^k \parallel _2^p \end{aligned}$$
(32)

is not satisfied, then set

$$\begin{aligned} \mathbf {d}^k = - \nabla g(\mathbf {a}^k). \end{aligned}$$
(33)

Step 3: Line Search: Choose \(t^k = 2^{-i_k}\), where \(i_k\) is the smallest integer such that

$$\begin{aligned} g\big ( \mathbf {a}^k + 2^{-i_k}\mathbf {d}^k \big ) \le g(\mathbf {a}^k) + \sigma 2^{-i_k} \nabla g (\mathbf {a}^k)^T \mathbf {d}^k. \end{aligned}$$

Step 4: Update: Let \(\mathbf {a}^{k+1} := \mathbf {a}^k + t^k\mathbf {d}^k \) and \(k := k + 1\). Go to Step 1. \(\square \)

The B-subdifferential used in step 3 of Algorithm 12 is obtained using the following theorem.

Theorem 13

(Theorem 5 in Ferris and Munson (2004)) Let \(\mathbf {U}: {\mathbb {R}}^n \rightarrow {\mathbb {R}}^n\) be continuously differentiable. Then

$$\begin{aligned} \partial _B\Phi (\mathbf {a}) \subseteq \{ D_a + D_bU'(\mathbf {a})\}, \end{aligned}$$

where \(D_a \in {\mathbb {R}}^{n \times n}\) and \(D_b \in {\mathbb {R}}^{n\times n}\) are diagonal matrices with entries defined as follows:

  1. 1.

    For all \(i \in \mathbb {I}\): If \(\parallel (a_i, U_i(\mathbf {a})) \parallel \ne 0\), then

    $$\begin{aligned} (D_a)_{ii}= & {} 1 - \frac{a_i}{\parallel a_i, U_i(\mathbf {a}) \parallel }, \nonumber \\ (D_b)_{ii}= & {} 1 - \frac{U_i(\mathbf {a})}{\parallel a_i, U_i(\mathbf {a}) \parallel }; \end{aligned}$$
    (34)

    otherwise

    $$\begin{aligned} ((D_a)_{ii}, (D_b)_{ii} ) \in \{ (1- \kappa , 1-\gamma ) \in {\mathbb {R}}^2 \mid \parallel (\eta , \gamma ) \parallel \le 1\}. \end{aligned}$$
    (35)
  2. 2.

    For all \(i\in \mathbb {E}\):

    $$\begin{aligned} (D_a)_{ii}= & {} 0,\\(D_b)_{ii}= & {} 1. \end{aligned}$$

                                                                                                                                     \(\square \)

If \(\parallel (a_i, U_i(\mathbf {a})) \parallel \ne 0\), \(\Phi (\mathbf {a})\) is differentiable at \(\mathbf {a}\) and the formula (34) computes the exact Jacobian. On the other hand, if \(\parallel (a_i, U_i(\mathbf {a})) \parallel = 0\) occurs at the \(i^{th}\) complementarity, \(\kappa \) and \(\gamma \) appeared in (35) are computed as suggested in Facchinei and Pang (2003):

$$\begin{aligned} \kappa = \frac{v_i}{\sqrt{v_i^2 + (U'v)_i^2}}, \quad \text{ and } \quad \gamma = \frac{(U'v)_i}{\sqrt{v_i^2 + (U'v)_i^2}}, \end{aligned}$$
(36)

where \(\mathbf {v} \in {\mathbb {R}}^n\) is a vector of user’s choice whose ith element is nonzero.

To compute \(\kappa \) and \(\gamma \) in (36), we can choose \(\mathbf {v} = \mathbf {1}\), and let

figure l

which doesn’t require update. Then the computation of \(\kappa \) and \(\gamma \) is simplified as

$$\begin{aligned} \kappa = \frac{1}{\sqrt{1 + (hv)_i^2}},\quad \text{ and } \quad \gamma = \frac{(hv)_i}{\sqrt{1 + (hv)_i^2}}. \end{aligned}$$

where the index i is the same as defined in (36).

In our experiments, we ignore steps (32) and (33) to save running time because the system (31) is always solvable, and the convergence is obtained in most cases with the initial \(\{\mathbf {a}^0\} = \mathbf {0}\). For rare cases where the condition \(g(\mathbf {a}^k) \le tol\) in Step 2 can not be fulfilled, we try a different initial point to restart.

Appendix B: Steps 2–4 of Algorithm 8

Step 2. Search on the vertical line \(C_e = \underline{C}\).

Initialize the set \({\mathbb {G}}rouping{\mathbb {VS}}et^{left} = {\mathbb {G}}rouping{\mathbb {VS}}et^{lu}\).

  1. 2a:

    For every pieces corresponding to a member of \({\mathbb {G}}rouping{\mathbb {VS}}et^{left}\), solve (20) subject to \(C_e = \underline{C}\) to obtain invariancy intervals. Let \(\big [\varepsilon _{min}, \,\varepsilon _{max}\big ]\) be the largest intervals among others. Do Add-In when repeating. \(countLeft+1\).

  2. 2b:

    Solve \({\mathbb {LCP}}_{SVR}^f\) at \((\underline{C},\varepsilon _{min})\) and obtain the grouping vectors set.

  3. 2c:

    Replace \({\mathbb {G}}rouping{\mathbb {VS}}et^{left}\) by the set of the grouping vectors obtained in 2b. If any members of \({\mathbb {G}}rouping{\mathbb {VS}}et^{left}\) are not in the set \({\mathbb {G}}rouping{\mathbb {VF}}ound\), add them to the later set.

  4. 2d:

    If the objective value of \({\mathbb {RLP}}\) is smaller than LeastUpperBound, update LeastUpperBound.

  5. 2e:

    If \(\varepsilon _{min}\) is greater than \(\underline{\varepsilon }\), let \(newStarting = (\underline{C}\,,\, \varepsilon _{min}-deviation)\).

  6. 2f:

    Solve \({\mathbb {LCP}}_{SVR}^f\) at newStarting and obtain the grouping vectors set. Do as in 2c–2d.

  7. 2g:

    Repeat 2a–2d until \(\varepsilon _{min} = \underline{\varepsilon }\).

Step 3. Search on the horizontal line \(\varepsilon _e = \underline{\varepsilon }\).

Initialize the set \({\mathbb {G}}rouping{\mathbb {VS}}et^{bottom} = {\mathbb {G}}rouping{\mathbb {VS}}et^{ul}\).

  1. 3a:

    For every pieces corresponding to a member of \(GroupingVSet^{bottom}\), solve (19) subject to \(\varepsilon _e = \underline{\varepsilon }\) to obtain invariancy intervals. Let \(\big [C_{min}, \,C_{max}\big ]\) be the largest intervals among others. Do Add-In when repeating. \(countBottom+1\).

  2. 3b:

    Solve \({\mathbb {LCP}}_{SVR}^f\) at \((C_{min},\underline{\varepsilon })\) and obtain the grouping vectors set.

  3. 3c:

    Replace \({\mathbb {G}}rouping{\mathbb {VS}}et^{bottom}\) by the set of the grouping vectors obtained in 3b. If any members of \({\mathbb {G}}rouping{\mathbb {VS}}et^{bottom}\) are not in the set \({\mathbb {G}}rouping{\mathbb {VF}}ound\), add them to the later set.

  4. 3d:

    If the objective value of \({\mathbb {RLP}}\) is smaller than LeastUpperBound, update LeastUpperBound.

  5. 3e:

    If \(C_{min}\) is greater than \(\underline{C}\), let \(newStarting = (C_{min}-deviation\,,\, \underline{\varepsilon })\).

  6. 3f:

    Solve \({\mathbb {LCP}}_{SVR}^f\) at newStarting and obtain the grouping vectors set. Do as in 3c-3d.

  7. 3g:

    Repeat 3a–3d until \(C_{min} = \underline{C}\).

Step 4. Search on the vertical line \(C_e = \bar{C}\).

Initialize the \({\mathbb {G}}rouping{\mathbb {VS}}et^{right} = {\mathbb {G}}rouping{\mathbb {VS}}et^{uu}\).

  1. 4a:

    For every pieces corresponding to a member of \({\mathbb {G}}rouping{\mathbb {VS}}et^{right}\), solve (20) subject to \(C_e = \bar{C}\) to obtain invariancy intervals. Let \(\big [\varepsilon _{min}, \,\varepsilon _{max}\big ]\) be the largest intervals among others. Do Add-In when repeating. \(countRight+1\).

  2. 4b:

    Solve \({\mathbb {LCP}}_{SVR}^f\) at \((\bar{C},\varepsilon _{min})\) and obtain the grouping vectors set.

  3. 4c:

    Replace \({\mathbb {G}}rouping{\mathbb {VS}}et^{right}\) by the set of the grouping vectors obtained in 4b. If any members of \({\mathbb {G}}rouping{\mathbb {VS}}et^{right}\) are not in the set \({\mathbb {G}}rouping{\mathbb {VF}}ound\), add them to the later set.

  4. 4d:

    If the objective value of \({\mathbb {RLP}}\) is smaller than LeastUpperBound, update LeastUpperBound.

  5. 4e:

    If \(\varepsilon _{min}\) is greater than \(\underline{\varepsilon }\), let \(newStarting = (\bar{C}, \varepsilon _{min}-deviation)\).

  6. 4f:

    Solve \({\mathbb {LCP}}_{SVR}^f\) at newStarting and obtain the grouping vectors set. Do as in 4c–4d.

  7. 4g:

    Repeat 4a–4d until \(\varepsilon _{min} = \underline{\varepsilon }\).

Appendix C: Cases 2–6 of Algorithm 10

figure m

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, YC., Pang, JS. & Mitchell, J.E. Global resolution of the support vector machine regression parameters selection problem with LPCC. EURO J Comput Optim 3, 197–261 (2015). https://doi.org/10.1007/s13675-015-0041-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13675-015-0041-z

Keywords

Mathematics Subject Classification

Navigation