Skip to main content
Log in

On support vector machines under a multiple-cost scenario

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Support vector machine (SVM) is a powerful tool in binary classification, known to attain excellent misclassification rates. On the other hand, many realworld classification problems, such as those found in medical diagnosis, churn or fraud prediction, involve misclassification costs which may be different in the different classes. However, it may be hard for the user to provide precise values for such misclassification costs, whereas it may be much easier to identify acceptable misclassification rates values. In this paper we propose a novel SVM model in which misclassification costs are considered by incorporating performance constraints in the problem formulation. Specifically, our aim is to seek the hyperplane with maximal margin yielding misclassification rates below given threshold values. Such maximal margin hyperplane is obtained by solving a quadratic convex problem with linear constraints and integer variables. The reported numerical experience shows that our model gives the user control on the misclassification rates in one class (possibly at the expense of an increase in misclassification rates for the other class) and is feasible in terms of running times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Alcalá-Fdez J, Sanchez L, Garcia S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM et al (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318

    Article  Google Scholar 

  • Allwein EL, Schapire RE, Singer Y (2001) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141

    MathSciNet  MATH  Google Scholar 

  • Benítez-Peña S, Blanquero R, Carrizosa E, Ramírez-Cobo P (2018) Cost-sensitive feature selection for support vector machines. Comput Oper Res. https://doi.org/10.1016/j.cor.2018.03.005

    Article  MATH  Google Scholar 

  • Bertsimas D, King A, Mazumder R et al (2016) Best subset selection via a modern optimization lens. Ann Stat 44(2):813–852

    Article  MathSciNet  MATH  Google Scholar 

  • Bertsimas D, Weismantel R (2005) Optimization over integers. Dynamic Ideas, Belmont

    Google Scholar 

  • Bewick V, Cheek L, Ball J (2004) Statistics review 13: receiver operating characteristic curves. Crit Care 8(6):508–512

    Article  Google Scholar 

  • Bonami P, Biegler LT, Conn AR, Cornujols G, Grossmann IE, Laird CD, Lee J, Lodi A, Margot F, Sawaya N, Wchter A (2008) An algorithmic framework for convex mixed integer nonlinear programs. Discrete Optim 5(2):186–204 (in Memory of George B. Dantzig)

    Article  MathSciNet  MATH  Google Scholar 

  • Bradford JP, Kunz C, Kohavi R, Brunk C, Brodley CE (1998) Pruning decision trees with misclassification costs. In: Proceedings of the 10th European conference on machine learning, ECML’98. Springer, pp. 131–136

  • Burer S, Letchford AN (2012) Non-convex mixed-integer nonlinear programming: a survey. Surv Oper Res Manag Sci 17(2):97–106

    MathSciNet  Google Scholar 

  • Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167

    Article  Google Scholar 

  • Camm JD, Raturi AS, Tsubakitani S (1990) Cutting big M down to size. Interfaces 20(5):61–66

    Article  Google Scholar 

  • Carrizosa E, Martin-Barragan B, Romero Morales D (2008) Multi-group support vector machines with measurement costs: a biobjective approach. Discrete Appl Math 156(6):950–966

    Article  MathSciNet  MATH  Google Scholar 

  • Carrizosa E, Romero Morales D (2013) Supervised classification and mathematical optimization. Comput Oper Res 40(1):150–165

    Article  MathSciNet  MATH  Google Scholar 

  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  • Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  • Datta S, Das S (2015) Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Netw 70:39–52

    Article  MATH  Google Scholar 

  • Freitas A, Costa-Pereira A, Brazdil P (2007) Cost-sensitive decision trees applied to medical data. In: Data warehousing and knowledge discovery: 9th international conference, DaWaK 2007, Regensburg Germany, September 3–7, 2007. Proceedings. Springer, Berlin, pp 303–312

  • Guo J (2010) Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. Biostatistics 11(4):599–608

    Article  MathSciNet  Google Scholar 

  • Gurobi Optimization I (2016) Gurobi optimizer reference manual. http://www.gurobi.com

  • Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer, New York

    Book  MATH  Google Scholar 

  • He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, Hoboken

    Book  MATH  Google Scholar 

  • Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30

    Article  MathSciNet  MATH  Google Scholar 

  • Horn D, Demircioğlu A, Bischl B, Glasmachers T, Weihs C (2016) A comparative study on large scale kernelized support vector machines. Adv Data Anal Classif. https://doi.org/10.1007/s11634-016-0265-7

  • Hsu CW, Chang CC, Lin CJ et al (2003) A practical guide to support vector classification. Tech. rep., Department of Computer Science, National Taiwan University

  • Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, vol 14. Stanford, CA, pp 1137–1145

  • Lichman M (2013) UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php

  • Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46(1–3):191–202

    Article  MATH  Google Scholar 

  • Lyu T, Lock EF, Eberly LE (2017) Discriminating sample groups with multi-way data. Biostatistics 18:434–450

    MathSciNet  Google Scholar 

  • Maldonado S, Prez J, Bravo C (2017) Cost-based feature selection for support vector machines: an application in credit scoring. Eur J Oper Res 261(2):656–665

    Article  MathSciNet  MATH  Google Scholar 

  • Mansouri K, Ringsted T, Ballabio D, Todeschini R, Consonni V (2013) Quantitative structureactivity relationship models for ready biodegradability of chemicals. J Chem Inf Model 53(4):867–878

    Article  Google Scholar 

  • Mercer J (1909) Functions of positive and negative type, and their connection with the theory of integral equations. Philos Trans R Soc Lond Ser A 209:415–446

    Article  MATH  Google Scholar 

  • Prati RC, Batista GE, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45(1):247–270

    Article  Google Scholar 

  • Sánchez BN, Wu M, Song PXK, Wang W (2016) Study design in high-dimensional classification analysis. Biostatistics 17(4):722

    Article  MathSciNet  Google Scholar 

  • Silva APD (2017) Optimization approaches to supervised classification. Eur J Oper Res 261(2):772–788

    Article  MathSciNet  MATH  Google Scholar 

  • Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222

    Article  MathSciNet  Google Scholar 

  • Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B (Cybern) 39(1):281–288

    Article  Google Scholar 

  • Van Rossum G, Drake FL (2011) An Introduction to Python. Network Theory Ltd, United Kingdom

    Google Scholar 

  • Vapnik VN (1995) The nature of statistical learning theory. Springer, New York

    Book  MATH  Google Scholar 

  • Vapnik VN (1998) Statistical learning theory, vol 1, 1st edn. Wiley, New York

    MATH  Google Scholar 

  • Yao Y, Lee Y (2014) Another look at linear programming for feature selection via methods of regularization. Stat Comput 24(5):885–905

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This research is supported by Fundación BBVA, and by Projects FQM329 and P11-FQM-7603 (Junta de Andalucía, Spain) and MTM2015-65915-R (Ministerio de Economía y Competitividad, Spain). The last three are cofunded with EU ERD Funds. The authors are thankful for such support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sandra Benítez-Peña.

Appendix A: Derivation of the CSVM

Appendix A: Derivation of the CSVM

In this section, the detailed steps to build the CSVM formulation are shown. For that, suppose that we are given the mixed-integer quadratic model

$$\begin{aligned} \min _{\omega , \beta , \xi ,z}&\omega ^\top \omega + {C_+\sum \limits _{i \in I : y_i =+1} \xi _i + C_-\sum \limits _{i \in I : y_i =-1} \xi _i }&\\ \text {s.t.}&y_i(\omega ^\top x_i + \beta ) \ge 1 - \xi _i,&i \in I \\&\xi _i \ge 0&i \in I\\&y_j(\omega ^\top x_j + \beta ) \ge 1 - M_1(1-z_j),&j \in J \\&z_j \in \{ 0,1\}&j \in J\\&{\hat{p}_\ell \ge {p_{0}^*}_{\ell }}&\ell \in L. \end{aligned}$$

Hence, the problem above can be rewritten as

$$\begin{aligned} \begin{array}{lllllll} \min _{{z}} &{} &{} &{} {\min _{{\omega },\beta ,{\xi }} }&{} {{\omega }^\top {\omega } + {C_+\sum \limits _{i \in I : y_i =+1} \xi _i + C_-\sum \limits _{i \in I : y_i =-1} \xi _i } }\\ \text{ s.t. } &{} z_j \in \{0,1\} &{} j \in J &{} {\text{ s.t. }} &{} {y_i \left( {\omega }^\top {x}_i + \beta \right) \ge 1 - \xi _i} &{} {i \in I}\\ &{} {\hat{p}_\ell \ge {p_{0}^*}_{\ell }} &{} \ell \in L &{} &{}y_j\left( \omega ^\top x_j + \beta \right) \ge 1 - M_1(1-z_j),&{} j \in J \\ &{} &{} &{} &{} { \xi _i \ge 0} &{} {i \in I }. \end{array} \end{aligned}$$

We first develop the expression of the dual for the linear case and then we show how the kernel trick applies. As a previous step we should consider the variables z as fixed. Hence, having those variables fixed, the inner problem is rewritten as:

$$\begin{aligned} \begin{array}{llll} {\min _{{\omega },\beta ,{\xi }} }&{} {{\omega }^\top {\omega } + {C_+\sum \limits _{i \in I : y_i =+1} \xi _i + C_-\sum \limits _{i \in I : y_i =-1} \xi _i } }\\ {\text{ s.t. }} &{} {y_i \left( {\omega }^\top {x}_i + \beta \right) \ge 1 - \xi _i} &{} {i \in I}\\ &{}y_j\left( \omega ^\top x_j + \beta \right) \ge 1,&{} j \in J : z_j =1 \\ &{}y_j\left( \omega ^\top x_j + \beta \right) \ge 1 - M_1,&{} j \in J : z_j = 0\\ &{} { \xi _i \ge 0} &{} {i \in I }. \end{array} \end{aligned}$$

As \(M_1\) is a large number, the fourth constraints always result feasible, so they can be removed. Also, we can denote \(\{j \in J : z_j =1\}\) by J(z), obtaining

$$\begin{aligned} \begin{array}{llll} {\min _{{\omega },\beta ,{\xi }} }&{} {{\omega }^\top {\omega } + {C_+\sum \limits _{i \in I : y_i =+1} \xi _i + C_-\sum \limits _{i \in I : y_i =-1} \xi _i } }\\ {\text{ s.t. }} &{} {y_i \left( {\omega }^\top {x}_i + \beta \right) \ge 1 - \xi _i} &{} {i \in I}\\ &{}y_j\left( \omega ^\top x_j + \beta \right) \ge 1,&{} j \in J(z) \\ &{} { \xi _i \ge 0} &{} {i \in I }. \end{array} \end{aligned}$$

Hence, we can build the Lagrangian

$$\begin{aligned} \mathcal {L}(\omega ,\beta ,\xi )= & {} {{\omega }^\top {\omega } + {C_+\sum \limits _{i \in I : y_i =+1} \xi _i + C_-\sum \limits _{i \in I : y_i =-1} \xi _i }} \\&- \sum \limits _{s\in I} \lambda _s (y_s(\omega ^\top x_s+\beta ) -1 + \xi _s) \\&- \sum \limits _{t\in J(z)} \mu _t (y_t(\omega ^\top x_t+\beta ) -1) - \sum _{i' \in I} \delta _{i'} \xi _{i'} \end{aligned}$$

The KKT conditions are, therefore

$$\begin{aligned} \begin{array}{llllll} \dfrac{\partial \mathcal {L}}{\partial \omega } = 0 &{} \Rightarrow &{} {\omega } &{} = &{} \sum \limits _{s \in I} (\lambda _s/2) y_s {x}_s+ \sum \limits _{t \in J(z)} (\mu _t/2) y_t {x}_t \\ \dfrac{\partial \mathcal {L}}{\partial \beta } = 0 &{} \Rightarrow &{} 0 &{} = &{} \sum \limits _{s \in I} \lambda _s y_s + \sum \limits _{t \in J(z)} \mu _t y_t \\ \dfrac{\partial \mathcal {L}}{\partial \xi _i} = 0 &{} \Rightarrow &{} 0 &{} = &{} -\lambda _i -\delta _i + C_+ &{} i \in I:y_i =+1\\ \dfrac{\partial \mathcal {L}}{\partial \xi _i} = 0 &{} \Rightarrow &{} 0 &{} = &{} -\lambda _i -\delta _i + C_- &{} i \in I:y_i =-1\\ &{} &{} 0 &{} \le &{} \lambda _{i} &{} i \in I\\ &{} &{} 0 &{} \le &{} \mu _t &{} t \in J(z)\\ &{} &{} 0 &{} \le &{} \delta _{i} &{} i \in I \end{array} \end{aligned}$$

Note that we can replace, without loss of generality, \(\lambda _s/2\), \(\mu _t/2\) by \(\lambda _s\) and \(\mu _t\), respectively. Then, in the condition \({\partial \mathcal {L}}/{\partial \beta } = 0\) we have

$$\begin{aligned} 0 = \sum \limits _{s \in I} 2\lambda _s y_s + \sum \limits _{t \in J(z)} 2\mu _t y_t, \end{aligned}$$

that can be simplified to

$$\begin{aligned} 0 = \sum \limits _{s \in I} \lambda _s y_s + \sum \limits _{t \in J(z)} \mu _t y_t, \end{aligned}$$

as stated. In addition, the condition \({\partial \mathcal {L}}/{\partial \xi _i} = 0\) is transformed into

$$\begin{aligned} 0 = -2\lambda _i - \delta _i + C_+, \quad i \in I:y_i=+1 \end{aligned}$$

and

$$\begin{aligned} 0 = -2\lambda _i - \delta _i + C_-, \quad i \in I:y_i=-1. \end{aligned}$$

Furthermore, since these results must be equivalent to the case if we had maintained the previously removed constraint, we have \(\mu _t = 0\) when \(z_t=0, \quad t \in J\) and \(\mu _t \ge 0\) when \(z_t=1, \quad t \in J\). This can be summarized as \(0 \le \mu _t \le M_2z_t,\quad t \in J\). Also, as usual, \(\delta _i\) is removed since we add

$$\begin{aligned} 0 \le \lambda _i \le C_+/2, \quad i \in I:y_i=+1 \end{aligned}$$

and

$$\begin{aligned} 0 \le \lambda _i \le C_-/2, \quad i \in I:y_i=-1, \end{aligned}$$

as we know that \(\delta _i \ge 0\). Therefore, the KKT conditions result:

$$\begin{aligned} \begin{array}{llll} {\omega } &{} = &{} \sum \limits _{s \in I} \lambda _s y_s {x}_s+ \sum \limits _{t \in J} \mu _t y_t {x}_t \\ 0 &{} = &{} \sum \limits _{s \in I} \lambda _s y_s + \sum \limits _{t \in J} \mu _t y_t \\ 0 &{} \le &{} \lambda _s \le C_+/2 &{} s\in I:y_i=+1 \\ 0 &{} \le &{} \lambda _s \le C_-/2 &{} s\in I:y_i=-1 \\ 0 &{} \le &{} \mu _t \le {M_2} z_t&{} t \in J. \end{array} \end{aligned}$$

Note that we have replaced all the J(z) by J using the previous clarification.

Thus, substituting the previous expressions into the second optimization problem, the partial dual of such problem can be calculated, yielding

$$\begin{aligned} \begin{array}{lllll} \min \limits _{{z}} &{} &{} &{} {\min \limits _{{\lambda },{\mu }, \beta , {\xi }} } &{} \left( \sum \limits _{s \in I} \lambda _s y_s {x}_s+ \sum \limits _{t \in J} \mu _t y_t {x}_t \right) ^\top \left( \sum \limits _{s \in I} \lambda _s y_s {x}_s+ \sum \limits _{t \in J} \mu _t y_t {x}_t \right) \\ &{} &{} &{} &{} {+ \, \, C_+\sum \limits _{i \in I : y_i =+1} \xi _i + C_-\sum \limits _{i \in I : y_i =-1} \xi _i } \\ \text{ s.t. } &{} z_j \in \{0,1\} &{} j \in J &{} {\text{ s.t. }} &{} {y_i \left( \left( \sum \limits _{s \in I} \lambda _s y_s {x}_s+ \sum \limits _{t \in J} \mu _t y_t {x}_t\right) ^\top {x}_i + \beta \right) }\\ &{}&{}&{}&{}{\ge 1 - \xi _i} \quad {i \in I}\\ &{} {\hat{p}_\ell \ge {p_{0}^*}_{\ell }} &{} \ell \in L &{} &{} { y_j\left( \left( \sum \limits _{s \in I} \lambda _s y_s {x}_s+ \sum \limits _{t \in J} \mu _t y_t {x}_t\right) ^\top {x}_j + \beta \right) }\\ &{}&{}&{}&{} {\ge 1 -{M_1}(1-z_j)} \quad {j \in J} \\ &{} &{} &{} &{} { \xi _i \ge 0} \quad {i \in I }\\ &{} &{} &{} &{} {\sum \limits _{i \in I} \lambda _i y_i + \sum \limits _{j \in J} \mu _j y_j = 0}\\ &{} &{} &{} &{} {{ 0 \le \lambda _i \le C_+/2} \quad {i \in I:y_i=+1 }}\\ &{} &{} &{} &{} {{ 0 \le \lambda _i \le C_-/2} \quad {i \in I:y_i=-1 }}\\ &{} &{} &{} &{} { 0 \le \mu _j \le {M_2}z_j} \quad {j \in J}. \end{array} \end{aligned}$$

Finally, since this problem only depends on the observation via the inner product, we can use the kernel trick and Problem (CSVM) is obtained.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Benítez-Peña, S., Blanquero, R., Carrizosa, E. et al. On support vector machines under a multiple-cost scenario. Adv Data Anal Classif 13, 663–682 (2019). https://doi.org/10.1007/s11634-018-0330-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-018-0330-5

Keywords

Mathematics Subject Classification

Navigation