Abstract
Support vector machine (SVM) is a powerful tool in binary classification, known to attain excellent misclassification rates. On the other hand, many realworld classification problems, such as those found in medical diagnosis, churn or fraud prediction, involve misclassification costs which may be different in the different classes. However, it may be hard for the user to provide precise values for such misclassification costs, whereas it may be much easier to identify acceptable misclassification rates values. In this paper we propose a novel SVM model in which misclassification costs are considered by incorporating performance constraints in the problem formulation. Specifically, our aim is to seek the hyperplane with maximal margin yielding misclassification rates below given threshold values. Such maximal margin hyperplane is obtained by solving a quadratic convex problem with linear constraints and integer variables. The reported numerical experience shows that our model gives the user control on the misclassification rates in one class (possibly at the expense of an increase in misclassification rates for the other class) and is feasible in terms of running times.
Similar content being viewed by others
References
Alcalá-Fdez J, Sanchez L, Garcia S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM et al (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318
Allwein EL, Schapire RE, Singer Y (2001) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
Benítez-Peña S, Blanquero R, Carrizosa E, Ramírez-Cobo P (2018) Cost-sensitive feature selection for support vector machines. Comput Oper Res. https://doi.org/10.1016/j.cor.2018.03.005
Bertsimas D, King A, Mazumder R et al (2016) Best subset selection via a modern optimization lens. Ann Stat 44(2):813–852
Bertsimas D, Weismantel R (2005) Optimization over integers. Dynamic Ideas, Belmont
Bewick V, Cheek L, Ball J (2004) Statistics review 13: receiver operating characteristic curves. Crit Care 8(6):508–512
Bonami P, Biegler LT, Conn AR, Cornujols G, Grossmann IE, Laird CD, Lee J, Lodi A, Margot F, Sawaya N, Wchter A (2008) An algorithmic framework for convex mixed integer nonlinear programs. Discrete Optim 5(2):186–204 (in Memory of George B. Dantzig)
Bradford JP, Kunz C, Kohavi R, Brunk C, Brodley CE (1998) Pruning decision trees with misclassification costs. In: Proceedings of the 10th European conference on machine learning, ECML’98. Springer, pp. 131–136
Burer S, Letchford AN (2012) Non-convex mixed-integer nonlinear programming: a survey. Surv Oper Res Manag Sci 17(2):97–106
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167
Camm JD, Raturi AS, Tsubakitani S (1990) Cutting big M down to size. Interfaces 20(5):61–66
Carrizosa E, Martin-Barragan B, Romero Morales D (2008) Multi-group support vector machines with measurement costs: a biobjective approach. Discrete Appl Math 156(6):950–966
Carrizosa E, Romero Morales D (2013) Supervised classification and mathematical optimization. Comput Oper Res 40(1):150–165
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, New York
Datta S, Das S (2015) Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Netw 70:39–52
Freitas A, Costa-Pereira A, Brazdil P (2007) Cost-sensitive decision trees applied to medical data. In: Data warehousing and knowledge discovery: 9th international conference, DaWaK 2007, Regensburg Germany, September 3–7, 2007. Proceedings. Springer, Berlin, pp 303–312
Guo J (2010) Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. Biostatistics 11(4):599–608
Gurobi Optimization I (2016) Gurobi optimizer reference manual. http://www.gurobi.com
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer series in statistics. Springer, New York
He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley, Hoboken
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30
Horn D, Demircioğlu A, Bischl B, Glasmachers T, Weihs C (2016) A comparative study on large scale kernelized support vector machines. Adv Data Anal Classif. https://doi.org/10.1007/s11634-016-0265-7
Hsu CW, Chang CC, Lin CJ et al (2003) A practical guide to support vector classification. Tech. rep., Department of Computer Science, National Taiwan University
Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, vol 14. Stanford, CA, pp 1137–1145
Lichman M (2013) UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php
Lin Y, Lee Y, Wahba G (2002) Support vector machines for classification in nonstandard situations. Mach Learn 46(1–3):191–202
Lyu T, Lock EF, Eberly LE (2017) Discriminating sample groups with multi-way data. Biostatistics 18:434–450
Maldonado S, Prez J, Bravo C (2017) Cost-based feature selection for support vector machines: an application in credit scoring. Eur J Oper Res 261(2):656–665
Mansouri K, Ringsted T, Ballabio D, Todeschini R, Consonni V (2013) Quantitative structureactivity relationship models for ready biodegradability of chemicals. J Chem Inf Model 53(4):867–878
Mercer J (1909) Functions of positive and negative type, and their connection with the theory of integral equations. Philos Trans R Soc Lond Ser A 209:415–446
Prati RC, Batista GE, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45(1):247–270
Sánchez BN, Wu M, Song PXK, Wang W (2016) Study design in high-dimensional classification analysis. Biostatistics 17(4):722
Silva APD (2017) Optimization approaches to supervised classification. Eur J Oper Res 261(2):772–788
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B (Cybern) 39(1):281–288
Van Rossum G, Drake FL (2011) An Introduction to Python. Network Theory Ltd, United Kingdom
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Vapnik VN (1998) Statistical learning theory, vol 1, 1st edn. Wiley, New York
Yao Y, Lee Y (2014) Another look at linear programming for feature selection via methods of regularization. Stat Comput 24(5):885–905
Acknowledgements
This research is supported by Fundación BBVA, and by Projects FQM329 and P11-FQM-7603 (Junta de Andalucía, Spain) and MTM2015-65915-R (Ministerio de Economía y Competitividad, Spain). The last three are cofunded with EU ERD Funds. The authors are thankful for such support.
Author information
Authors and Affiliations
Corresponding author
Appendix A: Derivation of the CSVM
Appendix A: Derivation of the CSVM
In this section, the detailed steps to build the CSVM formulation are shown. For that, suppose that we are given the mixed-integer quadratic model
Hence, the problem above can be rewritten as
We first develop the expression of the dual for the linear case and then we show how the kernel trick applies. As a previous step we should consider the variables z as fixed. Hence, having those variables fixed, the inner problem is rewritten as:
As \(M_1\) is a large number, the fourth constraints always result feasible, so they can be removed. Also, we can denote \(\{j \in J : z_j =1\}\) by J(z), obtaining
Hence, we can build the Lagrangian
The KKT conditions are, therefore
Note that we can replace, without loss of generality, \(\lambda _s/2\), \(\mu _t/2\) by \(\lambda _s\) and \(\mu _t\), respectively. Then, in the condition \({\partial \mathcal {L}}/{\partial \beta } = 0\) we have
that can be simplified to
as stated. In addition, the condition \({\partial \mathcal {L}}/{\partial \xi _i} = 0\) is transformed into
and
Furthermore, since these results must be equivalent to the case if we had maintained the previously removed constraint, we have \(\mu _t = 0\) when \(z_t=0, \quad t \in J\) and \(\mu _t \ge 0\) when \(z_t=1, \quad t \in J\). This can be summarized as \(0 \le \mu _t \le M_2z_t,\quad t \in J\). Also, as usual, \(\delta _i\) is removed since we add
and
as we know that \(\delta _i \ge 0\). Therefore, the KKT conditions result:
Note that we have replaced all the J(z) by J using the previous clarification.
Thus, substituting the previous expressions into the second optimization problem, the partial dual of such problem can be calculated, yielding
Finally, since this problem only depends on the observation via the inner product, we can use the kernel trick and Problem (CSVM) is obtained.
Rights and permissions
About this article
Cite this article
Benítez-Peña, S., Blanquero, R., Carrizosa, E. et al. On support vector machines under a multiple-cost scenario. Adv Data Anal Classif 13, 663–682 (2019). https://doi.org/10.1007/s11634-018-0330-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-018-0330-5
Keywords
- Constrained classification
- Misclassification costs
- Mixed integer quadratic programming
- Sensitivity/specificity trade-off
- Support vector machines