1 Introduction

Classification is an important topic in various disciplines, and many classification algorithms have been developed over the years. Mathematical programming represents one of the major categories of classification methods. The most prominent mathematical programming-based classifier is support vector machine (SVM) (Vapnik, 1982, 1995, 2000; Vapnik and Chapelle, 2000; Christopher and Burges, 1998; Christianini and Shawe-tayor (2000); Campadelli et al, 2005; Scholkopf and Smola, 2002). Linear programming (LP) is another well-known optimization model, with a simple strategy of obtaining the best linear separating function (Bradely et al, 1999; Garcia-Palomares and Manzanilla-Salaza, 2012; Freed and Glover, 1981).

In recent years, Multiple Criteria Linear Programming (MCLP) (Shi, 2010) and Multiple Criteria Quadratic Programming (MCQP) (Peng et al, 2008a, b), a type of optimization technique, have been proposed (Li et al, 2008; Zhang et al, 2009; Kou et al, 2005, 2009, 2012). The basic idea of multiple criteria mathematical programming models is to maximize the external distances between groups and minimize the internal distances within each group. Though the MCQP model has been proved to be effective and scalable to massive problems (Li et al, 2008), its performance drops when a dataset is highly imbalanced because it is designed to maximize the total accuracy and hence the boundary b is skewed toward the minority class.

Cost-sensitive learning is a commonly used approach to deal with class imbalance problem (Soda, 2011; He et al, 2009; Thai-Nghe et al, 2010; Chawla et al, 2004; Sun et al, 2009). The concept of cost-sensitive classification comes from the recognition that different types of misclassification have varying costs in real-life applications (Lomax and Vadera, 2013; Ting, 2002; Masnadi-Shirazi et al, 2015; Lee et al, 2012; Yue and Cakanyildiri, 2010; Zhao et al, 2011; Wang et al, Wang et al, 2014a, b; Cao et al, 2013; Min et al, 2011; Tsai et al, 2009; Shi et al, 2013). For example, in medical diagnosis and fraud detection, the cost of having a disease or fraud undetected is much higher than the cost of having a false alarm.

The goal of this paper is to propose a cost-sensitive Multiple Criteria Quadratic Programming (CS-MCQP) model to improve the performance of MCQP when the class distributions are imbalanced by introducing the cost of misclassifications and the cost of imbalanced data. The CS-MCQP model also adds \( \tilde{\beta } \) and \( \tilde{\alpha } \), which, respectively, measure the distance from a correctly classified and misclassified negative class (minority class) element \( A_{i} \) to \( b \), as an attempt to increase the overall accuracy. An experiment is designed to evaluate the proposed model by comparing it with the MCQP model and other well-known classifiers using imbalanced datasets from the UCI repository. The results show that the CS-MCQP model not only outperforms the MCQP model and the popular classifiers (i.e., C4.5, SVM, Naïve Bayes, locally weighted learning (LWL), RBF network, MLP and Logistic), but also achieves better performances than the cost-sensitive SVM, preprocessing techniques, ensemble and hybrid methods, especially on overlapping imbalanced data.

The contributions of this study to the theory and practice of OR are twofold: (1) present a cost-sensitive optimization classifier to deal with imbalanced classification by introducing the cost of misclassifications and the cost of imbalanced data. (2) Results show that the proposed model is successful and significantly better than the other twenty-three classifiers in the experiments, which indicates that the proposed CS-MCQP model provides a new tool for imbalanced classification, especially on overlapping imbalanced data.

The rest of this paper is organized as follows. Section 2 briefly discussed related works. Section 3 presents the CS-MCQP model. Section 4 describes details of the experiments and Section 5 concludes the paper.

This paragraph lists the notations. \( E[\theta ,\widetilde{\theta }] \) denotes the mathematical expectation of θ and \( \widetilde{\theta } \). A vector XR r is r-dimensional vector. e denotes a vector of ones and 0 denotes a vector of zeros. The transposed vector is a vector by a prime superscript T. The ρ-norm of vector is \( \left\| X \right\|_{p}^{p} = \left( {\sum\limits_{i = 1}^{r} {x_{i}^{p} } } \right)^{{\frac{1}{p}}} \). The inner product of vectors X and Y is denoted by \( X^{T} Y = \sum\limits_{i = 1}^{r} {x_{i} y_{i} } \). A matrix \( M \in R^{n \times r} \) is a real n × r-dimensional matrix. M i denotes the i-th row of M and M j denotes the j-th column of M. M T is the transpose of M. The identity matrix is denoted by I. Positive class is the minority class and negative class is the majority class.

2 Related work

Classification is one of the most extensively studied areas in data mining, machine learning and artificial intelligence (Pavlidis et al, 2012; Chang, 2013; Hwang et al, 2014; Martens and Provost, 2014; Peng et al, 2008b; Wang et al, 2014; Yang and Wu, 2006; Kou et al, 2009; Barros et al, 2012; Ferri et al, 2009). The well-known classification algorithms include decision tree induction, artificial neural network, Bayes classification and SVM (Vapnik, 1982, 1995, 2000; Garcia-Palomares and Manzanilla, 2012; Lomax and Vadera, 2013). While standard classification algorithms were designed without paying special attention to imbalanced class distributions, the class imbalance problem is ubiquitous throughout various fields. For example, unqualified products represent a small percentage in a batch of production; cancerous cases constitute a few portions in medical examinations; and fraud is an uncommon phenomenon in credit card transactions. Since many classification algorithms presume balanced class distributions and often generate classification models that favor the majority class, their performances are greatly affected by imbalanced datasets. In recent years, imbalanced classification becomes one of the key problems in data mining and machine learning.

2.1 Imbalance classification

Many solutions have been proposed to deal with the class imbalance problem and can be categorized into three major groups: preprocessing, ensemble and cost-sensitive learning (López et al, 2013). Preprocessing techniques handle the imbalanced classification problem by reducing skewed distributions using resampling methods, including undersampling, oversampling and hybrid approaches. Ensembles of classifiers are often combined with resampling techniques to target the class imbalance problem. For comprehensive reviews of classification approaches for imbalanced data, please refer to He and Garcia (2009) and Lopez et al, (2013).

In real-life imbalanced classification problems, misclassification cost of the minority class is higher than the majority class. For example, in medical diagnosis, the cost of having a disease undetected is much higher than the cost of having a false alarm. Cost-sensitive methods assign higher costs to misclassifications of the minority class and lower costs to misclassifications of the majority class to increase the accuracy of the minority class.

The cost can be defined in many ways, such as imbalance ratio (Li et al, 2008), misclassification costs (Thai-Nghe et al, 2010), test costs (Lomax and Vadera, 2013) or penalty coefficient (Masnadi-Shirazi et al, 2015). There are two main categories of cost-sensitive algorithms: cost-sensitive sampling (Sun et al, 2007) and modification of standard classification algorithms (Liu and Zhou, 2006). The sampling approach resamples the original data by assigning different weights to show the costs of misclassification, such as Ting, 2002. The latter one introduces cost to standard classification algorithms such as decision tree (Lomax and Vadera, 2013) and Neural Networks (Liu and Zhou, 2006).

2.2 Classification based on optimization model

Optimization-based classification technology has got rapid development in data mining (Vapnik, 1982, 1995, 2000; Christopher and Burges, 1998; Campadelli et al, 2005; Scholkopf and Smola, 2002; Bradely et al, 1999; Garcia-Palomares and Manzanilla-Salaza, 2012; Freed and Glover, 1981; Li et al, 2008; Zhang et al, 2009; Kou et al, 2005, 2009). Multiple Criteria Linear Programming (MCLP) and Multiple Criteria Quadratic Programming (MCQP) have been employed in credit card risk analysis, network intrusion detection and VIP e-Mail behavior analysis (Shi, 2010; Li et al, 2008; Zhang et al, 2009; Kou et al, 2005, 2009). The basic idea of multiple criteria optimization is minimizing overlapping distances between observations and hyperplane, and maximizing total distances from observations to hyperplane. The Pareto solution is the classification boundary. MCQP models can directly obtain solution of separated line by a Karush–Kuhn–Tucker (KKT) condition which can be employed to deal with large-scale classification (Peng et al, 2008b). However, the performance of MCQP model drops when a dataset is imbalanced because the objectives are maximizing the sum of the distances to the hyperplane and minimizing the total errors, and thus the hyperplane will be “pushed” to the minority class. If the imbalance ratio is really high, the MCQP model may classify all minority examples as majority class.

Given this background, we introduce cost matrix as penalty coefficient to the MCQP model to allow the hyperplane to stay far away from the minority class and consequently decrease the number of misclassified minority instances and the total costs.

3 Cost-sensitive multi-criteria quadratic programming model

3.1 CS-MCQP model

The classification process of multi-criteria mathematical programming can be introduced by a matrix. We use a row vector \( A_{i} = (a_{i1} ,a_{i2} , \ldots ,a_{in} ) \in R^{r} \) of \( n \times r \) matrix \( A = (A_{1},\,A_{2} ,\,\ldots,\,A_{n} )^{T} \) to represent one record in a dataset, where \( n \) is the number of records and \( r \) is the number of attributes. A scalar\( b \) can be set as a boundary to separate positive class I + and negative class I -. Let \( X = (x_{1} ,x_{2} , \ldots ,x_{n} )^{T} \in R^{r} \) be a vector of real numbers to be determined and the classification function can be presented as follows:

$$ \left\{ \begin{aligned} A_{i} X < b, \, \forall A_{i} \in I_{ + } \hfill \\ A_{i} X \ge b, \, \forall A_{i} \in I_{ - } \hfill \\ \end{aligned} \right. $$
(3.1)

The inequalities in (3.1) can be converted to equations in (3.2) by adding the following variables. Let \( \beta_{i} \) be the distance from an element \( A_{i} \) to \( b \) when \( A_{i} \) is correctly classified as positive. Let \( \tilde{\beta }_{i} \) be the distance from \( A_{i} \) to \( b \) when \( A_{i} \) is correctly classified as negative. To deal with linearly inseparable data, we introduce \( \alpha_{i} \) and \( \tilde{\alpha }_{i} \). If an element \( A_{i} \) belongs to the negative class and is misclassified, then let \( \tilde{\alpha }_{i} \) be the distance from \( A_{i} \) to \( b \). If an element \( A_{i} \) belongs to the positive class and is misclassified, then let \( \alpha_{i} \) be the distance from \( A_{i} \) to \( b \).

$$ \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {A_{i} X = b + \alpha_{i} - \beta_{i} ,} & {A_{i} \in I_{ + } } \\ \end{array} } \\ {\begin{array}{*{20}c} {A_{i} X = b - \tilde{\alpha }_{i} + \tilde{\beta }_{i} ,} & {A_{i} \in I_{ - } } \\ \end{array} } \\ \end{array} } \right. $$
(3.2)

Let \( \alpha_{i} = \widetilde{{\alpha_{i} }} = 0 \) for all correctly classified elements and \( \beta_{i} = \widetilde{\beta }_{i} = 0 \) for all misclassified elements, and the parameters satisfy \( \alpha_{i} \tilde{\alpha }_{i} = 0,\alpha_{i} \tilde{\beta }_{i} = 0,\tilde{\alpha }_{i} \beta_{i} = 0,\beta_{i} \tilde{\beta }_{i} = 0 \). Adopting the concepts proposed in (Peng et al, 2008b), we introduce two adjusted hyperplanes \( b + \delta \) and \( b - \delta \), where \( \delta \) a given scalar, equations (3.2) can be reformulated as medium (3.3), strong (3.4) and weak models (3.5):

(Medium model)

$$ \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {A_{i} X = b + \alpha_{i} - \beta_{i} ,} & {A_{i} \in I_{ + } } \\ \end{array} } \\ {\begin{array}{*{20}c} {A_{i} X = b - \tilde{\alpha }_{i} + \tilde{\beta }_{i} ,} & {A_{i} \in I_{ - } } \\ \end{array} } \\ \end{array} } \right. $$
(3.3)

(Strong model)

$$ \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {A_{i} X = b + \delta + \alpha_{i} - \beta_{i} ,} & {A_{i} \in I_{ + } } \\ \end{array} } \\ {\begin{array}{*{20}c} {A_{i} X = b - \delta - \tilde{\alpha }_{i} + \tilde{\beta }_{i} ,} & {A_{i} \in I_{ - } } \\ \end{array} } \\ \end{array} } \right. $$
(3.4)

(Weak model)

$$ \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {A_{i} X = b - \delta + \alpha_{i} - \beta_{i} ,} & {A_{i} \in I_{ + } } \\ \end{array} } \\ {\begin{array}{*{20}c} {A_{i} X = b + \delta - \tilde{\alpha }_{i} + \tilde{\beta }_{i} ,} & {A_{i} \in I_{ - } } \\ \end{array} } \\ \end{array} } \right. $$
(3.5)

The relationships among the three models are given in following proposition 1.

Proposition 1

(i) A feasible solution of strong model is the feasible solution of medium and weak model.

(ii) A feasible solution of medium model is the feasible solution of weak model.

(iii) If a point \( A_{i} \) is classified as a given class by strong model, then it must be in same class by using medium model and weak model.

(iv) If a point \( A_{i} \) is classified as a given class by medium model, then it must be in same class by using weak model.

Proof

Let \( F_{S} \) be the feasible area of strong model, \( F_{M} \) be the feasible area of medium model and \( F_{W} \) be the feasible area of weak model. Assume that \( X_{S}^{*} \) is feasible solution of strong model, \( X_{M}^{*} \) is feasible solution of medium model, \( X_{W}^{*} \) is feasible solution of weak model. Without loss of generality, we only prove the \( A_{i} \) is corrected, classified and belongs to \( I_{ + } \), other cases can be proven in the same way.

For \( \forall X_{S}^{*} \in F_{S} \), there exist \( \alpha_{i} \) and \( \beta_{i} \), such that \( A_{i} X_{i}^{*} = b - \delta + \alpha_{i} - \bar{\beta }_{i} \). This is equivalent to \( A_{i} X_{i}^{*} = b + \alpha_{i} - (\bar{\beta }_{i} + \delta ) \). Let \( \beta {}_{i} = \bar{\beta }_{i} + \delta \), then \( A_{i} X_{i}^{*} = b + \alpha_{i} - \beta_{i} \). Thus, \( X_{S}^{*} \in F_{M} \). Therefore, \( F_{s} \subseteq F_{M} \subseteq F_{W} \), and (i) and (ii) are true. (iii) and (iv) are true from the conclusions of (i) and (ii). The proof is completed.

To simplify the model, we divide the parameters \( X,b,\alpha_{i} ,\tilde{\alpha }_{i} ,\beta_{i} ,\tilde{\beta }_{i} \) by \( \delta \). The equations can be rewritten as a single constraint:

$$ Y( < A,X > - eb) = \delta^{\prime}e - \alpha - \tilde{\alpha } + \beta + \tilde{\beta } $$

where n × r diagonal matrix Y contains “+1” or “−1,” indicating that the corresponding record \( A_{i} \in I_{ + } \) or \( A_{i} \in I_{ - } \). The values of \( \delta^{\prime} \) are “+1,” “−1” and ‘0’ corresponding to strong, medium or weak model. The proof is completed.

Now, we introduce the cost of misclassifications (\( C_{\alpha } \), \( C_{{\tilde{\alpha }}} \)). \( C_{\alpha } \) is the cost of misclassifying an element A i that belongs to the positive class and \( C_{{\tilde{\alpha }}} \) is the cost of misclassifying an element A i that belongs to the negative class. Thus, a cost-sensitive multi-criteria programming model can be formulated as (3.6): (CS-MCQP model)

$$ \begin{array}{*{20}c} {\text{Min}} & {\frac{1}{2}} \\ \end{array} \left\| x \right\|_{s}^{s} + W_{\alpha } \left[ {\sum\limits_{i = 1}^{n} {\frac{{\left| {\alpha_{i} } \right|^{p} }}{{C_{\alpha } }}} + \sum\limits_{i = 1}^{n} {\frac{{\left| {\tilde{\alpha }_{i} } \right|^{p} }}{{C_{{\tilde{\alpha }}} }}} } \right] - W_{\beta } \left[ {\sum\limits_{i = 1}^{n} {\left| {\beta_{i} } \right|^{q} } } \right] $$
(3.6)
$$ \begin{array}{*{20}c} {\rm S.t.} & {Y( < A,X > - eb)} \\ \end{array} = \delta^{\prime}e - \alpha - \tilde{\alpha } + \beta $$

where the weights satisfy \( W_{\alpha } + W_{\beta } = 1 \), \( W_{\alpha } > 0 \) and \( W_{\beta } > 0 \). The values of \( W_{\alpha } \) and \( W_{\beta } \)can be optimized by cross-validation or predefined by decision makers.

Different values of \( s \), \( p \) and \( q \)will lead to different forms of CS-MCQP model. Without loss of generality, suppose \( s = 2 \), \( p = 2 \), \( q = 1 \), add \( \frac{{W_{b} }}{2}b^{2} \) for model convexity (Fung, 2003), and the model is (3.7):

$$ \begin{array}{*{20}c} {\rm Min} & {\frac{1}{2}} \\ \end{array} \left\| x \right\|^{2} + W_{\alpha } \left[ {\sum\limits_{i = 1}^{n} {\frac{{\left| {\alpha_{i} } \right|^{2} }}{{C_{\alpha } }}} + \sum\limits_{i = 1}^{n} {\frac{{\left| {\tilde{\alpha }_{i} } \right|^{2} }}{{C_{{\tilde{\alpha }}} }}} } \right] - W_{\beta } \sum\limits_{i = 1}^{n} {\left| {\beta_{i} } \right|} + \frac{{W_{b} }}{2}b^{2} $$
(3.7)
$$ \begin{array}{*{20}c} {\rm S.t.} & {Y( < A,X > - eb)} \\ \end{array} = \delta^{\prime}e - \alpha - \tilde{\alpha } + \beta $$

We divide this model into two sub-models and each sub-model considers misclassified records in one class. The solution of each sub-model represents the margin of that class. If the misclassified records belong to \( I_{ + } \), the CS-MCQP model becomes (3.8):

(Model 1)

$$ \begin{array}{*{20}c} {\text{Min}} & {\frac{1}{2}} \\ \end{array} \left\| x \right\|^{2} + \frac{{W_{\alpha } }}{2}\sum\limits_{i = 1}^{n} {\frac{{\left| {\alpha_{i} } \right|^{2} }}{{C_{\alpha } }}} - W_{\beta } \sum\limits_{i = 1}^{n} {\left| {\beta_{i} } \right|} + \frac{{W_{b} }}{2}b^{2} $$
(3.8)
$$ \begin{array}{*{20}c} {\rm S.t.} & {Y( < A,X > - eb)} \\ \end{array} = \delta^{\prime}e - \alpha + \beta $$

To simplify the computation, let \( \eta_{i} = \alpha_{i} - \beta_{i} \). According to previous definitions, \( \eta_{i} = \alpha_{i} \) for all misclassified records and \( \eta_{i} = - \beta_{i} \) for all correctly classified records (3.8) becomes (3.9), that is:

$$ \begin{array}{*{20}c} {\text{Min}} & {\frac{1}{2}} \\ \end{array} \left\| x \right\|^{2} + \frac{{W_{\alpha } }}{2}\sum\limits_{i = 1}^{n} {\frac{{\left| {\eta_{i} } \right|2}}{{C_{\alpha } }}} + W_{\beta } \sum\limits_{i = 1}^{n} {\left| {\eta_{i} } \right|} + \frac{{W_{b} }}{2}b^{2} $$
(3.9)
$$ \begin{array}{*{20}c} {{\text{S}} . {\text{t}} .} & {Y( < A,X > - eb)} \\ \end{array} = \delta^{{\prime }} e - \eta $$

By Lagrange function and Wolfe Dual Theorem (Wolfe, 1961), we can get:

$$ L(X,b,\eta ,\theta ) = \frac{1}{2}\left\| x \right\|^{2} + \frac{{W_{\alpha } }}{2}\sum\limits_{i = 1}^{n} {\frac{{\left| {\eta_{i} } \right|^{2} }}{{C_{\alpha } }}} + W_{\beta } \sum\limits_{i = 1}^{n} {\left| {\eta_{i} } \right|} + \frac{{W_{b} }}{2}b^{2} - \theta^{T} (Y( < A,X > - eb) - \delta^{\prime}e + \eta ) $$

and the gradient function is (3.10):

$$ \begin{array}{*{20}c} {\nabla_{X} L(X,b,\eta ,\theta ) = X - A^{T} X\theta = 0,} \\ {\nabla_{b} L(X,b,\eta ,\theta ) = W_{b} b - e^{T} Y\theta = 0,} \\ {\nabla_{\eta } L(X,b,\eta ,\theta ) = \frac{{W_{\alpha } }}{{C_{\alpha } }}\eta + W_{\beta } e = \theta .} \\ \end{array} $$
(3.10)

We can solve the equation set and obtain the solution:

$$ \theta^{\prime} = [Y(AA^{T} + \frac{1}{{W_{b} }}ee^{T} )Y + \frac{{C_{\alpha } }}{{W_{\alpha } }}I]^{ - 1} (\delta^{\prime} + \frac{{W_{\beta } C_{\alpha } }}{{W_{\alpha } }})e $$

On the other hand, if the misclassified points belong to I_, the CS-MCQP model becomes (3.11):

(Model 2)

$$ \begin{array}{*{20}c} {\rm Min} & {\frac{1}{2}} \\ \end{array} \left\| x \right\|^{2} + \frac{{W_{\alpha } }}{2}\sum\limits_{i = 1}^{n} {\frac{{\left| {\tilde{\alpha }_{i} } \right|^{2} }}{{C_{\alpha } }}} - W_{\beta } \sum\limits_{i = 1}^{n} {\left| {\beta_{i} } \right|} + \frac{{W_{b} }}{2}b^{2} $$
(3.11)
$$ \begin{array}{*{20}c} {\rm S.t.} & {Y( < A,X > - eb)} \\ \end{array} = \delta^{\prime}e - \tilde{\alpha } + \beta $$

We define \( \eta_{i} = \tilde{\alpha }_{i} - \tilde{\beta }_{i} \) and repeat the pervious process. The Model 2 is (3.12):

$$ \begin{array}{*{20}c} {\rm Min} & {\frac{1}{2}} \\ \end{array} \left\| x \right\|^{2} + \frac{{W_{\alpha } }}{2}\sum\limits_{i = 1}^{n} {\frac{{\left| {\tilde{\eta }_{i} } \right|^{2} }}{{C_{\alpha } }}} + W_{\beta } \sum\limits_{i = 1}^{n} {\left| {\tilde{\eta }_{i} } \right|} + \frac{{W_{b} }}{2}b^{2} $$
(3.12)
$$ \begin{array}{*{20}c} {\rm S.t.} & {Y( < A,X > - eb)} \\ \end{array} = \delta^{\prime}e - \tilde{\eta } $$

And the solution is:

$$ \theta^{\prime\prime} = [Y(AA^{T} + \frac{1}{{W_{b} }}ee^{T} )Y + \frac{{C_{{\tilde{\alpha }}} }}{{W_{\alpha } }}I]^{ - 1} (\delta^{\prime} + \frac{{W_{\beta } C_{{\tilde{\alpha }}} }}{{W_{\alpha } }})e $$

Therefore, the solution of CS-MCQP model in \( \theta = E[\theta^{\prime},\theta^{\prime\prime}] \). The weight coefficient of \( \theta^{\prime} \) and \( \theta^{\prime\prime} \)is computed by the cost ratio of \( \theta^{\prime} \) and \( \theta^{\prime\prime} \), respectively.

3.2 The existence of solution and relationship between MCQP and CS-MCQP

This section proves the existence of solution and presents the relationship between the MCQP and CS-MCQP models. The solution of CS-MCQP always exists since matrix \( Y( < A,X > + \frac{1}{{W_{b} }}ee^{T} )Y \) is invertible. CS-MCQP is equivalent to MCQP if \( C_{\alpha } = C_{{\tilde{\alpha }}} \).

Theorem 1

Let \( W_{\alpha } ,W_{\beta } ,W_{b} \) be positive real numbers. Suppose that \( \alpha ,\tilde{\alpha },\beta \) are not identically zero. Then, a solution of CS-MCQP exists.

Proof

Let matrix \( H = Y(A - W_{b}^{{ - \frac{1}{2}}} e) \), then \( HH^{T} = AA^{T} + \frac{1}{{W_{b} }}ee^{T} \). The \( HH^{T} \) is a Gram matrix, so it is positive and its inverse matrix exists. The other part of solution \( Y(AA^{T} + \frac{1}{{W_{b} }}ee^{T} )Y + \frac{{C_{\alpha } }}{{W_{\alpha } }}I \) is a real diagonal matrix and it must be invertible. Similar to the proof of model 1, and other models can be proved.

Theorem 2

If \( C_{\alpha } = C_{{\tilde{\alpha }}} \) in CS-MCQP, then CS-MCQP is equivalent to MCQP.

Proof

If the norm of CS-MCQP is \( s = 2 \), \( p = 2 \), \( q = 1 \) and \( C_{\alpha } = C_{{\tilde{\alpha }}} \), then the model is

$$ \begin{array}{*{20}c} {\rm Min} & {\frac{1}{2}} \\ \end{array} \left\| x \right\|^{2} + \frac{{\hat{W}_{\alpha } }}{{C_{\alpha } }}\left[\sum\limits_{i = 1}^{n} {\left| {\alpha_{i} } \right|^{2} } + \sum\limits_{i = 1}^{n} {\left| {\tilde{\alpha }_{i} } \right|^{2} } \right] - W_{\beta } \sum\limits_{i = 1}^{n} {\left| {\beta_{i} } \right|} $$

Since \( \alpha_{i} \tilde{\alpha }_{i} = 0 \), we can obtain \( \left\| \alpha \right\| = \sum\limits_{i = 1}^{n} {\left| {\alpha_{i} } \right|^{2} } + \sum\limits_{i = 1}^{n} {\left| {\tilde{\alpha }_{i} } \right|^{2} } \). Let \( W_{\alpha } = \frac{{\hat{W}_{\alpha } }}{{C_{\alpha } }} \), then the model is

$$ \begin{array}{*{20}c} {\rm Min} & {\frac{1}{2}} \\ \end{array} \left\| x \right\|^{2}\,+\,W_{\alpha } \left[\sum\limits_{i = 1}^{n} {\left| {\alpha_{i} } \right|^{2} } + \sum\limits_{i = 1}^{n} {\left| {\tilde{\alpha }_{i} } \right|^{2} } \right] - W_{\beta } \sum\limits_{i = 1}^{n} {\left| {\beta_{i} } \right|} $$

and it can be transformed into MCQP in this case.

The different costs are taken as penalty coefficient in optimization-based classification, such as SVM and MCQP. In (Li et al, 2008), the MCLP-based model assigns the coefficient of misclassification by ratio of size of majority and minority class. Furthermore, if \( C_{\alpha } \ne C_{{\tilde{\alpha }}} \), the solution will be drawn from sub-model (3.9) and (3.11). The cost-sensitive CS-MCQP is a generalization of MCQP and the solution will be optimized by adjusting weights of variables of the objective function.

4 Experiments

This section compares the CS-MCQP model with some well-known classifiers and cost-sensitive classifiers using 26 imbalanced datasets from the UCI machine learning repository (Lichman, 2013). The results are then verified using parametric and nonparametric statistical tests as suggested by Beyan and Fisher (2015). The characteristics of imbalanced datasets that are suitable for the proposed model are discussed at the end of this section.

The cost matrix (Table 1) shows the misclassification costs. Cost-sensitive learning aims to decrease error rates of the minority class by assigning higher costs to it. Since the correctly classified observations cannot change the total costs, it is a common practice (He and Garcia, 2009; Thai-Nghe et al, 2010; Lomax and Vadera, 2013; Masnadi-Shirazi et al, 2015) to set the costs of correct classifications (True Positive and True Negative) to zero to simplify the calculation.

Table 1 Cost matrix

The minority class is defined as the positive class which is the most concerned class in imbalance learning. The cost of misclassifying minority records is normally higher than that of misclassifying majority records. The cost ratios (CR = \( {{C_{{\tilde{\alpha }}} } \mathord{\left/ {\vphantom {{C_{{\tilde{\alpha }}} } {C_{\alpha } }}} \right. \kern-0pt} {C_{\alpha } }} \)) mean misclassifying a minority record is \( {{C_{{\tilde{\alpha }}} } \mathord{\left/ {\vphantom {{C_{{\tilde{\alpha }}} } {C_{\alpha } }}} \right. \kern-0pt} {C_{\alpha } }} \) times more costly than misclassifying a majority record. If cost ratio (CR) equals 1, i.e., \( C_{\alpha } = C_{{\tilde{\alpha }}} \), the model converts to the MCQP model (Theorem 2).

4.1 Datasets

Table 2 describes the characteristics of the 26 imbalanced UCI datasets (López et al, 2013; Beyan and Fisher, 2015; Seiffert et al, 2010; Fernández et al, 2008; Fernández et al, 2009) used in the experiment. The imbalance ratio (IR) is the number of records of the majority class to that of the minority class. Since the CS-MCQP is designed for binary classification, multi-class datasets were converted to binary datasets using the one class versus rest and some classes versus others scheme (Thai-Nghe et al, 2010). Records with missing values were removed.

Table 2 Summary of the datasets

4.2 Performance measures

The following performance measures are commonly used to evaluate classification algorithms. Accuracy is the percentage of correctly classified records: \( {\text{ACC}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{TN}}}} \). True negative (TN-rate), true positive (TP-rate), false negative (FN-rate) and false positive (FP-rate) are one class evaluation indexes: \( {\text{TP}}_{\text{rate}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}};{\text{TN}}_{\text{rate}} = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}}; \) \( {\text{FP}}_{E} = \frac{\text{FP}}{{{\text{TN}} + {\text{FP}}}}; \) \( {\text{FN}}_{\text{rate}} = \frac{\text{FN}}{{{\text{TP}} + {\text{FN}}}}. \)The area under the ROC curve (AUC) shows the tradeoff between TP-rate and FP-rate: \( {\text{AUC}} = \frac{{1 + {\text{TP}}_{\text{rate}} - {\text{FP}}_{\text{rate}} }}{2} \). Geometric mean (GeoMean) evaluates average performance of classifiers: \( {\text{Geometric (GM)}} = \sqrt {\frac{\text{TN}}{{{\text{FP}} + {\text{TN}}}} \times \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}} \).

The traditionally used classification metrics such as accuracy and error rate cannot provide adequate information when dealing with highly imbalanced data. For example, if a binary data set has 1% minority observations, a classifier that classifies all data observations as majority class will have 99% accuracy and 1% error rate. Therefore, the AUC and GeoMean, which are among the most widely used performance measures in binary imbalanced classifications (Ferri et al, 2009; He and Garcia, 2009; Thai-Nghe et al, 2010; Chawla et al, 2004; Sun et al, 2009), are used in the experiment.

4.3 CS-MCQP Algorithm

The algorithm for the CS-MCQP model is summarized as follows:

Input: The dataset \( A \); cost ratio.

Output: Average classification accuracy; true positive rate; true negative rate; false positive rate; false negative rate; GeoMean; total costs; AUC.

(1) \( A \) is randomly partitioned into tenfolds, \( A_{1} ,A_{2} , \ldots ,A_{10} \);//in iteration \( i \), partition \( A_{i} \) is reserved as the testing dataset and the remaining partitions are used as training set.

(2) for each partition \( i \) do;//training and testing is performed ten times

(3) Compute \( \theta \) using training dataset; \( W_{\alpha } ,W_{\beta } ,W_{b} \) are chosen by cross-validation;

(4) Compute \( {\text{X}} = A^{T} Y\theta \) and \( b = - \frac{1}{{W_{b} }}ee^{T} \theta \); //by gradient function of (3.10);

(5) Using partition \( A_{i} \) to test;

(6) end for

(7) Compute overall numbers of correct and incorrect classifications from tenfolds;

(8) Return performance indexes.

END

The experiment compares the CS-MCQP model with two optimization-based classifiers (MCQP and SVM), six popular classifiers (i.e., C4.5, Naive Bayes, Logistic regression, RBF network, Multilayer Perceptron and LWL), cost-sensitive optimization-based classifier (CS-SVM), data preprocessing techniques (SMOTE, OSS, CNN), ensembles (Adaboost M1, Bagging) and hybrid methods (SMOTE ENN and SMOTE Tomek Link) for imbalanced classification. Table 3 summarizes the classification methods used in the experiments.

Table 3 Classification methods used in experiments

Since the datasets used in the experiments do not provide cost ratios, the two cost-sensitive algorithms choose the best cost ratio from \( \{ 1,2,10,50,100\} \), which is a common practice in previous studies (Thai-Nghe et al, 2010; Weiss, 2010).

The C4.5, SVM, CS-SVM, Naïve Bayes, Logistic, RBF network, MLP and LWL were implemented using Weka (Hall et al, 2009), and the CS-MCQP was implemented in MATLAB R2014a.

5 Results

For each classifier, the tenfold cross-validations on AUC and GeoMean measures are summarized in Tables 4 and 5, respectively. The bold numbers denote the best results of a measurement on a dataset. The following observations can be made based on the results: (1) the CS-MCQP achieves the best average AUC and GeoMean among the twenty-four classifiers across the 26 imbalanced datasets; (2) the optimization-based classifiers (MCQP and SVM) have the lowest average AUC and poor performance on GeoMean, and their cost-sensitive versions improve the performances dramatically; (3) data preprocessing can improve the average performance of classifiers; (4) the oversampling methods such as SMOTE always perform better than undersampling methods such as OSS and CNN; (5) the hybrid methods which combine oversampling and undersampling obtain better performance than single preprocessing technique; (6) ensemble methods did not dramatically improve base classifiers’ performances compared to preprocess and cost-sensitive methods.

Table 4 Results of the twenty-four classifiers in terms of the AUC
Table 5 Results of the twenty-four classifiers in terms of the GeoMean

5.1 Statistical test

To validate the results statistically, student’s t test and Wilcoxon signed-rank test were conducted at the significance level 0.05 and reported in the following subsections. The null hypothesis is that all classifiers included in the tests perform the same and the observed differences among the classifiers are the result of chance.

5.1.1 Student’s test

Student’s t test determines whether the means of two datasets significantly differ from each other by calculating a p value, which tells the likelihood of the two values are from the same population (Beyan and Fisher, 2015). It is used to verify whether the differences among the classifiers found in the experiments are statistically significant. Each of the classifiers obtained the best performance from each category is compared with CS-MCQP at the significance level of 0.05. For the AUC measure, SVM, MLP Adaboost M1, SMOTE and SMOTE ENN are the best algorithms from the five categories. For the GeoMean, MCQP, Naïve Bayes, Adaboost M1, SMOTE and SMOTE Tomek Link are the best algorithms from the five categories. For simplicity, we use “+” symbol presents that the algorithm in row is significantly better than another method in column, “−” means the contrary and “=” indicates two methods perform the same. The p values are in brackets.

The results in Tables 6 and 7 show that the CS-MCQP performs significantly better than the other methods in terms of both AUC and GeoMean. The p values of comparing CS-MCQP and SMOTE ENN are 0.048 which indicates that CS-MCQP performs better than SMOTE ENN and other four methods

Table 6 Student’s t test for representative classifiers based on AUC and CS-MCQP
Table 7 Student’s t test for representative classifiers based on GeoMean and CS-MCQP

5.1.2 Wilcoxon signed-rank test

The Wilcoxon signed-rank test (Wilcoxon, 1945) is used to check whether the performance differences between two algorithms are significant. The signed-rank can be calculated as: \( R^{ + } = \sum\limits_{{d_{i} > 0}} {{\text{rank}}(d_{i} )} + \frac{1}{2}\sum\limits_{{d_{i} = 0}} {{\text{rank}}(d_{i} )} ; \) \( R^{ - } = \sum\limits_{{d_{i} < 0}} {{\text{rank}}(d_{i} )} + \frac{1}{2}\sum\limits_{{d_{i} = 0}} {{\text{rank}}(d_{i} )} . \) \( d_{i} \) is the difference between the performance scores of two classifiers on the ith dataset.

Tables 8 and 9 show that the CS-MCQP performs significantly better than the other five algorithms in terms of AUC and GeoMean among the 26 imbalanced datasets based on the Wilcoxon signed-rank tests with all of the p-values lower than 0.05.

Table 8 Wilcoxon signed-rank test for representative classifiers based on AUC and CS-MCQP
Table 9 Wilcoxon signed-rank test for representative classifiers based on GeoMean and CS-MCQP

5.2 Intrinsic characteristics of datasets suited for CS-MCQP

As López et al, discussed in (2013), the intrinsic characteristics of datasets have a great influence on classification with imbalanced data. This section analyzes the effect of different characteristics of datasets (i.e., noisy, small disjunct and overlapping) on the CS-MCQP model and summarizes the intrinsic characteristics of datasets that are appropriate for the proposed model. The t-SNE (Stochastic Neighbor Embedding) method (Maaten and Hinton, 2008) is used to visualize high-dimensional data and convert high-dimensional datasets into a plane. The red and blue dots represent positive class and negative class, respectively.

5.2.1 Noisy data

In imbalanced binary classification, noisy data refer to data objects that belong to one class appearing among the majority data of the other class (López et al, 2013), and the noisy data always move away from minority and majority class data. Apparently, this scenario has greater impact on the minority class because it has fewer data objects. Six representative datasets from the experiments with this feature are presented in Figures 1, 2, 3, 4, 5 and 6.

Figure 1
figure 1

Glass

Figure 2
figure 2

Yeast 02579 vs 368

Figure 3
figure 3

Newthyroid 1

Figure 4
figure 4

Pageblocks 13 vs 4

Figure 5
figure 5

Echo

Figure 6
figure 6

Pageblock 2 vs 3

5.2.2 Small disjunct

Small disjuncts are small clusters of underrepresented subconcepts (López et al, 2013) and are major source of errors (Weiss, 2010). When it combined with class imbalance, the situation is even worse because it is hard to tell whether it is a small disjunct or noise. Four datasets used in the experiments have this feature (Figures 7, 8, 9, 10).

Figure 7
figure 7

Hepatitis

Figure 8
figure 8

Zoo _3

Figure 9
figure 9

Shuttle 2 vs 5

Figure 10
figure 10

Ecoli 0_1_3_7 vs 2_6

The results show that the performances of all classifiers included in the experiments drop sharply and the best classifier tested over small disjunct and imbalanced data is Naïve Bayes.

5.2.3 Overlapping

In this class of datasets, borderline region includes similar amount of instances from both classes. Figures 11, 12, 13, 14 and 15 show the datasets used in the experiments with this characteristic. The classification results presented in Tables 4 and 5 demonstrate that the CS-MCQP performances better on average than other classifiers. The classification results indicate that optimization-based cost-sensitive classifiers (CS-SVM and CS-MCQP) are robust on overlapping and imbalanced data.

Figure 11
figure 11

Yeast 0359 vs 78

Figure 12
figure 12

Yeast 4

Figure 13
figure 13

Ecoli 0_3_4_6 vs 5

Figure 14
figure 14

Ecoli 0_1_4_6 vs 5

Figure 15
figure 15

Yeast 5

In general, the CS-MCQP model can improve the accuracy of the minority class using costs matrix as penalty coefficient to minimize minority class errors, because different costs can push a boundary far from a minority class. Figure 16 presents this situation and captures the overlapping characteristic of data that are appropriate for the CS-MCQP model.

Figure 16
figure 16

Imbalance data suitable for the CS-MCQP model

6 Conclusions

Though the Multiple Criteria Quadratic Programming (MCQP) model has been proved to be an effective and scalable classification method, its performance degraded quickly when the imbalance ratio increases.

This paper proposed a cost-sensitive MCQP (CS-MCQP) model to improve the MCQP model for imbalanced data. The proposed model extends the MCQP model by introducing the cost of misclassifications and the cost of imbalanced data. It maximizes the external distance between groups and minimizes the internal distance with cost coefficient. In addition, as an attempt to increase classification accuracy, \( \tilde{\beta } \) and \( \tilde{\alpha } \), which, respectively, measure the distance from a correctly classified and misclassified minority class element \( A_{i} \) to \( b \), respectively, were added to the CS-MCQP model. The existence of solution and the relationship between MCQP and CS-MCQP were also proved.

The CS-MCQP model was then compared with twenty-four popular classifiers, ensemble and data-processing methods using 26 public imbalanced datasets from the UCI machine learning repository. The results show that the CS-MCQP achieves the best average AUC and GeoMean measures among the twenty-four methods. To validate the results statistically, student’s t test and Wilcoxon signed-rank test were conducted at the significance level 0.05. Both tests indicate that the CS-MCQP performs significantly better than the other algorithms in terms of AUC and GeoMean measures. Furthermore, we analyzed the effect of noisy, small disjunct and overlapping data on the proposed model and concluded that the CS-MCQP model achieves better performance on data with overlapping feature than noisy and small disjunct data.