Abstract
Multiple Criteria Quadratic Programming (MCQP), a mathematical programming-based classification method, has been developed recently and proved to be effective and scalable. However, its performance degraded when learning from imbalanced data. This paper proposes a cost-sensitive MCQP (CS-MCQP) model by introducing the cost of misclassifications to the MCQP model. The empirical tests were designed to compare the proposed model with MCQP and a selection of classifiers on 26 imbalanced datasets from the UCI repositories. The results indicate that the CS-MCQP model not only performs better than the optimization-based models (MCQP and SVM), but also outperforms the selected classifiers, ensemble, preprocessing techniques and hybrid methods on imbalanced datasets in terms of AUC and GeoMean measures. To validate the results statistically, Student’s t test and Wilcoxon signed-rank test were conducted and show that the superiority of CS-MCQP is statistically significant with significance level 0.05. In addition, we analyze the effect of noisy, small disjunct and overlapping data properties on the proposed model and conclude that the CS-MCQP model achieves better performance on imbalanced data with overlapping feature than noisy and small disjunct data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Classification is an important topic in various disciplines, and many classification algorithms have been developed over the years. Mathematical programming represents one of the major categories of classification methods. The most prominent mathematical programming-based classifier is support vector machine (SVM) (Vapnik, 1982, 1995, 2000; Vapnik and Chapelle, 2000; Christopher and Burges, 1998; Christianini and Shawe-tayor (2000); Campadelli et al, 2005; Scholkopf and Smola, 2002). Linear programming (LP) is another well-known optimization model, with a simple strategy of obtaining the best linear separating function (Bradely et al, 1999; Garcia-Palomares and Manzanilla-Salaza, 2012; Freed and Glover, 1981).
In recent years, Multiple Criteria Linear Programming (MCLP) (Shi, 2010) and Multiple Criteria Quadratic Programming (MCQP) (Peng et al, 2008a, b), a type of optimization technique, have been proposed (Li et al, 2008; Zhang et al, 2009; Kou et al, 2005, 2009, 2012). The basic idea of multiple criteria mathematical programming models is to maximize the external distances between groups and minimize the internal distances within each group. Though the MCQP model has been proved to be effective and scalable to massive problems (Li et al, 2008), its performance drops when a dataset is highly imbalanced because it is designed to maximize the total accuracy and hence the boundary b is skewed toward the minority class.
Cost-sensitive learning is a commonly used approach to deal with class imbalance problem (Soda, 2011; He et al, 2009; Thai-Nghe et al, 2010; Chawla et al, 2004; Sun et al, 2009). The concept of cost-sensitive classification comes from the recognition that different types of misclassification have varying costs in real-life applications (Lomax and Vadera, 2013; Ting, 2002; Masnadi-Shirazi et al, 2015; Lee et al, 2012; Yue and Cakanyildiri, 2010; Zhao et al, 2011; Wang et al, Wang et al, 2014a, b; Cao et al, 2013; Min et al, 2011; Tsai et al, 2009; Shi et al, 2013). For example, in medical diagnosis and fraud detection, the cost of having a disease or fraud undetected is much higher than the cost of having a false alarm.
The goal of this paper is to propose a cost-sensitive Multiple Criteria Quadratic Programming (CS-MCQP) model to improve the performance of MCQP when the class distributions are imbalanced by introducing the cost of misclassifications and the cost of imbalanced data. The CS-MCQP model also adds \( \tilde{\beta } \) and \( \tilde{\alpha } \), which, respectively, measure the distance from a correctly classified and misclassified negative class (minority class) element \( A_{i} \) to \( b \), as an attempt to increase the overall accuracy. An experiment is designed to evaluate the proposed model by comparing it with the MCQP model and other well-known classifiers using imbalanced datasets from the UCI repository. The results show that the CS-MCQP model not only outperforms the MCQP model and the popular classifiers (i.e., C4.5, SVM, Naïve Bayes, locally weighted learning (LWL), RBF network, MLP and Logistic), but also achieves better performances than the cost-sensitive SVM, preprocessing techniques, ensemble and hybrid methods, especially on overlapping imbalanced data.
The contributions of this study to the theory and practice of OR are twofold: (1) present a cost-sensitive optimization classifier to deal with imbalanced classification by introducing the cost of misclassifications and the cost of imbalanced data. (2) Results show that the proposed model is successful and significantly better than the other twenty-three classifiers in the experiments, which indicates that the proposed CS-MCQP model provides a new tool for imbalanced classification, especially on overlapping imbalanced data.
The rest of this paper is organized as follows. Section 2 briefly discussed related works. Section 3 presents the CS-MCQP model. Section 4 describes details of the experiments and Section 5 concludes the paper.
This paragraph lists the notations. \( E[\theta ,\widetilde{\theta }] \) denotes the mathematical expectation of θ and \( \widetilde{\theta } \). A vector X ∈ R r is r-dimensional vector. e denotes a vector of ones and 0 denotes a vector of zeros. The transposed vector is a vector by a prime superscript T. The ρ-norm of vector is \( \left\| X \right\|_{p}^{p} = \left( {\sum\limits_{i = 1}^{r} {x_{i}^{p} } } \right)^{{\frac{1}{p}}} \). The inner product of vectors X and Y is denoted by \( X^{T} Y = \sum\limits_{i = 1}^{r} {x_{i} y_{i} } \). A matrix \( M \in R^{n \times r} \) is a real n × r-dimensional matrix. M i denotes the i-th row of M and M j denotes the j-th column of M. M T is the transpose of M. The identity matrix is denoted by I. Positive class is the minority class and negative class is the majority class.
2 Related work
Classification is one of the most extensively studied areas in data mining, machine learning and artificial intelligence (Pavlidis et al, 2012; Chang, 2013; Hwang et al, 2014; Martens and Provost, 2014; Peng et al, 2008b; Wang et al, 2014; Yang and Wu, 2006; Kou et al, 2009; Barros et al, 2012; Ferri et al, 2009). The well-known classification algorithms include decision tree induction, artificial neural network, Bayes classification and SVM (Vapnik, 1982, 1995, 2000; Garcia-Palomares and Manzanilla, 2012; Lomax and Vadera, 2013). While standard classification algorithms were designed without paying special attention to imbalanced class distributions, the class imbalance problem is ubiquitous throughout various fields. For example, unqualified products represent a small percentage in a batch of production; cancerous cases constitute a few portions in medical examinations; and fraud is an uncommon phenomenon in credit card transactions. Since many classification algorithms presume balanced class distributions and often generate classification models that favor the majority class, their performances are greatly affected by imbalanced datasets. In recent years, imbalanced classification becomes one of the key problems in data mining and machine learning.
2.1 Imbalance classification
Many solutions have been proposed to deal with the class imbalance problem and can be categorized into three major groups: preprocessing, ensemble and cost-sensitive learning (López et al, 2013). Preprocessing techniques handle the imbalanced classification problem by reducing skewed distributions using resampling methods, including undersampling, oversampling and hybrid approaches. Ensembles of classifiers are often combined with resampling techniques to target the class imbalance problem. For comprehensive reviews of classification approaches for imbalanced data, please refer to He and Garcia (2009) and Lopez et al, (2013).
In real-life imbalanced classification problems, misclassification cost of the minority class is higher than the majority class. For example, in medical diagnosis, the cost of having a disease undetected is much higher than the cost of having a false alarm. Cost-sensitive methods assign higher costs to misclassifications of the minority class and lower costs to misclassifications of the majority class to increase the accuracy of the minority class.
The cost can be defined in many ways, such as imbalance ratio (Li et al, 2008), misclassification costs (Thai-Nghe et al, 2010), test costs (Lomax and Vadera, 2013) or penalty coefficient (Masnadi-Shirazi et al, 2015). There are two main categories of cost-sensitive algorithms: cost-sensitive sampling (Sun et al, 2007) and modification of standard classification algorithms (Liu and Zhou, 2006). The sampling approach resamples the original data by assigning different weights to show the costs of misclassification, such as Ting, 2002. The latter one introduces cost to standard classification algorithms such as decision tree (Lomax and Vadera, 2013) and Neural Networks (Liu and Zhou, 2006).
2.2 Classification based on optimization model
Optimization-based classification technology has got rapid development in data mining (Vapnik, 1982, 1995, 2000; Christopher and Burges, 1998; Campadelli et al, 2005; Scholkopf and Smola, 2002; Bradely et al, 1999; Garcia-Palomares and Manzanilla-Salaza, 2012; Freed and Glover, 1981; Li et al, 2008; Zhang et al, 2009; Kou et al, 2005, 2009). Multiple Criteria Linear Programming (MCLP) and Multiple Criteria Quadratic Programming (MCQP) have been employed in credit card risk analysis, network intrusion detection and VIP e-Mail behavior analysis (Shi, 2010; Li et al, 2008; Zhang et al, 2009; Kou et al, 2005, 2009). The basic idea of multiple criteria optimization is minimizing overlapping distances between observations and hyperplane, and maximizing total distances from observations to hyperplane. The Pareto solution is the classification boundary. MCQP models can directly obtain solution of separated line by a Karush–Kuhn–Tucker (KKT) condition which can be employed to deal with large-scale classification (Peng et al, 2008b). However, the performance of MCQP model drops when a dataset is imbalanced because the objectives are maximizing the sum of the distances to the hyperplane and minimizing the total errors, and thus the hyperplane will be “pushed” to the minority class. If the imbalance ratio is really high, the MCQP model may classify all minority examples as majority class.
Given this background, we introduce cost matrix as penalty coefficient to the MCQP model to allow the hyperplane to stay far away from the minority class and consequently decrease the number of misclassified minority instances and the total costs.
3 Cost-sensitive multi-criteria quadratic programming model
3.1 CS-MCQP model
The classification process of multi-criteria mathematical programming can be introduced by a matrix. We use a row vector \( A_{i} = (a_{i1} ,a_{i2} , \ldots ,a_{in} ) \in R^{r} \) of \( n \times r \) matrix \( A = (A_{1},\,A_{2} ,\,\ldots,\,A_{n} )^{T} \) to represent one record in a dataset, where \( n \) is the number of records and \( r \) is the number of attributes. A scalar\( b \) can be set as a boundary to separate positive class I + and negative class I -. Let \( X = (x_{1} ,x_{2} , \ldots ,x_{n} )^{T} \in R^{r} \) be a vector of real numbers to be determined and the classification function can be presented as follows:
The inequalities in (3.1) can be converted to equations in (3.2) by adding the following variables. Let \( \beta_{i} \) be the distance from an element \( A_{i} \) to \( b \) when \( A_{i} \) is correctly classified as positive. Let \( \tilde{\beta }_{i} \) be the distance from \( A_{i} \) to \( b \) when \( A_{i} \) is correctly classified as negative. To deal with linearly inseparable data, we introduce \( \alpha_{i} \) and \( \tilde{\alpha }_{i} \). If an element \( A_{i} \) belongs to the negative class and is misclassified, then let \( \tilde{\alpha }_{i} \) be the distance from \( A_{i} \) to \( b \). If an element \( A_{i} \) belongs to the positive class and is misclassified, then let \( \alpha_{i} \) be the distance from \( A_{i} \) to \( b \).
Let \( \alpha_{i} = \widetilde{{\alpha_{i} }} = 0 \) for all correctly classified elements and \( \beta_{i} = \widetilde{\beta }_{i} = 0 \) for all misclassified elements, and the parameters satisfy \( \alpha_{i} \tilde{\alpha }_{i} = 0,\alpha_{i} \tilde{\beta }_{i} = 0,\tilde{\alpha }_{i} \beta_{i} = 0,\beta_{i} \tilde{\beta }_{i} = 0 \). Adopting the concepts proposed in (Peng et al, 2008b), we introduce two adjusted hyperplanes \( b + \delta \) and \( b - \delta \), where \( \delta \) a given scalar, equations (3.2) can be reformulated as medium (3.3), strong (3.4) and weak models (3.5):
(Medium model)
(Strong model)
(Weak model)
The relationships among the three models are given in following proposition 1.
Proposition 1
(i) A feasible solution of strong model is the feasible solution of medium and weak model.
(ii) A feasible solution of medium model is the feasible solution of weak model.
(iii) If a point \( A_{i} \) is classified as a given class by strong model, then it must be in same class by using medium model and weak model.
(iv) If a point \( A_{i} \) is classified as a given class by medium model, then it must be in same class by using weak model.
Proof
Let \( F_{S} \) be the feasible area of strong model, \( F_{M} \) be the feasible area of medium model and \( F_{W} \) be the feasible area of weak model. Assume that \( X_{S}^{*} \) is feasible solution of strong model, \( X_{M}^{*} \) is feasible solution of medium model, \( X_{W}^{*} \) is feasible solution of weak model. Without loss of generality, we only prove the \( A_{i} \) is corrected, classified and belongs to \( I_{ + } \), other cases can be proven in the same way.
For \( \forall X_{S}^{*} \in F_{S} \), there exist \( \alpha_{i} \) and \( \beta_{i} \), such that \( A_{i} X_{i}^{*} = b - \delta + \alpha_{i} - \bar{\beta }_{i} \). This is equivalent to \( A_{i} X_{i}^{*} = b + \alpha_{i} - (\bar{\beta }_{i} + \delta ) \). Let \( \beta {}_{i} = \bar{\beta }_{i} + \delta \), then \( A_{i} X_{i}^{*} = b + \alpha_{i} - \beta_{i} \). Thus, \( X_{S}^{*} \in F_{M} \). Therefore, \( F_{s} \subseteq F_{M} \subseteq F_{W} \), and (i) and (ii) are true. (iii) and (iv) are true from the conclusions of (i) and (ii). The proof is completed.
To simplify the model, we divide the parameters \( X,b,\alpha_{i} ,\tilde{\alpha }_{i} ,\beta_{i} ,\tilde{\beta }_{i} \) by \( \delta \). The equations can be rewritten as a single constraint:
where n × r diagonal matrix Y contains “+1” or “−1,” indicating that the corresponding record \( A_{i} \in I_{ + } \) or \( A_{i} \in I_{ - } \). The values of \( \delta^{\prime} \) are “+1,” “−1” and ‘0’ corresponding to strong, medium or weak model. The proof is completed.
Now, we introduce the cost of misclassifications (\( C_{\alpha } \), \( C_{{\tilde{\alpha }}} \)). \( C_{\alpha } \) is the cost of misclassifying an element A i that belongs to the positive class and \( C_{{\tilde{\alpha }}} \) is the cost of misclassifying an element A i that belongs to the negative class. Thus, a cost-sensitive multi-criteria programming model can be formulated as (3.6): (CS-MCQP model)
where the weights satisfy \( W_{\alpha } + W_{\beta } = 1 \), \( W_{\alpha } > 0 \) and \( W_{\beta } > 0 \). The values of \( W_{\alpha } \) and \( W_{\beta } \)can be optimized by cross-validation or predefined by decision makers.
Different values of \( s \), \( p \) and \( q \)will lead to different forms of CS-MCQP model. Without loss of generality, suppose \( s = 2 \), \( p = 2 \), \( q = 1 \), add \( \frac{{W_{b} }}{2}b^{2} \) for model convexity (Fung, 2003), and the model is (3.7):
We divide this model into two sub-models and each sub-model considers misclassified records in one class. The solution of each sub-model represents the margin of that class. If the misclassified records belong to \( I_{ + } \), the CS-MCQP model becomes (3.8):
(Model 1)
To simplify the computation, let \( \eta_{i} = \alpha_{i} - \beta_{i} \). According to previous definitions, \( \eta_{i} = \alpha_{i} \) for all misclassified records and \( \eta_{i} = - \beta_{i} \) for all correctly classified records (3.8) becomes (3.9), that is:
By Lagrange function and Wolfe Dual Theorem (Wolfe, 1961), we can get:
and the gradient function is (3.10):
We can solve the equation set and obtain the solution:
On the other hand, if the misclassified points belong to I_, the CS-MCQP model becomes (3.11):
(Model 2)
We define \( \eta_{i} = \tilde{\alpha }_{i} - \tilde{\beta }_{i} \) and repeat the pervious process. The Model 2 is (3.12):
And the solution is:
Therefore, the solution of CS-MCQP model in \( \theta = E[\theta^{\prime},\theta^{\prime\prime}] \). The weight coefficient of \( \theta^{\prime} \) and \( \theta^{\prime\prime} \)is computed by the cost ratio of \( \theta^{\prime} \) and \( \theta^{\prime\prime} \), respectively.
3.2 The existence of solution and relationship between MCQP and CS-MCQP
This section proves the existence of solution and presents the relationship between the MCQP and CS-MCQP models. The solution of CS-MCQP always exists since matrix \( Y( < A,X > + \frac{1}{{W_{b} }}ee^{T} )Y \) is invertible. CS-MCQP is equivalent to MCQP if \( C_{\alpha } = C_{{\tilde{\alpha }}} \).
Theorem 1
Let \( W_{\alpha } ,W_{\beta } ,W_{b} \) be positive real numbers. Suppose that \( \alpha ,\tilde{\alpha },\beta \) are not identically zero. Then, a solution of CS-MCQP exists.
Proof
Let matrix \( H = Y(A - W_{b}^{{ - \frac{1}{2}}} e) \), then \( HH^{T} = AA^{T} + \frac{1}{{W_{b} }}ee^{T} \). The \( HH^{T} \) is a Gram matrix, so it is positive and its inverse matrix exists. The other part of solution \( Y(AA^{T} + \frac{1}{{W_{b} }}ee^{T} )Y + \frac{{C_{\alpha } }}{{W_{\alpha } }}I \) is a real diagonal matrix and it must be invertible. Similar to the proof of model 1, and other models can be proved.
Theorem 2
If \( C_{\alpha } = C_{{\tilde{\alpha }}} \) in CS-MCQP, then CS-MCQP is equivalent to MCQP.
Proof
If the norm of CS-MCQP is \( s = 2 \), \( p = 2 \), \( q = 1 \) and \( C_{\alpha } = C_{{\tilde{\alpha }}} \), then the model is
Since \( \alpha_{i} \tilde{\alpha }_{i} = 0 \), we can obtain \( \left\| \alpha \right\| = \sum\limits_{i = 1}^{n} {\left| {\alpha_{i} } \right|^{2} } + \sum\limits_{i = 1}^{n} {\left| {\tilde{\alpha }_{i} } \right|^{2} } \). Let \( W_{\alpha } = \frac{{\hat{W}_{\alpha } }}{{C_{\alpha } }} \), then the model is
and it can be transformed into MCQP in this case.
The different costs are taken as penalty coefficient in optimization-based classification, such as SVM and MCQP. In (Li et al, 2008), the MCLP-based model assigns the coefficient of misclassification by ratio of size of majority and minority class. Furthermore, if \( C_{\alpha } \ne C_{{\tilde{\alpha }}} \), the solution will be drawn from sub-model (3.9) and (3.11). The cost-sensitive CS-MCQP is a generalization of MCQP and the solution will be optimized by adjusting weights of variables of the objective function.
4 Experiments
This section compares the CS-MCQP model with some well-known classifiers and cost-sensitive classifiers using 26 imbalanced datasets from the UCI machine learning repository (Lichman, 2013). The results are then verified using parametric and nonparametric statistical tests as suggested by Beyan and Fisher (2015). The characteristics of imbalanced datasets that are suitable for the proposed model are discussed at the end of this section.
The cost matrix (Table 1) shows the misclassification costs. Cost-sensitive learning aims to decrease error rates of the minority class by assigning higher costs to it. Since the correctly classified observations cannot change the total costs, it is a common practice (He and Garcia, 2009; Thai-Nghe et al, 2010; Lomax and Vadera, 2013; Masnadi-Shirazi et al, 2015) to set the costs of correct classifications (True Positive and True Negative) to zero to simplify the calculation.
The minority class is defined as the positive class which is the most concerned class in imbalance learning. The cost of misclassifying minority records is normally higher than that of misclassifying majority records. The cost ratios (CR = \( {{C_{{\tilde{\alpha }}} } \mathord{\left/ {\vphantom {{C_{{\tilde{\alpha }}} } {C_{\alpha } }}} \right. \kern-0pt} {C_{\alpha } }} \)) mean misclassifying a minority record is \( {{C_{{\tilde{\alpha }}} } \mathord{\left/ {\vphantom {{C_{{\tilde{\alpha }}} } {C_{\alpha } }}} \right. \kern-0pt} {C_{\alpha } }} \) times more costly than misclassifying a majority record. If cost ratio (CR) equals 1, i.e., \( C_{\alpha } = C_{{\tilde{\alpha }}} \), the model converts to the MCQP model (Theorem 2).
4.1 Datasets
Table 2 describes the characteristics of the 26 imbalanced UCI datasets (López et al, 2013; Beyan and Fisher, 2015; Seiffert et al, 2010; Fernández et al, 2008; Fernández et al, 2009) used in the experiment. The imbalance ratio (IR) is the number of records of the majority class to that of the minority class. Since the CS-MCQP is designed for binary classification, multi-class datasets were converted to binary datasets using the one class versus rest and some classes versus others scheme (Thai-Nghe et al, 2010). Records with missing values were removed.
4.2 Performance measures
The following performance measures are commonly used to evaluate classification algorithms. Accuracy is the percentage of correctly classified records: \( {\text{ACC}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{TN}}}} \). True negative (TN-rate), true positive (TP-rate), false negative (FN-rate) and false positive (FP-rate) are one class evaluation indexes: \( {\text{TP}}_{\text{rate}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}};{\text{TN}}_{\text{rate}} = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}}; \) \( {\text{FP}}_{E} = \frac{\text{FP}}{{{\text{TN}} + {\text{FP}}}}; \) \( {\text{FN}}_{\text{rate}} = \frac{\text{FN}}{{{\text{TP}} + {\text{FN}}}}. \)The area under the ROC curve (AUC) shows the tradeoff between TP-rate and FP-rate: \( {\text{AUC}} = \frac{{1 + {\text{TP}}_{\text{rate}} - {\text{FP}}_{\text{rate}} }}{2} \). Geometric mean (GeoMean) evaluates average performance of classifiers: \( {\text{Geometric (GM)}} = \sqrt {\frac{\text{TN}}{{{\text{FP}} + {\text{TN}}}} \times \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}} \).
The traditionally used classification metrics such as accuracy and error rate cannot provide adequate information when dealing with highly imbalanced data. For example, if a binary data set has 1% minority observations, a classifier that classifies all data observations as majority class will have 99% accuracy and 1% error rate. Therefore, the AUC and GeoMean, which are among the most widely used performance measures in binary imbalanced classifications (Ferri et al, 2009; He and Garcia, 2009; Thai-Nghe et al, 2010; Chawla et al, 2004; Sun et al, 2009), are used in the experiment.
4.3 CS-MCQP Algorithm
The algorithm for the CS-MCQP model is summarized as follows:
Input: The dataset \( A \); cost ratio.
Output: Average classification accuracy; true positive rate; true negative rate; false positive rate; false negative rate; GeoMean; total costs; AUC.
(1) \( A \) is randomly partitioned into tenfolds, \( A_{1} ,A_{2} , \ldots ,A_{10} \);//in iteration \( i \), partition \( A_{i} \) is reserved as the testing dataset and the remaining partitions are used as training set.
(2) for each partition \( i \) do;//training and testing is performed ten times
(3) Compute \( \theta \) using training dataset; \( W_{\alpha } ,W_{\beta } ,W_{b} \) are chosen by cross-validation;
(4) Compute \( {\text{X}} = A^{T} Y\theta \) and \( b = - \frac{1}{{W_{b} }}ee^{T} \theta \); //by gradient function of (3.10);
(5) Using partition \( A_{i} \) to test;
(6) end for
(7) Compute overall numbers of correct and incorrect classifications from tenfolds;
(8) Return performance indexes.
END
The experiment compares the CS-MCQP model with two optimization-based classifiers (MCQP and SVM), six popular classifiers (i.e., C4.5, Naive Bayes, Logistic regression, RBF network, Multilayer Perceptron and LWL), cost-sensitive optimization-based classifier (CS-SVM), data preprocessing techniques (SMOTE, OSS, CNN), ensembles (Adaboost M1, Bagging) and hybrid methods (SMOTE ENN and SMOTE Tomek Link) for imbalanced classification. Table 3 summarizes the classification methods used in the experiments.
Since the datasets used in the experiments do not provide cost ratios, the two cost-sensitive algorithms choose the best cost ratio from \( \{ 1,2,10,50,100\} \), which is a common practice in previous studies (Thai-Nghe et al, 2010; Weiss, 2010).
The C4.5, SVM, CS-SVM, Naïve Bayes, Logistic, RBF network, MLP and LWL were implemented using Weka (Hall et al, 2009), and the CS-MCQP was implemented in MATLAB R2014a.
5 Results
For each classifier, the tenfold cross-validations on AUC and GeoMean measures are summarized in Tables 4 and 5, respectively. The bold numbers denote the best results of a measurement on a dataset. The following observations can be made based on the results: (1) the CS-MCQP achieves the best average AUC and GeoMean among the twenty-four classifiers across the 26 imbalanced datasets; (2) the optimization-based classifiers (MCQP and SVM) have the lowest average AUC and poor performance on GeoMean, and their cost-sensitive versions improve the performances dramatically; (3) data preprocessing can improve the average performance of classifiers; (4) the oversampling methods such as SMOTE always perform better than undersampling methods such as OSS and CNN; (5) the hybrid methods which combine oversampling and undersampling obtain better performance than single preprocessing technique; (6) ensemble methods did not dramatically improve base classifiers’ performances compared to preprocess and cost-sensitive methods.
5.1 Statistical test
To validate the results statistically, student’s t test and Wilcoxon signed-rank test were conducted at the significance level 0.05 and reported in the following subsections. The null hypothesis is that all classifiers included in the tests perform the same and the observed differences among the classifiers are the result of chance.
5.1.1 Student’s test
Student’s t test determines whether the means of two datasets significantly differ from each other by calculating a p value, which tells the likelihood of the two values are from the same population (Beyan and Fisher, 2015). It is used to verify whether the differences among the classifiers found in the experiments are statistically significant. Each of the classifiers obtained the best performance from each category is compared with CS-MCQP at the significance level of 0.05. For the AUC measure, SVM, MLP Adaboost M1, SMOTE and SMOTE ENN are the best algorithms from the five categories. For the GeoMean, MCQP, Naïve Bayes, Adaboost M1, SMOTE and SMOTE Tomek Link are the best algorithms from the five categories. For simplicity, we use “+” symbol presents that the algorithm in row is significantly better than another method in column, “−” means the contrary and “=” indicates two methods perform the same. The p values are in brackets.
The results in Tables 6 and 7 show that the CS-MCQP performs significantly better than the other methods in terms of both AUC and GeoMean. The p values of comparing CS-MCQP and SMOTE ENN are 0.048 which indicates that CS-MCQP performs better than SMOTE ENN and other four methods
5.1.2 Wilcoxon signed-rank test
The Wilcoxon signed-rank test (Wilcoxon, 1945) is used to check whether the performance differences between two algorithms are significant. The signed-rank can be calculated as: \( R^{ + } = \sum\limits_{{d_{i} > 0}} {{\text{rank}}(d_{i} )} + \frac{1}{2}\sum\limits_{{d_{i} = 0}} {{\text{rank}}(d_{i} )} ; \) \( R^{ - } = \sum\limits_{{d_{i} < 0}} {{\text{rank}}(d_{i} )} + \frac{1}{2}\sum\limits_{{d_{i} = 0}} {{\text{rank}}(d_{i} )} . \) \( d_{i} \) is the difference between the performance scores of two classifiers on the ith dataset.
Tables 8 and 9 show that the CS-MCQP performs significantly better than the other five algorithms in terms of AUC and GeoMean among the 26 imbalanced datasets based on the Wilcoxon signed-rank tests with all of the p-values lower than 0.05.
5.2 Intrinsic characteristics of datasets suited for CS-MCQP
As López et al, discussed in (2013), the intrinsic characteristics of datasets have a great influence on classification with imbalanced data. This section analyzes the effect of different characteristics of datasets (i.e., noisy, small disjunct and overlapping) on the CS-MCQP model and summarizes the intrinsic characteristics of datasets that are appropriate for the proposed model. The t-SNE (Stochastic Neighbor Embedding) method (Maaten and Hinton, 2008) is used to visualize high-dimensional data and convert high-dimensional datasets into a plane. The red and blue dots represent positive class and negative class, respectively.
5.2.1 Noisy data
In imbalanced binary classification, noisy data refer to data objects that belong to one class appearing among the majority data of the other class (López et al, 2013), and the noisy data always move away from minority and majority class data. Apparently, this scenario has greater impact on the minority class because it has fewer data objects. Six representative datasets from the experiments with this feature are presented in Figures 1, 2, 3, 4, 5 and 6.
5.2.2 Small disjunct
Small disjuncts are small clusters of underrepresented subconcepts (López et al, 2013) and are major source of errors (Weiss, 2010). When it combined with class imbalance, the situation is even worse because it is hard to tell whether it is a small disjunct or noise. Four datasets used in the experiments have this feature (Figures 7, 8, 9, 10).
The results show that the performances of all classifiers included in the experiments drop sharply and the best classifier tested over small disjunct and imbalanced data is Naïve Bayes.
5.2.3 Overlapping
In this class of datasets, borderline region includes similar amount of instances from both classes. Figures 11, 12, 13, 14 and 15 show the datasets used in the experiments with this characteristic. The classification results presented in Tables 4 and 5 demonstrate that the CS-MCQP performances better on average than other classifiers. The classification results indicate that optimization-based cost-sensitive classifiers (CS-SVM and CS-MCQP) are robust on overlapping and imbalanced data.
In general, the CS-MCQP model can improve the accuracy of the minority class using costs matrix as penalty coefficient to minimize minority class errors, because different costs can push a boundary far from a minority class. Figure 16 presents this situation and captures the overlapping characteristic of data that are appropriate for the CS-MCQP model.
6 Conclusions
Though the Multiple Criteria Quadratic Programming (MCQP) model has been proved to be an effective and scalable classification method, its performance degraded quickly when the imbalance ratio increases.
This paper proposed a cost-sensitive MCQP (CS-MCQP) model to improve the MCQP model for imbalanced data. The proposed model extends the MCQP model by introducing the cost of misclassifications and the cost of imbalanced data. It maximizes the external distance between groups and minimizes the internal distance with cost coefficient. In addition, as an attempt to increase classification accuracy, \( \tilde{\beta } \) and \( \tilde{\alpha } \), which, respectively, measure the distance from a correctly classified and misclassified minority class element \( A_{i} \) to \( b \), respectively, were added to the CS-MCQP model. The existence of solution and the relationship between MCQP and CS-MCQP were also proved.
The CS-MCQP model was then compared with twenty-four popular classifiers, ensemble and data-processing methods using 26 public imbalanced datasets from the UCI machine learning repository. The results show that the CS-MCQP achieves the best average AUC and GeoMean measures among the twenty-four methods. To validate the results statistically, student’s t test and Wilcoxon signed-rank test were conducted at the significance level 0.05. Both tests indicate that the CS-MCQP performs significantly better than the other algorithms in terms of AUC and GeoMean measures. Furthermore, we analyzed the effect of noisy, small disjunct and overlapping data on the proposed model and concluded that the CS-MCQP model achieves better performance on data with overlapping feature than noisy and small disjunct data.
References
Atkeson C, Moore A and Schaal S (1996). Locally weighted learning. Artificial Intelligence Review 11(1):11–73.
Barros RC, Basgalupp MP, de Carvalho ACPLF (2012). A Survey of Evolutionary Algorithms for Decision-Tree Induction. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Re 42(2):291–312.
Beyan C and Fisher R (2015). Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognition 48(5):1653–1672.
Bradely PS, Fayyad UM and Mangasarian OL (1999). Mathematical programming for data mining: formulations and challenges. Informs Journal On Computing 11(3): 217–238.
Breiman L (1996). Bagging predictors. Machine Learning 24(2):123–140.
Campadelli P, Casiraghi E and Valentini G (2005). Support vector machines for candidate nodules classification. Neurocomputing 68:281–288.
Cao P, Zhao DZ and Zaiane O (2013) Measure Oriented Cost-Sensitive SVM for 3D nodule detection, 35th Annual International Conference of the IEEE EMBS Osaka, Japan, pp 3–7.
Chang CT (2013). On product classification with various membership functions and binary behavior. Journal of the Operational Research Society 65(1):141–150
Chawla NV, Bowyer KW, Hall LO and Kegelmeyer WP (2011). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16(1):321–357.
Chawla NV, Japkowicz N and Kotcz A. (2004) Editorial: special issue on learning from imbalanced data sets, SIGKDD Explorations, Learning (ICML), Banff, Canada, pp 1–6.
Christopher J and Burges C (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2):121–167.
Christianini N and Shawe-tayor J (2000). An introduction to support vector machines and other kenel-based learning methods. Cambridge: Cambridge University Press.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P and Witten I (2009). The WEKA data mining software: an update. SIGKDD Explorations 11(1):10–18.
Hart PE (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory.
He H and Garcia EA (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9):1263–1284.
Hwang K, Lee K, Lee C and Park S (2014). Multi-class classification using a signomial function. Journal of the Operational Research Society 66(3):434–449
Fernández A, García S, Jesus MJD and Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems 159(18):2378–2398.
Fernández A, Jesus MJD and Herrera F (2009). Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets. International Journal of Approximate Reasoning 50(3): 561–577.
Ferri C, Hernández-Orallo J and Modroiu R (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters 30(1):27–38.
Freed N and Glover F (1981). Simple but powerful goal programming models for discriminant problems. European Journal of Operational Research 7(1): 44–60.
Freund Y and Schapire RE (1996). Experiments with a new boosting algorithm. In: Thirteenth International Conference on Machine Learning, San Francisco, pp. 148–156.
Fung G (2003). Machine learning and data mining via mathematical programming-based support vector machines, Ph. D thesis, The University of Wisconsin-Madison.
Garcia-Palomares UM and Manzanilla-Salazar O (2012). Novel linear programming approach for building a piecewise nonlinear binary classifier with a priori accuracy. Decision Support Systems 52(3):717–728.
Kou G and Peng Y, Chen Z and Shi Y (2005) Discovering credit cardholders’ behavior by multiple criteria linear programming. Annals of Operations Research 135(1):261–274.
Kou G, Peng Y, Chen Z and Shi Y (2009). Multiple criteria mathematical programming for multi-class classification and application in network intrusion detection. Information Sciences 179(4):371–381.
Kou G, Lu Y. Peng Y and Shi Y (2012) Evaluation of classification algorithms using MCDM and rank correlation. International Journal of Information Technology and Decision Making 11(1):197–225.
Kubat M and Matwin S (1997). Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning (ICML’97), pp 179–186.
Li AH, Shi Y and He J (2008). MCLP-based methods for improving ‘Bad’ catching rate in credit cardholder behavior analysis. Applied Soft Computing 8:1259–1265.
Lichman M (2013) UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
Liu XY and Zhou ZH (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18(1):66–67.
Lomax S and Vadera S (2013). A survey of cost-sensitive decision tree induction algorithms. ACM Computing Surveys 45(2):1–35.
López V, Fernández A, García SX, Palade V and Herrera F (2013) An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences 250(1):113–141.
Maaten LVD and Hinton G (2008). Visualizing data using t-SNE. Journal of Machine Learning Research 9(11):2579–2605.
Martens D and Provost F (2014). Explaining data-driven document classifications. MIS Quarterly. 38(1):73–99.
Masnadi-Shirazi N, Vasconcelos A and Iranmehr A (2015) Cost-sensitive Support Vector Machines. Journal of Machine Learning Research (in press) arXiv:1212.0975V2.
Min F, He HP, Qian YH and Zhu W (2011). Test-cost-sensitive attribute reduction. Information Sciences 181(22):4928–4942.
Pavlidis NG, Tasoulis DK, Adams NM, Hand DJ (2012) Adaptive consumer credit classification. Journal of the Operational Research Society 63(12):1645–1654
Peng Y, Kou G, Shi Y and Chen Z (2008a). A descriptive framework for the field of data mining and knowledge discovery. International Journal of Information Technology and Decision Making 7(4):639–682.
Peng Y, Kou G, Shi Y and Chen Z (2008b). A Multi-criteria convex quadratic programming model for credit data analysis. Decision Support Systems 44(4):1016–1030.
Scholkopf B and Smola AJ (2002) Learning with kernels. MIT Press:Cambridge.
Seiffert C, Khoshgoftaar TH, Van Hulse J and Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans 40(1):185–209.
Shi Y (2010) Multiple criteria optimization-based data mining methods and applications: a systematic survey. Knowledge and Information Systems 24(3):369–391.
Shi YH, Gao Y, Wang R L, Zhang Y and Wang D (2013) Transductive cost-sensitive lung cancer image classification. Applied Intelligence 38(1):16–28.
Soda P (2011) A multi-objective optimisation approach for class imbalance learning. Pattern Recognition 44(8):1801–1810.
Sun A, Lim E P and Liu Y (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems 48(1):191–201.
Sun Y, Kamel MS, Wong AKC and Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40(12):3358–3378.
Thai-Nghe N, Gan TPer Z and Schmidt Thieme L (2010). Cost-sensitive learning methods for imbalanced data, In Proceeding of IEEE IJCNN10, IEEE CS, Barcelona, pp 1–8.
Ting KM (2002). An instance weighting method to induce cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering 14(3):659–665.
Tomek I (1976). Two modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics 6(11):769–772.
Tsai C, Chang L and Chiang H (2009) Forecasting of ozone episode days by cost-sensitive neural network methods. Science of the Total Environment 407(6):2124–2135.
Vapnik VN (1982) Estimation of Dependences Based on Empirical Data [in Russian], Nauka, Moscow, 1979 (English translation: Springer Verlag, New York.
Vapnik VN (1995). The nature of statistical learning theory. Springer: New York.
Vapnik VN (2000) The nature of statistical learning theory. second edition. Springer: New York.
Vapnik VN and Chapelle O (2000) Bounds on error expectation for support vector machines. Neural Computation. 12(9):2013–2036.
Wang G, Sun J and Ma J (2014a). Sentiment classification: the contribution of ensemble learning. Decision Support Systems 57:77–93.
Wang J, Zhao P and Steven CHH (2014b). Cost-sensitive online classification. IEEE Transactions on Knowledge and Data Engineering 26(10):2425–2438.
Weiss GM (2010) The impact of small disjuncts on classifier learning. In Stahlbock R, Crone SF, Lessmann S (eds.) Data mining: annals of information systems. Springer:Berlin, vol. 8, pp. 193–226.
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1(6):80–83.
Wilson D (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics 2(3): 408–421.
Wolfe P (1961). A duality theorem for nonlinear programming. Quarterly Journal of Applied Mathematics 19(3):239–244.
Yang Q and Wu X (2006) 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5(4):597–604.
Yue WT and Cakanyildiri M (2010) A cost-based analysis of intrusion detection system configuration under active or passive response. Decision Support Systems 50(1):21–31.
Zhao HM, Sinha AP and Bansal G (2011) An extended tuning method for cost-sensitive regression and forecasting. Decision Support Systems 51(3):372–383.
Zhang JL, Shi Y and Zhang P (2009). Several multi-criteria programming methods for classification. Computers & Operations Research 36(3):823–836.
Acknowledgments
This work was supported in part by grants from the National Natural Science Foundation of China #71325001.
Author information
Authors and Affiliations
Corresponding author
Additional information
The online version of this article is available Open Access
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Chao, X., Peng, Y. A cost-sensitive multi-criteria quadratic programming model for imbalanced data. J Oper Res Soc (2017). https://doi.org/10.1057/s41274-017-0233-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1057/s41274-017-0233-4