Clustering and Weighted Scoring in Geometric Space Support Vector Machine Ensemble for Highly Imbalanced Data Classification
- 1 Citations
- 637 Downloads
Abstract
Learning from imbalanced datasets is a challenging task for standard classification algorithms. In general, there are two main approaches to solve the problem of imbalanced data: algorithm-level and data-level solutions. This paper deals with the second approach. In particular, this paper shows a new proposition for calculating the weighted score function to use in the integration phase of the multiple classification system. The presented research includes experimental evaluation over multiple, open-source, highly imbalanced datasets, presenting the results of comparing the proposed algorithm with three other approaches in the context of six performance measures. Comprehensive experimental results show that the proposed algorithm has better performance measures than the other ensemble methods for highly imbalanced datasets.
Keywords
Imbalanced data Ensemble of classifiers Class imbalance Decision boundary Scoring function1 Introduction
The goal of the supervised classification is to build a mathematical model of a real-life problem using a labeled dataset. This mathematical model is used to assign the class label to each new recognized object, which, in general, does not belong to the training set. The individual classification model is called a base classifier. Ensemble methods are a vastly used approach to improve the possibilities of base classifiers by building a more stable and accurate ensemble of classifiers (EoC) [23, 28]. In general, the procedure for building EoC consists of three steps: generation, selection, and integration [18]. An imbalanced data problem occurs when the prior probability of classes in a given dataset is very diverse. There are many real-life problems in which we deal with imbalanced data [11, 25], e.g., network intrusion detection [2, 14], source code fault detection [8], or in general fraud detection [1].
There exist two main approaches to solve the problem of imbalanced data: a data-level [9, 26] and an algorithm-level solution [29]. EoC is one of the approaches to solve the imbalanced data classification problem which improve classification measure compared to single models and is highly competitive and robust to imbalanced data [10, 13, 16]. The use of not only voting in the EoC integration phase is one of the directions to solve a problem with imbalanced data [15]. Therefore, this article concerns about calculating the weighted scoring function to be applied in the weighted voting process.
In the process of EoC generation, we use the K-Means clustering algorithm [5] for each class label separately. The base linear classifier – Support Vector Machine [7] – is trained on cluster combination. The weighted scoring function takes into account the distance of a classified object from the decision boundary and cluster centroids used to learn the proper base classifier. Regardless of the number of learning objects in a given cluster, the cluster centroid is always determined. The proposed method for determining the scoring function is, therefore, insensitive to the number of objects defining the cluster. As shown in the article, the proposed approach is useful for imbalanced data.
A proposal of a new weighted scoring function that uses the location of the cluster centroids and distance to the decision boundary.
The proposition of an algorithm that uses clustering and the proposed function.
A new experimental setup to compare the proposed method with other algorithms on highly imbalanced datasets.
The paper is structured as follows: Sect. 2 introduces the base concept of EoC and presents the proposed algorithm. In Sect. 3, the experiments that were carried out are presented, while results and the discussion appear in Sect. 4. Finally, we conclude the paper in Sect. 5.
2 Clustering and Weighted Scoring
2.1 Basic Notation
Over the last years, the issue of calculating the weights in the voting rule has been considered many times. The article [30] presents an approach in which the weights are combining with local confidence. The classifier trained on a subset of training data should be limited to the area it spans in an impact on the resulting classifier. The problem of generalization of majority voting was studied in [12]. The authors are using a probability estimate calculated as percentage of properly classified validation objects over geometric constraints. Separately, regions that are functionally independent are considered. A significant improvement in the classification quality was observed when using the proposed algorithm, although knowledge of the domain is needed to provide a proper division. The authors are using a retinal image and classify over anatomic regions.
The weights of the base classifier are also considered in the context of the interval-valued fuzzy sets [6]. The upper weight of base the classifier refers to the situation in which the definite base classifier was correct, while the other classifiers proved the correct prediction. The lower weight describes the situation in which the definite base classifier made errors, while the other classifiers didn’t make any errors.
In the paper [22] weights are determined for each label separately over the entire validation dataset. This can lead to the improvement of the performance of the resulting integrated classifier.
The following article is a proposition of an algorithm assigning weights not to base classifiers, but recognized objects. The weight for each object depends on its location in the feature space. Therefore, the weight of an object is determined by the score function calculated in the geometric space.
2.2 Proposed Method
Schema for calculating of the score function for the object x.
Algorithm 1 presents the pseudocode of the proposed approach to EoC with clustering and weighted scoring in the geometric space. In addition, Algorithm 1 concerns the dichotomous division of the learning set into class labels. These types of highly imbalanced datasets were used in the experimental research.
3 Experiments Set-Up
The experimental evaluation conducted for the needs of verification of the method proposed in the following work was based of 30 highly imbalanced datasets contained in the keel repository [3]. Datasets selected for the study are characterized by an imbalance ratio (the proportion between minority and majority class) ranging from 1:9 up to 1:40. Besides, due to the preliminary nature of the conducted research, the pool of datasets includes only binary classification problems.
The basis of the used division methodology was Stratified K-Fold Crossvalidation with \(k=5\), necessary to ensure the presence of minority class patterns in each of the analyzed training subsets. Statistical tests, for both pair and rank tests, were carried out using the Wilcoxon test with the significance level \(\alpha =0.05\) [4].
(svc) Support Vector Machine—the base experimental model with the scaled gamma and linear kernel [21].
(cws) Clustering and Weighted Scoring—EoC with the pool diversified by pairs of clusters and integrated geometrically by the rules introduced in Sect. 2.
(cmv) Clustering and Majority Vote—EoC identical with cws but integrated using the majority vote [24].
(csa) Clustering and Support Accumulation—EoC identical with cws and cmv but integrated using the support accumulation rule [27].
Results achieved by the analyzed method for the balanced accuracy score metric.
Dataset | IR | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
SVC | CWS | CMV | CSA | ||
glass-0-4-vs-5 | 1:9 | \(0.738\pm 0.160\) | \(0.938\pm 0.125 \) | \(0.696\pm 0.247\) | \(0.856\pm 0.174\) |
\(_{-}\) | \(_{3}\) | \(_{-}\) | \(_{3}\) | ||
ecoli-0-1-4-7-vs-5-6 | 1:12 | \(0.867\pm 0.076\) | \(0.839\pm 0.060\) | \(0.713\pm 0.063\) | \(0.856\pm 0.025\) |
\(_{3}\) | \(_{3}\) | \(_{-}\) | \(_{3}\) | ||
ecoli-0-6-7-vs-5 | 1:10 | \(0.890\pm 0.103\) | \(0.915\pm 0.061\) | \(0.710\pm 0.056\) | \(0.882\pm 0.108\) |
\(_{-}\) | \(_{3}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-1-vs-2-3-5 | 1:9 | \(0.880\pm 0.083\) | \(0.871\pm 0.115\) | \(0.692\pm 0.086\) | \(0.780\pm 0.142\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-3-4-6-vs-5 | 1:9 | \(0.845\pm 0.118\) | \(0.890\pm 0.085\) | \(0.720\pm 0.081\) | \(0.879\pm 0.092\) |
\(_{-}\) | \(_{3}\) | \(_{-}\) | \(_{-}\) | ||
yeast-0-3-5-9-vs-7-8 | 1:9 | \(0.537\pm 0.081\) | \(0.597\pm 0.129\) | \(0.549\pm 0.078\) | \(0.539\pm 0.080\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli4 | 1:16 | \(0.570\pm 0.140\) | \(0.753\pm 0.088\) | \(0.619\pm 0.149\) | \(0.500\pm 0.000\) |
\(_{-}\) | \(_{1, 4}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-1-4-7-vs-2-3-5-6 | 1:11 | \(0.796\pm 0.120\) | \(0.790\pm 0.098\) | \(0.666\pm 0.130\) | \(0.756\pm 0.085\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-3-4-7-vs-5-6 | 1:9 | \(0.766\pm 0.097\) | \(0.744\pm 0.114\) | \(0.829\pm 0.057\) | \(0.779\pm 0.139\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
shuttle-c2-vs-c4 | 1:20 | \(1.000\pm 0.000\) | \(1.000\pm 0.000\) | \(0.756\pm 0.026\) | \(1.000\pm 0.000\) |
\(_{3}\) | \(_{3}\) | \(_{-}\) | \(_{3}\) | ||
yeast-2-vs-8 | 1:23 | \(0.774\pm 0.120\) | \(0.774\pm 0.120\) | \(0.600\pm 0.093\) | \(0.650\pm 0.093\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-4-6-vs-5 | 1:9 | \(0.839\pm 0.118\) | \(0.867\pm 0.070\) | \(0.611\pm 0.103\) | \(0.875\pm 0.093\) |
\(_{3}\) | \(_{3}\) | \(_{-}\) | \(_{3}\) | ||
yeast-2-vs-4 | 1:9 | \(0.667\pm 0.093\) | \(0.693\pm 0.070\) | \(0.634\pm 0.068\) | \(0.627\pm 0.039\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-6-7-vs-3-5 | 1:9 | \(0.853\pm 0.077\) | \(0.835\pm 0.101\) | \(0.728\pm 0.185\) | \(0.807\pm 0.102\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-1-4-6-vs-5 | 1:13 | \(0.829\pm 0.125\) | \(0.863\pm 0.077\) | \(0.631\pm 0.143\) | \(0.829\pm 0.063\) |
\(_{-}\) | \(_{3}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-2-3-4-vs-5 | 1:9 | \(0.875\pm 0.75\) | \(0.859\pm 0.119\) | \(0.745\pm 0.100\) | \(0.804\pm 0.129\) |
\(_{3}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
glass-0-6-vs-5 | 1:11 | \(0.639\pm 0.195\) | \(0.924\pm 0.103\) | \(0.744\pm 0.186\) | \(0.766\pm 0.113\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-2-6-7-vs-3-5 | 1:9 | \(0.860\pm 0.115\) | \(0.833\pm 0.092\) | \(0.628\pm 0.079\) | \(0.785\pm 0.123\) |
\(_{3}\) | \(_{3}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-3-4-vs-5 | 1:9 | \(0.822\pm 0.123\) | \(0.886\pm 0.090\) | \(0.761\pm 0.143\) | \(0.911\pm 0.055\) |
\(_{3}\) | \(_{3}\) | \(_{-}\) | \(_{3}\) | ||
glass4 | 1:15 | \(0.554\pm 0.093\) | \(0.914\pm 0.071\) | \(0.640\pm 0.124\) | \(0.568\pm 0.139\) |
\(_{-}\) | \(_{all}\) | \(_{-}\) | \(_{-}\) | ||
glass5 | 1:23 | \(0.544\pm 0.087\) | \(0.882\pm 0.121\) | \(0.704\pm 0.188\) | \(0.737\pm 0.162\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
glass-0-1-5-vs-2 | 1:9 | \(0.500\pm 0.000\) | \(0.500\pm 0.000\) | \(0.601\pm 0.159\) | \(0.507\pm 0.086\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast-0-2-5-6-vs-3-7-8-9 | 1:9 | \(0.509\pm 0.021\) | \(0.581\pm 0.101\) | \(0.532\pm 0.066\) | \(0.504\pm 0.010\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast3 | 1:8 | \(0.630\pm 0.035\) | \(0.500\pm 0.042\) | \(0.632\pm 0.050\) | \(0.701\pm 0.045\) |
\(_{2}\) | \(_{-}\) | \(_{-}\) | \(_{all}\) | ||
ecoli-0-1-vs-5 | 1:11 | \(0.880\pm 0.093\) | \(0.932\pm 0.061\) | \(0.895\pm 0.123\) | \(0.864\pm 0.090\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
shuttle-c0-vs-c4 | 1:14 | \(1.000\pm 0.000\) | \(1.000\pm 0.000\) | \(0.785\pm 0.030\) | \(0.992\pm 0.090\) |
\(_{3}\) | \(_{3}\) | \(_{-}\) | \(_{3}\) | ||
yeast6 | 1:41 | \(0.500\pm 0.000\) | \(0.528\pm 0.055\) | \(0.500\pm 0.000\) | \(0.500\pm 0.000\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast4 | 1:28 | \(0.500\pm 0.000\) | \(0.510\pm 0.020\) | \(0.500\pm 0.000\) | \(0.500\pm 0.000\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast-0-2-5-7-9-vs-3-6-8 | 1:9 | \(0.704\pm 0.099\) | \(0.840\pm 0.158\) | \(0.669\pm 0.072\) | \(0.578\pm 0.067\) |
\(_{4}\) | \(_{4}\) | \(_{4}\) | \(_{-}\) | ||
vowel0 | 1:10 | \(0.767\pm 0.119\) | \(0.719\pm 0.129\) | \(0.786\pm 0.079\) | \(0.787\pm 0.079\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) |
In the construction of each EoC, in order to limit the number of the presented tables and readability of the analysis, each time we construct the ensemble by dividing classes into two clusters, thus building a pool of four members. In the case of data as strongly imbalanced as those from the selected databases, often only a few (literally four or five) minority class objects remain in a single fold so that a more substantial number would treat almost every minority object as a separate cluster.
The whole experimental evaluation was performed in Python, using the scikit-learn api [20] to implement the cws method and is publicly available on the git repository1. As metrics for the conducted analysis, due to the imbalanced nature of the classification problem, three aggregate measures (balanced accuracy score, F1-score, and G-mean) and three base measures constituting their calculation (precision, recall, and specificity) were applied, using their implementation included in the stream-learn package [17].
4 Experimental Evaluation
For the readability of the analysis, the full results of the experiment, along with the presentation of the relation between the algorithms resulting from the conducted paired tests, are presented only for the balanced accuracy score (Table 1) and recall (Table 2) metrics.
As may be observed, for aggregate metrics (results are consistent for both balanced accuracy and G-mean, only in F1-score presenting a slightly smaller scale of differences) the use of majority voting (cmv) for EoC diversified with clustering, often leads to deterioration of the classification quality even concerning a single base classifier. Integration by support accumulation (csa) performs slightly better, due to taking into consideration the certainty (support) of the decisions of each classifier but ignoring their areas of competence. The use of areas of competence present in the cws method allows for substantial improvement in classification results, often leading to a statistically significant advantage. The primary source of advantage in results is a significant improvement in the recall metric in this approach.
Results achieved by the analyzed method for the recall metric.
Dataset | IR | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
SVC | CWS | CMV | CSA | ||
glass-0-4-vs-5 | 1:9 | \(0.600\pm 0.374\) | \( 1.000\pm 0.0\) | \(0.500\pm 0.447\) | \(0.800\pm 0.244\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-1-4-7-vs-5-6 | 1:12 | \(0.800\pm 0.178\) | \(0.760\pm 0.80\) | \(0.680\pm 0.097\) | \(0.800\pm 0.000\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-6-7-vs-5 | 1:10 | \(0.800\pm 0.187\) | \(0.900\pm 0.122\) | \(0.450\pm 0.100\) | \(0.800\pm 0.187\) |
\(_{-}\) | \(_{3}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-1-vs-2-3-5 | 1:9 | \(0.800\pm 0.178\) | \(0.760\pm 0.233\) | \(0.430\pm 0.218\) | \(0.620\pm 0.271\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-3-4-6-vs-5 | 1:9 | \(0.750\pm 0.273\) | \(0.850\pm 0.200\) | \(0.500\pm 0.158\) | \(0.850\pm 0.200\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast-0-3-5-9-vs-7-8 | 1:9 | \(0.080\pm 0.160\) | \(0.200\pm 0.252\) | \(0.100\pm 0.154\) | \(0.080\pm 0.160\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli4 | 1:16 | \(0.150\pm 0.300\) | \(0.550\pm 0.244\) | \(0.250\pm 0.316\) | \(0.000\pm 0.000\) |
\(_{-}\) | \(_{1, 4}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-1-4-7-vs-2-3-5-6 | 1:11 | \(0.687\pm 0.289\) | \(0.720\pm 0.231\) | \(0.540\pm 0.257\) | \(0.680\pm 0.263\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-3-4-7-vs-5-6 | 1:9 | \(0.640\pm 0.265\) | \(0.600\pm 0.219\) | \(0.840\pm 0.79\) | \(0.680\pm 0.271\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
shuttle-c2-vs-c4 | 1:20 | \(1.000\pm 0.000\) | \(1.000\pm 0.000\) | \(1.000\pm 0.000\) | \(1.000\pm 0.000\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast-2-vs-8 | 1:23 | \(0.550\pm 0.244\) | \(0.550\pm 0.244\) | \(0.200\pm 0.187\) | \(0.300\pm 0.187\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-4-6-vs-5 | 1:9 | \(0.750\pm 0.273\) | \(0.800\pm 0.187\) | \(0.300\pm 0.187\) | \(0.850\pm 0.200\) |
\(_{-}\) | \(_{3}\) | \(_{-}\) | \(_{3}\) | ||
yeast-2-vs-4 | 1:9 | \(0.335\pm 0.186\) | \(0.396\pm 0.146\) | \(0.276\pm 0.149\) | \(0.256\pm 0.83\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-6-7-vs-3-5 | 1:9 | \(0.740\pm 0.146\) | \(0.780\pm 0.203\) | \(0.490\pm 0.361\) | \(0.640\pm 0.185\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-1-4-6-vs-5 | 1:13 | \(0.700\pm 0.244\) | \(0.850\pm 0.200\) | \(0.350\pm 0.339\) | \(0.850\pm 0.200\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-2-3-4-vs-5 | 1:9 | \(0.800\pm 0.187\) | \(0.900\pm 0.122\) | \(0.600\pm 0.200\) | \(0.850\pm 0.122\) |
\(_{3}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
glass-0-6-vs-5 | 1:11 | \(0.300\pm 0.399\) | \(0.900\pm 0.200\) | \(0.600\pm 0.374\) | \(0.800\pm 0.244\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-2-6-7-vs-3-5 | 1:9 | \(0.760\pm 0.224\) | \(0.770\pm 0.203\) | \(0.310\pm 0.215\) | \(0.610\pm 0.261\) |
\(_{3}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
ecoli-0-3-4-vs-5 | 1:9 | \(0.800\pm 0.187\) | \(0.900\pm 0.122\) | \(0.650\pm 0.254\) | \(0.900\pm 0.122\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
glass4 | 1:15 | \(0.233\pm 0.200\) | \(0.933\pm 0.133\) | \(0.300\pm 0.266\) | \(0.200\pm 0.266\) |
\(_{-}\) | \(_{all}\) | \(_{-}\) | \(_{-}\) | ||
glass5 | 1:23 | \(0.200\pm 0.400\) | \(0.900\pm 0.200\) | \(0.500\pm 0.316\) | \(0.600\pm 0.374\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
glass-0-1-5-vs-2 | 1:9 | \(0.000\pm 0.000\) | \(0.000\pm 0.000\) | \(0.633\pm 0.335\) | \(0.633\pm 0.335\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast-0-2-5-6-vs-3-7-8-9 | 1:9 | \(0.031\pm 0.40\) | \(0.197\pm 0.252\) | \(0.074\pm 0.147\) | \(0.011\pm 0.21\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast3 | 1:8 | \(0.264\pm 0.69\) | \(0.190\pm 0.69\) | \(0.412\pm 0.104\) | \(0.430\pm 0.85\) |
\(_{-}\) | \(_{-}\) | \(_{1}\) | \(_{1, 2}\) | ||
ecoli-0-1-vs-5 | 1:11 | \(0.800\pm 0.187\) | \(0.900\pm 0.122\) | \(0.800\pm 0.244\) | \(0.750\pm 0.158\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
shuttle-c0-vs-c4 | 1:14 | \(1.000\pm 0.000\) | \(1.000\pm 0.000\) | \(0.661\pm 0.179\) | \(0.984\pm 0.19\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast6 | 1:41 | \(0.000\pm 0.000\) | \(0.057\pm 0.114\) | \(0.000\pm 0.000\) | \(0.000\pm 0.000\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast4 | 1:28 | \(0.000\pm 0.000\) | \(0.020\pm 0.400\) | \(0.000\pm 0.000\) | \(0.000\pm 0.000\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) | ||
yeast-0-2-5-7-9-vs-3-6-8 | 1:9 | \(0.421\pm 0.189\) | \(0.711\pm 0.310\) | \(0.351\pm 0.135\) | \(0.160\pm 0.135\) |
\(_{4}\) | \(_{4}\) | \(_{4}\) | \(_{-}\) | ||
vowel0 | 1:10 | \(0.589\pm 0.284\) | \(0.489\pm 0.288\) | \(0.611\pm 0.185\) | \(0.600\pm 0.183\) |
\(_{-}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) |
Results for mean ranks achieved by analyzed methods with all metrics included in the evaluation.
Metric | 1 | 2 | 3 | 4 |
---|---|---|---|---|
svc | cws | cmv | csa | |
Balanced accuracy | 2.500 | 3.250 | 1.867 | 2.383 |
\(_{-}\) | \(_{all}\) | \(_{-}\) | \(_{-}\) | |
F1-score | 2.733 | 3.217 | 1.767 | 2.283 |
\(_{3}\) | \(_{3, 4}\) | \(_{-}\) | \(_{-}\) | |
G-mean | 2.467 | 3.283 | 1.900 | 2.350 |
\(_{-}\) | \(_{all}\) | \(_{-}\) | \(_{-}\) | |
Precision | 3.000 | 2.783 | 1.900 | 2.317 |
\(_{3, 4}\) | \(_{3}\) | \(_{-}\) | \(_{-}\) | |
Recall | 2.333 | 3.350 | 1.917 | 2.400 |
\(_{-}\) | \(_{all}\) | \(_{-}\) | \(_{-}\) | |
Specificity | 2.817 | 2.017 | 2.533 | 2.633 |
\(_{2}\) | \(_{-}\) | \(_{-}\) | \(_{-}\) |
The design of the recognition algorithm dedicated to imbalanced data is almost always based on the calibration of factors measurable by the base classification metrics. As in the case of the cws method, we try to increase the recall so that the inevitable reduction of precision or specificity will further give us a significant statistical advantage in the chosen aggregate metric, selected to define the cost of the incorrect classification that is relevant to us. In this case, the cws method turns out to be much better than the other methods of EoC integration with a pool diversified by clusters and allows a statistically significant improvement of the base method in the case of highly imbalanced data.
5 Conclusions
This paper presented a new clustering and weighted scoring algorithm dedicated to constructing EoC. We proposed that the scoring function should take into account the distance from the decision boundary of each base classifier and the cluster centroids necessary to learn this classifier. In the proposed weighting scoring function the distance to the decision boundary and sum of the distances to the centroids have the same weight. The proposed approach applies to imbalanced datasets because each cluster centroid can be calculated regardless of the number of objects in this cluster.
Comprehensive experiments are presented on thirty examples of highly imbalanced datasets. The obtained results show that the proposed algorithm is better than other algorithms in the context of statistical tests and some performance measures. In particular, in the case of the balanced accuracy and G-mean classification measures, the proposed in this paper method is statistically significantly better than all others methods used in the experiments.
A possible future work is to be considered: other distance measures to calculate the distance to cluster centroids, the impact of the use of heterogeneous base classifiers in the proposed method of building EoC or another scoring weighting method. In particular, we suggest that the distance from the decision boundary can be scaled or weighting factors regarding the distance of the object from the decision boundary and the distance from the cluster centroids can be introduced. Additionally, we can assign weights for cluster centroids depending on the number of objects that were used to determine these centroids.
Footnotes
Notes
Acknowledgements
This work was supported by the Polish National Science Centre under the grant No. 2017/25/B/ST6/01750 as well as by the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wroclaw University of Science and Technology.
References
- 1.Abdallah, A., Maarof, M.A., Zainal, A.: Fraud detection system: a survey. J. Netw. Comput. Appl. 68, 90–113 (2016)CrossRefGoogle Scholar
- 2.Abdulhammed, R., Faezipour, M., Abuzneid, A., AbuMallouh, A.: Deep and machine learning approaches for anomaly-based intrusion detection of imbalanced network traffic. IEEE Sens. Lett. 3(1), 1–4 (2018)CrossRefGoogle Scholar
- 3.Alcalá-Fdez, J., et al.: Kee data-mining sotware tool: dat set repository, integration of algrithms and experimental nalysis framewor. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2011)Google Scholar
- 4.Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2014)zbMATHGoogle Scholar
- 5.Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proceedings of 19th International Conference on Machine Learning, ICML 2002. Citeseer (2002)Google Scholar
- 6.Burduk, R.: Classifier fusion with interval-valued weights. Pattern Recogn. Lett. 34(14), 1623–1629 (2013)Google Scholar
- 7.Cao, X., Wu, C., Yan, P., Li, X.: Linear SVM classification using boosting hog features for vehicle detection in low-altitude airborne videos. In: 2011 18th IEEE International Conference on Image Processing (ICIP), pp. 2421–2424. IEEE (2011)Google Scholar
- 8.Choraś, M., Pawlicki, M., Kozik, R.: Recognizing faults in software related difficult data. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11538, pp. 263–272. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22744-9_20CrossRefGoogle Scholar
- 9.Fotouhi, S., Asadi, S., Kattan, M.W.: A comprehensive data level analysis for cancer diagnosis on imbalanced data. J. Biomed. Inform. 90, 103089 (2019)CrossRefGoogle Scholar
- 10.Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2011)Google Scholar
- 11.Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)CrossRefGoogle Scholar
- 12.Hajdu, A., Hajdu, L., Jonas, A., Kovacs, L., Toman, H.: Generalizing the majority voting scheme to spatially constrained voting. IEEE Trans. Image Process. 22(11), 4182–4194 (2013)MathSciNetCrossRefGoogle Scholar
- 13.Klikowski, J., Ksieniewicz, P., Woźniak, M.: A genetic-based ensemble learning applied to imbalanced data classification. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A.J., Menezes, R., Allmendinger, R. (eds.) IDEAL 2019. LNCS, vol. 11872, pp. 340–352. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33617-2_35CrossRefGoogle Scholar
- 14.Kozik, R., Choras, M., Keller, J.: Balanced efficient lifelong learning (B-ELLA) for cyber attack detection. J. UCS 25(1), 2–15 (2019)MathSciNetGoogle Scholar
- 15.Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0CrossRefGoogle Scholar
- 16.Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14, 554–562 (2014)CrossRefGoogle Scholar
- 17.Ksieniewicz, P., Zyblewski, P.: Stream-learn-open-source python library for difficult data stream batch analysis. arXiv preprint arXiv:2001.11077 (2020)
- 18.Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, Hoboken (2004)CrossRefGoogle Scholar
- 19.Mao, S., Jiao, L., Xiong, L., Gou, S., Chen, B., Yeung, S.K.: Weighted classifier ensemble based on quadratic form. Pattern Recogn. 48(5), 1688–1706 (2015)CrossRefGoogle Scholar
- 20.Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
- 21.Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press (1999)Google Scholar
- 22.Rahman, A.F.R., Alam, H., Fairhurst, M.C.: Multiple classifier combination for character recognition: revisiting the majority voting system and its variations. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DAS 2002. LNCS, vol. 2423, pp. 167–178. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45869-7_21CrossRefGoogle Scholar
- 23.Rokach, L.: Pattern Classification Using Ensemble Methodsd, vol. 75. World Scientific, Singapore (2010)zbMATHGoogle Scholar
- 24.Ruta, D., Gabrys, B.: Classifier selection for majority voting. Inf. Fusion 6(1), 63–81 (2005)CrossRefGoogle Scholar
- 25.Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recognit. Artif. Intell. 23(04), 687–719 (2009)CrossRefGoogle Scholar
- 26.Szeszko, P., Topczewska, M.: Empirical assessment of performance measures for preprocessing moments in imbalanced data classification problem. In: Saeed, K., Homenda, W. (eds.) CISIM 2016. LNCS, vol. 9842, pp. 183–194. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45378-1_17CrossRefGoogle Scholar
- 27.Wozniak, M.: Hybrid Classifiers: Methods of Data, Knowledge, and Classifier Combination, vol. 519. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40997-4CrossRefGoogle Scholar
- 28.Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)CrossRefGoogle Scholar
- 29.Zhang, C., et al.: Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl.-Based Syst. 174, 137–143 (2019)CrossRefGoogle Scholar
- 30.Sultan Zia, M., Hussain, M., Arfan Jaffar, M.: A novel spontaneous facial expression recognition using dynamically weighted majority voting based ensemble classifier. Multimedia Tools Appl. 77(19), 25537–25567 (2018). https://doi.org/10.1007/s11042-018-5806-yCrossRefGoogle Scholar