Performance Analysis of Binarization Strategies for Multi-class Imbalanced Data Classification

Żak, Michał; Woźniak, Michał

doi:10.1007/978-3-030-50423-6_11

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12140))

Included in the following conference series:

International Conference on Computational Science

2630 Accesses
2 Citations

Abstract

Multi-class imbalanced classification tasks are characterized by the skewed distribution of examples among the classes and, usually, strong overlapping between class regions in the feature space. Furthermore, frequently the goal of the final system is to obtain very high precision for each of the concepts. All of these factors contribute to the complexity of the task and increase the difficulty of building a quality data model by learning algorithms. One of the ways of addressing these challenges are so-called binarization strategies, which allow for decomposition of the multi-class problem into several binary tasks with lower complexity. Because of the different decomposition schemes used by each of those methods, some of them are considered to be better suited for handling imbalanced data than the others. In this study, we focus on the well-known binary approaches, namely One-Vs-All, One-Vs-One, and Error-Correcting Output Codes, and their effectiveness in multi-class imbalanced data classification, with respect to the base classifiers and various aggregation schemes for each of the strategies. We compare the performance of these approaches and try to boost the performance of seemingly weaker methods by sampling algorithms. The detailed comparative experimental study of the considered methods, supported by the statistical analysis, is presented. The results show the differences among various binarization strategies. We show how one can mitigate those differences using simple oversampling methods.

You have full access to this open access chapter, Download conference paper PDF

Classification of Multi-class Imbalanced Data: Data Difficulty Factors and Selected Methods for Improving Classifiers

Feature Selection with Class Hierarchy for Imbalance Problems

A Survey on Methodologies for Handling Imbalance Problem in Multiclass Classification

Keywords

1 Introduction

The goal of the supervised learning is to build a data model capable of mapping inputs x to outputs y with a good generalization ability, given a labeled set of input-output pairs $\mathcal {D}={(x_i, y_i)}_{i=1}^N$, $\mathcal {D}$ being the training set and N being the number of training examples. Usually, each of the training inputs $x_i$ is a d-dimensional vector of numbers and nominal values, the so-called features that characterize a given example, but $x_i$ might as well be a complex structured object like an image, a time series or an email message. Similarly, the type of the output variable can in principle be anything, but in most cases it is of a continuous type $y_i \in \mathbb {R}$ or a nominal type $y_i \in \mathbb {C}$, where, considering an m class problem, $\mathbb {C}=\{c_1,...,c_m\}$. In the former case, it is a regression problem, while in the latter, it is a classification problem [10, 22]. Classification problems are very common in a real-world scenario and machine learning is widely used to solve these types of problems in areas such as fraud detection [6, 24], image recognition [17, 26], cancer treatment [3] or classification of DNA microarrays [19].

In many cases, classification tasks involve more than two classes forming so-called multi-class problems. This characteristic often imposes some difficulties on the machine learning algorithm, as some of the solutions were designed strictly for binary-class problems and may not be applicable to those kinds of scenarios. What is more, problems, where multiple classes are present, are often characterized by greater complexity than binary tasks, as the decision boundaries between classes tend to overlap, which might lead to building a poor quality model by a given classifier. Usually, it is simply easier to build a model to distinguish only between two classes than to consider a multi-class problem. One approach to overcome those challenges is to use binarization strategies that reduce the task to multiple binary classification subproblems - in theory, with lower complexity - that can be solved separately by dedicated models, the so-called base learners [2, 11, 13, 14]. The most commonly used binarization strategies are One-Vs-All (OVA) [25], One-Vs-One (OVO) [12, 16] and Error-Correcting Output Codes (ECOC) [9], which is a general framework for the binary decomposition of multi-class problems.

In this paper, we focus on the performance of the aforementioned binarization strategies in the context of multi-class imbalanced problems. We aim to determine whether there are statistically significant differences among the performances of those methods, provided the most suitable aggregation scheme for a given problem. If so - whether or not one can nullify those differences by improving the quality of base learners within each binarization method with sampling algorithms. The main contributions of this work are:

an exhaustive experimental study on the classification of multi-class imbalanced data with the use of OVA, OVO and ECOC binarization strategies.
a comparative study of the aforementioned approaches with regard to a number of base classifier and aggregation schemes for each of the them.
a study on the performance of the binarization strategies with the sampling algorithms used to boost the quality of their base classifiers.

The rest of this paper is organized as follows. In Sect. 2, an overview of binarization strategies used in the experiments is given. In Sect. 3 the experimental framework set-up is presented, including the classification and sampling algorithms, performance measures and datasets used in the study. The empirical analysis of obtained results has been carried out in Sect. 4. In Sect. 5 we make our concluding remarks.

2 Decomposition Strategies for Multi-classification

The underlying idea behind binarization strategies is to undertake the multi-class problems using binary classifiers with divide and conquer strategy [13]. A transformation like this is often performed with the expectation that the resulting binary subproblems will have lower complexity than the original multi-class problem. One of the drawbacks of such approach is the necessity to combine the individual responses of the base learners into the final output of the decision system. What is more, building a dedicated model for each of the binary subproblems significantly increases the cost of building a decision system in comparison to undertaking the same problem with a single classifier. However, the magnitude of this problem varies greatly depending on the chosen binarization strategy as well as the number of classes under consideration and the size of the training set itself. In this study, we focus on the most common binarization strategies: OVA, OVO, and ECOC.

2.1 One-Vs-All Strategy

OVA binarization strategy divides an m-class problem into m binary problems. In this strategy, m binary classifiers are trained, each responsible for distinguishing instances of a given class from the others. During the validation phase, the test pattern is presented to each of the binary models and the model that gives a positive output indicates the output class of the decision system. This approach can potentially result in ambiguously labeled regions of the input space. Usually, some tie-breaking techniques are required [13, 22].

While relatively simple, OVA binarization strategy is often preferred to more complex methods, provided that the best available binary classifiers are used as the base learners [25]. However, in this strategy, the whole training set is used to train each of the base learners. It dramatically increases the cost of building a decision system with respect to the single multi-class classifier. Another issue is that each of the binary subproblems is likely to suffer from the aforementioned class imbalance problem [13, 22].

2.2 One-Vs-One Strategy

OVA binarization strategy divides an m-class problem into $\frac{m\times (m-1)}{2}$ binary problems. In this strategy, each binary classifier is responsible for distinguishing instances of different pair of classes $(c_i, c_j)$. The training set for each of the binary classifiers consists only of instances of the two classes forming a given pair, while the instances of the remaining classes are discarded. During the validation phase, the test pattern is presented to each of the binary models. The output of a model given by $r_{ij} \in [0,1]$ is the confidence of the binary classifier discriminating classes i and j in favour of the former class. If the classifier does not provide it, the confidence for the latter class is computed by $r_{ji}=1-r_{ij}$ [12, 13, 22, 29]. The class with the higher confidence value is considered as the output class of the decision system. Similarly to OVA strategy - this approach can also result in ambiguities [22].

Although the number of base learners in this strategy is of $m^2$ order, the growth in the number of learning tasks is compensated by the learning set reduction for each of the individual problems, as demonstrated in [12]. Also, one has to keep in mind that in this method, each of the base classifiers is trained using only the instances of two classes, deeming their output not significant for the instances of all the remaining classes. Usually, the assumption is that the base learner will make a correct prediction within its domain of expertise [13].

2.3 Error-Correcting Output Codes Strategy

ECOC binarization strategy is a general framework for the binary decomposition of multi-class problems. In this strategy, each class is assigned a unique binary string of length n, called code word. Next, n binary classifiers are trained, one for each bit in the string. During the training phase on an example from class i, the desired output of a given classifier is specified by the corresponding bit in the code word for this class. This process can be visualized by a $m\times n$ binary code matrix. As an example, Table 1 shows a 15-bit error-correcting output code for a five-class problem, constructed using exhaustive technique [9]. During the validation phase, the test pattern is presented to each of the binary models. Then the binary code word is formed from their responses. The class which code word was the nearest to the code word formed from the base learners’ responses, according to the Hamming distance, indicates the output class of the decision system.

Table 1. A 15-bit error-correcting output code for a five class problem.

Full size table

In contrast to OVA and OVO strategies, ECOC method does not have a predefined number of binary models that will be used to solve a given multi-class problem. This number is determined purely by the algorithm one chooses to generate the ECOC code matrix. A measure of the quality of error-correcting code is the minimum Hamming distance between any pair of code words. If the minimum Hamming distance is l, then the code can correct at least $\frac{l-1}{2}$ single-bit errors.

2.4 Aggregation Schemes for Binarization Techniques

For the binarization techniques mentioned above, an aggregation method is necessary to combine the responses of an ensemble of base learners. In the case of ECOC binarization strategy, this aggregation method is embedded in it. An exhaustive comparison study has been carried out in [13], including various aggregation methods for both OVA and OVO binarization strategies. For our experimental study, the implementations of the following methods for OVA and OVO decomposition schemes have been used:

OVA
1. 1.
  Maximum Confidence Strategy;
2. 2.
  Dynamically Ordered One-Vs-All.
OVO
1. 1.
  Voting Strategy;
2. 2.
  Weighted Voting Strategy;
3. 3.
  Learning Valued Preference for Classification;
4. 4.
  Decision Directed Acyclic Graph

For ECOC strategy, the exhaustive codes were used to generate the code matrix if the number of classes m in the problem under consideration satisfied $ 3 \le m \le 7$. In other cases, the random codes were used as implemented in [23].

3 Experimental Framework

In this section, the set-up of the experimental framework used for the study is presented. The classification and sampling algorithms used to carry out the experiments are described in Sect. 3.1. Next, the performance measure used to evaluate the built models is presented in Sect. 3.2. Section 3.3 covers the statistical tests used to compare the obtained results. Finally, Sect. 3.4 describes the benchmark datasets used in the experiments.

3.1 Classification Used for the Study

One of the goals of the empirical study was to ensure the diversity of the classifiers used as base learners for binarization strategies. A brief description of the used algorithms is given in the remainder of this section.

Naïve Bayes [22] is a simple model that assumes the features are conditionally independent given the class label. In practice, even if Naïve Bayes assumption is not true, it often performs fairly well.
k-Nearest Neighbors (k-NN) [22] is a non-parametric classifier that simply uses chosen distance metric to find k points in the training set that are nearest to the test input x, and returns the most common class among those points as the estimate.
Classification and Regression Tree (CART) [22] models are defined by recursively partitioning the input space, and defining a local model in each resulting region of input space.
Support Vector Machines (SVM) [27] maps the original input space into a high-dimensional feature space via so-called kernel trick. In the new feature space, the optimal separating hyperplane with maximal margin is determined in order to minimize an upper bound of the expected risk instead of the empirical risk.
Logistic Regression [22] is the generalization of the linear regression to the (binary) classification, so called Binomial Logistic Regression. Further generalization to Multi-Class Logistic Regression is often achieved via OVA approach.

During the building phase, for each of aforementioned base classifiers an exhaustive search over specified hyperparameter values was performed in attempt to build the best possible data model for a given problem - the values of hyperparameters used in the experiments are shown in Table 2. Furthermore, various sampling methods were used to boost the performance of base learners, namely SMOTE [7], Borderline SMOTE [15], SMOTEENN [4] and SMOTETomek [5]. All of the experiments were conducted using the Python programming language and libraries from the SciPy ecosystem (statistical tests and data manipulation) as well as scikit-learn (classifier implementations and feature engineering) and imbalanced-learn (sampling algorithms implementations) [18, 23, 28].

Table 2. Hyperparameter specification for the base learners used in the experiments.

Full size table

3.2 Performance Measures

Model evaluation is a crucial part of an experimental study, even more so when dealing with imbalanced problems. In the presence of imbalance, evaluation metrics that focus on overall performance, such as overall accuracy, have a tendency to ignore minority classes because as a group they do not contribute much to the general performance of the system. To our knowledge, at the moment there is no consensus as to which metric should be used in imbalance data scenarios, although several solutions have been suggested [20, 21]. Our goal was to pick a robust metric that ensures reliable evaluation of the decision system in the presence of strong class imbalance and at the same time is capable of handling multi-classification problems. Geometric Mean Score (G-Mean) is proven metric that meets both of these conditions - it focuses only on recall of each class and aggregates it multiplicatively across each class:

$$\begin{aligned} G-Mean = (\prod _{i=1}^{m} r_i)^{1/m}, \end{aligned}$$

(1)

where $r_i$ represents recall for $i-th$ class and m represents number of classes.

3.3 Statistical Tests

The non-parametric tests were used to provide statistical support for the analysis of the results, as suggested in [8]. Specifically, the Wilcoxon Signed-Ranks Test was applied as a non-parametric statistical procedure for pairwise comparisons. Furthermore, the Friedman Test was used to check for statistically significant differences between all of the binarization strategies, while the Nemenyi Test was used for posthoc comparisons and to obtain and visualize critical differences between models. The fixed significance level $\alpha = 0.05$ was used for all comparisons.

3.4 Datasets

The benchmark datasets used to conduct the research were obtained from the KEEL dataset repository [1]. The set of benchmark datasets was specially selected to ensure the robustness of the study and includes data with a varying numbers of instances, number and type of class attributes and the imbalance ratio of classes. The characteristics of the datasets used in the experiments are shown in Table 3 - for each dataset, it includes the number of instances (#Inst.), the number of attributes (#Atts.), the number of real, integer and nominal attributes (respectively #Real., #Int., and #Nom.), the number of classes (#Cl.) and the distribution of classes (#Dc.). All numerical features were normalized, and categorical attributes were encoded using the so-called one-hot encoding.

4 Experimental Study

In this section, the results of the experimental study are presented. Table 4 shows the results for the best variant of each binarization strategy for the benchmark datasets without internal sampling. As we can see, in this case the OVO strategy outperformed the other two methods. Friedman Test returned $p-Value = p=0.008$, pointing to a statistically significant difference between the results of those methods. However, Nemenyi Test revealed only the statistically significant difference between OVO and ECOC methods. Results obtained for each binarization strategy and critical differences for posthoc tests are visualized respectively in Fig. 1 and Fig. 2.

Table 3. Summary description of the datasets.

Full size table

Table 4. G-mean results for tested binarization strategies without sampling.

Full size table

Table 5 shows results for binarization strategies after enhancing the performance of base learners with sampling methods. Although the results are visibly better than they were obtained using pure binarization schemes, the hierarchy seems to be preserved with OVO outperforming the other two techniques, which is confirmed by the Friedman Test returning $p-Value = p=0.006$ pointing to statistically significant difference and Nemenyi Test revealing only statistically significant difference between OVO and ECOC strategies. Those results seem to be consistent with the study carried out in [11], which points out that OVO approach confronts a lower subset of instances and, therefore, is less likely to obtain a highly imbalanced training sets during binarization. Results obtained for each binarization strategy with the usage of internal sampling algorithms and critical differences for posthoc tests are visualized respectively in Fig. 3 and Fig. 4.

Wilcoxon Signed-Ranks Test was performed to determine whether or not there is a statistically significant difference between each strategy pure variant and variant enhanced with sampling algorithms. As shown in Table 6, in every case, the usage of sampling algorithms to internally enhance base models significantly improved the overall performance of the binarization strategy.

Table 5. G-mean results for tested binarization strategies with sampling.

Full size table

Table 6. Wilcoxon Signed-Ranks Test to compare binarization strategies variants with and without internal sampling. $R^+$ corresponds to the sum of the ranks for pure binarization strategy and $R^-$ for variant with sampling.

Full size table

5 Concluding Remarks

In this paper, we carried out an extensive comparative experimental study of One-Vs-All, One-Vs-One, and Error-Correcting Output Codes binarization strategies in the context of imbalanced multi-classification problems. We have shown that one can reliably boost the performance of all of the binarization schemes with relatively simple sampling algorithms, which was then confirmed by a thorough statistical analysis. Another conclusion from this work is that the data preprocessing methods are able to partially mitigate the quality differences among different strategies, however the statistically significant difference among obtained results persists and OVO binarization seems to be the most robust of all three - this conclusion confirms the results of previous studies carried out in this field.

References

Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2011)
Google Scholar
Allwein, E., Schapire, R., Singer, Y.: Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2000)
MathSciNet MATH Google Scholar
Anand, A., Suganthan, P.: Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates. J. Theor. Biol. 259(3), 533–540 (2009)
Article Google Scholar
Batista, G., Bazzan, B., Monard, M.: Balancing training data for automated annotation of keywords: a case study. In: WOB, pp. 10–18 (2003)
Google Scholar
Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
Article Google Scholar
Chan, P., Stolfo, S.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164–168 (1998)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. arXiv e-prints arXiv:1106.1813 (2011)
Demšar, J.: Statistical comparisons ofclassifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)
Article Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2000)
MATH Google Scholar
Fernández, A., López, V., Galar, M., Jesus, M.D., Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl.-Based Syst. 42, 97–110 (2013)
Article Google Scholar
Fürnkranz, J.: Round robin classification. J. Mach. Learn. Res. 2, 721–747 (2002)
MathSciNet MATH Google Scholar
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recogn. 44(8), 1761–1776 (2011)
Article Google Scholar
Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: Empowering difficult classes with a similarity-based aggregation in multi-class classification problems. Inf. Sci. 264, 135–157 (2014)
Article MathSciNet Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
Hastie, T., Tibshirani, R.: Classification by pairwise coupling. Ann. Stat. 26(2), 451–471 (1998)
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv e-prints arXiv:1512.03385 (2015)
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)
Google Scholar
Liu, K., Xu, C.: A genetic programming-based approach to the classification of multiclass microarray datasets. Bioinformatics 25(3), 331–337 (2009)
Article Google Scholar
Luque, A., Carrasco, A., Martin, A., Heras, A.: The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 91, 216–231 (2019)
Article Google Scholar
Mosley, L.: A balanced approach to the multi-class imbalance problem. Graduate theses and dissertations, Iowa State University (2013)
Google Scholar
Murphy, K.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)
MATH Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Phua, C., Lee, V., Smith, K., Gayler, R.: A comprehensive survey of data mining-based fraud detection research. arXiv e-prints arXiv:1009.6119 (2010)
Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn. Res. 5, 101–141 (2004)
MathSciNet MATH Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv e-prints arXiv:1409.1556 (2014)
Vapnik, V.: Statistical Learning Theory. Wiley-Interscience, Hoboken (1998)
MATH Google Scholar
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
Zhang, Z., Krawczyk, B., Garcìa, S., Rosales-Pérez, A., Herrera, F.: Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowl.-Based Syst. 106, 251–263 (2016)
Google Scholar

Download references

Acknowledgement

This work is supported by the Polish National Science Center under the Grant no. UMO-2015/19/B/ST6/01597 as well the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wrocław University of Science and Technology.

Author information

Authors and Affiliations

Department of Systems and Computer Networks, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, 50-370, Wrocław, Poland
Michał Żak & Michał Woźniak

Authors

Michał Żak
View author publications
You can also search for this author in PubMed Google Scholar
Michał Woźniak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michał Żak .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Amsterdam, Amsterdam, The Netherlands
Gábor Závodszky
University of Amsterdam, Amsterdam, The Netherlands
Michael H. Lees
University of Tennessee, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M. A. Sloot
Intellegibilis, Setúbal, Portugal
Sérgio Brissos
Intellegibilis, Setúbal, Portugal
João Teixeira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Żak, M., Woźniak, M. (2020). Performance Analysis of Binarization Strategies for Multi-class Imbalanced Data Classification. In: Krzhizhanovskaya, V.V., et al. Computational Science – ICCS 2020. ICCS 2020. Lecture Notes in Computer Science(), vol 12140. Springer, Cham. https://doi.org/10.1007/978-3-030-50423-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-50423-6_11
Published: 15 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50422-9
Online ISBN: 978-3-030-50423-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Performance Analysis of Binarization Strategies for Multi-class Imbalanced Data Classification

Abstract

Similar content being viewed by others

Classification of Multi-class Imbalanced Data: Data Difficulty Factors and Selected Methods for Improving Classifiers

Feature Selection with Class Hierarchy for Imbalance Problems

A Survey on Methodologies for Handling Imbalance Problem in Multiclass Classification

Keywords

1 Introduction