Abstract
Many real classification problems are characterized by a strong disturbance in a prior probability, which for the most of classification algorithms leads to favoring majority classes. The action most often used to deal with this problem is oversampling of the minority class by the smote algorithm. Following work proposes to employ a modification of an individual binary classifier support-domain decision boundary, similar to the fusion of classifier ensembles done by the Fuzzy Templates method to deal with imbalanced data classification without introducing any repeated or artificial patterns into the training set. The proposed solution has been tested in computer experiments, which results shows its potential in the imbalanced data classification.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The base and the most important element of any artificial intelligence application is the decision module, most often being a trained pattern recognition model [4]. The development of such a solution requires the use of an algorithm capable of building knowledge around the specific type of training data.
In the case where training samples are only a set of non-described patterns, for example, to gather groups of objects based on cluster analysis, we are dealing with the problem of unsupervised learning. In most situations, however, we are not interested in identifying groups in the data set. The goal is preferably in assigning new objects, seen for the first time, to classes that we already have known there is a possibility to learn about their properties on the example of existing patterns. This type of learning is called supervised learning, and this specific task is classification [17].
In real classification problems, it is relatively rare for each class of a training set to be represented evenly. A significant disturbance in the proportions between classes is widely studied in the literature under the name of imbalanced data classification [5, 7].
Solutions for such problems are usually divided into three groups [9]. The first are built-in methods that try to modify the algorithm’s principles or its decision process to take into consideration the disturbed prior probability of the problem [19, 24]. The second group, which is also the most popular in literature and applications, is based on data preprocessing aiming to balance the class counts in the training set. The most common solutions of this type are under- [20] and oversampling [18] together with methods for generating synthetic patterns such as smote [6, 21, 22] or adasyn [1, 8, 25]. The third group consists of hybrid methods [23], mainly feasting on achievements of ensemble learning, using a pool of diversified base classifiers [12, 13] and a properly constructed, imbalanced decision principle [10, 11].
Following work tries to propose a practical method from the built-in methods group of solutions, modifying the support-domain decision boundary of the fuzzy classifier. It is done using the knowledge acquired on the basis of support vectors obtained on the training set by the already built model, similarly to the propositions of Fuzzy templates [15, 16]. The second section describes how to adapt them to work with a single classification model and how to modify this approach to the proposed Standard Decision Boundary algorithm. The third chapter contains the design of computer experiments carried out and summarized in the fourth chapter, and the fifth one focuses on the overall conclusions drawn from the research.
2 Methods
The feature space of a problem in which the decision boundary of the classifier is drawn is the most often undertaken area of considering the construction of a classification method. However, its modification may also take place in the space of supports obtained by the model, which is the subject of the method proposed in this article.
Regular Decision Boundary ( rdb). A fitting algorithm of every fuzzy classifier does not only provide bare prediction but also calculates the complementary (adding up to one) probability of belonging to each of the problem classes, which constructs the support vector of a predicted sample [3]. The classifier’s decision, in the most popular approach, is taken in a favor of the class for which the highest support was obtained [14].
By simplifying the classification problem only for binary tasks, one may determine such a decision rule by the most straightforward equation of a straight line:
where the x-axis represents support for the negative class and the y-axis is positive support. For the following paper, this rule will state as Regular Decision Boundary (rdb), and it is illustrated in Fig. 1.
Fuzzy Templates Decision Boundary ( ftdb). A commonly perceived phenomenon that occurs in classification models build on an imbalanced training set is the general tendency to favor the majority class [5]. The support obtained for it receives a particular bonus, caused directly by the increased prior probability.
One of the possible counteractions to this phenomenon may be the modification of a decision rule in the support domain. Solutions of this type are quite common in the construction of fusers for the needs of classifier ensembles [15]. One of such approaches is the proposition of Fuzzy Templates, introducing the Decision Profile, being the matrix of support vectors obtained for all patterns from the training set by each classifier from the available pool [16]. To produce a prediction, algorithm determines class centroids of obtained supports, and the final decision is based on the Nearest Mean principle.
In the case of a single fuzzy classifier, in contrast to the ensemble products of Decision Profiles, each of the complementary support vectors obtained for the training set, by definition, must be on a diagonal of a support space perpendicular to the Regular Decision Rule. An attempt to employ the Fuzzy Templates approach in a single classification model may be described by the equation of a straight line parallel to the Regular Decision Boundary, but passing through a point determined by the mean support values calculated separately for the patterns of both the training set classes:
where \(\mu _1\) and \(\mu _2\) are mean supports of each class. For the purpose of the following paper this rule will state as Fuzzy Templates Decision Boundary (ftdb), and its example is illustrated in Fig. 2a.
Standard Decision Boundary ( sdb). The Fuzzy Templates method, is an additional, simple classifier, supplementing any fuzzy classification algorithm with the model learned from its answers. It is based on the calculation of the basic statistical measure (mean value) and its inclusion in the final prediction of the hierarchical ensemble. The following work proposes an enhancement of this approach by including into the decision process also the basic knowledge about the distribution of supports obtained by the base classifier, using a standard deviation measure.
This approach still assumes that the decision boundary goes through the intersection of mean supports, but its gradient is further modified. It depends directly on the ratio between standard deviations, so it also goes through the point designated as the difference between the expected values of the distribution and the standard deviations vector. The formula may represent the equation of the proposed decision boundary:
where \(\sigma _1\) and \(\sigma _2\) are standard deviations of both classes. Due to the employment of both statistical measures calculated for the needs of a standard normalization, this rule will state as Standard Decision Boundary (sdb), and its example is illustrated in Fig. 2b.
Supposition. Intuition suggests that changes in the prediction method implemented both by the ftdb and sdb models should increase the precision of the obtained decisions, although the linear nature of the used decision boundary in a presence of a such tendency must simultaneously lead to a worsening of the results of the recall metric. Using aggregate information about class distributions in a decision rule, ignoring the prior probabilities of the training set, may result in an increase in the overall quality of predictions in imbalanced data. So if the proposed method will obtain significantly better results in aggregate metrics, such as F1-score, balanced accuracy score or geometric mean score, it will be considered as promising.
3 Design of Experiments
Datasets. The problems considered in research during experiments are directly expressed by the selection of datasets that meets specific conditions. For the purposes of conducted experimental evaluation, it was decided to use data with a high degree of imbalance, exceeding the 1:9 ratio, with relatively low dimensionality (up to 20 features). The appropriate collection is contained in the keel data repository [2]. A summary of the datasets selected for testing, supplemented with information on the imbalance ratio, the count of features and patterns is presented in Table 1.
Compared Approaches. The basis of considerations taken in this work are the differences between approaches to draw a decision boundary in the support space and the effectiveness of this type of solutions in imbalanced data classification problems. For the purposes of evaluation, the three methods presented in Sect. 2 have been supplemented with the preprocessing method, being a state-of-art solution for this type of problems. Due to the very large imbalance ratio, it is often impossible to apply the smote algorithm (with default parameterization it requires at least 5 minority class examples in the learning set), therefore random oversampling was chosen. The full list of compared algorithms presents as follows:
-
1.
RDB—Regular Decision Boundary used in Gaussian Naive Bayes classifier,
-
2.
ROS-RDB—Regular Decision Boundary used in Gaussian Naive Bayes classifier trained on datasets with randomly oversampled minority class,
-
3.
FTDB—Fuzzy Templates Decision boundary used in Gaussian Naive Bayes classifier,
-
4.
SDB—Standard Decision boundary used in Gaussian Naive Bayes classifier.
Evaluation Methodology and Metrics Used. During the experimental evaluation, a stratified 5-fold cross validation was used, for the non-deterministic ros-rdb algorithm by performing an additional ten-time replication of the results. Both pair tests between the quality of classifiers for individual data sets and ranking tests, used for general assessment of the relations between them, were carried out using the Wilcoxon test using 5% significance level. Due to the imbalanced nature of the considered problems, in assessing the quality of solutions it was decided to use precision and recall metrics, supplemented with aggregated F1-score, balanced accuracy score and geometric-mean-score metrics. Full source code of the performed tests, along with the method implementations and a full report of results, are located on the publicly available Git repositoryFootnote 1.
4 Experimental Evaluation
4.1 Results
Scores and Paired Tests. Table 2 contains the results achieved by each of the considered algorithms for the aggregate, F1-score metric. The ros-rdb method, being a typical approach to deal with imbalanced data using single model, looks the worst in the pool, which not only does not improve rdb results, but also often leads to statistically significant worse results. The ftdb method, although sporadically, leads to a significant improvement over rdb, never achieving results significantly inferior to it. Definitely the best in this competition is the sdb method proposed in this paper, which in eleven cases is statistically significantly better than each of the other methods, and in fourteen cases better than rdb.
For both the precision metric and the other aggregate measures (balanced accuracy score and geometric mean score), the observations are identical to those drawn from the F1-score, so the relevant result tables are not attached directly to the article, while still being public in the repository indicated in the previous section.
The aggregate metrics, such as F1-score, allows to draw some binding conclusions, but does not give a full picture of interpretation. As expected, with the recall metric (Table 3), the ftdb and rdb algorithms give some deterioration relative to both the base method and the ros-rdb approach. Statistical significance occurs in this difference, however, only once for dtdb and twice for rdb.
Rank Tests. The final comparison of the considered solutions was carried out by ranking tests, included in Table 4. The ros-rdb method obtains a small, but statistically significant advantage in the ranking over all other methods for the recall metric, but in all other measures it stands out very negatively, which leads to suggestions about its overall uselessness in the considered task of highly imbalanced data classification. If the goal of counteracting the tendency of favoring in the prediction of the majority class (which was stated as the basic problem in the classification of imbalanced data) is to equalize the impact of both classes, on the example of the considered data sets, the ros method must be rejected because it leads to the reverse tendency. In the case of precision and each of the aggregate metrics the same statistically significant relation is observed. The rdb method is better than ros-rdb, the ftdb method is better than both rdb methods, and the sdb proposed in this paper is significantly better than all the competitors in the considered pool of solutions.
5 Conclusions
Following paper, considering the binary classification of imbalanced data, proposed the application of the Fuzzy Templates method in the construction of the support-domain decision boundary for a single model in order to balance the impact of classes of different counts on the prediction of the decision system. The proposal was further developed to use both standard normalization metrics, introducing the Standard Decision Boundary method. Both solutions were tested in computer experiments on the example of a highly imbalanced dataset collection and compared to both the base method and the state-of-art preprocessing method.
Both proposed solutions seem to improve the quality of imbalanced data classification in relation to the regular support-domain decision boundary, in contrast to oversampling, without leading to overweight of the predictive towards the minority class. Modification of the use of Fuzzy Templates in the form of Standard Decision Boundary is also more effective than the simple use of a class support prototype and may be considered a recommendable solution to the problem of binary classification of imbalanced data. Due to the promising results achieved for individual models, the next works will attempt to generalize the sdb method for classifier ensembles.
Notes
References
Aditsania, A., Adiwijaya, Saonard, A.L.: Handling imbalanced data in churn prediction using ADASYN and backpropagation algorithm. In: Proceeding - 2017 3rd International Conference on Science in Information Technology: Theory and Application of IT for Education, Industry and Society in Big Data Era, ICSITech 2017 (2017)
Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2011)
del Amo, A., Montero, J., Cutello, V.: On the principles of fuzzy classification. In: Annual Conference of the North American Fuzzy Information Processing Society - NAFIPS (1999)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets (2018)
Fernández, A., García, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2(4), 42–47 (2012)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the International Joint Conference on Neural Networks (2008)
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016)
Ksieniewicz, P.: Undersampled majority class ensemble for highly imbalanced binary classification. In: Second International Workshop on Learning with Imbalanced Domains: Theory and Applications, pp. 82–94 (2018)
Ksieniewicz, P.: Combining Random Subspace approach with smote oversampling for imbalanced data classification. In: Pérez García, H., Sánchez González, L., Castejón Limas, M., Quintián Pardo, H., Corchado Rodríguez, E. (eds.) HAIS 2019. LNCS (LNAI), vol. 11734, pp. 660–673. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29859-3_56
Ksieniewicz, P., Woźniak, M.: Imbalanced data classification based on feature selection techniques. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A.J. (eds.) IDEAL 2018. LNCS, vol. 11315, pp. 296–303. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03496-2_33
Ksieniewicz, P., Wozniak, M., Torgo, L., Krawczyk, B., Branco, P., Moniz, N.: Dealing with the task of imbalanced, multidimensional data classification using ensembles of exposers. In: Proceedings of Machine Learning Research (2017)
Kuncheva, L.: Fuzzy Classifier Design, vol. 49. Springer, Heidelberg (2000). https://doi.org/10.1007/978-3-7908-1850-5
Kuncheva, L.I., Bezdek, J.C., Duin, R.P.: Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recogn. 34(2), 299–314 (2001)
Kuncheva, L.I., Bezdek, J.C., Sutton, M.A.: On combining multiple classifiers by fuzzy templates. In: Annual Conference of the North American Fuzzy Information Processing Society - NAFIPS (1998)
Mitchell, T.M.: The Discipline of Machine Learning. Machine Learning (2006)
Moreo, A., Esuli, A., Sebastiani, F.: Distributional random oversampling for imbalanced text classification. In: SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (2016)
Ohsaki, M., Wang, P., Matsuda, K., Katagiri, S., Watanabe, H., Ralescu, A.: Confusion-matrix-based kernel logistic regression for imbalanced data classification. IEEE Trans. Knowl. Data Eng. 29(9), 1806–1819 (2017)
Prusa, J., Khoshgoftaar, T.M., Dittman, D.J., Napolitano, A.: Using random undersampling to alleviate class imbalance on tweet sentiment data. In: Proceedings - 2015 IEEE 16th International Conference on Information Reuse and Integration, IRI 2015 (2015)
Rodriguez-Torres, F., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Deterministic oversampling methods based on SMOTE. J. Intell. Fuzzy Syst. 36(5), 4945–4955 (2019)
Wang, Q., Luo, Z.H., Huang, J.C., Feng, Y.H., Liu, Z.: A novel ensemble method for imbalanced data learning: bagging of extrapolation-SMOTE SVM. Comput. Intell. Neurosci. (2017)
Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)
Xu, Y., Yang, Z., Zhang, Y., Pan, X., Wang, L.: A maximum margin and minimum volume hyper-spheres machine with pinball loss for imbalanced data classification. Knowl.-Based Syst. 95, 75–85 (2016)
Zhang, Y.: Deep generative model for multi-class imbalanced learning. ProQuest Dissertations and Theses (2018)
Acknowledgements
This work was supported by the Polish National Science Centre under the grant No. 2017/27/B/ST6/01325 as well as by the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wroclaw University of Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ksieniewicz, P. (2020). Standard Decision Boundary in a Support-Domain of Fuzzy Classifier Prediction for the Task of Imbalanced Data Classification. In: Krzhizhanovskaya, V.V., et al. Computational Science – ICCS 2020. ICCS 2020. Lecture Notes in Computer Science(), vol 12140. Springer, Cham. https://doi.org/10.1007/978-3-030-50423-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-50423-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50422-9
Online ISBN: 978-3-030-50423-6
eBook Packages: Computer ScienceComputer Science (R0)