Keywords

1 Introduction

The base and the most important element of any artificial intelligence application is the decision module, most often being a trained pattern recognition model [4]. The development of such a solution requires the use of an algorithm capable of building knowledge around the specific type of training data.

In the case where training samples are only a set of non-described patterns, for example, to gather groups of objects based on cluster analysis, we are dealing with the problem of unsupervised learning. In most situations, however, we are not interested in identifying groups in the data set. The goal is preferably in assigning new objects, seen for the first time, to classes that we already have known there is a possibility to learn about their properties on the example of existing patterns. This type of learning is called supervised learning, and this specific task is classification [17].

In real classification problems, it is relatively rare for each class of a training set to be represented evenly. A significant disturbance in the proportions between classes is widely studied in the literature under the name of imbalanced data classification [5, 7].

Solutions for such problems are usually divided into three groups [9]. The first are built-in methods that try to modify the algorithm’s principles or its decision process to take into consideration the disturbed prior probability of the problem [19, 24]. The second group, which is also the most popular in literature and applications, is based on data preprocessing aiming to balance the class counts in the training set. The most common solutions of this type are under- [20] and oversampling [18] together with methods for generating synthetic patterns such as smote [6, 21, 22] or adasyn [1, 8, 25]. The third group consists of hybrid methods [23], mainly feasting on achievements of ensemble learning, using a pool of diversified base classifiers [12, 13] and a properly constructed, imbalanced decision principle [10, 11].

Following work tries to propose a practical method from the built-in methods group of solutions, modifying the support-domain decision boundary of the fuzzy classifier. It is done using the knowledge acquired on the basis of support vectors obtained on the training set by the already built model, similarly to the propositions of Fuzzy templates [15, 16]. The second section describes how to adapt them to work with a single classification model and how to modify this approach to the proposed Standard Decision Boundary algorithm. The third chapter contains the design of computer experiments carried out and summarized in the fourth chapter, and the fifth one focuses on the overall conclusions drawn from the research.

2 Methods

The feature space of a problem in which the decision boundary of the classifier is drawn is the most often undertaken area of considering the construction of a classification method. However, its modification may also take place in the space of supports obtained by the model, which is the subject of the method proposed in this article.

Regular Decision Boundary ( rdb). A fitting algorithm of every fuzzy classifier does not only provide bare prediction but also calculates the complementary (adding up to one) probability of belonging to each of the problem classes, which constructs the support vector of a predicted sample [3]. The classifier’s decision, in the most popular approach, is taken in a favor of the class for which the highest support was obtained [14].

By simplifying the classification problem only for binary tasks, one may determine such a decision rule by the most straightforward equation of a straight line:

$$\begin{aligned} y = x, \end{aligned}$$
(1)

where the x-axis represents support for the negative class and the y-axis is positive support. For the following paper, this rule will state as Regular Decision Boundary (rdb), and it is illustrated in Fig. 1.

Fig. 1.
figure 1

Illustration of a Regular Decision Boundary (rdb).

Fuzzy Templates Decision Boundary ( ftdb). A commonly perceived phenomenon that occurs in classification models build on an imbalanced training set is the general tendency to favor the majority class [5]. The support obtained for it receives a particular bonus, caused directly by the increased prior probability.

One of the possible counteractions to this phenomenon may be the modification of a decision rule in the support domain. Solutions of this type are quite common in the construction of fusers for the needs of classifier ensembles [15]. One of such approaches is the proposition of Fuzzy Templates, introducing the Decision Profile, being the matrix of support vectors obtained for all patterns from the training set by each classifier from the available pool [16]. To produce a prediction, algorithm determines class centroids of obtained supports, and the final decision is based on the Nearest Mean principle.

In the case of a single fuzzy classifier, in contrast to the ensemble products of Decision Profiles, each of the complementary support vectors obtained for the training set, by definition, must be on a diagonal of a support space perpendicular to the Regular Decision Rule. An attempt to employ the Fuzzy Templates approach in a single classification model may be described by the equation of a straight line parallel to the Regular Decision Boundary, but passing through a point determined by the mean support values calculated separately for the patterns of both the training set classes:

$$\begin{aligned} y = x+\mu _2-\mu _1, \end{aligned}$$
(2)

where \(\mu _1\) and \(\mu _2\) are mean supports of each class. For the purpose of the following paper this rule will state as Fuzzy Templates Decision Boundary (ftdb), and its example is illustrated in Fig. 2a.

Fig. 2.
figure 2

Illustration of trainable decision boundaries.

Standard Decision Boundary ( sdb). The Fuzzy Templates method, is an additional, simple classifier, supplementing any fuzzy classification algorithm with the model learned from its answers. It is based on the calculation of the basic statistical measure (mean value) and its inclusion in the final prediction of the hierarchical ensemble. The following work proposes an enhancement of this approach by including into the decision process also the basic knowledge about the distribution of supports obtained by the base classifier, using a standard deviation measure.

This approach still assumes that the decision boundary goes through the intersection of mean supports, but its gradient is further modified. It depends directly on the ratio between standard deviations, so it also goes through the point designated as the difference between the expected values of the distribution and the standard deviations vector. The formula may represent the equation of the proposed decision boundary:

$$\begin{aligned} y = \frac{\sigma _2 (x-\mu _1)}{\sigma _1} +\mu _2, \end{aligned}$$
(3)

where \(\sigma _1\) and \(\sigma _2\) are standard deviations of both classes. Due to the employment of both statistical measures calculated for the needs of a standard normalization, this rule will state as Standard Decision Boundary (sdb), and its example is illustrated in Fig. 2b.

Supposition. Intuition suggests that changes in the prediction method implemented both by the ftdb and sdb models should increase the precision of the obtained decisions, although the linear nature of the used decision boundary in a presence of a such tendency must simultaneously lead to a worsening of the results of the recall metric. Using aggregate information about class distributions in a decision rule, ignoring the prior probabilities of the training set, may result in an increase in the overall quality of predictions in imbalanced data. So if the proposed method will obtain significantly better results in aggregate metrics, such as F1-score, balanced accuracy score or geometric mean score, it will be considered as promising.

3 Design of Experiments

Datasets. The problems considered in research during experiments are directly expressed by the selection of datasets that meets specific conditions. For the purposes of conducted experimental evaluation, it was decided to use data with a high degree of imbalance, exceeding the 1:9 ratio, with relatively low dimensionality (up to 20 features). The appropriate collection is contained in the keel data repository [2]. A summary of the datasets selected for testing, supplemented with information on the imbalance ratio, the count of features and patterns is presented in Table 1.

Table 1. Overview of imbalanced classification datasets selected for experimental evaluation.

Compared Approaches. The basis of considerations taken in this work are the differences between approaches to draw a decision boundary in the support space and the effectiveness of this type of solutions in imbalanced data classification problems. For the purposes of evaluation, the three methods presented in Sect. 2 have been supplemented with the preprocessing method, being a state-of-art solution for this type of problems. Due to the very large imbalance ratio, it is often impossible to apply the smote algorithm (with default parameterization it requires at least 5 minority class examples in the learning set), therefore random oversampling was chosen. The full list of compared algorithms presents as follows:

  1. 1.

    RDBRegular Decision Boundary used in Gaussian Naive Bayes classifier,

  2. 2.

    ROS-RDBRegular Decision Boundary used in Gaussian Naive Bayes classifier trained on datasets with randomly oversampled minority class,

  3. 3.

    FTDBFuzzy Templates Decision boundary used in Gaussian Naive Bayes classifier,

  4. 4.

    SDBStandard Decision boundary used in Gaussian Naive Bayes classifier.

Evaluation Methodology and Metrics Used. During the experimental evaluation, a stratified 5-fold cross validation was used, for the non-deterministic ros-rdb algorithm by performing an additional ten-time replication of the results. Both pair tests between the quality of classifiers for individual data sets and ranking tests, used for general assessment of the relations between them, were carried out using the Wilcoxon test using 5% significance level. Due to the imbalanced nature of the considered problems, in assessing the quality of solutions it was decided to use precision and recall metrics, supplemented with aggregated F1-score, balanced accuracy score and geometric-mean-score metrics. Full source code of the performed tests, along with the method implementations and a full report of results, are located on the publicly available Git repositoryFootnote 1.

4 Experimental Evaluation

4.1 Results

Scores and Paired Tests. Table 2 contains the results achieved by each of the considered algorithms for the aggregate, F1-score metric. The ros-rdb method, being a typical approach to deal with imbalanced data using single model, looks the worst in the pool, which not only does not improve rdb results, but also often leads to statistically significant worse results. The ftdb method, although sporadically, leads to a significant improvement over rdb, never achieving results significantly inferior to it. Definitely the best in this competition is the sdb method proposed in this paper, which in eleven cases is statistically significantly better than each of the other methods, and in fourteen cases better than rdb.

Table 2. Results achieved by analyzed methods for all considered datasets with F1-score metric. Bold values shows dependency to the best classifier in a competition and the numbers below scores show classifier significantly worse than the one in the column.
Table 3. Results achieved by analyzed methods for all considered datasets with recall metric. Bold values shows dependency to the best classifier in a competition and the numbers below scores show classifier significantly worse than the one in the column.

For both the precision metric and the other aggregate measures (balanced accuracy score and geometric mean score), the observations are identical to those drawn from the F1-score, so the relevant result tables are not attached directly to the article, while still being public in the repository indicated in the previous section.

The aggregate metrics, such as F1-score, allows to draw some binding conclusions, but does not give a full picture of interpretation. As expected, with the recall metric (Table 3), the ftdb and rdb algorithms give some deterioration relative to both the base method and the ros-rdb approach. Statistical significance occurs in this difference, however, only once for dtdb and twice for rdb.

Rank Tests. The final comparison of the considered solutions was carried out by ranking tests, included in Table 4. The ros-rdb method obtains a small, but statistically significant advantage in the ranking over all other methods for the recall metric, but in all other measures it stands out very negatively, which leads to suggestions about its overall uselessness in the considered task of highly imbalanced data classification. If the goal of counteracting the tendency of favoring in the prediction of the majority class (which was stated as the basic problem in the classification of imbalanced data) is to equalize the impact of both classes, on the example of the considered data sets, the ros method must be rejected because it leads to the reverse tendency. In the case of precision and each of the aggregate metrics the same statistically significant relation is observed. The rdb method is better than ros-rdb, the ftdb method is better than both rdb methods, and the sdb proposed in this paper is significantly better than all the competitors in the considered pool of solutions.

Table 4. Results for mean ranks according to all considered metrics.

5 Conclusions

Following paper, considering the binary classification of imbalanced data, proposed the application of the Fuzzy Templates method in the construction of the support-domain decision boundary for a single model in order to balance the impact of classes of different counts on the prediction of the decision system. The proposal was further developed to use both standard normalization metrics, introducing the Standard Decision Boundary method. Both solutions were tested in computer experiments on the example of a highly imbalanced dataset collection and compared to both the base method and the state-of-art preprocessing method.

Both proposed solutions seem to improve the quality of imbalanced data classification in relation to the regular support-domain decision boundary, in contrast to oversampling, without leading to overweight of the predictive towards the minority class. Modification of the use of Fuzzy Templates in the form of Standard Decision Boundary is also more effective than the simple use of a class support prototype and may be considered a recommendable solution to the problem of binary classification of imbalanced data. Due to the promising results achieved for individual models, the next works will attempt to generalize the sdb method for classifier ensembles.