Keywords

1 Introduction

In recent years, class imbalance learning (CIL) has become one of hot topics in the field of machine learning [1]. Also, the CIL has been widely applied in various real-world applications, including disease classification [2], software defect detection [3], biology data analysis [4], bankrupt prediction [5], etc. So-called class imbalance problem means in training data, the instances belong to one class is much more than that in other classes. The problem tends to highlight the performance of majority class, but to ignore the minority class.

There exist three major techniques to implement CIL: 1) data-level approach, 2) algorithmic-level method and 3) ensemble learning strategy. Data-level, which is called resampling, addresses CIL problem by re-balancing data distribution [6,7,7]. It contains oversampling that generates lots of new minority instances, and undersampling which removes a lot of majority instances. Algorithmic-level adapts class imbalance by modifying the original supervised learning algorithms. It mainly contains: cost-sensitive learning [8], and decision threshold-moving strategy [9,10,10]. Cost-sensitive learning designates different training costs for the instances belonging to different classes to highlight the minority class, while decision threshold-moving tune the biased decision boundary from the minority class region to the majority class region. As for ensemble learning, it integrates either a data-level algorithm or an algorithmic-level method into a popular ensemble learning paradigm to promote the quality of CIL [11,12,12]. Among these CIL techniques, the decision threshold-moving is relatively flexible and effective, however, it also faces a challenge, i.e., it is difficult to select an appropriate threshold.

In this study, we focus on a popular supervised learning algorithm named Gaussian Naive Bayes (GNB) [13] which also tends to be hurt by skewed data distribution. First, we analyze why the GNB tends to be hurt by imbalanced data distribution in theory. Then, we explain why adopting several popular CIL techniques could repair this bias. Finally, based on the idea, PSO optimization algorithm, we propose an optimal decision threshold-moving algorithm for GNB named GNB-ODTM. Experimental results on eight class imbalance data sets indicate the effectiveness and superiority of the proposed algorithm.

2 Methods

2.1 Gaussian Naive Bayes Classifier

GNB is a variant of Naive Bayes (NB) [14], which is used only to deal with data in continuous space. Like NB, GNB has a strong theoretical basis. GNB assumes in each class, all instances satisfy a multivariate Gaussian distribution, i.e., for an instance xi, we have:

$$ P(x_i |y) = \frac{1}{{\sqrt {2\pi \sigma_y^2 } }}e^{ - \frac{(x_i - \mu_y )^2 }{{2\sigma_y^2 }}} $$
(1)

where μy and σy denote the mean and variance of all instances belonging to class y, respectively. P(xi| y) represents in class y, xi’s conditional probability. As the prior probability P(y) is known, hence the posterior probability P(y| xi) and P(~y| xi) can be calculated as,

$$ P(y|x_i ){\,=\, }\frac{P(x_i |y)P(y)}{{P(x_i |y)P(y){ + }P(x_i {|} \sim y)P( \sim y)}} $$
(2)
$$ P( \sim y|x_i ){\,=\,}\frac{P(x_i | \sim y)P( \sim y)}{{P(x_i |y)P(y){ + }P(x_i | \sim y)P( \sim y)}} $$
(3)

We expect the classification boundary can correspond to P(xi| y) = P(xi| ~y). However, if the data set is imbalanced (supposing P(y) << P(~y), then to guarantee P(y| xi) = P(~y| xi), i.e., P(xi| y)P(y) = P(xi| ~y)P(~y), the real classification boundary must correspond to a condition of P(xi| y) >> P(xi| ~y). That means the classification boundary is extremely pushed towards the minority class y. That explains why skewed data distribution hurts the performance of GNB.

To repair the bias, data-level approaches change P(y) or P(~y) to make P(y) = P(~y), cost-sensitive learning designates a high cost C1 for class y and a low cost C2 for class ~ y to make P(y) C1 = P(~y) C2, while for decision threshold-moving strategy, it adds a positive value λ for compensating the posterior probability of class y.

2.2 Optimal Decision Threshold-Moving Strategy

As we know, decision threshold-moving is an effective and efficient strategy to address CIL problem. However, we also face a challenge that is how to designate an appropriate moving threshold λ. Some previous work adopt empirical value [9] or trial-and-error method [10] to designate the value for λ, but ignore the specific data distribution, causing over-moving or under-moving phenomenon.

To address the problem above, we present an adaptive strategy for searching the most appropriate moving threshold. The strategy is based on particle swarm optimization (PSO) [15], which is a population-based stochastic optimization technique, inspired by the social behavior of bird flocking. During the optimization process of PSO, each particle dynamically changes its position and velocity by recalling its historical optimal position (pbest) and observing the position of the optimal particle (gbest). On each round, the position of each particle is updated by:

$$ \left\{ {\begin{array}{*{20}c} {v_{id}^{k + 1} = v_{id}^k + c_1 \times r_1 \times ({\text{pbest}} - x_{id}^k ) + c_2 \times r_2 \times ({\text{gbest}} - x_{id}^k )} \\ {x_{id}^{k + 1} = x_{id}^k + v_{id}^{k + 1} \, } \\ \end{array} } \right. $$
(4)

where \(v_{id}^k\) and \(v_{id}^{k + 1}\) represent the velocities of the dth dimension of the ith particle in the kth round and the (k + 1)st round, while \(x_{id}^k\) and \(x_{id}^{k + 1}\) denote their positions, respectively. c1 and c2 are two nonnegative constants that are called acceleration factors, while r1 and r2 are two random variables in the range of [0, 1]. In this study, the size of particle swarm and the search times are both set as 50, as well c1 and c2 are both set to 1. Meanwhile, the position x is restricted in the range of [0, 1] with considering the upper limit of a posterior probability is 1, and the velocity v is restricted between –1 and 1.

As for the fitness function, it should directly associate with the classification performance. We all know in CIL, accuracy is not an appropriate performance evaluation metric, thus we use a widely used CIL performance evaluation metric called G-mean as fitness function, which could be described as below,

$$ {\text{G-mean}} = \sqrt {{\text{TPR}} \times {\text{TNR}}} $$
(5)

where TPR and TNR indicate the accuracy of the positive and negative class, respectively.

2.3 Description About GNB-ODTM Algorithm

Combining GNB and the optimization strategy presented above, we propose an optimal decision threshold-moving algorithm for GNB named GNB-ODTM. The flow path of the GNB-ODTM algorithm is simply described as follows:

Algorithm: GNB-ODTM.

Input: A skewed binary-class training set Φ, a binary-class testing set Ψ.

Output: An optimal moving threshold λ*, the G-mean value on the testing set Ψ.

Procedure:

  1. 1)

    Train a GNB classifier on Φ;

  2. 2)

    Calculate the posterior probabilities of each instance in Φ, and hereby calculate the original G-mean value on Φ;

  3. 3)

    Call PSO algorithm and use the training set Φ to find the optimal moving threshold λ*;

  4. 4)

    Adopt the trained GNB classifier to calculate the posterior probabilities of each instance in Ψ;

  5. 5)

    Tune the posterior probabilities in Ψ by the recorded λ*;

  6. 6)

    Calculate the G-mean value on the testing set Ψ by using the tuned the posterior probabilities.

From the procedure described above, it is not difficult to observe that in comparison with empirical moving threshold setting, the proposed GNB-ODTM algorithm must be more time-consuming as it needs to conduct an iterative PSO optimization procedure. However, the time-complexity can be decreased by assigning small iterative times and population as soon as possible, which is also helpful for reducing the possibility of making classification model be overfitting. Moreover, we also note that the GNB-ODTM algorithm is self-adaptive, which means it is not restricted by data distribution, and meanwhile it can adapt any data distribution type without exploring it.

3 Results and Discussions

3.1 Description About the Used Data Sets

We collected 8 class imbalance data sets from UCI machine learning repository which is avaliable at: http://archive.ics.uci.edu/ml/datasets.php. The detailed information about these data sets is presented in Table 1. Specifically, these data sets have also been used in our previous work about class imbalance learning [16].

Table 1. Description about the used data sets

3.2 Analysis About the Results

We compared our proposed algorithm with GNB [13], GNB-SMOTE [7], CS-GNB [8], GNB-THR [9] and GNB-OTHR [10] in our experiments. All parameters in PSO have been designated in Sect. 2. In addition, to guarantee the impartiality of experimental comparison, we adopted external 10-fold cross-validation and randomly conducted it 10 times to provide the average G-mean as the final result.

Table 2 shows the comparable results of various algorithms, where on each data set, the best result has been highlighted in boldface.

From the results in Table 2, we observe:

  1. 1)

    In comparison with original GNB, no matter associating it with resampling, cost-sensitive learning or decision threshold-moving techniques could promote classification performance on imbalanced data sets. The results indicate the necessity of adopting CIL technique to address imbalance classification problem, again.

  2. 2)

    It is difficult to compare the quality of resampling and cost-sensitive learning as each of them performs better on partial data sets. GNB-SMOTE performs better on abalone9, pageblocks5, cardiotocographyC5 and cardiotocographyN3, while CS-GNB produces better result on rest data sets.

  3. 3)

    Although GNB-THR significantly outperforms to the original GNB model, it is obviously worse than several other algorithms. It indicates the unreliability of setting moving threshold by empirical approach.

  4. 4)

    We beleive the proposed GNB-ODTM algorithm is successful as it has produced the best result on nearly all data sets except pageblocks2345 and cardiotocographyN3. In addition, we observe on mst data sets, the performance promotion is remarkable by adopting the proposed algorithm. It should attribute to the consideration of distribution self-adaption. Although the proposed GNB-ODTM algorithm has a higher time-complexity than several other algorithms, it is still an excellent altinative for processing imbalance data classification problem.

Table 2. G-mean performance of various comparable algorithms on 8 data sets

4 Concluding Remarks

In this study, we focus on a specific class imbalance learning technique named decision threshold-moving strategy. A common problem about this technique is indicated, i.e., it generally lacks adaption to data distribution, further causing unreliable classification results. Specifically, in the context of Gaussian Naive Bayes classification model, we presented a robust decision threshold-moving strategy and proposed a novel CIL algorithm called GNB-ODTM. The experimental results have indicated the effective and superiority of the proposed algorithm.

The contribution of this paper is two-folds which are described as follows:

  1. 1)

    In context of Gaussian Naive Bayes classifier, we analyze the hazard of skewed data distribution in theory, and indicate rationality of several popular CIL techniques;

  2. 2)

    Based on Particle Swarm Optimization technique, we propose a robust decision threshold-moving algorithm which can adapt various data distribution.

The work was supported by Natural Science Foundation of Jiangsu Province of China under grant No.BK20191457.