1 Introduction

When supervised learning models are trained on data generated from skewed class distributions, i.e., suffer from the class imbalance problem, their performance on the minority class can degrade significantly, even though they may have outstanding performance in terms of overall error rate or accuracyFootnote 1. In extreme cases, the model may ignore the minority class altogether and predict always the majority class. Class imbalance is inherent in many real-world applications, e.g., medical diagnosis [28, 39], fraud detection [2, 32, 36, 41] or sentiment classification [19, 29] and could even lead to discrimination and unfairness [15,16,17,18, 20, 21, 40].

Over the years, a large body of work has been proposed for tackling the class imbalance problem. Following [46], these works can be categorized into: (i) data-level approaches, (ii) model-based approaches, and (iii) cost-sensitive approaches. Each category has its own limitations (and strengths). For instance, data-level approaches may discard useful information to restore balance across the different class distributions. Model-based approaches are typically designed and implemented for specific models and are therefore applicable only in limited settings. Finally, cost-sensitive methods require as input a misclassification cost matrix thus inducing additional parameters.

Here, we focus on cost-sensitive classification methods with boosting. We have chosen cost-sensitive boosting for three main reasons: (i) boosting is able to minimize the training error and at the same time, to avoid overfitting [43], (ii) boosting is a popular learning method employed in many classification systems [33], and (iii) by re-weighting the data distribution, boosting preserves more information comparing to sampling methods [46], the prevalent type of data-level methods. However, most cost-sensitive boosting methods require a fixed misclassification cost matrix provided by the user [11, 35, 46, 47]. To define such a matrix, often grid search is performed to find the best costs for the dataset at hand, a tedious and costly process. In many cases, as we also show in our experiments, grid search does not lead to optimal selection of misclassification costs. Additionally, having fixed costs during model training may lead to suboptimal learning outcomes.

Fig. 1
figure 1

Decision boundaries of AdaBoost and the two variants of the proposed AdaCC on the same imbalanced toy dataset of 5 blue and 20 red instances. The dot size is proportional to the weight allocated by each learner to the particular instance (with the exception of the last column that depicts the final ensemble), making clear that AdaCC assigns higher weight to minority class instances compared to the ones of the majority class

In this work, we propose a new parameter-free cost-sensitive boosting approach for classification problems with high class imbalance. The proposed method, named AdaCC, standing for Cumulative Cost-Sensitive Boosting, alleviates the need for setting a fixed misclassification cost-matrix as input parameter, by leveraging the cumulative costs of the model up to the current boosting round. As we show in Sect. 4, the proposed method has proven upper bounds for the training error. We propose two variants of the method, AdaCC1 and AdaCC2 that differ in terms of the employed data re-weighting scheme.

We carry out a comprehensive experimental study on 27 real-world datasets and compare our method with 12 state-of-the-art cost-sensitive boosting methods as well as 5 non cost-sensitive class-imbalance methods. Our results demonstrate the superior performance of AdaCC over the state of the art in terms of AUC, balanced accuracy, geometric mean, and recall. Notably, the performance improvements are more pronounced on the minority class. This makes our method suitable for tasks where high false negative rates are critical, e.g., medical diagnosis, fraud detection, fairness-aware machine learning, etc.

Figure 1 illustrates a binary imbalanced toy dataset where the blue points (#5) belong to the minority class and the red points (#20) to the majority class. We compare the learning behaviour of the proposed method with the one by AdaBoost by observing the decision boundaries of the weak learners on the toy dataset as well as the ensemble boundary at the end of the training process (rightmost figure). Due to the low dimensionality of the dataset and for illustration purposes, we use a small number of \(T=5\) weak learners. Figure 1 demonstrates how weighted data distribution is affected by each weak learner as well as the decision boundary of the final ensembles. The first 5 columns correspond to the 5 weak learners while the size of the dots corresponds to the weight that each instance receives per round. The last column illustrates the decision boundary of the ensemble model. We note that the final AdaCC model manages to fit better the data distribution compared to AdaBoost, allocating a “proper” part of the feature space to the minority class.

The rest of the paper is organized as follows: Related work is summarized in Sect. 2. Basic concepts are described in Sect. 3. Our approach is introduced in Sect. 4. Evaluation setup and experimental results are presented in Sects. 5 and 6, respectively. Finally, Sect. 7 concludes our work and identifies interesting directions for future research.

2 Related work

Methods for dealing with class imbalance can be organised in three broad categories [46]: (i) data-level, (ii) model-based and (iii) cost-sensitive methods.

Data-level methods operate at the dataset level, i.e., they modify the data distribution before model training, making these methods universally applicable. In [22], the authors investigate the problem of class imbalance and the impact of re-sampling methods under the inter-dependencies of class distribution skewness, data complexity, data volume and employed models. In [30] the authors propose a combination of under- and over-sampling to equalize class distributions and measure model’s performance using lift analysis. The impact of over-sampling and under-sampling under the cost curves performance metrics has been explored in [8]. The authors conclude that under-sampling is significantly more effective than over-sampling for C4.5 classifiers. In [3], the authors propose SMOTE, a method that augments the minority class by interpolating new instances in local neighborhoods. In [19], the authors propose text augmentation techniques, such as distortion and semantic similarity, to increase the representation of the minority class.

Although re-sampling approaches are simple and easy to use, they come with disadvantages. For example, over-sampling may fail to “boost” existing rare cases, and adds no additional information to the dataset [8, 46]. Under-sampling on the other hand, can deteriorate the performance by removing important information from the majority class [3]. Finally, augmentation methods can amplify and propagate noise [19], leading to overall performance deterioration.

Model-based methods tackle class imbalance during training either by employing a mechanism which aims to identify rare patterns or by optimizing for a balanced-performance aware metric. SMOTEBoost [4] combines SMOTE [3] and AdaBoost [42] to deal with class imbalance by augmenting the minority class in each boosting round. A similar line of work is RUSBoost [44], which combines AdaBoost and random under-sampling of the majority class on each boosting round. DataBoost-IM [12] locates the hard-to-learn instances from both positive and negative classes during the training phase of AdaBoost and based on these instances, it generates synthetic data for augmentation at the end of each boosting round. Class imbalance-sensitive pruning of decision trees has been presented in [53]. The work in [50] uses a kernel alignment to optimize the decision boundary of an SVM. In [24], a class posterior re-balancing framework has been proposed to reduce imbalance while retaining classification certainty. Over the recent years, hybrid methods have also been proposed. In [49], they employ multi-set feature learning to learn discriminant features from the constructed multi-set and combine the sets with a generative adversarial network technique such that each subset has similar distribution with the original dataset. In [51], authors propose a combination of different techniques such as under/over-sampling, data transformations, misclassification costs and ensemble learning to deal with class imbalance.

The main disadvantage of model-based methods is that the inductive bias of the selected model can raise issues given an imbalanced dataset, e.g., decision tree’s data fragmentation problem [14]. Additionally, they typically rely on assumptions regarding the underlying data properties or are tailored to specific classification algorithms, which makes hard their application to new domains and datasets.

Cost-sensitive methods do not optimize for overall accuracy. Instead, they try to minimize the overall misclassification costs. This class of algorithms is divided into three sub-categories [46]: (i) weighting the data space, (ii) making a specific classifier cost-sensitive, and (iii) using the Bayes risk theory to assign each instance to the class with the lowest risk. The first sub-category aims to alter data distribution by employing a misclassification cost matrix such that errors in minority class instances induce a higher loss. The very first method in this line of work is AdaCost [11]. Over the years many variations of AdaCost have been introduced such as: CSB1 [47], CSB2 [47], RareBoost [23], AdaC1 [46], AdaC2 [46], AdaC3 [46] and CGAda [25,26,27], which differ in the following main aspects: training data weight assignments, weight update rules, and decision rules. Except RareBoost, all the aforementioned methods in this category require user parameters for the misclassification costs. An overview of these methods can be seen in Table 3. The second sub-category of cost-sensitive methods aims to make a specific classifier cost-sensitive. In [34], authors propose AdaMEC, a boosting classifier that uses the misclassification costs only to set thresholds to the decision boundary of AdaBoost, in contrast to the previous methods which use the misclassification costs to change the data distribution in each boosting round. CGAda and AdaMEC have also been extended in [35], namely CGAda-Cal. and AdaMEC-Cal., by calibrating the models’ scores using the Platt scaling technique [37]. In [38], a cost-sensitive k-NN classifier is introduced to tackle class imbalance by using a modified distance function which takes into consideration the misclassification cost matrix. In [31], a misclassification cost matrix is used to define a cost-sensitive splitting criterion in decision trees, while in [1] the authors take into account the misclassification costs to determine the pruning criterion of a decision tree. The third sub-category uses the Bayes risk theory to assign each instance to a class with the lowest risk. Few works have been proposed in this direction, e.g., in [7] the authors swap the class labels of the leaves to minimize the misclassification cost.

For evaluation purposes, we select all the aforementioned cost-sensitive boosting methods since they are related to our contribution. In contrast to our proposed approach, however, the aforementioned cost-sensitive boosting methods assume that the misclassification costs for each class are known in advance (except RareBoost). For many applications/datasets these costs might not be available, and a costly grid search has to be performed to estimate them; however, in many cases, even grid search does not lead to optimal misclassification costs. Instead, the two variants of our approach are parameter-free and leverage the cumulative behavior of AdaBoost to dynamically adjust the misclassification costs per boosting round. Hence, our methods are applicable to any imbalanced dataset without any prior domain knowledge.

3 Preliminaries

For the sake of clarity, in Table 1 we briefly describe the employed notations. We assume a set of instances \(D=\{(x_1,y_1),\ldots ,(x_n,y_n)\}\) consisting of n independent and identically distributed samples drawn from the joint distribution P(Ay), where A denotes the feature space and y is the class attribute. For simplicity, we assume the class is binary with \(y \in \{+1, -1\}\). We denote by \(D_+\) (\(D_-\)) the set of instances belonging to the positive (negative, respectively) class. We also assume that the positive class is the minority, i.e., \(|D_+|<< |D_-|\). It holds that \(|D_+| + |D_-| = n\).

Table 1 Notations

Standard classification models treat instances of different classes equally and the performance of the induced classifier (see confusion matrix in Table 2) is measured in terms of the overall error rate (ER) as: \(ER = (FP + FN)/(TP + TN + FP + FN)\). However, when the class distribution is skewed, the overall error rate is not a good indicator of model’s performance in all classes, but rather of the performance on the majority class. In such a case, more appropriate performance metrics should be employed (see an overview in Table 5).

Cost-sensitive models tackle the class imbalance problem by emphasizing more on the minority class through appropriate costs [14, 35, 47]. Each sample \(x \in D\) is mapped to a typically fixed misclassification cost vector \(\textbf{C}=<C_+,C_->\), where each sample in \(D_+\) is associated with a fixed cost value \(C_+\) from the misclassification cost vector \(\textbf{C}\) and each sample in \(D_-\) with a fixed cost value \(C_-\) from \(\textbf{C}\), where \(C_+ > C_-\) and \(C_+,C_-\in [0,\infty )\). The costs denote the misclassification costs for each class and are employed by the cost-sensitive learner during the training phase to “force” the learner to also learn minority instances. The costs, however, need to be manually set by the user, thus requiring prior domain knowledge, or to be selected via grid search [14, 46].

Table 2 Confusion matrix

Boosting and AdaBoost: Boosting is an ensemble learning technique which trains a sequence of T weak learners, in order to create a strong learner. The sequential generation promotes the dependency between the weak learners and each learner learns from the mistakes of the previous learner.

AdaBoost [42], one of the most popular boosting algorithms (see Algorithm 1), adjusts in each iteration \(t:1-T\) (the so-called boosting round t) the data distribution \(D^t\) based on the mistakes of the current learner \(h_t\) in order to focus in the next round \(t+1\) on the misclassified instances. In particular, the weights of the instances for the next round are updated as follows:

$$\begin{aligned} D^{t+1}(i) = \frac{D^t(i)\exp {(-\alpha _ty_ih_t(x_i))}}{Z_t} \end{aligned}$$
(1)

The parameter \(\alpha _t\) denotes the weight of the weak learner \(h_t\) in the final classification decision and is based on the error rate of the weak learner \(h_t\):

$$\begin{aligned} \alpha _t = \frac{1}{2}\log \left( \frac{\sum \limits _{i,y_i=h_t(x_i)} D^t(i)}{\sum \limits _{i,y_i\ne h_t(x_i)} D^t(i)}\right) \end{aligned}$$
(2)

The parameter \(Z_t\) is a normalization factor which is used at the end of each boosting round to make \(D^{t+1}\) a probability distribution:

$$\begin{aligned} Z_t = \sum \limits _{i=1}^n D^t(i)exp\left( -\alpha _ty_ih_t(x_i)\right) \end{aligned}$$
(3)

The final model is a weighted combination of the weak learners:

$$\begin{aligned} H(x)=\text {sign}\left( \sum \limits _{t=1}^T \alpha _th_t(x)\right) \end{aligned}$$
(4)
figure e

Cost-sensitive boosting approaches extend AdaBoost for class imbalance by changing the following components: (i) weight initialization (recall that in Adaboost all instances receive the same weight during initialization—line 1 of Algorithm 1), (ii) distribution reweighting (for AdaBoost the update is according to Eqs. (1) and (2)), and (iii) voting schema (for Adaboost voting is according to Eq. (4)). A detailed overview of the cost-sensitive methods and how they implement the aforementioned (i)–(iii) aspects is presented in Table 3. CGAda [25,26,27] employs the misclassification cost matrix only for initializing the weight distribution at the first boosting round and proceeds as standard AdaBoost thereafter. AdaCost (\(\beta _2\)) [11], AdaC1-C3 [46] and CSB1/2 [47] incorporate the misclassification cost matrix to change the data distribution in each boosting round. AdaMEC [34] and RareBoost [23] differ from the other cost-sensitive methods: In particular, AdaMEC does not use costs to change the data distribution but it rather shifts the decision boundary of AdaBoost to minimize the total expected loss. RareBoost does not rely on misclassification costs, rather it employs instead of a single parameter \(\alpha \) (see Eq. (2)), two different parameters, \(\alpha ^+\) and \(\alpha ^-\) for positive and negative predictions, respectively to update the weight distribution as well as the voting schema. RareBoost requires that \(TP > FP\); however, if this assumption does not hold the algorithm’s performance deteriorates [46]. CGAda-Cal. [35] and AdaMEC-Cal. [35] are not shown in Table 3 since calibration, through Platt scaling, is applied to the trained CGAda and AdaMEC models, respectively.

Table 3 An overview of cost-sensitive boosting methods w.r.t. cost assignment,initialization, distribution update and ensemble decision rule

4 AdaCC: cumulative cost-sensitive boosting

Instead of assuming a fixed misclassification cost matrix, AdaCC  dynamically adjusts the misclassification costs in each boosting round based on the performance of the model up to that round, i.e., the performance of the partial ensemble (Sect. 4.1). This way, in each boosting round AdaCC  boosts the class with the highest misclassification rate. These costs are then used to update the data distribution for the next round. There are two ways to incorporate the costs in the update formula (for AdaBoost the update formula is shown in Eq. (1)): inside or outside the exponent resulting in two variations AdaCC1 (Sect. 4.2) and AdaCC2 (Sect. 4.3), respectively.

The toy example in Fig. 1 demonstrates how our approach “pays extra attention” to the minority class errors: in particular, we observe that AdaBoost, AdaCC1 and AdaCC2 misclassify the minority class (blue points) during the first boosting round \(t=1\); however, our methods assign higher weights to the minority examples on the next boosting rounds in contrast to AdaBoost, which lead to substantially different decision boundaries on the upcoming boosting rounds and also the final ensemble.

4.1 Cumulative misclassification costs

Let \(t \in [1,T]\) be the current boosting round, where T is a user defined parameter indicating the number of boosting rounds. Let \(H_{1:t}(x) = sign(\sum _{j=1}^t \alpha _jh_j(x))\) be the partial ensemble up to round t. We monitor the cumulative error of the partial ensemble and in particular, the cumulative false positive rate (FPR) and the cumulative false negative rate (FNR) defined as follows:

$$\begin{aligned}{} & {} \textrm{FNR}_{1:t} = \frac{\sum \limits _{i,x_i \in D_+} {\mathbb {I}}\left\{ \text {sign}\left( \sum \limits _{j=1}^t \alpha _jh_j(x_i)\right) \ne y_i \right\} }{|D_+|} \nonumber \\{} & {} \quad \textrm{FPR}_{1:t} = \frac{\sum \limits _{i,x_i\in D_-} {\mathbb {I}}\left\{ \text {sign}\left( \sum \limits _{j=1}^t \alpha _jh_j(x_i)\right) \ne y_i\right\} }{|D_-|} \end{aligned}$$
(5)

where \({\mathbb {I}}\{\cdot \}\) is the indicator function that returns 1 if the condition within is true and 0, otherwise. The term \(\textrm{FNR}_{1:t}\) corresponds to the error of the partial ensemble in the positive class (\(D_+\)); likewise, \(\textrm{FPR}_{1:t}\) refers to the error in the negative class (\(D_-\)).

Based on the cumulative error rates, we define the cumulative misclassifications costs below in order to “bias” the weighting process for the next round towards the class with the highest misclassification rate (on the current boosting round):

$$\begin{aligned} C^{t}(x_i)= {\left\{ \begin{array}{ll} 1 + \textrm{FNR}_{1:t}, &{} \text {if } h_t(x_i) \ne y_i, y_i = +, \textrm{FNR}_{1:t} > \textrm{FPR}_{1:t}\\ 1 + \textrm{FPR}_{1:t}, &{} \text {if } h_t(x_i) \ne y_i, y_i = -, \textrm{FNR}_{1:t} < \textrm{FPR}_{1:t}\\ 1, &{} \textrm{otherwise} \end{array}\right. } \end{aligned}$$
(6)

where \(h_t\) is the weak learner at round t. In particular, for any misclassified instance \(x_i\), we increase its weight using the cumulative FPR or FNR values based on its class-membership.

The costs are therefore dynamically adjusted based on the partial ensemble’s cumulative behavior and the predictions of the current weak learner. In contrast to other methods that assume fixed misclassification costs through the boosting rounds, our method is not only parameter-free but it also dynamically detects which class might require extra weighting at each round. We should highlight that the cumulative misclassification costs aim to boost the class with the highest misclassification rate and not individual examples. Nonetheless, the cumulative misclassification costs affect the weights of the instances since they are used to update the data distribution. In what follows, and when it is clear from the context, we simplify the notation of \(C^t(x_i)\) as \(C^t_i\).

The two variants AdaCC1 (Sect. 4.2) and AdaCC2 (Sect. 4.3) are presented next.

4.2 AdaCC1

The first proposed algorithm, AdaCC1, modifies the weight update formula of AdaBoost (Eq. (1)) using the cumulative costs \(C^{t}_i\) (Eq. (6)) as follows:

$$\begin{aligned} D^{t+1}(i) = \frac{D^t(i)\exp {\left( -C^{t}_i\alpha _t y_i h_t(x_i)\right) }}{Z_t} \end{aligned}$$
(7)

The normalization factor \(Z_t\) (for Adaboost shown in Eq. (3)), in round t, is also updated to take the extra weighting factor into account:

$$\begin{aligned} Z_t = \sum \limits _{i=1}^n D^t(i)\exp {(-C^t_i\alpha _ty_ih_t(x_i))} \end{aligned}$$
(8)

Error analysis: By unravelling Eq. (7), the following holds:

$$\begin{aligned} D^{t+1}(i)= & {} D^1(i)\times \frac{\exp {\left( -C^{1}_i\alpha _1y_ih_1(x_i)\right) }}{Z_1}\times \cdots \times \frac{\exp {\left( -C^{t}_i\alpha _ty_ih_t(x_i)\right) }}{Z_t} = \nonumber \\= & {} \frac{D^1(i)\exp {\left( -\sum \limits _{j=1}^tC^{j}_i \alpha _j y_ih_j(x_i) \right) }}{\prod \limits _{j=1}^t Z_j} \end{aligned}$$
(9)

The upper bound of the training error of the final ensemble H(x) can be expressed as:

$$\begin{aligned} Pr_{i\sim D^1} [H(x_i) \ne y_i] \le \sum \limits _{i=1}^n D^1(i)\exp {\left( -\sum \limits _{t=1}^T C^{t}_i \alpha _t y_ih_t(x_i) \right) } = \prod \limits _{t=1}^T Z_t \end{aligned}$$
(10)

Therefore, the objective in each boosting round is to find the \(\alpha _t\) that minimizes \(Z_t\). Since \(Z_t\) is the weight summation of correctly and non-correctly classified instances at round t, following the same argumentation as in [43, 46], Eq. (8) can be expressed as:

$$\begin{aligned} \sum \limits _{i=1}^n D^t(i)\exp {\left( - C^t_i \alpha _t y_i h_t(x_i) \right) }{} & {} \le \sum \limits _{i=1}^n D^t(i)\left( \frac{1-C^t_iy_ih_t(x_i)}{2}\exp {(\alpha _t)} \right. \nonumber \\{} & {} \quad \left. + \frac{1+C^t_iy_ih_t(x_i)}{2}\exp {(-\alpha _t)} \right) \end{aligned}$$
(11)

By differentiating Eq. (11) w.r.t. \(\alpha _t\) and setting it to zero, we can estimate \(\alpha _t\) as follows:

$$\begin{aligned}{} & {} \frac{\partial }{\partial \alpha _t} \Bigg ( \sum \limits _{i=1}^n D^t(i) \left( \frac{1-C^t_iy_ih_t(x_i)}{2}\exp {(\alpha _t)}\right) \nonumber \\{} & {} \quad + \sum \limits _{i=1}^n D^t(i)\left( \frac{1+C^t_iy_ih_t(x_i)}{2}\exp {(-\alpha _t)} \right) \Bigg ) = 0 \Rightarrow \nonumber \\{} & {} \quad e^\alpha _t\sum \limits _{i=1}^N D^t(i)\left( \frac{1-C^t_iy_ih_t(x_i)}{2}\right) = e^{-\alpha _t}\sum \limits _{i=1}^N D^t(i)\left( \frac{1+C^t_iy_ih_t(x_i)}{2}\right) \Rightarrow \nonumber \\{} & {} \quad \alpha _t = \frac{1}{2}\log \left( \frac{\sum \limits _{i=1}^n D^t(i)(1+C^t_iy_ih_t(x_i))}{\sum \limits _{i=1}^n D^t(i)(1-C^t_iy_ih_t(x_i))} \right) \nonumber \\{} & {} \qquad = \frac{1}{2}\log \left( \frac{1 + \sum \limits _{i,y_i=h_t(x_i)}^n C_i^tD^t(i) - \sum \limits _{i,y_i\ne h_t(x_i)}^n C_i^tD^t(i)}{1 - \sum \limits _{i,y_i=h_t(x_i)}^n C_i^tD^t(i) + \sum \limits _{i,y_i\ne h_t(x_i)}^n C_i^tD^t(i)} \right) \end{aligned}$$
(12)

To ensure that \(\alpha _t\) is non-negative, the following condition should hold, otherwise the iteration process terminates:

$$\begin{aligned} \sum \limits _{i,y_i=h(x_i)} C^t_iD^t(i) > \sum \limits _{i,y_i\ne h(x_i)} C^t_iD^t(i) \end{aligned}$$
(13)

Time complexity: We derive the time complexity of our approach building upon the complexity of AdaBoost (c.f., Algorithm 1). AdaBoost complexity is \(O(T\cdot (f + n))\), where T is the number of boosting rounds, O(f) is the complexity of a weak learner (for decision stumps it is \(O(n\cdot m)\) for training and O(n) for testing, where m is the number of features and n the number of instances [45]), and O(n) is the complexity for the weight update of the instances. Our only addition to the algorithm (computationally) is the calculation of the cumulative errors (Eq. (5)). This computation can be reduced to O(n) by maintaining a vector \(\textbf{o}\) of size n over the boosting rounds which averages the decision outcomes of the weak learners in each boosting round. Note that the vector \(\textbf{o}\) is updated on each round based on the current weak learner’s predictions (on the training). By doing this, we avoid spending \(O(t \cdot f)\) on each boosting round t i.e., we avoid the prediction time of the partial ensemble (on the training set) on each boosting round. Therefore, the complexity of AdaCC1 is: \(O(T\cdot (f + 2n)) \Rightarrow O(T\cdot (f + n))\), since 2 is a constant.

4.3 AdaCC2

The second proposed algorithm, AdaCC2, modifies the weight update formula of AdaBoost (Eq. (1)) using the cumulative costs (Eq. (6)) as follows:

$$\begin{aligned} D^{t+1}(i) = \frac{D^t(i)C^{t}_i\exp {(-\alpha _ty_ih_t(x_i)})}{Z_t} \end{aligned}$$
(14)

Similarly to AdaCC1, the normalization factor \(Z_t\) is also updated to ensure \(D^{t+1}\) is still a probability distribution:

$$\begin{aligned} Z_t = \sum \limits _{i=1}^n D^t(i) C^t_i\exp {(-\alpha _ty_ih_t(x_i))} \end{aligned}$$
(15)

Error analysis: Following the same logic as in Eq. (7) for AdaCC1, by unravelling Eq. (14), we obtain the following:

$$\begin{aligned} D^{t+1}(i) = \frac{D^1(i)\prod \limits _{j=1}^t C^{j}_i\exp {(-\alpha _j y_i h_j(x_i))}}{\prod \limits _{j=1}^t Z_j} \end{aligned}$$
(16)

Similarly to AdaCC1, the upper bound of the training error of the final ensemble H(x) is given by:

$$\begin{aligned} Pr_{i\sim D^1} [H(x_i) \ne y_i] \le \sum \limits _{i=1}^n D^1(i) \prod \limits _{t=1}^T C^{t}_i\exp {(-\alpha _t y_i h_t(x_i))} = \prod \limits _{t=1}^TZ_t \end{aligned}$$
(17)

Following a similar to AdaCC1 rationale (Eqs. (11) and (12)), the \(\alpha _t\) that minimizes \(Z_t\) is given by:

$$\begin{aligned} \alpha _t = \frac{1}{2}\log \left( \frac{\sum \limits _{i,y_i=h_t(x_i)}^n C_i^tD^t(i) }{\sum \limits _{i,y_i\ne h_t(x_i)}^n C_i^tD^t(i)} \right) \end{aligned}$$
(18)

To ensure that \(\alpha _t\) is non-negative, the same condition as in Eq. (13) for AdaCC1 should hold, otherwise the iteration process terminates.

Time complexity: AdaCC2 has the same time complexity as AdaCC1 since their only difference pertains to the weight estimation.

5 Evaluation setup

We compare our proposed AdaCC1 and AdaCC2 against 12 state-of-the-art cost-sensitive boosting approaches (Sect. 5.2) as well as 3 data level methods (SMOTE, Random Over-Sampling and Random Under-Sampling) and 2 model-based methods (SMOTEBoost and RUSBoost) using suitable class imbalance performance evaluation metrics (Sect. 5.1). We have experimented with a large number of real-world datasets (27), depicting various characteristics in terms of class imbalance, dimensionality and cardinality. An overview of the datasets is provided in Table 4. We have used the same pre-processing method on all datasets whenever categorical data were present i.e., one-hot encoding. The employed structures were numpy arrays for all datasets. In addition, all the classification methods which have been employed in this paper were trained on the exact same pre-processed data. The goal of our evaluation is twofold: to compare the different methods in terms of their predictive performance for both classes (Sect. 6.1), and to analyze and compare the internal behavior of our methods with the other approaches in order to understand/explain our methods’ superior performance (Sect. 6.2).

For our experiments,Footnote 2 we use decision stumps, i.e., decision trees of depth 1, as weak learners for all methods. Regarding the number of weak learners T, we experiment with different numbers \(T \in [25, 50,100,200]\). For the predictive performance experiments (Sect. 6.1), we report on the average of 10 \(\times \) 5-fold cross validation. These results are also used for the significance test of Friedman using Bonferroni correction for validating significance on multiple datasets across various methods [5]. For the experiments on the internal behavior (Sect. 6.2), we do not perform any split rather we train on the complete datasets. By using the entire datasets for training, we avoid fluctuating values which can make the internal analysis of our methods misleading.

Table 4 Datasets

5.1 Performance metrics

Due to the imbalanced nature of the learning problem, we report on AUC, balanced accuracy, f1-score, gmean, TNR, and TPR. By following similar logic as [6], we also use a combined overall performance measure (OPM), which averages the aforementioned metrics, since no algorithm outperforms others in all datasets and metrics. All metrics (except AUC which employs the confidence scores of the predictions) can be derived from the confusion matrix of Table 2 as shown in Table 5.

Table 5 Performance metrics

Due to the high amount of datasets, we cannot report on each individual dataset and therefore, similarly to [6, 35, 52], we omit individual dataset results, and report on the average across all datasets.

5.2 Competitors and parameter selection

Our main competitors are 12 cost-sensitive boosting methods, namely, AdaCost (\(\beta _2\)) [11], AdaC1 [46], AdaC2 [46], AdaC3 [46], AdaMEC [34], AdaMEC-Cal. [35], CGAda [25,26,27], CGAda-Cal. [35], CSB1 [47], CSB2 [47], and RareBoost [23]. We also employ the vanilla AdaBoost [42] to show the differences between cost-sensitive and standard boosting methods. The methods (including ours) are summarized in terms of their key characteristics in Table 3 (as already mentioned, AdaMEC-Cal. and CGAda-Cal. are excluded since they are the post-processed versions of AdaMEC and CGAda, respectively). Except for the AdaBoost, RareBoost and our AdaCC1 and AdaCC2 methods, all other methods need to be initialized with the misclassification cost matrix [\(C_+, C_-\)]. As already discussed, finding the right costs is a tedious task requiring domain/dataset knowledge. To this end, we follow the suggestion of [35, 46] to use grid search for selecting the best class ratio for misclassification costs. In particular, for each dataset, we perform grid search on a variety of different class ratios, namely with \(C_+ = 1.0\) and by varying \(C_-\) in the range \([0.1 - 1.0]\) with step 0.1. We select the class ratio which achieves the best f1-score as suggested by [35, 46]. Grid search is performed on each fold (on the training set) and each value of \(T \in [25, 50, 100, 200]\); therefore, for all 10 iterations and for each different fold, the competitors are fine-tuned.Footnote 3

We have combined the three data-level methods with a decision tree classifier. We augmented the minority class until the class-imbalance was eliminated, i.e., both classes had the same amount of instances. For the under-sampling we also removed instances from the majority class until both classes had the same amount of instances. For the model-level methods, we have used the default parameters, e.g., for SMOTEBoost we set \(k=5\) and varied the number of weak learners same as before and same for RUSBoost.

In addition, we evaluate the impact of the cumulative misclassification costs (Eq. (6)) which allows us to dynamically adjust the costs based on the performance of the partial ensemble and is central to our approach. To this end, we compare AdaCC1 and AdaCC2 with their non-cumulative counterparts, denoted by AdaN-CC1 and AdaN-CC2, respectively. The only difference is that the non-cumulative versions do not take into consideration the cumulative error of the partial ensemble, rather rely on each individual weak learner to estimate the misclassification costs for the next round. More concretely, the partial ensemble up to round t, i.e., \(\sum \limits _{j=1}^t \alpha _jh_j(x)\) in Eq. (5), is replaced by the corresponding weak learner in round t, i.e., \(h_t(x)\).

6 Experiments

We split the experiments into two categories: (i) predictive performance (Sect. 6.1) and (ii) internal analysis (Sect. 6.2). In the first category, we compare the predictive performance of our methods against other cost sensitive boosting competitors using the metrics from Sect. 5.1. Although the aim of this work is to compare cost-sensitive boosting methods, we also highlight in Table 10 the performance of data-level methods such as SMOTE [3] (where the number of neighbors \(k=5\)), Random Under-Sampling (RUS) and Random Over-Sampling (ROS) combined with decision tree classifiers. Also, we employ boosting class-imbalance methods such as SMOTEBoost [4] and RUSBoost [44]. In the second category, we compare how our method differs from the others by showing the internal behavior of each method.

6.1 Predictive performance

In this section, we begin by comparing the performance of our method against the employed competitors. We continue by comparing AdaCC with its non-cumulative counterpart AdaN-CC. Note that the performance results, in terms of different evaluation metrics shown in Tables 6 and 7, are averaged over all datasets. Afterwards, we report on the ranking of each method based on the datasets. Finally, we report on the statistical significance of our results.

AdaCC versus competitors: We begin our analysis for the main competitors in Table 6. AdaCC1 and AdaCC2 are the best in terms of balanced accuracy, gmean, recall (TPR) and OPM (AdaCC1 is also best in AUC). AdaMEC-Cal. follows with a [1.27–1.77%] relative decrease in OPM (it has very close difference with AdaCC2), [3.44–3.57%] relative decrease in balanced accuracy and [4.78–5.16%] relative decrease in gmean comparing to our best performing method (AdaCC2). The fourth performing method is CGAda-Cal. with a [1.54–2.39%] relative decrease in OPM, [4.13–4.49%] decrease in balanced accuracy and [5.82–6.48%] relative decrease in gmean comparing to our best performing method (AdaCC2). In terms of balanced accuracy, gmean and recall, AdaCC1 and AdaCC2 have the best performance. A closer look to the TPR, TNR scores shows that our approaches achieve the best performance for the minority class (higher TPR), while maintaining a moderate performance for the minority class (TNR close to average).

Table 6 Results for various evaluation metrics
Table 7 Results for various evaluation metrics for the comparison of AdaCC1/2 versus AdaN-CC1/2

As expected, AdaBoost, which does not tackle imbalance, achieves the highest TNR but lowest TPR. The cost-sensitive competitors are able to produce higher TPR scores than AdaBoost, but still fail to learn the minority class effectively, e.g., AdaC1, AdaC2 and AdaC3 produce [73.3–78.25%] balanced accuracy, [65.44–75.4%] gmean and [61.04–68.21%] TPR scores which are significantly lower in contrast to our methods.

The competitive performance of AdaMEC-Cal. and CGAda-Cal. is mainly due to their high TNR and low recall. AdaMEC-Cal.’s relative difference in recall is [13.87–17.28%] lower than our approaches, and for CGAda-Cal. the relative difference is [15.7–17.98%] lower. RareBoost also calls for special mention as it performs poorly on the minority class but achieves the second best TNR scores. Its outlying behavior is probably related to its strong assumption that \(TP > FP\), which cannot be always ensured.

The obtained results indicate that the cost-sensitive boosting competitors are producing higher balanced accuracy in contrast to AdaBoost but they fail to outperform our methods as indicated by balanced accuracy, gmean, recall, and AUC metrics. In addition, some competitors such as AdaCost, AdaC2, AdaC3, CSB1, and CSB2 do not improve their performance for higher values of T in contrast to other competitors. One possible reason for the sub-optimal performance of the competitors might be the non-optimal misclassification cost tuning as a result of the grid search. Our methods avoid this by dynamically adjusting misclassification costs on each boosting round based on the cumulative behavior of the model.

Cumulative versus non-cumulative: We continue by comparing our methods, AdaCC1 and AdaCC2, with their non-cumulative counterparts, namely AdaN-CC1 and AdaN-CC2, in Table 7. By comparing AdaCC1 to AdaN-CC1 we observe a relative decrease of [16–17.47%] in balanced accuracy, [42.29–56.36%] in gmean, [37.79–63.47%] and [10.01–19.94%] in AUC. There are also high (relative) differences between AdaCC2 and AdaN-CC2. These differences highlight the superiority of the cumulative costs in the reweighting procedure on each boosting round versus the non-cumulative costs.

Ranking: We also report on the ranks based on balanced accuracy across the methods in Table 8, for \(T=200\) (Tables for \(T \in [25, 50, 100]\) are included in the “Appendix”). Note that Table 8 contains floats instead of integers due to the fact that in many datasets some methods produced the same balanced accuracy score.

There are some interesting observations from this table. AdaCC1 and AdaCC2 are the best and second-best in ranks with an average rank of 2.06 and 3.19 respectively, in contrast to the competitors; however, methods such as AdaMEC-Cal. and CGAda-Cal. are also achieving high ranks. Furthermore, the last row of Table 8 shows the number of datasets for which each method achieved the best performance. Our approaches, AdaCC1 and AdaCC2, have won on 10 and 8 datasets, while for the majority of datasets AdaCC1 or AdaCC2 were the best or second best methods. Similar behavior can also be observed for other values of T, where AdaCC1 and AdaCC2 achieve the best ranking scores, e.g., AdaCC1 achieves the best ranking for \(T \in [25,50,100]\) with values 2.30, 2.33 and 2.11, respectively and AdaCC2 achieves the second best ranking with scores 2.70, 2.41 and 2.52.

In Table 10, we also compare non-cost-sensitive methods with our approach. We have used three well-known data-level methods such as SMOTE, Random Over-Sampling (ROS) and Random Under-Sampling (RUS) combined with a decision tree classifier, and also two model-based boosting methods such as SMOTEBoost and RUSBoost. As we can see, AdaCC performs better than the other methods in terms of balanced accuracy, gmean, auc and OPM. It is also visible that RUSBoost is able to maintain extremely high TPR scores; however, it under-performs in terms of TNR in contrast to AdaCC which maintains both TNR and TPR at high levels. Interestingly, by comparing Tables 6 and 10, we can observe that the non cost-sensitive methods are able to outperform several cost-sensitive methods.

Statistical significance: Finally, for the comparison of cost-sensitive methods we have performed the Friedman test (\(p < 0.05\)) using the Bonferroni correction [5] for comparing multiple methods across multiple datasets. The results can be seen in Table 9, in which non-significant values have been highlighted in bold. As we see, AdaCC1 and AdaCC2 are not significantly different across various values of T. AdaCC1 and AdaCC2 are significantly different compared to the other competitors. One interesting observation is that for high T, AdaMEC-Cal. and CGAda-Cal. are able to produce similar results as our methods.

Table 8 Comparative balanced accuracy ranks across the entire set of methods and datasets (smaller values are better) for \(T=200\)
Table 9 Friedman test: p-values for all competitors
Table 10 Results for various evaluation metrics for non-cost-sensitive class-imbalance methods

6.2 Internal analysis

We begin the internal analysis by comparing our methods, AdaCC1 and AdaCC2, with their corresponding non-cumulative version, namely AdaN-CC1 and AdaN-CC2, which are introduced in Sect. 5.2. Then, we continue our analysis in which we compare our methods with competitors w.r.t. in-training instance re-weighting, \(\alpha \) estimation, feature importance, confidence scores and decision boundaries (similar to the toy example in Fig. 1).

Cumulative versus non-cumulative costs: In Fig. 2 we compare AdaCC1/2 and AdaN-CC1/2 on the TPR and TNR values per boosting round (averaged over the datasets, \(T=200\)). Figure 2a shows the in-training TPR scores over the boosting rounds. It is clear that the cumulative versions, i.e., AdaCC1 and AdaCC2, are by far better and more stable than the non-cumulative ones, AdaN-CC1 and AdaN-CC2. Figure 2b shows the in-training TNR scores over the boosting rounds. The non-cumulative versions are better than the cumulative ones. However, they exhibit high fluctuation as they rely on point-in-time estimates of misclassification costs (i.e., based on individual weak learners) comparing to the cumulative methods which rely on cumulative estimates (i.e., based on the partial ensemble). These experiments demonstrate the importance of the cumulative misclassification cost estimation for the stability of the model. Also, in terms of predictive performance, we have seen (c.f., Table 6) that the non-cumulative methods, AdaN-CC1 and AdaN-CC2, are producing significantly worse results in contrast to AdaCC1 and AdaCC2.

Fig. 2
figure 2

Cumulative vs non-cumulative misclassification cost estimation (left:TPR, right:TNR)

Fig. 3
figure 3

In-training behavior over the boosting rounds (for \(T=200\))

Model performance analysis: The experiments thus far demonstrate the superior behavior of AdaCC1 and AdaCC2, compared to state-of-the-art cost-sensitive boosting approaches. Hereafter, we explain this behavior through additional experiments on the internal behavior of the models, assessed by: (i) positive (minority) class weight assignments over the boosting rounds (Fig. 3a), (ii) alpha values over the boosting rounds (Fig. 3b), (iii) in-training balanced error over the boosting rounds (Fig. 3c), (iv) feature importance (Fig. 4) of a given dataset (mammography), (iv) confidence scores (Fig. 5), and (v) decision boundaries (Fig. 6). Moreover, AdaMEC, AdaMEC-Cal., CGAda-Cal. are omitted from these experiments (except the decision boundary analysis). The reason is that AdaMEC is built on top of a trained AdaBoost model, by shifting its decision boundary towards the target class. AdaMEC-Cal. and CGAda-Cal. are calibrated versions of AdaMEC and CGAda.

In-training analysis: For in-training analysis, we set \(T=200\) and show the behavior of each method per boosting round. The weights of the minority class over the boosting rounds are shown in Fig. 3a; as we can see, AdaCC1 and AdaCC2 behave differently from the competitors by starting with very high weights during the first boosting rounds, which converge afterwards to 0.5. The other methods increase the positive weights gradually over the rounds. Our methods tackle the class imbalance problem during early boosting rounds by assigning cumulative misclassifications costs to the minority class and then proceed to reduce these costs (dynamically) as soon as the TPR scores are close to TNR scores.

In terms of \(\alpha \) values, which control how much the weak learners contribute to the final ensemble (Fig. 3b), the methods depict a similar behavior with \(\alpha \) decreasing over the boosting rounds. A notable exception is RareBoost which utilizes positive and negative \(\alpha \) to estimate the weight distribution per round; thus, it is expected for its \(\alpha \) values to fluctuate. Our methods do not differentiate from other competitors (excluding RareBoost); weak learners in the early boosting rounds (e.g., \(T <10\)) are more influential to the final outcomes (higher \(\alpha \) values).

In Fig. 3c the in-training balanced error over the boosting rounds is shown. As we can see, our methods achieve the lowest error. Moreover, AdaCC1 and AdaCC2 reduce the balanced error faster than any other method, and converge after a sufficient number of boosting rounds. The abrupt reduction of the error is directly related to the rapid increase of the positive weights in the initial boosting rounds.

Fig. 4
figure 4

Feature importance of mammography dataset (the higher, the more important the feature)

Feature importance: In Fig. 4 we illustrate the feature importance for each method on the mammography dataset. We have selected this dataset since it has low dimensionality (6 features) and high class imbalance ratio (1:42). Figure 4 shows the importance of each feature which is employed by each method to make a decision (weights are normalized to be a distribution). Note that each weak learner is a decision stump which means that it selects only one feature for splitting the dataset. The feature importance is measured as follows: each ensemble consists of T weak learners and each weak learner is trained on a different data distribution. Since we have employed Decision Stumps (Decision trees of depth 1), each weak learner will use only one split; therefore, it will use only one feature. The weak learners of AdaCost, based on the data distributions which are provided (based on the model’s updating strategy), do not use some features based on the splitting criterion. In addition, some models (e.g., AdaCost) may terminate their boosting rounds earlier than others based on their stopping criterion which can lead to ignoring some features.  Although the feature importance does not indicate which method is the best, it shows clearly that each method utilizes differently the features based on the weighting strategy, e.g., AdaCC1 is relying more on features 4 and 5 and less on features 1 and 3 compared to AdaCC2.

Confidence analysis: In Fig. 5, we compare the confidence scores of the different methods for two ensemble sizes, \(T=25\) (Fig. 5a) and \(T=200\) (Fig. 5b), and separate them into three categories: positive (left), negative (middle) and overall (right) confidence scores. Note that misclassified instances have confidence scores less than 0 on x-axis (values closer to 0, on x-axis, indicate lower confidence in the predictions, correct or wrong). Also, the area under the line in the range \([-1, 0]\) on the x-axis shows the proportion of misclassified instances.

Fig. 5
figure 5

Effect of boosting rounds on the confidence scores (left:positive class, middle:negative class, right:overall)

At a first look at the overall confidence scores, we see that AdaCC1 and AdaCC2 are producing low misclassification rates while the area under the line in the \([-1,0]\) range of x-axis is low. However, other methods are achieving similar results. Therefore, we need to analyze the confidence scores of each class separately since the minority (positive) class is overshadowed by the majority (negative) class. As expected, AdaBoost has the highest misclassification confidence score in positive (minority) class since it learns effectively only the negative (majority) class. AdaCC1 and AdaCC2 methods have the lowest misclassification confidence scores for \(T=25\), and reduce them even more as the number of weak learners increases, i.e., for \(T=200\). For the negative class, our approaches are able to reduce the misclassified confidence scores as the number of weak learners increases. Other competitors are able to reduce the positive misclassification confidence scores; however, their misclassification confidence scores (are under the line) for the negative (majority) class are increasing, e.g., CSB1, AdaC3. This highlights once more that the ability to adjust the weights during training is crucial to maintain good predictive performance across both classes. Note that for intermediate values of T, Figures are included in the “Appendix” as they depict this gradual behavior.

Fig. 6
figure 6

Decision boundaries of methods on the same imbalanced toy dataset of 10 blue and 30 red instances. Dot size is proportional to the weight allocated by each weak learner to the particular instance, making clear how each method assigns weights to minority class instances compared to the ones of the majority class

An interesting observation is that the cost-sensitive methods become less confident in the confidence of the correctly classified instances (both classes) as the number of weak learners increases. As it seems, the more they learn, their mistakes are reduced but they also become less confident in their correct decisions.

Decision boundary analysis: Finally, we generate an imbalanced dataset similar to the toy dataset in Fig. 1 of 40 instances (30 red class and 10 green class) with 2 features (for better visualization). We train each method on the same dataset and afterwards we show the decision boundaries which are learned from the training set. Since the dataset has only two features \(x_1\) and \(x_2\), we use a small number of weak learners (\(T=5\)). In Fig. 6 we show the decision boundaries of all methods and how each method changes the weight distribution over the boosting rounds.

As we can see, AdaBoost gives more emphasis to the majority (red) class, while it tunes for overall classification accuracy. AdaCC1 and AdaCC2 on this particular dataset behave similarly by properly partitioning the space, giving emphasis to minority class without deteriorating the performance on majority class (2 blue misclassified points versus 4 red misclassified points). AdaMEC and AdaMEC-Cal. cannot find, through grid search, good misclassifications costs; therefore, their behavior is similar to AdaBoost (by considering the best \(C_N = 1\) which makes them behave equal to AdaBoost). The misclassification cost selection of the competitors is based upon the performance of the final ensemble while our methods dynamically adapt their misclassification costs on each boosting round. CGAda, CSB1, CSB2 and AdaC2 partition the space to allow higher recall scores; however, they misclassify 12 red points. AdaCost, AdaC1 and AdaC3 perform even worse by misclassifying 19 red points. Interestingly, RareBoost partitions the space in a safe way, e.g., it correctly classifies 5 blue points and the majority class.

7 Conclusions and future work

In this work we present a novel strategy for cost-sensitive boosting that exploits the cumulative behavior of the model to dynamically balance the misclassification costs on each boosting round.

Existing approaches require a user-defined fixed misclassification cost matrix as input. In most cases this results in additional hyperparameters which need to be optimized jointly with the basic parameters, e.g., using grid search. As grid-search does not ensure a good initialization it might hurt the model’s overall predictive performance. Our methods’ ability to produce consistent improvements in different measures, e.g., [0.3–28.56%] for the AUC, [3.4–21.4%] for the balanced accuracy, [4.8–45%] for gmean and [7.4–85.5%] for the recall indicate the general applicability of our method. The high recall scores demonstrate, that our method is especially helpful for domains in which low recall scores have a disastrous impact. Moreover, we have shown the superior performance of such cumulative models comparing to their non-cumulative counterparts, in terms of both predictive performance and model stability. Finally, our method comes with theoretical guarantees w.r.t. the training error and it reduces the optimization of hyper-parameters.

In the future, we will consider multi-class extensions of our method. Furthermore, we plan to investigate our method’s application to the supervised online learning task. Our method’s ability to dynamically adjust the misclassification costs, makes our method suitable for such a task in contrast to a recent online cost-sensitive boosting extension of AdaC2 [48].