1 Introduction

Machine learning classifiers are quickly becoming a tool of choice in application areas ranging from finance to robotics and medicine. This is largely owing to the growth in the availability of labeled training data and declining computing costs. When applied correctly, machine learning classifiers have the potential to improve safety and efficiency and reduce costs. However, many of the most important domains, such as those related to health and safety, are limited by the problem of class imbalance. In binary classification, the class imbalance is defined as occurring when the prior probability of one class (referred to as the minority class) is significantly lower than the prior probability of the other class (majority class).

The induction of binary classifiers on imbalanced training data results in a predictive bias toward the majority class and has been associated with poor performance during application (Branco et al., 2016). Detailed empirical studies have demonstrated that class imbalance exacerbates the difficulty of learning accurate predictive models from complex data involving class overlap, sub-concepts, non-parametric distributions, etc. (He & Garcia, 2009; Stefanowski, 2016).

Traditional methods of improving the predictive performance of classification models trained on imbalanced data involve resampling (random undersampling the majority class, random oversampling the minority class and generating additional synthetic minority samples) or cost-adjustment (Branco et al., 2016). Synthetic minority sampling methods, such as SMOTE and its derivatives (Chawla et al., 2002; Han et al., 2005; He et al., 2008; Barua et al., 2012; Bellinger et al., 2016), generate synthetic minority samples to balance the training set. Generation-based methods of this nature are widely applied because they are classifier independent and can reduce the risk of overfitting.

In addition to elevating the learning challenge, in many cases, imbalanced training data results from sensitive application domains that exhibit asymmetric misclassification cost (Wallace & Dahabreh, 2012). For example, in medicine, misclassifying benign cases as cancerous (false positive) can have negative consequences in terms of mental anguish and additional tests. Whilst false positives should be kept to a minimum, misclassifying a cancerous case as benign (false negative) can significantly increase cost in terms of delayed treatment and premature death. In domains of this nature, additional effort must be made to induce a classifier with good predictive performance on the minority class.

To achieve satisfactory performance on sensitive imbalanced domains with asymmetric misclassification costs, the resampling strategy ought to prioritizing high recall whilst having minimal impact on precision. In this work, we propose a refinement to the CCR algorithm (Koziarski & Wożniak, 2017) that utilizes the radial-based (RB) approach to calculate the class potential to satisfy this objective. Specifically, CCR is a resampling algorithm that cleanses majority class training samples and randomly generates synthetic minority samples in the regions around the minority class. Whilst this technique has been shown to improve the recall of the induced classifier, the specific resampling strategy employed may limit the improvement in recall and risks harming the precision. To improve upon this, we propose the RB-CCR resampling algorithm. It focuses the generation processes in sub-regions of the data-space that satisfy the user-specified class potential targets. The ability to do this gives the user better control over the precision-recall trade-off. This, for example, enables higher recall on domain for which this is critical.

We empirically compare RB-CCR to CCR and the state-of-the-art resampling methods on 57 benchmark datasets with 9 classifiers. Our empirical results show that resampling with RB-CCR can be exploited to control the precision-recall trade-off in a domain-appropriate way. On average, RB-CCR outperforms the state-of-the-art alternatives in terms of AUC and G-mean.

The main contributions of this paper can be summarized as follows:

  • Proposition of the RB-CCR resampling algorithm, which employs the radial-based approach to calculate the class potential, so that a classifier trained on modified data improves recall and has less impact on precision.

  • Analysis of the impact of sampling region on algorithms behavior and performance.

  • Showing that the proposed method can outperform the quality of the CCR algorithm.

  • Experimental evaluation of the proposed approach based on diverse benchmark datasets and a detailed comparison with the state-of-the-art approaches.

The paper is organized as follows. The next section discusses the related work and situates RB-CCR concerning the state-of-the-art in imbalanced binary classification. Section 3, provides the details of CCR and RB-CCR, demonstrates resampling with RB-CCR and contrasts its run-time complexity with that of CCR. In Sect. 4, we describe the experimental setup, report the results along with our analysis, and finally, Sect. 5 includes our concluding remarks and a discussion of future work.

2 Related work

Imbalance ratio (IR) (García et al., 2012) is defined as the ratio between the number of majority and minority class observations. A moderate to high IR (typically greater than 10 : 1) can pose a significant challenge to learning a sufficiently accurate classifier across all classes. This is particularly the case when it is combined with other adverse data properties, such as class overlap, sparsity, complex clustering, and noise (He & Garcia, 2009; Napierala & Stefanowski, 2012). In such cases, the classifier is at great risk of becoming biased towards the majority class (He & Garcia, 2009), and/or overfitting the training data (Chen et al., 2008). Problems of this nature are a focus of intense research (Chawla et al., 2002; Bunkhumpornpat et al., 2009; Kubat & Matwin, 1997).

Measuring the quality of a model on imbalanced data requires some attention. It is well-known that using classic metrics, such as accuracy and error rate, on imbalanced datasets can cause misleading interpretations of the efficacy of the model (Jeni et al., 2013). As a result, the imbalanced learning community has shifted to use metrics, such as precision, recall (sensitivity), specificity, G-mean, \(F_{\beta } score\), and AUC (Kubat et al., 1997; Krawczyk, 2016). More recently, however, it has been noted that the widely used metrics \(F_{\beta } score\), and AUC can be sub-optimal for evaluating performance on imbalanced data. Brzezinski et al. (2019) demonstrated that \(F_{\beta } score\) is usually more biased towards the majority class than AUC and G-mean. The flaws of \(F_{\beta } score\) are also discussed in a study by Hand and Christen (2018), in which authors suggest that to make a fair comparison, precision and recall have to be weighed separately for each problem, depending on the imbalance ratio. Alternatively, the authors in Davis and Goadrich (2006) argue that ROC curves, and AUC by extension, can present an overly optimistic view of an algorithm’s performance if there is a large skew.

Classification strategies to deal with imbalanced data can be divided into three main groups (López et al., 2012): inbuilt mechanism, data-level methods, and hybrid methods.

2.1 Inbuilt mechanisms

In this approach, existing classification algorithms are adapted to imbalanced problems by ensuring balanced accuracy for instances from both classes. Two of the most popular areas of research of these methods are: using one-class classification (Japkowicz et al., 1995), where the goal is to learn the minority class decision boundaries, and because of the frequently assumed regular, closed shape of the decision borders it is adequate for the clusters created by minority classes (Krawczyk et al., 2014). Secondly, algorithms employing kernel functions (Mathew et al., 2018), splitting criteria in decision trees (Li et al., 2018), to make them cost-sensitive methods employing different forms of the loss function (Khan et al., 2018), where the algorithm assigns a higher misclassification cost for instances from the minority class (Krawczyk et al., 2014; López et al., 2012; He & Garcia, 2009; Zhou & Liu, 2006). Unfortunately, such methods can cause a reverse bias towards the minority class. Worth noting are methods based on ensemble classification (Woźniak et al., 2014), like smote Boost (Chawla et al., 2003) and AdaBoost.NC (Wang et al., 2010), or Multi-objective Genetic Programming-Based Ensemble (Bhowan et al., 2012).

2.2 Data-level methods

This work focuses on data preprocessing to reduce imbalance ratio by decreasing the number of majority observations (undersampling) or increasing minority observations (oversampling). After applying such preprocessing, the data can be classified using traditional learning algorithms. The most straightforward approaches to dealing with the imbalanced data are Random Oversampling (ROS) and Random Undersampling (RUS). When applying ROS, new minority class instances are generated by duplicating randomly chosen minority instances. This procedure can create small, dense clusters of replicated minority objects leading to overfitting. The most recognized data-level method is the SMOTE (Chawla et al., 2002) algorithm. It reduces the risk of overfitting by generating synthetic minority instances via random interpolation in-between existing minority objects.

The well-studied limitations of SMOTE have inspired many new synthetic oversampling techniques, such as (Pérez-Ortiz et al., 2016; Bellinger et al., 2018). The most significant shortcomings of SMOTE are that it assumes a homogeneous minority class cluster, and it does not consider the majority objects in the neighborhood when generating synthetic objects. In cases where the minority class forms many small disjointed clusters, SMOTE may cause an increase the class overlapping, and thus, the complexity of the classification problem (Krawczyk et al., 2019). Numerous methods have been proposed to address these weaknesses by considering both classes during generation, or as a post-hoc cleaning step.

Safe-level SMOTE (Bunkhumpornpat et al., 2009) and LN-SMOTE (Maciejewski & Stefanowski, 2011) are specifically designed to reduce the risk of introducing noisy synthetic observations inside the majority class region. Other SMOTE alternatives aim to focus the generation process on challenging regions of the dataspace. Borderline-SMOTE (Han et al., 2005), for example, focuses the process of synthetic observation generation on the instances close to the class boundary, and ADASYN (He et al., 2008) prioritizes the difficult instances. The SWIM (Sharma et al., 2018) method uses the Mahalanobis distance to determine the best position for synthetic samples, taking into account the existing samples from both classes. Radial-Based Oversampling (RBO) (Koziarski et al., 2019) is a method that employs potential estimation to generate new minority objects using radial basis functions. The Combined Cleaning and Resampling (CCR) (Koziarski & Wożniak, 2017) method combines two techniques—cleaning the decision border around minority objects and guided synthetic oversampling.

RUS preprocesses the data by randomly removing majority class samples. It is conceptually simple and risks removing important objects from the majority class. This can cause the induced classifier to underfit less dense majority class clusters. Guided undersampling approaches aim to avoid this by analyzing the minority and majority class instances in the local neighborhood. Edited Nearest Neighbor, for example, removes majority examples if their set of three nearest neighbors does not include at least one other majority object. Radial-Based Undersampling, on the other hand, employs the concept of mutual class potential to direct undersampling (Koziarski, 2020b). Koziarski introduced Synthetic Minority Undersampling Technique (SMUTE), which leverages the concept of interpolation of nearby instances, previously introduced in the oversampling setting in SMOTE (Koziarski, 2020a).

2.3 Hybrid methods

Data preprocessing methods can be combined with in-built classification methods for imbalanced learning. Galar et al. proposed to hybridize under- and oversampling with an ensemble of classifiers (Galar et al., 2011). This approach allows the data to be independently processed for each of the base models. It is worth also mentioning SMOTEBoost, which is based on a combination of the SMOTE algorithm and the boosting procedure (Chawla et al., 2003). In addition, the Combined Synthetic Oversampling and Undersampling Technique (CSMOUTE) integrates SMOTE oversampling with SMUTE undersampling (Koziarski, 2020a).

3 Radial-based combined cleaning and resampling

In this paper, we propose an extension to the original CCR (Koziarski & Wożniak, 2017) algorithm that refines its sampling procedure. In short, CCR is an energy-based oversampling algorithm that relies on spherical regions, centered around the minority class observations, to designate areas in which synthetic minority observations should be generated. These spherical regions expand iteratively, with the rate of expansion inversely proportional to the number of neighboring observations belonging to the majority class, while computationally efficient and conceptually simple, using spherical regions to model the areas designed for oversampling has two limitations. First of all, it enforces a constant rate of expansion of the sphere in every direction, regardless of the majority neighbors’ exact position. Secondly, it does not utilize the information about the neighboring minority class observations. We propose a novel sampling procedure to address these issues, which is refining the original spherical regions. In the remainder of this section, we describe the proposed sampling procedure and its integration with the CCR algorithm.

3.1 Guided sampling procedure

We base the proposed sampling procedure on the notion of class potential, previously used in the imbalanced data setting by Krawczyk et al. (2019). The potential function is a real-valued function that, in a given point in space x, measures the cumulative closeness to a given collection of observations \({\mathcal {X}}\). More formally, using a Gaussian radial basis function with a spread \(\gamma\), a potential function can be defined as

$$\begin{aligned} \Phi (x, {\mathcal {X}}, \gamma ) = \sum _{i=1}^{\mid {\mathcal {X}} \mid }{e^{-\left( \frac{\Vert {\mathcal {X}}_i - x \Vert _2}{\gamma }\right) ^{2}}}. \end{aligned}$$
(1)

Of particular interest in the imbalanced data oversampling task will be the potential computed concerning either the collection of majority class observations \({\mathcal {X}}_{maj}\) (majority class potential), or minority class observations \({\mathcal {X}}_{min}\) (minority class potential). Such class potential can be regarded as a measure reflecting the degree of certainty we assign to x being a member of either the majority or the minority class. It can also be used to model the regions of interest in which oversampling is to be conducted, which was previously demonstrated in Radial-Based Oversampling (RBO) (Krawczyk et al., 2019) and Sampling With the Majority (SWIM) (Bellinger et al., 2020) algorithms. SMOTE and its derivatives define the regions of interest as the lines connecting nearby minority observations. Also, the probability of sampling within any given region of interest is typically uniform. Alternatively, using class potential, as proposed here, offers an informationally richer framework. First of all, by using the majority class potential, we can leverage the information about the position of majority observations, which is not used by SMOTE. Secondly, when using potential, we are not constrained to sampling from within a set a lines. Rather, we can sample smoothly from the space around the minority observations. Moreover, the sampling region is non-linear, which enables it to better adapt to the underlying data distribution.

To reiterate, the drawbacks of the original CCR algorithm are that the sphere expansion procedure progresses at a constant rate in every direction, regardless of the exact position of the majority neighbors, and it does not utilize the information about the position of neighboring minority class observations. Intuitively, neither of these is the desired behavior since it can lead to a lower than expected expansion in the direction of minority observation clusters and higher than expected expansion in the direction of majority observation clusters. While in theory, an obvious modification that could address these issues would be to exchange the spheres used by CCR to more robust shapes, such as ellipsoids, and adjust the expansion step accordingly, in practice, it is not clear how the latter could be achieved. Alternatively, we propose to exploit the efficiency of first defining the sphere around the minority observation and then partitioning it into sub-regions based on the class potential to more effectively guide sample generation.

The proposed strategy partitions a given sphere into three target regions, low (L), equal (E), and high (H), based on the class potential. Synthetic samples are generated in a user-specificity target region by randomly generating candidates with uniform probability throughout the sphere. A random subset of these is selected from the target region and added to the training set. The target region and number of samples are specified as parameters of the algorithm. A more detailed formulation of the proposed strategy is presented in Algorithm 1, and an illustration of the sphere partitioning procedure is presented in Fig. 1.

figure a
Fig. 1
figure 1

An example of a sphere generated around a specific minority observation, partitioned into three regions: high potential (H), indicated with a green color, equal potential (E), indicated with a yellow color, and low potential (L), indicated with a red color. Note that the shape of the regions aligns with that of the produced potential field, indicated with a contour plot (Color figure online)

The CCR algorithm generates samples with uniform probability from within entire sphere. Alternatively, Fig. 1 illustrates that RB-CCR divides the original sphere into three regions (L, E, H). The regions are defined according to the shape of the globally calculated minority class potential. Subsequent to the partitioning, sample generation can be restricted to a specific region. Intuitively, samples in the high potential regions can be regarded as having a higher probability of coming from the underlying minority class distribution than samples in the low potential regions. This, to some extents, parallels different variants of SMOTE, such as Borderline-SMOTE (Han et al., 2005) or Safe-Level-SMOTE (Bunkhumpornpat et al., 2009), which focus on different types of observations to guide the sampling process. However, contrary to SMOTE variants, RB-CCR provides a flexibility to chose an appropriate sampling region for the target data within a single framework.

3.2 Integrating guided sampling with the CCR algorithm

We begin with a brief description of the original CCR algorithm, as described in Koziarski et al. (2020), where more in-depth discussion of the design choices can be found. The algorithm itself consists of two main steps: cleaning the neighborhood of the minority observations, and second of all, selectively oversampling in the produced, cleaned regions. After describing the original algorithm, we discuss how it can be integrated with the proposed guided sampling procedure.

3.2.1 Cleaning the minority neighborhoods

First step of the proposed approach is cleaning the minority class neighborhoods from the majority observations. This is achieved via an energy-based approach, in which spherical regions are being designated for cleaning. The size of the regions is constrained by the presence of majority neighbors and is determined in an iterative procedure, during which spheres expand up to the point of depleting the allocated energy budget. More formally, for a given minority observation denoted by \(x_i\), current radius of an associated sphere denoted by \(r_i\), a function returning the number of majority observations inside a sphere centered around \(x_i\) with radius r denoted by \(f_n(r)\), a target radius denoted by \(r_i'\), and \(f_n(r_i') = f_n(r_i) + 1\), we define the energy change caused by the expansion from \(r_i\) to \(r_i'\) as

$$\begin{aligned} \Delta e = - (r_i' - r_i) \cdot f_n(r_i'). \end{aligned}$$
(2)

During the sphere expansion procedure, the radius of a given sphere increases up to the point of completely depleting the energy, with the cost increasing after each encountered majority observation. Finally, the majority observations inside the sphere are being pushed out to its outskirts. The whole process was illustrated in Fig. 2.

Fig. 2
figure 2

An illustration of the sphere creation for an individual minority observation (in the center) surrounded by majority observations (in red). Sphere expends at a normal cost until it reaches a majority observation, at which point the further expansion cost increases (depicted by blue orbits with an increasingly darker color). Finally, after the expansions, the majority observations within the sphere are being pushed outside (in green). Source: Koziarski et al. (2020) (Color figure online)

3.2.2 Selectively oversampling the minority class

After the cleaning stage is completed, new synthetic minority observations are being generated in the produced spherical regions. The ratio of the synthetic observations generated around a given minority observation is proportional to the sphere’s radius, calculated in the previous step. More formally, for a given minority observation denoted by \(x_i\), the radius of an associated sphere denoted by \(r_i\), the vector of all calculated radii denoted by r, collection of majority observations denoted by \({\mathcal {X}}_{maj}\), collection of minority observations denoted by \({\mathcal {X}}_{min}\), and assuming that the oversampling is performed up to the point of achieving balanced class distribution, we define the number of synthetic observations to be generated around \(x_i\) as

$$\begin{aligned} g_i = \lfloor \dfrac{r_i^{-1}}{\sum _{k = 1}^{|{\mathcal {X}}_{min}|}{r_k^{-1}}} \cdot (|{\mathcal {X}}_{maj}| - |{\mathcal {X}}_{min}|)\rfloor . \end{aligned}$$
(3)

This procedure can be interpreted as weighing the difficult observations more heavily, similar to the technique used in ADASYN (He et al., 2008). The difficulty of observation is determined based on the proximity of nearest majority observations: minority observations with nearby majority neighbors will have a constrained sphere radius, which will result in a higher allocation of produced synthetic observations.

3.2.3 Combining guided sampling with CCR

The proposed sampling strategy can easily be integrated into the original CCR algorithm. Instead of the original sampling within the whole sphere, RB-CCR uses the guided sampling strategy described in the previous section. In initial steps of RB-CCR are the same as CCR. Specifically, they are sphere radius calculation, translation of majority observations, and calculation of the number of synthetic observations generated for each minority observations. We present pseudocode of the proposed RB-CCR algorithm in Algorithm 2. It should be noted that, except for the addition of a guided sampling procedure, the algorithm is presented as it was previously proposed in Koziarski et al. (2020).

figure b

The behavior of the proposed algorithm changes depending on the choice of its three major hyperparameters: RBF spread \(\gamma\), energy used for sphere expansion, and sampling region. The impact of \(\gamma\) was illustrated in Fig. 3. As can be seen, \(\gamma\) regulates the smoothness of the potential shape, with low values of \(\gamma\) producing a less regular contour, conditioned mainly on the position of minority neighbors located in close proximity. On the contrary, higher \(\gamma\) values produce a smoother, less prone to overfitting potential, with a smaller number of distinguishable clusters. Secondly, the value of energy affects the radius of the produced spheres, which controls the size of sampling regions and the range of translations, as illustrated in Fig. 4. It is worth noting that as the energy approaches zero, the algorithm degenerates to random oversampling. The choice of the energy is also highly dependent on the dimensionality of the data. It has to be scaled to the number of features a given dataset contains, with higher dimensional datasets requiring higher energy to achieve a similar sphere expansion. Finally, the choice of the sampling region determines how the generated samples align with the minority class potential. This is demonstrated in Fig. 5. Sampling in all of the available regions (LEH) is equivalent to the original CCR algorithm. This completely ignores the potential and uses whole spheres as a region of interest. Sampling in region E constrains samples to areas with class potential that is approximately equal class potential of real minority observation. Sampling in region H pushes the generated observations towards areas of the data space estimated to have a higher minority class potential. This can be interpreted as focusing the sampling process on generating samples that are safer, and better resemble the original minority observations. The opposite is true for sampling in the region L. This was further illustrated on a simplified dataset in Fig. 6.

Fig. 3
figure 3

Visualization of the impact of \(\gamma\) parameter on the shape of minority class potential

Fig. 4
figure 4

Visualization of the impact of energy parameter on the sphere radius and corresponding region in which synthetic minority observations (indicated by dark outline) are being generated. Note that the majority observations within the sphere are being pushed outside during the cleaning step

Fig. 5
figure 5

An example of the choice of sampling region on the distribution of generated minority observations. Baseline case, equivalent to sampling in all of the possible regions (LEH), was compared with sampling in the high (H), equal (E) and low (L) potential regions. Note that the distribution of generated observations aligns with the shape of the potential field

Fig. 6
figure 6

Comparison of CCR and RB-CCR with different sampling regions on a simplified dataset

Finally, it is worth discussing how RB-CCR compares to the other oversampling algorithms. An illustration of differences between several popular methods was presented in Fig. 7, with a highly imbalanced dataset characterized by a disjoint minority class distribution used as a benchmark. As can be seen, when compared to the SMOTE-based approaches, RB-CCR tends to introduce lower class overlap, which can occur for SMOTE when dealing with disjoint distributions, the presence of noise or outliers. RBO avoids sampling in the majority class regions. However, it produces very conservative and highly clustered samples. These can cause the classifier to overfit in a manner similar to random oversampling. RB-CCR avoids the risk of overfitting with larger regions of interest. Moreover, the larger regions enable a greater reduction in the classifier’s bias towards the majority class. The energy parameter facilitates the control of this behavior, with higher values of energy leading to less conservative sampling. Information provided by the class potential is used to fine-tune the shape of regions of interest within the sphere. It enables better control of the sampling.

Fig. 7
figure 7

A comparison of data distribution after oversampling with different algorithms on a highly imbalanced dataset with disjoint minority class distributions. SMOTE introduces a high degree of class overlap; Borderline-SMOTE (Bord) solves the problem only partially, still introducing some overlap, at the same time completely omitting to oversample around selected observations. RBO does not produce artificial overlap, but at the same time, it is very conservative during sampling, in particular within originally overlapping regions. CCR and RB-CCR produce a distribution that leads to a higher bias towards the minority class, both due to synthesizing observations around all of the instances and the conducted translation of majority observations while minimizing class overlap. Compared to CCR, RB-CCR produces more constrained samples based on the underlying potential

3.3 Computational complexity analysis

Let us define the total number of observations by n, the number of majority and minority observations by, respectively, \(n_{maj}\) and \(n_{min}\), the number of features by m, and the number of candidate samples used in a single sampling step of Algorithm 1 by c. As previously described in Koziarski et al. (2020), the original CCR algorithm can be divided into three steps: calculating the sphere radii, cleaning the majority observations inside the spheres, and synthesizing new observations, with each of the steps done iteratively for every minority observation. The same applies to the RB-CCR, for which only the complexity of the third step will differ from that of CCR.

  • As described in Koziarski et al. (2020), the first step consists of a) calculating a distance vector, b) sorting said vector, and c) calculating the resulting radius. Combined, these operations have complexity equal to \({\mathcal {O}}((m + \log {n})n^2)\).

  • As described in Koziarski et al. (2020), the second step, cleaning the majority observations inside the spheres, has complexity equal to \({\mathcal {O}}(mn)\).

  • Finally, the third step, synthesizing new observations, consists of (a) calculating the proportion of samples generated for a given observation \(g_i\), with the complexity equal to \({\mathcal {O}}(n_{min})\) (Koziarski et al., 2020), and (b) sampling the synthetic observations. In the case of the original CCR algorithm, as discussed in Koziarski et al. (2020), this sub-step consists of \(n_{maj} - n_{min}\) operations of sampling a random observation inside the sphere, each with complexity equal to \({\mathcal {O}}(m)\), leading to a total complexity of the third step of CCR that can be simplified to \({\mathcal {O}}(mn)\). On the other hand, the sampling used by RB-CCR has a higher complexity due to the chosen guided strategy. In particular, when considering the procedure described in Algorithm 1, its complexity is dominated by the potential calculation for all of the candidate samples. Potential calculation, defined in Eq. (1), when computed with respect to the collection of minority class observations \({\mathcal {X}}_{min}\), consists of \(n_{min}\) summations and \(n_{min}\) RBF function computations, with the later having complexity equal to \({\mathcal {O}}(m)\). As a result, a single computation of the minority class potential has a complexity that can be simplified to \({\mathcal {O}}(mn)\). Whole sampling step, which requires c potential function computations per minority observations, has therefore a total complexity equal to \({\mathcal {O}}(cmnn_{min})\), which can be simplified to \({\mathcal {O}}(cmn^2)\).

As can be seen, for the original CCR algorithm, the complexity is dominated by the first step and is equal to \({\mathcal {O}}((m + \log {n})n^2)\). On the other hand, in the case of RB-CCR, both the first and the third step influence the total complexity of the algorithm, which is equal to \({\mathcal {O}}((cm + \log {n})n^2)\).

4 Experimental study

To empirically evaluate the usefulness of the proposed RB-CCR algorithm, we conducted a series of experiment, the aim of which was to answer the following research questions:

  1. RQ1

    Is it possible to improve the original CCR algorithm’s performance by focusing resampling in the specific regions?

  2. RQ2

    Are the trends displayed by the RB-CCR consistent across different classification algorithms and performance metrics? Is it possible to control the behavior of the algorithm by a proper choice of parameters?

  3. RQ3

    How does RB-CCR compare with state-of-the-art reference methods, and how does the choice of classification algorithm affect that comparison?

4.1 Set-up

4.1.1 Datasets

We based our experiments on 57 binary datasets taken from the KEEL repository (Alcalá-Fdez et al., 2011), the details of which, namely their names, imbalance ratios (IR), number of contained samples and features, were presented in Table 1. We employed a dataset selection procedure previously used in Koziarski (2020b), that is we excluded datasets for which AUC greater than 0.85 was achieved with a linear SVM without any resampling. Prior to the resampling and classification, all datasets were preprocessed: categorical features were converted to integers first, and afterward, all of the features were normalized to zero mean and unit variance.

Table 1 Summary of the characteristics of datasets used throughout the experimental study

4.1.2 Classification algorithms

During the conducted experiments, we considered classification with a total of 9 different algorithms: CART decision tree, k-nearest neighbors classifier (KNN), support vector machine with linear (L-SVM), RBF (R-SVM) and polynomial (P-SVM) kernels, logistic regression (LR), Naive Bayes (NB), and multi-layer perceptron with ReLU (R-MLP) and linear (L-MLP) activation functions in the hidden layer. We considered a relatively high number of classification algorithms to examine how the choice of base learner affects the usefulness of RB-CCR. Implementations of all of the classification algorithms were taken from the scikit-learn library (Pedregosa et al., 2011), and their default parameters were used.

4.1.3 Reference methods

In addition to the original CCR algorithm, we compared the performance of RB-CCR with several over- and undersampling strategies, namely: SMOTE (Chawla et al., 2002), Borderline-SMOTE (Bord) (Han et al., 2005), Neighborhood Cleaning Rule (NCL) (Laurikkala, 2001), SMOTE combined with Tomek links (SMOTE+TL) (Tomek, 1976) and Edited Nearest Neighbor rule (SMOTE+EN) (Wilson, 1972). The hyperparameters for each resampling method were tuned individually for each dataset. The SMOTE variants considered the values of k neighborhood in {1, 3, 5, 7, 9}. In addition to K, the Bord method considered the values of m neighborhood in {5, 10, 15}. For NCL, we considered the value k of its neighborhood in {1, 3, 5, 7}. Finally, for all methods in which the resampling ratio was an inherent parameter, resampling was performed up to the point of achieving a balanced class distributions. Implementation of all of the reference methods was taken from the imbalanced-learn library (Lemaitre et al., 2017).

4.1.4 Performance metrics

We utilize 6 performance metrics for classifier evaluation. This includes precision, recall, and specificity of the predictions, and the combined metrics AUC, F-measure, and G-mean. This set of metrics is standard in the imbalanced classification literature and provides a diverse perspective on model performance. Precision, recall, and specificity provide insight into the class specific errors that are to be expected from each algorithm. The combine metrics, AUC, F-measure, and G-mean, provide a more wholesome perspective on performance by taking into account the trade-off between the performance on majority and minority class. As mentioned in the related work, AUC and F-measure have previously been criticized in the context of imbalance learning. Nonetheless, we include them as they remain standard benchmarks in the literature and provide orthogonal perspective on performance.

4.1.5 Evaluation procedure

To ensure the stability of the results, we used \(5\times 2\) cross-validation (Alpaydin, 1999) during all of the experiments. Furthermore, during the parameter selection for resampling algorithms we used additional 3-fold cross-validation on the training partition of the data, with AUC used as the optimization criterion.

4.1.6 Statistical analysis

To assess the statistical significance of the results, we used two types of statistical tests. We used a one-sided Wilcoxon signed-rank test in a direct comparison between the original CCR algorithm and the proposed RB-CCR algorithm. Secondly, when simultaneously comparing multiple methods, we used the Friedman test combined with Shaffer’s posthoc. In all cases, unless p-values were specified, the results were reported at the significance level \(\alpha = 0.10\).

4.1.7 Implementation and reproducibility

To ensure the reproducibility of the results, we made publicly available the following: the implementation of the algorithm, code sufficient to reproduce all of the described experiments, statistical tests, and all of the figures presented in this paper, as well as the partitioning of the data into folds and raw results. All of the above can be accessed at.Footnote 1

4.2 Evaluation of the choice of sampling region on the algorithms performance

In the first stage of the conducted experimental analysis, we examined the suitability of sampling in specific regions. We compared the performance of four variants of RB-CCR algorithm, in which sampling was performed only in the low potential region (L), only in the approximately equal potential region (E), and only in the high potential region (H), as well as the variant in which sampling was performed in all of the regions (LEH), which is equivalent to the original CCR algorithm. In all of the cases, we selected the energy parameter from {0.5, 1.0, 2.5, 5.0, ..., 100.0}. Furthermore, except LEH sampling we also selected the value of \(\gamma\) from {0.5, 1.0, 2.5, 5.0, 10.0}. We present a summary of regions achieving the highest average rank for every classifier and metric combination in Table 2. Detailed p-values for the conducted experiments can also be found in Appendix 1.

Table 2 A summary of sampling strategies that achieved highest average rank for a given classifier and metric combination

Several observations can be made based on the presented results. First of all, the observed performance was consistent across the classification algorithms concerning the precision, recall, and specificity, at least when comparing sampling in H region with the remaining variants: in the case of precision sampling exclusively in the H region produced, on average, the best performance when combined with 7 out of 9 considered classifiers, with the remaining two being NB and P-SVM. Furthermore, in the case of specificity, this behavior was observed for 8 out of 9 classifiers, once again except P-SVM. Finally, the reverse was true in the case of recall, were sampling in the H region gave the worst average rank for 8 out of 9 considered classifiers. All of the trends mentioned above were also statistically significant in the majority of cases. This indicates that using the guided sampling approach has a non-random influence on the algorithm’s performance and its bias towards the majority class, particularly when comparing sampling in the H region with the other variants, which is desirable behavior. Furthermore, from a general resampling perspective, this suggests that if the problem domain requires high precision or specificity, it is beneficial to focus sampling in the H region. On the other hand, if a high recall is required, sampling in L or E region is usually preferred.

However, the sampling region’s impact on the combined metrics is less clear in the general case. Although the baseline variant of CCR that is LEH sampling, achieved the best average rank only in 1 out of 27 cases (for the combination of P-SVM and F-measure), there was usually either a complete lack of significance, meaning that there were no statistically significant differences between any of the sampling strategies, or partial significance, meaning that only some of the variants displayed statistically significant differences. Importantly, when comparing with the LEH sampling, there was a statistically significant improvement concerning all of the combined metrics for a single classifier, LR; and for a single metric for the combination of L-MLP and F-measure, as well as the combination of G-mean and P-SVM. In all of the above cases, the best performance strategy was sampling in the H region. Nevertheless, for the remaining combinations of classification algorithms and performance metrics there was no clearly dominant strategy, even when at least partial significance was observed. All of the above leads to the conclusion that while sampling in the specific regions has a non-random impact that is consistent across the classification algorithms with respect to direction (focusing sampling in the H region leading to a statistically significantly better precision and specificity, and worse recall), the trade-off between them, which can be observed using the combined metrics, varies depending on both the classifier and the dataset.

4.3 Comparison of CCR and RB-CCR

We have empirically demonstrated that no single sample region is optimal for all datasets, classification algorithms, and performance metrics. It is consistent with a current state of knowledge, particularly the "no free lunch" theorem, according to which the choice of sampling strategy strongly depends on the dataset characteristics. Instead, we considered the approach in which we treat the sampling region as a parameter of the algorithm and adjust it on a per-dataset and per-classifier basis. To this end, we conducted two comparisons.

First of all, considered an idealized variant of RB-CCR. The region is giving the best performance, chosen only from {L, E, }, was selected individually for each dataset based on the test set results. Importantly, sampling in the LEH region was not included in the selection of available regions. This approach can be treated as an upper bound of performance that could be achieved by restricting sampling to a specific region. Once again, this variant of RB-CCR was compared with the original CCR algorithm, with the results presented in Table 3. As can be seen, by constraining sampling to a specific region, we were able to achieve improved performance for almost every considered dataset, regardless of the choice of classifier or performance metric.

Table 3 Comparison of the original CCR algorithm with an idealized variant of RB-CCR, for which the sampling region giving the best performance was chosen individually for each dataset

Secondly, we conducted a comparison between the original CCR algorithm and RB-CCR with the sampling region chosen from {L, E, H, LEH} using cross-validation. The results of this comparison were presented in Table 4. As can be seen, when adjusting the sampling region individually for each dataset we were able to achieve a statistically significant improvement in performance for at least one of the combined metrics for 7 out of 9 classifiers. This improvement was observed more often in the case of G-mean and AUC, and only in two cases for F-measure, which can be explained by the fact that AUC was used as the optimization criterion during cross-validation, and AUC and G-mean tend to be more correlated than F-measure. We hypothesize that the flexibility in RB-CCR offered by class potential regions enables the samples to be generated in areas that have the greatest positive impact on the metric being optimized. The results presented in Table 2, where focusing on the high potential regions produces a significant improvement in precision and specificity, seem to support this hypothesis. Thus, using F-measure as an optimization criterion for models trained with RB-CCR would have the opposite effect as AUC (i.e. it would produce better precision, specificity and F-measure, since these are related, at the expanse of recall, AUC and G-mean.)

Table 4 Comparison of the original CCR algorithm with RB-CCR using cross-validation to select resampling regions

Results of both of the above experiments indicate that, in principle, constraining sampling to a specific region can yield a clear performance improvement compared to the baseline approach. Using cross-validation to choose the optimal region for every case is a suitable strategy for picking region, resulting in a statistically significant performance improvement in most cases. Still, it falls short of the performance of the idealized variant. It indicates that either a better parameter selection strategy, more suited for the imbalanced datasets, or a specific heuristic for choosing the sampling region, could improve the proposed method’s overall performance.

4.4 Comparison of RB-CCR with the reference methods

We compared RB-CCR with several over- and undersampling reference methods in the next stage of the conducted experiments. We presented average ranks achieved by all of the methods, as well as the statistical significance of the comparison, in Table 5. Furthermore, we presented a visualization of the average ranks achieved by the specific methods concerning different performance metrics in Fig. 8. First of all, as can be seen, the general trend was that RB-CCR achieved the best recall at the expense of precision and specificity, which held true for all of the classification algorithms. As in the previous experiments, this had a varying impact on the combined metrics depending on their exact choice when F-measure was considered, which led to statistically significantly worse performance than the reference methods. However, at the same time, it improved the performance concerning AUC and G-mean: RB-CCR achieved the highest average rank in 17 out of 18 cases, with the only exception of AUC observed for L-MLP classifier, for which it achieved the second-best rank. The results of this comparison were also statistically significant in the majority of cases: for all of the classifiers when compared to the baseline case with no resampling, Bord and NCL; for 5 out of 9 classifiers when compared to SMOTE and SMOTE+TL; and in a single case of NB when compared to SMOTE+EN, which was the second-best performer. The differences between the results measured using F-measure, AUC and G-mean can be attributed to the previously discussed bias of F-measure towards the majority class performance: since RB-CCR is heavily skewed towards the recall at the cost of precision, it is natural that using metric weighted more heavily towards precision produces worse performance. Still, the observed results indicate high usefulness of the proposed RB-CCR algorithm when compared to the reference methods if a higher cost of misclassification of minority observations is assigned, as is the case with AUC and G-mean.

Table 5 A comparison of RB-CCR with the reference methods, with average ranks presented, and the methods for which RB-CCR achieved a statistically significantly better performance indicated with a + sign, and statistically significantly worse performance with a – sign
Fig. 8
figure 8

A visualization of the average ranks achieved by the individual methods with respect to different performance metrics

Finally, in the last stage of the conducted experiments, we compared different combinations of classification and resampling algorithms to establish their relative usefulness. We presented the average ranks observed for different combined metrics in Tables 6, 7 and 8, separately for the individual metrics. As can be seen, when F-measure was considered, RB-CCR was outperformed by the reference methods, who achieved the best performance when combined with either R-MLP or R-SVM, which was also the case for RB-CCR. However, when AUC and G-mean were considered, the combination of algorithms that achieved the highest average rank was RB-CCR and L-MLP, for both of those metrics. Besides L-MLP, the top-performing classifiers were R-MLP, R-SVM, and LR, in that order, all achieving the best performance when combined with RB-CCR. Overall, presented rankings indicate the importance of improving performance due to the resampling method for any given classification algorithm. From that point of view, out of the statistically significant improvements presented previously in Table 5, of most importance were those achieved for R-MLP and R-SVM, for which RB-CCR achieved a statistically significantly better performance than all of the resamplers except SMOTE+EN. On the other hand, it is worth noting that linear methods, that is L-MLP, LR and L-SVM, achieved relatively high performance, populating 3 out of 5 spots for highest performing classification algorithms, at the same time achieving less statistically significant improvement due to using RB-CCR when compared to the reference methods. This may suggest the importance of further work aimed particularly at improving the performance of RB-CCR for linear methods, which seem to be particularly predisposed to the classification of imbalanced datasets.

Table 6 Average ranks achieved by the specific combinations of classification and resampling algorithms, with AUC used as the performance metric
Table 7 Average ranks achieved by the specific combinations of classification and resampling algorithms, with F-measure used as the performance metric
Table 8 Average ranks achieved by the specific combinations of classification and resampling algorithms, with G-mean used as the performance metric

4.5 Lessons learned

Based on the described results of the conducted experiments, we will now attempt to answer the research questions raised at the beginning of this section.

RQ1: Is it possible to improve the original CCR algorithm’s performance by focusing resampling in the specific regions?

We demonstrated that using RB-CCR leads to a statistically significantly better performance than CCR for most considered classification algorithms when the sampling region is determined using cross-validation. However, selecting the sampling region individually for each dataset and treating it as another hyperparameter was crucial in achieving that performance improvement in most cases. Finally, we also demonstrated that in almost every case sampling in a specific region leads to a better performance than unguided sampling within the whole sphere, indicating that choosing the optimal sampling region remains a major challenge that cross-validation solves only partially.

RQ2: Are the trends displayed by the RB-CCR consistent across different classification algorithms and performance metrics? Is it possible to control the behavior of the algorithm by a proper choice of parameters?

The behavior of RB-CCR was consistent concerning precision, specificity and recall, with sampling solely within the H region improving precision and specificity at the expense of recall, and sampling within either L or E region having the opposite effect. As a result, it is possible to control the algorithm’s bias towards the specific classes by properly choosing the sampling region. However, the performance concerning AUC, F-measure, and G-mean was less consistent, indicating that the choice of sampling region yielding the optimal trade-off between precision and recall is both dataset- and classifier-specific.

RQ3: How does RB-CCR compare with state-of-the-art reference methods, and how does the choice of classification algorithm affect that comparison?

RB-CCR, on average, outperforms all of the considered reference methods concerning recall, AUC, and G-mean, and underperforms concerning precision, specificity and F-measure, with statistically significant differences between the majority of methods. It indicates that RB-CCR is a suitable choice whenever the performance of the minority class is the main consideration, which is usually the case in the imbalanced data classification task. Finally, a more significant improvement in performance due to using RB-CCR was observed for non-linear classification algorithms. Compared with the fact that linear methods, in general, achieved a favorable performance on the considered imbalanced datasets, this might indicate the need for further work focused specifically on improving the results for this type of classifiers.

5 Conclusions

In this work, we proposed the Radial-Based Combined Cleaning and Resampling algorithm (RB-CCR). We hypothesized that the refining resampling procedure employed by CCR could garner additional performance gains. RB-CCR uses the concept of class potential to divide the dataspace around each minority instance into sampling regions characterized by high, equal, or low class potential. Resampling is then restricted to the sub-regions with the specified characteristics, determined by cross-validation or user specification. Our results show that this is superior in the precision-recall trade-off to uniformly resampling around the minority class instances.

Our empirical assessment utilized 57 benchmark binary datasets, 9 classification algorithms and 5 state-of-the-art sampling techniques. The results measured as over 5-times 2-fold cross-validation show that sampling the high potential region with RB-CCR generally produces significantly better precision and specificity, with less impact on recall than CCR. Thus, RB-CCR achieves a better balance in the precision-recall trade-off. Moreover, on average RB-CCR outperforms the considered reference methods concerning recall, AUC and G-mean.

Future work may focus on designing a better region selection method than cross-validation, including a strategy for picking regions individually for each observation, which could not be done using cross-validation. Another potential direction is adjusting the RB-CCR algorithm to linear classifiers, which generally achieve good performance but are least affected by resampler choice and likely require a more drastic shift in the synthetic observation distribution to display a significant change classifier behavior.