RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification

Real-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our $5\times2$ cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.


Introduction
Machine learning classifiers are quickly becoming a tool of choice in application areas ranging from finance to robotics and medicine. This is largely owing to the growth in the availability of labeled training data and declining computing costs. When applied correctly, machine learning classifiers have the potential to improve safety and efficiency and reduce costs. However, many of the most important domains, such as those related to health and safety, are limited by the problem of class imbalance. In binary classification, the class imbalance is defined as occurring when the prior probability of one class (referred to as the minority class) is significantly lower than the prior probability of the other class (majority class).
The induction of binary classifiers on imbalanced training data results in a predictive bias toward the majority class and has been associated with poor performance during application [1]. Detailed empirical studies have demonstrated that class imbalance exacerbates the difficulty of learning accurate predictive models from complex data involving class overlap, sub-concepts, non-parametric distributions, etc. [2,3].

arXiv:2105.04009v1 [cs.LG] 9 May 2021
Traditional methods of improving the predictive performance of classification models trained on imbalanced data involve resampling (random undersampling the majority class, random oversampling the minority class and generating additional synthetic minority samples) or cost-adjustment [1]. Synthetic minority sampling methods, such as SMOTE and its derivatives [4,5,6,7,8], generate synthetic minority samples to balance the training set. Generation-based methods of this nature are widely applied because they are classifier independent and can reduce the risk of overfitting.
In addition to elevating the learning challenge, in many cases, imbalanced training data results from sensitive application domains that exhibit asymmetric misclassification cost [9]. For example, in medicine, misclassifying benign cases as cancerous (false positive) can have negative consequences in terms of mental anguish and additional tests. Whilst false positives should be kept to a minimum, misclassifying a cancerous case as benign (false negative) can significantly increase cost in terms of delayed treatment and premature death. In domains of this nature, additional effort must be made to induce a classifier with good predictive performance on the minority class.
To achieve satisfactory performance on sensitive imbalanced domains with asymmetric misclassification costs, the resampling strategy ought to prioritizing high recall whilst having minimal impact on precision. In this work, we propose a refinement to the CCR algorithm [10] that utilizes the radial-based (RB) approach to calculate the class potential to satisfy this objective. Specifically, CCR is a resampling algorithm that cleanses majority class training samples and randomly generates synthetic minority samples in the regions around the minority class. Whilst this technique has been shown to improve the recall of the induced classifier, the specific resampling strategy employed may limit the improvement in recall and risks harming the precision. To improve upon this, we propose the RB-CCR resampling algorithm. It focuses the generation processes in sub-regions of the data-space that satisfy the user-specified class potential targets. The ability to do this gives the user better control over the precision-recall trade-off. This, for example, enables higher recall on domain for which this is critical.
We empirically compare RB-CCR to CCR and the state-of-the-art resampling methods on 57 benchmark datasets with 9 classifiers. Our empirical results show that resampling with RB-CCR can be exploited to control the precision-recall trade-off in a domain-appropriate way. On average, RB-CCR outperforms the state-of-the-art alternatives in terms of AUC and G-mean.
The main contributions of this paper can be summarized as follows: • Proposition of the RB-CCR resampling algorithm, which employs the radial-based approach to calculate the class potential, so that a classifier trained on modified data improves recall and has less impact on precision.
• Analysis of the impact of sampling region on algorithms behavior and performance.
• Showing that the proposed method can outperform the quality of the CCR algorithm.
• Experimental evaluation of the proposed approach based on diverse benchmark datasets and a detailed comparison with the state-of-the-art approaches.
The paper is organized as follows. The next section discusses the related work and situates RB-CCR concerning the state-of-the-art in imbalanced binary classification. Section 3, provides the details of CCR and RB-CCR, demonstrates resampling with RB-CCR and contrasts its run-time complexity with that of CCR. In Section 4, we describe the experimental setup, report the results along with our analysis, and finally, Section 5 includes our concluding remarks and a discussion of future work.

Related work
Imbalance ratio (IR) [11] is defined as the ratio between the number of majority and minority class observations. A moderate to high IR (typically greater than 10 : 1) can pose a significant challenge to learning a sufficiently accurate classifier across all classes. This is particularly the case when it is combined with other adverse data properties, such as class overlap, sparsity, complex clustering, and noise [2,12]. In such cases, the classifier is at great risk of becoming biased towards the majority class [2], and / or overfitting the training data [13]. Problems of this nature are a focus of intense research [4,14,15].
Measuring the quality of a model on imbalanced data requires some attention. It is well-known that using classic metrics, such as accuracy and error rate, on imbalanced datasets can cause misleading interpretations of the efficacy of the model [16]. As a result, the imbalanced learning community has shifted to use metrics, such as precision, recall (sensitivity), specif icity, G-mean, F β score, and AU C [17,18]. More recently, however, it has been noted that the widely used metrics F β score, and AU C can be sub-optimal for evaluating performance on imbalanced data. Brzeziński et al. [19] demonstrated that F β score is usually more biased towards the majority class than AUC and G-mean. The flaws of F β score are also discussed in a study by Hand and Christen [20], in which authors suggest that to make a fair Safe-level SMOTE [14] and LN-SMOTE [37] are specifically designed to reduce the risk of introducing noisy synthetic observations inside the majority class region. Other SMOTE alternatives aim to focus the generation process on challenging regions of the dataspace. Borderline-SMOTE [38], for example, focuses the process of synthetic observation generation on the instances close to the class boundary, and ADASYN [6] prioritizes the difficult instances. The SWIM [39] method uses the Mahalanobis distance to determine the best position for synthetic samples, taking into account the existing samples from both classes. Radial-Based Oversampling (RBO) [40] is a method that employs potential estimation to generate new minority objects using radial basis functions. The Combined Cleaning and Resampling (CCR) [10] method combines two techniques -cleaning the decision border around minority objects and guided synthetic oversampling.
RUS preprocesses the data by randomly removing majority class samples. It is conceptually simple and risks removing important objects from the majority class. This can cause the induced classifier to underfit less dense majority class clusters. Guided undersampling approaches aim to avoid this by analyzing the minority and majority class instances in the local neighborhood. Edited Nearest Neighbor, for example, removes majority examples if their set of three nearest neighbors does not include at least one other majority object. Radial-Based Undersampling, on the other hand, employs the concept of mutual class potential to direct undersampling [41]. Koziarski introduced Synthetic Minority Undersampling Technique (SMUTE), which leverages the concept of interpolation of nearby instances, previously introduced in the oversampling setting in SMOTE [42].
Hybrid methods. Data preprocessing methods can be combined with in-built classification methods for imbalanced learning. Galar et al. proposed to hybridize underand oversampling with an ensemble of classifiers [43]. This approach allows the data to be independently processed for each of the base models. It is worth also mentioning SMOTEBoost, which is based on a combination of the SMOTE algorithm and the boosting procedure [31]. In addition, the Combined Synthetic Oversampling and Undersampling Technique (CSMOUTE) integrates SMOTE oversampling with SMUTE undersampling [42].

Radial-Based Combined Cleaning and Resampling
In this paper, we propose an extension to the original CCR [10] algorithm that refines its sampling procedure. In short, CCR is an energy-based oversampling algorithm that relies on spherical regions, centered around the minority class observations, to designate areas in which synthetic minority observations should be generated. These spherical regions expand iteratively, with the rate of expansion inversely proportional to the number of neighboring observations belonging to the majority class, while computationally efficient and conceptually simple, using spherical regions to model the areas designed for oversampling has two limitations. First of all, it enforces a constant rate of expansion of the sphere in every direction, regardless of the majority neighbors' exact position. Secondly, it does not utilize the information about the neighboring minority class observations. We propose a novel sampling procedure to address these issues, which is refining the original spherical regions. In the remainder of this section, we describe the proposed sampling procedure and its integration with the CCR algorithm.

Guided sampling procedure
We base the proposed sampling procedure on the notion of class potential, previously used in the imbalanced data setting by Krawczyk et al. [36]. The potential function is a real-valued function that, in a given point in space x, measures the cumulative closeness to a given collection of observations X . More formally, using a Gaussian radial basis function with a spread γ, a potential function can be defined as Of particular interest in the imbalanced data oversampling task will be the potential computed concerning either the collection of majority class observations X maj (majority class potential), or minority class observations X min (minority class potential). Such class potential can be regarded as a measure reflecting the degree of certainty we assign to x being a member of either the majority or the minority class. It can also be used to model the regions of interest in which oversampling is to be conducted, which was previously demonstrated in Radial-Based Oversampling (RBO) [36] and Sampling With the Majority (SWIM) [44] algorithms. SMOTE and its derivatives define the regions of interest as the lines connecting nearby minority observations. Also, the probability of sampling within any given region of interest is typically uniform. Alternatively, using class potential, as proposed here, offers an informationally richer framework. First of all, by using the majority class potential, we can leverage the information about the position of majority observations, which is not used by SMOTE. Secondly, when using potential, we are not constrained to sampling from within a set a lines. Rather, we can sample smoothly from the space around the minority observations. Moreover, the sampling region is non-linear, which enables it to better adapt to the underlying data distribution.
To reiterate, the drawbacks of the original CCR algorithm are that the sphere expansion procedure progresses at a constant rate in every direction, regardless of the exact position of the majority neighbors, and it does not utilize the information about the position of neighboring minority class observations. Intuitively, neither of these is the desired behavior since it can lead to a lower than expected expansion in the direction of minority observation clusters and higher than expected expansion in the direction of majority observation clusters. While in theory, an obvious modification that could address these issues would be to exchange the spheres used by CCR to more robust shapes, such as ellipsoids, and adjust the expansion step accordingly, in practice, it is not clear how the latter could be achieved. Alternatively, we propose to exploit the efficiency of first defining the sphere around the minority observation and then partitioning it into sub-regions based on the class potential to more effectively guide sample generation.
The proposed strategy partitions a given sphere into three target regions, low (L), equal (E), and high (H), based on the class potential. Synthetic samples are generated in a user-specificity target region by randomly generating candidates with uniform probability throughout the sphere. A random subset of these is selected from the target region and added to the training set. The target region and number of samples are specified as parameters of the algorithm. A more detailed formulation of the proposed strategy is presented in Algorithm 1, and an illustration of the sphere partitioning procedure is presented in Figure 1.
The CCR algorithm generates samples with uniform probability from within entire sphere. Alternatively, Figure 1 illustrates that RB-CCR divides the original sphere into three regions (L, E, H). The regions are defined according to the shape of the globally calculated minority class potential. Subsequent to the partitioning, sample generation can be restricted to a specific region. Intuitively, samples in the high potential regions can be regarded as having a higher probability of coming from the underlying minority class distribution than samples in the low potential regions. This, to some extents, parallels different variants of SMOTE, such as Borderline-SMOTE [5] or Safe-Level-SMOTE [14], which focus on different types of observations to guide the sampling process. However, contrary to SMOTE variants, RB-CCR provides a flexibility to chose an appropriate sampling region for the target data within a single framework.
Algorithm 1 Guided sampling procedure Input: sampling seed x, sampling radius r, collection of minority observations X min Parameters: radial basis function spread γ, sampling region from which returned samples will be drawn, number of candidates c used for potential range estimation, number of returned candidate samples n Output: collection of synthetic minority observations S located in the sampling region around x 1: function sample(x, r, X min , γ, region, c, n): C i ← random sample inside a x-centered sphere with radius r 7: if Z i ≤ bound L then 14: reg i ← L {i-th candidate in the low potential region} 15: else if Z i ≥ bound H then 16: reg i ← H {i-th candidate in the high potential region} 17: else 18: reg i ← E {i-th candidate in the equal potential region} 19: end if 20: if reg i = region then 21: end if 23: end for 24: S ← n samples randomly selected with replacement from S 25: return S Figure 1: An example of a sphere generated around a specific minority observation, partitioned into three regions: high potential (H), indicated with a green color, equal potential (E), indicated with a yellow color, and low potential (L), indicated with a red color. Note that the shape of the regions aligns with that of the produced potential field, indicated with a contour plot.

Integrating guided sampling with the CCR algorithm
We begin with a brief description of the original CCR algorithm, as described in [45], where more in-depth discussion of the design choices can be found. The algorithm itself consists of two main steps: cleaning the neighborhood of the minority observations, and second of all, selectively oversampling in the produced, cleaned regions. After describing the original algorithm, we discuss how it can be integrated with the proposed guided sampling procedure. . Sphere expends at a normal cost until it reaches a majority observation, at which point the further expansion cost increases (depicted by blue orbits with an increasingly darker color). Finally, after the expansions, the majority observations within the sphere are being pushed outside (in green). Source: [45].
Cleaning the minority neighborhoods. First step of the proposed approach is cleaning the minority class neighborhoods from the majority observations. This is achieved via an energy-based approach, in which spherical regions are being designated for cleaning. The size of the regions is constrained by the presence of majority neighbors and is determined in an iterative procedure, during which spheres expand up to the point of depleting the allocated energy budget. More formally, for a given minority observation denoted by x i , current radius of an associated sphere denoted by r i , a function returning the number of majority observations inside a sphere centered around x i with radius r denoted by f n (r), a target radius denoted by r i , and f n (r i ) = f n (r i ) + 1, we define the energy change caused by the expansion from r i to r i as ∆e = −(r i − r i ) · f n (r i ).
(2) During the sphere expansion procedure, the radius of a given sphere increases up to the point of completely depleting the energy, with the cost increasing after each encountered majority observation. Finally, the majority observations inside the sphere are being pushed out to its outskirts. The whole process was illustrated in Figure 2.
Selectively oversampling the minority class. After the cleaning stage is completed, new synthetic minority observations are being generated in the produced spherical regions. The ratio of the synthetic observations generated around a given minority observation is proportional to the sphere's radius, calculated in the previous step. More formally, for a given minority observation denoted by x i , the radius of an associated sphere denoted by r i , the vector of all calculated radii denoted by r, collection of majority observations denoted by X maj , collection of minority observations denoted by X min , and assuming that the oversampling is performed up to the point of achieving balanced class distribution, we define the number of synthetic observations to be generated around x i as This procedure can be interpreted as weighing the difficult observations more heavily, similar to the technique used in ADASYN [6]. The difficulty of observation is determined based on the proximity of nearest majority observations: minority observations with nearby majority neighbors will have a constrained sphere radius, which will result in a higher allocation of produced synthetic observations.
Combining guided sampling with CCR. The proposed sampling strategy can easily be integrated into the original CCR algorithm. Instead of the original sampling within the whole sphere, RB-CCR uses the guided sampling strategy described in the previous section. In initial steps of RB-CCR are the same as CCR. Specifically, they are sphere radius calculation, translation of majority observations, and calculation of the number of synthetic observations generated for each minority observations. We present pseudocode of the proposed RB-CCR algorithm in Algorithm 2. It should be noted that, except for the addition of a guided sampling procedure, the algorithm is presented as it was previously proposed in [45].
The behavior of the proposed algorithm changes depending on the choice of its three major hyperparameters: RBF spread γ, energy used for sphere expansion, and sampling region. The impact of γ was illustrated in Figure 3. As can be seen, γ regulates the smoothness of the potential shape, with low values of γ producing a less regular contour, conditioned mainly on the position of minority neighbors located in close proximity. On the contrary, higher γ values produce a smoother, less prone to overfitting potential, with a smaller number of distinguishable clusters. Secondly, the

Algorithm 2 Radial-Based Combined Cleaning and Resampling
Input: collections of majority observations X maj and minority observations X min Parameters: energy budget for expansion of each sphere, radial basis function spread γ, sampling region from which returned samples will be drawn, number of candidates c used for potential range estimation Output: collections of translated majority observations X maj and synthetic minority observations S 1: function RB-CCR(X maj , X min , energy, γ, region, c): for all majority observations x j in X maj do 9: end for 11: sort X maj with respect to d 12: for all majority observations x j in X maj do 13: n r ← n r + 1 14: if e + ∆e > 0 then 16: r i ← d j

17:
e ← e + ∆e for all majority observations x j in X maj do 24: if d j < r i then 25: end for 28: end for 29: X maj ← X maj + t 30: for all minority observations x i in X min do 31: 32: add sample(x i , r i , X min , γ, region, c, g i ) to S 33: end for 34: return X maj , S value of energy affects the radius of the produced spheres, which controls the size of sampling regions and the range of translations, as illustrated in Figure 4. It is worth noting that as the energy approaches zero, the algorithm degenerates to random oversampling. The choice of the energy is also highly dependent on the dimensionality of the data. It has to be scaled to the number of features a given dataset contains, with higher dimensional datasets requiring higher energy to achieve a similar sphere expansion. Finally, the choice of the sampling region determines how the generated samples align with the minority class potential. This is demonstrated in Figure 5. Sampling in all of the available regions (LEH) is equivalent to the original CCR algorithm. This completely ignores the potential and uses whole spheres as a region of interest. Sampling in region E constrains samples to areas with class potential that is approximately equal class potential of real minority observation. Sampling in region H pushes the generated observations towards areas of the data space estimated to have a higher minority class potential. This can be interpreted as focusing the sampling process on generating samples that are safer, and better resemble the original minority observations. The opposite is true for sampling in the region L. This was further illustrated on a simplified dataset in Figure 6.  Finally, it is worth discussing how RB-CCR compares to the other oversampling algorithms. An illustration of differences between several popular methods was presented in Figure 7, with a highly imbalanced dataset characterized by a disjoint minority class distribution used as a benchmark. As can be seen, when compared to the SMOTE-based approaches, RB-CCR tends to introduce lower class overlap, which can occur for SMOTE when dealing with disjoint distributions, the presence of noise or outliers. RBO avoids sampling in the majority class regions. However, it produces very conservative and highly clustered samples. These can cause the classifier to overfit in a manner similar to random oversampling. RB-CCR avoids the risk of overfitting with larger regions of interest. Moreover, the larger regions enable a greater reduction in the classifier's bias towards the majority class. The energy parameter facilitates the control of this behavior, with higher values of energy leading to less conservative sampling. Information provided by the class potential is used to fine-tune the shape of regions of interest within the sphere. It enables better control of the sampling. solves the problem only partially, still introducing some overlap, at the same time completely omitting to oversample around selected observations. RBO does not produce artificial overlap, but at the same time, it is very conservative during sampling, in particular within originally overlapping regions. CCR and RB-CCR produce a distribution that leads to a higher bias towards the minority class, both due to synthesizing observations around all of the instances and the conducted translation of majority observations while minimizing class overlap. Compared to CCR, RB-CCR produces more constrained samples based on the underlying potential.

Computational complexity analysis
Let us define the total number of observations by n, the number of majority and minority observations by, respectively, n maj and n min , the number of features by m, and the number of candidate samples used in a single sampling step of Algorithm 1 by c. As previously described in [45], the original CCR algorithm can be divided into three steps: calculating the sphere radii, cleaning the majority observations inside the spheres, and synthesizing new observations, with each of the steps done iteratively for every minority observation. The same applies to the RB-CCR, for which only the complexity of the third step will differ from that of CCR.
• As described in [45], the first step consists of a) calculating a distance vector, b) sorting said vector, and c) calculating the resulting radius. Combined, these operations have complexity equal to O((m + log n)n 2 ).
• As described in [45], the second step, cleaning the majority observations inside the spheres, has complexity equal to O(mn).
• Finally, the third step, synthesizing new observations, consists of a) calculating the proportion of samples generated for a given observation g i , with the complexity equal to O(n min ) [45], and b) sampling the synthetic observations. In the case of the original CCR algorithm, as discussed in [45], this sub-step consists of n maj − n min operations of sampling a random observation inside the sphere, each with complexity equal to O(m), leading to a total complexity of the third step of CCR that can be simplified to O(mn). On the other hand, the sampling used by RB-CCR has a higher complexity due to the chosen guided strategy. In particular, when considering the procedure described in Algorithm 1, its complexity is dominated by the potential calculation for all of the candidate samples. Potential calculation, defined in Equation 1, when computed with respect to the collection of minority class observations X min , consists of n min summations and n min RBF function computations, with the later having complexity equal to O(m). As a result, a single computation of the minority class potential has a complexity that can be simplified to O(mn). Whole sampling step, which requires c potential function computations per minority observations, has therefore a total complexity equal to O(cmnn min ), which can be simplified to O(cmn 2 ).
As can be seen, for the original CCR algorithm, the complexity is dominated by the first step and is equal to O((m + log n)n 2 ). On the other hand, in the case of RB-CCR, both the first and the third step influence the total complexity of the algorithm, which is equal to O((cm + log n)n 2 ).

Experimental study
To empirically evaluate the usefulness of the proposed RB-CCR algorithm, we conducted a series of experiment, the aim of which was to answer the following research questions: RQ1 Is it possible to improve the original CCR algorithm's performance by focusing resampling in the specific regions?
RQ2 Are the trends displayed by the RB-CCR consistent across different classification algorithms and performance metrics? Is it possible to control the behavior of the algorithm by a proper choice of parameters?
RQ3 How does RB-CCR compare with state-of-the-art reference methods, and how does the choice of classification algorithm affect that comparison?

Set-up
Datasets. We based our experiments on 57 binary datasets taken from the KEEL repository [46], the details of which, namely their names, imbalance ratios (IR), number of contained samples and features, were presented in Table 1. We employed a dataset selection procedure previously used in [41], that is we excluded datasets for which AUC greater than 0.85 was achieved with a linear SVM without any resampling. Prior to the resampling and classification, all datasets were preprocessed: categorical features were converted to integers first, and afterward, all of the features were normalized to zero mean and unit variance.
Classification algorithms. During the conducted experiments, we considered classification with a total of 9 different algorithms: CART decision tree, k-nearest neighbors classifier (KNN), support vector machine with linear (L-SVM), RBF (R-SVM) and polynomial (P-SVM) kernels, logistic regression (LR), Naive Bayes (NB), and multi-layer perceptron with ReLU (R-MLP) and linear (L-MLP) activation functions in the hidden layer. We considered a relatively high number of classification algorithms to examine how the choice of base learner affects the usefulness of RB-CCR. Implementations of all of the classification algorithms were taken from the scikit-learn library [47], and their default parameters were used.
Reference methods. In addition to the original CCR algorithm, we compared the performance of RB-CCR with several over-and undersampling strategies, namely: SMOTE [4], Borderline-SMOTE (Bord) [5], Neighborhood Cleaning Rule (NCL) [48], SMOTE combined with Tomek links (SMOTE+TL) [49] and Edited Nearest Neighbor rule (SMOTE+EN) [50]. The hyperparameters for each resampling method were tuned individually for each dataset. The SMOTE variants considered the values of k neighborhood in {1, 3, 5, 7, 9}. In addition to K, the Bord method considered the values of m neighborhood in {5, 10, 15}. For NCL, we considered the value k of its neighborhood in {1, 3, 5, 7}. Finally, for all methods in which the resampling ratio was an inherent parameter, resampling was performed up to the point of achieving a balanced class distributions. Implementation of all of the reference methods was taken from the imbalanced-learn library [51].
Performance metrics. We utilize 6 performance metrics for classifier evaluation. This includes precision, recall, and specificity of the predictions, and the combined metrics AUC, F-measure, and G-mean. This set of metrics is standard in the imbalanced classification literature and provides a diverse perspective on model performance. Precision, recall, and specificity provide insight into the class specific errors that are to be expected from each algorithm. The combine metrics, AUC, F-measure, and G-mean, provide a more wholesome perspective on performance by taking into account the trade-off between the performance on majority and minority class. As mentioned in the related work, AUC and F-measure have previously been criticized in the context of imbalance learning. Nonetheless, we include them as they remain standard benchmarks in the literature and provide orthogonal perspective on performance.
Evaluation procedure. To ensure the stability of the results, we used 5 × 2 cross-validation [52] during all of the experiments. Furthermore, during the parameter selection for resampling algorithms we used additional 3-fold cross-validation on the training partition of the data, with AUC used as the optimization criterion.
Statistical analysis. To assess the statistical significance of the results, we used two types of statistical tests. We used a one-sided Wilcoxon signed-rank test in a direct comparison between the original CCR algorithm and the proposed RB-CCR algorithm. Secondly, when simultaneously comparing multiple methods, we used the Friedman test combined with Shaffer's posthoc. In all cases, unless p-values were specified, the results were reported at the significance level α = 0.10.
Implementation and reproducibility. To ensure the reproducibility of the results, we made publicly available the following: the implementation of the algorithm, code sufficient to reproduce all of the described experiments, statistical tests, and all of the figures presented in this paper, as well as the partitioning of the data into folds and raw results. All of the above can be accessed at 1 .   Table 2. Detailed p-values for the conducted experiments can also be found in Appendix A.
Several observations can be made based on the presented results. First of all, the observed performance was consistent across the classification algorithms concerning the precision, recall, and specificity, at least when comparing sampling in H region with the remaining variants: in the case of precision sampling exclusively in the H region produced, on average, the best performance when combined with 7 out of 9 considered classifiers, with the remaining two being NB and P-SVM. Furthermore, in the case of specificity, this behavior was observed for 8 out of 9 classifiers, once again except P-SVM. Finally, the reverse was true in the case of recall, were sampling in the H region gave the worst average rank for 8 out of 9 considered classifiers. All of the trends mentioned above were also statistically significant in the majority of cases. This indicates that using the guided sampling approach has a non-random influence on the algorithm's performance and its bias towards the majority class, particularly when comparing sampling in the H region with the other variants, which is desirable behavior. Furthermore, from a general resampling perspective, this suggests that if the problem domain requires high precision or specificity, it is beneficial to focus sampling in the H region. On the other hand, if a high recall is required, sampling in L or E region is usually preferred.
However, the sampling region's impact on the combined metrics is less clear in the general case. Although the baseline variant of CCR that is LEH sampling, achieved the best average rank only in 1 out of 27 cases (for the combination of P-SVM and F-measure), there was usually either a complete lack of significance, meaning that there were no statistically significant differences between any of the sampling strategies, or partial significance, meaning that only some of the variants displayed statistically significant differences. Importantly, when comparing with the LEH sampling, there was a statistically significant improvement concerning all of the combined metrics for a single classifier, LR; and for a single metric for the combination of L-MLP and F-measure, as well as the combination of G-mean and P-SVM. In all of the above cases, the best performance strategy was sampling in the H region. Nevertheless, for the remaining combinations of classification algorithms and performance metrics there was no clearly dominant strategy, even when at least partial significance was observed. All of the above leads to the conclusion that while sampling in the specific regions has a non-random impact that is consistent across the classification algorithms with respect to direction (focusing sampling in the H region leading to a statistically significantly better precision and specificity, and worse recall), the trade-off between them, which can be observed using the combined metrics, varies depending on both the classifier and the dataset.

Comparison of CCR and RB-CCR
We have empirically demonstrated that no single sample region is optimal for all datasets, classification algorithms, and performance metrics. It is consistent with a current state of knowledge, particularly the "no free lunch" theorem,  according to which the choice of sampling strategy strongly depends on the dataset characteristics. Instead, we considered the approach in which we treat the sampling region as a parameter of the algorithm and adjust it on a per-dataset and per-classifier basis. To this end, we conducted two comparisons.
First of all, considered an idealized variant of RB-CCR. The region is giving the best performance, chosen only from {L, E, }, was selected individually for each dataset based on the test set results. Importantly, sampling in the LEH region was not included in the selection of available regions. This approach can be treated as an upper bound of performance that could be achieved by restricting sampling to a specific region. Once again, this variant of RB-CCR was compared with the original CCR algorithm, with the results presented in Table 3. As can be seen, by constraining sampling to a specific region, we were able to achieve improved performance for almost every considered dataset, regardless of the choice of classifier or performance metric.
Secondly, we conducted a comparison between the original CCR algorithm and RB-CCR with the sampling region chosen from {L, E, H, LEH} using cross-validation. The results of this comparison were presented in Table 4. As can be seen, when adjusting the sampling region individually for each dataset we were able to achieve a statistically significant improvement in performance for at least one of the combined metrics for 7 out of 9 classifiers. This improvement was observed more often in the case of G-mean and AUC, and only in two cases for F-measure, which can be explained by the fact that AUC was used as the optimization criterion during cross-validation, and AUC and G-mean tend to be more correlated than F-measure. We hypothesize that the flexibility in RB-CCR offered by class potential regions enables the samples to be generated in areas that have the greatest positive impact on the metric being optimized. The results presented in Table 2, where focusing on the high potential regions produces a significant improvement in precision and specificity, seem to support this hypothesis. Thus, using F-measure as an optimization criterion for models trained with RB-CCR would have the opposite effect as AUC (i.e. it would produce better precision, specificity and F-measure, since these are related, at the expanse of recall, AUC and G-mean.) Results of both of the above experiments indicate that, in principle, constraining sampling to a specific region can yield a clear performance improvement compared to the baseline approach. Using cross-validation to choose the optimal region for every case is a suitable strategy for picking region, resulting in a statistically significant performance improvement in most cases. Still, it falls short of the performance of the idealized variant. It indicates that either a better parameter selection strategy, more suited for the imbalanced datasets, or a specific heuristic for choosing the sampling region, could improve the proposed method's overall performance.

Comparison of RB-CCR with the reference methods
We compared RB-CCR with several over-and undersampling reference methods in the next stage of the conducted experiments. We presented average ranks achieved by all of the methods, as well as the statistical significance of the comparison, in Table 5. Furthermore, we presented a visualization of the average ranks achieved by the specific methods concerning different performance metrics in Figure 8. First of all, as can be seen, the general trend was that RB-CCR achieved the best recall at the expense of precision and specificity, which held true for all of the classification algorithms. As in the previous experiments, this had a varying impact on the combined metrics depending on their exact choice when F-measure was considered, which led to statistically significantly worse performance than the reference methods.However, at the same time, it improved the performance concerning AUC and G-mean: RB-CCR achieved the highest average rank in 17 out of 18 cases, with the only exception of AUC observed for L-MLP classifier, for which it achieved the second-best rank. The results of this comparison were also statistically significant in the majority of cases: for all of the classifiers when compared to the baseline case with no resampling, Bord and NCL; for 5 out of 9 classifiers when compared to SMOTE and SMOTE+TL; and in a single case of NB when compared to SMOTE+EN, which was the second-best performer. The differences between the results measured using F-measure, AUC and G-mean can be attributed to the previously discussed bias of F-measure towards the majority class performance: since RB-CCR is heavily skewed towards the recall at the cost of precision, it is natural that using metric weighted more heavily towards precision produces worse performance. Still, the observed results indicate high usefulness of the proposed RB-CCR algorithm when compared to the reference methods if a higher cost of misclassification of minority observations is assigned, as is the case with AUC and G-mean.
Finally, in the last stage of the conducted experiments, we compared different combinations of classification and resampling algorithms to establish their relative usefulness. We presented the average ranks observed for different combined metrics in Tables 6 through 8, separately for the individual metrics. As can be seen, when F-measure was considered, RB-CCR was outperformed by the reference methods, who achieved the best performance when combined with either R-MLP or R-SVM, which was also the case for RB-CCR. However, when AUC and G-mean were considered, the combination of algorithms that achieved the highest average rank was RB-CCR and L-MLP, for both of those metrics. Besides L-MLP, the top-performing classifiers were R-MLP, R-SVM, and LR, in that order, all achieving the best performance when combined with RB-CCR. Overall, presented rankings indicate the importance of improving performance due to the resampling method for any given classification algorithm. From that point of view, out of the statistically significant improvements presented previously in Table 5, of most importance were those achieved for R-MLP and R-SVM, for which RB-CCR achieved a statistically significantly better performance than all of the resamplers except SMOTE+EN. On the other hand, it is worth noting that linear methods, that is L-MLP, LR and L-SVM, achieved relatively high performance, populating 3 out of 5 spots for highest performing classification algorithms, at the same time achieving less statistically significant improvement due to using RB-CCR when compared to the reference methods. This may suggest the importance of further work aimed particularly at improving the performance of RB-CCR for linear methods, which seem to be particularly predisposed to the classification of imbalanced datasets.

Lessons learned
Based on the described results of the conducted experiments, we will now attempt to answer the research questions raised at the beginning of this section.

RQ1: Is it possible to improve the original CCR algorithm's performance by focusing resampling in the specific regions?
We demonstrated that using RB-CCR leads to a statistically significantly better performance than CCR for most considered classification algorithms when the sampling region is determined using cross-validation. However, selecting the sampling region individually for each dataset and treating it as another hyperparameter was crucial in achieving that performance improvement in most cases. Finally, we also demonstrated that in almost every case sampling in a specific region leads to a better performance than unguided sampling within the whole sphere, indicating that choosing the optimal sampling region remains a major challenge that cross-validation solves only partially.
RQ2: Are the trends displayed by the RB-CCR consistent across different classification algorithms and performance metrics? Is it possible to control the behavior of the algorithm by a proper choice of parameters?   The behavior of RB-CCR was consistent concerning precision, specificity and recall, with sampling solely within the H region improving precision and specificity at the expense of recall, and sampling within either L or E region having the opposite effect. As a result, it is possible to control the algorithm's bias towards the specific classes by properly choosing the sampling region. However, the performance concerning AUC, F-measure, and G-mean was less consistent,  indicating that the choice of sampling region yielding the optimal trade-off between precision and recall is both datasetand classifier-specific.
RQ3: How does RB-CCR compare with state-of-the-art reference methods, and how does the choice of classification algorithm affect that comparison?
RB-CCR, on average, outperforms all of the considered reference methods concerning recall, AUC, and G-mean, and underperforms concerning precision, specificity and F-measure, with statistically significant differences between the majority of methods. It indicates that RB-CCR is a suitable choice whenever the performance of the minority class is the main consideration, which is usually the case in the imbalanced data classification task. Finally, a more significant improvement in performance due to using RB-CCR was observed for non-linear classification algorithms. Compared with the fact that linear methods, in general, achieved a favorable performance on the considered imbalanced datasets, this might indicate the need for further work focused specifically on improving the results for this type of classifiers.

Conclusions
In this work, we proposed the Radial-Based Combined Cleaning and Resampling algorithm (RB-CCR). We hypothesized that the refining resampling procedure employed by CCR could garner additional performance gains. RB-CCR uses the concept of class potential to divide the dataspace around each minority instance into sampling regions characterized by high, equal, or low class potential. Resampling is then restricted to the sub-regions with the specified characteristics, determined by cross-validation or user specification. Our results show that this is superior in the precision-recall trade-off to uniformly resampling around the minority class instances.
Our empirical assessment utilized 57 benchmark binary datasets, 9 classification algorithms and 5 state-of-the-art sampling techniques. The results measured as over 5-times 2-fold cross-validation show that sampling the high potential region with RB-CCR generally produces significantly better precision and specificity, with less impact on recall than CCR. Thus, RB-CCR achieves a better balance in the precision-recall trade-off. Moreover, on average RB-CCR outperforms the considered reference methods concerning recall, AUC and G-mean.
Future work may focus on designing a better region selection method than cross-validation, including a strategy for picking regions individually for each observation, which could not be done using cross-validation. Another potential direction is adjusting the RB-CCR algorithm to linear classifiers, which generally achieve good performance but are least affected by resampler choice and likely require a more drastic shift in the synthetic observation distribution to display a significant change classifier behavior.
A Detailed p-values observed during sampling region comparison

B Examination of the impact of energy parameter
In addition to the sampling region, another hyperparameter that can have a significant influence on the performance of RB-CCR is its energy, which regulates the size of sampling regions and the extent of translation. To assess the exact impact of energy on the algorithms behavior we conducted an experiment, in which we measured the change in   averaged across all of the datasets. First of all, as can be seen the choice of energy has, on average, a clear impact on precision, specificity and recall, with the first two decreasing monotonically proportional to the energy, and the last one increasing monotonically. This is relevant because it indicates that CCR already has an inbuilt mechanism for controlling the precision-recall trade-off, and as a result the performance improvement displayed by the RB-CCR cannot be explained solely due to providing that, rather it provides a more optimal trade-off (with respect to the combined metrics).
Furthermore, as can be seen, the value of energy for which RB-CCR achieves the best average performance depends on the choice of classifier and metric. In the case of F-measure the best performance is observed for the minimal energy, when the precision-to-recall ratio is the highest. This is another empirical confirmation of the claim made in [19], according to which F-measure tends to be more biased towards the majority class performance. More importantly, in the case of AUC and G-mean, both of which tend to be highly correlated, two types of behavior can be observed. First of all, in the case of linear models, that is LR, L-SVM and L-MLP, the best average performance was observed with the energy values in {0.5, 1.0, 2.5, 5.0}, with little to no difference between those values. Secondly, in the case of the remaining classifiers the optimal performance was observed around the value of energy equal to 5.0, with both decrease and the increase of energy negatively affecting the performance. Considering the fact that as the energy goes down the methods behavior starts resembling random oversampling more closely, this seems to indicate that the expected performance gain due to using RB-CCR is highest for non-linear methods, capable of producing more complex decision boundaries. Finally, irregardless of the choice of classifier, from the practical standpoint observed results also suggest that using the value of energy equal to 5.0 is a sensible default.