1 Introduction

The problem of imbalanced data classification is one of the important subareas of machine learning research. This is dictated by the fact that most real-world decision problems are of this nature, i.e., a significant disparity between the object counts of the different classes. In the case of a two-class classification task to which this work is devoted, the relationship is quite obvious, i.e., one class with a large number of objects is called the majority class, and a class with a small number of instances is called the minority class. The critical point here is that, as a rule, the cost of incorrectly classified objects of the minority class is much higher than the error made on majority class instances. We should be aware that in the case of multiclass problems, such relation is no longer obvious, and minority-majority relations may occur only between some pairs of classes. Moreover, a given class may be a majority class concerning selected classes, simultaneously a minority class for others. This problem will not be considered in this paper and we will focus on the binary classification task. For imbalanced data, it is observed that canonical classifier learning methods that do not explicitly optimize the quality criterion prefer models biased towards the majority class. However, in the cases in which we can provide a learning criterion that can account for the different cost of errors for each class, the problem is obtaining such information, e.g., in the form of a loss matrix, from the user (Branco et al., 2016). The question is whether indeed the disparity between classes is the most significant difficulty or whether other factors affecting the difficulty of the data may be crucial since it is not difficult to imagine a decision problem that, even despite the significant disparity between classes, is easy to classify using traditional methods (see Fig. 1).

Fig. 1
figure 1

Easy classification task for imbalance ratio 1:1000

It seems that difficulties related to the classification of imbalanced data should be sought in characteristics of conditional probability distributions of classes. Napierala and Stefanowski (2016) noticed that, especially in the case of minority classes, they tend to form small, disconnected clusters, which combined with the small number of objects in the minority class, causes additional difficulty in correctly learning the models. One may find several taxonomies related to evaluating the difficulty of classifying minority class objects. As a rule, one distinguishes between the fraction of objects that do not present problems with correct classification (usually known as safe examples) and the remaining instances as unsafe. A popular taxonomy is to determine the difficulty of objects of a minority class using the number of objects of the minority class found among the five nearest neighbors of a given object from this class. This leads to a division into:

  • Examples lying near each other can be considered as safe ones.

  • Instances close to the border between classes, where examples of different classes may overlap, are identified as borderline examples.

  • Groups of a few examples of a class in areas of other classes are known as rare examples.

  • Outliers are completely isolated examples of a concrete class.

Figure. 2 shows a sample illustration of different types of minority class objects.

Fig. 2
figure 2

Examples of different minority class instances: “S” stands for safe, “B” for borderline, “R” for rare, and “O” for outliers

This paper will present the Local Neighborhood Encodings (LNE) algorithm, a hybrid algorithm for preprocessing imbalanced data. LNE uses both oversampling and undersampling methods. The intensity of these methods is separate for each fraction of minority and majority class objects and is chosen according to the neighborhood type of the class instance, defined as the number of neighbors of the same class closest to the given object. An evolutionary algorithm is used to solve this optimization task.

In brief, the most important contributions of this work are:

  • Proposition of LNE - a hybrid preprocessing method for imbalanced data.

  • Formulation of an optimization task and representation for a selected evolutionary algorithm.

  • Analysis of the effect of classification task characteristics such as imbalance ratio or size of each object type fractions on the quality of the proposed method.

  • Discussion on the data set characteristics returned by LNE.

  • Experimental evaluation of the quality of the LNE for different types of classifiers and the impact of using a simple proxy classification model required for feature value estimation in the optimization algorithm.

  • Experimental evaluation of the proposed approach based on diverse benchmark datasets and a detailed comparison with the state-of-art approaches

  • Discussion on an ablation study to demonstrate how the various components of the LNE affect its quality.

2 Related work

The problem of imbalanced data is encountered in a significant fraction of real-world problems and in recent years, the main focus has been on the analysis of tabular data. However, a growing number of works indicate that this problem is also very important for image classification since many of the datasets are characterized by this type of property, making generalizations of deep models difficult (Johnson & Khoshgoftaar, 2019; Kim et al., 2021).

The preference for the majority class by classification models can be mitigated by modifying the training data to bring the minority class size in line with the majority class size or by modifying the classifier’s learning or decision-making process to account for class disparity and thus increase the sensitivity of decisions towards the minority class.

We may divide the techniques used to deal with data imbalance into two categories. In the first approach (also called algorithm-level solutions), the problem of data imbalance is considered at the stage of the learning classifier, e.g., considering the different cost of error between different classes in the learning criterion. In contrast, the second approach (data-level solutions) does not interfere with the learning of the classifier itself but modifies the learning set before the learning process starts to compensate for the differences in the number of examples from each class.

In this paper, we will mainly focus on the second approach. Hence let us characterize the most important techniques associated with them. In general, these methods try to remove objects from the majority class (undersampling) or increase the size of the minority class (oversampling). In this regard, it should be noted that these actions may be purely random, or the process of removing objects of the majority class or adding instances of the minority class may result from an analysis of the distributions of the different fractions of objects.

Randomized preprocessing methods are easy to implement and have low computational complexity. However, it is essential to realize that in some cases, they may have an adverse effect on the dataset. Random Undersampling (RUS) can lead to the rejection of important instances for creating the correct class boundary or lie in specific subregions of the target class. On the other hand, random oversampling can lead to the reallocation of noisy instances and thus inappropriately affect the actual class distribution. The most straightforward sampling-based approach to data imbalance is Random Oversampling (ROS). Using ROS, new objects are generated by replicating randomly selected existing objects. The disadvantage of this approach is that it leads to the clustering of minority objects in small areas where the original instances were located. This can be a problem for some classifiers, especially those that tend to overfit. An interesting approach is Lee proposed oversampling method (Lee, 2000), which produces noisy replicas of minority class objects while keeping the majority class unchanged.

Although randomized methods often have good results, it should be noted that many authors try to develop strategies that add synthetic minority examples or remove majority instances in a guided manner.

For nearly two decades, the most popular method for creating new instances of a minority class has been the smote algorithm (Fernández et al., Jan 2018). It involves randomly generating objects between minority class instances. Although smote works well in many practical applications, it has been noted that it can lead to a change in the distribution of the minority class, resulting in overfitting of the classifier. Therefore, several modifications have been proposed, specifically turning around the fact that minority class objects are not generated in areas potentially belonging to the majority class. For instance, Borderline smote (Han et al., 2005) generates synthetic minority class objects near the decision boundary. In contrast, SafeLevelsmote (Bunkhumpornpat et al., 2009) and ln-smote (Maciejewski & Stefanowski, 2011) avoid generating minority class instances in areas where objects belonging to the majority class dominate. Other popular methods include adasyn (He et al., 2008), which generates synthetic objects by taking into account which areas are difficult to classify and therefore increases the number of minority class instances generated in those areas. Also, Sáez et al. (2016) proposed that the fraction of objects subject to oversampling for each particular problem should be chosen according to their difficulty type. At the same time, the rest should be left unmodified.

It is also worth noting RBO (Radial-Based Oversampling) (Koziarski et al., 2017), which estimates the mutual distribution of minority and majority class objects using potential functions to select the area of minority class object generation. Koziarski also proposed Potential Anchoring algorithm (Koziarski, 2021a), which also uses potential functions to ensure invariant probability distributions across classes during the resampling process. On the other hand, CCR (Combined Cleaning and Resampling) (Koziarski & Woźniak, 2017), combines two techniques - cleaning the area around minority class objects by removing majority class objects from their neighborhood and generating synthetic minority class objects in that area.

As mentioned, RUS carries the risk of removing important objects from the majority class, which may lead to constructing a classifier that ignores less dense clusters of the majority class. Several methods have been proposed to avoid this tendency. They try to analyze the local mutual distributions of the minority and majority classes during undersampling. For example, enn (Edited Nearest Neighbor) removes majority examples if it finds the homogeneity of a given sample neighborhood. RBU (Radial Based Undersampling) (Koziarski, 2020) employs a previously introduced concept of mutual class potential. smute (Synthetic Minority Undersampling Technique) utilizes the data interpolation used in SMOTE to reduce the number of observations from the majority class (Koziarski, 2021b).

Many methods try to combine both techniques, i.e., over and undersampling. For example, CSMOUTE (Combined Synthetic Oversampling and Undersampling Technique) (Koziarski, 2021b) combines the mentioned SMUTE method with SMOTE. On the other hand, Galar et al. proposed a technique for combining under- and oversampling with the classifier ensemble (Galar et al., 2012). Other authors are keen on using data preprocessing techniques by combining them with algorithm-level solutions, mainly based on techniques having their roots in classifier ensembles built on perturbed learning sets. These methods can be overviewed in Fernández et al. (2018).

Some of the works try to treat the guided data preprocessing as an optimization task. Kim et al. (2016) proposed a hybrid method using a clustering technique and genetic algorithms (GA) based on the artificial neural networks model to balance imbalanced data distribution. Barandela et al. (2005) employed a genetic algorithm to balance data distribution and perform feature selection simultaneously. On the other hand, Khoshgoftaar et al. (Jan 2010) proposed using an evolutionary algorithm for the undersampling task. Garcia and Herrera also used the evolutionary algorithm to propose a family of undersampling techniques eus( Evolutionary Undersampling) (García & Herrera, 2009). Later this concept was extended by Wojciechowski, who applied the multicriteria optimization algorithm NSGA-2 (Deb et al., 2002) to the undersampling task (Wojciechowski, 2021). Multicriteria optimization approach was additionally considered by Węgier et al. in the tasks of building ensembles (Węgier et al., 2022) and providing interpretable decision trees (Węgier et al., 2023). Hualong et al. (2013) used ant colony optimization to improve imbalanced DNA microarray data classification performance. Metaheuristic algorithms are also used successfully for oversampling algorithms. The examples include GenSample (Karia et al., 2019), based on and, ACO Resampling (Li et al., 2020), which uses Ant Colony Optimization.

When using data balancing methods, one must also ask what degree of balancing one wants to achieve. Most works try to reach an equal number of minority and majority class instances. However, for some classifiers, such as decision trees, it has been shown that this approach does not give the best results (Weiss & Provost, 2003). Moreover, Khoshgoftaar et al. have shown experimentally that when the imbalance ratio is large, the balancing process should be stopped for IR values between 2:1 and 3:1 (Khoshgoftaar et al., 2007). Although this characteristic is well known, most authors seem to ignore it in their studies.

A more comprehensive review of work related to data balancing methods or not discussed so-called algorithm-level solutions might be found in the review papers (Branco et al., 2016; Krawczyk, 2016).

3 Local neighborhood encodings

Our approach is inspired by the categorization of observation types proposed by Napierała and Stefanowski (Napierala & Stefanowski, 2016), and a further study by Sáez et al. (2016), in which selective oversampling of different observation types was analyzed. In a nutshell, the proposed categorization was based on five nearest neighbour taxonomy presented in the previous section. The aforementioned study by Sáez et al. later used this categorization in an approach in which only the observations from specific types are used for oversampling. They used the same division into safe, borderline, rare and outlier observations, and exhaustively evaluated all 16 (\(2^4\)) combinations of different minority observation types that can be used. They were able to experimentally demonstrate that limiting oversampling to specific observation types can improve the performance of the imbalanced data classification algorithms.

We extend the idea of selective resampling. First, we take advantage of the fact that some previous studies have shown that a combination of oversampling and undersampling can improve performance compared to an approach using only one (Koziarski, 2021a, 2021b). For this reason, we will combine these techniques and consider the intensity of each process as the parameter being optimized. Second, the optimization process will also determine the oversampling intensity of each type of minority class observation. Moreover, based on the observations pointed out, among others, by Khoshgoftaar et al. (2007), the imbalance ratio of the final set will also be treated as a parameter to be determined in our algorithm.

To implement the above ideas we designed an approach based on the evolutionary optimization of a Local Neighborhood Encoding, a real-valued vector of numbers encoding the number of observations from specific types that will be either created via oversampling or removed via undersampling. We present the high-level overview of the approach in Fig. 3, and a detailed pseudo code in Algorithm 1.

Fig. 3
figure 3

Schematic drawing of the proposed approach. Firstly, the original dataset is divided into bags based on the number of same class nearest neighbors, and afterwards into several cross-validation folds. Then, evolutionary algorithm is used to optimize Local Neighborhood Encodings coding the number of observations with specific number of same class nearest neighbors that will be over- and undersampled. Finaly LNE resamples the original dataset

Algorithm 1
figure a

Local neighborhood encodings

Let us first describe how we will encode the strength of resampling for each observation type. It is worth noting that there is a major practical distinction between the over- and undersampling, as the former is unbounded. In principle, we can generate synthetic observations ad infinitum. On the other hand, in the case of undersampling, there is a clear bound equal to the number of the original observations. Because of that, the two of them will have to be encoded differently. Specifically, the approach we propose is based on coding all of the information required to produce the resampling counts (that is the numbers of observations from specific types that will be either over- or undersampled) as a \(2k + 1\) element vector, with k being the parameter describing the size of neighborhood used for type calculation.

$$\begin{aligned} \left[ r,o_1,...,o_k,u_1,...,u_k \right] \end{aligned}$$

The first element r of this vector encodes the strength of oversampling, i.e., the total number of observations generated via oversampling will be equal to \(n_{over} = r \cdot r_{max} \cdot \left( |\mathcal {X}_{maj} |- |\mathcal {X}_{min} |\right)\), with \(r_{max}\) being a hyperparameter bounding the number of oversampled observations (which in practice we will set to be fairly high, i.e., equal to 5, to allow for an oversampling strength search within this bound), and \(\mathcal {X}_{maj}\) and \(\mathcal {X}_{min}\) being the collections of majority and minority observations, respectively. Next k elements \(o_1, o_2,..., o_k\) encode the relative strength of oversampling from particular observation types, with the resulting number of observations generated based on a specific observation type m defined as \(c^o_m = n_{over} \cdot \frac{o_m}{\sum _{l = 1}^{k}{o_l}}\). Finally, the last k elements \(u_1, u_2,..., u_k\) encode the proportion of majority class observations with a type m that will be removed during undersampling. Note that with such an approach, all elements of the encoding vector can be bound within the range [0; 1].

Given such an encoding scheme, we can formulate the resampling process as an optimization procedure of the encoding for the given dataset. Proposed solutions will be evaluated on the cross-validation folds, with the encoding used to obtain observation type-dependent resampling counts for a specific training fold. Based on the resulting counts, resampling will be performed on the training fold, an estimator fitted on it, and the performance evaluated on the test fold. The final performance score will be the average of scores across folds. Specifically, we will evaluate the quality of a given encoding using \(3\times 2\) cross-validation (Raschka, 2018), and any desired target metric as an optimization criterion. While, in principle, various oversampling, undersampling, and optimization algorithms can be used, in this paper we will employ SMOTE (Chawla et al., 2002) for oversampling (with random observations from a currently considered observation type used as the starting points), random undersampling as the undersampling algorithm, and Differential Evolution (Price et al., 2006) as the optimization algorithm. Once the optimization procedure is finished, we will use the resulting encoding to resample the whole training dataset.

Finally, it is worth noting that one advantage of the algorithm being formulated in such a way is that the encodings are, to some extent, interpretable: they describe not only the overall degree of balancing (with the dataset being either balanced completely, only partially, or overbalanced, with the old minority class becoming the new majority), the proclivity towards either over- or undersampling, and towards focusing the resampling on specific types of observations. In particular, the last one can be viewed as desirable since the idea of focusing the resampling on specific observation types is present in a number of existing approaches, such as Borderline-SMOTE (Han et al., 2005) and Safe-Level-SMOTE (Bunkhumpornpat et al., 2009). However, while the approaches are out of necessity contradictory, there is little justification to prefer any of them a priori, not in the context of a specific dataset. An analysis of the encodings produced during the optimization for a larger dataset body could shed some light on the trends associated with focusing the resampling on specific observation types: this idea will be later revisited in Sect. 4.5.

4 Experimental study

To evaluate the proposed approach’s usefulness and properties, we conducted an experimental study. It aims to answer the following research questions:

  1. RQ1:

    How does LNE compare with state-of-the-art resampling strategies?

  2. RQ2:

    What design choices are responsible for the performance of LNE?

  3. RQ3:

    Can LNE be sped up using less computationally expensive proxy estimators?

  4. RQ4:

    Can the solutions found by LNE provide any generalizable insights into imbalanced data resampling?

4.1 Set-up

Data. Conducted experimental study was based on the binary imbalanced datasets provided in the KEEL repository (Alcalá-Fdez et al., 2011), with a total of 60 datasets used. Their details were presented in Table 1. In addition to the imbalance ratio (IR), the number of samples and the number of features, for each dataset we computed the data difficulty index (DI) (Koziarski, 2021a) using \(m = 5\) nearest neighbors, which is a [0; 1] bounded functions measuring the difficulty of a given dataset. Prior to resampling and classification, each dataset was preprocessed: categorical features were encoded as integers, and afterwards all features were standarized by removing the mean and scaling to unit variance.

Table 1 Summary of the characteristics of datasets used throughout the experimental study

Classification. Four different classification algorithms were used throughout the experimental study: CART decision tree, k-nearest neighbors classifier (KNN), support vector machine (SVM) and multi-layer perceptron (MLP). The implementations of the classification algorithms provided in the scikit-learn machine learning library (Pedregosa et al., 2011) were utilized. Used hyperparameters of the classification algorithms were presented in Table 2.

Table 2 Parameters of the classification and the sampling algorithms used throughout the experimental study

Reference resampling methods. We considered several other state-of-the-art resampling strategies. We based our choice on a recent ranking constructed by (Kovács, 2019), out of which we selected the following best-performing methods: SMOTE (Chawla et al., 2002), Polynomial Fitting SMOTE (pf-SMOTE) (Gazzah et al., 2008), Oversampling with Rejection (Lee) (Lee et al., 2015), Synthetic Minority Oversampling Based on Sample Density (SMOBD) (Cao et al., 2011), Partially Guided Oversampling (G-SMOTE) (Sandhan & Choi, 2014), Learning Vector Quantization-based SMOTE (LVQ-SMOTE) (Nakamura et al., 2013), Assembled SMOTE (A-SMOTE) (Zhou et al., 2013) and SMOTE combined with Tomek Links (SMOTE-TL) (Batista et al., 2004). The implementations of the reference methods provided in the smote-variants library (Kovács, 2019) were utilized. Used hyperparameters of the resampling algorithms were presented in Table 2. For all of the reference methods we adjusted the proportion of oversampling using \(3\times 2\) cross-validation, selecting values of oversampling proportion from \(\{0.1, 0.2, 0.5, 1.0, 2.0, 5.0\}\), with 1.0 indicating resampling up to the point of achieving balanced class distributions.

Evaluation. For every dataset, we reported the results averaged over the \(5\times 2\) cross-validation folds (Alpaydin, 1999).

Let us define the used metrics. Firstly, let’s define the confusion matrix, which summarizes the number of instances from each class classified correctly or incorrectly as the remaining classes (see Table 3).

Table 3 Confusion matrix for two-class classification task

On the basis of the confusion matrix, we may define

$$\begin{aligned} recall=\frac{ TP}{TP+FN} \end{aligned}$$
(1)

that is also known as sensitivity.

$$\begin{aligned} precision= & {} \frac{TP}{TP+FP} \end{aligned}$$
(2)
$$\begin{aligned} specificity= & {} \frac{TN}{TN+FP} \end{aligned}$$
(3)

Throughout the experimental study we reported the values of AUC, balanced accuracy (BAC), G-mean, and \(F_\beta\) score (F-beta)

$$\begin{aligned} G-mean= & {} \sqrt{precision+recall} \end{aligned}$$
(4)
$$\begin{aligned} BAC= & {} \frac{sensitivity+specificity}{2} \end{aligned}$$
(5)
$$\begin{aligned} F_{\beta }= & {} \frac{(\beta ^2+1) \times {Precision} \times {Recall} }{\beta ^2 \times {Precision} + {Recall} } \end{aligned}$$
(6)

The parameter \(\beta\) can be tuned for different trade-offs between both components. Nevertheless, using such metrics could be dangerous because \(\beta\) should be appropriately set. Brzezinski et al. (2018) showed that inappropriate parameter setting for \({F_{\beta }score}\) may favor the majority class for the imbalanced data classification task. During the experiments, \(\beta\) has been chosen individually for each dataset and equal to its imbalance ratio (Stapor et al., 2021). In our experiments we also used AUC under the precision-recall curve that is computed on predicted probabilities.

Implementation and reproducibility. The experiments described in this paper were implemented in the Python programming language. Complete code, sufficient to repeat the experiments, as well as complete results in a CSV format, and tables showing average performance on each dataset, separately for every classifier and performance metric, were made publicly available at.Footnote 1 We used the scikit-learn (Pedregosa et al., 2011) library to implement the experimental protocol, performance metrics, and classifiers.

4.2 Comparison with reference methods

We began the experimental analysis by comparing the proposed approach to several state-of-the-art reference resampling methods. Average ranks together with the results of statistical comparison using Friedman test combined with Shaffer’s post-hoc, reported at a significance level \(\alpha = 0.05\), were presented in Table 4. As can be seen, when compared with respect to BAC, G-mean, and F-beta metrics, the proposed LNE approach outperformed the reference methods in every case, usually achieving statistically significantly better results, demonstrating the general usefulness of the proposed approach. However, it is worth mentioning that we observed no statistically significant differences when the comparison was made using AUC (under the precision-recall curve). It is not entirely clear what caused this behavior. One possible explanation is that the results computed with a metric using probability scores instead of binarized predictions are generally more stable and less susceptible to data imbalance. The other reason is that using probability-based metrics is more difficult to optimize and/or more susceptible to overfitting, reducing the final performance of the model.

Table 4 Average ranks of the evaluated methods calculated for all of the considered datasets

To better illustrate the performance of the proposed approach we also conducted pairwise comparisons on individual datasets, comparing the results on all 10 cross-validation folds using Wilcoxon rank-sum test at a significance level \(\alpha = 0.10\). The number of datasets for which LNE achieved statistically significantly better or worse results was presented in Fig. 4. As can be seen, there is a clear discrepancy depending on the performance metric chosen, with using F-beta (which should penalize the predictions biased towards the majority class most heavily) leading to most wins, BAC and G-mean to a medium amount, and AUC to the least; with the number of losses low in every case.

Fig. 4
figure 4

Win-loss-tie comparison between individual reference methods and LNE, with the number of datasets on which LNE achieved statistically significantly better performance denoted with green, statistically significantly worse with red, no no statistically significant differences with yellow (Color figure online)

4.3 Ablation study

Having established that the proposed LNE approach can outperform the reference methods, we next tried to examine what specific design choices led to this outperformance. We performed an ablation study to compare LNE with its two different, simplified variants:

  • LNE\(_{NRS}\) that is LNE with no ratio selection, for which the balancing was always performed up achieving a perfectly balanced distribution (or, in other words, for which the optimization of balancing ratio was disabled).

  • LNE\(_{NTS}\) that is LNE with no type selection: a variant in which the resampling was based on all available observation types, and only the balancing ratio was optimized.

Notably, while LNE\(_{NRS}\) differed only in assigning the resampling counts but still used \(2k + 1\) values as the encoding. LNE\(_{NTS}\) variant was encoded using only two values: oversampling and undersampling ratio.

Note that due to a high computational complexity in this comparison we excluded the MLP classifier because of its long training time.

Let’s present the average ranks and p values obtained when comparing the ablation variants with the baseline LNE: LNE - 1.83, LNE\(_{NRS}\) - 2.20, and LNE\(_{NTS}\) - 1.96. To determine whether the rank differences are statistically significant, the Friedman paired posthoc Shaffer test was performed, and the following results were obtained: LNE vs. LNE\(_{NRS}\) (p-value < 0.001), LNE vs. LNE\(_{NTS}\) (p-value = 0.015). As can be seen, the baseline LNE using both mechanisms (type and ratio selection) achieved statistically significantly better results than the ablation variants. The discrepancy was most significant in comparison to the ablation variant with no ratio selection, indicating that it was the major contributor to the outperformance of LNE over the reference algorithms. In contrast, type selection was a minor contribution: we suspect this is because the variant with no type selection had only two optimizable parameters, thus being much less prone to overfitting. Still, type selection can still be potentially used as a vehicle for meta-analysis of the algorithm’s behavior, which will be discussed in Sect. 4.5. Finally, it is worth noting that we aggregated the results over all classifiers and performance metrics to illustrate the general trends better. Still, they were not consistent for particular classifier and performance metric combinations in every case.

4.4 Using proxy estimator during optimization

So far, we could establish that LNE tends to outperform the reference methods. However, this comes at a high computational cost, especially for classifiers based on a computationally expensive training procedure, such as MLPs. Note that this cost is high when evaluating each offspring of each iteration of the optimization algorithm. The question is whether it is possible to reduce this cost, e.g., by using a proxy classifier that will not require too much computational resources to train. Such an approach would use the proxy classifier only in the optimization stage. However, the final model would be trained using the optimized parameters and the target classification model. In summary, the question, in this case, is whether the results would remain competitive compared to the reference methods.

To test this hypothesis, we conducted an experiment using the CART decision tree as the proxy classifier (i.e., CART was used during the optimization procedure). We used the obtained encoding to fit the final MLP model. We chose CART based on its training time, but in principle, any other classification algorithm could be used. We presented the results in Table 5. As can be seen, when compared to the original results in Table 4, the performance degraded for all of the performance metrics except F-beta: for BAC and G-mean LNE still achieved statistically better results than some of the reference approaches, but their number was significantly lower. On the other hand, for the AUC using proxy estimator actually produced statistically significantly worse results than some of the reference methods. No noticeable changes were observed for F-beta. Overall, this leads to a conclusion that while the behavior was dependent on the choice of the performance metric, in most cases, using proxy estimators tends to decrease the performance of LNE, even though it reduces the training time.

Table 5 Average ranks of the evaluated methods calculated for all of the considered datasets in the case of the proxy estimator, with CART used in the optimization process and MLP used as a final classifier

4.5 Analysis of the obtained encodings

Finally, having established the general usability of the proposed approach, we can proceed with the question of whether the solutions found by LNE provide any generalizable insights into imbalanced data resampling. We performed a meta-analysis of 9600 solution vectors obtained during the conducted experiments (for 60 different datasets, 10 cross-validation folds per dataset, four classification algorithms, and four optimization criteria).

We began with the analysis of global resampling properties, that is, trends regarding the level of balancing obtained during the resampling process and the preference towards resampling due to either over- or undersampling. We present the distributions of IR values on datasets after resampling in Fig. 5, with IR = 1 corresponding to the case in which the resulting resampled dataset was perfectly balanced and IR < 1 corresponding to the case in which the old minority case became the new majority case. Analogically, we show the distributions of the adjusted oversampling-to-undersampling (O/U) ratios, defined as O/U ratio = \(\frac{\# oversampled + 1}{\# undersampled + 1}\) (with denominator incremented by 1 to avoid division by zero errors, and nominator to preserve the property of O/U ratio being equal to 1 when the amount of oversampled and undersampled observations was the same) in Fig. 6.

Fig. 5
figure 5

Distributions of IR values after resampling. Note log scale

Fig. 6
figure 6

Distributions of adjusted O/U ratios. Note log scale

Several observations can be made based on the presented results. Firstly, the choice of the optimization criterion had a visible influence on the overall trends, more so than the choice of the classification algorithm. In the case of IR after resampling, for BAC and G-mean, the output datasets after resampling were, on average, roughly balanced; for F-beta overbalancing occurred, meaning that the old minority class became the new majority; and the opposite was the case for AUC, for which the resampling was least intense. A similar observation can be made for the adjusted O/U ratio. While for all of the classification algorithm and performance metric combinations, the average O/U ratio was greater than one, indicating that the preferred mode of operation was complementing strong oversampling with weak undersampling, the oversampling-to-undersampling ratio was the highest for AUC and the lowest for F-beta. However, it should be noted that we observed a high variance for both IR after resampling and O/U ratio, indicating the optimal choice is highly dataset specific. Still, on average, the results suggest that supplementing oversampling with weaker undersampling can be beneficial. However, exact resampling strength and the ratio of oversampling-to-undersampling should be selected individually for a given dataset.

It is also worth considering whether the LNE’s preference for setting a target IR is due to the characteristics of the measures used. In Brzeziński et al. (2020), the distributions of selected metrics are presented concerning IR. Based on the analysis presented, it appears that g-mean and BAC retain their shape (i.e., the distribution of possible values) regardless of IR, which may be the reason that their use as an optimization criterion will lead to reasonably balanced class distributions. In the case of the metric If F-beta, where the value of recall was chosen proportionally to IR, one might suspect that since this metric indicates that recall is \(\beta\) times more important than precision, it seems natural that the model will force the overrepresentation of minority class instances in the training set.

However, it should also be recalled that the results obtained are characterized by a large value of standard deviation, so it seems only fair to conclude that the target IR value should be chosen individually for each task, taking into account the preferred metric, as well as the classification model used.

The second question that we tried to address was whether some simple dataset characteristics could be used to predict the preferred resampling properties a priori without needing explicit evaluation on the target dataset. To this end we examined the relationship between the IR before and after the resampling (with Pearson correlation coefficients and p-values presented in Table 6, and scatterplots of the two variables in Fig. 7), as well as the number of observations and the IR after resampling (with correlation coefficients in Table 7, and scatterplots in Fig. 8).

Fig. 7
figure 7

Scatterplots of log IR values before and after resampling, with regression lines fitted

Table 6 Pearson correlation coefficients and p-values between log IR before and after resampling
Fig. 8
figure 8

Scatterplots of log # of samples and log IR after resampling, with regression lines fitted

Table 7 Pearson correlation coefficients and p-values between log # of samples and log IR after resampling

As can be seen, for either BAC, G-mean or F-beta used as the optimization criteria, we observed weak-to-medium level correlations, statistically significant in every case; less clear trends were observed in the case of AUC. We can conclude that while some dataset characteristics can be used as a predictor of the resampling parameters, they are not correlated strongly enough to be used instead of a traditional parameter selection. Still, some trends are visible: for instance, both the dataset size and the original IR tend to be negatively correlated with the resulting IR, meaning that larger and/or more imbalanced datasets tend to be resampled with lower strength. Note that we considered additional input (such as the number of features) and output (such as the resulting O/U ratio) variables, but observed either relations weaker than in the case of IR before and after resampling, or none at all. We did not include them all in the paper for brevity, but they were provided together with the algorithm’s implementation.

Finally, in the last stage of the conducted analysis, we proceeded with the question of what observation types tend to be favored during the resampling. We began by evaluating the average proportion of observations belonging to different observation types across all datasets and cross-validation folds (with the proportion calculated on the training partition). The results were presented in Table 8. As can be seen, in the case of the minority class all observation types were represented, on average, in a roughly similar proportion (with the highest proportion assigned to observations with no same class neighbors, i.e., the more difficult examples). This was not the case for the majority class, for which instances with all neighbors belonging to the same class heavily dominated. We expanded on this by calculating the percentage of datasets for which at least a single observation from a given type was present, with the results presented in Table 9. As can be seen, some observation types were sparsely represented in the considered datasets: the main takeaway of this observation is that this might affect the results of the further analysis and need to be taken into account.

Table 8 Mean and standard deviations of the proportion of observations with k nearest neighbors belonging to the same class
Table 9 Percentage of datasets with at least a single observation for which k nearest neighbors belong to the same class

Next, we calculated the average proportion of observations of different types created or discarded due to either oversampling or undersampling via encodings generated by LNE. To calculate this average proportion, we only considered the datasets for which at least a single observation from said type was present (so that the datasets without specific observation types do not bias the results). Additionally, as previously mentioned, some observation types were underrepresented, so we normalized the datasets for each dataset in one of two ways. First of all, by the number of observations from a given (minority or majority) class: in the case of oversampling this value was equal to \(\frac{\# oversampled}{\# minority}\), whereas for the undersampling, it was \(\frac{\# undersampled}{\# majority}\). This normalization was introduced to standardize the results across the datasets with different numbers of minority and majority class observations. Secondly, by the number of observations from the specific type present in the dataset. This type of normalization took into account the fact that some observation types were underrepresented.

The results were presented in Table 10. First of all it should be noted that the variance of the results was fairly high in every considered case, making the observed results reliable only to an extent. However, some trends can be observed: in the case of oversampling, when normalized by the number of minority observations, there was a monotonic trend, with the proportion of oversampling around the observations with no same class neighbors being the highest. However, this is partially due to the fact that they were most represented in the original data: when normalized by the number of observations of individual types, the most preferred type was that with a single same class neighbor in the 4-NN neighborhood, closely followed by two same class neighbors, which roughly corresponds with the standard taxonomy used by various SMOTE variants of rare and borderline instances. Analogically, when normalized by the number of majority instances, undersampling seems to be heavily focused on the observations with all same class neighbors. However, when normalized by the number of observations from specific types, this trend reverses, and observations with all the same class neighbors become least preferred for undersampling. The overall trend is that while the variance is very high, indicating high per-dataset variability, an equivalent of rare and borderline minority observations tends to be favored for oversampling and safe majority observations for undersampling.

Table 10 Mean and standard deviations of the proportion of observations of different types (with k denoting the number of nearest neighbors from the same class) created or discarded due to either oversampling (O) or undersampling (U)

We also tried to answer the question of whether the composition of the original dataset, that is the number of observations of a given type present in the data before resampling, affects the preference towards particular types during oversampling. We focused specifically on the oversampling because of a greater diversity of observation types in the data. We computed the average proportions, like before, normalized by the number of observations from individual types but computed separately only on datasets, for which a given observation type was present in a large proportion (\(\ge 0.3\)) in the original data. The results were presented in Table 11. Lower values of K indicate the datasets with a large proportion of less certain minority class observations, i.e., difficult datasets. While, once again the variance of the results was high, we can observe a general trend of the more difficult datasets (\(K \in \{0, 1\}\)) producing encodings focusing resampling on observations other than outliers (\(k \ne 0\)), and less difficult datasets (\(K \in \{2, 3, 4\}\)) focusing on outliers (\(k = 0\)). This seems to indicate that with high baseline certainty resampling tends to focus on unsafe observations (with the hypothesis being that certain regions are represented well enough, and the borderline regions can be the focus of boosting), and with low baseline certainty on safe observations (since the high confidence regions of predictions have yet to be established).

Table 11 Mean and standard deviations of the proportion of observations of different types (with k denoting the number of nearest neighbors from the same class) created due to oversampling, normalized by the number of observations from individual types, computed only on the datasets for which the proportion of minority observations from type \(K \ge 0.3\)

Finally, having introduced the categorization into datasets consisting of a large proportion of original observations from a given type, we also examined if there are any visible differences between the different types of datasets concerning the global properties (O/U ratio and IR after resampling). The table containing this comparison was presented in Table 12. Similar to the results of the previous analysis, the split between more (\(K \in \{0, 1\}\)) and less (\(K \in \{2, 3, 4\}\)) difficult datasets was visible here as well: specifically, more difficult datasets tended to favor using more undersampling than oversampling, and resample to a lesser degree than the less difficult datasets.

Table 12 Mean and standard deviations of the adjusted O/U ratio and IR after resampling, computed only on the datasets for which the proportion of minority observations from type \(K \ge 0.3\)

5 Conclusions

This paper proposed Local Neighborhood Encoding, a novel technique for resampling imbalanced data, combining oversampling and undersampling in an evolutionary algorithm-based procedure that optimizes the proportion of resampling performed around different types of observations. The conducted experimental study showed that LNE significantly outperforms standard resampling algorithms. In addition, the conducted ablation study showed that dynamic selection of resampling strength is the main factor in good LNE performance. Conducted experiments using proxy estimators, a strategy that involves using a less computationally intensive classifier in the coding optimization process, demonstrated that in some cases, especially when using performance metrics such as F-beta, it is possible to preserve the original performance of the LNE while reducing training time.

Finally, utilizing the interpretability of the encodings, we conducted a meta-analysis of the solutions produced by LNE on a large set of benchmark datasets. While there was a significant variance in the obtained results, suggesting that dataset-specific tuning is still required, some common trends have been observed:

  • A combination of oversampling and undersampling was the preferred strategy, with strong oversampling combined with reasonable weak undersampling.

  • The optimal strength of resampling was strongly dependent on the performance chosen metric, BAC and G-mean prefer approximately balanced distributions, while metrics such as F-beta favored overbalancing.

  • The general characteristics of the datasets before resampling can be, to some extent, used to predict the properties of the resampling (such as the oversampling-to-undersampling ratio or resampling strength); however, specific tuning of these parameters is still required to achieve the optimal performance.

  • Produced solutions, on average, tended to prioritize rare and borderline observations, during oversampling, and unsafe examples during undersampling. Similar observations were made in the work mentioned earlier on selective oversampling, where the most common object fractions for oversampling were borderline and rare (Sáez et al., 2016).

  • However, when taking into the account the original distribution of the observation types, for more difficult datasets there was a tendency to produce encodings focusing oversampling on observations other than outliers, and less difficult datasets focusing on outliers.

LNE is an efficient oversampling strategy that can be used when the dataset size is relatively small and/or computational resources are not limited. Its interpretability can also be used to gain insight into existing resampling strategies, as it allows the removal of errors introduced by methods such as Borderline-SMOTE and Safe-Level-SMOTE, the search for which becomes part of the optimization process. Possible future research directions include scaling the approach to larger dataset sizes and exploring the idea of using proxy estimators in more depth. The use of proxy classifiers in place of dataset difficulty scores should also be considered. However, this would require confirmation of the hypothesis that simplifying data distributions in preprocessing positively affects the quality of final classification models.