Local neighborhood encodings for imbalanced data classification

Koziarski, Michał; Woźniak, Michał

doi:10.1007/s10994-024-06563-6

Local neighborhood encodings for imbalanced data classification

Open access
Published: 10 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Local neighborhood encodings for imbalanced data classification

Download PDF

Michał Koziarski¹ &
Michał Woźniak¹

493 Accesses
1 Altmetric
Explore all metrics

Abstract

This paper aims to propose Local Neighborhood Encodings (LNE)-a hybrid data preprocessing method dedicated to skewed class distribution balancing. The proposed LNE algorithm uses both over- and undersampling methods. The intensity of the methods is chosen separately for each fraction of minority and majority class objects. It is selected depending on the type of neighborhoods of objects of a given class, understood as the number of neighbors from the same class closest to a given object. The process of selecting the over- and undersampling intensities is treated as an optimization problem for which an evolutionary algorithm is used. The quality of the proposed method was evaluated through computer experiments. Compared with SOTA resampling strategies, LNE shows very good results. In addition, an experimental analysis of the algorithms behavior was performed, i.e., the determination of data preprocessing parameters depending on the selected characteristics of the decision problem, as well as the type of classifier used. An ablation study was also performed to evaluate the influence of components on the quality of the obtained classifiers. The evaluation of how the quality of classification is influenced by the evaluation of the objective function in an evolutionary algorithm is presented. In the considered task, the objective function is not de facto deterministic and its value is subject to estimation. Hence, it was important from the point of view of computational efficiency to investigate the possibility of using for quality assessment the so-called proxy classifier, i.e., a classifier of low computational complexity, although the final model was learned using a different model. The proposed data preprocessing method has high quality compared to SOTA, however, it should be noted that it requires significantly more computational effort. Nevertheless, it can be successfully applied to the case as no very restrictive model building time constraints are imposed.

Tackling Class Imbalance Problem in Binary Classification using Augmented Neighborhood Cleaning Algorithm

Multi-objective Evolutionary Undersampling Algorithm for Imbalanced Data Classification

Addressing Overlapping in Classification with Imbalanced Datasets: A First Multi-objective Approach for Feature and Instance Selection

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The problem of imbalanced data classification is one of the important subareas of machine learning research. This is dictated by the fact that most real-world decision problems are of this nature, i.e., a significant disparity between the object counts of the different classes. In the case of a two-class classification task to which this work is devoted, the relationship is quite obvious, i.e., one class with a large number of objects is called the majority class, and a class with a small number of instances is called the minority class. The critical point here is that, as a rule, the cost of incorrectly classified objects of the minority class is much higher than the error made on majority class instances. We should be aware that in the case of multiclass problems, such relation is no longer obvious, and minority-majority relations may occur only between some pairs of classes. Moreover, a given class may be a majority class concerning selected classes, simultaneously a minority class for others. This problem will not be considered in this paper and we will focus on the binary classification task. For imbalanced data, it is observed that canonical classifier learning methods that do not explicitly optimize the quality criterion prefer models biased towards the majority class. However, in the cases in which we can provide a learning criterion that can account for the different cost of errors for each class, the problem is obtaining such information, e.g., in the form of a loss matrix, from the user (Branco et al., 2016). The question is whether indeed the disparity between classes is the most significant difficulty or whether other factors affecting the difficulty of the data may be crucial since it is not difficult to imagine a decision problem that, even despite the significant disparity between classes, is easy to classify using traditional methods (see Fig. 1).

It seems that difficulties related to the classification of imbalanced data should be sought in characteristics of conditional probability distributions of classes. Napierala and Stefanowski (2016) noticed that, especially in the case of minority classes, they tend to form small, disconnected clusters, which combined with the small number of objects in the minority class, causes additional difficulty in correctly learning the models. One may find several taxonomies related to evaluating the difficulty of classifying minority class objects. As a rule, one distinguishes between the fraction of objects that do not present problems with correct classification (usually known as safe examples) and the remaining instances as unsafe. A popular taxonomy is to determine the difficulty of objects of a minority class using the number of objects of the minority class found among the five nearest neighbors of a given object from this class. This leads to a division into:

Examples lying near each other can be considered as safe ones.
Instances close to the border between classes, where examples of different classes may overlap, are identified as borderline examples.
Groups of a few examples of a class in areas of other classes are known as rare examples.
Outliers are completely isolated examples of a concrete class.

Figure. 2 shows a sample illustration of different types of minority class objects.

This paper will present the Local Neighborhood Encodings (LNE) algorithm, a hybrid algorithm for preprocessing imbalanced data. LNE uses both oversampling and undersampling methods. The intensity of these methods is separate for each fraction of minority and majority class objects and is chosen according to the neighborhood type of the class instance, defined as the number of neighbors of the same class closest to the given object. An evolutionary algorithm is used to solve this optimization task.

In brief, the most important contributions of this work are:

Proposition of LNE - a hybrid preprocessing method for imbalanced data.
Formulation of an optimization task and representation for a selected evolutionary algorithm.
Analysis of the effect of classification task characteristics such as imbalance ratio or size of each object type fractions on the quality of the proposed method.
Discussion on the data set characteristics returned by LNE.
Experimental evaluation of the quality of the LNE for different types of classifiers and the impact of using a simple proxy classification model required for feature value estimation in the optimization algorithm.
Experimental evaluation of the proposed approach based on diverse benchmark datasets and a detailed comparison with the state-of-art approaches
Discussion on an ablation study to demonstrate how the various components of the LNE affect its quality.

2 Related work

The problem of imbalanced data is encountered in a significant fraction of real-world problems and in recent years, the main focus has been on the analysis of tabular data. However, a growing number of works indicate that this problem is also very important for image classification since many of the datasets are characterized by this type of property, making generalizations of deep models difficult (Johnson & Khoshgoftaar, 2019; Kim et al., 2021).

The preference for the majority class by classification models can be mitigated by modifying the training data to bring the minority class size in line with the majority class size or by modifying the classifier’s learning or decision-making process to account for class disparity and thus increase the sensitivity of decisions towards the minority class.

We may divide the techniques used to deal with data imbalance into two categories. In the first approach (also called algorithm-level solutions), the problem of data imbalance is considered at the stage of the learning classifier, e.g., considering the different cost of error between different classes in the learning criterion. In contrast, the second approach (data-level solutions) does not interfere with the learning of the classifier itself but modifies the learning set before the learning process starts to compensate for the differences in the number of examples from each class.

In this paper, we will mainly focus on the second approach. Hence let us characterize the most important techniques associated with them. In general, these methods try to remove objects from the majority class (undersampling) or increase the size of the minority class (oversampling). In this regard, it should be noted that these actions may be purely random, or the process of removing objects of the majority class or adding instances of the minority class may result from an analysis of the distributions of the different fractions of objects.

Randomized preprocessing methods are easy to implement and have low computational complexity. However, it is essential to realize that in some cases, they may have an adverse effect on the dataset. Random Undersampling (RUS) can lead to the rejection of important instances for creating the correct class boundary or lie in specific subregions of the target class. On the other hand, random oversampling can lead to the reallocation of noisy instances and thus inappropriately affect the actual class distribution. The most straightforward sampling-based approach to data imbalance is Random Oversampling (ROS). Using ROS, new objects are generated by replicating randomly selected existing objects. The disadvantage of this approach is that it leads to the clustering of minority objects in small areas where the original instances were located. This can be a problem for some classifiers, especially those that tend to overfit. An interesting approach is Lee proposed oversampling method (Lee, 2000), which produces noisy replicas of minority class objects while keeping the majority class unchanged.

Although randomized methods often have good results, it should be noted that many authors try to develop strategies that add synthetic minority examples or remove majority instances in a guided manner.

For nearly two decades, the most popular method for creating new instances of a minority class has been the smote algorithm (Fernández et al., Jan 2018). It involves randomly generating objects between minority class instances. Although smote works well in many practical applications, it has been noted that it can lead to a change in the distribution of the minority class, resulting in overfitting of the classifier. Therefore, several modifications have been proposed, specifically turning around the fact that minority class objects are not generated in areas potentially belonging to the majority class. For instance, Borderline smote (Han et al., 2005) generates synthetic minority class objects near the decision boundary. In contrast, SafeLevelsmote (Bunkhumpornpat et al., 2009) and ln-smote (Maciejewski & Stefanowski, 2011) avoid generating minority class instances in areas where objects belonging to the majority class dominate. Other popular methods include adasyn (He et al., 2008), which generates synthetic objects by taking into account which areas are difficult to classify and therefore increases the number of minority class instances generated in those areas. Also, Sáez et al. (2016) proposed that the fraction of objects subject to oversampling for each particular problem should be chosen according to their difficulty type. At the same time, the rest should be left unmodified.

It is also worth noting RBO (Radial-Based Oversampling) (Koziarski et al., 2017), which estimates the mutual distribution of minority and majority class objects using potential functions to select the area of minority class object generation. Koziarski also proposed Potential Anchoring algorithm (Koziarski, 2021a), which also uses potential functions to ensure invariant probability distributions across classes during the resampling process. On the other hand, CCR (Combined Cleaning and Resampling) (Koziarski & Woźniak, 2017), combines two techniques - cleaning the area around minority class objects by removing majority class objects from their neighborhood and generating synthetic minority class objects in that area.

As mentioned, RUS carries the risk of removing important objects from the majority class, which may lead to constructing a classifier that ignores less dense clusters of the majority class. Several methods have been proposed to avoid this tendency. They try to analyze the local mutual distributions of the minority and majority classes during undersampling. For example, enn (Edited Nearest Neighbor) removes majority examples if it finds the homogeneity of a given sample neighborhood. RBU (Radial Based Undersampling) (Koziarski, 2020) employs a previously introduced concept of mutual class potential. smute (Synthetic Minority Undersampling Technique) utilizes the data interpolation used in SMOTE to reduce the number of observations from the majority class (Koziarski, 2021b).

Many methods try to combine both techniques, i.e., over and undersampling. For example, CSMOUTE (Combined Synthetic Oversampling and Undersampling Technique) (Koziarski, 2021b) combines the mentioned SMUTE method with SMOTE. On the other hand, Galar et al. proposed a technique for combining under- and oversampling with the classifier ensemble (Galar et al., 2012). Other authors are keen on using data preprocessing techniques by combining them with algorithm-level solutions, mainly based on techniques having their roots in classifier ensembles built on perturbed learning sets. These methods can be overviewed in Fernández et al. (2018).

Some of the works try to treat the guided data preprocessing as an optimization task. Kim et al. (2016) proposed a hybrid method using a clustering technique and genetic algorithms (GA) based on the artificial neural networks model to balance imbalanced data distribution. Barandela et al. (2005) employed a genetic algorithm to balance data distribution and perform feature selection simultaneously. On the other hand, Khoshgoftaar et al. (Jan 2010) proposed using an evolutionary algorithm for the undersampling task. Garcia and Herrera also used the evolutionary algorithm to propose a family of undersampling techniques eus( Evolutionary Undersampling) (García & Herrera, 2009). Later this concept was extended by Wojciechowski, who applied the multicriteria optimization algorithm NSGA-2 (Deb et al., 2002) to the undersampling task (Wojciechowski, 2021). Multicriteria optimization approach was additionally considered by Węgier et al. in the tasks of building ensembles (Węgier et al., 2022) and providing interpretable decision trees (Węgier et al., 2023). Hualong et al. (2013) used ant colony optimization to improve imbalanced DNA microarray data classification performance. Metaheuristic algorithms are also used successfully for oversampling algorithms. The examples include GenSample (Karia et al., 2019), based on and, ACO Resampling (Li et al., 2020), which uses Ant Colony Optimization.

When using data balancing methods, one must also ask what degree of balancing one wants to achieve. Most works try to reach an equal number of minority and majority class instances. However, for some classifiers, such as decision trees, it has been shown that this approach does not give the best results (Weiss & Provost, 2003). Moreover, Khoshgoftaar et al. have shown experimentally that when the imbalance ratio is large, the balancing process should be stopped for IR values between 2:1 and 3:1 (Khoshgoftaar et al., 2007). Although this characteristic is well known, most authors seem to ignore it in their studies.

A more comprehensive review of work related to data balancing methods or not discussed so-called algorithm-level solutions might be found in the review papers (Branco et al., 2016; Krawczyk, 2016).

3 Local neighborhood encodings

Our approach is inspired by the categorization of observation types proposed by Napierała and Stefanowski (Napierala & Stefanowski, 2016), and a further study by Sáez et al. (2016), in which selective oversampling of different observation types was analyzed. In a nutshell, the proposed categorization was based on five nearest neighbour taxonomy presented in the previous section. The aforementioned study by Sáez et al. later used this categorization in an approach in which only the observations from specific types are used for oversampling. They used the same division into safe, borderline, rare and outlier observations, and exhaustively evaluated all 16 ($2^4$) combinations of different minority observation types that can be used. They were able to experimentally demonstrate that limiting oversampling to specific observation types can improve the performance of the imbalanced data classification algorithms.

We extend the idea of selective resampling. First, we take advantage of the fact that some previous studies have shown that a combination of oversampling and undersampling can improve performance compared to an approach using only one (Koziarski, 2021a, 2021b). For this reason, we will combine these techniques and consider the intensity of each process as the parameter being optimized. Second, the optimization process will also determine the oversampling intensity of each type of minority class observation. Moreover, based on the observations pointed out, among others, by Khoshgoftaar et al. (2007), the imbalance ratio of the final set will also be treated as a parameter to be determined in our algorithm.

To implement the above ideas we designed an approach based on the evolutionary optimization of a Local Neighborhood Encoding, a real-valued vector of numbers encoding the number of observations from specific types that will be either created via oversampling or removed via undersampling. We present the high-level overview of the approach in Fig. 3, and a detailed pseudo code in Algorithm 1.

Let us first describe how we will encode the strength of resampling for each observation type. It is worth noting that there is a major practical distinction between the over- and undersampling, as the former is unbounded. In principle, we can generate synthetic observations ad infinitum. On the other hand, in the case of undersampling, there is a clear bound equal to the number of the original observations. Because of that, the two of them will have to be encoded differently. Specifically, the approach we propose is based on coding all of the information required to produce the resampling counts (that is the numbers of observations from specific types that will be either over- or undersampled) as a $2k + 1$ element vector, with k being the parameter describing the size of neighborhood used for type calculation.

$$\begin{aligned} \left[ r,o_1,...,o_k,u_1,...,u_k \right] \end{aligned}$$

The first element r of this vector encodes the strength of oversampling, i.e., the total number of observations generated via oversampling will be equal to $n_{over} = r \cdot r_{max} \cdot \left( |\mathcal {X}_{maj} |- |\mathcal {X}_{min} |\right)$, with $r_{max}$ being a hyperparameter bounding the number of oversampled observations (which in practice we will set to be fairly high, i.e., equal to 5, to allow for an oversampling strength search within this bound), and $\mathcal {X}_{maj}$ and $\mathcal {X}_{min}$ being the collections of majority and minority observations, respectively. Next k elements $o_1, o_2,..., o_k$ encode the relative strength of oversampling from particular observation types, with the resulting number of observations generated based on a specific observation type m defined as $c^o_m = n_{over} \cdot \frac{o_m}{\sum _{l = 1}^{k}{o_l}}$. Finally, the last k elements $u_1, u_2,..., u_k$ encode the proportion of majority class observations with a type m that will be removed during undersampling. Note that with such an approach, all elements of the encoding vector can be bound within the range [0; 1].

Given such an encoding scheme, we can formulate the resampling process as an optimization procedure of the encoding for the given dataset. Proposed solutions will be evaluated on the cross-validation folds, with the encoding used to obtain observation type-dependent resampling counts for a specific training fold. Based on the resulting counts, resampling will be performed on the training fold, an estimator fitted on it, and the performance evaluated on the test fold. The final performance score will be the average of scores across folds. Specifically, we will evaluate the quality of a given encoding using $3\times 2$ cross-validation (Raschka, 2018), and any desired target metric as an optimization criterion. While, in principle, various oversampling, undersampling, and optimization algorithms can be used, in this paper we will employ SMOTE (Chawla et al., 2002) for oversampling (with random observations from a currently considered observation type used as the starting points), random undersampling as the undersampling algorithm, and Differential Evolution (Price et al., 2006) as the optimization algorithm. Once the optimization procedure is finished, we will use the resulting encoding to resample the whole training dataset.

Finally, it is worth noting that one advantage of the algorithm being formulated in such a way is that the encodings are, to some extent, interpretable: they describe not only the overall degree of balancing (with the dataset being either balanced completely, only partially, or overbalanced, with the old minority class becoming the new majority), the proclivity towards either over- or undersampling, and towards focusing the resampling on specific types of observations. In particular, the last one can be viewed as desirable since the idea of focusing the resampling on specific observation types is present in a number of existing approaches, such as Borderline-SMOTE (Han et al., 2005) and Safe-Level-SMOTE (Bunkhumpornpat et al., 2009). However, while the approaches are out of necessity contradictory, there is little justification to prefer any of them a priori, not in the context of a specific dataset. An analysis of the encodings produced during the optimization for a larger dataset body could shed some light on the trends associated with focusing the resampling on specific observation types: this idea will be later revisited in Sect. 4.5.

4 Experimental study

To evaluate the proposed approach’s usefulness and properties, we conducted an experimental study. It aims to answer the following research questions:

RQ1:
How does LNE compare with state-of-the-art resampling strategies?
RQ2:
What design choices are responsible for the performance of LNE?
RQ3:
Can LNE be sped up using less computationally expensive proxy estimators?
RQ4:
Can the solutions found by LNE provide any generalizable insights into imbalanced data resampling?

4.1 Set-up

Data. Conducted experimental study was based on the binary imbalanced datasets provided in the KEEL repository (Alcalá-Fdez et al., 2011), with a total of 60 datasets used. Their details were presented in Table 1. In addition to the imbalance ratio (IR), the number of samples and the number of features, for each dataset we computed the data difficulty index (DI) (Koziarski, 2021a) using $m = 5$ nearest neighbors, which is a [0; 1] bounded functions measuring the difficulty of a given dataset. Prior to resampling and classification, each dataset was preprocessed: categorical features were encoded as integers, and afterwards all features were standarized by removing the mean and scaling to unit variance.

Table 1 Summary of the characteristics of datasets used throughout the experimental study

Full size table

Classification. Four different classification algorithms were used throughout the experimental study: CART decision tree, k-nearest neighbors classifier (KNN), support vector machine (SVM) and multi-layer perceptron (MLP). The implementations of the classification algorithms provided in the scikit-learn machine learning library (Pedregosa et al., 2011) were utilized. Used hyperparameters of the classification algorithms were presented in Table 2.

Table 2 Parameters of the classification and the sampling algorithms used throughout the experimental study

Full size table

Reference resampling methods. We considered several other state-of-the-art resampling strategies. We based our choice on a recent ranking constructed by (Kovács, 2019), out of which we selected the following best-performing methods: SMOTE (Chawla et al., 2002), Polynomial Fitting SMOTE (pf-SMOTE) (Gazzah et al., 2008), Oversampling with Rejection (Lee) (Lee et al., 2015), Synthetic Minority Oversampling Based on Sample Density (SMOBD) (Cao et al., 2011), Partially Guided Oversampling (G-SMOTE) (Sandhan & Choi, 2014), Learning Vector Quantization-based SMOTE (LVQ-SMOTE) (Nakamura et al., 2013), Assembled SMOTE (A-SMOTE) (Zhou et al., 2013) and SMOTE combined with Tomek Links (SMOTE-TL) (Batista et al., 2004). The implementations of the reference methods provided in the smote-variants library (Kovács, 2019) were utilized. Used hyperparameters of the resampling algorithms were presented in Table 2. For all of the reference methods we adjusted the proportion of oversampling using $3\times 2$ cross-validation, selecting values of oversampling proportion from $\{0.1, 0.2, 0.5, 1.0, 2.0, 5.0\}$, with 1.0 indicating resampling up to the point of achieving balanced class distributions.

Evaluation. For every dataset, we reported the results averaged over the $5\times 2$ cross-validation folds (Alpaydin, 1999).

Let us define the used metrics. Firstly, let’s define the confusion matrix, which summarizes the number of instances from each class classified correctly or incorrectly as the remaining classes (see Table 3).

Table 3 Confusion matrix for two-class classification task

Full size table

On the basis of the confusion matrix, we may define

$$\begin{aligned} recall=\frac{ TP}{TP+FN} \end{aligned}$$

(1)

that is also known as sensitivity.

$$\begin{aligned} precision= & {} \frac{TP}{TP+FP} \end{aligned}$$

(2)

$$\begin{aligned} specificity= & {} \frac{TN}{TN+FP} \end{aligned}$$

(3)

Throughout the experimental study we reported the values of AUC, balanced accuracy (BAC), G-mean, and $F_\beta$ score (F-beta)

$$\begin{aligned} G-mean= & {} \sqrt{precision+recall} \end{aligned}$$

(4)

$$\begin{aligned} BAC= & {} \frac{sensitivity+specificity}{2} \end{aligned}$$

(5)

$$\begin{aligned} F_{\beta }= & {} \frac{(\beta ^2+1) \times {Precision} \times {Recall} }{\beta ^2 \times {Precision} + {Recall} } \end{aligned}$$

(6)

The parameter $\beta$ can be tuned for different trade-offs between both components. Nevertheless, using such metrics could be dangerous because $\beta$ should be appropriately set. Brzezinski et al. (2018) showed that inappropriate parameter setting for ${F_{\beta }score}$ may favor the majority class for the imbalanced data classification task. During the experiments, $\beta$ has been chosen individually for each dataset and equal to its imbalance ratio (Stapor et al., 2021). In our experiments we also used AUC under the precision-recall curve that is computed on predicted probabilities.

Implementation and reproducibility. The experiments described in this paper were implemented in the Python programming language. Complete code, sufficient to repeat the experiments, as well as complete results in a CSV format, and tables showing average performance on each dataset, separately for every classifier and performance metric, were made publicly available at.^{Footnote 1} We used the scikit-learn (Pedregosa et al., 2011) library to implement the experimental protocol, performance metrics, and classifiers.

4.2 Comparison with reference methods

We began the experimental analysis by comparing the proposed approach to several state-of-the-art reference resampling methods. Average ranks together with the results of statistical comparison using Friedman test combined with Shaffer’s post-hoc, reported at a significance level $\alpha = 0.05$, were presented in Table 4. As can be seen, when compared with respect to BAC, G-mean, and F-beta metrics, the proposed LNE approach outperformed the reference methods in every case, usually achieving statistically significantly better results, demonstrating the general usefulness of the proposed approach. However, it is worth mentioning that we observed no statistically significant differences when the comparison was made using AUC (under the precision-recall curve). It is not entirely clear what caused this behavior. One possible explanation is that the results computed with a metric using probability scores instead of binarized predictions are generally more stable and less susceptible to data imbalance. The other reason is that using probability-based metrics is more difficult to optimize and/or more susceptible to overfitting, reducing the final performance of the model.

Table 4 Average ranks of the evaluated methods calculated for all of the considered datasets

Full size table

To better illustrate the performance of the proposed approach we also conducted pairwise comparisons on individual datasets, comparing the results on all 10 cross-validation folds using Wilcoxon rank-sum test at a significance level $\alpha = 0.10$. The number of datasets for which LNE achieved statistically significantly better or worse results was presented in Fig. 4. As can be seen, there is a clear discrepancy depending on the performance metric chosen, with using F-beta (which should penalize the predictions biased towards the majority class most heavily) leading to most wins, BAC and G-mean to a medium amount, and AUC to the least; with the number of losses low in every case.

4.3 Ablation study

Having established that the proposed LNE approach can outperform the reference methods, we next tried to examine what specific design choices led to this outperformance. We performed an ablation study to compare LNE with its two different, simplified variants:

LNE$_{NRS}$ that is LNE with no ratio selection, for which the balancing was always performed up achieving a perfectly balanced distribution (or, in other words, for which the optimization of balancing ratio was disabled).
LNE$_{NTS}$ that is LNE with no type selection: a variant in which the resampling was based on all available observation types, and only the balancing ratio was optimized.

Notably, while LNE$_{NRS}$ differed only in assigning the resampling counts but still used $2k + 1$ values as the encoding. LNE$_{NTS}$ variant was encoded using only two values: oversampling and undersampling ratio.

Note that due to a high computational complexity in this comparison we excluded the MLP classifier because of its long training time.

Let’s present the average ranks and p values obtained when comparing the ablation variants with the baseline LNE: LNE - 1.83, LNE$_{NRS}$ - 2.20, and LNE$_{NTS}$ - 1.96. To determine whether the rank differences are statistically significant, the Friedman paired posthoc Shaffer test was performed, and the following results were obtained: LNE vs. LNE$_{NRS}$ (p-value < 0.001), LNE vs. LNE$_{NTS}$ (p-value = 0.015). As can be seen, the baseline LNE using both mechanisms (type and ratio selection) achieved statistically significantly better results than the ablation variants. The discrepancy was most significant in comparison to the ablation variant with no ratio selection, indicating that it was the major contributor to the outperformance of LNE over the reference algorithms. In contrast, type selection was a minor contribution: we suspect this is because the variant with no type selection had only two optimizable parameters, thus being much less prone to overfitting. Still, type selection can still be potentially used as a vehicle for meta-analysis of the algorithm’s behavior, which will be discussed in Sect. 4.5. Finally, it is worth noting that we aggregated the results over all classifiers and performance metrics to illustrate the general trends better. Still, they were not consistent for particular classifier and performance metric combinations in every case.

4.4 Using proxy estimator during optimization

So far, we could establish that LNE tends to outperform the reference methods. However, this comes at a high computational cost, especially for classifiers based on a computationally expensive training procedure, such as MLPs. Note that this cost is high when evaluating each offspring of each iteration of the optimization algorithm. The question is whether it is possible to reduce this cost, e.g., by using a proxy classifier that will not require too much computational resources to train. Such an approach would use the proxy classifier only in the optimization stage. However, the final model would be trained using the optimized parameters and the target classification model. In summary, the question, in this case, is whether the results would remain competitive compared to the reference methods.

To test this hypothesis, we conducted an experiment using the CART decision tree as the proxy classifier (i.e., CART was used during the optimization procedure). We used the obtained encoding to fit the final MLP model. We chose CART based on its training time, but in principle, any other classification algorithm could be used. We presented the results in Table 5. As can be seen, when compared to the original results in Table 4, the performance degraded for all of the performance metrics except F-beta: for BAC and G-mean LNE still achieved statistically better results than some of the reference approaches, but their number was significantly lower. On the other hand, for the AUC using proxy estimator actually produced statistically significantly worse results than some of the reference methods. No noticeable changes were observed for F-beta. Overall, this leads to a conclusion that while the behavior was dependent on the choice of the performance metric, in most cases, using proxy estimators tends to decrease the performance of LNE, even though it reduces the training time.

Table 5 Average ranks of the evaluated methods calculated for all of the considered datasets in the case of the proxy estimator, with CART used in the optimization process and MLP used as a final classifier

Full size table

4.5 Analysis of the obtained encodings

Finally, having established the general usability of the proposed approach, we can proceed with the question of whether the solutions found by LNE provide any generalizable insights into imbalanced data resampling. We performed a meta-analysis of 9600 solution vectors obtained during the conducted experiments (for 60 different datasets, 10 cross-validation folds per dataset, four classification algorithms, and four optimization criteria).

We began with the analysis of global resampling properties, that is, trends regarding the level of balancing obtained during the resampling process and the preference towards resampling due to either over- or undersampling. We present the distributions of IR values on datasets after resampling in Fig. 5, with IR = 1 corresponding to the case in which the resulting resampled dataset was perfectly balanced and IR < 1 corresponding to the case in which the old minority case became the new majority case. Analogically, we show the distributions of the adjusted oversampling-to-undersampling (O/U) ratios, defined as O/U ratio = $\frac{\# oversampled + 1}{\# undersampled + 1}$ (with denominator incremented by 1 to avoid division by zero errors, and nominator to preserve the property of O/U ratio being equal to 1 when the amount of oversampled and undersampled observations was the same) in Fig. 6.

Several observations can be made based on the presented results. Firstly, the choice of the optimization criterion had a visible influence on the overall trends, more so than the choice of the classification algorithm. In the case of IR after resampling, for BAC and G-mean, the output datasets after resampling were, on average, roughly balanced; for F-beta overbalancing occurred, meaning that the old minority class became the new majority; and the opposite was the case for AUC, for which the resampling was least intense. A similar observation can be made for the adjusted O/U ratio. While for all of the classification algorithm and performance metric combinations, the average O/U ratio was greater than one, indicating that the preferred mode of operation was complementing strong oversampling with weak undersampling, the oversampling-to-undersampling ratio was the highest for AUC and the lowest for F-beta. However, it should be noted that we observed a high variance for both IR after resampling and O/U ratio, indicating the optimal choice is highly dataset specific. Still, on average, the results suggest that supplementing oversampling with weaker undersampling can be beneficial. However, exact resampling strength and the ratio of oversampling-to-undersampling should be selected individually for a given dataset.

It is also worth considering whether the LNE’s preference for setting a target IR is due to the characteristics of the measures used. In Brzeziński et al. (2020), the distributions of selected metrics are presented concerning IR. Based on the analysis presented, it appears that g-mean and BAC retain their shape (i.e., the distribution of possible values) regardless of IR, which may be the reason that their use as an optimization criterion will lead to reasonably balanced class distributions. In the case of the metric If F-beta, where the value of recall was chosen proportionally to IR, one might suspect that since this metric indicates that recall is $\beta$ times more important than precision, it seems natural that the model will force the overrepresentation of minority class instances in the training set.

However, it should also be recalled that the results obtained are characterized by a large value of standard deviation, so it seems only fair to conclude that the target IR value should be chosen individually for each task, taking into account the preferred metric, as well as the classification model used.

The second question that we tried to address was whether some simple dataset characteristics could be used to predict the preferred resampling properties a priori without needing explicit evaluation on the target dataset. To this end we examined the relationship between the IR before and after the resampling (with Pearson correlation coefficients and p-values presented in Table 6, and scatterplots of the two variables in Fig. 7), as well as the number of observations and the IR after resampling (with correlation coefficients in Table 7, and scatterplots in Fig. 8).

Table 6 Pearson correlation coefficients and p-values between log IR before and after resampling

Full size table

Table 7 Pearson correlation coefficients and p-values between log # of samples and log IR after resampling

Full size table

As can be seen, for either BAC, G-mean or F-beta used as the optimization criteria, we observed weak-to-medium level correlations, statistically significant in every case; less clear trends were observed in the case of AUC. We can conclude that while some dataset characteristics can be used as a predictor of the resampling parameters, they are not correlated strongly enough to be used instead of a traditional parameter selection. Still, some trends are visible: for instance, both the dataset size and the original IR tend to be negatively correlated with the resulting IR, meaning that larger and/or more imbalanced datasets tend to be resampled with lower strength. Note that we considered additional input (such as the number of features) and output (such as the resulting O/U ratio) variables, but observed either relations weaker than in the case of IR before and after resampling, or none at all. We did not include them all in the paper for brevity, but they were provided together with the algorithm’s implementation.

Finally, in the last stage of the conducted analysis, we proceeded with the question of what observation types tend to be favored during the resampling. We began by evaluating the average proportion of observations belonging to different observation types across all datasets and cross-validation folds (with the proportion calculated on the training partition). The results were presented in Table 8. As can be seen, in the case of the minority class all observation types were represented, on average, in a roughly similar proportion (with the highest proportion assigned to observations with no same class neighbors, i.e., the more difficult examples). This was not the case for the majority class, for which instances with all neighbors belonging to the same class heavily dominated. We expanded on this by calculating the percentage of datasets for which at least a single observation from a given type was present, with the results presented in Table 9. As can be seen, some observation types were sparsely represented in the considered datasets: the main takeaway of this observation is that this might affect the results of the further analysis and need to be taken into account.

Table 8 Mean and standard deviations of the proportion of observations with k nearest neighbors belonging to the same class

Full size table

Table 9 Percentage of datasets with at least a single observation for which k nearest neighbors belong to the same class

Full size table

Next, we calculated the average proportion of observations of different types created or discarded due to either oversampling or undersampling via encodings generated by LNE. To calculate this average proportion, we only considered the datasets for which at least a single observation from said type was present (so that the datasets without specific observation types do not bias the results). Additionally, as previously mentioned, some observation types were underrepresented, so we normalized the datasets for each dataset in one of two ways. First of all, by the number of observations from a given (minority or majority) class: in the case of oversampling this value was equal to $\frac{\# oversampled}{\# minority}$, whereas for the undersampling, it was $\frac{\# undersampled}{\# majority}$. This normalization was introduced to standardize the results across the datasets with different numbers of minority and majority class observations. Secondly, by the number of observations from the specific type present in the dataset. This type of normalization took into account the fact that some observation types were underrepresented.

The results were presented in Table 10. First of all it should be noted that the variance of the results was fairly high in every considered case, making the observed results reliable only to an extent. However, some trends can be observed: in the case of oversampling, when normalized by the number of minority observations, there was a monotonic trend, with the proportion of oversampling around the observations with no same class neighbors being the highest. However, this is partially due to the fact that they were most represented in the original data: when normalized by the number of observations of individual types, the most preferred type was that with a single same class neighbor in the 4-NN neighborhood, closely followed by two same class neighbors, which roughly corresponds with the standard taxonomy used by various SMOTE variants of rare and borderline instances. Analogically, when normalized by the number of majority instances, undersampling seems to be heavily focused on the observations with all same class neighbors. However, when normalized by the number of observations from specific types, this trend reverses, and observations with all the same class neighbors become least preferred for undersampling. The overall trend is that while the variance is very high, indicating high per-dataset variability, an equivalent of rare and borderline minority observations tends to be favored for oversampling and safe majority observations for undersampling.

Table 10 Mean and standard deviations of the proportion of observations of different types (with k denoting the number of nearest neighbors from the same class) created or discarded due to either oversampling (O) or undersampling (U)

Full size table

We also tried to answer the question of whether the composition of the original dataset, that is the number of observations of a given type present in the data before resampling, affects the preference towards particular types during oversampling. We focused specifically on the oversampling because of a greater diversity of observation types in the data. We computed the average proportions, like before, normalized by the number of observations from individual types but computed separately only on datasets, for which a given observation type was present in a large proportion ($\ge 0.3$) in the original data. The results were presented in Table 11. Lower values of K indicate the datasets with a large proportion of less certain minority class observations, i.e., difficult datasets. While, once again the variance of the results was high, we can observe a general trend of the more difficult datasets ($K \in \{0, 1\}$) producing encodings focusing resampling on observations other than outliers ($k \ne 0$), and less difficult datasets ($K \in \{2, 3, 4\}$) focusing on outliers ($k = 0$). This seems to indicate that with high baseline certainty resampling tends to focus on unsafe observations (with the hypothesis being that certain regions are represented well enough, and the borderline regions can be the focus of boosting), and with low baseline certainty on safe observations (since the high confidence regions of predictions have yet to be established).

Table 11 Mean and standard deviations of the proportion of observations of different types (with k denoting the number of nearest neighbors from the same class) created due to oversampling, normalized by the number of observations from individual types, computed only on the datasets for which the proportion of minority observations from type $K \ge 0.3$

Full size table

Finally, having introduced the categorization into datasets consisting of a large proportion of original observations from a given type, we also examined if there are any visible differences between the different types of datasets concerning the global properties (O/U ratio and IR after resampling). The table containing this comparison was presented in Table 12. Similar to the results of the previous analysis, the split between more ($K \in \{0, 1\}$) and less ($K \in \{2, 3, 4\}$) difficult datasets was visible here as well: specifically, more difficult datasets tended to favor using more undersampling than oversampling, and resample to a lesser degree than the less difficult datasets.

Table 12 Mean and standard deviations of the adjusted O/U ratio and IR after resampling, computed only on the datasets for which the proportion of minority observations from type $K \ge 0.3$

Full size table

5 Conclusions

This paper proposed Local Neighborhood Encoding, a novel technique for resampling imbalanced data, combining oversampling and undersampling in an evolutionary algorithm-based procedure that optimizes the proportion of resampling performed around different types of observations. The conducted experimental study showed that LNE significantly outperforms standard resampling algorithms. In addition, the conducted ablation study showed that dynamic selection of resampling strength is the main factor in good LNE performance. Conducted experiments using proxy estimators, a strategy that involves using a less computationally intensive classifier in the coding optimization process, demonstrated that in some cases, especially when using performance metrics such as F-beta, it is possible to preserve the original performance of the LNE while reducing training time.

Finally, utilizing the interpretability of the encodings, we conducted a meta-analysis of the solutions produced by LNE on a large set of benchmark datasets. While there was a significant variance in the obtained results, suggesting that dataset-specific tuning is still required, some common trends have been observed:

A combination of oversampling and undersampling was the preferred strategy, with strong oversampling combined with reasonable weak undersampling.
The optimal strength of resampling was strongly dependent on the performance chosen metric, BAC and G-mean prefer approximately balanced distributions, while metrics such as F-beta favored overbalancing.
The general characteristics of the datasets before resampling can be, to some extent, used to predict the properties of the resampling (such as the oversampling-to-undersampling ratio or resampling strength); however, specific tuning of these parameters is still required to achieve the optimal performance.
Produced solutions, on average, tended to prioritize rare and borderline observations, during oversampling, and unsafe examples during undersampling. Similar observations were made in the work mentioned earlier on selective oversampling, where the most common object fractions for oversampling were borderline and rare (Sáez et al., 2016).
However, when taking into the account the original distribution of the observation types, for more difficult datasets there was a tendency to produce encodings focusing oversampling on observations other than outliers, and less difficult datasets focusing on outliers.

LNE is an efficient oversampling strategy that can be used when the dataset size is relatively small and/or computational resources are not limited. Its interpretability can also be used to gain insight into existing resampling strategies, as it allows the removal of errors introduced by methods such as Borderline-SMOTE and Safe-Level-SMOTE, the search for which becomes part of the optimization process. Possible future research directions include scaling the approach to larger dataset sizes and exploring the idea of using proxy estimators in more depth. The use of proxy classifiers in place of dataset difficulty scores should also be considered. However, this would require confirmation of the hypothesis that simplifying data distributions in preprocessing positively affects the quality of final classification models.

Data availability

All data used in this work is publicly available as reported in Sect. 4.1.

Code availability

All code used in this work is publicly available as reported in Sect. 4.1.

Notes

https://github.com/michalkoziarski/LocalNeighborhoodEncodings

References

Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., & Herrera, F. (2011). KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing, 17, 255–287.
Google Scholar
Alpaydin, E. (1999). Combined 5 $\times$ 2 cv F test for comparing supervised classification learning algorithms. Neural Computation, 11(8), 1885–1892.
Article Google Scholar
Barandela, R., Hernández, J. K., Sánchez, J. S., & Ferri, F. J. (2005). Imbalanced training set reduction and feature selection through genetic optimization. In CCIA (pp. 215–222).
Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29.
Article Google Scholar
Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys, 49(2), 1–50.
Article Google Scholar
Brzezinski, D., Stefanowski, J., Susmaga, R., & Szczęch, I. (2018). Visual-based analysis of classification measures and their properties for class imbalanced problems. Information Sciences, 462, 242–261.
Article MathSciNet Google Scholar
Brzeziński, D., Stefanowski, J., Susmaga, R., & Szczęch, I. (2020). On the dynamics of classification measures for imbalanced and streaming data. IEEE Transactions on Neural Networks and Learning Systems, 31(8), 2868–2878.
Article Google Scholar
Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia conference on knowledge discovery and data mining (pp. 475–482). Springer.
Cao, Q., Wang, S. Z. (2011). Applying over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In 2011 International conference on information management, innovation management and industrial engineering (vol. 2, pp. 543–548). IEEE.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Philip Kegelmeyer, W. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Article Google Scholar
Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182–197.
Article Google Scholar
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets. Springer.
Book Google Scholar
Fernández, A., García, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61(1), 863–905.
Article MathSciNet Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463–484.
Article Google Scholar
García, S., & Herrera, F. (2009). Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evolutionary Computation, 17(3), 275–306.
Article MathSciNet Google Scholar
Gazzah, S., Amara, N.E.B. (2008). New oversampling approaches based on polynomial fitting for imbalanced data sets. In 2008 The 8th IAPR international workshop on document analysis systems (pp. 677–684). IEEE.
Han, H., Wang, W. -Y., & Mao, B. -H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878–887). Springer
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the international joint conference on neural networks, 2008, part of the IEEE world congress on computational intelligence, 2008, Hong Kong, China, June 1-6, 2008 (pp. 1322–1328).
Hualong, Yu., Ni, J., & Zhao, J. (2013). ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing, 101, 309–318.
Article Google Scholar
Johnson, J., & Khoshgoftaar, T. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6, 27.
Article Google Scholar
Karia, V., Zhang, W., Naeim, A., & Ramezani, R. (2019). GenSample: A genetic algorithm for oversampling in imbalanced datasets.
Khoshgoftaar, T. M., Seiffert, C., Hulse, J. V., Napolitano, A., & Folleco, A. (2007). Learning with limited minority class data. In 6th International conference on machine learning and applications (ICMLA 2007) (pp. 348–353).
Khoshgoftaar, T. M., Seliya, N., & Drown, D. J. (2010). Evolutionary data analysis for the class imbalance problem. Intelligent Data Analysis, 14(1), 69–88.
Article Google Scholar
Kim, H.-J., Jo, N.-O., & Shin, K.-S. (2016). Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction. Expert Systems with Applications, 59, 226–234.
Article Google Scholar
Kim, Y., Lee, Y., & Jeon, M. (2021). Imbalanced image classification with complement cross entropy. Pattern Recognition Letters, 151, 33–40.
Article Google Scholar
Kovács, G. (2019). An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Applied Soft Computing, 83, 105662.
Article Google Scholar
Kovács, G. (2019). smote-variants: A Python implementation of 85 minority oversampling techniques. Neurocomputing, 366, 352–354.
Article Google Scholar
Koziarski, M. (2021). CSMOUTE: Combined synthetic oversampling and undersampling technique for imbalanced data classification. In 2021 International joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.
Koziarski, M., Krawczyk, B., & Woźniak, M. (2017). Radial-based approach to imbalanced data oversampling. In International conference on hybrid artificial intelligence systems (pp. 318–327). Springer.
Koziarski, M. (2020). Radial-based undersampling for imbalanced data classification. Pattern Recognition, 102, 107262.
Article Google Scholar
Koziarski, M. (2021). Potential Anchoring for imbalanced data classification. Pattern Recognition, 120, 108114.
Article Google Scholar
Koziarski, M., & Woźniak, M. (2017). CCR: Combined cleaning and resampling algorithm for imbalanced data classification. International Journal of Applied Mathematics and Computer Science, 27(4), 727–736.
Article MathSciNet Google Scholar
Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence, 5, 04.
Article Google Scholar
Lee, J., Kim, N., Lee, J. -H. (2015). An over-sampling technique with rejection for imbalanced class learning. In Proceedings of the 9th international conference on ubiquitous information management and communication (pp. 1–6).
Lee, S. S. (2000). Noisy replication in skewed binary classification. Computational Statistics & Data Analysis, 34(2), 165–191.
Article Google Scholar
Li, M., Xiong, A., Wang, L., Deng, S., & Ye, J. (2020). ACO resampling: Enhancing the performance of oversampling methods for class imbalance classification. Knowledge-Based Systems, 196, 105818.
Article Google Scholar
Maciejewski, T., & Stefanowski, J. (2011). Local neighbourhood extension of SMOTE for mining imbalanced data. In Proceedings of the IEEE symposium on computational intelligence and data mining 2011, part of the IEEE symposium series on computational intelligence 2011, April 11-15, 2011, Paris, France (pp. 104–111).
Nakamura, M., Kajiwara, Y., Otsuka, A., & Kimura, H. (2013). LVQ-SMOTE-learning vector quantization based synthetic minority over-sampling technique for biomedical data. Biodata Mining, 6(1), 16.
Article Google Scholar
Napierala, K., & Stefanowski, J. (2016). Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46(3), 563–597.
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
MathSciNet Google Scholar
Price, K., Storn, R. M., & Lampinen, J. A. (2006). Differential evolution: A practical approach to global optimization. Springer Science & Business Media.
Google Scholar
Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprintarXiv:1811.12808
Sáez, J. A., Krawczyk, B., & Woźniak, M. (2016). Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognition, 57, 164–178.
Article Google Scholar
Sandhan, T., Choi, J. Y. (2014). Handling imbalanced datasets by partially guided hybrid sampling for pattern recognition. In 2014 22nd international conference on pattern recognition (pp. 1449–1453). IEEE.
Stapor, K., Ksieniewicz, P., García, S., & Woźniak, M. (2021). How to design the fair experimental classifier evaluation. Applied Soft Computing, 104, 107219.
Article Google Scholar
Węgier, W., Koziarski, M., & Woźniak, M. (2023). Optimized hybrid imbalanced data sampling for decision tree training. In Proceedings of the companion conference on genetic and evolutionary computation (pp. 339–342).
Węgier, W., Koziarski, M., & Woźniak, M. (2022). Multicriteria classifier ensemble learning for imbalanced data. IEEE Access, 10, 16807–16818.
Article Google Scholar
Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19(1), 315–354.
Article Google Scholar
Wojciechowski, S. (2021). Multi-objective evolutionary undersampling algorithm for imbalanced data classification. In Computational science–ICCS 2021: 21st international conference, Krakow, Poland, June 16-18, proceedings, part III (pp. 118–127). Berlin, Heidelberg: Springer-Verlag.
Zhou, B., Yang, C., Guo, H., Hu, J. (2013). A quasi-linear SVM combined with assembled SMOTE for imbalanced data classification. In The 2013 international joint conference on neural networks (IJCNN) (pp. 1–7). IEEE.

Download references

Acknowledgements

This work was supported by the Polish National Science Centre under the grant No. 2019/35/B/ST6/04442 as well as the PLGrid Infrastructure.

Funding

This work was supported by the Polish National Science Centre under the grant No. 2019/35/B/ST6/04442.

Author information

Authors and Affiliations

Department of Systems and Computer Networks, Wrocław University of Science and Technology, Wybrzeże Wyspiańskiego 27, 50–370, Wrocław, Poland
Michał Koziarski & Michał Woźniak

Authors

Michał Koziarski
View author publications
You can also search for this author in PubMed Google Scholar
Michał Woźniak
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.K. conceived of the presented idea, implemented the algorithm, planned and carried out the experiments, and analysed the results. M.W. contributed to the analysis of the results. Both authors discussed the results, drew the conclusions, and contributed to the final manuscript, helping with writing, reviewing and editing.

Corresponding author

Correspondence to Michał Koziarski.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Editors: Nuno Moniz, Paula Branco, Luís Torgo, Nathalie Japkowicz, Shuo Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Koziarski, M., Woźniak, M. Local neighborhood encodings for imbalanced data classification. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06563-6

Download citation

Received: 05 July 2022
Revised: 14 March 2024
Accepted: 26 April 2024
Published: 10 June 2024
DOI: https://doi.org/10.1007/s10994-024-06563-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Local neighborhood encodings for imbalanced data classification

Abstract

Similar content being viewed by others

Tackling Class Imbalance Problem in Binary Classification using Augmented Neighborhood Cleaning Algorithm

Multi-objective Evolutionary Undersampling Algorithm for Imbalanced Data Classification

Addressing Overlapping in Classification with Imbalanced Datasets: A First Multi-objective Approach for Feature and Instance Selection

1 Introduction

2 Related work

3 Local neighborhood encodings