Abstract
Mitigating the impact of class imbalance datasets on classifiers poses a challenge to the machine learning community. Conventional classifiers do not perform well as they are habitually biased toward the majority class. Among existing solutions, the synthetic minority oversampling technique (SMOTE) has shown great potential, aiming to improve the dataset rather than the classifier. However, SMOTE still needs improvement because of its equal oversampling to each minority instance. Based on the consensus that instances far from the borderline contribute less to classification, a refined method for oversampling borderline minority instances (OBMI) is proposed in this paper using a two-stage Tomek link-finding procedure. In the oversampling stage, the pairs of between-class instances nearest to each other are first found to form Tomek links. Then, these minority instances in Tomek links are extracted as base instances. Finally, new minority instances are generated, each of which is linearly interpolated between a base instance and one minority neighbor of the base instance. To address the overlap caused by oversampling, in the cleaning stage, Tomek links are employed again to remove the borderline instances from both classes. The OBMI is compared with ten baseline methods on 17 benchmark datasets. The results show that it performs better on most of the selected datasets in terms of the F1-score and G-mean. Statistical analysis also indicates its higher-level Friedman ranking.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Classification of imbalanced data has become an important research topic in machine learning [1]. Imbalanced data indicate a skewed distribution; namely, the number of instances in the minority class is much less than that of instances in the majority class. This situation often occurs in many applications, such as medical diagnosis [2], credit risk evaluation [3], and behavioral analysis [4]. Standard classifiers will produce an inductive bias toward the majority class because the minority class contributes less to the minimization of the objective function. From a learning point of view, however, the minority class represents a more important pattern that deserves more attention. It also implies a higher cost when the minority is not well classified. For example, the cost of misdiagnosing a healthy patient is quite different from that of misdiagnosing a patient [5]. Therefore, improving minority class prediction is a key problem for imbalanced classification.
Many methods have been proposed to solve the aforementioned problem. These methods can be roughly categorized into three groups: data-level groups, algorithm-level groups, and hybrid-based groups [6, 7]. The data-level group uses resampling techniques to achieve data rebalancing [8]. It involves oversampling the minority class, undersampling the majority class, and hybrid sampling. The algorithm-level group places more emphasis on the minority class by modifying existing algorithms. It mainly includes cost-sensitive learning and ensemble methods. Cost-sensitive learning improves performance by penalizing the misclassification of the minority class. The ensemble method splits the majority class into several subsets with a size equal to the number of instances in the minority class [9]. The hybrid-based group combines data-level methods and algorithm-level methods. For example, the hybrid classifier ensemble [10] not only assigns higher weights to misclassified instances, but also performs density-based undersampling to obtain multiple subsets following a balanced distribution between classes.
Among these methods, algorithm-level methods refer to the modification of existing algorithms to adapt them to imbalanced data, which require a deep understanding of the specific algorithms and cost functions [11]. In contrast, data-level methods, also called resampling methods, are widely considered and most commonly used in imbalanced classification because they can be regarded as preprocessing steps and are always classifier independent [12, 16]. This advantage makes them very flexible and enables them to be applied to any classification algorithm [13]. In resampling, undersampling changes the class distribution by reducing the majority instances, which may lead to the loss of useful information. In [14, 15], the authors have also shown that undersampling is less effective than oversampling, especially in cases where the dataset is small. Therefore, this paper focuses on oversampling strategies.
Oversampling provides an effective solution to imbalanced classification by compensating for the insufficiency of minority class information. The synthetic minority oversampling technique (SMOTE) is the most well-known oversampling method [17], where the new instances are generated by linearly interpolating between a selected minority instance and one of its k-nearest minority neighbors [16]. In recent decades, SMOTE has been greatly developed and has derived many variants [18].
Although SMOTE can achieve a better balance in the number of instances between classes, when used in isolation, it may obtain results that are not as good as they could be [19]. This is because SMOTE ignores the natural distribution of instances due to its equal oversampling. Most of the classifiers attempt to learn the borderline of classes as exactly as possible in the training process. The borderline instances are more apt to be misclassified than those far from the borderline, and thus deserve more attention [19,20,21,22].
Borderline-SMOTE determines the borderline instances by the k-nearest neighbors (KNNs) rule [20]. It divides the minority instances into three categories: safe, danger and noise. A noise indicates that its KNNs are all majority instances, a safe instance indicates that more than half of its KNNs are minority instances, and a danger instance indicates that more than half of its KNNs are majority instances. Only the danger instances are treated as borderline instances and are oversampled. ADASYN adopts a similar strategy as Borderline-SMOTE to identify the borderline instances, where the minority instances with more majority neighbors in their KNNs are assigned higher weights to have greater chances to be oversampled [14, 16].
However, both Borderline-SMOTE and ADASYN fail to find all the borderline instances due to the ill-posed parameter k [23]. Figure 1 gives an example involving only Borderline-SMOTE. When \(k=3\), the minority instance P is identified as noise; when \(k=5\), P is a danger instance; and when \(k=7\), P is a safe instance. From Fig. 1, it can be seen that the setting of different k values significantly affects the determination of the borderline instances. An inappropriate k value can also degrade the quality of newly synthesized instances. Therefore, establishing a more effective determination mechanism for borderline instances has become a key issue in oversampling methods.
The Tomek link is an enhancement of the KNN rule and is defined as a pair of minimally distant nearest neighbors of opposite classes [24]. If two instances form a Tomek link, then either one of them is noise or both are near the borderline [40]. According to Batista et al. [25], the Tomek link can be used as an undersampling technique or as a postprocessing cleaning step. If used as an undersampling technique, only the instances from the majority class are removed. If used as a postprocessing cleaning step, both instances are removed.
In this paper, the use of the Tomek link will be extended in dealing with the class imbalance problem. A refined method called oversampling borderline minority instances (OBMI) is proposed. It consists of two stages. In the first stage, the borderline minority instances are located and oversampled via a Tomek link-finding procedure. In the second stage, Tomek links are employed again for overlap reduction. The merits of the proposed method are as follows:
-
It can determine informative borderline instances and avoid the difficulty involving the choice of parameter k, unlike Borderline-SMOTE or ADASYN. Additionally, it is wellposed. Once the distance function is selected, the result obtained is unique.
-
It is lightweight and simple to implement. There is no complicated iteration; it can be easily implemented by its definition. In some real-time scenarios, the simpler model should be preferred (Occam’s razor).
-
Experimental evaluation with ten well-known oversampling methods (i.e., SMOTE [17], Borderline-SMOTE [20], ADASYN [14], SMOTE-TL [25], SMOTE-IPF [19], kmeans-SMOTE [50], NaNSMOTE [51], RBO [52], SyMProD [54], and NaNBDOS [55]) demonstrates its effectiveness and competitiveness.
The rest of the paper is organized as follows. “Related works” section reviews the related work. “The proposed method” section presents the proposed OBMI in two separate stages: the oversampling stage and the cleaning stage. The corresponding algorithms are also described in detail. In “Experiment on 2D toy datasets” and “Experiment on benchmark datasets” sections, OBMI is evaluated by experiments both on synthetic datasets and on benchmark datasets. Furthermore, nonparametric statistical tests are also performed. “Conclusion” section gives the concluding remarks with a brief discussion.
Related works
There have been extensive studies on oversampling methods [18, 26, 27]. Each of them has its specific capabilities. This section will summarize those related to this work from three aspects: oversampling for borderline instances, oversampling with postprocessing cleaning, and several other well-known methods.
Oversampling for borderline instances
Chawla et al. [17] proposed the classical SMOTE as an alternative to random oversampling. SMOTE generates new synthetic instances by linear interpolations instead of random duplications, which can improve the quality of data fed to subsequent classifiers. Recently, many different extensions have been proposed with the objective of improving some of the capabilities of SMOTE [28]. One common idea is to calculate which minority instances are the best candidates for oversampling.
Borderline-SMOTE was proposed as an enhancement of SMOTE, in which only the borderline minority instances are oversampled [20]. It builds on the premise that the instances near the borderline are more apt to be misclassified than those far from the borderline and, thus, more important for classification. To obtain the borderline instances, the ratio of the majority instances in the neighbors of each minority instance is calculated. Similar to Borderline-SMOTE, the essential idea of ADASYN is to use a weighted distribution for different minority instances according to their level of difficulty in learning [14]. More synthetic instances are generated for minority instances that are harder to learn compared to those that are easier to learn.
Nguyen et al. [22] presented a novel method to achieve borderline oversampling. The borderline region is approximated by the support vectors obtained after training a standard support vector machine (SVM) classifier. New instances are created along the lines joining each support vector of the minority class with a number of its KNNs using interpolation or extrapolation. Tian et al. [29] adopted a similar strategy to oversample minority instances identified as support vectors. For the majority class instances, middle score k-means (MSK) is presented to decompose them into serval clusters. Subsequently, these clusters are used to build an SVM ensemble with the oversampled minority instances. Wang et al. [31] proposed another ensemble method called bagging of extrapolation Borderline-SMOTE SVM (BEBS) for addressing the class imbalance problem. BEBS screens the informative borderline instances by SVM in advance, and then the bagging mechanism is applied to promote the generalization.
Zhu et al. [30] proposed a position characteristic-aware interpolation oversampling algorithm (PAIO). In PAIO, the minority instances are divided into three groups: the inland group, the borderline group, and the trapped group with different position characteristics, by using NBDOS clustering. Then, the different oversampling strategies are applied to these groups to alleviate the overfitting. Tao et al. [16] proposed adaptive weighted oversampling based on density peak clustering with heuristic filtering. The size of each identified subcluster is adaptively determined according to its own size and density. The minority instances within each subcluster are oversampled based on their probabilities inversely proportional to their distances to the majority class and their densities to generate more synthetic minority instances for borderline and sparser instances.
Wang et al. [15] proposed a local distribution-based adaptive minority oversampling method (LAMO) to address the class imbalance problem. LAMO identifies the informative borderline minority instances as sampling seeds according to the neighbor relations and class distribution. Next, LAMO generates synthetic instances around seeds via a Gaussian mixture model. Revathi et al. [32] combined noise reduction with oversampling to emphasize the importance of borderline instances. For noise reduction, it decides whether one instance should be kept or removed by a calculated propensity score. For oversampling, only the borderline instances located in the danger region are selected as sampling seeds. Chen et al. [33] introduced the relative density to measure the local density of each minority instance. A self-adaptive robust SMOTE (RSMOTE) was further proposed, which distinguishes the borderline instances and the safe instances by the relative density. The number of synthetic instances for each base instance is reweighed according to its chaotic level. Leng et al. [55] proposed NaNBDOS (short for borderline oversampling via natural neighbor search) to improve class-imbalance learning. It adaptively assigns dynamic sampling weights to the base instances. More synthetic instances will be generated around the base instances with more minority natural neighbors. This strategy is closely related to the data complexity and can maintain the original distribution to a certain extent.
Oversampling with postprocessing cleaning
Data cleaning after oversampling is considered a good strategy. There are several early methods, including SMOTE-TL [25], SMOTE-ENN [49], and SMOTE-RSB [35]. SMOTE-TL [25] removes the Tomek links from the balanced dataset to alleviate overlap. SMOTE-ENN [49] employs the edited nearest neighbor (ENN) rule to remove noisy instances. It can extend the limitation of defining noises to some extent, and further provides more in-depth cleaning of data compared to SMOTE-TL. Based on rough set theory, SMOTE-RSB [35] eliminates any synthetic instance that does not belong to the lower approximation of the minority class, considering these instances in the borderline region as noisy and not useful for classification.
Sáez et al. [19] proposed an extension of SMOTE through an iterative-partitioning filter (IPF). SMOTE-IPF has the ability to control the noise introduced by the balancing between classes produced by SMOTE and to make the class boundaries more regular. Tao et al. [16] proposed a novel adaptive weighted oversampling method based on density peak clustering. It can simultaneously address between-class and within-class imbalances. Moreover, a heuristic filtering strategy is developed inspired by PSO evolution, in which the position updating mechanism is used to iteratively move the possibly overlapping minority instances away from the majority class. Koziarski et al. [35] proposed a multiclass combined cleaning and resampling (MC-CCR) algorithm. MC-CCR utilizes an energy-based approach to model the regions suitable for oversampling and combines it with a simultaneous cleaning operation. It is less affected by interclass relationship information loss than the traditional multiclass decomposition strategies.
Li et al. [21] proposed a filtering-based oversampling method called SMOTE-NaN-DE. Minority instances are first generated to balance the datasets, and then the natural neighbor is applied to detect noisy and borderline instances. Furthermore, differential evolution is used to optimize and change the positions of found instances instead of eliminating them. Tao et al. [36] proposed SVDDDPCO for handling imbalanced and overlapped data, which combines SVDD (support vector data description) and DPC (density peak clustering). It first utilizes SVDD to generate the class boundary and clean up the potential overlapped or noisy instances. Then, DPC is used to cluster the minority instances to address the within-class imbalance. Zhang et al. [37] proposed a SMOTE variant with reverse KNNs. After oversampling, the noisy instances are identified based on probability density but not local neighborhood information. Finally, these training instances with relevant probability densities higher than a given threshold are removed.
Other well-known oversampling methods
Douzas et al. [50] developed a simple and effective oversampling method called kmeans-SMOTE, which is based on k-means clustering and SMOTE. It can avoid noise generation and can overcome imbalances between and within classes. In kmeans-SMOTE, a high ratio of minority instances is used as an indicator that a cluster is a safe area. Furthermore, the average distance among a cluster’s minority instances is used to identify sparse areas. In general, those sparse clusters are assigned higher sampling weights. Koziarski et al. [52] introduced a new algorithm called radial-based oversampling (RBO) for imbalanced and noisy data. Instead of relying on neighborhood search, RBO uses radial-based functions to determine local distributions of both minority and majority instances. Moreover, calculating joint potentials is employed to calibrate classification difficulty in a specific area. Thus, the oversampling process is guided so that it can better allocate synthetic instances without increasing overlap. Kunakorntum et al. [54] presented a new oversampling technique, namely synthetic minority based on probabilistic distribution (SyMProD), to handle class-imbalance datasets. SyMProD normalizes data using a Z-score and creates synthetic instances that cover the minority class distribution, avoiding the noise generation. Additionally, it reduces the possibilities of overlapping classes and overgeneralization problems.
In SMOTE and its extensions, the choice of the neighbor parameter k is challenging. To address this issue, Li et al. [51] proposed a novel approach by embedding natural neighbors in SMOTE (NaNSMOTE). The random difference between a selected base instance and one of its natural neighbors is used to generate new synthetic instances. NaNSMOTE has an adaptive k value related to the data complexity. The instances closer to class centers have more neighbors to improve the generalization of oversampling results. In contrast, the borderline instances have fewer neighbors to reduce the error of newly generated instances. In a recent practical application, Jovanovic et al. [53] combined the firefly algorithm with SMOTE to detect credit card fraud. This new strategy was named the group search firefly algorithm (GSFA), and its effectiveness was confirmed in comparative experiments with other competitors.
The proposed method
There is one common concern when oversampling, and the instances far from the borderline may not provide a substantial contribution to the classification ability of the model [28]. This paper proposes oversampling borderline minority instances (OBMI) to improve oversampling effectiveness. The OBMI consists of two stages: the oversampling stage and the cleaning stage. Next, the details of the two-stage procedure will be explained, and the corresponding algorithms will be described.
Notations and terms
Several notations and terms used in this paper are as follows.
-
|S| represents the cardinality of set S.
-
\({ S min}=\{u_1, u_2, \ldots , u_i, \ldots , u_{|{ S min}|}\}\) is a minority dataset.
-
\({ S maj}=\{v_1, v_2, \ldots , v_j, \ldots , v_{|{ S maj}|}\}\) is a majority dataset.
-
\(\textrm{dist}(u_i, v_j)=||u_i-v_j||_2\) stands for the Euclidean distance between \(u_i\) and \(v_j\).
-
A base instance indicates a sampling seed selected from the minority class.
-
K is a KNN parameter used for generating new instances, generally in the oversampling process.
-
k is a kNN parameter used for the calculation of local density, specifically in Borderline-SMOTE and ADASYN.
Tomek link
Given two instances \(u_i \in { S min}\) and \(v_j \in { S maj}\), a pair \((u_i, v_j)\) is called a Tomek link if there is no other instance \(w_t\), such that \(\textrm{dist}(u_i, w_t)<\textrm{dist}(u_i, v_j)\) or \(\textrm{dist}(w_t, v_j)<\textrm{dist}(u_i, v_j)\) [38], i.e.,
A procedure for finding Tomek links was developed straightforwardly from its definition [39]. Step 1: For each \(u_i \in { S min}\), find the nearest instance \(v_p \in { S maj}\). The link set \(L_{12}=\{(u_i, v_p)|u_i \in { S min}, v_p \in { S maj}\}\) is saved. Step 2: For each \(v_j \in { S maj}\), find the nearest instance \(u_q \in { S min}\). The link set \(L_{21}=\{(u_q, v_j)|u_q\in { S min}, v_j \in { S maj}\}\) is saved. Step 3: A set of Tomek links \(T=\{L_{12} \cap L_{21}\}\) is finally obtained.
Actually, the above procedure can be simplified. In Step 2, it can start from only the instances in \(L_{12}\) instead of all the instances in \({ S maj}\). As shown in Fig. 2, for four instances \(u_1, u_2, u_3, u_4 \in { S min}\), two instances \(v_1, v_2 \in { S maj}\) nearest to them are first found separately. Then, starting from \(v_1\) and \(v_2\), \(u_2\) and \(u_3\) are found, and two Tomek links \((u_2, v_1)\) and \((u_3, v_2)\) are obtained.
The algorithm for finding Tomek links (Find-TLs for short) is depicted in Algorithm 1. If two instances can form a Tomek link, then they are near the borderline with a high probability [40]. Taking advantage of the ability to locate borderline instances, the Tomek links are employed to oversample the borderline minority instances in “Oversampling stage” section and then perform a postprocessing cleaning step in “Cleaning stage” section.
Oversampling stage
In the OBMI oversampling stage, a Tomek link-finding procedure is first employed to identify the borderline instances. Then, oversampling is performed on the identified minority instances, and new instances are generated near the borderline. As a result, a balance between classes is achieved.
Given a minority dataset \({ S min}\) and a majority dataset \({ S maj}\), a set T of Tomek links is obtained by the Find-TLs algorithm (Algorithm 1). Furthermore, the borderline minority instances are extracted from T and placed into a candidate set \(R^{'}\) of base instances. Thus, an oversampling factor \(\delta \) can be calculated as follows:
If \(\delta < 1\), it is clear that the number of instances in \(R^{'}\) is greater than the difference between the numbers of instances in \( Smaj \) and \( Smin \). In this case, only a part of the instances from \(R^{'}\) are selected as the base instances. These base instances are then stored into a set R for generating new instances. Notably, there is a one-to-one correspondence between the new instances and the base instances.
Otherwise, all instances in \(R^{'}\) will be stored in the base instance set R. A synthesis coefficient \(\gamma \) is calculated as follows:
where \(\gamma \) is an integer and "[]" represents the rounding operation. The synthesis coefficient \(\gamma \) indicates the number of new instances to be generated for each base instance. It should be noted that Eq. (3) is not simply obtained by rounding Eq. (2) since \(R^{'}\) is a proper subset of R when \(\delta < 1\).
For each base instance \(u_i \in R\), its KNNs in \( Smin \) are calculated. Then, one (\(\delta <1\)) or \(\gamma \) (\(\delta \ge 1\)) of KNNs is selected to generate one or \(\gamma \) new synthetic instances. Assuming that one nearest neighbor \(u_t\) is selected, a new instance \(u_{new}\) can be generated by linear interpolation:
where \(\textrm{random}(0, 1)\) denotes a random number between 0 and 1.
An informal description of the oversampling stage is depicted in Algorithm 2 (Oversample-BMIs for short). To make it easier to follow, the algorithm is expressed in a high-level pseudocode, with detailed comments.
Figure 3 shows an illustrative example of oversampling BMIs. The original dataset consists of eight minority instances and 17 majority instances (Fig. 3a). Using the Find-TLs algorithm, four Tomek links, \((u_1, v_1)\), \((u_2, v_2)\), \((u_3, v_3)\), and \((u_4, v_4)\), are obtained (Fig. 3b). Then, a base instance set \(R=\{u_1, u_2, u_3, u_4\}\) is constructed by extracting the minority instances from the obtained Tomek links. Subsequently, the number of synthetic instances for each base instance is calculated by Eq. (3). Finally, eight new minority instances are generated, two for each base instance (Fig. 3c).
Cleaning stage
After the oversampling stage, a balanced dataset is obtained. New synthetic instances are generated near the borderline with a high probability. It is also a re-emphasis on the claim that the borderline instances are critical for estimating the optimal decision boundary [22]. However, one problem with oversampling is that the degree of overlap between classes increases, since the synthetic minority instances might invade the majority class space [41].
To alleviate the overlap, in the second stage of OBMI, a postprocessing cleaning step is performed. The Find-TLs algorithm is called again. Both instances in the obtained Tomek links are removed. It is worth noting that the Tomek links obtained in this stage and those obtained in the oversampling stage are very different because of the participation of newly generated instances. In the oversampling stage (Fig. 3b), four Tomek links are identified and used to generate new instances. In the cleaning stage (Fig. 4a), five Tomek links are found: \((u_1, v_1)\), \((u_{11}, v_5)\), \((u_{13}, v_2)\), \((u_3, v_3)\), and \((u_{16}, v_4)\). A total of ten instances will be removed from the balanced dataset. In particular, the majority instance \(v_4\) will be regarded as noise. The treatment using Tomek links for cleaning can play a role in noise reduction to a certain extent. Due to pairwise deletion, the cleaned dataset is still balanced. Moreover, the overlap between classes is significantly alleviated (Fig. 4b), which helps the classifier learn an optimized classification model.
Experiment on 2D toy datasets
In determining borderline instances, Borderline-SMOTE does not perform well due to the choice of the value of neighbor parameter k used for calculating local density (see Fig. 1), while the proposed OBMI is expected to provide a stable solution. In this section, a comparative experiment is conducted to intuitively illustrate the difference between them. Two 2D toy datasets, Wave and Circle, are created with 214 minority instances and 586 majority instances and 162 minority instances and 638 majority instances, respectively. Figures 5 and 6 show the experimental results on Wave and Circle.
The Borderline-SMOTE determines different minority instances as borderline instances as the value of k changes (Figs. 5b–e and 6b–e). Correspondingly, it can be seen that oversampling occurs in different borderline regions. For OBMI, it can obtain unique oversampling results (Figs. 5f, 6f). Additionally, the oversampled instances spread over the entire borderline region, unlike Borderline-SMOTE with multiple separate clusters. In Wave, 368 new instances are generated, and 18 Tomek links are identified (Fig. 5g). On Circle, 484 new instances are generated, and 23 Tomek links are identified (Fig. 6g). It can be seen that these instances cleaned by Tomek links only account for a small proportion of the oversampled instances, which does not cause excessive computing resource waste. Figures 5h and 6h show the final results.
This section also provides a visualization of the classification boundaries by using the Scikit-learn toolkit [43], as shown in Fig. 7. Four oversampled datasets are used: Wave1 to Fig. 5c, Circle1 to Fig. 6c, Wave2 to Fig. 5h, and Circle2 to Fig. 6h. Support vector machine (SVM) and multilayer perceptron (MLP) are employed as the classification algorithms, with default settings. As Fig. 7 shows, one dataset oversampled by Borderline-SMOTE and that by OBMI induces very different classification models, even though they are from the same original dataset.
Experiment on benchmark datasets
In this section, the performance of the proposed OBMI is evaluated on 17 benchmark datasets by comparing it with ten well-known methods.
Datasets and metrics
Seventeen binary-class imbalanced datasets are selected from the KEEL repository [44]. Detailed information about these datasets can be found in Table 1. n is the number of total instances (from 129 to 5472), D is the number of features (from 3 to 18), and \(\textrm{IR} = \frac{{{n^ - }}}{{{n^ + }}}\) is the imbalanced ratio varying from 1.82 to 32.78. For convenience, a key is used to abbreviate the name of one dataset.
In a confusion matrix (Table 2), FP is the number of negative (majority) instances that are predicted to be positive (minority), and the rest can be understood similarly. Based on the confusion matrix, several metrics are given to assess the class imbalance problem.
The F1-score is the harmonic mean of precision and recall, mainly involving the positive instances. The G-mean is the geometric mean of recall and specificity, involving both positive and negative instances. They have the property of better characterizing quality when data are imbalanced. They are used as assessment metrics in this experiment.
Methods used for comparison
Three classical oversampling methods, i.e., SMOTE [17], Borderline-SMOTE [20], and ADASYN [14] are used as baselines. Since the proposed OBMI has a postprocessing cleaning step, it is also compared with two other methods with cleaning steps, SMOTE-TL [25] and SMOTE-IPF [19]. Five recent methods (kmeans-SMOTE [50], RBO [51], NaNSMOTE [52], SyMProD [54], and NaNBDOS [55]) are state-of-the-art references. The parameters of these methods are listed in Table 3.
In parameter settings, \(K=5\) denotes that in the synthesis phase, five nearest neighbors are considered and some of them are used for linear interpolation with the base instance, while \(k=5\) means that in the calculation phase of local density, five nearest neighbors are employed to determine whether one minority instance is a "danger" instance (for Borderline-SMOTE) or how many new instances should be generated around a base instance (for ADASYN). In ADASYN, \(\beta \in [0, 1]\) is used to specify the desired balance level, and \(d_{th}\) is a preset threshold for the maximum tolerated degree of imbalance ratio. In SMOTE-IPF, \({ i ters}\) is the number of consecutive iterations for the stop criterion, p represents the number of subsets obtained by splitting the current training dataset, and r is the proportion of removing noise instances in each iteration. In kmeans-SMOTE, the number of nearest neighbors is selected from \(\{3, 6\}\), and an exponent de used for computation of density is set to the number of features. In NaNSMOTE, the number of nearest neighbors is set to five, the same as in SMOTE. In RBO, a parameter selection for \(\gamma \) is performed, considering the values in \(\{0.001, 0.01 \ldots , 10.0\}\). Furthermore, the probability of early stopping (pes) is set to 0.001, the number of iterations to 5,000, and the step size to 0.0001. In SyMProD, NT is the noise threshold to detect and remove noisy data, CT is the cutoff threshold to filter minority instances out of majority region, K1 is the number of nearest neighbors to compare the closeness factor between groups, and M is the number of nearest neighbors to generate synthetic instances. The values of all SyMProD parameters are set by default.
Although the proposed method also has a parameter K that is commonly used for linear interpolation in the oversampling methods, a fact needs to be emphasized. OBMI does not require any parameters in the process of determining the borderline instances, which is essentially different from Borderline-SMOTE and ADASYN. In addition, SVM with RBF kernel (\(C=1000, \gamma =0.001\)) and MLP with conjugate gradient are employed as the postclassifiers.
Results and discussion
By adopting fivefold cross-validation [45], each dataset used in this experiment is first randomly divided into 5 disjoint folds that have approximately the same number of instances. Then, every fold plays a role in testing the model induced by the other fourfold. The average results with standard deviations are reported in Tables 4, 5, 6 and 7, where the highest scores are marked in bold, the symbol "+" denotes that OBMI outperforms the comparison method, "=" indicates that both obtain the same score, and "-" indicates that OBMI performs worse than the comparison method.
From Tables 4, 5, 6 and 7, the following observations can be made:
-
(1)
In terms of the F1-score with SVM, OBMI achieves the highest score on four datasets, i.e., iri0, pbl0, s2v4, and gla5. Specifically, compared with SMOTE, Borderline-SMOTE, ADASYN, SMOTE-TL, SMOTE-IPF, kmeans-SMOTE, NaNSMOTE, RBO, NaNBDOS, and SyMProD, it wins on 12, 11, 14, 11, 10, 12, 10, 8, 10, and 11 datasets, respectively. Especially on the "gla5" dataset, OBMI has an average testing F1-score of 89.33%, which is much higher than that obtained by the other nine methods (NaNBDOS excepted).
-
(2)
In terms of G-mean with SVM, OBMI obtains the best results on six datasets, i.e., iri0, hab, nth2, gla4, gla5, and yea5. It performs better than SMOTE, Borderline-SMOTE, ADASYN, SMOTE-TL, kmeans-SMOTE, NaNSMOTE, RBO, NaNBDOS, and SyMProD on 10, 12, 13, 9, 12, 12, 10, 11, and 14 datasets, respectively. Compared with SMOTE-IPF, OBMI wins on eight datasets with a win–loss ratio of 50%. It is also comparable to other methods on these datasets where OBMI does not achieve the highest scores.
-
(3)
In terms of the F1-score with MLP, OBMI achieves the highest score on seven datasets, i.e., gla1, iri0, gla0, g03v46, nth2, s2v4, and yea5. It outperforms SMOTE, Borderline-SMOTE, ADASYN, SMOTE-TL, SMOTE-IPF, kmeans-SMOTE, NaNSMOTE, RBO, NaNBDOS, and SyMProD on 14, 10, 15, 11, 11, 10, 10, 8, 12, and 10 datasets, respectively. More remarkably, on the "g03v46" dataset, OBMI obtained an average F1-score of 90.05%, much higher than the result obtained by the other ten methods.
-
(4)
In terms of G-mean with MLP, OBMI obtains the best results on six datasets, i.e., gla1, iri0, gla0, veh1, g03v46, and s2v4. In pairwise comparisons, it performs better than SMOTE, Borderline-SMOTE, ADASYN, SMOTE-TL, kmeans-SMOTE, NaNSMOTE, RBO, NaNBDOS, and SyMProD on 12, 13, 10, 9, 10, 10, 9, 15, and 13 datasets, respectively. SMOTE-IPF obtains the best results on eight datasets, two more than OBMI. Nevertheless, it still has a 7:8 win-loss ratio in a one-to-one comparison with OBMI.
Experimental results on benchmark datasets show that OBMI is generally better than SMOTE, Borderline-SMOTE, ADASYN, SMOTE-TL, kmeans-SMOTE, NaNSMOTE, RBO, NaNBDOS, and SyMProD in terms of both F1-score and G-mean. Additionally, compared with SMOTE-IPF, OBMI achieves similar or better performance on most of the selected datasets. Overall, OBMI shows a promising ability to improve imbalanced classification.
Nonparametric statistical test and analysis
To present a thorough comparison among these methods, a nonparametric statistical test is also conducted. The Friedman test and Bonferroni–Dunn post hoc procedure [46,47,48] are employed to differentiate the performance of the comparative methods on 17 datasets. The proposed OBMI is used as the control method and compared with other methods by multiple comparison tests. The test results are listed in Table 8, along with the mean scores of each method on different metrics.
Table 8 shows that OBMI achieves the top Friedman ranking on the first three metrics (F1-score with SVM, G-mean with SVM, and F1-score with MLP). On the last metric (G-mean with MLP), OBMI takes second place, just behind SMOTE-IPF. Remarkably, in terms of mean scores, it outperforms other methods used for comparison.
Figure 8 is the visualization result of the metric scores by a combination of boxplots and swarmplots. It displays the distribution of scores based on a five-number summary, i.e., minimum, first quartile, median, third quartile, and maximum. It can be observed that OBMI is robust because of the degree of dispersion. Overall, OBMI can effectively improve the oversampling method, which aims to synthesize high-quality new instances to better assist the classifier.
Conclusion
Inspired by the critical importance of borderline instances in classification, OBMI is proposed to oversample the borderline minority instances for the class imbalance problem. OBMI is essentially a two-stage Tomek link-finding procedure. In the oversampling stage, the minority instances in the Tomek links are used as sampling seeds to balance the dataset. In the cleaning stage, both instances in the Tomek links are removed for overlap reduction between classes. Unlike classic Borderline-SMOTE and ADASYN, OBMI is well-posed and parameter-free in determining the borderline instances.
The experimental results show that OBMI has clear superiority over the remaining methods in terms of the F1-score. It obtains similar or better G-means on most of the selected benchmark datasets. This indicates that OBMI not only emphasizes the importance of the minority instances, but also takes the majority instances into account. The statistical test also confirms its higher-level Friedman ranking.
Despite the promising performance, the proposed OBMI also has some limitations. The Tomek link represents a rigid rule, and oversampling with it will focus on considering the borderline instances. The overall data distribution may be ignored. In addition, the neighbor parameter K involved in linear interpolation is still inherent (the same as other SMOTE extensions), which will restrict the oversamplers from finding natural neighbors.
In future work, a fuzzy rule that extends the Tomek link to k-nearest-opposed pairs will be employed to improve the selection of borderline instances. Moreover, the theory of natural neighbors will be introduced to enhance the neighbor relationship. Correspondingly, linear interpolation will be achieved between the base instance and one of its natural neighbors so that the data distribution is considered. The proposed method can also be extended to multiclass problems. However, an ingenious decomposition strategy is needed. In-depth research on these issues should be regarded as an attractive advancement for class imbalance learning.
Data Availability
The data that support the findings of this study are openly available in [KEEL repository] at [https://sci2s.ugr.es/keel/imbalanced.php].
References
Lu Y, Cheung YM, Tang YY (2019) Bayes imbalance impact index: a measure of class imbalanced data set for classification problem. IEEE Trans Neural Netw Learn Syst 31(9):3525–3539
Wang Q, Zhou Y, Zhang W, Tang Z, Chen X (2020) Adaptive sampling using self-paced learning for imbalanced cancer data pre-diagnosis. Expert Syst Appl 152:113334
Shen F, Zhao X, Kou G, Alsaadi EE (2021) A new deep learning ensemble credit risk evaluation model with an improved synthetic minority oversampling technique. Appl Soft Comput 98:106852
Azaria A, Richardson A, Kraus S, Subrahmanian VS (2014) Behavioral analysis of insider threat: a survey and bootstrapped prediction in imbalanced data. IEEE Trans Comput Soc Syst 1(2):135–155
Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inform 107:103465
Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Raghuwanshi BS, Shukla S (2019) Class imbalance learning using under bagging based kernelized extreme learning machine. Neurocomputing 329:172–187
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Jimenez-Castaño C, Alvarez-Meza A, Orozco-Gutierrez A (2020) Enhanced automatic twin support vector machine for imbalanced data classification. Pattern Recognit 107:107442
Yang K, Yu Z, Wen X, Cao W, Chen CP, Wong HS, You J (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learn Syst 31(4):1387–1400
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70
Liu CL, Hsieh PY (2019) Model-based synthetic sampling for imbalanced data. IEEE Trans Knowl Data Eng 32(8):1543–1556
Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv (CSUR) 49(2):1–50
He H, Bai Y, Garcia E A, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceeding of 2008 IEEE international joint conference on neural networks (IJCNN). IEEE, pp 1322–1328
Wang X, Xu J, Zeng T, Jing L (2021) Local distribution-based adaptive minority oversampling for imbalanced data classification. Neurocomputing 422:200–213
Tao X, Li Q, Guo W, Ren C, He Q, Liu R, Zhou J (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Inf Sci 519:43–73
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Fernández A, Garcia S, Herrera F, Chawla NV (2018) SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceeding of international conference on intelligent computing. Springer, pp 878–887
Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) SMOTE-NaN-DE: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223:107056
Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig. 3(1):4–21
Barua S, Islam MM, Yao X, Murase K (2012) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 83:105662
Elreedy D, Atiya AF (2019) A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf Sci 505:32–64
Maldonado S, Vairetti C, Fernandez A, Herrera F (2022) FW-SMOTE: a feature-weighted oversampling approach for imbalanced classification. Pattern Recognit 124:108511
Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20(2):203–209
Zhu T, Lin Y, Liu Y (2020) Improving interpolation-based oversampling for imbalanced data learning. Knowl-Based Syst 187:104826
Wang Q, Luo Z, Huang J, Feng Y, Liu Z (2017) A novel ensemble method for imbalanced data learning: bagging of extrapolation-SMOTE SVM. Comput Intell Neurosci 2017:1827016
Revathi M, Ramyachitra D (2021) A modified borderline smote with noise reduction in imbalanced datasets. Wirel Pers Commun 121(3):1659–1680
Chen B, Xia S, Chen Z, Wang B, Wang G (2021) RSMOTE: a self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci 553:397–428
Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265
Koziarski M, Woźniak M, Krawczyk B (2020) Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise. Knowl-Based Syst 204:106223
Tao X, Chen W, Zhang X, Guo W, Qi L, Fan Z (2021) SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data. Knowl-Based Syst 234:107588
Zhang A, Yu H, Huan Z, Yang X, Zheng S, Gao S (2022) SMOTE-RkNN: a hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors. Inf Sci 595:70–88
Pereira RM, Costa YMG, Silla CN Jr (2020) MLTL: a multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing 383:95–105
Sklansky J, Michelotti L (1980) Locally trained piecewise linear classifiers. IEEE Trans Pattern Anal Mach Intell 2:101–111
Guzmán-Ponce A, Sánchez JS, Valdovinos RM, Marcial-Romero JR (2021) DBIG-US: a two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst Appl 168:114301
Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936
Zheng Z, Cai Y, Li Y (2015) Oversampling method for imbalanced classification. Comput Inform 34(5):1017–1037
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Triguero I, González S, Moyano JM, García S, Alcalá-Fdez J, Luengo J, Fernández A, Jesús MJ, Sánchez L, Herrera F (2017) KEEL 3.0: an open source software for multi-stage analysis in data mining. Int J Comput Intell Syst 10:1238–1249
Wong TT, Yeh PY (2019) Reliable accuracy estimates from k-fold cross validation. IEEE Trans Knowl Data Eng 32(8):1586–1594
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
García S, Fernández A, Luengo J, Herrera F (2009) A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput 13(10):959–977
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064
Verbiest N, Ramentol E, Cornelis C, Herrera F (2014) Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Appl Soft Comput 22:511–517
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20
Li J, Zhu Q, Wu Q, Fan Z (2021) A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors. Inf Sci 565:438–455
Koziarski M, Krawczyk B, Woźniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33
Jovanovic D, Antonijevic M, Stankovic M, Zivkovic M, Tanaskovic M, Bacanin N (2022) Tuning machine learning models using a group search firefly algorithm for credit card fraud detection. Mathematics 10(13):2272
Kunakorntum I, Hinthong W, Phunchongharn P (2020) A synthetic minority based on probabilistic distribution (SyMProD) oversampling for imbalanced datasets. IEEE Access 8:114692–114704
Leng Q, Guo J, Jiao E, Meng X, Wang C (2023) NanBDOS: adaptive and parameter-free borderline oversampling via natural neighbor search for class-imbalance learning. Knowl-Based Syst 274:110665
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under grants 61602056, 61772249, 61976027, the Foundation of Liaoning Province Education Administration under grant JYTMS20230819, and the PhD Startup Foundation of Liaoning Technical University under grant 21-1043.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The work described was original research that has not been published previously and is not under consideration for publication elsewhere, in whole or in part. The paper is approved by all authors for publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Leng, Q., Guo, J., Tao, J. et al. OBMI: oversampling borderline minority instances by a two-stage Tomek link-finding procedure for class imbalance problem. Complex Intell. Syst. 10, 4775–4792 (2024). https://doi.org/10.1007/s40747-024-01399-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-024-01399-y