Abstract
Mitigating the impact of class-imbalance data on classifiers is a challenging task in machine learning. SMOTE is a well-known method to tackle this task by modifying class distribution and generating synthetic instances. However, most of the SMOTE-based methods focus on the phase of data selection, while few consider the phase of data generation. This paper proposes a hypersphere-constrained generation mechanism (HS-Gen) to improve synthetic minority oversampling. Unlike linear interpolation commonly used in SMOTE-based methods, HS-Gen generates a minority instance in a hypersphere rather than on a straight line. This mechanism expands the distribution range of minority instances with significant randomness and diversity. Furthermore, HS-Gen is attached with a noise prevention strategy that adaptively shrinks the hypersphere by determining whether new instances fall into the majority class region. HS-Gen can be regarded as an oversampling optimization mechanism and flexibly embedded into the SMOTE-based methods. We conduct comparative experiments by embedding HS-Gen into the original SMOTE, Borderline-SMOTE, ADASYN, k-means SMOTE, and RSMOTE. Experimental results show that the embedded versions can generate higher quality synthetic instances than the original ones. Moreover, on these oversampled datasets, the conventional classifiers (C4.5 and Adaboost) obtain significant performance improvement in terms of F1 measure and G-mean.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Classification of imbalanced data is a challenging task in machine learning [49]. For a two-class dataset, imbalance means that the number of the minority instances is much less than that of the majority instances [17]. It may cause the performance deterioration of the conventional classifiers since these classifiers are biased towards the majority class and are not competent enough to classify the minority class [16, 29]. Nevertheless, the minority class deserves more attention in practical applications, such as software defects [30], medical diagnosis [52] and financial distress prediction [38]. Therefore, how to improve the recognition of the minority class has become a key issue.
In the past few decades, many methods have been proposed to deal with the problem of imbalanced classification [13]. These methods can be divided into four categories [54]: data-level method [5], algorithm-level method [48], cost-sensitive learning [42] and ensemble learning [15]. The data-level method employs the resampling technique to transform the class-imbalanced dataset into a balanced one [46]. It changes the data distribution to reduce the impact of class imbalance [35]. Algorithm-level methods modify traditional classification algorithms to learn the minority class more accurately. Cost-sensitive learning uses a cost matrix to adjust the penalties for different errors, which can emphasize the importance of the minority class. Ensemble learning solves the imbalanced problem by highlighting incorrectly classified instances in each iteration and combining the classification results from several weak classifiers.
Among them, the data resampling technology is independent of classification models and can be flexibly combined with any classifiers [23, 41]. It includes oversampling of the minority class [3, 32] and undersampling of the majority class [25, 37]. Oversampling is the process of generating reasonable minority instances to balance the dataset [22, 33, 55], which avoids the problem of undersampling sometimes losing important information [16]. Random oversampling is one of the earliest oversampling methods. It generates new minority instances by simple duplication, which may lead to more specific decision-making regions [21, 27].
To overcome the potential overfitting rendered by simple replication, Synthetic Minority Oversampling TEchnique (SMOTE) was proposed to assist the classifier in improving its generalization on the testing data [7, 13]. SMOTE generates a new minority instance by linearly interpolating between a selected base instance of minority class and one of its k-nearest minority neighbors. In the recent decades, SMOTE has been greatly developed and has deserved many variants [13], such as Borderline-SMOTE [18], ADASYN [19], Safe-Level-SMOTE [6], etc.
In fact, the SMOTE-based methods can be decomposed into two phases, i.e., the phase of data selection and the phase of data generation. In the phase of data selection, the base instances are selected from the original minority instances with prior knowledge. In the phase of data generation, linear interpolation is employed to generate new instances for most SMOTE-based methods.
For linear interpolation, however, there are two shortcomings that need to be noticed.
-
It limits synthetic instances to the line connecting two selected minority instances, which means that the generated instance is less informative. This is the main reason why it cannot overcome the class imbalance problem within minority class [28].
-
It may generate noise when the base instance and its neighbors belong to different clusters, leading to severe overlapping between classes, as shown in Fig. 1a.
Inspired by the above shortcomings, we propose a hypersphere-constrained generation mechanism (HS-Gen for short) to improve synthetic minority oversampling for imbalanced classification. Different from the linear interpolation of SMOTE, HS-Gen generates new minority instances in a hypersphere rather than on a line, which expands the distribution of minority instances with significant randomness and diversity. Furthermore, to prevent the generation of noise, the hypersphere initially determined by the base instance and one of its neighbors is potentially getting smaller such that the majority instances do not cross it (as shown in Fig. 1b). The proposed method is expected to assist the classifier in improving generalization.
The rest of the paper is organized as follows. In the next section, we give a brief review of the related work. In subsequent section, we describe the process of HS-Gen and illustrate its difference from linear interpolation. The corresponding algorithm is further described with detailed comments. Then the proposed HS-Gen by experiments on KEEL imbalanced datasets is evaluated. Comparisons with several similar or related algorithms are also performed. In the final section, we summarize the main contribution and discuss future research.
Related works
The baseline oversampling methods
SMOTE generates the new minority instances in the feature space to balance the data distribution [7]. The number of new instances depends on the oversampling ratio \(N\%\). In the phase of data selection, if \(N<100\), SMOTE randomly selects \(N\%\) of the minority instances as the base instances. Otherwise, each minority instance is regarded as a base instance.
In the phase of data generation, SMOTE generates each synthetic instance by linear interpolation between a base instance and one of its k-nearest neighbors. Specifically, for a base instance \(x_i\), if its one k-nearest neighbor \(x_j\) is randomly selected, a new instance \(x_{\textrm{lin}}\) will be generated as follows:
where \(\beta \in (0,1)\) is a random number. Obviously, \(x_{\textrm{lin}}\) lies on the line segment connecting \(x_i\) and \(x_j\).
In the phase of data selection, SMOTE is blind in choosing base instances, which may even backfire in many cases. SMOTE only considers the closeness between the minority instances, ignoring the distribution of the majority instances. It further exacerbates the learning difficulty of noisy and borderline instances. Taking the data characteristics into account, some improvement has been achieved by creating minority instances only near the class borderline.
Borderline-SMOTE [18] draws that borderline instances are more likely to be misclassified than instances far from the borderline. It identifies the borderline minority instances according to a density evaluation, which divides the minority instances into three categories, namely “safe” instances, “danger” instances, and “noises”. A minority instance is identified as “noise” if its k-nearest neighbors are all majority instances. Whereas, for a “safe” instance, there are more than half of the majority instances in its k-nearest neighbors. A “danger” instance means that there are less than half of the majority instances in its k-nearest neighbors. In Borderline-SMOTE, only the “danger” instances are considered on the borderline and oversampled.
The ADAptive SYNthetic oversampling (ADASYN) [19] employs another density-based data selection strategy. It weights a base instance according to the proportion of the majority instances in k-nearest neighbors of the base instance. More synthetic instances are generated for minority instances that are harder to learn compared to those minority instances that are easier to learn.
The recent oversampling methods
In 2018, Bellinger et al. [2] pointed out that the generative bias of SMOTE is not appropriate for the large class of learning problems that conform to the manifold property. Further, they proposed a general framework for manifold-based synthetic oversampling. Douzas et al. [10] employed k-means to assign more synthetic instances to sparse minority clusters, which alleviates within-class imbalance. The conditional version of generative adversarial networks (GAN) was used to generate minority instances for imbalanced datasets [11].
In 2019, Yan et al. [50] proposed a three-way decision model considering the differences in the cost of selecting base instances. The model uses a constructive covering algorithm to divide the minority instances into several covers and chooses the base instances according to the pattern of cover distribution on minority instances. Susan et al. [39] developed a three-step intelligent pruning of majority and minority instances. This method first uses particle swarm optimization to find globally optimum solutions in the search space for the intelligent undersampling technique. Then it oversamples the minority instances by SMOTE-based methods, which is further followed by the intelligent undersampling of the minority instances. Xie et al. [47] proposed generative learning by adopting the Gaussian mixed model to fit the distribution of the original dataset and generating new instances based on the distribution. Douzas et al. [12] proposed G-SMOTE that generates new instances within a geometric space around the base instance.
In 2020, Pan et al. [31] proposed Adaptive-SMOTE and Gaussian oversampling to improve data distribution. Adaptive-SMOTE adaptively selects groups of “Inner” and “Danger” data from the minority instances to generate the new instances. Gaussian oversampling adopts dimension reduction to thin the tails of the Gaussian distribution. LR-SMOTE [22] generates new instances close to the center of the minority instance, avoiding the generation of outlier instances or changing the data distribution. Guan et al. [17] proposed SMOTE-WENN that employs a weighted edited nearest neighbor rule to clean unsafe instances after oversampling. Tarawneh et al. [40] proposed SMOTEFUNA, which utilizes the furthest neighbor of a base instance to generate new instances. Ye et al. [51] utilized the Laplacian eigenmaps to find an optimal dimensional space, where the data can be well separated. Also, the disadvantage of SMOTE-based methods being prone to noise was amended to a certain extent.
In 2021, NaNSMOTE proposed by Li et al. [24] replaces KNN with natural nearest neighbors, which have an adaptive k related to the data complexity. RSMOTE [8] re-weights the number of new instances generated by each base instance according to chaotic level and distinguishing characteristics of relative density. Bernardo et al. [4] propose a very fast continuous SMOTE. SMOTE-NaN-DE [23] uses an error detection technique based on natural neighbors and differential evolution to optimize noisy and borderline instances after oversampling. RCSMOTE [36] employs a classification scheme to identify minority instances that are proper for oversampling and then generates new instances considering a calculated safe range.
The proposed method
The SMOTE-based method can be divided into two phases, the phase of data selection and the phase of data generation. Most existing methods work in the former, and dedicate to finding optimal base instances. For the latter, linear interpolation is commonly used, leading to a lack of diversity of new instances. This paper strives to improve the phase of data generation by hypersphere constraint instead of linear interpolation.
Generation with hypersphere constraint
Let \({\textbf{R}}^d\) stand for the d-dimensional Euclidean space. Assume that a base instance \(x_i\) and one of its neighbors \(x_k\) have been selected during the phase of data selection. Next, a new instance will be generated in a hypersphere C whose diameter is the line connecting \(x_i\) and \(x_k\). Accordingly, the radius r of C can be calculated as,
The center \(x_m\) of C is located at the midpoint of the line connecting \(x_i\) and \(x_k\),
Thus, we can generate a new instance \(x_{\textrm{syn}}\) within the hypersphere C, i.e.,
where \({\varvec{v}} = \{ {v^1}, \ldots , {v^i}, \ldots , {v^d} \}\) is a d-dimensional vector. \(v^i \in (-1, 1)\) is a random number and \({\left\| {\varvec{v}} \right\| ^2} < 1\), which ensures that \(x_{\textrm{syn}}\) is generated in a hypersphere with the center \(x_m\) and the radius r. Figure 2 shows the difference between hypersphere generation and linear interpolation. In linear interpolation, a new instance \(x_{\textrm{lin}}\) (marked with a blue triangle) is on the connecting line of \(x_i\) and \(x_k\). In hypersphere generation, a new instance \(x_{\textrm{syn}}\) (marked with a green diamond) will be generated at any position within C.
Noise prevention strategy
There is a problem that deserves one’s attention. Similar to linear interpolation, the initial hypersphere generation proposed in “Generation with hypersphere constraint” section may still generate noise. As shown in Figs. 1a and 3a, the line connecting \(x_i\) and \(x_k\) crosses the majority class region when \(x_i\) and \(x_k\) belong to different clusters. A potential synthetic instance will be generated within the majority class region. Obviously, it is a noise and will aggravate the overlap between classes.
To avoid this situation, we introduce a noise prevention strategy (NPS for short) before generating new minority instances. For \(x_i\) and \(x_k\), NPS determines whether there are majority instances that fall into the initial hypersphere C. The determination way is simple and can be denoted as,
If some majority instance \(y_j\) conforms to Eq. (5), it means that \(y_j\) is in C. In Fig. 3b, NPS will determine eight such majority instances and put them into a candidate set CS. Then, NPS finds the instance \(y_p \in CS\) nearest to \(x_i\), i.e.,
Next, NPS takes the line connecting \(x_i\) and \(y_p\) as the diameter to reconstruct a hypersphere \(C_1\). Note that the coverage of \(C_1\) may exceed the coverage of C. We still need to determine whether another majority instance falls into \(C_1\), as shown in Fig. 3b. This is an iterative process that needs to be executed repeatedly until none of the majority instances fall into a newly constructed hypersphere \(C_n\). For example, in Fig. 3c, \(C_2\) is finally obtained. Subsequently, a new instance is randomly generated in \(C_2\).
HS-GEN algorithm
We present the hypersphere generation with NPS in Algorithm 1. For convenience, we call it HS-Gen. HS-Gen takes a base instance \(x_i\) and its one neighbor \(x_k\) as input. A initial hypersphere C is constructed by \(x_i\) and \(x_k\). Then, HS-Gen determines whether a majority instance \(y_j\) falls into C. If so, HS-Gen puts \(y_j\) into the set CS. This determination process is accomplished in Step 3–Step 8. Next, HS-Gen checks whether the CS is an empty set. If so, HS-Gen goes to Step 15 and generates new instances within C. If not, it indicates a potential risk that the new instance may be generated among the majority instances, thus becoming a noise. At this moment, NPS is executed to prevent the synthetic instance from invading the majority region (Step 9–Step 14). If a majority instance \(y_p\) nearest to \(x_i\) is found, \(y_p\) is considered to be \(x_k\) in the next iteration, and this process goes to Step 2. The iteration stops until CS is empty, and a new synthetic instance \(x_{\textrm{syn}}\) is generated within the finalized hypersphere (Step 15–Step 16).
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs40747-022-00938-9/MediaObjects/40747_2022_938_Figa_HTML.png)
To increase the diversity, we generate new synthetic instances by the proposed HS-Gen instead of linear interpolation. However, the region of hypersphere generation covers the line connecting \(x_i\) and \(x_k\), thus a new instance can still be generated on the line. In this case, the hypersphere generation degenerates to the linear interpolation. This degradation probability can be approximated using P as follows,
where q is the number of decimal places for each dimension, and d is the dimensionality of the dataset. P is the degenerate probability of generating new instances within the hypercube that is inscribed in the hypersphere. Each dimension \(v^i\) of \({\varvec{v}}\) is a random value within \((- \frac{{\sqrt{2} }}{2},\frac{{\sqrt{2} }}{2})\), assuming that \(v^i\) obeys a uniform distribution. It can be seen from Eq. (7) that P decreases exponentially as d increases. Accordingly, in a high-dimensional space, the probability that a new instance locates on the line connecting \(x_i\) and \(x_k\) is very small.
Experiments
In this section, we first embedded HS-Gen into three baseline oversampling methods and two state-of-the-art oversampling methods. The baseline methods are SMOTE [7], BDSMOTE (Borderline-SMOTE) [18], and ADASYN [19]. The state-of-the-art methods include KSMOTE (k-means SMOTE) [10] and RSMOTE [8]. The embedded versions are called HS-SMOTE, HS-BDSMOTE, HS-ADASYN, HS-KSMOTE, HS-RSMOTE, respectively. Then, we compared the embedded versions with the original ones both on a 2D toy dataset and 15 benchmark datasets. Furthermore, test on a real scenario is also achieved. The neighbor parameter \(k_1\) used for synthesis is set to 5, while \(k_2\) for determination to 7.
Experiment on a 2D toy dataset
We create a 2D toy dataset to intuitively compare the embedded versions with the original ones, as shown in Fig. 4. The original dataset includes 196 majority instances and 14 minority instances. Figure 4a–e shows the results oversampled by SMOTE, BDSMOTE, ADASYN, KSMOTE, and RSMOTE. Figure 4f–j shows the results by HS-SMOTE, HS-BDSMOTE, HS-ADASYN, HS-KSMOTE, and HS-RSMOTE.
The results show that the instances generated by HS-Gen are significantly different from those by linear interpolation. Intuitively, the new instances generated by HS-Gen are more diverse and random. Remarkably, unlike linear interpolation, HS-Gen can avoid noise and class overlap to a certain extent.
Experiment on benchmark datasets
Assessment metric
Accuracy is the most commonly used evaluation metric for traditional binary classification problems. However, it can be deceiving in a specific situation and is highly sensitive to changes in data distribution [20]. For example, if a dataset contains 1% of minority instances and 99% of majority instances, we can obtain an accuracy of 99% by classifying all instances as majority instances. However, this treatment is clearly not advisable when the minority class represents the crucial pattern.
F1 [26] and G-mean [34] provide a good solution to assess the class-imbalance problem, which are based on the confusion matrix (Table 1). TP is the number of correctly classified minority instances, FP is the number of misclassified majority instances, FN is the number of misclassified minority instances, and TN is the number of correctly classified majority instances.
Precision and Recall reflect the recognition rate of the positive instances from different perspectives, while Specificity reflects the recognition rate of the negative instances. F1 is the harmonic mean for both the Precision and Recall of the positive class. G-mean is the geometric mean of the Recall and Specificity.
Datasets
To evaluate the performance of the proposed HS-Gen, we conducted a comparative experiment on 15 benchmark datasets. These datasets are from KEEL [1], and their details are listed in Table 2. \(IR = \frac{{{N_1}}}{{{N_2}}}\) denotes the imbalanced ratio, where \(N_1\) is the number of majority instances and \(N_2\) is the number of minority instances. d is the dimensionality (number of features) of a dataset. n is the number of total instances.
Results and analysis
Each dataset is randomly divided into 5 disjoint folds with approximately the same number of instances by adopting 5-fold cross-validation [45]. Then, each fold is used to test the model induced from the other 4 folds. The average results with standard deviations are reported in Tables 3, 4, 5, 6 and 7.
Table 3 shows the comparison of HS-SMOTE with the original SMOTE. When C4.5 is selected as the classifier, HS-SMOTE obtains higher F1 than SMOTE on nine datasets (G1, G06, V0, N2, S0, P0, Y4, G4, S24). In particular, HS-SMOTE outperforms SMOTE by 8.65% on the “N2” dataset. In terms of G-mean, HS-SMOTE performs a bit worse than SMOTE. When Adaboost is used as the classifier, HS-SMOTE obtains higher F1 than SMOTE on eleven datasets (Wi, V2, G06, N2, S0, G6, E3, P0, Y4, G4, S24), with a similar G-mean.
From Table 4, it can be seen that HS-Gen imposes a positive effect on BDSMOTE. When C4.5 is selected as the classifier, HS-BDSMOTE obtains higher F1 on ten datasets (G1, Wi, V2, G06, V0, S0, E3, P0, G4, E4). In terms of G-mean, HS-BDSMOTE also performs better than BDSMOTE on ten datasets (G1, Wi, V2, G06, V0, S0, E3, Y4, G4, E4). When Adaboost is selected as the classifier, HS-BDSMOTE obtains higher F1 than BDSMOTE on ten datasets (G1, E1, Wi, G06, V0, G6, E3, P0, Y4, E4). In terms of G-mean, HS-BDSMOTE obtains better results on seven datasets (G1, V2, G06, N2, P0, G4, E4).
The comparison of HS-ADASYN with ADASYN is provided in Table 5. When C4.5 is selected as the classifier, HS-ADASYN achieves higher F1 on thirteen datasets (except G1 and Wi). HS-ADASYN has at least 10% higher F1 than the original ADASYN on four datasets (G6, P0, S24, E4). In terms of G-mean, HS-ADASYN performs better than ADASYN on eleven datasets (E1, V2, G06, N2, S0, G6, P0, Y4, G4, E4, S24). When Adaboost is selected as the classifier, HS-ADASYN obtains higher F1 than ADASYN on twelve datasets (E1, Wi, V2, G06, V0, S0, G6, E3, P0, Y4, E4, S24). More remarkably, HS-ADASYN outperforms ADASYN by 12.02% on the “G6” dataset. In terms of G-mean, HS-ADASYN gets better results than ADASYN on eleven datasets (E1, Wi, V2, G06, S0, G6, E3, Y4, G4, E4, S24).
Table 6 shows the comparison of HS-KSMOTE with the original KSMOTE. The results indicate a significant improvement in terms of F1 and G-mean. When C4.5 is selected as the classifier, HS-KSMOTE achieves higher F1 than the original KSMOTE on 14 datasets (expect N2). Notably, in terms of G-mean, HS-KSMOTE obtains the highest values on all datasets. When Adaboost is selected as the classifier, HS-KSMOTE has higher F1 than KSMOTE on nine datasets (Wi, G06, N2, S0, G6, E3, P0, Y4, G4). In terms of G-mean, it also achieves the highest values on eleven datasets (E1, Wi, V2, G06, N2, S0, P0, Y4, G4, E4, S24).
The comparative results of HS-RSMOTE and RSMOTE are provided in Table 7. It should be noted that the dataset “s24” cannot be used to evaluate RSMOTE because the number of minority instances is too small to be split into two clusters. When C4.5 is selected as the classifier, HS-RSMOTE obtains higher F1 than RSMOTE on twelve datasets (G1, Wi, V2, G06, V0, S0, G6, E3, P0, Y4, G4, E4). In terms of G-mean, HS-RSMOTE also performs better than RSMOTE on ten datasets (G1, Wi, V2, G06, V0, G6, E3, G4, E4). When Adaboost is selected as the classifier, HS-RSMOTE achieves the highest F1 on eight datasets (G1, E1, Wi, G06, N2, P0, G4, E4). In terms of G-mean, HS-RSMOTE outperforms than RSMOTE on eleven datasets (G1, Wi, G06, V0, N2, G6, Y4, G4, E4).
As mentioned previously, HS-Gen can be decomposed into two modules, the hypersphere generation module (HG) and the noise prevention module (NPS). We perform ablation experiments to evaluate the effect of each module individually. SMOTE was used as the reference method. Figure 5 presents the experimental result. It shows that the combined effect of the two modules outperforms the effect of a single module on most of the datasets.
From Tables 4, 5, 6 and 7, we can see that the oversampling methods embedded with HS-Gen outperform the original sampling methods using linear interpolation on most datasets. But on the remaining few datasets, the performance is not satisfactory. For linear interpolation, it restricts new instances to the line connecting two candidate instances, which results in the lack of diversity of new instances. While HS-Gen employs the hypersphere generation mechanism, new instances will have more significant diversity and randomness. This merit also provides opportunities for the synthesis of high-quality instances. But on the other side of the coin, this randomness will cause the newly synthesized instances to exceed the original distribution range of the minority class, which will negatively affect the subsequent classifier.
In addition, we also give the statistical results in Fig. 6. The median and mean of F1 obtained by the methods of embedding HS-Gen are impressively higher than the original methods. In terms of G-mean, the embedded methods outperform the original methods in nine of ten groups. In the remaining group, HS-SMOTE also achieves a similar score to SMOTE. Furthermore, we conduct a non-parametric hypothesis test for strengthening pairwise comparisons. Wilcoxon’s signed-ranks test is used to detect significant differences between the behavior of the embedded version and the original method [9, 43]. Let \(d_i\) be the difference between the measure scores of two methods on ith out of N datasets. The differences are ranked according to their absolute values. Then, we can compute \({T} = \min ({R^ + },{R^ - })\), where \({R^ + } = \sum \nolimits _{{d_i} > 0} {\textrm{rank}({d_i})} + \frac{1}{2}\sum \nolimits _{{d_i} = 0} {\textrm{rank}({d_i})}\) and \({R^ - } = \sum \nolimits _{{d_i} < 0} {\textrm{rank}({d_i})} + \frac{1}{2}\sum \nolimits _{{d_i} = 0} {\textrm{rank}({d_i})} \).
According to the table of exact critical values for the Wilcoxon’s test [53], for a confidence level of \(\alpha = 0.10\) and \(N = 15\) datasets (\(N = 14\) for RSMOTE), we can get that the difference between the methods is significant if \(T < 30\) (\(T < 25\) for RSMOTE). The detailed results are shown in Table 8.
In general, the experimental results of the embedded methods are better than the original methods, which indicates that HS-Gen is more effective than linear interpolation. As can be seen from Tables 4, 5, 6 and 7, the embedded versions have clear advantages over the original methods. It indicates that HS-Gen produces better-quality synthetic instances than linear interpolation for borderline or weighted oversampling. Overall, it is verified that HS-Gen can improve the recognition rate of the classifier to minority instances meanwhile taking the majority instances into account.
Experiment on a real scenario
In this subsection, we use the diagnosis of breast cytology to demonstrate the applicability of the proposed method to medical diagnosis. The dataset was originally from Dr. Wolberg’s clinical cases [44] and has now been collected in the UCI machine learning repository [14]. It contains 681 instances (18 instances with missing values removed), where each instance has 10 features. The details of these features are shown in Table 9.
Two class labels are “benign” and “malignant” with 442 and 239 instances respectively. We divide this dataset into 5 disjoint folds and report the statistical results in Fig. 7. It can be seen that when C4.5 is used as the classifier, these oversampling methods embedded with HS-Gen achieve the highest median and mean both on F1 and G-mean. When AdaBoost is used as the classifier, HS-SMOTE obtains a higher median and similar mean compared with SMOTE. We should also see that HS-BDSMOTE is not as good as the original BDSMOTE. Nevertheless, in the remaining comparisons, the embedded versions, including HS-ADASYN, HS-KSMOTE, and HS-RSMOTE, outperform their respective original versions. The applicability test on the real scenario can illustrate the ability of the proposed method to deal with practical problems.
Conclusion
This paper proposes a hypersphere-constrained generation mechanism (HS-Gen) to improve oversampling for the class imbalance problem. Unlike the existing oversampling methods that commonly work in the phase of data selection, HS-Gen focuses on the phase of data generation. It differs from linear interpolation in two aspects. First, it generates a new minority instance in a hypersphere determined by a base instance and its one neighbor. Second, it introduces NPS, a noise prevention strategy, to prevent the synthetic instance from invading the majority region.
Experiments on benchmark datasets show that the embedded methods with HS-Gen achieve higher F1 and G-mean than the original methods on most datasets. It implies that HS-Gen can mitigate the impact of class-imbalance data on classifiers to a certain extent. It should be noted that HS-Gen is only a generation mechanism, and its applicability to imbalanced ratio or dimensionality mainly depends on the specific oversampling method. In particular, when the original oversampling method is unavailable on some datasets, the embedded version will not work either.
Dedicating to improving the way of data generation, our work provides a new research idea for the study of oversampling methods for imbalanced learning. Although the proposed HS-Gen is heuristic, experimental evaluations have proven its effectiveness. In the future, we plan to conduct in-depth research on the impact of data generation mechanisms on the quality of synthetic samples from a theoretical level.
References
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Log Soft Comput 2011:255–287
Bellinger C, Drummond C, Japkowicz N (2018) Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107(3):605–637
Barua S, Islam MM, Yao X, Murase K (2012) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Bernardo A, Della Valle E (2021) VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams. Data Min Knowl Disc 35(6):2679–2713
Branco P, Torgo L, Ribeiro RP (2019) Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343:76–99
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 475–482
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chen B, Xia S, Chen Z, Wang B, Wang G (2021) RSMOTE: a self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci 553:397–428
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on \(k\)-means and SMOTE. Inf Sci 465:1–20
Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl 91:464–471
Douzas G, Bacao F (2019) Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inf Sci 501:118–135
Fernández A, Garcia S, Herrera F, Chawla NV (2018) SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Frank A, Asuncion A (2010) UCI machine learning repository (Online). http://archive.ics.uci.edu/ml
Gao X, Ren B, Zhang H, Sun B, Li J, Xu J, Li K (2020) An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling. Expert Syst Appl 160:113660
García V, Sánchez JS, Marqués AI, Florencia R, Rivera G (2020) Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst Appl 158:113026
Guan H, Zhang Y, Xian M, Cheng HD, Tang X (2021) SMOTE-WENN: solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51(3):1394–1409
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, pp 878–887
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Kang Q, Chen X, Li S, Zhou M (2016) A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans Cybern 47(12):4263–4274
Liang XW, Jiang AP, Li T, Xue YY, Wang GT (2020) LR-SMOTE—an improved unbalanced data set oversampling based on \(k\)-means and SVM. Knowl Based Syst 196:105845
Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) SMOTE-NaN-DE: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl Based Syst 223:107056
Li J, Zhu Q, Wu Q, Fan Z (2021) A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors. Inf Sci 565:438–455
Lin WC, Tsai CF, Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26
Lipton ZC, Elkan C, Naryanaswamy B (2014) Optimal thresholding of classifiers to maximize F1 measure. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, pp 225–239
Liu XY, Wu J, Zhou ZH (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550
Li Y, Wang Y, Li T, Li B, Lan X (2021) SP-SMOTE: a novel space partitioning based synthetic minority oversampling technique. Knowl-Based Syst 228:107269
Mullick SS, Datta S, Dhekane SG, Das S (2020) Appropriateness of performance indices for imbalanced data classification: an analysis. Pattern Recogn 102:107197
Pang Y, Peng L, Chen Z, Yang B, Zhang H (2019) Imbalanced learning based on adaptive weighting and Gaussian function synthesizing with an application on Android malware detection. Inf Sci 484:95–112
Pan T, Zhao J, Wu W, Yang J (2020) Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf Sci 512:1214–1233
Pérez-Ortiz M, Gutiérrez PA, Tino P, Hervás-Martínez C (2015) Oversampling the minority class in the feature space. IEEE Trans Neural Netw Learn Syst 27(9):1947–1961
Pradipta GA, Wardoyo R, Musdholifah A, Sanjaya INH (2021) Radius-SMOTE: a new oversampling technique of minority samples based on radius distance for learning from imbalanced data. IEEE Access 9:74763–74777
Puthiya Parambath S, Usunier N, Grandvalet Y (2014) Optimizing \(F\)-measures by cost-sensitive classification. Adv Neural Inf Process Syst 27:2123–2131
Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl-Based Syst 187:104814
Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111
Sun B, Chen H, Wang J, Xie H (2018) Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front Comput Sci 12(2):331–350
Sun J, Li H, Fujita H, Fu B, Ai W (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf Fus 54:128–144
Susan S, Kumar A (2019) SSOMaj-SMOTE-SSOMin: three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets. Appl Soft Comput 78:141–149
Tarawneh AS, Hassanat AB, Almohammadi K, Chetverikov D, Bellinger C (2020) SMOTEFUNA: synthetic minority over-sampling technique based on furthest neighbour algorithm. IEEE Access 8:59069–59082
Tsai CF, Lin WC, Hu YH, Yao GT (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54
Wang Z, Wang B, Cheng Y, Li D, Zhang J (2019) Cost-sensitive fuzzy multiple kernel learning for imbalanced problem. Neurocomputing 366:178–193
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1:80–83
Wolberg WH, Mangasarian OL (1990) Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci 87(23):9193–9196
Wong TT, Yeh PY (2019) Reliable accuracy estimates from \(k\)-fold cross validation. IEEE Trans Knowl Data Eng 32(8):1586–1594
Wu F, Jing XY, Shan S, Zuo W, Yang JY (2017) Multiset feature learning for highly imbalanced data classification. In: 31st AAAI conference on artificial intelligence, pp 1593–1589
Xie Y, Peng L, Chen Z, Yang B, Zhang H, Zhang H (2019) Generative learning for imbalanced data using the Gaussian mixed model. Appl Soft Comput 79:439-451
Xu Y, Zhang Y, Zhao J, Yang Z, Pan X (2019) KNN-based maximum margin and minimum volume hyper-sphere machine for imbalanced data classification. Int J Mach Learn Cybern 10(2):357–368
Yang L, Cheung YM, Yuan YT (2019) Bayes imbalance impact index: a measure of class imbalanced data set for classification problem. IEEE Trans Neural Netw Learn Syst 31(9):3525–3539
Yan YT, Wu ZB, Du XQ, Chen J, Zhao S, Zhang YP (2019) A three-way decision ensemble method for imbalanced data oversampling. Int J Approx Reason 107:1–16
Ye X, Li H, Imakura A, Sakurai T (2020) An oversampling framework for imbalanced classification based on Laplacian eigenmaps. Neurocomputing 399:107–116
Yuan X, Xie L, Abouelenien M (2018) A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data. Pattern Recogn 77:160–172
Zar JH (1999) Biostatistical analysis, 5th edn. Pearson Educaion Inc., Upper Saddle River
Zhu Y, Yan Y, Zhang Y, Zhang Y (2020) EHSO: evolutionary hybrid sampling in overlapping scenarios for imbalanced learning. Neurocomputing 417:333–346
Zhu Z, Wang Z, Li D, Zhu Y, Du W (2018) Geometric structural ensemble learning for imbalanced problems. IEEE Trans Cybern 50(4):1617–1629
Acknowledgements
Funding was provided by National Natural Science Foundation of China (Grant Nos. 61602056, 61976027), Liaoning Revitalization Talents Program (Grant No. XLYC2008002), Scientific Research Project of Liaoning Provincial Committee of Education (Grant No. LJKMZ20221498), PhD Startup Foundation of Liaoning Technical University (Grant No. 21-1043).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
He, Z., Tao, J., Leng, Q. et al. HS-Gen: a hypersphere-constrained generation mechanism to improve synthetic minority oversampling for imbalanced classification. Complex Intell. Syst. 9, 3971–3988 (2023). https://doi.org/10.1007/s40747-022-00938-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-022-00938-9