HS-Gen: a hypersphere-constrained generation mechanism to improve synthetic minority oversampling for imbalanced classification

Mitigating the impact of class-imbalance data on classifiers is a challenging task in machine learning. SMOTE is a well-known method to tackle this task by modifying class distribution and generating synthetic instances. However, most of the SMOTE-based methods focus on the phase of data selection, while few consider the phase of data generation. This paper proposes a hypersphere-constrained generation mechanism (HS-Gen) to improve synthetic minority oversampling. Unlike linear interpolation commonly used in SMOTE-based methods, HS-Gen generates a minority instance in a hypersphere rather than on a straight line. This mechanism expands the distribution range of minority instances with significant randomness and diversity. Furthermore, HS-Gen is attached with a noise prevention strategy that adaptively shrinks the hypersphere by determining whether new instances fall into the majority class region. HS-Gen can be regarded as an oversampling optimization mechanism and flexibly embedded into the SMOTE-based methods. We conduct comparative experiments by embedding HS-Gen into the original SMOTE, Borderline-SMOTE, ADASYN, k-means SMOTE, and RSMOTE. Experimental results show that the embedded versions can generate higher quality synthetic instances than the original ones. Moreover, on these oversampled datasets, the conventional classifiers (C4.5 and Adaboost) obtain significant performance improvement in terms of F1 measure and G-mean.


Introduction
Classification of imbalanced data is a challenging task in machine learning [49]. For a two-class dataset, imbalance means that the number of the minority instances is much less than that of the majority instances [17]. It may cause the performance deterioration of the conventional classifiers since these classifiers are biased towards the majority class and are not competent enough to classify the minority class [16,29].
B Qiangkui Leng qkleng@126.com 1 Nevertheless, the minority class deserves more attention in practical applications, such as software defects [30], medical diagnosis [52] and financial distress prediction [38]. Therefore, how to improve the recognition of the minority class has become a key issue.
In the past few decades, many methods have been proposed to deal with the problem of imbalanced classification [13]. These methods can be divided into four categories [54]: data-level method [5], algorithm-level method [48], cost-sensitive learning [42] and ensemble learning [15]. The data-level method employs the resampling technique to transform the class-imbalanced dataset into a balanced one [46]. It changes the data distribution to reduce the impact of class imbalance [35]. Algorithm-level methods modify traditional classification algorithms to learn the minority class more accurately. Cost-sensitive learning uses a cost matrix to adjust the penalties for different errors, which can emphasize the importance of the minority class. Ensemble learning solves the imbalanced problem by highlighting incorrectly classified instances in each iteration and combining the classification results from several weak classifiers. Among them, the data resampling technology is independent of classification models and can be flexibly combined with any classifiers [23,41]. It includes oversampling of the minority class [3,32] and undersampling of the majority class [25,37]. Oversampling is the process of generating reasonable minority instances to balance the dataset [22,33,55], which avoids the problem of undersampling sometimes losing important information [16]. Random oversampling is one of the earliest oversampling methods. It generates new minority instances by simple duplication, which may lead to more specific decision-making regions [21,27].
To overcome the potential overfitting rendered by simple replication, Synthetic Minority Oversampling TEchnique (SMOTE) was proposed to assist the classifier in improving its generalization on the testing data [7,13]. SMOTE generates a new minority instance by linearly interpolating between a selected base instance of minority class and one of its k-nearest minority neighbors. In the recent decades, SMOTE has been greatly developed and has deserved many variants [13], such as Borderline-SMOTE [18], ADASYN [19], Safe-Level-SMOTE [6], etc.
In fact, the SMOTE-based methods can be decomposed into two phases, i.e., the phase of data selection and the phase of data generation. In the phase of data selection, the base instances are selected from the original minority instances with prior knowledge. In the phase of data generation, linear interpolation is employed to generate new instances for most SMOTE-based methods.
For linear interpolation, however, there are two shortcomings that need to be noticed.
-It limits synthetic instances to the line connecting two selected minority instances, which means that the generated instance is less informative. This is the main reason why it cannot overcome the class imbalance problem within minority class [28]. -It may generate noise when the base instance and its neighbors belong to different clusters, leading to severe overlapping between classes, as shown in Fig. 1a.
Inspired by the above shortcomings, we propose a hypersphere-constrained generation mechanism (HS-Gen for short) to improve synthetic minority oversampling for imbalanced classification. Different from the linear interpolation of SMOTE, HS-Gen generates new minority instances in a hypersphere rather than on a line, which expands the distribution of minority instances with significant randomness and diversity. Furthermore, to prevent the generation of noise, the hypersphere initially determined by the base instance and one of its neighbors is potentially getting smaller such that the majority instances do not cross it (as shown in Fig. 1b). The proposed method is expected to assist the classifier in improving generalization.
The rest of the paper is organized as follows. In the next section, we give a brief review of the related work. In subsequent section, we describe the process of HS-Gen and illustrate its difference from linear interpolation. The corresponding algorithm is further described with detailed comments. Then the proposed HS-Gen by experiments on KEEL imbalanced datasets is evaluated. Comparisons with several similar or related algorithms are also performed. In the final section, we summarize the main contribution and discuss future research.

Related works
The baseline oversampling methods SMOTE generates the new minority instances in the feature space to balance the data distribution [7]. The number of new instances depends on the oversampling ratio N %. In the phase of data selection, if N < 100, SMOTE randomly selects N % of the minority instances as the base instances. Otherwise, each minority instance is regarded as a base instance.
In the phase of data generation, SMOTE generates each synthetic instance by linear interpolation between a base instance and one of its k-nearest neighbors. Specifically, for a base instance x i , if its one k-nearest neighbor x j is randomly selected, a new instance x lin will be generated as follows: where β ∈ (0, 1) is a random number. Obviously, x lin lies on the line segment connecting x i and x j . In the phase of data selection, SMOTE is blind in choosing base instances, which may even backfire in many cases. SMOTE only considers the closeness between the minority instances, ignoring the distribution of the majority instances. It further exacerbates the learning difficulty of noisy and borderline instances. Taking the data characteristics into account, some improvement has been achieved by creating minority instances only near the class borderline.
Borderline-SMOTE [18] draws that borderline instances are more likely to be misclassified than instances far from the borderline. It identifies the borderline minority instances according to a density evaluation, which divides the minority instances into three categories, namely "safe" instances, "danger" instances, and "noises". A minority instance is identified as "noise" if its k-nearest neighbors are all majority instances. Whereas, for a "safe" instance, there are more than half of the majority instances in its k-nearest neighbors. A "danger" instance means that there are less than half of the majority instances in its k-nearest neighbors. In Borderline-SMOTE, only the "danger" instances are considered on the borderline and oversampled.
The ADAptive SYNthetic oversampling (ADASYN) [19] employs another density-based data selection strategy. It weights a base instance according to the proportion of the majority instances in k-nearest neighbors of the base instance. More synthetic instances are generated for minority instances that are harder to learn compared to those minority instances that are easier to learn.

The recent oversampling methods
In 2018, Bellinger et al. [2] pointed out that the generative bias of SMOTE is not appropriate for the large class of learning problems that conform to the manifold property. Further, they proposed a general framework for manifold-based synthetic oversampling. Douzas et al. [10] employed k-means to assign more synthetic instances to sparse minority clusters, which alleviates within-class imbalance. The conditional version of generative adversarial networks (GAN) was used to generate minority instances for imbalanced datasets [11].
In 2019, Yan et al. [50] proposed a three-way decision model considering the differences in the cost of selecting base instances. The model uses a constructive covering algorithm to divide the minority instances into several covers and chooses the base instances according to the pattern of cover distribution on minority instances. Susan et al. [39] developed a three-step intelligent pruning of majority and minority instances. This method first uses particle swarm optimization to find globally optimum solutions in the search space for the intelligent undersampling technique. Then it oversamples the minority instances by SMOTE-based methods, which is further followed by the intelligent undersampling of the minority instances. Xie et al. [47] proposed generative learning by adopting the Gaussian mixed model to fit the distribution of the original dataset and generating new instances based on the distribution. Douzas et al. [12] proposed G-SMOTE that generates new instances within a geometric space around the base instance. In 2020, Pan et al. [31] proposed Adaptive-SMOTE and Gaussian oversampling to improve data distribution. Adaptive-SMOTE adaptively selects groups of "Inner" and "Danger" data from the minority instances to generate the new instances. Gaussian oversampling adopts dimension reduction to thin the tails of the Gaussian distribution. LR-SMOTE [22] generates new instances close to the center of the minority instance, avoiding the generation of outlier instances or changing the data distribution. Guan et al. [17] proposed SMOTE-WENN that employs a weighted edited nearest neighbor rule to clean unsafe instances after oversampling. Tarawneh et al. [40] proposed SMOTEFUNA, which utilizes the furthest neighbor of a base instance to generate new instances. Ye et al. [51] utilized the Laplacian eigenmaps to find an optimal dimensional space, where the data can be well separated. Also, the disadvantage of SMOTEbased methods being prone to noise was amended to a certain extent.
In 2021, NaNSMOTE proposed by Li et al. [24] replaces KNN with natural nearest neighbors, which have an adaptive k related to the data complexity. RSMOTE [8] re-weights the number of new instances generated by each base instance according to chaotic level and distinguishing characteristics of relative density. Bernardo et al. [4] propose a very fast continuous SMOTE. SMOTE-NaN-DE [23] uses an error detection technique based on natural neighbors and differential evolution to optimize noisy and borderline instances after oversampling. RCSMOTE [36] employs a classification scheme to identify minority instances that are proper for oversampling and then generates new instances considering a calculated safe range.

The proposed method
The SMOTE-based method can be divided into two phases, the phase of data selection and the phase of data generation. Most existing methods work in the former, and dedicate to finding optimal base instances. For the latter, linear interpolation is commonly used, leading to a lack of diversity of new instances. This paper strives to improve the phase of data generation by hypersphere constraint instead of linear interpolation.

Generation with hypersphere constraint
Let R d stand for the d-dimensional Euclidean space. Assume that a base instance x i and one of its neighbors x k have been selected during the phase of data selection. Next, a new instance will be generated in a hypersphere C whose diameter is the line connecting x i and x k . Accordingly, the radius r of C can be calculated as, The center x m of C is located at the midpoint of the line connecting x i and x k , Thus, we can generate a new instance x syn within the hypersphere C, i.e., where is a random number and v 2 < 1, which ensures that x syn is generated in a hypersphere with the center x m and the radius r . Figure 2 shows the difference between hypersphere generation and linear interpolation. In linear interpolation, a new instance x lin (marked with a blue triangle) is on the connecting line of x i and x k . In hypersphere generation, a new instance x syn (marked with a green diamond) will be generated at any position within C.

Noise prevention strategy
There is a problem that deserves one's attention. Similar to linear interpolation, the initial hypersphere generation proposed in "Generation with hypersphere constraint" section may still generate noise. As shown in Figs. 1a and 3a, the line connecting x i and x k crosses the majority class region when x i and x k belong to different clusters. A potential synthetic instance will be generated within the majority class region. Obviously, it is a noise and will aggravate the overlap between classes.
To avoid this situation, we introduce a noise prevention strategy (NPS for short) before generating new minority instances. For x i and x k , NPS determines whether there are majority instances that fall into the initial hypersphere C. The determination way is simple and can be denoted as, If some majority instance y j conforms to Eq. (5), it means that y j is in C. In Fig. 3b, NPS will determine eight such majority instances and put them into a candidate set C S. Then, NPS finds the instance y p ∈ C S nearest to x i , i.e., Next, NPS takes the line connecting x i and y p as the diameter to reconstruct a hypersphere C 1 . Note that the coverage of C 1 may exceed the coverage of C. We still need to determine whether another majority instance falls into C 1 , as shown in Fig. 3b. This is an iterative process that needs to be executed repeatedly until none of the majority instances fall into a newly constructed hypersphere C n . For example, in Fig. 3c, C 2 is finally obtained. Subsequently, a new instance is randomly generated in C 2 .

HS-GEN algorithm
We present the hypersphere generation with NPS in Algorithm 1. For convenience, we call it HS-Gen. HS-Gen takes Fig. 3 Illustration of NPS a base instance x i and its one neighbor x k as input. A initial hypersphere C is constructed by x i and x k . Then, HS-Gen determines whether a majority instance y j falls into C. If so, HS-Gen puts y j into the set C S. This determination process is accomplished in Step 3-Step 8. Next, HS-Gen checks whether the C S is an empty set. If so, HS-Gen goes to Step 15 and generates new instances within C. If not, it indicates a potential risk that the new instance may be generated among the majority instances, thus becoming a noise. At this moment, NPS is executed to prevent the synthetic instance from invading the majority region (Step 9-Step 14). If a majority instance y p nearest to x i is found, y p is considered to be x k in the next iteration, and this process goes to Step 2. The iteration stops until C S is empty, and a new synthetic instance x syn is generated within the finalized hypersphere (Step 15-Step 16). To increase the diversity, we generate new synthetic instances by the proposed HS-Gen instead of linear interpolation. However, the region of hypersphere generation covers the line connecting x i and x k , thus a new instance can still be generated on the line. In this case, the hypersphere generation degenerates to the linear interpolation. This degradation probability can be approximated using P as follows,

Algorithm 1 HS-Gen algorithm
where q is the number of decimal places for each dimension, and d is the dimensionality of the dataset. P is the degenerate probability of generating new instances within the hypercube that is inscribed in the hypersphere. Each dimension v i of v is a random value within (− 2 ), assuming that v i obeys a uniform distribution. It can be seen from Eq. (7) that P decreases exponentially as d increases. Accordingly, in a high-dimensional space, the probability that a new instance locates on the line connecting x i and x k is very small.

Experiments
In this section, we first embedded HS-Gen into three baseline oversampling methods and two state-of-the-art oversampling methods. The baseline methods are SMOTE [7], BDSMOTE (Borderline-SMOTE) [18], and ADASYN [19]. The state-of-the-art methods include KSMOTE (k-means SMOTE) [10] and RSMOTE [8]. The embedded versions are called HS-SMOTE, HS-BDSMOTE, HS-ADASYN, HS-KSMOTE, HS-RSMOTE, respectively. Then, we compared the embedded versions with the original ones both on a 2D toy dataset and 15 benchmark datasets. Furthermore, test on a real scenario is also achieved. The neighbor parameter k 1 used for synthesis is set to 5, while k 2 for determination to 7.

Experiment on a 2D toy dataset
We create a 2D toy dataset to intuitively compare the embedded versions with the original ones, as shown in Fig. 4. The original dataset includes 196 majority instances and 14 minority instances. The results show that the instances generated by HS-Gen are significantly different from those by linear interpolation. Intuitively, the new instances generated by HS-Gen are more diverse and random. Remarkably, unlike linear interpolation, HS-Gen can avoid noise and class overlap to a certain extent.

Assessment metric
Accuracy is the most commonly used evaluation metric for traditional binary classification problems. However, it can be deceiving in a specific situation and is highly sensitive to changes in data distribution [20]. For example, if a dataset contains 1% of minority instances and 99% of majority instances, we can obtain an accuracy of 99% by classifying all instances as majority instances. However, this treatment is clearly not advisable when the minority class represents the crucial pattern.  [26] and G-mean [34] provide a good solution to assess the class-imbalance problem, which are based on the confusion matrix (Table 1). TP is the number of correctly classified minority instances, FP is the number of misclassified majority instances, FN is the number of misclassified minority instances, and TN is the number of correctly classified major-ity instances.  G-mean = Recall × Specificity (12) Precision and Recall reflect the recognition rate of the positive instances from different perspectives, while Specificity reflects the recognition rate of the negative instances. F1 is the harmonic mean for both the Precision and Recall of the positive class. G-mean is the geometric mean of the Recall and Specificity.

Datasets
To evaluate the performance of the proposed HS-Gen, we conducted a comparative experiment on 15 benchmark datasets. These datasets are from KEEL [1], and their details are listed in Table 2. I R = N 1 N 2 denotes the imbalanced ratio, where N 1 is the number of majority instances and N 2 is the number of minority instances. d is the dimensionality (number of features) of a dataset. n is the number of total instances.

Results and analysis
Each dataset is randomly divided into 5 disjoint folds with approximately the same number of instances by adopting 5fold cross-validation [45]. Then, each fold is used to test the model induced from the other 4 folds. The average results with standard deviations are reported in Tables 3, 4, 5, 6 and 7. Table 3 shows the comparison of HS-SMOTE with the original SMOTE. When C4.5 is selected as the classifier, HS-SMOTE obtains higher F1 than SMOTE on nine datasets (G1, G06, V0, N2, S0, P0, Y4, G4, S24). In particular, HS-SMOTE outperforms SMOTE by 8.65% on the "N2" dataset. In terms of G-mean, HS-SMOTE performs a bit worse than SMOTE. When Adaboost is used as the classifier, HS-SMOTE obtains higher F1 than SMOTE on eleven datasets (Wi, V2, G06, N2, S0, G6, E3, P0, Y4, G4, S24), with a similar G-mean.
As mentioned previously, HS-Gen can be decomposed into two modules, the hypersphere generation module (HG) and the noise prevention module (NPS). We perform ablation experiments to evaluate the effect of each module individually. SMOTE was used as the reference method. Figure 5 presents the experimental result. It shows that the combined effect of the two modules outperforms the effect of a single module on most of the datasets.
From Tables 4, 5, 6 and 7, we can see that the oversampling methods embedded with HS-Gen outperform the original sampling methods using linear interpolation on most datasets. But on the remaining few datasets, the performance is not satisfactory. For linear interpolation, it restricts new instances to the line connecting two candidate instances, which results in the lack of diversity of new instances. While HS-Gen employs the hypersphere generation mecha-nism, new instances will have more significant diversity and randomness. This merit also provides opportunities for the synthesis of high-quality instances. But on the other side of the coin, this randomness will cause the newly synthesized instances to exceed the original distribution range of the minority class, which will negatively affect the subsequent classifier.
In addition, we also give the statistical results in Fig. 6. The median and mean of F1 obtained by the methods of embedding HS-Gen are impressively higher than the original methods. In terms of G-mean, the embedded methods outperform the original methods in nine of ten groups. In the remaining group, HS-SMOTE also achieves a similar score to SMOTE. Furthermore, we conduct a non-parametric hypothesis test for strengthening pairwise comparisons. Wilcoxon's signed-ranks test is used to detect significant differences between the behavior of the embedded version and the original method [9,43]. Let d i be the difference between the measure scores of two methods on ith out of N datasets. The differences are ranked according to their absolute values. Then, we can compute According to the table of exact critical values for the Wilcoxon's test [53], for a confidence level of α = 0.10 and N = 15 datasets (N = 14 for RSMOTE), we can get that the difference between the methods is significant if T < 30   Table 8.
In general, the experimental results of the embedded methods are better than the original methods, which indicates that HS-Gen is more effective than linear interpolation. As can be seen from Tables 4, 5, 6 and 7, the embedded versions have clear advantages over the original methods. It indicates that HS-Gen produces better-quality synthetic instances than linear interpolation for borderline or weighted oversampling. Overall, it is verified that HS-Gen can improve the recognition rate of the classifier to minority instances meanwhile taking the majority instances into account.

Experiment on a real scenario
In this subsection, we use the diagnosis of breast cytology to demonstrate the applicability of the proposed method to medical diagnosis. The dataset was originally from Dr. Wolberg's clinical cases [44] and has now been collected in the UCI machine learning repository [14]. It contains 681 instances (18 instances with missing values removed), where each instance has 10 features. The details of these features are shown in Table 9. Two class labels are "benign" and "malignant" with 442 and 239 instances respectively. We divide this dataset into 5 disjoint folds and report the statistical results in Fig. 7. It can be seen that when C4.5 is used as the classifier, these oversampling methods embedded with HS-Gen achieve the highest median and mean both on F1 and G-mean. When AdaBoost is used as the classifier, HS-SMOTE obtains a Fig. 7 Statistical results of diagnosis of breast cytology higher median and similar mean compared with SMOTE. We should also see that HS-BDSMOTE is not as good as the original BDSMOTE. Nevertheless, in the remaining comparisons, the embedded versions, including HS-ADASYN, HS-KSMOTE, and HS-RSMOTE, outperform their respective original versions. The applicability test on the real scenario can illustrate the ability of the proposed method to deal with practical problems.

Conclusion
This paper proposes a hypersphere-constrained generation mechanism (HS-Gen) to improve oversampling for the class imbalance problem. Unlike the existing oversampling methods that commonly work in the phase of data selection, HS-Gen focuses on the phase of data generation. It differs from linear interpolation in two aspects. First, it generates a new minority instance in a hypersphere determined by a base instance and its one neighbor. Second, it introduces NPS, a noise prevention strategy, to prevent the synthetic instance from invading the majority region.
Experiments on benchmark datasets show that the embedded methods with HS-Gen achieve higher F1 and G-mean than the original methods on most datasets. It implies that HS-Gen can mitigate the impact of class-imbalance data on classifiers to a certain extent. It should be noted that HS-Gen is only a generation mechanism, and its applicability to imbalanced ratio or dimensionality mainly depends on the specific oversampling method. In particular, when the original oversampling method is unavailable on some datasets, the embedded version will not work either.
Dedicating to improving the way of data generation, our work provides a new research idea for the study of oversampling methods for imbalanced learning. Although the proposed HS-Gen is heuristic, experimental evaluations have proven its effectiveness. In the future, we plan to conduct indepth research on the impact of data generation mechanisms on the quality of synthetic samples from a theoretical level. long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.