Introduction

Imbalanced data are the situation whereby the proportion of “Negative” (majority class) instances is disproportionately larger to the number of instances marked as “Positive” (minority class). Without any treatment, this can negatively impact the performance of learning classifiers trained upon it as these classifiers would most likely interpret these minority instances as an outlier or anomaly [1]. Mainstream classification algorithms are typically designed with the goal of maximising predictive accuracy or minimising classification error, with the underlying assumption that distribution of instances between classes is relatively balanced. Therefore, this leads to a strong tendency for these classifiers to generate prediction of majority class, resulting in False Negative Rates (FNR) [2, 3]).

The treatment and handling of imbalanced data in current literature can be grouped into three main approaches, namely, cost-sensitive learning, ensemble-based methods, and re-sampling techniques. The target outcome for re-sampling techniques is the creation of a more balanced dataset for training and learning purposes by either random or synthetic measures. One variation of re-sampling is the application of over-sampling, where the most basic approach is Random Over-sampling. Throughout many years of research, there is also a wide variety of synthetic over-sampling approaches. Synthetic instances, as the name suggests, are not exact replications of the original minority instances. They aim to broaden the decision region compared to random over-sampling by reducing the likelihood of model overfitting. This leads to an improved False Negative Rate and helps enhanced performance of learning classifiers [4]. However, these algorithms have recently been proven to have a reduced effectiveness as they typically generate minority instances on a linear path on the “feature level” [5], and therefore prevent learning classifiers from obtaining a holistic view of the entire decision region of the minority class [6].

In our previous work, we illustrated how performance improvements can be obtained by introducing a strategy which optimises for diversity while protecting the integrity of the distribution of minority data space. We denote this algorithm as Density-based Clustering Diversity Over-sampling (DBCDO, previously named as CDO) [7]. The algorithm generates robust synthetic minority instances by taking a 2-step approach. A density-based clustering method is first applied to analyse and identify density distribution of minority instances. Synthetic data generation for each cluster is performed as the next step through NOAH algorithm to encourage diversity.

In this paper, we aim to extend the cluster-based diversity algorithm by incorporating a neighbourhood-based clustering algorithm. The proposed method is named as Neighbourhood-based Clustering Diversity Over-sampling (NBCDO). This paper also enhances the parameter selection for the original DADO and DIWO algorithms to achieve a more optimal result. Additionally, a comparison with alternative over-sampling approaches is conducted. With the two proposed cluster-based diversity algorithms, we have demonstrated significant improvement in handling extreme imbalanced scenario compared to the existing methods in the literature.

Related Work

SMOTE is a well-known synthetic over-sampling technique in literature [5]. Synthetic minority class instances are generated via a random selection of specified k-nearest neighbours of a minority sample, and apply a multiplier based on a uniform random distribution (0,1). This results in a “synthetic” instance which sits between the two minority points. SMOTE improved the performance of classifiers trained on imbalance dataset by expanding decision regions housing nearby minority instances as compared to basic random over-sampling which enhanced and narrowed decision regions with contrasting effects [5].

Recent studies have identified that traditional over-sampling methods (i.e. SMOTE) are limited to its casual tendency to generate synthetic instances, which sometimes extends into the input region of the majority class instances. This negatively impacts the performance of the subsequent learning classifier built [6, 8]. These studies have recognised the dual importance of maintaining the integrity of the minority sample region, in addition to boosting the diversity of minority class data. Subsequent studies have been conducted to address the above challenge.

ECO-ensemble is a cluster-based synthetic oversampling ensemble method [9]. Its underlying concept originates from the identification of suitable oversampling cluster regions with Evolutionary Algorithm (EA) to obtain the optimised ensemble. The SMOTE-Simple Genetic Algorithm (SMOTE-SGA) method is proposed to enhance diversity within the generated dataset [10]. The over-generalisation problem in SMOTE is addressed by the algorithm which determines instances to be generated and the number of synthetic instances created from the selected instance (sampling rate).

MAHAKIL is proposed with the purpose of generating more diverse synthetic instances [6]. It achieves this by pairing minority instances with previously generated synthetic instances to create instances inspired by the Chromosomal Theory of Inheritance. Its measure of diversity is based on Mahalanobis Distance and utilises the core concept of inheritance and genetic algorithm. The underlying idea is to create unique synthetic minority instances using two relatively distant parent instances which are different to their parents (i.e. existing minority class).

SMOM is proposed as a k-NN-based synthetic minority oversampling algorithm [11]. Its advantage over traditional k-NN based oversampling algorithms is that it minimises the minority data region from being over-generalised by considering both minority and majority data space density. This is achieved by computing selection weights to quantify the adverse impacts for all other classes if synthetic instances are generated along a particular neighbourhood direction. Low weights are assigned to neighbourhood directions which will result in over-generalisation. The key steps involve the usage of neighbourhood-based clustering (NBDOS) to identify outstanding and trapped instances, computation of selection weights for trapped instances (outstanding instances have equal weights in all directions), and generation of synthetic instances based on selection weights.

In 2018, Sampling With the Majority (SWIM) was proposed [8]. Synthetic minority instances are generated based on the distribution of majority class instances which are effective against extremely imbalanced data. In 2021, a diversity-based sampling method with a drop-in functionality was proposed to evaluate diversity. It was achieved via a greedy algorithm that is used to identify and discard subsets that share the most similarity [12].

KMEANS-SMOTE [13] is a data-level oversampling method that was introduced in 2018 which combines k-means clustering algorithm with SMOTE. It seeks to address the shortcomings of SMOTE by aiming for safe areas which would benefit from the generation of synthetic instances. It achieves this by oversampling safe regions within the decision boundary. The author also commented that the attractiveness of the proposed K-MEANS SMOTE comes from the universal availability and proven effectiveness of both underlying algorithms.

In 2020, Minority Clustering SMOTE (MC-SMOTE) [14] is then introduced. It aims to soften the occurrence of sample-intensive and sample-sparse regions after the synthetic data generation process. It incorporates an element of K-means algorithm at minority datapoints. The algorithm aims to populate synthetic instances between clusters. The authors experimented MC-SMOTE to determine wind-turbine fault and concluded that MC-SMOTE outperformed SMOTE.

In recent years, the challenges of imbalanced data classification are also reflected in imaging data. Generative Adversarial Neutral Networks (GANs) have attracted much focus from researchers due to their ability to model complex datasets across many different fields. A recent survey[15] was conducted by Sampath, et al., where it categorises existing GANs based techniques into 3 main groups (image level, object level, and pixel level imbalances). This study also enables readers to gain an understanding of how GANs are used to address the issue of imbalanced datasets.

Most recently, Diversity-based Average Distance Over-sampling (DADO) and Diversity-based Instance-Wise Over-sampling (DIWO) have been proposed to promote diversity [16]. The objective of the two techniques is to generate well-diverse synthetic instances close to minority class instances. DADO aims to ensure diversity in the region among minority class instances, when minority instances are compact, and performs better when the immediate surrounding area is located within the minority space. In the case of DIWO, the contrasting approach is taken to ensure synthetic instances are clustered as closely to the actual minority class instances when minority instances are widely distributed, and the surrounding area does not sit within the minority space.

In our recent paper, we proposed a synthetic sampling method, namely Density-based Clustering Diversity Over-sampling (DBCDO) [7]. Our proposed method combined the advantages of both DADO and DIWO by analysing density distribution of the minority instances using DBSCAN, a density-based clustering approach and maximising diversity optimisation.

In this paper, we aim to expand our proposed clustering algorithm with an alternative; neighbourhood-based clustering algorithm (NBDOS). The workings of NBDOS focuses on both minority and majority density, instead of only minority density. It is focused on the identification of clusters of outstanding instances, which allows us to classify all remaining minority instances outside of these outstanding clusters as trapped instances. Once these instances are promptly identified, DADO is applied onto clusters of outstanding instances and DIWO is applied onto trapped instances.

Methodology

Cluster-Based Diversity Over-Sampling (CDO)

In this section, we describe our synthetic data generation method, Cluster-based Diversity Over-sampling (CDO). CDO aims to enhance and improve robustness of synthetic data generation by integrating and leveraging the advantages of both DADO and DIWO. This is predicated on the outcome of the density distribution of the minority instances, where DADO is applied onto narrow clusters and DIWO is applied to disperse clusters. We will start with providing a brief overview of our proposed implementation of CDO using DBSCAN (identified as DBCDO), followed by NBDOS (identified as NBCDO).

DBSCAN was originally chosen as our preferred clustering method as it is more efficient when the problem involved identifying arbitrary shaped clusters in comparison to partition-based or hierarchical-based clustering methods [17]. DBSCAN was first introduced in 1996 [17] and is a non-parametric density-based clustering algorithm. It simultaneously performs two functions, first by strengthening the grouping of instances which are within proximity to each other, and secondly by identifying points which are placed in low-density areas (points whose nearest neighbours are relatively far away). This implies that DBSCAN is robust against outlier detection. Another notable advantage of DBSCAN is its ability to allow for selecting desired levels of similarity through hyper-parameter selection.

Neighbourhood-based clustering (or NBDOS) which discovers the clusters of outstanding instances is introduced as part of SWOM [11]. It aims to distinguish outstanding instances (minority instances which are clustered closely) and trapped instances (minority instances which are dispersed, isolated and sometimes located within other majority regions). In our situation, the advantage of NBDOS lies in the fact that it is conducted on the entire data on both minority and majority data space to uncover the minority instances lying in dense clusters or spread dispersedly. This differs from DBSCAN, where the algorithm is applied solely on minority data to detect instances above a certain density threshold. This in turn causes NBDOS to be more sensitive towards the hyper-parameters selection. This will consequently impact the identification of “soft core”, “outstanding” and “trapped” instances, especially in the use case on a small number of minority instances. In extremely imbalanced data space, this can result in situations where all instances are classified as “trapped”.

The algorithm of CDO is shown in Algorithm 1. Clustering algorithm is applied in Step 2. The choice of clustering algorithms (DBSCAN and NBDOS) are denoted in Algorithm 2 & 3 respectively. It is worthwhile to observe that DBSCAN only requires minority instances as input, whereas NBDOS requires the entire data space (Algorithm 3, line 3 & 4). After clusters are obtained, CDO applies the following synthetic sampling process: if minority instances do not belong to any cluster, then apply DIWO; if minority instances belong to a cluster, then all the instances within the cluster perform DADO (Algorithm 1, lines 9 and 12). The algorithm of DBSCAN and NBDOS is shown in Algorithm 2 & 3, respectively.

Diversity Optimisation

The proposed algorithm for diversity optimisation and generation of synthetic instances is the extended form of NOAH’s algorithm [18], as shown in Algorithm 4. Algorithm 4 contains 3 stages and requires the following input parameters: population size (n), number of generations to optimise objective function (g), number of instances remaining in the population after bound adaptation (r), percentage improvement of bound (v) and finally, the stopping criterion diversity maximisation (c). The above implies that if the population diversity does not improve for c generations, convergence of the diversity maximisation is achieved. The whole algorithm terminates if the bound does not improve for c generation. To further optimise the objective function, Algorithm 4 has incorporated the usage of Genetic Algorithm (GA), as it is the most popular evolutionary algorithm. Mutation and crossover concepts are utilised to create new instances. Instances which objective functions are better than bound value (b) are kept (Algorithm 4, lines 5 and 14). For DADO, the objective function (f) is the average of distance from all instances in the minority class. For DIWO, the objective function (f) is the distance to each instance.

We also made the following update to the DADO and DIWO algorithm, with the aim to further promote diversity within synthetic data. For DADO, the population size was initially set to oversampling size + 50 this has since been updated to oversampling size + 1. The intuition behind this modification is that lower population size encourages a more diverse synthetic sample generation process. The DIWO boundary was initially set to the minimum and maximum of the “isolated” minority instances data space, it has since been updated with the minimum and maximum of the entire minority instances data space. The broader generation region promotes more diverse synthetic samples.

Diversity-Based Selection

The preferred measure of diversity is Solow-Polasky measure. There are 3 main properties which are required of a diversity measure, which are: (1) monotonicity in variety; (2) monotonicity in distance; (3) twinning. The first property implies that the diversity measure will increase or at least be non-decreasing when an individual element currently not present in the dataset is added. The second property requires that the diversity between a particular set S (i.e. instances) should not be smaller than another set S’ if all pairs in S are as remote as all pairs in S’. The third property ensures the diversity measure remains the same when an additional element, already in the set, is added. Solow-Polasky measure can be expressed in the following Eq. (1), where M represents the distance matrix. The Euclidean distance between elements of set S is denoted as \(d({s}_{i},{s}_{j})\). Thereafter, our diversity measure is derived and computed by the summation of all inverse matrix of (\({M}^{-1}={\left[{m}_{ij}\right]}^{-1}\)).

$$D\left(S\right)=\sum {M}^{-1}=\sum_{i}\sum_{j}{e}^{-d({s}_{i},{s}_{j})}$$
(1)

To obtain the best diversity amongst all the instances, the ideal scenario would be to generate all possible permutation of subsets. However, this cannot be achieved as it would be computationally infeasible and expensive. As an alternative methodology, we propose the use of a greedy approach which would filter out instances which have the least contribution to the diversity of our dataset. Our definition of contribution is defined as the difference in diversity for our dataset with and without the instance. As proven in this study [18], the difference can be expressed in the following formula:

$$\sum {M}^{-1}-\sum {A}^{-1}=\frac{1}{\overline{c} }(\sum \overline{b }+\overline{c })$$
(2)

where A is the distance matrix of the set without that particular instance, \(M = \left[ {\begin{array}{*{20}c} A & b \\ {b^T } & c \\ \end{array} } \right],M^{ - 1} = \left[ {\begin{array}{*{20}c} {\overline{A}} & {\overline{b}} \\ {\overline{b}^T } & {\overline{c}} \\ \end{array} } \right]\), c and \(\overline{c }\) are single elements, b and \(\overline{b }\) are vectors and \({b}^{T}\) and \({\overline{b} }^{T}\) are their transpose.

figure a
figure b
figure c
figure d

Validation of Synthetic Dataset

Evaluation Method

The learning classifiers used to evaluate the generated data are, Naïve Bayes (NB), Decision Tree (DT), k-Nearest Neighbour (KNN), Support Vector Machine (SVM), and Random Forest (RF). We chose KNN and RF as they are sensitive to imbalanced data based on their model assumptions [19]. DT is selected based on development of decision regions which are influenced by re-sampling methods [20]. SVM with radial kernel is effective to classify classes which are not separable linearly.

We measure the performance of the classifiers on test data using AUC, F1-score, G-means, and PR-AUC as classification accuracy is not an appropriate measure for imbalanced data.

To calculate F1-score (5), we need to measure recall and precision shown as (3) and (4). Recall is the proportion of correctly predicted positive instances to all instances in the positive class. Precision is the proportion of correctly predicted positive instances to all predicted positive instances.

$$\mathrm{R }ecall=\frac{TP}{TP+FN}$$
(3)
$$\mathrm{P }recision=\frac{TP}{TP+FP}$$
(4)
$$\mathrm{F} 1=\frac{2\times Recall\times Precision}{Recall+Precision}$$
(5)

The Receiver Operating Characteristic (ROC) curve is a technique to summarise the performance of a classifier over trade-offs between recall and False Positive Rate (FPR) as (6).

$$FPR=\frac{FP}{FP+TN}$$
(6)

where FP stands for False Positive that is the number of instances from positive class predicted incorrectly.

AUC, the area under the ROC curve, is a suitable measure for classifiers’ performance, especially in the situation of imbalanced data, and is independent of the decision boundary [5, 21]. PR-AUC denotes the area under the Precision Recall curve.

The G-means (7) is the geometric mean of True Positive Rate (TPR), which is the same as Recall in (3) and true negative rate (TNR), which is \(1-FPR\).

$$G-means=\sqrt{TPR\times TNR}$$
(7)

Synthetic Dataset

To examine our proposed methods under different scenario, 4 2-dimensional datasets are created. Each of these four datasets are eventually split into half, with Imbalance Ratio (IR) of 10% and 5%, respectively. These datasets are used in our initial experiments to assist in hyper-parameters selection. Table 1 provides a summary of these datasets (DS1-4). There is a varying amount of cluster within each dataset, ranging from 0 (randomly distributed data points) in DS3 to 5 in DS1. For each of the four synthetic datasets, instances are randomly divided into training and test datasets with a 75:25 split. DBCDO and NBCDO are utilised to balance our training datasets. Learning classifiers are applied onto the balanced training datasets. Performance of these constructed learning classifiers is then assessed using the test datasets. Performance measures (AUC, F1, G-Means, and PR-AUC) are computed for the best performing classifier. The above process is repeated 30 times.

Table 1 Synthetic datasets characteristics

Parameter Selection

The distance measures chosen for both objective function and diversity measure are the optimal distance measure based on experimental results [16]. Euclidean distance measure (\({D}_{Eu}\)) is chosen for DADO, and Canberra (\({D}_{c}\)) is chosen for DIWO.

$${D}_{Eu}\left(x,y\right)=\sqrt{{\sum }_{i}{\left({x}_{i}-{y}_{i}\right)}^{2}}$$
(8)
$${D}_{c}\left(x,y\right)={\sum }_{i}\frac{\left|{x}_{i}-{y}_{i}\right|}{\left|{x}_{i}\right|+|{y}_{i}|}$$
(9)

Our next step is to determine the optimal hyper-parameters for DBSCAN and NDBOS. Based on previous work ([7], the optimal parameter configuration for Epsilon (eps) is 0.05 and Border Point (p) set as “T” / “True”. For NBDOS, there is a total of 4 hyper-parameters which require configuration. nTh (“minimum points per cluster”) is set as 5 given our goal is to target extremely imbalanced datasets. rTh is set as 0.5 to relax the stringent condition of selecting soft care instances. For k1 and k2, extensive study on synthetic datasets has been conducted on a series of combinations where \(k1 \in (\mathrm{4,6},8)\) and \(k2 \in (\mathrm{5,6},\mathrm{7,8})\). Results from our study of synthetic datasets suggest that the optimal value for k1 and k2 is 8 and 6, respectively. Nevertheless, it is important to call out that these parameter selections are sensitive to the distribution of majority and minority instances in varying datasets, which could impact its capability.

Synthetic Experiment Results

Comparison between DBCDO and NBCDO was conducted using 4 synthetic datasets and the performance of each of the algorithms are evaluated. In total, there are 16 different results, and they are reflected in Table 2. NBCDO performs better than DBCDO in 10 out of 16 cases across evaluation metrics. This is especially so for DS1, 3, 4 where there is low to medium variance in minority space. NBCDO does not outperform DBCDO in DS2 where there is high variance in minority space. The likely explanation for this is when the minority data space has high variance, it is more likely to be surrounded by majority data points. This may have caused NBCDO to cluster and categorise the entire data space (which also houses majority class instances) as isolated instances, thereby impacting the process of learning.

Table 2 Performance results of mean and standard error for each measure across synthetic datasets

Graphical Representation

To provide graphical representations, we created a synthetic dataset with two clusters, 10% imbalanced ratio in training data with a balanced ratio in testing data. A comparison between the two clustering algorithm, DBCAN and NBDOS is demonstrated in Fig. 1. DBSCAN identified 3 clusters compared to NBDOS, which accurately identified 2 clusters. This can be attributed to the additional information about majority distribution utilised during the clustering process. NBDOS will end up with a more representative data generation region once DADO is applied due to the more accurate identification of actual clusters by NBDOS when compared to DBSCAN method.

Fig. 1
figure 1

Plots for clustering methods on minority data points

In Fig. 2, synthetic datasets are generated by DBCDO and NBCDO (two cluster-based diversity methods) and its 3 comparable methods (DB-SMOTE, MAHAKIL, KMEANS-SMOTE and MC-SMOTE). We observe that the regions of synthetic generated instances for cluster-based diversity algorithms, MAHAKIL and KMEANS-SMTOE are relatively similar. However, cluster-based diversity algorithms stand out for its ability to cover all the data points of the minority test data with the narrowest region. MAHAKIL created synthetic data points between the two clusters, which occupies a larger region and could result in over-generalisation and higher False Positive Rate. The region for DB-SMOTE does not cover all of minority test data points, which could result in higher false negative rate. The region for MC-SMOTE also has many synthetic data generated outside of the clusters, which is how this algorithm works.

Fig. 2
figure 2

Plots for synthetic datasets generation region

Validation of Real-Life Dataset

We validate the proposed CDO algorithm against an assortment of 10 imbalanced datasets, with varying dimensions. The datasets and their characteristics are described in Table 3, and “Ratio” is used to indicate the original proportion of majority to minority instances. To replicate the scenarios with low and extremely low imbalanced ratio, we reduce the imbalanced ratio to 5% and 10 absolute count of minority instances.

Table 3 Real-word data description

The data within each of the real-world datasets are randomly divided into train and test datasets using a 75:25 split, respectively. This process is repeated for 30 iterations, resulting in 30 unique variations of training datasets and accompanying experimental datasets for each of the 10 real-world datasets. After the initialisation step, we apply our proposed methods (NBCDO and CDO) alongside with existing methods in the literature, namely DB-SMOTE, MAHAKIL, MC-SMOTE and KMEANS-SMOTE to evaluate algorithm performance. Six learning classifiers (GLM, NB. DT, KNN, SVM, NN) are then constructed on each of the training datasets (n = 30). Subsequently, the trained classifiers are applied onto test datasets.

For each real-world datasets, the best performing classifier is selected before computing the mean and standard error of the performance measures as F1, AUC, PR-AUC and G-mean. Additionally, we examine the statistical significance of differences for the performance measures obtained from all comparable methods using a non-parametric statistical test, Mann–Whitney test.

Experimental Results

The mean and standard error (stated in parenthesis) of our proposed method (DBCDO, NBCDO) and its comparable methods (DB-SMOTE, MAHAKIL, KMEANS-SMOTE and MC-SMOTE) are presented in Table 4 (5% imbalanced ratio) and Table 5 (10 minority instances).

Table 4 Performance results of mean and standard error across datasets with 5% imbalance levels
Table 5 Performance results of mean and standard error across datasets with 10 minority instances

By looking at the performance metrics for 5% imbalanced ratio (Table 4), across all evaluation metric and datasets, both DBCDO and NBCDO have the highest mean 10 times, followed by DB-SMOTE 7 times, MAHAKIL and MC-SMOTE 6 times each, and KMEANS-SMOTE 3 times. In total, cluster-based diversity algorithm outperformed its comparable algorithms 20 out of 40 times.

By looking at the performance metrics for 10 minority instances (Table 5), across all evaluation metric and datasets, DBCDO has the highest mean 10 times, followed by NBCDO 9 times, MAHAKIL 7 times, DB-SMOTE and MC-SMOTE 6 times, and KMEANS-SMOTE 3 times. In total, cluster-based diversity algorithm outperformed its comparable algorithms 19 out of 40 times.

The Mann–Whitney test is performed for each pairing of all 6 comparable algorithms. Tables 6 and 7 display the results from the test, where each figure represents the frequency that the specified method is statistically better than its comparable methods.

Table 6 Performance results of Mann–Whitney test across datasets with 5% imbalance levels
Table 7 Performance results of Mann–Whitney test across datasets with 10 minority instances

Table 6 reports results on 5% imbalanced datasets. MC-SMOTE statistically outperforms its comparable algorithms 48 times across all datasets and evaluation metrics, followed by NBCDO 40 times, DBCDO 39 times, MAHAKIL 32 times, and both DB-SMOTE 27 and KMEANS-SMOTE 25 times. Specifically, NBCDO has the best statistical performance across AUC and PR-AUC, whereas MC-SMOTE has the best statistical performance across F1 and G-means.

Table 7 reports results on datasets with 10 minority instances. DBCDO statistically outperforms its comparable algorithms 61 times across all datasets and evaluation metrics, followed by NBCDO 42 times, MC-SMOTE 41 times, MAHAKIL 33 times, DBSMOTE 17 times and KMEANS-SMOTE 14 times. Specifically, DBCDO has the best statistical performance across F1, G-means and PR-AUC, whereas NBCDO has the best statistical performance across AUC.

Discussions

As shown in the results for mean comparison (Tables 4 and 5), both cluster-based diversity methods (NBCDO and DBCDO) outperformed its comparable methods.

Cluster-based diversity methods outperformed DB-SMOTE as they considered the data space distribution and generated diverse instances within the boundaries of the identified data generation region. In contrast, DB-SMOTE method created synthetic instances using linear interpolation. We also observed cluster-based diversity algorithms perform better compared to MAHAKIL in situations where minority instances are sparser (i.e. when dataset is reduced to 10 minority data points). This can be attributed to the nature of MAHAKIL algorithm such that it only performs well when minority data distribution is convex and in situations where there are sufficient number of minority instances [16].

Cluster-based diversity methods also outperform KMEANS-SMOTE. This could be explained by the limitation of KMEANS-SMOTE at high imbalance levels. KMEANS-SMOTE performs clustering at dataset level and it generates synthetic data within each cluster based on selected k-nearest neighbours. In the circumstances where there are only a handful of minority class instances within the cluster, synthetic data points generated by SMOTE will be of relative similarity, resulting in lowered diversity.

With a comparison of NBCDO and DBCDO using their mean performance (Table 4, 5), we can conclude NBCDO performs better when the dataset has few dimensions (e.g. DS 7). In contrast, when there is higher dimensionality within the dataset (e.g. DS 1, 6 and 10), DBCDO performs better. Since density-based clustering performs better when feature set is large, and distance-based clustering performs better when feature set is small, the observation can be explained by the knowledge that NBCDO is based on distance-based clustering, and DBCDO is based on density-based clustering.

An evaluation of the overall statistical performance (Table 6, 7) allows us to conclude that most of these results echo our findings in the comparison of mean performance. We discovered cluster-based diversity measures perform better at extremely imbalanced datasets through special and individualised treatment of isolated instances, relative to existing clustering methods which tends to group them into a specific cluster. It also validates our hypothesis that diversity is more important when minority instances are sparse.

Although most evaluation metrics indicate cluster-based diversity methods as the best-performing methods, there are two metrics, AUC and F1 which favours MC-SMOTE at a 5% imbalanced level. It is worth highlighting that as the imbalance level becomes more extreme (e.g. 10 minority instances), the performance edge of MC-SMOTE over cluster-based diversity methods dissipates. A possible explanation for this is that as the issue of imbalance dataset becomes more prominent, there is greater likelihood of a data point treated as “isolated” and thereby not grouped into a cluster with other minority instances. In contrast to MC-SMOTE, which has strong tendency to group minority instances into clusters, cluster-based diversity methods assess data points individually and help to ensure that isolated data points are correctly identified. This assists in minimising the likelihood of introducing noises (errors) at the commencement of the subsequent synthetic data generation process.

Conclusions

In this study, we propose a new cluster-based diversity re-sampling method named NBCDO, with the aim to complement our previously introduced density-based clustering diversity algorithm (DBCDO). In contrast to DBCDO which uses DBSCAN as an underlying clustering algorithm, clustering for NBCDO is performed based on a recent clustering algorithm NBDOS, which considers data distribution within both minority and majority data space when identifying clusters. NBCDO first utilises NBDOS to identify clusters and isolated instances. It then utilises this information to create synthetic samples while incorporating diversity optimisation to promote diversity within each generation region. Two cluster-based diversity methods, DBCDO (based on DBSCAN) and NBCDO (based NBDOS) are evaluated together with its comparable methods on 10 real-world datasets with ≤ 5% imbalanced ratio and, in most cases, it has been found to have statistically superior performance to its comparable methods.

More importantly, this paper highlights the versatility of NOAH, our diversity optimisation algorithm. When it is paired with both clustering algorithm (DBSCAN and NBDOS), empirical results shows that it consistently outperforms comparable methods in most cases. We summarise and attribute its superior performance to its ability to identify the minority space for synthetic data generation and its ability to obtain optimal spread of generated instances due to genetic algorithm.

For future work, we may consequently incorporate other typologies of clustering algorithms, such as centroid-based clustering (k-means) and distribution-based clustering (e.g. Gaussian) in conjunction with our diversity optimisation algorithm, NOAH. This would allow us to further test the validity of NOAH algorithm. Additionally, the implementation of NBCDO is based on a fixed hyper-parameters configuration derived from our synthetic experiments. This is a one-size-fits-all approach which is then applied onto each dataset, regardless of their characteristics. For future work, there is a consideration to pre-determine the optimal hyper-parameters configuration and tailored it specifically for the specific dataset. Additionally, due to the superior ability of NBDOS to draw accurate and specific decision boundaries for each minority instances, we would like to extend this algorithm to the multi-class classification problem as the class overlapping issue will be more severe and complex, thereby requiring more sophisticated clustering algorithm before the over-sampling process commences.