Keywords

1 Introduction

The imbalanced classification problem has caught growing attention from many fields. In the field of computational design optimization, product parameters are modified to generate digital prototypes and the performances are usually evaluated by numerical simulations which often require minutes to hours of computation time. Here, some parameter variations (minority number of designs) would result in valid and producible geometries but violate given constraints in the final steps of the optimization. Under this circumstance, performing proper imbalanced classification algorithms on the design parameters could save computation time. In the imbalanced learning domain, many techniques have proven to be efficient in handling imbalanced datasets, including resampling techniques and algorithmic-level approaches [5, 9, 15], where the former aims to produce balanced datasets and the latter aims to make classical classification algorithms appropriate for handling imbalanced datasets. The resampling techniques are standard techniques in imbalance learning since they are simple and easily configurable and can be used in synergy with other learning algorithms [4]. The main idea of most oversampling approaches is to introduce more information to the original dataset by creating synthetic samples. However, very few studies consider the idea of introducing additional attributes to the imbalanced dataset.

The anomaly detection problem can be considered as an extreme case of the class imbalance problem. In this paper, we propose to improve the imbalanced classification with some anomaly detection techniques. We propose to introduce the outlier score, which is an important indicator to evaluate whether a sample is an outlier [2], as an additional attribute of the original imbalanced datasets. Apart from this, we also introduce the four types of samples (safe, borderline, rare and outlier), which have been emphasized in many studies [14, 16], as another additional attribute. In our experiments, we consider four scenarios, i.e. four different combinations using the additional attributes and performing resampling techniques. The results of our experiments demonstrate that introducing the two proposed additional attributes can improve the imbalanced classification performance in most cases. Further study shows that this performance improvement is mainly contributed by a more accurate classification in the overlapping region of the two classes (majority and minority classes).

The remainder of this paper is organized as follows. In Sect. 2, the research related to our work is presented, also including the relevant background knowledge on four resampling approaches, outlier score and the four types of samples. In Sect. 3, the experimental setup is introduced in order to understand how the results are generated. Section 4 gives the results and further discussion of our experiments. Section 5 concludes the paper and outlines further research.

2 Related Work

As mentioned in the Introduction, we propose to introduce two additional attributes into the imbalanced datasets in order to gain more information on the data characteristics and improve the classification performance. Introducing additional attributes can be regarded as a data preprocessing method, which is independent of resampling techniques and algorithmic-level approaches, and can also be combined with these two approaches. In this section, the background knowledge related to our experiment is given, including resampling techniques (Sect. 2.1), the definition of four types of samples in the imbalance learning domain (Sect. 2.3) and the outlier score (Sect. 2.2).

2.1 Resampling Techniques

In the following, we introduce two oversampling techniques (SMOTE and ADASYN) and two undersampling techniques (NCL and OSS).

Oversampling Techniques. The synthetic minority oversampling technique (SMOTE) is the most famous resampling technique [3]. SMOTE produces synthetic minority samples based on the randomly chosen minority samples and their K-nearest neighbours. The new synthetic sample can be generated by interpolation between the selected minority sample and one of its K-nearest neighbours. The main improvement in the adaptive synthetic (ADASYN) sampling technique is that the samples which are harder to learn are given higher importance and will be oversampled more often in ADASYN [7].

Undersampling Techniques. One-Sided Selection (OSS) is an undersampling technique which combines Tomek Links and the Condensed Nearest Neighbour (CNN) Rule [4, 11]. In OSS, noisy and borderline majority samples are removed with so-called Tomek links [17]. The safe majority samples which have limited contribution for building the decision boundary are then removed with CNN. Neighbourhood Cleaning Rule (NCL) emphasizes the quality of the retained majority class samples after data cleaning [12]. The cleaning process is first performed by removing ambiguous majority samples through Wilson’s Edited Nearest Neighbour Rule (ENN) [19]. Then, the majority samples which have different labels from their three nearest neighbours are removed. Apart from this, if a minority sample has different labels from its three nearest neighbours, then the three neighbours are removed.

2.2 Four Types of Samples in the Imbalance Learning Domain

Napierala and Stefanowski proposed to analyse the local characteristics of minority examples by dividing them into four different types: safe, borderline, rare examples and outliers [14]. The identification of the type of an example can be done through modeling its k-neighbourhood. Considering that many applications involve both nominal and continuous attributes, the HVDM metric (Appendix A) is applied to calculate the distance between different examples. Given the number of neighbours k (odd), the label to a minority example can be assigned through the ratio of the number of its minority neighbours to the total number of neighbours (\(R_{\frac{min}{all}}\)) according to Table 1. The label for a majority example can be assigned in a similar way.

Table 1. Rules to assign the four types of minority examples.

2.3 Outlier Score

Many algorithms have been developed to deal with anomaly detection problems and the experiments in this paper are mainly performed with the nearest-neighbour based local outlier score (LOF). Local outlier factor (LOF), which indicates the degree of a sample being an outlier, was first introduced by Breunig et al. in 2000 [2]. The LOF of an object depends on its relative degree of isolation from its surrounding neighbours. Several definitions are needed to calculate the LOF and are summarized in the following Algorithm 1.

figure a

According to the definition of LOF, a value of approximately 1 indicates that the local density of data point \(X_i\) is similar to its neighbours. A value below 1 indicates that data point \(X_i\) locates in a relatively denser area and does not seem to be an anomaly, while a value significantly larger than 1 indicates that data point \(X_i\) is alienated from other points, which is most likely an outlier.

3 Experimental Setup

In this paper, we propose to introduce the four types of samples and the outlier score as additional attributes of the original imbalanced dataset, where the former can be expressed as \(R_{\frac{min}{all}}\) (Table 1) and the latter can be calculated through Python library PyOD [20].

The experiments reported in this paper are based on 7 two-class imbalanced datasets, including 6 imbalanced benchmark datasets (given in Table 2) and a 2D imbalanced chess dataset, which is commonly used for visualising the effectiveness of the selected techniques in the imbalanced learning domain [4]. Imbalance ratio (IR) is the ratio of the number of majority class samples to the number of minority class samples. For each dataset, we consider four scenarios, whether to perform resampling techniques on the original datasets and whether to perform resampling techniques on the datasets with additional attributes. For each scenario of each dataset, we repeat the experiments 30 times with different random seeds. After that, the paired t-tests were performed on each of the 30 performance metric values to test if there is significant difference between the results of each scenario on a 5% significance level. Each collected dataset is divided into 5 stratified folds (for cross-validation) and only the training set is oversampled, where the stratified fold is to ensure that the imbalance ratio in the training set is consistent with the original dataset and only oversampling the training set is to avoid over-optimism problem [15]. Our code is available on Github (https://github.com/FayKong/PPSN2020) for the convenience of reproducing the main results and figure.

Table 2. Information on benchmark datasets [1].

In this paper, we evaluate the performance through several different measures, including Area Under the ROC Curve (AUC), precision, recall, F-Measure (F1) and Geometric mean (Gmean) [13]. These performance measures can be calculated as follows.

$$\begin{aligned} \begin{aligned} AUC&= \frac{1+ TP_{rate} - FP_{rate}}{2}, \quad TP_{rate} = \frac{TP}{TP + FN}, \quad FP_{rate} = \frac{FP}{FP + TN};\\ GM&= \sqrt{\frac{TP}{TP + FN} \times \frac{TN}{FP + TN}}; \quad FM = \frac{(1 + \beta )^2 \times Recall \times Precision}{\beta ^2 \times Recall + Precision}\\ Recall&= TP_{rate} = \frac{TP}{TP + FN}, \quad Precision = \frac{TP}{TP + FP}, \quad \beta = 1; \end{aligned} \end{aligned}$$

where TPFNFPTN indicate True Positives, False Negatives, False Negatives and True Negatives in the standard confusion matrix for binary classification.

4 Experimental Results and Discussion

Like other studies [7, 13], we also use SVM and Decision Tree as the base classifiers in our experiments to compare the performance of the proposed method and the existing methods. Please note that we did not tune the hyperparameters for the classification algorithms and the resampling techniques [9]. The experimental results with the two additional attributes (four types of samples and LOF score) are presented in Table 3. We can observe that introducing outlier score and four types of samples as additional attributes can significantly improve the imbalanced classification performance in most cases. For 5 out of 7 datasets (2D chess dataset, glass1, yeast4, wine quality and page block), only introducing additional attributes (with no resampling) gives better results than performing resampling techniques.

Table 3. Experimental results with SVM and Decision Tree. “Add = YES” means we introduce the two additional attributes to the original datasets; gray cells indicate that the proposed method (Add = YES) significantly outperforms the existing methods (Add = NO); “—” means that TP + FN = 0 or TP + FP = 0 and the performance metric cannot be computed.

According to our experimental setup, we notice that introducing the outlier score focuses on dealing with the minority samples since the outlier score indicates the degree of a sample being an outlier. Meanwhile, introducing four types of samples (safe, borderline, rare and outlier) puts emphasis on separating the overlapping region and safe region. The visualisation of different scenarios for the 2D chess dataset is given in Fig. 1 in order to further study the reason for the performance improvement.

From both the experimental results in Table 3 and the visualisation in Fig. 1, we can conclude that, for the 2D chess dataset, the experiment with the two additional attributes outperforms the experiment with the classical resampling technique SMOTE. The figure also illustrates that the proposed method has a better ability to handle samples in the overlapping region.

Fig. 1.
figure 1

[top left]. Original imbalanced 2D chess dataset. [top right]. Classification performance for original chess dataset. The red-circled points indicate the misclassified points. [bottom left]. Classification performance for SMOTE-sampled chess dataset. [bottom right]. Classification performance for chess dataset with additional attributes. (Color figure online)

Apart from the visualisation, the feature importance (with Decision Tree) is also analysed in order to get an additional insight into the usefulness of the new attributes. Detailed importance score of each attribute is shown in Table 4. According to the feature importance analysis, we can conclude that the introduced “four types of samples” attribute plays an important role in the decision tree classification process for all datasets in our experiment. For 3 out of 7 datasets, the introduced “outlier score” attribute provides useful information during the classification process. The conclusions above show that the two introduced attributes are actually used in the decision process and the “four types of samples” attribute is more important than the “outlier score” attribute.

Table 4. Feature importance analysis with Decision Tree. The higher the “score” is, the more the feature contributes during the classification; “org” indicates the original attribute while “add” indicates the added attribute; grey cells indicate the three most useful attributes (after adding the two proposed attributes) in the decision tree classification process.

5 Conclusions and Future Research

In this paper, we propose to introduce additional attributes to the original imbalanced datasets in order to improve the classification performance. Two additional attributes, namely four types of samples and outlier score, and the resampling techniques (SMOTE, ADASYN, NCL and OSS) are considered and experimentally tested on seven imbalanced datasets. According to our experimental results, two main conclusions can be derived:

  1. 1)

    In most cases, introducing these two additional attributes can improve the class imbalance classification performance. For some datasets, only introducing additional attributes gives better classification results than only performing resampling techniques.

  2. 2)

    An analysis of the experimental results also illustrates that the proposed method has a better ability to handle samples in the overlapping region.

In this paper, we only validate our idea with four resampling techniques and seven benchmark datasets. As future work, other anomaly detection techniques, such as the clustering-based local outlier score (CBLOF) [8] and histogram-based outlier score (HBOS) [6] could be included in the analysis. Future work could also consider an extension of this research for engineering datasets [10], especially for the design optimization problems mentioned in our Introduction. Detailed analysis of the feature importance and how the proposed method affects the classification performance in the overlapping region would also be worth studying.