Keywords

1 Introduction

With the growing interest of knowledge researchers and increasing number of proposed solutions, the imbalanced data problem becomes one of the most significant and challenging issues of the last years. The main reason of this particular attention given to the underrepresented data is the fundamental importance of untypical instances. Considering medical diagnosis, it is obvious that the cost of not recognizing patient that suffers from a rare disease might lead to serious and irreversible consequences. Apart from this example, there are numerous domains in which imbalanced class distribution occurs, such as [14, 21, 25]: fraudulent credit card transactions, detecting oil spills from satellite images, network intrusion detection, financial risk analysis, text categorization and information filtering. Indeed, the wide range of problem occurrences increases its significance and explains the efforts put into finding effective solution.

Initially, the major cause of the classifier performance depletion was merely identified with not sufficient number of examples representing minority data. However, recent comprehensive studies carried on the nature of imbalanced data revealed that there are other factors contributing to this undesirable effects [7, 14, 21]. Small disjuncts [13], class overlapping [15] and presence of noise as well as outliers [25] were especially considered as the most meaningful difficulties. Despite lots of suggested solutions (discussed briefly in the next section), there are still many open issues, particularly regarding the flexibility of proposed methods and tuning their parameters. We decided to focus on the data–level approach as it is classifier–independent and therefore more universal. However, it is worth to mention that besides this kind of concept, there are also numerous algorithm–level and cost–sensitive methods [11].

Although many re–sampling methods were proposed to deal with imbalanced data problem, only few incorporate the rough set theory. The standard rough sets approach was developed by Pawlak (1926–2006) in 1982 [16]. Objects characterized by the same information (identical values of the provided attributes) are treated as indiscernible instances [17, 18, 23]. Hence, the idea of indiscernibility relation is introduced. Since the real–life problems are often vague and contains inconsistent information, the rough (not precise) set concept can be replaced by a pair of precise concepts, namely the lower and upper approximations. We claim that this methodology could be very useful both in preprocessing step and cleaning phase of the algorithm (see [3]). Especially the extended version, which allows the continuous values of attributes by involving the similarity relation. Therefore, we propose the adjusted VIS_RST algorithm, dedicated for both qualitative and quantitative data. What is more, new mechanisms for careful oversampling are introduced.

2 Related Works

In this paper we focus only on the data–level methods addressing imbalanced data problem, as it was declared in the previous section. This category consists of classifier–independent solutions which transform the original data set into less complex and more balanced distribution using techniques such as oversampling and undersampling [11, 21]. The major algorithm representing this mentioned group is Synthetic Minority Oversampling Technique [6]. Since the approach of generating new minority samples based on the similarities between existing minority examples became very successful and powerful method, it was an excellent inspiration for researchers to develop numerous extensions and improvements. Some of them properly reduce the consequences of SMOTE drawbacks, such as over–generalization and variance [11]. Preparing the overview of related techniques, two main subjects were considered. Firstly, methods which handle additional difficulty factors are discussed. Secondly, we show the applications of rough set notions in imbalance data problem.

2.1 SMOTE–Based Methods Dealing with Complex Imbalanced Data Distribution

Highly imbalanced datasets, especially characterized by the complex distribution, require dedicated methods. Even such groundbreaking algorithm as SMOTE turns out as insufficiently effective in some specific domains. Indeed, the most recent researches revealed that technique of dividing minority data into some categories that reflect their local characteristics is a proper direction of development. The main reason of this conclusion is the nature of real–world data. Assuming that minority class instances placed in relatively homogeneous regions of feature space are named safe, we should consider the fact that these safe examples are uncommon in real–life data sets [21]. In order to deal with complex imbalanced data distributions, many sophisticated methods were proposed:

  • MSMOTE (Modified Synthetic Minority Oversampling Technique) [12] - the strategy of generating new samples is adapted to the local neighbourhood of the instance. Safe objects are processed similarly as in standard SMOTE technique. For border instances only the nearest neighbour is chosen. Latent noise representatives are not taken into consideration in undersampling.

  • Borderline-SMOTE [9] - method that strengthens the area of class overlapping. Only borderline examples are used in processing.

  • SMOTE-ENN [2] - technique combining oversampling and additional cleaning step. Standard SMOTE algorithm is enriched by the Edited Nearest Neighbour Rule (ENN) approach, which removes examples from both classes as long as they are misclassified by their three nearest neighbours.

  • Selective Preprocessing of Imbalanced Data (SPIDER) [15] - method that identifies noisy minority data using the k–NN technique and continues processing in a way depending on the selected option: weak, relabel or strong. Chosen condition determines if only minority class examples are amplified or also majority objects are relabeled. After oversampling, noisy representatives of majority class are removed.

  • Safe–Level SMOTE [5] - the algorithm applies k–NN technique to obtain the safe levels of minority class samples. New synthetic instances are created only in safe regions in order to improve prediction performance of classifiers.

2.2 Rough Sets Solutions for Imbalanced Data Problem

The occurrence of noisy and borderline examples in real–domain data sets is the fact that need to be acknowledged in most cases. Hence, the relevancy of methods dealing with these additional difficulties should be emphasized. The rough set notions appears as a promising approach to reduce data distribution complexity and understand hidden patterns in data. Before describing existing algorithms based on the rough sets approach, basic concepts of this theory are introduced.

Let U denote a finite non-empty set of objects, to be called the universe. The fundamental assumption of the rough set approach is that objects from a set U described by the same information are indiscernible. This main concept is source of the notion referred as indiscernibility relation \(IND \subseteq U \times U\), defined on the set U. The indiscernibility relation IND is an equivalence relation on U. As such, it induces a partition of U into indiscernibility classes. Let \([x]_{IND} = \{ y \in U : (x,y) \in IND\}\) be an indiscernibility class, where \(x \in U\). For any subset X of the set U it is possible to prepare the following characteristics [18]:

  • the lower approximation of a set X is the set of all objects that can be certainly classified as members of X with respect to IND : 

    $$\begin{aligned} \{x \in U : [x]_{IND} \subseteq X\}, \end{aligned}$$
    (1)
  • the boundary region of a set X is the set of all objects that are possibly members of X with respect to IND : 

    $$ \begin{aligned} \{x \in U : [x]_{IND} \cap X \ne \varnothing \ \& [x]_{IND} \nsubseteq X \}. \end{aligned}$$
    (2)

The most known preprocessing methods directly utilizing the rough set theory are the following:

  • filtering techniques (relabel and remove) [22] - depending on the method: majority class examples belonging to the boundary region (defined by the rough sets) are either relabeled or removed,

  • SMOTE–RSB\(_{*}\) [19] - method combining SMOTE algorithm with rough set theory by introducing additional cleaning phase of processing.

3 Proposed Algorithm VISROT - Versatile Improved SMOTE Based on Rough Set Theory

Comprehensive studies on imbalanced data problem and analysis of foregoing solutions revealed that there are many open issues and the need of more general approach dealing with wide range of different data characteristics is still actual [21]. Since most of the real–world data sets have complex distribution, researchers should pay particular attention to careful assortment of oversampling strategy [14]. In [20] two main types of SMOTE–based preprocessing algorithms are specified:

  • change–direction methods - new instances are created only in specific areas of the input space (especially close to relatively large positive examples clusters)

  • filtering–based techniques - SMOTE algorithm integrated with additional cleaning and filtering methods that aim to create more regular class boundaries.

The authors of this categorization claim that the first group may suffer from noisy and borderline instances. The necessity of additional cleaning phase was indicated. Since our VIS_RST [3] algorithm meets this requirement, but it is not directly suitable for quantitative data, we decided to improve the existing approach and enable processing of any attributes’ types.

The code generalization for both qualitative and quantitative data involved many adjustments and handling specific cases. The main modification concerns usage of weaker similarity concept instead of the strict indiscernibility relation [10]. We applied the HVDM distance metrics [26] as a generator of similarity measure.

The algorithm flexibility is obtained by two approaches dedicated to different types of problems. Analysis of local neighbourhood of each example enables to evaluate the complexity of data distribution. Based on the studies from [15] we assume that the occurrence of 30% of borderline and noisy instances indicates that the problem is difficult. Identification of these specific examples is performed by applying the k–NN algorithm. Continuing the categorization introduced in VIS algorithm [4], we distinguish between three types of objects, namely SAFE, DANGER and NOISE. SAFE examples are relatively easy to recognize, they are main representatives of minority class. DANGER instances are placed in the area surrounding class boundaries, they typically overlap with majority class examples. NOISE instances are rare, probably incorrect, individuals located in areas occupied by the majority class objects. The mechanism of categorization into mentioned groups is described below.

Let \(DT=(U,A \cup \{d\})\) be a decision table, where U is a non-empty set of objects, A is a set of condition attributes and d is a decision attribute and \(V_{d}=\{+,- \} .\) The following rules enable labeling minority data \(X_{d=+} =\{x \in U: d(x)=+\} \):

Definition 1

Let \(k>0\) be a given natural number. Let \(x \in X_{d=+}\) be an object from minority class. We define \(Label: X_{d=+} \rightarrow \{NOISE, DANGER, SAFE\}\) as follows:

  • \(Label(x)=NOISE\) if and only if all of the k nearest neighbors of x represent the majority class \(X_{d=-} =\{x \in U: d(x)=-\} ,\)

  • \(Label(x)=DANGER\) if and only if half or more than half of the k nearest neighbors of x belong to the majority class \(X_{d=-}\) or the nearest neighbour of x is majority class representative,

  • \(Label(x)=SAFE\) if and only if more than half of the k nearest neighbors represent the same class as the example under consideration and the nearest neighbour of x is minority class representative.

The explained approach involves three modes of processing of DT. None of them creates new samples using NOISE examples. The first one is defined below:

Definition 2

HighComplexity mode: \(DT \longmapsto DT_{balanced}\)

  • DANGER: the number of objects is doubled by creating one new example along the line segment between half of the distance from DANGER object and one of its k nearest neighbors. For nominal attributes values describing the object under consideration are replicated,

  • SAFE: assuming that these concentrated instances provide specific and easy to learn patterns that enable proper recognition of minority samples, a plenty of new data is created by interpolation between SAFE object and one of its k nearest neighbors. Nominal attributes are determined by majority vote of k nearest neighbors’ features.

The second option is applied when most of examples belong to the relatively homogeneous areas:

Definition 3

LowComplexity mode: \(DT \longmapsto DT_{balanced}\)

  • DANGER: the most of synthetic samples are generated in these borderline areas, since numerous majority class representatives may have greater impact on the classifier learning, when there are not enough minority examples. Hence, many new examples are created closer to the object under consideration. One of the k nearest neighbor is chosen for each new sample when determining the value of numeric feature. Values of nominal attributes are obtained by the majority vote of k nearest neighbors’ features,

  • SAFE: there is no need to increase significantly the number of instances in these safe areas. Only one new object per existing minority SAFE instance is generated. Numeric attributes are handled by the interpolation with one of the k nearest neighbors. For the nominal features, new sample has the same values of attributes as the object under consideration.

The third option is specified as follows:

Definition 4

noSAFE mode: \(DT \longmapsto DT_{balanced}\)

  • DANGER: all of the synthetic objects are created in the area surrounding class boundaries. This particular solution is selected in case especially complex data distribution, which do not include any SAFE samples. Missing SAFE elements indicates that most of the examples are labeled as DANGER (there are no homogeneous regions). Since only DANGER and NOISE examples are available, only generating new instances in neighborhood of DANGER objects would provide sufficient number of minority samples.

Omitting NOISE examples in oversampling phase is explained by the idea of keeping data distribution complexity as low as possible. Generating new synthetic samples by utilisation of objects surrounded only by the majority class representatives may introduce more inconsistencies. However, there is no guarantee that objects labeled as NOISE are truly effects of errors or they are only outliers which are untypical since no other similar objects are provided in the imbalanced data set [21]. Hence, we do not remove any of these instances, but we also do not create new examples similar to them.

Even when examples considered as noise are excluded from the oversampling process, generating new samples by combining features of two chosen instances still may contribute to creation of noisy examples. Thus some filtering and cleaning mechanisms are advisable [20]. In order to resolve problem of introducing additional inconsistencies we propose the technique of supervise preprocessing. The main idea of this approach is based on the lower approximation. After obtaining the threshold, algorithm identifies newly created objects that do not belong to the lower approximation of the minority class. The correctness of each element is obtained iteratively (by means of similarity relation rather than the strict indiscernibility relation). The expected proper number of new samples is assured by the increased limit of generated objects. The proposed solution consists of steps described in provided algorithm.

figure a

Generating redundant instances in Step VI protects from filtering out too many positive synthetic samples in cleaning phase. The method of determining additional objects number should be evaluated in the further research - the impact on the computing performance need to be especially investigated. We suggest that this number should be related to the complexity of the considered specific problem.

4 Experiments

The results of experimental study are presented in this section. We decided to compare our algorithm with five oversampling methods considered as successful in many domains. All of these techniques are described in Sect. 2. Widely used C4.5 decision tree was chosen as a classifier, since it is one of the most effective data–mining methods [7]. Very important parameter of k–NN processing, namely k, was set to 5 as it was proven that this is the most suitable value for wide range of problems [8]. The HVDM metric was applied to measure the distances between objects, because it properly handles both quantitative and qualitative data [26].

Six data sets were selected to perform described experiments. They are highly–imbalanced real–life data sets obtained from the UCI repository [24]. All of them were firstly divided into training and test partitions to ensure that the results of fivefolds cross-validation would be correct. We used partitioned data available in the KEEL website [1]. The analyzed data sets are presented in Table 1.

Table 1. Characteristics of evaluated data sets

The existence of boundary region defined by the rough set notions is emphasized to verify the impact of data inconsistencies on the classifier performance preceded by the particular preprocessing techniques.

Table 2 presents the results of experiments. The area under the ROC curve (AUC) was used to evaluate classifier performance. This measure discloses the dependency between sensitivity (percentage of positive instances correctly classified) and percentage of negative examples misclassified.

Table 2. Classification results for the selected UCI datasets - comparison of proposed algorithm VISROT with five other techniques and classification without preprocessing step (noPRE).

VISROT algorithm introduced in this paper was evaluated in comparison with five other preprocessing techniques which performance was measured in [19].

The results revealed that proposed method outperforms other algorithms in two cases (glass5, ecoli01_vs_5), one of whom has non–empty boundary region. For two data sets VISTROT has similar result as the most effective techniques. In the remaining two cases applying VISROT approach was slightly less beneficial than SMOTE and SMOTE–ENN or Safe–Level SMOTE and SMOTE–RSB\(_{*}\). The experiments proved that the proposed algorithm is suitable to deal with real–life complex data distributions, even highly–imbalanced.

5 Conclusions and Future Research

In this paper we introduced new preprocessing method dedicated to both quantitative and qualitative attributes in imbalanced data problems. The described approach considers significant difficulties that lead to the misclassification of many minority class samples. Since not enough number of examples representing positive class is not the main reason of performance depletion, other factors were also considered. Especially occurrence of sub–regions, noise and class overlapping were examined as they indicates the high data complexity. Performed experiments confirms that oversampling preceded by the analysis of local neighborhood of positive instances is proper approach. Moreover, the need of additional cleaning step that removes the inconsistencies is emphasized. The VISROT results showed that rough set notions can be successfully applied to the imbalanced data problems.

We suggest that proposed algorithm should be adjusted to handle Big Data problems in future research. The values of minimal allowed distance defining weaken low approximation rule (threshold) can also be investigated.