Dynamic affinitybased classification of multiclass imbalanced data with oneversusone decomposition: a fuzzy rough set approach
 291 Downloads
 1 Citations
Abstract
Class imbalance occurs when data elements are unevenly distributed among classes, which poses a challenge for classifiers. The core focus of the research community has been on binaryclass imbalance, although there is a recent trend toward the general case of multiclass imbalanced data. The IFROWANN method, a classifier based on fuzzy rough set theory, stands out for its performance in twoclass imbalanced problems. In this paper, we consider its extension to multiclass data by combining it with oneversusone decomposition. The latter transforms a multiclass problem into twoclass subproblems. Binary classifiers are applied to these subproblems, after which their outcomes are aggregated into one prediction. We enhance the integration of IFROWANN in the decomposition scheme in two steps. Firstly, we propose an adaptive weight setting for the binary classifier, addressing the varying characteristics of the subproblems. We call this modified classifier IFROWANN\({{\mathcal {W}}_{\mathrm{IR}}}\). Second, we develop a new dynamic aggregation method called WV–FROST that combines the predictions of the binary classifiers with the global class affinity before making a final decision. In a meticulous experimental study, we show that our complete proposal outperforms the stateoftheart on a wide range of multiclass imbalanced datasets.
Keywords
Imbalanced data Multiclass classification Oneversusone Fuzzy rough set theory1 Introduction
This paper focuses on the challenge of class imbalance for classification problems, which occurs when the elements of a dataset are unevenly distributed among the classes. Such class imbalance poses a challenge to traditional classifiers, such that specific methods able to deal with the imbalance need to be employed instead [31, 55]. In a twoclass scenario, the imbalance ratio (IR), the ratio between the majority and minority class examples [48], is used to identify this type of datasets. The most straightforward cause of the performance degradation is the misclassification of minority class examples. The recognition of these examples can be ignored in favor of majority instances when considering two common criteria: the maximization of both accuracy and model generalization. With regard to the former, a good classification performance on majority classes can easily overshadow a very poor recognition of minority instances. For the latter, the regions of the problem space with few minority examples can possibly be discarded in the learning process. Recent studies have also shown that the problem of imbalanced classes usually occurs in combination with various intrinsic characteristics of the data [42] that impose additional learning restrictions. Among them, we stress the overlap between classes [2, 27] and the presence of small disjuncts and noisy data [54].
In this paper, we consider the general problem of multiclass imbalance, while many previous works have been limited to the binary imbalanced case. Multiclass imbalanced data is encountered in reallife applications, like microarray research [68], protein classification [71], medical diagnosis [7], activity recognition [24], target detection [52] and video mining [25].
When aiming to solve any classification problem, it is clear that the higher the number of classes, the harder it becomes to correctly determine the output label for a query instance. This is mainly due to the overlap between the different classes in the dataset, which increases as more classes are interrelated. One simple yet effective way to address this task is to apply a divideandconquer methodology. Such methods are known as decomposition strategies [44], in which the original problem is divided into several easiertosolve binary subsets. A different classifier is devoted to distinguish among each pair of classes, and then, in the testing phase, the outputs of all classifiers are aggregated to make the final decision [20]. The difficulty in addressing the multiclass problem is therefore shifted from the classifier itself to the combination stage. Among the proposed decomposition strategies, the oneversusone (OVO) setting has been shown to outperform the oneversusall (OVA) setting for imbalanced data (e.g. [16]). One problem related to decomposition schemes is the question of classifier competence [19]. In the OVO setting, this issue refers to the fact that the outputs of all classifiers are equally taken into account when extracting a final prediction, although some of them were not trained to discern the real class of the instance and will usually not provide any relevant information. This can hinder the prediction performance. This phenomenon should be considered when developing a method based on OVO decomposition.
The work of [51] proposed a powerful classifier for twoclass imbalanced data based on fuzzy rough set theory [12], a mathematical theory that allows to model vagueness and indiscernibility in data. This method was called IFROWANN and was shown to outperform other stateoftheart methods. Its limitation is that it was set up as a binary classifier, and it cannot directly deal with more than two classes.

IFROWANN \({{\mathcal {W}}_{\mathrm{IR}}}\): the fuzzy rough component of the IFROWANN method requires the specification of a weighting scheme. The original study in [51] showed that the optimal choice depends on the IR of the twoclass problem under consideration. In an OVO decomposition, the IR can greatly differ among the binary subproblems. We therefore propose an adaptive version of IFROWANN, called IFROWANN\({{\mathcal {W}}_{\mathrm{IR}}}\), that dynamically chooses its weight settings based on the IR of each binary problem at hand. We demonstrate the necessity of an adaptive weight choice in our experiments.

WV–FROST: the second original contribution (and main novelty) of this paper is a new approach to deal with the classifier competence issue in an OVO ensemble. Each classifier in the OVO decomposition provides local information, that is, it only discerns between two possible classes. The reduction to two classes results in a loss of information. This is somewhat counteracted by aggregating over all classifiers to obtain a final prediction, as done in existing OVO aggregation schemes. We propose a further performance enhancement by explicitly including two global measures in the decision procedure. In this way, we aim to optimally use all information contained in the dataset. Both global summary terms are based on fuzzy rough set theory, as the binary classifiers are. When classifying an instance, the summary terms evaluate its global affinity with all candidate classes, complementing the local information provided by the OVO classifiers. Our new aggregation method is called weighted voting with fuzzy rough summary terms (WV–FROST).
We use 18 datasets from various application domains in our experiments. In a first stage, we demonstrate the advantage of our adaptive weighting scheme for IFROWANN in the OVO setting. Secondly, we show that our affinitybased design WV–FROST outperforms the earlier dynamic approaches from [21, 22] as well as partially constructed models using no binarization step or only one of the two summary terms. Finally, our complete method FROVOCO is experimentally shown to outperform the stateoftheart in multiclass imbalanced classification, in particular, C4.5OVO combined with preprocessing [16], AdaBoost.NC [61] and C4.5 with Mahalanobis Distance Oversampling (MDO) [1].
The remainder of this paper is organized as follows. In Sect. 2, we recall the proposed solutions for the classification of imbalanced data in both the binary and multiclass settings as well as the OVO decomposition scheme and its existing traditional and dynamic aggregation methods. Section 3 describes the original IFROWANN method from [51] and provides the necessary background on fuzzy rough set theory. The original contributions of this work, IFROWANN\({{\mathcal {W}}_{\mathrm{IR}}}\) and WV–FROST, are presented in Sect. 4. Our proposal is carefully evaluated in Sects. 5–7 and shown to outperform the stateoftheart. Finally, our conclusions and future work are formulated in Sect. 8.
2 Classification approaches for imbalanced data
Despite showing a fairly common occurrence and a strong impact on applications, the problem of imbalanced classes has not been solved properly by machine learning algorithms. Indeed, those methods that perform well in standard classification problems do not necessarily achieve the best performance for imbalanced datasets [15]. The main issue is that they consider equal distributions among classes or the same cost ranking for all classes.
Traditionally, the focus of class imbalance research has been on binary problems, where one class is considerably larger than the other (Sect. 2.1). However, datasets with more than two classes can be imbalanced as well and the attention of the research community has shifted to this more general setting in recent years. This paper focuses on classification problems with more than two classes, for which we apply the OVO decomposition scheme [30]. This strategy and its aggregation mechanisms are described in Sect. 2.2, including some remarks on the dynamic classifier selection procedure for the OVO scheme. Finally, Sect. 2.3 discusses some relevant solutions to deal with multiclass imbalanced data.
2.1 Binaryclass imbalance
When the goal is to boost the global performance on both the minority and majority classes, special mechanisms must be applied together with the classifiers. The procedures to address imbalanced classification in twoclass problems can be categorized into three groups [40, 42]: data level solutions that rebalance the training set [4], algorithmic level solutions that adapt the learning stage toward the minority classes [3] and costsensitive solutions which consider different misclassification costs with respect to the class distribution [11].
Among these methodologies, the advantage of the data level solutions is that they are more versatile, since their use is independent of the selected classifier. Three possible schemes can be applied: undersampling of the majority class, oversampling of the minority class and combinations of these two. The simplest approach, random undersampling, removes instances from the majority class until the class distribution is more balanced. A downside is that important majority class examples may be ignored. The random oversampling alternative makes exact copies of existing minority instances. The drawback here is that this method can increase the likelihood of overfitting [4].
More sophisticated algorithms have been proposed based on the generation of synthetic samples, often inspired by the SMOTE oversampling method [6]. The core idea is to form new minority class examples by interpolating between several minority class examples that lie close together. This allows to expand the clusters of the minority class and to strengthen the borderline areas between classes.
2.2 The oneversusone scheme (OVO)
The use of decomposition strategies in multiclass classification is of great interest in the research community [20, 44]. This scheme simplifies the original problem into binaryclass subsets, following a divideandconquer paradigm. Evidently, boundaries between two classes are easier to learn than in the general case, where they are more likely to highly overlap. Therefore, the critical step is moved toward the decision process, in which the confidence degrees of all binary classifiers must be aggregated in order to output a single class.
Several combination strategies to derive a class prediction from R(x) have been proposed in the literature. Two of the most intuitive aggregation methods are the simple voting strategy (VOTE) proposed in [17] and the weighted voting strategy (WV) from [35]. In the former setting, each binary classifier casts a vote for its predicted class. The class that receives the most votes is predicted. In the weighted alternative, each classifier assigns a confidence to the two classes that it handles. The class with the largest total confidence is the final prediction. More advanced methods are pairwise coupling (PC, [30]), a decision directed acyclic graph (DDAG, [50]), learning valued preference for classification (LVPC, [33, 34]), the nondominance criterion (ND, [14]), a binary tree of classifiers (BTC, [13]), nesting OVO (NEST, [38, 39]) and probability estimates by pairwise coupling (PE, [65]). We refer the reader to the review in [20] for clear descriptions of these methods.

DynamicOVO (DynOVO) [21]. The WV scheme is used as a basis. Prior to its computation, the score matrix is filtered by considering only those classifiers whose classes are in the neighborhood of the query instance. A neighborhood of size \(3\cdot m\) is used.

Distance relative competence weighted approach (DRCW) [22]. This methodology consists of carrying out a dynamic adaptation of the score matrix. It alters the confidence degrees by assigning a higher weight to those classifiers whose predicted classes are in the neighborhood of the query instance, setting the final prediction to \(\mathop {{{\mathrm{arg\,max}}}}\nolimits _{i = 1,\ldots ,m} \displaystyle \sum \nolimits _{1 \le j \ne i \le m} (r_{ij}\cdot w_{ij})\), where \(w_{ij}\) is the relative classifier competence computed as \(w_{ij} = \frac{d^2_j}{d^2_i + d^2_j}\), with \(d_i\) the average distance of the k neighbors of the ith class to the query instance.
2.3 Multiclass imbalance
The scenario of imbalanced classification with more than two classes imposes a strong restriction on the correct recognition of the different concepts present in the data [16]. This is not only due to the larger number of boundaries to consider. All data characteristics that must be considered in the scenario of classification with binary imbalanced data [42] are further accentuated when working in a context with more than two classes as well. The occurrence of multiminority and multimajority classes, the dependency among these classes (including overlapping) and relations between sameclass examples are possibly the main causes of performance degradation in this case.
The three solution groups listed in the Sect. 2.1 are designed for twoclass problems, and their extension to the multiclass scenario is not straightforward. On the one hand, data level solutions (preprocessing) are not directly applicable as the search space is increased. On the other hand, algorithmic level solutions become more complicated since there can be more than one minority class. Several alternatives have been developed to address this task [16]. We emphasize three different schemes: two approaches acting at the data level, namely OVO with preprocessing [16] and MDO [1], and a third one considering the use of ensembles for multiclass imbalanced learning, like the AdaBoost.NC method [61].
 1.
First, the original multiclass problem is divided into simpler binary subproblems by means of a decomposition strategy [20], for instance with the OVO scheme. In this way, the skewed class distribution is somehow controlled, as the sizes of two given classes can be similar.
 2.
Then, for each subproblem, any technique for instance preprocessing in twoclass imbalanced datasets can be applied. In this paper, we use the SMOTE method from [6]. After this step, every binary dataset is processed by a classifier. Recall that in the learning stage only instances from the two classes that the classifier is responsible for are taken into account.
 3.
Finally, when a new instance is presented, every individual classifier is fired to provide the confidence degrees for the two classes it is responsible for. These values are then aggregated using one of the schemes discussed in Sect. 2.2.
AdaBoost.NC ensemble The binary version of this method was proposed in [62] and incorporates negative correlation learning. It is based on the AdaBoost algorithm and extends it by introducing diversity between the constructed classifiers. The instance weights are not solely used to better recognize misclassified elements in later iterations, but also to enhance the diversity. AdaBoost.NC was extended to handle more than two classes in [61]. The authors noted that the application of random oversampling is required to improve the recognition of minority instances. To avoid increasing the training time, we incorporate this instruction in our experiments by a modified initialization of the ensemble weights, in order to give a higher significance to smaller classes. AdaBoost.NC is an important standard to measure the performance of new methods in multiclass imbalanced learning against. We note that the use of ensembles for multiclass imbalanced learning has been evaluated in [28, 67] as well, albeit in conjunction with feature selection. In the study of [70], ensembles for binary imbalanced classification were used within the OVO decomposition, showing competitive results with AdaBoost.NC.
3 The IFROWANN algorithm
In this second preliminary section, we recall the original IFROWANN classification method, the classifier using fuzzy rough sets for binary imbalanced classification. We provide the necessary background on fuzzy rough set theory (Sect. 3.1) and the classification model itself (Sect. 3.2).
3.1 Fuzzy rough set theory
Fuzzy rough set theory [12] is a mathematical tool dealing with two distinct types of uncertainty in data, namely vagueness and incompleteness. It was developed by integrating fuzzy set theory [69] into rough set theory [49]. Rough sets approximate a concept C described by an incomplete feature set in two ways. The lower approximation contains elements certainly belonging to C, while the upper approximation consists of elements possibly belonging to it. When these two sets are equal, there is no uncertainty in the data. In every other case, C cannot be described conclusively based on the observed features and can only be approximated. A limitation of rough set theory is that it requires discretization of any realvalued features in order to obtain useful results. The extension to fuzzy rough set theory addresses this issue. By measuring similarity between instances with fuzzy relations, the discretization requirement is removed. Fuzzy rough set theory has been used in many machine learning applications, see [58] for a recent review.
 1.
Sort the elements in V in descending order. Let \(S = \langle s_1, \ldots s_n \rangle \) be this sorted sequence, where \(s_i\) is the ith largest value in V.
 2.
Compute the OWA aggregation of V as \({{\mathrm{OWA}}}^W(V) = \displaystyle \sum \nolimits _{i=1}^n w_i s_i\).
3.2 The classification model
In [51], a classification method for twoclass imbalanced data based on fuzzy rough set theory was proposed. It is an extension of the fuzzy rough nearest neighbor classifier (FRNN, [36]), modified to deal with class imbalance. To classify an instance x, FRNN computes its membership degree to the fuzzy rough lower and upper approximations of each class. The score for a class is set to the average membership degree of x to its approximations. The membership degrees are computed by expressions (1) and (3).
4 FROVOCO: novel algorithm for multiclass imbalanced problems

In Sect. 4.1, we describe our IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) binary classifier. We propose a new adaptive weighting scheme that selects appropriate weights depending on the IR of the pair of classes at hand.

In Sect. 4.2, we describe our new OVO aggregation scheme: weighted voting with fuzzy rough summary terms (WV–FROST). We introduce two global summary terms, measuring the affinity of a test instance with the possible classes in two ways. We combine these terms with the WV aggregation scheme to enhance the performance of the latter. WV–FROST deals with the classifier competence issue in an OVO decomposition in a global way.

As a summary, we present a flowchart of our full proposal FROVOCO in Sect. 4.3.
4.1 Binary classifier within OVO: IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)
In the first stage of our experimental study (Sect. 6.1), we will evaluate the performance of IFROWANN in the OVO schemes discussed in Sect. 2.2. For each pair of classes, we apply an IFROWANN classifier to discern between them. The smallest class of the two is used as positive class. The original method solely yields class predictions, while the construction of the OVO score matrix requires the output of class confidence scores. To this end, when classifying x and applying IFROWANN to classes \(C_1\) and \(C_2\), we set the score for \(C_1\) to \(\frac{\underline{C_1}(x)}{\underline{C_1}(x)+\underline{C_2}(x)}\) and that for \(C_2\) to \(\frac{\underline{C_2}(x)}{\underline{C_1}(x)+\underline{C_2}(x)}\).
An important question with respect to IFROWANN regards the choice of the weighting scheme. As indicated above, the original study put forward two good candidate schemes: \({\mathcal {W}}_{e}\) and \({\mathcal {W}}_{\gamma }\). It was shown that \({\mathcal {W}}_{e}\) performs well for mildly imbalanced data with an IR up to 9, while for higher imbalance \({\mathcal {W}}_{\gamma }\) obtained better results. An IR of 9 is traditionally used (e.g. [41, 57]) as a threshold above which datasets are considered highly imbalanced. In a binarization scheme, the IR can be different for each pair of classes. It may therefore not be prudent to decide on the weighting scheme of IFROWANN beforehand, but rather choose this based on the imbalance between the two classes. In our experiments, we therefore evaluate three separate settings, two where the weighting scheme is the same in all IFROWANN classifiers (either \({\mathcal {W}}_{e}\) or \({\mathcal {W}}_{\gamma }\)) and a third one where \({\mathcal {W}}_{e}\) is used when the IR between a pair of classes is at most 9 and \({\mathcal {W}}_{\gamma }\) otherwise. We denote the third scheme as \({\mathcal {W}}_{\mathrm{IR}}\) and the corresponding classifier as IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\).
We want to note that it would also be possible to extend IFROWANN to a multiclass classifier without applying a binarization step. However, due to the definition of the fuzzy rough approximation operators (2–6), this corresponds largely to an OVA aggregation, which has been shown in previous studies (e.g. [16]) to not perform well compared to an OVO setting. Indeed, preliminary results for this evaluation (Sect. 6.2) showed that a multiclass version of IFROWANN does not perform at the same level as the integration of the classifier in binarization schemes. As this path seemed less promising, we do not pursue it further in this paper.
4.2 New OVO aggregation scheme: WV–FROST
We present a new OVO aggregation scheme, called weighted voting with fuzzy rough summary terms (WV–FROST). It enhances the WV method with a global evaluation represented by two summary terms. This can be regarded as an alternative for traditional dynamic classifier selection models. Whereas those methods only take into account the locality of the input instance, we propose to counteract the information loss resulting from the dataset reduction to class pairs by the inclusion of global measures in the decision procedure. The addition of these terms deals with the problem of classifier competence, as the aggregation not only relies on the binary classification performance, but uses the global information available in the dataset as well.

Positive affinity term: we compute the membership degree of an instance to each class, based on the full training set, by means of the fuzzy rough approximation operators.

Negative affinity term: each class C can be represented by a vector containing the expected membership degrees of an instance of class C to all classes. For a test instance, such a signature vector can be constructed as well. We penalize the distance from an instance to a class based on these vectors.
4.3 Overview of the FROVOCO proposal
 1.
In the decomposition phase, x is send to all IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) methods, each using a pair of classes as training set.
 2.
Class confidence scores are obtained from the IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) classifiers.
 3.
These scores are grouped into a score matrix.
 4.
WV–FROST is applied to aggregate the score matrix to the vector containing the values \(AV_x(C)\) for all classes C (Fig. 2). Instance x is assigned to the class corresponding to the largest \(AV_x(\cdot )\) value.
5 Experimental setup
In this section, we lay out the details of our study, specifying the datasets, the evaluation measures, the tests used in the statistical analysis and the stateoftheart methods to which we compare our proposal. The experimental study is conducted in Sects. 6–7.
5.1 Datasets
The 18 multiclass imbalanced datasets used in our experimental study
Dataset  ID  # inst  # feat  m  Min IR  Av. IR  Max IR 

Automobile  aut  150  25 (15/10)  6  1.04  4.90  16.00 
(3/20/48/46/29/13)  
Balance  bal  625  4 (4/0)  3  1.00  4.25  5.88 
(288/49/288)  
Cleveland  cle  297  13 (13/0)  5  1.03  3.87  12.62 
(164/55/36/35/13)  
Contraceptive  con  1473  9 (9/0)  3  1.23  1.55  1.89 
(629/333/511)  
Dermatology  der  358  34 (34/0)  6  1.00  2.17  5.55 
(111/60/71/48/48/20)  
Ecoli  eco  336  7 (7/0)  8  1.00  15.27  71.50 
(143/77/2/2/35/20/5/52)  
Glass  gla  214  9 (9/0)  6  1.09  3.60  8.44 
(70/76/17/13/9/29)  
Led7digit  led  500  7 (7/0)  10  1.00  1.16  1.54 
(45/37/51/57/52/52/47/57/53/49)  
Lymphography  lym  148  18 (3/15)  4  1.33  18.30  40.50 
(2/81/61/4)  
Newthyroid  new  215  5 (5/0)  3  1.17  3.48  5.00 
(150/35/30)  
Pageblocks  pag  5472  10 (10/0)  5  1.32  31.65  175.46 
(4913/329/28/87/115)  
Satimage  sat  6435  36 (36/0)  6  1.01  1.73  2.45 
(1533/703/1358/626/707/1508)  
Shuttle  shu  58,000  9 (9/0)  7  1.30  561.92  558.60 
(45,586/49/171/8903/3267/10/13)  
Thyroid  thy  7200  21 (21/0)  3  2.22  20.16  40.16 
(166/368/6666)  
Wine  win  178  13 (13/0)  3  1.20  1.30  1.48 
(59/71/48)  
Winequalityred  wqr  1599  11 (11/0)  6  1.07  18.83  68.10 
(10/53/681/638/199/18)  
Winequalitywhite  wqw  4898  11 (11/0)  7  1.07  61.08  439.60 
(20/163/1457/2198/880/175/5)  
Yeast  yea  1484  8 (8/0)  10  1.08  11.65  92.60 
(244/429/463/44/51/163/35/30/20/5) 
5.2 Evaluation measures
5.3 Statistical analysis
We list average results taken over the group of 18 datasets and, for the final comparison in Sect. 7.2, the results on each individual dataset as well. We combine this with an appropriate statistical analysis, applying nonparametric statistical tests as recommended by e.g. [10, 26]. For the comparison between two methods, we use the Wilcoxon signedranks test [63]. Its null hypothesis is that the two methods have an equivalent performance. In order to find sufficient evidence to reject the null hypothesis, the absolute differences in results of the two methods are ranked. The smallest absolute difference is assigned rank 1 and the largest rank n, with n the number of observations (18 in our study). In a comparison of ‘Method 1 versus Method 2’, the positive differences are in favor of the first method, the negative differences in favor of the second. The ranks of the positive differences are summed up to \(R^+\) and those of the negative differences to \(R^\). We report these two values, together with the p value of the test. When it is smaller than the significance level \(\alpha \), the null hypothesis is rejected and the first method is concluded to perform significantly better than the second. We also perform multiple comparisons, that is, we take a group of methods and determine whether any significant performance differences can be found among them. In this case, we use the Friedman test [18] in combination with the Holm post hoc procedure [32]. The null hypothesis of the Friedman test is that all methods under consideration perform equivalently. When it is rejected, the post hoc procedure is applied to detect where the significant differences can be found. The Friedman test is based on a ranking procedure and the method with the lowest rank is concluded to have the overall best performance. It is used as a control method to which the remaining methods are compared in the Holm process. When the p values of these comparisons are lower than \(\alpha \), it is concluded that the control method outperforms the other method with statistical significance. For these multiple comparisons, we list the Friedman ranks of all methods, the p value of the Friedman test (\(p_{\mathrm{Friedman}}\)) and the adjusted p values of the post hoc procedure (\(p_{\mathrm{Holm}}\)).
5.4 Structure of experiments and method parameters

Section 6: we first conduct an internal comparison of our proposal, in order to clearly show the strength of the separate components IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) and WV–FROST.

Section 7: we show the benefits of WV–FROST over other dynamic aggregation methods. Finally, we compare our full method FROVOCO to three stateoftheart methods in multiclass imbalanced classification recalled in Sect. 2.3. In AdaBoost.NC [61], we have set the penalty strength \(\lambda \) to 2, as done in earlier work e.g. [16, 61]. The number of classifiers in the ensemble was set to 10, which is a lower value than the one used by these referenced studies. In a preliminary evaluation, we observed that this value provides better average results on our selected datasets. It has been used in ensembles for imbalanced data in earlier studies as well e.g. [42]. For the OVO combination with preprocessing, we use the SMOTEC4.5 classifier with the same parameter settings as [16]. The use of decision tree learners like C4.5 in ensembles has been highlighted in [53]. Finally, we include the MDO preprocessing method in combination with C4.5 as well. We do not use any costsensitive classification method, as an appropriate definition of the cost matrix is usually not readily available [55] and domain experts are required for its specification [72].
6 Experimental evaluation of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) and WV–FROST

Section 6.1: we evaluate the performance of the IFROWANN method in existing OVO schemes, using the three weighting schemes listed in Sect. 4.1. We clearly show the benefit of our novel IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) method.

Section 6.2: we compare WV–FROST to related partially constructed aggregation models and show the advantages of the full proposal.
6.1 Evaluation of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)
Results of the integration of IFROWANN in the OVO setting with the traditional aggregation methods
Method  AvgAcc  

\({\mathcal {W}}_e\)  \({\mathcal {W}}_{\gamma }\)  \({\mathcal {W}}_{\mathrm{IR}}\)  
VOTE  \(69.4460 \pm 19.6554\)  \(61.6819 \pm 20.7889\)  \(\mathbf{70 }.\mathbf{4035 } \pm \mathbf{20 }.\mathbf{5736 }\) 
WV  \(69.4460 \pm 19.6554\)  \(63.1538 \pm 21.6848\)  \(\mathbf{71 }.\mathbf{4921 } \pm \mathbf{19 }.\mathbf{5685 }\) 
PC  \(69.4959 \pm 19.6182\)  \(62.6440 \pm 21.0826\)  \(\mathbf{71 }.\mathbf{3500 } \pm \mathbf{19 }.\mathbf{1160 }\) 
DDAG  \(69.9674 \pm 19.6337\)  \(59.4896 \pm 21.0593\)  \(\mathbf{71 }.\mathbf{1942 } \pm \mathbf{19 }.\mathbf{9944 }\) 
LVPC  \(58.8251 \pm 20.3807\)  \(61.2524 \pm 20.2090\)  \(\mathbf{63 }.\mathbf{2167 } \pm \mathbf{19 }.\mathbf{5304 }\) 
ND  \(69.5269 \pm 19.5921\)  \(58.5540 \pm 23.3192\)  \(\mathbf{70 }.\mathbf{2541 } \pm \mathbf{21 }.\mathbf{2630 }\) 
BTC  \(69.4497 \pm 19.7924\)  \(59.0936 \pm 21.4492\)  \(\mathbf{70 }.\mathbf{5591 } \pm \mathbf{20 }.\mathbf{5641 }\) 
NEST  \(69.5686 \pm 19.6920\)  \(56.2790 \pm 23.1138\)  \(\mathbf{70 }.\mathbf{0260 } \pm \mathbf{21 }.\mathbf{2530 }\) 
PE  \(69.4785 \pm 19.7172\)  \(62.1607 \pm 21.3814\)  \(\mathbf{71 }.\mathbf{1413 } \pm \mathbf{19 }.\mathbf{6598 }\) 
Method  MAUC  

\({\mathcal {W}}_e\)  \({\mathcal {W}}_{\gamma }\)  \({\mathcal {W}}_{\mathrm{IR}}\)  
VOTE  \(0.8566 \pm 0.1227\)  \(0.8208 \pm 0.1379\)  \(\mathbf{0 }.\mathbf{8613 } \pm \mathbf{0 }.\mathbf{1204 }\) 
WV  \(\mathbf{0 }.\mathbf{8921 } \pm \mathbf{0 }.\mathbf{1120 }\)  \(0.8910 \pm 0.1025\)  \(0.8895 \pm 0.1120\) 
PC  \(\mathbf{0 }.\mathbf{8958 } \pm \mathbf{0 }.\mathbf{1062 }\)  \(0.8935 \pm 0.1058\)  \(0.8939 \pm 0.1067\) 
DDAG  \(0.8070 \pm 0.1243\)  \(0.7366 \pm 0.1384\)  \(\mathbf{0 }.\mathbf{8143 } \pm \mathbf{0 }.\mathbf{1274 }\) 
LVPC  \(\mathbf{0 }.\mathbf{8932 } \pm \mathbf{0 }.\mathbf{1110 }\)  \(0.8919 \pm 0.1043\)  \(0.8910 \pm 0.1107\) 
ND  \(0.8779 \pm 0.1065\)  \(0.8457 \pm 0.1229\)  \(\mathbf{0 }.\mathbf{8794 } \pm \mathbf{0 }.\mathbf{1092 }\) 
BTC  \(0.8035 \pm 0.1259\)  \(0.7341 \pm 0.1413\)  \(\mathbf{0 }.\mathbf{8105 } \pm \mathbf{0 }.\mathbf{1304 }\) 
NEST  \(0.8572 \pm 0.1228\)  \(0.8208 \pm 0.1380\)  \(\mathbf{0 }.\mathbf{8613 } \pm \mathbf{0 }.\mathbf{1204 }\) 
PE  \(\mathbf{0 }.\mathbf{8952 } \pm \mathbf{0 }.\mathbf{1065 }\)  \(0.8875 \pm 0.1089\)  \(0.8927 \pm 0.1077\) 
The benefit of scheme \({\mathcal {W}}_{\mathrm{IR}}\) over \({\mathcal {W}}_e\) and \({\mathcal {W}}_{\gamma }\) is clear. This is particularly reflected in the average accuracy measure, where substantial differences can be observed. For each aggregation method, \({\mathcal {W}}_{\mathrm{IR}}\) attains the highest average accuracy. The results of \({\mathcal {W}}_e\) are better than those for \({\mathcal {W}}_{\gamma }\). This can be explained based on the description of the datasets in Table 1 and the conclusions drawn from Fig. 1. Indeed, computing the pairwise IR between classes, these values are often found to be less than 9, a situation where \({\mathcal {W}}_e\) yields better results than \({\mathcal {W}}_{\gamma }\). Considering the MAUC, smaller differences in performance for the three alternatives are observed. In most cases, \({\mathcal {W}}_e\) still outperforms \({\mathcal {W}}_{\gamma }\). For five aggregation methods, \({\mathcal {W}}_{\mathrm{IR}}\) yields the highest MAUC value. In the cases where it does not, the differences with the best performing scheme are small.
We conclude that, when fixing the weighting scheme, \({\mathcal {W}}_e\) yields better results than \({\mathcal {W}}_{\gamma }\), but using our adaptive scheme \({\mathcal {W}}_{\mathrm{IR}}\) further improves the performance. This largest improvement is made for the average accuracy measure, showing that correct classifications require the use of \({\mathcal {W}}_{\mathrm{IR}}\). The power of separating between class pairs, evaluated by the MAUC, is more or less comparable for \({\mathcal {W}}_e\) and \({\mathcal {W}}_{\mathrm{IR}}\). Deciding on \({\mathcal {W}}_{\mathrm{IR}}\) as weighting scheme and taking both evaluation measures into account, we can select the WV procedure as favored aggregation method. It attains the highest average accuracy value, among the highest MAUC values and its robustness has been demonstrated in [20, 35]. In the next section, we further improve the results of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV by replacing the WV step by our new proposal WV–FROST.
6.2 Evaluation of WV–FROST
 1.
Does WV–FROST improve the performance of WV?
 2.
Do partially constructed models yield similar or better results?

IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV: the best performing version in Sect. 6.1, the integration of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) in a traditional OVO setup with WV aggregation.

\({ mem}\): no binarization, the score of a class C is set to the value \({ mem}(x,C)\) and the class with the highest score is predicted.

\({ mem}\)\({ mse}_n\): no binarization, the score of class C is \({ mem}(x,C)  \frac{1}{m}{} { mse}_n(x,C)\).

IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV\({ mem}\): the WV\({ mem}\) step is similar to WV–FROST, but only includes one global summary term. The value \(V_x(C)\) is replaced by \(\frac{V_x(C) + { mem}(x,C)}{2}\).

IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV\({ mse}_n\): includes only one global summary term. It replaces \(V_x(C)\) by \(V_x(C)  \frac{1}{m}{} { mse}_n(x,C)\).

IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV–FROST: our complete proposal.
Results of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV–FROST and partially constructed versions
Method  AvgAcc  MAUC 

IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV  \(71.4921 \pm 19.5685\)  \(0.8895 \pm 0.1120\) 
\({ mem}\)  \(67.6477 \pm 18.8233\)  \(0.8810 \pm 0.1069\) 
\({ mem}\)\({ mse}_n\)  \(69.2093 \pm 18.3121\)  \(0.8958 \pm 0.1022\) 
IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV\({ mem}\)  \(71.5351 \pm 19.3465\)  \(0.8946 \pm 0.1065\) 
IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV\({ mse}_n\)  \(71.8868 \pm 19.2429\)  \(0.8984 \pm 0.0987\) 
IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV–FROST  \(\mathbf{72 }.\mathbf{6327 } \pm \mathbf{19 }.\mathbf{3379 }\)  \(\mathbf{0 }.\mathbf{9018 } \pm \mathbf{0 }.\mathbf{0982 }\) 
Table 3 shows the dominance of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV–FROST over all partially constructed models for both evaluation measures. Considering these results in more detail, IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV–FROST outperforms the traditional setting IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV on 12 out of the 18 datasets for the average accuracy and on 11 out of 18 for MAUC. Placing the results of models \({ mem}\) and IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV\({ mem}\) next to each other, the benefit of the binarization step is made clear, in particular in the evaluation by the average accuracy. The relatively high MAUC value of \({ mem}\) indicates that the fuzzy rough membership degrees form a good tool to separate between pairs of classes. However, this does not necessarily imply correct classification results, as only pairwise comparisons between classes are used in the MAUC evaluation. This is reflected in the clearly inferior average accuracy value of \({ mem}\). This method corresponds to the most straightforward extension of IFROWANN to a multiclass setting without applying any binarization. By also incorporating the comparison between pairs of classes in IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV\({ mem}\), more accurate predictions can be made. This was already noted in Sect. 4.1 and formed part of our motivation to not further pursue a direct extension of the IFROWANN method without binarization. Secondly, comparing \({ mem}\) to \({ mem}\)\({ mse}_n\) and IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV\({ mem}\) to IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV–FROST, the improvement after including the \({ mse}_n\) term is shown. The difference in performance between IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV–FROST and IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV\({ mse}_n\) shows that it is not sufficient to solely include the \({ mse}_n\) measure and that both fuzzy rough summary terms carry complementary information needed to improve the baseline performance of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV. In a statistical comparison of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV–FROST to IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)WV using the Wilcoxon test, the p values for the average accuracy and MAUC results were 0.16736 (\(R^+=118.0\), \(R^=53.0\)) and 0.08866 (\(R^+=113.0\), \(R^=40.0\)), respectively.
7 Experimental evaluation of FROVOCO

We compare the WV–FROST aggregation within FROVOCO to existing dynamic aggregation approaches. We show that the WV–FROST outperforms the alternatives, which justifies its inclusion in FROVOCO (Sect. 7.1).

As a final step, we compare FROVOCO to three stateoftheart classifiers for multiclass imbalanced data (Sect. 7.2).
7.1 WV–FROST versus other dynamic approaches
In this section, we compare WV–FROST to the existing dynamic aggregation approaches DynOVO [21] and DRCW [22]. For each of these three methods, we use IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) within the OVO method. The combination of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) with WV–FROST corresponds to our full FROVOCO method.
Results of FROVOCO and the combination of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) with the two other dynamic aggregation methods
Method  AvgAcc  MAUC 

IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)+DynOVO  \(71.7930 \pm 19.8270\)  \(0.7894 \pm 0.1151\) 
IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\)+DRCW  \(70.8782 \pm 20.1452\)  \(0.8916 \pm 0.1097\) 
FROVOCO  \(\mathbf{72 }.\mathbf{6327 } \pm \mathbf{19 }.\mathbf{3379 }\)  \(\mathbf{0 }.\mathbf{9018 } \pm \mathbf{0 }.\mathbf{0982 }\) 
Pairwise statistical comparisons by means of the Wilcoxon test, accompanying the results of Table 4
Comparison  \(R^+\)  \(R^\)  p 

A: WV–FROST versus DynOVO  100.0  71.0  0.52773 
A: WV–FROST versus DRCW  129.0  42.0  0.05994 
M: WV–FROST versus DynOVO  171.0  0.0  7.63E\({}\)6 
M: WV–FROST versus DRCW  100.0  53.0  0.26595 
7.2 Comparison with the stateoftheart
Full average accuracy and MAUC results for the three stateoftheart classifiers and our FROVOCO proposal
Data  AvgAcc  MAUC  

Ada  SMT  MDO  FR  Ada  SMT  MDO  FR  
aut  79.9444  80.6444  76.4778  77.1556  0.9370  0.9299  0.8928  0.9633 
bal  65.8900  55.2701  56.4819  78.8514  0.8609  0.5901  0.6768  0.8854 
cle  26.8750  26.1917  29.0417  33.7833  0.5834  0.5772  0.5610  0.6981 
con  47.9522  51.7246  48.7521  47.4449  0.6669  0.6560  0.6529  0.6485 
der  94.6845  96.2096  95.3709  97.1553  0.9857  0.9861  0.9727  0.9966 
eco  76.2654  71.4609  71.1481  77.2723  0.9162  0.8990  0.8556  0.9304 
gla  71.5516  75.1885  63.0437  67.0694  0.9246  0.9204  0.8534  0.9325 
led  54.3621  63.5466  64.1728  64.7918  0.7640  0.9134  0.8780  0.9189 
lym  72.4355  72.2222  73.2093  86.2401  0.7847  0.7717  0.8231  0.9108 
new  94.7222  91.3889  90.4444  91.1111  0.9972  0.9563  0.9276  0.9981 
pag  91.9105  89.2398  83.7520  90.0399  0.9876  0.9739  0.9446  0.9736 
sat  87.5570  85.2928  84.7142  89.4955  0.9817  0.9619  0.9214  0.9817 
shu  98.4803  96.8439  91.3154  91.8527  0.9911  0.9979  0.9600  0.9987 
thy  99.4186  99.2688  97.9360  66.4500  0.9998  0.9965  0.9894  0.8494 
win  94.7579  95.2698  92.9881  98.2143  0.9818  0.9788  0.9482  1.0000 
wqr  39.6884  34.1986  31.8371  43.7544  0.7581  0.7495  0.6432  0.8342 
wqw  47.6684  39.3455  39.3391  47.8895  0.7856  0.7772  0.6811  0.8309 
yea  49.0789  51.8083  53.3381  58.8178  0.8279  0.8472  0.7693  0.8810 
Mean  71.8468  70.8397  69.0757  72.6327  0.8741  0.8602  0.8306  0.9018 
Results of the Friedman test and Holm post hoc procedure
Method  AvgAcc  MAUC  

Rank  \(p_{\mathrm{Holm}}\)  Rank  \(p_{\mathrm{Holm}}\)  
FROVOCO  1.8333 (1)  –  1.4444 (1)  – 
AdaBoost.NC  2.3333 (2)  0.245278  2.1667 (2)  0.09329 
SMOTEC4.5WV  2.5556 (3)  0.18658  2.7222 (3)  0.00597 
MDOC4.5  3.2778 (4)  0.002367  3.6667 (4)  0.000001 
\(p_{\mathrm{Friedman}}\)  0.008617  0.000003 
Pairwise statistical comparisons by means of the Wilcoxon test
Comparison  W/L  \(R^+\)  \(R^\)  p 

A: FROVOCO versus AdaBoost.NC  11/7  108.0  63.0  0.32714 
A: FROVOCO versus SMOTEC4.5WV  12/6  116.0  55.0  0.19638 
A: FROVOCO versus MDOC4.5  16/2  148.0  23.0  0.004746 
M: FROVOCO versus AdaBoost.NC  15/3  139.0  32.0  0.018234 
M: FROVOCO versus SMOTEC4.5WV  15/3  149.0  22.0  0.004006 
M: FROVOCO versus MDOC4.5  16/2  155.0  16.0  0.0012894 
It cannot be shown that the average accuracy of our method is significantly better than that of AdaBoost.NC or SMOTEC4.5WV, although, as stated above, the computed ranks indicate that our method is best. The explanation can be found in Table 6 with the thydataset. On this single dataset, our proposal performs very poorly compared to the others (as do all other OVO versions using IFROWANN). This difference is assigned the highest rank in favor of the competing methods, which is why no statistical significance in the average accuracy can be detected.
As noted in Sect. 5, the two evaluation measures capture complementary performance information. The average accuracy solely focuses on the number of hits and misses, while the MAUC takes the confidence of the predictions of a classifier into account. Based on the analysis presented in Table 8, we can stress that the prediction confidences of our proposal are significantly more reliable than those of its competitors.
Effect of the IR When we combine the results in Table 6 with the IR information from Table 1, it can be observed that the datasets on which FROVOCO performs suboptimally are mainly those with a high average IR. In particular, these are the pageblocks, shuttle and thyroid datasets, for which the high average IR is due to the presence of one very large majority class. When we investigated these results in more detail, we observed that the accuracy of our proposal on the single majority class is notably lower than that obtained by the other methods, which resulted in its low average accuracy value. Both the fact that there is only one majority class as well as its absolute volume explain why FROVOCO fails in this situation. Firstly, the method aims to boost the performance on the minority classes to such a degree that the classification of the single majority class is negatively affected. The combined effect of all minority classes against the single majority class leads to a decrease in average accuracy. Secondly (and probably most importantly), when the majority class is very large, the OWAbased fuzzy rough approximation operators are hindered in their recognition ability. In particular, weight vector (9), which is used when the IR between a pair of classes is high, loses some of its strength when the absolute size of one of these classes is very high. The weight vector becomes very long and the weights on the nonzero positions almost flatten out to an average, due to their linear increasing character and the condition that all weights should sum to one. As a result, its desirable characteristics are lost.
On the winequalitywhite dataset, which has an average IR of 61.08, our method obtains the best average accuracy. This does not contradict our previous statement, since there are two majority classes present in this dataset, of which the sizes do not differ greatly. Moreover, the size of the largest class is also not that great compared to that of the pageblocks, shuttle and thyroid datasets.
We would like to note that our fuzzy rough approach highly depends on the similarity relation. When the overlap between classes is high (based on this measure), we can expect it to be more difficult for our method to adequately discern them. In such a situation, some confusion between classes is likely to occur, with classification errors as the result.
In general, as demonstrated by the results in Table 6 and analysis in Tables 7 and 8, our method outperforms its competitors, because it combines local (OVO decomposition) and global (WV–FROST) views of the data, thereby improving the recognition of difficult classes. Only in the particular case when a single massive majority class is present, the user may prefer to use AdaBoost.NC in the classification process. For reasons discussed in this paragraph, our method is less suited to handle this one specific type of problem. This situation can easily be checked for by a user before the application of a multiclass imbalanced classifier.
8 Conclusion
The IFROWANN method, which is based on fuzzy rough set theory, is a powerful classifier for binary imbalanced data. In this work, we studied its extension to the multiclass imbalanced setting, by combining it with the OVO decomposition process. In a first stage, we have shown that its success in existing OVO aggregation schemes is boosted by incorporating a newly proposed weighting scheme, represented in our method IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\). Secondly, we have proposed a new aggregation scheme WV–FROST that further improves the results of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) within the OVO setting. WV–FROST enhances the information extracted from the binary classifiers by including two global summary terms. Both are based on fuzzy rough set theory, yielding a nice synergy with the fuzzy rough classifiers. Their global character deals with the noncompetent classifier issue encountered in OVO decompositions. Our experiments allow us to conclude that our complete proposal called FROVOCO, which is the combination of IFROWANN\({\mathcal {W}}_{\mathrm{IR}}\) in the OVO setting and WV–FROST, outperforms the stateoftheart in multiclass imbalanced classification.
As future work, we propose to investigate the wider applicability of the WV–FROST step. In this paper, both the classifier within the OVO step and the summary terms in WV–FROST are based on fuzzy rough set theory. It will be interesting to verify whether the strength of WV–FROST transfers to settings where a different internal classifier is used in the OVO procedure. Furthermore, we could easily replace the WV part by any of the other existing OVO aggregation methods and validate whether our proposed summary terms yield similar improvements in those cases. The performance of WV–FROST on balanced datasets is left to be evaluated as well.
Footnotes
 1.
If the classifier provides both confidence degrees, one must ensure that they are normalized such that \(r_{ij} + r_{ji} = 1\).
Notes
Acknowledgements
The research of Sarah Vluymans is funded by the Special Research Fund (BOF) of Ghent University. This work was partially supported by the Spanish Ministry of Science and Technology under the Projects TIN201457251P and TIN201568454R; the Andalusian Research Plans P11TIC7765 and P12TIC2958. Yvan Saeys is an ISAC Marylou Ingram Scholar.
References
 1.Abdi L, Hashemi S (2016) To combat multiclass imbalanced problems by means of oversampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251CrossRefGoogle Scholar
 2.Alshomrani S, Bawakid A, Shim S, Fernández A, Herrera F (2015) A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl Based Syst 73:1–17CrossRefGoogle Scholar
 3.Barandela R, Sánchez J, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recog 36(3):849–851CrossRefGoogle Scholar
 4.Batista G, Prati R, Monard MC (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29CrossRefGoogle Scholar
 5.Britto AS Jr, Sabourin R, de Oliveira LES (2014) Dynamic selection of classifiers—a comprehensive review. Pattern Recog 47(1):3665–3680CrossRefGoogle Scholar
 6.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357zbMATHGoogle Scholar
 7.Chen Y (2016) An empirical study of a hybrid imbalancedclass DT–RST classification procedure to elucidate therapeutic effects in uremia patients. Med Biol Eng Comput 54(6):983–1001CrossRefGoogle Scholar
 8.Cornelis C, Verbiest N, Jensen R (2010) Ordered weighted average based fuzzy rough sets. In: Yu J, Greco S, Lingras P, Wang G, Skowron A (eds) Rough set and knowledge technology. Springer, Berlin, pp 78–85CrossRefGoogle Scholar
 9.D’eer L, Verbiest N, Cornelis C, Godo L (2015) A comprehensive study of implicator–conjunctorbased and noisetolerant fuzzy rough sets: definitions, properties and robustness analysis. Fuzzy Sets Syst 275:1–38MathSciNetCrossRefzbMATHGoogle Scholar
 10.Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetzbMATHGoogle Scholar
 11.Domingos P (1999) MetaCost: a general method for making classifiers cost—sensitive. In: Fayyad U, Chaudhuri S, Madigan D (eds) Proceedings of the 5th international conference on knowledge discovery and data mining (KDD’99). ACM, New York, pp 155–164Google Scholar
 12.Dubois D, Prade H (1990) Rough fuzzy sets and fuzzy rough sets. Int J Gen Syst 17(2–3):191–209CrossRefzbMATHGoogle Scholar
 13.Fei B, Liu J (2006) Binary tree of SVM: a new fast multiclass training and classification algorithm. IEEE Trans Neural Netw 17(3):696–704MathSciNetCrossRefGoogle Scholar
 14.Fernández A, Calderon M, Barrenechea E, Bustince H, Herrera F (2010a) Solving multiclass problems with linguistic fuzzy rule based classification systems based on pairwise learning and preference relations. Fuzzy Sets Syst 161(23):3064–3080MathSciNetCrossRefzbMATHGoogle Scholar
 15.Fernández A, García S, Luengo J, BernadoMansilla E, Herrera F (2010b) Geneticsbased machine learning for rule induction: state of the art, taxonomy and comparative study. IEEE Trans Evol Comput 14(6):913–941CrossRefGoogle Scholar
 16.Fernández A, López V, Galar M, Del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced datasets with multiple classes: binarization techniques and adhoc approaches. Knowl Based Syst 42:97–110CrossRefGoogle Scholar
 17.Friedman JH (1996) Another approach to polychotomous classification. Tech rep, Department of Statistics, Stanford University. http://wwwstat.stanford.edu/~jhf/ftp/poly.ps.Z
 18.Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701CrossRefzbMATHGoogle Scholar
 19.Fürnkranz J, Hüllermeier E, Vanderlooy S (2009) Binary Decomposition Methods for Multipartite Ranking. In: Buntine W, Grobelnik M, Mladenić D, ShaweTaylor J (eds.) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2009. Lecture Notes in Computer Science, vol 5781. Springer, Berlin, HeidelbergGoogle Scholar
 20.Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2011) An overview of ensemble methods for binary classifiers in multiclass problems: experimental study on onevsone and onevsall schemes. Pattern Recog 44(8):1761–1776CrossRefGoogle Scholar
 21.Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2013) Dynamic classifier selection for onevsone strategy: avoiding noncompetent classifiers. Pattern Recog 46(12):3412–3424CrossRefGoogle Scholar
 22.Galar M, Fernández A, Barrenechea E, Herrera F (2015) DRCWOVO: distancebased relative competence weighting combination for onevsone strategy in multiclass problems. Pattern Recog 48(1):28–42CrossRefGoogle Scholar
 23.Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2016) Orderingbased pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets. Inf Sci 354:178–196CrossRefGoogle Scholar
 24.Gao X, Chen Z, Tang S, Zhang Y, Li J (2016) Adaptive weighted imbalance learning with application to abnormal activity recognition. Neurocomputing 173:1927–1935CrossRefGoogle Scholar
 25.Gao Z, Zhang L, Chen M, Hauptmann A, Zhang H, Cai A (2014) Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimed Tools Appl 68(3):641–657CrossRefGoogle Scholar
 26.García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064CrossRefGoogle Scholar
 27.García V, Mollineda RA, Sánchez JS (2008) On the knn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280MathSciNetCrossRefGoogle Scholar
 28.Haixiang G, Yijing L, Yanan L, Xiao L, Jinling L (2016) BPSOAdaboostKNN ensemble learning algorithm for multiclass imbalanced data classification. Eng Appl Artifl Intell 49:176–193CrossRefGoogle Scholar
 29.Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186CrossRefzbMATHGoogle Scholar
 30.Hastie T, Tibshirani R (1998) Classification by pairwise coupling. Ann Stat 26(2):451–471MathSciNetCrossRefzbMATHGoogle Scholar
 31.He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRefGoogle Scholar
 32.Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2): 65–70Google Scholar
 33.Huhn J, Hüllermeier E (2009) FR3: a fuzzy rule learner for inducing reliable classifiers. IEEE Trans Fuzzy Syst 17(1):138–149CrossRefGoogle Scholar
 34.Hüllermeier E, Brinker K (2008) Learning valued preference structures for solving classification problems. Fuzzy Sets Syst 159(18):2337–2352MathSciNetCrossRefzbMATHGoogle Scholar
 35.Hüllermeier E, Vanderlooy S (2010) Combining predictions in pairwise classification: an optimal adaptive voting strategy and its relation to weighted voting. Pattern Recog 43(1):128–142CrossRefzbMATHGoogle Scholar
 36.Jensen R, Cornelis C (2011) Fuzzyrough nearest neighbour classification and prediction. Theor Comput Sci 412(42):5871–5884MathSciNetCrossRefzbMATHGoogle Scholar
 37.Kuncheva L, Bezdek J, Duin R (2001) Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recog 34(2):299–314CrossRefzbMATHGoogle Scholar
 38.Liu B, Hao Z, Yang X (2007) Nesting algorithm for multiclassification problems. Soft Comput 11(4):383–389CrossRefzbMATHGoogle Scholar
 39.Liu B, Hao Z, Tsang ECC (2008) Nesting oneagainstone algorithm based on SVMs for pattern classification. IEEE Trans Neural Netw 19(12):2044–2052CrossRefGoogle Scholar
 40.López V, Fernández A, MorenoTorres JG, Herrera F (2012) Analysis of preprocessing vs. costsensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl 39(7):6585–6608CrossRefGoogle Scholar
 41.López V, Fernández A, Del Jesus M, Herrera F (2013a) A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline datasets. Knowl Based Syst 38:85–104CrossRefGoogle Scholar
 42.López V, Fernández A, García S, Palade V, Herrera F (2013b) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141CrossRefGoogle Scholar
 43.López V, Fernández A, Herrera F (2014) On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed. Inf Sci 257:1–13CrossRefGoogle Scholar
 44.Lorena AC, Carvalho AC, Gama JM (2008) A review on the combination of binary classifiers in multiclass problems. Artif Intell Rev 30(1–4):19–37CrossRefGoogle Scholar
 45.Mahalanobis P (1936) On the generalized distance in statistics. Proc Natl Inst Sci (Calcutta) 2:49–55zbMATHGoogle Scholar
 46.MartínezMunoz G, HernándezLobato D, Suárez A (2009) An analysis of ensemble pruning techniques based on ordered aggregation. IEEE Trans Pattern Anal Mach Intellig 31(2):245–259CrossRefGoogle Scholar
 47.MorenoTorres JG, Sáez JA, Herrera F (2012) Study on the impact of partitioninduced dataset shift onfold crossvalidation. IEEE Trans Neural Netw Learn Syst 23(8):1304–1312CrossRefGoogle Scholar
 48.OrriolsPuig A, BernadoMansilla E (2009) Evolutionary rulebased systems for imbalanced datasets. Soft Comput 13(3):213–225CrossRefGoogle Scholar
 49.Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11(5):341–356CrossRefzbMATHGoogle Scholar
 50.Platt JC, Cristianini N, ShaweTaylor J (2000) Large margin DAGs for multiclass classification. In: Solla S, Leen T, Müller K (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 547–553Google Scholar
 51.Ramentol E, Vluymans S, Verbiest N, Caballero Y, Bello R, Cornelis C, Herrera F (2015) IFROWANN: imbalanced fuzzyrough ordered weighted average nearest neighbor classification. IEEE Trans Fuzzy Syst 23(5):1622–1637CrossRefGoogle Scholar
 52.Razakarivony S, Jurie F (2016) Vehicle detection in aerial imagery: a small target detection benchmark. J Vis Commun Image Represent 34:187–203CrossRefGoogle Scholar
 53.Rokach L (2016) Decision forest: twenty years of research. Inf Fusion 27:111–125CrossRefGoogle Scholar
 54.Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTEIPF: Addressing the noisy and borderline examples problem in imbalanced classification by a resampling method with filtering. Inf Sci 291:184–203CrossRefGoogle Scholar
 55.Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recog Artif Intell 23(4):687–719CrossRefGoogle Scholar
 56.Verbiest N, Ramentol E, Cornelis C, Herrera F (2014) Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Appl Soft Comput 22:511–517CrossRefGoogle Scholar
 57.Villar P, Fernández A, Carrasco R, Herrera F (2012) Feature selection and granularity learning in genetic fuzzy rulebased classification systems for highly imbalanced datasets. Int J Uncertain Fuzz 20(03):369–397CrossRefzbMATHGoogle Scholar
 58.Vluymans S, D’eer L, Saeys Y, Cornelis C (2015) Applications of fuzzy rough set theory in machine learning: a survey. Fundam Inform 142(1–4):53–86MathSciNetCrossRefzbMATHGoogle Scholar
 59.Vluymans S, Sánchez Tarragó D, Saeys Y, Cornelis C, Herrera F (2016) Fuzzy rough classifiers for class imbalanced multiinstance data. Pattern Recog 53:36–45CrossRefGoogle Scholar
 60.Vriesmann LM, Britto AS Jr, Oliveira LES, Koerich AL, Sabourin R (2015) Combining overall and local class accuracies in an oraclebased method for dynamic ensemble selection. In: Proceedings of the 2015 international joint conference on neural networks (IJCNN). IEEE, pp 1–7Google Scholar
 61.Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern Part B 42(4):1119–1130CrossRefGoogle Scholar
 62.Wang S, Chen H, Yao X (2010) Negative correlation learning for classification ensembles. In: Proceedings of the 2010 international joint conference on neural networks (IJCNN). IEEE, pp 1–8Google Scholar
 63.Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRefGoogle Scholar
 64.Woods K (1997) Combination of multiple classifiers using local accuracy estimates. IEEE Trans Pattern Anal Mach Intell 19:405–410CrossRefGoogle Scholar
 65.Wu TF, Lin CJ, Weng RC (2004) Probability estimates for multiclass classification by pairwise coupling. J Mach Learn Res 5:975–1005MathSciNetzbMATHGoogle Scholar
 66.Yager R (1988) On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Trans Syst Man Cybern 18(1):183–190MathSciNetCrossRefzbMATHGoogle Scholar
 67.Yijing L, Haixiang G, Xiao L, Yanan L, Jinling L (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multiclass imbalanced data. Knowl Based Syst 94:88–104CrossRefGoogle Scholar
 68.Yu H, Hong S, Yang X, Ni J, Dan Y, Qin B (2013) Recognition of multiple imbalanced cancer types based on DNA microarray data using ensemble classifiers. BioMed Res Int 2013:1–13Google Scholar
 69.Zadeh LA (1965) Fuzzy sets. Inform Control 8(3):338–353CrossRefzbMATHGoogle Scholar
 70.Zhang Z, Krawczyk B, Garcìa S, RosalesPérez A, Herrera F (2016) Empowering onevsone decomposition with ensemble learning for multiclass imbalanced data. Knowl Based Syst 106:251–263CrossRefGoogle Scholar
 71.Zhao X, Li X, Chen L, Aihara K (2008) Protein classification with imbalanced data. Proteins: Struct Funct Bioinform 70(4):1125–1132CrossRefGoogle Scholar
 72.Zhou Z, Liu X (2010) On multiclass costsensitive learning. Comput Intell 26(3):232–257MathSciNetCrossRefGoogle Scholar