, Volume 39, Issue 2, pp 335-373,
Open Access This content is freely available online to anyone, anywhere at any time.
Date: 30 Dec 2011

BRACID: a comprehensive approach to learning rules from imbalanced data

Abstract

In this paper we consider induction of rule-based classifiers from imbalanced data, where one class (a minority class) is under-represented in comparison to the remaining majority classes. The minority class is usually of primary interest. However, most rule-based classifiers are biased towards the majority classes and they have difficulties with correct recognition of the minority class. In this paper we discuss sources of these difficulties related to data characteristics or to an algorithm itself. Among the problems related to the data distribution we focus on the role of small disjuncts, overlapping of classes and presence of noisy examples. Then, we show that standard techniques for induction of rule-based classifiers, such as sequential covering, top-down induction of rules or classification strategies, were created with the assumption of balanced data distribution, and we explain why they are biased towards the majority classes. Some modifications of rule-based classifiers have been already introduced, but they usually concentrate on individual problems. Therefore, we propose a novel algorithm, BRACID, which more comprehensively addresses the issues associated with imbalanced data. Its main characteristics includes a hybrid representation of rules and single examples, bottom-up learning of rules and a local classification strategy using nearest rules. The usefulness of BRACID has been evaluated in experiments on several imbalanced datasets. The results show that BRACID significantly outperforms the well known rule-based classifiers C4.5rules, RIPPER, PART, CN2, MODLEM as well as other related classifiers as RISE or K-NN. Moreover, it is comparable or better than the studied approaches specialized for imbalanced data such as generalizations of rule algorithms or combinations of SMOTE + ENN preprocessing with PART. Finally, it improves the support of minority class rules, leading to better recognition of the minority class examples.