Chapter

Data Mining and Knowledge Discovery Handbook

pp 875-886

Date:

Data Mining for Imbalanced Datasets: An Overview

  • Nitesh V. ChawlaAffiliated withDepartment of Computer Science and Engineering, University of Notre Dame Email author 

* Final gross prices may vary according to local VAT.

Get Access

Summary

A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

Key words

imbalanced datasets classification sampling ROC cost-sensitive measures precision and recall