Boosting with Data Generation: Improving the Classification of Hard to Learn Examples

* Final gross prices may vary according to local VAT.

Get Access

Abstract

An ensemble of classifiers consists of a set of individually trained classifiers whose predictions are combined to classify new instances. In particular, boosting is an ensemble method where the performance of weak classifiers is improved by focusing on “hard examples” which are difficult to classify. Recent studies have indicated that boosting algorithm is applicable to a broad spectrum of problems with great success. However, boosting algorithms frequently suffer from over-emphasizing the hard examples, leading to poor training and test set accuracies. Also, the knowledge acquired from such hard examples may be insufficient to improve the overall accuracy of the ensemble. This paper describes a new algorithm to solve the above-mentioned problems through data generation. In the DataBoost method, hard examples are identified during each of the iterations of the boosting algorithm. Subsequently, the hard examples are used to generate synthetic training data. These synthetic examples are added to the original training set and are used for further training. The paper shows the results of this approach against ten data sets, using both decision trees and neural networks as base classifiers. The experiments show promising results, in terms of the overall accuracy obtained.