Abstract
Ensembles of classifiers have successfully been used to improve the overall predictive accuracy in many domains. In particular, the use of boosting which focuses on hard to learn examples, have application for difficult to learn problems. In a two-class imbalanced data set, the number of examples of the majority class is much higher than that of the minority class. This implies that, during training, the predictive accuracy against the minority class of a traditional boosting ensemble may be poor. This paper introduces an approach to address this shortcoming, through the generation of synthesis examples which are added to the original training set. In this way, the ensemble is able to focus not only on hard examples, but also on rare examples. The experimental results, when applying our Databoost-IM algorithm to eleven datasets, indicate that it surpasses a benchmarking individual classifier as well as a popular boosting method, when evaluated in terms of the overall accuracy, the G-mean and the F-measures..
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, pp. 148–156 (1996)
Breiman, L.:: Bias, Variance and Arcing Classifiers. Technical report 460, University of California: Berkeley, Berkeley, CA: USA (1996)
Schwenk, H., Bengio, Y.: AdaBoosting Neural Networks: Application to On-line Character Recognition. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 969–972. Springer, Heidelberg (1997)
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–157 (2000)
Japkowicz, N.: Learning from imbalanced data sets: A comparison of various strategies, Learning from imbalanced data sets: The AAAI Workshop 10-15. Menlo Park, CA: AAAI Press. Technical Report WS-00-05 (2000)
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
Maloof, M.A.: Learning when Data Sets are Imbalanced and when Costs are Unequal and unknown. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II (2003)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: Onesided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann, San Francisco (1997)
Joshi, M., Kumar, V., Agarwal, R.: Evaluating boosting algorithms to classify rare classes: comparison and improvements. Technical Report RC-22147, IBM Research Division (2001)
Chawla, N., Lazarevic, A., Hall, L., Bowyer, K.: SMOTEBoost: improving prediction of the minority class in boosting. In: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 107–119. Cavtat-Dubrovnik, Croatia (2003)
Blake ,C.L ., Merz,C.J : UCI Repository of Machine Learning Databases , http://www.ics.uci.edu/~mlearn/MLRepository.html Department of Information and Computer Science, University of California, Irvine, CA (1998)
Provost, F., Fawcett, T.: Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In: proceedings of the Third international conference on Knowledge discovery and data mining, pp. 43–48. AAAI Press, Menlo Park (1997)
Kubat, M., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215 (1998)
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Guo, H., Viktor, H. L.: Boosting with data generation: Improving the Classification of Hard to Learn Examples. In: The 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE), Ottawa, Canada, May 17-20 (2004)
Viktor, H.L.: The CILT multi-agent learning system. South African Computer Journal (SACJ) 24, 171–181 (1999)
Viktor, H.L., Skrypnik, I.: Improving the Competency of Ensembles of Classifiers through Data Generation. In: ICANNGA 2001, April 21-25, pp. 59–62. Czech Republic, Prague (2001)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1994)
Witten, I., Frank, E.: Data Mining: Practical Machine Learning tools and Techniques with Java Implementations, Â ch.8. Morgan Kaufmann Publishers, San Francisco (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Viktor, H.L., Guo, H. (2004). Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2004. Lecture Notes in Computer Science, vol 3138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27868-9_107
Download citation
DOI: https://doi.org/10.1007/978-3-540-27868-9_107
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22570-6
Online ISBN: 978-3-540-27868-9
eBook Packages: Springer Book Archive