Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples

Viktor, Herna L.; Guo, Hongyu

doi:10.1007/978-3-540-27868-9_107

Herna L. Viktor²¹ &
Hongyu Guo²¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3138))

Included in the following conference series:

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

Abstract

Ensembles of classifiers have successfully been used to improve the overall predictive accuracy in many domains. In particular, the use of boosting which focuses on hard to learn examples, have application for difficult to learn problems. In a two-class imbalanced data set, the number of examples of the majority class is much higher than that of the minority class. This implies that, during training, the predictive accuracy against the minority class of a traditional boosting ensemble may be poor. This paper introduces an approach to address this shortcoming, through the generation of synthesis examples which are added to the original training set. In this way, the ensemble is able to focus not only on hard examples, but also on rare examples. The experimental results, when applying our Databoost-IM algorithm to eleven datasets, indicate that it surpasses a benchmarking individual classifier as well as a popular boosting method, when evaluated in terms of the overall accuracy, the G-mean and the F-measures..

Download to read the full chapter text

Chapter PDF

A synthetic neighborhood generation based ensemble learning for the imbalanced data classification

Article 04 December 2017

Classification on Imbalanced Data Sets, Taking Advantage of Errors to Improve Performance

An Ensemble Method Based on AdaBoost and Meta-Learning

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, pp. 148–156 (1996)
Google Scholar
Breiman, L.:: Bias, Variance and Arcing Classifiers. Technical report 460, University of California: Berkeley, Berkeley, CA: USA (1996)
Google Scholar
Schwenk, H., Bengio, Y.: AdaBoosting Neural Networks: Application to On-line Character Recognition. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 969–972. Springer, Heidelberg (1997)
Google Scholar
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–157 (2000)
Article Google Scholar
Japkowicz, N.: Learning from imbalanced data sets: A comparison of various strategies, Learning from imbalanced data sets: The AAAI Workshop 10-15. Menlo Park, CA: AAAI Press. Technical Report WS-00-05 (2000)
Google Scholar
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Maloof, M.A.: Learning when Data Sets are Imbalanced and when Costs are Unequal and unknown. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II (2003)
Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: Onesided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Joshi, M., Kumar, V., Agarwal, R.: Evaluating boosting algorithms to classify rare classes: comparison and improvements. Technical Report RC-22147, IBM Research Division (2001)
Google Scholar
Chawla, N., Lazarevic, A., Hall, L., Bowyer, K.: SMOTEBoost: improving prediction of the minority class in boosting. In: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 107–119. Cavtat-Dubrovnik, Croatia (2003)
Google Scholar
Blake ,C.L ., Merz,C.J : UCI Repository of Machine Learning Databases , http://www.ics.uci.edu/~mlearn/MLRepository.html Department of Information and Computer Science, University of California, Irvine, CA (1998)
Provost, F., Fawcett, T.: Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In: proceedings of the Third international conference on Knowledge discovery and data mining, pp. 43–48. AAAI Press, Menlo Park (1997)
Google Scholar
Kubat, M., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215 (1998)
Article Google Scholar
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Google Scholar
Guo, H., Viktor, H. L.: Boosting with data generation: Improving the Classification of Hard to Learn Examples. In: The 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE), Ottawa, Canada, May 17-20 (2004)
Google Scholar
Viktor, H.L.: The CILT multi-agent learning system. South African Computer Journal (SACJ) 24, 171–181 (1999)
Google Scholar
Viktor, H.L., Skrypnik, I.: Improving the Competency of Ensembles of Classifiers through Data Generation. In: ICANNGA 2001, April 21-25, pp. 59–62. Czech Republic, Prague (2001)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1994)
Google Scholar
Witten, I., Frank, E.: Data Mining: Practical Machine Learning tools and Techniques with Java Implementations, ch.8. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology and Engineering, University of Ottawa, 800 King Edward Road, Ottawa, Ontario, Canada, K1N 6N5
Herna L. Viktor & Hongyu Guo

Authors

Herna L. Viktor
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Guo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto Superior Técnico, Instituto de Telecomunicações, Lisbon, Portugal
Ana Fred
RSISE, the Australian National University, ACT 0200, Canberra, Australia
Terry M. Caelli
Information and Communication Theory Group, Delft University of Technology, P.O. Box 5031, 2600GA, Delft, The Netherlands
Robert P. W. Duin
FEUP - Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
Aurélio C. Campilho
Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Information and Communication Theory Group, Delft, The Netherlands
Dick de Ridder

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Viktor, H.L., Guo, H. (2004). Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2004. Lecture Notes in Computer Science, vol 3138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27868-9_107

Download citation

DOI: https://doi.org/10.1007/978-3-540-27868-9_107
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22570-6
Online ISBN: 978-3-540-27868-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples

Abstract

Chapter PDF

Similar content being viewed by others

A synthetic neighborhood generation based ensemble learning for the imbalanced data classification

Classification on Imbalanced Data Sets, Taking Advantage of Errors to Improve Performance

An Ensemble Method Based on AdaBoost and Meta-Learning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples

Abstract

Chapter PDF

Similar content being viewed by others

A synthetic neighborhood generation based ensemble learning for the imbalanced data classification

Classification on Imbalanced Data Sets, Taking Advantage of Errors to Improve Performance

An Ensemble Method Based on AdaBoost and Meta-Learning

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation