Abstract
Many natural processes generate some observations more frequently than others. These processes result in an imbalanced distributions which cause classifiers to bias toward the majority class because most classifiers assume a normal distribution. In order to address the problem of class imbalance, a number of data preprocessing techniques, which can be generally categorized into over-sampling and under-sampling methods, have been proposed throughout the years. The Neighborhood cleaning rule (NCL) method proposed by Laurikkala is among the most popular under-sampling methods. In this paper, we augment the original NCL algorithm by cleaning the unwanted samples using CHC evolutionary algorithm instead of a simple nearest neighbor-based cleaning as in NCL. We name our augmented algorithm as NCL+. The performance of NCL+ is compared to that of NCL on 9 imbalanced datasets using 11 different classifiers. Experimental results show noticeable accuracy improvements by NCL+ over NCL. Moreover, NCL+ is also compared to another popular over-sampling method called Synthetic minority over-sampling technique (SMOTE), and is found to offer better results as well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
http://archive.ics.uci.edu (2014)
Al Abdouli, N.O.: Handling the Class Imbalance Problem in Binary Classification. Master’s thesis, Masdar Institute of Science and Technology, Abu Dhabi, UAE (2014)
Alan, J.B., Ryutaro, T., Hoan, N.: A hybrid pansharpening approach and multi-scale object-based image analysis for mapping diseased pine and oak trees. International Journal of Remote Sensing 34, 6969–6982 (2013)
Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garcia, S., Sanchez, L., Herrera, F.: KEEL data-mining software tool: Data set repository. Journal of Multiple- Valued Logic and Soft Computing 17, 255–287 (2011)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6, 20–29 (2004)
Cano, J., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computing 7, 561–575 (2003)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
Eshelman, L.J.: The CHC adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination. In: Proc. 1st Workshop on Foundations of Genetic Algorithms. pp. 265–283 (1990)
Faisal, M.A., Aung, Z., Williams, J., Sanchez, A.: Data-stream-based intrusion detection system for advanced metering infrastructure in smart grid: A feasibility study. IEEE Systems Journal (2014), in press
Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874 (2006)
Fernándeza, A., GarcÃaa, S., Jesusb, M., Herreraa, F.: A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems 159, 2378–2398 (2008)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics Part C 42, 463–484 (2011)
He, H., Bai, Y., Garcia, E., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proc. 2008 International Joint Conference on Neural Networks. pp. 1322–1328 (2008)
Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20, 18–36 (2004)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Proc. 8th Conference on AI in Medicine in Europe. pp. 63–66 (2001)
Liu, N., Woon, W.L., Aung, Z., Afshari, A.: Handling class imbalance in customer behavior prediction. In: Proc. 2014 IEEE International Conference on Collaboration Technologies and Systems. pp. 100–103 (2014)
Lokanayaki, K., Malathi, A.: Data preprocessing for liver dataset using SMOTE. International Journal of Advanced Research in Computer Science and Software Engineering 3, 559–562 (2013)
Mladenii, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Proc. 16th International Conference on Machine Learning. pp. 258–267 (1999)
Napieralla, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy and borderline examples. In: Proc. 7th International Conference on Rough Sets and Current Trends in Computing. pp. 158–167 (2010)
Perera, K.S., Neupane, B., Faisal, M.A., Aung, Z., Woon, W.L.: A novel ensemble learning-based approach for click fraud detection in mobile advertising. In: Proc. 2013 International Conference on Mining Intelligence and Knowledge Exploration. Lecture Notes in Computer Science, vol. 8284, pp. 370–382 (2013)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers (1993)
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Machine Learning 38, 257–286 (2000)
Yen, S.J., Lee, Y.S.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: Proc. 2006 International Conference on Intelligent Computing. pp. 731–740 (2006)
Yoon, K., Kwek, S.: An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Proc. 5th International Conference on Hybrid Intelligent Systems. pp. 303–308 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abdouli, N.O.A., Aung, Z., Woon, W.L., Svetinovic, D. (2015). Tackling Class Imbalance Problem in Binary Classification using Augmented Neighborhood Cleaning Algorithm. In: Kim, K. (eds) Information Science and Applications. Lecture Notes in Electrical Engineering, vol 339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46578-3_98
Download citation
DOI: https://doi.org/10.1007/978-3-662-46578-3_98
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46577-6
Online ISBN: 978-3-662-46578-3
eBook Packages: EngineeringEngineering (R0)