Skip to main content

Tackling Class Imbalance Problem in Binary Classification using Augmented Neighborhood Cleaning Algorithm

  • Conference paper
  • First Online:
Information Science and Applications

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 339))

Abstract

Many natural processes generate some observations more frequently than others. These processes result in an imbalanced distributions which cause classifiers to bias toward the majority class because most classifiers assume a normal distribution. In order to address the problem of class imbalance, a number of data preprocessing techniques, which can be generally categorized into over-sampling and under-sampling methods, have been proposed throughout the years. The Neighborhood cleaning rule (NCL) method proposed by Laurikkala is among the most popular under-sampling methods. In this paper, we augment the original NCL algorithm by cleaning the unwanted samples using CHC evolutionary algorithm instead of a simple nearest neighbor-based cleaning as in NCL. We name our augmented algorithm as NCL+. The performance of NCL+ is compared to that of NCL on 9 imbalanced datasets using 11 different classifiers. Experimental results show noticeable accuracy improvements by NCL+ over NCL. Moreover, NCL+ is also compared to another popular over-sampling method called Synthetic minority over-sampling technique (SMOTE), and is found to offer better results as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. http://archive.ics.uci.edu (2014)

  2. http://sci2s.ugr.es/keel/datasets.php (2014)

  3. Al Abdouli, N.O.: Handling the Class Imbalance Problem in Binary Classification. Master’s thesis, Masdar Institute of Science and Technology, Abu Dhabi, UAE (2014)

    Google Scholar 

  4. Alan, J.B., Ryutaro, T., Hoan, N.: A hybrid pansharpening approach and multi-scale object-based image analysis for mapping diseased pine and oak trees. International Journal of Remote Sensing 34, 6969–6982 (2013)

    Google Scholar 

  5. Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garcia, S., Sanchez, L., Herrera, F.: KEEL data-mining software tool: Data set repository. Journal of Multiple- Valued Logic and Soft Computing 17, 255–287 (2011)

    Google Scholar 

  6. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6, 20–29 (2004)

    Google Scholar 

  7. Cano, J., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computing 7, 561–575 (2003)

    Google Scholar 

  8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

    Google Scholar 

  9. Eshelman, L.J.: The CHC adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination. In: Proc. 1st Workshop on Foundations of Genetic Algorithms. pp. 265–283 (1990)

    Google Scholar 

  10. Faisal, M.A., Aung, Z., Williams, J., Sanchez, A.: Data-stream-based intrusion detection system for advanced metering infrastructure in smart grid: A feasibility study. IEEE Systems Journal (2014), in press

    Google Scholar 

  11. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874 (2006)

    Google Scholar 

  12. Fernándeza, A., Garcíaa, S., Jesusb, M., Herreraa, F.: A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems 159, 2378–2398 (2008)

    Google Scholar 

  13. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics Part C 42, 463–484 (2011)

    Google Scholar 

  14. He, H., Bai, Y., Garcia, E., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proc. 2008 International Joint Conference on Neural Networks. pp. 1322–1328 (2008)

    Google Scholar 

  15. Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20, 18–36 (2004)

    Google Scholar 

  16. Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Proc. 8th Conference on AI in Medicine in Europe. pp. 63–66 (2001)

    Google Scholar 

  17. Liu, N., Woon, W.L., Aung, Z., Afshari, A.: Handling class imbalance in customer behavior prediction. In: Proc. 2014 IEEE International Conference on Collaboration Technologies and Systems. pp. 100–103 (2014)

    Google Scholar 

  18. Lokanayaki, K., Malathi, A.: Data preprocessing for liver dataset using SMOTE. International Journal of Advanced Research in Computer Science and Software Engineering 3, 559–562 (2013)

    Google Scholar 

  19. Mladenii, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Proc. 16th International Conference on Machine Learning. pp. 258–267 (1999)

    Google Scholar 

  20. Napieralla, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy and borderline examples. In: Proc. 7th International Conference on Rough Sets and Current Trends in Computing. pp. 158–167 (2010)

    Google Scholar 

  21. Perera, K.S., Neupane, B., Faisal, M.A., Aung, Z., Woon, W.L.: A novel ensemble learning-based approach for click fraud detection in mobile advertising. In: Proc. 2013 International Conference on Mining Intelligence and Knowledge Exploration. Lecture Notes in Computer Science, vol. 8284, pp. 370–382 (2013)

    Google Scholar 

  22. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers (1993)

    Google Scholar 

  23. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Machine Learning 38, 257–286 (2000)

    Google Scholar 

  24. Yen, S.J., Lee, Y.S.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: Proc. 2006 International Conference on Intelligent Computing. pp. 731–740 (2006)

    Google Scholar 

  25. Yoon, K., Kwek, S.: An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Proc. 5th International Conference on Hybrid Intelligent Systems. pp. 303–308 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeyar Aung .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Abdouli, N.O.A., Aung, Z., Woon, W.L., Svetinovic, D. (2015). Tackling Class Imbalance Problem in Binary Classification using Augmented Neighborhood Cleaning Algorithm. In: Kim, K. (eds) Information Science and Applications. Lecture Notes in Electrical Engineering, vol 339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46578-3_98

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-46578-3_98

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-46577-6

  • Online ISBN: 978-3-662-46578-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics