Advertisement

ARCID: A New Approach to Deal with Imbalanced Datasets Classification

  • Safa Abdellatif
  • Mohamed Ali Ben Hassine
  • Sadok Ben Yahia
  • Amel Bouzeghoub
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10706)

Abstract

Classification is one of the most fundamental and well-known tasks in data mining. Class imbalance is the most challenging issue encountered when performing classification, i.e. when the number of instances belonging to the class of interest (minor class) is much lower than that of other classes (major classes). The class imbalance problem has become more and more marked while applying machine learning algorithms to real-world applications such as medical diagnosis, text classification, fraud detection, etc. Standard classifiers may yield very good results regarding the majority classes. However, this kind of classifiers yields bad results regarding the minority classes since they assume a relatively balanced class distribution and equal misclassification costs. To overcome this problem, we propose, in this paper, a novel associative classification algorithm called Association Rule-based Classification for Imbalanced Datasets (ARCID). This algorithm aims to extract significant knowledge from imbalanced datasets by emphasizing on information extracted from minor classes without drastically impacting the predictive accuracy of the classifier. Experimentations, against five datasets obtained from the UCI repository, have been conducted with reference to four assessment measures. Results show that ARCID outperforms standard algorithms. Furthermore, it is very competitive to Fitcare which is a class imbalance insensitive algorithm.

Keywords

Associative classification Imbalanced datasets Machine learning Data mining 

References

  1. 1.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of 20th International Conference on Very Large Data Bases, VLDB 1994, Santiago de Chile, Chile, 12–15 September 1994, pp. 487–499 (1994)Google Scholar
  2. 2.
    Ali, K., Manganaris, S., Srikant, R.: Partial classification using association rules. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-1997), Newport Beach, California, USA, 14–17 August 1997, pp. 115–118 (1997)Google Scholar
  3. 3.
    Antonie, M., Zaïane, O.R.: An associative classifier based on positive and negative rules. In: Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD 2004, Paris, France, 13 June 2004, pp. 64–69 (2004)Google Scholar
  4. 4.
    Bekkar, M., Djemaa, H.K., Alitouche, T.A.: Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl. 3(10), 2–4 (2013)Google Scholar
  5. 5.
    Bowyer, K.W., Chawla, N.V., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. CoRR abs/1106.1813 (2011). http://arxiv.org/abs/1106.1813
  6. 6.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  7. 7.
    Cerf, L., Gay, D., Selmaoui-Folcher, N., Crémilleux, B., Boulicaut, J.: Parameter-free classification in multi-class imbalanced data sets. Data Knowl. Eng. 87, 109–129 (2013)CrossRefGoogle Scholar
  8. 8.
    Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)Google Scholar
  9. 9.
    Gasmi, G., Yahia, S.B., Nguifo, E.M., Slimani, Y.: \(\cal{IGB}\): a new informative generic base of association rules. In: Ho, T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS, vol. 3518, pp. 81–90. Springer, Heidelberg (2005).  https://doi.org/10.1007/11430919_11 CrossRefGoogle Scholar
  10. 10.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  11. 11.
    Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005).  https://doi.org/10.1007/11538059_91 CrossRefGoogle Scholar
  12. 12.
    Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)CrossRefGoogle Scholar
  14. 14.
    Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced data. Stat. Anal. Data Min.: ASA Data Sci. J. 2(5–6), 412–426 (2009)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Holmes, J.H.: Differential negative reinforcement improves classifier system learning rate in two-class problems with unequal base rates. In: Genetic Programming, pp. 635–642 (1998)Google Scholar
  16. 16.
    Hu, B., Dong, W.: A study on cost behaviors of binary classification measures in class-imbalanced problems. CoRR abs/1403.7100 (2014)Google Scholar
  17. 17.
    Japkowicz, N., Myers, C., Gluck, M., et al.: A novelty detection approach to classification. In: IJCAI, vol. 1, pp. 518–523 (1995)Google Scholar
  18. 18.
    Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998)CrossRefGoogle Scholar
  19. 19.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, 8–12 July 1997, pp. 179–186 (1997)Google Scholar
  20. 20.
    Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-1998), New York City, New York, USA, 27–31 August 1998, pp. 80–86 (1998)Google Scholar
  21. 21.
    Merz, C.: UCI repository of machine learning databases (1996). http://www.ics.uci.edu/~mlearn/MLRepository.html
  22. 22.
    Mitchell, T.M.: Machine Learning. McGraw Hill Series in Computer Science. McGraw-Hill (1997)Google Scholar
  23. 23.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Burlington (1993)Google Scholar
  24. 24.
    Quinlan, J.R., Cameron-Jones, R.M.: FOIL: a midterm report. In: Brazdil, P.B. (ed.) ECML 1993. LNCS, vol. 667, pp. 1–20. Springer, Heidelberg (1993).  https://doi.org/10.1007/3-540-56602-3_124 CrossRefGoogle Scholar
  25. 25.
    Rijsbergen, C.J.V.: Information Retrieval. Butterworth, London (1979)zbMATHGoogle Scholar
  26. 26.
    Sasirekha, D., Punitha, A.: A comprehensive analysis on associative classification in medical datasets. Indian J. Sci. Technol. 8(33), 3–5 (2015)CrossRefGoogle Scholar
  27. 27.
    Thabtah, F., Cowling, P., Peng, Y.: Multiple label classification rules approach. J. Knowl. Inf. Syst. 9, 109–129 (2006)CrossRefGoogle Scholar
  28. 28.
    Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6, 448–452 (1976)MathSciNetzbMATHGoogle Scholar
  29. 29.
    Yang, Q., Wu, X.: 10 challenging problems in data mining research. Int. J. Inf. Technol. Decis. Mak. 5(04), 597–604 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Safa Abdellatif
    • 1
  • Mohamed Ali Ben Hassine
    • 1
  • Sadok Ben Yahia
    • 1
  • Amel Bouzeghoub
    • 2
  1. 1.University of Tunis El Manar, Faculty of Sciences of Tunis, LIPAH-LR11ES14, El ManarTunisTunisia
  2. 2.Institut Mines-TELECOM, TELECOM SudParis, UMR CNRS SamovarEvry CedexFrance

Personalised recommendations