Skip to main content
Log in

EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

  • Research Article - Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Discretization is an important data preprocessing technique used in data mining and knowledge discovery processes. The purpose of discretization is to transform or partition continuous values into discrete ones. In this manner, many data mining classification algorithms can be applied the discrete data more concisely and meaningfully than continuous ones, resulting in better performance. In this study, an improved version of the unsupervised equal frequency (EF) discretization method, EF_Unique, is proposed for enhancing the performance of discretization. The proposed EF_Unique discretization method is based on the unique values of the attribute to be discretized. In order to test the success of the proposed method, 17 benchmark datasets from the UCI repository and four data mining classification algorithms were used, namely Naïve Bayes, C.45, k-nearest neighbor, and support vector machine. The experimental results of the proposed EF_Unique discretization method were compared with those obtained using well-known discretization methods; unsupervised equal width (EW), EF, and supervised entropy-based ID3 (EB-ID3). The results show that the proposed EF_Unique discretization method outperformed EW, EF, and EB-ID3 discretization methods in 43, 41, and 27 out of the 68 benchmark tests, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Han, J.; Kamber, M.; Pei, J.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)

    MATH  Google Scholar 

  2. Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Los Altos (2016)

    Google Scholar 

  3. Kononenko, I.; Kukar, M.: Machine Learning and Data Mining. Woodhead Publishing, Cambridge (2007)

    Book  Google Scholar 

  4. Gupta, A.; Mehrotra, K.G.; Mohan, C.: A clustering-based discretization for supervised learning. Stat. Probab. Lett. 80(9–10), 816–824 (2010). https://doi.org/10.1016/j.spl.2010.01.015

    Article  MathSciNet  MATH  Google Scholar 

  5. Alfred, R.: Discretization numerical data for relational data with one-to-many relations. J. Comput. Sci. 5(7), 519–528 (2009). https://doi.org/10.3844/jcssp.2009.519.528

    Article  Google Scholar 

  6. Hacibeyoglu, M.; Arslan, A.; Kahramanli, S.: Improving classification accuracy with discretization on datasets including continuous valued features. World Acad. Sci. Eng. Technol. 78, 555–558 (2011)

    Google Scholar 

  7. Jiang, F.; Sui, Y.F.: A novel approach for discretization of continuous attributes in rough set theory. Knowl. Based Syst. 73, 324–334 (2015). https://doi.org/10.1016/j.knosys.2014.10.014

    Article  Google Scholar 

  8. Clark, P.; Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989). https://doi.org/10.1023/a:1022641700528

    Article  Google Scholar 

  9. Michalski, R.S.; Carbonell, J.G.; Mitchell, T.M.: Machine Learning. Springer, Berlin (1983)

    Book  Google Scholar 

  10. Cohen, W.W.: Fast effective rule induction. Paper presented at the Proceedings of the Twelfth International Conference on International Conference on Machine Learning, Tahoe City, California, USA

  11. Sasaki, M.; Kita, K.: Rule-based text categorization using hierarchical categories. In: 1998 IEEE International Conference on Systems, Man, and Cybernetics, pp. 2827–2830, vol. 2823, 11–14 Oct 1998

  12. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A.: Classification and Regression Trees. Chapman and Hall/CRC, London (1984)

    MATH  Google Scholar 

  13. Fix, E.; Hodges Jr., J.L.: Discriminatory analysis. Nonparametric discrimination: consistency properties. Int. Stat. Rev. 57(3), 238–247 (1989)

    Article  Google Scholar 

  14. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2000)

    Book  Google Scholar 

  15. Domingos, P.; Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2–3), 103–130 (1997). https://doi.org/10.1023/A:1007413511361

    Article  MATH  Google Scholar 

  16. Yang, Y.; Webb, G.I.: Discretization for naive-Bayes learning: managing discretization bias and variance. Mach. Learn. 74(1), 39–74 (2009). https://doi.org/10.1007/s10994-008-5083-5

    Article  Google Scholar 

  17. Zhang, H.J.; Liu, G.; Chow, T.W.S.; Liu, W.Y.: Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans. Neural Netw. 22(10), 1532–1546 (2011)

    Article  Google Scholar 

  18. Wu, X.D.; Kumar, V.; Quinlan, J.R.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.; Liu, B.; Yu, P.S.; Zhou, Z.H.; Steinbach, M.; Hand, D.J.; Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008). https://doi.org/10.1007/s10115-007-0114-2

    Article  Google Scholar 

  19. Kotsiantis, S.; Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32(1), 47–58 (2006)

    Google Scholar 

  20. Kumar, S.S.; Inbarani, H.H.: Cardiac arrhythmia classification using multi-granulation rough set approaches. Int. J. Mach. Learn. Cybern. (2016). https://doi.org/10.1007/s13042-016-0594-z

    Article  Google Scholar 

  21. Lustgarten, J.L.; Gopalakrishnan, V.; Grover, H.; Visweswaran, S.: Improving classification performance with discretization on biomedical datasets. AMIA Ann. Symp. Proc. 2008, 445–449 (2008)

    Google Scholar 

  22. Dash, R.; Paramguru, R.L.; Dash, R.: Comparative analysis of supervised and unsupervised discretization. Techniques 2(3), 29–37 (2011)

    Google Scholar 

  23. Hu, H.W.; Chen, Y.L.; Tang, K.: A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Trans. Knowl. Data Eng. 21(11), 1505–1514 (2009). https://doi.org/10.1109/Tkde.2009.24

    Article  Google Scholar 

  24. Rahman, M.G.; Islam, M.Z.: Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Syst. Appl. 45, 410–423 (2016). https://doi.org/10.1016/j.eswa.2015.10.005

    Article  Google Scholar 

  25. Catlett, J.: On changing continuous attributes into ordered discrete attributes. Mach. Learn. Ewsl-91 482, 164–178 (1991)

    Article  MathSciNet  Google Scholar 

  26. Chlebus, B.S.; Nguyen, S.H.: On finding optimal discretizations for two attributes. In: Rough Sets and Current Trends in Computing, pp. 537–544. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  27. Garcia, S.; Luengo, J.; Saez, J.A.; Lopez, V.; Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013). https://doi.org/10.1109/Tkde.2012.35

    Article  Google Scholar 

  28. Peng, L.; Qing, W.; Yujia, G.: Study on comparison of discretization methods. In: 2009 International Conference on Artificial Intelligence and Computational Intelligence, pp. 380–384, 7–8 Nov 2009

  29. Dougherty, J.; Kohavi, R.; Sahami, M.: Supervised and unsupervised discretization of continuous features A2—Prieditis, Armand. In: Russell, S. (ed.) Machine Learning Proceedings 1995, pp. 194–202. Morgan Kaufmann, San Francisco (1995)

    Chapter  Google Scholar 

  30. Fayyad, U.; Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th IJCAI, pp. 1022–1027 (1993)

  31. Bertelsen, R.; Martinez, T.R.: Extending ID3 through discretization of continuous inputs. In: FLAIRS’94 Florida Artificial Intelligence Research Symposium, pp. 122–125 (1994)

  32. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos (1993)

    Google Scholar 

  33. Kerber, R.: ChiMerge: discretization of numeric attributes. Paper presented at the Proceedings of the Tenth National Conference on Artificial intelligence, San Jose, California

  34. Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11(1), 63–91 (1993). https://doi.org/10.1023/A:1022631118932

    Article  MathSciNet  MATH  Google Scholar 

  35. Cebeci, Z.; Yildiz, F.: Comparison of Chi-square based algorithms for discretization of continuous chicken egg quality traits. J. Agric. Inform. 8(1), 13–22 (2017)

    Google Scholar 

  36. Liu, H.; Setiono, R.: Chi-square: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence, pp. 388–391, 5–8 Nov 1995

  37. Su, C.T.; Hsu, J.H.: An extended Chi2 algorithm for discretization of real value attributes. IEEE Trans. Knowl. Data Eng. 17(3), 437–441 (2005). https://doi.org/10.1109/Tkde.2005.39

    Article  Google Scholar 

  38. Au, W.H.; Chan, K.C.C.; Wong, A.K.C.: A fuzzy approach to partitioning continuous attributes for classification. IEEE Trans. Knowl. Data Eng. 18(5), 715–719 (2006). https://doi.org/10.1109/Tkde.2006.70

    Article  Google Scholar 

  39. Kurgan, L.A.; Cios, K.J.: Fast class-attribute interdependence maximization (CAIM) discretization algorithm. In: ICMLA (2003)

  40. Kurgan, L.A.; Cios, K.J.: CAIM discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004). https://doi.org/10.1109/Tkde.2004.1269594

    Article  Google Scholar 

  41. Ching, J.Y.; Wong, A.K.C.; Chan, K.C.C.: Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell. 17(7), 641–651 (1995). https://doi.org/10.1109/34.391407

    Article  Google Scholar 

  42. Boulle, M.: MODL: a Bayes optimal discretization method for continuous attributes. Mach. Learn. 65(1), 131–165 (2006). https://doi.org/10.1007/s10994-006-8364-x

    Article  Google Scholar 

  43. Bay, S.D.: Multivariate discretization for set mining. Knowl. Inf. Syst. 3(4), 491–512 (2001). https://doi.org/10.1007/pl00011680

    Article  MATH  Google Scholar 

  44. Madhu, G.; Rajinikanth, T.V.; Govardhan, A.: Improve the classifier accuracy for continuous attributes in biomedical datasets using a new discretization method. In: 2nd International Conference on Information Technology and Quantitative Management, Itqm 2014, vol. 31, pp. 671–679 (2014). https://doi.org/10.1016/j.procs.2014.05.315

    Article  Google Scholar 

  45. Yan, D.Q.; Liu, D.S.; Sang, Y.: A new approach for discretizing continuous attributes in learning systems. Neurocomputing 133, 507–511 (2014). https://doi.org/10.1016/j.neucom.2013.12.005

    Article  Google Scholar 

  46. Boulle, M.: Optimal bin number for equal frequency discretizations in supervized learning. Intell. Data Anal. 9(2), 175–188 (2005)

    Article  Google Scholar 

  47. Bakar, A.A.; Othman, Z.A.; Shuib, N.L.M.: Building a new taxonomy for data discretization techniques. In: 2009 2nd Conference on Data Mining and Optimization, pp. 132–140 (2009)

  48. Holmes, D.E.; Jain, L.C.: Data Mining: Foundations and Intelligent Paradigms. Springer, Berlin (2012)

    MATH  Google Scholar 

  49. Abraham, R.; Simha, J.B.; Iyengar, S.S.: A comparative analysis of discretization methods for medical datamining with naive Bayesian classifier. In: 9th International Conference on Information Technology, 2006. ICIT ’06, pp. 235–236, 18–21 Dec 2006

  50. Boulle, M.: Khiops: A discretization method of continuous attributes with guaranteed resistance to noise. In: Machine Learning and Data Mining in Pattern Recognition, Proceedings, vol. 2734, pp. 50–64 (2003)

  51. Hacibeyoglu, M.; Ibrahim, M.H.: Comparison of the effect of unsupervised and supervised discretization methods on classification process. Int. J. Intell. Syst. Appl. Eng. 4, 105–108 (2016)

    Google Scholar 

  52. Cebeci, Z.; Yildiz, F.: Unsupervised discretization of continuous variables in a chicken egg quality traits dataset. Turk. J. Agric. Food Sci. Technol. 5(4), 315–320 (2017)

    Google Scholar 

  53. Zieliński, K.; Szmuc, T.: Software Engineering: Evolution and Emerging Technologies. IOS Press, Amsterdam (2006)

    MATH  Google Scholar 

  54. Reinartz, T.: Focusing Solutions for Data Mining. Springer, Berlin (1999)

    Book  Google Scholar 

  55. Kohavi, R.; Sahami, M.: Error-based and entropy-based discretization of continuous features. Paper Presented at the Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon

  56. Hamerly, G.; Elkan, C.: Learning the k in k-means. Paper Presented at the Proceedings of the 16th International Conference on Neural Information Processing Systems, Whistler, British Columbia, Canada

  57. Cebeci, Z.; Yildiz, F.; Kayaalp, G.: K-ortalamalar kümelemesinde optimum K değeri seçilmesi. Paper Presented at the 2. Ulusal Yönetim Bilişim Sistemleri Kongresi, Erzurum

  58. Pham, D.T.; Dimov, S.S.; Nguyen, C.D.: Selection of K in K-means clustering. Proc. Inst. Mech. Eng. C J. Mech. Eng. Sci. 219(1), 103–119 (2005). https://doi.org/10.1243/095440605x8298

    Article  Google Scholar 

  59. Davies, O.L.; Goldsmith, P.L.: Statistical Methods in Research and Production. Longman, London (1984)

    Google Scholar 

  60. Alsuwaiyel, M.H.: Algorithms: Design Techniques and Analysis. World Scientific Publishing, Singapore (1999)

    Book  Google Scholar 

  61. Blake, C.L.; Merz, C.J.: CI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html (1998). Accessed 01 Sept 2017

  62. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehmet Hacibeyoglu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hacibeyoglu, M., Ibrahim, M.H. EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method. Arab J Sci Eng 43, 7695–7704 (2018). https://doi.org/10.1007/s13369-018-3144-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-018-3144-z

Keywords

Navigation