Abstract
Discretization is an important data preprocessing technique used in data mining and knowledge discovery processes. The purpose of discretization is to transform or partition continuous values into discrete ones. In this manner, many data mining classification algorithms can be applied the discrete data more concisely and meaningfully than continuous ones, resulting in better performance. In this study, an improved version of the unsupervised equal frequency (EF) discretization method, EF_Unique, is proposed for enhancing the performance of discretization. The proposed EF_Unique discretization method is based on the unique values of the attribute to be discretized. In order to test the success of the proposed method, 17 benchmark datasets from the UCI repository and four data mining classification algorithms were used, namely Naïve Bayes, C.45, k-nearest neighbor, and support vector machine. The experimental results of the proposed EF_Unique discretization method were compared with those obtained using well-known discretization methods; unsupervised equal width (EW), EF, and supervised entropy-based ID3 (EB-ID3). The results show that the proposed EF_Unique discretization method outperformed EW, EF, and EB-ID3 discretization methods in 43, 41, and 27 out of the 68 benchmark tests, respectively.
Similar content being viewed by others
References
Han, J.; Kamber, M.; Pei, J.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Los Altos (2016)
Kononenko, I.; Kukar, M.: Machine Learning and Data Mining. Woodhead Publishing, Cambridge (2007)
Gupta, A.; Mehrotra, K.G.; Mohan, C.: A clustering-based discretization for supervised learning. Stat. Probab. Lett. 80(9–10), 816–824 (2010). https://doi.org/10.1016/j.spl.2010.01.015
Alfred, R.: Discretization numerical data for relational data with one-to-many relations. J. Comput. Sci. 5(7), 519–528 (2009). https://doi.org/10.3844/jcssp.2009.519.528
Hacibeyoglu, M.; Arslan, A.; Kahramanli, S.: Improving classification accuracy with discretization on datasets including continuous valued features. World Acad. Sci. Eng. Technol. 78, 555–558 (2011)
Jiang, F.; Sui, Y.F.: A novel approach for discretization of continuous attributes in rough set theory. Knowl. Based Syst. 73, 324–334 (2015). https://doi.org/10.1016/j.knosys.2014.10.014
Clark, P.; Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989). https://doi.org/10.1023/a:1022641700528
Michalski, R.S.; Carbonell, J.G.; Mitchell, T.M.: Machine Learning. Springer, Berlin (1983)
Cohen, W.W.: Fast effective rule induction. Paper presented at the Proceedings of the Twelfth International Conference on International Conference on Machine Learning, Tahoe City, California, USA
Sasaki, M.; Kita, K.: Rule-based text categorization using hierarchical categories. In: 1998 IEEE International Conference on Systems, Man, and Cybernetics, pp. 2827–2830, vol. 2823, 11–14 Oct 1998
Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A.: Classification and Regression Trees. Chapman and Hall/CRC, London (1984)
Fix, E.; Hodges Jr., J.L.: Discriminatory analysis. Nonparametric discrimination: consistency properties. Int. Stat. Rev. 57(3), 238–247 (1989)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2000)
Domingos, P.; Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2–3), 103–130 (1997). https://doi.org/10.1023/A:1007413511361
Yang, Y.; Webb, G.I.: Discretization for naive-Bayes learning: managing discretization bias and variance. Mach. Learn. 74(1), 39–74 (2009). https://doi.org/10.1007/s10994-008-5083-5
Zhang, H.J.; Liu, G.; Chow, T.W.S.; Liu, W.Y.: Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans. Neural Netw. 22(10), 1532–1546 (2011)
Wu, X.D.; Kumar, V.; Quinlan, J.R.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.; Liu, B.; Yu, P.S.; Zhou, Z.H.; Steinbach, M.; Hand, D.J.; Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008). https://doi.org/10.1007/s10115-007-0114-2
Kotsiantis, S.; Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32(1), 47–58 (2006)
Kumar, S.S.; Inbarani, H.H.: Cardiac arrhythmia classification using multi-granulation rough set approaches. Int. J. Mach. Learn. Cybern. (2016). https://doi.org/10.1007/s13042-016-0594-z
Lustgarten, J.L.; Gopalakrishnan, V.; Grover, H.; Visweswaran, S.: Improving classification performance with discretization on biomedical datasets. AMIA Ann. Symp. Proc. 2008, 445–449 (2008)
Dash, R.; Paramguru, R.L.; Dash, R.: Comparative analysis of supervised and unsupervised discretization. Techniques 2(3), 29–37 (2011)
Hu, H.W.; Chen, Y.L.; Tang, K.: A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Trans. Knowl. Data Eng. 21(11), 1505–1514 (2009). https://doi.org/10.1109/Tkde.2009.24
Rahman, M.G.; Islam, M.Z.: Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Syst. Appl. 45, 410–423 (2016). https://doi.org/10.1016/j.eswa.2015.10.005
Catlett, J.: On changing continuous attributes into ordered discrete attributes. Mach. Learn. Ewsl-91 482, 164–178 (1991)
Chlebus, B.S.; Nguyen, S.H.: On finding optimal discretizations for two attributes. In: Rough Sets and Current Trends in Computing, pp. 537–544. Springer, Heidelberg (1998)
Garcia, S.; Luengo, J.; Saez, J.A.; Lopez, V.; Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013). https://doi.org/10.1109/Tkde.2012.35
Peng, L.; Qing, W.; Yujia, G.: Study on comparison of discretization methods. In: 2009 International Conference on Artificial Intelligence and Computational Intelligence, pp. 380–384, 7–8 Nov 2009
Dougherty, J.; Kohavi, R.; Sahami, M.: Supervised and unsupervised discretization of continuous features A2—Prieditis, Armand. In: Russell, S. (ed.) Machine Learning Proceedings 1995, pp. 194–202. Morgan Kaufmann, San Francisco (1995)
Fayyad, U.; Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th IJCAI, pp. 1022–1027 (1993)
Bertelsen, R.; Martinez, T.R.: Extending ID3 through discretization of continuous inputs. In: FLAIRS’94 Florida Artificial Intelligence Research Symposium, pp. 122–125 (1994)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos (1993)
Kerber, R.: ChiMerge: discretization of numeric attributes. Paper presented at the Proceedings of the Tenth National Conference on Artificial intelligence, San Jose, California
Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11(1), 63–91 (1993). https://doi.org/10.1023/A:1022631118932
Cebeci, Z.; Yildiz, F.: Comparison of Chi-square based algorithms for discretization of continuous chicken egg quality traits. J. Agric. Inform. 8(1), 13–22 (2017)
Liu, H.; Setiono, R.: Chi-square: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence, pp. 388–391, 5–8 Nov 1995
Su, C.T.; Hsu, J.H.: An extended Chi2 algorithm for discretization of real value attributes. IEEE Trans. Knowl. Data Eng. 17(3), 437–441 (2005). https://doi.org/10.1109/Tkde.2005.39
Au, W.H.; Chan, K.C.C.; Wong, A.K.C.: A fuzzy approach to partitioning continuous attributes for classification. IEEE Trans. Knowl. Data Eng. 18(5), 715–719 (2006). https://doi.org/10.1109/Tkde.2006.70
Kurgan, L.A.; Cios, K.J.: Fast class-attribute interdependence maximization (CAIM) discretization algorithm. In: ICMLA (2003)
Kurgan, L.A.; Cios, K.J.: CAIM discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004). https://doi.org/10.1109/Tkde.2004.1269594
Ching, J.Y.; Wong, A.K.C.; Chan, K.C.C.: Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell. 17(7), 641–651 (1995). https://doi.org/10.1109/34.391407
Boulle, M.: MODL: a Bayes optimal discretization method for continuous attributes. Mach. Learn. 65(1), 131–165 (2006). https://doi.org/10.1007/s10994-006-8364-x
Bay, S.D.: Multivariate discretization for set mining. Knowl. Inf. Syst. 3(4), 491–512 (2001). https://doi.org/10.1007/pl00011680
Madhu, G.; Rajinikanth, T.V.; Govardhan, A.: Improve the classifier accuracy for continuous attributes in biomedical datasets using a new discretization method. In: 2nd International Conference on Information Technology and Quantitative Management, Itqm 2014, vol. 31, pp. 671–679 (2014). https://doi.org/10.1016/j.procs.2014.05.315
Yan, D.Q.; Liu, D.S.; Sang, Y.: A new approach for discretizing continuous attributes in learning systems. Neurocomputing 133, 507–511 (2014). https://doi.org/10.1016/j.neucom.2013.12.005
Boulle, M.: Optimal bin number for equal frequency discretizations in supervized learning. Intell. Data Anal. 9(2), 175–188 (2005)
Bakar, A.A.; Othman, Z.A.; Shuib, N.L.M.: Building a new taxonomy for data discretization techniques. In: 2009 2nd Conference on Data Mining and Optimization, pp. 132–140 (2009)
Holmes, D.E.; Jain, L.C.: Data Mining: Foundations and Intelligent Paradigms. Springer, Berlin (2012)
Abraham, R.; Simha, J.B.; Iyengar, S.S.: A comparative analysis of discretization methods for medical datamining with naive Bayesian classifier. In: 9th International Conference on Information Technology, 2006. ICIT ’06, pp. 235–236, 18–21 Dec 2006
Boulle, M.: Khiops: A discretization method of continuous attributes with guaranteed resistance to noise. In: Machine Learning and Data Mining in Pattern Recognition, Proceedings, vol. 2734, pp. 50–64 (2003)
Hacibeyoglu, M.; Ibrahim, M.H.: Comparison of the effect of unsupervised and supervised discretization methods on classification process. Int. J. Intell. Syst. Appl. Eng. 4, 105–108 (2016)
Cebeci, Z.; Yildiz, F.: Unsupervised discretization of continuous variables in a chicken egg quality traits dataset. Turk. J. Agric. Food Sci. Technol. 5(4), 315–320 (2017)
Zieliński, K.; Szmuc, T.: Software Engineering: Evolution and Emerging Technologies. IOS Press, Amsterdam (2006)
Reinartz, T.: Focusing Solutions for Data Mining. Springer, Berlin (1999)
Kohavi, R.; Sahami, M.: Error-based and entropy-based discretization of continuous features. Paper Presented at the Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon
Hamerly, G.; Elkan, C.: Learning the k in k-means. Paper Presented at the Proceedings of the 16th International Conference on Neural Information Processing Systems, Whistler, British Columbia, Canada
Cebeci, Z.; Yildiz, F.; Kayaalp, G.: K-ortalamalar kümelemesinde optimum K değeri seçilmesi. Paper Presented at the 2. Ulusal Yönetim Bilişim Sistemleri Kongresi, Erzurum
Pham, D.T.; Dimov, S.S.; Nguyen, C.D.: Selection of K in K-means clustering. Proc. Inst. Mech. Eng. C J. Mech. Eng. Sci. 219(1), 103–119 (2005). https://doi.org/10.1243/095440605x8298
Davies, O.L.; Goldsmith, P.L.: Statistical Methods in Research and Production. Longman, London (1984)
Alsuwaiyel, M.H.: Algorithms: Design Techniques and Analysis. World Scientific Publishing, Singapore (1999)
Blake, C.L.; Merz, C.J.: CI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html (1998). Accessed 01 Sept 2017
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hacibeyoglu, M., Ibrahim, M.H. EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method. Arab J Sci Eng 43, 7695–7704 (2018). https://doi.org/10.1007/s13369-018-3144-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-018-3144-z