EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

Hacibeyoglu, Mehmet; Ibrahim, Mohammed H.

doi:10.1007/s13369-018-3144-z

EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

Research Article - Computer Engineering and Computer Science
Published: 03 March 2018

Volume 43, pages 7695–7704, (2018)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

168 Accesses
9 Citations
Explore all metrics

Abstract

Discretization is an important data preprocessing technique used in data mining and knowledge discovery processes. The purpose of discretization is to transform or partition continuous values into discrete ones. In this manner, many data mining classification algorithms can be applied the discrete data more concisely and meaningfully than continuous ones, resulting in better performance. In this study, an improved version of the unsupervised equal frequency (EF) discretization method, EF_Unique, is proposed for enhancing the performance of discretization. The proposed EF_Unique discretization method is based on the unique values of the attribute to be discretized. In order to test the success of the proposed method, 17 benchmark datasets from the UCI repository and four data mining classification algorithms were used, namely Naïve Bayes, C.45, k-nearest neighbor, and support vector machine. The experimental results of the proposed EF_Unique discretization method were compared with those obtained using well-known discretization methods; unsupervised equal width (EW), EF, and supervised entropy-based ID3 (EB-ID3). The results show that the proposed EF_Unique discretization method outperformed EW, EF, and EB-ID3 discretization methods in 43, 41, and 27 out of the 68 benchmark tests, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Han, J.; Kamber, M.; Pei, J.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
MATH Google Scholar
Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Los Altos (2016)
Google Scholar
Kononenko, I.; Kukar, M.: Machine Learning and Data Mining. Woodhead Publishing, Cambridge (2007)
Book Google Scholar
Gupta, A.; Mehrotra, K.G.; Mohan, C.: A clustering-based discretization for supervised learning. Stat. Probab. Lett. 80(9–10), 816–824 (2010). https://doi.org/10.1016/j.spl.2010.01.015
Article MathSciNet MATH Google Scholar
Alfred, R.: Discretization numerical data for relational data with one-to-many relations. J. Comput. Sci. 5(7), 519–528 (2009). https://doi.org/10.3844/jcssp.2009.519.528
Article Google Scholar
Hacibeyoglu, M.; Arslan, A.; Kahramanli, S.: Improving classification accuracy with discretization on datasets including continuous valued features. World Acad. Sci. Eng. Technol. 78, 555–558 (2011)
Google Scholar
Jiang, F.; Sui, Y.F.: A novel approach for discretization of continuous attributes in rough set theory. Knowl. Based Syst. 73, 324–334 (2015). https://doi.org/10.1016/j.knosys.2014.10.014
Article Google Scholar
Clark, P.; Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989). https://doi.org/10.1023/a:1022641700528
Article Google Scholar
Michalski, R.S.; Carbonell, J.G.; Mitchell, T.M.: Machine Learning. Springer, Berlin (1983)
Book Google Scholar
Cohen, W.W.: Fast effective rule induction. Paper presented at the Proceedings of the Twelfth International Conference on International Conference on Machine Learning, Tahoe City, California, USA
Sasaki, M.; Kita, K.: Rule-based text categorization using hierarchical categories. In: 1998 IEEE International Conference on Systems, Man, and Cybernetics, pp. 2827–2830, vol. 2823, 11–14 Oct 1998
Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A.: Classification and Regression Trees. Chapman and Hall/CRC, London (1984)
MATH Google Scholar
Fix, E.; Hodges Jr., J.L.: Discriminatory analysis. Nonparametric discrimination: consistency properties. Int. Stat. Rev. 57(3), 238–247 (1989)
Article Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2000)
Book Google Scholar
Domingos, P.; Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2–3), 103–130 (1997). https://doi.org/10.1023/A:1007413511361
Article MATH Google Scholar
Yang, Y.; Webb, G.I.: Discretization for naive-Bayes learning: managing discretization bias and variance. Mach. Learn. 74(1), 39–74 (2009). https://doi.org/10.1007/s10994-008-5083-5
Article Google Scholar
Zhang, H.J.; Liu, G.; Chow, T.W.S.; Liu, W.Y.: Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans. Neural Netw. 22(10), 1532–1546 (2011)
Article Google Scholar
Wu, X.D.; Kumar, V.; Quinlan, J.R.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.; Liu, B.; Yu, P.S.; Zhou, Z.H.; Steinbach, M.; Hand, D.J.; Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008). https://doi.org/10.1007/s10115-007-0114-2
Article Google Scholar
Kotsiantis, S.; Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32(1), 47–58 (2006)
Google Scholar
Kumar, S.S.; Inbarani, H.H.: Cardiac arrhythmia classification using multi-granulation rough set approaches. Int. J. Mach. Learn. Cybern. (2016). https://doi.org/10.1007/s13042-016-0594-z
Article Google Scholar
Lustgarten, J.L.; Gopalakrishnan, V.; Grover, H.; Visweswaran, S.: Improving classification performance with discretization on biomedical datasets. AMIA Ann. Symp. Proc. 2008, 445–449 (2008)
Google Scholar
Dash, R.; Paramguru, R.L.; Dash, R.: Comparative analysis of supervised and unsupervised discretization. Techniques 2(3), 29–37 (2011)
Google Scholar
Hu, H.W.; Chen, Y.L.; Tang, K.: A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Trans. Knowl. Data Eng. 21(11), 1505–1514 (2009). https://doi.org/10.1109/Tkde.2009.24
Article Google Scholar
Rahman, M.G.; Islam, M.Z.: Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Syst. Appl. 45, 410–423 (2016). https://doi.org/10.1016/j.eswa.2015.10.005
Article Google Scholar
Catlett, J.: On changing continuous attributes into ordered discrete attributes. Mach. Learn. Ewsl-91 482, 164–178 (1991)
Article MathSciNet Google Scholar
Chlebus, B.S.; Nguyen, S.H.: On finding optimal discretizations for two attributes. In: Rough Sets and Current Trends in Computing, pp. 537–544. Springer, Heidelberg (1998)
Chapter Google Scholar
Garcia, S.; Luengo, J.; Saez, J.A.; Lopez, V.; Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013). https://doi.org/10.1109/Tkde.2012.35
Article Google Scholar
Peng, L.; Qing, W.; Yujia, G.: Study on comparison of discretization methods. In: 2009 International Conference on Artificial Intelligence and Computational Intelligence, pp. 380–384, 7–8 Nov 2009
Dougherty, J.; Kohavi, R.; Sahami, M.: Supervised and unsupervised discretization of continuous features A2—Prieditis, Armand. In: Russell, S. (ed.) Machine Learning Proceedings 1995, pp. 194–202. Morgan Kaufmann, San Francisco (1995)
Chapter Google Scholar
Fayyad, U.; Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th IJCAI, pp. 1022–1027 (1993)
Bertelsen, R.; Martinez, T.R.: Extending ID3 through discretization of continuous inputs. In: FLAIRS’94 Florida Artificial Intelligence Research Symposium, pp. 122–125 (1994)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos (1993)
Google Scholar
Kerber, R.: ChiMerge: discretization of numeric attributes. Paper presented at the Proceedings of the Tenth National Conference on Artificial intelligence, San Jose, California
Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11(1), 63–91 (1993). https://doi.org/10.1023/A:1022631118932
Article MathSciNet MATH Google Scholar
Cebeci, Z.; Yildiz, F.: Comparison of Chi-square based algorithms for discretization of continuous chicken egg quality traits. J. Agric. Inform. 8(1), 13–22 (2017)
Google Scholar
Liu, H.; Setiono, R.: Chi-square: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence, pp. 388–391, 5–8 Nov 1995
Su, C.T.; Hsu, J.H.: An extended Chi2 algorithm for discretization of real value attributes. IEEE Trans. Knowl. Data Eng. 17(3), 437–441 (2005). https://doi.org/10.1109/Tkde.2005.39
Article Google Scholar
Au, W.H.; Chan, K.C.C.; Wong, A.K.C.: A fuzzy approach to partitioning continuous attributes for classification. IEEE Trans. Knowl. Data Eng. 18(5), 715–719 (2006). https://doi.org/10.1109/Tkde.2006.70
Article Google Scholar
Kurgan, L.A.; Cios, K.J.: Fast class-attribute interdependence maximization (CAIM) discretization algorithm. In: ICMLA (2003)
Kurgan, L.A.; Cios, K.J.: CAIM discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004). https://doi.org/10.1109/Tkde.2004.1269594
Article Google Scholar
Ching, J.Y.; Wong, A.K.C.; Chan, K.C.C.: Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell. 17(7), 641–651 (1995). https://doi.org/10.1109/34.391407
Article Google Scholar
Boulle, M.: MODL: a Bayes optimal discretization method for continuous attributes. Mach. Learn. 65(1), 131–165 (2006). https://doi.org/10.1007/s10994-006-8364-x
Article Google Scholar
Bay, S.D.: Multivariate discretization for set mining. Knowl. Inf. Syst. 3(4), 491–512 (2001). https://doi.org/10.1007/pl00011680
Article MATH Google Scholar
Madhu, G.; Rajinikanth, T.V.; Govardhan, A.: Improve the classifier accuracy for continuous attributes in biomedical datasets using a new discretization method. In: 2nd International Conference on Information Technology and Quantitative Management, Itqm 2014, vol. 31, pp. 671–679 (2014). https://doi.org/10.1016/j.procs.2014.05.315
Article Google Scholar
Yan, D.Q.; Liu, D.S.; Sang, Y.: A new approach for discretizing continuous attributes in learning systems. Neurocomputing 133, 507–511 (2014). https://doi.org/10.1016/j.neucom.2013.12.005
Article Google Scholar
Boulle, M.: Optimal bin number for equal frequency discretizations in supervized learning. Intell. Data Anal. 9(2), 175–188 (2005)
Article Google Scholar
Bakar, A.A.; Othman, Z.A.; Shuib, N.L.M.: Building a new taxonomy for data discretization techniques. In: 2009 2nd Conference on Data Mining and Optimization, pp. 132–140 (2009)
Holmes, D.E.; Jain, L.C.: Data Mining: Foundations and Intelligent Paradigms. Springer, Berlin (2012)
MATH Google Scholar
Abraham, R.; Simha, J.B.; Iyengar, S.S.: A comparative analysis of discretization methods for medical datamining with naive Bayesian classifier. In: 9th International Conference on Information Technology, 2006. ICIT ’06, pp. 235–236, 18–21 Dec 2006
Boulle, M.: Khiops: A discretization method of continuous attributes with guaranteed resistance to noise. In: Machine Learning and Data Mining in Pattern Recognition, Proceedings, vol. 2734, pp. 50–64 (2003)
Hacibeyoglu, M.; Ibrahim, M.H.: Comparison of the effect of unsupervised and supervised discretization methods on classification process. Int. J. Intell. Syst. Appl. Eng. 4, 105–108 (2016)
Google Scholar
Cebeci, Z.; Yildiz, F.: Unsupervised discretization of continuous variables in a chicken egg quality traits dataset. Turk. J. Agric. Food Sci. Technol. 5(4), 315–320 (2017)
Google Scholar
Zieliński, K.; Szmuc, T.: Software Engineering: Evolution and Emerging Technologies. IOS Press, Amsterdam (2006)
MATH Google Scholar
Reinartz, T.: Focusing Solutions for Data Mining. Springer, Berlin (1999)
Book Google Scholar
Kohavi, R.; Sahami, M.: Error-based and entropy-based discretization of continuous features. Paper Presented at the Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon
Hamerly, G.; Elkan, C.: Learning the k in k-means. Paper Presented at the Proceedings of the 16th International Conference on Neural Information Processing Systems, Whistler, British Columbia, Canada
Cebeci, Z.; Yildiz, F.; Kayaalp, G.: K-ortalamalar kümelemesinde optimum K değeri seçilmesi. Paper Presented at the 2. Ulusal Yönetim Bilişim Sistemleri Kongresi, Erzurum
Pham, D.T.; Dimov, S.S.; Nguyen, C.D.: Selection of K in K-means clustering. Proc. Inst. Mech. Eng. C J. Mech. Eng. Sci. 219(1), 103–119 (2005). https://doi.org/10.1243/095440605x8298
Article Google Scholar
Davies, O.L.; Goldsmith, P.L.: Statistical Methods in Research and Production. Longman, London (1984)
Google Scholar
Alsuwaiyel, M.H.: Algorithms: Design Techniques and Analysis. World Scientific Publishing, Singapore (1999)
Book Google Scholar
Blake, C.L.; Merz, C.J.: CI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html (1998). Accessed 01 Sept 2017
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Necmettin Erbakan University, Konya, Turkey
Mehmet Hacibeyoglu & Mohammed H. Ibrahim

Authors

Mehmet Hacibeyoglu
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed H. Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehmet Hacibeyoglu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hacibeyoglu, M., Ibrahim, M.H. EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method. Arab J Sci Eng 43, 7695–7704 (2018). https://doi.org/10.1007/s13369-018-3144-z

Download citation

Received: 04 May 2017
Accepted: 06 February 2018
Published: 03 March 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s13369-018-3144-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

A Comprehensive Survey of Anomaly Detection Algorithms

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

A Comprehensive Survey of Anomaly Detection Algorithms

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation