Statistical Discretization of Continuous Attributes Using Kolmogorov-Smirnov Test
Unlike unsupervised discretization methods that use simple rules to discretize continuous attributes through a low time complexity which mostly depends on sorting procedure, supervised discretization algorithms take the class label of attributes into consideration to achieve high accuracy. Supervised discretization process on continuous features encounters two significant challenges. Firstly, noisy class labels affect the effectiveness of discretization. Secondly, due to the high computational time of supervised algorithms in large-scale datasets, time complexity would rely on discretizing stage rather than sorting procedure. Accordingly, to address the challenges, we devise a statistical unsupervised method named as SUFDA. The SUFDA aims to produce discrete intervals through decreasing differential entropy of the normal distribution with a low temporal complexity and high accuracy. The results show that our unsupervised system obtains a better effectiveness compared to other discretization baselines in large-scale datasets.
KeywordsDiscretization Kolmogorov-Smirnov Data mining Data reduction Naïve Bayes
- 2.Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning (1993)Google Scholar
- 4.Hosseini, S., Li, L.T.: Point-of-interest recommendation using temporal orientations of users and locations. In: Navathe, S.B., Wu, W., Shekhar, S., Du, X., Wang, X.S., Xiong, H. (eds.) DASFAA 2016. LNCS, vol. 9642, pp. 330–347. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32025-0_21CrossRefGoogle Scholar