Advertisement

Statistical Discretization of Continuous Attributes Using Kolmogorov-Smirnov Test

  • Hadi Mohammadzadeh Abachi
  • Saeid Hosseini
  • Mojtaba Amiri Maskouni
  • Mohammadreza Kangavari
  • Ngai-Man Cheung
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10837)

Abstract

Unlike unsupervised discretization methods that use simple rules to discretize continuous attributes through a low time complexity which mostly depends on sorting procedure, supervised discretization algorithms take the class label of attributes into consideration to achieve high accuracy. Supervised discretization process on continuous features encounters two significant challenges. Firstly, noisy class labels affect the effectiveness of discretization. Secondly, due to the high computational time of supervised algorithms in large-scale datasets, time complexity would rely on discretizing stage rather than sorting procedure. Accordingly, to address the challenges, we devise a statistical unsupervised method named as SUFDA. The SUFDA aims to produce discrete intervals through decreasing differential entropy of the normal distribution with a low temporal complexity and high accuracy. The results show that our unsupervised system obtains a better effectiveness compared to other discretization baselines in large-scale datasets.

Keywords

Discretization Kolmogorov-Smirnov Data mining Data reduction Naïve Bayes 

References

  1. 1.
    Cano, A., Nguyen, D.T., Ventura, S., Cios, K.J.: ur-CAIM: improved CAIM discretization for unbalanced and balanced data. Soft Comput. 20(1), 173–188 (2016)CrossRefGoogle Scholar
  2. 2.
    Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning (1993)Google Scholar
  3. 3.
    Garcia, S., Luengo, J., Sez, J.A., Lopez, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)CrossRefGoogle Scholar
  4. 4.
    Hosseini, S., Li, L.T.: Point-of-interest recommendation using temporal orientations of users and locations. In: Navathe, S.B., Wu, W., Shekhar, S., Du, X., Wang, X.S., Xiong, H. (eds.) DASFAA 2016. LNCS, vol. 9642, pp. 330–347. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-32025-0_21CrossRefGoogle Scholar
  5. 5.
    Hripcsak, G., Rothschild, A.S.: Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc. 12(3), 296–298 (2005)CrossRefGoogle Scholar
  6. 6.
    Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)CrossRefGoogle Scholar
  7. 7.
    Massey Jr., F.J.: The Kolmogorov-Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46(253), 68–78 (1951)CrossRefGoogle Scholar
  8. 8.
    Pelz, W., Good, I.J.: Approximating the lower tail-areas of the Kolmogorov-Smirnov one-sample statistic. J. Roy. Stat. Soc. Ser. B (Methodol.) 38(2), 152–156 (1976)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Simard, R., L’Ecuyer, P.: Computing the two-sided Kolmogorov-Smirnov distribution. J. Stat. Softw. 39(11), 1–18 (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Hadi Mohammadzadeh Abachi
    • 1
  • Saeid Hosseini
    • 1
    • 2
  • Mojtaba Amiri Maskouni
    • 1
  • Mohammadreza Kangavari
    • 1
  • Ngai-Man Cheung
    • 2
  1. 1.Iran University of Science and TechnologyTehranIran
  2. 2.Singapore University of Technology and DesignSingaporeSingapore

Personalised recommendations