Skip to main content

The Importance of Discretization Methods in Machine Learning Applications: A Case Study of Predicting ICU Mortality

Part of the Advances in Intelligent Systems and Computing book series (AISC,volume 1339)

Abstract

Data preprocessing represents one of the most crucial stages in data analysis. While developing novel machine learning techniques has been the main focus of research in the data science field, less attention has been given to data preprocessing. This paper examines the importance of discretization as a preprocessing step and helps achieve better classification performance compared to using continuous attributes. We examine the performance of multiple parametric and non-parametric discretization methods in conjunction with several machine learning classifiers for the problem of predicting the Intensive Care Unit (ICU) mortality. Our results demonstrate the significance of discretizing the input attributes in this problem. Using discretized data achieved a classification accuracy and F1 score of 89.19% and 0.38, respectively, while using continuous attributes achieved a classification accuracy F1 score of 86.19% and 0.08, respectively. These results demonstrate that discretizing continuous attributes prior to applying machine learning models could significantly enhance performance enhancement.

Keywords

  • Discretization
  • Classification
  • Mortality prediction

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-69717-4_23
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   229.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-69717-4
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   299.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.

References

  1. Garcia, S., Luengo, J., Sáez, J.A., Lopez, V., Herrera, F.: A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2012)

    CrossRef  Google Scholar 

  2. Vorobeva, A.A.: Influence of features discretization on accuracy of random forest classifier for web user identification. In: 2017 20th Conference of Open Innovations Association (FRUCT), pp. 498–504. IEEE (2017)

    Google Scholar 

  3. Elhilbawi, H., Eldawlatly, S., Mahdi, H.: A Taxonomy of Discretization Techniques based on Class Labels and Attributes’ Relationship. In: 2019 14th International Conference on Computer Engineering and Systems (ICCES), pp. 316–321. IEEE (2019)

    Google Scholar 

  4. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J., Moody, G., Peng, C., Stanley, H.: Components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)

    Google Scholar 

  5. Danubianu, M.: Step by step data preprocessing for data mining. a case study. In: Proceedings of the International Conference on Information Technologies (InfoTech-2015), pp. 117–124 (2015)

    Google Scholar 

  6. Han, J., Pei, J., Kamber, M.: Data mining: concepts and techniques. Elsevier (2011)

    Google Scholar 

  7. Donders, A.R.T., Van Der Heijden, G.J., Stijnen, T., Moons, K.G.: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10), 1087–1091 (2006)

    CrossRef  Google Scholar 

  8. Berka, P., Bruha, I.: Discretization and grouping: preprocessing steps for data mining. In: Żytkow, J.M., Quafafou, M. (eds.) Principles of Data Mining and Knowledge Discovery, PKDD 1998, Lecture Notes in Computer Science, vol. 1510, pp. 239–245. Springer, Berlin

    Google Scholar 

  9. Boulle, M.: Optimal bin number for equal frequency discretizations in supervized learning. Intell. Data Anal. 9(2), 175–188 (2005)

    CrossRef  Google Scholar 

  10. Steinley, D.: K-means clustering: a half-century synthesis. British J. Math. Stat. Psychol. 59(1), 1–34 (2006)

    MathSciNet  CrossRef  Google Scholar 

  11. Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: an enabling technique. Data Min. Knowl. Disc. 6(4), 393–423 (2002)

    MathSciNet  CrossRef  Google Scholar 

  12. Cano, A., Nguyen, D.T., Ventura, S., Cios, K.J.: ur-CAIM: improved CAIM discretization for unbalanced and balanced data. Soft. Comput. 20(1), 173–188 (2016)

    CrossRef  Google Scholar 

  13. Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998)

    CrossRef  Google Scholar 

  14. Boulesteix, A.L., Janitza, S., Kruppa, J., König, I.R.: Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisc. Rev. Data Min. Knowl. Dis. 2(6), 493–507 (2012)

    CrossRef  Google Scholar 

  15. Sandri, M., Zuccolotto, P.: A bias correction algorithm for the Gini variable importance measure in classification trees. J. Comput. Graph. Stat. 17(3), 611–628 (2008)

    MathSciNet  CrossRef  Google Scholar 

  16. Dhanabal, S., Chandramathi, S.: A review of various k-nearest neighbor query processing techniques. Int. J. Comput. Appl. 31(7), 14–22 (2011)

    Google Scholar 

  17. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seif Eldawlatly .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Elhilbawi, H., Eldawlatly, S., Mahdi, H. (2021). The Importance of Discretization Methods in Machine Learning Applications: A Case Study of Predicting ICU Mortality. In: Hassanien, AE., Chang, KC., Mincong, T. (eds) Advanced Machine Learning Technologies and Applications. AMLTA 2021. Advances in Intelligent Systems and Computing, vol 1339. Springer, Cham. https://doi.org/10.1007/978-3-030-69717-4_23

Download citation