Skip to main content

Big Data Discretization

  • Chapter
  • First Online:
Big Data Preprocessing

Abstract

Data discretization task transforms continuous numerical data into discrete and bounded values, more understandable for humans and more manageable for a wide range of machine learning methods. With the advent of Big Data, a new wave of large-scale datasets with predominance of continuous features have arrived to industry and academia. However, standard discretizers do not respond well to huge sets of continuous points, and novel distributed discretization solutions are demanded. In this chapter, we review the most relevant contributions to this field in the literature. We begin by enumerating the early proposals on dealing with parallel discretization. Then, we present some distributed solutions capable of scaling on large-scale datasets. We finish with a study of the discretization methods capable of dealing with Big Data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    If the points are in array format, a loop is used to evaluate points, else a distributed map function is used instead.

References

  1. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th Very Large Data Bases Conference (VLDB) (pp. 487–499).

    Google Scholar 

  2. Alcalde-Barros, A., García-Gil, D., García, S., & Herrera, F. (2019). DPASF: A Flink library for streaming data preprocessing. Big Data Analytics, 4(1), 4.

    Article  Google Scholar 

  3. Apache Flink. (2019). Apache Flink. http://flink.apache.org/.

    Google Scholar 

  4. Bechini, A., Marcelloni, F., & Segatori, A. (2016). A MapReduce solution for associative classification of big data. Information Sciences, 332, 33–55.

    Article  Google Scholar 

  5. Cano, A., Ventura, S., & Cios, K. J. (2014). Scalable CAIM discretization on multiple GPUs using concurrent kernels. The Journal of Supercomputing, 69(1), 273–292.

    Article  Google Scholar 

  6. Cerquides, J., & de Mántaras, R. L. (1997). Proposal and empirical comparison of a parallelizable distance-based discretization method. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, KDD’97 (pp. 139–142).

    Google Scholar 

  7. Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml.

    Google Scholar 

  8. Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In IJCAI.

    Google Scholar 

  9. Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.

    MATH  Google Scholar 

  10. Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 1022–1029).

    Google Scholar 

  11. García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining. New York: Springer.

    Book  Google Scholar 

  12. García, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734–750.

    Article  Google Scholar 

  13. Hu, H.-W., Chen, Y.-L., & Tang, K. (2009). A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Transactions on Knowledge and Data Engineering, 21(11), 1505–1514.

    Article  Google Scholar 

  14. Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4), 393–423.

    Article  MathSciNet  Google Scholar 

  15. Machine Learning Library (MLlib) for Spark. (2019) MLlib. http://spark.apache.org/docs/latest/mllib-guide.html.

  16. Parthasarathy, S., & Ramakrishnan, A. (2002). Parallel incremental 2D-discretization on dynamic datasets. In International Conference on Parallel and Distributed Processing Systems (pp. 247–254).

    Google Scholar 

  17. Pinto, C. (2006). Discretization from data streams: applications to histograms and data mining. In In Proceedings of the 2006 ACM symposium on Applied computing (SAC06 (pp. 662–667).

    Google Scholar 

  18. Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco, CA: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  19. Ramírez-Gallego, S., García, S., Benítez, J. M., & Herrera, F. (2016). Multivariate discretization based on evolutionary cut points selection for classification. IEEE Transactions on Cybernetics, 46(3), 595–608.

    Article  Google Scholar 

  20. Ramírez-Gallego, S., García, S., Benítez, J. M., & Herrera, F. (2018). A distributed evolutionary multivariate discretizer for big data processing on Apache spark. Swarm and Evolutionary Computation, 38, 240–250.

    Article  Google Scholar 

  21. Ramírez-Gallego, S., García, S., & Herrera, F. (2018). Online entropy-based discretization for data streaming classification. Future Generation Computer Systems, 86, 59–70.

    Article  Google Scholar 

  22. Ramírez-Gallego, S., García, S., Talín, H. M., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., et al. (2016). Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(1), 5–21.

    Google Scholar 

  23. van Leeuwen, J., & Wood, D. (1993). Interval heaps. The Computer Journal, 36(3), 209–216.

    Article  Google Scholar 

  24. Vitter, J. S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1), 37–57.

    Article  MathSciNet  Google Scholar 

  25. Webb, G. I. (2014). Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In Proceedings of the 2014 IEEE International Conference on Data Mining, ICDM ’14 (pp. 1031–1036). Washington, DC: IEEE Computer Society.

    Chapter  Google Scholar 

  26. Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2017). Data mining: practical machine learning tools and techniques. Cambridge, MA: Morgan Kaufmann Publisher.

    Google Scholar 

  27. Wu, X., & Kumar, V. (Eds.). (2009). The top ten algorithms in data mining. Chapman & Hall/CRC Data Mining and Knowledge Discovery. New York: CRC Press.

    Google Scholar 

  28. Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.

    Article  Google Scholar 

  29. Xu, Y., Wang, X., & Xiao, D. (2012). A two step parallel discretization algorithm based on dynamic clustering. In Proceedings of the 2012 International Conference on Computer Science and Electronics Engineering - Volume 03, ICCSEE ’12 (pp. 192–196).

    Google Scholar 

  30. Yang, Y., & Webb, G. I. (2009). Discretization for naive-Bayes learning: managing discretization bias and variance. Machine Learning, 74(1), 39–74.

    Article  Google Scholar 

  31. Zhang, Y., Yu, J., & Wang, J. (2014) Parallel implementation of chi2 algorithm in MapReduce framework. In International Conference on Human Centered Computing (pp. 890–899). Heidelberg: Springer.

    Google Scholar 

  32. Zhao, Y., Niu, Z., Peng, X., & Dai. L. (2011). A discretization algorithm of numerical attributes for digital library evaluation based on data mining technology. In Proceedings of the 13th International Conference on Asia-pacific Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation, ICADL’11 (pp. 70–76).

    Google Scholar 

  33. Zighed, D. A., Rabaséda, S., & Rakotomalala, R. (1998). FUSINTER: A method for discretization of continuous attributes. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 06(03), 307–326.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Big Data Discretization. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-39105-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-39104-1

  • Online ISBN: 978-3-030-39105-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics