Skip to main content

Introduction

  • Chapter
  • First Online:
Big Data Preprocessing

Abstract

We live in a world where data is generated from a myriad of sources, and it is really cheap to collect and storage such data. However, the real benefit is not related to the data itself, but with the algorithms that are capable of processing such data in a tolerable elapsed time, and to extract valuable knowledge from it. The term “Big Data” has spread rapidly in the framework of data mining and business intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. Therefore, the use of Big Data Analytics tools provides very significant advantages to both industry and academia. In this chapter we provide an introduction to Big Data and its problems. Next we discuss about a new topic, namely Big Data Analytics, referred to the application of machine learning techniques to Big Data problems. Then we continue with a definition of data preprocessing and the different techniques used to improve the quality of data. We finish with an introduction to the state of Big Data streaming.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agrawal, D., Das, S., & Abbadi, A. E. (2011). Big data and cloud computing: Current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology (pp. 530–533). New York: ACM.

    Google Scholar 

  2. Aha, D. W., Kibler, D., & Albert, M. K. (1999). Instance-based learning algorithms. Machine Learning, 6(1), 37–66.

    Google Scholar 

  3. Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Communications Surveys & Tutorials, 17(4), 2347–2376.

    Article  Google Scholar 

  4. Apache Flink. (2019). http://flink.apache.org/

  5. Apache Storm. (2019). https://storm.apache.org/

  6. Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and new challenges. Information Fusion, 28, 45–59.

    Article  Google Scholar 

  7. Chen, H., Chiang, R. H. L., Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly: Management Information Systems, 36(4), 1165–1188.

    Article  Google Scholar 

  8. Choi, T.-M., Chan, H. K., & Yue, X. (2017). Recent development in big data analytics for business operations and risk management. IEEE Transactions on Cybernetics, 47(1), 81–92.

    Article  Google Scholar 

  9. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.

    Article  Google Scholar 

  10. Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.

    Article  Google Scholar 

  11. Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M. et al. (2014). Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.

    Google Scholar 

  12. Frénay, B., & Verleysen, M.: Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.

    Google Scholar 

  13. Gaber, M. M. (2012). Advances in data stream mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 79–85.

    Google Scholar 

  14. Gama, J. (2010). Knowledge discovery from data streams. London: Chapman and Hall/CRC.

    Book  MATH  Google Scholar 

  15. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing Surveys, 46(4), 44.

    Article  MATH  Google Scholar 

  16. Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144.

    Article  Google Scholar 

  17. García, S., Luengo, J., & Herrera, F. (2015). Data Preprocessing in Data Mining. Berlin: Springer.

    Book  Google Scholar 

  18. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1, 9.

    Article  Google Scholar 

  19. García-Gil, D., Luengo, J., García, S., & Herrera, F. (2019). Enabling smart data: Noise filtering in big data classification. Information Sciences, 479, 135–152.

    Article  Google Scholar 

  20. Hall, M. A. (1999). Correlation-based feature selection for machine learning. Hamilton: Department of Computer Science, Waikato University.

    Google Scholar 

  21. Härdle, W., Horng-Shing Lu, H., & Shen, X. (2018). Handbook of big data analytics. Berlin: Springer.

    Book  MATH  Google Scholar 

  22. Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652–687.

    Article  Google Scholar 

  23. Iafrate, F. (2014). A journey from big data to smart data. Advances in Intelligent Systems and Computing, 261, 25–33.

    Article  Google Scholar 

  24. Jolliffe, I. (2011). Principal Component Analysis. Berlin: Springer.

    MATH  Google Scholar 

  25. Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big data analytics (1st ed.). Sebastopol: O’Reilly Media.

    Google Scholar 

  26. Lin, J. (2013). MapReduce is good enough? If all you have is a hammer, throw away everything that’s not a nail! Big Data, 1(1), 28–37.

    Article  Google Scholar 

  27. Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4), 393–423.

    Article  MathSciNet  Google Scholar 

  28. Liu, H., & Motoda, H. (2002). On issues of instance selection. Data Mining and Knowledge Discovery, 6(2), 115–130 (2002)

    Google Scholar 

  29. López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.

    Article  Google Scholar 

  30. Masud, M. M., Chen, Q., Gao, J., Khan, L., Han, J., & Thuraisingham, B. (2010). Classification and novel class detection of data streams in a dynamic feature space. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 337–352). Berlin: Springer.

    Chapter  Google Scholar 

  31. Philip Chen, C. L., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275, 314–347.

    Article  Google Scholar 

  32. Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufmann.

    Google Scholar 

  33. Ramalingeswara Rao, T., Mitra, P., Bhatt, R., & Goswami, A. (2018). The big data system, components, tools, and technologies: A survey. Knowledge and Information Systems, 60, 1165–1245.

    Google Scholar 

  34. Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., & Herrera, F. (2018). Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Information Fusion, 42, 51–61.

    Article  Google Scholar 

  35. Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, S., & Herrera, F. (2017). A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing, 239, 39–57.

    Article  Google Scholar 

  36. Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (pp. 1–10). Piscataway: IEEE.

    Google Scholar 

  37. Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.

    Google Scholar 

  38. Wang, H., & Wang, S. (2010). Mining incomplete survey data through classification. Knowledge and Information Systems, 24(2), 221–233.

    Article  Google Scholar 

  39. Watson, H. J., & Wixom, B. H. (2007). The current state of business intelligence. Computer, 40(9), 96–99.

    Article  Google Scholar 

  40. Webb, G. I. (2014). Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In 2014 IEEE International Conference on Data Mining (pp. 1031–1036). Piscataway: IEEE.

    Chapter  Google Scholar 

  41. White, T. (2012). Hadoop: The Definitive Guide. Sebastopol: O’Reilly Media.

    Google Scholar 

  42. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37.

    Article  Google Scholar 

  43. Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.

    Article  Google Scholar 

  44. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (p. 2). Berkeley: USENIX Association.

    Google Scholar 

  45. Zaki, M. J., & Meira, W. Jr. (2014). Data mining and analysis: Fundamental concepts and algorithms. New York: Cambridge University Press.

    Book  MATH  Google Scholar 

  46. Zliobaite, I., & Gabrys, B. (2014). Adaptive preprocessing for streaming data. IEEE Transactions on Knowledge and Data Engineering, 26(2), 309–321.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Introduction. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-39105-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-39104-1

  • Online ISBN: 978-3-030-39105-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics