Abstract
We live in a world where data is generated from a myriad of sources, and it is really cheap to collect and storage such data. However, the real benefit is not related to the data itself, but with the algorithms that are capable of processing such data in a tolerable elapsed time, and to extract valuable knowledge from it. The term “Big Data” has spread rapidly in the framework of data mining and business intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. Therefore, the use of Big Data Analytics tools provides very significant advantages to both industry and academia. In this chapter we provide an introduction to Big Data and its problems. Next we discuss about a new topic, namely Big Data Analytics, referred to the application of machine learning techniques to Big Data problems. Then we continue with a definition of data preprocessing and the different techniques used to improve the quality of data. We finish with an introduction to the state of Big Data streaming.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, D., Das, S., & Abbadi, A. E. (2011). Big data and cloud computing: Current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology (pp. 530–533). New York: ACM.
Aha, D. W., Kibler, D., & Albert, M. K. (1999). Instance-based learning algorithms. Machine Learning, 6(1), 37–66.
Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Communications Surveys & Tutorials, 17(4), 2347–2376.
Apache Flink. (2019). http://flink.apache.org/
Apache Storm. (2019). https://storm.apache.org/
Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and new challenges. Information Fusion, 28, 45–59.
Chen, H., Chiang, R. H. L., Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly: Management Information Systems, 36(4), 1165–1188.
Choi, T.-M., Chan, H. K., & Yue, X. (2017). Recent development in big data analytics for business operations and risk management. IEEE Transactions on Cybernetics, 47(1), 81–92.
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.
Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.
Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M. et al. (2014). Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.
Frénay, B., & Verleysen, M.: Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.
Gaber, M. M. (2012). Advances in data stream mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 79–85.
Gama, J. (2010). Knowledge discovery from data streams. London: Chapman and Hall/CRC.
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing Surveys, 46(4), 44.
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144.
García, S., Luengo, J., & Herrera, F. (2015). Data Preprocessing in Data Mining. Berlin: Springer.
García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1, 9.
García-Gil, D., Luengo, J., García, S., & Herrera, F. (2019). Enabling smart data: Noise filtering in big data classification. Information Sciences, 479, 135–152.
Hall, M. A. (1999). Correlation-based feature selection for machine learning. Hamilton: Department of Computer Science, Waikato University.
Härdle, W., Horng-Shing Lu, H., & Shen, X. (2018). Handbook of big data analytics. Berlin: Springer.
Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652–687.
Iafrate, F. (2014). A journey from big data to smart data. Advances in Intelligent Systems and Computing, 261, 25–33.
Jolliffe, I. (2011). Principal Component Analysis. Berlin: Springer.
Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big data analytics (1st ed.). Sebastopol: O’Reilly Media.
Lin, J. (2013). MapReduce is good enough? If all you have is a hammer, throw away everything that’s not a nail! Big Data, 1(1), 28–37.
Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4), 393–423.
Liu, H., & Motoda, H. (2002). On issues of instance selection. Data Mining and Knowledge Discovery, 6(2), 115–130 (2002)
López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.
Masud, M. M., Chen, Q., Gao, J., Khan, L., Han, J., & Thuraisingham, B. (2010). Classification and novel class detection of data streams in a dynamic feature space. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 337–352). Berlin: Springer.
Philip Chen, C. L., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275, 314–347.
Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufmann.
Ramalingeswara Rao, T., Mitra, P., Bhatt, R., & Goswami, A. (2018). The big data system, components, tools, and technologies: A survey. Knowledge and Information Systems, 60, 1165–1245.
Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., & Herrera, F. (2018). Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Information Fusion, 42, 51–61.
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, S., & Herrera, F. (2017). A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing, 239, 39–57.
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (pp. 1–10). Piscataway: IEEE.
Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.
Wang, H., & Wang, S. (2010). Mining incomplete survey data through classification. Knowledge and Information Systems, 24(2), 221–233.
Watson, H. J., & Wixom, B. H. (2007). The current state of business intelligence. Computer, 40(9), 96–99.
Webb, G. I. (2014). Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In 2014 IEEE International Conference on Data Mining (pp. 1031–1036). Piscataway: IEEE.
White, T. (2012). Hadoop: The Definitive Guide. Sebastopol: O’Reilly Media.
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37.
Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (p. 2). Berkeley: USENIX Association.
Zaki, M. J., & Meira, W. Jr. (2014). Data mining and analysis: Fundamental concepts and algorithms. New York: Cambridge University Press.
Zliobaite, I., & Gabrys, B. (2014). Adaptive preprocessing for streaming data. IEEE Transactions on Knowledge and Data Engineering, 26(2), 309–321.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Introduction. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-39105-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)