Introduction

Luengo, Julián; García-Gil, Diego; Ramírez-Gallego, Sergio; García, Salvador; Herrera, Francisco

doi:10.1007/978-3-030-39105-8_1

Julián Luengo⁶,
Diego García-Gil⁶,
Sergio Ramírez-Gallego⁷,
Salvador García⁶ &
…
Francisco Herrera⁶

2156 Accesses
1 Citations

Abstract

We live in a world where data is generated from a myriad of sources, and it is really cheap to collect and storage such data. However, the real benefit is not related to the data itself, but with the algorithms that are capable of processing such data in a tolerable elapsed time, and to extract valuable knowledge from it. The term “Big Data” has spread rapidly in the framework of data mining and business intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. Therefore, the use of Big Data Analytics tools provides very significant advantages to both industry and academia. In this chapter we provide an introduction to Big Data and its problems. Next we discuss about a new topic, namely Big Data Analytics, referred to the application of machine learning techniques to Big Data problems. Then we continue with a definition of data preprocessing and the different techniques used to improve the quality of data. We finish with an introduction to the state of Big Data streaming.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, D., Das, S., & Abbadi, A. E. (2011). Big data and cloud computing: Current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology (pp. 530–533). New York: ACM.
Google Scholar
Aha, D. W., Kibler, D., & Albert, M. K. (1999). Instance-based learning algorithms. Machine Learning, 6(1), 37–66.
Google Scholar
Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Communications Surveys & Tutorials, 17(4), 2347–2376.
Article Google Scholar
Apache Flink. (2019). http://flink.apache.org/
Apache Storm. (2019). https://storm.apache.org/
Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and new challenges. Information Fusion, 28, 45–59.
Article Google Scholar
Chen, H., Chiang, R. H. L., Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly: Management Information Systems, 36(4), 1165–1188.
Article Google Scholar
Choi, T.-M., Chan, H. K., & Yue, X. (2017). Recent development in big data analytics for business operations and risk management. IEEE Transactions on Cybernetics, 47(1), 81–92.
Article Google Scholar
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.
Article Google Scholar
Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.
Article Google Scholar
Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M. et al. (2014). Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.
Google Scholar
Frénay, B., & Verleysen, M.: Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.
Google Scholar
Gaber, M. M. (2012). Advances in data stream mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 79–85.
Google Scholar
Gama, J. (2010). Knowledge discovery from data streams. London: Chapman and Hall/CRC.
Book MATH Google Scholar
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing Surveys, 46(4), 44.
Article MATH Google Scholar
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144.
Article Google Scholar
García, S., Luengo, J., & Herrera, F. (2015). Data Preprocessing in Data Mining. Berlin: Springer.
Book Google Scholar
García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1, 9.
Article Google Scholar
García-Gil, D., Luengo, J., García, S., & Herrera, F. (2019). Enabling smart data: Noise filtering in big data classification. Information Sciences, 479, 135–152.
Article Google Scholar
Hall, M. A. (1999). Correlation-based feature selection for machine learning. Hamilton: Department of Computer Science, Waikato University.
Google Scholar
Härdle, W., Horng-Shing Lu, H., & Shen, X. (2018). Handbook of big data analytics. Berlin: Springer.
Book MATH Google Scholar
Hu, H., Wen, Y., Chua, T.-S., & Li, X. (2014). Toward scalable systems for big data analytics: A technology tutorial. IEEE Access, 2, 652–687.
Article Google Scholar
Iafrate, F. (2014). A journey from big data to smart data. Advances in Intelligent Systems and Computing, 261, 25–33.
Article Google Scholar
Jolliffe, I. (2011). Principal Component Analysis. Berlin: Springer.
MATH Google Scholar
Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big data analytics (1st ed.). Sebastopol: O’Reilly Media.
Google Scholar
Lin, J. (2013). MapReduce is good enough? If all you have is a hammer, throw away everything that’s not a nail! Big Data, 1(1), 28–37.
Article Google Scholar
Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4), 393–423.
Article MathSciNet Google Scholar
Liu, H., & Motoda, H. (2002). On issues of instance selection. Data Mining and Knowledge Discovery, 6(2), 115–130 (2002)
Google Scholar
López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.
Article Google Scholar
Masud, M. M., Chen, Q., Gao, J., Khan, L., Han, J., & Thuraisingham, B. (2010). Classification and novel class detection of data streams in a dynamic feature space. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 337–352). Berlin: Springer.
Chapter Google Scholar
Philip Chen, C. L., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275, 314–347.
Article Google Scholar
Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufmann.
Google Scholar
Ramalingeswara Rao, T., Mitra, P., Bhatt, R., & Goswami, A. (2018). The big data system, components, tools, and technologies: A survey. Knowledge and Information Systems, 60, 1165–1245.
Google Scholar
Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., & Herrera, F. (2018). Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Information Fusion, 42, 51–61.
Article Google Scholar
Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, S., & Herrera, F. (2017). A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing, 239, 39–57.
Article Google Scholar
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The Hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (pp. 1–10). Piscataway: IEEE.
Google Scholar
Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.
Google Scholar
Wang, H., & Wang, S. (2010). Mining incomplete survey data through classification. Knowledge and Information Systems, 24(2), 221–233.
Article Google Scholar
Watson, H. J., & Wixom, B. H. (2007). The current state of business intelligence. Computer, 40(9), 96–99.
Article Google Scholar
Webb, G. I. (2014). Contrary to popular belief incremental discretization can be sound, computationally efficient and extremely useful for streaming data. In 2014 IEEE International Conference on Data Mining (pp. 1031–1036). Piscataway: IEEE.
Chapter Google Scholar
White, T. (2012). Hadoop: The Definitive Guide. Sebastopol: O’Reilly Media.
Google Scholar
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37.
Article Google Scholar
Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.
Article Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (p. 2). Berkeley: USENIX Association.
Google Scholar
Zaki, M. J., & Meira, W. Jr. (2014). Data mining and analysis: Fundamental concepts and algorithms. New York: Cambridge University Press.
Book MATH Google Scholar
Zliobaite, I., & Gabrys, B. (2014). Adaptive preprocessing for streaming data. IEEE Transactions on Knowledge and Data Engineering, 26(2), 309–321.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and AI, University of Granada, Granada, Spain
Julián Luengo, Diego García-Gil, Salvador García & Francisco Herrera
DOCOMO Digital España, Madrid, Madrid, Spain
Sergio Ramírez-Gallego

Authors

Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Diego García-Gil
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Ramírez-Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Introduction. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-39105-8_1
Published: 17 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics