Abstract
The change in the behavior of humans in the past decade has shown a tremendous generation in the data. The various researchers have given various definitions and discussed the different characteristics of big data. In the present study, we emphasize on the less focused areas of big data. One such zone is big data preprocessing. Extracting valuable information from big data has broadly three phases: first is acquisition and storage, second is data preprocessing, third is applying data mining and, at last, analysis of data. The contribution of this paper is that it shows generating the valuable information from big data not dependent on opting an advanced algorithm or novel algorithm but more than that it depends on acquisition of relevant data and preprocessing phase. The preprocessing phase plays a significant role in generating valuable data which serves as a great input in decision-making. At last, this paper gives a brief survey and analysis on big data preprocessing techniques used to handle imperfect data, reduction of data size and imbalanced data. It also theoretically discusses the different problems associated with the various phases and gives future directions where the researchers can work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag 35(2):137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier
Yogish HK, Raju GT, Manjunath TN (2011) The descriptive study of knowledge discovery from web usage mining. Int J Comput Sci Issues (IJCSI) 8(5):225
Fatema N et al (2021) Intelligent data-analytics for condition monitoring: smart grid applications. Elsevier, p 268. ISBN: 9780323855112
Aggarwal S et al (2020) Meta heuristic and evolutionary computation: algorithms and applications, Springer Nature, Berlin, p 949. https://doi.org/10.1007/978-981-15-7571-6. ISBN: 978-981-15-7571-6
Yadav AK et al (2020) Soft computing in condition monitoring and diagnostics of electrical and mechanical systems. Springer Nature, Berlin, p 496. https://doi.org/10.1007/978-981-15-1532-3. ISBN: 978-981-15-1532-3
Smriti S et al (2019) Applications of artificial intelligence techniques in engineering, vol 1. Springer Nature, p 643. https://doi.org/10.1007/978-981-13-1819-1. ISBN: 978-981-13-1819-1
Gopal et al (2021) Digital transformation through advances in artificial intelligence and machine learning. J Intell Fuzzy Syst (Pre-press) 1–8. https://doi.org/10.3233/JIFS-189787
Jafar A et al (2021) AI and machine learning paradigms for health monitoring system: intelligent data analytics. Springer Nature, Berlin, p 496. https://doi.org/10.1007/978-981-33-4412-9. ISBN: 978-981-33-4412-9
Smriti S et al (2018) Special issue on intelligent tools and techniques for signals, machines and automation. J Intell Fuzzy Syst 35(5):4895–4899. https://doi.org/10.3233/JIFS-169773
Chahal H, Jyoti J, Wirtz J (2018) Understanding the role of business analytics: some applications. https://doi.org/10.1007/978-981-13-1334-9
Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of big data challenges and analytical methods. J Bus Res 70:263–286
Alabdullah B, Beloff N, White M (2018) Rise of big data—issues and challenges. In: 2018 21st Saudi computer society national computer conference (NCC). IEEE, pp 1–6
Chakravarthy SK, Sudhakar N, Reddy ES, Subramanian DV, Shankar P (2019) Dimension reduction and storage optimization techniques for distributed and big data cluster environment. In: Soft computing and medical bioinformatics. Springer, Singapore, pp 47–54
Chen CLP, Zhang C (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 134
Cai J, Luo J, Wang S, Yang S (2018) Feature selection in machine learning: a new perspective. Neurocomputing 300:70–79
Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on apache spark. Int J Data Sci Anal 1(3–4):145–164
Ali SM, Gupta N, Nayak GK, Lenka RK (2016) Big data visualization: tools and challenges. In: 2016 2nd international conference on contemporary computing and informatics (IC3I). IEEE, pp 656–660
Yaqoob I, Hashem IAT, Gani A, Mokhtar S, Ahmed E, Anuar NB, Vasilakos AV (2016) Big data: from beginning to future. Int J Inf Manag 36(6):1231–1247
García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F (2016) Big data preprocessing: methods and prospects. Big Data Anal 1(1):9
Bhandari B, Goudar RH, Kumar K (2018) Quine-mccluskey: a novel concept for mining the frequency patterns from web data. Int J Educ Manag Eng 8(1):40
L’heureux A, Grolinger K, Elyamany HF, Capretz MA (2017) Machine learning with big data: challenges and approaches. IEEE Access 5:7776–7797
Kotiyal B, Kumar A, Pant B, Goudar RH (2014) Classification technique for improving user access on web log data. In: Intelligent computing, networking, and informatics. Springer, New Delhi, pp 1089–1097
Little RJ, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons
Nguyen G, Dlugolinsky S, Bobák M, Tran V, García ÁL, Heredia I, … Hluchý L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. ArtifIntell Rev 52(1):77–124.
Miller JA, Bowman C, Harish VG, Quinn S (2016) Open source big data analytics frameworks written in scala. In: 2016 IEEE international congress on big data (BigData Congress). IEEE, pp 389–393
Pandey M, Litoriya R, Pandey P (2016) Mobile applications in context of big data: a survey. In: 2016 symposium on colossal data analysis and networking (CDAN). IEEE, pp 1–5
Hariharakrishnan J, Mohanavalli S, Kumar KS (2017) Survey of pre-processing techniques for mining big data. In: 2017 international conference on computer, communication and signal processing (ICCCSP). IEEE, pp 1–5
García-Gil D, Luengo J, García S, Herrera F (2019) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152
Rong M, Gong D, Gao X (2019) Feature selection and its use in big data: challenges, methods, and trends. IEEE Access 7:19709–19725
Frénay B, Verleysen M (2013) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869
Mudiyanselage TB, Zhang Y (2019) Feature selection with graph mining technology. Big Data Min Anal 2(2):73–82
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kotiyal, B., Pathak, H. (2022). Big Data Preprocessing Phase in Engendering Quality Data. In: Tomar, A., Malik, H., Kumar, P., Iqbal, A. (eds) Machine Learning, Advances in Computing, Renewable Energy and Communication. Lecture Notes in Electrical Engineering, vol 768. Springer, Singapore. https://doi.org/10.1007/978-981-16-2354-7_7
Download citation
DOI: https://doi.org/10.1007/978-981-16-2354-7_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2353-0
Online ISBN: 978-981-16-2354-7
eBook Packages: EnergyEnergy (R0)