Skip to main content

Big Data Preprocessing Phase in Engendering Quality Data

  • Conference paper
  • First Online:
Machine Learning, Advances in Computing, Renewable Energy and Communication

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 768))

Abstract

The change in the behavior of humans in the past decade has shown a tremendous generation in the data. The various researchers have given various definitions and discussed the different characteristics of big data. In the present study, we emphasize on the less focused areas of big data. One such zone is big data preprocessing. Extracting valuable information from big data has broadly three phases: first is acquisition and storage, second is data preprocessing, third is applying data mining and, at last, analysis of data. The contribution of this paper is that it shows generating the valuable information from big data not dependent on opting an advanced algorithm or novel algorithm but more than that it depends on acquisition of relevant data and preprocessing phase. The preprocessing phase plays a significant role in generating valuable data which serves as a great input in decision-making. At last, this paper gives a brief survey and analysis on big data preprocessing techniques used to handle imperfect data, reduction of data size and imbalanced data. It also theoretically discusses the different problems associated with the various phases and gives future directions where the researchers can work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag 35(2):137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007

    Article  Google Scholar 

  2. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier

    MATH  Google Scholar 

  3. Yogish HK, Raju GT, Manjunath TN (2011) The descriptive study of knowledge discovery from web usage mining. Int J Comput Sci Issues (IJCSI) 8(5):225

    Google Scholar 

  4. Fatema N et al (2021) Intelligent data-analytics for condition monitoring: smart grid applications. Elsevier, p 268. ISBN: 9780323855112

    Google Scholar 

  5. Aggarwal S et al (2020) Meta heuristic and evolutionary computation: algorithms and applications, Springer Nature, Berlin, p 949. https://doi.org/10.1007/978-981-15-7571-6. ISBN: 978-981-15-7571-6

  6. Yadav AK et al (2020) Soft computing in condition monitoring and diagnostics of electrical and mechanical systems. Springer Nature, Berlin, p 496. https://doi.org/10.1007/978-981-15-1532-3. ISBN: 978-981-15-1532-3

  7. Smriti S et al (2019) Applications of artificial intelligence techniques in engineering, vol 1. Springer Nature, p 643. https://doi.org/10.1007/978-981-13-1819-1. ISBN: 978-981-13-1819-1

  8. Gopal et al (2021) Digital transformation through advances in artificial intelligence and machine learning. J Intell Fuzzy Syst (Pre-press) 1–8. https://doi.org/10.3233/JIFS-189787

  9. Jafar A et al (2021) AI and machine learning paradigms for health monitoring system: intelligent data analytics. Springer Nature, Berlin, p 496. https://doi.org/10.1007/978-981-33-4412-9. ISBN: 978-981-33-4412-9

  10. Smriti S et al (2018) Special issue on intelligent tools and techniques for signals, machines and automation. J Intell Fuzzy Syst 35(5):4895–4899. https://doi.org/10.3233/JIFS-169773

  11. Chahal H, Jyoti J, Wirtz J (2018) Understanding the role of business analytics: some applications. https://doi.org/10.1007/978-981-13-1334-9

  12. Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of big data challenges and analytical methods. J Bus Res 70:263–286

    Article  Google Scholar 

  13. Alabdullah B, Beloff N, White M (2018) Rise of big data—issues and challenges. In: 2018 21st Saudi computer society national computer conference (NCC). IEEE, pp 1–6

    Google Scholar 

  14. Chakravarthy SK, Sudhakar N, Reddy ES, Subramanian DV, Shankar P (2019) Dimension reduction and storage optimization techniques for distributed and big data cluster environment. In: Soft computing and medical bioinformatics. Springer, Singapore, pp 47–54

    Google Scholar 

  15. Chen CLP, Zhang C (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 134

    Google Scholar 

  16. Cai J, Luo J, Wang S, Yang S (2018) Feature selection in machine learning: a new perspective. Neurocomputing 300:70–79

    Article  Google Scholar 

  17. Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on apache spark. Int J Data Sci Anal 1(3–4):145–164

    Article  Google Scholar 

  18. https://data-flair.training/blogs/hadoop-vs-spark-vs-flink

  19. Ali SM, Gupta N, Nayak GK, Lenka RK (2016) Big data visualization: tools and challenges. In: 2016 2nd international conference on contemporary computing and informatics (IC3I). IEEE, pp 656–660

    Google Scholar 

  20. Yaqoob I, Hashem IAT, Gani A, Mokhtar S, Ahmed E, Anuar NB, Vasilakos AV (2016) Big data: from beginning to future. Int J Inf Manag 36(6):1231–1247

    Article  Google Scholar 

  21. García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F (2016) Big data preprocessing: methods and prospects. Big Data Anal 1(1):9

    Article  Google Scholar 

  22. Bhandari B, Goudar RH, Kumar K (2018) Quine-mccluskey: a novel concept for mining the frequency patterns from web data. Int J Educ Manag Eng 8(1):40

    Article  Google Scholar 

  23. L’heureux A, Grolinger K, Elyamany HF, Capretz MA (2017) Machine learning with big data: challenges and approaches. IEEE Access 5:7776–7797

    Google Scholar 

  24. Kotiyal B, Kumar A, Pant B, Goudar RH (2014) Classification technique for improving user access on web log data. In: Intelligent computing, networking, and informatics. Springer, New Delhi, pp 1089–1097

    Google Scholar 

  25. Little RJ, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons

    Google Scholar 

  26. Nguyen G, Dlugolinsky S, Bobák M, Tran V, García ÁL, Heredia I, … Hluchý L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. ArtifIntell Rev 52(1):77–124.

    Google Scholar 

  27. Miller JA, Bowman C, Harish VG, Quinn S (2016) Open source big data analytics frameworks written in scala. In: 2016 IEEE international congress on big data (BigData Congress). IEEE, pp 389–393

    Google Scholar 

  28. Pandey M, Litoriya R, Pandey P (2016) Mobile applications in context of big data: a survey. In: 2016 symposium on colossal data analysis and networking (CDAN). IEEE, pp 1–5

    Google Scholar 

  29. Hariharakrishnan J, Mohanavalli S, Kumar KS (2017) Survey of pre-processing techniques for mining big data. In: 2017 international conference on computer, communication and signal processing (ICCCSP). IEEE, pp 1–5

    Google Scholar 

  30. García-Gil D, Luengo J, García S, Herrera F (2019) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152

    Google Scholar 

  31. Rong M, Gong D, Gao X (2019) Feature selection and its use in big data: challenges, methods, and trends. IEEE Access 7:19709–19725

    Article  Google Scholar 

  32. Frénay B, Verleysen M (2013) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869

    Article  Google Scholar 

  33. Mudiyanselage TB, Zhang Y (2019) Feature selection with graph mining technology. Big Data Min Anal 2(2):73–82

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kotiyal, B., Pathak, H. (2022). Big Data Preprocessing Phase in Engendering Quality Data. In: Tomar, A., Malik, H., Kumar, P., Iqbal, A. (eds) Machine Learning, Advances in Computing, Renewable Energy and Communication. Lecture Notes in Electrical Engineering, vol 768. Springer, Singapore. https://doi.org/10.1007/978-981-16-2354-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-2354-7_7

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-2353-0

  • Online ISBN: 978-981-16-2354-7

  • eBook Packages: EnergyEnergy (R0)

Publish with us

Policies and ethics