Big Data Preprocessing for Modern World: Opportunities and Challenges

  • Andrea PrakashEmail author
  • Narem Navya
  • Jayapandian Natarajan
Conference paper
Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT, volume 26)


Big data is an often misunderstood business term in the modern world. Multiple devices are connected to the internet and a democratization of available technologies. The data is generated almost exponential rate. This data is generated in large quantities, at a high speed and belongs to myriad categories. Coupled with advances in storage and processing hardware, it can derive insights from these bigger number of data but it works effectively. The data is to be transformed in the form of understandable and useable insights by algorithms and models. The data mining steps require data that is cleaned and structured to a larger extent. This is achieved by using various algorithms, processes and applications known as data pre-processing techniques. This article reviews the various data pre-processing techniques from a big data point of view.


Big data Preprocessing Data analytics Data cleaning Distributed computing Hadoop file system Noisy data 


  1. 1.
    Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17, 375–381 (2003)CrossRefGoogle Scholar
  2. 2.
    Jayapandian, N., Md. Zubair Rahman, A.M.J.: Secure and efficient online data storage and sharing over cloud environment using probabilistic with homomorphic encryption. Cluster Comput. 20, 1561–1573 (2017)CrossRefGoogle Scholar
  3. 3.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)CrossRefGoogle Scholar
  4. 4.
    White, T.: Hadoop: The Definitive Guide, 3rd edn. O’Reilly Media Press, Sebastopol (2012)Google Scholar
  5. 5.
    Saberi, B., Saad, S.: Sentiment analysis or opinion mining: a review. IJASEIT 7, 1660–1666 (2017)Google Scholar
  6. 6.
    Atzmueller, M., Lemmerich, F.: VIKAMINE–open-source subgroup discovery, pattern mining, and analytics. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 842–849. Springer (2012)Google Scholar
  7. 7.
    García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M., Herrera, F.: Big data preprocessing: methods and prospects. Big Data Anal. 1, 1–9 (2016)CrossRefGoogle Scholar
  8. 8.
    Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (2014)zbMATHGoogle Scholar
  9. 9.
    Hariharakrishnan, J., Mohanavalli, S., Kumar, K.S.: Survey of pre-processing techniques for mining big data. In: International Conference on Computer, Communication and Signal Processing, pp. 1–5. IEEE, Chennai (2017)Google Scholar
  10. 10.
    Tahir, S., Iqbal, W.: Big Data—an evolving concern for forensic investigators. In: International Conference on Anti-cybercrime, pp. 1–8. IEEE, Arabia (2015)Google Scholar
  11. 11.
    Pandey, M., Litoriya, R., Pandey, P.: Mobile applications in context of big data: a survey. In: Symposium on Colossal Data Analysis and Networking, pp. 1–6. IEEE, India (2016)Google Scholar
  12. 12.
    Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M.: A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 8, 39–57 (2017)CrossRefGoogle Scholar
  13. 13.
    García, S., Luengo, J., Herrera, F.: Data preprocessing in data mining. In: Intelligent Systems Reference Library, vol. 72, pp. 1–327. Springer (2016)Google Scholar
  14. 14.
    Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: an enabling technique. Data Min. Knowl. Disc. 6, 393–423 (2002)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Dai, H., Zhang, S., Wang, L., Ding, Y.: Research and implementation of big data preprocessing system based on Hadoop. In: International Conference Big Data Analysis, pp. 1–8. IEEE, China (2016)Google Scholar
  16. 16.
    Jayapandian, N., Md. Zubair Rahman, A M J.: Secure deduplication for cloud storage using ınteractive message-locked encryption with convergent encryption. To reduce storage space. Braz. Arch. Biol. Technol. 61, 1–13 (2018)Google Scholar
  17. 17.
    Ahmed, L., Georgiev, V., Capuccini, M., Toor, S., Schaal, W., Laure, E., Spjuth, O.: Efficient iterative virtual screening with Apache Spark and conformal prediction. J. Cheminform. 10, 1–8 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Andrea Prakash
    • 1
    Email author
  • Narem Navya
    • 1
  • Jayapandian Natarajan
    • 1
  1. 1.Department of Computer Science and EngineeringCHRIST (Deemed to Be University)BangaloreIndia

Personalised recommendations