Skip to main content
Log in

A case study for performance analysis of big data stream classification using spark architecture

  • Original Article
  • Published:
International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

Abstract

A variety of huge data is being produced at an incredibly high speed in different sectors. Due to the large location of computing devices, the large volume of information is increasingly growing in the recent decades. A main role of big data is that a large set of data enables the machine learning techniques to obtain more accurate and better results. As the amount of data is exploding, it raises more challenges and opportunities for data analytic research in the data mining domain. The massively parallel databases not only have storage mechanisms but also have compute platforms. The extra capacity in the databases to really put some algorithms and move the data into in-memory to solve the problems. However, the big data stream contains different characteristics, such as high dimensionality, sparsity, volume and velocity. These characteristic features pose huge issues for the classification process when employing traditional data stream classification methods. For huge collection of data, effectively selecting the features and then classifying the data is important to make patterns. Recent feature selection strategies are involving the use of optimization methods for picking a subset of important features to get good classification results. Therefore, in this case study the feature selection is performed based on the Dragonfly Moth Search (DMS) optimization. The performance of the classification method is carried out in two different phases, such as offline and online phase by considering the master and slave node with stacked auto encoder (SAE) in the spark architecture. The parameters like accuracy, sensitivity and specificity metrics are evaluated on the performance of the DMS-SAE method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Abdel-Hamid NB, ElGhamrawy S, El Desouky A, Arafat H (2018) A dynamic spark-based classification framework for imbalanced big data. J Grid Comput 16(4):607–626

    Article  Google Scholar 

  • Brahmane AV, and Krishna BC (2020) RCBO–A Big Data Classification Based On an Efficient RCBO Optimization Technique and Apache Spark. In IEEE Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC): pp 851-854

  • Breast Cancer Data Set (2019). https://archive.ics.uci.edu/ml/datasets/Breast+Cancer

  • Dagdia ZC (2019) A scalable and distributed dendritic cell algorithm for big data classification. Swarm Evol Comput 50:100432

    Article  Google Scholar 

  • Localization Data for Person Activity Data Set (2019). https://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity

  • Deng Y, Ren Z, Kong Y, Bao F, Dai Q (2017) A hierarchical fused fuzzy deep neural network for data classification. IEEE Trans Fuzzy Syst 25(4):1006–1012

    Article  Google Scholar 

  • Devi SG, Sabrigiriraj M (2019) A hybrid multi-objective firefly and simulated annealing based algorithm for big data classification. Concurr Comput: Pract Exp 31(14):e4985

    Article  Google Scholar 

  • Dubey AK, Kumar A, and Agrawal R (2020) An efficient ACO-PSO-based framework for data classification and pre-processing in big data. Evolutionary Intelligence: pp.1–14

  • Elkano M, Galar M, Sanz J, Bustince H (2018) CHI-BD: A fuzzy rule-based classification system for big data classification problems. Fuzzy Sets Syst 348:75–101

    Article  MathSciNet  Google Scholar 

  • García-Gil D, Luque-Sánchez F, Luengo J, García S, Herrera F (2019a) From big to smart data: iterative ensemble filter for noise filtering in big data classification. Int J Intell Syst 34(12):3260–3274

    Article  Google Scholar 

  • García-Gil D, Luengo J, García S, Herrera F (2019b) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152

    Article  Google Scholar 

  • Gosu JS, Deol PP, Motupalli RK (2021) A hybrid approach for the analysis of feature selection using information gain and bat techniques on the anomaly detection. Turk J Comput Math Educat (TURCOMAT) 12(5):656–666

    Article  Google Scholar 

  • Hajar AAS, Fukase K, and Ozawa S (2013) A neural network model for large-scale stream data learning using locally sensitive hashing. In: International Conference on Neural Information Processing, Springer, Berlin, Heidelberg,: pp 369-376

  • Hernández G, Zamora E, Sossa H, Téllez G, Furlán F (2020) Hybrid neural networks for big data classification. Neuro comput 390(4):327–340

    Google Scholar 

  • Kashvi T, Srishti V, Aleena S (2020) A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. International Conference on Intelligent Computing and Control Systems (ICCS).

  • Liu G, Bao H, Han B (2018) A stacked autoencoder-based deep neural network for achieving gearbox fault diagnosis. Math Probl in Engin. https://doi.org/10.1155/2018/5105709

    Article  Google Scholar 

  • Maillo J, Triguero I, Herrera F (2020) Redundancy and complexity metrics for big data classification: Towards smart data. IEEE Access 8:87918–87928

    Article  Google Scholar 

  • Maillo J, Luengo J, García S, Herrera F, and Triguero I (2018) A preliminary study on hybrid spill-tree fuzzy k-nearest neighbors for big data classification. In: IEEE international conference on fuzzy systems (fuzz- IEEE)

  • Manoj RJ, Praveena MA, Vijayakumar K (2018) An ACO–ANN based feature selection algorithm for big data. Clust Comput 22(2):1–8

    Google Scholar 

  • Meera S, and Jeetha BR (2017) Acceleration artificial bee colony optimization-artificial neural network for optimal feature selection over big data. In: Proceedings of International Conference on Power, Control, Signals and Instrumentation Engineering, pp. 1698–1706.

  • Meng T, Jing X, Yan Z, Pedrycz W (2019) A Survey on Machine Learning for Data Fusion. Information Fusion. 57:1

    Google Scholar 

  • Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl 27(4):1053–1073

    Article  Google Scholar 

  • Morariu O, Morariu C, Borangiu T, and Răileanu S (2018) Manufacturing systems at scale with big data streaming and online machine learning. In Service orientation in holonic and multi-agent manufacturing, Springer, Cham, vol 762: pp 253-264

  • Motupalli, RaviKanth, and O Naga Raju.(2020) Integration of SQL Modelling and Graph Representations to Disaggregated Human Activity Data for Effective Knowledge Extraction 57(8): 975–984

  • Mundada MR, and Hegde S (2018) A hybrid approach of deep learning with cognitive particle swarm optimization for the big data analytics. In: Proceedings of 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–5.

  • Nie SP, and Shan W (2017) Shuffled frog-leaping algorithm based neural network and its using in big data set. In: Proceedings of International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, pp 707–711

  • Pawlak Z (1995) Rough sets. Institute of Theoretical and Applied Informatics, Polish Academy of Sciences

    Google Scholar 

  • Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Benítez JM, Herrera F (2017) Nearest neighbor classification for high-speed big data streams using spark. IEEE Trans Syst Man Cybern: Syst 47(10):2727–2739

    Article  Google Scholar 

  • Rastogi AK, Narang N, and Siddiqui ZA (2018) Imbalanced big data classification: a distributed implementation of smote. In: Proceedings of the workshop program of the 19th international conference on distributed computing and networking: pp 1–6.

  • Shankar VG, Devi B, Srivastava S (2019) DataSpeak: data extraction, aggregation, and classification using big data novel algorithm. Computing, communication and signal processing. Springer, Singapore, pp 143–155

    Chapter  Google Scholar 

  • Skin Segmentation Data Set (2019). https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation

  • Sleeman IV, WC and Krawczyk B (2019) Bagging using instance-level difficulty for multi-class imbalanced big data classification on spark. In: IEEE International Conference on Big Data (Big Data): pp 2484–2493.

  • Sleeman WC IV, Krawczyk B (2021) Multi-class imbalanced big data classification on spark. Knowl Based Syst 212:106598. https://doi.org/10.1016/j.knosys.2020.106598

    Article  Google Scholar 

  • Srivani B, Sandhya N, Padmaja Rani B (2020) Literature review and analysis on big datastream classification techniques. Int J Knowl-Based Intell Eng Syst 24(3):205–215

    Google Scholar 

  • Srivani B, Sandhya N, Padmaja Rani B (2021) An Effective Model for Handling the Big Data Streams Based on the Optimization-Enabled Spark Framework. Intelligent System Design. Springer, Singapore, pp 673–696

    Google Scholar 

  • UCI machine learning dataset (2019). https://archive.ics.uci.edu/ml/datasets.php

  • Wang GG (2018) Moth search algorithm: a bio-inspired metaheuristic algorithm for global optimization Problems. Memetic Comput 10(2):151–164

    Article  Google Scholar 

  • Wang P, Hui F, Zhang K (2018) A pixel-level entropy-weighted image fusion algorithm based on bidimensional ensemble empirical mode decomposition. Int Jal of Distributed Sensor Networks 14(12):155014771881875. https://doi.org/10.1177/1550147718818755

    Article  Google Scholar 

  • Xing W, Bei Y (2019) Medical health big data classification based on KNN classification algorithm. IEEE Access 8:28808–28819

    Article  Google Scholar 

  • Zhai J, Zhang S, Zhang M, Liu X (2018) Fuzzy integral-based ELM ensemble for imbalanced big data classification. Soft Comput 22(11):3519–3531

    Article  Google Scholar 

  • Zhai J, Zhou X, Zhang S, Wang T (2019) Ensemble RBM-based classifier using fuzzy integral for big data classification. Int J Mach Learn Cybern 10(11):3327–3337

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. Srivani.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report. We certify that the submission is original work and is not under review at any other publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Srivani, B., Sandhya, N. & Rani, B.P. A case study for performance analysis of big data stream classification using spark architecture. Int J Syst Assur Eng Manag 15, 253–266 (2024). https://doi.org/10.1007/s13198-022-01703-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13198-022-01703-4

Keywords

Navigation