Abstract
A variety of huge data is being produced at an incredibly high speed in different sectors. Due to the large location of computing devices, the large volume of information is increasingly growing in the recent decades. A main role of big data is that a large set of data enables the machine learning techniques to obtain more accurate and better results. As the amount of data is exploding, it raises more challenges and opportunities for data analytic research in the data mining domain. The massively parallel databases not only have storage mechanisms but also have compute platforms. The extra capacity in the databases to really put some algorithms and move the data into in-memory to solve the problems. However, the big data stream contains different characteristics, such as high dimensionality, sparsity, volume and velocity. These characteristic features pose huge issues for the classification process when employing traditional data stream classification methods. For huge collection of data, effectively selecting the features and then classifying the data is important to make patterns. Recent feature selection strategies are involving the use of optimization methods for picking a subset of important features to get good classification results. Therefore, in this case study the feature selection is performed based on the Dragonfly Moth Search (DMS) optimization. The performance of the classification method is carried out in two different phases, such as offline and online phase by considering the master and slave node with stacked auto encoder (SAE) in the spark architecture. The parameters like accuracy, sensitivity and specificity metrics are evaluated on the performance of the DMS-SAE method.
Similar content being viewed by others
References
Abdel-Hamid NB, ElGhamrawy S, El Desouky A, Arafat H (2018) A dynamic spark-based classification framework for imbalanced big data. J Grid Comput 16(4):607–626
Brahmane AV, and Krishna BC (2020) RCBO–A Big Data Classification Based On an Efficient RCBO Optimization Technique and Apache Spark. In IEEE Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC): pp 851-854
Breast Cancer Data Set (2019). https://archive.ics.uci.edu/ml/datasets/Breast+Cancer
Dagdia ZC (2019) A scalable and distributed dendritic cell algorithm for big data classification. Swarm Evol Comput 50:100432
Localization Data for Person Activity Data Set (2019). https://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity
Deng Y, Ren Z, Kong Y, Bao F, Dai Q (2017) A hierarchical fused fuzzy deep neural network for data classification. IEEE Trans Fuzzy Syst 25(4):1006–1012
Devi SG, Sabrigiriraj M (2019) A hybrid multi-objective firefly and simulated annealing based algorithm for big data classification. Concurr Comput: Pract Exp 31(14):e4985
Dubey AK, Kumar A, and Agrawal R (2020) An efficient ACO-PSO-based framework for data classification and pre-processing in big data. Evolutionary Intelligence: pp.1–14
Elkano M, Galar M, Sanz J, Bustince H (2018) CHI-BD: A fuzzy rule-based classification system for big data classification problems. Fuzzy Sets Syst 348:75–101
García-Gil D, Luque-Sánchez F, Luengo J, García S, Herrera F (2019a) From big to smart data: iterative ensemble filter for noise filtering in big data classification. Int J Intell Syst 34(12):3260–3274
García-Gil D, Luengo J, García S, Herrera F (2019b) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152
Gosu JS, Deol PP, Motupalli RK (2021) A hybrid approach for the analysis of feature selection using information gain and bat techniques on the anomaly detection. Turk J Comput Math Educat (TURCOMAT) 12(5):656–666
Hajar AAS, Fukase K, and Ozawa S (2013) A neural network model for large-scale stream data learning using locally sensitive hashing. In: International Conference on Neural Information Processing, Springer, Berlin, Heidelberg,: pp 369-376
Hernández G, Zamora E, Sossa H, Téllez G, Furlán F (2020) Hybrid neural networks for big data classification. Neuro comput 390(4):327–340
Kashvi T, Srishti V, Aleena S (2020) A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. International Conference on Intelligent Computing and Control Systems (ICCS).
Liu G, Bao H, Han B (2018) A stacked autoencoder-based deep neural network for achieving gearbox fault diagnosis. Math Probl in Engin. https://doi.org/10.1155/2018/5105709
Maillo J, Triguero I, Herrera F (2020) Redundancy and complexity metrics for big data classification: Towards smart data. IEEE Access 8:87918–87928
Maillo J, Luengo J, García S, Herrera F, and Triguero I (2018) A preliminary study on hybrid spill-tree fuzzy k-nearest neighbors for big data classification. In: IEEE international conference on fuzzy systems (fuzz- IEEE)
Manoj RJ, Praveena MA, Vijayakumar K (2018) An ACO–ANN based feature selection algorithm for big data. Clust Comput 22(2):1–8
Meera S, and Jeetha BR (2017) Acceleration artificial bee colony optimization-artificial neural network for optimal feature selection over big data. In: Proceedings of International Conference on Power, Control, Signals and Instrumentation Engineering, pp. 1698–1706.
Meng T, Jing X, Yan Z, Pedrycz W (2019) A Survey on Machine Learning for Data Fusion. Information Fusion. 57:1
Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl 27(4):1053–1073
Morariu O, Morariu C, Borangiu T, and Răileanu S (2018) Manufacturing systems at scale with big data streaming and online machine learning. In Service orientation in holonic and multi-agent manufacturing, Springer, Cham, vol 762: pp 253-264
Motupalli, RaviKanth, and O Naga Raju.(2020) Integration of SQL Modelling and Graph Representations to Disaggregated Human Activity Data for Effective Knowledge Extraction 57(8): 975–984
Mundada MR, and Hegde S (2018) A hybrid approach of deep learning with cognitive particle swarm optimization for the big data analytics. In: Proceedings of 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–5.
Nie SP, and Shan W (2017) Shuffled frog-leaping algorithm based neural network and its using in big data set. In: Proceedings of International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, pp 707–711
Pawlak Z (1995) Rough sets. Institute of Theoretical and Applied Informatics, Polish Academy of Sciences
Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Benítez JM, Herrera F (2017) Nearest neighbor classification for high-speed big data streams using spark. IEEE Trans Syst Man Cybern: Syst 47(10):2727–2739
Rastogi AK, Narang N, and Siddiqui ZA (2018) Imbalanced big data classification: a distributed implementation of smote. In: Proceedings of the workshop program of the 19th international conference on distributed computing and networking: pp 1–6.
Shankar VG, Devi B, Srivastava S (2019) DataSpeak: data extraction, aggregation, and classification using big data novel algorithm. Computing, communication and signal processing. Springer, Singapore, pp 143–155
Skin Segmentation Data Set (2019). https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation
Sleeman IV, WC and Krawczyk B (2019) Bagging using instance-level difficulty for multi-class imbalanced big data classification on spark. In: IEEE International Conference on Big Data (Big Data): pp 2484–2493.
Sleeman WC IV, Krawczyk B (2021) Multi-class imbalanced big data classification on spark. Knowl Based Syst 212:106598. https://doi.org/10.1016/j.knosys.2020.106598
Srivani B, Sandhya N, Padmaja Rani B (2020) Literature review and analysis on big datastream classification techniques. Int J Knowl-Based Intell Eng Syst 24(3):205–215
Srivani B, Sandhya N, Padmaja Rani B (2021) An Effective Model for Handling the Big Data Streams Based on the Optimization-Enabled Spark Framework. Intelligent System Design. Springer, Singapore, pp 673–696
UCI machine learning dataset (2019). https://archive.ics.uci.edu/ml/datasets.php
Wang GG (2018) Moth search algorithm: a bio-inspired metaheuristic algorithm for global optimization Problems. Memetic Comput 10(2):151–164
Wang P, Hui F, Zhang K (2018) A pixel-level entropy-weighted image fusion algorithm based on bidimensional ensemble empirical mode decomposition. Int Jal of Distributed Sensor Networks 14(12):155014771881875. https://doi.org/10.1177/1550147718818755
Xing W, Bei Y (2019) Medical health big data classification based on KNN classification algorithm. IEEE Access 8:28808–28819
Zhai J, Zhang S, Zhang M, Liu X (2018) Fuzzy integral-based ELM ensemble for imbalanced big data classification. Soft Comput 22(11):3519–3531
Zhai J, Zhou X, Zhang S, Wang T (2019) Ensemble RBM-based classifier using fuzzy integral for big data classification. Int J Mach Learn Cybern 10(11):3327–3337
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report. We certify that the submission is original work and is not under review at any other publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Srivani, B., Sandhya, N. & Rani, B.P. A case study for performance analysis of big data stream classification using spark architecture. Int J Syst Assur Eng Manag 15, 253–266 (2024). https://doi.org/10.1007/s13198-022-01703-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-022-01703-4