A case study for performance analysis of big data stream classification using spark architecture

Srivani, B.; Sandhya, N.; Rani, B. Padmaja

doi:10.1007/s13198-022-01703-4

A case study for performance analysis of big data stream classification using spark architecture

Original Article
Published: 02 July 2022

Volume 15, pages 253–266, (2024)
Cite this article

International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

B. Srivani¹,
N. Sandhya² &
B. Padmaja Rani³

237 Accesses
1 Citation
Explore all metrics

Abstract

A variety of huge data is being produced at an incredibly high speed in different sectors. Due to the large location of computing devices, the large volume of information is increasingly growing in the recent decades. A main role of big data is that a large set of data enables the machine learning techniques to obtain more accurate and better results. As the amount of data is exploding, it raises more challenges and opportunities for data analytic research in the data mining domain. The massively parallel databases not only have storage mechanisms but also have compute platforms. The extra capacity in the databases to really put some algorithms and move the data into in-memory to solve the problems. However, the big data stream contains different characteristics, such as high dimensionality, sparsity, volume and velocity. These characteristic features pose huge issues for the classification process when employing traditional data stream classification methods. For huge collection of data, effectively selecting the features and then classifying the data is important to make patterns. Recent feature selection strategies are involving the use of optimization methods for picking a subset of important features to get good classification results. Therefore, in this case study the feature selection is performed based on the Dragonfly Moth Search (DMS) optimization. The performance of the classification method is carried out in two different phases, such as offline and online phase by considering the master and slave node with stacked auto encoder (SAE) in the spark architecture. The parameters like accuracy, sensitivity and specificity metrics are evaluated on the performance of the DMS-SAE method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Big Data Analytics in Weather Forecasting: A Systematic Review

Article 28 June 2021

Big data analytics on Apache Spark

Article 13 October 2016

References

Abdel-Hamid NB, ElGhamrawy S, El Desouky A, Arafat H (2018) A dynamic spark-based classification framework for imbalanced big data. J Grid Comput 16(4):607–626
Article Google Scholar
Brahmane AV, and Krishna BC (2020) RCBO–A Big Data Classification Based On an Efficient RCBO Optimization Technique and Apache Spark. In IEEE Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC): pp 851-854
Breast Cancer Data Set (2019). https://archive.ics.uci.edu/ml/datasets/Breast+Cancer
Dagdia ZC (2019) A scalable and distributed dendritic cell algorithm for big data classification. Swarm Evol Comput 50:100432
Article Google Scholar
Localization Data for Person Activity Data Set (2019). https://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity
Deng Y, Ren Z, Kong Y, Bao F, Dai Q (2017) A hierarchical fused fuzzy deep neural network for data classification. IEEE Trans Fuzzy Syst 25(4):1006–1012
Article Google Scholar
Devi SG, Sabrigiriraj M (2019) A hybrid multi-objective firefly and simulated annealing based algorithm for big data classification. Concurr Comput: Pract Exp 31(14):e4985
Article Google Scholar
Dubey AK, Kumar A, and Agrawal R (2020) An efficient ACO-PSO-based framework for data classification and pre-processing in big data. Evolutionary Intelligence: pp.1–14
Elkano M, Galar M, Sanz J, Bustince H (2018) CHI-BD: A fuzzy rule-based classification system for big data classification problems. Fuzzy Sets Syst 348:75–101
Article MathSciNet Google Scholar
García-Gil D, Luque-Sánchez F, Luengo J, García S, Herrera F (2019a) From big to smart data: iterative ensemble filter for noise filtering in big data classification. Int J Intell Syst 34(12):3260–3274
Article Google Scholar
García-Gil D, Luengo J, García S, Herrera F (2019b) Enabling smart data: noise filtering in big data classification. Inf Sci 479:135–152
Article Google Scholar
Gosu JS, Deol PP, Motupalli RK (2021) A hybrid approach for the analysis of feature selection using information gain and bat techniques on the anomaly detection. Turk J Comput Math Educat (TURCOMAT) 12(5):656–666
Article Google Scholar
Hajar AAS, Fukase K, and Ozawa S (2013) A neural network model for large-scale stream data learning using locally sensitive hashing. In: International Conference on Neural Information Processing, Springer, Berlin, Heidelberg,: pp 369-376
Hernández G, Zamora E, Sossa H, Téllez G, Furlán F (2020) Hybrid neural networks for big data classification. Neuro comput 390(4):327–340
Google Scholar
Kashvi T, Srishti V, Aleena S (2020) A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. International Conference on Intelligent Computing and Control Systems (ICCS).
Liu G, Bao H, Han B (2018) A stacked autoencoder-based deep neural network for achieving gearbox fault diagnosis. Math Probl in Engin. https://doi.org/10.1155/2018/5105709
Article Google Scholar
Maillo J, Triguero I, Herrera F (2020) Redundancy and complexity metrics for big data classification: Towards smart data. IEEE Access 8:87918–87928
Article Google Scholar
Maillo J, Luengo J, García S, Herrera F, and Triguero I (2018) A preliminary study on hybrid spill-tree fuzzy k-nearest neighbors for big data classification. In: IEEE international conference on fuzzy systems (fuzz- IEEE)
Manoj RJ, Praveena MA, Vijayakumar K (2018) An ACO–ANN based feature selection algorithm for big data. Clust Comput 22(2):1–8
Google Scholar
Meera S, and Jeetha BR (2017) Acceleration artificial bee colony optimization-artificial neural network for optimal feature selection over big data. In: Proceedings of International Conference on Power, Control, Signals and Instrumentation Engineering, pp. 1698–1706.
Meng T, Jing X, Yan Z, Pedrycz W (2019) A Survey on Machine Learning for Data Fusion. Information Fusion. 57:1
Google Scholar
Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl 27(4):1053–1073
Article Google Scholar
Morariu O, Morariu C, Borangiu T, and Răileanu S (2018) Manufacturing systems at scale with big data streaming and online machine learning. In Service orientation in holonic and multi-agent manufacturing, Springer, Cham, vol 762: pp 253-264
Motupalli, RaviKanth, and O Naga Raju.(2020) Integration of SQL Modelling and Graph Representations to Disaggregated Human Activity Data for Effective Knowledge Extraction 57(8): 975–984
Mundada MR, and Hegde S (2018) A hybrid approach of deep learning with cognitive particle swarm optimization for the big data analytics. In: Proceedings of 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–5.
Nie SP, and Shan W (2017) Shuffled frog-leaping algorithm based neural network and its using in big data set. In: Proceedings of International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, pp 707–711
Pawlak Z (1995) Rough sets. Institute of Theoretical and Applied Informatics, Polish Academy of Sciences
Google Scholar
Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Benítez JM, Herrera F (2017) Nearest neighbor classification for high-speed big data streams using spark. IEEE Trans Syst Man Cybern: Syst 47(10):2727–2739
Article Google Scholar
Rastogi AK, Narang N, and Siddiqui ZA (2018) Imbalanced big data classification: a distributed implementation of smote. In: Proceedings of the workshop program of the 19th international conference on distributed computing and networking: pp 1–6.
Shankar VG, Devi B, Srivastava S (2019) DataSpeak: data extraction, aggregation, and classification using big data novel algorithm. Computing, communication and signal processing. Springer, Singapore, pp 143–155
Chapter Google Scholar
Skin Segmentation Data Set (2019). https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation
Sleeman IV, WC and Krawczyk B (2019) Bagging using instance-level difficulty for multi-class imbalanced big data classification on spark. In: IEEE International Conference on Big Data (Big Data): pp 2484–2493.
Sleeman WC IV, Krawczyk B (2021) Multi-class imbalanced big data classification on spark. Knowl Based Syst 212:106598. https://doi.org/10.1016/j.knosys.2020.106598
Article Google Scholar
Srivani B, Sandhya N, Padmaja Rani B (2020) Literature review and analysis on big datastream classification techniques. Int J Knowl-Based Intell Eng Syst 24(3):205–215
Google Scholar
Srivani B, Sandhya N, Padmaja Rani B (2021) An Effective Model for Handling the Big Data Streams Based on the Optimization-Enabled Spark Framework. Intelligent System Design. Springer, Singapore, pp 673–696
Google Scholar
UCI machine learning dataset (2019). https://archive.ics.uci.edu/ml/datasets.php
Wang GG (2018) Moth search algorithm: a bio-inspired metaheuristic algorithm for global optimization Problems. Memetic Comput 10(2):151–164
Article Google Scholar
Wang P, Hui F, Zhang K (2018) A pixel-level entropy-weighted image fusion algorithm based on bidimensional ensemble empirical mode decomposition. Int Jal of Distributed Sensor Networks 14(12):155014771881875. https://doi.org/10.1177/1550147718818755
Article Google Scholar
Xing W, Bei Y (2019) Medical health big data classification based on KNN classification algorithm. IEEE Access 8:28808–28819
Article Google Scholar
Zhai J, Zhang S, Zhang M, Liu X (2018) Fuzzy integral-based ELM ensemble for imbalanced big data classification. Soft Comput 22(11):3519–3531
Article Google Scholar
Zhai J, Zhou X, Zhang S, Wang T (2019) Ensemble RBM-based classifier using fuzzy integral for big data classification. Int J Mach Learn Cybern 10(11):3327–3337
Article Google Scholar

Download references

Author information

Authors and Affiliations

Research Scholar, CSE Department, JNTUH, Hyderabad, India
B. Srivani
CSE Department, VNRVJIET, Hyderabad, India
N. Sandhya
CSE Department, JNTUCEH, Hyderabad, India
B. Padmaja Rani

Authors

B. Srivani
View author publications
You can also search for this author in PubMed Google Scholar
N. Sandhya
View author publications
You can also search for this author in PubMed Google Scholar
B. Padmaja Rani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Srivani.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report. We certify that the submission is original work and is not under review at any other publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srivani, B., Sandhya, N. & Rani, B.P. A case study for performance analysis of big data stream classification using spark architecture. Int J Syst Assur Eng Manag 15, 253–266 (2024). https://doi.org/10.1007/s13198-022-01703-4

Download citation

Received: 02 November 2021
Revised: 06 May 2022
Accepted: 04 June 2022
Published: 02 July 2022
Issue Date: January 2024
DOI: https://doi.org/10.1007/s13198-022-01703-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A case study for performance analysis of big data stream classification using spark architecture

Abstract

Access this article

Similar content being viewed by others

Big data preprocessing: methods and prospects

Big Data Analytics in Weather Forecasting: A Systematic Review

Big data analytics on Apache Spark

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A case study for performance analysis of big data stream classification using spark architecture

Abstract

Access this article

Similar content being viewed by others

Big data preprocessing: methods and prospects

Big Data Analytics in Weather Forecasting: A Systematic Review

Big data analytics on Apache Spark

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation