Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering

Shamsinejad, Elham; Banirostam, Touraj; Pedram, Mir Mohsen; Rahmani, Amir Masoud

doi:10.1007/s11265-024-01920-z

Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering

Published: 25 May 2024

Volume 96, pages 333–356, (2024)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

46 Accesses
Explore all metrics

Abstract

Big data privacy preservation is a critical challenge for data mining and data analysis. Existing methods for anonymizing big data streams using k-anonymity algorithms may cause high data loss, low data quality, and identity disclosure. In this paper, we propose a novel model for anonymizing big data streams using in-memory processing. The model uses a Spark framework to parallelize the anonymization process and a one-time clustering algorithm to avoid multiple iterations and allocate the data to optimal clusters. We evaluate the performance and effectiveness of the model using a real-world dataset and compare it with three popular k-anonymity algorithms: CRUE, Mean-Shift, and DBSCAN. The results show that the model has the lowest data loss and the highest data quality for different data sizes and k-values. The model is scalable, robust, adaptable, and flexible. The model can provide better data for data mining and data analysis while protecting data privacy and preventing data disclosure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 7

Efficient approximation and privacy preservation algorithms for real time online evolving data streams

Article 20 January 2024

A Distributed Method Based on Mondrian Algorithm for Big Data Anonymization

Experimenting sensitivity-based anonymization framework in apache spark

Article Open access 13 October 2018

Data Availability

The data that were generated and analyzed during the current study are available in the UCI Machine Learning Repository. The data are licensed under the CC BY-SA 4.0 license, and can be freely downloaded and used for research purposes: https://archive.ics.uci.edu/ml/datasets/adult.

References

Su, P., Zhao, H., & Wang, Y. (2024). A novel model based on big data environment for text content security recognition. Journal of Signal Processing Systems. https://doi.org/10.1007/s11265-023-01860-0
Article Google Scholar
Banirostam, H., Banirostam, T., Pedram, M. M., & Rahmani, A. M. (2023). Providing and evaluating a comprehensive model for detecting fraudulent electronic payment card transactions with a two-level filter based on flow processing in big data. International Journal of Information Technology, 4161–4166. https://doi.org/10.1007/s41870-023-01501-6
Banirostam, H., Hedayati, A., Zadeh, A. K., & Shamsinezhad, E. (2013). A trust-based approach for increasing security in cloud computing infrastructure. In 2013 UKSim 15th International Conference on Computer Modelling and Simulation, Cambridge, UK, 717–721. https://doi.org/10.1109/UKSim.2013.39
Shamsinezhad, E., Shahbahrami, A., Hedayati, A., Zadeh, A. K., & Banirostam, H. (2013). Presentation methods for task migration in cloud computing by combination of Yu router and post-copy. International Journal of Computational Science and Engineering, 10(2), 98–102.
Google Scholar
Shamsinejad, E., Banirostam, T., Pedram, M. M., & Rahmani, A. M. (2024). Presenting a model of data anonymization in big data in the context of in-memory processing. Journal of Electrical and Computer Engineering Innovations (JECEI), 12(1), 79–98. https://doi.org/10.22061/jecei.2023.9737.651
Article Google Scholar
Banirostam, T., Shamsinejad, E., Pedram, M. M., & Rahmani, A. M. (2021). A review of anonymity algorithms in big data. Journal of Advances in Computer Engineering and Technology (JACET), 7(1), 187–196.
Google Scholar
Mehta, B. B., & Rao, U. P. (2018). Toward scalable anonymization for privacy-preserving big data publishing. In S. B. Singh & A. K. Singh (Eds.), Advances in Intelligent Computing Techniques and Applications (pp. 297–304). Singapore: Springer. https://doi.org/10.1007/978-981-10-8636-6_31
Chapter Google Scholar
Banirostam, H., Hedayati, A. R., & Khadem Zadeh, A. K. (2014). Using virtualization technique to increase security and reduce energy consumption in cloud computing. International Research Journal of Computer Science, 4(2), 25–30. https://doi.org/10.7815/ijorcs.42.2014.082
Article Google Scholar
Banirostam, H., Shamsinezhad, E., & Banirostam, T. (2013). Functional control of users by biometric behavior features in cloud computing. In 2013 4th International Conference on Intelligent Systems, Modelling and Simulation, (pp. 94–98). Bangkok, Thailand. https://doi.org/10.1109/ISMS.2013.102
Ullah Bazai, S., & Jang-Jaccard, J. (2019). SparkDA: RDD-based high-performance data anonymization technique for spark platform. In J. Lopez, J. Zhou, & M. Soriano (Eds.), Network and System Security (pp. 646–662). Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-030-36938-5_40
Chapter Google Scholar
Huo, Y., Ma, L., & Zhong, Y. (2018). A big data privacy respecting dissemination method for social network. Journal of Signal Processing Systems, 90(1), 467–475. https://doi.org/10.1007/s11265-017-1251-9
Article Google Scholar
Zhang, X., Deng, H., Xiong, Z., et al. (2024). Secure routing strategy based on attribute-based trust access control in social-aware networks. Journal of Signal Processing Systems. https://doi.org/10.1007/s11265-023-01908-1
Article Google Scholar
Xuemin, Z., Ying, R., Zenggang, X., et al. (2023). Resource-constrained and socially selfish-based incentive algorithm for socially aware networks. Journal of Signal Processing Systems, 95, 1439–1453. https://doi.org/10.1007/s11265-023-01896-2
Article Google Scholar
Saadouni, R., Gherbi, C., Aliouat, Z., Harbi, Y., & Khacha, A. (2024). Intrusion detection systems for IoT based on bio-inspired and machine learning techniques: A systematic review of the literature. Cluster Computing. https://doi.org/10.1007/s10586-024-04388-5
Article Google Scholar
Banirostam, T., Banirostam, H., Pedram, M. M., & Rahmani, A. M. (2021). A review of fraud detection algorithms for electronic payment card transactions. Journal of Advances in Computer Engineering and Technology (JACET), 7(3), 157–166.
Google Scholar
Saadouni, R., Gherbi, C., Aliouat, Z., Harbi, Y., & Khacha, A. (2024). Intrusion detection systems for IoT based on bio-inspired and machine learning techniques: A Systematic review of the literature. Cluster Computing. https://doi.org/10.1007/s10586-024-04388-5
Article Google Scholar
Banirostam, H., Banirostam, T., Pedram, M. M., & Rahmani, A. M. (2023). A model to detect the fraud of electronic payment card transactions based on stream processing in big data. Journal of Signal Processing Systems, 23(1), 1–16.
Google Scholar
Kumar, V., Sharma, D. K., & Mishra, V. K. (2021). Mille Cheval: A GPU-based in-memory high-performance computing framework for accelerated processing of big-data streams. Journal of Supercomputing, 77(10), 6936–6960. https://doi.org/10.1007/s11227-020-03508-3
Article Google Scholar
Ashkouti, S., & Khamforoosh, M. (2023). A parallel method for preserving the λ-diversity privacy model using partition-based data clustering algorithms. PLoS One1, 18(1), e0285212. https://doi.org/10.1371/journal.pone.0285212
Article Google Scholar
Park, K., Baek, C., & Peng, L. (2016). A development of streaming big data analysis system using in-memory cluster computing framework: Spark. In J. Park, H. Jin, Y.-S. Jeong, & M. Khan (Eds.), Advanced Multimedia and Ubiquitous Engineering: Future Information Technology (pp. 199–207). Singapore: Springer. https://doi.org/10.1007/978-981-10-1536-6_21
Chapter Google Scholar
Li, J., Wang, Y., Liu, Q., & Li, H. (2022). Privacy-preserving federated learning over big data streams. IEEE Transactions on Big Data. https://doi.org/10.1109/TBDATA.2022.3118810
Article Google Scholar
Chen, X., Zhang, J., Wang, X., & Li, Y. (2023). Anonymizing big data streams using deep reinforcement learning. Information Sciences, 583, 1–15. https://doi.org/10.1016/j.ins.2022.12.001
Article Google Scholar
Tiwaskar, S., Rashid, M., & Gokhale, P. (2024). Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-024-19103-0
Article Google Scholar
Onesimu, J. A., Karthikeyan, J., & Sei, Y. (2021). An efficient clustering-based anonymization scheme for privacy-preserving data collection in IoT based healthcare services. Peer-to-Peer Networking and Applications, 14(3), 1629–1649. https://doi.org/10.1007/s12083-021-01077-7
Article Google Scholar
Gupta, H. K., & Parveen, R. (2022). An Efficient Cluster by Cluster Head Selection Approach in Big Data. 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) (pp. 1–6). Noida, India. https://doi.org/10.1109/ICRITO56286.2022.9964764
Revanesh, M., Mary, S. A. S. A., Gnaneswari, G., et al. (2023). Retracted article: Deep learning-based algorithm for optimum cluster head selection in sustainable wireless communication system. Neural Computing and Applications. https://doi.org/10.1007/s00521-023-08861-x
Article Google Scholar
Canbay, Y.,Kalyoncu, A., Ercimen, M., Dogan, A., & Sagiroglu, S. (2019). A Clustering Based Anonymization Model for Big Data. In 2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun, Turkey, (pp. 720–725). https://doi.org/10.1109/UBMK.2019.8907155
Lawrance, J. U., & Jesudhasan, J. V. N. (2021). Privacy preserving parallel clustering-based anonymization for big data using MapReduce framework. Applied Artificial Intelligence, 35(15), 1587–1620. https://doi.org/10.1080/08839514.2021.1987709
Article Google Scholar
Wang, J., Cai, Z., Li, Y., Yang, D., & Li, J. (2018). Protecting query privacy with differentially private k-anonymity in location-based services. Personal and Ubiquitous Computing, 22(3), 453–469. https://doi.org/10.1007/s00779-018-1124-7
Article Google Scholar
Jadhav, P. S., & Borkar, G. M. (2024). Optimal key generation for privacy preservation in big data applications based on the marine predator whale optimization algorithm. Annals of Data Science. https://doi.org/10.1007/s40745-024-00521-8
Article Google Scholar
Rexa.info at the University of Massachusetts Amherst. (2024). Adult data set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/adult. Accessed 7 Jan 2024.
Sharma, A., Jain, A., Sharma, S., Gupta, A., Jain, P., & Mohanty, S. P. (2024). iPAL: A machine learning based smart healthcare framework for automatic diagnosis of attention deficit/hyperactivity disorder. SN Computer Science. https://doi.org/10.1007/s42979-024-02779-4
Article Google Scholar
Domingo-Ferrer, J. (2018). Big data anonymization requirements vs privacy models. In 2018 15th International Joint Conference on e-Business and Telecommunications (ICETE), Porto, Portugal (vol. 2, pp. 305–312). https://doi.org/10.5220/0006830003050312
Canbay, Y., Vural, Y., & Sagiroglu, S. (2018). Privacy preserving big data. In 2018 International Workshop on Big Data and Information Security (IWBIS), Jakarta, Indonesia (pp. 24–29). https://doi.org/10.1109/IBIGDELFT.2018.8625358

Download references

Acknowledgements

The authors thank the anonymous reviewers and the editor for their useful comments and suggestions.

Funding

This research received no external funding.

Author information

Authors and Affiliations

Department of Computer Engineering, Central Tehran Branch, Islamic Azad University, Tehran, Iran
Elham Shamsinejad & Touraj Banirostam
Department of Electrical and Computer Engineering, Faculty of Engineering, Kharazmi University, Tehran, Iran
Mir Mohsen Pedram
Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Amir Masoud Rahmani

Authors

Elham Shamsinejad
View author publications
You can also search for this author in PubMed Google Scholar
Touraj Banirostam
View author publications
You can also search for this author in PubMed Google Scholar
Mir Mohsen Pedram
View author publications
You can also search for this author in PubMed Google Scholar
Amir Masoud Rahmani
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.SH., T.B., and M.M.P. conceived and designed the study, collected and analyzed the data, and drafted the manuscript. A.M.R. critically revised the manuscript and gave final approval of the version to be published. All authors reviewed the manuscript.

Corresponding author

Correspondence to Touraj Banirostam.

Ethics declarations

Ethical Approval

This study did not involve human or animal subjects, and therefore did not require ethical approval.

Consent to Participate

Not applicable.

Consent to Publish

Not applicable.

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shamsinejad, E., Banirostam, T., Pedram, M.M. et al. Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering. J Sign Process Syst 96, 333–356 (2024). https://doi.org/10.1007/s11265-024-01920-z

Download citation

Received: 13 February 2024
Revised: 30 April 2024
Accepted: 14 May 2024
Published: 25 May 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11265-024-01920-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering

Abstract

Access this article

Similar content being viewed by others

Efficient approximation and privacy preservation algorithms for real time online evolving data streams

A Distributed Method Based on Mondrian Algorithm for Big Data Anonymization

Experimenting sensitivity-based anonymization framework in apache spark

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Consent to Participate

Consent to Publish

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering

Abstract

Access this article

Similar content being viewed by others

Efficient approximation and privacy preservation algorithms for real time online evolving data streams

A Distributed Method Based on Mondrian Algorithm for Big Data Anonymization

Experimenting sensitivity-based anonymization framework in apache spark

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Consent to Participate

Consent to Publish

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation