Skip to main content
Log in

Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Big data privacy preservation is a critical challenge for data mining and data analysis. Existing methods for anonymizing big data streams using k-anonymity algorithms may cause high data loss, low data quality, and identity disclosure. In this paper, we propose a novel model for anonymizing big data streams using in-memory processing. The model uses a Spark framework to parallelize the anonymization process and a one-time clustering algorithm to avoid multiple iterations and allocate the data to optimal clusters. We evaluate the performance and effectiveness of the model using a real-world dataset and compare it with three popular k-anonymity algorithms: CRUE, Mean-Shift, and DBSCAN. The results show that the model has the lowest data loss and the highest data quality for different data sizes and k-values. The model is scalable, robust, adaptable, and flexible. The model can provide better data for data mining and data analysis while protecting data privacy and preventing data disclosure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13

Similar content being viewed by others

Data Availability

The data that were generated and analyzed during the current study are available in the UCI Machine Learning Repository. The data are licensed under the CC BY-SA 4.0 license, and can be freely downloaded and used for research purposes: https://archive.ics.uci.edu/ml/datasets/adult.

References

  1. Su, P., Zhao, H., & Wang, Y. (2024). A novel model based on big data environment for text content security recognition. Journal of Signal Processing Systems. https://doi.org/10.1007/s11265-023-01860-0

    Article  Google Scholar 

  2. Banirostam, H., Banirostam, T., Pedram, M. M., & Rahmani, A. M. (2023). Providing and evaluating a comprehensive model for detecting fraudulent electronic payment card transactions with a two-level filter based on flow processing in big data. International Journal of Information Technology, 4161–4166. https://doi.org/10.1007/s41870-023-01501-6

  3. Banirostam, H., Hedayati, A., Zadeh, A. K., & Shamsinezhad, E. (2013). A trust-based approach for increasing security in cloud computing infrastructure. In 2013 UKSim 15th International Conference on Computer Modelling and Simulation, Cambridge, UK, 717–721. https://doi.org/10.1109/UKSim.2013.39

  4. Shamsinezhad, E., Shahbahrami, A., Hedayati, A., Zadeh, A. K., & Banirostam, H. (2013). Presentation methods for task migration in cloud computing by combination of Yu router and post-copy. International Journal of Computational Science and Engineering, 10(2), 98–102.

    Google Scholar 

  5. Shamsinejad, E., Banirostam, T., Pedram, M. M., & Rahmani, A. M. (2024). Presenting a model of data anonymization in big data in the context of in-memory processing. Journal of Electrical and Computer Engineering Innovations (JECEI), 12(1), 79–98. https://doi.org/10.22061/jecei.2023.9737.651

    Article  Google Scholar 

  6. Banirostam, T., Shamsinejad, E., Pedram, M. M., & Rahmani, A. M. (2021). A review of anonymity algorithms in big data. Journal of Advances in Computer Engineering and Technology (JACET), 7(1), 187–196.

    Google Scholar 

  7. Mehta, B. B., & Rao, U. P. (2018). Toward scalable anonymization for privacy-preserving big data publishing. In S. B. Singh & A. K. Singh (Eds.), Advances in Intelligent Computing Techniques and Applications (pp. 297–304). Singapore: Springer. https://doi.org/10.1007/978-981-10-8636-6_31

    Chapter  Google Scholar 

  8. Banirostam, H., Hedayati, A. R., & Khadem Zadeh, A. K. (2014). Using virtualization technique to increase security and reduce energy consumption in cloud computing. International Research Journal of Computer Science, 4(2), 25–30. https://doi.org/10.7815/ijorcs.42.2014.082

    Article  Google Scholar 

  9. Banirostam, H., Shamsinezhad, E., & Banirostam, T. (2013). Functional control of users by biometric behavior features in cloud computing. In 2013 4th International Conference on Intelligent Systems, Modelling and Simulation, (pp. 94–98). Bangkok, Thailand. https://doi.org/10.1109/ISMS.2013.102

  10. Ullah Bazai, S., & Jang-Jaccard, J. (2019). SparkDA: RDD-based high-performance data anonymization technique for spark platform. In J. Lopez, J. Zhou, & M. Soriano (Eds.), Network and System Security (pp. 646–662). Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-030-36938-5_40

    Chapter  Google Scholar 

  11. Huo, Y., Ma, L., & Zhong, Y. (2018). A big data privacy respecting dissemination method for social network. Journal of Signal Processing Systems, 90(1), 467–475. https://doi.org/10.1007/s11265-017-1251-9

    Article  Google Scholar 

  12. Zhang, X., Deng, H., Xiong, Z., et al. (2024). Secure routing strategy based on attribute-based trust access control in social-aware networks. Journal of Signal Processing Systems. https://doi.org/10.1007/s11265-023-01908-1

    Article  Google Scholar 

  13. Xuemin, Z., Ying, R., Zenggang, X., et al. (2023). Resource-constrained and socially selfish-based incentive algorithm for socially aware networks. Journal of Signal Processing Systems, 95, 1439–1453. https://doi.org/10.1007/s11265-023-01896-2

    Article  Google Scholar 

  14. Saadouni, R., Gherbi, C., Aliouat, Z., Harbi, Y., & Khacha, A. (2024). Intrusion detection systems for IoT based on bio-inspired and machine learning techniques: A systematic review of the literature. Cluster Computing. https://doi.org/10.1007/s10586-024-04388-5

    Article  Google Scholar 

  15. Banirostam, T., Banirostam, H., Pedram, M. M., & Rahmani, A. M. (2021). A review of fraud detection algorithms for electronic payment card transactions. Journal of Advances in Computer Engineering and Technology (JACET), 7(3), 157–166.

    Google Scholar 

  16. Saadouni, R., Gherbi, C., Aliouat, Z., Harbi, Y., & Khacha, A. (2024). Intrusion detection systems for IoT based on bio-inspired and machine learning techniques: A Systematic review of the literature. Cluster Computing. https://doi.org/10.1007/s10586-024-04388-5

    Article  Google Scholar 

  17. Banirostam, H., Banirostam, T., Pedram, M. M., & Rahmani, A. M. (2023). A model to detect the fraud of electronic payment card transactions based on stream processing in big data. Journal of Signal Processing Systems, 23(1), 1–16.

    Google Scholar 

  18. Kumar, V., Sharma, D. K., & Mishra, V. K. (2021). Mille Cheval: A GPU-based in-memory high-performance computing framework for accelerated processing of big-data streams. Journal of Supercomputing, 77(10), 6936–6960. https://doi.org/10.1007/s11227-020-03508-3

    Article  Google Scholar 

  19. Ashkouti, S., & Khamforoosh, M. (2023). A parallel method for preserving the λ-diversity privacy model using partition-based data clustering algorithms. PLoS One1, 18(1), e0285212. https://doi.org/10.1371/journal.pone.0285212

    Article  Google Scholar 

  20. Park, K., Baek, C., & Peng, L. (2016). A development of streaming big data analysis system using in-memory cluster computing framework: Spark. In J. Park, H. Jin, Y.-S. Jeong, & M. Khan (Eds.), Advanced Multimedia and Ubiquitous Engineering: Future Information Technology (pp. 199–207). Singapore: Springer. https://doi.org/10.1007/978-981-10-1536-6_21

    Chapter  Google Scholar 

  21. Li, J., Wang, Y., Liu, Q., & Li, H. (2022). Privacy-preserving federated learning over big data streams. IEEE Transactions on Big Data. https://doi.org/10.1109/TBDATA.2022.3118810

    Article  Google Scholar 

  22. Chen, X., Zhang, J., Wang, X., & Li, Y. (2023). Anonymizing big data streams using deep reinforcement learning. Information Sciences, 583, 1–15. https://doi.org/10.1016/j.ins.2022.12.001

    Article  Google Scholar 

  23. Tiwaskar, S., Rashid, M., & Gokhale, P. (2024). Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-024-19103-0

    Article  Google Scholar 

  24. Onesimu, J. A., Karthikeyan, J., & Sei, Y. (2021). An efficient clustering-based anonymization scheme for privacy-preserving data collection in IoT based healthcare services. Peer-to-Peer Networking and Applications, 14(3), 1629–1649. https://doi.org/10.1007/s12083-021-01077-7

    Article  Google Scholar 

  25. Gupta, H. K., & Parveen, R. (2022). An Efficient Cluster by Cluster Head Selection Approach in Big Data. 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) (pp. 1–6). Noida, India. https://doi.org/10.1109/ICRITO56286.2022.9964764

  26. Revanesh, M., Mary, S. A. S. A., Gnaneswari, G., et al. (2023). Retracted article: Deep learning-based algorithm for optimum cluster head selection in sustainable wireless communication system. Neural Computing and Applications. https://doi.org/10.1007/s00521-023-08861-x

    Article  Google Scholar 

  27. Canbay, Y.,Kalyoncu, A., Ercimen, M., Dogan, A., & Sagiroglu, S. (2019). A Clustering Based Anonymization Model for Big Data. In 2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun, Turkey, (pp. 720–725). https://doi.org/10.1109/UBMK.2019.8907155

  28. Lawrance, J. U., & Jesudhasan, J. V. N. (2021). Privacy preserving parallel clustering-based anonymization for big data using MapReduce framework. Applied Artificial Intelligence, 35(15), 1587–1620. https://doi.org/10.1080/08839514.2021.1987709

    Article  Google Scholar 

  29. Wang, J., Cai, Z., Li, Y., Yang, D., & Li, J. (2018). Protecting query privacy with differentially private k-anonymity in location-based services. Personal and Ubiquitous Computing, 22(3), 453–469. https://doi.org/10.1007/s00779-018-1124-7

    Article  Google Scholar 

  30. Jadhav, P. S., & Borkar, G. M. (2024). Optimal key generation for privacy preservation in big data applications based on the marine predator whale optimization algorithm. Annals of Data Science. https://doi.org/10.1007/s40745-024-00521-8

    Article  Google Scholar 

  31. Rexa.info at the University of Massachusetts Amherst. (2024). Adult data set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/adult. Accessed 7 Jan 2024.

  32. Sharma, A., Jain, A., Sharma, S., Gupta, A., Jain, P., & Mohanty, S. P. (2024). iPAL: A machine learning based smart healthcare framework for automatic diagnosis of attention deficit/hyperactivity disorder. SN Computer Science. https://doi.org/10.1007/s42979-024-02779-4

    Article  Google Scholar 

  33. Domingo-Ferrer, J. (2018). Big data anonymization requirements vs privacy models. In 2018 15th International Joint Conference on e-Business and Telecommunications (ICETE), Porto, Portugal (vol. 2, pp. 305–312). https://doi.org/10.5220/0006830003050312

  34. Canbay, Y., Vural, Y., & Sagiroglu, S. (2018). Privacy preserving big data. In 2018 International Workshop on Big Data and Information Security (IWBIS), Jakarta, Indonesia (pp. 24–29). https://doi.org/10.1109/IBIGDELFT.2018.8625358

Download references

Acknowledgements

The authors thank the anonymous reviewers and the editor for their useful comments and suggestions.

Funding

This research received no external funding.

Author information

Authors and Affiliations

Authors

Contributions

E.SH., T.B., and M.M.P. conceived and designed the study, collected and analyzed the data, and drafted the manuscript. A.M.R. critically revised the manuscript and gave final approval of the version to be published. All authors reviewed the manuscript.

Corresponding author

Correspondence to Touraj Banirostam.

Ethics declarations

Ethical Approval

This study did not involve human or animal subjects, and therefore did not require ethical approval.

Consent to Participate

Not applicable.

Consent to Publish

Not applicable.

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shamsinejad, E., Banirostam, T., Pedram, M.M. et al. Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering. J Sign Process Syst 96, 333–356 (2024). https://doi.org/10.1007/s11265-024-01920-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-024-01920-z

Keywords

Navigation