Skip to main content
Log in

Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

The amount of data on the internet is steadily growing due to recent technological advancements in cyber-physical-social systems, sensor networks, and communication technologies. Many information scientists, policy and decision-makers are attempting to explore this vast amount of data for critical decisions and planned business moves. The increasing amount of big data also increases privacy issues and data breaches. Proper data management is essential for all organizations that handle sensitive information and large volumes of data. Data anonymization is a promising method for protecting individual privacy, resulting in significant information loss. Recently, data anonymization based on data mining techniques has shown significant improvement in data utility. Again, when utilized with big data, the clustering-based anonymization technique has serious scalability issues, and cluster formation on large data sets is time-consuming. This paper proposes the Parallel Fuzzy C-Means Clustering based Anonymization Algorithm (FCMCAA) using the Hadoop MapReduce framework for ensuring the privacy of large volume of structured data. The results demonstrate that the algorithm works better in terms of F-measure and classification accuracy yielding 91% accuracy. It is also scalable and able to handle huge volumes of structured data while maintaining a high level of privacy with minimum information loss.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data Availability

The data that support the findings of this study are openly available in UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

References

  1. Tankard, C. (2012). Big data security. Network Security, 2012, 5–8.

    Google Scholar 

  2. Rahul, K., & Banyal, R. (2020). Data life cycle management in big data analytics. Procedia Computer Science, 173, 364–371.

    Article  Google Scholar 

  3. Rao, P. R. M., Krishna, S. M., & Kumar, A. P. S. (2018). Privacy preservation techniques in big data analytics: A survey. Journal of Big Data, 5, 33.

    Article  Google Scholar 

  4. Holzinger, A., & Jurisica, I. (2014). Knowledge discovery and data mining in biomedical informatics: The future is in integrative, interactive machine learning solutions. Interactive knowledge discovery and data mining in biomedical informatics (pp. 1–18). Berlin: Springer p.

    Chapter  Google Scholar 

  5. HIPAA Health insurance portability and accountability Act of 1999. http://www.hhs.gov/ocr/privacy/hipaa/administrative/privacyrule (accessed 20.06.15).

  6. Samarati, P. (2001). Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13, 1010–1027. https://doi.org/10.1109/69.971193

    Article  Google Scholar 

  7. Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 557–570.

    Article  MathSciNet  Google Scholar 

  8. Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 571–588. https://doi.org/10.1142/S021848850200165X

    Article  MathSciNet  Google Scholar 

  9. Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data. https://doi.org/10.1145/1217299.1217302

    Article  Google Scholar 

  10. Raymond, W., et al. (2009). (α, k)-anonymous data publishing. Journal of Intelligent Information Systems, 33, 209–234.

    Article  Google Scholar 

  11. Li, N., Li, T., & Venkatasubramanian, S. (2007). t-Closeness: Privacy beyond k-anonymity and l-diversity. IEEE International Conference on Data Engineering. https://doi.org/10.1109/ICDE.2007.367856

    Article  Google Scholar 

  12. Jain, P., Gyanchandani, M., & Khare, N. (2016). Big data privacy: A technological perspective and review. J Big Data, 3(1), 25. https://doi.org/10.1186/s40537-016-0059-y

    Article  Google Scholar 

  13. Usha, L. J., & Nayahi, J. J. V. (2019). Security and privacy in big data cyber-physical systems. In C. R. C. Press (Ed.), Cybersecurity and privacy in cyber physical systems (pp. 217–249). Taylor & Francis.

    Chapter  Google Scholar 

  14. Nayahi, J. J. V., & Kavitha, V. (2015). An efficient clustering for anoymizing data and protecting sensitive labels. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 23, 685–714.

    Article  Google Scholar 

  15. Nayahi, J. J. V., & Kavitha, V. (2017). Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop. Future Generation Computer Systems, 74, 393–408.

    Article  Google Scholar 

  16. Josephine Usha, L., & Jesu VedhaNayahi, J. (2021). Privacy preserving parallel clustering based anonymization for big data using mapreduce framework. Applied Artificial Intelligence. https://doi.org/10.1080/08839514.2021.1987709

    Article  Google Scholar 

  17. Nayak, J., Naik, B., & Behera, H. S. (2015). Fuzzy C-Means (FCM) Clustering Algorithm: A Decade Review from 2000 to 2014. Computational Intelligence in Data Mining—Volume 2. Smart Innovation, Systems and Technologies. (Vol. 32). New Delhi: Springer. https://doi.org/10.1007/978-81-322-2208-8_14

    Chapter  Google Scholar 

  18. Ludwig, S. A. (2015). MapReduce-based fuzzy c-means clustering algorithm: Implementation and scalability. International Journal of Machine Learning and Cybernetics, 6, 923–934. https://doi.org/10.1007/s13042-015-0367-0

    Article  Google Scholar 

  19. LeFevre, K., DeWitt, D. J., Ramakrishnan, R. (2005). Incognito: Efficient full-domain k-anonymity. In SIGMOD Conference, pages 49–60

  20. LeFevre, K., DeWitt, D. J., Ramakrishnan, R. (2006). Mondrian multidimensional k-anonymity. In Proceedings of the 22nd international conference on data engineering (ICDE’06), Atlanta, GA, USA, IEEE

  21. Amit, K., & Neeraj, S. (2016). Privacy preservation in big data using K-anonymity algorithm with privacy key. International Journal of Computer Applications., 153(5), 0975–8887.

    Google Scholar 

  22. Govinda, K., & Sathiyamoorthy, E. (2012). Identity anonymization and secure data storage using group signature in private cloud. Procedia Technology, 4, 495–499.

    Article  Google Scholar 

  23. Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., & Fu, A.W.-C. (2006). Utility-based anonymization for privacy preservation with less information loss. ACM SIGKDD Explorations Newsletter, 8(2), 21–30. https://doi.org/10.1145/1233321.1233324

    Article  Google Scholar 

  24. Waters, B. (2011). Ciphertext-policy attribute-based encryption: An expressive, efficient, and provably secure realization. Public Key Cryptography, 6571, 53–70.

    MathSciNet  Google Scholar 

  25. Potey, M. M., Dhote, C. A., & Sharma, D. H. (2016). Homomorphic encryption for security of cloud data. Procedia Computer Science, 79, 175–181. https://doi.org/10.1016/j.procs.2016.03.023

    Article  Google Scholar 

  26. Lindell, Y., & Pinkas, B. (2009). Secure multiparty computation for privacy-preserving data mining. Journal of Privacy and Confidentiality, 1, 59–98.

    Article  Google Scholar 

  27. Wang, J., Du, K., Luo, X., et al. (2019). Two privacy-preserving approaches for data publishing with identity reservation. Knowledge and Information Systems, 60, 1039–1080. https://doi.org/10.1007/s10115-018-1237-3

    Article  Google Scholar 

  28. Li, N., Qardaji, W., Su, D. (2012). On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, Seoul, Korea, 2–4; pp. 32–33

  29. Soria-comas, J., Domingo-Ferrer, J., Sanchez, D., & Martinez, S. (2015). t-closeness through microaggregation: Strict privacy with enhanced utility preservation. IEEE Transactions on Knowledge and Data Engineering, 27(11), 3098–3110. https://doi.org/10.1109/TKDE.2015.2435777

    Article  Google Scholar 

  30. Shen, Y., Guo, B., Shen, Y., Duan, X., Dong, X., Zhang, H., Zhang, C., & Jiang, Y. (2022). Personal big data pricing method based on differential privacy. Computers & Security., 113, 102529.

    Article  Google Scholar 

  31. Kiabod, M., Dehkordi, M., & Barekatain, B. (2019). TSRAM: A time-saving k-degree anonymization method in social network. Expert Systems with Applications, 125, 378–396.

    Article  Google Scholar 

  32. Panackal, J. J., & Pillai, A. S. (2015). Adaptive utility-based anonymization model: Performance evaluation on big data sets. Procedia Computer Science, 50, 347–352.

    Article  Google Scholar 

  33. Saura, J. R., Ribeiro-Soriano, D., & Palacios-Marqués, D. (2015). Setting privacy “by default” in social IoT: Theorizing the challenges and directions in big data research. Big Data Research, 25(100245), 1–15.

    Google Scholar 

  34. Al-zobbi, M., Shahrestani, S., & Ruan, C. (2018). Experimenting sensitivity-based anonymization framework in apache spark. Journal of Big Data, 5(38), 1–26.

    Google Scholar 

  35. Moutafis, P., Mavrommatis, G., Vassilakopoulos, M., & Sioutas, S. (2019). Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework. Data & Knowledge Engineering, 121, 42–70. https://doi.org/10.1016/j.datak.2019.04.003

    Article  Google Scholar 

  36. Qian, J., Xia, M., & Yue, X. (2018). Parallel knowledge acquisition algorithms for big data using MapReduce. International Journal of Machine Learning and Cybernetics., 9(6), 1007–1021. https://doi.org/10.1007/s13042-016-0624-x

    Article  Google Scholar 

  37. Zhao, Y., Tarus, S. K., Yang, L. T., Sun, J., Ge, Y., & Wang, J. (2020). Privacy preserving clustering for big data in cyber-physical-social systems: Survey and perspectives. Information Sciences, 515, 132–155.

    Article  MathSciNet  Google Scholar 

  38. Mehta, B. B., & Rao, U. P. (2022). Improved l-diversity: Scalable anonymization approach for privacy preserving big data publishing. Journal of King Saud University-Computer and Information Sciences, 34(4), 1423–1430.

    Article  Google Scholar 

  39. Eyupoglu, C., Aydin, M. A., Zaim, A. H., & Sertbas, A. (2018). An efficient big data anonymization algorithm based on chaos and perturbation techniques. Entropy, 20(5), 373.

    Article  Google Scholar 

  40. Fan, J., & Li, J. (2014). A fixed suppressed rate selection method for suppressed fuzzy c-means clustering algorithm. Applied Mathematics, 5, 1275–1283. https://doi.org/10.4236/am.2014.58119

    Article  Google Scholar 

  41. Bharill, N., & Tiwari, A. (2014). Handling big data with fuzzy based classification approach. Advances Trends Soft Computing Stud Fuzziness Soft Computing, 312, 219–227.

    Article  Google Scholar 

  42. Karthikeyani, N., Visalakshi, S., Parvathavarthini, S., & Thangavel, K. (2014). An intuitionistic fuzzy approach to fuzzy clustering of numerical dataset. Advances in Intelligent Systems and Computing, 246, 79–87. https://doi.org/10.1007/978-81-322-1680-3_9

    Article  Google Scholar 

  43. Khan, S., Iqbal, K., Faizullah, S., Fahad, M., Ali, J., & Ahmed, W. (2019). Clustering based privacy preserving of big data using fuzzification and anonymization operation. International Journal of Advanced Computer Science and Applications, 10(12), 0101239.

    Article  Google Scholar 

  44. Torra, V. (2020). Fuzzy clustering-based microaggregation to achieve probabilistic K-anonymity for data with constraints. Journal of Intelligent & Fuzzy Systems, 39(5), 5999–6008.

    Article  Google Scholar 

  45. Langari, R. K., Sardar, S., Mousavi, S. A. A., & Radfar, R. (2020). Combined fuzzy clustering and firefly algorithm for privacy preserving in social networks. Expert Systems with Applications, 141(112968), 1–15.

    Google Scholar 

  46. Tsai, Y. C., Wang, S. L., & Ting, I. H. (2020). Flexible sensitive K-anonymization on transactions. World Wide Web, 23, 2391–2406.

    Article  Google Scholar 

  47. Chauhan, R., Kaur, H., & Chang, V. (2021). An optimized integrated framework of big data analytics managing security and privacy in healthcare data. Wireless Personal Communications, 117, 87–108.

    Article  Google Scholar 

  48. Ward, K., Lin, D., & Madria, S. (2020). A parallel algorithm for anonymizing large-scale trajectory data. ACM/IMS Transactions on Data Science, 1(1), 1–26.

    Article  Google Scholar 

  49. Zhou, K., Fu, C., & Yang, S. (2014). Fuzziness parameter selection in fuzzy c-means: The perspective of cluster validation. Science China Information Sciences, 57, 1–8.

    Google Scholar 

  50. Nayak, J., Naik, B., & Behera, H. S. (2015). Fuzzy C-Means (FCM) clustering algorithm: A decade review from 2000 to 2014. Computational Intelligence in Data Mining, 2, 32.

    Google Scholar 

  51. M. Lichman, UCI Machine Learning Repository, 2013. http://archive.ics.uci.edu/ml.

  52. https://www.tutorialspoint.com/scikit_learn/index.htm

  53. Datafly, S. L. (1997). A system for providing anonymity in medical data, In: Proceeding of eleventh international conference on database security, pp. 356–381

Download references

Funding

No funding was received for this work.

Author information

Authors and Affiliations

Authors

Contributions

All the authors contributed equally to this work.

Corresponding author

Correspondence to Josephine Usha Lawrance.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the author(s).

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lawrance, J.U., Jesudhasan, J.V.N. & Thampiraj Rittammal, J.B. Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce. Wireless Pers Commun (2024). https://doi.org/10.1007/s11277-024-11101-7

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11277-024-11101-7

Keywords

Navigation