Abstract
The amount of data on the internet is steadily growing due to recent technological advancements in cyber-physical-social systems, sensor networks, and communication technologies. Many information scientists, policy and decision-makers are attempting to explore this vast amount of data for critical decisions and planned business moves. The increasing amount of big data also increases privacy issues and data breaches. Proper data management is essential for all organizations that handle sensitive information and large volumes of data. Data anonymization is a promising method for protecting individual privacy, resulting in significant information loss. Recently, data anonymization based on data mining techniques has shown significant improvement in data utility. Again, when utilized with big data, the clustering-based anonymization technique has serious scalability issues, and cluster formation on large data sets is time-consuming. This paper proposes the Parallel Fuzzy C-Means Clustering based Anonymization Algorithm (FCMCAA) using the Hadoop MapReduce framework for ensuring the privacy of large volume of structured data. The results demonstrate that the algorithm works better in terms of F-measure and classification accuracy yielding 91% accuracy. It is also scalable and able to handle huge volumes of structured data while maintaining a high level of privacy with minimum information loss.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are openly available in UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.
References
Tankard, C. (2012). Big data security. Network Security, 2012, 5–8.
Rahul, K., & Banyal, R. (2020). Data life cycle management in big data analytics. Procedia Computer Science, 173, 364–371.
Rao, P. R. M., Krishna, S. M., & Kumar, A. P. S. (2018). Privacy preservation techniques in big data analytics: A survey. Journal of Big Data, 5, 33.
Holzinger, A., & Jurisica, I. (2014). Knowledge discovery and data mining in biomedical informatics: The future is in integrative, interactive machine learning solutions. Interactive knowledge discovery and data mining in biomedical informatics (pp. 1–18). Berlin: Springer p.
HIPAA Health insurance portability and accountability Act of 1999. http://www.hhs.gov/ocr/privacy/hipaa/administrative/privacyrule (accessed 20.06.15).
Samarati, P. (2001). Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13, 1010–1027. https://doi.org/10.1109/69.971193
Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 557–570.
Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 571–588. https://doi.org/10.1142/S021848850200165X
Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data. https://doi.org/10.1145/1217299.1217302
Raymond, W., et al. (2009). (α, k)-anonymous data publishing. Journal of Intelligent Information Systems, 33, 209–234.
Li, N., Li, T., & Venkatasubramanian, S. (2007). t-Closeness: Privacy beyond k-anonymity and l-diversity. IEEE International Conference on Data Engineering. https://doi.org/10.1109/ICDE.2007.367856
Jain, P., Gyanchandani, M., & Khare, N. (2016). Big data privacy: A technological perspective and review. J Big Data, 3(1), 25. https://doi.org/10.1186/s40537-016-0059-y
Usha, L. J., & Nayahi, J. J. V. (2019). Security and privacy in big data cyber-physical systems. In C. R. C. Press (Ed.), Cybersecurity and privacy in cyber physical systems (pp. 217–249). Taylor & Francis.
Nayahi, J. J. V., & Kavitha, V. (2015). An efficient clustering for anoymizing data and protecting sensitive labels. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 23, 685–714.
Nayahi, J. J. V., & Kavitha, V. (2017). Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop. Future Generation Computer Systems, 74, 393–408.
Josephine Usha, L., & Jesu VedhaNayahi, J. (2021). Privacy preserving parallel clustering based anonymization for big data using mapreduce framework. Applied Artificial Intelligence. https://doi.org/10.1080/08839514.2021.1987709
Nayak, J., Naik, B., & Behera, H. S. (2015). Fuzzy C-Means (FCM) Clustering Algorithm: A Decade Review from 2000 to 2014. Computational Intelligence in Data Mining—Volume 2. Smart Innovation, Systems and Technologies. (Vol. 32). New Delhi: Springer. https://doi.org/10.1007/978-81-322-2208-8_14
Ludwig, S. A. (2015). MapReduce-based fuzzy c-means clustering algorithm: Implementation and scalability. International Journal of Machine Learning and Cybernetics, 6, 923–934. https://doi.org/10.1007/s13042-015-0367-0
LeFevre, K., DeWitt, D. J., Ramakrishnan, R. (2005). Incognito: Efficient full-domain k-anonymity. In SIGMOD Conference, pages 49–60
LeFevre, K., DeWitt, D. J., Ramakrishnan, R. (2006). Mondrian multidimensional k-anonymity. In Proceedings of the 22nd international conference on data engineering (ICDE’06), Atlanta, GA, USA, IEEE
Amit, K., & Neeraj, S. (2016). Privacy preservation in big data using K-anonymity algorithm with privacy key. International Journal of Computer Applications., 153(5), 0975–8887.
Govinda, K., & Sathiyamoorthy, E. (2012). Identity anonymization and secure data storage using group signature in private cloud. Procedia Technology, 4, 495–499.
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., & Fu, A.W.-C. (2006). Utility-based anonymization for privacy preservation with less information loss. ACM SIGKDD Explorations Newsletter, 8(2), 21–30. https://doi.org/10.1145/1233321.1233324
Waters, B. (2011). Ciphertext-policy attribute-based encryption: An expressive, efficient, and provably secure realization. Public Key Cryptography, 6571, 53–70.
Potey, M. M., Dhote, C. A., & Sharma, D. H. (2016). Homomorphic encryption for security of cloud data. Procedia Computer Science, 79, 175–181. https://doi.org/10.1016/j.procs.2016.03.023
Lindell, Y., & Pinkas, B. (2009). Secure multiparty computation for privacy-preserving data mining. Journal of Privacy and Confidentiality, 1, 59–98.
Wang, J., Du, K., Luo, X., et al. (2019). Two privacy-preserving approaches for data publishing with identity reservation. Knowledge and Information Systems, 60, 1039–1080. https://doi.org/10.1007/s10115-018-1237-3
Li, N., Qardaji, W., Su, D. (2012). On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, Seoul, Korea, 2–4; pp. 32–33
Soria-comas, J., Domingo-Ferrer, J., Sanchez, D., & Martinez, S. (2015). t-closeness through microaggregation: Strict privacy with enhanced utility preservation. IEEE Transactions on Knowledge and Data Engineering, 27(11), 3098–3110. https://doi.org/10.1109/TKDE.2015.2435777
Shen, Y., Guo, B., Shen, Y., Duan, X., Dong, X., Zhang, H., Zhang, C., & Jiang, Y. (2022). Personal big data pricing method based on differential privacy. Computers & Security., 113, 102529.
Kiabod, M., Dehkordi, M., & Barekatain, B. (2019). TSRAM: A time-saving k-degree anonymization method in social network. Expert Systems with Applications, 125, 378–396.
Panackal, J. J., & Pillai, A. S. (2015). Adaptive utility-based anonymization model: Performance evaluation on big data sets. Procedia Computer Science, 50, 347–352.
Saura, J. R., Ribeiro-Soriano, D., & Palacios-Marqués, D. (2015). Setting privacy “by default” in social IoT: Theorizing the challenges and directions in big data research. Big Data Research, 25(100245), 1–15.
Al-zobbi, M., Shahrestani, S., & Ruan, C. (2018). Experimenting sensitivity-based anonymization framework in apache spark. Journal of Big Data, 5(38), 1–26.
Moutafis, P., Mavrommatis, G., Vassilakopoulos, M., & Sioutas, S. (2019). Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework. Data & Knowledge Engineering, 121, 42–70. https://doi.org/10.1016/j.datak.2019.04.003
Qian, J., Xia, M., & Yue, X. (2018). Parallel knowledge acquisition algorithms for big data using MapReduce. International Journal of Machine Learning and Cybernetics., 9(6), 1007–1021. https://doi.org/10.1007/s13042-016-0624-x
Zhao, Y., Tarus, S. K., Yang, L. T., Sun, J., Ge, Y., & Wang, J. (2020). Privacy preserving clustering for big data in cyber-physical-social systems: Survey and perspectives. Information Sciences, 515, 132–155.
Mehta, B. B., & Rao, U. P. (2022). Improved l-diversity: Scalable anonymization approach for privacy preserving big data publishing. Journal of King Saud University-Computer and Information Sciences, 34(4), 1423–1430.
Eyupoglu, C., Aydin, M. A., Zaim, A. H., & Sertbas, A. (2018). An efficient big data anonymization algorithm based on chaos and perturbation techniques. Entropy, 20(5), 373.
Fan, J., & Li, J. (2014). A fixed suppressed rate selection method for suppressed fuzzy c-means clustering algorithm. Applied Mathematics, 5, 1275–1283. https://doi.org/10.4236/am.2014.58119
Bharill, N., & Tiwari, A. (2014). Handling big data with fuzzy based classification approach. Advances Trends Soft Computing Stud Fuzziness Soft Computing, 312, 219–227.
Karthikeyani, N., Visalakshi, S., Parvathavarthini, S., & Thangavel, K. (2014). An intuitionistic fuzzy approach to fuzzy clustering of numerical dataset. Advances in Intelligent Systems and Computing, 246, 79–87. https://doi.org/10.1007/978-81-322-1680-3_9
Khan, S., Iqbal, K., Faizullah, S., Fahad, M., Ali, J., & Ahmed, W. (2019). Clustering based privacy preserving of big data using fuzzification and anonymization operation. International Journal of Advanced Computer Science and Applications, 10(12), 0101239.
Torra, V. (2020). Fuzzy clustering-based microaggregation to achieve probabilistic K-anonymity for data with constraints. Journal of Intelligent & Fuzzy Systems, 39(5), 5999–6008.
Langari, R. K., Sardar, S., Mousavi, S. A. A., & Radfar, R. (2020). Combined fuzzy clustering and firefly algorithm for privacy preserving in social networks. Expert Systems with Applications, 141(112968), 1–15.
Tsai, Y. C., Wang, S. L., & Ting, I. H. (2020). Flexible sensitive K-anonymization on transactions. World Wide Web, 23, 2391–2406.
Chauhan, R., Kaur, H., & Chang, V. (2021). An optimized integrated framework of big data analytics managing security and privacy in healthcare data. Wireless Personal Communications, 117, 87–108.
Ward, K., Lin, D., & Madria, S. (2020). A parallel algorithm for anonymizing large-scale trajectory data. ACM/IMS Transactions on Data Science, 1(1), 1–26.
Zhou, K., Fu, C., & Yang, S. (2014). Fuzziness parameter selection in fuzzy c-means: The perspective of cluster validation. Science China Information Sciences, 57, 1–8.
Nayak, J., Naik, B., & Behera, H. S. (2015). Fuzzy C-Means (FCM) clustering algorithm: A decade review from 2000 to 2014. Computational Intelligence in Data Mining, 2, 32.
M. Lichman, UCI Machine Learning Repository, 2013. http://archive.ics.uci.edu/ml.
Datafly, S. L. (1997). A system for providing anonymity in medical data, In: Proceeding of eleventh international conference on database security, pp. 356–381
Funding
No funding was received for this work.
Author information
Authors and Affiliations
Contributions
All the authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
No potential conflict of interest was reported by the author(s).
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lawrance, J.U., Jesudhasan, J.V.N. & Thampiraj Rittammal, J.B. Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce. Wireless Pers Commun (2024). https://doi.org/10.1007/s11277-024-11101-7
Accepted:
Published:
DOI: https://doi.org/10.1007/s11277-024-11101-7