Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce

Lawrance, Josephine Usha; Jesudhasan, Jesu Vedha Nayahi; Thampiraj Rittammal, Jerald Beno

doi:10.1007/s11277-024-11101-7

Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce

Published: 14 May 2024

(2024)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Josephine Usha Lawrance ORCID: orcid.org/0000-0003-3420-0978¹,
Jesu Vedha Nayahi Jesudhasan² &
Jerald Beno Thampiraj Rittammal³

19 Accesses
Explore all metrics

Abstract

The amount of data on the internet is steadily growing due to recent technological advancements in cyber-physical-social systems, sensor networks, and communication technologies. Many information scientists, policy and decision-makers are attempting to explore this vast amount of data for critical decisions and planned business moves. The increasing amount of big data also increases privacy issues and data breaches. Proper data management is essential for all organizations that handle sensitive information and large volumes of data. Data anonymization is a promising method for protecting individual privacy, resulting in significant information loss. Recently, data anonymization based on data mining techniques has shown significant improvement in data utility. Again, when utilized with big data, the clustering-based anonymization technique has serious scalability issues, and cluster formation on large data sets is time-consuming. This paper proposes the Parallel Fuzzy C-Means Clustering based Anonymization Algorithm (FCMCAA) using the Hadoop MapReduce framework for ensuring the privacy of large volume of structured data. The results demonstrate that the algorithm works better in terms of F-measure and classification accuracy yielding 91% accuracy. It is also scalable and able to handle huge volumes of structured data while maintaining a high level of privacy with minimum information loss.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: Applications, Prospects and Challenges

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Article 12 July 2021

Data Availability

The data that support the findings of this study are openly available in UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

References

Tankard, C. (2012). Big data security. Network Security, 2012, 5–8.
Google Scholar
Rahul, K., & Banyal, R. (2020). Data life cycle management in big data analytics. Procedia Computer Science, 173, 364–371.
Article Google Scholar
Rao, P. R. M., Krishna, S. M., & Kumar, A. P. S. (2018). Privacy preservation techniques in big data analytics: A survey. Journal of Big Data, 5, 33.
Article Google Scholar
Holzinger, A., & Jurisica, I. (2014). Knowledge discovery and data mining in biomedical informatics: The future is in integrative, interactive machine learning solutions. Interactive knowledge discovery and data mining in biomedical informatics (pp. 1–18). Berlin: Springer p.
Chapter Google Scholar
HIPAA Health insurance portability and accountability Act of 1999. http://www.hhs.gov/ocr/privacy/hipaa/administrative/privacyrule (accessed 20.06.15).
Samarati, P. (2001). Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13, 1010–1027. https://doi.org/10.1109/69.971193
Article Google Scholar
Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 557–570.
Article MathSciNet Google Scholar
Sweeney, L. (2002). Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 571–588. https://doi.org/10.1142/S021848850200165X
Article MathSciNet Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data. https://doi.org/10.1145/1217299.1217302
Article Google Scholar
Raymond, W., et al. (2009). (α, k)-anonymous data publishing. Journal of Intelligent Information Systems, 33, 209–234.
Article Google Scholar
Li, N., Li, T., & Venkatasubramanian, S. (2007). t-Closeness: Privacy beyond k-anonymity and l-diversity. IEEE International Conference on Data Engineering. https://doi.org/10.1109/ICDE.2007.367856
Article Google Scholar
Jain, P., Gyanchandani, M., & Khare, N. (2016). Big data privacy: A technological perspective and review. J Big Data, 3(1), 25. https://doi.org/10.1186/s40537-016-0059-y
Article Google Scholar
Usha, L. J., & Nayahi, J. J. V. (2019). Security and privacy in big data cyber-physical systems. In C. R. C. Press (Ed.), Cybersecurity and privacy in cyber physical systems (pp. 217–249). Taylor & Francis.
Chapter Google Scholar
Nayahi, J. J. V., & Kavitha, V. (2015). An efficient clustering for anoymizing data and protecting sensitive labels. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 23, 685–714.
Article Google Scholar
Nayahi, J. J. V., & Kavitha, V. (2017). Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop. Future Generation Computer Systems, 74, 393–408.
Article Google Scholar
Josephine Usha, L., & Jesu VedhaNayahi, J. (2021). Privacy preserving parallel clustering based anonymization for big data using mapreduce framework. Applied Artificial Intelligence. https://doi.org/10.1080/08839514.2021.1987709
Article Google Scholar
Nayak, J., Naik, B., & Behera, H. S. (2015). Fuzzy C-Means (FCM) Clustering Algorithm: A Decade Review from 2000 to 2014. Computational Intelligence in Data Mining—Volume 2. Smart Innovation, Systems and Technologies. (Vol. 32). New Delhi: Springer. https://doi.org/10.1007/978-81-322-2208-8_14
Chapter Google Scholar
Ludwig, S. A. (2015). MapReduce-based fuzzy c-means clustering algorithm: Implementation and scalability. International Journal of Machine Learning and Cybernetics, 6, 923–934. https://doi.org/10.1007/s13042-015-0367-0
Article Google Scholar
LeFevre, K., DeWitt, D. J., Ramakrishnan, R. (2005). Incognito: Efficient full-domain k-anonymity. In SIGMOD Conference, pages 49–60
LeFevre, K., DeWitt, D. J., Ramakrishnan, R. (2006). Mondrian multidimensional k-anonymity. In Proceedings of the 22nd international conference on data engineering (ICDE’06), Atlanta, GA, USA, IEEE
Amit, K., & Neeraj, S. (2016). Privacy preservation in big data using K-anonymity algorithm with privacy key. International Journal of Computer Applications., 153(5), 0975–8887.
Google Scholar
Govinda, K., & Sathiyamoorthy, E. (2012). Identity anonymization and secure data storage using group signature in private cloud. Procedia Technology, 4, 495–499.
Article Google Scholar
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., & Fu, A.W.-C. (2006). Utility-based anonymization for privacy preservation with less information loss. ACM SIGKDD Explorations Newsletter, 8(2), 21–30. https://doi.org/10.1145/1233321.1233324
Article Google Scholar
Waters, B. (2011). Ciphertext-policy attribute-based encryption: An expressive, efficient, and provably secure realization. Public Key Cryptography, 6571, 53–70.
MathSciNet Google Scholar
Potey, M. M., Dhote, C. A., & Sharma, D. H. (2016). Homomorphic encryption for security of cloud data. Procedia Computer Science, 79, 175–181. https://doi.org/10.1016/j.procs.2016.03.023
Article Google Scholar
Lindell, Y., & Pinkas, B. (2009). Secure multiparty computation for privacy-preserving data mining. Journal of Privacy and Confidentiality, 1, 59–98.
Article Google Scholar
Wang, J., Du, K., Luo, X., et al. (2019). Two privacy-preserving approaches for data publishing with identity reservation. Knowledge and Information Systems, 60, 1039–1080. https://doi.org/10.1007/s10115-018-1237-3
Article Google Scholar
Li, N., Qardaji, W., Su, D. (2012). On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, Seoul, Korea, 2–4; pp. 32–33
Soria-comas, J., Domingo-Ferrer, J., Sanchez, D., & Martinez, S. (2015). t-closeness through microaggregation: Strict privacy with enhanced utility preservation. IEEE Transactions on Knowledge and Data Engineering, 27(11), 3098–3110. https://doi.org/10.1109/TKDE.2015.2435777
Article Google Scholar
Shen, Y., Guo, B., Shen, Y., Duan, X., Dong, X., Zhang, H., Zhang, C., & Jiang, Y. (2022). Personal big data pricing method based on differential privacy. Computers & Security., 113, 102529.
Article Google Scholar
Kiabod, M., Dehkordi, M., & Barekatain, B. (2019). TSRAM: A time-saving k-degree anonymization method in social network. Expert Systems with Applications, 125, 378–396.
Article Google Scholar
Panackal, J. J., & Pillai, A. S. (2015). Adaptive utility-based anonymization model: Performance evaluation on big data sets. Procedia Computer Science, 50, 347–352.
Article Google Scholar
Saura, J. R., Ribeiro-Soriano, D., & Palacios-Marqués, D. (2015). Setting privacy “by default” in social IoT: Theorizing the challenges and directions in big data research. Big Data Research, 25(100245), 1–15.
Google Scholar
Al-zobbi, M., Shahrestani, S., & Ruan, C. (2018). Experimenting sensitivity-based anonymization framework in apache spark. Journal of Big Data, 5(38), 1–26.
Google Scholar
Moutafis, P., Mavrommatis, G., Vassilakopoulos, M., & Sioutas, S. (2019). Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework. Data & Knowledge Engineering, 121, 42–70. https://doi.org/10.1016/j.datak.2019.04.003
Article Google Scholar
Qian, J., Xia, M., & Yue, X. (2018). Parallel knowledge acquisition algorithms for big data using MapReduce. International Journal of Machine Learning and Cybernetics., 9(6), 1007–1021. https://doi.org/10.1007/s13042-016-0624-x
Article Google Scholar
Zhao, Y., Tarus, S. K., Yang, L. T., Sun, J., Ge, Y., & Wang, J. (2020). Privacy preserving clustering for big data in cyber-physical-social systems: Survey and perspectives. Information Sciences, 515, 132–155.
Article MathSciNet Google Scholar
Mehta, B. B., & Rao, U. P. (2022). Improved l-diversity: Scalable anonymization approach for privacy preserving big data publishing. Journal of King Saud University-Computer and Information Sciences, 34(4), 1423–1430.
Article Google Scholar
Eyupoglu, C., Aydin, M. A., Zaim, A. H., & Sertbas, A. (2018). An efficient big data anonymization algorithm based on chaos and perturbation techniques. Entropy, 20(5), 373.
Article Google Scholar
Fan, J., & Li, J. (2014). A fixed suppressed rate selection method for suppressed fuzzy c-means clustering algorithm. Applied Mathematics, 5, 1275–1283. https://doi.org/10.4236/am.2014.58119
Article Google Scholar
Bharill, N., & Tiwari, A. (2014). Handling big data with fuzzy based classification approach. Advances Trends Soft Computing Stud Fuzziness Soft Computing, 312, 219–227.
Article Google Scholar
Karthikeyani, N., Visalakshi, S., Parvathavarthini, S., & Thangavel, K. (2014). An intuitionistic fuzzy approach to fuzzy clustering of numerical dataset. Advances in Intelligent Systems and Computing, 246, 79–87. https://doi.org/10.1007/978-81-322-1680-3_9
Article Google Scholar
Khan, S., Iqbal, K., Faizullah, S., Fahad, M., Ali, J., & Ahmed, W. (2019). Clustering based privacy preserving of big data using fuzzification and anonymization operation. International Journal of Advanced Computer Science and Applications, 10(12), 0101239.
Article Google Scholar
Torra, V. (2020). Fuzzy clustering-based microaggregation to achieve probabilistic K-anonymity for data with constraints. Journal of Intelligent & Fuzzy Systems, 39(5), 5999–6008.
Article Google Scholar
Langari, R. K., Sardar, S., Mousavi, S. A. A., & Radfar, R. (2020). Combined fuzzy clustering and firefly algorithm for privacy preserving in social networks. Expert Systems with Applications, 141(112968), 1–15.
Google Scholar
Tsai, Y. C., Wang, S. L., & Ting, I. H. (2020). Flexible sensitive K-anonymization on transactions. World Wide Web, 23, 2391–2406.
Article Google Scholar
Chauhan, R., Kaur, H., & Chang, V. (2021). An optimized integrated framework of big data analytics managing security and privacy in healthcare data. Wireless Personal Communications, 117, 87–108.
Article Google Scholar
Ward, K., Lin, D., & Madria, S. (2020). A parallel algorithm for anonymizing large-scale trajectory data. ACM/IMS Transactions on Data Science, 1(1), 1–26.
Article Google Scholar
Zhou, K., Fu, C., & Yang, S. (2014). Fuzziness parameter selection in fuzzy c-means: The perspective of cluster validation. Science China Information Sciences, 57, 1–8.
Google Scholar
Nayak, J., Naik, B., & Behera, H. S. (2015). Fuzzy C-Means (FCM) clustering algorithm: A decade review from 2000 to 2014. Computational Intelligence in Data Mining, 2, 32.
Google Scholar
M. Lichman, UCI Machine Learning Repository, 2013. http://archive.ics.uci.edu/ml.
https://www.tutorialspoint.com/scikit_learn/index.htm
Datafly, S. L. (1997). A system for providing anonymity in medical data, In: Proceeding of eleventh international conference on database security, pp. 356–381

Download references

Funding

No funding was received for this work.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, SRM Institute of Science and Technology, Tiruchirappalli, Tamil Nadu, India
Josephine Usha Lawrance
Department of Computer Science and Engineering, Anna University Regional Campus–Tirunelveli, Tirunelveli, Tamil Nadu, India
Jesu Vedha Nayahi Jesudhasan
Department of Electronics and Communication Engineering, Arunachala College of Engineering for Women, Manavilai, Tamil Nadu, India
Jerald Beno Thampiraj Rittammal

Authors

Josephine Usha Lawrance
View author publications
You can also search for this author in PubMed Google Scholar
Jesu Vedha Nayahi Jesudhasan
View author publications
You can also search for this author in PubMed Google Scholar
Jerald Beno Thampiraj Rittammal
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All the authors contributed equally to this work.

Corresponding author

Correspondence to Josephine Usha Lawrance.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the author(s).

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lawrance, J.U., Jesudhasan, J.V.N. & Thampiraj Rittammal, J.B. Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce. Wireless Pers Commun (2024). https://doi.org/10.1007/s11277-024-11101-7

Download citation

Accepted: 12 April 2024
Published: 14 May 2024
DOI: https://doi.org/10.1007/s11277-024-11101-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: Applications, Prospects and Challenges

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: Applications, Prospects and Challenges

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation