Skip to main content
Log in

Differential identifiability clustering algorithms for big data analysis

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Individual privacy preservation has become an important issue with the development of big data technology. The definition of ρ-differential identifiability (DI) precisely matches the legal definitions of privacy, which can provide an easy parameterization approach for practitioners so that they can set privacy parameters based on the privacy concept of individual identifiability. However, differential identifiability is currently only applied to some simple queries and achieved by Laplace mechanism, which cannot satisfy complex privacy preservation issues in big data analysis. In this paper, we propose a new exponential mechanism and composition properties of differential identifiability, and then apply differential identifiability to k-means and k-prototypes algorithms on MapReduce framework. DI k-means algorithm uses the usual Laplace mechanism and composition properties for numerical databases, while DI k-prototypes algorithm uses the new exponential mechanism and composition properties for mixed databases. The experimental results show that both DI k-means and DI k-prototypes algorithms satisfy differential identifiability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Mendes R, Vilela J P. Privacy-preserving data mining: methods, metrics, and applications. IEEE Access, 2017, 5: 10562–10582

    Article  Google Scholar 

  2. Yu S. Big privacy: challenges and opportunities of privacy study in the age of big data. IEEE Access, 2016, 4: 2751–2763

    Article  Google Scholar 

  3. Dinur I, Nissim K. Revealing information while preserving privacy. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, 2003. 202–210

  4. Dwork C, Nissim K. Privacy-preserving datamining on vertically partitioned databases. In: Advances in Cryptology - CRYPTO 2004. Berlin: Springer, 2004. 528–544

  5. Blum A, Dwork C, McSherry F, et al. Practical privacy: the SuLQ framework. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, 2005. 128–138

  6. Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. In: Theory of Cryptography. Berlin: Springer, 2006. 265–284

  7. Dwork C. Differential privacy. In: Automata, Languages, and Programming. Berlin: Springer, 2006. 1–12

    Google Scholar 

  8. Kifer D, Machanavajjhala A. No free lunch in data privacy. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Athens, 2011. 193–204

  9. Cormode J. Personal privacy vs population privacy: learning to attack anonymization. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, 2011. 1253–1261

  10. Lee J, Clifton C. Differential identifiability. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, 2012. 1041–1049

  11. Dwork C. A firm foundation for private data analysis. Commun ACM, 2011, 54: 86–95

    Article  Google Scholar 

  12. Li L, Dong Y, Wang J. Differential privacy data protection method based on clustering. In: Proceedings of Cyber-Enabled Distributed Computing and Knowledge Discovery, Nanjing, 2017. 11–16

  13. Zhao Z, Shang T, Liu J W, et al. Clustering algorithm for privacy preservation on MapReduce. In: Proceedings of the 4th International Conference on Cloud Computing and Security, Haikou, 2018. 622–632

  14. Bugata P, Drotar P. On some aspects of minimum redundancy maximum relevance feature selection. Sci China Inf Sci, 2020, 63: 112103

    Article  MathSciNet  Google Scholar 

  15. Samarati P, Sweeney L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: Proceedings of IEEE Symposium on Security and Privacy, Oakland, 1998. 1–19

  16. Samarati P. Protecting respondents identities in microdata release. IEEE Trans Knowl Data Eng, 2001, 13: 1010–1027

    Article  Google Scholar 

  17. Sweeney L. k-anonymity: a model for protecting privacy. Int J Unc Fuzz Knowl Based Syst, 2002, 10: 557–570

    Article  MathSciNet  Google Scholar 

  18. Machanavajjhala A, Gehrke J, Kifer D, et al. l-diversity: privacy beyond k-anonymity. In: Proceedings of International Conference on Data Engineering, Atlanta, 2006. 24–35

  19. Li N, Li T, Venkatasubramanian S. t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of International Conference on Data Engineering, Istanbul, 2007. 106–115

  20. Li N H, Li T C, Venkatasubramanian S. Closeness: a new privacy measure for data publishing. IEEE Trans Knowl Data Eng, 2010, 22: 943–956

    Article  Google Scholar 

  21. Li N H, Qardaji W, Su D, et al. Membership privacy: a unifying framework for privacy definitions. In: Proceedings of ACM SIGSAC Conference on Computer and Communications Security, New York, 2013. 889–900

  22. Backes M, Berrang P, Humbert M, et al. Membership privacy in microRNA-based studies. In: Proceedings of ACM SIGSAC Conference on Computer and Communication, Vienna, 2016. 319–330

  23. Aggarwal G, Feder T, Kenthapadi K, et al. Achieving anonymity via clustering. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Chicago, 2006. 153–162

  24. Byun J W, Kamra A, Bertino E, et al. Efficient k-anonymization using clustering techniques. In: Proceedings of International Conference on Database Systems for Advanced Applications, Bangkok, 2007. 188–200

  25. Nissim K, Raskhodnikova S, Smith A. Smooth sensitivity and sampling in private data analysis. In: Proceedings of ACM Symposium on Theory of Computing, San Diego, 2007. 75–84

  26. Jafer Y, Matwin S, Sokolova M. Using feature selection to improve the utility of differentially private data publishing. Procedia Comput Sci, 2014, 37: 511–516

    Article  Google Scholar 

  27. Huang Z. Clustering large data sets with mixed numeric and categorical values. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 1997. 21–34

  28. Huang Y. Understanding of Big Data: Big Data Processing and Programming Practices. Beijing: Machinery Industry Press, 2014

    Google Scholar 

  29. McSherry F, Talwar K. Mechanism design via differential privacy. In: Proceedings of IEEE Annual Symposium on Foundations of Computer Science, Providence, 2007. 94–103

  30. Dwork C, Rothblum G N, Vadhan S. Boosting and differential privacy. In: Proceedings of IEEE Annual Symposium on Foundations of Computer Science, Las Vegas, 2010. 51–60

  31. Bkakria A, Cuppens-Boulahia N, Cuppens F. Linking differential identifiability with differential privacy. In: Proceedings of International Conference on Information and Communications Security, Cham, 2018. 232–247

  32. McSherry F. Privacy integrated queries:an extensible platform for privacy-preserving data analysis. Commun ACM, 2010, 53: 89–97

    Article  Google Scholar 

  33. Xiong J, Ren J, Chen L, et al. Enhancing privacy and availability for data clustering in intelligent electrical service of IoT. IEEE Internet Things J, 2019, 6: 1530–1540

    Article  Google Scholar 

  34. Zhang Y, Liu N, Wang S. A differential privacy protecting k-means clustering algorithm based on contour coefficients. Plos One, 2018, 13: e0206832

    Article  Google Scholar 

  35. Gao Z, Sun Y, Cui X, et al. Privacy-preserving hybrid k-means. Int J Data Warehous Min, 2018, 14: 1–17

    Article  Google Scholar 

  36. Ren J, Xiong J, Cui X, et al. DPLk-means: a novel differential privacy k-means mechanism. In: Proceedings of IEEE International Conference on Data Science in Cyberspace, Shenzhen, 2017. 133–139

  37. Su D, Cao J, Li N, et al. Differentially private k-means clustering and a hybrid approach to private optimization. ACM Trans Priv Secur, 2017, 20: 1–33

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (Grant No. 2016YFC1000307) and National Natural Science Foundation of China (Grant Nos. 61971021, 61571024).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Shang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shang, T., Zhao, Z., Ren, X. et al. Differential identifiability clustering algorithms for big data analysis. Sci. China Inf. Sci. 64, 152101 (2021). https://doi.org/10.1007/s11432-020-2910-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-020-2910-1

Keywords

Navigation