Differential identifiability clustering algorithms for big data analysis

Shang, Tao; Zhao, Zheng; Ren, Xujie; Liu, Jianwei

doi:10.1007/s11432-020-2910-1

Differential identifiability clustering algorithms for big data analysis

Research Paper
Published: 31 March 2021

Volume 64, article number 152101, (2021)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Tao Shang¹,
Zheng Zhao²,
Xujie Ren¹ &
…
Jianwei Liu¹

115 Accesses
6 Citations
Explore all metrics

Abstract

Individual privacy preservation has become an important issue with the development of big data technology. The definition of ρ-differential identifiability (DI) precisely matches the legal definitions of privacy, which can provide an easy parameterization approach for practitioners so that they can set privacy parameters based on the privacy concept of individual identifiability. However, differential identifiability is currently only applied to some simple queries and achieved by Laplace mechanism, which cannot satisfy complex privacy preservation issues in big data analysis. In this paper, we propose a new exponential mechanism and composition properties of differential identifiability, and then apply differential identifiability to k-means and k-prototypes algorithms on MapReduce framework. DI k-means algorithm uses the usual Laplace mechanism and composition properties for numerical databases, while DI k-prototypes algorithm uses the new exponential mechanism and composition properties for mixed databases. The experimental results show that both DI k-means and DI k-prototypes algorithms satisfy differential identifiability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Outlier-eliminated k-means clustering algorithm based on differential privacy preservation

Article 11 July 2016

Differential Privacy Protection Algorithm for Data Clustering Center

A High-Availability K-modes Clustering Method Based on Differential Privacy

References

Mendes R, Vilela J P. Privacy-preserving data mining: methods, metrics, and applications. IEEE Access, 2017, 5: 10562–10582
Article Google Scholar
Yu S. Big privacy: challenges and opportunities of privacy study in the age of big data. IEEE Access, 2016, 4: 2751–2763
Article Google Scholar
Dinur I, Nissim K. Revealing information while preserving privacy. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, 2003. 202–210
Dwork C, Nissim K. Privacy-preserving datamining on vertically partitioned databases. In: Advances in Cryptology - CRYPTO 2004. Berlin: Springer, 2004. 528–544
Blum A, Dwork C, McSherry F, et al. Practical privacy: the SuLQ framework. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, 2005. 128–138
Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. In: Theory of Cryptography. Berlin: Springer, 2006. 265–284
Dwork C. Differential privacy. In: Automata, Languages, and Programming. Berlin: Springer, 2006. 1–12
Google Scholar
Kifer D, Machanavajjhala A. No free lunch in data privacy. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Athens, 2011. 193–204
Cormode J. Personal privacy vs population privacy: learning to attack anonymization. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, 2011. 1253–1261
Lee J, Clifton C. Differential identifiability. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, 2012. 1041–1049
Dwork C. A firm foundation for private data analysis. Commun ACM, 2011, 54: 86–95
Article Google Scholar
Li L, Dong Y, Wang J. Differential privacy data protection method based on clustering. In: Proceedings of Cyber-Enabled Distributed Computing and Knowledge Discovery, Nanjing, 2017. 11–16
Zhao Z, Shang T, Liu J W, et al. Clustering algorithm for privacy preservation on MapReduce. In: Proceedings of the 4th International Conference on Cloud Computing and Security, Haikou, 2018. 622–632
Bugata P, Drotar P. On some aspects of minimum redundancy maximum relevance feature selection. Sci China Inf Sci, 2020, 63: 112103
Article MathSciNet Google Scholar
Samarati P, Sweeney L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: Proceedings of IEEE Symposium on Security and Privacy, Oakland, 1998. 1–19
Samarati P. Protecting respondents identities in microdata release. IEEE Trans Knowl Data Eng, 2001, 13: 1010–1027
Article Google Scholar
Sweeney L. k-anonymity: a model for protecting privacy. Int J Unc Fuzz Knowl Based Syst, 2002, 10: 557–570
Article MathSciNet Google Scholar
Machanavajjhala A, Gehrke J, Kifer D, et al. l-diversity: privacy beyond k-anonymity. In: Proceedings of International Conference on Data Engineering, Atlanta, 2006. 24–35
Li N, Li T, Venkatasubramanian S. t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of International Conference on Data Engineering, Istanbul, 2007. 106–115
Li N H, Li T C, Venkatasubramanian S. Closeness: a new privacy measure for data publishing. IEEE Trans Knowl Data Eng, 2010, 22: 943–956
Article Google Scholar
Li N H, Qardaji W, Su D, et al. Membership privacy: a unifying framework for privacy definitions. In: Proceedings of ACM SIGSAC Conference on Computer and Communications Security, New York, 2013. 889–900
Backes M, Berrang P, Humbert M, et al. Membership privacy in microRNA-based studies. In: Proceedings of ACM SIGSAC Conference on Computer and Communication, Vienna, 2016. 319–330
Aggarwal G, Feder T, Kenthapadi K, et al. Achieving anonymity via clustering. In: Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Chicago, 2006. 153–162
Byun J W, Kamra A, Bertino E, et al. Efficient k-anonymization using clustering techniques. In: Proceedings of International Conference on Database Systems for Advanced Applications, Bangkok, 2007. 188–200
Nissim K, Raskhodnikova S, Smith A. Smooth sensitivity and sampling in private data analysis. In: Proceedings of ACM Symposium on Theory of Computing, San Diego, 2007. 75–84
Jafer Y, Matwin S, Sokolova M. Using feature selection to improve the utility of differentially private data publishing. Procedia Comput Sci, 2014, 37: 511–516
Article Google Scholar
Huang Z. Clustering large data sets with mixed numeric and categorical values. In: Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 1997. 21–34
Huang Y. Understanding of Big Data: Big Data Processing and Programming Practices. Beijing: Machinery Industry Press, 2014
Google Scholar
McSherry F, Talwar K. Mechanism design via differential privacy. In: Proceedings of IEEE Annual Symposium on Foundations of Computer Science, Providence, 2007. 94–103
Dwork C, Rothblum G N, Vadhan S. Boosting and differential privacy. In: Proceedings of IEEE Annual Symposium on Foundations of Computer Science, Las Vegas, 2010. 51–60
Bkakria A, Cuppens-Boulahia N, Cuppens F. Linking differential identifiability with differential privacy. In: Proceedings of International Conference on Information and Communications Security, Cham, 2018. 232–247
McSherry F. Privacy integrated queries:an extensible platform for privacy-preserving data analysis. Commun ACM, 2010, 53: 89–97
Article Google Scholar
Xiong J, Ren J, Chen L, et al. Enhancing privacy and availability for data clustering in intelligent electrical service of IoT. IEEE Internet Things J, 2019, 6: 1530–1540
Article Google Scholar
Zhang Y, Liu N, Wang S. A differential privacy protecting k-means clustering algorithm based on contour coefficients. Plos One, 2018, 13: e0206832
Article Google Scholar
Gao Z, Sun Y, Cui X, et al. Privacy-preserving hybrid k-means. Int J Data Warehous Min, 2018, 14: 1–17
Article Google Scholar
Ren J, Xiong J, Cui X, et al. DPLk-means: a novel differential privacy k-means mechanism. In: Proceedings of IEEE International Conference on Data Science in Cyberspace, Shenzhen, 2017. 133–139
Su D, Cao J, Li N, et al. Differentially private k-means clustering and a hybrid approach to private optimization. ACM Trans Priv Secur, 2017, 20: 1–33
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (Grant No. 2016YFC1000307) and National Natural Science Foundation of China (Grant Nos. 61971021, 61571024).

Author information

Authors and Affiliations

School of Cyber Science and Technology, Beihang University, Beijing, 100083, China
Tao Shang, Xujie Ren & Jianwei Liu
School of Electronic and Information Engineering, Beihang University, Beijing, 100083, China
Zheng Zhao

Authors

Tao Shang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xujie Ren
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Shang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shang, T., Zhao, Z., Ren, X. et al. Differential identifiability clustering algorithms for big data analysis. Sci. China Inf. Sci. 64, 152101 (2021). https://doi.org/10.1007/s11432-020-2910-1

Download citation

Received: 18 January 2020
Accepted: 18 April 2020
Published: 31 March 2021
DOI: https://doi.org/10.1007/s11432-020-2910-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Differential identifiability clustering algorithms for big data analysis

Abstract

Access this article

Similar content being viewed by others

Outlier-eliminated k-means clustering algorithm based on differential privacy preservation

Differential Privacy Protection Algorithm for Data Clustering Center

A High-Availability K-modes Clustering Method Based on Differential Privacy

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Differential identifiability clustering algorithms for big data analysis

Abstract

Access this article

Similar content being viewed by others

Outlier-eliminated k-means clustering algorithm based on differential privacy preservation

Differential Privacy Protection Algorithm for Data Clustering Center

A High-Availability K-modes Clustering Method Based on Differential Privacy

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation