Abstract
Differential privacy synthetic data is one of the most effective methods for privacy preserving data release. However, the existing schemes still suffer from high computational complexity and inability to directly handle values of large domain size when synthesizing high-dimensional data. To mitigate this gap, we propose synthetic data generation for differential privacy using maximum weight matching (DPMWM), a method for automatically synthesizing tabular data in high-dimensional large domain size via differential privacy. Specifically, DPMWM uses differential privacy maximum weight matching for low-dimensional marginal selection and then automatically synthesizes multiple records based on the filtered marginals. The experimental results show that DPMWM outperforms the state-of-the-art in terms of accuracy for counting queries and classification tasks on datasets with larger domain size.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
When \(m_i\) is 1-way marginal, \({\text {dom}}\left( m_i\right) \) indicates the domain size of a single attribute. When \(m_i\) is 2-way marginal, such as \(m_i=(V_1,V_2)\), then \({\text {dom}}\left( m_i\right) ={\text {dom}}\left( V_1\right) \cdot {\text {dom}}\left( V_2\right) \).
References
NIST. 2021 differential privacy synthetic data challenge. https://github.com/ryan112358/nist-synthetic-data-2021
Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016)
Asuncion, A., Newman, D., Bache, K., Lichman, M.: UCI machine learning repository. Meta 2003 (2003)
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282 (2007)
Bun, M., Steinke, T.: Concentrated differential privacy: simplifications, extensions, and lower bounds. In: Hirt, M., Smith, A. (eds.) TCC 2016. LNCS, vol. 9985, pp. 635–658. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53641-4_24
Cai, K., Lei, X., Wei, J., Xiao, X.: Data synthesis via differentially private Markov random fields. Proc. VLDB Endow. 14(11), 2190–2202 (2021)
Chen, D., Kerkouche, R., Fritz, M.: Private set generation with discriminative information. arXiv preprint arXiv:2211.04446 (2022)
Chen, D., Orekondy, T., Fritz, M.: GS-WGAN: a gradient-sanitized approach for learning differentially private generators. In: 34th Conference on Neural Information Processing Systems, pp. 12673–12684. Curran Associates, Inc. (2020)
Chen, R., Xiao, Q., Zhang, Y., Xu, J.: Differentially private high-dimensional data publication via sampling-based inference. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 2015, p. 129 (2015)
Chen, X., Wang, C., Yang, Q., et al.: Locally differentially private high-dimensional data synthesis (2023)
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Dwork, C., Rothblum, G.N., Vadhan, S.: Boosting and differential privacy. In: Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 51–60 (2010)
Yu, W., Iranmanesh, S., Haldar, A., Zhang, M., Ferhatosmanoglu, H.: An axiomatic role similarity measure based on graph topology. In: Qin, L., et al. (eds.) SFDI LSGDA 2020. CCIS, vol. 1281, pp. 33–48. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61133-0_3
Harder, F., Adamczewski, K., Park, M.: DP-MERF: differentially private mean embeddings with random features for practical privacy-preserving data generation. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021), vol. 130, pp. 1819–1827. PMLR (2021)
Kato, F., Takahashi, T., Takagi, S., Cao, Y., Liew, S.P., Yoshikawa, M.: HDPView: differentially private materialized view for exploring high dimensional relational data. arXiv preprint arXiv:2203.06791 (2022)
Kohavi, R.: Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In: Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207 (1996)
Li, J., Gan, W., Gui, Y., Wu, Y., Yu, P.S.: Frequent itemset mining with local differential privacy. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1146–1155 (2022)
Libbi, C.A., Trienes, J., Trieschnigg, D., Seifert, C.: Generating synthetic training data for supervised de-identification of electronic health records. Future Internet 13(5), 136 (2021)
Liu, F.: Model-based differentially private data synthesis and statistical inference in multiply synthetic differentially private data. arXiv e-prints, pp. arXiv-1606 (2016)
Long, Y., et al.: G-pate: scalable differentially private data generator via private aggregation of teacher discriminators. In: 35th Conference on Neural Information Processing Systems, NeurIPS 2021, pp. 2965–2977. Neural Information Processing Systems Foundation (2021)
McKenna, R., Sheldon, D., Miklau, G.: Graphical-model based estimation and inference for differential privacy. In: International Conference on Machine Learning, pp. 4435–4444. PMLR (2019)
Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62, 22–31 (2014)
Olave, M., Rajkovic, V., Bohanec, M.: An application for admission in public school systems. Expert Syst. Public Adm. 1, 145–160 (1989)
Qardaji, W., Yang, W., Li, N.: Priview: practical differentially private release of marginal contingency tables. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1435–1446 (2014)
Takagi, S., Takahashi, T., Cao, Y., Yoshikawa, M.: P3GM: private high-dimensional data release via privacy preserving phased generative model. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 169–180. IEEE Computer Society (2021)
Torfi, A., Fox, E.A., Reddy, C.K.: Differentially private synthetic medical data generation using convolutional GANs. Inf. Sci. 586, 485–500 (2022)
Torkzadehmahani, R., Kairouz, P., Paten, B.: DP-CGAN: differentially private synthetic data and label generation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 98–104. IEEE (2019)
Wang, T., Lopuhaa-Zwakenberg, M., Li, Z., Skoric, B., Li, N.: Locally differentially private frequency estimation with consistency. In: NDSS 2020: Proceedings of the NDSS Symposium (2020)
Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J.: Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739 (2018)
Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J. 22, 797–822 (2013)
Yue, X., et al.: Synthetic text generation with differential privacy: a simple and practical recipe. arXiv preprint arXiv:2210.14348 (2022)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1423–1434 (2014)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)
Zhang, Z., et al.: PrivSyn: differentially private data synthesis. In: Proceedings of the 30th USENIX Security Symposium (2021)
Zhu, T., Li, G., Zhou, W., Yu, P.S.: Differentially private data publishing and analysis: a survey. IEEE Trans. Knowl. Data Eng. 29(8), 1619–1638 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Proof of Lemma
A Proof of Lemma
Proof
We assume a dataset D contains n records, and consider two attributes a and b. Denote the frequency of different values of attribute a is \(\left\{ a_1,a_2,\ldots \right\} \) and the frequency of b is \(\left\{ b_1,b_2,\ldots \right\} \). For 2-way marginal of (a, b), denote its frequency of joint distribution is \(\left\{ c_{11},c_{12},\ldots \right\} \).
The metric w(a, b) is
If we add a user with value \(u\) for \(a\) and \(v\) for \(b\), then
The sensitivity is given by
For the above formula, some details are \(\sum _{i \ne u} a_i=n-a_u\), \(\sum _{j \ne v}b_j=n-b_v\) and \(\sum _{i \ne u,j \ne v} a_{i} b_{j}=(n-a_u)(n-b_v)\).
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, M., Ye, X., Deng, H. (2024). Synthetic Data Generation for Differential Privacy Using Maximum Weight Matching. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14489. Springer, Singapore. https://doi.org/10.1007/978-981-97-0798-0_8
Download citation
DOI: https://doi.org/10.1007/978-981-97-0798-0_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0797-3
Online ISBN: 978-981-97-0798-0
eBook Packages: Computer ScienceComputer Science (R0)