Towards Improving Privacy of Synthetic DataSets

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12703)


Recent growth in domain specific applications of machine learning can be attributed to availability of realistic public datasets. Real world datasets may always contain sensitive information about the users, which makes it hard to share freely with other stake holders, and researchers due to regulatory and compliance requirements. Synthesising datasets from real data by leveraging generative techniques is gaining popularity. However, the privacy analysis of these dataset is still a open research. In this work, we fill this gap by investigating the privacy issues of the generated data sets from attacker and auditor point of view. We propose instance level Privacy Score (PS) for each synthetic sample by measuring the memorisation coefficient \(\boldsymbol{\alpha _{m}}\) per sample. Leveraging, PS we empirically show that accuracy of membership inference attacks on synthetic data drop significantly. PS is a model agnostic, post training measure, which helps data sharer with guidance about the privacy properties of a given sample but also helps third party data auditors to run privacy checks without sharing model internals. We tested our method on two real world data sets and show that attack accuracy reduced by PS based filtering.


Privacy preserving synthetic data Generative Adversarial Networks Privacy audit 


  1. 1.
    Carlini, N., et al.: Extracting training data from large language models. arXiv preprint arXiv:2012.07805 (2020)
  2. 2.
    SIVEP-Gripe (2020). In Ministry of Health. SIVEP-Gripe public dataset, (Accessed 10 May 2020; in Portuguese)
  3. 3.
    Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017)Google Scholar
  4. 4.
    Departement of Commerce, National Institute of Standards and Technology. Differential private synthetic data challenge (2019). Accessed 19 Feb 2021
  5. 5.
    Olivier, T.T.: Anonymisation and synthetic data: towards trustworthy data (2019). Accessed 19 Feb 2021
  6. 6.
    The Open Data Institute. Diagnosing the NHS: SynAE. Accessed 19 Feb 2021
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
    Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)Google Scholar
  16. 16.
    Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)Google Scholar
  17. 17.
    Abowd, J.M., Lane, J.: New approaches to confidentiality protection: synthetic data, remote access and research data centers. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 282–289. Springer, Heidelberg (2004). ISBN 978-3-540-22118-0CrossRefGoogle Scholar
  18. 18.
    Abowd, J.M., Woodcock, S.D.: Multiply-imputing confidential characteristics and file links in longitudinal linked data. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 290–297. Springer, Heidelberg (2004). ISBN 3-540-22118-2CrossRefGoogle Scholar
  19. 19.
    Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102(480), 1462–1471 (2007)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79(3), 362–384 (2011)CrossRefGoogle Scholar
  23. 23.
    Reiter, J.P.: Using cart to generate partially synthetic, public use microdata. J. Off. Stat. 21(3), 441–462 (2005)Google Scholar
  24. 24.
    Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)MathSciNetGoogle Scholar
  25. 25.
    Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control, vol. 53. Springer, Heidelberg (2011). ISBN 9788578110796
  26. 26.
    Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. Stat. Sci. (forthcoming)Google Scholar
  27. 27.
    Manrique-Vallier, D., Hu, J.: Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. J. Roy. Stat. Soc. Ser. A: Stat. Soc. 181(3), 635–647 (2018)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Snoke, J., Raab, G., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data (2016)Google Scholar
  29. 29.
    Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1 (2003)Google Scholar
  30. 30.
    Kinney, S.K., Reiter, J.P., Berger, J.O.: Model selection when multiple imputation is used to protect confidentiality in public use data. J. Priv. Confident. 2(2), 3–19 (2010)Google Scholar
  31. 31.
    Article 29 Data Protection Working Party - European Commission. Opinion 05/2014 on anonymisation techniques (2014).
  32. 32.
    Elliot, M., Mackey, E., O’Hara, K., Tudor, C.: The anonymisation decision-making framework. UKAN Manchester (2016)Google Scholar
  33. 33.
    Rubinstein, I.S., Hartzog, W.: Anonymization and risk. Wash. L. Rev. 91, 703 (2016)Google Scholar
  34. 34.
    Elliot, M., et al.: Functional anonymisation: personal data and the data environment. Comput. Law Secur. Rev. 34(2), 204–221 (2018)CrossRefGoogle Scholar
  35. 35.
    Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. IEEE (2016)Google Scholar
  36. 36.
    Goodfellow, I.: NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
  37. 37.
    European Commission. Regulation (EU) 2016/679: General Data Protection Regulation (GDPR) (2016)Google Scholar
  38. 38.
    Sweeney, L.: k-anonymity: a model for protecting privacy. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy (S&P) (2017)Google Scholar
  40. 40.
    Yaghini, M., Kulynych, B., Troncoso, C.: Disparate vulnerability: on the unfairness of privacy attacks against machine learning. arXiv preprint arXiv:1906.00389 (2019)
  41. 41.
    Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management (2017)Google Scholar
  42. 42.
    Jayaraman, B., Wang, L., Evans, D., Gu, Q.: Revisiting membership inference under realistic assumptions. arXiv preprint arXiv:2005.10881 (2020)
  43. 43.
    Yeom, S., Giacomelli, I., Fredrikson, M., Jha, S.: Privacy risk in machine learning: analyzing the connection to overfitting. In: 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282. IEEE (2018)Google Scholar
  44. 44.
    Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42, 1–41 (2017)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems (2019)Google Scholar
  46. 46.
    Yoon, J., Drumright, L.N., Van Der Schaar, M.: Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J. Biomed. Health Inf. 24, 2378–2388 (2020)CrossRefGoogle Scholar
  47. 47.
    Adlam, B., Weill, C., Kapoor, A.: Investigating under and overfitting in wasserstein generative adversarial networks. arXiv preprint arXiv:1910.14137 (2019)
  48. 48.
    Meehan, C., Chaudhuri, K., Dasgupta, S.: A non-parametric test to detect data-copying in generative models. arXiv preprint arXiv:2004.05675 (2020)
  49. 49.
    Hayes, J., Melis, L., Danezis, G., De Cristofaro, E.: LoGAN: membership inference attacks against generative models. Proc. Priv. Enhanc. Technol. 2019(1), 133–152 (2019)Google Scholar
  50. 50.
    Sablayrolles, A., Douze, M., Ollivier, Y., Schmid, C., Jégou, H.: White-box vs black-box: bayes optimal strategies for membership inference. In: Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 5558–5567 (2019)Google Scholar
  51. 51.
    Sablayrolles, A., Douze, M., Ollivier, Y., Schmid, C., Jégou, H.: White-box vs black-box: bayes optimal strategies for membership inference (2019)Google Scholar
  52. 52.
    Truex, S., Liu, L., Gursoy, M.E., Yu, L., Wei, W.: Towards demystifying membership inference attacks. ArXiv, vol. abs/1807.09173 (2018)Google Scholar
  53. 53.
    Kuppa, A., Grzonkowski, S., Asghar, M.R., Le-Khac, N.-A.: Black box attacks on deep anomaly detectors. In: Proceedings of the 14th International Conference on Availability, Reliability and Security (2019)Google Scholar
  54. 54.
    Yoon, J., Jordon, J., van der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: International Conference on Learning Representations (2019).
  55. 55.
    Arpit, D., et al.: A closer look at memorization in deep networks. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 2017, p. 233–242. (2017)Google Scholar
  56. 56.
    Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. CoRR abs/1606.03498 (2016)Google Scholar
  57. 57.
    Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: NeurIPS (2019)Google Scholar
  58. 58.
    Eduardo, S., Nazábal, A., Williams, C.K.I., Sutton, C.: Robust variational autoencoders for outlier detection and repair of mixed-type data. In: Proceedings of AISTATS (2020)Google Scholar
  59. 59.
    Camino, R., Hammerschmidt, C., State, R.: Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202 (2018)
  60. 60.
    Meehan, C.R., Chaudhuri, K., Dasgupta, S.: A non-parametric test to detect data-copying in generative models. ArXiv, vol. abs/2004.05675 (2020)Google Scholar
  61. 61.
    Izzo, Z., Smart, M.A., Chaudhuri, K., Zou, J.: Approximate data deletion from machine learning models: algorithms and evaluations. ArXiv, vol. abs/2002.10077 (2020)Google Scholar
  62. 62.
    Song, C., Shmatikov, V.: Overlearning reveals sensitive attributes (2020)Google Scholar
  63. 63.
    Melis, L., Song, C., De Cristofaro, E., Shmatikov, V.: Exploiting unintended feature leakage in collaborative learning. In: IEEE Symposium on Security and Privacy (S&P), pp. 497–512. IEEE (2019)Google Scholar
  64. 64.
    Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy (S&P), pp. 3–18. IEEE (2017)Google Scholar
  65. 65.
    Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., Zhang, Y.: When machine unlearning jeopardizes privacy. CoRR abs/2005.02205 (2020)Google Scholar
  66. 66.
    Li, Z., Zhang, Y.: Label-leaks: membership inference attack with label. CoRR abs/2007.15528 (2020)Google Scholar
  67. 67.
    Leino, K., Fredrikson, M.: Stolen memories: leveraging model memorization for calibrated white-box membership inference. In: USENIX Security Symposium (USENIX Security), pp. 1605–1622. USENIX (2020)Google Scholar
  68. 68.
    Chen, D., Yu, N., Zhang, Y., Fritz, M.: GAN-leaks: a taxonomy of membership inference attacks against generative models. In: ACM SIGSAC Conference on Computer and Communications Security (CCS), p. 343–362. ACM (2020)Google Scholar
  69. 69.
    Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., Backes, M.: ML-leaks: model and data independent membership inference attacks and defenses on machine learning models. In: Network and Distributed System Security Symposium (NDSS). Internet Society (2019)Google Scholar
  70. 70.
    Jia, J., Salem, A., Backes, M., Zhang, Y., Gong, N.Z.: MemGuard: defending against black-box membership inference attacks via adversarial examples. In: ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 259–274. ACM (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2021

Authors and Affiliations

  1. 1.UCD School of ComputingDublinIreland
  2. 2.Tenable Network SecurityParisFrance

Personalised recommendations