Skip to main content

On the Privacy Guarantees of Synthetic Data: A Reassessment from the Maximum-Knowledge Attacker Perspective

Part of the Lecture Notes in Computer Science book series (LNISA,volume 11126)

Abstract

Generating synthetic data for the dissemination of individual information in a privacy-preserving way is an approach that is often presented as superior to other statistical disclosure control techniques. The reason for such claim is straightforward at first glance: since all records disseminated are synthetic and not actual observed values, no individual can reasonably claim to face a privacy threat. Thus, and if the synthesizer used is good enough, synthetic data will potentially always offer a high level of information with low disclosure risk attached. Building on recent advances in the literature regarding the conceptualization of an intruder, this paper aims at challenging this claim by reassessing the privacy guarantees of synthetic data. Using the concept of a maximum-knowledge intruder, we demonstrate that synthetic data can in fact be always expressed as a re-arrangement of the original data and that, as a result, they may lead to configurations where disclosure risk may be higher than for non-synthetic disclosure control approaches. We illustrate the application of these results by an empirical example.

Keywords

  • Statistical disclosure control
  • Synthetic data
  • Maximum-knowledge attacker

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-99771-1_5
  • Chapter length: 16 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-99771-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   79.99
Price excludes VAT (USA)

Notes

  1. 1.

    Using these notations, \( o_{ij} \) is the rank of attribute j in original record i and \( s_{lj}^{m} \) is the rank of attribute j in synthetic record l of the mth synthetic data set.

  2. 2.

    The two other attributes are not shown here due to space constraints but their reverse-mapped versions can be displayed in exactly the same way.

References

  1. Domingo-Ferrer, J., Muralidhar, K.: New directions in anonymization: permutation paradigm, verifiability by subjects and intruders, transparency to users. Inf. Sci. 337, 11–24 (2016)

    CrossRef  Google Scholar 

  2. Domingo-Ferrer, J., Ricci, S., Soria-Comas, J.: Disclosure risk assessment via record linkage by a maximum-knowledge attacker. In: 13th Annual International Conference on Privacy, Security and Trust-PST 2015, Izmir, Turkey, September 2015

    Google Scholar 

  3. Domingo-Ferrer, J., Sánchez, D., Rufian-Torrell, G.: Anonymization of nominal data based on semantic marginality. Inf. Sci. 242, 35–48 (2013)

    CrossRef  Google Scholar 

  4. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    CrossRef  MATH  Google Scholar 

  5. Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB establishment panel. Trans. Data Priv. 1, 105–130 (2008)

    MathSciNet  Google Scholar 

  6. Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_15

    CrossRef  Google Scholar 

  7. Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)

    CrossRef  Google Scholar 

  8. Muralidhar, K., Domingo-Ferrer, J.: Rank-based record linkage for re-identification risk assessment. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds.) PSD 2016. LNCS, vol. 9867, pp. 225–236. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45381-1_17

    CrossRef  Google Scholar 

  9. Muralidhar, K., Domingo-Ferrer, J.: Microdata masking as permutation. In: UNECE/EUROSTAT Work Session on Statistical Data Confidentiality, Helsinki, Finland, October 2015

    Google Scholar 

  10. Muralidhar, K., Sarathy, R.: A comparison of multiple imputation and data perturbation for masking numerical variables. J. Off. Stat. 22, 507–524 (2006)

    Google Scholar 

  11. Muralidhar, K., Sarathy, R., Domingo-Ferrer, J.: Reverse mapping to preserve the marginal distributions of attributes in masked microdata. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 105–116. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_9

    CrossRef  Google Scholar 

  12. Reiter, J.P., Wang, Q., Zhang, B.: Bayesian estimation of disclosure risks in multiply imputed, synthetic data. J. Priv. Confid. 6(1), 17–33 (2014). Article no. 2

    Google Scholar 

  13. Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18, 531–544 (2002)

    Google Scholar 

  14. Reiter, J.P.: Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. J. Roy. Stat. Soc. Ser. A 168, 185–205 (2005)

    CrossRef  MathSciNet  Google Scholar 

  15. Rubin, D.B.: Discussion: statistical disclosure control limitation. J. Off. Stat. 9, 462–468 (1993)

    Google Scholar 

  16. Ruiz, N.: On some consequences of the permutation paradigm for data anonymization: centrality of permutation matrices, universal measures of disclosure risk and information loss, evaluation by dominance. Inf. Sci. 430–431, 620–633 (2018)

    CrossRef  MathSciNet  Google Scholar 

  17. Ruiz, N.: A general cipher for individual data anonymization. Inf. Sci. (2017, under review). (https://arxiv.org/abs/1712.02557)

  18. Soria-Comas, J., Domingo-Ferrer, J.: A non-parametric model for accurate and provably private synthetic data sets. In: Proceedings of International Conference on Availability, Reliability and Security-ARES 2017, Article no. 3. ACM (2017)

    Google Scholar 

  19. Willenborg, L., De Waal, T.: Elements of Statistical Disclosure Control. Springer, New York (2001). https://doi.org/10.1007/978-1-4613-0121-9

    CrossRef  MATH  Google Scholar 

Download references

Acknowledgments and Disclaimer

The following funding sources are gratefully acknowledged by the third author: European Commission (project H2020-700540 “CANVAS”), Government of Catalonia (ICREA Acadèmia Prize) and Spanish Government (projects TIN2014-57364-C2-1-R “SmartGlacis” and TIN2015-70054-REDC). The views in this paper are the authors’ own and do not necessarily reflect the views of UNESCO or any of the funders.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Ruiz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Ruiz, N., Muralidhar, K., Domingo-Ferrer, J. (2018). On the Privacy Guarantees of Synthetic Data: A Reassessment from the Maximum-Knowledge Attacker Perspective. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99771-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99770-4

  • Online ISBN: 978-3-319-99771-1

  • eBook Packages: Computer ScienceComputer Science (R0)