Skip to main content

Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2022)

Abstract

Most statistical agencies release randomly selected samples of Census microdata, usually with sample fractions under 10% and with other forms of statistical disclosure control (SDC) applied. An alternative to SDC is data synthesis, which has been attracting growing interest, yet there is no clear consensus on how to measure the associated utility and disclosure risk of the data. The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible.

This paper follows on from previous work by the authors which mapped synthetic Census data on a risk-utility (R-U) map. The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions, thereby identifying the sample fraction which has equivalent utility and risk to the synthetic data. Three commonly used data synthesis packages are compared with some interesting results. Further work is needed in several directions but the methodology looks very promising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that, for calculating the risk and utility, the sample data was treated in the same way as the synthetic data, namely by comparing against the original data. However, for simplicity, in the metric descriptions only synthetic data is mentioned.

  2. 2.

    The project code is available here: https://github.com/clairelittle/psd2022-comparing-utility-risk.

  3. 3.

    This is a strong assumption, which has the benefit of then dominating most other scenarios, the one possible exception is a presence detection attack. However, for Census data, presence detection is vacuous, and the response knowledge assumption is sound by definition.

  4. 4.

    We recognise that averaging different utility metrics may not be optimal and in future work we will consider an explicitly multi-objective approach to utility optimisation.

  5. 5.

    Standard deviation not included for clarity as this was generally small, <0.01.

References

  1. Benedetto, G., Stinson, M.H., Abowd, J.M.: The creation and use of the SIPP synthetic beta. Technical report, November, U.S. Census Bureau (2018)

    Google Scholar 

  2. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. Wadsworth International Group, Belmont, California (1984). https://doi.org/10.1201/9781315139470

  3. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  4. Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Privacy 3(1), 27–42 (2010)

    MathSciNet  Google Scholar 

  5. Camino, R., Hammerschmidt, C., State, R.: Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202 (2018)

  6. Chen, H., Jajodia, S., Liu, J., Park, N., Sokolov, V., Subrahmanian, V.S.: Fake tables: using GANs to generate functional dependency preserving tables with bounded real data. In: IJCAI, pp. 2074–2080 (2019). https://doi.org/10.24963/ijcai.2019/287

  7. Dankar, F.K., Ibrahim, M.K., Ismail, L.: A multi-dimensional evaluation of synthetic data generators. IEEE Access 10, 11147–11158 (2022). https://doi.org/10.1109/ACCESS.2022.3144765

    Article  Google Scholar 

  8. Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010). https://doi.org/10.1198/jasa.2010.ap09480

    Article  MathSciNet  MATH  Google Scholar 

  9. Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011). https://doi.org/10.1016/j.csda.2011.06.006

    Article  MATH  Google Scholar 

  10. Duncan, G.T., Keller-McNulty, S.A., Stokes, S.L.: Database security and confidentiality: examining disclosure risk vs. data utility through the R-U confidentiality map. Technical report, National Institute of Statistical Sciences (2004)

    Google Scholar 

  11. Elliot, M.: Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team. Technical report, University of Manchester (2014). http://hummedia.manchester.ac.uk/institutes/cmist/archive-publications/reports/2015-02%20-Report%20on%20disclosure%20risk%20analysis%20of%20synthpop%20synthetic%20versions%20of%20LCF_%20final.pdf

  12. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

  13. Hittmeir, M., Ekelhart, A., Mayer, R.: Utility and privacy assessments of synthetic data for regression tasks. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 5763–5772 (2019). https://doi.org/10.1109/BigData47090.2019.9005476

  14. Hundepool, A., et al.: Statistical Disclosure Control. Wiley Series in Survey Methodology. Wiley, Hoboken (2012). https://doi.org/10.1002/9781118348239

  15. Joshi, C.: Generative adversarial networks (GANs) for synthetic dataset generation with binary classes (2019). https://datasciencecampus.ons.gov.uk/projects/generative-adversarial-networks-gans-for-synthetic-dataset-generation-with-binary-classes/

  16. Kaloskampis, I., Joshi, C., Cheung, C., Pugh, D., Nolan, L.: Synthetic data in the civil service. Significance 17(6), 18–23 (2020). https://doi.org/10.1111/1740-9713.01466

    Article  Google Scholar 

  17. Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006). https://doi.org/10.1198/000313006X124640

    Article  MathSciNet  Google Scholar 

  18. Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79(3), 362–384 (2011). https://doi.org/10.1111/j.1751-5823.2011.00153.x

    Article  Google Scholar 

  19. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015). https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  20. Little, C., Elliot, M., Allmendinger, R., Samani, S.S.: Generative adversarial networks for synthetic data generation: a comparative study. In: Joint UNECE/Eurostat Expert Meeting on Statistical Data Confidentiality (2021). https://unece.org/sites/default/files/2021-12/SDC2021_Day2_Little_AD.pdf

  21. Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)

    Google Scholar 

  22. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 277–286 (2008). https://doi.org/10.1109/ICDE.2008.4497436

  23. Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.3 [dataset]. IPUMS: IPUMs Census Data, Minneapolis (2020). https://doi.org/10.18128/D020.V7.2

  24. Nixon, M.P., Barrientos, A.F., Reiter, J.P., Slavković, A.: A latent class modeling approach for generating synthetic data and making posterior inferences from differentially private counts (2022). https://doi.org/10.48550/ARXIV.2201.10545

  25. Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016). https://doi.org/10.18637/jss.v074.i11

    Article  Google Scholar 

  26. Nowok, B., Raab, G.M., Dibben, C.: Providing bespoke synthetic data for the UK longitudinal studies and other sensitive data with the synthpop package for R. Stat. J. IAOS 33(3), 785–796 (2017). https://doi.org/10.3233/SJI-150153

    Article  Google Scholar 

  27. Office for National Statistics, Census Division, University of Manchester, Cathie Marsh Centre for Census and Survey Research: Census 1991: Individual Sample of Anonymised Records for Great Britain (SARs) (2013). https://doi.org/10.5255/UKDA-SN-7210-1

  28. Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018). https://doi.org/10.14778/3231751.3231757

    Article  Google Scholar 

  29. Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, New York (2017). https://doi.org/10.1145/3085504.3091117

  30. Pistner, M., Slavković, A., Vilhuber, L.: Synthetic data via quantile regression for heavy-tailed and heteroskedastic data. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 92–108. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_7

    Chapter  Google Scholar 

  31. Raab, G.M., Nowok, B., Dibben, C.: Guidelines for producing useful synthetic data (2017). https://doi.org/10.48550/ARXIV.1712.04078

  32. Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–16 (2003)

    Google Scholar 

  33. Rankin, D., Black, M., Bond, R., Wallace, J., Mulvenna, M., Epelde, G.: Reliability of supervised machine learning using synthetic data in health care: model to preserve privacy for data sharing. JMIR Med. Inform. 8(7), e18910 (2020). https://doi.org/10.2196/18910

    Article  Google Scholar 

  34. Reiter, J.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005)

    Google Scholar 

  35. Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531 (2002)

    Google Scholar 

  36. Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29(2), 181–188 (2003)

    Google Scholar 

  37. Reiter, J.P.: Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. Ser. A Stat. Soc. 168(1), 185–205 (2003). https://doi.org/10.1111/j.1467-985X.2004.00343.x

    Article  MathSciNet  MATH  Google Scholar 

  38. Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)

    Google Scholar 

  39. Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018). https://doi.org/10.1111/rssa.12358

    Article  MathSciNet  Google Scholar 

  40. Taub, J., Elliot, M.: The synthetic data challenge. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (2019). https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S3_UK_Synthethic_Data_Challenge_Elliot_AD.pdf

  41. Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9

    Chapter  Google Scholar 

  42. Taub, J., Elliot, M., Sakshaug, J.W.: The impact of synthetic data generation on data utility with application to the 1991 UK samples of anonymised records. Trans. Data Priv. 13(1), 1–23 (2020)

    Google Scholar 

  43. Therneau, T., Atkinson, E., Ripley, B.: Package ‘rpart’ (2019). https://cran.r-project.org/package=rpart

  44. Wang, L., Chen, W., Yang, W., Bi, F., Yu, F.R.: A state-of-the-art review on image synthesis with generative adversarial networks. IEEE Access 8, 63514–63537 (2020). https://doi.org/10.1109/ACCESS.2020.2982224

    Article  Google Scholar 

  45. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, Vancouver, Canada, vol. 32 (2019). https://proceedings.neurips.cc/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf

  46. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017). https://doi.org/10.1145/3134428

    Article  MathSciNet  MATH  Google Scholar 

  47. Zhao, Z., Kunar, A., Birke, R., Chen, L.Y.: CTAB-GAN: effective table data synthesizing. In: Proceedings of 13th Asian Conference on Machine Learning, vol. 157, pp. 97–112. PMLR (2021). https://proceedings.mlr.press/v157/zhao21a.html

Download references

Acknowledgement

The authors wish to acknowledge IPUMs International and the statistical offices that provided the underlying data making this research possible: Statistics Canada; Bureau of Statistics, Fiji; National Institute of Statistics, Rwanda; and the Office for National Statistics, UK.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Elliot .

Editor information

Editors and Affiliations

Appendices

Appendices

Appendix A

A brief summary of the Census microdata:

Canada 2011: Subsetted on the province of Manitoba, containing 32,149 records (3.47% of the total available dataset which was a 2.78% sample of the 2011 Census). Downloaded from IPUMs [23], courtesy of Statistics Canada.

Fiji 2007: The entire 10% sample (n = 84,323) of the 2007 Fiji Census. Downloaded from IPUMs [23] courtesy of the Bureau of Statistics, Fiji.

Rwanda 2012: Subsetted on the Karongi region, containing 31,455 records (3.03% of the total available, a 10% sample of the 2012 Census). Downloaded from IPUMs [23] courtesy of the National Institute of Statistics, Rwanda.

UK 1991: Subsetted on the region of West Midlands, containing 104,267 records (9.34% of total, a 2% sample of the 1991 Individual Sample of Anonymised Records for the British Census). Downloaded from UK Data Service [27].

Appendix B

Summary of TCAP key/target variables. The six key variables are listed together; the first 3 were used in the case of 3 keys, first 4 for 4 keys, etc.

Canada 2011: For target variables (RELIG, CITIZEN and TENURE) the key variables were: AGE, SEX, MARST (marital status), MINORITY (part of a visible minority), EMPSTAT (labour force status), BPL (birthplace).

Fiji 2007: For target variables (RELIGION, WORKTYPE and TENURE) the key variables were: PROVINCE (of residence), AGE, SEX, MARST (marital status), ETHNIC (part of a visible minority), CLASSWKR (employment status).

Rwanda 2012: For target variables (RELIGION, EMPSECTOR and OWNERSH (tenure)) the key variables were: AGE, SEX, MARST (marital status), CLASSWK (employment status), URBAN (urban/rural area), BPL (birthplace).

UK 1991: For target variables (LTILL (long-term illness), FAMTYPE and TENURE) the key variables were: AREAP, AGE, SEX, MSTATUS (marital status), ETHGROUP (ethnic group), ECONPRIM (economic status).

Appendix C

Description of regression models used to calculate the CIO. For each dataset two logistic regressions were performed using marital status and housing tenure as the targets (a binary target was created). Eight predictors were used, these were the same for both models (with tenure/marital status removed accordingly):

Canada Predictors: ABIDENT (aboriginal identity), AGE, CLASSWK, DEGREE, EMPSTAT, SEX, URBAN, TENURE/MARST.

Fiji Predictors: AGE, CLASSWKR, ETHNIC, RELIGION, EDATTAIN (educational level attained), SEX, PROVINCE, TENURE/MARST.

Rwanda Predictors: AGE, DISAB1, EDCERT (highest educational qualification), CLASSWK, LIT (languages spoken), RELIG, SEX, TENURE/MARST.

UK Predictors: AGE, ECONPRIM, ETHGROUP, LTILL, QUALNUM, SEX, SOCLASS, TENURE/MSTATUS.

Appendix D

Fig. 2.
figure 2

Risk-Utility map plotting the mean synthetic data and sample fraction results for Fiji 2007, Canada 2011 and Rwanda 2012 Census data.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Little, C., Elliot, M., Allmendinger, R. (2022). Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-13945-1_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-13944-4

  • Online ISBN: 978-3-031-13945-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics