Abstract
Most statistical agencies release randomly selected samples of Census microdata, usually with sample fractions under 10% and with other forms of statistical disclosure control (SDC) applied. An alternative to SDC is data synthesis, which has been attracting growing interest, yet there is no clear consensus on how to measure the associated utility and disclosure risk of the data. The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible.
This paper follows on from previous work by the authors which mapped synthetic Census data on a risk-utility (R-U) map. The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions, thereby identifying the sample fraction which has equivalent utility and risk to the synthetic data. Three commonly used data synthesis packages are compared with some interesting results. Further work is needed in several directions but the methodology looks very promising.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that, for calculating the risk and utility, the sample data was treated in the same way as the synthetic data, namely by comparing against the original data. However, for simplicity, in the metric descriptions only synthetic data is mentioned.
- 2.
The project code is available here: https://github.com/clairelittle/psd2022-comparing-utility-risk.
- 3.
This is a strong assumption, which has the benefit of then dominating most other scenarios, the one possible exception is a presence detection attack. However, for Census data, presence detection is vacuous, and the response knowledge assumption is sound by definition.
- 4.
We recognise that averaging different utility metrics may not be optimal and in future work we will consider an explicitly multi-objective approach to utility optimisation.
- 5.
Standard deviation not included for clarity as this was generally small, <0.01.
References
Benedetto, G., Stinson, M.H., Abowd, J.M.: The creation and use of the SIPP synthetic beta. Technical report, November, U.S. Census Bureau (2018)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. Wadsworth International Group, Belmont, California (1984). https://doi.org/10.1201/9781315139470
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Privacy 3(1), 27–42 (2010)
Camino, R., Hammerschmidt, C., State, R.: Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202 (2018)
Chen, H., Jajodia, S., Liu, J., Park, N., Sokolov, V., Subrahmanian, V.S.: Fake tables: using GANs to generate functional dependency preserving tables with bounded real data. In: IJCAI, pp. 2074–2080 (2019). https://doi.org/10.24963/ijcai.2019/287
Dankar, F.K., Ibrahim, M.K., Ismail, L.: A multi-dimensional evaluation of synthetic data generators. IEEE Access 10, 11147–11158 (2022). https://doi.org/10.1109/ACCESS.2022.3144765
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010). https://doi.org/10.1198/jasa.2010.ap09480
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011). https://doi.org/10.1016/j.csda.2011.06.006
Duncan, G.T., Keller-McNulty, S.A., Stokes, S.L.: Database security and confidentiality: examining disclosure risk vs. data utility through the R-U confidentiality map. Technical report, National Institute of Statistical Sciences (2004)
Elliot, M.: Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team. Technical report, University of Manchester (2014). http://hummedia.manchester.ac.uk/institutes/cmist/archive-publications/reports/2015-02%20-Report%20on%20disclosure%20risk%20analysis%20of%20synthpop%20synthetic%20versions%20of%20LCF_%20final.pdf
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Hittmeir, M., Ekelhart, A., Mayer, R.: Utility and privacy assessments of synthetic data for regression tasks. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 5763–5772 (2019). https://doi.org/10.1109/BigData47090.2019.9005476
Hundepool, A., et al.: Statistical Disclosure Control. Wiley Series in Survey Methodology. Wiley, Hoboken (2012). https://doi.org/10.1002/9781118348239
Joshi, C.: Generative adversarial networks (GANs) for synthetic dataset generation with binary classes (2019). https://datasciencecampus.ons.gov.uk/projects/generative-adversarial-networks-gans-for-synthetic-dataset-generation-with-binary-classes/
Kaloskampis, I., Joshi, C., Cheung, C., Pugh, D., Nolan, L.: Synthetic data in the civil service. Significance 17(6), 18–23 (2020). https://doi.org/10.1111/1740-9713.01466
Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006). https://doi.org/10.1198/000313006X124640
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79(3), 362–384 (2011). https://doi.org/10.1111/j.1751-5823.2011.00153.x
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015). https://doi.org/10.1038/nature14539
Little, C., Elliot, M., Allmendinger, R., Samani, S.S.: Generative adversarial networks for synthetic data generation: a comparative study. In: Joint UNECE/Eurostat Expert Meeting on Statistical Data Confidentiality (2021). https://unece.org/sites/default/files/2021-12/SDC2021_Day2_Little_AD.pdf
Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 277–286 (2008). https://doi.org/10.1109/ICDE.2008.4497436
Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.3 [dataset]. IPUMS: IPUMs Census Data, Minneapolis (2020). https://doi.org/10.18128/D020.V7.2
Nixon, M.P., Barrientos, A.F., Reiter, J.P., Slavković, A.: A latent class modeling approach for generating synthetic data and making posterior inferences from differentially private counts (2022). https://doi.org/10.48550/ARXIV.2201.10545
Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016). https://doi.org/10.18637/jss.v074.i11
Nowok, B., Raab, G.M., Dibben, C.: Providing bespoke synthetic data for the UK longitudinal studies and other sensitive data with the synthpop package for R. Stat. J. IAOS 33(3), 785–796 (2017). https://doi.org/10.3233/SJI-150153
Office for National Statistics, Census Division, University of Manchester, Cathie Marsh Centre for Census and Survey Research: Census 1991: Individual Sample of Anonymised Records for Great Britain (SARs) (2013). https://doi.org/10.5255/UKDA-SN-7210-1
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018). https://doi.org/10.14778/3231751.3231757
Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, New York (2017). https://doi.org/10.1145/3085504.3091117
Pistner, M., Slavković, A., Vilhuber, L.: Synthetic data via quantile regression for heavy-tailed and heteroskedastic data. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 92–108. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_7
Raab, G.M., Nowok, B., Dibben, C.: Guidelines for producing useful synthetic data (2017). https://doi.org/10.48550/ARXIV.1712.04078
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–16 (2003)
Rankin, D., Black, M., Bond, R., Wallace, J., Mulvenna, M., Epelde, G.: Reliability of supervised machine learning using synthetic data in health care: model to preserve privacy for data sharing. JMIR Med. Inform. 8(7), e18910 (2020). https://doi.org/10.2196/18910
Reiter, J.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005)
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531 (2002)
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29(2), 181–188 (2003)
Reiter, J.P.: Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. Ser. A Stat. Soc. 168(1), 185–205 (2003). https://doi.org/10.1111/j.1467-985X.2004.00343.x
Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018). https://doi.org/10.1111/rssa.12358
Taub, J., Elliot, M.: The synthetic data challenge. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (2019). https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S3_UK_Synthethic_Data_Challenge_Elliot_AD.pdf
Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9
Taub, J., Elliot, M., Sakshaug, J.W.: The impact of synthetic data generation on data utility with application to the 1991 UK samples of anonymised records. Trans. Data Priv. 13(1), 1–23 (2020)
Therneau, T., Atkinson, E., Ripley, B.: Package ‘rpart’ (2019). https://cran.r-project.org/package=rpart
Wang, L., Chen, W., Yang, W., Bi, F., Yu, F.R.: A state-of-the-art review on image synthesis with generative adversarial networks. IEEE Access 8, 63514–63537 (2020). https://doi.org/10.1109/ACCESS.2020.2982224
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, Vancouver, Canada, vol. 32 (2019). https://proceedings.neurips.cc/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017). https://doi.org/10.1145/3134428
Zhao, Z., Kunar, A., Birke, R., Chen, L.Y.: CTAB-GAN: effective table data synthesizing. In: Proceedings of 13th Asian Conference on Machine Learning, vol. 157, pp. 97–112. PMLR (2021). https://proceedings.mlr.press/v157/zhao21a.html
Acknowledgement
The authors wish to acknowledge IPUMs International and the statistical offices that provided the underlying data making this research possible: Statistics Canada; Bureau of Statistics, Fiji; National Institute of Statistics, Rwanda; and the Office for National Statistics, UK.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendices
Appendix A
A brief summary of the Census microdata:
Canada 2011: Subsetted on the province of Manitoba, containing 32,149 records (3.47% of the total available dataset which was a 2.78% sample of the 2011 Census). Downloaded from IPUMs [23], courtesy of Statistics Canada.
Fiji 2007: The entire 10% sample (n = 84,323) of the 2007 Fiji Census. Downloaded from IPUMs [23] courtesy of the Bureau of Statistics, Fiji.
Rwanda 2012: Subsetted on the Karongi region, containing 31,455 records (3.03% of the total available, a 10% sample of the 2012 Census). Downloaded from IPUMs [23] courtesy of the National Institute of Statistics, Rwanda.
UK 1991: Subsetted on the region of West Midlands, containing 104,267 records (9.34% of total, a 2% sample of the 1991 Individual Sample of Anonymised Records for the British Census). Downloaded from UK Data Service [27].
Appendix B
Summary of TCAP key/target variables. The six key variables are listed together; the first 3 were used in the case of 3 keys, first 4 for 4 keys, etc.
Canada 2011: For target variables (RELIG, CITIZEN and TENURE) the key variables were: AGE, SEX, MARST (marital status), MINORITY (part of a visible minority), EMPSTAT (labour force status), BPL (birthplace).
Fiji 2007: For target variables (RELIGION, WORKTYPE and TENURE) the key variables were: PROVINCE (of residence), AGE, SEX, MARST (marital status), ETHNIC (part of a visible minority), CLASSWKR (employment status).
Rwanda 2012: For target variables (RELIGION, EMPSECTOR and OWNERSH (tenure)) the key variables were: AGE, SEX, MARST (marital status), CLASSWK (employment status), URBAN (urban/rural area), BPL (birthplace).
UK 1991: For target variables (LTILL (long-term illness), FAMTYPE and TENURE) the key variables were: AREAP, AGE, SEX, MSTATUS (marital status), ETHGROUP (ethnic group), ECONPRIM (economic status).
Appendix C
Description of regression models used to calculate the CIO. For each dataset two logistic regressions were performed using marital status and housing tenure as the targets (a binary target was created). Eight predictors were used, these were the same for both models (with tenure/marital status removed accordingly):
Canada Predictors: ABIDENT (aboriginal identity), AGE, CLASSWK, DEGREE, EMPSTAT, SEX, URBAN, TENURE/MARST.
Fiji Predictors: AGE, CLASSWKR, ETHNIC, RELIGION, EDATTAIN (educational level attained), SEX, PROVINCE, TENURE/MARST.
Rwanda Predictors: AGE, DISAB1, EDCERT (highest educational qualification), CLASSWK, LIT (languages spoken), RELIG, SEX, TENURE/MARST.
UK Predictors: AGE, ECONPRIM, ETHGROUP, LTILL, QUALNUM, SEX, SOCLASS, TENURE/MSTATUS.
Appendix D
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Little, C., Elliot, M., Allmendinger, R. (2022). Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-13945-1_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13944-4
Online ISBN: 978-3-031-13945-1
eBook Packages: Computer ScienceComputer Science (R0)