Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata

Little, Claire; Elliot, Mark; Allmendinger, Richard

doi:10.1007/978-3-031-13945-1_17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13463))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

654 Accesses
4 Citations

Abstract

Most statistical agencies release randomly selected samples of Census microdata, usually with sample fractions under 10% and with other forms of statistical disclosure control (SDC) applied. An alternative to SDC is data synthesis, which has been attracting growing interest, yet there is no clear consensus on how to measure the associated utility and disclosure risk of the data. The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible.

This paper follows on from previous work by the authors which mapped synthetic Census data on a risk-utility (R-U) map. The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions, thereby identifying the sample fraction which has equivalent utility and risk to the synthetic data. Three commonly used data synthesis packages are compared with some interesting results. Further work is needed in several directions but the methodology looks very promising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that, for calculating the risk and utility, the sample data was treated in the same way as the synthetic data, namely by comparing against the original data. However, for simplicity, in the metric descriptions only synthetic data is mentioned.
2.
The project code is available here: https://github.com/clairelittle/psd2022-comparing-utility-risk.
3.
This is a strong assumption, which has the benefit of then dominating most other scenarios, the one possible exception is a presence detection attack. However, for Census data, presence detection is vacuous, and the response knowledge assumption is sound by definition.
4.
We recognise that averaging different utility metrics may not be optimal and in future work we will consider an explicitly multi-objective approach to utility optimisation.
5.
Standard deviation not included for clarity as this was generally small, <0.01.

References

Benedetto, G., Stinson, M.H., Abowd, J.M.: The creation and use of the SIPP synthetic beta. Technical report, November, U.S. Census Bureau (2018)
Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. Wadsworth International Group, Belmont, California (1984). https://doi.org/10.1201/9781315139470
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Privacy 3(1), 27–42 (2010)
MathSciNet Google Scholar
Camino, R., Hammerschmidt, C., State, R.: Generating multi-categorical samples with generative adversarial networks. arXiv preprint arXiv:1807.01202 (2018)
Chen, H., Jajodia, S., Liu, J., Park, N., Sokolov, V., Subrahmanian, V.S.: Fake tables: using GANs to generate functional dependency preserving tables with bounded real data. In: IJCAI, pp. 2074–2080 (2019). https://doi.org/10.24963/ijcai.2019/287
Dankar, F.K., Ibrahim, M.K., Ismail, L.: A multi-dimensional evaluation of synthetic data generators. IEEE Access 10, 11147–11158 (2022). https://doi.org/10.1109/ACCESS.2022.3144765
Article Google Scholar
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010). https://doi.org/10.1198/jasa.2010.ap09480
Article MathSciNet MATH Google Scholar
Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011). https://doi.org/10.1016/j.csda.2011.06.006
Article MATH Google Scholar
Duncan, G.T., Keller-McNulty, S.A., Stokes, S.L.: Database security and confidentiality: examining disclosure risk vs. data utility through the R-U confidentiality map. Technical report, National Institute of Statistical Sciences (2004)
Google Scholar
Elliot, M.: Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team. Technical report, University of Manchester (2014). http://hummedia.manchester.ac.uk/institutes/cmist/archive-publications/reports/2015-02%20-Report%20on%20disclosure%20risk%20analysis%20of%20synthpop%20synthetic%20versions%20of%20LCF_%20final.pdf
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Hittmeir, M., Ekelhart, A., Mayer, R.: Utility and privacy assessments of synthetic data for regression tasks. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 5763–5772 (2019). https://doi.org/10.1109/BigData47090.2019.9005476
Hundepool, A., et al.: Statistical Disclosure Control. Wiley Series in Survey Methodology. Wiley, Hoboken (2012). https://doi.org/10.1002/9781118348239
Joshi, C.: Generative adversarial networks (GANs) for synthetic dataset generation with binary classes (2019). https://datasciencecampus.ons.gov.uk/projects/generative-adversarial-networks-gans-for-synthetic-dataset-generation-with-binary-classes/
Kaloskampis, I., Joshi, C., Cheung, C., Pugh, D., Nolan, L.: Synthetic data in the civil service. Significance 17(6), 18–23 (2020). https://doi.org/10.1111/1740-9713.01466
Article Google Scholar
Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006). https://doi.org/10.1198/000313006X124640
Article MathSciNet Google Scholar
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79(3), 362–384 (2011). https://doi.org/10.1111/j.1751-5823.2011.00153.x
Article Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015). https://doi.org/10.1038/nature14539
Article Google Scholar
Little, C., Elliot, M., Allmendinger, R., Samani, S.S.: Generative adversarial networks for synthetic data generation: a comparative study. In: Joint UNECE/Eurostat Expert Meeting on Statistical Data Confidentiality (2021). https://unece.org/sites/default/files/2021-12/SDC2021_Day2_Little_AD.pdf
Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)
Google Scholar
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 277–286 (2008). https://doi.org/10.1109/ICDE.2008.4497436
Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 7.3 [dataset]. IPUMS: IPUMs Census Data, Minneapolis (2020). https://doi.org/10.18128/D020.V7.2
Nixon, M.P., Barrientos, A.F., Reiter, J.P., Slavković, A.: A latent class modeling approach for generating synthetic data and making posterior inferences from differentially private counts (2022). https://doi.org/10.48550/ARXIV.2201.10545
Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016). https://doi.org/10.18637/jss.v074.i11
Article Google Scholar
Nowok, B., Raab, G.M., Dibben, C.: Providing bespoke synthetic data for the UK longitudinal studies and other sensitive data with the synthpop package for R. Stat. J. IAOS 33(3), 785–796 (2017). https://doi.org/10.3233/SJI-150153
Article Google Scholar
Office for National Statistics, Census Division, University of Manchester, Cathie Marsh Centre for Census and Survey Research: Census 1991: Individual Sample of Anonymised Records for Great Britain (SARs) (2013). https://doi.org/10.5255/UKDA-SN-7210-1
Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018). https://doi.org/10.14778/3231751.3231757
Article Google Scholar
Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, New York (2017). https://doi.org/10.1145/3085504.3091117
Pistner, M., Slavković, A., Vilhuber, L.: Synthetic data via quantile regression for heavy-tailed and heteroskedastic data. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 92–108. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_7
Chapter Google Scholar
Raab, G.M., Nowok, B., Dibben, C.: Guidelines for producing useful synthetic data (2017). https://doi.org/10.48550/ARXIV.1712.04078
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–16 (2003)
Google Scholar
Rankin, D., Black, M., Bond, R., Wallace, J., Mulvenna, M., Epelde, G.: Reliability of supervised machine learning using synthetic data in health care: model to preserve privacy for data sharing. JMIR Med. Inform. 8(7), e18910 (2020). https://doi.org/10.2196/18910
Article Google Scholar
Reiter, J.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005)
Google Scholar
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18(4), 531 (2002)
Google Scholar
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29(2), 181–188 (2003)
Google Scholar
Reiter, J.P.: Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. Ser. A Stat. Soc. 168(1), 185–205 (2003). https://doi.org/10.1111/j.1467-985X.2004.00343.x
Article MathSciNet MATH Google Scholar
Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Google Scholar
Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018). https://doi.org/10.1111/rssa.12358
Article MathSciNet Google Scholar
Taub, J., Elliot, M.: The synthetic data challenge. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (2019). https://unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2019/mtg1/SDC2019_S3_UK_Synthethic_Data_Challenge_Elliot_AD.pdf
Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9
Chapter Google Scholar
Taub, J., Elliot, M., Sakshaug, J.W.: The impact of synthetic data generation on data utility with application to the 1991 UK samples of anonymised records. Trans. Data Priv. 13(1), 1–23 (2020)
Google Scholar
Therneau, T., Atkinson, E., Ripley, B.: Package ‘rpart’ (2019). https://cran.r-project.org/package=rpart
Wang, L., Chen, W., Yang, W., Bi, F., Yu, F.R.: A state-of-the-art review on image synthesis with generative adversarial networks. IEEE Access 8, 63514–63537 (2020). https://doi.org/10.1109/ACCESS.2020.2982224
Article Google Scholar
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, Vancouver, Canada, vol. 32 (2019). https://proceedings.neurips.cc/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42(4), 1–41 (2017). https://doi.org/10.1145/3134428
Article MathSciNet MATH Google Scholar
Zhao, Z., Kunar, A., Birke, R., Chen, L.Y.: CTAB-GAN: effective table data synthesizing. In: Proceedings of 13th Asian Conference on Machine Learning, vol. 157, pp. 97–112. PMLR (2021). https://proceedings.mlr.press/v157/zhao21a.html

Download references

Acknowledgement

The authors wish to acknowledge IPUMs International and the statistical offices that provided the underlying data making this research possible: Statistics Canada; Bureau of Statistics, Fiji; National Institute of Statistics, Rwanda; and the Office for National Statistics, UK.

Author information

Authors and Affiliations

School of Social Sciences, University of Manchester, Manchester, M13 9PL, UK
Claire Little & Mark Elliot
Alliance Manchester Business School, University of Manchester, Manchester, M13 9PL, UK
Richard Allmendinger

Authors

Claire Little
View author publications
You can also search for this author in PubMed Google Scholar
Mark Elliot
View author publications
You can also search for this author in PubMed Google Scholar
Richard Allmendinger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Elliot .

Editor information

Editors and Affiliations

Universitat Rovira i Virgili, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
Télécom SudParis, Palaiseau, France
Maryline Laurent

Appendices

Appendix A

A brief summary of the Census microdata:

Canada 2011: Subsetted on the province of Manitoba, containing 32,149 records (3.47% of the total available dataset which was a 2.78% sample of the 2011 Census). Downloaded from IPUMs [23], courtesy of Statistics Canada.

Fiji 2007: The entire 10% sample (n = 84,323) of the 2007 Fiji Census. Downloaded from IPUMs [23] courtesy of the Bureau of Statistics, Fiji.

Rwanda 2012: Subsetted on the Karongi region, containing 31,455 records (3.03% of the total available, a 10% sample of the 2012 Census). Downloaded from IPUMs [23] courtesy of the National Institute of Statistics, Rwanda.

UK 1991: Subsetted on the region of West Midlands, containing 104,267 records (9.34% of total, a 2% sample of the 1991 Individual Sample of Anonymised Records for the British Census). Downloaded from UK Data Service [27].

Appendix B

Summary of TCAP key/target variables. The six key variables are listed together; the first 3 were used in the case of 3 keys, first 4 for 4 keys, etc.

Canada 2011: For target variables (RELIG, CITIZEN and TENURE) the key variables were: AGE, SEX, MARST (marital status), MINORITY (part of a visible minority), EMPSTAT (labour force status), BPL (birthplace).

Fiji 2007: For target variables (RELIGION, WORKTYPE and TENURE) the key variables were: PROVINCE (of residence), AGE, SEX, MARST (marital status), ETHNIC (part of a visible minority), CLASSWKR (employment status).

Rwanda 2012: For target variables (RELIGION, EMPSECTOR and OWNERSH (tenure)) the key variables were: AGE, SEX, MARST (marital status), CLASSWK (employment status), URBAN (urban/rural area), BPL (birthplace).

UK 1991: For target variables (LTILL (long-term illness), FAMTYPE and TENURE) the key variables were: AREAP, AGE, SEX, MSTATUS (marital status), ETHGROUP (ethnic group), ECONPRIM (economic status).

Appendix C

Description of regression models used to calculate the CIO. For each dataset two logistic regressions were performed using marital status and housing tenure as the targets (a binary target was created). Eight predictors were used, these were the same for both models (with tenure/marital status removed accordingly):

Canada Predictors: ABIDENT (aboriginal identity), AGE, CLASSWK, DEGREE, EMPSTAT, SEX, URBAN, TENURE/MARST.

Fiji Predictors: AGE, CLASSWKR, ETHNIC, RELIGION, EDATTAIN (educational level attained), SEX, PROVINCE, TENURE/MARST.

Rwanda Predictors: AGE, DISAB1, EDCERT (highest educational qualification), CLASSWK, LIT (languages spoken), RELIG, SEX, TENURE/MARST.

UK Predictors: AGE, ECONPRIM, ETHGROUP, LTILL, QUALNUM, SEX, SOCLASS, TENURE/MSTATUS.

Appendix D

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Little, C., Elliot, M., Allmendinger, R. (2022). Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-13945-1_17
Published: 14 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13944-4
Online ISBN: 978-3-031-13945-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendices

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation