Skip to main content

Some Clarifications Regarding Fully Synthetic Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11126))

Abstract

There has been some confusion in recent years in which circumstances datasets generated using the synthetic data approach should be considered fully synthetic and which estimator to use for obtaining valid variance estimates based on the synthetic data. This paper aims at providing some guidance to overcome this confusion. It offers a review of the different approaches for generating synthetic datasets and discusses their similarities and differences. It also presents the different variance estimators that have been proposed for analyzing the synthetic data. Based on two simulation studies the advantages and limitations of the different estimators are discussed. The paper concludes with some general recommendations how to judge which synthesis strategy and which variance estimator is most suitable in which situation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Drechsler, J.: Improved variance estimation for fully synthetic datasets. In: Proceedings of the Joint UNECE/EUROSTAT Work Session on Statistical Data Confidentiality (2011)

    Google Scholar 

  2. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. LNS, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  3. Drechsler, J., Reiter, J.P.: Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB establishment survey. J. Off. Stat. 25, 589–603 (2009)

    Google Scholar 

  4. Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105(492), 1347–1357 (2010)

    Article  MathSciNet  Google Scholar 

  5. Drechsler, J., Reiter, J.P.: Combining synthetic data with subsampling to create public use microdata files for large scale surveys. Surv. Methodol. 38, 73–79 (2012)

    Google Scholar 

  6. Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic longitudinal business database. Int. Stat. Rev. 79, 362–384 (2011)

    Article  Google Scholar 

  7. Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9, 407–426 (1993)

    Google Scholar 

  8. Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 4 (2017)

    Google Scholar 

  9. Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19, 1–16 (2003)

    Google Scholar 

  10. Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18, 531–544 (2002)

    Google Scholar 

  11. Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29, 181–189 (2003)

    Google Scholar 

  12. Reiter, J.P., Drechsler, J.: Releasing multiply-imputed, synthetic data generated in two stages to protect confidentiality. Stat. Sin. 20, 405–421 (2010)

    MathSciNet  MATH  Google Scholar 

  13. Reiter, J.P., Kinney, S.K.: Inferentially valid, partially synthetic data: generating from posterior predictive distributions not necessary. J. Off. Stat. 28(4), 583–590 (2012)

    Google Scholar 

  14. Rubin, D.B.: Discussion: statistical disclosure limitation. J. Off. Stat. 9, 462–468 (1993)

    Google Scholar 

  15. Si, Y., Reiter, J.P.: A comparison of posterior simulation and inference by combining rules for multiple imputation. J. Stat. Theory Pract. 5(2), 335–347 (2011)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jörg Drechsler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Drechsler, J. (2018). Some Clarifications Regarding Fully Synthetic Data. In: Domingo-Ferrer, J., Montes, F. (eds) Privacy in Statistical Databases. PSD 2018. Lecture Notes in Computer Science(), vol 11126. Springer, Cham. https://doi.org/10.1007/978-3-319-99771-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99771-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99770-4

  • Online ISBN: 978-3-319-99771-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics