Advertisement

Information preserving regression-based tools for statistical disclosure control

  • Øyvind LangsrudEmail author
Article
  • 65 Downloads

Abstract

This paper presents a unified framework for regression-based statistical disclosure control for microdata. A basic method, known as information preserving statistical obfuscation (IPSO), produces synthetic data that preserve variances, covariances and fitted values. The data are then generated conditionally according to the multivariate normal distribution. Generalizations of the IPSO method are described in the literature, and these methods aim to generate data more similar to the original data. This paper describes these methods in a concise and interpretable way, which is close to efficient implementation. Decomposing the residual data into orthogonal scores and corresponding loadings is an essential part of the framework. Both QR decomposition (Gram–Schmidt orthogonalization) and singular value decomposition (principal components) may be used. Within this framework, new and generalized methods are presented. In particular, a method is described by means of which the correlations to the original principal component scores can be controlled exactly. It is shown that a suggested method of random orthogonal matrix masking can be implemented without generating an orthogonal matrix. Generalized methodology for hierarchical categories is presented within the context of microaggregation. Some information can then be preserved at the lowest level and more information at higher levels. The presented methodology is also applicable to tabular data. One possibility is to replace the content of primary and secondary suppressed cells with generated values. It is proposed replacing suppressed cell frequencies with decimal numbers, and it is argued that this can be a useful method.

Keywords

Microdata anonymization Synthetic data Microaggregation Hybrid microdata Cell suppression Official statistics 

Notes

References

  1. Benedetto, G., Stinson, M.H., Abowd, J.M.: The Creation and Use of the SIPP Synthetic Beta. Technical Report, United States Census Bureau (2013)Google Scholar
  2. Burridge, J.: Information preserving statistical obfuscation. Stat. Comput. 13(4), 321–327 (2003).  https://doi.org/10.1023/A:1025658621216 MathSciNetCrossRefGoogle Scholar
  3. Calvino, A.: A simple method for limiting disclosure in continuous microdata based on principal component analysis. J. Off. Stat. 33(1), 15–41 (2017).  https://doi.org/10.1515/JOS-2017-0002 CrossRefGoogle Scholar
  4. Chan, T.F.: Rank revealing QR factorizations. Linear Algebra Appl. 88–9, 67–82 (1987).  https://doi.org/10.1016/0024-3795(87)90103-0 MathSciNetCrossRefzbMATHGoogle Scholar
  5. de Wolf, P.P., Giessing, S.: Adjusting the tau-ARGUS modular approach to deal with linked tables. Data Knowl. Eng. 68(11), 1160–1174 (2009).  https://doi.org/10.1016/j.datak.2009.06.005 CrossRefGoogle Scholar
  6. Demmel, J., Gu, M., Eisenstat, S., Slapnicar, I., Veselic, K., Drmac, Z.: Computing the singular value decomposition with high relative accuracy. Linear Algebra Appl. 299(1–3), 21–80 (1999).  https://doi.org/10.1016/S0024-3795(99)00134-2 MathSciNetCrossRefzbMATHGoogle Scholar
  7. Domingo-Ferrer, J., Gonzalez-Nicolas, U.: Hybrid microdata using microaggregation. Inf. Sci. 180(15), 2834–2844 (2010).  https://doi.org/10.1016/j.ins.2010.04.005 CrossRefGoogle Scholar
  8. Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002)CrossRefGoogle Scholar
  9. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011)CrossRefGoogle Scholar
  10. Duncan, G.T., Pearson, R.W.: Enhancing access to microdata while protecting confidentiality: prospects for the future. Stat. Sci. 6(3), 219–239 (1991)CrossRefGoogle Scholar
  11. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S., Spicer, K., de Wolf, P.P.: Statistical Disclosure Control. Wiley, Hoboken (2012).  https://doi.org/10.1002/9781118348239.ch1 CrossRefGoogle Scholar
  12. Hundepool, A., de Wolf, P.P., Bakker, J., Reedijk, A., Franconi, L., Polettini, S., Capobianchi, A., Domingo, J.: mu-ARGUS User’s Manual, Version 5.1. Technical Report, Statistics Netherlands (2014)Google Scholar
  13. Jarmin, R.S., Louis, T.A., Miranda, J.: Expanding the role of synthetic data at the U.S. Census Bureau. Stat. J. IAOS 30(1–3), 117–121 (2014)Google Scholar
  14. Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer, New York (2002)zbMATHGoogle Scholar
  15. Klein, M.D., Datta, G.S.: Statistical disclosure control via sufficiency under the multiple linear regression model. J. Stat. Theory Pract. 12(1), 100–110 (2018)MathSciNetCrossRefGoogle Scholar
  16. Langsrud, Ø.: Rotation tests. Stat. Comput. 15(1), 53–60 (2005).  https://doi.org/10.1007/s11222-005-4789-5 MathSciNetCrossRefGoogle Scholar
  17. Loong, B., Rubin, D.B.: Multiply-imputed synthetic data: advice to the imputer. J. Off. Stat. 33(4), 1005–1019 (2017).  https://doi.org/10.1515/JOS-2017-0047 CrossRefGoogle Scholar
  18. Mateo-Sanz, J., Martinez-Balleste, A., Domingo-Ferrer, J.: Fast generation of accurate synthetic microdata. In: DomingoFerrer, J., Torra, V. (eds.) Privacy in Statistical Databases, Proceedings, . Conference on Privacy in Statistical DataBases (PSD 2004), Barcelona, Spain, 09–11 June 2004, vol. 3050, pp. 298–306 (2004)Google Scholar
  19. Muralidhar, K., Sarathy, R.: Generating sufficiency-based non-synthetic perturbed data. Trans. Data Priv. 1(1), 17–33 (2008)MathSciNetGoogle Scholar
  20. Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102(480), 1462–1471 (2007).  https://doi.org/10.1198/016214507000000932 MathSciNetCrossRefzbMATHGoogle Scholar
  21. Salazar-Gonzalez, J.J.: Statistical confidentiality: optimization techniques to protect tables. Comput. Oper. Res. 35(5), 1638–1651 (2008).  https://doi.org/10.1016/j.cor.2006.09.007 CrossRefGoogle Scholar
  22. Strang, G.: Linear Algebra and Its Applications, 3rd edn. Harcourt Brace Jovanovich, San Diego (1988)zbMATHGoogle Scholar
  23. Templ, M., Meindl, B.: Robustification of microdata masking methods and the comparison with existing methods. In: Domingo-Ferrer, J., Saygın, Y. (eds.) Privacy in Statistical Databases, Proceedings, UNESCO Chair in Data Privacy International Conference (PSD 2008), Istanbul, Turkey, 24–26 Sept 2008, pp. 113–126. Springer, Berlin (2008)Google Scholar
  24. Templ, M., Kowarik, A., Meindl, B.: Statistical disclosure control for micro-data using the R Package sdcMicro. J. Stat. Softw. 67(4), 1–37 (2015)CrossRefGoogle Scholar
  25. Ting, D., Fienberg, S.E., Trottini, M.: Random orthogonal matrix masking methodology for microdata release. Int. J. Inf. Comput. Secur. 2(1), 86–105 (2008).  https://doi.org/10.1504/IJICS.2008.016823 CrossRefGoogle Scholar
  26. Wedderburn, R.W.M.: Random Rotations and Multivariate Normal Simulation. Research Report, Rothamsted Experimental Station (1975)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Statistics NorwayOsloNorway

Personalised recommendations