Advertisement

Statistical Disclosure Control for Data Privacy Using Sequence of Generalised Linear Models

  • Min Cherng Lee
  • Robin Mitra
  • Emmanuel Lazaridis
  • An Chow Lai
  • Yong Kheng Goh
  • Wun-She YapEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9722)

Abstract

When releasing data for public use, statistical agencies seek to reduce the risk of disclosure, while preserving the utility of the release data. Common approaches such as adding random noises, top coding variables and swapping data values will distort the relationships in the original data. To achieve the aforementioned properties, we consider the synthetic data approach in this paper where we release multiply imputed partially synthetic data sets comprising original data values, and with values at high disclosure risk being replaced by synthetic values. To generate such synthetic data, we introduce a new variant of factored regression model proposed by Lee and Mitra in 2016. In addition, we take a step forward to propose a new algorithm in identifying the original data that need to be replaced with synthetic data. By using our proposed methods, data privacy can be preserved since it is difficult to identify the individual under the scenario that the released synthetic data are not entirely similar with the original data. Besides, valid inference about the data can be made using simple combining rules, which take the uncertainty due to the presence of synthetic values. To evaluate the performance of our proposed methods in term of the risk of disclosure and the utility of the released synthetic data, we conduct an experiment on a dataset taken from 1987 National Indonesia Contraceptive Prevalence.

Keywords

Disclosure control Data privacy Multiple imputation Generalised linear models 

References

  1. 1.
    Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 88(422), 669–679 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Fienberg, S.E., McIntyre, J.: Data swapping: variations on a theme by Dalenius and Reiss. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 14–29. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Fuller, W.A.: Masking procedures for microdata disclosure limitation. J. Official Stat. 9(2), 383–406 (1993)Google Scholar
  4. 4.
    Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 227–238. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  5. 5.
    Lim, T.-S., Loh, W.-Y., Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn. 40(3), 203–228 (2000)CrossRefzbMATHGoogle Scholar
  6. 6.
    Lee, M.C., Mitra, R.: Multiply imputing missing values in data sets with measurement scales using a sequence of generalised linear models. Comput. Stat. Data Anal. 95, 24–38 (2016)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Little, R.J.A.: Statistical analysis of masked data. J. Official Stat. 9(2), 407–426 (1993)Google Scholar
  8. 8.
    Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Method. 29, 181–188 (2003)Google Scholar
  9. 9.
    Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Official Stat. 21(3), 441–461 (2005)Google Scholar
  10. 10.
    Reiter, J.P., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confidentiality 1(1), 99–110 (2009)Google Scholar
  11. 11.
    Rubin, D.B.: Statistical disclosure limitation. J. Official Stat. 9(2), 461–468 (1993)Google Scholar
  12. 12.
    Schafer, J.L.: Analysis of incomplete multivariate data. Chapman & Hall/CRC, London (1997)CrossRefzbMATHGoogle Scholar
  13. 13.
    Willenborg, L., de Waal, T.: Elements of Statistical Disclosure Control. Lecture Notes in Statistics. Springer, New York (2001)CrossRefzbMATHGoogle Scholar
  14. 14.
    Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confidentiality 1(1), 111–124 (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Min Cherng Lee
    • 1
  • Robin Mitra
    • 2
  • Emmanuel Lazaridis
    • 3
  • An Chow Lai
    • 1
  • Yong Kheng Goh
    • 1
  • Wun-She Yap
    • 1
    Email author
  1. 1.Universiti Tunku Abdul RahmanKajangMalaysia
  2. 2.Southampton Statistical Sciences Research InstituteUniversity of SouthamptonSouthamptonUK
  3. 3.National Institute for Cardiovascular Outcomes ResearchUniversity College LondonLondonUK

Personalised recommendations