Statistical disclosure control via sufficiency under the multiple linear regression model
In this article we show, under the normal multiple linear regression model, how synthetic data can be generated using the principle of sufficiency. An advantage of this approach is that if the regression model assumed by the synthetic data producer is correctly specified, then the synthetic data have the same joint distribution as the original data, and therefore one can use standard regression methodology and software to analyze the synthetic data. If the same regression model used to generate the synthetic data is also used for data analysis, and the data are analyzed using standard regression methodology, then the synthetic data yield identical inference to that of the original data. We also study the effects of overfitting or under-fitting the linear regression model. We show that even if the data producer overspecifies the regression model when creating the synthetic data, the synthetic data will still have the same distribution as the original data, and hence valid inference can be obtained. However, if the data producer underspecifies the linear regression model, then one cannot expect to obtain valid inference from the synthetic data. The disclosure risk of the proposed method relative to a standard synthetic data method is also examined.
KeywordsConditional distribution statistical disclosure control sufficient statistics synthetic data
AMS Subject Classification62B05 62J05 62D05
Unable to display preview. Download preview PDF.
- Hundepool, Α., J. Domingo-Ferrer, L. Franconi, S. Giessing, R. Lenz, J. Naylor, E. S. Nordholt, G. Seri, and P.-P. De Wolf. 2010. Handbook on statistical disclosure control, Version 1.2, ESSNet. https://doi.org/neon.vb.cbs.nl/casc/ASDC_Handbook.pdf.
- Klein, M., and B. Sinha. 2016. Likelihood based finite sample inference for singly imputed synthetic data under the multivariate normal and multiple linear regression models. Journal of Privacy and Confidentiality 7 (1):43–98.Google Scholar
- Lin, Y.-X., and P. Wise. 2012. Estimation of regression parameters from noise multiplied data. Journal of Privacy and Confidentiality 4 (2):61–94.Google Scholar
- Little, R. J. A. 1993. Statistical analysis of masked data. Journal of Official Statistics 9:407–26.Google Scholar
- Raghunathan, Τ. Ε., J. P. Reiter, and D. B. Rubin. 2003. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19 (1):1–16.Google Scholar
- Reiter, J. P. 2003. Inference for partially synthetic, public use microdata sets. Survey Methodology 29 (2):181–88.Google Scholar
- Reiter, J. P., and S. K. Kinney. 2012. Inferentially valid, partially synthetic data: Generating from posterior predictive distributions not necessary. Journal of Official Statistics 28:583–90.Google Scholar
- Rubin, D. B. 1993. Discussion: Statistical disclosure limitation. Journal of Official Statistics 9:461–68.Google Scholar