Journal of Statistical Theory and Practice

, Volume 12, Issue 1, pp 100–110 | Cite as

Statistical disclosure control via sufficiency under the multiple linear regression model

  • Martin Daniel KleinEmail author
  • Gauri Sankar Datta


In this article we show, under the normal multiple linear regression model, how synthetic data can be generated using the principle of sufficiency. An advantage of this approach is that if the regression model assumed by the synthetic data producer is correctly specified, then the synthetic data have the same joint distribution as the original data, and therefore one can use standard regression methodology and software to analyze the synthetic data. If the same regression model used to generate the synthetic data is also used for data analysis, and the data are analyzed using standard regression methodology, then the synthetic data yield identical inference to that of the original data. We also study the effects of overfitting or under-fitting the linear regression model. We show that even if the data producer overspecifies the regression model when creating the synthetic data, the synthetic data will still have the same distribution as the original data, and hence valid inference can be obtained. However, if the data producer underspecifies the linear regression model, then one cannot expect to obtain valid inference from the synthetic data. The disclosure risk of the proposed method relative to a standard synthetic data method is also examined.


Conditional distribution statistical disclosure control sufficient statistics synthetic data 

AMS Subject Classification

62B05 62J05 62D05 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Casella, G., and R. L. Berger. 2002. Statistical inference, 2nd ed. Pacific Grove, CA: Duxbury.zbMATHGoogle Scholar
  2. Drechsler, J. 2011. Synthetic dataseis for statistical disclosure control: Theory and implementation. New York, NY: Springer.CrossRefGoogle Scholar
  3. Hundepool, Α., J. Domingo-Ferrer, L. Franconi, S. Giessing, R. Lenz, J. Naylor, E. S. Nordholt, G. Seri, and P.-P. De Wolf. 2010. Handbook on statistical disclosure control, Version 1.2, ESSNet.
  4. Klein, M., T. Mathew, and B. Sinha. 2014. Noise multiplication for statistical disclosure control of extreme values in log-normal regression samples. Journal of Privacy and Confidentiality 6 (1):77–125.CrossRefGoogle Scholar
  5. Klein, M., and B. Sinha. 2016. Likelihood based finite sample inference for singly imputed synthetic data under the multivariate normal and multiple linear regression models. Journal of Privacy and Confidentiality 7 (1):43–98.Google Scholar
  6. Lehmann, E. L., and J. P. Romano. 2005. Testing statistical hypotheses, 3rd ed. New York, NY: Springer.zbMATHGoogle Scholar
  7. Lin, Y.-X., and P. Wise. 2012. Estimation of regression parameters from noise multiplied data. Journal of Privacy and Confidentiality 4 (2):61–94.Google Scholar
  8. Little, R. J. A. 1993. Statistical analysis of masked data. Journal of Official Statistics 9:407–26.Google Scholar
  9. Muirhead, R J. 1982. Aspects of multivariate statistical theory. Hoboken, NJ: John Wiley & Sons.CrossRefGoogle Scholar
  10. Raghunathan, Τ. Ε., J. P. Reiter, and D. B. Rubin. 2003. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19 (1):1–16.Google Scholar
  11. Rao, C. R. 1973. Linear statistical inference and its applications, 2nd ed. New York, NY: John Wiley & Sons.CrossRefGoogle Scholar
  12. Reiter, J. P. 2003. Inference for partially synthetic, public use microdata sets. Survey Methodology 29 (2):181–88.Google Scholar
  13. Reiter, J. P., and S. K. Kinney. 2012. Inferentially valid, partially synthetic data: Generating from posterior predictive distributions not necessary. Journal of Official Statistics 28:583–90.Google Scholar
  14. Rencher, A. C., and G. B. Schaalje. 2008. Linear models in statistics, 2nd ed. Hoboken, NJ: John Wiley & Sons.zbMATHGoogle Scholar
  15. Rubin, D. B. 1993. Discussion: Statistical disclosure limitation. Journal of Official Statistics 9:461–68.Google Scholar
  16. Rubin, D. B. 1987. Multiple imputation for nonresponse in surveys. Hoboken, NJ: John Wiley & Sons.CrossRefGoogle Scholar
  17. Willenborg, L., and T. DeWaal. 2001. Elements of statistical disclosure control. New York, NY: Springer.CrossRefGoogle Scholar

Copyright information

© Grace Scientific Publishing, 20 Middlefield Ct, Greensboro, NC 27455 2018

Authors and Affiliations

  1. 1.Center for Statistical Research and MethodologyU.S. Census BureauWashingtonUSA
  2. 2.Department of StatisticsUniversity of GeorgiaAthensUSA

Personalised recommendations