Abstract
Likelihood-based finite sample inference for singly imputed synthetic data generated via posterior predictive sampling is developed in this paper for multivariate normal and multiple linear regression models. Currently available methodology for drawing valid inference on population parameters using synthetic data is based on concepts of multiple imputation for missing data, and therefore requires the release of multiple synthetic datasets. The methodology developed in this paper demonstrates that, contrary to the usual belief, valid inference about meaningful model parameters can indeed be drawn based on a singly imputed synthetic dataset under the multivariate normal and multiple linear regression models, by fully utilizing the model structure.
Similar content being viewed by others
References
Anderson, T.W. (2003). An introduction to multivariate statistical analysis, (third edition). Wiley.
Drechsler, J. (2011). Synthetic datasets for statistical disclosure control: theory and implementation. Springer.
Hawala, S. (2008). Producing partially synthetic data to avoid disclosure. Proceedings of the joint statistical meetings, american statistical association.
Kinney, S.K., Reiter, J.P. and Miranda, J. (2014). SynLBD 2.0: improving the synthetic longitudinal business database. Statistical Journal of the International Association for Official Statistics 30, 129–135.
Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S. and Abowd, J.M. (2011). Towards unrestricted public use business microdata: the synthetic longitudinal business database. International Statistical Review 79, 362–384.
Kshirsagar, A.M. (1972). Multivariate analysis, Marcel Dekker.
Little, R.J.A. (1993). Statistical analysis of masked data. Journal of Official Statistics 9, 407–426.
Muirhead, R.J. (1982). Aspects of multivariate statistical theory. Wiley.
Raghunathan, T.E., Reiter, J.P. and Rubin, D.B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19, 1–16.
Reiter, J.P. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–188.
Reiter, J.P. and Raghunathan, T.E. (2007). The multiple adaptations of multiple imputation. Journal of the American Statistical Association 102, 1462–1471.
Reiter, J.P. and Kinney, S.K. (2012). Inferentially valid, partially synthetic data: generating from posterior predictive distributions not necessary. Journal of Official Statistics 28, 583–590.
Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. Wiley.
Rubin, D.B. (1993). Discussion: Statistical Disclosure Limitation. Journal of Official Statistics 9, 461–468.
Author information
Authors and Affiliations
Corresponding author
Additional information
Disclaimer. This article is released to inform interested parties of ongoing research and to encourage discussion. The views expressed are those of the authors and not necessarily those of the U.S. Census Bureau.
Rights and permissions
About this article
Cite this article
Klein, M., Sinha, B. Inference for Singly Imputed Synthetic Data Based on Posterior Predictive Sampling under Multivariate Normal and Multiple Linear Regression Models. Sankhya B 77, 293–311 (2015). https://doi.org/10.1007/s13571-015-0100-8
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13571-015-0100-8
Keywords and phrases
- Maximum likelihood estimator
- Pivot
- Posterior predictive sampling
- Single imputation
- Statistical disclosure control