Advertisement

The Quasi-Multinomial Synthesizer for Categorical Data

  • Jingchen HuEmail author
  • Nobuaki Hoshino
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11126)

Abstract

We present a new synthesizer for categorical data based on the Quasi-Multinomial distribution. Characteristics of the Quasi-Multinomial distribution provide a tuning parameter, which allows a Quasi-Multinomial synthesizer to control the balance of the utility and the disclosure risks of synthetic data. We develop a Quasi-Multinomial synthesizer based on a popular categorical data synthesizer, the Dirichlet process mixtures of products of multinomial distributions. The general sampling methods and algorithm of the Quasi-Multinomial synthesizer are developed and presented. We illustrate its balance of the utility and the disclosure risks by synthesizing a sample from the American Community Survey.

Keywords

Bayesian Dirichlet process Microdata Quasi-Multinomial Synthetic 

References

  1. Akande, O., Li, F., Reiter, J.P.: An empirical comparison of multiple imputation methods for categorical data. Am. Stat. 71, 162–170 (2017)MathSciNetCrossRefGoogle Scholar
  2. Akande, O., Reiter, J. P., Barrientos, A. F.: Multiple imputation of missing values in household data with structural zeros (2017+). arXiv:1707.05916
  3. Consul, P.C., Mittal, S.P.: A new urn model with predetermined strategy. Biometrische Zeitschrift 17, 67–75 (1975)MathSciNetCrossRefGoogle Scholar
  4. Consul, P.C., Mittal, S.P.: Some discrete multinomial probability models with predetermined strategy. Biometrische Zeitschrift 19, 161–173 (1977)MathSciNetCrossRefGoogle Scholar
  5. Devroye, L.: Non-uniform Random Variate Generation. Springer, New York (1986).  https://doi.org/10.1007/978-1-4613-8643-8CrossRefzbMATHGoogle Scholar
  6. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control. Springer, New York (2011).  https://doi.org/10.1007/978-1-4614-0326-5CrossRefzbMATHGoogle Scholar
  7. Drechsler, J., Hu, J.: Strategies to facilitate access to detailed geocoding information based on synthetic data (2017+). arXiv:1803.05874
  8. Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 227–238. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-87471-3_19CrossRefGoogle Scholar
  9. Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach to releasing public use microdata samples of census data. J. Am. Stat. Assoc. 105, 1347–1357 (2010)CrossRefGoogle Scholar
  10. Duncan, G.T., Lambert, D.: Disclosure-limited data dissemination. J. Am. Stat. Assoc. 10, 10–28 (1986)CrossRefGoogle Scholar
  11. Duncan, G.T., Lambert, D.: The risk of disclosure for microdata. J. Bus. Econ. Stat. 7, 207–217 (1989)Google Scholar
  12. Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. J. Am. Stat. Assoc. 104, 1042–1051 (2009)MathSciNetCrossRefGoogle Scholar
  13. Fienberg, S.E., Makov, U., Sanil, A.P.: A Bayesian approach to data disclosure: optimal intruder behavior for continuous data. J. Off. Stat. 13, 75–89 (1997)Google Scholar
  14. Firth, D.: Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38 (1993)MathSciNetCrossRefGoogle Scholar
  15. Franconi, L., Polettini, S.: Individual risk estimation in \(\mu \)-argus: a review. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 262–272. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-25955-8_20CrossRefGoogle Scholar
  16. Ho, F.C.M., Gentle, J.E., Kennedy, W.J.: Generation of random variates from the multinomial distribution. In: Proceedings of the American Statistical Association Statistical Computing Section (1979)Google Scholar
  17. Hoshino, N.: The quasi-multinomial distribution as a tool for disclosure risk assessment. J. Off. Stat. 25, 269–291 (2009)Google Scholar
  18. Hu, J.: Bayesian estimation of attribute and identification disclosure risks in synthetic data (2018+). arXiv:1804.02784
  19. Hu, J., Reiter, J.P., Wang, Q.: Disclosure risk evaluation for fully synthetic categorical data. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 185–199. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11257-2_15CrossRefGoogle Scholar
  20. Hu, J., Reiter, J.P., Wang, Q.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13, 183–200 (2018)MathSciNetCrossRefGoogle Scholar
  21. Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96, 161–173 (2001)MathSciNetCrossRefGoogle Scholar
  22. Lambert, D.: Measures of disclosure risk and harm. J. Off. Stat. 9, 313–331 (1993)Google Scholar
  23. Malefaki, S., Iliopoulos, G.: Simulating from a multinomial distribution with large number of categories. Comput. Stat. Data Anal. 51, 5471–5476 (2007)MathSciNetCrossRefGoogle Scholar
  24. Manrique-Vallier, D., Hu, J.: Bayesian non-parametric generation of fully synthetic multivariate categorical data in the presence of structural zeros. J. Roy. Stat. Soc. Ser. A (2018, to appear)Google Scholar
  25. Manrique-Vallier, D., Reiter, J.P.: Bayesian estimation of discrete multivariate latent structure models with structural zeros. J. Comput. Graph. Stat. 23, 1061–1079 (2014)MathSciNetCrossRefGoogle Scholar
  26. Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. Stat. Sci. (2018+)Google Scholar
  27. Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Surv. Methodol. 29, 181–188 (2003)Google Scholar
  28. Reiter, J.P.: Estimating risks of identification disclosure in microdata. J. Am. Stat. Assoc. 100, 1103–1112 (2005)MathSciNetCrossRefGoogle Scholar
  29. Reiter, J.P., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confid. 1, 99–110 (2009)Google Scholar
  30. Reiter, J.P., Raghunathan, T.E.: The multiple adaptations of multiple imputation. J. Am. Stat. Assoc. 102, 1462–1471 (2007)MathSciNetCrossRefGoogle Scholar
  31. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)CrossRefGoogle Scholar
  32. Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639–650 (1994)MathSciNetzbMATHGoogle Scholar
  33. Si, Y., Reiter, J.P.: Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat. 38, 499–521 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Vassar CollegePoughkeepsieUSA
  2. 2.Kanazawa UniversityKanazawaJapan

Personalised recommendations