Synthetic Data via Quantile Regression for Heavy-Tailed and Heteroskedastic Data
- 445 Downloads
Abstract
Privacy protection of confidential data is a fundamental problem faced by many government organizations and research centers. It is further complicated when data have complex structures or variables with highly skewed distributions. The statistical community addresses general privacy concerns by introducing different techniques that aim to decrease disclosure risk in released data while retaining their statistical properties. However, methods for complex data structures have received insufficient attention. We propose producing synthetic data via quantile regression to address privacy protection of heavy-tailed and heteroskedastic data. We address some shortcomings of the previously proposed use of quantile regression as a synthesis method and extend the work into cases where data have heavy tails or heteroskedastic errors. Using a simulation study and two applications, we show that there are settings where quantile regression performs as well as or better than other commonly used synthesis methods on the basis of maintaining good data utility while simultaneously decreasing disclosure risk.
Notes
Acknowledgments
The Synthetic LBD data were accessed through the Synthetic Data Server at Cornell University, which is funded through NSF Grant SES-1042181 and BCS-0941226 and a grant from the Alfred P. Sloan foundation. Access to the Synthetic LBD is described at https://www2.vrdc.cornell.edu/news/synthetic-data-server/step-1-requesting-access-to-sds/. Use of the Integrated Census Microdata Project at the University of Essex was facilitated by Gillian Raab.
Supplementary material
References
- 1.Benedetto, G., Stinson, M., Abowd, J.M.: The creation and use of the SIPP Synthetic Beta. Mimeo, U.S. Census Bureau, April 2013. http://hdl.handle.net/1813/43924
- 2.Benoit, D.F., Van den Poel, D.: bayesQR: A Bayesian approach to quantile regression. J. Stat. Softw. 76(7), 1–32 (2017)CrossRefGoogle Scholar
- 3.Bondell, H.D., Reich, B.J., Wang, H.: Noncrossing quantile regression curve estimation. Biometrika 97(4), 825–838 (2010)MathSciNetCrossRefGoogle Scholar
- 4.Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)MathSciNetGoogle Scholar
- 5.Chernozhukov, V., Fernández-Val, I., Galichon, A.: Quantile and probability curves without crossing. Econometrica 78(3), 1093–1125 (2010)MathSciNetCrossRefGoogle Scholar
- 6.Drechsler, J.: Synthetic datasets for the German IAB establishment panel. Invited Paper WP.10, Joint UNECE/Eurostat work session on statistical data confidentiality (2009). http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2009/wp.10.e.pdf
- 7.Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. LNS, vol. 201. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5CrossRefzbMATHGoogle Scholar
- 8.Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)MathSciNetzbMATHGoogle Scholar
- 9.Dwork, C., Smith, A., Steinke, T., Ullman, J.: Exposed! a survey of attacks on private data. Annu. Rev. Stat. Appl. 4, 61–84 (2017)CrossRefGoogle Scholar
- 10.Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-34333-9CrossRefzbMATHGoogle Scholar
- 11.Foschi, F.: Disclosure risk for high dimensional business microdata. In: Joint UNECE-Eurostat Work Session on Statistical Data Confidentiality, pp. 26–28 (2011). https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2011/03_Italy-Foschi.pdf
- 12.Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)MathSciNetCrossRefGoogle Scholar
- 13.Hu, J., Reiter, J.P., Wang, Q.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13(1), 183–200 (2018)MathSciNetCrossRefGoogle Scholar
- 14.Huang, Q., Zhang, H., Chen, J., He, M.: Quantile regression models and their applications: a review. J. Biometr. Biostat. 8, 354 (2017)Google Scholar
- 15.Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality of tax returns using quantile regression and hot deck. In: Proceedings of the Third International Conference on Establishment Data. American Statistical Association (2007)Google Scholar
- 16.Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality protection using regression quantiles and hot deck. In: Proceedings of the Survey Research Methods Section. American Statistical Association (2007)Google Scholar
- 17.Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)CrossRefGoogle Scholar
- 18.Ichim, D.: Disclosure control of business microdata: a density-based approach. Int. Stat. Rev. 77(2), 196–211 (2009)CrossRefGoogle Scholar
- 19.Karr, A., Oganian, A., Reiter, J., Woo, M.J.: New measures of data utility. In: Workshop Manuscripts of Data Confidentiality, A Working Group in National Defense and Homeland Security (2006). http://sisla06.samsi.info/ndhs/dc/Papers/NewDataUtility-01-10-06.pdf
- 20.Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic Longitudinal Business Database. Int. Stat. Rev. 79(3), 362–384 (2011)CrossRefGoogle Scholar
- 21.Koenker, R.: quantreg: Quantile Regression (2017). R package version 5.34: https://CRAN.R-project.org/package=quantreg
- 22.Koenker, R., Bassett Jr., G.: Regression quantiles. Econometrica 46, 33–50 (1978)MathSciNetCrossRefGoogle Scholar
- 23.Kozumi, H., Kobayashi, G.: Gibbs sampling methods for Bayesian quantile regression. J. Stat. Comput. Simul. 81(11), 1565–1578 (2011)MathSciNetCrossRefGoogle Scholar
- 24.Larsen, M.D., Huckett, J.C.: Multimethod synthetic data generation for confidentiality and measurement of disclosure risk. Int. J. Inf. Priv. Secur. Integr. 2 1(2–3), 184–204 (2012)Google Scholar
- 25.Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)Google Scholar
- 26.Liu, Y., Wu, Y.: Simultaneous multiple non-crossing quantile regression estimation using kernel constraints. J. Nonparametr. Stat. 23(2), 415–437 (2011)MathSciNetCrossRefGoogle Scholar
- 27.Machanavajjhala, A., Kifer, D., Abowd, J.M., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: International Conference on Data Engineering (ICDE), pp. 277–286 (2008). https://doi.org/10.1109/ICDE.2008.4497436
- 28.Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res. 7(Jun), 983–999 (2006)MathSciNetzbMATHGoogle Scholar
- 29.Nowok, B., Raab, G.M., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)CrossRefGoogle Scholar
- 30.Portnoy, S., Koenker, R.: The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat. Sci. 12(4), 279–300 (1997)MathSciNetCrossRefGoogle Scholar
- 31.Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2017)CrossRefGoogle Scholar
- 32.RDC- Cornell University: Synthetic Data Server (2018). https://www2.vrdc.cornell.edu/news/synthetic-data-server/
- 33.Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005)Google Scholar
- 34.Rizzo, M.L.: Statistical Computing with R. CRC Press, Boca Raton (2007)zbMATHGoogle Scholar
- 35.Rubin, D.B.: Discussion: statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)Google Scholar
- 36.Scottish Longitudinal Study Development and Support Unit: Synthetic Data (2018). https://sls.lscs.ac.uk/guides-resources/synthetic-data/
- 37.Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc.: Ser. A (Stat. Soc.) 181(3), 663–688 (2018)MathSciNetCrossRefGoogle Scholar
- 38.Therneau, T., Atkinson, B., Ripley, B.: rpart: Recursive Partitioning and Regression Trees (2017). R package version 4.1-11: https://CRAN.R-project.org/package=rpart
- 39.University of Essex Department of History: I-CeM: Integrated Census Microdata Project (2018). https://www1.essex.ac.uk/history/research/icem/
- 40.Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009)Google Scholar
- 41.Woodcock, S.D., Benedetto, G.: Distribution-preserving statistical disclosure limitation. Comput. Stat. Data Anal. 53(12), 4228–4242 (2009)MathSciNetCrossRefGoogle Scholar
- 42.Yu, K., Lu, Z., Stander, J.: Quantile regression: applications and current research areas. J. Roy. Stat. Soc.: Ser. D (Statistician) 52(3), 331–350 (2003)MathSciNetGoogle Scholar
- 43.Yu, K., Moyeed, R.A.: Bayesian quantile regression. Stat. Probab. Lett. 54(4), 437–447 (2001)MathSciNetCrossRefGoogle Scholar