Advertisement

Synthetic Data via Quantile Regression for Heavy-Tailed and Heteroskedastic Data

  • Michelle PistnerEmail author
  • Aleksandra Slavković
  • Lars Vilhuber
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11126)

Abstract

Privacy protection of confidential data is a fundamental problem faced by many government organizations and research centers. It is further complicated when data have complex structures or variables with highly skewed distributions. The statistical community addresses general privacy concerns by introducing different techniques that aim to decrease disclosure risk in released data while retaining their statistical properties. However, methods for complex data structures have received insufficient attention. We propose producing synthetic data via quantile regression to address privacy protection of heavy-tailed and heteroskedastic data. We address some shortcomings of the previously proposed use of quantile regression as a synthesis method and extend the work into cases where data have heavy tails or heteroskedastic errors. Using a simulation study and two applications, we show that there are settings where quantile regression performs as well as or better than other commonly used synthesis methods on the basis of maintaining good data utility while simultaneously decreasing disclosure risk.

Notes

Acknowledgments

The Synthetic LBD data were accessed through the Synthetic Data Server at Cornell University, which is funded through NSF Grant SES-1042181 and BCS-0941226 and a grant from the Alfred P. Sloan foundation. Access to the Synthetic LBD is described at https://www2.vrdc.cornell.edu/news/synthetic-data-server/step-1-requesting-access-to-sds/. Use of the Integrated Census Microdata Project at the University of Essex was facilitated by Gillian Raab.

Supplementary material

References

  1. 1.
    Benedetto, G., Stinson, M., Abowd, J.M.: The creation and use of the SIPP Synthetic Beta. Mimeo, U.S. Census Bureau, April 2013. http://hdl.handle.net/1813/43924
  2. 2.
    Benoit, D.F., Van den Poel, D.: bayesQR: A Bayesian approach to quantile regression. J. Stat. Softw. 76(7), 1–32 (2017)CrossRefGoogle Scholar
  3. 3.
    Bondell, H.D., Reich, B.J., Wang, H.: Noncrossing quantile regression curve estimation. Biometrika 97(4), 825–838 (2010)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3(1), 27–42 (2010)MathSciNetGoogle Scholar
  5. 5.
    Chernozhukov, V., Fernández-Val, I., Galichon, A.: Quantile and probability curves without crossing. Econometrica 78(3), 1093–1125 (2010)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Drechsler, J.: Synthetic datasets for the German IAB establishment panel. Invited Paper WP.10, Joint UNECE/Eurostat work session on statistical data confidentiality (2009). http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2009/wp.10.e.pdf
  7. 7.
    Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. LNS, vol. 201. Springer, New York (2011).  https://doi.org/10.1007/978-1-4614-0326-5CrossRefzbMATHGoogle Scholar
  8. 8.
    Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9(3–4), 211–407 (2014)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Dwork, C., Smith, A., Steinke, T., Ullman, J.: Exposed! a survey of attacks on private data. Annu. Rev. Stat. Appl. 4, 61–84 (2017)CrossRefGoogle Scholar
  10. 10.
    Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-34333-9CrossRefzbMATHGoogle Scholar
  11. 11.
    Foschi, F.: Disclosure risk for high dimensional business microdata. In: Joint UNECE-Eurostat Work Session on Statistical Data Confidentiality, pp. 26–28 (2011). https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2011/03_Italy-Foschi.pdf
  12. 12.
    Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Hu, J., Reiter, J.P., Wang, Q.: Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 13(1), 183–200 (2018)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Huang, Q., Zhang, H., Chen, J., He, M.: Quantile regression models and their applications: a review. J. Biometr. Biostat. 8, 354 (2017)Google Scholar
  15. 15.
    Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality of tax returns using quantile regression and hot deck. In: Proceedings of the Third International Conference on Establishment Data. American Statistical Association (2007)Google Scholar
  16. 16.
    Huckett, J.C., Larsen, M.D.: Microdata simulation for confidentiality protection using regression quantiles and hot deck. In: Proceedings of the Survey Research Methods Section. American Statistical Association (2007)Google Scholar
  17. 17.
    Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)CrossRefGoogle Scholar
  18. 18.
    Ichim, D.: Disclosure control of business microdata: a density-based approach. Int. Stat. Rev. 77(2), 196–211 (2009)CrossRefGoogle Scholar
  19. 19.
    Karr, A., Oganian, A., Reiter, J., Woo, M.J.: New measures of data utility. In: Workshop Manuscripts of Data Confidentiality, A Working Group in National Defense and Homeland Security (2006). http://sisla06.samsi.info/ndhs/dc/Papers/NewDataUtility-01-10-06.pdf
  20. 20.
    Kinney, S.K., Reiter, J.P., Reznek, A.P., Miranda, J., Jarmin, R.S., Abowd, J.M.: Towards unrestricted public use business microdata: the synthetic Longitudinal Business Database. Int. Stat. Rev. 79(3), 362–384 (2011)CrossRefGoogle Scholar
  21. 21.
    Koenker, R.: quantreg: Quantile Regression (2017). R package version 5.34: https://CRAN.R-project.org/package=quantreg
  22. 22.
    Koenker, R., Bassett Jr., G.: Regression quantiles. Econometrica 46, 33–50 (1978)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Kozumi, H., Kobayashi, G.: Gibbs sampling methods for Bayesian quantile regression. J. Stat. Comput. Simul. 81(11), 1565–1578 (2011)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Larsen, M.D., Huckett, J.C.: Multimethod synthetic data generation for confidentiality and measurement of disclosure risk. Int. J. Inf. Priv. Secur. Integr. 2 1(2–3), 184–204 (2012)Google Scholar
  25. 25.
    Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)Google Scholar
  26. 26.
    Liu, Y., Wu, Y.: Simultaneous multiple non-crossing quantile regression estimation using kernel constraints. J. Nonparametr. Stat. 23(2), 415–437 (2011)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Machanavajjhala, A., Kifer, D., Abowd, J.M., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: International Conference on Data Engineering (ICDE), pp. 277–286 (2008).  https://doi.org/10.1109/ICDE.2008.4497436
  28. 28.
    Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res. 7(Jun), 983–999 (2006)MathSciNetzbMATHGoogle Scholar
  29. 29.
    Nowok, B., Raab, G.M., Dibben, C.: synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)CrossRefGoogle Scholar
  30. 30.
    Portnoy, S., Koenker, R.: The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat. Sci. 12(4), 279–300 (1997)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Raab, G.M., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confid. 7(3), 67–97 (2017)CrossRefGoogle Scholar
  32. 32.
    RDC- Cornell University: Synthetic Data Server (2018). https://www2.vrdc.cornell.edu/news/synthetic-data-server/
  33. 33.
    Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005)Google Scholar
  34. 34.
    Rizzo, M.L.: Statistical Computing with R. CRC Press, Boca Raton (2007)zbMATHGoogle Scholar
  35. 35.
    Rubin, D.B.: Discussion: statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)Google Scholar
  36. 36.
    Scottish Longitudinal Study Development and Support Unit: Synthetic Data (2018). https://sls.lscs.ac.uk/guides-resources/synthetic-data/
  37. 37.
    Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc.: Ser. A (Stat. Soc.) 181(3), 663–688 (2018)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Therneau, T., Atkinson, B., Ripley, B.: rpart: Recursive Partitioning and Regression Trees (2017). R package version 4.1-11: https://CRAN.R-project.org/package=rpart
  39. 39.
    University of Essex Department of History: I-CeM: Integrated Census Microdata Project (2018). https://www1.essex.ac.uk/history/research/icem/
  40. 40.
    Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009)Google Scholar
  41. 41.
    Woodcock, S.D., Benedetto, G.: Distribution-preserving statistical disclosure limitation. Comput. Stat. Data Anal. 53(12), 4228–4242 (2009)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Yu, K., Lu, Z., Stander, J.: Quantile regression: applications and current research areas. J. Roy. Stat. Soc.: Ser. D (Statistician) 52(3), 331–350 (2003)MathSciNetGoogle Scholar
  43. 43.
    Yu, K., Moyeed, R.A.: Bayesian quantile regression. Stat. Probab. Lett. 54(4), 437–447 (2001)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of StatisticsThe Pennsylvania State UniversityUniversity ParkUSA
  2. 2.Economics DepartmentCornell UniversityIthacaUSA

Personalised recommendations