Skip to main content

v-Dispersed Synthetic Data Based on a Mixture Model with Constraints

  • Conference paper
  • 1398 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8744))

Abstract

In this paper a new approach is proposed for the generation of synthetic microdata which reduces attribute disclosure for continuous variables. First, we define a metric of attribute disclosure which is called v-dispersion. This metric quantifies the risk based on the volume of the multidimensional confidence regions for the original data values. Next we describe a method that satisfies the requirements of v-dispersion. This method is based on a mixture model with constraints on parameters of components’ spread. Experiments with real data show that the proposed approach compares very favorably with other methods of disclosure limitation for continuous microdata in terms of utility and risk.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2013), http://archive.ics.uci.edu/ml

    Google Scholar 

  2. Borgelt, C., Kruse, R.: Fuzzy and probabilistic clustering with shape and size constraints. In: Proc. 11th Int. Fuzzy Systems Association World Congress (IFSA 2005), Bejing, China, Heidelberg, Germany, pp. 945–950 (2005)

    Google Scholar 

  3. Campbell, J.G., Fraley, C., Murtagh, F., Raftery, A.E.: Linear flaw detection in woven textiles using model-based clustering. Pattern Recdognition Letters 18, 1539–1548 (1997)

    Article  Google Scholar 

  4. Campbell, J.G., Fraley, C., Stanford, D., Raftery, A.E.: Model-based methods for real-time textile fault detection. International Journal of Imaging Systems and Technology 10, 339–346 (1999)

    Article  Google Scholar 

  5. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793 (1995)

    Article  Google Scholar 

  6. Charest, A.S.: Creation and Analysis of Differentially-Private Synthetic Datasets. PhD Thesis, Carnegie-Mellon University (2012)

    Google Scholar 

  7. Charest, A.-S.: Empirical evaluation of statistical inference from differentially-private contingency tables. In: Domingo-Ferrer, J., Tinnirello, I. (eds.) PSD 2012. LNCS, vol. 7556, pp. 257–272. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  8. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical Society, Ser. B 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  9. Domingo-Ferrer, J., González-Nicolás, U.: Hybrid microdata using microaggregation. Information Sciences 180, 2834–2844 (2010)

    Article  Google Scholar 

  10. Domingo-Ferrer, J., Oganian, A., Torra, V.: Information-theoretic disclosure risk measures in statistical disclosure control of tabular data. In: Proc. of the 14th International Conference on Scientific and Statistical Database Management - SSDBM 2002, pp. 227–231. IEEE Computer Society, Los Alamitos (2002)

    Google Scholar 

  11. Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous k-anonimity through microaggregation. Data Mining and Knowledge Discovery 11, 195–212 (2005)

    Article  MathSciNet  Google Scholar 

  12. Domingo-Ferrer, J., Torra, V.: A critique of k–anonymity and some of its enhancements. In: The Third International Conference on Availability, Reliability and Security, pp. 990–993. IEEE (2008)

    Google Scholar 

  13. Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. Springer (2011)

    Google Scholar 

  14. Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Dwork, C.: A firm foundation for Private Data Analysis. Communications of the ACM 54(1), 86–95 (2011)

    Article  Google Scholar 

  16. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  17. Fienberg, S.E., Rinaldo, A., Yang, X.: Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 187–199. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  18. Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 41, 578–588 (1998)

    Article  MATH  Google Scholar 

  19. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  20. Fraley, C., Raftery, A.E.: MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report no. 504, Department of Statistics, University of Washington (September 2006), http://cran.r-project.org/web/packages/mclust/index.html

  21. Hardt, M., Ligett, K., McSherry, F.: A simple and practical algorithm for differentially private data release. In: 26th Annual Conference on Neural Information Processing Systems - NIPS 2012, pp. 2348–2356 (2012)

    Google Scholar 

  22. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Lenz, R., Longhurst, J., Schulte-Nordholt, E., Seri, G., DeWolf, P.-P.: Handbook on Statistical Disclosure Control (version 1.2). ESSNET SDC project (2010), http://neon.vb.cbs.nl/casc

  23. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte-Nordholt, E., Spicer, K., DeWolf, P.-P.: Statistical Disclosure Control. Wiley (2012)

    Google Scholar 

  24. IVEware. Imputation and Variance Estimation software, http://www.isr.umich.edu/src/smp/ive/ (accessed July 2, 2014)

  25. Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60(3), 224–232 (2006)

    Article  MathSciNet  Google Scholar 

  26. Li, N., Li, T., Venkatasubramanian, S.: T-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the IEEE ICDE 2007 (2007)

    Google Scholar 

  27. Little, R.J., Liu, F., Raghunathan, T.: Statistical disclosure techniques based on multiple imputation. In: Gelman, A., Meng, X.-L. (eds.) Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, vol. 18, pp. 141–152. Wiley, New York (2004)

    Google Scholar 

  28. McLachlan, G.J., Krishnan, T.: EM Algorithm and Extensions. Wiley, New York (1997)

    MATH  Google Scholar 

  29. Oganian, A.: Security and Information Loss in Statistical Database Protection. PhD thesis, Universitat Politecnica de Catalunya (2003)

    Google Scholar 

  30. Oganian, A., Domingo-Ferrer, J.: Hybrid Microdata via Model-Based Clustering. In: Domingo-Ferrer, J., Tinnirello, I. (eds.) PSD 2012. LNCS, vol. 7556, pp. 103–115. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  31. Oganian, A., Karr, A.F.: Combinations of SDC methods for microdata protection. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 102–113. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  32. Raghunathan, T.E., Lepkowski, J.M., van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85–96 (2001)

    Google Scholar 

  33. Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multivariate imputation for statistical disclosure limitation. Journal of Official Statistics 19(1), 1–16 (2003)

    Google Scholar 

  34. Reaven, G.M., Miller, R.G.: An attempt to define the nature of chemical diabetes using multidimensional analysis. Diabetologica 16(1), 17–24 (1979)

    Article  Google Scholar 

  35. Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18, 531–544 (2002)

    Google Scholar 

  36. Reiter, J.P.: Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21, 441–462 (2005)

    Google Scholar 

  37. Scott, D.W.: Multivariate Density Estimation. John Wiley & Sons, New York (1992)

    Book  MATH  Google Scholar 

  38. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall (1986)

    Google Scholar 

  39. Stanford, D., Raftery, A.E.: Principle curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Inteligence, 601–609 (2000)

    Google Scholar 

  40. Templ, M.: Statistical disclosure control for microdata using the R-package sdcMicro. Transactions on Data Privacy 1(2), 67–85 (2008)

    MathSciNet  Google Scholar 

  41. Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality 1(1), 111–124 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Oganian, A. (2014). v-Dispersed Synthetic Data Based on a Mixture Model with Constraints. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11257-2_16

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11256-5

  • Online ISBN: 978-3-319-11257-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics