v-Dispersed Synthetic Data Based on a Mixture Model with Constraints

Oganian, Anna

doi:10.1007/978-3-319-11257-2_16

v-Dispersed Synthetic Data Based on a Mixture Model with Constraints

Anna Oganian¹⁶

Conference paper

1398 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8744))

Abstract

In this paper a new approach is proposed for the generation of synthetic microdata which reduces attribute disclosure for continuous variables. First, we define a metric of attribute disclosure which is called v-dispersion. This metric quantifies the risk based on the volume of the multidimensional confidence regions for the original data values. Next we describe a method that satisfies the requirements of v-dispersion. This method is based on a mixture model with constraints on parameters of components’ spread. Experiments with real data show that the proposed approach compares very favorably with other methods of disclosure limitation for continuous microdata in terms of utility and risk.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2013), http://archive.ics.uci.edu/ml
Google Scholar
Borgelt, C., Kruse, R.: Fuzzy and probabilistic clustering with shape and size constraints. In: Proc. 11th Int. Fuzzy Systems Association World Congress (IFSA 2005), Bejing, China, Heidelberg, Germany, pp. 945–950 (2005)
Google Scholar
Campbell, J.G., Fraley, C., Murtagh, F., Raftery, A.E.: Linear flaw detection in woven textiles using model-based clustering. Pattern Recdognition Letters 18, 1539–1548 (1997)
Article Google Scholar
Campbell, J.G., Fraley, C., Stanford, D., Raftery, A.E.: Model-based methods for real-time textile fault detection. International Journal of Imaging Systems and Technology 10, 339–346 (1999)
Article Google Scholar
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793 (1995)
Article Google Scholar
Charest, A.S.: Creation and Analysis of Differentially-Private Synthetic Datasets. PhD Thesis, Carnegie-Mellon University (2012)
Google Scholar
Charest, A.-S.: Empirical evaluation of statistical inference from differentially-private contingency tables. In: Domingo-Ferrer, J., Tinnirello, I. (eds.) PSD 2012. LNCS, vol. 7556, pp. 257–272. Springer, Heidelberg (2012)
Chapter Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical Society, Ser. B 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Domingo-Ferrer, J., González-Nicolás, U.: Hybrid microdata using microaggregation. Information Sciences 180, 2834–2844 (2010)
Article Google Scholar
Domingo-Ferrer, J., Oganian, A., Torra, V.: Information-theoretic disclosure risk measures in statistical disclosure control of tabular data. In: Proc. of the 14th International Conference on Scientific and Statistical Database Management - SSDBM 2002, pp. 227–231. IEEE Computer Society, Los Alamitos (2002)
Google Scholar
Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous k-anonimity through microaggregation. Data Mining and Knowledge Discovery 11, 195–212 (2005)
Article MathSciNet Google Scholar
Domingo-Ferrer, J., Torra, V.: A critique of k–anonymity and some of its enhancements. In: The Third International Conference on Availability, Reliability and Security, pp. 990–993. IEEE (2008)
Google Scholar
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. Springer (2011)
Google Scholar
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)
Chapter Google Scholar
Dwork, C.: A firm foundation for Private Data Analysis. Communications of the ACM 54(1), 86–95 (2011)
Article Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006)
Chapter Google Scholar
Fienberg, S.E., Rinaldo, A., Yang, X.: Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 187–199. Springer, Heidelberg (2010)
Chapter Google Scholar
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 41, 578–588 (1998)
Article MATH Google Scholar
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631 (2002)
Article MATH MathSciNet Google Scholar
Fraley, C., Raftery, A.E.: MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report no. 504, Department of Statistics, University of Washington (September 2006), http://cran.r-project.org/web/packages/mclust/index.html
Hardt, M., Ligett, K., McSherry, F.: A simple and practical algorithm for differentially private data release. In: 26th Annual Conference on Neural Information Processing Systems - NIPS 2012, pp. 2348–2356 (2012)
Google Scholar
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Lenz, R., Longhurst, J., Schulte-Nordholt, E., Seri, G., DeWolf, P.-P.: Handbook on Statistical Disclosure Control (version 1.2). ESSNET SDC project (2010), http://neon.vb.cbs.nl/casc
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte-Nordholt, E., Spicer, K., DeWolf, P.-P.: Statistical Disclosure Control. Wiley (2012)
Google Scholar
IVEware. Imputation and Variance Estimation software, http://www.isr.umich.edu/src/smp/ive/ (accessed July 2, 2014)
Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60(3), 224–232 (2006)
Article MathSciNet Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: T-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the IEEE ICDE 2007 (2007)
Google Scholar
Little, R.J., Liu, F., Raghunathan, T.: Statistical disclosure techniques based on multiple imputation. In: Gelman, A., Meng, X.-L. (eds.) Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, vol. 18, pp. 141–152. Wiley, New York (2004)
Google Scholar
McLachlan, G.J., Krishnan, T.: EM Algorithm and Extensions. Wiley, New York (1997)
MATH Google Scholar
Oganian, A.: Security and Information Loss in Statistical Database Protection. PhD thesis, Universitat Politecnica de Catalunya (2003)
Google Scholar
Oganian, A., Domingo-Ferrer, J.: Hybrid Microdata via Model-Based Clustering. In: Domingo-Ferrer, J., Tinnirello, I. (eds.) PSD 2012. LNCS, vol. 7556, pp. 103–115. Springer, Heidelberg (2012)
Chapter Google Scholar
Oganian, A., Karr, A.F.: Combinations of SDC methods for microdata protection. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 102–113. Springer, Heidelberg (2006)
Chapter Google Scholar
Raghunathan, T.E., Lepkowski, J.M., van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85–96 (2001)
Google Scholar
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multivariate imputation for statistical disclosure limitation. Journal of Official Statistics 19(1), 1–16 (2003)
Google Scholar
Reaven, G.M., Miller, R.G.: An attempt to define the nature of chemical diabetes using multidimensional analysis. Diabetologica 16(1), 17–24 (1979)
Article Google Scholar
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18, 531–544 (2002)
Google Scholar
Reiter, J.P.: Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21, 441–462 (2005)
Google Scholar
Scott, D.W.: Multivariate Density Estimation. John Wiley & Sons, New York (1992)
Book MATH Google Scholar
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall (1986)
Google Scholar
Stanford, D., Raftery, A.E.: Principle curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Inteligence, 601–609 (2000)
Google Scholar
Templ, M.: Statistical disclosure control for microdata using the R-package sdcMicro. Transactions on Data Privacy 1(2), 67–85 (2008)
MathSciNet Google Scholar
Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality 1(1), 111–124 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

National Center for Health Statistics, 3311 Toledo Rd, Hyatsville, MD, 20782, U.S.A.
Anna Oganian

Authors

Anna Oganian
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Engineering and Mathematics, UNESCO Chair in Data Privacy, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007, Tarragona, Catalonia
Josep Domingo-Ferrer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oganian, A. (2014). v-Dispersed Synthetic Data Based on a Mixture Model with Constraints. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-11257-2_16
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11256-5
Online ISBN: 978-3-319-11257-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics