Abstract
In this paper a new approach is proposed for the generation of synthetic microdata which reduces attribute disclosure for continuous variables. First, we define a metric of attribute disclosure which is called v-dispersion. This metric quantifies the risk based on the volume of the multidimensional confidence regions for the original data values. Next we describe a method that satisfies the requirements of v-dispersion. This method is based on a mixture model with constraints on parameters of components’ spread. Experiments with real data show that the proposed approach compares very favorably with other methods of disclosure limitation for continuous microdata in terms of utility and risk.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2013), http://archive.ics.uci.edu/ml
Borgelt, C., Kruse, R.: Fuzzy and probabilistic clustering with shape and size constraints. In: Proc. 11th Int. Fuzzy Systems Association World Congress (IFSA 2005), Bejing, China, Heidelberg, Germany, pp. 945–950 (2005)
Campbell, J.G., Fraley, C., Murtagh, F., Raftery, A.E.: Linear flaw detection in woven textiles using model-based clustering. Pattern Recdognition Letters 18, 1539–1548 (1997)
Campbell, J.G., Fraley, C., Stanford, D., Raftery, A.E.: Model-based methods for real-time textile fault detection. International Journal of Imaging Systems and Technology 10, 339–346 (1999)
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793 (1995)
Charest, A.S.: Creation and Analysis of Differentially-Private Synthetic Datasets. PhD Thesis, Carnegie-Mellon University (2012)
Charest, A.-S.: Empirical evaluation of statistical inference from differentially-private contingency tables. In: Domingo-Ferrer, J., Tinnirello, I. (eds.) PSD 2012. LNCS, vol. 7556, pp. 257–272. Springer, Heidelberg (2012)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical Society, Ser. B 39, 1–38 (1977)
Domingo-Ferrer, J., González-Nicolás, U.: Hybrid microdata using microaggregation. Information Sciences 180, 2834–2844 (2010)
Domingo-Ferrer, J., Oganian, A., Torra, V.: Information-theoretic disclosure risk measures in statistical disclosure control of tabular data. In: Proc. of the 14th International Conference on Scientific and Statistical Database Management - SSDBM 2002, pp. 227–231. IEEE Computer Society, Los Alamitos (2002)
Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous k-anonimity through microaggregation. Data Mining and Knowledge Discovery 11, 195–212 (2005)
Domingo-Ferrer, J., Torra, V.: A critique of k–anonymity and some of its enhancements. In: The Third International Conference on Availability, Reliability and Security, pp. 990–993. IEEE (2008)
Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. Springer (2011)
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006)
Dwork, C.: A firm foundation for Private Data Analysis. Communications of the ACM 54(1), 86–95 (2011)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006)
Fienberg, S.E., Rinaldo, A., Yang, X.: Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 187–199. Springer, Heidelberg (2010)
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 41, 578–588 (1998)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631 (2002)
Fraley, C., Raftery, A.E.: MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report no. 504, Department of Statistics, University of Washington (September 2006), http://cran.r-project.org/web/packages/mclust/index.html
Hardt, M., Ligett, K., McSherry, F.: A simple and practical algorithm for differentially private data release. In: 26th Annual Conference on Neural Information Processing Systems - NIPS 2012, pp. 2348–2356 (2012)
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Lenz, R., Longhurst, J., Schulte-Nordholt, E., Seri, G., DeWolf, P.-P.: Handbook on Statistical Disclosure Control (version 1.2). ESSNET SDC project (2010), http://neon.vb.cbs.nl/casc
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte-Nordholt, E., Spicer, K., DeWolf, P.-P.: Statistical Disclosure Control. Wiley (2012)
IVEware. Imputation and Variance Estimation software, http://www.isr.umich.edu/src/smp/ive/ (accessed July 2, 2014)
Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60(3), 224–232 (2006)
Li, N., Li, T., Venkatasubramanian, S.: T-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the IEEE ICDE 2007 (2007)
Little, R.J., Liu, F., Raghunathan, T.: Statistical disclosure techniques based on multiple imputation. In: Gelman, A., Meng, X.-L. (eds.) Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, vol. 18, pp. 141–152. Wiley, New York (2004)
McLachlan, G.J., Krishnan, T.: EM Algorithm and Extensions. Wiley, New York (1997)
Oganian, A.: Security and Information Loss in Statistical Database Protection. PhD thesis, Universitat Politecnica de Catalunya (2003)
Oganian, A., Domingo-Ferrer, J.: Hybrid Microdata via Model-Based Clustering. In: Domingo-Ferrer, J., Tinnirello, I. (eds.) PSD 2012. LNCS, vol. 7556, pp. 103–115. Springer, Heidelberg (2012)
Oganian, A., Karr, A.F.: Combinations of SDC methods for microdata protection. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 102–113. Springer, Heidelberg (2006)
Raghunathan, T.E., Lepkowski, J.M., van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85–96 (2001)
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multivariate imputation for statistical disclosure limitation. Journal of Official Statistics 19(1), 1–16 (2003)
Reaven, G.M., Miller, R.G.: An attempt to define the nature of chemical diabetes using multidimensional analysis. Diabetologica 16(1), 17–24 (1979)
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18, 531–544 (2002)
Reiter, J.P.: Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21, 441–462 (2005)
Scott, D.W.: Multivariate Density Estimation. John Wiley & Sons, New York (1992)
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall (1986)
Stanford, D., Raftery, A.E.: Principle curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Inteligence, 601–609 (2000)
Templ, M.: Statistical disclosure control for microdata using the R-package sdcMicro. Transactions on Data Privacy 1(2), 67–85 (2008)
Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality 1(1), 111–124 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Oganian, A. (2014). v-Dispersed Synthetic Data Based on a Mixture Model with Constraints. In: Domingo-Ferrer, J. (eds) Privacy in Statistical Databases. PSD 2014. Lecture Notes in Computer Science, vol 8744. Springer, Cham. https://doi.org/10.1007/978-3-319-11257-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-11257-2_16
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11256-5
Online ISBN: 978-3-319-11257-2
eBook Packages: Computer ScienceComputer Science (R0)