Abstract
In this paper we propose a new scheme for statistical disclosure limitation which can be classified as a hybrid method of protection, that is, a method that combines properties of perturbative and synthetic methods. This approach is based on model-based clustering with the subsequent synthesis of the records within each cluster. The novelty is that the clustering and synthesis methods have been carefully chosen to fit each other in view of reducing information loss. The model-based clustering tries to obtain clusters such that the within-cluster data distribution is approximately normal; then we can use a multivariate normal synthesizer for the local synthesis of data. In this way, some of the non-normal characteristics of the data are captured by the clustering, so that a simple synthesizer for normal data can be used within each cluster. Our method is shown to be effective when compared to other disclosure limitation strategies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Campbell, J.G., Fraley, C., Murtagh, F., Raftery, A.E.: Linear flaw detection in woven textiles using model-based clustering. Pattern Recognition Letters 18, 1539–1548 (1997)
Campbell, J.G., Fraley, C., Stanford, D., Raftery, A.E.: Model-based methods for real-time textile fault detection. International Journal of Imaging Systems and Technology 10, 339–346 (1999)
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793 (1995)
Dandekar, R.A., Cohen, M., Kirkendall, N.: Sensitive Micro Data Protection Using Latin Hypercube Sampling Technique. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, pp. 117–253. Springer, Heidelberg (2002)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical Society, Ser.B 39, 1–38 (1977)
Domingo-Ferrer, J., González-Nicolás, U.: Hybrid microdata using microaggregation. Information Sciences 180, 2834–2844 (2010)
Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous k-anonimity through microaggregation. Data Mining and Knowledge Discovery 11, 195–212 (2005)
Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining and Knowledge Discovery 11(2), 195–212 (2005)
Edwards, A.W.F., Cavalli-Sforza, L.L.: A method for cluster analysis. Biometrics 21, 362–375 (1965)
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 41, 578–588 (1998)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631 (2002)
Fraley, C., Raftery, A.E.: MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report no. 504, Department of Statistics, University of Washington (September 2006), http://cran.r-project.org/web/packages/mclust/index.html
Hansen, P., Jaumard, B., Mladenovic, N.: Minimum sum of squares clustering in a low dimensional space. Journal of Classification 15, 37–55 (1998)
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte Nordholt, E., Spicer, K., De Wolf, P.P.: Statistical Disclosure Control. Wiley, New York (2012)
IVEware. Imputation and Variance Estimation software, http://www.isr.umich.edu/src/smp/ive/ (accessed July 12, 2012)
Little, R.J., Liu, F., Raghunathan, T.: Statistical disclosure techniques based on multiple imputation. In: Gelman, A., Meng, X.-L. (eds.) Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, vol. 18, pp. 141–152. Wiley, New York (2004)
McLachlan, G.J., Krishnan, T.: EM Algorithm and Extensions. Wiley, New York (1997)
Muralidhar, K., Sarathy, R.: Generating sufficiency-based non-synthetic perturbed data. Transactions on Data Privacy 1(1), 17–33 (2008), http://www.tdp.cat/issues/tdp.a005a08.pdf
Oganian, A.: Security and Information Loss in Statistical Database Protection. PhD thesis, Universitat Politecnica de Catalunya (2003)
Oganian, A., Karr, A.F.: Combinations of SDC Methods for Microdata Protection. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 102–113. Springer, Heidelberg (2006)
Phillips, K.: R functions to symbolically compute the central moments of the multivariate normal distribution. Journal of Statistical Software, Code Snippets 33(1), 1–14 (2010)
Mateo-Sanz, J.M., Brand, R., Domingo-Ferrer, J.: Reference data sets to test and compare SDC methods for protection of numerical microdata (2002), http://neon.vb.cbs.nl/casc/CASCtestsets.htm
Raghunathan, T.E., Lepkowski, J.M., van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85–96 (2001)
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multivariate imputation for statistical disclosure limitation. Journal of Official Statistics 19(1), 1–16 (2003)
Reaven, G.M., Miller, R.G.: An attempt to define the nature of chemical diabetes using multidimensional analysis. Diabetologica 16(1), 17–24 (1979)
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18, 531–544 (2002)
Reiter, J.P.: Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21, 441–462 (2005)
Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Transactions on Data Privacy 3(1), 27–42 (2010)
Scott, D.W.: Multivariate Density Estimation. Wiley, New York (1992)
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall (1986)
Stanford, D., Raftery, A.E.: Principle curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Inteligence, 601–609 (2000)
Templ, M.: Statistical disclosure control for microdata using the R-package sdcMicro. Transactions on Data Privacy 1(2), 67–85 (2008)
Torra, V.: Microaggregation for Categorical Variables: A Median Based Approach. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 162–174. Springer, Heidelberg (2004)
Ward, J.H.: Hierarchical grouping to optimize an objective function. Journal of American Statistical Association 58, 236–244 (1963)
Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality 1(1), 111–124 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Oganian, A., Domingo-Ferrer, J. (2012). Hybrid Microdata via Model-Based Clustering. In: Domingo-Ferrer, J., Tinnirello, I. (eds) Privacy in Statistical Databases. PSD 2012. Lecture Notes in Computer Science, vol 7556. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33627-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-33627-0_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33626-3
Online ISBN: 978-3-642-33627-0
eBook Packages: Computer ScienceComputer Science (R0)