Hybrid Microdata via Model-Based Clustering

Oganian, Anna; Domingo-Ferrer, Josep

doi:10.1007/978-3-642-33627-0_9

Anna Oganian¹⁸ &
Josep Domingo-Ferrer¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7556))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

912 Accesses
1 Citations

Abstract

In this paper we propose a new scheme for statistical disclosure limitation which can be classified as a hybrid method of protection, that is, a method that combines properties of perturbative and synthetic methods. This approach is based on model-based clustering with the subsequent synthesis of the records within each cluster. The novelty is that the clustering and synthesis methods have been carefully chosen to fit each other in view of reducing information loss. The model-based clustering tries to obtain clusters such that the within-cluster data distribution is approximately normal; then we can use a multivariate normal synthesizer for the local synthesis of data. In this way, some of the non-normal characteristics of the data are captured by the clustering, so that a simple synthesizer for normal data can be used within each cluster. Our method is shown to be effective when compared to other disclosure limitation strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Campbell, J.G., Fraley, C., Murtagh, F., Raftery, A.E.: Linear flaw detection in woven textiles using model-based clustering. Pattern Recognition Letters 18, 1539–1548 (1997)
Article Google Scholar
Campbell, J.G., Fraley, C., Stanford, D., Raftery, A.E.: Model-based methods for real-time textile fault detection. International Journal of Imaging Systems and Technology 10, 339–346 (1999)
Article Google Scholar
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793 (1995)
Article Google Scholar
Dandekar, R.A., Cohen, M., Kirkendall, N.: Sensitive Micro Data Protection Using Latin Hypercube Sampling Technique. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, pp. 117–253. Springer, Heidelberg (2002)
Chapter Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical Society, Ser.B 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Domingo-Ferrer, J., González-Nicolás, U.: Hybrid microdata using microaggregation. Information Sciences 180, 2834–2844 (2010)
Article Google Scholar
Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous k-anonimity through microaggregation. Data Mining and Knowledge Discovery 11, 195–212 (2005)
Article MathSciNet Google Scholar
Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining and Knowledge Discovery 11(2), 195–212 (2005)
Article MathSciNet Google Scholar
Edwards, A.W.F., Cavalli-Sforza, L.L.: A method for cluster analysis. Biometrics 21, 362–375 (1965)
Article Google Scholar
Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 41, 578–588 (1998)
Article MATH Google Scholar
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631 (2002)
Article MathSciNet MATH Google Scholar
Fraley, C., Raftery, A.E.: MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report no. 504, Department of Statistics, University of Washington (September 2006), http://cran.r-project.org/web/packages/mclust/index.html
Hansen, P., Jaumard, B., Mladenovic, N.: Minimum sum of squares clustering in a low dimensional space. Journal of Classification 15, 37–55 (1998)
Article MathSciNet MATH Google Scholar
Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte Nordholt, E., Spicer, K., De Wolf, P.P.: Statistical Disclosure Control. Wiley, New York (2012)
Book Google Scholar
IVEware. Imputation and Variance Estimation software, http://www.isr.umich.edu/src/smp/ive/ (accessed July 12, 2012)
Little, R.J., Liu, F., Raghunathan, T.: Statistical disclosure techniques based on multiple imputation. In: Gelman, A., Meng, X.-L. (eds.) Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, vol. 18, pp. 141–152. Wiley, New York (2004)
Google Scholar
McLachlan, G.J., Krishnan, T.: EM Algorithm and Extensions. Wiley, New York (1997)
MATH Google Scholar
Muralidhar, K., Sarathy, R.: Generating sufficiency-based non-synthetic perturbed data. Transactions on Data Privacy 1(1), 17–33 (2008), http://www.tdp.cat/issues/tdp.a005a08.pdf
MathSciNet Google Scholar
Oganian, A.: Security and Information Loss in Statistical Database Protection. PhD thesis, Universitat Politecnica de Catalunya (2003)
Google Scholar
Oganian, A., Karr, A.F.: Combinations of SDC Methods for Microdata Protection. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 102–113. Springer, Heidelberg (2006)
Chapter Google Scholar
Phillips, K.: R functions to symbolically compute the central moments of the multivariate normal distribution. Journal of Statistical Software, Code Snippets 33(1), 1–14 (2010)
Google Scholar
Mateo-Sanz, J.M., Brand, R., Domingo-Ferrer, J.: Reference data sets to test and compare SDC methods for protection of numerical microdata (2002), http://neon.vb.cbs.nl/casc/CASCtestsets.htm
Raghunathan, T.E., Lepkowski, J.M., van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85–96 (2001)
Google Scholar
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multivariate imputation for statistical disclosure limitation. Journal of Official Statistics 19(1), 1–16 (2003)
Google Scholar
Reaven, G.M., Miller, R.G.: An attempt to define the nature of chemical diabetes using multidimensional analysis. Diabetologica 16(1), 17–24 (1979)
Article Google Scholar
Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18, 531–544 (2002)
Google Scholar
Reiter, J.P.: Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21, 441–462 (2005)
Google Scholar
Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Transactions on Data Privacy 3(1), 27–42 (2010)
MathSciNet Google Scholar
Scott, D.W.: Multivariate Density Estimation. Wiley, New York (1992)
Book MATH Google Scholar
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall (1986)
Google Scholar
Stanford, D., Raftery, A.E.: Principle curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Inteligence, 601–609 (2000)
Google Scholar
Templ, M.: Statistical disclosure control for microdata using the R-package sdcMicro. Transactions on Data Privacy 1(2), 67–85 (2008)
MathSciNet Google Scholar
Torra, V.: Microaggregation for Categorical Variables: A Median Based Approach. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 162–174. Springer, Heidelberg (2004)
Chapter Google Scholar
Ward, J.H.: Hierarchical grouping to optimize an objective function. Journal of American Statistical Association 58, 236–244 (1963)
Article Google Scholar
Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality 1(1), 111–124 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematical Sciences, Georgia Southern University, P.O. Box 8093, Statesboro, GA, 30460-8093, U.S.A.
Anna Oganian
Department of Computer Engineering and Maths, Universitat Rovira i Virgili, UNESCO Chair in Data Privacy, Av. Països Catalans 26, E-43007, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer

Authors

Anna Oganian
View author publications
You can also search for this author in PubMed Google Scholar
Josep Domingo-Ferrer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departament d’Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili UNESCO Chair in Data Privacy, Av. Països Catalans 26, E-43007, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
Dipartimento di Ingegneria Elettrica, Elettronica e delle Telecomunicazioni, di tecnologie Chimiche, Automatica e modelli Matematici, Università degli Studi di Palermo, Viale delle Scienze, edificio 9, 90128, Palermo, Italy
Ilenia Tinnirello

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oganian, A., Domingo-Ferrer, J. (2012). Hybrid Microdata via Model-Based Clustering. In: Domingo-Ferrer, J., Tinnirello, I. (eds) Privacy in Statistical Databases. PSD 2012. Lecture Notes in Computer Science, vol 7556. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33627-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-33627-0_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33626-3
Online ISBN: 978-3-642-33627-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics