Hybrid Microdata via Model-Based Clustering

  • Anna Oganian
  • Josep Domingo-Ferrer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7556)


In this paper we propose a new scheme for statistical disclosure limitation which can be classified as a hybrid method of protection, that is, a method that combines properties of perturbative and synthetic methods. This approach is based on model-based clustering with the subsequent synthesis of the records within each cluster. The novelty is that the clustering and synthesis methods have been carefully chosen to fit each other in view of reducing information loss. The model-based clustering tries to obtain clusters such that the within-cluster data distribution is approximately normal; then we can use a multivariate normal synthesizer for the local synthesis of data. In this way, some of the non-normal characteristics of the data are captured by the clustering, so that a simple synthesizer for normal data can be used within each cluster. Our method is shown to be effective when compared to other disclosure limitation strategies.

Keywords and Phrases

Statistical disclosure limitation (SDL) hybrid SDL methods mixture models model-based clustering expectation-maximization (EM) algorithm 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Campbell, J.G., Fraley, C., Murtagh, F., Raftery, A.E.: Linear flaw detection in woven textiles using model-based clustering. Pattern Recognition Letters 18, 1539–1548 (1997)CrossRefGoogle Scholar
  2. 2.
    Campbell, J.G., Fraley, C., Stanford, D., Raftery, A.E.: Model-based methods for real-time textile fault detection. International Journal of Imaging Systems and Technology 10, 339–346 (1999)CrossRefGoogle Scholar
  3. 3.
    Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognition 28, 781–793 (1995)CrossRefGoogle Scholar
  4. 4.
    Dandekar, R.A., Cohen, M., Kirkendall, N.: Sensitive Micro Data Protection Using Latin Hypercube Sampling Technique. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, pp. 117–253. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  5. 5.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical Society, Ser.B 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Domingo-Ferrer, J., González-Nicolás, U.: Hybrid microdata using microaggregation. Information Sciences 180, 2834–2844 (2010)CrossRefGoogle Scholar
  7. 7.
    Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous k-anonimity through microaggregation. Data Mining and Knowledge Discovery 11, 195–212 (2005)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining and Knowledge Discovery 11(2), 195–212 (2005)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Edwards, A.W.F., Cavalli-Sforza, L.L.: A method for cluster analysis. Biometrics 21, 362–375 (1965)CrossRefGoogle Scholar
  10. 10.
    Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 41, 578–588 (1998)zbMATHCrossRefGoogle Scholar
  11. 11.
    Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97(458), 611–631 (2002)MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Fraley, C., Raftery, A.E.: MCLUST Version 3 for R: Normal mixture modeling and model-based clustering. Technical Report no. 504, Department of Statistics, University of Washington (September 2006),
  13. 13.
    Hansen, P., Jaumard, B., Mladenovic, N.: Minimum sum of squares clustering in a low dimensional space. Journal of Classification 15, 37–55 (1998)MathSciNetzbMATHCrossRefGoogle Scholar
  14. 14.
    Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Schulte Nordholt, E., Spicer, K., De Wolf, P.P.: Statistical Disclosure Control. Wiley, New York (2012)CrossRefGoogle Scholar
  15. 15.
    IVEware. Imputation and Variance Estimation software, (accessed July 12, 2012)
  16. 16.
    Little, R.J., Liu, F., Raghunathan, T.: Statistical disclosure techniques based on multiple imputation. In: Gelman, A., Meng, X.-L. (eds.) Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, vol. 18, pp. 141–152. Wiley, New York (2004)Google Scholar
  17. 17.
    McLachlan, G.J., Krishnan, T.: EM Algorithm and Extensions. Wiley, New York (1997)zbMATHGoogle Scholar
  18. 18.
    Muralidhar, K., Sarathy, R.: Generating sufficiency-based non-synthetic perturbed data. Transactions on Data Privacy 1(1), 17–33 (2008), MathSciNetGoogle Scholar
  19. 19.
    Oganian, A.: Security and Information Loss in Statistical Database Protection. PhD thesis, Universitat Politecnica de Catalunya (2003)Google Scholar
  20. 20.
    Oganian, A., Karr, A.F.: Combinations of SDC Methods for Microdata Protection. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 102–113. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  21. 21.
    Phillips, K.: R functions to symbolically compute the central moments of the multivariate normal distribution. Journal of Statistical Software, Code Snippets 33(1), 1–14 (2010)Google Scholar
  22. 22.
    Mateo-Sanz, J.M., Brand, R., Domingo-Ferrer, J.: Reference data sets to test and compare SDC methods for protection of numerical microdata (2002),
  23. 23.
    Raghunathan, T.E., Lepkowski, J.M., van Hoewyk, J., Solenberger, P.: A multivariate technique for multiply imputing missing values using a series of regression models. Survey Methodology 27, 85–96 (2001)Google Scholar
  24. 24.
    Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multivariate imputation for statistical disclosure limitation. Journal of Official Statistics 19(1), 1–16 (2003)Google Scholar
  25. 25.
    Reaven, G.M., Miller, R.G.: An attempt to define the nature of chemical diabetes using multidimensional analysis. Diabetologica 16(1), 17–24 (1979)CrossRefGoogle Scholar
  26. 26.
    Reiter, J.P.: Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18, 531–544 (2002)Google Scholar
  27. 27.
    Reiter, J.P.: Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 21, 441–462 (2005)Google Scholar
  28. 28.
    Caiola, G., Reiter, J.P.: Random forests for generating partially synthetic, categorical data. Transactions on Data Privacy 3(1), 27–42 (2010)MathSciNetGoogle Scholar
  29. 29.
    Scott, D.W.: Multivariate Density Estimation. Wiley, New York (1992)zbMATHCrossRefGoogle Scholar
  30. 30.
    Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall (1986)Google Scholar
  31. 31.
    Stanford, D., Raftery, A.E.: Principle curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Inteligence, 601–609 (2000)Google Scholar
  32. 32.
    Templ, M.: Statistical disclosure control for microdata using the R-package sdcMicro. Transactions on Data Privacy 1(2), 67–85 (2008)MathSciNetGoogle Scholar
  33. 33.
    Torra, V.: Microaggregation for Categorical Variables: A Median Based Approach. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 162–174. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  34. 34.
    Ward, J.H.: Hierarchical grouping to optimize an objective function. Journal of American Statistical Association 58, 236–244 (1963)CrossRefGoogle Scholar
  35. 35.
    Woo, M.-J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality 1(1), 111–124 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Anna Oganian
    • 1
  • Josep Domingo-Ferrer
    • 2
  1. 1.Department of Mathematical SciencesGeorgia Southern UniversityStatesboroU.S.A.
  2. 2.Department of Computer Engineering and MathsUniversitat Rovira i Virgili, UNESCO Chair in Data PrivacyTarragonaSpain

Personalised recommendations