The VLDB Journal

, Volume 15, Issue 4, pp 355–369 | Cite as

Efficient multivariate data-oriented microaggregation

  • Josep Domingo-Ferrer
  • Antoni Martínez-Ballesté
  • Josep Maria Mateo-Sanz
  • Francesc Sebé
Special Issue Paper

Abstract

Microaggregation is a family of methods for statistical disclosure control (SDC) of microdata (records on individuals and/or companies), that is, for masking microdata so that they can be released while preserving the privacy of the underlying individuals. The principle of microaggregation is to aggregate original database records into small groups prior to publication. Each group should contain at least k records to prevent disclosure of individual information, where k is a constant value preset by the data protector. Recently, microaggregation has been shown to be useful to achieve k-anonymity, in addition to it being a good masking method. Optimal microaggregation (with minimum within-groups variability loss) can be computed in polynomial time for univariate data. Unfortunately, for multivariate data it is an NP-hard problem. Several heuristic approaches to microaggregation have been proposed in the literature. Heuristics yielding groups with fixed size k tends to be more efficient, whereas data-oriented heuristics yielding variable group size tends to result in lower information loss. This paper presents new data-oriented heuristics which improve on the trade-off between computational complexity and information loss and are thus usable for large datasets.

Keywords

Statistical databases Privacy Anonymity Statistical disclosure control Microaggregation Microdata protection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, D., Aggarwal, C.C. On the design and quantification of privacy preserving data mining algorithms. In: Proceedings of the Symposium on Principles of Database Systems-PODS’2001, Santa Barbara. Association for Computing Machinery, (2001)Google Scholar
  2. 2.
    Boyens, C., Krishnan, R., Padman, R. On privacy-preserving access to distributed heterogeneous healthcare information. In: Proceedings of the 37th Hawaii International Conference on System Sciences HICSS-37, Big Island, HI IEEE Computer Society (2004)Google Scholar
  3. 3.
    Brand R. (2002). Microdata protection through noise addition. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol 2316 of LNCS, Springer, Berlin Heidelberg New York, pp. 97–116Google Scholar
  4. 4.
    Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.M. Reference data sets to test and compare sdc methods for protection of numerical microdata. European Project IST-2000-25069 CASC, http://neon.vb.cbs.nl/casc (2002)Google Scholar
  5. 5.
    Burridge J. (2003) Information preserving statistical obfuscation. Stat. Comput. 13, 321–327CrossRefMathSciNetGoogle Scholar
  6. 6.
    Dalenius T. (1986) Finding a needle in a haystack–or identifying anonymous census records. J. Official Stat. 23, 329–336Google Scholar
  7. 7.
    Dandekar R., Domingo-Ferrer J., Sebé F. (2002). LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol. 2316 of LNCS. Springer, Berlin Heidelberg NewYork, pp. 153–162Google Scholar
  8. 8.
    Defays, D., Anwar, N. Micro-aggregation: a generic method. In: Proceedings of the 2nd International Symposium on Statistical Confidentiality, pp. 69–78. Eurostat, Luxemburg (1995)Google Scholar
  9. 9.
    Defays, D., Nanopoulos, P. Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of 1992 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)Google Scholar
  10. 10.
    Domingo-Ferrer J., Mateo-Sanz J.M. (2002) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1): 189–201CrossRefGoogle Scholar
  11. 11.
    Domingo-Ferrer, J., Mateo-Sanz, J.M., Torra, V. Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In: Pre-proceedings of ETK-NTTS’2001 (vol. 2), pp. 807–826. Luxemburg, Eurostat (2001)Google Scholar
  12. 12.
    Domingo-Ferrer, J., Torra, V. A quantitative comparison of disclosure control methods for microdata. In: Doyle P., Lane J.I., Theeuwes J. J. M., Zayatz, L. (eds.) Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 111–134. Amsterdam North-Holland, http://vneumann.etse.urv.es/publications/bcpi (2001)Google Scholar
  13. 13.
    Domingo-Ferrer J., Torra V. (2005) Ordinal, continuous and heterogenerous k-anonymity through microaggregation. Data Mining Knowl. Discov. 11(2): 195–212CrossRefMathSciNetGoogle Scholar
  14. 14.
    Doyle, P., Lane, J.I., Theeuwes, J.J., Zayatz, L.V. (eds). Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. North-Holland, Amsterdam (2001)Google Scholar
  15. 15.
    Edwards A.W.F., Cavalli-Sforza L.L. (1965) A method for cluster analysis. Biometrics 21, 362–375CrossRefGoogle Scholar
  16. 16.
    Gordon A.D., Henderson J.T. (1977) An algorithm for Euclidean sum of squares classification. Biometrics 33, 355–362MATHCrossRefGoogle Scholar
  17. 17.
    Hansen P., Jaumard B., Mladenovic N. (1998) Minimum sum of squares clustering in a low dimensional space. J. Classifi. 15, 37–55MATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Hansen S.L., Mukherjee S. (2003) A polynomial algorithm for optimal univariate microaggregation. IEEE Trans. Knowl. Data Eng. 15(4): 1043–1044CrossRefGoogle Scholar
  19. 19.
    Hartigan J.A. (1975) Clustering Algorithms. Wiley, New YorkMATHGoogle Scholar
  20. 20.
    Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A. DeWolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S. μ-ARGUS version 3.2 Software and User’s Manual. Statistics Netherlands, Voorburg NL, http://neon.vb.cbs.nl/casc (2003)Google Scholar
  21. 21.
    Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A., DeWolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S. μ-ARGUS version 4.0 Software and User’s Manual. Statistics Netherlands, Voorburg NL, http://neon.vb.cbs.nl/casc (2005)Google Scholar
  22. 22.
    Jancey R.C. (1966) Multidimensional group analysis. Aust. J. Bot. 14, 127–130CrossRefGoogle Scholar
  23. 23.
    Laszlo M., Mukherjee S. (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17(7): 902–911CrossRefGoogle Scholar
  24. 24.
    Lenz, R., Vorgrimler, D. Matching German turnover tax statistics. In: Technical Report FDZ-Arbeitspapier Nr. 4, Statistische Ämter des Bundes und der Länder-Forschungsdatenzentren (2005)Google Scholar
  25. 25.
    MacQueen, J.B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol., 1, 281–297 (1967)Google Scholar
  26. 26.
    Mateo-Sanz, J.M., Domingo-Ferrer, J. A method for data-oriented multivariate microaggregation. In: Domingo-Ferrer, J., (ed.) Statistical Data Protection, (pp. 89–99) Luxemburg, (1999) Office for Official Publications of the European CommunitiesGoogle Scholar
  27. 27.
    Mateo-Sanz, J.M., Domingo-Ferrer, J. Heuristic techniques for multivariate microaggregation. In: COMPSTAT’2000, Utrecht. CBS-Statistics, Netherlands (2000)Google Scholar
  28. 28.
    Mateo-Sanz J.M., Martínez-Ballesté A., Domingo-Ferrer J. (2004). Fast generation of accurate synthetic microdata. In: Domingo-Ferrer J., Torra V. (eds). Privacy in Statistical Databases, volume 3050 of LNCS. Springer, Berlin Heidelberg New York, pp. 298–306Google Scholar
  29. 29.
    Oganian A., Domingo-Ferrer J. (2001) On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. United Nat. Econ. Com. Eur. 18(4): 345–354Google Scholar
  30. 30.
    Pagliuca, D., Seri, G. Some results of the individual ranking method on the system of enterprise accounts annual survey. In: Technical report, ESPRIT SDC Project, Deliverable MI-3/D2.11 (1999)Google Scholar
  31. 31.
    Rosemann, M. Erste Ergebnisse von vergleichenden Untersuchungen mit anonymisierten und nicht anonymisierten Einzeldaten am Beispiel der Kostenstrukturerhebung und der Umsatzsteuerstatistik. In: Ronning, G., Gnoss, R., (eds.), Anonymisierung wirtschaftsstatistischer Einzeldaten, (pp.154–183) Wiesbaden, Germany, Statistisches Bundesamt (2003)Google Scholar
  32. 32.
    Samarati P. (2001) Protecting respondents’ identities in microdata release. IEEE Trans. Know. and Data Eng. 13(6): 1010–1027CrossRefGoogle Scholar
  33. 33.
    Samarati, P., Sweeney, L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In Technical report, SRI International, (1998)Google Scholar
  34. 34.
    Sande G. (2002) Exact and approximate methods for data directed microaggregation in one or more dimensions. Int. J. Uncert. Fuzziness Know. Based Sys. 10(5): 459–476MATHCrossRefMathSciNetGoogle Scholar
  35. 35.
    Sweeney L. (2002) k-Anonimity: a model for protecting privacy. Int. J. Uncert. Fuzziness Knowl. Based Sys. 10(5): 557–570MATHCrossRefMathSciNetGoogle Scholar
  36. 36.
    Torra V. (2004). Microaggregation for categorical variables: a median based approach. In: Domingo-Ferrer J., Torra V. (eds). Privacy Stat. Databases vol. 3050 of LNCS. Springer, Berlin Heidelberg New York, pp. 162–174Google Scholar
  37. 37.
    Torra V., Domingo-Ferrer J. (2003). Record linkage methods for multidatabase data mining. In: Torra V. (eds). Information Fusion in Data Mining. Springer, Germany, pp.101–132Google Scholar
  38. 38.
    UNECE. United Nations Economic Commission for Europe: Questionnaire on disclosure and confidentiality–summary of replies. In: 2nd Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, Macedonia (2001)Google Scholar
  39. 39.
    UNECE. United Nations Economic Commission for Europe: 2003 Questionnaire on statistical confidentiality – summary of replies from Central and Eastern Europe. In: 4th Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Luxemburg (2005)Google Scholar
  40. 40.
    Ward J.H. (1963) Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244CrossRefGoogle Scholar
  41. 41.
    Willenborg L., DeWaal T. (2001) Elements of Statistical Disclosure Control. Springer, Berlin Heidelberg New YorkMATHGoogle Scholar
  42. 42.
    Yancey W.E., Winkler W.E., Creecy R.H. (2002). Disclosure risk assessment in perturbative microdata protection. In: Domingo-Ferrer J. (eds). Inference Control in Statistical Databases, vol. 2316 of LNCS. Springer, Berlin Heidelberg New York, pp. 135–152Google Scholar

Copyright information

© Springer-Verlag 2006

Authors and Affiliations

  • Josep Domingo-Ferrer
    • 1
  • Antoni Martínez-Ballesté
    • 1
  • Josep Maria Mateo-Sanz
    • 2
  • Francesc Sebé
    • 1
  1. 1.Department of Computer Engineering & MathsRovira i Virgili University of TarragonaTarragonaCatalonia
  2. 2.Statistics GroupRovira i Virgili University of TarragonaTarragonaCatalonia

Personalised recommendations