International Journal of Information Security

, Volume 11, Issue 4, pp 253–267 | Cite as

Efficient microaggregation techniques for large numerical data volumes

Regular Contribution

Abstract

The contradictory requirements of data privacy and data analysis have fostered the development of statistical disclosure control techniques. In this context, microaggregation is one of the most frequently used methods since it offers a good trade-off between simplicity and quality. Unfortunately, most of the currently available microaggregation algorithms have been devised to work with small datasets, while the size of current databases is constantly increasing. The usual way to tackle this problem is to partition large data volumes into smaller fragments that can be processed in reasonable time by available algorithms. This solution is applied at the cost of losing quality. In this paper, we revisited the computational needs of microaggregation showing that it can be reduced to two steps: sorting the dataset with regard to a vantage point and a set of k-nearest neighbors searches. Considering this new point of view, we propose three new efficient quality-preserving microaggregation algorithms based on k-nearest neighbors search techniques. We present a comparison of our approaches with the most significant strategies presented in the literature using three real very large datasets. Experimental results show that our proposals overcome previous techniques by keeping a better balance between performance and the quality of the anonymized dataset.

Keywords

Microaggregation Statistical disclosure control Large data volumes k-nearest neighbors search 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
    UCI KDD archive. URL http://kdd.ics.uci.edu
  3. 3.
    Arya S., Mount D., Netanyahu N., Silverman R., Wu A.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45(6), 891–923 (1998)MathSciNetMATHCrossRefGoogle Scholar
  4. 4.
    Bentley J.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Bentley, J.: K-d trees for semidynamic point sets. In: Proceedings of the 6th Symposium on Computational Geometry, pp. 187–197 (1990)Google Scholar
  6. 6.
    Berchtold, S., Keim, D., Kriegel, H.P.: The X-tree : an index structure for high-dimensional data. In: Proceedings of 22th International Conference on Very Large Data Bases, pp. 28–39 (1996)Google Scholar
  7. 7.
    Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.M.: Reference datasets to test and compare sdc methods for protection of numerical microdata. Technical report, European Project IST-2000-25069 CASC (2002)Google Scholar
  8. 8.
    Chávez E., Navarro G., Baeza-Yates R., Marroquín J.L.: Searching in metric spaces. ACM Comput. Surv. 33, 273–321 (2001)CrossRefGoogle Scholar
  9. 9.
    Clarkson, K.: Nearest-neighbor searching and metric space dimensions. In: Nearest-Neighbor Methods for Learning and Vision: Theory and Practice. MIT Press, Cambridge (2005)Google Scholar
  10. 10.
    Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of 92th Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)Google Scholar
  11. 11.
    Domingo-Ferrer J., Martínez-Ballesté A., Mateo-Sanz J.M., Sebé F.: Efficient multivariate data-oriented microaggregation. Very Large Data Bases J. 15(4), 355–369 (2006)CrossRefGoogle Scholar
  12. 12.
    Domingo-Ferrer J., Mateo-Sanz J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002)CrossRefGoogle Scholar
  13. 13.
    Domingo-Ferrer, J., Sebé, F., Solanas, A.: Microaggregation heuristics for p-sensitive k-anonymity. In: Proceedings of Joint UNECE/Eurostat work session on statistical data confidentiality (2007)Google Scholar
  14. 14.
    Domingo-Ferrer, J., Sebé, F., Solanas, A.: A polynomial-time approximation to optimal multivariate microaggregation. In: Computers and Mathematics with Applications, vol. 55, pp. 714–732 (2008)Google Scholar
  15. 15.
    Domingo-Ferrer J., Torra V.: Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min. Knowl. Discov. 11(2), 195–212 (2005)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Domingo-Ferrer, J., Torra, V., Mateo-Sanz, J.M., Sebé, F.: Systematic measures of re-identification risk based on the probabilistic links of the partially synthetic data back to the original microdata. Technical report, Cornell University (2005)Google Scholar
  17. 17.
    Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the 33rd International Conference Very Large Data Bases, pp. 758–769 (2007)Google Scholar
  18. 18.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM International Conference on Management of data, pp. 47–57 (1984)Google Scholar
  19. 19.
    Hansen S.L., Mukherjee S.: A polynomial algorithm for optimal univariate microaggregation. IEEE Trans. Knowl. Data Engi. 15(4), 1043–1044 (2003)CrossRefGoogle Scholar
  20. 20.
    Hore, B., Jammalamadaka, R.C., Mehrotra, S.: Flexible anonymization for privacy preserving data publishing: a systematic search based approach. In: Proceedings of the 7th SIAM International Conference on Data Mining (2007)Google Scholar
  21. 21.
    Hundepool, A., deWetering deWetering, A.V., Ramaswamy, R., Franconi, L., Polettini, S., Capo-bianchi, A., de Wolf, P.P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S.: μ-argus version 4.1 software and users manual. http://neon.vb.cbs.nl/casc (2007)
  22. 22.
    Indyk P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds) Handbook of Discrete and Computational Geometry, 2nd edn, CRC Press LLC, Boca Raton (2004)Google Scholar
  23. 23.
    Jaro M.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84, 414–420 (1989)CrossRefGoogle Scholar
  24. 24.
    Jian-min, H., Ting-ting, C., Hui-qun: An improved v-mdav algorithm for l-diversity. In: International Symposiums on Information Processing, pp. 733–739 (2008)Google Scholar
  25. 25.
    Kokolakis G., Fouskakis D.: Importance partitioning in micro-aggregation. Comput. Stat. Data Anal. 53(7), 2439–2445 (2009)MathSciNetMATHCrossRefGoogle Scholar
  26. 26.
    Laszlo M., Mukherjee S.: Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17(7), 902–911 (2005)CrossRefGoogle Scholar
  27. 27.
    Lee D.T., Wong C.K.: Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Informatica 9, 23–29 (1977)MathSciNetMATHCrossRefGoogle Scholar
  28. 28.
    LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of International Conference on Data Engineering (2006)Google Scholar
  29. 29.
    Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: IEEE International Conference on Data Engineering (2006)Google Scholar
  30. 30.
    Mount, D., Arya, S.: ANN: a library for approximate nearest neighbor searching. URL http://www.cs.umd.edu/~mount/ANN
  31. 31.
    Navarro G.: Searching in metric spaces by spatial approximation. Very Large Data Bases J. 11(1), 28–46 (2002)CrossRefGoogle Scholar
  32. 32.
    Oganian A., Domingo-Ferrer J.: On the complexity of optimal microaggregation for statistical disclosure control. Stat. J. U.N. Econ. Comm. Eur. 18(4), 345–354 (2000)Google Scholar
  33. 33.
    Sample, N., Haines, M., Arnold, M., Purcell, T.: Optimizing search strategies in k-d trees. In: 5th WSES/IEEE World Multiconference on Circuits, Systems, Communications & Computers (CSCC) (2001)Google Scholar
  34. 34.
    Smid, M.: Closest-point problems in computational geometry. In: Sack, J.-R., Urrutia, J. (eds.) Handbook of computational geometry, pp. 877–935. North-Holland (2000)Google Scholar
  35. 35.
    Solanas, A., Martínez-Ballesté, A.: V-MDAV: a multivariate microaggregation with variable group size. In: Computational Statistics (COMPSTAT), pp. 917–925 (2006)Google Scholar
  36. 36.
    Solanas, A., Martinez-Balleste, A., Domingo-Ferrer, J., Mateo-Sanz, J.M.: A 2d-tree-based blocking method for microaggregating very large data sets. In: International Conference on Availability, Reliability and Security, pp. 922–928 (2006)Google Scholar
  37. 37.
    Solanas, A., Pietro, R.: A linear-time multivariate micro-aggregation for privacy protection in uniform very large data sets. In: Proceedings of the 5th International Conference on Modeling Decisions for Artificial Intelligence, pp. 203–214 (2008)Google Scholar
  38. 38.
    Sweeney L.: -anonymity: a model for protecting privacy k. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)MathSciNetMATHCrossRefGoogle Scholar
  39. 39.
    Templ, M.: sdcMicro. Manual and Package. Version 2.5.1. Statistics Austria and Vienna University of Technology, http://cran.r-project.org/src/contrib/Descriptions/sdcMicro.html (2008)
  40. 40.
    Truta, T.M., Vinay, B.: Privacy protection: p-sensitive k-anonymity property. In: IEEE International Confernce on Data Engineering Workshops (2006)Google Scholar
  41. 41.
    Willenborg L., de Waal T.: Elements of Statistical Diclosure Control. Lecture Notes in Statistics. Springer, Berlin (2001)CrossRefGoogle Scholar
  42. 42.
    Wong, W.K., Mamoulis, N., Cheung, D.W.: Non-homogeneous generalization in privacy preserving data publishing. In: ACM International Conference on Management of Data (SIGMOD), pp. 747–758 (2010)Google Scholar
  43. 43.
    Xia, C., Lu, H., Ooi, B.C., Hu, J.: Gorder: an efficient method for KNN join processing. In: Proceedings of the International Conference on Very large data bases, pp. 756–767 (2004)Google Scholar
  44. 44.
    Yuan, C., Gersho, A., Ramamurthi, B., Shoham, Y.: Fast search algorithms for vector quantization and pattern matching. In: International Conference on Acoustics, Speech, and Signal Processing, pp. 372–375 (1984)Google Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  1. 1.Departament d’Arquitectura de ComputadorsUniversitat Politècnica de CatalunyaBarcelonaSpain
  2. 2.Departament de Llenguatges i Sistemes InformàticsUniversitat Politècnica de CatalunyaBarcelonaSpain
  3. 3.CA Labs, CA TechnologiesCornellà de LlobregatSpain

Personalised recommendations