Skip to main content
Log in

Enhancing aggregation phase of microaggregation methods for interval disclosure risk minimization

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Microaggregation is a masking mechanism to protect confidential data in a public release. This technique can produce a k-anonymous dataset where data records are partitioned into groups of at least k members. In each group, a representative centroid is computed by aggregating the group members and is published instead of the original records. In a conventional microaggregation algorithm, the centroids are computed based on simple arithmetic mean of group members. This naïve formulation does not consider the proximity of the published values to the original ones, so an intruder may be able to guess the original values. This paper proposes a disclosure-aware aggregation model, where published values are computed in a given distance from the original ones to attain a more protected and useful published dataset. Empirical results show the superiority of the proposed method in achieving a better trade-off point between disclosure risk and information loss in comparison with other similar anonymization techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The measures are discussed in Sect. 2.3 with more details.

  2. For simplicity, we define \(Var(X)=\sigma ^2_X=1/n \sum _{i=1}^{n}(x_i-\mu _{X})^2\) where X is a set of n equally likely values \(x_i\) with \(\mu _{X}=Mean(X)\).

  3. We review some general purpose \(\textit{DR}\) and \(\textit{IL}\) measures only for continuous data type, which is addressed in this paper. The variants of the measures for other data types can be found in Hundepool et al. (2012).

  4. It is also known as identity disclosure or re-identification risk.

  5. Interval disclosure is a special case of attribute disclosure for continuous datasets.

  6. The heuristic can be simply extended to consider each attribute separately, however, our experiments show that there is no a significant improvement that justifies this additional cost.

  7. These methods are described in Sect. 3.

  8. Please note that in Table 2, MDAV-DA usually performs better for \(k=5\) than other aggregation levels for MDAV-DA.

  9. In fact, we select the trade-off point with closest but greater \(\textit{DR}\) than the value of MDAV-DA, to allow a more (potential) decrease of \(\textit{IL}\) for the methods.

  10. An illustrative example is presented in Fig. 1.

References

  • Askari M, Safavi-Naini R, Barker K (2012) An information theoretic privacy and utility measure for data sanitization mechanisms. In: Proceedings of the second ACM conference on data and application security and privacy, ACM, New York, NY CODASPY, pp 283–294

  • Batet M, Erola A, Sánchez D, Castellà-Roca J (2013) Utility preserving query log anonymization via semantic microaggregation. Inf Sci 242:49–63

    Article  Google Scholar 

  • Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517

    Article  MathSciNet  MATH  Google Scholar 

  • Brand R (2003) Microdata protection through noise addition. In: Domingo-Ferrer J (ed) Inference control in statistical databases., Lecture notes in computer scienceSpringer, Berlin, pp 97–116

    Google Scholar 

  • Brand R, Domingo-Ferrer J, Mateo-Sanz J (2002) Reference data sets to test and compare SDC methods for protection of numerical microdata. European Project IST-2000-25069 CASC, http://neon.vb.cbs.nl/casc

  • Burridge J (2003) Information preserving statistical obfuscation. Stat Comput 13(4):321–327

    Article  MathSciNet  Google Scholar 

  • Charu A, Philip S (2008) Privacy-preserving data mining: models and algorithms. ASPVU, Boston

    Google Scholar 

  • Defays D, Nanopoulos P (1993) Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the 1992 symposium on design and analysis of longitudinal surveys, pp 195–204

  • Domingo-Ferrer J, Torra V (2001a) Disclosure protection methods and information loss for microdata. Confidentiality, disclosure and data access: theory and practical applications for statistical agencies, pp 91–110

  • Domingo-Ferrer J, Torra V (2001b) A quantitative comparison of disclosure control methods for microdata. Confidentiality, disclosure and data access: theory and practical applications for statistical agencies, pp 111–134

  • Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min Knowl Discov 11(2):195–212

    Article  MathSciNet  Google Scholar 

  • Domingo-Ferrer J, Rebollo-Monedero D (2009) Measuring risk and utility of anonymized data using information theory. In: Proceedings of the EDBT/ICDT Workshops, ACM, New York, NY, EDBT/ICDT, pp 126–130

  • Domingo-Ferrer J, Mateo-Sanz JM, Torra V (2001) Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In: Pre-proceedings of ETK-NTTS, vol 2, pp 807–826

  • Domingo-Ferrer J, Martínez-Ballesté A, Mateo-Sanz JM, Sebé F (2006a) Efficient multivariate data-oriented microaggregation. VLDB J 15(4):355–369

    Article  Google Scholar 

  • Domingo-Ferrer J, Solanas A, Martinez-Balleste A (2006b) Privacy in statistical databases: k-anonymity through microaggregation. In: Proceedings of international conference on granular computing, IEEE, pp 774–777

  • Domingo-Ferrer J, Sebé F, Solanas A (2008) An anonymity model achievable via microaggregation. In: Secure data management, Springer, Heidelberg, pp 209–218

  • Drud AS (1994) CONOPT a large-scale GRG code. ORSA J Comput 6(2):207–216

    Article  MATH  Google Scholar 

  • Fayyoumi E, Oommen BJ (2010) A survey on statistical disclosure control and micro-aggregation techniques for secure statistical databases. Softw Pract Exp 40(12):1161–1188

    Article  Google Scholar 

  • Hansen S, Mukherjee S (2003) A polynomial algorithm for optimal univariate microaggregation. IEEE Trans Knowl Data Eng 15(4):1043–1044

    Article  Google Scholar 

  • Heaton B (2012) New record ordering heuristics for multivariate microaggregation. PhD thesis, Nova Southeastern University

  • Herranz J, Matwin S, Nin J, Torra V (2010) Classifying data from protected statistical datasets. Comput Secur 29(8):875–890

    Article  Google Scholar 

  • Herranz J, Nin J, Solé M (2012a) Kd-trees and the real disclosure risks of large statistical databases. Inf Fusion 13(4):260–273

    Article  Google Scholar 

  • Herranz J, Nin J, Solé M (2012b) More hybrid and secure protection of statistical data sets. IEEE Trans Dependable Secur Comput 9(5):727–740

    Google Scholar 

  • Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Nordholt ES, Spicer K, De Wolf PP (2006) Cenex SDC handbook on statistical disclosure control, version 1.01

  • Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Nordholt ES, Spicer K, De Wolf PP (2012) Statistical disclosure control. Wiley, Chichester

    Book  Google Scholar 

  • Kim JJ (1986) A method for limiting disclosure in microdata based on random noise and transformation. In: Proceedings of the ASA section on survey research methodology, pp 303–308

  • Laszlo M, Mukherjee S (2005) Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans Knowl Data Eng 17(7):902–911

    Article  Google Scholar 

  • Li Y, Zhu S, Wang L, Jajodia S (2002) A privacy-enhanced microaggregation method. In: Eiter T, Schewe KD (eds) Foundations of Information and Knowledge Systems., Lecture notes in computer scienceSpringer, Berlin, pp 148–159

    Chapter  Google Scholar 

  • Lin JL, Chang PC, Liu JYC, Wen TH (2010) Comparison of microaggregation approaches on anonymized data quality. Expert Syst Appl 37(12):8161–8165

    Article  Google Scholar 

  • López A (2011) Effect of microaggregation on regression results: an application to Spanish innovation data. Emprical Econ Lett 10(12):1265–1272

  • Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (2007) L-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discov From Data (TKDD) 1(1):1–52

    Article  Google Scholar 

  • Mateo-Sanz J, Sebé F, Domingo-Ferrer J (2004) Outlier protection in continuous microdata masking. In: Domingo-Ferrer J, Torra V (eds) Privacy in statistical databases., Lecture notes in computer scienceSpringer, Berlin, pp 201–215

    Chapter  Google Scholar 

  • Mateo-Sanz J, Domingo-Ferrer J, Sebé F (2005) Probabilistic information loss measures in confidentiality protection of continuous microdata. Data Min Knowl Discov 11(2):181–193

    Article  MathSciNet  Google Scholar 

  • Moore Jr RA (1996) Controlled data-swapping techniques for masking public use microdata sets. Tech. Rep. 96-04, Statistical Research Division Report Series, US Bureau of the Census, Washington D.C

  • Mortazavi R, Jalili S (2014) Fast data-oriented microaggregation algorithm for large numerical datasets. Knowl Based Syst 67:195–205

    Article  Google Scholar 

  • Mortazavi R, Jalili S (2015) Preference-based anonymization of numerical datasets by multi-objective microaggregation. Inf Fusion 25:85–104

    Article  Google Scholar 

  • Mortazavi R, Jalili S, Gohargazi H (2013) Multivariate microaggregation by iterative optimization. Appl Intell 39(3):529–544

    Article  Google Scholar 

  • Navarro-Arribas G, Torra V (2009) Towards microaggregation of log files for Web usage mining in B2C e-commerce. In: Fuzzy information processing society (NAFIPS), IEEE, pp 1–6

  • Navarro-Arribas G, Torra V (2012) Information fusion in data privacy: a survey. Inf Fusion 13(4):235–244

    Article  Google Scholar 

  • Nin J, Herranz J, Torra V (2008) On the disclosure risk of multivariate microaggregation. Data Knowl Eng 67(3):399–412

    Article  Google Scholar 

  • Oganian A, Domingo-Ferrer J (2001) On the complexity of optimal microaggregation for statistical disclosure control. Stat J U N Econ Com Eur 18(4):345–354

    Google Scholar 

  • Oganian A, Karr AF (2006) Combinations of SDC methods for microdata protection. In: Privacy in Statistical Databases, Springer, Heidelberg, pp 102–113

  • Pagliuca D, Seri G (1999) Some results of individual ranking method on the system of enterprise accounts annual survey. Report, Esprit SDC Project, Deliverable MI-3 D

  • Schmid M, Schneeweiss H, Küchenhoff H (2007) Estimation of a linear regression under microaggregation with the response variable as a sorting variable. Statis Neerl 61(4):407–431

    Article  MathSciNet  MATH  Google Scholar 

  • Solanas A (2008) Privacy protection with genetic algorithms. In: Yang A, Shan Y, Bui L (eds) Success in evolutionary computation, studies in computational intelligence. Springer, Berlin, pp 215–237

    Chapter  Google Scholar 

  • Solanas A, Sebé F, Domingo-Ferrer J (2008) Micro-aggregation-based heuristics for p-sensitive k-anonymity: one step beyond. In: Proceedings of the 2008 international workshop on privacy and anonymity in information society, ACM, pp 61–69

  • Solé M, Muntés-Mulero V, Nin J (2012) Efficient microaggregation techniques for large numerical data volumes. Int J Inf Secur 11(4):253–267

    Article  Google Scholar 

  • Sweeney L (2002) k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5):557–570

    Article  MathSciNet  MATH  Google Scholar 

  • Torra V (2005) Fuzzy c-means for fuzzy hierarchical clustering. In: The 14th IEEE international conference on fuzzy systems, IEEE, pp 646–651

  • Truta TM, Vinay B (2006) Privacy protection: p-sensitive k-anonymity property. In: Proceedings 22nd international conference on data engineering workshops, IEEE, pp 94–94

  • Willenborg LC, De Waal T (2001) Elements of statistical disclosure control, vol 155. Springer, New York

    MATH  Google Scholar 

  • Winkler WE (2004) Re-identification methods for masked microdata. In: Privacy in statistical databases, Springer, Berlin, pp 216–230

  • Yancey W, Winkler W, Creecy R (2002) Disclosure risk assessment in perturbative microdata protection. In: Domingo-Ferrer J (ed) Inference control in statistical databases., Lecture notes in computer scienceSpringer, Berlin, pp 135–152

    Chapter  Google Scholar 

Download references

Acknowledgments

This research is partially supported by ITRC (Iran Telecommunication Research Center) under Contract No. 12200/500.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saeed Jalili.

Additional information

Responsible editor: Kristian Kersting.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mortazavi, R., Jalili, S. Enhancing aggregation phase of microaggregation methods for interval disclosure risk minimization. Data Min Knowl Disc 30, 605–639 (2016). https://doi.org/10.1007/s10618-015-0432-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-015-0432-z

Keywords

Navigation