Skip to main content
Log in

Enhancing data utility in differential privacy via microaggregation-based \(k\)-anonymity

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

It is not uncommon in the data anonymization literature to oppose the “old” \(k\)-anonymity model to the “new” differential privacy model, which offers more robust privacy guarantees. Yet, it is often disregarded that the utility of the anonymized results provided by differential privacy is quite limited, due to the amount of noise that needs to be added to the output, or because utility can only be guaranteed for a restricted type of queries. This is in contrast with \(k\)-anonymity mechanisms, which make no assumptions on the uses of anonymized data while focusing on preserving data utility from a general perspective. In this paper, we show that a synergy between differential privacy and \(k\)-anonymity can be found: \(k\)-anonymity can help improving the utility of differentially private responses to arbitrary queries. We devote special attention to the utility improvement of differentially private published data sets. Specifically, we show that the amount of noise required to fulfill \(\varepsilon \)-differential privacy can be reduced if noise is added to a \(k\)-anonymous version of the data set, where \(k\)-anonymity is reached through a specially designed microaggregation of all attributes. As a result of noise reduction, the general analytical utility of the anonymized output is increased. The theoretical benefits of our proposal are illustrated in a practical setting with an empirical evaluation on three data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigraphy, R., Thomas, D., Zhu, A.: Anonymizing tables. In: Proceedings of the 10th International Conference on Database Theory-ICDT 2005, pp. 246–258 (2005)

  2. Batet, M., Valls, A., Gibert, K.: A distance function to assess the similarity of words using ontologies. In: XV Congreso Español sobre Tecnologías y Lógica Fuzzy, Huelva, pp. 561–566. Spain (2010)

  3. Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Proceedings of the 40th Annual Symposium on the Theory of Computing-STOC 2008, pp. 609–618 (2008)

  4. Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.M.: Reference data sets to test and compare SDC methods for protection of numerical microdata. European Project IST-2000-25069 CASC. http://neon.vb.cbs.nl/casc (2002)

  5. Charest, A.-S.: Empirical evaluation of statistical inference from differentially-private contingency tables. In: Proceedings of Privacy in Statistical Databases-PSD 2012, LNCS 7556, pp. 257–272. Springer (2012)

  6. Charest, A.-S.: How can we analyze differentially-private synthetic data sets? J. Priv. Confident. 2(2), 21–33 (2010)

    Google Scholar 

  7. Chen, R., Mohammed, N., Fung, B.C.M., Desai B.C., Xiong, L.: Publishing set-valued data via differential privacy. In: 37th International Conference on Very Large Data Bases-VLDB 2011/Proceedings of the VLDB Endowment 4(11), 1087–1098 (2011)

  8. Clifton, C., Tassa, T.: On syntactic anonymity and differential privacy. Trans. Data Priv. 6(2), 161–183 (2013)

    MathSciNet  Google Scholar 

  9. Cormode, G., Procopiuc, C.M., Shen, E., Srivastava, D., Yu, T.: Differentially private spatial decompositions. In: IEEE International Conference on Data Engineering (ICDE 2012), pp. 20–31 (2012)

  10. Cormode, G., Procopiuc, C.M., Shen, E., Srivastava, D., Yu, T.: Empirical privacy and empirical utility of anonymized data. In: ICDE Workshop on Privacy-Preserving Data Publication and Analysis (2013)

  11. Dalenius, T.: The invasion of privacy problem and statistics production. An overview. Stat. Tidskrift 12, 213–225 (1974)

    Google Scholar 

  12. Dandekar, R., Domingo-Ferrer, J., Sebé, F.: LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases, LNCS 2316, pp. 153–162. Springer (2002)

  13. Domingo-Ferrer, J.: A critique of \(k\)-anonymity and some of its enhancements. In: Proceedings of ARES/PSAI 2008, pp. 990–993. IEEE Computer Society (2008)

  14. Domingo-Ferrer, J.: Marginality: a numerical mapping for enhanced exploitation of taxonomic attributes. In: Proceedings of the 9th International Conference on Modeling Attributes for Artificial Intelligence-MDAI 2012, LNCS 7647, pp. 367–381. Springer (2012)

  15. Domingo-Ferrer, J., González-Nicolás, U.: Hybrid microdata using microaggregation. Inf. Sci. 180(15), 2834–2844 (2010)

    Article  Google Scholar 

  16. Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans.Knowl. Data Eng. 14(1), 189–201 (2002)

    Article  Google Scholar 

  17. Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Min. Knowl. Discov. 11(2), 195–212 (2005)

    Article  MathSciNet  Google Scholar 

  18. Domingo-Ferrer, J., Mateo-Sanz, J.M., Torra, V.: Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In: Pre-Proceedings of ETK-NTTS’2001 (vol. 2), pp. 807–826. Eurostat (2001)

  19. Domingo-Ferrer, J., Sánchez, D., Rufian-Torrell, G.: Anonymization of nominal data based on semantic marginality. Inf. Sci. 242, 35–48 (2013)

    Article  Google Scholar 

  20. Domingo-Ferrer, J., Sebé, F., Solanas, A.: A polynomial-time approximation to optimal multivariate microaggregation. Comput. Math. Appl. 55(4), 714–732 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  21. Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J., Sebé, F.: Efficient multivariate data-oriented microaggregation. VLDB J. 15, 355–369 (2006)

    Article  Google Scholar 

  22. Dwork, C., Naor, M., Reingold, O., Rothblum G.N., Vadhan, S.: On the complexity of differentially private data release: efficient algorithms and hardness results. In: Proceedings of the 41st Annual Symposium on the Theory of Computing-STOC 2009, pp. 381–390 (2009)

  23. Dwork, C.: Differential privacy. In: Proceedings of 33rd International Colloquium on Automata, Languages and Programming-ICALP 2006, LNCS 4052, pp. 1–12. Springer (2006)

  24. Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)

    Article  Google Scholar 

  25. Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge (1998)

    Google Scholar 

  26. Frank, A., Asuncion, A.: UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml/datasets/Adult (2010)

  27. Fung, B.C.M., Wang, K., Yu., P.S.: Top-down specialization for information and privacy preservation. In: Proceedings of the 21st International Conference on Data Engineering, pp. 205–216. IEEE Computer Society (2005)

  28. Goldberger, J., Tassa, T.: Efficient anonymizations with enhanced utility. Trans. Data Priv. 3, 149–175 (2010)

    MathSciNet  Google Scholar 

  29. Hardt, M., Ligett, K., McSherry, F.: A simple and practical algorithm for differentially private data release. Preprint arXiv:1012.4763 (2010)

  30. Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. PVLDB 3(1), 1021–1032 (2010)

    Google Scholar 

  31. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Spicer, K., de Wolf, P.-P.: Statistical Disclosure Control. Wiley, London (2012)

    Book  Google Scholar 

  32. Laszlo, M., Mukherjee, S.: Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17(7), 902–911 (2005)

    Article  Google Scholar 

  33. Li, N., Li, T., Venkatasubramanian, S.: t-Closeness: privacy beyond k-anonymity and l-diversity. In: IEEE International Conference on Data Engineering (ICDE 2007), pp. 106–115 (2007)

  34. Li, N., Qardaji, V., Su, D.: On sampling, anonymization, and differential privacy: Or, k -anonymization meets differential privacy. In: 7th ACM Symposium on Information, Computer and Communications, Security (ASIACCS’2012), pp. 32–33 (2012)

  35. Li, N., Yang, W., Qardaji, W.: Differentially private grids for geospatial data. In: IEEE International Conference on Data Engineering (ICDE 2013), pp. 757–768 (2013)

  36. Li, Y., Bandar, Z., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15, 871–882 (2003)

    Article  Google Scholar 

  37. Lin, J.-L., Wen, T.-H., Hsieh, J.-C., Chang, P.-C.: Density-based microaggregation for statistical disclosure control. Expert Syst. Appl. 37, 3256–3263 (2010)

    Article  Google Scholar 

  38. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-Diversity: privacy beyond k-anonymity. In: IEEE International Conference on Data Engineering (ICDE 2006), pp. 24 (2006)

  39. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: IEEE International Conference on Data Engineering (ICDE 2008), pp. 277–286 (2008)

  40. Martínez, S., Valls, A., Sánchez, D.: Semantically-grounded construction of centroids for data sets with textual attributes. Knowl.-Based Syst. 35, 160–172 (2012)

    Google Scholar 

  41. Martínez, S., Sánchez, D., Valls, A.: Semantic adaptive microaggregation of categorical microdata. Comput. Secur. 31(5), 653–672 (2012)

    Article  Google Scholar 

  42. McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science-FOCS 2007, pp. 94–103 (2007)

  43. McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 19–30. ACM (2009)

  44. Mohammed, N., Chen, R., Fung, B.C.M., Yu, P.S.: Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining-KDD 2011, pp. 493–501. ACM (2011)

  45. Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the 39th ACM Symposium on Theory of Computing-STOC 2007, pp. 75–84. ACM (2007)

  46. Petrakis, E.G.M., Varelas, G., Hliaoutakis, A., Raftopoulou, P.: X-similarity: computing semantic similarity between concepts from different ontologies. J. Dig. Inf. Manag. 4, 233–237 (2006)

    Google Scholar 

  47. Pirró, G.: A semantic similarity metric combining features and intrinsic information content. Data Knowl. Eng. 68, 1289–1308 (2009)

    Article  Google Scholar 

  48. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Trans. Syst., Man Cybern. 19(1), 17–30 (1989)

    Article  Google Scholar 

  49. Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: \(k\)-anonymity and its enforcement through generalization and suppression. SRI International Report (1998)

  50. Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)

    Article  Google Scholar 

  51. Sánchez, D., Batet, M.: Semantic similarity estimation in the biomedical domain: an ontology-based information-theoretic perspective. J. Biomed. Inform. 44, 749–759 (2011)

    Article  Google Scholar 

  52. Sánchez, D., Batet, M., Isern, D.: Ontology-based information content computation. Knowl. -Based Syst. 24, 297–303 (2011)

    Article  Google Scholar 

  53. Sánchez, D., Batet, M.: A new model to compute the information content of concepts from taxonomical knowledge. Int. J. Semant. Web Inf. Syst. 8, 34–50 (2012)

    Article  Google Scholar 

  54. Sánchez, D., Batet, M., Isern, D., Valls, A.: Ontology-based semantic similarity: a new feature-based approach. Expert Syst. Appl. 39(9), 7718–7728 (2012)

    Article  Google Scholar 

  55. Soria-Comas, J., Domingo-Ferrer, J., Sánchez, D., Martínez, S.: Improving the utility of differentially private data releases via \(k\)-anonymity. In: 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (IEEE Trust-Com 2013), pp. 372–379. Melbourne, Australia, July 16–18 (2013)

  56. Sweeney, L.: \(k\)-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  57. Willenborg, L., De Waal, T.: Statistical Disclosure Control in Practice. Springer, Berlin (1996)

    Book  MATH  Google Scholar 

  58. Wong, R., Li, J., Fu, A., Wang, K.: (\(\alpha \), k)-Anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), pp. 754–759 (2006)

  59. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138. Las Cruces, New Mexico (1994)

  60. Xiao, Y., Xiong, L., Yuan, C.: Differentially private data release through multidimensional partitioning. In: Proceedings of the 7th VLDB Conference on Secure Data Management (SDM’10), pp. 150–168 (2010)

  61. Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23(8), 1200–1214 (2010)

    Article  Google Scholar 

  62. Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G.: Differentially private histogram publication. In: IEEE International Conference on Data Engineering (ICDE 2012), pp. 32–43 (2012)

  63. Yancey, W.E., Winkler, W.E., Creecy, R.H.: Disclosure risk assessment in perturbative microdata protection. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases, LNCS 2316, pp. 135–152. Springer (2002)

Download references

Acknowledgments

This work was partly supported by the Government of Catalonia under grant 2009 SGR 1135, by the Spanish Government through projects TIN2011-27076-C03-01 “CO-PRIVACY,” TIN2012-32757 “ICWT,” IPT2012-0603-430000 “BallotNext” and CONSOLIDER INGENIO 2010 CSD2007-00004 “ARES,” and by the European Commission under FP7 projects “DwB” and “Inter-Trust.” The second author is partially supported as an ICREA Acadèmia researcher by the Government of Catalonia. The authors are with the UNESCO Chair in Data Privacy, but they are solely responsible for the views expressed in this paper, which do not necessarily reflect the position of UNESCO nor commit that organization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Sánchez.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 192 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Soria-Comas, J., Domingo-Ferrer, J., Sánchez, D. et al. Enhancing data utility in differential privacy via microaggregation-based \(k\)-anonymity. The VLDB Journal 23, 771–794 (2014). https://doi.org/10.1007/s00778-014-0351-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-014-0351-4

Keywords

Navigation