The VLDB Journal

, Volume 23, Issue 5, pp 771–794 | Cite as

Enhancing data utility in differential privacy via microaggregation-based \(k\)-anonymity

  • Jordi Soria-Comas
  • Josep Domingo-Ferrer
  • David Sánchez
  • Sergio Martínez
Regular Paper

Abstract

It is not uncommon in the data anonymization literature to oppose the “old” \(k\)-anonymity model to the “new” differential privacy model, which offers more robust privacy guarantees. Yet, it is often disregarded that the utility of the anonymized results provided by differential privacy is quite limited, due to the amount of noise that needs to be added to the output, or because utility can only be guaranteed for a restricted type of queries. This is in contrast with \(k\)-anonymity mechanisms, which make no assumptions on the uses of anonymized data while focusing on preserving data utility from a general perspective. In this paper, we show that a synergy between differential privacy and \(k\)-anonymity can be found: \(k\)-anonymity can help improving the utility of differentially private responses to arbitrary queries. We devote special attention to the utility improvement of differentially private published data sets. Specifically, we show that the amount of noise required to fulfill \(\varepsilon \)-differential privacy can be reduced if noise is added to a \(k\)-anonymous version of the data set, where \(k\)-anonymity is reached through a specially designed microaggregation of all attributes. As a result of noise reduction, the general analytical utility of the anonymized output is increased. The theoretical benefits of our proposal are illustrated in a practical setting with an empirical evaluation on three data sets.

Keywords

Privacy-preserving data publishing  Differential privacy \(k\)-Anonymity Microaggregation  Data utility 

Notes

Acknowledgments

This work was partly supported by the Government of Catalonia under grant 2009 SGR 1135, by the Spanish Government through projects TIN2011-27076-C03-01 “CO-PRIVACY,” TIN2012-32757 “ICWT,” IPT2012-0603-430000 “BallotNext” and CONSOLIDER INGENIO 2010 CSD2007-00004 “ARES,” and by the European Commission under FP7 projects “DwB” and “Inter-Trust.” The second author is partially supported as an ICREA Acadèmia researcher by the Government of Catalonia. The authors are with the UNESCO Chair in Data Privacy, but they are solely responsible for the views expressed in this paper, which do not necessarily reflect the position of UNESCO nor commit that organization.

Supplementary material

778_2014_351_MOESM1_ESM.pdf (192 kb)
Supplementary material 1 (pdf 192 KB)

References

  1. 1.
    Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigraphy, R., Thomas, D., Zhu, A.: Anonymizing tables. In: Proceedings of the 10th International Conference on Database Theory-ICDT 2005, pp. 246–258 (2005)Google Scholar
  2. 2.
    Batet, M., Valls, A., Gibert, K.: A distance function to assess the similarity of words using ontologies. In: XV Congreso Español sobre Tecnologías y Lógica Fuzzy, Huelva, pp. 561–566. Spain (2010)Google Scholar
  3. 3.
    Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Proceedings of the 40th Annual Symposium on the Theory of Computing-STOC 2008, pp. 609–618 (2008)Google Scholar
  4. 4.
    Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.M.: Reference data sets to test and compare SDC methods for protection of numerical microdata. European Project IST-2000-25069 CASC. http://neon.vb.cbs.nl/casc (2002)
  5. 5.
    Charest, A.-S.: Empirical evaluation of statistical inference from differentially-private contingency tables. In: Proceedings of Privacy in Statistical Databases-PSD 2012, LNCS 7556, pp. 257–272. Springer (2012)Google Scholar
  6. 6.
    Charest, A.-S.: How can we analyze differentially-private synthetic data sets? J. Priv. Confident. 2(2), 21–33 (2010)Google Scholar
  7. 7.
    Chen, R., Mohammed, N., Fung, B.C.M., Desai B.C., Xiong, L.: Publishing set-valued data via differential privacy. In: 37th International Conference on Very Large Data Bases-VLDB 2011/Proceedings of the VLDB Endowment 4(11), 1087–1098 (2011)Google Scholar
  8. 8.
    Clifton, C., Tassa, T.: On syntactic anonymity and differential privacy. Trans. Data Priv. 6(2), 161–183 (2013)MathSciNetGoogle Scholar
  9. 9.
    Cormode, G., Procopiuc, C.M., Shen, E., Srivastava, D., Yu, T.: Differentially private spatial decompositions. In: IEEE International Conference on Data Engineering (ICDE 2012), pp. 20–31 (2012)Google Scholar
  10. 10.
    Cormode, G., Procopiuc, C.M., Shen, E., Srivastava, D., Yu, T.: Empirical privacy and empirical utility of anonymized data. In: ICDE Workshop on Privacy-Preserving Data Publication and Analysis (2013)Google Scholar
  11. 11.
    Dalenius, T.: The invasion of privacy problem and statistics production. An overview. Stat. Tidskrift 12, 213–225 (1974)Google Scholar
  12. 12.
    Dandekar, R., Domingo-Ferrer, J., Sebé, F.: LHS-based hybrid microdata vs rank swapping and microaggregation for numeric microdata protection. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases, LNCS 2316, pp. 153–162. Springer (2002)Google Scholar
  13. 13.
    Domingo-Ferrer, J.: A critique of \(k\)-anonymity and some of its enhancements. In: Proceedings of ARES/PSAI 2008, pp. 990–993. IEEE Computer Society (2008)Google Scholar
  14. 14.
    Domingo-Ferrer, J.: Marginality: a numerical mapping for enhanced exploitation of taxonomic attributes. In: Proceedings of the 9th International Conference on Modeling Attributes for Artificial Intelligence-MDAI 2012, LNCS 7647, pp. 367–381. Springer (2012)Google Scholar
  15. 15.
    Domingo-Ferrer, J., González-Nicolás, U.: Hybrid microdata using microaggregation. Inf. Sci. 180(15), 2834–2844 (2010)CrossRefGoogle Scholar
  16. 16.
    Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans.Knowl. Data Eng. 14(1), 189–201 (2002)CrossRefGoogle Scholar
  17. 17.
    Domingo-Ferrer, J., Torra, V.: Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Min. Knowl. Discov. 11(2), 195–212 (2005)CrossRefMathSciNetGoogle Scholar
  18. 18.
    Domingo-Ferrer, J., Mateo-Sanz, J.M., Torra, V.: Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In: Pre-Proceedings of ETK-NTTS’2001 (vol. 2), pp. 807–826. Eurostat (2001)Google Scholar
  19. 19.
    Domingo-Ferrer, J., Sánchez, D., Rufian-Torrell, G.: Anonymization of nominal data based on semantic marginality. Inf. Sci. 242, 35–48 (2013)CrossRefGoogle Scholar
  20. 20.
    Domingo-Ferrer, J., Sebé, F., Solanas, A.: A polynomial-time approximation to optimal multivariate microaggregation. Comput. Math. Appl. 55(4), 714–732 (2008)CrossRefMATHMathSciNetGoogle Scholar
  21. 21.
    Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J., Sebé, F.: Efficient multivariate data-oriented microaggregation. VLDB J. 15, 355–369 (2006)CrossRefGoogle Scholar
  22. 22.
    Dwork, C., Naor, M., Reingold, O., Rothblum G.N., Vadhan, S.: On the complexity of differentially private data release: efficient algorithms and hardness results. In: Proceedings of the 41st Annual Symposium on the Theory of Computing-STOC 2009, pp. 381–390 (2009)Google Scholar
  23. 23.
    Dwork, C.: Differential privacy. In: Proceedings of 33rd International Colloquium on Automata, Languages and Programming-ICALP 2006, LNCS 4052, pp. 1–12. Springer (2006)Google Scholar
  24. 24.
    Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)CrossRefGoogle Scholar
  25. 25.
    Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge (1998)Google Scholar
  26. 26.
    Frank, A., Asuncion, A.: UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml/datasets/Adult (2010)
  27. 27.
    Fung, B.C.M., Wang, K., Yu., P.S.: Top-down specialization for information and privacy preservation. In: Proceedings of the 21st International Conference on Data Engineering, pp. 205–216. IEEE Computer Society (2005)Google Scholar
  28. 28.
    Goldberger, J., Tassa, T.: Efficient anonymizations with enhanced utility. Trans. Data Priv. 3, 149–175 (2010)MathSciNetGoogle Scholar
  29. 29.
    Hardt, M., Ligett, K., McSherry, F.: A simple and practical algorithm for differentially private data release. Preprint arXiv:1012.4763 (2010)
  30. 30.
    Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. PVLDB 3(1), 1021–1032 (2010)Google Scholar
  31. 31.
    Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Spicer, K., de Wolf, P.-P.: Statistical Disclosure Control. Wiley, London (2012)CrossRefGoogle Scholar
  32. 32.
    Laszlo, M., Mukherjee, S.: Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. Knowl. Data Eng. 17(7), 902–911 (2005)CrossRefGoogle Scholar
  33. 33.
    Li, N., Li, T., Venkatasubramanian, S.: t-Closeness: privacy beyond k-anonymity and l-diversity. In: IEEE International Conference on Data Engineering (ICDE 2007), pp. 106–115 (2007)Google Scholar
  34. 34.
    Li, N., Qardaji, V., Su, D.: On sampling, anonymization, and differential privacy: Or, k -anonymization meets differential privacy. In: 7th ACM Symposium on Information, Computer and Communications, Security (ASIACCS’2012), pp. 32–33 (2012)Google Scholar
  35. 35.
    Li, N., Yang, W., Qardaji, W.: Differentially private grids for geospatial data. In: IEEE International Conference on Data Engineering (ICDE 2013), pp. 757–768 (2013)Google Scholar
  36. 36.
    Li, Y., Bandar, Z., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15, 871–882 (2003)CrossRefGoogle Scholar
  37. 37.
    Lin, J.-L., Wen, T.-H., Hsieh, J.-C., Chang, P.-C.: Density-based microaggregation for statistical disclosure control. Expert Syst. Appl. 37, 3256–3263 (2010)CrossRefGoogle Scholar
  38. 38.
    Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-Diversity: privacy beyond k-anonymity. In: IEEE International Conference on Data Engineering (ICDE 2006), pp. 24 (2006)Google Scholar
  39. 39.
    Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: IEEE International Conference on Data Engineering (ICDE 2008), pp. 277–286 (2008)Google Scholar
  40. 40.
    Martínez, S., Valls, A., Sánchez, D.: Semantically-grounded construction of centroids for data sets with textual attributes. Knowl.-Based Syst. 35, 160–172 (2012)Google Scholar
  41. 41.
    Martínez, S., Sánchez, D., Valls, A.: Semantic adaptive microaggregation of categorical microdata. Comput. Secur. 31(5), 653–672 (2012)CrossRefGoogle Scholar
  42. 42.
    McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science-FOCS 2007, pp. 94–103 (2007)Google Scholar
  43. 43.
    McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 19–30. ACM (2009)Google Scholar
  44. 44.
    Mohammed, N., Chen, R., Fung, B.C.M., Yu, P.S.: Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining-KDD 2011, pp. 493–501. ACM (2011)Google Scholar
  45. 45.
    Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: Proceedings of the 39th ACM Symposium on Theory of Computing-STOC 2007, pp. 75–84. ACM (2007)Google Scholar
  46. 46.
    Petrakis, E.G.M., Varelas, G., Hliaoutakis, A., Raftopoulou, P.: X-similarity: computing semantic similarity between concepts from different ontologies. J. Dig. Inf. Manag. 4, 233–237 (2006)Google Scholar
  47. 47.
    Pirró, G.: A semantic similarity metric combining features and intrinsic information content. Data Knowl. Eng. 68, 1289–1308 (2009)CrossRefGoogle Scholar
  48. 48.
    Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Trans. Syst., Man Cybern. 19(1), 17–30 (1989)CrossRefGoogle Scholar
  49. 49.
    Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: \(k\)-anonymity and its enforcement through generalization and suppression. SRI International Report (1998)Google Scholar
  50. 50.
    Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)CrossRefGoogle Scholar
  51. 51.
    Sánchez, D., Batet, M.: Semantic similarity estimation in the biomedical domain: an ontology-based information-theoretic perspective. J. Biomed. Inform. 44, 749–759 (2011)CrossRefGoogle Scholar
  52. 52.
    Sánchez, D., Batet, M., Isern, D.: Ontology-based information content computation. Knowl. -Based Syst. 24, 297–303 (2011)CrossRefGoogle Scholar
  53. 53.
    Sánchez, D., Batet, M.: A new model to compute the information content of concepts from taxonomical knowledge. Int. J. Semant. Web Inf. Syst. 8, 34–50 (2012)CrossRefGoogle Scholar
  54. 54.
    Sánchez, D., Batet, M., Isern, D., Valls, A.: Ontology-based semantic similarity: a new feature-based approach. Expert Syst. Appl. 39(9), 7718–7728 (2012)CrossRefGoogle Scholar
  55. 55.
    Soria-Comas, J., Domingo-Ferrer, J., Sánchez, D., Martínez, S.: Improving the utility of differentially private data releases via \(k\)-anonymity. In: 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (IEEE Trust-Com 2013), pp. 372–379. Melbourne, Australia, July 16–18 (2013)Google Scholar
  56. 56.
    Sweeney, L.: \(k\)-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)CrossRefMATHMathSciNetGoogle Scholar
  57. 57.
    Willenborg, L., De Waal, T.: Statistical Disclosure Control in Practice. Springer, Berlin (1996)CrossRefMATHGoogle Scholar
  58. 58.
    Wong, R., Li, J., Fu, A., Wang, K.: (\(\alpha \), k)-Anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), pp. 754–759 (2006)Google Scholar
  59. 59.
    Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138. Las Cruces, New Mexico (1994) Google Scholar
  60. 60.
    Xiao, Y., Xiong, L., Yuan, C.: Differentially private data release through multidimensional partitioning. In: Proceedings of the 7th VLDB Conference on Secure Data Management (SDM’10), pp. 150–168 (2010)Google Scholar
  61. 61.
    Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23(8), 1200–1214 (2010)CrossRefGoogle Scholar
  62. 62.
    Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G.: Differentially private histogram publication. In: IEEE International Conference on Data Engineering (ICDE 2012), pp. 32–43 (2012)Google Scholar
  63. 63.
    Yancey, W.E., Winkler, W.E., Creecy, R.H.: Disclosure risk assessment in perturbative microdata protection. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases, LNCS 2316, pp. 135–152. Springer (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Jordi Soria-Comas
    • 1
  • Josep Domingo-Ferrer
    • 1
  • David Sánchez
    • 1
  • Sergio Martínez
    • 1
  1. 1.Department of Computer Science and Mathematics, UNESCO Chair in Data PrivacyUniversitat Rovira i VirgiliTarragonaSpain

Personalised recommendations