Skip to main content
Log in

Differentially private multidimensional data publishing

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Various organizations collect data about individuals for various reasons, such as service improvement. In order to mine the collected data for useful information, data publishing has become a common practice among those organizations and data analysts, research institutes, or simply the general public. The quality of published data significantly affects the accuracy of the data analysis and thus affects decision making at the corporate level. In this study, we explore the research area of privacy-preserving data publishing, i.e., publishing high-quality data without compromising the privacy of the individuals whose data are being published. Syntactic privacy models, such as k-anonymity, impose syntactic privacy requirements and make certain assumptions about an adversary’s background knowledge. To address this shortcoming, we adopt differential privacy, a rigorous privacy model that is independent of any adversary’s knowledge and insensitive to the underlying data. The published data should preserve individuals’ privacy, yet remain useful for analysis. To maintain data utility, we propose DiffMulti, a workload-aware and differentially private algorithm that employs multidimensional generalization. We devise an efficient implementation to the proposed algorithm and use a real-life data set for experimental analysis. We evaluate the performance of our method in terms of data utility, efficiency, and scalability. When compared to closely related existing methods, DiffMulti significantly improved data utility, in some cases, by orders of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://www.popdata.bc.ca/.

  2. Unless performed randomly, having a fixed generalization function \(\phi \) is a non-trivial task. The domain space of \(\phi \) is as large as the cardinality of the input data set. Moreover, the codomain of \(\phi \) is a set of d-dimensional regions, each bounded by either an interval or a value from the generalization hierarchy. Our proposed algorithm effectively partitions the regions to maintain data utility.

References

  1. Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K (2007) Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS ’07), pp 273–282

  2. Bayardo RJ, Agrawal R (2005) Data privacy through optimal k-anonymization. In: Proceedings of the 21st international conference on data engineering (ICDE ’05), pp 217–228

  3. Blum A, Ligett K, Roth A (2008) A learning theory approach to non-interactive database privacy. In: Proceedings of the fortieth annual ACM symposium on theory of computing (STOC ’08), pp 609–618

  4. Carlisle DM, Rodrian ML, Diamond CL (2007) California inpatient data reporting manual, medical information reporting for California, 5th edn. Technical Report, Office of Statewide Health Planning and Development

  5. Chawla S, Dwork C, McSherry F, Smith A, Wee H (2005) Toward privacy in public databases. In: Proceedings of the second international conference on theory of cryptography (TCC ’05), pp 363–385

  6. Chen R, Mohammed N, Fung BCM, Desai BC, Xiong L (2011) Publishing set-valued data via differential privacy. Proc VLDB Endow 4(11):1087–1098

    Google Scholar 

  7. Cormode G, Procopiuc C, Srivastava D, Tran TTL (2012) Differentially private summaries for sparse data. In: Proceedings of the 15th international conference on database theory (ICDT ’12), pp 299–311

  8. Ding B, Winslett M, Han J, Li Z (2011) Differentially private data cubes: optimizing noise sources and consistency. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data (SIGMOD ’11), pp 217–228

  9. Dwork C (2006) Differential privacy. In: Proceedings of the 33rd international conference on automata, languages and programming—volume part II (ICALP ’06), pp 1–12

  10. Dwork C (2008) Differential privacy: a survey of results. In: Proceedings of the 5th international conference on theory and applications of models of computation (TAMC ’08), pp 1–19

  11. Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):86–95

    Article  Google Scholar 

  12. Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private data analysis. In: Proceedings of the third conference on theory of cryptography (TCC ’06), pp 265–284

  13. Dwork C, Roth A (2014) The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 9(3–4):211–407

    MathSciNet  MATH  Google Scholar 

  14. Frank A, Suncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml

  15. Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10), pp 493–502

  16. Fung BCM, Wang K, Wang L, Hung PCK (2009) Privacy-preserving data publishing for cluster analysis. Data Knowl Eng 68(6):552–575

    Article  Google Scholar 

  17. Fung BCM, Wang K, Yu PS (2007) Anonymizing classification data for privacy preservation. IEEE Trans Knowl Data Eng 19(5):711–725

    Article  Google Scholar 

  18. Ganta SR, Kasiviswanathan S, Smith A (2008) Composition attacks and auxiliary information in data privacy. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’08), pp 265–273

  19. Hafner K (2006) And if you liked the movie, a netflix contest may reward you handsomely. New York Times, New York

    Google Scholar 

  20. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  21. Hay M, Rastogi V, Miklau G, Suciu D (2010) Boosting the accuracy of differentially private histograms through consistency. Proc VLDB Endow 3(1–2):1021–1032

    Article  Google Scholar 

  22. Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’02), pp 279–288

  23. Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in Kernel methods–support vector learning, vol 11. MIT Press, Cambridge, pp 169–184

    Google Scholar 

  24. Karypis G (2006) CLUTO—software for clustering high-dimensional datasets. http://glaros.dtc.umn.edu/gkhome/views/cluto

  25. Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, Hoboken

    MATH  Google Scholar 

  26. Kifer D (2009) Attacks on privacy and de Finetti’s theorem. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data (SIGMOD ’09), pp 127–138

  27. Kifer D, Lin B-R (2010) Towards an axiomatization of statistical privacy and utility. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS ’10), pp 147–158

  28. LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Mondrian multidimensional K-anonymity. In: Proceedings of the 22nd international conference on data engineering (ICDE ’06)

  29. LeFevre K, DeWitt DJ, Ramakrishnan R (2008) Workload-aware anonymization techniques for large-scale datasets. ACM Trans Database Syst 33(3):17:1–17:47

    Article  Google Scholar 

  30. Li C, Hay M, Rastogi V, Miklau G, McGregor A (2010) Optimizing linear counting queries under differential privacy. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS ’10), pp 123–134

  31. Li H, Xiong L, Jiang X (2014) Differentially private synthesization of multi-dimensional data using copula functions. In: Proceedings of the 17th international conference on extending database technology (EDBT ’14), vol 2014, pp 475–486

  32. Li N, Li T, Venkatasubramanian S (2007) \(t\)-closeness: privacy beyond \(k\)-anonymity and \(\ell \)-diversity. In: Proceedings of the 23rd international conference on data engineering (ICDE ’07), pp 106–115

  33. Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) \(\ell \)-diversity: privacy beyond \(k\)-anonymity. In: Proceedings of the 22nd IEEE international conference on data engineering (ICDE ’6), p 24

  34. McSherry F (2009) Privacy integrated queries. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data (SIGMOD ’09), pp 19–30

  35. McSherry F, Talwar K (2007) Mechanism design via differential privacy. In: Proceedings of the 48th annual IEEE symposium on foundations of computer science (FOCS ’07), pp 94–103

  36. Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’11), pp 493–501

  37. Qardaji W, Li N (2012) Recursive partitioning and summarization: a practical framework for differentially private data publishing. In: Proceedings of the 7th ACM symposium on information, computer and communications security (ASIACCS ’12), pp 38–39

  38. Qardaji W, Yang W, Li N (2013) Differentially private grids for geospatial data. In: Proceedings of the 29th IEEE international conference on data engineering (ICDE ’13), pp 757–768

  39. Qardaji W, Yang W, Li N (2013) Understanding hierarchical methods for differentially private histograms. Proc VLDB Endow 6(14):1954–1965

    Article  Google Scholar 

  40. Qardaji W, Yang W, Li N (2014) PriView: practical differentially private release of marginal contingency tables. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data (SIGMOD ’14), pp 1435–1446

  41. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  42. Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027

    Article  Google Scholar 

  43. Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5):557–570

    Article  MathSciNet  MATH  Google Scholar 

  44. Weiss SM, Kulikowski CA (1991) Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  45. Wong RC-W, Fu AW-C, Wang K, Pei J (2007) Minimality attack in privacy preserving data publishing. In: Proceedings of the 33rd international conference on very large data bases (VLDB ’07), pp 543–554

  46. Xiao X, Bender G, Hay M, Gehrke J (2011) iReduct: differential privacy with reduced relative errors. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data (SIGMOD ’11), pp 229–240

  47. Xiao X, Wang G, Gehrke J (2011) Differential privacy via wavelet transforms. IEEE Trans Knowl Data Eng 23(8):1200–1214

    Article  Google Scholar 

  48. Xiao Y, Xiong L, Fan L, Goryczka S, Li H (2014) DPCube: differentially private histogram release through multidimensional partitioning. Trans Data Privacy 7(3):195–222

    MathSciNet  Google Scholar 

  49. Xu J, Wang W, Pei J, Wang X, Shi B, Fu AWC (2006) Utility-based anonymization using local recoding. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’06), pp 785–790

  50. Xu J, Zhang Z, Xiao X, Yang Y, Yu G, Winslett M (2013) Differentially private histogram publication. VLDB J 22(6):797–822

    Article  Google Scholar 

  51. Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X (2014) PrivBayes: private data release via Bayesian networks. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data (SIGMOD ’14), pp 1423–1434

Download references

Acknowledgements

The research is supported in part by the Discovery Grants (356065-2013) from the Natural Sciences and Engineering Research Council of Canada (NSERC), Canada Research Chairs Program (950-230623), Research Incentive Funds (R15046 and R15048) from Zayed University, Research Grants (61272306) from the National Natural Science Foundation of China (NSFC), and Research Grants (LY17F020004) from the Zhejiang Natural Science Foundation of China (ZJNSF). The work was partially completed while Benjamin C. M. Fung was visiting the Department of Computer Science at Hong Kong Baptist University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamin C. M. Fung.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Hussaeni, K., Fung, B.C.M., Iqbal, F. et al. Differentially private multidimensional data publishing. Knowl Inf Syst 56, 717–752 (2018). https://doi.org/10.1007/s10115-017-1132-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1132-3

Keywords

Navigation