Advertisement

Knowledge and Information Systems

, Volume 56, Issue 3, pp 717–752 | Cite as

Differentially private multidimensional data publishing

  • Khalil Al-Hussaeni
  • Benjamin C. M. Fung
  • Farkhund Iqbal
  • Junqiang Liu
  • Patrick C. K. Hung
Regular Paper

Abstract

Various organizations collect data about individuals for various reasons, such as service improvement. In order to mine the collected data for useful information, data publishing has become a common practice among those organizations and data analysts, research institutes, or simply the general public. The quality of published data significantly affects the accuracy of the data analysis and thus affects decision making at the corporate level. In this study, we explore the research area of privacy-preserving data publishing, i.e., publishing high-quality data without compromising the privacy of the individuals whose data are being published. Syntactic privacy models, such as k-anonymity, impose syntactic privacy requirements and make certain assumptions about an adversary’s background knowledge. To address this shortcoming, we adopt differential privacy, a rigorous privacy model that is independent of any adversary’s knowledge and insensitive to the underlying data. The published data should preserve individuals’ privacy, yet remain useful for analysis. To maintain data utility, we propose DiffMulti, a workload-aware and differentially private algorithm that employs multidimensional generalization. We devise an efficient implementation to the proposed algorithm and use a real-life data set for experimental analysis. We evaluate the performance of our method in terms of data utility, efficiency, and scalability. When compared to closely related existing methods, DiffMulti significantly improved data utility, in some cases, by orders of magnitude.

Keywords

Data sharing Privacy protection Differential privacy Multidimensional generalization 

Notes

Acknowledgements

The research is supported in part by the Discovery Grants (356065-2013) from the Natural Sciences and Engineering Research Council of Canada (NSERC), Canada Research Chairs Program (950-230623), Research Incentive Funds (R15046 and R15048) from Zayed University, Research Grants (61272306) from the National Natural Science Foundation of China (NSFC), and Research Grants (LY17F020004) from the Zhejiang Natural Science Foundation of China (ZJNSF). The work was partially completed while Benjamin C. M. Fung was visiting the Department of Computer Science at Hong Kong Baptist University.

References

  1. 1.
    Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F, Talwar K (2007) Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS ’07), pp 273–282Google Scholar
  2. 2.
    Bayardo RJ, Agrawal R (2005) Data privacy through optimal k-anonymization. In: Proceedings of the 21st international conference on data engineering (ICDE ’05), pp 217–228Google Scholar
  3. 3.
    Blum A, Ligett K, Roth A (2008) A learning theory approach to non-interactive database privacy. In: Proceedings of the fortieth annual ACM symposium on theory of computing (STOC ’08), pp 609–618Google Scholar
  4. 4.
    Carlisle DM, Rodrian ML, Diamond CL (2007) California inpatient data reporting manual, medical information reporting for California, 5th edn. Technical Report, Office of Statewide Health Planning and DevelopmentGoogle Scholar
  5. 5.
    Chawla S, Dwork C, McSherry F, Smith A, Wee H (2005) Toward privacy in public databases. In: Proceedings of the second international conference on theory of cryptography (TCC ’05), pp 363–385Google Scholar
  6. 6.
    Chen R, Mohammed N, Fung BCM, Desai BC, Xiong L (2011) Publishing set-valued data via differential privacy. Proc VLDB Endow 4(11):1087–1098Google Scholar
  7. 7.
    Cormode G, Procopiuc C, Srivastava D, Tran TTL (2012) Differentially private summaries for sparse data. In: Proceedings of the 15th international conference on database theory (ICDT ’12), pp 299–311Google Scholar
  8. 8.
    Ding B, Winslett M, Han J, Li Z (2011) Differentially private data cubes: optimizing noise sources and consistency. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data (SIGMOD ’11), pp 217–228Google Scholar
  9. 9.
    Dwork C (2006) Differential privacy. In: Proceedings of the 33rd international conference on automata, languages and programming—volume part II (ICALP ’06), pp 1–12Google Scholar
  10. 10.
    Dwork C (2008) Differential privacy: a survey of results. In: Proceedings of the 5th international conference on theory and applications of models of computation (TAMC ’08), pp 1–19Google Scholar
  11. 11.
    Dwork C (2011) A firm foundation for private data analysis. Commun ACM 54(1):86–95CrossRefGoogle Scholar
  12. 12.
    Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private data analysis. In: Proceedings of the third conference on theory of cryptography (TCC ’06), pp 265–284Google Scholar
  13. 13.
    Dwork C, Roth A (2014) The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 9(3–4):211–407MathSciNetMATHGoogle Scholar
  14. 14.
    Frank A, Suncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
  15. 15.
    Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10), pp 493–502Google Scholar
  16. 16.
    Fung BCM, Wang K, Wang L, Hung PCK (2009) Privacy-preserving data publishing for cluster analysis. Data Knowl Eng 68(6):552–575CrossRefGoogle Scholar
  17. 17.
    Fung BCM, Wang K, Yu PS (2007) Anonymizing classification data for privacy preservation. IEEE Trans Knowl Data Eng 19(5):711–725CrossRefGoogle Scholar
  18. 18.
    Ganta SR, Kasiviswanathan S, Smith A (2008) Composition attacks and auxiliary information in data privacy. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’08), pp 265–273Google Scholar
  19. 19.
    Hafner K (2006) And if you liked the movie, a netflix contest may reward you handsomely. New York Times, New YorkGoogle Scholar
  20. 20.
    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRefGoogle Scholar
  21. 21.
    Hay M, Rastogi V, Miklau G, Suciu D (2010) Boosting the accuracy of differentially private histograms through consistency. Proc VLDB Endow 3(1–2):1021–1032CrossRefGoogle Scholar
  22. 22.
    Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’02), pp 279–288Google Scholar
  23. 23.
    Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in Kernel methods–support vector learning, vol 11. MIT Press, Cambridge, pp 169–184Google Scholar
  24. 24.
    Karypis G (2006) CLUTO—software for clustering high-dimensional datasets. http://glaros.dtc.umn.edu/gkhome/views/cluto
  25. 25.
    Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, HobokenMATHGoogle Scholar
  26. 26.
    Kifer D (2009) Attacks on privacy and de Finetti’s theorem. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data (SIGMOD ’09), pp 127–138Google Scholar
  27. 27.
    Kifer D, Lin B-R (2010) Towards an axiomatization of statistical privacy and utility. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS ’10), pp 147–158Google Scholar
  28. 28.
    LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Mondrian multidimensional K-anonymity. In: Proceedings of the 22nd international conference on data engineering (ICDE ’06)Google Scholar
  29. 29.
    LeFevre K, DeWitt DJ, Ramakrishnan R (2008) Workload-aware anonymization techniques for large-scale datasets. ACM Trans Database Syst 33(3):17:1–17:47CrossRefGoogle Scholar
  30. 30.
    Li C, Hay M, Rastogi V, Miklau G, McGregor A (2010) Optimizing linear counting queries under differential privacy. In: Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS ’10), pp 123–134Google Scholar
  31. 31.
    Li H, Xiong L, Jiang X (2014) Differentially private synthesization of multi-dimensional data using copula functions. In: Proceedings of the 17th international conference on extending database technology (EDBT ’14), vol 2014, pp 475–486Google Scholar
  32. 32.
    Li N, Li T, Venkatasubramanian S (2007) \(t\)-closeness: privacy beyond \(k\)-anonymity and \(\ell \)-diversity. In: Proceedings of the 23rd international conference on data engineering (ICDE ’07), pp 106–115Google Scholar
  33. 33.
    Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) \(\ell \)-diversity: privacy beyond \(k\)-anonymity. In: Proceedings of the 22nd IEEE international conference on data engineering (ICDE ’6), p 24Google Scholar
  34. 34.
    McSherry F (2009) Privacy integrated queries. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data (SIGMOD ’09), pp 19–30Google Scholar
  35. 35.
    McSherry F, Talwar K (2007) Mechanism design via differential privacy. In: Proceedings of the 48th annual IEEE symposium on foundations of computer science (FOCS ’07), pp 94–103Google Scholar
  36. 36.
    Mohammed N, Chen R, Fung BCM, Yu PS (2011) Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’11), pp 493–501Google Scholar
  37. 37.
    Qardaji W, Li N (2012) Recursive partitioning and summarization: a practical framework for differentially private data publishing. In: Proceedings of the 7th ACM symposium on information, computer and communications security (ASIACCS ’12), pp 38–39Google Scholar
  38. 38.
    Qardaji W, Yang W, Li N (2013) Differentially private grids for geospatial data. In: Proceedings of the 29th IEEE international conference on data engineering (ICDE ’13), pp 757–768Google Scholar
  39. 39.
    Qardaji W, Yang W, Li N (2013) Understanding hierarchical methods for differentially private histograms. Proc VLDB Endow 6(14):1954–1965CrossRefGoogle Scholar
  40. 40.
    Qardaji W, Yang W, Li N (2014) PriView: practical differentially private release of marginal contingency tables. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data (SIGMOD ’14), pp 1435–1446Google Scholar
  41. 41.
    Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San FranciscoGoogle Scholar
  42. 42.
    Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027CrossRefGoogle Scholar
  43. 43.
    Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5):557–570MathSciNetCrossRefMATHGoogle Scholar
  44. 44.
    Weiss SM, Kulikowski CA (1991) Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
  45. 45.
    Wong RC-W, Fu AW-C, Wang K, Pei J (2007) Minimality attack in privacy preserving data publishing. In: Proceedings of the 33rd international conference on very large data bases (VLDB ’07), pp 543–554Google Scholar
  46. 46.
    Xiao X, Bender G, Hay M, Gehrke J (2011) iReduct: differential privacy with reduced relative errors. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data (SIGMOD ’11), pp 229–240Google Scholar
  47. 47.
    Xiao X, Wang G, Gehrke J (2011) Differential privacy via wavelet transforms. IEEE Trans Knowl Data Eng 23(8):1200–1214CrossRefGoogle Scholar
  48. 48.
    Xiao Y, Xiong L, Fan L, Goryczka S, Li H (2014) DPCube: differentially private histogram release through multidimensional partitioning. Trans Data Privacy 7(3):195–222MathSciNetGoogle Scholar
  49. 49.
    Xu J, Wang W, Pei J, Wang X, Shi B, Fu AWC (2006) Utility-based anonymization using local recoding. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’06), pp 785–790Google Scholar
  50. 50.
    Xu J, Zhang Z, Xiao X, Yang Y, Yu G, Winslett M (2013) Differentially private histogram publication. VLDB J 22(6):797–822CrossRefGoogle Scholar
  51. 51.
    Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X (2014) PrivBayes: private data release via Bayesian networks. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data (SIGMOD ’14), pp 1423–1434Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2017

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringConcordia UniversityMontrealCanada
  2. 2.School of Information StudiesMcGill UniversityMontrealCanada
  3. 3.College of Technological InnovationZayed UniversityAbu DhabiUAE
  4. 4.School of Information and Electronic EngineeringZhejiang Gongshang UniversityHangzhouChina
  5. 5.Faculty of Business and Information TechnologyUniversity of Ontario Institute of TechnologyOshawaCanada

Personalised recommendations