IHP: improving the utility in differential private histogram publication

Abstract

Differential privacy (DP) is a promising tool for preserving privacy during data publication, as it provides strong theoretical privacy guarantees in face of adversaries with arbitrary background knowledge. Histogram, as the result of a set of count queries, serves as a core statistical tool to report data distributions and is in fact viewed as the fundamental method for many other statistical analysis such as range queries. It is an important form for data publishing. In this paper, we consider the scenario of publishing sensitive histogram data with differential privacy scheme. Existing work in this field has justified that, comparing to directly applying DP techniques (i.e., injecting noise) over the counts in histogram bins, grouping bins before noise injection is more effective (i.e., with higher utility) as it introduces much less error over the sanitized histogram given the same privacy budget. However, state-of-the-art works have not unveiled how the overall utility of a sanitized histogram can be affected by the balance between the privacy budget distributed between grouping and noise injection phases. In this work, we conduct a theoretical study towards how the probability of getting better groups can be improved such that the overall error introduced in sanitized histogram can be further reduced, which directly leads to a higher utility for the sanitized histograms. In particular, we show that the probability of achieving better grouping can be affected by two factors, namely privacy budget assigned in grouping and the normalized utility function used for selecting groups. Motivated by that, we propose a new DP histogram publishing scheme, namely Iterative Histogram Partition, in which we carefully assign privacy budget between grouping and injection phases based on our theoretical study. We also theoretically prove that \(\epsilon \)-differential privacy can be achieved according to our new scheme. Moreover, we also show that, under the same privacy budget, our scheme exhibits less errors in the sanitized histograms comparing with state-of-the-art methods. We also extends the model to multi-dimensional histogram publication cases. Finally, empirical study over four real-world datasets also justifies that our scheme achieves the least error among series of state-of-the-art baseline methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    In the sequel, we may use the term “cluster” and “group” interchangeably to facilitate the discussion.

  2. 2.

    In the standard form, \(D_2\) is in fact obtained by either inserting or deleting a simple record from \(D_1\). Modifying a record cannot be viewed as neighboring, as it in fact consists of both delete and insert operations.

  3. 3.

    In the following, we shall use 2-dimensional data cube and 2-dimensional histogram interchangeably.

References

  1. 1.

    Ács, G., Castelluccia, C., Chen, R.: Differentially private histogram publishing through Lossy compression. In: Proceedings of the 12th IEEE ICDM, pp. 1–10 (2012)

  2. 2.

    Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM PODS, pp. 273–282 (2007)

  3. 3.

    Cho, H., Kwon, S.J., Jin, R., Chung, T.: A privacy-aware monitoring algorithm for moving k-nearest neighbor queries in road networks. Distrib. Parallel Databases 33(3), 319–352 (2015)

    Article  Google Scholar 

  4. 4.

    Cormode, G., Procopiuc, C.M., Srivastava, D., Shen, E., Yu, T.: Differentially private spatial decompositions. In: IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, VA), 1–5 April 2012, pp. 20–31 (2012)

  5. 5.

    Ding, B., Winslett, M., Han, J., Li, Z.: Differentially private data cubes: optimizing noise sources and consistency. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, 12–16 June 2011, pp. 217–228 (2011)

  6. 6.

    Dwork, C., McSherry, F., Nissim, K., Smith, A.D.: Calibrating noise to sensitivity in private data analysis. In: Proceedings of the 3rd TCC, pp. 265–284 (2006)

    Google Scholar 

  7. 7.

    Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. PVLDB 3(1), 1021–1032 (2010)

    Google Scholar 

  8. 8.

    Hua, J., Tang, A., Fang, Y., Shen, Z., Zhong, S.: Privacy-preserving utility verification of the data published by non-interactive differentially private mechanisms. IEEE Trans. Inf. Forensics Security 11(10), 2298–2311 (2016)

    Article  Google Scholar 

  9. 9.

    Huang, J., Qi, J., Xu, Y., Chen, J.: A privacy-enhancing model for location-based personalized recommendations. Distrib. Parallel Databases 33(2), 253–276 (2015)

    Article  Google Scholar 

  10. 10.

    Kellaris, G., Papadopoulos, S.: Practical differential privacy via grouping and smoothing. PVLDB 6(5), 301–312 (2013)

    Google Scholar 

  11. 11.

    Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: Proceedings of the 29th ACM PODS, pp. 123–134 (2010)

  12. 12.

    Li, C., Hay, M., Miklau, G., Wang, Y.: A data- and workload-aware query answering algorithm for range queries under differential privacy. PVLDB 7(5), 341–352 (2014)

    Google Scholar 

  13. 13.

    Li, C., Miklau, G., Hay, M., McGregor, A., Rastogi, V.: The matrix mechanism: optimizing linear counting queries under differential privacy. VLDB J. 24(6), 757–781 (2015)

    Article  Google Scholar 

  14. 14.

    Li, H., Xiong, L., Jiang, X., Liu, J.: Differentially private histogram publication for dynamic datasets: an adaptive sampling approach. In: Proceedings of the 24th ACM CIKM, pp. 1001–1010 (2015)

  15. 15.

    Li, H., Cui, J., Lin, X., Ma, J.: Improving the utility in differential private histogram publishing: Theoretical study and practice. In: 2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA, December 5-8, 2016, pp 1100–1109 (2016)

  16. 16.

    Li, Y.D., Zhang, Z., Winslett, M., Yang, Y.: Compressive mechanism: utilizing sparse representation in differential privacy. In: Proceedings of the 10th ACM WPES, pp. 177–182 (2011)

  17. 17.

    Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

  18. 18.

    McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD, pp. 19–30 (2009)

  19. 19.

    McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: Proceedings of the 48th IEEE FOCS, pp. 94–103 (2007)

  20. 20.

    Muthukrishnan, S., Poosala, V., Suel, T.: On rectangular partitionings in two dimensions: Algorithms, complexity, and applications. In: 7th International Conference on Database Theory—ICDT ’99, Jerusalem, Israel, 10–12 January 1999, Proceedings, pp. 236–256 (1999)

  21. 21.

    Qardaji, W.H., Yang, W., Li, N.: Differentially private grids for geospatial data. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 757–768 (2013)

  22. 22.

    Qardaji, W.H., Yang, W., Li, N.: Understanding hierarchical methods for differentially private histograms. PVLDB 6(14), 1954–1965 (2013)

    Google Scholar 

  23. 23.

    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  24. 24.

    Rastogi, V., Nath, S.: Differentially private aggregation of distributed time-series with transformation and encryption. In: Proceedings of the ACM SIGMOD, pp. 735–746 (2010)

  25. 25.

    Song, C., Ge, T.: Aroma: A new data protection method with differential privacy and accurate query answering. In: Proceedings of the 23rd ACM CIKM, pp. 1569–1578 (2014)

  26. 26.

    Soria-Comas, J., Domingo-Ferrer, J., Sánchez, D., Megías, D.: Individual differential privacy: a utility-preserving formulation of differential privacy guarantees. IEEE Trans. Inf. Forensics Security 12(6), 1418–1429 (2017)

    Article  Google Scholar 

  27. 27.

    Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23(8), 1200–1214 (2011)

    Article  Google Scholar 

  28. 28.

    Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G.: Differentially private histogram publication. In: Proceedings of the 28th IEEE ICDE, pp. 32–43 (2012)

  29. 29.

    Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J. 22(6), 797–822 (2013)

    Article  Google Scholar 

  30. 30.

    Yuan, G., Zhang, Z., Winslett, M., Xiao, X., Yang, Y., Hao, Z.: Low-rank mechanism: optimizing batch queries under differential privacy. PVLDB 5(11), 1352–1363 (2012)

    Google Scholar 

  31. 31.

    Yuan, G., Zhang, Z., Winslett, M., Xiao, X., Yang, Y., Hao, Z.: Optimizing batch linear queries under exact and approximate differential privacy. ACM Trans. Database Syst. 40(2), 11 (2015)

    MathSciNet  Article  Google Scholar 

  32. 32.

    Zhang, T., Zhu, Q.: Dynamic differential privacy for admm-based distributed classification learning. IEEE Trans. Inf. Forensics Security 12(1), 172–187 (2017)

    MathSciNet  Article  Google Scholar 

  33. 33.

    Zhang, X., Chen, R., Xu, J., Meng, X., Xie, Y.: Towards accurate histogram publication under differential privacy. In: Proceedings of the 2014 SIAM SDM, pp. 587–595 (2014)

  34. 34.

    Zhu, T., Xiong, P., Li, G., Zhou, W.: Correlated differential privacy: hiding information in non-iid data set. IEEE Trans. Inf. Forensics Security 10(2), 229–242 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

The work is supported by National Nature Science Foundation of China (No. 61672408 and 61472298), Director Fund of PSRPC, Fundamental Research Funds for the Central Universities (No. JB181505), Natural Science Basic Research Plan in Shaanxi Province of China (No. 2018JM6073) and China 111 Project (No. B16037).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hui Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem 3

First of all, as the utility function solely depends on the grouping algorithm’s error metric, given \(A_1\) and \(A_2\) following the same error metric, it is obvious that \(u_1(D,r)=u_2(D,r)\) and \(\Delta u_1=\Delta u_2\).

Proof of first feature

\(\forall m\in [1,\ell ]\), as \(u_1(D,r)=u_2(D,r)\) and \(\Delta u_1=\Delta u_2\), then \(\phi _{2,m}/\phi _{1,m}=\epsilon _2/\epsilon _1=\alpha \).

Therefore, \(\phi _{1,m}<\phi _{2,m}\) as \(\epsilon _2>\epsilon _1\).

As a result,

$$\begin{aligned} Pr[r_m\in \mathbf {H}_2^{\prime }]= & {} \frac{\exp (\phi _{2,m})}{\sum \limits _{j=1}^\ell {\exp (\phi _{2,j})}}=\frac{\exp (\alpha \phi _{1,m})}{\sum _{j=1}^\ell {\exp (\alpha \phi _{1,j})}} \end{aligned}$$

Similarly,

$$\begin{aligned} Pr[r_m\in \mathbf {H}_1^{\prime }]= & {} \frac{\exp (\phi _{1,m})}{{\sum _{j=1}^\ell {\exp (\phi _{1,j})}}}. \end{aligned}$$

Then,

$$\begin{aligned}&Pr[r_m\in \mathbf {H}_2^{\prime }]>Pr[r_m\in \mathbf {H}_1^{\prime }]\\&\quad \Leftrightarrow \frac{{\sum _{j = 1}^{m - 1} {({\exp ({\alpha \phi _{1,j} - ({\alpha - 1})\phi _{1,m}}) - \exp ( {\phi _{1,j}})}}}}{{\sum _{j = m + 1}^\ell {({\exp ( {\phi _{1,j}}) - \exp ({\alpha \phi _{1,j} - ({\alpha - 1})\phi _{1,m}})})} }}< 1\\&\quad \Leftrightarrow \sum \limits _{j = 1,j \ne m}^\ell {\exp ({\alpha \phi _{1,j} - ({\alpha - 1})\phi _{1,m}})}< \sum \limits _{j = 1,j \ne m}^\ell {\exp ({\phi _{1,j}} )} \\&\quad \Leftrightarrow \frac{{\sum _{j = 1}^\ell {\exp ({\alpha \phi _{1,j}})} - \exp ({\alpha \phi _{1,m}})}}{{\exp ({( {\alpha - 1})\phi _{1,m}})}}< \sum \limits _{j = 1}^\ell {\exp ({\phi _{1,j}})} - \exp ({\alpha \phi _{1,m}})\\&\quad \Leftrightarrow \sum \limits _{j = 1}^\ell {\exp ({\alpha \phi _{1,j}})} < \sum \limits _{j = 1}^\ell {\exp ({\phi _{1,j}})} \times \exp ({({\alpha - 1})\phi _{1,m}})\\&\quad \Leftrightarrow \phi _{1,m} > \frac{{\ln \left( {\sum \nolimits _{j = 1}^\ell {\exp ({\alpha \phi _{1,j}})} } \right) - \ln \left( {\sum _{j = 1}^\ell {\exp ({\phi _{1,j}})} } \right) }}{{\alpha - 1}} \end{aligned}$$

Proof of second feature

Given that

$$\begin{aligned}&\phi _{1,m}>\phi _{1,p}>\frac{\ln (\sum _{j=1}^\ell \exp (\alpha \phi _{1,j})/\sum _{j=1}^\ell \exp (\phi _{1,j}))}{\alpha -1},\\&\phi _{1,m}> \phi _{1,p}\\&\quad \Rightarrow \sum \limits _{j = 1}^{m - 1} {\left( {\exp \left( {\alpha \phi _{1,j} - \left( {\alpha - 1} \right) \phi _{1,m}} \right) - \exp \left( {\phi _{1,j}} \right) } \right) } \\&\qquad - \sum \limits _{j = m + 1}^\ell {\left( {\exp \left( {\phi _{1,j}} \right) - \exp \left( {\alpha \phi _{1,j} - \left( {\alpha - 1} \right) \phi _{1,m}} \right) } \right) } \\&\qquad> \sum \limits _{j = 1}^{p - 1} {\left( {\exp \left( {\alpha \phi _{1,j} - \left( {\alpha - 1} \right) \phi _{1,p}} \right) - \exp \left( {\phi _{1,j}} \right) } \right) }\\&\qquad - \sum \limits _{j = p + 1}^\ell {\left( {\exp \left( {\phi _{1,j}} \right) - \exp \left( {\alpha \phi _{1,j} - \left( {\alpha - 1} \right) \phi _{1,p}} \right) } \right) } \\&\quad \Rightarrow Pr[r_m\in \mathbf {H}_2^{\prime }]-Pr[r_m\in \mathbf {H}_1^{\prime }]>Pr[r_p\in \mathbf {H}_2^{\prime }]-Pr[r_p\in \mathbf {H}_1^{\prime }] \end{aligned}$$

\(\square \)

Appendix B: Proof of Theorem 4

First of all, as the privacy budget between \(A_1\) and \(A_2\) are the same, the difference between \(\phi _{1,m}\) and \(\phi _{2,m}\) lies in \(u(D,r)/\Delta u\).

Proof of first feature

Given that \(\forall m\in [1,k]\), \(\phi _{2,m}>\phi _{1,m}\), as \(\epsilon _1=\epsilon _2\), then \(\frac{u_2(D,r_m)}{\Delta u_2}>\frac{u_1(D,r_m)}{\Delta u_1}\). We denote by \(\alpha =\max {\frac{\phi _{2,m}}{\phi _{1,m}}}\), then

$$\begin{aligned} Pr[r_m\in \mathbf {H}_2^{\prime }] =\frac{\exp (\phi _{2,m})}{\sum _{j=1}^\ell {\exp (\phi _{2,j})}}\ge \frac{\exp (\alpha \phi _{1,m})}{\sum _{j=1}^\ell {\exp (\alpha \phi _{1,j})}} \end{aligned}$$

Then,

$$\begin{aligned}&Pr[r_m\in \mathbf {H}_2^{\prime }]>Pr[r_m\in \mathbf {H}_1^{\prime }]\\&\quad \Leftrightarrow \frac{{\sum _{j = 1}^{m - 1} {( {\exp ( {\alpha \phi _{1,j} - ({\alpha - 1})\phi _{1,m}}) - \exp ({\phi _{1,j}})} )} }}{{\sum _{j = m + 1}^\ell {({\exp ({\phi _{1,j}}) - \exp ({\alpha \phi _{1,j} - ({\alpha - 1})\phi _{1,m}})})} }}< 1\\&\quad \Leftrightarrow \sum \limits _{j = 1,j \ne m}^\ell {\exp ({\alpha \phi _{1,j} - ( {\alpha - 1})\phi _{1,m}})}< \sum \limits _{j = 1,j \ne m}^\ell {\exp ({\phi _{1,j}})} \\&\quad \Leftrightarrow \frac{{\sum _{j = 1}^\ell {\exp ({\alpha \phi _{1,j}})} - \exp ({\alpha \phi _{1,m}})}}{{\exp ( {( {\alpha - 1})\phi _{1,m}})}}< \sum \limits _{j = 1}^\ell {\exp ({\phi _{1,j}})} - \exp ({\alpha \phi _{1,m}})\\&\quad \Leftrightarrow \sum \limits _{j = 1}^\ell {\exp \left( {\alpha \phi _{1,j}} \right) }< \sum \limits _{j = 1}^\ell {\exp ({\phi _{1,j}})}\times \exp ( {({\alpha - 1})\phi _{1,m}})\\&\quad \Leftrightarrow \phi _{1,m} > \frac{{\ln \left( {\sum _{j = 1}^\ell {\exp ( {\alpha \phi _{1,j}})} } \right) - \ln \left( {\sum _{j = 1}^\ell {\exp ({\phi _{1,j}})} } \right) }}{{\alpha - 1}} \end{aligned}$$

Proof of second feature

This part is identical to the second part in proof of Theorem 3.    \(\square \)

Appendix C: Proof of Lemma 1

In the algorithm we have

$$\begin{aligned} err\left( H \right) = \sum \limits _{i = 1}^\ell {\sum \limits _{{B_j} \in {H_i}} {\left| {{B_j} \cdot c - \mathop {{H_i}}\limits ^ - } \right| } } + \frac{\ell }{\epsilon }, \end{aligned}$$

consider a partition H with counts \(\left\{ {{x_1},{x_2},\ldots ,{x_n}} \right\} \) and let \(H'\) differ from H only in a single bin by 1. Then,

$$\begin{aligned} \sum \limits _i {\left| {x{'_i} - \frac{{\sum \limits _i {x{'_i}} }}{n}} \right| } \le \sum \limits _i {\left| {x_i^{} - \frac{{\sum \limits _i {x_i^{}} }}{n}} \right| } + 2. \end{aligned}$$

As a cluster can be bisected h times at most, so a single record difference in H affects at most h bisections, so the sensitivity of our algorithm is 2h. \(\square \)

Appendix D: Proof of Theorem 6

InitialBiHis algorithm employs the exponential mechanism to split the original histogram into two sub-histograms, according to Lemma 1, InitialBiHis algorithm’s sensitivity is 2. Moreover, exponential mechanism is applied to select the output, we can easily find that InitialBiHis satisfies \({\epsilon _1}\)-differential privacy. As it is straightforward, we select not to show the detailed formula here.

Then ClusterSplit utilizes exponential mechanism over two disjoint parts of the original data, according to Lemma 1, ClusterSplit’s sensitivity is 2h. Therefore, using Parallel Composability property (see Corollary 2), ClusterSplit satisfies \({\epsilon _2}\)-differential privacy.

In the final step we employ the laplace mechanism to calculate the noisy count of each bin, according to the definition of neighboring histograms, the last step’s sensitivity is 1, the noise injection in this step can be easily found satisfying \({\epsilon _3}\)-differential privacy.

Finally, using Sequential Composability property (see Corollary 1), Algorithm 1 satisfies \(\left( {{\epsilon _1} + {\epsilon _2} + {\epsilon _3}} \right) \)-differential privacy. \(\square \)

Appendix E: Proof of Theorem 7

Sanitization are applied in Lines 7 and 8 of the algorithm.

Line 7 strictly follows Exponential Mechanism, and guarantees \(\epsilon _1\)-differential privacy in each iteration, according to Theorem 2. As there are altogether d iterations, which are sequentially executed, then Line 7 guarantees \((d*\epsilon _1)\)-differential privacy over all iterations.

Line 8 is sequentially performed in d iterations. According to Theorem 6, in each iteration ihp achieves \(\frac{\epsilon _2}{d}\)-differential privacy. Based on Corollary 1, d iterations will achieve \(\epsilon _2\) differential privacy.

Putting Line 7 and 8 together, they are sequentially executed in the algorithm, according to Corollary 1, mihp achieves \((d*\epsilon _1+\epsilon _2)\)-differential privacy. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, H., Cui, J., Meng, X. et al. IHP: improving the utility in differential private histogram publication. Distrib Parallel Databases 37, 721–750 (2019). https://doi.org/10.1007/s10619-018-07255-6

Download citation

Keywords

  • Differential privacy
  • Data publication
  • Histogram