## Abstract

Differential privacy (DP) is a promising tool for preserving privacy during data publication, as it provides strong theoretical privacy guarantees in face of adversaries with arbitrary background knowledge. Histogram, as the result of a set of count queries, serves as a core statistical tool to report data distributions and is in fact viewed as the fundamental method for many other statistical analysis such as range queries. It is an important form for data publishing. In this paper, we consider the scenario of publishing sensitive histogram data with differential privacy scheme. Existing work in this field has justified that, comparing to directly applying DP techniques (i.e., injecting noise) over the counts in histogram bins, grouping bins before noise injection is more effective (i.e., with higher utility) as it introduces much less error over the sanitized histogram given the same privacy budget. However, state-of-the-art works have not unveiled how the overall utility of a sanitized histogram can be affected by the balance between the privacy budget distributed between grouping and noise injection phases. In this work, we conduct a theoretical study towards how the probability of getting better groups can be improved such that the overall error introduced in sanitized histogram can be further reduced, which directly leads to a higher utility for the sanitized histograms. In particular, we show that the probability of achieving better grouping can be affected by two factors, namely privacy budget assigned in grouping and the normalized utility function used for selecting groups. Motivated by that, we propose a new DP histogram publishing scheme, namely Iterative Histogram Partition, in which we carefully assign privacy budget between grouping and injection phases based on our theoretical study. We also theoretically prove that \(\epsilon \)-differential privacy can be achieved according to our new scheme. Moreover, we also show that, under the same privacy budget, our scheme exhibits less errors in the sanitized histograms comparing with state-of-the-art methods. We also extends the model to multi-dimensional histogram publication cases. Finally, empirical study over four real-world datasets also justifies that our scheme achieves the least error among series of state-of-the-art baseline methods.

This is a preview of subscription content, log in to check access.

## Notes

- 1.
In the sequel, we may use the term “cluster” and “group” interchangeably to facilitate the discussion.

- 2.
In the standard form, \(D_2\) is in fact obtained by either inserting or deleting a simple record from \(D_1\). Modifying a record cannot be viewed as neighboring, as it in fact consists of both delete and insert operations.

- 3.
In the following, we shall use 2-dimensional data cube and 2-dimensional histogram interchangeably.

## References

- 1.
Ács, G., Castelluccia, C., Chen, R.: Differentially private histogram publishing through Lossy compression. In: Proceedings of the 12th IEEE ICDM, pp. 1–10 (2012)

- 2.
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the 26th ACM PODS, pp. 273–282 (2007)

- 3.
Cho, H., Kwon, S.J., Jin, R., Chung, T.: A privacy-aware monitoring algorithm for moving k-nearest neighbor queries in road networks. Distrib. Parallel Databases

**33**(3), 319–352 (2015) - 4.
Cormode, G., Procopiuc, C.M., Srivastava, D., Shen, E., Yu, T.: Differentially private spatial decompositions. In: IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, VA), 1–5 April 2012, pp. 20–31 (2012)

- 5.
Ding, B., Winslett, M., Han, J., Li, Z.: Differentially private data cubes: optimizing noise sources and consistency. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, 12–16 June 2011, pp. 217–228 (2011)

- 6.
Dwork, C., McSherry, F., Nissim, K., Smith, A.D.: Calibrating noise to sensitivity in private data analysis. In: Proceedings of the 3rd TCC, pp. 265–284 (2006)

- 7.
Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. PVLDB

**3**(1), 1021–1032 (2010) - 8.
Hua, J., Tang, A., Fang, Y., Shen, Z., Zhong, S.: Privacy-preserving utility verification of the data published by non-interactive differentially private mechanisms. IEEE Trans. Inf. Forensics Security

**11**(10), 2298–2311 (2016) - 9.
Huang, J., Qi, J., Xu, Y., Chen, J.: A privacy-enhancing model for location-based personalized recommendations. Distrib. Parallel Databases

**33**(2), 253–276 (2015) - 10.
Kellaris, G., Papadopoulos, S.: Practical differential privacy via grouping and smoothing. PVLDB

**6**(5), 301–312 (2013) - 11.
Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: Proceedings of the 29th ACM PODS, pp. 123–134 (2010)

- 12.
Li, C., Hay, M., Miklau, G., Wang, Y.: A data- and workload-aware query answering algorithm for range queries under differential privacy. PVLDB

**7**(5), 341–352 (2014) - 13.
Li, C., Miklau, G., Hay, M., McGregor, A., Rastogi, V.: The matrix mechanism: optimizing linear counting queries under differential privacy. VLDB J.

**24**(6), 757–781 (2015) - 14.
Li, H., Xiong, L., Jiang, X., Liu, J.: Differentially private histogram publication for dynamic datasets: an adaptive sampling approach. In: Proceedings of the 24th ACM CIKM, pp. 1001–1010 (2015)

- 15.
Li, H., Cui, J., Lin, X., Ma, J.: Improving the utility in differential private histogram publishing: Theoretical study and practice. In: 2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA, December 5-8, 2016, pp 1100–1109 (2016)

- 16.
Li, Y.D., Zhang, Z., Winslett, M., Yang, Y.: Compressive mechanism: utilizing sparse representation in differential privacy. In: Proceedings of the 10th ACM WPES, pp. 177–182 (2011)

- 17.
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

- 18.
McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the ACM SIGMOD, pp. 19–30 (2009)

- 19.
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: Proceedings of the 48th IEEE FOCS, pp. 94–103 (2007)

- 20.
Muthukrishnan, S., Poosala, V., Suel, T.: On rectangular partitionings in two dimensions: Algorithms, complexity, and applications. In: 7th International Conference on Database Theory—ICDT ’99, Jerusalem, Israel, 10–12 January 1999, Proceedings, pp. 236–256 (1999)

- 21.
Qardaji, W.H., Yang, W., Li, N.: Differentially private grids for geospatial data. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 757–768 (2013)

- 22.
Qardaji, W.H., Yang, W., Li, N.: Understanding hierarchical methods for differentially private histograms. PVLDB

**6**(14), 1954–1965 (2013) - 23.
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)

- 24.
Rastogi, V., Nath, S.: Differentially private aggregation of distributed time-series with transformation and encryption. In: Proceedings of the ACM SIGMOD, pp. 735–746 (2010)

- 25.
Song, C., Ge, T.: Aroma: A new data protection method with differential privacy and accurate query answering. In: Proceedings of the 23rd ACM CIKM, pp. 1569–1578 (2014)

- 26.
Soria-Comas, J., Domingo-Ferrer, J., Sánchez, D., Megías, D.: Individual differential privacy: a utility-preserving formulation of differential privacy guarantees. IEEE Trans. Inf. Forensics Security

**12**(6), 1418–1429 (2017) - 27.
Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng.

**23**(8), 1200–1214 (2011) - 28.
Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G.: Differentially private histogram publication. In: Proceedings of the 28th IEEE ICDE, pp. 32–43 (2012)

- 29.
Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J.

**22**(6), 797–822 (2013) - 30.
Yuan, G., Zhang, Z., Winslett, M., Xiao, X., Yang, Y., Hao, Z.: Low-rank mechanism: optimizing batch queries under differential privacy. PVLDB

**5**(11), 1352–1363 (2012) - 31.
Yuan, G., Zhang, Z., Winslett, M., Xiao, X., Yang, Y., Hao, Z.: Optimizing batch linear queries under exact and approximate differential privacy. ACM Trans. Database Syst.

**40**(2), 11 (2015) - 32.
Zhang, T., Zhu, Q.: Dynamic differential privacy for admm-based distributed classification learning. IEEE Trans. Inf. Forensics Security

**12**(1), 172–187 (2017) - 33.
Zhang, X., Chen, R., Xu, J., Meng, X., Xie, Y.: Towards accurate histogram publication under differential privacy. In: Proceedings of the 2014 SIAM SDM, pp. 587–595 (2014)

- 34.
Zhu, T., Xiong, P., Li, G., Zhou, W.: Correlated differential privacy: hiding information in non-iid data set. IEEE Trans. Inf. Forensics Security

**10**(2), 229–242 (2015)

## Acknowledgements

The work is supported by National Nature Science Foundation of China (No. 61672408 and 61472298), Director Fund of PSRPC, Fundamental Research Funds for the Central Universities (No. JB181505), Natural Science Basic Research Plan in Shaanxi Province of China (No. 2018JM6073) and China 111 Project (No. B16037).

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix A: Proof of Theorem 3

First of all, as the utility function solely depends on the grouping algorithm’s error metric, given \(A_1\) and \(A_2\) following the same error metric, it is obvious that \(u_1(D,r)=u_2(D,r)\) and \(\Delta u_1=\Delta u_2\).

### Proof of first feature

\(\forall m\in [1,\ell ]\), as \(u_1(D,r)=u_2(D,r)\) and \(\Delta u_1=\Delta u_2\), then \(\phi _{2,m}/\phi _{1,m}=\epsilon _2/\epsilon _1=\alpha \).

Therefore, \(\phi _{1,m}<\phi _{2,m}\) as \(\epsilon _2>\epsilon _1\).

As a result,

Similarly,

Then,

### Proof of second feature

Given that

\(\square \)

### Appendix B: Proof of Theorem 4

First of all, as the privacy budget between \(A_1\) and \(A_2\) are the same, the difference between \(\phi _{1,m}\) and \(\phi _{2,m}\) lies in \(u(D,r)/\Delta u\).

### Proof of first feature

Given that \(\forall m\in [1,k]\), \(\phi _{2,m}>\phi _{1,m}\), as \(\epsilon _1=\epsilon _2\), then \(\frac{u_2(D,r_m)}{\Delta u_2}>\frac{u_1(D,r_m)}{\Delta u_1}\). We denote by \(\alpha =\max {\frac{\phi _{2,m}}{\phi _{1,m}}}\), then

Then,

### Proof of second feature

This part is identical to the second part in proof of Theorem 3. \(\square \)

### Appendix C: Proof of Lemma 1

In the algorithm we have

consider a partition *H* with counts \(\left\{ {{x_1},{x_2},\ldots ,{x_n}} \right\} \) and let \(H'\) differ from *H* only in a single bin by 1. Then,

As a cluster can be bisected *h* times at most, so a single record difference in *H* affects at most *h* bisections, so the sensitivity of our algorithm is 2*h*. \(\square \)

### Appendix D: Proof of Theorem 6

*InitialBiHis* algorithm employs the exponential mechanism to split the original histogram into two sub-histograms, according to Lemma 1, *InitialBiHis* algorithm’s sensitivity is 2. Moreover, exponential mechanism is applied to select the output, we can easily find that *InitialBiHis* satisfies \({\epsilon _1}\)-differential privacy. As it is straightforward, we select not to show the detailed formula here.

Then *ClusterSplit* utilizes exponential mechanism over two disjoint parts of the original data, according to Lemma 1, *ClusterSplit*’s sensitivity is 2*h*. Therefore, using Parallel Composability property (see Corollary 2), *ClusterSplit* satisfies \({\epsilon _2}\)-differential privacy.

In the final step we employ the laplace mechanism to calculate the noisy count of each bin, according to the definition of neighboring histograms, the last step’s sensitivity is 1, the noise injection in this step can be easily found satisfying \({\epsilon _3}\)-differential privacy.

Finally, using Sequential Composability property (see Corollary 1), Algorithm 1 satisfies \(\left( {{\epsilon _1} + {\epsilon _2} + {\epsilon _3}} \right) \)-differential privacy. \(\square \)

### Appendix E: Proof of Theorem 7

Sanitization are applied in Lines 7 and 8 of the algorithm.

Line 7 strictly follows Exponential Mechanism, and guarantees \(\epsilon _1\)-differential privacy in each iteration, according to Theorem 2. As there are altogether *d* iterations, which are sequentially executed, then Line 7 guarantees \((d*\epsilon _1)\)-differential privacy over all iterations.

Line 8 is sequentially performed in *d* iterations. According to Theorem 6, in each iteration ihp achieves \(\frac{\epsilon _2}{d}\)-differential privacy. Based on Corollary 1, *d* iterations will achieve \(\epsilon _2\) differential privacy.

Putting Line 7 and 8 together, they are sequentially executed in the algorithm, according to Corollary 1, *m*ihp achieves \((d*\epsilon _1+\epsilon _2)\)-differential privacy. \(\square \)

## Rights and permissions

## About this article

### Cite this article

Li, H., Cui, J., Meng, X. *et al.* IHP: improving the utility in differential private histogram publication.
*Distrib Parallel Databases* **37, **721–750 (2019). https://doi.org/10.1007/s10619-018-07255-6

Published:

Issue Date:

### Keywords

- Differential privacy
- Data publication
- Histogram