Correcting Finite Sampling Issues in Entropy l-diversity

Stammler, Sebastian; Katzenbeisser, Stefan; Hamacher, Kay

doi:10.1007/978-3-319-45381-1_11

Sebastian Stammler¹⁵,
Stefan Katzenbeisser¹⁵ &
Kay Hamacher¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9867))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

1032 Accesses
3 Citations

Abstract

In statistical disclosure control (SDC) anonymized versions of a database table are obtained via generalization and suppression to reduce de-anonymization attacks, ideally with minimal utility loss. This amounts to an optimization problem in which a measure of remaining diversity needs to be improved. The feasible solutions are those that fulfill some privacy criteria, e.g., the entropy l-diversity. In the statistics it is known that the naive computation of an entropy via the Shannon formula systematically underestimates the (real) entropy and thus influences the resulting equivalence classes. In this contribution we implement an asymptotically unbiased estimator for the Shannon entropy and apply it to three test databases. Our results show previously performed systematic miscalculations; we show that by an unbiased estimator one can increase the utility of the data without compromising privacy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note, that typically the logarithm to base two is used. However, any base will do as we can simply rescale all equations and inequalities. We will use the natural logarithm in the subsequent parts as the estimator we use is more easily derived in nats rather than bits.
2.
often written $\hat{p}(s)$ to indicate that these are not the “real” probabilities, but rather stem from a statistical model or concrete data.
3.
frequencies are just the counts divided by the total number of observations, namely $\hat{p}(s)=\frac{\hat{n}(s)}{\sum _\sigma \hat{n}(\sigma )}$.
4.
Note that G(0) is never used but marks the start of the inductive definition. $\gamma \approx 0.577215\dots $ is Euler’s constant.
5.
With growing counts the systematic bias tends to vanish in the statistical sense.
6.
cmp. the 1000 genome project, http://www.1000genomes.org.
7.
The used l’s and sizes can be seen in Fig. 2.
8.
$G(2)/\log (2) \approx 1.053, G(4)/\log (4) \approx 1.0072, G(6)/\log (6) \approx 1.0025$.
9.
We therefore recommend to carefully analyze what one wants to achieve: with such a small privacy guarantee ($l < 2$) on could consequentially just release the raw data.

References

Antal, L., Shlomo, N., Elliot, M.: Measuring disclosure risk with entropy in population based frequency tables. In: Domingo-Ferrer [5], pp. 62–78
Google Scholar
Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005)
Article MathSciNet MATH Google Scholar
Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 70–78. ACM (2008)
Google Scholar
Craig, D.W., Goor, R.M., Wang, Z., Paschall, J., Ostell, J., Feolo, M., Sherry, S.T., Manolio, T.A.: Assessing and managing risk when sharing aggregate genetic variant data. Nat. Rev. Genet. 12(10), 730–736 (2011). http://dx.doi.org/10.1038/nrg3067
Article Google Scholar
Domingo-Ferrer, J. (ed.): PSD 2014. LNCS, vol. 8744. Springer, Heidelberg (2014)
MATH Google Scholar
Gionis, A., Tassa, T.: k-anonymization with minimal loss of information. IEEE Trans. Knowl. Data Eng. 21(2), 206–219 (2009)
Article MATH Google Scholar
Goeman, J.J., Solari, A.: Multiple hypothesis testing in genomics. Stat. Med. 33(11), 1946–1978 (2014)
Article MathSciNet Google Scholar
Grassberger, P.: Entropy estimates from insufficient samplings arXiv:physics/0307138 (2008)
Grassberger, P.: Finite sample corrections to entropy and dimension estimates. Phys. Lett. A 128(6), 369–373 (1988)
Article MathSciNet Google Scholar
Hamacher, K.: Using lisp macro-facilities for transferable statistical tests. In: 9th European Lisp Symposium (accepted, 2016)
Google Scholar
Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 279–288. ACM (2002)
Google Scholar
Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.: Flash: efficient, stable and optimal $k$-anonymity. In: Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pp. 708–717, September 2012
Google Scholar
Kohlmayer, F., Prasser, F., Kuhn, K.A.: The cost of quality: implementing generalization and suppression for anonymizing biomedical data with minimal information loss. J. Biomed. Inform. 58, 37–48 (2015)
Article Google Scholar
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 25–25. IEEE (2006)
Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 106–115. IEEE (2007)
Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: $l$-diversity: privacy beyond $k$-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3 (2007)
Article Google Scholar
MacKay, D.: Information Theory, Inference, and Learning Algorithms, 2nd edn. Cambridge University Press, Cambridge (2004)
Google Scholar
Narayanan, A., Shmatikov, V.: Myths and fallacies of “personally identifiable information”. Commun. ACM 53(6), 24–26 (2010). http://doi.acm.org/10.1145/1743546.1743558
Article Google Scholar
Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 665–676. ACM, New York (2007). http://doi.acm.org/10.1145/1247480.1247554
Ohm, P.: Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Rev. 57, 1701 (2009)
Google Scholar
Prasser, F., Kohlmayer, F., Lautenschläger, R., Kuhn, K.A.: ARX - a comprehensive tool for anonymizing biomedical data. In: Proceedings of the AMIA 2014 Annual Symposium, Washington D.C., USA, November 2014
Google Scholar
Roldán, É.: Estimating the Kullback-Leibler divergence. In: Irreversibility and Dissipation in Microscopic Systems, pp. 61–85. Springer International Publishing, Cham (2014)
Google Scholar
Schürmann, T.: Bias analysis in entropy estimation. J. Phys. A: Math. Gen. 37(27), L295 (2004)
Article MathSciNet MATH Google Scholar
Schürmann, T.: A note on entropy estimation. Neural Comput. 27(10), 2097–2106 (2015)
Article Google Scholar
Siegel, S.: Non-parametric Statistics for the Behavioral Sciences. McGraw-Hill, New York (1956)
MATH Google Scholar
Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer [5], pp. 253–268
Google Scholar
Sweeney, L.: Achieving $k$-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)
Article MathSciNet MATH Google Scholar
Sweeney, L.: $k$-anonymity: a model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)
Article MathSciNet MATH Google Scholar
Weil, P., Hoffgaard, F., Hamacher, K.: Estimating sufficient statistics in co-evolutionary analysis by mutual information. Comput. Biol. Chem. 33(6), 440–444 (2009)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The research reported in this paper has been supported by the German Federal Ministry of Education and Research (BMBF) [and by the Hessian Ministry of Science and the Arts] within CRISP (www.crisp-da.de).

We also thank the ARX-Team for their helpful support in using the API and understanding some inner workings of the framework.

Author information

Authors and Affiliations

Technische Universität Darmstadt, 64287, Darmstadt, Germany
Sebastian Stammler, Stefan Katzenbeisser & Kay Hamacher

Authors

Sebastian Stammler
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Katzenbeisser
View author publications
You can also search for this author in PubMed Google Scholar
Kay Hamacher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kay Hamacher .

Editor information

Editors and Affiliations

Universitat Rovira i Virgili, Tarragona, Spain
Josep Domingo-Ferrer
University of Zagreb, Zagreb, Croatia
Mirjana Pejić-Bach

A Efficient Computation of the G(n)

The inductive definition (4) of the G(n) can be rewritten as

$$\begin{aligned} G(2n) = -\gamma -\ln 2 + \sum _{k=0}^{n-1} \frac{2}{2k+1} \end{aligned}$$

Remember that $G(2n+1) := G(2n)$. This equation can simply be rewritten in terms of the Harmonic numbers $H_n := \sum _{k=1}^{n} 1/k$ as

$$\begin{aligned} G(2n) = -\gamma -\ln 2 + 2 H_{2n-1} - H_{n-1} \end{aligned}$$

The Harmonic numbers can also be expressed in terms of the Digamma function $\psi $ as $H_{n-1} = \psi (n) + \gamma $. The Digamma function has the asymptotic series

$$\begin{aligned} \psi (x) = \ln x - \frac{1}{2x} - \sum _{k=1}^{\infty } \frac{B_{2k}}{2k\,x^{2k}}, \end{aligned}$$

using the Bernoulli numbers $B_n$. Combining the last two equations yields the asymptotic expansion for G(2n):

$$\begin{aligned} G(2n) = \ln (2n) + \frac{1}{24\,n^2} - \frac{7}{960\,n^4} + \frac{31}{8064\,n^6} + \mathscr {O} \left( \frac{1}{n^8}\right) \end{aligned}$$

This expansion is accurate to double presicion for $n\ge 50$. For smaller values we used a precomputed table. Note that, beside the logarithm, it suffices to calculate a single division $1/n^2$ and then proceed with nested multiplication.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stammler, S., Katzenbeisser, S., Hamacher, K. (2016). Correcting Finite Sampling Issues in Entropy l-diversity. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds) Privacy in Statistical Databases. PSD 2016. Lecture Notes in Computer Science(), vol 9867. Springer, Cham. https://doi.org/10.1007/978-3-319-45381-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-45381-1_11
Published: 31 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45380-4
Online ISBN: 978-3-319-45381-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Correcting Finite Sampling Issues in Entropy l-diversity

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Efficient Computation of the G(n)

A Efficient Computation of the G(n)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation