Abstract
In statistical disclosure control (SDC) anonymized versions of a database table are obtained via generalization and suppression to reduce de-anonymization attacks, ideally with minimal utility loss. This amounts to an optimization problem in which a measure of remaining diversity needs to be improved. The feasible solutions are those that fulfill some privacy criteria, e.g., the entropy l-diversity. In the statistics it is known that the naive computation of an entropy via the Shannon formula systematically underestimates the (real) entropy and thus influences the resulting equivalence classes. In this contribution we implement an asymptotically unbiased estimator for the Shannon entropy and apply it to three test databases. Our results show previously performed systematic miscalculations; we show that by an unbiased estimator one can increase the utility of the data without compromising privacy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note, that typically the logarithm to base two is used. However, any base will do as we can simply rescale all equations and inequalities. We will use the natural logarithm in the subsequent parts as the estimator we use is more easily derived in nats rather than bits.
- 2.
often written \(\hat{p}(s)\) to indicate that these are not the “real” probabilities, but rather stem from a statistical model or concrete data.
- 3.
frequencies are just the counts divided by the total number of observations, namely \(\hat{p}(s)=\frac{\hat{n}(s)}{\sum _\sigma \hat{n}(\sigma )}\).
- 4.
Note that G(0) is never used but marks the start of the inductive definition. \(\gamma \approx 0.577215\dots \) is Euler’s constant.
- 5.
With growing counts the systematic bias tends to vanish in the statistical sense.
- 6.
cmp. the 1000 genome project, http://www.1000genomes.org.
- 7.
The used l’s and sizes can be seen in Fig. 2.
- 8.
\(G(2)/\log (2) \approx 1.053, G(4)/\log (4) \approx 1.0072, G(6)/\log (6) \approx 1.0025\).
- 9.
We therefore recommend to carefully analyze what one wants to achieve: with such a small privacy guarantee (\(l < 2\)) on could consequentially just release the raw data.
References
Antal, L., Shlomo, N., Elliot, M.: Measuring disclosure risk with entropy in population based frequency tables. In: Domingo-Ferrer [5], pp. 62–78
Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005)
Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 70–78. ACM (2008)
Craig, D.W., Goor, R.M., Wang, Z., Paschall, J., Ostell, J., Feolo, M., Sherry, S.T., Manolio, T.A.: Assessing and managing risk when sharing aggregate genetic variant data. Nat. Rev. Genet. 12(10), 730–736 (2011). http://dx.doi.org/10.1038/nrg3067
Domingo-Ferrer, J. (ed.): PSD 2014. LNCS, vol. 8744. Springer, Heidelberg (2014)
Gionis, A., Tassa, T.: k-anonymization with minimal loss of information. IEEE Trans. Knowl. Data Eng. 21(2), 206–219 (2009)
Goeman, J.J., Solari, A.: Multiple hypothesis testing in genomics. Stat. Med. 33(11), 1946–1978 (2014)
Grassberger, P.: Entropy estimates from insufficient samplings arXiv:physics/0307138 (2008)
Grassberger, P.: Finite sample corrections to entropy and dimension estimates. Phys. Lett. A 128(6), 369–373 (1988)
Hamacher, K.: Using lisp macro-facilities for transferable statistical tests. In: 9th European Lisp Symposium (accepted, 2016)
Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 279–288. ACM (2002)
Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.: Flash: efficient, stable and optimal \(k\)-anonymity. In: Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pp. 708–717, September 2012
Kohlmayer, F., Prasser, F., Kuhn, K.A.: The cost of quality: implementing generalization and suppression for anonymizing biomedical data with minimal information loss. J. Biomed. Inform. 58, 37–48 (2015)
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 25–25. IEEE (2006)
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 106–115. IEEE (2007)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: \(l\)-diversity: privacy beyond \(k\)-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3 (2007)
MacKay, D.: Information Theory, Inference, and Learning Algorithms, 2nd edn. Cambridge University Press, Cambridge (2004)
Narayanan, A., Shmatikov, V.: Myths and fallacies of “personally identifiable information”. Commun. ACM 53(6), 24–26 (2010). http://doi.acm.org/10.1145/1743546.1743558
Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 665–676. ACM, New York (2007). http://doi.acm.org/10.1145/1247480.1247554
Ohm, P.: Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Rev. 57, 1701 (2009)
Prasser, F., Kohlmayer, F., Lautenschläger, R., Kuhn, K.A.: ARX - a comprehensive tool for anonymizing biomedical data. In: Proceedings of the AMIA 2014 Annual Symposium, Washington D.C., USA, November 2014
Roldán, É.: Estimating the Kullback-Leibler divergence. In: Irreversibility and Dissipation in Microscopic Systems, pp. 61–85. Springer International Publishing, Cham (2014)
Schürmann, T.: Bias analysis in entropy estimation. J. Phys. A: Math. Gen. 37(27), L295 (2004)
Schürmann, T.: A note on entropy estimation. Neural Comput. 27(10), 2097–2106 (2015)
Siegel, S.: Non-parametric Statistics for the Behavioral Sciences. McGraw-Hill, New York (1956)
Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer [5], pp. 253–268
Sweeney, L.: Achieving \(k\)-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)
Sweeney, L.: \(k\)-anonymity: a model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)
Weil, P., Hoffgaard, F., Hamacher, K.: Estimating sufficient statistics in co-evolutionary analysis by mutual information. Comput. Biol. Chem. 33(6), 440–444 (2009)
Acknowledgements
The research reported in this paper has been supported by the German Federal Ministry of Education and Research (BMBF) [and by the Hessian Ministry of Science and the Arts] within CRISP (www.crisp-da.de).
We also thank the ARX-Team for their helpful support in using the API and understanding some inner workings of the framework.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Efficient Computation of the G(n)
A Efficient Computation of the G(n)
The inductive definition (4) of the G(n) can be rewritten as
Remember that \(G(2n+1) := G(2n)\). This equation can simply be rewritten in terms of the Harmonic numbers \(H_n := \sum _{k=1}^{n} 1/k\) as
The Harmonic numbers can also be expressed in terms of the Digamma function \(\psi \) as \(H_{n-1} = \psi (n) + \gamma \). The Digamma function has the asymptotic series
using the Bernoulli numbers \(B_n\). Combining the last two equations yields the asymptotic expansion for G(2n):
This expansion is accurate to double presicion for \(n\ge 50\). For smaller values we used a precomputed table. Note that, beside the logarithm, it suffices to calculate a single division \(1/n^2\) and then proceed with nested multiplication.
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Stammler, S., Katzenbeisser, S., Hamacher, K. (2016). Correcting Finite Sampling Issues in Entropy l-diversity. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds) Privacy in Statistical Databases. PSD 2016. Lecture Notes in Computer Science(), vol 9867. Springer, Cham. https://doi.org/10.1007/978-3-319-45381-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-45381-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45380-4
Online ISBN: 978-3-319-45381-1
eBook Packages: Computer ScienceComputer Science (R0)