Skip to main content

Correcting Finite Sampling Issues in Entropy l-diversity

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9867))

Included in the following conference series:

Abstract

In statistical disclosure control (SDC) anonymized versions of a database table are obtained via generalization and suppression to reduce de-anonymization attacks, ideally with minimal utility loss. This amounts to an optimization problem in which a measure of remaining diversity needs to be improved. The feasible solutions are those that fulfill some privacy criteria, e.g., the entropy l-diversity. In the statistics it is known that the naive computation of an entropy via the Shannon formula systematically underestimates the (real) entropy and thus influences the resulting equivalence classes. In this contribution we implement an asymptotically unbiased estimator for the Shannon entropy and apply it to three test databases. Our results show previously performed systematic miscalculations; we show that by an unbiased estimator one can increase the utility of the data without compromising privacy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note, that typically the logarithm to base two is used. However, any base will do as we can simply rescale all equations and inequalities. We will use the natural logarithm in the subsequent parts as the estimator we use is more easily derived in nats rather than bits.

  2. 2.

    often written \(\hat{p}(s)\) to indicate that these are not the “real” probabilities, but rather stem from a statistical model or concrete data.

  3. 3.

    frequencies are just the counts divided by the total number of observations, namely \(\hat{p}(s)=\frac{\hat{n}(s)}{\sum _\sigma \hat{n}(\sigma )}\).

  4. 4.

    Note that G(0) is never used but marks the start of the inductive definition. \(\gamma \approx 0.577215\dots \) is Euler’s constant.

  5. 5.

    With growing counts the systematic bias tends to vanish in the statistical sense.

  6. 6.

    cmp. the 1000 genome project, http://www.1000genomes.org.

  7. 7.

    The used l’s and sizes can be seen in Fig. 2.

  8. 8.

    \(G(2)/\log (2) \approx 1.053, G(4)/\log (4) \approx 1.0072, G(6)/\log (6) \approx 1.0025\).

  9. 9.

    We therefore recommend to carefully analyze what one wants to achieve: with such a small privacy guarantee (\(l < 2\)) on could consequentially just release the raw data.

References

  1. Antal, L., Shlomo, N., Elliot, M.: Measuring disclosure risk with entropy in population based frequency tables. In: Domingo-Ferrer [5], pp. 62–78

    Google Scholar 

  2. Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  3. Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 70–78. ACM (2008)

    Google Scholar 

  4. Craig, D.W., Goor, R.M., Wang, Z., Paschall, J., Ostell, J., Feolo, M., Sherry, S.T., Manolio, T.A.: Assessing and managing risk when sharing aggregate genetic variant data. Nat. Rev. Genet. 12(10), 730–736 (2011). http://dx.doi.org/10.1038/nrg3067

    Article  Google Scholar 

  5. Domingo-Ferrer, J. (ed.): PSD 2014. LNCS, vol. 8744. Springer, Heidelberg (2014)

    MATH  Google Scholar 

  6. Gionis, A., Tassa, T.: k-anonymization with minimal loss of information. IEEE Trans. Knowl. Data Eng. 21(2), 206–219 (2009)

    Article  MATH  Google Scholar 

  7. Goeman, J.J., Solari, A.: Multiple hypothesis testing in genomics. Stat. Med. 33(11), 1946–1978 (2014)

    Article  MathSciNet  Google Scholar 

  8. Grassberger, P.: Entropy estimates from insufficient samplings arXiv:physics/0307138 (2008)

  9. Grassberger, P.: Finite sample corrections to entropy and dimension estimates. Phys. Lett. A 128(6), 369–373 (1988)

    Article  MathSciNet  Google Scholar 

  10. Hamacher, K.: Using lisp macro-facilities for transferable statistical tests. In: 9th European Lisp Symposium (accepted, 2016)

    Google Scholar 

  11. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 279–288. ACM (2002)

    Google Scholar 

  12. Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.: Flash: efficient, stable and optimal \(k\)-anonymity. In: Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pp. 708–717, September 2012

    Google Scholar 

  13. Kohlmayer, F., Prasser, F., Kuhn, K.A.: The cost of quality: implementing generalization and suppression for anonymizing biomedical data with minimal information loss. J. Biomed. Inform. 58, 37–48 (2015)

    Article  Google Scholar 

  14. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 25–25. IEEE (2006)

    Google Scholar 

  15. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 106–115. IEEE (2007)

    Google Scholar 

  16. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: \(l\)-diversity: privacy beyond \(k\)-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3 (2007)

    Article  Google Scholar 

  17. MacKay, D.: Information Theory, Inference, and Learning Algorithms, 2nd edn. Cambridge University Press, Cambridge (2004)

    Google Scholar 

  18. Narayanan, A., Shmatikov, V.: Myths and fallacies of “personally identifiable information”. Commun. ACM 53(6), 24–26 (2010). http://doi.acm.org/10.1145/1743546.1743558

    Article  Google Scholar 

  19. Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 665–676. ACM, New York (2007). http://doi.acm.org/10.1145/1247480.1247554

  20. Ohm, P.: Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Rev. 57, 1701 (2009)

    Google Scholar 

  21. Prasser, F., Kohlmayer, F., Lautenschläger, R., Kuhn, K.A.: ARX - a comprehensive tool for anonymizing biomedical data. In: Proceedings of the AMIA 2014 Annual Symposium, Washington D.C., USA, November 2014

    Google Scholar 

  22. Roldán, É.: Estimating the Kullback-Leibler divergence. In: Irreversibility and Dissipation in Microscopic Systems, pp. 61–85. Springer International Publishing, Cham (2014)

    Google Scholar 

  23. Schürmann, T.: Bias analysis in entropy estimation. J. Phys. A: Math. Gen. 37(27), L295 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  24. Schürmann, T.: A note on entropy estimation. Neural Comput. 27(10), 2097–2106 (2015)

    Article  Google Scholar 

  25. Siegel, S.: Non-parametric Statistics for the Behavioral Sciences. McGraw-Hill, New York (1956)

    MATH  Google Scholar 

  26. Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer [5], pp. 253–268

    Google Scholar 

  27. Sweeney, L.: Achieving \(k\)-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  28. Sweeney, L.: \(k\)-anonymity: a model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  29. Weil, P., Hoffgaard, F., Hamacher, K.: Estimating sufficient statistics in co-evolutionary analysis by mutual information. Comput. Biol. Chem. 33(6), 440–444 (2009)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The research reported in this paper has been supported by the German Federal Ministry of Education and Research (BMBF) [and by the Hessian Ministry of Science and the Arts] within CRISP (www.crisp-da.de).

We also thank the ARX-Team for their helpful support in using the API and understanding some inner workings of the framework.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kay Hamacher .

Editor information

Editors and Affiliations

A Efficient Computation of the G(n)

A Efficient Computation of the G(n)

The inductive definition (4) of the G(n) can be rewritten as

$$\begin{aligned} G(2n) = -\gamma -\ln 2 + \sum _{k=0}^{n-1} \frac{2}{2k+1} \end{aligned}$$

Remember that \(G(2n+1) := G(2n)\). This equation can simply be rewritten in terms of the Harmonic numbers \(H_n := \sum _{k=1}^{n} 1/k\) as

$$\begin{aligned} G(2n) = -\gamma -\ln 2 + 2 H_{2n-1} - H_{n-1} \end{aligned}$$

The Harmonic numbers can also be expressed in terms of the Digamma function \(\psi \) as \(H_{n-1} = \psi (n) + \gamma \). The Digamma function has the asymptotic series

$$\begin{aligned} \psi (x) = \ln x - \frac{1}{2x} - \sum _{k=1}^{\infty } \frac{B_{2k}}{2k\,x^{2k}}, \end{aligned}$$

using the Bernoulli numbers \(B_n\). Combining the last two equations yields the asymptotic expansion for G(2n):

$$\begin{aligned} G(2n) = \ln (2n) + \frac{1}{24\,n^2} - \frac{7}{960\,n^4} + \frac{31}{8064\,n^6} + \mathscr {O} \left( \frac{1}{n^8}\right) \end{aligned}$$

This expansion is accurate to double presicion for \(n\ge 50\). For smaller values we used a precomputed table. Note that, beside the logarithm, it suffices to calculate a single division \(1/n^2\) and then proceed with nested multiplication.

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Stammler, S., Katzenbeisser, S., Hamacher, K. (2016). Correcting Finite Sampling Issues in Entropy l-diversity. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds) Privacy in Statistical Databases. PSD 2016. Lecture Notes in Computer Science(), vol 9867. Springer, Cham. https://doi.org/10.1007/978-3-319-45381-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45381-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45380-4

  • Online ISBN: 978-3-319-45381-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics