Correcting Finite Sampling Issues in Entropy l-diversity

  • Sebastian Stammler
  • Stefan Katzenbeisser
  • Kay Hamacher
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9867)


In statistical disclosure control (SDC) anonymized versions of a database table are obtained via generalization and suppression to reduce de-anonymization attacks, ideally with minimal utility loss. This amounts to an optimization problem in which a measure of remaining diversity needs to be improved. The feasible solutions are those that fulfill some privacy criteria, e.g., the entropy l-diversity. In the statistics it is known that the naive computation of an entropy via the Shannon formula systematically underestimates the (real) entropy and thus influences the resulting equivalence classes. In this contribution we implement an asymptotically unbiased estimator for the Shannon entropy and apply it to three test databases. Our results show previously performed systematic miscalculations; we show that by an unbiased estimator one can increase the utility of the data without compromising privacy.


Anonymity l-diversity Finite sampling Statistics Information theory 



The research reported in this paper has been supported by the German Federal Ministry of Education and Research (BMBF) [and by the Hessian Ministry of Science and the Arts] within CRISP (

We also thank the ARX-Team for their helpful support in using the API and understanding some inner workings of the framework.


  1. 1.
    Antal, L., Shlomo, N., Elliot, M.: Measuring disclosure risk with entropy in population based frequency tables. In: Domingo-Ferrer [5], pp. 62–78Google Scholar
  2. 2.
    Batu, T., Dasgupta, S., Kumar, R., Rubinfeld, R.: The complexity of approximating the entropy. SIAM J. Comput. 35(1), 132–150 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 70–78. ACM (2008)Google Scholar
  4. 4.
    Craig, D.W., Goor, R.M., Wang, Z., Paschall, J., Ostell, J., Feolo, M., Sherry, S.T., Manolio, T.A.: Assessing and managing risk when sharing aggregate genetic variant data. Nat. Rev. Genet. 12(10), 730–736 (2011). CrossRefGoogle Scholar
  5. 5.
    Domingo-Ferrer, J. (ed.): PSD 2014. LNCS, vol. 8744. Springer, Heidelberg (2014)zbMATHGoogle Scholar
  6. 6.
    Gionis, A., Tassa, T.: k-anonymization with minimal loss of information. IEEE Trans. Knowl. Data Eng. 21(2), 206–219 (2009)CrossRefzbMATHGoogle Scholar
  7. 7.
    Goeman, J.J., Solari, A.: Multiple hypothesis testing in genomics. Stat. Med. 33(11), 1946–1978 (2014)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Grassberger, P.: Entropy estimates from insufficient samplings arXiv:physics/0307138 (2008)
  9. 9.
    Grassberger, P.: Finite sample corrections to entropy and dimension estimates. Phys. Lett. A 128(6), 369–373 (1988)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Hamacher, K.: Using lisp macro-facilities for transferable statistical tests. In: 9th European Lisp Symposium (accepted, 2016)Google Scholar
  11. 11.
    Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 279–288. ACM (2002)Google Scholar
  12. 12.
    Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.: Flash: efficient, stable and optimal \(k\)-anonymity. In: Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pp. 708–717, September 2012Google Scholar
  13. 13.
    Kohlmayer, F., Prasser, F., Kuhn, K.A.: The cost of quality: implementing generalization and suppression for anonymizing biomedical data with minimal information loss. J. Biomed. Inform. 58, 37–48 (2015)CrossRefGoogle Scholar
  14. 14.
    LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 25–25. IEEE (2006)Google Scholar
  15. 15.
    Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering, ICDE 2007, pp. 106–115. IEEE (2007)Google Scholar
  16. 16.
    Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: \(l\)-diversity: privacy beyond \(k\)-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 3 (2007)CrossRefGoogle Scholar
  17. 17.
    MacKay, D.: Information Theory, Inference, and Learning Algorithms, 2nd edn. Cambridge University Press, Cambridge (2004)Google Scholar
  18. 18.
    Narayanan, A., Shmatikov, V.: Myths and fallacies of “personally identifiable information”. Commun. ACM 53(6), 24–26 (2010). CrossRefGoogle Scholar
  19. 19.
    Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared databases. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 665–676. ACM, New York (2007).
  20. 20.
    Ohm, P.: Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Rev. 57, 1701 (2009)Google Scholar
  21. 21.
    Prasser, F., Kohlmayer, F., Lautenschläger, R., Kuhn, K.A.: ARX - a comprehensive tool for anonymizing biomedical data. In: Proceedings of the AMIA 2014 Annual Symposium, Washington D.C., USA, November 2014Google Scholar
  22. 22.
    Roldán, É.: Estimating the Kullback-Leibler divergence. In: Irreversibility and Dissipation in Microscopic Systems, pp. 61–85. Springer International Publishing, Cham (2014)Google Scholar
  23. 23.
    Schürmann, T.: Bias analysis in entropy estimation. J. Phys. A: Math. Gen. 37(27), L295 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Schürmann, T.: A note on entropy estimation. Neural Comput. 27(10), 2097–2106 (2015)CrossRefGoogle Scholar
  25. 25.
    Siegel, S.: Non-parametric Statistics for the Behavioral Sciences. McGraw-Hill, New York (1956)zbMATHGoogle Scholar
  26. 26.
    Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer [5], pp. 253–268Google Scholar
  27. 27.
    Sweeney, L.: Achieving \(k\)-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Sweeney, L.: \(k\)-anonymity: a model for protecting privacy. Int. J. Uncertainty, Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Weil, P., Hoffgaard, F., Hamacher, K.: Estimating sufficient statistics in co-evolutionary analysis by mutual information. Comput. Biol. Chem. 33(6), 440–444 (2009)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Sebastian Stammler
    • 1
  • Stefan Katzenbeisser
    • 1
  • Kay Hamacher
    • 1
  1. 1.Technische Universität DarmstadtDarmstadtGermany

Personalised recommendations