Advertisement

Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool

  • Fabian Prasser
  • Florian Kohlmayer

Abstract

The sharing of sensitive personal data has become a core element of biomedical research. To protect privacy, a broad spectrum of techniques must be implemented, including data anonymization. In this article, we present ARX, an anonymization tool for structured data which supports a broad spectrum of methods for statistical disclosure control by providing (1) models for analyzing re-identification risks, (2) risk-based anonymization, (3) syntactic privacy criteria, such as k-anonymity, -diversity, t-closeness and δ-presence, (4) methods for automated and manual evaluation of data utility, and (5) an intuitive coding model using generalization, suppression and microaggregation. ARX is highly scalable and allows for anonymizing datasets with several millions of records on commodity hardware. Moreover, it offers a comprehensive graphical user interface with wizards and visualizations that guide users through different aspects of the anonymization process. ARX is not just a toolbox, but a fully-fledged application, meaning that all implemented methods have been harmonized and integrated with each other. It is well understood that balancing privacy and data utility requires user feedback. To facilitate this interaction, ARX is highly configurable and provides various methods for exploring the solution space.

Keywords

Application Program Interface Data Utility Utility Measure Sensitive Attribute Pruning Strategy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

The authors would like to express their appreciation to Klaus A. Kuhn for his many helpful and insightful comments and suggestions.

References

  1. 1.
    Article 29 Data Protection Working Party: Opinion 05/2014 on anonymisation techniques. http://www.cnpd.public.lu/fr/publications/groupe-art29/wp216_en.pdf. Accessed 22 Apr (2014)
  2. 2.
    ARX – Powerful Data Anonymization: http://arx.deidentifier.org/. Accessed 06 May (2015)
  3. 3.
    Bayardo, R., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of the International Conference on Data Engineering, pp. 217–228 (2005)Google Scholar
  4. 4.
    Byun, J., Sohn, Y., Bertino, E., Li, N.: Secure anonymization for incremental datasets. In: Proceedings of VLDB Workshop Secure Data Management, pp. 48–63 (2006)Google Scholar
  5. 5.
    Cavoukian, A., Castro, D.: Big data and innovation, setting the record straight: de-identification does work. Privacy by Design, Ontario, Canada. http://www2.itif.org/2014-big-data-deidentification.pdf (2014). Accessed 06 May (2015)
  6. 6.
    Chen, G., Keller-McNulty, S.: Estimation of identification disclosure risk in microdata. J. Off. Stat. 14, 79–95 (1998)Google Scholar
  7. 7.
    Ciglic, M., Eder, J., Koncilia, C.: k-anonymity of microdata with null values. In: Proceedings of International Conference on Database and Expert Systems Applications (2014)CrossRefGoogle Scholar
  8. 8.
    Ciriani, V., De Capitani di Vimercati, S., Foresti, S., Samarati, P.: Microdata protection. In: Yu, T., Jajodia, S. (eds.) Secure Data Management in Decentralized Systems. Advances in Information Security, vol. 33, pp. 291–321. Springer, Berlin (2007)CrossRefGoogle Scholar
  9. 9.
    Dai, C., Ghinita, G., Bertino, E., Byun, J.W., Li, N.: TIAMAT: a tool for interactive analysis of microdata anonymization techniques. In: Proceedings of the VLDB Endowment (2009)Google Scholar
  10. 10.
    Dankar, F.K., Emam, K.E.: Practicing differential privacy in health care: a review. Trans. Data Privacy 6(1), 35–67 (2013)Google Scholar
  11. 11.
    Dankar, F., Emam, K.E., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12(1), 66 (2012)CrossRefGoogle Scholar
  12. 12.
    Dwork, C.: An ad omnia approach to defining and achieving private data analysis. In: Proceedings of PinKDD, pp. 1–13 (2007)Google Scholar
  13. 13.
    Dwork, C.: Differential privacy. In: Encyclopedia of Cryptography and Security, pp. 338–340. Springer, Berlin (2011)Google Scholar
  14. 14.
    Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Proceedings of EUROCRYPT 2006, pp. 486–503 (2006)MathSciNetGoogle Scholar
  15. 15.
    El Emam, K., Jonker, E., Arbuckle, L., Malin, B.: A systematic review of re-identification attacks on health data. PloS One 6(12), e28071 (2011)CrossRefGoogle Scholar
  16. 16.
    Emam, K.E., Dankar, F.K., Issa, R., Jonker, E., Amyot, D., Cogo, E., Corriveau, J., Walker, M., Chowdhury, S., Vaillancourt, R., Roffey, T., Bottomley, J.: A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inform. Assoc. 16(5), 670–682 (2009)CrossRefGoogle Scholar
  17. 17.
    Fung, B., Wang, K., Fu, A., Yu, P.: Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. CRC Press, Hoboken (2010)CrossRefGoogle Scholar
  18. 18.
    Gardner, J.J., Xiong, L., Li, K., Lu, J.J.: HIDE: heterogeneous information de-identification. In: Proceedings of International Conference on Extending Database Technology, pp. 1116–1119 (2009)Google Scholar
  19. 19.
    Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the VLDB Endowment, pp. 758–769 (2007)Google Scholar
  20. 20.
    Gkoulalas-Divanis, A., Loukides, G., Sun, J.: Publishing data from electronic health records while preserving privacy: a survey of algorithms. J. Biomed. Inform. 50, 4–19 (2014)CrossRefGoogle Scholar
  21. 21.
    Greenberg, B., Zayatz, L.: Strategies for measuring risk in public use micro-data files. Statistica Neerlandica 46(1), 33–48 (1992)CrossRefGoogle Scholar
  22. 22.
    Hoshino, N.: Applying Pitman’s sampling formula to microdata disclosure risk assessment. J. Off. Stat. 17(4), 499–520 (2001)Google Scholar
  23. 23.
    Hundepool, A., van de Wetering, A., Ramaswamy, R., Franconi, L., Polettini, S., Capobianchi, A., de Wolf, P.P., Domingo, J., Torra, V., Brand, R., Giessing, S.: μ-Argus manual. http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf. Accessed 22 Apr (2008)
  24. 24.
    Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2002)Google Scholar
  25. 25.
    Kayaalp, M., Browne, A.C., Dodd, Z., Sagan, P., McDonald, C.: De-identification of address, date, and alphanumeric identifiers in narrative clinical reports. In: AMIA Annual Symposium Proceedings, pp. 767–776 (2014)Google Scholar
  26. 26.
    Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.A.: Flash: efficient, stable and optimal k-anonymity. In: Proceedings of International Conference on Information Privacy, Security, Risk and Trust (2012)Google Scholar
  27. 27.
    Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.A.: Highly efficient optimal k-anonymity for biomedical datasets. In: Proceedings of International Symposium on Computer-Based Medical Systems (2012)CrossRefGoogle Scholar
  28. 28.
    Kohlmayer, F., Prasser, F., Eckert, C., Kuhn, K.A.: A flexible approach to distributed data anonymization. J. Biomed. Inform. 50, 62–76 (2013).CrossRefGoogle Scholar
  29. 29.
    LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In: Proceedings of International Conference on Management of Data, pp. 49–60 (2005)Google Scholar
  30. 30.
    LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Multidimensional k-anonymity (TR-1521). Tech. Rep., University of Wisconsin (2005)Google Scholar
  31. 31.
    LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of International Conference on Data Engineering, p. 25 (2006)Google Scholar
  32. 32.
    Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and -diversity. In: Proceedings of International Conference on Data Engineering, pp. 106–115 (2007)Google Scholar
  33. 33.
    Li, N., Qardaji, W.H., Su, D.: Provably private data anonymization: or, k-anonymity meets differential privacy. CoRR, abs/1101.2604 49, 55 (2011)Google Scholar
  34. 34.
    Li, T., Li, N., Zhang, J., Molloy, I.: Slicing: a new approach for privacy preserving data publishing. Trans. Knowl. Data Eng. 24(3), 561–574 (2012)CrossRefGoogle Scholar
  35. 35.
    Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: -diversity: privacy beyond k-anonymity. Trans. Knowl. Discov. Data 1(1), 24–35 (2007)Google Scholar
  36. 36.
    Malin, B., Benitez, K., Masys, D.: Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 18(1), 3–10 (2011)CrossRefGoogle Scholar
  37. 37.
    McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of International Conference on Management of Data, pp. 19–30 (2009)Google Scholar
  38. 38.
    Minka, T.: Lightspeed Matlab toolbox. http://research.microsoft.com/en-us/um/people/minka/software/lightspeed/. Accessed 22 Apr (2014)
  39. 39.
    Narayanan, A., Felten, E.: No silver bullet: de-identification still doesn’t work. http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf (2014). Accessed 06 May (2015)
  40. 40.
    Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared data-bases. In: Proceedings of International Conference on Management of Data, pp. 665–676 (2007)Google Scholar
  41. 41.
    Poulis, G., Loukides, G., Gkoulalas-Divanis, A., Skiadopoulos, S.: Anonymizing data with relational and transaction attributes. In: Proceedings of ECML PKDD, pp. 353–369 (2013)Google Scholar
  42. 42.
    Poulis, G., Gkoulalas-Divanis, A., Loukides, G., Skiadopoulos, S., Tryfonopoulos, C.: SECRETA: a system for evaluating and comparing relational and transaction anonymization algorithms. In: Proceedings of International Conference on Extending Database Technology, pp. 620–623 (2014)Google Scholar
  43. 43.
    Prasser, F., Kohlmayer, F.: A simple benchmark of risk-based anonymization with ARX. https://www.github.com/arx-deidentifier/risk-benchmark. Accessed 22 Apr (2015)
  44. 44.
    Prasser, F., Kohlmayer, F., Kuhn, K.A.: A benchmark of globally-optimal anonymization methods for biomedical data. In: Proceedings of International Symposium on Computer-Based Medical Systems (2014).CrossRefGoogle Scholar
  45. 45.
    Prasser, F., Kohlmayer, F., Lautenschlaeger, R., Eckert, C., Kuhn, K.A.: ARX: a comprehensive tool for anonymizing biomedical data. In: AMIA Annual Symposium Proceedings (2014).Google Scholar
  46. 46.
    Privacy Analytics Inc.: About PARAT de-identification software. http://www.privacyanalytics.ca/software/parat/. Accessed 22 Apr (2015)
  47. 47.
    Rinott, Y.: On models for statistical disclosure risk estimation. In: Proceedings of ECE/Eurostat Work Session on Statistical Data Confidentiality, pp. 275–285 (2003)Google Scholar
  48. 48.
    Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information. In: Proceedings of Symposium on Principles of Database Systems, p. 188 (1998)Google Scholar
  49. 49.
    Sweeney, L.: Datafly: a system for providing anonymity in medical data. In: Database Security, XI: Status and Prospects, p. 20 (1998)Google Scholar
  50. 50.
    Sweeney, L.: Computational disclosure control: a primer on data privacy protection. Ph.D. thesis, MIT (2001)Google Scholar
  51. 51.
    Templ, M.: Statistical disclosure control for microdata using the r-package sdcmicro. Trans. Data Privacy 1(2), 67–85 (2008)MathSciNetGoogle Scholar
  52. 52.
    Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. In: Proceedings of the VLDB Endowment (2008)Google Scholar
  53. 53.
    U.S. Health Insurance Portability and Accountability Act of 1996. Public Law 1-349 (1996)Google Scholar
  54. 54.
    UTD Data Security and Privacy Lab: UTD anonymization toolbox. http://www.cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php. Accessed 10 June (2012)
  55. 55.
    Wikipedia: Hasse diagram. https://en.wikipedia.org/wiki/Hasse_diagram. Accessed 22 Apr (2015)
  56. 56.
    Wikipedia: Newton’s method. https://en.wikipedia.org/wiki/Newton’s_method. Accessed 22 Apr (2015)
  57. 57.
    Wikipedia: Polygamma function. https://en.wikipedia.org/wiki/Polygamma_function. Accessed 22 Apr (2015)
  58. 58.
    Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In: Proceedings of the VLDB Endowment, pp. 139–150 (2006)Google Scholar
  59. 59.
    Xiao, X., Wang, G., Gehrke, J.: Interactive anonymization of sensitive data. In: Proceedings of International Conference on Management of Data, pp. 1051–1054 (2009)Google Scholar
  60. 60.
    Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.: Utility-based anonymization using local recoding. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 785–790 (2006)Google Scholar
  61. 61.
    Zayatz, L.V.: Estimation of the percent of unique population elements on a microdata file using the sample. Statistical Research Division Report Number: Census/SRD/RR-91/08 (1991)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Biomedical InformaticsTechnische Universität MünchenMünchenGermany

Personalised recommendations