Skip to main content

Anonymizing Machine Learning Models

  • Conference paper
  • First Online:
Data Privacy Management, Cryptocurrencies and Blockchain Technology (DPM 2021, CBT 2021)

Abstract

There is a known tension between the need to analyze personal data to drive business and the need to preserve the privacy of data subjects. Many data protection regulations, including the EU General Data Protection Regulation (GDPR) and the California Consumer Protection Act (CCPA), set out strict restrictions and obligations on the collection and processing of personal data. Moreover, machine learning models themselves can be used to derive personal information, as demonstrated by recent membership and attribute inference attacks. Anonymized data, however, is exempt from the obligations set out in these regulations. It is therefore desirable to be able to create models that are anonymized, thus also exempting them from those obligations, in addition to providing better protection against attacks.

Learning on anonymized data typically results in significant degradation in accuracy. In this work, we propose a method that is able to achieve better model accuracy by using the knowledge encoded within the trained model, and guiding our anonymization process to minimize the impact on the model’s accuracy, a process we call accuracy-guided anonymization. We demonstrate that by focusing on the model’s accuracy rather than generic information loss measures, our method outperforms state of the art k-anonymity methods in terms of the achieved utility, in particular with high values of k and large numbers of quasi-identifiers.

We also demonstrate that our approach has a similar, and sometimes even better ability to prevent membership inference attacks as approaches based on differential privacy, while averting some of their drawbacks such as complexity, performance overhead and model-specific implementations. In addition, since our approach does not rely on making modifications to the training algorithm, it can even work with “black-box” models where the data owner does not have full control over the training process, or within complex machine learning pipelines where it may be difficult to replace existing learning algorithms with new ones. This makes model-guided anonymization a legitimate substitute for such methods and a practical approach to creating privacy-preserving models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://ec.europa.eu/info/law/law-topic/data-protection/data-protection-eu_en.

  2. 2.

    https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180AB375.

  3. 3.

    https://www.europarl.europa.eu/thinktank/en/document.html?reference=EPRS_STU(2020)641530.

  4. 4.

    https://archive.ics.uci.edu/ml/datasets/adult.

  5. 5.

    https://www.lendingclub.com/info/download-data.action.

  6. 6.

    https://github.com/sam-fletcher/Smooth_Random_Trees.

  7. 7.

    https://github.com/facebookresearch/pytorch-dp.

  8. 8.

    https://archive.ics.uci.edu/ml/datasets/nursery.

References

  1. Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016)

    Google Scholar 

  2. Bagdasaryan, E., Shmatikov, V.: Differential privacy has disparate impact on model accuracy. In: Advances in Neural Information Processing Systems, pp. 15453–15462 (2019)

    Google Scholar 

  3. Domingo-Ferrer, J., Torra, V.: A critique of k-anonymity and some of its enhancements. In: 3rd International Conference on Availability, Reliability and Security, pp. 990–993. ARES (2008). https://doi.org/10.1109/ARES.2008.97

  4. Emam, K.E., Dankar, F.K.: Protecting privacy using k-anonymity. J. Am. Med. Inform. Assoc. 15(5), 627–637 (2008)

    Article  Google Scholar 

  5. Fletcher, S., Islam, M.Z.: Differentially private random decision forests using smooth sensitivity. Expert Syst. Appl. 78(1), 16–31 (2017)

    Article  Google Scholar 

  6. Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit confidence information and basic countermeasures. In: CCS (2015)

    Google Scholar 

  7. Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., Ristenpart, T.: Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In: USENIX Security Symposium, pp. 17–32 (2014)

    Google Scholar 

  8. Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Very Large Databases (2007)

    Google Scholar 

  9. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)

    Google Scholar 

  10. Huda, M.N., Yamada, S., Sonehara, N.: Recent Progress in Data Engineering and Internet Technology. Lecture Notes in Electrical EngineerinG, vol. 156. Springer, Heidelberg (2013)

    Book  Google Scholar 

  11. Iwuchukwu, T., DeWitt, D.J., Doan, A., Naughton, J.F.: K-anonymization as spatial indexing: toward scalable and incremental anonymization. In: IEEE 23rd International Conference on Data Engineering (2007)

    Google Scholar 

  12. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: SIGKDD. Edmonton, Alberta (2002)

    Google Scholar 

  13. Jayaraman, B., Evans, D.: Evaluating differentially private machine learning in practice. In: Proceedings of the 28th USENIX Conference on Security Symposium, pp. 1895–1912. USENIX Association, Berkeley (2019)

    Google Scholar 

  14. Kazim, E., Denny, D.M.T., Koshiyama, A.: Ai auditing and impact assessment: according to the UK information commissioner’s office. AI Ethics 1, 301–310 (2021)

    Article  Google Scholar 

  15. Lefevre, K., Dewitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: 22nd International Conference on Data Engineering (2006)

    Google Scholar 

  16. Lefevre, K., Dewitt, D.J., Ramakrishnan, R.: Workload-aware anonymization techniques for large-scale datasets. ACM Trans. Database Syst. 33(3), 1–47 (2008)

    Article  Google Scholar 

  17. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering, pp. 106–115 (2007)

    Google Scholar 

  18. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 3-es (2007)

    Article  Google Scholar 

  19. Malle, B., Kieseberg, P., Weippl, E., Holzinger, A.: The right to be forgotten: towards machine learning on perturbed knowledge bases. In: Buccafurri, F., Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) CD-ARES 2016. LNCS, vol. 9817, pp. 251–266. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45507-5_17

    Chapter  Google Scholar 

  20. Melis, L., Song, C., Cristofaro, E.D., Shmatikov, V.: Exploiting unintended feature leakage in collaborative learning. In: IEEE Symposium on Security and Privacy, pp. 691–706 (2019)

    Google Scholar 

  21. Narayanan, A., Shmatikov, V.: How to break anonymity of the netflix prize dataset (2006). https://arxiv.org/abs/cs/0610105

  22. Nasr, M., Shokri, R., Houmansadr, A.: Machine learning with membership privacy using adversarial regularization. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 634–646. ACM, New York (2018). https://doi.org/10.1145/3243734.3243855

  23. Papernot, N., Abadi, M., Úlfar Erlingsson, Goodfellow, I., Talwar, K.: Semi-supervised knowledge transfer for deep learning from private training data. In: ICLR (2017). https://arxiv.org/abs/1610.05755

  24. Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., Backes, M.: Ml-leaks: model and data independent membership inference attacks and defenses on machine learning models. In: Network and Distributed Systems Security Symposium, San Diego, CA, USA (2019). https://doi.org/10.14722/ndss.2019.23119

  25. Senavirathne, N., Torra, V.: On the role of data anonymization in machine learning privacy. In: 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 664–675. IEEE Computer Society, Los Alamitos, CA, USA (2020)

    Google Scholar 

  26. Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy, San Jose, CA, USA, pp. 3–18 (2017)

    Google Scholar 

  27. Sánchez, D., Martínez, S., Domingo-Ferrer, J.: How to avoid reidentification with proper anonymization (2018). https://arxiv.org/abs/1808.01113

  28. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10, 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  29. Veale, M., Binns, R., Edwards, L.: Algorithms that remember: model inversion attacks and data protection law. Philos. Trans. R. Soc. A 376, 20180083 (2018). https://doi.org/10.1098/rsta.2018.0083

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abigail Goldsteen .

Editor information

Editors and Affiliations

Appendices

A Datasets and Quasi-identifiers

Table 3 describes the datasets used for evaluation. Table 4 presents the attributes used as quasi-identifiers in the different runs.

Table 3. Datasets used for evaluation
Table 4. Quasi-identifiers used for evaluation

B Attack Model

Figure 6 depicts the attack model employed for membership inference, trained with 50% members and 50% non-members. n represents the number of target classes in the attacked model, f(x) is the logit (for NN) or probability (for RF) of each class, and y is the one-hot encoded true label. This architecture was adapted from [22] and chosen empirically to yield better accuracy for the tested datasets and models.

Fig. 6.
figure 6

Attack model

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Goldsteen, A., Ezov, G., Shmelkin, R., Moffie, M., Farkash, A. (2022). Anonymizing Machine Learning Models. In: Garcia-Alfaro, J., Muñoz-Tapia, J.L., Navarro-Arribas, G., Soriano, M. (eds) Data Privacy Management, Cryptocurrencies and Blockchain Technology. DPM CBT 2021 2021. Lecture Notes in Computer Science(), vol 13140. Springer, Cham. https://doi.org/10.1007/978-3-030-93944-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-93944-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93943-4

  • Online ISBN: 978-3-030-93944-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics