Anonymizing Machine Learning Models

Goldsteen, Abigail; Ezov, Gilad; Shmelkin, Ron; Moffie, Micha; Farkash, Ariel

doi:10.1007/978-3-030-93944-1_8

Abigail Goldsteen¹²,
Gilad Ezov¹²,
Ron Shmelkin¹²,
Micha Moffie¹² &
…
Ariel Farkash¹²

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 13140))

Included in the following conference series:

1115 Accesses
1 Citations

Abstract

There is a known tension between the need to analyze personal data to drive business and the need to preserve the privacy of data subjects. Many data protection regulations, including the EU General Data Protection Regulation (GDPR) and the California Consumer Protection Act (CCPA), set out strict restrictions and obligations on the collection and processing of personal data. Moreover, machine learning models themselves can be used to derive personal information, as demonstrated by recent membership and attribute inference attacks. Anonymized data, however, is exempt from the obligations set out in these regulations. It is therefore desirable to be able to create models that are anonymized, thus also exempting them from those obligations, in addition to providing better protection against attacks.

Learning on anonymized data typically results in significant degradation in accuracy. In this work, we propose a method that is able to achieve better model accuracy by using the knowledge encoded within the trained model, and guiding our anonymization process to minimize the impact on the model’s accuracy, a process we call accuracy-guided anonymization. We demonstrate that by focusing on the model’s accuracy rather than generic information loss measures, our method outperforms state of the art k-anonymity methods in terms of the achieved utility, in particular with high values of k and large numbers of quasi-identifiers.

We also demonstrate that our approach has a similar, and sometimes even better ability to prevent membership inference attacks as approaches based on differential privacy, while averting some of their drawbacks such as complexity, performance overhead and model-specific implementations. In addition, since our approach does not rely on making modifications to the training algorithm, it can even work with “black-box” models where the data owner does not have full control over the training process, or within complex machine learning pipelines where it may be difficult to replace existing learning algorithms with new ones. This makes model-guided anonymization a legitimate substitute for such methods and a practical approach to creating privacy-preserving models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016)
Google Scholar
Bagdasaryan, E., Shmatikov, V.: Differential privacy has disparate impact on model accuracy. In: Advances in Neural Information Processing Systems, pp. 15453–15462 (2019)
Google Scholar
Domingo-Ferrer, J., Torra, V.: A critique of k-anonymity and some of its enhancements. In: 3rd International Conference on Availability, Reliability and Security, pp. 990–993. ARES (2008). https://doi.org/10.1109/ARES.2008.97
Emam, K.E., Dankar, F.K.: Protecting privacy using k-anonymity. J. Am. Med. Inform. Assoc. 15(5), 627–637 (2008)
Article Google Scholar
Fletcher, S., Islam, M.Z.: Differentially private random decision forests using smooth sensitivity. Expert Syst. Appl. 78(1), 16–31 (2017)
Article Google Scholar
Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit confidence information and basic countermeasures. In: CCS (2015)
Google Scholar
Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., Ristenpart, T.: Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In: USENIX Security Symposium, pp. 17–32 (2014)
Google Scholar
Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Very Large Databases (2007)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)
Google Scholar
Huda, M.N., Yamada, S., Sonehara, N.: Recent Progress in Data Engineering and Internet Technology. Lecture Notes in Electrical EngineerinG, vol. 156. Springer, Heidelberg (2013)
Book Google Scholar
Iwuchukwu, T., DeWitt, D.J., Doan, A., Naughton, J.F.: K-anonymization as spatial indexing: toward scalable and incremental anonymization. In: IEEE 23rd International Conference on Data Engineering (2007)
Google Scholar
Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: SIGKDD. Edmonton, Alberta (2002)
Google Scholar
Jayaraman, B., Evans, D.: Evaluating differentially private machine learning in practice. In: Proceedings of the 28th USENIX Conference on Security Symposium, pp. 1895–1912. USENIX Association, Berkeley (2019)
Google Scholar
Kazim, E., Denny, D.M.T., Koshiyama, A.: Ai auditing and impact assessment: according to the UK information commissioner’s office. AI Ethics 1, 301–310 (2021)
Article Google Scholar
Lefevre, K., Dewitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: 22nd International Conference on Data Engineering (2006)
Google Scholar
Lefevre, K., Dewitt, D.J., Ramakrishnan, R.: Workload-aware anonymization techniques for large-scale datasets. ACM Trans. Database Syst. 33(3), 1–47 (2008)
Article Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering, pp. 106–115 (2007)
Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 3-es (2007)
Article Google Scholar
Malle, B., Kieseberg, P., Weippl, E., Holzinger, A.: The right to be forgotten: towards machine learning on perturbed knowledge bases. In: Buccafurri, F., Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) CD-ARES 2016. LNCS, vol. 9817, pp. 251–266. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45507-5_17
Chapter Google Scholar
Melis, L., Song, C., Cristofaro, E.D., Shmatikov, V.: Exploiting unintended feature leakage in collaborative learning. In: IEEE Symposium on Security and Privacy, pp. 691–706 (2019)
Google Scholar
Narayanan, A., Shmatikov, V.: How to break anonymity of the netflix prize dataset (2006). https://arxiv.org/abs/cs/0610105
Nasr, M., Shokri, R., Houmansadr, A.: Machine learning with membership privacy using adversarial regularization. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 634–646. ACM, New York (2018). https://doi.org/10.1145/3243734.3243855
Papernot, N., Abadi, M., Úlfar Erlingsson, Goodfellow, I., Talwar, K.: Semi-supervised knowledge transfer for deep learning from private training data. In: ICLR (2017). https://arxiv.org/abs/1610.05755
Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., Backes, M.: Ml-leaks: model and data independent membership inference attacks and defenses on machine learning models. In: Network and Distributed Systems Security Symposium, San Diego, CA, USA (2019). https://doi.org/10.14722/ndss.2019.23119
Senavirathne, N., Torra, V.: On the role of data anonymization in machine learning privacy. In: 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 664–675. IEEE Computer Society, Los Alamitos, CA, USA (2020)
Google Scholar
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: IEEE Symposium on Security and Privacy, San Jose, CA, USA, pp. 3–18 (2017)
Google Scholar
Sánchez, D., Martínez, S., Domingo-Ferrer, J.: How to avoid reidentification with proper anonymization (2018). https://arxiv.org/abs/1808.01113
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10, 557–570 (2002)
Article MathSciNet Google Scholar
Veale, M., Binns, R., Edwards, L.: Algorithms that remember: model inversion attacks and data protection law. Philos. Trans. R. Soc. A 376, 20180083 (2018). https://doi.org/10.1098/rsta.2018.0083
Article Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research - Haifa, Haifa University Campus, Haifa, Israel
Abigail Goldsteen, Gilad Ezov, Ron Shmelkin, Micha Moffie & Ariel Farkash

Authors

Abigail Goldsteen
View author publications
You can also search for this author in PubMed Google Scholar
Gilad Ezov
View author publications
You can also search for this author in PubMed Google Scholar
Ron Shmelkin
View author publications
You can also search for this author in PubMed Google Scholar
Micha Moffie
View author publications
You can also search for this author in PubMed Google Scholar
Ariel Farkash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abigail Goldsteen .

Editor information

Editors and Affiliations

Institut Polytechnique de Paris, Palaiseau, France
Joaquin Garcia-Alfaro
Universitat Politècnica de Catalunya, Barcelona, Spain
Jose Luis Muñoz-Tapia
Universitat Autonoma de Barcelona, Bellaterra, Spain
Guillermo Navarro-Arribas
Universitat Politècnica de Catalunya, Barcelona, Spain
Miguel Soriano

Appendices

A Datasets and Quasi-identifiers

Table 3 describes the datasets used for evaluation. Table 4 presents the attributes used as quasi-identifiers in the different runs.

Table 3. Datasets used for evaluation

Full size table

Table 4. Quasi-identifiers used for evaluation

Full size table

B Attack Model

Figure 6 depicts the attack model employed for membership inference, trained with 50% members and 50% non-members. n represents the number of target classes in the attacked model, f(x) is the logit (for NN) or probability (for RF) of each class, and y is the one-hot encoded true label. This architecture was adapted from [22] and chosen empirically to yield better accuracy for the tested datasets and models.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goldsteen, A., Ezov, G., Shmelkin, R., Moffie, M., Farkash, A. (2022). Anonymizing Machine Learning Models. In: Garcia-Alfaro, J., Muñoz-Tapia, J.L., Navarro-Arribas, G., Soriano, M. (eds) Data Privacy Management, Cryptocurrencies and Blockchain Technology. DPM CBT 2021 2021. Lecture Notes in Computer Science(), vol 13140. Springer, Cham. https://doi.org/10.1007/978-3-030-93944-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-93944-1_8
Published: 23 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93943-4
Online ISBN: 978-3-030-93944-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics