Have You Forgotten? A Method to Assess if Machine Learning Models Have Forgotten Data

Liu, Xiao; Tsaftaris, Sotirios A.

doi:10.1007/978-3-030-59710-8_10

Xiao Liu¹⁶ &
Sotirios A. Tsaftaris^16,17

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12261))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

10k Accesses
9 Citations
1 Altmetric

Abstract

In the era of deep learning, aggregation of data from several sources is a common approach to ensuring data diversity. Let us consider a scenario where several providers contribute data to a consortium for the joint development of a classification model (hereafter the target model), but, now one of the providers decides to leave. This provider requests that their data (hereafter the query dataset) be removed from the databases but also that the model ‘forgets’ their data. In this paper, for the first time, we want to address the challenging question of whether data have been forgotten by a model. We assume knowledge of the query dataset and the distribution of a model’s output. We establish statistical methods that compare the target’s outputs with outputs of models trained with different datasets. We evaluate our approach on several benchmark datasets (MNIST, CIFAR-10 and SVHN) and on a cardiac pathology diagnosis task using data from the Automated Cardiac Diagnosis Challenge (ACDC). We hope to encourage studies on what information a model retains and inspire extensions in more complex settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Knowledge of the task (e.g. detect presence of a pathology in cardiac MRI images), implies knowledge of the domain \(\mathbbm {D}\) (e.g. the space of cardiac MRI images). Without this assumption, \(D^*\) can be anything, rendering the problem intractable.
2.
We do not cover here the different task of making models and data more private, by means e.g. of differential privacy. We point readers to surveys such as [9, 10] and a recent (but not the only) example application in healthcare [14].
3.
In fact, overlap will frequently occur in the real world. For example, datasets collected by different vendors can overlap if they collaborate with a same hospital or if a patient has visited several hospitals. Our method has been designed to address this challenging aspect of overlap.
4.
As mentioned previously we cannot measure statistical overlap between \(D^*\) and \(D_C\) (or \(D_Q\)) since \(D^*\) is unknown. Furthermore, statistical overlap in high-dimensional spaces is not trivial [2, 7], and hence we want to avoid it.

References

Barillot, C., et al.: Federating distributed and heterogeneous information sources in neuroimaging: the neurobase project. Stud. Health Technol. Inf. 120, 3 (2006)
Google Scholar
Belghazi, M.I., et al.: Mutual information neural estimation. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 531–540. PMLR, Stockholmsmässan, Stockholm Sweden, 10–15 July 2018
Google Scholar
Bernard, O., et al.: Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37(11), 2514–2525 (2018)
Article Google Scholar
Carlini, N., Liu, C., Erlingsson, U., Kos, J., Song, D.: The secret sharer: evaluating and testing unintended memorization in neural networks. In: Proceedings of the 28th USENIX Conference on Security Symposium, SEC 2019, pp. 267–284. USENIX Association, Berkeley, CA, USA (2019)
Google Scholar
Cherubin, G., Chatzikokolakis, K., Palamidessi, C.: F-BLEAU: fast Black-box Leakage Estimation, February 2019. http://arxiv.org/abs/1902.01350
Feller, W.: On the Kolmogorov-Smirnov limit theorems for empirical distributions. In: Schilling, R., Vondraček, Z., Woyczyński, W. (eds.) Selected Papers I, pp. 735–749. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16859-3_38
Chapter Google Scholar
Glazer, A., Lindenbaum, M., Markovitch, S.: Learning high-density regions for a generalized Kolmogorov-Smirnov test in high-dimensional data. In: NIPS (2012)
Google Scholar
Golatkar, A., Achille, A., Soatto, S.: Eternal sunshine of the spotless net: selective forgetting in deep networks (2019)
Google Scholar
Gong, M., Xie, Y., Pan, K., Feng, K., Qin, A.K.: A survey on differentially private machine learning [review article]. IEEE Comput. Intell. Mag. 15(2), 49–64 (2020)
Article Google Scholar
Ji, Z., Lipton, Z.C., Elkan, C.: Differential privacy and machine learning: a survey and review (2014)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37(3), 50–60 (2020)
Article Google Scholar
Li, W., et al.: Privacy-preserving federated brain tumour segmentation. In: Suk, H.I., Liu, M., Yan, P., Lian, C. (eds.) MLMI 2019. LNCS, vol. 11861, pp. 133–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32692-0_16
Chapter Google Scholar
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)
Google Scholar
Pyrgelis, A., Troncoso, C., De Cristofaro, E.: Under the hood of membership inference attacks on aggregate location time-series, February 2019. http://arxiv.org/abs/1902.07456
Roy, A.G., Siddiqui, S., Pölsterl, S., Navab, N., Wachinger, C.: BrainTorrent: a peer-to-peer environment for decentralized federated learning, May 2019. http://arxiv.org/abs/1905.06731
Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: D\(\backslash \)’ej\(\backslash \)‘a Vu: an empirical evaluation of the memorization properties of ConvNets, September 2018. http://arxiv.org/abs/1809.06396
Sheller, M.J., Reina, G.A., Edwards, B., Martin, J., Bakas, S.: Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds.) BrainLes 2018. LNCS, vol. 11383, pp. 92–104. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11723-8_9
Chapter Google Scholar
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18, May 2017. https://doi.org/10.1109/SP.2017.41
Torralba, A., Efros, A.: Unbiased look at dataset bias. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1521–1528 (2011)
Google Scholar

Download references

Acknowledgment

This work was supported by the University of Edinburgh by a PhD studentship. This work was partially supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1. S.A. Tsaftaris acknowledges the support of the Royal Academy of Engineering and the Research Chairs and Senior Research Fellowships scheme and the [in part] support of the Industrial Centre for AI Research in digital Diagnostics (iCAIRD) which is funded by Innovate UK on behalf of UK Research and Innovation (UKRI) [project number: 104690] (https://icaird.com/).

Author information

Authors and Affiliations

School of Engineering, University of Edinburgh, Edinburgh, EH9 3FB, UK
Xiao Liu & Sotirios A. Tsaftaris
The Alan Turing Institute, London, UK
Sotirios A. Tsaftaris

Authors

Xiao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Sotirios A. Tsaftaris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Liu .

Editor information

Editors and Affiliations

University of Toronto, Toronto, ON, Canada
Anne L. Martel
The University of British Columbia, Vancouver, BC, Canada
Purang Abolmaesumi
University College London, London, UK
Danail Stoyanov
École Centrale de Nantes, Nantes, France
Diana Mateus
EURECOM, Biot, France
Maria A. Zuluaga
Chinese Academy of Sciences, Beijing, China
S. Kevin Zhou
Sorbonne University, Paris, France
Daniel Racoceanu
The Hebrew University of Jerusalem, Jerusalem, Israel
Leo Joskowicz

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 109 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, X., Tsaftaris, S.A. (2020). Have You Forgotten? A Method to Assess if Machine Learning Models Have Forgotten Data. In: Martel, A.L., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. MICCAI 2020. Lecture Notes in Computer Science(), vol 12261. Springer, Cham. https://doi.org/10.1007/978-3-030-59710-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-59710-8_10
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59709-2
Online ISBN: 978-3-030-59710-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)