Skip to main content

Have You Forgotten? A Method to Assess if Machine Learning Models Have Forgotten Data

  • Conference paper
  • First Online:
Medical Image Computing and Computer Assisted Intervention – MICCAI 2020 (MICCAI 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12261))

Abstract

In the era of deep learning, aggregation of data from several sources is a common approach to ensuring data diversity. Let us consider a scenario where several providers contribute data to a consortium for the joint development of a classification model (hereafter the target model), but, now one of the providers decides to leave. This provider requests that their data (hereafter the query dataset) be removed from the databases but also that the model ‘forgets’ their data. In this paper, for the first time, we want to address the challenging question of whether data have been forgotten by a model. We assume knowledge of the query dataset and the distribution of a model’s output. We establish statistical methods that compare the target’s outputs with outputs of models trained with different datasets. We evaluate our approach on several benchmark datasets (MNIST, CIFAR-10 and SVHN) and on a cardiac pathology diagnosis task using data from the Automated Cardiac Diagnosis Challenge (ACDC). We hope to encourage studies on what information a model retains and inspire extensions in more complex settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Knowledge of the task (e.g. detect presence of a pathology in cardiac MRI images), implies knowledge of the domain \(\mathbbm {D}\) (e.g. the space of cardiac MRI images). Without this assumption, \(D^*\) can be anything, rendering the problem intractable.

  2. 2.

    We do not cover here the different task of making models and data more private, by means e.g. of differential privacy. We point readers to surveys such as [9, 10] and a recent (but not the only) example application in healthcare [14].

  3. 3.

    In fact, overlap will frequently occur in the real world. For example, datasets collected by different vendors can overlap if they collaborate with a same hospital or if a patient has visited several hospitals. Our method has been designed to address this challenging aspect of overlap.

  4. 4.

    As mentioned previously we cannot measure statistical overlap between \(D^*\) and \(D_C\) (or \(D_Q\)) since \(D^*\) is unknown. Furthermore, statistical overlap in high-dimensional spaces is not trivial  [2, 7], and hence we want to avoid it.

References

  1. Barillot, C., et al.: Federating distributed and heterogeneous information sources in neuroimaging: the neurobase project. Stud. Health Technol. Inf. 120, 3 (2006)

    Google Scholar 

  2. Belghazi, M.I., et al.: Mutual information neural estimation. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 531–540. PMLR, Stockholmsmässan, Stockholm Sweden, 10–15 July 2018

    Google Scholar 

  3. Bernard, O., et al.: Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37(11), 2514–2525 (2018)

    Article  Google Scholar 

  4. Carlini, N., Liu, C., Erlingsson, U., Kos, J., Song, D.: The secret sharer: evaluating and testing unintended memorization in neural networks. In: Proceedings of the 28th USENIX Conference on Security Symposium, SEC 2019, pp. 267–284. USENIX Association, Berkeley, CA, USA (2019)

    Google Scholar 

  5. Cherubin, G., Chatzikokolakis, K., Palamidessi, C.: F-BLEAU: fast Black-box Leakage Estimation, February 2019. http://arxiv.org/abs/1902.01350

  6. Feller, W.: On the Kolmogorov-Smirnov limit theorems for empirical distributions. In: Schilling, R., Vondraček, Z., Woyczyński, W. (eds.) Selected Papers I, pp. 735–749. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16859-3_38

    Chapter  Google Scholar 

  7. Glazer, A., Lindenbaum, M., Markovitch, S.: Learning high-density regions for a generalized Kolmogorov-Smirnov test in high-dimensional data. In: NIPS (2012)

    Google Scholar 

  8. Golatkar, A., Achille, A., Soatto, S.: Eternal sunshine of the spotless net: selective forgetting in deep networks (2019)

    Google Scholar 

  9. Gong, M., Xie, Y., Pan, K., Feng, K., Qin, A.K.: A survey on differentially private machine learning [review article]. IEEE Comput. Intell. Mag. 15(2), 49–64 (2020)

    Article  Google Scholar 

  10. Ji, Z., Lipton, Z.C., Elkan, C.: Differential privacy and machine learning: a survey and review (2014)

    Google Scholar 

  11. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)

    Google Scholar 

  12. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  13. Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37(3), 50–60 (2020)

    Article  Google Scholar 

  14. Li, W., et al.: Privacy-preserving federated brain tumour segmentation. In: Suk, H.I., Liu, M., Yan, P., Lian, C. (eds.) MLMI 2019. LNCS, vol. 11861, pp. 133–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32692-0_16

    Chapter  Google Scholar 

  15. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)

    Google Scholar 

  16. Pyrgelis, A., Troncoso, C., De Cristofaro, E.: Under the hood of membership inference attacks on aggregate location time-series, February 2019. http://arxiv.org/abs/1902.07456

  17. Roy, A.G., Siddiqui, S., Pölsterl, S., Navab, N., Wachinger, C.: BrainTorrent: a peer-to-peer environment for decentralized federated learning, May 2019. http://arxiv.org/abs/1905.06731

  18. Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: D\(\backslash \)’ej\(\backslash \)‘a Vu: an empirical evaluation of the memorization properties of ConvNets, September 2018. http://arxiv.org/abs/1809.06396

  19. Sheller, M.J., Reina, G.A., Edwards, B., Martin, J., Bakas, S.: Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds.) BrainLes 2018. LNCS, vol. 11383, pp. 92–104. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11723-8_9

    Chapter  Google Scholar 

  20. Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18, May 2017. https://doi.org/10.1109/SP.2017.41

  21. Torralba, A., Efros, A.: Unbiased look at dataset bias. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1521–1528 (2011)

    Google Scholar 

Download references

Acknowledgment

This work was supported by the University of Edinburgh by a PhD studentship. This work was partially supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1. S.A. Tsaftaris acknowledges the support of the Royal Academy of Engineering and the Research Chairs and Senior Research Fellowships scheme and the [in part] support of the Industrial Centre for AI Research in digital Diagnostics (iCAIRD) which is funded by Innovate UK on behalf of UK Research and Innovation (UKRI) [project number: 104690] (https://icaird.com/).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao Liu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 109 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, X., Tsaftaris, S.A. (2020). Have You Forgotten? A Method to Assess if Machine Learning Models Have Forgotten Data. In: Martel, A.L., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. MICCAI 2020. Lecture Notes in Computer Science(), vol 12261. Springer, Cham. https://doi.org/10.1007/978-3-030-59710-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59710-8_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59709-2

  • Online ISBN: 978-3-030-59710-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics