Skip to main content
Log in

Data minimization for GDPR compliance in machine learning models

  • Original Research
  • Published:
AI and Ethics Aims and scope Submit manuscript

Abstract

The EU General Data Protection Regulation (GDPR) and the California Privacy Rights Act (CPRA) mandate the principle of data minimization, which requires that only data necessary to fulfill a certain purpose be collected. However, it can often be difficult to determine the minimal amount of data required, especially in complex machine learning models such as deep neural networks. We present a first-of-a-kind method to reduce the amount of personal data needed to perform predictions with a machine learning model, by removing or generalizing some of the input features of the runtime data. Our method makes use of the knowledge encoded within the model to produce a generalization that has little to no impact on its accuracy, based on knowledge distillation approaches. We show that, in some cases, less data may be collected while preserving the exact same level of model accuracy as before, and if a small deviation in accuracy is allowed, even more generalizations of the input features may be performed. We also demonstrate that when collecting the features dynamically, the generalizations may be even further improved. This method enables organizations to truly minimize the amount of data collected, thus fulfilling the data minimization requirement set out in the regulations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://ec.europa.eu/info/law/law-topic/data-protection/data-protection-eu_en.

  2. https://www.caprivacy.org/annotated-cpra-text-with-ccpa-changes/.

  3. https://www.europarl.europa.eu/RegData/etudes/STUD/2020/641530/EPRS_STU(2020)641530_EN.pdf.

  4. https://github.com/IBM/ai-privacy-toolkit.

  5. https://archive.ics.uci.edu/ml/datasets/adult.

  6. https://archive.ics.uci.edu/ml/datasets/nursery.

  7. https://www.lendingclub.com/info/download-data.action.

  8. https://scikit-learn.org/stable/.

References

  1. Abadi, M., Chu, A., Goodfellow, I., McMahan, H.B., Mironov, I., Talwar, K., Zhang, L.: Deep learning with differential privacy. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Association for Computing Machinery, New York, NY, USA, pp. 308–318 (2016). https://doi.org/10.1145/2976749.2978318

  2. Alkabawi, E.M., Hilal, A.R., Basir, O.A.: Feature abstraction for early detection of multi-type of dementia with sparse auto-encoder. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 3471–3476 (2017). https://doi.org/10.1109/SMC.2017.8123168

  3. Bakker, M.A., Riverón Valdés, H., Tu, D.P., Gummadi, K.P., Varshney, K.R., Weller, A., Pentland, A.: Fair enough: improving fairness in budget-constrained decision making using confidence thresholds. In: H. Espinoza, J. Hernández-Orallo, X.C. Chen, S.S. ÓhÉigeartaigh, X. Huang, M. Castillo-Effen, R. Mallah, J. McDermid (eds.) Proceedings of the Workshop on Artificial Intelligence Safety co-located with 34th AAAI Conference on Artificial Intelligence, CEUR Workshop Proceedings, vol. 2560, pp. 41–53. CEUR-WS.org, New York, NY, USA (2020). urn:nbn:de:0074-2560-0

  4. Barni, M., Orlandi, C., Piva, A.: A privacy-preserving protocol for neural-network-based computation. In: Proceedings of the 8th Workshop on Multimedia and Security. Association for Computing Machinery, New York, NY, USA, pp. 146–151 (2006). https://doi.org/10.1145/1161366.1161393

  5. Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: 21st International Conference on Data Engineering (ICDE'05), pp. 217–228 (2005). https://doi.org/10.1109/ICDE.2005.42

  6. Bild, R., Kuhn, K.A., Prasser, F.: Safepub: a truthful data anonymization algorithm with strong privacy guarantees. Proc. Privacy Enhanc. Technol. 1, 67–87 (2018)

    Article  Google Scholar 

  7. Bucilua, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pp. 535–541. Association for Computing Machinery, New York, NY, USA (2006). https://doi.org/10.1145/1150402.1150464

  8. Chabanne, H., de Wargny, A., Milgram, J., Morel, C., Prouff, E.: Privacy-preserving classification on deep neural network. IACR Cryptology ePrint Archive 2017, 35 (2017)

  9. Craven, M.W., Shavlik, J.W.: Extracting tree-structured representations of trained networks. In: Proceedings of the 8th International Conference on Neural Information Processing Systems, NIPS’95, pp. 24–30. MIT Press, Cambridge, MA, USA (1995)

  10. Doran, K., Price, J.: Pornography and marriage. J. Fam. Econ. Issues 35, 489–498 (2014). https://doi.org/10.1007/s10834-014-9391-6

    Article  Google Scholar 

  11. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml. Accessed 22 Sept 2021

  12. Ferreira, A., Figueiredo, M.: An unsupervised approach to feature discretization and selection. Pattern Recogn. 45(9), 3048–3060 (2011)

    Article  Google Scholar 

  13. Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit confidence information and basic countermeasures. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1322–1333 (2015). https://doi.org/10.1145/2810103.2813677

  14. Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., Ristenpart, T.: Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In: Proceedings of the 23rd USENIX Conference on Security Symposium. USENIX Association, USA, pp. 17–32 (2014)

  15. Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for information and privacy preservation. In: 21st International Conference on Data Engineering (ICDE), pp. 205–216 (2005). https://doi.org/10.1109/ICDE.2005.143

  16. Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, Vienna, Austria, pp. 758–769 (2007)

  17. Goldsteen, A., Ezov, G., Farkash, A.: Reducing risk of model inversion using privacy-guided training (2020). arXiv:2006.15877

  18. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vis. 129, 1789–1819 (2021)

    Article  Google Scholar 

  19. Hesamifard, E., Takabi, H., Ghasemi, M., Wright, R.N.: Privacy-preserving machine learning as a service. Proc. Privacy Enhanc. Technol. 3, 123–142 (2018)

    Article  Google Scholar 

  20. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2014). arXiv:1503.02531

  21. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, pp. 279–288 (2002). https://doi.org/10.1145/775047.775089

  22. Kazim, E., Denny, D.M.T., Koshiyama, A.: AI auditing and impact assessment: according to the UK information commissioner’s office. AI and Ethics 1(3), 301–310 (2021). https://doi.org/10.1007/s43681-021-00039-2

  23. Kearns, M.J., Mansour, Y.: A fast, bottom-up decision tree pruning algorithm with near-optimal generalization. In: J.W. Shavlik (ed.) Proceedings of the Fifteenth International Conference on Machine Learning (ICML), pp. 269–277. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1998)

  24. Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32, 47–58 (2005)

    Google Scholar 

  25. Lee, H., Hwang, S.J., Shin, J.: Rethinking data augmentation: self-supervision and selfdistillation (2019). arXiv:1910.05872

  26. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Workload-aware anonymization techniques for large-scale datasets. ACM Trans. Database Syst. 33(3), 1–47 (2008)

    Article  Google Scholar 

  27. Ma, Z., Kargl, F., Weber, M.: Measuring long-term location privacy in vehicular communication systems. Comput. Commun. 33, 1414–1427 (2010)

    Article  Google Scholar 

  28. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.) Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), vol. 2, pp. 3111–3119. Curran Associates Inc. (2013)

  29. Papernot, N., Abadi, M., Úlfarlfar Erlingsson, Goodfellow, I., Talwar, K.: Semi-supervised knowledge transfer for deep learning from private training data. In: Proceedings of the International Conference on Learning Representations (2017). arXiv:1610.05755

  30. Pratesi, F., Monreale, A., Trasarti, R., Giannotti, F., Pedreschi, D., Yanagihara, T.: Prudence: a system for assessing privacy risk vs utility in data sharing ecosystems. Trans. Data Privacy 11(2) (2018). http://www.tdp.cat/issues16/tdp.a284a17.pdf. Accessed 22 Sept 2021

  31. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)

    Article  Google Scholar 

  32. Salem, A., Y. Zhang, M.H., Berrang, P., Fritz, M., Backes, M.: Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. In: Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS) (2019). https://doi.org/10.14722/ndss.2019.23119

  33. Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. San Jose, CA (2017). https://doi.org/10.1109/SP.2017.41

  34. Sokol, K., Flach, P.: Explainability fact sheets: a framework for systematic assessment of explainable approaches. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 56–67. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3351095.3372870

  35. Soleymani, S., Dabouei, A., Kazemi, H., Dawson, J., Nasrabadi, N.M.: Multi-level feature abstraction from convolutional neural networks for multimodal biometric identification. Int. Conf. Pattern Recogn. 346–-3476 (2018). https://doi.org/10.1109/ICPR.2018.8545061

  36. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzz. Knowl.-Based Syst. 10, 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  37. Unceta, I., Nin, J., Pujol, O.: Copying machine learning classifiers. IEEE Access 8, 160268–160284 (2020). https://doi.org/10.1109/ACCESS.2020.3020638

    Article  Google Scholar 

  38. Urner, R., Shalev-Shwartz, S., Ben-David, S.: Access to unlabeled data can speed up prediction time. In: Proceedings of the 28th International Conference on Machine Learning, pp. 641–648 (2011). https://dl.acm.org/doi/10.5555/3104482.3104563. Accessed 22 Sept 2021

  39. Wagner, I., Eckhoff, D.: Technical privacy metrics: a systematic survey. ACM Comput. Surv. 51(3), 1–38 (2018)

    Article  Google Scholar 

  40. Wang, J., Bao, W., Sun, L., Zhu, X., Cao, B., Philip, S.: Private model compression via knowledge distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1190–1197 (2019). https://doi.org/10.1609/aaai.v33i01.33011190

  41. Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.: Utility-based anonymization using local recoding. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, pp. 785–790 (2006). https://doi.org/10.1145/1150402.1150504

  42. Xu, S., Ye, X.: Risk and Distortion Based K-Anonymity. Information Security Applications, pp. 345–358. Springer, Berlin (2007)

    Google Scholar 

  43. Zhang, T., He, Z., Lee, R.B.: Privacy-preserving machine learning through data obfuscation (2018). arxiv:1807.01860

  44. Zhu, Q., Lv, X.: 2p-dnn : Privacy-preserving deep neural networks based on homomorphic cryptosystem (2018). arxiv:1807.08459

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abigail Goldsteen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A Experimental setup

1.1.1 A.1 Features

List of features used in adult dataset: age, workclass, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country.

List of features used in nursery dataset: parents, has_nurs, form, children, housing, finance, health, social.

List of features used in GSS dataset: work_status, marital_status, children, age, gender, race, happiness, x_rated.

List of features used in Loan dataset: loan_amount, funded_amount, investor_funds, term, interest_rate, installment, grade, sub_grade, emp_length, home_ownership, annual_income, verification_status, pymnt_plan, purpose, zip_code, dti, delinq_2yrs, inq_last_6mths, mths_since_last_delinq, mths_since_last_record, open_acc, pub_rec, revol_bal, revol_util, total_acc, initial_list_status, out_prncp, out_prncp_inv, total_rec_int, total_rec_late_fee, last_pymnt_amnt, collections_12_mths_ex_med, policy_code, application_type, acc_now_delinq, chargeoff_within_12_mths, delinq_amnt, tax_liens, hardship_flag, disbursement_method, year, earliest_cr_year, region.

1.1.2 A.2 Pre-processing

Pre-processing applied to adult dataset: the native-country feature was transformed to the following areas: Euro_1 (Italy, Holand-Netherlands, Germany, France), Euro_2 (Yugoslavia, South, Portugal, Poland, Hungary, Greece), SE_Asia (Vietnam, Thailand, Philippines, Laos, Cambodia), UnitedStates (United-States), LatinAmerica (Trinadad&Tobago, Puerto-Rico, Outlying-US(Guam-USVI-etc), Nicaragua, Mexico, Jamaica, Honduras, Haiti, Guatemala, Dominican-Republic), China (Taiwan, Hong, China), BritishCommonwealth (Scotland, Ireland, India, England, Canada), SouthAmerica (Peru, El-Salvador, Ecuador, Columbia), Other (Japan, Iran, Cuba), Unknown (?).

Pre-processing applied to loan dataset: the addr_state feature was transformed to the following regions: west (CA, OR, UT, WA, CO, NV, AK, MT, HI, WY, ID), south_west (AZ, TX, NM, OK), south_east (GA, NC, VA, FL, KY, SC, LA, AL, WV, DC, AR, DE, MS, TN), mid_west (IL, MO, MN, OH, WI, KS, MI, SD, IA, NE, IN, ND), north_east (CT, NY, PA, NJ, RI, MA, MD, VT, NH, ME). For the label column, the following loan_status values were considered a bad loan: Charged Off, Default, Does not meet the credit policy. Status: Charged Off, In Grace Period, Late (16–30 days), Late (31–120 days). The rest were considered good.

Missing values were filled with 0 or ’NA’ according to the feature type (numeric or categorical) in all four datasets. All categorical features in all datasets were one-hot encoded.

1.1.3 A.3 Dataset division

Each dataset was divided into four subsets according to the following percentages: 40% for training the target model, 30% for training the generalizer model, 20% test data for optimizing the generalization and 10% for final validation.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goldsteen, A., Ezov, G., Shmelkin, R. et al. Data minimization for GDPR compliance in machine learning models. AI Ethics 2, 477–491 (2022). https://doi.org/10.1007/s43681-021-00095-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s43681-021-00095-8

Keywords

Navigation