Data minimization for GDPR compliance in machine learning models

Goldsteen, Abigail; Ezov, Gilad; Shmelkin, Ron; Moffie, Micha; Farkash, Ariel

doi:10.1007/s43681-021-00095-8

Data minimization for GDPR compliance in machine learning models

Original Research
Published: 26 September 2021

Volume 2, pages 477–491, (2022)
Cite this article

AI and Ethics Aims and scope Submit manuscript

Abigail Goldsteen¹,
Gilad Ezov¹,
Ron Shmelkin¹,
Micha Moffie¹ &
…
Ariel Farkash¹

5749 Accesses
19 Citations
1 Altmetric
Explore all metrics

Abstract

The EU General Data Protection Regulation (GDPR) and the California Privacy Rights Act (CPRA) mandate the principle of data minimization, which requires that only data necessary to fulfill a certain purpose be collected. However, it can often be difficult to determine the minimal amount of data required, especially in complex machine learning models such as deep neural networks. We present a first-of-a-kind method to reduce the amount of personal data needed to perform predictions with a machine learning model, by removing or generalizing some of the input features of the runtime data. Our method makes use of the knowledge encoded within the model to produce a generalization that has little to no impact on its accuracy, based on knowledge distillation approaches. We show that, in some cases, less data may be collected while preserving the exact same level of model accuracy as before, and if a small deviation in accuracy is allowed, even more generalizations of the input features may be performed. We also demonstrate that when collecting the features dynamically, the generalizations may be even further improved. This method enables organizations to truly minimize the amount of data collected, thus fulfilling the data minimization requirement set out in the regulations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Learning: Algorithms, Real-World Applications and Research Directions

Article 22 March 2021

Machine learning and deep learning

Article Open access 08 April 2021

Big data in healthcare: management, analysis and future prospects

Article Open access 19 June 2019

Notes

References

Abadi, M., Chu, A., Goodfellow, I., McMahan, H.B., Mironov, I., Talwar, K., Zhang, L.: Deep learning with differential privacy. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Association for Computing Machinery, New York, NY, USA, pp. 308–318 (2016). https://doi.org/10.1145/2976749.2978318
Alkabawi, E.M., Hilal, A.R., Basir, O.A.: Feature abstraction for early detection of multi-type of dementia with sparse auto-encoder. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 3471–3476 (2017). https://doi.org/10.1109/SMC.2017.8123168
Bakker, M.A., Riverón Valdés, H., Tu, D.P., Gummadi, K.P., Varshney, K.R., Weller, A., Pentland, A.: Fair enough: improving fairness in budget-constrained decision making using confidence thresholds. In: H. Espinoza, J. Hernández-Orallo, X.C. Chen, S.S. ÓhÉigeartaigh, X. Huang, M. Castillo-Effen, R. Mallah, J. McDermid (eds.) Proceedings of the Workshop on Artificial Intelligence Safety co-located with 34th AAAI Conference on Artificial Intelligence, CEUR Workshop Proceedings, vol. 2560, pp. 41–53. CEUR-WS.org, New York, NY, USA (2020). urn:nbn:de:0074-2560-0
Barni, M., Orlandi, C., Piva, A.: A privacy-preserving protocol for neural-network-based computation. In: Proceedings of the 8th Workshop on Multimedia and Security. Association for Computing Machinery, New York, NY, USA, pp. 146–151 (2006). https://doi.org/10.1145/1161366.1161393
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: 21st International Conference on Data Engineering (ICDE'05), pp. 217–228 (2005). https://doi.org/10.1109/ICDE.2005.42
Bild, R., Kuhn, K.A., Prasser, F.: Safepub: a truthful data anonymization algorithm with strong privacy guarantees. Proc. Privacy Enhanc. Technol. 1, 67–87 (2018)
Article Google Scholar
Bucilua, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pp. 535–541. Association for Computing Machinery, New York, NY, USA (2006). https://doi.org/10.1145/1150402.1150464
Chabanne, H., de Wargny, A., Milgram, J., Morel, C., Prouff, E.: Privacy-preserving classification on deep neural network. IACR Cryptology ePrint Archive 2017, 35 (2017)
Craven, M.W., Shavlik, J.W.: Extracting tree-structured representations of trained networks. In: Proceedings of the 8th International Conference on Neural Information Processing Systems, NIPS’95, pp. 24–30. MIT Press, Cambridge, MA, USA (1995)
Doran, K., Price, J.: Pornography and marriage. J. Fam. Econ. Issues 35, 489–498 (2014). https://doi.org/10.1007/s10834-014-9391-6
Article Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml. Accessed 22 Sept 2021
Ferreira, A., Figueiredo, M.: An unsupervised approach to feature discretization and selection. Pattern Recogn. 45(9), 3048–3060 (2011)
Article Google Scholar
Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit confidence information and basic countermeasures. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 1322–1333 (2015). https://doi.org/10.1145/2810103.2813677
Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., Ristenpart, T.: Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In: Proceedings of the 23rd USENIX Conference on Security Symposium. USENIX Association, USA, pp. 17–32 (2014)
Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for information and privacy preservation. In: 21st International Conference on Data Engineering (ICDE), pp. 205–216 (2005). https://doi.org/10.1109/ICDE.2005.143
Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, Vienna, Austria, pp. 758–769 (2007)
Goldsteen, A., Ezov, G., Farkash, A.: Reducing risk of model inversion using privacy-guided training (2020). arXiv:2006.15877
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vis. 129, 1789–1819 (2021)
Article Google Scholar
Hesamifard, E., Takabi, H., Ghasemi, M., Wright, R.N.: Privacy-preserving machine learning as a service. Proc. Privacy Enhanc. Technol. 3, 123–142 (2018)
Article Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2014). arXiv:1503.02531
Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, pp. 279–288 (2002). https://doi.org/10.1145/775047.775089
Kazim, E., Denny, D.M.T., Koshiyama, A.: AI auditing and impact assessment: according to the UK information commissioner’s office. AI and Ethics 1(3), 301–310 (2021). https://doi.org/10.1007/s43681-021-00039-2
Kearns, M.J., Mansour, Y.: A fast, bottom-up decision tree pruning algorithm with near-optimal generalization. In: J.W. Shavlik (ed.) Proceedings of the Fifteenth International Conference on Machine Learning (ICML), pp. 269–277. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1998)
Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32, 47–58 (2005)
Google Scholar
Lee, H., Hwang, S.J., Shin, J.: Rethinking data augmentation: self-supervision and selfdistillation (2019). arXiv:1910.05872
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Workload-aware anonymization techniques for large-scale datasets. ACM Trans. Database Syst. 33(3), 1–47 (2008)
Article Google Scholar
Ma, Z., Kargl, F., Weber, M.: Measuring long-term location privacy in vehicular communication systems. Comput. Commun. 33, 1414–1427 (2010)
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.) Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), vol. 2, pp. 3111–3119. Curran Associates Inc. (2013)
Papernot, N., Abadi, M., Úlfarlfar Erlingsson, Goodfellow, I., Talwar, K.: Semi-supervised knowledge transfer for deep learning from private training data. In: Proceedings of the International Conference on Learning Representations (2017). arXiv:1610.05755
Pratesi, F., Monreale, A., Trasarti, R., Giannotti, F., Pedreschi, D., Yanagihara, T.: Prudence: a system for assessing privacy risk vs utility in data sharing ecosystems. Trans. Data Privacy 11(2) (2018). http://www.tdp.cat/issues16/tdp.a284a17.pdf. Accessed 22 Sept 2021
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Article Google Scholar
Salem, A., Y. Zhang, M.H., Berrang, P., Fritz, M., Backes, M.: Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. In: Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS) (2019). https://doi.org/10.14722/ndss.2019.23119
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. San Jose, CA (2017). https://doi.org/10.1109/SP.2017.41
Sokol, K., Flach, P.: Explainability fact sheets: a framework for systematic assessment of explainable approaches. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 56–67. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3351095.3372870
Soleymani, S., Dabouei, A., Kazemi, H., Dawson, J., Nasrabadi, N.M.: Multi-level feature abstraction from convolutional neural networks for multimodal biometric identification. Int. Conf. Pattern Recogn. 346–-3476 (2018). https://doi.org/10.1109/ICPR.2018.8545061
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzz. Knowl.-Based Syst. 10, 557–570 (2002)
Article MathSciNet Google Scholar
Unceta, I., Nin, J., Pujol, O.: Copying machine learning classifiers. IEEE Access 8, 160268–160284 (2020). https://doi.org/10.1109/ACCESS.2020.3020638
Article Google Scholar
Urner, R., Shalev-Shwartz, S., Ben-David, S.: Access to unlabeled data can speed up prediction time. In: Proceedings of the 28th International Conference on Machine Learning, pp. 641–648 (2011). https://dl.acm.org/doi/10.5555/3104482.3104563. Accessed 22 Sept 2021
Wagner, I., Eckhoff, D.: Technical privacy metrics: a systematic survey. ACM Comput. Surv. 51(3), 1–38 (2018)
Article Google Scholar
Wang, J., Bao, W., Sun, L., Zhu, X., Cao, B., Philip, S.: Private model compression via knowledge distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1190–1197 (2019). https://doi.org/10.1609/aaai.v33i01.33011190
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.: Utility-based anonymization using local recoding. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, pp. 785–790 (2006). https://doi.org/10.1145/1150402.1150504
Xu, S., Ye, X.: Risk and Distortion Based K-Anonymity. Information Security Applications, pp. 345–358. Springer, Berlin (2007)
Google Scholar
Zhang, T., He, Z., Lee, R.B.: Privacy-preserving machine learning through data obfuscation (2018). arxiv:1807.01860
Zhu, Q., Lv, X.: 2p-dnn : Privacy-preserving deep neural networks based on homomorphic cryptosystem (2018). arxiv:1807.08459

Download references

Author information

Authors and Affiliations

IBM Research, Haifa, Israel
Abigail Goldsteen, Gilad Ezov, Ron Shmelkin, Micha Moffie & Ariel Farkash

Authors

Abigail Goldsteen
View author publications
You can also search for this author in PubMed Google Scholar
Gilad Ezov
View author publications
You can also search for this author in PubMed Google Scholar
Ron Shmelkin
View author publications
You can also search for this author in PubMed Google Scholar
Micha Moffie
View author publications
You can also search for this author in PubMed Google Scholar
Ariel Farkash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abigail Goldsteen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 A Experimental setup

1.1.1 A.1 Features

List of features used in adult dataset: age, workclass, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country.

List of features used in nursery dataset: parents, has_nurs, form, children, housing, finance, health, social.

List of features used in GSS dataset: work_status, marital_status, children, age, gender, race, happiness, x_rated.

List of features used in Loan dataset: loan_amount, funded_amount, investor_funds, term, interest_rate, installment, grade, sub_grade, emp_length, home_ownership, annual_income, verification_status, pymnt_plan, purpose, zip_code, dti, delinq_2yrs, inq_last_6mths, mths_since_last_delinq, mths_since_last_record, open_acc, pub_rec, revol_bal, revol_util, total_acc, initial_list_status, out_prncp, out_prncp_inv, total_rec_int, total_rec_late_fee, last_pymnt_amnt, collections_12_mths_ex_med, policy_code, application_type, acc_now_delinq, chargeoff_within_12_mths, delinq_amnt, tax_liens, hardship_flag, disbursement_method, year, earliest_cr_year, region.

1.1.2 A.2 Pre-processing

Pre-processing applied to adult dataset: the native-country feature was transformed to the following areas: Euro_1 (Italy, Holand-Netherlands, Germany, France), Euro_2 (Yugoslavia, South, Portugal, Poland, Hungary, Greece), SE_Asia (Vietnam, Thailand, Philippines, Laos, Cambodia), UnitedStates (United-States), LatinAmerica (Trinadad&Tobago, Puerto-Rico, Outlying-US(Guam-USVI-etc), Nicaragua, Mexico, Jamaica, Honduras, Haiti, Guatemala, Dominican-Republic), China (Taiwan, Hong, China), BritishCommonwealth (Scotland, Ireland, India, England, Canada), SouthAmerica (Peru, El-Salvador, Ecuador, Columbia), Other (Japan, Iran, Cuba), Unknown (?).

Pre-processing applied to loan dataset: the addr_state feature was transformed to the following regions: west (CA, OR, UT, WA, CO, NV, AK, MT, HI, WY, ID), south_west (AZ, TX, NM, OK), south_east (GA, NC, VA, FL, KY, SC, LA, AL, WV, DC, AR, DE, MS, TN), mid_west (IL, MO, MN, OH, WI, KS, MI, SD, IA, NE, IN, ND), north_east (CT, NY, PA, NJ, RI, MA, MD, VT, NH, ME). For the label column, the following loan_status values were considered a bad loan: Charged Off, Default, Does not meet the credit policy. Status: Charged Off, In Grace Period, Late (16–30 days), Late (31–120 days). The rest were considered good.

Missing values were filled with 0 or ’NA’ according to the feature type (numeric or categorical) in all four datasets. All categorical features in all datasets were one-hot encoded.

1.1.3 A.3 Dataset division

Each dataset was divided into four subsets according to the following percentages: 40% for training the target model, 30% for training the generalizer model, 20% test data for optimizing the generalization and 10% for final validation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goldsteen, A., Ezov, G., Shmelkin, R. et al. Data minimization for GDPR compliance in machine learning models. AI Ethics 2, 477–491 (2022). https://doi.org/10.1007/s43681-021-00095-8

Download citation

Received: 27 June 2021
Accepted: 02 September 2021
Published: 26 September 2021
Issue Date: August 2022
DOI: https://doi.org/10.1007/s43681-021-00095-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data minimization for GDPR compliance in machine learning models

Abstract

Access this article

Similar content being viewed by others

Machine Learning: Algorithms, Real-World Applications and Research Directions

Machine learning and deep learning

Big data in healthcare: management, analysis and future prospects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

1.1 A Experimental setup

1.1.1 A.1 Features

1.1.2 A.2 Pre-processing

1.1.3 A.3 Dataset division

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data minimization for GDPR compliance in machine learning models

Abstract

Access this article

Similar content being viewed by others

Machine Learning: Algorithms, Real-World Applications and Research Directions

Machine learning and deep learning

Big data in healthcare: management, analysis and future prospects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 A Experimental setup

1.1.1 A.1 Features

1.1.2 A.2 Pre-processing

1.1.3 A.3 Dataset division

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation