DO NOT DISTURB? Classifier Behavior on Perturbed Datasets

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10410)


Exponential trends in data generation are presenting today’s organizations, economies and governments with challenges never encountered before, especially in the field of privacy and data security. One crucial trade-off regulators are facing regards the simultaneous need for publishing personal information for the sake of statistical analysis and Machine Learning in order to increase quality levels in areas like medical services, while at the same time protecting the identity of individuals. A key European measure will be the introduction of the General Data Protection Regulation (GDPR) in 2018, giving customers the ‘right to be forgotten’, i.e. having their data deleted on request. As this could lead to a competitive disadvantage for European companies, it is important to understand which effects deletion of significant data points has on the performance of ML techniques. In a previous paper we introduced a series of experiments applying different algorithms to a binary classification problem under anonymization as well as perturbation. In this paper we extend those experiments by multi-class classification and introduce outlier-removal as an additional scenario. While the results of our previous work were mostly in-line with our expectations, our current experiments revealed unexpected behavior over a range of different scenarios. A surprising conclusion of those experiments is the fact that classification on an anonymized dataset with outliers removed in beforehand can almost compete with classification on the original, un-anonymized dataset. This could soon lead to competitive Machine Learning pipelines on anonymized datasets for real-world usage in the marketplace.


Machine learning Knowledge bases Right to be forgotten Perturbation K-anonymity SaNGreeA Information loss Cost weighing vector Multi-class classification Outlier analysis Variance-sensitive analysis 


  1. 1.
    Aggarwal, C.C.: On k-anonymity and the curse of dimensionality. In: Proceedings of the 31st International Conference on Very Large Data Bases VLDB, pp. 901–909 (2005)Google Scholar
  2. 2.
    Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Approximation algorithms for k-anonymity. J. Priv. Technol. (JOPT) (2005)Google Scholar
  3. 3.
    Brain, D., Webb, G.: On the effect of data set size on bias and variance in classification learning. In: Proceedings of the Fourth Australian Knowledge Acquisition Workshop, pp. 117–128. University of New South Wales (1999)Google Scholar
  4. 4.
    Campan, A., Truta, T.M.: Data and structural k-anonymity in social networks. In: Bonchi, F., Ferrari, E., Jiang, W., Malin, B. (eds.) PInKDD 2008. LNCS, vol. 5456, pp. 33–54. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-01718-6_4 CrossRefGoogle Scholar
  5. 5.
    Ciriani, V., De Capitani di Vimercati, S., Foresti, S., Samarati, P.: \(\kappa \)-anonymity. In: Yu, T., Jajodia, S. (eds.) Secure Data Management in Decentralized Systems. Advances in Information Security, vol. 33, pp. 323–353. Springer, Boston (2007)Google Scholar
  6. 6.
    Duchi, J.C., Jordan, M.I., Wainwright, M.J.: Privacy aware learning. J. ACM (JACM) 61(6), 38 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-79228-4_1 CrossRefGoogle Scholar
  8. 8.
    Holzinger, A., Plass, M., Holzinger, K., Crişan, G.C., Pintea, C.-M., Palade, V.: Towards interactive machine learning (iML): applying ant colony algorithms to solve the traveling salesman problem with the human-in-the-loop approach. In: Buccafurri, F., Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) CD-ARES 2016. LNCS, vol. 9817, pp. 81–95. Springer, Cham (2016). doi: 10.1007/978-3-319-45507-5_6 CrossRefGoogle Scholar
  9. 9.
    Holzinger, A.: Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Inform. (BRIN) 3(2), 119–131 (2016). SpringerCrossRefGoogle Scholar
  10. 10.
    Holzinger, A.: Introduction to machine learning & knowledge extraction (make). Mach. Learn. Knowl. Extract. 1(1), 1–20 (2017)CrossRefGoogle Scholar
  11. 11.
    Kieseberg, P., Malle, B., Frhwirt, P., Weippl, E., Holzinger, A.: A tamper-proof audit and control system for the doctor in the loop. Brain Inform. 3(4), 269–279 (2016)CrossRefGoogle Scholar
  12. 12.
    Lee, H., Kim, S., Kim, J.W., Chung, Y.D.: Utility-preserving anonymization for health data publishing. BMC Med. Inform. Decis. Making 17(1), 104 (2017)CrossRefGoogle Scholar
  13. 13.
    LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006), p. 25. IEEE (2006)Google Scholar
  14. 14.
    Li, J., Liu, J., Baig, M., Wong, R.C.-W.: Information based data anonymization for classification utility. Data Knowl. Eng. 70(12), 1030–1045 (2011)CrossRefGoogle Scholar
  15. 15.
    Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE 23rd International Conference on Data Engineering (ICDE 2007), pp. 106–115. IEEE (2007)Google Scholar
  16. 16.
    Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Disc. Data (TKDD) 1(1), 1–52 (2007)CrossRefGoogle Scholar
  17. 17.
    Majeed, A., Ullah, F., Lee, S.: Vulnerability-and diversity-aware anonymization of personally identifiable information for improving user privacy and utility of publishing data. Sensors 17(5), 1–23 (2017)CrossRefGoogle Scholar
  18. 18.
    Malle, B., Kieseberg, P., Weippl, E., Holzinger, A.: The right to be forgotten: towards machine learning on perturbed knowledge bases. In: Buccafurri, F., Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) CD-ARES 2016. LNCS, vol. 9817, pp. 251–266. Springer, Cham (2016). doi: 10.1007/978-3-319-45507-5_17 CrossRefGoogle Scholar
  19. 19.
    Nergiz, M.E., Clifton, C.: Delta-presence without complete world knowledge. IEEE Trans. Knowl. Data Eng. 22(6), 868–883 (2010)CrossRefGoogle Scholar
  20. 20.
    Samarati, P.: Protecting respondents identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)CrossRefGoogle Scholar
  21. 21.
    Simpson, E.H.: Measurement of diversity. Nature 163, 688 (1949)CrossRefzbMATHGoogle Scholar
  22. 22.
    Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertaint. Fuzziness Knowl. Based Syst. 10(5), 571–588 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertaint. Fuzziness Knowl. Based Syst. 10(05), 557–570 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Wimmer, H., Powell, L..: A comparison of the effects of K-anonymity on machine learning algorithms, pp. 1–9 (2014)Google Scholar
  25. 25.
    Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data augmentation for classification: when to warp? In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–6. IEEE (2016)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2017

Authors and Affiliations

  1. 1.Holzinger Group HCI-KDD, Institute for Medical Informatics, Statistics and DocumentationMedical University GrazGrazAustria
  2. 2.SBA Research gGmbHWienAustria

Personalised recommendations