Skip to main content

Not a Free Lunch, But a Cheap One: On Classifiers Performance on Anonymized Datasets

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12840))

Abstract

The problem of protecting datasets from the disclosure of confidential information, while published data remains useful for analysis, has recently gained momentum. To solve this problem, anonymization techniques such as k-anonymity, \(\ell \)-diversity, and t-closeness have been used to generate anonymized datasets for training classifiers. While these techniques provide an effective means to generate anonymized datasets, an understanding of how their application affects the performance of classifiers is currently missing. This knowledge enables the data owner and analyst to select the most appropriate classification algorithm and training parameters in order to guarantee high privacy requirements while minimizing the loss of accuracy. In this study, we perform extensive experiments to verify how the classifiers performance changes when trained on an anonymized dataset compared to the original one, and evaluate the impact of classification algorithms, datasets properties, and anonymization parameters on classifiers’ performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://scikit-learn.org.

  2. 2.

    The code used for our experiments is available at https://github.com/minaalishahi/classifiersperformance.

  3. 3.

    https://archive.ics.uci.edu/ml/datasets/.

References

  1. Aggarwal, C.C.: Data Classification: Algorithms and Applications. Chapman and Hall CRC (2014)

    Google Scholar 

  2. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic comparison and evaluation of k-anonymization algorithms for practitioners. Trans. Data Priv. 7(3), 337–370 (2014)

    MathSciNet  Google Scholar 

  3. Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: International Conference on Knowledge Discovery and Data Mining, pp. 70–78. ACM (2008)

    Google Scholar 

  4. Ciriani, V., di Vimercati, S.D.C., Foresti, S., Samarati, P.: k-anonymous data mining: a survey. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining: Models and Algorithms. ADBS, vol. 34, pp. 105–136. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-70992-5_5

    Chapter  Google Scholar 

  5. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  6. Friedman, A., Schuster, A., Wolff, R.: k-anonymous decision tree induction. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 151–162. Springer, Heidelberg (2006). https://doi.org/10.1007/11871637_18

    Chapter  Google Scholar 

  7. Gong, M., Xie, Y., Pan, K., Feng, K., Qin, A.: A survey on differentially private machine learning. IEEE Comp. Intell. Mag. 15(2), 49–64 (2020)

    Article  Google Scholar 

  8. Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6(2), 65–70 (1979)

    MathSciNet  MATH  Google Scholar 

  9. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning: With Applications in R. Springer, New York (2014). https://doi.org/10.1007/978-1-4614-7138-7

    Book  MATH  Google Scholar 

  10. Khodaparast, F., Sheikhalishahi, M., Haghighi, H., Martinelli, F.: Privacy preserving random decision tree classification over horizontally and vertically partitioned data. In: Conference on Dependable, Autonomic and Secure Computing, pp. 600–607 (2018)

    Google Scholar 

  11. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: International Conference on Data Engineering, p. 25 (2006)

    Google Scholar 

  12. Li, N., Li, T., Venkatasubramanian, S.: \(t\)-closeness: privacy beyond \(k\)-anonymity and \(l\)-diversity. In: 23rd International Conference on Data Engineering, pp. 106–115. IEEE (2007)

    Google Scholar 

  13. Li, T., Li, N., Zhang, J., Molloy, I.: Slicing: a new approach for privacy preserving data publishing. IEEE Trans. Knowl. Data Eng. 24(3), 561–574 (2012)

    Article  Google Scholar 

  14. Lopuhaä-Zwakenberg, M., Alishahi, M., Kivits, J., Klarenbeek, J., van der Velde, G.J., Zannone, N.: Comparing classifiers’ performance under differential privacy. In: International Conference on Security and Cryptography (SECRYPT) (2021)

    Google Scholar 

  15. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: \(l\)-diversity: privacy beyond \(k\)-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 3-es (2007)

    Article  Google Scholar 

  16. Malle, B., Kieseberg, P., Holzinger, A.: DO NOT DISTURB? Classifier behavior on perturbed datasets. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) CD-MAKE 2017. LNCS, vol. 10410, pp. 155–173. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66808-6_11

    Chapter  Google Scholar 

  17. Malle, B., Kieseberg, P., Weippl, E., Holzinger, A.: The right to be forgotten: towards machine learning on perturbed knowledge bases. In: Buccafurri, F., Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds.) CD-ARES 2016. LNCS, vol. 9817, pp. 251–266. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45507-5_17

    Chapter  Google Scholar 

  18. Martinelli, F., Alishahi, M.S.: Distributed data anonymization. In: Conference on Dependable, Autonomic and Secure Computing (DASC), pp. 580–586 (2019)

    Google Scholar 

  19. McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i’’: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31680-7_16

    Chapter  Google Scholar 

  20. Nergiz, M.E., Gök, M.Z.: Hybrid k-anonymity. Comput. Secur. 44, 51–63 (2014)

    Article  Google Scholar 

  21. Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)

    Article  Google Scholar 

  22. Sheikhalishahi, M., Martinelli, F.: Privacy-utility feature selection as a privacy mechanism in collaborative data classification. In: Enabling Technologies: Infrastructure for Collaborative Enterprises, pp. 244–249 (2017)

    Google Scholar 

  23. Sheikhalishahi, M., Saracino, A., Martinelli, F., Marra, A.L.: Privacy preserving data sharing and analysis for edge-based architectures. Int. J. Inf. Secur. 1(2), 1–23 (2021). https://doi.org/10.1007/s10207-021-00542-x

    Article  Google Scholar 

  24. Sheikhalishahi, M., Zannone, N.: On the comparison of classifiers’ construction over private inputs. In: International Conference on Trust, Security and Privacy in Computing and Communications, pp. 691–698 (2020)

    Google Scholar 

  25. Sweeney, L.: \(k\)-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  26. Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1, 60–83 (1945)

    Google Scholar 

  27. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10(2), 1–19 (2019)

    Article  Google Scholar 

  28. Ye, M., Wu, X., Hu, X., Hu, D.: Anonymizing classification data using rough set theory. Knowl.-Based Syst. 43, 82–94 (2013)

    Article  Google Scholar 

Download references

Acknowledgement

This work has been supported by H2020 EU funded project SECREDAS [GA #783119].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mina Alishahi .

Editor information

Editors and Affiliations

Appendix

Appendix

Tables 5, 6, 7, and 8 report respectively the Holm scores of classifiers with respect to accuracy, precision, recall, and F1-score. The higher scores show better performance results for the associated classification algorithm and associated metric.

Table 5. Classifier accuracy scores.
Table 6. Classifier precision scores.
Table 7. Classifier recall scores.
Table 8. Classifier F1-score scores.

Figures 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16 show the classifiers performance trained on anonymized Credit, Absent, and Optic datasets for different values of \(k, \ell \), and t.

Fig. 5.
figure 5

Accuracy on Credit.

Fig. 6.
figure 6

Precision on Credit.

Fig. 7.
figure 7

Recall on Credit.

Fig. 8.
figure 8

F1-score on Credit.

Fig. 9.
figure 9

Accuracy on Absent.

Fig. 10.
figure 10

Precision on Absent.

Fig. 11.
figure 11

Recall on Absent.

Fig. 12.
figure 12

F1-score on Absent.

Fig. 13.
figure 13

Accuracy on Optic.

Fig. 14.
figure 14

Precision on Optic.

Fig. 15.
figure 15

Recall on Optic.

Fig. 16.
figure 16

F1-score on Optic.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alishahi, M., Zannone, N. (2021). Not a Free Lunch, But a Cheap One: On Classifiers Performance on Anonymized Datasets. In: Barker, K., Ghazinour, K. (eds) Data and Applications Security and Privacy XXXV. DBSec 2021. Lecture Notes in Computer Science(), vol 12840. Springer, Cham. https://doi.org/10.1007/978-3-030-81242-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81242-3_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81241-6

  • Online ISBN: 978-3-030-81242-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics