Skip to main content

Data sanitization against label flipping attacks using AdaBoost-based semi-supervised learning technology

Abstract

The label flipping attack is a special poisoning attack in the adversarial environment. The research designed a novel label noise processing framework, the core of which is the semi-supervised learning label correction algorithm based on AdaBoost (AdaSSL). It can effectively improve the label quality of training data and improve the classification performance of the model. Based on five real UCI datasets, this study chose six classic machine learning algorithms (NB, LR, SVM, DT, KNN and MLP) as the base classifiers to classify them. With a noise level of 0\( \sim \)20%, we evaluated the classification effect of these classifiers on UCI datasets based on the entropy label flipping attack and the AdaSSL defense algorithm. The experimental results show that the AdaSSL algorithm can effectively improve the robustness of the classifier against label flipping attack. Compared with the most advanced semi-supervised defense algorithm in the literature, the algorithm does not need to use additional datasets. At a noise ratio of 10%, the AdaSSL algorithm is significantly better than state-of-the-art label noise defense technology.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. Altınel B, Ganiz MC (2016) A new hybrid semi-supervised algorithm for text classification with class-based semantics. Knowl-Based Syst 108:50–64

    Article  Google Scholar 

  2. Barreno M, Nelson B, Sears R, Joseph AD, Tygar JD (2006) Can machine learning be secure? In: Proceedings of the 2006 ACM symposium on information, computer and communications security, pp 16–25

  3. Bhagoji AN, Cullina D, Mittal P(2017) Dimensionality reduction as a defense against evasion attacks on machine learning classifiers. arXiv:1704.026542

  4. Biggio B, Nelson B, Laskov P (2011) Support vector machines under adversarial label noise. In: Asian conference on machine learning, pp 97–112

  5. Biggio B, Nelson B, Laskov P (2012) Poisoning attacks against support vector machines. arXiv:1206.6389

  6. Chan PP, He ZM, Li H, Hsu CC (2018) Data sanitization against adversarial label contamination based on data complexity. Int J Mach Learn Cybern 9(6):1039–1052

    Article  Google Scholar 

  7. Demidova L, Klyueva I, Sokolova Y, Stepanov N, Tyart N (2017) Intellectual approaches to improvement of the classification decisions quality on the base of the svm classifier. Procedia Comput Sci 103:222–230

    Article  Google Scholar 

  8. Diab DM, El Hindi KM (2017) Using differential evolution for fine tuning naïve bayesian classifiers and its application for text classification. Appl Soft Comput 54:183–199

    Article  Google Scholar 

  9. Frénay B, Verleysen M (2013) Classification in the presence of label noise: a survey. IEEE Tran Neural Netw Learn Syst 25(5):845–869

    Article  Google Scholar 

  10. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    MathSciNet  Article  Google Scholar 

  11. Ghosh A, Kumar H, Sastry P (2017) Robust loss functions under label noise for deep neural networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  12. Ghosh A, Manwani N, Sastry P (2015) Making risk minimization tolerant to label noise. Neurocomputing 160:93–107

    Article  Google Scholar 

  13. Gupta V et al (2011) Recent trends in text classification techniques. Int J Comput Appl 35(6):45–51

    Google Scholar 

  14. Kotsiantis SB (2013) Decision trees: a recent overview. Artif Intell Rev 39(4):261–283

    Article  Google Scholar 

  15. Li B, Gao Q (2019) Improving data quality with label noise correction. Intell Data Anal 23(4):737–757

    Article  Google Scholar 

  16. Liu H, Ditzler G (2019) Data poisoning attacks against mrmr. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2517–2521. IEEE

  17. Lukasik M, Bhojanapalli S, Menon AK, Kumar S (2020) Does label smoothing mitigate label noise? arXiv:2003.02819

  18. Muñoz-González L, Biggio B, Demontis A, Paudice A, Wongrassamee V, Lupu EC, Roli F (2017) Towards poisoning of deep learning algorithms with back-gradient optimization. In: Proceedings of the 10th ACM workshop on artificial intelligence and security, pp 27–38

  19. Nicholson B, Sheng VS, Zhang J (2016) Label noise correction and application in crowdsourcing. Expert Syst Appl 66:149–162

    Article  Google Scholar 

  20. Paudice A, Muñoz-González L, Lupu EC (2018) Label sanitization against label flipping poisoning attacks. In: Joint European conference on machine learning and knowledge discovery in databases, pp 5–15. Springer

  21. Samami M, Akbari E, Abdar M, Plawiak P, Nematzadeh H, Basiri ME, Makarenkov V (2020) A mixed solution-based high agreement filtering method for class noise detection in binary classification. Phys A Stat Mech Appl 553:124219

  22. Shanthini A, Vinodhini G, Chandrasekaran R, Supraja P (2019) A taxonomy on impact of label noise and feature noise using machine learning techniques. Soft Comput 23(18):8597–8607. https://doi.org/10.1007/s00500-019-03968-7

    Article  Google Scholar 

  23. Sharma K, Donmez P, Luo E, Liu Y, Yalniz IZ (2020) Noiserank: unsupervised label noise reduction with dependence models. arXiv:2003.06729

  24. Sluban B, Lavrač N (2015) Relating ensemble diversity and performance: a study in class noise detection. Neurocomputing 160:120–131

    Article  Google Scholar 

  25. Taheri R, Javidan R, Shojafar M, Pooranian Z, Miri A, Conti M (2020) On defending against label flipping attacks on malware detection systems. Neural Comput Appl 32:14781–14800

    Article  Google Scholar 

  26. Thangaraj M, Sivakami M (2018) Text classification techniques: a literature review. Interdiscip J Inf Knowl Manag 13:117–135

    Google Scholar 

  27. Xiao H, Biggio B, Brown G, Fumera G, Eckert C, Roli F (2015) Is feature selection secure against training data poisoning? In: International conference on machine learning, pp 1689–1698

  28. Xiao H, Biggio B, Nelson B, Xiao H, Eckert C, Roli F (2015) Support vector machines under adversarial label contamination. Neurocomputing 160:53–62

    Article  Google Scholar 

  29. Yan Y, Xu Z, Tsang I, Long G, Yang Y (2016) Robust semi-supervised learning through label aggregation. In: Proceedings of the AAAI conference on artificial intelligence, vol 30

  30. Yen SJ, Lee YS, Ying JC, Wu YC (2011) A logistic regression-based smoothing method for Chinese text categorization. Expert Syst Appl 38(9):11581–11590

    Article  Google Scholar 

  31. Zhang H, Cheng N, Zhang Y, Li Z (2021) Label flipping attacks against Naive Bayes on spam filtering systems. Appl Intell 51:4503–4514

    Article  Google Scholar 

  32. Zhang J, Sheng VS, Li T, Wu X (2017) Improving crowdsourced label quality using noise correction. IEEE Trans Neural Netw Learn Syst 29(5):1675–1688

    MathSciNet  Article  Google Scholar 

  33. Zhang Z, Sabuncu M (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. Adv Neural Inf Process Syst 31:8778–8788

    Google Scholar 

Download references

Acknowledgements

This research was partly supported by the Key R&D and promotion projects of Henan Province (Technological research) (Grant No. 212102210143), the Key Science and Technology Project of Xinjiang Production and Construction Corps (Grant No. 2018AB017) and the Key Research, Development, and Dissemination Program of Henan Province (Science and Technology for the People) (Grant No. 182207310002).

Author information

Affiliations

Authors

Contributions

Hongpo Zhang contributed to the conception of the study; Ning Cheng performed the experiment; Hongpo Zhang contributed significantly to analysis and manuscript preparation; Ning Cheng performed the data analyses and wrote the manuscript; Zhanbo Li helped perform the analysis with constructive discussions.

Corresponding authors

Correspondence to Hongpo Zhang or Zhanbo Li.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cheng, N., Zhang, H. & Li, Z. Data sanitization against label flipping attacks using AdaBoost-based semi-supervised learning technology. Soft Comput 25, 14573–14581 (2021). https://doi.org/10.1007/s00500-021-06384-y

Download citation

Keywords

  • Label noise detection
  • Machine learning
  • AdaBoost algorithm
  • Semi-supervised