Skip to main content
Log in

A real-valued label noise cleaning method based on ensemble iterative filtering with noise score

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Real-world data always contain noise for a variety of reasons. In a regression task, noisy labels interfere with the construction of an accurate model, leading to a decline in the prediction accuracy. Methods that have emerged to deal with continuous label noise are rather limited in contrast with those on class noise cleaning techniques. To address this gap, we propose a novel noise filter to clean noisy instances with real-valued label noise. This method combines several filtering strategies. First, an iterative filtering process is carried out, allowing us to avoid using potential noisy examples in each new filtering iteration. Second, we develop a noise score to assess the noise level of each detected noisy instance. The higher the noise score is, the more likely that the instance is noisy. Finally, an ensemble filtering scheme is implemented. The fusion of detection from different models makes the determination of noisy examples even more reliable. The validity of the proposed method is verified through extensive experiments. We discuss the selection of the best hyperparameters, and compare the developed method with several state-of-the-art noise filters using public regression datasets. The outcomes show that our method not only achieves a good balance between the elimination of noisy samples and the retention of clean samples but also outperforms all the other compared methods, especially at higher noise levels. Simultaneously, the results of a case study of temperature prediction in an electric arc furnace suggest that training a domain-related regressor on a dataset preprocessed with the proposed noise filter contributes to a great improvement in the prediction accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data used in the experiments mainly comes from KEEL and UCI repositories. Please refer to references [38] and [39] for detail.

References

  1. Kang Z, Pan H, Hoi SCH et al (2020) Robust graph learning from noisy data. IEEE Trans Cybernet 50(5):1833–1843

    Article  Google Scholar 

  2. Sáez JA, Corchado E (2019) KSUFS: a novel unsupervised feature selection method based on statistical tests for standard and big data problems. IEEE Access 7:99754–99770

    Article  Google Scholar 

  3. Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869

    Article  Google Scholar 

  4. Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22:177–210

    Article  Google Scholar 

  5. Sáez JA, Galar M, Luengo J et al (2014) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowl Inf Syst 38:179–206

    Article  Google Scholar 

  6. Gamberger D, Lavrac N, Dzeroski S (1996) Noise elimination in inductive concept learning: a case study in medical diagnosis. In: proceedings of the 7th international workshop on algorithmic learning theory, pp 199–212

  7. García S, Luengo J, Herrera F (2016) Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl Based Syst 98:1–29

    Article  Google Scholar 

  8. Sáez JA, Galar M, Luengo J et al (2016) INFFC: an iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Inform Fusion 27:19–32

    Article  Google Scholar 

  9. Luengo J, Shim SO, Alshomrani S et al (2018) CNC-NOS: class noise cleaning by ensemble filtering and noise scoring. Knowl Based Syst 140:27–49

    Article  Google Scholar 

  10. Nematzadeh Z, Ibrahim R, Selamat A (2020) Improving class noise detection and classification performance: a new two-filter CNDC model. Appl Soft Comput 94:106428

    Article  Google Scholar 

  11. Li C, Sheng VS, Jiang L et al (2016) Noise filtering to improve data and model quality for crowdsourcing. Knowl Based Syst 107:96–103

    Article  Google Scholar 

  12. Jeatrakul P, Wong KW, Fung CC (2010) Data cleaning for classification using misclassification analysis. J Adv Comput Intell Intell Inf 14:297–302

    Article  Google Scholar 

  13. Algan G, Ulusoy I (2020) Image classification with deep learning in the presence of noisy labels: a survey. Knowl Based Syst 215:106771

    Article  Google Scholar 

  14. Wang Y, Liu W, Ma X, et al (2018) Iterative learning with open-set noisy labels. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8688–8696

  15. Daiki T, Daiki I, Toshihiko Y et al (2018) Joint optimization framework for learning with noisy labels. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5552–5560

  16. Yu X, Han B, Yao J et al (2019) How does disagreement help generalization against label corruption? In: international conference on machine learning, pp 7164–7173

  17. Kordos M, Blachnik M (2012) Instance selection with neural networks for regression problems. In: international conference on artificial neural networks, pp 263–270

  18. Martín J, Sáez JA, Corchado E (2021) On the regressand noise problem: model robustness and synergy with regression-adapted noise filters. IEEE Access 9:145800–145816

    Article  Google Scholar 

  19. González AA, Pastor JFD, Rodríguez JJ et al (2016) Instance selection for regression by discretization. Expert Syst Appl 54:340–350

    Article  Google Scholar 

  20. González AA, Pastor JFD, Rodríguez JJ et al (2016) Instance selection for regression: adapting DROP. Neurocomputing 201:66–81

    Article  Google Scholar 

  21. Sofie V, Assche AV (2003) Ensemble methods for noise elimination in classification problems. Multiple classifier systems. Springer, Berlin, pp 317–325

    Google Scholar 

  22. Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22:387–396

    Article  Google Scholar 

  23. Gamberger D, Lavrac N, Dzeroski S (2000) Noise detection and elimination in data preprocessing: experiments in medical domains. Appl Artif Intell 14(2):205–223

    Article  Google Scholar 

  24. Berghout T, Mouss LH, Kadri O et al (2020) Aircraft engines remaining useful life prediction with an adaptive denoising online sequential extreme learning machine. Eng Appl Artif Intel 96:103936

    Article  Google Scholar 

  25. Lv M, Hong Z, Chen L et al (2020) Temporal multi-graph convolutional network for traffic flow prediction. IEEE Trans Intell Transp Syst 22(6):3337–3348

    Article  Google Scholar 

  26. Ge L, Wu K, Zeng Y et al (2020) Multi-scale spatiotemporal graph convolution network for air quality prediction. Appl Intell 51:3491–3505

    Article  Google Scholar 

  27. Shine P, Scully T, Upton J et al (2019) Annual electricity consumption prediction and future expansion analysis on dairy farms using a support vector machine. Appl Energy 250:1110–1119

    Article  Google Scholar 

  28. Kara F, Aslantaş K, Çiçek A (2016) Prediction of cutting temperature in orthogonal machining of AISI 316L using artificial neural network. Appl Soft Comput 38:64–74

    Article  Google Scholar 

  29. Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 7:623–640

    Article  Google Scholar 

  30. Fernandez JMM, Cabal VA, Montequin VR et al (2008) Online estimation of electric arc furnace tap temperature by using fuzzy neural networks. Eng Appl Artif Intel 21(7):1001–1012

    Article  Google Scholar 

  31. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11(1):131–167

    Article  Google Scholar 

  32. Sun J, Zhao F, Wang C et al (2007) Identifying and correcting mislabeled training instances. In: proceedings of the future generation communication and networking, pp 244–250

  33. Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybernet 6(6):448–452

    MathSciNet  Google Scholar 

  34. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6:37–66

    Article  Google Scholar 

  35. Jiang G, Wang W, Qian Y et al (2021) A unified sample selection framework for output noise filtering: an error-bound perspective. J Mach Learn Res 22:1–66

    MathSciNet  Google Scholar 

  36. González AA, Blachnik M, Kordos M et al (2016) Fusion of instance selection methods in regression tasks. Inform Fusion 30:69–79

    Article  Google Scholar 

  37. Angelova A, Mostafam YA, Perona P (2005) Pruning training sets for learning of object categories. In proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 494–501

  38. Fdez JA, Fernandez A, Luengo J et al (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17(2–3):255–287

    Google Scholar 

  39. Dheeru D, Graff C (2017) UCI Machine learning repository. http://archive.ics.uci.edu/ml. Accessed 2017

  40. Zhao L, Gkountouna O, Pfoser D (2019) Spatial auto-regressive dependency interpretable learning based on spatial topological constraints. ACM Trans Spat Algorithms Syst 5(3):1–28

    Article  Google Scholar 

  41. Acı CI, Akay MF (2015) A hybrid congestion control algorithm for broadcast-based architectures with multiple input queues. J Supercomput 71:1907–1931

    Article  Google Scholar 

  42. Zhou F, Claire Q, King RD (2014) Predicting the geographical origin of music. In proceedings of the IEEE international conference on data mining, pp 1115–1120

  43. Kaya H, Tüfekci P, Uzun E (2019) Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS. Turk J Electr Eng Comput Sci 27(6):4783–4796

    Article  Google Scholar 

  44. Moro S, Rita P, Vala B (2016) Predicting social media performance metrics and evaluation of the impact on brand building: a data mining approach. J Bus Res 69(9):3341–3351

    Article  Google Scholar 

  45. Vergara A, Vembu S, Ayhan T et al (2012) Chemical gas sensor drift compensation using classifier ensembles. Sens Actuators B Chem 166:320–329

    Article  Google Scholar 

  46. Lujan IR, Fonollosa J, Vergara A et al (2014) On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemom Intell Lab Syst 130:123–134

    Article  Google Scholar 

  47. Hoseinzade E, Haratizadeh S (2019) CNNpred: CNN-based stock market prediction using a diverse set of variables. Expert Syst Appl 129:273–285

    Article  Google Scholar 

  48. Rafiei MH, Adeli H (2016) A novel machine learning model for estimation of sale prices of real estate units. J Constr Eng Manag 142(2):04015066

    Article  Google Scholar 

  49. Vito SDE, Massera E, Piga M et al (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757

    Article  Google Scholar 

  50. Fanaee TH, Gama J (2014) Event labeling combining ensemble detectors and background knowledge. Prog Artif Intell 2(2):113–127

    Article  Google Scholar 

  51. Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501

    Article  Google Scholar 

  52. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  Google Scholar 

  53. García S, Fernández A, Luengo J et al (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inform Sci 180(10):2044–2064

    Article  Google Scholar 

  54. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70

    MathSciNet  Google Scholar 

  55. Hay T, Visuri VV, Aula M et al (2020) A review of mathematical process models for the electric arc furnace process. Steel Res Int 92(3):2000395

    Article  Google Scholar 

  56. Li C, Mao Z (2022) Generative adversarial network–based real-time temperature prediction model for heating stage of electric arc furnace. Trans Inst Meas Control 44(8):1669–1684

    Article  Google Scholar 

  57. Yuan P, Wang F, Mao Z (2006) Endpoint prediction of EAF based on G-SVM. J Iron Steel Res Int 18(10):7–10

    Google Scholar 

  58. Fernandez JMM, Menendez C, Ortega FA et al (2009) A smart modelling for the casting temperature prediction in an electric arc furnace. Int J Comput Math 86(7):1182–1193

    Article  Google Scholar 

  59. Sismanis P (2019) Prediction of productivity and energy consumption in a consteel furnace using data-science models. In: proceedings of the 22th international conference on business information systems, pp 85–99

Download references

Acknowledgements

This work was supported by the Key Program of National Natural Science Foundation of China (No. 51634002) and National Natural Science Foundation of China (No. 61773101).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhizhong Mao.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., Mao, Z. & Jia, M. A real-valued label noise cleaning method based on ensemble iterative filtering with noise score. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02137-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13042-024-02137-z

Keywords

Navigation