Data Mining and Knowledge Discovery

, Volume 28, Issue 2, pp 265–303 | Cite as

Ensemble-based noise detection: noise ranking and visual performance evaluation

Article

Abstract

Noise filtering is most frequently used in data preprocessing to improve the accuracy of induced classifiers. The focus of this work is different: we aim at detecting noisy instances for improved data understanding, data cleaning and outlier identification. The paper is composed of three parts. The first part presents an ensemble-based noise ranking methodology for explicit noise and outlier identification, named Noise-Rank, which was successfully applied to a real-life medical problem as proven in domain expert evaluation. The second part is concerned with quantitative performance evaluation of noise detection algorithms on data with randomly injected noise. A methodology for visual performance evaluation of noise detection algorithms in the precision-recall space, named Viper, is presented and compared to standard evaluation practice. The third part presents the implementation of the NoiseRank and Viper methodologies in a web-based platform for composition and execution of data mining workflows. This implementation allows public accessibility of the developed approaches, repeatability and sharing of the presented experiments as well as the inclusion of web services enabling to incorporate new noise detection algorithms into the proposed noise detection and performance evaluation workflows.

Keywords

Noise detection Ensembles Noise ranking Precision-recall evaluation 

References

  1. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, pp 37–46Google Scholar
  2. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167MATHGoogle Scholar
  3. Deanfield J, Shea M, Ribiero P, de Landsheere C, Wilson R, Horlock P, Selwyn A (1984) Transient st-segment depression as a marker of myocardial ischemia during daily life. Am J Cardiol 54(10):1195–1200CrossRefGoogle Scholar
  4. Demšar J, Zupan B, Leban G, Curk T (2004) Orange: From experimental machine learning to interactive data mining. In: Boulicaut JF, Esposito F, Giannotti F, Pedreschi D (eds) Knowledge discovery in databases: PKDD 2004, lecture notes in computer science. Springer, vol 3202, pp 537–539Google Scholar
  5. Frank A, Asuncion A (2010) UCI machine learning repository. URL http://archive.ics.uci.edu/ml
  6. Fürnkranz J (1997) Pruning algorithms for rule learning. Mach Learn 27:139–171CrossRefGoogle Scholar
  7. Gamberger D, Lavrač N (1997) Conditions for Occam’s razor applicability and noise elimination. In: Lecture notes in artificial intelligence: machine learning: ECML-97, vol 1224, pp 108–123Google Scholar
  8. Gamberger D, Lavrač N (2002) Expert-guided subgroup discovery: methodology and application. J Artif Intell Res 17:501–527MATHGoogle Scholar
  9. Gamberger D, Lavrač N, Grošelj C (1999) Experiments with noise filtering in a medical domain. In: Proceedings of 16th international conference on machine learning—ICML, Morgan Kaufmann, pp 143–151Google Scholar
  10. Gamberger D, Lavrač N, Džeroski S (2000) Noise detection and elimination in data preprocessing: experiments in medical domains. Appl Artif Intell 14(2):205–223CrossRefGoogle Scholar
  11. Gamberger D, Lavrač N, Krstačić G (2003) Active subgroup mining: a case study in a coronary heart disease risk group detection. Artif Intell Med 28:27–57CrossRefGoogle Scholar
  12. Gelfand S, Ravishankar C, Delp E (1991) An iterative growing and pruning algorithm for classification tree design. IEEE Trans Pattern Anal Mach Intell 13:163–174CrossRefGoogle Scholar
  13. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18CrossRefGoogle Scholar
  14. Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126CrossRefMATHGoogle Scholar
  15. Khoshgoftaar TM, Rebours P (2004) Generating multiple noise elimination filters with the ensemble-partitioning filter. In: Proceedings of the 2004 IEEE International Conference on information reuse and integration. IEEE Systems, Man, and Cybernetics Society, pp 369–375.Google Scholar
  16. Khoshgoftaar T, Seliya N, Gao K (2004) Rule-based noise detection for software measurement data. In: Proceedings of the 2004 IEEE international conference on information reuse and integration, 2004 (IRI 2004), pp 302–307Google Scholar
  17. Khoshgoftaar TM, Zhong S, Joshi V (2005) Enhancing software quality estimation using ensemble-classifier based noise filtering. Intell Data Anal 9(1):3–27Google Scholar
  18. Khoshgoftaar TM, Joshi VH, Seliya N (2006) Detecting noisy instances with the ensemble filter: a study in software quality estimation. Int J Softw Eng Knowl Eng 16(1):53–76CrossRefGoogle Scholar
  19. Kranjc J, Podpečan V, Lavrač N (2012) Clowdflows: a cloud based scientific workflow platform. In: Flach P, Bie T, Cristianini N (eds) Machine learning and knowledge discovery in databases, lecture notes in computer science. Springer, Berlin, vol 7524, pp 816–819Google Scholar
  20. Libralon GL, Carvalho ACPLF, Lorena AC (2009) Ensembles of pre-processing techniques for noise detection in gene expression data. In: Proceedings of the 15th international conference on advances in neuro-information processing—volume part I, ICONIP’08. Springer, Berlin, pp 486–493Google Scholar
  21. Manning CD, Raghavan P, Schtze H (2008) Introduction to Information Retrieval. Cambridge University Press, New YorkCrossRefMATHGoogle Scholar
  22. Maron D, Ridker P, Pearson A (1998) Risk factors and the prevention of coronary heart disease. In: Wayne A, Schlant R, Fuster V (eds) HURST’S: the Heart, pp 1175–1195Google Scholar
  23. Mingers J (1989) An empirical comparison of pruning methods for decision tree induction. Mach Learn 4:227–243CrossRefGoogle Scholar
  24. Miranda A, Garcia L, Carvalho A, Lorena A (2009) Use of classification algorithms in noise detection and elimination. In: Hybrid artificial intelligence systems, lecture notes in computer science. Springer, Berlin, vol 5572, pp 417–424Google Scholar
  25. Niblett T, Bratko I (1987) Learning decision rules in noisy domains. In: Bramer M (ed) Research and development in expert systems. Cambridge University Press, CambridgeGoogle Scholar
  26. Pollak S (2009) Text classification of articles on kenyan elections. In: Proceedings of the 4th language & technology conference: human language technologies as a challenge for computer science and linguistics, pp 229–233.Google Scholar
  27. Pollak S, Coesemans R, Daelemans W, Lavrač N (2011) Detecting contrast patterns in newspaper articles by combining discourse analysis and text mining. Pragmatics 21(4):674–683Google Scholar
  28. Quinlan JR (1987) Simplifying decision trees. Int J Man-Mach Stud 27:221–234CrossRefGoogle Scholar
  29. Sluban B, Gamberger D, Lavrač N (2010) Advances in class noise detection. In: Coelho H, Studer R, Wooldridge M (eds) Proceedings of the 19th European conference on artificial intelligence (ECAI 2010), pp 1105–1106Google Scholar
  30. Sluban B, Gamberger D, Lavrač N (2011) Performance analysis of class noise detection algorithms. In: Ågotnes T (ed) STAIRS 2010—proceedings of the fifth starting AI researchers’ symposium, pp 303–314Google Scholar
  31. Teng CM (1999) Correcting noisy data. In: Proceedings of the sixteenth international conference on machine learning, pp 239–248Google Scholar
  32. Van Hulse JD, Khoshgoftaar TM (2006) Class noise detection using frequent itemsets. Intell Data Anal 10(6):487–507Google Scholar
  33. Van Hulse JD, Khoshgoftaar TM, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2):171–190CrossRefGoogle Scholar
  34. Verbaeten S (2002) Identifying mislabeled training examples in ilp classification problems. In: Proceedings of twelfth Belgian-Dutch conference on machine learning, pp 1–8Google Scholar
  35. Verbaeten S, Van Assche A (2003) Ensemble methods for noise elimination in classification problems. In: Windeatt T, Roli F (eds) Multiple classifier systems, lecture notes in computer science. Springer, Berlin, vol 2709, pp 317–325Google Scholar
  36. Wiswedel B, Berthold MR (2005) Fuzzy clustering in parallel universes with noise detection. In: Proceedings of the ICDM 2005 workshop on computational intelligence in data mining, pp 29–37Google Scholar
  37. Yin H, Dong H, Li Y (2009) A cluster-based noise detection algorithm. International workshop on database technology and applications, pp 386–389Google Scholar
  38. Zhong S, Tang W, Khoshgoftaar TM (2005) Boosted noise filters for identifying mislabeled data. Technical report, Department of computer science and engineering, Florida Atlantic UniversityGoogle Scholar
  39. Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study of their impacts. Artif Intell Rev 22:177–210CrossRefMATHGoogle Scholar
  40. Zhu X, Wu X, Chen Q (2003) Eliminating class noise in large datasets. In: Proceedings of the international conference on machine learning, pp 920–927Google Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Jožef Stefan Institute and Jožef Stefan International Postgraduate SchoolLjubljanaSlovenia
  2. 2.Rudjer Bošković InstituteZagrebCroatia

Personalised recommendations