Journal of Intelligent Information Systems

, Volume 22, Issue 1, pp 89–109

Identifying and Handling Mislabelled Instances

  • Fabrice Muhlenbach
  • Stéphane Lallich
  • Djamel A. Zighed
Article

DOI: 10.1023/A:1025832930864

Cite this article as:
Muhlenbach, F., Lallich, S. & Zighed, D.A. Journal of Intelligent Information Systems (2004) 22: 89. doi:10.1023/A:1025832930864

Abstract

Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.

supervised learning mislabelled data geometrical neighbourhood filtering removing instances relabelling instances 

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Fabrice Muhlenbach
    • 1
  • Stéphane Lallich
    • 1
  • Djamel A. Zighed
    • 1
  1. 1.ERIC Laboratory, Lumière University (Lyon 2)Bron CedexFrance

Personalised recommendations