Journal of Intelligent Information Systems

, Volume 22, Issue 1, pp 89–109 | Cite as

Identifying and Handling Mislabelled Instances

  • Fabrice Muhlenbach
  • Stéphane Lallich
  • Djamel A. Zighed
Article

Abstract

Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.

supervised learning mislabelled data geometrical neighbourhood filtering removing instances relabelling instances 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aha, D.W., Kibler, D., and Albert, M.K. (1991). Instance-Based Learning Algorithms. Machine Learn., 6, 37-66.Google Scholar
  2. Barnett, V. and Lewis, T. (1984). Outliers in Statistical Data, 2nd edition. Norwich: Wiley.Google Scholar
  3. Beckman, R.J. and Cooks, R.D. (1983). Outlier...s. Technometrics, 25, 119-149.Google Scholar
  4. Blake, C.L. and Merz, C.J. (1998). UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science [http://www.ics.uci.edu/~mlearn/MLRepository.html].Google Scholar
  5. Brodley, C.E. and Friedl, M.A. (1996). Identifying and Eliminating Mislabeled Training Instances. In Proc. of the 30th National Conference on Artificial Intelligence (pp. 799-805). Portland, OR: AAI Press.Google Scholar
  6. Brodley, C.E. and Friedl, M.A. (1999). Identifying Mislabeled Training Data. JAIR, 11, 131-167.Google Scholar
  7. Cliff, A.D. and Ord, J.K. (1981). Spatial Processes, Models and Applications. London: Pion Limited.Google Scholar
  8. Cover, T.M. and Hart, P.E. (1967). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13, 21-27.Google Scholar
  9. Elfving, T. and Eklundh, J.O. (1982). Some Properties of Stochastic Labeling Procedures. Computer Graphics and Image Processing, 20, 158-170.Google Scholar
  10. Hummel, R. and Zucker, S. (1983). On the Foundations of Relaxation Labelling Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(3), 267-287.Google Scholar
  11. Jain, A.K. and Dubes, R.C. (1988). Algorithms for Clustering Data. Prentice Hall.Google Scholar
  12. John, G.H. (1995). Robust Decision Trees: Removing Outliers from Data. In Proc. of the First International Conference on Knowledge Discovery and Data Mining (pp. 174-179). Montréal: AAI Press.Google Scholar
  13. Kittler, J. and Illingworth, J. (1985). Relaxation Labelling Algorithms-A Review. Image and Vision Computing, 3(4), 206-216.Google Scholar
  14. Knorr, E.M., Ng, R.T., and Tucakov, V. (2000). Distance-Based Outliers: Algorithms and Applications. The VLDB Journal, 8(3), 237-253.Google Scholar
  15. Lallich, S., Muhlenbach, F., and Zighed, D.A. (2002). Improving Classification by Removing or Relabeling Mislabeled Instances. In Foundations of Intelligent Systems, Proc. of the 13th International Symposium on Methodologies for Intelligent Systems (ISMIS 2002) (pp. 5-15). Lyon, France, LNAI 2366, Springer-Verlag.Google Scholar
  16. Lallich, S., Muhlenbach, F., and Zighed, D.A. (2003). Traitement des exemples atypiques en apprentissage par la régression. RSTI, série RIA-ECA, 17(1-3), 399-410.Google Scholar
  17. Largeron, C. (1991). Reconnaissance des formes par relaxation: un modèle d'aide à la décision. Ph.D. Thesis, Université Lyon 1.Google Scholar
  18. Milligan, G.W. and Cooper, M.C. (1988). A Study of Standardization of Variables in Cluster Analysis. Journal of Classification, 5, 181-204.Google Scholar
  19. Mood, A. (1940). The Distribution Theory of Runs. Ann. of Math. Statist., 11, 367-392.Google Scholar
  20. Moran, P.A.P. (1948). The Interpretation of Statistical Maps. Journal of the Royal Statistical Society, Serie B, 246-251.Google Scholar
  21. Muhlenbach, F., Lallich, S., and Zighed, D.A. (2002). Amélioration d'une classification par filtrage des exemples malétiquetés. ECA, 1(4), 155-166.Google Scholar
  22. Quinlan, J.R. (1986). Induction of Decisions Trees. Machine Learning, 1, 81-106.Google Scholar
  23. Rosenfeld, A., Hummel, R.A., and Zucker, S.W. (1976). Scene Labeling by Relaxation Operations. IEEE Transactions on Systems Man and Cybernetics, 6(6), 420-433.Google Scholar
  24. Tomek, I. (1976). An Experiment with the Edited Nearest Neighbor Rule. IEEE Transactions on Systems, Man and Cybernetics, 6(6), 448-452.Google Scholar
  25. Toussaint, G.T. (1980). The Relative Neighbourhood Graph of a Finite Planar Set. Pattern Recog., 12, 261-268.Google Scholar
  26. Wald, A. and Wolfowitz, J. (1940). On a Test Wether Two Samples are from the Same Population. Ann. of Math. Statist., 11, 147-162.Google Scholar
  27. Wilson, D.R. (1972). Asymptotic Properties of Nearest Neighbors Rules Using Edited Data. IEEE Transactions on Systems, Man and Cybernetics, 2, 408-421.Google Scholar
  28. Wilson, D.R. and Martinez, T.R. (2000). ReductionTechniques for Exemplar-Based Learning Algorithms. Machine Learning, 38, 257-268.Google Scholar
  29. Zighed, D.A., Lallich, S., and Muhlenbach, F. (2001). Séparabilité des classes dans R p. In Actes du VIIIème Congrès de la Société Francophone de Classification-SFC'01 (pp. 356-363). Pointe-à-Pitre, France.Google Scholar
  30. Zighed, D.A., Lallich, S., and Muhlenbach, F. (2002). Separability Index in Supervised Learning. In Principles of Data Mining and Knowledge Discovery, Proc. of the 6th European Conference PKDD 2002 (pp. 475-487). Helsinki, Finland, LNAI2431, Springer-Verlag.Google Scholar
  31. Zighed, D.A. and Sebban, M. (1999). Sélection et validation statistique de variables et de prototypes. In M. Sebban and G. Venturini (Eds.), Apprentissage Automatique (pp. 85-107). Paris, Hermes.Google Scholar
  32. Zighed, D.A., Tounissoux, D., Auray, J.P., and Largeron, C. (1990). Discrimination basée sur un critère d'homogénéité locale. Traitement du Signal, 2, 213-220.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Fabrice Muhlenbach
    • 1
  • Stéphane Lallich
    • 1
  • Djamel A. Zighed
    • 1
  1. 1.ERIC Laboratory, Lumière University (Lyon 2)Bron CedexFrance

Personalised recommendations