Skip to main content

Improving Classification by Removing or Relabeling Mislabeled Instances

Part of the Lecture Notes in Computer Science book series (LNAI,volume 2366)

Abstract

It is common that a database contains noisy data. An important source of noise consists in mislabeled training instances. We present a new approach that deals with improving classification accuracies in such a case by using a preliminary filtering procedure. An example is suspect when in its neighborhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the whole database. Such suspect examples in the training data can be removed or relabeled. The filtered training set is then provided as input to learning algorithm. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removing give better results than relabeling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable.

Keywords

  • Delaunay Triangulation
  • Neighborhood Graph
  • Geometrical Graph
  • Neighbor Rule
  • Class Noise

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. V. Barnett and T. Lewis. Outliers in statistical data. Wiley, Norwich, 1984.

    MATH  Google Scholar 

  2. R. J. Beckman and R. D. Cooks. Oulier...s. Technometrics, 25:119–149, 1983.

    CrossRef  MATH  MathSciNet  Google Scholar 

  3. C. L. Blake and C. J. Merz. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science [http://www.ics.uci.edu/~mlearn/MLRepository.html], 1998.

    Google Scholar 

  4. C. E. Brodley and M. A. Friedl. Identifying and eliminating mislabeled training instances. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 799–805, Portland OR, 1996. AAI Press.

    Google Scholar 

  5. C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11:131–167, 1999.

    MATH  Google Scholar 

  6. G. H. John. Robust decision trees: removing outliers from data. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 174–179, Montréal, Québec, 1995. AAI Press.

    Google Scholar 

  7. E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. The VLDB Journal, 8(3):237–253, February 2000.

    Google Scholar 

  8. C. Largeron. Reconnaissance des formes par relaxation: un modèle d’aide à la décision. PhD thesis, Université Lyon 1, 1991.

    Google Scholar 

  9. F. Muhlenbach, S. Lallich, and D. A. Zighed. Amélioration d’une classification par filtrage des exemples mal étiquetés. ECA, 1(4):155–166, 2001.

    Google Scholar 

  10. J. R. Quinlan. Induction of decisions trees. Machine Learning, 1:81–106, 1986.

    Google Scholar 

  11. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

    Google Scholar 

  12. I. Tomek. An experiment with the edited nearest neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, 6(6):448–452, 1976.

    CrossRef  MATH  MathSciNet  Google Scholar 

  13. G. Toussaint. The relative neighborhood graph of a finite planar set. Pattern recognition, 12:261–268, 1980.

    CrossRef  MATH  MathSciNet  Google Scholar 

  14. D. Wilson. Asymptotic properties of nearest neighbors rules using edited data. In IEEE Transactions on systems, Man and Cybernetics, 2:408–421, 1972.

    CrossRef  MATH  Google Scholar 

  15. D. R. Wilson and T. R. Martinez. Reduction techniques for exemplar-based learning algorithms. Machine Learning, 38:257–268, 2000.

    CrossRef  MATH  Google Scholar 

  16. D. A. Zighed, S. Lallich, and F. Muhlenbach. Séparabilité des classes dans Rp. In Actes des 8èmes Rencontres de la SFC, pages 356–363, 2001.

    Google Scholar 

  17. D. A. Zighed and M. Sebban. Sélection et validation statistique de variables et de prototypes. In M. Sebban and G. Venturini, editors, Apprentissage automatique. Hermès Science, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lallich, S., Muhlenbach, F., Zighed, D.A. (2002). Improving Classification by Removing or Relabeling Mislabeled Instances. In: Hacid, MS., Raś, Z.W., Zighed, D.A., Kodratoff, Y. (eds) Foundations of Intelligent Systems. ISMIS 2002. Lecture Notes in Computer Science(), vol 2366. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48050-1_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-48050-1_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43785-7

  • Online ISBN: 978-3-540-48050-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics