Artificial Intelligence Review

, Volume 22, Issue 3, pp 177–210

Class Noise vs. Attribute Noise: A Quantitative Study

  • Xingquan Zhu
  • Xindong Wu
Article

Abstract

Real-world data is never perfect and can often suffer from corruptions (noise) that may impact interpretations of the data, models created from the data and decisions made based on the data. Noise can reduce system performance in terms of classification accuracy, time in building a classifier and the size of the classifier. Accordingly, most existing learning algorithms have integrated various approaches to enhance their learning abilities from noisy environments, but the existence of noise can still introduce serious negative impacts. A more reasonable solution might be to employ some preprocessing mechanisms to handle noisy instances before a learner is formed. Unfortunately, rare research has been conducted to systematically explore the impact of noise, especially from the noise handling point of view. This has made various noise processing techniques less significant, specifically when dealing with noise that is introduced in attributes. In this paper, we present a systematic evaluation on the effect of noise in machine learning. Instead of taking any unified theory of noise to evaluate the noise impacts, we differentiate noise into two categories: class noise and attribute noise, and analyze their impacts on the system performance separately. Because class noise has been widely addressed in existing research efforts, we concentrate on attribute noise. We investigate the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise. Our conclusions can be used to guide interested readers to enhance data quality by designing various noise handling mechanisms.

attribute noise class noise machine learning noise impacts 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allison, P.D. (2002). Missing Data. Thousand Oaks, CA:Sage.Google Scholar
  2. Bansal, N., Chawla, S. & Gupta, A. (2000). Error Correction in Noisy Datasets Using Graph Mincuts.Project Report, Carnegie Mellon University, http://www.cs.cmu. edu/15781/web/proj/chawla.ps.Google Scholar
  3. Batista, G. & Monard, M.C. (2003). An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Applied Artificial Intelligence 17:519–533.Google Scholar
  4. Blake, C.L. & Merz, C.J. (1998). UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/mlearn/MLRepository.htmlGoogle Scholar
  5. Brodley, C.E. & Friedl, M.A. (1996). Identifying and Eliminating Mislabeled Training Instances. Proc. of 13th National Conf. on Artificial Intelligence, 799–805.Google Scholar
  6. Brodley, C.E. & Friedl, M.A. (1999). Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research 11:131–167.Google Scholar
  7. Bruha, I. & Franek, F. (1996). Comparison of Various Routines for Unknown Attribute Value Processing the Covering Paradigm. International Journal of Pattern Recognition and Artificial Intelligence 10 (8):939–955.Google Scholar
  8. Bruha, I. (2002). Unknown Attributes Values Processing by Meta-learner. Foundations of Intelligent Systems, 13th International Symposium, 451–461.Google Scholar
  9. Cendrowska, J. (1987). Prism:An Algorithm for Inducing Modular Rules. International Journal of Man-Machines Studies 27:349–370.Google Scholar
  10. Clark, P. & Niblett, T. (1989). The CN2 induction algorithm.Machine Learning 3 (4): 261–283.Google Scholar
  11. Cohen, J. & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (2nd ed.), Hillsdale, NJ: Erlbaum.Google Scholar
  12. Dave´, R. (1991). Characterization and Detection of Noise in Clustering. Pattern Rec-ognition Letter 12:657–664.Google Scholar
  13. Domingos, P. & Pazzani, M. (1996). Beyond Independence:Conditions for the Optimality of Simple Bayesian Classifier. In Proceedings of the 13th International Conference on Machine Learning, pp. 105–112.Google Scholar
  14. Everitt, B.S. (1977). The Analysis of Contingency Tables. Chapman and Hall.Google Scholar
  15. Freitas, A. (2001). Understanding the Crucial Role of Attribute Interactions in Data Mining. Artificial Intelligence Review 16 (3):177–199.Google Scholar
  16. Gamberger, D., Lavrac, N. & Groselj, C. (1999). Experiments with Noise Filtering in a Medical Domain. Proc. of 16th CML Conference, San Francisco, CA, 143–151.Google Scholar
  17. Gamberger, D., Lavrac, N. & Dzeroski, S. (2000). Noise Detection and Elimination in Data Preprocessing:experiments in medical domains. Applied Artificial Intelligence 14:205–223.Google Scholar
  18. Guyon, I., Matic, N. & Vapnik, V. (1996). Discovering Information Patterns and Data Cleaning.Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, pp. 181–203.Google Scholar
  19. Hickey, R. (1996). Noise Modeling and Evaluating Learning from Examples. Artificial Intelligence 82 (1–2):157–179.Google Scholar
  20. Holte, R.C. (1993). Very Simple Classification Rules Perform well on Most Commonly Used Datasets. Machine Learning 11:1993.Google Scholar
  21. Hoppner, F. (2003). A Biography Index of References Related to Noise Handling, http:// public.rz.fhwolfenbuettel.de/hoeppnef/bib/keyword/NOISE-HANDLING.htmlGoogle Scholar
  22. Howell, D.C. (2002). Treatment of Missing Data, Technical Report, University of Vermont, http://www.uvm.edu/dhowell/StatPages/More_Stu./Missing_Data/ Missing.htmlGoogle Scholar
  23. Huang, C. & Lee, H. (2001). A grey-based Nearest Neighbor Approach for Predicting Missing Attribute Values. Proc. of 2001 National Computer Symposium, Taiwan, NSC-90-2213-E-011-052.Google Scholar
  24. Hunt, E.B., Martin, J. & Stone, P. (1966). Experiments in Induction. New York: Academic Press. IBM Synthetic Data.IBM Almaden Research, Synthetic classification data generator, http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html# classSynDataGoogle Scholar
  25. John, G.H. (1995). Robust Decision Trees: Removing Outliers from Databases. Proc. of the First International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp.174–179.Google Scholar
  26. Kubica, J. & Moore, A. (2003). Probabilistic Noise Identification and Data Cleaning. Proceedings of Third IEEE International Conference on Data Mining, Florida.Google Scholar
  27. Langley, P., Iba, W. & Thompson, K. (1992). An Analysis of Bayesian Classifiers. Proceedings of AAAI-92, 223–228.Google Scholar
  28. Little, R.J.A. & Rubin, D.B. (1987). Statistical Analysis with Missing Data. Wiley: New York.Google Scholar
  29. Maletic, J. & Marcus, A. (2000). Data Cleansing:Beyond Integrity Analysis. Proceedings of the Conference on Information Quality (IQ2000).Google Scholar
  30. Oak, N & Yoshida, K. (1993). Learning regular and irregular examples separately. Proc. of IEEE International Joint Conference on Neural Networks, 171–174.Google Scholar
  31. Oak, N. & Yoshida, K. (1996). A noise-tolerant hybrid model of a global and a local learning model. Proc. of AAAI-96 Workshop: Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithm, 95–100.Google Scholar
  32. Orr, K. (1998). Data Quality and Systems Theory. CACM 41 (2):66–71.Google Scholar
  33. Quinlan, J.R. (1983). Learning from Noisy Data.Proceedings of the Second International Machine Learning Workshop, University of Illinois at Urbana-Champaign.Google Scholar
  34. Quinlan, J.R. (1986a). Induction of Decision Trees. Machine Learning 1 (1):81–106.Google Scholar
  35. Quinlan, J.R. (1986b). The Effect of Noise on Concept Learning. In Michalski, R.S., Carboneel, J.G. & Mitchell, T.M. (eds.), Machine Learning, Morgan Kaufmann.Google Scholar
  36. Quinlan, J.R. (1993). C4.5:Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.Google Scholar
  37. Quinlan, J.R. (1989). Unknown Attribute Values in Induction. Proceedings of 6th International Workshop on Machine Learning, 164–168.Google Scholar
  38. Ragel, A. & Cremilleus, B. (1999). MVC–a preprocessing method to Deal with Missing Values. Knowledge-Based Systems, 285–291.Google Scholar
  39. Redman, T. (1998). The Impact of Poor Data Quality on the Typical Enterprise. CACM 41 (2):79–82.Google Scholar
  40. Redman, T. (1996). Data Quality for the Information Age. Artech House.Google Scholar
  41. Schaffer, C. (1992). Sparse Data and the Effect of Overfitting Avoidance in Decision Tree Induction. Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI), San Jose, CA. pp.147–152.Google Scholar
  42. Schaffer, C. (1993). Over tting Avoidance as Bias. Machine Learning 10:153–178.Google Scholar
  43. Srinivasan, A., Muggleton, S. & Bain, M. (1992). Distinguishing Exception from Noise in Non-monotonic Learning. Proc.of 2nd Inductive Logic Programming Workshop, pp.97–107.Google Scholar
  44. Teng, M. (1999). Correcting Noisy Data. Proceedings of the Sixteenth International Conference on Machine Learning, pp. 239–248.Google Scholar
  45. Wang, R., Storey, V. & Firth, C. (1995). A Framework for Analysis of Data Quality Research. IEEE Transactions on Knowledge and Data Engineering 7 (4):623–639.Google Scholar
  46. Wang, R., Strong, D. & Guarascio, L. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 12 (4):5–34.Google Scholar
  47. Weisberg, S. (1980). Applied Linear Regression. John Wiley and Sons, Inc.Google Scholar
  48. Wu, X. (1995). Knowledge Acquisition from Databases. Ablex Pulishing Corp.Google Scholar
  49. Zhao, Q. & Nishida, T. (1995). Using Qualitative Hypotheses to Identify Inaccurate Data. Journal of Artificial Intelligence Research 3, pp. 119–145.Google Scholar
  50. Zhu, X., Wu, X. & Chen, S. (2003a). Eliminating class noise in large datasets. Proceedings of the 20th ICML International Conference on Machine Learning, Washington D.C. pp.920–927.Google Scholar
  51. Zhu, X., Wu, X. & Chen, Q. (2003b). Identifying Class Noise in Large, Distributed Datasets.Technical Report, University of Vermont, http://www.cs.uvm.edu/tr/CS-03-12.shtml.Google Scholar
  52. Zhu, X., Wu, X. & Yang, Y. (2004). Error Detection and Impact-sensitive Instance Ranking in Noisy Datasets. In Proceedings of 19th National conference on Artificial Intelligence (AAAI-2004), San Jose, CA.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Xingquan Zhu
    • 1
  • Xindong Wu
    • 1
  1. 1.Department of Computer ScienceUniversity of VermontUSA

Personalised recommendations