Skip to main content

Advertisement

Log in

Class Noise vs. Attribute Noise: A Quantitative Study

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Real-world data is never perfect and can often suffer from corruptions (noise) that may impact interpretations of the data, models created from the data and decisions made based on the data. Noise can reduce system performance in terms of classification accuracy, time in building a classifier and the size of the classifier. Accordingly, most existing learning algorithms have integrated various approaches to enhance their learning abilities from noisy environments, but the existence of noise can still introduce serious negative impacts. A more reasonable solution might be to employ some preprocessing mechanisms to handle noisy instances before a learner is formed. Unfortunately, rare research has been conducted to systematically explore the impact of noise, especially from the noise handling point of view. This has made various noise processing techniques less significant, specifically when dealing with noise that is introduced in attributes. In this paper, we present a systematic evaluation on the effect of noise in machine learning. Instead of taking any unified theory of noise to evaluate the noise impacts, we differentiate noise into two categories: class noise and attribute noise, and analyze their impacts on the system performance separately. Because class noise has been widely addressed in existing research efforts, we concentrate on attribute noise. We investigate the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise. Our conclusions can be used to guide interested readers to enhance data quality by designing various noise handling mechanisms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Allison, P.D. (2002). Missing Data. Thousand Oaks, CA:Sage.

  • Bansal, N., Chawla, S. & Gupta, A. (2000). Error Correction in Noisy Datasets Using Graph Mincuts.Project Report, Carnegie Mellon University, http://www.cs.cmu. edu/15781/web/proj/chawla.ps.

  • Batista, G. & Monard, M.C. (2003). An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Applied Artificial Intelligence 17:519–533.

    Google Scholar 

  • Blake, C.L. & Merz, C.J. (1998). UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/mlearn/MLRepository.html

  • Brodley, C.E. & Friedl, M.A. (1996). Identifying and Eliminating Mislabeled Training Instances. Proc. of 13th National Conf. on Artificial Intelligence, 799–805.

  • Brodley, C.E. & Friedl, M.A. (1999). Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research 11:131–167.

    Google Scholar 

  • Bruha, I. & Franek, F. (1996). Comparison of Various Routines for Unknown Attribute Value Processing the Covering Paradigm. International Journal of Pattern Recognition and Artificial Intelligence 10 (8):939–955.

    Google Scholar 

  • Bruha, I. (2002). Unknown Attributes Values Processing by Meta-learner. Foundations of Intelligent Systems, 13th International Symposium, 451–461.

  • Cendrowska, J. (1987). Prism:An Algorithm for Inducing Modular Rules. International Journal of Man-Machines Studies 27:349–370.

    Google Scholar 

  • Clark, P. & Niblett, T. (1989). The CN2 induction algorithm.Machine Learning 3 (4): 261–283.

    Google Scholar 

  • Cohen, J. & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (2nd ed.), Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Dave´, R. (1991). Characterization and Detection of Noise in Clustering. Pattern Rec-ognition Letter 12:657–664.

    Google Scholar 

  • Domingos, P. & Pazzani, M. (1996). Beyond Independence:Conditions for the Optimality of Simple Bayesian Classifier. In Proceedings of the 13th International Conference on Machine Learning, pp. 105–112.

  • Everitt, B.S. (1977). The Analysis of Contingency Tables. Chapman and Hall.

  • Freitas, A. (2001). Understanding the Crucial Role of Attribute Interactions in Data Mining. Artificial Intelligence Review 16 (3):177–199.

    Google Scholar 

  • Gamberger, D., Lavrac, N. & Groselj, C. (1999). Experiments with Noise Filtering in a Medical Domain. Proc. of 16th CML Conference, San Francisco, CA, 143–151.

  • Gamberger, D., Lavrac, N. & Dzeroski, S. (2000). Noise Detection and Elimination in Data Preprocessing:experiments in medical domains. Applied Artificial Intelligence 14:205–223.

    Google Scholar 

  • Guyon, I., Matic, N. & Vapnik, V. (1996). Discovering Information Patterns and Data Cleaning.Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, pp. 181–203.

  • Hickey, R. (1996). Noise Modeling and Evaluating Learning from Examples. Artificial Intelligence 82 (1–2):157–179.

    Google Scholar 

  • Holte, R.C. (1993). Very Simple Classification Rules Perform well on Most Commonly Used Datasets. Machine Learning 11:1993.

    Google Scholar 

  • Hoppner, F. (2003). A Biography Index of References Related to Noise Handling, http:// public.rz.fhwolfenbuettel.de/hoeppnef/bib/keyword/NOISE-HANDLING.html

  • Howell, D.C. (2002). Treatment of Missing Data, Technical Report, University of Vermont, http://www.uvm.edu/dhowell/StatPages/More_Stu./Missing_Data/ Missing.html

  • Huang, C. & Lee, H. (2001). A grey-based Nearest Neighbor Approach for Predicting Missing Attribute Values. Proc. of 2001 National Computer Symposium, Taiwan, NSC-90-2213-E-011-052.

  • Hunt, E.B., Martin, J. & Stone, P. (1966). Experiments in Induction. New York: Academic Press. IBM Synthetic Data.IBM Almaden Research, Synthetic classification data generator, http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html# classSynData

  • John, G.H. (1995). Robust Decision Trees: Removing Outliers from Databases. Proc. of the First International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp.174–179.

  • Kubica, J. & Moore, A. (2003). Probabilistic Noise Identification and Data Cleaning. Proceedings of Third IEEE International Conference on Data Mining, Florida.

  • Langley, P., Iba, W. & Thompson, K. (1992). An Analysis of Bayesian Classifiers. Proceedings of AAAI-92, 223–228.

  • Little, R.J.A. & Rubin, D.B. (1987). Statistical Analysis with Missing Data. Wiley: New York.

    Google Scholar 

  • Maletic, J. & Marcus, A. (2000). Data Cleansing:Beyond Integrity Analysis. Proceedings of the Conference on Information Quality (IQ2000).

  • Oak, N & Yoshida, K. (1993). Learning regular and irregular examples separately. Proc. of IEEE International Joint Conference on Neural Networks, 171–174.

  • Oak, N. & Yoshida, K. (1996). A noise-tolerant hybrid model of a global and a local learning model. Proc. of AAAI-96 Workshop: Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithm, 95–100.

  • Orr, K. (1998). Data Quality and Systems Theory. CACM 41 (2):66–71.

    Google Scholar 

  • Quinlan, J.R. (1983). Learning from Noisy Data.Proceedings of the Second International Machine Learning Workshop, University of Illinois at Urbana-Champaign.

  • Quinlan, J.R. (1986a). Induction of Decision Trees. Machine Learning 1 (1):81–106.

    Google Scholar 

  • Quinlan, J.R. (1986b). The Effect of Noise on Concept Learning. In Michalski, R.S., Carboneel, J.G. & Mitchell, T.M. (eds.), Machine Learning, Morgan Kaufmann.

  • Quinlan, J.R. (1993). C4.5:Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Quinlan, J.R. (1989). Unknown Attribute Values in Induction. Proceedings of 6th International Workshop on Machine Learning, 164–168.

  • Ragel, A. & Cremilleus, B. (1999). MVC–a preprocessing method to Deal with Missing Values. Knowledge-Based Systems, 285–291.

  • Redman, T. (1998). The Impact of Poor Data Quality on the Typical Enterprise. CACM 41 (2):79–82.

    Google Scholar 

  • Redman, T. (1996). Data Quality for the Information Age. Artech House.

  • Schaffer, C. (1992). Sparse Data and the Effect of Overfitting Avoidance in Decision Tree Induction. Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI), San Jose, CA. pp.147–152.

  • Schaffer, C. (1993). Over tting Avoidance as Bias. Machine Learning 10:153–178.

    Google Scholar 

  • Srinivasan, A., Muggleton, S. & Bain, M. (1992). Distinguishing Exception from Noise in Non-monotonic Learning. Proc.of 2nd Inductive Logic Programming Workshop, pp.97–107.

  • Teng, M. (1999). Correcting Noisy Data. Proceedings of the Sixteenth International Conference on Machine Learning, pp. 239–248.

  • Wang, R., Storey, V. & Firth, C. (1995). A Framework for Analysis of Data Quality Research. IEEE Transactions on Knowledge and Data Engineering 7 (4):623–639.

    Google Scholar 

  • Wang, R., Strong, D. & Guarascio, L. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 12 (4):5–34.

    Google Scholar 

  • Weisberg, S. (1980). Applied Linear Regression. John Wiley and Sons, Inc.

  • Wu, X. (1995). Knowledge Acquisition from Databases. Ablex Pulishing Corp.

  • Zhao, Q. & Nishida, T. (1995). Using Qualitative Hypotheses to Identify Inaccurate Data. Journal of Artificial Intelligence Research 3, pp. 119–145.

    Google Scholar 

  • Zhu, X., Wu, X. & Chen, S. (2003a). Eliminating class noise in large datasets. Proceedings of the 20th ICML International Conference on Machine Learning, Washington D.C. pp.920–927.

  • Zhu, X., Wu, X. & Chen, Q. (2003b). Identifying Class Noise in Large, Distributed Datasets.Technical Report, University of Vermont, http://www.cs.uvm.edu/tr/CS-03-12.shtml.

  • Zhu, X., Wu, X. & Yang, Y. (2004). Error Detection and Impact-sensitive Instance Ranking in Noisy Datasets. In Proceedings of 19th National conference on Artificial Intelligence (AAAI-2004), San Jose, CA.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, X., Wu, X. Class Noise vs. Attribute Noise: A Quantitative Study. Artificial Intelligence Review 22, 177–210 (2004). https://doi.org/10.1007/s10462-004-0751-8

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-004-0751-8

Navigation