Class Noise vs. Attribute Noise: A Quantitative Study

Zhu, Xingquan; Wu, Xindong

doi:10.1007/s10462-004-0751-8

Class Noise vs. Attribute Noise: A Quantitative Study

Published: November 2004

Volume 22, pages 177–210, (2004)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Xingquan Zhu¹ &
Xindong Wu¹

3355 Accesses
594 Citations
3 Altmetric
Explore all metrics

Abstract

Real-world data is never perfect and can often suffer from corruptions (noise) that may impact interpretations of the data, models created from the data and decisions made based on the data. Noise can reduce system performance in terms of classification accuracy, time in building a classifier and the size of the classifier. Accordingly, most existing learning algorithms have integrated various approaches to enhance their learning abilities from noisy environments, but the existence of noise can still introduce serious negative impacts. A more reasonable solution might be to employ some preprocessing mechanisms to handle noisy instances before a learner is formed. Unfortunately, rare research has been conducted to systematically explore the impact of noise, especially from the noise handling point of view. This has made various noise processing techniques less significant, specifically when dealing with noise that is introduced in attributes. In this paper, we present a systematic evaluation on the effect of noise in machine learning. Instead of taking any unified theory of noise to evaluate the noise impacts, we differentiate noise into two categories: class noise and attribute noise, and analyze their impacts on the system performance separately. Because class noise has been widely addressed in existing research efforts, we concentrate on attribute noise. We investigate the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise. Our conclusions can be used to guide interested readers to enhance data quality by designing various noise handling mechanisms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Allison, P.D. (2002). Missing Data. Thousand Oaks, CA:Sage.
Bansal, N., Chawla, S. & Gupta, A. (2000). Error Correction in Noisy Datasets Using Graph Mincuts.Project Report, Carnegie Mellon University, http://www.cs.cmu. edu/15781/web/proj/chawla.ps.
Batista, G. & Monard, M.C. (2003). An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Applied Artificial Intelligence 17:519–533.
Google Scholar
Blake, C.L. & Merz, C.J. (1998). UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/mlearn/MLRepository.html
Brodley, C.E. & Friedl, M.A. (1996). Identifying and Eliminating Mislabeled Training Instances. Proc. of 13th National Conf. on Artificial Intelligence, 799–805.
Brodley, C.E. & Friedl, M.A. (1999). Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research 11:131–167.
Google Scholar
Bruha, I. & Franek, F. (1996). Comparison of Various Routines for Unknown Attribute Value Processing the Covering Paradigm. International Journal of Pattern Recognition and Artificial Intelligence 10 (8):939–955.
Google Scholar
Bruha, I. (2002). Unknown Attributes Values Processing by Meta-learner. Foundations of Intelligent Systems, 13th International Symposium, 451–461.
Cendrowska, J. (1987). Prism:An Algorithm for Inducing Modular Rules. International Journal of Man-Machines Studies 27:349–370.
Google Scholar
Clark, P. & Niblett, T. (1989). The CN2 induction algorithm.Machine Learning 3 (4): 261–283.
Google Scholar
Cohen, J. & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (2nd ed.), Hillsdale, NJ: Erlbaum.
Google Scholar
Dave´, R. (1991). Characterization and Detection of Noise in Clustering. Pattern Rec-ognition Letter 12:657–664.
Google Scholar
Domingos, P. & Pazzani, M. (1996). Beyond Independence:Conditions for the Optimality of Simple Bayesian Classifier. In Proceedings of the 13th International Conference on Machine Learning, pp. 105–112.
Everitt, B.S. (1977). The Analysis of Contingency Tables. Chapman and Hall.
Freitas, A. (2001). Understanding the Crucial Role of Attribute Interactions in Data Mining. Artificial Intelligence Review 16 (3):177–199.
Google Scholar
Gamberger, D., Lavrac, N. & Groselj, C. (1999). Experiments with Noise Filtering in a Medical Domain. Proc. of 16th CML Conference, San Francisco, CA, 143–151.
Gamberger, D., Lavrac, N. & Dzeroski, S. (2000). Noise Detection and Elimination in Data Preprocessing:experiments in medical domains. Applied Artificial Intelligence 14:205–223.
Google Scholar
Guyon, I., Matic, N. & Vapnik, V. (1996). Discovering Information Patterns and Data Cleaning.Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, pp. 181–203.
Hickey, R. (1996). Noise Modeling and Evaluating Learning from Examples. Artificial Intelligence 82 (1–2):157–179.
Google Scholar
Holte, R.C. (1993). Very Simple Classification Rules Perform well on Most Commonly Used Datasets. Machine Learning 11:1993.
Google Scholar
Hoppner, F. (2003). A Biography Index of References Related to Noise Handling, http:// public.rz.fhwolfenbuettel.de/hoeppnef/bib/keyword/NOISE-HANDLING.html
Howell, D.C. (2002). Treatment of Missing Data, Technical Report, University of Vermont, http://www.uvm.edu/dhowell/StatPages/More_Stu./Missing_Data/ Missing.html
Huang, C. & Lee, H. (2001). A grey-based Nearest Neighbor Approach for Predicting Missing Attribute Values. Proc. of 2001 National Computer Symposium, Taiwan, NSC-90-2213-E-011-052.
Hunt, E.B., Martin, J. & Stone, P. (1966). Experiments in Induction. New York: Academic Press. IBM Synthetic Data.IBM Almaden Research, Synthetic classification data generator, http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html# classSynData
John, G.H. (1995). Robust Decision Trees: Removing Outliers from Databases. Proc. of the First International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp.174–179.
Kubica, J. & Moore, A. (2003). Probabilistic Noise Identification and Data Cleaning. Proceedings of Third IEEE International Conference on Data Mining, Florida.
Langley, P., Iba, W. & Thompson, K. (1992). An Analysis of Bayesian Classifiers. Proceedings of AAAI-92, 223–228.
Little, R.J.A. & Rubin, D.B. (1987). Statistical Analysis with Missing Data. Wiley: New York.
Google Scholar
Maletic, J. & Marcus, A. (2000). Data Cleansing:Beyond Integrity Analysis. Proceedings of the Conference on Information Quality (IQ2000).
Oak, N & Yoshida, K. (1993). Learning regular and irregular examples separately. Proc. of IEEE International Joint Conference on Neural Networks, 171–174.
Oak, N. & Yoshida, K. (1996). A noise-tolerant hybrid model of a global and a local learning model. Proc. of AAAI-96 Workshop: Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithm, 95–100.
Orr, K. (1998). Data Quality and Systems Theory. CACM 41 (2):66–71.
Google Scholar
Quinlan, J.R. (1983). Learning from Noisy Data.Proceedings of the Second International Machine Learning Workshop, University of Illinois at Urbana-Champaign.
Quinlan, J.R. (1986a). Induction of Decision Trees. Machine Learning 1 (1):81–106.
Google Scholar
Quinlan, J.R. (1986b). The Effect of Noise on Concept Learning. In Michalski, R.S., Carboneel, J.G. & Mitchell, T.M. (eds.), Machine Learning, Morgan Kaufmann.
Quinlan, J.R. (1993). C4.5:Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Quinlan, J.R. (1989). Unknown Attribute Values in Induction. Proceedings of 6th International Workshop on Machine Learning, 164–168.
Ragel, A. & Cremilleus, B. (1999). MVC–a preprocessing method to Deal with Missing Values. Knowledge-Based Systems, 285–291.
Redman, T. (1998). The Impact of Poor Data Quality on the Typical Enterprise. CACM 41 (2):79–82.
Google Scholar
Redman, T. (1996). Data Quality for the Information Age. Artech House.
Schaffer, C. (1992). Sparse Data and the Effect of Overfitting Avoidance in Decision Tree Induction. Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI), San Jose, CA. pp.147–152.
Schaffer, C. (1993). Over tting Avoidance as Bias. Machine Learning 10:153–178.
Google Scholar
Srinivasan, A., Muggleton, S. & Bain, M. (1992). Distinguishing Exception from Noise in Non-monotonic Learning. Proc.of 2nd Inductive Logic Programming Workshop, pp.97–107.
Teng, M. (1999). Correcting Noisy Data. Proceedings of the Sixteenth International Conference on Machine Learning, pp. 239–248.
Wang, R., Storey, V. & Firth, C. (1995). A Framework for Analysis of Data Quality Research. IEEE Transactions on Knowledge and Data Engineering 7 (4):623–639.
Google Scholar
Wang, R., Strong, D. & Guarascio, L. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 12 (4):5–34.
Google Scholar
Weisberg, S. (1980). Applied Linear Regression. John Wiley and Sons, Inc.
Wu, X. (1995). Knowledge Acquisition from Databases. Ablex Pulishing Corp.
Zhao, Q. & Nishida, T. (1995). Using Qualitative Hypotheses to Identify Inaccurate Data. Journal of Artificial Intelligence Research 3, pp. 119–145.
Google Scholar
Zhu, X., Wu, X. & Chen, S. (2003a). Eliminating class noise in large datasets. Proceedings of the 20th ICML International Conference on Machine Learning, Washington D.C. pp.920–927.
Zhu, X., Wu, X. & Chen, Q. (2003b). Identifying Class Noise in Large, Distributed Datasets.Technical Report, University of Vermont, http://www.cs.uvm.edu/tr/CS-03-12.shtml.
Zhu, X., Wu, X. & Yang, Y. (2004). Error Detection and Impact-sensitive Instance Ranking in Noisy Datasets. In Proceedings of 19th National conference on Artificial Intelligence (AAAI-2004), San Jose, CA.

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Vermont, VT, USA
Xingquan Zhu & Xindong Wu

Authors

Xingquan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xindong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, X., Wu, X. Class Noise vs. Attribute Noise: A Quantitative Study. Artificial Intelligence Review 22, 177–210 (2004). https://doi.org/10.1007/s10462-004-0751-8

Download citation

Issue Date: November 2004
DOI: https://doi.org/10.1007/s10462-004-0751-8

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Class Noise vs. Attribute Noise: A Quantitative Study

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on semi-supervised learning

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Class Noise vs. Attribute Noise: A Quantitative Study

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on semi-supervised learning

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation