Data Mining and Knowledge Discovery

, Volume 12, Issue 2–3, pp 275–308 | Cite as

Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets



To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.

In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.


data cleansing class noise machine learning 


  1. Aha, D., Kibler, D., and Albert, M. 1991. Instance-based learning algorithms. Machine Learning, 6(1):37–66.Google Scholar
  2. Blake, C.L. and Merz, C.J. 1998. UCI Repository of Machine Learning Databases.Google Scholar
  3. Breiman, L., Friedman, J.H., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth & Brooks, CA.Google Scholar
  4. Brodley, C.E. and Friedl, M.A. 1996. Identifying and eliminating mislabeled training instances. Proc. of 13th National Conf. on Artificial Intelligence, pp.799–805.Google Scholar
  5. Brodley, C.E. and Friedl, M.A. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11:131–167.Google Scholar
  6. Bruha, I. and Franek, F. 1996. Comparison of various routines for unknown attribute value processing the covering paradigm. IJPRAI 10(8):939–955Google Scholar
  7. Cestnik, B., Kononenko, I., and Bratko, I. 1987. ASSISTANT 86: A knowledge elicitation tool for sophisticated users, Proc. of 2nd European Working Session on Learning, Sigma Press, 1987. pp. 31–45.Google Scholar
  8. Cendrowska, J. 1987. Prism: An algorithm for inducing modular rules. International Journal of Man-Machines Studies, 27:349–370.Google Scholar
  9. Chan, P.K.-W. 1996. An extensive meta-learning approach for scalable and accurate inductive learning, Ph.D. Thesis, Columbia University.Google Scholar
  10. Clark, P. and Niblett, T. 1989. The CN2 induction algorithm. Machine Learning, 3(4):261–283.Google Scholar
  11. Clark, P. and Boswell, R. 1991. Rule induction with CN2: Some recent improvement. Proc. of 5th ECML, Berlin, Springer-Verlag.Google Scholar
  12. Dietterich, T. 2000. Ensemble methods in machine learning. In Lecture Notes in Computer Science Vol. 1867, J. Kittler and F. Roli, (Eds.), Springer, Berlin: pp. 1–15.Google Scholar
  13. Gamberger, D., Lavrac, N., and Groselj, C. 1999. Experiments with noise filtering in a medical domain. Proc. of 16th ICML Conference, San Francisco, CA, pp. 143–151.Google Scholar
  14. Gamberger, D. Lavrac, N., and Dzeroski, S. 2000. Noise detection and elimination in data preprocessing: Experiments in medical domains. Applied Artificial Intelligence, 14:205–223.Google Scholar
  15. Grzymala-Busse, J.W. and Hu, M. 2000. A comparison of several approaches to missing attribute values in data mining. Rough Sets and Current Trends in Computing, pp. 378–385.Google Scholar
  16. Guyon, I. Matic, N., and Vapnik, V. 1996. Discovering information patterns and data cleaning. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, pp. 181—203.Google Scholar
  17. Friedman, J.H. 1977. A recursive partitioning decision rule for nonparametric classification. IEEE Transaction on Computers, 26(4):404–408.Google Scholar
  18. Hall, L., Bowyer, K., Kegelmeyer, W., Moore, T., and Chao, C. 2000. Distributed learning on very large data sets, KDD-00 Workshop on Distributed and Parallel Knowledge Discovery, pp. 79–84.Google Scholar
  19. Holte, R.C. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11.Google Scholar
  20. Huang, C.C. and Lee, H.M. 2001. A grey-based nearest neighbor approach for predicting missing attribute values. Proc. of 2001 National Computer Symposium, Taiwan, NSC-90-2213-E-011-052.Google Scholar
  21. IBM Synthetic Data. IBM Almaden Research, Synthetic classification data generator,
  22. John, G.H. 1995. Robust decision trees: Removing outliers from databases. Proc. of the First International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 174–179.Google Scholar
  23. Kononenko, I., Bratko, I., and Roskar, E. 1984. Experiments in automatic learning of medical diagnostic rules, Technical Report, Jozef Stefan Institute, Ljubljana, Yugoslavia.Google Scholar
  24. Kubica, J. and Moore, A. 2003. Probabilistic noise identification and data cleaning. Proc. of ICDM, FL, USAGoogle Scholar
  25. Lewis, D. and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. Proc. of the 11th ICML Conference, NJ, Morgan Kaufmann: pp. 148–156.Google Scholar
  26. Li, Q. Li, T., Zhu, S., and Kambhamettu, C. 2002. Improving medical/biological data classification performance by wavelet preprocessing. Proc. of International Conference on Data Mining (ICDM 2002), Japan.Google Scholar
  27. Michalski, R.S., Mozetic, I., Hong, J., and Lavrac, N. 1986. The multi-purpose incremental learning system AQ15 and its testing application to three medical domains. Proceedings of AAAI, pp. 1041–1045.Google Scholar
  28. Oak, N. and Yoshida, K. 1993. Learning regular and irregular examples separately. Proc. of IEEE International Joint Conference on Neural Networks, pp. 171–174.Google Scholar
  29. Oak, N. and Yoshida, K. 1996. A noise-tolerant hybrid model of a global and a local learning model. Proc. of AAAI-96 Workshop: Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithm, pp. 95–100.Google Scholar
  30. Provost, F., Jensen, D., and Oates, T. 1999. Efficient progressive sampling. Proc. of the 5th ACM SIGKDD, pp. 23–32.Google Scholar
  31. Provost, F. and Kolluri, V. 1999. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131–169.Google Scholar
  32. Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, 1(1):81–106.Google Scholar
  33. Quinlan, J.R. 1989. Unknown attribute values in induction. Proceedings of the 6th International Workshop on Machine Learning, pp. 164–168.Google Scholar
  34. Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.Google Scholar
  35. Schapire, R.E. 1990. The strength of weak learnability. Machine Learning, 5(2):197–227.Google Scholar
  36. Shapiro, A. 1983. The role of structured induction in expert systems, Ph.D Thesis, University of Edinburgh.Google Scholar
  37. Skalak, D. 1994. Prototype and feature selection by sampling and random mutation hill climbing algorithms, Proc. of 11th ICML Conference, New Brunswick, NJ. Morgan Kaufmann, pp. 293–301.Google Scholar
  38. Srinivasan, A., Muggleton, S., and Bain, M. 1992. Distinguishing exception from noise in non-monotonic learning. Proc. of 2nd Inductive Logic Programming Workshop, pp. 97–107.Google Scholar
  39. Teng, C.M. 1999. Correcting noisy data. Proc. of International Conference on Machine Learning, pp. 239–248.Google Scholar
  40. Tomek, I. 1976. An experiment with edited nearest-neighbor rule. IEEE Trans. on Sys. Man and Cyber., 6(6):448–452.Google Scholar
  41. Verbaeten, S. 2002. Identifying mislabeled training examples in ILP classification problems. Proc. of Benelearn, Annual Machine Learning Conf. of Belgium and the Netherlands.Google Scholar
  42. Weisberg, S. 1980. Applied Linear Regression, John Wiley and Sons, Inc.Google Scholar
  43. Weiss, G.M. 1995. Learning with rare cases and small Disjunctions. Proc. of 12th International Conference on Machine Learning, Morgan Kaufmann, pp. 558–565.Google Scholar
  44. Weiss, G.M. and Hirsh, H. 1998. The problem with noise and small disjuncts. Proc. of 15th International Conference on Machine Learning, San Francisco, CA, pp. 574–578.Google Scholar
  45. Whitaker, A. and Saroiu, S. 1999. Cleaning mislabeled training data using SMART FILTER. Second Project Report for CSE573–Artificial Intelligence, Prof. Pedro Domingos.Google Scholar
  46. Wilson, D. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. on SMC, 2:408–421.Google Scholar
  47. Wilson, D. and Martinez, T.R. 2000. Reduction techniques for examplar-based learning algorithms. Machine Learning, 38(3):257–268.Google Scholar
  48. Winston, P. 1975. Learning structural descriptions from examples. The Psychology of Computer Vision, McGraw-Hill, New York.Google Scholar
  49. Wu, X. 1995. Knowledge Acquisition from Database, Ablex Pulishing Corp., USA.Google Scholar
  50. Wu, X. 1998. Rule induction with extension matrices. American Society for Information Science, 49(5):435–454.Google Scholar
  51. Zhao, Q. and Nishida, T. 1995. Using qualitative hypotheses to identify inaccurate data. Journal of Artificial Intelligence Research, 3:119–145.Google Scholar
  52. Zhu, X., Wu, X., and Yang, Y. 2004. Error detection and impact-sensitive instance ranking in noisy datasets. Proceedings of the 19th National Conference on Artificial Intelligence (AAAI-2004), July 25–29, San Jose, California.Google Scholar
  53. Zhu, X. and Wu, X. 2004. Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3-4):177–210.Google Scholar

Copyright information

© Springer Science+Business Media, Inc 2005

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of VermontBurlingtonUSA

Personalised recommendations