Noisy Data Set Identification

  • Luís Paulo F. Garćia
  • André C. P. L. F. de Carvalho
  • Ana C. Lorena
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8073)


Real data are often corrupted by noise, which can be provenient from errors in data collection, storage and processing. The presence of noise hampers the induction of Machine Learning models from data, which can have their predictive or descriptive performance impaired, while also making the training time longer. Moreover, these models can be overly complex in order to accomodate such errors. Thus, the identification and reduction of noise in a data set may benefit the learning process. In this paper, we thereby investigate the use of data complexity measures to identify the presence of noise in a data set. This identification can support the decision regarding the need of the application of noise redution techniques.


Noisy data Noise identification Data Complexity Measures 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7(4), 623–640 (1995)CrossRefGoogle Scholar
  2. 2.
    Wu, X.: Knowledge Acquisition from Databases. Ablex Pulishing Corp. (1995)Google Scholar
  3. 3.
    Maletic, J.I., Marcus, A.: Data cleansing: Beyond integrity analysis. In: Proc. Conf. Information Quality, pp. 200–209 (2000)Google Scholar
  4. 4.
    Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)Google Scholar
  5. 5.
    Quinlan, J.R.: The effect of noise on concept learning. In: Michalski, R.S.I., Carboneel, J.G., Mitchell (eds.) Machine Learning. Morgan Kaufmann Publishers Inc. (1986)Google Scholar
  6. 6.
    Lorena, A.C., Carvalho, A.C.P.L.F.: Evaluation of noise reduction techniques in the splice junction recognition problem. Genetics and Molecular Biology 27(4), 665–672 (2004)CrossRefGoogle Scholar
  7. 7.
    Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)CrossRefGoogle Scholar
  8. 8.
    Gamberger, D., Lavrac, N., Dzeroski, S.: Noise detection and elimination in data proprocessing: Experiments in medical domains. Applied Artificial Intelligence 14(2), 205–223 (2000)CrossRefGoogle Scholar
  9. 9.
    John, G.H.: Robust decision trees: Removing outliers from databases. In: KDD, pp. 174–179 (1995)Google Scholar
  10. 10.
    Zhao, Q., Nishida, T.: Using qualitative hypotheses to identify inaccurate data. J. Artif. Intell. Res. (JAIR) 3, 119–145 (1995)zbMATHGoogle Scholar
  11. 11.
    Brodley, C.E., Friedl, M.A.: Identifying and eliminating mislabeled training instances. In: AAAI/IAAI, vol. 1, pp. 799–805 (1996)Google Scholar
  12. 12.
    Teng, C.M.: Correcting noisy data. In: ICML, pp. 239–248 (1999)Google Scholar
  13. 13.
    Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: ICML, pp. 920–927 (2003)Google Scholar
  14. 14.
    Zhu, X., Wu, X., Yang, Y.: Error detection and impact-sensitive instance ranking in noisy datasets. In: AAAI, pp. 378–384 (2004)Google Scholar
  15. 15.
    Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)CrossRefGoogle Scholar
  16. 16.
    Sáez, J.A., Luengo, J., Herrera, F.: Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognition 46(1), 355–364 (2013)CrossRefGoogle Scholar
  17. 17.
    Sluban, B., Gamberger, D., Lavrac, N.: Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Mining and Knowledge Discovery (2013)Google Scholar
  18. 18.
    Zhu, X., Wu, X.: Class noise vs. attribute noise: A quantitative study. Artif. Intell. Rev. 22(3), 177–210 (2004)MathSciNetzbMATHCrossRefGoogle Scholar
  19. 19.
    Orriols-Puig, A., Maciá, N., Ho, T.K.: Documentation for the data complexity library in C++. Technical report, La Salle - Universitat Ramon Llull (2010)Google Scholar
  20. 20.
    Heckerman, D.: A tutorial on learning with bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research (1995)Google Scholar
  21. 21.
    Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)zbMATHCrossRefGoogle Scholar
  22. 22.
    Breiman, L., Freidman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth (1984)Google Scholar
  23. 23.
    Mitchell, T.M.: Machine Learning, 1st edn. McGraw Hill series in computer science. McGraw-Hill (1997)Google Scholar
  24. 24.
    Vapnik, V.N.: The nature of Statistical learning theory. Springer (1995)Google Scholar
  25. 25.
    Bache, K., Lichman, M.: UCI machine learning repository (2013)Google Scholar
  26. 26.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Luís Paulo F. Garćia
    • 1
  • André C. P. L. F. de Carvalho
    • 1
  • Ana C. Lorena
    • 2
  1. 1.Computer Science Department, Institute of Mathematics and Computer SciencesUniversity of São PauloSão CarlosBrazil
  2. 2.Institute of Science and TechnologyFederal University of São PauloSão José dos CamposBrazil

Personalised recommendations