A First Approach in the Class Noise Filtering Approaches for Fuzzy Subgroup Discovery
The presence of noise in data is a common problem that produces several negative consequences, and is an unavoidable problem, which affects the data collection and data preparation processes in Data Mining applications, where errors commonly occur. The performance of the models built under such circumstances will heavily depend on the quality of the training data. Hence, problems containing noise are complex problems and accurate solutions are often difficult to achieve without using specialized techniques. A particular supervised learning field as subgroup discovery has overlooked the analysis of noise and its impact in the description obtained. In this paper, the noise impact in subgroup discovery is analyzed in a complete experimental study, using recent filtering techniques for several class noise levels. Specifically, the analysis is performed through the FuGePSD algorithm which is a state-of-the-art SD algorithm based on genetic programming and fuzzy logic.
KeywordsSubgroup discovery Class noise Noise filters
Supported by the the Spanish Ministry of Economy and Competitiveness under projects TIN2012-33856 (FEDER Founds), the Spanish Ministry of Science and Technology under Projects TIN2011-28488 and TIN2010-15055, and also by the Regional Projects P10-TIC-6858 and P12-TIC-2958.
- 1.Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17(2–3):255–287Google Scholar
- 5.Carmona CJ, Ruiz-Rodado V, del Jesus M, Weber A, Grootveld M, González P, Elizondo D (2015) A fuzzy genetic programming-based algorithm for subgroup discovery and the application to one problem of pathogenesis of acute sore throat conditions in humans. Information Sciences 298:180–197CrossRefGoogle Scholar
- 8.A. E. Eiben and J. E. Smith. Introduction to evolutionary computation. Springer, 2003Google Scholar
- 14.W. Kloesgen. Explora: A Multipattern and Multistrategy Discovery Assistant. In Advances in Knowledge Discovery and Data Mining, pages 249–271. American Association for Artificial Intelligence, 1996Google Scholar
- 15.J. R. Koza. Genetic Programming: On the Programming of computers by Means of Natural Selection. MIT Press, 1992Google Scholar
- 17.G. J. Mclachlan. Discriminant Analysis and Statistical Pattern Recognition (Wiley Series in Probability and Statistics). Wiley-Interscience, 2004Google Scholar
- 18.J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Francisco, CA, USA, 1993Google Scholar
- 19.C.-M. Teng. Correcting Noisy Data. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 239–248, San Francisco, CA, USA, 1999. Morgan Kaufmann PublishersGoogle Scholar
- 20.S. Verbaeten and A. V. Assche. Ensemble methods for noise elimination in classification problems. In Fourth International Workshop on Multiple Classifier Systems, pages 317–325. Springer, 2003Google Scholar
- 21.S. Wrobel. An Algorithm for Multi-relational Discovery of Subgroups. In Proceedings of the 1st European Symposium on Principles of Data Mining and Knowledge Discovery, volume 1263 of LNAI, pages 78–87. Springer, 1997Google Scholar
- 22.L. A. Zadeh. The concept of a linguistic variable and its applications to approximate reasoning. Parts I, II, III. Information Science, 8–9:199–249,301–357,43–80, 1975Google Scholar
- 24.X. Zhu, X. Wu, and Q. Chen. Eliminating class noise in large datasets. In Proceeding of the Twentieth International Conference on Machine Learning, pages 920–927, 2003Google Scholar