Abstract
In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances and is known to be a very disruptive feature of data. Another alteration present in the data is the presence of missing values. They deserve a special attention as it has a critical impact in the learning process, as most learners suppose that the data is complete. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise and missing values, as they have difficulties coping with such a large amount of data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baldi, P., Sadowski, P., & Whiteson, D. (2014). Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5, 4308.
Batista, G. E. A. P. A., & Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5–6), 519–533.
Bouveyron, C., & Girard, S. (2009). Robust supervised classification with mixture models: Learning from data with uncertain labels. Pattern Recognition, 42(11), 2649–2658.
Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131–167.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27:1–27:27.
Dua, D., & Graff, C. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml
Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review, 1(2), 293–314.
Frénay, B., & Verleysen, M. (2014). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.
Garcia, L. P. F., de Carvalho, A. C. P. L. F., & Lorena, A. C. (2015). Effect of label noise in the complexity of classification problems. Neurocomputing, 160, 108–119.
García, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 417–435.
García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining. Berlin: Springer.
García-Gil, D., Luengo, J., García, S., & Herrera, F. (2019). Enabling smart data: Noise filtering in big data classification. Information Sciences, 479, 135–152.
García-Laencina, P. J., Sancho-Gómez, J.-L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: A review. Neural Computing and Applications, 19(2), 263–282.
Hernández, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2, 9–37.
Khoshgoftaar, T. M., & Rebours, P. (2007). Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology, 22, 387–396.
Kim, H., Golub, G. H., & Park, H. (2004). Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, 21(2), 187–198.
Li, Y., Wessels, L. F. A., de Ridder, D., & Reinders, M. J. T. (2007). Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition, 40(12), 3349–3357.
Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202.
Little, R. J. A., & Rubin, D. B. (2014). Statistical analysis with missing data (Vol. 333). Hoboken: Wiley.
Luengo, J., García, S., & Herrera, F. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32(1), 77–108.
Maíllo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.
Miao, Q., Cao, Y., Xia, G., Gong, M., Liu, J., & Song, J. (2016). RBoost: Label noise-robust boosting algorithm based on a nonconvex loss function and the numerically stable base learners. IEEE Transactions on Neural Networks and Learning Systems, 27(11), 2216–2228.
Royston, P. (2014). Multiple imputation of missing values. Stata Journal, 4(3), 227–41.
Sáez, J. A., Galar, M., Luengo, J., & Herrera, F. (2016). INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Information Fusion, 27, 19–32.
Sánchez, J. S., Barandela, R., Marqués, A. I., Alejo, R., & Badenas, J. (2003). Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 24(7), 1015–1022.
Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbor rule through proximity graphs. Pattern Recognition Letters, 18, 507–513.
Schneider, T. (2001). Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5), 853–871.
Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems and Man and Cybernetics, 6(6), 448–452.
Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: the winner algorithm for the ecbdl14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.
Triguero, I., Derrac, J., García, S., & Herrera, F. (2012). A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, 42(1), 86–100.
Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.
Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, 331–345.
Verbaeten, S., & Assche, A. V. (2003). Ensemble methods for noise elimination in classification problems. In 4th International Workshop on Multiple Classifier Systems. Lecture Notes on Computer Science (Vol. 2709, pp. 317–325). Berlin: Springer.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.
Wu, X. (1996). Knowledge acquisition from databases. Norwood: Ablex Publishing.
Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, 19(2), 20–27.
Zhu, X., & Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22, 177–210.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Imperfect Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-39105-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)