Imperfect Big Data

Luengo, Julián; García-Gil, Diego; Ramírez-Gallego, Sergio; García, Salvador; Herrera, Francisco

doi:10.1007/978-3-030-39105-8_6

Julián Luengo⁶,
Diego García-Gil⁶,
Sergio Ramírez-Gallego⁷,
Salvador García⁶ &
…
Francisco Herrera⁶

2122 Accesses

Abstract

In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances and is known to be a very disruptive feature of data. Another alteration present in the data is the presence of missing values. They deserve a special attention as it has a critical impact in the learning process, as most learners suppose that the data is complete. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise and missing values, as they have difficulties coping with such a large amount of data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baldi, P., Sadowski, P., & Whiteson, D. (2014). Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5, 4308.
Article Google Scholar
Batista, G. E. A. P. A., & Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5–6), 519–533.
Article Google Scholar
Bouveyron, C., & Girard, S. (2009). Robust supervised classification with mixture models: Learning from data with uncertain labels. Pattern Recognition, 42(11), 2649–2658.
Article Google Scholar
Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131–167.
Article Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27:1–27:27.
Google Scholar
Dua, D., & Graff, C. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml
Google Scholar
Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review, 1(2), 293–314.
Article Google Scholar
Frénay, B., & Verleysen, M. (2014). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.
Article Google Scholar
Garcia, L. P. F., de Carvalho, A. C. P. L. F., & Lorena, A. C. (2015). Effect of label noise in the complexity of classification problems. Neurocomputing, 160, 108–119.
Article Google Scholar
García, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 417–435.
Article Google Scholar
García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining. Berlin: Springer.
Book Google Scholar
García-Gil, D., Luengo, J., García, S., & Herrera, F. (2019). Enabling smart data: Noise filtering in big data classification. Information Sciences, 479, 135–152.
Article Google Scholar
García-Laencina, P. J., Sancho-Gómez, J.-L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: A review. Neural Computing and Applications, 19(2), 263–282.
Article Google Scholar
Hernández, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2, 9–37.
Article Google Scholar
Khoshgoftaar, T. M., & Rebours, P. (2007). Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology, 22, 387–396.
Article Google Scholar
Kim, H., Golub, G. H., & Park, H. (2004). Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, 21(2), 187–198.
Article Google Scholar
Li, Y., Wessels, L. F. A., de Ridder, D., & Reinders, M. J. T. (2007). Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition, 40(12), 3349–3357.
Article Google Scholar
Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202.
Article MathSciNet Google Scholar
Little, R. J. A., & Rubin, D. B. (2014). Statistical analysis with missing data (Vol. 333). Hoboken: Wiley.
MATH Google Scholar
Luengo, J., García, S., & Herrera, F. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32(1), 77–108.
Article Google Scholar
Maíllo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.
Article Google Scholar
Miao, Q., Cao, Y., Xia, G., Gong, M., Liu, J., & Song, J. (2016). RBoost: Label noise-robust boosting algorithm based on a nonconvex loss function and the numerically stable base learners. IEEE Transactions on Neural Networks and Learning Systems, 27(11), 2216–2228.
Article MathSciNet Google Scholar
Royston, P. (2014). Multiple imputation of missing values. Stata Journal, 4(3), 227–41.
Article Google Scholar
Sáez, J. A., Galar, M., Luengo, J., & Herrera, F. (2016). INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Information Fusion, 27, 19–32.
Article Google Scholar
Sánchez, J. S., Barandela, R., Marqués, A. I., Alejo, R., & Badenas, J. (2003). Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 24(7), 1015–1022.
Article Google Scholar
Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbor rule through proximity graphs. Pattern Recognition Letters, 18, 507–513.
Article Google Scholar
Schneider, T. (2001). Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5), 853–871.
Article Google Scholar
Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems and Man and Cybernetics, 6(6), 448–452.
MathSciNet MATH Google Scholar
Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: the winner algorithm for the ecbdl14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.
Article Google Scholar
Triguero, I., Derrac, J., García, S., & Herrera, F. (2012). A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, 42(1), 86–100.
Article Google Scholar
Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.
Google Scholar
Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, 331–345.
Article Google Scholar
Verbaeten, S., & Assche, A. V. (2003). Ensemble methods for noise elimination in classification problems. In 4th International Workshop on Multiple Classifier Systems. Lecture Notes on Computer Science (Vol. 2709, pp. 317–325). Berlin: Springer.
Google Scholar
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.
Article MathSciNet Google Scholar
Wu, X. (1996). Knowledge acquisition from databases. Norwood: Ablex Publishing.
Google Scholar
Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, 19(2), 20–27.
Article Google Scholar
Zhu, X., & Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22, 177–210.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and AI, University of Granada, Granada, Spain
Julián Luengo, Diego García-Gil, Salvador García & Francisco Herrera
DOCOMO Digital España, Madrid, Madrid, Spain
Sergio Ramírez-Gallego

Authors

Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Diego García-Gil
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Ramírez-Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Imperfect Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-39105-8_6
Published: 17 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics