Skip to main content

Imperfect Big Data

  • Chapter
  • First Online:
Big Data Preprocessing

Abstract

In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances and is known to be a very disruptive feature of data. Another alteration present in the data is the presence of missing values. They deserve a special attention as it has a critical impact in the learning process, as most learners suppose that the data is complete. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise and missing values, as they have difficulties coping with such a large amount of data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baldi, P., Sadowski, P., & Whiteson, D. (2014). Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 5, 4308.

    Article  Google Scholar 

  2. Batista, G. E. A. P. A., & Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5–6), 519–533.

    Article  Google Scholar 

  3. Bouveyron, C., & Girard, S. (2009). Robust supervised classification with mixture models: Learning from data with uncertain labels. Pattern Recognition, 42(11), 2649–2658.

    Article  Google Scholar 

  4. Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131–167.

    Article  Google Scholar 

  5. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27:1–27:27.

    Google Scholar 

  6. Dua, D., & Graff, C. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml

    Google Scholar 

  7. Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review, 1(2), 293–314.

    Article  Google Scholar 

  8. Frénay, B., & Verleysen, M. (2014). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.

    Article  Google Scholar 

  9. Garcia, L. P. F., de Carvalho, A. C. P. L. F., & Lorena, A. C. (2015). Effect of label noise in the complexity of classification problems. Neurocomputing, 160, 108–119.

    Article  Google Scholar 

  10. García, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 417–435.

    Article  Google Scholar 

  11. García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining. Berlin: Springer.

    Book  Google Scholar 

  12. García-Gil, D., Luengo, J., García, S., & Herrera, F. (2019). Enabling smart data: Noise filtering in big data classification. Information Sciences, 479, 135–152.

    Article  Google Scholar 

  13. García-Laencina, P. J., Sancho-Gómez, J.-L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: A review. Neural Computing and Applications, 19(2), 263–282.

    Article  Google Scholar 

  14. Hernández, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2, 9–37.

    Article  Google Scholar 

  15. Khoshgoftaar, T. M., & Rebours, P. (2007). Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology, 22, 387–396.

    Article  Google Scholar 

  16. Kim, H., Golub, G. H., & Park, H. (2004). Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, 21(2), 187–198.

    Article  Google Scholar 

  17. Li, Y., Wessels, L. F. A., de Ridder, D., & Reinders, M. J. T. (2007). Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition, 40(12), 3349–3357.

    Article  Google Scholar 

  18. Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202.

    Article  MathSciNet  Google Scholar 

  19. Little, R. J. A., & Rubin, D. B. (2014). Statistical analysis with missing data (Vol. 333). Hoboken: Wiley.

    MATH  Google Scholar 

  20. Luengo, J., García, S., & Herrera, F. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32(1), 77–108.

    Article  Google Scholar 

  21. Maíllo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.

    Article  Google Scholar 

  22. Miao, Q., Cao, Y., Xia, G., Gong, M., Liu, J., & Song, J. (2016). RBoost: Label noise-robust boosting algorithm based on a nonconvex loss function and the numerically stable base learners. IEEE Transactions on Neural Networks and Learning Systems, 27(11), 2216–2228.

    Article  MathSciNet  Google Scholar 

  23. Royston, P. (2014). Multiple imputation of missing values. Stata Journal, 4(3), 227–41.

    Article  Google Scholar 

  24. Sáez, J. A., Galar, M., Luengo, J., & Herrera, F. (2016). INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control. Information Fusion, 27, 19–32.

    Article  Google Scholar 

  25. Sánchez, J. S., Barandela, R., Marqués, A. I., Alejo, R., & Badenas, J. (2003). Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters, 24(7), 1015–1022.

    Article  Google Scholar 

  26. Sánchez, J. S., Pla, F., & Ferri, F. J. (1997). Prototype selection for the nearest neighbor rule through proximity graphs. Pattern Recognition Letters, 18, 507–513.

    Article  Google Scholar 

  27. Schneider, T. (2001). Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5), 853–871.

    Article  Google Scholar 

  28. Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems and Man and Cybernetics, 6(6), 448–452.

    MathSciNet  MATH  Google Scholar 

  29. Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: the winner algorithm for the ecbdl14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.

    Article  Google Scholar 

  30. Triguero, I., Derrac, J., García, S., & Herrera, F. (2012). A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, 42(1), 86–100.

    Article  Google Scholar 

  31. Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289.

    Google Scholar 

  32. Triguero, I., Peralta, D., Bacardit, J., García, S., & Herrera, F. (2015). MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 150, 331–345.

    Article  Google Scholar 

  33. Verbaeten, S., & Assche, A. V. (2003). Ensemble methods for noise elimination in classification problems. In 4th International Workshop on Multiple Classifier Systems. Lecture Notes on Computer Science (Vol. 2709, pp. 317–325). Berlin: Springer.

    Google Scholar 

  34. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.

    Article  MathSciNet  Google Scholar 

  35. Wu, X. (1996). Knowledge acquisition from databases. Norwood: Ablex Publishing.

    Google Scholar 

  36. Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, 19(2), 20–27.

    Article  Google Scholar 

  37. Zhu, X., & Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22, 177–210.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Imperfect Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-39105-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-39104-1

  • Online ISBN: 978-3-030-39105-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics