Abstract
The term “big data” has become a buzzword in the last years, and it refers to the possibility to collect and store huge amounts of information, resulting in big data bases and data repositories. This also holds for industrial applications: In a production process, for instance, it is possible to install many sensors and record data in a very high temporal resolution. The amount of information grows rapidly, but not necessarily does the insight into the production process. This is the point where machine learning or, say, statistics needs to enter, because sophisticated algorithms are now required to identify the relevant parameters which are the drivers of the quality of the product, as an example. However, is data quality still an issue? It is clear that with small amounts of data, single outliers or extreme values could affect the algorithms or statistical methods. Can “big data” overcome this problem? In this article we will focus on some specific problems in the regression context, and show that even if many parameters are measured, poor data quality can severely influence the prediction performance of the methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alfons, A., Croux, C., Gelper, S.: Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics 7(1), 226–248 (2013)
Borowski, M., Fried, R.: Online signal extraction by robust regression in moving windows with data-adaptive width selection. Statistics and Computing 24(4), 597–613 (2014)
Friedman, J., Hastie, T., Simon, N., Tibshirani, R.: glmnet: Lasso and Elastic Net Regularized Generalized Linear Models. R Foundation for Statistical Computing, Vienna, Austria (2016). http://CRAN.R-project.org/package=glmnet. R package version 2.0-5
Heritier, S., Cantoni, E., Copt, S., Victoria-Feser, P.M.: Robust Methods in Biostatistics. John Wiley & Sons, Chichester (2009)
Johnson, R., Wichern, D.: Applied Multivariate Statistical Analysis, 7th edn. Prentice Hall, Upper Saddle River, NJ (2007)
Kurnaz, F., Hoffmann, I., Filzmoser, P.: enetLTS: Robust and Sparse Methods for High Dimensional Linear and Logistic Regression (2018). https://CRAN.R-project.org/package=enetLTS. R package version 0.1.0
Kurnaz, F., Hoffmann, I., Filzmoser, P.: Robust and sparse estimation methods for high-dimensional linear and logistic regression. Chemometrics and Intelligent Laboratory Systems 172, 211–222 (2018)
Maechler, M., Rousseeuw, P., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T., Koller, M., Conceicao, E., di Palma, M.: robustbase: Basic Robust Statistics (2018). http://robustbase.r-forge.r-project.org/. R package version 0.93-3
Maronna, R., Martin, R., Yohai, V., Salibián-Barrera, M.: Robust Statistics: Theory and Methods (with R). John Wiley & Sons, Chichester (2019)
Öllerer, V., Alfons, A., Croux, C.: The shooting S-estimator for robust regression. Computational Statistics 31(3), 829–844 (2016)
Rousseeuw, P.: Least median of squares regression. Journal of the American Statistical Association 79(388), 871–880 (1984)
Rousseeuw, P., Vanden Bossche, W.: Detecting deviating data cells. Technometrics 60(2), 135–145 (2018)
Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series (Methodological) 58(1), 267–288 (1996)
Zimek, A., Filzmoser, P.: There and back again: Outlier detection between statistical reasoning and data mining algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(6), e1280 (2018)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B 67(2), 301–320 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer-Verlag GmbH, DE, part of Springer Nature
About this chapter
Cite this chapter
Filzmoser, P., Mazak-Huemer, A. (2023). Massive Data Sets – Is Data Quality Still an Issue?. In: Vogel-Heuser, B., Wimmer, M. (eds) Digital Transformation. Springer Vieweg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-65004-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-662-65004-2_11
Published:
Publisher Name: Springer Vieweg, Berlin, Heidelberg
Print ISBN: 978-3-662-65003-5
Online ISBN: 978-3-662-65004-2
eBook Packages: Computer ScienceComputer Science (R0)