Massive Data Sets – Is Data Quality Still an Issue?

Filzmoser, Peter; Mazak-Huemer, Alexandra

doi:10.1007/978-3-662-65004-2_11

1249 Accesses

Abstract

The term “big data” has become a buzzword in the last years, and it refers to the possibility to collect and store huge amounts of information, resulting in big data bases and data repositories. This also holds for industrial applications: In a production process, for instance, it is possible to install many sensors and record data in a very high temporal resolution. The amount of information grows rapidly, but not necessarily does the insight into the production process. This is the point where machine learning or, say, statistics needs to enter, because sophisticated algorithms are now required to identify the relevant parameters which are the drivers of the quality of the product, as an example. However, is data quality still an issue? It is clear that with small amounts of data, single outliers or extreme values could affect the algorithms or statistical methods. Can “big data” overcome this problem? In this article we will focus on some specific problems in the regression context, and show that even if many parameters are measured, poor data quality can severely influence the prediction performance of the methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alfons, A., Croux, C., Gelper, S.: Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics 7(1), 226–248 (2013)
Article MATH Google Scholar
Borowski, M., Fried, R.: Online signal extraction by robust regression in moving windows with data-adaptive width selection. Statistics and Computing 24(4), 597–613 (2014)
Article MATH Google Scholar
Friedman, J., Hastie, T., Simon, N., Tibshirani, R.: glmnet: Lasso and Elastic Net Regularized Generalized Linear Models. R Foundation for Statistical Computing, Vienna, Austria (2016). http://CRAN.R-project.org/package=glmnet. R package version 2.0-5
Heritier, S., Cantoni, E., Copt, S., Victoria-Feser, P.M.: Robust Methods in Biostatistics. John Wiley & Sons, Chichester (2009)
Book MATH Google Scholar
Johnson, R., Wichern, D.: Applied Multivariate Statistical Analysis, 7th edn. Prentice Hall, Upper Saddle River, NJ (2007)
MATH Google Scholar
Kurnaz, F., Hoffmann, I., Filzmoser, P.: enetLTS: Robust and Sparse Methods for High Dimensional Linear and Logistic Regression (2018). https://CRAN.R-project.org/package=enetLTS. R package version 0.1.0
Kurnaz, F., Hoffmann, I., Filzmoser, P.: Robust and sparse estimation methods for high-dimensional linear and logistic regression. Chemometrics and Intelligent Laboratory Systems 172, 211–222 (2018)
Article Google Scholar
Maechler, M., Rousseeuw, P., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T., Koller, M., Conceicao, E., di Palma, M.: robustbase: Basic Robust Statistics (2018). http://robustbase.r-forge.r-project.org/. R package version 0.93-3
Maronna, R., Martin, R., Yohai, V., Salibián-Barrera, M.: Robust Statistics: Theory and Methods (with R). John Wiley & Sons, Chichester (2019)
MATH Google Scholar
Öllerer, V., Alfons, A., Croux, C.: The shooting S-estimator for robust regression. Computational Statistics 31(3), 829–844 (2016)
Article MATH Google Scholar
Rousseeuw, P.: Least median of squares regression. Journal of the American Statistical Association 79(388), 871–880 (1984)
Article MATH Google Scholar
Rousseeuw, P., Vanden Bossche, W.: Detecting deviating data cells. Technometrics 60(2), 135–145 (2018)
Article Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series (Methodological) 58(1), 267–288 (1996)
MATH Google Scholar
Zimek, A., Filzmoser, P.: There and back again: Outlier detection between statistical reasoning and data mining algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(6), e1280 (2018)
Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B 67(2), 301–320 (2005)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Computational Statistics, Institute of Statistics and Mathematical Methods in Economics, TU Wien, Vienna, Austria
Peter Filzmoser
Institute of Business Informatics - Software Engineering, Johannes Kepler University (JKU) Linz, Linz, Austria
Alexandra Mazak-Huemer

Authors

Peter Filzmoser
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Mazak-Huemer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandra Mazak-Huemer .

Editor information

Editors and Affiliations

Technische Universität München, Garching b. München, Bayern, Germany
Birgit Vogel-Heuser
Johannes Kepler University Linz, Linz, Oberösterreich, Austria
Manuel Wimmer

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Filzmoser, P., Mazak-Huemer, A. (2023). Massive Data Sets – Is Data Quality Still an Issue?. In: Vogel-Heuser, B., Wimmer, M. (eds) Digital Transformation. Springer Vieweg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-65004-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-662-65004-2_11
Published: 03 February 2023
Publisher Name: Springer Vieweg, Berlin, Heidelberg
Print ISBN: 978-3-662-65003-5
Online ISBN: 978-3-662-65004-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Massive Data Sets – Is Data Quality Still an Issue?