Skip to main content

Massive Data Sets – Is Data Quality Still an Issue?

  • Chapter
  • First Online:
Digital Transformation

Abstract

The term “big data” has become a buzzword in the last years, and it refers to the possibility to collect and store huge amounts of information, resulting in big data bases and data repositories. This also holds for industrial applications: In a production process, for instance, it is possible to install many sensors and record data in a very high temporal resolution. The amount of information grows rapidly, but not necessarily does the insight into the production process. This is the point where machine learning or, say, statistics needs to enter, because sophisticated algorithms are now required to identify the relevant parameters which are the drivers of the quality of the product, as an example. However, is data quality still an issue? It is clear that with small amounts of data, single outliers or extreme values could affect the algorithms or statistical methods. Can “big data” overcome this problem? In this article we will focus on some specific problems in the regression context, and show that even if many parameters are measured, poor data quality can severely influence the prediction performance of the methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alfons, A., Croux, C., Gelper, S.: Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics 7(1), 226–248 (2013)

    Article  MATH  Google Scholar 

  2. Borowski, M., Fried, R.: Online signal extraction by robust regression in moving windows with data-adaptive width selection. Statistics and Computing 24(4), 597–613 (2014)

    Article  MATH  Google Scholar 

  3. Friedman, J., Hastie, T., Simon, N., Tibshirani, R.: glmnet: Lasso and Elastic Net Regularized Generalized Linear Models. R Foundation for Statistical Computing, Vienna, Austria (2016). http://CRAN.R-project.org/package=glmnet. R package version 2.0-5

  4. Heritier, S., Cantoni, E., Copt, S., Victoria-Feser, P.M.: Robust Methods in Biostatistics. John Wiley & Sons, Chichester (2009)

    Book  MATH  Google Scholar 

  5. Johnson, R., Wichern, D.: Applied Multivariate Statistical Analysis, 7th edn. Prentice Hall, Upper Saddle River, NJ (2007)

    MATH  Google Scholar 

  6. Kurnaz, F., Hoffmann, I., Filzmoser, P.: enetLTS: Robust and Sparse Methods for High Dimensional Linear and Logistic Regression (2018). https://CRAN.R-project.org/package=enetLTS. R package version 0.1.0

  7. Kurnaz, F., Hoffmann, I., Filzmoser, P.: Robust and sparse estimation methods for high-dimensional linear and logistic regression. Chemometrics and Intelligent Laboratory Systems 172, 211–222 (2018)

    Article  Google Scholar 

  8. Maechler, M., Rousseeuw, P., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M., Verbeke, T., Koller, M., Conceicao, E., di Palma, M.: robustbase: Basic Robust Statistics (2018). http://robustbase.r-forge.r-project.org/. R package version 0.93-3

  9. Maronna, R., Martin, R., Yohai, V., Salibián-Barrera, M.: Robust Statistics: Theory and Methods (with R). John Wiley & Sons, Chichester (2019)

    MATH  Google Scholar 

  10. Öllerer, V., Alfons, A., Croux, C.: The shooting S-estimator for robust regression. Computational Statistics 31(3), 829–844 (2016)

    Article  MATH  Google Scholar 

  11. Rousseeuw, P.: Least median of squares regression. Journal of the American Statistical Association 79(388), 871–880 (1984)

    Article  MATH  Google Scholar 

  12. Rousseeuw, P., Vanden Bossche, W.: Detecting deviating data cells. Technometrics 60(2), 135–145 (2018)

    Article  Google Scholar 

  13. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series (Methodological) 58(1), 267–288 (1996)

    MATH  Google Scholar 

  14. Zimek, A., Filzmoser, P.: There and back again: Outlier detection between statistical reasoning and data mining algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(6), e1280 (2018)

    Google Scholar 

  15. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B 67(2), 301–320 (2005)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandra Mazak-Huemer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer-Verlag GmbH, DE, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Filzmoser, P., Mazak-Huemer, A. (2023). Massive Data Sets – Is Data Quality Still an Issue?. In: Vogel-Heuser, B., Wimmer, M. (eds) Digital Transformation. Springer Vieweg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-65004-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-65004-2_11

  • Published:

  • Publisher Name: Springer Vieweg, Berlin, Heidelberg

  • Print ISBN: 978-3-662-65003-5

  • Online ISBN: 978-3-662-65004-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics