Skip to main content

Data Pre-processing

  • Chapter

Abstract

Data preprocessing techniques generally refer to the addition, deletion, or transformation of the training set data. Preprocessing data is a crucial step prior to modeling since data preparation can make or break a model’s predictive ability. To illustrate general preprocessing techniques, we begin by introducing a cell segmentation data set (Section 3.1). This data set contains common predictor problems such as skewness, outliers, and missing values. Sections 3.2 and 3.3 review predictor transformations for single predictors and multiple predictors, respectively. In Section 3.4 we discuss several approaches for handling missing data. Other preprocessing steps may include removing (Section 3.5), adding (Section 3.6), or binning (Section 3.7) predictors, all of which must be done carefully so that predictive information is not lost or erroneous information is added to the data. The computing section (3.8) provides R syntax for the previously described preprocessing steps. Exercises are provided at the end of the chapter to solidify concepts.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The individual data points can be found on the journal web site or in the RAppliedPredictiveModeling package. See the Computing section at the end of this chapter.

  2. 2.

    The original authors included several “status” features that are binary representations of other features in the data set. We excluded these from the analysis in this chapter.

  3. 3.

    Some readers familiar with Box and Cox (1964) will know that this transformation was developed for outcome data while Box and Tidwell (1962) describe similar methods for transforming a set of predictors in a linear model. Our experience is that the Box–Cox transformation is more straightforward, less prone to numerical issues, and just as effective for transforming individual predictor variables.

  4. 4.

    Section 20.5 discusses model extrapolation—where the model predicts samples outside of the mainstream of the training data. Another concept is the applicability domain of the model, which is the population of samples that can be effectively predicted by the model.

  5. 5.

    A preprocessed version of these data can also be found in the caret package and is used in later chapters.

  6. 6.

    http://archive.ics.uci.edu/ml/index.html.

References

  • Abdi H, Williams L (2010). “Principal Component Analysis.” Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459.

    Article  Google Scholar 

  • Austin P, Brunner L (2004). “Inflation of the Type I Error Rate When a Continuous Confounding Variable Is Categorized in Logistic Regression Analyses.” Statistics in Medicine, 23(7), 1159–1178.

    Article  Google Scholar 

  • Bone R, Balk R, Cerra F, Dellinger R, Fein A, Knaus W, Schein R, Sibbald W (1992). “Definitions for Sepsis and Organ Failure and Guidelines for the Use of Innovative Therapies in Sepsis.” Chest, 101(6), 1644–1655.

    Article  Google Scholar 

  • Box G, Cox D (1964). “An Analysis of Transformations.” Journal of the Royal Statistical Society. Series B (Methodological), pp. 211–252.

    Google Scholar 

  • Box G, Tidwell P (1962). “Transformation of the Independent Variables.” Technometrics, 4(4), 531–550.

    Article  MathSciNet  MATH  Google Scholar 

  • Everitt B, Landau S, Leese M, Stahl D (2011). Cluster Analysis. Wiley.

    Google Scholar 

  • Forina M, Casale M, Oliveri P, Lanteri S (2009). “CAIMAN brothers: A Family of Powerful Classification and Class Modeling Techniques.” Chemometrics and Intelligent Laboratory Systems, 96(2), 239–245.

    Article  Google Scholar 

  • Geladi P, Manley M, Lestander T (2003). “Scatter Plotting in Multivariate Data Analysis.” Journal of Chemometrics, 17(8–9), 503–511.

    Article  Google Scholar 

  • Giuliano K, DeBiasio R, Dunlay R, Gough A, Volosky J, Zock J, Pavlakis G, Taylor D (1997). “High–Content Screening: A New Approach to Easing Key Bottlenecks in the Drug Discovery Process.” Journal of Biomolecular Screening, 2(4), 249–259.

    Article  Google Scholar 

  • Hill A, LaPan P, Li Y, Haney S (2007). “Impact of Image Segmentation on High–Content Screening Data Quality for SK–BR-3 Cells.” BMC Bioinformatics, 8(1), 340.

    Article  Google Scholar 

  • Jerez J, Molina I, Garcia-Laencina P, Alba R, Ribelles N, Martin M, Franco L (2010). “Missing Data Imputation Using Statistical and Machine Learning Methods in a Real Breast Cancer Problem.” Artificial Intelligence in Medicine, 50, 105–115.

    Article  Google Scholar 

  • Kuiper S (2008). “Introduction to Multiple Regression: How Much Is Your Car Worth?” Journal of Statistics Education, 16(3).

    Google Scholar 

  • Mente S, Lombardo F (2005). “A Recursive–Partitioning Model for Blood–Brain Barrier Permeation.” Journal of Computer–Aided Molecular Design, 19(7), 465–481.

    Article  Google Scholar 

  • Myers R (1994). Classical and Modern Regression with Applications. PWS-KENT Publishing Company, Boston, MA, second edition.

    Google Scholar 

  • Saar-Tsechansky M, Provost F (2007b). “Handling Missing Values When Applying Classification Models.” Journal of Machine Learning Research, 8, 1625–1657.

    MATH  Google Scholar 

  • Serneels S, Nolf ED, Espen PV (2006). “Spatial Sign Pre-processing: A Simple Way to Impart Moderate Robustness to Multivariate Estimators.” Journal of Chemical Information and Modeling, 46(3), 1402–1409.

    Article  Google Scholar 

  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R (2001). “Missing Value Estimation Methods for DNA Microarrays.” Bioinformatics, 17(6), 520–525.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Kuhn, M., Johnson, K. (2013). Data Pre-processing. In: Applied Predictive Modeling. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6849-3_3

Download citation

Publish with us

Policies and ethics