Skip to main content

Data Modeling III – Making Models More Selective

  • Chapter
  • First Online:
  • 8539 Accesses

Part of the book series: Use R ((USE R))

Abstract

In this chapter, we cover another important aspect of modeling: the danger of incorporating too much data. In fact, this point is a rather sensitive one: while too little information will not render a very powerful model, too much information can achieve quite the opposite effect – it can destroy the model’s ability to give us any insight at all. Thus, much of our modeling efforts will be geared at achieving a fine balance between identifying just the right amount of data that is useful for our model and weeding out the information that does not carry any value. In other words, we want to make sure our models are selective and only admit information that is useful for the decision-making process.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    In regression, using two predictors with a correlation of one results in standard errors (and hence p-values) that are unreliable. The mathematical reason behind this is that the two predictors are linearly dependent in the design matrix, which, as a consequence, cannot be inverted. Matrix inversion is one of the core tools underlying the estimation of the regression model.

  2. 2.

    For a set of k different predictor variables, there exist 2k possible models (ignoring all possible interaction terms), so, as we have 24 different variables in this case, there are as many as 224 = 16, 777, 216 different models!

  3. 3.

    Stepwise regression is not an optimization method; it is a heuristic that systematically eliminates poor variable choices, but it does not guarantee the absolutely best possible model.

  4. 4.

    We transform all of the 25 variables to the logarithmic scale. The reason is that a close inspection reveals that every single variable shows a heavy right skew and patterns that are not linear at all, similar to the patterns observed in Figure 5.13. Hence we transform all of the variables from Table 5.3 to the log scale.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wolfgang Jank .

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Jank, W. (2011). Data Modeling III – Making Models More Selective. In: Business Analytics for Managers. Use R. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-0406-4_5

Download citation

Publish with us

Policies and ethics