Abstract
In this chapter, we cover another important aspect of modeling: the danger of incorporating too much data. In fact, this point is a rather sensitive one: while too little information will not render a very powerful model, too much information can achieve quite the opposite effect – it can destroy the model’s ability to give us any insight at all. Thus, much of our modeling efforts will be geared at achieving a fine balance between identifying just the right amount of data that is useful for our model and weeding out the information that does not carry any value. In other words, we want to make sure our models are selective and only admit information that is useful for the decision-making process.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
In regression, using two predictors with a correlation of one results in standard errors (and hence p-values) that are unreliable. The mathematical reason behind this is that the two predictors are linearly dependent in the design matrix, which, as a consequence, cannot be inverted. Matrix inversion is one of the core tools underlying the estimation of the regression model.
- 2.
For a set of k different predictor variables, there exist 2k possible models (ignoring all possible interaction terms), so, as we have 24 different variables in this case, there are as many as 224 = 16, 777, 216 different models!
- 3.
Stepwise regression is not an optimization method; it is a heuristic that systematically eliminates poor variable choices, but it does not guarantee the absolutely best possible model.
- 4.
We transform all of the 25 variables to the logarithmic scale. The reason is that a close inspection reveals that every single variable shows a heavy right skew and patterns that are not linear at all, similar to the patterns observed in Figure 5.13. Hence we transform all of the variables from Table 5.3 to the log scale.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Jank, W. (2011). Data Modeling III – Making Models More Selective. In: Business Analytics for Managers. Use R. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-0406-4_5
Download citation
DOI: https://doi.org/10.1007/978-1-4614-0406-4_5
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-0405-7
Online ISBN: 978-1-4614-0406-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)