Data Modeling III – Making Models More Selective

Jank, Wolfgang

doi:10.1007/978-1-4614-0406-4_5

Wolfgang Jank²

Part of the book series: Use R ((USE R))

8540 Accesses

Abstract

In this chapter, we cover another important aspect of modeling: the danger of incorporating too much data. In fact, this point is a rather sensitive one: while too little information will not render a very powerful model, too much information can achieve quite the opposite effect – it can destroy the model’s ability to give us any insight at all. Thus, much of our modeling efforts will be geared at achieving a fine balance between identifying just the right amount of data that is useful for our model and weeding out the information that does not carry any value. In other words, we want to make sure our models are selective and only admit information that is useful for the decision-making process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In regression, using two predictors with a correlation of one results in standard errors (and hence p-values) that are unreliable. The mathematical reason behind this is that the two predictors are linearly dependent in the design matrix, which, as a consequence, cannot be inverted. Matrix inversion is one of the core tools underlying the estimation of the regression model.
2.
For a set of k different predictor variables, there exist 2^k possible models (ignoring all possible interaction terms), so, as we have 24 different variables in this case, there are as many as 2²⁴ = 16, 777, 216 different models!
3.
Stepwise regression is not an optimization method; it is a heuristic that systematically eliminates poor variable choices, but it does not guarantee the absolutely best possible model.
4.
We transform all of the 25 variables to the logarithmic scale. The reason is that a close inspection reveals that every single variable shows a heavy right skew and patterns that are not linear at all, similar to the patterns observed in Figure 5.13. Hence we transform all of the variables from Table 5.3 to the log scale.

Author information

Authors and Affiliations

Department of Decision and Information Technologies Robert H. Smith School of Business, University of Maryland, Van Munching Hall, College Park, MD, 20742-1815, USA
Wolfgang Jank

Authors

Wolfgang Jank
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wolfgang Jank .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Jank, W. (2011). Data Modeling III – Making Models More Selective. In: Business Analytics for Managers. Use R. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-0406-4_5

Download citation

DOI: https://doi.org/10.1007/978-1-4614-0406-4_5
Published: 18 July 2011
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-0405-7
Online ISBN: 978-1-4614-0406-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics