Abstract
In the previous chapter, we discussed situations where we had only one independent variable (X ) and evaluated its relationship with a dependent variable (Y ). This chapter goes beyond that and deals with the analysis of situations where we have more than one X (predictor) variable, using a technique called multiple regression. Similarly to simple regression, the objective here is to specify mathematical models that can describe the relationship between Y and more than one X and that can be used to predict the outcome at given values of the predictors. As we did in Chap. 14, we focus on linear models.
Notes
- 1.
In fact, we would have a plane or hyperplane, since we have multiple dimensions. We will use the term line in this text for simplicity.
- 2.
For those of you who know what this means, you would need to invert a matrix by hand!
- 3.
If we accept the null hypothesis , we would typically abandon formal statistical analysis, since we have accepted that “the X’s as a group (or, the X’s collectively) do not provide us with predictive value about Y”; in which case, what more can be said?
- 4.
At this point we would drop X 2 from consideration and repeat the regression without the X 2 data. However, leaving it in the model/equation for the moment allows several salient points to be made about the methodology over the next several pages in a superior way. We explicitly discuss this issue later.
- 5.
We are making an analogy to R. That is, imagine that a “unit of information” is equivalent to .01 of the R. There are 100 units of information about Y, labeled 1−100. Obviously, if an X, or group of X’s, provide all 100 units of information, it would be equivalent to having an R of 1.
- 6.
In SPSS and JMP , we can enter a column of data as, for example, M and F, for the two sexes. However, we advise the reader not to do so, for the richness of the output is greater when we convert the letters to 0’s and 1’s.
- 7.
We would, in general, not be pleased to have 12 X’s and n = (only) 15. This is true even though all 12 X’s are extremely unlikely to enter the stepwise regression . There is too much opportunity to “capitalize on chance,” and find variables showing up as significant, when they are really not. This possibility is a criticism of the stepwise regression technique and is discussed further in “Improving the User Experience through Practical Data Analytics,” by Fritz and Berger , Morgan Kaufmann, page 259.
- 8.
JMP and SPSS include some options for “directions” or “methods” when performing stepwise regression . Forward is equivalent to stepwise, but once a variable is included, it cannot be removed. Remove is a stepwise in reverse; that is, your initial equation contains all the variables and the steps remove the least significant ones in each step (not available in JMP). Backward is similar to remove, although we cannot reintroduce a variable once it is removed from the equation. JMP also has mixed, which is a procedure that alternates between forward and backward. The authors recommend stepwise and, while preferring it, are not strongly against remove. We are not certain why anyone would prefer either forward or backward. These two processes remove the “guarantee” that all non-significant variables (using p = .10, usually) are deleted from the model/equation.
- 9.
While standardized coefficients provide an indication of relative importance of the variables in a stepwise regression , this would not necessarily be the case in a “regular” multiple regression . This is because there can be large amounts of multicollinearity in a regular multiple regression , while this element is eliminated to a very large degree in the stepwise process.
Author information
Authors and Affiliations
1 Electronic Supplementary Material
Example 15.5
(XLS 59 kb)
Appendix
Appendix
Example 15.8 Faculty Ratings using R
To analyze the faculty ratings example , we can import the data as we have done previously or create our own in R.
> x1 <- c(1, 4, 4, 2, 4, 4, 4, 5, 4, 3, 4, 4, 3, 4, 3) > x2 <- c(4, 4, 3, 3, 4, 4, 4, 5, 3, 3, 3, 3, 3, 3, 4) > x3 <- c(4, 3, 4, 4, 4, 3, 5, 5, 4, 4, 4, 4, 3, 3, 4) > x4 <- c(4, 4, 4, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3, 3, 2) ⋮ > x12 <- c(3, 2, 1, 2, 3, 2, 3, 2, 2, 3, 1, 2, 2, 2, 1) > y <- c(4, 4, 3, 3, 4, 4, 4, 5, 4, 3, 4, 3, 3, 3, 4) > rating <- data.frame(x1, x2, x3, x4, …, x12, y)
First, let’s see how we perform a multiple-regression analysis . The functions used are the ones we already know:
> rating_model <- lm(y~x1+x2+x3+x4+…+x12, data=rating) > summary(rating_model) Call: lm(formula = y~x1+x2+x3+x4+…+x12, data=rating)
Residuals:
1 | 2 | 3 | 4 | 5 | 6 |
0.01552 | -0.10636 | -0.01592 | -0.04003 | 0.14890 | -0.02140 |
7 | 8 | 9 | 10 | 11 | 12 |
-0.04565 | 0.01751 | -0.06493 | 0.05061 | 0.21131 | -0.22315 |
13 | 14 | 15 | |||
0.02642 | 0.07319 | -0.02603 |
Coefficients:
Estimate | Std. error | t value | Pr(>|t|) | |
(Intercept) | -0.40784 | 0.84199 | -0.484 | 0.676 |
x1 | 0.26856 | 0.19360 | 1.387 | 0.300 |
x2 | 0.01166 | 0.31473 | 0.037 | 0.974 |
x3 | 0.31028 | 0.21674 | 1.432 | 0.289 |
x4 | 0.02993 | 0.43669 | 0.069 | 0.952 |
x5 | -0.17622 | 0.16670 | -1.057 | 0.401 |
x6 | 0.20136 | 0.42008 | 0.479 | 0.679 |
x7 | 0.05440 | 0.14016 | 0.388 | 0.735 |
x8 | 0.09736 | 0.24867 | 0.392 | 0.733 |
x9 | 0.17106 | 0.14630 | 1.169 | 0.363 |
x10 | 0.27376 | 0.19890 | 1.376 | 0.303 |
x11 | 0.10341 | 0.32860 | 0.315 | 0.783 |
x12 | 0.00783 | 0.38118 | 0.021 | 0.985 |
Residual standard error: 0.2705 on 2 degrees of freedom Multiple R-squared: 0.9726, Adjusted R-squared: 0.8079 F-statistic: 5.906 on 12 and 2 DF, p-value: 0.1538
Our model is obtained as follows:
> rating_model Call: lm(formula = y~x1+x2+x3+x4+…+x12, data=rating) Coefficients:
(Intercept) | x1 | x2 | x3 | x4 | x5 |
-0.40784 | 0.26856 | 0.01166 | 0.31028 | 0.02993 | -0.17622 |
x6 | x7 | x8 | x9 | x10 | x11 |
0.20136 | 0.05440 | 0.09736 | 0.17106 | 0.27376 | 0.10341 |
x12 | |||||
0.00783 |
There are different ways a stepwise regression can be performed in R. Here, we demonstrate a semi-automated procedure using p-value as the selection criteria. Differently from other software, with R we have to select which variable will be included or excluded. First, we create a model that contains only the intercept (called “1” by R) and none of the independent variables:
> rating_none <- lm(y~1, data=rating)
Then, using add1() or drop1() functions we can include or remove single items from the model. This is done as follows:
> add1(rating_none, formula(rating_model), test="F") Single term additions Model: y ~ 1
Df | Sum of Sq | RSS | AIC | F value | Pr(>F) | ||
<none> | 5.3333 | -13.511 | |||||
x1 | 1 | 0.5178 | 4.8155 | -13.043 | 1.3978 | 0.258258 | |
x2 | 1 | 3.7984 | 1.5349 | -30.194 | 32.1717 | 7.643e-05 | *** |
x3 | 1 | 0.9496 | 4.3837 | -14.452 | 2.8161 | 0.117186 | |
x4 | 1 | 0.1786 | 5.1548 | -12.022 | 0.4503 | 0.513918 | |
x5 | 1 | 0.2976 | 5.0357 | -12.372 | 0.7683 | 0.396645 | |
x6 | 1 | 2.7083 | 2.6250 | -22.145 | 13.4127 | 0.002869 | ** |
x7 | 1 | 0.1190 | 5.2143 | -11.850 | 0.2968 | 0.595116 | |
x8 | 1 | 2.8161 | 2.5172 | -22.773 | 14.5434 | 0.002151 | ** |
x9 | 1 | 0.3592 | 4.9741 | -12.557 | 0.9388 | 0.350278 | |
x10 | 1 | 2.9207 | 2.4126 | -23.410 | 15.7378 | 0.001609 | ** |
x11 | 1 | 3.9062 | 1.4271 | -31.286 | 35.5839 | 4.705e-05 | *** |
x12 | 1 | 0.0160 | 5.3173 | -11.556 | 0.0392 | 0.846154 |
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
Next, we select the variable with the smallest p-value – in this case, X 11 – and introduce it in our model without dependent variables:
> rating_best <- lm(y~1+x11, data=rating) > add1(rating_best, formula(rating_model), test="F") Single term additions Model: y ~ 1 + x11
Df | Sum of Sq | RSS | AIC | F value | Pr(>F) | ||
<none> | 1.42708 | -31.286 | |||||
x1 | 1 | 0.15052 | 1.27656 | -30.958 | 1.4149 | 0.25724 | |
x2 | 1 | 0.47429 | 0.95279 | -35.346 | 5.9735 | 0.03093 | * |
x3 | 1 | 0.22005 | 1.20703 | -31.798 | 2.1877 | 0.16488 | |
x4 | 1 | 0.10665 | 1.32043 | -30.451 | 0.9693 | 0.34430 | |
x5 | 1 | 0.00125 | 1.42584 | -29.299 | 0.0105 | 0.92013 | |
x6 | 1 | 0.02708 | 1.40000 | -29.574 | 0.2321 | 0.63861 | |
x7 | 1 | 0.11905 | 1.30804 | -30.593 | 1.0922 | 0.31659 | |
x8 | 1 | 0.68192 | 0.74517 | -39.033 | 10.9814 | 0.00618 | ** |
x9 | 1 | 0.04419 | 1.38289 | -29.758 | 0.3835 | 0.54732 | |
x10 | 1 | 0.05887 | 1.36821 | -29.918 | 0.5164 | 0.48616 | |
x12 | 1 | 0.00453 | 1.42256 | -29.334 | 0.0382 | 0.84834 |
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
We keep doing this until there are no significant variables left:
> rating_best <- lm(y~1+x11+x8, data=rating) > add1(mbest, formula(rating_model), test="F") Single term additions Model: y ~ 1 + x11 + x8
Df | Sum of Sq | RSS | AIC | F value | Pr(>F) | |
<none> | 0.74517 | -39.033 | ||||
x1 | 1 | 0.011724 | 0.73344 | -37.271 | 0.1758 | 0.6831 |
x2 | 1 | 0.156982 | 0.58818 | -40.581 | 2.9358 | 0.1146 |
x3 | 1 | 0.072753 | 0.67241 | -38.574 | 1.1902 | 0.2986 |
x4 | 1 | 0.024748 | 0.72042 | -37.540 | 0.3779 | 0.5512 |
x5 | 1 | 0.012667 | 0.73250 | -37.290 | 0.1902 | 0.6712 |
x6 | 1 | 0.020492 | 0.72468 | -37.451 | 0.3110 | 0.5882 |
x7 | 1 | 0.001921 | 0.74325 | -37.072 | 0.0284 | 0.8691 |
x9 | 1 | 0.007752 | 0.73742 | -37.190 | 0.1156 | 0.7402 |
x10 | 1 | 0.049515 | 0.69565 | -38.064 | 0.7830 | 0.3952 |
x12 | 1 | 0.009649 | 0.73552 | -37.228 | 0.1443 | 0.7113 |
Since all the other variables are non-significant, we terminate the optimization process and, using X 8 and X 11, find our final model:
> rating_final <- lm(y~x8+x11, data=rating) > rating_final Call: lm(formula = y~x8+ x11, data=rating) Coefficients:
(Intercept) | x8 | x11 |
0.9209 | 0.3392 | 0.6011 |
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Berger, P.D., Maurer, R.E., Celli, G.B. (2018). Multiple Linear Regression. In: Experimental Design. Springer, Cham. https://doi.org/10.1007/978-3-319-64583-4_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-64583-4_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64582-7
Online ISBN: 978-3-319-64583-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)