Abstract
Fitting covariate and response data to a line is referred to as linear regression . In this chapter, we introduce the least squares method for a single covariate (single regression) first and extend it to multiple covariates (multiple regression) later. Then, based on the statistical notion of estimating parameters from data, we find the distribution of the coefficients (estimates) obtained via the least squares method. Thus, we present a method for estimating a confidence interval of the estimates and for testing whether each of the true coefficients is zero. Moreover, we present a method for finding redundant covariates that may be removed. Finally, we consider obtaining a confidence interval of the response of new data outside of the data set used for the estimation. The problem of linear regression is a basis of consideration in various issues and plays a significant role in machine learning.
This is a preview of subscription content, access via your institution.
Buying options
Notes
 1.
We often refer to this matrix as the hat matrix .
 2.
We write \(\hat{\xi }\gamma \le {\xi }\le \hat{\xi }+\gamma \) as \(\xi =\hat{\xi }\pm \gamma \), where \(\hat{\xi }\) is an unbiased estimator of \(\xi \).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix: Proof of Propositions
Proposition 12
Two Gaussian random variables are independent if and only if their covariance is zero.
Proof
Let \(X\sim N(\mu _X,\sigma _X^2)\) and \(Y\sim N(\mu _Y,\sigma _Y^2)\), and let \(E[\cdot ]\) be the expectation operation. If we let
and define the independence of X andY by the property \(f_X(x)f_Y(y)=f_{XY}(x,y)\) for all \(x,y\in {\mathbb R}\), where
then \(\rho =0 \Longrightarrow f_{XY}(x,y)=f_X(x)f_Y(y)\). On the other hand, if \(f_{XY}(x,y)=f_X(x)f_Y(y)\), then we can write the numerator of \(\rho \) in (2.23) as follows:
which means that \(\rho =0 \Longleftarrow f_{XY}(x,y)=f_X(x)f_Y(y)\).
Proposition 13
The eigenvalues of H and \(IH\) are only zeros and ones, and the dimensions of the eigenspaces of H and \(IH\) with eigenvalues one and zero, respectively, are both \(p+1\), while the dimensions of the eigenspaces of H and \(IH\) with eigenvalues of zero and one, respectively, are both \(Np1\).
Proof
Using Proposition 4, from \(H=X(X^TX)^{1}X^T\) and \({{\,\mathrm{rank}\,}}(X)=p+1\), we have
On the other hand, from Proposition 4 and \(HX=X,\ {{\,\mathrm{rank}\,}}(X)=p+1\), we have
Therefore, we have \({{\,\mathrm{rank}\,}}(H)=p+1\). Moreover, from \(HX=X\), the columns of X are the basis of the image of H and the eigenvectors of H for an eigenvalue of one. Since the dimension of the image of H is \(p+1\), the dimension of the kernel is \(Np1\) (the eigenspace of an eigenvalue of zero). Moreover, for an arbitrary \(x\in {\mathbb R}^{p+1}\), we have \((IH)x=0\Longleftrightarrow Hx=x\) and \((IH)x=x\Longleftrightarrow Hx=0\), which means that the eigenspaces of H and \(IH\) for eigenvalues of zero and one are the same as the eigenspaces of \(IH\) and H for eigenvalues one and zero, respectively.
Exercises 1–18

1.
For a given \(x_1, \ldots ,x_N, y_1, \ldots ,y_N\in {\mathbb R}\), let \(\hat{\beta }_0,\hat{\beta }_1\) be the \(\beta _0,\beta _1\in {\mathbb R}\) that minimizes \(L:=\displaystyle \sum _{i=1}^N(y_i\beta _0\beta _1x_i)^2\). Show the following equations, where \(\bar{x}\) and \(\bar{y}\) are defined by \(\displaystyle \frac{1}{N}\sum _{i=1}^Nx_i\) and \(\displaystyle \frac{1}{N}\sum _{i=1}^Ny_i\).

(a)
\(\hat{\beta }_0+\hat{\beta }_1\bar{x}=\bar{y}\)

(b)
Unless \(x_1= \cdots =x_N\),
$$ \hat{\beta }_1 =\frac{\displaystyle \sum _{i=1}^N(x_i\bar{x})(y_i\bar{y})}{\displaystyle \sum _{i=1}^N (x_i\bar{x})^2} $$Hint: Item (a) is obtained from \(\displaystyle \frac{\partial L}{\partial \beta _0}=0\). For (b), substitute (a) into \(\displaystyle \frac{\partial L}{\partial \beta _1}=2\sum _{i=1}^Nx_i(y_i\beta _0\beta _1x_i)=0\) and eliminate \(\beta _0\). Then, solve it w.r.t. \({\beta }_1\) first and obtain \({\beta }_0\) later.

(a)

2.
We consider the line l with the intercept \(\hat{\beta }_0\) and slope \(\hat{\beta }_1\) obtained in Problem 1. Find the intercept and slope of the shifted line \(l'\) from the data \(x_1\bar{x}, \ldots ,x_N\bar{x}\), and \(y_1\bar{y}, \ldots ,y_N\bar{y}\). How do we obtain the intercept and slope of l from those of the shifted line \(l'\)?

3.
We wish to visualize the relation between the lines \(l,l'\) in Problem 2. Fill Blanks (1) and (2) below and draw the graph.

4.
Let m, n be positive integers. Suppose that the matrix \(A\in {\mathbb R}^{m\times m}\) can be written by \(A=B^TB\) for some \(B\in {\mathbb R}^{n\times m}\).

(a)
Show that \(Az=0\Longleftrightarrow Bz=0\) for arbitrary \(z\in {\mathbb R}^m\) Hint: Use \(Az=0\Longrightarrow z^TB^TBz=0 \Longrightarrow \Vert Bz\Vert ^2=0\).

(b)
shows that the ranks of A and B are equal. Hint: Because the kernels of A and B are equal, so are the dimensions (ranks) of the images.

(a)
In the following, the leftmost column of \(X\in {\mathbb R}^{N\times (p+1)}\) consists of all ones.

5.
For each of the following cases, show that \(X^TX\) is not invertible

(a)
\(N<p+1\)

(b)
\(N\ge p+1\) and different columns are equal in X

(a)
In the following, the rank of \(X\in {\mathbb R}^{N\times (p+1)}\) is \(p+1\).

6.
We wish to obtain \(\beta \in {\mathbb R}^{p+1}\) that minimizes \(L:=\Vert yX\beta \Vert ^2\) from \(X\in {\mathbb R}^{N\times (p+1)}\), \(y\in {\mathbb R}^N\), where \(\Vert \cdot \Vert \) denotes \(\displaystyle \sqrt{\sum _{i=1}^Nz_i^2}\) for \(z=[z_1,\ldots ,z_N]^T\).

(a)
Let \(x_{i,j}\) be the (i, j)th element of X. Show that the partial derivative of \(\displaystyle L=\frac{1}{2}\sum _{i=1}^N\left( y_i\sum _{j=0}^px_{i,j}\beta _j\right) ^2\) w.r.t. \(\beta _j\) is the jth element of \(X^Ty+X^TX\beta \).
Hint: The jth element of \(X^Ty\) is \(\displaystyle \sum _{i=1}^Nx_{i,j}y_i\), the (j, k)th element of \(X^TX\) is \(\displaystyle \sum _{i=1}^Nx_{i,j}x_{i,k}\), and the jth element of \(X^TX\beta \) is \(\displaystyle \sum _{k=0}^p\sum _{i=1}^Nx_{i,j}x_{i,k}\beta _k\).

(b)
Find \(\beta \in {\mathbb R}^{p+1}\) such that \(\displaystyle \frac{\partial L}{\partial \beta }=0\). In the sequel, we write the value by \(\hat{\beta }\).

(a)

7.
Suppose that the random variable \(\hat{\beta }\) is obtained via the procedure in Problem 6, where we assume that \(X\in {\mathbb R}^{N\times (p+1)}\) is given and \(y\in {\mathbb R}^N\) is generated by \(X\beta +\epsilon \) with unknown constants \(\beta \in {\mathbb R}^{p+1}\) and \(\sigma ^2>0\) and random variable \(\epsilon \sim N(0,\sigma ^2 I)\).

(a)
Show \(\hat{\beta }=\beta +(X^TX)^{1}X^T{\epsilon }\).

(b)
shows that the average of \(\hat{\beta }\) coincides with \(\beta \), i.e., \(\hat{\beta }\) is an unbiased estimator.

(c)
shows that the covariance matrix of \(\hat{\beta }\) is \(E(\hat{\beta }\beta )(\hat{\beta }\beta )^T=\sigma ^2(X^TX)^{1}\).

(a)

8.
Let \(H:=X(X^TX)^{1}X^T\in {\mathbb R}^{N\times N}\) and \(\hat{y}:=X\hat{\beta }\). Show the following equations.

(a)
\(H^2=H\)

(b)
\((IH)^2=IH\)

(c)
\(HX=X\)

(d)
\(\hat{y}=Hy\)

(e)
\(y\hat{y}=(IH)\epsilon \)

(f)
\(\Vert y\hat{y}\Vert ^2=\epsilon ^T(IH)\epsilon \)

(a)

9.
Prove the following statements.

(a)
The dimension of the image, rank, of H is \(p+1\). Hint: We assume that the rank of X is \(p+1\).

(b)
H has eigenspaces of eigenvalues of zero and one, and their dimensions are \(Np1\) and \(p+1\), respectively. Hint: The number of columns N in H is the sum of the dimensions of the image and kernel.

(c)
\(IH\) has eigenspaces of eigenvalues of zero and one, and their dimensions are \(p+1\) and \(Np1\), respectively. Hint: For an arbitrary \(x\in {\mathbb R}^{p+1}\), we have \((IH)x=0\Longleftrightarrow Hx=x\) and \((IH)x=x\Longleftrightarrow Hx=0\).

(a)

10.
Using the fact that \(P(IH)P^T\) becomes a diagonal matrix such that the first \(Np1\) and last \(p+1\) diagonal elements are ones and zeros, respectively, for an orthogonal P, show the following.

(a)
\(RSS:=\epsilon ^T(IH)\epsilon =\sum _{i=1}^{Np1}v_i^2\), where \(v:=P\epsilon \). Hint: Because P is orthogonal, we have \(P^TP=I\). Substitute \(\epsilon =P^{1}v=P^Tv\) into the definition of RSS and find that the diagonal elements of \(P^T(IH)P\) are the N eigenvalues. In particular, \(IH\) has \(Np1\) and \(p+1\) eigenvalues of zero and one, respectively.

(b)
\(Evv^T=\sigma ^2\tilde{I}\). Hint: Use \(Evv^T=P(E\epsilon \epsilon ^T)P^T\).

(c)
\(RSS/\sigma ^2\sim \chi ^2_{Np1}\)
(\(\chi ^2\) distribution with \(Np1\) degrees of freedom). Hint: Find the statistical properties from (a) and (b).
Use the fact that the independence of Gaussian random variables is equivalent to the covariance matrix of them being diagonal, without proving it.

(a)

11.

(a)
Show that \(E(\hat{\beta }\beta )(y\hat{y})^T=0\). Hint: Use \((\hat{\beta }\beta )(y\hat{y})^T=(X^TX)^{1}X^T\epsilon \epsilon ^T(IH)\) and \(E\epsilon \epsilon ^T=\sigma ^2 I\).

(b)
Let \(B_0, \ldots ,B_p\) be the diagonal elements of \((X^TX)^{1}\). Show that \((\hat{\beta }_i\beta _i)/(\sqrt{B_i}\sigma )\) and \(RSS/\sigma ^2\) are independent for \(i=0,1, \ldots ,p\). Hint: Since RSS is a function of \(y\hat{y}\), the problem reduces to independence between \(y\hat{y}\) and \(\hat{\beta }\beta \). Because they are Gaussian, it is sufficient to show that the covariance is zero.

(c)
Let \(\displaystyle \hat{\sigma }:=\sqrt{\frac{RSS}{Np1}}\) (the residual standard error, an estimate of \(\sigma \)), and \(SE(\hat{\beta }_i):=\hat{\sigma } \sqrt{B_i} \) (an estimate of the standard error of \(\hat{\beta }_i\)). Show that
$$\displaystyle \frac{\hat{\beta }_i\beta _i}{SE(\hat{\beta }_i)}\sim t_{Np1},\ i=0,1, \ldots ,p$$(t distribution with \(Np1\) degrees of freedom). Hint: Derive
$$\frac{\hat{\beta }_i\beta _i}{SE(\hat{\beta }_i)}=\frac{\hat{\beta }_i\beta _i}{\sigma \sqrt{B_i}}\bigg /\sqrt{\frac{RSS}{\sigma ^2}\bigg /(Np1)}$$and show that the righthand side follows a t distribution.

(d)
When \(p=1\), find \(B_0\) and \(B_1\), letting \((x_{1,1}, \ldots ,x_{N,1})=(x_1, \ldots ,x_N)\). Hint: Derive
$$(X^TX)^{1}= \frac{1}{\displaystyle \sum _{i=1}^N (x_i\bar{x})^2} \left[ \begin{array}{c@{\quad }c} \displaystyle \frac{1}{N}\sum _{i=1}^Nx_i^2&{}\bar{x}\\ \bar{x}&{}1 \end{array}\right] $$Use the fact that independence of Gaussian random variables \(U_1, \ldots ,U_m\), \(V_1, \ldots ,V_N\) is equivalent to a covariance matrix of size \(m\times n\) being a diagonal matrix, without proving it.

(a)

12.
We wish to test the null hypothesis \(H_0: \beta _i=0\) versus its alternative \(H_1: \beta _i\not =0\). For \(p=1\), we construct the following procedure using the fact that under \(H_0\),
$$t=\frac{\hat{\beta }_i0}{SE(\hat{\beta }_i)}\sim t_{Np1}\ ,$$where the function pt(x,m) returns the value of \(\displaystyle \int _{x}^\infty f_m(t)dt\), where \(f_m\) is the probability density function of a t distribution with m degrees of freedom.
Examine the outputs using the \(\texttt {lm}\) function in the R language.

13.
The following procedure repeats estimating \(\hat{\beta }_1\) 1000 times (\(r=1000\)) and draws a histogram of \(\hat{\beta }_1/SE(\beta _1)\), where beta.1/se.1 is computed each time from the data, and they are accumulated in the vector T of size r.
Replace y=rnorm(N) with y=0.1*x+rnorm(N) and execute it. Furthermore, explain the difference between the two graphs.

14.
Suppose that each element of \(W\in {\mathbb R}^{N\times N}\) is 1/N, thus \(\displaystyle \bar{y}=\frac{1}{N}\sum _{i=1}^N y_i=Wy\) for \(y=[y_1,\ldots ,y_N]^T\).

(a)
Show that \(HW=W\) and \((IH)(HW)=0\). Hint: Because each column of W is an eigenvector of eigenvalue one in H, we have \(HW=W\).

(b)
Show that \(ESS:=\Vert \hat{y}\bar{y}\Vert ^2=\Vert (HW)y\Vert ^2\) and \(TSS:=\Vert {y}\bar{y}\Vert ^2=\Vert (IW)y\Vert ^2\).

(c)
Show that \(RSS=\Vert (IH)\epsilon \Vert ^2=\Vert (IH)y\Vert ^2\) and ESS are independent Hint: The covariance matrix of \((IH)\epsilon \) and \((HW)y\) is that of \((IH)\epsilon \) and \((HW)\epsilon \). Evaluate the covariance matrix \(E(IH)\epsilon \epsilon ^T(HW)\). Then, use (a).

(d)
Show that \(\Vert (IW)y\Vert ^2=\Vert (IH)y\Vert ^2+\Vert (HW)y\Vert ^2\), i.e., \(TSS=RSS+ESS\). Hint: \((IW)y=(IH)y+(HW)y\).

(a)
In the following, we assume that \(X\in {\mathbb R}^{N\times p}\) does not contain a vector of size N of all ones in the leftmost column.

15.
Given \(X\in {\mathbb R}^{N\times p}\) and \(y \in {\mathbb R}^{N}\), we refer to
$$R^2=\frac{ESS}{TSS}=1\frac{RSS}{TSS}$$as to the coefficient of determination. For \(p=1\), suppose that we are given \(x=[x_1, \ldots ,x_N]^T\).

(a)
Show that \(\hat{y}\bar{y}=\hat{\beta }_1(x\bar{x})\). Hint: Use \(\hat{y}_i=\hat{\beta }_0+\hat{\beta }_1 {x}_i\) and Problem 1(a).

(b)
Show that \(\displaystyle R^2=\frac{\hat{\beta }_1^2\Vert x\bar{x}\Vert ^2}{\Vert y\bar{y}\Vert ^2}\).

(c)
For \(p=1\), show that the value of \(R^2\) coincides with the square of the correlation coefficient. Hint: Use \(\displaystyle \Vert x\bar{x}\Vert ^2=\sum _{i=1}^N(x_i\bar{x})^2\) and Problem 1(b).

(d)
The following function computes the coefficient of determination.
Let N=100 and m=1, and execute x=matrix(rnorm(m*N),ncol=m); y=rnorm(N); R2(x,y); .

(a)

16.
The coefficient of determination expresses how well the covariates explain the response variable, and its maximum value is one. When we evaluate how redundant a covariate is given the other covariates, we often use VIFs (variance inflation factors)
$$VIF:=\frac{1}{1R^2_{X_jX_{j}}}\ ,$$where \(R^2_{X_jX_{j}}\) is the coefficient of determination of the jth covariate in \(X\in {\mathbb R}^{N\times p}\) given the other \(p1\) covariates (\(y\in {\mathbb R}^N\) is not used). The larger the VIF value, the better the covariate is explained by the other covariates (the minimum value is one), which means that the collinearity is strong. Install the R package MASS and compute the VIF values for each variable in the Boston data set by filling the blank. (Simply execute the following).

17.
We can compute the prediction value \(x_*\hat{\beta }\) for each \(x_*\in {\mathbb R}^{p+1}\) (the row vector whose first value is one), using the estimate \(\hat{\beta }\).

(a)
Show that the variance of \(x_*\hat{\beta }\) is \(\sigma ^2 x_*(X^TX)^{1}x_*^T\). Hint: Use \(V(\hat{\beta })=\sigma ^2(X^TX)^{1}\).

(b)
If we let \(SE(x_*^T\hat{\beta }):=\hat{\sigma }\sqrt{x_*(X^TX)^{1}x_*^T}\), show that
$$\frac{x_*\hat{\beta }x_*\beta }{SE(x_*\hat{\beta })}\sim t_{Np1}\ ,$$where \(\displaystyle \hat{\sigma }=\sqrt{RSS/(Np1)}\).

(c)
The actual value of y can be expressed by \(y_*:=x_*\beta +\epsilon \). Thus, the variance of \(y_*x_*\hat{\beta }\) is \(\sigma ^2\) larger. Show that
$$\frac{x_*\hat{\beta }y_*}{\hat{\sigma }\sqrt{1+x_*(X^TX)^{1}x_*^T}}\sim t_{Np1}\ .$$

(a)

18.
From Problem 17, we have
$$x_*^T\hat{\beta }\pm t_{Np1}(\alpha /2)\hat{\sigma } \sqrt{x_*^T(X^TX)^{1}x_*}$$$${y_*}\pm t_{Np1}(\alpha /2)\hat{\sigma } \sqrt{1+x_*^T(X^TX)^{1}x_*}$$(the confidence and prediction intervals, respectively), where f is the t distribution with \(Np1\) degrees of freedom. \(t_{Np1}(\alpha /2)\) is the tstatistic such that \(\alpha /2=\int _{t}^\infty f(u)du\). Suppose that \(p=1\). We wish to draw the confidence and prediction intervals in red and blue, respectively, for \(x_* \in {\mathbb R}\). For the confidence interval, we expressed the upper and lower limits by red and blue solid lines, respectively, executing the procedure below. For the prediction interval, define the function g(x) and overlay the upper and lower dotted lines in red and blue on the same graph.
Rights and permissions
Copyright information
© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Suzuki, J. (2020). Linear Regression. In: Statistical Learning with Math and R. Springer, Singapore. https://doi.org/10.1007/9789811575686_2
Download citation
DOI: https://doi.org/10.1007/9789811575686_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 9789811575679
Online ISBN: 9789811575686
eBook Packages: Computer ScienceComputer Science (R0)