Appendix: Proof of Propositions
Proposition 18
For covariates \(x_1,\ldots ,x_N\), if the responses are \(z_1,\ldots ,z_N\), the likelihood \(\displaystyle -\sum _{i=1}^N\log f(z_i|x_i,\gamma )\) of \(\gamma \in {\mathbb R}^{p+1}\) is
$$\begin{aligned} \frac{N}{2}\log 2\pi \sigma ^2 +\frac{1}{2\sigma ^2}\Vert z-X\beta \Vert ^2 -\frac{1}{\sigma ^2}(\gamma -\beta )^TX^T(z-X\beta )+\frac{1}{2\sigma ^2}(\gamma -\beta )^TX^TX(\gamma -\beta ) \end{aligned}$$
(5.15)
for an arbitrary \(\beta \in {\mathbb R}^{p+1}\).
Proof
In fact, for \(u\in {\mathbb R}\) and \(x\in {\mathbb R}^{p+1}\), we have that
$$\begin{aligned} \log f(u|x,\gamma )= & {} -\frac{1}{2}\log 2\pi \sigma ^2-\frac{1}{2\sigma ^2}(u-x\gamma )^2\\ (u-x\gamma )^2= & {} \{(u-x\beta )-x(\gamma -\beta )\}^2\\= & {} (u-x\beta )^2-2(\gamma -\beta )^Tx^T(u-x\beta )+(\gamma -\beta )^Tx^Tx(\gamma -\beta )\\ \log f(u|x,\gamma )= & {} -\frac{1}{2}\log 2\pi \sigma ^2 -\frac{1}{2\sigma ^2}(u-x\beta )^2\\&+\frac{1}{\sigma ^2}(\gamma -\beta )^Tx^T(u-x\beta )-\frac{1}{2\sigma ^2}(\gamma -\beta )^Tx^Tx(\gamma -\beta ) \end{aligned}$$
and, if we sum over \((x,u)=(x_1,z_1),\ldots ,(x_n,z_n)\), we can write
$$\begin{aligned} -\sum _{i=1}^N\log f(z_i|x_i,\gamma )= & {} \frac{N}{2}\log 2\pi \sigma ^2 +\frac{1}{2\sigma ^2}\Vert z-X\beta \Vert ^2\\&-\frac{1}{\sigma ^2}(\gamma -\beta )^TX^T(z-X\beta )+\frac{1}{2\sigma ^2}(\gamma -\beta )^TX^TX(\gamma -\beta )\ , \end{aligned}$$
where we have used \(z=[z_1,\ldots ,z_N]^T\) and \(\displaystyle \Vert z-X\beta \Vert ^2=\sum _{i=1}^N(z_i-x_i\beta )^2,\ \displaystyle X^TX=\sum _{i=1}^Nx_i^Tx_i,\ X^T(z-X\beta )=\sum _{i=1}^Nx_i^T(z_i-x_i\beta )\).
Proposition 19
Let k(S) be the cardinality of S. Then, we haveFootnote 4
$$E[\log \hat{\sigma }^2(S)]=\log \sigma ^2(S)-\frac{k(S)+2}{N}+O\left( \frac{1}{N^2}\right) \ .$$
Proof
Let \(m\ge 1\), \(U\sim \chi ^2_m\), \(V_1,\ldots ,V_m\sim N(0,1)\). For \(i=1,\ldots ,m\), we have that
$$\begin{aligned} Ee^{tV_i^2}= & {} \int _{-\infty }^\infty e^{tv_i^2}\frac{1}{\sqrt{2\pi }}e^{-v_i^2/2}dv_i= \int _{-\infty }^\infty \frac{1}{\sqrt{2\pi }} \exp \left\{ -\frac{(1-2t)v_i^2}{2}\right\} dv_i= (1-2t)^{-1/2}\\ Ee^{tU}= & {} \int _{-\infty }^\infty e^{t(v_1^2+\cdots +v_{m}^2)}\frac{1}{\sqrt{2\pi }}\int _{-\infty }^\infty e^{-(v_1^2+\cdots +v_{m}^2)/2}\, dv_1\cdots dv_{m} =(1-2t)^{-m/2}\ . \end{aligned}$$
which means that for \(n=1,2,\ldots \),
$$\begin{aligned} EU^n= & {} \frac{d{^n}Ee^{tU}}{dt^n}\bigg |_{t=0}=m(m+2)\cdots (m+2n-2) \ , \end{aligned}$$
(5.16)
where \(\displaystyle Ee^{tU}=1+tE[U]+\frac{t^2}{2}E[U^2]+\cdots \) has been used. Moreover, from the Taylor expansion, we have that
$$\begin{aligned} E[\log \frac{U}{m}]= & {} E\left( \frac{U}{m}-1\right) -\frac{1}{2}E\left( \frac{U}{m}-1\right) ^2+\cdots \ . \end{aligned}$$
(5.17)
If we let (5.16) for \(n=1,2\), where \(EU=m\) and \(EU^2=m(m+2)\), the first and second terms of (5.17) are zero and
$$-\frac{1}{2m^2}(EU^2-2mEU+m^2)=-\frac{1}{2m^2}\{m(m+2)-2m^2+m^2\}=-\frac{1}{m}\ ,$$
rspectively.
Next, we show that each term in (5.17) for \(n\ge 3\) is at most \(O(1/m^2)\). From the binomial theorem and (5.16), we have that
$$\begin{aligned} E(U-m)^n = \sum _{j=0}^n \left( \begin{array}{c} n\\ j \end{array} \right) EU^j(-m)^{n-j} = \sum _{j=0}^n(-1)^{n-j} \left( \begin{array}{c} n\\ j \end{array} \right) m^{n-j}m(m+2)\cdots (m+2j-2)\ . \end{aligned}$$
(5.18)
If we regard
$$m^{n-j}m(m+2)\cdots (m+2j-2)$$
as a polynmial w.r.t. m, the coefficients of the highest and \((n-1)\)-th terms are one and \(2\{1+2+\cdots +(j-1)\}=j(j-1)\), respectively. Hence, the coefficients of the n-th and \((n-1)\)-th terms in (5.18) are
$$\displaystyle \sum _{j=0}^n\left( \begin{array}{c} n\\ j \end{array} \right) (-1)^j= \sum _{j=0}^n\left( \begin{array}{c} n\\ j \end{array} \right) (-1)^j1^{n-j}=(-1+1)^n= 0$$
and
$$ \sum _{j=0}^n \left( \begin{array}{c} n\\ j \end{array} \right) (-1)^jj(j-1)= \sum _{j=2}^n \frac{n!}{(n-j)!(j-2)!} (-1)^{j-2}= n(n-1) \sum _{i=0}^{n-2} \left( \begin{array}{c} n-2\\ i \end{array} \right) (-1)^{i}=0\ , $$
respectively. Thus, we have shown that for \(n\ge 3\),
$$E\left( \frac{U}{m}-1\right) ^n=O\left( \frac{1}{m^2}\right) \ .$$
Finally, from \(\displaystyle \frac{RSS(S)}{\sigma ^2(S)}= \frac{N\hat{\sigma }^2(S)}{{\sigma }^2(S)}\sim \chi ^2_{N-k(S)-1}\) and (5.17), if we apply \(m=N-k(S)-1\), then we have that
$$\log \frac{N}{N-k(S)-1}=\frac{k(S)+1}{N-k(S)-1}+O((\frac{1}{N-k(S)-1})^2)$$
$$E\left[ \log \left( \frac{\hat{\sigma }^2(S)}{N-k(S)-1}\bigg /\frac{{\sigma }^2(S)}{N}\right) \right] =-\frac{1}{N-k(S)-1}+O\left( \frac{1}{N^2}\right) =-\frac{1}{N}+O(\frac{1}{N^2})$$
and
$$ E\left[ \log \frac{\hat{\sigma }^2(S)}{\sigma ^2}\right] =-\frac{1}{N}-\frac{k(S)+1}{N}+O\left( \frac{1}{N^2}\right) =-\frac{k(S)+2}{N}+O\left( \frac{1}{N^2}\right) \ . $$
Exercises 40–48
In the following, we define
$$X= \left[ \begin{array}{c} {x_1}\\ \vdots \\ x_N \end{array} \right] \in {\mathbb R}^{N\times (p+1)} ,\ y= \left[ \begin{array}{c} y_1\\ \vdots \\ y_N \end{array} \right] \in {\mathbb R}^N ,\ z= \left[ \begin{array}{c} z_1\\ \vdots \\ z_N \end{array} \right] \in {\mathbb R}^N ,\ \beta = \left[ \begin{array}{c} \beta _0\\ \beta _1\\ \vdots \\ \beta _p \end{array} \right] \in {\mathbb R}^{p+1}\ , $$
where \(x_1,\ldots ,x_N\) are row vectors. We assume that \(X^TX\) has an inverse matrix and denote by \(E[\cdot ]\) the expectation w.r.t.
$$f(y|x,\beta ):=\frac{1}{\sqrt{2\pi \sigma ^2}}\exp \left\{ -\frac{\Vert y-x\beta \Vert ^2}{2\sigma ^2}\right\} \ .$$
-
40.
For \(X \in {\mathbb R}^{N\times (p+1)}\), \(y\in {\mathbb R}^N\), show each of the following.
-
(a)
If the variance \(\sigma ^2>0\) is known, the \(\beta \in {\mathbb R}^{p+1}\) that maxmizes \(\displaystyle l:=\sum _{i=1}^N \log f(y_i|x_i,\beta )\) coincides with the least squares solution. Hint:
$${l}=-\frac{N}{2}\log (2\pi \sigma ^2)-\frac{1}{2\sigma ^2}\Vert y-X\beta \Vert ^2$$
-
(b)
If both \(\beta \in {\mathbb R}^{p+1}\) and \(\ \sigma ^2>0\) are unknown, the maximum likelihood estimate of \(\sigma ^2\) is given by
$$\hat{\sigma }^2=\frac{1}{N}\Vert y-X\hat{\beta }\Vert ^2$$
. Hint: If we partially differentiate l with respect to \(\sigma ^2\), we have
$$\displaystyle \frac{\partial l^2}{\partial \sigma ^2}= \frac{N}{2\sigma ^2}-\frac{\Vert y-X\beta \Vert ^2}{2(\sigma ^2)^2} =0$$
-
(c)
For probabilistic density functions f, g over \(\mathbb R\), the Kullback-Leibler divergence is nonnegative, i.e.,
$$\displaystyle D(f\Vert g):=\int _{-\infty }^\infty f(x)\log \frac{f(x)}{g(x)}dx\ge 0$$
-
41.
Let \(f^N(y|x,\beta ):=\prod _{i=1}^Nf(y_i|x_i,\beta )\). By showing (a) through (d), prove
$$J=\frac{1}{N}E(\nabla l)^2=-\frac{1}{N}E\nabla ^2 l$$
-
(a)
\(\displaystyle {\nabla l}=\frac{\nabla f^N(y|x,\beta )}{f^N(y|x,\beta )}\)
-
(b)
\(\displaystyle \int \nabla f^N(y|x,\beta )dy=0\)
-
(c)
\(E\nabla l=0\)
-
(d)
\(\nabla E[\nabla l]= E[\nabla ^2 l]+E[(\nabla l)^2]\)
-
42.
Let \(\tilde{\beta }\in {\mathbb R}^{p+1}\) be an arbitrary unbiased estimate \(\beta \). By showing (a) through (c), prove Cramer-Rao’s inequality
$$V(\tilde{\beta })\ge (NJ)^{-1}$$
-
(a)
\(E[(\tilde{\beta }-\beta )(\nabla l)^T]=I\)
-
(b)
The covariance matrix of the vector combining \(\tilde{\beta }-\beta \) and \(\nabla l\) of size \(2(p+1)\)
$$ \left[ \begin{array}{c@{\quad }c} V(\tilde{\beta })&{}I\\ I&{}NJ \end{array} \right] $$
-
(c)
Both sides of
$$ \left[ \begin{array}{c@{\quad }c} V(\tilde{\beta })-(NJ)^{-1}&{}0\\ 0&{}NJ \end{array} \right] = \left[ \begin{array}{c@{\quad }c} I&{}-(NJ)^{-1}\\ 0&{}I \end{array} \right] \left[ \begin{array}{c@{\quad }c} V(\tilde{\beta })&{}I\\ I&{}NJ \end{array} \right] \left[ \begin{array}{c@{\quad }c} I&{}0\\ -(NJ)^{-1}&{}I \end{array} \right] $$
are nonnegative definite.
-
43.
By showing (a) through (c), prove \(E\Vert X(\tilde{\beta }-\beta )\Vert ^2\ge \sigma ^2(p+1)\).
-
(a)
\(E[(\tilde{\beta }-\beta )^T\nabla {l}]=p+1\)
-
(b)
\(E\Vert X(X^TX)^{-1}\nabla l\Vert ^2=(p+1)/\sigma ^2\)
-
(c)
\(\{E(\tilde{\beta }-\beta )^T\nabla {l}\}^2\le E\Vert X(X^TX)^{-1}\nabla l\Vert ^2E\Vert X(\tilde{\beta }-\beta )\Vert ^2\) Hint: For random variables \(U,V\in {\mathbb R}^m\) (\(m\ge 1\)), prove \(\{E[U^TV]\}^2\le E[\Vert U\Vert ^2]E[\Vert V\Vert ^2]\) (Schwarz’s inequality).
-
44.
Prove the following statements.
-
(a)
For covariates \(x_1,\ldots ,x_N\), if we obtain the responses \(z_1,\ldots ,z_N\), then the likelihood \(\displaystyle -\sum _{i=1}^N\log f(z_i|x_i,\gamma )\) of the parameter \(\gamma \in {\mathbb R}^{p+1}\) is
$$\frac{N}{2}\log 2\pi \sigma ^2 +\frac{1}{2\sigma ^2}\Vert z-X\beta \Vert ^2 -\frac{1}{\sigma ^2}(\gamma -\beta )^TX^T(z-X\beta )+\frac{1}{2\sigma ^2}(\gamma -\beta )^TX^TX(\gamma -\beta ) $$
for an arbitrary \(\beta \in {\mathbb R}^{p+1}\).
-
(b)
If we take the expectation of (a) w.r.t. \(z_1,\ldots ,z_N\), it is
$$\frac{N}{2}\log (2\pi \sigma ^2e)+\frac{1}{2\sigma ^2}\Vert X(\gamma -\beta )\Vert ^2\ .$$
-
(c)
If we estimate \(\beta \) and choose an estimate \(\gamma \) of \(\beta \), the minimum value of (b) on average is
$$\frac{N}{2}\log (2\pi \sigma ^2e)+\frac{1}{2}(p+1)$$
and the minimum value is realized by the least squares method.
-
(d)
Instead of choosing all the p covariates, we choose \(0\le k\le p\) covariates from p. Minimizing
$$\frac{N}{2}\log (2\pi \sigma _k^2e)+\frac{1}{2}(k+1)$$
w.r.t. k is equivalent to minimizing \({N}\log \sigma ^2_k+k\) w.r.t. k, where \(\sigma ^2_k\) is the minimum variance when we choose k covariates.
-
45.
By showing (a) through (f), prove
$$E\log \frac{\hat{\sigma }^2(S)}{\sigma ^2} =-\frac{1}{N}-\frac{k(S)+1}{N}+O\left( \frac{1}{N^2}\right) =-\frac{k(S)+2}{N}+O\left( \frac{1}{N^2}\right) \ .$$
Use the fact that the moment of \(U\sim \chi ^2_m\) is
$$EU^n=m(m+2)\cdots (m+2n-2)\ $$
without proving it.
-
(a)
\(\displaystyle E\log \frac{U}{m}=E\left( \frac{U}{m}-1\right) -\frac{1}{2}E\left( \frac{U}{m}-1\right) ^2+\cdots \)
-
(b)
\(\displaystyle E\left( \frac{U}{m}-1\right) =0\) and \(\displaystyle E\left( \frac{U}{m}-1\right) ^2=\frac{2}{m}\)
-
(c)
\(\displaystyle \sum _{j=0}^n(-1)^{n-j} \left( \begin{array}{c} n\\ j \end{array} \right) =0 \)
-
(d)
if we regard \( \displaystyle E(U-m)^n= \sum _{j=0}^n(-1)^{n-j} \left( \begin{array}{c} n\\ j \end{array} \right) m^{n-j}m(m+2)\cdots (m+2j-2) \) as a polynomial of degree m, the sum of the terms of degree n is zero. Hint: Use (c).
-
(e)
The sum of the terms of degree \(n-1\) is zero. Hint: Derive that the coefficient of degree \(n-1\) is \(2\{1+2+\cdots +(j-1)\}=j(j-1)\) for each j and that \( \displaystyle \sum _{j=0}^n \left( \begin{array}{c} n\\ j \end{array} \right) (-1)^jj(j-1)=0 \).
-
(f)
\(\displaystyle E\log \left( \frac{\hat{\sigma }^2(S)}{N-k(S)-1}\bigg /\frac{{\sigma }^2}{N}\right) =-\frac{1}{N}+O\left( \frac{1}{N^2}\right) \)
-
46.
The following procedure produces the AIC value. Fill in the blanks and execute the procedure.
-
47.
Instead of AIC, we consider a criterion that minimizes the following quantity (BIC, Bayesian Information Criterion):
$$N\log \hat{\sigma }^2+k\log N$$
Replace the associated lines of the AIC procedure above, and name the function BIC. For the same data, execute BIC. Moreover, construct a procedure to choose the covariate set that maximizes
$$AR^2:=1-\frac{RSS/(N-k-1)}{TSS/(N-1)}$$
(adjusted coefficient of determination) and name the function AR2. For the same data, execute AR2.
-
48.
We wish to visualize the k that minimizes AIC and BIC. Fill in the blanks and execute the procedure.