The Geometry of the GLM For simplicity, consider the bivariate GLM

$$ {y}_t={\beta}_0+{\beta}_1{x}_t+{u}_t $$

and note that the conditional expectation of

y _{ t } given

x _{ t } is

$$ \mathrm{E}\left({y}_t|{x}_t\right)={\beta}_0+{\beta}_1{x}_t, $$

The model may be plotted in the y , x plane as density functions about the mean.

Specifically, in Fig.

1.1 we have plotted the conditional mean of

y given

x as the straight line. Given the abscissa

x , however, the dependent variable is not constrained to lie on the line. Instead, it is thought to be a random variable defined over the vertical line rising over the abscissa. Thus, for given

x we can, in principle, observe a

y that can range anywhere over the vertical axis. This being the conceptual framework, we would therefore not be surprised if in plotting a given sample in

y ,

x space we obtain the disposition of Fig.

1.2 . In particular, even if the pairs {(

y _{ t } ,

x _{ t } ) :

t = 1, 2, … ,

T } have been generated by the process pictured in Fig.

1.1 there is no reason why plotting the sample will not give rise to the configuration of Fig.

1.2 . A plot of the sample is frequently referred to as a

scatter diagram . The least squares procedure is simply a method for determining a line through the scatter diagram such that for given abscissa (

x ) the square of the

y distance between the corresponding point and the line is minimized.

Fig. 1.1 A fitted regression line with confidence intervals

Fig. 1.2 A fitted regression line

In Fig.

1.2 the sloping line is the hypothetical estimate induced by OLS. As such it represents an estimate of the unknown parameters in the conditional mean function. The vertical lines are the vertical distances (

y distances) between the following two points: first, given an

x that lies in the sample set of observations there corresponds a

y that lies in the sample set of observations; second, given this same

x there corresponds a

y that lies on the sloping line. It is the sum of the squares of the distances between all points (such that the

x component lies in the set of sample observations) that the OLS procedure seeks to minimize. In terms of the general results in the preceding discussion this is accomplished by taking

$$ {\widehat{\beta}}_0=\overline{y}-{\widehat{\beta}}_1\overline{x},\kern1em {\widehat{\beta}}_1=\frac{s_{yx}}{s_{xx}}, $$

where

$$ {s}_{yx}=\frac{1}{T}\sum \limits_{t=1}^T\left({y}_t-\overline{y}\right)\left({x}_t-\overline{x}\right),\kern1em {s}_{xx}=\frac{1}{T}\sum \limits_{t=1}^T{\left({x}_t-\overline{x}\right)}^2, $$

$$ \overline{y}=\frac{1}{T}\sum {y}_t,\kern1em \overline{x}=\frac{1}{T}\sum {x}_t. $$

The

y distance referred to above is

$$ {y}_t-{\widehat{\beta}}_0-{\widehat{\beta}}_1{x}_t=\left({y}_t-\overline{y}\right)-{\widehat{\beta}}_1\left({x}_t-\overline{x}\right), $$

the square of which is

$$ {\left({y}_t-\overline{y}\right)}^2-2{\widehat{\beta}}_1\left({y}_t-\overline{y}\right)\left({x}_t-\overline{x}\right)+{\widehat{\beta}}_1^2{\left({x}_t-\overline{x}\right)}^2. $$

Notice, incidentally,

that to carry out an OLS

estimation scheme we need only the sums and cross products of the observations . Notice also that the variance of the slope coefficient is

$$ \mathrm{Var}\left({\widehat{\beta}}_1\right)=\frac{\sigma^2}{\sum_{t=1}^T{\left({x}_t-\overline{x}\right)}^2}. $$

Consequently, if we could design the sample by choosing the x coordinate we could further minimize the variance of the resulting estimator by choosing the x’s so as to make \( {\sum}_{t=1}^T{\left({x}_t-\overline{x}\right)}^2 \) as large as possible . In fact, it can be shown that if the phenomenon under study is such that the

x ’s are constrained to lie in the interval [

a ,

b ] and we can choose the design of the sample, we should choose half the

x ’s at

a and half the

x ’s at

b . In this fashion we minimize the variance of

\( {\widehat{\beta}}_1 \) . Intuitively, and in terms of Figs.

1.1 and

1.2 , the interpretation of this result is quite clear. By concentrating on two widely separated points in the

x space we induce maximal discrimination between a straight line and a more complicated curve. If we focus on two

x points that are very adjacent our power to discriminate is very limited, since over a sufficiently small interval all curves “look like straight lines.” By taking half of the observations at one end point and half at the other, we maximize the “precision” with which we fix these two ordinates of the conditional mean function and thus fix the slope coefficient by the operation

$$ {\widehat{\beta}}_1=\frac{{\overline{y}}^{(2)}-{\overline{y}}^{(1)}}{b-a}. $$

Above, \( {\overline{y}}^{(2)} \) is the mean of the y observations corresponding to x ’s chosen at b and \( {\overline{y}}^{(1)} \) is the mean of the y observations corresponding to the x ’s chosen at a .

In the multivariate context a pictorial representation is difficult; nonetheless a geometric interpretation in terms of vector spaces is easily obtained. The columns of the matrix of explanatory variables,

X , are by assumption linearly independent. Let us initially agree that we deal with observations that are centered about respective sample means. Since we have, by construction,

n such vectors, they span an

n -dimensional subspace of the

T -dimensional Euclidean space ℝ

_{ T } . We observe that

X (

X ^{′} X )

^{−1} X ^{′} is the matrix representation of a projection of ℝ

_{ T } into itself. We recall that a projection is a linear idempotent transformation of a space into itself, i.e., if

P represents a projection operator and

y _{1} ,

y _{2} ∈ ℝ

_{ T } ,

c being a real constant, then

$$ P\left({cy}_1+{y}_2\right)= cP\left({y}_1\right)+P\left({y}_2\right),\kern1em P\left[P\left({y}_1\right)\right]=P\left({y}_1\right), $$

where

P (

y ) is the image of

y ∈ ℝ

_{ T } , under

P .

We also recall that a projection divides the space ℝ

_{ T } into two subspaces, say

S _{1} and

S _{2} , where

S _{1} is the

range of the projection , i.e.,

$$ {S}_1=\left\{z:z=P(y),\kern0.5em y\in {\mathrm{\mathbb{R}}}_T\right\}, $$

while

S _{2} is the

null space of the projection , i.e.,

$$ {S}_2=\left\{y:P(y)=0,\kern0.5em y\in {\mathrm{\mathbb{R}}}_T\right\}. $$

We also recall that any element of ℝ_{ T } can be written uniquely as the sum of two components, one from S _{1} and one from S _{2} .

The subspace S _{2} is also referred to as the orthogonal complement of S _{1} , i.e., if y _{1} ∈ S _{1} and y _{2} ∈ S _{2} their inner product vanishes. Thus, \( {y}_1^{\prime }{y}_2=0 \) .

The application of these concepts to the regression problem makes the mechanics of estimation quite straightforward. What we do is to project the vector of observations

y on the subspace of ℝ

_{ T } spanned by the (linearly independent) columns of the matrix of observations on the explanatory variables

X . The matrix of the projection is

$$ X{\left({X}^{\prime }X\right)}^{-1}{X}^{\prime }, $$

which is an idempotent matrix of rank

n . The orthogonal complement of the range of this projection is another projection, the matrix of which is

$$ I-X{\left({X}^{\prime }X\right)}^{-1}{X}^{\prime }. $$

It then follows immediately that we can write

$$ y=\widehat{y}+\widehat{u}, $$

where

$$ \widehat{y}=X{\left({X}^{\prime }X\right)}^{-1}{X}^{\prime }y $$

is an element of the range of the projection defined by the matrix

X (

X ^{′} X )

^{−1} X ^{′} , while

$$ \widehat{u}=\left[I-X{\left({X}^{\prime }X\right)}^{-1}{X}^{\prime}\right]y $$

and is an element of its orthogonal complement. Thus, mechanically,

we have decomposed y into \( \widehat{y} \) , which lies in the space spanned by the columns of X, and \( \widehat{u} \) , which lies in a subspace which is orthogonal to it .

While the mechanics of regression become clearer in the vector space context above, it must be remarked that the context in which we studied the general linear model is by far the richer one in interpretation and implications.

A Measure of Correlation Between a Scalar and a Vector In the discussion to follow we shall draw an interesting analogy between the GLM and certain aspects of multivariate, and more particularly, multivariate

normal distributions. To fix notation, let

$$ x\sim N\left(\mu, \Sigma \right) $$

and partition

$$ x=\left(\begin{array}{l}{x}^1\\ {}{x}^2\end{array}\right),\kern1em \mu =\left(\begin{array}{l}{\mu}^1\\ {}{\mu}^2\end{array}\right),\kern1em \Sigma =\left[\begin{array}{cc}{\Sigma}_{11}& {\Sigma}_{12}\\ {}{\Sigma}_{21}& {\Sigma}_{22}\end{array}\right] $$

such that

x ^{1} has

k elements,

x ^{2} has

n −

k ,

μ has been partitioned conformably with

x , Σ

_{11} is

k ×

k , Σ

_{22} is (

n −

k ) × (

n −

k ) , Σ

_{12} is

k × (

n −

k ), etc.

We recall that the conditional mean of

x ^{1} given

x ^{2} is simply

$$ \mathrm{E}\left({x}^1|{x}^2\right)={\mu}^1+{\Sigma}_{12}{\Sigma}_{22}^{-1}\left({x}^2-{\mu}^2\right). $$

If

k = 1 then

x ^{1} =

x _{1} and

$$ \mathrm{E}\left({x}_1|{x}^2\right)={\mu}_1+{\sigma}_{1\cdot }{\Sigma}_{22}^{-1}\left({x}^2-{\mu}^2\right)={\mu}_1-{\sigma}_{1\cdot }{\Sigma}_{22}^{-1}{\mu}^2+{\sigma}_{1\cdot }{\Sigma}_{22}^{-1}{x}^2. $$

But, in the GLM we also have that

$$ \mathrm{E}\left(y|x\right)={\beta}_0+\sum \limits_{i=1}^n{\beta}_i{x}_i $$

so that if we look upon (

y ,

x _{1} ,

x _{2} , … ,

x _{ n } )

^{′} as having a jointly normal distribution we can think of the “systematic part” of the GLM above as the conditional mean (function) of

y given the

x _{ i } ,

i = 1 , 2 , …,

n .

In this context, we might wish to define what is to be meant by the correlation coefficient between a scalar and a vector. We have

This is termed the

multiple correlation coefficient and it is denoted by

$$ {R}_{i\cdot k+1,k+2,\dots, n}. $$

We now proceed to derive an expression for the multiple correlation coefficient in terms of the elements of Σ. To do so we require two auxiliary results. Partition

$$ \Sigma =\left[\begin{array}{cc}{\Sigma}_{11}& {\Sigma}_{12}\\ {}{\Sigma}_{21}& {\Sigma}_{22}\end{array}\right] $$

conformably with

x and

let σ _{ i⋅} be the i th

row of Σ

_{12} . We have:

It is now simple to prove:

The ratio of the conditional to the unconditional variance of

x _{ i } (given

x ^{2} ) is given by

$$ \frac{\sigma_{ii}-{\sigma}_{i\cdot }{\Sigma}_{22}^{-1}{\sigma}_i^{\prime }}{\sigma_{ii}}=1-{R}_{i\cdot k+1,k+2,\dots, n}^2. $$

Thus, \( {R}_{i\cdot k+1,k+2,\dots, n}^2 \) , measures the relative reduction in the variance of x _{ i } between its marginal and conditional distributions (given x _{ k + 1} , x _{ k + 2} , … , x _{ n } ).

The analogy between these results and those encountered in the chapter is now quite obvious. In that context, the role of

x _{ i } is played by the dependent variable, while the role of

x ^{2} is played by the bona fide explanatory variables. If the data matrix is

$$ X=\left(e,\kern0.5em {X}_1\right), $$

where

X _{1} is the matrix of observations on the bona fide explanatory variables, then

$$ \frac{1}{T}{X}_1^{\prime}\left(I-\frac{e{e}^{\prime }}{T}\right)y $$

plays the role of

σ _{ i⋅} . In the above,

y is the vector of observations on the dependent variable and, thus, the quantity above is the

vector of sample covariances between the explanatory and dependent variables. Similarly,

$$ \frac{1}{T}{X}_1^{\prime}\left(I-\frac{e{e}^{\prime }}{T}\right){X}_1 $$

is the

sample covariance matrix of the explanatory variables. The vector of residuals is analogous to the quantity

x _{ i } −

γ ^{′} x ^{2} , and Assertion

A.1 corresponds to the statement that the vector of residuals in the regression of

y on

X is orthogonal to

X , a result given in Eq. (

1.21 ). Assertion

A.2 is analogous to the result in Proposition

1 . Finally, the (square of the) multiple correlation coefficient is analogous to the (unadjusted) coefficient of determination of multiple regression. Thus, recall from Eq. (

1.26 ) that

$$ {\displaystyle \begin{array}{l}{R}^2=1-\frac{{\widehat{u}}^{\prime}\widehat{u}}{y^{\prime}\left(I-\frac{e{e}^{\prime }}{T}\right)y}\\ {}=\frac{y^{\prime}\left(I-\frac{e{e}^{\prime }}{T}\right){X}_1{\left[{X}_1^{\prime}\left(I-\frac{e{e}^{\prime }}{T}\right){X}_1\right]}^{-1}{X}_1^{\prime}\left(I-\frac{e{e}^{\prime }}{T}\right)y}{y^{\prime}\left(I-\frac{e{e}^{\prime }}{T}\right)y},\end{array}} $$

which is the sample analog of the (square of the) multiple correlation coefficient between

y and

x _{1} ,

x _{2} , …,

x _{ n } ,

$$ {R}_{y\cdot {x}_1,{x}_2,\dots, {x}_n}^2=\frac{\sigma_{y\cdot }{\Sigma}_{xx}^{-1}{\sigma}_{y\cdot}^{\prime }}{\sigma_{yy}}, $$

where

$$ \Sigma =\mathrm{Cov}(z)=\left[\begin{array}{ll}{\sigma}_{yy}& {\sigma}_{y\cdot}\\ {}{\sigma}_{y\cdot}^{\prime }& {\sum}_{xx}\end{array}\right],\kern1em z=\left(\begin{array}{l}y\\ {}x\end{array}\right),\kern1em x={\left({x}_1,{x}_2,\dots, {x}_n\right)}^{\prime }, $$

i.e., it is the “covariance matrix” of the “joint distribution” of the dependent and bona fide explanatory variables.