A review of Professor C. R. Rao’s contributions in Statistics is available in Kumar et al. (2010). It shows the important role played by Professor Rao in the founding of the Indian Econometric Society, which recently held a 54-th Annual meeting in Jammu, India, attended by over 400 econometricians from all parts of India and abroad. Rao helped late Prof. P. C. Mahalanobis, his colleague and mentor at the Indian Statistical Institute in Calcutta, in formulating economic policies of free India upon achieving independence in 1947. He encouraged sound statistical training for economists and encouraged spending resources for good economic data collection. The journal Econometric Theory paid tribute to Rao by printing a 70-page long interview with him in Bera (2003), which includes detailed lists of Rao’s 458 publications and various honors received. The Government of India acknowledged Rao’s leadership in helping to improve Indian economic data collection and policy—a worthy activity for a potential Nobelist in Economics.
Instead of revisiting Kumar et al. (2010) in detail, this paper provides a short technical review, providing hands-on explanations of Rao’s ideas with the help of R software absent in the earlier reviews and retrospectives. We hope to encourage young Econometricians to find inspiration and new insights from Rao’s publications in their own research. Rao’s ideas are explained anew with references to over 20 R packages, which should allow students to get a quick start with practical applications, even though some packaged examples might be from other sciences.
The input code snippets in R software and outputs produced by R are distinctly highlighted in the sequel, such that the reader can copy and paste the input and compare her locally produced output with our output reported here. Since the regression model is the bread and butter of applied econometrics, we use regressions to illustrate some ideas, even though some ideas can be more simply explained using the sample mean.
The usual linear regression model in matrix notation is
$$\begin{aligned} y=\beta _{0}+\beta _{1}x_{1}+ \cdots +\beta _{p}x_{p}+\varepsilon = X\beta +\varepsilon , \end{aligned}$$
(1)
where we have T observations, X is the \(T\times (p+1)\) matrix of data on all regressor variables including the first column of ones to represent the intercept \(\beta _{0}\), y is a \(T\times 1\) vector of data on the dependent variable, \(\beta\) is a \((p+1)\times 1\) vector of regression coefficients, and \(\varepsilon\) is a \(T\times 1\) vector of unknown true errors.
For example, we let y be the stopping distance (’dist’) of a car (in feet), and \(x_1\) be the speed of the car in miles per hour (’speed’), using Ezekiel’s data called ‘cars’ always available in R. Our first R input code is:
The following output has the first set of three row-values in the first two columns and the last set of three rows-values in the last two columns. R does not print to the screen the tailing (last) three row numbers 48–50.
Note that ‘cars’ data in R has \(T=50\) with y for a (50\(\times 1\)) vector of stopping distances (2, 10, 4, ...,93, 120, 85). We need to insert the first column of T ones in our (50\(\times 2\)) X matrix for the intercept. The second column of X has \(x_1\) representing car speeds. An R function for linear models (lm) creates an R object containing all regression results. We name the R object ‘reg’ in the following code.
The output of the above code is in a form suitable to produce the following (Latex) Table 1, where standard errors of regression coefficients are in parentheses under the coefficient values.
Table 1 Regression of stopping distance on speed by standard R method Now we discuss the theory of linear regression to set up our notation needed for explaining Rao’s contributions, including Cramer-Rao lower bound on variance in the regression context. We have the following semi-parametric (when normality is not assumed) probabilistic structure for y and errors in matrix notation:
$$\begin{aligned} y = X\beta +\varepsilon ,\quad E(\varepsilon )=0, \quad E(\varepsilon \varepsilon ^\prime )=\sigma ^{2} I_{T} , \end{aligned}$$
(2)
where \(I_{T}\) is a
\(T\times T\) identity matrix, suggesting homoscedasticity or constant variances and zero covariances among errors. Note that
\(E(y)=X\beta\) is also the conditional mean \(E(y \vert X)\). If we further assume that (a) the X matrix of regressors has full column rank, (b) all columns of X are uncorrelated with errors, and (c) that y (and \(\varepsilon )\) are multivariate Normal,
\(y\sim N(\mu\),
\(\sigma ^{2}I\)), then the usual t-tests and F tests on coefficients and the overall model are available. In particular, let the t-th error \(\varepsilon _{t}\) have the following Normal density:
$$\begin{aligned} f(\varepsilon _{t})= (2\pi )^{-1/2}\sigma ^{-1} \exp \{{-}(\varepsilon ^\prime _{t}\varepsilon _{t})/2\sigma ^{2} {\}}, \end{aligned}$$
(3)
where
\(\varepsilon _{t}\) is the t-th element of the vector
(\(y-X\beta\)) and is known only when
\(\beta\) is known. Hence, the density
\(f(\varepsilon _{t})\) is a function of data on y, X at time t, given specified numerical values of \(\beta\) and \(\sigma ^{2}\). The density is usually written as: \(f(y_{t}, X_{t} \vert \beta\), \(\sigma ^{2}\)), where we assume that the numerical values of all items after the vertical bar are known. The joint density for all T observations is the product of such densities for all t from
\(t=1\) to \(t=T\):
$$\begin{aligned} f_\mathrm{joint}(y, X \vert \beta ,\sigma ^{2})=\prod ^{T}_{t=1}f( \varepsilon _{t}), \end{aligned}$$
(4)
where \(f(\varepsilon _{t})\) is defined in (3) and does depend on \(\beta\) through \(\varepsilon =(y-X\beta )\).
Sir Fisher re-interpreted the same joint density function \(f_\mathrm{joint}(\cdot )\) as \(f_{lkhd} (\beta , \sigma ^{2} \vert y, X\)). Its log is called the log-likelihood (LL) function. It is useful when the object is to find the unknown parameters from the given data. For the regression model (2), we are not assuming the form of the density of \(\varepsilon\), only that we know its mean and variance-covariance matrix. A quasi-log-likelihood function can be defined from the assumptions of (1), by pretending (hence quasi) that errors have the same mean and variance as the corresponding Normal without having to be actually Normal.
Now the quasi-LL and the usual LL by expanding the product in
(4) after using (3) are given by
$$\begin{aligned} { LL}= -(T/2) \log (2\pi \sigma ^{2}) + \varepsilon ^\prime \varepsilon /(2\sigma ^2) , \end{aligned}$$
(5)
where
$$\begin{aligned} \varepsilon ^\prime \varepsilon =(y-X\beta )^\prime (y-X\beta )=y^\prime y-2\beta ^\prime X^\prime y+\beta ^\prime X^\prime X\beta . \end{aligned}$$
(6)
Now we assume that \(\sigma ^2\) is known for simplifying the following derivations and minimize \(\varepsilon ^\prime \varepsilon\) with respect to \(\beta\). An extension to the case where \(\sigma ^2\) is also unknown is obtained by maximizing the LL with respect to \(\sigma ^2\) is straightforward and available in textbooks. The matrix derivative of
\(2\beta ^\prime X^\prime y\) with respect to \(\beta\) is
\(2X^\prime y\), and the derivative of the quadratic form \(\beta ^\prime X^\prime X\beta\) is \(2X^\prime X\beta\). Setting the derivative equal to zero, we get the first-order condition (FOC) from calculus given by \(-2X^\prime y +2X^\prime X\beta =0\). Upon canceling the 2 and solving for \(\beta\), we have the ordinary least squares (OLS) estimator b as the solution:
$$\begin{aligned} b = (X^\prime X)^{-1}X^\prime y. \end{aligned}$$
(7)
Let us use the ‘cars’ data to illustrate a numerical implementation of the above formula.
The output below of the above code shows that the OLS estimator b from first principles using the formula of Eq. (7) agrees with the R output in Table 1.
Score Function and Score Equation
We have verified that R software correctly estimates the OLS estimator derived by matrix algebra formulas. Next we verify that the OLS formula also maximizes the log-likelihood (LL). First, we define the score vector \(g^*\) as the derivative of the LL function:
$$\begin{aligned} g^*= (\partial \,{LL} /\partial \beta ) =2X^\prime (y -X\beta ) /(2\sigma ^2), \end{aligned}$$
(8)
evaluated at the true parameter values. Since the first-order condition (FOC) for maximization of LL states that \(E(g^*)=0\), also known as the score equation. We can cancel the 2 from the numerator and denominator of (8) and write the FOC as:
$$\begin{aligned} X^\prime (y -X\beta )/\sigma ^2 =0=(1/\sigma ^2) X^\prime \varepsilon . \end{aligned}$$
(9)
Now the LL maximizing solution upon setting the left-hand-side equal to zero is the same b defied above in eq. (7):
Fisher Information Matrix
Fisher information matrix is defined as the variance-covariance matrix of the score vector (from the outer product of the score vector) as:
$$\begin{aligned} I_{nf}=E(g^* g^{*\prime }) = (1/\sigma ^4) X^\prime E(\varepsilon \varepsilon ^\prime ) X =(1/\sigma ^2) X^\prime X, \end{aligned}$$
(10)
where we have used \(E(\varepsilon \varepsilon ^\prime )=\sigma ^{2} I_{T}\) from Eq. (2).
The second-order condition for a maximum from calculus is that the matrix of second-order partial derivatives should be negative definite. It can be stated in terms of the Fisher information matrix based on the second-order partials as:
$$\begin{aligned} I_{inf}=-E\left[ \frac{\partial ^2}{\partial \beta ^\prime } LL\right] \end{aligned}$$
(11)
Since the Fisher information matrix is a projection matrix, it is non-negative definite. Thus the ML estimator of the \(\beta\) in Eq. (7) satisfies both first and second-order conditions for a maximum. We use the usual estimate \(s^2\) of \(\sigma ^2\) from the residual sum of squares (RSS) divided by the degrees of freedom (\(\hbox {df}=T-p-1\)).
In the above code, s2 equals RSS/df.
Note that this output agrees with the square of ‘Residual Std. Error’ of 15,380, reported in Table 1. A slight discrepancy is due to the superior numerical accuracy of R.
The information matrix for our cars example is \(s^2 X^\prime X\) with the R output given next.
Having set up the basic notation, we are ready to discuss one foundational contribution by Rao.