9.1. Introduction

We will adopt the same notations as in the previous chapters. Lower-case letters x, y, … will denote real scalar variables, whether mathematical or random. Capital letters X, Y, … will be used to denote real matrix-variate mathematical or random variables, whether square or rectangular matrices are involved. A tilde will be placed on top of letters such as \(\tilde {x},\tilde {y},\tilde {X},\tilde {Y}\) to denote variables in the complex domain. Constant matrices will for instance be denoted by A, B, C. A tilde will not be used on constant matrices unless the point is to be stressed that the matrix is in the complex domain. The determinant of a square matrix A will be denoted by |A| or det(A) and, in the complex case, the absolute value or modulus of the determinant of A will be denoted as |det(A)|. When matrices are square, their order will be taken as p × p, unless specified otherwise. When A is a full rank matrix in the complex domain, then AA is Hermitian positive definite where an asterisk designates the complex conjugate transpose of a matrix. Additionally, dX will indicate the wedge product of all the distinct differentials of the elements of the matrix X. Letting the p × q matrix X = (x ij) where the x ij’s are distinct real scalar variables, \(\mathrm {d}X=\wedge _{i=1}^p\wedge _{j=1}^q\mathrm {d}x_{ij}\). For the complex matrix \(\tilde {X}=X_1+iX_2,\ i=\sqrt {(-1)}\), where X 1 and X 2 are real, \(\mathrm {d}\tilde {X}=\mathrm {d}X_1\wedge \mathrm {d}X_2\).

The requisite theory for the study of Principal Component Analysis has already been introduced in Chap. 1, namely, the problem of optimizing a real quadratic form that is subject to a constraint. We shall formulate the problem with respect to a practical situation consisting of selecting the most “relevant” variables in a study. Suppose that a scientist would like to devise a “good health” index in terms of certain indicators. After selecting a random sample of individuals belonging to a population that is homogeneous with respect to a variety of factors, such as age group, racial background and environmental conditions, she managed to secure measurements on p = 15 variables, including for instance, x 1: weight, x 2: systolic pressure, x 3: blood sugar level, and x 4: height. She now faces a quandary as there is an excessive number of variables and some of them may not be relevant to her investigation. A methodology is thus required for discarding the unimportant ones. As a result, the number of variables will be reduced and the interpretation of the results, possibly facilitated. So, what might be the most pertinent variables in any such study? If all the observations of a particular variable x j are concentrated around a certain value, μ j, then that variable is more or less predetermined. As an example, suppose that the height of the individuals comprising a study group is neighboring 1.8 meters and that, on the other hand, it is observed that the weight measurements are comparatively spread out. On account of this, while height is not a particularly consequential variable in connection with this study, weight is. Accordingly, we can utilize the criterion: the larger the variance of a variable, the more relevant this variable is. Let the p × 1 vector X, X′ = (x 1, …, x p), encompass all the variables on which measurements are available. Let the covariance matrix associated with X be Σ, that is, Cov(X) = Σ. Since linear functions also contain individual variables, we may consider linear functions such as u = a 1 x 1 + ⋯ + a p x p = A′X = X′A, a prime designating a transpose, where

(9.1.1)

Then,

$$\displaystyle \begin{aligned} \mathrm{Var}(u)=\mathrm{Var}(A'X)=\mathrm{Var}(X'A)=A'\varSigma A. \end{aligned} $$
(9.1.2)

9.2. Principal Components

As will be explained further, the central objective in connection with the derivation of principal components consists of maximizing A′ΣA. Such an exercise would indeed prove meaningless unless some constraint is imposed on A, considering that, for an arbitrary vector A, the minimum of A′ΣA occurs at zero and the maximum, at + , Σ = E[X − E(X)][X − E(X)] being either positive definite or positive semi-definite. Since Σ is symmetric and non-negative definite, its eigenvalues, denoted by λ 1 ≥ λ 2 ≥⋯ ≥ λ p ≥ 0, are real. Moreover, Σ being symmetric, there exists an orthonormal matrix P, PP′ = I, P′P = I, such that

(9.2.1)

and

$$\displaystyle \begin{aligned} \varSigma=P\varLambda P'=\lambda_1 P_1P_1^{\prime}+\cdots+\lambda_p P_pP_p^{\prime}\ , \end{aligned} $$
(9.2.2)

where P 1, …, P p constitute the columns of P, P i denoting a normalized eigenvector corresponding to λ i, i = 1, …, p; this is expounded for instance in Mathai and Haubold (2017a). Note that all real symmetric matrices, including those having repeated eigenvalues, can be diagonalized. Since the optimization problem is pointless when A is arbitrary, the search for an optimum shall be confined to vectors A such that A′A = 1, that is, vectors lying on the unit sphere in \(\Re ^p\). Without any loss of generality, the coefficients of the linear function can be selected so that the Euclidean norm of the coefficient vector is unity, in which case a minimum and a maximum will both exist. Hence, the problem can be restated as follows:

$$\displaystyle \begin{aligned} \mbox{Maximize } A'\varSigma A\mbox{ subject to }A'A=1. \end{aligned}$$
(i)

We will resort to the method of Lagrangian multipliers to optimize A′ΣA subject to the constraint A′A = 1. Let

$$\displaystyle \begin{aligned} \phi_1=A'\varSigma A-\lambda(A'A-1) \end{aligned}$$
(ii)

where λ is a Lagrangian multiplier. Differentiating ϕ 1 with respect to A and equating the result to a null vector (vector/matrix derivatives are discussed in Chap. 1, as well as in Mathai 1997), we have the following:

$$\displaystyle \begin{aligned} \frac{\partial\phi_1}{\partial A}=O\Rightarrow 2\varSigma A-2\lambda A=O\Rightarrow \varSigma A=\lambda A. \end{aligned}$$
(iii)

On premultiplying (iii) by A′, we have

$$\displaystyle \begin{aligned} A'\varSigma A=\lambda A'A=\lambda.{}\end{aligned} $$
(9.2.3)

In order to obtain a non-null solution for A in (iii), the coefficient matrix Σ − λI has to be singular or, equivalently, its determinant has to be zero, that is, |Σ − λI| = 0, which implies that λ is an eigenvalue of Σ, A being the corresponding eigenvector. Thus, it follows from (9.2.3) that the maximum of the quadratic form A′ΣA, subject to A′A = 1, is the largest eigenvalue of Σ:

$$\displaystyle \begin{aligned} \max_{A'A=1}[A'\varSigma A]=\lambda_1=\mbox{ the largest eigenvalue of}\ \varSigma. \end{aligned}$$

Similarly,

$$\displaystyle \begin{aligned} \min_{A'A=1}[A'\varSigma A]=\lambda_p=\mbox{ the smallest eigenvalue of}\ \varSigma.{}\end{aligned} $$
(9.2.4)

Since Var(A′X) = A′ΣA = λ, the largest variance associated with a linear combination A′X wherein the vector A is normalized, is equal to λ 1, the largest eigenvalue of Σ, and letting A 1 be the normalized (\(A_1^{\prime }A_1=1\)) eigenvector corresponding λ 1, \(u_1=A_1^{\prime }X\) will be that linear combination of X having the maximum variance. Thus, u 1 is called the first principal component which is the linear function of X having the maximum variance. Although normalized, the vector A 1 is not unique, as (−A 1)(−A 1) is also equal to one. In order to ensure the unicity, we will require that the first nonzero element of A 1—and the other p − 1 normalized eigenvectors—be positive. Recall that since Σ is a real symmetric matrix, the λ j’s are real and so are the corresponding eigenvectors. Consider the second largest eigenvalue λ 2 and determine the associated normalized eigenvector A 2; then \(u_2=A_2^{\prime }X\) will be the second principal component. Since the matrix P that diagonalizes Σ into the diagonal matrix of its eigenvalues is orthonormal, the normalized eigenvectors A 1, A 2, …, A p are necessarily orthogonal to each other, which means that the corresponding principal components \(u_1=A_1^{\prime }X, u_2=A_2^{\prime }X, \ldots ,u_p=A_p^{\prime }X\) will be uncorrelated. Let us see whether uncorrelated normalized eigenvectors could be obtained by making use of the above procedure. When constructing A 2, we can impose an additional condition to the effect that A′X should be uncorrelated with \(A_1^{\prime }X\), \(A_1^{\prime }\varSigma A_1\) being equal to λ 1, the largest eigenvalue of Σ. The covariance between A′X and \(A_1^{\prime }X\) is

$$\displaystyle \begin{aligned} \mathrm{Cov}(A'X,~A_1^{\prime}X)=A'\mathrm{Cov}(X)A_1=A'\varSigma A_1=A_1^{\prime}\varSigma A.{}\end{aligned} $$
(9.2.5)

Hence, we may require that \(A'\varSigma A_1=A_1^{\prime }\varSigma A=0\). However,

$$\displaystyle \begin{aligned} 0=A'\varSigma A_1=A'(\varSigma A_1)=A'\lambda_1 A_1=\lambda_1A'A_1\Rightarrow A'A_1=0. \end{aligned}$$
(iv)

Observe that λ 1 > 0, noting that Σ would be a null matrix if its largest eigenvalue were equal to zero, in which case no optimization problem would remain to be solved. Consider

$$\displaystyle \begin{aligned} \phi_2=A'\varSigma A-2\mu_1(A'\varSigma A_1-0)-\mu_2(A'A-1) \end{aligned}$$
(v)

where μ 1 and μ 2 are the Lagrangian multipliers. Now, differentiating ϕ 2 with respect to A and equating the result to a null vector, we have the following:

$$\displaystyle \begin{aligned} \frac{\partial \phi_2}{\partial A}=O\Rightarrow 2\varSigma A-2\mu_1(\varSigma A_1)-2\mu_2A=O. \end{aligned}$$
(vi)

Premultiplying (vi) by \(A_1^{\prime }\) yields

$$\displaystyle \begin{aligned} A_1^{\prime}\varSigma A-\mu_1 A_1^{\prime}\varSigma A_1-\mu_2A_1^{\prime}A=0\Rightarrow 0-\mu_1\lambda_1-0\Rightarrow \mu_1=0, \end{aligned}$$
(vii)

which entails that the added condition of uncorrelatedness with \(u_1=A_1^{\prime }X\) is superfluous. When determining A j, we could require that A′X be uncorrelated with the principal components \(u_1=A_1^{\prime }X,\ldots ,u_{j-1}=A_{j-1}^{\prime }X\); however, as it turns out, these conditions become redundant when optimizing A′ΣA subject to A′A = 1. Thus, after determining the normalized eigenvector A j (whose first nonzero element is positive) corresponding the j-th largest eigenvalue of Σ, we form the j-th principal component, \(u_j=A_j^{\prime }X\), which will necessarily be uncorrelated with the preceding principal components. The question that arises at this juncture is whether all of u 1, …, u p are needed or a subset thereof would suffice? For instance, we could interrupt the computations when the variance of the principal component \(u_j=A_j^{\prime }X\), namely λ j, falls below a predetermined threshold, in which case we can regard the remaining principal components, u j+1, …, u p, as unimportant and omit the associated calculations. In this way, a reduction in the number of variables is achieved as the original number of variables p is reduced to j < p principal components. However, this reduction in the number of variables could be viewed as a compromise since the new variables are linear functions of all the original ones and so, may not be as interpretable in a real-life situation. Other drawbacks will be considered in the next section. Observe that since Var(u j) = λ j, j = 1, …, p, the fraction

$$\displaystyle \begin{aligned} \nu_1=\frac{\lambda_1}{\sum_{j=1}^p\lambda_j}=\mbox{ the proportion of the total variation accounted for by }\ u_1,{}\end{aligned} $$
(9.2.6)

and letting r < p,

$$\displaystyle \begin{aligned} \nu_r=\frac{\sum_{j=1}^r\lambda_j}{\sum_{j=1}^p\lambda_j}=\mbox{ the proportion of total variation accounted for by }\ u_1,\ldots,u_r.{}\end{aligned} $$
(9.2.7)

If ν 1 = 0.7, ν 1 accounts for 70% of the total variation in the original variables or 70% of the total variation is due to the first principal component. If r = 3 and ν 3 = 0.99, then the sum of the first three principal components accounts for 99% of the total variation. We can also use this percentage of the total variation as a stopping rule for the determination of the principal components. For example when ν r of (9.2.7) is say, greater than or equal to 95%, we may interrupt the determination of the principal components beyond u r.

Example 9.2.1

Even though reducing the number of variables is a main objective of Principal Component Analysis, for illustrative purposes, we will consider a case involving three variables, that is, p = 3. Compute the principal components associated with the following covariance matrix:

Solution 9.2.1

Let us verify that V  is positive definite. The leading minors of V = V being

V > O. Let us compute the eigenvalues of V . Consider the equation |V − λI| = 0 ⇒ (3 − λ)[(3 − λ)2 − 1] − (3 − λ) = 0 ⇒ (3 − λ)[(3 − λ)2 − 2] = 0 \(\Rightarrow (3-\lambda )(3-\lambda \pm \sqrt {2})=0\). Hence the eigenvalues are \(\lambda _1=3+\sqrt {2}, ~\lambda _2=3,~ \lambda _3=3-\sqrt {2}\). Let us compute an eigenvector corresponding to \(\lambda _1=3+\sqrt {2}\). Consider (V − λ 1 I)X = O, that is,

(i)

There are three linear equations involving the x j’s in (i). Since the matrix V − λI is singular, we need only consider any two of these three linear equations and solve to obtain a solution. Let the first equation be \(-\sqrt {2}x_1-x_2=0\Rightarrow x_2=-\sqrt {2}x_1\), and the second one be \(-x_1-\sqrt {2}x_2+x_3=0\Rightarrow -x_1+2x_1+x_3=0\Rightarrow x_3=-x_1\). Now, one solution is

Observe that X 1 also satisfies the third equation in (i) and that

Thus, the first principal component, denoted by u 1, is

$$\displaystyle \begin{aligned} u_1=A_1^{\prime}X=\frac{1}{2}[x_1-\sqrt{2}x_2-x_3].\end{aligned}$$
(ii)

Now, consider an eigenvector corresponding to λ 2 = 3. (V − λ 2 I)X = O provides the first equation: − x 2 = 0 ⇒ x 2 = 0, the third equation giving − x 1 + x 3 = 0 ⇒ x 1 = x 3. Therefore, one solution for X is

so that the second principal component is

$$\displaystyle \begin{aligned} u_2=\frac{1}{\sqrt{2}}[x_1+x_3].\end{aligned}$$
(iii)

It can readily be checked that \(A_2^{\prime }VA_2=3=\lambda _2\). Let us now obtain an eigenvector corresponding to the eigenvalue \(\lambda _3=3-\sqrt {2}\). Consider the linear system (V − λ 3)X = O. The first equation is \(\sqrt {2}x_1-x_2=0\Rightarrow x_2=\sqrt {2}x_1\), and the third one gives \(x_2+\sqrt {2}x_3=0\Rightarrow x_2=-\sqrt {2}x_3\). One solution is

The third principal component is then

$$\displaystyle \begin{aligned} u_3=\frac{1}{2}[x_1+\sqrt{2}x_2-x_3].\end{aligned}$$
(iv)

It is easily verified that \(A_3^{\prime }VA_3=\lambda _3=3-\sqrt {2}\). As well,

$$\displaystyle \begin{aligned} \mathrm{Var}(u_1)&=\frac{1}{4}[\mathrm{Var}(x_1)+2\,\mathrm{Var}(x_2)+\mathrm{Var}(x_3)-2\sqrt{2}\,\mathrm{Cov}(x_1,x_2)\\ &\ \ \ \ \ \ -2\,\mathrm{Cov}(x_1,x_3)+2\sqrt{2}\,\mathrm{Cov}(x_2,x_3)]\\ &=\frac{1}{4}[3+2\times 3+3+2\sqrt{2}(-1)-2(0)+2\sqrt{2}]\\ &=\frac{1}{4}[3\times 4+4\sqrt{2}]=3+\sqrt{2}=\lambda_1. \end{aligned} $$

Similar calculations will confirm that Var(u 2) = 3 = λ 2 and \(\mathrm {Var}(u_3)=2-\sqrt {2}=\lambda _3\). Now, consider the covariance between u 1 and u 2:

$$\displaystyle \begin{aligned} \mathrm{Cov}(u_1,u_2)&=\frac{1}{\sqrt{2}}[\mathrm{Var}(x_1)+\mathrm{Cov}(x_1,x_3)-\sqrt{2}\,\mathrm{Cov}(x_1,x_2)-\sqrt{2}\,\mathrm{Cov}(x_2,x_3)\\ &\ \ \ \ \ \ \ \ -\mathrm{Cov}(x_3,x_1)-\mathrm{Var}(x_3)]\\ &=\frac{1}{4}[\,3+0+\sqrt{2}-\sqrt{2}-0-0-3\,]=0.\end{aligned} $$

It can be likewise verified that Cov(u 1, u 3) = 0 and Cov(u 2, u 3) = 0. Note that u 1, u 2, and u 3 respectively account for 49.1%, 33.3% and 17.6% of the total variation. As none of these proportions is negligibly small, all three principal components are deemed relevant and, in this instance, it is not indicated to reduce the number of the original variables. Although we still end up with as many variables, the u j’s, j = 1, 2, 3, are uncorrelated, which was not the case for the original variables.

9.3. Issues to Be Mindful of when Constructing Principal Components

Since variances and covariances are expressed in units of measurement, principal components also depend upon the scale on which measurements on the individual variables are made. If we change the units of measurement, the principal components will differ. Suppose that x j is multiplied by a real scalar constant d j, j = 1, …, p, where some of the d j’s are not equal to 1. This is equivalent to changing the units of measurement of some of the variables. Let

(9.3.1)

and consider the linear functions A′X and A′Y . Then,

$$\displaystyle \begin{aligned} \mathrm{Var}(A'X)=A'\varSigma A,~ \mathrm{Var}(A'Y)=A'\mathrm{Var}(DX)A=A'D\varSigma D A.{}\end{aligned} $$
(9.3.2)

Since the eigenvalues of Σ and DΣD differ, so will the corresponding principal components, and if the original variables are measured in various units of measurements, it would be advisable to attempt to standardize them. Letting R denote the correlation matrix which is scale-invariant, observe that the covariance matrix Σ = Σ 1 R Σ 1 where Σ 1 = diag(σ 1, …, σ p), \(\sigma _j^2\) being the variance of x j, j = 1, …, p. Thus, if the original variables are scaled by the inverses of their associated standard deviations, that is, \(\frac {x_j}{\sigma _j}\) or equivalently, via the transformation \(\varSigma _1^{-1}X\), the resulting covariance matrix is \(\mathrm {Cov}(\varSigma _1^{-1}X)=R,\) the correlation matrix. Accordingly, constructing the principal components by making use of R instead of Σ, will mitigate the issue stemming from the scale of measurement.

If λ 1, …, λ p are the eigenvalues of Σ, then \(\lambda _1^k,\ldots ,\lambda _p^k\) will be the eigenvalues of Σ k. Moreover, λ and λ k will share the same eigenvectors. Note that the collection \((\lambda _1^k,\ldots ,\lambda _p^k)\) will be well separated compared to the set (λ 1, …, λ p) when the λ i’s are distinct and greater than one. Hence, in some instances, it might be preferable to construct principal components by making use of Σ k for an appropriate value of k instead of Σ. Observe that, in certain situations, it is not feasible to provide a physical interpretation of the principal components which are a linear function of the original x 1, …, x p. Nonetheless, they can at times be informative by pointing to the average of certain variables (for example, u 2 in the previous numerical example) or by eliciting contrasts between two sets of variables (for example, u 3 in the previous numerical example, which opposes x 3 to x 1 and x 2).

To illustrate how the eigenvalues of a correlation matrix are determined, we revisit Example 9.2.1 wherein the variances of the x j’s or the diagonal elements of the covariance matrix are all equal to 3. Thus, in this case, the correlation matrix is \(R=\frac {1}{3}V \), and the eigenvalues of R will be \(\frac {1}{3}\) times the eigenvalues of V . However, the normalized eigenvectors will remain identical to those of V , and therefore the principal components will not change. We now consider an example wherein the diagonal elements of the covariance matrix are different.

Example 9.3.1

Let X′ = (x 1, x 2, x 3) where the x j’s are real scalar random variables. Compute the principal components resulting from R, the correlation matrix of X, where

Solution 9.3.1

Let us compute the eigenvalues of R. Consider the equation

$$\displaystyle \begin{aligned} |R-\lambda I|=0&\Rightarrow (1-\lambda)\Big[(1-\lambda)^2-\frac{1}{6}\Big]-\frac{1}{\sqrt{6}}\Big[-\frac{1}{\sqrt{6}}(1-\lambda)\Big]=0\\ &\Rightarrow (1-\lambda)\Big[(1-\lambda)^2-\frac{1}{3}\Big]=0.\end{aligned} $$

Thus, the eigenvalues are \(\lambda _1=1+\frac {1}{\sqrt {3}},~\lambda _2=1,~\lambda _3=1-\frac {1}{\sqrt {3}}\). Let us compute an eigenvector corresponding to \(\lambda _1=1+\frac {1}{\sqrt {3}}\). Consider the system (R − λ 1 I)X = O. The first equation is

$$\displaystyle \begin{aligned} -\frac{1}{\sqrt{3}}x_1+\frac{1}{\sqrt{6}}x_3=0\Rightarrow x_1=\frac{1}{\sqrt{2}}x_3, \end{aligned}$$

the second equation being

$$\displaystyle \begin{aligned} -\frac{1}{\sqrt{3}}x_2-\frac{1}{\sqrt{6}}x_3=0\Rightarrow x_2=-\frac{1}{\sqrt{2}}x_3. \end{aligned}$$

Let us take \(x_3=\sqrt {2}\). Then, an eigenvector corresponding to λ 1 is

Let us verify that the third equation is also satisfied by X 1 or A 1. This is the case since \(\frac {1}{\sqrt {6}}+\frac {1}{\sqrt {6}}-\frac {\sqrt {2}}{\sqrt {3}}=0\), and the first principal component is indeed \(u_1=\frac {1}{2}[x_1-x_2+\sqrt {2}x_3]\). As well, the variance of u 1 is equal to λ 1:

$$\displaystyle \begin{aligned} \mathrm{Var}(u_1)&=\frac{1}{4}[\mathrm{Var}(x_1)+\mathrm{Var}(x_2)+2\mathrm{Var}(x_3)-2\mathrm{Cov}(x_1,x_2)\\ &\ \ \ \ \ \ \,+2\sqrt{2}\mathrm{Cov}(x_1,x_3)-2\sqrt{2}\mathrm{Cov}(x_2,x_3)]\\ &=\frac{1}{4}\Big[1+1+2+\frac{2\sqrt{2}}{\sqrt{6}}+\frac{2\sqrt{2}}{\sqrt{6}}\Big]=1+\frac{1}{\sqrt{3}}=\lambda_1.\end{aligned} $$

Let us compute an eigenvector corresponding to the eigenvalue λ 2 = 1. Consider the linear system (R − λ 2 I)X = O. The first equation, \(\frac {1}{\sqrt {6}}x_3=0, \) gives x 3 = 0 and the second one also yields x 3 = 0; as for the third one, \(\frac {x_1}{\sqrt {6}}-\frac {x_2}{\sqrt {6}}=0\Rightarrow x_1=x_2\). Letting x 1 = 1, an eigenvector is

Thus, the second principal component is \(u_2=\frac {1}{\sqrt {2}}[x_1+x_2]\). In the case of λ 3, we consider the system (R − λ 3 I)X = O whose first equation is

$$\displaystyle \begin{aligned} \frac{1}{\sqrt{3}}x_1+\frac{1}{\sqrt{6}}x_3=0\Rightarrow x_1=-\frac{1}{\sqrt{2}}x_3,\end{aligned}$$

the second one being

$$\displaystyle \begin{aligned} \frac{x_2}{\sqrt{3}}+\frac{x_3}{\sqrt{6}}=0\Rightarrow x_2=\frac{1}{\sqrt{2}}x_3. \end{aligned}$$

Letting \(x_3=\sqrt {2}\), an eigenvector corresponding to λ 3 is

For the matrix [A 1, A 2, A 3] to be uniquely determined, we may multiply A 3 by − 1 so that the first nonzero element is positive. Hence, the third principal component is \(u_3=\frac {1}{2}[x_1-x_2-\sqrt {2}x_3]\). As was shown in the case of u 1, it can also be verified that \(\mathrm {Var}(u_2)=1=\lambda _2,~ \mathrm {Var}(u_3)=\lambda _3=1-\frac {1}{\sqrt {3}}\), Cov(u 1, u 2) = 0, Cov(u 1, u 3) = 0 and Cov(u 2, u 3) = 0. We note that

$$\displaystyle \begin{aligned} \frac{\lambda_1}{\lambda_1+\lambda_2+\lambda_3}&=\frac{1+1/\sqrt{3}}{3}\approx 0.525,\\ \frac{\lambda_1+\lambda_2}{\lambda_1+\lambda_2+\lambda_3}&=\frac{2+1/\sqrt{3}}{3}\approx 0.859.\end{aligned} $$

Thus, almost 53% of the total variation is accounted for by the first principal component and nearly 86% of the total variation is due to the first two principal components.

9.4. The Vector of Principal Components

Observe that the determinant of a matrix is the product of its eigenvalues and that its trace is the sum of its eigenvalues, that is, |Σ| = λ 1λ p and tr(Σ) = λ 1 + ⋯ + λ p. As previously pointed out, the determinant of a covariance matrix corresponds to Wilks’ concept of generalized variance. Let us consider the vector of principal components. The principal components are \(u_j=A_j^{\prime }X,\) with \( \mathrm {Var}(u_j)=A_j^{\prime }\varSigma A_j=\lambda _j,~j=1,\ldots ,p, \) and Cov(u i, u j) = 0 for all ij. Thus,

(9.4.1)

The determinant of the covariance matrix is the product λ 1λ p and its trace, the sum λ 1 + ⋯ + λ p. Hence the following result:

Theorem 9.4.1

Let X be a p × 1 real vector whose associated covariance matrix is Σ. Let the principal components of X be denoted by \(u_j=A_j^{\prime }X\) with \(\mathrm {Var}(u_j)=A_j^{\prime }\varSigma A_j=\lambda _j= j\) -th largest eigenvalue of Σ, and U′ = (u 1, …, u p) with Cov(U) ≡ Σ u . Then, |Σ u| = |Σ| =  product of the eigenvalues, λ 1λ p , and tr(Σ u) = tr(Σ) =  sum of the eigenvalues, λ 1 + ⋯ + λ p . Observe that the determinant as well as the eigenvalues and the trace are invariant with respect to orthonormal transformations or rotations of the coordinate axes.

Let A = [A 1, A 2, …, A p]. Then

$$\displaystyle \begin{aligned} \varSigma A=A\varLambda,~ A'A=I,~ A'\varSigma A=\varLambda.{}\end{aligned} $$
(9.4.2)

Note that U′ = (u 1, …, u p) where u 1, …, u p are linearly independent. We have not assumed any distribution for the p × 1 vector X so far. If the p × 1 vector X has a p-variate nonsingular Gaussian distribution, that is, X ∼ N p(O, Σ), Σ > O, then

$$\displaystyle \begin{aligned} u_j\sim N_1(0,~\lambda_j),~ U\sim N_p(O,~\varLambda).{}\end{aligned} $$
(9.4.3)

Since Principal Components involve variances or covariances, which are free of any location parameter, we may take the mean value vector to be a null vector without any loss of generality. Then, E[u j] = 0 by assumption, \(\mathrm {Var}(u_j)=A_j^{\prime }\varSigma A_j=\lambda _j,~j=1,\ldots ,p\) and Cov(U) = Λ = diag(λ 1, …, λ p). Accordingly, \(\frac {\lambda _1}{\lambda _1+\cdots +\lambda _p}\) is the proportion of the total variance accounted for by the largest eigenvalue λ 1, where λ 1 is equal to the variance of the first principal component. Similarly, \(\frac {\lambda _1+\cdots +\lambda _r}{\lambda _1+\cdots +\lambda _p}\) is the proportion of the total variance due to the first r principal components, r ≤ p.

Example 9.4.1

Let X ∼ N 3(μ, Σ), Σ > O, where

Derive the densities of the principal components of X.

Solution 9.4.1

Let us determine the eigenvalues of Σ. Consider the equation

Let us compute an eigenvector corresponding to λ 1 = 4. Consider the linear system (Σ − λ 1 I)X = O, whose first equation gives − 2x 1 + x 3 = 0 or \(x_1=\frac {1}{2}x_3\), the second equation, − 2x 2 − x 3 = 0, yielding \(x_2=-\frac {1}{2}x_3\). Letting x 3 = 2, one solution is

Thus, the first principal component is \(u_1=A_1^{\prime }X=\frac {1}{\sqrt {6}}[x_1-x_2+2x_3]\) with \(E[u_1]=\frac {1}{\sqrt {6}}[1-0-2]=-\frac {1}{\sqrt {6}}\) and

$$\displaystyle \begin{aligned} \mathrm{Var}(u_1)&=\frac{1}{6}[\mathrm{Var}(x_1)+\mathrm{Var}(x_2)+4\mathrm{Var}(x_3)-2\mathrm{Cov}(x_1,x_2)\\ & \ \ \ \ \ \ +4\mathrm{Cov}(x_1,x_3)-4\mathrm{Cov}(x_2,x_3)]\\ &=\frac{1}{6}[2+2+4\times 3-2(0)+4(1)-4(-1)]=4=\lambda_1.\end{aligned} $$

Since u 1 is a linear function of normal variables, u 1 has the following Gaussian distribution:

$$\displaystyle \begin{aligned} u_1\sim N_1\Big(\!\!-\frac{1}{\sqrt{6}}\,,\,4\Big). \end{aligned}$$

Let us compute an eigenvector corresponding to λ 2 = 2. Consider the system (Σ − λ 2 I)X = O, whose first and second equations give x 3 = 0, the third equation x 1 − x 2 + x 3 = 0 yielding x 1 = x 2. Hence, one solution is

and the second principal component is \(u_2=A_2^{\prime }X=\frac {1}{\sqrt {2}}[x_1+x_2]\) with \(E[u_2]=\frac {1}{\sqrt {2}}[1+0]=\frac {1}{\sqrt {2}}\). Let us verify that the variance of u 2 is λ 2 = 2:

$$\displaystyle \begin{aligned} \mathrm{Var}(u_2)=\frac{1}{2}[\mathrm{Var}(x_1)+\mathrm{Var}(x_2)+2\mathrm{Cov}(x_1,x_2)]=\frac{1}{2}[2+2+2(0)]=2=\lambda_2.\end{aligned}$$

Hence, u 2 has the following real univariate normal distribution:

$$\displaystyle \begin{aligned} u_2\sim N_1\Big(\frac{1}{\sqrt{2}}\,,\,2\Big). \end{aligned}$$

We finally construct an eigenvector associated with λ 3 = 1. Consider the linear system (Σ − λ 3 I)X = O. In this case, the first equation is x 1 + x 3 = 0 ⇒ x 1 = −x 3 and the second one is x 2 − x 3 = 0 or x 2 = x 3. Let x 3 = 1. One eigenvector is

In order to ensure the uniqueness of the matrix [A 1, A 2, A 3], we may multiply A 3 by − 1 to ensure that the first nonzero element is positive. Then, the third principal component is \(u_3=\frac {1}{\sqrt {3}}[x_1-x_2-x_3]\) with \(E[u_3]=\frac {1}{\sqrt {3}}[1-0+1]=\frac {2}{\sqrt {3}}\). The variance of u 3 is indeed equal to λ 3 as

$$\displaystyle \begin{aligned} \mathrm{Var}(u_3)&=\frac{1}{3}[\mathrm{Var}(x_1)+\mathrm{Var}(x_2)+\mathrm{Var}(x_3)-2\mathrm{Cov}(x_1,x_2)\\ &\ \ \ \ \ \ -2\mathrm{Cov}(x_1,x_3)+2\mathrm{Cov}(x_2,x_3)]\\ & =\frac{1}{3}[2+2+3-2(0)-2(1)+2(-1)]=1.\end{aligned} $$

Thus,

$$\displaystyle \begin{aligned} u_3\sim N_1\Big(\frac{2}{\sqrt{3}}\,,\,1\Big). \end{aligned}$$

It can easily be verified that Cov(u 1, u 2) = 0, Cov(u 1, u 3) = 0 and Cov(u 2, u 3) = 0. Accordingly, letting

As well, ΣA j = λ j A j, j = 1, 2, 3, that is,

This completes the computations.

9.4.1. Principal components viewed from differing perspectives

Let X be a p × 1 real vector whose covariance matrix is Cov(X) = Σ > O. Assume that E(X) = O. Then, X′Σ −1 X = c > 0 is commonly referred to as an ellipsoid of concentration, centered at the origin of the coordinate system, with X′X being the square of the distance between the origin and a point X on the surface of this ellipsoid. A principal axis of this ellipsoid is defined when this squared distance has a stationary point. The stationary points are determined by optimizing X′X subject to X′Σ −1 X = c > 0. Consider

$$\displaystyle \begin{aligned} w&=X'X-\lambda(X'\varSigma^{-1}X-c), \ \frac{\partial w}{\partial X}=O\Rightarrow 2X-2\lambda \varSigma^{-1}X=O\\ &\Rightarrow \varSigma^{-1}X=\frac{1}{\lambda}X\Rightarrow \varSigma X=\lambda X\end{aligned} $$
(i)

where λ is a Lagrangian multiplier. It is seen from (i) that λ is an eigenvalue of Σ and X is the corresponding eigenvector. Letting λ 1 ≥ λ 2 ≥⋯ ≥ λ p > 0 be the eigenvalues and A 1, …, A p be the corresponding eigenvectors, A 1, …, A p give the principal axes of this ellipsoid of concentration. It follows from (i) that

$$\displaystyle \begin{aligned} c=A_j^{\prime}\varSigma^{-1}A_j=\frac{1}{\lambda_j}A_j^{\prime}A_j\Rightarrow A_j^{\prime}A_j=\lambda_jc. \end{aligned}$$

Thus, the length of the j-th principal axis is \(2\sqrt {A_j^{\prime }A_j}=2\sqrt {\lambda _j\,c}\).

As another approach, consider a plane passing through the origin. The equation of this plane will be β′X = 0 where β is a p × 1 constant vector and X is the p × 1 vector of the coordinate system. Without any loss of generality, let β′β = 1. Let the p × 1 vector Y  be a point in the Euclidean space. The distance between this point and the plane is then β′Y . Letting Y  be a random point such that E(Y ) = O and Cov(Y ) = Σ > O, the expected squared distance from this point to the plane is E[β′Y ]2 = E[β′YY′β] = β′E(YY)β = β′Σβ. Accordingly, the two-dimensional planar manifold of closest fit to the point Y  is that plane whose coefficient vector β is such that β′Σβ is minimized subject to β′β = 1. This, once again, leads to the eigenvalue problem encountered in principal component analysis.

9.5. Sample Principal Components

When X ∼ N p(O, Σ), Σ > O, the maximum likelihood estimator of Σ is \(\hat {\varSigma }=\frac {1}{n}S\) where n > p is the sample size and S is the sample sum of products matrix, as defined for instance in Mathai and Haubold (2017b). In general, whether Σ is nonsingular or singular and X is p-variate Gaussian or not, we may take \(\hat {\varSigma }\) as an estimate of Σ. The sample eigenvalues and eigenvectors are then available from \(\hat {\varSigma }\beta =k \beta \) where βO is a p × 1 eigenvector corresponding to the eigenvalue k of \(\hat {\varSigma }=\frac {1}{n}S\). For a non-null β, \((\hat {\varSigma }-kI)\beta =O\Rightarrow \hat {\varSigma }-k I\) is singular or \(|\hat {\varSigma }-k I|=0\) and k is a solution of

$$\displaystyle \begin{aligned} |\hat{\varSigma}-k I|=0\Rightarrow (\hat{\varSigma}-k_j I)B_j=O,~ ~ j=1,\ldots,p.{}\end{aligned} $$
(9.5.1)

We may only consider the normalized eigenvectors B j such that \(B_j^{\prime }B_j=1,~ j=1,\ldots ,p\). If the eigenvalues of Σ are distinct, it can be shown that the eigenvalues of \(\hat {\varSigma }\) are also distinct, k 1 > k 2 > ⋯ > k p almost surely. Note that even though \(B_j^{\prime }B_j=1\), B j is not unique as one also has (−B j)(−B j) = 1. Thus, we require that the first nonzero element of B j be positive to ensure uniqueness. Since \(\hat {\varSigma }\) is a real symmetric matrix, its eigenvalues k 1, …, k p are real, and so are the corresponding normalized eigenvectors B 1, …, B p. As well, there exist a full set of orthonormal eigenvectors \(B_1,\ldots ,B_p,~ B_j^{\prime }B_j=1,~ B_j^{\prime }B_i=0,~ i\ne j,~ j=1,\ldots ,p,\) such that for B = (B 1, …, B p),

$$\displaystyle \begin{aligned} B'\hat{\varSigma}B=K=\mathrm{diag}(k_1,\ldots,k_p),~ \hat{\varSigma}B=B K \mbox{ and } \hat{\varSigma}=k_1B_1B_1^{\prime}+\cdots+k_pB_pB_p^{\prime}.{}\end{aligned} $$
(9.5.2)

Also, observe that \(\frac {k_1}{k_1+\cdots +k_p}\) is the proportion of the total variation in the data which is accounted for by the first principal component. Similarly, \(\frac {k_1+\cdots +k_r}{k_1+\cdots +k_p}\) is the proportion of the total variation due to the first r principal components, r ≤ p.

Example 9.5.1

Let X be a 3 × 1 vector of real scalar random variables, X′ = [x 1, x 2, x 3], with E[X] = μ and covariance matrix Cov(X) = Σ > O where both μ and Σ are unknown. Let the following observation vectors be a simple random sample of size 5 from this population:

Compute the principal components of X from an estimate of Σ that is based on those observations.

Solution 9.5.1

An estimate of Σ is \(\hat {\varSigma }=\frac {1}{n}S\) where n is the sample size and S is the sample sum of products matrix. In this case, n = 5. We first compute S. To this end, we determine the sample average vector, the sample matrix, the deviation matrix and finally S. The sample average is \(\bar {X}=\frac {1}{n}[X_1+X_2+X_3+X_4+X_5]\) and the sample matrix is X = [X 1, X 2, X 3, X 4, X 5]. The matrix of sample means is \(\bar {\mathbf {X}}=[\bar {X},\bar {X},\bar {X},\bar {X},\bar {X}]\). The deviation matrix is \({\mathbf {X}}_d=\mathbf {X}-\bar {\mathbf {X}}\) and the sample sum of products matrix \(S={\mathbf {X}}_d{\mathbf {X}}_d^{\prime }=[\mathbf {X}-\bar {\mathbf {X}}][\mathbf {X}-\bar {\mathbf {X}}]'\). Based on the given random sample, these quantities are

An estimate of Σ is \(\hat {\varSigma }=\frac {1}{n}S=\frac {1}{5}S\). However, since the eigenvalues of \(\hat {\varSigma }\) are those of S multiplied by \(\frac {1}{5}\) and the normalized eigenvectors of \(\hat {\varSigma }\) and S will then be identical, we will work with S. The eigenvalues of S are available from the equation |S − λI| = 0. That is, \((4-\lambda )[(4-\lambda )(6-\lambda )-4]=0\Rightarrow (4-\lambda )[\lambda ^2-10\lambda +20]=0\Rightarrow \lambda _1=5+\sqrt {5},~\lambda _2=4,~\lambda _3=5-\sqrt {5}\). An eigenvector corresponding to \(\lambda _1=5+\sqrt {5}\) can be determined from the system (S − λ 1 I)X = O wherein first equation gives x 1 = 0 and the third one yields \(2x_2+(1-\sqrt {5})x_3=0\). Taking x 3 = 2, \(x_2=-(1-\sqrt {5})\), and it is easily verified that these values also satisfy the second equation. Thus, an eigenvector, denoted by Y 1, is the following:

so that the first principal component is \(u_1=\frac {1}{\sqrt {10-2\sqrt {5}}}[(\sqrt {5}-1)x_2+2x_3]\). Let us verify that the variance of u 1 equals λ 1:

$$\displaystyle \begin{aligned} \mathrm{Var}(u_1)&=\frac{1}{10-2\sqrt{5}}[(\sqrt{5}-1)^2\mathrm{Var}(x_2)+4\mathrm{Var}(x_3)+2(\sqrt{5}-1)\mathrm{Cov}(x_2,x_3)]\\ &=\frac{1}{10-2\sqrt{5}}[(6-2\sqrt{5})(4)+4(6)+2(\sqrt{5}-1)(4)]=\frac{20(5+\sqrt{5})}{20}\\ &=5+\sqrt{5}=\lambda_1.\end{aligned} $$

An eigenvector corresponding to λ 2 = 4 is available from (S − λ 2 I)X = O. The first equation shows that all variables are arbitrary. The second and third equations yield x 3 = 0 and x 2 = 0. Taking x 1 = 1, an eigenvector corresponding to λ 2 = 4, denoted by Y 2, is

Thus, the second principal component is u 2 = x 1 and its variance is Var(x 1) = 4. An eigenvector corresponding to \(\lambda _3=5-\sqrt {5}\) can be determined from the linear system (S − λ 3 I)X = O whose first equation yields x 1 = 0, the third one giving \(2x_2+(1+\sqrt {5})x_3=0\). Taking x 3 = 2, \(x_2=-(1+\sqrt {5})\), and it is readily verified that these values satisfy the second equation as well. Then, an eigenvector, denoted by Y 3, is

In order to select the matrix [A 1, A 2, A 3] uniquely, we may multiply A 3 by − 1 so that the first nonzero element is positive. Thus, the third principal component is \(u_3=\frac {1}{\sqrt {10+2\sqrt {5}}}[(1+\sqrt {5})x_2-2x_3]\) and, as the following calculations corroborate, its variance is indeed equal to λ 3:

$$\displaystyle \begin{aligned} \mathrm{Var}(u_3)&=\frac{1}{\sqrt{10+2\sqrt{5}}}[(1+\sqrt{5})^2\mathrm{Var}(x_2)+4\mathrm{Var}(x_3)-4(1+\sqrt{5})\mathrm{Cov}(x_2,x_3)]\\ &=\frac{1}{2(5+\sqrt{5})}[4(6+2\sqrt{5})+4(6)-4(1+\sqrt{5})(2)]=\frac{20}{5+\sqrt{5}}=5-\sqrt{5}=\lambda_3.\end{aligned} $$

Additionally,

$$\displaystyle \begin{aligned} \frac{\lambda_1}{\lambda_1+\lambda_2+\lambda_3}=\frac{5+\sqrt{5}}{14}\approx 0.52, \end{aligned}$$

that is, approximately 52% of the total variance is due to u 1 or nearly 52% of the total variation in the data is accounted for by the first principal component. Also observe that

That is,

which completes the computations.

9.5.1. Estimation and evaluation of the principal components

If X ∼ N p(O, Σ), Σ > O, then the maximum likelihood estimator of Σ, denoted by \(\hat {\varSigma }=\frac {1}{n}S\) where n is the sample size and S is the sample sum of products matrix. How can one evaluate the eigenvalues and eigenvectors in order to construct the principal components? One method consists of solving the polynomial equation |Σ − λI| = 0 for the population eigenvalues λ 1, …, λ p, or \(|\hat {\varSigma }-k I|=0\) for obtaining the sample eigenvalues k 1 ≥ k 2 ≥⋯ ≥ k p. Direct evaluation of k by solving the determinantal equation is not difficult when p is small. However, for large values of p, one has to resort to mathematical software or some iterative process. We will illustrate such an iterative process for the population values λ 1 > λ 2 > ⋯ > λ p and the corresponding normalized eigenvectors A 1, …, A p (using our notations). Let us assume that the eigenvalues are distinct. Consider the following equation for determining the eigenvalues and eigenvectors:

$$\displaystyle \begin{aligned} \varSigma A_j=\lambda_j A_j,~ j=1,\ldots,k.{}\end{aligned} $$
(9.5.3)

Take any initial p-vector W o that is not orthogonal to A 1, the eigenvector corresponding to the largest eigenvalue λ 1. If W o is orthogonal to A 1, then, in the iterative process, W j will not reach A 1. Let \(Y_o=\frac {1}{\sqrt {W_o^{\prime }W_o}}W_o\) be the normalized W 0. Consider the equation

$$\displaystyle \begin{aligned} \varSigma Y_{j-1}=W_j, \mbox{ that is, } \varSigma Y_0=W_1,~ \varSigma Y_1=W_2,\ldots,~ \ Y_j=\frac{1}{\sqrt{(W_j^{\prime}W_j)}}W_j, ~j=1, \ldots\ .{}\end{aligned} $$
(9.5.4)

Halt the iteration when W j approximately agrees with W j−1, that is, W j converges to some W which will then be λ 1 A 1 or Y j converges to Y 1. At each stage, compute the quadratic form \(\delta _j= Y_j^{\prime }\varSigma Y_j\) and make sure that δ j is increasing. Suppose that W j converges to some W for certain values of j or as j →. At that stage, the equation is ΣY = W where \(Y=\frac {1}{\sqrt {W'W}}W\), the normalized W. Then, the equation is \(\varSigma Y=\sqrt {W'W}Y\). In other words, \(\sqrt {W'W}=\lambda _1\) and Y = A 1, which are the largest eigenvalue of Σ and the corresponding eigenvector A 1. That is,

$$\displaystyle \begin{aligned} \lim_{j\to\infty}\sqrt{W_j^{\prime}W_j}&=\lambda_1\\ \lim_{j\to\infty}\Bigg[\frac{1}{\sqrt{(W_j^{\prime}W_j)}}W_j\Bigg]&=\lim_{j\to\infty}Y_j=A_1.\end{aligned} $$

The rate of convergence in (9.5.4) depends upon the ratio \(\frac {\lambda _2}{\lambda _1}\). If λ 2 is close to λ 1, then the convergence will be very slow. Hence, the larger the difference between λ 1 and λ 2, the faster the convergence. It is thus indicated to raise Σ to a suitable power m and initiate the iteration on this Σ m so that the difference between the resulting λ 1 and λ 2 be magnified accordingly, \(\lambda _j^m,~j=1,\ldots ,p,\) being the eigenvalues of Σ m. If an eigenvalue is equal to one, then Σ must first be multiplied by a constant so that all the resulting eigenvalues of Σ are well separated, which will not affect the eigenvectors. As well, observe that Σ and Σ m, m = 1, 2, …, share the same eigenvectors even though the eigenvalues are λ j and \(\lambda _j^m\), j = 1, …, p, respectively; in other words, the normalized eigenvectors remain the same. After obtaining the largest eigenvalue λ 1 and the corresponding normalized eigenvector A 1, consider

$$\displaystyle \begin{aligned} \varSigma_2=\varSigma-\lambda_1A_1A_1^{\prime}, ~\varSigma=\lambda_1A_1A_1^{\prime}+\lambda_2A_2A_2^{\prime}+\cdots+ \lambda_pA_pA_p^{\prime}, \end{aligned}$$

where A j is the column eigenvector corresponding to λ j so that \(A_jA_j^{\prime }\) is a p × p matrix for j = 1, 2, …, p. Now, carry out the iteration on Σ 2, as was previously done on Σ. This will produce λ 2 and the corresponding normalized eigenvector A 2. Note that λ 2 is the largest eigenvalue of Σ 2. Next, consider

$$\displaystyle \begin{aligned} \varSigma_3=\varSigma_2-\lambda_2A_2A_2^{\prime}\end{aligned}$$

and continue this process until all the required eigenvectors are obtained. Similarly, for small p, the sample eigenvalues k 1 ≥ k 2 ≥⋯ ≥ k p are available by solving the equation \(|\hat {\varSigma }-kI|=0\); otherwise, the sample eigenvalues and eigenvectors can be computed by applying the iterative process described above with Σ replaced by \(\hat {\varSigma }\), λ j interchanged with k j and B j substituted to the eigenvector A j, j = 1, …, p.

Example 9.5.2

Principal component analysis is especially called for when the number of variables involved is large. For the sake of illustration, we will take p = 2. Consider the following 2 × 2 covariance matrix

Construct the principal components associated with this matrix while working with the symbolic representations of the eigenvalues λ 1 and λ 2 rather than their actual values.

Solution 9.5.2

Let us first evaluate the eigenvalues of Σ in the customary fashion. Consider

$$\displaystyle \begin{aligned} |\varSigma-\lambda I|=0&\Rightarrow (2-\lambda)(1-\lambda)-1=0\Rightarrow \lambda^2-3\lambda+1=0 \\ &\Rightarrow \lambda_1=\frac{1}{2}[3+\sqrt{5}],\ \lambda_2=\frac{1}{2}[3-\sqrt{5}].\end{aligned} $$
(i)

Let us compute an eigenvector corresponding to the largest eigenvalue λ 1. Then,

(ii)

Since the system is singular, we need only solve one of these equations. Letting x 2 = 1 in (ii), x 1 = (1 − λ 1). For illustrative purposes, we will complete the remaining steps with the general parameters λ 1 and λ 2 rather than their numerical values. Hence, one eigenvector is

Thus, the principal components are the following:

$$\displaystyle \begin{aligned} u_1&=\frac{1}{\Vert C\Vert}C'X=\frac{1}{\sqrt{1+(1-\lambda_1)^2}}\{(1-\lambda_1)x_1+x_2\}\\ u_2&=\frac{1}{\sqrt{1+(1-\lambda_2)^2}}\{(1-\lambda_2)x_1+x_2\}.\end{aligned} $$

Let us verify that Var(u 1) = λ 1 and Var(u 2) = λ 2. Note that

$$\displaystyle \begin{aligned} \Vert C\Vert^2=1+(1-\lambda_1)^2=\lambda_1^2-2\lambda_1+2 =(\lambda_1^2-3\lambda_1+1)+\lambda_1+1=\lambda_1+1, \end{aligned}$$
(iii)

given the characteristic equation (i). However,

$$\displaystyle \begin{aligned} \mathrm{Var}(C'X)&=\mathrm{Var}((1-\lambda_1)x_1+x_2)\\&=(1-\lambda_1)^2\mathrm{Var}(x_1)+\mathrm{Var}(x_2) +2(1-\lambda_1)\mathrm{Cov}(x_1,x_2)\\ &=2(1-\lambda_1)^2+1-2(1-\lambda_1)\\ &=2\lambda_1^2-2\lambda_1+1=2(\lambda_1^2-3\lambda_1+1) +4\lambda_1-1=4\lambda_1-1,\end{aligned} $$
(iv)

given (i), the characteristic (or starting) equation. We now have Var(C′X) = Var((1 − λ 1)x 1 + x 2) = 4λ 1 − 1 from (iv) and ∥C∥ = λ 1 + 1 from (iii). Hence, we must have Var(C′X) = ∥C2 λ 1 = [1 + (1 − λ 1)2]λ 1 = (λ 1 + 1)λ 1 from (iii). Since

$$\displaystyle \begin{aligned} \lambda_1(\lambda_1+1)=\lambda_1^2+\lambda_1 =(\lambda_1^2-3\lambda_1+1)+4\lambda_1-1=4\lambda_1-1, \end{aligned}$$

agrees with (iv), the result is verified for u 1. Moreover, on replacing λ 1 by λ 2, we have Var(u 2) = λ 2.

9.5.2. L1 and L2 norm principal components

For any p × 1 vector Y , Y = (y 1, …, y p), where y 1, …, y p are real quantities, the L2 and L1 norms are respectively defined as follows:

$$\displaystyle \begin{aligned} \Vert Y\Vert_2&=(y_1^2+\cdots+y_p^2)^{\frac{1}{2}}=[Y'Y]^{\frac{1}{2}}\Rightarrow \Vert Y\Vert_2^2=Y'Y \end{aligned} $$
(9.5.5)
$$\displaystyle \begin{aligned} \Vert Y\Vert_1&=|y_1|+|y_2|+\cdots+|y_p|=\sum_{j=1}^p|y_j|.\end{aligned} $$
(9.5.6)

In Sect. 9.5, we set up the principal component analysis on the basis of the sample sum of products matrix, assuming a p-variate real population, not necessarily Gaussian, having μ as its mean value vector and Σ > O as its covariance matrix. Then, we considered X j, j = 1, …, n, iid vector random variables, that is, a simple random sample of size n from this population such that E[X j] = μ and Cov(X j) = Σ > O, j = 1, …, n. We denoted the sample matrix by X = [X 1, …, X n], the sample average by \(\bar {X}=\frac {1}{n}[X_1+\cdots +X_n]\), the matrix of sample means by \(\bar {\mathbf {X}}=[\bar {X},\ldots ,\bar {X}]\) and the sample sum of products matrix by \(S=[\mathbf {X}-\bar {\mathbf {X}}][\mathbf {X}-\bar {\mathbf {X}}]'\). If the population mean value vector is the null vector, that is, μ = O, then we can take S = XX . For convenience, we will consider this case. For determining the sample principal components, we then maximized

$$\displaystyle \begin{aligned} A'\mathbf{XX}'A \mbox{ subject to }A'A=1 \end{aligned}$$

where A is an arbitrary constant vector that will result in the coefficient vector of the principal components. This can also be stated in terms of maximizing the square of an L2 norm subject to A′A = 1, that is,

$$\displaystyle \begin{aligned} \max_{A'A=1}\Vert A'X\Vert_2^2 .\end{aligned} $$
(9.5.7)

When carrying out statistical analyses, it turns out that the L2 norm is more sensitive to outliers than the L1 norm. If we utilize the L1 norm, the problem corresponding to (9.5.7) is

$$\displaystyle \begin{aligned} \max_{A'A=1} \Vert A'X\Vert_1 .\end{aligned} $$
(9.5.8)

Observe that when μ = O, the sum of products matrix can be expressed as

$$\displaystyle \begin{aligned} S=\mathbf{XX}'=\sum_{j=1}^pX_jX_j^{\prime},~ \mathbf{X}=[X_1,\ldots,X_n], \end{aligned}$$

where X j is the j-th column of X or j-th sample vector. Then, the initial optimization problem can be restated as follows:

$$\displaystyle \begin{aligned} \max_{A'A=1}\Vert A'X\Vert_2^2=\max_{A'A=1}\sum_{j=1}^nA'X_jX_j^{\prime}A =\max_{A'A=1}\sum_{j=1}^n\Vert A'X_j\Vert_2^2\,, \end{aligned} $$
(9.5.9)

and the corresponding L1 norm optimization problem can be formulated as follows:

$$\displaystyle \begin{aligned} \max_{A'A=1}\Vert A'X\Vert_1=\max_{A'A=1}\sum_{j=1}^n\Vert A'X_j\Vert_1.{}\end{aligned} $$
(9.5.10)

We have obtained exact analytical solutions for the coefficient vector A in (9.5.9); however, this is not possible when attempting to optimize (9.5.10) subject to A′A = 1. Thankfully, iterative procedures are available in this case.

Let us consider a generalization of the basic principal component analysis. Let W be a p × m, m < p, matrix of full rank m. Then, the general problem in L2 norm optimization is the following:

$$\displaystyle \begin{aligned} \max_{W'W=I}\mathrm{tr}(W'SW)=\max_{W'W=I}\sum_{j=1}^n\Vert W'X_j\Vert_2^2{}\end{aligned} $$
(9.5.11)

where I is the m × m identity matrix, the corresponding L1 norm optimization problem being

$$\displaystyle \begin{aligned} \max_{W'W=I}\sum_{j=1}^n\Vert WX_j\Vert_1.{}\end{aligned} $$
(9.5.12)

In Sect. 9.4.1, we have considered a dual problem of minimization for constructing principal components. The dual problems of minimization corresponding to (9.5.11) and (9.5.12) can be stated as follows:

$$\displaystyle \begin{aligned} \min_{W'W=I}\sum_{j=1}^n\Vert X_j-WW'X_j\Vert_2^2{}\end{aligned} $$
(9.5.13)

with respect to the L2 norm, and

$$\displaystyle \begin{aligned} \min_{W'W=I}\sum_{j=1}^n\Vert X_j-WW'X_j\Vert_1{}\end{aligned} $$
(9.5.14)

with respect to the L1 norm. The form appearing in (9.5.13) suggests that the construction of principal components by making use of the L2 norm is also connected to general model building, Factor Analysis and related topics. The general mathematical problem pertaining to the basic structure in (9.5.13) is referred to as low-rank matrix factorization. Readers interested in such statistical or mathematical problems may refer to the survey article, Shi et al. (2017). L1 norm optimization problems are for instance discussed in Kwak (2008) and Nie and Huang (2016).

9.6. Distributional Aspects of Eigenvalues and Eigenvectors

Let us examine the distributions of the variances of the sample principal components and the coefficient vectors in the sample principal components. Let A be a p × p orthonormal matrix whose columns are denoted by A 1, …, A p, so that \(A_j^{\prime }A_j=1,~A_i^{\prime }A_j=0,~i\ne j,\) or AA′ = I, A′A = I. Let \(u_j=A_j^{\prime }X\) be the sample principal components for j = 1, …, p. Let the p × 1 vector X have a nonsingular normal density with the null vector as its mean value vector, that is, X ∼ N p(O, Σ), Σ > O. We can assume that the Gaussian distribution has a null mean value vector without any loss of generality since we are dealing with variances and covariances, which are free of any location parameter. Consider a simple random sample of size n from this normal population. Letting S denote the sample sum of products matrix, the maximum likelihood estimate of Σ is \(\hat {\varSigma }=\frac {1}{n}S\). Given that S has a Wishart distribution having n − 1 degrees of freedom, the density of \(\hat {\varSigma }\), denoted by \(f(\hat {\varSigma })\), is given by

$$\displaystyle \begin{aligned} f(\hat{\varSigma})=\frac{n^{\frac{mp}{2}}}{|2\varSigma|{}^{\frac{m}{2}}\varGamma_p(\frac{m}{2})}|\hat{\varSigma}|{}^{\frac{m}{2}-\frac{p+1}{2}}\mathrm{e}^{-\frac{n}{2}\mathrm{tr}({\varSigma}^{-1}\hat{\varSigma})}, \end{aligned}$$

where Σ is the population covariance matrix and m = n − 1, n being the sample size. Let k 1, …, k p be the eigenvalues of \(\hat {\varSigma }\); it was shown that k 1 ≥ k 2 ≥⋯ ≥ k p > 0 are actually the variances of the sample principal components. Letting B j, j = 1, …, p, denote the coefficient vectors of the sample principal components and B = (B 1, …, B p), it was established that \(B'\hat {\varSigma }B=\mathrm {diag}(k_1,\ldots ,k_p)\equiv D\). The first nonzero component of B j is required to be positive for j = 1, …, p, so that B be uniquely determined. Then, the joint density of B and D is available by transforming \(\hat {\varSigma }\) in terms of B and D. Let the joint density of B and D and the marginal densities of D and B be respectively denoted by g(D, B), g 1(D) and g 2(B). In light of the procedures and results presented in Chap. 8 or in Mathai (1997), we have the following joint density:

$$\displaystyle \begin{aligned} g(D,B)\mathrm{d}D\wedge\mathrm{d}B&=\frac{n^{\frac{mp}{2}}}{|2\varSigma|{}^{\frac{m}{2}}\varGamma_p(\frac{m}{2})}\Big\{\prod_{j=1}^pk_j\Big\}^{\frac{m}{2}-\frac{p+1}{2}}\Big\{\prod_{i<j}(k_i-k_j)\Big\}\\ &\ \ \ \ \times \mathrm{e}^{-\frac{n}{2}\mathrm{tr}((B'\varSigma^{-1}B)D)}h(B)\mathrm{d}D\end{aligned} $$
(9.6.1)

where h(B) is the differential element corresponding to B, which is given in Chap. 8 and in Mathai (1997). The marginal densities of D and B are not explicitly available in a convenient form for a general Σ. In that case, they can only be expressed in terms of hypergeometric functions of matrix argument and zonal polynomials. See, for example, Mathai et al. (1995) for a discussion of zonal polynomials. However, when Σ = I, the joint and marginal densities are available in closed forms. They are

$$\displaystyle \begin{aligned} g(D,B)\mathrm{d}D\wedge\mathrm{d}B&=\frac{n^{\frac{mp}{2}}}{2^{\frac{mp}{2}}\varGamma_p(\frac{m}{2})}\Big\{\prod_{j=1}^pk_j\Big\}^{\frac{m}{2}-\frac{p+1}{2}}\Big\{\prod_{i<j}(k_i-k_j)\Big\}\mathrm{e}^{-\frac{n}{2}(k_1+\cdots+k_p)}h(B)\mathrm{d}D, \end{aligned} $$
(9.6.2)
$$\displaystyle \begin{aligned} g_1(D)&=\frac{n^{\frac{mp}{2}}}{2^{\frac{mp}{2}}\varGamma_p(\frac{m}{2})}\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\frac{p}{2})} \Big\{\prod_{j=1}^pk_j\Big\}^{\frac{m}{2}-\frac{p+1}{2}} \mathrm{e}^{-\frac{n}{2}(k_1+\cdots+k_p)}\Big\{\prod_{i<j}(k_i-k_j)\Big\}, \end{aligned} $$
(9.6.3)
$$\displaystyle \begin{aligned} g_2(B)\mathrm{d}B&=\frac{\varGamma_p(\frac{p}{2})}{\pi^{\frac{p^2}{2}}}h(B),\ BB'=I=B'B,~ k_1>k_2> \cdots >k_p>0. \end{aligned} $$
(9.6.4)

Given the representation of the joint density g(D, B) appearing in (9.6.2), it is readily seen that D and B are independently distributed. Similar results are obtainable for Σ = σ 2 I where σ 2 > 0 is a real scalar quantity and I is the identity matrix.

Example 9.6.1

Show that the function g 1(D) given in (9.6.3) is a density for p = 2, n = 6, m = 5, and derive the density of k 1 for this special case.

Solution 9.6.1

We have \(\frac {p+1}{2}=\frac {3}{2},~ \frac {m}{2}-\frac {3}{2}=1\). As g 1(D) is manifestly nonnegative, it suffices to show that the total integral equals 1. When p = 2 and n = 6, the constant part of (9.6.3) is the following:

$$\displaystyle \begin{aligned} \frac{n^{\frac{mp}{2}}}{2^{\frac{mp}{2}}\varGamma_p(\frac{m}{2})}\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\frac{p}{2})}&=\frac{6^5}{2^5\varGamma_p(\frac{5}{2})}\frac{\pi^2}{\varGamma_2(1)} \\ &=\frac{3^5}{\sqrt{\pi}\varGamma(\frac{5}{2})\varGamma(\frac{2}{2})}\frac{\pi^2}{\sqrt{\pi}\varGamma(1)\sqrt{\pi}} \\&=\frac{3^5}{\sqrt{\pi}(\frac{3}{2})(\frac{1}{2})\sqrt{\pi}} \frac{\pi^2}{\sqrt{\pi}\sqrt{\pi}} =4(3^4). \end{aligned} $$
(i)

As for the integral part, we have

$$\displaystyle \begin{aligned} \int_{k_1=0}^{\infty}&\int_{k_2=0}^{k_1}k_1k_2(k_1-k_2)\mathrm{e}^{-3(k_1+k_2)}\mathrm{d}k_1\wedge\mathrm{d}k_2\\ &=\int_{k_1=0}^{\infty}k_1^2\mathrm{e}^{-3k_1}\Big[\int_{k_2=0}^{k_1}k_2\mathrm{e}^{-3k_2}\mathrm{d}k_2\Big]\mathrm{d}k_1-\int_{k_1=0}^{\infty}k_1\mathrm{e}^{-3k_1} \Big[\int_{k_2=0}^{k_1}k_2^2\mathrm{e}^{-3k_2}\mathrm{d}k_2\Big]\mathrm{d}k_1\\ &=\int_{k_1=0}^{\infty}\Big(k_1^2\mathrm{e}^{-3k_1}\Big[-\frac{k_1}{3}\mathrm{e}^{-3k_1}+\frac{1}{3^2}(1-\mathrm{e}^{-3k_1})\Big]\\ &\ \ \ \ \qquad \ \ \ -k_1\mathrm{e}^{-3k_1}\Big[-\frac{k_1^2}{3}\mathrm{e}^{-3k_1}-\frac{2k_1}{3^2}\mathrm{e}^{-3k_1}+\frac{2}{3^3}(1-\mathrm{e}^{-3k_1})\Big]\Big)\mathrm{d}k_1 \end{aligned} $$
(ii)
$$\displaystyle \begin{aligned} &=-\frac{1}{3}[\varGamma(4)6^{-4}]+\frac{1}{9}[\varGamma(3)3^{-3}]-\frac{1}{3^2}[\varGamma(3)6^{-3}]\\ &\ \ \ +\frac{1}{3}[\varGamma(4)6^{-4}]+\frac{2}{3^2}[\varGamma(3)6^{-3}]-\frac{2}{3^3}[\varGamma(2)3^{-2}]+\frac{2}{3^2}[\varGamma(2)6^{-2}]\\ &=\frac{1}{3^2}[\varGamma(3)6^{-3}]+\frac{2}{3^3}[\varGamma(2)6^{-2}]+\frac{1}{3^2}[\varGamma(3)3^{-3}]-\frac{2}{3^3}[\varGamma(2)3^{-2}]\\ &=\Big[\frac{1}{4}+\frac{2}{4}+2-2\Big]\frac{1}{3^5}=\frac{1}{4(3^4)}. \end{aligned} $$
(iii)

The product of (i) and (iii) being equal to 1, this establishes that g 1(D) is indeed a density function for p = 2 and n = 6. On integrating out k 2, the resulting integrand appearing in (ii) yields the marginal density of k 1. Denoting the marginal density of k 1 by g 11(k 1), after some simplifications, we have

$$\displaystyle \begin{aligned} g_{11}(k_1)=4(3^4)\Big\{\Big(\frac{k_1^2}{3^2}+\frac{2k_1}{3^3}\Big)\mathrm{e}^{-6k_1}+\Big(\frac{k_1^2}{3^2}-\frac{2k_1}{3^3}\Big)\mathrm{e}^{-3k_1}\Big\},~0<k_1<\infty, \end{aligned}$$

and zero elsewhere. Similarly, given g 1(D), the marginal density of k 2 is obtained by integrating out k 1 from k 2 to . This completes the computations.

9.6.1. The distributions of the largest and smallest eigenvalues

One can determine the distribution of the j-th largest eigenvalue k j and thereby, the distribution of the largest one, k 1, as well as that of the smallest one, k p. Actually, many authors have been investigating the problem of deriving the distributions of the eigenvalues of a central Wishart matrix and matrix-variate type-1 beta and type-2 beta distributions. Some of the test statistics discussed in Chap. 6 can also be treated as eigenvalue problems involving real and complex type-1 beta matrices. Let k 1 > k 2 > ⋯ > k p > 0 be the eigenvalues of the real sampling distribution of a Wishart matrix generated from a p-variate real Gaussian sample. Then, as given in (9.6.3), the joint density of k 1, …, k p, denoted by g 1(D), is the following:

$$\displaystyle \begin{aligned} g_1(D)\mathrm{d}D=\frac{n^{\frac{mp}{2}}}{2^{\frac{mp}{2}}\varGamma_p(\frac{m}{2})}\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\frac{p}{2})}\Big\{\prod_{j=1}^p k_j^{\frac{m}{2}-\frac{p+1}{2}}\Big\}\mathrm{e}^{-\frac{n}{2}(k_1+\cdots +k_p)}\Big\{\prod_{i<j}(k_i-k_j)\Big\}\mathrm{d}D{}\end{aligned} $$
(9.6.5)

where m = n − 1 is the number of degrees of freedom, n being the sample size. More generally, we may consider a p × p real matrix-variate gamma distribution having α as its shape parameter and aI, a > 0, as its scale parameter matrix, I denoting the identity matrix. Noting that the eigenvalues of aS are equal to the eigenvalues of S multiplied by the scalar quantity a, we may take a = 1 without any loss of generality. Let g(S)dS denote the density of the resulting gamma matrix and g(D) be g(S) expressed in terms of λ 1 > λ 2 > ⋯ > λ p > 0, the eigenvalues of S. Then, given the Jacobian of the transformation S → D specified in (9.6.7), the joint density of λ 1, …, λ p, denoted by g 1(D) is obtained as

$$\displaystyle \begin{aligned} g(D)\mathrm{d}S=\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\alpha)\varGamma_p(\frac{p}{2})}\Big\{\prod_{j=1}^p\lambda_j^{\alpha-\frac{p+1}{2}}\Big\}\mathrm{e}^{-(\lambda_1+\cdots +\lambda_p)}\Big\{\prod_{i<j}(\lambda_i-\lambda_j)\Big\}\mathrm{d}D\equiv g_1(D)\mathrm{d}D{}\end{aligned} $$
(9.6.6)

where dD = dλ 1 ∧… ∧dλ p. The corresponding p × p complex matrix-variate gamma distribution will have real eigenvalues, also denoted by λ 1 > ⋯ > λ p > 0, the matrix \(\tilde {S}\) being Hermitian positive definite in this case. Then, in light of the relationship (9.6a.2) between the differential elements of \(\tilde {S}\) and D, the joint density of λ 1, …, λ p, denoted by \(\tilde {g}_1(D)\), is given by

$$\displaystyle \begin{aligned} \tilde{g}(D)\mathrm{d}\tilde{S}=\frac{\pi^{p(p-1)}}{\tilde{\varGamma}_p(\alpha)\tilde{\varGamma}_p(p)}\Big\{\prod_{j=1}^p\lambda_j^{\alpha-p}\Big\}\mathrm{e}^{-(\lambda_1+\cdots +\lambda_p)}\Big\{\prod_{i<j}(\lambda_i-\lambda_j)^2\Big\}\mathrm{d}D\equiv \tilde{g}_1(D)\mathrm{d}D \end{aligned} $$
(9.6a.1)

where \(\tilde {g}(D)\) is \(\tilde {g}(\tilde {S})\) expressed in terms of D, \(\tilde {g}(\tilde {S})\) denoting the density function of a matrix-variate gamma random variable whose shape parameter and scale parameter matrix are α and I, respectively.

As explained for instance in Mathai (1997), when S and \(\tilde {S}\) are the p × p gamma matrices in the real and complex domains, the integration over the Stiefel manifold yields

$$\displaystyle \begin{aligned} \mathrm{d}S=\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\frac{p}{2})}\Big\{\prod_{i<j}(\lambda_i-\lambda_j)\Big\}\mathrm{d}D{}\end{aligned} $$
(9.6.7)

and

$$\displaystyle \begin{aligned} \mathrm{d}\tilde{S}=\frac{\pi^{p(p-1)}}{\tilde{\varGamma}_p(p)}\Big\{\prod_{i<j}(\lambda_i-\lambda_j)^2\Big\}\mathrm{d}D,{} \end{aligned} $$
(9.6a.2)

respectively. When endeavoring to derive the marginal density of λ j for any fixed j, the difficulty arises from the factor ∏i<j(λ i − λ j). So, let us first attempt to simplify this factor.

9.6.2. Simplification of the factor ∏i<j(λ i − λ j)

It is well known that one can write ∏i<j(λ i − λ j) as a Vandermonde determinant which, incidentally, has been utilized in connection with nonlinear transformations in Mathai (1997). That is,

(9.6.8)

where the (i, j)-th element \(a_{ij}=\lambda _i^{p-j}\) for all i and j. We could consider a cofactor expansion of the determinant, |A|, consisting of expanding it in terms of the elements and their cofactors along any row. In this case, it would be advantageous to do so along the i-th row as the cofactors would then be free of λ i and the coefficients of the cofactors would only be powers of λ i. However, we would then lose the symmetry with respect to the elements of the cofactors in this instance. Since symmetry is required for the procedure to be discussed, we will consider the general expansion of a determinant, that is,

$$\displaystyle \begin{aligned} |A|=\sum_K(-1)^{\rho_K}a_{1k_1}a_{2k_2}\cdots a_{pk_p}=\sum_K(-1)^{\rho_K}\lambda_1^{p-k_1}\lambda_2^{p-k_2}\cdots \lambda_p^{p-k_p}{}\end{aligned} $$
(9.6.9)

where K = (k 1, …, k p) and (k 1, …, k p) is a given permutation of the numbers (1, 2, …, p). Defining ρ K as the number of transpositions needed to bring (k 1, …, k p) into the natural order (1, 2, …, p), \((-1)^{\rho _K}\) will correspond to a − sign for the corresponding term if ρ K is odd, the sign being otherwise positive. For example, for p = 3, the possible permutations are (1, 2, 3), (1, 3, 2), (2, 3, 1), (2, 1, 3), (3, 1, 2), (3, 2, 1), so that there are 3! = 6 terms. For the sequence (1, 2, 3), k 1 = 1, k 2 = 2 and k 3 = 3; for the sequence (1, 3, 2), k 1 = 1, k 2 = 3 and k 3 = 2, and so on, ∑K representing the sum of all such terms multiplied by \((-1)^{\rho _K}\). For (1, 2, 3), ρ K = 0 corresponding to a plus sign; for (1, 3, 2), ρ K = 1 corresponding to a minus sign, and so on. Other types of expansions of |A| could also be utilized. As it turns out, the general expansion given in (9.6.9) is the most convenient one for deriving the marginal densities.

In the complex case,

$$\displaystyle \begin{aligned} \prod_{i<j}(\lambda_i-\lambda_j)^2=|A|{}^2=|AA'|=|A'A| \end{aligned}$$

where

(i)

Observe that α j only contains λ j and that \(A'A=\alpha _1\alpha _1^{\prime }+\cdots +\alpha _p\alpha _p^{\prime }\), so that for instance the (i, j)-th element in \(\alpha _1\alpha _1^{\prime }\) is \(\lambda _1^{2p-(i+j)}\). Accordingly, the (i, j)-th element of \(\alpha _1\alpha _1^{\prime }+\cdots +\alpha _p\alpha _p^{\prime }\) is \(\sum _{r=1}^p\lambda _r^{2p-(i+j)}\). Thus, letting \(B=A'A=(b_{ij}),\ b_{ij}=\sum _{r=1}^p\lambda _r^{2p-(i+j)}\). Now, consider the expansion (9.6.9) of |B|, that is,

$$\displaystyle \begin{aligned} \prod_{i<j}(\lambda_i-\lambda_j)^2=|B|=|A'A|=\sum_K(-1)^{\rho_K}b_{1k_1}b_{2k_2}\cdots b_{pk_p} \end{aligned}$$
(ii)

where K = (k 1, …, k p) and (k 1, …, k p) is a permutation of the sequence (1, 2, …, p). Note that

$$\displaystyle \begin{aligned} b_{1k_1}&=\lambda_1^{2p-(1+k_1)}+ \lambda_2^{2p-(1+k_1)}+\cdots +\lambda_p^{2p-(1+k_1)}\\ b_{2k_2}&=\lambda_1^{2p-(2+k_2)}+\lambda_2^{2p-(2+k_2)} +\cdots +\lambda_p^{2p-(2+k_2)}\\ &\ \, \vdots\\ b_{pk_p}&=\lambda_1^{2p-(p+k_p)}+\lambda_2^{2p-(p+k_p)}+\cdots +\lambda_p^{2p-(p+k_p)}.\end{aligned} $$
(iii)

Let us write

$$\displaystyle \begin{aligned} b_{1k_1}b_{2k_2}\cdots b_{pk_p}=\sum_{r_1,\ldots ,r_p}\lambda_1^{r_1}\cdots \lambda_p^{r_p}. \end{aligned}$$
(iv)

Then,

$$\displaystyle \begin{aligned} \prod_{i<j}(\lambda_i-\lambda_j)^2=|B|=\sum_K(-1)^{\rho_K}\Big[\sum_{r_1,\ldots ,r_p}\lambda_1^{r_1}\cdots \lambda_p^{r_p}\Big]{} \end{aligned} $$
(9.6a.3)

where the r j’s, j = 1, …, p, are nonnegative integers. We may now express the joint density of the eigenvalues in a systematic way.

9.6.3. The distributions of the eigenvalues

The joint density of λ 1, …, λ p as specified in (9.6.6) and (9.6a.1) can be expressed as follows. In the real case,

$$\displaystyle \begin{aligned} f(D)\mathrm{d}D&=\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\frac{p}{2})\varGamma_p(\alpha)} \Big(\prod_{j=1}^p\lambda_j^{\alpha-\frac{p+1}{2}}\Big)\mathrm{e}^{-(\lambda_1+\cdots +\lambda_p)}\Big(\sum_K(-1)^{\rho_K}\lambda_1^{p-k_1}\cdots \lambda_p^{p-k_p}\Big)\mathrm{d}D\\ {} &=\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\frac{p}{2})\varGamma_p(\alpha)}\sum_K(-1)^{\rho_K}(\lambda_1^{m_1}\cdots \lambda_p^{m_p})\,\mathrm{e}^{-(\lambda_1+\cdots +\lambda_p)}\mathrm{d}D \end{aligned} $$
(9.6.10)

with

$$\displaystyle \begin{aligned} m_j=\alpha-\frac{p+1}{2}+p-k_j. \end{aligned}$$
(i)

In the complex case, the joint density is

$$\displaystyle \begin{aligned} \tilde{f}(D)\mathrm{d}D&=\frac{\pi^{p(p-1)}}{\tilde{\varGamma}_p(p)\tilde{\varGamma}_p(\alpha)}\sum_K(-1)^{\rho_K}\sum_{r_1,\ldots ,r_p}(\lambda_1^{\alpha-p+r_1}\cdots \lambda_p^{\alpha-p+r_p})\,\mathrm{e}^{-(\lambda_1+\cdots +\lambda_p)}\mathrm{d}D\\ {} &=\frac{\pi^{p(p-1)}}{\tilde{\varGamma}_p(p)\tilde{\varGamma}_p(\alpha)}\sum_K(-1)^{\rho_K}\sum_{r_1,\ldots ,r_p}(\lambda_1^{m_1}\cdots \lambda_p^{m_p})\,\mathrm{e}^{-(\lambda_1+\cdots +\lambda_p)}\mathrm{d}D \end{aligned} $$
(9.6a.4)

with m j = α − p + r j and r j as defined in (9.6a.3). For convenience, we will use the same symbol m j for both the real and complex cases; however, in the real case, \(m_j=\alpha -\frac {p+1}{2}+p-k_j \) and, in the complex case, m j = α − p + r j.

The distributions of the largest, smallest and j-th largest eigenvalues were considered by many authors. Earlier works mostly dealt with eigenvalue problems associated with testing hypotheses on the parameters of one or more real Gaussian populations. In such situations, a one-to-one function of the likelihood ratio statistics could be explored in terms of the eigenvalues of a real type-1 beta distributed matrix-variate random variable. In a series of papers, Pillai constructed the distributions of the seven largest eigenvalues in the type-1 beta distribution and produced percentage points as well; the reader may refer for example to Pillai (1964) and the references therein. In a series of papers including Khatri (1964), Khatri addressed the distributions of eigenvalues in the real and complex domains. In a series of papers, Krishnaiah and his co-authors dealt with various distributional aspects of eigenvalues, see for instance Krishnaiah et al. (1973). Clemm et al. (1973) computed upper percentage points of the distribution of the eigenvalues of the Wishart matrix. James (1964) considered the eigenvalue problem of different types of matrix-variate random variables and determined their distributions in terms of functions of matrix arguments and zonal polynomials. In a series of papers Davis, dealt with the distributions of eigenvalues by creating and solving systems of differential equations, see for example Davis (1972). Edelman (1991) discussed the distributions and moments of the smallest eigenvalue of Wishart type matrices. Johnstone (2001) examined the distribution of the largest eigenvalue in Principal Component Analysis. Recently, Chiani (2014) and James and Lee (2021) discussed the distributions of the eigenvalues of Wishart matrices. The methods employed in these papers lead to representations of the distributions of eigenvalues in terms of Pfaffians of skew symmetric matrices, incomplete gamma functions, multiple integrals, functions of matrix argument and zonal polynomials, ratios of determinants, solutions of differential equations, and so on. None of those methods yield tractable forms for the distribution or density functions of eigenvalues.

In the next subsections, we provide, in explicit forms, the exact distributions of any of the j-th largest eigenvalue of a general real or complex matrix-variate gamma type matrix, either as finite sums when a certain quantity is a positive integer or as a product of infinite series in the general non-integer case. These include, for instance, the distributions of the largest and smallest eigenvalue as well as the joint distributions of several of the largest or smallest eigenvalues, and readily apply to the real and complex Wishart distributions.

9.6.4. Density of the smallest eigenvalue λ p in the real matrix-variate gamma case

We will initially examine the situation where \(m_j=\alpha -\frac {p+1}{2}+p-k_j\) is an integer, so that m j in the real matrix-variate gamma case is a positive integer. We will integrate out λ 1, …, λ p−1 to obtain the marginal density of λ p. Since m j is a positive integer, we can integrate by parts. For instance,

$$\displaystyle \begin{aligned} \int_{\lambda_1=\lambda_2}^{\infty}\lambda_1^{m_1}\mathrm{e}^{-\lambda_1}\mathrm{d}\lambda_1&=[-\lambda_1^{m_1}\mathrm{e}^{-\lambda_1}]_{\lambda_2}^{\infty}+[-m_1\lambda_1^{m_1-1}\mathrm{e}^{-\lambda_1}]_{\lambda_2}^{\infty}+\cdots +[-\mathrm{e}^{-\lambda_1}]_{\lambda_2}^{\infty}\\ &=\sum_{\mu_1=0}^{m_1}\frac{m_1!}{(m_1-\mu_1)!}\,\lambda_2^{m_1-\mu_1}\mathrm{e}^{-\lambda_2},\end{aligned} $$
(i)

and integrating λ 1, …, λ j−1 gives the following:

$$\displaystyle \begin{aligned} &\int_{\lambda_1}\int_{\lambda_2}\cdots \int_{\lambda_{j-1}}\lambda_1^{m_1}\cdots \lambda_{j-1}^{m_{j-1}}\mathrm{e}^{-(\lambda_1+\cdots +\lambda_{j-1})}\mathrm{d}\lambda_1\wedge\ldots \wedge\mathrm{d}\lambda_{j-1}\end{aligned} $$
$$\displaystyle \begin{aligned} &=\sum_{\mu_1=0}^{m_1}\frac{m_1!}{(m_1-\mu_1)!}\sum_{\mu_2=0}^{m_1-\mu_1+m_2}\frac{(m_1-\mu_1+m_2)!}{2^{\mu_2+1}(m_1-\mu_1+m_2-\mu_2)!}\ \cdots \\ &\ \ \ \times\!\!\!\!\!\!\!\!\!\!\sum_{\mu_{j-1}=0}^{m_1-\mu_1+\cdots +m_{j-1}}\!\!\!\!\ \frac{(m_1-\mu_1+\cdots +m_{j-1})!}{(j-1)^{\mu_{j-1}+1}(m_1-\mu_1+\cdots +m_{j-1}-\mu_{j-1})!}\lambda_j^{m_1-\mu_1+\cdots +m_{j-1}-\mu_{j-1}}\mathrm{e}^{-j\lambda_j}\\ &\equiv\phi_{j-1}(\lambda_j).\end{aligned} $$
(ii)

Hence, the following result:

Theorem 9.6.1

When \(m_j=\alpha -\frac {p+1}{2}+p-k_j\) is a positive integer, where m j is defined in (9.6.10), the marginal density of the smallest eigenvalue λ p of the p × p real gamma distributed matrix with parameters (α, I), denoted by f 1p(λ p), is the following:

$$\displaystyle \begin{aligned} f_{1p}&(\lambda_p)\mathrm{d}\lambda_p \\&=c_K\,\phi_{p-1}(\lambda_p)\ \lambda_p^{m_p}\,\mathrm{e}^{-\lambda_p}\\ &=c_K\sum_{\mu_1=0}^{m_1}\frac{m_1!}{(m_1-\mu_1)!}\sum_{\mu_2=0}^{m_1-\mu_1+m_2}\frac{(m_1-\mu_1+m_2)!}{2^{\mu_2+1}(m_1-\mu_1+m_2-\mu_2)!}\ \cdots \sum_{\mu_{p-1}=0}^{m_1-\mu_1+\cdots +m_{p-1}}\\ {} &\times\frac{(m_1-\mu_1+\cdots +m_{p-1})!}{(p-1)^{\mu_{p-1}+1}(m_1-\mu_1+\cdots +m_{p-1}-\mu_{p-1})!}\lambda_{p-1}^{m_1-\mu_1+\cdots +m_{p-1}-\mu_{p-1}+m_p}\mathrm{e}^{-p\lambda_p}\mathrm{d}\lambda_p \end{aligned} $$
(9.6.11)

for 0 < λ p < ∞, where

$$\displaystyle \begin{aligned} c_K=\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\frac{p}{2})\varGamma_p(\alpha)}\sum_K(-1)^{\rho_K},\end{aligned}$$

ϕ j−1(λ j) is specified in (ii) and K = (k 1, …, k p) is defined in (9.6.9).

In the complex p × p matrix-variate gamma case, r j is as defined in (9.6a.3) and the expression for ϕ j−1(λ j) given in (ii) remains the same with the exception that m j in the complex case is m j = α − p + r j. Then, in the complex domain, the density of λ p, denoted by \(\tilde {f}_{1p}(\lambda _p)\), is the following:

Theorem 9.6a.1

When m j = α  p + r j is a positive integer, where r j is defined in (9.6a.3), the density of the smallest eigenvalue of the complex matrix-variate gamma distribution is the following:

$$\displaystyle \begin{aligned} \tilde{f}_{1p}&(\lambda_p)\mathrm{d}\lambda_p\\&=\tilde{c}_K\,\phi_{p-1}(\lambda_p)\,\lambda_p^{m_p}\,\mathrm{e}^{-p\lambda_p}\\ &=\tilde{c}_K\sum_{\mu_1=0}^{m_1}\frac{m_1!}{(m_1-\mu_1)!}\sum_{\mu_2=0}^{m_1-\mu_1+m_2}\frac{(m_1-\mu_1+m_2)!}{2^{\mu_2+1}(m_1-\mu_1+m_2-\mu_2)!}\ \cdots\!\!\!\!\! \sum_{\mu_{p-1}=0}^{m_1-\mu_1+\cdots +m_{p-1}}\\ {} &\times\frac{(m_1-\mu_1+\cdots +m_{p-1})!}{(p-1)^{\mu_{p-1}+1}(m_1-\mu_1+\cdots +m_{p-1}-\mu_{p-1})!}\lambda_{p-1}^{m_1-\mu_1+\cdots +m_{p-1}-\mu_{p-1}+m_p}\mathrm{e}^{-p\lambda_p} \mathrm{d}\lambda_p \end{aligned} $$
(9.6a.5)

for 0 < λ p < ∞, where

$$\displaystyle \begin{aligned} \tilde{c}_K=\frac{\pi^{p(p-1)}}{\tilde{\varGamma}_p(p)\tilde{\varGamma}_p(\alpha)}\sum_K(-1)^{\rho_K}\sum_{r_1,\ldots ,r_k}.\end{aligned}$$

Note 9.6.1

In the complex Wishart case, α = m where m is the number of degrees of freedom, which is a positive integer. Hence, Theorem 9.6a.1 gives the final result in the general case for that distribution. One can obtain the joint density of the p − j + 1 smallest eigenvalues from ϕ j−1(λ j) as defined in (ii), both in the real and complex cases. If the scale parameter matrix of the gamma distribution is of the form aI where a > 0 is a real scalar and I is the identity matrix, then the distributions of the eigenvalues can also be obtained from the proposed procedure since for any square matrix B, the eigenvalues of aB are a ν j’s where the ν j’s are the eigenvalues of B. In the case of real Wishart distributions originating from a sample of size n from a p-variate Gaussian population whose covariance matrix is the identity matrix, the λ j’s are multiplied by the constant \(a=\frac {n}{2}\). If the Wishart matrix is not an estimator obtained from the sample values, they should only be multiplied by \(a=\frac {1}{2}\) in the real case, no multiplicative constant being necessary in the complex Wishart case.

9.6.5. Density of the largest eigenvalue λ 1 in the real matrix-variate gamma case

Consider the case of \(m_j=\alpha -\frac {p+1}{2}+p-k_j\) being an integer first. Then, in the real case, m j is as defined in (9.6.10). One has to integrate out λ p, …, λ 2 in order to obtain the marginal density of λ 1. As the initial step, consider the integration of λ p, that is,

$$\displaystyle \begin{aligned} \mbox{Step 1 integral :}&\int_{\lambda_p=0}^{\lambda_{p-1}}\lambda_p^{m_p}\mathrm{e}^{-\lambda_p}\mathrm{d}\lambda_p\\ &\ \ \ =[-\lambda_p^{m_p}\mathrm{e}^{-\lambda_p}]_0^{\lambda_{p-1}}+\cdots +[-m_p!\,\mathrm{e}^{-\lambda_p}]_0^{\lambda_{p-1}}\\ &\ \ \ =m_p!-\sum_{\nu_p=0}^{m_p}\frac{m_p!}{(m_p-\nu_p)!}\lambda_{p-1}^{m_p-\nu_p}\mathrm{e}^{-\lambda_{p-1}}.\end{aligned} $$
(i)

Now, multilying each term by \(\lambda _{p-1}^{m_{p-1}}\mathrm {e}^{-\lambda _{p-1}}\) and integrating by parts, we have the second step integral:

$$\displaystyle \begin{aligned} &\mbox{Step 2 integral }\\ &=m_p!\int_{\lambda_{p-1}=0}^{\lambda_{p-2}}\lambda_{p-1}^{m_{p-1}}\mathrm{e}^{-\lambda_{p-1}}\mathrm{d}\lambda_{p-1}-\sum_{\nu_p=0}^{m_p}\frac{m_p!}{(m_p-\nu_p)!}\int_{\lambda_{p-1}=0}^{\lambda_{p-2}}\lambda_{p-1}^{m_p-\nu_p+m_{p-1}}\mathrm{e}^{-2\lambda_{p-1}}\mathrm{d}\lambda_{p-1}\\ &=m_p!\,m_{p-1}!-m_p!\sum_{\nu_{p-1}=0}^{m_{p-1}}\frac{m_{p-1}!}{(m_{p-1}-\nu_{p-1})!}\lambda_{p-2}^{m_{p-1}-\nu_{p-1}}\mathrm{e}^{-\lambda_{p-2}}\\ &-\sum_{\nu_p=0}^{m_p}\frac{m_p!}{(m_p-\nu_p)!}\frac{(m_p-\nu_p+m_{p-1})!}{2^{m_p-\nu_p+m_{p-1}}}\\ &+\!\sum_{\nu_p=0}^{m_p}\frac{m_p!}{(m_p-\nu_p)!}\!\!\!\!\sum_{\nu_{p-1}=0}^{m_p-\nu_p+m_{p-1}}\!\!\!\! \frac{(m_p-\nu_p+m_{p-1})!}{2^{\nu_{p-1}+1}(m_p-\nu_p+m_{p-1}-\nu_{p-1})!}\lambda_{p-2}^{m_p-\nu_p+m_{p-1}-\nu_{p-1}}\mathrm{e}^{-2\lambda_{p-2}}.\end{aligned} $$
(ii)

At the j-th step of integration, there will be 2j terms of which \(\frac {2^j}{2}=2^{j-1}\) will be positive and 2j−1 will be negative. All the terms at the j-th step can be generated by 2j sequences of zeros and ones. Each sequence (each row) consists of j positions with a zero or one occupying each position. Depending on whether the number of ones in a sequence is odd or even, the corresponding term will start with a minus or a plus sign, respectively. The following are the sequences, where the last column indicates the sign of the term:

All the terms at the j-th step can be written down by using the following rules: (1): If the first entry in a sequence (row) is zero, then the corresponding factor in the term is m p! ; (2): If the first entry in a sequence is 1, then the corresponding factor in the term is \(\sum _{\nu _p=0}^{m_p}\frac {m_p!}{(m_p-\nu _p)!}\) or this sum multiplied by \(\lambda _{p-1}^{m_p-\nu _p}{\mathrm{e}}^{-\lambda _{p-1}}\) if this 1 is the last entry in the sequence; (3): If the r-th and (r − 1)-th entries in the sequence both equal zero, then the corresponding factor in the term is m pr+1! ; (4): If the r-th entry in the sequence is zero and the (r − 1)-th entry is 1, then the corresponding factor in the term is \(\frac {(n_{r-1}+m_{p-r+1})!}{(\eta _{r-1}+1)^{n_{r-1}+m_{p-r+1}+1}}\), where n r−1 is the argument of the denominator factorial and \((\eta _{r-1})^{\nu _{p-r+2}+1}\) is the factor in the denominator corresponding to the (r − 1)-th entry; (5): If the r-th entry in the sequence is 1 and the (r − 1)-th entry is zero, then the corresponding factor in the term is \(\sum _{\nu _{p-r+1}=0}^{m_{p-r+1}}\frac {m_{p-r+1}!}{(m_{p-r+1}-\nu _{p-r+1})!}\) or this sum multiplied by \(\lambda _{p-r}^{m_{p-r+1}-\nu _{p-r+1}}{\mathrm{e}}^{-\lambda _{p-r}}\) if this 1 is the last entry in the sequence; (6): If the r-th and (r − 1)-th entries in the sequence are both equal to 1, then the corresponding factor in the term is \(\sum _{\nu _{p-r+1}=0}^{n_{r-1}+m_{p-r+1}}\frac {(n_{r-1}+m_{p-r+1})!}{(\eta _{r-1}+1)^{\nu _{p-r+1}}(n_{r-1}+m_{p-r+1}-\nu _{p-r+1})!}\) or this sum multiplied by \(\lambda _{p-r}^{n_{r-1}+m_{p-r+1}-\nu _{p-r+1}}{\mathrm{e}}^{-{(\eta _{r-1}+1)\lambda _{p-r}}}\) if this 1 is the last entry in the sequence, where n r−1 and η r−1 are defined in rule (4). By applying the above rules, let us write down the terms at step 3, that is, j = 3. The sequences are then the following:

(iii)

The corresponding terms in the order of the sequences are the following:

$$\displaystyle \begin{aligned} &\mbox{Step 3 integral }\\ &=m_p!\,m_{p-1}!\,m_{p-2}!-m_p!\,m_{p-1}!\sum_{\nu_{p-2}=0}^{m_{p-2}}\frac{m_{p-2}!}{(m_{p-2}-\nu_{p-2})!}\lambda_{p-3}^{m_{p-2}-\nu_{p-2}}\mathrm{e}^{-\lambda_{p-3}}\\ &\ \ \ \ -m_p!\sum_{\nu_{p-1}=0}^{m_{p-1}}\frac{m_{p-1}!}{(m_{p-1}-\nu_{p-1})!}\frac{(m_{p-1}-\nu_{p-1}+m_{p-2})!}{2^{m_{p-1}-\nu_{p-1}+m_{p-2}+1}}\\ &\ \ \ \ +m_p!\sum_{\nu_{p-1}=0}^{m_{p-1}}\frac{(m_{p-1})!}{(m_{p-1}-\nu_{p-1})!}\sum_{\nu_{p-2}=0}^{m_{p-1}-\nu_{p-1}+m_{p-2}}\frac{(m_{p-1}-\nu_{p-1}+m_{p-2})!}{2^{\nu_{p-2}+1}(m_{p-1}-\nu_{p-1}+m_{p-2}-\nu_{p-2})!}\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \times \lambda_{p-3}^{m_{p-1}-\nu_{p-1}+m_{p-2}-\nu_{p-2}}\mathrm{e}^{-2\lambda_{p-3}}\\ &\ \ \ \ -\sum_{\nu_p=0}^{m_p}\frac{m_p!}{(m_p-\nu_p)!}\frac{(m_p-\nu_p+m_{p-1})!}{2^{m_p-\nu_p+m_{p-1}}}m_{p-2}!\\ &\ \ \ \ +\sum_{\nu_p=0}^{m_p}\frac{m_p!}{(m_p-\nu_p)!}\frac{(m_p-\nu_p+m_{p-1})!}{2^{m_p-\nu_p+m_{p-1}+1}}\sum_{\nu_{p-2}=0}^{m_{p-2}}\frac{m_{p-2}!}{(m_{p-2}-\nu_{p-2})!}\lambda_{p-3}^{m_{p-2}-\nu_{p-2}}\mathrm{e}^{-\lambda_{p-3}}\\ &\ \ \ \ +\sum_{\nu_p=0}^{m_p}\frac{m_p!}{(m_p-\nu_p)!}\sum_{\nu_{p-1}=0}^{m_p-\nu_p+m_{p-1}}\frac{(m_p-\nu_p+m_{p-1})!}{(m_p-\nu_p+m_{p-1}-\nu_{p-1})!}\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \times\frac{(m_p-\nu_p+m_{p-1}-\nu_{p-1}+m_{p-2})!}{3^{m_p-\nu_p+m_{p-1}-\nu_{p-1}+m_{p-2}}}\\ &\ \ \ \ -\sum_{\nu_p=0}^{m_p}\frac{m_p!}{(m_p-\nu_p)!}\sum_{\nu_{p-1}=0}^{m_p-\nu_p+m_{p-1}}\frac{(m_p-\nu_p+m_{p-1})!}{2^{\nu_{p-1}+1}(m_p-\nu_p+m_{p-1}-\nu_{p-1})!}\\ &\ \ \ \ \ \ \ \ \times\sum_{\nu_{p-2}=0}^{m_p-\nu_p+m_{p-1}-\nu_{p-1}+m_{p-2}} \frac{(m_p-\nu_p+m_{p-1}-\nu_{p-1}+m_{p-2})!}{3^{\nu_{p-2}+1}(m_p-\nu_p+m_{p-1}-\nu_{p-1}+m_{p-2}-\nu_{p-2})!}\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \times \lambda_{p-3}^{m_p-\nu_p+m_{p-1}-\nu_{p-1}+m_{p-2}-\nu_{p-2}}\mathrm{e}^{-3\lambda_{p-3}}.\end{aligned} $$
(iv)

The terms in (iv) are to be multiplied by \(\lambda _{p-3}^{m_{p-3}}\mathrm {e}^{-\lambda _{p-3}}\) to obtain the final result if we are stopping, that is, if p = 4. The terms in (iv) can be verified by multiplying the step 2 integral by \(\lambda _{p-2}^{m_{p-2}}\mathrm {e}^{-\lambda _{p-2}}\) and then integrating (ii), term by term. Denoting the sum of the terms at the j-th step by ψ j(λ pj), the density of the largest eigenvalue λ 1, denoted by f 11(λ 1), is the following:

Theorem 9.6.2

When \(m_j=\alpha -\frac {p+1}{2}+p-k_j\) is a positive integer and ψ j(λ pj) is as defined in the preceding paragraph, where k j is given in (9.6.10), the density of λ 1 in the real matrix-variate gamma case, denoted by f 11(λ 1), is the following:

$$\displaystyle \begin{aligned} f_{11}(\lambda_1)\mathrm{d}\lambda_1=\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\alpha)\varGamma_p(\frac{p}{2})}\sum_K\,(-1)^{\rho_k}\,\psi_{p-1}(\lambda_1)\,\lambda_1^{m_1}\,\mathrm{e}^{-\lambda_1}\,\mathrm{d}\lambda_1,\ 0<\lambda_1<\infty,{}\end{aligned} $$
(9.6.12)

where it is assumed that m j is a positive integer.

In the corresponding complex case, the procedure is parallel and the expression for ψ j(λ pj) remains the same, except that m j will then be equal to α − p + r j, where r j is defined in (9.6a.3). Assuming that m j is a positive integer and letting the density in the complex case be denoted by \(\tilde {f}_{11}(\lambda _1)\), we have the following result:

Theorem 9.6a.2

Letting m j = α  p + r j be a positive integer and ψ j(λ pj) have the same representation as in the real case except that m j = α  p + r j , in the complex case, the density of λ 1 , denoted by \(\tilde {f}_{11}(\lambda _1)\) , is the following:

$$\displaystyle \begin{aligned} \tilde{f}_{11}(\lambda_1)\mathrm{d}\lambda_1=\frac{\pi^{p(p-1)}}{\tilde{\varGamma}_p(p)\tilde{\varGamma}_p(\alpha)}\sum_K(-1)^{\rho_K} \sum_{r_1,\ldots ,r_k}\psi_{p-1}(\lambda_1)\,\lambda_1^{m_1}\,\mathrm{e}^{-\lambda_1}\ \mathrm{d}\lambda_1,\ 0<\lambda_1<\infty. \end{aligned} $$
(9.6a.6)

Note 9.6.2

One can also compute the density of the j-th eigenvalue λ j from Theorems 9.6.1 and 9.6.2. For obtaining the density of λ j, one has to integrate out λ 1, …, λ j−1 and λ p, λ p−1, …, λ j+1, the resulting expressions being available from the (j − 1)-th step when integrating λ 1, …, λ j−1 and from the (p − j)-th step when integrating λ p, λ p−1, …, λ j+1.

9.6.6. Density of the largest eigenvalue λ 1 in the general real case

By general case, it is meant that \(m_j=\alpha -\frac {p+1}{2}+p-k_j\) is not a positive integer. In the real Wishart case, m j will then be a half-integer; however in the general gamma case α can be any real number greater than \(\frac {p-1}{2}\). In this general case, we will expand the exponential part and then integrate term by term. That is,

$$\displaystyle \begin{aligned} \mbox{Step 1 integral: }&\int_{\lambda_p=0}^{\lambda_{p-1}}\lambda_p^{m_p}\mathrm{e}^{-\lambda_p}\mathrm{d}\lambda_p=\sum_{\nu_p=0}^{\infty}\frac{(-1)^{\nu_p}}{\nu_p!}\int_{\lambda_p=0}^{\lambda_{p-1}}\lambda_p^{m_p+\nu_p}\mathrm{d}\lambda_p\\ &=\sum_{\nu_p=0}^{\infty}\frac{(-1)^{\nu_p}}{\nu_p!}\frac{1}{(m_p+\nu_p+1)}\lambda_{p-1}^{m_p+\nu_p+1}.\end{aligned} $$
(i)

Continuing this process, we have

$$\displaystyle \begin{aligned} \mbox{Step }\ j\ \mbox{integral }&=\sum_{\nu_p=0}^{\infty}\frac{(-1)^{\nu_p}}{\nu_p!}\frac{1}{m_p+\nu_p+1}\sum_{\nu_{p-1}=0}^{\infty}\frac{(-1)^{\nu_{p-1}}}{\nu_{p-1}!}\frac{1}{m_p+\nu_p+m_{p-1}+\nu_{p-1}+2}\\ &\ \ \ \cdots \!\!\sum_{\nu_{p-j+1}=0}^{\infty}\!\!\frac{(-1)^{\nu_{p-j+1}}}{\nu_{p-j+1}!}\frac{1}{m_p+\nu_p+\cdots +m_{p-j+1}+\nu_{p-j+1}+j}\\&\equiv\varDelta_j(\lambda_{p-j}).\end{aligned} $$
(ii)

Then, in the general real case, the density of λ 1, denoted by f 21(λ 1), is the following:

Theorem 9.6.3

When \(m_j=\alpha -\frac {p+1}{2}+p-k_j\) is not a positive integer, where k j is as specified in (9.6.10), the density of the largest eigenvalue λ 1 in the general real matrix-variate gamma case, denoted by f 21(λ 1), is given by

$$\displaystyle \begin{aligned} f_{21}(\lambda_1)\mathrm{d}\lambda_1=\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\frac{p}{2})\varGamma_p(\alpha)}\sum_K(-1)^{\rho_K}\varDelta_{p-1}(\lambda_{1})\,\lambda_1^{m_1}\mathrm{e}^{-\lambda_1}\mathrm{d}\lambda_1,\ 0<\lambda_1<\infty,{}\end{aligned} $$
(9.6.13)

where Δ j(λ pj) is defined in (ii).

The corresponding density of λ 1 in the general situation of the complex matrix-variate gamma distribution is given in the next theorem. Observe that in the complex Wishart case, m j is an integer and hence there is no general case to consider.

Theorem 9.6a.3

When m j = α  p + r j is not a positive integer, where r j is as defined in (9.6a.3), the density of λ 1 in the complex case, denoted by \(\tilde {f}_{21}(\lambda _1)\) , is given by

$$\displaystyle \begin{aligned} \tilde{f}_{21}(\lambda_1)\mathrm{d}\lambda_1=\frac{\pi^{p(p-1)}}{\tilde{\varGamma}_p(p)\tilde{\varGamma}_p(\alpha)}\sum_K(-1)^{\rho_K}\sum_{r_1,\ldots ,r_p}\varDelta_{p-1}(\lambda_{1})\,\lambda_1^{m_1}\mathrm{e}^{-\lambda_1}\mathrm{d}\lambda_1,\ 0<\lambda_1<\infty, \end{aligned} $$
(9.6a.7)

where the Δ j(λ pj) has the representation specified in (ii) except that m j = α  p + r j.

9.6.7. Density of the smallest eigenvalue λ p in the general real case

Once again, ‘general case’ is understood to mean that \(m_j=\alpha -\frac {p+1}{2}+p-k_j\) is not a positive integer, where k j is defined in (9.6.10). For the real Wishart distribution, ‘general case’ corresponds to m j being a half-integer. In order to determine the density of the smallest eigenvalue, we will integrate out λ 1, …, λ p−1. We initially evaluate the following integral:

$$\displaystyle \begin{aligned} \mbox{Step 1 integral: }&\int_{\lambda_1=\lambda_2}^{\infty}\lambda_1^{m_1}\mathrm{e}^{-\lambda_1}\mathrm{d}\lambda_1=\varGamma(m_1+1)-\int_{\lambda_1=0}^{\lambda_2}\lambda_1^{m_1}\mathrm{e}^{-\lambda_1}\mathrm{d}\lambda_1\\ &=\varGamma(m_1+1)-\sum_{\mu_1=0}^{\infty}\frac{(-1)^{\mu_1}}{\mu_1!}\frac{1}{m_1+\mu_1+1}\lambda_2^{m_1+\mu_1+1}.\end{aligned} $$
(i)

The second step consists of integrating out λ 2 from the expression obtained in (i) multiplied by \(\lambda _2^{m_2}\mathrm {e}^{-\lambda _2}:\)

$$\displaystyle \begin{aligned} &\mbox{Step 2 integral:}\\ &\varGamma(m_1+1)\int_{\lambda_2=\lambda_3}^{\infty}\lambda_2^{m_2}\mathrm{e}^{-\lambda_2}\mathrm{d}\lambda_2-\sum_{\mu_1=0}^{\infty}\frac{(-1)^{\mu_1}}{\mu_1!}\frac{1}{m_1+\mu_1+1}\int_{\lambda_2=\lambda_3}^{\infty}\lambda_2^{m_1+\mu_1+m_2+1}\mathrm{e}^{-\lambda_2}\mathrm{d}\lambda_2\\ &=\varGamma(m_1+1)\varGamma(m_2+1)-\varGamma(m_1+1)\sum_{\mu_2=0}^{\infty}\frac{(-1)^{\mu_2}}{\mu_2!}\frac{1}{m_2+\mu_2+1}\lambda_3^{m_2+\mu_2+1}\\ &\ \ \ \ -\sum_{\mu_1=0}^{\infty}\frac{(-1)^{\mu_1}}{\mu_1!(m_1+\mu_1+1)}\varGamma(m_1+\mu_1+m_2+2)\\ &\ \ \ \ +\sum_{\mu_1=0}^{\infty}\frac{(-1)^{\mu_1}}{\mu_1!}\frac{1}{m_1+\mu_1+1}\sum_{\mu_2=0}^{\infty}\frac{(-1)^{\mu_2}}{\mu_2!} \frac{\lambda_3^{m_1+\mu_1+m_2+\mu_2+2}}{m_1+\mu_1+m_2+\mu_2+2}.\end{aligned} $$
(ii)

A pattern is now seen to emerge. At step j, there will be 2j terms, of which 2j−1 will start with a plus sign and 2j−1 will start with a minus sign. All the terms at the j-th step are available from the 2j sequences of zeros and ones provided in Sect. 9.6.5. The terms can be written down by utilizing the following rules: (1): If the sequence starts with a zero, then the corresponding factor in the term is Γ(m 1 + 1); (2): If the sequence starts with a 1, then the corresponding factor in the term is \(\sum _{\mu _1=0}^{\infty }\frac {(-1)^{\mu _1}}{\mu _1!}\frac {1}{m_1+\mu _1+1}\) or this series multiplied by \(\lambda _2^{m_1+\mu _1+1}\) if this 1 is the last entry in the sequence; (3): If the r-th entry in the sequence is a zero and the (r − 1)-th entry in the sequence is also zero, then the corresponding factor in the term is Γ(m r + 1); (4): If the r-th entry in the sequence is a zero and the (r − 1)-th entry in the sequence is a 1, then the corresponding factor in the term is Γ(n r−1 + m r + 1) where n r−1 is the denominator factor in the (r − 1)-th factor excluding the factorial; (5): If the r-th entry in the sequence is 1 and the (r − 1)-th entry is zero, then the corresponding factor in the term is \(\sum _{\mu _r=0}^{\infty }\frac {(-1)^{\mu _1}}{\mu _r!}\frac {1}{m_r+\mu _r+1}\) or this series multiplied by \(\lambda _{r+1}^{m_r+\mu _r+1}\) if this 1 happens to be the last entry in the sequence; (6): If the r-th entry in the sequence is 1 and the (r − 1)-th entry is also 1, then the corresponding factor in the term is \(\sum _{\mu _r=0}^{\infty }\frac {(-1)^{\mu _r}}{\mu _r!}\frac {1}{n_{r-1}+m_r+\mu _r+1}\) or this series multiplied by \(\lambda _{r+1}^{n_{r-1}+m_r+\mu _r+1}\) if this 1 is the last entry in the sequence, where n r−1 is the factor appearing in the denominator of the (r − 1)-th factor excluding the factorial.

These rules enable one to write down all the terms at any step. For example for j = 3, that is, at the third step, the terms are available from the following step 3 sequences:

The terms corresponding to the sequences in the order are the following:

$$\displaystyle \begin{aligned} &\mbox{Step 3 integral}\\ &=\varGamma(m_1+1)\varGamma(m_2+1)\varGamma(m_3+1)\\ &\ \ \ -\varGamma(m_1+1)\varGamma(m_2+1)\sum_{\mu_3=0}^{\infty}\frac{(-1)^{\mu_3}}{\mu_3!}\frac{1}{m_3+\mu_3+1}\lambda_4^{m_3+\mu_3+1}\\ &\ \ \ -\varGamma(m_1+1)\sum_{\mu_2=0}^{\infty}\frac{(-1)^{\mu_2}}{\mu_2!(m_2+\mu_2+1)}\varGamma(m_2+\mu_2+m_3+2)\\ &\ \ \ +\varGamma(m_1+1)\sum_{\mu_2=0}^{\infty}\frac{(-1)^{\mu_2}}{\mu_2!}\frac{1}{m_2+\mu_2+1}\sum_{\mu_3=0}^{\infty} \frac{(-1)^{\mu_3}}{\mu_3!} \frac{\lambda_4^{m_2+\mu_2+m_3+\mu_3+2}}{m_2+\mu_2+m_3+\mu_3+2}\\ &\ \ \ -\sum_{\mu_1=0}^{\infty}\frac{(-1)^{\mu_1}}{\mu_1!(m_1+\mu_1+1)}\varGamma(m_1+\mu_1+m_2+2)\varGamma(m_3+1)\\ &\ \ \ +\sum_{\mu_1=0}^{\infty}\frac{(-1)^{\mu_1}}{\mu_1!(m_1+\mu_1+1)}\varGamma(m_1+\mu_1+m_2+2)\sum_{\mu_3=0}^{\infty}\frac{(-1)^{\mu_3}}{\mu_3!}\frac{\lambda_4^{m_3+\mu_3+1}}{(m_3+\mu_3+1)}\end{aligned} $$
$$\displaystyle \begin{aligned} &+\sum_{\mu_1=0}^{\infty}\frac{(-1)^{\mu_1}}{\mu_1!(m_1+\mu_1+1)}\sum_{\mu_2=0}^{\infty}\frac{(-1)^{\mu_2}}{\mu_2!(m_1+\mu_1+m_2+\mu_2+2)}\\ &\ \ \ \ \times\varGamma(m_1+\mu_1+m_2+\mu_2+m_3+3)\\ &-\sum_{\mu_1=0}^{\infty}\frac{(-1)^{\mu_1}}{\mu_1!(m_1+\mu_1+1)}\sum_{\mu_2=0}^{\infty}\frac{(-1)^{\mu_2}}{\mu_2!(m_1+\mu_1+m_2+\mu_2+2)}\\ &\ \ \ \ \times\sum_{\mu_3=0}^{\infty}\frac{(-1)^{\mu_3}}{\mu_3!} \frac{\lambda_4^{m_1+\mu_1+m_2+\mu_2+m_3+\mu_3+3}}{(m_1+\mu_1+m_2+\mu_2+m_3+\mu_3+3)}.\end{aligned} $$
(iii)

Then, the step j will be the following, denoted by w j(λ j+1):

$$\displaystyle \begin{aligned} w_j(\lambda_{j+1})&=\varGamma(m_1+1)\cdots \varGamma(m_j+1)-\cdots +(-1)^j\sum_{\mu_1=0}^{\infty}\frac{(-1)^{\mu_1}}{\mu_1!(m_1+\mu_1+1)}\ \cdots \\ &\ \ \ \ \ \ \ \ \ \times \sum_{\mu_j=0}^{\infty}\frac{(-1)^{\mu_j}}{\mu_j!} \frac{\lambda_{j+1}^{m_1+\mu_1+\cdots +m_j+\mu_j+j}}{(m_1+\mu_1+\cdots +m_j+\mu_j+j)}.\end{aligned} $$
(iv)

Theorem 9.6.4

The density of λ p for the general real matrix-variate gamma distribution, denoted by f 2p(λ p), is the following:

$$\displaystyle \begin{aligned} f_{2p}(\lambda_p)\mathrm{d}\lambda_p=\frac{\pi^{\frac{p^2}{2}}}{\varGamma_p(\frac{p}{2})\varGamma_p(\alpha)}\sum_K(-1)^{\rho_K}\,w_{p-1}(\lambda_p)\,\lambda_p^{m_p}\,\mathrm{e}^{-\lambda_p}\,\mathrm{d}\lambda_p,\ 0<\lambda_p<\infty,{}\end{aligned} $$
(9.6.14)

where the w j(λ j+1) is defined in (iv).

The corresponding distribution of λ p for a general complex matrix-variate gamma distribution, denoted by \(\tilde {f}_{2p}(\lambda _p)\), is the following:

Theorem 9.6a.4

In the general complex case, in which instance m j = α  p + r j is not a positive integer, r j being as defined in (9.6a.3), the density of the smallest eigenvalue λ p , denoted by \(\tilde {f}_{2p}(\lambda _p)\) , is given by

$$\displaystyle \begin{aligned} \tilde{f}_{2p}(\lambda_p)\mathrm{d}\lambda_p=\frac{\pi^{p(p-1)}}{\tilde{\varGamma}_p(p)\tilde{\varGamma}_p(\alpha)}\sum_K(-1)^{\rho_K} \sum_{r_1,\ldots ,r_p}w_{p-1}(\lambda_p)\,\lambda_p^{m_p}\,\mathrm{e}^{-\lambda_p}\,\mathrm{d}\lambda_p,\ 0<\lambda_p<\infty,\ \end{aligned} $$
(9.6a.8)

where w j(λ j+1) has the representation given in (iv) above for the real case, except that m j = α  p + r j.

Note 9.6.3

In the complex Wishart case, α = m where m > p − 1 is the number of degrees of freedom, which is a positive integer. Hence, in this instance, one would simply apply Theorem 9.6a.1. It should also be observed that one can integrate out λ 1, …, λ j−1 by using the procedure described in Theorem 9.6.4 and integrate out λ p, …, λ j+1 by employing the procedure provided in Theorem 9.6.3, and thus derive the density of λ j or the joint density of any set of successive λ j’s. In a similar manner, one can obtain the density of λ j or the joint density of any set of successive λ j’s in the complex domain by making use of the procedures outlined in Theorems 9.6a.3 and 9.6a.4.