1 Introduction

This paper applies model selection criteria, especially AIC and BIC, to the problem of choosing a sufficient number of principal components to retain. It applies the concepts of Sclove [13] to this particular problem.

2 Background

Other researchers have considered to problem of the choice of number of principal components. For example, Bai et al. [6] examined the asymptotic consistency of the criteria AIC and BIC for determining the number of significant principal components in high-dimensional problems. The focus here is not necessarily on high-dimensional problems.

To begin the discussion here, we first give a short review of some general background on the relevant portions of multivariate statistical analysis, such as may be obtained from textbooks such as Anderson [5] or Johnson and Wichern [9].

2.1 Sample Quantities

Suppose we have a multivariate sample \({{\mathbf{x}}}_1, {{\mathbf{x}}}_2 , \; \dots , \; {{\mathbf{x}}}_n \) of n p-dimensional random vectors,

$$\begin{aligned} {\mathbf{x}}_i= (x_{1i}, x_{2i}, \ldots x_{pi})',\quad i = 1, 2, \ldots , n. \end{aligned}$$

The transpose (\('\)) means that we are thinking of the vectors as column vectors. The sample mean vector is

$$\begin{aligned} \bar{{\mathbf{x}} } =\sum _{i=1}^n \, {\mathbf{x}}_i / n. \end{aligned}$$

The \(p \times p\) sample covariance matrix is

$$\begin{aligned} S =\sum _{i=1}^n \, ({\mathbf{x}}_i -\bar{{\mathbf{x}} }) ( {\mathbf{x}}_i -\bar{{\mathbf{x}} } )' / (n-1) . \end{aligned}$$

2.2 Population Quantities and Principal Components

The sample covariance matrix \({\mathbf{S}}\) estimates the true covariance matrix \( {\varvec{\Sigma }} \) of the random variables

$$\begin{aligned} X_1, X_2, \ldots , X_p. \end{aligned}$$

That is,

$$\begin{aligned} {\varvec{\Sigma }} = [\sigma _{u,v}]_{u,v = 1,2,\ldots ,p}, \end{aligned}$$

where

$$\begin{aligned} \sigma _{uv} = {{\mathcal{C}}}[X_u, X_v], \end{aligned}$$

the covariance of \(X_u\) and \(X_v.\) In particular, \({{\mathcal{C}}}[X_v, X_v] ={\mathcal {V}}[X_v], \) the variance of \(X_v. \)

The principal components of \({\varvec{\Sigma }}\) are defined as uncorrelated linear combinations of maximal variance. A linear combination, say LC, of the p variables is \({\mathbf{a'X}},\) that is

$$\begin{aligned} LC = \; {\mathbf{a'X}} \; = \; a_1{\mathbf{X}}_1 \; + \; a_2 {\mathbf{X}}_2 \; + \ldots \, + \; a_p{\mathbf{X}}_p. \end{aligned}$$

Here the vector \( \, {\mathbf{a}} \, \) is a vector of scalars \(\, a_1, a_2, \ldots , \; a_p{:} \)

$$\begin{aligned} {\mathbf{a}}' \; = \; (a_1 \; a_2 \; \ldots \; a_p). \end{aligned}$$

These are the coefficients in the linear combination. Such linear combinations are called variates.

We have

$$\begin{aligned} {\mathcal {V}}[ LC] \; = \; {\mathcal {V}}[{\mathbf{a'X}}] \; = \; {\mathbf{a}}'{\varvec{\Sigma }} {\mathbf{a}}. \end{aligned}$$

This is estimated as \({\mathbf{a'Sa}}. \) This is to be maximized over \({\mathbf{a}}.\) The derivative is

$$\begin{aligned} \partial {\mathbf{a'Sa}}/\partial {\mathbf{a}} = {\mathbf{Sa}}. \end{aligned}$$

is A constraint is required for meaningful maximization. A reasonable such constraint \({\mathbf{a'a}} = 1,\) which is equivalent to the length of \({\mathbf{a}} , \) the quantity \(\sqrt{{\mathbf{a'a}}},\) being equal to 1.

The Lagrangian function incorporating the constraint is

$$\begin{aligned} L({\mathbf{S}},{\bf a}; \lambda ) = {\mathbf{a'Sa}} + \lambda (1 - {\mathbf{a'a}} ). \end{aligned}$$

The partial derivatives are

$$\begin{aligned} \partial L / \partial {\mathbf{a}} = 2{\mathbf{Sa}} - 2 \lambda {\mathbf{a}} \end{aligned}$$

and

$$\begin{aligned} \partial L / \partial \lambda =\partial \lambda ( 1-{ {\mathbf{a}}'a} ) / \partial \lambda = 1 - {\mathbf{a'a}} . \end{aligned}$$

Setting these equal to zero gives the simultaneous linear equations

$$\begin{aligned} {\mathbf{Sa}} =\lambda {\mathbf{a}}, {\mathbf{a'a}} = 1. \end{aligned}$$

The first is the equation

$$\begin{aligned} {\mathbf{Sa}} -\lambda {\mathbf{a}}={\mathbf{0}}, \end{aligned}$$

the zero vector. This is the homogeneous equation

$$\begin{aligned} ({\mathbf{S}} - \lambda {\mathbf{I}} ) {\mathbf{a}} ={\mathbf{0}}. \end{aligned}$$

For nontrivial solutions, we must have det\(( {\mathbf{S}} - \lambda {\mathbf{a}} ) = 0. \) This is a polynomial equation of degree p in \(\lambda \); denote the roots by \(\lambda _1 \ge \lambda _2 \ge \; \cdots \; \lambda _p. \) These are the eigenvalues. Their sum is the trace of \({\mathbf{S}};\) their product is the determinant of \({\mathbf{S}}. \)

The corresponding eigenequations are

$$\begin{aligned} {\mathbf{Sa}}_j = \lambda _j {\mathbf{a}}_j, \quad j = 1, 2, \ldots , p. \end{aligned}$$

The j-th PC (principal component), \(C_j,\) is the linear combination of the form

$$\begin{aligned} C_j ={\mathbf{a}}_j' {\mathbf{x}} = a_{1j}x_1 + a_{2j} x_2+ \cdots + a_{pj}x_p, \end{aligned}$$

where \({\mathbf{a}}_j' =(a_{1j}, a_{2j}, \ldots , \, a_{pj}). \) That is to say, for \(j = 1, 2, \ldots , p,\) the value of the j-th PC for Individual i is \({\mathbf{a}}_j' {\mathbf{x}}_i, \; i = 1, 2, \ldots , n. \)

The equations for the PCs in terms of the Xs are PC\(_j = {\mathbf{a}}'_j {\mathbf{X}}, \; j = 1, 2, \ldots , p. \) Let \({\mathbf{C}}\) be the p-vector of PCs. Then \({\mathbf{C}} \; = \; {{\mathbf{A}}}'{} {\mathbf{X}}, \) where \({\mathbf{A}} \; = \; [{\mathbf{a}}_{\mathbf{1}} \, {\mathbf{a}}_{\mathbf{2}} \, \dots \, {\mathbf{a}}_{\mathbf{p}} ] \) is the matrix whose columns are the eigenvectors. The inverse relation is

$$\begin{aligned} {\mathbf{X}} = {{\mathbf{A}}}'^{-1} \, {\mathbf{C}}= {\mathbf{L}} {\mathbf{C}} , \end{aligned}$$

where

$$\begin{aligned} {\mathbf{L}} = {{\mathbf{A}}}'^{-1}, \end{aligned}$$

where \({\mathbf{L}}\) is the matrix of loadings of the \(X_v\) on the PCs \(C_j.\) Actually, \({\mathbf{A}} \) is an orthonormal matrix (its columns are of length one and are pairwise orthogonal),   so \({\mathbf{A}}^{-1} = {{\mathbf{A}}}'.\) Thus \({\mathbf{L}} = \; {\mathbf{A}}. \) So

$$\begin{aligned} {\mathbf{X}} = {{\mathbf{A}}}'^{-1} \, {\mathbf{C}} = {\mathbf{A}}{} {\mathbf{C}}. \end{aligned}$$

Letting \( {\mathbf{a}}^{(v)'} \, \) be the v-th row of the matrix \({\mathbf{A}},\) that is

$$\begin{aligned} {\mathbf{a}}^{(v)'} = (a_{v1}, a_{v2}, \ldots , a_{vp} ), \end{aligned}$$

we have

$$\begin{aligned} X_v = \; a_{v1}C_1 + a_{v2}C_2 + \cdots + a_{vp}C_p. \end{aligned}$$

In terms of the first k PCs, this is

$$\begin{aligned} X_v = a_{v1}C_1 + a_{v2}C_2 + \cdots + a_{vk}C_k + \varepsilon _v,\quad (*) \end{aligned}$$

where the error \(\varepsilon _v\) is

$$\begin{aligned} \varepsilon _v = a_{v \,k+1}C_{k+1} + a_{v \,k+2}C_{k+2} + \cdots + a_{vp}C_p. \end{aligned}$$

The covariance matrix can be represented as

$$\begin{aligned} {\mathbf{S}} = \sum _{j=1}^p \, \lambda _j {\mathbf{a}}_j {\mathbf{a}}_j'. \end{aligned}$$

Correspondingly, the best rank k approximation to \({\mathbf{S}}\) is

$$\begin{aligned} {\mathbf{S}}^{(k) } = \sum _{j=1}^k \, \lambda _j {\mathbf{a}}_j {\mathbf{a}}_j'. \end{aligned}$$

Recall that for a symmetric matrix such as a covariance matrix, the eigenvalues are non-negative.

2.3 Ad Hoc Procedures for Determining an Appropriate Number of PCs

2.3.1 Procedure Based on the Average Eigenvalue

The average eigenvalue is

$$\begin{aligned} \bar{\lambda } = \sum _{j=1}^p \lambda _j / p. \end{aligned}$$

One rule for the number of PCs to retain is the retain those for which the eigenvalues are greater than \(\bar{\lambda }. \) When \({\mathbf{S}}\) is taken to be the sample correlation matrix, the trace is p and the average eigenvalue \(\bar{\lambda }\) is 1.

2.3.2 Procedure Based on Retaining a Prescribed Portion of the Total Variance

Another procedure is to retain a number of PCs sufficient to account for, say, 90% of the total variance, trace \({\mathbf{S}} = \sum _{j=1}^p \, \lambda _j. \) Of course the figure ninety percent is somewhat arbitrary and it might be nice to have some somewhat more objective criteria.

2.3.3 Procedure Based on the Dropoff of the Eigenvalues

Another procedure is to plot \(\lambda _1, \lambda _2, \ldots , \; \lambda _p\) against \(1, 2, \ldots , p. \) One then looks for an elbow in the curve and retains a number of PCs corresponding to the point before the leveling off of the curve, if it does indeed take an elbow shape. Such a plot is called a scree plot, “scree” being the debris at the foot of a glacier.

3 AIC and BIC for the Number of PCs

Let us see what a Gaussian model would imply. The maximum log likelihood for the model (*) approximating the p variables in terms of k PCs is \((2\pi \hat{| {\varvec{\Sigma }}}_k |)^{-n/2} C(n,p,k ), \) where C(npk) is a constant depending upon np,  and k and \(|{\varvec{\Sigma }}_k|\) denotes the determinant of the residual covariance matrix \({\varvec{\Sigma }}_k.\)

The determinant of the covariance matrix is the product of the eigenvalues,

$$\begin{aligned} |{\varvec{\Sigma }}| = \Pi _{j=1}^p \, \lambda _j. \end{aligned}$$

For a model based on the first k PCs, this is

$$\begin{aligned} \Pi _{j=1}^k \, \lambda _j. \end{aligned}$$

The determinant of the residual covariance is \(\Pi _{j=k+1}^p \lambda _j. \) The model-selection criterion AIC—Akaike’s information criterion [2,3,4]—is based on an estimate of the log cross-entropy of K proposed models with a null model.

The Bayesian information criterion BIC [12] is based on a large-sample estimate of the posterior probability \(pp_k\) of Model \(k, \; k = 1, 2, \ldots , K. \, \)

More precisely, BIC\(_k \) is an approximation to \(\, -2 \ln pp_k.\) These model-selection criteria (MSCs) are thus smaller-is-better criteria and take the form

$$\begin{aligned} MSC_k= -2 \, \text{ln max } L_k + a(n) m_k, \quad k = 1, 2, \ldots , K, \end{aligned}$$

where \(L_k\) is the likelihood for Model \(k, \, a(n) = \ln n\) for BIC\(_k, \; a(n) = 2\) (not depending upon n) for AIC\(_k\) and \(m_k\) is the number of independent parameters in Model \(k.\, \) Relative to BIC, AIC tends to favor models with a smaller number of parameters. Note that

$$\begin{aligned} pp_k \; \approx \; C \exp (-\text{ BIC}_k/2), \end{aligned}$$

where C is a constant. Thus BIC values can be converted to a scale of 0 to 1. This is done by exponentiating   -BIC\(_k/2,\) summing the values, and dividing by the sum.

For the PC model,

$$\begin{aligned} -2 \, \text{ ln max } \, L_k = n \, \text{ ln } \, \Pi _{j=k+1}^p \, \lambda _k = n \, \sum _{j=k+1}^p \, \ln \lambda _k. \end{aligned}$$

The criteria can be written as

$$\begin{aligned} \text{MSC}_k = \text{Deviance}_k + \text{ Penalty}_k, \end{aligned}$$

where \(\text {Deviance}_k = n \, \ln \max L_k \text{ is a measure of lack of fit and Penalty}_k = \; a(N) m_k.\) Inclusion of an additional PC is justified if the criterion value decreases, that is if MSC\(_{k+1} < {MSC}_k. \) For PCs, this is

$$\begin{aligned} n \sum _{j=k+2}^{p} \text{ ln } \lambda _j + (k+1)a(n) < n \sum _{j=k+1}^p \, \text{ ln } \lambda _j + k \, a(n). \end{aligned}$$

This is

$$\begin{aligned} a(n) < n \ln \lambda _{k+1} = \ln (\lambda _{k+1}^n), \end{aligned}$$

or

$$\begin{aligned} \exp [a(n)] < \lambda _{k+1}^n, \end{aligned}$$

or

$$\begin{aligned} \lambda _{k+1} > \exp {[a(n)/n]} \end{aligned}$$

or

$$\begin{aligned} \lambda _{k+1} > \exp [-a(n)/n]. \end{aligned}$$

Thus for AIC, inclusion of the additional PC\(_{k+1} \) is justified if \(\lambda _{k+1}\) is greater than \(\exp (-2/n). \)

For BIC, inclusion of an additional PC\(_{k+1} \) is justified if \( \lambda _{k+1} >\exp (\ln N / N) = \; [\exp (\ln n)]^{1/n} = n^{1/n},\) which tends to 1 for large n. So this is in approximate agreement with the average eigenvalue rule for correlation matrices, stating that one should retain dimensions with eigenvalues larger than 1.

4 Example

Here we consider a sample from the LA Heart Study. See, e.g., [8]. The sample is \(n = 100 \) men. The variables include Age, Systolic blood pressure, Diastolic blood pressure, weight, height and Coronary Incident, a binary variable indicating whether or not the individual had a coronary incident during the course of the study. (Data on the same variables for another 100 men are also given in Dixon and Massey’s book. Results can be compared and contrasted between the two samples.) Here we focus on the first five variables. Minitab statistical software was used for the analysis.

Table 1 is the lower-triangular portion of the correlation matrix for the five variables (Table 2).

Table 1 Correlation matrix of 5 variables–LA heart data
Table 2 PCs of heart data

4.1 Principal Component Analysis in the Example

Note that an eigenvector can be multiplied by \(-1,\) changing the signs of all its elements. Below, this is done with PC1 so that SYS and DIAS have positive loadings. Interpretations, BPtotal, SIZE, AGE, OVERWT, BPdiff, are given below the eigenvectors. The interpretations are based on which loadings are large and which are small. Taking .6 as a cut-off point, in PC1, SYS and DIAS have loadings above this, while the other variables have loadings less than this (in fact, less than .4), so PC1 can be interpreted asan index of total BP. In PC2, WT and HT have large loadings with the same sign, so PC2 can be interpreted as SIZE (Table 3).

Table 3 PC1 is multiplied by \(-1\)

As above, denote the eigensystem by

$$\begin{aligned} (\lambda _v, {\varvec{a}}_v), \, v = 1, 2, \ldots , p. \, \end{aligned}$$

Then the eigensystem equations are

$$\begin{aligned} {\mathbf{S}} \, {\varvec{a}}_v \; = \; \lambda _v \, {\varvec{a}}_v, \, v = 1, 2, \ldots , p. \, \end{aligned}$$

Here \({\mathbf{S}}\) is taken to be the correlation matrix. Let \({\mathbf{1}}_v' \; = \; ( 0 \; 0 \cdots \; 1 \cdots \; 0 \cdots \; ), \) the vector with 1 in the v-th position and zeroes elsewhere. The covariance between a variable \(X_v\) and a PC \(C_u\) is \({\mathcal{C}}[X_v, \, C_u \,] = \, {\mathcal{C}}[{\mathbf{1}}_v' {\varvec{X}}, {\varvec{a}}_u' \, {\varvec{X}}] = {\mathbf{1}}' \Sigma \, {\varvec{a}}_u \; = \; {\mathbf{1}}_v' \, \lambda _u \, {\varvec{a}}_u \; = \; \lambda _u a_{uv}, \, \) where \( \, a_{uv} \, \) is the v-th element of the vector \({\varvec{a}}_u. \, \) The correlation is Corr\( [X_v, \, C_u \,] = \; {\mathcal{C}}[X_v, \, C_u \,] / {SD}[X_v] {SD}[ C_u \, ] \; = \; \lambda _u \, a_{uv} \, / \, \sigma _v \, \sqrt{\lambda _u} \; = \; \sqrt{\lambda _u} \, a_{uv} \, / \, \sigma _v. \, \) When the correlation matrix is used, \(\sigma _v = 1, \) and this correlation is \(\sqrt{\lambda _u} \ a_{uv}. \) A correlation of size greater than .6 corresponds to 36% of variance explained. The variable \(X_v \, \) has a correlation higher than .6 with the component \(C_u\) if its loading in \(C_u, \) the value \(a_{uv}, \) is greater than .6 / \(\sqrt{\lambda _u}.\) These values are appended to the table below. Loadings larger than this cut point are in boldface.   (The cut-off of .6 is somewhat arbitrary; one might use, for example, a cut-off of .5.)

One can also focus on the pattern of loadings within the different PCs for interpretation of the PCs. To reiterate:

  1. PC1:

    SYS and DIAS have large loadings with the same sign; we interpret PC1 as BPinex or BPtotal.

  2. PC2:

    WT and HT have large loadings of the same sign; we interpret PC2 as the man’s SIZE.

  3. PC3:

    Only AGE has a large loading; we interpret PC3 as AGE.

  4. PC4:

    WT and HT have large loadings with opposite signs; we interpret PC4 as OVERWEIGHT.

  5. PC5:

    SYS and DIAS have large loadings with opposite signs; we interpret PC5 as BPdrop.

I continue to marvel at how readily interpretable the PCs are. And, this is even without using a factor analysis model and using rotation (Table 4).

Table 4 Loadings corresponding to correlations \( > .6 \) are boldface

4.2 Employing the Criteria in the Example

Table 5 shows the eigenvalues and the results according to the various criteria. According to the rule based on the average eigenvalue, the dimension is retained it its eigenvalue is greater than 1 (for a correlation matrix). For BIC, the k-th PC is retained if \(n \, \ln \, \lambda _k > - a(n), \) where \(a(n) = \ln n .\) Here, \(n = 100 \) and \( \ln n = \ln 100,\) approx. 4.61. For AIC, the k-th PC is retained if \(n \ln \lambda _k > - 2. \) In this example, the methods agree on retaining \(k = 2\) PCs.

I feel that I should remark that, though this is the case, the fourth and fifth PCs do have simple and interesting interpretations. It is just that they do not improve the fit very much.

Table 5 Estimating the number of PCs by various methods

5 Discussion

The focus here has been on determining the number of dimensions needed to represent a complex of variables adequately.

5.1 Regression on Principal Components

Given a response variable Y and explanatory variables \(X_1, X_2, \ldots , X_p,\) one may transform the Xs to their principal components, as this may aid in the interpretation of the results of the regression. In such regression on principal components (see, e.g., [10]), however, one should not necessarily eliminate the principal components with small eigenvalues, as they may still be strongly related to the response variable. The Bayesian information criterion is

$$\begin{aligned} BIC_k = - 2 LL_k + m_k \ln n, \end{aligned}$$

for alternative models indexed by \( k = 1, 2,\ldots , K,\) where \(LL_k\) is the maximum log likelihood for Model k and \(m_k\) is the number of independent parameters in Model k. For linear regression models with Gaussian-distributed errors BIC takes the form

$$\begin{aligned} BIC_k = n \ln MSE_k + m_k \ln n \end{aligned}$$

where \(MSE_k\) is the MLE (maximum likelihood estimate) of the MSE (mean squared error) of Model k, with divisor n,  of the error variance. With p explanatory variables, there are \(2^p\) alternative models (including the model where no explanatory variables are used and the fitted value of Y is simply \(\bar{y}). \) It would usually seem to be wise to evaluate all \(2^p\) models using \(BIC_k\) rather than reducing the number of principal components by just looking at the explanatory variables.

5.2 Some Related Recent Literature

Some various applications involving choosing the number of principal components from recent literature include the following. The method presented here could possibly be applied in these applications. For example, a good book on the topic of model selection and testing covering all aspects is Bhatti et al. [7]. In recent years econometricians have examined the problems of diagnostic testing, specification testing, semiparametric estimation and model selection. In addition, researchers have considered whether to use model testing and model selection procedures to decide upon the models that best fit a particular dataset. This book explores both issues with application to various regression models, including arbitrage pricing theory models. Along the lines of model-selection criteria, the book references, e.g., Schwarz [12], the foundational paper for BIC.

Next we mention some recent papers which show applications of model selection in various research areas.

One such paper is Xu et al. [14] an application of principal components analysis and other methods to water quality assessment in a lake basin in China,

Another is Omuya et al. [11], on feature selection for classification using principal component analysis.

As mentioned, a particularly interesting application of principal components analysis is in regression and logistic regression. We have mentioned the paper by Massy [10] on using principal components analysis in regression. Another is Aguilera et al. [1] on using principal components in logistic regression.

6 Conclusions

The information criteria AIC and BIC have been applied here to the choice of the number of principal components to represent a dataset. The results have been compared and contrasted with criteria such as retaining those principal components which explain more than an average amount of the total variance.