Using Model Selection Criteria to Choose the Number of Principal Components

The use of information criteria, especially AIC (Akaike’s information criterion) and BIC (Bayesian information criterion), for choosing an adequate number of principal components is illustrated.


Introduction
This paper applies model selection criteria, especially AIC and BIC, to the problem of choosing a sufficient number of principal components to retain. It applies the concepts of Sclove [13] to this particular problem.

Background
Other researchers have considered to problem of the choice of number of principal components. For example, Bai et al. [6] examined the asymptotic consistency of the criteria AIC and BIC for determining the number of significant principal components in high-dimensional problems. The focus here is not necessarily on high-dimensional problems.
To begin the discussion here, we first give a short review of some general background on the relevant portions of multivariate statistical analysis, such as may be obtained from textbooks such as Anderson [5] or Johnson and Wichern [9].

Sample Quantities
Suppose we have a multivariate sample 1 , 2 , … , n of n p-dimensional random vectors, The transpose ( ′ ) means that we are thinking of the vectors as column vectors. The sample mean vector is The p × p sample covariance matrix is

Population Quantities and Principal Components
The sample covariance matrix estimates the true covariance matrix of the random variables That is, where the covariance of X u and X v . In particular,

3
The principal components of are defined as uncorrelated linear combinations of maximal variance. A linear combination, say LC, of the p variables is ′ , that is Here the vector is a vector of scalars a 1 , a 2 , … , a p ∶ These are the coefficients in the linear combination. Such linear combinations are called variates.
We have This is estimated as ′ . This is to be maximized over . The derivative is is A constraint is required for meaningful maximization. A reasonable such constraint � = 1, which is equivalent to the length of , the quantity √ ′ , being equal to 1.
The Lagrangian function incorporating the constraint is The partial derivatives are and Setting these equal to zero gives the simultaneous linear equations The first is the equation the zero vector. This is the homogeneous equation For nontrivial solutions, we must have det( − ) = 0. This is a polynomial equation of degree p in ; denote the roots by 1 ≥ 2 ≥ ⋯ p . These are the eigenvalues. Their sum is the trace of ; their product is the determinant of . LC = � = a 1 1 + a 2 2 + … + a p p . .
Journal of Statistical Theory and Applications (2021) 20:450-461 The corresponding eigenequations are The j-th PC (principal component), C j , is the linear combination of the form where � j = (a 1j , a 2j , … , a pj ). That is to say, for j = 1, 2, … , p, the value of the j-th PC for Individual i is � j i , i = 1, 2, … , n. The equations for the PCs in terms of the Xs are PC j = � j , j = 1, 2, … , p. Let be the p-vector of PCs. Then is the matrix whose columns are the eigenvectors. The inverse relation is where where is the matrix of loadings of the X v on the PCs C j . Actually, is an orthonormal matrix (its columns are of length one and are pairwise orthogonal), so −1 = � . Thus = . So Letting (v) � be the v-th row of the matrix , that is we have In terms of the first k PCs, this is where the error v is The covariance matrix can be represented as (v) � = (a v1 , a v2 , … , a vp ), Correspondingly, the best rank k approximation to is Recall that for a symmetric matrix such as a covariance matrix, the eigenvalues are non-negative.

Procedure Based on the Average Eigenvalue
The average eigenvalue is One rule for the number of PCs to retain is the retain those for which the eigenvalues are greater than ̄. When is taken to be the sample correlation matrix, the trace is p and the average eigenvalue ̄ is 1.

Procedure Based on Retaining a Prescribed Portion of the Total Variance
Another procedure is to retain a number of PCs sufficient to account for, say, 90% of the total variance, trace = ∑ p j=1 j . Of course the figure ninety percent is somewhat arbitrary and it might be nice to have some somewhat more objective criteria.

Procedure Based on the Dropoff of the Eigenvalues
Another procedure is to plot 1 , 2 , … , p against 1, 2, … , p. One then looks for an elbow in the curve and retains a number of PCs corresponding to the point before the leveling off of the curve, if it does indeed take an elbow shape. Such a plot is called a scree plot, "scree" being the debris at the foot of a glacier.

AIC and BIC for the Number of PCs
Let us see what a Gaussian model would imply. The maximum log likelihood for the model (*) approximating the p variables in terms of k PCs is (2| k |) −n∕2 C(n, p, k), where C(n, p, k) is a constant depending upon n, p, and k and | k | denotes the determinant of the residual covariance matrix k .
The determinant of the covariance matrix is the product of the eigenvalues, For a model based on the first k PCs, this is

3
Journal of Statistical Theory and Applications (2021) 20:450-461 The determinant of the residual covariance is Π p j=k+1 j . The model-selection criterion AIC-Akaike's information criterion [2][3][4]-is based on an estimate of the log cross-entropy of K proposed models with a null model.
The Bayesian information criterion BIC [12] is based on a large-sample estimate of the posterior probability pp k of Model k, k = 1, 2, … , K.
More precisely, BIC k is an approximation to −2 ln pp k . These model-selection criteria (MSCs) are thus smaller-is-better criteria and take the form where L k is the likelihood for Model k, a(n) = ln n for BIC k , a(n) = 2 (not depending upon n) for AIC k and m k is the number of independent parameters in Model k. Relative to BIC, AIC tends to favor models with a smaller number of parameters. Note that where C is a constant. Thus BIC values can be converted to a scale of 0 to 1. This is done by exponentiating -BIC k ∕2, summing the values, and dividing by the sum.
For the PC model, The criteria can be written as where Deviance k = n ln max L k is a measure of lack of fit and Penalty k = a(N)m k . Inclusion of an additional PC is justified if the criterion value decreases, that is if MSC k+1 < MSC k . For PCs, this is This is or or or Π k j=1 j .
Thus for AIC, inclusion of the additional PC k+1 is justified if k+1 is greater than exp(−2∕n). For BIC, inclusion of an additional PC k+1 is justified if k+1 > exp(ln N∕N) = [exp(ln n)] 1∕n = n 1∕n , which tends to 1 for large n. So this is in approximate agreement with the average eigenvalue rule for correlation matrices, stating that one should retain dimensions with eigenvalues larger than 1.

Example
Here we consider a sample from the LA Heart Study. See, e.g., [8]. The sample is n = 100 men. The variables include Age, Systolic blood pressure, Diastolic blood pressure, weight, height and Coronary Incident, a binary variable indicating whether or not the individual had a coronary incident during the course of the study. (Data on the same variables for another 100 men are also given in Dixon and Massey's book. Results can be compared and contrasted between the two samples.) Here we focus on the first five variables. Minitab statistical software was used for the analysis.   Table 1 is the lower-triangular portion of the correlation matrix for the five variables (Table 2).

Principal Component Analysis in the Example
Note that an eigenvector can be multiplied by −1, changing the signs of all its elements. Below, this is done with PC1 so that SYS and DIAS have positive loadings. Interpretations, BPtotal, SIZE, AGE, OVERWT, BPdiff, are given below the eigenvectors. The interpretations are based on which loadings are large and which are small. Taking .6 as a cut-off point, in PC1, SYS and DIAS have loadings above this, while the other variables have loadings less than this (in fact, less than .4), so PC1 can be interpreted asan index of total BP. In PC2, WT and HT have large loadings with the same sign, so PC2 can be interpreted as SIZE ( Table 3).
As above, denote the eigensystem by

Then the eigensystem equations are
Here is taken to be the correlation matrix. Let � v = (0 0 ⋯ 1 ⋯ 0 ⋯ ), the vector with 1 in the v-th position and zeroes elsewhere. The covariance between a vari- When the correlation matrix is used, v = 1, and this correlation is √ u a uv . A correlation of size greater than .6 corresponds to 36% of variance explained. The variable X v has a correlation higher than .6 with the component C u if its loading in C u , the value a uv , is greater than .6 / √ u . These values are appended to the table below. Loadings larger than this cut point are in boldface. (The cut-off of .6 is somewhat arbitrary; one might use, for example, a cut-off of .5.) One can also focus on the pattern of loadings within the different PCs for interpretation of the PCs. To reiterate: I continue to marvel at how readily interpretable the PCs are. And, this is even without using a factor analysis model and using rotation (Table 4). Table 5 shows the eigenvalues and the results according to the various criteria. According to the rule based on the average eigenvalue, the dimension is retained it its eigenvalue is greater than 1 (for a correlation matrix). For BIC, the k-th PC is retained if n ln k > −a(n), where a(n) = ln n. Here, n = 100 and ln n = ln 100,

Discussion
The focus here has been on determining the number of dimensions needed to represent a complex of variables adequately.

Regression on Principal Components
Given a response variable Y and explanatory variables X 1 , X 2 , … , X p , one may transform the Xs to their principal components, as this may aid in the interpretation of the results of the regression. In such regression on principal components (see, e.g., [10]), however, one should not necessarily eliminate the principal components with small eigenvalues, as they may still be strongly related to the response variable. The Bayesian information criterion is for alternative models indexed by k = 1, 2, … , K, where LL k is the maximum log likelihood for Model k and m k is the number of independent parameters in Model k. For linear regression models with Gaussian-distributed errors BIC takes the form where MSE k is the MLE (maximum likelihood estimate) of the MSE (mean squared error) of Model k, with divisor n, of the error variance. With p explanatory variables, there are 2 p alternative models (including the model where no explanatory variables are used and the fitted value of Y is simply ȳ). It would usually seem to be wise to evaluate all 2 p models using BIC k rather than reducing the number of principal components by just looking at the explanatory variables.

Some Related Recent Literature
Some various applications involving choosing the number of principal components from recent literature include the following. The method presented here could possibly be applied in these applications. For example, a good book on the topic of model selection and testing covering all aspects is Bhatti et al. [7]. In recent years econometricians have examined the problems of diagnostic testing, specification testing, semiparametric estimation and model selection. In addition, researchers have considered whether to use model testing and model selection procedures to decide upon BIC k = −2LL k + m k ln n, BIC k = n ln MSE k + m k ln n the models that best fit a particular dataset. This book explores both issues with application to various regression models, including arbitrage pricing theory models. Along the lines of model-selection criteria, the book references, e.g., Schwarz [12], the foundational paper for BIC.
Next we mention some recent papers which show applications of model selection in various research areas.
One such paper is Xu et al. [14] an application of principal components analysis and other methods to water quality assessment in a lake basin in China, Another is Omuya et al. [11], on feature selection for classification using principal component analysis.
As mentioned, a particularly interesting application of principal components analysis is in regression and logistic regression. We have mentioned the paper by Massy [10] on using principal components analysis in regression. Another is Aguilera et al. [1] on using principal components in logistic regression.

Conclusions
The information criteria AIC and BIC have been applied here to the choice of the number of principal components to represent a dataset. The results have been compared and contrasted with criteria such as retaining those principal components which explain more than an average amount of the total variance.