Abstract
Multivariate statistical methods beyond the correlation and covariance analysis include principal component analysis (PCA), factor analysis, discriminant analysis, classification, and various regression methods. Because of the increasing use of neural networks in the recent decades, some classical statistical methods are now less frequently used. However, PCA has been continuously used in both statistical data analysis and machine learning because of its versatility. At the same time, PCA has had several extensions, including nonlinear PCA, kernel PCA, and PCA for discrete data.
This chapter presents an overview of PCA and its applications to geosciences. The mathematical formulation of PCA is reduced to a minimum; instead, the presentation emphasizes the data analytics and innovative uses of PCA for geosciences. More applications of PCA are presented in Chap. 10.
It is the mark of a truly educated person to be deeply moved by statistics.
Oscar Wilde or Bernard Shaw (attribution is not clear)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdi, H., & Williams, L. J. (2010). Principal component analysis (Statistics & data mining series) (Vol. 2, pp. 433–459). Wiley.
Ferguson, J. (1994). Introduction to linear algebra in geology. London, UK: Chapman & Hall.
Hindlet, F., Ma, Y. Z., & Hass, A. (1991). Statistical analysis on AVO data. In Proceeding of EAEG, C028:264-265, Florence, Italy.
Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). New York: Springer.
Ma, Y. Z. (2011). Lithofacies clustering using principal component analysis and neural network: applications to wireline logs. Mathematical Geosciences, 43(4), 401–419.
Ma, Y. Z., & Gomez, E. (2015). Uses and abuses in applying neural networks for predicting reservoir properties. Journal of Petroleum Science and Engineering, 133, 66–65. https://doi.org/10.1016/j.petrol.2015.05.006.
Ma, Y. Z., & Zhang, Y. (2014). Resolution of happiness-income paradox. Social Indicators Research, 119(2), 705–721. https://doi.org/10.1007/s11205-013-0502-9.
Ma Y. Z. et al. (2014, April). Identifying hydrocarbon zones in unconventional formations by discerning Simpson’s paradox. Paper SPE 169496 presented at the SPE Western and Rocky Regional Conference.
Ma, Y. Z., Moore, W. R., Gomez, E., Luneau, B., Kaufman, P., Gurpinar, O., & Handwerger, D. (2015). Wireline log signatures of organic matters and lithofacies classifications for shale and tight carbonate reservoirs. In Y. Z. Ma & S. Holditch (Eds.), Handbook of unconventional resource (pp. 151–171). Waltham: Gulf Professional Publishing/Elsevier.
Prensky, S. E. (1984). A Gamma-ray log anomaly associated with the Cretaceous-Tertiary boundary in the Northern Green River Basin, Wyoming. In B. E. Law (Ed.), Geological characteristics of low-permeability Upper Cretaceous and Lower Tertiary Rocks in the Pinedale Anticline Area, Sublette County, Wyoming, USGS Open-File 84-753, pp. 22–35. https://pubs.usgs.gov/of/1984/0753/report.pdf. Accessed 6 Aug 2017.
Richman, M. (1986). Rotation of principal components. International Journal of Climatology, 6(3), 293–335.
Author information
Authors and Affiliations
Appendices
Appendices
1.1 Appendix 5.1 Introduction to PCA with an Example
PCA transforms data to a new coordinate system such that the greatest variance by projecting the data lies on the first coordinate, the second greatest variance on the second coordinate and so on. From the geometric point of view, PCA can be interpreted as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. The variance of each PC is related to the length of the axis of the ellipsoid that the PC represents. To find the axes of the ellipsoid, the correlation or covariance matrix of the data is constructed (see Chap. 4), and the eigenvalues and corresponding eigenvectors of the correlation or covariance matrix are calculated. Each of the mutually orthogonal eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This will transform the correlation or covariance matrix into a diagonalized matrix with the diagonal elements representing the variance of each axis. The proportion of the variance that each eigenvector represents is the ratio of its eigenvalue over the sum of all eigenvalues.
Consider a data matrix, X, in which k rows represent different variables and n columns represent different samples of the variables (an example will be given later). The principal components of X can be calculated by
where P is the matrix for all the PCs of size k × n, E t is the transpose of the matrix of the eigenvectors of size k × k, X is the input data matrix of size k × n. Notice that the data matrix is generally not squared and the number of variables, k, is generally smaller than the number of samples (observations), n. Therefore, to obtain the principal components, P, is to find the eigenvector, E.
The covariance matrix, C, of the data matrix is
Where C is of size k × k. If the input variables are standardized into one standard deviation, C is the correlation matrix.
The correlation or covariance matrix is positive (semi)definite (this concept is further discussed in Chaps. 16 and 17), implying that they can be eigen-decomposed with orthogonal eigenvectors, and expressed as
Where E is the eigenvector of the matrix, C, and v is the eigenvalue associated with the eigenvector, E.
After diagonalizing the correlation or covariance matrix, the values on the diagonal elements are its eigenvalues. The directions of the orthogonal axes are the eigenvectors. Several algorithms are available to solve Eq. 5.4 (Ferguson 1994; Jolliffe 2002; Abdi and Williams 2010). It is straightforward to compute PCs from Eq. 5.2 after obtaining the eigenvectors and eigenvalues.
Depending on the number of variables in the data matrix and their correlation structure, some of the PCs may have zero or very small eigenvalues. Thus, the number of the meaningful PCs, p, is smaller than the number of variables, k. This is the concept of using PCA for compression of data. When there is no correlation among all the input variables, PCA will not have the ability of compressing the data.
1.2 A5.1.1 Introductory Example
This example uses data extracted and simplified from a survey published in Ma and Zhang (2014). It is a small dataset, designed for illustrating PCA, not intended for a thorough study on the heights of family members. In this dataset, two variables are heights of 10 men and the heights of their partners. Therefore, we have 2 variables and 10 samples (Table 5.2). It is straightforward to convert such a table into its data matrix, such as
1.3 A5.1.2 Standardizing Data
The average of the ten men’s height is 1.761 m, and the average of their partners’ heights is 1.649 m. the standard deviation for the men’s heights is 0.0434626 and the standard deviation for their partners’ heights is 0.0254755. Each of the two variables can be standardized by
where S i is the standardized counterpart of input variable X i. Table 5.3 gives the standardized version of Table 5.2. Thus, the standardized counterpart of the data matrix is:
1.4 A5.1.3 Computing Correlation Matrix
Only two variables are in the example, their correlation coefficient is approximately 0.723, and the correlation matrix is:
1.5 A5.1.4 Finding Eigenvectors and Eigenvalues
Eigenvalues and eigenvectors are not unique for a correlation or covariance matrix. To obtain a unique solution, it is common to impose a condition such as the sum of the square of the elements in the eigenvector equal to 1. From the linear algebra (see e.g., Ferguson 1994), we can find the eigenvalues of the matrix C in Eq. 5.8. Two eigenvalues are 1 plus correlation coefficient and 1 minus correlation coefficient:
The two corresponding eigenvectors are
1.6 A5.1.5 Finding the principal components
The principal components can be calculated from Eq. 5.2 (but using the standardized data matrix), such as
The full values of the two PCs are listed in Table 5.4.
1.7 A5.1.6 Basic Analytics in PCA
PCA is based on the linear correlations among all the input variables; the correlations of the variables impact eigenvalues, relative representation of the information by different PCs, and contributions of the original variables to the PCs. Table 5.5 lists the relationships among the input variables and the two PCs for the above example. Figure 5.6 gives graphic displays of their relationships. More advanced analytics using PCA for geoscience applications are presented in the main text.
It is worthy to note again the nonuniqueness of eigenvectors from Eq. 5.4. In the above example, if we impose the sum of the square of the elements in the eigenvector equal to 2, instead of 1, the two eigen vectors in Eq. 5.10 will be \( {e}_1^t=\left[1\kern0.75em 1\right] \), and \( {e}_2^t=\left[1\kern0.5em -1\right].\mathrm{T} \)he results for each PC will have different values, but they are simply proportional. For example, if all the values in Table 5.4 are multiplied by \( \sqrt{2} \), the two PCs will be the results of the eigenvectors \( {e}_1^t=\left[1\kern0.75em 1\right] \) and \( {e}_2^t=\left[1\kern0.5em -1\right] \). Figure 5.6c represents the PCs corresponding to the eigenvectors \( {e}_1^t=\frac{1}{\sqrt{2}\ }\left[1\kern0.75em 1\right] \), and \( {e}_2^t=\frac{1}{\sqrt{2}\ }\left[1\kern0.5em -1\right] \) and Fig. 5.6d represents the PCs corresponding to the eigenvectors \( {e}_1^t=\left[1\kern0.75em 1\right] \), and \( {e}_2^t=\left[1\kern0.5em -1\right] \). In applications, using the PCs from a differently constrained eigenvectors implies slightly different calibration. For example, in the example shown in Fig. 5.2g, the cutoff values should be multiplied by \( \sqrt{2} \)when PCs are obtained with the sum of the square of the elements in the eigenvector equal to 2.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Ma, Y.Z. (2019). Principal Component Analysis. In: Quantitative Geosciences: Data Analytics, Geostatistics, Reservoir Characterization and Modeling. Springer, Cham. https://doi.org/10.1007/978-3-030-17860-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-17860-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17859-8
Online ISBN: 978-3-030-17860-4
eBook Packages: EnergyEnergy (R0)