Skip to main content
  • 1715 Accesses

Abstract

Multivariate statistical methods beyond the correlation and covariance analysis include principal component analysis (PCA), factor analysis, discriminant analysis, classification, and various regression methods. Because of the increasing use of neural networks in the recent decades, some classical statistical methods are now less frequently used. However, PCA has been continuously used in both statistical data analysis and machine learning because of its versatility. At the same time, PCA has had several extensions, including nonlinear PCA, kernel PCA, and PCA for discrete data.

This chapter presents an overview of PCA and its applications to geosciences. The mathematical formulation of PCA is reduced to a minimum; instead, the presentation emphasizes the data analytics and innovative uses of PCA for geosciences. More applications of PCA are presented in Chap. 10.

It is the mark of a truly educated person to be deeply moved by statistics.

Oscar Wilde or Bernard Shaw (attribution is not clear)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Abdi, H., & Williams, L. J. (2010). Principal component analysis (Statistics & data mining series) (Vol. 2, pp. 433–459). Wiley.

    Google Scholar 

  • Ferguson, J. (1994). Introduction to linear algebra in geology. London, UK: Chapman & Hall.

    Google Scholar 

  • Hindlet, F., Ma, Y. Z., & Hass, A. (1991). Statistical analysis on AVO data. In Proceeding of EAEG, C028:264-265, Florence, Italy.

    Google Scholar 

  • Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). New York: Springer.

    MATH  Google Scholar 

  • Ma, Y. Z. (2011). Lithofacies clustering using principal component analysis and neural network: applications to wireline logs. Mathematical Geosciences, 43(4), 401–419.

    Article  Google Scholar 

  • Ma, Y. Z., & Gomez, E. (2015). Uses and abuses in applying neural networks for predicting reservoir properties. Journal of Petroleum Science and Engineering, 133, 66–65. https://doi.org/10.1016/j.petrol.2015.05.006.

    Article  Google Scholar 

  • Ma, Y. Z., & Zhang, Y. (2014). Resolution of happiness-income paradox. Social Indicators Research, 119(2), 705–721. https://doi.org/10.1007/s11205-013-0502-9.

    Article  Google Scholar 

  • Ma Y. Z. et al. (2014, April). Identifying hydrocarbon zones in unconventional formations by discerning Simpson’s paradox. Paper SPE 169496 presented at the SPE Western and Rocky Regional Conference.

    Google Scholar 

  • Ma, Y. Z., Moore, W. R., Gomez, E., Luneau, B., Kaufman, P., Gurpinar, O., & Handwerger, D. (2015). Wireline log signatures of organic matters and lithofacies classifications for shale and tight carbonate reservoirs. In Y. Z. Ma & S. Holditch (Eds.), Handbook of unconventional resource (pp. 151–171). Waltham: Gulf Professional Publishing/Elsevier.

    Google Scholar 

  • Prensky, S. E. (1984). A Gamma-ray log anomaly associated with the Cretaceous-Tertiary boundary in the Northern Green River Basin, Wyoming. In B. E. Law (Ed.), Geological characteristics of low-permeability Upper Cretaceous and Lower Tertiary Rocks in the Pinedale Anticline Area, Sublette County, Wyoming, USGS Open-File 84-753, pp. 22–35. https://pubs.usgs.gov/of/1984/0753/report.pdf. Accessed 6 Aug 2017.

  • Richman, M. (1986). Rotation of principal components. International Journal of Climatology, 6(3), 293–335.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Appendices

Appendices

1.1 Appendix 5.1 Introduction to PCA with an Example

PCA transforms data to a new coordinate system such that the greatest variance by projecting the data lies on the first coordinate, the second greatest variance on the second coordinate and so on. From the geometric point of view, PCA can be interpreted as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. The variance of each PC is related to the length of the axis of the ellipsoid that the PC represents. To find the axes of the ellipsoid, the correlation or covariance matrix of the data is constructed (see Chap. 4), and the eigenvalues and corresponding eigenvectors of the correlation or covariance matrix are calculated. Each of the mutually orthogonal eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This will transform the correlation or covariance matrix into a diagonalized matrix with the diagonal elements representing the variance of each axis. The proportion of the variance that each eigenvector represents is the ratio of its eigenvalue over the sum of all eigenvalues.

Consider a data matrix, X, in which k rows represent different variables and n columns represent different samples of the variables (an example will be given later). The principal components of X can be calculated by

$$ \boldsymbol{P}={\boldsymbol{E}}^{\mathbf{t}}\boldsymbol{X} $$
(5.2)

where P is the matrix for all the PCs of size k × n, E t is the transpose of the matrix of the eigenvectors of size k × k, X is the input data matrix of size k × n. Notice that the data matrix is generally not squared and the number of variables, k, is generally smaller than the number of samples (observations), n. Therefore, to obtain the principal components, P, is to find the eigenvector, E.

The covariance matrix, C, of the data matrix is

$$ \boldsymbol{C}={\boldsymbol{XX}}^{\boldsymbol{t}} $$
(5.3)

Where C is of size k × k. If the input variables are standardized into one standard deviation, C is the correlation matrix.

The correlation or covariance matrix is positive (semi)definite (this concept is further discussed in Chaps. 16 and 17), implying that they can be eigen-decomposed with orthogonal eigenvectors, and expressed as

$$ {\displaystyle \begin{array}{ll}\boldsymbol{CE}=\boldsymbol{vE}& \boldsymbol{or}\\ {}\left(\boldsymbol{C}\hbox{--} \boldsymbol{vI}\right)\boldsymbol{E}=\boldsymbol{0}& \end{array}} $$
(5.4)

Where E is the eigenvector of the matrix, C, and v is the eigenvalue associated with the eigenvector, E.

After diagonalizing the correlation or covariance matrix, the values on the diagonal elements are its eigenvalues. The directions of the orthogonal axes are the eigenvectors. Several algorithms are available to solve Eq. 5.4 (Ferguson 1994; Jolliffe 2002; Abdi and Williams 2010). It is straightforward to compute PCs from Eq. 5.2 after obtaining the eigenvectors and eigenvalues.

Depending on the number of variables in the data matrix and their correlation structure, some of the PCs may have zero or very small eigenvalues. Thus, the number of the meaningful PCs, p, is smaller than the number of variables, k. This is the concept of using PCA for compression of data. When there is no correlation among all the input variables, PCA will not have the ability of compressing the data.

1.2 A5.1.1 Introductory Example

This example uses data extracted and simplified from a survey published in Ma and Zhang (2014). It is a small dataset, designed for illustrating PCA, not intended for a thorough study on the heights of family members. In this dataset, two variables are heights of 10 men and the heights of their partners. Therefore, we have 2 variables and 10 samples (Table 5.2). It is straightforward to convert such a table into its data matrix, such as

$$ \boldsymbol{X}=\left[\begin{array}{ccc}1.68& 1.71\kern1em \dots & 1.83\\ {}1.63& 1.60\kern1em \dots & 1.67\end{array}\right] $$
(5.5)
Table 5.2 Heights of ten men and their partners, in meters

1.3 A5.1.2 Standardizing Data

The average of the ten men’s height is 1.761 m, and the average of their partners’ heights is 1.649 m. the standard deviation for the men’s heights is 0.0434626 and the standard deviation for their partners’ heights is 0.0254755. Each of the two variables can be standardized by

$$ {S}_i=\left({X}_i\hbox{--} {m}_i\right)/{\sigma}_i $$
(5.6)

where S i is the standardized counterpart of input variable X i. Table 5.3 gives the standardized version of Table 5.2. Thus, the standardized counterpart of the data matrix is:

$$ \boldsymbol{S}=\left[\begin{array}{ccc}-1.86& -1.17\kern1em \dots & 1.59\\ {}-0.75& -1.92\kern1em \dots & 0.82\end{array}\right] $$
(5.7)
Table 5.3 Standardized heights of ten men and their partners, in meters (rounded to 2 decimals)

1.4 A5.1.3 Computing Correlation Matrix

Only two variables are in the example, their correlation coefficient is approximately 0.723, and the correlation matrix is:

$$ \boldsymbol{C}={\boldsymbol{SS}}^{\boldsymbol{t}}=\left[\begin{array}{cc}1& 0.723\\ {}0.723& 1\end{array}\right] $$
(5.8)

1.5 A5.1.4 Finding Eigenvectors and Eigenvalues

Eigenvalues and eigenvectors are not unique for a correlation or covariance matrix. To obtain a unique solution, it is common to impose a condition such as the sum of the square of the elements in the eigenvector equal to 1. From the linear algebra (see e.g., Ferguson 1994), we can find the eigenvalues of the matrix C in Eq. 5.8. Two eigenvalues are 1 plus correlation coefficient and 1 minus correlation coefficient:

$$ {\displaystyle \begin{array}{c}{v}_1=1+0.723=1.723\kern2.25em \mathrm{and}\\ {}{v}_2=1\hbox{--} 0.723=0.277\end{array}} $$
(5.9)

The two corresponding eigenvectors are

$$ {\displaystyle \begin{array}{c}{e}_1^t=\left[1\kern1.25em 1\right]/\sqrt{2}\kern2em \mathrm{and}\\ {}{e}_2^t=\left[1\kern0.5em -1\right]/\sqrt{2}\end{array}} $$
(5.10)

1.6 A5.1.5 Finding the principal components

The principal components can be calculated from Eq. 5.2 (but using the standardized data matrix), such as

$$ {\displaystyle \begin{array}{c}\mathrm{PC}1={e}_1^t\boldsymbol{S}=\frac{1}{\sqrt{2}\ }\left[1\kern1.50em 1\right]\left[\begin{array}{ccc}-1.86& \dots & 1.59\\ {}-0.75& \dots & 0.82\end{array}\right]=\frac{1}{\sqrt{2}\ }\left[\begin{array}{ccc}-2.61& \dots & 2.41\end{array}\right]\\ {}\mathrm{PC}2={e}_2^t\boldsymbol{S}=\frac{1}{\sqrt{2}\ }\left[\begin{array}{cc}1& -1\end{array}\right]\left[\begin{array}{ccc}-1.86& \dots & 1.59\\ {}-0.75& \dots & 0.82\end{array}\right]=\frac{1}{\sqrt{2}\ }\left[\begin{array}{ccc}-1.11& \dots & 0.77\end{array}\right]\end{array}} $$

The full values of the two PCs are listed in Table 5.4.

Table 5.4 Principal components of the heights in Table 5.2 (rounded to two decimals)

1.7 A5.1.6 Basic Analytics in PCA

PCA is based on the linear correlations among all the input variables; the correlations of the variables impact eigenvalues, relative representation of the information by different PCs, and contributions of the original variables to the PCs. Table 5.5 lists the relationships among the input variables and the two PCs for the above example. Figure 5.6 gives graphic displays of their relationships. More advanced analytics using PCA for geoscience applications are presented in the main text.

Table 5.5 Summary of PCA for the height example: correlations among the two input variables and their PCs
Fig. 5.6
figure 6

(a) Crossplot between 2 height variables. (b) Same as (a) but overlain with the first PC of the PCA from the 2 height variables. (c) Crossplot between the two PCs in Table 5.4. (d) Crossplot between the two PCs (PC1_1.4 and PC2_1.4) proportional to PCs in Table 5.4 using eigenvectors \( {e}_1^t \) = \( \left[1\kern0.75em 1\right] \)and \( {e}_2^t \) = \( \left[1\kern0.5em -1\right]. \) Color legends in (c) and (d) are the same as in (b). They are the “names” of the 10 men in Table 5.2

It is worthy to note again the nonuniqueness of eigenvectors from Eq. 5.4. In the above example, if we impose the sum of the square of the elements in the eigenvector equal to 2, instead of 1, the two eigen vectors in Eq. 5.10 will be \( {e}_1^t=\left[1\kern0.75em 1\right] \), and \( {e}_2^t=\left[1\kern0.5em -1\right].\mathrm{T} \)he results for each PC will have different values, but they are simply proportional. For example, if all the values in Table 5.4 are multiplied by \( \sqrt{2} \), the two PCs will be the results of the eigenvectors \( {e}_1^t=\left[1\kern0.75em 1\right] \) and \( {e}_2^t=\left[1\kern0.5em -1\right] \). Figure 5.6c represents the PCs corresponding to the eigenvectors \( {e}_1^t=\frac{1}{\sqrt{2}\ }\left[1\kern0.75em 1\right] \), and \( {e}_2^t=\frac{1}{\sqrt{2}\ }\left[1\kern0.5em -1\right] \) and Fig. 5.6d represents the PCs corresponding to the eigenvectors \( {e}_1^t=\left[1\kern0.75em 1\right] \), and \( {e}_2^t=\left[1\kern0.5em -1\right] \). In applications, using the PCs from a differently constrained eigenvectors implies slightly different calibration. For example, in the example shown in Fig. 5.2g, the cutoff values should be multiplied by \( \sqrt{2} \)when PCs are obtained with the sum of the square of the elements in the eigenvector equal to 2.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ma, Y.Z. (2019). Principal Component Analysis. In: Quantitative Geosciences: Data Analytics, Geostatistics, Reservoir Characterization and Modeling. Springer, Cham. https://doi.org/10.1007/978-3-030-17860-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17860-4_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17859-8

  • Online ISBN: 978-3-030-17860-4

  • eBook Packages: EnergyEnergy (R0)

Publish with us

Policies and ethics