The observation data collected from continuous industrial processes usually have two main categories: process data and quality data, and the corresponding industrial data analysis is mainly for the two types of data based on the multivariate statistical techniques. Process data are collected by distributed control system (DCS) in real time with frequent sampling (its basic sampling period usually is 1s). For example, there are five typical variables in the process industries: temperature, pressure, flow rate, liquid level, and composition. Among them, temperature, pressure, flow rate, and liquid level are process variables. However, it is difficult to acquire the real-time quality measurement in general due to the limitation of quality sensors. Usually, the quality data are obtained by taking samples for laboratory test and their sampling frequency is much lower than that of process data. For example, product composition, viscosity, molecular weight distribution, and other quality-related parameters need to be obtained through various analytical instruments in the laboratory, such as composition analyzers, gel permeation chromatography (GPC), or mass spectrometry.

Process data and quality data belong to two different observation spaces, so the corresponding statistical analysis methods are correspondingly divided into two categories: single observation space and multiple observation spaces. This book introduces the basic multivariate statistical techniques from this perspective of observation space. This chapter focuses on the analysis methods in single observation space, including PCA and FDA methods. The core of these methods lies in the spatial projection oriented to different needs, such as sample dispersion or multi-class sample separation. This projection could extract the necessary and effective features while achieving the dimensional reduction. The next chapter focuses on the multivariate statistical analysis methods between two-observation space, specifically including PLS, CCA, and CVA. These methods aim at maximizing the correlation of variables in different observation spaces, and achieve the feature extraction and dimensional reduction.

2.1 Principal Component Analysis

As the modern industrial production system is becoming larger and more complex, the stored historical data not only has high dimensionality but also has strong coupling and correlation between the process variables. This also makes it impractical to monitor so many process variables at the same time. Therefore, we need to find a reasonable method to minimize the loss of information contained in the original variables while reducing the dimension of monitoring variables. If a small number of independent variables can be used to accurately reflect the operating status of the system, the operators can monitor these few variables to achieve the purpose of controlling the entire production process.

Principal component analysis (PCA) is one of the most widely used multivariate statistical algorithm (Pan et al. 2008). It is mainly used to monitor the process data with high dimensionality and strong linear correlation. It decomposes high-dimensional process variables into a few independent principal components and then establishing a model. The extracted features constitute the projection principal component subspace (PCS) of the PCA algorithm and this space contains most of the changes in the system. The remaining features constitute the residual subspace, which mainly contains the noise and interference during the monitoring process and a small amount of system change information (Wiesel and Hero 2009). Due to the integration of variables, PCA algorithm can be able to overcome the overlapping information caused by multiple correlations, and achieve dimensional reduction of high-dimensional data, simultaneously. It also highlights the main features and removes the noise and some unimportant features in the PCS.

2.1.1 Mathematical Principle of PCA

Suppose data matrix \(\boldsymbol{X}\in \mathcal {R}^{n\times m}\), where m is the number of variables and n is the number of observations for each variable. Matrix \(\boldsymbol{X}\) can be decomposed into the sum of outer products of k vectors (Wang et al. 2016; Gao 2013):

$$\begin{aligned} \boldsymbol{X} = {\boldsymbol{t}_1}\boldsymbol{p}_1^\mathrm {T}+ {\boldsymbol{t}_2}\boldsymbol{p}_2^\mathrm {T}+ \cdots + {\boldsymbol{t}_k}\boldsymbol{p}_k^\mathrm {T}, \end{aligned}$$
(2.1)

where \(\boldsymbol{t}_{i} \) is score vector, also called the principal component of the matrix \(\boldsymbol{X}\), and \( \boldsymbol{p}_{i} \) is the feature vector corresponding to the principal component, also called load vector. Then (2.1) can also be written in the form of matrix:

$$\begin{aligned} \boldsymbol{X} = \boldsymbol{T}{\boldsymbol{P}^\mathrm {T}}. \end{aligned}$$
(2.2)

Among them, \( \boldsymbol{T}=[\boldsymbol{t}_{1},\boldsymbol{t}_{2},\ldots ,\boldsymbol{t}_{k}] \) is called the score matrix and \( \boldsymbol{P}=[\boldsymbol{p}_{1},\boldsymbol{p}_{2},\ldots ,\boldsymbol{p}_{k}] \) is called the load matrix. The score vectors are orthogonal to each other,

$$\begin{aligned} \boldsymbol{t}_i^\mathrm {T}{\boldsymbol{t}_j} = 0,i \ne j. \end{aligned}$$
(2.3)

The following relationships exist between load vectors:

$$\begin{aligned} \left\{ \begin{array}{l} \boldsymbol{p}_i^\mathrm {T}{\boldsymbol{p}_j} = 0,i \ne j\\ \boldsymbol{p}_i^\mathrm {T}{\boldsymbol{p}_j} = 1,i = j \end{array} \right. \end{aligned}$$
(2.4)

It is shown that the load vectors are also orthogonal to each other and the length of each load vector is 1.

Multiplying the left and right sides of (2.2) by load vector \( \boldsymbol{p}_{i} \) and combining with (2.4), we can get

$$\begin{aligned} {\boldsymbol{t}_i} = \boldsymbol{X}{\boldsymbol{p}_i}. \end{aligned}$$
(2.5)

Equation (2.5) shows that each score vector \( \boldsymbol{t}_{i} \) is the projection of the original data \(\boldsymbol{X}\) in the direction of the load vector \( \boldsymbol{p}_{i} \) corresponding to \( \boldsymbol{t}_{i} \). The length of the score vector \( \boldsymbol{t}_{i} \) reflects the coverage degree of the original data \(\boldsymbol{X}\) in the direction of \( \boldsymbol{p}_{i} \). The longer the length of \( \boldsymbol{t}_{i} \), the greater the coverage degree or range of change of the data matrix \(\boldsymbol{X}\) in the direction of \( \boldsymbol{p}_{i} \) (Han 2012). The score vector \( \boldsymbol{t}_{i} \) is arranged as follows :

$$\begin{aligned} \left\| {{\boldsymbol{t}_1}} \right\|> \left\| {{\boldsymbol{t}_2}} \right\|> \left\| {{\boldsymbol{t}_3}} \right\|> \cdots > \left\| {{\boldsymbol{t}_k}} \right\| . \end{aligned}$$
(2.6)

The load vector \( \boldsymbol{p}_{1} \) represents the direction in which the data matrix \(\boldsymbol{X}\) changes most, and load vector \( \boldsymbol{p}_{2} \) is orthogonal to \( \boldsymbol{p}_{1} \) and represents the second largest direction of the data matrix \(\boldsymbol{X}\) changes. Similarly, the load vector \( \boldsymbol{p}_{k} \) represents the direction in which \(\boldsymbol{X}\) changes least. When most of the variance is contained in the first r load vectors and the variance contained in the latter \(m-r\) load vectors is almost zero which could be omitted. Then the data matrix \(\boldsymbol{X}\) is decomposed into the following forms:

$$\begin{aligned} \begin{aligned} \boldsymbol{X}&= {\boldsymbol{t}_1}\boldsymbol{p}_1^\mathrm {T}+ {\boldsymbol{t}_2}\boldsymbol{p}_2^\mathrm {T}+ \cdots + {\boldsymbol{t}_r}\boldsymbol{p}_r^\mathrm {T}+ \boldsymbol{E}\\&= \hat{\boldsymbol{X}} + \boldsymbol{E} = \boldsymbol{T}{\boldsymbol{P}^\mathrm {T}} +\boldsymbol{E}, \end{aligned} \end{aligned}$$
(2.7)

where \(\hat{\boldsymbol{X}}\) is principle component matrix and \(\boldsymbol{E}\) is the residual matrix whose main information is caused by measurement noise. PCA divides the original data space into principal component subspace (PCS) and residual subspace (RS). These two subspaces are orthogonal and complementary to each other. The principal component subspace mainly reflects the changes caused by normal data, while the residual subspace mainly reflects the changes caused by noise and interference.

PCA is to calculate the optimal loading vectors \(\boldsymbol{p}\) by solving the optimization problem:

$$\begin{aligned} J = \max _{\boldsymbol{p}\ne \boldsymbol{0}}\frac{\boldsymbol{p}^\mathrm {T}\boldsymbol{X}^\mathrm {T}X \boldsymbol{p}}{\boldsymbol{p}^\mathrm {T}\boldsymbol{p}}. \end{aligned}$$
(2.8)

The number r of principal components is generally obtained by cumulative percent variance (\(\mathrm {CPV}\)). Use eigenvalue decomposition or singular value decomposition of the covariance matrix of \(\boldsymbol{X}\) and obtain all the eigenvalues \(\lambda _i\). \(\mathrm {CPV}\) is defined as follows:

$$\begin{aligned} \mathrm{{CPV}} = \frac{{\sum \limits _{i = 1}^r {{\lambda _i}} }}{{\sum \limits _{i = 1}^n {{\lambda _i}} }}. \end{aligned}$$
(2.9)

Generally, when the \(\mathrm {CPV}\) value is greater than or equal to 85%, the corresponding number r is obtained.

2.1.2 PCA Component Extraction Algorithm

There are two algorithms to implement PCA component extraction. Algorithm 1 is based on the singular value decomposition (SVD) of the covariance matrix and Algorithm 2 obtains each principal component based on Nonlinear Iterative Partial Least Squares algorithm (NIPALS), developed by H. Wold at first for PCA and later for PLS (Wold 1992). It gives more numerically accurate results compared with the SVD of the covariance matrix, but is slower to calculate.

figure a
figure b
Fig. 2.1
figure 1

Two-dimensional raw random data

Fig. 2.2
figure 2

Visualization of the change principal axis and confidence ellipse of the original data

The PCA dimensional reduction is illustrated by simple two-dimensional random data. Figure 2.1 shows the original random data sample in two-dimensional space. Figure 2.2 is a visualization with principal axis and confidence ellipse of the original data. The green ray gives the direction with the largest variance of the original data and the black ray shows the direction of second largest variance.

PCA projects the original data \(\boldsymbol{X}\) from the two-dimensional space into one-dimensional subspace along the direction of maximum variance direction. The dimensional reduction is shown in Fig. 2.3.

Fig. 2.3
figure 3

Dimensional reduction results

2.1.3 PCA Base Fault Detection

PCA can be applied to solve all kinds of data analysis problems, such as exploration and visualization of high-dimensional data sets, data compression, data preprocessing, dimensional reduction, removing data redundancy, and denoising. When it is applied to the field of FDD and the detection process is divided into offline modeling and online monitoring.

  1. (1)

    Offline modeling: use the training data to construct a principal component analysis model and calculate the monitored statistics, such as \(\mathrm {SPE}\) and \(\mathrm {T}^{2}\), and its control limits;

  2. (2)

    Online monitoring: when a new sample vector \(\boldsymbol{x}\) is obtain, it can be decomposed into projections on PCS and RS (Zhang et al. 2017),

    $$\begin{aligned} \begin{aligned} \boldsymbol{x}&= {\hat{\boldsymbol{x}}} + {\tilde{\boldsymbol{x}}}\\ {\hat{\boldsymbol{x}}}&= \boldsymbol{P}{\boldsymbol{P}^\mathrm {T}}\boldsymbol{x}\\ {\tilde{\boldsymbol{x}}}&= (\boldsymbol{I} - \boldsymbol{P}{\boldsymbol{P}^\mathrm {T}})\boldsymbol{x}, \end{aligned} \end{aligned}$$
    (2.18)

    where \( \hat{\boldsymbol{x}} \) is the projection of the sample \(\boldsymbol{x}\) in PCS and \( \tilde{\boldsymbol{x}} \) is the projection of the sample in RS. Calculate the statistics, \(\mathrm {SPE}\) (1.12) on RS and \(\mathrm {T}^{2}\) (1.9) on PCS of new sample \(\boldsymbol{x}\), respectively. Compare the statistics of new sample with the control limits obtained from the training data. If the statistics of the new sample exceeds the control limit, it means that a fault has occurred, otherwise the system is in the normal operation.

\( \hat{\boldsymbol{x}} \) and \( \tilde{\boldsymbol{x}} \) are not only orthogonal (\( \hat{\boldsymbol{x}}^{\mathrm {T}}\tilde{\boldsymbol{x}} = 0 \)) but also still statistically independent (\( \mathbb {E}\left( \hat{\boldsymbol{x}}^{\mathrm {T}}\tilde{\boldsymbol{x}}\right) = 0 \)). So, there are natural advantages to apply PCA algorithm to process monitoring. The flowchart of PCA based fault detection is shown in Fig. 2.4. In general, the fault detection process based on multivariate statistical analysis is similar as that of PCA, only the statistical model and statistics index are different.

Fig. 2.4
figure 4

PCA-based fault detection

2.2 Fisher Discriminant Analysis

Industrial processes are heavily instrumented and large amounts of data are collected online and stored in computer database. A lot of data are usually collected during out-of-control operations. When the data collected during an out-of-control operation has been previously diagnosed, the data can be classified into separate categories, where each category is related to a specific fault. When the data has not been diagnosed before, cluster analysis can help diagnose the operation of collecting data, and the data can be divided into a new category accordingly. If hyperplanes can separate the data in the class, as shown in Fig. 2.5, these separation planes can define the boundaries of each fault area. Once a fault is detected using the online data observation, the fault can be diagnosed by determining the fault area where the observation is located. Assuming that the detected fault is represented in the database, the fault can be correctly diagnosed in this way.

Fig. 2.5
figure 5

Two-dimensional comparison of FDA and PCA

2.2.1 Principle of FDA

Fisher discriminant analysis (FDA), a dimensionality reduction technique that has been extensively studied in the pattern classification domain, takes into account the information between the classes. For fault diagnosis, data collected from the plant during in the specific fault operation are categorized into classes, where each class contains data representing a particular fault. FDA is a classical linear dimensionality reduction technique that is optimal in maximizing the separation between these classes. The main idea of FDA is to project data from a high-dimensional space into a lower dimensional space, and to simultaneously ensure that the projection maximizes the scatter between classes while minimizing the scatter within each class. It means that the high-dimensional data of the same class is projected to the low-dimensional space and clustered together, but the different classes are far apart.

Given training data for all classes \(\boldsymbol{X} \in {R^{n \times m}}\), where n and m are the number of observations and measurement variables, respectively. In order to understand FDA, it is first necessary to define various matrices, including the total scatter matrix, intra-class (within-class) scatter matrix, and inter-class (between-class) scatter matrix. The total scatter matrix is

$$\begin{aligned} \begin{aligned} \boldsymbol{S}_{t}=\sum _{i=1}^{n}\left( \boldsymbol{x}(i)-\bar{\boldsymbol{x}}\right) \left( \boldsymbol{x}(i)-\bar{\boldsymbol{x}}\right) ^{\mathrm {T}}, \end{aligned} \end{aligned}$$
(2.19)

where \(\boldsymbol{x}(i)\) represents the vector of measurement variables for the i-th observation and \(\bar{\boldsymbol{x}}\) is the total mean vector.

$$\begin{aligned} \begin{aligned} \bar{\boldsymbol{x}}=\frac{1}{n} \sum _{i=1}^{n} \boldsymbol{x}(i). \end{aligned} \end{aligned}$$
(2.20)

The within-scatter matrix for class j is

$$\begin{aligned} \begin{aligned} \boldsymbol{S}_{j}=\sum _{\boldsymbol{x}(i) \in \mathcal {X}_{j}}\left( \boldsymbol{x}(i)-\bar{\boldsymbol{x}}_{{j}}\right) \left( \boldsymbol{x}(i)-\bar{\boldsymbol{x}}_{{j}}\right) ^{\mathrm {T}}, \end{aligned} \end{aligned}$$
(2.21)

where \(\mathcal {X}_{j}\) is the set of vectors \(\boldsymbol{x}(i)\) which belong to the class j and \(\bar{\boldsymbol{x}}_{{j}}\) is the mean vector for class j:

$$\begin{aligned} \begin{aligned} \bar{\boldsymbol{x}}_{{j}}=\frac{1}{n_{j}} \sum _{\boldsymbol{x}(i) \in \mathcal {X}_{j}} \boldsymbol{x}(i), \end{aligned} \end{aligned}$$
(2.22)

where \(n_j\) is the number of observations in the j-th class. The intra-class scatter matrix is

$$\begin{aligned} \begin{aligned} \boldsymbol{S}_{w}=\sum _{j=1}^{p} \boldsymbol{S}_{j}, \end{aligned} \end{aligned}$$
(2.23)

where p is the number of classes. The inter-class scatter matrix is

$$\begin{aligned} \begin{aligned} \boldsymbol{S}_{b}=\sum _{j=1}^{p} n_j \left( \bar{\boldsymbol{x}}_{{j}}-\bar{\boldsymbol{x}}\right) \left( \bar{\boldsymbol{x}}_{{j}}-\bar{\boldsymbol{x}}\right) ^{\mathrm {T}}. \end{aligned} \end{aligned}$$
(2.24)

It is obvious that the following relationship always holds:

$$\begin{aligned} \begin{aligned} \boldsymbol{S}_{t}=\boldsymbol{S}_{b}+\boldsymbol{S}_{w}. \end{aligned} \end{aligned}$$
(2.25)

The maximum inter-class scatter means that the sample centers of different classes are as far apart as possible after projection (\(\max \boldsymbol{v}^\mathrm {T}\boldsymbol{S}_b \boldsymbol{v}\)). The minimum intra-class scatter is equivalent to making the sample points of the same class after projection to be clustered together as much as possible (\(\min \boldsymbol{v}^\mathrm {T}\boldsymbol{S}_w \boldsymbol{v},\; |\boldsymbol{S}_w| \ne 0\)), where \(\boldsymbol{v} \in \mathcal {R}^{m}\).

The optimal FDA project \(\boldsymbol{w}\) is obtained by

$$\begin{aligned} \begin{aligned} J=\max _{\boldsymbol{w} \ne 0} \frac{\boldsymbol{w}^{\mathrm {T}} \boldsymbol{S}_{b} \boldsymbol{w}}{\boldsymbol{w}^{\mathrm {T}} \boldsymbol{S}_{w} \boldsymbol{w}}. \end{aligned} \end{aligned}$$
(2.26)

Both the numerator and denominator have project vector \(\boldsymbol{w}\). Considering that \(\boldsymbol{w}\) and \(\alpha \boldsymbol{w},\; \alpha \ne 0\) have the same effect, Let \(\boldsymbol{w}^{\mathrm {T}} \boldsymbol{S}_{w} \boldsymbol{w} =1\), then the optimal objective (2.26) becomes

$$\begin{aligned} \begin{aligned} J&=\max _{\boldsymbol{w}} {\boldsymbol{w}^{\mathrm {T}} \boldsymbol{S}_{b} \boldsymbol{w}}\\ \mathrm{s.t.}&\quad \boldsymbol{w}^{\mathrm {T}} \boldsymbol{S}_{w} \boldsymbol{w} =1. \end{aligned} \end{aligned}$$
(2.27)

Firstly, let’s consider the optimization of first FDA vector \( \boldsymbol{w}_1\). Solving (2.27) by Lagrange multiplier method.

$$ L(\boldsymbol{w}_1,\lambda _1)={\boldsymbol{w}_1^{\mathrm {T}} \boldsymbol{S}_{b} \boldsymbol{w}_1}-\lambda _1(\boldsymbol{w}_1^{\mathrm {T}} \boldsymbol{S}_{w} \boldsymbol{w}_1 -1) $$

Find the partial derivative of L with respect to \(\boldsymbol{w}_1\).

$$ \frac{\partial L}{\partial \boldsymbol{w}_1} = 2\boldsymbol{S}_b\boldsymbol{w}_1 - 2\lambda _1 \boldsymbol{S}_w \boldsymbol{w}_1 $$

The first FDA vector is equal to the eigenvectors \(\boldsymbol{w}_1\) of the generalized eigenvalue problem.

$$\begin{aligned} \begin{aligned} \boldsymbol{S}_{b} \boldsymbol{w} _ 1 =\lambda _{1} \boldsymbol{S}_{w} \boldsymbol{w}_{1} \;\rightarrow \; \boldsymbol{S}_{w}^{-1}\boldsymbol{S}_{b} \boldsymbol{w} _ 1 =\lambda _{1} \boldsymbol{w}_{1}. \end{aligned} \end{aligned}$$
(2.28)

The first FDA vector boils down to finding the eigenvector \(\boldsymbol{w}_1\) corresponding to the largest eigenvalue of the matrix \(\boldsymbol{S}_{w}^{-1}\boldsymbol{S}_{b}\).

The second FDA vector is captured such that the inter-class scatter is maximized, while the intra-class scatter is minimized on all axes perpendicular to the first FDA vector and the same is true for the remaining FDA vectors. The kth FDA vectors is obtained by

$$\begin{aligned} \boldsymbol{S}_{w}^{-1}\boldsymbol{S}_{b} \boldsymbol{w} _ k =\lambda _{k} \boldsymbol{w}_{k}, \end{aligned}$$

where \(\lambda _1\ge \lambda _2\ge \cdots \ge \lambda _{p-1}\) and \(\lambda _k\) indicate the degree of overall separability among the classes by projecting the data onto \(\boldsymbol{w}_k\).

When \(\boldsymbol{S}_w\) is invertible, the FDA vector can be computed from the generalized eigenvalue problem. This is almost always true as long as the number of observations n is significantly larger than the number of measurements m (the case in practice). If the \(\boldsymbol{S}_w\) matrix is not invertible, you can use PCA to project data into \(m_1\) dimensions before executing FDA, in which \(m_1\) is the number of non-zero eigenvalues of the covariance matrix \(\boldsymbol{S}_t\).

The first FDA vector is the eigenvector associated with the largest eigenvalue, the second FDA vector is the eigenvector associated with the second largest eigenvalue, and so on. The large eigenvalue \(\lambda _k\) shows that when the data in classes are projected onto the associated eigenvector \(\boldsymbol{w}_k\), there is a large overall separation of class means relative to the variance of the class, and thus, a large degree of separation among classes along the direction of \(\boldsymbol{w}_k\). Since the rank of \(\boldsymbol{S}_b\) is less than p and at most \(p - 1\) eigenvalues are not equal to zero. The FDA provides a useful ordering of eigenvectors only in these directions.

When FDA is used as a pattern classification, the dimensionality reduction technique is implemented for all classes of data at the same time. Denote \(\boldsymbol{W}_{a}= \left[ \boldsymbol{w}_{\boldsymbol{1}}, \boldsymbol{w}_{\boldsymbol{2}}, \ldots , \boldsymbol{w}_{\boldsymbol{a}}\right] \in \mathcal {R}^{m \times a}\). The discriminant function can be deduced as

$$\begin{aligned} \begin{aligned} g_{j}(\boldsymbol{x})=&-\frac{1}{2}\left( \boldsymbol{x}-\bar{\boldsymbol{x}}_{{j}}\right) ^{\mathrm {T}} \boldsymbol{W}_{a}\left( \frac{1}{n_{j}-1} \boldsymbol{W}_{a}^{\mathrm {T}} \boldsymbol{S}_{j} \boldsymbol{W}_{a}\right) ^{-1} \boldsymbol{W}_{a}^{\mathrm {T}}\left( \boldsymbol{x}-\bar{\boldsymbol{x}}_{{j}}\right) +\ln \left( \boldsymbol{p}_{i}\right) \\&-\frac{1}{2} \ln \left[ {\text {det}}\left( \frac{1}{n_{j}-1} \boldsymbol{W}_{a}^{\mathrm {T}} \boldsymbol{S}_{j} \boldsymbol{W}_{a}\right) \right] . \end{aligned} \end{aligned}$$
(2.29)

FDA can also be used to detect faults by defining an additional class of data on top of the fault class, i.e., data collected under normal operating conditions. The reliability of fault detection using (2.29) depends on the similarity between the data from normal operating conditions and the fault class data in the training set. Fault detection using FDA will yield small miss rates for known fault classes when a transformation \(\boldsymbol{W}\) exists such that data from normal operating conditions can be reasonably separated from other fault classes.

2.2.2 Comparison of FDA and PCA

As two classical techniques for dimensionality reduction of a single data set, PCA and FDA exhibit similar properties in many aspects. The optimization problems of PCA and FDA, respectively, formulated mathematically in (2.8) and (2.26), can also be captured as

$$\begin{aligned} J_\mathrm{PCA} = \max _{\boldsymbol{w}\ne \boldsymbol{0}}\frac{\boldsymbol{w}^\mathrm {T}\boldsymbol{S}_t \boldsymbol{w}}{\boldsymbol{w}^\mathrm {T}\boldsymbol{w}} \end{aligned}$$
(2.30)
$$\begin{aligned} J_\mathrm{FDA} = \max _{\boldsymbol{w}\ne \boldsymbol{0}}\frac{\boldsymbol{w}^\mathrm {T}\boldsymbol{S}_t \boldsymbol{w}}{\boldsymbol{w}^\mathrm {T}\boldsymbol{S}_w \boldsymbol{w}} \end{aligned}$$
(2.31)

In the special case, \(\boldsymbol{S}_w = a\boldsymbol{I},\;a\ne 0\), their vector optimization results are identical. This would occur if the data for each class could be described by a uniformly distributed ball (i.e., without a dominant direction), even if these balls had different sizes. The difference between these two techniques only occurs when the data used to describe either class appears elongated. These elongated shapes occur on highly correlated data sets, for example, the data collected in industrial processes. Thus, when FDA and PCA are applied to process data in the same way, the FDA vectors and the PCA loading vectors are significantly different. The different objectives of (2.30) and (2.31) show that the FDA has superior performance than PCA at distinguishing among fault classes.

Figure 2.5 illustrates a difference between PCA and FDA. The first FDA vector and the PCA loading vector are almost perpendicular. PCA is to map the entire data set to the coordinate axis that is most convenient to represent the data. The mapping does not use any classification information inside the data. Therefore, although the entire data set is more convenient to represent after PCA (reducing the dimensionality and minimizing the loss of information), it may become more difficult to classify. It is found that the projections of red and blue are overlapped in the PCA direction, but separated in the FDA direction. The two sets of data become easier to distinguish (it can be distinguished in low dimensions, reducing large amount of calculations) by FDA mapping.

To illustrate more clearly the difference between PCA and FDA, the following numerical example of binary classification is given.

$$ \begin{aligned} \boldsymbol{x}_1&=[5+0.05\boldsymbol{\mu }(0,1); 3.2+0.9\boldsymbol{\mu }(0,1)] \in R^{2\times 100}\\ \boldsymbol{x}_2&=[5.1+0.05\boldsymbol{\mu }(0,1); 3.2+0.9\boldsymbol{\mu }(0,1)]\in R^{2\times 100}\\ \boldsymbol{X}&=[\boldsymbol{x}_1,\boldsymbol{x}_2]\in R^{2\times 200}, \end{aligned} $$

where \(\boldsymbol{\mu }(0,1) \in R^{1\times 100}\) is a uniformly distributed random vector on [0, 1]. \(\boldsymbol{X}\) is a two-mode data and its projection of FDA and PCA is shown in Fig. 2.6. The distribution of the data in the classes is somewhat elongated. The linear transformation of the data on the first FDA vector separates the two types of data better than the linear transformation of the data on the first PCA loading vector.

Fig. 2.6
figure 6

Two-dimensional data projection comparison of FDA and PCA

Both PCA and FDA can be used to classify the original data after dimensionality reduction. PCA is an unsupervised method, i.e. it has no classification labels. After dimensionality reduction, unsupervised algorithms such as K-Means or self-organizing mapping networks are needed for classification. The FDA is a supervised method. It first reduces the dimensionality of the training data and then finds a linear discriminant function. The similarities and differences between FDA and PCA can be summarized as follows.

  1. 1.

    Similarities

    1. (1)

      Both functions are used to reduce dimensionality;

    2. (2)

      Both assume Gaussian distribution.

  2. 2.

    Differences

    1. (1)

      FDA is a supervised dimensionality reduction method, while PCA is unsupervised;

    2. (2)

      FDA dimensionality reduction can be reduced to the number of categories \(k-1\) at most, PCA does not have this restriction;

    3. (3)

      FDA is more dependent on the mean. If the sample information is more dependent on variance, the effect will not be as good as PCA;

    4. (4)

      FDA may overfit the data.