The previous chapters have described the mathematical principles and algorithms of multivariate statistical methods, as well as the monitoring processes when used for fault diagnosis. In order to validate the effectiveness of data-driven multivariate statistical analysis methods in the field of fault diagnosis, it is necessary to conduct the corresponding fault monitoring experiments. Therefore this chapter introduces two kinds of simulation platform, Tennessee Eastman (TE) process simulation system and fed-batch Penicillin Fermentation Process simulation system. They are widely used as test platforms for the process monitoring, fault classification, and identification of industrial process. The related experiments based on PCA, CCA, PLS, and FDA are completed on the TE simulation platforms.

4.1 Tennessee Eastman Process

The original TE industrial process control problem was developed by Downs and Vogel in 1993. It is used for the open and challenging control-related topics including multi-variable controller design, optimization, adaptive and predictive control, nonlinear control, estimation and identification, process monitoring and diagnosis, and education. TE process model is established according to the actual chemical process. It has been widely used as a benchmark for control and monitoring research process. Figure 4.1 shows the flow diagram of TE process with five major units: reactor, condenser, compressor, vaporliquid separator, and stripper. Four kinds of gaseous material \(A,\; C,\; D,\; \text { and}\; E\) are input for reaction. In addition, a small amount of inert gas B is contained besides the above feeds. The final products are three liquid including \(G ,\; H\), and F, where F is the by-product.

$$\begin{aligned} A(g)+C(g)+D(g)&\rightarrow G(liq),&\text {product}\; G \\ A(g)+C(g)+E(g)&\rightarrow H(liq),&\text {product}\; H \\ A(g)+E(g)&\rightarrow F(liq),&\text {by-product} \\ 3D(g)&\rightarrow 2F(liq),&\text {by-product} \end{aligned}$$

Briefly, TE process consists of two data modules: XMV module containing 12 manipulated variables (XMV(1)-XMV(12):\(x_{23}-x_{34}\)) and XMEAS module consisting of 22 process measured variables (XMEAS(1)-XMEAS(22):\(x_{1}-x_{22}\)) and 19 component measured variables (XMEAS(23)-XMEAS(41):\(x_{35}-x_{53}\)), as listed in Tables 4.1 and 4.2.

Fig. 4.1
figure 1

Tennessee Eastman process

Table 4.1 Monitoring variables in the TE process(\(x_{1}-x_{34}\))
Table 4.2 Monitoring variables in the TE process(\(x_{35}-x_{53}\))
Table 4.3 Disturbances for the TE process

In this book, the code provided is available on the website online at http://depts.washington.edu/control/LARRY/TE/download.html. Also, the code and data sets can be downloaded. The Simulink simulator allows an easy setting and generation of the operation modes, measurement noises, sampling time, and magnitudes of the faults. It is thus very helpful for the data-driven process monitoring study. 21 artificially disturbances (considered as faulty operations for fault diagnosis problem) in the TE process are shown in Table 4.3. In general, the entire TE data consists of training set and testing set, and each set includes 22 kinds of data under different simulation operations. Each kind of data has sampled measurements on 53 observed variables.

In the data set given in the web link above, d00.dat to d21.dat are training sets, and d00_te.dat to d21_te.dat are testing sets. d00.dat and d00_te.dat are samples under the normal operation conditions. The training samples of d00.dat are sampled under 25h running simulation. The total number of observations is 500. The d00_te.dat test samples are obtained under 48 h running simulation, and the total number of observation data is 960. d01.dat–d21.dat (for training) and d01_te.dat–d21_te.dat (for testing) are sampled with different faults, in which the numerical label of the data set are corresponding to the fault type.

All the testing data set are obtained under 48 h running simulation with the faults introduced at 8 h. A total of 960 observations are collected, in which the first 160 observations are in the normal operation. It is worth to point out that the data sets once generated by Leoand et al. (2001) is widely accepted for process monitoring and fault diagnosis research. The data sets are smoothed, filtered, and normalized. The monitored variables are variables \(x_{1}-x_{53}\).

4.2 Fed-Batch Penicillin Fermentation Process

Fed-batch fermentation processes are widely used in the pharmaceutical industry. The yield maximization is usually considered as the main goal in the batch fermentation processes. The different characteristics of batch operation from the continuous operation include strong nonlinearity, non-stationary conditions, batch-to-batch variability, and strong time-varying conditions. These features result that the yield is difficult to predict. Therefore, the fault detection, classification, and identification of batch/fed-batch processes shows more difficulties compared with the continuous TE process.

The model of fed-batch penicillin fermentation process is described by Birol et al. (2002)

$$\begin{aligned} X=&f(X,S,C_L,H,T)\\ S=&f(X,S,C_L,H,T)\\ C_L=&f(X,S,C_L,H,T)\\ P=&f(X,S,CL,H,T,P)\\ CO_2=&f(X,H,T)\\ H=&f(X,H,T), \end{aligned}$$

where \(X,\; S,\; C_L,\; P, \;CO_2,\; H \;\text {and}\; T\) are biomass concentration, substrate concentration, dissolved oxygen concentration, penicillin concentration, carbon dioxide concentration, hydrogen ion concentration for pH \(\left( \left[ H^+\right] \right) \), and temperature, respectively. The corresponding detailed mathematical model is given in Birol et al. (2002).

The research group with the Illinois Institute of Technology has developed a dynamic simulation of penicillin production based on an unstructured model, PenSim V2.0. This model has been used as a benchmark for statistical process monitoring studies of batch/fed-batch reaction process. The flow chart of the fermentation process is depicted in Fig. 4.2. The fermentation unit consists of a fermentation reactor and a coil-based heat exchange unit. The pH and temperature are automatically controlled by two PID controllers by adjusting the flow rates of acid/base and cold/hot water. The glucose substrate is fed continuously into the fermentation reactor in open-loop operation in the fed-batch operation mode.

Fig. 4.2
figure 2

Flow chart of the penicillin fermentation process

Table 4.4 Variables in penicillin fermentation process

Fourteen variables are considered in PenSim V2.0 model, shown in Table 4.4: 5 input variables (1–4, 14) and 9 process variables (5–13). Since variables 11–13 are not measured online in industry, only 11 variables are monitored here.

4.3 Fault Detection Based on PCA, CCA, and PLS

This section tests the effectiveness of various multivariate statistical methods for the TE process. Faults in the standard TE data set are introduced at the 160 sampling. For comparison purposes, the normal operation data d00_te is chosen as to train the statistical model and faulty operation data d01_te-d21_te is used to test model and detect fault. In the experiments for the PCA and PLS methods, the process variable matrix X consists of process variables (XMEAS (1–22)) and manipulated variables (XMV (1–11)). XMEAS (35) is used as the quality variable matrix Y for PLS. In the CCA experiment, the process variables (XMEAS (1–22)) are used as one data set, and the manipulated variables (XMV (1–11)) as another data set.

The fault detection rate (FDR) and false alarm rate (FAR) are defined as follows:

$$\begin{aligned} \begin{aligned} \mathrm {FDR}&=\frac{\mathrm{No. of\; samples} (J> J_{th}|f \ne 0)}{\mathrm{total\; samples} (f \ne 0)}\times 100 \\ \mathrm {FAR}&=\frac{\mathrm{No. of\; samples} (J > J_{th}|f = 0)}{\mathrm{total\; samples }(f = 0)}\times 100. \end{aligned} \end{aligned}$$
(4.1)

Experiment and model parameters are determined as follows. The principal components of PCA are determined by the cumulative contribution of 90%. The number of principal components of PLS is selected as 6. \(\mathrm{T}^2\) and \(\mathrm Q\) statistics are used to monitor process faults. It should be noted that in the monitoring of CCA, (3.18) and (3.19) are used as monitoring indices and the corresponding monitoring results are slightly different. For 21 fault types, the FDR for PCA, CCA, and PLS based on the control limit with \(99\%\) confidence level are shown in Table 4.5. It can be seen that the multivariate statistical methods listed in this section (including PCA, CCA, and PLS) can accurately detect the significant process faults.

Table 4.5 FDRs of PCA, CCA and PLS

Figures 4.3, 4.4, and 4.5 show the different monitoring results base on PCA, CCA, and PLS model for typical faults IDV(1), IDV(16), and IDV(20), respectively. Here, the black line is the statistic calculated from the real-time data and the red line is the normal statistic threshold from the offline model calculation.

It is easy to find that CCA has better detection for certain fault types from Table 4.5, such as faults IDV(10), IDV(16), IDV(19), and IDV(20). The monitoring results for faults IDV(16) and IDV(20) are shown in Figs. 4.4 and 4.5. Why does CCA show better detection capabilities than the other two methods in certain faults? Let’s check the setting of process variable X for three methods. In contrast to PCA and PLS, CCA splits its X-space directly into two parts and extracts the latent variables by examining the correlation between these two parts, i.e., the latent variables extracted by CCA can better characterise the changes in the process.

4.4 Fault Classification Based on FDA

To further test the effectiveness of fault classification, samples from the 161th to the 700th of the 21 fault data sets and the normal data sets are used for training FDA model. The corresponding data from the 701th to the 960th samples are used to test FDA model and its classification ability. FDA in Sect.  is a classical method to validate the classification effect and identify the fault types. The following distance metric index is introduced to further quantify the difference between different faults:

$$\begin{aligned} \mathrm{{D}}_2=\left\| \ \mathrm{FDA}_i - \mathrm{FDA}_j\right\| , \end{aligned}$$

where \(\mathrm{FDA}_i\) denotes the FDA feature vector of the ith fault.

Fig. 4.3
figure 3

PCA, CCA, and PLS monitoring results for IDV(1)

Fig. 4.4
figure 4

PCA, CCA, and PLS monitoring results for IDV(16)

Fig. 4.5
figure 5

PCA, CCA, and PLS monitoring results for IDV(20)

The simulation results are shown in Fig. 4.6. The 22 kinds of data (including the normal operation and 21 faulty operation) can be roughly divided into two major categories: the first category is the faults that are significantly different from other faults, which contains faults IDV(2) (line with \(\lozenge \)), IDV(6) (line with \(*\)), and IDV(18) (line with \(\circ \)); the other category is the set of faults whose characteristics are relatively close to each other.

Fig. 4.6
figure 6

\(\mathrm{{D}}_2\) index for different faults

The faults IDV(1), IDV(2), IDV(6), and IDV(20) are further analyzed. The FDA results for fault classification are shown in Fig. 4.7. The \(\mathrm{{D}}_2\) indices for these faults vary considerably, as the classification results clearly illustrated. Conversely, certain faults have very small differences in \(\mathrm{{D}}_2\) indices. For example, faults IDV(4), IDV(11), and IDV(14) have the similar FDA \(\mathrm{{D}}_2\) indices, shown in Fig. 4.8. These faults are difficult to classify accurately based on FDA model, as shown in Fig. 4.9.

Fig. 4.7
figure 7

FDA identification result for the fault 1, 2, 6, and 20

Fig. 4.8
figure 8

\(\mathrm{{D}}_2\) indices for fault 4, 11, and 14

Fig. 4.9
figure 9

FDA identification result for the fault 4, 11, and 14

4.5 Conclusions

Two kinds of simulation platforms are introduced for verifying the statistical monitoring methods and several experiments based on the traditional methods, PCA, PLS, CCA, and FDA, are finished. These basic experiments illustrate the characteristics of several methods and their fault detection effects. Actually, there are lots of improved methods to overcome the shortcomings and deficiencies of the original multivariate statistical analysis methods. Each method has its own conditions and scope of application. No one method completely outperforms the others in terms of performance. Furthermore, data-based fault detection methods need to be combined with the actual monitoring objects, and existing methods need to be improved according to its knowledge and characteristics. So this book focus on the fault detection (discrimination) strategies for batch processes and strong nonlinear systems.