Ensuring the safety of industrial systems requires not only detecting the faults, but also locating them so that they can be eliminated. The previous chapters have discussed the fault detection and identification methods. Fault traceability is also an important issue in industrial system. This chapter and Chap. 14 aim at the fault inference and root tracking based on the probabilistic graphical model. This model explores the internal linkages of system variables quantitatively and qualitatively, so it avoids the bottleneck of multivariate statistical model without clear mechanism. The exacted features or principle components of multivariate statistical model are linear or nonlinear combinations of system variables and have not any physical meaning. So the multivariate statistical model is good at fault detection and identification, but not at fault root tracking.

Bayesian network (BN) can estimate and predict the potentially harmful factors of the general system, but its structure learning has some deficiencies when it is applied to the complex system, such as complex training mechanism and variable causalities. In order to simplify the network structure, lots of assumptions should be presupposed and it inevitably causes the loss of generality. Usually, a generative model (linear or nonlinear) is built to explain the data generating process, i.e., the causalities. A variety of causal discovery methods have been proposed recently to find the causalities (Hyvärinen et al. 2010; Hong et al. 2017). The most classical method is the linear non-gaussian acyclic model (LiNGAM) (Shimizu et al. 2010), in which the full structure of BN is identifiable without pre-specifying a causal order of the variables. The improved LiNGAM method is proposed to estimate the causal order of variables without any prior structure knowledge and provide better statistical performance (Shimizu et al. 2011). The nonlinear causality of a pair of variables is discovered in Johnson and Bhattacharyya (2015), where the proposed method shows a limitation when dealing with the multivariate variables.

The above approaches exploit the complexity of the marginal and conditional probability distributions in one way or the other. Despite the large number of methods for bivariate causal discovery have been proposed over the last few years, their practical performance has not been studied systematically. These methods have yet to be applied to the actual industrial systems which usually do not meet the linear and bivariate assumptions. To address the above issues, this chapter proposes a more generalized multivariate post-nonlinear acyclic causal model for the complex industrial process. The proposed multivariate post-nonlinear acyclic causal model, named as Bayesian Causal Network (BCN), can easily find the multi-variables causality. It shows more compact structure and consistency with mechanism, compared with the traditional BN structure. In addition, it avoids the complex learning mechanism of traditional BN, so is easier to implement without compromising accuracy.

13.1 Construction of Bayesian Causal Network

It is known that there are many ways to describe the system characteristic according to the observational data and expert knowledge, such as graph model (Hipel et al. 2011), neural network model (Li et al. 2016), fuzzy model (Jiang et al. 2015). The graph model is composed of points and lines to describe the system structure and the causal relationships among variables. It provides an effective method for studying various systems, especially the complex systems. Bayesian network, a typical graph model, is the main method to deal with the knowledge representations and uncertainties based on the probability theory. It builds the causality and probability within the process components and the system variables from the prior knowledge and process data. BN consists of the structure learning and the parameter learning, in which the structure learning aims at determining the causalities within system variables and the parameter learning aims at revealing the quantitative relationship of these causalities. Bayesian network has been applied to fault diagnosis, financial analysis, automatic target recognition, military, and many other areas (Zhu et al. 2017).

13.1.1 Description of Bayesian Network

Bayesian network, also known as Belief Network or directed acyclic graphical model, is a probabilistic graphical model. It first proposed by Judea Pearl in 1985 (Pearl 1986). It is an uncertainty processing model that simulates the causal relationship in the human reasoning process, and its network topology is a directed acyclic graph (DAG). The nodes in the directed acyclic graph represent the random variables, including the observable variables, hidden variables, unknown parameters, etc. Variables or propositions that are believed to have a causal relationship (or non-conditional independence) are connected by arrows (in other words, the arrow connecting two nodes represents whether the two random variables have a causal relationship or are not conditionally independent). If two nodes are connected by a single arrow, it means that one of the nodes is “cause” and the other is “effect”, a conditional probability value is used to describe the causality degree quantitatively.

For example, assume that node A directly affects node B, then \(A\rightarrow B\). The arrow from A to B is used to establish a directed arc (AB) from node A to node B, and the weight (its connection strength) is determined by the conditional probability P(B|A). In short, a BN is formed by drawing the random variables in a directed graph according to whether they are conditionally independent. It usually uses circle to represent the random variables (nodes) and arrow to represent the conditional dependencies. Figure 13.1 gives a simple Bayesian network (Ishak et al. 2011).

Fig. 13.1
figure 1

Bayesian network example

13.1.2 Establishing Multivariate Causal Structure

Model-based causal discovery assumes a generative model to explain the data generating process. When the existing knowledge about the data model is unavailable, the assumed model should be sufficiently general so that it can be adapted to approximate the real data generation process. Furthermore, the model should be identified such that it could distinguish the causes from the effects. A nonlinear and multivariable system always possesses the following three characteristics (Chen et al. 2018):

  1. 1.

    The multivariate causalities are usually nonlinear.

  2. 2.

    The final target variable is affected by its cause variables and some noise who is independent from the causes.

  3. 3.

    Sensors or measurements may introduce nonlinear distortions into the observed value of the variables.

To discover the causality of multivariable in complex industrial systems, a more generalized multivariate post-nonlinear acyclic causal model with inner additive noise is proposed. The model is in the form of graph theory and Bayesian network structure. Assume that there is a DAG to represent the relationship among multiple observed variables. Mathematically, the generating process of \(\boldsymbol{X}_i\) is

$$\begin{aligned} \boldsymbol{X}_{i}=f_{i,2}(f_{i,1}(\boldsymbol{PA}_{i} )+\boldsymbol{e}_{i} ), \end{aligned}$$
(13.1)

where the observed variables \(\boldsymbol{X}_{i}\),\(i=\{1,2,\ldots ,n\}\) are arranged in a causal order, such that no later variable causes any earlier variable. \(\boldsymbol{PA}_{i}\) is the direct cause of \(\boldsymbol{X}_{i}\). \(f_{i,1}\) denotes the nonlinear effect of this cause, and \(f_{i,2}\) denotes the invertible post-nonlinear distortion in variable \(\boldsymbol{X}_{i}\). \(\boldsymbol{e}_{i} \) is the independent disturbance which is a continuous-valued random variable with non-gaussian distributions of non-zero variances. Model (13.1) satisfies the aforementioned three characteristics: function \(f_{i,1}\) accounts for the nonlinear effect of the causes \(\boldsymbol{PA}_{i}\); \(\boldsymbol{e}_{i} \) is the noise effect during the transmission from \(\boldsymbol{PA}_{i}\) to \(\boldsymbol{X}_{i}\); invertible function \(f_{i,2}\) reflects the nonlinear distortion caused by the sensor or measurement.

Randomly select a pair of variables \(\boldsymbol{X}_{i}\) and \(\boldsymbol{X}_{j}\), i,\(j=\{1,2,\ldots ,n\}\). Assume that the pair \((\boldsymbol{X}_{i},\boldsymbol{X}_{j})\) has the causal relation \(\boldsymbol{X}_{i}\rightarrow \boldsymbol{X}_{j}\). It’s data generating process can be described in a generated model,

$$\begin{aligned} \boldsymbol{X}_{j}=f_{j,2} (f_{j,1} ( \boldsymbol{X}_{i} )+\boldsymbol{e}_{j} ), \end{aligned}$$
(13.2)

where \(\boldsymbol{e}_{j} \) is independent from \(\boldsymbol{X}_{i}\). Define \(\boldsymbol{s}_{i} \triangleq f_{j,1}(\boldsymbol{X}_{i})\), \(\boldsymbol{s}_{j} \triangleq \boldsymbol{e}_{j} \), and \(\boldsymbol{s}_{i} \) is independent from \(\boldsymbol{s}_{j} \).

Rewrite the generating process \(\boldsymbol{X}_{i}\rightarrow \boldsymbol{X}_{j}\) as follows:

$$\begin{aligned} \begin{aligned} \boldsymbol{X}_{i}&=f_{j,1}^{-1}(\boldsymbol{s}_{i} ),\\ \boldsymbol{X}_{j}&=f_{j,2}(\boldsymbol{s}_{i} +\boldsymbol{s}_{j} ).\end{aligned}\end{aligned}$$
(13.3)

\(\boldsymbol{X}_{i}\) and \(\boldsymbol{X}_{j}\) in (13.3) are post-nonlinear (PNL) mixtures of independent sources \(\boldsymbol{s}_{i} \) and \(\boldsymbol{s}_{j} \). The PNL mixing model can be seen as a special case of the general nonlinear independent component analysis (ICA) model. Here we use nonlinear ICA method to solve this problem (13.3).

Generally there are two possibility to describe the causal relation between any two random variables \(\boldsymbol{X}_{i}\) and \(\boldsymbol{X}_{j}\), (\(\boldsymbol{X}_{i}\rightarrow \boldsymbol{X}_{j}\) or \(\boldsymbol{X}_{j}\rightarrow \boldsymbol{X}_{i}\)). We should identify the correct relation by judging which one satisfies the assumed model (13.2). If the causal relation is \(\boldsymbol{X}_{i}\rightarrow \boldsymbol{X}_{j}\) (i.e., \(\boldsymbol{X}_{i}\) and \(\boldsymbol{X}_{j}\) satisfy the model (13.2)), we can invert the data generating process (13.2) to recover the disturbance \(\boldsymbol{e}_{j} \), which is expected to be independent from \(\boldsymbol{X}_{i}\). Two steps are used to examine the possible causal relationships between variables.

In the first step, recover the disturbance \(\boldsymbol{e}_{j} \) corresponding to the assumed causal relation \(\boldsymbol{X}_{i}\rightarrow \boldsymbol{X}_{j}\) based on the constrained nonlinear ICA. If this causal relation holds, there exist nonlinear functions \(f_{j,2}^{-1}\) and \(f_{j,1}\) such that

$$\begin{aligned} \boldsymbol{e}_{j} =f_{j,2}^{-1}(\boldsymbol{X}_{j})-f_{j,1}(\boldsymbol{X}_{i}), \end{aligned}$$
(13.4)

where \(\boldsymbol{e}_{j} \) is independent from \(\boldsymbol{X}_{i}\). Thus perform nonlinear ICA using the structure in Fig. 13.2 and the outputs of system are

$$\begin{aligned} \begin{aligned} \boldsymbol{Y}_{i}&=\boldsymbol{X}_{i},\\ \boldsymbol{Y}_{j}&=\boldsymbol{e}_{j} =g_j(\boldsymbol{X}_{j} )-g_i(\boldsymbol{X}_{i})). \end{aligned} \end{aligned}$$
(13.5)

The nonlinearities \(g_i\) and \(g_j\) is modeled by Multi-layer perceptrons (MLP’s), and the parameters in \(g_i\) and \(g_j\) are learned by making \(\boldsymbol{Y}_{i}\) and \(\boldsymbol{Y}_{j}\) as independent as possible, i.e., minimizing the mutual information between \(\boldsymbol{Y}_{i}\) and \(\boldsymbol{Y}_{j}\),

$$\begin{aligned} I(\boldsymbol{Y}_{i},\boldsymbol{Y}_{j})=H(\boldsymbol{Y}_{i})+H(\boldsymbol{Y}_{j})-H(\boldsymbol{Y}), \end{aligned}$$
(13.6)

where \(H(\boldsymbol{Y})\) is the joint entropy of \(\boldsymbol{Y}=(\boldsymbol{Y}_{i},\boldsymbol{Y}_{j})^{T}\),

$$\begin{aligned} \begin{aligned} H(\boldsymbol{Y})&=-\mathbb {E}\left[ \log p_Y(\boldsymbol{Y})\right] \\&=-\mathbb {E}\left[ \log p_Y(\boldsymbol{X})-\log \left| \boldsymbol{J}\right| \right] \\&=H(\boldsymbol{X})+\mathbb {E}\left[ \log \left| \boldsymbol{J}\right| \right] . \end{aligned} \end{aligned}$$
(13.7)

The joint density of \(\boldsymbol{Y}=(\boldsymbol{Y}_{i},\boldsymbol{Y}_{j})^{\mathrm {T}}\) is \(p_Y(\boldsymbol{Y})=p_X (\boldsymbol{X})/|\boldsymbol{J}|\). \(\boldsymbol{J}\) is the Jacobian matrix of the transformation from \((\boldsymbol{X}_{i},\boldsymbol{X}_{j})\) to \((\boldsymbol{Y}_{i},\boldsymbol{Y}_{j})\), i.e.,

$$\begin{aligned} \begin{aligned} \boldsymbol{J}&=\frac{\partial (\boldsymbol{Y}_{i},\boldsymbol{Y}_{j} )}{\partial (\boldsymbol{X}_{i},\boldsymbol{X}_{j} )},\\ |\boldsymbol{J}|&=\left| \begin{matrix} 1&{}0\\ g'_i&{}g'_j \end{matrix}\right| =\left| g'_j\right| . \end{aligned} \end{aligned}$$
(13.8)

Substitute (13.7) and (13.8) into (13.6), we have

$$\begin{aligned} I(\boldsymbol{Y}_{i},\boldsymbol{Y}_{j})&=H(\boldsymbol{Y}_{i})+H(\boldsymbol{Y}_{j})-\mathbb {E}[\log |\boldsymbol{J}|]-H(\boldsymbol{X})\end{aligned}$$
(13.9)
$$\begin{aligned}&=-\mathbb {E}\left[ \log p_{\boldsymbol{Y}_{i}}(\boldsymbol{Y}_{i})\right] -\mathbb {E}\left[ \log p_{\boldsymbol{Y}_{j}}(\boldsymbol{Y}_{j})\right] -\mathbb {E}\left[ \log \left| g'_j\right| \right] -H(\boldsymbol{X}), \end{aligned}$$
(13.10)

where \(H(\boldsymbol{X})\) does not depend on the parameters in \(g_i\) and \(g_j\) and can be considered as constant. The minimization problem (13.10) is solved by gradient-descent methods, and the details of the optimization are skipped.

Fig. 13.2
figure 2

Constrained nonlinear ICA system used to verify if the causal relation \(\boldsymbol{X}_{i}\rightarrow \boldsymbol{X}_{j}\) holds

In the second step, verify if the estimated disturbance \(\boldsymbol{Y}_{j}\) is independent from the assume cause \(\boldsymbol{Y}_{i}\) based on the statistical independence test. The kernel-based statistical test is adopted with the significance level \(=0.01\) (Giga 2014). Denote the test statistic as \(test_{i\rightarrow j}\). If \(test_{i\rightarrow j}> test_{j\rightarrow i}\), it indicates that \(\boldsymbol{Y}_{i}\) and \(\boldsymbol{Y}_{j}\) are not independent, that is \(\boldsymbol{X}_{i}\rightarrow \boldsymbol{X}_{j}\) does not hold. Repeat the above procedure with \(\boldsymbol{X}_{i}\) and \(\boldsymbol{X}_{j}\) exchanged to verify if \(\boldsymbol{X}_{j}\rightarrow \boldsymbol{X}_{i}\) holds. If \(test_{i\rightarrow j}<test_{j\rightarrow i}\), it concludes that \(\boldsymbol{X}_{i}\) causes \(\boldsymbol{X}_{j}\). \(g_i\) and \(g_j\) provide an estimate of \(f_{j,1}\) and \(f_{j,2}^{-1}\), respectively.

For a complex system, there are n process variables. Following a test sequence, \(\boldsymbol{X}_1\rightarrow \boldsymbol{X}_2\), \(\boldsymbol{X}_1\rightarrow \boldsymbol{X}_3,\ldots , \boldsymbol{X}_{n-1}\rightarrow \boldsymbol{X}_{n}\), the N group statistics should be tested,

$$\begin{aligned} N=n+(n-1)+(n-2)+ \cdots +1=\frac{n(n-1)}{2}. \end{aligned}$$
(13.11)

The total computation is in direct proportion to \(2\times N\). As the number of variables increases, the amount of computation will increase as well. The measured statistics in the positive order (or in the reverse order) are stored as

$$\begin{aligned} \begin{aligned} \boldsymbol{A}&=[test_{ {X_1}\rightarrow {X_2}},test_{{X_1}\rightarrow {X_3}},\ldots ,test_{ {X_{n-1}}\rightarrow {X_n}}],\\ \boldsymbol{B}&=[test_{{X_2}\rightarrow {X_1}},test_{ {X_3}\rightarrow {X_1}},\ldots ,test_{ {X_n}\rightarrow {X_{n-1}}}]. \end{aligned} \end{aligned}$$
(13.12)

Comparing the corresponding elements of the vectors \(\boldsymbol{A}\) and \(\boldsymbol{B}\), the causal direction of this pair of variables is determined according to the smaller statistic. Once the causality of all variables is found based on the above cyclic search, integrate them into a DAG.

13.1.3 Network Parameter Learning

The multivariate causality model gives a framework similar to the Bayesian network to find the internal structure of the complex systems. Its graphical structure expresses the causal interactions and direct/indirect relations as probabilistic networks. Its parameter represents the intensity of the complex inter-relationships among the cause-effect variables.

Consider a finite set \(\boldsymbol{U}=\{\boldsymbol{X}_1,\ldots ,\boldsymbol{X}_{n}\}\) of discrete random variables where each variable \(\boldsymbol{X}_{i}\) may take on several discrete status from a finite set. A Bayesian network is an annotated directed acyclic graph that encodes a joint probability distribution over a set of random variables \(\boldsymbol{U}\). Formally, the Bayesian network for \(\boldsymbol{U}\) is constructed as a pair \(\boldsymbol{B}=<\boldsymbol{G},\boldsymbol{\varTheta }>\). \(\boldsymbol{G}\) is a directed acyclic graph whose vertices is correspond to the random variables \(\boldsymbol{X}_1,\ldots ,\boldsymbol{X}_{n}\). \(\boldsymbol{\varTheta }\) is the parameters set that quantifies the network with \(\theta _{ijk}=p(x_i^k)|pa_i^j\) and \(\sum _k\theta _{ijk}=1\), where \(x_i^k\) is the discrete status of \(\boldsymbol{X}_{i}\) and \(\boldsymbol{pa_i^j}\) is one of components in the complete parent set \(\boldsymbol{PA}_{i}\) of \(\boldsymbol{X}_{i}\) in \(\boldsymbol{G}\). Every variable \(\boldsymbol{X}_{i}\) is conditionally independent of its non-descendants given its parents (Markov condition). The joint probability distribution over set \(\boldsymbol{U}\) is

$$\begin{aligned} \begin{aligned} P_B (\boldsymbol{X}_1,\ldots ,\boldsymbol{X}_{n} )=\prod _{i=1}^nP_B(\boldsymbol{X}_{i}|\boldsymbol{PA}_{i})=\prod _{i=1}^n\theta _{\boldsymbol{X}_{i}|\prod \boldsymbol{PA}_{i}}. \end{aligned} \end{aligned}$$
(13.13)

The parameters of the causality Bayesian network are mainly learned from the statistics analysis of sample data. The maximum likelihood estimation method (MLE) is one of the most classical and effective algorithms in parameter learning.

Give a data set \(D=\{D_1,\ldots ,D_N\}\) of all Bayesian network nodes. The goal of parameter learning is to find the most probable values for \( \boldsymbol{\varTheta }\). These values best explain the data set D, which can be quantified by the log likelihood function \(logp{(D|\theta )}\), denoted \(L_D(\theta )\). Assume that all samples are drawn independently from the underlying distribution. According to the conditional independence assumptions, we have

$$\begin{aligned} \begin{aligned} L_D(\theta )=\log \prod _{i=1}^n\prod _{j=1}^{q_i}\prod _{k=1}^{r_i}\theta _{ijk}^{n_{ijk}}, \end{aligned} \end{aligned}$$
(13.14)

where \(q_i\) is the number of combinations of the parent nodes \(pa_i^j\), \(r_i\) is the number of the node \(\boldsymbol{X_i}\) status. \(n_{ijk}\) indicates how many elements of D contain both \(x_i^k\) and \(pa_i^j\). If the data set D is complete, MLE method can be described as a constrained optimization problem,

$$\begin{aligned} \begin{aligned}&\max L_D(\theta ),\\&\mathrm{{st}}.g_{ij}(\theta )=\sum _{k=1}^{r_i}\theta _{ijk}-1=0,\forall i=1,\ldots ,n, \forall j=1,\ldots ,q_i. \end{aligned} \end{aligned}$$
(13.15)

Its global optimum solution is

$$\begin{aligned} \begin{aligned} \theta _{ijk}=\frac{n_{ijk}}{n_{ij}}, \end{aligned} \end{aligned}$$
(13.16)

where \(n_{ij}=\sum _{k=1,\ldots ,r_i}n_{ijk}\).

13.2 BCN-Based Fault Detection and Inference

The complete monitoring model is established via combining the multivariate causal structure and the Bayesian parameters learning. The qualitative and quantitative relationships among the process variables are revealed to the greatest extent. Then this model is forward used to accurately predict the operation status and detect faults of the critical process variables (i.e., forward inference). Similarly, it also can be inversely used to find the source of the faults (i.e., backward inference). The overall block diagram of the proposed method is shown in Fig. 13.3.

Fig. 13.3
figure 3

Overall design block diagram

Causality network prediction or inference is to calculate the probability of the hypothesis variables at certain status according to the network topology and conditional probability distribution of the evidence variable. An inference or query \(P(\boldsymbol{Q}=q|\boldsymbol{E}=e_0)\) is to calculate the posterior probability of a query variable \(\boldsymbol{Q}\) being at its specific value q in the condition of given evidence \(e_0\) for node \(\boldsymbol{E}\).

There are many existing network inference algorithms, such as variable elimination algorithm and junction tree algorithm (JT). These algorithms utilize the hypothesis variables and specific independence relations induced by the evidence in BN to simplify the updating task. JT implements the inference procedure in four steps (Borsotto et al. 2006),

  1. 1.

    Cluster the nodes into several cliques;

  2. 2.

    Connect the cliques to form a junction tree;

  3. 3.

    Propagate information in the network;

  4. 4.

    Answer a query.

The inference starts from a root clique. The core step of message propagation consists of a message collection phase and a distribution phase. The cliques of the junction tree are connected by separators such that the so-called junction tree property holds. When a message is passed from one clique \(\boldsymbol{X}\) to another clique \(\boldsymbol{Y}\), it is mediated by the separate set \(\boldsymbol{S}\) between the two cliques. Every conditional probability distribution of the original BN is associated with a clique such that the domain of the distribution is a subset of the clique domain (we use the notation \(dom(\boldsymbol{\phi })\) to refer to the domain of a potential \(\boldsymbol{\phi }\)). The set of distributions \(\boldsymbol{\phi }_{X}\) associated with a clique \(\boldsymbol{X}\) are in standard junction tree architectures combined to form the initial clique \(\boldsymbol{X}\).

$$\begin{aligned} \begin{aligned} \boldsymbol{\phi }_{X}=\prod _{\boldsymbol{\phi }\epsilon \boldsymbol{\phi }_{X}}\boldsymbol{\phi }. \end{aligned} \end{aligned}$$
(13.17)

For a clique, a potential or a message is a mapping from the value assignments of the nodes to the set [0, 1.0]. A message pass from \(\boldsymbol{X}\) to \(\boldsymbol{Y}\) occurs with two procedures: projection and absorption based on the Hugin architecture (architecture is proposed by Jensen et al. 1990). The projection procedure saves the current potential and assigns a new one to \(\boldsymbol{S}\):

$$\begin{aligned} \begin{aligned} \boldsymbol{\phi }_S^{old}\leftarrow \boldsymbol{\phi }_{S}, \text {and}\,\boldsymbol{\phi }_{S}\leftarrow \sum _{\boldsymbol{X}\backslash {S}}\boldsymbol{\phi }_{X}. \end{aligned} \end{aligned}$$
(13.18)

The absorption procedure assigns a new potential to \(\boldsymbol{Y}\) using both the old and the new tables of \(\boldsymbol{S}\),

$$\begin{aligned} \begin{aligned} \boldsymbol{\phi }_{Y}\leftarrow \boldsymbol{\phi }_{Y}\frac{\boldsymbol{\phi }_{S}}{\boldsymbol{\phi }_S^{old}}, \end{aligned} \end{aligned}$$
(13.19)

where \(\boldsymbol{\phi }_{S}\) is the current separator potential, \(\boldsymbol{\phi }_S^{old}\) is the old separator potential, \(\boldsymbol{\phi }_{X}\) is the clique potential for \(\boldsymbol{X}\), \(\boldsymbol{\phi }_{Y}\) is the clique potential for \(\boldsymbol{Y}\).

The query answering step has two procedures. First, the marginalization procedure calculates the joint probability of \(\boldsymbol{Q}\) and \(\boldsymbol{E}=e_0: P(\boldsymbol{Q},\boldsymbol{E}=e_0 )=\sum _{\boldsymbol{X}\{\boldsymbol{Q}\}}\boldsymbol{\phi }_{X}\). Second, the normalization procedure calculates the inference result,

$$\begin{aligned} \begin{aligned} P(\boldsymbol{Q}=q|\boldsymbol{E}=e_0)=\frac{P(\boldsymbol{Q}=q,\boldsymbol{E}=e_0)}{\sum _{\boldsymbol{Q}}P(\boldsymbol{Q},\boldsymbol{E}=e_0)}. \end{aligned} \end{aligned}$$
(13.20)

The fault of operational variables is an intervention that has various effects on the production process. The main task in fault detection is to predict the system output and detect whether a fault occurs. The object of causal inference is to find the real root cause under the faulty intervention.

13.3 Case Study

In order to evaluate the performance of the proposed method, the experiment results are reported from three aspects: the causal direction identification of multi-variables, network parameter learning, and probability inference.

13.3.1 Public Data Sets Experiment

Four published data sets proposed by Mooij and Janzing (Leoand et al. 2001) are used to test the effectiveness of the nonlinear multivariate causal model. The cause-effect pairs are available at http://webdav.tuebingen.mpg.de/cause-effect/, which is considered as the benchmark for testing causal detection algorithms. The four data sets are (1) the ground altitude and temperature sampled at 349 stations, US; (2) census income data set which contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. The variables include age and wage per hour; (3) the attribute information (age and heart rate) from Cardiac Arrhythmia database; (4) the population with sustainable access to improved drinking water sources (%) total, and the infant mortality rate (per 1000 live births) both sexes, 2006. Each data set consists of two random variables which their cause-effect relationship is known. The four data sets have different attributes, which is sufficient to show the general and comprehensive nature of the data.

Figure 13.4 gives the scatter plots of the selected data sets (1)–(4). Table 13.1 shows the statistics of independence test on \(\boldsymbol{x}\) and \(\boldsymbol{y}\) for data sets (1)–(4) under different assumption of causal directions. The statistics are calculated separately based on these different assumptions. Comparing the test statistics under two different assumption in Table 13.1, the causal direction of each set all are determined as \(\boldsymbol{x}\rightarrow \boldsymbol{y}\). Table 13.2 summarizes the causal results obtained by the multivariate causality model. It is found that the test results are consistent to the real causal relationship. We can conclude that the proposed method can correctly identify the causal direction regardless the diversity of data.

Fig. 13.4
figure 4

Scatter plots of four data sets, ad corresponding to data sets (1)–(4), respectively

Table 13.1 Independence test statistics under different assumption of causal directions
Table 13.2 Causal results of the public data sets

13.3.2 TE Process Experiment

In order to illustrate the applicability of the proposed method in the actual complex industrial process, the network topology of TE process is established and used to predict the alarm variables. TE platform simulates an actual chemical process, a detailed description of the TE process is given in Chap. 4.

Experiment 1: Build Causal Structure

In this experiment, eight important process variables are selected to calculate their causality in order to facilitate the result visualization. From the mechanism analysis of TE process, it is known that when the reactor feed \(\boldsymbol{X}_2\) increases, the material is first entered into the reactor, so the reactor level \(\boldsymbol{X}_4\) must increase. So the reactor feed \(\boldsymbol{X}_2\) directly affects the reactor level \(\boldsymbol{X}_4\). The temperature of cooling water \(\boldsymbol{X}_8\) and the reactor feed \(\boldsymbol{X}_2\) are the main factors of affecting the reactor temperature \(\boldsymbol{X}_5\). The reactor pressure \(\boldsymbol{X}_3\) changes synchronized with the reactor temperature \(\boldsymbol{X}_5\) according to the general physical principle. In addition, once the chemical reaction in the reactor is more intense, the compressor module power \(\boldsymbol{X}_7\) will be synchronized to strengthen due to the sequential loop. At the same time, the reactor pressure \(\boldsymbol{X}_3\) also has an obvious influence on the recovered flow \(\boldsymbol{X}_1\) and the material level \(\boldsymbol{X}_6\) in the separator. Now the initial structure of the causality network is built based on the mechanism analysis (including the expert prior knowledge and the intuitive correlation analysis of process variable), named as Bnet0 shown in Fig. 13.5.

Fig. 13.5
figure 5

The network Bnet0 from the mechanism analysis

The pre-defined fault is random variations in A, B, C compositions in stream 4. The corresponding data of eight variables are collected from the simulation platform. The reaction length is 700 h to ensure that the data is sufficient to reflect the system process. 500 sampling data are obtained after the equal time decimating. The causal direction of the paired variables is shown in Table 13.3. Three different causality models are compared, including (1) Bnet1, the proposed multivariate post-nonlinear acyclic causal model, shown in Fig. 13.6a; (2) Bnet2, an alternative network obtained from the traditional BN structure learning method-K2 algorithm which needs to set the node order, shown in Fig. 13.6b; (3) Bnet3, the network structure learned with the expectation maximization (EM) algorithm, shown in Fig. 13.6c.

Table 13.3 Causal direction of TE variables
Fig. 13.6
figure 6

The network compare: a Bnet1, b Bnet2, c Bnet3

Comparing the process analysis structure Bnet0 and Bnet1 determined by the proposed Bayesian Causal Network, it is found that Bnet1 is exactly consistent to Bnet0. The structure determined using the proposed method exactly matches the mechanism and expert knowledge, which indicates that the causal structure is credible and accurate. However, Bnet2 and Bnet3 learned from the traditional BN methods are not consistent with the mechanism. They show a big gap from the actual physical relationship. It demonstrates that the general BN learning method fails when it is applied to the complex nonlinear systems, while the proposed multivariate causality model proves its superiority.

Experiment 2: Parameter Learning Once the TE network structure is determined, the alarm prediction model can be obtained by parameter learning of this causality structure network. In general, the process alarm event can be divided into five-alarm levels, namely, high-high alarm (HH), high alarm(H), normal(N), low alarm(L) and low-low alarm(LL), corresponding to the number 1,2,3,4,5. The first step is to discretize the continuous variables into five-alarm levels by setting different thresholds, shown in Table 13.4.

Table 13.4 Threshold setting for alarm status in different variables

Here the MLE algorithm is adopted to learn the network parameters and get a complete probability table. Suppose that the initial probability of the alarm level in the normal condition is theoretically divided equally. Then the conditional probability values for all variables are calculated based on the BN parameter learning. Considering two root nodes \(\boldsymbol{X}_2\) and \(\boldsymbol{X}_8\), their corresponding probabilities for five status are 0.0843, 0.2211, 0.4704, 0.2026 and 0.0217, respectively. The probability of other descendant variables as shown in Fig. 13.7. Hot plot is used to show the probability since the precise value has nothing meaning for the alarm prediction and inference. The color represents the probability range between 0 and 1.

Fig. 13.7
figure 7

Conditional probability of the descendant variables: a \(P(\boldsymbol{X}_5|\boldsymbol{X}_8,\boldsymbol{X}_2)\), b \(P(\boldsymbol{X}_4|\boldsymbol{X}_3)\), c \(P(\boldsymbol{X}_3|\boldsymbol{X}_5)\), d \(P(\boldsymbol{X}_7|\boldsymbol{X}_5)\), e \(P(\boldsymbol{X}_1|\boldsymbol{X}_3)\), f \(P(\boldsymbol{X}_6|\boldsymbol{X}_3)\)

It should be concerned with the probability value of close to 1. These are the key points in determining the inference results. When the probability is less than 0.5, the result situation will not likely appear in the actual inference. Figure 13.7a shows the probability of \(\boldsymbol{X}_5\) under the combined action of \(\boldsymbol{X}_2\) and \(\boldsymbol{X}_8\). The abscissa is the status condition of \(\boldsymbol{X}_8\) and \(\boldsymbol{X}_2\), and the ordinate is the probability value for five-alarm status of \(\boldsymbol{X}_5\) displayed in corresponding color. \(P(\boldsymbol{X}_5=1|\boldsymbol{X}_8=1,2\; \text {and}\; \boldsymbol{X}_2=1) \approx 1\) in the lower left corner of Fig. 13.7a. It means that \(\boldsymbol{X}_5\) occurs the low-low alarm with the probability close to 1 when \(\boldsymbol{X}_2\) and \(\boldsymbol{X}_8\) are in the low-low alarm status. \(P(\boldsymbol{X}_5=5|\boldsymbol{X}_8=4,5\; \text {and} \;\boldsymbol{X}_2=5) \approx 1\) in the upper right corner of Fig. 13.7a. It means that \(\boldsymbol{X}_5\) occurs the high-high alarm with the probability close to 1 when \(\boldsymbol{X}_2\) and \(\boldsymbol{X}_8\) are in the high-high alarm status. These inference results are consistent with the actual mechanism.

Figure 13.7b–e reflects the probability relationship between bivariate variables. Figure 13.7b shows the probability of \(\boldsymbol{X}_4\) under the action of \(\boldsymbol{X}_3\). \(P(\boldsymbol{X}_4=5|\boldsymbol{X}_3=5)\approx 1\) in the upper right corner. It means that the probability of \(X_4\) occurs the high-high alarm with the probability close to 1 when \(\boldsymbol{X}_3\) in the high-high alarm status. However, \(P(\boldsymbol{X}_4=1|\boldsymbol{X}_3=5)=0\) in the lower right corner. It means that \(\boldsymbol{X}_4\) occurs the high-high alarm with the probability close to 0 when \(\boldsymbol{X}_3\) in the low-low alarm statue. \(P(\boldsymbol{X}_4=1 \;\text {and} \;\boldsymbol{X}_4=2|\boldsymbol{X}_3=2)\approx 0.5\) in the green area. It means the probability of \(\boldsymbol{X}_4\) occurs the low alarm or low-low alarm almost same when \(\boldsymbol{X}_3\) in the low alarm status. Similarly, the inference results obtained from Fig. 13.7c–e are consistent with the mechanism.

Experiment 2: Alarm Prediction Alarm prediction is a top-down inference according to the evidences inference conclusion. The probabilistic analysis calculates the likelihood of each status for the result variable may occur. The discrete status corresponding to the maximum probability is the alarm prediction result.

Using the established multivariate causality network model, compress work \(X_7\) is predicted when its parent variables \(\boldsymbol{X}_2\), \(\boldsymbol{X}_8\) and \(\boldsymbol{X}_5\) are known. The prediction results for model Bnet1 are shown in Table 13.5, where \(\hat{\boldsymbol{X}}_{\boldsymbol{7}}\) is the prediction value of \(\boldsymbol{X_7}\).

Table 13.5 Alarm level prediction of compress work \(\boldsymbol{X_7}\)

The total prediction accuracy for the 20 simulation experiments is 75%. When the maximum probability of the predicted value is greater than 0.5, the prediction result is confident. Furthermore, the predictions with a high probability is consistent with the true status. When the maximum probability of the predicted value is less than 0.5, the prediction result is not believable and accurate. The mis-predictions confuse between the adjacent status, such as the normal status 2 and Low alarm 3 (or high alarm 2). The simulation results show that the multivariate causality network can find the intrinsic relationships among various process variables, and give precise fault or alarm prediction.

13.4 Conclusions

This chapter proposes a multivariate causality model to analyze the causal direction of multivariable and final determine the network topology. The proposed method can describe the system structure more accurate than the traditional BN structure learning method especially when the industrial process is high complex. Combined with the network parameters learning and evidence inference technique, an accurate monitoring and alarm prediction can be performed. The validity of the proposed method is verified via the public data set and TE process. An compact network structure and confident alarm prediction are obtained for the TE process based on the causal analysis and probability inference. Both the methodology and the simulation results show that the proposed multivariate causality model has great value for the process industry modeling and monitoring.

There are some issues worth further discussion. The computing efficiency of the proposed multivariate post-nonlinear acyclic causal modeling method should be considered when solving the large-scale causal analysis problems in the real world. Developing the efficient algorithm to find the causal relationship of multiple variables based on the general functional causal models is still an important topic. To improve the computational efficiency, a feasible solution is to limit the complexity of the causal structure, such as decreasing the number of direct causes of each variable. Moreover, a smart optimization procedure instead of the exhaustive search should be considered further.