1 Introduction

Early detection and localization of process faults while the industrial system is still operating in a controllable region can significantly avoid abnormal event progression and reduce productivity loss. The most straightforward solution is to integrate an automated diagnostic procedure into the supervisory system, so as to suggest operators with decisions of system recovery and protective measures in particular in emergency cases. However, when faced with large-scale applications, fault diagnosis approaches that employ quantitative modeling and diagnostic reasoning usually suffer from high computational overheads and adapt poorly to the operational instabilities and configuration changes[1]. The robustness to uncertainties is a common challenge for the fault diagnosis technology. As efforts towards this direction, various formalisms have been investigated, including the Dempster-Shafer theory[2], fuzzy logic[3, 4], support vector machines[5, 6], neuro-fuzzy networks[7], hidden Markov models[8, 9], etc. During the last two decades, a gradual shift towards the use of probability theory as the foundation of many works was mainly due to the impact, both theoretically and practically, of the introduction of Bayesian networks (BNs)[1012] and the related probabilistic graphical models[13, 14] into this field. BNs offer a powerful framework for modeling uncertain interactions among variables. As is well-known, the number of parameters needed in a conditional probability table of BN is exponential to the number of parent variables and states involved, and the probabilistic reasoning is NP-hard[15, 16]. The complexity becomes particularly problematic for large and multi-connected network[17]. Therefore, some studies on efficient inference algorithms and the structure compilation or conversions to BNs have been emerging, such as variable elimination[18], recursive conditioning algorithms[19], the enhanced qualitative probabilistic network[20], the decomposable negation normal form[21], multiplicative factorization for the noisy-MAX[22], weighted CNF encoding algorithm[23], the qualitative characterization method of ICI models[24], etc.

However, challenges also lie in the insufficiency of historical fault dada. Many of current approaches employ means of learning in structure selection or parameter estimation. A suitable set of training cases is essential that well represents the phenomena to be investigated and covers as many as possible failure modes[25]. What is undeniable is that both components of structure learning — the scoring function and the search procedure — are considerably complicated in the cases of incomplete data[13]. This is particularly true for some costly systems with stringent requirements for reliability, such as the nuclear power equipment and spacecraft. High reliability and the consequent low failure rate inevitably lead to the fact that available failure statistical data are scarce, scattered and random, which decreases the rationality of learning method. Even though never happened before, some major accidents cannot be excluded from consideration definitely. Moreover, different individuals are usually different from each other in many aspects including structure design, equipment type, environment and system configuration such as parameter fluctuation intervals, alarm thresholds, protection and trigger mechanisms, etc. Although common dada might be a reluctant choice, strictly speaking, the underlying failure modes of specific system may never be universal.

Another problem arises from the interpretability or comprehensibility associated with the conclusions and inference processes formulated by the diagnostic approaches. People are unwilling to use both a new technology and a new decision strategy that might modify their established routines. Therefore, it is preferable that a diagnostic reasoning approach can intuitively clarify how its conclusions are drawn and why these conclusions are appropriate. Only in that way, can the operators possibly accept, approve of and follow the suggestions. Indeed, many current methods are deficient in such an explanation facility. This consequently hinders their engineering applicability, especially under scenarios with inevitable incomplete or erroneous evidence[26].

All these problems can be summarized as a synthesis issue of developing a fault diagnosis method for subsystems of a nuclear power plant: Suppose that there are hundreds or even thousands of observable variables and hundreds of failure modes; the domain knowledge is imperfect, for there are no sufficient historical fault data for scenarios regarding some rare malfunctions; the calculation efficiency must meet the timeliness requirement of online operational maintenance; the diagnosis and prediction should be meaningful, accessible and reliable; the safety implications of nuclear reactor operations mandate severe requirement to decision-making accuracy. To solve these problems, this paper describes a method for complex system modeling and fault diagnosis based on dynamic uncertain causality graph (DUCG)[27], which is pioneered by Prof. Qin Zhang as a new attempt for uncertain knowledge representation and probabilistic inference.

The remainder of this paper is organized as follows. Section 2 introduces the principal concepts of DUCG and Section 3 analyzes the diagnostic inference algorithm. In Section 4, the diagnostic reasoning cases involving two groups of synthetic failures are put forward with elaborate calculations. Section 5 presents the results of verification experiments with real industrial fault data. Section 6 discusses the theoretical and practical significance of this method.

2 Introduction to DUCG: Concepts and terms

The fundamental theory of DUCG is preliminarily defined by Zhang[27]. DUCG aims to represent uncertain causal knowledge compactly and intuitively, provide efficient probabilistic reasoning, and make the inference results explanatory. The complex causalities among system components are explicitly represented in DUCG in forms of graphical symbols. The probability parameters of DUCG can be incomplete (only those in concern need to be specified), while the exact probabilistic inference can still be made, which provides people with great convenience in engineering applications. As a preliminary, some concepts of DUCG are introduced here.

2.1 Graphical representation and causality definition of DUCG

For the ease of understanding DUCG’s modeling mechanism, we take Fig. 1 as a reference in the introduction. Also, Fig. 1 will be used as a diagnostic reasoning case in the following sections.

Fig. 1
figure 1

An illustrative example of DUCG

We first introduce the variables of DUCG. As shown in Fig. 1, the ellipse-shaped variable “X” represents anobservable event; variable “B” represents a basic or root event (the fault origin), which can be further classified into initiating event (in the shape of square with single border) and non-initiating event (in the shape of square with double border) within a process system, depending on whether or not it can independently trigger the system abnormality; the default cause of X n is defined as the diamond-shaped variable “D n ”, which usually denotes the unknown or unspecified cause of an event; the logic gate variable “G i ” represents the complex combinational logic relationships among variables, e.g., the logic gate variables G 1, G 2 and G 3 in Fig. 1 (their logic gate specifications are listed in Table 1). In the variable state expression \({V_{i,{j_i}}}\), V ∈ {X, B, D, G}, the first subscript, i, indexes the variable, and the second subscript, j i , indexes the current state of variable V i . For an X type variable, the state 0 denotes the normal state, and nonzero state values indicate different abnormal states: the odd numbers 1, 3,⋯ respectively signify “mildly low”, “seriously low”, ⋯; the even numbers 2, 4, ⋯ respectively signify “mildly high”, “seriously high”, and so on. For simplicity, subscript “j i ” can be written as “j”, and symbol “V i,j ” can be abbreviated to “V ij ” in the cases without confusion.

Table 1 The logic gate specifications of Fig. 1

The directed arc stands for weighted functional variable, F n;i ≡ (r n;i /r n ) A n;i , which indicates the directed cause-effect relationship between the parent variable V i and child variable X n . A n;i is an event matrix with A nk;ij as its elements where k indexes the row and j indexes the column; A nk;ij represents the uncertain physical mechanism that V ij causes X nk , with the parameter value a nk;ij quantifying the strength of causal influence. (r n;i /r n ) is the weighting factor associated with A n ;i, where r n ;i > 0 is called the causal relationship intensity between V i and X n , \({r_n} \equiv \sum\limits_i {{r_{n;i}}}\). The cause of every variable state and the causal relationships between a pair of child-parent variables are all represented separately in DUCG. The essential concept underlying the causality definition of a variable is the “weighted event outspread”[27], which can be formalized as

$$\matrix{{{X_{nk}} = \sum\limits_i {{F_{nk;i}}{V_i}} = \sum\limits_i {({r_{n;i}}/{r_n})} \sum\limits_j {{A_{nk;ij}}{V_{ij}}} = } \cr {\sum\limits_i {\sum\limits_j {({r_{n;i}}/{r_n}){A_{nk;ij}}{V_{ij}}} }.} \cr }$$
(1)

The basic operators included in DUCG event expressions are “·” and “+”, which are respectively used to represent logic “AND” and the sum of multiple independent causal effects. In (1), for a specific child variable X n , the sum of all weighted functions from its parent variables V i governs X n ’s final probability distribution, and therefrom determines its current state. So the general query Pr \(\left\{ {{{\rm{X}}_{{\rm{nk}}}}|\mathop \cap \limits_i {V_{i{j_i}}}} \right\}\) on the causality graph can be calculated as (2), which is “Eq. (35)” in [27].

$$\Pr \left\{ {{X_{nk}}\mid \mathop \cap \limits_i {V_{i{j_i}}}} \right\} = \sum\limits_i {\left( {{r_{n;i}}/{r_n}} \right)} {a_{nk;i{j_i}}}.$$
(2)

For example, consider the causal relationships between X 5 and its parent variables {X 1, X 3}, which are illustrated in Fig. 1. Suppose that they are all three-state variables and the parameters of A 5;1 and A 5;3 are

r 5;1 = 1, r 5;3 = 2, in which, “−” indicates that the causalities associated with normal states are not in concern. Indeed, normal state can neither be the cause nor the effect of an abnormal event, thereby its function is equivalent to 0. The value 0 of a nk ; ij (e.g. a 5,2;1,2, a 5,1;3,1 and a 5,1;3,2) indicates that the causal relationship between V ij and X nk does not exist. With regard to the child event X5,2, we obviously have r 5 = r 5;1 + r 5;3 = 1 + 2 = 3, (r 5;1/r 5) = (1/3), and (r 5;3 /r 5)=(2/3).

According to (1) and (2), we get that X 5,2 = (r 5;1 /r 5) A 5, 2;1,1 + (r 5;3 /r 5) A 5,2;3,1 + (r 5;3 /r 5) A 5,2;3,2, and \(\Pr \left\{ {{X_{5,2}}|{X_{1,1}}{X_{3,1}}} \right\} = \left( {{r_{5;1}}/{r_5}} \right){a_{5,2;1,1}} + \left( {{r_{5;3}}/{r_5}} \right){a_{5,2;3,1}} = 0.667\). Likewise, for X 5,1 only X 1 behaves as its parent event because of a 5,1;3,1 = a 5,1;3,2 = 0. So we have r 5 = r 5;1 = 1, (r 5;1 /r 5) = 1, and X 5,1 = (r 5;1 /r 5) A 5,1;1,1 + (r 5;1/r5) A5,1;1,2. Thus we get Pr {X5,1 | X1,1 X3,1} = a 5,1;1,1 = 0. 5.

2.2 The chaining inference mechanisms

Benefited from the equilibrium effect of weighting factors, the “auto-normalization” property[27] of the variable state probability can be proven as \(\sum\nolimits_k {{X_{nk}}} = \sum\nolimits_k {\sum\nolimits_i {\left( {{r_{n;i}}/{r_n}} \right)} } \sum\nolimits_j {{A_{nk;ij}}{V_{ij}}} = 1\). Such a property always holds automatically and no imposed normalization formula is needed. This is because of the definitions and facts: \({r_n} \equiv \sum\nolimits_i {{r_{n;i}}}, \,\sum\nolimits_i {\left( {{r_{n;i}}/{r_n}} \right)} = 1,\,\sum\nolimits_k {{A_{n,k;i,j}}} = 1\), and \(\sum\nolimits_j {{V_{i,j}}} = 1\). Therefore, for the calculation of Pr {X nk }, the values of \({A_{nk \prime; ij}}\) are not necessarily known (k′ ≠ k), and \({\rm{Pr}}\left\{ {{X_{nk\prime}}} \right\}\) can be excluded from consideration too. Such an algorithm is characterized as “self-relied” inference, which achieves the sufficiency and separability that are desired for compact representation[28].

Fundamentally, the usual causal process of diagnostic inference is: When significant deviations are detected, primary faults are hypothesized and the propagation pathways in the directed graph are analyzed to determine whether a candidate hypothesis (supposed as a fault origin) can account for current failures. For this purpose, with features of “auto-normalization” and “self-relied” probabilistic reasoning as a basis, DUCG resorts to the scheme of “chaining” inference[27] in the diagnosis process. Chaining inference is to independently outspread an observed event into logic expressions in forms of disjoint “sum-of-products” composed of independent events B ij and A nk;ij associated with weighting factors (r n ; i/r n ). For each event, such a logical outspreading process follows this events all upstream causality chains and ancestors towards its root causes. For example, suppose that the upstream events of X 8,2 are {X 5,1, X 1,1, B 6,1, B 2,1} as shown in Fig. 1, and r 8;5 = r 8;6 = 1. So the chaining logic outspread result of X 8,2 is

$$\matrix{{{X_{8,2}} = {F_{8,2;6,1}}{B_{6,1}} + {F_{8,2;5,1}}{X_{5,1}} = } \cr {{F_{8,2;6,1}}{B_{6,1}} + {F_{8,2;5,1}}{F_{5,1;1,1}}{F_{1,1;2,1}}{B_{2,1}} = } \cr {\left( {1/2} \right){A_{8,2;6,1}}{B_{6,1}} + \left( {1/2} \right){A_{8,2;5,1}}{A_{5,1;1,1}}{A_{1,1;2,1}}{B_{2,1}}.} \cr }$$

As we can see that some of the parameters not involved in the causality chains of the target event can be absent, while the reasoning procedure is not affected at all. In other words, even though the parameters needed to specify a CPT are not given completely, we can still calculate the exact probability of a variables state in concern. Above all, chaining inference, plus the fact that accurate inference can be performed by using incomplete knowledge, brings us significant convenience in knowledge base construction and probabilistic reasoning.

3 The probabilistic reasoning method

This section introduces how to conduct exact and efficient probabilistic reasoning based on DUCG, under situations of incomplete representations and ubiquitous uncertainties. The fundamental probabilistic reasoning method is divided into four reasoning steps as follows.

  • Step 1. Causal simplification

    Simplification is to eliminate unlikely and meaningless causalities and variables from a graph with the evidence received. Simplification is indispensable in the diagnostic inference of DUCG, and as long as some causal dependencies are not supported by the evidence, they are removed from the graph and excluded from consideration. As well, the irrelative parts are eliminated. The 11 simplification rules of DUCG are initially presented in [27], whereby the causality graph can be significantly reduced in scale. Furthermore, the problem area gets promptly focused since unnecessary details are excluded, and the calculation complexity may be decreased without any loss of accuracy. Sometimes, fault origins can be determined in the early reasoning stage before numerical calculations. Limited by the length, we cannot present the simplification rules of DUCG. Readers can refer to [27] for details.

  • Step 2. Structure decomposition — the “divide and conquer” strategy

    As is well-known, the method of probabilistic risk/safety assessment (PRA/PSA, e.g., the “WASH-1400” report proposed by N. C. Rasmussen, et al.) has been generally followed as the safety-assessment of modern nuclear power plants, aerospace and other fields. According to the analysis of PRA, during a process system’s continuous and smooth operation, once the system state changes from normality to abnormality, the concurrence probability of more than one initiating events is a high-order small value, compared with the concurrence probability of one initiating event with any (including none) non-initiating event[27]. This approves the assumption that the DUCG under abnormal conditions is decomposable without affecting the diagnostic reasoning accuracy. Specifically, by assuming different initiating events, a large and complex DUCG can be divided into a set of local diagnostic structures, which are overall exhaustive and mutually exclusive. Any sub-graph is valid (meaningful) if and only if it can account for all abnormal evidences received, and only the initiating event in a valid sub-graph can serve as candidate root cause to the abnormal observations. All the meaningless sub-DUCGs are excluded from consideration during diagnostic reasoning. Obviously, the decomposition strategy can accelerate the inference process.

  • Step 3. Weighted logic inference

    Before numerical probabilistic calculation, weighted event outspread and weighted logic operations are first carried out on the observed evidence E[27] to get the hypothesis space of each meaningful sub-DUCG g (g = 1, 2, ⋯). The evidence \(E \equiv \mathop \cap \limits_i {E_i} = \mathop \cap \limits_i {V_{i,{j_i}}}\) is also called “complete evidence”, in which each \({E_i} \equiv {V_{i,{j_i}}}\) is an observed evidence included in sub-DUCG g. Among them, we denote the abnormal evidence as E′ (namely, incomplete evidence \(E\prime \equiv \mathop \cap \limits_i {E_i}^\prime \)) and the normal evidence as E″, thereby we have E = EE″. Let H k,j represent a candidate hypothesis that is a possible root cause of evidence E. Thus the hypothesis space is defined as \({S_{{H_g}}} = \{ {H_{k,j}}|{H_{k,j}} \in {\rm{sub}}-{\rm{DUCG}}g\} \), in which all the candidate hypotheses on sub-DUCGg are included. Along with the event outspread, static logic cycles are broken, and some ordinary logic operations, such as AND, OR, XOR, NOT, absorption, exclusion and complement, are conducted. The products including exclusive events within a logic expression are removed, and the inclusive events within any products of expressions are absorbed. Eventually we get the final logic expressions of evidences, in the form of “sum-of-products” of independent events as stated above. This process will be demonstrated in detail in Section 4 by using diagnostic reasoning cases with synthetic failures.

    The subsequent probabilistic calculation is implemented on the basis of logic event expressions. Such a reasoning procedure as a whole is referred to as “two-phase algorithm”. The main motivation behind the logic operation is to reduce redundant probabilistic calculations and to lower the overall computation cost.

  • Step 4. Probability calculation

    Now the posterior probabilities in concern can be calculated by simply replacing the events with their probabilities in logic event expressions yielded by logic operations. What deserves mention is that most of current qualitative or deductive-based expert systems only incorporate abnormal process information into the diagnostic inference. Indeed, some abnormal symptoms that were expected to appear according to specific candidate hypothesis might not be observed yet, which naturally should decrease our confidence in this hypothesis’ failure interpretation. This inclusion of positive symptoms (normal variable) into the fault diagnosis process may bring about a more accurate diagnosis. Therefore, DUCG takes normal observations into account by means of regarding them as negative evidences to the hypothesis. The incomplete evidence E′ is first employed to get an approximate state probability of H k,j , namely \(h_{k,j}^{s\prime}\); then this result is modified by supplementing normal evidence E″, if any. Thus, the exact state probability for complete evidence is obtained, namely \(h_{k,j}^s\). The calculation formulas are (3)(5) as below, and will be applied to and analyzed in the calculation cases of Section 4.

    $$h_{k,j}^{s^{\prime}} \equiv \Pr \left\{ {{H_{k,j}}\mid E^{\prime}} \right\} = {{\Pr \left\{ {{H_{k,j}}E^{\prime}} \right\}} \over {\Pr \left\{ {E^{\prime}} \right\}}}$$
    (3)
    $${h_{k,j}^s \equiv \Pr \left\{ {{H_{k,j}}\mid E} \right\} = {{\Pr \left\{ {{H_{k,j}}E} \right\}} \over {\Pr \left\{ E \right\}}} = {{\Pr \left\{ {{H_{k,j}}\mathop \cap \limits_i {V_{i,{j_i}}}} \right\}} \over {\Pr \left\{ {\mathop \cap \limits_i {V_{i,{j_i}}}} \right\}}}}$$
    (4)
    $$h_{k,j}^s = h_{k,j}^{s^{\prime}}\cdot{\sigma _{k,j}}.$$
    (5)

In (5), \({\sigma _{k,j}} = {{\Pr \left\{ {E\prime\prime| {H_{k,j}}E\prime} \right\}} \over {\Pr \left\{ {E\prime\prime| E\prime} \right\}}}\) indicates a modification factor between the exact result and approximate result.

According to the definition of state probability, the normalization \(\sum\nolimits_{{H_{k,j}} \in {S_{{H_g}}}} {h_{k,j}^s} = 1\) can be obtained within each sub-DUCGg. If there are more than one meaningful sub-DUCGs remaining in concern (i.e., not being eliminated for its meaninglessness to any abnormal evidence), the local state probability \(h_{k,j}^s\) will be modified by a weight associated with the priori probability of the evidence in each sub-DUCGg that contains the hypothesis H k,j to get a global state probability \(h_{k,j}^s\). This \(h_{k,j}^s\) quantifies our confidence degree of whether this hypothesis is exactly the root cause of current failures. Therefore, the most probable root cause is the one that maximizes the posterior hypothesis, i.e., the one that best explains observed symptoms: \(\arg {\max _{{H_{k,j}}}}\left( {h_{k,j}^s} \right)\), where \({H_{k,j}} \in {S_H},\,{S_H} \equiv \mathop \cup \limits_g {S_{{H_g}}}\) is the complete hypothesis space compromising the root causes from all meaningful sub-DUCGs.

4 Diagnostic reasoning cases with synthetic failures

In order to validate the effectiveness of the diagnostic inference algorithm and demonstrate the calculation details, we simulate two groups of failure observations on the model illustrated in Fig. 1. The specifications to parameter matrices are listed in Appendix, and all the weighting factors are supposed to be r n ;i = 1 for simplicity.

4.1 Diagnostic reasoning case 1

Suppose that the evidences received currently are

Abnormal evidences: \({{E\prime}_1} = {X_{3,1}};\,{{E\prime}_2} = {X_{4,1}};\,{{E\prime}_3} = {X_{6,2}};\,{{E\prime}_4} = {X_{11,2}};\,{{E\prime}_5} = {X_{13,1}};\,{{E\prime}_6} = {X_{14,1}}\)

Normal evidences: .

Except for the above evidences, suppose that the observed signal X 14,1 appears earlier than X 6,2. Any X type variable not listed here implies unconfirmed or unidentifiable signal which is due to possible symptom losses or time-delay, etc. By applying simplification rules to Fig. 1, we get a simplified DUCG shown in Fig. 2. The colored-area ellipses are used to graphically distinguish X type variable’s symptom state as a supplement to numerical state tags: green indicates normal state indexed by 0; sky blue indicates state 1; yellow and brown respectively indicate states 2 and 4; gray indicates one state of the binary variable (the color figures can be seen in the electronic version) Among them, the brown and gray ellipses will be presented in Section 5. Any X type variable without color implies a state-unknown signal. We can speculate a state-unknown variable’s only possible state by analyzing the parameters and evidences given. The ellipses with dashed lines denote these speculated states. For example, on account of evidences \({{E\prime}_4} = {X_{11,2}},\,{{E\prime}_5} = {X_{13,1}}\) and parameter a 13,1;11 2 = 0, only X 12,2 can serve as a possible cause to abnormal event X 13,1; considering the parameter vector a12,2;1 = (−0.9 0), G 1,1 becomes the only choice being an explanation to X 12,2, hence both X 16,1 and X 7,1 can be inferred according to the logic expression G 1,1 = X16,1 · X 7,1, and so on.

Fig. 2
figure 2

The partially simplified DUCG of Case 1

Note that B 4 and B 6 are the initiating events finally remained in Fig. 2, which can be further divided and simplified into two exclusive and exhaustive sub-DUCGs (Fig. 3) with each one containing an initiating event. X 22,0 has been eliminated from Fig. 3 (a) for its irrelativeness to the hypothesis in concern (B 4), or mathematically speaking, X 22,0 is not a negative evidence to B 4 (E″ = φ). As we can see that both Figs. 3 (a) and (b) cover all abnormal evidences, hence they are meaningful in the context of causal interpretation. The causal graphs can vividly tell users how the influences of a fault origin propagate through causal pathways and eventually result in current abnormal conditions.

Fig. 3
figure 3

The simplified sub-DUCGs of Case 1

4.1.1 Reasoning process for B 4 on Fig. 3 (a)

The event outspread operations are first performed on Fig. 3 (a) to generate the hypothesis space \({S_{{H_1}}}\). Within the following outspread expressions, superscript 8 indicates the node where the causal cycle is broken, and the solutions to directed cyclic graphs (DCGs) will be presented in detail in another paper.

$$\matrix{{{{E'}_1} = {X_{3,1}} = {F_{3,1;4}}{B_4} + {F_{3,1;8}}{F_{8;5,2}}{F_{5,2;3,1}}{X_{3,1}} = } \cr {F_{3,1;4}^{\left\{ 8 \right\}}{B_4} = {A_{3,1;4}}{B_4}} \cr {{{E'}_2} = {X_{4,1}} = \left( {{F_{4,1;4}} + {F_{4,1;3,1}}F_{3,1;4}^{\left\{ 8 \right\}}} \right){B_4} = } \cr {{1 \over 2}\left( {{A_{4,1;4}} + {A_{4,1;3,1}}{A_{3,1;4}}} \right){B_4}} \cr {{{E'}_3} = {X_{6,2}} = {F_{6,2;4,1}}\left( {{F_{4,1;4}} + {F_{4,1;3,1}}F_{3,1;4}^{\left\{ 8 \right\}}} \right){B_4} = } \cr {{1 \over 2}{A_{6,2;4,1}}\left( {{A_{4,1;4}} + {A_{4,1;3,1}}{A_{3,1;4}}} \right){B_4}} \cr {{{E'}_4} = {X_{11,2}} = {F_{11,2;8}}{F_{8;5,2}}{F_{5,2;3,1}}F_{3,1;4}^{\left\{ 8 \right\}}{B_4} = } \cr {{A_{11,2;8}}{A_{8;5,2}}{A_{5,2;3,1}}{A_{3,1;4}}{B_4}} \cr }$$

Thus, we get that

$$\matrix{{E' = {{E'}_1}{{E'}_2}{{E'}_3}{{E'}_4}{{E'}_5}{{E'}_6} = } \cr {{1 \over 2}\left( {{A_{6,2;4,1}}{A_{13,1;12,2}}{A_{12,2;1,1}}{A_{16,1;16D}}{A_{7,1;3,1}}{A_{4,1;4,1}}} \right)\cdot} \cr {\left( {{A_{14,1;11,2}}{A_{11,2;8}}{A_{8;5,2}}{A_{5,2;3,1}}{A_{3,1;4,1}}} \right){B_{4,1}} + } \cr {{1 \over 2}\left( {{A_{6,2;4,1}}{A_{13,1;12,2}}{A_{12,2;1,1}}{A_{16,1;16D}}{A_{7,1;3,1}}{A_{4,1;4,2}}} \right)\cdot} \cr {\left( {{A_{14,1;11,2}}{A_{11,2;8}}{A_{8;5,2}}{A_{5,2;3,1}}{A_{3,1;4,2}}} \right){B_{4,2}} + } \cr {{1 \over 2}\left( {{A_{6,2;4,1}}{A_{13,1;12,2}}{A_{12,2;1,1}}{A_{16,1;16D}}{A_{7,1;3,1}}{A_{4,1;3,1}}} \right)\cdot} \cr {\left( {{A_{14,1;11,2}}{A_{11,2;8}}{A_{8;5,2}}{A_{5,2;3,1}}} \right){A_{3,1;4,1}}{B_{4,1}} + } \cr {{1 \over 2}\left( {{A_{6,2;4,1}}{A_{13,1;12,2}}{A_{12,2;1,1}}{A_{16,1;16D}}{A_{7,1;3,1}}{A_{4,1;3,1}}} \right)\cdot} \cr {\left( {{A_{14,1;11,2}}{A_{11,2;8}}{A_{8;5,2}}{A_{5,2;3,1}}} \right){A_{3,1;4,2}}{B_{4,2}}.} \cr }$$

So we get Pr {E′} = 0.0000005901984. By ignoring all A type events and weighing factors from the outspread expression of E′, we get the hypothesis space \({S_{{H_1}}} = \left\{ {{H_{1,1}},{H_{1,2}}} \right\} = \left\{ {{B_{4,1}},{B_{4,2}}} \right\}\), where H 1B 4, H 1,1B 4,1 and H 1,2B 4,2. Now we turn to calculate Pr {H k,j E′}.

$$\matrix{{\Pr \left\{ {{H_{1,1}}E'} \right\} = } \cr {\Pr \left\{ {{1 \over 2}\left( {{A_{6,2;4,1}}{A_{13,1;12,2}}{A_{12,2;1,1}}{A_{16,1;16D}}{A_{7,1;3,1}}{A_{4,1;4,1}}} \right)\cdot} \right.} \cr {\left( {{A_{14,1;11,2}}{A_{11,2;8}}{A_{8;5,2}}{A_{5,2;3,1}}{A_{3,1;4,1}}} \right){B_{4,1}} + } \cr {{1 \over 2}\left( {{A_{6,2;4,1}}{A_{13,1;12,2}}{A_{12,2;1,1}}{A_{16,1;16D}}{A_{7,1;3,1}}{A_{4,1;3,1}}} \right)\cdot} \cr {\left( {{A_{14,1;11,2}}{A_{11,2;8}}{A_{8;5,2}}{A_{5,2;3,1}}} \right){A_{3,1;4,1}}{B_{4,1}}\} = } \cr {0.0000000653184.} \cr }$$

Likewise, we get Pr {H 1,2 E′} = 0.00000052488.

The state probabilities of \({H_{k,j}} \in {S_{{H_1}}}\) conditioned on incomplete evidence E′ are

$$\matrix{{h_{1,1}^{s'} = {{\Pr \left\{ {{H_{1,1}}E'} \right\}} \over {\Pr \left\{ {E'} \right\}}} = 0.110672} \cr {h_{1,2}^{s'} = {{\Pr \left\{ {{H_{1,2}}E'} \right\}} \over {\Pr \left\{ {E'} \right\}}} = 0.889328.} \cr }$$

Since E″ = φ, we get that Pr {E} = Pr {E′} = 0.0000005901984, and the modification factor σ k,j between \(h_{k,j}^s\) and \(h_{k,j}^{s\prime}\) is 1. Therefore, the local state probabilities conditioned on complete evidence are

$$\matrix{{h_{1,1}^s = h_{1,1}^{s'} = 0.110672} \cr {h_{1,2}^s = h_{1,2}^{s'} = 0.889328.} \cr }$$

4.1.2 Reasoning process for B 6 on Fig. 3 (b)

Now we proceed to perform the similar reasoning on Fig. 3 (b) with B 6 supposed to be a fault hypothesis. We get the outspreaded evidence as

$$\matrix{{E' = {A_{6,2;4,1}}{A_{13,1;12,2}}{A_{12,2;1,1}}{A_{16,1;16D}}{A_{7,1;3,1}}{A_{4,1;3,1}}} \cr {{A_{3,1;8,1}}{A_{14,1;11,2}}{A_{11,2;8,1}}\left( {{A_{8,1;6,1}}{B_{6,1}} + {A_{8,1;6,2}}{B_{6,2}}} \right).} \cr }$$

The hypothesis space is \({S_{{H_2}}} = \left\{ {{H_{2,1}},{H_{2,2}}} \right\} = \left\{ {{B_{6,1}},{B_{6,2}}} \right\}\), where H 2B 6, H 2,1 = B 6,1 and H 2,2B 6,2. The state probabilities of \({H_{k,j}} \in {S_{{H_2}}}\) conditioned on E′ are \(h_{2,1}^{s\prime} = 0.0625\) and \(h_{2,2}^{s\prime} = 0.9375\). The normal evidence is E″ = X 22,0, which can be outspreaded as

Note that the parameter value a 22,0;8,1 is not given explicitly. As the quantification of causality between X 22 and X 8, the value of a 22,0;8,1 can be inferred as \({a_{22,0;8,1}} = 1 - {a_{22,1;8,1}} - {a_{22,2;8,1}}\) according to the fact of \({X_{22,0}} = \overline {{X_{22,1}} + {X_{22,2}}} \).

Thus we get that Pr {E} = 0.0000001990656, and σ 2,1 = σ 2,2 = 1. The local state probabilities of \({H_{k,j}} \in {S_{{H_2}}}\) are

$$\matrix{{h_{2,1}^s = h_{2,1}^{s'}\cdot{\sigma _{2,1}} = 0.0625} \cr {h_{2,2}^s = h_{2,2}^{s'}\cdot{\sigma _{2,2}} = 0.9375.} \cr }$$

4.1.3 Reasoning result

By combination of the candidate hypotheses from two meaningful sub-DUCGs in Fig. 3, we get the complete hypothesis space \({S_H} = \mathop \cup \limits_g {S_{{H_g}}} = \left\{ {{B_{4,1}},{B_{4,2}},{B_{6,1}},{B_{6,2}}} \right\}\).The final global state probabilities of H K,J S H are obtained as \(h_{1,1}^s = 0.082758,\,h_{1,2}^s = 0.665022,\,h_{2,1}^s = 0.015764,\,h_{2,2}^s = 0.236456\).

As a conclusion of this diagnostic inference case, B 4,2, out of the candidate hypothesis space {B4,1, B4,2, B6,1, B6,2} is identified as the most probable root cause of current abnormalities, for it best interprets all the observations in an independent and self-contained manner.

4.2 Diagnostic reasoning Case 2

Now we proceed to calculate another diagnosis case. While the original DUCG (Fig. 1) and parameters are kept unchanged, the observed values are partly replaced in this case in order to simulate a new scenario with a different underlying failure mode. Suppose that the evidences received are as follows

Abnormal evidences: \({{E\prime}_1} = {X_{3,1}};\,{{E\prime}_2} = {X_{4,2}};\,{{E\prime}_3} = {X_{6,2}};\,{{E\prime}_4} = {X_{8,2}};\,{{E\prime}_5} = {X_{11,4}};\,{{E\prime}_6} = {X_{13,1}};\,{{E\prime}_7} = {X_{14,1}};\)

Normal evidences: \({{E\prime\prime}_8} = {X_{1,0}};\,{{E\prime\prime}_9} = {X_{2,0}};\,{{E\prime\prime}_{10}} = {X_{9,0}};\,{{E\prime\prime}_{11}} = {X_{15,0}};\,{{E\prime\prime}_{12}} = {X_{17,0}};\,{{E\prime\prime}_{13}} = {X_{18,0}};\,{{E\prime\prime}_{14}} = {X_{19,0}};\,{{E\prime\prime}_{15}} = {X_{20,0}};\,{{E\prime\prime}_{16}} = {X_{21,0}};\,{{E\prime\prime}_{17}} = {X_{22,0}}\)

After simplification and decomposition for Fig. 1, we get the two graphs in Fig. 4 with each one containing an initiating event. As we can see that Fig. 4 (a) is invalid for containing no causal explanation to the abnormal evidences X 4,2 and X 6,2. Therefore, Fig. 4 (a) is discarded and Fig. 4 (b) becomes the only meaningful sub-DUCG.

Fig. 4
figure 4

The simplified sub-DUCGs of Case 2

4.2.1 Reasoning for B 4 on Fig. 4 (b)

Based on Fig. 4 (b), we implement event outspread operations on evidence E′,

$$\matrix{{E' = \left( {{A_{6,2;4,2}}\left( {{A_{13,1;12,2}}{A_{12,2;1,1}}{A_{16,1;16D}}{A_{7,1;3,2}}} \right)} \right){A_{4,2;4,2}}.} \cr {\left( {{A_{14,1;11,4}}{A_{11,4;8,2}}{A_{8,2;5,2}}{A_{5,2;3,1}}{A_{3,1;4,2}}} \right){B_{4,2}}{B_{7,1}}.} \cr }$$

The hypothesis space is \({S_{{H_1}}} = \left\{ {{H_{1,1}}} \right\} = {B_{4,2}}{B_{7,1}}\), and Fig. 4 (b) is further simplified as Fig. 5. There is only one hypothesis B 4,2 B 7,1 remained in the hypothesis space, so that we can even determine the diagnostic conclusion without any numerical calculation. In fact, because of H 1,1 E′ = E′, the incomplete state probability of H 1,1 can be obtained as \(h_{1,1}^{s\prime} = {{\Pr \left\{ {{H_{1,1}}E\prime} \right\}} \over {\Pr \left\{ {E\prime} \right\}}} = 1\), since the normal evidence E″ = φ in Fig. 5, we finally get that \(h_{1,1}^s = 1\).

Fig. 5
figure 5

The finally simplified DUCG of Case 2

4.2.2 Concurrency of multiple faults

As a result, the hypothesis event B 4,2 B 7,1 is uniquely determined as the root cause of current failures. This hypothesis denotes a joint function (concurrent multiple faults) of initiating event B 4,2 and non-initiating event B 7,1 which together account for all the abnormalities observed.

5 Verification experiments with industrial failure data

The purpose of the verification experiment with real-world data is to characterize the diagnostic performance of this method. As is known to all, the scalability and efficiency are vital for an algorithm’s applicability, because more observable variables usually lead to perplexing dependencies among them and aggregate the complexity of probabilistic reasoning. We have developed an engineering application of DUCG to the operational maintenance on two turbine generator set systems (rated active power 1150 MW, half-speed) installed in China Lingdong Nuclear Power Plant (LNPT). A total of 659 variables and 2852 arcs are involved, and the causality graph is shown in Fig. 6. The fault data sets are extracted from the Supervisory Information System (SIS) of LNPT, containing discrete, continuous, switching and vibrating types of signals.

Fig. 6
figure 6

The complete DUCG for generator diagnosis system of LNPT

Totally, we’ve conducted 38 diagnostic experiments for 258 failure modes of this generator system. The inferences can be completed within about 324–700 ms, and the timing measurements are made on a personal computer (Intel Core i7 1.73 GHz processor, 4 GB random access memory). Some disturbances, such as losses and spurious observations, are intentionally inserted for the purpose of robustness testing. Even so, the generated fault hypothesis space is as concise as possible with all unlikely hypotheses being excluded. As an example, the experiment results for vibration fault of generator tilting pad journal bearing are presented in detail.

5.1 The vibration fault of generator tilting pad journal bearing

Because the vibration fault is fairly complicated in logical causality and the potential root causes are numerous, only the situations regarding retaining ring breakup are discussed here. We introduce two experiments for different failure scenarios.

This generators rated speed is 1500 rpm. During its start/shutdown, the changing rotation speed may pass through certain intervals (800–1000 rpm or 1100–1300 rpm), which can normally increase the vibration amplitude of the tilting pad journal bearing. Such a situation is called “generator in critical speed”. Besides this, the fault of generator retaining ring breakup may also induce the abnormally high vibration amplitude of the tilting pad journal bearing. Typically, the fault event “rotor windings being thrown out of magnetic core due to retaining ring breakup” may damage generator stator insulation, resulting in short circuit, and the generator may be tripped due to the triggering of “generator and power transmission protection (GPA)”. Fig. 7 demonstrates the sub-DUCG constructed for this failure mode. The logic gate variables G 2, G 3, G 4 and G 5 are specified in Table 2, and Table 3 lists the variable definitions.

Fig. 7
figure 7

The sub-DUCG for vibration fault of generator tilting pad journal bearing caused by the retaining ring breakup

Table 2 The logic gate specifications of Fig. 7
Table 3 The definitions of variables in Fig. 7

Note that this example demonstrates a property of DUCG-modularized modeling and automatic synthesis scheme, which means that the domain knowledge can both be divided into a set of semi-independent fragments and be incorporated into multiple perspectives. Solutions of consistency check play a major role in dealing with ambiguous and contradictory knowledge during the combination of all sub-DUCGs. With this property, we can flexibly describe various aspects of targets and processes at arbitrary granularities. While the difficulties for knowledge engineers to ensure consistency in knowledge base have been considered as a major limitation of diagnostic expert system[29], this scheme makes the task of DUCG modeling to complex system significantly decreased in difficulty.

5.2 The verification Experiment 1

In the first experiment, the abnormal evidence received is E′ = X 171,4 X 175,1 X 179,1 X 185,1 X 186,1 X 187,1.Other X type variables not listed here are observed as normal.

Simplification is first performed during the diagnostic reasoning process. In effect, the huge causality graph of Fig. 6 is simplified into Fig. 8 (a), in which only five B type variables out of the total 230 B type variables are identified as candidate root causes. This significantly reduces the computational cost. According to Fig. 8 (a), all the fault signals observed might be caused by “rotor fan blades breakup” (B 3)/ “retaining ring breakup” (B4)/ “the retaining pin of generator fan loosened” (B 77)/ “the retaining bolt of fan counterweight loosened”(B 78)/ “the retaining screw of wind guider at stator end loosened” (B 189). Each one can individually account for all the abnormalities. Among them, Fig. 8 (b) illustrates the causal relationships with “retaining ring breakup” (B 4) being supposed to be a root cause. The global state probabilities of the five candidate hypotheses are: \(h_{1,1}^s = 0.0844,\,h_{2,1}^s = 0.9117,\,h_{3,1}^s = 0.0028,\,h_{4,1}^s = 0.0002\) and \(h_{5,1}^s = 0.0008\). Therefore, B 4,1 is determined as the most probable origin of current system failures.

Fig. 8
figure 8

The diagnostic Experiment 1 for generator faults

5.3 The verification Experiment 2

We now change the evidence with another group of abnormal signals: E′ = X 171,2 X 175,1 X 495,1.Other X type variables are observed as normal or unknown.

Given the relatively fewer evidences than Experiment 1, the diagnostic inference in this experiment can still be performed accurately. The results reveal that these abnormal vibration observations may be caused by “rotor fan blades breakup” (B 3)/“retaining ring breakup” (B 4)/ “winding interturn short circuit” (B 8)/ “the rotor magnetization” (B 167), etc. The most probable causal source of current system abnormalities is regarded as B 8,1, with the maximum probability of 0.2435. The resulting graph in Fig. 9 (a) indicates the circumstance with “retaining ring breakup” (B 4,1) being a root cause. Based on Fig. 9 (a) and the descendant causality chains illustrated in original DUCG, we can make predictions for future fault developments. Such as, Fig. 9 (b) is the 2-step prediction result. As we can see, X 172,2, X 174,2 and X 180,1 are inferred as possible upcoming failures, indicating the abnormally high vibration amplitude signals and the seriously high vibration alarm signal of generator tilting pad journal bearing. The state-unknown prediction result of X 173 means that there are two possibilities for its abnormal states to be induced, i.e., the high or seriously high vibration amplitude signal.

Fig. 9
figure 9

The diagnostic Experiment 2 for generator faults

Validated by the operational maintenance affairs of LNPT and specialists diagnostic conclusions, the diagnostic results of two experiments are both accurate. This method offers intuitive insights into underlying pathological mechanism, increasing the objectivity of diagnosis and decision-making.

6 Discussion and conclusion

This paper analyzes some difficulties of fault diagnosis in large-scale applications, such as high computational overhead, poor scalability, the reliance on sufficient historical fault dada and precise online observations, and absence of interpretability to the conclusions formulated. In coping with these problems, we introduce the method of DUCG as an attempt to model the casual behaviors of complex system so as to provide reliable fault diagnosis. Some diagnostic reasoning solutions are investigated in order to reduce the calculation complexity and improve robustness to losses or imprecisions in observations.

By means of elucidating explicitly causal relationships derived from domain knowledge and data, together with the modularized construction scheme, the modeling task of knowledge base of complex system is greatly reduced in difficulty. In contrast to other fault diagnosis approaches, DUCGs visual analysis of causality pathways can intuitively explain to users how the fault influence propagates through causality chains and results in the status of malfunctioning system. Moreover, the probabilistic reasoning algorithm exhibits high precision, good generalization capability and resilience to incomplete information. All these properties manifest the feasibility of DUCG in practical engineering.

Further refinements on DUCG to improve its theoretical completeness are our goals in the future, for instance, the rigorous formulation for weighted logic inference and the verification to its soundness and self-containment, as well as the investigation on decision support strategies.