1 Introduction

Fault detection and diagnosis is an important challenge for many systems, in particular manufacturing systems. On the one hand, fault detection consists in generating an alarm once a fault is detected based on the available measurements. On the other hand, fault diagnosis consists in isolating the detected fault among a set of fault candidates [1]. Fault detection and diagnosis methods may be separated into model-based and data-based approaches. Model-based approaches have been intensively studied for continuous time systems [2] and also for discrete event systems [3, 4]. Among model-based approaches, numerous methods are based on the design of an observer. The aim of such an observer is to estimate the system state. In the context of faults diagnosis, the state estimation is compared to the measurements to track the occurrence of faults. With continuous time systems, observers serve to generate residual signals that reveal the occurrence of faults when these signals become significantly different from zero [2]. With discrete event systems, the faulty states are mainly incorporated in the model [1] and the observer is transformed into a diagnoser that tags each possible state as normal or faulty. Consequently, state estimation leads directly to fault diagnosis. Diagnosis methods based on observers have been first developed for finite state automata [5], stochastic automata [6], and timed automata [7]. Then, methods have been extended to Petri nets [8, 9]. The previous methods are mainly devoted to logical faults (i.e., some unexpected events). Among data-based approaches, the use of statistical control charts has been successfully used in many application domains and in particular to detect and diagnose operation faults in manufacturing systems [10,11,12]. One limitation of such approaches is that they consider the systems according to an atomic point of view and consequently focus on specific operations. But, the advantages of such methods are efficient for temporal faults (i.e., unexpected changes in the occurrence frequency of some events). One contribution of this work is to combine a discrete event observer with control charts in order to reveal timed drift (due, for example, to unexpected delays) in timed discrete event systems.

From a high-level point of view, manufacturing systems are basically modeled as discrete event systems (DES) that are pure logical models [2, 3]. In order to incorporate temporal specifications [13], DES can be enriched by adding time stamps or some information that represent the temporal evolution of the systems. Petri nets (PN) have demonstrated their ability to model such specifications and present numerous advantages. They are ready to perform qualitative and quantitative analysis of the systems thanks to their underlying mathematical structure; they can be directly converted into simulation models; in addition, they are graphical, easy to develop, extend, and offer a good understanding of the dynamic behavior of the systems. In particular, activities, resources, and constraints of a manufacturing system can be represented in a single consistent formulation. In the context of manufacturing systems, a large variety of PN subclasses have been listed and discussed, in particular to solve scheduling problems [14]. Some examples of these subclasses of models are systems of simple sequential processes with resources (S3PR) [15], systems of sequential systems with shared resources (S4R) [16] proposed for job-shop problems (that consist of several operations with total precedence constraints), and sets of simple open processes with resources (S2OPR) [17] proposed for open-shop problems (that consist of several jobs with full routing flexibility within the operations). In order to add flexibility at the operation level, colored Petri nets have been also used to model manufacturing systems [18], in particular for control applications [19]. A few applications are also devoted to diagnosis issues [20], but the extension of colored nets applied to temporal faults remains to the best of our knowledge an open question.

In this article, we aim to apply the principles of statistical control charts to timed DES in order to extract significant information about potential faults from heterogeneous measurements collected in different parts of the system. In particular, we propose to use a moving average (MA) control chart to detect and diagnose faults of manufacturing systems that are modeled as stochastic DES with labeled timed Petri nets (LTPN) [8]. LTPN are characterized by deterministic timed transitions that fire after constant durations and also stochastic timed transitions that fire according to probability density functions (PDF) of finite supports. LTPN encode the system sensors as a measurement function that define the measurements collected when observable events occur. The considered faults correspond to significant variations of the PDF supports that define the stochastic firing durations. For detection and isolation purposes, MA control charts [21] are combined with observers that compute the set of trajectories consistent with the observation collected thus far. For this purpose, the MA control charts smooth the successive measurements and compare these measurements with some thresholds. In particular, the proposed observers provide a set of minimal-size elementary trajectories (MSET) consistent with each new measurement. The MSET are also used for fault isolation (or diagnosis) by identifying the suspicious transitions for which the mean firing durations are outdated. Detection and isolation functions are proposed as a result of the approach. This approach is applicable when large sets of transition firing durations are collected. In contrast with many other diagnosis approaches, main advantages of the proposed approach are: (1) to detect and isolate temporal faults that only affect the mean duration of some activities (such faults are more or less undetectable to the main existing diagnosis methods as in [5,6,7,8, 22]); (2) to be applicable with a large variety of time processes; (3) to be suitable also for slow deviations and gradual drifts; (4) to reduce the number of sensors and consequently the cost of the sensoring (compared to approaches used at the level of the workstations): For an appropriated sensor configuration, it becomes possible to detect various faults; (5) to avoid the computation of the trajectory probabilities that is very expensive in time and space [23]. The current work continues our preliminary study on slow deviations of firing durations [24]. More precisely, in [24] we have proposed a simple method to use control charts for manufacturing systems based on the decomposition of the system with some specific paths. The main limitation of that former approach is that it is only suitable for 1-bounded nets (i.e., systems where a single operation is performed at each time and where multiples products cannot circulate simultaneously). The main contribution of the present paper is to introduce a new class of observers that track the recent trajectories in the state space for system where several operations and products may be simultaneously considered.

The rest of this document is organized as follows: In Sect. 2, tools and useful definitions are introduced. In Sect. 3, the working assumptions and the model of the temporal faults are detailed. Section 4 provides a detailed explanation of the observer design. Section 5 describes the detection and diagnosis functions used to generate alarms and to isolate the faulty transitions. Section 6 sums up conclusion and future works.

2 Definitions and notations

2.1 Notations

Table 1 explains the notations and acronyms used in the rest of the paper.

Table 1 Abbreviations and notations

2.2 Timed Petri nets

A PN structure is defined as \({\text{PN}} = \left\langle {\varvec{P},\varvec{T},W_{\text{pr}} ,W_{\text{po}} } \right\rangle\), where \(\varvec{P} = \{ P_{1} , \ldots ,P_{n} \}\) is a set of n places, \(\varvec{T} = \{ T_{1} , \ldots ,T_{q} \}\) is a set of q transitions, \(W_{\text{pr}} \in ({\mathbf{N}})^{n \times q}\) and \(W_{\text{po}} \in ({\mathbf{N}})^{n \times q}\) are the pre- and post-incidence matrices (N is the set of nonnegative integer numbers), and \(W = W_{\text{po}} - W_{\text{pr}}\) is the incidence matrix. \(\left\langle {{\text{PN}},M_{\text{I}} } \right\rangle\) is a PN system with initial marking MI, and M ∈ (N)n represents the PN marking vector corresponding to the number of tokens in each place. A transition Tj is enabled at M if \(M \ge W_{\text{pr}} \left( {:,j} \right)\), where Wpr (:, j) stands for the column j of matrix Wpr. When Tj is enabled at M, we write M [Tj〉, and then, Tj may fire. When Tj fires once, the marking varies according to \(\Delta M = M^{\prime } - M = W(:,j)\). This is denoted as M [Tj〉 M′, and \(T(M,M^{\prime } ) \subseteq \varvec{T}\) is defined as the subset of transitions such that M[TM′. If σ is a sequence of several firings, feasible at M, then one can write M[σM′ and X(σ) refers to the firing count vector of σ. For systems with a finite number N of states, R and G are, respectively, the set of reachable markings and the generator matrix (i.e., G is a matrix of dimension N × N such that the entry g(M, M′) ∈ G is the transition T such that M [T〉 M′) of the reachability graph of the net system 〈PN, MI〉.

As far as time is considered, the basic logical PN model can be extended to include the time stamps. Time is measured with time units (TU) and can be associated with the duration of the transitions firing or with the sojourn of the tokens in the places. In this paper, the time is associated with the transitions and we refer to such extension of PN as timed-transition Petri nets (TPN): The firing of each transition T occurs after a time d that is either deterministic (d has a constant value) or stochastic (d is a random variable (RV) with a PDF f(d)) [25]. When d is deterministic, its value is either strictly positive or can eventually be zero. In that case, the firing is said to be immediate; in the other cases, it is said to be delayed. When d is stochastic, the firing times are distributed according to an arbitrary PDF that is assumed to have a finite support. In addition, this support is known for each transition. In the next, we will consider TPN with stochastic firing durations with two particular types of PDF: uniform PDF on finite support (Fig. 1 left) and symmetrical triangular PDF (Fig. 1 right) defined, respectively, with Eqs. (1) and (2):

$$\begin{array}{*{20}l} {f(d) = 1/(b - a)} \hfill &\quad {{\text{if}}\quad d\in[a,b]} \hfill \\ {f(d) = 0} \hfill &\quad {\text{otherwise}} \hfill \\ \end{array}$$
(1)
$$\begin{array}{*{20}l} {f(d) = 4 /(b - a)^{2} \cdot (d - a) } \hfill &\quad {{\text{if}}\quad d\in[a,(a + b) /2]} \hfill \\ {f(d) = 4 /(b - a)^{2} \cdot (b - d)} \hfill &\quad {{\text{if}}\quad d\in[(a + b) /2,b]} \hfill \\ {f(d) = 0} \hfill &\quad {\text{otherwise}} \hfill \\ \end{array}$$
(2)

The time semantic of the considered TPN [26, 27] is completed by defining the server, choice, and memory policies:

Fig. 1
figure 1

PDF of the transition firing durations: bounded uniform PDF (left); symmetrical triangular PDF (right)

  1. a.

    The servers are single server.

  2. b.

    The choice policy is a preselection policy. Such a policy is used in case of effective conflicts. In such situations, the next transition is randomly chosen from all currently enabled transitions according to a uniform PDF and a set of weights ωj, j = 1,…,q associated with the transitions. The weight vector Ω = (ωj) ∈ (R+*)q (R+* is the set of strictly positive real numbers) determines the firing probability of each transition. For simplicity and without any loss of generality, ωj = 1, j = 1,…,q, in the rest of the paper and the firing probability of the transitions in conflict is identical.

  3. c.

    The enabling memory is a memory policy. With such a policy, at each firing, the residual durations associated with still enabled transitions are decremented and the residual durations associated with disabled transitions are reset.

A timed firing sequence is defined as σ = T(j1, t1) T(j2, t2) … T(jh, th) with T(jk, tk) ∈ T, k = 1,…,h being the transitions that consecutively fire in the sequence σ. The integers jk are the indexes of the transitions that successively fire, the times tk are the time stamps of the successive firings and h is the length of σ. In addition, a timed trajectory is defined as (M(t0), M(th), σ) = M(t0)[T(j1, t1)〉 M(t1) … [T(jh, th)〉 M(th), from marking M(t0) at time t0 to marking M(th) at time th according to the timed firing sequence σ. Making abstraction of the timing information, a firing sequence is defined as σ = T(j1) T(j2) … T(jh) with T(jk) ∈ T, k = 1,…,h and a trajectory is defined as (M(0), M(h), σ) = M(0)[T(j1, 1)〉 M(1) … [T(jh, h)〉 M(h).

Example 1

Figure 2 is an example of TPN with a set of places \(\varvec{P} = \left\{ {P_{1} , \ldots ,P_{8} } \right\}\) and a set of transitions \(\varvec{T} = \left\{ {T_{1} , \ldots ,T_{6} } \right\}\). Matrices \(W_{\text{PR}} , W_{\text{PO}}\) are both of dimensions 8 × 6. For example, \(W_{\text{PR}} \left( {1, 1} \right) = W_{\text{PR}} \left( {7, 1} \right) = W_{\text{PO}} \left( {2,1} \right) = 1\) describes how transition \(T_{1}\) is connected to the rest of the net. The initial marking \(M_{\text{I}}\) in this example is such that \(M_{I} = \left( {n_{1}~ 0~ 0~ n_{2}~ 0~ 0~ r_{1}~ r_{2} } \right)^{T}\) with n1, n2, r1, r2 ∈ N* (the set of strictly positive integer numbers). The transitions \(T_{1}\) and \(T_{4}\) are enabled at \(M_{I}\). The firing durations in this TPN are stochastic and defined with a set of uniform PDF of support [ai, bi], i = 1, …, 6. The values of ai and bi are reported in Fig. 2 so that “\(T_{1} :\left[ {2.7, 3.3} \right]\)” means that transition \(T_{1}\) needs a time d1 that is uniformly distributed within the time interval \(\left[ {2.7, 3.3} \right]\) before it fires. For the values n1 = 1, n2 = 1, r1 = 2, and r2 = 2, the system owns nine reachable markings and no deadlock. Increasing n1, n2, r1, r2 increases the number of reachable markings. When r1 < n1 + n2 or r2 < n1 + n2, deadlocks appear. For n1 = 1, n2 = 1, r1 = 2, and r2 = 2, an example of timed firing sequence enabled at MI is σ = T(4, 2.8) T(1,3.1) T(5,3.8) T(2,4.2). The corresponding timed trajectory is (M(0), (4.2), σ) = M(0) [T(4, 2.8)〉 M(2.8) [T(1, 3.1)〉 M(3.1) [T(5,3.8)〉 M(3.8) [T(2,4.2)〉 M(4.2) with M(0) = MI = \(\left( {1~ 0~ 0~ 1~ 0~0~ 2~ 2} \right)^{\text{T}}\) and M(4.2) = \(\left( {0~ 0~ 1~ 0~ 0~ 1~ 1~ 1} \right)^{\text{T}}\).

Fig. 2
figure 2

An example of manufacturing system modeled with a TPN

2.3 Labeled timed Petri nets

The transitions of the net are basically separated into observable transitions that deliver a label and silent ones that do not. L: T →E{ε} is a labeling function that assigns a label to each transition where E = {e1,,eqo} is the set of qO labels that are assigned to observable transitions and ε is the null label that is assigned to the silent ones.

Labeled timed Petri nets (LTPN) are finally defined as <PN, PDF, Ω, L, MI> where PN is a Petri net structure, PDF is a set of density probability functions, Ω is a set of weights, L is the measurement function that defines the sensor configuration and MI is the initial marking. The measurement of any trajectory is obtained according to the selected sensor configuration. The function L collects the K successive dated measurements of a timed trajectory (σ, MI) over time interval [t0, tp]. These measurements are organized in a measured trajectory: TRo = L(σ, MI) = (e1, τ1) … (eK, τK) that will be analyzed in the next section to detect and isolate temporal faults.

Example 2

Consider again the TPN in Fig. 2 with n1 = 1, n2 = 1, r1 = 2, and r2 = 2. Assume that transition T3 delivers a label a when T3 fires and similarly that transition T6 delivers b when T6 fires. The other transitions are assumed to be silent. Consequently, the labeling function is defined as L(T3) = a, L(T6) = b, and L(Tj) = ε, for j = 1, 2, 4, 5. With the addition of the labeling function, the TPN is now a LTPN. Figure 3 reports the observations captured by the labeling function when the LTPN works during a period of 100 time units (TU). According to the labeling function, only the complete execution of the cycles {T1, T2, T3} and {T4, T5, T6} is detected. Figure 3 top reports the number of executions of cycle {T1, T2, T3} in red light (resp. cycle {T4, T5, T6} in blue dark) with respect to (wrt) time (in TU). One can notice that n1 = 1, n2 = 1, r1 = 2, and r2 = 2. Both cycles are executed at maximal speed with an average period of 6 TU (25 executions of each cycle are detected within the time window of width 150 TU). Figure 3 bottom reports the distributions of the durations of the cycles {T1, T2, T3} in light gray and {T4, T5, T6} in dark gray. One can notice the dispersion of these distributions due to the uniform pdf of the firing durations.

Fig. 3
figure 3

Number of executions (top) of cycle {T1, T2, T3} in red (resp. cycle {T4, T5, T6} in blue) with respect to the time; histogram of the durations (bottom) of cycle {T1, T2, T3} in light gray (resp. cycle {T4, T5, T6} in dark gray)

2.4 Models of manufacturing systems with LTPN

From a high-level perspective, a manufacturing system is often considered as a workshop composed of multiple jobs where each job consists of a set of operations performed on several resources. Sensors are used to detect the achievement of some operations.

  • Each operation oj is modeled by a transition Tj, an input buffer represented by a place and an output buffer represented by another place (that may be in the same time the input buffer of the next operation of the job when intermediate buffers do not exist). The duration of oj is represented by the firing delay dj of Tj. The definition of the firing delay as a stochastic distribution is suitable to take into account some uncertainties in the execution of the operations.

  • A set of resources Rj is generally required to perform each operation oj. Each type k of resource is represented by a specific place with an initial marking rk that indicates how many resources of type k are available to perform the different operations. The resource place is simultaneously in the preset and postset of Tj (and can also belong to the preset and postset of other transitions to model resources that are shared by several operations).

  • Sensors are used to measure the activities in the workshop. The labeling function collects particular labels when some transitions fire and these labels give the information that the corresponding operation is achieved. Note that improved strategies for data collections have been investigated (see, for example, cloud infrastructure-based methods which collect data in real time from intelligent devices [28]) and can be used for the same purpose.

In addition, a supervisor may be added to remove some forbidden markings (for example to avoid the deadlock markings).

Example 3

Consider again the TPN in Fig. 2. This net is the model of a manufacturing system with two jobs. The first job J1 consists in three operations {o1, o2, o3} that are modeled with the transitions T1, T2, and T3. The second job J2 consists in the operations {o4, o5, o6} that are modeled with the transitions T4, T5, and T6. The lot size (i.e., the number of products that are simultaneously accepted by the job) for J1 (resp. J2) is given by the initial marking of the place P1 (resp. P4). Two types of resources are required to perform the operations, and the initial marking (r1 and r2) of the places P7 and P8 gives the number of resources of each type. The resource r1 is needed to perform the operation {o1, o2} and {o5, o6}. Similarly, the resource r2 is needed to perform the operation {o2, o3} and {o4, o5}. No supervisor has been considered in this example. The labeling function introduced in Example 2 gives the information that operation o3 and consequently job J1 are performed when a label a is detected, whereas it gives the information that operation o6 and job J2 are performed when a label b is detected. Consequently, Fig. 3 top can be interpreted as the number of executions of job J1 in red light (resp. J2 in blue dark) and Fig. 3 bottom can be interpreted as the distributions of the durations of job J1 in light gray (resp. J2 in dark gray).

3 Faults modeling and control charts

3.1 Assumptions

Assumptions 1–8 will be considered in the next:

  1. 1.

    The LTPN are bounded, and consequently, the set R and generator matrix G are of finite dimensions.

  2. 2.

    T(M, M′) is at most of cardinality 1, for any pair of markings M, M′ ∈ R.

  3. 3.

    The silent part of the considered LTPN is acyclic.

  4. 4.

    The system parameters, the PDF of the stochastic firing durations, and the net initial marking are assumed to be known.

  5. 5.

    The time semantic is defined according to single server, preselection, and enabling memory policies.

  6. 6.

    The temporal faults correspond to significant variations of the PDF support (with respect to the measurement errors).

  7. 7.

    Single faults are considered.

  8. 8.

    The type of the PDF is not affected by the faults.

Assumptions 1–4 are usual assumptions needed to design observers for labeled Petri nets [8, 9]. In particular, Assumptions 1 and 3 ensure that a finite size deterministic observer exists. Assumption 2 is stated for simplicity and may be relaxed as the knowledge of initial marking (Assumption 4). Then, Assumptions 5–8 are required to obtain significant residuals for the temporal faults. Assumption 5 describes the time semantic specifications and is frequently used for timed stochastic discrete event systems [27]. Assumption 7 is also a common assumption for fault isolation [2]. Finally, Assumptions 6 and 8 restrict the class of considered temporal faults in order to make the approach tractable.

3.2 Support variation of the PDF

The firing durations of the transitions are random variables defined by their PDF. The core of the approach is to detect and characterize the PDF support variations according to the variation of the firing durations mean value. Such a variation is captured by a control chart as illustrated in Fig. 4. In particular, Fig. 4 left shows a translation for a bounded uniform PDF and Fig. 4 right shows a translation for a symmetrical triangular PDF. The original PDF has a support [ac, bc] and a mean value mc, whereas the resulting PDF has a support [am, bm] and a mean value mm.

Fig. 4
figure 4

Support variation of a symmetrical triangular (left) and bounded uniform (right) PDF

From the perspective of manufacturing systems, the time drifts represented in Fig. 4 concern the operation execution times. In a given workshop, delays may occur due to an accumulation of atomic unexpected behaviors. In many cases, the occurrence and increase in such delays are the symptom of dysfunctions in the system.

Example 4

Consider again the TPN in Fig. 2 as the model of a manufacturing system with two jobs and the labeling function introduced in Example 2. Assume that a time drift of more or less 1% affects the duration of operation o3. The sensoring of the system remains unchanged. Figure 5 top reports the number of executions of job J1 in red light (resp. J2 in blue dark) for this situation. Observe that the job J1 gradually slows down, whereas the makespan of J2 is not affected by the dysfunction. In addition, Fig. 5 bottom reports the distributions of the durations of job J1 in light gray (resp. J2 in dark gray) and one can notice that the dispersion of job J1 durations increases with respect to the time. This example illustrates how time drifts may affect the makespan of the manufacturing systems and how such time drifts could be tracked with LTPN models.

Fig. 5
figure 5

Number of executions of job J1 in red (resp. J2 in blue) with respect to the time (top); histogram of the durations of job J1 in light gray (resp. J2 in dark gray) when a temporal drift affects the duration of operation o3

3.3 Detection of the PDF support variation with control chart

The firing durations of a given transition T are initially distributed according to the PDF of known support [ac, bc]. A variation of this PDF support is deducted by the calculation of the mean and the standard deviation of the durations of N consecutive firings for the same transition. Then, the new PDF support [a, b] can be computed. The support [a, b] of a bounded uniform PDF is computed from the mean m and the standard deviation σ of this PDF by Eq. (3) [24]:

$$\begin{aligned} a & = m{-}\sigma \cdot (3)^{ 1 / 2} \\ b& = m + \sigma \cdot (3)^{ 1 / 2} \\ \end{aligned}$$
(3)

In a similar way, the support [a, b] of a symmetrical triangular PDF is computed from the mean m and the standard deviation σ of this PDF by Eq. (4) [24]:

$$\begin{aligned} a & = m{-}\sigma \cdot (6)^{ 1 / 2} \\ b& = m + \sigma \cdot (6)^{ 1 / 2} \\ \end{aligned}$$
(4)

A MA control chart aims to detect the variations in the average of a data series. For this purpose, N consecutive measurements collected by the sensors of the system are saved and some computations are proposed for these histories of data. In particular, lower and upper thresholds δ and Δ are defined according to the desired tolerances of the system. Then, a fault is detected in case of violation of the thresholds. A time window of variable size that contains N consecutive values of the firing duration associated with a given transition T is considered, and the mean value MA(dn) of the N firing durations dnN+1,…,dn is computed. For n > N, an updating of MA(dn) is obtained with Eq. (5):

$${\text{MA}}(d_{n} ) = {\text{MA}}(d_{n - 1} ) + \frac{{\left( {d_{n} - d_{{\left( {n - N + 1} \right)}} } \right)}}{N}\quad {\text{if}}\,n > N$$
(5)

The detection thresholds δ and Δ define the acceptable variations of the mean value: The system is assumed to have a fault-free behavior as long as MA(dn) ∈ ]δ, Δ[; otherwise, it is assumed to have a faulty behavior. Consequently, the thresholds are selected in order to fulfill some desired performance with respect to safety requirements. A usual method is to compute these thresholds according to mean m and standard deviation σ of the firing durations of T measured during nominal (i.e., fault-free) behavior: Δ = m + γ·σ and δ = m − γ·σ where γ is an input parameter.

4 Observer design

4.1 Elementary trajectory observer design

In this section, elementary trajectory observers for LTPN are detailed to track the trajectories consistent with the initial marking and the successive measurements. The design of such observers is motivated by the fact that the temporal fault of a given transition will affect the durations of the sequences in which this transition occurs. An elementary trajectory observer (ETO) that computes all elementary trajectories (i.e., trajectories between two consecutive measurements) consistent with the measurements observed thus far is obtained with Algorithm 1. This algorithm aims to design in an iterated way all trajectories that are feasible at a given marking. Then, it projects the firing sequence of the trajectory within the set of observable labels. Only the trajectories that coincide with the sequence of observations collected thus far are saved for diagnosis issues. As far as the number of consistent trajectories will necessarily grow with respect to the number of successive collected observations, only elementary trajectories between two successive observations are considered. This motivates the design of the ETO. This algorithm uses the labeling function L and a list of unexplored states UNXPL. It returns the set RTO of observer states and the generator matrix Gobs of the ETO. Each state S of ETO is composed by a set of trajectories tr = (MO, MD, σ) where MO is an origin marking, MD is a destination (final) marking and σ is a logical (i.e., making abstraction of the timing information) feasible sequence from MO to MD (i.e., MO [σ〉  MD) that satisfies: σ = σT with L(σ′) = ε and L(T) = e (i.e., σ is consistent with a given label e and has no silent closure). Consequently, by construction, all edges arriving in a given state of the elementary trajectory observer will be tagged with the same label. Note that a given origin marking MO (resp. a given destination marking MD) can appear in several elementary trajectories associated with the same state. From S, it is easy to compute the set SO(S) of origin markings and the set SD(S) of destination markings. The set SD(S) represents also the set of the current markings consistent with the measurement. The states are stored in Robs. Each entry gobs(S, S′) of Gobs is composed by the label that is measured when the observer state varies from S to S′.

figure a

The set S′ of elementary trajectories (MO, MD, σ′) originated from a marking MO ∈ SD(S) and such that σ′ = σT with L(σ″) = ε and L(T) = e is computed with Algorithm 2.

figure b

Proposition 1

Consider a LTPN < PN, PDF, Ω, L, MI > that satisfies Assumptions 1–4. The ETO obtained with Algorithm 1 has a finite number of states that does not exceed

$$\sum\limits_{j = 1,. \ldots ,H + 1} {\left( {2^{{\left( {N^{j + 1} } \right)}} - 1} \right)} .$$

where hmax is the maximal number of consecutive silent transitions.

Proof

  1. i.

    A trajectory of length k includes not only a firing sequence of length k but also the origin and destination markings, and because of Assumption 2, a trajectory of length 1 (MO, MD, T) is indifferently defined by (MO, T) or (MO, MD). Consequently, the total number of trajectories of length k does not exceed N1 = Nk+1 where N is the finite number of system states. Moreover, the number of states with trajectories of length k cannot exceed the sum of the combinations \(C^{1}_{N1} + \cdots + C^{N1}_{N1} = 2^{{N_{1} }} - 1\).

  2. ii.

    According to Assumption 3, the number of consecutive silent transitions is finite and cannot exceed a maximal number referred to as H. Consequently, the trajectories encoded by the observer states have a maximal length of H + 1 and each state of the observer is a combination of trajectories with length 1 to H + 1.

The upper bound \(\sum\limits_{j = 1, \ldots ,H + 1} {\left( {2^{{\left( {N^{j + 1} } \right)}} - 1} \right)}\) results from (i) and (ii).□

Proposition 2

Consider a LTPN < PN, PDF, Ω, L, MI > that satisfies Assumptions 1–4. The ETO obtained with Algorithm 1 is of minimal size.

Proof

First observe that the ETO is designed in an iterated schema: For each state S and label e, a new state S′ is created only if the set of trajectories encoded in S′ is different from the set of trajectories encoded in S. Now, imagine a reduced observer obtained by merging two different states S and S′ in a single state S″. For simplicity and without any loss of generality, assume that S and S′ differ only by a single trajectory of length k: tr ∈ S and tr ∉ S′.

In case S″ is defined so that tr ∈ S″, all trajectories that results in S′ have a postfix different from tr. So, tr is not consistent with some sequences of observation and the reduced ETO is no longer an elementary trajectory observer.

In case S″ is defined so that tr ∉ S″, there exist trajectories with a postfix tr that results in S but not in S″ and the reduced ETO is no longer an elementary trajectory observer.□

Example 5

Consider the marked LTPN system < PN2, PDF, Ω, L, MI > in Fig. 6 with M1 = (1 0 0 0 0)T and the sensor configuration defined by L(T1) = e1, L(T3) = e2, L(T4) = e3 and L(Tj) = ε for j = 2, 5, 6, 7. The reachability set of < PN2, PDF, Ω, L, MI > has five states: M1 = (1 0 0 0 0)T; M2 = (0 1 0 0 0)T; M3 = (0 0 1 0 0)T; M4 = (0 0 0 1 0)T; and M5 = (0 0 0 0 1)T. The observer of < PN2, PDF, Ω, L, MI > has 11 states and is reported in Fig. 7. Note that this observer is composed of a transient part that corresponds to the set of states {S1, S2, S3} and to a steady state part that corresponds to the other states. The list of elementary trajectories consistent with each state of the ETO is reported in the second column of Table 2. Let us also report in column 3 the mean duration in nominal behavior for these trajectories. Such a mean duration is not calculated for the transient states for which only a single measurement is collected at most.

Fig. 6
figure 6

The LTPN system: < PN2, PDF, Ω, L, MI > of Example 5

Fig. 7
figure 7

ETO for < PN2, PDF, Ω, L, MI > (for clarity labels are reported near to the states—all edges that reach the same state sharing the same label)

Table 2 Elementary trajectories and mean duration in fault-free behavior

4.2 Minimal-size elementary trajectories

In the previous simple example, each state is associated with a single elementary trajectory. This is no longer the case for more complex systems or when the initial marking increases. To illustrate this difficulty, let us consider the previous example of Fig. 6 and an initial marking M1(k) = k·(1 0 0 0 0)T that depends on k; Table 3 illustrates how the system size N, the observer size Nobs, and the number NET of elementary trajectories increase with respect to k. In particular, one can notice the rapid increase in NET with respect to k.

Table 3 Complexity increases with respect to initial marking

In order to limit this explosion of complexity, minimal-size elementary trajectories (MSET) are introduced. A MSET: MO [σ〉 MD in state S is defined as an elementary trajectory that is matched [22, 29, 30] by all other elementary trajectories in S and originated from the same marking MO. A trajectory MO [σ′〉 MD matches another trajectory MO [σ〉 MD if σ′ contains all transitions of σ and these transitions respect the same precedence conditions (i.e., fire in the same order). In a more formal way, a sequence σ′ matches a sequence σ ≠ ε (one write σ ≪ σ′) if σ = σ1 σ2 with σ1 ∈ T* (“*” denotes the Kleene star and T* is the set of sequences of transitions in T) and σ2 ∈ T* and there exists σ1 ∈ T* and σ2 ∈ T* such that σ′ = σ1 σ1 σ2 and σ2 ≪ σ1. MSET(S) is defined as the set of MSET in S. MSET(S) is the subset of elementary trajectories of S that is matched by all other elementary trajectories in S that are originated from the same marking MO. NET is defined as the global number of MSET in all states of the ETO (i.e., in Robs). The idea behind the MSET computation is to remove the transitions that fire concurrently in some (but not all) elementary trajectories. Such an elimination is reasonable because when a temporal fault affects such a transition, it will not affect the collected measurements. Algorithm 3 computes the set MSET(S) of MSET for a given state S of the ETO. In this algorithm, T(h) stands for the hth transition of sequence σ.

figure c

Example 6 illustrates the computation of MSET.

Example 6

Consider again the marked LTPN system < PN2, PDF, Ω, L, MI > in Fig. 6. If M1 = (2 0 0 0 0)T, the ETO has 53 states and each state is composed of a set of elementary trajectories. Consider, for example, a particular state S composed of nine elementary trajectories:

$$\begin{aligned}S\, = \,\{&M_{ 3} [T_{1} \rangle M_{ 5} ; \\ & M_{3} [T_{5} T_{1} \rangle M_{2} ; \\ & M_{5} [T_{2} T_{5} T_{1} \rangle M_{5} ; \\ & M_{5} [T_{2} T_{5} T_{5} T_{1} \rangle M_{2} ; \\ & M_{5} [T_{5} T_{1} \rangle M_{4} ; \\ & M_{5} [T_{5} T_{2} T_{1} \rangle M_{5} ; \\ & M_{5} [T_{5} T_{2} T_{5} T_{1} \rangle M_{2} ; \\ & M_{7} [T_{5} T_{1} \rangle M_{5} ; \\ & M_{7} [T_{5} T_{5} T_{1} \rangle M_{2} \} \\ \end{aligned}$$

with M2 = (1 1 0 0 0)T; M3 = (1 0 1 0 0)T; M4 = (0 2 0 0 0)T; M5 = (0 1 1 0 0)T; M7 = (0 0 2 0 0)T.

From these elementary trajectories, three MSET are computed:

$$\begin{aligned} & M_{ 3} [T_{ 1} \rangle M_{ 5} ; \\ & M_{ 5} [T_{ 5} T_{ 1} \rangle M_{ 4} ; \\ & M_{ 7} [T_{ 5} T_{ 1} \rangle M_{ 5} . \\ \end{aligned}$$

In the elementary trajectory M3 [T5T1M2, the transition T5 fires concurrently with T1, but the measurement is obtained when T1 fires. A fault in T5 will not affect the duration of the sequence M3 [T5T1M2, and this trajectory can be removed. The global number NET of MSET is reported in Table 2 with respect to k, and one can notice the reduction of complexity, compared to NET.

5 Detection and diagnosis

In this section, a detection and diagnosis method is proposed for temporal faults. This method uses a moving average control chart and the ETO previously defined. It has three steps: (a) the computation of residuals for each MSET, (b) the computation of the detection function, and (c) the computation of the isolation function. In addition to Assumptions 1–4 that are needed to obtain the ETO, Assumptions 5–8 will be also considered in this section to compute significant residuals.

5.1 Residuals computation

Once the MSET are computed, the durations between two successive measurements are affected to all MSET that are consistent with these measurements. The detection and isolation of temporal faults is then based on the analysis of a set of residuals obtained for the MSET. For each MSET, the mean duration of the nominal behavior is first evaluated and dN(mset, S) is defined as the mean duration in nominal behavior for the minimal-size elementary trajectory mset in set MSET(S). This evaluation can be obtained from an analytical computation or from the statistical analysis of a fault-free sequence of measurements or finally from an expert knowledge about the system. In order to compute the residuals used to detect and isolate the temporal faults, the proposed approach has the following steps:

  1. 1.

    For each pair of consecutive measurements (ek−1, τk−1) and (ek, τk) of TRO, the state S consistent with the observed trajectory TRo(k) = (e11) … (ek−1k−1) (ekk) is computed thanks to the ETO.

  2. 2.

    The duration dk = τk − τk−1, measured at time τk, is filtered with the MA control charts to smooth the variations.

  3. 3.

    The filtered duration MA(dk) is associated with each minimal-size elementary trajectory mset in set MSET(S). The series of measurements collected for the minimal-size elementary trajectory mset in state S at time τk is consequently defined as D(mset, S, τk):

    $$D({\textit{mset}},S,\tau_{k} ) = \{ ({\text{MA}}(d_{k} ),\tau_{k} )\;{\text{such}}\,{\text{that}}\,{\textit{mset}}\, \in \,{\text{MSET}}(S)\,{\text{and}}\,S\,{\text{is}}\,{\text{consistent}}\,{\text{with}}\,{\text{TRo}}(k)\}$$
    (6)
  4. 4.

    One difficulty is that the series of measurements D(mset,S,τk) are generated at specific values of times τk that only depend on the occurrence of the observable events. On the contrary, the detection and diagnosis decision are expected to be computed periodically. To solve this issue, the series of measurements D(mset,S,τk) are resampled with a given sampling time dt according to Eq. (7):

    $$\begin{aligned} & D^{\prime } ({\textit{mset}},S,h.dt) = \{ ({\text{MA}}(d_{h} ),h.dt),h = 0, \ldots ,\left\lfloor {\tau_{k} /dt} \right\rfloor + 1, \\ & \quad {\text{and}}\,({\text{MA}}(d_{h} ),h.dt) \in D({\textit{mset}},S,\tau_{k} )\,{\text{is}}\,{\text{the}}\,{\text{measurement}}\,{\text{such}} \\ & \quad {\text{that}}\,\tau_{k} \,{\text{is}}\,{\text{the}}\,{\text{larger}}\,{\text{measurement}}\,{\text{time}}\,{\text{that}}\,{\text{satisfies}}\,\tau_{k} \le h.dt\} \\ \end{aligned}$$
    (7)

    A consequence of the resampling operation is that the values of the firing durations are maintained constant and equal to the last collected value as long as no new measurement is collected in series D(mset,S,τk).

  5. 5.

    The series of residuals are computed from the series of durations. The filtered resampled durations MA(dh) are used to compute the residuals δ(MA(dh), mset, S) for all mset ∈ MSET(S):

    $$\delta ({\text{MA}}(d_{h} ),{\textit{mset}},S) = {\text{MA}}(d_{h} ) - d_{N} ({\textit{mset}},S)$$
    (8)

5.2 Detection and isolation of temporal faults

For each transition Tj ∈ T, let us first define the set MSET(Tj) of MSET in which Tj fires. The two following functions are then defined for detection and diagnosis issues:

$${\text{diag}}^{ + } (T_{j} ,h.dt) = \hbox{max} \{ \delta ({\text{MA}}(d_{h} ),{\textit{mset}},S)\,{\text{for}}\,{\text{all}}\,{\textit{mset}} \in {\text{MSET}}(T_{j} )\,{\text{and}}\,S \in R_{\text{obs}} \}$$
(9)
$${\text{diag}}^{ - } (T_{j} ,h.dt) = \hbox{min} \{ |\delta ({\text{MA}}(d_{h} ),{\textit{ mset,}}S)|\,{\text{for}}\,{\text{all}}\,{\textit{mset}} \in {\text{MSET}}(T_{j} )\,{\text{and}}\,S \in R_{\text{obs}} \}$$
(10)

The function detect(h.dt) is proposed as a detection function that captures the variation of the collected durations and evaluates whether a temporal fault has occurred at sampled time h.dt. This function is sensitive to the sum of the maximal residuals computed for each transition:

$${\text{detect}}\left( {h.dt} \right) = \frac{1}{q}\mathop \sum \limits_{{T_{j} \in T}} {\text{diag}}^{ + } \left( {T_{j} , h.dt} \right)$$
(11)

This function will be compared with a detection threshold ΔD in order to generate an alarm when the threshold is excessed.

The fault isolation results from the combine use of the functions diag+(Tj, h.dt) and diag(Tj, h.dt). These functions evaluate how the firing of a given transition Tj is affected by the variation of the collected durations. On the one hand, the function diag+(Tj, h.dt) is sensitive to the maximal value of the residuals computed for Tj. When this function increases significantly from zero, it means that a variation of the durations affects at least one of the MSET where Tj appears. Note that the use of diag+(Tj, h.dt) without diag(Tj, h.dt) for isolation may lead to diagnosis errors by overestimating the risk that the temporal fault concerns Tj. On the other hand, the function diag(Tj, h.dt) is sensitive to the minimal absolute value of the residuals computed for Tj. When this function increases significantly from zero, it means that a variation of the durations affects necessarily all MSET where Tj occurs. The use of diag(Tj, h.dt) without diag+(Tj, h.dt) may also lead to diagnosis errors by underestimating the risk that the temporal fault concerns Tj. Consequently, both functions are used together to compute the probability prob(Tj, h.dt) that the detected fault has affected the transition Tj. Let us introduce a normalization parameter N(h.dt) at time h.dt with Eq. (12):

$$N\left( {h.dt} \right) = \mathop \sum \limits_{{\begin{array}{*{20}c} {T_{j} {\text{ such}}\,{\text{that }}} \\ {\left( {{\text{detect}}\left( {h \cdot dt} \right) > {\Delta_D} } \right)} \\ { \wedge ({\text{diag}}^{ + } \left( {T_{j} , h \cdot dt} \right) > {\Delta_D} )} \\ \end{array} }} {\text{diag}}^{ - } \left( {T_{j} , h.dt} \right)$$
(12)

The probability prob(Tj, h.dt) is computed for each h and Tj with Eq. (13):

$$\begin{array}{*{20}l} {{\text{prob}}(T_{j} ,h.dt) \, = {\text{diag}}^{ - } (T_{j} ,h.dt) /N(h.dt),} \hfill & {{\text{if}}\,({\text{detect}}\left( {h.dt} \right) > \Delta_{D} ) \wedge ({\text{diag}}^{ + } (T_{j} ,h.dt) > \Delta_{D} )} \hfill \\ {{\text{prob}}(T_{j} ,h.dt) \, = {\text{diag}}^{ - } (T_{j} ,h.dt) /N(h.dt),} \hfill & {\text{otherwise}} \hfill \\ \end{array}$$
(13)

Note that the detection delay depends on the time drift and on the number of sensors used by the labeling function. For poor sensoring systems, this delay may increase in a critical way. Improving the detection delay is one of the perspectives of this work.

Example 7

Consider again the marked LTPN system < PN2, PDF, Ω, L, MI > in Fig. 6 with MI = (1 0 0 0 0)T and L(T1) = e1, L(T3) = e2, L(T4) = e3. Uniform bounded PDF are considered for all transitions with supports provided in Table 4.

Table 4 Support of the transition PDF for <SPN2, L, MI>

The mean values resulting from a fault-free trajectory of duration 4000 TU are reported in the last column of Table 1. Note that mean values are not computed for the transient states S1, S2, and S3 because these transitions fire at most once.

It is now assumed that transition T2 experiences a PDF support variation from [1.8, 2.2] to [5.8, 6.2] between 2000 and 4000 TU (for the clarity of the presentation, an abrupt change of the support was preferred for this example instead of a slow timed drift). The detection threshold is defined as ΔD = 1.

The functions detect(h.dt), diag+(Tj, h.dt) (in full line), and diag(Tj, h.dt) (in dashed line) resulting from this simulation are reported in Fig. 8. Temporal faults are detected once the detection function detect(h.dt) exceeds the detection threshold (Fig. 8a). For the considered series, the detection time is 2020 TU and the delay is 20 TU that corresponds more or less to ten successive transition firings. One can notice that the function diag+(Tj, h.dt) is not enough to isolate the fault. This is because T2 occurs in MSET that also include the transitions T1, T3, T4, and T5 (Table 2). For this reason, these transitions are also suspicious. The use in addition of the function diag(Tj, h.dt) confirms that the fault has affected T2. Note that in the present case, the function diag(Tj, h.dt) by itself clearly indicates the faulty transition, but for more complicated cases, both functions are required. This is confirmed with the probabilities prob(Tj, h.dt) reported in Fig. 9 in addition to the residuals diag(Tj, h.dt). In particular, one can notice that the probability prob(T2, h.dt) that isolates transition T2 is more or less equal to 1 from the detection time at 2020 TU.

Fig. 8
figure 8

Diagnosis of < PN2, PDF, Ω, L, MI > with MI = (1 0 0 0 0)T: a detection function wrt time (TU); isolation functions wrt time (TU); b T1; c T2; d T3; e T4; f T5; g T6; h T7 (diag+(Tj, h.dt) in full line and diag(Tj, h.dt) in dashed line)

Fig. 9
figure 9

Fault probabilities for < PN2, PDF, Ω, L, MI > with MI = (1 0 0 0 0)T: b T1; c T2; d T3; e T4; f T5; g T6; h T7 (prob(Tj, h.dt) in full line and diag−(Tj, h.dt) in dashed line)

Consider again the marked LTPN system < PN2, PDF, Ω, L, MI > in Fig. 6 with MI = (2 0 0 0 0)T, L(T1) = e1, L(T3) = e2, L(T4) = e3 and the supports provided in Table 4. The transition T2 experiences again a PDF support variation from [1.8, 2.2] to [5.8, 6.2] between 2000 and 4000 TU. The detection threshold is defined as ΔD = 2.

The detection and isolation results are reported in Figs. 10 and 11. The isolation of transition T2 that should be preferred compared to the other transitions results from the comparison of the probabilities prob(Tj, h.dt) after detection at date 2100 TU. In that case, the detection delay increases to 100 TU. One can notice that the diagnosis decisions (and even the detection one) have less confidence in the present case due to the large number of possible behaviors that increases the risk of error.

Fig. 10
figure 10

Diagnosis of < PN2, PDF, Ω, L, MI > with MI = (2 0 0 0 0)T: a detection function wrt time (TU); isolation functions wrt time (TU); b T1; c T2; d T3; e T4; f T5; g T6; h T7 (diag+(Tj, h.dt) in full line and diag(Tj, h.dt) in dashed line)

Fig. 11
figure 11

Fault probabilities for < PN2, PDF, Ω, L, MI > with MI = (2 0 0 0 0)T: b T1; c T2; d T3; e T4; f T5; g T6; h T7 (prob(Tj, h.dt) in full line and diag−(Tj, h.dt) in dashed line)

6 Conclusion

This article has proposed an approach that can be used to detect and isolate temporal drifts in timed manufacturing systems that are characterized by cyclic behaviors and repetitive operations. Such systems were modeled as timed Petri nets where operations are represented by the net transitions and their durations correspond to the transition firing times considered as random variables with arbitrary PDF. In this context, the temporal drifts were characterized by the variations of the mean firing durations. The detection and isolation problem has been solved by combining moving average control charts with a new class of observers that estimate the recent elementary trajectories consistent with the last measurements. On the one hand, fault detection has been obtained by comparing residual signal with some thresholds. On the other hand, the isolation of the faulty operation among the set of fault candidates has been performed thanks to the analysis of the measurements with respect to the set of minimal-size elementary trajectories generated by the observer. The following concluding comments hold that lead to some interesting perspectives:

  • The proposed method can also be used to test and compare several sensor configurations in order to select the most appropriated one. One interesting issue is to reverse the problem and to propose a method that aims to search for the best selection and positioning of the sensors in order to improve the performance of the detection.

  • When competition is used as a choice policy instead of preselection, one can compute, in a similar way, additional residuals based on the frequency of the MSET occurrences. With competition, conflicts are solved according to the duration of the firings. When a transition experiences a temporal fault, the conditions of the competition change and this modifies the MSET frequency.

The main limitation of the proposed approach is due to the rapid increase in the complexity in the design of the observers. In particular, the size of the elementary trajectory observer grows rapidly with respect to the initial marking of the LTPN and to the labeling function that models the sensors, resulting in large nets that may prevent to use the method for more complicated systems. Consequently, the first objective in our future work will be to consider the complexity and scalability issues. Another perspective is to pay more attention to the detection delay and to improve the approach in order to decrease this delay. The possible combined use of time and colors in Petri nets models lies also under the perspectives of that work.