Keywords

1 Introduction

Process discovery techniques promise that given enough data, it is possible to output a realistic model of the process as is. This evidence-based approach has a caveat: one needs to assume that inputs belong to the same process. Not considering process variance over time might end in under- or over-constrained models that do not represent reality. The second assumption is that it is possible to identify full traces from the event log. This requirement indeed presents considerable obstacles in organizations where processes are constantly evolving, either because the starting events are located in legacy systems no longer in use, or because current traces have not finished yet. Accounting for change is particularly important in declarative processes. Based on a “outside-in” approach, declarative processes describe the minimal set of rules that generate accepting traces. For process mining, the simplicity of declarative processes has been demonstrated to fit well with real process executions, and declarative miners are currently the most precise miners in useFootnote 1. However, little research exists regarding how declarative miners are sensitive to process change. The objective of this paper is to study how declarative miners can give accurate and timely views of partial traces (so-called event streams). We integrate techniques of streaming process mining to declarative modelling notations, in particular, DCR graphs [14]. While previous works of streaming conformance checking have addressed other declarative languages (e.g.: Declare [23]), these languages are fundamentally different. Declare provides a predefined set of 18 constraint templates with an underlying semantics based on LTL formulae on finite traces [12]. Instead, DCR is based on a minimal set of 5 constraints, being able to capture regular and omega-regular languages [13]. In comparison with Declare, DCR is a language adopted by the industry: DCR is integrated into KMD Workzone, a case management solution used by 70% of central government institutions in Denmark [22]. Event streams present challenges for discovery. Streams are potentially infinite, making memory and time computation complexities major issues. Our technique optimizes these aspects by relying on intermediate representations that are updated at runtime. Another aspect is extensibility: our technique can be extended to more complex workflow patterns via the combination of atomic DCR constraints. Figure 1 illustrates our contribution: a streaming mining component, capable of continuously generating DCR graphs from an event stream (here we use the plural graphs to indicate that the DCR model could evolve over time, to accommodate drifts in the model that might occur). Towards the long-term goal of a system capable of spotting changes in a detailed fashion, we will also sketch a simple model-to-model metric for DCR, which can be used to compare the results of stream mining with a catalogue or repository of processes. An implementation of our techniques together with tests and datasets is available in BeamlineFootnote 2 [6].

Fig. 1.
figure 1

Contribution of the paper

The rest of the paper is structured as follows: related works are presented in Sect. 2; theoretical background is covered in Sect. 3. The streaming discovery is presented in Sect. 4 and the approach is validated in Sect. 5. Section 6 concludes.

2 Related Work

This is the first work aiming at discovering DCR graphs from event streams. We find related work in offline discovery of DCR graphs and stream process mining for Declare.

Offline Process Discovery Techniques. The most current discovery technique for DCR graphs is the DisCoveR algorithm [4]. In their paper, the authors claim an accuracy of 96,1% with linear time complexity (in PDC 2021 the algorithm achieved 96.2%). The algorithm is an extension of the ParNek algorithm [21] using an efficient implementation of DCR mapping via bit vectors. In its most recent version [24], DisCoveR has been extended with the idea of having both positive and negative examples to produce a more precise process model. Other related works derive from conformance checking [10] and process repair [1] techniques. Both fields aim at understanding whether executions can be replayed on top of an existing processes model. However, in our case, we wanted to separate the identification of the processes (i.e., control-flow discovery) from the calculation of their similarity (i.e., the model-to-model metric) so that these two contributions can be used independently from each other. Conformance checking and process repair, on the other hand, embed the evaluation and the improvement into one “activity”.

Online Discovery for Declarative Models. In [7] a framework for the discovery of Declare models from streams was introduced as a way to deal with large collections of datasets that are impossible to store and process altogether. In [20] this work was generalized to handle the mining of data constraints, leveraging the MP-Declare notation [9].

Streaming Process Mining in General. In his PhD thesis [29], van Zelst proposes process mining techniques applicable to process discovery, conformance checking, and process enhancement from event streams. An important conclusion from his research consists of the idea of building intermediate models that capture the knowledge observed in the stream before creating the final process model. In [5] the author presents a taxonomy for the classification of streaming process mining techniques. Our techniques constitute a hybrid approach in the categories in  [5], mixing a smart window-based model which is used to construct and maintain an intermediate structure updated, and a problem reduction technique used to transform the such structure into a DCR graph.

3 Background

In the following section, we recall basic notions of Directly Follows Graphs [1] and the Dynamic Condition Response (DCR) graphs [14]. While, in general, DCR is expressive to capture multi-perspective constraints such as time and data [15, 26], in this paper we use the classical, set-based formulation first presented in [14] that contains only four most basic behavioural relations: conditions, responses, inclusions and exclusions.

Definition 1 (Sets, Events and Sequences)

Let \(\mathcal {C}\) denote the set of possible case identifiers and let \(\mathcal {A}\) denote the set of possible activity names. The event universe is the set of all possible events \(\mathcal {E} = \mathcal {C} \times \mathcal {A}\) and an event is an element \(e =(c,a) \in \mathcal {E}\). Given a set \(\mathbb {N}_n^+ = 1, 2, \dots , n\) and a target set A, a sequence \(\sigma : \mathbb {N}_n^+ \mapsto A\) maps index values to elements in A. For simplicity, we can consider sequences using a string interpretation: \(\sigma = \langle a_1,\dots ,a_n \rangle \) where \(\sigma (i) = a_i \in A\).

We can now formally characterize an event stream:

Definition 2 (Event stream)

An event stream is an unbounded sequence mapping indexes to events: \(\mathcal {S}: \mathbb {N}^+ \rightarrow \mathcal {E}\).

Definition 3 (Directly Follows Graph (DFG))

A DFG is a graph \(G = (V,R)\) where nodes represent activities (i.e., \(V \subseteq \mathcal {A}\)), and edges indicate directly follows relations from source to target activities (i.e., \((a_s,a_t) \in R\) with \(a_s, a_t \in V\), so \(R \subseteq V \times V\)).

Definition 4 (Extended DFG)

An extended DFG is a graph \(G_x = (V, R, X)\) where (VR) is a DFG and X contains additional numerical attributes referring to the nodes: \(X: V \times \textit{Attrs} \rightarrow \mathbb {R}\), where \(\textit{Attrs}\) is the set of all attribute names. To access attribute \(\alpha _1\) for node v we use the notation \(X(v,\alpha _1)\).

We use the following attributes: avgFO: average index of the first appearance of an activity in a trace; noTraceApp: current number of traces containing the activity; avgIdx: average index of the activity in a trace; and noOccur: number of activity occurrences.

Definition 5 (DCR Graph)

A DCR graph is a tuple , where \(\mathcal {A}\) is a set of activities, \(M \subseteq \mathcal{{P}}(\mathcal {A}) \times \mathcal{{P}}(\mathcal {A}) \times \mathcal{{P}}(\mathcal {A})\) is a marking, and \(\phi \subseteq \mathcal {A}\times \mathcal {A}\) for are relations between activities.

A DCR graph defines processes whose executions are finite and infinite sequences of activities. An activity may be executed several times. The three sets of activities in the marking \(M=(\textsf{Ex},\textsf{Re},\textsf{In})\) define the state of a process, and they are referred to as the executed activities (\(\textsf{Ex}\)), the pending response (\(\textsf{Re}\))Footnote 3 and the included activities (\(\textsf{In}\)). DCR relations define what is the effect of executing one activity in the graph. Briefly: Condition relations say that the execution of a is a prerequisite for \(a'\), i.e. if a is included, then a must have been executed for \(a'\) to be enabled for execution. Response relations say that whenever a is executed, \(a'\) becomes pending. In a run, a pending event must eventually be executed or be excluded. We refer to \(a'\) as a response to a. An inclusion (respectively exclusion) relation (respectively ) means that if a is executed, then \(a'\) is included (respectively excluded).

For a DCR graphFootnote 4 P with activities \(\mathcal {A}\) and marking \(M=(\textsf{Ex},\textsf{Re},\textsf{In})\) we write for the set of pairs (similarly for any of the relations in \(\phi \)) and we write \(P_\mathcal {A}\) for the set of activities. Definition 5 omits the existence of a set of labels and labelling function present in [14]. This has a consequence in the set of observable traces: Assume a graph \(G = \langle \{a,b\}, (\emptyset , \{a, b\}, \{a, b\}), \emptyset , \emptyset , \emptyset ,\emptyset \rangle \) as well as a set of labels \(L=\{p\}\) and a labelling function \( l= \{(a,p),(b,p)\}\). A possible run of G has the shape \(\sigma = \langle p, p \rangle \), which can be generated from 1) two executions of a, 2) two executions of b or 3) an interleaved execution of a and b. By removing the labels from the events (or alternatively, assuming an injective surjective labelling function in [14]), we assume that two occurrences of the event in the stream imply event repetition.

4 Streaming DCR Miner

This section presents the general structure of the stream mining algorithm for DCR graphs. The general idea of the approach presented in this paper is depicted in Fig. 2: constructing and maintaining an extended DFG structure (cf. Definition 4) starting from the stream and then, periodically, a new DCR graph is extracted from the most recent version of the extended DFG available. The extraction of the different DCR rules starts from the same extended DFG instance. For readability purposes, we split the approach into two phases. The former (Algorithm 1) is in charge of extracting the extended DFG, the latter (Algorithms. 2, 3, 4) focuses on the extraction of DCR rules from the extended DFG.

Fig. 2.
figure 2

Conceptual representation of the discovery strategy in this paper.

figure i

Algorithm 1 takes as input a stream of events S, two parameters referring to the maximum number of traces \(m_t\) and events to store \(m_e\) and a set of DCR patterns to mine. The algorithm starts by initializing two supporting map data structures \(\textsf {obs}\) and \(\textsf {deps}\) as well as an empty extended DCR graph \(G_X\) (lines 1–3). \(\textsf {obs}\) is a map associating case ids to sequences of partial traces; \(\textsf {deps}\) is a map associating case ids to activity names. After initialization, the algorithm starts consuming the actual events in a never-ending loop (line 4). The initial step consists of receiving a new event (line 5). Then, two major steps take place: the first step consists of updating the extended DFG; the second consists of transforming the extended DFG into a DCR model. To update the extended DFG the algorithm first updates the set of nodes and extra attributes. If the case id c of the new event has been seen before (line 6), then the algorithm refreshes the update time of the case id (line 7, useful to keep track of which cases are the most recent ones) and checks whether the maximum length of the partial trace for that case id has been reached (line 8). If that is the case, then the oldest event is removed and the \(G_X\) is updated to incorporate the removal of the event. If this is the first time this case id is seen (line 11), then it is first necessary to verify that the new case can be accommodated (line 12) and, if there is no room, then first some space needs to be created by removing oldest cases and propagating corresponding changes (lines 13–14) and then a new empty list can be created to host the partial trace (line 15). In either situation, the new event is added to the partial trace (line 16) and, if needed, a new node is added to the set of vertices V (line 17). The X data structure can be refreshed by leveraging the properties of the partial trace seen so far (line 18). To update the relations in the extended DFG (i.e., the R component of \(G_X\)), the algorithm checks whether an activity was seen previously for the given case id c and, if that is the case, the relation from such activity (i.e., \(\textsf {deps}(c)\)) to the new activity just seen (i.e., a) is added (lines 19–20). In any case, the activity just observed is now the latest activity for case id c (line 21) and oldest cases (i.e., cases not likely to receive any further events) are removed from deps (line 23). Finally, the algorithm refreshes the DCR model by calling the procedure that transforms (lines 25-26) the extended DFG into a DCR model (cf. Algorithm 2). Updates can be triggered based on some periodicity (line 24) or based on the amount of behaviour seen. The mechanics of such periodicity are beyond the scope of the paper.

figure j

Algorithm 2 generates a DCR graph from an extended DFG. First, it (1) defines patterns that describe occurrences of atomic DCR constraints in the extended DFG, and then it (2) defines composite patterns that describe the most common behaviour. Given a set of relation patterns T, \(\langle T, \le \rangle \) denotes a pattern dependency poset with \(\le \) a partial order over T. Similarly \(MinimalElements(\langle T, \le \rangle ) = \{ x \in T \mid \not \exists y\in T. y \le x\}\). Patterns as posets allow us to reuse and simplify the outputs from the discovery algorithm. Consider a pattern describing a sequential composition from a to b (similar to a flow in BPMN). A DCR model that captures a sequential behaviour will need 4 constraints: . Consider \(T=\{T_1:\textit{Condition}, T_2:\textit{Response}, T_3:\textit{Exclusion}, T_4:\textit{Sequence}\}\). The pattern poset \(\langle T, \{(T_4, T_1), (T_4,T_2), (T_4,T_3)\}\rangle \) defines the dependency relations for a miner capable of mining sequential patterns. Additional patterns (e.g. exclusive choices, escalation patterns, etc.), can be modelled similarly. Pattern posets are finite, thus there exist minimal elements. The generation of a DCR model from an extended DFG is described in Algorithm 2. We illustrate the mining of DCR conditions, responses and self-responses, but more patterns are available in [25]. The algorithm takes as input an extended DFG \(G_X\) and a pattern poset. It starts by creating an empty DCR graph P with activities equal to the nodes in \(G_X\) and initial marking \(M_{\textit{init}} = \{\emptyset ,\emptyset , V\}\), that is, all events are included, not pending and not executed. We then split the processing between atomic patterns (those with no dependencies) and composite patterns. The map \(\textit{Rel}\) stores the relations from atomic patterns, that will be used for the composite miner. We use the merge notation \(P \oplus \textit{Rels}\) to denote the result of the creation of a DCR graph whose activities and markings are the same as P, and whose relations are the pairwise union of the range of \(\textit{Rels}\) and its corresponding relational structure in P. Line 11 applies a transitive reduction strategy [4], reducing the number of relations while maintaining identical reachability properties.

figure l
figure m

The atomic and composite miners are described in Algorithms 3, and 4. The atomic miner in Algorithm 3 iterates over all node dependencies in the DFG and the pattern matches with the existing set of implemented patterns. Take the case of a response constraint. We will identify it if the average occurrence of s is before t (line 6). This condition, together with the dependency between s and t in \(G_X\) is sufficient to infer a response constraint from s to t. To detect conditions, the algorithm verifies another set of properties: given a dependency between s and t, it checks that the first occurrence of s precedes t and that s and t appeared in the same traces (approximated by counting the number of traces containing both activities, line 9). The composite miner in Algorithm 4 receives the DFG, a pattern, and the list of mined relations from atomic patterns. We provide an example for the case of include and exclude relations. This pattern is built as a combination of self-exclusions, precedence, and not chain successions. As these atomic patterns generate each set of include/exclude relations, the pattern just takes the set union construction.

Suitability of the Algorithms for Streaming Settings. Whenever discussing algorithms that can tackle the streaming process mining problem [5], it is important to keep in mind that while a stream is assumed to be infinite, only a finite amount of memory can be used to store all information and that the time complexity for processing each event must be constant. Concerning the memory, an upper bound on the number of stored events in Algorithm 1 is given by \(m_t \cdot m_e\) where \(m_e\) is the number of unique events and \(m_t\) is the number of parallel traces. Moreover, note that the extended DFG is also finite since there is a node for each activity contained in the memory. Concerning the time complexity, Algorithm 1 does not perform any unbounded backtracking. Instead, for each event, it operates using just maps that have amortized constant complexity or on the extended DFG (which has finite, controlled size). The same observation holds for Algorithm 2 as it iterates on the extended DFG which has a size bounded by the provided parameters (and hence, can be considered constant).

5 Experimental Evaluation

To validate our approach we executed several tests, first to validate quantitatively the streaming discovery on synthetic data, then to qualitatively evaluate the whole approach on a real dataset. Due to lack of space, we only report quantitative tests, while performance and the qualitative evaluation can be found in a separate technical report [8].

5.1 Quantitative Evaluation of Streaming Discovery

Recall from the previous section that time/space complexity are constant for streaming settings. Thus, our analysis will focus on studying how the algorithm behaves when encountering sudden changes in a stream. We compare with other process discovery algorithms for DCR graphs, in this case, the DisCoveR miner [4]. The tests are performed against a publicly available dataset of events streams [11]. This dataset includes (1) a synthetic stream inspired by a loan application process, and (2) perturbations to the original stream using change patterns [28]. Recall that the DisCoveR miner is an offline miner, thus it assumes an infinite memory model. To provide a fair evaluation we need to parameterize DisCoveR with the same amount of available memory. We divided the experiment into two parts: a simple stream where the observations of each process instance arrive in an ordered manner (i.e., one complete process instance at a time) and a complex stream where observations from many instances arrive intertwined. As no initial DCR graph exists for this process, and no streaming DCR miner exists, we used the DisCoveR miner in its original (offline) setting to generate a baseline graph using the entire dataset. This model (the one calculated with offline DisCoveR) was used to calculate the model-to-model similarity between the DCR stream miner and the DisCoveR miner with memory limits. For the sake of simplicity, in this paper, we considered only the case of sudden drifts, while we discuss other types of drift in future work.

Fig. 3.
figure 3

Performance comparison between the offline DisCoveR miner and the streaming DCR Miner with equal storage available (capacity of up to 100 and 500 events).

We introduce a metric that quantifies the similarity between two DCR graphs. It can be used, for example, to identify which process is being executed with respect to a model repository, or by quantifying the change rate of one process over time. The metric takes as input two DCR graphs P and Q as well as a weight relation W that associates each DCR relation in \(\phi \) (cf. Definition 5) with a weight, plus one additional weight for the activities. Then it computes the weighted Jaccard similarity [17] of the sets of relations and the set of activities, similarly to what happens in [2] imperative models:

Definition 6 (DCR Model-to-Model metric)

Given P and Q two DCR graphs, and \(W: \phi \cup \{\textsf {act}\} \rightarrow \mathbb {R}\) a weight function in the range [0, 1] such that \(\sum _{r \in \phi \cup \{\textsf {act}\}} W(r)=1\). The model-to-model similarity metric is defined as:

$$\begin{aligned} S(P,Q,W) = W(\textsf {act}) \cdot \frac{|P_\mathcal {A} \cap Q_\mathcal {A}|}{|P_\mathcal {A} \cup Q_\mathcal {A}|} + \sum _{r \in \phi } W(r) \cdot \frac{|P_r \cap Q_r |}{|P_r \cup Q_r|} \end{aligned}$$
(1)

The similarity metric compares the relations in each of the two DCR graphs, thus returning a value between 0 and 1, where 1 indicates a perfect match and 0 stands for no match at all. A brief evaluation of the metric is reported in Appendix A.

The results of the quantitative evaluation are reported in Fig. 3. Each figure shows the performance of the incremental version of DisCoveR and the streaming DCR miner against 2 different configurations over time. The vertical black bars indicate where a sudden drift occurred in the stream. While the performance for the simple stream is very good for both the DisCoveR and the streaming DCR miners, when the stream becomes more complicated (i.e., Fig. 3b), DisCoveR becomes less effective, and, though its average performance increases over time, the presence of the drift completely disrupt the accuracy. In contrast, our approach is more robust to the drift and more stable over time, proving its ability at managing the available memory in a more effective way.

5.2 Discussion

One of the limitations of the approach regards precision with respect to offline miners. A limiting aspect of our work is the choice of the intermediate structure. A DFG representation may report confusing model behaviour as it simplifies the observations using purely a frequency-based threshold [27]. A DFG is in essence an imperative data structure that captures the most common flows that appear in a stream. This, in a sense, goes against the declarative paradigm as a second-class citizen with respect to declarative constraints. We believe that the choice of the DFG as an intermediate data structure carries out a loss of precision with respect to the DisCoveR miner in offline settings. However, in an online setting, the DFG still provides a valid approximation to observations of streams where we do not have complete traces. This is far from an abnormal situation: IoT communication protocols such as MQTT [16] assume that subscriber nodes might connect to the network after the communications have started, not being able to identify starting nodes. Specifically, in a streaming setting it is impossible to know exactly when a certain execution is complete and, especially in declarative settings, certain constraints describe liveness behaviours that can only be verified after a whole trace has been completely inspected. While watermarking techniques [3] could be employed to cope with lateness issues, we have decided to favour self-contained approaches in this paper, leaving for future work the exploration of watermarking techniques.

6 Conclusion and Future Work

This paper presented a novel streaming discovery technique capable of extracting declarative models expressed using the DCR language, from event streams. Additionally, a model-to-model metric is reported which allows understanding if and to what extent two DCR models are the same. An experimental evaluation, comprising both synthetic and real data, validated the two contributions separately as well as their combination in a qualitative fashion, which included interviews with the process owner.

We plan to explore several directions in future work. Regarding the miner, we plan to extend its capabilities to the identification of sub-processes, nesting, and data constraints. Regarding the model-to-model similarity, we would like to embed more semantic aspects, such as mentioned in [18]. A possible limitation of the streaming miner algorithm approach followed here relates to the updating mechanism. Currently lines 22–24 of Algorithm 1 perform updates based entirely on periodic updates triggered by time, which will generate notifications even when no potential changes in the model have been identified. A possibility to extend the algorithm will be to integrate the model-to-model similarity as a parameter to the discovery algorithm, so models only get updated after a given change threshold (a similarity value specified by the user) is reached.