1 Introduction

The previous chapter has introduced the alpha algorithm and the inductive mining algorithm as basic algorithms that discover an accepting Petri net from a (simplified) event log. It has also shown a number of example event logs for which these two basic algorithms work excellently. However, these two basic algorithms do not always perform well, often depending on the characteristics of the given event log.

In this chapter, we first introduce an example event log where the recorded process behavior features intertwined parallel compositions and exclusive choices. Second, we discuss the results of the alpha algorithm and the inductive mining algorithm on this example event log, showing that there is room for improvement. Third, we introduce four advanced process mining algorithms, discussing the results of using these algorithms on the example event log – highlighting their benefits and limitations. The first two advanced algorithms use region-based techniques to discover accepting Petri nets, where the first algorithm uses state-based regions and the second uses language-based regions. The third algorithm relies on sophisticated approaches to pre-process the DFG prior the identification of splits and joins behavioral semantics, and it natively outputs BPMN models. Whereas these three algorithms produce imperative process models, the fourth algorithm generates declarative process models (like Declare) called log skeletons. As we shall see in this chapter, thanks to their advanced approaches, these mining algorithms are capable of handling event logs recording very complex process behavior better than the basic mining algorithms do. At the same time, also these algorithms should not be considered bullet-proof solutions for addressing exercises of automated process discovery as, in general, their results vary depending on the input event log.

2 Motivation

For motivating the need for advanced process discovery algorithm, we introduce the event log \(L_6 = [\langle a,b,c,g,e,h \rangle ^{10}, \langle a,b,c,f,g,h \rangle ^{10}, \langle a,b,d,g,e,h \rangle ^{10}, \langle a,b,d,e,g,h \rangle ^{10}, \langle a,b,e,c,g,h \rangle ^{10}, \langle a,b,e,d,g,h \rangle ^{10}, \langle a,c,b,e,g,h \rangle ^{10}, \langle a,c,b,f,g,h \rangle ^{10}, \langle a,d,b,e,g,h \rangle ^{10}, \langle a,d,b,f,g,h \rangle ^{10}]\). At first sight, there seems to be a choice between c and d, followed by a choice between e and f. However, it is more complicated than that, as traces like \(\langle a,c,b,g,e,h \rangle \) and \(\langle a,b,d,f,g,h \rangle \) are not included in \(L_6\).

Fig. 1.
figure 1

The directly-follows graph of the event log \(L_6\).

Figure 1 shows the DFG that results from the event log \(L_6\). Clearly, this DFG is not as symmetric as we would have thought after a first glance at \(L_6\). For example, e can be directly followed by c or d, but f is always directly followed by g.

Fig. 2.
figure 2

The accepting Petri net discovered by the alpha algorithm from the event log \(L_6\).

Figure 2 shows the accepting Petri net that results from running the alpha algorithm on event log \(L_6\). The places with the > sign are places with a larger inflow than outflow, whereas the places with the < symbol are places with a smaller inflow than outflow. This is a clear indication that this net has quality problems, which is also confirmed by the fact that in this net the final marking is not reachable from the initial marking. It is possible to put a token in the final place, but then there would be other tokens in the net as well. Precisely, there would be tokens in the place that is the output of a and e and the input of c.

Fig. 3.
figure 3

The process tree discovered by the Inductive Mining Algorithm for event log \(L_6\).

Figure 3 shows the process tree that results from running the inductive mining algorithm on event log \(L_6\). Although the process tree guarantees that the final marking is always reachable from the initial marking, this process tree allows for too much behavior. As an example it is possible to do both e and f, or neither, even though in \(L_6\) always exactly one of these two activities is observed per trace. Also the fact that f is always directly followed by g is not captured by this process tree.

This shows that, for more complex event logs, we need more advanced algorithms than the alpha algorithm and the inductive mining algorithm. This chapter, introduces four of such advanced algorithms each having more success in discovering a process model from the event log \(L_6\) than the basic algorithms from the previous chapter, they are:

  1. 1.

    The State-based Region Miner, which produces accepting Petri nets like the basic algorithms do;

  2. 2.

    The Language-based Region Miner, which also produces accepting Petri nets;

  3. 3.

    The Split Miner, which produces BPMN models;

  4. 4.

    The Log Skeleton Miner, which produces declarative process models (like Declare [45]) called log skeletons.

These four advanced algorithms are discussed in the next sections, as the first two algorithms both use the theory of regions, they are discussed in a single section. Then, we continue with split miner, and lastly we conclude with the log skeleton miner.

Fig. 4.
figure 4

The accepting Petri net discovered by the State-based Region Algorithm for event log \(L_6\).

3 The Theory of Regions

The theory of regions [30] was proposed in the early nineties to define a formal correspondence between behavior and structure. In particular, several region-based algorithms have been proposed in the last decades to synthesize specifications into Petri nets using this powerful theory.

As mining is a form of synthesis, several approaches have appeared to mine process models from event logs. Regardless of the region based technique applied, the approaches that rely on the notion of region theory search for a process model that is both fitting and precise [17]. This section shows two branches of region-based approaches for process discovery: state and language-based approaches.

3.1 State-Based Region Approach for Process Discovery

Figure 4 shows the accepting Petri net that results from running the State-based Region Algorithm on event log \(L_6\). Note that for all places the inflow equals the outflow. In the remainder of this section we will provide an overview of the main ingredients of state-based region discovery.

State-based region approaches for process discovery need to convert the event log into a state-based representation, that will be used to discover the Petri net. This representation, is formalized in the following definition.

Definition 1 (Transition System)

A transition system (TS) is a tuple \((S, \varSigma , A,s_{in})\), where S is a set of states, \(\varSigma \) is an alphabet of activities, \(A \subseteq S\times \varSigma \times S\) is a set of (labeled) arcs, and \(s_{in} \in S\) is the initial state. We will use \(s {\mathop {\rightarrow }\limits ^{e}} s'\) as a shortcut for \((s,e,s') \in A\), and the transitive closure of this relation will be denoted by \({\mathop {\rightarrow }\limits ^{*}}\).

Figure 5(a) presents an example of a transition system.

Definition 2 (Multiset representation of traces)

We denote by \(\#(\sigma ,\mathsf {e}_{\scriptstyle })\) the number of times that event \(\mathsf {e}_{\scriptstyle }\) occurs in \(\sigma \), that is \(\#(\langle \mathsf {e}_{\scriptstyle 1}\ldots \mathsf {e}_{\scriptstyle n} \rangle ,\mathsf {e}_{\scriptstyle }) = |\{ \mathsf {e}_{\scriptstyle i} ~|~ \mathsf {e}_{\scriptstyle i} = \mathsf {e}_{\scriptstyle } \}|\). Given an alphabet \(\varSigma \), the Parikh vector of a sequence \(\sigma \) with respect to \(\varSigma \) is a vector \(p_{\sigma } \in \mathbb {N}^{|\varSigma |}\) such that \(p_{\sigma }(\mathsf {e}_{\scriptstyle }) = \#(\sigma ,e)\).

The techniques described in [62] present different variants for generating a transition system from an event log. For the most common variant, the basic idea to incorporate state information is to look at the set of multiset of events included in a subtrace in the event log:

Definition 3 (Multiset State Representation of an Event Log)

Given an event log \(L \in \mathcal {B}({\mathcal{U}_{ act }}^*)\), the TS corresponding to the multiset conversion of L, denoted as \(\textsf {TS}_{\text {mset}}(L)\), is \(\langle S, \varSigma , T, \mathsf {s}_{\scriptstyle p_{\epsilon }} \rangle \), such that: S contains one state \(\mathsf {s}_{\scriptstyle p_{w}}\) for each Parikh vector \(p_w\) of a prefix w in L, with \(\epsilon \) denoting the empty prefix, and \(T = \{ \mathsf {s}_{\scriptstyle p_{w}} {{{\mathop {\longrightarrow }\limits ^{\mathsf {e}}}}} \mathsf {s}_{\scriptstyle p_{w\mathsf {e}}} ~|~ w \mathsf {e} \text { is a prefix of } L \}\).

In the sequence conversion, two traces lead to the same state if they fire the same events in exactly the same order.

Example 1

Let us use along this section an example extracted from [61]. The event log contains the following activities: r=register, s=ship, sb=send_bill, p=payment, ac=accounting, ap=approved, c=close, em=express_mail, rj=rejected, and rs=resolve. Given the event log \(L_7 = [\langle r,s,sb,p,ac,ap,c \rangle ^{10}, \langle r,sb,em,p,ac,ap,c \rangle ^{10}, \langle r,sb,p,em,ac,rj,rs,c \rangle ^{10}, \langle r,em,sb,p,ac,ap,c \rangle ^{10}, \langle r,sb,s,p,ac,rj,rs,c \rangle ^{10}, \langle r,sb,p,s,ac,ap,c \rangle ^{10}, \langle r,sb,p,em,ac,ap,c \rangle ^{10}]\), Fig. 5(a) show an example of TS constructed according to Definition 3.

Fig. 5.
figure 5

State-based region discovery: (a) transition system corresponding to \(L_7\), (b) derived Petri net.

A regionFootnote 1 in a transition system is a set of states that satisfy an homogeneous relation with respect to the set of arcs. In the simplest case, this relation can be described by a predicate on the set of states considered. Formally:

Definition 4 (Region)

Let \(S'\) be a subset of the states of a TS, \(S' \subseteq S\). If \(s\not \in S'\) and \(s'\in S'\), then we say that transition \(s{\mathop {\rightarrow }\limits ^{a}} s'\) enters \(S'\). If \(s\in S'\) and \(s'\not \in S'\), then transition \(s{\mathop {\rightarrow }\limits ^{a}} s'\) exits \(S'\). Otherwise, transition \(s{\mathop {\rightarrow }\limits ^{a}} s'\) does not cross \(S'\): it is completely inside (\(s\in S'\) and \(s'\in S'\)) or completely outside (\(s\notin S'\) and \(s'\not \in S'\)). A set of states \(r \subseteq S\) is a region if for each event \(e \in E\), exactly one of the three predicates (enters, exits or does not cross) holds for each of its arcs.

An example of region is presented in Fig. 6 on the TS of our running example. In the highlighted region, event r enters the region, s and em exit the region, and the rest of labels do not cross the region.

A region corresponds to a place in the Petri net, and the role of the arcs determine the Petri net flow relation: when an event e enters the region, there is an arc from the corresponding transition for e to the place, and when e exits the region, there is an arc from the region to the transition for e. Events satisfying the do not cross relation are not connected to the corresponding place. For instance, the region shown in Fig. 6(a) corresponds to the shadowed place in Fig. 6(b), where event r belongs to the set of input transitions of the place whereas events em and s belong to the set of output transitions. Hence, the algorithm for Petri net derivation from a transition system consists in finding regions and constructing the Petri net as illustrated with the previous example. In [26] it was shown that only a minimal set of regions was necessary, whereas further relaxations to this restriction can be found in [17]. The Petri net obtained by this method is guaranteed to accept the language of the transition system, and satisfy the minimal language containment property, which implies that if all the minimal regions are used, the Petri net derived is the one whose language difference with respect to the log is minimal, hence being the most precise Petri net for the set of transitions considered.

Fig. 6.
figure 6

(a) Example of region (three shadowed states). The predicates are r enters, s and em exits, and the rest of events do not cross, (b) Corresponding place shadowed in the Petri net.

In any case, the algorithm that searches for regions in a transition system must explore the lattice of sets (or multisets, in the case for k-bounded regions), thus having a high complexity: for a transition system with n states, the lattice for k-bounded regions is of size \(\mathcal{O}(k^n)\). For instance, the lattice of sets of states for the toy TS used in this article (which has 22 states) has \(2^{22}\) possibles sets to check for the region conditions. Although many simplification properties, efficient data structures and algorithms, and heuristics are used to prune this search space [17], they only help to alleviate the problem. Decomposition alternatives, which for instance use partitions of the state space to guide the search for regions, significantly alleviate the complexity of the state-based region algorithm, at the expense of not guaranteeing the derivation of precise models [15]. Other state-based region approaches for discovery have been proposed, which complement the approach described in this section [54,55,56].

3.2 Language-Based Region Approach for Process Discovery

In language-based region theory [6, 8, 9, 22, 37, 38] the goal is to construct the smallest Petri net such that the behaviour of the net is equal to the given input language (or minimally larger). [41] provides an overview for language-based region theory for different classes of languages: step languages, regular languages, and (infinite) partial languages.

Fig. 7.
figure 7

The accepting Petri net discovered by the Language-based Region Algorithm for event log \(L_6\).

Figure 7 shows the accepting Petri net that results from running the Language-based Region Algorithm on event log \(L_6\). As it happened with state-base regions, again for all places the inflow equals the outflow.

More formally, let \(L \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an event log, then language based region theory constructs a Petri net with the set of transitions equals to \(\varSigma \) and in which all traces of L are a firing sequence. The Petri net should have only minimal firing sequences not in the language L (and all prefixes in L). This is achieved by adding places to the Petri net that restrict unobserved behavior, while allowing for observed behavior. The theory of regions provides a method to identify these places, using language regions.

Definition 5 (Prefix Closure)

Let \(L \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an event log. The prefix closed language \(\mathcal {L} \subseteq \varSigma ^*\) of L is defined as: \(\mathcal {L} = \{ \sigma \in \varSigma ^* \mid \exists _{\sigma ' \in \varSigma ^*} \sigma \circ {}\sigma ' \in L\}\).

The prefix closure of a log is simply the set of all prefixes in the log (including the empty prefix).

Definition 6 (Language Region)

Let \(\varSigma \) be a set of activities. A region of a prefix-closed language \(\mathcal {L} \in \varSigma ^*\) is a triple \((\vec {x},\vec {y}, c)\) with \(\vec {x},\vec {y} \in \{0,1\}^{\varSigma }\) and \(c \in \{0,1\}\), such that for each non-empty sequence \(w = w'\circ {}a \in \mathcal {L}\), \(w' \in \mathcal {L}\), \(a \in \varSigma \):

$$ c + \sum _{t \in \varSigma } \left( \vec {w'}(t) \cdot \vec {x}(t) - \vec {w}(t) \cdot \vec {y}(t)\right) \ge 0 $$

This can be rewritten into the inequation system:

$$\begin{aligned} c \cdot \vec {1} + M'\cdot \vec {x} - M \cdot \vec {y} \ge \vec {0} \end{aligned}$$

where M and \(M'\) are two \(|\mathcal {L}| \times |\varSigma |\) matrices with \(M(w,t) = \vec {w}(t)\), and \(M'(w,t) = \vec {w'}(t)\), with \(w = w'\circ {}a\). The set of all regions of a language is denoted by \(\Re (\mathcal {L})\) and the region \((\vec {0}, \vec {0}, 0)\) is called the trivial region.

Fig. 8.
figure 8

Region for a language over four activities [63].

Intuitively, vectors \(\vec {x},\vec {y}\) denote the set of incoming and outgoing arcs of the place corresponding to the region, respectively, and c sets if it is initially marked. Figure 8 shows a region for a language over four activities, i.e. each solution \((\vec {x},\vec {y},c)\) of the inequation system can be regarded in the context of a Petri net, where the region corresponds to a feasible place with preset \(\{t |t \in T, \vec {x}(t)=1\}\) and postset \(\{t|t\in T, \vec {y}(t) = 1\}\), and initially marked with c tokens. Note that we do not assume arc-weights here, while the authors of [6, 7, 22, 38] do.

Since the place represented by a region is a place which can be added to a Petri net, without disturbing the fact that the net can reproduce the language under consideration, such a place is called a feasible place.

Definition 7 (Feasible place)

Let \(\mathcal {L}\) be a prefix-closed language over \(\varSigma \) and let \(N = ((P,\varSigma ,F),m)\) be a marked Petri net. A place \(p \in P\) is called feasible if and only if there exists a corresponding region \((\vec {x},\vec {y},c) \in \Re (\mathcal {L})\) such that \(m(p) = c\), and \(\vec {x}(t) = 1\) if and only if \(t \in {{}^{\bullet }{p}}\), and \(\vec {y}(t) = 1\) if and only if \(t \in {{p}^{\bullet }}\).

In general, there are many feasible places for any given event log (when considering arc-weights in the discovered Petri net, there are even infinitely many). Several methods exist for selecting an appropriate subset of these places. The authors of [7, 38] present two ways of finitely representing these places, namely a basis representation and a separating representation. Both representations maximize precision, i.e. they select a set of places such that the behavior of the model outside of the log is minimal.

In contrast, the authors of [63, 65, 66, 68] focus on those feasible places that express some causal dependency observed in the event log, and/or ensure that the entire model is a connected workflow net. They do so by introducing various cost functions favouring one solution of the equation system over another and then selecting the top candidates.

3.3 Strengths and Limitations of Region Theory

The goal of region theory is to find a Petri net that perfectly describes the observed behavior (where this behavior is specified in terms of a language or a statespace). As a result the Petri nets are perfectly fitting and maximally precise. Consequently, the assumption on the input event log is that it records a full behavioral specification, i.e., that the input is complete and noise free. While the assumption on the output is that it is a compact and exact representation of the behavior recorded in the input event log. To this end, we note that, although in this section we have focused on safe nets, the theory of regions can represent general k-bounded Petri nets – a feature that is not yet provided by any other automated process discovery technique.

When applying region theory in the context of process mining, it is therefore very important to perform any required generalization of the behavior recorded in the input event log before calling region theory algorithms. For state-based regions, the challenges are in the construction of the statespace from the event log, while in language-based regions it is in the selection of the appropriate prefixes to include in the final prefix-closed language in order to ensure some level of generalization.

In the next section, we will see that split miner relaxes the requirement of having the full behavioral specification recorded in the input event log, striving to discover BPMN process models that only maximizes the balance between its fitness and precision.

4 Split Miner

In the following, we describe how Split Miner (hereinafter, SM) discovers a BPMN model starting from an event log. SM operates in six steps (cf. Fig. 9). In the first step, it constructs the DFG and analyses it to detect self-loops and short-loops. In the second step, it discovers concurrency relations between pairs of activities in the DFG. In the third step, the DFG is filtered by applying a filtering algorithm designed to strike balanced fitness and precision of the final BPMN model while maintaining a low control-flow complexity. The fourth and fifth steps focus (respectively) on the discovery of split and join gateways, activities having multiple outgoing edges are turned into a hierarchy of split gateways, while activities have multiple incoming edges are turned into a hierarchy of join gateways. Lastly, if any OR-joins were discovered, they are analyzed and turned (whenever possible) into either XOR-gateways or AND gateways.

Although some of the steps executed by SM are typical of basic automated process discovery techniques such as alpha miner and inductive miner (e.g., the filtering of the DFG), the steps of SM were designed to overcome the limitations of such techniques. Most notably, to increase precision without compromising fitness and/or structural complexity. Furthermore, in SM, each step can operate as a black-box, allowing for additional future improvements by redesign or enhancing a step at a time [5].

We now provide a brief overview of each step of SM in a tutorial-like fashion, by leveraging the example log \(L_6 = [\langle a,b,c,g,e,h \rangle ^{10}, \langle a,b,c,f,g,h \rangle ^{10}, \langle a,b,d,g,e,h \rangle ^{10}, \langle a,b,d,e,g,h \rangle ^{10}, \langle a,b,e,c,g,h \rangle ^{10}, \langle a,b,e,d,g,h \rangle ^{10}, \langle a,c,b,e,g,h \rangle ^{10}, \langle a,c,b,f,g,h \rangle ^{10}, \langle a,d,b,e,g,h \rangle ^{10}, \langle a,d,b,f,g,h \rangle ^{10}]\) (introduced in Sect. 2). Given that an in-depth analysis of the algorithms behind SM would be out of the scope of this chapter and book, we refer the interested reader to the original work [3].

Fig. 9.
figure 9

Overview of the Split Miner algorithm.

4.1 Step 1: DFG and Loops Discovery

Given the input event log \(L_6\), SM immediately builds its DFG, as shown in Fig. 10a. In this example, all the traces have the same start and end activity, however, SM automatically adds artificial start and end activities (represented by the nodes \(\blacktriangleright \) and \({\scriptstyle {\blacksquare }}\)).

Then, SM detects self-loops and short-loops, i.e., loops involving only one and two activities (respectively). Loops are known to cause problems when detecting concurrency [60], hence, we want to detect loops before detecting concurrency.

The simplest of the loops is the self-loop, a self-loop exists if a node is both source and target of one arc of the DFG, i.e., \(a\rightarrow a\). Short-loops and their frequencies are detected in the log as follows. Given two activities a and b, for SM, a short-loop (\(a\circlearrowright b\)) exists if and only if (iff) the following two conditions hold:


both a and b are not self-loops;


there exists at least one log trace containing the subtrace \(\langle a,b,a\rangle \) or \(\langle b,a,b\rangle \).

Condition (i) is necessary because otherwise the short-loop evaluation may not be reliable. In fact, if we consider a process that allows a concurrency between a self-loop activity a and a normal activity b, we could observe log traces containing the subtrace \(\langle a,b,a\rangle \), which can also characterize \(a\circlearrowright b\). Condition (ii) guarantees that we have observed (in at least one trace of the log) a short-loop between the two activities. In fact, short-loops are characterized by subtraces of the type \(\langle a,b,a\rangle \) or \(\langle b,a,b\rangle \).

Fig. 10.
figure 10

Processing of the directly-follows graph.

The detected self-loops are trivially removed from the DFG and restored only in the output BPMN model. While the detected short-loops are saved and used in the next step. In our example (Fig. 10a), there are no self-loops or short-loops.

4.2 Step 2: Concurrency Discovery

Given a DFG and any two activities a and b, such that neither a nor b is a self-loop, for SM, a and b are considered concurrent (noted as \(a \Vert b\)) iff three conditions hold:


there exist two arcs in the DFG: (ab) and (ba);


both a and b are not in a short-loop;


the arcs (ab) and (ba) have similar frequency: \(\frac{\left| \left| a \rightarrow b \right| - \left| b\rightarrow a \right| \right| }{\left| a \rightarrow b \right| + \left| b \rightarrow a \right| } < \varepsilon \) (\(\varepsilon \in [0,1]\)).

These three conditions define the heuristic-based concurrency oracle of SM. The rationale behind the conditions is the following. Condition (iii) captures the basic requirement for \(a \Vert b\): the existence of the arcs (ab) and (ba) entails that a and b can occur in any order. However, Condition (iii) is not sufficient to postulate concurrency because it may hold in three cases: a and b form a short-loop; (ab) or (ba) is an infrequent observation (e.g., noise in the data); a and b are concurrent. We are interested in identifying when the third case holds. To this end, we check Conditions (iv) and (v). When Condition (iv) holds, we can exclude the first case because a and b do not form a short-loop. When Condition (v) holds, we can exclude the second case because (ab) and (ba) are both observed frequently and have similar frequencies. At this point, we are left with the third case and we assume \(a\Vert b\). The variable \(\varepsilon \) becomes a user input parameter, the smaller is its value the more similar have to be the number of observations of (ab) and (ba). Instead, setting \(\varepsilon = 1\), Condition (v) would always hold.

Whenever we find \(a\Vert b\), we remove the arcs (ab) and (ba) from the DFG, since we assume there is no causality but instead there is concurrency. On the other hand, if we find that either (ab) or (ba) represents an infrequent directly-follows relation, we remove the least frequent of the two edges. We call the output of this step a Pruned DFG (PDFG).

In the example in Fig. 10a, we identify four possible cases of concurrency: (bc), (bd), (de), (eg). Setting \(\varepsilon = 0.25\), we capture the following concurrency relations: \(b\Vert c\), \(b\Vert d\), \(d\Vert e\), \(e\Vert g\). The resulting PDFG is shown in Fig. 10b.

4.3 Step 3: Filtering

Fig. 11.
figure 11

Processing of the directly-follows graph.

The filtering algorithm applied by SM on the PDFG is based on three criteria. First, each node of the PDFG must be on a path from the single start node (source) to the single end node (sink). Second, for each node, (at least one of) its path(s) from source to sink must be the one having maximum capacity. In our context, the capacity of a path is the frequency of the least frequent arc of the path. Third, the number of edges of the PDFG must be minimal. The three criteria aim to guarantee that the discovered BPMN process model is accurate and simple at the same time.

The filtering algorithm performs a double breadth-first exploration: forward (source to sink) and backward (sink to source). During the forward exploration, for each node of the PDFG, we discover its maximum source-to-node capacity, and its incoming edge granting such capacity (best incoming edge). During the backward exploration, for each node of the PDFG, we discover its maximum node-to-sink capacities, and the best outgoing edges. Then, we remove from the PDFG all the edges that are not best incoming edges or best outgoing edges. In doing so, we may reduce the amount of behavior that the final model can replay, and consequently its fitness. Therefore, we introduce a frequency threshold that allows the user to strike a balance fitness and precision. Precisely, we compute the \(\eta \) percentile over the frequencies of the best incoming and outgoing edges of each node, and we add to the PDFG the edges with a frequency exceeding the threshold. It is important to note that the percentile is not taken over the frequencies of all the edges, otherwise we would simply retain \(\eta \) percentage of all the edges. Also, this means that even by setting \(\eta = 0\), SM will still apply a certain amount of filtering.

Figure 11b shows the output of the filtering algorithm when applied to the PDFG of our working example (Fig. 11a). As a consequence of retaining the best incoming and outgoing edges for each node, we would drop the arcs: (ec) and (cf); and they would not be retained regardless of the value assigned to \(\eta \).

4.4 Step 4: Splits Discovery

Before discovering the split gateways, the filtered PDFG is converted into a BPMN process model by turning the start (\(\blacktriangleright \)) and end (\({\scriptstyle {\blacksquare }}\)) nodes of the graph into the start and end events of the BPMN model, and each other node of the graph into a BPMN activity. Figure 12a shows the BPMN modelFootnote 2 generated from the filtered PDFG of our working example (Fig. 11b). Now, let us focus on the discovery of the split gateways by considering the example in Fig. 13a. Given an activity with multiple outgoing edges (e.g., activity z), the splits discovery is based on the idea that all the activities directly following (successors of) the same split gateway must have the same concurrency and/or mutually exclusive relations with the activities that do not directly follow their preceding split gateway. With hindsight and reference to Fig. 13b, we see that since activities c and d are successors of gateway \(and_1\), both c and d are concurrent to e, f, g, due to gateway \(and_3\) (i.e., \(c\Vert e\), \(c\Vert f\), \(c\Vert g\), and \(d\Vert e\), \(d\Vert f\), \(d\Vert g\)). At the same time, both c and d are mutually exclusive with a and b, due to gateway \(xor_3\). Considering activities by pairs, and analyzing which concurrency or mutually exclusive relations they have in common, we can generate the appropriate splits hierarchy.

Fig. 12.
figure 12

Processing of the BPMN model.

Fig. 13.
figure 13

Splits discovery example.

With this in mind, we continue our working example. Let us consider activity A (Fig. 12a), it has three successors: B, C, and D. From the outcome of Step 2, we know that both C and D are concurrent to B, while C and D are not concurrent (hence, mutually exclusive with each other). Since C and D share the same relations to other activities (both are concurrent to B), they can be selected as successors of the same gateway, which in this case would be an XOR-gateway because C and D are mutually exclusive. After we add the XOR-gateway, the successors of activity A will be two: B and the newly added XOR-gateway (see Fig. 12b). The algorithm becomes trivial when an activity with multiple outgoing edges has only two successors, indeed, it is enough to add a split gateway matching the relation between the two successors. Continuing the example of activity A, the successor B is in parallel with the newly added XOR-gateway or, more precisely, with all the activities following the XOR-gateway (activities C an D). Therefore, we can add an AND gateway preceding B and the XOR-gateway. Similarly, if we consider activity B and its two successors, activities E and F, given that they are not concurrent, they must be mutually exclusive and therefore an XOR-gateway is placed before them. The result of the splits discovery is shown in Fig. 12c.

4.5 Step 5: Joins Discovery

Once all the split gateways have been placed, we can discover the join gateways. To do so, we rely on the Refined Process Structure Tree (RPST) [46] of the current BPMN model. The RPST of a process model is a tree data structure where the tree nodes represent the single-entry single-exit (SESE) fragments of the process model, and the tree edges denote a containment relation between SESE fragments. Precisely, the children of a SESE fragment are its directly contained SESE fragments, whilst SESE fragments on different branches of the tree are disjoint. Each SESE fragment represents a subgraph of the process model, and the partition of the process model into SESE fragments is made in terms of edges. A SESE fragment can be of one of the following four types: a trivial fragment, which consists of a single edge; a polygon, which consists of a sequence of fragments; a bond, which is a fragment where all the children fragments share two common nodes, one being the entry and the other being the exit of the bond; and a rigid, which represents any other fragment. Each SESE fragment is classified as homogeneous, if the gateways it contains (and are not contained in any of its SESE children) are all of the same type (e.g., only XOR-gateways), or heterogeneous if its gateways have different types. Figure 14a and Fig. 14b show two examples of homogeneous SESE fragments: a bond and a rigid.

We note that, at this stage, in the BPMN model (Fig. 12c) all the SESE fragment’s exits correspond to activities with multiple incoming edges, which we aim to turn into join gateways. Starting from the leaves of the RPST, i.e., the innermost SESE fragments of the process model, we explore the RPST bottom-up. For each SESE fragment we encounter in this exploration, we select the activities it contains that have multiple incoming edges (there is always at least one, the SESE fragment exit). For each of the selected activities, we add a join gateway preceding it. The join gateway type will depend on whether the SESE fragment is homogeneous or heterogeneous. In the former case, the join gateway will have the same type of the other gateways in the SESE fragment, in the latter case, the join gateway will be an OR-gateway. Figure 14 shows in brief how our approach works for SESE bonds (Fig. 14a), for homogeneous SESE rigids (Fig. 14b), and for all other cases, i.e. heterogeneous SESE rigids (Fig. 14c).

Returning to our working example (Fig. 12c), we can discover three joins. The first one is the XOR-join in the SESE bond containing activities C, D and G, with G as the exit of the bond and the XOR-split as the entry. The bond is XOR-homogeneous, so that the type of the join is set to XOR. The remaining two joins are in the parent SESE fragment of the bond, which is a heterogeneous rigid, hence, we place two OR-joins. The resulting model is shown in Fig. 12d.

Fig. 14.
figure 14

Joins discovery examples.

4.6 Step 6: OR-joins Minimization

The previous step may leave several OR-join gateways in the discovered BPMN model. Since OR-gateways can be difficult to interpret [42], SM tries to remove them by analyzing the process behavior and turning OR-gateways into AND- or XOR-gateways whenever the behavior is interchangeable.

4.7 Strengths and Limitations of Split Miner

SM was designed to bring together the strengths of older and basic automated process discovery algorithms while addressing their limitations. An example of this design strategy is the filtering algorithm. Past filtering algorithms were either based on heuristics [73, 79] that risk to compromise the correctness of the output model, or driven by structural requirements [35]. While SM retains the idea of an integrated filtering algorithm, it focuses on balancing fitness, precision, and simplicity of the output process model.

Past automated discovery algorithms favored either accuracy [73, 79] or simplicity [11, 35], SM aims to strike a trade-off between the two. The splits and joins discovery steps do not impose any structural constraint on the output process model, as opposed to inductive miner [35] and evolutionary tree miner [11], which enforce block-structuredness, allowing SM to pursue accuracy. Yet, the discovery of the split gateways is designed to produce hierarchies of gateway which foster simplicity and structuredness, while the join discovery and the use of OR-gateways allow for simplicity without compromising accuracy.

However, also SM has its own limitations. First, SM was designed for real-life contexts, and it operates under the assumption that there is always some infrequent behavior to filter out. Second, SM may discover unsound processes, indeed, hitherto soundness has been guaranteed only by enforcing block-structuredness, a trend that SM does not adhere to. While SM guarantees to discover deadlock-free process models [3], it does not guarantee that such process models respect the soundness property of proper completion, so that when a token reaches the end event of the process model, more tokens may be left behind. Nonetheless, the chances of SM discovering an unsound process model are very low [2] and in most cases it can discover accurate yet simple and sound process models.

5 Log Skeletons

The previous sections introduced three advanced mining algorithms that tackle the example event log \(L_6\) with more success than the basic algorithms as introduced in Sect. 2. Like these basic algorithms, these advanced algorithms all result in an imperative process model, that is, a process model that indicates what the next possible steps are. However, next to these imperative models, we also have declarative models, like Declare [45]. Unlike an imperative model, a declarative model does not specify what the next possible steps are, instead it provides a collection of constraints that any process instance in the end should adhere to.

This Section introduces an advanced mining algorithm that results in a declarative process model, called a log skeleton. [75]. This algorithm has been implemented as the “Visualize Log as Log Skeleton” visualizer plugin in ProM 6 [76]. Provided an event log L, the algorithm first extends the provided event log with the artificial start activity \(\blacktriangleright \) and the artificial end activity \({\scriptstyle {\blacksquare }}\). In accordance with Sect. 2, we use \(L'\) to denote the event log L extended with these artificial activities. Second, the algorithm discovers from this extended event log \(L'\) the collection of initial specific constraints it adheres to. Third, it reduces some of these constraints, keeping only those constraints that are considered to be relevant. Fourth, it shows the most-relevant constraints to the user as a graph. These last three steps are detailed in the next sections.

5.1 Discovering the Log Skeleton

The specific constraints in a log skeleton are the following three activity frequencies and six binary activity relations.

Definition 8 (Log Skeleton Frequencies and Relations)

Let \(L' \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an extended event log and let \(a, b \in act (L')\) be two different activities.

$$\begin{aligned} c_{L'}(a) = \#^{ act }_{L'}(a) \end{aligned}$$

is the frequency of activity a in event log \(L'\).

$$\begin{aligned} l_{L'}(a) = \underline{ min }\{\left| {\sigma \uparrow \{a\}}\right| \mid \sigma \in L'\} \end{aligned}$$

is the lowest frequency of activity a in any trace in event log \(L'\).

$$\begin{aligned} h_{L'}(a) = \underline{ max }\{\left| {\sigma \uparrow \{a\}}\right| \mid \sigma \in L'\} \end{aligned}$$

is the highest frequency of activity a in any trace in event log \(L'\).

$$\begin{aligned} (a,b) \in E_{L'} \Leftrightarrow \forall _{\sigma \in L'}\left| {\sigma \uparrow \{a\}}\right| = \left| {\sigma \uparrow \{b\}}\right| \end{aligned}$$

denotes that for every trace in event log \(L'\) the frequencies of activities a and b are the same. Note that the relation \(E_{L'}\) induces an equivalence relation over the activities. We use \(r_{L'}(a)\) to denote the representative activity for the equivalence class of activity a (by definition, \((r_{L'}(a),a) \in E_{L'}\)).

$$\begin{aligned} (a,b) \in R_{L'} \Leftrightarrow \forall _{\sigma \in L'}\forall _{i \in \{1,\ldots ,\left| {\sigma }\right| \}}(\sigma _i = a \Rightarrow \exists _{j \in \{i+1,\ldots ,\left| {\sigma }\right| \}}\sigma _j = b) \end{aligned}$$

denotes that for every trace in event log \(L'\) an occurrence of activity a is always followed by an occurrence of activity b. This corresponds to the response relation in Declare.

$$\begin{aligned} (a,b) \in P_{L'} \Leftrightarrow \forall _{\sigma \in L'}\forall _{i \in \{1,\ldots ,\left| {\sigma }\right| \}}(\sigma _i = a \Rightarrow \exists _{j \in \{1,\ldots ,i-1\}}\sigma _j = b) \end{aligned}$$

denotes that for every trace in event log \(L'\) an occurrence of activity a is always preceded by an occurrence of activity b. This corresponds to the precedence relation in Declare.

denotes that for every trace in event log \(L'\) an occurrence of activity a is never followed by an occurrence of activity b.

denotes that for every trace in event log \(L'\) an occurrence of activity a is never preceded by an occurrence of activity b.

denotes that for every trace in event log \(L'\) an occurrence of activity a never co-occurs with an occurrence of activity b.

Fig. 15.
figure 15

The nodes of the log skeleton discovered from the event log \(L_6\).

Figure 15 shows that we can easily visualize the frequencies and the equivalence relation in the nodes of the log skeleton. The activity, the representative of the equivalence class and the frequencies are simply shown at the bottom of the node, whereas equivalent nodes also have the same background color. For example, Fig. 15 immediately shows that the activities a, b, g, h, \({\scriptstyle {\blacksquare }}\), and \(\blacktriangleright \) are equivalent.

Table 1. An overview of the initial non-Equivalence relations for event log \(L_6\).

The remaining five activity relations will be visualized by edges between these nodes. However, there could be many such relations, which could very well result in a model that is often called a spaghetti model: A model that contains way too many edges to make any sense of it. Consider, for example, Table 1, which shows that for event log \(L_6\) there are relations between 80 out of 90 possible pairs of different activities, like \((f,b) \in P_{L_6} \cap \overline{R}_{L_6}\). For this reason, the algorithm reduces the collection of these remaining five relations to a collection of relevant relations.

Definition 9 (Relevant Log Skeleton Relations)

Let \(L' \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an extended event log and let \(a, b \in act (L')\) be two different activities.

that is, \(\mathcal {R}_{L'}\) is the transitively reduced version of \(R_{L'}\). Clearly, if a is always followed by c and c is always followed by b, then a must be always followed by b.

that is, \(\mathcal {P}_{L'}\) is the transitively reduced version of \(P_{L'}\). Clearly, if a is always preceded by c and c is always preceded by b, then a must be always preceded by b.

that is, \(\overline{\mathcal {R}}_{L'}\) is the transitively reduced version of \(\overline{R}_{L'}\), on top of which the fact that a is never followed by b is also considered irrelevant if a and b do not co-occur. It is not true that if a is never followed by c and c is never followed by b, that then a is never followed by b. Consider, for example the event log containing the traces \(\langle a, b \rangle \), \(\langle b, c \rangle \), and \(\langle c, a \rangle \). We are aware of this, but believe the benefits of doing the transitive reduction outweighs the fact that we may remove relevant relations.

that is, \(\overline{\mathcal {P}}_{L'}\) is the transitively reduced version of \(\overline{P}_{L'}\), on top of which the fact that a is never preceded by b is also considered irrelevant if a and b do not co-occur. Like with \(\overline{\mathcal {R}}_{L'}\), it is not true that if a is never preceded by c and c is never preceded by b, that then a is never preceded by b.

Clearly, if b is always preceded by c and c does not co-occur with a, then b cannot co-occur with a. Note that we could also have used the always-follows relation \(R_{L'}\) here instead of the always-precedes relation \(P_{L'}\), but using the latter relation results in the relevant never-co-occurs relations being more at the beginning of the process, that is, towards the point where the actual decision was made to choose one or the other.

Table 2. An overview of the relevant non-Equivalence relations for event log \(L_6\).

Table 2 shows the results for the event log \(L_6\): Of the 80 initial relations, only 32 are considered to be relevant. Finally, the algorithm shows the log skeleton as a graph to the user, where this graph contains only edges for the relevant relations.

5.2 Visualizing the Log Skeleton

The discovered log skeleton is visualized using a log skeleton graph, which is a graph showing the relevant relations, the equivalence classes, and the frequencies as discovered from the event log.

Definition 10 (Log Skeleton Graph)

Let \(L' \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an extended event log and let \(a, b \in act (L')\). The log skeleton graph for \(L'\) is the graph \(G=(V,E,t)\) where:

$$\begin{aligned} V = \{(a,r_{L'}(a),c_{L'}(a),l_{L'}(a),h_{L'}(a))|a \in act (L')\} \end{aligned}$$

is the set of nodes, where every node contains the activity, the representative of the activity within its equivalence class, the frequency of the activity in the log, and the minimal and maximal frequencies of the activity in any trace. If \(l(a)=h(a)\) then only l(a) is shown, otherwise l(a)..h(a) is shown.

$$\begin{aligned} E =&\,(\mathcal {R}_{L'} \cup \mathcal {P}_{L'} \cup \overline{\mathcal {R}}_{L'} \cup \overline{\mathcal {P}}_{L'} \cup \overline{\mathcal {C}}_{L'}) \nonumber \\ \cup&(\mathcal {R}_{L'} \cup \mathcal {P}_{L'} \cup \overline{\mathcal {R}}_{L'} \cup \overline{\mathcal {P}}_{L'} \cup \overline{\mathcal {C}}_{L'})^{-1} \end{aligned}$$

is the set of edges, where we have an edge from one activity to another activity if we have a relevant relation between these activities (either way).

$$\begin{aligned} d \in E \rightarrow \{\blacktriangleright ,\blacktriangleleft ,\triangleright ,\triangleleft ,\mid ,\bot \} \end{aligned}$$

denotes the decorator to be used to show the relation from the activity at the tail to the activity at the head:

  • if \((a,b) \in R_{L'}\) then \(d((a,b)) =\, \blacktriangleright \), indicating that a is always followed by b,

  • else if \((a,b) \in P_{L'}\) then \(d((a,b)) =\, \blacktriangleleft \), indicating that a is always preceded by b,

  • else if \((a,b) \in \overline{C}_{L'}\) then \(d((a,b)) =\, \mid \), indicating that a does not co-occur with b,

  • else if \((a,b) \in \overline{R}_{L'}\) then \(d((a,b)) = \triangleleft \), indicating that a is never followed by b,

  • else if \((a,b) \in \overline{P}_{L'}\) then \(d((a,b)) = \triangleright \), indicating that a is never preceded by b, and

  • otherwise \(d((a,b)) = \bot \), indicating that no relation was discovered from a to b.

These decorations are shown on the tail of the corresponding edge.

Fig. 16.
figure 16

The full log skeleton discovered from the event log \(L_6\) (shown using a left-right orientation).

Table 3. An overview of the decorators used for the non-Equivalence relations for event log \(L_6\).

Table 3 shows which decorators will be shown for the event log \(L_6\), and Fig. 16 shows the resulting log skeletonFootnote 3. Note that the edges (ab) and (ba) are visualized by a single edge, with the decorator for (ab) near a and the decorator for (ba) near b. As example relations, activity b is never preceded by e (that is, if both b and e occur, then e occurs after b), e is is always preceded by b, and e and f do not co-occur. Also note that although 32 relations were considered to be relevant, 34 are now shown: The relations \((g,c) \in \overline{R}_{}\) and \((g,d) \in \overline{R}_{}\) were not considered relevant as these relations can be induced using f. However, as \((c,g) \in R_{}\) and \((d,g) \in R_{}\) are considered relevant, the relations for (gc) and (gd) are shown as well.

Using the log skeleton shown in Fig. 16, we can deduce the following facts on the example event log:

  • The activities a, b, g, and h are always executed exactly once, and always in the given order.

  • In parallel with b, there is a 50/50 choice between c and d.

  • There is a 70/30 choice between e and f, but the position of this choice in the process is less clear. If e is chosen, it is executed after b but in parallel with c, d, and g. However, if f is chosen it is executed after b, c, and d, and before g.

5.3 Handling Noise

So far, we have assumed that the event log does not contain any noise. As a result, a constraint like \((a,b) \in R_{L'}\) may be invalid because a single instance of a in the entire event log is not followed by a b. To be able to handle noisy logs, the log skeletons allow the user to set a percentage for which the constraint should hold. We recall here the definition of the Response constraint as provided earlier:

$$\begin{aligned} (a,b) \in R_{L'} \Leftrightarrow \forall _{\sigma \in L'}\forall _{i \in \{1,\ldots ,\left| {\sigma }\right| \}}(\sigma _i = a \Rightarrow \exists _{j \in \{i+1,\ldots ,\left| {\sigma }\right| \}}\sigma _j = b) \end{aligned}$$

When dealing with noise, we are interested in the percentage of cases for which the left-hand side of the implication (\(\sigma _i = a\)) holds, for which then also the right-hand side (\(\exists _{j \in \{i+1,\ldots ,\left| {\sigma }\right| \}}\sigma _j = b\)) holds. As such, we can divide the instances of the left-hand side into positive instances (for which the right-hand side holds) and negative instances (for which the right-hand side does not hold). If the user allows for a noise level of l (where \(0 \le l \le 1\)), then the number of negative instances should be at most l times the number of total instances:

This way of handling noise can also be used for the relations \(P_{L'}\), \(\overline{R}_{L'}\), \(\overline{P}_{L'}\), and \(\overline{C}_{L'}\), because these constraint are structured in a similar way. However, this way will not work for the equivalence relation \(E_{L'}\). To decide whether two different activities \(a_1\) and \(a_n\) (where \(n \ge 2\)) are considered to be equivalent given a certain noise level l (where again \(0 \le l \le 1\)), we use the following condition for equivalence:

$$\begin{aligned} \forall _{i \in \{1,\ldots ,n-1\}}{\left( \left( \sum _{\sigma \in L'}\left| {\left| {\sigma \uparrow \{a_i\}}\right| - \left| {\sigma \uparrow \{a_{i+1}\}}\right| }\right| \right) \le l \times \left| {L'}\right| \right) } \end{aligned}$$

That is, there is a series of activities \(a_1\), \(a_2\), \(\ldots \), \(a_n\) such that for every subsequent pair \((a_i,a_{i+1})\) the distance between both activity counts over all traces should at most be l times the number of traces in the event log. Clearly, setting a noise level of \(l = 0\) results in a condition that the activity counts should match perfectly, which is exactly what we want.

Fig. 17.
figure 17

The full log skeleton discovered from the event log \(L_6\) allowing for \(20\%\) noise.

Figure 17 shows the log skeleton that results from event log \(L_6\) when setting the noise level to 0.2. For example, this shows that \(80\%\) of the instances of activity c are never preceded by e, that \(85\%\) of the instances of e are never followed by c, and that \(80\%\) of the instances of activity d do not co-occur with f.

5.4 Strengths and Limitations

Clearly, a log skeleton is not an imperative process model like a Petri net or a BPMN diagram. Instead, it is a declarative process model like Declare [45]. Some of the relations in the log skeletons exist in Declare as well like \(R_{L'}\) (Response) and \(P_{L'}\) (Precedence). But Declare contains many relations that are unknown in a log skeleton, while the Equivalence relation \(E_{L'}\) does not have a counterpart in Declare. As a result, a log skeleton can be considered as a Declare model restricted to only some relations but with an additional equivalence relation.

Of course, limitations also exists for log skeletons. Known process constructs that are hard for log skeletons are loops and duplicate tasks. Furthermore, noise in an event log may be a problem, as a single misplaced activity may prevent discovery of some relations. As attempts to alleviate the problems with these constructs and noise, The visualizer plugin allows the user to specify boundary activities (to tackle loops), to split activities over activities (to tackle duplicates), and various noise levels (to tackle noise). Although our experience with the noise levels is very positive, our experience with the boundary activities and splitting of activities shows that they only can solve some of the problems related to the hard process constructs. As a result, more research is needed in this direction to improve on this.

6 Related Work

Discovering accurate and simple process models is extremely important to reduce the time spent to enhance them and avoid mistakes during process analysis [28].

While extensive research effort was spent in designing the perfect automated process discovery algorithm, in parallel, researchers have investigated the problem of improving the quality of the input data, proposing techniques for data filtering and data repairing [19, 21, 32, 50,51,52, 57, 59, 69, 70, 78]; as well as the problem of predicting what would be the process discovery algorithm yielding the best process model from a given event log [47,48,49]. A few research studies also explored divide-and-conquer strategies, designing approaches to divide the input data into smaller chunks and separately feed each chunk to a discovery algorithm – in order to facilitate the discovery task. The set of process models discovered from the data chunks would then be re-assembled into a unique process model. Among these techniques we find Genet [15, 16], C-net miner [55], Stage Miner [43], BPMN Miner [20], and Decomposed Process Mining [77].

It is also worth mentioning techniques that have the ability to deal with negative examples [23, 24, 33], i.e., to accept also traces that are known to not be part of the underlying process. Of course, this is an information that is not often available, unless domain knowledge can be used, or some automated techniques can be applied for generating it [71, 72]. These techniques seem to be better positioned to also consider generalization when searching for the best process model.

Optimization metaheuristics have also been extensively applied in the context of automated process discovery, aiming to incrementally discover and refine the process model to reach a trade-off between accuracy and simplicity. The most notorious, among this type of approaches, are those based on evolutionary (genetic) algorithms [11, 25]. However, several other metaheuristics have been explored, such as the imperialist competitive algorithm [1], the swarm particles optimization [18, 29, 44], and simulated annealing [31, 58].

Nonetheless, the latest literature review and benchmark in automated process discovery [2] highlighted that many of the state-of-the-art automated process discovery algorithms [4, 13, 34, 36, 67, 73, 79] were affected by one (or more) of the following three limitations when discovering process models from real-life event logs: i) they achieve limited accuracy; ii) they are computationally inefficient to be used in practice; iii) they discover syntactically incorrect process models. In practice, when the behavior of the process recorded in the input event log varies little, most of the state-of-the-art automated process discovery algorithms can output accurate and simple process models. However, as the behavioral complexity of the process increases, the quality of the discovered process models can worsen quickly. Given that oftentimes real-life event logs are highly complex (i.e., containing complex process behavior, noise, and incomplete data), discovering highly accurate and simple process models with traditional state-of-the-art algorithms can be challenging.

On the other hand, achieving in a robust and scalable manner the best trade-off between accuracy and simplicity, while ensuring behavioral correctness (i.e., process soundness), has proved elusive. In particular, it is possible to group automated process discovery algorithms in two categories: those focusing more on the simplicity, the soundness and either the precision [13] or the fitness [36] of the discovered process model, and those focusing more on its fitness and its precision at the cost of simplicity and/or soundness [4, 73, 79]. The first kind of algorithms strive for simplicity and soundness by enforcing block-structured behavior on the discovered process model. However, since real-life processes are not always block-structured, a direct consequence of doing that is an approximation of the behavior which leads to a loss of accuracy (either fitness of precision). The second kind of algorithms do not adopt any strategy to deal with process simplicity and soundness, focusing only on capturing its behavior in a process model, but in doing so they can produce unsound process models.

Alongside techniques that discover imperative process models, it is important to mention that there exists many discovery algorithm that produce declarative models [10, 27, 39, 40, 53, 74]. Declare models capture the processes’ behavior through a set of rules, also known as declarative constraints. Even though each declarative constraint is precise, capturing the whole process behavior in a declarative model can be very difficult, especially because declarative models do not give any information about “undeclared” behavior, e.g., any behavior that does not break the declarative constraint is technically allowed behavior. Hence, imperative process models are usually preferred in practice.

7 Challenges Ahead

Process Mining started about 20 years ago with the development of control-flow miners like the Alpha Miner [64] and the Little Thumb Miner [80]. Although the field has advanced in these 20 years with many others control-flow miners, this does not mean that control-flow mining is already a done deal.

Consider, for example, the results of the latest Process Discovery Contest (PDC 2020) [14], which are shown by Fig. 18. The PDC 2020 was a contest for fully-automated control-flow miners, and shows the then-current state of the field on these miners. In this contest, every miner was used to discover a control-flow model from a training event log, after which this model was used to classify every trace from a test event log. As the ground truth for this classification is known, we can compute both the average positive accuracy and the average negative accuracy for all of the algorithms on this data set. The results show that there is still some ground to cover for the imperative miners, as none of these miners was able to achieve both an average positive accuracy and an average negative accuracy exceeding 80.0%.

Fig. 18.
figure 18

The results of the PDC 2020. The squares correspond to base miners, the circles to imperative miners (that result in an imperative model, like a Petri net or a BPMN diagram), and the triangles to declarative miners (that result in a declarative model, like a DCR graph or a log skeleton). The percentage mentioned with a miner is the score (see footnote 4) of that miner.

Table 4 shows the weaknesses of several algorithms submitted to the PDC 2020 contest. As an example, the weaknesses of the Inductive IMfa Miner included loops: It scored 59.2%Footnote 4 on the event logs in the PDC 2020 data set that do not contain loops, and only 19.3% on the event logs that do contain loops. This table indicates that noise and loops but also optional tasks and duplicate tasks can be considered as challenges for control-flow miners in the near future.

Table 4. Weaknesses and scores of miners submitted to PDC 2020 and their scores on the event logs that do not contain the weakness (No) or that do contain it (Yes). Only weaknesses where the No and Yes scores differ at least 10.0% are listed.

In these 20 years, algorithms have been developed that discover perspectives other than the control-flow perspective. However, many of these other perspectives are added on top of the discovered control-flow model, and hence depend on the discovery of a control-flow model of high-enough quality. Nevertheless, even if assuming that the quality of the control-flow model is indeed high enough, challenges remain for these other perspectives as well.

As a first example, consider the data perspective, which would add expressions (guards) to the control-flow model that would guide the execution of the model: Certain parts of the control-flow model may be only valid if a certain guard evaluates true. Challenges here include the discovery of sensible guards with sensible values. As an example, if based on some value the control-flow ’goes either left or right’, then the data in the event log may not contain this precise value. As a result, this value needs to be guessed based on the data that is in the event log.

A second example is the organizational perspective, which would add organizational entities (like users, groups, or roles) to certain parts of the control-flow model: Only resources (like users and automated services) that qualify for these entities can be involved in these parts. Challenges here include the discovery of the correct organizational entities at the correct level. As an example, if some activity was performed by some user according to the event log, then what is the correct organizational level (like user, role, group) for this activity?

8 Conclusion

In this chapter, we have introduced four advanced process discovery techniques: the state-based region miner; the language-based region miner; the split miner; the log skeleton miner. Each of the four techniques aims to alleviate shortcomings of the more basic process discovery techniques as introduced in the previous chapter.

First, the region-based miners can lift the shortcoming of having to assume that activities only occur once in the model. When using regions, different contexts of an activity can be found, and the activity can then be divided over these contexts, leading to a model with an activity for every different context. This is a feature that is not shared by any of the other miners, and this feature can be very important in case we have an event log of a system where these “duplicate activities” occur. Where other miners need to assume there is only one activity, which may lead to discovered models that are incomprehensible, these region-based miners do not need to make this assumption, which may result in more precise models.

Second, the split miner aims to discover process models that simultaneously maximize and balance fitness and precision, while at the same time minimizing the control-flow complexity of the resulting model. This approach brings precision and complexity into the equation, something that previously could be done only by using genetic miners like the evolutionary tree miner [12]. However, differently than genetic miners, split miner typically takes seconds to discover a process model from the event log, as opposed to the hour-long execution times required by genetic miners [2].

Third, the log skeleton miner is not limited to using only the directly-follows relations, which are heavily leveraged by many existing discovery algorithm. This miner discovers a declarative model from the event log that contains facts like “95% of the instances of activity a is always followed by activity b”, or “90% of the instances of activity a do not co-occur with an instance of activity b”. As such, it is not limited to just the directly-follows relations, and it can discover relations between activities that cannot be discovered if only considering the directly-follows relations.

It is clear that each of these advanced techniques can be used effectively on certain event logs, and may produce better models than those produced by basic techniques. However, ultimately, there is no technique yet that is effective on all (or even almost all) event logs regardless of the process behavior features. Such an ideal process discovery technique should be able to maximize accuracy and simplicity of the discovered process model while at the same time guaranteeing its simplicity and soundness. While, hitherto, the design of such a technique has proved to be challenging and elusive, it has become clear that each process discovery technique can be useful on some event logs. Hence, while we hope that future research endeavors will lead to the ideal process discovery technique, until it materializes, we just have to rely on educated choices based on the process data at hand (i.e., in the form of event log), and select the most appropriate technique for discovering the best process model.