Abstract
Given the challenges associated to the process discovery task, more than a hundred research studies addressed the problem over the past two decades. Despite the richness of proposals, many stateoftheart automated process discovery techniques, especially the oldest ones, struggle to systematically discover accurate and simple process models. In general, when the behavior recorded in the input event log is simple (e.g., exhibiting little parallelism, repetitions, or inclusive choices) or noise free, some basic algorithms such as the alpha miner can output accurate and simple process models. However, as the complexity of the input data increases, the quality of the discovered process models can worsen quickly. Given that oftentimes reallife event logs record very complex and unstructured process behavior containing many repetitions, infrequent traces, and incomplete data, some stateoftheart techniques turn unreliable and not purposeful. Specifically, they tend to discover process models that either have limited accuracy (i.e., low fitness and/or precision) or are syntactically incorrect. While currently there exists no perfect automated process discovery technique, some are better than others when discovering a process model from event logs recording complex process behavior. In this chapter, we introduce four of such techniques, discussing their underlying approach and algorithmic ideas, reporting their benefits and limitation, and comparing their performance with the algorithms introduced in the previous chapter.
You have full access to this open access chapter, Download chapter PDF
Similar content being viewed by others
1 Introduction
The previous chapter has introduced the alpha algorithm and the inductive mining algorithm as basic algorithms that discover an accepting Petri net from a (simplified) event log. It has also shown a number of example event logs for which these two basic algorithms work excellently. However, these two basic algorithms do not always perform well, often depending on the characteristics of the given event log.
In this chapter, we first introduce an example event log where the recorded process behavior features intertwined parallel compositions and exclusive choices. Second, we discuss the results of the alpha algorithm and the inductive mining algorithm on this example event log, showing that there is room for improvement. Third, we introduce four advanced process mining algorithms, discussing the results of using these algorithms on the example event log – highlighting their benefits and limitations. The first two advanced algorithms use regionbased techniques to discover accepting Petri nets, where the first algorithm uses statebased regions and the second uses languagebased regions. The third algorithm relies on sophisticated approaches to preprocess the DFG prior the identification of splits and joins behavioral semantics, and it natively outputs BPMN models. Whereas these three algorithms produce imperative process models, the fourth algorithm generates declarative process models (like Declare) called log skeletons. As we shall see in this chapter, thanks to their advanced approaches, these mining algorithms are capable of handling event logs recording very complex process behavior better than the basic mining algorithms do. At the same time, also these algorithms should not be considered bulletproof solutions for addressing exercises of automated process discovery as, in general, their results vary depending on the input event log.
2 Motivation
For motivating the need for advanced process discovery algorithm, we introduce the event log \(L_6 = [\langle a,b,c,g,e,h \rangle ^{10}, \langle a,b,c,f,g,h \rangle ^{10}, \langle a,b,d,g,e,h \rangle ^{10}, \langle a,b,d,e,g,h \rangle ^{10}, \langle a,b,e,c,g,h \rangle ^{10}, \langle a,b,e,d,g,h \rangle ^{10}, \langle a,c,b,e,g,h \rangle ^{10}, \langle a,c,b,f,g,h \rangle ^{10}, \langle a,d,b,e,g,h \rangle ^{10}, \langle a,d,b,f,g,h \rangle ^{10}]\). At first sight, there seems to be a choice between c and d, followed by a choice between e and f. However, it is more complicated than that, as traces like \(\langle a,c,b,g,e,h \rangle \) and \(\langle a,b,d,f,g,h \rangle \) are not included in \(L_6\).
Figure 1 shows the DFG that results from the event log \(L_6\). Clearly, this DFG is not as symmetric as we would have thought after a first glance at \(L_6\). For example, e can be directly followed by c or d, but f is always directly followed by g.
Figure 2 shows the accepting Petri net that results from running the alpha algorithm on event log \(L_6\). The places with the > sign are places with a larger inflow than outflow, whereas the places with the < symbol are places with a smaller inflow than outflow. This is a clear indication that this net has quality problems, which is also confirmed by the fact that in this net the final marking is not reachable from the initial marking. It is possible to put a token in the final place, but then there would be other tokens in the net as well. Precisely, there would be tokens in the place that is the output of a and e and the input of c.
Figure 3 shows the process tree that results from running the inductive mining algorithm on event log \(L_6\). Although the process tree guarantees that the final marking is always reachable from the initial marking, this process tree allows for too much behavior. As an example it is possible to do both e and f, or neither, even though in \(L_6\) always exactly one of these two activities is observed per trace. Also the fact that f is always directly followed by g is not captured by this process tree.
This shows that, for more complex event logs, we need more advanced algorithms than the alpha algorithm and the inductive mining algorithm. This chapter, introduces four of such advanced algorithms each having more success in discovering a process model from the event log \(L_6\) than the basic algorithms from the previous chapter, they are:

1.
The Statebased Region Miner, which produces accepting Petri nets like the basic algorithms do;

2.
The Languagebased Region Miner, which also produces accepting Petri nets;

3.
The Split Miner, which produces BPMN models;

4.
The Log Skeleton Miner, which produces declarative process models (like Declare [45]) called log skeletons.
These four advanced algorithms are discussed in the next sections, as the first two algorithms both use the theory of regions, they are discussed in a single section. Then, we continue with split miner, and lastly we conclude with the log skeleton miner.
3 The Theory of Regions
The theory of regions [30] was proposed in the early nineties to define a formal correspondence between behavior and structure. In particular, several regionbased algorithms have been proposed in the last decades to synthesize specifications into Petri nets using this powerful theory.
As mining is a form of synthesis, several approaches have appeared to mine process models from event logs. Regardless of the region based technique applied, the approaches that rely on the notion of region theory search for a process model that is both fitting and precise [17]. This section shows two branches of regionbased approaches for process discovery: state and languagebased approaches.
3.1 StateBased Region Approach for Process Discovery
Figure 4 shows the accepting Petri net that results from running the Statebased Region Algorithm on event log \(L_6\). Note that for all places the inflow equals the outflow. In the remainder of this section we will provide an overview of the main ingredients of statebased region discovery.
Statebased region approaches for process discovery need to convert the event log into a statebased representation, that will be used to discover the Petri net. This representation, is formalized in the following definition.
Definition 1 (Transition System)
A transition system (TS) is a tuple \((S, \varSigma , A,s_{in})\), where S is a set of states, \(\varSigma \) is an alphabet of activities, \(A \subseteq S\times \varSigma \times S\) is a set of (labeled) arcs, and \(s_{in} \in S\) is the initial state. We will use \(s {\mathop {\rightarrow }\limits ^{e}} s'\) as a shortcut for \((s,e,s') \in A\), and the transitive closure of this relation will be denoted by \({\mathop {\rightarrow }\limits ^{*}}\).
Figure 5(a) presents an example of a transition system.
Definition 2 (Multiset representation of traces)
We denote by \(\#(\sigma ,\mathsf {e}_{\scriptstyle })\) the number of times that event \(\mathsf {e}_{\scriptstyle }\) occurs in \(\sigma \), that is \(\#(\langle \mathsf {e}_{\scriptstyle 1}\ldots \mathsf {e}_{\scriptstyle n} \rangle ,\mathsf {e}_{\scriptstyle }) = \{ \mathsf {e}_{\scriptstyle i} ~~ \mathsf {e}_{\scriptstyle i} = \mathsf {e}_{\scriptstyle } \}\). Given an alphabet \(\varSigma \), the Parikh vector of a sequence \(\sigma \) with respect to \(\varSigma \) is a vector \(p_{\sigma } \in \mathbb {N}^{\varSigma }\) such that \(p_{\sigma }(\mathsf {e}_{\scriptstyle }) = \#(\sigma ,e)\).
The techniques described in [62] present different variants for generating a transition system from an event log. For the most common variant, the basic idea to incorporate state information is to look at the set of multiset of events included in a subtrace in the event log:
Definition 3 (Multiset State Representation of an Event Log)
Given an event log \(L \in \mathcal {B}({\mathcal{U}_{ act }}^*)\), the TS corresponding to the multiset conversion of L, denoted as \(\textsf {TS}_{\text {mset}}(L)\), is \(\langle S, \varSigma , T, \mathsf {s}_{\scriptstyle p_{\epsilon }} \rangle \), such that: S contains one state \(\mathsf {s}_{\scriptstyle p_{w}}\) for each Parikh vector \(p_w\) of a prefix w in L, with \(\epsilon \) denoting the empty prefix, and \(T = \{ \mathsf {s}_{\scriptstyle p_{w}} {{{\mathop {\longrightarrow }\limits ^{\mathsf {e}}}}} \mathsf {s}_{\scriptstyle p_{w\mathsf {e}}} ~~ w \mathsf {e} \text { is a prefix of } L \}\).
In the sequence conversion, two traces lead to the same state if they fire the same events in exactly the same order.
Example 1
Let us use along this section an example extracted from [61]. The event log contains the following activities: r=register, s=ship, sb=send_bill, p=payment, ac=accounting, ap=approved, c=close, em=express_mail, rj=rejected, and rs=resolve. Given the event log \(L_7 = [\langle r,s,sb,p,ac,ap,c \rangle ^{10}, \langle r,sb,em,p,ac,ap,c \rangle ^{10}, \langle r,sb,p,em,ac,rj,rs,c \rangle ^{10}, \langle r,em,sb,p,ac,ap,c \rangle ^{10}, \langle r,sb,s,p,ac,rj,rs,c \rangle ^{10}, \langle r,sb,p,s,ac,ap,c \rangle ^{10}, \langle r,sb,p,em,ac,ap,c \rangle ^{10}]\), Fig. 5(a) show an example of TS constructed according to Definition 3.
A region^{Footnote 1} in a transition system is a set of states that satisfy an homogeneous relation with respect to the set of arcs. In the simplest case, this relation can be described by a predicate on the set of states considered. Formally:
Definition 4 (Region)
Let \(S'\) be a subset of the states of a TS, \(S' \subseteq S\). If \(s\not \in S'\) and \(s'\in S'\), then we say that transition \(s{\mathop {\rightarrow }\limits ^{a}} s'\) enters \(S'\). If \(s\in S'\) and \(s'\not \in S'\), then transition \(s{\mathop {\rightarrow }\limits ^{a}} s'\) exits \(S'\). Otherwise, transition \(s{\mathop {\rightarrow }\limits ^{a}} s'\) does not cross \(S'\): it is completely inside (\(s\in S'\) and \(s'\in S'\)) or completely outside (\(s\notin S'\) and \(s'\not \in S'\)). A set of states \(r \subseteq S\) is a region if for each event \(e \in E\), exactly one of the three predicates (enters, exits or does not cross) holds for each of its arcs.
An example of region is presented in Fig. 6 on the TS of our running example. In the highlighted region, event r enters the region, s and em exit the region, and the rest of labels do not cross the region.
A region corresponds to a place in the Petri net, and the role of the arcs determine the Petri net flow relation: when an event e enters the region, there is an arc from the corresponding transition for e to the place, and when e exits the region, there is an arc from the region to the transition for e. Events satisfying the do not cross relation are not connected to the corresponding place. For instance, the region shown in Fig. 6(a) corresponds to the shadowed place in Fig. 6(b), where event r belongs to the set of input transitions of the place whereas events em and s belong to the set of output transitions. Hence, the algorithm for Petri net derivation from a transition system consists in finding regions and constructing the Petri net as illustrated with the previous example. In [26] it was shown that only a minimal set of regions was necessary, whereas further relaxations to this restriction can be found in [17]. The Petri net obtained by this method is guaranteed to accept the language of the transition system, and satisfy the minimal language containment property, which implies that if all the minimal regions are used, the Petri net derived is the one whose language difference with respect to the log is minimal, hence being the most precise Petri net for the set of transitions considered.
In any case, the algorithm that searches for regions in a transition system must explore the lattice of sets (or multisets, in the case for kbounded regions), thus having a high complexity: for a transition system with n states, the lattice for kbounded regions is of size \(\mathcal{O}(k^n)\). For instance, the lattice of sets of states for the toy TS used in this article (which has 22 states) has \(2^{22}\) possibles sets to check for the region conditions. Although many simplification properties, efficient data structures and algorithms, and heuristics are used to prune this search space [17], they only help to alleviate the problem. Decomposition alternatives, which for instance use partitions of the state space to guide the search for regions, significantly alleviate the complexity of the statebased region algorithm, at the expense of not guaranteeing the derivation of precise models [15]. Other statebased region approaches for discovery have been proposed, which complement the approach described in this section [54,55,56].
3.2 LanguageBased Region Approach for Process Discovery
In languagebased region theory [6, 8, 9, 22, 37, 38] the goal is to construct the smallest Petri net such that the behaviour of the net is equal to the given input language (or minimally larger). [41] provides an overview for languagebased region theory for different classes of languages: step languages, regular languages, and (infinite) partial languages.
Figure 7 shows the accepting Petri net that results from running the Languagebased Region Algorithm on event log \(L_6\). As it happened with statebase regions, again for all places the inflow equals the outflow.
More formally, let \(L \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an event log, then language based region theory constructs a Petri net with the set of transitions equals to \(\varSigma \) and in which all traces of L are a firing sequence. The Petri net should have only minimal firing sequences not in the language L (and all prefixes in L). This is achieved by adding places to the Petri net that restrict unobserved behavior, while allowing for observed behavior. The theory of regions provides a method to identify these places, using language regions.
Definition 5 (Prefix Closure)
Let \(L \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an event log. The prefix closed language \(\mathcal {L} \subseteq \varSigma ^*\) of L is defined as: \(\mathcal {L} = \{ \sigma \in \varSigma ^* \mid \exists _{\sigma ' \in \varSigma ^*} \sigma \circ {}\sigma ' \in L\}\).
The prefix closure of a log is simply the set of all prefixes in the log (including the empty prefix).
Definition 6 (Language Region)
Let \(\varSigma \) be a set of activities. A region of a prefixclosed language \(\mathcal {L} \in \varSigma ^*\) is a triple \((\vec {x},\vec {y}, c)\) with \(\vec {x},\vec {y} \in \{0,1\}^{\varSigma }\) and \(c \in \{0,1\}\), such that for each nonempty sequence \(w = w'\circ {}a \in \mathcal {L}\), \(w' \in \mathcal {L}\), \(a \in \varSigma \):
This can be rewritten into the inequation system:
where M and \(M'\) are two \(\mathcal {L} \times \varSigma \) matrices with \(M(w,t) = \vec {w}(t)\), and \(M'(w,t) = \vec {w'}(t)\), with \(w = w'\circ {}a\). The set of all regions of a language is denoted by \(\Re (\mathcal {L})\) and the region \((\vec {0}, \vec {0}, 0)\) is called the trivial region.
Intuitively, vectors \(\vec {x},\vec {y}\) denote the set of incoming and outgoing arcs of the place corresponding to the region, respectively, and c sets if it is initially marked. Figure 8 shows a region for a language over four activities, i.e. each solution \((\vec {x},\vec {y},c)\) of the inequation system can be regarded in the context of a Petri net, where the region corresponds to a feasible place with preset \(\{t t \in T, \vec {x}(t)=1\}\) and postset \(\{tt\in T, \vec {y}(t) = 1\}\), and initially marked with c tokens. Note that we do not assume arcweights here, while the authors of [6, 7, 22, 38] do.
Since the place represented by a region is a place which can be added to a Petri net, without disturbing the fact that the net can reproduce the language under consideration, such a place is called a feasible place.
Definition 7 (Feasible place)
Let \(\mathcal {L}\) be a prefixclosed language over \(\varSigma \) and let \(N = ((P,\varSigma ,F),m)\) be a marked Petri net. A place \(p \in P\) is called feasible if and only if there exists a corresponding region \((\vec {x},\vec {y},c) \in \Re (\mathcal {L})\) such that \(m(p) = c\), and \(\vec {x}(t) = 1\) if and only if \(t \in {{}^{\bullet }{p}}\), and \(\vec {y}(t) = 1\) if and only if \(t \in {{p}^{\bullet }}\).
In general, there are many feasible places for any given event log (when considering arcweights in the discovered Petri net, there are even infinitely many). Several methods exist for selecting an appropriate subset of these places. The authors of [7, 38] present two ways of finitely representing these places, namely a basis representation and a separating representation. Both representations maximize precision, i.e. they select a set of places such that the behavior of the model outside of the log is minimal.
In contrast, the authors of [63, 65, 66, 68] focus on those feasible places that express some causal dependency observed in the event log, and/or ensure that the entire model is a connected workflow net. They do so by introducing various cost functions favouring one solution of the equation system over another and then selecting the top candidates.
3.3 Strengths and Limitations of Region Theory
The goal of region theory is to find a Petri net that perfectly describes the observed behavior (where this behavior is specified in terms of a language or a statespace). As a result the Petri nets are perfectly fitting and maximally precise. Consequently, the assumption on the input event log is that it records a full behavioral specification, i.e., that the input is complete and noise free. While the assumption on the output is that it is a compact and exact representation of the behavior recorded in the input event log. To this end, we note that, although in this section we have focused on safe nets, the theory of regions can represent general kbounded Petri nets – a feature that is not yet provided by any other automated process discovery technique.
When applying region theory in the context of process mining, it is therefore very important to perform any required generalization of the behavior recorded in the input event log before calling region theory algorithms. For statebased regions, the challenges are in the construction of the statespace from the event log, while in languagebased regions it is in the selection of the appropriate prefixes to include in the final prefixclosed language in order to ensure some level of generalization.
In the next section, we will see that split miner relaxes the requirement of having the full behavioral specification recorded in the input event log, striving to discover BPMN process models that only maximizes the balance between its fitness and precision.
4 Split Miner
In the following, we describe how Split Miner (hereinafter, SM) discovers a BPMN model starting from an event log. SM operates in six steps (cf. Fig. 9). In the first step, it constructs the DFG and analyses it to detect selfloops and shortloops. In the second step, it discovers concurrency relations between pairs of activities in the DFG. In the third step, the DFG is filtered by applying a filtering algorithm designed to strike balanced fitness and precision of the final BPMN model while maintaining a low controlflow complexity. The fourth and fifth steps focus (respectively) on the discovery of split and join gateways, activities having multiple outgoing edges are turned into a hierarchy of split gateways, while activities have multiple incoming edges are turned into a hierarchy of join gateways. Lastly, if any ORjoins were discovered, they are analyzed and turned (whenever possible) into either XORgateways or AND gateways.
Although some of the steps executed by SM are typical of basic automated process discovery techniques such as alpha miner and inductive miner (e.g., the filtering of the DFG), the steps of SM were designed to overcome the limitations of such techniques. Most notably, to increase precision without compromising fitness and/or structural complexity. Furthermore, in SM, each step can operate as a blackbox, allowing for additional future improvements by redesign or enhancing a step at a time [5].
We now provide a brief overview of each step of SM in a tutoriallike fashion, by leveraging the example log \(L_6 = [\langle a,b,c,g,e,h \rangle ^{10}, \langle a,b,c,f,g,h \rangle ^{10}, \langle a,b,d,g,e,h \rangle ^{10}, \langle a,b,d,e,g,h \rangle ^{10}, \langle a,b,e,c,g,h \rangle ^{10}, \langle a,b,e,d,g,h \rangle ^{10}, \langle a,c,b,e,g,h \rangle ^{10}, \langle a,c,b,f,g,h \rangle ^{10}, \langle a,d,b,e,g,h \rangle ^{10}, \langle a,d,b,f,g,h \rangle ^{10}]\) (introduced in Sect. 2). Given that an indepth analysis of the algorithms behind SM would be out of the scope of this chapter and book, we refer the interested reader to the original work [3].
4.1 Step 1: DFG and Loops Discovery
Given the input event log \(L_6\), SM immediately builds its DFG, as shown in Fig. 10a. In this example, all the traces have the same start and end activity, however, SM automatically adds artificial start and end activities (represented by the nodes \(\blacktriangleright \) and \({\scriptstyle {\blacksquare }}\)).
Then, SM detects selfloops and shortloops, i.e., loops involving only one and two activities (respectively). Loops are known to cause problems when detecting concurrency [60], hence, we want to detect loops before detecting concurrency.
The simplest of the loops is the selfloop, a selfloop exists if a node is both source and target of one arc of the DFG, i.e., \(a\rightarrow a\). Shortloops and their frequencies are detected in the log as follows. Given two activities a and b, for SM, a shortloop (\(a\circlearrowright b\)) exists if and only if (iff) the following two conditions hold:
 i.:

both a and b are not selfloops;
 ii.:

there exists at least one log trace containing the subtrace \(\langle a,b,a\rangle \) or \(\langle b,a,b\rangle \).
Condition (i) is necessary because otherwise the shortloop evaluation may not be reliable. In fact, if we consider a process that allows a concurrency between a selfloop activity a and a normal activity b, we could observe log traces containing the subtrace \(\langle a,b,a\rangle \), which can also characterize \(a\circlearrowright b\). Condition (ii) guarantees that we have observed (in at least one trace of the log) a shortloop between the two activities. In fact, shortloops are characterized by subtraces of the type \(\langle a,b,a\rangle \) or \(\langle b,a,b\rangle \).
The detected selfloops are trivially removed from the DFG and restored only in the output BPMN model. While the detected shortloops are saved and used in the next step. In our example (Fig. 10a), there are no selfloops or shortloops.
4.2 Step 2: Concurrency Discovery
Given a DFG and any two activities a and b, such that neither a nor b is a selfloop, for SM, a and b are considered concurrent (noted as \(a \Vert b\)) iff three conditions hold:
 iii.:

there exist two arcs in the DFG: (a, b) and (b, a);
 iv.:

both a and b are not in a shortloop;
 v.:

the arcs (a, b) and (b, a) have similar frequency: \(\frac{\left \left a \rightarrow b \right  \left b\rightarrow a \right \right }{\left a \rightarrow b \right + \left b \rightarrow a \right } < \varepsilon \) (\(\varepsilon \in [0,1]\)).
These three conditions define the heuristicbased concurrency oracle of SM. The rationale behind the conditions is the following. Condition (iii) captures the basic requirement for \(a \Vert b\): the existence of the arcs (a, b) and (b, a) entails that a and b can occur in any order. However, Condition (iii) is not sufficient to postulate concurrency because it may hold in three cases: a and b form a shortloop; (a, b) or (b, a) is an infrequent observation (e.g., noise in the data); a and b are concurrent. We are interested in identifying when the third case holds. To this end, we check Conditions (iv) and (v). When Condition (iv) holds, we can exclude the first case because a and b do not form a shortloop. When Condition (v) holds, we can exclude the second case because (a, b) and (b, a) are both observed frequently and have similar frequencies. At this point, we are left with the third case and we assume \(a\Vert b\). The variable \(\varepsilon \) becomes a user input parameter, the smaller is its value the more similar have to be the number of observations of (a, b) and (b, a). Instead, setting \(\varepsilon = 1\), Condition (v) would always hold.
Whenever we find \(a\Vert b\), we remove the arcs (a, b) and (b, a) from the DFG, since we assume there is no causality but instead there is concurrency. On the other hand, if we find that either (a, b) or (b, a) represents an infrequent directlyfollows relation, we remove the least frequent of the two edges. We call the output of this step a Pruned DFG (PDFG).
In the example in Fig. 10a, we identify four possible cases of concurrency: (b, c), (b, d), (d, e), (e, g). Setting \(\varepsilon = 0.25\), we capture the following concurrency relations: \(b\Vert c\), \(b\Vert d\), \(d\Vert e\), \(e\Vert g\). The resulting PDFG is shown in Fig. 10b.
4.3 Step 3: Filtering
The filtering algorithm applied by SM on the PDFG is based on three criteria. First, each node of the PDFG must be on a path from the single start node (source) to the single end node (sink). Second, for each node, (at least one of) its path(s) from source to sink must be the one having maximum capacity. In our context, the capacity of a path is the frequency of the least frequent arc of the path. Third, the number of edges of the PDFG must be minimal. The three criteria aim to guarantee that the discovered BPMN process model is accurate and simple at the same time.
The filtering algorithm performs a double breadthfirst exploration: forward (source to sink) and backward (sink to source). During the forward exploration, for each node of the PDFG, we discover its maximum sourcetonode capacity, and its incoming edge granting such capacity (best incoming edge). During the backward exploration, for each node of the PDFG, we discover its maximum nodetosink capacities, and the best outgoing edges. Then, we remove from the PDFG all the edges that are not best incoming edges or best outgoing edges. In doing so, we may reduce the amount of behavior that the final model can replay, and consequently its fitness. Therefore, we introduce a frequency threshold that allows the user to strike a balance fitness and precision. Precisely, we compute the \(\eta \) percentile over the frequencies of the best incoming and outgoing edges of each node, and we add to the PDFG the edges with a frequency exceeding the threshold. It is important to note that the percentile is not taken over the frequencies of all the edges, otherwise we would simply retain \(\eta \) percentage of all the edges. Also, this means that even by setting \(\eta = 0\), SM will still apply a certain amount of filtering.
Figure 11b shows the output of the filtering algorithm when applied to the PDFG of our working example (Fig. 11a). As a consequence of retaining the best incoming and outgoing edges for each node, we would drop the arcs: (e, c) and (c, f); and they would not be retained regardless of the value assigned to \(\eta \).
4.4 Step 4: Splits Discovery
Before discovering the split gateways, the filtered PDFG is converted into a BPMN process model by turning the start (\(\blacktriangleright \)) and end (\({\scriptstyle {\blacksquare }}\)) nodes of the graph into the start and end events of the BPMN model, and each other node of the graph into a BPMN activity. Figure 12a shows the BPMN model^{Footnote 2} generated from the filtered PDFG of our working example (Fig. 11b). Now, let us focus on the discovery of the split gateways by considering the example in Fig. 13a. Given an activity with multiple outgoing edges (e.g., activity z), the splits discovery is based on the idea that all the activities directly following (successors of) the same split gateway must have the same concurrency and/or mutually exclusive relations with the activities that do not directly follow their preceding split gateway. With hindsight and reference to Fig. 13b, we see that since activities c and d are successors of gateway \(and_1\), both c and d are concurrent to e, f, g, due to gateway \(and_3\) (i.e., \(c\Vert e\), \(c\Vert f\), \(c\Vert g\), and \(d\Vert e\), \(d\Vert f\), \(d\Vert g\)). At the same time, both c and d are mutually exclusive with a and b, due to gateway \(xor_3\). Considering activities by pairs, and analyzing which concurrency or mutually exclusive relations they have in common, we can generate the appropriate splits hierarchy.
With this in mind, we continue our working example. Let us consider activity A (Fig. 12a), it has three successors: B, C, and D. From the outcome of Step 2, we know that both C and D are concurrent to B, while C and D are not concurrent (hence, mutually exclusive with each other). Since C and D share the same relations to other activities (both are concurrent to B), they can be selected as successors of the same gateway, which in this case would be an XORgateway because C and D are mutually exclusive. After we add the XORgateway, the successors of activity A will be two: B and the newly added XORgateway (see Fig. 12b). The algorithm becomes trivial when an activity with multiple outgoing edges has only two successors, indeed, it is enough to add a split gateway matching the relation between the two successors. Continuing the example of activity A, the successor B is in parallel with the newly added XORgateway or, more precisely, with all the activities following the XORgateway (activities C an D). Therefore, we can add an AND gateway preceding B and the XORgateway. Similarly, if we consider activity B and its two successors, activities E and F, given that they are not concurrent, they must be mutually exclusive and therefore an XORgateway is placed before them. The result of the splits discovery is shown in Fig. 12c.
4.5 Step 5: Joins Discovery
Once all the split gateways have been placed, we can discover the join gateways. To do so, we rely on the Refined Process Structure Tree (RPST) [46] of the current BPMN model. The RPST of a process model is a tree data structure where the tree nodes represent the singleentry singleexit (SESE) fragments of the process model, and the tree edges denote a containment relation between SESE fragments. Precisely, the children of a SESE fragment are its directly contained SESE fragments, whilst SESE fragments on different branches of the tree are disjoint. Each SESE fragment represents a subgraph of the process model, and the partition of the process model into SESE fragments is made in terms of edges. A SESE fragment can be of one of the following four types: a trivial fragment, which consists of a single edge; a polygon, which consists of a sequence of fragments; a bond, which is a fragment where all the children fragments share two common nodes, one being the entry and the other being the exit of the bond; and a rigid, which represents any other fragment. Each SESE fragment is classified as homogeneous, if the gateways it contains (and are not contained in any of its SESE children) are all of the same type (e.g., only XORgateways), or heterogeneous if its gateways have different types. Figure 14a and Fig. 14b show two examples of homogeneous SESE fragments: a bond and a rigid.
We note that, at this stage, in the BPMN model (Fig. 12c) all the SESE fragment’s exits correspond to activities with multiple incoming edges, which we aim to turn into join gateways. Starting from the leaves of the RPST, i.e., the innermost SESE fragments of the process model, we explore the RPST bottomup. For each SESE fragment we encounter in this exploration, we select the activities it contains that have multiple incoming edges (there is always at least one, the SESE fragment exit). For each of the selected activities, we add a join gateway preceding it. The join gateway type will depend on whether the SESE fragment is homogeneous or heterogeneous. In the former case, the join gateway will have the same type of the other gateways in the SESE fragment, in the latter case, the join gateway will be an ORgateway. Figure 14 shows in brief how our approach works for SESE bonds (Fig. 14a), for homogeneous SESE rigids (Fig. 14b), and for all other cases, i.e. heterogeneous SESE rigids (Fig. 14c).
Returning to our working example (Fig. 12c), we can discover three joins. The first one is the XORjoin in the SESE bond containing activities C, D and G, with G as the exit of the bond and the XORsplit as the entry. The bond is XORhomogeneous, so that the type of the join is set to XOR. The remaining two joins are in the parent SESE fragment of the bond, which is a heterogeneous rigid, hence, we place two ORjoins. The resulting model is shown in Fig. 12d.
4.6 Step 6: ORjoins Minimization
The previous step may leave several ORjoin gateways in the discovered BPMN model. Since ORgateways can be difficult to interpret [42], SM tries to remove them by analyzing the process behavior and turning ORgateways into AND or XORgateways whenever the behavior is interchangeable.
4.7 Strengths and Limitations of Split Miner
SM was designed to bring together the strengths of older and basic automated process discovery algorithms while addressing their limitations. An example of this design strategy is the filtering algorithm. Past filtering algorithms were either based on heuristics [73, 79] that risk to compromise the correctness of the output model, or driven by structural requirements [35]. While SM retains the idea of an integrated filtering algorithm, it focuses on balancing fitness, precision, and simplicity of the output process model.
Past automated discovery algorithms favored either accuracy [73, 79] or simplicity [11, 35], SM aims to strike a tradeoff between the two. The splits and joins discovery steps do not impose any structural constraint on the output process model, as opposed to inductive miner [35] and evolutionary tree miner [11], which enforce blockstructuredness, allowing SM to pursue accuracy. Yet, the discovery of the split gateways is designed to produce hierarchies of gateway which foster simplicity and structuredness, while the join discovery and the use of ORgateways allow for simplicity without compromising accuracy.
However, also SM has its own limitations. First, SM was designed for reallife contexts, and it operates under the assumption that there is always some infrequent behavior to filter out. Second, SM may discover unsound processes, indeed, hitherto soundness has been guaranteed only by enforcing blockstructuredness, a trend that SM does not adhere to. While SM guarantees to discover deadlockfree process models [3], it does not guarantee that such process models respect the soundness property of proper completion, so that when a token reaches the end event of the process model, more tokens may be left behind. Nonetheless, the chances of SM discovering an unsound process model are very low [2] and in most cases it can discover accurate yet simple and sound process models.
5 Log Skeletons
The previous sections introduced three advanced mining algorithms that tackle the example event log \(L_6\) with more success than the basic algorithms as introduced in Sect. 2. Like these basic algorithms, these advanced algorithms all result in an imperative process model, that is, a process model that indicates what the next possible steps are. However, next to these imperative models, we also have declarative models, like Declare [45]. Unlike an imperative model, a declarative model does not specify what the next possible steps are, instead it provides a collection of constraints that any process instance in the end should adhere to.
This Section introduces an advanced mining algorithm that results in a declarative process model, called a log skeleton. [75]. This algorithm has been implemented as the “Visualize Log as Log Skeleton” visualizer plugin in ProM 6 [76]. Provided an event log L, the algorithm first extends the provided event log with the artificial start activity \(\blacktriangleright \) and the artificial end activity \({\scriptstyle {\blacksquare }}\). In accordance with Sect. 2, we use \(L'\) to denote the event log L extended with these artificial activities. Second, the algorithm discovers from this extended event log \(L'\) the collection of initial specific constraints it adheres to. Third, it reduces some of these constraints, keeping only those constraints that are considered to be relevant. Fourth, it shows the mostrelevant constraints to the user as a graph. These last three steps are detailed in the next sections.
5.1 Discovering the Log Skeleton
The specific constraints in a log skeleton are the following three activity frequencies and six binary activity relations.
Definition 8 (Log Skeleton Frequencies and Relations)
Let \(L' \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an extended event log and let \(a, b \in act (L')\) be two different activities.
is the frequency of activity a in event log \(L'\).
is the lowest frequency of activity a in any trace in event log \(L'\).
is the highest frequency of activity a in any trace in event log \(L'\).
denotes that for every trace in event log \(L'\) the frequencies of activities a and b are the same. Note that the relation \(E_{L'}\) induces an equivalence relation over the activities. We use \(r_{L'}(a)\) to denote the representative activity for the equivalence class of activity a (by definition, \((r_{L'}(a),a) \in E_{L'}\)).
denotes that for every trace in event log \(L'\) an occurrence of activity a is always followed by an occurrence of activity b. This corresponds to the response relation in Declare.
denotes that for every trace in event log \(L'\) an occurrence of activity a is always preceded by an occurrence of activity b. This corresponds to the precedence relation in Declare.
denotes that for every trace in event log \(L'\) an occurrence of activity a is never followed by an occurrence of activity b.
denotes that for every trace in event log \(L'\) an occurrence of activity a is never preceded by an occurrence of activity b.
denotes that for every trace in event log \(L'\) an occurrence of activity a never cooccurs with an occurrence of activity b.
Figure 15 shows that we can easily visualize the frequencies and the equivalence relation in the nodes of the log skeleton. The activity, the representative of the equivalence class and the frequencies are simply shown at the bottom of the node, whereas equivalent nodes also have the same background color. For example, Fig. 15 immediately shows that the activities a, b, g, h, \({\scriptstyle {\blacksquare }}\), and \(\blacktriangleright \) are equivalent.
The remaining five activity relations will be visualized by edges between these nodes. However, there could be many such relations, which could very well result in a model that is often called a spaghetti model: A model that contains way too many edges to make any sense of it. Consider, for example, Table 1, which shows that for event log \(L_6\) there are relations between 80 out of 90 possible pairs of different activities, like \((f,b) \in P_{L_6} \cap \overline{R}_{L_6}\). For this reason, the algorithm reduces the collection of these remaining five relations to a collection of relevant relations.
Definition 9 (Relevant Log Skeleton Relations)
Let \(L' \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an extended event log and let \(a, b \in act (L')\) be two different activities.
that is, \(\mathcal {R}_{L'}\) is the transitively reduced version of \(R_{L'}\). Clearly, if a is always followed by c and c is always followed by b, then a must be always followed by b.
that is, \(\mathcal {P}_{L'}\) is the transitively reduced version of \(P_{L'}\). Clearly, if a is always preceded by c and c is always preceded by b, then a must be always preceded by b.
that is, \(\overline{\mathcal {R}}_{L'}\) is the transitively reduced version of \(\overline{R}_{L'}\), on top of which the fact that a is never followed by b is also considered irrelevant if a and b do not cooccur. It is not true that if a is never followed by c and c is never followed by b, that then a is never followed by b. Consider, for example the event log containing the traces \(\langle a, b \rangle \), \(\langle b, c \rangle \), and \(\langle c, a \rangle \). We are aware of this, but believe the benefits of doing the transitive reduction outweighs the fact that we may remove relevant relations.
that is, \(\overline{\mathcal {P}}_{L'}\) is the transitively reduced version of \(\overline{P}_{L'}\), on top of which the fact that a is never preceded by b is also considered irrelevant if a and b do not cooccur. Like with \(\overline{\mathcal {R}}_{L'}\), it is not true that if a is never preceded by c and c is never preceded by b, that then a is never preceded by b.
Clearly, if b is always preceded by c and c does not cooccur with a, then b cannot cooccur with a. Note that we could also have used the alwaysfollows relation \(R_{L'}\) here instead of the alwaysprecedes relation \(P_{L'}\), but using the latter relation results in the relevant nevercooccurs relations being more at the beginning of the process, that is, towards the point where the actual decision was made to choose one or the other.
Table 2 shows the results for the event log \(L_6\): Of the 80 initial relations, only 32 are considered to be relevant. Finally, the algorithm shows the log skeleton as a graph to the user, where this graph contains only edges for the relevant relations.
5.2 Visualizing the Log Skeleton
The discovered log skeleton is visualized using a log skeleton graph, which is a graph showing the relevant relations, the equivalence classes, and the frequencies as discovered from the event log.
Definition 10 (Log Skeleton Graph)
Let \(L' \in \mathcal {B}({\mathcal{U}_{ act }}^*)\) be an extended event log and let \(a, b \in act (L')\). The log skeleton graph for \(L'\) is the graph \(G=(V,E,t)\) where:
is the set of nodes, where every node contains the activity, the representative of the activity within its equivalence class, the frequency of the activity in the log, and the minimal and maximal frequencies of the activity in any trace. If \(l(a)=h(a)\) then only l(a) is shown, otherwise l(a)..h(a) is shown.
is the set of edges, where we have an edge from one activity to another activity if we have a relevant relation between these activities (either way).
denotes the decorator to be used to show the relation from the activity at the tail to the activity at the head:

if \((a,b) \in R_{L'}\) then \(d((a,b)) =\, \blacktriangleright \), indicating that a is always followed by b,

else if \((a,b) \in P_{L'}\) then \(d((a,b)) =\, \blacktriangleleft \), indicating that a is always preceded by b,

else if \((a,b) \in \overline{C}_{L'}\) then \(d((a,b)) =\, \mid \), indicating that a does not cooccur with b,

else if \((a,b) \in \overline{R}_{L'}\) then \(d((a,b)) = \triangleleft \), indicating that a is never followed by b,

else if \((a,b) \in \overline{P}_{L'}\) then \(d((a,b)) = \triangleright \), indicating that a is never preceded by b, and

otherwise \(d((a,b)) = \bot \), indicating that no relation was discovered from a to b.
These decorations are shown on the tail of the corresponding edge.
Table 3 shows which decorators will be shown for the event log \(L_6\), and Fig. 16 shows the resulting log skeleton^{Footnote 3}. Note that the edges (a, b) and (b, a) are visualized by a single edge, with the decorator for (a, b) near a and the decorator for (b, a) near b. As example relations, activity b is never preceded by e (that is, if both b and e occur, then e occurs after b), e is is always preceded by b, and e and f do not cooccur. Also note that although 32 relations were considered to be relevant, 34 are now shown: The relations \((g,c) \in \overline{R}_{}\) and \((g,d) \in \overline{R}_{}\) were not considered relevant as these relations can be induced using f. However, as \((c,g) \in R_{}\) and \((d,g) \in R_{}\) are considered relevant, the relations for (g, c) and (g, d) are shown as well.
Using the log skeleton shown in Fig. 16, we can deduce the following facts on the example event log:

The activities a, b, g, and h are always executed exactly once, and always in the given order.

In parallel with b, there is a 50/50 choice between c and d.

There is a 70/30 choice between e and f, but the position of this choice in the process is less clear. If e is chosen, it is executed after b but in parallel with c, d, and g. However, if f is chosen it is executed after b, c, and d, and before g.
5.3 Handling Noise
So far, we have assumed that the event log does not contain any noise. As a result, a constraint like \((a,b) \in R_{L'}\) may be invalid because a single instance of a in the entire event log is not followed by a b. To be able to handle noisy logs, the log skeletons allow the user to set a percentage for which the constraint should hold. We recall here the definition of the Response constraint as provided earlier:
When dealing with noise, we are interested in the percentage of cases for which the lefthand side of the implication (\(\sigma _i = a\)) holds, for which then also the righthand side (\(\exists _{j \in \{i+1,\ldots ,\left {\sigma }\right \}}\sigma _j = b\)) holds. As such, we can divide the instances of the lefthand side into positive instances (for which the righthand side holds) and negative instances (for which the righthand side does not hold). If the user allows for a noise level of l (where \(0 \le l \le 1\)), then the number of negative instances should be at most l times the number of total instances:
This way of handling noise can also be used for the relations \(P_{L'}\), \(\overline{R}_{L'}\), \(\overline{P}_{L'}\), and \(\overline{C}_{L'}\), because these constraint are structured in a similar way. However, this way will not work for the equivalence relation \(E_{L'}\). To decide whether two different activities \(a_1\) and \(a_n\) (where \(n \ge 2\)) are considered to be equivalent given a certain noise level l (where again \(0 \le l \le 1\)), we use the following condition for equivalence:
That is, there is a series of activities \(a_1\), \(a_2\), \(\ldots \), \(a_n\) such that for every subsequent pair \((a_i,a_{i+1})\) the distance between both activity counts over all traces should at most be l times the number of traces in the event log. Clearly, setting a noise level of \(l = 0\) results in a condition that the activity counts should match perfectly, which is exactly what we want.
Figure 17 shows the log skeleton that results from event log \(L_6\) when setting the noise level to 0.2. For example, this shows that \(80\%\) of the instances of activity c are never preceded by e, that \(85\%\) of the instances of e are never followed by c, and that \(80\%\) of the instances of activity d do not cooccur with f.
5.4 Strengths and Limitations
Clearly, a log skeleton is not an imperative process model like a Petri net or a BPMN diagram. Instead, it is a declarative process model like Declare [45]. Some of the relations in the log skeletons exist in Declare as well like \(R_{L'}\) (Response) and \(P_{L'}\) (Precedence). But Declare contains many relations that are unknown in a log skeleton, while the Equivalence relation \(E_{L'}\) does not have a counterpart in Declare. As a result, a log skeleton can be considered as a Declare model restricted to only some relations but with an additional equivalence relation.
Of course, limitations also exists for log skeletons. Known process constructs that are hard for log skeletons are loops and duplicate tasks. Furthermore, noise in an event log may be a problem, as a single misplaced activity may prevent discovery of some relations. As attempts to alleviate the problems with these constructs and noise, The visualizer plugin allows the user to specify boundary activities (to tackle loops), to split activities over activities (to tackle duplicates), and various noise levels (to tackle noise). Although our experience with the noise levels is very positive, our experience with the boundary activities and splitting of activities shows that they only can solve some of the problems related to the hard process constructs. As a result, more research is needed in this direction to improve on this.
6 Related Work
Discovering accurate and simple process models is extremely important to reduce the time spent to enhance them and avoid mistakes during process analysis [28].
While extensive research effort was spent in designing the perfect automated process discovery algorithm, in parallel, researchers have investigated the problem of improving the quality of the input data, proposing techniques for data filtering and data repairing [19, 21, 32, 50,51,52, 57, 59, 69, 70, 78]; as well as the problem of predicting what would be the process discovery algorithm yielding the best process model from a given event log [47,48,49]. A few research studies also explored divideandconquer strategies, designing approaches to divide the input data into smaller chunks and separately feed each chunk to a discovery algorithm – in order to facilitate the discovery task. The set of process models discovered from the data chunks would then be reassembled into a unique process model. Among these techniques we find Genet [15, 16], Cnet miner [55], Stage Miner [43], BPMN Miner [20], and Decomposed Process Mining [77].
It is also worth mentioning techniques that have the ability to deal with negative examples [23, 24, 33], i.e., to accept also traces that are known to not be part of the underlying process. Of course, this is an information that is not often available, unless domain knowledge can be used, or some automated techniques can be applied for generating it [71, 72]. These techniques seem to be better positioned to also consider generalization when searching for the best process model.
Optimization metaheuristics have also been extensively applied in the context of automated process discovery, aiming to incrementally discover and refine the process model to reach a tradeoff between accuracy and simplicity. The most notorious, among this type of approaches, are those based on evolutionary (genetic) algorithms [11, 25]. However, several other metaheuristics have been explored, such as the imperialist competitive algorithm [1], the swarm particles optimization [18, 29, 44], and simulated annealing [31, 58].
Nonetheless, the latest literature review and benchmark in automated process discovery [2] highlighted that many of the stateoftheart automated process discovery algorithms [4, 13, 34, 36, 67, 73, 79] were affected by one (or more) of the following three limitations when discovering process models from reallife event logs: i) they achieve limited accuracy; ii) they are computationally inefficient to be used in practice; iii) they discover syntactically incorrect process models. In practice, when the behavior of the process recorded in the input event log varies little, most of the stateoftheart automated process discovery algorithms can output accurate and simple process models. However, as the behavioral complexity of the process increases, the quality of the discovered process models can worsen quickly. Given that oftentimes reallife event logs are highly complex (i.e., containing complex process behavior, noise, and incomplete data), discovering highly accurate and simple process models with traditional stateoftheart algorithms can be challenging.
On the other hand, achieving in a robust and scalable manner the best tradeoff between accuracy and simplicity, while ensuring behavioral correctness (i.e., process soundness), has proved elusive. In particular, it is possible to group automated process discovery algorithms in two categories: those focusing more on the simplicity, the soundness and either the precision [13] or the fitness [36] of the discovered process model, and those focusing more on its fitness and its precision at the cost of simplicity and/or soundness [4, 73, 79]. The first kind of algorithms strive for simplicity and soundness by enforcing blockstructured behavior on the discovered process model. However, since reallife processes are not always blockstructured, a direct consequence of doing that is an approximation of the behavior which leads to a loss of accuracy (either fitness of precision). The second kind of algorithms do not adopt any strategy to deal with process simplicity and soundness, focusing only on capturing its behavior in a process model, but in doing so they can produce unsound process models.
Alongside techniques that discover imperative process models, it is important to mention that there exists many discovery algorithm that produce declarative models [10, 27, 39, 40, 53, 74]. Declare models capture the processes’ behavior through a set of rules, also known as declarative constraints. Even though each declarative constraint is precise, capturing the whole process behavior in a declarative model can be very difficult, especially because declarative models do not give any information about “undeclared” behavior, e.g., any behavior that does not break the declarative constraint is technically allowed behavior. Hence, imperative process models are usually preferred in practice.
7 Challenges Ahead
Process Mining started about 20 years ago with the development of controlflow miners like the Alpha Miner [64] and the Little Thumb Miner [80]. Although the field has advanced in these 20 years with many others controlflow miners, this does not mean that controlflow mining is already a done deal.
Consider, for example, the results of the latest Process Discovery Contest (PDC 2020) [14], which are shown by Fig. 18. The PDC 2020 was a contest for fullyautomated controlflow miners, and shows the thencurrent state of the field on these miners. In this contest, every miner was used to discover a controlflow model from a training event log, after which this model was used to classify every trace from a test event log. As the ground truth for this classification is known, we can compute both the average positive accuracy and the average negative accuracy for all of the algorithms on this data set. The results show that there is still some ground to cover for the imperative miners, as none of these miners was able to achieve both an average positive accuracy and an average negative accuracy exceeding 80.0%.
Table 4 shows the weaknesses of several algorithms submitted to the PDC 2020 contest. As an example, the weaknesses of the Inductive IMfa Miner included loops: It scored 59.2%^{Footnote 4} on the event logs in the PDC 2020 data set that do not contain loops, and only 19.3% on the event logs that do contain loops. This table indicates that noise and loops but also optional tasks and duplicate tasks can be considered as challenges for controlflow miners in the near future.
In these 20 years, algorithms have been developed that discover perspectives other than the controlflow perspective. However, many of these other perspectives are added on top of the discovered controlflow model, and hence depend on the discovery of a controlflow model of highenough quality. Nevertheless, even if assuming that the quality of the controlflow model is indeed high enough, challenges remain for these other perspectives as well.
As a first example, consider the data perspective, which would add expressions (guards) to the controlflow model that would guide the execution of the model: Certain parts of the controlflow model may be only valid if a certain guard evaluates true. Challenges here include the discovery of sensible guards with sensible values. As an example, if based on some value the controlflow ’goes either left or right’, then the data in the event log may not contain this precise value. As a result, this value needs to be guessed based on the data that is in the event log.
A second example is the organizational perspective, which would add organizational entities (like users, groups, or roles) to certain parts of the controlflow model: Only resources (like users and automated services) that qualify for these entities can be involved in these parts. Challenges here include the discovery of the correct organizational entities at the correct level. As an example, if some activity was performed by some user according to the event log, then what is the correct organizational level (like user, role, group) for this activity?
8 Conclusion
In this chapter, we have introduced four advanced process discovery techniques: the statebased region miner; the languagebased region miner; the split miner; the log skeleton miner. Each of the four techniques aims to alleviate shortcomings of the more basic process discovery techniques as introduced in the previous chapter.
First, the regionbased miners can lift the shortcoming of having to assume that activities only occur once in the model. When using regions, different contexts of an activity can be found, and the activity can then be divided over these contexts, leading to a model with an activity for every different context. This is a feature that is not shared by any of the other miners, and this feature can be very important in case we have an event log of a system where these “duplicate activities” occur. Where other miners need to assume there is only one activity, which may lead to discovered models that are incomprehensible, these regionbased miners do not need to make this assumption, which may result in more precise models.
Second, the split miner aims to discover process models that simultaneously maximize and balance fitness and precision, while at the same time minimizing the controlflow complexity of the resulting model. This approach brings precision and complexity into the equation, something that previously could be done only by using genetic miners like the evolutionary tree miner [12]. However, differently than genetic miners, split miner typically takes seconds to discover a process model from the event log, as opposed to the hourlong execution times required by genetic miners [2].
Third, the log skeleton miner is not limited to using only the directlyfollows relations, which are heavily leveraged by many existing discovery algorithm. This miner discovers a declarative model from the event log that contains facts like “95% of the instances of activity a is always followed by activity b”, or “90% of the instances of activity a do not cooccur with an instance of activity b”. As such, it is not limited to just the directlyfollows relations, and it can discover relations between activities that cannot be discovered if only considering the directlyfollows relations.
It is clear that each of these advanced techniques can be used effectively on certain event logs, and may produce better models than those produced by basic techniques. However, ultimately, there is no technique yet that is effective on all (or even almost all) event logs regardless of the process behavior features. Such an ideal process discovery technique should be able to maximize accuracy and simplicity of the discovered process model while at the same time guaranteeing its simplicity and soundness. While, hitherto, the design of such a technique has proved to be challenging and elusive, it has become clear that each process discovery technique can be useful on some event logs. Hence, while we hope that future research endeavors will lead to the ideal process discovery technique, until it materializes, we just have to rely on educated choices based on the process data at hand (i.e., in the form of event log), and select the most appropriate technique for discovering the best process model.
Notes
 1.
In this paper we will use region to denote a 1bounded region. However, when needed we will use kbounded region to extend the notion, necessary to account for kbounded Petri nets.
 2.
Labels are capitalised to distinguish them from the DFG nodes.
 3.
For sake of completeness, we mention that we are using version 6.12.5 of the LogSkeleton package, which is available in the Nightly Build of ProM, see https://www.promtools.org/doku.php?id=nightly.
 4.
This score is computed as the average over \(\frac{2\cdot {}P_L\cdot {}N_L}{P_L+N_L}\), where \(P_L\) is the positive accuracy and \(N_L\) is the negative accuracy for (1) the model discovered from a training log L and (2) the corresponding test log.
References
Alizadeh, S., Norani, A.: ICMA: a new efficient algorithm for process model discovery. Appl. Intell. 48(11) (2018)
Augusto, A., et al.: Automated discovery of process models from event logs: Rev. Benchmark. IEEE TKDE 31(4) (2019)
Augusto, A., Conforti, R., Dumas, M., La Rosa, M., Polyvyanyy, A.: Split miner: automated discovery of accurate and simple business process models from event logs. Knowl. Inf. Syst. 59(2), 251–284 (2018). https://doi.org/10.1007/s101150181214x
Augusto, A., Conforti, R., Dumas, M., La Rosa, M., Bruno, G.: Automated discovery of structured process models: discover structured vs. discover and structure. In: ComynWattiau, I., Tanaka, K., Song, I.Y., Yamamoto, S., Saeki, M. (eds.) ER 2016. LNCS, vol. 9974, pp. 313–329. Springer, Cham (2016). https://doi.org/10.1007/9783319463971_25
Augusto,A., Dumas, M., La Rosa, M.:Automated discovery of process models with true concurrency and inclusive choices. In: International Conference on Process Mining, pp. 43–56. Springer, Cham (2020). https://doi.org/10.1007/9783030985813_1
Badouel, E., Bernardinello, L., Darondeau. Ph.: Polynomial algorithms for the synthesis of bounded nets. In: TAPSOFT, pp. 364–378 (1995)
Bergenthum, R., Desel, J., Lorenz, R., Mauser, S.: Process mining based on regions of languages. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 375–383. Springer, Heidelberg (2007). https://doi.org/10.1007/9783540751830_27
Bergenthum, R., Desel, J., Lorenz, R., Mauser, S.: Synthesis of petri nets from infinite partial languages. In: Billington, J., Duan, J., Koutny, M. (eds.) ACSD, pp. 170–179. IEEE (2008)
Bergenthum, R., Desel, J., Mauser, S., Lorenz, R.: Synthesis of petri nets from term based representations of infinite partial languages. Fundam. Inform. 95(1), 187–217 (2009)
Bernardi, M.L., Cimitile, M., Di Francescomarino, C., Maggi, F.M.: Do activity lifecycles affect the validity of a business rule in a business process? Inf. Syst. 62 (2016)
Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: On the role of fitness, precision, generalization and simplicity in process discovery. In: Meersman, R., et al. (eds.) OTM 2012. LNCS, vol. 7565, pp. 305–322. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642336065_19
Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: A genetic algorithm for discovering process trees. In: IEEE Congress on Evolutionary Computation (CEC), 2012, pp. 1–8. IEEE (2012)
Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: Quality dimensions in process discovery: the importance of fitness, precision, generalization and simplicity. Int. J. Cooperat. Inf. Syst. 23(01),1440001 (2014)
Carmona, J., Depaire, B., Verbeek, H.M.W.: Process discovery contest 2020 (2019). https://icpmconference.org/2020/processdiscoverycontest/. Accessed 23 Apr 2021
Carmona, J.: Projection approaches to process mining using regionbased techniques. Data Min. Knowl. Discov. 24(1), 218–246 (2012)
Carmona, J., Cortadella, J., Kishinevsky, M.: Divideandconquer strategies for process mining. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 327–343. Springer, Heidelberg (2009). https://doi.org/10.1007/9783642038488_22
Carmona, J., Cortadella, J., Kishinevsky, M.: New regionbased algorithms for deriving bounded Petri nets. IEEE Trans. Comput. 59(3), 371–384 (2009)
Chifu, V.R., Pop, C.B., Salomie, I., Balla, I., Paven, R.: Hybrid particle swarm optimization method for process mining. In: ICCP, IEEE (2012)
Conforti, R., La Rosa, M., ter Hofstede, A.H.M.: Filtering out infrequent behavior from business process event logs. IEEE Trans. Knowl. Data Eng. 29(2), 300–314 (2016)
Conforti, R., Dumas, M., GarcíaBañuelos, L., La Rosa, M.: BPMN miner: automated discovery of BPMN process models with hierarchical structure. Inf. Syst. 56, 284–303 (2016)
Conforti, R., La Rosa, M., ter Hofstede, A.H.M., Augusto, A.: Automatic repair of sametimestamp errors in business process event logs. In: Fahland, D., Ghidini, C., Becker, J., Dumas, M. (eds.) BPM 2020. LNCS, vol. 12168, pp. 327–345. Springer, Cham (2020). https://doi.org/10.1007/9783030586669_19
Darondeau, P.: Deriving unbounded Petri nets from formal languages. In: Sangiorgi, D., de Simone, R. (eds.) CONCUR 1998. LNCS, vol. 1466, pp. 533–548. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0055646
Ponce de León, H., Nardelli, L., Carmona, J., vanden Broucke, S.K.L.M.: Incorporating negative information to process discovery of complex systems. Inf. Sci. 422, 480–496 (2018)
PoncedeLeón, H., Rodríguez, C., Carmona, J., Heljanko, K., Haar, S.: Unfoldingbased process discovery. In: Finkbeiner, B., Pu, G., Zhang, L. (eds.) ATVA 2015. LNCS, vol. 9364, pp. 31–47. Springer, Cham (2015). https://doi.org/10.1007/9783319249537_4
Alves de Medeiros, A.K.: Genetic process mining. Ph.D. thesis, Eindhoven University of Technology (2006)
Desel, J., Reisig, W.: The synthesis problem of Petri nets. Acta Inf. 33(4), 297–315 (1996)
Di Ciccio, C., Mecella, M.: A twostep fast algorithm for the automated discovery of declarative workflows. In: 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 135–142. IEEE (2013)
Dumas, M., La Rosa, M., Mendling, J., Reijers, H.A.: Fundamentals of Business Process Management. Springer, Berlin (2013). https://doi.org/10.1007/9783662565094
Effendi, Y.A., Sarno, P.: Discovering optimized process model using rule discovery hybrid particle swarm optimization. In: 2017 3rd International Conference on Science in Information Technology (ICSI Tech), pp. 97–103. IEEE (2017)
Ehrenfeucht, A., Rozenberg, G.: Partial (Set) 2structures. Part I, II. Acta Inform. 27, 315–368 (1990)
Gao, D., Liu, Q.: An improved simulated annealing algorithm for process mining. In: CSCWD, IEEE (2009)
Ghionna, L., Greco, G., Guzzo, A., Pontieri, L.: Outlier detection techniques for process mining applications. In: An, A., Matwin, S., Ras, Z.W., Slezak, D. (eds.) ISMIS 2008. LNCS (LNAI), vol. 4994, pp. 150–159. Springer, Heidelberg (2008). https://doi.org/10.1007/9783540681236_17
Goedertier, S., Martens, D., Vanthienen, J., Baesens, B.: Robust process discovery with artificial negative events. J. Mach. Learn. Res. 10, 1305–1340 (2009)
Guo, Q., Wen, L., Wang, J., Yan, Z., Yu, P.S.: Mining invisible tasks in nonfreechoice constructs. In: MotahariNezhad, H.R., Recker, J., Weidlich, M. (eds.) BPM 2015. LNCS, vol. 9253, pp. 109–125. Springer, Cham (2015). https://doi.org/10.1007/9783319230634_7
Leemans, S.J.J., Fahland, D., van der Aalst, W.M.P.: Discovering blockstructured process models from event logs containing infrequent behaviour. In: Lohmann, N., Song, M., Wohed, P. (eds.) BPM 2013. LNBIP, vol. 171, pp. 66–78. Springer, Cham (2014). https://doi.org/10.1007/9783319062570_6
Leemans, S.J.J., Fahland, D., van der Aalst, W.M.P.: Discovering blockstructured process models from incomplete event logs. In: Ciardo, G., Kindler, E. (eds.) PETRI NETS 2014. LNCS, vol. 8489, pp. 91–110. Springer, Cham (2014). https://doi.org/10.1007/9783319077345_6
Lorenz, R.: Towards synthesis of petri nets from general partial languages. In: Lohmann, N., Wolf, K. (eds.) AWPN, vol. 380 of CEUR Workshop Proceedings, pp. 55–62. CEURWS.org (2008)
Lorenz, R., Juhás, R.: How to synthesize nets from languages  a survey. In: Proceedings of the Wintersimulation Conference (WSC) 2007 (2007)
Maggi, F.M., Bose, R.P.J.C., van der Aalst, W.M.P.: Efficient discovery of understandable declarative process models from event logs. In: Ralyté, J., Franch, X., Brinkkemper, S., Wrycza, S. (eds.) CAiSE 2012. LNCS, vol. 7328, pp. 270–285. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642310959_18
Maggi, F.M., Dumas, M., GarcíaBañuelos, L., Montali, M.: Discovering dataaware declarative process models from event logs. In: Daniel, F., Wang, J., Weber, B. (eds.) BPM 2013. LNCS, vol. 8094, pp. 81–96. Springer, Heidelberg (2013). https://doi.org/10.1007/9783642401763_8
Mauser, S., Lorenz, S.: Variants of the language based synthesis problem for petri nets. In: ACSD, pp. 89–98 (2009)
Mendling, J., Reijers, H.A., van der Aalst, W.M.P.: Seven process modeling guidelines (7PMG). Inform. Softw. Technol. 52(2), 127–136 (2010)
Nguyen, H., Dumas, M., ter Hofstede, A.H.M., La Rosa, M., Maggi, F.M.: Mining business process stages from event logs. In: Dubois, E., Pohl, K. (eds.) CAiSE 2017. LNCS, vol. 10253, pp. 577–594. Springer, Cham (2017). https://doi.org/10.1007/9783319595368_36
Nurlaili, A.L., Sarno, R.: A combination of the evolutionary tree miner and simulated annealing. In: 2017 4th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), pp. 1–5. IEEE (2017)
Pesic, M., Schonenberg, H., van der Aalst, W.I.P.: DECLARE: full support for looselystructured processes. In: 11th IEEE International Enterprise Distributed Object Computing Conference (EDOC 2007), 15–19 October 2007, Annapolis, Maryland, USA, pp. 287–300 (2007)
Polyvyanyy, A., Vanhatalo, J., Völzer, H.: Simplified computation and generalization of the refined process structure tree. In: WSFM, pp. 25–41 (2010)
Ribeiro, J., Carmona, J.: RS4PD: a tool for recommending controlflow algorithms. In: BPM (Demos), pp. 66. Citeseer (2014)
Ribeiro, J., Carmona, J., Mısır, M., Sebag, M.: A recommender system for process discovery. In: Sadiq, S., Soffer, P., Völzer, H. (eds.) BPM 2014. LNCS, vol. 8659, pp. 67–83. Springer, Cham (2014). https://doi.org/10.1007/9783319101729_5
Ribeiro, J., Carmona Vargas, J.: A method for assessing parameter impact on controlflow discovery algorithms. In: Proceedings of the International Workshop on Algorithms & Theories for the Analysis of Event Data: Brussels, Belgium, 22–23 June 2015, pp. 83–96. CEURWS. org (2015)
RoggeSolti, A., Mans, R.S., van der Aalst, W.M.P., Weske, M.: Improving Documentation by repairing event logs. In: Grabis, J., Kirikova, M., Zdravkovic, J., Stirna, J. (eds.) PoEM 2013. LNBIP, vol. 165, pp. 129–144. Springer, Heidelberg (2013). https://doi.org/10.1007/9783642416415_10
Sani, M.F., van Zelst, S.J., van der Aalst, W.M.P.: Improving process discovery results by filtering outliers using conditional behavioural probabilities. In: International Workshop on Business Process Intelligence (BPI 2017) (2017)
Sani, M.F., van Zelst, S.J., van der Aalst, W.M.P.: Repairing outlier behaviour in event logs using contextual behaviour. EMISAJ 14, 1–24 (2019)
Schönig, S., RoggeSolti, A., Cabanillas, C., Jablonski, S., Mendling, J.: Efficient and customisable declarative process mining with SQL. In: Nurcan, S., Soffer, P., Bajec, M., Eder, J. (eds.) CAiSE 2016. LNCS, vol. 9694, pp. 290–305. Springer, Cham (2016). https://doi.org/10.1007/9783319396965_18
Solé, M., Carmona, J.: Light regionbased techniques for process discovery. Fundam. Inform. 113(3–4), 343–376 (2011)
Solé, M., Carmona, J.: Incremental process discovery. Trans. Petri Nets Other Models of Concurr. 5, 221–242 (2012)
Solé, M., Carmona, J.: Regionbased foldings in process discovery. IEEE Trans. Knowl. Data Eng. 25(1), 192–205 (2013)
Song, S., Cao, Y., Wang, J.: Cleaning timestamps with temporal constraints. VLDB Endow. 9(10), 708–719 (2016)
Song, W., Liu, S., Liu, Q.: Business process mining based on simulated annealing. In: ICYCS, IEEE (2008)
Tax, N., Sidorova, N., van der Aalst, W.M.P.: Discovering more precise process models from event logs by filtering out chaotic activities. J. Intell. Inf. Syst., 52(1), 107–139 (2019)
van der Aalst, W., Weijters, T., Maruster, L.: Workflow mining: discovering process models from event logs. IEEE Trans. Knowl. Data Eng. 16(9) (2004)
van der Aalst, W.M.P., Günther, C.W.: Finding structure in unstructured processes: the case for process mining. In: ACSD, pp. 3–12 (2007)
van der Aalst, W.M.P., Rubin, V., (Eric) Verbeek, H.M.W., van Dongen, B.F., Kindler, E., Günther, C.W.: Process mining: a twostep approach to balance between underfitting and overfitting. Softw. Syst. Model. 9, 87–111 (2009)
van der Aalst, W.M.P., van Dongen, B.F.: Discovering petri nets from event logs. Trans. Petri Nets Other Models Concurr. 7, 372–422 (2013)
van der Aalst, W.M.P., Weijters, T., Maruster, L.: Workflow mining: discovering process models from event logs. IEEE Trans. Knowl. Data Eng. 16(9), 1128–1142 (2004)
van der Werf, J.M.E.M., van Dongen, B.F., Hurkens, C.A.J., Serebrenik, A.: Process discovery using integer linear programming. Fundam. Inform. 94(3–4), 387–412 (2009)
van Zelst, S.J., van Dongen, B.F., van der Aalst, W.M.P.: ILPbased process discovery using hybrid regions. In van der Aalst, W.M.P., Bergenthum, R., Carmona, J. (eds.) Proceedings of the International Workshop on Algorithms & Theories for the Analysis of Event Data, ATAED 2015, Satellite Event of the Conferences: 36th International Conference on Application and Theory of Petri Nets and Concurrency Petri Nets 2015 and 15th International Conference on Application of Concurrency to System Design ACSD 2015, Brussels, Belgium, 22–23 June 2015, vol. 1371 of CEUR Workshop Proceedings, pp. 47–61. CEURWS.org (2015)
van Zelst, S.J., van Dongen, B.F., van der Aalst, W.M.P.: ILPbased process discovery using hybrid regions. In: International Workshop on Algorithms & Theories for the Analysis of Event Data, ATAED 2015, vol. 1371 of CEUR Workshop Proceedings, pp. 47–61. CEURWS.org (2015)
van Zelst, S.J., van Dongen, B.F., van der Aalst, W.M.P., Verbeek, H.M.W.: Discovering workflow nets using integer linear programming. Computing 100(5), 529–556 (2017). https://doi.org/10.1007/s0060701705825
van Zelst, S.J., Fani Sani, M., Ostovar, A., Conforti, R., La Rosa, M.: Filtering spurious events from event streams of business processes. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 35–52. Springer, Cham (2018). https://doi.org/10.1007/9783319915630_3
van Zelst, S.J., Fani Sani, M., Ostovar, A., Conforti, R., La Rosa, M.: Detection and removal of infrequent behaviour from event streams of business processes. Inf. Syst. 90 (2019)
vanden Broucke, S.K.L.M., De Weerdt, J., Baesens, B., Vanthienen, J.: Improved artificial negative event generation to enhance process event logs. In: Ralyté, J., Franch, X., Brinkkemper, S., Wrycza, S. (eds.) CAiSE 2012. LNCS, vol. 7328, pp. 254–269. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642310959_17
vanden Broucke, S.K.L.M., De Weerdt, J., Vanthienen, J., Baesens, B.: Determining process model precision and generalization with weighted artificial negative events. IEEE Trans. Knowl. Data Eng, 26(8), 1877–1889 (2014)
vanden Broucke, S.K.L.M., De Weerdt, J.: Fodina: a robust and flexible heuristic process discovery technique. Decis. Supp. Syst. 100, 109–118 (2017)
vanden Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Declarative process discovery with evolutionary computing. In: 2014 IEEE Congress on Evolutionary Computation (CEC), pp. 2412–2419. IEEE (2014)
Verbeek, H.M.W.: The Log Skeleton Visualizer in ProM 6.9: the winning contribution to the process discovery contest 2019. Int. J. Softw. Tools Technol. Trans. 339 (2021). https://doi.org/10.1007/s1000902100618y
Verbeek, H.M.W. Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: ProM 6: the process mining toolkit. In: Proceedings of BPM Demonstration Track 2010, vol. 615, pp. 34–39. CEURWS.org (2010)
Verbeek, H.M.W., van der Aalst, W.M.P.: Decomposed process mining: the ILP case. In: Fournier, F., Mendling, J. (eds.) BPM 2014. LNBIP, vol. 202, pp. 264–276. Springer, Cham (2015). https://doi.org/10.1007/9783319158952_23
Wang, J., Song, S., Lin, X., Zhu, X., Pei, J.: Cleaning structured event logs: a graph repair approach. In: Proceedings of IEEE ICDE, pp. 30–41. IEEE (2015)
Weijters, A.J.M.M., Ribeiro, J.T.S.: Flexible heuristics miner (FHM). In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 310–317. IEEE (2011)
Weijters, A.J.M.M., van der Aalst, W.: Rediscovering workflow models from eventbased data using little thumb. Integr. Comput.Aid. Eng. 10(2) (2003)
Acknowledgements
This work has been supported by MCIN/AEI funds under grant PID2020112581GBC21.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Augusto, A., Carmona, J., Verbeek, E. (2022). Advanced Process Discovery Techniques. In: van der Aalst, W.M.P., Carmona, J. (eds) Process Mining Handbook. Lecture Notes in Business Information Processing, vol 448. Springer, Cham. https://doi.org/10.1007/9783031088483_3
Download citation
DOI: https://doi.org/10.1007/9783031088483_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031088476
Online ISBN: 9783031088483
eBook Packages: Computer ScienceComputer Science (R0)