This section outlines our framework for optimizing APDAs by means of S-metaheuristics (cf. Sect. 2). First, we give an overview of the framework and its core components. Next, we discuss the adaptation of the S-metaheuristics to the problem of process discovery. Finally, we describe the instantiations of our framework for Split Miner, Fodina, and Inductive Miner.
Preliminaries
In order to discover a process model, an APDA takes as input an event log and transforms it into an intermediate representation from which a process model is derived. Below, we define one of the most popular intermediate representations, that is the directly-follows graph (DFG). Although other intermediate representations are available in the literature (e.g., behavioral profiles [28]), our framework focuses only on DFGs for two main reasons: first, because they are adopted by many state-of-the-art automated process discovery approaches [7, 9, 24, 34, 35]; second, because they allow us to leverage the Markovian accuracy [5] to facilitate the application of metaheuristics and the navigation of the solution space as we show later in this section.
Definition 1
(Event Log) Given a set of activities \({\mathscr {A}}\), an event log \({\mathscr {L}}\) is a multiset of traces where a trace \(t \in {\mathscr {L}}\) is a sequence of activities \(t= \langle a_1, a_2, \dots , a_n \rangle \), with \(a_i \in {\mathscr {A}}, 1 \le i \le n\).
Definition 2
[Directly-follows graph (DFG)] Given an event log \({\mathscr {L}}\), its directly-follows graph (DFG) is a directed graph \({\mathscr {G}}= (N, E)\), where: N is the set of nodes, \(N = \{ a \in {\mathscr {A}}\mid \exists t \in {\mathscr {L}}\wedge a \in t\}\); and E is the set of edges \(E = \{(x, y) \in N \times N \mid \exists t = \langle a_1, a_2, \ldots , a_n \rangle , t\in {\mathscr {L}}\wedge a_i = x \wedge a_{i+1} = y \left[ 1 \le i \le n-1 \right] \}\).
By definition, each node of the DFG represents an activity recorded in at least one trace of the event log, while each edge of a DFG represents a directly-follows relation between two activities (represented by the node source and the node target of the edge). An APDA is said to be DFG-based if it first generates the DFG of the event log, then applies an algorithm to manipulate the DFG (e.g., removing edges), and finally converts the processed DFG into a process model. Such a processed DFG will not adhere any more to Definition 2; therefore, we redefine it as Refined DFG.
Definition 3
(Refined DFG) Given an event log \({\mathscr {L}}\) and its DFG \({\mathscr {G}}_{{\mathscr {L}}} = (N, E)\), a Refined DFG is a directed graph \({\mathscr {G}}= (N', E')\), where: \(N' \subseteq N\) and \(E' \subseteq E\). If \(N' = N\) and \(E' = E\), the refined DFG is equivalent to the event log DFG.
Examples of DFG-based APDAs are Inductive Miner [24], Heuristics Miner [7, 35], Fodina [34], and Split Miner [9]. Different DFG-based APDAs may extract different Refined DFGs from the same log. Also, a DFG-based APDA may discover different Refined DFGs from the same log depending on its hyperparameter settings (e.g., a filtering threshold). The algorithm(s) used by a DFG-based APDA to discover the Refined DFG from the event log and convert it into a process model may greatly affect the accuracy of an APDA. Accordingly, our framework focuses on optimizing the discovery of the Refined DFG rather than its conversion into a process model.
Given that a Refined DFG is a binary graph, it is possible to represent it in the form of a matrix as follows.
Definition 4
(DFG-Matrix) Given a Refined DFG \({\mathscr {G}}= (N, E)\) and a function \(\theta : N \rightarrow [1, \left| N \right| ]\),Footnote 3 the DFG-Matrix is a squared matrix \(X_{{\mathscr {G}}} \in [0,1]\cap {\mathbb {N}}^{\left| N\right| \times \left| N\right| }\), where each cell \(x_{i,j} = 1 \Longleftrightarrow \exists (a_1,a_2)\in E \mid {\theta }(a_1)=i \wedge {\theta }(a_2)=j\), otherwise \(x_{i,j}=0\).
In the remaining of this paper, we refer to the Refined DFG as DFG for simplicity reason.
Framework overview
As shown in Fig. 1, our framework takes three inputs (in addition to the log): (i) the optimization metaheuristics; (ii) the objective function to be optimized (e.g., F-score); (iii) and the DFG-based APDA to be used for discovering a process model.
Algorithm 1 describes how our framework operates, while Fig. 2 captures the control flow representation of the Algorithm 1. First, the input event log is given to the APDA, which returns the discovered (refined) DFG and its corresponding process model (lines 1 and 2). This (refined) DFG becomes the current DFG, while the process model becomes the best process model (so far). This process model’s objective function score (e.g., the F-score) is stored as the current score and the best score (lines 3 and 4). The current DFG is then given as input to the function GenerateNeighbors, which applies changes to the current DFG to generate a set of neighboring DFGs (line 6). The latter ones are given as input to the APDA, which returns the corresponding process models. The process models are assessed by the objective function evaluators (line 9 to 13). When the metaheuristic receives the results from the evaluators (along with the current DFG and its score), it chooses the new current DFG and updates the current score (lines 14 and 15). If the new current score is higher than the best score (line 16), it updates the best process model and the best score (lines 17 and 18). After the update, a new iteration starts, unless a termination criterion is met (e.g., a timeout, a maximum number of iterations, or a minimum threshold for the objective function). In the latter case, the framework outputs the best process model identified, i.e., the process model scoring the highest value for the objective function.
Adaptation of the optimization metaheuristics
To adapt iterative local search (ILS), tabu search (TABU), and simulated annealing (SIMA) to the problem of automated process discovery, we need to define the following three concepts: (i) the problem solution space; (ii) a solution neighborhood; (iii) the objective function. These design choices influence how each of the metaheuristics navigates the solution space and escapes local minima, i.e., how to design the Algorithm 1 functions: GenerateNeighbors and UpdateDFG, resp. lines 6 and 14.
Solution space Our goal being the optimization of APDAs, we are forced to choose a solution space that fits well our context regardless the selected APDA. If we assume that the APDA is DFG-based (that is the case for the majority of the available APDAs), we can define the solution space as the set of all the DFG discoverable from the event log. Indeed, any DFG-based APDA can generate deterministically a process model from a DFG.
Solution neighborhood Having defined the solution space as the set of all the DFG discoverable from the event log, we can refer to any element of this solution space as a DFG-Matrix. Given a DFG-Matrix, we define its neighborhood as the set of all the matrices having one different cell value (i.e., DFGs having one more/less edge). In the following, every time we refer to a DFG we assume it is represented as a DFG-Matrix.
Objective function It is possible to define the objective function as any function assessing one of the four quality dimensions for discovered process models (introduced in Sect. 2). However, being interested in optimizing the APDAs to discover the most accurate process model, in our optimization framework instantiations we refer to the objective function as the F-score of fitness and precision. Furthermore, we remark that our framework could operate also with objective functions that take into account multiple quality dimensions striving for a trade-off, e.g., F-score and model complexity, provided the multiple quality dimensions can be combined into a unique objective function.
Having defined the solution space, a solution neighborhood, and the objective function, we can turn our attention on how ILS, TABU, and SIMA navigate the solution space. ILS, TABU, and SIMA share similar traits in solving an optimization problem, especially when it comes to the navigation of the solution space. Given a problem and its solution space, any of these three S-metaheuristics starts from a (random) solution, discovers one or more neighboring solutions, and assesses them with the objective function to find a solution that is better than the current one. If a better solution is found, it is chosen as the new current solution and the metaheuristic performs a new neighborhood exploration. If a better solution is not found, e.g., the current solution is locally optimal, the three metaheuristics follow different approaches to escape the local optimum and continue the solution space exploration. Algorithm 1 orchestrates and facilitates the parts of this procedure shared by the three metaheuristics. However, we must define the functions GenerateNeighbors (GNF) and UpdateDFG (UDF).
The GNF receives in input a solution of the solution space, i.e., a DFG, and it generates a set of neighboring DFGs. By definition, GNF is independent from the metaheuristic and it can be as simple or as elaborate as we demand. An example of a simple GNF is a function that randomly selects neighboring DFGs turning one cell of the input DFG-Matrix to 0 or to 1. While, an example of an elaborate GNF is a function that accurately selects neighboring DFGs relying on the feedback received from the objective function assessing the input DFG, as we show in Sect. 3.4.
The UDF (captured in Algorithm 2) is the core of our optimization framework, and it implements the metaheuristic itself. The UDF receives in input the selected metaheuristic (\(\omega \)), the neighboring DFGs and their corresponding objective function scores (S), the current DFG (\({\mathscr {G}}_c\)), the current score (\(s_c\)), the APDA (\(\alpha \)), and the event log (\({\mathscr {L}}\)). Then, we can differentiate two cases: (i) among the input neighboring DFGs there is at least one having a higher objective function score than the current; (ii) none of the input neighboring DFGs has a higher objective function score than the current. In the first case, UDF always outputs the DFG having the highest score regardless of the selected metaheuristic (see Algorithm 2, lines 4, 11, and 33—respectively, for ILS, TABU, and SIMA). In the second case, the current DFG may be a local optimum, and each metaheuristic escapes it with a different strategy. Figures 3, 4, and 5 show the high-level control flow of how ILS, TABU, and SIMA update the current DFG (that is, the UDF—Algorithm 2).
Iterative Local Search applies the simplest strategy, it perturbs the current DFG (Algorithm 2, line 7). The perturbation is meant to alter the DFG in such a way to escape the local optimum, e.g., randomly adding and removing multiple edges from the current DFG. The perturbed DFG is the output of the UDF.
Tabu Search relies on its three memories to escape a local optimum (Algorithm 2, line 25 to 30). The short-term memory (a.k.a. Tabu-list), which contains DFG that must not be explored further. The intermediate-term memory, which contains DFGs that should lead to better results and, therefore, should be explored in the near future. The long-term memory, which contains DFGs (with characteristics) that have been seen multiple times and, therefore, not to explore in the near future. TABU updates the three memories each time the UDF is executed. Given the set of neighboring DFGs and their respective objective function scores (see Algorithm 1, map S), TABU adds each DFG to a different memory. DFGs worsening the objective function score are added to the Tabu-list. DFGs improving the objective function score, yet less than another neighboring DFG, are added to the intermediate-term memory. DFGs that do not improve the objective function score are added to the long-term memory. Also, the current DFG is added to the Tabu-list, it being already explored. When TABU does not find a better DFG in the neighborhood of the current DFG, it returns the latest DFG added to the intermediate-term memory. If the intermediate-term memory is empty, TABU returns the latest DFG added to the long-term memory. If both these memories are empty, TABU requires a new (random) DFG from the APDA, and outputs its DFG.
Simulated Annealing avoids getting stuck in a local optimum by allowing the selection of DFGs worsening the objective function score (Algorithm 2, line 36 to 40). In doing so, SIMA explores areas of the solution space that other S-metaheuristics do not. When a better DFG is not found in the neighborhood of the current DFG, SIMA analyzes one neighboring DFG at a time. If this neighbor does not worsen the objective function score, SIMA outputs it. Instead, if the neighboring DFG worsens the objective function score, SIMA outputs it with a probability of \(e^{-\frac{\left| s_n - s_c \right| }{T}}\), where \(s_n\) and \(s_c\) are the objective function scores of, (respectively), the neighboring DFG and the current DFG, and the temperature T is an integer that converges to zero as a linear function of the maximum number of iterations. The temperature is fundamental to avoid updating the current DFG with a worse one if there would be no time to recover from the worsening (i.e., too few iterations left for continuing the exploration of the solution space from the worse DFG).
Framework instantiation
To assess our framework, we instantiated it for three APDAs: Split Miner [9], Fodina [34], and Inductive Miner [24]. These three APDAs are all DFG-based, and they are representatives of the state of the art. In fact, the latest APDAs literature review and benchmark [8] showed that Fodina, Split Miner, and Inductive Miner outperformed other APDAs when their hyperparameters were optimized via a brute-force approach. Therefore, we decided to focus on those DFG-based APDAs that would benefit the most from the application of our optimization framework.
To complete the instantiation of our framework for any concrete DFG-based APDA, it is necessary to implement an interface that allows the metaheuristics to interact with the APDA (as discussed above). Such an interface should provide four functions: DiscoverDFG and ConvertDFGtoProcessModel (see Algorithm 1), the Restart Function (RF) for TABU, and the Perturbation Function (PF) for ILS.
The first two functions, DiscoverDFG and ConvertDFGtoProcessModel, are inherited from the DFG-based APDA, in our case Split Miner, Fodina, and Inductive Miner. We note that Split Miner and Fodina receive as input parameter settings that can vary the output of the DiscoverDFG function. Precisely, Split Miner has two parameters: the noise filtering threshold, used to drop infrequent edges in the DFG, and the parallelism threshold, used to determine which potential parallel relations between activities are used when discovering the process model from the DFG. While, Fodina has three parameters: the noise filtering threshold, similar to the one of Split Miner, and two threshold to detect, respectively, self-loops and short-loops in the DFG. Instead, the DFG-based variant of Inductive Miner [24] that we integrated in our optimization framework does not receive any input parameters.
To discover the initial DFG (Algorithm 1, line 1) with Split Miner, default parameters are used.Footnote 4 We removed the randomness for discovering the initial DFG because most of the times, the DFG discovered by Split Miner with default parameters is already a good solution [9], and starting the solution space exploration from this latter can reduce the total exploration time.
Similarly, if Fodina is the selected APDA, the initial DFG (Algorithm 1, line 1) is discovered using the default parameters of Fodina,Footnote 5 even though there is no guarantee that the default parameters allow Fodina to discover a good starting solution [8]. Yet, this design choice is less risky than randomly choosing the values of the input parameters in order to discover the initial DFG, because it is likely Fodina would discover unsound models when randomly tuned, given that it does not guarantee soundness.
On the other hand, Inductive Miner [24] does not apply any manipulation to the discovered initial DFG. In this case, we pseudorandomly generate an initial DFG starting from a given seed, to ensure determinism. Differently than the case of Fodina, this is a suitable design choice for Inductive Miner, because it always guarantees block-structured sound process models, regardless of the DFG.
Function RF is very similar to DiscoverDFG, since it requires the APDA to output a DFG. The only difference is that RF must output a different DFG every time it is executed. We adapted the DiscoverDFG function of Split Miner and Fodina to output the DFG discovered with default parameters the first time it is executed, and a DFG discovered with pseudorandom parameters for the following executions. The case of Inductive Miner is simpler, because the DiscoverDFG function always returns a pseudorandom DFG. Consequently, we mapped RF to the DiscoverDFG function.
Finally, function PF can be provided either by the APDA (through the interface) or by the metaheuristic. However, PF can be more effective when not generalised by the metaheuristic, allowing the APDA to apply different perturbations to the DFGs, taking into account how the APDA converts the DFG to a process model. We chose a different PF for each of the three APDAs.
-
Split Miner PF We invoke Split Miner’s concurrency oracle to extract the possible parallelism relations in the log using a randomly chosen parallelism threshold. For each new parallel relation discovered that is not present in the current solution, two edges are removed from the DFG, whils, for each deprecated parallel relation, two edges are added to the DFG.
-
Fodina PF Given the current DFG, we analyze its self-loops and short-loops relations using random loop thresholds. As a result, a new DFG is generated where a different set of edges is retained as self-loops and short-loops.
-
Inductive Miner PF Since Inductive Miner does not perform any manipulation on the DFG, we could not determine an efficient way to perturb the DFG. Thus, we set PF = RF, so that instead of perturbing the current DFG, a new random DFG is generated. This variant of the ILS is called Repeated local search (RLS). In the evaluation reported in Sect. 4, we use only RLS for Inductive Miner, and both ILS and RLS for Fodina and Split Miner.
To complete the instantiation of our framework, we need to set an objective function. With the goal of optimizing the accuracy of the APDAs, we chose as objective function the F-score of fitness and precision. Among the existing measures of fitness and precision, we selected the Markovian fitness and precision presented in [5, 6].Footnote 6 The rationale for this choice is that these measures of fitness and precision are the fastest to compute among state-of-the-art measures [5, 6]. Furthermore, these measures indicate what edges could be added to or removed from the DFG to improve the fitness or precision of the model. This feedback allows us to design an effective GNF.
In the instantiation of our framework, the objective function’s output is a data structure composed of: the Markovian fitness and precision of the model, the F-score, and the mismatches between the model and the event log identified during the computation of the Markovian fitness and precision, i.e., the sets of edges that could be added to improve fitness or removed to improve precision. Algorithm 3 illustrates how we build this data structure, its high-level control flow sketch is captured in Fig. 6.
Given an event log and a process model, we generate their respective Markovian abstractions by applying the method described in [5] (lines 1 and 2). We recall that the Markovian abstraction of the log/model is a graph, where each edge represents a subtraceFootnote 7 observed in the log/model. Next, we collect all the edges of the Markovian abstraction of the log and of the model into two different sets: \(E_l\) and \(E_m\) (lines 3 and 4). These two sets are used to determine the Markovian fitness and precision of the process model [5], by applying the formula in lines 4 and 10. We note that the edges in \(E_l\) that cannot be found in \(E_m\) (set \(E_{df}\), line 6) represent subtraces of the log that cannot be found in the process model. Vice-versa, the edges in \(E_m\) that cannot be found in \(E_l\) (set \(E_{dp}\), line 11) represent subtraces of the process model that cannot be found in the log. We analyze these subtraces to detect directly-follows relations, i.e., DFG edges (lines 9 and 14), that can be added or removed from the DFG that generated the process model in order to either improve fitness or precision. Precisely, the DFG edges that can be added to improve fitness are those captured by the directly-follows relations that we can find in the Markovian abstraction edges in set \(E_{df}\). On the other hand, the edges that can be removed to improve precision are those captured by the directly-follows relations that we can find in the Markovian abstraction edges in set \(E_{dp}\). Once these edges to be added or removed are identified (sets \(E_f\) and \(E_p\)), we can output the final data structure, which comprises the Markovian fitness and precision, their F-score, and the two sets \(E_f\) and \(E_p\).
Given the above objective function’s output, our GNF is described in Algorithm 4, while Fig. 7 captures its high-level control flow sketch.
This function receives as input the current DFG (\({\mathscr {G}}_c\)), its objective function score (the data structure \(s_c\)), and the number of neighbors to generate (\(\textit{size}_n\)). If fitness is greater than precision, we retrieve from \(s_c\) the set of edges (\(E_m\)) that could be removed from \({\mathscr {G}}_c\) to improve its precision (line 2). Conversely, if precision is greater than fitness, we retrieve from \(s_c\) the set of edges (\(E_m\)) that could be added to \({\mathscr {G}}_c\) to improve its fitness (line 4). The reasoning behind this design choice is that, given that our objective function is the F-score, it is preferable to increase the lowest of the two measures (precision or fitness). That is, if the fitness is lower, we increase fitness, and conversely if the precision is lower we increase precision. Once we have \(E_m\), we randomly select one edge from it, generate a copy of the current DFG (\({\mathscr {G}}_n\)), and either remove or add the randomly selected edge according to the accuracy measure we want to improve (precision or fitness), see lines 7 to 13. If the removal of an edge generates a disconnected \({\mathscr {G}}_n\), we do not add this latter to the neighbors set (N), line 10. We keep iterating over \(E_m\) until the set is empty (i.e., no mismatching edges are left) or N reaches its maximum size (i.e., \(\textit{size}_n\)). We then return N. The algorithm ends when the maximum execution time or the maximum number of iterations is reached.