1 Introduction

In today’s organizations, it is important to ensure that process executions follow the protocols prescribed by process stakeholders so that compliance is maintained. Conformance checking in process mining compares event data with the corresponding process model to identify commonalities and discrepancies [2]. Detailed diagnostics provide novel insights into the magnitude and effect of deviations. The state-of-the-art in conformance checking are alignment-based techniques that provide detailed explanations of the observed behavior in terms of modeled behavior [4].

However, one of the limitations of alignment-based approaches is the explosion of state-space during the alignment computation. For example, the classic cost-based alignment approach [4] in the worst case is exponential with respect to the model size [5].

One research line focuses on decomposition techniques which break down a conformance problem into smaller sub-problems [1]. Experimental results have shown that decomposed approaches can be several times faster than their monolithic counterparts and can compute alignments for datasets that were previously infeasible. But until recently, decomposition techniques have been limited to resolving the decision problem of deciding if a log trace is perfectly fitting with the model. As a result, reliable diagnostics are missing. However, recent work has shown that overall alignment results can be computed under decomposed conformance checking by using the so-called recomposition approach. A framework that computes overall alignment results in a decomposed manner was presented in [10, 11].

A key result of the work is in defining and proving the border agreement condition which permits the merging of sub-alignment results as an overall result. If the condition is not met, the decomposed sub-components are “recomposed” to encourage the merging condition in the next alignment iteration. Experimental results have shown significant performance gains using recomposition, but they have also shown that the merging aspect of the framework can become a performance bottleneck where log traces may require numerous recompositions to reach the merging condition. Under this context, this paper is a step towards that direction by defining and structuring the recomposition step, proposing different recomposition strategies, and evaluating their impact to the overall computation time. The experimental results show that by applying the presented recomposition strategies, exact alignment results can be computed on synthetic and real-life datasets much faster.

The remainder of the paper is structured as follows: Sect. 2 introduces the required notations and concepts. In particular, Sect. 2.2 presents the recomposition approach as the focus of the paper. Section 3 defines and structures the recomposition step and sheds light on the limitations of the existing recomposition strategies. Section 4 presents four recomposition strategies that can be used in the recomposition step. Section 5 details the experimental setup for the evaluation of the proposed strategies, and Sect. 6 analyzes the experimental results. Section 7 presents the related work. Finally, Sect. 8 presents some conclusions and future work.

2 Preliminaries

This section introduces basic concepts related to process models, event logs, and alignment-based conformance checking techniques.

Let X be a set. \(\mathcal {B}(X)\) denotes the set of all possible multisets over set X, and \(X^{*}\) denotes the set of all possible sequences over set X. \(\langle {}\rangle {}\) denotes the empty sequence. Concatenation of sequences \(\sigma _1 \in X^{*}\) and \(\sigma _2 \in X^{*}\) is denoted as \(\sigma _1 \cdot \sigma _2\). Given a tuple \(\varvec{x} = (x_1, x_2, \ldots , x_n) \in X_1 \times X_2 \times \ldots \times X_n\), \(\pi _{i}(\varvec{x}) = x_i\) denotes the projection operator for all \(i \in \{1, \ldots , n\}\). This operator is extended to sequences so that given a sequence \(\varvec{\sigma } \in (X_1 \times X_2 \times \ldots \times X_n)^{*}\) of length m with \(\varvec{\sigma } = \langle {}(x_{1_1}, x_{2_1}, \ldots , x_{n_1}), (x_{1_2}, x_{2_2}, \ldots , x_{n_2}), \ldots , (x_{1_m}, x_{2_m}, \ldots , x_{n_m})\rangle {}\), \(\pi _{i}(\varvec{\sigma }) = \langle {}x_{i_1}, x_{i_2}, \ldots , x_{i_m}\rangle {}\) for all \(i \in {1, \ldots , n}\). Projection is also defined over sets and functions recursively. Given \(Y \subseteq X\) and a sequence \(\sigma \in X^{*}\), \(\langle {}\rangle {}\!\!\upharpoonright _{Y} = \langle {}\rangle {}\), and \((\langle {}x\rangle {} \cdot \sigma )\!\!\upharpoonright _{Y} = \langle {}x\rangle {} \cdot \sigma \!\!\upharpoonright _{Y}\) if \(x \in Y\), and \((\langle {}x\rangle {} \cdot \sigma )\!\!\upharpoonright _{Y} = \sigma \!\!\upharpoonright _{Y}\) if \(x \notin Y\). Similarly, given a function \(f: X \rightarrow Y\) and a sequence \(\sigma = \langle {}x_1, x_2, \ldots , x_n\rangle {} \in X^{*}\), \(f(\sigma ) = \langle {}f(x_1), f(x_2), \ldots , f(x_n)\rangle {}\).

2.1 Preliminaries on Petri Net, Event Log, and Net Decomposition

In this paper, Petri nets are used to represent process models.

Definition 1 (Labeled Petri net)

Let P denote a set of places, T denote a set of transitions, and \(F \subseteq (P \times T) \cup (T \times P)\) denote the flow relation. A labeled Petri net \(N = (P, T, F, l)\) is a Petri net \((P, T, F)\) with labeling function \(l \in T\nrightarrow {}\mathcal {U}_{A}\) where \(\mathcal {U}_{A}\) is some universe of activity labels.

In a process setting, there is typically a well-defined start and end to an instance of the process. This can be denoted with the initial and final marking of a system net.

Fig. 1.
figure 1

(adapted from [6])

System net S that models a loan application process

Definition 2 (System net)

A system net is a triplet \(S= (N, I, O)\) where \(N = (P, T, F, l)\) is a labeled Petri net, \(I \in \mathcal {B}(P)\) is the initial state and \(O \in \mathcal {B}(P)\) is the final state. \(\phi _f(S) \) is the set of transition sequences that reach the final state when started in the initial state. If \(\sigma \) is a transition sequence, then \(l(\sigma \!\!\upharpoonright _{ dom (l)})\) is an activity sequence.

\(T_v(S) = dom (l)\) is the set of visible transitions in \(S\). \(T^u_v(S) = \{t \in T_v(S)\mid {}\forall _{t' \in T_v(S)} l(t) = l(t') \Rightarrow t = t'\}\) is the set of unique visible transitions in \(S\).

Figure 1 presents a system net S that models a loan application process (ignore the grey boxes in the background for now). \([\mathsf {i}]\) is the initial marking and \([\mathsf {o}]\) is the final marking. An example activity sequence is \(\langle {}\mathsf {a}, \mathsf {b}, \mathsf {c}, \mathsf {d}, \mathsf {f}, \mathsf {g}, \mathsf {h}, \mathsf {i}, \mathsf {k}\rangle {}\) which corresponds to the occurred events of a successful loan application. The process executions in real-life are recorded as event data and can be expressed as an event log.

Definition 3 (Trace, Event log)

Let \(A \subseteq \mathcal {U}_{A}\) be a set of activities. A trace \(\sigma \in A^{*}\) is a sequence of activities. An event log \(L \in \mathcal {B}(A^{*})\) is a multiset of traces.

Figure 2 presents an event log L corresponding to the system net in Fig. 1. Log L has 20 cases in total with 5 cases following trace \(\sigma _1\), 10 cases following trace \(\sigma _2\), and 5 cases following trace \(\sigma _3\). In cost-based alignment conformance checking, a trace is aligned with the corresponding system net to produce an alignment.

Fig. 2.
figure 2

Running example: event log L

Definition 4

(Alignment [4]). Let \(L \in \mathcal {B}(A^{*})\) be an event log with \(A \subseteq \mathcal {U}_{A}\), let \(\sigma _L \in L\) be a log trace and \(\sigma _M \in \phi _f(S)\) a complete transition sequence of system net \(S\). An alignment of \(\sigma _L\) and \(\sigma _M\) is a sequence of pairs \(\gamma \in ((A \cup \{\, \gg \}) \times (T \cup \{\, \gg \}))^{*}\) where \(\pi _{1}(\gamma )\!\!\upharpoonright _{A} = \sigma _L\), \(\pi _{2}(\gamma )\!\!\upharpoonright _{T} = \sigma _M\), \(\forall _{(a, t) \in \gamma } ~ a \ne \, \gg \vee ~ t \ne \, \gg \), and \(\forall _{(a, t) \in \gamma } ~ a \ne \, \gg \wedge ~ (t = \, \gg \vee ~ a = l(t))\).

Each pair in an alignment is a legal move. There are four types of legal moves: a synchronous move (at) means that the activity matches the activity of the transition, i.e., \(a = l(t)\), a log move \((a, \, \gg )\) means that there is a step in the log that is not matched by a corresponding step in the model, a model move \((\, \gg , t)\) where \(t \in dom (l)\) means that there is a step in the model that is not matched by a corresponding step in the log, and an invisible move \((\, \gg , t)\) where \(t \in T \setminus dom (l)\) means that the step in the model corresponds to an invisible transition that is not observable in the log.

Definition 5

(Valid decomposition [1] and Border activities [11]). Let \(S= (N, I, O)\) with \(N = (P, T, F, l)\) be a system net. \(D = \{ S_1, S_2, \ldots ,S_n \}\) is a valid decomposition if and only if the following properties are fulfilled:

  • \(S_i = (N_i, I_i, O_i)\) is a system net with \(N_i = (P_i, T_i, F_i, l_i)\) for all \(1 \le i \le n\).

  • \(l_i = l\!\!\upharpoonright _{T_i}{}\) for all \(1 \le i \le n\).

  • \(P_i \cap P_j = \emptyset \) and \(T_i \cap T_j \subseteq T^u_v(S)\) for all \(1 \le i < j \le n\).

  • \(P = \bigcup _{1 \le i \le n}{P_i}\), \(T = \bigcup _{1 \le i \le n}{T_i}\), and \(F = \bigcup _{1 \le i \le n}{F_i}\).

\(\mathcal {D}(S)\) is the set of all valid decompositions of \(S\).

\(A_b(D) = \{l(t)\mid {}\exists _{1 \le i < j \le n}~{t \in T_i \cap T_j}\}\) is the set of border activities of the valid decomposition D. To retrieve the sub-nets that share the same border activity, for an activity \(a \in rng (l)\), \(S_b(a, D) = \{S_i \in D\mid {} a \in rng (l_i)\}\) is the set of sub-nets that contain a as an observable activity.

Figure 1 presents a valid decomposition D of net S where sub-nets are marked by the grey boxes. For example, sub-net \(\mathsf {S}_1\) consists of the transitions \(\mathsf {t}_1\), \(\mathsf {t}_2\), \(\mathsf {t}_3\), \(\mathsf {t}_4\), \(\mathsf {t}_5\), and \(\mathsf {t}_6\). Border activities can be identified as the activities of the transitions that are shared between two sub-nets. They are \(\mathsf {t}_4\), \(\mathsf {t}_5\), \(\mathsf {t}_6\), \(\mathsf {t}_8\), \(\mathsf {t}_{11}\), and \(\mathsf {t}_{12}\). Under the recomposition approach framework, overall alignments can be computed in a decomposed manner.

2.2 Recomposing Conformance Checking

Figure 3 presents an overview of the recomposing conformance checking framework [10, 11] which consists of the following five steps: (1) The net and log are decomposed using a decomposition strategy, e.g., maximal decomposition [1]. (2) Alignment-based conformance checking is performed per sub-net and sub-log to produce a set of sub-alignments for each log trace. (3) Since sub-components overlap on border activities, the set of sub-alignments for each log trace also overlap on moves involving border activities. In [11], it was shown that if the sub-alignments synchronize on these moves, then they can be merged as an overall optimal alignment using the merging algorithm presented in [18]. This condition was formalized as the total border agreement condition. Log traces that do not meet the requirement are either rejected or left for the next iteration. As such, only border activities can cause merge conflicts. (4) User-configured termination conditions are checked at the end of each iteration. If the framework is terminated before computing the overall optimal alignments for all log traces, then an approximate overall result is given. The results of the framework consist of a fitness value and a set of alignments corresponding to the log traces. In the case of an approximate result, the fitness value would be an interval bounding the exact fitness value and the set of alignments would have pseudo alignments. (5) If there are remaining log traces to be aligned and the termination conditions are not reached, then a recomposition step is taken to produce a new net decomposition and a corresponding set of sub-logs. The next iteration of the framework then starts from Step (2).

Fig. 3.
figure 3

Recomposing conformance checking framework with the recomposition step highlighted in dark blue (Color figure online)

While experimental results have shown significant performance gains from the recomposition approach over its monolithic counterpart, large scale experimentation has shown that recomposition is a potential bottleneck. In particular, the strategies used at the recomposition step can have a significant impact. The following section takes a more detailed look at the recomposition step and discusses the limitations of the current recomposition strategies.

3 Recomposition Step

The recomposition step refers to Step (5) of the framework overview presented in Fig. 3 and is highlighted in dark blue. We formalize the step in two parts: the production of a new net decomposition and a corresponding set of sub-logs.

Definition 6 (Recomposition step)

Let \(D \in \mathcal {D}(S)\) be a valid decomposition of system net \(S\) and let \(L = \mathcal {B}(A^{*})\) be an event log. For \(1 \le i \le n\), where \(n = |D|\), let \(M_i = (A_i \cup \{\, \gg \}) \times (T_i \cup \{\, \gg \})\) be the possible alignment moves for a sub-component so that \(\Upgamma _D = [(\gamma _{i_1}, \ldots , \gamma _{i_n}) \in M_1^{*} \times \ldots \times M_n^{*} \mid \exists _{\sigma _i \in L} \forall _{j \in \{1, \ldots , n\}} \pi _{1}(\gamma _{i_j})\!\!\upharpoonright _{A_j} = \sigma _i\!\!\upharpoonright _{A_j}]\) contains the latest sub-alignments for all log traces. Given the valid decomposition, and the latest sub-alignments, \( R_S: \mathcal {D}(S) \times \mathcal {B}(M_1^{*} \times \ldots \times M_n^{*})\rightarrow {}\mathcal {D}(S) \) creates a new valid decomposition \(D' \in \mathcal {D}(S)\) where \( m = |D'| < |D|\). Then, given the new and current net decompositions, the event log, and the latest sub-alignments, \( R_L: \mathcal {D}(S) \times \mathcal {D}(S) \times \mathcal {B}(A^{*}) \times \mathcal {B}(M_1^{*} \times \ldots \times M_n^{*})\nrightarrow {}\mathcal {B}(A'_1{}^*) \times \ldots \times \mathcal {B}(A'_m{}^*) \) creates a set of sub-logs to align in the following iteration of the recomposition approach. Overall, the recomposition step \(R\) creates a new net decomposition and a corresponding set of sub-logs, \(R: \mathcal {D}(S) \times \mathcal {B}(A^{*}) \times \mathcal {B}(M_1^{*} \times \ldots \times M_n^{*})\nrightarrow {}\mathcal {D}(S) \times \mathcal {B}(A'_1{}^*) \times \ldots \times \mathcal {B}(A'_m{}^*)\).

The current recomposition strategy involves recomposing on the most frequent conflicting activities (MFC) and constructing sub-logs that contains to-be-aligned traces which carry conflicting activities that have been recomposed upon (IC).

Most frequent conflict (MFC) recomposes the current net decomposition on the activity set \(A_r = \{ a \in A_b(D) \mid a \in \text {arg}\,\max _{a' \in A_b(D)} \sum _{\varvec{\gamma _i} \in \text {Supp}(\Upgamma _D)} \text {C}(\varvec{\gamma _i})(a')\}\) where \(\Upgamma _D \in \mathcal {B}(M_1^{*} \times \ldots \times M_n^{*})\) are the latest sub-alignments and \(\text {C}: M_1^{*} \times \ldots \times M_n^{*} \rightarrow \mathcal {B}(A)\) is a function that gives the multiset of conflicting activities of sub-alignments. Hence, \(A_r\) contains the border activities with the most conflicts.

Inclusion by conflict (IC) then creates a log \(L_r = [\sigma _i \in L \mid \exists _{a \in A_b(D)}~\text {C}(\varvec{\gamma _i})(a) > 0 \wedge a \in A_r ]\) where \(\varvec{\gamma _i} \in \Upgamma _D\) are the sub-alignments of trace \(\sigma _i \in L\) and net decomposition \(D \in \mathcal {D}(S)\). As such, log \(L_r\) includes to-be-aligned log traces which have conflicts on at least one of the border activities that have been recomposed upon. Later, log \(L_r\) is then projected onto the new net decomposition to create the corresponding sub-logs.

3.1 Limitations to the Current Recomposition Strategies

To explain the limitations, we refer to the set of optimal sub-alignments in Fig. 4 from aligning net decomposition D in Fig. 1 and log L in Fig. 2. We first note that for the conflicting activities which are highlighted in grey: \(\sum _{\varvec{\gamma } \in \Upgamma _\textsf {D}}\text {C}(\varvec{\gamma })(\mathsf {c}) = 2\), \(\sum _{\varvec{\gamma } \in \Upgamma _\textsf {D}}\text {C}(\varvec{\gamma })(\mathsf {i}) = 1\), and \(\sum _{\varvec{\gamma } \in \Upgamma _\textsf {D}}\text {C}(\varvec{\gamma })(\mathsf {j}) = 1\), where \(\Upgamma _\textsf {D} = \{\varvec{\gamma }_1, \varvec{\gamma }_2, \varvec{\gamma }_3\}\). With activity c being the most frequent conflicting activity, MFC recomposes the current net decomposition on \(A_r = \{ \mathsf {c} \}\) and IC creates the corresponding sub-logs containing \(L_r = \{ \sigma _2, \sigma _3 \}\) since both traces have activity c as a conflicting activity. The new net decomposition will contain three sub-nets rather than four where sub-net \(\mathsf {S}_1\) and sub-net \(\mathsf {S}_2\) are recomposed upon activity c as a single sub-net. The corresponding sub-log set is created by projecting log \(L_r\) onto the new net decomposition.

While one merge conflict is resolved by recomposing on activity c, the merge conflicts at activity i and j will remain in the following iteration. In fact, under the current recomposition strategy, trace \(\sigma _2\) and \(\sigma _3\) have to be aligned three times each to reach the required merging condition to yield overall alignments. This shows the limitation of MFC in only partially resolving merge conflicts on the trace level and IC in leniently including to-be-aligned log traces whose subsequent sub-alignments are unlikely to reach the necessary merging condition.

Fig. 4.
figure 4

Sub-alignments \(\varvec{\gamma _1} = (\gamma _{1_1}, \gamma _{1_2}, \gamma _{1_3}, \gamma _{1_4})\), \(\varvec{\gamma _2} = (\gamma _{2_1}, \gamma _{2_2}, \gamma _{2_3}, \gamma _{2_4})\), and \(\varvec{\gamma _3} = (\gamma _{3_1}, \gamma _{3_2}, \gamma _{3_3}, \gamma _{3_4})\) of log \(L_1\) and net decomposition \(D_1\) with merge conflicts highlighted in grey

As such, the key to improving the existing recomposition strategies is in lifting conflict resolution from the individual activity level to the trace level so that the net recomposition strategy resolves merge conflicts of traces rather than activities and the log recomposition strategy selects log traces whose merge conflicts are likely to be fully resolved with the latest net recomposition. In the following section, three net recomposition strategies and one log recomposition strategy are presented. These strategies improve on the existing ones by looking at merge conflict sets, identifying co-occurring conflicting activities, and minimizing the average size of the resulting recomposed sub-nets. The later experimental results show that the strategies lead to significant performance improvements in both synthetic and real-life datasets.

4 Recomposition Strategies

In this section, three net recomposition strategies and one log recomposition strategy are presented.

4.1 Net Recomposition Strategies

As previously shown, resolving individual conflicting activities may only partially resolve the merge conflicts of traces. This key observation motivates the following net recomposition strategies which target conflicts at the trace level.

Top k most frequent conflict set (MFCS-k) constructs a multiset of conflict sets \(A_{cs} = [\text {Supp}(C(\varvec{\gamma })) \subseteq A_b(D) \mid \varvec{\gamma } \in \Upgamma _D \wedge |C(\varvec{\gamma })| > 0]\). Then the top k most frequent conflict set \(A_{cs,k} \subseteq \{ a_{cs} \subseteq A_b(D) | A_{cs}(a_{cs}) > 0 \}\) is selected. If \(|A_{cs}| < k\), then all conflict sets are taken. Afterwards, the recomposing activity set \(A_r = \cup (A_{cs,k}) \subseteq A_b(D)\) is created. We note that in the case where two conflict sets have the same occurrence frequency, a random one is chosen. This secondary criterion avoids bias, and gives better performances empirically than any other straightforward criteria.

Merge conflict graph (MCG) recomposes on conflicting activities that co-occur on the trace level by constructing a weighted undirected graph \(G = (V, E)\) where \(E = \{ \{a_1, a_2\} \mid \exists _{\varvec{\gamma } \in \Upgamma _D}~ a_1 \in \text {C}(\varvec{\gamma }) \wedge a_2 \in \text {C}(\varvec{\gamma }) \wedge a_1 \ne a_2 \}\) with a weight function \(w: E \rightarrow \mathbb {N}^+\) such that \(w((a_1,a_2)) = |\{ \varvec{\gamma } \in \Upgamma _D \mid \text {C}(\varvec{\gamma })(a_1)> 0 \wedge \text {C}(\varvec{\gamma })(a_2) > 0 \}|\) and \(V = \{a \in A_b(D) \mid \exists _{(a_1,a_2) \in E}~a = a_1 \vee a = a_2 \}\). Then, with a threshold \(t \in [0, 1]\), edges are filtered so that \(E_f = \{ e \in E \mid w(e) \ge t \times w_{\max } \}\) where \(w_{\max }\) is the maximum edge weight in E. The corresponding vertex set and filtered graph can be created as \(V_f = \{ a \in A_b(D) \mid \exists _{(a_1,a_2) \in E_f} a = a_1 \vee a = a_2 \}\) and \(G_f = (V_f, E_f)\). Finally, the current net decomposition is recomposed on activity set \(A_r = V_f\).

Balanced. This recomposition strategy extends the MFCS-k strategy but also tries to minimize the average size of the sub-nets resulting from the recomposition. For a border activity \(a \in A_b(D)\), \(|(a,D)| = |\cup _{S_i \in S_b(a, D)} A_v(S_i)|\) approximates the size of the recomposed sub-net on activity a. The average size of the recomposed sub-nets for a particular conflict set can then be approximated by \(|(A_c, D)| = \frac{\sum _{a \in A_c} |(a,D)|}{|A_{c}|}\). The score of the conflict set can be computed as a weighted combination \(\beta (A_c, D) = w_0 \times \frac{m(A_c)}{\max _{A'_c \in A_{cs}} m(A'_c)} + w_1 \times (1 - \frac{|(A_c, D)|}{\max _{A'_c \in A_{cs}} |(A'_c, D)|})\) where higher scores are assigned to frequent conflict sets that do not recompose to create large sub-nets. The activities of the conflict sets with the highest score, \(A_r = \{ a \in A_c \mid A_c \in \text {arg}\,\max {}_{A'_c \in A_{cs}} \beta (A'_c, D)\}\), are then recomposed upon to create a net decomposition.

4.2 Log Recomposition Strategy

Similar to the net recomposition strategies, the existing IC strategy can be too lenient in including log traces which have conflicting activities that are unlikely to be resolved in the following decomposed replay iteration.

Strict include by conflict (SIC) increases the requirement for a to-be-aligned log trace to be selected for the next iteration. This addresses the limitation of IC which can include log traces whose merge conflicts are only partially covered by the net recomposition. Given the recomposed activity set \(A_r\), SIC includes log traces as \(L_r = [\sigma _i \in L \mid \forall _{a \in \text {C}(\varvec{\gamma }_i)}~a \in A_r]\) with merge conflict if the corresponding conflict set is a subset of set \(A_r\). However, this log strategy only works in conjunction with the net strategies that are based on conflict sets, i.e., MFCS-k and Balanced, so that at least one to-be-aligned log trace is included.

5 Experiment Setup

Both synthetic and real-life datasets are used to evaluate the proposed recomposition strategies. Dataset generation is performed using the PTandLogGenerator [8] and information from the empirical study [9]; it is reproducible as a RapidProM workflow [3]. The BPIC 2018 dataset is used [16] as the real-life dataset. Moreover, two baseline net recomposition strategies are used: All recomposes on all conflicting activities, and Random recomposes on a random number of conflicting activities. Similarly, a baseline log recomposition All which includes all to-be-aligned log traces is used. For the sake of space, the full experimental setup and datasets are available at the GitHub repositoryFootnote 1 so that the experimental results can be reproduced.

6 Results

The results shed light on two key insights: First, the selection of the recomposition approach may lead to very different performances. Second, good performance requires both selecting appropriate conflicting activities and well-grouped to-be-aligned log traces.

Figure 5 presents the experimental results for both synthetic and real-life datasets. For each of the synthetic models, there are three event logs of different noise profiles described as netX-noise probability-dispersion over trace where \(X \in \{1, 2, 3\}\). For the sake of readability, we only show the results of three out of five synthetic datasets, but the results are consistent across all five synthetic datasets). Readers interested in more details are referred to the GitHub link for a detailed explanation on noise generation and the rest of the experimental results. For the MFCS-k and Balanced strategies, only configurations using the SIC log strategy are shown; results showed that the SIC log strategy provides a better performance. For the others where SIC is not applicable, only configurations using the IC log strategy are shown as results indicated better performances. Overall, the results show that for both the monolithic and recomposition approach, it is more difficult to compute alignment results for less fitting datasets.

Fig. 5.
figure 5

Bar chart showing fitness and overall time per net recomposition strategy (including the monolithic approach). The time limit is shown as a dashed red line and indicates infeasible replays. Best performing approaches and their time gains from the second fastest times are specified by black arrows. (Color figure online)

Different Approaches Give Different Performances. Comparing the monolithic and recomposition approach, it is clear that the recomposition approach provides a better performance than the monolithic counterpart under at least one recomposition strategy configuration. Furthermore, performance can vary significantly across different recomposition approaches. For example, the existing MFC strategy is the worst performing strategy where it is not able to give exact results for the real-life dataset and both the netX-10-60 and netX-60-10 noise scenarios of the synthetic datasets. The MFCS-k and Balanced strategies are shown to be the best performing strategies. While for high fitness scenarios, i.e., netX-10-10, MFCS-k give better performances with a high \(k = 10\). This is because when there is little noise, it becomes simply a “race” to aligning traces with similar merge conflicts. Conversely, for low fitness scenarios, because merge conflicts are potentially much more difficult to resolve, the Balanced strategy avoids quickly creating large sub-components that take longer to replay. In these cases, the time differences between the different feasible strategies can go up to three minutes. For all the experiments, the proposed recomposition strategies outperform the baseline strategies. Lastly, for the real-life dataset BPIC18, only the MFCS-1, Balanced, and MCG recomposition strategies are able to compute exact alignment results and the Balanced strategy outperforms MFCS-1 by more than three minutes.

Both Net and Log Recomposition Strategies Matter. Figure 6 presents the number of aligned traces and percentage of valid alignments per iterations under All, IC, and SIC log strategies with net strategy fixed as Balanced on BPIC18. We first note that only the SIC log strategy resulted with exact alignment results. While all strategies start with aligning all traces in the first iteration, there are significant differences in the number of aligned traces across iterations. Similar to the All strategy, the existing IC strategy includes a high number of traces to align throughout all iterations; the number of aligned traces only tapered off in the later iterations as half of the traces have resulted as valid alignments. This confirms the hypothesis that the existing IC strategy can be too lenient with the inclusion of traces to align. Furthermore, up until iteration 13, none of the aligned traces reaches the necessary merging condition to result as a valid alignment; this means that both the All and IC strategies are “wasting” resources aligning many traces. Conversely, the SIC strategy keeps the number of traces to align per iteration comparatively lower. Moreover, at the peak of the number of traces to align at iteration 21, almost 80% of the \({\sim }300\) aligned traces resulted as valid alignments. These are likely to explain why only the SIC log strategy is able to compute an exact result.

Fig. 6.
figure 6

Comparing log strategies by showcasing the number of aligned traces (left) and percentage of valid alignments (right) per iteration on the real-life dataset BPIC18.

7 Related Work

Performance problems related to alignment-based conformance checking form a well-known problem. A large number of conformance checking techniques have been proposed to tackle this issue. Approximate alignments have been proposed to reduce the problem complexity by abstracting sequential information from segments of log traces [14]. The notion of indication relations has been used to reduce the model and log prior to conformance checking [15]. Several approaches have been proposed along the research line of decomposition techniques. This include different decomposition strategies, e.g., maximal [1], and SESE-based [12]. Moreover, different decomposed replay approaches such as the hide-and-reduce replay [17] and the recomposition approach [11] have also been investigated. Compared to the existing work, this paper investigates different strategies for the recomposition approach in order to improve the overall performance in computation time.

Other than the alignment-based approach, there are also other conformance checking approaches. This includes the classical token replay [13], behavioral profile approaches [19] and more recently approaches based on event structures [7].

8 Conclusions and Future Work

This paper investigated the recomposition aspect of the recomposing conformance checking approach which can become a bottleneck to the overall performance. By defining the recomposition problem, the paper identifies limitations of the current recomposition strategy in not fully resolving merge conflicts on the trace level and also being too lenient in the inclusion of log traces for the subsequent decomposed replay iteration. Based on the observations, three net recomposition strategies and one log recomposition strategy have been presented. The strategies were then evaluated on both synthetic and real-life datasets with two baseline approaches. The results show that different recomposition strategies can significantly impact the overall performance in computing alignments. Moreover, they show that the presented approaches provide a better performance than baseline approaches, and both the existing recomposition and monolithic approaches. While simpler strategies tend to provide a better performance for synthetic datasets, a more sophisticated strategy can perform better for a real-life dataset. However, the results show that both the selection of activities to recompose on and log traces to include are important to achieve superior performances.

Future Work. The results have shown that the recomposition strategy has a significant impact on performance. We plan to extend the evaluation of the presented approaches to a larger variety of models, noise scenarios, initial decomposition strategies, and other real-life datasets. For the current and presented approaches, new net decompositions are created by recomposing the initial decomposition on selected activities. Entirely different net decompositions can be created using the merge conflict information from the previous iteration; however, our preliminary results showed that this may be difficult. Lastly, in the current framework, the same strategies (both decomposition and recomposition) are used in all iterations; higher level meta-strategies might be useful. For example, it might be good to switch to the monolithic approach for a small number of log traces that cannot be aligned following many iterations.