Enabling efficient process mining on large data sets: realizing an in-database process mining operator

Dijkman, Remco; Gao, Juntao; Syamsiyah, Alifah; van Dongen, Boudewijn; Grefen, Paul; ter Hofstede, Arthur

doi:10.1007/s10619-019-07270-1

Enabling efficient process mining on large data sets: realizing an in-database process mining operator

Open access
Published: 09 May 2019

Volume 38, pages 227–253, (2020)
Cite this article

Download PDF

You have full access to this open access article

Distributed and Parallel Databases Aims and scope Submit manuscript

Enabling efficient process mining on large data sets: realizing an in-database process mining operator

Download PDF

Remco Dijkman ORCID: orcid.org/0000-0003-4083-0036¹,
Juntao Gao²,
Alifah Syamsiyah¹,
Boudewijn van Dongen¹,
Paul Grefen¹ &
…
Arthur ter Hofstede³

5459 Accesses
13 Citations
Explore all metrics

Abstract

Process mining can be used to analyze business processes based on logs of their execution. These execution logs are often obtained by querying a database and storing the results in a file. The mining itself is then done on the file, such that the data processing power of the database cannot be used after the log is extracted. Enabling process mining directly on a database therefore provides additional flexibility and efficiency. To help facilitate this, this paper formally defines a database operator that extracts the ‘directly follows’ relation—one of the relations that is at the heart of process mining—from an operational database. It defines the operator using the well-known relational algebra and formally proves equivalence properties of the operator that are useful for query optimization. Subsequently, it presents time-complexity properties of the operator. Finally, it presents an implementation of the operator as part of the H2 DBMS and demonstrates that this implementation extracts the ‘directly follows’ relation from a database with an arbitrary database structure within a fraction of a second; several orders of magnitude faster than is currently possible.

Process Mining on Databases: Unearthing Historical Data from Redo Logs

DB-XES: Enabling Process Discovery in the Large

Celonis PQL: A Query Language for Process Mining

1 Introduction

Process mining can be used to analyze business processes, based on logs of their execution. These execution logs contain the activities that were performed in the business process, the customer cases for which they were performed, the time at which they were performed, and additional information about the resources involved and the properties of the customer case. Many different analysis techniques have been developed (e.g., [1,2,3]) that can help to analyze the causal and temporal relations between the activities, to analyze whether or not the process has been executed as it is supposed to be executed, and to analyze performance properties of the process.

Table 1 Database table Log that contains a log

Full size table

Table 1 shows a simple event log as it could be stored in a database. In practice, an event log contains thousands of rows (e.g., [4]) and may be distributed over multiple tables (such as e.g., in SAP [5]). Table 1 shows the activities that were performed in an organization, the (customer) case for which these activities were performed and the start and end time of the activities. Given such a table, it is important in process mining to be able to retrieve the ‘directly follows’ relation, because this is the basis for many process mining techniques, including the alpha algorithm [1] (and its variants), the inductive miner [6], the heuristic miner [2], and the fuzzy miner [3]. The directly follows relation retrieves the events that follow each other directly in some instance of process execution (i.e., case). The SQL query that retrieves this relation from relation R in Table 1 has two parts: one part that extracts all pairs of activities that follow each other either directly or indirectly and another, nested part that excludes activities that indirectly follow each other. Consequently, it is defined as follows:

The result of applying this query to Log in Table 1 is the table that contains the tuples (A, B), (B, E), (A, D), (D, E), (A, C), (C, E).

The query illustrates the challenges that arise when process mining directly on a database. First, it is inconvenient: even a conceptually simple process mining request like ‘retrieve all directly follows relations between events’ is relatively difficult to phrase in SQL. Second, it is inefficient, because it requires a nested query (i.e., the ‘NOT EXISTS’ part) and nested queries are known to cause performance problems [7, 8]. This is discussed in detail in Sect. 4.

Consequently, measures must be taken to make process mining feasible on relational databases. Figure 1 shows three possible strategies. Figure 1(i) shows a strategy that is frequently used in practice, in which a user constructs an SQL query to extract a log from the database, writes this log to disk, and then imports it in memory in a process mining tool for further processing. Consequently, the complete log must be read or written three times and in current practice there is manual work involved in getting the log to and from disk. It is easy to imagine a process mining tool that does not need intermediate storage to disk. Such a tool would only need to read the log once and would not require manual intervention to get the log to and from disk. However, it would require that the information is obtained from the database in the form of a log, which limits flexibility and performance in some cases as we will explain further on in the paper. Figure 1(ii) illustrates this strategy.

This paper proposes a third strategy, illustrated in Fig. 1(iii), in which the database management system supports a native ‘directly follows’ operator. This strategy combines the benefits of the first and the second approach, in that it does not require intermediate storage on disk, but does facilitates flexible and efficient querying. To realize this strategy, this paper defines the ‘directly follows’ operator in relational algebraic form. This definition can be used to implement the operator as part of a database management system, including query optimization measures . Finally, the paper presents an implementation of the operator as part of the H2 DBMS [9] as well as the effect that the operator has on the time performance of that system.

Against this background the remainder of this paper is structured as follows. Section 2 explains process mining and relational algebra as background for this paper. Section 3 presents a relational algebraic definition of the ‘directly follows’ operator. Section 4 shows the theoretical computational cost of executing this operator and the potential effects of query optimization with respect to this operator. Section 5 presents the implementation of the operator in the H2 DBMS and the time performance of that operator. Section 6 discusses alternative approaches for extracting the directly follows relation from a database. Finally, Sect. 7 presents related work and Sect. 8 the conclusions.

2 Background

This section presents background information that is necessary to understand the goal of this paper: developing a ‘directly follows’ relation for process mining on a database: process mining, relational algebra, and how relational algebra is used in a database system.

2.1 Process mining

Process mining is the extraction of a process from an event log, where a process defines possible sequences of actions. Process mining is most commonly used in the area of business processes. In this context, a process defines the tasks that must be performed to achieve a certain business goal, such as approving a mortgage or processing a sales order.

Figure 2 shows a simple example of a business process that concerns processing an application. After the process starts, the application must be registered. After that, there is a choice between approving the application, rejecting the application, or postponing a decision. Finally, the result of the decision is filed and the process ends. Each application that is processed has to go through the process and a choice must be made to either approve, reject, or postpone the application. Consequently, identifying the tasks by their letters, each application will either go through the sequence [A, B, E], the sequence [A, C, E], or the sequence [A, D, E]. Assuming that the result of each task that is performed is stored in a database along with a timestamp and an identifier for the application on which it was performed, each application will leave a trace in the database, like the traces that are shown in Table 1. The challenge of process mining is to reconstruct the business process when execution traces such as the ones in Table 1 are available, but the process itself is not.

There exist a variety of algorithms for process mining, and in many of them the ‘directly follows’ relation plays an important role [1,2,3, 6]. The ‘directly follows’ relation identifies pairs of events that ever follow each other directly (in time) in a log of execution traces. We will define the operator precisely in Sect. 3, but for now let $x>y$ represent that event y follows event x directly in some execution trace in the log, such that in Table 1 we have the directly follows relations: $A>B, B>E, A>D, D>E, A>C, C>E$. The directly follows relation is the basis for mining the relations that form the process model. For example, the Alpha algorithm [1] has the rule that, if there exist x, y, z in the log, such that $x>y$, $x>z$, and neither $y>z$ nor $z>y$, a process model must be constructed in such a way that x occurs first, after which there is a choice between y and z. This is the case in the log in Table 1 for the event A and the events B, C, D, which indeed leads to a task A followed by a choice between B, C, D as in Fig. 2. While the precise manner in which the ‘directly follows’ relation is used to construct a process model from a log differers for different algorithms, it is the basis for constructing the process model in the alpha algorithm [1] (and its variants), the inductive miner [6], the heuristic miner [2], and the fuzzy miner [3]. Consequently, for in-database process mining it is important that the directly follows relation can be extracted efficiently.

2.2 Relational algebra

In this section we briefly define the basic relational algebraic operators. We refer to the many textbooks on databases (e.g., [10]) for more detailed definitions.

Definition 1

(Attribute, schema, relation) An attribute is a combination of a name and a domain of possible values. A schema is a set of attributes. Any two elements $s_1, s_2$ of schema s with $s_1 \ne s_2$ have different names. A relation is a combination of a schema and a set of tuples, where a tuple maps each attribute name from the schema to a value from the domain of that attribute.

For example, let C be the domain of case identifiers, E be the domain of activities, and T be the time domain. The relation of Table 1 has the schema $\{case: C, activity: E, start\_time: T, end\_time: T\}$ and the set of tuples $\{\{case\mapsto 1, activity\mapsto A, start\_time\mapsto 00{:}20, end\_time\mapsto 00{:}22\}, \ldots \}$

In the remainder of this paper, we will also refer to R as the set of tuples of a relation R. For a relation R with a schema that defines an attribute with name a, we will use $e_a$ to refer to the value of attribute a in element $e\in R$.

Definition 2

(Relational algebra) Let R, S be relations with schemas r, s. Furthermore, let a, b be attribute names, and $\phi $ a condition over attributes that is constructed using the logic operators conjunction ($\wedge $), disjunction ($\vee $), and negation ($\lnot $), and the binary conditions ($>,\ge ,=,\ne ,\le ,<$) over attributes and attribute values. We define the usual relational algebra operators [10, 11]:

Selection$\sigma _\phi R = \{e | e \in R, \phi (e) \}$, where $\phi (e)$ is derived from $\phi $ by replacing each attribute name a by its value in tuple e: $e_a$. The schema of $\sigma _\phi R$ is r.
Projection$\pi _{a,b,\ldots } R = \{\{a\mapsto e_a,b\mapsto e_b,\ldots \} | e \in R \}$. The schema of $\pi _{a,b,\ldots } R$ is r restricted to the attributes with names $a, b, \ldots $
Renaming$\rho _{a/b} R = R$. The schema of $\rho _{a/b} R$ is derived from r by replacing the name of attribute a by b, while the set of tuples in R does not change. In the remainder of this paper, we will also use $\rho _x R$ to represent prefixing all attribute names of R with x.

In addition, we define the usual set theoretic operators $R \cup S$, $R \cap S$, $R - S$. These operators have the usual set-theoretic interpretation, but they require that R and S have the same schema. We define the Cartesian product of R with schema $r=\{r_1: R_1, r_2: R_2, \ldots , r_n:R_n\}$ and S with schema $s=\{s_1:S_1, s_2:S_2, \ldots , s_n:S_n\}$ as $R \times S = \{\{r_1\mapsto e_{r_1},r_2\mapsto e_{r_2},\ldots ,r_n\mapsto e_{r_n},s_1\mapsto u_{s_1}, s_2\mapsto u_{s_2},\ldots ,s_m\mapsto u_{s_m}\}|e \in R, u \in S\}$. The schema of $R \times S$ is $r \cup s$.

Finally, a join operator is usually defined for the commonly used operator of joining tuples from two relations that have some property in common. The join operator is a shorthand for a combination of Cartesian product and selection: $R \bowtie _\phi S = \sigma _\phi R \times S$. The theta join is a special case of the join operator, in which $\phi $ has the form $a \theta b$ and $\theta $ is a binary condition (>, $\ge $, $=$, $\ne $, $\le $, <) on the attributes a and b.

Table 2 shows examples of the selection, projection, and renaming operators, applied to the relation in Table 1.

Table 2 Example relational algebra expressions

Full size table

2.3 Query optimization

Relational algebra is at the core of every relational database system. It is used to define the execution semantics of an SQL query and to define equivalence rules to determine which execution semantics (among a set of equivalent ones) is the most efficient to execute. There is a large number of proven relational algebraic equivalences that can be used to rewrite relational algebraic equations into the equivalent that is most efficient to execute [10, 11]. In the remainder of this paper, we use the following ones. Let R, S be relations, a, b, c be attributes, x, y be attribute values, $\phi , \psi $ be conditions, and $\theta $ be a binary condition (>, $\ge $, $=$, $\ne $, $\le $, <). Then:

$$\begin{aligned} \sigma _{\phi \wedge \psi } R&= \sigma _{\phi } (\sigma _\psi R) \end{aligned}$$

(1)

$$\begin{aligned} \sigma _{\phi } (\sigma _\psi R)&= \sigma _{\psi } (\sigma _\phi R) \end{aligned}$$

(2)

$$\begin{aligned} R \bowtie _\phi S&= S \bowtie _\phi R \end{aligned}$$

(3)

$$\begin{aligned} (R \bowtie _\phi S) \bowtie _\psi T&= R \bowtie _\phi (S \bowtie _\psi T) \end{aligned}$$

(4)

$$\begin{aligned} \sigma _\psi (R \bowtie _\phi S)&= (\sigma _\psi R) \bowtie _\phi S \text {, if } \psi \text { only has attributes from } R \end{aligned}$$

(5)

$$\begin{aligned} \sigma _\psi (R - S)&= (\sigma _\psi R) - (\sigma _\psi S) \end{aligned}$$

(6)

$$\begin{aligned} \sigma _{a \theta x} (\rho _{b/a} R)&= \rho _{b/a} (\sigma _{b \theta x} R) \end{aligned}$$

(7)

$$\begin{aligned} \pi _a (\rho _{b/a} R)&= \rho _{b/a} (\pi _b R) \end{aligned}$$

(8)

$$\begin{aligned} \pi _{a,b,\ldots } (\sigma _{\phi } R)&= \sigma _{\phi } (\pi _{a,b,\ldots } R) \text {, if } \phi \text { only has attributes from } a, b, \ldots \end{aligned}$$

(9)

$$\begin{aligned} \pi _{a,b,\ldots } (R \bowtie _\phi S)&= (\pi _{a,\ldots } R) \bowtie _\phi (\pi _{b,\ldots } S)\text {, if } a, b, \ldots \text {can be split over } R, S \end{aligned}$$

(10)

$$\begin{aligned} \pi _{a,b} ( \pi _{b,c} R)&= \pi _b R \end{aligned}$$

(11)

$$\begin{aligned} \pi _{a,\ldots } ( \pi _{b,\ldots } R)&= \pi _{b,\ldots } ( \pi _{a,\ldots } R) \end{aligned}$$

(12)

$$\begin{aligned} \rho _{b/a}(R \bowtie _{b\theta c} S)&= (\rho _{b/a}R) \bowtie _{a\theta c} S \text {, if } a,b \text { only in } R \end{aligned}$$

(13)

$$\begin{aligned} \pi _{Rs} R&= R \text {, if } Rs \text { contains all attributes from } R \end{aligned}$$

(14)

$$\begin{aligned} \pi _{Rs} (R \bowtie _\phi S)&= R \text {, if } R \bowtie _\phi S \text { includes each tuple of } R \nonumber \\&\phantom { = R,\ } \text {and } Rs \text { contains all and only attributes from } R \end{aligned}$$

(15)

$$\begin{aligned} (R-T) \bowtie _{a \theta b} S&= R \bowtie _{a \theta b} S - T \bowtie _{a \theta b} S \end{aligned}$$

(16)

In practice these equivalences are used to generate alternative formulas that lead to the same result, but represent alternative execution strategies. For example, $\sigma _\psi (\sigma _\phi R \times \sigma _\theta S)$ can be proven to be equivalent to $\sigma _{\psi \wedge \phi \wedge \theta } (R \times S)$. However, $\sigma _\psi (\sigma _\phi R \times \sigma _\theta S)$ represents the execution strategy in which we first execute the selections and then the Cartesian product, while $\sigma _{\psi \wedge \phi \wedge \theta } (R \times S)$ represents the execution strategy where we first execute the Cartesian product and then the selection. The first execution strategy is more efficient than the second, because it only requires the Cartesian product to be computed for a subset of R and S.

3 Relational algebra for process mining

This section defines the ‘directly follows’ relation as a relational algebraic operator. It also presents and proves equivalences for this operator that can be used for query optimization, analogous to the equivalences that are presented in Sect. 2.3.

3.1 Directly follows operator

The directly follows operator retrieves events that directly follow each other in some case. Let Log be a database table that contains log events and that has a column c, which represents the case identifier of an event, a column t, which represents the completion timestamp of an event, and further arbitrary columns $a_1, \ldots $, which represent other event attributes. We define the directly follows operator, denoted $>_{c,t} Log$, as follows:

Definition 3

(Directly follows operator)

$$\begin{aligned} >_{c,t} Log =&\Big \{ \{{\downarrow }c \mapsto r_c, {\downarrow }t \mapsto r_t, {\downarrow }a_1 \mapsto r_{a_1}, \ldots , {\uparrow }c \mapsto s_c, {\uparrow }t \mapsto s_t, {\uparrow }a_1\mapsto s_{a_1}, \ldots \},\\&r \in Log, s \in Log, r_c = s_c, r_t< s_t,\\&\lnot \exists q \in Log: r_c = q_c \wedge r_t< q_t \wedge q_t < s_t\Big \} \end{aligned}$$

The definition renames existing attributes by prefixing them with $\downarrow $ and $\uparrow $. The resulting attributes refer to the event that precedes ($\downarrow $) the other event and the event that succeeds ($\uparrow $) the other event. Note that the prefixes $\downarrow $ and $\uparrow $ are arbitrary labels, we could also have used the labels ‘preceeding’ and ‘succeeding’, but prefer to use $\downarrow $ and $\uparrow $ for brevity. The directly follows operator takes all events r and s that have the same case identifier and that follow each other in time, and for which there is no event q that has the same case identifier and that happens in between r and s in time. For example, applying the operator to the example log from Table 1 (i.e., $>_{case, end\_time} Log$) returns Table 3.

Similar to the way in which the join operator is defined in terms of other relational algebra operators, we can also define the ‘directly follows’ operator in terms of the traditional relational algebra operators.

Table 3 Result of $>_{case, end\_time} Log$

Full size table

Proposition 17

(Composite definition of the directly follows operator)

$$\begin{aligned} >_{c,t} Log =&\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log \\&- \pi _{As} ((\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \bowtie _{\downarrow t< t \wedge t < \uparrow t \wedge \downarrow c = c} Log) \end{aligned}$$

where As is the set of attributes that are in $\rho _{\downarrow } Log$ or $\rho _{\uparrow } Log$.

Proof

The correctness of this definition follows trivially from the definition of the weakly follows operator (Definition 3) and the definitions of the relational algebra operators (Definition 2). $\square $

The directly follows operator can both be used in an algorithm for process mining that is based on it (or on ‘footprints’ which are derived from it [12]) and for querying for process-mining related information. Some example queries include:

The activities in which the amount of a loan is changed:
$\pi _{\uparrow activity} \sigma _{\uparrow amount \ne \downarrow amount} >_{case, end\_time} Log$
This query first extracts the directly follows relation from a log. The resulting table contains a directly follows relation with, among others, columns for the preceeding ($\downarrow activity$) and the succeeding ($\uparrow activity$) activity, as well as the loan amount that is associated with the preceeding ($\downarrow amount$) and the succeeding ($\uparrow amount$) activity. Selecting the rows in which the succeeding amount differs from the preceeding amount identifies the activities that change the amount. Subsequently, we can project the activities for which this is the case.
The resources that ever changed the amount of a loan:
$\pi _{\uparrow resource} \sigma _{\uparrow amount \ne \downarrow amount} >_{case, end\_time} Log$
This is a alternative to the previous query, in which we project the resource that changes the amount rather than the activity in which the amount is changed.
The two activities that precede a rejection:
$\sigma _{\uparrow \uparrow activity = reject} (>_{\uparrow case, \uparrow end\_time} >_{case, end\_time} Log)$
This query first extracts the directly follows relation as usual, it then applies the directly follows relation again to the results, such that we do not only get the activities that succeed ($\uparrow activity$) another activity, but also the activities that succeed ($\uparrow \uparrow activity$) that activity.

3.2 Directly follows query optimization

To facilitate query optimization for the directly follows operator, we must define how it behaves with respect to the other operators and prove that behavior. We present this behavior as propositions along with their proofs. In each of these propositions, we use a, b, c, t as attributes (where—as convention—we use c to denote the case identifier attribute and t to denote the time attribute), $\theta $ as a binary operator from the set $\{>,\ge ,=,\ne ,\le ,<\}$, and x as a value.

The first proposition states that the directly follows operator and the selection operator commute, i.e., you can change the order in which they are executed. It holds for case attributes and event attributes, but not for other types of attributes. We define case attributes as attributes that keep the same value for all events in a case, from the moment that they get a value. An example of a case attribute is an attribute ‘customer name’ that gets a value when ‘register customer’ occurs and keeps that value for the remainder of the case. We define event attributes as attributes that have a value for at most one event in each case. An example of an event attribute is an attribute ‘amount’ that may have a value for activities labeled ‘determine if fine must be paid’. As this proposition only holds for case and event attributes, we can only use it for optimizing queries that involve a selection on a case or event attribute. Selections on attributes that may have different values for different activities with the same case identifier cannot be optimized.

Proposition 18

(directly follows and selection commute) $>_{c,t} \sigma _{a \theta x} Log = \sigma _{\downarrow a \theta x \wedge \uparrow a \theta x} >_{c,t} Log $, if a is a case or event attribute.

Proof

$$\begin{aligned}&>_{c,t} \sigma _{a \theta x} Log\\&= \text { (Proposition}~17)\\&\quad \rho _{\downarrow }(\sigma _{a \theta x} Log) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }(\sigma _{a \theta x} Log) \\&\quad - \pi _{As} \big (((\rho _{\downarrow } (\sigma _{a \theta x} Log) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }(\sigma _{a \theta x} Log)) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} \sigma _{a \theta x} Log) \big ) \\&= \text { (Proposition}~7)\\&\quad \sigma _{\downarrow a \theta x} (\rho _{\downarrow } Log) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \sigma _{\uparrow a \theta x}(\rho _{\uparrow } Log) \\&\quad - \pi _{As} \big (((\sigma _{\downarrow a \theta x}(\rho _{\downarrow } Log) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \sigma _{\uparrow a \theta x}(\rho _{\uparrow } Log)) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} \\ {}&\sigma _{a \theta x} Log) \big )\\&= \text { (Propositions}~1,~3,~5)\\&\quad \sigma _{\downarrow a \theta x \wedge \uparrow a \theta x} (\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \\&\quad - \pi _{As} \big ( \sigma _{\downarrow a \theta x \wedge \uparrow a \theta x \wedge a \theta x} ((\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} Log) \big )\\&= (\text {assume } {\downarrow }a \theta x \wedge {\uparrow }a \theta x \Rightarrow a \theta x)\\&\quad \sigma _{\downarrow a \theta x \wedge \uparrow a \theta x} (\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \\&\quad - \pi _{As} \big ( \sigma _{\downarrow a \theta x \wedge \uparrow a \theta x} ((\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} Log) \big ) \\&= \text { (Proposition}~9)\\&\quad \sigma _{\downarrow a \theta x \wedge \uparrow a \theta x} (\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \\&\quad - \sigma _{\downarrow a \theta x \wedge \uparrow a \theta x} \big ( \pi _{As} ((\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} Log) \big ) \\&= \text { (Proposition}~6)\\&\quad \sigma _{\downarrow a \theta x \wedge \uparrow a \theta x} \big ( \rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log \\&\quad - \pi _{As} ((\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \bowtie _{\downarrow t< t \wedge t < \uparrow t \wedge \downarrow c = c} Log) \big ) \\&= \text { (Proposition}~17) \\&\quad \sigma _{\downarrow a \theta x \wedge \uparrow a \theta x} >_{c,t} Log \end{aligned}$$

$\square $

Note that Proposition 9 requires that the condition only contains attributes that are also projected (in this case $\downarrow a, \uparrow a$ must be in As). This condition is satisfied as per Proposition 17. Also note that the proof uses an assumption, which states that if any two events in a case have the same value for an attribute, all events for that case that are between these two (in time) must also have that value (${\downarrow }a \theta x \wedge {\uparrow }a \theta x \Rightarrow a \theta x$). This assumption holds for case attributes and for event attributes, which are the scope of this proposition. As a consequence, the proposition can only be used to optimize a query on such attributes, but not on other attributes, such as the resource attribute.

The next proposition is a variant of the previous one, in which there is a selection condition on two attributes instead of an attribute and a value.

Proposition 19

(directly follows and selection commute 2) $>_{c,t} \sigma _{a \theta b} Log = \sigma _{\downarrow a \theta \downarrow b \wedge \downarrow a \theta \uparrow b \wedge \uparrow a \theta \downarrow b \wedge \uparrow a \theta \uparrow b} >_{c,t} Log $, if a, b are case or event attributes.

Proof

Analogous to the proof of Proposition 18$\square $

To prove that directly follows and projection commute, we first need to prove that projection and set minus commute, because the set minus operator is an important part of the definition of the directly follows operator. However, for the general case it is not true that projection and set minus commute. A counter example is easily constructed. Let $R = \{\{a\mapsto 1,b\mapsto 2\}\}$ and $S = \{\{a\mapsto 1,b\mapsto 3\}\}$. For these relations it does not hold that $\pi _{a} (R - S) = (\pi _{a} R) - (\pi _{a} S)$. However, we can prove this proposition for the special case that S is a subset of R and a uniquely identifies tuples in R. Since these conditions are satisfied for the directly follows operator, it is sufficient to prove the proposition under these conditions.

Proposition 20

(projection and restricted set minus commute) $\pi _{a} (R - S) = (\pi _{a} R) - (\pi _{a} S)$, if $S \subseteq R$ and a uniquely identifies each tuple in R.

Proof

This equivalence is proven by observing that $S \subset R$ implies that a non-surjective injective function $f:S\rightarrow R$ exists that matches each tuple s in S to a unique tuple r in R. The fact that a uniquely identifies tuples in R (and also in S, because S is a subset of R) implies that f is completely determined by the values of tuples in a, i.e., the values of attributes other than a have no consequence for f. Therefore, projecting R and S onto a does not change the tuple mapping defined by f.

Now, looking at the left side of Proposition 20, calculating the projection over the difference, means removing the attributes not in a from the selected tuples in R that are not in the range of f. Looking at the right side, calculating the difference over the projections, means removing the attributes not in a from R and S (which does not affect f) and then selecting the tuples in R that are not in the range of f. These two are equivalent. $\square $

Proposition 21

(directly follows and restricted projection commute) $>_{c,t} \pi _{c,t,a} Log = \pi _{\downarrow c,\downarrow t,\downarrow a, \uparrow c, \uparrow t, \uparrow a} >_{c,t} Log $

Proof

$$\begin{aligned}&>_{c,t} \pi _{c,t,a} Log\\&= \text { (Proposition}~17)\\&\quad \rho _{\downarrow }(\pi _{c,t,a} Log) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }(\pi _{c,t,a} Log) \\&\quad - \pi _{As} ((\rho _{\downarrow } (\pi _{c,t,a} Log) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }(\pi _{c,t,a} Log)) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} \pi _{c,t,a} Log)\\&= \text { (Proposition}~8)\\&\quad \pi _{\downarrow c,\downarrow t,\downarrow a} (\rho _{\downarrow } Log) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \pi _{\uparrow c,\uparrow t,\uparrow a} (\rho _{\uparrow } Log) \\&\quad - \pi _{As} ((\pi _{\downarrow c,\downarrow t,\downarrow a} (\rho _{\downarrow } Log) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \pi _{\uparrow c,\uparrow t,\uparrow a} (\rho _{\uparrow } Log)) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} \pi _{c,t,a} Log)\\&= \text { (Proposition}~10)\\&\quad \pi _{\downarrow c,\downarrow t,\downarrow a,\uparrow c,\uparrow t,\uparrow a} (\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \\&\quad - \pi _{As} \big (\pi _{\downarrow c,\downarrow t,\downarrow a,\uparrow c,\uparrow t,\uparrow a,c,t,a}((\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} Log) \big ) \\&= \text { (Propositions}~11,~12)\\&\quad \pi _{\downarrow c,\downarrow t,\downarrow a,\uparrow c,\uparrow t,\uparrow a} (\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \\&\quad - \pi _{\downarrow c,\downarrow t,\downarrow a,\uparrow c,\uparrow t,\uparrow a} \big ( \pi _{As} ((\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} Log) \big ) \\&= \text { (Proposition}~20)\\&\quad \pi _{\downarrow c,\downarrow t,\downarrow a,\uparrow c,\uparrow t,\uparrow a} \big ( (\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \\&\quad - \pi _{As} ((\rho _{\downarrow } Log \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow } Log) \bowtie _{\downarrow t< t \wedge t < \uparrow t \wedge \downarrow c = c} Log) \big )\\&= \text { (Proposition}~17)\\&\quad \pi _{\downarrow c,\downarrow t,\downarrow a, \uparrow c, \uparrow t, \uparrow a} >_{c,t} Log \end{aligned}$$

$\square $

The next proposition states that the directly follows relation and the theta join (see Definition 2) commute. This proposition relies on the fact that the directly follows operator duplicates each attribute a into an attribute ${\uparrow }a$ and an attribute ${\downarrow }a$. Table 3 illustrates this. However, if the case, activity and start time attribute uniquely identify an event, then there is no need to duplicate the end time attribute or any other attribute, and the duplicate attributes just take up space. This redundancy can easily be fixed later on with a projection operator and in future work additional efficiency may be achieved by avoiding this redundancy altogether.

Proposition 22

(directly follows and theta join commute) $>_{c,t} (R \bowtie _{a \theta b} S) = (>_{c,t} R) \bowtie _{\downarrow a \theta b} S \bowtie _{\uparrow a \theta b} S $, if each tuple from R is combined with a tuple in S.

Proof

$$\begin{aligned}&>_{c,t} (R \bowtie _{a \theta b} S)\\&= \text { (Proposition}~17)\\&\qquad \rho _{\downarrow }(R \bowtie _{a \theta b} S) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }(R \bowtie _{a \theta b} S) \\&\qquad - \pi _{As} ((\rho _{\downarrow } (R \bowtie _{a \theta b} S) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }(R \bowtie _{a \theta b} S)) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} \\ {}&\quad (R \bowtie _{a \theta b} S))\\ \\&= \text { (Proposition}~13)\\&\quad (\rho _{\downarrow }R) \bowtie _{\downarrow a \theta b} S \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} (\rho _{\uparrow }R) \bowtie _{\uparrow a \theta b} S \\&\quad - \pi _{As} (((\rho _{\downarrow }R) \bowtie _{\downarrow a \theta b} S) \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} (\rho _{\uparrow }R) \bowtie _{\uparrow a \theta b} S) \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} \\ {}&\quad (R \bowtie _{a \theta b} S))\\&= \text { (Propositions}~3,~4,~10,~14,~15)\\&\quad \rho _{\downarrow }R \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }R \bowtie _{\downarrow a \theta b} S \bowtie _{\uparrow a \theta b} S \\&\quad - \pi _{As} (\rho _{\downarrow }R \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }R \bowtie _{\downarrow t< t \wedge t< \uparrow t \wedge \downarrow c = c} R) \bowtie _{\downarrow a \theta b} S \bowtie _{\uparrow a \theta b} S\\&= \text { (Proposition}~16)\\&\quad \big (\rho _{\downarrow }R \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }R\\&\quad - \pi _{As} (\rho _{\downarrow }R \bowtie _{\downarrow t< \uparrow t \wedge \downarrow c = \uparrow c} \rho _{\uparrow }R \bowtie _{\downarrow t< t \wedge t < \uparrow t \wedge \downarrow c = c} R)\big ) \bowtie _{\downarrow a \theta b} S \bowtie _{\uparrow a \theta b} S\\&= \text { (Proposition}~17)\\&\quad (>_{c,t} R) \bowtie _{\downarrow a \theta b} S \bowtie _{\uparrow a \theta b} S \end{aligned}$$

$\square $

The properties that are proven in this way help to define different queries with equivalent results, but different execution costs, such that the most efficient query among a set of equivalent queries can be established.

4 Theoretical execution cost

We determine the computational cost of executing the directly follows operation and compare the costs for the different execution strategies that we identified in the introduction. We also determine the effect of query optimization on the directly follows operator.

4.1 Cost of computing the directly follows relation

The execution cost of a database operation is typically defined in terms of the number of disk blocks read or written, because reading from or writing to disk are the most expensive database operations. In line with the strategies for process mining on a database that are presented in Fig. 1, Table 4 shows four execution strategies with their costs. Note that the ‘with database operator’ strategy from Fig. 1 is split up into two alternatives. The first alternative requires a database management system that contains an implementation of a dedicated directly follows operator. The second alternative can be applied in any database management system and uses a nested query to extract the directly follows relation. The execution cost is presented as an order of magnitude, measured in terms of the number of disk blocks B that must be read or written. The number of disk blocks is linear with the number of events in the log and depends on the number of bytes needed to store an event and the number of bytes per disk block. These measures assume that the complete log fits into memory. If the log does not fit into memory, the measures would worsen. However, the best operators in terms of execution cost (classical process mining with database connection and the native database operator proposed here) would remain the best and the worst operator (the composite database operator) would remain the worst.

Table 4 Execution costs of process mining on a database

Full size table

Process mining with intermediate storage requires that the log is read and written three times: once to query the database for the log, once to store the log to disk, and once to load the log in the process mining tool. Consequently, the complexity is $3 \cdot B$. Process mining directly on a database requires that the log is read only once. Subsequent processing can be done in memory.

While process mining with a database connection can be fast, in many usage scenarios more flexible querying capabilities are needed, which can benefit from access to all SQL operators. In particular, this is the case if the process mining log must be aggregated from multiple database tables, or if—rather than the event log itself—process mining related information must be extracted.

For such usage scenarios, the ‘directly follows’ relation must be extracted directly from the database. Such an operator reads the log from disk once and computes the directly follows relation in memory. Consequently, it has linear cost. For databases that do not have the native ‘directly follows’ operator proposed in this paper, the operator can be emulated using the composite formula from Proposition 17. The drawback of this formula is that it requires that the intermediate results from both sides of the minus operator are stored, after which the minus operator can be applied. While this is no problem as long as the intermediate results fit into memory, the costs become prohibitive once the intermediate results must be written to disk. We will show the precise practical implications of this problem in Sect. 5.

On a theoretical level, the problem is illustrated in Fig. 3. This figure shows that the problem arises when the number of events in the log is high, relative to the number of cases. The mechanism that causes this is easy to derive from Proposition 17, which shows that the intermediate results that must be stored are the pairs of events that directly or indirectly follow each other in some case (the left side and right side of the minus operator). Consequently, if there are many events per case, this number is high (cubic over the number of events per case in the right-hand side of the minus operator).

The precise calculation can be performed as follows. Let V be the number of cases in the log, N be the number of events, F be the block size (i.e., the number of tuples/events that fit in a single disk block), $B_{Log} = \frac{N}{F}$ be the number of disk blocks required to store the log, and M be the total memory size in blocks. Note that the cost of a block nested join (or minus) operator on two relations R and S that take $B_R$ and $B_S$ disk blocks (with $B_R \le B_S$), is equal to $B_R + B_S$ when one of the two relations fits in memory, and equal to $B_R + \frac{B_S}{M} \cdot B_R$ otherwise [13]. The cost is split up into five components:

1.
The cost of the first join is denoted as $B_{join_1}$. This equals $B_{Log}$ if the log fits into memory and $B_{Log} + \frac{B_{Log}}{M} \cdot B_{Log}$ otherwise. Note that this join appears twice, but that it only needs to be computed once.
2.
The cost of storing the results of the first join to disk is denoted as $B_{result_1}$. This equals 0 if the result fits in memory. Otherwise, the number of tuples in the result, which we denote as $|t_1|$, equals the number of pairs of events that directly or indirectly follow each other in some case: $V \cdot \left( \frac{N}{V} \cdot \frac{N}{V}-1\right) /2$ on average. This fits into $\frac{|t_1|\cdot 2}{F}$ disk blocks (times 2 because each tuple in the result is a pair of tuples from the original).
3.
The cost of the second join is denoted as $B_{join_2}$. This equals 0 if the original log fits into memory. Otherwise, the cost equals $B_{Log} + \frac{|t_1|\cdot 2}{F}/M \cdot B_{Log}$.
4.
The cost of storing the result of the second join to disk is denoted as $B_{join_2}$. This equals 0 if the result fits into memory. Otherwise, the number of tuples in the result, which we denote as $|t_2|$, equals the number of pairs of events that indirectly follow each other. This equals the number of pairs of events $|t_1|$ that directly of indirectly follow each other minus the number of pairs of events that directly follow each other: $V\cdot \left( \frac{N}{V} -1\right) $ on average. This fits into $\frac{|t_2|\cdot 2}{F}$ disk blocks (times 2 because each tuple in the result is a triple of tuples from the original and then reduced to a pair by projection).
5.
The cost of the minus operator is denoted as $B_{minus}$. This equals 0 if the result of the second join fits into memory. Otherwise, it equals $B_{result_1} + \frac{B_{result_1}}{M} \cdot B_{result_2}$.

To generate Fig. 3 we used a tuple size of 80 bytes, a 4 GB buffer size, and a block size of 50, such that there is a total memory size of 1 million blocks. The figure shows two ‘thresholds’ in the computational cost. These thresholds are crossed when a particular intermediate result no longer fits into memory.

The order of the cost can be determined more easily. The order of the cost is determined by the cost of the set minus, because this incorporates both intermediate results, which are typically much larger than the original log. Therefore, the order of the cost of computing the intermediate results are $\frac{N^2}{V}/F$. The total order of cost is then obtained by filling these costs out in the right-hand side of the formula for computing the cost of the set minus, which yields: $\left( \frac{N^2}{V}/F/M\right) \cdot \frac{N^2}{V}/F$. If we let M be large enough to contain the log itself, but not the intermediate results (i.e., we set $M=\frac{N}{F}$), this can be simplified as: $\frac{N^3}{V^2}/F$.

Summarizing, the execution cost of retrieving a directly follows relation directly from a database can be as low as retrieving it from a process mining tool, if the database supports the native ‘directly follows’ operator proposed in this paper and the process mining tool supports on-database process mining. However, as long as a native ‘directly follows’ operator does not exist, the execution costs increase to third order polynomial cost if the average number of events per case is high (i.e., if intermediate results do not fit into memory anymore).

4.2 The effect of query optimization

An advantage of in-database process mining is that it enables query optimization. Query optimization, using the rewrite rules that are defined in Sect. 3.2 can greatly reduce the cost of executing a query. As an example, we show the cost of executing the query $>_{c,t} \sigma _{a \theta x} Log$ and the equivalent query $\sigma _{\downarrow a \theta x \wedge \uparrow a \theta x} >_{c,t} Log$. These costs decrease at least linearly with the fraction of events that match the selection criteria. Let Q be that fraction. Table 5 shows the different execution situations that can arise. It is possible to either first derive the directly follows relation and then do the selection, or vice versa. It is also possible that the intermediate results fit in memory, or that they must be stored on disk. If the results fit in memory (and the table is indexed with respect to the variable on which the selection is done), then the execution costs are simply the cost of reading the log, or the part that matches the selection criteria, into memory once. If the intermediate results do not fit into memory, the order of the execution cost is $\frac{N^3}{V^2}/F$ as explained in the previous section. Remembering that $B = \frac{N}{F}$ leads to the formulas that are shown in the table.

Table 5 Execution cost orders of different execution sequences

Full size table

The most dramatic increase occurs in the situation where, if the selection is done first, the intermediate results fit into memory, while, if the selection is done last, the intermediate results do not fit into memory. In practice this is likely to be the case, because the selection can greatly reduce the number of events that are considered. For example, for a log with N = 10,000 events over $V = 500$ cases, with a block size of $F = 50$ and a selection fraction $Q = 0.10$, the order of the cost increases from 20 to $8 \times 10^4$ according to the formulas from Table 5. The actual computed costs increase (the same order of magnitude) from 21 (plus one, because we need to read one disk block to load the index that is used to optimize the selection) to $9.5 \times 10^4$ using the formulas from the previous section.

This shows that the way in which a query that includes the directly follows operator is executed greatly influences the execution cost. Query optimizers, which are parameterized with equivalence relations from Sect. 3.2 and the cost calculation functions from Sect. 4.1, can automatically determine the optimal execution strategy for a particular query.

5 Practical execution time performance

We implemented the native ‘directly follows’ operator into the H2 Database Management System [9]. The implementation is publicly available through GitHub,^{Footnote 1} as are the process mining logs that we used to evaluate the implementation and that are referenced in the table. We also created a plugin for the ProM process mining tool [14] that enables process mining directly on an H2 Database. This implementation is publicly available through the ProM packaging system and as open source.^{Footnote 2}

As process mining logs, we used the logs from the BPI Challenges of the past years, except for the BPI 2016 challenge, which contained click-data rather than process data. For some BPI Challenges, multiple logs were made available, in which case we took the largest log. We implemented the operator as a function on a table, such that for the table R that contains at least case identifiers, activity labels, and completion timestamps in the first three columns, FOLLOWS(R) returns the directly follows relation of the table.

Subsequently, we evaluated the implementation of the native operator, by storing the logs from practice into an H2 database and extracting the directly follows relation, both using the implementation of the native operator and using the composite operator in the form of the following SQL query (which is the same as the one outlined in Sect. 1).

All evaluations were run on a server with two Intel Xeon processors running at 2.4 GHz, 64Gb of memory and a database cache size of 8GB. Clearly, the time performance will differ a lot depending on the hardware that is used. However, we mainly want to use the experiments to illustrate the practical implications of the time complexity that is specified in Table 4 and the time complexity will not differ depending on the hardware that is used.

Table 6 Time performance of deriving a directly follows relation

Full size table

Table 6 shows the results of the comparison. The table shows some descriptive statistics of the logs and the time performance of the different approaches that are illustrated in Fig. 1, of which the theoretical performance is specified in Table 4. Note that for classical process mining with intermediate storage, the time for storing the log on disk and retrieving it from disk (i.e., the ‘extract’ and ‘load’ steps from Fig. 1) is not included. We left that time out, because it involves manual actions, such as clicking menu buttons and choosing a file name, which will clearly be the dominant factor in the time performance and are irrelevant from the perspective of computational performance. If a fully automated means of extracting the log from the database is required, using classical mining with a database connection or one of the two database operators makes more sense.

As expected based on Table 4, the performance of the first three operators (excluding disk storage for classical process mining) has the same order or magnitude. An analysis on artificial data shows similar results [20]. While these results might suggest that for that reason the choice of technique for extracting the directly follows relation does not matter, this is not true. It is indeed true that the performance of the three techniques does not differ. However, the three techniques are suitable for very different applications:

Classical mining with intermediate storage is most suitable for once-off process mining, in which the data that must be extracted from the database has been carefully selected beforehand, because the extraction of the data from the database is labor intensive and often requires administrator privileges.
Classical mining with a database connection is most suitable for continuous process mining, in which a dedicated database table exists that contains the event log and that is continuously being updated. It cannot be used for ad-hoc process mining from a database, because it only works when a dedicated database table with an event log exists. Examples of such an approach are the DB-XES approach [21] and the approach that is proposed by the professional process mining tool Celonis [22].
Process mining with a native directly follows operator is most suitable for exploratory process mining, in which no dedicated table exists that contains the event log, or in which data from different tables must be combined before process mining takes place. It allows for fast ad-hoc process mining even when a database does not contain a dedicated table with an event log. An example of such an approach is the Event Cube approach [23].

As also expected based on Table 4, the performance of the composite operator is much worse than that of the other operators. It takes two to three orders of magnitude longer to compute the directly follows relation for the composite operator. In the worst case the computation time goes from milliseconds to over an hour. Moreover, the performance of the composite operator is heavily affected by the number of cases and events in the log, while the performance of the other operators is hardly affected by the number of cases and events. This result is in line with the theoretical complexity of the composite operator that is presented in Sect. 4.1, where we theorized that the order of the execution cost is polynomial over the fraction of events per case.

Note that Table 6 only shows the performance of extracting the directly follows relation from the database. It does not include the time that it takes to mine the process model based on the directly follows relation, because once the directly follows relation is derived, mining a process model from it, is the same for each of the approaches. For completeness, Table 7 shows the time performance of mining the process model from the directly follows relation for the different approaches. The table also presents the complexity of the mined models in terms of their nodes and edges. It shows that the time performance is indeed the same for the different approaches. We attribute the minor differences in time performance to noise due to, for example, background processor activity.

Table 7 Time performance of mining from a directly follows relation

Full size table

In an experiment, we explored the effect of the theoretical computational complexity on the computation time in more detail. To do so, we took the log of the BPI 2011 challenge [4] and varied the number of events per case to measure the effect on the computation time. We varied the number of events per case by randomly removing event types until a desired number of events per case is reached, and also by artificially adding event types. Event types were artificially added by taking an existing event type and copying its properties in terms of the probability that that event type follows and is followed by another event type. Figure 4 shows the result of the experiment. It shows the computational time required to extract the weakly follows relation from an event log, required both by the native and the composite operator for a different number of events per case. The figure also shows trend lines. For the native operator the trend line is linear and for the composite operator the trend line is polynomial. Due to the way in which the event logs were generated, it was hard to generate event logs with between 50 and 100 events per case. Regardless, the experiment clearly shows that the computational time of the composite operator is polynomial over the number of events per case, which is in line with the theoretical computational cost.

Summarizing, the native operator returns query results within a fraction of a second. As expected, it performs several orders of magnitude better than the composite operator. Moreover, unlike the composite operator, the native operator is hardly affected by properties of the log for current industry-strength logs. This makes the native operator uniquely suited for interactive in-database exploration of an event log.

6 Alternatives for querying the directly follows relation

On a high-level, we identified three different strategies to doing process mining on a database, as illustrated in Sect. 1 (Fig. 1).

These strategies can be compared along a number of criteria:

1.
Computational performance;
2.
Flexibility with which a log can be constructed for mining;
3.
Flexibility with which the database can be queried for process mining related information; and
4.
Effort needed to adapt a DBMS.

Where, by ‘flexibility’ we mean the ease with which queries can be constructed and adapted, for extracting a process mining log (point 2) or for extracting other information (point 3).

All three proposed strategies behave similarly with respect to the first two criteria: the computational performance is linear over the number of events in the log, and an SQL query can be defined that constructs a log file from (multiple tables) in a database. This SQL query can be adapted at will and running a process mining tool on the resulting log should be easy, especially if the process mining tool has a direct database connection and constructs the log file internally.

Consequently, the difference between the strategies primarily becomes a trade-off between the desire to ask process mining-related queries directly on the database and the associated need to adapt the DBMS. Queries that involve the directly follows operator are difficult to specify and perform badly when using the composite operator, while they are easy to specify and perform well when using the native operator. However, being able to use the native operator requires that such an operator is implemented in the DBMS, which clearly is a time-consuming task, especially with respect to the implementation of the query optimization rules that are discussed in Sect. 3.

To avoid the problems of the composite operator, one might be tempted to define a query without a ‘NOT EXISTS’ part, which exploits a time-order of the events to derive the weakly follows relation. For example, assuming that the database management system can return the row number of each tuple, a relation can be ordered by case identifier and time, and subsequently the following SQL query would return the ‘directly follows’ relation for event logs in which no events occur at the same time.

However, such a query fails if two events have the same timestamp. For example, for the log (1, A, 0 : 01), (1, B, 0 : 02), (1, C, 0 : 02), (1, D, 0 : 03), this query would not work, because the query would return (A, B), (B, C), (C, D), while it should return (A, B), (A, C), (B, D), (C, D). What makes matters worse, is that the result would be non-deterministic, since the result that would be returned, depends on the particular order of events that have the same timestamp, which is not specified. This problem may be more or less common in a log, depending on the time-granularity at which events are logged, which—in practice—is often (too) coarse. For example, it may happen that only the date of the event is logged and not the time. This problem is also known as the ‘timestamp challenge’ [24].

Table 8 shows that this problem clearly exists for the BPI logs. The experiments were run on a computer with an Intel i7 processor at 4.2 GHz with 16 GB of RAM, an SSD drive and a database cache size of 8 GB. The table shows the computational performance of the native operator that is proposed in this paper and the operator that works on the ‘sorted’ log. The performance is comparable for the two operators. The table also shows the number of directly follows relations that is returned by the operator for the ‘sorted’ log and the number of directly follows relations that exists in the log. This clearly shows that the operator for the ‘sorted’ log misses many directly follows relations that do exist in the log.

Table 8 Performance of the ‘sorted’ directly follows operator

Full size table

7 Related work

By defining an operator for efficiently extracting the ‘directly follows’ relation between events from a database, this paper has its basis in a long history of papers that focus on optimizing database operations. In particular, it is related to papers that focus on optimizing database operations for data mining purposes [25, 26], of which SAP HANA [27] is a recent development. The idea of proposing domain-specific database operators has also been applied in other domains, such as spatio-temporal databases [28] and scientific databases [29].

In related work on DB-XES [21], which extends the previous Relational XES [30], we developed a standard data warehouse structure to store process logs for efficient process mining. DB-XES also provides dedicated insert operations that insert pre-processed data in data warehouse tables for each inserted log event. This pre-processed data can be queried efficiently, which provides performance benefits that are similar to the performance benefits of the technique described in this paper [31]. However, the technique described in this paper allows for querying any database structure in a computationally fast manner, while the DB-XES data warehouse structures are fixed. In this respect, the functionality that is described in this paper is complementary to that of DB-XES, and a DB-XES data warehouse can even be filled using techniques described in this paper.

The Celonis process mining tool, which is being used in practice, has a similar approach as DB-XES [22]. It creates a data warehouse in SAP HANA that is being kept up-to-date with the operational database and on which process mining is facilitated. Like DB-XES this approach will facilitate fast process mining on the pre-defined data warehouse structure. Also like DB-XES, to facilitate computationally fast process mining on any database structure, additional facilities, such as the one proposed in this paper, are needed, and the functionality described in this paper can be used to fill the Celonis data warehouse.

By presenting a ‘directly follows’ operator, the primary goal of this paper is to support computationally efficient process mining on a database. There exist some papers that deal with the computational complexity of the process mining algorithms themselves [32, 33]. Also, in a research agenda for process mining the computational complexity and memory usage of process mining algorithms have been identified as important topics [34]. However, this paper focuses on a step that precedes the process mining itself: flexibly querying a database to investigate which information is useful for process mining.

More database-related work from the area of process mining comes from shaping data warehouses specifically for process mining [35, 36]. There also exists work that focuses on the extraction of logs from a database [5, 37], including conceptual problems that arise when lifting the lower-level data and events from a database to a process level [38, 39]. Also, several query languages have been developed specifically for databases that contain business processes and related information [40,41,42] as well as dedicated querying facilities for other forms of process mining [43,44,45].

8 Conclusions

This paper presents a first step towards in-database process mining. In particular, it defines a relational algebraic operator to extract the ‘directly follows’ relation from a log that is stored in a relational database, possibly distributed over multiple tables. The paper presents and proves relational algebraic properties of this operator, in particular that the operator commutes with the selection, projection, and theta join. These equivalence relations can be used for query optimization. Also, the paper presents and proves formulas to estimate the computational cost of the operator. These formulas can be used in combination with the equivalence relations to determine the most efficient execution strategy for a query.

The operator is primarily meant to facilitate exploratory process mining in the style of online analytical processing. For these purposes, it has the clear benefit that it facilitates efficient process mining, even on databases in which the ‘event log’ or ‘directly follows’ relation is not stored trivially in a single database table, but must be constructed with a complex SQL query.

The operator was implemented as part of the H2 Database Management System, as a function on a table that can be part of an SQL query. An evaluation with logs from practice shows that the operator returns the ‘directly follows’ relation of a log within a fraction of a second. If the ‘directly follows’ relation is extracted from the log using an SQL query with only standard SQL, the query takes several orders of magnitude longer, from a minute to even over an hour, depending on properties of the log. Consequently, the native implementation of the operator makes a database system uniquely suited for in-database process mining.

At this moment, query optimization measures are not implemented for the operator. This remains for future work. Also, the query optimization equations have some limitations, in particular with respect to the conditions under which they hold. These limitations restrict the possibilities for query optimization. The extent to which these theoretical limitations impact practical performance of the operator must be investigated in future work and, if possible, mitigated.

The work on in-database process mining can be further extended, by defining and implementing additional operators. Currently, it is hard to work with ‘cases’ in an SQL query, because a log table has a row for each event that happens rather than each case that completes. For that reason, we aim to develop operators that can easily aggregate properties from an event level to a case level. For example, operators for easily extracting variants of execution sequences could be envisioned, as well as operators for extracting properties of resource involvement and time characteristics of cases. In addition, we aim to extend the algebraic optimization rules, in particular because we expect that other process mining operators, such as strictly follows, independence, and exclusion - even though they can be defined in terms of the directly follows operator - can be better optimized if they get their own optimization rules. We also aim to do various case studies in which we explore the process-mining related queries that process analysts tend to define, such that we can develop support for those queries and end up with a feature-complete Process Mining DBMS. In parallel, we are working on the DB-XES approach to in-database process mining. This approach keeps separate (cache) tables for process mining-related queries. When the process mining query is requested, this requires a once-off effort to fill the separate tables. From that moment on, these tables are kept up-to-date with specialized insert statements.

Notes

References

van der Aalst, W., Weijters, A., Maruster, L.: Workflow mining: discovering process models from event logs. IEEE Trans. Knowl. Data Eng. 16(9), 1128–1142 (2004)
Article Google Scholar
Weijters, A., van der Aalst, W., De Medeiros, A.: Process mining with the heuristics miner-algorithm. Technische Universiteit Eindhoven, Technical Report WP 166, 1–34 (2006)
Günther, C., van der Aalst, W.: Fuzzy mining: adaptive process simplification based on multi-perspective metrics. In: Proceedings of BPM, pp. 328–343 (2007)
Google Scholar
van Dongen, B.: Real-life event logs—hospital log (2011). https://doi.org/10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54
Ingvaldsen, J.E., Gulla, J.A.: Preprocessing support for large scale process mining of sap transactions. In: Proceedings of BPM, pp. 30–41 (2007)
Chapter Google Scholar
Leemans, S.J.J., Fahland, D., van der Aalst, W.M.P.: Discovering block-structured process models from event logs: a constructive approach. In: Colom, J.M., Desel, J. (eds.) Proceeding of ATPN, pp. 311–329. Springer, New York (2013)
Google Scholar
Kim, W.: On optimizing an SQL-like nested query. ACM Trans. Database Syst. 7(3), 443–469 (1982)
Article MathSciNet Google Scholar
Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of PODS, pp. 34–43. New York (1998)
H2: H2 Database Engine (2017). http://h2database.com. Accessed 31 Aug 2017
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: The Logical Level. Addison-Wesley, Boston (1995)
Google Scholar
Sagiv, Y., Yannakakis, M.: Equivalence among relational expressions with the union and difference operation. In: Proceedings of VLDB, pp. 535–548 (1978)
van der Aalst, W.: Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer, New York (2011)
Book Google Scholar
Blasgen, M.W., Eswaran, K.P.: Storage and access in relational data bases. IBM Syst. J. 16(4), 363–377 (1977)
Article Google Scholar
van Dongen, B., de Medeiros, A., Verbeek, E., Weijters, T., van der Aalst, W.: The prom framework: a new era in process mining tool support. In: Proceedings of ATPN, pp. 444–454. Springer, New York (2005)
van Dongen, B.: BPI challenge 2012 (2012). https://doi.org/10.4121/uuid:3926db30-f712-4394-aebc-75976070e91f
Steeman, W.: BPI challenge 2013, incidents (2013). https://doi.org/10.4121/500573e6-accc-4b0c-9576-aa5468b10cee
van Dongen, B.: BPI challenge 2014: activity log for incidents (2014). https://doi.org/10.4121/uuid:86977bac-f874-49cf-8337-80f26bf5d2ef
van Dongen, B.: BPI challenge 2015 municipality 5 (2015). https://doi.org/10.4121/uuid:b32c6fe5-f212-4286-9774-58dd53511cf8
van Dongen, B.: BPI challenge 2017—offer log (2017). https://doi.org/10.4121/uuid:7e326e7e-8b93-4701-8860-71213edf0fbe
Syamsiyah, A., van Dongen, B.F., Dijkman, R.M.: A native operator for process discovery. In: Proceedings of DEXA, pp. 292–300. Springer, New York (2018)
Google Scholar
Syamsiyah, A., van Dongen, B., van der Aalst, W.: Accelerating process mining using relational databases. In: Proceedings of SIMPDA, pp. 137–141 (2016)
Celonis: SAP Process Mining by Celonis. https://www.sap.com/developer/showcases/process-mining-by-celonis.html (2019). Accessed 22 Feb 2019
Ribeiro, J., Weijters, T.: Event cube: another perspective on business processes. In: Proceedings of OTM, pp. 274–283. Springer, New York (2011)
Chapter Google Scholar
van der Aalst, W., et al.: Process mining manifesto. In: Proceedings of BPM Workshops, pp. 169–194 (2012)
Chen, M.S., Han, J., Yu, P.S.: Data mining: an overview from a database perspective. IEEE Trans. Knowl. Data Eng. 8(6), 866–883 (1996)
Article Google Scholar
Agrawal, R., Imielinski, T., Swami, A.: Database mining: a performance perspective. IEEE Trans. Knowl. Data Eng. 5(6), 914–925 (1993)
Article Google Scholar
Färber, F., Cha, S.K., Primsch, J., Bornhövd, C., Sigg, S., Lehner, W.: SAP HANA database: data management for modern business applications. SIGMOD Rec. 40(4), 45–51 (2012)
Article Google Scholar
Abraham, T., Roddick, J.F.: Survey of spatio-temporal databases. GeoInformatica 3(1), 61–99 (1999)
Article Google Scholar
Cudre-Mauroux, P., Kimura, H., Lim, K.T., Rogers, J., Simakov, R., Soroush, E., Velikhov, P., Wang, D.L., Balazinska, M., Becla, J., DeWitt, D., Heath, B., Maier, D., Madden, S., Patel, J., Stonebraker, M., Zdonik, S.: A demonstration of SciDB: a science-oriented DBMS. Proc. VLDB Endow. 2(2), 1534–1537 (2009)
Article Google Scholar
van Dongen, B.F., Shabani, S.: Relational XES: data management for process mining. In: Proceedings of the CAiSE Forum, pp. 169–176 (2015)
Syamsiyah, A., van Dongen, B.F., van der Aalst, W.M.P.: DB-XES: enabling process discovery in the large. In: Data-Driven Process Discovery and Analysis, pp. 53–77. Springer, New Yrok (2018)
Google Scholar
Maggi, F.M., Bose, R.P.J.C., van der Aalst, W.M.P.: Efficient discovery of understandable declarative process models from event logs. In: Proceedings of CAiSE, pp. 270–285 (2012)
Google Scholar
Bergenthum, R., Desel, J., Lorenz, R., Mauser, S.: Process mining based on regions of languages. In: Proceedings of BPM, pp. 375–383 (2007)
van der Aalst, W., Weijters, A.: Process mining: a research agenda. Comput. Ind. 53(3), 231–244 (2004)
Article Google Scholar
Eder, J., Olivotto, G.E., Gruber, W.: A data warehouse for workflow logs. In: Proceedings of EDCIS, pp. 1–15 (2002)
Google Scholar
zur Mühlen, M.: Process-driven management information systems—combining data warehouses and workflow technology. In: Proceedings of ICECR, pp. 550–566 (2001)
González López de Murillas, E., Reijers, H.A., van der Aalst, W.M.P.: Connecting databases with process mining: a meta model and toolset. In: Proceedings of BPMDS 2016, pp. 231–249 (2016)
Chirigati, F., Freire, J.: Towards integrating workflow and database provenance. In: Proceedings of IPAW, pp. 11–23. Springer, New York (2012)
Google Scholar
van der Aalst, W.: Extracting event data from databases to unleash process mining. In: BPM—Driving Innovation in a Digital World, pp. 105–128 (2015)
González López de Murillas, E., Reijers, H.A., van der Aalst, W.M.P.: Process mining on databases: unearthing historical data from redo logs. In: Proceedings of BPM, pp. 296–309 (2017)
Momotko, M., Subieta, K.: Process query language: a way to make workflow processes more flexible. In: Proceedings of ADBIS, pp. 306–321 (2004)
Chapter Google Scholar
Beheshti, S.M.R., Benatallah, B., Motahari-Nezhad, H.R., Sakr, S.: A query language for analyzing business processes execution. In: Proceedings of BPM, pp. 281–297 (2011)
Google Scholar
Schönig, S., Rogge-Solti, A., Cabanillas, C., Jablonski, S., Mendling, J.: Efficient and customisable declarative process mining with SQL. In: Proceedings of CAiSE, pp. 290–305. Springer, New York (2016)
Chapter Google Scholar
Schönig, S., Di Ciccio, C., Maggi, F.M., Mendling, J.: Discovery of multi-perspective declarative process models. In: Proceeding of ICSOC, pp. 87–103. Springer, New York (2016)
Chapter Google Scholar
Syamsiyah, A., van Dongen, B.F., van der Aalst, W.M.P.: Discovering social networks instantly: moving process mining computations to the database and data entry time. In: Proceedings of BPMDS, pp. 51–67. Springer, New York (2017)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Remco Dijkman, Alifah Syamsiyah, Boudewijn van Dongen & Paul Grefen
Northeast Petroleum University, Daqing, China
Juntao Gao
Queensland University of Technology, Brisbane, Australia
Arthur ter Hofstede

Authors

Remco Dijkman
View author publications
You can also search for this author in PubMed Google Scholar
Juntao Gao
View author publications
You can also search for this author in PubMed Google Scholar
Alifah Syamsiyah
View author publications
You can also search for this author in PubMed Google Scholar
Boudewijn van Dongen
View author publications
You can also search for this author in PubMed Google Scholar
Paul Grefen
View author publications
You can also search for this author in PubMed Google Scholar
Arthur ter Hofstede
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Remco Dijkman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Dijkman, R., Gao, J., Syamsiyah, A. et al. Enabling efficient process mining on large data sets: realizing an in-database process mining operator. Distrib Parallel Databases 38, 227–253 (2020). https://doi.org/10.1007/s10619-019-07270-1

Download citation

Published: 09 May 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10619-019-07270-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Enabling efficient process mining on large data sets: realizing an in-database process mining operator

Abstract

Similar content being viewed by others

Process Mining on Databases: Unearthing Historical Data from Redo Logs

DB-XES: Enabling Process Discovery in the Large

Celonis PQL: A Query Language for Process Mining

1 Introduction

2 Background

2.1 Process mining

2.2 Relational algebra

Definition 1

Definition 2

2.3 Query optimization

3 Relational algebra for process mining

3.1 Directly follows operator

Definition 3

Proposition 17

Proof

3.2 Directly follows query optimization

Proposition 18

Proof

Proposition 19

Proof

Proposition 20

Proof

Proposition 21

Proof

Proposition 22

Proof

4 Theoretical execution cost

4.1 Cost of computing the directly follows relation

4.2 The effect of query optimization

5 Practical execution time performance

6 Alternatives for querying the directly follows relation

7 Related work

8 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation