Keywords

1 Introduction

Processes are executed by human actors and automated resources performing work on the cases of the process. For example, multiple employees of a bank jointly check a credit application, create (one or more) loan offers, contact the client for additional information, to finally decline or prepare a contract. Each case evolves by executing actions according to the process’ control-flow [1]. Human actors (or resources) performing the actions often structure their work further by performing multiple actions on the same case before handing the case to the next actor, e.g., creating and sending two loan offers to the same client; such a larger unit of work is called task [13, 19]. Routines research investigates thereby which patterns arise when actors jointly structure and divide work in a process into (recurring) tasks [10].

Task execution patterns can be identified from process event data when using graph-based data models. We can jointly model the synchronization of classical traces of all process cases and the traces of all actors working across all cases in an event knowledge graph [8]. Any sub-graph of this graph where an actor follows multiple events in a case corresponds to an execution of some task [13], as we recall in Sect. 2. Sub-graphs on real-life event logs can be identified through querying [14], e.g., 98% of the BPIC’17 [6] events are part of a larger task execution. But the structure of how task execution sub-graphs are related has not been described.

A key operation for describing structures in event data is aggregation. As the model of event knowledge graphs is novel, only limited aggregation operations have been proposed, but they either only aggregate events to actions [7], or task execution sub-graphs to higher-level events [13]. We show in Sect. 3 that understanding tasks in a process requires (R1) to aggregate sets of similar higher-level events to suitable constructs while preserving their behavioral context, (R2) to aggregate events underlying higher-level events to study variations among actions, and (R3) that either aggregation requires parameters for filtering and for controlling the aggregation level.

We then propose in Sect. 4 two new parameterized aggregation operations, formalized as queries over event knowledge graphs, that address (R1–R3) and demonstrate in Sect. 5 their effectiveness for summarizing task executions of real-life event data in new kinds of global and local process models [4]. We compare our results to related work in Sect. 6 and conclude in Sect. 7.

2 Preliminaries

A process-aware system can record an action execution as an event in an event log. We require that each event records at least the action that occurred, the time of occurrence and at least two different entity identifiers of entities involved in the event: a data object or case in which the event occurred, and the resource (or actor) executing the action. An event can also record additional attributes describing the event further.

Event Knowledge Graphs. A classical event log orders all events by sequential traces according to a single entity identifier (also called case id). In contrast, an event knowledge graph (EKG) orders events wrt. multiple different entity identifiers [8]. EKGs are based on labeled property graphs (LPG), a graph-based data model supported by graph DB systems [3] that describes concepts as nodes and various relationships between them as edges. In an LPG \(G = (X,Y,\varLambda ,\#)\), each node \(x \in X\) and each relationship \(y \in Y\) with edge \(\overrightarrow{y}=(x,x')\) from x to \(x'\) has a label \(\ell \in \varLambda \), denoted \(x \in \ell \) or \(y \in \ell \) that describes the concept represented by x or y. \(\#_{(a)(x)} = v\) and \(\#_{(a)(y)} = v\) denotes that property a of x or y has value v; we use and \(x.a=v\) and \(y.a=v\) as short-hand.

In an EKG, each event and each entity (i.e., each data object or resource) is represented by a node with label Event and Entity, respectively. Each node \(e \in \textit{Event}\) defines e.action and e.time; each node \(n \in Entity\) defines n.type and n.id. While EKGs allow to model arbitrarily many entity types, we subsequently restrict ourselves to EKGs with two entity types: case (any data object or a classical case identifier) and resource (the actors working in the process). Figure 1 shows an example graph: each square node is an Event node; each circle is an Entity node of the corresponding type (blue for case, red for resource). An EKG has relationship labels:

Fig. 1.
figure 1

Event knowledge graph.

  • corr (correlation): \(y \in \textit{corr}, \overrightarrow{y}=(e,n)\) iff event \(e \in Event\) is correlated to entity \(n \in Entity\); we write \((e,n) \in \textit{corr}\) as short-hand.

  • df (directly-follows): \(y \in \textit{df}, \overrightarrow{y}=(e,e')\) iff events \(e,e'\) are correlated to the same entity n \((e,n),(e',n) \in \textit{corr}\), \(e.time < e'.time\) and there is no other event \((e'',n) \in \textit{corr}\) with \(e.time< e''.time < e'.time\); we write \((e,e')^{n.type} \in \textit{df}\) as short-hand, i.e., \((e,e')^c\) for entity type case and \((e,e')^r\) for resource.

In Fig. 1, corr relationships are shown as dashed edges, e.g., e1, e2, e3, e4, e5 are correlated to case c3 and e3, e4, e9, e10 are correlated to resource a5. df-relationships are shown as solid edges. The df-relationships between the events correlated to the same entity form a df-path for that entity; the graph in Fig. 1 has 2 df-paths for case entities, e.g., \(\sigma _{c3} = \langle (e1,e2)^c, (e2,e3)^c, (e3,e4)^c, (e4,e5)^c \rangle \) and 3 df-paths for resource entities, e.g., \(\sigma _{a5} = \langle (e3,e4)^r, (e4,e9)^r, (e9,e10)^r\rangle \). See [7] for details of how to create an EKG G from classical event data sources through graph DB queries.

Task Instance Sub-graphs. Where individual events record the execution of an atomic action, a resource often performs multiple subsequent actions in the same case. This is called a task [19]. A task execution materializes in an EKG as sub-graph, where a case and a resource df-path synchronize for several subsequent events [13]. While a variety of such task subgraphs can be characterized [13], we here recall the most simple one: a sub-graph of events \(\{e_1,...,e_k\}\) and adjacent df-edges that contains (1) exactly one (part of a) case df-path \(\langle \ldots (e_1,e_2)^c,\ldots ,(e_{k-1},e_{k})^c \ldots \rangle \) for a case c and (2) exactly one (part of an) actor df-path \(\langle \ldots (e_1,e_2)^r,\ldots ,(e_{k-1},e_{k})^r \ldots \rangle \) for an actor r, i.e., both paths synchronize over the same subsequent events. In Fig. 1, subgraphs of events that meet these criteria are \(\{e_1,e_2\}\), \(\{e_3,e_4\}\), \(\{e_5\}\), \(\{e_6,e_7,e_8\}\) and \(\{e_9,e_{10}\}\). Each such subgraph ti describes one task instance. These sub-graphs \(\{G_1,\ldots ,G_k\} = \textit{TI}(G)\) can be queried from G by (1) aggregating any two parallel df-edges \((e,e')^c\) and \((e,e')^r\) into a “joint” edge \((e,e')\) with label df-joint and (2) then querying for maximal df-joint paths; see [13] for details.

3 Existing Aggregation Queries and Requirements

We first review existing aggregation operations on EKGs and present them systematically as three different types of aggregation queries. We then analyze their properties and shortcomings for summarizing task instances in (large) EKGs.

Node Aggregation. The first basic aggregation query \(\textit{Agg}_\textit{nodes}(a,X',\ell ,\ell ')\) on an EKG \(G = (X,Y,\varLambda ,\#)\) proposed in [7] aggregates nodes \(X' \subseteq X\) by property a into concept \(\ell \) as follows: (1) query all values \(V = \{ x.a \mid x \in X' \}\), (2) for each value \(v \in V\) add a new node \(x_v \in \ell \) to G with label \(\ell \) and set \(x_v.id = v\), \(x_v.type=a\), (3) for each \(x \in X'\), add new relationship \(y \in \ell '\) with label \(\ell '\) from x to \(x_v\), \(\overrightarrow{y}=(x,x_v)\).

For example, applying \(\textit{Agg}_\textit{nodes}(\textit{action},\textit{Event},\textit{Class},\textit{observed})\) on the graph in Fig. 1 creates one new event Class node for each value of the Event nodes’ action property, i.e., nodes \(cl1,\ldots ,cl6\) shown in Fig. 2, and links each event to the event class that was observed when the event occurred.

Fig. 2.
figure 2

Aggregation of the EKG of Fig. 1 by action into Class nodes (top), and by task instance sub-graphs into TaskInstance nodes (bottom).

Event Sub-graph Aggregation. The query \(\textit{Agg}_\textit{sub}(\mathcal {G},\ell ,\ell ')\) proposed in [13] aggregates given sub-graphs \(\mathcal {G} = \{G_1,\ldots ,G_k\}\) over Event nodes of G into high-level events with label \(\ell \) as follows: (1) the sub-graphs \(\mathcal {G}\) have been obtained by a previous query, e.g., \(\mathcal {G} = TI(G)\), see Sect. 2, (2) for each \(G' \in \mathcal {G}\), create a new high-level event node \(h_{G'} \in \ell \) with label \(\ell \) and set \(h_{G'}.time_{start} = \min \{ e.time \mid e \in G'\}\) and \(h_{G'}.time_{end} = \max \{ e.time \mid e \in G'\}\), and (3) for each \(e \in G'\) add new relationship \(y \in \ell '\) with label \(\ell '\) from \(h_{G'}\) to e, \(\overrightarrow{y}=(h_{G'},e)\). Although \(\ell \ne Event \), we interpret each new node \(h_{G'}\) as a high-level event with duration as it has a start and an end timestamp.

For example, applying \(\textit{Agg}_\textit{sub}(\textit{TI}(G),\textit{TaskInstance},\textit{contains})\) on the graph in Fig. 1 materializes five task instance sub-graphs as TaskInstance high-level event nodes \(h1,\ldots ,h5\) shown in Fig. 2, and links each event to the TaskInstance in which it is contained.

Directly-Follows Aggregation. The query \(\textit{Agg}_{df}(\textit{t},\ell ,\ell ')\) proposed in [7] aggregates (or lifts) df-relationships between Event nodes for a particular entity type t to \(\ell \) nodes along the \(\ell '\) relationships as follows: (1) for any two nodes \(x,x' \in \ell \) query the set \(df_{x,x'}^{t}\) of all df-edges \((e,e')^{t} \in \textit{df}\) where events \(e,e' \in \textit{Event}\) are related to \(x,x'\) via \(y,y'\in \ell ', \overrightarrow{y}=(x,e),\overrightarrow{y'}=(x',e')\), (2) if \(df_{x,x'}^{t} \ne \emptyset \) create a new df-relationship \(y^* \in \textit{df}, \overrightarrow{y^*} = (x,x'), y^*.type=t\) and set \(y^*.count = |df_{x,x'}^{t}|\). The variant \(\textit{Agg}_{df}^{\mathord {\ne }}(\textit{t},\ell ,\ell ')\) of the above query that requires \(x \ne x'\) was proposed in [13].

For example, first aggregating events to Class nodes (as explained above), and then applying \(\textit{Agg}_{df}(t,\textit{Class},\textit{observed})\) for \(t\in \{\textit{Case},\textit{Resource}\}\) on the graph in Fig. 1 results in the df-edges between \(cl1,\ldots ,cl6\) shown in Fig. 2. For instance, \((cl2,cl1)^r\) originates from \((e2,e6)^r\) while \((cl1,cl2)^c\) originates from \((e1,e2)^c\) and \((e6,e7)^c\). Likewise, aggregating to TaskInstance nodes and then applying \(\textit{Agg}_{df}^{\mathord {\ne }}(t,\textit{TaskInstance},\textit{contains})\) for \(t\in \{\textit{Case},\textit{Resource}\}\) results in the df-edges between \(h1,\ldots ,h5\) shown in Fig. 2. For instance, \((h1,h4)^r\) originates from \((e2,e6)^r\).

Extensions. These basic aggregation queries can be extended for specific use cases. For instance, every task instance sub-graph is essentially a path \(e_1,\ldots ,e_k\) over event nodes. Aggregation into a TaskInstance node \(h_{ti}\) then allows to set property \(h_{ti}.name = e_1.action,\ldots ,e_k.action\) [13] describing the sequence of actions executed in the task, as shown in Fig 2. All these queries are implemented as Cypher queries over the graph DB system Neo4j [14].

Properties, Shortcomings, and Requirements. \(\textit{Agg}_\textit{nodes}\) together with \(\textit{Agg}_\textit{df}\) constructs directly-follows graphs where edges distinguish between multiple types of entities [8], i.e., nodes and edges are on the level of actions. \(\textit{Agg}_\textit{sub}\) together with \(\textit{Agg}_\textit{df}^{\mathord {\ne }}\) constructs a “higher level” event graph, i.e., nodes and edges are on the level of sets of events but not on the level of actions.

Applying the aggregations in this way does not suffice to adequately summarize the process “as a whole” for analyzing task instances and tasks within a (larger) process. On one hand, task instances themselves are similar to events as they describe the specific execution of a task, i.e., multiple actions in a single case by a single resource. On the other hand, task instances are also not a hierarchical abstraction of the events wrt. actions: multiple different task instances overlap in their actions. The queries discussed so far do not take this nature of task instances into account.

In principle, the aim is to summarize the task instances (on the level of sets of Events) as actual tasks (on the level of sets of actions or event Classes), and to lift the df-relationships accordingly.

A naive approach would be to aggregate TaskInstance nodes to Task nodes by their \(h_{ti}.name\) property, i.e., \(\textit{Agg}_\textit{nodes}(name,TaskInstance ,Task ,observed )\). However, as task instances are sequences of multiple actions, two different \(h_{ti}.name\) values may be different variants of the same task. For example, h1 and h4 in Fig. 2 with \(h1.name = A,B\) and \(h4.name = A,B,D\) might be variants of the same task. Depending on the analysis, it may be desirable to (R1) aggregate TaskInstance nodes with similar (but not identical) name properties into the same Task node, which is not possible with the available queries.

If multiple task instances are considered as variants of the same task, it will be useful to summarize all the task instances on the level of actions to study the “contents” and “variability” of executions of a task. We seek to (R2) aggregate events and directly-follows relations that belong to similar TaskInstance nodes.

The presence of multiple types of DF-relationships (per entity type) increases the (visual) complexity of the aggregated graphs (see Fig. 2 (top)). Depending on the analysis, it may be desirable to (R3) control the aggregation through filtering and refinement to obtain more specific summaries in the form of smaller, simpler, or more precise aggregated graphs.

4 Queries for Summarizing Task Instances

To address requirements (R1-R3), we propose new queries for aggregating task instances in different ways, and discuss how to configure and combine aggregation queries with other queries to obtain specific graphs. In the following, let G be event knowledge graph \(G = (X,Y,\varLambda ,\#)\) after applying \(\textit{Agg}_\textit{sub}\) and \(\textit{Agg}_\textit{df}^{\mathord {\ne }}\) as defined in Sect. 3, i.e., the graph as Event nodes and TaskInstance nodes connected by df-edges.

4.1 Aggregating Similar Task Instances

Addressing (R1) requires to (a) identify which task instances are similar, and (b) aggregating task instance nodes considered as similar.

The specific criteria when two TaskInstance nodes are similar depend on the concrete process, data, and analysis use case. For the scope of this work, we therefore assume an “oracle query” \(\textit{O}(h) = i\) that determines \(\textit{O}(h_{ti}) = \textit{O}(h_{ti}')\) iff two task instances \(h_{ti}\) and \(h_{ti}'\) belong to the same task. \(\textit{O}(h)\) could, for instance, be implemented by agglomerative clustering wrt. the \(h_{ti}.name\) values (with suitable parameters) [15].

Given such an oracle O, the query \(\textit{Agg}_{sim}(O,X')\) aggregates TaskInstance nodes \(X' \subseteq \textit{TaskInstance}\) wrt. oracle O to Class nodes as follows: (1) for each \(h_{ti} \in X'\) set \(h_{ti}.Task = \textit{O}(h_{ti})\), (2) aggregate the TaskInstance nodes by property \(h_{ti}.Task\) using \(\textit{Agg}_{nodes}(\textit{Task},\textit{TaskInstance},\textit{Class},\textit{observed})\) of Sect. 3.

For example, applying \(\textit{Agg}_{sim}(O,TaskInstance )\) on the graph in Fig. 2 creates the Class nodes cl7, cl8, cl9 of type Task shown in Fig. 3 (top). Further properties of a Task node t can be set based on the use case, e.g., setting t.name as the set of (most frequent) e.action of events contained in the \(h_{ti}\) nodes that observed t.

To also lift df-relationships from TaskInstance nodes to the Class nodes of type Task we have to generalize \(\textit{Agg}_{df}\) to also consider high-level events such as TaskInstance and not just “regular” Events. The query \(\textit{Agg}_{df}(Z',t,\ell ,\ell ')\) aggregates df-relationships between nodes \(Z'\) for a particular entity type t to \(\ell \) nodes along the \(\ell '\) relationships as follows: (1) for any two nodes \(x,x' \in \ell \) query the set \(df_{x,x'}^{t}\) of all df-edges \((z,z')^{t} \in \textit{df}\) where nodes \(z,z' \in Z'\) are related to \(x,x'\) via \(y,y'\in \ell ', \overrightarrow{y}=(z,x),\overrightarrow{y'}=(z',x')\), (2) if \(df_{x,x'}^{t} \ne \emptyset \) create a new df-relationship \(y^* \in \textit{df}, \overrightarrow{y} = (x,x'), y^*.type=t\) and set \(y^*.count = |df_{x,x'}^{t}|\).

Fig. 3.
figure 3

Task instances aggregated into task classes for deriving inter-task dfGs (top). Subset of lower-level events aggregated into event classes for deriving intra-task dfGs (bottom).

Applying \(\textit{Agg}_{df}(\textit{TaskInstance},t,\textit{Class},\textit{observed})\) for \(t\in \textit{Case},\textit{Resource}\) in our running example yields the df-edges between cl7, cl8, cl9 shown in Fig. 3 (top).

The sub-graph over the Class nodes of type Task and created in this way is a directly-follows graph on the level of tasks (instead of the DFG on the level of actions obtained in Sect. 3). We call this DFG an inter-task DFG to distinguish it from the DFG describing behavior within a task as we discuss next.

4.2 Aggregating Events Within Similar Task Instances

To address (R2) we need to aggregate only those events that are contained within task instances of the same task. Two previous aggregation operations already materialized this information. Each event e is connected to one task \(\textit{t} \in \textit{Class},t.type=\textit{Task}\) via \((h,e) \in contains \) and \((h,t) \in observed \) (created by \(\textit{Agg}_\textit{sub}\) of Sect. 3 and \(\textit{Agg}_{sim}\) of Sect. 4.1, e.g., (h1, e1) and (h1, cl7) in Fig. 2).

Using these edges to task t as context, we adapt the node aggregation of Sect. 3 to be local to a task t. But as the same action may occur in different instances of different tasks, we have to distinguish to which task an action belongs. This requires to define an event classifier query \(\textit{class}(e)\) which returns for each event a value based on the properties of e or neighboring nodes. \(\textit{Agg}_\textit{node}\) of Sect. 3 used \(\textit{class}(e) = e.X\) for some property name X. To distinguish the task, we define event classifier \(class_{task}(e) = (e.action,task(e))\) with \(task(e) = i\) iff \((e,h_{ti}) \in \textit{contains},(h_{ti},t) \in \textit{observed}, t.ID = i\). Note that by basing \(class_{task}(e)\) on the Class node, we become independent of the specific oracle O used to identify tasks.

The generalized aggregation query \(\textit{Agg}_\textit{nodes}(class,X',\ell ,\ell ')\) on an EKG \(G = (X,Y,\varLambda ,\#)\) differs from \(\textit{Agg}_\textit{nodes}(a,X',\ell ,\ell ')\) in the first step: (1) query all values \(V = \{ class(x) \mid x \in X' \}\), (2) for each value \(v \in V\) add a new node \(x_v \in \ell \) to G with label \(\ell \) and set \(x_v.id = v\), \(x_v.type=class\), (3) for each \(x \in X'\), add new relationship \(y \in \ell '\) with label \(\ell '\) from x to \(x_v\), \(\overrightarrow{y}=(x,x_v)\).

We then can aggregate the events per task \(\textit{t} \in \textit{Class},t.type=\textit{Task}\) as follows. Query the events \(E_t = \{ e \in \textit{Event} \mid \exists h (e,h) \in contains , (h,t) \in observed \) } and aggregate by \(\textit{Agg}_\textit{nodes}(class_{task},E_t,\textit{Class},observes )\). Applying this query in our example for cl7 (Task with ID=1), we obtained the class nodes cl1, cl2, cl3 shown in Fig. 2 (bottom). The df-edges can be aggregated using \(\textit{Agg}_{df}\) of Sect. 4.1.

In this way, Fig. 3 (bottom) shows how events e1, e2, e6, e7, e8 are aggregated to an “intra-task directly-follows graph” describing the local behavior within a task in one model. Analysts can use such an intra-task DFG to understand task contents and how homogeneous the task instances assigned to the same task are, e.g., to evaluate whether the chosen oracle O is of sufficient quality.

4.3 Parameterized, Specific Aggregation

The aggregations of Sect. 4.1 and 4.2 can result in a complex graph over two behavioral dimensions that is difficult to visualize and possibly not specific to answer an analysis question. To obtain more specific DFGs, we introduce the following parameters: (1) node aggregation by using more specific classifiers for TaskInstance nodes, (2) filtering by using different criteria to decide which TaskInstance nodes to keep and (3) edge aggregation by selecting which df-edges to aggregate. Each parameter is defined by the properties of the entire event knowledge graph, including the underlying events.

(1) We can refine the aggregation of TaskInstance nodes to Class nodes using a classifier over multiple properties. For example, the following classifier distinguishes tasks per actor: \(class_{T \times R}(h) = (h.cluster,resource(h))\) with \(resource(h) = a\) iff \((e,h) \in \textit{contains}, e.resource = a\). The df-relationships are then aggregated per actor, allowing to compare different actors wrt. their behavior over tasks.

(2) To obtain a DFG for specific parts of the data, the Agg queries allow to limit the set of nodes to be aggregated to a subset \(\textit{TI}' \subseteq TaskInstance \). We can construct \(\textit{TI}'\) by another query. For instance: (1) only \(h_{ti} \in \textit{TaskInstance}\) nodes correlated to an entity based on a specific property, e.g., in Fig. 2, related to resource entities where \(n.ID=a5\), i.e., h2 and h5, or case entities where \(n.item\_category=\) Electronics, i.e., h1, h2 and h3; or (2) based on temporal properties, e.g., only \(h_{ti}\) nodes in cases that end before 15:00, i.e., \((h_{ti},e) \in \textit{contains}, (e,n) \in \textit{corr}, n.type=case\) and all events \((e',n) \in \textit{corr}\) have \(e'.time < 15:00\).

(3) We can limit the df-relationships to aggregate to a subset \(\textit{df}' \subseteq \textit{df}\) determined by structural or temporal properties in the same way as in (2). Note that if \(\textit{df}'\) is chosen independent of \(\textit{TI}'\) there may be no aggregated df-edges between Class nodes.

Analysis typically requires to understand where behavior starts or ends. We summarize how often a Class node cl is a start node of the DFG (for entity type n) by querying the number of \(h_{ti} \in \textit{TI}'\) nodes with \((h_{ti},cl) \in \textit{observed}\) and no incoming df-edge \((h_{ti}',h_{ti})^n \in \textit{df}'\). For example, in Fig. 3, cl7 is start node once for r and twice for c. Correspondingly for end nodes. We visualize this as edges from/to artificial inserted start/end nodes.

5 Demonstration

We implemented the queries we proposed in Sect. 4 as naive, non-optimized Cypher queries invoked via parameterized Python scriptsFootnote 1 on the graph database Neo4j. We applied the queries on the event knowledge graph [7] of the BPIC’17 data [6] to evaluate and demonstrate the feasibility of the queries for obtaining new insight into the process on the level of tasks and task instances.

Fig. 4.
figure 4

Inter-task DFG over Task nodes obtained by aggregating case-df-edges wrt similar task instance sub-graphs.

First, we materialized the task instance sub-graphs as TaskInstance nodes (see Sect. 2) which resulted in 171,200 task instances with 1,208 task variants (unique \(h_{ti}.name\) values). Naively aggregating the TaskInstance nodes by \(h_{ti}.name\) would lead to a graph too large to understand. We removed TaskInstance nodes describing variants occurring \(<10\) times (1%) and of length \(=1\) (6%). We then implemented a simple oracle O for identifying tasks of similar task instances by agglomerative clustering as this method fits the bottom-up aggregation of instances into tasks. We used Eucledian distance between \(h_{ti}.name\) as distance metric and chose the number of clusters by maximizing the silhouette index, see [15]. Applying \(\textit{Agg}_{sim}(O,\textit{TaskInstance})\) (Sect. 4.1) resulted in 20 Class nodes of type Task. Aggregating all case-df-edges results in the DFG shown in Fig. 4. For space limitations, we can explain only the contents of a few selected tasks. For example, C1,C2,C5 show the 3 most frequent ways actors group actions differently into tasks at the start of the process, with C1 containing actions A_Create, A_Concept, W_Compl appl+Start; C2 containing A_Create, A_Submit, W_Handle Lds+Start and while C5 contains more actions A_Create, A_Concept, W_Compl appl+Start, A_Accept, O_Create, W_Call offers+Start, A_Complete. Cases starting with C1 and C2 later go through C0 containing A_Accept, O_Create, O_Sent, W_Compl appl+E, W_Call offers+S, A_Complete, i.e., task C5 is done by a single actor combining tasks C1 and C0 done by two different actors.

We then evaluated whether aggregating events of task instances of the same task is effective to understand the contents of a task. We applied \(\textit{Agg}_\textit{nodes}(class_{task},E_t,\textit{Class},observes )\) of Sect. 4.2 for all tasks and obtained a corresponding intra-task DFG. Figure 5 shows the intra-task DFG of C14 (the most frequent task after C5 and C0) highlighting two different variants of the task that differ in the actions executed. The intra-task DFGs of other tasks revealed also variability in the order of actions executed.

Fig. 5.
figure 5

Intra-task DFG of C14 with variants V5 and V6.

We then explored the more advanced capabilities of parameterizing aggregation with further queries explained in Sect. 4.3. We chose to compare how 2 different actors work and collaborate on a day-to-day basis in terms of tasks. For this, we constructed a composition of multiple actor DFGs interconnected by cases as follows: (1) classifier \(class_{T \times R}(h)\) defined in Sect. 4.3, i.e., create class nodes task and actor, (2) \(\textit{TI}'\) contains only TaskInstance nodes related to one of two specific resources, (3) include a resource df-edge \((h,h')^r \in df' \) only if \(h.time_{end}\) and \(h'.time_{start}\) occur on the same day, and include any case df-edge in \(\textit{df}'\).

Fig. 6.
figure 6

Inter-task DFGs showing behavior of users 29 and 113 and handovers.

Figure 6 shows an example of a specialized DFG; it summarizes for each actor the behavior executed over a day (no df-edges to a task on the next day) and the aggregated case DF-edges show how often an actor handed a case from one task to another actor with another task. U29 was working on 100 d while U113 worked on 43 d; U20 performs 4 tasks while U113 performs 3 tasks; both work on C4 and C8 but otherwise do disjoint work and hand work over between C11 and C14 and from C14 to C4.

Our naive queries took 1.5 h to build the graph with similar tasks materialized as Class nodes and about 1 m for computing each of the DFGs, including filtering on an Intel i7 CPU @ 2.2 GHz machine with 32 GB RAM.

6 Related Work

We discuss how our findings relate to other works on aggregation and analyzing tasks and actor behavior in terms of sub-sequences or patterns in cases and/or actor behavior.

Kumar and Lui [16] analyze tasks by detecting frequent collaboration patterns in sequences of actor behavior; but the contents of work between hand-offs is disregarded and not the whole process can be summarized. Yang et al. [20] discover organizational models including grouping of resources and their relation to execution contexts; but an execution context consists of single activities disregarding work that may be aggregated into larger tasks. Hulzen et al. [11] cluster activity instance to activity instance archetypes related to actors; this technique corresponds to the oracle for identifying similar task executions used as input for aggregation in Sect. 4. Delcoucq et al. [5] aggregate frequent, gapped behavior of an actor over the entire trace into a local process model of a task; but the resulting models are not related to the case making it impossible to study the behavioral context in the case or to other actors as we allow. Jooken et al.  [12] mine resource interactions as collaboration sessions of actors working on the same data objects within a specific time-window from multi-entity event table; the collaborations are then aggregated into a social network. This approach is an alternative to task instance querying [13] used in this paper, but their approach does not model task executions in the context of process executions allowing fewer types of aggregations compared to our approach. Leoni and Dündar [18] use waiting time between events as heuristic to group consecutive low-level events into “batch sessions” and cluster them using the most frequently executed activity in a cluster as label; our aggregation queries preserve the structure of task variants in the graph. Several task mining approaches [2, 17] aim to discover task executions by segmenting an event log of desktop interactions such that repetitive patterns or pre-identified routines are found similar to our previous work [15]. However, such tasks are limited to a single actor ignoring collaboration and do not investigate the process context. Finally, Genga et al. [9] used frequent sub-graph mining with SUBDUE to summarize graph-based event data; the approach enforces a hierarchical structure and cannot be configured to a desired abstraction level, e.g., task instances or specific subsets. In contrast, the query-based aggregation operations proposed in this paper offer the required flexibility.

7 Conclusion

We showed how to adapt and generalize existing aggregation queries on event knowledge graphs to preserve the intermediate abstraction level of task instances being multiple events executed by the one actor in the same case. These queries, implemented as Cypher queries on standard graph DB systems, allow us to generate three completely new types of event data summaries: global inter-task DFGs that summarize processes on the level of larger tasks (instead of atomic actions); local intra-task DFGs that summarize behavior within a task (similar to a local process model [4]); and inter-task DFGs modeling behavior and interactions of multiple actors.

Our demonstration on the BPIC’17 event data suggests that these data summaries are helpful in answering questions of how work is structured and divided among actors in different parts of the process. We believe that such analysis of event data can give new insights into actor behavior in the context of routines [10, 19] and organizational models [20]. Future work is to evaluate whether the aggregation operations are effective for analysts trying to understand tasks.

Our work has two limitations. We modeled behavior along the control-flow using a single entity identifier while many processes operate on multiple objects; task identification and queries have to be generalized in this regard. The aggregation queries share many elements, but are formalized independently as pattern matching and creation operations over LPGs. A necessary next step is to systematically inventorize query operators over event knowledge graphs and develop a formal query algebra that is natural to process concepts.