Multi-Dimensional Event Data in Graph Databases

Process event data is usually stored either in a sequential process event log or in a relational database. While the sequential, single-dimensional nature of event logs aids querying for (sub)sequences of events based on temporal relations such as"directly/eventually-follows", it does not support querying multi-dimensional event data of multiple related entities. Relational databases allow storing multi-dimensional event data but existing query languages do not support querying for sequences or paths of events in terms of temporal relations. In this paper, we propose a general data model for multi-dimensional event data based on labeled property graphs that allows storing structural and temporal relations in a single, integrated graph-based data structure in a systematic way. We provide semantics for all concepts of our data model, and generic queries for modeling event data over multiple entities that interact synchronously and asynchronously . The queries allow for efficiently converting large real-life event data sets into our data model and we provide 5 converted data sets for further research. We show that typical and advanced queries for retrieving and aggregating such multidimensional event data can be formulated and executed efficiently in the existing query language Cypher, giving rise to several new research questions. Specifically aggregation queries on our data model enable process mining over multiple interrelated entities using off-the-shelf technology.


Introduction
Retrieving and aggregating subsets of event data of a particular characteristic is a recurring activity in process analysis and process mining [2]. Each event is thereby defined by an event classifier such as the activity or state that was recorded, a case identifier referring to the object or entity where the activity was carried out, and a timestamp or ordering attribute defining the order of events.
If all events use the same, single case identifier attribute, the event data is single-dimensional and can be stored in an event log as one sequence of events per case according to the data model of the XES-Standard [1], see Fig. 1(a). Such sequences can be easily queried for behavioral properties such as event (sub-)sequences or temporal relations such as "directly/eventuallyfollows" in combination with other data attributes [11,13,30,40,43,45]. Aggregating directly/eventually-follows relations between events is fundamental for discovering process models from event logs [2,6,46].
Most processes in practice however involve multiple inter-related entities which results in multi-dimensional event data in which each event is directly or indirectly linked to multiple different case identifiers; sequential event logs cannot represent such multi-dimensional event data [33]. Relational databases (RDBs) can store 1:n and n:m relations between events and case identifiers and among different case identifiers -but the explicit behavioral information of sequences (of arbitrary length) is lost (Fig. 1(e)).
State of the art. From existing literature [26,33,36], we identified requirements for modeling (R1-R4), querying (R5-R11), and aggregating (R12-R16) multidimensional event data, see Sect. 2. Several data models for multi-dimensional event data have been researched. Extracting sequential paths of events over multiple entities requires large, non-intuitive queries [28,37] and introduces false information (e.g., D → D and E → E in Fig. 1(a) have no corresponding Offer) [33]. Behavioral queries on RDBs are limited to pairs of directly-following events [14] or pre-defined patterns [42] for a single entity. Multi-identifier event tables [3,29,39] correlate each event to multiple entities, but provide no sequential order for each entity dimension, see Fig. 1

(d).
Entity-specific event logs describe sequential information per entity [37,39] while relation-specific event logs allow to reify relations between entities [33] into a composite entity describing their interaction [24], see Fig. 1(b); both do not describe correlation of an event to multiple entities. A path such as a sequence D → E → F → G over 3 different entities as shown in Fig. 1(f) cannot be queried in any of these models as they leave multi-entity correlation and sequential information per entity separated.
In a prior exploratory case study [22], we showed an integrated data model based on labeled property graphs using edges to correlate events to entities, and to model "directly-following" events per entity as shown in Fig. 1(c). We used graph query languages of existing graph database systems [41] to answer behavioral multi-entity queries ( Fig. 1(f)) and aggregating directly-follows per en-  Table   A1 O1 O2 tity. In parallel, Berti et al. [10] demonstrated that aggregating directly-follows edges per entity in a graph of events allows for simpler discovery of models of behavior over multiple entities; though their model assumes that relations between entities have already been reified and does not support querying as we discuss in Sect. 2. However, our data-model [22] was based on a single reallife data set, did not provide a generic data model and queries, specifically for aggregation.

Collection of Artifact-Centric Event Logs
Research problem. In this paper, we approach the problem of identifying a generally applicable model of event data in a multi-dimensional setting. The specific problem is to identify a minimal set of core concepts for a data model of multi-dimensional process event data with clearly defined semantics to fully (1) model, (2) query, and (3) aggregate all kinds of real-life process event data suitable for process analysis, addressing requirements R1-R15. From a collection of public real-life event logs 1 we identified 5 publicly available data-sets with unique multi-dimensional characteristics that can serve as benchmark: multiple entities interact via shared common entity (BPIC14 [15]); multiple entities interact asynchronously, based on click-stream data (BPIC16 [17]), based on a case management system (BPIC17 [18] 2 ), based on ERP system data (BPIC19 [20]); multiple event logs of the same processes executed in different organizations, BPIC15 [16]. A data model has to allow modeling, querying, and aggregating at least these 5 data sets in their multi-dimensional nature.
Method. First we determined the process event data concepts any data model had to support based on literature (see Sect. 2.1). All recent works that succeed in modeling (some) aspects of multi-dimensional event data employ graph properties. We therefore took the most complete proposal [22] based on labeled property graphs (LPGs, see Sect. 2.4) as a starting point. We then iteratively developed a solution that could support all 5 benchmark data sets using the following approach with the existing graph database (GDB) system Neo4J (neo4j.com); Neo4j was chosen for LPG storage and querying due to off-theshelf availability and suitable performance.
1. We transformed the event data into a standardized input format of an event table where each record describes one event and its properties, including references to all entities involved. This format is readily available or easily obtainable from the XES-Standard event log standard [1]. 2. We import events into the GDB as an LPG of generic, unrelated nodes only. 3. We then identified node and relation ship types as semantic concepts for the data model and corresponding data transformations in terms of queries on LPGs to model all input data sets through the same semantic concepts and the same data transformation queries. 4. The semantic concepts for nodes and relations thereby had to serve as adequate abstractions so that all event data could be queried and analyzed through these semantic abstractions only. 5. In case a suitable solution (types, relations, queries) was found for one data set, we applied it on all other data sets. If the solution could not be applied on one data set, we identified the cause and generalized the concepts and queries and repeated steps 3-5 for all data sets.
We conducted over 100 iterations of the above process over all 5 data sets until reaching a fixed point.
Contribution and Results. We contribute a generally applicable, minimal, integrated data model for multi-dimensional event data with semantic definitions for correlation and behavioral ordering using labeled property graphs. Our model allows querying and aggregating the modeled multi-dimensional event data and thereby subsumes and exceeds several prior works. We identified 4 semantic node types for multi-dimensional event data in LPGs: events, entities, logs, and event classes; 3 semantic structural relations for relating each event to: one more entities, to exactly one log, and to one or more event classes; and 2 semantic behavioral relations describing directly follows between two events (along a chosen entity), and its congruent directlyfollows relation between event classes (summarizing event-level directly follows on class level); see Sect. 3 for the concepts and Sect. 4 for their semantic definition in terms of LPGs.
We identified queries to extract entities from input event data for both, explicitly recorded entities and composite entities by reifying relations, to correlate events to entities, to derive directly-follows relations for all events of a specific entity (explicit or reified). All the available multi-dimensional event data could be transformed into our model using a standard set of queries satisfying (R1-R4); see Sect. 5.
We show that all query requirements (R5-R11) on multi-dimensional event data are satisfied by our model using the standard query language Cypher. We evaluated the query results to be correct against a manually constructed ground truth. Query execution times are practically feasible and in complex cases outperform hand-written algorithms on sequential event logs for the same task; see Sect. 6.
We identified queries to aggregate events and directly-follows relations between events to even classes and directly-follows relations between event classes for a specific entity; the queries allow to consider structural and behavioral properties during aggregation, addressing all aggregation requirements (R12-R15). Specifically aggregating directly-follows on event classes allows to realize process discovery for models of processes over multiple entities directly through standard queries in existing graph database systems. In Sect. 7 we demonstrate discovery of artifact-centric models of multiple entities with asynchronous and synchronous interactions [33], also called multi-viewpoint models [10].
All queries realizing the transformations into our data model, querying the data, and discovering process models through graph databases are available at GitHub 3 . We discuss limitations and alleys for future work in Sect. 8.

Background
We first recall the foundational concepts of single-dimensional process event logs in Sect. 2.1. After discussing challenges and requirements for analyzing multi-dimensional event data in Sect. 2.2, we discuss the state of the art on modeling, querying, and analyzing multi-dimensional event data in Sect. 2.3, before we recall the data model of labeled property graphs and the query language Cypher.

Modeling Single-Dimensional Event Logs
Information Systems (IS) create and update information records in structured transactions or activities. Each update is linked to one or more entities with unique entity identifiers, for example a specific order and related invoices. Each update can be recorded as an event with attributes for the activity carried out, the entity identifiers, and the timestamp (or ordering of updates). Events are implicitly related to each other via the structural 1:1, 1:n, and n:m relations between the entities on which the updates occurred [33]. A process event log is a collection of recorded events E structured into a specific view on an IS from the perspective handling of one specific entity, e.g., handling a credit application. Table 1 shows a simplified event log taken from the BPIC17 [18] data set describing the handling of a credit application (identified by Appl.).
The following process-specific concepts are part of every event log [2]. (E1) Each event e ∈ E in an event log records an atomic observation using 3 attributes: an entity identifier e.entityid to which the event is related, an event class e.class (usually the activity name e.activity), and an ordering attribute (usually the event's timestamp e.time). (E2) Optional attribute e.lifecycle records states of long-running behavior, e.g. an activity started or completed. (E3) Optional attribute e.resource records whether an actor or resource was involved in the event. (E4) Each entity identifier id = e.entityid , e.g. a specific application, defines a case (or execution) of the process; the set of events correlated to this entity is {e 1 , . . . , e n } = {e ∈ E | e.entityid }. (E5) The sequence e 1 , . . . , e n of all events correlated to an entity ordered by the ordering attribute is called a trace (of this entity). The IEEE XES-Standard [1] materializes these concepts in a tree-structure that specifically pre-determines a unique case identifier.
Process mining relies on querying and aggregating the directly-follows relation over events E. (E6) Event e b directly follows e a , e a → e b iff there is a trace . . . , e a , e b , . . . . Directly-follows is aggregated over event classes: event class b directly follows event class a, a → b iff there is a trace . . . , e a , e b , . . . with e a .class = a and e b .class = b. Nearly all process discovery algorithms create for each event class one activity node [2] and dependencies between activities are derived from the directly-follows relation between activities [6,46]. Model quality improves when the class of an event is determined by behavioral properties such as the set preceding activities [32].

Requirements for Analyzing Multi-Dimensional Event Data
An IS usually hosts multiple uniquely identifiable entities, e.g., credit applications and offers. For example, the BPIC'17 data (shown in Tab. 1) identifies four entities: credit applications (identified by Appl., events with Origin = A), credit offers (identified by oID with Origin = O) with a 1:n relation to Applications, the Workflow (identified by Appl. with Origin = W ); and the actors working on the case (resource) with a n:m relation to Application, Workflow, and Offers; Fig. 1(e) illustrates (part of) the data in a relational database (RDB).
Extracting a single-dimensional event log ( Fig. 1(a)) groups all events under a single entity (case identifier), e.g. the Application or the Offer document, and flattens the data accordingly [28] leading to false behavioral information called convergence and divergence [3,33]. Flattening the data in Fig. 1(e) under Application de-normalizes the 1:n relation to Offer and results in the event log of Tab. 1 and Fig. 1(a): Create Offer → Create Offer occurs in the log, whereas in reality this never happens for any entity (convergence). Flattening the data in Fig. 1(e) under Offer via the n:1 relation replicates events on the 1-side for each entity on the n-side (divergence), see Fig. 1

(a).
A recent literature survey of 95 studies [26,36] established requirements for querying event data. Focusing on querying for structure and behavior in multi-dimensional event data we identified from [26, pp.133] the requirements to (R1) query and analyze events (E1 of Sect. 2.1), and to (R2) consider relations between multiple data entities (as in RDBs). The technique shall support (R3) storing and querying business process-oriented concepts (E2-E3) and (R4) capture information about how events are correlated to different entities to avoid convergence and divergence (generalize E4-E6).
According to [26,36], queries should (R5) be expressed as graphs to specify the behavior of interest in a natural way, (R6) allow to query paths (or sequences) of events (connected by some relation), (R7) allow to select individual cases based on partial patterns, (R8) allow to query temporal properties (such as directly/eventually-follows), (R9) correlate events related to the same entity, (R10) allow querying aspects related to several entities or processes at the same time on the same data set, and (R11) allow to query multiple event logs and combine results.
Prior work on analyzing multi-dimensional event data [33] identified four major aggregation operations for discovering so-called artifact-centric process models. The technique has to support (R12) aggregating events into userdefined event classes, e.g., activities, based on data properties, (R13) aggregating (reifying) records of a relation between two entities into a new composite entity to model, query and aggregate interactions between different entities, (R14) aggregating behavioral relations from the event level to the event class level per (inferred) entity type, and (R15) relating or synchronizing aggregated behavior of different entity types.
Altogether, a user shall be able to query and aggregate for individual events (and their properties), for different entities/case notions, for behavioral and structural relations, and for patterns of multiple events (within and across entities).

Related work
We review 5 existing types of data models for event data against the requirements of Sect. 2.2 showing that no existing data model or query language on sequential event logs or RDBs satisfies R1-R15.
#1. Single event log for a single, selected entity. Event logs as described in Sect. 2.1 and illustrated in Fig. 1(a) cannot correctly model or aggregate behavior of events related to multiple entities due to convergence and divergence as discussed in Sect. 2.2. Sequential event logs can be stored and queried using files [1] or through RDBs [44].
Of the 95 works surveyed [26, pp.133], several approaches exist to retrieve cases from event logs for temporal properties [30,43], for most frequent behavior [13], for sequences of activities [11] or algebraic expressions of sequence, choice, and parallelism over activities [45], or to check whether a temporallogic property holds [40]. Several techniques support graph-based queries [12,27,30]. These techniques satisfy R7 and R8. However, they only support a single fixed case notion and thus fail R2, R10, R11.
#2. Event table with multiple entity identifiers. The model defines a single table, each record is an event with multi-valued entity identifier attributes, first introduced by Popova et al. [39] and later formalized by Aalst et al. as object-centric log [3]. Redo event logs [35] include database operations and XOC logs [29] even include database snapshots. The BPAF [34] format is a precursor that allows querying event data of different processes [8] but not based on properties of specific events (violates R7). All these models only describe correlation (and data operations) of an event to multiple entities, but leave sequential ordering per entity implicit, see Fig. 1(d), which prevents R6 and R8. They are usually transformed to other formats for analysis [10,39] #3. Event data in a relational database. Events are stored as time-stamped attributes and can be related to various entities through primary and foreign keys as shown in Fig. 1(e). Dijkman et al. [14] show a native, efficient relational algebra operator to query directly-following events. Also pre-defined behavioral patterns can be queried efficiently [42]. However, these operators are fixed to one entity identifier and querying paths requires unbounded joins (violates R4,R6,R10).
#4. Multiple logs, one per entity and per relation. Convergence and divergence can be avoided by extracting one log per entity providing multiple views [3,33,37,39], as shown in Fig. 1(b). Non-overlapping logs can be extracted automatically by partitioning the relational schema [33,39].
Interactions between entities can be modeled by extracting a sequential log per relation: per record in the relation, include all events of one entity preceded or succeeded by an event of another entity [33], as shown in Fig. 1(b). Extraction of an interaction log corresponds to reifying the relation into an entity which overlaps and synchronizes with other entities [24]. The approach of [37] provides a meta-model to extract event logs of different perspectives from user-defined, composite entities [37] that also may overlap. However, the separation into multiple event logs violates R9 and R10, and if logs do not overlap also R15.
DAPOQ [26,Ch.7] generalizes various prior query languages to query and extract events in the context of their relational data model for behavior properties, but does not support retrieving individual cases (R7) or specifying behavioral and structural patterns (R8).
#5. Graph-based, events as nodes related to multiple entities. The technique in [9] supports graph-based SPARQL queries over event data in RDF format from multiple entities, but does not allow to select individual cases or querying for behavioral properties (violates R7,R8). Werner et al. [47] modeled behavior over two entity types in financial auditing as a graph over events describing "directly-follows" per entity or relation, but their model was not generalized or used for querying (violates R5-R11). Our graph-based model [22] shown in Fig. 1(c) generalizes the model of Werner et al. [47] to standard process concepts (Sect. 2.1) using labeled property graphs and Cypher [25].
Berti et al. [10] convert object-centric logs (format #2) into two separate graphs: one describes correlation of events to entities, and one describes the directly-follows relation between any two events per entity similar to [22,47]. Assuming that all relations between entities have been reified into entities (see #4), they aggregate the directly-follows relations per entity and satisfy R14,R15. However, in most event data in practice [18,33], such as Fig. 1, relations are not reified yet, limiting the applicability of [10] as it does not support R13. Further, the model does not support querying the event data prior to aggregation (violates R6, R7, R8, R10) because correlation and directly-follows relations are stored in separate graphs.
In this paper, we generalize our previously explored integrated, graph-based model [22] shown in Fig. 1 to be applicable to all kinds of real-life data sets while satisfying R1-R11 and additionally R12-R15, thereby subsuming prior works.

Labeled property graphs and querying
Labeled Property Graphs (LPGs) are a data structure used in graph databases (GDBs) [41]. An LPG G = (N, R, label , prop) consists of nodes N (vertices) and relationships R (edges) where each relationship r ∈ R defines a directed edge − → r = (n 1 , n 2 ) ∈ N × N between two nodes. The labeling function label : N ∪ R → 2 Label assigns each node and each relationship a non-empty set of labels designating their type. Function prop : (N ∪ R) × K → V assigns each node or relationship an arbitrary number of key-value pairs, called properties. We write n.k = v for prop(k) = v, and n.k =⊥ if k is undefined for n.

Fig. 2 Labeled Property Graph
The example LPG in Fig. 2 models the relationships between a professor and two students. The example contains nodes with the labels :Person, :Professor, :Student and :Document. The document you are currently reading is authored by Stefan, a student supervised by Dirk who co-authors this document and say Miro is another student contributing to this paper. The "Name" of each person is a property of the :Person nodes; "Type" is a property of :Document nodes. The described relationships between the nodes can also hold properties like the starting date of a supervision. Neo4j supports multiple labels for nodes while relationships have exactly one label.
Cypher is a language for querying LPGs [25] and supported by Neo4j. Cypher queries use pattern matching to select sub-graphs of interest. The pattern (n : Label {Prop : Value}) matches any node labeled :Label that has property Prop set to Value. Pattern (n) − [r:Label ]−>(m) matches any relationship labeled :Label from a node n to a node m. Any combination of nodes and relations (n, r, m) that match the pattern are included in the result set; if any variable n, r, m is already bound, then only combinations including the bound nodes/relationships will be returned. We explain the Cypher query concepts used in this paper by a single (albeit inefficient) example query. For the graph in Fig. 2, we query for the longest path between "Dirk" and a student, other than "Stefan", who also works on a document that "Dirk" co-authors. The MATCH clause retrieves pairs of nodes s and p and a path between s and p that match the pattern in line 1: a Student s related to Professor p by a path −[ * ]− of arbitrary relationships, direction, and length. The WHERE clause in line 2 restricts the pattern such that the student's name cannot be "Stefan". By defining the professors' name property to be "Dirk" in line 1 we also restrict the pattern. WITH in line 3 formats and renames the result set (e.g., s renamed to student) that is passed to the next query from line 4 on: variable student in lines 4-6 may only take values retrieved for variable s in lines 1-2, e.g., "Miro" but not "Stefan." Line 4 matches the documents Dirk coauthors and line 5 restricts the results to documents that have a direct relationship to a student.
The RETURN statement in line 6 formats the result set of lines 4-5 as output. For the example graph, the student is "Miro" and the document d is the "paper"; variable paths contains the 2 possible paths between Miro and Dirk. One walks over Stefan and one does not. "Length()" is a function of Cypher that returns the hops needed to walk a path. Lines 7-8 sort the results by path lengths (ORDER BY clause) in descending order (DESC ) and return only the first path of this ordered list (LIMIT 1).
Instead of RETURN , a Cypher query may also end with a statement CREATE (student)<−[r:FOUND]−(d) to add a new relationship of label :Found from node d to node student; statement MERGE only creates the specified node/relationship if it doesn't exist yet; see [21] for more details.

Representing Multi-Dimensional Event Data in Labeled Property Graphs
Labeled property graphs introduced in Sect. 2.4 allow versatile data modeling of various concepts and relations between concepts. In this section, we propose how to model the central concepts and relations of process event data of Section 2.1 in labeled property graphs. Figure 3 summarizes our proposal which we explain in detail below. In Section 4, we constrain the way how the concepts and relations may occur in a labeled property graph describing event data, thereby defining the semantics for process concepts in terms of LPGs. In that section, we also discuss how to refine the proposed node and relationship types to aid in the analysis.

Modeling Events Related to Multiple Entities or Logs
We introduce node and relation types for event, entity, and event log; together they describe the instance-level concepts of Fig. 3, i.e., concrete entities or recorded events.
Event. We represent the core element of each event log, the event, as a node with label :Event as shown in figure 3. Of the three mandatory event attributes (cf. Sect. 2.1), we only model activity and timestamp as properties to event nodes, having datatypes STRING and DATETIME, respectively. We describe  Entity. Single-dimensional event logs fix a single entity identifier (called case identifier, cf. Sect. 2.1) to which each event is correlated. Our model abandons the notion of a case identifier in favor of the more general Entity concept. We model each entity as a node with the label :Entity as shown in Fig. 3. Its property EntityType describes the type of the entity. Property ID is the entity identifier. We require the combination of EntityType and ID, stored as property uID, to be unique in the entire graph similar to a primary key value in a relational database (indicated by uID being underlined in Fig. 3). We keep our model limited to describing the existence of entities and defer modeling of more specific entity types and structure relations between entities and types to existing proposals [41].
The graph in Fig. 4 models 4 entities of 4 different entity types. Each :Entity node represents a concrete entity related to the process, such as Application 1 and Offer 1 of our running example of Tab. 1. Our model also allows entities that are not classically considered part of the process execution such as Resource 11. Finally, an entity node can describe a reified relation between two existing entities, such as the entity (1, 1) of type A+O describing the relation between Application 1 and Offer 1.
We model correlation of an event to an entity by an event to entity relationship type: (: Event) − [:E EN ]−>(: Entity). Specifically through this relationship, we can correlate any event to any number of entities of different types, allowing for multi-dimensional correlation of events to entities as shown in Fig. 4. For example, event e4 is correlated to Resource 11, Offer 1, and reified relation A+O (1,1) whereas event e2 is correlated to Application 1 and A+O (1,1).
Event Log. Some event data sets, such as BPIC'15 [16], consist of multiple event logs. To support multiple logs in one graph event log instance, we introduce a separate node type for logs with the label :Log as shown in figure 3. Similar to the entities, we model which event belongs to which log with a relationship type: Attributes. Figure 3 shows all properties our graph data model expects to be present. Additionally, any event, entity or log node can carry any other property.

Modeling Behavior as Paths
Directly-Follows. Events are ordered by time from the viewpoint of an entity they are correlated to (cf. Sect. 2.1). As our model allows events to be correlated to multiple entities, each event may have multiple "next" events, depending on the entity. We model temporal ordering of events through a :DF relationship between any two events x and y that directly follow each other from the perspective of one or more entities: (x : Event)−[:DF ]−>(y : Event). Each :DF has the list of EntityTypes for which this relationship holds as property.
In Fig. 4, events e1, e2, and e9 follow each other for entity type Application, i.e., Application 1 ; events e2, e4, e7, e9 follow each other for entity type A+O; and e7 also follows e4 for Offer. Note that in this way, all directly-follows relations of the events of Tab. 1 are modeled correctly and in a single data structure, compared to the other models discussed in Fig. 1.

Modeling Aggregations of Events and Behavior
Event classes. While we assume each event to have the mandatory attribute Activity events can be classified in other ways as well (cf. Sect. 2.1). In our  shows non-standard event classes. Two event classes of type "Resource" are defined by the Resource entities occurring in the data, e.g., Res1 and Res2. Events e5, e6, and e8 belong to class Res1. Five event classes of type Last 2 Activities are defined by the sequence of the current and previous activity for this event, e.g., distinguishing "first A", (−, A), "repeated A" (A, A), and "A after B" (B, A). Events e2, e4, e7 belong to class (A, B).
Directly-follows on event classes. Event data analysis aggregates directlyfollows relations between events to directly-follows relations between event classes (cf. Sect. 2.1). Our model of Fig. 3 provides relationship (: Class) − [:DF C ]−>(: Class). As for :DF, each :DF C relationship lists in attribute EntityType for which entity types the aggregated directly-follows relationship holds. This allows to describe aggregated behavior per entity type. The graph in Fig. 5 shows the aggregation of :DF to :DF C, assuming a single entity type. The two sub-graphs induced by :DF C describe a hand-over of work social network (bottom) [4] and a transition system [5] (top) of the event data in the same model.

Semantics of Entities and Events in Labeled Property Graphs
The node and relationship types introduced in Sect. 3 allow us to model multidimensional event data in LPGs. However, LPGs allow for unrestricted use of any node and relationship types which would allow for creating LPGs that do not capture the semantics of event data. For example, Figure 6 only uses the node and relationship types of Sect. 3 but the graph violates the semantics the types shall encode: :DF does not order events e2 and e3 according to their timestamp, events e1 and e2 are ordered by :DF but belong to different entities, and event e3 even directly-follows itself.
In this section, we formulate constraints on how the nodes and edges over the types of Sect. 3 may be relates, thereby giving them semantics. In the following, we formulate such constraints for any labeled property graph G = (N, R, label , prop) (see Sect. 2.4).

Strictly Typed Nodes
We formalize the semantics of the node labels Entity, Event, Log, Class and of the relationship labels E EN (event to entity), E L (event to log), DF (directly-follows on events), E C (event to event class), and DF CL (directlyfollows on event classes).
Each node/relationship may have one of these types (i.e., no node or relationship may have two different semantic roles). Formally, Label N = {Entity, Event, Log, Class}, Label R = {E EN , E L, DF , E C , DF CL} and for each n ∈ N , |label (n) ∩ Label N | ≤ 1 and |label (n) ∩ Label R | = 0 and for each r ∈ R, |label (r) ∩ Label N | = 0 and |label (r) ∩ Label R | ≤ 1.
As all nodes and relationships of interest carry exactly one label in Label N and Label R which are disjoint sets. We write n.label and r.label for their labels in the following. Further, we write L for the set {n ∈ N | n.label = L} of nodes and of relations {r ∈ R | r.label = L} carrying label L, respectively. For example, n ∈ Entity and (e 1 , e 2 ) ∈ DF.

Semantics of Event-Entity Relations
The Event-Entity relationship E EN correlates an event to its process entities. While each event e can be related to multiple different entities, for example an Application and a Resource, there must not be two E EN relationships to the same entity. Furthermore, each event is correlated to some entity, and vice versa, as shown in Fig. 7. Formally, the following properties have to hold: 1. Between any pair of e ∈ Event and n ∈ Entity is at most one relation r ∈ E EN, − → r = (e, n). As a shorthand, we write E EN ⊆ Event × Entity, and (e, n) ∈ E EN. 2. Each event e ∈ Event is correlated to at least one entity: there exists (e, n) ∈ E EN 3. Each entity n ∈ Entity is correlated to at least one event: there exists (e, n)E EN

Semantics of Log-Event Relations
Log-Event relationships L E explicitly encode which events belongs to which log. Every event must be in exactly one log, and each log must have at least one event, as shown in Fig. 7. Formally, the following properties have to hold: Each event e ∈ Event is in exactly one event log: there exists exactly one r ∈ L E, − → r = (l, e).
3. Each event log l ∈ Log has at least one event: there exists at least one r ∈ L E, − → r = (l, e) .

Semantics of Directly-Follows Relation
We model temporal relations as paths of :DF relationships over :Event. Each :DF relationship must go forward in time from the point of view of an :Entity node correlated to both events involved as shown in Fig. 7. Overall, all :DF relationships induce a partial order. Formally, the following properties have to hold: For every (e 1 , e 2 ) ∈ DF, exists a log l ∈ Log with (e 1 , l) and (e 2 , l) ∈ L E. 3. For every (e 1 , e 2 ) ∈ DF, e 1 .Timestamp ≤ e 2 .Timestamp holds, i.e., events are ordered by time. 4. For every (e 1 , e 2 ) ∈ DF, exists an entity n ∈ Entity with (e 1 , n) and (e 2 , n) ∈ E EN such that n.EntityT ype ∈ (e 1 , e 2 ).EntityT ypes, and there exists no event e x correlated to n, (e x , n) ∈ E EN, such that e 1 .Timestamp < e x .Timestamp < e 2 .Timestamp holds, i.e., e 2 directly-follows e 1 from the perspective of entity n. 5. For all events e 1 , (e 1 , e 1 ) ∈ DF, i.e., :DF is irreflexive. 6. For all events e 0 , e n ∈ Event exists no cycle. (e 0 , e 1 )(e 1 , e 2 ), . . . , (e n−1 , e n ), (e n , e 0 ) ∈ DF i.e., :DF is acyclic and hence the transitive closure of :DF is a partial order.

Semantics of Event-Class relation
Events relate to :Class nodes in a similar way as to :Entity nodes: relationship :E C relates each event to at least one class of the same type, and vice versa. Formally, the following properties have to hold: Each event e ∈ Event has at least one event class: there exists (e, c) ∈ E C, and there are no two classes (e, c 1 ), (e, c 2 ) ∈ E C with , c 1 = c 2 and c1.Type = c2.Type.

Semantics of Class-level Directly-Follows Relation
The class-level directly follows relationship :DF C is only defined between :Class nodes of the same type, and may only aggregate :DF relationships between events correlated to the same entity, as shown in Fig. 8. Formally, the following properties have to hold: 1. DF C ⊆ Class × Class 2. Any two related classes (c 1 , c 2 ) ∈ E C are of the same type c 1 .T ype = c 2 .T ype 3. For any two related classes (c 1 , c 2 ) ∈ DF C exist events e 1 , e 2 of these classes (e 1 , c 1 ), (e 2 , c 2 ) ∈ E C ordered in the same way as c 1 and c 2 , (e 1 , e 2 ) ∈ DF for corresponding entity types, (e 1 , e 2 ).EntityTypes ⊆ (c 1 , c 2 ).EntityTypes.

Refined Directly-Follows Relation
The :DF relationship defined in Sect. 4.4 ensures that there is a single :DF relationship between any two ordered events, allowing for writing simple queries.  However, queries over paths cannot easily restrict :DF relationships to specific entity types as these are stored as lists. This renders queries over multiple entities very complex or infeasible, such as Q6 in Sect. 6. Here, we discuss two options to refine :DF relationships We can refine the relationship label :DF into a set of relationship labels with one :DF type for each value type of an EntityType property occurring in the graph. In the example of Fig. 9, :DF is refined into three labels :DF Application, :DF Offer, and :DF A+O. All relationships of :DF type have to satisfy the constraints of Sect. 4.4 and additionally, if (e 1 , e 2 ) ∈ DF type then both events are correlated (e 1 , n), (e 2 , n) ∈ E EN to entity n of this type n.EntityType = type.
In the resulting model, there can be as many dedicated :DF type relations between two events as there are entity types in the data. In consequence, the required disk space can grow significantly, depending on the log size and number of entity types. The analysis in turn becomes more efficient because queries can directly match :DF type labels, which allowed us to define a query for Q6 (see Sect. 6). We can refine the :DF C relation to :DF type relations in the same way.
Next to encoding the entity type in the label, we can also add a property EntityType to the :DF relationships leading to a similar data model with the same number of relationships, but using all the same :DF label. Formally, we now may have more than one :DF relationship between two events. This encoding has advantages when writing queries for aggregating :DF relations.

Translating Event Logs to Labeled Property Graphs
We now present a semi-automatic procedure for translating event tables with multiple entity identifiers in CSV format (cf. Sect. 2.3) into the graph data structure introduced in Sect. 3 satisfying the semantic constraints of Sect. 4.
In a nutshell, our method has the following steps. (5.1) We assume the event data to be given in the form of an event table where each record describes one event. (5.2) We translate each record with all its attributes to an :Event node in the LPG with corresponding properties, obtaining a graph of unrelated :Event nodes. (5.3) We create :Log nodes for each log in the source data set and correlate them with the respective :Event nodes (5.4) We provide query templates to extract :Entity nodes from :Event properties (e.g. identifiers) and to correlate :Event nodes to all their :Entity nodes. (5.5) A generic query derives the entity-specific directly follows :DF relationships between events. (5.6) Finally, we provide queries to reify relations between entities into new composite entities, allowing to derive :DF relationships of interactions between entities. We explain the queries on the running example of Tab. 1.
We demonstrate the types of graphs obtained on the full BPIC17 dataset [18] in Sect. 5.7 and report on a quantative evaluation of all datasets in Sect. 5.8.

Source Event Data Format
We expect the event data of the source log to be in event table format (see Sect. 2.3) defining columns Activity and Timestamp and multiple columns Attribute1,. . . ,AttributeN that also contain entity identifiers. In case the data comes from multiple logs, also a LogID column is required.

Import the Events
The following Cypher query imports the entire event table into the graph such that each row of the table translates to one :Event node with one property per attribute (column) in the table. Importing the first four rows of Tab. 1 results in the graph shown in

Create Entities Nodes and Correlate Events (using Domain Knowledge)
We create entities and correlate events to entities in two steps. First we identify and create the set of all entities that occur in the data by creating :Entity nodes. Then we correlate each event to all its entities by creating :E EN relationships.
Creating entities from events requires domain knowledge about how entity types and identifiers are stored in the event data. The user decides whether the presence of an identifier allows correlating the event to an entity. In our running example, events can have two different entity identifiers: Appl and oID (see Tab. 1 or Fig. 10). The property Origin designates their "owning" entity Application, Workflow, or Offer. For example, only events with e.Origin = "A" belong to the Application entity with id e.Appl.
The following query template provides 3 parameters. Type sets the entity type to which an event shall be correlated; EntityID is the event property that defines the entity identifier to which the event shall be correlated; EntityProperty allows to correlate only those events having a specific property. Calling the template with EntityProperty ≡ "e.Origin = "A", EntityID ≡ "Appl", Type ≡ "Application", creates one :Entity node of type "Application" for each value of e.Appl found in all event nodes where e.Origin = "A", i.e., e1 and e2 in Fig. 10. Prefixing the entity id with its types ensures a unique :Entity node id (uID).  The following query template with the same parameters correlates each matching event to its entity node.
The above query template has to be executed for each entity type in the data.
Assuming each event contains correlation information for at least one entity, it conforms to the semantic requirements of Sect. 4.2. The event graph after creating entities for Application, Workflow, and Offer is shown in Fig. 11(left).

Create Entity-specific Directly-Follows Relation
We derive the directly-follows relation per entity from the timestamp attribute of each event. We collect all :Event nodes correlated to the same :Entity node n via :E EN (lines 1-2 in the query below) and order all events by their timestamp attribute and collect them in a eventList of length k (line 3-4). We then iterate over the 0-indexed eventList = e 0 , . . . , e k−1 (lines 5-6) and create a :DF relationship from e i to e i+1 for each i = 0, . . . , k − 2. The data may contain events with identical timestamps, typically due to coarse-grained or imprecise recording [31,38]. To ensure that all directlyfollows relations form a directed acyclic graph (see Sect. 4.4), we need to provide a globally consistent ordering for events with identical timestamps. We do so using the internal unique ID(e) of the :Event nodes in line 3 to order events by ID(e) in case their timestamps are identical. As we import the events in the same order as in the source data ID(e) is consistent with the implicit ordering in the source data.
The query creates :DF relationships for events per entity node in the graph; through MERGE in line 7, we ensure that we only add relationships between different events per EntityType as discussed in Sect. 4.7. We may also create entity type-specific :DF type relationships when using type as parameter: we add WHERE n.EntityType = type in line 1 and use MERGE ( e1 ) -[df:DF type]->( e2 ) instead in line 7. Figure 11(right) shows a :DF Application relationship created between e1 and e2 of the running example using this adapted query. Creating :DF relationships in this way conforms to the constraints of Sect. 4.4.

Reify relations between entities into composite entities for describing interactions
Entity creation and correlation may leave events of different entities unrelated if an event is not explicitly related to more than on entity. In our running example of BPIC17 [18], events are tightly correlated to either an Application, Offer, or Workflow entity as shown in Fig. 11(left). Deriving directly-follows relations per entity as in Sect. 5.5 leaves these entities disconnected.
We cannot connect Offer entities by further correlating Offer events to other existing entities such Application. If we would correlate e3 and e4 directly to Application 1 by entity identifier Appl., we would "pollute" the directlyfollows relation of Application 1 with events that are only remotely related to it, resulting in convergence errors (see Sect. 2.2). Instead, we have to model the interactions between two entities n1 and n2 by reifying the relation between n1 and n2 into a composite entity r -and then derive :DF relationships for r [24,33]. This also requires domain knowledge.
Our data model starts from recorded events, thus we have to infer relations between entities from event attributes. Assume two events (e1:Event) -[:E EN]-> (n1:Entity) and (e2:Event) -[:E EN]-> (n2:Entity) are correlated to different entities n1 <> n2. If e2 contains some property refto1 referencing the entity identifier ID of n1, i.e., a foreign key, we observe that n2 is related to n1 through event e2. In our running example, we observe that Order 1 is related to Application 1 through event e4 of Order 1 via property Appl., see Fig. 12.
We lift this observation to entity types. The relation R from entity type1 to entity type2 is the set of all pairs (n1.ID, n2.ID) where (1) there is an entity n1 of type1, (2) some event (e2:Event) -[:E EN]-> (n2:Entity) is correlated to entity n2 of type2, (3) with e2.refto1 = n1.ID. Lines 1-5 of the query template below construct this relation R for chosen parameters type1, type2, and refto1. Line 6 reifies relation R by creating a new composite entity r of typeR for each pair (n1.ID, n2.ID).   Applying the above queries for type1 ≡ Application, type2 ≡ Offer, refto1 ≡ Appl, typeR ≡ Case AO, on our running example results in the new entity with ID (1, 1) in the graph of Fig. 12. The new entity r refers to n1 and n2 by properties type1ID and type2ID, e.g., ApplicationID and OrderID; r's own ID is the combination of n1.ID and n2.ID.
Any event e correlated to an entity n to which the composite entity r refers (by r.type1ID or r.type2ID) can now be correlated to r using the following generic query template. Applying the above query twice, once for type ≡ Application and once for type ≡ Offer both with typeR ≡ Case AO adds the :E EN relationships from e2 and e4 to Case AO (1, 1) shown in Fig. 12. Depending on the available domain knowledge, the correlation query can be made more specific by adding WHERE clauses for only correlating events that satisfy specific properties.
We may now derive :DF relationships for the composite entity typeR using the queries of Sect. 5.5, e.g., Fig. 12 also shows relationship :DF Case AO derived for the composite entity of type Case AO. By correlating events of related entities Application 1 and Order 1 to their own reified entity Case AO (1,1), we could construct a new directly-follows relation :DF Case AO describing the interaction between Application 1 and Offer 1. The original directly-follows relations :DF Application and :DF Offer remain as before and "unpolluted".

Demonstration on BPIC17
We applied the above queries on the events of the full BPIC17 dataset [18]. After importing all events 4 , we derived entities for 3 types: Application, Workflow, Offer. We reified the binary relations between the first three entities into Case AO, Case AW, Case WO and derived entity-specific :DF type relationships. Figure 13 shows the graph of handling loan application 681547497 involving one Application entity (dark blue), one Workflow entity (light blue), and two Offer entities (orange). Interactions are shown through the grey :DFrelationships of Case AO, Case AW, and Case WO. 5 The graph shows how both offers are created and handled concurrently to the application entity. Figure 14 visualizes 7 randomly selected process executions: the 1st and 4th involve only one Offer whereas all others involve two offer entities; some executions in BPIC17 involve 5 or more offer entities. Offers may be created in parallel (2nd, 7th) or with Application events in between (3rd, 5th, 6th). Offers may conclude in parallel (2nd, 3rd, 4th) or with Application events in between (6th, 7th). 4 we filtered out events with lifecycle attribute suspend or resume for reducing the size of the figures 5 To simplify the visualization, the graph does not contain :DF Case AO, :DF Case AW, :DF Case WO relationships which are in parallel to a DF Application, DF Workflow, DF Offer relationship. Fig. 14 Graph of 7 randomly selected process executions of BPIC17 [18] We then derived Resource as additional entity from the e.resource property of events. While Application, Workflow, and Offer are local to a process execution, the Resource entities describe works who persist in the system and work on many entities. We derived the :DF Resource relationships. Querying the data for the events of the 7 process executions of Fig. 14  We can clearly see that all process executions and entities are tightly connected through the resources. Each resources is always involved for a sequence of several events of the same or related entities, and then moves to another entity in another process execution while handing the previous entity over to another resource.
Overlaying :DF Resource relationships also allows us to see that interactions between related Application, Workflow, and Offer entities of the same process execution are not explained by Resource entities. In the graph in Fig. 16, the first event of Offer 1647347263 correlated to User 85 follows after an Application event (via Case AO) correlated to User 7, i.e., there is no resource explaining the ordering of Application and Offer. This confirms the importance of reifying relations between entities into composite entitiesotherwise the graph would describe that Offer 1647347263 would start concurrently to all preceding events.

Evaluation
We applied the above steps for importing and transforming the event data into our proposed graph-based data model on 5 real-life datasets [15,16,17,18,20] 6 using a Neo4j instance with 20GB of main memory allocated 7 and at [23]. We measured the time required for the conversion and the memory requirements for storing the data in Neo4j. 6 For BPIC2016, we omitted all click events without a session identifier as these could not be correlated. 7  All translations succeeded within several minutes, even for the largest datasets however explicitly encoding the structural information requires significant space. The results are shown in Table 2. We observed execution time and space to grow linearly with the number of entities to derive and relations to reify as per event and entity, we derive one :E EN relationship and one :DF relationship. In general, the size of the source log is not a solid indicator for the size of a graph event log. For example, the BPIC'14 is small in size but defines several related entities resulting in a large graph. More importantly, any graph can be adapted to the needs of a particular research question, e.g., by deriving only a limited number of (composite) entities.

Querying Multi-Dimensional Event Data
In the following we present 6 classes of analysis questions that we formulated to evaluate requirements R5-R11 of Sect. 2.2 for querying multi-dimensional event data on the LPGs of Sect. 5. For each question we provide a Cypher query and report results and the query processing times (measured on an Intel i7 CPU @ 2.6 GHz machine with 32 GB of memory with Neo4j Browser).
We conducted the experiment on the BPIC17 dataset for which we additionally derived the Case AWO entity type which corresponds to the original case notion, i.e., all events sharing the same e.case attribute. We did this in order to be able to verify the correctness of our results against classical process mining software which works with the original case notion only.
Q1. Query Attributes of Events/Cases. We want to query for the firstclass concepts of event logs: a case and an event based on event/case attributes by using a partial patterns to satisfy R7. The following query returns the event attribute "end" and the case attribute "LoanGoal" of Case "Application 681547497". Note that all (event and entity) attributes are encoded as properties of event nodes. The query has been processed in 0.061 seconds. After modifying the query to consider all cases, i.e. remove the condition for a specific case in line 2, the query completed in 0.582 seconds.
Q2. Query Directly-Follows Relations. Q2 is focused on temporal aspects. Here we show a query that satisfies R8 by considering 2 consecutive events. Directly-follows relations of events in a case are an important characteristic of event logs as they represent the case internal temporal order of events and many of today's process mining techniques rely on these relations. The query below returns the event directly following the node with the activity property "O Created" of a given offer entity by matching the :DF Offer relationship. The query execution time for one specific offer was 0.064 seconds whereas querying the :DF Offer relations with destination node "O Created" for all 42,995 offers took 11.932 seconds. Directly-follows relations of other entities (Application and Workflow) or across entities (Case AWO) can be queried by adjusting the query in the MATCH and WHERE clauses accordingly.
Q3. Query Eventually-Follows Relations. We want a query that satisfies R8 by considering the temporal relationship of any 2 events of a case. Eventually-follows relations are also related to the case internal order of events. Event y eventually follows event x if y occurs after x in the same case, that is, if x and y are connected through a path of directly-follows relations of arbitrary length. We query the offer specific eventually-follows relationship between "O Created" and "O Cancelled" for a given offer as follows: Even though the MATCH clause looks similar to the one of the directly-follows query, the *-Operator changes the pattern from a direct relationship to a path of arbitrary length. Since we want to find the eventually-follows relationship of two specific activities we also added condition e2 .Activity = "O Cancelled" to the WHERE clause to define the endpoint of the paths we want to match in the graph. For the given offer the query took 0.068 seconds. For all 20,898 offers where "O Created" is eventually followed by "O Cancelled" we removed the condition for "Offer 716078829" from the query which then took 4.264 seconds.
Q4. Case Variants. We want a query to return a case variant as path in the graph to satisfy R6. A case variant is the sequence of activities of a case. Case variants are for example used to detect frequent behaviour of a process. We can query the graph to retain the path of events of a case by walking over all of its directly-follows relationships from the first to the last event. For a given case (Case AWO) this can be done as follows: The pattern of the match clause follows the same logic as the eventually-follows match pattern. For variants we limit the output to the first and last event of a case, i.e. the events that have no incoming or no outgoing ":DF Case AWO" relationship. The query completed in 0.079 seconds. Similarly, we can query the graph for variants of another entity such as Offer. The paths of events returned by the above query can be turned in a list of activity sequences by Cypher's list operators: UNWIND processes each path in the paths variable iteratively, function nodes() translates the path into a list of nodes, and list comprehension maps each event node to its activity property. The resulting list of activities can be compared for equality with other lists, etc.
Q5. Query Duration/Distance between two specific Activities. The information on how much time or how many activities were needed to get an Offer from "O Created" to "O Accepted" for example can be used to measure process performance. For Q6 we want to query temporal relations in the form of durations and path lengths to satisfy R8. Say we are interested in the offer entity that took the longest time to get accepted. We can query the eventuallyfollows relation of two given activities and use their timestamps to calculate the elapsed time between them: The query matches all :E EN relationships, filters for the given activities and then uses Cypher's duration function to calculate the time spans. Only the result with the longest duration is returned. In case we want to retrieve the distance wrt. the number of activities, we can aggregate over the nodes along the path between the two events with eventually-follows relation and count the hops with the "Length()" function as shown in [21]. The query for the elapsed time completed in 0.585 seconds. Querying for the longest path took 0.755 seconds.
Q6. Query for Behavior across Multi-Instance Relations. Event logs such as BPIC'17 can contain multiple case identifiers. A case identifier may be a single entity, e.g. Offer, or any combination of entities such as the Case notion of BPIC'17 combining Application, Workflow and Offer entities. Querying the behavior across different instances of these entities typically requires multiple steps with traditional event logs such as custom scripts to be able to select, project, aggregate and combine the results accordingly. With Q6 we want to satisfy R9 by querying for events correlated to the same entity, R10 by combining data from different entities in the same query, and to satisfy R11 by querying 2 (sub)processes in a single query. We defined a query that returns all paths from "A Create Application" to "O Cancelled" of the BPIC'17 Cases for Offers that have "O Created" directly followed by "O Cancelled" on entity The query demonstrates several central properties of querying multi-dimensional event data in labeled property graphs.
The first MATCH clauses (lines 1-4) return all case nodes of Cases structurally related to more than one Offer (via :E EN ) where "O Created" is directly followed by "O Cancelled" in this offer (via :DF Offer ). Note that the case (Case AWO) typically has several other events not related to the offer in between the two events, i.e., they only directly follow each other according to :DF Offer but not according to :DF Case AWO. The second pair of MATCH clauses (lines 5-7) return all "O Cancelled" events that directly succeed "O Created" (via :DF Offer ) in an Offer that is correlated to one of the cases with multiple offers (found in Lines 1-4). The returned "O Cancelled" events are used in the last pair of MATCH clauses (lines 8-9) to return paths from some "A Create Application" event to one of the "O Cancelled" events. This way we get a unique path for every Offer that meets the criteria. The query's execution time was 0.569 seconds in Neo4j Browser. Figure 17 shows 2 of the 218 paths of the query's output in Neo4j's graphical representation; in all 2 cases the "O Created" and "O Cancelled" events of one offer are interleaved with events from the Application or other offers.
Discussion. We validated the correctness of our queries against an independent baseline implementation. The results of Q1-Q5 were obtained by processing the event log with manual filtering in Disco. Q6 required a manual procedural algorithm using a single-pass search over the data as the evaluation with existing tools was not possible. Our Cypher queries obtained the same result as the baseline implementations [21]. The graph analysis for Q1-Q6 required only Cypher queries with clauses and functions as described in [25] (except for the typecasts which are not part of Cypher but provided by Neo4j). Notably, the single-pass baseline algorithm for Q6 required 15 mins compared to the 0.453 secs needed to obtain the same results using Neo4j. Further details on the evaluation of Q1-Q6 regarding time and baselines can be found in [21].

Constructing Simple Models for Multiple Entities
We now show that our data model of Sect. 3 allows aggregating the directlyfollows relations between events to directly-follows relations between event classes -taking the notion of entities and entity types into account. We provide queries that satisfy R12-R15 of Sect. 2.2.

Aggregating events into user-defined event classes
An event :Class is a node describing a set of events with the same characteristics, e.g., having the same Activity or other combination of data attributes. We can aggregate events into user-defined event classes using the same principles as deriving and correlating :Entity nodes: we query for all distinct values of a particular (combination of) event attributes and create a new :Class node per retrieved value(s). We illustrate the concept for two event classes: the activity name and life-cycle attribute, and resource. We then link each :Class node to all events of this class when they match on the defining attributes, as for correlating events to entities. We show the query for Activity+Lifecycle. We may also derive event classes based on behavioral properties of events, e.g., based on an event (e2:Event) -[:DF]-> (e) preceding e. The above queries satisfy the semantic constraints for :E C of Sect. 4.5.

Fig. 18 Handover of Work Network
We can retrieve this network with the query MATCH (c1) -[dfc:DF C]-> (c2) WHERE c1.Type = "Resource" AND c2.Type = "Resource" AND dfc.EntityType = "Case AWO". We verified the correctness of the query using the social network mining plugin of ProM (www.promtools.org), see [21] for details. Figure 18 shows the Neo4j graph output of the query above on a sample of 20 cases.
With traditional event logs, creating a handover of work network typically requires the use of a tool or programming language whereas Neo4j is capable of creating them by in-DB processing only.
Mining behavioral models over multiple entities In the same way an aggregated directly-follows graph can be obtained in-DB by aggregating :DF relations. We aggregated the :DF relationships of Application, Workflow, Offer, Case AO, Case WO, Case AW, Case AWO for event class Activity+Lifecycle. Figure 19(left) shows the classical directly-follows graph that can be obtained by aggregating all :DF relationships for the global case notion of the original log (entity type Case AWO). Each node is a :Class node and each edge is a :DF C relationship of type Case AWO; we only queried relationships with count ≥ 500. Figure 19(right) shows a process model discovered by the Split Miner [7] from the same event log; the model has a fitness of 95%, i.e. cannot explain 5% of the data. However, both describe the behavior as a complex interleaving of steps of three different entities while the underlying log suffers from convergence and divergence, see Sect. 2.2. Figure 20 shows the graph we obtained by querying for the aggregated :DF C relationships of types Application (dark blue) Workflow (light blue), Offer (orange), and of the reified relations Case AO, Case WO, Case AW (grey) occurring count ≥ 500, i.e., > 98% of all process executions. This graph describes directly-follows relations per entity and thus is similar to an artifactcentric model [33] or a multiple viewpoint model [10]. Compared to Fig. 19, the graph of Fig. 20 explicitly describes the directly-follows behavior of each entity; the behavior of each entity is concurrent to the behavior of other entities up to the few explicit interactions. In contrast, Fig. 19 shows few edges between the event classes associated with Application, Workflow, and Offer and most edges in between because the classical event log interleaved all events. The graph of Fig. 20 is significantly easier to understand and more precise as it was derived from data without convergence and divergence.

Conclusion
We introduced a new data model for event data based on labeled property graphs. Our data model provides node types and relationship types (see Sect. 3) with semantic constraints (see Sect. 4) for all first-class concepts of event logs: events, entities (generalizing the case notion), event classes (generalizing the activity and the resource attribute), and the directly-follows relation between events, satisfying requirements R1 and R3 of Sect. 2.2. The semi-structured nature of graphs allowed us to represent multiple different, related entities (R2) and the relations between entities and events (R4) through dedicated correlation relationships. Thus, the data model can be seen as a multi-dimensional event log, where events of each entity are ordered by "their" directly-follows relation leading to a partial order of events. Our data model avoids all shortcomings of existing event data models including event tables, event logs, and relational databases, see Sect. 2.3, while building on a standard data storage format.
We provided a succinct set of queries to efficiently convert data in event table format into our data model (see Sect. 5. The queries are parameterized where user-provided domain-knowledge is required. We specifically provide queries to reify relations between entities into composite entities allowing to derive directly-follows relations describing interactions between entities (R13). The data model and queries allowed us to convert represent 5 different real-life dataset into our data model.
We demonstrated that the query language Cyper allows querying event data in our data model, see Sect. 6. Queries and results are given as graphs, satisfying (R5). Queries Q4 and Q6 retrieve entire paths of events (R6) allowing to analyse the sequences. Q1-Q3 and Q6 select individual cases based on partial patterns (R7) allowing to "query by example". Q2,Q3,Q5 and Q6 query for temporal properties (R8) where Q5 specifically considers time; all queries correlate events related to a common entity (R9); Q7 queries aspects of multiple entities in the same query (R10) and allows to query behavior of multiple entities and combine results (R11). Altogether, we could demonstrate the queries over labeled property graphs satisfy R5-R11, which no existing query language on event data offers, see Sect. 2.3.
Finally, we demonstrated that our data model and Cypher allow aggregating events to event classes (R12) and directly-follows relations to event classes per entity (R14,R15). The resulting graphs are simpler and describe the behavior more accurately than techniques using other data models, see Sect. 7. The queries and data sets are publicly available for further research. [23] The model has several limitations and requires further research. Our data model does not model properties of entities and relations between entities; practical applications require a more complete data model of the entities as well. When one event is correlated to multiple entities of the same type, then the current modeling of the directly-follows relation does not distinguish the different entities, rendering queries more complex; the model has to be generalized further. Our aggregation queries aggregate behavior on event type level, thereby hiding multiplicities of entities of the same type involved in a behavior; further research is required to aggregate behavior that preserves multiplicities of entities in interactions. Within the scope of this work, we only consider converting event tables to our data model whereas most event data is stored in relational databases; an automated techniques for conversion is desirable for practical adoption. Cypher is highly expressive but not specifically designed for querying event data -it takes expertise and patience to write the right queries; query patterns and best practices have to be established. While we demonstrated feasibility and obtained performance that allows for usage in practice, existing graph database systems are still significantly slower than relational databases or dedicated algorithms, specifically due to deficiencies in query optimization which may easily render queries practically infeasible. Further improvements on the performance of graph databases is required, possible specifically taking the partially-ordered nature of our data into account.
Our model also enables new lines of research. Providing a more general standard event data model allows for development of new event data analysis and process mining techniques that explicitly consider the presence of multiple entities. The data format enables the adoption of graph mining techniques for event data.