Keywords

1 Introduction

Process mining [1] is a family of techniques relating the fields of data science and process management to support the analysis of operational processes based on event logs. To perform process mining, normally the algorithms and tools expect that the event logs follow certain standards. However, in reality, most IT systems in companies and organizations do not directly produce such logs, and the relevant information is spread in legacy systems, e.g., relational databases. Event log extraction from legacy systems is a key enabler for process mining [6,7,8].

There have been several proposals for the representation of event logs, e.g., eXtensible Event Stream (XES) [16], JSON Support for XES (JXES) [15], Open SQL Log Exchange (OpenSLEX) [14], and eXtensible Object-Centric (XOC) [12], where XES is the most adopted one, being the IEEE standard for interoperability in event logs [11]. In XES (and other similar proposals), each event is related to a single case object, which leads to problems with convergence and divergence [2], as later explained in Sect. 2.2. To solve these issues, object-centric approaches become promising, where objects are the central notion, and one event may refer to multiple objects. In particular, along this direction, the Object-Centric Event Logs (OCEL) standard [10] has been proposed recently.

To the best of our knowledge, the crucial problem of extracting OCEL logs from external sources is still largely unexplored. The only exception is [3], where OCEL logs are extracted by identifying the so-called master and relevant tables in the underlying database and building a Graph of Relations (GoR). Though promising, this approach might be difficult to adopt when the underlying tables are complex and the GoR is hard to model because it does not separate the storage level (i.e., the database) from the concept level (i.e., domain knowledge about events).

In this work, we try to fill this gap by leveraging the OnProm framework [5, 7] for extracting event logs from legacy information systems. OnProm  v1 was already relying on the technology of Virtual Knowledge Graphs (VKG) [17] to expose databases as Knowledge Graphs that conform to a conceptual model, and to query this conceptual model and eventually generate logs by using ontology and mapping-based query processing. Using VKG, a SPARQL query q expressed over the virtual view is translated into a query \(\mathcal {Q}\) that can be directly executed on a relational database \(\mathcal {D}\), and the answer is simply the RDF graph following the standard SPARQL semantics. The workflow of OnProm  v1 for extraction of event logs in XES format from relational databases consists of three steps: conceptual modeling, event data annotations, and automatic event log extraction [7]. OnProm came with a toolchain to process the conceptual model and to automatically extract XES event logs by relying on the VKG system Ontop  [18].

We present here OnProm  v2, which we have modularized so that it becomes easier to extend, and in which we have implemented OCEL-specific features to extract OCEL logs. We have carried out an experiment with OnProm over the Dolibarr Enterprise Resource Planning (ERP) & Customer Relationship Management (CRM) system. The evaluation results confirm that OnProm can effectively extract OCEL logs. The code of OnProm and the data for reproducing the experiment can be found on GitHub https://github.com/onprom/onprom.

2 Event Log Standards: XES and OCEL

A variety of event log standards have emerged in the literature. In this paper, we are mostly interested in the XES and OCEL standards.

2.1 XES Standard

The eXtensible Event Stream (XES) is an XML-based standard for event logs. It aims to provide an open format for event log data exchange between tools and applications. Since it first appeared in 2009, it has quickly become the de-facto standard in Process Mining and eventually became an official IEEE standard in 2016 [4]. The main elements of the XES standard are Log, Trace, Event, Attribute, Extension, and Classifier. We emphasize the following points:

  • Log is the root component in XES.

  • The Trace element is directly contained in the Log root element. Each trace belongs to a log, and each log may contain many traces.

  • Each event belongs to a trace, and each trace usually contains many events.

  • All information in an event log is stored in attributes. Both traces and events may contain an arbitrary number of attributes.

Example 1

Consider the order management process in an ERP system, and suppose there is an instance of order cancellation. Taking the order as a case, there is a trace containing events such as create order, review order, cancel order, and close order, as shown below.

figure a

2.2 OCEL Standard

The purpose of the Object-Centric Event Logs (OCEL) standard is to provide a general standard to interchange event data with multiple case notions. Its goal is to exchange data between information systems and process mining analysis tools. It has been proposed recently as the mainstream format for storing object-centric event logs [10].

The main elements of the OCEL standard are Log, Object, Event, and Element. The main difference between XES and OCEL lies in the usage of Case in XES and Object in OCEL. Recall that XES requires a single case to describe events. In contrast, in OCEL the relationship between objects and events is many-to-many. This gives OCEL several advantages [10] compared with existing standards:

  • It can handle application scenarios involving multiple cases, thus making up for the deficiencies of XES.

  • Each event log contains a list of objects, and the properties of the objects are written only once in the log (and not replicated for each event).

  • In comparison to tabular formats, the information is strongly typed.

  • It supports lists and maps of elements, while most existing formats (such as XOC, tabular formats, and OpenSLEX) do not properly support them.

One main motivation for OCEL is to support multiple case notions. Using a traditional event log standard like XES may lead to problems of convergence (when an event is related to different cases and occurs repetitively) and divergence (when it is hard to separate events with a single case) [10]. We show these in the following example.

Example 2

Considering again an ERP system as in Example 1, when a valid order has been confirmed and payment has been completed, the goods are about to be delivered. Usually the items in the same order may come from different warehouses or suppliers, and may be packaged into different packages for delivery, as shown below.

figure b

Suppose that we want to use XES to model this event log. If item is regarded as a case and create order is regarded as an activity, then create order will be repeated because there are multiple items (e.g., item1, item2, ...), even if there is only one order order1. This is the convergence problem.

If order is regarded as a case and pack item and check item are regarded as activities, then in the same order case, there are multiple pack item events that should be executed after the check item events. However, we cannot distinguish different items in an order, and the order between the two activities may be disrupted. This is the divergence problem.

The OCEL standard is a good solution to these problems, since they can be easily solved by treating order, item, and package as objects, and then each event can be related to different objects. In this way, the properties of the objects are written only once in the event log and not replicated for each event.

3 The OnProm V2 Framework

We describe now the OnProm approach for event log extraction. OnProm  v1, which supports only the XES standard, has been discussed extensively in [5, 7]. We describe here the revised version v2, which has a better modularized architecture and supports also OCEL. The architecture of OnProm  v2 is shown in Fig. 1. We first briefly introduce the basic components of the framework.

Fig. 1.
figure 1

OnProm event log extraction framework.

To extract from a legacy information system \(\mathcal {I}\), event logs that conform to an event log standard X, OnProm works as follows:

  1. (A)

    Creating a VKG specification. The user designs a domain ontology \(\mathcal {T}\) using the UML Editor of OnProm. Then they create a VKG mapping \(\mathcal {M}\) (using, e.g., the Ontop plugin for Protégé  [18]) to declare how the instances of classes and properties in \(\mathcal {T}\) are populated from \(\mathcal {I}\). This step is only concerned with modeling the domain of interest and is agnostic to the event log standard.

  2. (B)

    Annotating the domain ontology with the event ontology. OnProm assumes that for the event log standard X, a specific event ontology \(\mathcal {E}_X\) is available. The Annotation Editor of OnProm imports \(\mathcal {E}_X\), and allows the user to create annotations \(\mathcal {L}_X\) over the classes in \(\mathcal {T}\) that are based on the classes of \(\mathcal {E}_X\).

  3. (C)

    Extracting the event log. OnProm assumes that for the standard X also a set of SPARQL queries for extracting the log information is defined. The Log Extractor of OnProm relies on a conceptual schema transformation approach [6] and query reformulation of Ontop, using \(\mathcal {L}_X\), \(\mathcal {T}\), \(\mathcal {M}\), and \(\mathcal {R}\). It internally translates these SPARQL queries to SQL queries over \(\mathcal {I}\), and evaluates the generated SQL queries to construct corresponding Java objects and serialize them into log files compliant with X.

In this work, we have first modularized the system, by separating the above steps into different software components, so as to make it more extensible. Then we have introduced OCEL-specific features in Steps (B) and (C). Hence, OnProm  v2 is now able to extract OCEL logs from relational databases.

Table 1. Table llx_commande.

Below we detail these steps and provide a case study with the Dolibarr system. This example also serves as the base of the experiments in the next section. Dolibarr [9] is a popular open source ERP & CRM system. It uses a relational database as backend, and we consider a subset of the tables that are related to the Sale Orders. We model it as an information system \(\mathcal {I}=\langle \mathcal {R},\mathcal {D}\rangle \), where the schema \(\mathcal {R}\) consists of 9 tables, related to product, customer, order, item, invoice, payment, shipment, etc., and the data \(\mathcal {D}\) includes instances of the tables, a sample of which is shown in Table 1. Note that the table name is not immediately understandable (llx_commande is a table about orders).

3.1 Creating a VKG Specification

In this step, we define a domain ontology \(\mathcal {T}\) and a mapping \(\mathcal {M}\) for creating a VKG from \(\mathcal {I}=\langle \mathcal {R},\mathcal {D}\rangle \). This is to provide a more understandable knowledge graph view of the underlying data in terms of the domain.

The domain ontology \(\mathcal {T}\) is a high-level abstraction of the business logic concerned with the domain of interest. The UML editor in OnProm uses UML class diagrams as a concrete language for ontology building and provides their logic-based formal encoding according to the OWL 2 QL ontology language [13]. In this case study, the domain ontology about Sale Orders is constructed using the UML editor as shown in Fig. 2.

Fig. 2.
figure 2

Domain ontology in the UML editor.

In a VKG system, the domain ontology \(\mathcal {T}\) is connected to the information system \(\mathcal {I}\) through a declarative specification \(\mathcal {M}\), called mapping. More specifically, \(\mathcal {M}\) establishes a link between \(\mathcal {I}\) and \(\mathcal {T}\). The mapping \(\mathcal {M}\) is a collection of mapping assertions, each of which consists of a SQL statement (called Source) over \(\mathcal {I}\) and a triple template at the data concept schema level (called Target) over \(\mathcal {T}\). For example, the following mapping assertion constructs instances of the Order class in \(\mathcal {T}\), with their creation date, from a SQL query over the llx_commande table:

figure c

By instantiating rowid and date_creation with the values from the first row of llx_commande in Table 1, this mapping assertion would produce two triples:  

The mapping can be edited using the Ontop plugin for Protégé. Figure 3 shows the mapping for the running example.

Fig. 3.
figure 3

Mapping in the Ontop plugin for Protégé.

3.2 Annotating the Domain Ontology with the Event Ontology

In this step, we establish the connection between the VKG and the event ontology. This is achieved by annotating the classes in the domain ontology using the elements from the event ontology.

An event ontology \(\mathcal {E}\) is a conceptual event schema, which describes the key concepts and relationships in an event log standard. For the OCEL standard, we have created an ontology \(\mathcal {E}_{\textsf {OCEL}}\), whose main elements are shown in Fig. 4.

Fig. 4.
figure 4

OCEL event ontology.

In this ontology, the classes Event and Object are connected by the many-to-many relation e-contains-o. One event may contain multiple objects, and an object may be contained in multiple events. Events and objects can be related to attributes through the relations e-has-a and o-has-a, respectively. An attribute has a name (attKey), a type (attType), and a value (attValue).

Now, using the Annotation Editor in the OnProm tool chain, we can annotate the classes in \(\mathcal {T}\) using the elements from \(\mathcal {E}\) to produce an annotation \(\mathcal {L}\). For OCEL, there are three kinds of annotations:

  • The event annotation specifies which concepts in \(\mathcal {T}\) are OCEL events. Each event represents an execution record of an underlying business process and contains mandatory (e.g., id, activity, timestamp, and relevant objects) and optional elements (e.g., event attributes). A screenshot of such example annotation is shown in Fig. 5(a), where we annotate the Order class with an Event and specify its properties label, activity, eventId, and timestamp.

  • The object annotation specifies which concepts in \(\mathcal {T}\) are OCEL objects. An OCEL object contains mandatory elements (e.g., id and type) and optional elements (e.g., price and weight). A screenshot of an example object annotation is shown in Fig. 5(b).

  • The attribute annotation specifies the attributes attached to the events/objects. Both an event and an object may contain multiple attributes, and each attribute annotation consists of an attKey, an attType, and an attValue. A screenshot of an (event) attribute annotation using the Annotation Editor tool is shown in Fig. 5(c).

Fig. 5.
figure 5

Annotation samples.

3.3 Extracting the Event Log

Once the annotation is concluded, OnProm will compute a new VKG specification \(\mathcal {P}'\) with a new mapping \(\mathcal {M}'\) and the event ontology \(\mathcal {E}\), so that it exposes the information system \(\mathcal {I}\) as an VKG using the vocabulary from the event standard. For example, among others, OnProm produces a new mapping assertion in \(\mathcal {M}'\) from the Dolibarr database to the Event classes in \(\mathcal {E}_{\textsf {OCEL}}\).

figure g

Now all the information in an OCEL log can be obtained by issuing several predefined SPARQL queries over \(\mathcal {P}'\). For example, the following query extracts all OCEL events and their attributes:

figure h

The following query extracts the relations between OCEL events and objects through the property e-contains-o:

figure i

To evaluate these SPARQL queries, OnProm uses Ontop, which translates them to SQL queries over the database. In this way, extracting OCEL event logs boils down to evaluating some automatically generated (normally complex) SQL queries over the database directly. Finally, OnProm just needs to serialize the query results as logs in the XML or JSON format according to OCEL. Figure 6(a) shows a fragment in XML-OCEL and Fig. 6(b) shows its visualization.

Fig. 6.
figure 6

A fragment of the extracted OCEL log from the Dolibarr ERP system.

4 Experiments

We have conducted an evaluation of OnProm based on the scenario of Dolibarr. The experiments have been carried out using a machine with an Intel Core i7 2.0 GHz processor, 16 GB of RAM, Dolibarr v14, and MySQL v8. In order to test the scalability, we have generated 8 database instances of difference sizes from 2K to 1M. The size of a database is the number of rows.

Table 2. Extraction details of OCEL log elements.

Performance Evaluation. The running times are reported in Fig. 7. First, we notice that our approach scales well. The overall running time scales linearly with respect to the size of the database. In the biggest dataset of 1M rows, it takes less than 12 min to extract the event log. We also computed the division of the running time over the subtasks of log extraction. The upper left corner of Fig. 7 shows the proportion of the running time for each OCEL element. We observe that most of the time (98%) has been spent on the event, object, and attribute extraction, whose main tasks are to evaluate SPARQL queries and create corresponding Java objects in memory. The time for log serialization is almost negligible (2%). We note that since extracting these logs corresponds to evaluating the same SQL queries over databases of different sizes, it is actually not surprising to observe this linear behavior when the database tables are properly indexed.

Fig. 7.
figure 7

Running times of the experiment.

We also report in Table 2 the number of OCEL elements extracted for each database size. At the size of 1M, OnProm extracts an OCEL event log with 374999 objects, 499997 events, 374999 attributes and 624943 relations, and it takes 207 MB to serialize the whole log in XML-OCEL.

Conformance Test. The OCEL standard comes also with a Python library and one of the main functionalities is the validation of JSON-OCEL and XML-OCEL. The library reports that the log obtained by our method is compliant with the OCEL standard.

5 Conclusions

In this work, we have presented how to extract OCEL logs using the revised version of the OnProm framework. OnProm uses an annotation-based interface for users to specify the relationship between a domain ontology and an event ontology. Then OnProm leverages the VKG system Ontop to expose the underlying sources as a Knowledge Graph using the vocabulary from the OCEL event ontology. Thus, extracting OCEL logs is reduced to evaluating a fixed set of SPARQL queries. Our experiments confirmed that the extraction is efficient and that the extracted logs are compliant with the standard. OnProm provides a flexible framework for users to choose XES or OCEL according to their needs. In the non-many-to-many business, the results are similar, but OCEL has higher extraction efficiency because it does not need to manage events in one case. In modeling many-to-many relations, OCEL has greater advantages because it is actually a graph structure.

There are several directions for future work. First of all, we would like to carry out a user-study to let more users try out our toolkit, and confirm that it is indeed easy-to-use. We are also interested in extracting logs from other sources beyond relational databases, e.g., from graph databases. Finally, the modularity of the approach makes it relatively straightforward to support other standards, and we will study this possibility.