1 Introduction

The field of process mining offers a wide variety of techniques to analyze event data. Process discovery, conformance and compliance checking, performance analysis, process monitoring and prediction, and operational support are some of the techniques that process mining provides to better understand and improve business processes. However, most of these techniques rely on the existence of an event log.

Obtaining event logs in real-life scenarios is not a trivial task. It is not common to find logs exactly in the right form. In many occasions, such logs simply do not exist and need to be extracted from some sort of storage, like databases. In these situations, when a database exists, several approaches are available to extract events. The most general is the classical extraction in which events are manually obtained from the tables in the database. To do so, substantial domain knowledge is required to select the right data scattered across tables. Some work has been done in this field to assist in the extraction and log generation task [2]. Also, studies have been performed on how to extract events in specific environments like SAP [8, 10, 18] or other ERP systems [12]. In [5], we presented a more general solution to extract events from databases, regardless of the application under study. The paper describes how to automatically obtain events from database systems that generate redo logs as a way to recover from failure. All mentioned approaches aim at, eventually, generating an event log, i.e., a set of traces, each of them containing a set of events. These events represent operations or actions performed in the system under study and are grouped in traces following some criteria. However, there are multiple ways in which events can be selected and grouped into traces. Depending on the perspective we want to take on the data, we need to extract event logs differently. Also, a database contains a lot more information than just events. The extraction of events and its representation as a plain event log can be seen as a “lossy” process during which valuable information can get lost. Considering the prevalence of databases as a source for event logs, it makes sense to gather as much information as possible, combining the process view with the actual data.

We see that process mining techniques grow more and more sophisticated. Yet, the most time-consuming activity, event log extraction, is hardly supported. In big industrial database settings, where event data are scattered through hundreds of tables, and many processes coexist in the same environment, the queries used to extract event logs can become very complicated, difficult to write and hard to modify. Ideally, users should be able to find events explicitly defined and stored in a centralized way. These events should be defined in such a way that event correlation and log building could be performed effortlessly and be easily adapted to the business questions to answer in each situation. Also, to discover meaningful data rules, these events should be annotated with enough data attributes.

This paper aims at providing support to tackle the problem of obtaining, transforming, organizing and deriving data and process information from databases, abstracting the data to high-level concepts. This means that, after the ETL procedure is applied, users will no longer have to deal with low-level raw data scattered through tables (e.g., timestamps, activity names and case ids as columns of different tables that need to be joined together). Conversely, users will be able to focus on the analysis, dealing only with familiar process elements such as events, cases, logs and activities, or with data elements such as objects, object versionsFootnote 1, object classes and attributes. Also, the new meta model proposed in this work can be seen as a data warehouse schema that captures all the pieces of information necessary to apply process mining to database environments.

In order to build event logs with events from databases, languages like SQL are the natural choice for many users. The point of this work is not to replace the use of query languages, but to provide a common meta model as a standard abstraction that ensures process mining applicability. Moreover, one of the main advantages of the adoption of a standard meta model has to do with multi-perspective event log building and analysis. Many different event logs can be built from the information stored in a database. Looking at data from different perspectives requires ad hoc queries that extract and correlate events in different ways (e.g., (a) order, delivery and invoice events, versus (b) order, product and supplier events). Our meta model defines the abstract concepts that need to be extracted to enable this multi-perspective event log building in a more intuitive way.

Additionally, as a result of the adoption of the proposed meta model, it becomes easier to connect the event recording system of enterprises with analysis tools, generating different views on the data in a flexible way. Also, this work presents a comprehensive integration of process and data information in a consistent and unified format. All of this is supported by our implementation. Moreover, the provided solution has the benefit of being universal, being applicable regardless of the specific system in use. Figure 1 depicts an environment in which the process information of a company is scattered over several systems from a different nature, like ERPs, CRMs, BPM managers, database systems, redo logs, etc. In such a heterogeneous environment, the goal is to extract, transform and derive data from all sources to a common representation. By putting all pieces together, analysis techniques like process mining can be readily applied.

The remainder of this paper is structured as follows: Section 2 presents a running example used throughout the paper. Section 3 explains the proposed meta model. Implementation details are presented in Sect. 4. The approach is evaluated in Sect. 5 on three real-life environments. The results of the evaluation and the querying of the output of our approach are analyzed in Sect. 6. Section 7 discusses the related work, and finally, Sect. 8 presents conclusions and future work. Additionally, Appendix A provides a formalization of the meta model, and Appendix B extends the formal details of the real-life environments presented in Sect. 5.

Fig. 1
figure 1

Data gathering from several systems to a meta model

Fig. 2
figure 2

Data schema of the example database

2 Running example

In this section, we propose a running example to explain and illustrate our approach. Assume we want to analyze a setting where concerts are organized and concert tickets are sold. To do so, a database is used to store all the information related to concerts, concert halls (hall), seats, tickets, bands, performance of bands in concerts (band_playing), customers and bookings. Figure 2 shows the data schema of the database. In it we see many different elements of the involved process represented. Let us consider now a complex question that could have been posed from a business point of view: What is the behavior through the process of customers between 18 and 25 years old who bought tickets for concerts of band X? This question represents a challenge starting from the given database for several reasons:

  1. 1.

    The database does not provide an integrated view of process and data. Therefore, questions related to the execution of the underlying process with respect to some of the elements cannot be directly answered with a query.

  2. 2.

    The current database schema fits the purpose of storing the information in this specific setting, but it does not have enough flexibility to extend its functionality allocating new kinds of data such as events or objects of a different nature.

  3. 3.

    The setting lacks execution information in an accessible way (events, traces and logs are missing so one cannot apply process mining directly), and there is no assistance on how to extract or derive this information from the given data.

  4. 4.

    If we plan to use the data as it is, we need to adapt to the way it is stored for every question we want to answer.

All these reasons make the analysis complex in many settings. At best, such an analysis can only be carried out by the extraction of a highly specialized event log by creating a complex ad hoc query. Besides, the extraction will need to be repeated for new questions that require a new viewpoint on the data.

If we consider that, for the present database, some sort of event recording system is in place, the extraction of additional events becomes feasible. Listing 1 shows an example of an ad hoc query to answer the sample question posted above. This query makes use of the event data scattered through the database (booking, band, etc.), as depicted in Fig. 2, together with events recorded by a redo log system provided by the RDBMS (redo_logs_mapping table). The way to access redo log information has been simplified in this example for the sake of clarity. The first part of the query retrieves ticket booked events, making use of the booking_date timestamp stored in the table booking. These events are united to the ones corresponding to customer events. These last ones are obtained from redo logs, due to the lack of such event information in the data schema. Then, events belonging to the same customer are correlated by means of the caseid attribute, and the cases are restricted to the ones belonging to customers aged 18–25 at the time the booking was made. Additionally, we need to keep only the bookings made on bands named “X” at the moment of the booking. (A band could change its name at any point in time.) This is an especially tricky step since we need to look into the redo logs to check if the name was different at the time the booking was made.

figure d

It is evident that extracting specialized event logs that answer very specific questions, while maintaining the original data schema, is possible. However, such queries will have to be adapted or rewritten for every different setting or event recording system. The fact that users, and especially business analysts, need to be knowledgeable about the particularities of how event data are stored represents an important challenge to overcome. This is one of the main reasons why so much time and effort is consumed during the ETL phase, that should be devoted to analysis. This work aims at supporting the ETL phase and ultimately at decoupling it from the analysis by providing a set of familiar abstractions readily available for the business analysts.

3 Meta model

As has been shown before, a need exists for a way to store execution information in a structured way, something that accepts data from different sources and allows building further analysis techniques independently from the origin of this data. Efforts in this field have already been made as can be observed in [7] with the IEEE XES standard. This standard defines a structure to manage and manipulate logs, containing events and traces and the corresponding attributes. Therefore, XES is a good format to represent behavior. However, an XES file is just one view on the data, and despite being an extensible format, it does not provide a predefined structure to store all the linked information we want to consider.

Because of this, it seems necessary to define a structured way to store additional information that can be linked to the classical event log. This new way to generalize and store information must provide sufficient details about the process, the data types and the relations between all the elements, making it possible to answer questions at the business level, while looking at two different perspectives: data and process.

3.1 Requirements

To be able to combine the data and process perspectives in a single structure, it is important to define a set of requirements that a meta model must fulfill. It seems reasonable to define requirements that consider backwards compatibility with well-established standards, support of additional information, its structure and the correlation between process and data views:

  1. 1.

    The meta model must be compatible with the current meta model of XES, i.e., any XES log can be transformed to the new meta model and back without loss of information,

  2. 2.

    It must be possible to store several logs in the new meta model, avoiding event duplication,

  3. 3.

    Logs stored in the same meta model can share events and belong to different processes,

  4. 4.

    It must be possible to store some kind of process representation in the meta model,

  5. 5.

    The meta model must allow storing additional information, like database objects, together with the events, traces and processes, and the correlation between all these elements,

  6. 6.

    The structure of additional data must be precisely modeled,

  7. 7.

    All information mentioned must be self-contained in a single storage format, easy to share and exchange, similarly to the way that XES logs are handled.

The following section describes the proposed meta model which complies with these requirements, providing a description of the concepts. A formalization of the meta model can be found in Appendix A.

3.2 Description

Considering the typical environments subject to study in the process mining field, we can say that it is common to find systems backed up by some sort of database storage system. Regardless of the specific technology behind these databases, all of them have in common some kind of structure for data. We can describe our meta model as a way to integrate process and data perspectives, providing flexibility on its inspection and assistance to reconstruct the missing parts. Figure 3 shows a high-level representation of the meta model. On the right-hand side, the data perspective is considered, while the left models the process view. Assuming that the starting point of our approach is data, we see that the less abstract elements of the meta model, events and versions, are related, providing the connection between the process and data view. These are the basic blocks of the whole structure and, usually, the rest can be derived from them.

The data side considers three elements: data models, objects and versions. The data models provide a schema describing the objects of the database. The objects represent the unique entities of data that ever existed or will exist in our database, while the versions represent the specific values of the attributes of an object during a period of time. Versions represent the evolution of objects through time. The process side considers events, instances and processes. Processes describe the behavior of the system. Instances are traces of execution for a given process, being sets of events ordered through time. These events represent the most granular kind of execution data, denoting the occurrence of an activity or action at a certain point in time.

Fig. 3
figure 3

Diagram of the meta model at a high level

Fig. 4
figure 4

ER diagram of the meta model. The entities have been grouped in sectors, delimited by the dashed lines

Figure 4 depicts the entity-relation diagram of the meta model. Some elements have been omitted in the diagram for the sake of simplicity. A full version of the ER diagram is available onlineFootnote 2, and a formalization is provided in Appendix A. Each of the entities in the diagram, represented by a square, corresponds to the basic entities of the meta model. Also, these entities, together with their relations (diamond shapes), have been grouped in areas that we call sectors (delimited by dashed lines). These sectors: data models, objects, versions, events, cases and process models contain tightly related concepts and provide an abbreviated representation of the meta model. For the sake of clarity, the “sectorized” representation of the meta model will be used in further sections. As can be observed, the entity-relation diagram is divided into six sectors. The purpose of each of them is described below:

  • Data models: this sector is formed by concepts needed to describe the structure of any database system. Many data models can be represented together in this sector, whose main element is the data model entity. For each data model, several classes can exist. These classes are abstractions of the more specific concept of table, which is commonly found in RDBMSs. Looking at the database provided in Sect. 2, the tables customer and booking are examples of classes. These classes contain attributes, which are equivalent to table columns in modern databases (e.g., id, name, address, etc.). The references between classes of the same data model are represented with the relationship entity. This last entity holds links between a source and a target class (e.g., booking_customer_fk which relates the source class booking to the target class customer).

  • Objects: the object entity, part of the objects sector, represents each of the unique data elements that belong to a class. An example of this can be a hypothetical customer with customer_id\(=\) 75. Additional details of this object are omitted, given that they belong to the next sector.

  • Object versions (Versions): for each of the unique object entities described in the previous sector, one or many versions can exist. A version is an instantiation of an object during a certain period of time, e.g., the customer object with id 75, existed in the database, during a certain period of time, for example from “2015-08-01 14:45:00” to “2016-09-03 12:32:00”. During that period of time, the object had specific values for the attributes of the customer class that it belongs to. Therefore, there is a version of customer 75, valid between the mentioned dates, with name “John Smith”, address “45, 5th Avenue”, and birth date “1990-01-03”. If at some point, the value of one of the attributes changed (e.g., a new address), the end timestamp of the previous version would be set to the time of the change, and a new version would be created with the updated value for that attribute, and a start timestamp equal to the end of the previous version, e.g., version_1 \(=\) {object_id \(=\) 75, name \(=\) “John Smith”, address \(=\) “45, 5th Avenue”, birth_date \(=\) “1990-01-03”, start_timestamp \(=\) “2015-08-01 14:45:00”, end_timestamp \(=\) “2016-09-03 12:32:00”}, and version_2 \(=\) {object_id \(=\) 75, name = “John Smith”, address \(=\) “floor 103, Empire State Building”, birth_date \(=\) “1990-01-03”, start_timestamp \(=\) “2016-09-03 12:32:00”, end_timestamp \(=\) NONE }. Note that the value of end_timestamp for the newly created object version (version_2) is NONE. That means that it is the current version for the corresponding object (object_id\(=\) 75). Another entity reflected in this sector is the concept of relation. A relation is an instantiation of a relationship and holds a link between versions of objects that belong to the source and target classes of the relationship. For example, a version of a booking object can be related to another version of a customer object by means of a relation instance, given that a relationship (booking_customer_fk) exists from class booking to class customer.

  • Events: this sector collects a set of events, obtained from any available source (database tables, redo logs, change records, system logs, etc.). In this sector, events appear as a collection, not grouped into traces (such grouping is reflected in the next sector). In order to keep process information connected to the data side, each event can be linked to one or many object versions by means of a label (eventToOVLabel). This label allows specifying what kind of interaction exists between the event and the referred object version, e.g., insert, update, delete, read, etc. Events hold details such as timestamp, life cycle and resource information, apart from an arbitrary number of additional event attributes.

  • Cases and instances: the entities present in this sector are very important from the process mining point of view. The events by themselves do not provide much information about the control flow of the underlying process, unless they are correlated and grouped into traces (or cases). First, the activity instance entity must be explained. This entity is used to group events that refer to the same instance of a certain activity with different values for its life cycle, e.g., the execution of the activity book_tickets generates one event for each phase of its life cycle: book_tickets+start and book_tickets+complete. Both events, referring to the same execution of an activity, are grouped in the same activity instance. Then, as in any other event log format, activity instances can be grouped in cases, and these cases, together, form a log.

  • Processes: the last sector contains information about processes. Several processes can be represented in the same meta model. Each process is related to a set of activities, and each of these activities can be associated with several activity instances, contained in the corresponding cases and instances sector.

An instantiation of this meta model fulfills the requirements set in Sect. 3.1 in terms of storage of data and process view. Some characteristics of this meta model that enable full compatibility with the XES standard have been omitted in this formalization for the sake of brevity. Additionally, an implementation of this meta model has been made. This was required in order to provide tools that assist in the exploration of the information contained within the populated meta model. More details on this implementation are explained in the following section.

4 Implementation

The library Open SQL Log Exchange (OpenSLEXFootnote 3), based on the meta model proposed in this work has been implemented in Java. This library provides an interface to insert data in an instantiation of this meta model and to access it in a similar way to how XES Logs are managed by the IEEE XES Standard [7]. However, under the hood it relies on SQL technology. Specifically, the populated meta model is stored in a SQLiteFootnote 4 file. This provides some advantages like an SQL query engine, a standardized format as well as storage in self-contained single data files that benefits its exchange and portability. Figure 4 shows an ER diagram of the internal structure of the meta model. However, it represents a simplified version to make it more understandable and easy to visualize. The complete class diagram of the meta model can be accessed in the OpenSLEX’s website\(^{3}\). In addition to the library mentioned earlier, an inspector has been included in the Process Aware Data Suite (PADAS)Footnote 5 tool. This inspector, depicted in Fig. 5, allows exploring the content of OpenSLEX files by means of a GUI in an exploratory fashion, which lets the user dig into the data and apply some basic filters on each element of the structure. The tool presents a series of blocks that contain the activities, logs, cases, activity instances, events, event attribute values, data model, objects, object versions, object version attribute values and relations entities in the meta model. Some of the lists in the inspector (logs, cases, activity instances, events and objects) have tabs that allow one to filter the content they show. For instance, if the tab “Per Activity” in the cases list is clicked, only cases that contain events of such activity will be shown. In the same way, if the tab “Per Case” in the events list is clicked, only events contained in the selected case will be displayed. An additional block in the tool displays the attributes of the selected event.

Fig. 5
figure 5

Screenshot of the meta model inspector tool

The goal of providing these tools is to assist in the task of populating the proposed meta model, in order to query it in a posterior step. Because the meta model structure and the data inserted into it are stored in a SQLite file, it is possible to execute SQL queries in a straightforward way. In particular, the whole process of extracting, transforming and querying data has been implemented in a RapidProMFootnote 6 workflow. RapidProM is an extension that adds process mining plugins from ProM to the well-known data mining workflow tool RapidMinerFootnote 7. For our task, an OpenSLEX meta model population operator has been implemented in a development branchFootnote 8 of RapidProM. This operator, together with the data handling operators of RapidMiner (including database connectors), lets us extract, transform and query our populated meta model automatically in a single execution of the workflow. More details on this workflow and the extraction and transformation steps are provided in Sect. 5, showing how the technique can be applied in different real-life environments. A workflow was designed for each of the evaluation environments with the purpose of extracting and transforming the content of the corresponding databases to fit the structured of the OpenSLEX meta model. Each of these workflows, also referred to as adapters, is extensible using the collection of operators available in RapidMiner, RapidProM and other plugins, and can be modified to fit new environments. Moreover, Sect. 6 demonstrates how querying of the resulting OpenSLEX populated meta model can be standardized for all the proposed environments.

The provided implementation of the OpenSLEX meta model is a general and extensible platform for data extraction and transformation. It provides the means to abstract raw data to high-level concepts. These concepts (events, cases, objects, etc.) are easier to deal with for business analysts. On the other hand, limitations exist to the current implementation. A general method does not exist at the moment to perform this extraction and transformation independently of the raw data structure. This means that an ad hoc adapter needs to be designed for different data architectures (SAP data schema, Oracle redo logs, in-table versioning, etc.) in order to properly extract and interpret events and object versions. However, we believe that the three adapters provided with the implementation should suffice in most cases and serve as a template for other environments. It is important to note that, despite the need to design an extraction adapter for each type of environment, it supposes an advantage with respect to writing ad hoc queries for event log extraction. This is due to the fact that, once the extraction and transformation to OpenSLEX are performed, automated methods (implemented as operators in the RapidProM platform) become available for generating multiple event logs from different perspectives in a single dataset. This saves time and effort and is less prone to errors than designing ad hoc queries for each specific event log perspective.

Fig. 6
figure 6

Meta model completion in the three evaluated environments

5 Application in real-life environments

The development of the meta model presented in this paper has been partly motivated by the need of a general way to capture the information contained in different systems combining the data and process views. Such systems, usually backed up by a database, use very different ways to internally store their data. Therefore, in order to extract these data, it is necessary to define a translation mechanism tailored to the wide variety of such environments. Because of this, the evaluation aims at demonstrating the feasibility of transforming information from different environments to the proposed meta model. Specifically, three real-life source environments are analyzed:

  1. 1.

    Database redo logs: files generated by the RDBMS in order to maintain the consistency of the database in case of failure or rollback operations. The data were obtained from real redo logs generated by a running Oracle instance. A synthetic process was designed and run in order to insert and modify data in the database and trigger the generation of redo logs. Because of the simplicity of the data model, this environment inspired the running example of Sect. 2.

  2. 2.

    In-table version storage: Application-specific schema to store new versions of objects as a new row in each table. The data of this analysis was obtained from a real-life instance of a Dutch financial organization.

  3. 3.

    SAP-style change tables: changes in tables are recorded in a “redo log” style as a separate table, the way it is done in SAP systems. For this analysis, real SAP data, generated by an external entity, were used. Such data are often used for training purposes by the third-party organization.

The benefit of transforming data to a common representation is that it allows for decoupling the application of techniques for the analysis from the sources of data. In addition, a centralized representation allows linking data from different sources. However, the source of data may be incomplete in itself. In some real-life scenarios, explicitly defined events might be missing. Other scenarios do not have a record of previous object versions. Something common is the lack of a clear case notion, therefore the absence of process instances. In all these cases, when transforming the data to only fit in our meta model is not enough, we need to apply some automated inference techniques. This allows one to derive the missing information and create a complete and fully integrated view. Figure 6 shows these environments, and which sectors of our meta model can be populated right away, only extracting what is available in the database. Then, following a series of automated steps like version inference and event inference, case derivation, and process discovery, all the sectors can be populated with inferred or derived data.

The first part of this evaluation (Sect. 5.1) presents the different scenarios that we can find when transforming data. Each of these scenarios starts from data that correspond to different sectors of the meta model. Then, we show how to derive the missing sectors from the given starting point. Sections 5.2, 5.3 and 5.4 analyze the three real-life common environments mentioned before. We will demonstrate that data extraction is possible and that the meta model can be populated from these different sources. Section 5.5 presents a method to merge data from two different systems into a single meta model structure, providing an end-to-end process view. Section 6 demonstrates that, from the three resulting populated meta models, it is possible to standardize the process mining analysis and perform it automatically, adapting only a few parameters specific to each dataset. In addition to that, an example of the corresponding resulting populated meta model for each one of the environments is shown.

Fig. 7
figure 7

Input scenarios to complete meta model sectors

5.1 Meta model completion scenarios

Experience teaches us that it is rare to find an environment that explicitly provides the information to fill every sector of our meta model. This means that additional steps need to be taken to evolve from an incomplete meta model to a complete one. To do so, Fig. 7 presents several scenarios in which, starting from a certain input (gray sectors), it is possible to infer the content of other elements (dashed sectors). Depending on the starting point that we face, we must start inferring the missing elements consecutively in the right order, which will lead us, in the end, to a completely populated meta model:

  1. a

    Schema discovery: One of the most basic elements we require in our meta model to be able to infer other elements is the events sector. Starting from this input and applying schema, primary key and foreign key discovery techniques [17, 25], it is possible to obtain a data model describing the structure of the original data.

  2. b

    Object identification: If the events and a data model are known, we can infer the objects that these events represent. To do so, it is necessary to know the attributes of each class that identify the objects (primary keys). Finding the unique values for such attributes in the events corresponding to each class results in the list of unique objects of the populated meta model.

  3. c

    Case derivation: Another possible scenario is the one in which we derive cases from the combination of events and a data model. The event splitting technique described in [5], which uses the transitive relations between events defined by the data model, allows generating different sets of cases, or event logs. This requires, similarly to scenario b, to match the attributes of each event to the attributes of each class. Then, we must use the foreign key relations to correlate events between them. Selecting the desired combination of relationships between events will tell us how to group these events in cases to form a log.

  4. d

    Version inference: The events of each object can be processed to infer the object versions as results of the execution of each event. To do so, events must contain the values of the attributes of the object they relate to at a certain point in time or, at least, the values of the attributes that were affected (modified) by the event. Then, ordering the events by (descending) timestamp and applying them to the last known value of each attribute allow us to reconstruct the versions of each object.

  5. e

    Event inference: The inverse of scenario d is the one in which events are inferred from object versions. Looking at the attributes that differ between consecutive versions it is possible to create the corresponding event for the modification.

  6. f

    Process discovery: Finally, a set of cases is required to discover a process model using any of the multiple miners available in the process mining field.

The following three sections consider real-life environments that can act as a source for event data. These environments are used to illustrate the scenarios in Fig. 7 and demonstrate that the complete meta model structure can be derived for each of them. In each scenario, the goal is to create an integrated view of data and process, even when event logs are not directly available.

5.2 Database redo logs

Table 1 Fragment of a redo log: each line corresponds to the occurrence of an event

The first environment focuses on database redo logs, a mechanism present in many DBMSs to guarantee consistency, as well as providing additional features such as rollback and point-in-time recovery. Redo logs have already been considered in our earlier work [5, 21] as a source of event data for process mining. This environment corresponds to the scenario depicted in Fig. 7d, where data model, objects and events are available, but object versions need to be inferred. Table 1 shows an example fragment of a redo log obtained from an Oracle DBMS. After its processing, explained in [5], these records are transformed into events. Table 2 shows the derived values for attributes of these events according to the changes observed in the redo log. For instance, we see that the first row of Table 2 corresponds to the processed redo log record observed in the first row in Table 1. It corresponds to a Customer insertion; therefore, all the values for each attribute were inexistent before the event was executed (4th column). The second column holds the values for each attribute right after the event occurred. The rest of the events are interpreted in the same way.

Table 2 Fragment of a redo log: each row corresponds to the occurrence of an event

Figure 6 shows a general overview of how the meta model sectors are completed according to the starting input data and the steps taken to derive the missing sectors. In this case, the analysis of database redo logs allows one to obtain a set of events, together with the objects they belong to and the data model of the database. These elements alone are not sufficient to do process mining without the existence of an event log (cases). In addition, the versions of the objects of the database need to be inferred from the events as well. Therefore, the starting point is a meta model in which the only populated sectors are Events, Objects and Data Model. We need to infer the remaining ones: Versions, Cases and Process Models.

Fortunately, a technique to build logs using different perspectives (Trace ID Patterns) is presented in [5]. The existence or definition of a data model is required for this technique to work. Figure 6 shows a diagram of the data transformation performed by the technique, and how it fits in the proposed meta model structure. A more formal description of the mapping between redo logs and our meta model can be consulted in “Appendix B.2”. The data model is automatically extracted from the database schema and is the one included in the meta model. This data model, together with the extracted events, allows us to generate both cases (c) and object versions (d). Then, process discovery completes the meta model with a process (f). Once the meta model structure is populated with data, we can make queries on it taking advantage of the established connections between all the entities and apply process mining to do the analysis.

5.2.1 Database redo logs: transformation

In the previous section, we have described the nature of redo logs and how they can be extracted and transformed into event collections first and used to populate the meta model afterward. The goal of this section is to sketch the content of this resulting populated meta model and, in further sections, the kind of analysis that can be done on it.

Table 3 Redo log dataset transformation to populate the OpenSLEX meta model
Fig. 8
figure 8

Data model of the redo log dataset as obtained from the populated meta model

Table 3 represents quantitatively the input and output data for our meta model population process for each of the entities introduced in Sect. 3. These entities match the ones depicted in Fig. 4. Moreover, this table provides qualitative properties on the nature of the transformation for each entity. In this case, we observe that all the data are obtained from the original database through SQL (except for the redo logs, that require some specific-to-the-job tools from Oracle). The input entities (data model, class, attribute, relationship and event) are extracted automatically (5th column), directly transformed given that they are explicitly defined in the source data (6th column), and maintaining their quantity (3th vs. 4th columns). However, the remaining entities of our meta model need to be inferred, i.e., derived from the original entities. In this case, this derivation can be done automatically by our tool (PADAS). The data model that describes this dataset is depicted in Fig. 8 and clearly inspired the running example provided in Sect. 2.

5.2.2 Database redo logs: adapter implementation

In order to assist in the task of extracting and processing redo logs, the tool PADASFootnote 9 has been developed. This tool is able to connect to an existing Oracle RDBMS, load a collection of redo logs, extract their content and fill the missing data from a database snapshot, obtain the data model automatically and export the resulting collection of events to an intermediate format. Then, the rest of the processing can be done offline from the Oracle database.

In addition to this, the tool assists in selecting case notions from the data model and building logs for a specific view. As can be observed in Fig. 9, once the three main sectors have been extracted, i.e., data model, events and cases, the meta model population can begin. This is an automatic process that does not require any further user interaction.

Fig. 9
figure 9

PADAS tool settings to convert a redo log dataset into a populated meta model

5.3 In-table versioning

It is not always possible to get redo logs from databases. Sometimes they are disabled or not supported by the DBMS. Also, we simply may not be able to obtain credentials to access them. Whatever the reason, it is common to face the situation in which events are not explicitly stored. This seriously limits the analysis that can be performed on the data. The challenge in this environment is to obtain, somehow, an event log to complete our data.

It can be the case that in a certain environment, despite lacking events, versioning of objects is kept in the database, i.e., it is possible to retrieve the old value for any attribute of an object at a certain point in time. This is achieved by means of duplication of the modified versions of rows. This environment corresponds to the scenario depicted in Fig. 7e, where data model, objects and object versions are available, but events need to be inferred. The table at the bottom left corner of Fig. 10 shows an example of in-table versioning of objects. We see that the primary key of Customer is formed by the fields id and load_timestamp. Each row represents a version of an object, and every new reference to the same id at a later load_timestamp represents an update. Therefore, if we order rows (ascending) by id and load_timestamp, we get sets of versions for each object. The first one (with older load_timestamp) represents an insertion, and the rest are updates on the values. A more formal description of the mapping from in-table versioning environments to our meta model can be found in Appendix B.3.

Fig. 10
figure 10

Example of in-table versioning and its transformation into objects and versions

5.3.1 In-table versioning: transformation

Looking at Fig. 10, it is clear that, ordering by timestamp the versions in the original set (bottom left), we can reconstruct the different states of the database (right). Each new row in the original table represents a change in the state of the database. Performing this process for all the tables allows inferring the events in a setting where they were not explicitly stored. Figure 6 shows that, thanks to the meta model proposed, it is possible to derive events starting from a data model, a set of objects, and their versions as input (Fig. 7e). The next step is to obtain cases from the events and data model applying the technique from [5] to split event collections into cases selecting an appropriate Trace ID Pattern (scenario c). Finally, process discovery will allow us to obtain a process model to complete the meta model structure (scenario f). A more formal description of the mapping between in-table versioning sources and our meta model can be found in Appendix B.3.

Table 4 In-table versioning dataset transformation to populate the OpenSLEX meta model

As a result of the whole procedure, we have a meta model completely filled with data (original and derived) that enables any kind of analysis available nowadays in the process of mining field. Moreover, it allows for extended analysis combining the data and process perspectives. Table 4 shows the input/output of the transformation process, together with details of its automation and derivation. We see that, in this environment, the input data correspond to the following entities of the meta model: data model, class, attribute and version. These elements are automatically transformed from the input data, where they are explicitly defined. However, the rest of elements need to be either manually obtained from domain knowledge (relationships between classes), or automatically inferred from the rest of elements. This is so because, given the configuration of the database, there was no way to query the relationships between tables, as they were not explicitly defined in the DBMS but managed at the application level. Therefore, these relationships needed to be specified by hand after interviewing domain experts on the process at hand.

Fig. 11
figure 11

Data model of the in-table versioning dataset as obtained from the populated meta model

The resulting data model is depicted in Fig. 11. It is based on the database of a financial organization, focusing on the claim and service processing for a specific financial product. As can be observed, this dataset links information corresponding to customers (CUSTOMER (3)), their products (Product (8) and Service_Provition (6)), incoming and outgoing phone calls (Call_IN (2) and Call_OUT (7)), letters being sent to customers (Letter_OUT (4)), monthly statements from customers (MONTHLY_STATEMENT (1)) and customer details change forms (User_Details_Change_Form_IN (5)). In further sections, we will show how to obtain a view on this dataset, exploiting the structure of the meta model.

5.3.2 In-table versioning: adapter implementation

The task of transforming the data from an environment that presents an in-table versioning structure into our meta model can be automated by means of an adapter. In this case, we make use of the RapidMinerFootnote 10 platform together with the RapidProMFootnote 11 extension to implement a workflow that automates this transformation. Specifically, the operator needed for this task can be found in one of the development branches of the RapidProM repositoryFootnote 12. This adapter, as depicted in Fig. 12, takes as input the set of CSV files that contain the unprocessed dump of the tables to be analyzed directly from the source database. Also, if we get direct access to such database, the data can be queried directly, skipping the transformation to CSV files.

Fig. 12
figure 12

RapidMiner workflow to convert an in-table versioning dataset in order to populate the OpenSLEX meta model

5.4 SAP-style change table

Fig. 13
figure 13

Example of SAP change tables CDHDR and CDPOS

The last environment we will consider for our feasibility study is related to widespread ERP systems such as SAP. These systems provide a huge amount of functionalities to companies by means of configurable modules. They can run on various platforms and rely on databases to store all their information. However, in order to make them as flexible as possible, the implementation tries to be independent of the specific storage technology running underneath. We can observe SAP systems running on MS SQL, Oracle or other technologies, but they generally do not make intensive use of the features that the database vendor provides. Therefore, data relations are often not defined in the database schema, but managed at the application level. This makes the life of the analyst who would be interested in obtaining event logs rather complicated. Fortunately, SAP implements its own redo-log-like mechanism to store changes in data, and it represents a valid source of data for our purposes. In this setting, we lack event logs, object versions, a complete data model and processes. Without some of these elements, performing any kind of process mining analysis becomes very complicated. For instance, the lack of an event log does not allow for the discovery of a process, and without it, performance or conformance analyses are not possible. To overcome this problem, we need to infer the lacking elements from the available information in the SAP database.

First, it must be noted that, despite the absence of an explicitly defined data model, SAP uses a consistent naming system for their tables and columns, and there are lots of documentation available that describe the data model of the whole SAP table landscape. Therefore, this environment corresponds to the scenario depicted in Fig. 7d, where data model, objects and events are available, but object versions need to be inferred. To extract the events, we need to process the change log. This SAP-style change log, as can be observed in Fig. 13, is based on two change tables: CDHDR and CDPOS. The first table (CDHDR) stores one entry per change performed on the data with a unique change id (CHANGENR,OBJECTCLAS,OBJECTID) and other additional details. The second table (CDPOS) stores one entry per field changed. Several fields in a data object can be changed at the same time and will share the same change id. For each field changed, the table name is recorded (TABNAME) together with the field name (FNAME), the key of the row affected by the change (TABKEY) and the old and new values of the field (VALUE_OLD, VALUE_NEW). A more formal description of the mapping between SAP change tables and our meta model can be found in “Appendix B.4”. This structure is very similar to the one used for redo logs. However, one of the differences is that changes on different fields of the same record are stored in different rows of the CDPOS table, while in the redo logs they are grouped in a single operation.

5.4.1 SAP-style change table: transformation

As can be seen in Fig. 6, after processing the change log and providing a SAP data model, we are in a situation in which the events, objects and data model are known. Therefore, we can infer the versions of each object (d) split the events in cases (c) and finally discover a process model (f). With all these ingredients, it becomes possible to perform any process mining analysis and answer complex questions combining process and data perspectives.

Table 5 SAP dataset transformation to populate the OpenSLEX meta model
Fig. 14
figure 14

General view of the data model of the SAP dataset as obtained from the populated meta model. Due to the size of the data model, the attributes or the tables have been omitted from this graph

Table 5 shows that, in order to populate our meta model, what we obtain from SAP are the following entities: data model, class, attribute, object and event. All these elements are explicitly defined and can be automatically transformed. However, the rest need further processing to be inferred from the input data. Only the relationships cannot be inferred, but can be obtained automatically from domain knowledge. The difference here between in-table versioning and SAP is that, despite the need for domain knowledge to obtain the relationships in both scenarios, in the case of SAP we can query the online documentation that specifies the connections between different tables of their data model. To do so, a scriptFootnote 13 has been developed which, from a set of table names, finds and obtains the relationships and stores then in a CSV file.

To get an idea of the complexity of the data model of this dataset, Fig. 14 shows a full picture of it. As can be noticed, it is a very complex structure. However, the tool allows one to zoom in and explore the interesting areas. Figure 15 shows a zoomed area of the data model, where the EKPO table points to the EKKO table.

Fig. 15
figure 15

Detail of the data model of the SAP dataset as obtained from the populated meta model

5.4.2 SAP-style change table: adapter implementation

As has been mentioned before, SAP systems often use relational databases to store the documents and information they manage. Despite the wide variety of database systems used, the extraction of the relevant information is always possible, as long as a JDBC Driver exists. This allows one to connect to the source database directly from RapidMiner, as shown in the workflow in Fig. 16. In this specific case, the database was SAP Adaptive Server Enterprise (ASE), originally known as Sybase SQL Server. The process is as simple as providing the list of table names we are interested in. The workflow will loop through them (Fig. 17) and extract their content into CSV files that will be processed in a further step by a different workflow.

Once we have the CSV files of the SAP tables, we need to process them. To do so, the RapidProM workflow in Fig. 18 has been developed, which loops through these files (including the SAP change tables) to build our meta model.

5.5 Distributed system merge

The three previous environments have been considered in isolation, showing the ETL process on individual systems. On the other hand, as depicted in Fig. 1, it is not rare that companies’ IT landscape is constructed by several systems such as ERPs, CRMs, BPM workflow managers, and so on, which operate separately. These systems store data related to different aspects of the functioning of the company. It is possible that, in order to perform our analysis, we need to make use of data from several of these sources combined, e.g., to link user requests made through a website with internal request handling managed by an internal SAP system. In that case, we need to merge the data coming from these independent systems into a single structure. In order to achieve this goal, we require, at least, one common connecting concept to be shared by these systems.

Fig. 16
figure 16

RapidMiner workflow to connect to a SAP database and extract the content of all the tables

Fig. 17
figure 17

RapidMiner subprocess to extract each table when connecting to a SAP database

Figure 19 shows an example of two systems being merged in a single data model. On the left side of the figure, the data model described in Sect. 5.3 is shown as an abstract group of tables from which one of them, the customer table, has been selected. On the right side, the ticket selling platform described in Sect. 5.2 is represented, from which the customer table has been selected as well. Both customer tables (CUSTOMER_A and CUSTOMER_B) have a customer identification field. Both could share the same ids to represent the same concept. However, that is not needed to be correlated. The merging process requires the creation of an additional linking table (CUSTOMER_LINK) which holds the relation between customer ids for both separate data models.

To make the combination of these data models fit the structure of our meta model, it is necessary to make the connection between objects at the object version level. This can be done automatically as long as we know (a) which customer id from table A is related to which customer id from table B, and (b) during which period of time did this relation exist. It can be that the correlation is permanent, for example, when we map customer ids of two systems which always represent the same entity (e.g., social security numbers on one system with national identification numbers on the other). In such a case, the connecting object will always relate to versions of the same two connected objects. This situation is represented in Fig. 20, where an object of class A (top line) is connected to an object of class B (bottom line) by means of a linking object (middle line). For each pair of coexisting object versions of objA and objB, a new version of objL must exist to relate them. This method has the benefit of providing a great flexibility, being also possible to hold relations between objects that change through time. An example of this is the case in which the connection between two tables represents employer-to-employee relations. An employer can be related to many employees, and employees can change employers at any time. Therefore, the mapping between employer and employee objects will not be permanent through time, but will evolve and change and only exist during a specific period. Such a case is supported by the proposed mapping method, as presented in Fig. 21, by means of new object versions for the linking object.

Fig. 18
figure 18

RapidMiner workflow to populate the meta model based on SAP dataset

As demonstrated with this example, the proposed meta model is able to merge information from different systems into a single structure, enabling the analysis of process and data in a holistic way, beyond the boundaries of IT systems infrastructure. Throughout all this section, we have seen how data extraction and transformation into the proposed meta model (Fig. 22) can be performed when dealing with environments of very different nature. It is important to note that, despite its apparent complexity, all the steps previously mentioned are carried out automatically by the provided implementations of the adapters. These adapters can be modified and extended to add support for new environments.

6 Analysis of the resulting populated meta model

The main advantage of transforming all our source information into the proposed meta model structure is that, regardless of the origin of data, we can pose questions in a standard way. Let us consider the three environments described in previous sections: redo logs, in-table versions and SAP systems. In the examples we chose to illustrate the transformation, it is evident that they belong to very different processes and businesses. The redo log example corresponds to a concert ticket selling portal. The second example, based on in-table versioning, corresponds to the claim and service processing for a financial product within an organization. The third example, based on SAP system, represents the procurement process of a fictitious company. Due to the diversity of data formats and designs of the systems, in a normal situation these three environments would require very different approaches in order to be queried and analyzed. However, we claim that our meta model provides the standardization layer required to tackle these systems in a similar manner.

6.1 Standardized querying

Our goal is to query the data from different sources in a standardized fashion. To do so, as demonstrated in Sect. 5, we extracted the raw input data from the databases of each system. Then, our automated workflows were used to transform these data and obtaining, as a result, a structure compliant with the proposed meta model. In this way, we hope to make plausible that, given a specific business question, it is possible to make a SQL query that answers this question on any populated meta model, regardless of the process it represents or the original format of the data. We demonstrate our claim with the following example. Let’s assume we want to find the group of cases in our logs that comply with a specific rule, or represent a part of the behavior we want to analyze. As an example, for the redo log dataset we could define the following Business Question (BQ1):

Fig. 19
figure 19

Link between different data models through tables representing a common concept (CUSTOMER_A and CUSTOMER_B) by means of a link table (CUSTOMER_LINK)

BQ1: In which cases, the address of a customer was updated?

To do this in the original data would require to go through the redo log, looking for events on the table CUSTOMER, link them to the specific customer_id, correlate the other events of the same customer, group them in traces and build the log. This is not an easy task, especially if such a complicated query needs to be built specifically for each new business question that comes to mind. Let us consider another example, now for the in-table versioning dataset (BQ2):

Fig. 20
figure 20

Merging method to map versions of objects belonging to classes of different data models (class A and class B) through a linking class (class Link)

Fig. 21
figure 21

Merging method to map versions of multiple objects belonging to classes of different data models (class A and class B) through a linking class (class Link). In this case, the related objects change through time

Fig. 22
figure 22

Fragment of resulting populated meta model

BQ2: In which cases, a service request was resolved?

To answer this, we need to go through the table product and check in which row the value of the field decision changed respect to the previous one, having the same customer_id in common. Then, correlate events from any other table through the same customer_id and build the log again. This requires a sequence of non-trivial and time-consuming steps. As a final example, we could think of the following question to ask for the SAP dataset (BQ3):

BQ3: In which cases, the quantity of a purchase requisition order was modified?

This time we need to go to the EKPOFootnote 14 table, which contains information about purchasing document items. However, unlike the in-table versioning case, the previous values of the items are not in the original table anymore. To find historical values for each field, we need to go through the CDHDR and CDPOS tables to find a row that corresponds to the modification of the field MENGE (purchase order quantity). To do so, we must look at the value of the field FNAME, that should be equal to the string “MENGE”, and the value of the field TABNAME, that should be equal to the string “EKPO”. Then, match it with the correct Purchase Order (sharing the same TABKEY) and correlate it to any other event (from the CDHDR and CDPOS tables) that affected the same purchase orders and group them in cases to form a log. Doing so we can see the changes in context. We see that, as the complexity of data increases, the process to query it increases as well.

Therefore, to summarize, we want to ask three different questions in three different environments, and each question needs to be answered in a specific way. However, we claim that we can derive a standard way to answer questions of the same nature on different datasets, as long as the data have been previously normalized into our meta model. If we look at the three proposed business questions (BQ1, BQ2 and BQ3), all of them have something in common: they rely on identifying cases in which a value was modified for a specific field. To complicate things a bit more, we can even filter the cases to obtain only the ones that cover a specific period of time. This means that we can generalize these questions into one common general question (GQ):

GQ: In which cases (a) there was an event that happened between time T1 and T2, (b) that performed a modification in a version of class C (c) in which the value of field F changed from X to Y?

This is a question we can answer on the basis of our meta model, regardless of the dataset it represents. All we need to do is to specify the values of each parameter (T1, T2, C, X and Y) according to the data model at hand. We translated the general question GQ into the SQL query in Listing 2.

figure e

This query is standard and independent of the dataset thanks to the use of macros (RapidMiner macros) denoted by \(\%\{{ MACRO}\_{ NAME}\}\). We just need to instantiate their values according to the dataset to process. Table 6 shows the values for each macro to be replaced in each case. Notice that the timestamps (TS_LOWER_BOUND and TS_UPPER_BOUND) are expressed in milliseconds, and the string \(\%\) in NEW_VALUE and OLD_VALUE will match any value.

Table 6 Parameters to query the three different populated meta models with the same query

This query can be easily executed in the RapidMiner environment, using the Query Database operator. Figure 23 shows the workflow executed to set the macro-values and make the query for each dataset. Some details of the logs obtained for each of the three datasets can be observed in Table 7.

Table 7 Query results for the three different populated meta models
Fig. 23
figure 23

Analysis workflow to process the three resulting populated meta models

It is important to notice that this technique does not exclude the classical way to analyze logs. On the contrary, it enables the use of all the existing process mining techniques. A proof of it is the workflow in Fig. 23, which executes the subprocess in Fig. 24 for each of the resulting logs. As can be observed, this process mining task transforms the log table into a XES log. Then, a process tree is discovered using the Inductive Miner, which transforms it into a Petri Net. Then, a performance analysis is done on the discovered model and the log, together with a conformance checking. In addition to that, a more interactive view of the process is obtained with the Inductive Visual Miner. A social network (when the resource field is available) can be discovered using the Social Network Miner. All these analyses are performed automatically and with exactly the same parameters for each resulting log. It is completely independent of the source, which means that, just by modifying the values of the macros of our query, we can standardize the process of analyzing sublogs for value modification cases.

6.2 Process mining results

To continue with the demonstration of viability of this technique, in this section we show the results of the automated analysis performed on the three different datasets by means of the same query and same workflow.

Fig. 24
figure 24

Process mining analysis subprocess included in the analysis workflow of the three populated meta models. It illustrates that queries on the populated meta models can be converted into event logs that can be analyzed by a range of process mining algorithms

Fig. 25
figure 25

Discovered model and deviations for the redo log dataset

6.2.1 The redo log environment: ticket selling process

In the case of the ticket selling process, we are interested in the traces that modified the address of a customer. First, the discovered model is shown in Fig. 25. We see five main activities in this model: customer insertions, booking insertions, and three different kinds of customer updates. One can notice that the activity CUSTOMER\(+\)UPDATE\(+\)1411, which corresponds to updates in the address field, is the most frequent of the three modifications, only behind the activity BOOKING\(+\)INSERT\(+\)44 activity. One can also see that, as expected, a customer record can only be modified when it has been inserted before. The dashed lines in the model represent deviations from the discovered model, usually representing infrequent cases that were filtered out during the mining phase.

Fig. 26
figure 26

Performance analysis of the model for the redo log dataset

Another technique that can be used during the execution of the automated workflow is the performance analysis. To do so, the log is replayed on the discovered model, and time differences are computed between consecutive events. Then, this information is displayed on top of the original model to visualize bottlenecks and performance issues. In this case, Fig. 26 shows that the most time-consuming activity is one of the customer updates. However, this information can be misleading since these modifications were atomic, while data about start and complete events are missing. To have a better view on performance metrics and accurate task duration, the life-cycle attribute should be properly assigned to pair events in activity instances. This is done automatically by the transformation adapter. However, in this case, there are no activities to be paired.

Fig. 27
figure 27

Conformance analysis of the model for the redo log dataset

To finalize the analysis of this process, Fig. 27 shows the result of checking the conformance of the log respect to the process. Here we see that, framed in red, activities CUSTOMER+UPDATE+1411 and BOOKING+INSERT+44 show move on log deviations. However, as the green bar at the bottom and the number in parentheses show, these deviations occur in a very small share of the total cases. The rest of the activities are always executed synchronously according to the model.

Fig. 28
figure 28

Discovered model and deviations for the in-table versioning dataset

6.2.2 The in-table versioning environment: claim management process

The second environment corresponds to the claim management process of a financial organization. The result of our query represents the cases in which the value of the Decision field of product service claim changed from Undecided (Nog geen beslissing) to Granted (Toekenning). Figure 28 shows the model as obtained from Inductive Visual Miner. It can be observed that the process starts with a Monthly Statement being initialized. Then, for most of the following steps three possibilities exist: the activity is executed, the activity is re-executed (in a loop), or the activity is skipped. The frequencies in the arcs demonstrate that, in most of the cases, each subsequent activity happens at least once. These activities consist of the Reception and Checking of Monthly Statements, Incoming calls, Outgoing letters, Changes in user details and Outgoing calls. Finally, the claim is resolved and three outcomes are possible: Not decided, Accepted or Rejected.

Fig. 29
figure 29

Performance analysis of the model for the in-table versioning dataset

Fig. 30
figure 30

Conformance analysis of the model for the in-table versioning dataset

When looking at the performance view in Fig. 29, we notice that most of the places of the net are colored in red. This means that the waiting time in this places is higher than the rest. It makes sense since, for some automatic activities, the waiting time between tasks will be very low, of the order of milliseconds or seconds, and those places will set the lower bound for the performance scale. On the other hand, for most of the human-driven activities, the times will be much longer, of the order of days. With respect to activities, we see some variability, having some very costly activities, especially at the beginning. These activities are mainly the ones related to processing of Monthly Statements, Incoming calls and Outgoing letters.

Figure 30 shows some conformance results. Here we can observe that most of the activities happen without significant deviations. However, some model moves are observed, but in a very low rate, mainly because of the infrequent skips explained previously in Fig. 28.

6.2.3 The SAP environment: procurement process

Finally, we focus on the Procurement process of an SAP environment. We are particularly interested in the cases in which the quantity of a purchase order was modified. After querying the populated meta model and obtaining the relevant cases, we used Inductive Visual Miner to get a model of the process. Figure 31 shows a very structured process, which is not strange given the few cases that fulfilled our criteria.

Fig. 31
figure 31

Discovered model and deviations for the SAP dataset

It seems clear that some activities are executed repeatedly because of the loops. The process always starts with an update in the value of the Effective Price in Purchasing Info Record (INFOSATZ_U_EFFPR). This is followed by updates in the Purchasing Document (EINKBELEG object class), specifically in the Purchase order not yet complete field (MEMORY). Only then, the Purchase Order Quantity (MENGE) field is updated.

Fig. 32
figure 32

Performance analysis of the model for the SAP dataset

Fig. 33
figure 33

Conformance analysis of the model for the SAP dataset

Fig. 34
figure 34

Social network for the SAP dataset

According to Fig. 32, the performance seems evenly distributed, except for the update in Purchase Order Quantities, which seems to take a shorter time than the rest. From the conformance point of view (Fig. 33), we see that the process does not show deviations. This was expected, given the size of the log and the existence of only one case with this particular behavior.

Something interesting to note in this dataset with respect to the other environments is the existence of resource information. This means that we know who performed a change in the documents in SAP. This can be exploited to analyze the data discovering the social network behind the interactions. In this case, Fig. 34 represents the discovered social network, showing the handout of work between resources UJBSSEN and EHANI.

7 Related work

This paper is based on our previous work [4] in which we already proposed the same meta model. However, in this paper, a special emphasis is put on the applicability of the approach and the standardization of the analysis. For this, we provide a formal description of the mapping between real-life system and our meta model. Moreover, we provide an implementation of all our techniques, making possible to transform the data to populate our meta model, capturing the data and process perspectives of the system under analysis. This makes possible to query it in a standard way, obtaining the relevant cases for the business question at hand, in order to proceed with the analysis. Finally, a demonstration of the analysis is made, covering all the phases of the process, starting from the data extraction, and continuing with data transformation, querying, process discovery, conformance and performance analysis.

Despite the efforts made by the community, we can identify a few areas in which the problem of modeling, collecting and analyzing process execution data remains a challenge. Business process management and workflow management are areas in which the community has focused on providing models to describe their functioning and make possible the analysis of their behavior. Papers like [16, 27] provide meta models to give structure to audit trails on workflows. However, they focus mainly on the workflow or process perspective.

Process mining has different needs and the desire to store event data in a unified way is obvious. In [19], the authors provide a meta model to define event logs, which would evolve later in the IEEE XES Standard format [7]. This format represents a great achievement from the standardization point of view and allows exchanging logs and developing mining techniques assuming a common representation of the data. However, the current definition of XES is not able to effectively and efficiently store the data perspective of our meta model. From our point of view, the XES format is, in fact, a target format and not a source of information. We aim at, from a richer source, generating different views on data in XES format to enable process mining.

Artifact-centric approaches provide a different point of view. These approaches give more relevance to the data entities in business management systems. Data artifacts are identified within business process execution data, in order to discover how changes and updates affect their life cycle [15]. Techniques like the one proposed in [9] are able to combine the individual life cycles of different artifacts to show the interrelation of the different steps. However, a limitation of these techniques is the difficulty to interpret correctly the resulting models, in which clear execution semantics are not always available. The application of other techniques like conformance and performance analysis to the resulting models has not been solved yet.

Data warehousing is an area in which there is a great interest to solve the problem of gathering business process information from different sources. Work has been done on this aspect [3, 14, 26, 28, 29], proposing ways to centralize the storage of heterogeneous data, with the purpose of enabling process monitoring and workflow analysis. However, some of these approaches, besides being process-oriented, do not allow applying process mining analysis. This is due to the lack of event data, when measures of process performance are stored instead. On the other hand, when they allow storing more detailed information [13], force an ad hoc structure, only relevant for the process at hand. The main distinction between our approach and existing work on data warehousing for process mining is the generality and standardization of our meta model, being independent of the specific process, while combining both data and process perspectives.

Process cubes [20] are techniques, closely related to data warehousing, that aim at combining aspects of multi-dimensional databases (OLAP cubes) with event logs, in order to enable the application of process mining techniques. Further work, with functional implementations, has been presented in [1, 22,23,24]. These approaches allow one to slice, dice, roll up and drill-down event data using predefined categories and dimensions. It represents a great improvement for the analysis of multi-dimensional data. However, these techniques still rely on event data as their only source of information, and they require any additional data to be part of each individual event. Therefore, given that it is impossible to efficiently represent in an event log all the aspects we cover in our meta model, process cubes as they are right now do not represent an alternative for our purposes.

SAP Process Observer is a component of the wide spectrum of SAP utilities. It makes possible to monitor the behavior of some SAP Business Suite processes (e.g., order to cash). This component monitors Business Object (BO) events and creates logs correlating them. The tool provides insights such as deviation detection, real-time exception handling, service level agreement tracking, etc. These event logs can be used as well to perform other kinds of analytics. One of the main advantages of this solution is the fact that it enables real-time monitoring, without the need to deal with raw data. However, it needs to be configured specifying the activities that must be monitored. The kind of logs that it generates is no different of the event logs provided by other tools or techniques, in the sense that they use the same structure, having events grouped in sets of traces, and data only represented as plain attributes of events, traces and logs. Also, it lacks genericity, being only applicable in SAP environments to monitor Business Suite processes.

Nowadays, several commercial process mining tools are available on the market. These tools import either XES event logs (e.g., DiscoFootnote 15), or event tables (e.g., CelonisFootnote 16), supporting attributes at the event, trace and log levels. Most of them provide advanced filtering features based on these attributes, as is the case of Disco. Additionally, Celonis has two interesting features. First, it implements OLAP capabilities, making possible to perform some operations like slicing and dicing event data. Second, Celonis provides its own language to express formulas. These formulas are used to compute statistics and key performance indicators (KPIs) and to express complex conditions for OLAP operations. All these features are remarkable. However, they are restricted by the flat structure of their input event log format. It is not possible to keep complex data relations in such a restricted representation. These complex relations are needed to perform advanced data querying, like query GQ in Sect. 6.1, where relevant event data must be selected based on the evolution of database objects of a specific class.

The main flaw of most of these approaches resides in the way they force the representation of complex systems by means of a flat event log. The data perspective is missing, only allowing one to add attributes at the event, trace or log level. More recent works try to improve the situation, analyzing data dependencies [11] in business models with the purpose of improving them, or even observing changes on object states to improve their analysis [6]. However, none of the existing approaches provides a generic and standard way of gathering, classifying, storing and connecting process and data perspectives on information systems, especially when dealing with databases where the concept of structured process can be fuzzy or nonexistent.

8 Conclusion

In this paper, a meta model has been proposed that provides a way to capture a more descriptive image of the reality of business information systems. This meta model aligns the data and process perspectives and enables the application of existing process mining techniques. At the same time, it unleashes a new way to query data and historical information. This is possible thanks to the combination of data and process perspectives on the analysis of information systems. The applicability of the technique has been demonstrated by means of the analysis of several real-life environments. Also, an implementation of the proposed solution has been developed and tested, including adapters for each of these well-known environments. However, from our point of view, the main contribution of this work is the universality of the proposed solution. Its applicability to so many different environments provides a common ground to separate data extraction and analysis as different problems, in this way generating an interface that is much richer and powerful than the current existing standards. To summarize, it provides a standardization layer to simplify and generalize the analysis phase.

Nevertheless, several challenges remain open. For instance, a system that enhances the query building experience, allowing for a more natural and user-friendly way, is desirable. Also, mechanisms to exploit the benefits of the process side (control flow, performance information, conformance, etc.) of the meta model when combined with the data (e.g., data rules discovery) will make the solution really beneficial in comparison with the regular queries that can be performed on the source database systems.