1 Introduction

Process mining is, today, an essential analytical instrument for data-driven process improvement and steering [8, 10, 21]. It helps to understand how a specific process contributes to the whole value chain, to identify different types of operational debts and to quantify improvement opportunities and, eventually, to measure the impact of transformation projects. Put another way, it is the instrument by means of which the business process management (BPM) lifecycle, as in [11, p. 21], can be effectively brought to life.

However, it was not always this way. Considering the state of the Process Mining discipline as of 2013 [2], the majority of work was still very academia-focused. Use-cases and pilots ran within research projects or by pioneering process mining technology providers, which at that time were spin-offs founded by PhD researchers in the area, substantiated the power of process mining. The practical evidence for the suitability of process mining as a scalable instrument for process improvement identification was missing though.

There were two main reasons for this. The first reason was the lack of market (and methodology) maturity. In fact, stakeholders could not clearly distinguish between process mining and business intelligence, and providers/consultants could not clearly articulate (and/or substantiate) its advantages. The second reason was the fact that process mining, as well as any other data analytical instrument, requires a specific data model. This is, for process mining, an event log, of which the assembly requires a wide range of skills beyond pure data staging and aggregation. Experience in pulling together an event log for complex processes and hetorogeneous systems was lacking.

Put another way, while “academic” process mining work mostly starts with a given log L, “practical” process mining work starts with a set of systems (or tables) and aims at creating the log file L for subsequent analysis. Admittedly, the latter is easier said than done. Depending on the complexity of the source data model and process to be discovered, up to \(80\%\) of a project timespan is used for data preprocessing and log creation, leaving \(20\%\) for the real process analytical work [9]. While reviewing the existing literature, we have seen a focus on use-cases [14, 23, 25], on general approaches to (and techniques for) process analytics [5] and strategies and frameworks for creating event logs for process mining [4, 17]. Recently, also data quality is receiving more attention [3]. However, we could not find previous work addressing all these elements and a hands-on data preprocessing example and corresponding best-practices.

Given this scenario and our practioner’s view, the goal of this chapter is threefold:

  1. 1.

    Report on the process mining adoption in different industries, as well as on the drivers for process mining usage. We will illustrate different application scenarios and drivers with practical cases.

  2. 2.

    Elaborate on a real-world example focusing on the event log construction for the Order to Cash process (OTC) as seen on an SAP system.

  3. 3.

    Summarize the best-practices and the experience we have acquired by conducting process mining projects.

Below, we will explicitly take up a practical view of process mining. We thus refrain from formalizations and will introduce the necessary technical concepts – especially in the context of SAP – in an on-demand basis including only the necessary aspects. While focusing on SAP for the hands-on example, the methodology we elaborate on can be equally applied to other processes within SAP, or other ERP systems, such as Oracle, Navision and Salesforce. It is also agnostic to any data transformation approach and platform, and process mining technology, thereby decoupling data transformation from the specific analytical tool one intends to employ.

By focusing on data preprocessing, we deliberately leave out various other – equally relevant – phases of a process mining project. See [27] for a process mining project methodology. For example, although we explain the different angles that make out a process mining project scope in Sect. 4.1, we will not cover the scoping phase in detail (e.g. deciding which process or legal entities to be analyzed). We also skip the data maturity assessment phase, whose goal is to ensure that the system’s data provides a basis for process mining. This is typically required for less known, highly customized or legacy systems, not as much for standard ERP systems and their common satellite applications. We also do not cover analytical and improvement phases with methods and methodologies, e.g. to derive insights from process mining and calculate a business case for change. The improvement perspective is extensively covered in [26].

The reminder of this paper is laid out as follows. Section 2 reports on the process mining adoption in different industries and drivers for process mining. Section 3 introduces the SAP O2C process and corresponding data foundation. Section 4 elaborates on how to construct a simple log file for SAP O2C. It does so by cutting through the complexities of data extraction, transformation and data model engineering in a general manner, and on the specific context of SAP O2C. Section 5 summarizes the best-practices in creating an event log. Section 6 takes stock and provides an outlook on the upcoming challenges for data preprocessing.

2 Process Mining Adoption

Process Mining is widely used in a multitude of industries and businesses to create transparency on the key processes. This section firstly provides an overview on where process mining is being used and, subsequently, elaborates on the drivers for firms to deploy process mining as a basis for process understanding, monitoring and improvement. Although we illustrate, by means of real-world cases, how process mining has contributed to processes improvements in those industries, this section will not deep-dive into the specific case studies. For this, we refer to [14], a database with example process mining applications, and to [23], a book compiling a series of industry use-cases for process mining.

2.1 Business Usage

We have seen Process Mining being used in several industries and processes. Still, their adoption focus differs depending on the underlying industry type and its characteristics. To better differentiate industry adoption in the different industry segments and map the corresponding processes to the industries, we split businesses in three types, namely (a) “Financial Products” (e.g. banks and insurance companies); (b) “Industrial Products” (e.g. pharmaceuticals and manufacturing); and (c) “Services” (e.g. telecommunication, healthcare, retail and government).

Overall, Financial and Industrial Products are, to-date, the segments with the highest process mining penetration [10]. That is not to say that process mining is not being successfully adopted in Services: healthcare [19, 24], telecommunication providers [23, Chap. 13 and 20] and municipalities [15] already today highly profit from process mining. However, according to technology providers and market research reports [12], they make around \(15\%\) of the installed process mining base. Below we provide examples of how process mining is being adopted in the main industry segments, focusing on the driving factors in Sect. 2.2.

Financial Products. These are predominantly banks, e.g. retail, corporate and investment banking, and different types of insurance companies, e.g. health, life, composite and reinsurance. In banking, we have observed the focus on two processes: (a) loan and mortage services and (b) account opening, in particular the KYC process (know-your-client), closely related to the anti-money laundry prevention mechanisms. Focusing on the former, the main focus is on unleashing operational efficiency by means of identifying automation potentials or redesigning the process completely. For example, we have applied Process Mining to assess the loan process of a large bank based out of the Benelux region. In doing so, we have understood that around 70% of the applications were rejected (by the bank) or canceled (by the applicant), which is well-above the industry benchmark for this type of process and region. More importantly, rejections and cancellation happened at the activity “Final Application Check”, which was the penultimate process step before completion. Put another way, the applications ran (at least) ten process steps, including an “Initial Application Check” (second process step), to be rejected or withdrawn at nearly the end of the process. This insight has paved the way to reengineer the process by creating a more thorough initial application check and eventually reducing effort by 19 full-time equivalents (FTEs) per year.

Moving on to insurance – irrespective of its kind –, the focus is on two areas: first, claim management and processing, and second, back and front office functions, such as master data changes and lead management. Because of its sheer volumeFootnote 1 and business relevance, the primary focus is on claim management’s efficiency and effectiveness, specifically the level of fully automated claim processing and adherence to service level agreements (SLA), that is, the time elapsed between the submission and settlement of a claim. In a Swiss-based health insurance company with around 15 million claims per year, process mining first helped measuring the full automation rate over the year, namely 74% (target being 80%). Second, it shed transparency on the root-cause for manual work: a large bulk of claims were detoured to manual inspection just to set a final approval sign. While this activity took less than 10 s processing time, it delayed the process by a median 1.8 days (waiting time in work baskets) and reduced the automation level by 8.2%. By refining the rule-set for claims that really required the approval step, the automation immediately raised to 82.2%. As a side-effect, this has improved the SLA adherence by 8%.

Industrial Products. This type of industry is predominantly characterized by the manufacture of different types of products, such as cars, electronics, power plants or chemicals. Producing businesses, when transforming their operation towards bottom-line savings or top-line improvement, mainly focus on the so-called operational support functions including procurement, sales and general accounting, and supply-chain and production.

Because the operation of such industries is usually based upon a traditional, in terms of data structure widely-understood ERP system, such as SAP or Oracle, this industry can be seen as the forerunner for the deployment of process mining “in the large”. The main targets for process mining are procurement – “procure to pay” (P2P) or “source to pay” (S2P) – or sales – “order to cash” (O2C) or “lead to cash” (L2C). We address the sales process in the context of SAP in detail in Sect. 3. In fact, these two core processes – procurement and sales – often deliver a number of quick-wins for rapid process improvements, both in terms of cost-savings and increased revenue.

As an example, we have analyzed the procurement process of a mid-sized company manufacturing laser-cutting machines, focusing the analysis on three main European legal entities. With process mining we identified cyclic payment runs for invoices (each fourth working day). By overlaying the payment cycles with the payment terms associated to those invoices, we have identified a negative offset. That is, discounts associated with paying an invoice within a specific period were not taken into account whilst prioritizing the payment runs. Over one year and considering only the three entities in scope, this amounted to EUR .83 million unrealised discounts.

Turning to production, a very popular analysis regards the interplay between the front-office (in charge of taking leads and orders) and the production plants. In other words, the interplay between the sales and the production process. In this setting, we have used process mining to analyze the impact of late change order requests (coming from the front-office executives) to four production plants for a global fragrance and flavor producer. Late requests led to changes in the production planning, requiring, depending on the situation, a reschedule of production or stock transfers for products to ensure production. The former created idle production times worth 40.7 FTE per year. By preventing order changes in the so-called “frozen zone,” i.e. orders already scheduled for production, the company was able to reduce the idle time by \(47\%\) and ensure a more reliable customer service.

2.2 Drivers for Process Mining Deployment

The adoption of process mining as a technique for process understanding, monitoring and improvement is fueled by some characteristics of the leading industry segments. In this section we revisit some of these drivers and how they contribute to process mining adoption.Footnote 2

System Homogeneity. Firms in the Industrial Products space are usually based upon one core ERP system, most predominantly Dynamics, Oracle, Navision and SAP, covering the main processes, with satellite systems for specific tasks, e.g. invoice processing with Basware or customs processing with SAP GTS. Because the underlying tables, data structures and operations for “standard” ERP systems are well-known by experts, the preparation of data towards proces mining becomes easier. Generally, the more homogeneous the system landscape, the easier it is to implement and use process mining, be it by collecting and transforming the data, or by connecting directly to a process mining tool which performs the data transformation. The downside of system homogeneity is that, because of system’s maturity, one oftentimes finds less low-hanging fruits in terms of process improvements.

Transaction Volume. Some processes are executed once a month (e.g. the consolidation of financial statements in general accounting), others millions of times a day (e.g. cab hailing rides at Uber). Both processes can undoubtedly lead to enterprise performance improvements when analyzed with process mining. However, the higher the number of transactions one has at hand, the higher the (at least potential) impact that can be achieved, and consequently the higher the return on investment (ROI) for process mining and improvement exercises. Just imagine one can identify, on average, USD .5 cost-savings per claim with 15 million claims processes a year. In practice, this inevitably turns into a scoping question when analyzing processes: what is the “minimal” transaction volume to qualify for process mining? There is no magic formula for this, as processes are subject to different cycles and seasonality. So even the same process (e.g. procurement) in the same industry (e.g. manufacturing) might considerably differ from company to company depending on what is produced (e.g. power plants vs. chips). Our recommendation is to start with the end in mind and delineate the scope based on the business questions to be answered, operational debts to be bridged and process improvement ambition. See Sect. 4.1 for the different scoping elements.

Process Drivenness. Some industries – and more specifically, companies in those industries, or even functions in specific companies – exhibit a high maturity level in terms of “process-drivenness” and, correspondingly, digitalization of processes. That is, processes are captured in a structured manner (e.g. by means of BPMN) and the underlying system landscape and data models responsible for the process execution exist (e.g. ER diagrams to capture the relationship of entity sets stored in a database or ADL specifications for architecture description). Other companies (or some of their functions), be it because of their business model or niche of operation, are less “process-driven.” For example, in banks the Loan and Credit functions in banks are highly process-driven, while the lead management in Asset and Wealth Management is less so. In fact, for the latter, technically speaking each process execution is a legitimate variant. Clearly, the higher the process-drivenness and volume of transactions, the better the chances for being able to run process mining. The downside is that, because plenty of thinking has been spent on process design and implementation, the quick-wins in terms of improvement potential could already have been harvested by previous initiatives, irrespective of data-driven or not.

Existing Data Foundation. Irrespective of all the aforementioned drivers, some companies have largely invested in building a cross-functional data foundation as part of their data strategy [21], either in the sense of a data mart (department-wide for the provision of some form of business intelligence) or a data warehouse/data lake (enterprise-wide for large data analytics), the latter being the focus of current projects tackling the transformation towards data-driven decision making. Process mining profits substantially from an existing data foundation outside of (and combining the different) core systems. The reasons are threefold: first it avoids dedicated bulk data extractions, which are usually time consuming and require additional effort from IT or base teams; second, because the platforms on which they are deployed (e.g. SnowFlake or Teradata) offer a transformation layer allowing the (automatable, periodic) data transformation, thereby avoiding the setup of an additional transformation platform/layer; and third, they enforce some data homogeneity when standardizing data staging, for example, by making sure that timestamps are recorded to the precision of miliseconds.

Overall, these four key drivers put together factors favoring the use of process mining. Of course, transparency and analytics on their own do not lead to bottom-line savings or outperforming top-line. That is, process mining should be embedded in a broader context aiming at continuous improvement, and the identification and elimination of operational debts, measuring the impact of changes and recalibrating the performance goals according to a well-understood and well-established KPI framework [26]. The business process management lifecycle [11] provides a basis for data-driven process improvement based on process mining, in particular, and process analytics, in general [5].

3 Real-World Example: Order-to-Cash on SAP Systems

In order to make the approach and considerations presented hereafter tangible and easily related to actual use cases relevant for both industry and academia, we introduce an exemplary Order-to-Cash (O2C) process run on an SAP ERP system. O2C is not only prevalent across all three industry types as laid out in Sect. 2.1, but also very much relatable to anyone running a business or even just buying goods online. The twist is to simply look at this buying process through the vendor’s eyes, i.e. the firm selling for example electronics through a web shop. Irrespective of the firm’s business type, region or size, the main process steps of any O2C process are fairly similar. Hence, it makes a perfect running example to showcase event data preprocessing in a real-world scenario.

Many large organizations run their core business processes on ERP software solutions from Oracle or SAP, imposing a minimum level of standardization on process steps and their sequential flow. Since they are, however, designed to fit many different industries and business models, the predefined guardrails are not very strict, allowing for significant variation even in otherwise well-defined business processes like O2C. And while some of the companies even go to the length of modifying the underlying data structures in order to tailor the systems to their very needs, most modifications do not interfere with the core O2C process flow, but rather add complementary information. Paired with the fact that the adoption of SAP-based O2C process mining is far ahead of their Oracle-based counterparts, an SAP ERP has been chosen to exemplify event data preprocessing for O2C.

An end-to-end O2C process encompasses steps from the initial entry of a sales order and its items, all the way to the actual receipt of payment or another financial record clearing the open balance (e.g. a credit note). In practice we have encountered O2C process analyses with well over 100 different process steps, however, in favor of reducing this complexity to a manageable, but representative set of events, the process flow is exemplarily represented by nine individual steps (or events).

Fig. 1.
figure 1

SAP O2C process description across the different flow types

The events as depicted in the first swim lane of Fig. 1 have been selected in order to (a) capture at least one instance of each event archetypeFootnote 3, while both (b) reducing the number of events substantially, but also (c) retain major milestones of a typical O2C process. We correlate the Business Flow, the underlying Document Flow as well as the corresponding Data Flow as follows.

Business Flow. The process starts with the creation of a sales order (SO) with at least one item (SO Item created), after which a confirmation can be sent to the customer (Order Conf. sent). As a next step the corresponding delivery document is created including details for all items (Delivery Item created), after which the warehouse operations (Picking completed, then Packing completed) follow, illustrating the application of O2C for sales of physical goods in store. The goods are eventually sent to the customer (Delivery Item dispatched) and a corresponding billing document including respective items gets created (Billing Item created), which typically interfaces with the financial accounting part of the process. In favor of simplicity, this part is omitted (i.e. all financial postings, such as the settlement of billing documents).

In this given example, we include changes to the quantity ordered (Quantity changed) which can be triggered at any stage before the creation of the delivery note. This change event can be seen as a template and hence applied to a variety of other change attributes (e.g. price or requested delivery date). After each change, the corresponding marker on the sales order gets updated as well (SO Item last changed.).

Document Flow. The second perspective focuses on the business documents and their flow, as if actual paper documents would be processed. It starts with the sales order (SO and SO Item), after which the customer is sent a confirmation (Order Confirmation). A delivery document (DD and DD Item) is created and dispatched before the billing document (BD and BD Item) opens a balance for the respective customer.

Data Flow. Next, we focus on the main corresponding data structures holding information about the events and/or documents. For SAP-based O2C processes, sales orders and their items are stored in a pair of tables distinguishing sales order header information (table VBAK) from their item level information (VBAP). The data recorded in these tables include their creation date, as well as the date it was last changed. Order confirmations are persisted in a log table (NAST) comprising nearly all outgoing messages, while delivery document information, including creation and dispatch, can be found in another table pair (headers: LIKP; items: LIPS). Picking and packing is traced through changelogs (headers: CDHDR; items: CDPOS) on sales document status (headers: VBUK; items: VBUP) and billing documents are stored in a separate table pair (headers: VBRK; items: VBRP). Similar to picking and packing, all change events – including quantity changes – are tracked in a change audit log (headers: CDHDR; items: CDPOS).Footnote 4

Limitations. Finally, it is important to point out that the presented O2C process constitutes a radical oversimplification. While the individual events are indeed representative, the process flow and set of events should be treated solely as an excerpt for demonstration purposes. Not only will real-life O2C processes be significantly more complex, system customizations and other modifications to the SAP O2C standard configuration are likely to require additional attention.

4 Event Log Engineering in Practice

Data preparation for process mining in the form of event log engineering encompasses three main steps, namely:

  1. 1.

    Data selection and extraction

  2. 2.

    Data transformation

  3. 3.

    Data-model engineering and fine-tuning

This section addresses these three steps from two perspectives: first from a broad perspective by touching upon key aspects to be considered; and second, in a zoom into the specific setting introduced in Sect. 3. The following is not meant to be a complete cookbook for process mining preprocessing. Instead, it focuses on the predominant, recurring aspects and challenges – some specific to process mining, some applicable to a wider spectrum of data analysis initiatives.

4.1 Data Selection and Extraction

From a general standpoint, this step focuses on answering the following questions:

  1. 1.

    Which data is to be extracted and when?

  2. 2.

    Which attributes are necessary for the analysis?

  3. 3.

    Is data sensitive?

  4. 4.

    Is data readily available or archived?

  5. 5.

    Will the data size to be extracted be overwhelming?

The answers to these questions can be clustered under the labels “scoping” and “sourcing.” The scoping phase defines four analytical angles: processual angle (i.e. the subject of analysis), its regional angle (i.e. a specific country or set of legal entities), the time angle (i.e. time span of transactions to analyse), and the analytical angle (i.e. the “why” behind the analysis).

Once the scope is set, the sourcing phase establishes a mapping between the process steps and their attributes in a transaction and the events in the source systems, tables and objects. The overarching goal is to identify where – if at all – the necessary events are digitally represented and which attributes are natively available. In some situations, both events and attributes need to be derived by combining different characteristics. For example, the definition of an “automated event” in an SAP system depends on various factors, including user type and reference transaction. Hence capturing and interpreting them correctly is essential for the credibility of process mining (see Sect. 4.2 for details).

The final step in the sourcing phase is the “physical” data extraction from the relevant system and corresponding data objects to a destination outside the system. Assuming that all the data is based on a single ERP landscape, this usually happens by querying the corresponding tables and applying selection criteria to filter out, e.g., the transactions falling in the current time and regional angles. This could be either done by means of an ETL tool connecting to the system, by creating a dedicated extraction script (e.g. specialized ABAP code for SAP, or DART, SAP’s embedded extraction tool), or by backing up the relevant tables and fields from the system (see Sect. 5 for best practices on data extraction). In companies with large data volumes or analyses considering a wide time angle (say, \(10+\) years), data extraction might need to consider so-called “archived transactional data”. Whilst archived data can be seamlessly brought back to life, in practical settings, not the entire transaction is archived bur rather its main attributes, for storage capacity reasons. This might restrict the analytical angle for archived transactions.

When extracting transactional and associated master and change data, two aspects are important: first, data size; and second, data protection. For the former, to estimate the final size of extraction and, simultaneously test the extraction method, one usually extracts, say, one month of data. By extrapolating this to the final time angle, one approximates the final number of cases and events to be dealt with, and consequently the size of final extraction. For the latter, the advent of the General Data Protection Regulation (GDPR) specifically, and increased awareness for data protection generally, puts additional requirements to data extraction and processing. Here, two strategies comes handy. First, data minimization, that is extracting only the information strictly needed to cover the analytical angle. For example, if an analysis aims to measure the level of automation in a particular process, one can solely extract the user type, not necessarily the user name or ID. Second, for the necessary but sensitive fields, data obfuscation techniques generate – during the extraction – an irreversible value for a particular field. In practice, the most common method is by hashing the values for the sensitive fields. This is typically applied to personally identifiable information, such as user IDs and customer names. Security and privacy have been an important topic in business process management and process mining [1, 20], the widespread adoption of process analytics and mining paired with stricter legislation created a sense of urgency which is translating in cutting-edge, scalable data-protection approaches, such as [18, 22].

Data Selection and Extraction in the SAP O2C Scenario. In the following, we apply the general considerations discussed before to our SAP O2C running example. As a reminder, the scoping phase defines the rationale (“why”) and derives the object of study (“what”) using business terminology, while the sourcing phase translates this scope into technical delimitations and specifications, guiding the actual extraction of data. The exemplary scenario presented hereafter is fictitious, though resembling essential experiences and learnings from real process mining initiatives.

Scoping Phase. While the initial trigger for starting scoping discussions for process mining can originate from IT/analytics departments or solution vendors during pre-sales, we choose to exemplify an arguably more value-driven context. The Global Head of Order Management aims to optimize the firmwide order management process and has been introduced to the general concept of process mining, which seems to be a perfect fit. She initiates a pilot project to evaluate the suitability of the approach, drive process transparency and distill tangible process improvement levers. During scoping discussions with process mining experts, three hypotheses are agreed to become the predominant analysis directions for moving ahead in an orchestrated manner:

  1. 1.

    Quantity changes. The number, magnitude and time-wise distribution of quantity changes in sales order items is, while being a driver for additional manual effort and downstream ripple effects, concentrated around recurring patterns (e.g. customers, regions, product groups).

  2. 2.

    SLA adherence. Transparency on Service Level Agreement (SLA) performance with regards to sending order confirmations a minimum number of days before their dispatch can greatly improve both the adherence to and eventually the perceived value from such agreements (e.g. with key customers).

  3. 3.

    Process conformance. Data driven sensing of process flows violating the designed and desired process model informs process owners, operational staff running the process, as well as governance bodies (e.g. internal audit), about needs for additional training, additional guardrails or even process re-engineering.

While the first hypothesis looks at options to streamline the process, the second one bears potential to create additional value for customers. Lastly, hypothesis number three looks at more medium to long term objectives around process robustness and clarity of flow, which many times is a precursor for automation.

After rallying around the rationale for employing process mining, the scope (i.e. object of study) is being defined. As a largely business-driven exercise, process mining experts typically need to act in a (technical) counterbalance role, since the larger the business scope is set, the more complex all steps of the resulting process mining exercise will be. Hence, it must be the joint goal to aim at the smallest possible scope, while still retaining enough to be representative with regards to all shortlisted hypotheses.

The first delimitation is made with regards to the underlying business process. In the context of our SAP O2C example, all three hypotheses are related to the O2C process, more notably even, the non-financial part of O2C (sometimes referred to as order management). As the next level of detail, a minimum set of process steps or events is selected in line with the hypotheses (as described in the ‘Business Flow’ swim lane in Fig. 1). The second scoping task identifies corresponding business objects to be traced. Please refer to the ‘Document Flow’ swim lane in Fig. 1. The third delimitation challenges whether all organizational units (e.g. legal entities, regions or segments) and transaction types (e.g. consignment vs. standard sales) need to be included to retain validity and significance of analytical results. Oftentimes, the project participants are highly acquainted with one specific part of the business, making it a natural choice to ensure the right expertise is available when validating results later. With regards to transaction types, high volume types are typically scoped in when looking at efficiency hypotheses. In our example we focus on ‘standard sales from stock’ only, while the fictitious firm operates as one legal entity with one sales organization and one warehouse. The fourth delimitation looks at the timeframe to be analyzed. Depending on the underlying data volume, it might become necessary to further restrict the timeframe in scope later during the sourcing phase. In order to capture seasonality, it is generally recommendable to cover one full calendar or fiscal year. In our example we restrict the analysis to data from 2020. The fifth and last business-driven scoping discussion typically presents the biggest challenge. Here, one aims to delimit the number of different data points associated with each business object. For example, each SO item has more than 400 individual attributes in any given SAP system. Some of them are collocated, others require multiple data linkages, but most importantly, many do not naturally indicate whether they might become useful context around process execution during the downstream analysis. While the default reaction of business favors retaining everything, the resulting spike in technical effort and complexity renders this extreme as inadvisable, sometimes even infeasible. In our simplified example we assume the process mining experts are seasoned enough to guide the team toward a narrow selection with necessary attributes only. Such a selection does typically not exceed 40 attributes in case of the SO item example.

Sourcing Phase. With the scope being clearly defined from a business perspective, the first step in the sourcing phase is to translate all delimitations into technical terms, i.e. a selection of source systems, data sources (e.g. tables or log files) within these systems, corresponding parameters to filter data records and last but not least the selection of required attributes within the data sources.

In our SAP O2C scenario, we focus on one source system only (exemplarily ‘P42’, an SAP R/3 ERP, even though the characteristics described largely apply to SAP S/4 instances as well). Since SAP ERP systems are capable of multi-tenancy, it is important to select the correct tenant in addition, which in case of the running example falls on the only active tenant configured in the productive ERP instance (i.e. ‘P42/010’).

Next, the process steps and traced documents (please refer to the ‘Business Flow’ and ‘Document Flow’ swims lane in Fig. 1) are translated into their respective data sources. Often, this exercise with its required deep expertise is indicative to whether multiple data scope refinements and, hence, data extractions will become necessary, thereby prolonging the project timeline. These translations applied to the running example are shown in the tables in Fig. 2.

Fig. 2.
figure 2

Data sources for the SAP O2C process

In order to restrict the extraction data volume for each data source, the delimitations on organizational scope and transaction type, as well as timeframe are translated into row filter criteria. Figure 3 shows exemplary filtering criteria. While the tenant filter represents an example for restricting the organizational scope, a timeframe filter is also applied to each data source. As shown for the data source CDPOS, timeframe restrictions sometimes require linkage to another data source, like in this case its header information in CDHRD. Setting fixed timeframe boundaries will, however, lead to cut-off artifacts in the resulting analysis. If a sales order is registered on December 31 2020, the corresponding order confirmation will likely be created outside the selection window and thereby cut from the extraction. Preventing such artifacts would require substantial pre-extraction analysis and sophisticated extraction mechanisms catering to dependencies between data sources. This is typically deemed impractical, and analysts would rather deal with the resulting artifacts during analysis. Lastly, transaction type filters are exemplified through the document category for sales orders, the configured message type for order confirmations, which needs to be looked up in the system itself, and the list of tables affected by logged changes.

Fig. 3.
figure 3

Filtering criteria when extracting data for SAP O2C process

Equipped with a clear selection of data sources and filter criteria (i.e. restricting data records), selection criteria (i.e. restricting attributes/columns) are next. While data sources like sales order item or delivery item tables have over 350 columns, only a small number of them is required to evaluate specific hypotheses. Typically, practitioners hone in on (a) the identifying primary key, (b) temporal, quantity, price, cost, volume, and weight information, (c) markers indicating a state of the object, (d) links to other relevant objects, and (e) links to supplementary information. Figure 4 showcases attribute selection for delivery items (system table LIPS). During this screening process it is natural to come across additional supplementary information not yet covered in the data source selection. In such cases, and if their usefulness gets validated, they need to go through the same delimitation procedure as other data sources.

Fig. 4.
figure 4

Attribute selection for delivery items in SAP

The final step of the technical translation is the screening of a complete, resulting attribute selection for sensitive data. Some data types are prohibited to be transferred across country borders (even if it is solely for analysis), others fall into categories requiring additional safeguarding, pseudonymization or even anonymization. While the process act of obfuscation is part of the extraction itself, it is recommended to identify all attributes requiring special attention upfront. In the SAP O2C scenario, these could entail usernames and details from customer master data.

Lastly, the actual extraction is configured and run accordingly. As all major considerations regarding the extraction of large volumes of data from ERP systems have already been presented in the general section of this chapter, our running example assumes a proven one-time extraction mechanism is used. Such setups have been utilized extensively by external auditors, however, are usually limited to selective one-time extracts, storing the payload in individual files locally on the SAP application server. Since no sensitive data has been identified in the data scope of our SAP O2C example, no obfuscation mechanisms need to be configured.

4.2 Data Transformation

After the extraction, data is typically loaded onto a preprocessing platform to generate the target data model, i.e. the log file and ancillary tables (see Sect. 4.3 for details). This platform can be a database management system (such as a Microsoft SQL Database Server) or part of the ETL tool applied during the extraction. Depending on the process mining technology applied, transformation might also happen inside the tool, such as in UiPath Process Mining and Celonis.

Data transformation is the most important preprocessing step in the journey towards process analytics. This is in particular relevant because an error in reconstructing the end-to-end process may cascade to a completely flawed process mining exercise, delivering misleading results, creating negative experience and, in the worst case, discrediting the whole approach. Consequently, a lot of attention needs to be put into mapping the correct process.

Specifically, we want to call out the following key aspects, namely:

  1. 1.

    Case Identifier consistency,

  2. 2.

    Timestamp quality, and

  3. 3.

    Amount handling.

Serving as unique transaction identifier, the Case Identifier (CaseID for short) is a primary point of concern when transforming data in order to achieve an end-to-end representation of the process. When considering one single system, such as SAP, the transaction identifier is typically given by the key document number being tracked (e.g. sales order number), to which other related documents refer. When the transaction spans different systems, the CaseID might – or might as well not – be consistent across them.

Assuming that CaseIDs are not consistent, two situations might occur: (1) there exists a mapping between the systems, that is, one can precisely link the transactions, even though the CaseID used in the systems for the same transaction differ; or (2) there is no link between the systems, or this link is not persistent (e.g. being deleted after 24 h). In (2), transactions can only be approximated by relating timestamps and transactional attributes on both systems, the so-called linkage criteria. That is, assume that transactions are passed on from an Application A to an Application B. The linkage criteria will initially define a time range (e.g. from 1 to 10 s) within which a transaction ending on the Application A is connected with the transaction that commences on Application B. Ideally, the matching of timestamps on both ends will create an one-to-one linking of the transactions. The resultant CaseID could be the concatenation of the CaseIDs on Systems A and B, e.g. CaseID\(_A\)–CaseID\(_B\). However, in practice a linkage criteria based solely on time ranges can lead to one-to-many relationships between the transactions on both systems, for instance when the cadence of transactions is high. By refining the linkage criteria with non-temporal matching attributes (e.g. the vendor and/or material), one gains precision and reduces uncertainty. Still, in some settings it is impossible to achieve a perfect mapping across systems. In these situations we recommend adding a case attribute that flags those transactions which perfectly match and those which do not.

Timestamps are essencial in process mining, as they mainly charaterize the partial ordering \(\prec \) in which the events are sequenced by the underlying process mining algorithms. A typical problem happening in particular when analyzing automated process steps in sequence, but also in other contexts, is the precision of timestamps. Automated process steps happen in range of miliseconds, and this is the precision with which timestamps need to be captured, otherwise process steps will have the same timestamp. In practical settings, this leads to an extreme high number of process variants, as the process mining engines will pick events with the same timestamp in a random order and artificially create variants. When precise timestamps are not available, hardwiring the process ordering by means of a dedicated field in the final log might come handy. Tools will use this field to enforce the ordering, avoiding unnecessary variants. Another solution is to subsume all the sequenced steps into one, assuming that their ordering is not relevant for the analytical angle.

Another aspect commonly overlooked in analysis is the fact that the timestamps might be captured in a different timezones, especially when the regional angle spans various continents. (Summer and winter time shifts shall not be forgotten, too.) As an example, suppose one is looking for outlier transactions in which invoices were settled outside of the standard European working hours for a particular company, e.g. between 7PM and 6AM. Completely legitimate settlements happening the US might be then considered illegitimate if one does not normalize the timezones. This can be either done by adding a supplementary timestamp field to the log (denoting the time according to the reference point, e.g. CET, defined in the analytical angle), or by adding an attribute field for the timezone offset according to the reference point (e.g. \(+2\)).

A final consideration regarding the timestamps is the fact that typical ERP systems, as well as most of the legacy systems, do not capture the begin and end timestamps of events. As events are therefore “atomic”, it is impossible to measure the actual duration of an process step and, correspondly, to quantify the working time per step. In fact, the lead times between events in a discovered process maps encompass both processing and waiting times. To address this, approaches for so-called “effort mining” are being developed and tested in practical settings [28]. They using statistical methods to estimate the duration of tasks, thereby allowing the quantification of working time and productivity, as well as benchmarking.

Considering amounts, two frequent issues are: (1) amount duplication, and (2) unharmonized currencies. The duplication happens when loops exist in the process. For example, suppose the event “Issue Invoice” happens, with an associated event attribute “Invoice Amount.” Furthermore, suppose that, because of a loop, this event happens twice in some cases. Naively adding up the amounts associated with the event “Issue Invoice” will include duplications because of the multiple occurrences of the event within a case. Similarly, amount corrections happening in a case must be taken into account when assessing the final amount related to a case. Ideally, to avoid dupliations and other errors associated with amounts one should parse, the execution traces create an ancillary case attribute table recording the amounts per case (see Sect. 4.3 for details on the data model), thereby avoiding calculations on the specific process mining tool.

Unharmonized currencies typically happen when the analytical angle spans different countries, e.g. Denmark and Brazil. As above, naively adding up amounts without taking into account the different local currencies will lead to a wrong financial assessment of the process. Therefore, as a preprocessing step, currencies shall be harmonized to a reference currency set during the analytical angle, such as USD or EUR. This will be the reporting currency. The basis for such a harmonization might be system tables capturing the currency conversion rates history (e.g. TCURR on SAP), or dedicated APIs from which the historical foreign exchange can be retrieved (e.g. Fixer.io). For flexibility, the resulting data model stores both the amounts in local and reporting currencies, optionally also the conversion rate.

Data Transformation in the SAP O2C Scenario. In the following, we apply the general considerations discussed before to our SAP O2C running example. While first focusing on unit harmonization (i.e. timestamps and prices), the second part will describe different archetypes of events and exemplarily discuss data transformation steps to generate event log entries.

Unit Harmonization. As characterized above, there are several types of attributes that can occur in different base units across the data sources, sometimes even within one source system. Starting with the timestamps, SAP typically stores date (DATS) and time (TIMS) data in the configured time zone in the SAP installation. Some timestamps are, however, persisted in the time zone of the individual user interacting with the system (e.g. in SAP Warehouse Management, short WM). Luckily, all attributes relevant to our example are based on the same time zone and therefore, no adjustment for different time zones needs to be made. Since we analyze a full year of data, the switch between summer and winter time can – depending on the SAP system configuration – still require adjustments and a decision to treat one of them as dominant.

Before applying adjustments, we prepare our data sources by combining separated date and time information into timestamps (e.g. in VBAP: ERDAT & ERZET > tsCreation). If any data source contains multiple separated timestamps, each pair will result in an additional attribute. Moving to the actual adjustment and taking CET as our dominant base time zone, we adjust all timestamps in summertime by subtracting one hour. For traceability and testing purposes the adjusted timestamp shall be added as an additional column (tsCreationCET). Once a project matures into an operational monitoring solution, such steps are typically collapsed to reduce overall data volume.

Another major category for unit harmonization is currency denominated attributes. Within SAP, some data sources provide figures in multiple currencies (often document, local and reporting currency), other hold the transaction or document currency only. In these cases, and especially when firms engage in international business relations, respective metrics need to be harmonized before being compared or aggregated.

There is a substantial level of semantics captured in the way SAP ERP systems convert currencies.Footnote 5 However, in the context of our SAP O2C process at hand, we assume a currency conversion mechanism is available in the data transformation environment. Some of the currency attributes which need to be harmonized are static in terms of source and target currency (e.g. all records converted from USD to EUR), others need to dynamically capture the source currency per each individual record (e.g. sales order item price from document currency to EUR). Exemplarily, Fig. 5 shows the input to such a dynamic currency conversion function, whose output is then stored in an additional attribute.

Fig. 5.
figure 5

Exemplary currency conversion.

Event Data Transformation. When transforming data into an event log capturing all relevant events as defined in Sect. 4.1, different event data archetypes should be distinguished. These types inform corresponding transformation recipes and while they need to be tailored to individual events, their core structure remains largely intact. Figure 6 delimits these three archetypes and Fig. 7 maps them to the events which are part of our SAP O2C running example.

Fig. 6.
figure 6

Types of data archetypes.

Below, we detail the event Sales Order Item created to exemplify the immutable timestamp transformation archetype, the event Sales Order Item last changed to exemplify the mutable timestamp transformation archetype, and both events Order Confirmation sent as well as Picking completed to exemplify the log entry transformation archetype. For simplicity reasons we limit the transformations to the three basic elements for process mining: (a) the object ID/case ID candidate, (b) the event name, and (c) the timestamp.

Fig. 7.
figure 7

Mapping archetypes to the events.

Timestamp – immutable. In order to extract event records for the event type Sales Order Item created an object ID (caseID candidate) is crafted by concatenating VBAP.MANDT, VBAP.VBELN and VBAP.POSNR, the primary key of the respective data source table. In a preparation step we have already generated the corresponding timestamp VBAP.tsCreated from VBAP.ERDAT and VBAP.ERZET. Many event types can be extracted in this manner.

Timestamp – mutable. To extract event records for the event type Sales Order Item last changed we use the same object ID as for the immutable event. In a preparation step, we have also generated a corresponding timestamp VBAP.tsLastChanged from VBAP.AEDAT and 23:59:59, a dummy time to fill the missing precision in this timestamp. It is very important to clearly document usage of such dummy times, since they can lead to undesired analysis results due to misinterpretation of the event sequence. In general, such mutable event types are more valuable for operational process mining analyses, with shortened refresh cycles, and thus a greater chance of the data still being current at the time of analysis. We included it in the SAP O2C running example for completeness only.

Log entry. As the first example of the log entry archetype, the event records for Order Confirmation sent are retrieved. Assuming the data source NAST has already been filtered to solely include order confirmation message types, it is linked to VBAP based on the client (MANDT) and its object key (OBJKY) referencing to the header primary key of VBAP (MANDT, VBELN). The same concatenated object ID is used as for the immutable event. And since the two sources are linked already, tsProcessed as derived from NAST.DATVR and NAST.UHRVR is used as the event timestamp.

The second example derives the event Picking completed from SAP’s change documentation. Assuming the data source VBUP has already been filtered to solely include the item status information of standard sales order items, we also restrict the change logs based on the affected table (CDPOS.TABNAME = VBUP), on the affected field (CDPOS.FNAME = PKSTA), and on the change type (CDPOS.CHNGIND = U) to retain value updates only. Thereafter, VBUP is linked to CDPOS based on MANDT and CDPOS.TABKEY referencing the primary key of VBUP. Next, change log header information (CDHDR) is linked based on MANDT and CHANGENR. Lastly, we can extract the object ID from VBUP analogously to the immutable timestamp example, and the prepared timestamp tsUpdated, derived from CDHDR.UDATE and CDHRD.UTIME. Many changelog structures for other event types work similar, even outside the SAP ecosystem.

The exemplarily described recipes can be applied beyond the events listed as part of our simplified SAP O2C process analysis. It is rarely a blind application, however, rather a tailoring exercise. Sometimes the name of the resulting events – mostly in the log entry archetype – is even meant to be dynamically derived from attributes on a record-by-record basis. This becomes particularly useful when analyzing workflow systems with potentially hundreds of different events, since all of them can be extracted with one transformation recipe.

4.3 Data Model Engineering

Generally speaking, the transformation creates an event log for the process in scope, as defined in the processual and regional angles. It further contains the necessary events and attributes needed to respond to the analytical angle happening in the time span prescribed by the time angle. This section focuses on considerations at building a data model fit for scalable process mining analytics.

The simplest target data model for an event log file is a table in which the columns capture the attributes and the rows capture the events. Although some tools still build upon a single event log table as their input format and although this format might be handy for small exercises, producing a single event log has several adverse practical implications, namely:

  1. 1.

    Log file generation. Even in narrow-scoped analysis, an event log may quickly have .5 million transactions (cases) with around 100 associated attributes. Changes in the transformation logic – a frequent step to appropriately capture the business logic or fine-tune for system customizations – lead to a full reprocessing of the whole event log. This procedure can, depending on the transformation platform, take hours (or even days) to complete.

  2. 2.

    Lower analytical performance. When analyzing a process, e.g. by applying a filter for a drill-down or computing a KPI, a large event log packed on a single table might have an adverse impact on the user experience in the tool. Specifically, the response time can be very high, making it hard to interactively produce insights.

  3. 3.

    Unnecessary reduncancy. A single event log does not distinguish between case (or transaction) attributes, such as legal entity or material, and event attributes, such as user name. Therefore, depending on how the transformation handles case attributes, they might be replicated throughout all the events within a case, or have empty values, which is a suboptimal use of the data model.

  4. 4.

    Scalability and size. While an event log might have a stable size during a proof-of-value, in process mining implementations tackling continuous process monitoring and improvement, the event log steadily grows as new cases are appendend to the existing model. Depending on the cadence of transactions (and number of events in those transaction), this can easily result in an average monthly increase well over 100 million lines. This has an implication on the disk space necessary for storage, as well as how the tools might be able to load and process this log.

Practical process mining thus calls out the need for more efficient and scalable logs. There are two complementary strategies for engineering event logs. The more general strategy focuses on splitting the log into at least two tables: the so called event table containing the events and their attributes, and the transaction table containing the case attributes. The key linking these two tables is the CaseID. This strategy can be further refined, depending on the needs of the analysis. For example, another usual structures seen in practice is the change table capturing updates in the main documents (e.g. Quantity changed in Sect. 3) and the property table capturing derived precomputed transaction attributes easing the analysis (e.g. the number of events in a particular transaction or precision of the linkage criteria, as of Sect. 4.2).

The more specific strategy takes the scope and its different angles into account, as well as who is eventually consuming the analysis. Specifically, when the regional angle comprises multiple geographies (e.g. five hubs of a Global Business Services (GBS) topology), it is wise to create one data model – irrespective of its layout – for each hub. While this does not prevent having a global analysis, benchmarks and knowledge transfer of best-practices, it by default ensures controllability and need-to-know policies, i.e. that hubs focus on their area of concern. The analytical angle is also a strong driver for event log engineering. For example, an SAP O2C analysis might focus on improving client servicing and lead management. In this case, the focus is on transactions against external customers, and not on intercompany or intracompany transactionsFootnote 6. Therefore, the data model for this analysis can be built to comprise only the relevant transactions.

Generally, narrowing the event log according to the scope reduces the risk of adding noise to the analysis, and the risk of misinterpretation. This is because it requires the clear-cut specification (and transparent communication) of the filtering criteria used during log engineering and data transformation. It also reinforces that there is no “one size fits all”, standard target log file and set of events and attributes to be reconstructed.

Data Model Engineering in the SAP O2C Scenario. In the following, we apply the general considerations discussed before to our SAP O2C running example. Starting with the selection of a common process instance identifier or case ID suitable for the analytical angle at hand, we define a dedicated data table for information on each process instance (i.e. the case table). Lastly, contextual data is added in a scalable way and linked to the core data model.

Case Identifier. When transforming source data into event records as described in Sect. 4.2 the resulting object identifier (object ID) is typically referencing the underlying business object or document. Exemplarily, the events derived from sales order items (VBAP), e.g. Sales Order Item created, will have a concatenation of the table’s primary key fields as its unique object ID reference. However, events derived from other data sources, like Deliver Item created will correspondingly have an object ID composed of the primary key fields of LIPS assigned. This results in a need to relate these objects and documents involved in our O2C process, in order to retrieve original process flow end-to-end.

We look at the document flow in Fig. 1 and use the link attributes we preserved in Sect. 4.1 to derive an object graph in accordance with the relationships between the corresponding data sources. The only exception is the data source NAST, whose corresponding events have already been linked to the respective sales order item object ID during event data transformation as described in Sect. 4.2. Such direct links are typically used when the business object or document has very few additional attributes of relevance – link in the exemplary case – the order confirmation. Please refer to Fig. 8 for the resulting relationship model and exemplary graph.

Fig. 8.
figure 8

Relationship model relating to the case identifier.

As some of the relationships between the business object data sources are one-to-many (in some scenarios even many to many), the resulting graph/forest can become quite complex. Considering the example in Fig. 8, the part of the graph in which two sales order items exist (X and Y), with both belonging to the same sales order (R). While X has no link to any delivery item yet, Y references two distinct delivery items (U and W) with two different headers (T and Z). This means the sales order item was likely split into two deliveries. Now, one of the deliveries (W) is already billed with a billing document item (A) and header (B).

For most process analyses it is advisable to define one of the object ID types as the case identifier (caseID). Based on the underlying analytical angle and hypotheses, we select the sales order item as the identifying document type and create a mapping table, which lists all reachable objects within the forest as a function of the caseID (see Fig. 9). As a rule of thumb, when traversing the forest, the very same relationship shall not be traversed in both ways, i.e. after connecting X to R, we do not proceed to connect Y to the same set of reachable objects since it would take the same relation (VBAP \(\rightleftharpoons \) VBAK) that connects X to R in a backward direction. This approach prevents linking objects and thereby their associated events erroneously. The combination of our mapping table with the event record table from Sect. 4.2 results in a final event log table, which can already be used for process mining.

Fig. 9.
figure 9

Mapping table listing reachable objects.

Case Table. Most process mining analysis are moving beyond the pure event traces quite quickly, resulting in the need to add contextual information. The most straightforward option is to create an additional table with exactly one record per process instance (i.e. per active caseID) and adding so-called case-level information to it. Since we have defined the sales order item as our case ID type, we can simply add the attributes preserved in Sect. 4.1 as case attributes (e.g. net price, material number).

Contextual Data. When using process mining in real business scenarios, the thirst for contextual information does not stop at the case table. Applied to our running example, we can assume that just because we selected the sales order item as caseID type does not mean additional information on the delivery document items and billing items is irrelevant. One approach can be to add such information in the event log with the tradeoff being an extremely detrimental impact on data volume. In practice, we rather opt to introduce additional tables, often one per objectID type (except the one selected as case ID type). Illustrated in Fig. 10, we connect two additional contextual data sources to the cases table. In order to establish the link from these object tables to the case table, the previously generated mapping table (caseID \(\rightleftharpoons \) objectID) can be re-used.

Fig. 10.
figure 10

Connecting additional contextual data sources to the case table.

In practice even more advanced data models, such as the one described above do not support the testing of every hypothesis project stakeholders come up with. Sometimes, hypothesis-specific “helper” tables are created and linked into the process mining data model. In such cases, it is advisable to challenge the business value from such modifications before triggering substantial data model modifications.

5 Best Practices

In this section we take stock on the above sections and distill best practices from our experience of rolling out process mining “in the large”. Clearly this is a non-exhaustive list; its intent is to elude on the most relevant and recurring topics.

Data Selection and Extraction. This is the basis for process mining, and if not structured well, hiccups here can undermine the entire analytical effort. Four best-practices in this area:

  • BP1 Explicitly formulate the four analytical angles and confirm it with all stakeholders.

  • BP2 Find a sweet-spot between data minimization and extraction efficiency.

  • BP3 Estimate the final size (and time) of extraction.

  • BP4 Extract data from a QA environment or existing staging platform.

By following (BP1) one ensures common knowledge as to the analytical objectives and avoid getting lost in details. Turning to (BP2), as mentioned in Sect. 4.1, data extraction technically boils down to some form of select-statement on terabyte-sized tables. Data minimization criteria (formulated as where-constraints) add constraints to such a statement, slowing down the extraction. Therefore it is important find the sweet spot between minimization and efficiency. One way to do so is to follow (BP3) and carry out a probe extraction with a drastically reduced scope and extrapolate the values to the full scope range. Finally, because a data extraction might have an impact on the performance of the system, (BP4) recommends the extraction of data from a QA (test) environment or staging platform, as opposed to a productive environment. Of course, for this, the extraction environment must fully cover the scope.

Data Transformation. When transforming data towards an event log, events are discovered according to the business logic and system specific configurations. In doing so, the precision is essential for the analytical correctness. Five best-practices to emphasize in this area:

  • BP5 Modularize event discovery.

  • BP6 Harmonize timestamp format, currencies and other units.

  • BP7 Take system customizations into account.

  • BP8 Do not ignore the business logic and context.

  • BP9 Meaningful event naming convention

With (BP5) one creates modules to discover the different types of events (e.g. SO Item created). In doing so, adjustments in those events (e.g. naming convention or discovery logic) can be done locally without requiring the generation of a whole event log. With (BP6) one avoids misinterpretation of results and a sound basis for analysis. Besides those harmonization efforts, ERP systems are highly customized to a particular business and operational mode. This can be at the level of fields in a table (e.g. a field capturing a specific company flag for a completed delivery) or the way attributes add up for an attribute (e.g. what is an automated vs. a manual step). Therefore, (BP7) recommends to take those customizations into account when transforming data. One approach is to carefully resuse and validate existing transformation scripts.Footnote 7 Building on that, different attributes carry aspects on the business logic, e.g. document types associated to sales orders indicating external or internal sales. In (BP8) we recommend to take this into account when generating the event log by creating different events or transaction identifiers. Finally, by (BP9) one facilitates the understanding of process maps. For example, instead of naming an event SO Item qty chg., use SO Item qty incr., already indicating how the change impacted the sales order quantity field. This creates more meaningful logs and a more effective basis for analysis. However, if used exaggeratedly, this leads to an inflation of distinct events, making any analysis a complex undertaking.

Data Model Engineering. The best-practices regarding the target data model have an impact on the scalability and ease of analysis. We emphasize the following:

  • BP10 Add sanity checks.

  • BP11 Modularize logs according to the analytical scope.

  • BP12 Split the attributes according to attribute types.

  • BP13 Consider ancillary analytics, e.g. machine learning, prediction and simulation.

In (BP10) we recommend the use of sanity check tests indicating, for example, the overall number of cases reconstructed or a summary of fields including NULL or empty values. This helps in the quality assurance phase, e.g. by matching the number of expected transactions with the number of cases or by detecting and tracing transformation bugs. By (BP11), one ensures that event logs fit their analytical angle but, at the same time, separate business concerns. Regardless of modularization, by (BP12) one separates the characteristics of events and those of transactions. This eliminates redundant data and is less error-prone during analysis. Finally, process data has been increasingly used as a subject of more advanced analytics. The data model required for such analytics differs substantially from a plain event log. In (BP13) we recommend to take this into account when deciding on the necessary attributes and their aggregation level. In some situations, it is worth creating a separate table allowing, e.g., regression or time series analytics.

6 Outlook

During the past ten years, with process mining finally finding its way from academia into market leading organizations, a lot of progress has been made in both simplifying the approach, including data preparation, for business use, as well as extending associated functionalities and proving to generate tangible value in a growing number of industries. With spreading awareness and substantial increases in capital allocation from venture firms, this development has only accelerated in the more recent past. From our perspective as practitioners, we expect the following to be some of the most substantial improvements:

  1. 1.

    Architected for Analytics. With many large-scale tech modernization programs under way, corporations replace more and more legacy systems with modern architectures. Most of these solutions are already designed with modularity and an analysis angle in mind (e.g. with data provisioning APIs), leading to a substantial reduction of effort for data extraction and transformation.

  2. 2.

    Maturing Connector Landscape. Two factors will contribute to a maturing landscape of data extraction and transformation connectors. Firstly, and focused on the data intake, the growing adoption of process mining across industries and therefore across a wide variety of (partially) standardized source systems will produce proven, configurable and scalable connector modules. Corresponding know-how will be gradually commoditized, easing access to and usage of these modules. Secondly, and more focused on the data transformation output, vendor agnostic bodies in the field are working on standardizing event log data formats beyond academia. Once adopted by enough players in the field, such standardized formats greatly improve compatibility of approaches across process mining vendors and will eventually even lead to data source systems offering “native” data outputs conforming to the process mining data standard.

  3. 3.

    Data Model Advancements. As broached in Sect. 4.3, substantial complexity stems from the requirement to select a caseID type during data transformation. End users, however, desire to have the flexibility of switching between such case perspectives interactively during analysis. With solution vendors reacting to these requirements, data transformation steps related to linking all events to one single case type will become redundant. This field of object-centric and ontology-based process mining [6] not only constitutes an active academic field, but the exciting opportunity for researchers to work very close to tangibly impact businesses across the globe.

Besides the overall maturing industry and the improvements listed above in particular, we anticipate some of the already prevalent challenges to intensify, while new ones emerge from macrotrends:

  1. 1.

    Ubiquitous Data. As more and more processes get digitized and data storage capacity is easily and economically scalable, it is no surprise to expect data volumes to continue to increase significantly. However, with the growing adoption of Internet of Things (IoT) and usage sensor data in general, an arguably new data category is added to the mix. Not only can such IoT data volumes exceed transactional data volumes manyfold, they usually come in less structured formats (e.g. simple text files) and are subject to less scrutiny with regards to data quality (i.e. the occasional outlier and malfunction of single sensors is expected). Also, the frequency in which the sensor state is persisted is normally independent of process transactions, but rather driven by the sensor’s configuration. Correlating such IoT data to individual process flows presents a major challenge ahead [16].

  2. 2.

    Simulation. After embracing the end-to-end transparency added with process mining, organizations demand putting this insight to use beyond educational purposes and the occasional process improvement. The first step – for obvious automation candidates – was and still is to interface or integrate automation solutions. But as soon as more sophisticated questions are asked in the context of process mining based process analytics (e.g. “What if we change A? Would there be another bottleneck?”), the lack of simulation capabilities becomes apparent. In isolation from process mining, there has been plenty of research [7] and tool support for simulation engines [13]. The challenge will be to seamlessly integrate simulation engines with process mining engines, without turning the corresponding configuration into a Customer Experience (CX) nightmare. It is expected that, as part of this convergence, additional requirements towards event log engineering emerge (e.g. statistical distribution information).

  3. 3.

    Data Exchange Restrictions. With regulations and restrictions around data sensitivity and data exchange tightening around the world, it becomes increasingly difficult to manage the compliance angle of holistic and often global analytics initiatives. This shift also leads to additional precaution whenever data is shared with third parties and especially when these are within the open domain. It will accordingly become more and more difficult for academia to work on relevant business scenarios with representative underlying data sets, which in turn results in slower and less targeted innovation in the field, including event log engineering.

In summary, the discipline of process mining and corresponding event log engineering is expected to thrive under the increased attention of academia, solution vendors, professional service firms and financiers. The most substantial impact, however, will continue to emanate from firms of all sizes adopting process mining to streamline operations and – at times – turn process excellence into their competitive advantage.