1 Introduction

The main goal of predictive process monitoring is to provide timely information by predicting the behavior of business processes (van der Aalst et al., 2011) and enabling proactive actions to improve the performance of the process (Park & van der Aalst, 2022). It provides various predictive information such as the next performing activity of a process instance (Breuker et al., 2016), e.g., patient and product, its waiting time for an activity, its remaining time to complete the process, etc Marquez-Chamorro et al. (2018). For instance, by predicting the long waiting time of a patient for registration, one can bypass the activity or add more resources to perform it.

A plethora of approaches have been proposed to support predictive process monitoring. In particular, with the recent breakthroughs in machine learning, various machine learning-based approaches have been developed (Marquez-Chamorro et al., 2018). The emergence of ensemble learning methods leads to improvement in accuracy in different areas (Breiman, 1996). eXtreme Gradient Boosting (XGBoost) (Chen and Guestrin, 2016) has shown promising results, often outperforming other ensemble methods such as Random Forest or using a single regression tree (Senderovich et al., 2017; Teinemaa et al., 2019b). Furthermore, techniques based on deep neural networks, e.g., Long-Short Term Memory (LSTM) networks, have shown high performance in different predictive tasks (Evermann et al., 2017).

However, machine learning-based techniques are computationally expensive due to their training process (Zhou et al., 2017). Moreover, they often require exhaustive hyperparameter-tuning to provide acceptable accuracy. Such limitations hinder the application of machine learning-based techniques into real-life business processes where new prediction models are required in short intervals to adapt to changing business situations. Business analysts need to test the efficiency and reliability of their conclusions via repeated training of different prediction models with different parameters (Marquez-Chamorro et al., 2018). Such long training time limits the application of the techniques when considering the limitations in time and hardware (Pourghassemi et al., 2020).

In this regard, instance selection has been studied as a promising direction of research to reduce original datasets to a manageable volume to perform machine learning tasks, while the quality of the results (e.g., accuracy) is maintained as if the original dataset was used (Garca et al., 2014). Instance selection techniques are categorized into two classes based on the way they select instances. First, some techniques select the instances at the boundaries of classes. For instance, Decremental Reduction Optimization Procedure (DROP) (Wilson & Martinez, 2000) selects instances using k-Nearest Neighbors by incrementally discarding an instance if its neighbors are correctly classified without the instance. The other techniques preserve the instances residing inside classes, e.g., Edited Nearest Neighbor (ENN) (Wilson, 1972) preserves instances by repeatedly discarding an instance if it does not belong to the class of the majority of its neighbors.

However, it is restricted to directly apply existing techniques for instance selection to predictive process monitoring training, since such techniques assume independence among instances (Wilson & Martinez, 2000). Instead, in predictive process monitoring, instances are computed from event data that are recorded by the information system supporting business processes (de Leoni et al., 2016). Thus, they are highly correlated (van der Aalst, 2016) with the notion of case, e.g., a manufacturing product in a factory and a customer in a store.

In this work, we suggest an instance selection approach for predicting the next activity, the remaining time and the outcome of the process that are main applications of predictive business process monitoring. By considering the characteristics of the event data, the proposed approach samples event data such that the training speed is improved while the accuracy of the resulting prediction model is maintained. We have evaluated the proposed methods using three real-life datasets and state-of-the-art techniques for predictive business process monitoring, including LSTM (Huang et al., 2015) and XGBoost (Chen & Guestrin, 2016).

This paper extends our earlier work presented in Sani et al. (2021) in the following dimensions: 1) Evaluating the applicability of the proposed approach, 2) Enhancing the accessibility of the work, and 3) Extending the discussion of strengths and limitations. First, we have evaluated the applicability of the proposed approach both task-wise and domain-wise. For the task-wise evaluation, we have selected the three most well-known predictive monitoring tasks (i.e., next activity, remaining time, and outcome predictions) and evaluated the performance of the proposed approach in the different tasks. For the domain-wise evaluation, we have evaluated the performance of our proposed approach in real-life event logs from different domains, including finance, government, and healthcare domains. Second, we have extended the accessibility of the proposed approach by implementing the proposed sampling methods in the Python platform as well as the Java platform. Finally, we have extensively discussed the strengths and limitations of the proposed approach, providing foundations for further research.

The remainder is organized as follows. We discuss the related work in Section 2. Next, we present the preliminaries in Section 3 and proposed methods in Section 4. Afterward, Section 5 evaluates the proposed methods using real-life event data and Section 6 provides discussions. Finally, Section 7 concludes the paper.

2 Related work

This section presents the related work on predictive process monitoring, time optimization, and instance sampling.

2.1 Predictive process monitoring

Predictive process monitoring is an exceedingly active field, both currently and historically, thanks to the compatibility of process sciences and other branches of data science that include inference techniques, such as statistics and machine learning. At its core, the fundamental component of many predictive monitoring approaches is the abstraction technique it uses to obtain a fixed-length representation of the process component subject to the prediction (often, but not always, process traces). In the earlier approaches, the need for such abstraction was overcome through model-aware techniques, employing process models and replay techniques on partial traces to abstract a flat representation of event sequences. Such process models are mostly automatically discovered from a set of available complete traces, and require perfect fitness on training instances (and, seldomly, also on unseen test instances). For instance, (van der Aalst et al., 2011) proposed a time prediction framework based on replaying partial traces on a transition system, effectively clustering training instances by control-flow information. This framework has later been the basis for a prediction method by Polato et al. (2018), where the transition system is annotated with an ensemble of SVR and Naïve Bayes classifiers, to perform a more accurate time estimation. Some more recent approaches split the predictive contribution of process models and machine learning models in a perspective-wise manner: for instance, (Park and Song, 2020) obtain a representation of the performance perspective using an annotated transition system, and design an ensemble with a deep neural network to obtain the final predictive model. A related approach, albeit more linked to the simulation domain and based on a Monte Carlo method, is the one proposed by Rogge-Solti and Weske (2013), which maps partial process instances in an enriched Petri net.

Recently, predictive process monitoring started to use a plethora of machine learning approaches, achieving varying degrees of success. For instance, (Teinemaa et al., 2016) provided a framework to combine text mining methods with Random Forest and Logistic Regression. Senderovich et al. (2017) studied the effect of using intra-case and inter-case features in predictive process monitoring and showed a promising result for XGBoost compared to other ensemble and linear methods. A comprehensive benchmark on using classical machine learning approaches for outcome-oriented predictive process monitoring tasks (Teinemaa et al., 2019b) has shown that the XGBoost is the best-performing classifier among different machine learning approaches such as SVM, Decision Tree, Random Forest, and logistic regression.

More recent methods are model-unaware and perform based on a single and more complex machine learning model instead of an ensemble. In fact, such an evolution of predictive monitoring mimics the advancement in the accuracy of newer machine learning approaches, specifically the numerous and sophisticated models based on deep neural networks that have been developed in the last decade. The LSTM network model has proven to be particularly effective for predictive monitoring (Evermann et al., 2017; Tax et al., 2017), since the recurrent architecture can natively support sequences of data of arbitrary length. It allows performing trace prediction while employing a fixed-length event abstraction, which can be based on control-flow alone (Evermann et al., 2017; Tax et al., 2017), data-aware (Navarin et al., 2017), time-aware (Nguyen et al., 2020), text-aware (Pegoraro et al., 2021), or model-aware (Park and Song, 2020). Additionally, rather than leveraging control-flow information for prediction, some recent research aims to use predictive monitoring to reconstruct missing control-flow attributes such as labels (van der Aa et al., 2021) or case identifiers (Pegoraro et al., 2022a; 2022b). However, the body of work currently standing in predictive process monitoring, regardless of architecture of the predictive model and/or target of the prediction, does not include strategies for performance-preserving sampling.

2.2 Time optimization and instance sampling

The latest research developments in the field of predictive monitoring, much like both this paper and the application of machine learning techniques in other domains, shifts its focus away from increasing the quality of prediction (with respect to a given error metric or a given benchmark), and specializes in enriching the results of the prediction on additional aspects or perspectives. Unlike the methods mentioned in the previous paragraph, this is usually obtained through dedicated machine learning models—or modifications thereof—rather than designing specific event- or trace-level abstractions. For instance, many scholars have attempted to make predictions more transparent through the use of explainable machine learning techniques (Verenich, 2019; Stierle et al., 2021; Sindhgatta et al., 2020; Galanti et al., 2020; Park et al., 2022). More related to our present work, (Pauwels & Calders, 2021) propose a technique to avoid the time expenditure caused by the retrain of machine learning models; this is necessary when they are not representative anymore—for instance, when changes occur in the underlying process (caused e.g. by concept drift). While in this paper we focus on the data, and we propose a solution based on sampling, Pauwels and Calders intervene on the model side, devising an incremental training schema which accounts for new information in an efficient way.

Another concept similar to the idea proposed in this paper, and of current interest in the field of machine learning, is dataset distillation: utilizing a dataset to obtain a smaller set of training instances that contain the same information (with respect to training a machine learning model) (Wang et al., 2020). While this is not considered sampling, since some instances of the distilled dataset are created ex-novo, it is an approach very similar to the one we illustrate in our paper.

The concept of instance sampling, or instance subset selection, is present in the context of process mining at large, albeit the development of such techniques is very recent. Some instance selection algorithms have been proposed to help classical process mining tasks. For example, (Fani Sani et al., 2021) proposes to use instance selection techniques to improve the performance of process discovery algorithms; in this context, the goal is to obtain automatically a descriptive model of the process, on the basis of the data recorded about the historical executions of process instances—the same starting point of the present work. Then, the work in Fani Sani et al. (2020) applies the same concept and integrates the edit distance to obtain a fast technique to approximate the conformance checking score of process traces: this consists of measuring the deviation between a model, which often represents the normative or prescribed behavior of a process, and the data, which represents the actual behavior of a process. This paper integrates the two aforementioned works, extending the effects of strategic sampling: while in Fani Sani et al. (2021) the sampling optimizes descriptive modeling and in Fani Sani et al. (2020) it optimizes process diagnostics, in this work it aids predictive modeling.

To the best of our knowledge, no work present in literature inspects the effects of building a training set for predictive process monitoring through a strategic process instance selection, with the exception of our previous work, which we extend in this paper (Sani et al., 2021). In this paper, we examine the underexplored topic of event data sampling and selection for predictive process monitoring, with the objective of assessing if and to which extent prediction quality can be retained when we utilize subsets of the training data.

3 Preliminaries

In this section, some process mining concepts such as event log and sampling are discussed. In process mining, we use events to provide insights into the execution of business processes. Event logs, i.e., collections of events representing the execution of several instances of a process, are the starting point of process mining algorithms. An example event log is shown in Table 1. Each event that relates to a row in the table is related to specific activities of the underlying process. Furthermore, we refer to a collection of events related to a specific process instance of the process as a case (represented by the Case-id column). Both cases and events may have different attributes. An event log that is a collection of events and cases is defined as follows.

Table 1 Simple example of an event log

Definition 1 (Event Log)

Let \(\mathcal {E}\) be the universe of events, \(\mathcal {C}\) be the universe of cases, \(\mathcal {A} \mathcal {T}\) be the universe of attributes, and \(\mathcal {U}\) be the universe of attribute values. Moreover, let \(C{\subseteq }\mathcal {C}\) be a non-empty set of cases, let \(E{\subseteq }\mathcal {E}\) be a non-empty set of events. We define (C,E,πC,πE) as an event log, where and . Any event in the event log has a case, and thus, \(\nexists _{e\in E} (\pi _{E}(e, case) \not \in C)\) and \(\bigcup \limits _{e\in E}(\pi _{E}(e, case)){=} C \). Let \(\mathcal {A}{\subseteq }\mathcal {U}\) be the universe of activities and \(\mathcal {V}{\subseteq }\mathcal {A}^{*}\) be the universe of sequences of activities. For any eE, function \(\pi _{E}(e, activity){\in } \mathcal {A}\), which means that any event in the event log has an activity. Moreover, for any cC function \(\pi _{C}(c, variant){\in } \mathcal {A}^{*}{\setminus } \{\langle \rangle \}\) that means any case in the event log has a variant.

Therefore, there are some mandatory attributes that are case and activity for events and variants for cases. For example, for event with Event-id equals to 35 in Table 1, πE(e,case))= 7 and πE(e,activity)=Register(a).

Variants are the sequence of activities that are presented in each case. For example, for case 7 in Table 1, the variant is 〈a,b,g,e,h〉 (for the simplicity we show each activity by a letter). Variant information plays an important role an in some process mining applications, e.g., process discovery and conformance checking, just this information is considered. In this regard, event logs are considered as a multiset of sequences of activities. In the following, a simple event log is defined.

Definition 2 (Simple event log)

Let \(\mathcal {A}\) be the universe of activities and let the universe of multisets over a set X be denoted by \({\mathscr{B}}(X)\). A simple event log is \(L{\in } {\mathscr{B}}(\mathcal {A}^{*}) \). Moreover, let \(\mathcal {E} {\mathscr{L}}\) be the universe of event logs and \(EL{=}(C,E,\pi _{C},\pi _{E}){\in } \mathcal {E} {\mathscr{L}} \) be an event log. We define function \(sl{:}\mathcal {E} {\mathscr{L}}{\to }{\mathscr{B}}(\mathcal {A}^{*})\) returns the simple event log of an event log where \(sl(EL){=}[\sigma ^{k} \mid \sigma {\in }\{ \pi _{C}(c,variant) \mid c{\in }C \} \wedge k{=}{\Sigma }_{c\in C} \big (\pi _{C}(c,variant){=} \sigma \big ) ] \). The set of unique variants in the event log is denoted by \(\overline {sl(EL)}{=}\{ \pi _{C}(c,variant) \mid c{\in }C \}\).

Therefore, sl returns the multiset of variants in the event logs. Note that the size of a simple event log equals the number of cases in the event logs, i.e., ∣sl(EL)∣=∣C

In this paper, we use sampling techniques to reduce the size of event logs. An event log sampling method is defined as follows.

Definition 3 (Event log sampling)

Let \(\mathcal {E} {\mathscr{L}}\) be the universe of event logs and be the universe of activities. Moreover, let \(EL{=}(C,E,\pi _{C},\pi _{E}){\in } \mathcal {E}{\mathscr{L}} \) be an event log, we define function \(\delta {:}\mathcal {E} {\mathscr{L}}{\to } \mathcal {E}{\mathscr{L}} \) that returns the sampled event log where if \((C^{\prime },E^{\prime },\pi ^{\prime }_{C}, \pi ^{\prime }_{E}){=}\delta (EL)\), then \(C'{\subseteq }C\), \(E'{\subseteq }E\), \(\pi ^{\prime }_{E}{\subseteq }\pi _{E}\), \(\pi ^{\prime }_{C}{\subseteq }\pi _{C}\), and consequently, \(\overline {sl(\delta (EL))} {\subseteq } \overline {sl(EL)}\). We define that δ is a variant-preserving sampling if \(\overline {sl(\delta (EL))} {=} \overline {sl(EL)}\).

In other words, a sampling method is variant-preserving if and only if all the variants of the original event log are presented in the sampled event log.

To use machine learning methods for prediction, we usually need to transfer each case to one or more features. The feature is defined as follows.

Definition 4 (Feature)

Let \(\mathcal {A}\mathcal {T}\) be the universe of attributes, \(\mathcal {U}\) be the universe of attribute values, and \(\mathcal {C}\) be the universe of cases. Moreover, let \(AT{\subseteq }\mathcal {A}\mathcal {T}\) be a set of attributes. A feature is a relation between a sequence of attributes’ values for AT and the target attribute value, i.e., \(f{\in } (\mathcal {U}^{{\mid }AT{\mid }} {\times } \mathcal {U} ) \). We define \(\mathit {fe}{:}\mathcal {C}{\times }\mathcal {E} {\mathscr{L}}{\to }{\mathscr{B}}(\mathcal {U}^{{\mid }AT{\mid }} {\times } \mathcal {U} )\) is a function that receives a case and an event log, and returns a multiset of features.

For the next and final activity prediction, the target attribute value should be an activity. However, for the remaining time prediction, the target attribute value is a numerical value. Moreover, a case in the event log may have different features. For example, suppose that we only consider the activities. For the case 〈a,b,c,d〉, we may have (〈a〉,b), (〈a,b〉,c), and (〈a,b,c〉,d) as features. Furthermore, \(\sum \limits _{c\in C} fe(c,EL)\) are the corresponding features of event log EL=(C,E,πC,πE) that could be given to different machine learning algorithms. For more details on how to extract features from event logs please refer to Qafari and van der Aalst (2020).

4 Proposed sampling methods

In this section, we propose an event log preprocessing procedure that helps prediction algorithms to perform faster while maintaining reasonable accuracy. The schematic view of the proposed instance selection approach is presented in Fig. 1. First, we need to traverse the event log and find the variants and corresponding traces of each variant in the event log. Moreover, different distributions of data attributes in each variant will be computed. Afterward, using different sorting and instance selection strategies, we are able to select some of the cases and return the sample event log. In the following, each of these steps is explained in more detail. To illustrate the following steps, we provide an example event log with 10 cases, visible in Table 2.

  1. 1.

    Traversing the event log: In this step, the unique variants of the event log and the corresponding traces of each variant are determined. In other words, consider event log EL that \(\overline {sl(EL)}{=}\{\sigma _{1}, \dots ,\sigma _{n} \}\) where \(n{=}{\mid } \overline {sl(EL)}{\mid }\), we aim to split EL to \(EL_{1}, \dots , EL_{n}\) where ELi only contains all the cases that Ci={cCπC(c,variant)=σi} and Ei={eEπE(e,case)∈Ci}. Obviously, \(\bigcup \limits _{1\leq i\leq n }(C_{i}){=}C\) and \(\bigcap \limits _{1\leq i\leq n }(C_{i}){=}\varnothing \). For the event log that is presented in Table 2, we have n = 4 variants and E1 = {c1,c3,c4,c9,c10}, E2 = {c2,c5,c8}, E3 = {c6}, and E1 = {c7}.

  2. 2.

    Computing Distribution: In this step, for each variant of the event log, we compute the distribution of different data attributes aAT. It would be more practical if the interesting attributes are chosen by an expert. Both event and case attributes can be considered. A simple approach is to compute the frequency of categorical data values. For numerical data attributes, it is possible to consider the average or the median of values for all cases of each variant. In the running example for E3 and E4, we only have one case for each variant. However, for E1, and E2, the average of Amount is 500 and 460, respectively.

  3. 3.

    Sorting the cases of each variant: In this step, we aim to sort the traces of each variant. We need to sort the traces to give a higher priority to those traces that can represent the variant better. One way is to sort the traces based on the frequency of the existence of the most occurred data values of the variant. For example, we can give a higher priority to the traces that have more frequent resources of each variant. For the event log that is presented in Table 2, we do not need to prioritize the cases in E3 and E4. However, if we sort the traces according to their distance of amount value and the average value of each variant, for E1, we have c3,c9,c4,c1, and c10. The order for E2 is c5,c2, and c8. It is also possible to sort the traces based on their arrival time or randomly.

  4. 4.

    Returning sample event logs: Finally, depending on the setting of the sampling function, we return some of the traces with the highest priority for all variants. The most important point about this step is to know how many traces of each variant should be selected.

    In the following, some possibilities will be introduced.

    • Unique selection: In this approach, we select only one trace with the highest priority. In other words, suppose that L=sl(δ(EL)), \(\forall _{\sigma \in L^{\prime }} L^{\prime }(\sigma ){=}1 \). Therefore, using this approach we will have \({\mid }sl(\delta (EL)){\mid }{=} {\mid }\overline {sl(EL)}{\mid }\). It is expected that by using this approach, the distribution of frequency of variants will be changed and consequently the resulted prediction model will be less accurate. By applying this sampling method on the event log that is presented in Table 2, the sampled event log will have 4 traces, i.e., one trace for each variant. The corresponding cases are \(C^{\prime }=\{ c_{3}, c_{5},c_{6},c_{7}\}\). For the variants that have more than one trace, the traces are chosen that have the highest priority (their amount value is closer to the average amount value for each variant).

    • Logarithmic distribution: In this approach, we reduce the number of traces in each variant in a logarithmic way. If L=sl(EL) and L=sl(δ(EL)), \(\forall _{\sigma \in L^{\prime }} L^{\prime }(\sigma ){=}{[}Log_{k}(L(\sigma )) {]} \). Using this approach, the infrequent variants will not have any trace in the sampled event log and consequently it is not variant-preserving. According the above formula, by using a higher base for the logarithm (i.e., k), the size of the sampled event log is reduced more. By using this sampling strategy with k equals to 3 on the event log that is presented in Table 2, the cases that selected in the sampled event log is \(C^{\prime }=\{ c_{3}, c_{9},c_{5}\}\). Note that for the infrequent variants no trace is selected in the sampled event log.

    • Division: This approach performs similar to the previous one, however, instead of using logarithmic scale, we apply the division operator. In this approach, \(\forall _{\sigma \in L^{\prime }} L^{\prime }(\sigma ){=}{\lceil }\frac {L(\sigma )}{k} {\rceil } \). A higher k results in fewer cases in the sample event log. Note that as ⌈⌉ considered in the above formula, using this approach all the variants have at least one trace in the sampled event log and it is variant-preserving. By using this sampling strategy with k = 4 on the event log that is presented in Table 2, the sampled event log will have 5 traces that are \(C^{\prime }=\{ c_{3}, c_{9},c_{5},c_{6},c_{7}\}\).

    There is also a possibility to consider other selection methods. For example, we can select the traces completely randomly from the original event log.

Fig. 1
figure 1

A schematic view of the proposed sampling procedure

Table 2 An example event log with 10 traces and 4 variants

By choosing different data attributes in Step 2 and different sorting algorithms in Step 3, we are able to lead the sampling of the method on which cases should be chosen. Moreover, by choosing the type of distribution in Step 4, we determine how many cases should be chosen. To compute how sampling method δ reduces the size of the given event log EL, we use the following equation:

$$ R_{S}{=}\frac{{\mid}sl(EL){\mid}}{{\mid}sl(\delta(EL)){\mid}} $$

The higher RS value means, the sampling method reduces more the size of the training log. By choosing different distribution methods and different k-values, we are able to control the size of the sampled event log. It should be noted that the proposed method will apply just to the training event log. In other words, we do not sample event logs for development and test datasets.

5 Evaluation

In this section, we aim at designing some experiments to answer the research question, i.e., “Is it possible to have computational performance improvement of prediction methods by using the sampled event logs, while maintaining a similar accuracy?”. It should be noted that the focus of the experiments is not on prediction model tuning to have higher accuracy. Conversely, we aim to analyze the effect of using sampled event logs (instead of the whole datasets) on the required time and the accuracy of prediction models.

In the following, we first explain the evaluation settings and event logs that are used. Afterward, we provide some information about the implementation of sampling methods, and finally, we show the experimental results.

5.1 Evaluation setting

In this section, we first explain the prediction methods and parameters that are used in the evaluation. Afterward, we discuss the evaluation metrics.

5.1.1 Evaluation parameters

We have developed the sampling methods as a plug-in in the ProM framework (Verbeek et al., 2010), accessible via https://svn.win.tue.nl/repos/prom/Packages/LogFiltering. This plug-in takes an event log and returns k different train and test event logs in the CSV format. Moreover, we have also implemented the sampling methods in Python to have all the evaluations in one workflow.

We have used two machine learning methods to train the prediction models, i.e., LSTM and XGBoost. For predicting the next activity, our LSTM network consisted of an input layer, two LSTM layers with dropout rates of 10%, and a dense output layer with the SoftMax activation function. We used “categorical cross-entropy” to calculate the loss and adopted ADAM as an optimizer. We built the same architecture of LSTM for the remaining time predicting with some differences. We employed “mean absolute error” as a loss function and “Root Mean Squared Propagation” (RMSprop) as an optimizer. We used gbtree with a max depth of 6 as a booster in our XGBoost model for both of the next activity and remaining time prediction tasks. Uniform distribution is used as the sampling method inside our XGBoost model. To avoid overfitting in both models, the training set is further divided into 90% training set and 10% validation set to stop training once the model performance on the validation set stops improving. We used the same parameter setting of both models for original event logs and sampled event logs. The implementations of these methods are available at https://github.com/gyunamister/pm-prediction/.

To train the prediction models using machine learning methods, we extract features from event data. To this end, we use the most commonly-used features for each prediction task in order to reduce the degree of freedom in selecting relevant features. In other words, we focus on comparing the performance of predictions between sampled and non-sampled event data with a fixed feature space. For instance, for the next activity prediction, we use the partial trace (i.e., the sequence of historical activities) of cases and the temporal measures of each activity (e.g., sojourn time) with one-hot encoding (Park & Song, 2019). For the remaining time prediction, we use the partial trace of cases along with case attributes (e.g., cost), resources, and temporal measures (Verenich et al., 2019).

To sample the event logs, we use three distributions that are log distribution, division, and unique variants. For the log distribution method, we have used 2,3,5, and 10 (i.e., log2,log3,log5, and log10). For the division method, we have used 2,3,5, and 10 (i.e., d2,d3,d5, and d10). For each event log and each sampling method, we have used a 5-fold cross-validation. It means we split the data into 5 groups. One of the groups is used as the test event log, and the rest are merged as the training event log. It should be noted that for each event log, the splitting groups were the same for all the prediction and sampling methods. Moreover, as the results of the experiments are non-deterministic, all the experiments have been repeated 5 times, and the average values are represented. Moreover, to have a fair evaluation, in all the steps, one CPU thread has been used.

5.1.2 Metrics

To evaluate the correctness of prediction methods for predicting the next activities, we have considered two metrics, i.e., Accuracy and F1-score. The F1-score is used for imbalanced data (Luque et al., 2019). For remaining time prediction, we consider Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) measures as they were used in Verenich et al. (2019) that are computed as follows.

$$ MAE {=}\frac{1}{n}\sum\limits_{t=1}^{n}{\mid}e_{t}{\mid} $$
$$ RMSE {=}\sqrt{\frac{1}{n}\sum\limits_{t=1}^{n}{e_{t}^{2}}} $$

In the above equations, et indicates the prediction error for the tth instance of validation data. In other words, we aim to compute the absolute difference between the predicted and real values (in days) for the remained time. For both measures, the lower value means higher accuracy. Note that, similar to Verenich et al. (2019), we considered seconds as the time unit to compute these two metrics.

To evaluate how accurate the prediction methods using the sampled event logs are, we have used relative metrics that compared them with the case that whole event logs are used according to the following equations.

$$ R_{Acc}=\frac{\text{ Accuracy using the sampled training log}}{\text{Accuracy using the whole training log}} $$
$$ R_{F1}{=}\frac{\text{ F1-score using the sampled training log}}{\text{F1-score using the whole training log}} $$

In both above equations, a value close to 1 means that using the sampling event logs, the prediction methods behave almost similar to the case that the whole data is used for the training. Moreover, values higher than 1 indicate the accuracy/F1-score of prediction methods has improved.

Unlike previous metrics, for MAE and RMSE, a higher value means the prediction model is less accurate in predicting the remaining time. Therefore, we use the following measures.

$$ R_{MAE}{=}\frac{MAE \text{ using the whole training log}}{MAE \text{ using the sampled training log}} $$
$$ R_{RMSE}{=}\frac{RMSE \text{ using the whole training log}}{RMSE \text{ using the sampled training log}} $$

In both of the above measures, a higher value means that the applied instance selection method preserves higher accuracy. In case the values of the above measures are higher than 1, the instance selection methods improve the accuracy of prediction models compared to the case that the whole training data have been used.

To compute the improvement in the performance of feature extraction and training time, we will use the following equations.

$$ R_{FE}{=}\frac{\text{Feature extraction time using whole data}}{\text{Feature extraction time using the sampled data}} $$
$$ R_{t}{=}\frac{\text{Training time using whole data}}{\text{Training time using the sampled data}} $$

For both equations, the resulting values indicate how many times applying the sampled log is faster than using all data.

5.1.3 Event logs

We have used three event logs widely used in the literature. In the BPIC-2012-W event log, relating to a process of an insurance company, the average of variant frequencies is low. In the RTFM event log, which corresponds to a road traffic management system, we have some highly frequent variants and several infrequent variants. Moreover, the number of activities in this event log is high. In the Sepsis event log, relating to a health care process, there are several variants, that most of them are unique. Some of the activities in the last two event logs are infrequent, which makes these event logs imbalanced. Some information about these event logs and the result of using prediction methods on them is presented in Table 3. Note that the time-related features in this table are in seconds.

Table 3 Event logs that are used in the evaluation and results of using them for the next activity and remaining time prediction

According to Table 3, using the whole event data we usually have high accuracy for the next activity prediction. However, the F1-score is not that high, which is mainly because the event logs are imbalanced (specifically RTFM and Sepsis). Moreover, the MAE and RMSE values are very high. Specifically for the RTFM event log. It is mainly because process instances’ durations are very long in this event log, and consequently, the et values are higher. Finally, there is a direct relation between the size of event logs and the required time for extracting features and training the models.

5.2 Evaluation results

Here, we provide the results of using sampled training event logs instead of whole training event logs. First, we show how by using sampling, the size of training data is reduced in Table 4. As it is expected, the highest reduction occurs when log10 is used. Using this sampling, the size of the RTFM event log is more than 1000 times reduced. However, for the Sepsis event log, as most variants are unique, sampling the training event logs using divide distribution could not result in high RS. Moreover, this table shows how using the sampling event logs can reduce the required time to extract features of the event data, i.e., RFE. As it is expected, there is a correlation between the size reduction of the sampled event logs and the improvement in the RFE.

Table 4 Reduction in size of training event logs (i.e., RS) and the improvement in the feature extraction process (i.e., RFE) using different sampling methods

In the following, we show how using the sampling event logs affects the next activity, remaining time, and outcome prediction.

5.2.1 Next Activity Prediction

The accuracy of both LSTM and XGboost methods that are trained using sampled training data is presented in Table 5. Results indicate that in most cases, when the division sampling methods are used, we can achieve similar accuracy, i.e., RAcc close to 1, compared to the case where the whole training data is used. In some cases, like using d2 for the Sepsis event log and LSTM method, the accuracy of the training method is even (slightly) improved. However, for the RTFM, using sampling with logarithmic or unique distribution highly changes the frequency distribution of variants and consequently causes a higher reduction in the accuracy of predicting models. Moreover, the accuracy reduction was higher when the XGboost method was used. The result indicated that by increasing the RS value, we lose more information in the training event logs, and consequently, the quality of the prediction models will be decreased.

Table 5 RAcc of different event logs when different sampling methods are used

In Table 6, RF1 of trained models are depicted. The results again indicate that using the sampling method with divide distribution in most cases leads to having a similar (and sometimes higher) F1-score.

Table 6 RF1 of different event logs when different sampling methods are used

By considering the results of these two tables, we found that specifically, when the LSTM method is used, we will have similar accuracy and F1-score. Moreover, for the Sepsis and BPIC2012-W that variants have similar frequency having all the variants (i.e., using divide and unique distributions) can help the prediction method to have results similar to the case that we use the whole training data. However, for the RTFM event log that has some high frequent variants, using the unique distribution results in lower accuracy and F1-score.

Table 7 shows how much training time is faster using the sampled training data instead of using the whole event data. There is a direct relationship between the size reduction and Rt(refer to the results in Table 4). However, in most cases, the performance improvement is bigger for the XGboost method. Considering the results in this table and Table 6, we found that using the sampling method, we are able to improve the performance of the next activity prediction methods on the used event logs while they provide similar results. However, oversampling (e.g., applying log10 for RTFM) will result in lower accuracy.

Table 7 Rt of different event logs when different sampling methods are used

5.2.2 Remaining time prediction

In Tables 8 and 9, we show how by using the sampled event logs, MAE and RMSE of different remaining time prediction methods are changed. The results indicate that for the LSTM method, independent of the sampling method for all event logs, we are able to provide a prediction similar to the case where whole event logs are used. It seems that the settings that are used for training the prediction method are not good. In other words, the trained model is not accurate enough. We repeat this experiment for LSTM with several different parameters, but we have almost the same results. It is mainly caused by the challenging nature of the remaining time prediction task compared to classification-based problems (such as next activity and outcome prediction). However, by sampling the training event logs, we keep the quality of prediction models.

Table 8 RMAE of different event logs when different sampling methods are used
Table 9 RRMSE of different event logs when different sampling methods are used

For the XGboost method, the results indicate that if we do not sample a small amount of traces (for example, using logarithmic sampling), we can have high RMAE and RRMSE. In general, as the main attribute for the next activity prediction is the sequence of activities, it is less sensitive to sampling. However, to predict the remaining time, the other data attributes can be essential too. In other words, for the remaining prediction, we need larger sampled event logs.

In Table 10, it is shown how by sampling event logs, we are able to reduce the required training time and improve the performance of the remaining time prediction process. By considering the results in Tables 7 and 10, as we have expected, by having higher RS the Rt value is higher.

Table 10 Rt of different event logs when different sampling methods are used

5.3 Outcome prediction

For the outcome prediction, in order to facilitate comparison and remain consistent with previous work on outcome prediction, we transform each event log into different event logs (Teinemaa et al., 2019a). For example, we transform BPIC2012 event log to BPIC2012Accepted, BPIC2012Cancelled, and BPIC2012Declined.

The RAcc and RF1 of both LSTM and XGboost methods that are trained for prediction of outcome using sampled training data is presented in Tables 11 and 12, respectively. The results indicate that, in many cases, we are able to improve the accuracy of the outcome prediction algorithms. Specifically, using Unique strategy for sampling the traces for BPIC2012Accepted event log leads to considerable improvement for both LSTM and XGB methods. Unlike the other two applications, i.e., the next activity and the remaining time prediction, even by oversampling some event logs, e.g., log10, we can obtain results similar to the cases in which the whole training event logs are used.

Table 11 RAcc of different event logs when different sampling methods are used for outcome prediction
Table 12 RF1 of different event logs when different sampling methods are used for outcome prediction

In Table 13, Rt of different sampling methods are shown. The performance improvement is usually bigger for the LSTM method. There are several cases in which we are not able to improve the performance of the prediction method (Rt values less than 1). It happens mainly for the Unique sampling method. One reason could be by removing the frequencies, the convergence time for the learning method is increased. In case the logarithmic method is used, the performance improvement is around 50 times. It means the training process using the sampled event log is 50 times faster than the case where the whole training log is used.

Table 13 Rt of different event logs when different sampling methods are used for outcome prediction

6 Discussion

In this section, we discuss the results that are illustrated in the previous section. The results indicate that we do not always have a typical trade-off between the accuracy of the trained model and the performance of the prediction procedure. For example, for the next activity prediction, there are some cases where the training process is much faster than the normal procedure, even though the trained model provides an almost similar or higher accuracy and F1-score. Thus, the proposed instance selection procedure can be applied when we aim to apply hyper-parameter optimization (Bergstra et al., 2011). In this way, more settings can be analyzed in a limited time. Moreover, it is reasonable to use the proposed method when we aim to train an online prediction method or on naive hardware such as cell phones.

To achieve the highest performance improvement while the trained model is accurate enough, different sampling methods should be used for different event logs. For example, for the RTFM event log—as there are some highly frequent variants—the division distribution may be more useful. In other words, independently of the used prediction method, if we change the distribution of variants (e.g., using unique distribution), it is expected that the accuracy will sharply decrease. However, for event logs with a more uniform distribution, we can use unique distributions to sample event logs. Furthermore, the results indicate that the effect of the chosen distribution (i.e., unique, division, and logarithmic) is more important than the used k-value. It is mainly because the logarithmic distribution may remove some of the variants, and the unique distribution change the frequency distribution of variants. Therefore, it would be interesting to investigate more on the characteristics of the given event log and suitable sampling parameters for such distribution. For example, if most variants of a given event log are unique (e.g., Sepsis), using the logarithmic distribution leads to having remarkable RS and consequently, RFE and Rt will be very high. However, we will lose most of the variants, and the trained model might have poor predictions.

By analyzing the results, we found that the infrequent activities can be ignored using some hyper-parameter settings. The significant difference between F1-score and Accuracy values in Table 3 indicates this problem too. Using the sampling methods that modify the distribution of the event logs such as the unique method can help the prediction methods to also consider these activities. However, as these activities are infrequent, improving in the prediction of them would not impact highly on the presented aggregated F1-score value.

Finally, in real-life business scenarios, the process can change because of different reasons (Carmona and Gavaldà, 2012). This phenomenon is usually called concept drift. By considering the whole event log for training the prediction model, it is most probable that these changes are not considered in the prediction. Using the proposed sampling procedure, and giving higher priorities to newer traces, it is expected that we are able to adapt to the changes faster, which may be critical for specific applications.

6.1 Limitations

Comparing the results for the next activity and remaining time prediction, we found that predicting the remaining time of the process is more sensitive to sampling. In other words, this application requires more data to predict accurately. Therefore, for the cases that the target attribute depends more on other data attributes compared to variant information, we need to sample more data. Otherwise, the trained model might be inaccurate.

We also found a critical problem in predictive monitoring. In some cases, specifically using LSTM for predicting the remaining time, the accuracy of the predictions is low. For the next activity prediction, it is possible that the prediction models almost ignore infrequent activities. In these cases, even if we use the training data for the evaluation, we do not have acceptable results. This problem in machine learning is called a high bias error (van der Putten & van Someren, 2004). In other words, the training is not efficient even when using whole data and we need to change the prediction method (or its parameters).

7 Conclusion

In this paper, we proposed an instance selection approach to improve the performance of predictive business process monitoring methods. We suggested that it is possible to use a sample of training event data instead of the whole training event data. To evaluate the proposed approach, we consider two main applications of predictive business monitoring, i.e., the next activity and the remaining time prediction. Results of applying the proposed approaches on three real-life event logs and two widely used machine learning methods, i.e., LSTM and XGboost, indicate that in most cases, we are able to improve the performance of predictive monitoring algorithms while providing similar accuracy compared to the case that the whole training event logs are used. However, by oversampling, the accuracy of the trained model might be reduced. Moreover, we have found that the remaining time prediction application is more sensitive to sampling.

To continue this research, we aim to extend the experiments to study the relationship between the event log characteristics and the sampling parameters. In other words, we aim to help the end-user to adjust the sampling parameters based on the characteristics of the given event log. Moreover, it would be great to investigate how we can apply the proposed sampling procedure for streaming event data which is potentially one of the major advantages of the proposed method in a real-life setting. Finally, it is interesting to investigate more on feature selection methods for improving the performance of the predictive monitoring procedure. It is expected that, similar to process instance sampling, feature selection methods are able to reduce the required training time. In other words, the training is not efficient even using whole data and we need to change the prediction method (or its parameters).

Another important outcome of the results is that for different event logs, we should use different sampling methods to achieve the highest performance. Considering lots of different machine learning methods and their parameters can lead to an increase in the search space and complexity for users. In other words, finding the right setting for sampling may be challenging in real scenarios. Therefore, it would be valuable to research the relationship between the event log characteristics and suitable sampling parameters that can be used for preprocessing the training event log.