1 Introduction

Predictive Process Monitoring Maggi et al. [25] is a research topic aiming at developing techniques that use the abundant availability of event logs extracted from information systems in order to predict how ongoing (uncompleted) process executions (a.k.a. cases) will unfold up to their completion. In turn, these techniques can be embedded within information systems to enhance their ability to manage business processes. For example, an information system can exploit a predictive monitoring technique to predict the remaining execution time of each ongoing case of a process Rogge-Solti and Weske [35], the next activity that will be executed in each case Evermann et al. [15], or the final outcome of a case w.r.t. a set of possible outcomes  Maggi et al. [25] Metzger et al. [27], Metzger et al. [28].

Existing Predictive Process Monitoring techniques first construct a predictive model based on data coming from past process executions. Then, they use this model to predict the future of an ongoing case (e.g. outcome, remaining time, or next activity). However, when the predictive model has been constructed, it won’t automatically take into account new cases when they complete their execution. This is a limitation in the usage of predictive techniques in the area of Business Process Monitoring: well-known characteristics of real processes are, in fact, their complexity, variability, and lack of steady state. Due to changing circumstances, processes (and thus their executions) evolve and increase their variability, and systems need to adapt in a timely manner.

While a rough answer to this problem would be the one of re-building new predictive models from the wider available set of data, one could observe that building predictive models has a cost and this option should therefore be well understood before embracing it; moreover, preliminary studies such as the one of  Maisenbacher and Weidlich [26] investigate the usage of incremental techniques  Gepperth and Hammer [19] in the presence of Concept Drift phenomena, thus suggesting a diverse strategy of updating a Predictive Process Monitoring model.

In this paper, we tackle the problem of updating Predictive Process Monitoring models in the presence of new process execution data in a principled manner by investigating, in a comparative and empirically driven manner, how different strategies to keep predictive models up-to-date work. In particular, given an event log \(\mathcal{TR}\mathcal{}_0\), and a set of new traces \(\mathcal{TR}\mathcal{}_1\), we focus on four diverse strategies to update a predictive model \({{\mathcal {M}}}_0\) built using the traces of \(\mathcal{TR}\mathcal{}_0\), to also take into account the set of new traces \(\mathcal{TR}\mathcal{}_1\):

  • Do nothing. In this case, \({{\mathcal {M}}}_0\) is never updated and does not take into account \(\mathcal{TR}\mathcal{}_1\) in any way. This strategy acts also as a baseline against which to compare all the other strategies.

  • Re-train with no hyperopt. In this case, a new predictive model \({{\mathcal {M}}}_1\) is built using \(\mathcal{TR}\mathcal{}_0\cup \mathcal{TR}\mathcal{}_1\) as train set but no optimisation of the hyperparameters is performed and the ones of \({{\mathcal {M}}}_0\) are used;

  • Full re-train. In this case, a new predictive model \({{\mathcal {M}}}_2\) is built using \(\mathcal{TR}\mathcal{}_0\cup \mathcal{TR}\mathcal{}_1\) as the train set and a new optimisation of the hyperparameters is performed;

  • Incremental update. In this case, a new predictive model \({{\mathcal {M}}}_3\) is built starting from \({{\mathcal {M}}}_0\) using the cases contained in \(\mathcal{TR}\mathcal{}_1\) in an incremental manner (that is, using incremental learning algorithms).

The evaluation aims at investigating two main aspects of these update strategies: first, their impact on the qualityFootnote 1 of the prediction in event logs that exhibit/do not exhibit Concept Drift; second, their impact on the time spent for building the predictive model \({{\mathcal {M}}}_0\) and their updates.

The reason why we provide two different strategies for re-training the model, i.e. with no optimisation and with an optimisation of the hyperparameters, is because the two costly activities when building a predictive model are the actual training of the model w.r.t. a train set and the optimisation of the hyperparameters for the constructed model. Therefore, when evaluating the impact of re-training on an extended set of data, we aim at investigating the impact of building a new predictive model and the impact of optimising the hyperparameter in a separate manner.

The problem upon which we investigate these four strategies is the one of outcome predictions, where the outcomes are expressed by using either Linear Temporal Logic (LTL) formulae Pnueli [32], in line with several works such as Di Francescomarino et al. [12], Maggi et al. [25], or case duration properties. The four different strategies are evaluated in a broad experimental setting that considers different real and synthetic datasets. Since we focus on outcome predictions, we have decided to centre our evaluation on Random Forest. This algorithm was chosen as it was experimentally proven to be one of the best performing techniques on the outcome prediction problem—see Teinemaa et al. [40] for a rigorous review—and is therefore widely used on event log data usually used in Predictive Process Monitoring.

Perhaps not surprisingly, the results show that the do-nothing strategy is not a viable strategy (and therefore the issue of updating a Predictive Process Monitoring model is a real issue) and that full re-training and incremental updates are the best strategy in terms of quality of the updated predictive model. Nonetheless, the incremental update is able to keep up with the re-training strategy and deliver a properly fitted model almost in real time, whereas the full re-training might take hours and in some cases even days, suggesting that the potential of incremental models is under-appreciated, and clever solutions could be applied to deliver more stable performance while retaining the positive side of the update functions.

The rest of the paper is structured as follows: Section 2 provides the necessary background on Predictive Process Monitoring and incremental learning; Sect. 3 presents two exemplifying scenarios of process variability and explicit Concept Drift; Sect. 5 illustrates the data and procedure we use to evaluate the proposed update strategies, while Sect. 6 presents and discusses the results. We finally provide some related work (Sect. 7) and concluding remarks (Sect. 8).

2 Background

In this section, we provide an overview of the four main building blocks that compose our research effort: Predictive Process Monitoring, Random Forest, hyperparameter optimisation, and Concept Drift.

2.1 Predictive Process Monitoring

Predictive Process Monitoring   Maggi et al. [25] is a branch of Process Mining that aims at predicting at runtime and as early as possible the future development of ongoing cases of a process given their uncompleted traces. In the last few years, a wide literature about Predictive Process Monitoring techniques has become available—see Di Francescomarino, Ghidini , Maggi and Milani [13] for a survey—mostly based on Machine Learning techniques. The main dimension that is typically used to classify Predictive Process Monitoring techniques is the type of prediction, which can belong to one of the three macro-categories: numeric predictions (e.g. time or cost predictions); categorical predictions (e.g. risk predictions or specific categorical outcome predictions such as the fulfilment of a certain property); next activities predictions (e.g. the sequence of the future activities, possibly with their attributes).

Frameworks such as Nirdizati   Rizzi et al. [34], Jorbina et al. [21] collect a set of Machine Learning techniques that can be instantiated and used for providing different types of predictions to the user. In detail, these frameworks take as input a set of past executions and use them to train predictive models, which can then be stored to be used at runtime to continuously supply predictions to the user. Moreover, the computed predictions can be used to compute accuracy scores for specific configurations. Within these frameworks, we can identify two main modules: one for the case encoding, and one for the supervised learning. Each of them can be instantiated with different techniques. Examples of case encodings are index-based encodings presented in Leontjeva et al. [22]. Supervised learning techniques instead vary and can also depend on the type of prediction a user is interested in, ranging from Decision Tree and Random Forest, to regression methods and Recurrent Neural Networks.

2.2 Random forest

Random Forest [20] is an ensemble learning method used for classification and regression. The goal is to create a model composed of a multitude of Decision Trees  Quinlan [33]. In a Decision Tree, each tree interior node corresponds to one of the input variables and each leaf node to a possible classification or decision. Each different path from the root to the leaf represents a different configuration of input variables. A tree can be “learned” by bootstrapping the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The result is a tree in which each selected variable will contribute to the labelling of the relative example. When an example needs to be labelled, it is run through all the Decision Trees. The output of each Decision Tree is counted. The most occurring label will be the output of the model.

In this work, we use non-incremental and incremental versions of Random Forest. In a nutshell, the non-incremental version builds a predictive model once and for all using a specific set of training data in a single training phase. Instead, in addition to the step of building a predictive model during the training phase, the incremental learning versions are able to update such a model whenever needed through an update function. The specific implementation used in this work incrementally updates the starting model by adding new decision trees as soon as new data are available.

As already mentioned in Introduction, Random Forest was chosen as it was experimentally proven to be one of the best performing techniques on the outcome prediction problem in Predictive Process Monitoring. The interested reader is referred to Teinemaa et al. [40] for a rigorous review.

2.3 Hyperparameter optimisation

Machine Learning techniques are known to use model parameters and hyperparameters. Model parameters are automatically learned during the training phase so as to fit the data. Instead, hyperparameters are set outside the training procedure and used for controlling how flexible the model is in fitting the data. While the values of hyperparameters can influence the performance of the predictive models in a relevant manner, their optimal values highly depend on the specific dataset under examination, thus making their setting rather burdensome. To support and automatise this onerous but important task, several hyperparameter optimisation techniques have been developed in the literature Bergstra and Bengio [4], Bergstra et al. [3] also for Predictive Process Monitoring models—see, for example, Teinemaa et al. [40], Di Francescomarino, Dumas, Federici, Ghidini, Maggi, Rizzi and Simonetto [11], Teinemaa et al. [41]. While in Teinemaa et al. [40], Di Francescomarino, Dumas, Federici, Ghidini, Maggi, Rizzi and Simonetto [11], the Tree Parser Estimator (TPE) has been used for outcome-oriented Predictive Process Monitoring solutions, in  Teinemaa et al. [41], Random Search has been used for hyperparameter optimisation in a comparative analysis focusing on how stable outcome-oriented predictions are along time. Although hyperparameter optimisation techniques for Predictive Process Monitoring have shown their ability to identify accurate and reliable framework configurations, they are also an expensive task and we have hence decided to evaluate the role of hyperparameter tuning in our update strategies.

2.4 Concept drift

In Machine Learning, Concept drift refers to a change over time, in unforeseen ways, of the statistical properties of a target variable a learned model is trying to predict. This drift is often due to changes in the target data w.r.t. the one that was used in the training phase. These changes are problematic as they cause the predictions to become less accurate as time passes. Depending on the type of change (e.g. gradual, recurring, or abrupt), different types of techniques have been proposed in the literature to detect and handle them  Gama et al. [18], Widmer and Kubat [17], Schlimmer and Granger [37, 49].

Business processes are subject to change due to, for example, changes in their normative, or organisational context, and so are their executions. Processes and executions can hence be subject to Concept Drifts, which may involve several process dimensions such as its control-flow dependencies and data handling. For instance, an organisational change might affect how a certain procedure is managed by employees in a Public Administration scenario (e.g. a further approval by a new manager is required for closing the procedure), or a normative change might affect either the way in which patients are managed in an emergency department (e.g. patients have to be tested for COVID-19 before they can be visited) or the age of the customers who are allowed to submit a loan request procedure in a bank process. The Concept Drift phenomenon has originated few works that focus on drift detection and localisation in procedural and in declarative business processes—see Bose et al. [6], Carmona and Gavaldà [10] and Maggi et al. [24], respectively—as well as on attempts to deal with it in the context of Predictive Process Monitoring   Maisenbacher and Weidlich [26], Pauwels and Calders [31].

3 Two descriptive scenarios

We aim at assessing the benefits of incremental learning techniques in scenarios characterised by process variability and/or explicit Concept Drift phenomena. In this section, we introduce two typical scenarios, which refer to some of the datasets used in the evaluation described in Sect. 5.

Scenario 1

(Dealing with Process Variability) Information systems are widely used in healthcare and several scenarios of predictive analytics can be provided in this domain. Indeed, the exploitation of predictive techniques in healthcare is described as one of the promising big data trends in this domain Bughin et al. [8], Munoz-Gama et al. [30].

Despite some successful evaluation of Predictive Process Monitoring techniques using healthcare data  Maggi et al. [25], predictive monitoring needs to consider a well-known feature of healthcare processes, that is, their variability  Rojas et al. [36], i.e. the variety of different alternative paths characterising the executions of a process. Whether they refer to non-elective care (e.g. medical emergencies), or elective care (e.g. scheduled standard, routine and non-routine procedures), healthcare processes often exhibit characteristics of high variability and instability. For instance, the treatment processes related to different patients can be quite different due to allergies or comorbidities or other specific characteristics of a patient. In fact, when attempting to discover process model from data related to these processes, they are often spaghetti-like, i.e. cumbersome models in which it is difficult to distil a stable procedure. Moreover, small changes in the organisational structure (e.g. new personnel in charge of a task, unforeseen seasonal variations due to holidays or diseases) may originate subtle variability not detectable in terms of stable Concept Drifts, but nonetheless relevant in terms of predictive data analytics.

In such a complex environment, an important challenge concerns the emergence of new behaviours: regardless of how much data we consider, an environment highly dependent on the human factor is likely to exhibit new variants that may not be captured when stopping the training at a specific time. Similarly, some variants may become obsolete, thus making the forgetting of data equally important.

Thus, a way for adapting the predictions to these changes and an investigation of which update strategies are especially suited to highly variable and realistic process executions would be of great impact.

Scenario 2

(Dealing with explicit Concept Drift) The presence of Concept Drift in business processes, due to, for example, changes in the organisational structures, legal regulations, and technological infrastructures, has been acknowledged in the Process Mining manifesto van der Aalst and et al. [44] and literature  Maggi et al. [24], together with some preliminary studies on its relation with Predictive Process Monitoring  Gepperth and Hammer [19].

Such a sudden and abrupt variation in the data provides a clear challenge to the process owners: they must be ready to cope with a Predictive Process Monitoring model with degraded performance on the drifted data, or to perform an update that allows the Predictive Process Monitoring technique to support both the non-drifted and the drifted trends of the data (as ongoing executions may still concern the non-drifted cases).

Similarly to the above, an investigation of which update strategies are especially suited to realistic process executions that exhibit an explicit Concept Drift would provide a concrete support for the maintenance of Predictive Process Monitoring models.

4 Update strategies

Predictive Process Monitoring provides a set of techniques that use the availability of execution traces (or cases) extracted from information systems in order to predict how ongoing (uncompleted) executions will unfold up to their completion.

Fig. 1
figure 1

The general idea

Thus, assume that an organisation was able to exploit a set of process executions \(\mathcal{TR}\mathcal{}_0\), collected within a period of time that we will call “Period A” to obtain a predictive model \({{\mathcal {M}}}_0\), and that it starts to exploit \({{\mathcal {M}}}_0\) to perform predictions upon new incomplete traces (see Fig. 1 for an illustration of this scenario). As soon as the new incomplete traces terminate, they become new data, potentially available to be exploited for building a new model \({{\mathcal {M}}}'\), that in turn can be used to provide predictions on new incomplete traces. The need for exploiting new execution traces and building such an updated \({{\mathcal {M}}}'\) could be due to several reasons, among which the evolution of the process at hand (and thus of its executions) to which the system needs to adapt in a timely manner.

In this paper, we provide 4 different strategies for computing \({{\mathcal {M}}}'\), by exploiting a new set of process executions \(\mathcal{TR}\mathcal{}_1\), collected in a period of time “Period B” subsequent to “Period A”, along with the original set \(\mathcal{TR}\mathcal{}_0\). The strategies are summarised in Fig. 2. The figure represents, on the left hand side, model \({{\mathcal {M}}}_0\) and, on the right hand side, the operations performed starting from \({{\mathcal {M}}}_0\) to obtain model \({{\mathcal {M}}}'\) according to the four update strategies.

Fig. 2
figure 2

Four strategies to produce \({{\mathcal {M}}}'\)

The first strategy, \({{\mathcal {S}}}_0\), is a do nothing strategy. This strategy simply disregards that new traces are produced in “Period B” and continues to use \({{\mathcal {M}}}_0\) as a predictive model. This strategy may prove to be useful when the processes remain stable and it acts also as a baseline against which to compare all the other strategies.

The second strategy, \({{\mathcal {S}}}_1\), exploits the new traces in \(\mathcal{TR}\mathcal{}_1\) produced in “Period B” for training but not for the optimisation of the hyperparameters. In this re-train with no hyperopt strategy, \({{\mathcal {M}}}_0\) is replaced by a new predictive model \({{\mathcal {M}}}_1\) built from scratch by using \(\mathcal{TR}\mathcal{}_0\cup \mathcal{TR}\mathcal{}_1\) as train set. No optimisation of the hyperparameters is made in the construction of \({{\mathcal {M}}}_1\) and the values of the ones computed for \({{\mathcal {M}}}_0\) are instead used. This strategy aims at exploiting the new data in \(\mathcal{TR}\mathcal{}_1\) still avoiding the costly steps needed for hyperparameter optimisation.

The third strategy, \({{\mathcal {S}}}_2\), completely replaces the old model \({{\mathcal {M}}}_0\) with a new predictive model \({{\mathcal {M}}}_2\) built from scratch using both \(\mathcal{TR}\mathcal{}_0\) and \(\mathcal{TR}\mathcal{}_1\). This strategy aims at performing a full re-train, thus exploiting to the outmost all the available data. Also, the comparison between \({{\mathcal {S}}}_1\) and \({{\mathcal {S}}}_2\) enables us to investigate the specific role of the hyperparameter tuning in the predictions.

The final strategy, \({{\mathcal {S}}}_3\), exploits the new traces in \(\mathcal{TR}\mathcal{}_1\) produced in “Period B” for training the predictive model in an incremental manner. Differently from \({{\mathcal {S}}}_1\), the data of \(\mathcal{TR}\mathcal{}_1\) are added as training data in a continuous manner by means of incremental Machine Learning algorithms, to extend the existing knowledge of model \({{\mathcal {M}}}_0\). The incremental update strategy is chosen as an example of dynamic technique, which can be applied when training data become available gradually over time or the size of the training data is too large to store or process it all at once. Similarly to \({{\mathcal {S}}}_1\), the value of the hyperparameters does not change when adding new training data.

5 Empirical evaluation

The evaluation reported in this paper aims at understanding the characteristics of the four different update strategies introduced in the previous section in terms of accuracy and time. We aim at evaluating these strategies with two types of real-life event log data: event logs without an explicit Concept Drift and event logs with an explicit Concept Drift. As such, we have selected four real-life datasets,Footnote 2 two for the first scenario and two for the second one. To consolidate the evaluation on the Concept Drift scenario, we also expanded the evaluation to include a synthetic event log with explicit Concept Drifts introduced in Maaradji et al. [23].

In this section, we introduce the research questions, the datasets, the metrics used to evaluate the effectiveness of the four update strategies described in Sect. 4, the procedure, and the tool settings. The results are instead reported in Sect. 6.

5.1 Research questions

Our evaluation is guided by the following research questions:


How do the four update strategies do nothing, re-train with no hyperopt, full re-train, and incremental update compare to one another in terms of accuracy?


How do the four update strategies do nothing, re-train with no hyperopt, full re-train, and incremental update compare to one another in terms of time performance?

RQ1 aims at evaluating the quality of the predictions returned by the four update strategies, while RQ2 investigates the time required to build the predictive models in the four scenarios and, in particular, aims at assessing the difference between the complete periodic rediscovery (full re-train) and the other two update strategies re-train with no hyperopt and incremental update.

5.2 Datasets

The four update strategies are evaluated using five datasets. Three of them are real-life event logs provided for a Business Process Intelligence (BPI) Challenge, in different years, without an explicit Concept Drift: the BPI Challenges 2011 3TU Data Center [1], 2012 van Dongen [45], and 2015 van Dongen; [46]. They are examples of event logs exhibiting Process Variability as described in the first scenario in Sect. 3. The remaining two datasets are instead examples of logs with explicit Concept Drift as described in the second scenario in Sect. 3. Our aim was to evaluate the four strategies on real-life event logs, but, to the best of our knowledge, the only publicly available event log which contains an explicit concept drift is the BPI Challenge 2018 van Dongen and Borchert [47]. Therefore, we decided to augment the evaluation considering also one of the synthetic event logs introduced in Maaradji et al. [23]. Here, we report the main characteristics of each dataset, while the outcomes to be predicted for each dataset are contained in Table 1.Footnote 3

Table 1 The outcome formulae

The first dataset, originally provided for the BPI Challenge 2011, contains the treatment history of patients diagnosed with cancer in a Dutch academic hospital. The log contains 1,140 cases and 149,730 events referring to 623 different activities. Each case in this log records the events related to a particular patient. For instance, the first labelling (\(\phi _{11}\)), for this dataset, is such that the positive traces are all the ones for which if activity \(\mathsf {CEA - tumour\ marker\ using\ meia}\) occurs, then it is followed by an occurrence of activity \(\mathsf {squamous\ cell\ carcinoma\ using\ eia}\).

The second dataset, originally provided for the BPI Challenge 2012, contains the execution history of a loan application process in a Dutch financial institution. It is composed of 4,685 cases and 186,693 events referring to 36 different activities. Each case in this log records the events related to a particular loan application. For instance, the first labelling (\(\phi _{21}\)), for this dataset, is such that the positive traces are all the ones in which event \(\mathsf {Accept\ Loan\ Application}\) occurs.

The third dataset, originally provided for the BPI Challenge 2015, concerns the application process for construction permits in five Dutch municipalities. We consider the log pertaining to the first municipality, which is composed of 1,199 cases and 52,217 events referring to 398 different activities.

The fourth dataset, originally provided for the BPI Challenge 2018, concerns an event log from the European Agricultural Guarantee Fund pertaining to an application process for EU direct payments for German farmers from the European Agricultural Guarantee Fund. Depending on the document types, different branches of the workflow are performed. The event log used in this evaluation is composed of 29,302 cases and 1,661,656 events referring to 40 different activities.

The fifth and the sixth datasets, hereafter called DriftRIO1 and DriftRIO2, are synthetic event logs that use a “textbook” example of a business process for assessing loan applications Weber et al. [48]. The DriftRIO event logs introduced in  Maaradji et al. [23] have been built by alternating traces executing the original “base” model and traces modified so as to exhibit complex Concept Drifts obtained by composing simple log changes, namely, re-sequentialisation of process model activities (R), insertion of a new activity (I) and optionalisation of one activity (O)Footnote 4 DriftRIO1 is composed of 3,994 cases and 47,776 events related to 19 activities. DriftRIO2 is composed, instead, of 2,000 cases and 21,279 events referring to 19 different activities. The outcomes to be predicted for each dataset are expressed by using LTL formulae for the four BPI Challenges  Di Francescomarino et al. [12], Maggi et al. [25] and by using the case duration property of being a fast case for DriftRIOFootnote 5 (see Table 1).

5.3 Metrics

In order to answer the research questions, we use two metrics, one for accuracy and one for time. The one for accuracy is used to evaluate RQ1, whereas the time measure is used to evaluate RQ2.

The accuracy metric.:

In this work, we exploit a typical evaluation metric for calculating the performance of a classification model, that is AUC-ROC (hereafter only AUC). The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) using various threshold settings. In formulae, \(\textit{TPR} = \frac{T_P}{T_P + F_N}\), and \(\textit{FPR} = \frac{F_P}{FP + T_N}\), where \(T_P\), \(T_N\), \(F_P\), and \(F_N\) are the true-positives (positive outcomes correctly predicted), the true-negatives (negative outcomes correctly predicted), the false-positives (negative outcomes predicted as positive), and the false-negatives (positive outcomes predicted as negative), respectively. In our case, the AUC is the area under the ROC curve and, when using normalised units, it can be intuitively interpreted as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. As usual, \(T_P\), \(T_N\), \(F_P\), and \(F_N\) are obtained by comparing the predictions produced by the predictive models against a gold standard that indicates the correct labelling of each case. In our experiments, we have built the gold standard by evaluating the outcome of each completed case in the test setFootnote 6

The time metric.:

We measure the time spent to build the predictive model in terms of execution time. The execution time indicates the time required to create and update (in the case of incremental algorithms) the predictive models. We remark here that the execution time does not include the time spent to load and pre-process the data, but only the bare processing time.

5.4 Experimental procedure

We adopt the classical Machine Learning Train/Validate/Test experimental procedure and configure it for the four different strategies we want to compare. The procedure consists of the following main steps: (1) dataset preparation; (2) classifier training and validation; (3) classifier testing and metrics collection.

In the dataset preparation phase, the execution traces are first ordered according to their starting date-time so as to be able to meaningfully identify those referring to “Period A” (that is, \(\mathcal{TR}\mathcal{}_0\)), those referring to the subsequent “Period B” (that is, \(\mathcal{TR}\mathcal{}_1\)), and those referring to the test set \(\mathcal{TE}\mathcal{}\), which here represents the most recent set of traces where the actual predictions are made. The predictions are tested using the four different models \({{\mathcal {M}}}_0\)-\({{\mathcal {M}}}_3\) corresponding to the different strategies \({{\mathcal {S}}}_0\)-\({{\mathcal {S}}}_3\) described in Sect. 4.

Fig. 3
figure 3

The experimental settings used to build the predictive models

The actual splits between train, hyperparameter validation, and test sets used for evaluating the four strategies are illustrated in Fig. 3. Following a common practice in Machine Learning, we have decided to use 80% of the data for training/validation and 20% for testing.

Table 2 Dataset entropy

To test whether the performance of the different strategies was connected to different sizes of \(\mathcal{TR}\mathcal{}_0\) and \(\mathcal{TR}\mathcal{}_1\), we devised two experimental settings to supply the train data to the learning algorithms. The first experimental setting divides the train set unequally in \( \mathcal{TR}\mathcal{}_0\) (10%) and \( \mathcal{TR}\mathcal{}_1\) (70%); the second experimental setting divides the train set equally in \(\mathcal{TR}\mathcal{}_0\) (40%) and \(\mathcal{TR}\mathcal{}_1\) (40%). These two settings do not concern the construction of \({{\mathcal {M}}}_2\), which is always built using the entire set of available train data. The hyperparameter validation set is extracted from the train set through a randomised sampling procedure. It amounts to 20% of the train set, which corresponds to 2% of the dataset for the train set at 10%, 8% for the train set at 40%, and 16% for the train set at 80% (see Fig. 3).

The variability of the log behaviour of each dataset can be measured through the trace entropy and the Global block entropy metrics  Back et al. [2] (Table 2). While the first metric mainly focuses on the number of variants in an event log, the latter also takes into account the internal structure of the traces. Note that the BPI Challenge 2018 and DriftRIO datasets contain a Concept Drift in \(\mathcal{TR}\mathcal{}_1\) in both settings (it affects the last 30% of data in \(\mathcal{TR}\mathcal{}\), when ordered according to their starting date-time) and, for these datasets, the difference between the entropy value of the split 0%-40%–80%–100% is higher than the entropy value of the split 40%–80%–80%–100% for both entropy metrics. Comparing the behaviour variability of the splits 0%–10%–80%–100% and 10%–80%–80%–100% is instead less useful since the two sets of traces have different sizes and the smaller set has obviously a lower entropy w.r.t. the larger one for all datasets. Finally, the entropy on the complete datasets (column 0%–100%) is useful to understand what type of input is provided when evaluating strategy \({{\mathcal {S}}}_2\) that uses the full dataset to train the predictive model.

Table 3 Dataset label distribution

Additional information about the input logs used in our experiments is reported in Table 3 that shows, for each dataset and for each setting, the distribution of the labels on both the train and the test sets, thus providing an idea of how much balanced the datasets are.

Once the data are prepared, training and validation start by extracting execution prefixes and by encoding them using the complex index encoding introduced by Leontjeva et al. [22].Footnote 7 In the complex index encoding, the data related to a process execution are divided into static and dynamic information. Static information is the same for all the events in the sequence (e.g. the age of a patient), while dynamic information changes for different events occurring in the process execution (e.g. the name of the activity, the pressure value of a patient associated with an event aimed at measuring the patient’s pressure). We encode static information as attribute-value pairs, while, for dynamic information, the order in which events (and the related attributes) occur in the sequence is also taken into account.

The hyperparameter optimisation function uses the Tree of Parzen Estimators (TPE)Footnote 8 to retrieve the hyperparameter configuration (see Sect. 2.3), with a maximum of 1000 iterations, and the AUC as objective evaluation measure to maximise. Models \({{\mathcal {M}}}_0\)-\({{\mathcal {M}}}_3\) are produced at the end of this phase. The specific hyperparameters used by the models are reported in Table 4. The table shows that indeed differences exist in the optimised hyperparameters computed starting from different train sets, i.e. 10% of the datasets, 40% of the datasets (both used for \({{\mathcal {M}}}_0\), \({{\mathcal {M}}}_1\), and \({{\mathcal {M}}}_3\)) and 80% of the datasets (used for \({{\mathcal {M}}}_2\)).

Table 4 Model hyperparameters

After the training and validation procedures are completed, we test the resulting model with the test set, and we collect the scored metrics (see Sect. 5.3) and store them in a database. Concerning time, we have decided to measure here the time spent to build the initial model \({{\mathcal {M}}}_0\), plus the time needed to update it according to the four different strategies. Thus, for \({{\mathcal {S}}}_0\), this will coincide with the time spent for building \({{\mathcal {M}}}_0\) (as, in this case, no further action is taken for updating the model); for \({{\mathcal {S}}}_1\), we compute the time spent for building \({{\mathcal {M}}}_0\) plus the time spent for the re-training over \(\mathcal{TR}\mathcal{}\) (no hyperopt); for \({{\mathcal {S}}}_2\), we compute the time spent for building \({{\mathcal {M}}}_0\) plus the time needed for the re-training over \(\mathcal{TR}\mathcal{}\) (with hyperopt); and, finally, for \({{\mathcal {S}}}_3\), we compute the time spent for building \({{\mathcal {M}}}_0\) plus the time needed for updating it with the data in \(\mathcal{TR}\mathcal{}_1\).

5.5 Experimental settings

The tool used for the experimentation is Nirdizati Rizzi et al. [34]. The experimental evaluation was performed on a workstation Dell Precision 7820 with the following configuration: (i) 314GB DDR4 2666MHz RDIMM ECC of RAM; (ii) double Intel Xeon Gold 6136 3.0GHz, 3.7GHz Turbo, 12C, 10.4GT/s 3UPI, 24.75MB Cache, HT (150W) CPU; and (iii) one 2.5" 256GB SATA Class 20 Solid State Drive. We assumed to have only 1 CPU for training the predictive models, i.e. we did not parallelise the training of the base learners of the Random Forests. We also ensured that there was no racing condition over the disk and no starvation over the RAM usage by actively monitoring the resources through Netdata  Tsaousis [42]. Each experiment was allowed to run for at most 100 hours. No other experiment or interaction with the workstation was performed other than the monitoring of the used resources.

6 Results

In this section, we present the results of our experiments, reported in Tables 5, 6, 7, and discuss how they allow us to answer the two research questions introduced before. We also provide a discussion about the four update strategies with a cost-effectiveness analysis and an analysis of the validity threats. To ensure reproducibility, the datasets used, the configurations, and the detailed results are available at http://bit.ly/how_do_I_update_my_model.

6.1 Discussion

Answering RQ1.  The AUC of all models, for the two experimental settings 10%–70% and 40%–40%, for all datasets, is reported in Tables 5a and  5b. The best result for each dataset and labelling is emphasised in italicFootnote 9. Since, in many cases, different models have very close accuracy, we have emphasised in bold the results that differ from the best ones for less than 0.01. Tables 6a and 6b report, for the two experimental settings 10%–70% and 40%–40%, the percentage of gain/loss of \({{\mathcal {M}}}_1\), \({{\mathcal {M}}}_2\), and \({{\mathcal {M}}}_3\) w.r.t. \({{\mathcal {M}}}_0\). The percentage of gain or lossFootnote 10 is reported together with an histogram of gains and losses. In order to ease the comparison, \({{\mathcal {M}}}_2\) is reported in both tables.

Table 5 The accuracy results
Table 6 Accuracy improvement against \({{\mathcal {M}}}_0\)

By looking at the tables, we can immediately see that \({{\mathcal {M}}}_2\) and \({{\mathcal {M}}}_3\) are the clear best performers, especially in the first setting 10%–70%, and that \({{\mathcal {M}}}_0\) is almost consistently the worst performer, often with a significant difference in terms of accuracy. This highlights that the need to update the predictive models is a real issue in typical Predictive Process Monitoring settings. The only exception to this finding is provided by the results obtained for the BPIC15 dataset, where the performance of \({{\mathcal {M}}}_0\) is comparable to that of the other models for almost all the outcome formulae. This is due to the fact that, for this dataset, there is a high homogeneity of the process behaviour over time. This is confirmed by the entropy values, provided in Table 2, that remain quite stable across the entire log. Moreover, we can observe that BPIC12 with labelling \(\phi _{23}\) has overall the lowest accuracy for all the four update strategies. This is possibly due to the high label unbalance characterising this dataset (see Table 3).

The performance of all the evaluated strategies is overall higher in the 40%–40% setting, and this is likely related to the higher amount of data used for hyperparameter optimisation. Nonetheless, the lower performance of \({{\mathcal {M}}}_0\) also in the 40%–40% setting indicates the need to update the models with new data at regular intervals. Concerning the possible differences between the results obtained using datasets with and without an explicit Concept Drift, our experiments did not find any striking variation in the different strategies, thus consolidating the finding that devising update strategies is important in general, also in scenarios where the process changes over time are not so definite. Nonetheless, if we look at Tables 6a and 6b, it is easy to see that the experiments with an explicit Concept Drift are the ones with the greatest difference between \({{\mathcal {M}}}_0\) and all the other models, thus confirming that an explicit Concept Drift can have a significant negative influence on the performance of a Predictive Process Monitoring model, if it is not updated. This is especially true for BPIC18, in which the Concept Drift highly affects the entropy measure (Table 2) in a way that the difference between the entropy value of the split 0%–40%–80%–100% is higher than the entropy value of the split 40%–80%–80%–100% for both entropy metrics.

Tables 6a and 6b show also another interesting aspect of our evaluation: while \({{\mathcal {M}}}_2\) and \({{\mathcal {M}}}_3\) tend to always gain against \({{\mathcal {M}}}_0\) (or to be stable in very few cases), the same cannot be said for \({{\mathcal {M}}}_1\). In fact, if we look at BPIC12 with labelling \(\phi _{23}\) and BPIC15 with labelling \(\phi _{32}\) in Table 6a, and, particularly, at DriftRIO with labelling \(\phi _{52}\) in Table 6b, we can see a decrease in the accuracy. By carrying out a deeper analysis of the chosen hyperparameters, we found that the lower accuracy of \({{\mathcal {M}}}_1\) is due to the inappropriateness of the hyperparameters derived from \(\mathcal{TR}\mathcal{}_0\) to the new data used to build \({{\mathcal {M}}}_1\). While this aspect may need to be better investigated, we can conclude that while re-train with no hyperopt is usually a viable solution, it is nonetheless riskier than full re-train or incremental update.

The general findings and trends derived from the results obtained using Random Forest as classifier are further confirmed, with few exceptions, by the results obtained by using Perceptron  Minsky and Papert [29] as predictive model in the analysis of the four update strategies. The perceptron results are reported in Appendix A.

To sum up, concerning RQ1, our evaluation shows that full re-train and incremental update are the best performing update strategies in terms of accuracy, followed by re-train with no hyperopt. With the exception of BPIC15 in the 10%–70% setting, do nothing is, often by far, the worst strategy, indicating the importance of updating the predictive models with new data, when it becomes available.

Answering RQ2

The time spent for creating the four models, for the two experimental settings 10%–70% and 40%–40%, for all datasets, is reported in Tables 7a and 7b using the “hh:mm:ss” format. The best results for each dataset and labelling (that is the lower execution times) are emphasised in italic, while execution times that differ for less than 60 seconds from the best ones are indicated in bold. The last two rows of each table report the average time (and standard deviation) necessary to train a model for a given strategy. In order to ease the comparison, \({{\mathcal {M}}}_2\) is reported in both tables. \({{\mathcal {M}}}_0\) and \({{\mathcal {M}}}_2\) are self-contained models that are “built from scratch” and, therefore, the time reported in the tables for these models is the time spent to train them. Differently, \({{\mathcal {M}}}_1\) and \({{\mathcal {M}}}_3\) are built in a two-steps fashion that includes a training phase but also the usage of the hyperparameters used to build \({{\mathcal {M}}}_0\). Therefore, their construction time is measured by summing up the time spent for the training phase and the time spent for building \({{\mathcal {M}}}_0\).

Table 7 The time results

By looking at Tables 7a and 7b, we can immediately see that, among all the evaluated strategies, \({{\mathcal {M}}}_0\) is the clear best performer, especially for the 10%–70% setting, and that \({{\mathcal {M}}}_2\) is almost consistently the worst performer, often with a significant difference in terms of time spent to build the model (with few exceptions in the 40%–40% case that we will discuss below). As a second general observation, we note that \({{\mathcal {M}}}_1\) and \({{\mathcal {M}}}_3\) share almost the same construction time with \({{\mathcal {M}}}_0\). This fact is not particularly surprising, as the hyperparameter optimisation routine is often the most expensive step in the construction of this type of predictive models. Therefore, the two strategies \({{\mathcal {S}}}_1\) and \({{\mathcal {S}}}_3\) that underline the construction of these models are highly inexpensive, when we consider the time dimension, especially when \({{\mathcal {M}}}_0\) is already available.

If we compare the two experimental settings 10%–70% and 40%–40%, we can observe that while \({{\mathcal {M}}}_0\) is almost always the best performer in both settings, the difference between the construction time of \({{\mathcal {M}}}_0\) (and thus of \({{\mathcal {M}}}_1\) and \({{\mathcal {M}}}_3\)) and the construction time of \({{\mathcal {M}}}_2\) is significantly higher for the 10%–70% setting. While the investigation of when it is convenient to perform a full re-train is out of the scope of the paper and is left to further investigations, this finding emphasises the fact that the cost of a full re-train may increase in a significant manner if the update of the predictive model over-delays and the amount of new data greatly increases. Interestingly enough, the 40%–40% setting presents two cases in which \({{\mathcal {M}}}_2\) is the fastest model to be built. The case of BPIC11 with labelling \(\phi _{12}\) is likely due to an “unfortunate” guess-estimate in the hyperparameter optimisation step for \({{\mathcal {M}}}_0\), which makes the training time explodeFootnote 11the case of BPIC15 with labelling \(\phi _{31}\), instead, represents a situation in which \({{\mathcal {M}}}_0\) and \({{\mathcal {M}}}_2\) take almost the same time to be built. Concerning possible differences between datasets with and without an explicit Concept Drift, our experiments did not find any striking difference among the evaluated strategies.

Finally, our evaluation did not find any fixed correlation between the training times for all strategies and (i) the size of the dataset and the alphabet of the dataset within the same settings or (ii) the quality of the predictive model in terms of accuracy (and thus the difficulty of the prediction problem). As an example of the first, we can observe that BPIC12 contains four times the cases of BPIC11; nonetheless, most of the prediction models built for BPIC11 take more time to be constructed than the ones built for BPIC12. Similarly, BPIC15 has an alphabet with a number of activities that is almost 10 times the one of BPIC2018, but the prediction models built for BPIC18 take much more time to be constructed than the ones built for BPIC15. As an example of the second, we can observe, from Table 7a, that the time needed for building \({{\mathcal {M}}}_0\) for BPIC11 with labelling \(\phi _{23}\) is much greater than the one needed for building \({{\mathcal {M}}}_0\) for BPIC11 with labelling \(\phi _{13}\), even if the accuracy for the same cases, in Table 5a, follows the inverse trend.Footnote 12

To sum up, concerning RQ2, our evaluation shows that once \({{\mathcal {M}}}_0\) is available, incremental update and re-train with no hyperopt are the two most convenient update strategies—as they can be built in almost no time. This may suggest the possibility to implement an almost continuous update strategy whenever new data become available. While the investigation of when it is convenient to perform a full re-train is out of the scope of the paper, our experiments show that the cost of a full re-train may increase in a significant manner if the update of the predictive model over-delays and the amount of new data increases significantly.

Overall Conclusions. The plots in Figs. 4 and 5 show inaccuracy and time related to the 10%–70% and 40%–40% settings, respectively, for each of the considered datasets. The closer the item is to the origin, the best is the balance between the time required for training, re-training, or updating the model and the accuracy of the results. By looking at the plots, it is clear that the worst choice in terms of balance is given by \({{\mathcal {M}}}_0\), while, for the other three models, the choice somehow depends on the dataset and on the labelling. With the only exception of \(\phi _{12}\), \(\phi _{21}\), \(\phi _{22}\), and \(\phi _{33}\) for both settings, as well as of \(\phi _{11}\) for the 10%–70% setting and of \(\phi _{61}\) for the 40%–40% setting, for all other datasets and labellings, \({{\mathcal {M}}}_1\) and/or \({{\mathcal {M}}}_3\) are the only non-dominated update strategies, i.e. those strategies for which another strategy improving both the inaccuracy and the time dimension does not exist.Footnote 13

Fig. 4
figure 4

Inaccuracy versus time plots (10%)

Fig. 5
figure 5

Inaccuracy versus time plots (40%)

To conclude, our evaluation shows that the do-nothing strategy is not a viable strategy as the accuracy performance of a non-updated model tends to significantly decrease for typical real-life datasets (with and without explicit Concept Drift), whereas lightweight update strategies, such as the incremental update and re-train with no hyperopt, are, instead, often extremely effective in updating the models. Full re-train offers a strategy that almost always achieves the best accuracy (or an accuracy in line with the best one). Nonetheless, its training time may increase significantly, especially in the presence of an abundance of new data. According to our experiments, the incremental update is able to keep up with the full re-train strategy and deliver a properly fitted model almost in real time, suggesting that the potential of incremental models is under-appreciated in Predictive Process Monitoring, and smart Predictive Process Monitoring solutions could be developed leveraging this update strategy.

6.2 Cost-effectiveness analysis

In order to have a better grasp of the cost-effectiveness of the different update strategies, we also investigated the costs required by the update strategies, when new batches of data become available along the time. In particular, given a set of batches of train sets \(\mathcal{TR}\mathcal{}_0\), \(\mathcal{TR}\mathcal{}_1\), ...\(\mathcal{TR}\mathcal{}_n\), and the corresponding batches of test sets \(\mathcal{TE}\mathcal{}_{0}\), \(\mathcal{TE}\mathcal{}_{1}\), ...\(\mathcal{TE}\mathcal{}_n\) (where \(\mathcal{TE}\mathcal{}_{0}\) is the test set immediately following \(\mathcal{TR}\mathcal{}_0\), \(\mathcal{TE}\mathcal{}_{1}\) the test set immediately following \(\mathcal{TR}\mathcal{}_1\) in the temporal timeline), we can define \(CE_M\)(\(\mathcal{TR}\mathcal{}_0\), \(\mathcal{TR}\mathcal{}_1\), ...\(\mathcal{TR}\mathcal{}_n\), \(\mathcal{TE}\mathcal{}_{0}\), \(\mathcal{TE}\mathcal{}_{1}\), ...\(\mathcal{TE}\mathcal{}_n\))—from here on shortened as \(CE_M\)(\(\mathcal{TR}\mathcal{}_{0 \ldots n}\), \(\mathcal{TE}\mathcal{}_{0 \ldots n}\))—as the cost of a model trained with n batches of arriving data \(\mathcal{TR}\mathcal{}_0\), \(\mathcal{TR}\mathcal{}_1\), ...\(\mathcal{TR}\mathcal{}_n\), and tested with the corresponding batches of test data \(\mathcal{TE}\mathcal{}_{1}\), ...\(\mathcal{TE}\mathcal{}_n\).

In our scenario, the cost-effectiveness of the update strategies is characterised by two main aspects: on the one hand, the cost of the time required for building the model and, on the other, the cost of returning wrong predictions (prediction inaccuracy). We can hence define \(CE_{{\mathcal {M}}}\)(\(\mathcal{TR}\mathcal{}_{0 \ldots n}\), \(\mathcal{TE}\mathcal{}_{0 \ldots n}\)) as the sum of (i) \(CT_{{\mathcal {M}}}\)(\(\mathcal{TR}\mathcal{}_{0 \ldots n}\), \(\mathcal{TE}\mathcal{}_{0 \ldots n}\)), i.e. the cost of the time required for building, training and, when necessary, re-training the model M, whenever a new batch of traces arrives;Footnote 14 and of (ii) \(CI_{{\mathcal {M}}}\)(\(\mathcal{TR}\mathcal{}_{0 \ldots n}\), \(\mathcal{TE}\mathcal{}_{0 \ldots n}\)), i.e. the cost of the inaccuracy due to wrong predictions returned by the trained models on the traces of the test sets. Low values for a given model indicate a good cost-effectiveness model.

Defining \(CT_{T}(TR)\), \(CT_{T}^{H}(TR)\), and \(CT_{U}(TR)\) as the time required for training, training and optimising the hyperparameters, and incrementally updating a model with the train set TR, respectively, the time costs related to the four models can be computed as reported in Eq. 1. The time cost for \({{\mathcal {M}}}_0\) is given by the only cost required for training and optimising the hyperparameters on \(\mathcal{TR}\mathcal{}_0\). For \({{\mathcal {M}}}_1\) (\({{\mathcal {M}}}_3\)), besides the cost for training and optimising the hyperparameters on \(\mathcal{TR}\mathcal{}_0\), when the i-th train set batch is available, the costs for training (updating) the model with the union of the train set batches up to the i-th one (with the i-th train batch), have also to be considered. Finally, the time cost for \({{\mathcal {M}}}_2\) is given by the cost required for re-training and optimising the hyperparameters from scratch each time a new batch arrives.

$$\begin{aligned} \begin{aligned} CT_{{{\mathcal {M}}}_0}(\mathcal{TR}\mathcal{}_{0 \ldots n}, \mathcal{TE}\mathcal{}_{0 \ldots n})&=CT_{T}^{H}(\mathcal{TR}\mathcal{}_0)\\ CT_{{{\mathcal {M}}}_1}(\mathcal{TR}\mathcal{}_{0 \ldots n}, \mathcal{TE}\mathcal{}_{0 \ldots n})&=CT_{T}^{H}(\mathcal{TR}\mathcal{}_0)+\sum _{i=1}^{n-1} CT_T(\bigcup _{j=1}^{i}\mathcal{TR}\mathcal{}_i) \\ CT_{{{\mathcal {M}}}_2}(\mathcal{TR}\mathcal{}_{0 \ldots n}, \mathcal{TE}\mathcal{}_{0 \ldots n})&=\sum _{i=0}^{n-1} CT_T^{H}(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_i)\\ CT_{{{\mathcal {M}}}_3}(\mathcal{TR}\mathcal{}_{0 \ldots n}, \mathcal{TE}\mathcal{}_{0 \ldots n})&=CT_{T}^{H}(\mathcal{TR}\mathcal{}_0)+\sum _{i=1}^{n-1} CT_U(\mathcal{TR}\mathcal{}_i) \end{aligned} \end{aligned}$$

The inaccuracy costs are instead reported in Eq. 2. For \({{\mathcal {M}}}_0\), the inaccuracy cost is given by the sum of the costs obtained by providing predictions using the model trained on the batch \({{\mathcal {M}}}_0\) and tested on each new test set batch \(\mathcal{TE}\mathcal{}_{i}\). For the other three models, instead, the cost of inaccuracy, at the arrival of the train set batch \(\mathcal{TR}\mathcal{}_{i}\), is given by the inaccuracy cost obtained from models trained, updated and/or optimised on a train set that takes into account the information on the data of all train set batches up to \(\mathcal{TR}\mathcal{}_{i}\), and evaluated on the test set batch \(\mathcal{TE}\mathcal{}_{i}\). The inaccuracy cost is given by the sum of the costs for each new training and test batch \(\mathcal{\mathcal{TR}\mathcal{}}_i\) and \({\mathcal{TE}\mathcal{}}_i\).

$$\begin{aligned} \begin{aligned} CI_{{{\mathcal {M}}}_0}(\mathcal{TR}\mathcal{}_{0 \ldots n}, \mathcal{TE}\mathcal{}_{0 \ldots n})&=\sum _{i=0}^{n-1} CI_{{{\mathcal {M}}}_0}(\mathcal{TR}\mathcal{}_0,\mathcal{TE}\mathcal{}_{i})\\ CI_{{{\mathcal {M}}}_1}(\mathcal{TR}\mathcal{}_{0 \ldots n}, \mathcal{TE}\mathcal{}_{0 \ldots n})&=\sum _{i=0}^{n-1} CI_{{{\mathcal {M}}}_1}(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_{j}, \mathcal{TE}\mathcal{}_{i})\\ CI_{{{\mathcal {M}}}_2}(\mathcal{TR}\mathcal{}_{0 \ldots n}, \mathcal{TE}\mathcal{}_{0 \ldots n})&=\sum _{i=0}^{n-1} CI_{{{\mathcal {M}}}_2}(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_{j}, \mathcal{TE}\mathcal{}_{i})\\ CI_{{{\mathcal {M}}}_3}(\mathcal{TR}\mathcal{}_{0 \ldots n}, \mathcal{TE}\mathcal{}_{0 \ldots n})&=\sum _{i=0}^{n-1} CI_{{{\mathcal {M}}}_3}(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_{j}, \mathcal{TE}\mathcal{}_{i})\\ \end{aligned} \end{aligned}$$

We assume we can approximate the inaccuracy cost of the model \({{\mathcal {M}}}_0\) tested on the i-th test set batch \(\mathcal{TE}\mathcal{}_i\)\(CI_{{{\mathcal {M}}}_0}(\mathcal{TR}\mathcal{}_0,\mathcal{TE}\mathcal{}_{i})\)—with the inaccuracy cost of \({{\mathcal {M}}}_0\) tested on \(\mathcal{TE}\mathcal{}_1\) plus an extra inaccuracy cost \(\delta _i^{{{\mathcal {M}}}_0}\), i.e. \(CI_{{{\mathcal {M}}}_0}(\mathcal{TR}\mathcal{}_0,\mathcal{TE}\mathcal{}_{i})=CI_{{{\mathcal {M}}}_0}\)(\(\mathcal{TR}\mathcal{}_0,\mathcal{TE}\mathcal{}_{1})+\delta _i^{{{\mathcal {M}}}_0}\). Similarly, the inaccuracy costs of the other three models—\(CI_{{\mathcal {M}}}(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_{j}, \mathcal{TE}\mathcal{}_{i})\)—can be approximated with the inaccuracy cost computed on the first test set plus an extra inaccuracy cost \(\delta _i^{{\mathcal {M}}}\), i.e. \(CI_{{\mathcal {M}}}\)(\(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_{j}, \mathcal{TE}\mathcal{}_{i})\)=\(CI_{{\mathcal {M}}}\)(\(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_{j}\),\(\mathcal{TE}\mathcal{}_{1}\))\(+\delta _i^{{\mathcal {M}}}\).

Defining \(c_t\) as the time hourly cost and \(c_e\) as the unary prediction error cost, we can compute the time cost and the inaccuracy cost of a model \({{\mathcal {M}}}\) starting from the time required for training and updating the model \(T_{{\mathcal {M}}}\) and from the number of prediction errors (i.e. false positive + false negatives) \(E_{{\mathcal {M}}}\) as:

$$\begin{aligned} \begin{aligned} CT_{{\mathcal {M}}}(\mathcal{TR}\mathcal{}, \mathcal{TE}\mathcal{})&=c_t * T_{{\mathcal {M}}}\\ CI_{{\mathcal {M}}}(\mathcal{TR}\mathcal{}, \mathcal{TE}\mathcal{})&=c_e * E_{{\mathcal {M}}}\\ \end{aligned} \end{aligned}$$

In order to have an estimate of the costs in our scenario, we instantiated such a cost-effectiveness framework. In detail, we set the extra-inaccuracy costs to 0 (\(\delta _i^{{\mathcal {M}}}=0\)) and chose a couple of sample configurations for the prediction error unitary costs and for the hourly unitary cost (\(c_t=0.1\), \(c_e=100\) and \(c_t=100\), \(c_e=0.1\)), as well as for the number of batches n (\(n=1\) and \(n=5\)). Table 8 reports the obtained results using as reference values the ones of the experimental setting 40%-40%. The best result among the four strategies for each outcome is emphasised in italic, while the results that differ from the best ones for less than 1 are emphasised in bold.

The results in the table show that, depending on the unitary costs of time and prediction errors, differences can exist in the choice of the cheapest model. In both settings, \({{\mathcal {M}}}_3\) seems to be the cheapest model for most of the labellings and for different numbers of batches. In the setting in which \(c_t=0.1\) and \(c_e=100\), due to the low hourly unitary cost, \({{\mathcal {M}}}_2\) is the cheapest model for some of the labellings. Moreover, no significant differences exist in terms of costs when more and more data batches arrive (at least for the specific assumptions made for this cost-effectiveness framework). In the other setting, i.e. when the time cost is much higher than the error cost (\(c_t=100\) and \(c_e=0.1\)), \({{\mathcal {S}}}_2\) is always the most expensive update strategy, due to its substantial training time. Moreover, in this setting, the cheapest update strategy can change when the number of arriving data batches increases. Indeed, while with only one batch of data, for some of the labellings (e.g. \(\phi _{12}\)), \({{\mathcal {M}}}_1\) is slightly cheaper than or has the same cost as \({{\mathcal {M}}}_3\), in the long run, \({{\mathcal {M}}}_3\) is cheaper than \({{\mathcal {M}}}_1\).

To sum up, this instantiation of the cost-effectiveness framework confirms the results of the plots reported in Fig. 5, i.e. that overall \({{\mathcal {S}}}_1\) and \({{\mathcal {S}}}_3\) are the strategies providing the best balance between time and accuracy and hence the cheapest update strategies. Moreover, the analysis suggests that, based on the unitary costs of time and errors, as well as on the number of available data batches, differences can exist related to the best update strategy, although \({{\mathcal {M}}}_3\) seems to be consistently cheaper for most of the tested settings.

Table 8 Cost-effectiveness framework instantiation with \(\delta _i^{{\mathcal {M}}}=0\) and experimental setting 40%–40%

6.3 Threats to validity

The main threats affecting the validity of the evaluation carried out are external validity threats, limiting the generalisability of the results. Indeed, although we investigated the usage of different update strategies on different types of labellings, we limited the investigation to outcome predictions and to classification techniques typically used with this type of predictions. We plan to inspect other types of predictions, i.e. numeric and sequence predictions, together with typical techniques used with them, i.e. regression and deep learning techniques, for future work.

Finally, the lack of an exhaustive investigation of the hyperparameter values affects the construction validity of our experimentation. We limited this threat by using standard techniques for hyperparameter optimisation  Bergstra et al. [5].

Table 9 The accuracy results related to the Perceptron model

7 Related work

To the best of our knowledge, no other work exists on the comparison of update strategies for Predictive Process Monitoring models with the exception of the two by Pauwels and Calders [31] and Maisenbacher and Weidlich Pauwels and Calders [26]. We hence first position our work within the Predictive Process Monitoring field and then address a specific comparison with [31] and Maisenbacher and Weidlich [26].

We can classify Predictive Process Monitoring works based on the types of predictions they provide. A first group of approaches deals with numeric predictions, and, in particular, predictions related to time  van der Aalst et al. [43], Folino et al. [16], Rogge-Solti and Weske [35]. A second group of approaches focuses on the prediction of next activities. These approaches mainly use deep learning techniques—specifically techniques based on LSTM neural networks Tax et al. [38], Di Francescomarino et al. [14], Camargo et al. [9], Brunk et al. [7], Taymouri et al. [39]. These studies have shown that when the datasets are large, deep learning techniques can outperform techniques based on classical Machine Learning techniques. A third group of approaches deals with outcome predictions Teinemaa et al. [40], Maggi et al. [25], Di Francescomarino et al. [12], Leontjeva et al. [22], which are the ones we focus on. A key difference between these works and the work presented in this paper is that we do not aim at proposing/supporting a specific outcome prediction method, rather we aim at evaluating different update strategies.

The work by  Pauwels and Calders [31] leverages deep learning models to address the challenge of next activity prediction in the context of incremental Predictive Process Monitoring. The goal of their paper is twofold: they explore different strategies to update a model over time for next-activity prediction, and they investigate the potential of neural networks for the incremental Predictive Process Monitoring scenario. The goal is reached by (i) identifying different settings related to the data to use for training, updating, and testing the models, both in a static and a dynamic scenario; and (ii) showing the positive impact of catastrophic forgetting of deep learning models for the Predictive Process Monitoring use-case. In our work, we focus on another type of techniques/predictions, i.e. we aim at investigating the potential of classical Machine Learning models in the Predictive Process Monitoring scenario for the prediction of an outcome.

The work by  Maisenbacher and Weidlich [26] is the only one we are aware of that exploits classical incremental Machine Learning in the context of Predictive Process Monitoring. The goal of that paper is to show the usefulness of incremental techniques in the presence of Concept Drift. The goal is proved by performing an evaluation over synthetic logs which exhibit different types of Concept Drifts. In our work, we aim at comparatively investigating four different model update strategies (which include the case of the incremental update) both in terms of accuracy of the results and in terms of time required to update the models. We carry on our evaluation on real-life and synthetic logs with and without an explicit Concept Drift.

8 Conclusion

In this paper, we have provided a first investigation of different update strategies for Predictive Process Monitoring models in the context of the outcome prediction problem. In particular, we have evaluated the performance of four update strategies, namely do nothing, re-train with no hyperopt, full re-train, and incremental update, applied to Random Forest, the reference technique for outcome-oriented predictions, on a number of real and synthetic datasets with and without explicit Concept Drift. The cost-effectiveness of the different update strategies has been evaluated in the simple case where only one train and one test set are available, and in the more complex scenario where new batches of data become continuously available. The results show that the need to update a Predictive Process Monitoring model is real for typical real-life event logs (regardless of the presence of an explicit Concept Drift). They also show the potential of incremental learning strategies for Predictive Process Monitoring in real environments. An avenue for future work is the extension of our evaluation to different prediction problems such as remaining time and sequence predictions, which would, in turn, extend the evaluation to different reference Machine Learning techniques such as regression and LSTM, respectively. Also, a deeper investigation of the proposed cost-effectiveness framework in the context of the proposed update strategies will allow us to come up with more detailed best practices to guide the user in understanding which strategy is the most appropriate one under specific contextual conditions.

To conclude, we believe that the potential of incremental models is under-appreciated in the Predictive Process Monitoring field. To allow researchers to better understand the usefulness of the update strategies proposed in this paper, we made them readily available in the latest release of Nirdizati Rizzi et al. [34].