Abstract
Existing wellinvestigated Predictive Process Monitoring techniques typically construct a predictive model based on past process executions and then use this model to predict the future of new ongoing cases, without the possibility of updating it with new cases when they complete their execution. This can make Predictive Process Monitoring too rigid to deal with the variability of processes working in real environments that continuously evolve and/or exhibit new variant behaviours over time. As a solution to this problem, we evaluate the use of three different strategies that allow the periodic rediscovery or incremental construction of the predictive model so as to exploit new available data. The evaluation focuses on the performance of the new learned predictive models, in terms of accuracy and time, against the original one, and uses a number of real and synthetic datasets with and without explicit Concept Drift. The results provide an evidence of the potential of incremental learning algorithms for predicting process monitoring in real environments.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Predictive Process Monitoring Maggi et al. [25] is a research topic aiming at developing techniques that use the abundant availability of event logs extracted from information systems in order to predict how ongoing (uncompleted) process executions (a.k.a. cases) will unfold up to their completion. In turn, these techniques can be embedded within information systems to enhance their ability to manage business processes. For example, an information system can exploit a predictive monitoring technique to predict the remaining execution time of each ongoing case of a process RoggeSolti and Weske [35], the next activity that will be executed in each case Evermann et al. [15], or the final outcome of a case w.r.t. a set of possible outcomes Maggi et al. [25] Metzger et al. [27], Metzger et al. [28].
Existing Predictive Process Monitoring techniques first construct a predictive model based on data coming from past process executions. Then, they use this model to predict the future of an ongoing case (e.g. outcome, remaining time, or next activity). However, when the predictive model has been constructed, it won’t automatically take into account new cases when they complete their execution. This is a limitation in the usage of predictive techniques in the area of Business Process Monitoring: wellknown characteristics of real processes are, in fact, their complexity, variability, and lack of steady state. Due to changing circumstances, processes (and thus their executions) evolve and increase their variability, and systems need to adapt in a timely manner.
While a rough answer to this problem would be the one of rebuilding new predictive models from the wider available set of data, one could observe that building predictive models has a cost and this option should therefore be well understood before embracing it; moreover, preliminary studies such as the one of Maisenbacher and Weidlich [26] investigate the usage of incremental techniques Gepperth and Hammer [19] in the presence of Concept Drift phenomena, thus suggesting a diverse strategy of updating a Predictive Process Monitoring model.
In this paper, we tackle the problem of updating Predictive Process Monitoring models in the presence of new process execution data in a principled manner by investigating, in a comparative and empirically driven manner, how different strategies to keep predictive models uptodate work. In particular, given an event log \(\mathcal{TR}\mathcal{}_0\), and a set of new traces \(\mathcal{TR}\mathcal{}_1\), we focus on four diverse strategies to update a predictive model \({{\mathcal {M}}}_0\) built using the traces of \(\mathcal{TR}\mathcal{}_0\), to also take into account the set of new traces \(\mathcal{TR}\mathcal{}_1\):

Do nothing. In this case, \({{\mathcal {M}}}_0\) is never updated and does not take into account \(\mathcal{TR}\mathcal{}_1\) in any way. This strategy acts also as a baseline against which to compare all the other strategies.

Retrain with no hyperopt. In this case, a new predictive model \({{\mathcal {M}}}_1\) is built using \(\mathcal{TR}\mathcal{}_0\cup \mathcal{TR}\mathcal{}_1\) as train set but no optimisation of the hyperparameters is performed and the ones of \({{\mathcal {M}}}_0\) are used;

Full retrain. In this case, a new predictive model \({{\mathcal {M}}}_2\) is built using \(\mathcal{TR}\mathcal{}_0\cup \mathcal{TR}\mathcal{}_1\) as the train set and a new optimisation of the hyperparameters is performed;

Incremental update. In this case, a new predictive model \({{\mathcal {M}}}_3\) is built starting from \({{\mathcal {M}}}_0\) using the cases contained in \(\mathcal{TR}\mathcal{}_1\) in an incremental manner (that is, using incremental learning algorithms).
The evaluation aims at investigating two main aspects of these update strategies: first, their impact on the quality^{Footnote 1} of the prediction in event logs that exhibit/do not exhibit Concept Drift; second, their impact on the time spent for building the predictive model \({{\mathcal {M}}}_0\) and their updates.
The reason why we provide two different strategies for retraining the model, i.e. with no optimisation and with an optimisation of the hyperparameters, is because the two costly activities when building a predictive model are the actual training of the model w.r.t. a train set and the optimisation of the hyperparameters for the constructed model. Therefore, when evaluating the impact of retraining on an extended set of data, we aim at investigating the impact of building a new predictive model and the impact of optimising the hyperparameter in a separate manner.
The problem upon which we investigate these four strategies is the one of outcome predictions, where the outcomes are expressed by using either Linear Temporal Logic (LTL) formulae Pnueli [32], in line with several works such as Di Francescomarino et al. [12], Maggi et al. [25], or case duration properties. The four different strategies are evaluated in a broad experimental setting that considers different real and synthetic datasets. Since we focus on outcome predictions, we have decided to centre our evaluation on Random Forest. This algorithm was chosen as it was experimentally proven to be one of the best performing techniques on the outcome prediction problem—see Teinemaa et al. [40] for a rigorous review—and is therefore widely used on event log data usually used in Predictive Process Monitoring.
Perhaps not surprisingly, the results show that the donothing strategy is not a viable strategy (and therefore the issue of updating a Predictive Process Monitoring model is a real issue) and that full retraining and incremental updates are the best strategy in terms of quality of the updated predictive model. Nonetheless, the incremental update is able to keep up with the retraining strategy and deliver a properly fitted model almost in real time, whereas the full retraining might take hours and in some cases even days, suggesting that the potential of incremental models is underappreciated, and clever solutions could be applied to deliver more stable performance while retaining the positive side of the update functions.
The rest of the paper is structured as follows: Section 2 provides the necessary background on Predictive Process Monitoring and incremental learning; Sect. 3 presents two exemplifying scenarios of process variability and explicit Concept Drift; Sect. 5 illustrates the data and procedure we use to evaluate the proposed update strategies, while Sect. 6 presents and discusses the results. We finally provide some related work (Sect. 7) and concluding remarks (Sect. 8).
2 Background
In this section, we provide an overview of the four main building blocks that compose our research effort: Predictive Process Monitoring, Random Forest, hyperparameter optimisation, and Concept Drift.
2.1 Predictive Process Monitoring
Predictive Process Monitoring Maggi et al. [25] is a branch of Process Mining that aims at predicting at runtime and as early as possible the future development of ongoing cases of a process given their uncompleted traces. In the last few years, a wide literature about Predictive Process Monitoring techniques has become available—see Di Francescomarino, Ghidini , Maggi and Milani [13] for a survey—mostly based on Machine Learning techniques. The main dimension that is typically used to classify Predictive Process Monitoring techniques is the type of prediction, which can belong to one of the three macrocategories: numeric predictions (e.g. time or cost predictions); categorical predictions (e.g. risk predictions or specific categorical outcome predictions such as the fulfilment of a certain property); next activities predictions (e.g. the sequence of the future activities, possibly with their attributes).
Frameworks such as Nirdizati Rizzi et al. [34], Jorbina et al. [21] collect a set of Machine Learning techniques that can be instantiated and used for providing different types of predictions to the user. In detail, these frameworks take as input a set of past executions and use them to train predictive models, which can then be stored to be used at runtime to continuously supply predictions to the user. Moreover, the computed predictions can be used to compute accuracy scores for specific configurations. Within these frameworks, we can identify two main modules: one for the case encoding, and one for the supervised learning. Each of them can be instantiated with different techniques. Examples of case encodings are indexbased encodings presented in Leontjeva et al. [22]. Supervised learning techniques instead vary and can also depend on the type of prediction a user is interested in, ranging from Decision Tree and Random Forest, to regression methods and Recurrent Neural Networks.
2.2 Random forest
Random Forest [20] is an ensemble learning method used for classification and regression. The goal is to create a model composed of a multitude of Decision Trees Quinlan [33]. In a Decision Tree, each tree interior node corresponds to one of the input variables and each leaf node to a possible classification or decision. Each different path from the root to the leaf represents a different configuration of input variables. A tree can be “learned” by bootstrapping the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The result is a tree in which each selected variable will contribute to the labelling of the relative example. When an example needs to be labelled, it is run through all the Decision Trees. The output of each Decision Tree is counted. The most occurring label will be the output of the model.
In this work, we use nonincremental and incremental versions of Random Forest. In a nutshell, the nonincremental version builds a predictive model once and for all using a specific set of training data in a single training phase. Instead, in addition to the step of building a predictive model during the training phase, the incremental learning versions are able to update such a model whenever needed through an update function. The specific implementation used in this work incrementally updates the starting model by adding new decision trees as soon as new data are available.
As already mentioned in Introduction, Random Forest was chosen as it was experimentally proven to be one of the best performing techniques on the outcome prediction problem in Predictive Process Monitoring. The interested reader is referred to Teinemaa et al. [40] for a rigorous review.
2.3 Hyperparameter optimisation
Machine Learning techniques are known to use model parameters and hyperparameters. Model parameters are automatically learned during the training phase so as to fit the data. Instead, hyperparameters are set outside the training procedure and used for controlling how flexible the model is in fitting the data. While the values of hyperparameters can influence the performance of the predictive models in a relevant manner, their optimal values highly depend on the specific dataset under examination, thus making their setting rather burdensome. To support and automatise this onerous but important task, several hyperparameter optimisation techniques have been developed in the literature Bergstra and Bengio [4], Bergstra et al. [3] also for Predictive Process Monitoring models—see, for example, Teinemaa et al. [40], Di Francescomarino, Dumas, Federici, Ghidini, Maggi, Rizzi and Simonetto [11], Teinemaa et al. [41]. While in Teinemaa et al. [40], Di Francescomarino, Dumas, Federici, Ghidini, Maggi, Rizzi and Simonetto [11], the Tree Parser Estimator (TPE) has been used for outcomeoriented Predictive Process Monitoring solutions, in Teinemaa et al. [41], Random Search has been used for hyperparameter optimisation in a comparative analysis focusing on how stable outcomeoriented predictions are along time. Although hyperparameter optimisation techniques for Predictive Process Monitoring have shown their ability to identify accurate and reliable framework configurations, they are also an expensive task and we have hence decided to evaluate the role of hyperparameter tuning in our update strategies.
2.4 Concept drift
In Machine Learning, Concept drift refers to a change over time, in unforeseen ways, of the statistical properties of a target variable a learned model is trying to predict. This drift is often due to changes in the target data w.r.t. the one that was used in the training phase. These changes are problematic as they cause the predictions to become less accurate as time passes. Depending on the type of change (e.g. gradual, recurring, or abrupt), different types of techniques have been proposed in the literature to detect and handle them Gama et al. [18], Widmer and Kubat [17], Schlimmer and Granger [37, 49].
Business processes are subject to change due to, for example, changes in their normative, or organisational context, and so are their executions. Processes and executions can hence be subject to Concept Drifts, which may involve several process dimensions such as its controlflow dependencies and data handling. For instance, an organisational change might affect how a certain procedure is managed by employees in a Public Administration scenario (e.g. a further approval by a new manager is required for closing the procedure), or a normative change might affect either the way in which patients are managed in an emergency department (e.g. patients have to be tested for COVID19 before they can be visited) or the age of the customers who are allowed to submit a loan request procedure in a bank process. The Concept Drift phenomenon has originated few works that focus on drift detection and localisation in procedural and in declarative business processes—see Bose et al. [6], Carmona and Gavaldà [10] and Maggi et al. [24], respectively—as well as on attempts to deal with it in the context of Predictive Process Monitoring Maisenbacher and Weidlich [26], Pauwels and Calders [31].
3 Two descriptive scenarios
We aim at assessing the benefits of incremental learning techniques in scenarios characterised by process variability and/or explicit Concept Drift phenomena. In this section, we introduce two typical scenarios, which refer to some of the datasets used in the evaluation described in Sect. 5.
Scenario 1
(Dealing with Process Variability) Information systems are widely used in healthcare and several scenarios of predictive analytics can be provided in this domain. Indeed, the exploitation of predictive techniques in healthcare is described as one of the promising big data trends in this domain Bughin et al. [8], MunozGama et al. [30].
Despite some successful evaluation of Predictive Process Monitoring techniques using healthcare data Maggi et al. [25], predictive monitoring needs to consider a wellknown feature of healthcare processes, that is, their variability Rojas et al. [36], i.e. the variety of different alternative paths characterising the executions of a process. Whether they refer to nonelective care (e.g. medical emergencies), or elective care (e.g. scheduled standard, routine and nonroutine procedures), healthcare processes often exhibit characteristics of high variability and instability. For instance, the treatment processes related to different patients can be quite different due to allergies or comorbidities or other specific characteristics of a patient. In fact, when attempting to discover process model from data related to these processes, they are often spaghettilike, i.e. cumbersome models in which it is difficult to distil a stable procedure. Moreover, small changes in the organisational structure (e.g. new personnel in charge of a task, unforeseen seasonal variations due to holidays or diseases) may originate subtle variability not detectable in terms of stable Concept Drifts, but nonetheless relevant in terms of predictive data analytics.
In such a complex environment, an important challenge concerns the emergence of new behaviours: regardless of how much data we consider, an environment highly dependent on the human factor is likely to exhibit new variants that may not be captured when stopping the training at a specific time. Similarly, some variants may become obsolete, thus making the forgetting of data equally important.
Thus, a way for adapting the predictions to these changes and an investigation of which update strategies are especially suited to highly variable and realistic process executions would be of great impact.
Scenario 2
(Dealing with explicit Concept Drift) The presence of Concept Drift in business processes, due to, for example, changes in the organisational structures, legal regulations, and technological infrastructures, has been acknowledged in the Process Mining manifesto van der Aalst and et al. [44] and literature Maggi et al. [24], together with some preliminary studies on its relation with Predictive Process Monitoring Gepperth and Hammer [19].
Such a sudden and abrupt variation in the data provides a clear challenge to the process owners: they must be ready to cope with a Predictive Process Monitoring model with degraded performance on the drifted data, or to perform an update that allows the Predictive Process Monitoring technique to support both the nondrifted and the drifted trends of the data (as ongoing executions may still concern the nondrifted cases).
Similarly to the above, an investigation of which update strategies are especially suited to realistic process executions that exhibit an explicit Concept Drift would provide a concrete support for the maintenance of Predictive Process Monitoring models.
4 Update strategies
Predictive Process Monitoring provides a set of techniques that use the availability of execution traces (or cases) extracted from information systems in order to predict how ongoing (uncompleted) executions will unfold up to their completion.
Thus, assume that an organisation was able to exploit a set of process executions \(\mathcal{TR}\mathcal{}_0\), collected within a period of time that we will call “Period A” to obtain a predictive model \({{\mathcal {M}}}_0\), and that it starts to exploit \({{\mathcal {M}}}_0\) to perform predictions upon new incomplete traces (see Fig. 1 for an illustration of this scenario). As soon as the new incomplete traces terminate, they become new data, potentially available to be exploited for building a new model \({{\mathcal {M}}}'\), that in turn can be used to provide predictions on new incomplete traces. The need for exploiting new execution traces and building such an updated \({{\mathcal {M}}}'\) could be due to several reasons, among which the evolution of the process at hand (and thus of its executions) to which the system needs to adapt in a timely manner.
In this paper, we provide 4 different strategies for computing \({{\mathcal {M}}}'\), by exploiting a new set of process executions \(\mathcal{TR}\mathcal{}_1\), collected in a period of time “Period B” subsequent to “Period A”, along with the original set \(\mathcal{TR}\mathcal{}_0\). The strategies are summarised in Fig. 2. The figure represents, on the left hand side, model \({{\mathcal {M}}}_0\) and, on the right hand side, the operations performed starting from \({{\mathcal {M}}}_0\) to obtain model \({{\mathcal {M}}}'\) according to the four update strategies.
The first strategy, \({{\mathcal {S}}}_0\), is a do nothing strategy. This strategy simply disregards that new traces are produced in “Period B” and continues to use \({{\mathcal {M}}}_0\) as a predictive model. This strategy may prove to be useful when the processes remain stable and it acts also as a baseline against which to compare all the other strategies.
The second strategy, \({{\mathcal {S}}}_1\), exploits the new traces in \(\mathcal{TR}\mathcal{}_1\) produced in “Period B” for training but not for the optimisation of the hyperparameters. In this retrain with no hyperopt strategy, \({{\mathcal {M}}}_0\) is replaced by a new predictive model \({{\mathcal {M}}}_1\) built from scratch by using \(\mathcal{TR}\mathcal{}_0\cup \mathcal{TR}\mathcal{}_1\) as train set. No optimisation of the hyperparameters is made in the construction of \({{\mathcal {M}}}_1\) and the values of the ones computed for \({{\mathcal {M}}}_0\) are instead used. This strategy aims at exploiting the new data in \(\mathcal{TR}\mathcal{}_1\) still avoiding the costly steps needed for hyperparameter optimisation.
The third strategy, \({{\mathcal {S}}}_2\), completely replaces the old model \({{\mathcal {M}}}_0\) with a new predictive model \({{\mathcal {M}}}_2\) built from scratch using both \(\mathcal{TR}\mathcal{}_0\) and \(\mathcal{TR}\mathcal{}_1\). This strategy aims at performing a full retrain, thus exploiting to the outmost all the available data. Also, the comparison between \({{\mathcal {S}}}_1\) and \({{\mathcal {S}}}_2\) enables us to investigate the specific role of the hyperparameter tuning in the predictions.
The final strategy, \({{\mathcal {S}}}_3\), exploits the new traces in \(\mathcal{TR}\mathcal{}_1\) produced in “Period B” for training the predictive model in an incremental manner. Differently from \({{\mathcal {S}}}_1\), the data of \(\mathcal{TR}\mathcal{}_1\) are added as training data in a continuous manner by means of incremental Machine Learning algorithms, to extend the existing knowledge of model \({{\mathcal {M}}}_0\). The incremental update strategy is chosen as an example of dynamic technique, which can be applied when training data become available gradually over time or the size of the training data is too large to store or process it all at once. Similarly to \({{\mathcal {S}}}_1\), the value of the hyperparameters does not change when adding new training data.
5 Empirical evaluation
The evaluation reported in this paper aims at understanding the characteristics of the four different update strategies introduced in the previous section in terms of accuracy and time. We aim at evaluating these strategies with two types of reallife event log data: event logs without an explicit Concept Drift and event logs with an explicit Concept Drift. As such, we have selected four reallife datasets,^{Footnote 2} two for the first scenario and two for the second one. To consolidate the evaluation on the Concept Drift scenario, we also expanded the evaluation to include a synthetic event log with explicit Concept Drifts introduced in Maaradji et al. [23].
In this section, we introduce the research questions, the datasets, the metrics used to evaluate the effectiveness of the four update strategies described in Sect. 4, the procedure, and the tool settings. The results are instead reported in Sect. 6.
5.1 Research questions
Our evaluation is guided by the following research questions:
 RQ1:

How do the four update strategies do nothing, retrain with no hyperopt, full retrain, and incremental update compare to one another in terms of accuracy?
 RQ2:

How do the four update strategies do nothing, retrain with no hyperopt, full retrain, and incremental update compare to one another in terms of time performance?
RQ1 aims at evaluating the quality of the predictions returned by the four update strategies, while RQ2 investigates the time required to build the predictive models in the four scenarios and, in particular, aims at assessing the difference between the complete periodic rediscovery (full retrain) and the other two update strategies retrain with no hyperopt and incremental update.
5.2 Datasets
The four update strategies are evaluated using five datasets. Three of them are reallife event logs provided for a Business Process Intelligence (BPI) Challenge, in different years, without an explicit Concept Drift: the BPI Challenges 2011 3TU Data Center [1], 2012 van Dongen [45], and 2015 van Dongen; [46]. They are examples of event logs exhibiting Process Variability as described in the first scenario in Sect. 3. The remaining two datasets are instead examples of logs with explicit Concept Drift as described in the second scenario in Sect. 3. Our aim was to evaluate the four strategies on reallife event logs, but, to the best of our knowledge, the only publicly available event log which contains an explicit concept drift is the BPI Challenge 2018 van Dongen and Borchert [47]. Therefore, we decided to augment the evaluation considering also one of the synthetic event logs introduced in Maaradji et al. [23]. Here, we report the main characteristics of each dataset, while the outcomes to be predicted for each dataset are contained in Table 1.^{Footnote 3}
The first dataset, originally provided for the BPI Challenge 2011, contains the treatment history of patients diagnosed with cancer in a Dutch academic hospital. The log contains 1,140 cases and 149,730 events referring to 623 different activities. Each case in this log records the events related to a particular patient. For instance, the first labelling (\(\phi _{11}\)), for this dataset, is such that the positive traces are all the ones for which if activity \(\mathsf {CEA  tumour\ marker\ using\ meia}\) occurs, then it is followed by an occurrence of activity \(\mathsf {squamous\ cell\ carcinoma\ using\ eia}\).
The second dataset, originally provided for the BPI Challenge 2012, contains the execution history of a loan application process in a Dutch financial institution. It is composed of 4,685 cases and 186,693 events referring to 36 different activities. Each case in this log records the events related to a particular loan application. For instance, the first labelling (\(\phi _{21}\)), for this dataset, is such that the positive traces are all the ones in which event \(\mathsf {Accept\ Loan\ Application}\) occurs.
The third dataset, originally provided for the BPI Challenge 2015, concerns the application process for construction permits in five Dutch municipalities. We consider the log pertaining to the first municipality, which is composed of 1,199 cases and 52,217 events referring to 398 different activities.
The fourth dataset, originally provided for the BPI Challenge 2018, concerns an event log from the European Agricultural Guarantee Fund pertaining to an application process for EU direct payments for German farmers from the European Agricultural Guarantee Fund. Depending on the document types, different branches of the workflow are performed. The event log used in this evaluation is composed of 29,302 cases and 1,661,656 events referring to 40 different activities.
The fifth and the sixth datasets, hereafter called DriftRIO1 and DriftRIO2, are synthetic event logs that use a “textbook” example of a business process for assessing loan applications Weber et al. [48]. The DriftRIO event logs introduced in Maaradji et al. [23] have been built by alternating traces executing the original “base” model and traces modified so as to exhibit complex Concept Drifts obtained by composing simple log changes, namely, resequentialisation of process model activities (R), insertion of a new activity (I) and optionalisation of one activity (O)^{Footnote 4} DriftRIO1 is composed of 3,994 cases and 47,776 events related to 19 activities. DriftRIO2 is composed, instead, of 2,000 cases and 21,279 events referring to 19 different activities. The outcomes to be predicted for each dataset are expressed by using LTL formulae for the four BPI Challenges Di Francescomarino et al. [12], Maggi et al. [25] and by using the case duration property of being a fast case for DriftRIO^{Footnote 5} (see Table 1).
5.3 Metrics
In order to answer the research questions, we use two metrics, one for accuracy and one for time. The one for accuracy is used to evaluate RQ1, whereas the time measure is used to evaluate RQ2.
 The accuracy metric.:

In this work, we exploit a typical evaluation metric for calculating the performance of a classification model, that is AUCROC (hereafter only AUC). The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) using various threshold settings. In formulae, \(\textit{TPR} = \frac{T_P}{T_P + F_N}\), and \(\textit{FPR} = \frac{F_P}{FP + T_N}\), where \(T_P\), \(T_N\), \(F_P\), and \(F_N\) are the truepositives (positive outcomes correctly predicted), the truenegatives (negative outcomes correctly predicted), the falsepositives (negative outcomes predicted as positive), and the falsenegatives (positive outcomes predicted as negative), respectively. In our case, the AUC is the area under the ROC curve and, when using normalised units, it can be intuitively interpreted as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. As usual, \(T_P\), \(T_N\), \(F_P\), and \(F_N\) are obtained by comparing the predictions produced by the predictive models against a gold standard that indicates the correct labelling of each case. In our experiments, we have built the gold standard by evaluating the outcome of each completed case in the test set^{Footnote 6}
 The time metric.:

We measure the time spent to build the predictive model in terms of execution time. The execution time indicates the time required to create and update (in the case of incremental algorithms) the predictive models. We remark here that the execution time does not include the time spent to load and preprocess the data, but only the bare processing time.
5.4 Experimental procedure
We adopt the classical Machine Learning Train/Validate/Test experimental procedure and configure it for the four different strategies we want to compare. The procedure consists of the following main steps: (1) dataset preparation; (2) classifier training and validation; (3) classifier testing and metrics collection.
In the dataset preparation phase, the execution traces are first ordered according to their starting datetime so as to be able to meaningfully identify those referring to “Period A” (that is, \(\mathcal{TR}\mathcal{}_0\)), those referring to the subsequent “Period B” (that is, \(\mathcal{TR}\mathcal{}_1\)), and those referring to the test set \(\mathcal{TE}\mathcal{}\), which here represents the most recent set of traces where the actual predictions are made. The predictions are tested using the four different models \({{\mathcal {M}}}_0\)\({{\mathcal {M}}}_3\) corresponding to the different strategies \({{\mathcal {S}}}_0\)\({{\mathcal {S}}}_3\) described in Sect. 4.
The actual splits between train, hyperparameter validation, and test sets used for evaluating the four strategies are illustrated in Fig. 3. Following a common practice in Machine Learning, we have decided to use 80% of the data for training/validation and 20% for testing.
To test whether the performance of the different strategies was connected to different sizes of \(\mathcal{TR}\mathcal{}_0\) and \(\mathcal{TR}\mathcal{}_1\), we devised two experimental settings to supply the train data to the learning algorithms. The first experimental setting divides the train set unequally in \( \mathcal{TR}\mathcal{}_0\) (10%) and \( \mathcal{TR}\mathcal{}_1\) (70%); the second experimental setting divides the train set equally in \(\mathcal{TR}\mathcal{}_0\) (40%) and \(\mathcal{TR}\mathcal{}_1\) (40%). These two settings do not concern the construction of \({{\mathcal {M}}}_2\), which is always built using the entire set of available train data. The hyperparameter validation set is extracted from the train set through a randomised sampling procedure. It amounts to 20% of the train set, which corresponds to 2% of the dataset for the train set at 10%, 8% for the train set at 40%, and 16% for the train set at 80% (see Fig. 3).
The variability of the log behaviour of each dataset can be measured through the trace entropy and the Global block entropy metrics Back et al. [2] (Table 2). While the first metric mainly focuses on the number of variants in an event log, the latter also takes into account the internal structure of the traces. Note that the BPI Challenge 2018 and DriftRIO datasets contain a Concept Drift in \(\mathcal{TR}\mathcal{}_1\) in both settings (it affects the last 30% of data in \(\mathcal{TR}\mathcal{}\), when ordered according to their starting datetime) and, for these datasets, the difference between the entropy value of the split 0%40%–80%–100% is higher than the entropy value of the split 40%–80%–80%–100% for both entropy metrics. Comparing the behaviour variability of the splits 0%–10%–80%–100% and 10%–80%–80%–100% is instead less useful since the two sets of traces have different sizes and the smaller set has obviously a lower entropy w.r.t. the larger one for all datasets. Finally, the entropy on the complete datasets (column 0%–100%) is useful to understand what type of input is provided when evaluating strategy \({{\mathcal {S}}}_2\) that uses the full dataset to train the predictive model.
Additional information about the input logs used in our experiments is reported in Table 3 that shows, for each dataset and for each setting, the distribution of the labels on both the train and the test sets, thus providing an idea of how much balanced the datasets are.
Once the data are prepared, training and validation start by extracting execution prefixes and by encoding them using the complex index encoding introduced by Leontjeva et al. [22].^{Footnote 7} In the complex index encoding, the data related to a process execution are divided into static and dynamic information. Static information is the same for all the events in the sequence (e.g. the age of a patient), while dynamic information changes for different events occurring in the process execution (e.g. the name of the activity, the pressure value of a patient associated with an event aimed at measuring the patient’s pressure). We encode static information as attributevalue pairs, while, for dynamic information, the order in which events (and the related attributes) occur in the sequence is also taken into account.
The hyperparameter optimisation function uses the Tree of Parzen Estimators (TPE)^{Footnote 8} to retrieve the hyperparameter configuration (see Sect. 2.3), with a maximum of 1000 iterations, and the AUC as objective evaluation measure to maximise. Models \({{\mathcal {M}}}_0\)\({{\mathcal {M}}}_3\) are produced at the end of this phase. The specific hyperparameters used by the models are reported in Table 4. The table shows that indeed differences exist in the optimised hyperparameters computed starting from different train sets, i.e. 10% of the datasets, 40% of the datasets (both used for \({{\mathcal {M}}}_0\), \({{\mathcal {M}}}_1\), and \({{\mathcal {M}}}_3\)) and 80% of the datasets (used for \({{\mathcal {M}}}_2\)).
After the training and validation procedures are completed, we test the resulting model with the test set, and we collect the scored metrics (see Sect. 5.3) and store them in a database. Concerning time, we have decided to measure here the time spent to build the initial model \({{\mathcal {M}}}_0\), plus the time needed to update it according to the four different strategies. Thus, for \({{\mathcal {S}}}_0\), this will coincide with the time spent for building \({{\mathcal {M}}}_0\) (as, in this case, no further action is taken for updating the model); for \({{\mathcal {S}}}_1\), we compute the time spent for building \({{\mathcal {M}}}_0\) plus the time spent for the retraining over \(\mathcal{TR}\mathcal{}\) (no hyperopt); for \({{\mathcal {S}}}_2\), we compute the time spent for building \({{\mathcal {M}}}_0\) plus the time needed for the retraining over \(\mathcal{TR}\mathcal{}\) (with hyperopt); and, finally, for \({{\mathcal {S}}}_3\), we compute the time spent for building \({{\mathcal {M}}}_0\) plus the time needed for updating it with the data in \(\mathcal{TR}\mathcal{}_1\).
5.5 Experimental settings
The tool used for the experimentation is Nirdizati Rizzi et al. [34]. The experimental evaluation was performed on a workstation Dell Precision 7820 with the following configuration: (i) 314GB DDR4 2666MHz RDIMM ECC of RAM; (ii) double Intel Xeon Gold 6136 3.0GHz, 3.7GHz Turbo, 12C, 10.4GT/s 3UPI, 24.75MB Cache, HT (150W) CPU; and (iii) one 2.5" 256GB SATA Class 20 Solid State Drive. We assumed to have only 1 CPU for training the predictive models, i.e. we did not parallelise the training of the base learners of the Random Forests. We also ensured that there was no racing condition over the disk and no starvation over the RAM usage by actively monitoring the resources through Netdata Tsaousis [42]. Each experiment was allowed to run for at most 100 hours. No other experiment or interaction with the workstation was performed other than the monitoring of the used resources.
6 Results
In this section, we present the results of our experiments, reported in Tables 5, 6, 7, and discuss how they allow us to answer the two research questions introduced before. We also provide a discussion about the four update strategies with a costeffectiveness analysis and an analysis of the validity threats. To ensure reproducibility, the datasets used, the configurations, and the detailed results are available at http://bit.ly/how_do_I_update_my_model.
6.1 Discussion
Answering RQ1. The AUC of all models, for the two experimental settings 10%–70% and 40%–40%, for all datasets, is reported in Tables 5a and 5b. The best result for each dataset and labelling is emphasised in italic^{Footnote 9}. Since, in many cases, different models have very close accuracy, we have emphasised in bold the results that differ from the best ones for less than 0.01. Tables 6a and 6b report, for the two experimental settings 10%–70% and 40%–40%, the percentage of gain/loss of \({{\mathcal {M}}}_1\), \({{\mathcal {M}}}_2\), and \({{\mathcal {M}}}_3\) w.r.t. \({{\mathcal {M}}}_0\). The percentage of gain or loss^{Footnote 10} is reported together with an histogram of gains and losses. In order to ease the comparison, \({{\mathcal {M}}}_2\) is reported in both tables.
By looking at the tables, we can immediately see that \({{\mathcal {M}}}_2\) and \({{\mathcal {M}}}_3\) are the clear best performers, especially in the first setting 10%–70%, and that \({{\mathcal {M}}}_0\) is almost consistently the worst performer, often with a significant difference in terms of accuracy. This highlights that the need to update the predictive models is a real issue in typical Predictive Process Monitoring settings. The only exception to this finding is provided by the results obtained for the BPIC15 dataset, where the performance of \({{\mathcal {M}}}_0\) is comparable to that of the other models for almost all the outcome formulae. This is due to the fact that, for this dataset, there is a high homogeneity of the process behaviour over time. This is confirmed by the entropy values, provided in Table 2, that remain quite stable across the entire log. Moreover, we can observe that BPIC12 with labelling \(\phi _{23}\) has overall the lowest accuracy for all the four update strategies. This is possibly due to the high label unbalance characterising this dataset (see Table 3).
The performance of all the evaluated strategies is overall higher in the 40%–40% setting, and this is likely related to the higher amount of data used for hyperparameter optimisation. Nonetheless, the lower performance of \({{\mathcal {M}}}_0\) also in the 40%–40% setting indicates the need to update the models with new data at regular intervals. Concerning the possible differences between the results obtained using datasets with and without an explicit Concept Drift, our experiments did not find any striking variation in the different strategies, thus consolidating the finding that devising update strategies is important in general, also in scenarios where the process changes over time are not so definite. Nonetheless, if we look at Tables 6a and 6b, it is easy to see that the experiments with an explicit Concept Drift are the ones with the greatest difference between \({{\mathcal {M}}}_0\) and all the other models, thus confirming that an explicit Concept Drift can have a significant negative influence on the performance of a Predictive Process Monitoring model, if it is not updated. This is especially true for BPIC18, in which the Concept Drift highly affects the entropy measure (Table 2) in a way that the difference between the entropy value of the split 0%–40%–80%–100% is higher than the entropy value of the split 40%–80%–80%–100% for both entropy metrics.
Tables 6a and 6b show also another interesting aspect of our evaluation: while \({{\mathcal {M}}}_2\) and \({{\mathcal {M}}}_3\) tend to always gain against \({{\mathcal {M}}}_0\) (or to be stable in very few cases), the same cannot be said for \({{\mathcal {M}}}_1\). In fact, if we look at BPIC12 with labelling \(\phi _{23}\) and BPIC15 with labelling \(\phi _{32}\) in Table 6a, and, particularly, at DriftRIO with labelling \(\phi _{52}\) in Table 6b, we can see a decrease in the accuracy. By carrying out a deeper analysis of the chosen hyperparameters, we found that the lower accuracy of \({{\mathcal {M}}}_1\) is due to the inappropriateness of the hyperparameters derived from \(\mathcal{TR}\mathcal{}_0\) to the new data used to build \({{\mathcal {M}}}_1\). While this aspect may need to be better investigated, we can conclude that while retrain with no hyperopt is usually a viable solution, it is nonetheless riskier than full retrain or incremental update.
The general findings and trends derived from the results obtained using Random Forest as classifier are further confirmed, with few exceptions, by the results obtained by using Perceptron Minsky and Papert [29] as predictive model in the analysis of the four update strategies. The perceptron results are reported in Appendix A.
To sum up, concerning RQ1, our evaluation shows that full retrain and incremental update are the best performing update strategies in terms of accuracy, followed by retrain with no hyperopt. With the exception of BPIC15 in the 10%–70% setting, do nothing is, often by far, the worst strategy, indicating the importance of updating the predictive models with new data, when it becomes available.
Answering RQ2
The time spent for creating the four models, for the two experimental settings 10%–70% and 40%–40%, for all datasets, is reported in Tables 7a and 7b using the “hh:mm:ss” format. The best results for each dataset and labelling (that is the lower execution times) are emphasised in italic, while execution times that differ for less than 60 seconds from the best ones are indicated in bold. The last two rows of each table report the average time (and standard deviation) necessary to train a model for a given strategy. In order to ease the comparison, \({{\mathcal {M}}}_2\) is reported in both tables. \({{\mathcal {M}}}_0\) and \({{\mathcal {M}}}_2\) are selfcontained models that are “built from scratch” and, therefore, the time reported in the tables for these models is the time spent to train them. Differently, \({{\mathcal {M}}}_1\) and \({{\mathcal {M}}}_3\) are built in a twosteps fashion that includes a training phase but also the usage of the hyperparameters used to build \({{\mathcal {M}}}_0\). Therefore, their construction time is measured by summing up the time spent for the training phase and the time spent for building \({{\mathcal {M}}}_0\).
By looking at Tables 7a and 7b, we can immediately see that, among all the evaluated strategies, \({{\mathcal {M}}}_0\) is the clear best performer, especially for the 10%–70% setting, and that \({{\mathcal {M}}}_2\) is almost consistently the worst performer, often with a significant difference in terms of time spent to build the model (with few exceptions in the 40%–40% case that we will discuss below). As a second general observation, we note that \({{\mathcal {M}}}_1\) and \({{\mathcal {M}}}_3\) share almost the same construction time with \({{\mathcal {M}}}_0\). This fact is not particularly surprising, as the hyperparameter optimisation routine is often the most expensive step in the construction of this type of predictive models. Therefore, the two strategies \({{\mathcal {S}}}_1\) and \({{\mathcal {S}}}_3\) that underline the construction of these models are highly inexpensive, when we consider the time dimension, especially when \({{\mathcal {M}}}_0\) is already available.
If we compare the two experimental settings 10%–70% and 40%–40%, we can observe that while \({{\mathcal {M}}}_0\) is almost always the best performer in both settings, the difference between the construction time of \({{\mathcal {M}}}_0\) (and thus of \({{\mathcal {M}}}_1\) and \({{\mathcal {M}}}_3\)) and the construction time of \({{\mathcal {M}}}_2\) is significantly higher for the 10%–70% setting. While the investigation of when it is convenient to perform a full retrain is out of the scope of the paper and is left to further investigations, this finding emphasises the fact that the cost of a full retrain may increase in a significant manner if the update of the predictive model overdelays and the amount of new data greatly increases. Interestingly enough, the 40%–40% setting presents two cases in which \({{\mathcal {M}}}_2\) is the fastest model to be built. The case of BPIC11 with labelling \(\phi _{12}\) is likely due to an “unfortunate” guessestimate in the hyperparameter optimisation step for \({{\mathcal {M}}}_0\), which makes the training time explode^{Footnote 11}the case of BPIC15 with labelling \(\phi _{31}\), instead, represents a situation in which \({{\mathcal {M}}}_0\) and \({{\mathcal {M}}}_2\) take almost the same time to be built. Concerning possible differences between datasets with and without an explicit Concept Drift, our experiments did not find any striking difference among the evaluated strategies.
Finally, our evaluation did not find any fixed correlation between the training times for all strategies and (i) the size of the dataset and the alphabet of the dataset within the same settings or (ii) the quality of the predictive model in terms of accuracy (and thus the difficulty of the prediction problem). As an example of the first, we can observe that BPIC12 contains four times the cases of BPIC11; nonetheless, most of the prediction models built for BPIC11 take more time to be constructed than the ones built for BPIC12. Similarly, BPIC15 has an alphabet with a number of activities that is almost 10 times the one of BPIC2018, but the prediction models built for BPIC18 take much more time to be constructed than the ones built for BPIC15. As an example of the second, we can observe, from Table 7a, that the time needed for building \({{\mathcal {M}}}_0\) for BPIC11 with labelling \(\phi _{23}\) is much greater than the one needed for building \({{\mathcal {M}}}_0\) for BPIC11 with labelling \(\phi _{13}\), even if the accuracy for the same cases, in Table 5a, follows the inverse trend.^{Footnote 12}
To sum up, concerning RQ2, our evaluation shows that once \({{\mathcal {M}}}_0\) is available, incremental update and retrain with no hyperopt are the two most convenient update strategies—as they can be built in almost no time. This may suggest the possibility to implement an almost continuous update strategy whenever new data become available. While the investigation of when it is convenient to perform a full retrain is out of the scope of the paper, our experiments show that the cost of a full retrain may increase in a significant manner if the update of the predictive model overdelays and the amount of new data increases significantly.
Overall Conclusions. The plots in Figs. 4 and 5 show inaccuracy and time related to the 10%–70% and 40%–40% settings, respectively, for each of the considered datasets. The closer the item is to the origin, the best is the balance between the time required for training, retraining, or updating the model and the accuracy of the results. By looking at the plots, it is clear that the worst choice in terms of balance is given by \({{\mathcal {M}}}_0\), while, for the other three models, the choice somehow depends on the dataset and on the labelling. With the only exception of \(\phi _{12}\), \(\phi _{21}\), \(\phi _{22}\), and \(\phi _{33}\) for both settings, as well as of \(\phi _{11}\) for the 10%–70% setting and of \(\phi _{61}\) for the 40%–40% setting, for all other datasets and labellings, \({{\mathcal {M}}}_1\) and/or \({{\mathcal {M}}}_3\) are the only nondominated update strategies, i.e. those strategies for which another strategy improving both the inaccuracy and the time dimension does not exist.^{Footnote 13}
To conclude, our evaluation shows that the donothing strategy is not a viable strategy as the accuracy performance of a nonupdated model tends to significantly decrease for typical reallife datasets (with and without explicit Concept Drift), whereas lightweight update strategies, such as the incremental update and retrain with no hyperopt, are, instead, often extremely effective in updating the models. Full retrain offers a strategy that almost always achieves the best accuracy (or an accuracy in line with the best one). Nonetheless, its training time may increase significantly, especially in the presence of an abundance of new data. According to our experiments, the incremental update is able to keep up with the full retrain strategy and deliver a properly fitted model almost in real time, suggesting that the potential of incremental models is underappreciated in Predictive Process Monitoring, and smart Predictive Process Monitoring solutions could be developed leveraging this update strategy.
6.2 Costeffectiveness analysis
In order to have a better grasp of the costeffectiveness of the different update strategies, we also investigated the costs required by the update strategies, when new batches of data become available along the time. In particular, given a set of batches of train sets \(\mathcal{TR}\mathcal{}_0\), \(\mathcal{TR}\mathcal{}_1\), ...\(\mathcal{TR}\mathcal{}_n\), and the corresponding batches of test sets \(\mathcal{TE}\mathcal{}_{0}\), \(\mathcal{TE}\mathcal{}_{1}\), ...\(\mathcal{TE}\mathcal{}_n\) (where \(\mathcal{TE}\mathcal{}_{0}\) is the test set immediately following \(\mathcal{TR}\mathcal{}_0\), \(\mathcal{TE}\mathcal{}_{1}\) the test set immediately following \(\mathcal{TR}\mathcal{}_1\) in the temporal timeline), we can define \(CE_M\)(\(\mathcal{TR}\mathcal{}_0\), \(\mathcal{TR}\mathcal{}_1\), ...\(\mathcal{TR}\mathcal{}_n\), \(\mathcal{TE}\mathcal{}_{0}\), \(\mathcal{TE}\mathcal{}_{1}\), ...\(\mathcal{TE}\mathcal{}_n\))—from here on shortened as \(CE_M\)(\(\mathcal{TR}\mathcal{}_{0 \ldots n}\), \(\mathcal{TE}\mathcal{}_{0 \ldots n}\))—as the cost of a model trained with n batches of arriving data \(\mathcal{TR}\mathcal{}_0\), \(\mathcal{TR}\mathcal{}_1\), ...\(\mathcal{TR}\mathcal{}_n\), and tested with the corresponding batches of test data \(\mathcal{TE}\mathcal{}_{1}\), ...\(\mathcal{TE}\mathcal{}_n\).
In our scenario, the costeffectiveness of the update strategies is characterised by two main aspects: on the one hand, the cost of the time required for building the model and, on the other, the cost of returning wrong predictions (prediction inaccuracy). We can hence define \(CE_{{\mathcal {M}}}\)(\(\mathcal{TR}\mathcal{}_{0 \ldots n}\), \(\mathcal{TE}\mathcal{}_{0 \ldots n}\)) as the sum of (i) \(CT_{{\mathcal {M}}}\)(\(\mathcal{TR}\mathcal{}_{0 \ldots n}\), \(\mathcal{TE}\mathcal{}_{0 \ldots n}\)), i.e. the cost of the time required for building, training and, when necessary, retraining the model M, whenever a new batch of traces arrives;^{Footnote 14} and of (ii) \(CI_{{\mathcal {M}}}\)(\(\mathcal{TR}\mathcal{}_{0 \ldots n}\), \(\mathcal{TE}\mathcal{}_{0 \ldots n}\)), i.e. the cost of the inaccuracy due to wrong predictions returned by the trained models on the traces of the test sets. Low values for a given model indicate a good costeffectiveness model.
Defining \(CT_{T}(TR)\), \(CT_{T}^{H}(TR)\), and \(CT_{U}(TR)\) as the time required for training, training and optimising the hyperparameters, and incrementally updating a model with the train set TR, respectively, the time costs related to the four models can be computed as reported in Eq. 1. The time cost for \({{\mathcal {M}}}_0\) is given by the only cost required for training and optimising the hyperparameters on \(\mathcal{TR}\mathcal{}_0\). For \({{\mathcal {M}}}_1\) (\({{\mathcal {M}}}_3\)), besides the cost for training and optimising the hyperparameters on \(\mathcal{TR}\mathcal{}_0\), when the ith train set batch is available, the costs for training (updating) the model with the union of the train set batches up to the ith one (with the ith train batch), have also to be considered. Finally, the time cost for \({{\mathcal {M}}}_2\) is given by the cost required for retraining and optimising the hyperparameters from scratch each time a new batch arrives.
The inaccuracy costs are instead reported in Eq. 2. For \({{\mathcal {M}}}_0\), the inaccuracy cost is given by the sum of the costs obtained by providing predictions using the model trained on the batch \({{\mathcal {M}}}_0\) and tested on each new test set batch \(\mathcal{TE}\mathcal{}_{i}\). For the other three models, instead, the cost of inaccuracy, at the arrival of the train set batch \(\mathcal{TR}\mathcal{}_{i}\), is given by the inaccuracy cost obtained from models trained, updated and/or optimised on a train set that takes into account the information on the data of all train set batches up to \(\mathcal{TR}\mathcal{}_{i}\), and evaluated on the test set batch \(\mathcal{TE}\mathcal{}_{i}\). The inaccuracy cost is given by the sum of the costs for each new training and test batch \(\mathcal{\mathcal{TR}\mathcal{}}_i\) and \({\mathcal{TE}\mathcal{}}_i\).
We assume we can approximate the inaccuracy cost of the model \({{\mathcal {M}}}_0\) tested on the ith test set batch \(\mathcal{TE}\mathcal{}_i\)—\(CI_{{{\mathcal {M}}}_0}(\mathcal{TR}\mathcal{}_0,\mathcal{TE}\mathcal{}_{i})\)—with the inaccuracy cost of \({{\mathcal {M}}}_0\) tested on \(\mathcal{TE}\mathcal{}_1\) plus an extra inaccuracy cost \(\delta _i^{{{\mathcal {M}}}_0}\), i.e. \(CI_{{{\mathcal {M}}}_0}(\mathcal{TR}\mathcal{}_0,\mathcal{TE}\mathcal{}_{i})=CI_{{{\mathcal {M}}}_0}\)(\(\mathcal{TR}\mathcal{}_0,\mathcal{TE}\mathcal{}_{1})+\delta _i^{{{\mathcal {M}}}_0}\). Similarly, the inaccuracy costs of the other three models—\(CI_{{\mathcal {M}}}(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_{j}, \mathcal{TE}\mathcal{}_{i})\)—can be approximated with the inaccuracy cost computed on the first test set plus an extra inaccuracy cost \(\delta _i^{{\mathcal {M}}}\), i.e. \(CI_{{\mathcal {M}}}\)(\(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_{j}, \mathcal{TE}\mathcal{}_{i})\)=\(CI_{{\mathcal {M}}}\)(\(\bigcup _{j=0}^{i}\mathcal{TR}\mathcal{}_{j}\),\(\mathcal{TE}\mathcal{}_{1}\))\(+\delta _i^{{\mathcal {M}}}\).
Defining \(c_t\) as the time hourly cost and \(c_e\) as the unary prediction error cost, we can compute the time cost and the inaccuracy cost of a model \({{\mathcal {M}}}\) starting from the time required for training and updating the model \(T_{{\mathcal {M}}}\) and from the number of prediction errors (i.e. false positive + false negatives) \(E_{{\mathcal {M}}}\) as:
In order to have an estimate of the costs in our scenario, we instantiated such a costeffectiveness framework. In detail, we set the extrainaccuracy costs to 0 (\(\delta _i^{{\mathcal {M}}}=0\)) and chose a couple of sample configurations for the prediction error unitary costs and for the hourly unitary cost (\(c_t=0.1\), \(c_e=100\) and \(c_t=100\), \(c_e=0.1\)), as well as for the number of batches n (\(n=1\) and \(n=5\)). Table 8 reports the obtained results using as reference values the ones of the experimental setting 40%40%. The best result among the four strategies for each outcome is emphasised in italic, while the results that differ from the best ones for less than 1 are emphasised in bold.
The results in the table show that, depending on the unitary costs of time and prediction errors, differences can exist in the choice of the cheapest model. In both settings, \({{\mathcal {M}}}_3\) seems to be the cheapest model for most of the labellings and for different numbers of batches. In the setting in which \(c_t=0.1\) and \(c_e=100\), due to the low hourly unitary cost, \({{\mathcal {M}}}_2\) is the cheapest model for some of the labellings. Moreover, no significant differences exist in terms of costs when more and more data batches arrive (at least for the specific assumptions made for this costeffectiveness framework). In the other setting, i.e. when the time cost is much higher than the error cost (\(c_t=100\) and \(c_e=0.1\)), \({{\mathcal {S}}}_2\) is always the most expensive update strategy, due to its substantial training time. Moreover, in this setting, the cheapest update strategy can change when the number of arriving data batches increases. Indeed, while with only one batch of data, for some of the labellings (e.g. \(\phi _{12}\)), \({{\mathcal {M}}}_1\) is slightly cheaper than or has the same cost as \({{\mathcal {M}}}_3\), in the long run, \({{\mathcal {M}}}_3\) is cheaper than \({{\mathcal {M}}}_1\).
To sum up, this instantiation of the costeffectiveness framework confirms the results of the plots reported in Fig. 5, i.e. that overall \({{\mathcal {S}}}_1\) and \({{\mathcal {S}}}_3\) are the strategies providing the best balance between time and accuracy and hence the cheapest update strategies. Moreover, the analysis suggests that, based on the unitary costs of time and errors, as well as on the number of available data batches, differences can exist related to the best update strategy, although \({{\mathcal {M}}}_3\) seems to be consistently cheaper for most of the tested settings.
6.3 Threats to validity
The main threats affecting the validity of the evaluation carried out are external validity threats, limiting the generalisability of the results. Indeed, although we investigated the usage of different update strategies on different types of labellings, we limited the investigation to outcome predictions and to classification techniques typically used with this type of predictions. We plan to inspect other types of predictions, i.e. numeric and sequence predictions, together with typical techniques used with them, i.e. regression and deep learning techniques, for future work.
Finally, the lack of an exhaustive investigation of the hyperparameter values affects the construction validity of our experimentation. We limited this threat by using standard techniques for hyperparameter optimisation Bergstra et al. [5].
7 Related work
To the best of our knowledge, no other work exists on the comparison of update strategies for Predictive Process Monitoring models with the exception of the two by Pauwels and Calders [31] and Maisenbacher and Weidlich Pauwels and Calders [26]. We hence first position our work within the Predictive Process Monitoring field and then address a specific comparison with [31] and Maisenbacher and Weidlich [26].
We can classify Predictive Process Monitoring works based on the types of predictions they provide. A first group of approaches deals with numeric predictions, and, in particular, predictions related to time van der Aalst et al. [43], Folino et al. [16], RoggeSolti and Weske [35]. A second group of approaches focuses on the prediction of next activities. These approaches mainly use deep learning techniques—specifically techniques based on LSTM neural networks Tax et al. [38], Di Francescomarino et al. [14], Camargo et al. [9], Brunk et al. [7], Taymouri et al. [39]. These studies have shown that when the datasets are large, deep learning techniques can outperform techniques based on classical Machine Learning techniques. A third group of approaches deals with outcome predictions Teinemaa et al. [40], Maggi et al. [25], Di Francescomarino et al. [12], Leontjeva et al. [22], which are the ones we focus on. A key difference between these works and the work presented in this paper is that we do not aim at proposing/supporting a specific outcome prediction method, rather we aim at evaluating different update strategies.
The work by Pauwels and Calders [31] leverages deep learning models to address the challenge of next activity prediction in the context of incremental Predictive Process Monitoring. The goal of their paper is twofold: they explore different strategies to update a model over time for nextactivity prediction, and they investigate the potential of neural networks for the incremental Predictive Process Monitoring scenario. The goal is reached by (i) identifying different settings related to the data to use for training, updating, and testing the models, both in a static and a dynamic scenario; and (ii) showing the positive impact of catastrophic forgetting of deep learning models for the Predictive Process Monitoring usecase. In our work, we focus on another type of techniques/predictions, i.e. we aim at investigating the potential of classical Machine Learning models in the Predictive Process Monitoring scenario for the prediction of an outcome.
The work by Maisenbacher and Weidlich [26] is the only one we are aware of that exploits classical incremental Machine Learning in the context of Predictive Process Monitoring. The goal of that paper is to show the usefulness of incremental techniques in the presence of Concept Drift. The goal is proved by performing an evaluation over synthetic logs which exhibit different types of Concept Drifts. In our work, we aim at comparatively investigating four different model update strategies (which include the case of the incremental update) both in terms of accuracy of the results and in terms of time required to update the models. We carry on our evaluation on reallife and synthetic logs with and without an explicit Concept Drift.
8 Conclusion
In this paper, we have provided a first investigation of different update strategies for Predictive Process Monitoring models in the context of the outcome prediction problem. In particular, we have evaluated the performance of four update strategies, namely do nothing, retrain with no hyperopt, full retrain, and incremental update, applied to Random Forest, the reference technique for outcomeoriented predictions, on a number of real and synthetic datasets with and without explicit Concept Drift. The costeffectiveness of the different update strategies has been evaluated in the simple case where only one train and one test set are available, and in the more complex scenario where new batches of data become continuously available. The results show that the need to update a Predictive Process Monitoring model is real for typical reallife event logs (regardless of the presence of an explicit Concept Drift). They also show the potential of incremental learning strategies for Predictive Process Monitoring in real environments. An avenue for future work is the extension of our evaluation to different prediction problems such as remaining time and sequence predictions, which would, in turn, extend the evaluation to different reference Machine Learning techniques such as regression and LSTM, respectively. Also, a deeper investigation of the proposed costeffectiveness framework in the context of the proposed update strategies will allow us to come up with more detailed best practices to guide the user in understanding which strategy is the most appropriate one under specific contextual conditions.
To conclude, we believe that the potential of incremental models is underappreciated in the Predictive Process Monitoring field. To allow researchers to better understand the usefulness of the update strategies proposed in this paper, we made them readily available in the latest release of Nirdizati Rizzi et al. [34].
Change history
03 August 2022
Missing Open Access funding information has been added in the Funding Note.
Notes
We measure the prediction quality using different metrics precisely defined later in the paper.
From the repository available at https://data.4tu.nl/.
We assume that events occurring during the process execution fall in the set of atomic propositions. LTL rules are constructed from these atoms by applying the temporal operators in addition to the usual Boolean connectives. Given a formula \(\varphi \), \({\textbf {F}} \varphi \) indicates that \(\varphi \) is true sometimes in the future. \({\textbf {G}} \varphi \) means that \(\varphi \) is true always in the future. \(\varphi {\textbf {U}} \psi \) indicates that \(\varphi \) has to hold at least until \(\psi \) holds and \(\psi \) must hold in the current or in a future time instant.
The logs used in Maaradji et al. [23] have been slightly modified (i) by further strengthening the introduced Concept Drift with the increase in the time duration of two activities, as well as (ii) by considering only the first parts of the event logs, so as to have a single occurrence of the Concept Drift rather than multiple occurrences as in the original event logs.
Specifically, the property defines fast a case with a cycle time lower than the average cycle time of the event log.
To avoid biases related to the chosen metric, in our experiments we have also measured accuracy in terms of average Fmeasure and accuracy defined as \(\frac{T_P + T_N}{T_P + T_N + F_P + F_N}\), and the results remain consistent. We have, therefore, decided to focus only on AUC for the sake of simplicity and readability of the results.
We have decided to use the complex index encoding as it is the one providing better performance in Predictive Process Monitoring – see Leontjeva et al. [22].
We indicate in italic the best result, which may be due to digits smaller than the third decimal one reported in the paper.
Computed as \(\frac{{{\mathcal {M}}}_{i}{{\mathcal {M}}}_0}{{{\mathcal {M}}}_0}\), \(i \in \{1,2,3\}\).
Note that an explosion of the training time for the same reason affects also model \({{\mathcal {M}}}_2\) for BPIC11 with labelling \(\phi _{13}\).
This is just one of many samples of an inverse relation between time and accuracy that can be found in the tables.
We did not take into account small differences that cannot be observed with the naked eye in the plots.
We assume that the time cost for providing predictions at runtime, i.e. for testing our model is negligible.
References
3TU Data Center, (2011) BPI Challenge 2011 Event Log. https://doi.org/10.4121/uuid:d9769f3d0ab04fb8803b0d1120ffcf54
Back CO, Debois S, Slaats T (2019) Entropy as a measure of log variability. J. Data Semant. 8(2):129–156. https://doi.org/10.1007/s13740019001053
Bergstra J, Bardenet R, Bengio Y, Kegl B ( 2011) Algorithms for hyperparameter optimization. In: Shawe Taylor J, Zemel RS, Bartlett PL, Pereira FCN, Weinberger KQ, eds, Advances in Neural Information Processing Systems 24. In: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 1214 December 2011, Granada, Spain., pp 2546–2554
Bergstra J, Bengio Y (2012) Random search for hyperparameter optimization. J Mach Learn Res 13:281–305
Bergstra J, Yamins D, & Cox DD ( 2013) Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013, Vol. 28 of JMLR Workshop and Conference Proceedings, JMLR.org, pp 115–123
Bose RPJC, van der Aalst WMP, Zliobaite I, Pechenizkiy M (2014) Dealing with concept drifts in process mining. IEEE Trans. Neural Netw. Learning Syst. 25(1):154–171
Brunk J, Stottmeister J, Weinzierl S, Matzner M, Becker J (2020) Exploring the effect of context information on deep learning business process predictions. Journal of Decision Systems 1–16
Bughin J, Hazan E, Ramaswamy S, Chui M, Allas T, Dahlstrom P, Henke N, Trench M (2017) Artificial intelligence, the next digital frontier? McKinsey
Camargo M, Dumas M, & Rojas OG ( 2019) , Learning accurate LSTM models of business processes, In: T. T. Hildebrandt, B. F. van Dongen, M. oglinger, & J. Mendling, eds, Business Process Management  17th International Conference, BPM 2019, Vienna, Austria, September 16, 2019, Proceedings, Vol. 11675 of Lecture Notes in Computer Science, Springer, pp 286–302
Carmona J, & Gavalda R ( 2012) Online techniques for dealing with concept drift in process mining, In: J. Hollmen, F. Klawonn, & A. Tucker, eds, Advances in Intelligent Data Analysis XI  11th International Symposium, IDA 2012, Helsinki, Finland, October 2527, 2012. Proceedings, Vol. 7619 of Lecture Notes in Computer Science, Springer, pp 90–102
Di Francescomarino C, Dumas M, Federici M, Ghidini C, Maggi FM, Rizzi W, Simonetto L (2018) Genetic algorithms for hyperparameter optimization in predictive business process monitoring. Inf. Syst. 74(Part):67–83
Di Francescomarino C, Dumas M, Maggi FM, Teinemaa I (2019) Clusteringbased predictive process monitoring. IEEE Trans Serv Comput 12(6):896–909
Di Francescomarino C, Ghidini C, Maggi FM, & Milani F ( 2018) Predictive process monitoring methods: Which one suits me best?, In: M. Weske, M. Montali, I. Weber, & J. vom Brocke, eds, Business Process Management  16th International Conference, BPM 2018, Sydney, NSW, Australia, September 914, 2018, Proceedings, Vol. 11080 of Lecture Notes in Computer Science, Springer, pp 462–479
Di Francescomarino C, Ghidini C, Maggi FM, Petrucci G, & Yeshchenko A ( 2017) An eye into the future: Leveraging apriori knowledge in predictive business process monitoring, In: BPM, pp 252–268
Evermann J, Rehse J, Fettke P (2017) Predicting process behaviour using deep learning. Decision Support Syst 100:129–140
Folino F, Guarascio M, & Pontieri L ( 2012) Discovering contextaware models for predicting business process performances. In: R. Meersman, H. Panetto, T. S. Dillon, S. RinderleMa, P. Dadam, X. Zhou, S. Pearson, A. Ferscha, S. Bergamaschi, & I. F. Cruz, eds, On the Move to Meaningful Internet Systems: OTM 2012, Confederated International Conferences: CoopIS, DOASVI, and ODBASE 2012, Rome, Italy, September 1014, 2012. Proceedings, Part I, Vol. 7565 of Lecture Notes in Computer Science, Springer, pp 287–304
Gama J, Medas P, Castillo G Rodrigues PP ( 2004) Learning with drift detection. In: A. L. C. Bazzan, & S. Labidi, eds, ‘Advances in Artificial Intelligence  SBIA 2004, 17th Brazilian Symposium on Artificial Intelligence, Sao Luis, Maranhao, Brazil, September 29  October 1, 2004, Proceedings, Vol. 3171 of Lecture Notes in Computer Science, Springer, pp 286–295
Gama J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput. Surv. 46(4):44:144:37
Gepperth A, & Hammer B ( 2016) Incremental learning algorithms and applications. In: 24th European Symposium on Artificial Neural Networks, ESANN 2016, Bruges, Belgium, April 2729, 2016
Ho, TK ( 1995) , Random decision forests. In: Third International Conference on Document Analysis and Recognition, ICDAR 1995, August 14  15, 1995, Montreal, Canada. Volume I, IEEE Computer Society, pp 278–282
Jorbina K, Rozumnyi A, Verenich I, Di Francescomarino C, Dumas M, Ghidini C, Maggi FM., Rosa ML, & Raboczi S ( 2017) Nirdizati: A webbased tool for predictive process monitoring. In: R. Clariso, H. Leopold, J. Mendling, W. M. P. van der Aalst, A. Kumar, B. T. Pentland, & M. Weske, eds, Proceedings of the BPM Demo Track and BPM Dissertation Award colocated with 15th International Conference on Business Process Modeling (BPM 2017), Barcelona, Spain, September 13, 2017, Vol. 1920 of CEUR Workshop Proceedings, CEURWS.org
Leontjeva A, Conforti R, Di Francescomarino C, Dumas M, & Maggi, FM ( 2015) Complex symbolic sequence encodings for predictive monitoring of business processes. In: H. R. MotahariNezhad, J. Recker, & M. Weidlich, eds, Business Process Management  13th International Conference, BPM 2015, Innsbruck, Austria, August 31  September 3, 2015, Proceedings, Vol. 9253 of Lecture Notes in Computer Science, Springer, pp 297–313
Maaradji A, Dumas M, Rosa ML, & Ostovar A ( 2015) Fast and accurate business process drift detection, In: H. R. MotahariNezhad, J. Recker & M. Weidlich, eds, Business Process Management  13th International Conference, BPM 2015, Innsbruck, Austria, August 31  September 3, 2015, Proceedings, Vol. 9253 of Lecture Notes in Computer Science, Springer, pp 406–422
Maggi FM, Burattin A, Cimitile M, & Sperduti A (2013) Online process discovery to detect concept drifts in ltlbased declarative process models, In: R. Meersman, H. Panetto, T. S. Dillon, J. Eder, Z. Bellahsene, N. Ritter, P. D. Leenheer, & D. Dou, eds, On the Move to Meaningful Internet Systems: OTM 2013 Conferences  Confederated International Conferences: CoopIS, DOATrusted Cloud, and ODBASE 2013, Graz, Austria, September 913, 2013. Proceedings, Vol. 8185 of Lecture Notes in Computer Science, Springer, pp 94–111
Maggi FM, Di Francescomarino C, Dumas M, & Ghidini C (2014) Predictive monitoring of business processes. In: M. Jarke, J. Mylopoulos, C. Quix, C. Rolland, Y. Manolopoulos, H. Mouratidis, & J. Horkoff, eds, Advanced Information Systems Engineering  26th International Conference, CAiSE 2014, Thessaloniki, Greece, June 1620, 2014. Proceedings, Vol. 8484 of Lecture Notes in Computer Science, Springer, pp 457–472
Maisenbacher M, Weidlich M (2017). In: Liu XF, Bellur U (eds) Handling concept drift in predictive process monitoring. IEEE Computer Society, pp 1–8
Metzger A, Franklin R, Engel Y (2012) Predictive monitoring of heterogeneous serviceoriented business networks: The transport and logistics case, In: 2012 Annual SRII Global Conference, San Jose, CA, USA, July 24–27, 2012. IEEE Computer Society 313–322
Metzger A, Leitner P, Ivanovic D, Schmieders E, Franklin R, Carro M, Dustdar S, Pohl K (2015) Comparing and combining predictive business process monitoring techniques. IEEE Trans Syst, Man, Cybern: Syst 45(2):276–290
Minsky M, Papert SA (2017) Perceptrons: An introduction to computational geometry. MIT press
MunozGama J, Martin N, FernandezLlatas C, Johnson OA, Sepulveda M, Helm E, GalvezYanjari V, Rojas E, MartinezMillana A, Aloini D, Amantea IA, Andrews R, Arias M, Beerepoot I, Benevento E, Burattin A, Capurro D, Carmona J, Comuzzi M, Dalmas B, de la Fuente R, Di Francescomarino C, Ciccio CD, Gatta R, Ghidini C, GonzalezLopez F, IbanezSanchez G, Klasky HB, Prima Kurniati A, Lu X, Mannhardt F, Mans R, Marcos M, Medeiros de Carvalho R, Pegoraro M, Poon SK, Pufahl L, Reijers HA, Remy S, RinderleMa S, Sacchi L, Seoane F, Song M, Stefanini A, Sulis E, ter Hofstede AH, Toussaint PJ, Traver V, ValeroRamon Z, van de Weerd I, van der Aalst WM, Vanwersch R, Weske M, Wynn MT, Zerbato F (2022) Process mining for healthcare: characteristics and challenges. J Biomed Inform 22:36
Pauwels S, & Calders T ( 2021) Incremental predictive process monitoring: The next activity case, In: A. Polyvyanyy, M. T. Wynn, A. V. Looy, & M. Reichert, eds, Business Process Management  19th International Conference, BPM 2021, Rome, Italy, September 0610, 2021, Proceedings, Vol. 12875 of Lecture Notes in Computer Science, Springer, pp 123–140
Pnueli A (1977) The temporal logic of programs, in Foundations of Computer Science, 1977, 18th Annual Symposium on’. IEEE 46–57
Quinlan JR (1986) Induction of decision trees. Machine Learning 1(1):81–106
Rizzi W, Simonetto L, Di Francescomarino C, Ghidini C, Kasekamp T, & Maggi FM ( 2019) Nirdizati 2.0: New features and redesigned backend, In: B. Depaire, J. D. Smedt, M. Dumas, D. Fahland, A. Kumar, H. Leopold, M. Reichert, S. RinderleMa, S. Schulte, S. Seidel, & W. M. P. van der Aalst, eds, Proceedings of the Dissertation Award, Doctoral Consortium, and Demonstration Track at BPM 2019 colocated with 17th International Conference on Business Process Management, BPM 2019, Vienna, Austria, September 16, 2019, Vol. 2420 of CEUR Workshop Proceedings, CEURWS.org, pp 154–158
RoggeSolti A, & Weske M (2013) Prediction of remaining service execution time using stochastic petri nets with arbitrary firing delays. In: S. Basu, C. Pautasso, L. Zhang, & X. Fu, eds, ServiceOriented Computing  11th International Conference, ICSOC 2013, Berlin, Germany, December 25, 2013, Proceedings, Vol. 8274 of Lecture Notes in Computer Science, Springer, pp 389–403
Rojas E, MunozGama J, Sepulveda M, Capurro D (2016) Process mining in healthcare: a literature review. J Biomed Inform 61:224–236
Schlimmer JC, Granger RH (1986) Incremental learning from noisy data. Mach. Learn. 1(3):317–354
Tax N, Verenich I, La Rosa M, Dumas M (2017) Predictive business process monitoring with LSTM neural networks. pp 477–492
Taymouri F, Rosa ML, Erfani SM, Bozorgi ZD, Verenich I (2020) , Predictive business process monitoring via generative adversarial nets: The case of next event prediction. In: D. Fahland, C. Ghidini, J. Becker, & M. Dumas, eds, Business Process Management  18th International Conference, BPM 2020, Seville, Spain, September 1318, 2020, Proceedings, Vol. 12168 of Lecture Notes in Computer Science, Springer, pp 237–256
Teinemaa I, Dumas M, La Rosa M, Maggi FM (2019) Outcomeoriented predictive process monitoring. Rev benchmark TKDD 13(2):17–57
Teinemaa I, Dumas M, Leontjeva A, Maggi FM (2018) Temporal stability in predictive process monitoring. Data Min. Knowl. Discov. 32(5):1306–1338
Tsaousis C (2013) Netdata, https://github.com/netdata/netdata/
van der Aalst W, Schonenberg MH, Song M (2011) Time prediction based on process mining. Inf. Syst. 36(2):450–475
van der Aalst W, amp, et al (2012) Process mining manifesto. In: Daniel F, Barkaoui K, Dustdar S (eds) Business Process Management Workshops (BPM 2011 International Workshops, ClermontFerrand, France, August 29, 2011, Revised Selected Papers, Part I). Lecture Notes in Business Information Processing, Springer, Germany, pp 169–194
van Dongen B (2012) Bpi challenge 2012. https://doi.org/10.4121/uuid:3926db30f7124394aebc75976070e91f
van Dongen B (2015) BPIC 2015. https://doi.org/10.4121/uuid:31a308efc84448da948c305d167a0ec1
van Dongen B & Borchert FF (2018) Bpi challenge 2018. https://data.4tu.nl/articles/dataset/BPI_Challenge_2018/12688355/1
Weber B, Reichert M, RinderleMa S (2008) Change patterns and change support features  enhancing flexibility in processaware information systems. Data Knowl. Eng. 66(3):438–466
Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach. Learn. 23(1):69–101
Acknowledgements
We thank anonymous reviewers for their very useful comments and suggestions.
Funding
Open access funding provided by Free University of BozenBolzano within the CRUICARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A. Further Results
A. Further Results
We report, in this appendix, the results obtained using Perceptron [29] (rather than Random Forest) as classifier. Tables 9a and 9b show the results obtained with such a classifier for the different strategies and for the different considered settings. Also for these experiments, the best result for each dataset and labelling is emphasised in italic, while the results that differ from the best ones for less than 0.01 are emphasised in bold.
As for the case of Random Forest, we can observe that the most accurate strategies are \({{\mathcal {M}}}_2\) and \({{\mathcal {M}}}_3\), with the exception of datasets BPIC12 and BPIC15 with labelling \(\phi _{32}\) in the setting 40%–40%, in which the best results are obtained with \({{\mathcal {M}}}_0\). Moreover, we can observe a significant difference, in terms of performance, with respect to the Random Forest, for BPIC18: the accuracy obtained with Perceptron for this dataset is very low for all the four update strategies. Also with Perceptron, the accuracy is overall better for the 40%–40% setting w.r.t. 10%–70% and no relevant differences in terms of winning strategies for datasets with and without explicit Concept Drift can be devised.
Overall, the evaluation with Perceptron confirms that \({{\mathcal {M}}}_2\) and \({{\mathcal {M}}}_3\) (i.e. full retrain and incremental update, respectively) are the best performing update strategies in terms of accuracy.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rizzi, W., Di Francescomarino, C., Ghidini, C. et al. How do I update my model? On the resilience of Predictive Process Monitoring models to change. Knowl Inf Syst 64, 1385–1416 (2022). https://doi.org/10.1007/s10115022016669
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115022016669