How do I update my model? On the resilience of Predictive Process Monitoring models to change

Existing well investigated Predictive Process Monitoring techniques typically construct a predictive model based on past process executions, and then use it to predict the future of new ongoing cases, without the possibility of updating it with new cases when they complete their execution. This can make Predictive Process Monitoring too rigid to deal with the variability of processes working in real environments that continuously evolve and/or exhibit new variant behaviours over time. As a solution to this problem, we evaluate the use of three different strategies that allow the periodic rediscovery or incremental construction of the predictive model so as to exploit new available data. The evaluation focuses on the performance of the new learned predictive models, in terms of accuracy and time, against the original one, and uses a number of real and synthetic datasets with and without explicit Concept Drift. The results provide an evidence of the potential of incremental learning algorithms for predicting process monitoring in real environments.


Introduction
Predictive Process Monitoring [25] is a research topic aiming at developing techniques that use the abundant availability of event logs extracted from information systems in order to predict how ongoing (uncompleted) process executions (a.k.a.cases) will unfold up to their completion.In turn, these techniques can be embedded within information systems to enhance their ability to manage business processes.For example, an information system can exploit a predictive monitoring technique to predict the remaining execution time of each ongoing case of a process [35], the next activity that will be executed in each case [15], or the final outcome of a case w.r.t. a set of possible outcomes [25,27,28].
Existing Predictive Process Monitoring techniques first construct a predictive model based on data coming from past process executions.Then, they use this model to predict the future of an ongoing case (e.g., outcome, remaining time, or next activity).However, when the predictive model has been constructed, it won't automatically take into account new cases when they complete their execution.This is a limitation in the usage of predictive techniques in the area of Business Process Monitoring: well-known characteristics of real processes are, in fact, their complexity, variability, and lack of steady-state.Due to changing circumstances, processes (and thus their executions) evolve, increase their variability, and systems need to adapt in a timely manner.
While a rough answer to this problem would be the one of re-building new predictive models from the wider available set of data, one could observe that building predictive models has a cost and this option should therefore be well understood before embracing it; moreover, preliminary studies such as the one of Maisenbacher and Weidlich [26] investigate the usage of incremental techniques [19] in the presence of Concept Drift phenomena, thus suggesting a diverse strategy of updating a Predictive Process Monitoring model.
In this paper, we tackle the problem of updating Predictive Process Monitoring models in the presence of new process execution data in a principled manner by investigating, in a comparative and empirically driven manner, how different strategies to keep predictive models up-to-date work.In particular, given an event log T R 0 , and a set of new traces T R 1 , we focus on four diverse strategies to update a predictive model M 0 built using the traces of T R 0 , to also take into account the set of new traces T R 1 : • Do nothing.In this case, M 0 is never updated and does not take into account T R 1 in any way.This strategy acts also as a baseline against which to compare all the other strategies.
• Re-train with no hyperopt.In this case, a new predictive model M 1 is built using T R 0 ∪ T R 1 as train set but no optimisation of the hyperparameters is performed and the ones of M 0 are used; • Full re-train.In this case, a new predictive model M 2 is built using T R 0 ∪ T R 1 as the train set and a new optimisation of the hyperparameters is performed; • Incremental update.In this case, a new predictive model M 3 is built starting from M 0 using the cases contained in T R 1 in an incremental manner (that is, using incremental learning algorithms).
The evaluation aims at investigating two main aspects of these update strategies: first, their impact on the quality1 of the prediction in event logs that exhibit/do not exhibit Concept Drift; second, their impact on the time spent for building the predictive model M 0 and their updates.
The reason why we provide two different strategies for retraining the model, i.e., with no optimisation and with an optimisation of the hyperparameters, is because the two costly activities when building a predictive model are the actual training of the model w.r.t. a train set and the optimisation of the hyperparameters for the constructed model.Therefore, when evaluating the impact of retraining on an extended set of data we aim at investigating the impact of building a new predictive model and the impact of optimising the hyperparameter in a separate manner.
The problem upon which we investigate these four strategies is the one of outcome predictions, where the outcomes are expressed by using either Linear Temporal Logic (LTL) formulae [32], in line with several works such as [12,25], or case duration properties.The four different strategies are evaluated in a broad experimental setting that considers different real and synthetic datasets.Since we focus on outcome predictions, we have decided to center our evaluation on Random Forest.This algorithm was chosen as it was experimentally proven to be one of the best performing techniques on the outcome prediction problem -see [40] for a rigorous review -and is therefore widely used on event log data usually used in Predictive Process Monitoring.
Perhaps not surprisingly, the results show that the do-nothing strategy is not a viable strategy (and therefore the issue of updating a Predictive Process Monitoring model is a real issue) and that full retraining and incremental updates are the best strategy in terms of quality of the updated predictive model.Nonetheless, the incremental update is able to keep up with the retraining strategy and deliver a properly fitted model almost in real time, whereas the full retraining might take hours and in some cases even days, suggesting that the potential of incremental models is under-appreciated, and clever solutions could be applied to deliver more stable performance while retaining the positive side of the update functions.
The rest of the paper is structured as follows: Section 2 provides the necessary background on Predictive Process Monitoring and incremental learning; Section 3 presents two exemplifying scenarios of process variability and explicit Concept Drift; Section 5 illustrates the data and procedure we use to evaluate the proposed update strategies, while Section 6 presents and discusses the results.We finally provide some related work (Section 7) and concluding remarks (Section 8).

Background
In this section, we provide an overview of the four main building blocks that compose our research effort: Predictive Process Monitoring, Random Forest, hyperparameter optimisation, and Concept Drift.

Predictive Process Monitoring
Predictive Process Monitoring [25] is a branch of Process Mining that aims at predicting at runtime and as early as possible the future development of ongoing cases of a process given their uncompleted traces.In the last few years, a wide literature about Predictive Process Monitoring techniques has become available -see [13] for a survey -mostly based on Machine Learning techniques.The main dimension that is typically used to classify Predictive Process Monitoring techniques is the type of prediction, which can belong to one of the three macro-categories: numeric predictions (e.g., time or cost predictions); categorical predictions (e.g., risk predictions or specific categorical outcome predictions such as the fulfillment of a certain property); next activities predictions (e.g, the sequence of the future activities, possibly with their attributes).
Frameworks such as Nirdizati [34,21] collect a set of Machine Learning techniques that can be instantiated and used for providing different types of predictions to the user.In detail, these frameworks take as input a set of past executions and use them to train predictive models, which can then be stored to be used at runtime to continuously supply predictions to the user.Moreover, the computed predictions can be used to compute accuracy scores for specific configurations.Within these frameworks, we can identify two main modules: one for the case encoding, and one for the supervised learning.Each of them can be instantiated with different techniques.Examples of case encodings are index-based encodings presented in Leontjeva et al. [22].Supervised learning techniques instead vary and can also depend on the type of prediction a user is interested in, ranging from Decision Tree and Random Forest, to regression methods and Recurrent Neural Networks.

Random Forest
Random Forest [20] is an ensemble learning method used for classification and regression.The goal is to create a model composed of a multitude of Decision Trees [33].In a Decision Tree, each tree interior node corresponds to one of the input variables and each leaf node to a possible classification or decision.Each different path from the root to the leaf represents a different configuration of input variables.A tree can be "learned" by bootstrapping the source set into subsets based on an attribute value test.This process is repeated on each derived subset in a recursive manner called recursive partitioning.The result is a tree in which each selected variable will contribute to the labelling of the relative example.When an example needs to be labelled, it is run through all the Decision Trees.The output of each Decision Tree is counted.The most occurring label will be the output of the model.
In this work, we use non-incremental and incremental versions of Random Forest.In a nutshell, the non-incremental version builds a predictive model once and for all using a specific set of training data in a single training phase.Instead, in addition to the step of building a predictive model during the training phase, the incremental learning versions are able to update such a model whenever needed through an update function.The specific implementation used in this work incrementally updates the starting model by adding new decision trees as soon as new data is available.
As already mentioned in the Introduction, Random Forest was chosen as it was experimentally proven to be one of the best performing techniques on the outcome prediction problem in Predictive Process Monitoring.The interested reader is referred to Teinemaa et al. [40] for a rigorous review.

Hyperparameter Optimisation
Machine Learning techniques are known to use model parameters and hyperparameters.Model parameters are automatically learned during the training phase so as to fit the data.Instead, hyperparameters are set outside the training procedure and used for controlling how flexible the model is in fitting the data.While the values of hyperparameters can influence the performance of the predictive models in a relevant manner, their optimal values highly depend on the specific dataset under examination, thus making their setting rather burdensome.To support and automatise this onerous but important task, several hyperparameter optimisation techniques have been developed in the literature [4,3] also for Predictive Process Monitoring models -see e.g., [40,11,41].While, in [40,11], the Tree Parser Estimator (TPE) has been used for outcome-oriented Predictive Process Monitoring solutions, in [41], Random Search has been used for hyperparameter optimisation in a comparative analysis focusing on how stable outcome-oriented predictions are along time.Although hyperparameter optimisation techniques for Predictive Process Monitoring have shown their ability to identify accurate and reliable framework configurations, they are also an expensive task and we have hence decided to evaluate the role of hyperparameter tuning in our update strategies.

Concept Drift
In Machine Learning, Concept Drift refers to a change over time, in unforeseen ways, of the statistical properties of a target variable a learned model is trying to predict.This drift is often due to changes in the target data w.r.t. the one that was used in the training phase.These changes are problematic as they cause the predictions to become less accurate as time passes.Depending on the type of change (e.g., gradual, recurring, or abrupt), different types of techniques have been proposed in the literature to detect and handle them [18,17,49,37].
Business Processes are subject to change due to, for example, changes in their normative, or organisational context, and so are their executions.Processes and executions can hence be subject to Concept Drifts, which may involve several process dimensions such as its control-flow dependencies and data handling.For instance, an organisational change might affect how a certain procedure is managed by employees in a Public Administration scenario (e.g., a further approval by a new manager is required for closing the procedure), or a normative change might affect either the way in which patients are managed in an emergency department (e.g., patients have to be tested for COVID-19 before they can be visited) or the age of the customers who are allowed to submit a loan request procedure in a bank process.The Concept Drift phenomenon has originated few works that focus on drift detection and localisation in procedural and in declarative business processes -see [6,10] and [24], respectively -as well as on attempts to deal with it in the context of Predictive Process Monitoring [26,31].

Two Descriptive Scenarios
We aim at assessing the benefits of incremental learning techniques in scenarios characterised by process variability and/or explicit Concept Drift phenomena.In this section, we introduce two typical scenarios, which refer to some of the datasets used in the evaluation described in Section 5.
Scenario 1. Dealing with Process Variability.Information systems are widely used in healthcare and several scenarios of predictive analytics can be provided in this domain.Indeed, the exploitation of predictive techniques in healthcare is described as one of the promising big data trends in this domain [8,30].
Despite some successful evaluation of Predictive Process Monitoring techniques using healthcare data [25], predictive monitoring needs to consider a well known feature of healthcare processes, that is, their variability [36], i.e., the variety of different alternative paths characterizing the executions of a process.Whether they refer to non-elective care (e.g., medical emergencies), or elective care (e.g., scheduled standard, routine and non-routine procedures), healthcare processes often exhibit characteristics of high variability and instability.For instance, the treatment processes related to different patients can be quite different due to allergies or comorbidities or other specific characteristics of a patient.In fact, when attempting to discover process model from data related to these processes, they are often spaghetti-like, i.e., cumbersome models in which it is difficult to distill a stable procedure.Moreover, small changes in the organisational structure (e.g., new personnel in charge of a task, unforeseen seasonal variations due to holidays or diseases) may originate subtle variability not detectable in terms of stable Concept Drifts, but nonetheless relevant in terms of predictive data analytics.
In such a complex environment, an important challenge concerns the emergence of new behaviours: regardless of how much data we consider, an environment highly dependent on the human factor is likely to exhibit new variants that may not be captured when stopping the training at a specific time.Similarly, some variants may become obsolete, thus making the forgetting of data equally important.
Thus, a way for adapting the predictions to these changes, and an investigation of which update strategies are especially suited to highly variable and realistic process executions would be of great impact.Scenario 2. Dealing with explicit Concept Drift.The presence of Concept Drift in business processes, due to, e.g., changes in the organisational structures, R F c J A J w 7 E V w 7 g 0 7 O T + / A i E p Z 6 d q k k A / x g N G A 0 q w M q E v 2 i U 4 s j 9 k z y 6 3 6 s 1 G s z B 7 0 X F m T r 3 1 F B V 2 f L m 9 8 d f 1 O U l j Y I p E W M q e 0 0 x U X 2 O h K I k g q 7 m p h A S T I R 5 A z 7 g M x y D 7 u u g 4 s 3 d M x L c D L s z D l F 1 E 5 x U a x 1 J O Y s 9 k x l i F s s r y 4 H I m J r 4 M I 1 a q r 8 f S e N W e V P C m r y l L U g W M T F s K 0 s h W 3 M 4 3 y v a p A K K i i X E w E d R M Z Z M Q C 0 y U 2 c 6 a y 2 B E e B x j 5 m v X x L K e 0 9 c u M J k K y L v S L k 9 A Y M V F P r h 2 i 0 4 D X X c y Y x W 1 5 0 / V C s Z K k u n X m + W W U 0 O c J M D A z w q V P q h y n g 5 C N Y N v l 0 L G r / n R w u q T Z M b a u 3 t L S g e G K p x W A I w L c I U F J J J G n F W 4 C o E Z L q i p j o X g o w p v H + 0 Z P A J / A B X y 8 X O + L l T D k t D A V D z p H O 7 v u 9 c n W 8 p I g G T l 0 8 j f 5 l f X J 1 m 2 X G P G l q B W q N p z q h 2 7 p B t 6 R F c J A J w 7 E V w 7 g 0 7 O T + / A i E p Z 6 d q k k A / x g N G A 0 q w M q E v 2 i U 4 s j 9 k z y 6 3 6 s 1 G s z B 7 0 X F m T r 3 1 F B V 2 f L m 9 8 d f 1 O U l j Y I p E W M q e 0 0 x U X 2 O h K I k g q 7 m p h A S T I R 5 A z 7 g M x y D 7 u u g 4 s 3 d M x L c D L s z D l F 1 E 5 x U a x 1 J O Y s 9 k x l i F s s r y 4 H I m J r 4 M I 1 a q r 8 f S e N W e V P C m r y l L U g W M T F s K 0 s h W 3 M 4 3 y v a p A K K i i X E w E d R M Z Z M Q C 0 y U 2 c 6 a y 2 B E e B x j 5 m v X x L K e 0 9 c u M J k K y L v S L k 9 A Y M V F P r h 2 i 0 4 D X X c y Y x W 1 5 0 / V C s Z K k u n X m + W W U 0 O c J M D A z w q V P q h y n g 5 C N Y N v l 0 L G r / n R w u q T Z M b a u 3 t L S g e G K p x W A I w L c I U F J J J G n F W 4 C o E Z L q i p j o X g o w p v H + 0 Z P A J / A B X y 8 X O + L l T D k t D A V D z p H O 7 v u 9 c n W 8 p I g G T l 0 8 j f 5 l f X J 1 m 2 X G P G l q B W q N p z q h 2 7 p B t 6 R F c J A J w 7 E V w 7 g 0 7 O T + / A i E p Z 6 d q k k A / x g N G A 0 q w M q E v 2 i U 4 s j 9 k z y 6 3 6 s 1 G s z B 7 0 X F m T r 3 1 F B V 2 f L m 9 8 d f 1 O U l j Y I p E W M q e 0 0 x U X 2 O h K I k g q 7 m p h A S T I R 5 A z 7 g M x y D 7 u u g 4 s 3 d M x L c D L s z D l F 1 E 5 x U a x 1 J O Y s 9 k x l i F s s r y 4 H I m J r 4 M I 1 a q r 8 f S e N W e V P C m r y l L U g W M T F s K 0 s h W 3 M 4 3 y v a p A K K i i X E w E d R M Z Z M Q C 0 y U 2 c 6 a y 2 B E e B x j 5 m v X x L K e 0 9 c u M J k K y L v S L k 9 A Y M V F P r h 2 i 0 4 D X X c y Y x W 1 5 0 / V C s Z K k u n X m + W W U 0 O c J M D A z w q V P q h y n g 5 C N Y N v l 0 L G r / n R w u q T Z M b a u 3 t L S g e G K p x W A I w L c I U F J J J G n F W 4 C o E Z L q i p j o X g o w p v H + 0 Z P A J / A B X y 8 X O + L l T D k t D A V D z p H O 7 v u 9 c n W 8 p I g G T l 0 8 j f 5 l f X J 1 m 2 X G P G l q B W q N p z q h 2 7 p B t 6 j B y o S + a J f g y P 6 Q P b / a a L Z b 7 c L s e c c p n S Y q 7 f h q c + 2 v 6 3 O S x s A U i b C U l 0 4 7 U X 2 N h a I k g q z h p h I S T I Z 4 A J f G Z T g G 2 d d F x 5 m 9 Z S K + H X B h H q b s I j q r 0 D i W c h J 7 J j P G K p R 1 l g c X M z H x Z R i x S n 0 9 l s a r 9 6 S C N 3 1 / O H d 9 c 7 j t s y J K G 6 L I R D r v L Z l A i j y 1 W n c 6 I z G V 3 6 l d 7 3 j n b b j n t l v P p V X P n Z X n t 1 9 F T 9 A y 9 Q A 5 6 j X b Q A T p G P U R Q j H 6 h 3 + i P 9 d X 6 Z n 2 3 f k x T V 1 d K z R N U M e v n f + 5 q u 5 Q = < / l a t e x i t > legal regulations, and technological infrastructures, has been acknowledged in the Process Mining manifesto [44] and literature [24], together with some preliminary studies on its relation with Predictive Process Monitoring [19].
Such a sudden and abrupt variation in the data provides a clear challenge to the process owners: they must be ready to cope with a Predictive Process Monitoring model with degraded performance on the drifted data, or to perform an update that allows the Predictive Process Monitoring technique to support both the non-drifted and the drifted trends of the data (as ongoing executions may still concern the non-drifted cases).
Similarly to the above, an investigation of which update strategies are especially suited to realistic process executions that exhibit an explicit Concept Drift would provide a concrete support for the maintenance of Predictive Process Monitoring models.

Update Strategies
Predictive Process Monitoring provides a set of techniques that use the availability of execution traces (or cases) extracted from information systems in order to predict how ongoing (uncompleted) executions will unfold up to their completion.Thus, assume that an organisation was able to exploit a set of process executions T R 0 , collected within a period of time that we will call "Period A" to obtain a predictive model M 0 , and that it starts to exploit M 0 to perform predictions upon new incomplete traces (see Figure 1 for an illustration of this scenario).As soon as the new incomplete traces terminate, they become new data, potentially available to be exploited for building a new model M ′ , that, in turn can be used to provide predictions on new incomplete traces.The need for exploiting new execution traces and building such an updated M ′ could be due to several reasons, among which the evolution of the process at hand (and thus of its executions) to which M0 < l a t e x i t s h a 1 _ b a s e 6 4 = " f O 7 5 y l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q e l a L D s D c J A 3 8 9 a 9 m b 5 0 q G v 2 i U 4 s j 9 m 1 6 + v t 5 r t V r s w e 9 F x S q e J S j u 9 3 t 7 4 6 / q c p D E w R S I s Z d 9 p J 2 q g s V r a n G r P W d A N g j W a z 4 d L x 7 e U + w l b s q Y h u m 6 E R n 3 d D F r m s d W q 5 p z I X n a v e L W X n a v 3 V c + t e u f 2 1 u + j q W 2 j 1 + g N e o s 8 9 A E d o B N 0 h l q I I I 5 + o F / o d 8 k v 3 Z W + l r 5 N U z c 3 Z p p X a M F K 3 / 8 D Z P u / r w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " b s 3 T E M j 0 s w e g w v e c 8 K W a X F T u R d c = " > A A A E 2 n i c d Z P P b 9 M w F M e 9 L c A o P 7 b B k U t E N Y l T l X B h x 3 U d Y x P S G F u 7 T W q q y X F e W q u J H d n O 2 s r K h R v i C l f 4 O / h T + G 9 w 0 m h q 0 v Z J S Z 7 e 5 3 3 9 3 r N j P 4 m o V I 7 z b 2 N z y 3 r 0 + M n 2 0 8 a z 5 y 9 e 7 u z u v b q W c J M A i y Q q V P 6 5 y n w 5 E q 4 c e V k P E H f r 6 0 + i w p W f v o e E X p 0 F C F 0 x q A a Q H u s Y B E 0 o i z G l c j Y I Y L a q p j I f i k x t v n x w Z P I B h C j X y 5 z N e F e l g S G p q K V 5 2 z k M a f a c x d 0 g 3 C N 5 t P h 0 v E t 5 X 7 E l q x p i K 4 b 4 b S 5 b g Y t i 9 h q V W t O Z C + 7 X 7 7 a y 8 7 V + 5 r v 1 f w v 9 t Y f o K l t o 9 f o D X q L f L S P 6 u g E n a M 2 I o i j n + g 3 + u M E z l f n m / N 9 m r q 5 M d O 8 Q g v m / P g P 8 t G + K g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " h r n / 5 + h 9 G a h e g Z d U v y e F N j W z r S O S P L 3 P + / q 9 Z 8 e B i K n S 1 e q / j c 0 t 5 8 H D R 9 u P S 0 + e P n v + Y m f 3 5 Z X i q S T Q O S P L 3 P + / q 9 Z 8 e B i K n S 1 e q / j c 0 t 5 8 H D R 9 u P S 0 + e P n v + Y m f 3 5 Z X i q S T Q M l y e z r z 3 P L q U O c J M A g y A q V P q 5 y n g 6 G a g 7 f r Y S M 3 / L T p d W n y Z y 1 D g 5 M l y e z r z 3 P L q U O c J M A g y A q V P q 5 y n g 6 G a g 7 f r Y S M 3 / L T p d W n y Z y 1 D g 5 M l y e z r z 3 P L q U O c J M A g y A q V P q 5 y n g 6 G a g 7 f r Y S M 3 / L T p d W n y Z y 1 D g 5 M l y e z r z 3 P L q U O c J M A g y A q V P q 5 y n g 6 G a g 7 f r Y S M 3 / L T p d W n y Z y 1 D g 5 the system needs to adapt in a timely manner.

Hyperarameters of
In this paper, we provide 4 different strategies for computing M ′ , by exploiting a new set of process executions T R 1 , collected in a period of time "Period B" subsequent to "Period A", along with the original set T R 0 .The strategies are summarised in Figure 2. The figure represents, on the left hand side, model M 0 and, on the right hand side, the operations performed starting from M 0 to obtain model M ′ according to the four update strategies.
The first strategy, S 0 , is a do nothing strategy.This strategy simply disregards that new traces are produced in "Period B" and continues to use M 0 as a predictive model.This strategy may prove to be useful when the processes remain stable and it acts also as a baseline against which to compare all the other strategies.
The second strategy, S 1 , exploits the new traces in T R 1 produced in "Period B" for training but not for the optimisation of the hyperparameters.In this re-train with no hyperopt strategy, M 0 is replaced by a new predictive model M 1 built from scratch by using T R 0 ∪ T R 1 as train set.No optimisation of the hyperparameters is made in the construction of M 1 and the values of the ones computed for M 0 are instead used.This strategy aims at exploiting the new data in T R 1 still avoiding the costly steps needed for hyperparameter optimisation.
The third strategy, S 2 , completely replaces the old model M 0 with a new predictive model M 2 built from scratch using both T R 0 and T R 1 .This strategy aims at performing a full re-train, thus exploiting to the outmost all the available data.Also, the comparison between S 1 and S 2 enables us to investigate the specific role of the hyperparameter tuning in the predictions.
The final strategy, S 3 , exploits the new traces in T R 1 produced in "Pe-riod B" for training the predictive model in an incremental manner.Differently from S 1 , the data of T R 1 is added as training data in a continuous manner by means of incremental Machine Learning algorithms, to extend the existing knowledge of model M 0 .The incremental update strategy is chosen as an example of dynamic technique, which can be applied when training data becomes available gradually over time or the size of the training data is too large to store or process it all at once.Similarly to S 1 , the value of the hyperparameters does not change when adding new training data.

Empirical Evaluation
The evaluation reported in this paper aims at understanding the characteristics of the four different update strategies introduced in the previous section in terms of accuracy and time.We aim at evaluating these strategies with two types of real-life event log data: event logs without an explicit Concept Drift and event logs with an explicit Concept Drift.As such, we have selected four real-life datasets 2 , two for the first scenario and two for the second one.To consolidate the evaluation on the Concept Drift scenario we also expanded the evaluation to include a synthetic event log with explicit Concept Drifts introduced in Maaradji et al. [23].
In this section, we introduce the research questions, the datasets, the metrics used to evaluate the effectiveness of the four update strategies described in Section 4, the procedure, and the tool settings.The results are instead reported in Section 6.

Research Questions
Our evaluation is guided by the following research questions: RQ1.How do the four update strategies do nothing, re-train with no hyperopt, full re-train, and incremental update compare to one another in terms of accuracy?
RQ2.How do the four update strategies do nothing, re-train with no hyperopt, full re-train, and incremental update compare to one another in terms of time performance?
RQ1 aims at evaluating the quality of the predictions returned by the four update strategies, while RQ2 investigates the time required to build 2 from the repository available at https://data.4tu.nl/ the predictive models in the four scenarios and, in particular, aims at assessing the difference between the complete periodic rediscovery (full re-train) and the other two update strategies re-train with no hyperopt and incremental update.

Datasets
The four update strategies are evaluated using five datasets.Three of them are real-life event logs provided for a Business Process Intelligence (BPI) Challenge, in different years, without an explicit Concept Drift: the BPI Challenges 2011 [1], 2012 [45], and 2015 [46].They are examples of event logs exhibiting Process Variability as described in the first scenario in Section 3. The remaining two datasets are instead examples of logs with explicit Concept Drift as described in the second scenario in Section 3. Our aim was to evaluate the four strategies on real-life event logs, but, to the best of our knowledge, the only publicly available event log which contains an explicit concept drift is the BPI Challenge 2018 [47].Therefore, we decided to augment the evaluation considering also one of the synthetic event logs introduced in Maaradji et al. [23].Here, we report the main characteristics of each dataset, while the outcomes to be predicted for each dataset are contained in Table 1. 3he first dataset, originally provided for the BPI Challenge 2011, contains the treatment history of patients diagnosed with cancer in a Dutch academic hospital.The log contains 1,140 cases and 149,730 events referring to 623 different activities.Each case in this log records the events related to a particular patient.For instance, the first labelling (ϕ 11 ), for this dataset, is such that the positive traces are all the ones for which if activity CEA − tumor marker using meia occurs, then it is followed by an occurrence of activity squamous cell carcinoma using eia.
The second dataset, originally provided for the BPI Challenge 2012, contains the execution history of a loan application process in a Dutch financial institution.It is composed of 4,685 cases and 186,693 events referring to 36 different activities.Each case in this log records the events related to a particular loan application.For instance, the first labelling (ϕ 21 ), for this  dataset, is such that the positive traces are all the ones in which event Accept Loan Application occurs.
The third dataset, originally provided for the BPI Challenge 2015, concerns the application process for construction permits in five Dutch municipalities.We consider the log pertaining to the first municipality, which is composed of 1,199 cases and 52,217 events referring to 398 different activities.
The fourth dataset, originally provided for the BPI Challenge 2018, concerns an event log from the European Agricultural Guarantee Fund pertaining to an application process for EU direct payments for German farmers from the European Agricultural Guarantee Fund.Depending on the document types, different branches of the workflow are performed.The event log used in this evaluation is composed of 29,302 cases and 1,661,656 events referring to 40 different activities.
The fifth and the sixth datasets, hereafter called DriftRIO1 and DriftRIO2, are synthetic event logs that use a "textbook" example of a business process for assessing loan applications [48].The DriftRIO event logs introduced in Maaradji et al. [23] have been built by alternating traces executing the original "base" model and traces modified so as to exhibit complex Concept Drifts obtained by composing simple log changes, namely, resequentialisation of process model activities (R), insertion of a new activity (I) and optionalisation of one activity (O). 4 DriftRIO1 is composed of 3,994 cases and 47,776 events related to 19 activities.DriftRIO2 is composed, instead, of 2,000 cases and 21,279 events referring to 19 different activities.The outcomes to be predicted for each dataset are expressed by using LTL formulae for the four BPI Challenges [12,25] and by using the case duration property of being a fast case for DriftRIO5 (see Table 1).

Metrics
In order to answer the research questions, we use two metrics, one for accuracy and one for time.The one for accuracy is used to evaluate RQ1, whereas the time measure is used to evaluate RQ2.
The accuracy metric.In this work, we exploit a typical evaluation metric for calculating the performance of a classification model, that is AUC-ROC (hereafter only AUC).The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) using various threshold settings.In formulae, TPR = T P T P +F N , and FPR = F P F P +T N , where T P , T N , F P , and F N are the true-positives (positive outcomes correctly predicted), the true-negatives (negative outcomes correctly predicted), the false-positives (negative outcomes predicted as positive), and the false-negatives (positive outcomes predicted as negative), respectively.In our case, the AUC is the area under the ROC curve and, when using normalised units, it can be intuitively interpreted as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.As usual, T P , T N , F P , and F N are obtained by comparing the predictions produced by the predictive models against a gold standard that indicates the correct labelling of each case.In our experiments, we have built the gold standard by evaluating the outcome of each completed case in the test set. 6he time metric.We measure the time spent to build the predictive model in terms of execution time.The execution time indicates the time required to create and update (in the case of incremental algorithms) the predictive models.We remark here that the execution time does not include the time spent to load and pre-process the data, but only the bare processing time.

Experimental Procedure
We adopt the classical Machine Learning Train/Validate/Test experimental procedure and configure it for the four different strategies we want to compare.The procedure consists of the following main steps: (1) dataset preparation; (2) classifier training and validation; (3) classifier testing and metrics collection.
In the dataset preparation phase, the execution traces are first ordered according to their starting date-time so as to be able to meaningfully identify those referring to "Period A" (that is, T R 0 ), those referring to the subsequent "Period B" (that is, T R 1 ), and those referring to the test set T E, which here represents the most recent set of traces where the actual predictions are made.The predictions are tested using the four different models M 0 -M 3 corresponding to the different strategies S 0 -S 3 described in Section 4.
The actual splits between train, hyperparameter validation, and test sets used for evaluating the four strategies are illustrated in Figure 3.Following a common practice in Machine Learning, we have decided to use 80% of the data for training/validation and 20% for testing.
To test whether the performance of the different strategies was connected to different sizes of T R 0 and T R 1 , we devised two experimental settings to supply the train data to the learning algorithms.The first experimental setting divides the train set unequally in T R 0 (10%) and T R 1 (70%); the second experimental setting divides the train set equally in T R 0 (40%) and T R 1 (40%).These two settings do not concern the construction of M 2 , which is always built using the entire set of available train data.The hyperparameter validation set is extracted from the train set through a randomised sampling procedure.It amounts to 20% of the train set, which corresponds to 2% of the dataset for the train set at 10%, 8% for the train set at 40%, and 16% for the train set at 80% (see Figure 3).accuracy in terms of average F-measure and accuracy defined as T P +T N T P +T N +F P +F N , and the results remain consistent.We have, therefore, decided to focus only on AUC for the sake of simplicity and readability of the results.The variability of the log behaviour of each dataset can be measured through the trace entropy and the Global block entropy metrics [2] (Table 2).While the first metric mainly focuses on the number of variants in an event log, the latter also takes into account the internal structure of the traces.Note that the BPI Challenge 2018 and DriftRIO datasets contain a Concept Drift in T R 1 in both settings (it affects the last 30% of data in T R, when ordered according to their starting date-time) and, for these datasets, the difference between the entropy value of the split 0%-40%-80%-100% is higher than the entropy value of the split 40%-80%-80%-100% for both entropy metrics.Comparing the behaviour variability of the splits 0%-10%-80%-100% and 10%-80%-80%-100% is instead less useful since the two sets of traces have different sizes and the smaller set has obviously a lower entropy w.r.t. the larger one for all datasets.Finally, the entropy on the complete datasets (column 0%-100%) is useful to understand what type of input is provided when evaluating strategy S 2 that uses the full dataset to train the predictive model.

Dataset Trace Entropy Global Block Entropy
Additional information about the input logs used in our experiments is reported in Table 3 that shows, for each dataset and for each setting, the distribution of the labels on both the train and the test sets, thus providing an idea of how much balanced the datasets are.
Once the data is prepared, training and validation start by extracting execution prefixes and by encoding them using the complex index encoding to measure here the time spent to build the initial model M 0 , plus the time needed to update it according to the four different strategies.Thus, for S 0 , this will coincide with the time spent for building M 0 (as, in this case, no further action is taken for updating the model); for S 1 , we compute the time spent for building M 0 plus the time spent for the retraining over T R (no hyperopt); for S 2 , we compute the time spent for building M 0 plus the time needed for the retraining over T R (with hyperopt); and, finally, for S 3 , we compute the time spent for building M 0 plus the time needed for updating it with the data in T R 1 .

Experimental Settings
The tool used for the experimentation is Nirdizati [34].The experimental evaluation was performed on a workstation Dell Precision 7820 with the following configuration: (i) 314GB DDR4 2666MHz RDIMM ECC of RAM; (ii) double Intel Xeon Gold 6136 3.0GHz, 3.7GHz Turbo, 12C, 10.4GT/s 3UPI, 24.75MB Cache, HT (150W) CPU; and (iii) one 2.5" 256GB SATA Class 20 Solid State Drive.We assumed to have only 1 CPU for training the predictive models, i.e., we did not parallelise the training of the base learners of the Random Forests.We also ensured that there was no racing condition over the disk and no starvation over the RAM usage by actively monitoring the resources through Netdata [42].Each experiment was allowed to run for at most 100 hours.No other experiment or interaction with the workstation was performed other than the monitoring of the used resources.

Results
In this section, we present the results of our experiments, reported in Tables 5-7, and discuss how they allow us to answer the two research questions introduced before.We also provide a discussion about the four update strategies with a cost-effectiveness analysis and an analysis of the validity threats.To ensure reproducibility, the datasets used, the configurations, and the detailed results are available at http://bit.ly/how_do_I_update_my_ model.

Discussion
Answering RQ1.The AUC of all models, for the two experimental settings 10%-70% and 40%-40%, for all datasets, is reported in Tables 5a and  The percentage of gain or loss10 is reported together with an histogram of gains and losses.In order to ease the comparison, M 2 is reported in both tables.By looking at the tables, we can immediately see that M 2 and M 3 are the clear best performers, especially in the first setting 10%-70%, and that M 0 is almost consistently the worst performer, often with a significant difference in terms of accuracy.This highlights that the need to update the predictive models is a real issue in typical Predictive Process Monitoring settings.The only exception to this finding is provided by the results obtained for the BPIC15 dataset, where the performance of M 0 is comparable to that of the other models for almost all the outcome formulae.This is due to the fact that, for this dataset, there is a high homogeneity of the process behavior over time.This is confirmed by the entropy values, provided in Table 2, that remain quite stable across the entire log.Moreover, we can observe that BPIC12 with labelling ϕ 23 has overall the lowest accuracy for all the four update strategies.This is possibly due to the high label  unbalance characterizing this dataset (see Table 3).
The performance of all the evaluated strategies is overall higher in the 40%-40% setting and this is likely related to the higher amount of data used for hyperparameter optimisation.Nonetheless, the lower performance of M 0 also in the 40%-40% setting indicates the need to update the models with new data at regular intervals.Concerning the possible differences between the results obtained using datasets with and without an explicit Concept Drift, our experiments did not find any striking variation in the different strategies, thus consolidating the finding that devising update strategies is important in general, also in scenarios where the process changes over time are not so definite.Nonetheless, if we look at Tables 6a and 6b, it is easy to see that the experiments with an explicit Concept Drift are the ones with the greatest difference between M 0 and all the other models, thus confirming that an explicit Concept Drift can have a significant negative influence on the performance of a Predictive Process Monitoring model, if it is not updated.This is especially true for BPIC18, in which the Concept Drift highly affects the entropy measure (Table 2) in a way that the difference between the entropy value of the split 0%-40%-80%-100% is higher than the entropy value of the split 40%-80%-80%-100% for both entropy metrics.
Tables 6a and 6b show also another interesting aspect of our evaluation: while M 2 and M 3 tend to always gain against M 0 (or to be stable in very few cases), the same cannot be said for M 1 .In fact, if we look at BPIC12 with labelling ϕ 23 and BPIC15 with labelling ϕ 32 in Table 6a, and, particularly, at DriftRIO with labelling ϕ 52 in Table 6b, we can see a decrease of the accuracy.By carrying out a deeper analysis of the chosen hyperparameters, we found that the lower accuracy of M 1 is due to the inappropriateness of the hyperparameters derived from T R 0 to the new data used to build M 1 .While this aspect may need to be better investigated, we can conclude that, while re-train with no hyperopt is usually a viable solution, it is nonetheless riskier than full re-train or incremental update.
The general findings and trends derived from the results obtained using Random Forest as classifier are further confirmed, with few exceptions, by the results obtained by using Perceptron [29] as predictive model in the analysis of the four update strategies.The perceptron results are reported in Appendix A.
To sum up, concerning RQ1, our evaluation shows that full re-train and incremental update are the best performing update strategies in terms of accuracy, followed by re-train with no hyperopt.With the exception of BPIC15 in the 10%-70% setting, do nothing is, often by far, the worst strategy, indicating the importance of updating the predictive models with new data, when it becomes available.
Answering RQ2.The time spent for creating the four models, for the two experimental settings 10%-70% and 40%-40%, for all datasets, is reported in Tables 7a and 7b using the "hh:mm:ss" format.The best results for each dataset and labelling (that is the lower execution times) are emphasised in italic, while execution times that differ for less than 60 seconds from the best ones are indicated in bold.The last two rows of each table report the average time (and standard deviation) necessary to train a model for a given strategy.In order to ease the comparison, M 2 is reported in both tables.M 0 and M 2 are self-contained models that are "built from scratch" and, therefore, the time reported in the tables for these models is the time spent to train them.Differently, M 1 and M 3 are built in a two-steps fashion that includes a training phase but also the usage of the hyperparameters used to build M 0 .Therefore, their construction time is measured by summing up the time spent for the training phase and the time spent for building M 0 .
By looking at Tables 7a and 7b, we can immediately see that, among all the evaluated strategies, M 0 is the clear best performer, especially for the 10%-70% setting, and that M 2 is almost consistently the worst performer, often with a significant difference in terms of time spent to build the model (with few exceptions in the 40%-40% case that we will discuss below).As a second general observation, we note that M 1 and M 3 share almost the same construction time with M 0 .This fact is not particularly surprising,  as the hyperparameter optimisation routine is often the most expensive step in the construction of this type of predictive models.Therefore, the two strategies S 1 and S 3 that underline the construction of these models are highly inexpensive, when we consider the time dimension, especially when M 0 is already available.
If we compare the two experimental settings 10%-70% and 40%-40%, we can observe that, while M 0 is almost always the best performer in both settings, the difference between the construction time of M 0 (and thus of M 1 and M 3 ) and the construction time of M 2 is significantly higher for the 10%-70% setting.While the investigation of when it is convenient to perform a full re-train is out of the scope of the paper and is left to further investigations, this finding emphasises the fact that the cost of a full re-train may increase in a significant manner if the update of the predictive model over-delays and the amount of new data greatly increases.Interestingly enough, the 40%-40% setting presents two cases in which M 2 is the fastest model to be built.The case of BPIC11 with labelling ϕ 12 is likely due to an "unfortunate" guess-estimate in the hyperparameter optimisation step for M 0 , which makes the training time explode 11 ; the case of BPIC15 with labelling ϕ 31 , instead, represents a situation in which M 0 and M 2 take almost the same time to be built.Concerning possible differences between datasets with and without an explicit Concept Drift, our experiments did not find any striking difference among the evaluated strategies.Finally, our evaluation did not find any fixed correlation between the training times for all strategies and (i) the size of the dataset and the alphabet of the dataset within the same settings, or (ii) the quality of the predictive model in terms of accuracy (and thus the difficulty of the prediction problem).As an example of the first, we can observe that BPIC12 contains four times the cases of BPIC11, nonetheless, most of the prediction models built for BPIC11 take more time to be constructed than the ones built for BPIC12.Similarly, BPIC15 has an alphabet with a number of activities that is almost 10 times the one of BPIC2018 but the prediction models built for BPIC18 take much more time to be constructed than the ones built for BPIC15.As an example of the second, we can observe, from Table 7a, that the time needed for building M 0 for BPIC11 with labelling ϕ 23 is much greater than the one needed for building M 0 for BPIC11 with labelling ϕ 13 , even if the accuracy for the same cases, in Table 5a, follows the inverse trend. 12o sum up, concerning RQ2, our evaluation shows that, once M 0 is available, incremental update and re-train with no hyperopt are the two most convenient update strategies -as they can be built in almost no time.This may suggest the possibility to implement an almost continuous update strategy whenever new data becomes available.While the investigation of when it is convenient to perform a full re-train is out of the scope of the paper, our experiments show that the cost of a full re-train may increase in a significant manner if the update of the predictive model over-delays and the amount of new data increases significantly.
Overall Conclusions.The plots in Figures 4 and 5 show inaccuracy and time related to the 10%-70% and 40%-40% settings, respectively, for each of the considered datasets.The closer the item is to the origin, the best is the balance between the time required for training, re-training, or updating the model and the accuracy of the results.By looking at the plots, it is clear that the worst choice in terms of balance is given by M 0 , while, for the other three models, the choice somehow depends on the dataset and on the labelling.With the only exception of ϕ 12 , ϕ 21 , ϕ 22 , and ϕ 33 for both settings, as well as of ϕ 11 for the 10%-70% setting and of ϕ 61 for the 40%-40% setting, for all other datasets and labellings, M 1 and/or M 3 are the only non-dominated update strategies, i.e., those strategies for which another strategy improving both the inaccuracy and the time dimension does not  exist. 13o conclude, our evaluation shows that the do-nothing strategy is not a viable strategy as the accuracy performance of a non-updated model tends to significantly decrease for typical real-life datasets (with and without explicit Concept Drift), whereas lightweight update strategies, such as the incremental update and re-train with no hyperopt, are, instead, often extremely effective in updating the models.Full re-train offers a strategy that almost always achieves the best accuracy (or an accuracy in line with the best one).Nonetheless, its training time may increase significantly, especially in the presence of an abundance of new data.According to our experiments, the incremental update is able to keep up with the full re-train strategy and deliver a properly fitted model almost in real time,  suggesting that the potential of incremental models is under-appreciated in Predictive Process Monitoring, and smart Predictive Process Monitoring solutions could be developed leveraging this update strategy.

Cost-effectiveness analysis
In order to have a better grasp of the cost-effectiveness of the different update strategies, we also investigated the costs required by the update strategies, when new batches of data become available along the time.In particular, given a set of batches of train sets T R 0 , T R 1 , . . .T R n , and the corresponding batches of test sets T E 0 , T E 1 , . . .T E n (where T E 0 is the test set immediately following T R 0 , T E 1 the test set immediately following T R 1 in the temporal timeline), we can define CE M (T R 0 , T R 1 , . . .T R n , T E 0 , T E 1 , . . .T E n ) -from here on shortened as CE M (T R 0...n , T E 0...n )-as the cost of a model trained with n batches of arriving data T R 0 , T R 1 , . . .T R n , and tested with the corresponding batches of test data T E 1 , . . .T E n .
In our scenario, the cost-effectiveness of the update strategies is characterized by two main aspects: on the one hand, the cost of the time required for building the model and, on the other, the cost of returning wrong predictions (prediction inaccuracy).We can hence define CE M (T R 0...n , T E 0...n ) as the sum of (i) CT M (T R 0...n , T E 0...n ), i.e., the cost of the time required for building, training and, when necessary, retraining the model M , whenever a new batch of traces arrives; 14 and of (ii) CI M (T R 0...n , T E 0...n ), i.e., the cost of the inaccuracy due to wrong predictions returned by the trained models on the traces of the test sets.Low values for a given model indicate a good cost-effectiveness model.
Defining CT T (T R), CT H T (T R), and CT U (T R) as the time required for training, training and optimizing the hyperparameters, and incrementally updating a model with the train set T R, respectively, the time costs related to the four models can be computed as reported in Equation 1.The time cost for M 0 is given by the only cost required for training and optimizing the hyperparameters on T R 0 .For M 1 (M 3 ), besides the cost for training and optimizing the hyperparameters on T R 0 , when the i-th train set batch is available, the costs for training (updating) the model with the union of the train set batches up to the i-th one (with the i-th train batch), have also to be considered.Finally, the time cost for The inaccuracy costs are instead reported in Equation 2. For M 0 , the inaccuracy cost is given by the sum of the costs obtained by providing predictions using the model trained on the batch M 0 and tested on each new test set batch T E i .For the other three models, instead, the cost of inaccuracy, at the arrival of the train set batch T R i , is given by the inaccuracy cost obtained from models trained, updated and/or optimized on a train set that takes into account the information on the data of all train set batches up to T R i , and evaluated on the test set batch T E i .The inaccuracy cost is given by the sum of the costs for each new training and test batch T R i and T E i .We assume we can approximate the inaccuracy cost of the model M 0 tested on the i-th test set batch T E i -CI M 0 (T R 0 , T E i ) -with the inaccuracy cost of M 0 tested on T E 1 plus an extra inaccuracy cost δ M 0 i , i.e., CI M 0 (T R 0 , T E i ) = CI M 0 (T R 0 , T E 1 ) + δ M 0 i .Similarly, the inaccuracy costs of the other three models -CI M ( i j=0 T R j , T E i ) -can be approximated with the inaccuracy cost computed on the first test set plus an extra  Table 8: Cost-effectiveness framework instantiation with δ M i = 0 and experimental setting 40%-40%. of available data batches, differences can exist related to the best update strategy, although M 3 seems to be consistently cheaper for most of the tested settings.

Threats to Validity
The main threats affecting the validity of the evaluation carried out are external validity threats, limiting the generalizability of the results.Indeed, although we investigated the usage of different update strategies on different types of labellings, we limited the investigation to outcome predictions and to classification techniques typically used with this type of predictions.We plan to inspect other types of predictions, i.e., numeric and sequence predictions, together with typical techniques used with them, i.e., regression and deep learning techniques, for future work.
Finally, the lack of an exhaustive investigation of the hyperparameter values affects the construction validity of our experimentation.We limited this threat by using standard techniques for hyperparameter optimisation [5].

Related Work
To the best of our knowledge, no other work exists on the comparison of update strategies for Predictive Process Monitoring models with the exception of the two by Pauwels and Calders [31] and Maisenbacher and Weidlich [26].We hence first position our work within the Predictive Process Monitoring field and then address a specific comparison with Pauwels and Calders [31] and Maisenbacher and Weidlich [26].
We can classify Predictive Process Monitoring works based on the types of predictions they provide.A first group of approaches deals with numeric predictions, and, in particular, predictions related to time [43,16,35].A second group of approaches focuses on the prediction of next activities.These approaches mainly use deep learning techniques -specifically techniques based on LSTM neural networks [38,14,9,7,39].These studies have shown that when the datasets are large, deep learning techniques can outperform techniques based on classical Machine Learning techniques.A third group of approaches deals with outcome predictions [40,25,12,22], which are the ones we focus on.A key difference between these works and the work presented in this paper is that we do not aim at proposing/supporting a specific outcome prediction method, rather we aim at evaluating different update strategies.
The work by Pauwels and Calders [31] leverages deep learning models to address the challenge of next activity prediction in the context of incremental Predictive Process Monitoring.The goal of their paper is two-fold: they explore different strategies to update a model over time for next-activity prediction and they investigate the potential of neural networks for the incremental Predictive Process Monitoring scenario.The goal is reached by (i) identifying different settings related to the the data to use for training, updating, and testing the models, both in a static and a dynamic scenario; and (ii) showing the positive impact of catastrophic forgetting of deep learning models for the Predictive Process Monitoring use-case.In our work, we focus on another type of techniques/predictions, i.e., we aim at investigating the potential of classical Machine Learning models in the Predictive Process Monitoring scenario for the prediction of an outcome.
The work by Maisenbacher and Weidlich [26] is the only one we are aware of that exploits classical incremental Machine Learning in the context of Predictive Process Monitoring.The goal of that paper is to show the usefulness of incremental techniques in the presence of Concept Drift.The goal is proved by performing an evaluation over synthetic logs which exhibit different types of Concept Drifts.In our work, we aim at comparatively investigating four different model update strategies (which include the case of the incremental update) both in terms of accuracy of the results and in terms of time required to update the models.We carry on our evaluation on real-life and synthetic logs with and without an explicit Concept Drift.

Conclusion
In this paper, we have provided a first investigation of different update strategies for Predictive Process Monitoring models in the context of the outcome prediction problem.In particular, we have evaluated the performance of four update strategies, namely do nothing, re-train with no hyperopt, full re-train, and incremental update, applied to Random Forest, the reference technique for outcome-oriented predictions, on a number of real and synthetic datasets with and without explicit Concept Drift.The cost-effectiveness of the different update strategies has been evaluated in the simple case where only one train and one test set are available, and in the more complex scenario where new batches of data become continuously available.The results show that the need to update a Predictive Process Monitoring model is real for typical real-life event logs (regardless of the presence of an explicit Concept Drift).They also show the potential of incremental learning strategies for Predictive Process Monitoring in real environments.An avenue for future work is the extension of our evaluation to different prediction problems such as remaining time and sequence predictions, which would, in turn, extend the evaluation to different reference Machine Learning techniques such as regression and LSTM, respectively.Also, a deeper investigation of the proposed cost-effectiveness framework in the context of the proposed update strategies will allow us to come up with more detailed best practices to guide the user in understanding which strategy is the most appropriate one under specific contextual conditions.
To conclude, we believe that the potential of incremental models is underappreciated in the Predictive Process Monitoring field.To allow researchers to better understand the usefulness of the update strategies proposed in this paper, we made them readily available in the latest release of Nirdizati [34].

A Further Results
We report, in this appendix, the results obtained using Perceptron [29] (rather than Random Forest) as classifier.Tables 9a and 9b show the results obtained with such a classifier for the different strategies and for the different considered settings.Also for these experiments, the best result for each dataset and labelling is emphasised in italic, while the results that differ from the best ones for less than 0.01 are emphasised in bold.
As for the case of Random Forest, we can observe that the most accurate strategies are M 2 and M 3 , with the exception of datasets BPIC12 and BPIC15 with labelling ϕ 32 in the setting 40%-40%, in which the best results are obtained with M 0 .Moreover, we can observe a significant difference, in terms of performance, with respect to the Random Forest, for BPIC18: the accuracy obtained with Perceptron for this dataset is very low for all the four update strategies.Also with Perceptron, the accuracy is overall better for the 40%-40% setting w.r.t.10%-70% and no relevant differences in terms of winning strategies for datasets with and without explicit Concept Drift can be devised.
Overall, the evaluation with Perceptron confirms that M 2 and M 3 (i.e., full re-train and incremental update, respectively) are the best performing update strategies in terms of accuracy.
n Y u 9 x u u 0 3 C / v q s 3 3 8 5 u P d p G L 9 E r 9 A a 5 6 D 1 q o m N 0 h r q I I I 5 + o d / o j + V Z 3 6 z v 1 o 9 Z 6 u b G X P M C l c z 6 + R 9 F L L z N < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 w 4 h b + D P 4 X / B i e N p j Z t T 0 p y u s 9 9 f X d 2 7 K c x l c p x / m 1 s b l n 3 7 j / Y f r j z 6 P G T p 8 9 2 9 5 5 f S Z 4 J A l 3 C Y y 5 u f C w h p g y 6 i q o Y b l I B O P F j r J z 9 b b p O k 3 3 6 7 v G w Z v q 2 m + j l + g V e o 1 c 9 B 4 d o B N 0 j r q I I I 5 + o d / o j + V Z 3 6 z v 1 o 9 Z 6 u Z G p X m B F s z 6 + R + a o r x p < / l a t e x i t > Period A M0 < l a t e x i t s h a 1 _ b a s e 6 4 = " O r u r q 8 Y D S n B y o S + a o / g y P 6 Y 9 Z 3 + T t 1 p O I X Z y 4 4 7 d + r N 5 6 i w 0 / 7 u 1 l 8 v 4 C S N g S k S Y S m 7 r p O o n s l o a C q e t Y 8 P D 7 2 b k y 1 l J E C y 8 m n k b / O v 6 7 M s W 6 0 x Y 0 t Q a 1 S t B d W e X d K N / D W a D / t L x 7 e U + x 4 b s q Y h u m 6 E 4 / a 6 G Z T I Y 6 t V 5 w s i c 9 n d 6 t V e d i 5 e N V y n 4 X 5 + X W + + n N 1 6 t I 2 e o m f o B X L R G 9 R E R + g U d R B B D P 1 C v 9 E f q 2 t 9 s 7 5 b P 2 a p m x t z z R N U M u v n f 5 A e v G o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " O r u r q 8 Y D S n B y o S + a o / g y P 6 Y 9 Z 3 + T t 1 p O I X Z y 4 4 7 d + r N 5 6 i w 0 / 7 u 1 l 8 v 4 C S N g S k S Y S m 7 r p O o n s l o a C q e t Y 8 P D 7 2 b k y 1 l J E C y 8 m n k b / O v 6 7 M s W 6 0 x Y 0 t Q a 1 S t B d W e X d K N / D W a D / t L x 7 e U + x 4 b s q Y h u m 6 E 4 / a 6 G Z T I Y 6 t V 5 w s i c 9 n d 6 t V e d i 5 e N V y n 4 X 5 + X W + + n N 1 6 t I 2 e o m f o B X L R G 9 R E R + g U d R B B D P 1 C v 9 E f q 2 t 9 s 7 5 b P 2 a p m x t z z R N U M u v n f 5 A e v G o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " O r u r q 8 Y D S n B y o S + a o / g y P 6 Y 9 Z 3 + T t 1 p O I X Z y 4 4 7 d + r N 5 6 i w 0 / 7 u 1 l 8 v 4 C S N g S k S Y S m 7 r p O o n s l o a C q e t Y 8 P D 7 2 b k y 1 l J E C y 8 m n k b / O v 6 7 M s W 6 0 x Y 0 t Q a 1 S t B d W e X d K N / D W a D / t L x 7 e U + x 4 b s q Y h u m 6 E 4 / a 6 G Z T I Y 6 t V 5 w s i c 9 n d 6 t V e d i 5 e N V y n 4 X 5 + X W + + n N 1 6 t I 2 e o m f o B X L R G 9 R E R + g U d R B B D P 1 C v 9 E f q 2 t 9 s 7 5 b P 2 a p m x t z z R N U M u v n f 5 A e v G o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " g o i 4 G 9 U t x m C e q b X i m C o 4 c h 3 H + n A = " > A A A E 2 X i c d Z P f b 9 M w E M e 9 L c A o v z Z 4 5 C W i m s Q D q h I 0 C R 7 X d Y x N i D H Y u k 0 0 1 e Q 4 l 8 Z q Y k e 2 s 7 a y 8 s A b 4 h V e 4 f / g T + G / w W m j q U n b k 5 K c 7 n N f 3 5 0 d + 2 l M p X K c f 2 v r G 9 a d u / c 2 7 z c e P H z 0 8 p 7 b 1 x 4 w m Q k o u t I e T 0 F g x U U x u P a m n Y a 6 6 e b G a m o / m K k V j J U k s 6 9 f 5 l Z T I 5 y m w C D I p y p 9 V O c 8 G 0 S q h O + W Q s Z v + c n C 6 p O 0 Z O 3 9 g y W l Q 0 M V z m o A x l N w g w W k k s a c 1 b i K g B k u q K m O h e C j G m + f H B g 8 g m A A N f L p S 7 E u 1 M O S 0 N B U P O s c H x 5 6 t y d b y U i B 5 N X T K N 7 m X 9 d n e b 5 c Y 8 a W o F a o 2 n O q H b u i G / o r N B / 2 F 4 5 v I f c 9 N m R F Q 3 T V C M e d V T M o U c S W q 8 7 n R O a y u / W r v e h c v G 6 5 T s v 9 v N v c e 1 V e + 0 3 0 H L 1 A L 5 G L 3 q A 9 d I R O U R c R x N A v 9 B v 9 s X r W N + u 7 9 W O W u r 5 W a p 6 h i l k / / w P l l L w G < / l a t e x i t > Period B T R1 < l a t e x i t s h a 1 _ b a s e 6 4 = " O O e I / N T n l 3 S j b w 1 m s 8 H S 8 e 3 l P s J G 7 K m I b p u h J P O u h m U y G O r V d 0 F k b n s T v V q L z u X + w 2 n 2 X C + v q u 3 3 s 5 u P d p G L 9 E r 9 A Y 5 6 D 1 q o W N 0 h n q I I I 5 + o d / o j + V a 3 6 z v 1 o 9 Z 6 u b G X P M C l c z 6 + R 9 J m L z O < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " O O e I / N T Z P A W g O y A I 1 K b z p G / q + 2 k = " > A A A E 2 n i c d Z P f b 9 M w E M e 9 L c A o v z Z 4 5 C W i T O I B V c 2 E B I / t O s Y m p D G 2 d p v U V J P j X B q r i R 3 Z z t r K y g t v i F d 4 h b + D P 4 X / B i e t p i Z t T 0 p y u s 9 9 f X d 2 7 C U R l a r Z / L e x u W X d u / 9 g + 2 H t 0 e M n T 5 / t 7 D 6 / l D w n l 3 S j b w 1 m s 8 H S 8 e 3 l P s J G 7 K m I b p u h J P O u h m U y G O r V d 0 F k b n s T v V q L z u X + w 2 n 2 X C + v q u 3 3 s 5 u P d p G L 9 E r 9 A Y 5 6 D 1 q o W N 0 h n q I I I 5 + o d / o j + V a 3 6 z v 1 o 9 Z 6 u b G X P M C l c z 6 + R 9 J m L z O < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " O O e I / N T Z P A W g O y A I 1 K b z p G / q + 2 k = " > A A A E 2 n i c d Z P f b 9 M w E M e 9 L c A o v z Z 4 5 C W i T O I B V c 2 E B I / t O s Y m p D G 2 d p v U V J P j X B q r i R 3 Z z t r K y g t v i F d 4 h b + D P 4 X / B i e t p i Z t T 0 p y u s 9 9 f X d 2 7 C U R l a r Z / L e x u W X d u / 9 g + 2 H t 0 e M n T 5 / t 7 D 6 / l D w 4 h b + D P 4 X / B i e N p j Z t T 0 p y u s 9 9 f X d 2 7 K c x l c p x / m 1 s b l n 3 7 j / Y f r j z 6 P G T p 8 9 2 9 5 5 f S Z 4 J A l 3 C Y y 5 u f C w h p g y 6 i q o Y b l I B O P F j H h 9 7 d y e 7 k J E C y R d P o 3 i b n 1 1 f 5 v l q j R l b g l q j a s 2 p 9 u 0 F 3 c h f o / l 8 u H R 8 S 7 m f s C F r G q L r R j h t r 5 t B i S K 2 W t W Z E 5 n L 7 t a v 9 r J z 9 b b p O k 3 3 6 7 v G w Z v q 2 m + j l + g V e o 1 c 9 B 4 d o B N 0 j r q I I I 5 + o d / o j + V Z 3 6 z v 1 o 9 Z 6 u Z G p X m B F s z 6 + R + f D r x q < / l a t e x i t > M 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " l I 3 S J f a Y M J H + K t k 9 6 k n M e r S L 0 K c = " > A A A E 2 H i c d Z P f b 9 M w E M e 9 L c A o v z Z 4 5 C W i T P C A q g Y h w W O 7 j r E J M Q Z b t 0 F T T Y 5 z a a w m d m Q 7 a y s r E m + I V 3 i F P 4 Q / h f 8 G J 6 2 m J m 1 P S n K 6 z 3 1 9 d 3 b s J R G V q t n 8 t 7 a + Y d 2 4 e W v z d u 3 O 3 X v 3 H 2 x t P z y T 3 8 D 5 j 0 u / g = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l I 3 S J f a Y M J H + K t k 9 6 k n M e r S L 0 K c = " > A A A E 2 H i c d Z P f b 9 M w E M e 9 L c A o v z Z 4 5 C W i T P C A q g Y h w W O 7 j r E J M Q Z b t 0 F T T Y 5 z a a w m d m Q 7 a y s r E m + I V 3 i F P 4 Q / h f 8 G J 6 2 m J m 1 P S n K 6 z 3 1 9 d 3 b s J R G V q t n 8 t 7 a + Y d 2 4 e W v z d u 3 O 3 X v 3 H 2 x t P z y T 3 8 D 5 j 0 u / g = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l I 3 S J f a Y M J H + K t k 9 6 k n M e r S L 0 K c = " > A A A E 2 H i c d Z P f b 9 M w E M e 9 L c A o v z Z 4 5 C W i T P C A q g Y h w W O 7 j r E J M Q Z b t 0 F T T Y 5 z a a w m d m Q 7 a y s r E m + I V 3 i F P 4 Q / h f 8 G J 6 2 m J m 1 P S n K 6 z 3 1 9 d 3 b s J R G V q t n 8 t 7 a + Y d 2 4 e W v z d u 3 O 3 X v 3 H 2 x t P z y T 3 8 D 5 j 0 u / g = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z X s I e z g Y R 3 E N a b Q Y G 2 u j K n L j K U M = " > A A A E 2 H i c d Z P f b 9 M w E M e 9 L c A o v z Z 4 5 C W i m u A B V c 2 E B I / r O s Y m x B h s 3 Q Z L N T n O p b G a 2 J H t r K 2 s S L w h X u E V / h D + F P 4 b n D S a m r Q 9 K c n p P v f 1 3 d m x l 0 R U q n b 7 3 8 r q m n X r 9 p 3 1 u 4 1 7 9 x 8 8 f L
s 7 5 b P 2 a p 6 2 u l 5 h m q m P X z P / Z 0 v B U = < / l a t e x i t > S0(M0) = < l a t e x i t s h a 1 _ b a s e 6 4 = " g 6 + b b O K 8 e k 1 H S J b + 7 P b e k J J v h S 0 = " > A A A E 6 3 i c d Z P f b 9 M w E M e 9 L c A o v z p 4 g 5 e I a t J 4 q Z I J i b 0 g r e s Y m x B j 0 H W b t F S V 4 1 x a q 4 k d 2 c 7 a y o r E / 8 A b 4 h V e 4 Z U / h f 8 G p 4 2 m N m 1 P S n K 6 z 3 1 9 d 3 b s s h a 1 _ b a s e 6 4 = " g 6 + b b O K 8 e k 1 H S J b + 7 P b e k J J v h S 0 = " > A A A E 6 3 i c d Z P f b 9 M w E M e 9 L c A o v z p 4 g 5 e I a t J 4 q Z I J i b 0 g r e s Y m x B j 0 H W b t F S V 4 1 x a q 4 k d 2 c 7 a y o r E / 8 A b 4 h V e 4 Z U / h f 8 G p 4 2 m N m 1 P S n K 6 z 3 1 9 d 3 b s s h a 1 _ b a s e 6 4 = " g 6 + b b O K 8 e k 1 H S J b + 7 P b e k J J v h S 0 = " > A A A E 6 3 i c d Z P f b 9 M w E M e 9 L c A o v z p 4 g 5 e I a t J 4 q Z I J i b 0 g r e s Y m x B j 0 H W b t F S V 4 1 x a q 4 k d 2 c 7 a y o r E / 8 A b 4 h V e 4 Z U / h f 8 G p 4 2 m N m 1 P S n K 6 z 3 1 9 d 3 b s l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " g 6 + b b O K 8 e k 1 H S J b + 7 P b e k J J v h S 0 = " > A A A E 6 3 i c d Z P f b 9 M w E M e 9 L c A o v z p 4 g 5 e I a t J 4 q Z I J i b 0 g r e s Y m x B j 0 H W b t F S V 4 1 x a q 4 k d 2 c 7 a y o r E / 8 A b 4 h V e 4 Z U / h f 8 G p 4 2 m N m 1 P S n K 6 z 3 1 9 d 3 b s J x G V y n H + r a 1 v W H f u 3 t u 8 X 3 n w 8 N H j J 9 W t p x e S p 4 J A m / C I i y s f S 4 g o g 7 a i K o K r R A C O / Q g u / U E z 5 5 c 3 I C T l 7 F y N E + j E u M d o S A l W J t S t P t c e w Z H d y r r O z t T 9 a N x X 9 l u 7 0 q 3 H 3 a n j m 8 p 9 j y 2 Z 0 x C d N 8 J h e 9 4 M W h a x 2 a r T C Z G 9 7 F 7 9 a k 8 7 5 1 u b n r v p f X 7 d 3 N k u r / 0 y e o F e o g 3 k o T d o B x 2 g Y 3 S G C P q K f q H f 6 E + D N b 4 1 v j d + j F M X F 0 r N M 1 S x x s / / k 2 n B L g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q r c r o 1 w H 3 a n j m 8 p 9 j y 2 Z 0 x C d N 8 J h e 9 4 M W h a x 2 a r T C Z G 9 7 F 7 9 a k 8 7 5 1 u b n r v p f X 7 d 3 N k u r / 0 y e o F e o g 3 k o T d o B x 2 g Y 3 S G C P q K f q H f 6 E + D N b 4 1 v j d + j F M X F 0 r N M 1 S x x s / / k 2 n B L g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q r c r o 1 w H 3 a n j m 8 p 9 j y 2 Z 0 x C d N 8 J h e 9 4 M W h a x 2 a r T C Z G 9 7 F 7 9 a k 8 7 5 1 u b n r v p f X 7 d 3 N k u r / 0 y e o F e o g 3 k o T d o B x 2 g Y 3 S G C P q K f q H f 6 E + D N b 4 1 v j d + j F M X F 0 r N M 1 S x x s / / k 2 n B L g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q r c r o 1 w H 3 a n j m 8 p 9 j y 2 Z 0 x C d N 8 J h e 9 4 M W h a x 2 a r T C Z G 9 7 F 7 9 a k 8 7 5 1 u b n r v p f X 7 d 3 N k u r / 0 y e o F e o g 3 k o T d o B x 2 g Y 3 S G C P q K f q H f 6 E + D N b 4 1 v j d + j F M X F 0 r N M 1 S x x s / / k 2 n B L g = = < / l a t e x i t > S2(M0) = < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 4 b q t e x i t s h a 1 _ b a s e 6 4 = "x b k S g O x U D H L o D s v U 2 A g G A E w u 5 p U = " > A A A E 2 n i c d Z P f b 9 M w E M e 9 r c A o v z Z 4 5 C W i m o S E V C W 8 b I / t O s Y m p D G 2 dp v U V J P j X B q r i W 3 Z z t r K y g t v i F d 4 B P 4 O / h T + G 5 y 2 m t q 0 P S n J 6 T 7 3 9 d 3 Z c S A S q r T r / t v Y 3 K o 8 e P h o + 3 H 1 y d N n z 1 / s 7 L 6 8 U j y T H x / 7 9 y S 5 k C C D 5 4 m k U b / u z m 8 s 8 X 6 2 x Y y v Q a 1 T N O d W e s 6 A b B G s 0 n w 6 X j m 8 p 9 y O 2 Z E 1 D d N 0 I p 6 1 1 M 2 h Z x F a r 2 n M i e 9 m 9 8 t V e d q 7 e 1 z 2 3 7 n 2 x t / 4 A T W 0 b r / P g P 7 m W + K Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C y K L s 4 B d P H 4 R P / t k / S J g U y m d b 0 Y = " > A A A E 2 n i c d Z P P b 9 M w F M e 9 r c A o v z Y 4 c o m o 6 4 3 j Y / / + Z B c y B J B s 8 T T y t / 3 Z z W W W r d b Y s R X o N a r a n G r P W d A N g j W a z 4 d L x 7 e U + w l b s q Y h u m 6 E R n 3 d D F r m s d W q 5 p z I X n a v e L W X n a v 3 V c + t e u f 2 1 u + j q W 2 j 1 + g N e o s 8 9 A E d o B N 0 h l q I I I 5 + o F / o d 8 k v 3 Z W + l r 5 N U z c 3 Z p p X a M F K 3 / 8 D Z P u / r w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C y K L s 4 B d P H 4 R P / t k / S J g U y m d b 0 Y = " > A A A E 2 n i c d Z P P b 9 M w F M e 9 r c A o v z Y 4 c o m o 4 b 4 a y z b g Y l 8 t h q V X d B Z C 6 7 W 7 / a y 8 7 1 + 5 b r t N y v T v P w o L z 2 2 + g N e o v e I R d 9 Q I f o F F 2 g H i K I o 1 / o N / p j e d Y 3 6 7 v 1 Y 5 6 6 u V F q X q O K W T / / A 5 z + v H E = < / l a t e x i t > T R1 < l a t e x i t s h a 1 _ b a s e 6 4 = " z g R z 1 I F b u B o 6 M n x d + a 3 q P T S d 7 B A = " > A A A E 2 n i c d Z P f b 9 M w E M e 9 L c A o v z Z 4 5 C W i m o S E V C W 8 b I / t O s Y m p D G 2 d p v U V J P j X F q r i W 3 Z z t r K y g t v i F d 4 B P 4 O / h T + G 5 y 2 m t q 0 P S n J 6 T 7 3 9 d 3 Z c S g S q r T n / d v Y 3 H I e P H y 0 / b j y 5 O m z 5 y 9 2 d l 9 e K Z 5 J A m 3 C E y 5 8 5 8 7 5 6 n y b p m 5 u z D S v 0 I I 5 3 / 8 D a W e / s A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " h r n / 5 + h 9 G a h e g Z d U v y e F N j W z r S E = " > A A A E 2 n i c d Z P P b 9 M w F M e 9 L c A o v z Y 4 c o m o 8 5 8 7 5 6 n y b p m 5 u z D S v 0 I I 5 3 / 8 D a W e / s A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P 2 A J Q 5 d 0 K n v P 2 B l 3 H W F R C o n e I d Q = " > A A A E 2 n i c d Z P P b 9 M w F M e 9 L c A o P 7 b 7 e 5 3 3 9 3 r N j P 4 m o V I 7 z b 2 N z y 3 r 0 + M n 2 0 8 a z 5 y 9 e 7 u z u v b q

Figure 3 :
Figure 3: The experimental settings used to build the predictive models.

Table 5 :
The accuracy results.5b.The best result for each dataset and labelling is emphasised in italic9.Since, in many cases, different models have very close accuracy, we have emphasised in bold the results that differ from the best ones for less than 0.01.Tables 6a and 6b report, for the two experimental settings 10%-70% and 40%-40%, the percentage of gain/loss of M 1 , M 2 , and M 3 w.r.t.M 0 .

Table 7 :
The time results.

Table 9 :
The accuracy results related to the Perceptron model.