Proactive Process Adaptation Using Deep Learning Ensembles

. Proactive process adaptation can prevent and mitigate upcoming problems during process execution. Proactive adaptation decisions are based on predictions about how an ongoing process instance will unfold up to its completion. On the one hand, these predictions must have high accuracy, as, for instance, false negative predictions mean that necessary adaptations are missed. On the other hand, these predictions should be produced early during process execution, as this leaves more time for adaptations, which typically have non-negligible latencies. However, there is an important tradeoﬀ between prediction accuracy and ear-liness. Later predictions typically have a higher accuracy, because more information about the ongoing process instance is available. To address this tradeoﬀ, we use an ensemble of deep learning models that can produce predictions at arbitrary points during process execution and that provides reliability estimates for each prediction. We use these reliability estimates to dynamically determine the earliest prediction with suﬃcient accuracy, which is used as basis for proactive adaptation. Experimental results indicate that our dynamic approach may oﬀer cost savings of 27% on average when compared to using a static prediction point.


Introduction
Proactive process adaptation can prevent the occurrence of problems and it can mitigate the impact of upcoming problems during process execution [1,18,31] by dynamically re-planning the flow of a running process instance [19,28,37]. Proactive process adaptation thereby can avoid contractual penalties or timeconsuming roll-back and compensation activities.
Proactive process adaptation relies on predictive process monitoring to predict potential problems. Predictive process monitoring predicts how an ongoing process instance (a.k.a. case) will unfold up to its completion [8,16,22]. If a potential problem is predicted, adaptation decisions are taken at run time to prevent or mitigate the predicted problem. As an example, a delay in the expected delivery time for a freight transport process may incur contractual penalties [11]. If during the execution of such freight transport process a delay is predicted, faster transport services (such as air delivery instead of road delivery) can be proactively scheduled in order to prevent the delay.
With respect to predictions, there are two important requirements for proactive process adaptation. On the one hand, predictions must have high accuracy, as, for instance, false negative predictions mean that necessary adaptations are missed. On the other hand, predictions should be produced early during process execution, as this leaves more time for adaptations, which typically have nonnegligible latencies. However, there is an important tradeoff between these two requirements. Later predictions typically have a higher accuracy, because more information about the ongoing process instance becomes available.
We address the aforementioned tradeoff by using ensembles of deep learning models. Ensemble prediction is a meta-prediction technique where the predictions of m prediction models are combined into a single prediction [30]. We use deep learning ensembles to produce predictions at arbitrary points during process execution. In addition we compute reliability estimates for each prediction by computing the fraction of prediction models that predicted the majority class [19]. A high reliability indicates a high probability that the ensemble prediction is correct. We use these reliability estimates to dynamically determine the earliest prediction with sufficiently high reliability and use this prediction as basis for proactive adaptation. Experimental results based on four real-world data sets suggests that our dynamic approach offers cost savings of 27% on average when compared to using a fixed, static prediction point.
Section 2 provides a detailed problem statement and analysis of related work. Section 3 describes our approach. Section 4 provides its experimental evaluation.

Prediction Accuracy and Reliability
Problem. As mentioned above, one key requirement for proactive process adaptation are accurate predictions. Informally, prediction accuracy characterizes the ability of a prediction technique to forecast as many true violations as possible, while generating as few false alarms as possible [33]. Prediction accuracy is important for two main reasons [21]. First, accurate predictions deliver more true violations and thus trigger more required adaptations. Each missed required adaptation means one less opportunity for preventing or mitigating a problem. Second, accurate predictions mean less false alarms, and thus triggering less unnecessary adaptations. Unnecessary adaptations incur additional costs for executing the adaptations, while not addressing actual problems.
Previous research on predictive process monitoring (see [16] for an overview) focused on aggregate accuracy, such as precision or recall. Even though a high aggregate accuracy is beneficial, it does not provide direct information about the accuracy of an individual prediction. Knowing the accuracy of an individual prediction is important, because some predictions may have a higher probability of being correct than others. Proactive adaptation decisions are taken on a case by case basis. Therefore, the information about whether an individual prediction may be correct provides additional support for decision making [18,19].
Prediction techniques traditionally used for predictive process monitoring (such as decision trees, k-nearest-neighbors, support vector machines, and multilayer perceptrons [8,16]) can provide probabilities to indicate whether an individual prediction is correct (e.g., in the form of class probabilities of a decision tree). Yet, probabilities estimated by most of these prediction techniques are poor [38]. In contrast, so called reliability estimates (e.g., computed from ensembles of prediction models) can provide better estimates of the probability that an individual prediction is correct [2].

Related Work.
To improve aggregate prediction accuracy, deep learning techniques are being employed for predictive process monitoring [5,9,17,26,27,34]. In particular, Recurrent Neural Networks (RNNs) are employed, which are a special type of artificial neural network, where each neuron also feeds back information into itself [5,10]. Empirical evidence indicates that RNNs provide significant accuracy improvements for predictive process monitoring [5,20,34]. As an example, the empirical results of our previous work on using RNNs show an accuracy improvement of 36% when compared to multi-layer perceptrons [20]. Yet, the aforementioned approaches only consider aggregate accuracy and do not consider the accuracy of individual predictions for proactive adaptation decisions.
In the literature, some authors considered the probability that an individual prediction is correct. Maggi et al. [15] use decision tree learning for predictive process monitoring. As a follow up, Francescomarino et al. [7] employ random forests (an ensembles of decision trees) for prediction. Both use class probabilities of decision trees. They analyze how selecting predictions using class probabilities impacts on aggregate prediction accuracy. They observe that using class probabilities may improve aggregate accuracy, but at the expedient of loosing predictions that are below a given probability threshold. Yet, they do not analyze in how far using these class probabilities may improve proactive process adaptation and whether it may offer cost savings.
In our earlier work, we used reliability estimates computed from ensembles of multi-layer perceptrons to decide on proactive adaptation [18,19]. If the reliability for a given prediction is equal to or greater than a predefined threshold, the prediction is used to trigger a proactive adaptation. In [19] we considered reliabilities computed from ensembles of classification models, which led to cost savings of up to 54% (14% on average). In [18] we also included the magnitude of a predicted violation (computed from ensembles of regression models) into the adaptation decision, which led to additional cost savings of up to 31% (14.8% on average). Yet, we used a fixed point for our predictions (the 50% mark of process execution), and thus did not consider the aspect of prediction earliness.

Prediction Earliness
Problem. Predictions can be made at different points during the execution of a process instance. The point during process execution for which a prediction is made is called checkpoint [13,22]. When determining checkpoints, there is an important tradeoff to be taken into account between prediction accuracy and the earliness of the prediction [13]. This is particularly important when predictions at a given checkpoint are used as basis for proactive process adaptation.
Typically, prediction accuracy increases as the process unfolds, as more information about the process instance becomes available. As an example, between the 25% mark and the 75% mark in process execution, accuracy may increase by 44% (as reported in [36]) and even by 97% (as reported in [22]). This means later predictions have a higher chance to be correct predictions, and thus one should favor later checkpoints as basis for proactive process adaptation.
However, waiting for the predictions of later checkpoints also means that the remaining time for proactively addressing problems becomes shorter [13]. This can be important as adaptations typically have non-negligible latencies, i.e., it may take some time until they become effective [23]. As an example, dispatching additional personnel to mitigate delays in container transports may take several hours. Also, the later a process is adapted, the fewer options may be available for adaptation. As an example, while at the beginning of a transport process one may be able to transport a container by train instead of ship, once the container is on-board the ship, such adaption may no longer be feasible. Finally, if an adaptation is performed late in the process and turns out not to be effective, not much time may remain for any remedial actions or further adaptations. This means one should choose a rather early checkpoint.

Related Work.
In the literature, the tradeoff between prediction accuracy and earliness was approached from different angles. Several authors use prediction earliness as a dependent variable in their experiments. This means they evaluate their proposed predictive process monitoring techniques by considering prediction earliness in addition to prediction accuracy. As an example, Kang et al. [12], Teinemaa et al. [36], and we in our earlier work [20,22] measured the accuracy of different prediction techniques for the different checkpoints along process execution. Results presented in the aforementioned works clearly show the tradeoff between prediction earliness and accuracy. However, how to resolve the tradeoff between accuracy and earliness was not further addressed.
To increase the earliness of accurate predictions, several authors proposed new variants of prediction techniques. As an example, Teinemaa et al. investigate whether unstructured data may increase prediction earliness and accuracy [35]. Similarly, Leontjeva et al. exploit the data payload of process events to increase prediction earliness [14]. Finally, Francescomarino et al. investigate in how far hyper-parameter optimization [7] and clustering [6] can improve earliness. A similar tradeoff between accuracy and earliness was investigated for time series classification, i.e., for predicting the class label of a temporally-indexed set of data points. The aim is to predict the final label of a time series with sufficiently high accuracy by using the lowest number of data points. Being able to accurately classify a time series early on facilitates early situation detection and thus may help to timely respond to risks and failures [24]. An additional motivation is to reduce the computational effort when compared with using the whole time series for prediction, which is of particular concern for resource-or power-constrained devices [25,29]. As an example, Mori et al. use probabilistic classifiers to produce a class label for a time series as soon as the probability at a checkpoint exceeds a class-dependent threshold [25]. The aforementioned works address different needs of earliness and accuracy by setting the available parameters, such as prediction reliability thresholds. However, they did not examine in how far the techniques have found a good trade-off between earliness and accuracy. Doing so, requires quantifying the utility of the achieved trade-off, as comparing the techniques solely based on earliness and accuracy may not provide a fair comparison. To quantify such utility, we thus measure how choosing the actual checkpoint for adaptation decisions impacts on overall costs of process execution.

Deep Learning Ensembles for Proactive Adaptation
To find a trade-off between earliness and accuracy, we exploit the fact that reliability estimates can provide information about the accuracy of an individual prediction. The key idea of our approach is to (i ) dynamically determine, for each process instance, the earliest checkpoint that delivers a sufficiently high reliability, and (ii ) use this checkpoint to decide on proactive adaptation.
We compute predictions and reliability estimates from ensembles of deep learning models, specifically RNN models. Ensemble prediction is a meta-prediction technique where the predictions of m prediction models, so called base learners, are combined into a single prediction [30].
Ensemble prediction is primarily used to increase aggregate prediction accuracy, while it also allows computing reliability estimates (see Sect. 2.1). Computing reliability estimates is the main reason why we use ensembles of RNN models in our approach. As added benefit, predictions computed via such ensembles provide higher prediction accuracy than using a single RNN model. As an example, RNN ensembles provide an 8.4% higher accuracy when compared with a single RNN model (as used in [20]). Figure 1 provides an overview of the main activities of our approach. The ensemble of RNN models creates a prediction T j at each potential checkpoint j. In addition, it provides the reliability estimate ρ j for this prediction. If the ensemble predicts a violation at checkpoint j, a proactive adaptation may be needed in order to prevent or mitigate the predicted violation. However, we only act on this prediction if its reliability ρ j is equal to or greater than a predefined threshold, i.e., we only act if we consider the prediction reliable enough. Thereby, our approach dynamically determines the earliest checkpoint with sufficient reliability that is used as basis for proactive adaptation. This implies that the actual checkpoint chosen for a proactive adaptation decision will vary among the different process instances, in the same way the reliability estimates may be different for each prediction and each process instance.

RNNs as Base Learners
We use RNNs as base learners, i.e., as the individual models in the ensemble, as RNNs can handle arbitrary length sequences of input data [10]. Thus, a single RNN can be employed to make predictions for business processes that have an arbitrary length in terms of process activities. In contrast, other prediction techniques (such as random forests or multi-layer perceptrons) either require training a prediction model for each of the checkpoints or they require the special encoding of the input data to train a single model [16,22,36]. However, these encodings entail information loss and thus may limit prediction performance.
RNNs also facilitate the scalability of our dynamic approach. Assume we have c checkpoints in the business process. A single RNN model can make predictions at any of these c checkpoints [5,34]. If we want to avoid information loss, other prediction techniques would require the training of c prediction models, one for each of the c checkpoints. Our exploratory performance measurements indicate a training time of ca. 8 min per checkpoint for multi-layer perceptrons on a standard PC, while the training time for an RNN was 25 min 1 . This means that RNNs provide better scalability for our approach if the process has many potential checkpoints (c > 3 in our case).
We use RNNs with Long Short-Term Memory (LSTM) cells as they better capture long-term dependencies in the data [17,34]. Our implementation of these RNN base learners is available online 2 . It exploits the Keras library 3 running on top of TensorFlow 4 .
However, RNNs also face specific challenges when used for predictive process monitoring. Even though the data that is fed into an RNN is sequential, i.e., a sequence of events, these events represent the execution of business processes which may include loops and parallel regions. Such non-sequential control flows can make prediction with RNNs more difficult [5,9,34], as RNNs were conceived for natural language processing, which is sequential by nature [10].
To address these difficulties, we employ the following two solutions (presented in earlier work [20]). First, instead of incrementally predicting the next process event until we reach the final event and thus the process outcome (such as proposed in [5,9,34]), we directly predict the process outcome. Thereby, we avoid the problem RNNs may have in predicting the next process activity when process execution entails loops with many repeated activities [9,34]. Second, we encode parallel process activities by embedding the branch information as an additional attribute of the respective process activity. Thereby, we address the problem that parallel process activities can make the prediction task more difficult [5].

RNN Ensembles
We use bagging (bootstrap aggregating [4]) as a concrete ensemble technique to build the base learners of our RNN ensemble. Bagging generates m new training data sets from the whole training set by sampling from the whole training data set uniformly and with replacement. For each of the m new training data sets an individual RNN model is trained. We use bagging with a sample size of 60% to increase the diversity of the RNN ensembles. Generating the ensembles using bagging also contributes to the scalability of our approach, as the training of the base learners can happen in parallel.
For computing the ensemble predictions T j and reliability estimates ρ j for each checkpoint j, we employ the strategies defined in [18,19], because these strategies showed reasonably good results for a fixed checkpoint. Let us assume that at each checkpoint j, each of the m base learners of the ensemble delivers a prediction result T i,j , with i = 1, . . . , m, where T i,j is either of class "violation" or "non-violation".
The ensemble prediction for checkpoint j is computed as a majority vote: T j = "violation", |i : T i,j = "violation"| ≥ m/2 "non-violation", otherwise.
The reliability estimate ρ j for prediction T j is computed as the fraction of base learners that predicted the majority class:

Cost Model
We aim to answer the question in how far determining checkpoints based on reliability estimates (our dynamic approach) compares to determining checkpoints based on aggregate accuracy (the static approach). To this end, we quantify and compare the costs of process execution and adaptation of these approaches.
We employ a cost model used in our previous work [19], which incorporates two cost factors of proactive process adaptation as shown in Fig. 2. The first cost factor is adaptation costs, as an adaptation of a running processes typically requires effort and resources, and thus incurs costs. The second cost factor is penalties, which may be faced in two situations. First, a proactive adaptation may be missed. This can be due to a false negative prediction (i.e., a non-violation is predicted despite an actual violation), or because the reliability threshold was not reached, even though it was an actual violation. Second, a proactive adaptation may not be effective, i.e., the violation may persist after the adaptation.

Experimental Variables
We consider cost as the dependent variable in our experiments. For each process instance, we compute its individual costs according to the cost model defined in Sect. 4.1. The total costs are the sum of the individual costs of all process instances in our test data set. The test data set comprises 1/3 of the process instances of the overall data set.
We consider the following independent variables (also shown in Table 1): Reliability threshold θ ∈ [.5, 1]. As introduced in Sect. 3, a proactive adaptation is triggered only if the reliability of a predicted violation is equal to or greater than a pre-defined threshold, we name θ.
Relative adaptation costs λ ∈ [0, 1]. To be able to concisely analyze and present our experimental results, we assume constant costs and penalties (like we did in [19]). Thus, the costs of a process adaptation, c a , are expressed as a fraction of the penalty for process violation, c p , i.e., c a = λ · c p . We thereby can reflect different situations that may be faced in practice concerning how costly a process adaptation in relation to a penalty may be. Choosing λ > 1 would not make sense, as this leads to higher costs than if no adaptation is performed.
Adaptation effectiveness α ∈ (0, 1]. If an adaptation results in a non-violation, we consider such an adaptation effective (cf. Fig. 2). We use α to represent the fact that not all adaptations might be effective. More concretely, α represents the probability that an adaptation is effective. We do not consider α = 0 as this means that no adaptation is effective. To reflect the fact that earlier checkpoints may be favored as they provide more options and time for proactive adaptations (see Sect. 2.2), we vary α in our experiments in such a way that α linearly decreases over the course of process execution. This means that the probability for effective proactive adaptations diminishes towards the end of the process. To model this, we define α max as the α for the first checkpoint in the process instance, and α min as the α for the last checkpoint.

Data Sets
We use four data sets from different sources. The Cargo2000 transport data set 5 is the one we used in our previous work. The other three data sets are among the ones frequently used to evaluate predictive process monitoring approaches [3,5,32,34,36]. Table 2 provides key characteristics of these data sets, including the number of checkpoints we used in our experiments.

Experimental Results
Using two of the data sets as an example, Fig. 3 gives a first impression of the effect of our dynamic approach. The figure shows the costs of the dynamic approach (bold) and the costs of the static approach for each of the possible checkpoints (dashed). The right hand side of each chart shows the costs without any adaptation (expressed by θ > 1). These costs also serves as baseline. We chose α max = .9 and α min = .5, reflecting the fact that early on in the process there is a high chance that adaptation is effective, whilst at the very end, this chance is only 50%. Also, we show the results for two values of λ (relative adaptation costs). A λ = .1 reflects the situation where adaptation is rather cheap, whereas λ = .4 a situation where it is more expensive. The charts in Fig. 3 indicate that the dynamic approach provides cost benefits when compared to the static approach, in particular when adaptations are not too expensive (λ = .1). The charts for more expensive adaptations (λ = . 4) show that cost savings may become less, and may even be negative, such as for BPIC2012, where the static approach performs better for three checkpoints.
Also -independent of the static or dynamic approach -it can be observed that if adaptation costs get higher, a higher threshold (and thus more conservative stance in taking adaptation decisions) offers cost savings. While costs for λ = .1 are lowest for small thresholds, the situation is exactly opposite for λ = .4. The reason is that if adaptation costs are low (smaller λ) carrying out unnecessary adaptations is not so costly and thus it pays off not being too conservative (i.e., setting a lower threshold). However, if adaptation costs are high (greater λ), unnecessary adaptations can quickly become very costly, and thus being more conservative (i.e., setting a higher threshold) pays off.
To further explore the situations in which the dynamic approach offers cost savings, we performed a full-factorial experiment, combining all parameter settings of our independent variables as shown in Table 1. The results for all four data sets are presented in Table 3 for different, selected values of λ. For each λ, 6666 different situations were explored. Column A shows the number of situations for which proactive adaptation leads to lower costs than the baseline costs when not performing an adaptation. This is an important metric to contextualize our results, as proactive adaptation may not be beneficial in all situations. In particular, when adaptation costs are high and the chances for a successful adaptation are low, proactive adaptation may not help (e.g., see [19]). As can be seen in column A, the relative number of situations, where proactive adaptation helps saving costs diminishes as λ increases. As an example, while for a λ = .05, proactive adaptation is beneficial in 85.4% of situations for Traffic and 75.7% for BPIC2012, this goes down to .8% and 1.4% respectively for λ = .85.
Column B shows the number of situations where the dynamic approach has lower costs than the static approach. This indicates that the dynamic approach indeed offers additional cost savings in many situations when compared to the static approach. As can be seen from the last column, the situations in which the static approach has less costs than the dynamic approach, are not very high (at around 3% on average). Again, the dynamic approach offers the highest number of savings for smaller values of λ, where the number of situations reach 82.4% for Cargo2000 and even 89.2% for BPIC2017 to give an example.
The actual cost savings of the dynamic approach compared with the static one are shown in Table 4 for different values of λ and θ. Gray cells highlight where costs of the dynamic approach are less than the costs of the static one ('0' indicates that the costs of proactive adaptation are higher than the costs of not performing an adaptation). The table shows the cost savings for selected values of θ, averaged over all combinations of α.  14 14 14 13 13 12 12 11 11 11 11 24 24 23 23 22 22 22 22 21 11 12 12 12 12 12 12 12 12 11  Higher thresholds imply that cost savings are achieved for higher values of λ. As an example, for Traffic a threshold of θ = .6 allows cost savings up to λ = .45, while a θ = .8 allows cost savings up to λ = .65. The reason is that a higher threshold means that adaptation decisions are take more conservatively. They are taken only if a prediction is highly reliable, which in turn implies that the number of unnecessary (and costly) adaptations are reduced.
However, being conservative comes at a risk. The cost savings for higher thresholds can become smaller than the cost savings for lower thresholds. As an example, while for Cargo2000 a threshold of θ = .5 leads to cost savings of up to 24%, this goes down to savings of only up to 16% for θ = 1. And it may even mean that the cost of proactive adaptation is higher than not performing any adaptation, as can be seen for θ = 1 for BPIC2012.
Overall (i.e, considering all possible situations), the average savings are 9.2% for Cargo2000, 27.2% for Traffic, 15.1% for BPIC2012, and 35.8% for BPIC2017. Across all four data sets, average savings are 27%. We conclude that the dynamic approach can deliver cost savings compared to the static approach, with a high chance that it is better than not performing any proactive adaptation at all.
In addition, the dynamic approach comes with the benefit that there is no need for an up-front decision on which checkpoint to use as basis for proactive adaptation, which is required in the static approach. In particular this means, that there is no need for a testing phase during which aggregate accuracies are computed in order to select a suitable static checkpoint.

Threats to Validity
Internal Validity. To minimize the risk of bias, we explored different ensembles sizes (ranging from 2 to 100). Literature indicates that smaller ensembles might perform better than larger ensembles ("many could be better than all" [38]). In our experiments, however, the size of the ensemble did not lead to different principal findings. Yet, by using a larger ensemble, we gain more fine-grained reliability estimates than by using a smaller ensemble.

External Validity.
To cover different situations that may be faced in practice, we specifically chose different reliability thresholds, different probabilities of effective process adaptations, as well as different slopes for how these probabilities diminish towards the end of process execution. In addition, we used four large, real-world data sets from different application domains, which differ in key characteristics. For the sake of generalizability, we used a naïve approach to select data from the event log, i.e., we used whatever data is available and did not perform any manual feature engineering or selection. For non-numeric data attributes a categorical encoding (one-hot) was used.
Construct Validity. We took great care to ensure we measure the right things. In particular, we used a cost model that was tested in our previous work. However, we have only used constant cost functions for adaptation costs and penalties. Yet, as we showed in our previous work [18], the shape of the cost functions can have an impact on savings. We aim to investigate the impact of such nonconstant cost functions as part of our future work.

Conclusion Validity.
As our experiments indicate, the choice of cost model parameters (such as λ and α) impacts on whether proactive adaptation has a positive impact on cost. To address this threat, we have carefully identified influencing variables and varied them over the whole range of permissible values.

Conclusions and Perspectives
Dynamically determining which prediction along the process execution to use for proactive adaptation can offer cost savings. Our experimental results indicate average cost savings of 27% when compared to using a static prediction point. Such a dynamic approach thereby effectively addresses the tradeoff between prediction accuracy and prediction earliness. Also, the dynamic approach does not require a testing phase during which aggregate accuracies are computed in order to select a suitable static prediction point.
As part of our future work, we will extend our dynamic approach towards non-constant cost models. In particular, we will consider different shapes of penalties and different costs of adaptations. To this end, we will employ regression models to predict continuous indicators in order to quantify, for instance, the extent of deviations. Regression models will also facilitate computing more complex reliability estimates (e.g., ones that use the variance of the ensemble).