Fire Now, Fire Later: Alarm-Based Systems for Prescriptive Process Monitoring

Predictive process monitoring is a family of techniques to analyze events produced during the execution of a business process in order to predict the future state or the final outcome of running process instances. Existing techniques in this field are able to predict, at each step of a process instance, the likelihood that it will lead to an undesired outcome.These techniques, however, focus on generating predictions and do not prescribe when and how process workers should intervene to decrease the cost of undesired outcomes. This paper proposes a framework for prescriptive process monitoring, which extends predictive monitoring with the ability to generate alarms that trigger interventions to prevent an undesired outcome or mitigate its effect. The framework incorporates a parameterized cost model to assess the cost-benefit trade-off of generating alarms. We show how to optimize the generation of alarms given an event log of past process executions and a set of cost model parameters. The proposed approaches are empirically evaluated using a range of real-life event logs. The experimental results show that the net cost of undesired outcomes can be minimized by changing the threshold for generating alarms, as the process instance progresses. Moreover, introducing delays for triggering alarms, instead of triggering them as soon as the probability of an undesired outcome exceeds a threshold, leads to lower net costs.


Introduction
Predictive process monitoring [1,2] is a family of techniques to predict the future state of running instances of a business process based on event logs recording past instances thereof.Respective techniques may provide predictions on the remaining execution time of a process instance (herein called a case), the next activity to be executed, or the final outcome of the process instance with respect to a set of possible outcomes.This article is concerned with the latter type of predictive process monitoring, which we call outcome-oriented [3].For example, in a lead-to-order process, outcome-oriented techniques may predict whether a case will lead to an order (desired outcome), or not (undesired outcome).
Existing techniques for outcome-oriented predictive process monitoring are able to predict, after each event of a case, the probability that the case will end with an undesired outcome.Yet, these techniques are restricted to prediction, not suggesting or prescribing how and when process workers should intervene to decrease the probability or cost of undesired outcomes.
A naive approach to turn a predictive process monitoring technique into a prescriptive one is by triggering an alarm when the probability that a case will lead to an undesired outcome is above a fixed threshold (e.g.90%).This alarm can be linked to an intervention, such as calling the customer, offering a discount, etc.Yet, this naive approach may be far from optimal, as interventions induce a cost (e.g.time spent by workers in the intervention or forgone revenue) and an effect (e.g.preventing the undesired outcome altogether or only partially mitigating it).Moreover, the cost and the effect of an intervention may vary as the case advances.Late interventions may have be less effective (and less costly) than intervening earlier, in such a way that the net cost of intervening varies (possibly non-monotonically) as the case advances.This article proposes a framework to extend predictive process monitoring techniques in order to make them prescriptive.The proposed framework extends a given predictive process monitoring model with a mechanism for generating alarms that lead to interventions, which, in turn, mitigate (or altogether prevent) undesired outcomes.The framework is armed with a cost model that captures, among others, the trade-off between the cost of an intervention and the cost of an undesired outcome.Based on this cost model, the article presents an approach to tune the generation of alarms so as to minimize the expected cost for a given event log and a set of parameters.The approach is empirically evaluated, under various configurations, using a collection of real-life event logs.This article is an extended and revised version of a previous conference paper [4].The conference version focused on the scenario where there is a single type of alarm leading to a single type of intervention (e.g.calling the customer).This article enhances the scope of the framework to consider multiple alarm types, each one leading to a different intervention (e.g. one alarm type leads to calling the customer while another one leads to offering a discount by email).
Additionally, this article considers further factors that influence the effectiveness of alarms in practice.First, we add the possibility that the probability threshold above which an alarm is triggered may vary depending on how far the case has progressed.Second, we introduce the possibility of delaying the firing of an alarm to reduce false alarms stemming from instability in the predictive model.
The article is structured as follows.Section 2 presents the prescriptive process monitoring framework.Next, Section 3 outlines the approach to optimize the alarm generation mechanism, while Section 4 reports on the empirical evaluation.Section 5 discusses related work and Section 6 summarizes the contributions.

Prescriptive Process Monitoring Framework
This section introduces a framework for an alarm-based prescriptive process monitoring system.We first introduce basic notions of event logs in Section 2.1, before turning to the cost model of our framework in Section 2.2.For the sake of clarity, this model is first introduced for scenarios that allow for one type of alarm.The latter assumption is dropped in Section 2.3, extending the cost model to scenarios with multiple alarms.Finally, Section 2.4 builds on the cost-model to formalize the concept of an alarm system.Each row is an event; events with the same customer id are grouped into traces.

Event Log
Executions of process instances (a.k.a.cases) are recorded in so-called event logs.At the most abstract level, an event log can be represented in a tabular form.Figure 1 shows an excerpt of an event log that refers to the execution of the unemployment-benefit process in the Netherlands.Example 1 describes how the process is usually executed, which serves for illustration in the remainder of the paper.The description is the result of several interactions of one of the authors (see, e.g., [5]) with the company's stakeholders.
Example 1 (Unemployment Benefit).In the Netherlands, UWV is the institution that provides social-security insurances for Dutch residents who carry on job-related activities in the country.UWV is in charge of several insurances, of which one of the most relevant is the provision of unemployment benefit.When residents (hereafter customers) become unemployed, they are usually entitled to monthly monetary benefits for a certain period of time.UWV executes a process to determine the amount of these benefits, to manage the interaction with the customers, and to perform the monthly payments.These payments are stopped when the customer reports that he/she has found a new job or when the time period ends in which the customer is entitled to unemployment benefits.1 corresponds to an event.An event represents the execution of an activity for a case with a certain identifier that occurred at a specific moment in time and was performed by a given resource.The case identifier can vary depending on the process: for the example in question, it coincides with the customer id.Events can be grouped by case identifier and ordered by timestamp, thus obtaining a sequence of events, a.k.a.trace.

Each row in Figure
Figure 1 shows that events can be associated with various properties, namely the different columns of the event table; however, we assume that the activity name and the timestamp are properties that are always present.Given an event e of a log, π T (e) and π A (e), respectively, return the timestamp and the activity name of an event e.As an example, the trace of customer id 25879 in Figure 1 is e 1 , e 2 , . . ., e n where, e.g., it holds that π T (e 1 ) = 2017-01-15 and π A (e 1 ) = 'Initialize income form'.In the remainder, is the empty sequence, and σ 1 • σ 2 indicates the concatenation of sequence σ 1 and σ 2 .Given a sequence With these concepts at hand, a trace is defined as a sequence of events, and an event log as a set of traces: Definition 1 (Trace, Event Log).Let E be the universe of all events.A trace is a finite non-empty sequence of events σ ∈ E + such that each event appears at most once and time is non-decreasing, i.e., for 1 ≤ i < j ≤ |σ| : σ(i) = σ(j) and π T (σ(i)) ≤ π T (σ(j)).An event log is a set of traces L ⊂ E + such that each event appears at most once in the entire log.

Concepts and Cost Model
An alarm-based prescriptive process monitoring system (alarm system for short) is a monitoring system that can raise an alarm in relation to a running case of a business process in order to indicate that the case is likely to lead to some undesired outcome.These alarms are handled by process workers who intervene in the process instance by performing an action (e.g., by calling a customer or blocking a credit card), thereby preventing the undesired outcome or mitigating its effect.These actions may have a cost, which we call cost of intervention.When a case does end in a negative outcome, this leads to the cost of undesired outcome.Clearly, intervening is beneficial if its cost is lower than the cost of reaching an undesired outcome.
Example 1 (continued).While receiving unemployment benefits at UWV, several customers omit to inform UWV about the fact that they had found a job and, thus, kept receiving benefits that they were not entitled to.If the salary of the new job is lower than the salary of the previous job, then the customer is still entitled to a benefit proportional to the difference in salary.In fact, customers often mistakenly declare the wrong salary for their new job and are then expected to return a certain amount of received benefits.In practice, however, this rarely happens.UWV would therefore benefit from an alarm system that informs about customers who are likely to be receiving unentitled benefits.The cost of an intervention stems from the hourly rate of UWV's employees; investigations using different IT systems and verification of a customer's information with the old and new employer take a significant amount of time.The cost of undesired outcome is the monetary value of the benefits that the customer received unlawfully.
For many practical application scenarios, however, it is not sufficient to consider solely a trade-off between the cost of intervention and the cost of undesired outcome.In the above example, raising an alarm induces solely the cost of intervention, which is independent of the fact whether the intervention was actually necessary.In other scenarios, however, interventions that turn out to be superfluous induce further costs, reaching from financial damage to losing customers (we later give a detailed example for such a scenario in finance).We capture these effects by a cost of compensation in case the intervention was unnecessary, i.e., if the case was only suspected to lead to an undesired outcome, but this suspicion was wrong.
A further aspect to consider is the mitigation effectiveness, i.e., the relative benefit of raising an alarm at a certain point in time.Consider again the process of handling unemployment benefits at UWV as detailed in Example 1.In case of unlawful benefits, the longer UWV postpones an intervention, the lower its effectiveness: The amount paid already to the customer cannot always be claimed back successfully.Therefore, the mitigation effectiveness would be defined by the fraction of money that is not paid in the first place.
Having introduced the various costs to consider when interfering in the execution of a process instance, we need to clarify the granularity at which these costs are modeled.In general, an alarm system is intended to continuously monitor the cases of the business process.However, in many application scenarios, such continuous monitoring is expensive (resources to interfere need to be continuously available).Hence, an approximation may be employed that assumes that significant cost changes are always correlated with the availability of new events (i.e., new information) about a case.Following this line, alarms can only be raised after the occurrence of an event.
In the remainder, each case is identified by a trace σ that is (eventually) recorded in an event log.For a prefix of such a trace, the above characteristics of an alarm, i.e., its cost of intervention, cost of compensation, and mitigation effectiveness, are captured by an alarm model.These characteristics may depend on the position in the case, at which the alarm is raised and/or on other cases being executed.Hence, they are defined as functions over the number of already recorded events and the entire set of cases being executed, as follows: Definition 2 (Alarm Model).An alarm model is a tuple (c in , c com , eff ) consisting of: 0 modeling the cost of intervention: given a trace σ belonging to an event log L, c in (k, σ, L) indicates the cost of an intervention for a running case with trace σ when the intervention takes place after the k-th event; modeling the mitigation effectiveness of an intervention: for a trace σ of an event log L, eff (k, σ, L) indicates the mitigation effectiveness of an intervention in σ when the intervention.
Combining the above properties of an alarm with the cost of the undesired outcome that shall be prevented, we define a single-alarm cost model.• an alarm model (c in , c com , eff ); Next, we illustrate the above model with exemplary application scenarios.
Example 1 (continued).Let unt(σ) denote the amount of unentitled benefits received in a case corresponding to trace σ.Based on discussions with UWV, we designed the following cost model.

Cost of intervention.
For the intervention, an employee needs to check if the customer is indeed receiving unentitled benefits and, if so, fill in the forms for stopping the payments.Let S be the employee's average salary rate per time unit; let i s and i f denote the positions of the events in σ when the employee started working on the intervention and finished it, respectively.
The cost of an intervention can be modeled as: If an employee takes care of one case, they cannot take care of a different one, which potentially yields unentitled benefits.This analysis is clearly an approximation: The cost can be modeled as the average amount that customers usually receive unlawfully: c in (k, σ, L) = avg σ∈L unt(σ).
Cost of undesired outcome.The amount of unentitled benefits that the customer would obtain without stopping the payments, i.e., c out (σ, L) = unt(σ).
Cost of compensation.The social security institution works in a situation of monopoly, which means that the customer cannot be lost because of moving to a competitor, i.e., there is no cost of compensation: c com (σ, L) = 0.
Mitigation effectiveness.The proportion of unentitled benefits that will not be paid thanks to the intervention: eff (k, σ, L) = unt(σ)−unt(hd k (σ))
Due to the lack of competition for public administration, the above example does not include compensation costs.The following example from finance, however, illustrates the importance of being able to incorporate the cost of compensation.
Example 2 (Financial Institute).Suppose that the customers of a financial institute use their credit cards to make payments online.Each such transaction is associated with a risk of fraud, e.g., through a stolen or cloned card.In this scenario, an alarm system shall determine whether the card needs to be blocked, due to a high risk of fraud.However, superfluous blocking of the card causes discomfort to the customer, who may then switch to a different financial institute.
Cost of intervention.The card is automatically blocked by the system.Therefore, the intervention costs c in (k, σ, L) are limited to the costs for sending a new credit card to the customer by mail.
Cost of undesired outcome.The total amount of money related to fraudulent transactions that the bank would need to reimburse to the legitimate customer, c out (σ, L) = value(σ).

Cost of compensation.
It is defined as the expected asset of the customer that would be lost, namely the actual asset multiplied by a certain probability p ∈ [0, 1], which is the fraction of customers who left the institute within a short time after the card was wrongly blocked.Denoting the asset value of a customer (the amount of the investment portfolio, the account balance, etc.) with asset(σ), the cost of compensation is estimated as: Mitigation effectiveness.The proportion of the total amount of money related to fraudulent transactions that does not need to be reimbursed by blocking the credit card after that k events have been executed: eff .

Multi-Alarm Cost Model
The above formalization of the cost model assumes that an alarm system supports a single type of intervention.For some processes, however, multiple types of alarms exist and represent alternative options to interfere in the execution of a case.Consider Example 2, the scenario of blocking a credit card to prevent fraud.An alternative intervention is to call the credit card owner by phone to verify the suspicious transaction.A phone call certainly has a higher cost of intervention, in comparison to automatically blocking the credit card.Yet, a phone call has a lower cost of compensation because, in case of falsely suspected fraud, it prevents the inconvenience of unnecessarily blocking a card.
The choice of which alarm to employ might depend on the likelihood of achieving an undesired outcome, if no action is put in place.For instance, for the financial institute, if fraud is assessed to be a possibility, but not a near-certainty, then it is preferable to call the customer.If it is assessed to be fraud with near-certainty, it is safer to preventively block the card.
To handle scenarios with multiple alarms, the following definition generalizes our earlier single-alarm cost model.

Definition 4 (Multi-Alarm Cost Model).
Let A be a set of alarms and C be the universe of alarm models.An multi-alarm cost model is a tuple (A, ma, c out ) consisting of: The above definition subsumes Definition 3: If only one alarm is present (i.e. A = {a}), there is only a one alarm model ma(a) = c.

Alarm-Based Prescriptive Process Monitoring System
An alarm-based prescriptive process monitoring system is driven by the outcome of the cases.In the remainder, the outcome of the cases is represented by a function out : E + → {true, false}: given a case identified by a trace σ, if the case has an undesired outcome, out(σ) = true; otherwise, out(σ) = false.In reality, during the execution of a case, its outcome is not yet known and needs to be estimated based on past executions that are recorded in an event log L ⊂ E + .
The outcome estimator is a function out L : E + → [0, 1] predicting the likelihood out L (σ ) that the outcome of a case that starts with prefix σ is undesired.We define an alarm system as a function that returns true or false depending on whether an alarm is raised based on the predicted outcome.
Definition 5 (Alarm-Based Prescriptive Process Monitoring System).Given an event log L ⊂ E + , let out L be an outcome estimator built from L. Let A undesired outcome desired outcome alarm not raised cout (σ, L) 0 Table 1: Cost of a case with trace σ based on its outcome and whether an alarm was raised.If the alarm is raised, iσ indicates the index of σ when the alarm occurred.
be the set of alarms that can be raised.An alarm-based prescriptive process monitoring system is a function alarm out L : E + → A ∪ {⊥} Given a running case with current prefix σ , alarm out L (σ ) returns the alarm raised after σ , or ⊥ if no alarm is raised.
For simplicity, we omit subscript L from out L , and the entire subscript out L from alarm out L when it is clear from the context.An alarm system can raise an alarm at most once per case, since we assume that already the first alarm triggers an intervention by the stakeholders.
The purpose of an alarm system is to minimize the cost of executing a case.
Table 1 illustrates how the cost of a case is determined based on a cost model on the basis of a multi-alarm cost model (A, ma, c out ).In the table, i σ indicates the index of the event in σ when the alarm was raised, namely the smallest i ∈ [1, |σ|] such that alarm(hd iσ (σ)) = ⊥; also, ma(alarm(hd iσ (σ))) = (c in , c com , eff ).
The above definition provides a framework for alarm-based prescriptive process monitoring systems.We next turn to its instantiation with the goal of minimizing the overall costs of case execution following the matrix of Table 1.

Alarm Systems and Empirical Thresholding
This section introduces different instantiations of an alarm system.In Section 3.1, we introduce a basic mechanism based on empirical thresholding.
In Section 3.2, we enhance this approach with the idea of delaying alarms.
An enhancement based on prefix-length-dependent thresholds is introduced in Section 3.3.Finally, multiple possible alarms are handled in Section 3.4.

Basic Alarm System
We first consider a basic alarm system, in which there is only one type of alarm a, which is triggered as soon as the estimated probability of an undesired outcome exceeds a given threshold τ .Given this simple alarm system, we aim at finding an optimal value for the alarming threshold τ , which minimizes the cost on a log L thres comprising historical traces such that L thres ∩ L train = ∅ with respect to a given probability estimator out Ltrain and alarm model ma(a).
Optimizing a threshold τ on a separate thresholding set is called empirical thresholding [6].The search for such a threshold τ wrt. a specified alarm model ma and log L thres is done through any hyperparameter optimization technique, such as Tree-structured Parzen Estimator (TPE) optimization [7].The resulting approach can be considered as a form of cost-sensitive learning, since the value τ depends on how the alarm model ma(a) is specified.

Delayed Firing System
The basic alarm system introduced above fires an alarm as soon as the probability of an undesired outcome is higher than the threshold τ .This may lead to firing an alarm too soon.Consider a scenario in which, after observing event e i , the probability is above τ , whereas it drops below τ and stays below τ after the subsequent event e i+1 has been recorded.The basic alarm system would fire the alarm even if the probability of an undesired outcome is above τ for one single event and then drops below τ in subsequent events in a case.
An alternative alarming system (herein called delayed firing) is to fire an alarm only if the probability of a negative outcome remains above threshold τ for κ consecutive events, where κ is the firing delay.One would expect this delayed firing system to be more robust to instabilities in the predictive model (e.g. robust to high levels of variations in the probability calculated by the predictive model for consecutive events in a trace).
When building a delayed alarming mechanism, we need to optimize both, the firing delay κ as well as the threshold τ .Note that the basic alarm system is a special case of the delayed firing system with a firing delay of κ = 1.

Prefix-length-dependent Threshold System
To cope with scenarios with dynamic costs, we propose to not rely only on a single global alarming threshold τ , but to use different thresholds depending on the length of the prefix.That is, a separate threshold τ k is optimized for each prefix length k or, more generally, thresholds τ a→b are optimized for certain intervals of prefix lengths [a, b] for a, b ∈ N.For instance, τ 1→4 denotes the optimal threshold for prefix lengths 1 to 4, while τ 5→∞ denotes the optimal threshold from length 5 to the end of the trace.To divide the whole range of prefix lengths into n non-overlapping intervals, we define n splitting prefixes , where ρ i refers to the start point of interval i.For instance, in the previous example, ρ 1 = 1, ρ 2 = 5.A single global threshold and prefix-lengthbased thresholds can be seen as special cases of interval-based thresholds.
The splitting prefixes can either be user-defined or optimized over L train .In a system with multiple thresholds, all the thresholds can be optimized simultaneously.It is also possible to treat the splitting prefixes for prefix intervals as hyperparameters and optimize them over the event log L together with the thresholds.Prefix-length dependent thresholds τ a→b may be combined with a fire delay to build an alarm system.Then, the fire delay and the prefix-length dependent thresholds τ a→b are trained at the same time.

Multi-alarm systems
The problem of selecting the best alarm type from a set of possible alarms can be posed as a multi-class classification task [8], where each possible alarm type, as well as the option of not firing an alarm, are considered as different classes.For the set of alarm types A, we formulate a classification task with |A ∪ {⊥}| different classes: one class for no alarm and one class per alarm type.
Note that common approaches for training multi-class classifier (e.g., multi-class decision trees) are not applicable in our context, since the training data contains solely one out of two possible labels per trace.
One way of solving a multi-class classification problem is to break it into several binary classification problems (i.e., the one-vs-one approach) [9].Then, all classes are tested against each other with a binary classification algorithm.
The set of binary decisions leads to a number of votes, that are assigned to the classes that won the binary classifications.The input data is ascribed the class with the most votes.An alternative solution is the one-vs-all approach, where for each class a binary classifier is trained that decides for each instance if it belongs to the respective class or not.The class with the highest probability is then assigned to the input.However, it is not possible to apply the latter approach to our problem, due to the asymmetrical misclassification costs (the cost of not assigning a class depends on the class assigned).In traditional multi-class classification, it does not matter which wrong class is assigned.Hence, we exploit the one-vs-one approach to implement a mechanism for a multi-alarm system.
Training the alarms against the class of not firing an alarm is done by applying the basic model to each alarm independently.In a scenario with two possible alarms this gives us the thresholds τ F alse−vs.−a1 and τ F alse−vs.−a2 .All alarms are then trained against each other with empirical thresholding.In our example, that implies training the threshold τ a1−vs.−a2 .However, for training τ a1−vs.−a2, we use only the prefixes σ that have a higher likelihood probability than both thresholds for these alarms (e.g.out Ltrain (hd k (σ)) > τ F alse−vs.−a1 and out Ltrain (hd k (σ)) > τ F alse−vs.−a2 ).While this might seem contrary to the one-vs-one approach (the threshold is not trained over the whole dataset), it is necessary for the following reasons.First, the question of 'should we fire an alarm?' shall be separated from the question of 'which alarm to fire?'.The decision on the alarm type is taken only over the subset of prefixes where alarming was found necessary.Training the threshold over all prefixes would fit the threshold to irrelevant prefixes.Also, it is more important to fire any alarm for a case with an undesired outcome than to choose the best alarm.Since we train the thresholds to decide between alarms in a hierarchical order, first the alarm vs. no alarm thresholds and then the alarms against each other, we call this approach hierarchical thresholding.

Evaluation
In this section, we report on an experimental evaluation of the proposed framework.Specifically, our evaluation addresses the following research questions related to the overall effectiveness of alarm-based prescriptive process monitoring: RQ1 Does empirical thresholding identify thresholds that consistently reduce the average processing cost for different alarm model configurations?
RQ2 Does the alarm system consistently yield a benefit over different values of the mitigation effectiveness?
RQ3 Does the alarm system consistently yield a benefit over different values of the cost of compensation?
Moreover, we explore in detail the design choices involved when deciding on when to fire an alarm, going beyond simple empirical thresholding.
Is there a reduction in the average processing cost per case: RQ4 When training a parameter for the minimum number of events that exceed the threshold?
RQ5 When using more than one threshold interval?
RQ6 When increasing the number of prefix-length-dependent thresholds?
RQ7 When combining prefix-length dependent thresholds with a firing delay, compared to the a single-alarm system?
Finally, we turn to a comparison of systems that feature a single alarm only and those that support multiple alarms.
RQ8 Is there a reduction in the average processing cost per case when using multiple alarms compared to the best system with only one alarm?
In the remainder, we first discuss the real-world datasets used in our evaluation (Section 4.1), before turning to the experimental setup (Section 4.2).We then report on our evaluation results in detail and close with an overview of our answers to the above research questions (Section 4.3).

Datasets
We use the following real-world datasets to evaluate the alarm system: BPIC2017.This log contains traces of a loan application process in a Dutch bank. 1 It was split into two sub-logs, denoted with bpic2017 refused and bpic2017 cancelled.In the first one, the undesired cases refer to the process executions in which the applicant has refused the final offer(s) by the financial institution.In the second one, the undesired cases consist of those cases where the financial institution has cancelled the offer(s).
Road traffic fines.This log originates from a Italian police unit and relates to a process to collect traffic fines. 2 The desired outcome is that a fine is paid, while in the undesired cases the fine needs to be sent for credit collection.
Unemployment.This event log corresponds to the Unemployment Benefits process run by the UWV in the Netherlands, introduced already as Example 1 in Section 2. Due to privacy constraints, this event log is not publicly available.The undesired outcome of the process is that a resident will receive more benefits than entitled, causing the need for a reclamation.Table 2 describes the characteristics of the event logs used.These logs cover diverse evaluation settings, along several dimensions.The classes are well balanced in bpic2017 cancelled and traffic fines, while the undesired outcome is more rare in unemployment and bpic2017 refused.In traffic fines, the traces are very short, while in the other datasets the traces are generally longer.
For each event log, we use all available data attributes as input to the classifier.
Additionally, we extract the event number, i.e., the index of the event in the given case, the hour, weekday, month, time since case start, and time since last event.Infrequent values of categorical attributes (occurring less than 10 times in the log) are replaced with value 'other', to avoid a massive blow-up of the considered number of dimensions.Missing attributes are imputed with the respective most recent (preceding) value of that attribute in the same trace when available, otherwise with zero.Traces are cut before the labeling of the case becomes trivially known and are truncated at the 90th percentile of all case lengths to avoid bias from very long traces.

Experimental Setup
We split the aforementioned datasets temporally, as follows.We order cases by their start time and randomly select 80% of the first 80% of the cases (i.e., 64% of the total) for L train ; 20% of the first 80% of the cases (i.e., 16% of the total) for L thres ; and use the remaining 20% as the test set L test .The events in cases in L train and L thres that overlap in time with L test are discarded in order to not use any information that would not be available yet in a real setting.
Based on L train , we build the classifier out using random forest (RF) and gradient boosted trees (GBT).Both algorithms have been shown to work well on a variety of classification tasks [10,11].Features for a given prefix are obtained using the aggregation encoding [12], which is known to be effective for logs [3].
The configuration of the alarming mechanism depends on the setup chosen for a specific research question.For RQ1 to RQ3, we rely on the basic model, introduced in Section 3.1, and determine an optimal alarming threshold τ based on L thres .We employ Tree-structured Parzen Estimator (TPE) optimization [7] with 3-fold cross validation.The resulting alarm system is then compared against several baselines: First, we compare with the as-is situation, in which alarms are never raised.Second, define a baseline with τ = 0, which enables us to compare with the situation where alarms are always raised directly at the start of a case.
Finally, setting τ = 0.5, we consider a comparison with the cost-insensitive scenario that simply raises alarms when an undesired outcome is expected.
To answer RQ4 to RQ7, we consider the advanced models introduced in Section 3.2 and Section 3.3.Again, the parameters are optimized using TPE.For RQ4, this includes the threshold τ and the firing delay κ ∈ {1, . . ., 7}.For RQ5, we train two prefix-length-interval-based thresholds τ 1→ρ , τ ρ→∞ and optimize them together with parameter ρ.For RQ6, we define three different systems with 1 to 3 prefix length intervals.We rely on user-defined intervals for the prefix length, as follows.We set the length for each interval, except for the last, to one prefix and put all intervals after each other staring at prefix length 1.This results in three systems with {τ 1→∞ }, {τ 1→2 , τ 2→∞ }, and {τ 1→2 , τ 2→3 , τ 3→∞ }.
To explore RQ8, we derive a multi-alarm system that optimizes the alarming threshold τ , using TPE, for two different alarm types independently (Section 3.4).
For these alarms, we multiply the factors given in Table 3 with the cost of intervention c in and the cost of intervention c in .We compare the resulting system against a baseline that always uses one of the alarms, i.e., the one that is better in terms of average processing cost per case.It is common in cost-sensitive learning to apply calibration techniques to the resulting classifier [13].Yet, we found that calibration using Platt scaling [14] does not consistently improve the estimated likelihood of the undesired outcome on our data and, thus, did not apply calibration.
Next, we turn to the alarm models used in our evaluation, summarized in Table 4.For RQ1, we vary the ratio between the cost of the undesired outcome c out and the cost of intervention c in , keeping the cost of compensation c com and the mitigation effectiveness eff unchanged.The same is done for RQ2 and, in addition, the mitigation effectiveness eff is varied.For RQ3, we vary two ratios: the one between c out and c in , and the one between c in and c com .
In the experiments related to RQ4-RQ7, we consider three types of configurations, each corresponding to one row in Table 4.We vary the cost of interference c in and the cost of compensation c com from values that render them insignificant compared to c out , to values that yield a significant impact.Specifically, the first type of alarm model, coined constant cost configurations, assigns constant costs over the whole trace.A second type, linear cost configurations, assigns costs that increase with longer trace prefixes.The third type, non-monotonic cost configurations, changes costs non-linearly over the length of the trace.Constants for this setting are introduced as listed in Table 5.We assign values similar to the minimum case length to constants used as a numerator and values similar to the medium case length for constants used as a divisor.This yields a non-monotonic cost development over the trace length for most cases.
For RQ8, we largely follow the same setup.However, we define our test scenarios such that the alarm that is more expensive for true positives is at least 10% cheaper for false positives compared to the other alarm.Thereby, both alarms show a certain difference in the induced cost trade-off.To evaluate the success of prescriptive process monitoring, we measure the average cost per case using the test set L test derived for each dataset as discussed above.This cost shall be minimal.Moreover, we measure the benefit of the alarm system, i.e., the reduction in the average cost of a case when using the alarm system compared to the average cost when not using it.
Additionally, we use the f-score to test the accuracy of our systems, in terms of how often they fire an alarm correctly.It is calculated as the harmonic mean of precision and recall and ranges from 0 (worst) to 1 (best).

The above experimental setup has been implemented in Python based on
Scikit-Learn and LightGBM, with the prototype being publicly available online. 3

Results
We evaluate our basic model of an alarming system that is based on empirical thresholding by exploring whether it consistently reduces the average processing cost under different alarm models (RQ1).Figure 3 shows the average cost per case when varying the ratio of c out and c in .We only present the results obtained   with GBT, which slightly outperform those with RF.When the ratio between the two costs is balanced, the minimal cost is obtained by never alarming.When c out c in , one shall always raise an alarm.However, when c out is slightly higher than c in , the best strategy is to sometimes raise an alarm based on out.We found that the optimized τ always outperforms the baselines.An exception is ratio 2:1 for the traffic fines dataset, where never alarming is slightly better.
The impact of the threshold for firing an alarm is further explored in Figure 4.
The optimized threshold is marked with a red cross and each line represents one particular cost ratio.While the optimized threshold generally obtains minimal costs, there sometimes exist multiple optimal thresholds for a given alarm model.
For instance, for the 5:1 ratio in bpic2017 cancelled, all thresholds between 0 and 0.4 are cost-wise equivalent.Hence, empirical thresholding consistently finds a threshold that yields the lowest cost for a given log and cost model configuration.
Turning to RQ2, we evaluate how the mitigation effectiveness influences the results.Figure 5a shows the benefit of having an alarm system compared to not having it for different (constant) effectiveness values.As the results are similar for logs with similar class ratios, hereinafter, we show the results for bpic2017 cancelled (balanced classes) and unemployment (imbalanced classes).
As expected, the benefit increases both with higher eff and with higher c out : c in ratios.For bpic2017 cancelled, the alarm system yields a benefit when c out : c in is high and eff > 0. Also, a benefit is always obtained when eff > 0.5 and In the case of the unemployment dataset, the average benefits are smaller, since there are fewer cases with undesired outcome and, therefore, the number of cases where c out can be prevented by alarming is lower.In this case, a benefit is obtained when both eff and c out : c in are high.We conducted analogous experiments with a linear decay in effectiveness, varying the maximum possible effectiveness (at the start of the case), which confirmed that the observed patterns remain the same.As such, we have confirmed empirically that an alarm system yields a benefit over different values of mitigation effectiveness.
Next, we consider the influence of the cost of compensation (RQ3).As above, the benefit of the alarm system is plotted in Figure 5b across different ratios of c out : c in and c in : c com .When the cost of compensation c com is high, the benefit decreases due to false alarms.For bpic2017 cancelled, a benefit is obtained almost always, except when c out : c in is low (e.g., 2:1) and c com is high (i.e., higher than c in ).For unemployment, fewer configurations are beneficial, e.g., when c out : c in = 5 : 1 and c com is smaller than c in .We conducted analogous experiments with a linearly increasing cost of intervention, varying the maximum possible cost, which confirmed the above trends.In sum, we confirmed empirically that the alarm system achieves a benefit if the cost of the undesired outcome is sufficiently higher than the cost of the intervention and/or the cost of the intervention is sufficiently higher than the cost of compensation.
Next, we evaluate the design choices involved when deciding on when to fire an alarm.We explore the reduction in the average processing cost under a To answer RQ4 on the effectiveness of a firing delay, we calculated the ratio of the average cost per case for the system with the firing delay as an additional hyperparameter, divided by the system without a firing delay.Figure 6 shows nearly no cost settings, in which the system that fires immediately outperforms the system with the firing delay (red cells).Many times, the firing delay reduces the costs (green cells), especially for the non-monotonic scenarios.Sometimes the costs are reduced by 20% or more.However, in the vast majority of cases, both systems produce the same results.This is due to the firing delay parameter being set to 1 in many scenarios.In addition to the ratio, we also visualized the benefit in terms of f-score that the system with firing delay delivers, see Figure 7.
The results provide evidence for our hypothesis that waiting a certain number of events before firing an alarm improves the classification of traces.Overall, our results support a positive answer to RQ4: It is possible to reduce the average processing cost per case with a firing delay parameter.
To visualize relative benefits of multiple threshold intervals (RQ5), we calculate the ratio of the average processing costs between our baseline, a system based on the basic model, and our system with two threshold intervals.In Figure 8, we plot this ratio for all used cost configurations.Green and red coloring represents an improvement by using interval-based thresholds or by using a single global threshold, respectively.Contrary to our expectation, in some scenarios the system with a global threshold outperforms a system with intervals.However, improvements are observed for most scenarios.
Consistent improvements materialize for scenarios with non-monotonic changing costs and the cost of compensation being lower than the cost of undesired outcome.The improvements are the highest for the traffic fines dataset, that has the shortest traces.This leads also to shorter intervals.The smallest improvement in this group is seen in the bpic2017 refused dataset, which shows the lowest ratio of traces with an undesired outcome.Due to this low class ratio, the potential room for improvement is also the smallest, because for a smaller number of cases it is necessary to intervene.
Moreover, we observe consistent improvements for the two bpic2017 datasets for linear changing costs and constant costs.
The traffic fines dataset with linear costs is an interesting outlier.It shows nearly no improvement for scenarios, in which the cost of compensation is lower than the cost of undesired outcome.If the cost of undesired outcome has the same value as the cost of compensation the system with threshold intervals performs significantly better than the system with a global threshold.If the cost of compensation is higher than the cost of undesired outcome, the system with a single global threshold is significantly better.We investigated why the interval systems performs worse than the single threshold system and found out that, in the linear scenario, the single threshold system set the threshold to such a high level, that the threshold is never reached and no alarm is fired.However, this is not true for the system with two threshold intervals, this system fire alarms in a variety of cases.This approach underperforms here compared to not firing an alarm at all.The reason for this seems to be overfitting.
In general, this confirms RQ5.Our results indicate that prefix-length dependent thresholds based on intervals may improve a prescriptive alarm system.
We further consider the use of several prefix-length-dependent thresholds.
Figure 9 depicts the ranking correlation between the number of thresholds a system has in ascending order and the corresponding average cost per case in descending order.As in the previous experiment, an increasing number of thresholds should not lead to a higher, average cost per case.If there is no benefit to additional thresholds, all thresholds could just be set to the same value.This is not the case in our experiments, though, which we attribute to overfitting.An example for overfitting are situations, where the basic model does not fire an alarm for any case, while the prefix-length dependent approach uses the second or third threshold to fire alarms for lengthy traces.This leads to better results on the training set, but not on the test set.
The above results do not generally suggest a positive answer to RQ6.However, we see consistent improvements or at least no declines for the non-monotonic versions of the datasets bpic2017 cancelled and traffic fines.These datasets have both a balanced class ratio, whereas the dataset bpic2017 refused has an imbalanced class ratio (see Table 2).This leads to the conclusion that it is possible to improve the average cost per case with an increasing number of thresholds for datasets with a balanced class ratio in non-monotonic cost scenarios.A limitation to this is that we only observe improvements if the cost of an undesired outcome is larger than the cost of compensation.As expected, the improvements in the traffic fines dataset are higher.Since this dataset has shorter traces, multiple thresholds imply a high coverage of possible prefix-lengths.
As mentioned in RQ7, we also investigate, if the combination of prefix-length dependent thresholds and a firing delay is better than the basic model.To this end, we first check, if the combined approach outperforms a system with a global threshold and without firing delay.The results in Figure 10 confirm that the proposed approach leads to equivalent or better ratios of average processing cost under nearly all tested cost configurations.However, for three cost configurations in the traffic fines dataset, we observe negative results.
Exploring these experiments in more detail, we found the negative results for two of them being due to overfitting.In these scenarios, the basic model approach builds a system that never fires an alarm, while the proposed approach builds a system that fires an alarm for some traces.In the third case, the sophisticated approach sets the fire delay to 1 and sets the later prefix-length dependent threshold to 1 and the first threshold to a similar value like the threshold of the basic approach.With more training runs, therefore, the sophisticated approach would probably end up with the same behavior as the basic model.
Finally, we turn to the evaluation of a multi-alarm system (RQ8).Our results in Figure 11 indicate that hierarchical thresholding may, but does not necessarily outperform the best single-alarm system.Specifically, we observe a stair-like border for scenarios with high compensation cost, in which hierarchical thresholding outperforms the best-single alarm system.This aligns with the relative benefit for false positive for alarm 1 compared to alarm 2. We visualize this benefit by calculating the ratio of firing alarm 1 for false positive and alarm 2 in Figure 12.Based thereon, we conclude that a multi-alarm system using hierarchical thresholding can outperform the best single-alarm system for scenarios in which the more expensive alarm, in terms of cost of intervention, has a huge cost benefit in the case of a false alarm, compared to a cheaper alarm.
A summary of the experimental findings is given in Table 6.

Related Work
The problem of determining the optimal alarm threshold with respect to a given cost function is closely related to the problem of cost-sensitive learning.
Cost-sensitive learning seeks to find an optimal prediction when different types of misclassifications have different costs and different types of correct classifications have different benefits [15].A non-cost-sensitive classifier can be turned into a cost-sensitive one by stratification (rebalancing the ratio of positive and negative training samples) [15], by learning a meta-classifier after relabeling the training samples according to their estimated cost-minimizing class label [16], or via empirical thresholding [6].In this article, we use the latter approach, which has been shown to perform better than other cost-sensitive learning approaches [6].
The above approaches target predictions at a given timepoint, without the possibility of delaying the decision.We, in turn, deal with a different problem formulation, where the costs may depend on the time when the decision is made, i.e., one may delay the decision to first accumulate more information.
Cost-sensitive learning for making sequential decisions has been approached using reinforcement learning (RL) techniques [17].This work differs from ours in two ways.First, instead of making a sequence of decisions, we aim to find an optimal time to take a single decision.Second, RL assumes that actions affect the state observed after their implementation.In our case, possible actions are 1) alarming and 2) delaying the decision.In the latter case, however, we observe an additional event, which is not at all affected by the former decision.
Predictive and prescriptive process monitoring are also related to Early Classification of Time Series (ECTS), which aims at accurate classification of a (partial) time series as early as possible [18].Common solutions for ECTS find an optimal trigger function that decides on whether to output the prediction or to delay the decision and wait for another observation in the time series.To this end, the the minimum prediction length when the memberships assigned by the nearest neighbor classifier become stable is identified [18], the probability that the label assigned based on the current prefix is the same as the one based on the complete time series is estimated [19], or the percentage of accuracy that would be achieved on a prefix, compared to that of a complete trace is determined [20].
Recently, a few non-myopic methods have been proposed [21,22].Yet, these approaches assume a-priori knowledge of the length of the sequence, which is not given in the context of traces of business processes.As such, the problem is different, as delaying the decision comes with a risk that the case will end before the next possible decision point.The methods outlined in [23,21,22], however, are the only ECTS methods that try to balance accuracy-related and earliness-related costs.They assume that predicting a positive class early has the same effect on the cost function as predicting a negative class early, which is not the case in typical business process monitoring scenarios, where earliness matters only when an undesired outcome is predicted.
We are aware of four previous studies related to alarm-based prescriptive process monitoring [24,25,26,27].Metzger et al. [24] study the effect of different likelihood thresholds on the total intervention cost (called adaptation cost) and the misclassification penalties.However, they assume that the alarms can only be generated at one pre-determined point in the process that can be located as a state in a process model.Hence, their approach is restricted to scenarios where: (i) there is a process model that perfectly captures all cases; (ii) the costs and rewards implied by alarms are not time-varying.Also, their approach relies on a mechanism with a user-defined threshold, as opposed to our empirical thresholding approach.The latter remark also applies to [25], which generates a prediction when the likelihood returned by a trace prefix classifier first exceeds a given threshold.Also, the latter study is not cost-sensitive.Gröger et al. [26] provide recommendations during the execution of business processes to avoid a predicted performance deviation.Yet, this approach lacks the two core elements of our proposed framework, i.e., an alarm model and a notion of earliness, i.e.
the fact that firing an alarm earlier has a different effect than firing it at a later point in time.Finally, Krumeich et al. [27] propose a general architecture for prescriptive process monitoring.However, this architecture does not incorporate an alarm model nor does it propose a method for cost optimization.

Conclusion
This article presented a prescriptive process monitoring framework that extends existing approaches for predictive process monitoring with a mechanism to raise alarms, and hence trigger interventions, to prevent or mitigate the effects of undesired outcomes.The framework incorporates a cost model to capture the trade-offs between the cost of intervention, the benefit of mitigating or preventing undesired outcomes, and the cost of compensating for unnecessary interventions.
We also showed how to optimize the threshold(s) for generating alarms, with respect to a given configuration of the alarm model and event log.
An empirical evaluation on real-life logs showed significant benefits in optimizing the alarm threshold, relative to the baseline case where an alarm is raised when the likelihood of a negative outcome exceeds a pre-determined value.
We also highlighted that under some conditions, it is preferable to use multiple alarm thresholds.Finally, it became apparent that additional cost reductions can be obtained, in some configurations, by delaying the firing of an alarm.These findings provide insights into how alarm-based systems for prescriptive process monitoring shall be configured in practice.
Our work opens various directions for future research.We plan to lift the assumption of an alarm always triggering an intervention (regardless of, for example, the workload of process workers).To this end, our framework shall be extended with notions of resource capacity and utilization, possibly drawing on our previous work on risk-aware resource allocation across concurrent cases [28].
We also strive for incremental tuning of the alarming mechanism based on feedback about the alarm relevance and the intervention effectiveness.We foresee that active learning methods could be applied in this context.

Figure 1 :
Figure 1: A fragment of an event-log for two UWV's customers, as explained in Example 1.

Definition 3 (
Single-Alarm Cost Model).An alarm-based cost model is a tuple (c in , c out , c com , eff ) consisting of:

Figure 2 Figure 2 :
Figure 2 illustrates the general idea for a system with two alarm types.It represents the probability of an undesired outcome on the horizontal axis.The three aforementioned thresholds are visualized as vertical lines.Different colors illustrate the slides of the likelihood probability that will lead to different actions by the alarm system: firing no alarm, or firing one of the two alarms a 1 , a 2 .

Figure 3 :
Figure 3: Cost over different ratios of cout and c in (GBT)

Figure 4 :
Figure 4: Cost over different thresholds (τ is marked with a red cross)

Figure 5 :
Figure 5: Benefit with different alarm model configurations

Figure 6 :
Figure 6: Ratio of the average processing costs for a system with and without firing delay

Figure 9 :Figure 10 :
Figure 9: Correlation coefficient between number of intervals and average cost per case

Figure 11 :Figure 12 :
Figure 11: Ratio of average processing cost for a system with multiple alarm based on hierarchical thresholding and the best single alarm system

Table 3 :
Factors for the different alarms, that are multiplied with the respective costs

Table 4 :
Alarm model configurations

Table 5 :
Constants for non-monotonic cost configurations

Table 6 :
Summary of our findings with respect to the research questions