Specification-Driven Predictive Business Process Monitoring

Predictive analysis in business process monitoring aims at forecasting the future information of a running business process. The prediction is typically made based on the model extracted from historical process execution logs (event logs). In practice, different business domains might require different kinds of predictions. Hence, it is important to have a means for properly specifying the desired prediction tasks, and a mechanism to deal with these various prediction tasks. Although there have been many studies in this area, they mostly focus on a specific prediction task. This work introduces a language for specifying the desired prediction tasks, and this language allows us to express various kinds of prediction tasks. This work also presents a mechanism for automatically creating the corresponding prediction model based on the given specification. Differently from previous studies, instead of focusing on a particular prediction task, we present an approach to deal with various prediction tasks based on the given specification of the desired prediction tasks. We also provide an implementation of the approach which is used to conduct experiments using real-life event logs.


Introduction
Process mining [1,2] provides a collection of techniques for extracting process-related information from the logs of business process executions (event logs). One important area in this field is predictive business process monitoring, which aims at forecasting the future information of a running process based on the models extracted from event logs. Through predictive analysis, potential future problems can be detected and preventive actions can be taken in order to avoid unexpected situation, e.g., processing delay and Service-Level Agreement (SLA) violations. Many studies have been conducted in order to deal with various prediction tasks such as predicting the remaining processing time [4,63,54,52,53], predicting the outcomes of a process [37,22,67,50], predicting future events [23,63,27], etc (cf. [43,42,58,49,15,19]). An overview of various works in the area of predictive business process monitoring can be found in [38,24].
In practice, different business areas might need different kinds of prediction tasks. For instance, an online retail company might be interested in predicting the processing time until an order can be delivered to the customer, while for an insurance company, predicting the outcome of an insurance claim process would be interesting. On the other hand, both of them might be interested in predicting whether their processes comply with some business constraints (e.g., the processing time must be less than a certain amount of time).
When it comes to predicting the outcome of a process, business constraint satisfaction and the existence of an unexpected behaviour, it is important to specify the desired outcomes, the business constraint and the unexpected behaviour precisely. For instance, in the area of customer problem management, to increase the customer satisfaction as well as to promote efficiency, we might be interested in predicting the possibility of ping-pong behaviour among the Customer Service (CS) officers while handling the customer problems. However, the definition of a ping-pong behaviour could be varied. For instance, when a CS officer transfers a customer problem to another CS officer who belongs into the same group, it can already be considered as a ping-pong behaviour since both of them should be able to handle the same problem. Another possible definition would be to consider a ping-pong behaviour as a situation when a CS officer transfers a problem to another CS officer who has the same expertise, and the problem is transfered back to the original CS officer.
To have a suitable prediction service for our domain, we need to be able to specify the desired prediction tasks properly. Thus, we need a means to express the specification. Once we have characterized the prediction objectives and are able to express them properly, we need a mechanism to create the corresponding prediction model. To automate the prediction model creation, the specification should be unambiguous and machine processable. As illustrated above, such specification mechanism should also allow us to specify constraints over the data, and compare data values at different time points. For example, to characterize the pingpong behaviour, one possibility is to specify the behaviour as follows: "there is an event at a certain time point in which the CS officer (who handles the problem) is different from the CS officer in the event at the next time point, but both of them belong to the same group". Note that here we need to compare the information about the CS officer names and groups at different time points. In other cases, we might even need to involve arithmetic expressions. For instance, consider a business constraint that requires that the length of customer order processing time to be less than 3 hours, where the length of the processing time is the time difference between the timestamp of the first activity and the last activity within the process. To express this constraint, we need to be able to specify that "the time difference between the timestamp of the first activity and the last activity within the process is less than 3 hours".
The language should also enable us to specify how to compute/obtain the target information to be predicted. For instance, in the prediction of remaining processing time, we need to be able to define that the remaining processing time is the time difference between timestamp of the last activity and the current activity. We might also need to aggregate some data values, for instance in the prediction of the total processing cost where the total cost is the sum over the cost of all activities/events. In other cases, we might even need to specify an expression that counts the number of a certain activity. For example in the prediction of the amount of work to be done (workload), we might be interested in predicting the number of the remaining validation activities that are necessary to be done for processing a client application.
In this work, we tackle those problems by proposing an approach for obtaining the desired prediction services based on the specification of the desired prediction tasks. Specifically, we provide the following contributions: 1. We introduce a rich language for expressing the desired prediction tasks. This language allows us to specify various desired prediction tasks. In some sense, this language allows us to specify how to create the desired prediction models based on the event logs. We also provide a formal semantics for the language in order to ensure a uniform understanding and avoid ambiguity. 2. We devise a mechanism for building the corresponding prediction model based on the given specification. This includes the mechanism for automatically processing the specification. Once created, the prediction model can be used to provide predictive analysis services in business process monitoring. 3. To provide a general idea on the capability of our language, we exhibit how our proposal can be used for specifying various prediction tasks (cf. Section 5). 4. We provide an implementation of our approach which enables the automatic creation of prediction models based on the specified prediction objective. 5. To demonstrate the applicability of our approach, we carry out experiments using real-life event logs that were provided for the Business Process Intelligence Challenge (BPIC) 2012, 2013, and 2015.
Our approach for obtaining prediction services essentially consists of the following main steps: (i) First, we specify the desired prediction tasks, (ii) Second, we automatically create the prediction models based on the given specification, (iii) Once created, we can use the constructed prediction models for predicting the future information of a running process.
Roughly speaking, we specify the desired prediction task by specifying how we want to map each (partial) business processes execution information into the expected predicted information. Based on this specification, we train either a classification or regression model that will serve as the prediction model. By specifying a set of desired prediction tasks, we could obtain multi-perspective prediction services that enable us to focus on different aspects and predict various information of interest. Our approach is independent with respect to the classification/regression model that is used. In our implementation, to get the expected quality of predictions, the users are allowed to choose the desired classification/regression model as well as the feature encoding mechanisms (in order to allow some sort of feature engineering). This article extends [57] in several ways. First, we extend the specification language so as to incorporate various aggregate functions such as Max, Min, Average, Sum, Count, and Concat. Importantly, our aggregate functions allow us not only to perform aggregation over some values but also to choose the values to be aggregated. Obviously this extension increases the expressivity of the language and allows us to specify many more interesting prediction tasks. Next, we add various new showcases that exhibit the capabilities of our language in specifying prediction tasks. We also extend the implementation of our prototype in order to incorporate those extensions. To demonstrate the applicability of our approach, more experiments on different prediction tasks are also conducted and presented. Apart from using the real-life event log that was provided for BPIC 2013 [62], we also use another real-life event logs, namely the event logs that were provided for BPIC 2012 [65] and BPIC 2015 [66]. Notably, our experiments also exhibit the usage of a Deep Learning model [32] in predictive process monitoring. In particular, we use Deep Feed-Forward Neural Network. Though there have been some works that exhibit the usage of deep learning models in predictive process monitoring (cf. [63,26,27,23,40]), here we consider the prediction tasks that are different from the tasks that have been studied in those works. We also add more thorough explanation on several concepts and ideas of our approach so as to provide a better understanding. The discussion on the related work is also extended. Last but not least, several examples are added in order to support the explanation of various technical concepts as well as to ease the understanding of the ideas.
The remainder of this article is structured as follows. In Section 2, we provide the required background on the concepts that are needed for the rest of the paper. Having laid the foundation, in Section 3, we present the language that we introduce for specifying the desired prediction tasks. In Section 4, we present a mechanism for building the corresponding prediction model based on the given specification. In Section 5, we continue the explanation by providing numerous showcases that exhibit the capability of our language in specifying various prediction tasks. In Section 6, we present the implementation of our approach as well as the experiments that we have conducted. Related work is presented in Section 7. Finally, in Section 8 we present a discussion on some potential limitations which pave the way towards our future direction, and Section 9 concludes this work.

Preliminaries
We will see later that we build the prediction models by using machine learning classification/regression techniques and based on the data in event logs. To provide some background concepts, this section briefly explains the typical structure of event logs as well as the notion of classification and regression in machine learning.

Trace, Event and Event Log
We follow the usual notion of event logs as in process mining [2]. Essentially, an event log captures historical information of business process executions. Within an event log, an execution of a business process instance (a case) is represented as a trace. In the following, we may use the terms trace and case interchangeably. Each trace has several events, and each event in a trace captures the information about a particular event/activity that happens during the process execution. Events are characterized by various attributes, e.g., timestamp (the time when the event occurred).
We now proceed to formally define the notion of event logs as well as their components. Let E be the event universe (i.e., the set of all event identifiers), and A be the set of attribute names. For any event e ∈ E, and attribute name n ∈ A, # n (e) denotes the value of attribute n of e. E.g., # timestamp (e) denotes the timestamp of the event e. If an event e does not have an attribute named n, then # n (e) = ⊥ (where ⊥ is undefined value). A finite sequence over E of length n is a mapping σ : {1, . . . , n} → E, and we represent such a sequence as a tuple of elements of E, i.e., σ = e 1 , e 2 , . . . , e n where e i = σ (i) for i ∈ {1, . . . , n}. The set of all finite sequences over E is denoted by E * . The length of a sequence σ is denoted by |σ |.
A trace τ is a finite sequence over E such that each event e ∈ E occurs at most once in τ, i.e., τ ∈ E * and for 1 ≤ i < j ≤ |τ|, we have τ(i) = τ( j), where τ(i) refers to the event of the trace τ at the index i. Let τ = e 1 , e 2 , . . . , e n be a trace, τ k = e 1 , e 2 , . . . , e k denotes the k-length trace prefix of τ (for 1 ≤ k < n).
Finally, an event log L is a set of traces such that each event occurs at most once in the entire log, i.e., for each τ 1 , τ 2 ∈ L such that τ 1 = τ 2 , we have that τ An IEEE standard for representing event logs, called XES (eXtensible Event Stream), has been introduced in [34]. The standard defines the XML format for organizing the structure of traces, events and attributes in event logs. It also introduces some extensions that define some attributes with pre-defined meaning such as: 1. concept:name, which stores the name of event/trace; 2. org:resource, which stores the name/identifier of the resource that triggered the event (e.g., a person name); 3. org:group, which stores the group name of the resource that triggered the event.

Classification and Regression
In machine learning, a classification and regression model can be seen as a function f : #» X → Y that takes some input features/variables #» x ∈ #» X and predicts the corresponding target value/output y ∈ Y . The key difference is that the output range of the classification task is a finite number of discrete categories (qualitative outputs) while the output range of the regression task is continous values (quantitative outputs) [30,33]. Both of them are supervised machine learning techniques where the models are trained with labelled data. I.e., the inputs for the training are pairs of input variables #» x and (expected) target value y. This way, the models learn how to map certain inputs #» x into the expected target value y.

Specifying the Desired Prediction Tasks
This section elaborates our mechanism for specifying the desired prediction tasks. Here we introduce a language that is able to capture the desired prediction task in terms of the specification on how to map each (partial) trace in the event log into the desired prediction results. Such specification can be used to train a classification/regression model that will be used as the prediction model.
To express the specification of a prediction task, we introduce the notion of analytic rule. An analytic rule R is an expression of the form: DefaultTarget is a special target expression called default target expression. (iv) The expression Cond i =⇒ Target i is called conditional-target expression. Section 3.1 provides an informal intuition of our language for specifying prediction tasks. Throughout Sections 3.2 and 3.3, we introduce the language for specifying the condition and target expressions in analytic rules. Specifically, Section 3.3 introduces a language called First-Order Event Expression (FOE), while Section 3.2 elaborates several components that are needed to define such language. We will see later that FOE can be used to formally specify condition expressions and a fragment of FOE can be used to specify target expressions. Finally, the formalization of analytic rules is provided in Section 3.4.

Overview: Prediction Task Specification Language
An analytic rule R is interpreted as a mapping that maps each (partial) trace into a value that is obtained by evaluating the target expression in which the corresponding condition is satisfied by the corresponding trace. Let τ be a (partial) trace, such mapping R can be illustrated as follows where eval(DefaultTarget) and eval(Target i ) consecutively denote the results of evaluating the target expression DefaultTarget and Target i , for i ∈ {1, . . . , n} (The formal definition of this evaluation operation is given later).
We will see later that a target expression specifies either the desired prediction result or expresses the way to compute the desired prediction result. Thus, an analytic rule R can also be seen as a means to map (partial) traces into either the desired prediction results, or to compute the expected prediction results of (partial) traces.
To specify condition expressions in analytic rules, we introduce a language called First-Order Event Expression (FOE). Roughly speaking, an FOE formula is a First-Order Logic (FOL) formula [61] where the atoms are expressions over some event attribute values and some comparison operators, e.g., ==, =, >, ≤. The quantification in FOE is restricted to the indices of events (so as to quantify the time points). The idea of condition expressions is to capture a certain property of (partial) traces. To give some intuition, before we formally define the language in Section 3.3, consider the ping-pong behaviour that can be specified as follows: some strings such as "Ping-Pong" and "Not Ping-Pong". Based on these, we can create an example of an analytic rule R 1 as follows: where Cond 1 is as above. In this case, R 1 specifies a task for predicting the ping-pong behaviour. In the prediction model creation phase, we will create a classifier that classifies (partial) traces based on whether they satisfy Cond 1 or not (i.e., a trace will be classified into "Ping-Pong" if it satisfies Cond 1 , otherwise it will be classified into "Not Ping-Pong"). During the prediction phase, such classifier can be used to predict whether a given (partial) trace will lead to ping-pong behaviour or not.
The target expression can be more complex than merely a string. For instance, it can be an expression that involves arithmetic operations over numeric values such as which specifies a task for predicting the remaining processing time, because R 2 maps each (partial) trace into its remaining processing time. In this case, during the prediction model creation phase, we will create a regression model for predicting the remaining processing time of a given (partial) trace. Section 5 provides more examples of prediction tasks specification using our language.

Towards Formalizing the Condition and Target Expressions
This section is devoted to introduce several components that are needed to define the language for specifying condition and target expressions in Section 3.3. As we have seen in Section 3.1, we often need to refer to a particular index of an event within a trace. Recall the expression e[i + 1]. org:group that refers to the org:group attribute value of the event at the index i + 1, and also the expression e[last]. time:timestamp that refers to the timestamp of the last event. The former requires us to refer to the event at the index i + 1, while the latter requires us to refer to the last event in the trace. To capture this, we introduce the notion of index expression idx defined as follows: pint is a positive integer (i.e., pint ∈ Z + ). (iii) last and curr are special indices in which the former refers to the index of the last event in a trace, and the latter refers to the index of the current event (i.e., last event of the trace prefix under consideration). For instance, given a k-length trace prefix τ k of the trace τ, curr is equal to k (or |τ k |), and last is equal to |τ|. (iv) idx + idx and idx − idx are the usual arithmetic addition and subtraction operations over indices.
The semantics of index expression is defined over traces and considered trace prefix length. Since an index expression can be a variable, given a trace τ and a considered trace prefix length k, we first introduce a variable valuation ν, i.e., a mapping from index variables into Z + . We assign meaning to index expression by associating to τ, k, and ν an interpretation function (·) τ,k ν which maps an index expression into Z + . Formally, (·) τ,k ν is inductively defined as follows: The definition above says that the interpretation function (·) τ,k ν interprets index expressions as follows: (i) each variable is interpreted based on how the variable valuation ν maps the corresponding variable into a positive integer in Z + ; (ii) each positive integer is interpreted as itself, e.g., (2603) τ,k ν = 2603; (iii) curr is interpreted into k; (iv) last is interpreted into |τ|; and (v) the arithmetic addition/subtraction operators are interpreted as usual.
To access the value of an event attribute, we introduce so-called event attribute accessor, which is an expression of the form where attName is an attribute name and idx is an index expression. To define the semantics of event attribute accessor, we extend the definition of our interpretation function (·) τ,k ν such that it interprets an event attribute accessor expression into the attribute value of the corresponding event at the given index. Formally, (·) τ,k ν is defined as follows: Note that the above definition also says that if the event attribute accessor refers to an index that is beyond the valid event indices in the corresponding trace, then we will get undefined value (i.e., ⊥).
As an example of event attribute accessor, the expression e[i]. org:resource refers to the value of the attribute org:resource of the event at the position i. Example 2 Consider the trace τ = e 1 , e 2 , e 3 , e 4 , e 5 , let "Bob" be the value of the attribute org:resource of the event e 3 in τ, i.e., # org:resource (e 3 ) = "Bob", and e 3 does not have any attributes named org:group, i.e., # org:group (e 3 ) = ⊥. In this example, we have that (e[3]. org:resource) τ,k ν = "Bob", and (e[3]. org:group) τ,k ν = ⊥.
The value of an event attribute within a trace can be either numeric (e.g., 26, 3.86) or non-numeric (e.g., "sendOrder"), and we might want to specify properties that involve arithmetic operations over numeric values. Thus, we introduce the notion of numeric expression and non-numeric expression as follows: To give the semantics for numeric expression and nonnumeric expression, we extend the definition of our interpretation function (·) τ,k ν by interpreting true, false, String, and number as themselves, e.g., (3) τ,k ν = 3, ("sendOrder") τ,k ν = "sendOrder", and by interpreting the arithmetic operations as usual, e.g., Formally, we extend our interpretation function as follows: Note that the value of an event attribute might be undefined, i.e., it is equal to ⊥. In this case, we define that the arithmetic operations involving ⊥ give ⊥, e.g., 26 + ⊥ = ⊥. We now define the notion of event expression as a comparison between either numeric expressions or non-numeric expressions. Formally, it is defined as follows: where (i) numExp is a numeric expression; (ii) nonNumExp is a non-numeric expression; (iii) the operators == and = are the usual logical comparison operators, namely equality and inequality; (iv) the operators <, >, ≤, and ≥ are the usual arithmetic comparison operators, namely less than, greater than, less than or equal, and greater than or equal. is an event expression saying that the value of the attribute concept:name of the event at the index i is equal to "OrderCreated".
We interpret each logical/arithmetic comparison operator (i.e., ==, =, <, >, etc) in the event expressions as usual. For instance, the expression 26 ≥ 3 is interpreted as true, while the expression "receivedOrder" == "sendOrder" is interpreted as false. Additionally, any comparison involving undefined value (⊥) is interpreted as false. It is easy to see how to extend the formal definition of our interpretation function (·) τ,k ν towards interpreting event expressions, therefore we omit the details.

Adding Aggregate Functions
We now extend the notion of numeric expression and non-numeric expression by adding several numeric and non-numeric aggregate functions. A numeric (resp. nonnumeric) aggregate function is a function that performs an aggregation operation over some values and return a numeric (resp. non-numeric) value. Before providing the formal syntax and semantics of our aggregate functions, in the following we illustrate the needs of having aggregate functions and we provide some intuition on the shape of our aggregate functions.
Suppose that each event in each trace has an attribute named cost. Consider the situation where we want to specify a task for predicting the total cost of all activities (from the first until the last event) within a trace. In this case, we need to sum up all values of the cost attribute in all events. To express this need, we introduce the aggregate function sum and we can specify the notion of total cost as follows: sum(e[x]. cost; where x = 1 : last).
The expression above computes the sum of the values of e[x]. cost for all x ∈ {1, . . . , last}. In this case x is called aggregation variable, the expression e[x]. cost specifies the aggregation source, i.e., the source of the values to be aggregated, and the expression x = 1 : last specifies the aggregation range by defining the range of the aggregation variable x.
In some situation, we might only be interested to compute the total cost of a certain activity. E.g., the total cost of all validation activities within a trace. To do this, we introduce the notion of aggregation condition, which allows us to select only some values that we want to aggregate. is evaluated to true. Therefore, the summation only considers the values of x in which the activity name is "Validation", and we only compute the total cost of all validation activities. As before, e[x]. cost specifies the source of the values to be aggregated, the expression x = 1 : last specifies the aggregation range by defining the range of the aggregation variable x, and the expression e[x]. concept:name == "Validation" provides the aggregation condition.
The expression for specifying the source of the values to be aggregated can be more complex, for example when we want to compute the average activity running time within a trace. In this case, the running time of an activity is specified as the time difference between the timestamp of that activity and the next activity, i.e., .
Essentially, the expression above computes the average of the time difference between the activity at the timepoint x + 1 and x, where x ∈ {1, . . . , last}.
In other cases, we might not be interested in aggregating the data values but we are interested in counting the number of a certain activity/event. To do this, we introduce the aggregate function count. As an example, we can specify an expression to count the number of validation activities within a trace as follows: where e[x]. concept:name == "validation" is an aggregation condition. The expression above counts how many times the specified aggregation condition is true within the specified range. Thus, in this case, it counts the number of the events between the first and the last event, in which the activity name is "validation". We might also be interested in counting the number of different values of a certain attribute within a trace. For example, we might be interested in counting the number of different resources that are involved within a trace. To capture this, we introduce the aggregate function countVal. We can then specify the expression to count the number of different resources between the first and the last event as follows: countVal(org:resource; within 1 : last) where (i) org:resource is the name of the attribute in which we want to count its number of different values; and (ii) the expression "within 1 : last" is the aggregation range.
We will see later in Section 5 that the presence of aggregate functions allows us to express numerous interesting prediction tasks. Towards formalizing the aggregate functions, we first formalize the notion of aggregation conditions. An aggregation condition is an unquantified First Order Logic (FOL) [61] formula where the atoms are event expressions and may use only a single unquantified variable, namely the aggregation variable. The values of the unquantified/free variable in aggregation conditions is ranging over the specified aggregation range in the corresponding aggregate function. Formally aggregation conditions are defined as follows: where eventExp is an event expression, and the semantics of aggCond is based on the usual FOL semantics. Formally, we extend the definition of our interpretation function (·) τ,k ν as follows: With this machinery in hand, we are ready to define the syntax and the semantics of numeric and non-numeric aggregate functions. We first extend the syntax of the numeric and non-numeric expressions by adding the numeric and non-numeric aggregate functions as follows: . NonNumericAttribute, numExp 1 + numExp 2 , and numExp 1 − numExp 2 are as before; (ii) st and ed are either positive integers (i.e., st ∈ Z + and ed ∈ Z + ) or special indices (i.e., last or curr), and st ≤ ed; (iii) x is a variable called aggregation variable, and the range of its value is between st and ed (i.e., st ≤ x ≤ ed). The expression where x = st : ed as well as within x = st : ed are called aggregation variable range; (iv) numSrc and nonNumSrc specify the source of the values to be aggregated. The numSrc is specified as numeric expression while nonNumSrc is specified as non-numeric expression. Both of them may and can only use the corresponding aggregation variable x, and they cannot contain any aggregate functions; (v) aggCond is an aggregation condition over the corresponding aggregation variable x and no other variables are allowed to occur in aggCond; (vi) attName is an attribute name; (vii) For the aggregate functions, as the names describe, sum stands for summation, avg stands for average, min stands for minimum, max stands for maximum, count stands for counting, countVal stands for counting values, and concat stands for concatenation. The behaviour of these aggregate functions is quite intuitive. Some intuition has been given previously and we explain their details behaviour while providing their formal semantics below. The aggregate functions sum, avg, min, max, concat that have aggregation conditions aggCond are also called conditional aggregate functions.
Notice that a numeric aggregate function is also a numeric expression and a numeric expression is also a component of a numeric aggregate function (either in the source value or in the aggregation condition). Hence, it may create some sort of nested aggregate function. However, to simplify the presentation, in this work we do not allow nested aggregation functions of this form, but technically it is possible to do that under a certain care on the usage of the variables (Similarly for the non-numeric aggregate function).
To formalize the semantics of aggregate functions, we first introduce some notations. Given a variable valuation ν, we write ν[x → d] to denote a new variable valuation obtained from the variable valuation ν as follows: the other variables (apart from x) are substituted the same way as ν is defined. Given a conditional summation aggregate function sum(numSrc; where x = st : ed; and aggCond), a trace τ, a considered trace prefix length k, and a variable valuation ν, we define its corresponding set Idx of valid aggregation indices as follows: basically, Idx collects the values within the given aggregation range (i.e., between st and ed), in which, by substituting the aggregation variable x with those values, the aggregation condition aggCond is evaluated to true and numSrc is not evaluated to undefined value ⊥. For the other conditional aggregate functions avg, max, min, and concat, the corresponding set of valid aggregation indices can be defined similarly.
Moreover, let # concept:name (e 1 ) = "initialization" and # concept:name (e 3 ) = "assembling". Suppose that the cost of each activity is the same, let say it is equal to 3, i.e., The former computes the total cost of all activities while the latter computes the total cost of validation activities. In this case, the corresponding set of the valid aggregation indices (with respect to the given trace τ) for the first aggregate function is Idx 1 = {1, 2, 3, 4}, while for the second aggregate function we have Idx 2 = {2, 4} because the second aggregate function requires that the activity name (i.e., the value of the attribute concept:name) to be equal to "validation" and it is only true when x is equal to either 2 or 4.
Having this machinery in hand, we are now ready to formally define the semantics of aggregate functions. The formal semantics of the conditional aggregate functions sum, avg, max, min is provided in Figure 1. Intuitively, the aggregate function sum computes the sum of the values that are obtained from the evaluation of the specified numeric expression numSrc over the specified aggregation range (i.e., between st and ed). Additionally, the computation of the summation ignores undefined values and it only considers those indices within the specified aggregation range in which the aggregation condition is evaluated to true. The intuition for the aggregate functions avg, max, min is similar, except that avg computes the average, max computes the maximum values, and min computes the minimum values.
Example 5 Continuing Example 4, the first aggregate function is evaluated to 12 because we have that Idx 1 = {1, 2, 3, 4}, and On the other hand, the second aggregate function is evaluated to 6 because we have that Idx 2 = {2, 4}, and The aggregate function max(numExp 1 , numExp 2 ) computes the maximum value between the two values that are obtained by evaluating the specified two numeric expressions numExp 1 and numExp 2 . It gives undefined value ⊥ if one of them is evaluated to undefined value ⊥ (Similarly for the aggregate function min(numExp 1 , numExp 2 ) except that it computes the minimum value). Formally, the semantics of these functions is defined as follows: The formal semantics of the aggregate function count is provided below Intuitively, it counts how many times the aggCond is evaluated to true within the given range, i.e., between st and ed. This aggregate function is useful to count the number of events/activities within a certain range that satisfy a certain condition. For example, to count the number of the activity named "modifying delivery appointment" within a certain range in a trace. The semantics of the aggregate function countVal is formally defined as follows: intuitively, it counts the number of all possible values of the attribute attName within all events between the given start and end timepoints (i.e., between st and ed).
The aggregate function concat concatenates the values that are obtained from the evaluation of the given nonnumeric expression under the valid aggregation range (i.e., we only consider the value within the given aggregation range in which the aggregation condition is satisfied). Moreover, the concatenation ignores undefined values and treats them as empty string. The formal semantics of the aggregate function concat is provided in Figure 2.
Notice that, for convenience, we could easily extend our language with unconditional aggregate functions by adding the following: otherwise where ⊙ is a concatenation operator that simply concatenates two non-numeric values. In this case, they simply perform an aggregation computation over the values that are obtained by evaluating the specified numeric/non-numeric expression over the specified aggregation range. However, they do not give additional expressive power since they are only syntactic variant of the current conditional aggregate functions. This is the case because we can simply put "true" as the aggregation condition, e.g., sum(numSrc; where x = st : ed; and true). Based on their semantics, we get the aggregate functions that behave as unconditional aggregate functions. I.e., they ignore the aggregation condition since it will always be true for every values within the specified aggregation range. In the following, for the brevity of presentation, when aggregation condition is not important we often simply use the unconditional version of aggregate functions.

First-Order Event Expression (FOE)
Finally, we are ready to define the language for specifying condition expression, namely First-Order Event Expression (FOE). A part of this language is also used to specify target expression.
An FOE formula is a First Order Logic (FOL) [61] formula where the atoms are event expressions and the quantification is ranging over event indices. Syntactically FOE is defined as follows: ϕ is an FOE formula where the variable i is universally quantified; (iv) ∃i.ϕ is an FOE formula where the variable i is existentially quantified; (v) ϕ 1 ∧ϕ 2 is a conjunction of FOE formulas; (vi) ϕ 1 ∨ϕ 2 is a disjunction of FOE formulas; (vii) ϕ 1 → ϕ 2 is an FOE implication formula saying that ϕ 1 implies ϕ 2 ; (viii) The notion of free and bound variables is as usual in FOL, except that the variables inside aggregate functions, i.e., aggregation variables, are not considered as free variables; (ix) The aggregation variables cannot be existentially/universally quantified. The semantics of FOE constructs is based on the usual FOL semantics. Formally, we extend the definition of our interpretation function (·) τ,k ν as follows 2 : As before, ν[i → c] substitutes each variable i with c, while the other variables are substituted the same way as ν is defined. When ϕ is a closed formula, its truth value does not depend on the valuation of the variables, and we denote the interpretation of ϕ simply by (ϕ) τ,k . We also say that the trace τ and the prefix length k satisfy ϕ, With a little abuse of notation, sometimes we also say that the k-length trace prefix τ k of the trace τ satisfies ϕ, written τ k |= ϕ, if τ, k |= ϕ. which essentially says that whenever there is an event where an order is created, eventually there will be an event where the corresponding order is delivered and the time difference between the two events (the processing time) is less than or equal to 10.800.000 milliseconds (3 hours).
In general, FOE has the following main features: (i) it allows us to specify constraints over the data (attribute values); (ii) it allows us to (universally/existentially) quantify different event time points and to compare different event attribute values at different event time points; (iii) it allows us to specify arithmetic expressions/operations involving the data as well as aggregate functions; (iv) it allows us to do selective aggregation operations (i.e., selecting the values to be aggregated). (v) the fragments of FOE, namely the numeric and non-numeric expressions, allow us to specify the way to compute a certain value (We will see later that it is needed to specify how to compute the target value).

Checking Whether a Closed FOE Formula is Satisfied
We now proceed to introduce several properties of FOE formulas that are useful for checking whether a trace τ and a prefix length k satisfy a closed FOE formula ϕ, i.e., to check whether τ, k |= ϕ. This check is needed when we create the prediction model based on the specification of prediction task provided by an analytic rule.
Let ϕ be an FOE formula, we write ϕ[i → c] to denote a new formula obtained by substituting each variable i in ϕ by c. In the following, Theorems 1 and 2 show that, while checking whether a trace τ and a prefix length k satisfy a closed FOE formula ϕ, we can eliminate the presence of existential and universal quantifiers.
Theorem 1 Given a closed FOE formula ∃i.ϕ, a trace τ and a prefix length k, Proof By the definition of the semantics of FOE, we have that τ and k satisfy ∃i.ϕ (i.e., τ, k |= ∃i.ϕ) iff there exists an index c ∈ {1, . . . , |τ|}, such that τ and k satisfy the formula ψ that is obtained from ϕ by substituting each vari- Thus, it is the same as satisfying the disjunctions of formulas that is obtained by considering all possible substitutions of the variable i in ϕ by all possible values of c (i.e., c∈{1,...|τ|} ϕ[i → c]). This is the case because such disjunctions of formulas can be satisfied by τ and k if and only if there exists at least one formula in that disjunctions of formulas that is satisfied by τ and k.
⊓ ⊔ Theorem 2 Given a closed FOE formula ∀i.ϕ, a trace τ and a prefix length k, The proof is quite similar to Theorem 1, except that we use the conjunctions of formulas. Basically, we have that τ and k satisfy ∀i.ϕ (i.e., τ, k |= ∀i.ϕ) iff for every c ∈ {1, . . ., |τ|}, we have that τ, k |= ψ, where ψ is obtained from ϕ by substituting each variable i in ϕ with c. In other words, τ and k satisfy each formula that is obtained from ϕ by considering all possible substitutions of variable i with all possible values of c. Hence it is the same as satisfying the conjunctions of those formulas (i.e., c∈{1,.
. This is the case because such conjunctions of formulas can be satisfied by τ and k if and only if each formula in that conjunctions of formulas is satisfied by τ and k. ⊓ ⊔ To check whether a trace τ and a prefix length k satisfy a closed FOE formula ϕ, i.e., τ, k |= ϕ, we could perform the following steps: 1. First, we eliminate all quantifiers. This can be done easily by applying Theorems 1 and 2. As a result, each quantified variable will be instantiated with a concrete value; 2. Evaluate all aggregate functions as well as all event attribute accessor expressions based on the event attributes in τ so as to get the actual values of the corresponding event attributes. After this step, we have a formula that is constituted by only concrete values composed by either arithmetic operators (i.e., + or −), logical comparison operators (i.e., == or =), or arithmetic comparison operators (i.e., <, >, ≤, ≥, == or =); 3. Last, we evaluate all arithmetic expressions as well as all expressions involving logical and arithmetic comparison operators. If the whole evaluation gives us true (i.e., (ϕ) τ,k = true), then we have that τ, k |= ϕ, otherwise τ, k |= ϕ (i.e., τ and k do not satisfy ϕ). The existence of this procedure gives us the following theorem: Theorem 3 Given a closed FOE formula ϕ, a trace τ and a prefix length k, checking whether τ, k |= ϕ is decidable.
This procedure has been implemented in our prototype as a part of the mechanism for processing the specification of prediction task while constructing the prediction model.

Formalizing the Analytic Rule
With this machinery in hand, we can formally say how to specify condition and target expressions in analytic rules, namely that condition expressions are specified as closed FOE formulas, while target expressions are specified as either numeric expression or non-numeric expression, except that target expressions are not allowed to have index variables (Thus, they do not need variable valuation). We require an analytic rule to be coherent, i.e., all target expressions of an analytic rule should be either only numeric or non-numeric expressions. An analytic rule in which all of its target expressions are numeric expressions is called numeric analytic rule, while an analytic rule in which all of its target expressions are non-numeric expressions is called non-numeric analytic rule.
We can now formalize the semantics of analytic rules as illustrated in Section 3.1. Formally, given a trace τ, a con-sidered prefix length k, and an analytic rule R of the form R maps τ and k into a value obtained from evaluating the corresponding target expression as follows: where (Target i ) τ,k is the application of our interpretation function (·) τ,k to the target expression Target i in order to evaluate the expression and get the value. Checking whether the given trace τ and the given prefix length k satisfy Cond i , i.e., τ, k |= Cond i , can be done as explained in Section 3.3.1.
We also require an analytic rule to be well-defined, i.e., given a trace τ, a prefix length k, and an analytic rule R, we say that R is well-defined for τ and k if R maps τ and k into exactly one target value, i.e., for every condition expressions Cond i and Cond j in which τ, k |= Cond i and τ, k |= Cond j , we have that (Target i ) τ,k = (Target j ) τ,k . This notion of well-definedness can be easily generalized to event logs as follows: Given an event log L and an analytic rule R, we say that R is well-defined for L if for every possible trace τ in L and every possible prefix length k, we have that R is well-defined for τ and k. Note that such condition can be easily checked for the given event log L and an analytic rule R since the event log is finite. This notion of well-defined is required in order to guarantee that the given analytic rule R behaves as a function with respect to the given event log L, i.e., R maps every pair of trace τ and prefix length k into a unique value. Compared to enforcing that each condition in analytic rules must not be overlapped, our notion of well-defined gives us more flexibility in making a specification using our language while also guaranteeing reasonable behaviour. For instance, one can specify several characteristics of pingpong behaviour in a more convenient way by specifying several conditional-target expressions, i.e., which could end up into a very long specification of a condition expression.

Building the Prediction Model
Given an analytic rule R and an event log L, if R is a numeric analytic rule, we build a regression model. Otherwise, if R is a non-numeric analytic rule, we build a classification model. Given an analytic rule R and an event log L, our aim is to create a prediction function that takes (partial) trace as the input and predict the most probable output value for the given input. To this aim, we train a classification/regression model in which the input is the features that are obtained from the encoding of all possible trace prefixes in the event log L (the training data). There are several ways to encode (partial) traces into input features for training a machine learning model. For instance, [35,60] study various encoding techniques such as index-based encoding, boolean encoding, etc. In [63], the authors use the so-called one-hot encoding of event names, and also add some time-related features (e.g., the time increase with respect to the previous event). Some works consider the feature encodings that incorporate the information of the last n-events. There are also several choices on the information to be incorporated. One can incorporate only the name of the events/activities, or one can also incorporate other information (provided by the available event attributes) such as the (human) resource who is in charged in the activity.
In general, an encoding technique can be seen as a function enc that takes a trace τ as the input and produces a set {x 1 , . . . , x m } of features, i.e., enc(τ) = {x 1 , . . . , x m }. Furthermore, since a trace τ might have arbitrary length (i.e., arbitrary number of events), the encoding function must be able to transform these arbitrary number of trace information into a fix number of features. This can be done, for example, by considering the last n-events of the given trace τ or by aggregating the information within the trace itself. In the encoding that incorporates the last n-events, if the number of the events within the trace τ is less than n, then typically we can add 0 for all missing information in order to get a fix number of features.
In our approach, users are allowed to choose the desired encoding mechanism by specifying a set Enc of preferred encoding functions (i.e., Enc = {enc 1 , . . . , enc n }). This allows us to do some sort of feature engineering (note that the desired feature engineering approach, that might help increasing the prediction performance, can also be added as one of these encoding functions). The set of features of a trace is then obtained by combining all features produced by applying each of the selected encoding functions into the corresponding trace. In the implementation (cf. Section 6), we provide some encoding functions that can be selected in order to encode a trace.
Algorithm 1 -Procedure for building the prediction model Input: an analytic rule R, an event log L, and a set Enc = {enc 1 , . . ., enc n } of encoding functions Output: a prediction function P 1: for each trace τ ∈ L do 2: for each k where 1 < k < |τ| do 3: add a new training instance for P, where P(τ k encoded ) = targetValue 6: end for 7: end for 8: Train the prediction function P (either classification or regression model) Algorithm 1 illustrates our procedure for building the prediction model based on the given inputs, namely: (i) an analytic rule R, (ii) an event log L, and (iii) a set Enc = {enc 1 , . . . , enc n } of encoding functions. The algorithm works as follows: for each k-length trace prefix τ k of each trace τ in the event log L (where 1 < k < |τ|), we do the following: In line 3, we apply each encoding function enc i ∈ Enc into τ k , and combine all obtained features. This step gives us the encoded trace prefix. In line 4, we compute the expected prediction result (target value) by applying the analytical rule R to τ k . In line 5, we add a new training instance by specifying that the prediction function P maps the encoded trace prefix τ k encoded into the target value computed in the previous step. Finally, we train the prediction function P and get the desired prediction function.
Observe that the procedure above is independent with respect to the classification/regression model and trace encoding technique that are used. One can plug in different machine learning classification/regression model as well as use different trace encoding technique in order to get the desired quality of prediction.

Showcases and Multi-Perspective Prediction Service
An analytic rule R specifies a particular prediction task of interest. To specify several desired prediction tasks, we only have to specify several analytic rules, i.e., R 1 , R 2 , . . . , R n . Given a set R = {R 1 , R 2 , . . . , R n } of analytic rules, our approach allows us to construct a prediction model for each analytic rule R i ∈ R. By having all of the constructed prediction models where each of them focuses on a particular prediction objective, we can obtain a multi-perspective prediction analysis service.
In Section 3, we have seen some examples of prediction task specification for predicting the ping-pong behaviour and the remaining processing time. In this section, we present numerous other showcases of prediction task specification using our language.

Predicting Unexpected Behaviour/Situation
We can specify the task for predicting unexpected behaviour by first expressing the characteristics of the unexpected behaviour.
Ping-pong Behaviour. The condition expression Cond 1 (in Section 3.1) expresses a possible characteristic of ping-pong behaviour. Another possible characterization of ping-pong behaviour is shown below: In other word, Cond 2 characterizes the condition where "an officer transfers a task into another officer of the same group, and then the task is transfered back to the original officer". In the event log, this situation is captured by the changes of the org:resource value in the next event, but then it changes back into the original value in the next two events, while the values of org:group remain the same. We can then create an analytic rule to specify the task for predicting ping-pong behaviour as follows: where Cond 1 is the same as specified in Section 3.1. During the construction of the prediction model, in the training phase, R 3 maps each trace prefix τ k that satisfies either Cond 1 or Cond 2 into the target value "Ping-Pong", and those prefixes that neither satisfy Cond 1 nor Cond 2 into "Not Ping-Pong". After training the model based on this rule, we get a classifier that is trained for distinguishing between (partial) traces that most likely and unlikely lead to pingpong behaviour. This example also exhibits the ability of our language to specify a behaviour that has multiple characteristics.
Abnormal Activity Duration. The following expression specifies the existence of abnormal waiting duration by stating that there exists a waiting activity in which the duration is more than 2 hours (7.200.000 milliseconds): As before, we can then specify an analytic rule for predicting whether a (partial) trace is likely to have an abnormal waiting duration or not as follows: Applying the approach for constructing the prediction model in Section 4, we obtain a classifier that is trained to predict whether a (partial) trace is most likely or unlikely to have an abnormal waiting duration.

Predicting SLA/Business Constraints Compliance
Using FOE, we can easily specify numerous expressive SLA conditions as well as business constraints. Furthermore, using the approach presented in Section 4, we can create the corresponding prediction model, which predicts the compliance of the corresponding SLA/business constraints.
Time-related SLA. Let Cond 4 be the FOE formula in Example 6. Roughly speaking, Cond 4 expresses an SLA stating that each order that is created will be eventually delivered within 3 hours. We can then specify an analytic rule for predicting the compliance of this SLA as follows: Using R 5 , our procedure for constructing the prediction model in Section 4 generates a classifier that is trained to predict whether a (partial) trace is likely or unlikely to comply with the given SLA.

Separation of Duties (SoD).
We could also specify a constraint concerning Separation of Duties (SoD). For instance, we require that the person who assembles the product is different from the person who checks the product (i.e., quality assurance). This can be expressed as follows: Intuitively, Cond 5 states that for every two activities, if they are assembling and checking activities, then the resources who are in charge of those activities must be different. Similar to previous examples, we can specify an analytic rule for predicting the compliance of this constraint as follows: Applying our procedure for building the prediction model, we obtain a classifier that is trained to predict whether or not a trace is likely to fulfil this constraint.
Constraint on Activity Duration. Another example would be a constraint on the activity duration, e.g., a requirement which states that each activity must be finished within 2 hours. This can be expressed as follows: . time:timestamp) < 7.200.000.
Cond 6 basically says that the time difference between two activities is always less than 2 hours (7.200.000 milliseconds). An analytic rule to predict the compliance of this SLA can be specified as follows: Notice that we can express the same specification in a different way, for instance Essentially, Cond 7 expresses a specification on the existence of abnormal activity duration. It states that there exists an activity in which the time difference between that activity and the next activity is greater than 7.200.000 milliseconds (2 hours). Using either R 7 or R 8 , our procedure for building the prediction model (cf. Algorithm 1) gives us a classifier that is trained to distinguish between the partial traces that most likely will and will not satisfy this activity duration constraint. We could even specify a more fine-grained constraint by focusing into a particular activity. For instance, the following expression specifies that each validation activity must be done within 2 hours (7.200.000 milliseconds): Cond 8 basically says that for each validation activity, the time difference between that activity and its next activity is always less than 2 hours (7.200.000 milliseconds). Similar to the previous examples, it is easy to see that we could specify an analytic rule for predicting the compliance of this SLA and create a prediction model that is trained to predict whether a (partial) trace is likely or unlikely fulfilling this SLA.

Predicting Time Related Information
In Section 3.1, we have seen how we can specify the task for predicting the remaining processing time (by specifying a target expression that computes the time difference between the timestamp of the last and the current events). In the following, we provide another examples on predicting time related information.
Predicting Delay. Delay can be defined as a condition when the actual processing time is longer than the expected processing time. Suppose we have the information about the expected processing time, e.g., provided by an attribute "expectedDuration" of the first event, we can specify an analytic rule for predicting the occurrence of delay as follows: where Cond 9 is specified as follows: Cond 9 states that the difference between the last event timestamp and the first event timestamp (i.e., the processing time) is greater than the expected duration (provided by the value of the event attribute "expectedDuration"). While training the classification model, R 9 maps each trace prefix τ k into either "Delay" or "Normal" depending on whether the processing time of the whole trace τ is greater than the expected processing time or not.
Predicting the Overhead of Running Time. The overhead of running time is the amount of time that exceeds the expected running time. If the actual running time does not go beyond the expected running time, then the overhead is 0. Suppose that the expected running time is 3 hours (10.800.000 milliseconds), the task for predicting the overhead of running time can then be specified as follows: where Overhead = TotalRunTime − 10.800.000, and In this case, R 10 computes the difference between the actual total running time and the expected total running time. Moreover, it outputs 0 if the actual total running time is less than the expected total running time, since it takes the maximum value between the computed time difference and 0. Applying our procedure for creating the prediction model, we obtain a regression model that predicts the overhead of running time.

Predicting the Remaining Duration of a Certain Event.
Let the duration of an event be the time difference between the timestamp of that event and its succeeding event. The task for predicting the total duration of all remaining "waiting" events can be specified as follows: where RemWaitingDur is defined as the sum of the duration of all remaining waiting events, formally as follows: sum ( where the activity duration is defined as the time difference between the timestamp of that activity and its next activity. We can then specify an analytic rule that expresses the task for predicting the average activity duration as follows: Similar to previous examples, applying our procedure for creating the prediction model, we get a regression model that computes the approximation of the average activity duration of a process.

Predicting Workload-related Information
Knowing the information about the amount of work to be done (i.e., workload) would be beneficial. Predicting the activity frequency is one of the ways to get an overview of workload. The following task specifies how to predict the number of the remaining activities that are necessary to be performed: In this case, R 13 counts the number of remaining activities. We could also provide a more fine-grained specification by focusing on a certain activity. For instance, in the following we specify the task for predicting the number of the remaining validation activities that need to be done: where NumOfRemValidation is specified as follows: count(e[x]. concept:name == "validation"; where x = curr : last) NumOfRemValidation counts the occurrence of validation activities between the current event and the last event (the occurence of validation activity is reflected by the fact that the value of the attribute concept:name is equal to "validation"). Applying our procedure for creating the prediction model over R 13 and R 14 , consecutively we get regression models that predict the number of remaining activities as well as the number of the remaining validation activities. We could also classify a process into complex or normal based on the frequency of a certain activity. For instance, we could consider a process that requires more than 25 validation activities as complex (otherwise it is normal). The following analytic rule specifies this task: where Cond 10 is specified as follows: count(e[x]. concept:name == "validation"; where x = 1 : last) Based on R 15 , we could train a model to classify whether a (partial) trace is likely to be a complex or a normal process.

Predicting Resource-related Information
Human resources could be a crucial factor in the process execution. Knowing the number of different resources that are needed for handling a process could be beneficial. The following analytic rule specifies the task for predicting the number of different resources that are required: R 16 = curr < last =⇒ countVal(org:resource; within 1 : last), 0 .
During the training phase, since countVal(org:resource; within 1 : last) is evaluated to the number of different values of the attribute org:resource within the corresponding trace, R 16 maps each trace prefix τ k into the number of different resources.
To predict the number of task handovers among resources, we can specify the following prediction task: where NumHandovers is defined as follows: i.e., NumHandovers counts the number of changes on the value of the attribute org:resource and the changes of resources reflect the task handovers among resources. Thus, in this case, R 17 maps each trace prefix τ k into the number of task handovers.
A process can be considered as labor intensive if it involves at least a certain number of different resources, e.g., three different number of resources. This kind of task can be specified as follows: where Cond 11 is as follows: Essentially, Cond 11 states that there are at least three different events in which the values of the attribute org:resource in those events are different.

Cost-related prediction
Suppose that each activity within a process has its own cost and this information is stored in the attribute named cost.
The task for predicting the total cost of a process can be specified as follows: where R 19 maps each trace prefix τ k into the corresponding total cost that is computed by summing up the cost of all activities. We can also specify the task for predicting the maximal cost within a process as follows: . cost; where x = 1 : last), 0 .
In this case, R 20 computes the maximal cost among the cost of all activities within the corresponding process. Similarly, we can specify the task for predicting the average activity cost as follows: We could also create a more detailed specification. For instance, we want to predict the total cost of all validation activities. This task can be specified as follows: where TotalValidationCost is as follows: In a certain situation, the cost of an activity can be broken down into several components such as human cost and material cost. Thus, the total cost of each activity is actually the sum of the human and material costs. To take these components into account, the prediction task can be specified as follows: where TotalCost is as follows: One might consider a process as expensive if its total cost is greater than a certain amount (e.g., 550 Eur), otherwise it is normal. Based on this characteristic, we could specify a task for predicting whether a process would be expensive or not as follows: where TotalCost = sum(e[x]. cost; where x = 1 : last).

Predicting Process Performance
One could consider the process that runs longer than a certain amount of time as slow, otherwise it is normal. Given a (partial) process execution information, we might be interested to predict whether it will end up as a slow or a normal process. This prediction task can be specified as follows: R 25 states that if the total running time of a process is greater than 18.000.000 milliseconds (5 hours), then it is categorized as slow, otherwise it is normal. During the training, R 25 maps each trace prefix τ k into the corresponding performance category (i.e., slow or normal). In this manner, we get a prediction model that is trained to predict whether a certain (partial) trace will most likely be slow or normal. Notice that we can specify a more fine-grained characteristic of process performance. For instance, we can add one more characteristic into R 25 by saying that the processes that spend less than 3 hours (10.800.000 milliseconds) are considered as fast. This is specified by R 26 as follows: One might consider that a process is performed efficiently if there are only small amount of task handovers between resources. On the other hand, one might consider a process is efficient if it involves only a certain number of different resources. Suppose that the processes that have more than 7 times of task handovers among the (human) resources are considered to be inefficient. We can then specify a task to predict whether a (partial) trace is most likely to be inefficient or not as follows: where Cond 14 is specified as follows: i.e., Cond 14 counts how many times the value of the attribute org:resource is changing from a one time point to another time point by checking whether the value of the attribute org:resource at a particular time point is different from the value of the attribute org:resource at the next time point. Now, suppose that the processes that involve more than 5 resources are considered to be inefficient. We can then specify a task to predict whether a (partial) trace is most likely to be inefficient or not as follows: where Cond 15 = countVal(org:resource; within 1 : last), i.e., it counts the number of different values of the attribute org:resource. As before, using R 27 and R 28 , we could then train a classifier to predict whether a process will most likely perform inefficiently or normal.

Predicting Future Activities/Events
The task for predicting the next activity/event can be specified as follows: During the construction of the prediction model, R 29 maps each trace prefix τ k into its next activity name, because e[curr + 1]. concept:name is evaluated to the name of the next activity. Similarly, we can specify the task for predicting the next lifecycle as follows: In this case, since e[curr + 1]. lifecycle:transition is evaluated to the lifecycle information of the next event, R 30 maps each trace prefix τ k into its next lifecycle.
Instead of just predicting the information about the next activity, we might be interested in predicting more information such as the information about the next three activities. This task can be specified as follows: During the construction of the prediction model, in the training phase, R 31 maps each trace prefix τ k into the information about the next three activities.

Implementation and Experiment
As a proof of concept, we develop a prototype that implements our approach. This prototype includes a parser for our language and a program for automatically processing the given prediction task specification as well as for building the corresponding prediction model based on our approach explained in Sections 3 and 4. We also build a ProM 3 plugin that wraps these functionalities. Several feature encoding functions to be selected are also provided, e.g., one hot encoding of event attributes, time since the previous event, plain attribute values encoding, etc. We can also choose the desired machine learning model to be built. Our implementation uses Java and Python. For the interaction between Java and Python, we use Jep (Java Embedded Python) 4 . In general, we use Java for implementing the program for processing the specification and we use Python for dealing with the machine learning models.
Our experiments aim at demonstrating the applicability of our approach in automatically constructing reliable prediction models based on the given specification. The experiments were conducted by applying our approach into several case studies/problems that are based on real life event logs. Particularly, we use the publicly available event logs that were provided for Business Process Intelligence Challenge (BPIC) 2012, BPIC 2013, and BPIC 2015. For each event log, several relevant prediction tasks are formulated based on the corresponding domain, and also by considering the available information. For instance, predicting the occurence of ping-pong behaviour among support groups might be suitable for the BPIC 13 event log, but not for BPIC 12 event log since there is no information about groups in BPIC 12 event log (in fact, they are event logs from two different domains). For each prediction task, we provide the corresponding formal specification that can be fed into our tool in order to create the corresponding prediction model.
For the experiment, we follow the standard holdout method [33]. Specifically, we partition the data into two sets as follows: we use the first 2/3 of the log for the training data and the last 1/3 of the log for the testing data. For each prediction task specification, we apply our approach in order to generate the corresponding prediction model, and then we evaluate the prediction quality of the generated prediction model by considering each k-length trace prefix τ k of each trace τ in the testing set (for 1 < k < |τ|). In order to provide a baseline, we use a statistical-based prediction technique, which is often called Zero Rule (ZeroR). Specifically, for the classification task, the prediction by ZeroR is performed based on the most common target value in the training set, while for the regression task, the prediction is based on the mean value of the target values in the training data.
Within these experiments, we consider several machine learning models, namely (i) Logistic Regression, (ii) Linear Regression, (iii) Naive Bayes Classifier, (iv) Decision Tree [14], (v) Random Forest [13], (vi) Ada Boost [29] with Decision Tree as the base estimator, (vii) Extra Trees [31], (viii) Voting Classifier that is composed of Decision Tree, Random Forest, Ada Boost, and Extra Trees. Among these, Logistic Regression, Naive Bayes, and Voting Classifier are only used for classification tasks, and Linear Regression is only used for regression tasks. The rest are used for both. Notably, we also use a Deep Learning Model [32]. In particular, we use the Deep Feed-Forward Neural Network and we consider various sizes of the network by taking into account several different depth and width of the network (we consider different numbers of hidden layers ranging from 2 to 6 and three variants of the number of neurons namely 75, 100 and 150). In the implementation, we use the machine learning libraries provided by scikit-learn [46]. For the implementation of neural network, we use Keras 5 with Theano [64] backend.
To assess the prediction quality, we use the standard metrics for evaluating classification and regression models that are generally used in the machine learning literatures. These metrics are also widely used in many works in this research area (e.g. [4,39,37,67,35,63]). For the classification task, we use Accuracy, Area Under the ROC Curve (AUC), Precision, Recall, and F-Measure. For the regression task, we use Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). In the following, we briefly explain these metrics. A more elaborate explanation on these metrics can be found in the typical literature on machine learning and data mining, e.g., [44,33,30].
Accuracy is the fraction of predictions that are correct. It is computed by dividing the number of correct predictions by the number of all predictions. The range of accuracy value is between 0 and 1. The value 1 indicates the best model, while 0 indicates the worst model. An ROC (Receiver Operating Characteristic) curve allows us to visualize the prediction quality of a classifier. If the classifier is 5 https://keras.io good, the curve should be as closer to the top left corner as possible. A random guessing is depicted as a straight diagonal line. Thus, the closer the curve to the straight diagonal line, the worse the classifier is. The value of the area under the ROC curve (AUC) allows us to assess a classifier as follows: the AUC value equal to 1 shows a perfect classifier while the AUC value equal to 0.5 shows the worst classifier that is not better than random guessing. Thus, the closer the value to 1, the better it is, and the closer the value to 0.5, the worse it is. Precision measures the exactness of the prediction. When a classifier predicts a certain output for a certain case, the precision value intuitively indicates how much is the chance that such prediction is correct. Specifically, among all cases that are classified into a particular class, precision measures the fraction of those cases that are correctly classified. On the other hand, recall measures the completeness of the prediction. Specifically, among all cases that should be classified as a particular class, recall measures the fraction of those cases that can be classified correctly. Intuitively, given a particular class, the recall value indicates the ability of the model to correctly classify all cases that should be classified into that particular class. The best precision and recall value is 1. F-Measure is harmonic mean of precision and recall. It provides a measurement that combines both precision and recall values by also giving equal weight to them. Formally, it is computed as follows: where P is precision and R is recall. The best F-Measure value is 1. Thus, the closer the value to 1, the better it is.
MAE computes the average of the absolute error of all predictions over the whole testing data, where each error is computed as the difference between the expected and the predicted values. Formally, given n testing data, MAE = (∑ n i=1 |y i −ŷ i |) /n, whereŷ i (resp. y i ) is the predicted value (resp. the expected/actual value) for the testing instance i. RMSE can be computed as follows: RMSE = (∑ n i=1 (y i −ŷ i ) 2 ) /n, whereŷ i (resp. y i ) is the predicted value (resp. the expected/actual value) for the testing instance i. Compare to MAE, RMSE is more sensitive to errors since it gives larger penalty to larger errors by using the 'square' operation. For both MAE and RMSE, the lower the score, the better the model is.
In our experiments, we use the trace encoding that incorporates the information of the last n-events, where n is the maximal length of the traces in the event log under consideration. Furthermore, for each experiment we consider two types of encoding, where each of them considers different available event attributes (One encoding incorporates more event attributes than the others). The detail of event attributes that are considered is explained in each experiment below.

Experiment on BPIC 2013 Event Log
The event log from BPIC 2013 6 [62] contains the data from the Volvo IT incident management system called VINST. It stores information concerning the incidents handling process. For each incident, a solution should be found as quickly as possible so as to bring back the service with minimum interruption to the business. It contains 7554 traces (process instances) and 65533 events. There are also several attributes in each event containing various information such as the problem status, the support team (group) that is involved in handling the problem, the person who works on the problem, etc.
In BPIC 2013, ping-pong behaviour is one of the interesting problems to be analyzed. Ideally, an incident should be solved quickly without involving too many support teams. To specify the tasks for predicting whether a process would probably exhibit a ping-pong behaviour, we first identify and express the possible characteristics of ping-pong behaviour as follows: Roughly speaking, Cond E1 says that there is a change in the support team while the problem is not being "Queued". Cond E2 and Cond E3 state that there is a change in the person who handles the problem, but then at some point it changes back into the original person. Cond E4 and Cond E5 say that there is a change in the support team (group) who handles the problem, but then at some point it changes back into the 6 More information on BPIC 2013 can be found in http://www.win.tue.nl/bpi/doku.php?id=2013:challenge original support team. Cond E6 states that the process of handling the incident involves at least three different groups.
We then specify three different analytic rules below in order to specify three different tasks for predicting pingpong behaviour based on various characteristics of this unexpected behaviour.
In BPIC 2013 event log, an incident can have several statuses. One of them is waiting. In this experiment, we predict the remaining duration of all waiting-related events by specifying the following analytic rule: where RemWaitingTime is as follows: i.e., RemWaitingTime is the sum of all event duration in which the status is related to waiting (e.g., Awaiting Assignment, Wait, Wait-User, etc). Similarly, we predict the remaining duration of all (exactly) waiting events by specifying the following: where RemWaitDur is as follows: sum ( i.e., RemWaitDur is the sum of all event duration in which the status is "wait". Both R E4 and R E5 can be fed into our tool, and in this case we generate regression models.
For all of these tasks, we consider two different trace encodings. First, we use the trace encoding that incorporates several available event attributes, namely concept:name, org:resource, org:group, lifecycle:transition, organization involved, impact, product, resource country, organization country, org:role. Second, we use the trace encoding that only incorporates the event names, i.e., the values of the attribute concept:name. Intuitively, the first encoding considers more information than the second encoding. Thus, the prediction models that are obtained by using the first encoding use more input information for doing the prediction. The evaluation on the generated prediction models from all prediction tasks specified above is reported in Tables 1 and 2.

Experiment on BPIC 2012 Event Log
The event log for BPIC 2012 7 [65] comes from a Dutch financial institute. It stores the information concerning the process of handling either personal loan or overdraft application. It contains 13.087 traces (process instances) and 262.200 events. Generally, the process of handling an application is as follows: Once an application is submitted, some checks are performed. After that, the application is augmented with necessary additional information that is obtained by contacting the client by phone. An offer will be send to the client, if the applicant is eligible. After this offer is received back, it is assessed. The customer will be contacted again if there is missing information. After that, a final assessment is performed. In this experiment, we consider two prediction task as follows: 1. One type of activity within this process is named W_Completeren aanvraag, which stands for "Filling in information for the application". The task for predicting the total duration of all remaining activities of this type is formulated as follows: where RemTimeFillingInfo is as follows: i.e., it computes the sum of the duration of all remaining W_Completeren aanvraag activities. 2. At the end of the process, an application can be declined.
The task to predict whether an application will eventually be declined is specified as follows: i.e., Cond E8 says that eventually there will be an event in which the application is declined.
Both R E6 and R E7 can be fed into our tool. For R E6 , we generate a regression model, while for R E7 , we generate a classification model. Different from the BPIC 2013 and BPIC 2015 event logs, there are not so many event attributes in this log. For all of these tasks, we consider two different trace encodings. First, we use the trace encoding that incorporates several available event attributes, namely concept:name and lifecycle:transition. Second, we use the trace encoding that only incorporates the event names, i.e., the values of the attribute concept:name. Thus, intuitively the first encoding considers more information than the second encoding. The evaluation on the generated prediction models from the prediction tasks specified above is shown in Tables 3 and 4.

Experiment on BPIC 2015 Event Log
In BPIC 2015 8 [66], 5 event logs from 5 Dutch Municipalities are provided. They contain the data of the processes for handling the building permit application. In general, the processes in these 5 municipalities are similar. Thus, in this experiment we only consider one of these logs. There are several information available such as the activity name and the resource/person that carried out a certain task/activity. The statistic about the log that we consider is as follows: it has 1409 traces (process instances) and 59681 events.
For this event log, we consider several tasks related to predicting workload-related information (i.e., related to the amount of work/activities need to be done). First, we deal with the task for predicting whether a process of handling an application is complex or not based on the number of the remaining different activities that need to be done. Specifically, we consider a process is complex (or need more attention) if there are still more than 25 different activities need to be done. This task can be specified as follows: where NumDifRemAct is specified as follows: countVal(activityNameEN; within curr : last) i.e., NumDifRemAct counts the number of different values of the attribute 'activityNameEN' from the current time point until the end of the process. As the next workloadrelated prediction task, we specify the task for predicting the number of remaining events/activities as follows: where RemAct = count(true; where x = curr : last), i.e., RemAct counts the number of events/activities from the current time point until the end of the process. Both R E8 and R E9 can be fed into our tool. For the former, we generate a classification model, and for the latter, we generate a regression model. For all of these tasks, we consider two different trace encodings. First, we use the trace encoding that incorporates several available event attributes, namely monitoringResource, org:resource, activi-tyNameNL, activityNameEN, question, concept:name. Second, we use the trace encoding that only incorporates the event names, i.e., the values of the attribute concept:name. As before, the first encoding considers more information than the second encoding. The evaluation on the generated prediction models from the prediction tasks specified above is shown in Tables 5 and 6 6.4 Discussion on the Experiments In total, our experiments involve 9 different prediction tasks over 3 different real-life event logs from 3 different domains (1 event log from BPIC 2015, 1 event log from BPIC 2012, and 1 event log from BPIC 2013).
Overall, these experiments show the capabilities of our language in capturing and specifying the desired prediction tasks that are based on the event logs coming from real-life situation. These experiments also exhibit the applicability of our approach in automatically constructing reliable prediction models based on the given specification. This is supported by the following facts: first, for all prediction tasks that we have considered, by considering different input features and machine learning models, we are able to obtain prediction models that beat the baseline. Moreover, for all prediction tasks that predict categorical values, in our experiments we are always able to get a prediction model that has AUC value greater than 0.5. Recall that AUC = 0.5 indicates the worst classifier that is not better than a random guess. Thus, since we have AUC > 0.5, the prediction models that we generate certainly take into account the given input and predict the most probable output based on the given input, instead of randomly guessing the output no matter what the input is. In fact, in many cases, we could even get very high AUC values which are ranging between 0.8 and 0.9 (see Tables 1 and 5). This score is very close to the AUC value for the best predictor (recall that AUC = 1 indicates the best classifier).
As can be seen from the experiments, the choice of the input features and the machine learning models influence the quality of the prediction model. The result of our experiments also shows that there is no single machine learning model that always outperforms other models on every task. Since our approach does not rely on a particular machine learning model, it justifies that we can simply plug in different supervised machine learning techniques in order to get different or better performance. In fact, in our experiments, by considering different models we could get different/better prediction quality. Concerning the input features, for each task in our experiments, we intentionally consider two different input encodings. The first one includes many attributes (hence it incorporates many information), and the second one includes only a certain attribute (i.e., it incorporates less information). In general, our common sense would expect that the more information, the better the prediction quality would be. This is because we thought that, by having more information, we have a more holistic view of the situation. Although many of our experiment results show this fact, there are several cases where considering less features could give us a better result, e.g., the RMSE score in the experiment with several models on the task R E5 , and the scores of several metrics in the experiment R E8 show this fact (see Tables 2 and 5). In fact, this is aligned with the typical observation in machine learning. The presence of irrelevant features could decrease the prediction quality. Although in the learning process a good model should (or will try to) ignore irrelevant features, the absence of these unrelated features might make the learning process better and might improve the quality of the prediction. Additionally, in some situation, too many features might cause overfitting, i.e., the model fits the training data very well, but it fails to generalize well while doing prediction on the new data. Based on the experience from these experiments, time constraint would also be a crucial factor in choosing the model when we would like to apply this approach in practice. Some models require a lot of tuning in order to achieve a good performance (e.g., neural network), while other models do not need many adjustment and able to achieve relatively good performance (e.g., Extra Trees, Random Forest).
Looking at another perspective, our experiments complement various studies in the area of predictive process monitoring in several ways. First, instead of using machine learning models that are typically used in many studies within this area such as Random Forest and Decision Tree (cf. [37,67,21,22]), we also consider other machine learning models that, to the best of our knowledge, are not typically used. For instance, we use Extra Trees, Ada Boost, and Voting Classifier. Thus, we provide a fresh insight on the performance of these machine learning models in predictive process monitoring by using them in various different prediction tasks (e.g., predicting (fine-grained) time-related information, unexpected behaviour). Although this work is not aimed at comparing various machine learning models, as we see from the experiments, in several cases, Extra Trees exhibits similar performance (in terms of accuracy) as Random Forest. There are also some cases where it outperforms the Random Forest (e.g., see the experiment with the task R E9 in Table 6). In the experiment with the task R E7 , AdaBoost outperforms all other models. Regarding the type of the prediction tasks, we also look into the tasks that are not yet highly explored in the literature within the area of predictive process monitoring. For instance, while there are numerous works on predicting the remaining processing time, to the best of our knowledge, there is no literature exploring a more fine-grained task such as the prediction of the remaining duration of a particular type of event (e.g., predicting the duration of all remaining waiting events). We also consider several workload-related prediction tasks, which is rarely explored in the area of predictive process monitoring.
Concerning the Deep Learning approach, there have been several studies that explore the usage of Deep Neural Network for predictive process monitoring (cf. [63,26,27,23,40]). However, they focus on predicting the name of the future activities/events, the next timestamp, and the remaining processing time. In this light, our experiments contribute new insights on exhibiting the usage of Deep Learning approach in dealing with different prediction tasks other than just those tasks. Although the deep neural network does not always give the best result in all tasks in our experiments, there are several interesting cases where it shows a very good performance. Specifically, in the experiments with the tasks R E4 and R E5 (cf. Table 2), where all other models cannot beat the RMSE score of the baseline, the deep neural network comes to the rescue and becomes the only model that could beat the RMSE score of our baseline.

Related Work
This work is tightly related to the area of predictive analysis in business process management. In the literature, there have been several works focusing on predicting time-related properties of running processes. The works by [3,4,54,55,52,53] focus on predicting the remaining processing time. In [3,4], the authors present an approach for predicting the remaining processing time based on annotated transition system that contains time information extracted from event logs. The work by [54,55] proposes a technique for predicting the remaining processing time using stochastic petri nets. The works by [58,59,42,49] focus on predicting delays in process execution. In [58,59], the authors use queueing theory to address the problem of delay prediction, while [42] explores the delay prediction in the domain of transport and logistics process. In [28], the authors present an ad-hoc predictive clustering approach for predicting process performance. The authors of [63] present a deep learning approach (using LSTM neural network) for predicting the timestamp of the next event and use it to predict the remaining cycle time by repeatedly predicting the timestamp of the next event.
Looking at another perspective, the works by [37,22,67] focus on predicting the outcomes of a running process. The work by [37] introduces a framework for predicting the business constraints compliance of a running process. In [37], the business constraints are formulated in propositional Linear Temporal Logic (LTL), where the atomic propositions are all possible events during the process executions. The work by [22] improves the performance of [37] by using a clustering preprocessing step. Another work on outcomes prediction is presented by [50], which proposes an approach for predicting aggregate process outcomes by taking into account the information about overall process risk. Related to process risks, [18,19] propose an approach for risks prediction. The work by [39] presents an approach based on evolutionary algorithm for predicting business process indicators of a running process instance, where business process indicator is a quantifiable metric that can be measured by data that is generated by the processes. The authors of [41] present a work on predicting business constraint satisfaction. Particularly, [41] studies the impact of considering the estimation of prediction reliability on the costs of the processes.
Another major stream of works tackle the problem of predicting the future activities/events of a running process (cf. [63,26,27,23,40,15,53]). The works by [63,26,27,23,40] use deep learning approach for predicting the future events, e.g., the next event of the current running process. Specifically, [63,26,27,23] use LSTM neural network, while [40] uses deep feed-forward neural network. In [53,23,63] the authors also tackle the problem of predicting the whole sequence of future events (the suffix of the current running process).
A key difference between many of those works and ours is that, instead of focusing on dealing with a particular prediction task (e.g., predicting the remaining processing time or the next event), this work introduces a specification language that enables us to specify various desired prediction tasks for predicting various future information of a running business process. To deal with these various desired prediction tasks, we present a mechanism to automatically process the given specification of prediction task and to build the corresponding prediction model. From another point of view, several works in this area often describe the prediction tasks under study simply by using a (possibly ambiguous) natural language. In this light, the presence of our language complements this area by providing a means to formally and unambiguously specifying/describing the desired prediction tasks. Consequently, it could ease the definition of the task and the comparison among different works that propose a particular prediction technique for a particular prediction task.
Regarding the specification language, unlike the propositional LTL [51], which is the basis of Declare language [47,48] and often used for specifying business constraints over a sequence of events (cf. [37]), our FOE language (which is part of our rule-based specification language) allows us not only to specify properties over sequence of events but also to specify properties over the data (attribute values) of the events, i.e., it is data-aware. Concerning data-aware specification language, the work by [5] introduces a data-aware specification language by combining data querying mechanisms and temporal logic. Such language has been used in several works on verification of dataaware processes systems (cf. [6,56,17,16]). The works by [20,36] provide a data-aware extension of the Declare language based on the First-Order LTL (LTL-FO). Although those languages are data-aware, they do not support arithmetic expressions/operations over the data which is absolutely necessary for our purpose, e.g., for expressing the time difference between the timestamp of the first and the last event. Another interesting data-aware language is S-FEEL, which is part of the Decision Model and Notation (DMN) standard [45] by OMG. Though S-FEEL supports arithmetic expressions over the data, it does not allow us to universally/existentially quantify different event time points and to compare different event attribute values at different event time points, which is important for our needs, e.g., in specifying the ping-pong behaviour.
Concerning aggregation, there are several formal languages that incorporate such feature (cf. [25,12,9]) and many of them have been used in system monitoring. The work by [25] extends the temporal logic Past Time LTL with counting quantifier. Such extension allows us to express a constraint on the number of occurrences of events (similar to our count function). In [12] a language called SOLOIST is introduced and it supports several aggregate functions on the number of event occurrences within a certain time window. Differently from ours, both [25] and [12] do not consider aggregation over data (attribute values). The works by [9,10] extend the temporal logic that was introduced in [8,11] with several aggregate functions. Such language allows us to select the values to be aggregated. However, due to the interplay between the set and bag semantics in their language, as they have illustrated, some values might be lost while computing the aggregation because they first collect the set of tuples of values that satisfy the specified condition and then they collect the bag of values to be aggregated from that set of tuples of values. To avoid this situation, they need to make sure that each tuple of values has a sort of unique identifier. This situation does not happen in our aggregation because, in some sense, we directly use the bag semantics while collecting the values to be aggregated. Importantly, unlike those languages above, apart from allowing us to specify a complex constraint/pattern, a fragment of our FOE language also allows us to specify the way to compute certain values, which is needed for specifying the way to compute the target/predicted values, e.g., the remaining processing time, or the remaining number of a certain activity/event. Our language is also specifically tuned for expressing data-aware properties based on the typical structure of business process execution logs (cf. [34]), and the design is highly driven by the typical prediction tasks in business process management. From another point of view, our work complements the works on predicting SLA/business constraints compliance by providing an expressive language to specify complex data-aware constraints that may involve arithmetic expression and data aggregation.

Discussion
This section discusses potential limitations of this work, which might pave the way towards our future direction.
This work focuses on the problem of predicting the future information of a single running process based on the current information of that corresponding running process. In practice, there could be several processes running concurrently. Hence, it is absolutely interesting to extend the work further so as to consider the prediction problems on concurrently running processes. This extension would involve the extension of the language itself. For instance, the language should be able to specify some patterns over multiple running processes. Additionally, it should be able to express the desired predicted information or the way to compute the desired predicted information, and it might involve the aggregation of information over multiple running processes. Consequently, the mechanism for building the corresponding prediction model needs to be adjusted.
Our experiments (cf. Section 6) show a possible instantiation of our generic approach in creating prediction services. In this case we predict the future information of a running process by only considering the information from a single running process. However, in practice, other processes that are concurrently running might affect the execution of other processes. For instance, if there are so many processes running together and there are not enough employees for handling all processes simultaneously, some processes might need to wait. Hence, when we predict the remaining duration of waiting events, the current workload information might be a factor that need to be considered and ideally these information should be incorporated in the prediction. One possibility to overcome this limitation is to use the trace encoding function that incorporates the information related to the processes that are concurrently running. For instance, we can make an encoding function that extracts relevant information from all processes that are concurrently running, and use them as the input features. Such information could be the number of employees that are actively handling some processes, the number of available resources/employees, the number of processes of a certain type that are currently running, etc.
This kind of machine learning based technique performs the prediction based on the observable information. Thus, if the information to be predicted depends on some unobservable factors, the quality of the prediction might be decreasing. Therefore, in practice, all factors that highly influence the information to be predicted should be incorporated as much as possible. Furthermore, the prediction model is only built based on the historical information about the previously running processes and neglects the possibility of the existence of the domain knowledge (e.g., some organizational rules) that might influence the prediction. In some sense, it (implicitly) assumes that the domain knowledge is already incorporated in those historical data that captures the processes execution in the past. Obviously, it is then interesting to develop the technique further so as to incorporate the existing domain knowledge in the creation of the prediction model with the aim of enhancing the prediction quality. Looking at another perspective, since the prediction model is only built based on the historical data of the past processes execution, this approach is absolutely suitable for the situation in which the (explicit) process model is unavailable or hard to obtain.
As also observed by other works in this area (e.g., [4]), in practice, by predicting the future information of a running process, we might affect the future of the process itself, and hence we might reduce the preciseness of the prediction. For instance, when it is predicted that a particular process would exhibit an unexpected behaviour, we might be eager to prevent it by closely watching the process in order to prevent that unexpected behaviour. In the end, that unexpected behaviour might not be happened due to our preventive actions, and hence the prediction is not happened. On the other hand, if we predict that a particular process will run normally, we might put less attention than expected into that process, and hence the unexpected behaviour might occur. Therefore, knowing the (prediction of the) future might not always be good for this case. This also indicates that a certain care need to be done while using the predicted information.

Conclusion
We have introduced an approach for obtaining predictive process monitoring services based on the specification of the desired prediction tasks. Specifically, we proposed a novel rule-based language for specifying the desired prediction tasks, and we devise a mechanism for automatically building the corresponding prediction models based on the given specification. Establishing such language is a nontrivial task. The language should be able to capture various prediction tasks, while at the same time allowing us to have a procedure for building/deriving the corresponding prediction model. Our language is a logic-based language which is fully equipped with a well-defined formal semantics. Therefore, it allows us to do formal reasoning over the specification, and to have a machine processable language that enables us to automate the creation of the prediction model. The language allows us to express complex properties involving data and arithmetic expressions. It also allows us to specify the way to compute certain values. Notably, our language supports several aggregate functions. A prototype that implements our approach has been developed and several experiments using real life event logs confirmed the applicability of our approach. Remarkably, our experiments involve the usage of a deep learning model (In particular, we use the deep feed-forward neural network).
Apart from those that have been discussed in Section 8, the future work includes the extension of the tool and the language. One possible extension would be to incorporate trace attribute accessor that allows us to specify properties involving trace attribute values. As our FOE language is a logic-based language, there is a possibility to exploit existing logic-based tools such as Satisfiability Modulo Theories (SMT) solver [7] for performing some reasoning tasks related to the language. Experimenting with other supervised machine learning techniques would be the next step as well, for instance by using another deep learning approach (i.e., another type of neural network such as recurrent neural network) with the aim of improving the prediction quality.