Probability Estimation of Uncertain Process Trace Realizations

Process mining is a scientific discipline that analyzes event data, often collected in databases called event logs. Recently, uncertain event logs have become of interest, which contain non-deterministic and stochastic event attributes that may represent many possible real-life scenarios. In this paper, we present a method to reliably estimate the probability of each of such scenarios, allowing their analysis. Experiments show that the probabilities calculated with our method closely match the true chances of occurrence of specific outcomes, enabling more trustworthy analyses on uncertain data.


Introduction
Process mining is a discipline that focuses on extracting insights about processes in a data-driven manner.For instance, on the basis of the recorded information on historical process executions, process mining allows to automatically extract a model of the behavior of process instances, or to measure the compliance of the process data with a prescribed normative model of the process.In process mining, the central focus is on the event log, a collection of data that tracks past process instances.Every activity performed in a process is recorded in the event log, together with information such as the corresponding process case and the timestamp of the activity, in a sequence of events called a trace.
Recently, research on novel forms of event data have garnered the attention of the scientific community.Among these there are uncertain event logs, which contain data affected by imprecision [8].This data contains meta-information describing the nature and entity of the uncertainty.Such meta-information can be obtained from the inherent precision with which the data has been recorded (e.g., timestamps only indicating the date have a possible "true value" range of 24 hours), from the precision of the tools involved in supporting the process (e.g., the absolute error of sensors), or from the domain knowledge provided by a process expert.An uncertain trace corresponds to multiple possible real-life scenarios, each of which might have very diverse implications on features of cases such as compliance to a model.It is then important to be able to assess the risk of occurrence of specific outcomes of uncertain traces, which enables to estimate the impact of such traces on indicators such as cost and conformance.
In this paper, we present a method to obtain a complete probability distribution over the possible instantiations of uncertain attributes in a trace.As a possible example of application, we frame our results in the context of conformance checking, and show the impact of assessing probability estimates for uncertain traces on insights about the compliance of an uncertain trace to a process model.We validate our method with experiments based on a Monte Carlo simulation, which shows that the probability estimates are reliable and reflect the true chances of occurrence of a specific outcome.
The remainder of the paper is structured as follows.Section 2 examines relevant related work.Section 3 illustrates a motivating running example for our technique.Section 4 presents preliminary definitions of different types of uncertainty in process mining.Section 5 illustrates a method for computing probabilities of realizations for uncertain process traces.Section 6 validates our method through experimental results.Finally, Section 7 concludes the paper.

Related Work
The analysis of uncertain data in process mining is a very recent research direction.The specific formulation and definition of uncertain data utilized in this paper has been introduced in 2019 [8], in the context of an analysis approach consisting in computing bounds for the conformance score of uncertain traces through alignments [5].Subsequently, that work has been extended with an inductive mining approach for process discovery over uncertainty [9] and a taxonomy of different types of uncertain data, with their characteristics [10].
Uncertain data, as formulated in our present and previous work, is closely related to a considerably more studied data anomaly in process mining: partially ordered event data.In fact, uncertain data as described here is a generalization of partially ordered traces.Lu et al. [7] proposed a conformance checking approach based on alignments to measure conformance of partially ordered traces.More recently, Van der Aa et al. [1] illustrated a method for inferring a linear extension, i.e., a compliant total order, of events in partially ordered traces, based on examples of correct orderings extracted from other traces in the log.Busany et al. [4] estimated probabilities for partially ordered events in IoT event streams.
An associated topic, which draws from disciplines such as pattern and sequence mining and is antithetical to the analysis of partially ordered data, is the inference of partial orders from fully sequential data as a way to model its behavior.This goes under the name of episode mining, which can be performed with many techniques both on batched data and with online streams of events [11,6,2].
In this paper, we present a method to estimate the likelihood of any scenario in an uncertain setting, which covers partially ordered traces as well as other types of uncertainty illustrated in the taxonomy [10].Furthermore, we will cover both the non-deterministic case (strong uncertainty) and the probabilistic case (weak uncertainty).

Running Example
In this section, we will provide a running example of uncertain process instance related to a sample process.We will then apply our probability estimation method to this uncertain trace, to illustrate its operation.The example we analyze here is a simplified generalization of a remote credit card fraud investigation process.This process is visualized by the Petri net in Figure 1.
Firstly, the credit card owner alerts the credit card company of a possibly fraudulent transaction.The customer may either notify the company by calling their hotline (alert hotline) or arrange an urgent meeting with personnel of the bank that issued the credit card (alert bank ).In both scenarios, his credit is frozen (freeze credit) to prevent further fraud.All information provided by the customer about the transaction is summarized when filing the formal report (file report).As a next step, the credit card company tries to contact the merchant that charged the credit card.If this happens (contact merchant), the credit card company clarifies whether there has been just a mistake (e.g., merchant charging not delivering a product, or a billing mistake) on the merchant's side.In such cases, the customer gets a refund from merchant and the case is closed.Another outcome might be the discovery of a friendly fraud, which is when a cardholder makes a purchase and then disputes it as fraud even though it was not.If contacting the merchant is impossible, a fraud investigation is initiated.In this case, fraud investigators will usually start with the transaction data and look for timestamps, geolocation, IP addresses, and other elements that can be used to prove whether or not the cardholder was involved in the transaction.The outcome might be either friendly fraud or true fraud.True fraud can also happen when both the merchant and the cardholder are affected by the fraud.In this case, the cardholder receives a refund from the credit institute (activity refund credit institute) and the case is closed.
Note that for simplicity, we have used single letters to represent the activity labels in the Petri net transitions.Some possible traces in this process are for example: h, c, r, m, u , b, c, r, m, f , h, c, r, i, f and b, c, r, i, t, v .
Suppose that the credit card company wants to perform conformance checking to identify deviant process instances.However, some traces in the information system of the company are affected by uncertainty, such as the one in Table 1.
Suppose that in the first half of October 2020, the company was implementing a new system for automatic event data generation.During this time, the event data regarding the credit card fraud investigation process often had to be inserted manually by the employees.Such manual recordings were subject to inaccuracies, leading to imprecise or missing data affecting the cases during  this period.The process instance from Table 1 is one of the affected instances.
Here, events e 2 , e 3 , e 5 , e 6 are uncertain.The timestamp of event e 2 is not precise enough, so the possible timestamp lies between 06-10-2020 00:00 and 06-10-2020 23:59.Event e 3 has happened some time between 20:00 on October 5th and 10:00 on October 6th.Event e 5 has two possible activity labels: f with probability 0.3 and t with probability 0.7.Refunding the customer (event e 6 ) has been recorded in the system, but the customer has not received the money yet, which is why the event is indeterminate: this is indicated with a question mark (?) in the rightmost column, and indicates an event that has been recorded, but for which is unclear if it actually occurred in reality.
The credit card company is interested in understanding if and how the data in this uncertain trace conforms with the normative process model, and the entity of the actual compliance risk; they are specifically interested in knowing whether a severely non-compliant scenario is highly likely.In the remainder of the paper, we will describe a method able to estimate the probability of all possible outcome scenarios.

Preliminaries
Let us now present some preliminary definitions regarding uncertain event data.
We denote with W D the set of all such weakly uncertain attributes of domain D. We collectively denote with It is easy to see how a "certain" attribute x, with a value not affected by any uncertainty, can be represented through the definitions in use here: if its domain is discrete, it can be represented with the singleton {x}; otherwise, it can be represented with the degenerate interval [x, x].
Definition 2 (Uncertain events).Let U I be the universe of event identifiers.Let U C be the universe of case identifiers.Let A ∈ U be the discrete domain of all the activity identifiers.Let T ∈ U be the totally ordered domain of all the timestamp identifiers.Let O = {?} ∈ U, where the "?" symbol is a placeholder denoting event indeterminacy.The universe of uncertain events is denoted with The activity label, timestamp and indeterminacy attribute values of an uncertain event are drawn from U A , U T and U O ; in accordance with Definition 1, each of these attributes can be strongly uncertain (set of possible values or interval) or weakly uncertain (probability distribution).The indeterminacy domain is defined on a single element "?": thus, strongly uncertain indeterminacy may be {?} (indeterminate event) or ∅ (no indeterminacy).In weakly uncertain indeterminacy, the "?" element is associated to a probability value.Definition 3 (Projection functions).For an uncertain event e = (i, c, a, t, o) ∈ E, we define the following projection functions: π a (e) = a, π t (e) = t, π o (e) = o.We define π set a (e) = a if a is strongly uncertain, and is strongly uncertain, we define π tmin (e) = t min and π tmax (e) = t max .If the timestamp t = f T is weakly uncertain, we define π tmin (e) = argmin x (f T (x) > 0) and π tmax (e) = argmax x (f T (x) > 0).Definition 4 (Uncertain traces and logs).τ ⊂ E is an uncertain trace if all the event identifiers in τ are unique and all events in τ share the same case identifier c ∈ U C .T denotes the universe of uncertain traces.L ⊂ T is an uncertain log if all the event identifiers in L are unique.

Definition 5 (Realizations of uncertain traces)
. Let e, e ∈ E be two uncertain events.≺ E is a strict partial order defined on the universe of strongly uncertain events E as e ≺ E e ⇔ π tmax (e) < π tmin (e ).Let τ ∈ T be an uncertain trace.The sequence ρ = e 1 , e 2 , . . ., e n ∈ E * , with n ≤ |τ |, is an order-realization of τ if there exists a total function f : {1, 2, . . ., n} → τ such that: for all e ∈ τ with π o (e) = ∅ there exists We denote with R O (τ ) the set of all such order-realizations of the trace τ .
Given an order-realization ρ = e 1 , e 2 , . . ., * the set of all such realizations of the order-realization ρ.We denote with R(τ ) ⊆ U A * the union of the realizations obtainable from all the order-realizations of τ : R(τ ) = ρ∈R O (τ ) R A (ρ).We will say that an order-realization Detailing an algorithm to generate all realizations of an uncertain trace is beyond the scope of this paper.The literature illustrates a conformance checking method over uncertain data which employs a behavior net, a Petri net able to replay all and only the realizations of an uncertain trace [8].Exhaustively exploring all complete firing sequences of a behavior net, e.g., through its reachability graph, provides all realizations of the corresponding uncertain trace.
Given the above formalization, we can now define more clearly the research question that we are investigating in this paper.Given an uncertain trace τ ∈ T and one of its realizations σ ∈ R(τ ), our goal is to obtain a procedure to reliably compute P (σ | τ ) = "probability of σ given that we observe τ ".In other words, provided that σ corresponds to a scenario (i.e., a realization) for the uncertain trace τ , we are interested in calculating the probability that σ is the actual scenario occurred in reality, which caused the recording of the uncertain trace τ in the event log.In the next section, we will illustrate how to calculate such probabilities of uncertain traces realizations.

Method
Before we show how we can obtain probability estimates for all realizations of an uncertain trace, it is important to state an assumption: the information on uncertainty related to a particular attribute in some event is independent of the possible values of the same attribute present in other events, and it is independent of the uncertainty information on other attributes of the same event.Note that in the examples of uncertainty sources given in Section 1 (data coarseness and sensor errors), this independence assumption often holds.
Additionally, we need to consider the fact that strongly uncertain attributes do not come with known probability values: their description only specifies the values that attributes might acquire, but not the likelihood of each possible value.As a consequence, estimating probability for specific realizations in a strongly uncertain environment is only possible with a-priori assumptions on how probability distributes among the attribute value.At times, it might be possible to assume the distribution in an informed way-for instance, on the basis of features of the information system hosting the data, of the sensors recording events and attributes, or other tools involved in the management of the process.
In case no indication is present, a reasonable assumption-which we will hold for the remainder of the paper-is that any possible value of a strongly uncertain attribute is equally likely.Formally, with e = (i, c, a, t, o) ∈ E let τ s : E → E be a function such that τ s (e) = (i, c, a , t , o ), where a = {(x, First, observe that the probability P (σ | τ ) that an activity sequence σ ∈ U A * is indeed a realization of the trace τ ∈ T , and thus σ ∈ R(τ ), increases with the number of order-realizations enabling it.Furthermore, for each such orderrealizations, one can construct a probability function P O (ρ | τ ) reflecting the likelihood of the sequence ρ itself given the trace τ , and a probability function P A (σ | ρ) reflecting the likelihood that the realization corresponding to ρ is indeed σ.The value of P O (ρ | τ ) is affected by the uncertainty information in timestamps and indeterminate events, while the value of P A (σ | ρ) is aggregated from the uncertainty information in the activity labels.Given a realization σ of an uncertain process instance and the set of its enablers, its probability is computed as following: Note that, if ρ does not enable σ, P A (σ | ρ) = 0.For any uncertain trace τ ∈ T , it holds that σ∈R(τ ) P (σ | τ ) = 1, since both P O (•) and P A (•) are each constructed to be (independent) probability distributions.
We will now compute P A (σ | ρ) using the information on the activity labels uncertainty.Let us write f e A as a shorthand for π a (e).If there is uncertainty in activities, then for each event e ∈ ρ and activity label a ∈ π set a (e), the probability that e executes a is given by f e A (a). Thus, for every ρ = e 1 , ..., e n ∈ R O (τ ) and σ = a 1 , ..., a n ∈ R O (τ ), the value P A can be aggregated from these distributions in the following way: Through the value of P A , we can assess the likelihood that any given orderrealization executes a particular realization.The next step is to estimate the probability of each order-realization ρ from the set R O (τ ).The probability of observing ρ needs to be aggregated from the probability that the corresponding set of events appears in the given particular order, which is determined by the timestamp intervals and, if applicable, the distributions over them; and the probability that the order-realization contains the corresponding specific set of events, which is determined by the uncertainty information on the indeterminacy.Multiplying the two values obtained above to yield a probability estimate for the order-realization reflects our independence assumption.Let us firstly focus on uncertainty on timestamps, which causes the events to be partially ordered.
We will write f e T (t) as a shorthand for π t (e)(t).For every event e, the value of f e T (t) yields the probability that event e happened on timestamp t.This value is always 0 for all t < π tmin (e) and t > π tmax (e) (see π tmin and π tmax in Definition 3).Given the continuous domain of timestamps, P O (•) is assessed by using integrals.For a trace τ ∈ T and an order-realization ρ = e 1 , ..., e n ∈ R O (τ ), let a i = π tmin (i) and b i = π tmax (i) for all 1 ≤ i ≤ n.Then, we define: This chain of integrals allows us to compute the probability of a specific order among all the events in an uncertain trace.Now, to compute the probability of each realization from R e accounting for indeterminate events, we combine both the probability of the events having appeared in a particular order and the probability that the sequence contains exactly those events.For simplicity, we will use a function that acquires the value 1 if an event is not indeterminate.Let us define f e O : O → [0, 1] such that f e O (?) = π o (e)(?) if π o (e) = ∅ and f e O (?) = 1 otherwise.More precisely, given τ ∈ T and ρ ∈ R O (τ ), we compute: We now have at our disposal all the necessary tools to compute a probability distribution over the trace realizations of any uncertain process instance in any possible uncertainty scenario.Let us then apply this method to compute the probabilities of all realizations of the trace τ in Table 1, and to analyze its conformance to the process in Figure 1.
One can notice that the I values only depend on the ordering of the first three events, which are also the only ones with overlapping timestamps.Since the indeterminate event e 6 does not overlap with any other event, pairs of sequences where the first three events have the same order also have the same probability.This reflects our assumption that the occurrence and non-occurrence of e 6 are both equally possible.Table 3 displays the calculations for the computation of the P (σ | τ ) values for all realizations.Now we can compute the expected conformance score for the uncertain process instance τ = {e 1 , . . ., e 6 }.We can do so by computing alignments [5] for each realization of τ :   1, their enablers, their probabilities, and their conformance scores.The conformance score is equal to the cost of the optimal alignment between the trace and the Petri net in Figure 1.
Given the information on uncertainty available for the trace, this conformance score is a more realistic estimate of the real conformance score compared to taking the best, worst or average scores with values 0, 3 and 1.75 respectively.

Validation of Probability Estimates
In this section, we compute the probability estimates for the realizations of an uncertain trace, and then show a validation of those estimates by Monte Carlo simulation on the behavior net of the trace.The process instance of our example has strong uncertainty in timestamps and weak uncertainty in activities and indeterminacy.It consists of 4 events: e 1 , e 2 , e 3 and e 4 , where e 2 and e 3 have overlapping timestamps.Event e 2 executes b (resp., c) with probability 0.9 (resp., 0.1).There is a probability of 0.2 that e 3 did not occur.Figure 2 shows the corresponding behavior graph, an uncertain event data visualization that represents the time relationships between events with a directed acyclic graph [8].Lastly, Table 4 list all the possible realizations, their probabilities, and the order-realizations enabling them.
We now validate our obtained probability estimates quantitatively by means of a Monte Carlo simulation approach.First, we construct the behavior net [10] corresponding to the uncertain process instance, which is shown in Figure 3.The set of replayable traces in this behavior net is exactly the set of realizations for the uncertain instance.Then, we simulate realizations on the behavior net, dividing the accumulated count of each realization by the number of runs, and  compare those values to our probability estimates.Here, we use the stochastic simulator of the PM4Py library [3].In every step of the simulation, the stochastic simulator chooses one enabled transition to fire according to a stochastic map, assigning a weight to each transition in the Petri net (here, the behavior net).
To simulate uncertainty in activities, events and timestamps, we do the following: possible activities executed by the same event appearing in an XOR-split in the behavior net are weighted so to reflect the probability values of the activity labels.Indeterminacy is equivalently modeled as an XOR-choice between a visible transition and a silent one in the behavior net, so to model a "skip".If there are two or more possible activities for an indeterminate event, then the sum of the weights of the visible transitions in relation to the weight of the silent transition should be the same as in the distribution given in the event type uncertainty information.Whenever there are events with overlapping timestamps, these appear in an AND-split in the behavior net.The (enabled) path of the AND-split which is taken first signals which event is executed at that moment.
Let bn(τ ) = (P, T ) be the behavior net of trace τ .Let (e, a) ∈ T be a visible transition related to some event e ∈ τ .We weight (e, a) the following way: overlapping timestamps appear in an AND construct.By definition of our weight function, whenever the transitions of some e ∈ τ are enabled (in an XOR construct), the probability of firing one of them is 1/k, where k is the number of events from τ for which none of the corresponding transitions have fired yet.This way, there is always a uniform distribution over the set of enabled transitions representing overlapping events.Assigning the weights according to this distribution allows to decorate the behavior net with probabilities that reflect the chances of occurrence of every possible value in uncertain attributes.
Applying the stochastic simulator n times yields n realizations.For each of the 6 possible realizations for the uncertain process instance, we obtain a probability measurement by dividing its simulated frequency by n.Figures 4 through 7 show how for greater n, this measurement converges to the probability estimates shown in Table 4, which were computed with our method.
To conclude, the Monte Carlo simulation shows that our estimated probabilities for realizations match their relative frequencies when one simulates the behavior net of the corresponding uncertain trace.

Conclusion
Uncertain traces inherently contain behavior, allowing for many realizations; these, in turn, correspond to diverse possible real-life scenarios, that may have different consequences on the management and governance of a process.In this paper, we presented a method to quantify the probability of each realization of an uncertain trace.This enables process analysts to weigh the impact of specific insights gathered with uncertainty-aware process mining techniques, such as conformance checking using alignments.As a consequence, information from process analysis techniques can be associated with a quantification of risk or opportunity for specific scenarios, making them more trustworthy.
Multiple avenues for future work on this topic are possible.These include inferring probabilities for uncertain traces from sections of the log not affected by uncertainty, adopting certain traces or fragments of traces as ground truth.Moreover, inferring probabilities by examining evidence against a ground truth can also be achieved with a normative model that includes information concerning the probability of error or noise in specific parts of the process.

Fig. 1 :
Fig. 1: A Petri net model of the credit card fraud investigation process.This net allows for 10 possible traces.

Definition 1 (
Uncertain attributes).Let U be the universe of attribute domains, and the set D ∈ U be an attribute domain.Any D ∈ U is a discrete set or a totally ordered set.A strongly uncertain attribute of domain D is a subset d S ⊆ D if D is a discrete set, and it is a closed interval d S = [d min , d max ] with d min ∈ D and d max ∈ D otherwise.We denote with S D the set of all such strongly uncertain attributes of domain D. A weakly uncertain attribute )| ) | x ∈ π set a (e)} if a ∈ S A and a = a otherwise; t = U (π tmin (e), π tmax (e)) if t ∈ S T and t = t otherwise; o = 0.5 if o = {?} and o = o otherwise.

Fig. 2 :
Fig. 2: The behavior graph of the uncertain trace considered as example for validation.

Table 3 :
The set of possible realizations of the example from Table