Keywords

1 Introduction

Process mining is the science of understanding processes and improving them based on event data. Event data is mined for insights that can help industries in optimizing their processes, re-engineering them, and aiding their decision-making. Process discovery is one application of process mining allowing to understand the underlying processes by visualizing a process model of how the process was executed.

This means that event data availability is key to any process mining task; without event data, there is no process mining. However, the publishing of event data, in many cases, is subject to constraints due to privacy concerns; hence, limiting the availability of event data. Fields, such as healthcare make use of process mining techniques for optimizing their processes and consider patient information whose privacy concerns are of utmost importance.

So, privacy is an important topic in the process mining field given the growing collection and use of data which may originate from personal activities or process event logs that contain information on individuals. Maintaining privacy of individuals in process mining use cases is difficult since event data is sequential and often individual cases or events are related to sensitive information about individuals. An example of such information would be the information about medical tests performed on a patient in a hospital. Existing work, therefore, has provided methods to quantify the risk of re-identification of individual information in a published event log [8, 10]. This allows to assess the risk when releasing the original event log or to judge the effectiveness of a specific anonymization technique on the privacy of an event log.

The existing work on privacy-risk quantification only considers the various privacy related risks, e.g., the re-identification risk as in [10], for published event logs or closely derived representations such as directly-follows graphs [5] (Sect. 3). However, in many use cases the event log does not need to be public and could be only available to a process mining system that discovers process models providing an abstract representation of the source data. Still, there is a risk of re-identification of sensitive information based on such published process models that were mined from the event log. Such re-identification risk of discovered process models and how it differs from the re-identification risk of the source event log has not yet been investigated.

This work explores which privacy attacks are possible using the information in a published process model and aims to quantify the re-identification risk for a given published process model. Such quantification would enable new evaluation options for anonymization schemes and help to judge whether a certain process model can be released to a specific audience. The input to our method are frequency-annotated process models that can be converted to process trees [11] such as, e.g., discovered by the Inductive Miner [7]. We propose a randomized log replay technique to generate multiple possible event logs (scenarios) given the constraints of the process model and its frequencies. Based on these generated event log scenarios, we leverage the existing re-identification risk measures proposed by Rafaei et al. [8] (Sect. 4). The method was evaluated on several real-life event logs (Sect. 5) and the results were compared to the re-identification risk of the original logs.

2 Problem Statement

Process models, which are a graphical representation of the process, can be of different types such as Petri nets, Process trees, or Directly-Follows Graphs (DFGs). These process models can be discovered automatically using process discovery algorithms such as, e.g., Inductive miner [7]. A process discovery method takes an event log as input and returns a process model as a compact representation of the process behavior that was observed.

Table 1. Fragment of an event log about handling requests for compensation [2]
Fig. 1.
figure 1

Petri net with frequencies of the Medical Center COVID, HIV testing process.

Events in an event logs contain in addition to case identifiers, which refer to the process instance in which the event occurred, other event information such as timestamp, activity, resource, and cost. For each case that consists of all events with the same case identifier in an event log, we can write its trace, i.e., the sequence of events ordered by timestamp, in a concise form representing the activities found in a case. For example, the trace for case 1 in the event log of Table 1 can be represented as

$$ \langle \text {register request, examine thoroughly, check ticket, decide, reject request}\rangle . $$

A common output format of process discovery algorithms are Petri nets, which are a graphical representation of a given process. Since both event logs and process models contain information, there are risks of privacy attacks that seek to reveal sensitive information given the attacker has some background information about the individuals in an event log or process model. Petri nets can represent a variety of information that might unintentionally be revealed by the publisher of the Petri net to stakeholders or to the public. In other scenarios, a privacy attack might occur and an adversary might be able to disclose sensitive or classified information based on the published Petri net. Although the data might not be present explicitly in the Petri net, the attacker might be able to draw out sensitive information by using different techniques; especially when combined with other data obtained from a breach or other publicly available data about the individuals. For the purposes of illustration, we use the following example in Fig. 1 of a testing process for COVID and HIV in a medical facility. The process illustrates the process from the point of registration in the relevant department (COVID or HIV) up to the testing process, result (negative or positive), and the discharge of the patient. The Petri net is frequency-annotated with the number of occurrences of the respective activities (transitions).

A re-identification attack occurs when an adversary attempts to reverse the anonymity of certain information that was masked by an operation to remove sensitive information. The adversary can use information that was acquired publicly or from a data breach. Attacks are most successful when adversaries correlate or match different datasets and information to mount an attack. Process models may contain sensitive information about individuals or processes. The adversary might attempt to use the information in the process model to re-identify individuals in their attack. In a dataset or an event log, an adversary will attempt to single out an individual’s identity based on the uniqueness of a record’s or event’s identifiers in order to mount a re-identification attack. Singling out a record, the adversary can attempt a linkage attack using other datasets obtained by the adversary which can lead to the re-identification of individual information.

In the context of a frequency-annotated process model, a similar paradigm can be established to understand how an adversary can use the process model information to re-identify individuals. Singling out individuals in a process model can be done based on infrequent paths (runs) in the process model. An infrequent path in a process model allows an adversary to single out an activity of that process. To illustrate, in the Petri net shown in Fig. 1, we notice a single case in which a patient transferred from the HIV testing department to a COVID testing department. This information does not reveal the individual’s identity; however, coupled with other information that the attacker might have, it can lead to a successful re-identification by an adversary. For instance, if the adversary has background information about patients registrations for COVID department and they detect that a unique individual transferred from HIV department, without undergoing registration directly for a PCR test, they can conclude that individual is Carla Sanders from the background information. They can also know from the process model that Carla Sanders has also undergone an HIV test as well as the result of the HIV test (positive). Therefore, the individual identity and HIV results are revealed in that attack. In Fig. 1, there is a single activity transfer_reg after ongoing a result_back_positive_hiv as indicated by the frequency on the activity(transition). Assume that also in the background information, Carla Sanders transferred registration to COVID department is recorded. Therefore, the adversary can identify Carla Sanders as the individual belonging to the trace containing transfer_reg and also can know the complete trace prior to the transfer, such as that she had undergone a positive HIV test.

By singling out the infrequent trace transfer_reg of frequency 1 of the Petri net in Fig. 1, the adversary mounts a linkage attack using certain background knowledge re-identify the individual as Carla Sanders who is the only one that has undergone transfer on that day April 2, 2020; and thus can identify that she has undergone a positive HIV test. This shows that a re-identification attack is also possible using the information available in a published Petri net.

3 Related Work

Privacy preserving process mining has increasingly gained interest in the process mining community. This comes with legislations and data protection regulations across the EU becoming stricter, especially with the General Data Protection Regulation (GDPR) [1]. A main concern for privacy-preserving process mining approaches is how to ensure privacy when event logs may be published. Since the processes studied can be in fields dealing with sensitive data such as healthcare, financial institutions, and other critical fields, privacy is of utmost importance.

Quantifying the re-identification risk is closely related to research on privacy-preserving process mining. Knowing the risk can help evaluate the effectiveness of privacy-preserving methods in process mining. It can also act as a comparison measure before and after applying privacy models. Although re-identification attacks are hugely researched in different fields [3, 4, 6, 9], there is only a couple of published research in the process mining community on quantification of disclosure risk in event logs and directly-follows graphs. Most related to our research are [5, 8, 10].

In [10], Nunez von Voigt et al. present a method to quantify the re-identification risk in event logs. The authors propose two measures to quantify the risk. Both measures are based on uniqueness in the event logs. The two measures are: uniqueness based on case attributes and uniqueness based on traces. In the first measure, the uniqueness of the case attributes in an event log is used to estimate the re-identification risk. The second measure considers the uniqueness of traces in an event log to account for event logs that do not have a lot of case attributes where the only information in the event log is the traces themselves. The work quantifies the risk in publicly available event logs and demonstrates how the re-identification risk can be very high for some of them and that almost every case can be re-identified in some scenarios. This sheds light on the need for adequate methods to quantify the re-identification risks and means to protect against them. In [8], Rafiei et al. introduces two measures for quantifying disclosure risk in published event logs to evaluate the effectiveness of privacy-preserving techniques. The two proposed measures are identity (case) disclosure and attribute(trace) disclosure. The first measure, case disclosure, uses uniqueness to measure how trace owners can be re-identified. The second measure, trace disclosure, measures how the sensitive attributes such as the complete trace of a case, can be disclosed. The method takes into account the background knowledge that the attacker might have about the event log when quantifying the risk. The method considers three types of background knowledge: set, multiset, and sequence. The set background knowledge is simply the set of activities in a process that the attacker might know are related to an individual. A multiset background knowledge provides additional knowledge about the number of occurrences of the process activities, while the sequence background knowledge provides additional information about the order in which the activities have occurred for the trace owner. The paper applies the method on two publicly available real-life event logs. In [5], Elkoumy et al. discuss the re-identification probability in DFGs, which are an output of process mining techniques. The work expresses the re-identification and the guessing advantage of an attacker by calculating the guessing probability given a DFG.

4 Approach

Existing re-identification risk quantification techniques calculate re-identification risk from event logs which constitute of traces. Our main idea is to calculate the re-identification risk from frequency-annotated process models by generating possible sets of traces corresponding to the original event log from the process model.

4.1 Approximating Re-identification Risk by Simulation

The generation of the exact traces corresponding to the original log from the frequency-annotated process model is not possible in all process models. This can be done accurately for process models that do not contain concurrent transitions and cycles as described in the challenges described earlier. This is due to the fact that when the process model contains concurrent transitions and cyclic behaviour, there can be multiple execution traces, of which we do not know which of them correspond to the trace in the original event log. The brute force approach to calculate the re-identification risk would consider the generation of all the combinations of transitions in a process model and their possible trace sets which could be very computationally expensive.

However, since most process models that describe business processes contain concurrent and cyclic behaviour, we adapt the solution to estimate the quantification risk using approximation techniques. Our proposed approach is to generate execution traces from the frequency-annotated process model without considering all the possible traces sets in a process model. This is because when the process model has many transitions and especially nested cyclic and concurrent ones, the number of possible traces sets increases substantially. Afterwards, we employ the existing measures for quantification of risk on the approximated possible traces of the process model.

The idea is to generate the execution traces in an ordered manner and not consider all the possible traces combinations. For this purpose we restrict ourselves to process models or Petri nets that can be represented as process trees. Then, we can traverse the process model in an ordered manner without randomizing the options to fire transitions, Therefore, for XOR transitions and concurrent transitions, we only consider a fixed order of firing of the activities, and not all the combinations possible. In the following, we assume as input a frequency annotated process tree.

4.2 Process Trees

Process models using graph-based notations can complicate process discovery from such event logs and result in unsound process models which complicates discovery. We use process trees or block-structured models that are sound by construction [2]. A process tree is defined as follows:

Definition 1

(Process Tree) [2]. Let \(A \subseteq \mathcal {A}\) be a finite set of activities with \(\tau \notin A\). \(\oplus = \{\rightarrow ,\times ,\wedge , \circlearrowleft \}\) is the set of process tree operators.

  • If \(a \in A \cup \{ \tau \}\), then Q = a is a process tree,

  • If \(n \ge 1, Q1,Q2,...,Qn\) are process trees, and \(\oplus \in \{\rightarrow ,\times ,\wedge \}\), then \(Q = \oplus (Q1,Q2,...Qn)\) is a process tree, and

  • If \(n \ge 2\) and Q1, Q2, ..., Qn are process trees, then \(Q = (\circlearrowleft (Q1,Q2, \ldots , Qn)\) is a process tree.

\(\mathcal {L}_A\) is the set of all process trees over A.

The leaf nodes of a process tree represent process activities and the other nodes represent operators. Process trees have four operators: sequence operator, exclusive choice operator, parallel operator, and loop operator denoted as: \(\{\rightarrow ,\times ,\wedge \,\circlearrowleft \}\) respectively. The four operators can also be abbreviated as seq, xor, and, xor loop which will be adopted in this work. The operator nodes define the order of execution of their children nodes. In the following, we define the semantics of the execution of a process tree by its operator type:

Definition 2 (Semantics of a Process Tree)

Let \(\mathcal {P}\) be a process tree. Let \(\mathcal {N} \in \mathcal {P}\) be any non-leaf node. Let \(\mathcal {T(N)}\) be of range \(\{\rightarrow ,\times ,\wedge \,\circlearrowleft \}\) be the type of Node \(\mathcal {N}\) operator. Let \(\mathcal {T(N)}\) have children nodes \(\{a,b\}\) with a being the leftmost node, and b being the rightmost node. The execution of the children of the node \(\mathcal {T(N)}\) is done as follows:

  • if \(\mathcal {T(N)} = \,\rightarrow \), a is executed then b is executed. Trace is \(\{a,b\}\)

  • if \(\mathcal {T(N)} = \times \), a is executed or b is executed. Trace is \(\{a\}\) or \(\{b\}\)

  • if \(\mathcal {T(N)} = \wedge \), a and b are executed, a or b may come first. Trace is \(\{a, b\}\) or \(\{b,a\}\).

  • if \(\mathcal {T(N)} = \,\circlearrowleft \), a is executed, b can be executed any number of times 0...n. For each execution of b, a is executed again. Trace is \(\{a\}\) or \(\{a,b,a\}\) or \(\{a,b,a,...,a,b,a\}\).

4.3 Frequency Constrained Traversal of the Process Tree

After having assigned all the nodes of the process tree with their respective firing frequencies, the next step is to traverse the process tree according to those frequencies. This will allow us to generate simulated traces that are similar in their frequencies of transitions to the inputted Petri net. The traversal of the process tree is done in a top-bottom approach on the tree while decrementing the counts of the nodes traversed until the frequencies are fully satisfied in the process tree. The traversals of the tree must satisfy the counts on the nodes of the tree. We define the execution order of the nodes in a process tree by our simulation approach as follows:

Definition 3 (Execution Order of Nodes in Process Tree by our Approach)

Let P be a process tree and T(N) be the type of the tree operator at the root node of P which can be one of \(\{ SEQ, XOR, AND, LOOP \}\). Let N be a node of a process tree which can be equal to P or any subtree of P that is a non-leaf node. The execution order of the children of process tree P by our approach is as follows:

  1. 1.

    If \(T(N) = SEQ\): execute leftmost child first followed by second leftmost and so on.

  2. 2.

    If \(T(N) = AND\): execute all the children of the AND node in a fixed order from left to right.

  3. 3.

    If \(T(N) = XOR\): execute the first child of the XOR node with remaining frequency > 0

  4. 4.

    If \(T(N) = LOOP\): execute leftmost child, repetition is possible by executing rightmost child additionally then executing the leftmost child again. The overall number of repetitions is equal to the frequency of the rightmost child of the XOR LOOP node.

Fig. 2.
figure 2

Frequency Annotated Process Tree of Handling Request For Compensation Log

In Fig. 2, we show the frequency-annotated of the initial example log in Table 1 and its corresponding frequency-annotated process tree as decorated by our approach in the previous section. We give one example run of the tree traversal execution order in our approach according to Definition 3 for the process tree in Fig. 2. The execution of the process tree should start from the root node and is as follows:

  1. 1.

    root SEQ node is executed

  2. 2.

    children of SEQ node are executed: register request, XOR LOOP, and xor in order from left to right.

  3. 3.

    When XOR LOOP is executed, children of the loop are executed: SEQ, reinitiate request. For each execution of the right child of the loop (reinitiate request), the leftmost child (SEQ) of the XOR LOOP is executed again. That is, leftmost child(SEQ) is executed followed by the execution of AND node, decide node. Next, children of AND node (check ticket and XOR) are executed in left-to-right order. Next, examine thoroughly is executed given its remaining frequency > 0; otherwise, examine casually is executed.

  4. 4.

    XOR is executed, and then its child reject request is executed given its remaining frequency > 0; otherwise, pay compensation is executed.

The traversal of the process tree is done multiple times until the frequencies of the process tree are satisfied. While nodes are executed, the count of the executed nodes is decremented until the count reaches zero satisfying the frequencies observed - at which the execution is stopped. Therefore, we obtain from the process tree executions that are equivalent in their activity frequencies to the original event log which the Petri net was mined from. However, the individual traces may differ due to the higher abstraction level of the process model.

5 Evaluation

We evaluate our approach on real-life event logs to investigate feasibility and validity.

5.1 Experimental Setup

We performed experiments with multiple real-life event logs which are publicly available at the 4TU Centre for Research DataFootnote 1. Here, we only report the results on the Sepsis Cases and the Road Traffic Fine Management (RTFM) logs since they were frequently used in the related work. For each log, we generate a frequency-annotated process model using Inductive Miner [7]. From the process model, we obtain five simulated event logs by applying our approach. We did not opt for more simulations since there is little variation between the results of the simulated event logs. Afterwards, we calculate the identity (case) disclosure and trace disclosure measures as mentioned in our approach in Sect. 4 and implemented in Rafiei et al. in [8] using the p-privacy-qt library published by their work. Then, we report the case disclosure and trace disclosure for both the original event log and the five simulated event logs generated from the mined process model of the original event log. Our approach is implemented in Python and can be found on GitHubFootnote 2.

5.2 Identity (Case) Disclosure Results

Fig. 3.
figure 3

Case disclosure results for the original log and our simulated event logs.

In Fig. 3, we demonstrate the identity (case) disclosure risk on the original Sepsis-cases log and the simulated logs generated by our approach. The identity (case) disclosure risk increases with increasing the background knowledge power size. The more background information available to the attacker, the more the risk of re-identifying individuals in the event log. The risk also increases with varying the background knowledge type from set to multiset and sequence respectively which is also explained by more background information available to the attacker about the activities. This increase is also noticeable in all other event logs that are studied in the experiments. As can be seen also in Fig. 3, the identity (case) disclosure risk in the 5 simulated event logs generated from the process model mined from the original Sepsis Cases event log is less than or equal to the identity (case) disclosure risk in the original event log with the gap between the original log and the simulated logs increasing with the increase in the background knowledge power size. The risk of the simulated logs from the process model is an estimator, in our approach, of the identity (case) disclosure risk of the process model mined from the original event log.

5.3 Trace Disclosure

Fig. 4.
figure 4

Trace disclosure results for the original Sepsis log and our simulated event logs.

We report the results of the experiments to quantify the trace disclosure risk. In Fig. 4, we notice that the trace disclosure risk increases for the original sepsis-cases event log but not for the simulated event logs by the increase in the background knowledge power size. However, for other event logs such as the Road Traffic Fine Management event log in Fig. 5, the trend is different, we notice that the trace disclosure risk decreases for the original RTFM log and is varying for the simulated event logs by the increase in the background knowledge. This indicates that the trace disclosure, indeed, does not follow the same trend as the identity (case) disclosure risk with the increase in the size of the background knowledge power. The trace disclosure, then, can be high even for weaker background knowledge power size, which was also found in [8].

5.4 Discussion

The distributions of the identity disclosure risks of the five simulated logs of the process model mined from the Sepsis Cases log are all below or equal to the risk of the original event log. This result was also observed for all other tested event logs. Therefore, for all the studied event logs, on average, the identity disclosure risk is less in the simulated logs from process model than the original event log. This confirms our intuition that a process model abstracts certain behavior and, therefore, provides less information to an adversary. It also confirms that our simulation approach is feasible and our constrained simulation seems to return valid results. Clearly, the identity disclosure risk of the simulated event logs generated from the process model that was discovered from the original event log should (overall) not be higher than the risk of the original log. As we also noted, the identity disclosure risk and trace disclosure risk are varying not much between the simulated event logs. However, the results of the experimentation on the event logs does not guarantee that the hypotheses will be fulfilled for all event logs.

Fig. 5.
figure 5

Trace disclosure results for the original RTFM log and our simulated event logs.

We already discussed from a theoretical perspective that a process model reveals less information than an event log. Thus, it is safe to say that process models are safer to publish generally than the event logs they were mined from. Moreover, the experiments also show that the identity disclosure risk is significantly less, on average, in process models than the original event logs they were mined from. In some cases, however, the re-identification risk can be equal or lower for some background knowledge sizes as shown in the results of our experiments. This may be an artefact of our simulation method but could also indicate that a process model can also have a similar risk in some cases similar to publishing a log.

Regarding the trace disclosure risk, the results are less clear. Depending on the log, our simulated event logs result in a higher risk compared to the risk of the original event log. Indeed, our method may generate less variants than contained in the original log and the disclosure risk appears to be higher. Thus, our method is not well suited to investigate the trace disclosure risk.

6 Conclusion

We discussed possible privacy attacks that an adversary can mount using a published process model in the form of a block-structured Petri net or process tree. We proposed a method to quantify the re-identification risk of such models that is based on a constrained simulation and leveraging existing work on quantifying re-identification risk. In our experiments, we validated the feasibility of our approach on several event logs and reported detailed results on the Sepsis Cases event log. Our conclusion is that our approach returns results that are in line with the intuition that when discovering a process model from an event log certain behavior is abstracted from and, thus, the re-identification is should, in general, be lower than that on the original event log. In future work, we want to evaluate this method in a more statistically rigorous manner, and work on more efficient approaches to approximate the re-identification risk directly from a non-block-structured Petri net without generating event logs.