Keywords

1 Introduction

Data science is changing the way we do business, socialize, conduct research, and govern society. Data are collected on anything, at any time, and in any place. Therefore, it is not surprising that many people are concerned about the responsible use of data. The Responsible Data Science (RDS) [8] initiative focuses on four main questions: (1) How to avoid unfair conclusions even if they are true?, (2) How to answer questions with a guaranteed level of accuracy?, (3) How to answer questions without revealing secrets?, and (4) How to clarify answers such that they become indisputable? This paper focuses on the confidentiality problem (third question) when applying process mining to event data.

Process mining uses event data to provide novel insights into actual processes [2]. There are many activities and techniques in the field of process mining. However, the three basic types of process mining are; process discovery [1], conformance checking [2], and process re-engineering (enhancement) [7]. Also, four perspectives are considered to analyze the event data including; control-flow, organizational, case, and time perspective [2]. In this paper, we focus on control-flow and organizational perspective side by side. A simple definition for process discovery is learning process models from event logs. In fact, a discovery technique takes an event log and produces process model without using additional information [5]. A social network is a social structure which shows relations among social actors (individuals or organizations) [30]. When event data contain information about resources, not only can it be used to thoroughly analyze bottlenecks, but also it turns to a valuable data to derive social networks among resources, involved in the process. Since such event data contain highly sensitive information about the organization and the people involved in the process, confidentiality is a major concern. Note that by confidentiality in process mining, we aim to deal with two important issues; (1) protecting the sensitive data belonging to the organization, (2) protecting the private information about the individuals.

As we show in this paper, confidentiality in process mining cannot be achieved by merely encrypting all data. Since people need to use and see process mining results, the challenge is to retain as little information as possible while still being able to have the same desired result. Here, the desired results are process models and social networks. The discovered models (networks) based on the anonymized event data should be identical to the results obtained from the original event data (assuming proper authorizations).

In this paper, we propose an approach to deal with confidentiality in process mining which is based on abstractions. Moreover, we present the connector method by which the individual traces of a process stay anonymous, yet, at the same time, process models and social networks are discoverable. The proposed framework allows us to derive the same results from secure event logs when compared to the results from original event logs, while unauthorized persons cannot access confidential information. In addition, this framework can provide a secure solution for process mining when processes are cross-organizational.

The remainder of this paper is organized as follows. Section 2 outlines related work and the problem background. In Sect. 3, we clarify process mining, social network discovery, and cryptography as preliminaries. In Sect. 4, the problem is explained in detail. Our approach is introduced in Sect. 5. In Sect. 6 the approach is evaluated, and Sect. 7 concludes the paper.

2 Related Work

In data science, social networks, and information systems, confidentiality has been a topic of interest in the last decade. In computer science, privacy-preserving algorithms and methods in differential privacy are most applicable to confidentiality in process mining. In sequential pattern mining, the field of data science which arguably close to process mining, there has been work on preserving privacy in settings with distributed databases [15] or in cross-organizational settings [31]. Also, privacy-preservation in social networks is a well-researched topic, and most of the research in this field aims to protect the privacy of the individuals involved in a given social network [17]. However, here, we focus on the confidentiality issues arising when initially discovering social networks from event logs that comprise lots of sensitive private data about the individuals.

Although there have been a lot of breakthroughs in the field of process mining ranging from data preprocessing [28] and process discovery [22] to performance analysis [18] and prediction [25], the research field confidentiality and privacy has received relatively little attention. This is despite the fact that already the Process Mining Manifesto [6] points out that privacy concerns are important to be addressed. In the following, we introduce some research regarding Responsible Process Mining (RPM) and few publications which focused specifically on confidentiality issues, in the control-flow perspective or during process discovery.

The topic of Responsible Process Mining (RPM) [3] has been put forward by several authors thereby raising concerns related to fairness, accuracy, confidentiality, and transparency. In [29], a method for securing event logs to be able to do process discovery by Alpha algorithm has been proposed. In [12], a possible approach toward a solution, allowing the outsourcing of process mining while ensuring the confidentiality of dataset and processes, has been presented. In [20], the authors has used a cross-organizational process discovery setting, where public process model fragments are shared as safe intermediates. In [23], the aim is to provide an overview of privacy challenges when process mining is used in human-centered industrial environments. In [27], the authors introduce a framework for ensuring confidentiality in process mining which is utilized and extended in this paper. In [14], a privacy model is proposed for privacy-aware process discovery. In [26], the organizational perspective in process mining is taken into account, and the aim is to provide a privacy-preserving method for role mining, which can be used for generalizing resources as individuals in event data. It is also worth noting that process mining can be used for security analyses, e.g., in [10], process mining is used for security auditing.

3 Background

In this section, we briefly present the main concepts and refer the readers to relevant literature for more detailed explanations.

3.1 Process Mining

In the following, we introduce some basic concepts of process mining to which we will refer in this paper.

Events are the smallest data unit in process mining and occur when an activity in a process is executed. Events comprise of multiple attributes including; Case ID, Timestamp, Activity, Resource, etc. In Table 1, each row indicates an event. In the remainder of this paper, we will refer to the activities and resources of Table 1 with their abbreviations, e.g., “R” stands for “Register”.

A trace is a sequence of events and represents how a process is executed in one instance, e.g., in Table 1, case 1 is first registered, then documents are verified, and vacancies are checked. Finally, a decision is made for the case.

An event log is a collection of sequences of events which are used as the input of process mining algorithms. Event data are widely available in current information systems [6].

As you can see in Table 1, a “Timestamp” identifies the moment in time at which an event has taken place, and a “Case ID” is what all events in a trace have in common so that they can be identified as part of that process instance. Event logs can also include additional attributes for the events they record. There are two main attribute types that fall under this category. “Event Attributes” which are specific to an event, and “Case Attributes” which are ones that stay the same throughout an entire trace.

Table 1. Sample event log (each row represents an event).
Fig. 1.
figure 1

The DFG resulting from event log Table 1

A Directly Follows Graph (DFG) is a graph where the nodes represent activities and the arcs represent causalities. Activities “a” and “b” are connected by an arrow when “a” is frequently followed by “b”. The weights of the arrows denote the frequency of the relation [19]. Most commercial process mining tools use DFGs. Unlike more advanced process discovery techniques (e.g., implemented in ProM), DFGs cannot express concurrency. Figure 1 shows the DFG resulting from the event log Table 1.

3.2 Discovering Social Networks

There are different methods for discovering social networks from event logs including those based on causality, joint activities, joint cases, etc. [9]. Here, we, however, focus purely on the metrics based on causality. These metrics monitor for individual cases how work moves from resource to resource. E.g., there is a handover relation from individual i to individual j, if there are two subsequent activities where the first is performed by i and the second is performed by j. This relation furthermore becomes a real-handover if casual dependency between both activities exists. Note that in this case the directly follows relations between resources are not enough and the real casual dependencies are required. Dependency measure (Eq. 1) can be used to realize whether there is a real casual dependency between two activities (a and b) or not, while a threshold is set as the minimum required value [2]. In Eq. 1, \(| a \>_{L} b|\) shows how frequent a is followed by b:

$$\begin{aligned} |a \Rightarrow _{L}{b}|= {\left\{ \begin{array}{ll} \frac{| a>_{L} b| - | b>_{L} a|}{| a>_{L} b| - | b>_{L} a| + 1} &{} \textit{if } a\ne b\\ \frac{| a>_{L} b|}{| a >_{L} b| + 1} &{} \textit{if } a = b \end{array}\right. } \end{aligned}$$
(1)

When observing handovers, indirect succession may also be considered. E.g., based on the event log of Table 1, there is a non-real handover relation between “Frank” and “Alex” with the depth three. It is non-real due to there is no real casual dependency between all the corresponding activities. Figure 2 shows the networks based on causality having been obtained from event log Table 1.

Fig. 2.
figure 2

The networks based on causality for the event log Table 1

3.3 Cryptography

Cryptography or cryptology is about constructing and analyzing protocols that prevent third parties or the public from reading private messages [11].

A cryptosystem is a suite of cryptographic algorithms needed to implement a particular security service, most commonly for achieving confidentiality [16]. There are different kinds of cryptosystems. In this paper, we use the following ones.

  • Symmetric Cryptosystem: The same secret key is used to encrypt and decrypt a message. Data manipulation in symmetric systems is faster than asymmetric systems as they generally use shorter key lengths. Advanced Encryption Standard (AES) is a symmetric encryption algorithm [13].

  • Asymmetric Cryptosystem: Asymmetric systems use a public key to encrypt a message and a private key to decrypt it or vice versa. The use of asymmetric systems enhances the security of communication. Rivest-Shamir-Adleman (RSA) is an asymmetric encryption algorithm.

  • Deterministic Cryptosystem: A deterministic cryptosystem is a cryptosystem which always produces the same ciphertext for a given plaintext and key, even over separate executions of the encryption algorithm.

  • Probabilistic Cryptosystem: A probabilistic cryptosystem, other than the deterministic cryptosystem, is a cryptosystem which uses randomness when encrypting so that when the same plaintext is encrypted several times, it will produce different ciphertexts.

  • Homomorphic Cryptosystem: A homomorphic cryptosystem allows computation on ciphertext, e.g., Paillier is a partially homomorphic cryptosystem [24].

4 Problem Definition (Attack Analysis)

To illustrate the challenge of confidentiality in process mining, we start this section with an example. Consider Table 2, describing a totally encrypted event log, belonging to a hospital conducting surgeries. Since we need to preserve difference to find a sequence of activities for each case, discovering process model, and other analyses like social network discovery, “Case ID”, “Activity”, and “Resource” are encrypted based on a deterministic encryption method. Numerical data (i.e., “Timestamp” and “Cost”) are encrypted by a homomorphic encryption method to preserve the ability of basic mathematical computations on the encrypted data. Now suppose that we have background knowledge about surgeons and the approximate cost of different types of surgeries. The question arises whether parts of the log can now be deanonymized.

Owning to the fact that “Cost” is encrypted by a homomorphic encryption method, the maximum value for the “Cost” is the real maximum cost and based on background knowledge we know that e.g., the most expensive event in the hospital was the brain surgery by “Dr. Jone”, on “01/09/2018 at 12:00”, and the patient name is “Judy”. Since “Case ID”, “Activity”, and “Resource” are encrypted by a deterministic encryption method, we can replace all these encrypted values with the corresponding plain values. Consequently, encrypted data could be made visible without requiring decryption. This example demonstrates that even given completely encrypted event logs small fraction of contextual knowledge can leads to data leakage.

Given domain knowledge, several analyses could be done to identify individuals or extract some sensitive information from an encrypted event log. In the following, we explain couple of them.

Table 2. A totally encrypted event log.
  • Exploring the Length of Traces: One can find the longest/shortest trace, and the related background knowledge can be exploited to realize the actual activities and the related case(s).

  • Frequency Mining: One can find the most or the less frequent traces and the related background knowledge can be utilized to identify the corresponding case(s) and the actual activities.

These are just some examples demonstrate that encryption alone is not a solution. For example, [21] shows that mobility traces are easily identifiable after encryption. Any approach which is based on solely encrypting the whole event log will furthermore have the following weaknesses:

  • Encrypted Results: Since results are encrypted, the data analyst is not able to interpret the results. E.g., as data analyst we want to know which paths are the most frequent after “Registration” activity; how can one perform this analysis when the activities are not plain? The only solution is decrypting the results.

  • Impossibility of Accuracy Evaluation: How can we make sure that a result of the encrypted event log is the same as the result of the plain event log? Again, decryption would be required.

Generally, and as explored by [12], using cryptography is a resource consuming activity, and decryption is even much more resource consuming than encryption. The weaknesses demonstrate that encryption methods should be used wisely and one needs to evaluate closely where they are beneficiary and where unavoidable to provide confidentiality.

Here, we assume that background knowledge could be any contextual knowledge about traces which can result in a case disclosure including; frequency of traces, length of traces, exact/approximate time related to the cases, etc. Note that this background knowledge is assumed where unauthorized people can access the anonymized data. For example, given domain knowledge regarding frequency of traces one can guess the actual sequence of activities and possible case(s) (e.g., politicians, celebrities, etc) for the traces which are too rare. Consequently, individuals or minority group of people and their private information would be revealed. Therefore, the case disclosure is a crucial type of data leakage which should be prevented.

5 Approach

Figure 3 illustrates a framework to provide a solution for confidentiality when the desired result is a model. This framework has been inspired by [5], where abstractions are introduced as intermediate results for relating models and logs. Here, abstractions are directly follows matrix of activities (A-DFM) and directly follows matrix of resources (R-DFM). Figure 4 shows the A-DFM and R-DFM resulting from event log Table 1. A-DFM is considered as the abstraction for relating logs and process models, and R-DFM together with A-DFM are considered as the abstraction for relating logs and social networks which are based on causality. As can be seen in Fig. 3 three different environments and two confidentiality solutions are presented.

  • Forbidden Environment: In this environment, the actual information system runs that needs to use the real data. The real event logs (EL) produced by this environment contain a lot of valuable confidential information and except some authorized persons no one can access this data.

  • Internal Environment: This environment is just accessible by authorized stakeholders. A data analyst can be considered as an authorized stakeholder who can access the internal event logs. Event logs in this environment are partially secure, selected results produced in this environment (e.g., a process model) are the same as the results produced in the forbidden environment, and data analyst is able to interpret the results without decryption.

  • External Environment: In this environment, unauthorized external persons can access the data. Such environments may be used to provide the computing infrastructure dealing with large data sets (e.g., a cloud solution). Event logs in this environment are supposed to be entirely secure, and the results are encrypted. Whenever a data analyst wants to interpret results, they need to be decrypted and converted to an internal version. Furthermore, results from the external environment do not need to be exactly the same as the results from the internal environment, but, the same interpretations need to be provided.

Fig. 3.
figure 3

The proposed framework for confidentiality in process mining.

Table 3 shows a summary of our assumptions with respect to the internal and external environments. Note that in the forbidden environment, the main assumption is that only few highly trusted persons can access the data. Therefore, there is no need to employ confidentiality solutions. As described in Sect. 4, contextual knowledge regarding traces is assumed as background knowledge. As can be seen in Fig. 3, the desired results, which are process models (PM) and social networks (SN), can be obtained in each environment. The original event log (EL) is converted to the partially secure event log in the internal environment (\(EL'\)) and then to the entirely secure event log in the external environment (\(EL''\)) by the internal confidentiality solution (ICS) and the external confidentiality solution (ECS) respectively. Abstractions, which are intermediate results, are used for proving accuracy. It should be taken into account that since abstractions are considered as the outputs of the very last phase before the final results (only thresholds are required to be applied), when they are equal, the final results would be the same. In addition, transparency is provided by the reverse operation of the internal confidentiality solution (\(ICS^{-1}\)) and the reverse operation of the external confidentiality solution (\(ECS^{-1}\)). In the following, we explain our ICS and ECS in detail.

Fig. 4.
figure 4

The abstractions from Table 1.

Table 3. The general assumptions based on the environments

5.1 Internal Confidentiality Solution (ICS)

For ICS we combine several methods and introduce the connector method. Figure 5 gives an overview of the anonymization procedure.

Filtering and Modifying the Input. The first step to effective anonymization is preparing the data input. To filter the input, simple limits for frequencies can be set, and during loading an event log all traces that do not reach the minimal frequencies are not transferred to the \(EL'\).

Choosing the Plain Data. As mentioned, we need to produce interpretable results. Hence, some parts of event log remain as plain text in the internal version of the secure event log (\(EL'\)). We should decide what information and/or structure is strictly necessary for the desired analysis. Based on our considered abstractions (A-DFM and R-DFM), the only information necessary are directly follows relations between activities/resources.

Fig. 5.
figure 5

The internal confidentiality solution.

Encryption. Here there are two important choices. The first choice is which columns of the event log should be encrypted. Second, we need to decide which algorithms should be used. For the internal environment, since we want to keep the capability of applying basic mathematical computations on the encrypted values, we use Paillier for numeric attributes (i.e., “Cost”), and AES-128 with only ASCII characters as the key is used for other attributes. Note that the encrypted values shown in the paper are not necessarily the real outputs of the encryption methods (they are just unintelligible text).

Making Times Relative. Times need to be modified because keeping the exact epoch time of an event can allow one to identify it. The naive approach, of setting the starting time of every trace to 0, would make it impossible to replay events and reconstruct the original log. Thus, we select another time that all events are made relative to. This time can be kept secure along with the keys for decryption. Table 4 shows the first 10 rows of our sample log after encrypting cost and making times relative to the “01-01-2018:00.00”.

Table 4. The first 10 rows of Table 1 after encryption and making times relative
Table 5. Adding previous activities/resources and previous IDs.

The Connector Method. Using the connector method we embed the structure, which can be used for extracting directly follows relations, into \(EL'\). Also, the connector method helps us to reconstruct the full original event logs when keys and relative values are given. In the first step, the previous activity (“Prev. Activity”) and the previous resource (“Prev. Resource”) columns are added in order to identify which arcs can be directly connected.

In the second step, we find a way to securely save the information contained in the “Case ID”, without allowing it to link the events. This can be done by giving each row a random ID (“ID”) and a previous ID (“Prev. ID”). These uniquely identify the following event in a trace because the IDs are not generic like activity names. The ID for start activities is always a number of zeros. Table 5 shows the log after adding “Prev. Activity”, “Prev. Resource”, “ID”, and “Prev. ID”.

Table 6. The event log after adding the connector column
Table 7. The output event log after applying ICS

In the third step, regarding the fact that these columns contain the same information previously found in the “Case ID”, they must be hidden and secured. This can be done by concatenating the “ID” and “Prev. ID” of each row and encrypting those using AES. Due to the nature of AES, neither orders nor sizes of the IDs remain inferable. The concatenation can be done in any style, in this example, we however simply concatenate “ID” and “Prev. ID”,e.g., connector of the first row would be “3100”. To retain the “ID” and “Prev. ID” one simply needs to decrypt the “Connector” column and cut the resulting number in two equal parts. This method requires that every time the two IDs differ by a factor 10 a zero must be added to guarantee equal length. Table 6 shows the log after concatenating the ID columns and encrypting them as a connector.

In the final step, we use the “Case ID” to anonymize the “Time tamp”. The “Time tamp” attribute of events which have the same “Case ID” is made relative to the preceded one. The exception is the first event of each trace which remains unchanged. This allows the complete calculation of all durations of the arcs in a directly follows graph but makes it complicated to identify events based on the epoch times they occurred at. After creating the relative times, we are free to delete the “Case ID” and disarray the order of all rows, ending up with an unconnected log in Table 7.

Table 7 is internally secure event log (\(EL'\)), which can be used by a data analyst to create a A-DFM and a R-DFM. It is trivial to see that if process/social network discovery could have been done on the plain event log (EL), AEL would be identical to \(AEL'\) (i.e., both are the same A-DFM/R-DFM) and the final desired results would be the same. Note that when the desired result is a process model, resource related information (“Resource” and “Prev. Resource” columns) can be removed from Table 7. Moreover, when the desired result is a handover network, activity related information (“Activity” and “Prev. Activity”) can be removed, since the real causal dependencies do not need to be taken into account.

Comparing Table 7 and the original log, one can see that there is no answer for the following questions in \(EL'\) anymore: (1) Who was responsible for doing an activity for case c? (2) What is the sequence of activities which has been done for case c? (3) How long did it take to process case c? (4) What is the cost of activity a which has been done by resource r for case c? (5) What is the the length of case c? (6) What is the the frequency of case c?, and many other questions related to the cases.

It is also worth noting that since we assume that the data in the internal environment can be accessed by the internal trustworthy people who already know the organizational structure, the plain resources are not considered as a privacy issue. In fact, \(EL'\) is a partially secure version of event log in such a way that it contains the minimum level of information, which a data analyst might need to reach the result. Although ICS does not preserve the standard format of the event log which is used by the current process discovery techniques, the intermediate input it provides can be used by the current tools. In the External Confidentiality Solution (ECS), we need to avoid any form of data leakage and privacy risks based on the assumed background knowledge.

5.2 External Confidentiality Solution (ECS)

In the external environment, the plain part of the event log may cause data leakage. Therefore, the whole event log gets encrypted. Moreover, some additional attributes, which can lead to data leakage even in the encrypted form, are projected. In the following, our two-steps ECS is explained.

Encrypting the Plain Part. In this step, activities and resources are encrypted by a deterministic encryption method like AES. A deterministic encryption method must be used, because for discovering DFMs, differences should be preserved. Table 8 shows the result after encrypting activities and resources.

However, after encrypting, detecting “START” activities seem to be impossible, and without detecting them, finding traces becomes impossible. For identifying the “START” activities, we can go through the “Activity” (“Resource”) and “Prev. Activity” (“Prev. Resource”) columns, the activities (resources) which are appeared in the “Prev. Activity” (“Prev. Resource”) column but not appeared in the “Activity” (“Resource”) column are the “START” activities (resources).

Table 8. The event log after encrypting activities and resources

Fortifying Encryption and/or Projecting Event Logs. As mentioned in Sect. 4, since resources are encrypted by a deterministic encryption method, and costs are encrypted by a homomorphic encryption method, which preserves differences, by comparison, one can find the minimum/maximum cost, which can be used as knowledge for extracting confidential or private information (e.g. name of resource). In order to decrease the effect of such analyses, fortifying encryption and/or projecting event logs could be done. Here, we project the costs which are indeed not necessary for the desired results.

6 Evaluation

We consider three evaluation criteria for the proposed approach, yet, at the same time, performance is also taken into account:

  • Ensuring Confidentiality: As explained in Sect. 5, we can improve confidentiality by defining different environments and indicating a level of information which is accessible in each of these environments. In addition, using multiple encryption methods and our connector method for disassociating events from their cases provide high level of confidentiality with respect to the assumed background knowledge.

  • Reversibility: When the keys and the value used for making times relative are given, both ICS and ECS are reversible, which means that transparency is addressed by the proposed approach.

  • Accuracy: To show the accuracy of our approach, by a case study we illustrate that the results obtaining from the secure version of event logs are exactly the same as the results obtaining from the original event logs.

6.1 Correctness of the Approach

As can be seen in Fig. 3, from accuracy point of view, we need to show that the abstraction of the original event log is the same as the abstraction of the internal event log \((AEL = AEL')\) (rule (2)), and also the abstraction of the internal event log is the same as the abstraction of the external event log, which is encrypted \((AEL' = ECS^{-1}(AEL''))\) (rule (3)). To show that these relations are guaranteed to hold, we have implemented an interactive environment in Python and tested the approach on multiple event logs. In the following, we illustrate the results obtaining by applying the approach on “BPI challenge 2012”.

$$\begin{aligned} AEL=AEL'\Rightarrow {PM=PM'} \wedge SN=SN' \end{aligned}$$
(2)
$$\begin{aligned} AEL'= ECS^{-1} (AEL'') \Rightarrow {PM' \approx ECS^{-1} (PM'')} \wedge SN' \approx ECS^{-1} (SN'') \end{aligned}$$
(3)
Fig. 6.
figure 6

Comparing the DFG from the EL with the DFG from the \(EL'\) for BPI challenge 2012: both graphs are identical, only layouts are different.

Fig. 7.
figure 7

The DFG from the external event log (BPI challenge 2012) for the frequency threshold 2000.

In the first step, \(EL'\), and \(EL''\) were created. Then, to verify that AEL and \(AEL'\) are identical, we created a DFG from the original and the internal version of event log. Figure 6 shows the DFGs resulting from BPI challenge 2012 for the frequency threshold 2000. As one can see both DFGs are the same. Also, Fig. 7 shows the DFG resulting from BPI challenge 2012 for the same frequency threshold (2000) in the external environment. As can be seen, this DFG is also the same as the DFG from the EL and \(EL'\) (modulo renaming and layout differences), i.e., all the process discovery algorithms which are based on a DFG would lead to the same process models in the different environments.

In order to demonstrate that the causality based social networks in the secure environments are the same as the actual social networks from the original event log, we have made the real-handover from the original and internal version of event log for BPI challenge 2012. Figure 8 shows the networks for the real causal dependency threshold 0.5 and the frequency threshold 50. The networks are exactly the same. It is obvious that the network from the external version of event log must be the same (while resources are encrypted). Nevertheless, in Fig. 9, we have zoomed in the highlighted parts of Fig. 8 for the networks resulting from the internal and external environment (the same thresholds were applied), and relations are the same except the fact that resources in the external environment are encrypted. As can be seen in Fig. 9 all the relations of the resource “11201” are the sameFootnote 1.

6.2 Performance

To demonstrate performance of the approach, we apply it on several benchmarking [4] and real-life event logs. Table 9 shows specifications of the used event logs. “BPI Challenge 2012” and “BPI Challenge 2017” are used to evaluate the performance when social networks are discovered, and the benchmarking event logs are used to evaluate the performance of the control-flow discovery.

Table 9. The specifications of the event logs used for evaluation

Figure 10 shows how the control-flow discovery scales when using the benchmarking event logs and increasing the number of events exponentially, and Fig. 11 shows the performance of social network discovery when the approach is applied on the two real-life event logs with different scales. All runtimes are in milliseconds and have been tested using an Intel i7 Processor with 1.8 GHz and 16 GB RAM.

Fig. 8.
figure 8

Comparing the real-handover networks resulting from BPI challenge 2012: both networks are identical.

Fig. 9.
figure 9

Comparing the relations of resource “11201” in the real-handover networks resulting from \(EL'\) and \(EL''\) for BPI challenge 2012

In Fig. 10, the darker bars show the execution time for discovering the DFG from the original event logs, and the lighter bars show the execution time for discovering the DFG from the secure event logs. One can see a linear increase of the runtime in milliseconds when adding choices or loops. In addition, as can be seen in Fig. 11, when the metric is real-handover, the execution time for discovering social networks is higher, since the real causal dependencies between subsequent resources need to be verified.

Fig. 10.
figure 10

The execution time of the control-flow discovery when using the benchmaking event logs.

Fig. 11.
figure 11

The execution time of the social network discovery when using the real-life event logs.

7 Conclusions

This paper presented a novel approach to ensure confidentiality in process mining when the desired results are models. We demonstrated that confidentiality in process mining cannot be achieved by only encrypting an event log. We outlined the little related work, most of which use just encryption, and explained the weaknesses of following this approach. The new approach is introduced since there always exists a trade-off between confidentiality and data utility. Therefore, we reasoned backwards from the desired results and how they can be obtained with as little data as possible.

Here, process models and social networks were considered as the desired results, and the confidentiality solutions presented in the context of a framework that can be extended to include other forms of process mining, i.e., different ICS and ECS could be explored for different process mining activities. Moreover, the proposed framework could be utilized in cross-organizational context such that each environment could cover specific constraints and authorizations of a party. In this paper, we focused on causality based social networks, and in the future other metrics could be explored. Moreover, in the future, a measure for confidentiality could be defined so that the effectiveness of different solutions in this research area could be quantified and compared.

We have utilized a new method named “connector”, which can be employed in any situation where we need to store associations securely. For evaluating the proposed approach, we have implemented an interactive environment in Python, and a real-life log was used as the case study.