Business Process Privacy Analysis in Pleak

Pleak is a tool to capture and analyze privacy-enhanced business process models to characterize and quantify to what extent the outputs of a process leak information about its inputs. Pleak incorporates an extensible set of analysis plugins, which enable users to inspect potential leakages at multiple levels of detail.


Introduction
Data minimization is a core tenet of the European General Data Protection Regulation (GDPR) [1]. According to GDPR, usage of private data should be limited to the purpose for which it has been collected. To verify compliance with this principle, privacy analysts need to determine who has access to the data and what private information these data may disclose. Business process models are a rich source of metadata to support this analysis. Indeed, these models capture which tasks are performed by whom, what data are taken as input and output by each task, and what data are exchanged with external actors. Process models are usually captured using the Business Process Model and Notation (BPMN). This paper introduces Pleak 3 -the first tool to analyze privacy-enhanced BPMN models in order to characterize and quantify to what extent the outputs of a process leak information about its inputs. The top level, namely the Boolean level (Sec. 2), tell us whether or not a given (intermediate or final) output of a process may reveal information about a given input. The middle level, the qualitative level (Sec. 3), goes further by indicating which attributes of (or functions over) a given input data object are potentially leaked by each output, and under what conditions this leakage may occur. The lower level (quantitative analysis) quantifies to what extent a given output leaks information about an input, either in terms of a sensitivity measure (Sec. 4) or in terms of the guessing advantage that an attacker gains by having the output (Sec. 5).
To illustrate the capabilities of Pleak, we refer to an "aid distribution" process in Fig. 1. This process starts when a nation requests aid from the international community to handle an emergency and a country offers to route a ship to help transport people and/or goods. The goal of the process is to allocate a port and a berth to the ship but not to reveal information about ships that are unable to help or the parameters of the ports. The process uses a type of privacy-enhancing technology (PET) known as secure multiparty computation (MPC). MPC allows participants to perform joint computations such that none of the parties gets to see the data of the other parties, but can learn the output depending on the private inputs. Given a ship, a deadline and the list of ports, task "Compute reachable ports" retrieves the list of ports reachable by the deadline. Tasks with identical names in different pools denote MPC computations carried out jointly by multiple stakeholders. Task "Select feasible ports" retrieves ports with the capacity to host the ship. The third task selects a port, a berth, and a slot for the ship, and discloses them to both participants.

PE-BPMN Editor and Simple Disclosure Analysis
The model in Fig. 1 is captured Privacy-Enhanced BPMN (PE-BPMN) [6]. PE-BPMN uses stereotypes to distinguish used PETs, e.g. MPC or homomorphic encryption, that affect which data is protected in the process. The PE-BPMN editor allows users to attach stereotypes to model elements and to enter the stereotype's parameters where applicable. The editor integrates a checker, which verifies stereotype specific restrictions. For example, that: (1) when a task has an MPC stereotype, there is at least one other "twin" task with the same label in another pool, since an MPC computation involves at least two parties; (2) when one of these tasks is enabled, the other twin tasks is eventually enabled; and (3) the joint computation has at least one input and one output.
Given a valid PE-BPMN model, Pleak runs a binary privacy analysis, which produces a simple disclosure report and data dependency matrix. The disclosure report in Fig. 2 tells us whether or not a stakeholder gets to see a given data object. In the report shown . "V" indicates that a data object (in columns) is visible to a stakeholder (in rows). Row "shared over" refers to the network service provider, who may also see some of the data (e.g. unencrypted data objects).

Qualitative Leaks-When Analysis
Leaks-When analysis [2] is a technique that takes as input a SQL workflow and determines, for each (output, input) pair such that the output discloses information about the input, which attributes of the input object are disclosed by the output object and under which conditions. A SQL workflow is a BPMN process model in which every data object corresponds to a database table, defined by a table schema, and every task is a SQL query that transforms the input tables of the task into its output tables. Fig. 3 shows a sample SQL workflow -a variant of the "aid distribution" example where the disclosure of information about ships to the aid-requesting country is made incrementally. The figure shows the SQL workflow alongside the query corresponding to task "Select reachable ports".
To perform a Leaks-When analysis, the user selects one or more output data objects and clicks the "LeaksWhen Report" button. The Leaks-When analysis shows one tab for each output data object and one report for each column in the output table. An example of a leaks-when report (in graphical form) is shown in Fig. 4. The report states that the aid-requesting country would get to know that at least one or several ships (left branch) can reach a specific port (right branch) before the deadline (branch in the middle). The rest of the report specifies how the disclosed elements are computed from the inputs (in the dashed rectangles). The report is generated by extracting all runs of the workflow and applying dataflow analysis techniques to each run in order to infer all relevant data dependencies. The sensitivity of a function is the expected maximum change in the output, given a change in the input of the function. Sensitivity is the basis for calibrating the amount of noise to be added to prevent leakages on statistical database queries using a differential privacy mechanism [5]. Differential privacy ensures that it is difficult for an attacker, who observes the query output, to distinguish between two input databases that are sufficiently "close" to each other, e.g. differ in one row.
Pleak tells the user how to sample noise to achieve differential privacy, and how this affects the correctness of the output. Pleak provides two methods -global and local -to quantify sensitivity of a task in a SQL workflow or of an entire SQL workflow. These methods can be applied to queries that output aggregations (e.g. count, sum, min, max).
Global sensitivity analysis [4] takes as input a database schema and a query, and computes the theoretical bounds for sensitivity, which are suitable for any instance of the database. Sensitivity shows how the output changes if we add (remove) a row to (from) some input table. To launch the analysis, the user clicks the "Analyse Sensitivities" button, receiving a matrix that shows the sensitivity w.r.t. each input table separately. It supports only COUNT queries.
Sometimes, the global sensitivity may be very large or even infinite. Local sensitivity analysis is an alternative approach, which requires as input not only a schema and a query, but also a particular instance of the underlying database, and it tells how the output changes with the change from the given input. Using the database instance improves the amount of noise needed to ensure differential privacy w.r.t. the number of rows. Moreover, it supports COUNT, SUM, MIN, MAX aggregations, and allows to capture more interesting distances between input tables, such as change in a particular attribute of some row. In Pleak, we have investigated a particular type of local sensitivity, called derivative sensitivity [3], which is in first place adapted to continuous functions, and is closely related to function derivative. Pleak uses derivative sensitivity to quantify the required amount of noise as described in [3].
Let us look at some examples of derivative sensitivity analysis. Since differential privacy works with real-valued outputs, we cannot apply the analysis directly to the model of Fig. 1. We compute some related queries instead.
An example of derivative sensitivity analysis output with a COUNT query is shown in Fig. 5. The query counts the number of ships that are able to arrive at the available port before the deadline. The actual database instance contains 53 ships. The user wants to enforce differential privacy w.r.t. unit change in ship location (latitude and longitude), assuming that all ships (all rows in the table Ship) are sensitive. This might correspond to the case where the user is the owner of the Ship table, and the attacker is any other party that might see the output. The analysis result tells that the derivative sensitivity w.r.t. the Ship table is 0.0625, and that a differential privacy level of ε = 1 can be achieved using smoothness parameter β = 0.1. To this end, we would have to add an amount of noise such that the relative error of the output is 1.28%. More precisely, if the correct output is y, the noised answer will be between 0.9872y and 1.0128y with probability 80%.
A related SUM-query would be e.g. one that estimates the total amount of cargo that all arriving ships bring altogether. An example of a SUM query is shown in Fig. 6. The table norm and analysis settings are the same as in the COUNT query ( Fig. 5) and are omitted from the figure. The sensitivity is larger, since some ships have more than 1 unit of cargo and hence affect the output more, but the output itself is larger as well and in turn reduces the relative error.
Instead of counting the number of ships that reach the port before the deadline, we may be interested in the time when the first of them reaches the port. The corresponding example of a MIN query is shown in Fig. 7. The table norm and analysis settings are the same as before. We see that the error is quite large for a MIN query, and it is now 111%. While sensitivity itself is 0.05, which is quite small, the reason why error is large is that the output itself is small. Differently from COUNT or SUM queries, the output does not increase with the number of table rows, and it is more difficult to achieve differential privacy.
It may be interesting to analyse a related query that computes the time when the last ship reaches the port. The corresponding example of a MAX query is shown in Fig. 8. The table norm and analysis settings are the same as before. We see that the error is much smaller than for a MIN query, and it is 4.75%. This is because the output itself is large, so we in general would have smaller A tutorial on sensitivity analyzer can be found at https://pleak.io/wiki/ sql-derivative-sensitivity-analyser.

Attacker's Guessing Advantage
While function sensitivity as defined in Sec. 4 can be used directly to compute the noise required to achieve ε-differential privacy, it is in general not clear which ε is good enough, and the problem is that its "goodness" depends on the particular data and the query [5]. We want to use a more standard security measure, such as attacker's guessing advantage. Formally, it is defined as the difference between the posterior (after observing the output) and prior (before observing the output) probabilities of attacker guessing the input. This tells the user how much the attacker is able to infer about the input after observing the output, in addition to what he has already known before (if anything). Internally, Pleak is still performing query function sensitivity analysis, but represents the analysis result in terms of guessing advantage, as described in [3].
The guessing advantage analysis of PLEAK takes as input the desired upper bound on attacker's advantage, which ranges between 0% and 100%. The user specifies particular subset of attributes that the attacker is trying to guess for some data table record, within given precision range. To characterize the attacker more precisely, the user defines prior knowledge of the attacker, which is currently expressed as an upper and a lower bound on an attribute. The analyser internally converts these values to a suitable for differential privacy, and computes the noise required to achieve the bound on attacker's advantage. Fig. 9 shows an example of guessing advantage analysis result. We consider the same COUNT query that has undergone sensitivity analysis in Fig. 5, which counts the total number of ships arriving before given deadline. Here, the attacker already knows that the longitude and latitude of a ship are in the range [0..300] while the speed is in the range [20..90]. By default, he does not know anything else besides the bounds, and the prior distribution is assumed to be uniform in the range. Attacker's goal is to learn the location of any ship with a precision of 5 units of its actual latitude and longitude. The analysis result says that, if we want to bound the guessing advantage by 30% using noise addition mechanism, the relative error of the output will be 13.57%.
A tutorial on guessing advantage analyzer can be found at https://pleak. io/wiki/sql-guessing-advantage-analyser.  The models created under this account are periodically deleted. In addition, PLEAK offers public view using the links included in the following. The public links are enough to see example models with their metadata and run PLEAK's analyzers. The account is necessary to create of modify the models.
The following description provides a walkthrough of capabilities of PLEAK using a unified scenario. The focus is on explaining the models and running the analysis and it is expected that the reader follows the writing using the demo account or the public links to the models. Our live demo would follow a similar pattern, but would allow for more interaction with the models, especially modifying the model data. Parts of the expected demonstration can be seen in the demo video in https://www.youtube.com/watch?v=pQDYn1Q-BQM.

A.1 Introduction to PLEAK
The front page of pleak.io allows a user to log in and access its models using the files menu. Clicking on the model name in the file menu opens the editor used to create the BPMN model. Other actions can be accessed using the button in right hand side of the model row. Choosing the Shared models tab also shows the models that are not owned by the user, but where others have granted either view or edit rights to the user. All models considered in this description are available for the demo account under the Shared models tab with view rights. The user can copy the shared models so that they appear in the My models view and become modifiable. PLEAK also allows to publish models so that the analysis tools are accessible without a user account. All of the following revolves around a running scenario (e.g. see https:// pleak.io/app/#/view/Zta5dILQC6DozqcqQB4E) that involves cargo ships and a nation with ports for the ships to dock at. The ship needs to find suitable berths available before its deadline. The data object reachable ports contains ports that can be reached within the deadline. Feasible ports narrows this down to ports that the ship can actually fit to. The final output of the process gives the actual port and berth slot assignment for each ship. The goal is to hide the ship location and the exact details of the ports where the ship can not dock.
The example models folder in the demonstration account has other processes that can be analyzed using our tools (model is intended to be used with the analyzer specified by the folder name). The process of using the tools is similar to the description given for the running example, but the concrete scenarios, the computations involved in the process, and therefore the analysis outputs can differ significantly. In addition, the wiki page in pleak.io gives further information about the usage and details of our tools.

A.2 PE-BPMN
Consider the Ship Allocation model using the PE-BPMN editor (https:// pleak.io/app/#/view/Zta5dILQC6DozqcqQB4E). This is one possible process for agreeing on the slot assignment using secure multiparty computation (MPC). MPC methods allow participants to collaboratively compute on their data while only revealing the computation output. Privacy-Enhanced BPMN is a BPMN extension that captures the use of privacy enhancing technologies in the model. It adds notations to specify the technology and its concrete operation within a classical BPMN model, for example the blue MPC markers in the example.
Clicking on tasks opens the stereotype menu when the user has edit rights. This can be tried by copying the demo model so that it appears under My models tab in the demo account and example of edit view of PE-BPMN editor is given in Fig. 10. This menu is organized based on privacy goals such as data protection or processing. For example, MPC can be found under Data processing/Privacy preserving. Choosing a stereotype like secure multiparty computation opens a stereotype-specific panel on the right allowing to add required parameters. Tasks with the MPC stereotype are grouped based on which tasks correspond to a joint computation. The editor highlights the selected model element and other related elements, for example, other group members for MPC. In the given example, tasks with the same name in separate pools are considered to correspond to the same joint computation, hence, clicking on a task in one pool highlights the corresponding task in the other pool.
The correctness of PE-BPMN models can be checked using the Validate button. The result of the validation appears on the right hand side of the screen. For a valid model, like the demo model, we get two analysis options -simple disclosure and data dependency. The simple disclosure report visualizes which participants have access to which data in the process. In addition, it distinguishes between data objects that are visible or hidden. For example, the Nation sees its inputs such as port, berth and slot and also learns the intermediate values, namely, feasible ports and the output assignment. However, the Nation does not have direct access to the inputs of the ship manager nor the reachable ports computed for the ship manager. A data dependency matrix describes the interdependence of data objects. However, other analyzers offer more tools to go into the details of the dependencies. In the basic use of PLEAK, the analyst first finds potential leakages (data marked with V in the disclosure report) and then uses the data dependency matrix to check if any visible data depends on any private data. If it does then the leaks-when or sensitivity analysis can be used to further study this dependency.
Validation produces an error list and does not allow analysis in case there are any problems in the model. For example, https://pleak.io/app/#/view/ NyWvwmKjUedE10nNyY6u is an invalid model where clicking the validation button shows an error. Clicking on the error helps to locate the model elements that cause the error. In this case the second part of the feasible ports task is missing the MPC stereotype so the error draws attention to the fact that the feasible ports task in the Nation requires another group member. The distributed nature of MPC tasks requires that there is at least two tasks in a group, hence the removal of one stereotype causes the remaining MPC task to give an error.
Other demo account models consider different privacy enhancing technologies than MPC. Our approach to various PETs, including the concrete stereotypes and types of validation, is documented in PLEAK wiki 4 .

A.3 Leaks-When
Open https://pleak.io/app/#/view/lsQufWrKxjbdGtpJErHl using the SQL editor to consider an example of the leaks-when analysis. The editor can be changed to SQL editor using the Change Analyzer button in PLEAK.
Leaks-when analysis takes a SQL workflow or SQL collaborative workflow as an input. SQL workflow is a BPMN model where each task corresponds to SQL script that manipulates input database tables into temporary tables. The editor allows to view and edit these scripts. For example, clicking on the task Select reachable ports reveals a SQL script that takes the tables port, ship and parame-ters as inputs and produces reachable ports . Data object are defined analogously,  for the table port one would enter an SQL CREATE TABLE statement. The data object parameters is a special table that we use to define the name and data types of input parameters for the overall computation. In this scenario, we assume that the SQL-workflow is executed for one ship at a time such that the parameters are the ships name and desired deadline.
PLEAK's leaks-when analyzer processes the PostgreSQL's SQL dialect. For example, task Select feasible ports has two store procedures, one of which computes the distance over the earth surface given the coordinates of two objects.
To analyze a fully annotated model, the analyst has to select one or more output data objects by clicking on them (selected data is green) and then start the analysis by clicking the button Leaks-when report. For example, select feasible ports and reachable ports in the demo model. At the beginning of the analysis, PLEAK collects SQL scripts for each of the runs of the BPMN model and sends them to the backend for analysis.
The analysis output is a leaks-when report for each attribute. The right hand panel lists the chosen data objects and expanding the data object view shows the number of leakage graph corresponding to this data object. Each data object has one graph for each column in its output. For example, the reachable ports data object has two columns, these correspond to the port and deadline computed in the script. The reachable ports(1) is the port column. The final node in leakswhen graph is a filter where the first input shows what leaks and the second input shows under which conditions the leakage occurs. In reachable ports(1) case, the leaks-when report shows that the port id is disclosed if the ship can reach the port by a given deadline. The graph describes the deadline computation -it is computed from ship's speed and distance from the port as determined from its coordinates. The second column reachable ports(0) corresponds to deadline in the reachable ports table. Looking at the corresponding leaks-when report shows that it leaks the arrival time under the same conditions as for the port. However, in the leaks branch of the graph we now also have the deadline computation.
The feasible ports table is computed from the reachable ports data and the leaks-when report reflects this. We can see that the deadline condition is still present for the leaks-when report of feasible ports. In addition, there are new conditions specific to this SQL query to stress that the ships draft has to be less than the harbor depth and that the port must be able to offload the cargo.
The leaks-when analysis can be extended to collaborative workflows. An example of our ships workflow as a collaboration can be seen in https://pleak. io/app/#/view/wJuteo5sJAa_sf4cJ5oY.

A.4 Sensitivity
Sensitivity tells us how much information the output reveals about adding or modifying a row in the input table. Knowing the workflow's sensitivity allows us to make it differentially private (DP). PLEAK considers two flavours of sensitivity: global and local. Global sensitivity computes bounds based on the data structures whereas local sensitivity depends on the actual data.
Global. Open https://pleak.io/app/#/view/lh2NY01e2brJb6hspcFN, Click the button Change Analyzer and select SQL analyzer. It is the same editor as used for leaks-when SQL analysis, and it requires similar SQL data object and task descriptions. Global sensitivity quantifies the magnitude of the noise that should be added to the output to make it differentially private. It can be computed based on the table schemas and the SQL workflow, similar to the leaks-when analysis.
Since global sensitivity is reasonable only for COUNT-queries, we count the number of ports for which the time that it takes to reach the port is below a certain threshold. Note that the main query does not contain the keyword COUNT, since the analyzer itself counts the rows in the output table.
The global sensitivity analysis starts by clicking the blue button Analyze Sensitivities. The sensitivity matrix depicts the sensitivity of tasks in columns with respect to the input tables making up the rows. In this example, the sensitivity w.r.t. all tables except ship is ∞. The sensitivity w.r.t. ship is 1 since adding a ship may increase the total number of counted ships by 1. If we remove the keyword DISTINCT from the query, the sensitivity becomes ∞, since now the same ship can be potentially assigned to an unbounded number of berths.
Similarly to SQL editor, this editor allows the user to define SQL statements for all tasks. In addition, it requires the user to insert actual input data tables to the data objects. Data can be viewed and added by clicking on the data objects. For example, select the ship data object and consider the definition of sensitive rows in Table norm. We see that all rows are considered sensitive (line rows: all), there are no sensitive columns (line cols: none), but the number of rows itself is sensitive, and the cost of adding/removing one row is 1.0 (line G: 1.0).
Analysis is started by clicking the button Analyze. First, set the parameters to define the desired privacy level. The variable ε comes directly from the definition of differential privacy. Simply put, a smaller epsilon means more privacy. The variable β is a parameter that can be optimized. In general, it gives less noise if it is smaller but if it is too small, then achieving differential privacy may be impossible. The parameters can be left to their default values.
The analysis is executed by clicking on the green button Run analysis. Similarly to global sensitivity, local sensitivity is computed with respect to each input table. In our case, the only sensitive table is ship, and the sensitivity w.r.t. it is 2. Indeed, since there is no keyword DISTINCT, a ship can be assigned to two possible berths of the port "alma" (assignments of berths to ports can be viewed by clicking the data object berth), so the count may change by 2. Recall that it would be ∞ in the case of global sensitivity. Relative error shows how much noise we have to tolerate to achieve differential privacy for ship table.
Derivative. Open https://pleak.io/app/#/view/lQSSx15uY13H9S4EhcXA in Combined sensitivity analyzer using Change Analyzer button. This is the same model as before but the table norms are defined differently. In component based sensitivity, the user may choose which rows and columns are sensitive.
Click the port data object and consider the definition of sensitive rows in Table norm. Here we assume that the columns offloadcapacity, offloadtime, and harbordepth are sensitive in all rows. It is possible to define more sophisticated sensitive components. In the table ship, only the rows indexed 3 and 7 (ships "gamma" and "farmi") are considered sensitive as can be seen in the first row of the Table norm. We combine latitude and longitude to define Euclidean distance (i.e 2 -norm) from the port with the line u = lp 2.0 latitude longitude;. We may assign different privacy weights to different columns, e.g. 0.2 in v1 = scaleNorm 0.2 u; means that we conceal changes in location up to 1/0.2 = 5 units, so the location is more private than the length. Then, z = lp 1.0 v1 v2; means that the distance between two rows is the sum of distances between the location and the length (i.e 1 -norm). Finally, return linf z; shows how the distance between the tables is computed from the distances between their rows, and linf means that we take the maximum row distance (i.e ∞ -norm), so DP conceals the change even if all sensitive rows change by a unit. In general, DP requires much more noise to hide the changes in all rows simultaneously, but in our case only the rows 3 and 7 are sensitive, so it is fine.
Analysis is started by clicking the button Analyze. The button Attacker settings allows to define known bounds on table attributes. In the example model, it says that the ship maximum speed ranges from 20 to 90 units. Without the lower bound on ship speed, the arrival time approaches ∞ as speed approaches 0, which does not allow to define a β-smooth lower bound for a finite β.
The analysis is executed by clicking on the green button Run analysis. The sensitivity is computed with respect to each input table. The sensitivity w.r.t. table port is very large. Indeed, if the port attributes change, it may happen that no ship will fit there anymore. Sensitivity w.r.t. the table ship is 4, where 2 comes from possible changes in the 3rd row, and 2 from possible changes in the 6th row. The row sensitivity 2 comes from the fact that modifying the length by 1 unit, or the location by 5 units, may cause filtering failure, and since there are 2 available berths, we lose 2 rows from the count.

A.5 Guessing Advantage
Open https://pleak.io/app/#/view/P4RRkJV-DsBttt5NnapS. Click the button Change Analyzer and select Guessing advantage analyzer. Here, each table has a schema and data, but no norm. Clicking Analyze opens a slider, ranging from 0% to 100%, to set the upper bound on attacker's guessing advantage. There are now two extra buttons to define bounds for used attributes: Attacker settings defines prior knowledge of the attacker by setting preknown bounds on attributes, defined either as exact, range a b, or total a (the latter is used only for discrete data).
Sensitive attributes defines a set of sensitive components, which the attacker is trying to guess. The definition starts from a keyword leak and for each attribute, the guess can either be exact (discrete attributes), or approx r (approximated by r > 0 units). The list of attributes is followed by the keyword cost and a number that defines the cost of leaking that attribute.
Reducing the advantage slider to 0% gives the error ∞, as it is impossible to achieve perfect privacy with bounded noise. Increasing it to 100% gives the error 0, since the attacker is allowed to guess everything. Reducing the allowed guessing radius under Sensitive attributes, or the known radius under Attacker settings (click Save after making any changes) makes the guess more difficult. Clicking View more, we see that both prior and posterior probabilities decrease. However, since the noise level depends on the advantage, which is the difference between these two probabilities, the error does not necessarily decrease.