1 Introduction

The public sector poses various interesting tasks that range from predicting energy demand to simulating infection spread in hindsight. Predicting future behaviour or inferring most likely past behaviour are two instances of the general task of probabilistic inference, which can be reduced to computing marginal probability distributions in probabilistic models. Providing answers to those tasks relies on an accurate model of the influences between the components or individuals and stakeholders in the specific setting of a task. As such, these models have to represent relational domains, in which individuals are characterised through attributes and relations among them under uncertainty.

First-order frameworks like description logic naturally lend themselves to modelling relational domains, grouping individuals that behave (almost) indistinguishably without further evidence. To additionally handle uncertainty, formalisms such as parametric factor (parfactor) graphs [39], Markov logic networks (MLNs) [40], or ProbLog [9] have been developed. So-called lifted inference lgorithms (e.g., [5, 18, 21, 22, 39, 50, 51]) exploit any first-order structure for efficient inference by using representatives for groups during calculations, enabling tractable inference in group sizes [36].

While efficient probabilistic inference, exact or approximate, is already a challenging research problem, applications often come with further external requirements such as resource limitations [33] or demands for privacy of sensitive information. Applications in the public sector come with a heightened need for privacy regarding influences modelled and the information about individuals encoded in such models due to the very nature of the public sector. This is especially true if these models are learned from real-world data, and the public sector often comes with a hose of data, collected through periodic censuses or through technical devices such as smart metres. Of course, privacy is not only in the purview of the public sector. Whenever customer data is involved, privacy becomes increasingly important, when processing this data, which can range from online purchases in web shops to health data collected by fitness trackers.

For publishing collected data in a privacy-preserving manner, there exist various methods such as k-anonymity [45], \(\ell\)-diversity [31], t-closeness [29], or differential privacy [14] to anonymise data, each coming with strengths and weaknesses. The notion of k-anonymity, for example, provides anonymity by having at least \(k-1\) other data points for each data point with the same expression of identifying attribute values (quasi-identifying properties), for checking of which efficient algorithms exist [2]. Unfortunately, for k-anonymity, one has to specify these identifying attributes, for which then k-anonymity is ensured, which may not be possible to determine beforehand and thus enables several attacks on data sets published with k-anonymity [8].

When looking at privacy-preserving probabilistic inference, especially in relational models, research is scarce. Relational domains present a unique challenge, since a corresponding relational model contains, by its very definition, individuals. However, given that these models represent groups of indistinguishable individuals, they provide a rather obvious connection to k-anonymity. Therefore, we investigate privacy-preserving probabilistic inference in relational models by transferring the notion of k-anonymity to probabilistic relational models, presenting the analogous notion of s-symmetry, with the intuitive idea that groups must be large enough, i.e., must be of cardinality \(\ge s\). As we will see, the weakness applying to k-anonymity does not apply to s-symmetry in fully relational models, as all attributes talk about indistinguishable individuals in groups or about the modelled scenario as a whole. Next to s-symmetry as a property for models to fulfil, we need to adapt the query language as well as evidence handling to ensure that solving an inference task does not lead to violating s-symmetry. The problem lies in evidence or query terms referencing individuals explicitly, which can ground a first-order model, leading to groups that might be smaller than s, and thus privacy concerns arise again. We combine all these insights into PAULI, to the best of our knowledge, the first lifted privacy-preserving algorithm for probabilistic inference in probabilistic relational domains. Specifically, the contributions are,

  1. (i)

    s-symmetry as a property for privacy-preserving probabilistic relational models,

  2. (ii)

    Changes to the query language for privacy aspects, and

  3. (iii)

    An adaptation of evidence entering to keep models privacy-preserving using privacy-preserving clustering,

yielding PAULI. While the main discussion focuses on episodic models and inference, we also provide an extension to the temporal setting, TemPAULI, which allows for bounding the approximation error introduced by clustering evidence.

In the following, we define parfactor graphs as a representative modelling formalism, and recap lifted inference algorithms. Then, we present our contributions, including s-symmetry, an adapted query language, and privacy-preserving evidence handling, culminating into PAULI. Last, we discuss related work and conclude.

2 Preliminaries

We shortly present present parfactor models [6] and give a brief overview of lifted inference algorithms.

2.1 Parfactor Models

parfactor models combine first-order logic with probabilistic models, using logical variables as parameters in random variables to represent sets of indistinguishable random variables, forming PRVs. The logical variables are grounded, i.e., replaced by constants, to get from a first-order model to a propositional one. In the privacy setting, these constants refer to (anonymous) individuals with associated information that should be kept private. As such, we refer to constants and individuals interchangeably. As a running example throughout this article, we set up a model for influences between income, residential districts, schools, and households with children, which could be learned from census data, with one logical variable representing people and one representing districts. Throughout the paper, we use bold-faced letters \(\varvec{V}\) to denote sets and calligraphic letters \(\mathcal {V}\) to denote sequences. In abuse of notation, we use set operations on sequences, upholding the order of the sequence. We further assume familiarity with standard relational algebra operators such as \(\pi\) for projection.

Definition 1

Let \(\textbf{R}\) be a set of random variable names, \(\textbf{L}\) a set of logical variable names, \(\Phi\) a set of factor names, and \(\textbf{D}\) a set of constants. All sets are finite. Each logical variable L has a domain \(dom(L) \subseteq \textbf{D}\). A constraint is a tuple \((\mathcal {X}, C_{\mathcal {X}})\) of a logical variable sequence \(\mathcal {X} = (X_1, \dots , X_n)\) and a set \(C_{\mathcal {X}} \subseteq \times _{i = 1}^n dom(X_i)\). The symbol \(\top\) for C marks that no restrictions apply, i.e., \(C_{\mathcal {X}} = \times _{i = 1}^n dom(X_i)\). A PRV \(R(L_1, \dots , L_n), n \ge 0\), is a syntactical construct of a random variable \(R \in \textbf{R}\) possibly combined with logical variables \(L^1, \dots , L^n \in \textbf{L}\). If \(n = 0\), the PRV forms a propositional random variable. A PRV A (or logical variable L) under constraint C is given by \(A_{|C}\) (\(L_{|C}\)). We may omit \(|\top\) in \(A_{|\top }\) (\(L_{|\top }\)). If \(|C_\mathcal {X}| = 1\), \(A_{|C}\) forms a propositional random variable as well. The term ran(A) denotes the possible values (range) A can take. For a set or sequence of PRVs \(A_1,\dots , A_n\), the range is defined as the cross product of the individual ranges, i.e., \(ran(A_1,\dots , A_n) = \times _{i=1}^n ran(A_i)\). An event \(A = a\) denotes the occurrence of a grounded PRV A with value \(a \in ran(A)\).

Example 1

Consider \(\textbf{R} = \{I, R, S, H\}\) for income, residence, school situation, and household with children, respectively, and \(\textbf{L} = \{X, D\}\) with \(dom(X) = \{x_1, x_2, x_3\}\) (people) and \(dom(D) = \{d_1, d_2\}\) (districts), combined into PRVs I(X), R(XD), S(D), and H(X). PRVs R(XD) and H(X) have Boolean ranges, while the other two have three possible values, i.e., \(ran(I(X))=\{low,middle,high\}\)) and \(ran(S(D))=\{good,neutral,bad\}\).

A parfactor describes a function, mapping PRV values to real values (potentials).

Definition 2

We denote a parfactor g by \(\phi (\mathcal {A})_{| C}\) with \(\mathcal {A} = (A_1, \dots , A_n)\) a sequence of PRVs, \(\phi : \times _{i = 1}^n ran(A_i)\mapsto \mathbb {R}^+\) a function with name \(\phi \in \Phi\), and C a constraint on the logical variables of \(\mathcal {A}\). We may omit \(|\top\) in \(\phi (\mathcal {A})_{| \top }\). A set of parfactors \(\{g_i\}_{i=1}^n\) forms a model G. The term rv(Y) refers to the PRVs in input Y (e.g., a parfactor or a model). The term lv(Y) refers to the logical variables in Y (a PRV, a parfactor, or sets thereof). The term \(gr(Y_{| C})\) denotes the set of all groundings of Y w.r.t. constraint C. The semantics of G is given by grounding and building a full joint distribution, i.e., \(P_G = \frac{1}{Z} \prod _{f \in gr(G)} f\) with Z for normalisation.

Example 2

Figure 1 shows a graphical representation of a model \(G_{ex} = \{g_i\}^1_{i=0}\), with parfactors

$$\begin{aligned} g_0&=\phi ^0(I(X), R(X,D))_{| \top }\\ g_1&= \phi ^1(R(X,D), S(D), H(X))_{| \top } \end{aligned}$$

(input–output pairs omitted). PRV are depicted as ellipses and parfactors as layered boxes, with edges going from a PRV to a parfactors if the PRV occurs as an argument of the parfactors.

Constraints are \(\top\), i.e., the \(\phi\)’s are defined for all domain values.

Fig. 1
figure 1

Parfactor graph for \(G_{ex}\)

For the remainder of this article, we assume a model that is shattered on itself [41], i.e., constraints are either disjoint or fully overlapping. In general, a query asks for a probability distribution of a set of random variables given fixed events as evidence based on the full joint distribution \(P_G\) of a model G.

Definition 3

Given a model G, a set of query terms \(\varvec{Q}\) (ground PRVs), and events \(\varvec{e} = \{E^i=e^i\}_{i=1}^m\), the expression \(P(\varvec{Q} \mid \varvec{e})\) denotes a query w.r.t. \(P_G\).

Example 3

Query \(P(I(x_1) \mid S(d_1)=good)\) for \(G_{ex}\) asks for the probability distribution of the income of \(x_1\) given that the school situation in \(d_1\) is good.

2.2 Lifted Inference

Lifted variable elimination (LVE) [39], which lifts variable elimination [53] to the first-order level and has been further refined since its inception [6, 32, 41, 47], takes a model G of Def. 2 and answers a query of Def. 3, which we explain in more detail below.

To handle evidence in a lifted way, evidence parfactors are built by combining those events that refer to the same PRV and have the same range value as an observation: The constraint contains the constants occurring in those events. The potential function maps the observed range value to 1 and the remaining values to 0. Additionally, before any inference calculations can start, G is shattered on the evidence parfactors and the query terms. That is, LVE ensures that the constraints between G, the evidence parfactors, and the constants in the query terms are either disjoint or fully overlapping. To do so, LVE splits a parfactor g with constraint C given a partially overlapping constraint \(C'\) by duplicating g and adjusting the constraints of g and its duplicate: The constraint of g keeps those tuples that only contain constants that do not occur in \(C'\) and the constraint of the duplicate gets the remaining tuples, i.e., those constants also appearing in \(C'\). The partially overlapping constraint \(C'\) can come from an evidence parfactor or is built for each query term Q by collecting the logical variables of the PRV referenced in Q, i.e., lv(rv(Q)), for the sequence of logical variables \(\mathcal {X}\) in \(C'\) and adding a single tuple for the constants occurring in Q to \(C'_\mathcal {X}\). After shattering, there is one group of indistinguishable constants for each observed value of a PRV and a group for the unobserved constants as well as singleton groups for the constants in the query terms.

Example 4

Let us go back to Ex. 3, in which the query term is \(I(x_1)\) and the evidence is \(S(d_1)=good\). First, the observed value, \(S(d_1)=good\), is split of from the model. Splitting the model on \(d_1\) results in splitting \(g_0\) and \(g_1\) as both of them have \(d_1\) in their constraint for the logical variable D: The parfactors are split into two parts, one part for \(d_1\) and another part for the remaining constants of D, \(d_2\) in our toy example. The resulting parfactors after splitting look like this:

$$\begin{aligned} g_0&=\phi _0(I(X), R(X,d_2))_{| \top }, \\ g_0'&=\phi _0(I(X), R(X,d_1))_{| \top },\\ g_1&= \phi _1(R(X,d_2), S(d_2), H(X))_{| \top }, \\ g_1'&= \phi _1(R(X,d_1), S(d_1), H(X))_{| \top } \end{aligned}$$

The input–output pairs of the \(\phi\)’s are not affected. With logical variables encoding more than just two constants and the observation of S being true available for more than one constant, the result would be two parfactors like \(g_i\) and \(g_i'\) for \(i\in \{0,1\}\) but with D in \(g_i'\) representing those constants with such an observation and in \(g_i\) representing the remaining constants.

Second, the constant of the query term is split off. The remaining parts talk about \(x_2\) and \(x_3\). Since \(x_1\) occurs in the constraints of all four parfactors, all four need to be split. E.g., for \(g_0\), the result is:

$$\begin{aligned} g_0&=\phi ^0(I(X), R(X,d_1))_{| (X,\{x_2,x_3\}))},\\ g_0^{*}&=\phi ^0(I(x_1), R(x_1,d_1)) \end{aligned}$$

The other three parfactors are split analogously, with \(d_1\), \(d_2\), as well as \(x_1\) appearing without other constants of their domains in parfactors.

After shattering, LVE absorbs the evidence parfactors [47] and then eliminates all remaining PRVs except \(\varvec{Q}\) from G by–in a nutshell–summing out a representative for each PRV under a given constraint and exponentiating the result for the instances referenced in the constraint. Refer to, e.g., [47] for details on LVE and its operators ensuring correctness.

With a new query, LVE restarts with the original model. Therefore, further research concentrates on efficiently solving multiple queries. One particular algorithm that uses LVE as a subroutine is the LJT [5], which lifts the junction tree algorithm [26] to the first-order level. It constructs a helper structure, which exploits conditional independences between submodels by forming an acyclic graph over clusters of PRVs and exchanging messages between clusters. These messages make the clusters independent from each other, enabling query answering on the smaller clusters. Before messages are sent, evidence is entered into the clusters for absorption. Evidence absorption, messages, and query answers are computed using LVE operators. See [4] for details. Lifted multi-query inference algorithms operating on MLNs answering queries of Def. 3 reduce the inference problem to a first-order version of weighted model counting as is the case with first-order knowledge compilation [50] and probabilistic theorem proving [21], also building tree-like helper structures for efficient inference.

3 Privacy-Preserving Probabilistic Relational Models

This section considers the challenges for probabilistic inference in probabilistic relational models from a privacy viewpoint, pointing towards sources of privacy leakage, derives a property for privacy-preserving probabilistic relational models, and argues why existing lifted inference algorithms do not necessarily uphold the property during its computations.

3.1 Sources of Privacy Leakage

Privacy leakage in probabilistic relational models occurs whenever a part of a model encodes information about a few or, in the worst case, single individuals. Therefore, the model itself can possibly leak information. Further, a query can also reveal information as it consists of grounded query terms and evidence for individual instances, which can pose a threat even if the model itself does not leak information. Consider the query \(P(I(x_1)\mid S(d_1)=good)\) from Ex. 3 with query term \(I(x_1)\) and evidence for \(S(d_1)\), referencing specific individuals, which means that these individuals are no longer indistinguishable from the rest by virtue of appearing in the query, as foreshadowed in Ex. 4. Thus, we discuss the model as well as evidence and query terms as possible sources of privacy leakage.

Model Probabilistic relational models constitute a general modelling framework that does not restrict the nature of the PRVs occurring. PRVs may be combined with no logical variables at all, yielding propositional random variables that may encode features that apply to the general setting but also talk about specific individuals. Even PRVs that are combined with logical variables may have constraints that only reference a single constant for each logical variable, rendering the PRV essentially a propositional random variable talking about a specific individual. Under all these circumstances, privacy leakage can occur.

Evidence Specific observations for an individual distinguish the individual from other individuals, which include those without evidence and those with a different set of observations. If there are further individuals with the same observations, then this set of individuals forms a new set of indistinguishable instances among themselves given the evidence, distinguishable from the set of individuals without or with other evidence. Of course, we can have multiple such groups arising out of different sets of observations, each of which may talk about only a small number of individuals.

On a technical level, conditioning on evidence leads to modifying the model by shattering [41] the model on the evidence, leading to splitting up the constraints of the parfactors according to the groups of observations that occur in the evidence: As shown in Ex. 4, it means splitting up parfactors \(g_0\) and \(g_1\) by duplicating the original parfactors and then modifying the constraints such that one constraint contains only \(d_1\) for D and the other contains the rest, which means that we have two parfactors that talk about a specific individual, namely, \(d_1\), which would then absorb the evidence and as such, have an influence on the query result that is distinguishable from the other d’s. In our very small running example, splitting off \(d_1\) leads to the constraint for the remaining constants to only contain \(d_2\), which also makes \(d_2\) distinguishable. However, if considering larger domains, the problem can remain, with either an evidence set that contains too few observations for constants d or with an evidence set that contains so many observations that the number of remaining constants without observations is too small.

Query Terms The query above references constant \(x_1\) in the query term, which refers to a specific individual. As such, the query language itself is not privacy-preserving. On a technical level, the model is shattered on query terms, leading to parfactors with constraints referencing single individuals like \(x_1\) (see Ex. 4). Especially given several sets of observations and always querying for \(x_1\) would allow for collecting information about \(x_1\) that might lead to revealing its identity.

Privacy Leakage in Weighted First-order Model Counting Even though the descriptions here are given for parfactors, one only needs to look at the formal definition of a weighted first-order model count to see the same problem. Given a knowledge base \(\Gamma\), which consists of weighted first-order logical formulas, query terms \(\varvec{q}\), and evidence \(\varvec{e}\), answering a query \(P(\varvec{q} \mid \varvec{e})\) in \(\Gamma\) reduces to computing a count for \(\Gamma \wedge \varvec{q} \wedge \varvec{e}\) and a count for \(\Gamma \wedge \varvec{e}\), where the individuals occurring in \(\varvec{q}\) and \(\varvec{e}\) lead to a shattering as well [49].

Lessons All these cases can be traced back to constraints referencing few or single individuals, either by specification in the model itself or after shattering on evidence or query terms. On the one hand, we need to make sure that initial models have fitting constraints. Additionally, any propositional random variable in the model may only encode information about the whole system and not individuals, which is something that needs to be ensured when setting up the model together with domain experts. To this end, we define a property called s-symmetry, which formalises our requirements for constraints and draws on k-anonymity for inspiration. Although we present the property for parfactors, the notion is directly applicable to other first-order formalisms. On the other hand, we need to deal with evidence and query terms. Regarding query terms, the general query language needs to be adapted, which is independent of a specific inference algorithm. Evidence handling is more involved and essentially has to make sure that s-symmetry is not violated.

Thus, we first define s-symmetry and adapt the query language as general measures for privacy preservation. We also show under which circumstances current lifted algorithms uphold privacy. Afterwards, with PAULI, we present a way of implementing an inference algorithm upholding s-symmetry for privacy-preserving probabilistic relational models including evidence handling.

3.2 s-Symmetry

This subsection formally defines s-symmetry, which enables privacy preservation by requiring that there are \(s-1\) other individuals that exhibit the same behaviour in the full joint of the given model. As such, s-symmetry transfers the idea of k-anonymity to the setting of probabilistic relational models. However, we apply the idea to constraints to ensure that anonymity holds for all possible combinations of PRVs with their respective logical variables occurring in parfactors.

As s-symmetry is inspired by the idea of k-anonymity, let us define k-anonymity, which uses the notion of quasi-identifying property for identifying attributes from the privacy literature, and then define s-symmetry.

Definition 4

A multiset \(\varvec{D}\) is k-anonymous w.r.t. a set of quasi-identifying properties \(\varvec{q}\) if for each element \(m \in \varvec{D}\) there are at least \(k-1\) other elements \(m' \ne m \in \varvec{D}\) such that \(\varvec{q}(m') = \varvec{q}(m)\).

Definition 5

A constraint \((\mathcal {X},C_{\mathcal {X}})\) containing at least s constants for each logical variable \(X \in \mathcal {X}\), i.e.,

$$\begin{aligned} \forall X \in \mathcal {X} : |\pi _{X}(C_{\mathcal {X}})| \ge s, \end{aligned}$$

is referred to as s-sized. Given a model G, in which the parfactor constraints are s-sized, we say the model exhibits s-symmetry or is s-symmetric.

Assuming \(\top\)-constraints, constraints are cross-products of logical variable domains, meaning that each domain needs to contain at least s constants for a model to be s-symmetric under \(\top\)-constraints. In other words, an s-sized constraint means that there are at least s constants for each logical variable in the constraint, which yields at least \(s^n\) constants overall given n logical variables appearing in the constraint.

Example 5

Let us look at the connection of k-anonymity and constants in logical variables. Table 1 shows a selection of possible database entries. Applying an expectation maximisation algorithm [11] to learn a probabilistic model on these entries would yield indistinguishable random variables w.r.t. X and D. So by lifting the probabilistic model [30], we would get the model from Fig. 1. In Fig. 2, we see the identical model but with the corresponding constants each PRV represents. Thus, for example, we see that I(X) actually stands for 3 indistinguishable groundings.

In the following, we use this natural correspondence between k-anonymity and lifted models to show how lifted inference can preserve privacy.

Table 1 Illustration for entries in a database (range values are abbreviated by their first letter); the highlighted columns can be used to turn the database into a k-anonymous database
Fig. 2
figure 2

Parfactor graph for \(G_{ex}\) illustrating the constants each PRV stands for

Theorem 1

Given an s-symmetric model G whose propositional random variables do not refer to individuals, G is privacy-preserving.

Proof

All PRVs in G reference at least s individuals due to G being s-symmetric. There are also no propositional random variables that reference single individuals as per condition in the theorem. As such, there are always \(s-1\) other individuals with an indistinguishable influence on the full joint encoded by G, i.e., for any assignment \(\varvec{v}\) to the remaining random variables in \(P_G\), any two groundings \(R(\textbf{x}), R(\textbf{x}') \in gr(A_{|C})\) of each PRV \(A \in rv(G)\) exhibit

$$\begin{aligned} \forall \varvec{r},\varvec{r}' \in ran(R(\textbf{x}), R(\textbf{x}')) : P_G[\varvec{r},\varvec{r}',\varvec{v}] = P_G[\varvec{r}',\varvec{r},\varvec{v}], \end{aligned}$$

with \(P_G[.]\) referring to a specific entry in the full joint distribution \(P_G\) encoded by G. That is the probabilities in \(P_G\) are identical when exchanging the range values for any two groundings of A, meaning that all s groundings are exchangeable in \(P_G\). Thus, analogous to k-anonymity, the s-symmetric G is privacy-preserving. \(\square\)

In contrast to k-anonymity, we require s-sized constraints for each PRV, thereby circumventing the weakness of k-anonymity that (quasi-) identifiers have to be determined and may lead to leakage if not considered. Drawing parallels to k-anonymity, besides being able to use the identity function for the quasi-identifying properties, each property q has at least s (k) identical data points, where in the lifted inference community, we would say that each logical variable in a constraint of a PRV refers to s (k) constants.

Corollary 1

All features encoded in s-symmetric PRVs are anonymised according to k-anonymity.

Having made the connection between s-symmetry, k-anonymity, and privacy preservation, for the remainder of this paper, we use a model is privacy-preserving interchangeably with a model is s-symmetric.

3.3 A Privacy-Preserving Query Language

The query definition contains grounded PRVs, forming propositional random variables, which leak information as argued above. We solve this problem by only allowing representative queries, i.e., answering queries only for representative (and not specific) constants for each group. The idea of representative constants is not new: In a similar vein, representative constants are used in so-called first-order decomposition trees, which represent LVE calculations, where indistinguishable subtrees of different constants are represented by a single subtree and a representative constant [46]. The implementation of first-order knowledge compilation also provides an option that allows for asking for all marginals of single terms, using random groundings as representatives of each group forming for each PRV in a model.

Instead of a grounded PRV, a PRV with its logical variables is the query term, meaning that a query should be formed using a representative grounding for each group (disjoint set of tuples over all constraints referencing the logical variables from the query) in the model after evidence handling, possibly leading to a set of queries if multiple groups indeed appear. Formally, we define this new representative query as follows.

Definition 6

Given a model G, a set of PRVs \(\varvec{A}\), and events \(\varvec{e} = \{E^i=e^i\}_{i=1}^m\), the expression \(P(\varvec{A} \mid \varvec{e})\) denotes a representative query w.r.t. \(P_G\).

A representative query is what PAULI will work with. Of course, from a user perspective, the evidence is not known as it is not private. It needs to come from within the system operating PAULI or from a trusted third party. Query terms in the form of PRVs may be provided by the user to form the representative query, together with the evidence, that PAULI needs to answer.

Example 6

Harking back to the example model, a representative query is \(P(I(X) \mid S(d_1)=good)\), which is a representative version of query \(P(I(x_1) \mid S(d_1) = good)\) from Ex. 3. It shows the already private nature of the query term I(X) but also that the evidence in the query is not yet private. A user may provide the I(X) but must not know \(S(d_1)=good\) for privacy reasons.

Given a representative query, shattering on query terms still has to occur for a random tuple of constants out of the at least s constants in a group. The query answer must not show the chosen constant and is anonymised in the sense that it applies to all constants represented, which are anonymous to the user.

3.4 When Do Existing Lifted Algorithms Keep Inference Privacy-Preserving?

Adapting the query language is straight-forward even in existing off-the-shelf lifted inference algorithms such as LVE [47], LJT [5], or first-order knowledge compilation [50]. However, these algorithms do not suffice to guarantee privacy based on s-symmetry as argued above due to shattering on evidence. Still, existing lifted algorithms can already keep inference privacy-preserving if one does not allow evidence. Then, the algorithms would not shatter the input model on evidence. The models would not violate s-symmetry and inference would be kept privacy-preserving.

Theorem 2

Given no evidence, i.e., \(\varvec{e}=\emptyset\), as well as representative queries, the result of a query for any lifted inference algorithm provides an s-symmetric answer.

Proof

Given an s-symmetric model, a lifted inference algorithm reasons over constraints with at least s individuals in each constraint. Further, without evidence, i.e., no conditional queries, these constraints are not split up and thus, no individuals are shattered out. As a consequence, during query answering, any lifted inference algorithm always reasons over groups that are at least s-sized. By only allowing representative queries, the result is anonymised and applies to at least s constants. Therefore, any lifted inference algorithm that does not enter any evidence and answers only representative queries over s-sized constraints does not reveal information about specific individuals. \(\square\)

So, given an s-symmetric model, no evidence, and representative queries, any lifted inference algorithm already preserves privacy, without any compromises in terms of utility as the model is unchanged. Here, privacy is preserved as constraints always have at least s constants, i.e., s individuals, to reason over. Requiring s-sized constraints is not really a restriction since lifting is based on the assumption that one has constraints with many constants. Allowing only representatives in queries is also only a minor restriction as the constants per constraint are indistinguishable so the answer would not change with a specific constant. However, not allowing any evidence is a major restriction, as only marginal but not conditional queries are possible. Therefore, the next section introduces PAULI, an inference algorithm that allows for conditional representative queries while ensuring that the model remains s-symmetric.

4 Probabilistic Inference in s-Symmetric Models

This section presents PAULI for answering queries in a privacy-preserving manner while being as accurate as possible. As identified in the previous section, to answer queries in a privacy-preserving fashion, we have to adapt evidence handling, keeping constraints s-sized s.t. PAULI continuously operating on an s-symmetric model. We first look at evidence handling before presenting PAULI. Then, we show how to extend the result to temporal inference. We end with a discussion of privacy, accuracy, and expressiveness.

4.1 Evidence Handling

A basic assumption for lifted inference is that individuals in a group behave similarly, i.e., are close to indistinguishable as mentioned before. Therefore, observations are expected to be more or less the same. Reasons for different observations lie in faulty sensors sending wrong observations, observations getting lost, or—and this is assumed to be highly unlikely—an individual actually behaving differently.Footnote 1 Nonetheless, if a constraint contains more than s constants, it is possible that groups of at least size s emerge, one for each possible observed value, and thus can be represented by s-sized constraints. So, in the case that evidence leads to groups of at least size s, it is privacy-preserving to split a parfactor on that particular evidence. Otherwise, a split might lead to violating s-symmetry, such as in Ex. 4.

Since there are situations where evidence might induce differences between otherwise indistinguishable individuals, we need to devise a method to avoid splits that are not privacy-preserving. A simple idea would be to change the evidence in some way: We could compute majority evidence, i.e., the evidence that most individuals agree upon, and assign that evidence to all individuals. Another option would be to compute a form of mean evidence, i.e., counting how often which value is observed and then introducing a parfactor that maps to these normalised counts as a form of uncertain evidence [19] with the constraint containing all individuals. The upside of both these approaches is that no splits would be necessary, leaving the constraints as they are and as such, keeping the model s-symmetric. However, computing equal evidence for all constants throws away a lot of information, and the goal is to be as accurate as possible for a good utility-privacy trade-off. A more fine-grained approach that locally changes evidence for individuals until constraints are s-sized again requires a lot of oversight, though, and might not lead to an optimal result if larger groups are possible. Thus, we are interested in an automated way to ensure that the model remains s-symmetric. Let us start with the high-level argumentation of how PAULI ensures s-sized constraints before formally defining privacy-preserving evidence handling.

4.1.1 Overview

Unfortunately, we cannot simply cluster the evidence s.t. each cluster contains at least s observations. The problem is that observations for some groundings can cause splits in other parfactors because of a shared logical variable.

Example 7

Observations for some groundings of I(X) not only lead to a split of \(g_0\) but also cause a split in \(g_1\) due to the shared logical variable X: Consider the evidence \(I(x_1) = low\), which leads to splitting \(g_1\), which contains I(X), resulting in two parfactors \(g_0',g_0''\), one containing all tuples in the constraint that reference \(x_1\) and one for the remaining tuples (referencing \(x_2\) and \(x_3\)) of the constraint in the original parfactor \(g_0\). However, since X also occurs in \(g_1\), the split also needs to be carried out in \(g_1\) for the constraints in the model to not partially overlap. Otherwise, inference could not be carried out in a lifted way because one parfactor talks about three X constants, while two other parfactors talk about one and two X constants, respectively, and thus, calculations would differ. As such, \(g_1\) is also split into two versions \(g_1',g_1''\), one with a constraint containing tuples referencing \(x_1\) and one with a constraint containing the remaining tuples. Afterwards, the constraints in the for resulting parfactors are either identical or disjoint: \(g_0',g_1'\) have identical constraints referencing \(x_1\) and \(g_0'',g_1''\) have identical constraints referencing \(x_2\) and \(x_3\). As the former only consists of tuples containing \(x_1\) and the latter only consists of tuples containing \(x_2\) and \(x_3\), the constraints are also disjoint. The same applies to observations regarding groundings of D.

So while handling evidence, to ensure s-sized constraints, PAULI has to consider splits over all parfactors that are split in the same way due to evidence to be able to ensure s-sized constraints, which partitions the model into sets of parfactors containing constraints that are affected by splits in the same way, called partitions. Since evidence might be available for multiple PRVs and thus affect multiple logical variables in those parfactors, such as X and D in the example model, constraints need to remain s-sized over all splits that are necessary due to evidence. Hence, PAULI cannot collect evidence at the level of PRVs but has to collect evidence for sets of PRVs that are influenced by splits in an identical fashion.

Having collected all evidence terms per PRVs set, PAULI uses a clustering algorithm that ensures a cluster size of at least s on vector representations of the collected evidence and uses the mean of each cluster as evidence for the individuals in the cluster. To ensure that individuals without an observation do not end up as a group with fewer than s members, PAULI adds an evidence parfactor for the group of these individuals as well, but mapping all range values to 1. Defining the mapping using 1 makes sure that the full joint distribution remains unchanged and no scaling is introduced.Footnote 2 Thereby, PAULI ensures the model remains s-symmetric while remaining as close as possible to the original evidence.

Next, we present how PAULI collects the evidence terms in detail and then describe how to cluster the evidence terms privately.

4.1.2 Evidence Collection

To collect the evidence terms per group of indistinguishable constants, PAULI performs the following steps, as outlined in Alg. 1 with a set of parfactors G and evidence \(\varvec{e}\) as input:

  1. (i)

    Identify partitions of parfactors that are influenced by splits in the same way,

  2. (ii)

    Collect evidence terms for each parfactor in each partition, and

  3. (iii)

    Collect unobserved groundings in an own group.

First, PAULI has to identify those sets of parfactors in G that are affected in the same way by splits, which partitions G. Each partition \(P \in G\) has a set of logical variables \(\varvec{X}_p\) that has been affected in the same way by splitting due to evidence. Formally, P has the form

$$\begin{aligned} P = \{g_{i,1}, \dots , g_{i,o}\}_{i=1}^{n_p} \end{aligned}$$

with \(lv(g_{i,j}) \subseteq \varvec{X}_p, j \in \{1, \dots , o\}\). With the partitions, PAULI already has those sets that will be influenced by evidence terms in the same way. Grouping more than these individuals together would result in splitting later on. Thus, these partitions are the starting point to collect evidence terms for later clustering the terms. Second, PAULI collects the evidence terms, iterating over the parfactors of each partition. To account for unobserved groundings, PAULI also adds those groundings that do not occur in evidence to an own group. Before moving on to the next step, which involves clustering these sets of evidence, let us briefly consider the running example but with domains of size 30 and \(s=10\).

Example 8

Consider identical observations for 20 groundings of I(X) and 20 groundings of H(X), of which 15 refer to the same constants of X. Shattering would lead to a group of 15 constants that have (the same) observations for both I(X) and H(X). Then, there would be a group of 5 constants with an observation of I(X) and a group of 5 constants with an observation of H(X). Then there are 5 remaining constants without any observation. With \(s=10\), the model would no longer be s-symmetric. Given this small example, there is then one partition, which contains both parfactors \(g_0\) and \(g_1\) as both are affected by the split on X. Both sets of evidence would be collected. The 10 groundings of I(X) not observed make up their own group (without having an observed value associated). The same holds for the 10 groundings of H(X) not observed.

4.1.3 Computing s-Sized Evidence Groups

PAULI has collected evidence per partition in the previous step, which has not yet affected the evidence itself. In the next step, the evidence is manipulated to become privacy-preserving regarding s-symmetry, as described in Alg. 2. In essence, PAULI has to cluster the identified groups of evidence, for which it first has to build inputs that the clustering algorithm can handle in a way that respects s-symmetry.

Algorithm 1
figure a

Evidence Collection

Algorithm 2
figure b

Computing s-sized Evidence Groups

The problem with building the inputs to the clustering algorithm from the grouped evidence is as follows: Remember that clustering evidence per PRV is not possible as two PRVs with evidence referencing the same logical variable might lead to different clusters that individually contain at least s evidence terms. However, when forming evidence parfactors for each PRV and then shattering the model on those evidence parfactors, the necessary splits might lead to constraints that are no longer s-sized if the clusters overlap unfavourably. To avoid such a situation, PAULI collects evidence per partition, which contains those parfactors whose logical variables are affected by splits in the same way. Thus, PAULI basically has to simultaneously cluster the evidence per partition. To do so, PAULI builds the clustering inputs by combining the evidence for each ground combination of constants in the constraints of the whole partition. Depending on the chosen clustering algorithm and distance metric, the combination can be a vector representation with the observed range values at specific positions and zeros otherwise or each possible assignment to the evidence PRVs could be considered a feature, set to 1 if the assignment was observed, to 0 if the assignment was not the observed one, or to all 1’s if there was no observation. The only restriction for the representation is that PAULI needs to be able to map the inputs back to the collected evidence that goes into each input. If this were not possible, we would need to use the combined inputs as evidence, which would require multiplying all parfactors per partition, unnecessarily enlarging parfactors. Mapping the inputs back to the collected evidence allows for computing private evidence parfactors per cluster per PRV, which we describe below, allowing for keeping the parfactors small.

In terms of the clustering algorithm, the only restriction for a fitting algorithm is that it needs to be able to form clusters with at least s elements. Hierarchical clustering could be used to ensure that each cluster has at least s elements. However, a hierarchical clustering might then split up a cluster that has \(2 \cdot s\) elements into two clusters even if all points are relatively close together leading to hardly any accuracy drawbacks. Mondrian is a clustering algorithm for k-anonymity and ensures that each resulting cluster contains at least k data points as well as a good clustering [27]. Therefore, we suggest using Mondrian as a clustering algorithm, but every clustering algorithm, which ensures s-sized constraints (i.e., clusters of at least size s), works.

Clustering evidence like this allows for using off-the-shelf clustering algorithms. Of course, in the spirit of lifted inference and for efficiency gains, it would be better to not work with the ground instances but actually use parfactors to represent the evidence and then cluster the parfactors, taking into account that a parfactor represents multiple elements. Future work includes developing a fitting lifted clustering algorithm.

To compute private evidence parfactors, for each PRV in each cluster, PAULI maps the data points in the cluster back to the collected evidence and computes an uncertain evidence parfactor for each PRV by building the mean of the vector representation of the potential function representing the observation (or uniform distribution) and setting the result as the potentials in the evidence parfactor with the constants of the evidence involved in this calculation in the constraint.

Example 9

Consider Ex. 8, which has evidence for H(X) and I(X). Assume that we observe true for the 20 H(X) instances and middle for the 20 I(X) instances, of which 15 instances refer to the same constants. We build clustering inputs using vectors, with the first two positions referring to the two possible values of H(X) and the other three positions referring to the three possible values of I(X). For the 15 instances with both observations, we get a vector representation of \((1,0,0,1,0)^\top\) each, which is basically a concatenation of the evidence parfactor mappings. For the 5 instances with only an H(X) observation and thus, not observation for I(X), the vector representation is \((1,0,1,1,1)^\top\) each, using 1’s for the three possible values of I(X) to not introduce a scaling as argued previously. The vector is basically a concatenation of (1, 0) representing the observed evidence and (1, 1, 1) representing that H(X) is unobserved. For the 5 instances with only an I(X) observations and no H(X) observations, the vector representation is \((1,1,0,1,0)^\top\) each. The 5 instances with no observation, the vector representation is \((1, 1, 1, 1, 1)^\top\) each. These 30 data points are the input to a clustering algorithm with, e.g., cosine similarity as a distance measure and \(s=k=10\) as the minimum cluster size. Assume the result consists of two clusters, one for the 15 instances with both observations, which contains only \((1,0,0,1,0)^\top\) data points, and one for the other 15 data points, which contains the other three vector representations. For the first cluster, the in-going evidence was that of \(H(X)=true\) and \(I(X)=middle\) with vector representations of (1, 0) and (0, 1, 0). The mean of these vector representations is (1, 0) and (0, 1, 0), respectively, leaving the evidence as is and leading to an evidence parfactor \(\phi (H(X))\), mapping true to 1 and false to 0, and an evidence parfactor \(\phi (I(X))\), mapping middle to 1 and lowhigh to 0, with constraints containing the 15 X instances. For the second cluster, the in-going evidence as vectors was five times (1, 0) and five times (1, 1) for H(X) and five times (0, 1, 0) and five times (1, 1, 1) for I(X). The mean of these vectors is (2/3, 1/3) and (1/4, 1/2, 1/4), respectively, which are used as mappings in the corresponding evidence parfactors for the 15 constants in this cluster. When splitting \(g_0\) and \(g_1\) on these evidence parfactors, the result consists of two parfactors each, one that contains the 15 constants of the first cluster and one that contains the 15 constants of the second cluster both in a cross product with the 30 D constants, keeping the constraints s-sized with \(s=10\).

Lemma 1

Given a clustering algorithm that returns clusters of at least size s, Algs. 1 and 2 yield privacy-preserving evidence, i.e., evidence with s-sized constraints.

Proof

If a parfactor is not affected by a split due to evidence, its constraint will not be changed and remains s-sized. If a parfactor is affected, Alg. 1 adds it to the corresponding partition to take into account. Since a partition contains all parfactors that are influenced by evidence splits over shared logical variables, they are considered during clustering. Using a clustering algorithm that returns clusters of a minimum size of s ensures that the splits that are then necessary still yield s-sized constraints. \(\square\)

Let us combine privacy-preserving evidence and query handling into PAULI.

4.2 The Complete Algorithm

PAULI answers queries on models in a privacy-preserving manner while being as accurate as possible. It uses the adapted query language defined in Sect. 3 and the evidence handling just described. Since representative queries, even if concerning just a single PRV, very likely lead to multiple queries given the groups, we formulate PAULI based on LJT as an efficient multi-query algorithm that operates on parfactors.

Algorithm 3
figure c

Privacy-preserving and Utility-controlled Lifted Inference

Algorithm 3 outlines PAULI. Like LJT, PAULI works in four general steps:

  1. (i)

    Building a helper structure J for the input model G,

  2. (ii)

    Handling evidence \(\varvec{e}\) in J,

  3. (iii)

    Passing messages in J, and

  4. (iv)

    Answering queries for query terms \(\varvec{Q}\) in J.

To make this procedure privacy-preserving, the query terms need to adhere to the updated query language and the evidence handling needs to turn the given evidence into privacy-preserving evidence using Algs. 1 and 2 before entering the evidence into J. More specifically, PAULI constructs the helper structure of an acyclic graph over clusters of PRVs as nodes, exploiting conditional independences between PRVs. It also includes assigning each parfactor of the input model to a node, yielding local models at each node. For evidence handling, PAULI begins by identifying groups w.r.t. logical variables, i.e., the partitions. For these partitions, PAULI builds the corresponding evidence groups. To be privacy-preserving, PAULI then uses a clustering algorithm on the inputs generated from these groups, which ensures s-sized constraints. Therefore, all constants are always in a group with at least \(s-1\) other constants. Then, PAULI computes privacy-preserving evidence parfactors using the cluster result.

After entering these privacy-presenting evidence parfactors into the helper structure by absorbing each evidence parfactor at the local model of each node containing the PRVs of the evidence parfactor, PAULI passes messages on the structure, which makes the nodes independent from each other. To answer a query, a representative constant for each group of the queried PRV is chosen. Then, a node containing the PRV is selected and its local model and received messages are used to compute the query result. That is for each private cluster, the result is a (conditional) marginal distribution of a representative. As such, neither instances included in a cluster nor how many of them exist are returned, keeping that information private. Thereby, queries are only answered over the groups that emerged after privacy-preserving evidence handling.

Theorem 3

PAULI keeps a model s-symmetric while computing query answers.

Proof

The helper structure is built solely using the topology of the model, which does not influence constraints. From Lemma 1, it follows that handling evidence in this fashion makes the manipulation of the model privacy-preserving. By only allowing representative queries, PAULI is also privacy-preserving in the sense that also no information about constant names are revealed. \(\square\)

This concludes PAULI for episodic models. However, we can also use the ideas in PAULI analogously for temporal inference with the added advantage of a bounded error over time, which we show next.

4.3 A Temporal Extension

This section presents how we can use PAULI during temporal inference. To this end, we first introduce a temporal version of parfactor models [18], recap the LDJT, a temporal inference algorithm that uses LJT as a subroutine, thereby lifting the interface algorithm for temporal inference in propositional temporal models [34], and last show how to incorporate PAULI with LDJT.

Fig. 3
figure 3

Parfactor graph for \(\mathcal {G}_{ex}\)

4.3.1 Introducing Time to Models

As is convention in temporal probabilistic inference, we define a temporal model based on the first-order Markov assumption, i.e., a time slice t only depends on the previous time slice \(t-1\). Further, the underlying process is stationary, i.e., the model behaviour does not change over time. For more details, please refer to [16].

Definition 7

A temporal model \(\mathcal {G}\) is a pair of models \((G^0,G^\rightarrow )\) with \(G^0\) a model representing the first time step and \(G^\rightarrow\) a two-slice model over \(\varvec{A}^{t-1}\) and \(\varvec{A}^t\) with \(\varvec{A}^\pi\) a set of PRVs from time slice \(\pi\). The semantics of \(\mathcal {G}\) is to unroll \(\mathcal {G}\) for a given number of time steps T, by instantiating the model for \(t \in \{0, \dots , T\}\), yielding a model of Def. 2.

The two-slice model \(G^\rightarrow\) essentially is a template model that can be instantiated for various time steps. It consists of parfactors describing the model behaviour within a time slice. Since we assume that the process is stationary, the intra-slice behaviour is identical in each slice, which means that \(G^\rightarrow\) consists of the same model twice, once indexed by \(t-1\) and once by t, whose parfactors only contain PRVs indexed by either \(t-1\) or t. In addition, there are so-called inter-slice parfactors, describing how the model state changes from \(t-1\) to t, containing PRVs indexed \(t-1\) and t, with those PRVs indexed \(t-1\) being called interface. The interface m-separates the two slices, i.e., conditioning on the interface makes the two slices independent from each other, enabling efficient temporal inference within a single slice.Footnote 3 Since the size of the interface determines the complexity of temporal inference, a general assumption is that the connections between \(t-1\) and t are sparse, yielding a small interface.

Example 10

Given the scenario modelled in our running example so far, it would be sensible to assume that income, residence, state of the schools, and the household situation do not change with high probability from one time step to the next, requiring inter-slice parfactors for each to carry over that information. However, with the general assumption that the two slices in a \(G_\rightarrow\) are sparsely connected, we use some creative license here: Figure 3 shows a temporal model \(\mathcal {G}_{ex}\) using \(G_{ex}\), which consists of \(G_{ex}\) for the intra-slice behaviour. As such, PRVs are indexed by \(t-1\) and t to describe the behaviour of the modelled scenario within time slices \(t-1\) and t, respectively. Additionally, there is an inter-slice parfactor connecting \(I^{t-1}(X)\) and \(I^t(X)\). The interface consists of \(I^{t-1}(X)\) as the one PRV in the interface parfactor indexed with \(t-1\). All paths between PRVs indexed with \(t-1\) and PRVs indexed with t go through \(I^{t-1}(X)\), m-separating \(t-1\) and t.

In general, a temporal query asks for a probability distribution of a random variable given a sequence of events as evidence.

Definition 8

Given a temporal model \(\mathcal {G}\), query terms \(\varvec{Q}\) (ground PRVs), and events \(\varvec{e}^{0:t} = \{E_{i}^\tau =e_{i}^\tau \}_{i=1,\tau =0}^{m,t}\), the expression \(P(\varvec{Q}^\pi |\varvec{e}^{0:t})\) denotes a query w.r.t. \(P_{\varvec{G}}\).

The problem of answering a query \(P(\varvec{Q}^\pi |\varvec{e}^{0:t})\) w.r.t. the model is called hindsight for \(\pi < t\), filtering for \(\pi = t\), and prediction for \(\pi > t\). For the representative version, \(\varvec{Q}\) is replaced by a set of PRVs \(\varvec{A}\).

Example 11

Temporal queries in \(\mathcal {G}_{ex}\) would be:

  • \(P(I^3(x_1)\mid S^{0:3}(d_1)=true)\) (filtering),

  • \(P(I^5(x_1)\mid S^{0:3}(d_1)=true)\) (prediction), and

  • \(P(I^1(x_1)\mid S^{0:3}(d_1)=true)\) (hindsight).

Representative versions of these queries would have \(x_1\) replaced by X in all three queries.

4.3.2 A Temporal Query Answering Algorithm: LDJT

LDJT lifts the interface algorithm [34] and thereby uses LJT as a subroutine. It uses the interface to m-separate time steps, which means that state descriptions about these PRVs render the models of different time steps independent from each other. Thus, LDJT builds the same helper structure that LJT uses and ensures that the interface occurs in one cluster to use as a gateway to the next time step. Within a time step, LDJT proceeds like LJT, starting with entering evidence, then passing messages, and ending with answering queries on the helper structure. After those steps, it computes a so-called forward message \(\varvec{m}^t\) over the interface, which encodes all information (evidence, model behaviour) up to t, and adds this message to a new instance of the helper structure, and starts again with evidence entering for \(t+1\). For a full description including how to handle hindsight and prediction queries, refer to [17].

A problem with the forward message is that it carries over splits from one time step to the next, slowly grounding a model over time. However, since each time step also sees the model behaviour added to the inference procedure, it is possible to re-merge previously split parfactors if differences in evidence lie far enough in the past, which is realised by an algorithm called TAMe [20] and can be added to LDJT as a subroutine to merge splits in the forward message. Adding the model behaviour in each time step actually allows for bounding the error introduced by merging parfactors [20].

4.3.3 PAULI for Temporal Inference

Since LDJT uses LJT as a subroutine and PAULI essentially proceeds like LJT, we basically have to replace those steps of LJT in LDJT with the respective steps from PAULI.

There is one complication that we have to consider however. For PAULI, we have assumed that the model is preemptively shattered on itself, i.e., constraints talking about the same logical variables are restricted identically. In temporal inference, we do not have that setting, since the forward message \(\varvec{m}_t\) introduces splits from previous time steps to the new instantiation for time step \(t+1\), which normally are only propagated through the model during message passing. Therefore, we assume that the splits, which \(\varvec{m}^t\) causes in the model, have been propagated into the helper structure for \(t+1\).

Algorithm 4 shows the algorithm that we call TemPAULI. We present the algorithm as an offline algorithm with the set of evidence and query terms over time already available but it could be easily tweaked to accept two streams for evidence and query terms, respectively, for online inference. Given a temporal model as well as a set of evidence and query terms for T time steps, TemPAULI builds the above-mentioned helper structure, and then proceeds in time, starting at \(t=0\), until \(t=T\). In each time step t, it adds the forward message from the previous time step (not available for \(t=0\)), computes privacy-preserving evidence, adds the result to the helper structure, passes messages, and answers queries. Then, it computes a forward message before moving on to the next time step.

Optionally, one can use TAMe in here to merge split parfactors for efficiency gains. With TAMe, the sizes of the groups can increase again, which possibly allows new splits in the private evidence groups in later time steps, which in turn can lead to more accurate private evidence groups w.r.t. the original evidence terms.

Algorithm 4
figure d

Temporal Privacy-preserving and Utility-controlled Lifted Inference

This concludes the presentation of PAULI (and TemPAULI) as an algorithm inspired by k-anonymity to preserve privacy, while answering queries as accurately as possible. Next, we discuss PAULI and in particular s-symmetry.

5 Discussion

This section discusses the privacy implications, starting with the overall setting, and then moving on to the implications of queries, evidence, and accuracy in particular. Then it considers runtime performance and expressiveness of the modelling formalism.

Query Answering As shown in Sect. 3, answering representative queries on a model with s-sized constraints without entering evidence is privacy-preserving. The main problem for a more general result is that evidence might lead to a model violating the prerequisite of s-sized constraint. Therefore, PAULI handles evidence differently than standard inference algorithms, namely, by computing privacy-preserving clusters of evidence terms and then entering those into the model. Thereby, entering evidence cannot violate the prerequisite of s-sized constraints anymore. During query answering, PAULI presents the marginal conditional distribution for representatives, instead of querying individuals to avoid the privacy risk of queries containing specific constants. The next paragraphs take a closer look at the privacy implications of queries, evidence, and the effect on accuracy individually.

Queries The information an attacker gains by asking queries are the conditional marginal distributions for a representative of each group, given a PRV. Thereby, the attacker knows the number of groups of indistinguishable individuals, but not the exact number of constants in a group (only that it is at least s if s is publicly known). Solely based on the query results, an attacker does not know which individuals belong to which group of indistinguishable individuals.

In a temporal setting, based on multiple query results over several time steps, an attacker might be able to guess how the query answering results evolve, but that has no privacy implications for an individual, as the attacker only knows how the results of a big group of indistinguishable individuals evolve.

Evidence To keep the model as close as possible to the actual data, PAULI enters evidence into the model. To do that in a privacy-preserving way, PAULI first groups the evidence terms w.r.t. privacy-preserving groups and then uses a clustering algorithm to ensure s-sized constraints, to limit any potential data leakage. Last, PAULI enters these privacy-preserving evidence clusters into the model.

Accuracy

Only the privacy-preserving evidence handling can introduce an error as evidence may get altered. With very diverse evidence for various PRVs in the model and unfortunate group sizes, the effects might be large and highly depend on the clustering algorithm. In general, the error might be large while preserving privacy in a highly individualistic setting. However, given decent numbers of individuals per constraint and somewhat homogeneous behaviour within groups, which are fundamental assumptions in lifting, the error is likely to be negligible in the episodic setting: If there is homogeneous behaviour within large groups, these groups will produce almost identical evidence. Therefore, the actual mean and the computed mean of the evidence will be very close and thus the results will be nearly identical in the end, leading to a negligible error.

One might presume that by adding a temporal component, the error situation becomes worse. However, using TAMe with TemPAULI ensures that the approximation error is indefinitely bounded and that the error decreases over time, which follows directly from previous results [3, 20]. Any empirical results follow these bounds with an error indefinitely bounded, meaning the possibility for an increasing accuracy over time.

Runtime By clustering evidence, PAULI prevents splits of individuals or small groups and thereby ensures an s-symmetric model. While such an s-symmetric model might exhibit some inaccuracy, it has a great benefit w.r.t. runtimes as lifted inference is tractable w.r.t. group sizes, which are at least s-sized in the privacy-preserving case. Therefore, the runtime of PAULI will always benefit from lifting, which applies to the episodic and the temporal setting.

Expressiveness PAULI is an algorithm that solves probabilistic inference tasks in a privacy-preserving manner in probabilistic relational models. Choosing a first-order formalism comes with greater expressiveness, enabling privacy in the first place. Propositional models have to either combine the features of all individuals into features for one individual, prohibiting any reasoning over groups and preventing a more accurate model, or model each individual explicitly, making any privacy advancements nearly impossible (in addition to being very costly).

This ends the discussion of privacy-preserving probabilistic inference with the help of lifting. Before concluding, we look at related work in the areas probabilistic inference for the public sector, lifted probabilistic inference, and privacy-preserving inference.

6 Related Work

Probabilistic graphical models and probabilistic inference have been used in various applications in the public sector, ranging from energy forecasting [10] or assessing resilience of smart grid systems [23] using dynamic Bayesian networks to predicting infection risks [24] or spatial transmissions [12] in a pandemic, including policy-making supported by probabilistic modelling, e.g., based on occupational health data [37]. To the best of our knowledge, none of these works consider privacy, which becomes increasingly important with personal data as a basis for learning these models. Additionally, these works also do not consider the advantages that come with the expressiveness of first-order modelling. The remainder of this section considers related work on lifted inference as well as privacy-preserving probabilistic inference in general.

First-order probabilistic inference leverages the relational aspect of a static model, using representatives for groups of indistinguishable, known objects, also known as lifting [39]. Poole presents parfactor graphs as relational models and proposes LVE as an exact inference algorithm on relational models [39]. Taghipour et al. extend LVE to its current form [47]. To benefit from the ideas of the junction tree algorithm [26] and LVE, Braun and Möller present the LJT for exact inference given a set of queries [5]. To answer multiple temporal queries, Gehrke et al. present the LDJT [18], which combines the advantages of the interface algorithm [34] and LJT. Gehrke et al. propose TAMe to approximate symmetries over time to retain sets of indistinguishable objects [20]. Finke and Mohr introduce an a priori learning approach assuming complete knowledge to try to prevent groundings from even happening [15], which follows a similar idea w.r.t. adjusting evidence, but there for efficiency gains, while we are interested in preserving privacy. TAMe approximates symmetries after groundings occurred, which may lead to privacy concerns.

k-anonymity is widely used for data publication [42, 44]. Pei et al. consider the case of how incremental updates can be handled while maintaining k-anonymity [38]. These incremental updates can in our case be viewed as conditional queries and thereby evidence entering. Besides using Mondrian to obtain a clustering for k-anonymity [27], also for example hierarchical clustering is studied [28]. The general idea of using privacy-preserving clustering to achieve anonymity [1] is currently still vastly studied in the field of differential privacy[7, 13, 14, 25, 35, 43]. For privacy-preserving inference, there are some propositional approaches that, for example, only provide an interface for query answering, leaving the model as a black-box to users [48]. There are also works which use Bayesian networks to preserve privacy [52, 54]. However, the focus of these works is not on query answering in a privacy-preserving manner, but on generating a synthetic data set from original data, which then can be published. Further, PAULI benefits from lifted query answering, which is polynomial w.r.t. domain sizes.

7 Conclusion

The public sector has a particular need for privacy-preserving handling of data and tasks. One such task lies in probabilistic inference, for which research is scarce if requiring privacy-preservation during inference, which is especially hard in propositional models. However, first-order frameworks have great potential in affording privacy-preserving inference, while keeping accuracy and expressiveness high, using the concept of lifting. Unfortunately, using any off-the-shelf lifted algorithm only works well to protect the privacy of large groups. Smaller groups or individuals are still at risk, though. Therefore, we present the notion of s-symmetry, inspired by k-anonymity, for privacy-preserving probabilistic relational models, which requires that constraints always refer to at least s individuals and need to be upheld during inference. An additional risk comes from queries, which allow for asking for specific individuals and possibly include evidence, which distinguishes individuals and therefore requires special attention to protect small groups and individuals. Thus, we define representative queries, which are applicable to any lifted inference algorithm. Without evidence in an s-symmetric model, any off-the-shelf lifted algorithm provides privacy-preserving inference if answering representative queries. To also allow for evidence, we present PAULI. PAULI achieves privacy preservation by making the evidence entering privacy-preserving using k-anonymity clustering. When considering temporal inference, PAULI can be used analogously and if combined with TAMe even allows for bounding the error introduced by approximating the evidence terms to make them privacy-preserving. To the best of our knowledge, PAULI and its temporal extension are the first algorithms to offer privacy-preserving lifted inference in episodic and temporal probabilistic models, respectively, which can be used for a wide range of inference tasks in various applications, including those coming from the public sector.

Future work lies in a large-scale public sector case study testing out the algorithms in the wild. On the technical side, future work involves developing a fitting lifted clustering algorithm. Another direction lies in adopting another privacy framework such as differential privacy, which provides strong guarantees for privacy.