1 Introduction

Data from census or survey studies are among the most useful sources of information for social and political studies. However, when statistical and governmental agencies release microdata to the public, they often encounter ethical and moral issues concerning the possible privacy leak for individuals present in the dataset. Anonymization techniques, like encrypting or removing personally identifiable information, have been widely used with the hope of ensuring privacy protection. However, recent studies by Gymrek et al. (2013), Homer et al. (2008), Narayanan and Shmatikov (2008), Sweeney (1997) have shown that, even after removing directly identifying variables, like names or national insurance numbers, the potential for breaches of confidentiality is still present. Specifically, an intruder might still be able to identify individuals by cross-classifying categorical variables in the dataset and matching them with some external database. This kind of privacy problems have been widely considered in the statistical literature and different measures of disclosure risk have been proposed to assess the riskiness of specific dataset.

Different disclosure limitation techniques have been proposed, like rounding, suppression of extreme values or entire variables, sampling or perturbation techniques. Post Randomization Methods (PRAM) are among the most used techniques for disclosure risk limitation. See De Wolf et al. (1997), Gouweleeuw et al. (1997), Kooiman et al. (1997). With these techniques, before releasing the dataset, the data curator randomly changes the values of some categorical identifying variables, like gender, job or age, of some individuals in the dataset. In a recent paper, Shlomo and Skinner (2010) consider PRAM and random data swapping of a geographical variable and propose a way of computing measures of disclosure risk to assess whether these techniques have been effective in “privatizing” the dataset. The choice of the geographical variable is motivated by the fact that, by swapping or changing it, it is usually less likely to generate unreasonable combinations of categorical variables, like for instance a pregnant man or a 10 year old lawyer. In order to implement PRAM, they consider a stochastic matrix M, where the (ij) entry of this matrix gives the probability that an individual from location i has his geographical variable swapped to location j. Given this known matrix M, Shlomo and Skinner (2010) suggest some measures of risk and related estimation methods. However, an open problem is to decide how the data curator should actually choose the matrix M in order to guarantee an effective level of privacy.

Over the last ten years, a new approach to data protection, called Differential Privacy (see Dwork et al. (2006)), has become more and more popular in the computer science literature and has been implemented in their security protocols by IT companies [Eland (2015), Erlingsson et al. (2014), Machanavajjhala et al. (2008)]. This new framework finds its roots in the cryptography literature and prescribes to transform the original data, containing sensitive information, using a channel or mechanism Q into a sanitized dataset. The mechanism Q should be chosen carefully, in such a way that, by only looking at the released dataset, an intruder will have very low probability of guessing correctly the presence or absence of a specific individual in the raw data and, therefore, the privacy of the latter will be preserved. Differential Privacy formalizes mathematically this intuitive idea. We will provide a short review on it in Sect. 2.3.

In this work, we bring together ideas from the Disclosure Risk and Differential Privacy literature to propose a formal way of choosing the stochastic matrix M used in PRAM. Specifically, when choosing M, we need to balance two conflicting goals: (1) on the one hand, we want the application of M to make the dataset somehow private; (2) on the other hand, we also want that the released dataset preserves as much statistical information as possible from the raw data. In order to balance this trade-off, we propose to choose M as the solution of a constrained maximization problem. We maximize the Mutual Information between the released and the raw dataset, hence guaranteeing preservation of statistical information and achieving goal (2). Mutual Information is a common measure of dependence between random variables used in probability and information theory. In order to guarantee also goal (1), we introduce a constraint in the maximization problem by imposing that the application M satisfies differential privacy, therefore the resulting mechanism based on M can formally be considered private. We show how this optimization problem results in a convex maximization problem under linear contraints and can therefore being solved efficiently by known optimization algorithms.

The rest of this work is organized as follows. In Sect. 2, first we will briefly review the disclosure risk problem in Sect. 2.1 and then the tools needed for our approach. Specifically, we review Mutual Information in Sect. 2.2 and Differential Privacy in Sect. 2.3. In Sect. 3, we formalize the proposed constrained maximization problem to choose the stochastic matrix M in PRAM and show that this choice is made by solving a convex optimization problem under linear constraints. Section 4 contains a simulation study showing first the effect of the Diffential Privacy constraint on simulated data and then the effect of different choices of M using a real dataset of a survey of New York residents. Finally, a concluding remarks section closes the work. Proofs of the statements are deferred to the Appendix.

2 Literature review

2.1 Disclosure risk limitation with categorical variables

In disclosure risk problems, we usually have microdata of n individuals, where for each individual we can observe two distinct sets of variables: (1) some variables, usually called sensitive variables, containing private information, e.g. health status or salary; (2) some identifying categorical variables, usually called key variables, e.g. gender, age or job. Disclosure problems arise because an intruder may be able to identify individuals in the dataset by cross-classifying their corresponding key variables and matching them to some external source of information. If the matching is correct, the intruder will be able to disclose the information contained in the sensitive variables.

Formally, let us assume we have J categorical key variables in the dataset, observed for a sample of n individuals, collected from a population of size N. Each variable has \(n_{j}\) possible categories labelled, without loss of generality, from 1 up to \(n_j\). The observation for individual i, \(X_{i}=(X_{i1},\ldots ,X_{iJ})\), therefore takes values in the state space \({\mathcal {C}}:=\prod _{j=1}^{J}\{1\ldots ,n_{j}\}\). This set has \(K:=|{\mathcal {C}}|=\prod _{j=1}^{J}n_{j}\) values, corresponding to all possible cross-classification of the J key variables. The information about the sample is usually given through the sample frequency vector \((f_{1},\ldots ,f_{K})\), where \(f_{i}\) counts how many individuals of the sample have been observed with the particular combination of cross-classified key variables corresponding to cell i. \((F_{1},\ldots ,F_{K})\) denotes the corresponding vector frequencies when considering the whole population of N individuals.

The earliest papers to consider disclosure risk problems include Bethlehem et al. (1990), Duncan and Lambert (1986), Duncan and Lambert (1989), Lambert (1993). These works propose different measures of disclosure risk and possible ways to estimates them under different model choice. Skinner and Elliot (2002), Skinner et al. (1994) review the most popular among measures of disclosure risk. These measures depend on the sample frequencies \((f_{1},\)\(\ldots ,f_{K})\) and usually focus on small frequencies, especially those having frequency 1, called sample uniques. The individuals belonging to these cells are those with the highest risk of their sensitive information being disclosed. Specifically, suppose that an individual is the only one both in the sample and in population to have a specific combination of key variables. Then, his key variables can be matched to an external database, and therefore this match will be perfect, i.e. correct with probability one, and his sensitive information will be therefore disclosed.

We usually distinguish between two groups of measures of disclosure risk:

  1. 1.

    Record Level (or per-record) measures: they assign a measure of risk for each data point. Among the most popular, there are

    $$\begin{aligned} \begin{aligned}&r_{1k}=\mathbb {P}(F_{k}=1|f_{k}=1), \\&r_{2k}=\mathbb {E}(1/F_{k}|f_{k}=1). \end{aligned} \end{aligned}$$
    (1)

    \(k\in \{1,\ldots ,K\}\). The first measure provides the probability that a sample unique is also population unique. The second tells the probability that if we select a sample unique and guess uniformly about his identity, we pick him correctly. The first measure is less conservative and is always smaller than the second.

  2. 2.

    File level measures: they provide an overall measure of risk for a dataset and are usually defined by aggregating the record level. Popular examples are

    $$\begin{aligned} \tau _{1}=\sum _{k:f_{k}=1}r_{1k}, \ \ \ \tau _{2}=\sum _{k:f_{k}=1}r_{2k}. \end{aligned}$$
    (2)

These measures of disclosure risk are estimated using the data \((f_{1},\ldots ,f_{K})\) under different modelling choices. For example, Skinner and Shlomo (2008), Shlomo and Skinner (2010) consider the estimation of these measures under log-linear models for the population and sample frequencies. Under this model choice, the indexes (1) and (2), can be derived in closed form and estimated using plug-in MLE estimators. A different modelling approach, proposed in Manrique-Vallier and Reiter (2012, 2014), is to apply grade of membership models, which provide very accurate estimates for (2). For a quite recent review on disclosure risk problems, the reader is referred to Matthews and Harel (2011).

If the estimated values of (1) and (2) are too high, then the data curator should apply a disclosure limitation technique to the dataset before releasing it to the public. Some possibilities are for example rounding, suppression of extreme values or entire variables, subsampling or perturbation techniques. See Willenborg and de Waal (2001) for a review of different disclosure limitation techniques.

2.2 Mutual information

Let X be a discrete random variable taking values on a finite set \({\mathcal {X}}\) and having probability mass function \(p_{X}(x)\). The (Shannon) entropy of X is defined as

$$\begin{aligned} H(X)=-\sum _{x\in {\mathcal {X}}}p_{X}(x)\log p_{X}(x)=-\mathbb {E}(\log (p_{X}(x))) \end{aligned}$$

and it is a measure of uncertainty about the distribution of X. H(X) is always non-negative, takes value 0 when \(p_{X}\) is a point mass in one of the support points and it is maximized when \(p_{X}\) is uniform, \(p_{X}(x)=\frac{1}{|{\mathcal {X}}|}\)\(\forall x\in {\mathcal {X}}\), in which case \(H(X)=\log |{\mathcal {X}}|\).

Similarly, given two discrete random variables X and Z, their joint entropy is defined as

$$\begin{aligned} H(X,Z)=-\sum _{x\in {\mathcal {X}}}\sum _{z\in {\mathcal {Z}}}p_{(X,Z)}(x,z)\log p_{(X,Z)}(x,z), \end{aligned}$$

where \(p_{(X,Z)}\) denotes the joint mass function on \({\mathcal {X}}\times {\mathcal {Z}}\). H(XZ) measures the joint uncertainty of X and Z taken together.

Besides the conditional entropy of Z given X is defined as

$$\begin{aligned} H(Z|X)&= - \sum _{x,z} p_{X,Z}(x,z) \log (P_{Z|X=x}(z)) \nonumber \\&= H(X,Z)-H(X) \end{aligned}$$
(3)

and quantifies the amount of information needed to describe the outcome of Z given that the value of X is known. If Z and X are independent, the conditional entropy H(Z|X) coincides with H(Z).

The mutual information between X and Z is defined as

$$\begin{aligned} I(X,Z)=\sum _{z\in {\mathcal {Z}}}\sum _{x\in {\mathcal {X}}}P_{(X,Z)}(x,z)\log \left( \frac{p_{(X,Z)}(x,z)}{p_{X}(x)p_{Z}(z)}\right) \end{aligned}$$

where \(p_{X},p_{Z},p_{(X,Z)}\) are respectively the marginal and joint distributions of X and Z. From the definition of I(XZ) it follows that

$$\begin{aligned} I(X,Z)=D_{KL}(p_{(X,Z)}||p_{X}p_{Z}) \end{aligned}$$
(4)

where \(D_{KL}\) denotes the Kullback–Leibler divergence. Therefore, I(XZ) measures the divergence between the joint distribution of X and Z and the product of their marginals. From (4), it also follows that \(I(X,Z)\ge 0\), and \(I(X,Z) = 0\) if and only if X and Z are independent.

An important equality connecting the mutual information I(XZ) with the marginal and joint entropies is

$$\begin{aligned} I(X,Z)=H(X)+H(Z)-H(X,Z). \end{aligned}$$
(5)

This formula is the base of the so-called 3H principle to estimate I(XZ), in which the three H entropy terms on the right hand side are estimated from the data and plugged into (5) to obtain an estimate \(\widehat{I(X,Z)}\).

For a review on entropy, mutual information and their properties, see for example Gibbs and Su (2002), Gray (2011) and references therein.

2.3 Differential privacy

Differential Privacy is a notion recently proposed in the computer science literature by Dwork et al. (2006), Dwork and Roth (2014) mathematically formalize the idea that the presence or absence of an individual in the raw data should have a limited impact on the transformed data, in order for the latter to be considered privatized. Formally, let \(X_{1:n}=(X_{1},\ldots ,X_{n})\) be a set of observations, taking values in a state space \({\mathcal {X}}^{n}\subseteq \mathbb {R}^{n}\), containing sensitive information. A mechanism is simply a conditional distribution Q that, given the raw dataset \(X_{1:n}\), returns a transformed dataset \(Z_{1:k_{n}}=(Z_{1},\ldots ,Z_{k_{n}})\), with \({\mathcal {Z}}^{k_n}\subseteq \mathbb {R}^{k_n}\), to be released to the public, where the sample sizes of \(X_{1:n}\) and \(Z_{1:n}\) are allowed to be different. Differential Privacy is a property of Q that guarantees that it should be very difficult for an intruder to recover the sensitive information of \(X_{1:n}\) by having access only to \(Z_{1:k_{n}}\) and is defined as follows.

Definition 1

(\(\alpha \)-Differential Privacy, Dwork et al. (2006)) The mechanism Q satisfies \(\alpha \)-Differential Privacy if

$$\begin{aligned} \underset{S\in \sigma ({\mathcal {Z}}^{n})}{\sup }\frac{Q(Z_{1:n}\in S|X_{1:n})}{Q(Z_{1:n}\in S|X'_{1:n})} \le \exp (\alpha ) \end{aligned}$$
(6)

for all \(X_{1:n},X'_{1:n}\in {\mathcal {X}}^{n}\) s.t. \(d_{H}(X_{1:n},X'_{1:n})=1\), where \(d_{H}\) denotes the Hamming distance, \(d_{H}(X_{1:n},X'_{1:n})=\sum _{i=1}^{n}\mathbb {I}(X_{i}\ne X'_{i})\) and \(\mathbb {I}\) is the indicator function of the event inside brackets.

For small values of \(\alpha \) the right hand side of (6) is approximately 1. Therefore, if Q satisfies Differential Privacy, (6) guarantees that the output database \(Z_{1:n}\) has basically the same probability of having been generated from either one of two neighboring databases \(X_{1:n}\), \(X'_{1:n}\), i.e. databases differing in only one entry. See Rinott at al. (2018) for a statistical viewpoint of differential privacy.

Differential Privacy has been studied in a wide range of problems, differing among them in the way data is collected and/or released to the end user. The two most important classifications are between Global versus Local privacy, and Interactive versus Non-Interactive models. In the Global (or Centralized) model of privacy, each individual sends his data to the data curator who privatizes the entire data set centrally. Alternatively, in the Local (or Decentralized) model, each user privatizes his own data before sending it to the data curator. In this latter model, data also remains secret to the possibly untrusted curator. In the Non-Interactive (or Off-line) model, the transformed data set \(Z_{1:n}\) is released in one spot and each end user has access to it to perform his statistical analysis. In the Interactive (or On-line) model however, no data set is directly released to the public, but each end user can ask queries f about \(X_{1:n}\) to the data holder who will reply with a noisy version of the true answer \(f(X_{1:n})\).

There have been many extensions and generalizations of the notion (6) of Differential Privacy proposed over the last ten years, in order to accommodate for different areas of applications and state spaces of the input and output data. Among them, we mention \((\alpha ,\delta )\)-Differential Privacy (Dwork and Roth 2014), vertex and edge Differential Privacy for network models (Borgs et al. 2015), zero-mean Concentrated Differential Privacy (Bun and Steine 2016), randomised differential privacy (Happ et al. 2011) or \(\rho \) Differential Privacy (Chatzikokolakis et al. 2013; Dimitrakakis et al. 2017), where the Hamming distance \(d_{H}\) is (6) is replaced by possibly any distance \(\rho \), and many others. However, since it is not possible to review all the many extensions of Differential Privacy here, we refer to Dwork and Roth (2014) for a quite updated review on different applications and extensions of Differential Privacy. To conclude this brief review, we recall one of the most important properties of any Differential Private mechanism: post processing, see Dwork and Roth (2014). This property guarantees that if the output \(Z_{1:n}\) of any \(\alpha \)-Differential Private mechanism is further processed and gone through another mechanism (depending only on \(Z_{1:n}\), and not on \(X_{1:n}\)), then the resulting output will also be \(\alpha \)-Differential Private. Therefore, there will be no chance of any leak of privacy simply by post-processing the released data \(Z_{1:n}\).

3 An information-theoretic approach to PRAM using differential privacy

Post Randomization Method is a popular perturbation method for disclosure risk limitation. It is connected to randomized response techniques described by Warner (1965). In the former approach, the raw data are perturbed by the data holder after having being collected, while in the latter, the perturbation is directly applied by the respondents during the interviewing process. We remind that PRAM was introduced by Kooiman et al. (1997) and further explored by Gouweleeuw et al. (1997) and De Wolf et al. (1997). Given raw microdata, PRAM produces a new dataset where some of entries are randomly changed according to a prescribed probability mechanism. The randomness introduced by the mechanism implies that matching a record in the perturbed dataset may actually be a mismatch instead of a true match, hence making usual disclosure matching attempts less reliable.

Shlomo and Skinner (2010) consider the problem of disclosure risk estimation when the microdata has gone through either a PRAM or data swapping process. They perturb the geographical key variable using a stochastic matrix M, i.e. every row of M sums to one, where \(M_{ij}\) provides the probability that an individual from location i is changed to location j. Shlomo and Skinner (2010) then proceed to discuss the problem of how to estimate the measures of risk presented in Sect. 2.1, but without providing any tangible rule on how to choose M, which is not the main goal of that paper.

In this work, we propose a novel approach to choose the randomization matrix M in PRAM. Specifically, we propose to choose it as the solution of a constrained maximization problem, in which we maximize the mutual information between raw data \(X_{1:n}\) and released data \(Z_{1:n}\), under the constraint that the perturbation mechanism satisfies the Differential Privacy condition (6). Other optimization approaches for PRAM were already considered by Willnborg (1999) and Willenborg (2000), using different target functions and constraints. See also Section 5.5 of Willenborg and de Waal (2001). However, these choices usually result in a difficult maximization problems and often rely on approximation methods.

We argue that the choice of Mutual Information and Differential Privacy have several advantages. First, Mutual Information and Differential Privacy are very natural notions and popular measures of information similarity and privacy guaranty that have been widely considered in Information Theory and Machine Learning. Second, as it will be shown shortly, the resulting maximization problem reduces to a convex maximization problem under a set of linear constraints, hence it can be solved efficiently by well known optimization tools, like the Simplex method which is implemented in most of the commonly used computational softwares, like Matlab or R. Finally, the level of privacy guaranteed by the proposed methodology is tuned by a single tuning parameter \(\alpha \), which can be chosen by the data curator to achieve the desired level of privacy in a very simple manner. In Sect. 4.1, we will show empirically how the choice of this parameter affects the estimation of the parameters, hence providing some evidence and guidance on how to choose it.

3.1 Model of PRAM

We propose to choose M as the solution of the following constrained maximization program

$$\begin{aligned} \max _{M \ \text {satisfies} \ (6)} I(X_{1:n},Z_{1:n}). \end{aligned}$$
(7)

We will consider the case of randomly changing the values of a key variable with S possible outcomes, e.g. the geographical location. \(X_{i}\in \{1,\ldots ,S\}\) is the corresponding categorical random variable, having probabilities \(p=(p_{1},\ldots ,p_{S})\), and therefore \(\mathbb {P}(X_{i}=j)=p_{j}\). We consider the class of all randomizing matrices of the following form

(8)

for an unknown parameter vector \(q=(q_{1},\ldots ,q_{S})\). This corresponds to the case in which, given that \(X_{i}\) belongs to category j, then its transformed value \(Z_{i}\) will either remain unchanged with probability \(q_{j}\), or will be changed to one of the other \(S-1\) categories, chosen uniformly at random, with probability \(1-q_{j}\). Therefore, the conditional distribution of \(Z_{i}\) given \(X_{i}\) is

$$\begin{aligned} Q(Z_{i}|X_{i})=q_{X_{i}}^{\mathbb {I}(Z_{i}=X_{i})}\left( \frac{1-q_{X_{i}}}{S-1}\right) ^{\mathbb {I}(Z_{i}\not = X_{i})}. \end{aligned}$$

To underline the dependency on the vector q, we will sometimes write \(Q_q\). It is easy to check that the marginal of \(Z_{i}\) is given by

$$\begin{aligned} \mathbb {P}(Z_{i}=j)= p_{j}q_{j}+\sum _{k\ne j}p_{k}\frac{1-q_{k}}{S-1} =: m_{j} \end{aligned}$$
(9)

for \(j \in \{1,\ldots ,S \}\). We remark that the vector \(m=(m_{1},\ldots ,m_{S})\) can be computed in linear time in the dimension S by first computing the quantity \( \sum _{k=1}^S p_{k}\frac{1-q_{k}}{S-1}. \)

In the non interactive setting that we are considering, i.e. when \(Z_i\) only depends on \(X_i\), the conditional distribution of \(Z_{1:n}\) factorizes and can be written as

$$\begin{aligned} Q(Z_{1:n} | X_{1:n}) = \prod \limits _{i=1}^n Q(Z_i|X_i). \end{aligned}$$

Plugging it into (6), the Differential Privacy condition simplifies into

$$\begin{aligned} \underset{Z_{i},X_{i}\not = X_{i}'}{\sup } \frac{Q(Z_{i}|X_{i})}{Q(Z_{i}|X_{i}')} \le e^\alpha . \end{aligned}$$

Depending on the value of \(Z_{i}\), the quotient \(Q(Z_{i}|X_{i})/Q(Z_{i}|X_{i}')\) can take one of three values. If \(Z_{i}=X_{i}\), then it is equal to \((S-1)q_{X_{i}}/(1-q_{X_{i}'})\). If \(Z_{i}=X_{i}'\), then it is equal to \((1-q_{X_{i}})/(S-1)q_{X_{i}'}\). Finally, if \(Z_{i}\) is different from both \(X_{i}\) and \(X_{i}'\), then the quotient is equal to \( (1-q_{X_{i}})/(1-q_{X_{i}'})\). Therefore, the privacy condition specializes into the following set of constraints

$$\begin{aligned} \max \left( \frac{(S-1)q_{k}}{1-q_{k'}} , \frac{1-q_{k}}{(S-1)q_{k'}}, \frac{1-q_{k}}{1-q_{k' }} \mathbb {I}(S \ge 3) \right) \le e^\alpha \end{aligned}$$
(10)

for any couple \(k \not = k' \in \{1,\ldots ,S\}\). We notice that this set of conditions can be expressed as a linear constraint. Specifically,

Fact 1: There exists a matrix C and a vector \(b_\alpha \) (depending on \(\alpha \)) such that the set of differential privacy constraints (10) can be rewritten as the following linear constraint

$$\begin{aligned} C q^T \le b_\alpha , \end{aligned}$$
(11)

where q is the vector \(q=(q_1,\ldots ,q_S)\) and \(\le \) denotes entry-wise inequality. C and \(b_\alpha \) are given in Appendix.

In general, computing I(XZ) takes of an order of \(|{\mathcal {X}}||{\mathcal {Z}}|\) operations, meaning that here it should be quadratic in S. However, due to the particular form of the matrix M considered here, this computation can be achieved linearly in S. Let us recall from Sect. 2.2, that H(Z) denotes the Shannon entropy of the random variable Z and H(Z|X) the conditional entropy of Z given X. To underline the dependency on q, we denote \(f(q) := I(X,Z)\). We use the following known identity, which can immediately be derived from (5) and (3),

$$\begin{aligned} f(q) = I(X,Z) = H(Z) - H(Z|X), \end{aligned}$$

which leads to the simpler form

$$\begin{aligned} f(q)&= \sum _{x=1}^{S} p_x \left( q_x \log q_x + (1-q_x) \log \frac{1-q_x}{S-1}\right) \\&\qquad - \sum _{z=1}^{S} m_z \log m_z \end{aligned}$$

where we recall that \(m=(m_{1},\ldots ,m_{J})\) denotes the marginal distribution of Z given in (9). Let us start by noticing that f is minimal, equal to 0, for \(q_1=\cdots =q_S=\frac{1}{S}\). In the Appendix, we show that f is convex in q, which, together with differential privacy constraint (11), implies that the problem (7) is a linearly constrained convex program, i.e. we are maximizing a convex function under a set of linear constraints. As a consequence, the following proposition follows,

Proposition 1

Any optimal q solution of the program (7), lays within the vertices of the convex polytope formed by all the feasible points.

It follows from the previous proposition that finding the optimal matrix M of general form (8) requires finding the vertices of the feasible set. In Sect. 3.3 we will give some properties of this feasible set, which might make the search faster. In the following paragraph, we will provide the optimal M for several sub-cases of (8).

3.2 Examples

In this section we show how we can use Proposition 1 to give the explicit solutions of the program (7) for several particular examples of interest.

3.2.1 Binary key variable with symmetric M

We start from the simplest case of a categorical variable with only two possible categories denoted \({\mathcal {X}} = \{0,1\}\) and symmetric M with \(q_1=q_2=q\). We will abuse our notations by writing q both the scalar value in [0, 1] and the corresponding two-dimensional vector \((q,q)^T\) having both coordinates equal to this value. We are considering binary symmetric matrices of the following form,

$$\begin{aligned} M= \begin{bmatrix} q &{}\quad 1-q \\ 1-q &{}\quad q \end{bmatrix}. \end{aligned}$$

In this setting, the Differential Privacy condition (10) specializes into

$$\begin{aligned} \max \left( \frac{q}{1-q}, \frac{1-q}{q} \right) \le e^\alpha , \end{aligned}$$

which simplifies to \(q \in {[}\frac{1}{1+e^\alpha },\frac{e^\alpha }{1+e^\alpha }]\). In such a situation, the constrained maximization problem can actually be solved analytically by derivation of the target function. However, from Proposition 1, it is already known that the optimal q is among the boundaries of the feasible region. Let \(\psi :\{0,1\}\rightarrow \{0,1\}\) be defined as \(\psi (x) = 1-x\). Since \(\psi \) is one-to-one, it follows that \(I(X,Z)=I(X,\psi (Z))\). Moreover, by noticing that \(\psi (Z)\) has conditional distribution \(Q_{1-q}\), we can deduce that \(I(X,\psi (Z))=f(1-q)\), and therefore \(f(q)=f(1-q)\). Hence, the optimal q are both boundaries points, \(\frac{1}{1+e^\alpha }\) and \(\frac{e^\alpha }{1+e^\alpha }\).

There are two interesting properties appearing in this simple example. First, we understand that there are two solutions of the program. Second, these solutions are independent of p, the marginal of X.

3.2.2 Binary key variable with any M

The previous argument can be easily extended to the non-symmetric case,

$$\begin{aligned} M= \begin{bmatrix} q_{1} &{}\quad 1-q_{1} \\ 1-q_{2} &{}\quad q_{2} \end{bmatrix}. \end{aligned}$$

In this setting, the convex polytope generated by the linear constraints has four vertices, specifically \((q_1,q_2)\) belongs to the following set

$$\begin{aligned} \left\{ (1,0), (0,1), \Big (\frac{1}{1+e^\alpha },\frac{1}{1+e^\alpha }\Big ), \Big (\frac{e^\alpha }{1+e^\alpha },\frac{e^\alpha }{1+e^\alpha }\Big ) \right\} \end{aligned}$$

If either \((q_1,q_2)\) is equal to (1, 0) or (0, 1), then the Mutual Information \(I(X_{1:n},Z_{1:n})\) is null, since \(Z_{i}\) will be constant and independent of \(X_{i}\). Therefore, the only optimal solutions are the two symmetric matrices derived in the symmetric case.

3.2.3 Symmetric M

Let us now consider the case of a categorical variable with S categories and symmetric M. Specifically, we consider \({\mathcal {X}} = \{1,\ldots ,S\}\) and M of the form (8) with \(q_1=q_2=\cdots =q_S\). We again abuse our notation by denoting with q both the scalar in [0, 1] and the corresponding S dimensional vector with all entries equal to this value. The differential privacy condition (10) specializes into \(\max \left( \frac{(S-1)q}{1-q}, \frac{1-q}{(S-1)q} \right) \le e^\alpha \), which leads to \(q \in [\frac{e^{-\alpha }}{S-1+e^{-\alpha }},\frac{e^\alpha }{S-1+e^\alpha }]\). As before, following from Proposition 1, the optimal q are the boundary values.

Table 1 Scenario I: the number of times the \(q_k\)’s assume the four possible values \(v_\alpha , v_{-\alpha }, v_{\mathrm{min}}\) and \(v_{\mathrm{max}}\), under different choices of \(\alpha \)
Fig. 1
figure 1

Scenario I: estimates of the true probabilities generating the data. The x-axis encodes the \(S=10\) possible categories, for each one the yellow point represents the true probability \(p_k\), while the solid red line connects the estimated probabilities averaged over 100 iterations

3.3 Feasible set

In our experiments, we have experienced that routine optimization functions implemented in standard software, e.g. Matlab, can solve the optimization problem (7) extremely quickly. However, when the number of possible categories S becomes very large, the optimization might become time consuming. For this reason, in the following Proposition, we provide a description of all possible vectors \(q=(q_{1},\ldots ,q_{S})\) that can arise as vertices of the convex polytope generated by the Differential Privacy constraints (10) when S is large enough. This result should help to speed up the search for the optimal vertices among all feasible points given by (10).

Proposition 2

For \(S \ge 4\), if \(\alpha \le \log (S+\sqrt{S(S-4)}-2)-\log 2\), then, up to permutations, the vertices of the convex polytope formed by all feasible points are:

  1. 1.

    \(q_k = v_\alpha \), \(\forall k\in \{1,\ldots ,S\}\);

  2. 2.

    \(q_k = v_{-\alpha }\), \(\forall k\in \{1,\ldots ,S\}\);

  3. 3.

    \(q_{k} = v_{i_k \alpha }\), with \(i_k = \pm 1\) and \(2 \le \# \{k\ \text {s.t.} i_k = 1\} \le S-2\), \(\forall k\in \{1,\ldots ,S\}\);

  4. 4.

    \(q_1 = v_{\min }\), \(q_{k } = v_\alpha \). , \(\forall k\in \{2,\ldots ,S\}\);

  5. 5.

    \(q_1 = v_{\max }\), \(q_{k} = v_{-\alpha }\), \(\forall k\in \{2,\ldots ,S\}\);

where \( v_x = \frac{e^{x}}{e^{x}+S-1}\), \(v_{\min } = \frac{e^{-\alpha }}{e^{\alpha }+S-1}\) and \(v_{\max } = \frac{e^{\alpha }}{e^{-\alpha }+S-1}\).

Common values of \(\alpha \) are generally within the range [0, 2], which means that the conditions of previous Proposition are satisfied when \(S \ge 10\). Contrary to the symmetric case, the optimal M will depend on p. In the following section, we will show some simulations illustrating for some values of p which of the vertices in Proposition 2 are optimal.

Table 2 Scenario II: the number of times the \(q_k\)’s assume the four possible values \(v_\alpha , v_{-\alpha }, v_{\mathrm{min}}\) and \(v_{\mathrm{max}}\), under different choices of \(\alpha \)
Fig. 2
figure 2

Scenario II: estimates of the true probabilities generating the data. The x-axis encodes the \(S=10\) possible categories, for each one the yellow point represents the true probability \(p_k\), while the solid red line connects the estimated probabilities averaged over 100 iterations

4 Simulations

4.1 Simulation study

We consider different simulates scenarios, where the observations \(X_{1:n}\) are generated from a categorical distribution with S possible outcomes, having known probabilities \(p=(p_1, \ldots , p_S)\). In the first scenario, we set \(S=10\) and consider the following vector of probabilities

$$\begin{aligned} p=(0.3 ,0.1 ,0.2 ,0.08, 0.02 ,0.04, 0.06, 0.1, 0.01, 0.09). \end{aligned}$$

We consider the following values of \(\alpha =0.5, 1, 1.5, 2\), and we determine the corresponding optimal vectors of \(q= (q_1, \ldots , q_S)\) that solve (7). We select a sample size \(n=10^4\). As explained in Sect. 3, the Differential Privacy condition can be expressed as a set of linear constraints as in (11). The optimal q is then determined numerically by solving the constrained maximization problem via the optimization function in Matlab. Besides we have also generated the corresponding privatized dataset \(Z_{1:n}\) using the determined values of \((q_1, \ldots , q_S)\) for the different choices of \(\alpha \). The determined values of q are reported in Table 1. From Proposition 2, we know that, up to permutations, there are only 5 possible different scenarios and the \(q_k\)’s may assume only 4 different values, corresponding to \(v_\alpha , v_{-\alpha }, v_{\mathrm{min}}\) and \(v_{\mathrm{max}}\). Hence in Table 1 we have reported the number of times the \(q_k\)’s assume these values for the different choices of \(\alpha \). In order to investigate the effect of differential privacy, for the four values of \(\alpha \) considered here, we have reported the MLE of the vector of probabilities p obtained using the observed sample \(Z_{1:n}\). The results are represented in Fig. 1, all the simulations are averaged over 100 iterations. For each value of the categorical variable \( k \in \{1, \ldots , 10\}\), we have reported the estimated \(p_k\)’s, and each blue star corresponds to the MLE of \(p_k\) in one of the 100 experiments. The solid red line links the averaged estimates of the \(p_k\)’s over the 100 runs, while the true values of the probabilities \(p_k\) are represented in yellow. It is apparent that as \(\alpha \) increases, the estimates improve and the variability of the estimates decreases, hence the higher \(\alpha \), the weaker the privacy mechanism.

In the second scenario we have generated the data using the vector of probabilities

$$\begin{aligned} p&= (0.0336, 0.1059 , 0.1697 , 0.0962 , 0.0180 , \\&\qquad \qquad 0.0062 , 0.1097 , 0.0005 , 0.1233 , 0.3369). \end{aligned}$$

As before we report the values of the \(q_k\)’s for different choices of \(\alpha \) in Table 2, besides the estimated probabilities \(p_k\)’s are reported in Fig. 2. The simulations are averaged over 100 iterations.

We consider now a third scenario, in which \(S=30\) and we have generated the data using the vector of probabilities p obtained as a normalization of 30 independent gamma random variables with parameters (1, 5), more precisely we have generated \(G_k \sim \mathrm{Gamma}(1,5)\) for \(k=1, \ldots , 30\) and we have put \(p_k := G_k/ \sum _{s=1}^S G_s\). As before we report the vectors of q for different values of \(\alpha \) in Table 3 and the estimated probabilities averaged over 100 iterations in Fig. 3, where again \(n=10^4\) is the sample size.

Table 3 Scenario III: the number of times the \(q_k\)’s assume the four possible values \(v_\alpha , v_{-\alpha }, v_{\mathrm{min}}\) and \(v_{\mathrm{max}}\), under different choices of \(\alpha \)
Fig. 3
figure 3

Scenario III: estimates of the true probabilities generating the data. The x-axis encodes the \(S=30\) possible categories, for each one the yellow point represents the true probability \(p_k\), while the solid red line connects the estimated probabilities averaged over 100 iterations

In the last scenario IV, we assume again that \(S=30\) and we have generated the data using the vector of probabilities p defined by

$$\begin{aligned} p_1=0.05 , \quad p_k = 0.95/29 \, \text { for } k \ge 2. \end{aligned}$$

We report the vectors of q for different values of \(\alpha \) in Table 4 and the estimated probabilities averaged over 100 iterations in Fig. 4, where the sample size equals \(n=10^4\).

4.2 Real data

We finally test the performance of our strategy on some benchmark datasets from the public use microdata sample of the U.S. 2000 census for the state of New York, Ruggles et al. (2010). The data contains the values of ten categorical variables of 953,076 individuals: ownership of dwelling (3 levels), mortgage status (4 levels), age (9 levels), sex (2 levels), marital status (6 levels), single race identification (5 levels), educational attainment (11 levels), employment status (4 levels), work disability status (3 levels), and veteran status (3 levels).

Table 4 Scenario IV: the number of times the \(q_k\)’s assume the four possible values \(v_\alpha , v_{-\alpha }, v_{\mathrm{min}}\) and \(v_{\mathrm{max}}\), under different choices of \(\alpha \)
Fig. 4
figure 4

Scenario IV: estimates of the true probabilities generating the data. The x-axis encodes the \(S=30\) possible categories, for each one the yellow point represents the true probability \(p_k\), while the solid red line connects the estimated probabilities averaged over 100 iterations

For ease of illustration we consider the sex variable, which has two possible categories (female or male), therefore \(S=2\) and \(q=q_1=q_2\). We have already seen that the optimal q lies on the boundaries of the interval \(J_\alpha := [1/(e^\alpha +1), e^\alpha /(1+e^\alpha )]\). We have estimated the probabilities of the two possible categories using the sample mean, thus obtaining \(p_1=0.48\) and \(p_2=0.52\). In our numerical experiments we have considered \(\alpha =0.05\), and for different values of \(q \in J_\alpha \) we have generated the privatized dataset \(Z_{1:n}\) estimating \(p_1\) and \(p_2\). More precisely, in Fig. 5 we have considered six values of \(q \in [0.4875, 0.5125]\), and we reported the estimates of \(p_1\) and \(p_2\) averaged over 100 iterations. Each panel corresponds to a different q, each blue star corresponds to the estimated value in one of the 100 experiments based on the privatized sample \(Z_{1:n}\). The solid blue line links the averaged estimates of the \(p_k\)’s over the 100 runs, while the true values of the probabilities are represented in yellow. From the top left to the bottom right, we have chosen \(q=0.4875, 0.4925, 0.4975, 0.5025, 0.5075, 0.5125\): from the theory developed in the paper it is not surprising to realize that the values on the boundary lead to more reliable estimates, indeed they maximize the mutual information between X and Z. In Fig. 6 we reported the estimated mutual information between X and Z for different values of \(q \in J_{0.05}\), in order to do that we have estimated \(P_Z(k)\) and \(P_X(k)\) using the corresponding sample means for each \(k=1,2\).

Fig. 5
figure 5

NY dataset: estimates of the true probabilities generating the data. The x-axis encodes the \(S=2\) possible categories (female or male), for each one the yellow point represents the true probability \(p_k\), while the solid blue line connects the estimated probabilities averaged over 100 iterations

5 Conclusions and future work

In this work, we have proposed a novel approach to choose the randomizing matrix M in PRAM. This approach applies popular tools from computer science to derive M as the solution of a constrained optimization problem, in which the Mutual Information between raw and transformed data is maximized, under the constraint that the transformation satisfies Differential Privacy. The proposed approach has the advantage to be quick and easy to implement. Also, the desired level of privacy can be tuned by a single parameter \(\alpha \).

Fig. 6
figure 6

NY dataset: mutual information as a function of q

There are different ways in which the present work could be extended. A first possible direction of research is to understand how to tune the Differential Privacy parameter \(\alpha \), which regulates the desired level of privacy, using the classical measures of risk (1) and (2). Specifically, given the choice of some model, \(\alpha \) can be chosen in such a way that the estimate of the disclosure risk index computed on the transformed dataset matches or falls below a particular threshold value. A second direction of research is to generalize the proposed procedure to the case in which a few categorical variables are jointly perturbed. The proposed methodology can be extended to this case following similar lines. In particular, an individual will be randomly moved from one frequency cell to another using a \(K\times K\) stochastic matrix, where K is the number of cells after cross-classifying the variables we want to jointly perturb. In this setting, it will be important to study what further structure the \(K\times K\) matrix should have in order to avoid structural zeros combinations. Further, another direction of research is to examine the problem using other formulations of Differential Privacy. Specifically, the definition of Differential Privacy as in (6) is known to provide a very strong privacy guarantee. Therefore, generalizing the proposed methodology to other formulations and relaxations of Differential Privacy, as those mentioned in Sect. 1, might be an interesting topic.

Some other important research directions have also been suggested by the reviewers. Specifically, the proposed approach is focused on perturbing categorical variables, which are usually the most sensitive in terms of disclosure risk. However, a direction of research can be to study the problem for other datatypes, possibly including also some continuous data. Another line of research is to study theoretical guarantees in terms of preservation of utility for different classes of queries. In computer science, with a query it is usually meant a statistics of the observations, or function of some sufficient statistics. A crucial problem consists to quantify and analyse the expected distance (risk) of some classes of queries computed on raw and realised dataset. Similar contributions in this direction are Smith (2011) and Duchi et al. (2018). A useful extension of the proposed methodology would focus on different structures for the matrix M, rather than with uniform off-diagonal rows as in (8). An interesting example of application suggested by one reviewer, in which imposing non-uniform off-diagonal rows would be important, is in spatial modelling, when perturbing a geographical variable. In this context, a more suitable structure for M would allow for the geographical category to have a higher probability to be swapped with a spatially neighboring category rather than to one very far from the true observed value. Within this context, the optimal choice of M will have to balance between the higher randomization to achieve the same level of Differential Privacy and the benefit in statistical utility that follows from geographically localised perturbation for any later spatial analysis. Finally, another extension could be to include all variables in the mutual information in the maximization (7), applying privacy perturbation and the Differential Privacy constraint only to a subset of them. If the included and excluded variables are modelled as independent, the solution of the maximization problem M should be unaltered. Instead, in the dependent case, the optimal solution M might depend also on the non-perturbed variables and the maximization problem could become analytically much more challenging.