An Information Theoretic approach to Post Randomization Methods under Differential Privacy

Post Randomization Methods (PRAM) are among the most popular disclosure limitation techniques for both categorical and continuous data. In the categorical case, given a stochastic matrix $M$ and a specified variable, an individual belonging to category $i$ is changed to category $j$ with probability $M_{i,j}$. Every approach to choose the randomization matrix $M$ has to balance between two desiderata: 1) preserving as much statistical information from the raw data as possible; 2) guaranteeing the privacy of individuals in the dataset. This trade-off has generally been shown to be very challenging to solve. In this work, we use recent tools from the computer science literature and propose to choose $M$ as the solution of a constrained maximization problems. Specifically, $M$ is chosen as the solution of a constrained maximization problem, where we maximize the Mutual Information between raw and transformed data, given the constraint that the transformation satisfies the notion of Differential Privacy. For the general Categorical model, it is shown how this maximization problem reduces to a convex linear programming and can be therefore solved with known optimization algorithms.


Introduction
Data from census or survey studies are among the most useful sources of information for social and political studies. However, when statistical and governmental agencies release microdata to the public, they often encounter ethical and moral issues concerning the possible privacy leak for individuals present in the dataset. Anonymization techniques, like encrypting or removing personally identifiable information, have been widely used with the hope of ensuring privacy protection. However, recent studies by Gymrek et al. (2013), Homer et al. (2008), Narayanan andShmatikov (2008), Sweeney (1997) have shown that, even after removing directly identifying variables, like names or national insurance numbers, the potential for breaches of confidentiality is still present. Specifically, an intruder might still be able to identify individuals by cross-classifying categorical variables in the dataset and matching them with some external database. This kind of privacy problems have been widely considered in the statistical literature and different measures of disclosure risk have been proposed to assess the riskiness of specific dataset.
Different disclosure limitation techniques have been proposed, like rounding, suppression of extreme values or entire variables, sampling or perturbation techniques. Post Randomization Methods (PRAM) are among the most used techniques for disclosure risk limitation. See De , Gouweleeuw et al. (1997), Kooiman et al. (1997). With these techniques, before releasing the dataset, the data curator randomly changes the values of some categorical identifying variables, like gender, job or age, of some individuals in the dataset. In a recent paper, Shlomo and Skinner (2010) consider PRAM and random data swapping of a geographical variable and propose a way of computing measures of disclosure risk to assess whether these techniques have been effective in "privatizing" the dataset. The choice of the geographical variable is motivated by the fact that, by swapping or changing it, it is usually less likely to generate unreasonable combinations of categorical variables, like for instance a pregnant man or a 10 year old lawyer. In order to implement PRAM, they consider a stochastic matrix M , where the (i, j) entry of this matrix gives the probability that an individual from location i has his geographical variable swapped to location j. Given this known matrix M , Shlomo and Skinner (2010) suggest some measures of risk and related estimation methods. However, an open problem is to decide how the data curator should actually choose the matrix M in order to guarantee an effective level of privacy.
Over the last ten years, a new approach to data protection, called Differential Privacy (see Dwork et al. (2006)), has become more and more popular in the computer science literature and has been implemented in their security protocols by IT companies (Eland (2015), Erlingsson et al. (2014), Machanavajjhala et al. (2008)). This new framework finds its roots in the cryptography literature and prescribes to transform the original data, containing sensitive information, using a channel or mechanism Q into a sanitized dataset. The mechanism Q should be chosen carefully, in such a way that, by only looking at the released dataset, an intruder will have very low probability of guessing correctly the presence or absence of a specific individual in the raw data and, therefore, the privacy of the latter will be preserved. Differential Privacy formalizes mathematically this intuitive idea. We will provide a short review on it in Subsection 2.3.
In this work, we bring together ideas from the Disclosure Risk and Differential Privacy literature to propose a formal way of choosing the stochastic matrix M used in PRAM. Specifically, when choosing M , we need to balance two conflicting goals: 1) on the one hand, we want the application of M to make the dataset somehow private; 2) on the other hand, we also want that the released dataset preserves as much statistical information as possible from the raw data. In order to balance this tradeoff, we propose to choose M as the solution of a constrained maximization problem. We maximize the Mutual Information between the released and the raw dataset, hence guaranteeing preservation of statistical information and achieving goal 2). Mutual Information is a common measure of dependence between random variables used in probability and information theory. In order to guarantee also goal 1), we introduce a constraint in the maximization problem by imposing that the application M satisfies differential privacy, therefore the resulting mechanism based on M can formally be considered private. We show how this optimization problem results in a convex maximization problem under linear contraints and can therefore being solved efficiently by known optimization algorithms.
The rest of this work is organized as follows. In Section 2, first we will briefly review the disclosure risk problem in Subsection 2.1 and then the tools needed for our approach. Specifically, we review Mutual Information in Subsection 2.2 and Differential Privacy in Subsection 2.3. In Section 3, we formalize the proposed constrained maximization problem to choose the stochastic matrix M in PRAM and show that this choice is made by solving a convex optimization problem under linear constraints.
Section 4 contains a simulation study showing first the effect of the Diffential Privacy constraint on simulated data and then the effect of different choices of M using a real dataset of a survey of New York residents. Finally, a concluding remarks section closes the work. Proofs of the statements are deferred to the Appendix.
2 Literature Review

Disclosure Risk Limitation with categorical variables
In disclosure risk problems, we usually have microdata of n individuals, where for each individual we can observe two distinct sets of variables: 1) some variables, usually called sensitive variables, containing private information, e.g. health status or salary; 2) some identifying categorical variables, usually called key variables, e.g. gender, age or job. Disclosure problems arise because an intruder may be able to identify individuals in the dataset by cross-classifying their corresponding key variables and matching them to some external source of information. If the matching is correct, the intruder will be able to disclose the information contained in the sensitive variables.
Formally, let us assume we have J categorical key variables in the dataset, observed for a sample of n individuals, collected from a population of size N . Each variable has n j possible categories labelled, without loss of generality, from 1 up to n j . The observation for individual i, X i = (X i1 , . . . , X iJ ), therefore takes values in the state space C := J j=1 {1 . . . , n j }. This set has K := |C| = J j=1 n j values, corresponding to all possible cross-classification of the J key variables. The information about the sample is usually given through the sample frequency vector (f 1 , . . . , f K ), where f i counts how many individuals of the sample have been observed with the particular combination of cross-classified key variables corresponding to cell i. (F 1 , . . . , F K ) denotes the corresponding vector frequencies when considering the whole population of N individuals.
The earliest papers to consider disclosure risk problems include Bethlehem et al. (1990), Duncan andLambert (1986), Duncan and Lambert (1989), Lambert (1993). These works propose different measures of disclosure risk and possible ways to estimates them under different model choice. Skinner and Elliot (2002), Skinner et al. (1994) review the most popular among measures of disclosure risk.
These measures depend on the sample frequencies (f 1 , . . . , f K ) and usually focus on small frequencies, especially those having frequency 1, called sample uniques. The individuals belonging to these cells are those with the highest risk of their sensitive information being disclosed. Specifically, suppose that an individual is the only one both in the sample and in population to have a specific combination of key variables. Then, his key variables can be matched to an external database, and therefore this match will be perfect, i.e. correct with probability one, and his sensitive information will be therefore disclosed.
We usually distinguish between two groups of measures of disclosure risk : 1. Record Level (or per-record) measures: they assign a measure of risk for each data point.
Among the most popular, there are r 1k = P(F k = 1|f k = 1), (1) k ∈ {1, . . . , K}. The first measure provides the probability that a sample unique is also population unique. The second tells the probability that if we select a sample unique and guess uniformly about his identity, we pick him correctly. The first measure is less conservative and is always smaller than the second.
These measures of disclosure risk are estimated using the data (f 1 , . . . , f K ) under different modelling choices. For example, Skinner and Shlomo (2008), Shlomo and Skinner (2010) (2014), is to apply grade of membership models, which provide very accurate estimates for (2). For a quite recent review on disclosure risk problems, the reader is referred to Matthews and Harel (2011).
If the estimated values of (1) and (2)

Mutual Information
Let X be a discrete random variable taking values on a finite set X and having probability mass function p X (x). The (Shannon) entropy of X is defined as and it is a measure of uncertainty about the distribution of X. H(X) is always non-negative, takes value 0 when p X is a point mass in one of the support points and it is maximized when p X is uniform, p X (x) = 1 |X | ∀x ∈ X , in which case H(X) = log |X |. Similarly, given two discrete random variables X and Z, their joint entropy is defined as where p (X,Z) denotes the joint mass function on X × Z. H(X, Z) measures the joint uncertainty of X and Z taken together. Besides the conditional entropy of Z given X is defined as and quantifies the amount of information needed to describe the outcome of Z given that the value of X is known. If Z and X are independent, the conditional entropy H(Z|X) coincides with H(Z).
The mutual information between X and Z is defined as where p X , p Z , p (X,Z) are respectively the marginal and joint distributions of X and Z. From the definition of I(X, Z) it follows that where D KL denotes the Kullback-Leibler divergence. Therefore, I(X, Z) measures the divergence between the joint distribution of X and Z and the product of their marginals. From (4), it also follows that I(X, Z) ≥ 0, and I(X, Z) = 0 if and only if X and Z are independent.
An important equality connecting the mutual information I(X, Z) with the marginal and joint entropies is This formula is the base of the so-called 3H principle to estimate I(X, Z), in which the three H entropy terms on the right hand side are estimated from the data and plugged into (5) to obtain an estimate I(X, Z).
For a review on entropy, mutual information and their properties, see for example Gibbs and Su (2002); Gray (2011) and references therein.

Differential Privacy
Differential Privacy is a notion recently proposed in the computer science literature by Dwork et al. (2006); Dwork and Roth (2014) mathematically formalize the idea that the presence or absence of an individual in the raw data should have a limited impact on the transformed data, in order for the latter to be considered privatized. Formally, let X 1:n = (X 1 , . . . , X n ) be a set of observations, taking values in a state space X n ⊆ R n , containing sensitive information. A mechanism is simply a conditional distribution Q that, given the raw dataset X 1:n , returns a transformed dataset Z 1:kn = (Z 1 , . . . , Z kn ), with Z kn ⊆ R kn , to be released to the public, where the sample sizes of X 1:n and Z 1:n are allowed to be different. Differential Privacy is a property of Q that guarantees that it should be very difficult for an intruder to recover the sensitive information of X 1:n by having access only to Z 1:kn and is defined as follows.
Definition 2.1 (α-Differential Privacy, Dwork et al. (2006)) The mechanism Q satisfies α-Differential for all X 1:n , X 1:n ∈ X n s.t. d H (X 1:n , X 1:n ) = 1, where d H denotes the Hamming distance, d H (X 1:n , X 1:n ) = n i=1 I(X i = X i ) and I is the indicator function of the event inside brackets.
For small values of α the right hand side of (6) is approximately 1. Therefore, if Q satisfies Differential Privacy, (6) guarantees that the output database Z 1:n has basically the same probability of having been generated from either one of two neighboring databases X 1:n , X 1:n , i.e. databases differing in only one entry. See Rinott at al. (2018) for a statistical viewpoint of differential privacy.
Differential Privacy has been studied in a wide range of problems, differing among them in the way data is collected and/or released to the end user. The two most important classifications are between Global vs Local privacy, and Interactive vs Non-Interactive models. In the Global (or Centralized) model of privacy, each individual sends his data to the data curator who privatizes the entire data set centrally. Alternatively, in the Local (or Decentralized) model, each user privatizes his own data before sending it to the data curator. In this latter model, data also remains secret to the possibly untrusted curator. In the Non-Interactive (or Off-line) model, the transformed data set Z 1:n is released in one spot and each end user has access to it to perform his statistical analysis. In the Interactive (or Online) model however, no data set is directly released to the public, but each end user can ask queries f about X 1:n to the data holder who will reply with a noisy version of the true answer f (X 1:n ).
There have been many extensions and generalizations of the notion (6) of Differential Privacy proposed over the last ten years, in order to accommodate for different areas of applications and state spaces of the input and output data. Among them, we mention (α, δ)-Differential Privacy (Dwork and Roth (2014)), vertex and edge Differential Privacy for network models (Borgs et al. (2015)), zero-mean Concentrated Differential Privacy (Bun and Steine (2016)), randomised differential privacy (2017)), where the Hamming distance d H is (6) is replaced by possibly any distance ρ, and many others. However, since it is not possible to review all the many extensions of Differential Privacy here, we refer to Dwork and Roth (2014) for a quite updated review on different applications and extensions of Differential Privacy. To conclude this brief review, we recall one of the most important properties of any Differential Private mechanism: post processing, see Dwork and Roth (2014). This property guarantees that if the output Z 1:n of any α-Differential Private mechanism is further processed and gone through another mechanism (depending only on Z 1:n , and not on X 1:n ), then the resulting output will also be α-Differential Private. Therefore, there will be no chance of any leak of privacy simply by post-processing the released data Z 1:n .
3 An information-theoretic approach to PRAM using Differential Privacy Post Randomization Method is a popular perturbation method for disclosure risk limitation. It is connected to randomized response techniques described by Warner (1965). In the former approach, the raw data are perturbed by the data holder after having being collected, while in the latter, the perturbation is directly applied by the respondents during the interviewing process. We remind that PRAM was introduced by Kooiman et al. (1997) and further explored by Gouweleeuw et al. (1997) and De . Given raw microdata, PRAM produces a new dataset where some of entries are randomly changed according to a prescribed probability mechanism. The randomness introduced by the mechanism implies that matching a record in the perturbed dataset may actually be a mismatch instead of a true match, hence making usual disclosure matching attempts less reliable. Shlomo and Skinner (2010) consider the problem of disclosure risk estimation when the microdata has gone through either a PRAM or data swapping process. They perturb the geographical key variable using a stochastic matrix M , i.e. every row of M sums to one, where M ij provides the probability that an individual from location i is changed to location j. Shlomo and Skinner (2010) then proceed to discuss the problem of how to estimate the measures of risk presented in Subsection 2.1, but without providing any tangible rule on how to choose M , which is not the main goal of that paper.
In this work, we propose a novel approach to choose the randomization matrix M in PRAM. Specifically, we propose to choose it as the solution of a constrained maximization problem, in which we maximize the mutual information between raw data X 1:n and released data Z 1:n , under the constraint that the perturbation mechanism satisfies the Differential Privacy condition (6). Other optimization approaches for PRAM were already considered by Willnborg (1999) and Willenborg (2000), using different target functions and constraints. See also Section 5.5 of Willenborg and de Waal (2001).
However, these choices usually result in a difficult maximization problems and often rely on approximation methods.
We argue that the choice of Mutual Information and Differential Privacy have several advantages.
First, Mutual Information and Differential Privacy are very natural notions and popular measures of information similarity and privacy guaranty that have been widely considered in Information Theory and Machine Learning. Second, as it will be shown shortly, the resulting maximization problem reduces to a convex maximization problem under a set of linear constraints, hence it can be solved efficiently by well known optimization tools, like the Simplex method which is implemented in most of the commonly used computational softwares, like Matlab or R. Finally, the level of privacy guaranteed by the proposed methodology is tuned by a single tuning parameter α, which can be chosen by the data curator to achieve the desired level of privacy in a very simple manner. In subsection 4.1, we will show empirically how the choice of this parameter affects the estimation of the parameters, hence providing some evidence and guidance on how to choose it.

Model of PRAM
We propose to choose M as the solution of the following constrained maximization program max M satisfies (6) I(X 1:n , Z 1:n ).
We will consider the case of randomly changing the values of a key variable with S possible outcomes, e.g. the geographical location. X i ∈ {1, . . . , S} is the corresponding categorical random variable, having probabilities p = (p 1 , . . . , p S ), and therefore P(X i = j) = p j . We consider the class of all randomizing matrices of the following form for an unknown parameter vector q = (q 1 , . . . , q S ). This corresponds to the case in which, given that X i belongs to category j, then its transformed value Z i will either remain unchanged with probability q j , or will be changed to one of the other S−1 categories, chosen uniformly at random, with probability 1 − q j . Therefore, the conditional distribution of Z i given X i is .
To underline the dependency on the vector q, we will sometimes write Q q . It is easy to check that the marginal of Z i is given by In the non interactive setting that we are considering, i.e. when Z i only depends on X i , the conditional distribution of Z 1:n factorizes and can be written as Plugging it into (6), the Differential Privacy condition simplifies into Depending on the value of Z i , the quotient then the quotient is equal to (1 − q Xi )/(1 − q X i ). Therefore, the privacy condition specializes into the following set of constraints for any couple k = k ∈ {1, . . . , S}. We notice that this set of conditions can be expressed as a linear constraint. Specifically, Fact 1: There exists a matrix C and a vector b α (depending on α) such that the set of differential privacy constraints (10) can be rewritten as the following linear constraint where q is the vector q = (q 1 , .., q S ) and ≤ denotes entry-wise inequality. C and b α are given in Appendix.
In general, computing I(X, Z) takes of an order of |X ||Z| operations, meaning that here it should be quadratic in S. However, due to the particular form of the matrix M considered here, this computation can be achieved linearly in S. Let us recall from Subsection 2.2, that H(Z) denotes the Shannon entropy of the random variable Z and H(Z|X) the conditional entropy of Z given X. To underline the dependency on q, we denote f (q) := I(X, Z). We use the following known identity, which can immediately be derived from (5) and (3), which leads to the simpler form where we recall that m = (m 1 , . . . , m J ) denotes the marginal distribution of Z given in (9). Let us start by noticing that f is minimal, equal to 0, for q 1 = .. = q S = 1 S . In the Appendix, we show that f is convex in q, which, together with differential privacy constraint (11), implies that the problem (7) is a linearly constrained convex program, i.e. we are maximizing a convex function under a set of linear constraints. As a consequence, the following proposition follows, Proposition 3.1 Any optimal q solution of the program (7), lays within the vertices of the convex polytope formed by all the feasible points.
It follows from the previous proposition that finding the optimal matrix M of general form (8) requires finding the vertices of the feasible set. In Section 3.3 we will give some properties of this feasible set, which might make the search faster. In the following paragraph, we will provide the optimal M for several sub-cases of (8).

Examples
In this section we show how we can use Proposition 3.1 to give the explicit solutions of the program (7) for several particular examples of interest.

Binary key variable with symmetric M
We start from the simplest case of a categorical variable with only two possible categories denoted X = {0, 1} and symmetric M with q 1 = q 2 = q. We will abuse our notations by writing q both the scalar value in [0, 1] and the corresponding two-dimensional vector (q, q) T having both coordinates equal to this value. We are considering binary symmetric matrices of the following form, In this setting, the Differential Privacy condition (10) specializes into which simplifies to q ∈ [ 1 1+e α , e α 1+e α ]. In such a situation, the constrained maximization problem can actually be solved analytically by derivation of the target function. However, from Proposition 3.1, it is already known that the optimal q is among the boundaries of the feasible region. Let ψ : {0, 1} → {0, 1} be defined as ψ(x) = 1 − x. Since ψ is one-to-one, it follows that I(X, Z) = I(X, ψ(Z)). Moreover, by noticing that ψ(Z) has conditional distribution Q 1−q , we can deduce that I(X, ψ(Z)) = f (1 − q), and therefore f (q) = f (1 − q). Hence, the optimal q are both boundaries points, 1 1+e α and e α 1+e α . There are two interesting properties appearing in this simple example. First, we understand that there are two solutions of the program. Second, these solutions are independent of p, the marginal of X.

Binary key variable with any M
The previous argument can be easily extended to the non-symmetric case, In this setting, the convex polytope generated by the linear constraints has four vertices, specifically (q 1 , q 2 ) belongs to the following set (1, 0), (0, 1), 1 1 + e α , 1 1 + e α , e α 1 + e α , e α 1 + e α If either (q 1 , q 2 ) is equal to (1, 0) or (0, 1), then the Mutual Information I(X 1:n , Z 1:n ) is null, since Z i will be constant and independent of X i . Therefore, the only optimal solutions are the two symmetric matrices derived in the symmetric case.

Symmetric M
Let us now consider the case of a categorical variable with S categories and symmetric M . Specifically, we consider X = {1, . . . , S} and M of the form (8) with q 1 = q 2 = · · · = q S . We again abuse our notation by denoting with q both the scalar in [0, 1] and the corresponding S dimensional vector with all entries equal to this value. The differential privacy condition (10) specializes into max (S−1)q 1−q , 1−q (S−1)q ≤ e α , which leads to q ∈ [ e −α S−1+e −α , e α S−1+e α ]. As before, following from Proposition 3.1, the optimal q are the boundary values.

Feasible set
In our experiments, we have experienced that routine optimization functions implemented in standard software, e.g. Matlab, can solve the optimization problem (7) extremely quickly. However, when the number of possible categories S becomes very large, the optimization might become time consuming.
Common values of α are generally within the range [0, 2], which means that the conditions of previous Proposition are satisfied when S ≥ 10. Contrary to the symmetric case, the optimal M will depend on p. In the following section, we will show some simulations illustrating for some values of p which of the vertices in Proposition 3.2 are optimal.
We consider the following values of α = 0.5, 1, 1.5, 2, and we determine the corresponding optimal vectors of q = (q 1 , . . . , q S ) that solve (7). We select a sample size n = 10 4 . As explained in Section 3, the Differential Privacy condition can be expressed as a set of linear constraints as in (11). The optimal q is then determined numerically by solving the constrained maximization problem via the optimization function in Matlab. Besides we have also generated the corresponding privatized dataset Z 1:n using the determined values of (q 1 , . . . , q S ) for the different choices of α. The determined values of q are reported in Table 1. From Proposition 3.2, we know that, up to permutations, there are only 5 possible different scenarios and the q k 's may assume only 4 different values, corresponding to v α , v −α , v min and v max . Hence in Table 1 we have reported the number of times the q k 's assume these values for the different choices of α. In order to investigate the effect of differential privacy, for the four  As before we report the values of the q k 's for different choices of α in Table 2, besides the estimated probabilities p k 's are reported in Figure 2. The simulations are averaged over 100 iterations.  We consider now a third scenario, in which S = 30 and we have generated the data using the vector of probabilities p obtained as a normalization of 30 independent gamma random variables with parameters (1, 5), more precisely we have generated G k ∼ Gamma(1, 5) for k = 1, . . . , 30 and we have put p k := G k / S s=1 G s . As before we report the vectors of q for different values of α in Table 3 and the estimated probabilities averaged over 100 iterations in Figure 3, where again n = 10 4 is the sample size.  In the last scenario IV, we assume again that S = 30 and we have generated the data using the vector of probabilities p defined by p 1 = 0.05, p k = 0.95/29 for k ≥ 2.
We report the vectors of q for different values of α in Table 4 and the estimated probabilities averaged over 100 iterations in Figure 4, where the sample size equals n = 10 4 .

Real Data
We finally test the performance of our strategy on some benchmark datasets from the public use microdata sample of the U.S. 2000 census for the state of New York, Ruggles et al. (2010). The data contains the values of ten categorical variables of 953076 individuals: ownership of dwelling (3 levels), mortgage status (4 levels), age (9 levels), sex (2 levels), marital status (6 levels), single race identification (5 levels), educational attainment (11 levels), employment status (4 levels), work disability status (3 levels), and veteran status (3 levels).
For ease of illustration we consider the sex variable, which has two possible categories (female or male), therefore S = 2 and q = q 1 = q 2 . We have already seen that the optimal q lies on the boundaries of the interval J α := [1/(e α + 1), e α /(1 + e α )]. We have estimated the probabilities of the two possible categories using the sample mean, thus obtaining p 1 = 0.48 and p 2 = 0.52. In our numerical experiments we have considered α = 0.05, and for different values of q ∈ J α we have generated the privatized dataset Z 1:n estimating p 1 and p 2 . More precisely, in Figure  the theory developed in the paper it is not surprising to realize that the values on the boundary lead to more reliable estimates, indeed they maximize the mutual information between X and Z. In Figure   6 we reported the estimated mutual information between X and Z for different values of q ∈ J 0.05 , in order to do that we have estimated P Z (k) and P X (k) using the corresponding sample means for each k = 1, 2.
In this work, we have proposed a novel approach to choose the randomizing matrix M in PRAM. This approach applies popular tools from computer science to derive M as the solution of a constrained optimization problem, in which the Mutual Information between raw and transformed data is maximized, under the constraint that the transformation satisfies Differential Privacy. The proposed approach has the advantage to be quick and easy to implement. Also, the desired level of privacy can be tuned by a single parameter α.
There are different ways in which the present work could be extended. A first possible direction of research is to understand how to tune the Differential Privacy parameter α, which regulates the desired level of privacy, using the classical measures of risk (1) and (2). Specifically, given the choice of some model, α can be chosen in such a way that the estimate of the disclosure risk index computed on the transformed dataset matches or falls below a particular threshold value. A second direction of research is to generalize the proposed procedure to the case in which a few categorical variables are jointly perturbed. The proposed methodology can be extended to this case following similar lines. In particular, an individual will be randomly moved from one frequency cell to another using a K × K stochastic matrix, where K is the number of cells after cross-classifying the variables we want to jointly perturb. In this setting, it will be important to study what further structure the K × K matrix should have in order to avoid structural zeros combinations. Further, another direction of research is to examine the problem using other formulations of Differential Privacy. Specifically, the definition of Differential Privacy as in (6) is known to provide a very strong privacy guarantee. Therefore, generalizing the proposed methodology to other formulations and relaxations of Differential Privacy, as those mentioned in Subsection 2.1, might be an interesting topic.
Some other important research directions have also been suggested by the reviewers. Specifically, the proposed approach is focused on perturbing categorical variables, which are usually the most sensitive in terms of disclosure risk. However, a direction of research can be to study the problem for other datatypes, possibly including also some continuous data. Another line of research is to study theoretical guarantees in terms of preservation of utility for different classes of queries. In computer science, with a query it is usually meant a statistics of the observations, or function of some sufficient statistics. A crucial problem consists to quantify and analyse the expected distance (risk) of some classes of queries computed on raw and realised dataset. Similar contributions in this direction are Smith (2011) and Duchi et al. (2018). A useful extension of the proposed methodology would focus on different structures for the matrix M , rather than with uniform off-diagonal rows as in (8). An interesting example of application suggested by one reviewer, in which imposing non-uniform off-diagonal rows would be important, is in spatial modelling, when perturbing a geographical variable. In this context, a more suitable structure for M would allow for the geographical category to have a higher probability to be swapped with a spatially neighboring category rather than to one very far from the true observed value. Within this context, the optimal choice of M will have to balance between the higher randomization to achieve the same level of Differential Privacy and the benefit in statistical utility that follows from geographically localised perturbation for any later spatial analysis. Finally, another extension could be to include all variables in the mutual information in the maximization (7), applying privacy perturbation and the Differential Privacy constraint only to a subset of them.
If the included and excluded variables are modelled as independent, the solution of the maximization problem M should be unaltered. Instead, in the dependent case, the optimal solution M might depend also on the non-perturbed variables and the maximization problem could become analytically much more challenging.

Aknowlegdment
The authors thank the Associate Editor and two anonymous referees, whose constructive comments and suggestions have been appreciated and helped to improve the paper. To underline the dependency on q, we will sometimes use the notation Q q . We need to show that f is convex. Let q = (q 1 , .., q S ) T and θ ∈ [0, 1]. Let k, l ∈ {1, .., S} such that k = l. Besides, Therefore, Q θq+(1−θ)q = θQ q + (1 − θ)Q q . It is known that for a fixed marginal distribution of one of the variables, the mutual information is convex in the conditional distribution of the second, see for example Theorem 2.7.4 of Cover and Thomas (2012). Therefore, f (θq+(1−θ)q ) ≤ θf (q)+(1−θ)f (q ), and hence f is convex.

Fact 1: Set of feasible parameters q
We start by writing explicitly the linear constraints (11) on q. Let T α be the convex polytope of all q satisfying α-differential privacy. Let S α be the planar polygon defined by the set of equations The set of feasible points is then characterized by (q 1 , . . . , q S ) ∈ T α ⇐⇒ ∀(k, l), (q k , q l ) ∈ S α .
This set characterized by the 3S(S − 1) linear constraints given by equations (12) to (17) can thus be defined as the set of solutions of the equation
Lemma 5.1 For S ≥ 2, if q satisfies differential privacy, then at most one of its coordinates is larger than v α and at most one is smaller than v −α Proof This trivially follow from the constraint (10). Indeed, suppose that q i > e α S−1+e α . Then for any other q j , using formula (19), Similarly suppose both q i < e −α S−1+e −α . For any other q j , from formula (21), Let (q 1 , . . . , q S ) ∈ T α , using previous Lemma, we know that up to permutations, one of 4 settings is possible: The first setting is the most straightforward, indeed since v min < v −α and v max > v α , we find that all the points (v i k α ) 1≤k≤S for any sequence (i k ) 1≤k≤S ∈ {−1, 1} S , are within the convex hull of V and so does the whole hypercube generated by those 2 S points.
The second and third settings have similar proof, that we will explicit for the second setting. As said in the previous remark, the point (v α , v −α , . . . , v −α ) belongs to the convex hull of V. Let k ≥ 2, we know that q k ≥ v −α . Besides, since (q k , q 1 ) ∈ S α , (12) gives that (q k , q 1 ) is below the line passing through (v −α , v max ) and (v α , v α ). Hence, denoting θ = q1−vα vmax−vα , we find that Therefore, we only need to show that any point (q 1 , x 2 , . . . , x S ) is in the convex hull of V for any sequence (x k ) k≥2 ∈ {v −α , θv −α + (1 − θ)v α } S−1 .