Covariance's Loss is Privacy's Gain: Computationally Efficient, Private and Accurate Synthetic Data

The protection of private information is of vital importance in data-driven research, business, and government. The conflict between privacy and utility has triggered intensive research in the computer science and statistics communities, who have developed a variety of methods for privacy-preserving data release. Among the main concepts that have emerged are anonymity and differential privacy. Today, another solution is gaining traction, synthetic data. However, the road to privacy is paved with NP-hard problems. In this paper we focus on the NP-hard challenge to develop a synthetic data generation method that is computationally efficient, comes with provable privacy guarantees, and rigorously quantifies data utility. We solve a relaxed version of this problem by studying a fundamental, but a first glance completely unrelated, problem in probability concerning the concept of covariance loss. Namely, we find a nearly optimal and constructive answer to the question how much information is lost when we take conditional expectation. Surprisingly, this excursion into theoretical probability produces mathematical techniques that allow us to derive constructive, approximately optimal solutions to difficult applied problems concerning microaggregation, privacy, and synthetic data.


Introduction
"Sharing is caring", we are taught.But if we care about privacy, then we better think twice what we share.As governments and companies are increasingly collecting vast amounts of personal information (often without the consent or knowledge of the user [37]), it is crucial to ensure that fundamental rights to privacy of the subjects the data refer to are guaranteed 1 .We are facing the problem of how to release data that are useful to make accurate decisions and predictions without disclosing sensitive information on specific identifiable individuals.
The conflict between privacy and utility has triggered intensive research in the computer science and statistics communities, who have developed a variety of methods for privacy-preserving data release.Among the main concepts that have emerged are anonymity and differential privacy [6].Today, another solution is gaining traction, synthetic data [4].However, the road to privacy is paved with NP-hard problems.For example, finding the optimal partition into k-anonymous groups is NP-hard [24].Optimal multivariate microaggregation is NP-hard [26,33] (albeit, the error metric used in these papers is different from the one used in our paper).Moreover, assuming the existence of one-way functions, there is no polynomial time, differentially private algorithm for generating boolean synthetic data that preserves all two-dimensional marginals with accuracy o(1) [35].
No matter which privacy preserving strategy one pursues, in order to implement that strategy the challenge is to navigate this NP-hard privacy jungle and develop a method that is computationally efficient, comes with provable privacy guarantees, and rigorously quantifies data utility.This is the main topic of our paper.

State of the art
Anonymity captures the understanding that it should not be possible to re-identify any individual in the published data [6].One of the most popular ways in trying to ensure anonymity is via the concept of k-anonymity [32,31].A dataset has the k-anonymity property if the information for each person contained in the dataset cannot be distinguished from at least k − 1 individuals whose information also appears in the dataset.Although the privacy guarantees offered by k-anonymity are limited, its simplicity has made it a popular part of the arsenal of privacy enhancing technologies, see e.g.[6,10,21,16].k-anonymity is often implemented via the concept of microaggregation [7, 19,30,17,6].The principle of microaggregation is to partition a data set into groups of at least k similar records and to replace the records in each group by a prototypical record (e.g. the centroid).
Finding the optimal partition into k-anonymous groups is an NP-hard problem [24].Several practical algorithms exists that produce acceptable empirical results, albeit without any theoretical bounds on the information loss [7,6,25].In light of the popularity of k-anonymity, it is thus quite surprising that it is an open problem to design a computationally efficient algorithm for k-anonymity that comes with theoretical utility guarantees.
As k-anonymity is prone to various attacks, differential privacy is generally considered a more robust type of privacy.
Differential privacy formalizes the intuition that the presence or absence of any single individual record in the database or data set should be unnoticeable when looking at the responses returned for the queries [9].Differential privacy is a popular and robust method that comes with a rigorous mathematical framework and provable guarantees.It can protect aggregate information, but not sensitive information in general.Differential privacy is usually implemented via noise injection, where the noise level depends on the query sensitivity.However, the added noise will negatively affect utility of the released data.
As pointed out in [6], microaggregation is a useful primitive to find bridges between privacy models.It is a natural idea to combine microaggregation with differential privacy [30,29] to address some of the privacy limitations of k-anonymity.As before, the fundamental question is whether there are computationally efficient methods to implement this scheme while also maintaining utility guarantees.
Synthetic data are generated (typically via some randomized algorithm) from existing data such that they maintain the statistical properties of the original data set, but do so without risk of exposing sensitive information.Combining synthetic data with differential privacy is a promising means to overcome key weaknesses of the latter [13,4,15,20].Clearly, we want the synthetic data to be faithful to the original data, so as to preserve utility.To quantify the faithfulness, we need some similarity metrics.A common and natural choice for tabular data is to try to (approximately) preserve low-dimensional marginals [3,34,8].
We model the true data x 1 , . . ., x n as a sequence of n points from the Boolean cube {0, 1} p , which is a standard benchmark data model [3,35,13,27].For example, this can be n health records of patients, each containing p binary parameters (smoker/nonsmoker, etc.) 2 We are seeking to transform the true data into synthetic data y 1 , . . ., y m ∈ {0, 1} p that is both differentially private and accurate.
As mentioned before, we measure accuracy by comparing the marginals of true and synthetic data.A d-dimensional marginal of the true data has the form for some given indices j 1 , . . ., j d ∈ [p].In other words, a low-dimensional marginal is the fraction of the patients whose d given parameters all equal 1.The one-dimensional marginals encode the means of the parameters, and the two-dimensional marginals encode the covariances.
The accuracy of the synthetic data for a given marginal can be defined as Clearly, the accuracy is bounded by 1 in absolute value.

Our contributions
Our goal is to design a randomized algorithm that satisfies the following list of desiderata: (i) (synthetic data): the algorithm outputs a list of vectors y 1 , . . ., y m ∈ {0, 1} p ; (ii) (efficiency): the algorithm requires only polynomial time in n and p; (iii) (privacy): the algorithm is differentially private; (iv) (accuracy): the low-dimensional marginals of y 1 , . . ., y m are close to those of x 1 , . . ., x n .
There are known algorithms that satisfy any three of the above four requirements if we restrict the accuracy condition (iv) to two-dimensional marginals.
Indeed, if (i) is dropped, one can first compute the mean 1 n n k=1 x k and the covariance matrix x k ) T , add some noise to achieve differential privacy, and output i.i.d.samples from the Gaussian distribution with the noisy mean and covariance.
Suppose (ii) is dropped.It suffices to construct a differentially private probability measure µ on {0, 1} p so that {0,1} p x dµ(x) ≈ 1 x k x T k (see Lemma 2.4 below) and then set µ to be a probability measure on {0, 1} p that minimizes {0,1} p x dµ(x) − ( 1 where ∞ is the ℓ ∞ norm on R p or R p 2 .However, this requires exponential time in p, since the set of all probability measures on {0, 1} p can be identified as a convex subset of R 2 p .See [3]. If (iii) or (iv) is dropped, the problem is trivial: in the former case, we can output either the original true data; in the latter, all zeros.
While there are known algorithms that satisfy (i)-(iii) with proofs and empirically satisfy (iv) in simulations (see e.g., [22,36,18,19]), the challenge is to develop an algorithm that provably satisfies all four conditions.Ullman and Vadhan [35] showed that, assuming the existence of one-way functions, one cannot achieve (i)-(iv) even for d = 2, if we require in (iv) that all of the d-dimensional marginals be preserved accurately.More precisely, there is no polynomial time, differentially private algorithm for generating synthetic data in {0, 1} p that preserves all of the two-dimensional marginals with accuracy o(1) if one-way functions exist.This remarkable no-go result by Ullman and Vadhan already could put an end to our quest for finding an algorithm that rigorously can achieve conditions (i)-(iv).
Surprisingly, however, a slightly weaker interpretation of (iv) suffices to put our quest on a more successful path.Indeed, we will show in this paper that one can achieve (i)-(iv), if we require in (iv) that most of the d-dimensional marginals be preserved accurately.Remarkably, our result does not only hold for two-dimensional marginals, but for marginals of any given degree.
Note that even if the differential privacy condition in (iii) is replaced by the condition of anonymous microaggregation, it is still a challenging open problem to develop an algorithm that fulfills all these desiderata.In this paper we will solve this problem by deriving a computationally efficient anonymous microaggregation framework that comes with provable accuracy bounds.
Covariance loss.We approach the aforementioned goals by studying a fundamental, but a first glance completely unrelated, problem in probability.This problem is concerned with the most basic notion of probability: conditional expectation.We want to answer the fundamental question: "How much information is lost when we take conditional expectation?" The law of total variance shows that taking conditional expectation of a random variable underestimates the variance.A similar phenomenon holds in higher dimensions: taking conditional expectation of a random vector underestimates the covariance (in the positive-semidefinite order).We may ask: how much covariance is lost?And what sigma-algebra of given complexity minimizes the covariance loss?
Finding an answer to this fundamental probability question turns into a quest of finding among all sigma-algebras of given complexity that one which minimizes the covariance loss.We will derive a nearly optimal bound based on a careful explicit construction of a specific sigma-algebra.Amazingly, this excursion into theoretical probability produces mathematical techniques that are most suitable to solve the previously discussed challenging practical problems concerning microaggregation and privacy.

Private, synthetic data?
Now that we described the spirit of our main results, let us introduce them in more detail.
As mentioned before, it is known from Ullman and Vadhan [35] that it is generally impossible to efficiently make private synthetic data that accurately preserves all low-dimensional marginals.However, as we will prove, it is possible to efficiently construct private synthetic data that preserves most of the low-dimensional marginals.
To state our goal mathematically, we average the accuracy (in the L 2 sense) over all p d subsets of indices {i 1 , . . ., i d } ⊂ [p], then take the expectation over the randomness in the algorithm.In other words, we would like to see for some small δ, where E(i Theorem 5.15 gives a formal and non-asymptotic version of this result.Our method is not specific to Boolean data.It can be used to generate synthetic data with any predefined convex constraints (Theorem 5.14).If we assume that the input data x 1 , . . ., x n lies in some known convex set K ⊂ R p , one can make private and accurate synthetic data y 1 , . . ., y m that lies in the same set K.

Covariance loss
Our method is based on a new problem in probability theory, a problem that is interesting on its own.It is about the most basic notion of probability: conditional expectation.And the question is: how much information is lost when we take conditional expectation?
The law of total expectation states that for a random variable X and a sigma-algebra F, the conditional expectation Y = E[X|F] gives an unbiased estimate of the mean: E X = E Y .The law of total variance, which can be expressed as shows that taking conditional expectation underestimates the variance.
Heuristically, the simpler the sigma-algebra F is, the more variance gets lost.What is the best sigma-algebra F with a given complexity?Among all sigma-algebras F that are generated by a partition of the sample space into k subsets, which one achieves the smallest loss of variance, and what is that loss?
If X is bounded, let us say |X| ≤ 1, we can decompose the interval [−1, 1] into k subintervals of length 2/k each, take F i to be the preimage of each interval under X, and let F = σ(F 1 , . . ., F k ) be the sigma-algebra generated by these events.Since X and Y takes values in the same subinterval a.s., we have |X − Y | ≤ 2/k a.s.Thus, the law of total variance gives Let us try to generalize this question to higher dimensions.If X is a random vector taking values in R p , the law of total expectation holds unchanged.The law of total variance becomes the law of total covariance: where Σ X = E(X − E X)(X − E X) T denotes the covariance matrix of X, and similarly for Σ Y (see Lemma 3.1 below).Just like in the one-dimensional case, we see that taking conditional expectation underestimates the covariance (in the positive-semidefinite order).
However, if we naively attempt to bound the loss of covariance like we did to get (1.3), we would face a curse of dimensionality.The unit Euclidean ball in R p cannot be partitioned into k subsets of diameter, let us say, 1/4, unless k is exponentially large in p (see e.g.[2]).The following theorem 3shows that a much better bound can be obtained that does not suffer the curse of dimensionality.
Theorem 1.2 (Covariance loss).Let X be a random vector in R p such that X 2 ≤ 1 a.s.Then, for every k ≥ 3, there exists a partition of the sample space into at most k sets such that for the sigma-algebra F generated by this partition, the conditional expectation Here C is an absolute constant.Moreover, if the probability space has no atoms, then the partition can be made with exactly k sets, all of which have the same probability 1/k.
Remark 1.3 (Optimality).The rate in Theorem 1.2 is in general optimal up to a √ log log k factor; see Proposition 3.14.
Remark 1.4 (Higher moments).Theorem 1.2 can be automatically extended to higher moments via the following tensorization principle (Theorem 3.10), which states that for all d ≥ 2, (1.5) Remark 1.5 (Hilbert spaces).The bound (1.4) is dimension-free.Indeed, Theorem 1.2 can be extended to hold for infinite dimensional Hilbert spaces.

Anonymous microaggregation
Let us apply these abstract probability results to the problem of making synthetic data.As before, denote the true data by x 1 , . . ., x n ∈ R p .Let X(i) = x i be the random variable on the sample space [n] equipped with uniform probability distribution.Obtain a partition [n] = I 1 ∪ • • • ∪ I m from the Covariance Loss Theorem 1.2, where m ≤ k, and let us assume for simplicity that m = k and that all sets I j have the same cardinality I j = n/k (this can be achieved whenever k divides n, a requirement that can easily be dropped as we will discuss later).The conditional expectation Y = E[X|F] on the sigma-algebra F = σ(I 1 , . . ., I k ) generated by this partition takes values with probability 1/k each.In other words, the synthetic data y 1 , . . ., y k is obtained by taking local averages, or by microaggregation of the input data x 1 , . . ., x n .The crucial point is that the synthetic data is obviously generated via (n/k)-anonymous microaggregation.Here, we use the following formal definition of r-anonymous microaggregation.
Definition 1.6.Let x 1 , . . ., x n ∈ R p be a dataset.Let r ∈ N. A r-anonymous averaging of x 1 , . . ., x n is a dataset consisting of the points i∈I 1 x i , . . ., i∈Im x i for some partition [n] = I 1 ∪ . . .∪ I m such that |I i | ≥ r for each 1 ≤ i ≤ m.A r-anonymous microaggregation algorithm A( ) with input dataset x 1 , . . ., x n ∈ R p is the composition of a r-anonymous averaging procedure followed by any algorithm.
For any notion of privacy, any post-processing of a private dataset should still be considered as private.While a post-processing of a r-anonymous averaging of a dataset is not necessarily a r-anonymous averaging of the original dataset (it might not even consist of vectors), the notion of r-anonymous microaggregation allows a post-processing step after r-anonymous averaging.
What about the accuracy?The law of total expectation As for higher moments, assume that x i 2 ≤ 1 for all i.Then Covariance Loss Theorem 1.2 together with tensorization principle (1.5) yields Thus, if k ≫ 1 and d = O(1), the synthetic data is accurate in the sense of the mean square average of marginals.This general principle can be specialized to Boolean data.Doing appropriate rescaling, bootstrapping (Section 4.2) and randomized rounding (Section 4.3), we can conclude the following: Theorem 1.7 (Anonymous synthetic Boolean data).Suppose k divides n.There exists a randomized (n/k)-anonymous microaggregation algorithm that transforms input data x 1 , . . ., x n ∈ {0, 1} p into output data y 1 , . . ., y m ∈ {0, 1} p .Moreover, if d = O(1), d ≤ p/2, k ≫ 1, m ≫ 1, the synthetic data is o(1)-accurate for d-dimensional marginals on average.The algorithm runs in time polynomial in p, n and linear in m, and is independent of d.

Differential privacy
How can we pass from anonymity to differential privacy and establish Theorem 1.1?The microaggregation mechanism by itself is not differentially private.However, it reduces sensitivity of synthetic data.If a single input data point x i is changed, microaggregation (1.6) suppresses the effect of such change on the synthetic data y j by the factor k/n. Once the data has low sensitivity, the classical Laplacian mechanism can make it private: one has simply to add Laplacian noise.This is the gist of the proof of Theorem 1.1.However, several issues arise.One is that we do not know how to make all blocks I j of the same size while preserving their privacy, so we allow them to have arbitrary sizes in the application to privacy.However, small blocks I j may cause instability of microaggregation, and diminish its beneficial effect on sensitivity.We resolve this issue by downplaying, or damping, the small blocks (Section 5.4).The second issue is that adding Laplacian noise to the vectors y i may move them outside the set K where the synthetic data must lie (for Boolean data, K is the unit cube [0, 1] p .)We resolve this issue by metrically projecting the perturbed vectors back onto K (Section 5.5).

Related work
There exists a large body of work on privately releasing answers in the interactive and noninteractive query setting.But a major advantage of releasing a synthetic data set instead of just the answers to specific queries is that synthetic data opens up a much richer toolbox (clustering, classification, regression, visualization, etc.), and thus much more flexibility, to analyze the data.
In [5], Blum, Ligett, and Roth gave an ε-differentially private synthetic data algorithm whose accuracy scales logarithmically with the number of queries, but the complexity scales exponentially with p.The papers [12,13] propose methods for producing private synthetic data with an error bound of about Õ( √ np 1/4 ) per query.However, the associated algorithms have running time that is at least exponential in p.In [3], Barak et al. derive a method for producing accurate and private synthetic Boolean data based on linear programming with a running time that is exponential in p.This should be contrasted with the fact that our algorithm runs in time polynomial in p and n, see Theorem 1.7.We emphasize that the our method is designed to produce synthetic data.But, as suggested by the reviewers, we briefly discuss how well d-way marginals can be preserved by our method in the non-synthetic data regime.Here, we consider the dependence of n on p as well as the accuracy we achieve versus existing methods.
Dependence of n on p: In order to release 1-way marginals with any nontrivial accuracy on average and with ε-differential privacy, one already needs n p [9, Theorem 8.7].Our main result Theorem 1.1 on differential privacy only requires n to grow slightly faster than linearly in p.
If we just want to privately release the d-way marginals without creating synthetic data, and moreover relax ε-differential privacy to (ε, δ)-differential privacy, one might relax the dependence of n on p. Specifically, n ≫ p ⌈d/2⌉/2 log(1/δ)/ε suffices [8].In particular, when d = 2, this means that n ≫ p log(1/δ)/ε suffices.On the other hand, our algorithm does not depend on d.Moreover, for d ≥ 5, the dependence of n on p required in Theorem 1.1 is less restrictive than in [8].
Accuracy in n: As mentioned above, Theorem 5.15 gives a formal and non-asymptotic version of Theorem 1.1.The average error of d-way marginals we achieve has decay of order log log n/ log n in n.In Remark 5.16, we show that even for d = 1, 2, no polynomial time differentially private algorithm can have average error of the marginals decay faster than n −a for any a > 0, assuming the existence of one-way functions.
However, if we only need to release d-way marginals with differential privacy but without creating synthetic data, then one can have the average error of the d-way marginals decay at the rate 1/ √ n [8].

Outline of the paper
The rest of the paper is organized as follows.In Section 2 we provide basic notation and other preliminaries.Section 3 is concerned with the concept of covariance loss.We give a constructive and nearly optimal answer to the problem of how much information is lost when we take conditional expectation.In Section 4 we use the tools developed for covariance loss to derive a computationally efficient microaggregation framework that comes with provable accuracy bounds regarding lowdimensional marginals.In Section 5 we obtain analogous versions of these results in the framework of differential privacy.

Basic notation
The approximate inequality signs hide absolute constant factors; thus a b means that a ≤ Cb for a suitable absolute constant C > 0. A list of elements ν 1 , . . ., ν k of a metric space M is an α-covering, where α > 0, if every element of M has distance less than α from one of ν 1 , . . ., ν k .For p ∈ N, define

Tensors
The marginals of a random vector can be conveniently represented in tensor notation.A tensor is a d-way array X ∈ R p×•••×p .In particular, 1-way tensors are vectors, and 2-way tensors are matrices.
A simple example of a tensor is the rank-one tensor x ⊗d , which is constructed from a vector x ∈ R p by multiplying its entries: In particular, the tensor x ⊗2 is the same as the matrix xx T .The ℓ 2 norm of a tensor X can be defined by regarding X as a vector in R p d , thus Note that when d = 2, the tensor X can be identified as a matrix and X 2 is the Frobenius norm of X.
The errors of the marginals (1.1) can be thought of as the coefficients of the error tensor A tensor X ∈ R p×•••×p is symmetric if the values of its entries are independent of the permutation of the indices, i.e. if for any permutation π of [p].It often makes sense to count each distinct entry of a symmetric tensor once instead of d! times.To make this formal, we may consider the restriction operator P sym that preserves the p d entries whose indices satisfy 1 and zeroes out all other entries.Thus Thus, the goal we stated in (1.2) can be restated as follows: for the error tensor (2.1), we would like to bound the quantity (2. 2) The operator P sym is related to another restriction operator P off , which retains the p d d! offdiagonal entires, i.e. those for which all indices i 1 , . . .i d are distinct, and zeroes out all other entries.Thus, for all symmetric tensor X.
Lemma 2.1.If p ≥ 2d, we have for all symmetric d-way tensor X.
Proof.According to (2.3), the left hand side of (2.4) equals p d d! −1 P off X 2 2 , and This yields the desired bound.

Differential privacy
We briefly review some basic facts about differential privacy.The interested reader may consult [9] for details.Definition 2.2 (Differential Privacy [9]).A randomized function M gives ε-differential privacy if for all databases D 1 and D 2 differing on at most one element, and all measurable S ⊆ range(M), where the probability is with respect to the randomness of M.
Almost all existing mechanisms to implement differential privacy are based on adding noise to the data or the data queries, e.g.via the Laplacian mechanism [3].Recall that a random variable has the (centered) Laplacian distribution Lap(σ) if its probability density function at x is for all D 1 , D 2 differing in at most one element.Lemma 2.4 (Laplace mechanism, Theorem 2 in [3]).For any f : D → R d , the addition of Lap(σ) d noise preserves (∆f /σ)-differential privacy.
The proof of the following lemma, which is similar in spirit to the Composition Theorem 3.14 in [9], is left to the reader.

Lemma 2.5. Suppose that an algorithm
Remark 2.6.As outlined in [3], any function applied to private data, without accessing the raw data, is privacy-preserving.
The following observation is a special case of Lemma 2.5.Lemma 2.7.Suppose the data Y 1 and Y 2 are independent with respect to the randomness of the privacy-generating algorithm and that each is ε-differentially private, then (Y 1 , Y 2 ) is 2ε-differentially private.

Covariance loss
The goal of this section is to prove Theorem 1.2 and its higher-order version, Corollary 3.12.We will establish the main part of Theorem 1.2 in Sections 3.1-3.4,the "moreover" part (equipartition) in Sections 3.5-3.6,the tensorization principle (1.5) in Section 3.7, and then immediately yields Corollary 3.12.Finally, we show optimality in Section 3.8.

Law of total covariance
Throughout this section, X is an arbitrary random vector in R p , F is an arbitrary sigma-algebra and Y = E[X|F] is the conditional expectation.Lemma 3.1 (Law of total covariance).We have In particular, Σ X Σ Y .
Proof.The covariance matrix can be expressed as proving the first equality in the lemma.Next, one can check that almost surely by expanding the product in the right hand side and recalling that Y is F-measurable.Finally, take expectation on both sides to complete the proof.
Lemma 3.2 (Decomposing the covariance loss).For any orthogonal projection P in R p we have Proof.By the law of total covariance (Lemma 3.1), the matrix is positive semidefinite.Then we can use the following inequality, which holds for any positivesemidefinite matrix A (see e.g. in [1, p.157]): Let us bound the two terms in the right hand side.Jensen's inequality gives Next, since the matrix E Y Y T is positive-semidefinite, we have 0 A E XX T in the semidefinite order, so 0 (I − P )A(I − P ) (I − P )(E XX T )(I − P ), which yields Substitute the previous two bounds into (3.1) to complete the proof.

Spectral projection
The two terms in Lemma 3.2 will be bounded separately.Let us start with the second term.It simplifies if P is a spectral projection: Let P be the orthogonal projection in R p onto the t leading eigenvectors of the second moment matrix S = E XX T .Then Proof.We have (I − P )S(I − P ) where λ i (S) denote the eigenvalues of S arranged in a non-increasing order.Using linearity of expectation and trace, we get It follows that at most t eigenvalues of S can be larger than 1/t.By monotonicity, this yields λ i (S) ≤ 1/t for all i > t.Combining this with the bound above, we conclude that Substitute this bound into (3.2) to complete the proof.

Nearest-point partition
Next, we bound the first term in Lemma 3.2.This is the only step that does not hold generally but for a specific sigma-algebra, which we generate by a nearest-point partition.
Definition 3.4 (Nearest-point partition).Let X be a random vector taking values in R p , defined on a probability space (Ω, Σ, P).A nearest-point partition {F 1 , . . ., F s } of Ω with respect to a list of points ν 1 , . . ., ν s ∈ R p is a partition {F 1 , . . ., F s } of Ω such that for all ω ∈ F j and 1 ≤ j ≤ s. (Some of the F j could be empty.) Remark 3.5.A nearest-point partition can be constructed as follows: For each ω ∈ Ω, choose a point ν j nearest to X(ω) in the ℓ 2 metric and put ω into F j .Break any ties arbitrarily as long as the F j are measurable.
Lemma 3.6 (Approximation).Let X be a random vector in R p such that X 2 ≤ 1 a.s.Let P be an orthogonal projection on R p .Let ν 1 , . . ., ν s ∈ R p be an α-covering of the unit Euclidean ball of ran(P ).Let Ω = F 1 ∪ • • • ∪ F s be a nearest-point partition for P X with respect to ν 1 , . . ., ν s .Let F = σ(F 1 , . . ., F s ) be the sigma-algebra generated by the partition.Then the conditional expectation Proof.If ω ∈ F j then, by the definition of the nearest-point partition, ν j −P X(ω) 2 = min 1≤i≤s ν i − P X(ω) 2 .So by the definition of the α-covering, we have P X(ω) − ν j 2 ≤ α.Hence, by the triangle inequality we have Furthermore, by definition of Y , we have Thus, for such ω we have Applying the projection P and taking norm on both sides, then using Jensen's inequality, we conclude that where in the last step we used (3.4).Since the bound holds for each ω ∈ F j and the events F j form a partition of Ω, it holds for all ω ∈ Ω.The proof is complete.

Proof of the main part of Theorem 1.2
The following simple (and possibly known) observation will come handy to bound the cardinality of an α-covering that we will need in the proof of Theorem 1.2.
Proposition 3.7 (Number of lattice points in a ball).For all α > 0 and t ∈ N, In particular, for any α ∈ (0, 1), it follows that Proof.The open cubes of side length α/ √ t that are centered at the points of the set N := B t 2 ∩ α √ t Z t are all disjoint.Thus the total volume of these cubes equals |N | (α/ √ t) t .On the other hand, since each such cube is contained in a ball of radius α/2 centered at some point of N , the union of these cubes is contained in the ball (1 + α/2)B t 2 .So, comparing the volumes, we obtain Now, it is well known that [23] Vol(B t 2 ) = π t/2 Γ(t/2 + 1) .
Using Stirling's formula we have This gives Vol(B t 2 ) ≤ (2eπ/t) t/2 .Substitute this into the bound on |N | above to complete the proof.
It follows now from Proposition 3.7 that for every α ∈ (0, 1), there exists an α-covering in the unit Euclidean ball of dimension t of cardinality at most (7/α) t .
Fix an integer k ≥ 3 and choose The choice of t is made so that we can find an α-covering of the unit Euclidean ball of ran(P ) of cardinality at most (7/α) t ≤ k.
We decompose the covariance loss Σ X − Σ Y in Lemma 3.1 into two terms as in Lemma 3.2 and bound the two terms as in Lemma 3.3 and Lemma 3.6.This way we obtain where the last bound follows from our choice of α and t.If t = 0 then k ≤ C, for some universal constant C > 0, so Σ X − Σ Y 2 is at most O(1) and log log k/ log k = O(1).The main part of Theorem 1.2 is proved.

Monotonicity
Next, we are going to prove the "moreover" (equipartition) part of Theorem 1.2.This part is crucial in the application for anonymity, but it can be skipped if the reader is only interested in differential privacy.Before we proceed, let us first note a simple monotonicity property: Lemma 3.8 (Monotonicity).Conditioning on a larger sigma-algebra can only decrease the covariance loss.Specifically, if Z is a random variable and B ⊂ G are sigma-algebras then Proof.Denoting X = E[Z|G] and Y = E[Z|B], we see from the law of total expectation that Passing to a smaller sigma-algebra may in general increase the covariance loss.The additional covariance loss can be bounded as follows: Lemma 3.9 (Merger).Let Z be a random vector in R p such that Z 2 ≤ 1 a.s.If a sigmaalgebra is generated by a partition, merging elements of the partition may increase the covariance loss by at most the total probability of the merged sets.Specifically, if Proof.The lower bound follows from monotonicity (Lemma 3.8).To prove the upper bound, we have where the first bound follows by triangle inequality, and the second from Lemma 3.1 and Lemma 3.2 for P = I.
Denote by E G the conditional expectation on the set Indeed, to check the first case, note that since B ⊂ G, the law of total expectation yields Y = E[X|B]; then the case follows since G ∈ B. To check the second case, note that since the sets G r+1 , • • • , G m belong to both sigma-algebras G and B, so the conditional expectations X and Y must agree on each of these sets and thus on their union G c .Hence Here we bounded the variance by the second moment, and used the assumption that X 2 ≤ 1 a.s.Substitute this bound into (3.6) to complete the proof.
3.6 Proof of equipartition (the "moreover" part of Theorem 1.2) Assume that k ′ ≥ 3. (Otherwise k < 9 and the result is trivial by taking arbitrary partition into k of the same probability.)Applying the first part of Theorem 1.2 for k ′ instead of k, we obtain a sigma-algebra F ′ generated by a partition of a sample space into at most k ′ sets F i , and such that Divide each set F i into subsets with probability 1/k each using division with residual.Thus we partition each F i into a certain number of subsets (if any) of probability 1/k each and one residual subset of probability less than 1/k.By Monotonicity Lemma 3.8, any partitioning can only reduce the covariance loss.
This process results in the creation of a lot of good subsets -each having probability 1/k -and at most k ′ residual subsets that have probability less than 1/k each.Merge all residuals into one new "residual subset".While a merger may increase the covariance loss, Lemma 3.9 guarantees that the additional loss is bounded by the probability of the set being merged.Since we chose k ′ = ⌊ √ k⌋, the probability of the residual subset is less than So the additional covariance loss is bounded by 1/ √ k.Finally, divide the residual subset into further subsets of probability 1/k each.By monotonicity, any partitioning may not increase the covariance loss.At this point we partitioned the sample space into subsets of probability 1/k each and one smaller residual subset.Since k is an integer, the residual must have probability zero, and thus can be added to any other subset without affecting the covariance loss.
Let us summarize.We partitioned the sample space into k subsets of equal probability such that the covariance loss is bounded by The proof is complete.

Higher moments: tensorization
Recall that Theorem 1.2 provides a bound on the covariance loss 4 Perhaps counterintuitively, the bound on the covariance loss can automatically be lifted to higher moments, at the cost of multiplying the error by at most 4 d .
Theorem 3.10 (Tensorization).Let X be a random vector in R p such that X 2 ≤ 1 a.s., let F be a sigma-algebra, and let d ≥ 2 be an integer.Then the conditional expectation For the proof, we need an elementary identity: Lemma 3.11.Let U and V be independent and identically distributed random vectors in R p .Then Proof of Theorem 3.10.Step 1: binomial decomposition.Denoting 4 Recall Lemma 3.1 for the first identity, and refer to Section 2.2 for the tensor notation. Since Taking expectation on both sides and using triangle inequality, we obtain Let us look at each summand on the right hand side separately.
Step 2: Dropping trivial terms.First, let us check that all summands for which i 1 + • • • + i d = 1 vanish.Indeed, in this case exactly one term in the product Step 3: Bounding nontrivial terms.Next, we bound the terms for which be an independent copy of the pair of random variables (X 0 , X 1 ).Then By assumption, we have X 2 ≤ 1 a.s., which implies by Jensen's inequality that X 0 2 = E F X 2 ≤ E F X 2 ≤ 1 a.s.These bounds imply by the triangle inequality that X 1 2 = X − X 0 2 ≤ 2 a.s.By identical distribution, we also have X ′ 0 2 ≤ 1 and X ′ 1 2 ≤ 2 a.s.Hence, Returning to the term we need to bound, this yields (by Lemma 3.11) (by Lemma 3.1).
Step Combining the Covariance Loss Theorem 1.2 with Theorem 3.10 in view of (3.7), we conclude: Corollary 3.12 (Tensorization).Let X be a random vector in R p such that X 2 ≤ 1 a.s.Then, for every k ≥ 3, there exists a partition of the sample space into at most k sets such that for the sigma-algebra F generated by this partition, the conditional expectation Moreover, if the probability space has no atoms, then the partition can be made with exactly k sets, all of which have the same probability 1/k.
Remark 3.13.A similar bound can be deduced for the higher-order version of covariance matrix, Σ X := E(X − E X) ⊗d .Indeed, applying Theorem 1.2 and Theorem 3.10 for X − E X instead of X (and so for (The extra 2 d factor appears because from X 2 ≤ 1 we can only conclude that X − E X 2 ≤ 2, so the bound needs to be normalized accordingly.)

Optimality
The following result shows that the rate in Theorem 1.2 is in general optimal up to a √ log log k factor.Proposition 3.14 (Optimality).Let p > 16 ln(2k).Then there exists a random vector X in R p such that X 2 ≤ 1 a.s. and for any sigma-algebra F generated by a partition of a sample space into at most k sets, the conditional expectation We will make X uniformly distributed on a well-separated subset of the Boolean cube p −1/2 {0, 1} p of cardinality n = 2k.The following well known lemma states that such a subset exists: Lemma 3.15 (A separated subset).Let p > 16 ln n.Then there exist points x 1 , . . ., x n ∈ p −1/2 {0, 1} p such that Proof.Let X and X ′ be independent random vectors uniformly distributed on {0, 1} p .Then X − X ′ 2 2 = p r=1 (X(r) − X ′ (r)) 2 is a sum of i.i.d.Bernoulli random variables with parameter 1/2.Then Hoeffding's inequality [11] yields Let X 1 , . . ., X n be independent random vectors uniformly distributed on {0, 1} p .Applying the above inequality for each pair of them and then taking the union bound, we conclude that due to the condition on n.Therefore, there exists a realization of these random vectors that satisfies Divide both sides by √ p to complete the proof.
We will also need a high-dimensional version of the identity Var(X) = 1 2 E(X − X ′ ) 2 where X and X ′ are independent and identically distributed random variables.The following generalization is straightforward: Lemma 3.16.Let X and X ′ be independent and identically distributed random vectors taking values in R p .Then Proof of Proposition 3.14.Let n = 2k.Consider the sample space [n] equipped with uniform probability and the sigma-algebra that consists of all subsets of [n].Define the random variable X by where {x 1 , . . ., x n } is the (1/2)-separated subset of p −1/2 {0, 1} p from Lemma 3.15.Hence, X is uniformly distributed on the set {x 1 , . . ., x n }.Now, if F is the sigma-algebra generated by a partition {F 1 , . . ., (where E F denotes conditional expectation) where the random variable X j is uniformly distributed on the set {x i } i∈F j .= 1 2 where X ′ j is an independent copy of X j , by Lemma 3.16.Since the X j and X ′ j are independent and uniformly distributed on the set of F j points, X j − X ′ j 2 can either be zero (if both random vectors hit the same point, which happens with probability 1/ F j ) or it is greater than 1/2 by separation.Hence Moreover, P(F j ) = F j /n, so substituting in the bound above yields where we used that the sets F j form a partition of [n] so their cardinalities sum to n, our choice of n = 2k and the fact that k 0 ≤ k.We proved that √ p .
If p ≤ 25 ln n, this quantity is further bounded below by 1/(80 √ ln n) = 1/(80 ln(2k)), completing the proof in this range.For larger p, the result follows by appending enough zeros to X and thus embedding it into higher dimension.Such embedding obviously does not change Σ X − Σ Y 2 .

Anonymity
In this section, we use our results on the covariance loss to make anonymous and accurate synthetic data by microaggregation.To this end, we can interpret microaggregation probabilistically as conditional expectation (Section 4.1) and deduce a general result on anonymous microaggregation (Theorem 4.1).We then show how to make synthetic data with custom size by bootstrapping (Section 4.2) and Boolean synthetic data by randomized rounding (Section 4.3).

Microaggregation as conditional expectation
For discrete probability distributions, conditional expectation can be interpreted as microaggregation, or local averaging.
Consider a finite sequence of points x 1 , . . ., x n ∈ R p , which we can think of as true data.Define the random variable X on the sample space [n] equipped with the uniform probability distribution by setting Now, if F = σ(I 1 , . . ., I k ) is the sigma-algebra generated by some partition must take a constant value on each set I j , and that value is the average of X on that set.In other words, Y takes values y j with probability w j , where The law of total expectation E X = E Y in our case states that Higher moments are handled using Corollary 3.12.This way, we obtain an effective anonymous algorithm that creates synthetic data while accurately preserving most marginals: Theorem 4.1 (Anonymous microaggregation).Suppose k divides n.There exists a (deterministic) algorithm that takes input data x 1 , . . ., x n ∈ R p such that x i 2 ≤ 1 for all i, and computes a partition [n] = I 1 ∪ • • • ∪ I k with I j = n/k for all j, such that the microaggregated vectors satisfy for all d ∈ N, The algorithm runs in time polynomial in p and n, and is independent of d.
Proof.Most of the statement follows straightforwardly from Corollary 3.12 in light of the discussion above.However, the "moreover" part of Corollary 3.12 requires the probability space to be atomless, while our probability space [n] does have atoms.Nevertheless, if the sample space consists of n atoms of probability 1/n each, and k divides n, then it is obvious that the divide-and-merge argument explained in Section 3.6 works, and so the "moreover" part of Corollary 3.12 also holds in this case.Thus, we obtain the (n/k)-anonymity from the microaggregation procedure.It is also clear that the algorithm (which is independent of d) runs in time polynomial in p and n.See the Microaggregation part of Algorithm 1.
Remark 4.2.The requirement that k divides n appearing in Theorem 4.1 as well as in other theorems makes it possible to partition [n] into k sets of exactly the same cardinality.While convenient for proof purposes, this assumption is not strictly necessary.One can drop this assumption and make one set slightly larger than others.The corresponding modifications are left to the reader.
The use of spectral projection in combination with microaggregation has also been proposed in [25], although without any theoretical analysis regarding privacy or utility.

Synthetic data with custom size: bootstrapping
A seeming drawback of Theorem 4.1 is that the anonymity strength n/k and the cardinality k of the output data y 1 , . . ., y k are tied to each other.To produce synthetic data of arbitrary size, we can use the classical technique of bootstrapping, which consists of sampling new data u 1 , . . ., u m from the data y 1 , . . ., y k independently and with replacement.The following general lemma establishes the accuracy of resampling: Proof.We have (by independence and zero mean) Using the assumption Y 2 ≤ 1 a.s., we complete the proof.
Going back to the data y 1 , . . ., y k produced by Theorem 4.1, let us consider a random vector Y that takes values y j with probability 1/k each.Then obviously E Y ⊗d = 1 k k j=1 y ⊗d j .Moreover, the assumption that x i 2 ≤ 1 for all i implies that y j 2 ≤ 1 for all j, so we have Y 2 ≤ 1 as required in Bootstrapping Lemma 4.3.Applying this lemma, we get Combining this with the bound in Theorem 4.
The algorithm consists of anonymous averaging (described in Theorem 4.1) followed by bootstrapping (described above).It runs in time polynomial in p and n, and is independent of d.
Remark 4.5 (Convexity).Microaggregation respects convexity.If the input data x 1 , . . ., x n lies in some given convex set K, the output data u 1 , . . ., u m will lie in K, too.This can be useful in applications where one often needs to preserve some natural constraints on the data, such as positivity.

Boolean data: randomized rounding
Let us now specialize to Boolean data.Suppose the input data x 1 , . . ., x n is taken from {0, 1} p .We can use Theorem 4.4 (and obvious renormalization by the factor x i 2 = √ p) to make (n/k)anonymous synthetic data u 1 , . . ., u m that satisfies According to Remark 4.5, the output data u 1 , . . ., u m lies in the cube K = [0, 1] p .In order to transform the vectors u i into Boolean vectors, i.e. points in {0, 1} p , we can apply the known technique of randomized rounding [28].We define the randomized rounding of a number x ∈ [0, 1] as a random variable r(x) ∼ Ber(x).Thus, to compute r(x), we flip a coin that comes up heads with probability x and output 1 for a head and 0 for a tail.It is convenient to think of r : [0, 1] → {0, 1} as a random function.The randomized rounding r(x) of a vector x ∈ [0, 1] p is obtained by randomized rounding on each of the p coordinates of x independently.Theorem 4.6 (Anonymous synthetic Boolean data).Suppose k divides n.There exists a randomized (n/k)-anonymous microaggregation algorithm that transforms input data x 1 , . . ., x n ∈ {0, 1} p into output data z 1 , . . ., z m ∈ {0, 1} p in such a way that the error for all d ≤ p/2.The algorithm consists of anonymous averaging and bootstrapping (as in Theorem 4.4) followed by independent randomized rounding of all coordinates of all points.It runs in time polynomial in p, n and linear in m, and is independent of d.
For convenience of the reader, Algorithm 1 below gives a pseudocode description of the algorithm described in Theorem 4.6.
Algorithm 1 Boolean n/k-anonymous synthetic data via microaggregation Input: a sequence of points x 1 , . . ., x n in the cube {0, 1} p (true data); k ≥ 9, where k divides n; m ∈ N (number of points in the synthetic data).Microaggregation 3. Let P : R d → R d be the orthogonal projection onto the span of the eigenvectors associated with the t largest eigenvalues of S.
4. Choose an α-covering ν 1 , . . ., ν s ∈ R p of the unit Euclidean ball of the subspace ran(P ).This is done by enumerating B t 2 ∩ (α/ √ t)Z t and mapping it into ran(P ) using any linear isometry.

Construct a nearest-point partition
. ., P x n with respect to ν 1 , . . ., ν s as follows.For each ℓ ∈ [n], choose a point ν j nearest to x ℓ in the ℓ 2 metric and put ℓ into F j .Break any ties arbitrarily.

Transform the partition
k ∀j following the steps in Section 3.6: Divide each non-empty set F i into subsets with probability 1/k each using division with residual, then merge all residuals into one new residual subset and divide the residual subset into further subsets of probability 1/k each.7. Perform microaggregation: compute y j = k n i∈I j x i , j = 1, . . ., k. Bootstrapping creates new data u 1 , . . ., u m by sampling (independently and with replacement) m points from the data y 1 , . . ., y k .Randomized rounding maps the data {u ℓ } m ℓ=1 ∈ [0, 1] p to data {z j } m j=1 ∈ {0, 1} p .Output: a sequence of points z 1 , . . ., z m in the cube {0, 1} p (synthetic data) that satisfy the properties outlined in Theorem 4.6.
To prove Theorem 4.6, first note:5 Lemma 4.7 (Randomized rounding is unbiased).For any x ∈ [0, 1] p and d ∈ N, all off-diagonal entries of the tensors E r(x) ⊗d and x ⊗d match: where P off is the orthogonal projection onto the subspace of tensors supported on the off-diagonal entries.
Proof.For any tuple of distinct indices i 1 , . . ., i d ∈ [p], the definition of randomized rounding implies that r(x) i 1 , . . ., r(x) i d are independent Ber(x i 1 ), . . ., Ber(x i d ) random variables.Thus Proof of Theorem 4.6.Condition on the data u 1 , . . ., u m obtained in Theorem 4.4.The output data of our algorithm can be written as z i = r i (u i ), where the index i in r i indicates that we perform randomized rounding on each point u i independently.Let us bound the error introduced by randomized rounding, which is where Z i := P off r i (u i ) ⊗d − u ⊗d i are independent mean zero random variables due to Lemma 4.7.Therefore, Since the variance is bounded by the second moment, we have Lifting the conditional expectation (i.e.taking expectation with respect to u 1 , . . ., u m ) and combining this with (4.3) via triangle inequality, we obtain Finally, we can replace the off-diagonal norm by the symmetric norm using Lemma 2.1.If p ≥ 2d, it yields In view of (2.2), the proof is complete.

Differential Privacy
Here we pass from anonymity to differential privacy by noisy microaggregation.In Section 5.1, we construct a "private" version of the PCA projection using repeated applications of the "exponential mechanism" [14].This "private" projection is needed to make the PCA step in Algorithm 1 in Section 4.3 differentially private.In Sections 5.2-5.6,we show that the microaggregation is sufficiently stable with respect to additive noise, as long as we damp small blocks F j (Section 5.4) and project the weights w j and the vectors y j back to the unit simplex and the convex set K, respectively (Section 5.5).We then establish differential privacy in Section 5.7 and accuracy in Section 5.6, with Theorem 5.13 being the most general result on private synthetic data.Just like we did for anonymity, we then show how to make synthetic data with custom size by bootstrapping (Section 5.9) and Boolean synthetic data by randomized rounding (Section 5.10).

Differentially private projection
If A is a self-adjoint linear transformation on a real inner product space, then the ith largest eigenvalue of A is denoted by λ i (A); the spectral norm of A is denoted by A ; and the Frobenius norm of A is denoted by A 2 .If v 1 , . . ., v t ∈ R p then P v 1 ,...,vt denotes the orthogonal projection from R p onto span{v 1 , . . ., v t }.In particular, if v ∈ R p then P v denotes the orthogonal projection from R p onto span{v}.
In this section, we construct for any given p × p positive semidefinite A and 1 ≤ t ≤ p, a random projection P that behaves like the projection onto the t leading eigenvectors of A and, at the same time, is "differentially private" in the sense that if A is perturbed a little, the distribution of P changes a little.Something like this is done [14].However, in [14], a PCA approximation of A is produced in the output rather than the projection.The error in the operator norm for this approximation is estimated in [14], whereas in this paper, we need to estimate the error in the Frobenius norm.
Thus, we will do a modification of the construction in [14].But the general idea is the same: first construct a vector that behaves like the principal eigenvector (i.e., 1-dimensional PCA) and, at the same time, is "differentially private."Repeatedly doing this procedure gives a "differentially private" version of the t-dimensional PCA projection.
The following algorithm is referred to as the "exponential mechanism" in [14].As shown in Lemma 5.1 below, this algorithm outputs a random vector that behaves like the principal eigenvector (see part 1) and is "differentially private" in the sense of part 3.

Algorithm 2 PVEC(A)
Input: positive semidefinite linear transformation A : V → V , where V is a finite dimensional real inner product space.Output: x sampled from the unit sphere of V according to the density proportional to e Ax,x .Lemma 5. 1 ([14]).Suppose that A is a positive semidefinite linear transformation on a finite dimensional vector space V .
(1) If v is an output of PVEC(A), then for any measurable subset S of V .
Let us restate part 1 of Lemma 5.1 more conveniently: Lemma 5.2.Suppose that A is a positive semidefinite linear transformations on a finite dimensional vector space V .If v is an output of PVEC(A), then for all γ > 0, where C > 0 is an absolute constant.
Proof.Fix γ > 0, and let us consider two cases.
We now construct a "differentially private" version of the t-dimensional PCA projection.This is done by repeated applications of PVEC in Algorithm 2.

Algorithm 3 PROJ(A, t)
Input: p × p positive semidefinite real matrix A; and 1 The following lemma shows that the algorithm PROJ is "differentially private" in the sense of part 3 of Lemma 5.1, except that e β is replaced by e tβ .Lemma 5.3.Suppose that A and B are p × p positive semidefinite matrices and for any measurable subset S of R p×p .
Proof.Fix β.We first define a notion of privacy similar to the one in [14].A randomized algorithm M with input being a p × p positive semidefinite real matrix A is θ-DP if whenever A − B ≤ β, we have P{M(A) ∈ S} ≤ e θ P{M(B) ∈ S} for all measurable subset S of R p×p .In the algorithm PROJ, the computation of v 1 as an algorithm is β-DP by Lemma 5.1 (3).
Similarly, if we fix v 1 , the computation of v 2 as an algorithm is also β-DP.So by some version of Lemma 2.5, the computation of (v 1 , v 2 ) as an algorithm (without fixing v 1 ) is 2β-DP.
And so on.By induction, we have that the computation of (v 1 , . . ., v t ) as an algorithm is tβ-DP.Thus, PROJ(•, t) is tβ-DP.The result follows.
Next, we show that the output of the algorithm PROJ behaves like the t-dimensional PCA projection in the sense of Lemma 5.5 below.Observe that if P is the projection onto the t leading eigenvectors of a p × p positive semidefinite matrix A, then (I − P )A(I − P ) 2 2 = p i=t+1 λ i (A) 2 .To prove Lemma 5.5, we first prove the following lemma and then we apply this lemma repeatedly to obtain Lemma 5.5.
Proof.For every p × p real symmetric matrix B and every 1 ≤ i ≤ p, we have where the infimum is over all subspaces W of R p with dimension p − i + 1.Thus, since P v is a rank-one orthogonal projection, for every 1 ≤ i ≤ p − 1.Thus, , the result follows.
Lemma 5.5.Suppose that A is a p × p positive semidefinite matrix and 1 ≤ t ≤ p.If P is an output of PROJ(A, t), then for all γ > 0, where C > 0 is an absolute constant.
Proof.Let v 1 , . . ., v t be those vectors defined in the algorithm PROJ(A, t).
Since v k+1 is an output of PVEC(A k ), by Lemma 5.2, we have for all 1 ≤ j ≤ p and 0 ≤ k ≤ t − 1, where the expectation E v k+1 is over v k+1 conditioning on v 1 , . . ., v k .By Lemma 5.4, we have for all 1 ≤ j ≤ p and 0 ≤ k ≤ t − 1.Therefore, for all 1 ≤ j ≤ p and 0 ≤ k ≤ t − 1.
In the algorithm PROJ(A, t), each v k+1 is chosen from the unit sphere of ran(I − P v 1 ,...,v k ).Hence, the vectors v 1 , . . ., v t are orthonormal, so I − P v 1 ,...,v k+1 = (I − P v k+1 )(I − P v 1 ,...,v k ) for all 1 ≤ k ≤ t − 1.Thus, (I − P v k+1 )A k (I − P v k+1 ) = A k+1 .So we have for all 1 ≤ j ≤ p and 0 ≤ k ≤ t − 1. Taking the full expectation E on both sides, we get for all 1 ≤ j ≤ p and 0 ≤ k ≤ t − 1, where we used the fact that λ 1 (A k ) = A k ≤ A .Repeated applications of this inequality yields Note that A t = (I − P v 1 ,...,vt )A(I − P v 1 ,...,vt ) and P = P v 1 ,...,vt is the output of PROJ(A, t).Thus, the left hand side is equal to E A t

Microaggregation with more control
We will protect privacy by adding noise to the microaggregation mechanism.To make this happen, we will need a version of Theorem 4.1 with more control.We adapt the microaggregation mechanism from (4.1) to the current setting.Given a partition [n] = F 1 ∪ • • • ∪ F s (where some F j could be empty), we define for 1 ≤ j ≤ s with F j being non-empty, and when F j is empty, set w j = 0 and y j to be an arbitrary point.
Theorem 5.6 (Microaggregation with more control).Let x 1 , . . ., x n ∈ R p be such that x i 2 ≤ 1 for all i.Let S = 1 n n i=1 x i x T i .Let P be an orthogonal projection on R p .Let ν 1 , . . ., ν s ∈ R p be an α-covering of the unit Euclidean ball of ran(P ).Let [n] = F 1 ∪ • • • ∪ F s be a nearest-point partition of (P x i ) with respect to ν 1 , . . ., ν s .Then the weights w j and vectors y j defined in (5.1) satisfy for all d ∈ N: x ⊗d i − s j=1 w j y ⊗d (5.2) Proof of Theorem 5.6.We explained in Section 4.1 how to realize microaggregation probabilistically as conditional expectation.To reiterate, we consider the sample space [n] equipped with the uniform probability distribution and define a random variable X on [n] by setting X(i) = x i for i = 1, . . ., n.
If F = σ(F 1 , . . ., F s ) is the sigma-algebra generated by some partition is a random vector that takes values y j with probability w j as defined in (5.1).Then the left hand side of (5.2) equals

Perturbing the weights and vectors
Theorem 5.6 makes the first step towards noisy microaggregation.Next, we will add noise to the weights (w j ) and vectors (y j ) obtained by microaggregation.To control the effect of such noise on the accuracy, the following two simple bounds will be useful.
Lemma 5.7.Let u, v ∈ R n be such that u 2 ≤ 1 and v 2 ≤ 1.Then, for every d ∈ N, Proof.For d = 1 the result is trivial.For d ≥ 2, we can represent the difference as a telescopic sum Then, by triangle inequality, where we used the assumption on the norms of u and v in the last step.The lemma is proved.
Lemma 5.8.Consider numbers λ j , µ j ∈ R and vectors u j , v j ∈ R p such that u j 2 ≤ 1 and v j 2 ≤ 1 for all j = 1, . . ., m.Then, for every d ∈ N, (5.3) Proof.Adding and subtracting the cross term j λ j v ⊗d j and using triangle inequality, we can bound the left side of (5.3) by It remains to use Lemma 5.7 and note that v ⊗d

Damping
Although the microaggregation mechanism (5.1) is stable with respect to additive noise in the weights w j or the vectors y j as shown in Section 5.3, there are still two issues that need to be resolved.
The first issue is the potential instability of the microaggregation mechanism (5.1) for small blocks F j .For example, if F j = 1, the microaggregation does not do anything for that block and returns the original input vector y j = x i .To protect the privacy of such vector, a lot of noise is needed, which might be harmful to the accuracy.
One may wonder why can we not make all blocks F j of the same size like we did in Theorem 4.1.Indeed, in Section 3.6 we showed how to transform a potentially imbalanced partition [n] = F 1 ∪ • • • ∪ F s into an equipartition (where all F j have the same cardinality) using a divide-an-merge procedure; could we not apply it here?Unfortunately, an equipartition might be too sensitive6 to changes even in a single data point x i .The original partition F 1 , . . ., F s , on the other hand, is sufficiently stable.
We resolve this issue by suppressing, or damping, the blocks F j that are too small.Whenever the cardinality of F j drops below a predefined level b, we divide by b rather than F j in (5.1).In other words, instead of vanilla microaggregation (5.1), we consider the following damped microaggregation: x i , j = 1, . . ., s. (5.4)

Metric projection
And here is the second issue.Recall that the numbers w j returned by microaggregation (5.4) are probability weights: the weight vector w = (w j ) s j=1 belongs to the unit simplex ∆ := a = (a 1 , . . ., a s ) : This feature may be lost if we add noise to w j .Similarly, if the input vectors x j are taken from a given convex set K (for Boolean data, this is K = [0, 1] p ), we would like the synthetic data to belong to K, too.Microaggregation mechanism (5.1) respects this feature: by convexity, the vectors y j do belong to K.However, this property may be lost if we add noise to y j .We resolve this issue by projecting the perturbed weights and vectors back onto the unit simplex ∆ and the convex set K, respectively.For this purpose, we utilize metric projections mappings that return a proximal point in a given set.Formally, we let (5.5) (If the minimum is not unique, break the tie arbitrarily.One valid choice of π ∆,1 (w) can be defined by setting all the negative entries of w to be 0 and then normalize it so that it is in ∆.In the case when all entries of w are negative, set π ∆,1 (w) to be any point in ∆.) Thus, here is our plan: given input data (x i ) s i=1 , we apply damped microaggregation (5.4) to compute weights and vectors (w j , ỹj ) s j=1 , add noise, and project the noisy vectors back to the unit simplex ∆ and the convex set K respectively.In other words, we compute w = π ∆,1 (w + ρ) , ȳj = π K,2 ỹj + r j , (5.6) where ρ ∈ R s and r j ∈ R p are noise vectors (which we will set to be random Laplacian noise in the future).

The accuracy guarantee
Here is the accuracy guarantee of our procedure.This is a version of Theorem 5.6 with noise, damping, and metric projection: Theorem 5.9 (Accuracy of damped, noisy microaggregation).Let K be a convex set in R p that lies in the unit Euclidean ball B p 2 .Let x 1 , . . ., x n ∈ K. Let S = 1 n n i=1 x i x T i .Let P be an orthogonal projection on R p .Let ν 1 , . . ., ν s ∈ R p be an α-covering of the unit Euclidean ball of ran(P ).Let [n] = F 1 ∪ • • • ∪ F s be a nearest-point partition of (P x i ) with respect to ν 1 , . . ., ν s .Then the weights wj and vectors ȳj defined in (5.6) satisfy for all d ∈ N: (5.7) Proof.Adding and subtracting the cross term j w j y ⊗d (5.8) The first term can be bounded by Theorem 5.6.For the second term we can use Lemma 5.8 and note that y j 2 ≤ 1, ȳj 2 ≤ 1 for all j ∈ [s].
(5.9) Indeed, definition (5.1) of y j and the assumption that x i lie in the convex set K imply that y j ∈ K. Also, definition (5.6) implies that ȳj ∈ K as well.Now the bounds in (5.9) follow from the assumption that K ⊂ B p 2 .So, applying Theorem 5.6 and Lemma 5.8, we see that the quantity in (5.(5.10) We bound the two sums in this expression separately.Let us start with the sum involving y j and ȳj .We will handle large and small blocks differently.For a large block, one for which F j ≥ b, by (5.4) we have ỹj = F j −1 i∈F j x i = y j ∈ K.By definition (5.6), ȳj is the closest point in K to y j + r j .Since y j ∈ K, we have y j − ȳj 2 ≤ y j + r j − ȳj 2 + r j 2 (by triangle inequality) ≤ y j + r j − y j 2 + r j 2 (by minimality property of ȳj ) = 2 r j 2 .

Hence
Now let us handle small blocks.By (5.9) we have y j − ȳj 2 ≤ 2, so Combining our bounds for large and small blocks, we conclude that (5.11) Finally, let us bound the last sum in (5.10).By definition (5.6), w is a closest point in the unit simplex ∆ to w + ρ in the ℓ 1 metric.Since w ∈ ∆, we have s j=1 (5.12) Substitute (5.11) and (5.12) into (5.10) to complete the proof.

Privacy
Now that we analyzed the accuracy of the synthetic data, we prove differential privacy.To that end, we will use Laplacian mechanism, so we need to bound the sensitivity of the microaggregation.
Lemma 5.10 (Sensitivity of damped microaggregation).Let • be a norm on R p .Consider vectors x 1 , . . ., x n ∈ R p .Let I and I ′ be subsets of [n] that differ in exactly one element.Then, for any b > 0, we have x i . (5.13) Proof.Without loss of generality, we can assume that I ′ = I \ {n 0 } for some n 0 ∈ I. Denoting by ξ the difference vector whose norm we are estimating in (5.13), we have The sum in the right hand side consists of |I| − 1 terms, each satisfying x n 0 − x i ≤ 2 max i x i .This yields ξ ≤ (2/|I|) max i x i .Since |I| ≥ b + 1 by assumption, we get even a better bound than we need in this case.
Lemma 5.11 (Sensitivity of damped microaggregation II).Let • be a norm on R p .Let I be a subset of [n] and let n 0 ∈ I. Consider vectors x 1 , . . ., x n ∈ R p and x ′ 1 , . . ., x ′ n ∈ R p such that x i = x ′ i for all i = n 0 .Then, for any b > 0, we have Proof.
Theorem 5.12 (Privacy).In the situation of Theorem 5.9, suppose that all coordinates of the vectors ρ and r j are independent Laplacian random variables, namely and P is an output of PROJ( nε 6t S, t) where S = 1 n n i=1 x i x T i .Then the output data ( wj , ȳj ) s j=1 is ε-differentially private in the input data (x i ) n i=1 .
Proof.First we check that the projection P is private.To do this, let us bound the sensitivity of the second moment matrix S = 1 n n i=1 x i x T i in the spectral norm.Consider two input data (x i ) n i=1 and (x ′ i ) n i=1 that differ in exactly one element, i.e. x i = x ′ i for all i except some i = n 0 .Then the difference in the spectral norm of the corresponding matrices S and S ′ satisfy (by triangle inequality) Thus, nε 6t S − nε 6t S ′ ≤ ε 3t .So by Lemma 5.3, the projection P is (ε/3)-differentially private.Due to Lemma 2.5, it suffices to prove that for any fixed projection P , the output data ( wj , ȳj ) s j=1 is (2ε/3)-differentially private in the input data (x i ) n i=1 .Fixing P fixes also the covering (ν j ) s j=1 .Consider what happens if we change exactly one vector in the input data (x i ) n i=1 .The effect of that change on the nearest-point partition [n] = F 1 ∪ • • • ∪ F s is minimal: at most one of the indices can move from one block F j to another block (thereby changing the cardinalities of those two blocks by 1 each) or to another point in the same block, and the rest of the blocks stay the same.Thus, the weight vector w = (w j ) s j=1 , w j = F j /n, can change by at most 2/n in the ℓ 1 norm.Due to the choice of ρ, it follows by Lemma 2.4 that w + ρ is (ε/3)-differentially private.
For the same reason, all vectors ỹj defined in (5.4), except for at most two, stay the same.Moreover, by Lemma 5.10 and Lemma 5.11, the change of each of these (at most) two vectors in the ℓ 1 norm is bounded by x i 2 ≤ 2 √ p b since all x i ∈ K ⊂ B p 2 .Hence, the change of the tuple (ỹ 1 , . . ., ỹs ) ∈ R ps in the ℓ 1 norm is bounded by 4 √ p/b.Due to the choice of r j , it follows by Lemma 2.4 that (ỹ j + r j ) s j=1 is (ε/3)-differentially private.
Since ρ and r j are all independent vectors, it follows by Lemma 2.7 that the pair (w + ρ, (ỹ j + r j ) s j=1 ) is (2ε/3)-differentially private.The output data ( wj , ȳj ) s j=1 is a function of that pair, so it follows by Remark 2.6 that for any fixed projection P , that the output data must be (2ε/3)differentially private.Applying Lemma 2.5, the result follows.

Accuracy
We are ready to combine privacy and accuracy guarantees provided by Theorem 5.12 and Theorem 5.9.
Choose the noises ρ ∈ R s and r j ∈ R p as in the Privacy Theorem 5.12; then To check the first bound, use triangle inequality as follows: which is the sum of the standard deviations of the Laplacian distribution.
The second bound follows from summing the variances of the Laplace distribution over all entries.
Choose P to be an output of PROJ( nε 6t S, t) as in Theorem 5.12, where S = 1 n n i=1 x i x T i .Take A = nε 6t S in Lemma 5.5.We obtain E (I − P )A(I − P ) where C > 0 is an absolute constant.
Choose the accuracy α of the covering and its dimension t as follows: t := κ log n log(7/α) ; α = 1 (log n) 1/4 , where κ ∈ (0, 1) is a fixed constant that will be introduced later.(See Theorem 5. 13 assuming that t ≥ 1.In the case, when t = 0, we have n ≤ C, for some universal constant C > 0, so the left hand side is at most S 2 ≤ 1 and the right hand side is at least O(1).Apply the Accuracy Theorem 5.9 for this choice of parameters, square both sides and take expectation.Use (5.14) and (5.16).Since the weights w j = F j /n satisfy s j=1 w j = 1, we have (E( s j=1 w j r j 2 ) 2 ) 1 2 ≤ (E s j=1 w j r j  and with this choice we can simplify the error bound as follows: Note that κ log n = 7 2 log(n 2κ/7 ) ≤ 7 2 n 2κ/7 .So p(κ log n) Note that in the complement range where n < (p/ε) 1/(1−κ) , the second term is greater than one, so such error bound is trivial to achieve by outputting ȳj to be an arbitrary point in K for all j.Thus we proved: Theorem 5.13 (Privacy and accuracy).Let K be a convex set in R p that lies in the unit ball B p 2 , and ε ∈ (0, 1).Fix κ ∈ (0, 1).There exists an ε-differentially private algorithm that transforms input data (x i ) n i=1 where all x i ∈ K into the output data ( wj , ȳj ) s j=1 where s ≤ n, all wj ≥ 0, j wj = 1, and all ȳj ∈ K, in such a way that for all d ∈ N: The algorithm runs in time polynomial in p, n and linear in the time to compute the metric projection onto K, and it is independent of d.

Bootstrapping
To get rid of the weights wj and make the synthetic data to have custom size, we can use bootstrapping introduced in Section 4.2, i.e., we can sample new data u 1 , . . ., u m independently and with replacement by choosing ȳj with probability wj at every step.Thus, we consider the random vector Y that takes value ȳj with probability wj .Let Y 1 , . . ., Y m be independent copies of Y .Then obviously E Y ⊗d = s j=1 wj ȳ⊗d j , so Bootstrapping Lemma 4.3 yields Combining this with the bound in Theorem 5.13, we obtain: Theorem 5.14 (Privacy and accuracy: custom data size).Let K be a convex set in R p that lies in the unit ball B p 2 , and ε ∈ (0, 1).Fix κ ∈ (0, 1) There exists an ε-differentially private algorithm that transforms input data x 1 , . . ., x n ∈ K into the output data u 1 , . . ., u m ∈ K, in such a way that for all d ∈ N: The algorithm runs in time polynomial in p, n and linear in m and the time to compute the metric projection onto K, and it is independent of d.

Boolean data: randomized rounding
Now we specialize to Boolean data, i.e., data from {0, 1} p .If the input data x 1 , . . ., x n is Boolean, the output data u 1 , . . ., u m is in [0, 1] p (for technical reasons we may need to rescale the data by √ p because Theorem 5.14 requires K to be in B p 2 ).To transform it to Boolean data, we can use randomized rounding as described in Section 4.3.Thus, each coefficient of each vector u i is independently and randomly rounded to 1 with probability equal to that coefficient, (and to 0 with the complementary probability).Exactly the same analysis as we did in Section 4.3 applies here, and we conclude: Theorem 5.15 (Boolean private synthetic data).Let ε, κ ∈ (0, 1).There exists an ε-differentially private algorithm that transforms input data x 1 , . . ., x n ∈ {0, 1} p into the output data z 1 , . . ., z m ∈ {0, 1} p in such a way that the error E = 1 and {0,1} p xx T dµ(x) ≈ 1 n n k=1 x k x Tk .After µ is constructed, one can generate i.i.d.samples y 1 , . . ., y m from µ.The measure µ can be constructed as follows: First add Laplace noises to1

Theorem 4 .
6 gives a formal and non-asymptotic version of this result.

4 :
Conclusion.Let us summarize.The sum on the right side of (3.8) has 2 d −1 terms.The d terms corresponding to i 1 + • • • + i d = 1 vanish.The remaining 2 d − d − 1 terms are bounded by K := 2 d−2 E X ⊗2 − E Y ⊗2 2 each.Hence the entire sum is bounded by (2 d − d − 1)K, as claimed.The theorem is proved.

jw j y ⊗d j − wj ȳ⊗d j 2
and using triangle inequality, we can bound the left hand side of (5.7

Case 2 :
|I| ≤ b In this case, I ′ = |I| − 1 < b.Hence the difference vector of interest equals ξ = 1 b i∈I

1 1≤i 1
<•••<i d ≤p E(i 1 , . . ., i d ) 2 32 d log log n κ log n + p n 1−κ ε + 1 m 1 , ..., i d ) is defined in(1.1).If this happens, we say that the synthetic data is δ-accurate for d-dimensional marginals on average.Using Markov inequality, we can see that the synthetic data is o(1)-accurate for d-dimensional marginals on average if and only if with high probability, most of the d-dimensional marginals are asymptotically accurate; more precisely, with probability 1 − o(1), a 1 − o(1) fraction of the d-dimensional marginals of the synthetic data is within o(1) of the corresponding marginals of the true data.Let us state our result informally.Theorem 1.1 (Private synthetic Boolean data).Let ε, κ ∈ (0, 1) and n, m ∈ N.There exists an ε-differentially private algorithm that transforms input data x 1 , . . ., x n ∈ {0, 1} p into output data y 1 , . .., y m ∈ {0, 1} p .Moreover, if d = O(1), d ≤ p/2, m ≫ 1, n ≫ (p/ε) 1+κ, then the synthetic data is o(1)-accurate for d-dimensional marginals on average.The algorithm runs in time polynomial in p, n and linear in m, and is independent of d.