Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Differential privacy is a formal mathematical definition of privacy for the analysis of statistical datasets. It promises that a data analyst (treated as an adversary) cannot learn too much individual-level information from the outcome of an analysis. The traditional definition of differential privacy makes this promise information-theoretically: Even a computationally unbounded adversary is limited in the amount of information she can learn that is specific to an individual. On one hand, there are now numerous techniques that actually achieve this strong guarantee of privacy for a rich body of computational tasks. On the other hand, the information-theoretic definition of differential privacy does not itself permit the use of basic cryptographic primitives that naturally arise in the practice of differential privacy (such as the use of cryptographically secure pseudorandom generators in place of perfect randomness). More importantly, computationally secure relaxations of differential privacy open the door to designing improved mechanisms: ones that either achieve better utility (accuracy) or computational efficiency over their information-theoretically secure counterparts.

Motivated by these observations, and building on ideas suggested in [BNO08], Mironov et al. [MPRV09] proposed several definitions of computational differential privacy (CDP). All of these definitions formalize what it means for the output of a mechanism to “look” differentially private to a computationally bounded (i.e. probabilistic polynomial-time) adversary. The sequence of works [DKM+06, BNO08, MPRV09] introduced a paradigm that enables two or more parties to take advantage of CDP, either to achieve better utility or reduced round complexity, when computing a joint function of their private inputs: The parties use a secure multi-party computation protocol to simulate having a trusted third party perform a differentially private computation on the union of their inputs. Subsequent work [MMP+10] showed that such a CDP protocol for approximating the Hamming distance between two private bit vectors is in fact more accurate than any (information-theoretically secure) differentially private protocol for the same task. A number of works [CSS12, GMPS13, HOZ13, KMS14, GKM+16] have since sought to characterize the extent to which CDP yields accuracy improvements for two-party privacy problems.

Despite the success of CDP in the design of improved algorithms in the multiparty setting, much less is known about what can be achieved in the traditional client-server model, in which a trusted curator holds all of the sensitive data and mediates access to it. Beyond just the absence of any techniques for taking advantage of CDP in this setting, results of Groce, Katz, and Yerukhimovich [GKY11] (discussed in more detail below) show that CDP yields no additional power in the client-server model for many basic statistical tasks. An additional barrier stems from the fact that all known lower bounds against computationally efficient differentially private algorithms [DNR+09, UV11, Ull13, BZ14, BZ16] in the client-server model are proved by exhibiting computationally efficient adversaries. Thus, these lower bounds rule out the existence of CDP mechanisms just as well as they rule out differentially private ones.

In this work, we give the first example of a computational problem in the client-server model which can be solved in polynomial-time with CDP, but (under plausible assumptions) is computationally infeasible to solve with (information-theoretic) differential privacy. Our problem is specified by an efficiently computable utility function u, which takes as input a dataset \(D \in \mathcal {X}^n\) and an answer \(r \in \mathcal {R}\), and outputs 1 if the answer r is “good” for the dataset D, and 0 otherwise.

Theorem 1

(Main (Informal)). Assuming the existence of sub-exponentially secure one-way functions and “exponentially extractable” 2-message witness indistinguishable proofs (zaps) for \(\mathbf {NP}\), there exists an efficiently computable utility function \(u : \mathcal {X}^n \times \mathcal {R}\rightarrow \{0, 1\}\) such that

  1. 1.

    There exists a polynomial time CDP mechanism \(M^{\mathrm {CDP}}\) such that for every dataset \(D \in \mathcal {X}^n\), we have \(\Pr [u(D, M^{\mathrm {CDP}}(D)) = 1] \ge 2/3\).

  2. 2.

    There exists a computationally unbounded differentially private mechanism \(M^{\mathrm {unb}}\) such that \(\Pr [u(D, M^{\mathrm {unb}}(D)) = 1] \ge 2/3\).

  3. 3.

    For every polynomial time differentially private M, there exists a dataset \(D \in \mathcal {X}^n\), such that \(\Pr [u(D, M(D)) = 1] \le 1/3\).

Note that the theorem provides a task where achieving differential privacy is infeasible – not impossible. This is inherent because the CDP mechanism we exhibit (for item 1) satisfies a simulation-based form of CDP (“SIM-CDP”), which implies the existence of a (possibly inefficient) differentially private mechanism, provided the utility function u is efficiently computable as we require. It remains an intriguing open problem to exhibit a task that can be achieved with a weaker indistinguishably-based notion of CDP (“IND-CDP”) but is impossible to achieve (even inefficiently) with differential privacy. Such a task would also separate IND-CDP and SIM-CDP, which is an interesting open problem in its own right.

Circumventing the impossibility results of [GKY11]. Groce et al. showed that in many natural circumstances, computational differential privacy cannot yield any additional power over differential privacy in the client-server model. In particular, they showed two impossibility results:

  1. 1.

    If a CDP mechanism accesses a one-way function (or more generally, any cryptographic primitive that can be instantiated with a random function) in a black-box way, then it can be simulated just as well (in terms of both utility and computationally efficiency) by a differentially private mechanism.

  2. 2.

    If the output of a CDP mechanism is in \(\mathbb {R}^d\) (for some constant d) and its utility is measured via an \(L_p\)-norm, then the mechanism can be simulated by a differentially private one, again without significant loss of utility or efficiency.

(In Sect. 4, we revisit the techniques [GKY11] to strengthen the second result in some circumstances. In general, we show that when error is measured in any metric with doubling dimension \(O(\log k)\), CDP cannot improve utility by more than a constant factor. Specifically, respect to \(L_p\)-error, CDP cannot do much better than DP mechanisms even when d is logarithmic in the security parameter.)

We get around both of these impossibility results by (1) making non-black-box use of one-way functions via the machinery of zap proofs and (2) relying on a utility function that is far from the form in which the second result of [GKY11] applies. Indeed, our utility function is cryptographic and unnatural from a data analysis point view. Roughly speaking, it asks whether the answer r is a valid zap proof of the statement “there exists a row of the dataset D that is a valid message-signature pair” for a secure digital signature scheme. It remains an intriguing problem for future work whether a separation can be obtained from a more natural task (such as answering a polynomial number of counting queries with differential privacy).

Our Construction and Techniques. Our construction is based on the existence of two cryptographic primitives: an existentially unforgeable digital signature scheme \(({{\mathrm{Gen}}}, {{\mathrm{Sign}}}, {{\mathrm{Ver}}})\), and a 2-message witness indistinguishable proof system (zap) (PV) for \(\mathbf {NP}\). We make use of complexity leveraging [CGGM00] and thus require a complexity gap between the two primitives: namely, a sub-exponential time algorithm should be able to break the security of the zap proof system, but should not be able to forge a valid message-signature pair for the digital signature scheme.

We now describe (eliding technical complications) the computational task which allows us to separate computational and information-theoretic differential privacy in the client-server model. Inspired by prior differential privacy lower bounds [DNR+09, UV11], we consider a dataset D that consists of many valid message-signature pairs \((m_1, \sigma _1), \dots , (m_n, \sigma _n)\) for the digital signature scheme. We say that a mechanism M gives a useful answer on D, i.e. the utility function u(DM(D)) evaluates to 1, if it produces a proof \(\pi \) in the zap proof system that there exists a message-signature pair \((m, \sigma )\) for which \({{\mathrm{Ver}}}(m, \sigma ) = 1\).

First, let us see how the above task can be performed inefficiently with differential privacy. Consider the mechanism \(M^{\mathrm {unb}}\) that first confirms (in a standard differentially private way) that its input dataset indeed contains “many” valid message-signature pairs. Then \(M^{\mathrm {unb}}\) uses its unbounded computational resources to forge a canonical valid message-signature pair \((m, \sigma )\) and uses the zap prover on witness \((m, \sigma )\) to produce a proof \(\pi \). Since the choice of the forged pair does not depend on the input dataset at all, the procedure as a whole is differentially private.

Now let us see how a CDP mechanism can perform the same task efficiently. Our mechanism \(M^{\mathrm {CDP}}\) again first checks that it possesses many valid message-signature pairs, but this time it simply outputs a proof \(\pi \) using an arbitrary valid pair \((m_i, \sigma _i) \in D\) as its witness. Since the proof system is witness indistinguishable, a computationally bounded observer cannot distinguish \(\pi \) from the canonical proof output by the differentially private mechanism \(M^{\mathrm {unb}}\). Thus, the mechanism \(M^{\mathrm {CDP}}\) is in fact CDP in the strongest (simulation-based) sense.

Despite the existence of the inefficient differentially private mechanism \(M^{\mathrm {unb}}\), we show that the existence of an efficient mechanism M for this task would violate the sub-exponential security of the digital signature scheme. Suppose there were such a mechanism M. Now consider a sub-exponential time adversary A that completely breaks the security of the zap proof system, in the sense that given a valid proof \(\pi \), it is always able to recover a corresponding witness \((m, \sigma )\). Since M is differentially private, the \((m, \sigma )\) extracted by A cannot be in the dataset D given to M. Thus, \((m, \sigma )\) constitutes a forgery of a valid message-signature pair, and hence the composed algorithm \(A \circ M\) violates the security of the signature scheme.

2 Preliminaries

2.1 (Computational) Differential Privacy

We first set notations that will be used throughout this paper, and recall the notions of \((\varepsilon , \delta )\)-differential privacy and computational differential privacy. The abbreviation “PPT” stands for “probabilistic polynomial-time Turing machine.”

Security Parameter k. Let \(k \in \mathbb {N}\) denote a security parameter. In this work, datasets, privacy-preserving mechanisms, and privacy parameters \(\varepsilon ,\delta \) will all be sequences parameterized in terms of k. Adversaries will also have their computational power parameterized by k; in particular, efficient adversaries have circuit size polynomial in k. A function is said to be negligible if it vanishes faster than any inverse polynomial in k.

Dataset D. A dataset D is an ordered tuple of n elements from some data universe \(\mathcal {X}\). Two datasets \(D, D'\) are said to be adjacent (written \(D \sim D'\)) if they differ in at most one row. We use \(\{D_k\}_{k\in \mathbb {N}}\) to denote a sequence of datasets, each over a data universe \(\mathcal {X}_k\), with sizes growing with the parameter k. The size in bits of a dataset \(D_k\), and in particular the number of rows n, will always be \({{\mathrm{poly}}}(k)\).

Mechanism M. A mechanism \(M : \mathcal {X}^* \rightarrow \mathcal {R}\) is a randomized function taking a dataset \(D \in \mathcal {X}^*\) to an output in a range space \(\mathcal {R}\). We will be especially interested in ensembles of efficient mechanisms \(\{M_k\}_{k\in \mathbb {N}}\) where each \(M_k : \mathcal {X}_k^* \rightarrow \mathcal {R}_k\), when run on an input dataset \(D \in \mathcal {X}_k^n\), runs in time \({{\mathrm{poly}}}(k, n)\).

Adversary A. Given an ensemble of mechanisms \(\{M_k\}_{k\in \mathbb {N}}\) with \(M_k : X_k^* \rightarrow \mathcal {R}_k\), we model an adversary \(\{A_k\}_{k\in \mathbb {N}}\) as a sequence of polynomial-size circuits \(A_k : \mathcal {R}_k \rightarrow \{0, 1\}\). Equivalently, \(\{A_k\}_{k\in \mathbb {N}}\) can be thought of as a probabilistic polynomial time Turing machine with non-uniform advice.

Definition 1

(Differential Privacy [DMNS06, DKM+06]). A mechanism M is \((\varepsilon , \delta )\)-differentially private if for all adjacent datasets \(D \sim D'\) and every set \(S \subseteq \mathrm {Range}(M),\)

$$\begin{aligned} \Pr [M(D)\in S] \le e^{\varepsilon }\Pr [M(D')\in S] + \delta \end{aligned}$$

Equivalently, for all adjacent datasets \(D \sim D'\) and every (computationally unbounded) algorithm A, we have

$$\begin{aligned} \Pr [A(M(D)) = 1] \le e^{\varepsilon }\Pr [A(M(D')) = 1] + \delta \end{aligned}$$
(1)

For consistency with the definition of SIM-CDP, we also make the following definitions for sequences of mechanisms:

  • An ensemble of mechanisms \(\{M_k\}_{k\in \mathbb {N}}\) is \(\varepsilon _k\)-DP if for all k, \(M_k\) is \((\varepsilon _k, {{\mathrm{negl}}}(k))\)-differentially private.

  • An ensemble of mechanisms \(\{M_k\}_{k\in \mathbb {N}}\) is \(\varepsilon _k\)-PURE-DP if for all k, \(M_k\) is \((\varepsilon _k, 0)\)-differentially private.

The above definitions are completely information-theoretic. Several computational relaxations of this definition are proposed by Mironov et al. [MPRV09]. The first “indistinguishability-based” definition, denoted IND-CDP, relaxes Condition (1) to hold against computationally-bounded adversaries:

Definition 2

(IND-CDP). A sequence of mechanisms \(\{M_k\}_{k\in \mathbb {N}}\) is \(\varepsilon _k\)-IND-CDP if there exists a negligible function \({{\mathrm{negl}}}(\cdot )\) such that for all sequences of pairs of \({{\mathrm{poly}}}(k)\)-size adjacent datasets \(\{(D_k, D'_k)\}_{k\in \mathbb {N}}\), and all non-uniform polynomial time adversaries A,

$$\begin{aligned} \Pr [A(M_k(D_k)) = 1] \le e^{\varepsilon _k}\Pr [A(M_k(D'_k)) = 1] + {{\mathrm{negl}}}(k). \end{aligned}$$

Mironov et al. [MPRV09] also proposed a stronger “simulation-based” definition of computational differential privacy. A mechanism is said to be \(\varepsilon \)-SIM-CDP if its output is computationally indistinguishable from that of an \(\varepsilon \)-differentially private mechanism:

Definition 3

(SIM-CDP). A sequence of mechanisms \(\{M_k\}_{k\in \mathbb {N}}\) is \(\varepsilon _k\)-SIM-CDP if there exists a negligible function \({{\mathrm{negl}}}(\cdot )\) and a family of mechanisms \(\{M'_k\}_{k\in \mathbb {N}}\) that is \(\varepsilon _k\)-differentially private such that for all \({{\mathrm{poly}}}(k)\)-size datasets D, and all non-uniform polynomial time adversaries A,

$$\begin{aligned} |\Pr [A(M_k(D)) = 1] - \Pr [A(M'_k(D)) = 1]| \le {{\mathrm{negl}}}(k). \end{aligned}$$

If \(M_k'\) is in fact \(\varepsilon _k\)-pure differentially private, then we say that \(\{M_k\}_{k\in \mathbb {N}}\) is \(\varepsilon _k\)-PURE-SIM-CDP.

Writing \(A \preceq B\) to denote that a mechanism satisfying definition A also satisfies definition B (that is, A is a stricter privacy definition than B). We have the following relationships between the various notions of (computational) differential privacy:

$$\begin{aligned} \text {DP}\, \preceq \text {SIM-CDP}\, \preceq \,\text {IND-CDP}. \end{aligned}$$

We will state and prove our separation between CDP and differential privacy for the simulation-based definition SIM-CDP. Since SIM-CDP is a stronger privacy notion than IND-CDP, this implies a separation between IND-CDP and differential privacy as well.

2.2 Utility

We describe an abstract notion of what it means for a mechanism to “succeed” at performing a computational task. We define a computational task implicitly in terms of an efficiently computable utility function, which takes as input a dataset \(D \in \mathcal {X}^*\) and an answer \(r \in \mathcal {R}\) and outputs a score describing how well r solves a given problem on instance D. For our purposes, it suffices to consider binary-valued utility functions u, which output 1 iff the answer r is “good” for the dataset D.

Definition 4

(Utility). A utility function is an efficiently computable (deterministic) function \(u : \mathcal {X}^* \times \mathcal {R} \rightarrow \{0, 1\}\). A mechanism M is \(\alpha \) -useful for a utility function \(u : \mathcal {X}^* \times \mathcal {R} \rightarrow \{0, 1\}\) if for all datasets D,

$$\begin{aligned} \mathop {\Pr }\limits _{{r \leftarrow M(D)}}[u(D, r) = 1] \ge \alpha . \end{aligned}$$

Restricting our attention to efficiently computable utility functions is necessary to rule out pathological separations between computational and statistical notions of differential privacy. For instance, let \(\{G_k\}_{k\in \mathbb {N}}\) be a pseudorandom generator with \(G_k : \{0, 1\}^k \rightarrow \{0, 1\}^{2k}\), and consider the (hard-to-compute) function \(u(0, r) = 1\) iff r is in the image of \(G_k\), and \(u(1, r) = 1\) iff r is not in the image of \(G_k\). Then the mechanism M(b) that samples from \(G_k\) if \(b = 0\) and samples a random string if \(b = 1\) is useful with overwhelming probability. Moreover, M is computationally indistinguishable from the mechanism that always outputs a random string, and hence SIM-CDP. On the other hand, the supports of \(u(0, \cdot )\) and \(u(1, \cdot )\) are disjoint, so no differentially private mechanism can achieve high utility with respect to u.

2.3 Zaps (2-Message WI Proofs)

The first cryptographic tool we need in our construction is 2-message witness indistinguishable proofs for \(\mathbf {NP}\) (“zaps”) [FS90, DN07] in the plain model (with no common reference string). Consider a language \(L \in \mathbf {NP}\). A witness relation for L is a polynomial-time decidable binary relation \(R_L = \{(x, w)\}\) such that \(|w| \le {{\mathrm{poly}}}(|x|)\) whenever \((x, w) \in R_L\), and

$$\begin{aligned} x \in L \iff \exists w \,\text { s.t. }\, (x, w) \in R_L. \end{aligned}$$

Definition 5

(Zap). Let \(R_L = \{(x, w)\}\) be a witness-relation corresponding to a language \(L \in \mathbf {NP}\). A zap proof system for \(R_L\) consists of a pair of algorithms (PV) where:

  • In the first round, the verifier sends a message \(\rho \leftarrow \{0, 1\}^{\ell (k, |x|)}\) (“public coins”), where \(\ell (\cdot , \cdot )\) is a fixed polynomial.

  • In the second round, the prover runs a PPT P that takes as input a pair (xw) and verifier’s first message \(\rho \) and outputs a proof \(\pi \).

  • The verifier runs an efficient, deterministic algorithm V that takes as input an instance x, a first-round message \(\rho \), and proof \(\pi \), and outputs a bit in \(\{0, 1\}\).

The security requirements of the proof system are:

  1. 1.

    Perfect completeness. An honest prover who possesses a valid witness can always convince an honest verifier. Formally, for all \(x \in \{0, 1\}^{{{\mathrm{poly}}}(k)}\), \((x, w)\in R_L\), and \(\rho \in \{0, 1\}^{\ell (k, |x|)}\),

    $$\begin{aligned} \mathop {\Pr }\limits _{{\pi \leftarrow P(1^k, x, w, \rho )}}[V(1^k, x, \rho , \pi ) = 1] = 1. \end{aligned}$$
  2. 2.

    Statistical soundness. With overwhelming probability over the choice of \(\rho \), it is impossible to convince an honest verifier of the validity of a false statement. Formally, there exists a negligible function \({{\mathrm{negl}}}(\cdot )\) such that for all sufficiently large k and \(t = {{\mathrm{poly}}}(k)\), we have

    $$\begin{aligned} \mathop {\Pr }\limits _{{\rho \leftarrow \{0, 1\}^{\ell (k, t)}}}[\exists x \notin L \cap \{0, 1\}^t, \pi \in \{0, 1\}^* : V(1^k, x, \rho , \pi ) = 1] \le {{\mathrm{negl}}}(k). \end{aligned}$$
  3. 3.

    Witness indistinguishability. For every sequence \(\{x_k\}_{k\in \mathbb {N}}\) with \(|x_k| = {{\mathrm{poly}}}(k)\), every two sequences \(\{w^1_k\}_{k\in \mathbb {N}}\), \(\{w^2_k\}_{k\in \mathbb {N}}\) such that \((x_k, w^1_k), (x_k, w^2_k)\in R_L\), and every choice of the verifier’s first message \(\rho \), we have

    $$\begin{aligned} \{P(1^k, x_k, w^1_k, \rho )\}_{k\in \mathbb {N}} \mathbin {\mathop {\approx }\limits ^\mathrm{c}}\{P(1^k, x_k, w^2_k, \rho )\}_{k\in \mathbb {N}}. \end{aligned}$$

    Namely, for every such pair of sequences, there exists a negligible function \({{\mathrm{negl}}}(\cdot )\) such that for all polynomial-time adversaries A and all sufficiently large k, we have

    $$\begin{aligned} |\Pr [A(1^k, P(1^k, x_k, w^1_k, \rho )) = 1] - \Pr [A(1^k, P(1^k, x_k, w^2_k, \rho )) = 1]| \le {{\mathrm{negl}}}(k). \end{aligned}$$

In our construction, we will need more fine-grained control over the security of our zap proof system. In particular, we need the proof system to be extractable by an adversary running in time \(2^{O(k)}\), in that such an adversary can always reverse-engineer a valid proof \(\pi \) to find a witness w such that \((x, w) \in R_L\). It is important to note that we require the running time of the adversary to be exponential in the security parameter k, but otherwise independent of the statement size |x|.

Definition 6

(Extractable Zap). The algorithm triple (PVE) is an extractable zap proof system if (PV) is a zap proof system and there exists an algorithm E running in time \(2^{O(k)}\) with the following property:

  1. 4.

    (Exponential Statistical) Extractability. There exists a negligible function \({{\mathrm{negl}}}(\cdot )\) such that for all \(x \in \{0, 1\}^{{{\mathrm{poly}}}(k)}\):

    $$\begin{aligned} \mathop {\Pr }\limits _{{\rho \leftarrow \{0, 1\}^{\ell (k, |x|)}}}[\exists \pi \in \{0, 1\}^*&, w \in E(1^k, x, \rho , \pi ) : \\&(x, w) \notin R_L \; \wedge \; V(1^k, x, \rho , \pi ) = 1] \le {{\mathrm{negl}}}(k). \end{aligned}$$

While we do not know whether extractability is a generic property of zaps, it is preserved under Dwork and Naor’s reduction to NIZKs in the common random string model. Namely, if we plug an extractable NIZK into Dwork and Naor’s construction, we obtain an extractable zap.

Theorem 2

Every language in \(\mathbf {NP}\) has an extractable zap proof system (PVE), as defined in Definition 6, if there exists non-interactive zero-knowledge proofs of knowledge for \(\mathbf {NP}\) [DN07].

For completeness, we sketch Dwork and Naor’s construction in Appendix B and argue its extractability.

2.4 Digital Signatures

The other ingredient we need in our construction is sub-exponentially strongly unforgeable digital signature schemes. Here “strong unforgeability” [ADR02] means that the adversary in the existential unforgeability game is allowed to forge a signature for a message it has queried before, as long as the signature is different than the one it received.

Definition 7

(Sub-exponentially Strongly Unforgeable Digital Signature Scheme). Let \(c\in (0, 1)\) be a constant. A c-strongly unforgeable digital signature is a triple of PPT algorithms \(({{\mathrm{Gen}}}, {{\mathrm{Sign}}}, {{\mathrm{Ver}}})\) where

  • \((sk, vk) \leftarrow {{\mathrm{Gen}}}(1^k)\): The generation algorithm takes as input a security parameter k and generates a secret key and a verification key.

  • \(\sigma \leftarrow {{\mathrm{Sign}}}(sk, m)\): The signing algorithm signs a message \(m\in \{0, 1\}^*\) to produce a signature \(\sigma \in \{0, 1\}^*\).

  • \(b \leftarrow {{\mathrm{Ver}}}(vk, m, \sigma )\): The (deterministic) verification algorithm outputs a bit to indicate whether the signature \(\sigma \) is a valid signature of m.

The algorithms have the following properties:

  1. 1.

    Correctness. For every message \(m \in \{0, 1\}^*\),

    $$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (sk, vk)\leftarrow {{\mathrm{Gen}}}(1^k)\\ \sigma \leftarrow {{\mathrm{Sign}}}(sk, m) \end{array}}}[{{\mathrm{Ver}}}(vk, m, \sigma ) = 1] = 1. \end{aligned}$$
  2. 2.

    Existential unforgeability. There exists a negligible function \({{\mathrm{negl}}}(\cdot )\) such that for all adversaries A running in time \(2^{k^c}\),

    $$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (sk, vk)\leftarrow {{\mathrm{Gen}}}(1^k)\\ (m, \sigma )\leftarrow A^{{{\mathrm{Sign}}}(sk, \cdot )}(vk) \end{array}}}[{{\mathrm{Ver}}}(m, \sigma ) = 1\,\text { and }\, (m, \sigma )\notin Q] < {{\mathrm{negl}}}(k) \end{aligned}$$

    where Q is the set of messages-signature pairs obtained through A’s use of the signing oracle.

Theorem 3

If sub-exponentially secure one-way functions exist, then there is a constant \(c\in (0, 1)\) such that a c-strongly unforgeable digital signature scheme exists.

The reduction from a one-way function to digital signature [NY89, Rom90, KK05, Gol04] can be applied when both schemes are secure against sub-exponential time adversaries.

3 Separating CDP and Differential Privacy

In this section, we define a computational problem in the client-server model that can be efficiently solved with CDP, but not with statistical differential privacy. That is, we define a utility function u for which there exists a CDP mechanism achieving high utility. On the other hand, any efficient differentially private algorithm can only have negligible utility.

Theorem 4

(Main). Assume the existence of sub-exponentially secure one-way functions and extractable zaps for \(\mathbf {NP}\). Then there exists a sequence of data universes \(\{\mathcal {X}_k\}_{k\in \mathbb {N}}\), range spaces \(\{\mathcal {R}_k\}_{k\in \mathbb {N}}\) and an (efficiently computable) utility function \(u_k : \mathcal {X}_k^* \times \mathcal {R}_k \rightarrow \{0, 1\}\) such that

  1. 1.

    There exists a polynomial p such that for any \(\varepsilon _k, \beta _k > 0\) there exists a polynomial-time \(\varepsilon _k\)-PURE-SIM-CDP mechanism \(\{M^{\mathrm {CDP}}_k\}_{k\in \mathbb {N}}\) and an (inefficient) \(\varepsilon _k\)-PURE-DP mechanism \(\{M^{\mathrm {unb}}_k\}_{k\in \mathbb {N}}\) such that for every \(n \ge p(k, 1/\varepsilon _k, \log (1/\beta _k))\) and dataset \(D \in \mathcal {X}_k^n\), we have

    $$\begin{aligned} \Pr [u_k(D, M^{\mathrm {CDP}}(D)) = 1] \ge 1-\beta _k \;\text { and }\; \Pr [u_k(D, M^{\mathrm {unb}}(D)) = 1] \ge 1-\beta _k \end{aligned}$$
  2. 2.

    For every \(\varepsilon _k \le O(\log k)\), \(\alpha _k = 1/{{\mathrm{poly}}}(k)\), \(n = {{\mathrm{poly}}}(k)\), and efficient \((\varepsilon _k, \delta = 1/n^2)\)-differentially private mechanism \(\{M'_k\}_{k\in \mathbb {N}}\), there exists a dataset \(D \in \mathcal {X}_k^n\) such that

    $$\begin{aligned} \Pr [u(D, M'(D)) = 1] \le \alpha _k \;\text { for sufficient large k.} \end{aligned}$$

Remark 1

We can only hope to separate SIM-CDP and differential privacy by designing a task that is infeasible with differential privacy but not impossible. By the definition of (PURE-)SIM-CDP for a mechanism \(\{M_k\}_{k\in \mathbb {N}}\), there exists an \(\varepsilon _k\)-(PURE-)DP mechanism \(\{M'_k\}_{k\in \mathbb {N}}\) that is computationally indistinguishable from \(\{M_k\}_{k\in \mathbb {N}}\). But if for every differentially private \(\{M'_k\}_{k\in \mathbb {N}}\), there were a dataset \(D_k\in \mathcal {X}^n_k\) such that \(\Pr [u_k(D_k, M_k'(D_k)) = 1] \le \Pr [u_k(D_k, M_k(D_k)) = 1] - 1/{{\mathrm{poly}}}(k)\), then the utility function \(u_k(D_k, \cdot )\) would itself serve as a distinguisher between \(\{M'_k\}_{k\in \mathbb {N}}\) and \(\{M_k\}_{k\in \mathbb {N}}\).

3.1 Construction

Let \(({{\mathrm{Gen}}}, {{\mathrm{Sign}}}, {{\mathrm{Ver}}})\) be a c-strongly unforgeable secure digital signature scheme with parameter \(c > 0\) as in Definition 7. After fixing c, we define for each \(k \in \mathbb {N}\) a reduced security parameter \(k_c= k^{c/2}\). We will use \(k_c\) as the security parameter for an extractable zap proof system (PVE). Since k and \(k_c\) are polynomially related, a negligible function in k is negligible in \(k_c\) and vice versa.

Given a security parameter \(k \in \mathbb {N}\), define the following sets of bit strings:  

Verification Key Space: :

\(\mathcal {K}_k = \{0, 1\}^{\ell _1}\) where \(\ell _1 = |vk|\) for \((sk, vk) \leftarrow {{\mathrm{Gen}}}(1^k)\),

Message Space: :

\(\mathcal {M}_k = \{0, 1\}^k\),

Signature Space: :

\(\mathcal {S}_k = \{0, 1\}^{\ell _2}\) where \(\ell _2 = |\sigma |\) for \(\sigma \leftarrow {{\mathrm{Sign}}}(sk, m)\) with \(m \in \mathcal {M}_k\),

Public Coins Space: :

\(\mathcal {P}_k = \{0, 1\}^{\ell _3}\) where \(\ell _3 = {{\mathrm{poly}}}(\ell _1)\) is the length of first-round zap messages used to prove statements from \(\mathcal {K}_k\) under security parameter \(k_c\),

Data Universe: :

\(\mathcal {X}_k = \mathcal {K}_k\times \mathcal {M}_k\times \mathcal {S}_k\times \mathcal {P}_k\).

 

That is, similarly to one the hardness results of [DNR+09], we consider datasets D that contain n rows of the form \(x_1 = (vk_1, m_1, \sigma _1, \rho _1), \dots , x_n = (vk_n, m_n, \sigma _n, \rho _n)\) each corresponding to a verification key, message, and signature from the digital signature scheme, and to a zap verifier’s public coin tosses.

Let \(L \in \mathbf {NP}\) be the language

$$\begin{aligned} vk \in (L \cap \mathcal {K}_k) \iff \exists (m, \sigma ) \in \mathcal {M}_k \times \mathcal {S}_k \text { s.t. } {{\mathrm{Ver}}}(vk, m, \sigma ) = 1 \end{aligned}$$

which has the natural witness relation

$$\begin{aligned} R_L = \bigcup _k\{(vk, (m, \sigma ))\in \mathcal {K}_k\times (\mathcal {M}_k\times \mathcal {S}_k)\;:\;{{\mathrm{Ver}}}(vk, m, \sigma ) = 1\}. \end{aligned}$$

Define

 

Proof Space: :

\(\varPi _k = \{0, 1\}^{\ell _4}\) where \(\ell _4 = |\pi |\) for \(\pi \leftarrow P(1^{k_c}, vk, (m, \sigma ), \rho )\) for \(vk \in (L \cap \mathcal {K}_k)\) with witness \((m, \sigma ) \in \mathcal {M}_k \times \mathcal {S}_k\) and public coins \(\rho \in \mathcal {P}_k\), and

Output Space: :

\(\mathcal {R}_k = \mathcal {K}_k\times \mathcal {P}_k\times \varPi _k\).

 

Definition of Utility Function u. We now specify our computational task of interest via a utility function \(u : \mathcal {X}_k^n \times \mathcal {R}_k \rightarrow \{0, 1\}\). For any string \(vk \in \mathcal {K}_k\) and \(D = ((vk_1, m_1, \sigma _1, \rho _1), \cdots , (vk_n, m_n, \sigma _n, \rho _n))\in \mathcal {X}_k^n\) define an auxiliary function

$$\begin{aligned} f_{vk, \rho }(D) = \#\{i \in [n]: vk_i = vk \wedge \rho _i = \rho \wedge {{\mathrm{Ver}}}(vk, m_i, \sigma _i) = 1\}. \end{aligned}$$

That is, \(f_{vk, \rho }\) is the number of elements of the dataset D with verification key equal to vk and public coin string equal to \(\rho \) for which \((m_i, \sigma _i)\) is a valid message-signature pair under vk. We now define \(u(D, (vk, \rho , \pi )) = 1\) iff

$$\begin{aligned} f_{vk, \rho }(D) \ge 9n/10 \quad&\wedge \quad V(1^{k_c}, vk, \rho , \pi ) = 1 \\&\text { or } \\ f_{vk', \rho '}(D) < 9n/10 \quad&\text { for all } vk' \in \mathcal {K}_k \text { and } \rho ' \in \mathcal {P}_k. \end{aligned}$$

That is, the utility function u is satisfied if either (1) many entries of D contain valid message-signature pairs under the same verification key vk with the same public coin string \(\rho \) and \(\pi \) is a valid proof for statement vk using \(\rho \), or (2) it is not the case that many entries of D contain valid message-signature pairs under the same verification key, with the same public coin string (in which case any response \((vk, \rho , \pi )\) is acceptable).

3.2 An Inefficient Differentially Private Algorithm

We begin by showing that there is an inefficient differentially private mechanism that achieves high utility under u.

Proposition 1

Let \(k \in \mathbb {N}\). For every \(\varepsilon > 0\), there exists an \((\varepsilon , 0)\)-differentially private algorithm \(M_k^{\mathrm {unb}} : \mathcal {X}_k^n \rightarrow \mathcal {R}_k\) such that, for every \(\beta > 0\), every \(n \ge \frac{10}{\varepsilon }\log (2 \cdot |\mathcal {K}_k| \cdot |\mathcal {P}_k| / \beta )) = {{\mathrm{poly}}}(1/\varepsilon , \log (1/\beta ), k)\) and \(D\in (\mathcal {K}_k\times \mathcal {M}_k\times \mathcal {S}_k\times \mathcal {P}_k)^n\),

$$\begin{aligned} \mathop {\Pr }\limits _{{(vk, \rho , \pi ) \leftarrow M_k^{\mathrm {unb}}(D)}}[u(D, (vk, \rho , \pi )) = 1] \ge 1-\beta \end{aligned}$$

Remark 2

While the mechanism \(M^{\mathrm {unb}}\) considered here is only accurate for \(n \ge \varOmega (\log |\mathcal {P}_k|)\), it is also possible to use “stability techniques” [DL09, TS13] to design an \((\varepsilon , \delta )\)-differentially private mechanism that achieves high utility for \(n \ge O(\log (1/\delta )/\varepsilon )\) for \(\delta > 0\). We choose to provide a “pure” \(\varepsilon \)-differentially private algorithm here to make our separation more dramatic: Both the inefficient differentially private mechanism and the efficient SIM-CDP mechanism achieve pure \((\varepsilon , 0)\)-privacy, whereas no efficient mechanism can even achieve \((\varepsilon , \delta )\)-differential privacy with \(\delta > 0\).

Our algorithm relies on standard differentially private techniques for identifying frequently occurring elements in a dataset.

Report Noisy Max. Consider a data universe \(\mathcal {X}\). A predicate \(q : \mathcal {X}\rightarrow \{0, 1\}\) defines a counting query over the set of datasets \(\mathcal {X}^n\) as follows: For \(D = (x_1, \dots , x_n) \in \mathcal {X}^n\), we abuse notation by defining \(q(D) = \sum _{i = 1}^n q(x_i)\). We further say that a collection of counting queries Q is disjoint if, whenever \(q(x) = 1\) for some \(q \in Q\) and \(x \in \mathcal {X}\), we have \(q'(x) = 0\) for every other \(q' \ne q\) in Q. (Thus, disjoint counting queries slightly generalize point functions, which are each supported on exactly one element of the domain \(\mathcal {X}\).)

The “Report Noisy Max” algorithm [DR14], combined with observations of [BV16], can efficiently and privately identify which of a set of disjoint counting queries is (approximately) the largest on a dataset D, and release its identity along with the corresponding noisy count. We sketch the proof of the following proposition in Appendix A.

Proposition 2

(Report Noisy Max). Let Q be a set of efficiently computable and sampleable disjoint counting queries over a domain \(\mathcal {X}\). Further suppose that for every \(x \in \mathcal {X}\), the query \(q \in Q\) for which \(q(x) = 1\) (if one exists) can be identified efficiently. For every \(n\in \mathbb {N}\) and \(\varepsilon > 0\) there is an mechanism \(F:\mathcal {X}^n\rightarrow \mathcal {X}\times \mathbb {R}\) such that

  1. 1.

    F runs in time \({{\mathrm{poly}}}(n, \log |\mathcal {X}|, \log |Q|, 1/\varepsilon )\).

  2. 2.

    F is \(\varepsilon \)-differentially private.

  3. 3.

    For every dataset \(D \in \mathcal {X}^n\), let \(q_{{{\mathrm{OPT}}}} = {{\mathrm{argmax}}}_{q \in Q}q(D)\) and \({{\mathrm{OPT}}}= q_{\mathrm {OPT}}(D)\). Let \(\beta > 0\). Then with probability at least \(1-\beta \), the algorithm F outputs a solution \((\hat{q}, a)\) such that \(a \ge \hat{q}(D) - \gamma /2\) where \(\gamma = \frac{8}{\varepsilon } \cdot \left( \log |Q| + \log (1/\beta ) \right) \). Moreover, if \({{\mathrm{OPT}}}- \gamma > \max _{q\ne {q_{{{\mathrm{OPT}}}}}}q(D)\), then \(\hat{q} = {{\mathrm{argmax}}}_{q \in Q}q(D)\).

We are now ready to describe our unbounded algorithm \(M_k^{\mathrm {unb}}\) as Algorithm 1. We prove Proposition 1 via the following two claims, capturing the privacy and utility guarantees of \(M_k^{\mathrm {unb}}\), respectively.

figure a

Lemma 1

The algorithm \(M_k^{\mathrm {unb}}\) is \(\varepsilon \)-differentially private.

Proof

The algorithm \(M_k^{\mathrm {unb}}\) accesses its input dataset D only through the \(\varepsilon \)-differentially private Report Noisy Max algorithm (Proposition 2). Hence, by the closure of differential privacy under post-processing, \(M_k^{\mathrm {unb}}\) is also \(\varepsilon \)-differentially private.

Lemma 2

The algorithm \(M_k^{\mathrm {unb}}\) is \((1-\beta )\)-useful for any number of rows \(n \ge \frac{20}{\varepsilon }(\log (|\mathcal {K}_k| \cdot |\mathcal {P}_k|/ \beta ))\).

Proof

If \(f_{vk, \rho }(D) < 9n/10\) for every vk and \(\rho \), then the utility of the mechanism is always 1. Therefore, it suffices to consider the case when there exist \(vk, \rho \) for which \(f_{vk, \rho }(D) \ge 9n/10\). When such vk and \(\rho \) exist, observe that we have \(f_{vk', \rho '}(D) \le n/10\) for every other pair \((vk', \rho ') \ne (vk, \rho )\). Thus, as long as

$$\begin{aligned} \frac{9n}{10} - \frac{n}{10} > \frac{8}{\varepsilon } \cdot (\log (|\mathcal {K}_k| \cdot |\mathcal {P}_k|) + \log (1/\beta )), \end{aligned}$$

the Report Noisy Max algorithm successfully identifies the correct \(vk, \rho \) in Step 1 with probability all but \(\beta \) (Proposition 2). Moreover, the reported value a is at least 7n / 10. By the perfect completeness of the zap proof system, the algorithm produces a useful triple \((vk, \rho , \pi )\) in Step 4. Thus, the mechanism as a whole is \((1-\beta )\)-useful.

3.3 A SIM-CDP Algorithm

We define a PPT algorithm \(M_k^{\mathrm {CDP}}\) in Algorithm 2, which we argue is an efficient, SIM-CDP algorithm achieving high utility with respect to u.

figure b

The only difference between \(M_k^{\mathrm {CDP}}\) and the inefficient algorithm \(M_k^{\mathrm {unb}}\) occurs in Step 3, where we have replaced the inefficient process of finding a canonical message-signature pair \((m^*, \sigma ^*)\) with selecting a message-signature pair \((m_i, \sigma _i)\) in the dataset. Since all the other steps (Report Noisy Max and the zap prover’s algorithm) are efficient, \(M_k^{\mathrm {CDP}}\) runs in polynomial time. However, this change renders \(M_k^{\mathrm {CDP}}\) statistically non-differentially private, since a (computationally unbounded) adversary could reverse engineer the proof \(\pi \) produced in Step 4 to recover the pair \((m_i, \sigma _i)\) contained in the dataset. On the other hand, the witness indistinguishability of the proof system implies that \(M_k^{\mathrm {CDP}}\) is nevertheless computationally differentially private:

Lemma 3

The algorithm \(M_k^{\mathrm {CDP}}\) is \(\varepsilon \)-SIM-CDP provided that \(n \ge (20/\varepsilon ) \cdot (k + \log |\mathcal {K}_k| + \log |\mathcal {P}_k|) = {{\mathrm{poly}}}(k, 1/\varepsilon )\).

Proof

Indeed, we will show that \(M'_k = M^{\mathrm {unb}}_k\) is secure as the simulator for \(M_k = M_k^{\mathrm {CDP}}\). That is, we will show that for any \({{\mathrm{poly}}}(k)\)-size adversary A, that

$$\begin{aligned} \Pr [A(M_k^{\mathrm {CDP}}(D)) = 1] - \Pr [A(M_k^{\mathrm {unb}}(D)) = 1] \le {{\mathrm{negl}}}(k). \end{aligned}$$

First observe that by definition, the first two steps of the mechanisms are identical. Now define, for either mechanism \(M^{\mathrm {unb}}_k\) or \(M^{\mathrm {CDP}}_k\), a “bad” event B where the mechanism in Step 1 produces a pair \(((vk, \rho ), a)\) for which \(f_{vk, \rho }(D) = 0\), but does not output \((\bot , \bot , \bot )\) in Step 2. For either mechanism, the probability of the bad event B is \({{\mathrm{negl}}}(k)\), as long as \(n \ge (20/\varepsilon ) \cdot (k + \log (|\mathcal {K}_k| \cdot |\mathcal {P}_k|))\). This follows from the utility guarantee of the Report Noisy Max algorithm (Proposition 2), setting \(\beta = 2^{-k}\).

Thus, it suffices to show that for any fixing of the coins of both mechanisms in Steps 1 and 2 in which B does not occur, that the mechanisms \(M_k^{\mathrm {CDP}}(D)\) and \(M_k^{\mathrm {unb}}(D)\) are indistinguishable. There are now two cases to consider based on the coin tosses in Steps 1 and 2:

Case 1: Both Mechanisms Output \((\bot , \bot , \bot )\) in Step 2. In this case,

$$\begin{aligned} \Pr [A(M_k^{\mathrm {CDP}}(D)) = 1] = \Pr [A(\bot , \bot , \bot ) = 1] = \Pr [A(M_k^{\mathrm {unb}}(D)) = 1], \end{aligned}$$

and the mechanisms are perfectly indistinguishable.

Case 2: Step 1 Produced a Pair \(((vk, \rho ), a)\) for which \(f_{vk, \rho }(D) > 0\). In this case, we reduce to the indistinguishability of the zap proof system. Let \((vk_i = vk, m_i, \sigma _i)\) be the first entry of D for which \({{\mathrm{Ver}}}(vk, m_i, \sigma _i) = 1\), and let \((m^*, \sigma ^*)\) be the lexicographically first message-signature pair with \({{\mathrm{Ver}}}(vk, m^*, \sigma ^*) = 1\). The proofs we are going to distinguish are \(\pi _{\mathrm {CDP}} \leftarrow P(1^{k_c}, vk, (m_i, \sigma _i), \rho )\) and \(\pi _{\mathrm {unb}} \leftarrow P(1^{k_c}, vk, (m^*, \sigma ^*), \rho )\). Let \(A^{\mathrm {zap}}(1^{k_c}, \rho , \pi ) = A(vk, \rho , \pi )\). Then we have

$$\begin{aligned} \Pr [A(M_k^{\mathrm {CDP}}(D)) = 1] = \Pr [A^{\mathrm {zap}}(1^{k_c}, \rho , \pi _{\mathrm {CDP}}) = 1] \end{aligned}$$

and

$$\begin{aligned} \Pr [A(M_k^{\mathrm {unb}}(D)) = 1] = \Pr [A^{\mathrm {zap}}(1^{k_c}, \rho , \pi _{\mathrm {unb}}) = 1]. \end{aligned}$$

Thus, indistinguishability of \(M_k^{\mathrm {CDP}}(D)\) and \(M_k^{\mathrm {unb}}(D)\) follows from the witness indistinguishability of the zap proof system.

The proof of Lemma 2 also shows that \(M_k\) is useful for u.

Lemma 4

The algorithm \(M_k^{\mathrm {CDP}}\) is \((1-\beta )\)-useful for any number of rows \(n \ge \frac{20}{\varepsilon }(\log (2 \cdot |\mathcal {K}_k| \cdot |\mathcal {P}_k|/ \beta ))\).

3.4 Infeasibility of Differential Privacy

We now show that any efficient algorithm achieving high utility cannot be differentially private. In fact, like many prior hardness results, we provide an attack A that does more than violate differential privacy. Specifically we exhibit a distribution on datasets such that, given any useful answer produced by an efficient mechanism, A can with high probability recover a row of the input dataset. Following [DNR+09], we work with the following notion of a re-identifiable dataset distribution.

Definition 8

(Re-identifiable Dataset Distribution). Let \(u : \mathcal {X}^n \times \mathcal {R}\rightarrow \{0, 1\}\) be a utility function. Let \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) be an ensemble of distributions over \((D_0, z) \in \mathcal {X}^{n(k) + 1} \times \{0, 1\}^{{{\mathrm{poly}}}(k)}\) for \(n(k) = {{\mathrm{poly}}}(k)\). (Think of \(D_0\) as a dataset on \(n + 1\) rows, and z as a string of auxiliary information about \(D_0\)). Let \((D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k\) denote a sample from the following experiment: Sample \((D_0 = (x_1, \dots , x_{n+1}), z) \leftarrow \mathcal {D}_k\) and \(i \in [n]\) uniformly at random. Let \(D \in \mathcal {X}^n\) consist of the first n rows of \(D_0\), and let \(D'\) be the dataset obtained by replacing \(x_i\) in D with \(x_{n+1}\).

We say the ensemble \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) is a re-identifiable dataset distribution with respect to u if there exists a (possibly inefficient) adversary A and a negligible function \({{\mathrm{negl}}}(\cdot )\) such that for all polynomial-time mechanisms \(M_k\),

  1. 1.

    Whenever \(M_k\) is useful, A recovers a row of D from \(M_k(D)\). That is, for any PPT \(M_k\):

    $$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k \\ r \leftarrow M_k(D) \end{array}}}[u(D, r) = 1 \; \wedge \; A(r, z) \notin D] \le {{\mathrm{negl}}}(k). \end{aligned}$$
  2. 2.

    A cannot recover the row \(x_i\) not contained in \(D'\) from \(M_k(D')\). That is, for any algorithm \(M_k\):

    $$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k \\ r \leftarrow M_k(D') \end{array}}}[A(r, z) = x_i] \le {{\mathrm{negl}}}(k), \end{aligned}$$

    where \(x_i\) is the i-th row of D.

Proposition 3

([DNR+09]). If a distribution ensemble \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) on datasets of size n(k) is re-identifiable with respect to a utility function u, then for every \(\gamma > 0\) and \(\alpha (k)\) with \(\min \{\alpha , (1-8\alpha )/8n^{1+\gamma }\} \ge {{\mathrm{negl}}}(k)\), there is no polynomial-time \((\varepsilon = \gamma \log (n), \delta = (1-8\alpha )/2n^{1+\gamma })\)-differentially private mechanism \(\{M_k\}_{k\in \mathbb {N}}\) that is \(\alpha \)-useful for u.

In particular, for every \(\varepsilon = O(\log k), \alpha = 1/{{\mathrm{poly}}}(k)\), there is no polynomial-time \((\varepsilon , 1/n^2)\)-differentially private and \(\alpha \)-useful mechanism for u.

Construction of a Re-identifiable Dataset Distribution. For \(k \in \mathbb {N}\), recall that the digital signature scheme induces a choice of verification key space \(\mathcal {K}_k\), message space \(\mathcal {M}_k\), and signature space \(\mathcal {S}_k\), each on \({{\mathrm{poly}}}(k)\)-bit strings. Let \(n = {{\mathrm{poly}}}(k)\). Define a distribution \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) as follows. To sample \((D_0, z)\) from \(\mathcal {D}_k\), first sample a key pair \((sk, vk) \leftarrow {{\mathrm{Gen}}}(1^k)\). Sample messages \(m_1, \dots , m_{n+1} \leftarrow \mathcal {M}_k\) uniformly at random. Then let \(\sigma _i \leftarrow {{\mathrm{Sign}}}(sk, m_i)\) for each \(i = 1, \dots , n+1\). Let the dataset \(D_0 = (x_1, \dots , x_{n+1})\) where \(x_i = (vk, m_i, \sigma _i, \rho )\), and set the auxiliary string \(z = (vk, \rho )\).

Proposition 4

The distribution \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) defined above is re-identifiable with respect to the utility function u.

Proof

We define an adversary \(A : \mathcal {R}_k \times \mathcal {K}_k \rightarrow \mathcal {X}_k\). Consider an input to A of the form \((r, z) = ((vk', \rho ', \pi ), (vk, \rho ))\). If \(vk' \ne vk\) or \(\rho ' \ne \rho \) or \(\pi = \bot \), then output \((vk, \perp , \perp , \rho )\). Otherwise, run the zap extraction algorithm \(E(1^{k_c}, vk, \rho , \pi )\) to extract a witness \((m, \sigma )\), and output the resulting \((vk, m, \sigma , \rho )\). Note that the running time of A is \(2^{O(k_c)}\).

We break the proof of re-identifiability into two lemmas. First, we show that A can successfully recover a row in D from any useful answer:

Lemma 5

Let \(M_k : \mathcal {X}_k^n \rightarrow \mathcal {R}_k\) be a PPT algorithm. Then

$$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k\\ r \leftarrow M_k(D) \end{array}}}[u(D, r) = 1 \; \wedge \; A(r, z) \notin D] \le {{\mathrm{negl}}}(k). \end{aligned}$$

Proof

First, if \(u(D, r) = u(D, (vk', \rho ', \pi )) = 1\), then \(vk' = vk\), \(\rho ' = \rho \), and \(V(1^k, vk, \rho , \pi ) = 1\). In other words, \(\pi \) is a valid proof that \(vk\in (L\cup \mathcal {K}_k)\). Hence, by the extractability of the zap proof system, we have that \((m, \sigma ) = E(1^{k_c}, vk, \rho , \pi )\) satisfies \((vk, (m, \sigma ))\in R_L\); namely \({{\mathrm{Ver}}}(vk, m, \sigma ) = 1\) with overwhelming probability over the choice of \(\rho \).

Next, we use the exponential security of the digital signature scheme to show that the extracted pair \((m, \sigma )\) must indeed appear in the dataset D. Consider the following forgery adversary for the digital signature scheme.

figure c

The dataset built by the forgery algorithm \(A^{{{\mathrm{Sign}}}(sk, \cdot )}_{\mathrm {forge}}\) is identically distributed to a sample D from the experiment \((D, D', i, z) \leftarrow \tilde{D}_k\). Since a message-signature pair \((m, \sigma )\) appears in D if and only if the signing oracle was queried on m to produce \(\sigma \), we have

$$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (sk, vk)\leftarrow {{\mathrm{Gen}}}(1^k)\\ (m, \sigma )\leftarrow A_{\mathrm {forge}}^{{{\mathrm{Sign}}}(sk, \cdot )}(vk) \end{array}}}&[{{\mathrm{Ver}}}(m, \sigma ) = 1 \wedge (m, \sigma )\notin Q] \\&=\mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k \\ r \leftarrow M_k(D) \end{array}}}[u(D, r) = 1 \; \wedge \;(vk, m, \sigma , \rho ) = A(r, z) \notin D]. \end{aligned}$$

The running time of the algorithm A, and hence the algorithm \(A^{{{\mathrm{Sign}}}(sk, \cdot )}_{\mathrm {forge}}\), is \(2^{O(k_c)} = 2^{o(k^c)}\). Thus, by the existential unforgeability of the digital signature scheme against \(2^{k^c}\)-time adversaries, this probability is negligible in k.

We next argue that A cannot recover row \(x_i = (vk, m_i, \sigma _i, \rho )\) from \(M_k(D')\), where we recall that \(D'\) is the dataset obtained by replacing row \(x_i\) in D with row \(x_{n+1}\).

Lemma 6

For every algorithm \(M_k\):

$$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k \\ r \leftarrow M_k(D') \end{array}}}[A(r, z) = x_i] \le {{\mathrm{negl}}}(k), \end{aligned}$$

where \(x_i\) is the i-th row of D.

Proof

Since in \(D_0 = ((vk, m_1, {{\mathrm{Sign}}}_{vk}(m_1), \rho )\cdots , (vk, m_{n+1}, {{\mathrm{Sign}}}_{vk}(m_{n+1}), \rho ))\), the messages \(m_1, \cdots , m_{n+1}\) are drawn independently, the dataset \(D' = (D_0 - \{(vk, m_i, \sigma _i, \rho )\}) \cup \{(vk, m_{n+1}, \sigma _{n+1}, \rho )\}\) contains no information about message \(m_i\). Since \(m_i\) is drawn uniformly at random from the space \(\mathcal {M}_k = \{0, 1\}^k\), the probability that \(A(r, z) = A(M_k(D'), (vk, \rho ))\) outputs row \(x_i\) is at most \(2^{-k} = {{\mathrm{negl}}}(k)\).

Re-identifiability of the distribution \(\tilde{\mathcal {D}}_k\) follows by combining Lemmas 5 and 6.

4 Limits of CDP in the Client-Server Model

We revisit the techniques of [GKY11] to exhibit a setting in which efficient CDP mechanisms cannot do much better than information-theoretically differentially private mechanisms. In particular, we consider computational tasks with output in some discrete space (or which can be reduced to some discrete space) \(\mathcal {R}_k\), and with utility measured via functions of the form \(g:\mathcal {R}_k\times \mathcal {R}_k\rightarrow \mathbb {R}\). We show that if \((\mathcal {R}_k, g)\) forms a metric space with \(O(\log k)\)-doubling dimension (and other properties described in detail later), then CDP mechanisms can be efficiently transformed into differentially private ones. In particular, when \(\mathcal {R}_k = \mathbb {R}^d\) for \(d = O(\log k)\) and utility is measured by an \(L_p\)-norm, we can transform a CDP mechanism into a differentially private one.

The result in this section is incomparable to that of [GKY11]. We incur a constant-factor blowup in error, rather than a negligible additive increase as in [GKY11]. However, in the case that utility is measured by an \(L_p\) norm, our result applies to output spaces of dimension that grow logarithmically in the security parameter k, whereas the result of [GKY11] only applies to outputs of constant dimension. In addition, we handle IND-CDP directly, while [GKY11] prove their results for SIM-CDP, and then extend them to IND-CDP by applying a reduction of [MPRV09].

4.1 Task and Utility

Consider a computational task with discrete output space \(\mathcal {R}_k\). Let \(g:\mathcal {R}_k\times \mathcal {R}_k\rightarrow \mathbb {R}\) be a metric on \(\mathcal {R}_k\). We impose the following additional technical conditions on the metric space \((\mathcal {R}_k, g)\):

Definition 9

(Property \(\mathcal {L}\) ). A metric space formed by a discrete set \(\mathcal {R}_k\) and a metric g has property \(\mathcal {L}\) if

  1. 1.

    The doubling dimension of \((\mathcal {R}_k, g)\) is \(O(\log k)\). That is, for every \(a \in \mathcal {R}_k\) and radius \(r > 0\), the ball B(ar) centered at a with radius r is contained in a union of \({{\mathrm{poly}}}(k)\) balls of radius r/2.

  2. 2.

    The metric space is uniform. Namely, for any fixed radius r, the size of a ball of radius r is independent of its center.

  3. 3.

    Given a center \(a \in \mathcal {R}_k\) and a radius \(r > 0\), the membership in the ball B(ar) can be checked in time \({{\mathrm{poly}}}(k)\).

  4. 4.

    Given a center \(a \in \mathcal {R}_k\) and a radius \(r > 0\), a uniformly random point in B(ar) can be sampled in time \({{\mathrm{poly}}}(k)\).

Given a metric g, we can define a utility function measuring the accuracy of a mechanism with respect to g:

Definition 10

( \(\alpha \) -accuracy). Consider a dataset space \(\mathcal {X}_k\). Let \(q_k : \mathcal {X}_k^n \rightarrow \mathcal {R}_k\) be any function on datasets of size n. Let \(M_k: \mathcal {X}^n_k\rightarrow \mathbb {N}_k^d\) be a mechanism for approximating \(q_k\). We say that \(M_k\) is \(\alpha _k\) -accurate for \(q_k\) with respect to g if with overwhelming probability, the error of \(M_k\) as measured by g is at most \(\alpha _k\). Namely, there exists a negligible function \({{\mathrm{negl}}}(\cdot )\) such that

$$\begin{aligned} \Pr [g(q_k(D), M_k(D)) \le \alpha _k] \ge 1-{{\mathrm{negl}}}(k). \end{aligned}$$

We take the failure probability here to be negligible primarily for aesthetic reasons. In general, taking the failure probability to be \(\beta _k\) will yield in our result below a mechanism that is \((\varepsilon _k, \beta _k + {{\mathrm{negl}}}(k))\)-differentially private.

Moreover, for reasonable queries \(q_k\), taking the failure probability to be negligible is essentially without loss of generality. We can reduce the failure probability of a mechanism \(M_k\) from constant to negligible by repeating the mechanism \(O(\log ^2 k)\) times and taking a median. By composition theorems for differential privacy, this incurs a cost of at most \(O(\log ^2 k)\) in the privacy parameters. But we can compensate for this loss in privacy by first increasing the sample size n by a factor of \(O(\log ^2 k)\), and then applying a “secrecy-of-the-sample” argument [KLN+11] – running the original mechanism on a random subsample of the larger dataset. This step maintains accuracy as long as the query \(q_k\) generalizes from random subsamples.

4.2 Result and Proof

Theorem 5

Let \((\mathcal {R}_k, g)\) be a metric space with property \(\mathcal {L}\). Suppose \(M_k:\mathcal {X}_k^n\rightarrow \mathcal {R}_k\) is an efficient \(\varepsilon _k\)-IND-CDP mechanism that is \(\alpha _k\)-accurate for some function \(q_k\) with respect to g. Then there exists an efficient \((\varepsilon , {{\mathrm{negl}}}(k))\)-differentially private mechanism \(\hat{M}_k\) that is \(O(\alpha _k)\)-accurate for \(q_k\) with respect to g.

Proof

We denote a ball centered at a with radius r in the metric space \((\mathcal {R}_k, g)\) by

$$\begin{aligned} B(a, r) = \{x\in \mathcal {R}_k\;:\; g(a, x) \le r\}. \end{aligned}$$

We also let \(V(r)\mathbin {\mathop {=}\limits ^\mathrm{def}}|B(a, r)|\) for any \(a \in \mathcal {R}_k\), which is well-defined due to the uniformity of the metric space. Now we define a mechanism \(\hat{M}_k\) which outputs a uniformly random point from \(B(M_k(x), c_k)\), where \(c_k > 0\) is a parameter be determined later. Note that \(\hat{M}_k\) can be implemented efficiently due to the efficient sampling condition of property \(\mathcal {L}\). Since g satisfies the triangle inequality, \(\hat{M}_k\) is \((\alpha _k+c_k)\)-accurate. Thus it remains to prove that \(\hat{M}_k\) is \((\varepsilon ,{{\mathrm{negl}}}(k))\)-DP.

The key observation is that, for every \(D \in \mathcal {X}_k^n\) and \(s \in \mathcal {R}_k\),

$$\begin{aligned} \Pr [\hat{M}_k(D) = s] = \frac{1}{V(c_k)}\Pr [M_k(D) \in B(s, c_k)] \end{aligned}$$

For all sets \(S\subseteq \mathcal {R}_k\), we thus have

$$\begin{aligned}&\Pr [\hat{M}_k(D)\in S]\\ \le&\left( \sum _{s\in S\cap B(q_k(D), \alpha _k+c_k)}\Pr [\hat{M}_k(D) = s] \right) + \Pr [\hat{M}_k(D)\notin B(q_k(D), \alpha _k+c_k)]\\ \le&\left( \sum _{s\in S\cap B(q_k(D), \alpha _k+c_k)} \frac{1}{V(c_k)}\Pr [M_k(D)\in B(s, c_k)] \right) + {{\mathrm{negl}}}(k)\\&\;\; {(\text {by the above observation and}}~ \alpha _k{\text {-accuracy of}}~ M_k) \\ \le&\left( \sum _{s\in S\cap B(q_k(D), \alpha _k+c_k)} \frac{1}{V(c_k)}\left( e^\varepsilon \Pr [M_k(D')\in B(s, c_k)] + {{\mathrm{negl}}}'(k) \right) \right) + {{\mathrm{negl}}}(k)\\&\;\; \text {(since} \,M_k ~\text {is IND-CDP, and testing containment in} ~B(s, c_k) ~{\text {is efficient)}} \\ \le&\sum _{s\in S\cap B(q_k(D), \alpha _k+c_k)} \left[ e^{\varepsilon _k} \Pr [\hat{M}_k(D') = s] + \frac{1}{V(c_k)}{{\mathrm{negl}}}'(k)\right] + {{\mathrm{negl}}}(k)\\ \le&e^{\varepsilon _k}\Pr [M_k(D')\in S] + \frac{V(\alpha _k + c_k)}{V(c_k)} \cdot {{\mathrm{negl}}}'(k) + {{\mathrm{negl}}}(k). \end{aligned}$$

By the bounded doubling dimension of \((\mathcal {R}_k, g)\), we can set \(c_k = O(\alpha _k)\) to make \(V(\alpha _k + c_k)/V(c_k) = {{\mathrm{poly}}}(k)\). Hence \(\hat{M}_k\) is a \((\varepsilon _k, {{\mathrm{negl}}}(k))\)-differentially private algorithm.

\(L_p\) -norm Case. Many natural tasks can be captured by outputs in \(\mathbb {R}^d\) with utility measured by an \(L_p\) norm (e.g. counting queries). Since we work with efficient mechanisms, we may assume that our mechanisms always have outputs represented by \({{\mathrm{poly}}}(k)\) bits of precision. The level of precision is unimportant, so we may assume an output space represented by k bits of precision for simplicity. By rescaling, we may assume all responses are integers and take values in \(\mathbb {N}_k\mathbin {\mathop {=}\limits ^\mathrm{def}}\mathbb {N} \cap [0, 2^k]\). When \(d = O(\log k)\), the doubling dimension of the new discrete metric space induced by the \(L_p\)-norm on integral points is \(O(\log k)\) ([GKL03] shows that the subspace of \(\mathbb {R}^d\) equipped with \(L_p\) norm has doubling dimension O(d)). Now the metric space almost satisfies property \(\mathcal {L}\), with the exception of the uniformity condition. This is because the sizes of balls close the boundary of \(\mathbb {N}_k\) are smaller than those in the interior. However, we can apply Theorem 5 to first construct a statistically DP mechanism with outputs in the larger uniform metric space \(\mathbb {N}^d\). Then we may construct the final statistical mechanism \(\hat{M}_k\), by projecting answers that are not in \(\mathbb {N}_k^d\) to the closest point in \(\mathbb {N}_k^d\). By post-processing, the modified mechanism \(\hat{M}_k\) is still differentially private. Moreover, its utility is only improved since \(\hat{M}_k\) can only get closer to the true query answer in every coordinate. Therefore, we have the following corollary.

Corollary 1

Let \(M_k:\mathcal {X}_k^n\rightarrow \mathbb {R}^d\) with \(d = O(\log k)\) be an efficient \(\varepsilon _k\)-IND-CDP mechanism that is \(\alpha _k\)-accurate for some function \(q_k\) when error is measured by an \(L_p\)-norm. Then there exists an efficient \((\varepsilon , {{\mathrm{negl}}}(k))\)-differentially private mechanism \(\hat{M}_k\) that is \(O(\alpha _k)\)-accurate for \(q_k\).