In this section, we define a computational problem in the client-server model that can be efficiently solved with CDP, but not with statistical differential privacy. That is, we define a utility function u for which there exists a CDP mechanism achieving high utility. On the other hand, any efficient differentially private algorithm can only have negligible utility.
Theorem 4
(Main). Assume the existence of sub-exponentially secure one-way functions and extractable zaps for \(\mathbf {NP}\). Then there exists a sequence of data universes \(\{\mathcal {X}_k\}_{k\in \mathbb {N}}\), range spaces \(\{\mathcal {R}_k\}_{k\in \mathbb {N}}\) and an (efficiently computable) utility function \(u_k : \mathcal {X}_k^* \times \mathcal {R}_k \rightarrow \{0, 1\}\) such that
-
1.
There exists a polynomial p such that for any \(\varepsilon _k, \beta _k > 0\) there exists a polynomial-time \(\varepsilon _k\)-PURE-SIM-CDP mechanism \(\{M^{\mathrm {CDP}}_k\}_{k\in \mathbb {N}}\) and an (inefficient) \(\varepsilon _k\)-PURE-DP mechanism \(\{M^{\mathrm {unb}}_k\}_{k\in \mathbb {N}}\) such that for every \(n \ge p(k, 1/\varepsilon _k, \log (1/\beta _k))\) and dataset \(D \in \mathcal {X}_k^n\), we have
$$\begin{aligned} \Pr [u_k(D, M^{\mathrm {CDP}}(D)) = 1] \ge 1-\beta _k \;\text { and }\; \Pr [u_k(D, M^{\mathrm {unb}}(D)) = 1] \ge 1-\beta _k \end{aligned}$$
-
2.
For every \(\varepsilon _k \le O(\log k)\), \(\alpha _k = 1/{{\mathrm{poly}}}(k)\), \(n = {{\mathrm{poly}}}(k)\), and efficient \((\varepsilon _k, \delta = 1/n^2)\)-differentially private mechanism \(\{M'_k\}_{k\in \mathbb {N}}\), there exists a dataset \(D \in \mathcal {X}_k^n\) such that
$$\begin{aligned} \Pr [u(D, M'(D)) = 1] \le \alpha _k \;\text { for sufficient large k.} \end{aligned}$$
Remark 1
We can only hope to separate SIM-CDP and differential privacy by designing a task that is infeasible with differential privacy but not impossible. By the definition of (PURE-)SIM-CDP for a mechanism \(\{M_k\}_{k\in \mathbb {N}}\), there exists an \(\varepsilon _k\)-(PURE-)DP mechanism \(\{M'_k\}_{k\in \mathbb {N}}\) that is computationally indistinguishable from \(\{M_k\}_{k\in \mathbb {N}}\). But if for every differentially private \(\{M'_k\}_{k\in \mathbb {N}}\), there were a dataset \(D_k\in \mathcal {X}^n_k\) such that \(\Pr [u_k(D_k, M_k'(D_k)) = 1] \le \Pr [u_k(D_k, M_k(D_k)) = 1] - 1/{{\mathrm{poly}}}(k)\), then the utility function \(u_k(D_k, \cdot )\) would itself serve as a distinguisher between \(\{M'_k\}_{k\in \mathbb {N}}\) and \(\{M_k\}_{k\in \mathbb {N}}\).
3.1 Construction
Let \(({{\mathrm{Gen}}}, {{\mathrm{Sign}}}, {{\mathrm{Ver}}})\) be a c-strongly unforgeable secure digital signature scheme with parameter \(c > 0\) as in Definition 7. After fixing c, we define for each \(k \in \mathbb {N}\) a reduced security parameter \(k_c= k^{c/2}\). We will use \(k_c\) as the security parameter for an extractable zap proof system (P, V, E). Since k and \(k_c\) are polynomially related, a negligible function in k is negligible in \(k_c\) and vice versa.
Given a security parameter \(k \in \mathbb {N}\), define the following sets of bit strings:
-
Verification Key Space:
:
-
\(\mathcal {K}_k = \{0, 1\}^{\ell _1}\) where \(\ell _1 = |vk|\) for \((sk, vk) \leftarrow {{\mathrm{Gen}}}(1^k)\),
-
Message Space:
:
-
\(\mathcal {M}_k = \{0, 1\}^k\),
-
Signature Space:
:
-
\(\mathcal {S}_k = \{0, 1\}^{\ell _2}\) where \(\ell _2 = |\sigma |\) for \(\sigma \leftarrow {{\mathrm{Sign}}}(sk, m)\) with \(m \in \mathcal {M}_k\),
-
Public Coins Space:
:
-
\(\mathcal {P}_k = \{0, 1\}^{\ell _3}\) where \(\ell _3 = {{\mathrm{poly}}}(\ell _1)\) is the length of first-round zap messages used to prove statements from \(\mathcal {K}_k\) under security parameter \(k_c\),
-
Data Universe:
:
-
\(\mathcal {X}_k = \mathcal {K}_k\times \mathcal {M}_k\times \mathcal {S}_k\times \mathcal {P}_k\).
That is, similarly to one the hardness results of [DNR+09], we consider datasets D that contain n rows of the form \(x_1 = (vk_1, m_1, \sigma _1, \rho _1), \dots , x_n = (vk_n, m_n, \sigma _n, \rho _n)\) each corresponding to a verification key, message, and signature from the digital signature scheme, and to a zap verifier’s public coin tosses.
Let \(L \in \mathbf {NP}\) be the language
$$\begin{aligned} vk \in (L \cap \mathcal {K}_k) \iff \exists (m, \sigma ) \in \mathcal {M}_k \times \mathcal {S}_k \text { s.t. } {{\mathrm{Ver}}}(vk, m, \sigma ) = 1 \end{aligned}$$
which has the natural witness relation
$$\begin{aligned} R_L = \bigcup _k\{(vk, (m, \sigma ))\in \mathcal {K}_k\times (\mathcal {M}_k\times \mathcal {S}_k)\;:\;{{\mathrm{Ver}}}(vk, m, \sigma ) = 1\}. \end{aligned}$$
Define
-
Proof Space:
:
-
\(\varPi _k = \{0, 1\}^{\ell _4}\) where \(\ell _4 = |\pi |\) for \(\pi \leftarrow P(1^{k_c}, vk, (m, \sigma ), \rho )\) for \(vk \in (L \cap \mathcal {K}_k)\) with witness \((m, \sigma ) \in \mathcal {M}_k \times \mathcal {S}_k\) and public coins \(\rho \in \mathcal {P}_k\), and
-
Output Space:
:
-
\(\mathcal {R}_k = \mathcal {K}_k\times \mathcal {P}_k\times \varPi _k\).
Definition of Utility Function u. We now specify our computational task of interest via a utility function \(u : \mathcal {X}_k^n \times \mathcal {R}_k \rightarrow \{0, 1\}\). For any string \(vk \in \mathcal {K}_k\) and \(D = ((vk_1, m_1, \sigma _1, \rho _1), \cdots , (vk_n, m_n, \sigma _n, \rho _n))\in \mathcal {X}_k^n\) define an auxiliary function
$$\begin{aligned} f_{vk, \rho }(D) = \#\{i \in [n]: vk_i = vk \wedge \rho _i = \rho \wedge {{\mathrm{Ver}}}(vk, m_i, \sigma _i) = 1\}. \end{aligned}$$
That is, \(f_{vk, \rho }\) is the number of elements of the dataset D with verification key equal to vk and public coin string equal to \(\rho \) for which \((m_i, \sigma _i)\) is a valid message-signature pair under vk. We now define \(u(D, (vk, \rho , \pi )) = 1\) iff
$$\begin{aligned} f_{vk, \rho }(D) \ge 9n/10 \quad&\wedge \quad V(1^{k_c}, vk, \rho , \pi ) = 1 \\&\text { or } \\ f_{vk', \rho '}(D) < 9n/10 \quad&\text { for all } vk' \in \mathcal {K}_k \text { and } \rho ' \in \mathcal {P}_k. \end{aligned}$$
That is, the utility function u is satisfied if either (1) many entries of D contain valid message-signature pairs under the same verification key vk with the same public coin string \(\rho \) and \(\pi \) is a valid proof for statement vk using \(\rho \), or (2) it is not the case that many entries of D contain valid message-signature pairs under the same verification key, with the same public coin string (in which case any response \((vk, \rho , \pi )\) is acceptable).
3.2 An Inefficient Differentially Private Algorithm
We begin by showing that there is an inefficient differentially private mechanism that achieves high utility under u.
Proposition 1
Let \(k \in \mathbb {N}\). For every \(\varepsilon > 0\), there exists an \((\varepsilon , 0)\)-differentially private algorithm \(M_k^{\mathrm {unb}} : \mathcal {X}_k^n \rightarrow \mathcal {R}_k\) such that, for every \(\beta > 0\), every \(n \ge \frac{10}{\varepsilon }\log (2 \cdot |\mathcal {K}_k| \cdot |\mathcal {P}_k| / \beta )) = {{\mathrm{poly}}}(1/\varepsilon , \log (1/\beta ), k)\) and \(D\in (\mathcal {K}_k\times \mathcal {M}_k\times \mathcal {S}_k\times \mathcal {P}_k)^n\),
$$\begin{aligned} \mathop {\Pr }\limits _{{(vk, \rho , \pi ) \leftarrow M_k^{\mathrm {unb}}(D)}}[u(D, (vk, \rho , \pi )) = 1] \ge 1-\beta \end{aligned}$$
Remark 2
While the mechanism \(M^{\mathrm {unb}}\) considered here is only accurate for \(n \ge \varOmega (\log |\mathcal {P}_k|)\), it is also possible to use “stability techniques” [DL09, TS13] to design an \((\varepsilon , \delta )\)-differentially private mechanism that achieves high utility for \(n \ge O(\log (1/\delta )/\varepsilon )\) for \(\delta > 0\). We choose to provide a “pure” \(\varepsilon \)-differentially private algorithm here to make our separation more dramatic: Both the inefficient differentially private mechanism and the efficient SIM-CDP mechanism achieve pure \((\varepsilon , 0)\)-privacy, whereas no efficient mechanism can even achieve \((\varepsilon , \delta )\)-differential privacy with \(\delta > 0\).
Our algorithm relies on standard differentially private techniques for identifying frequently occurring elements in a dataset.
Report Noisy Max. Consider a data universe \(\mathcal {X}\). A predicate \(q : \mathcal {X}\rightarrow \{0, 1\}\) defines a counting query over the set of datasets \(\mathcal {X}^n\) as follows: For \(D = (x_1, \dots , x_n) \in \mathcal {X}^n\), we abuse notation by defining \(q(D) = \sum _{i = 1}^n q(x_i)\). We further say that a collection of counting queries Q is disjoint if, whenever \(q(x) = 1\) for some \(q \in Q\) and \(x \in \mathcal {X}\), we have \(q'(x) = 0\) for every other \(q' \ne q\) in Q. (Thus, disjoint counting queries slightly generalize point functions, which are each supported on exactly one element of the domain \(\mathcal {X}\).)
The “Report Noisy Max” algorithm [DR14], combined with observations of [BV16], can efficiently and privately identify which of a set of disjoint counting queries is (approximately) the largest on a dataset D, and release its identity along with the corresponding noisy count. We sketch the proof of the following proposition in Appendix A.
Proposition 2
(Report Noisy Max). Let Q be a set of efficiently computable and sampleable disjoint counting queries over a domain \(\mathcal {X}\). Further suppose that for every \(x \in \mathcal {X}\), the query \(q \in Q\) for which \(q(x) = 1\) (if one exists) can be identified efficiently. For every \(n\in \mathbb {N}\) and \(\varepsilon > 0\) there is an mechanism \(F:\mathcal {X}^n\rightarrow \mathcal {X}\times \mathbb {R}\) such that
-
1.
F runs in time \({{\mathrm{poly}}}(n, \log |\mathcal {X}|, \log |Q|, 1/\varepsilon )\).
-
2.
F is \(\varepsilon \)-differentially private.
-
3.
For every dataset \(D \in \mathcal {X}^n\), let \(q_{{{\mathrm{OPT}}}} = {{\mathrm{argmax}}}_{q \in Q}q(D)\) and \({{\mathrm{OPT}}}= q_{\mathrm {OPT}}(D)\). Let \(\beta > 0\). Then with probability at least \(1-\beta \), the algorithm F outputs a solution \((\hat{q}, a)\) such that \(a \ge \hat{q}(D) - \gamma /2\) where \(\gamma = \frac{8}{\varepsilon } \cdot \left( \log |Q| + \log (1/\beta ) \right) \). Moreover, if \({{\mathrm{OPT}}}- \gamma > \max _{q\ne {q_{{{\mathrm{OPT}}}}}}q(D)\), then \(\hat{q} = {{\mathrm{argmax}}}_{q \in Q}q(D)\).
We are now ready to describe our unbounded algorithm \(M_k^{\mathrm {unb}}\) as Algorithm 1. We prove Proposition 1 via the following two claims, capturing the privacy and utility guarantees of \(M_k^{\mathrm {unb}}\), respectively.
Lemma 1
The algorithm \(M_k^{\mathrm {unb}}\) is \(\varepsilon \)-differentially private.
Proof
The algorithm \(M_k^{\mathrm {unb}}\) accesses its input dataset D only through the \(\varepsilon \)-differentially private Report Noisy Max algorithm (Proposition 2). Hence, by the closure of differential privacy under post-processing, \(M_k^{\mathrm {unb}}\) is also \(\varepsilon \)-differentially private.
Lemma 2
The algorithm \(M_k^{\mathrm {unb}}\) is \((1-\beta )\)-useful for any number of rows \(n \ge \frac{20}{\varepsilon }(\log (|\mathcal {K}_k| \cdot |\mathcal {P}_k|/ \beta ))\).
Proof
If \(f_{vk, \rho }(D) < 9n/10\) for every vk and \(\rho \), then the utility of the mechanism is always 1. Therefore, it suffices to consider the case when there exist \(vk, \rho \) for which \(f_{vk, \rho }(D) \ge 9n/10\). When such vk and \(\rho \) exist, observe that we have \(f_{vk', \rho '}(D) \le n/10\) for every other pair \((vk', \rho ') \ne (vk, \rho )\). Thus, as long as
$$\begin{aligned} \frac{9n}{10} - \frac{n}{10} > \frac{8}{\varepsilon } \cdot (\log (|\mathcal {K}_k| \cdot |\mathcal {P}_k|) + \log (1/\beta )), \end{aligned}$$
the Report Noisy Max algorithm successfully identifies the correct \(vk, \rho \) in Step 1 with probability all but \(\beta \) (Proposition 2). Moreover, the reported value a is at least 7n / 10. By the perfect completeness of the zap proof system, the algorithm produces a useful triple \((vk, \rho , \pi )\) in Step 4. Thus, the mechanism as a whole is \((1-\beta )\)-useful.
3.3 A SIM-CDP Algorithm
We define a PPT algorithm \(M_k^{\mathrm {CDP}}\) in Algorithm 2, which we argue is an efficient, SIM-CDP algorithm achieving high utility with respect to u.
The only difference between \(M_k^{\mathrm {CDP}}\) and the inefficient algorithm \(M_k^{\mathrm {unb}}\) occurs in Step 3, where we have replaced the inefficient process of finding a canonical message-signature pair \((m^*, \sigma ^*)\) with selecting a message-signature pair \((m_i, \sigma _i)\) in the dataset. Since all the other steps (Report Noisy Max and the zap prover’s algorithm) are efficient, \(M_k^{\mathrm {CDP}}\) runs in polynomial time. However, this change renders \(M_k^{\mathrm {CDP}}\) statistically non-differentially private, since a (computationally unbounded) adversary could reverse engineer the proof \(\pi \) produced in Step 4 to recover the pair \((m_i, \sigma _i)\) contained in the dataset. On the other hand, the witness indistinguishability of the proof system implies that \(M_k^{\mathrm {CDP}}\) is nevertheless computationally differentially private:
Lemma 3
The algorithm \(M_k^{\mathrm {CDP}}\) is \(\varepsilon \)-SIM-CDP provided that \(n \ge (20/\varepsilon ) \cdot (k + \log |\mathcal {K}_k| + \log |\mathcal {P}_k|) = {{\mathrm{poly}}}(k, 1/\varepsilon )\).
Proof
Indeed, we will show that \(M'_k = M^{\mathrm {unb}}_k\) is secure as the simulator for \(M_k = M_k^{\mathrm {CDP}}\). That is, we will show that for any \({{\mathrm{poly}}}(k)\)-size adversary A, that
$$\begin{aligned} \Pr [A(M_k^{\mathrm {CDP}}(D)) = 1] - \Pr [A(M_k^{\mathrm {unb}}(D)) = 1] \le {{\mathrm{negl}}}(k). \end{aligned}$$
First observe that by definition, the first two steps of the mechanisms are identical. Now define, for either mechanism \(M^{\mathrm {unb}}_k\) or \(M^{\mathrm {CDP}}_k\), a “bad” event B where the mechanism in Step 1 produces a pair \(((vk, \rho ), a)\) for which \(f_{vk, \rho }(D) = 0\), but does not output \((\bot , \bot , \bot )\) in Step 2. For either mechanism, the probability of the bad event B is \({{\mathrm{negl}}}(k)\), as long as \(n \ge (20/\varepsilon ) \cdot (k + \log (|\mathcal {K}_k| \cdot |\mathcal {P}_k|))\). This follows from the utility guarantee of the Report Noisy Max algorithm (Proposition 2), setting \(\beta = 2^{-k}\).
Thus, it suffices to show that for any fixing of the coins of both mechanisms in Steps 1 and 2 in which B does not occur, that the mechanisms \(M_k^{\mathrm {CDP}}(D)\) and \(M_k^{\mathrm {unb}}(D)\) are indistinguishable. There are now two cases to consider based on the coin tosses in Steps 1 and 2:
Case 1: Both Mechanisms Output \((\bot , \bot , \bot )\) in Step 2. In this case,
$$\begin{aligned} \Pr [A(M_k^{\mathrm {CDP}}(D)) = 1] = \Pr [A(\bot , \bot , \bot ) = 1] = \Pr [A(M_k^{\mathrm {unb}}(D)) = 1], \end{aligned}$$
and the mechanisms are perfectly indistinguishable.
Case 2: Step 1 Produced a Pair \(((vk, \rho ), a)\) for which
\(f_{vk, \rho }(D) > 0\). In this case, we reduce to the indistinguishability of the zap proof system. Let \((vk_i = vk, m_i, \sigma _i)\) be the first entry of D for which \({{\mathrm{Ver}}}(vk, m_i, \sigma _i) = 1\), and let \((m^*, \sigma ^*)\) be the lexicographically first message-signature pair with \({{\mathrm{Ver}}}(vk, m^*, \sigma ^*) = 1\). The proofs we are going to distinguish are \(\pi _{\mathrm {CDP}} \leftarrow P(1^{k_c}, vk, (m_i, \sigma _i), \rho )\) and \(\pi _{\mathrm {unb}} \leftarrow P(1^{k_c}, vk, (m^*, \sigma ^*), \rho )\). Let \(A^{\mathrm {zap}}(1^{k_c}, \rho , \pi ) = A(vk, \rho , \pi )\). Then we have
$$\begin{aligned} \Pr [A(M_k^{\mathrm {CDP}}(D)) = 1] = \Pr [A^{\mathrm {zap}}(1^{k_c}, \rho , \pi _{\mathrm {CDP}}) = 1] \end{aligned}$$
and
$$\begin{aligned} \Pr [A(M_k^{\mathrm {unb}}(D)) = 1] = \Pr [A^{\mathrm {zap}}(1^{k_c}, \rho , \pi _{\mathrm {unb}}) = 1]. \end{aligned}$$
Thus, indistinguishability of \(M_k^{\mathrm {CDP}}(D)\) and \(M_k^{\mathrm {unb}}(D)\) follows from the witness indistinguishability of the zap proof system.
The proof of Lemma 2 also shows that \(M_k\) is useful for u.
Lemma 4
The algorithm \(M_k^{\mathrm {CDP}}\) is \((1-\beta )\)-useful for any number of rows \(n \ge \frac{20}{\varepsilon }(\log (2 \cdot |\mathcal {K}_k| \cdot |\mathcal {P}_k|/ \beta ))\).
3.4 Infeasibility of Differential Privacy
We now show that any efficient algorithm achieving high utility cannot be differentially private. In fact, like many prior hardness results, we provide an attack A that does more than violate differential privacy. Specifically we exhibit a distribution on datasets such that, given any useful answer produced by an efficient mechanism, A can with high probability recover a row of the input dataset. Following [DNR+09], we work with the following notion of a re-identifiable dataset distribution.
Definition 8
(Re-identifiable Dataset Distribution). Let \(u : \mathcal {X}^n \times \mathcal {R}\rightarrow \{0, 1\}\) be a utility function. Let \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) be an ensemble of distributions over \((D_0, z) \in \mathcal {X}^{n(k) + 1} \times \{0, 1\}^{{{\mathrm{poly}}}(k)}\) for \(n(k) = {{\mathrm{poly}}}(k)\). (Think of \(D_0\) as a dataset on \(n + 1\) rows, and z as a string of auxiliary information about \(D_0\)). Let \((D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k\) denote a sample from the following experiment: Sample \((D_0 = (x_1, \dots , x_{n+1}), z) \leftarrow \mathcal {D}_k\) and \(i \in [n]\) uniformly at random. Let \(D \in \mathcal {X}^n\) consist of the first n rows of \(D_0\), and let \(D'\) be the dataset obtained by replacing \(x_i\) in D with \(x_{n+1}\).
We say the ensemble \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) is a re-identifiable dataset distribution with respect to u if there exists a (possibly inefficient) adversary A and a negligible function \({{\mathrm{negl}}}(\cdot )\) such that for all polynomial-time mechanisms \(M_k\),
-
1.
Whenever \(M_k\) is useful, A recovers a row of D from \(M_k(D)\). That is, for any PPT \(M_k\):
$$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k \\ r \leftarrow M_k(D) \end{array}}}[u(D, r) = 1 \; \wedge \; A(r, z) \notin D] \le {{\mathrm{negl}}}(k). \end{aligned}$$
-
2.
A cannot recover the row \(x_i\) not contained in \(D'\) from \(M_k(D')\). That is, for any algorithm \(M_k\):
$$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k \\ r \leftarrow M_k(D') \end{array}}}[A(r, z) = x_i] \le {{\mathrm{negl}}}(k), \end{aligned}$$
where \(x_i\) is the i-th row of D.
Proposition 3
([DNR+09]). If a distribution ensemble \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) on datasets of size n(k) is re-identifiable with respect to a utility function u, then for every \(\gamma > 0\) and \(\alpha (k)\) with \(\min \{\alpha , (1-8\alpha )/8n^{1+\gamma }\} \ge {{\mathrm{negl}}}(k)\), there is no polynomial-time \((\varepsilon = \gamma \log (n), \delta = (1-8\alpha )/2n^{1+\gamma })\)-differentially private mechanism \(\{M_k\}_{k\in \mathbb {N}}\) that is \(\alpha \)-useful for u.
In particular, for every \(\varepsilon = O(\log k), \alpha = 1/{{\mathrm{poly}}}(k)\), there is no polynomial-time \((\varepsilon , 1/n^2)\)-differentially private and \(\alpha \)-useful mechanism for u.
Construction of a Re-identifiable Dataset Distribution. For \(k \in \mathbb {N}\), recall that the digital signature scheme induces a choice of verification key space \(\mathcal {K}_k\), message space \(\mathcal {M}_k\), and signature space \(\mathcal {S}_k\), each on \({{\mathrm{poly}}}(k)\)-bit strings. Let \(n = {{\mathrm{poly}}}(k)\). Define a distribution \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) as follows. To sample \((D_0, z)\) from \(\mathcal {D}_k\), first sample a key pair \((sk, vk) \leftarrow {{\mathrm{Gen}}}(1^k)\). Sample messages \(m_1, \dots , m_{n+1} \leftarrow \mathcal {M}_k\) uniformly at random. Then let \(\sigma _i \leftarrow {{\mathrm{Sign}}}(sk, m_i)\) for each \(i = 1, \dots , n+1\). Let the dataset \(D_0 = (x_1, \dots , x_{n+1})\) where \(x_i = (vk, m_i, \sigma _i, \rho )\), and set the auxiliary string \(z = (vk, \rho )\).
Proposition 4
The distribution \(\{\mathcal {D}_k\}_{k\in \mathbb {N}}\) defined above is re-identifiable with respect to the utility function u.
Proof
We define an adversary \(A : \mathcal {R}_k \times \mathcal {K}_k \rightarrow \mathcal {X}_k\). Consider an input to A of the form \((r, z) = ((vk', \rho ', \pi ), (vk, \rho ))\). If \(vk' \ne vk\) or \(\rho ' \ne \rho \) or \(\pi = \bot \), then output \((vk, \perp , \perp , \rho )\). Otherwise, run the zap extraction algorithm \(E(1^{k_c}, vk, \rho , \pi )\) to extract a witness \((m, \sigma )\), and output the resulting \((vk, m, \sigma , \rho )\). Note that the running time of A is \(2^{O(k_c)}\).
We break the proof of re-identifiability into two lemmas. First, we show that A can successfully recover a row in D from any useful answer:
Lemma 5
Let \(M_k : \mathcal {X}_k^n \rightarrow \mathcal {R}_k\) be a PPT algorithm. Then
$$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k\\ r \leftarrow M_k(D) \end{array}}}[u(D, r) = 1 \; \wedge \; A(r, z) \notin D] \le {{\mathrm{negl}}}(k). \end{aligned}$$
Proof
First, if \(u(D, r) = u(D, (vk', \rho ', \pi )) = 1\), then \(vk' = vk\), \(\rho ' = \rho \), and \(V(1^k, vk, \rho , \pi ) = 1\). In other words, \(\pi \) is a valid proof that \(vk\in (L\cup \mathcal {K}_k)\). Hence, by the extractability of the zap proof system, we have that \((m, \sigma ) = E(1^{k_c}, vk, \rho , \pi )\) satisfies \((vk, (m, \sigma ))\in R_L\); namely \({{\mathrm{Ver}}}(vk, m, \sigma ) = 1\) with overwhelming probability over the choice of \(\rho \).
Next, we use the exponential security of the digital signature scheme to show that the extracted pair \((m, \sigma )\) must indeed appear in the dataset D. Consider the following forgery adversary for the digital signature scheme.
The dataset built by the forgery algorithm \(A^{{{\mathrm{Sign}}}(sk, \cdot )}_{\mathrm {forge}}\) is identically distributed to a sample D from the experiment \((D, D', i, z) \leftarrow \tilde{D}_k\). Since a message-signature pair \((m, \sigma )\) appears in D if and only if the signing oracle was queried on m to produce \(\sigma \), we have
$$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (sk, vk)\leftarrow {{\mathrm{Gen}}}(1^k)\\ (m, \sigma )\leftarrow A_{\mathrm {forge}}^{{{\mathrm{Sign}}}(sk, \cdot )}(vk) \end{array}}}&[{{\mathrm{Ver}}}(m, \sigma ) = 1 \wedge (m, \sigma )\notin Q] \\&=\mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k \\ r \leftarrow M_k(D) \end{array}}}[u(D, r) = 1 \; \wedge \;(vk, m, \sigma , \rho ) = A(r, z) \notin D]. \end{aligned}$$
The running time of the algorithm A, and hence the algorithm \(A^{{{\mathrm{Sign}}}(sk, \cdot )}_{\mathrm {forge}}\), is \(2^{O(k_c)} = 2^{o(k^c)}\). Thus, by the existential unforgeability of the digital signature scheme against \(2^{k^c}\)-time adversaries, this probability is negligible in k.
We next argue that A cannot recover row \(x_i = (vk, m_i, \sigma _i, \rho )\) from \(M_k(D')\), where we recall that \(D'\) is the dataset obtained by replacing row \(x_i\) in D with row \(x_{n+1}\).
Lemma 6
For every algorithm \(M_k\):
$$\begin{aligned} \mathop {\Pr }\limits _{{\begin{array}{c} (D, D', i, z) \leftarrow \tilde{\mathcal {D}}_k \\ r \leftarrow M_k(D') \end{array}}}[A(r, z) = x_i] \le {{\mathrm{negl}}}(k), \end{aligned}$$
where \(x_i\) is the i-th row of D.
Proof
Since in \(D_0 = ((vk, m_1, {{\mathrm{Sign}}}_{vk}(m_1), \rho )\cdots , (vk, m_{n+1}, {{\mathrm{Sign}}}_{vk}(m_{n+1}), \rho ))\), the messages \(m_1, \cdots , m_{n+1}\) are drawn independently, the dataset \(D' = (D_0 - \{(vk, m_i, \sigma _i, \rho )\}) \cup \{(vk, m_{n+1}, \sigma _{n+1}, \rho )\}\) contains no information about message \(m_i\). Since \(m_i\) is drawn uniformly at random from the space \(\mathcal {M}_k = \{0, 1\}^k\), the probability that \(A(r, z) = A(M_k(D'), (vk, \rho ))\) outputs row \(x_i\) is at most \(2^{-k} = {{\mathrm{negl}}}(k)\).
Re-identifiability of the distribution \(\tilde{\mathcal {D}}_k\) follows by combining Lemmas 5 and 6.