1 Introduction

A proof of work (PoW), introduced by Dwork and Naor [DN93], is a proof system in which a prover \(\mathcal{P}\) convinces a verifier \(\mathcal{V}\) that he spent some computation with respect to some statement x. A simple PoW can be constructed from a function \(H(\cdot )\), where a proof with respect to a statement x is simply a salt s such that H(sx) starts with t 0’s. If H is modelled as a random function, \(\mathcal{P}\) must evaluate H on \(2^t\) values (in expectation) before he finds such an s.

The original motivation for PoWs was prevention of email spam and denial of service attacks, but today the by far most important application for PoWs is securing blockchains, most prominently the Bitcoin blockchain, whose security is based on the assumption that the majority of computing power dedicated towards the blockchain comes from honest users. This results in a massive waste of energy and other resources, as this mining is mostly done on dedicated hardware (ASICs) which has no other use than Bitcoin mining. In [DFKP15] proofs of space (PoS) have been suggested as an alternative to PoW. The idea is to use disk space rather than computation as the main resource for mining. As millions of users have a significant amount of unused disk space available (on laptops etc.), dedicating this space towards securing a blockchain would result in almost no waste of resources.

Let [N] denote some domain of size N. For convenience we’ll often assume that \(N=2^n\) is a power of 2 and identify [N] with \(\{0,1\}^n\), but this is never crucial and [N] can be any other efficiently samplable domain. A simple idea for constructing a PoS is to have the verifier specify a random function \(f:[N]\rightarrow [N]\) during the initialization phase, and have the prover compute the function table of f and sort it by the output values.Footnote 1 Then, during the proof phase, to convince the verifier that he really stores this table, the prover must invert f on a random challenge. We will call this approach “simple PoS”; we will discuss it in more detail in Sect. 1.3 below, and explain why it miserably fails to provide any meaningful security guarantees.

Instead, existing PoS [DFKP15, RD16] are based on pebbling lower bounds for graphs. These PoS provide basically the best security guarantees one could hope for: a cheating prover needs \(\varTheta (N)\) space or time after the challenge is known to make a verifier accept. Unfortunately, compared to the (insecure) simple PoS, they have two drawbacks which make them more difficult to use as replacement for PoW in blockchains. First, the proof size is quite large (several MB instead of a few bytes as in the simple PoS or Bitcoin’s PoW). Second, the initialization phase requires two messages: the first message, like in the simple PoS, is sent from the verifier to the prover specifying a random function f, and second message, unlike in the simple PoS, is a “commitment” from the prover to the verifier.Footnote 2

If such a pebbling-based PoS is used as a replacement for PoW in a blockchain design, the first message can be chosen non-interactively by the miner (who plays the role of the prover), but the commitment sent in the second message is more tricky. In Spacemint (a PoS-based decentralized cryptocurrency [PPK+15]), this is solved by having a miner put this commitment into the blockchain itself before he can start mining. As a consequence, Spacemint lacks the nice property of the Bitcoin blockchain where miners can join the effort by just listening to the network, and only need to speak up once they find a proof and want to add it to the chain.

1.1 Our Results

In this work we “resurrect” the simple approach towards constructing PoS based on inverting random functions. This seems impossible, as Hellman’s time-memory trade-offs — which are the reason for this approach to fail — can be generalized to apply to all functions (see Sect. 1.4). For Hellman’s attacks to apply, one needs to be able to evaluate the function efficiently in forward direction. At first glance, this may not seem like a real restriction at all, as inverting functions which cannot be efficiently computed in forward direction is undecidable in general.Footnote 3 However, we observe that for functions to be used in the simple PoS outlined above, the requirement of efficient computability can be relaxed in a meaningful way: we only need to be able to compute the entire function table in time linear (or quasilinear) in the size of the input domain. We construct functions satisfying this relaxed condition for which we prove lower bounds on time-memory trade-offs beyond the upper bounds given by Hellman’s attacks.

Our most basic construction of such a function \(g_f:[N]\rightarrow [N]\) is based on a function \(g:[N]\times [N]\rightarrow [N]\) and a permutation \(f:[N]\rightarrow [N]\). For the lower bound proof g and f are modelled as truly random, and all parties access them as oracles. The function is now defined as \(g_f(x)=g(x,x')\) where \(f(x)=\pi (f(x'))\) for any involution \(\pi \) without fixed points. For concreteness we let \(\pi \) simply flip all bits, denoted \(f(x)=\overline{f(x')}\). Let us stress that f does not need to be a permutation – it can also be a random functionFootnote 4 – but we’ll state and prove our main result for a permutation as it makes the analysis cleaner. In practice — where one has to instantiate f and g with something efficient — one would rather use a function, because it can be instantiated with a cryptographic hash function like SHA-3 or (truncated) AES,Footnote 5 whereas we don’t have good candidates for suitable permutations (at the very least f needs to be one-way; and, unfortunately, all candidates we have for one-way permutations are number-theoretic and thus much less efficient).

In Theorem 2 we state that for \(g_f\) as above, any algorithm which has a state of size S (that can arbitrarily depend on g and f), and inverts \(g_f\) on an \(\epsilon \) fraction of outputs, must satisfy \(S^2T\in \varOmega (\epsilon ^2 N^2)\). This must be compared with the best lower bound known for inverting random functions (or permutations) which is \(ST=\varOmega (\epsilon N)\). We can further push the lower bound to \(S^kT\in \varOmega (\epsilon ^k N^k)\) by “nesting” the construction; in the first iteration of this nesting one replaces the inner function f with \(g_f\).Footnote 6 These lower bounds are illustrated in Fig. 1.

Fig. 1.
figure 1

Illustration of lower bounds. the \(ST=\varOmega (N)\) lower bound for inverting random permutations or functions. the ideal bound where either T or S is \(\varOmega (N)\) as achieved by the pebbling-based PoS [DFKP15, RD16] (more precisely, the bound approaches the dark green line for large N). the lower bound \(S^2T=\varOmega (N^2)\) for \(T\le N^{2/3}\) for our most basic construction as stated in Theorem 2. the restriction \(T\le N^{2/3}\) on T we need for our proof to go through can be relaxed to \(T\le N^{t/(t+1)}\) by using t-wise collisions instead of pairwise collisions in our construction. The pink arrow shows how the bound improves by going from \(t=2\) to \(t=3\). we can push the \(S^2T=\varOmega (N^2)\) lower bound of the basic construction to \(S^kT=\varOmega (N^k)\) by using \(k-1\) levels of nesting. The purple arrow shows how the bound improves by going from \(k=2\) to \(k=3\). (Color figure online)

In this paper we won’t give a proof for the general construction, as the proof for the general construction doesn’t require any new ideas, but just gets more technical. We also expect the basic construction to be already sufficient for constructing a secure PoS. Although for \(g_f\) there exists a time-memory trade-off \(S^4T\in O(N^4)\) (say, \(S=T\approx N^{4/5}\)), which is achieved by “nesting” Hellman’s attack,Footnote 7 we expect this attack to only be of concern for extremely large N.Footnote 8

A caveat of our lower bound is that it only applies if \(T\le N^{2/3}\). We don’t see how to break our lower bound if \(T >N^{2/3}\), and the restriction \(T\le N^{2/3}\) seems to be mostly related to the proof technique. One can improve the bound to \(T\le N^{t/(t+1)}\) for any t by generalizing our construction to t-wise collisions. One way to do this – if f is a permutation and t divides N – is as follows: let \(g:[N]^t \rightarrow [N]\) and define \(g_f(x)=g(x,x_1,\ldots ,x_{t-1})\) where for some partition \(S_1,\ldots ,S_{N/t}, |S_i|=t\) of [N] the values \(f(x),f(x_1),\ldots ,f(x_{t-1})\) contain all elements of a partition \(S_i\) and \(x_1<x_2<\ldots <x_{t-1}\).

1.2 Proofs of Space

A proof of space as defined in [DFKP15] is a two-phase protocol between a prover \(\mathcal{P}\) and a verifier \(\mathcal{V}\), where after an initial phase \(\mathcal{P}\) holds a file F of size N,Footnote 9 whereas \(\mathcal{V}\) only needs to store some small value. The running time of \(\mathcal{P}\) during this phase must be at least N as \(\mathcal{P}\) has to write down F which is of size N, and we require that it’s not much more, quasilinear in N at most. \(\mathcal{V}\) on the other hand must be very efficient, in particular, its running time can be polynomial in a security parameter, but must be basically independent of N.

Then there’s a proof execution phase — which typically will be executed many times over a period of time — in which \(\mathcal{V}\) challenges \(\mathcal{P}\) to prove it stored F. The security requirement states that a cheating prover \(\tilde{\mathcal{P}}\) who only stores a file \(F'\) of size significantly smaller than N either fails to make \(\mathcal{V}\) accept, or must invest a significant amount of computation, ideally close to \(\mathcal{P}\)’s cost during initialization. Note that we cannot hope to make it more expensive than that as a cheating \(\tilde{\mathcal{P}}\) can always just store the short communication during initialization, and then reconstruct all of F before the execution phase.

1.3 A Simple PoS that Fails

Probably the first candidate for a PoS scheme that comes to mind is to have — during the initalization phase — \(\mathcal{V}\) send the (short) description of a “random behaving” function \(f:[N]\rightarrow [N]\) to \(\mathcal{P}\), who then computes the entire function table of f and stores it sorted by the outputs. During proof execution \(\mathcal{V}\) will pick a random \(x\in [N]\), and then challenge \(\mathcal{P}\) to invert f on \(y=f(x)\).Footnote 10

An honest prover can answer any challenge y by looking up an entry \((x',y)\) in the table, which is efficient as the table is sorted by the y’s. At first one might hope this provides good security against any cheating prover; intuitively, a prover who only stores \(\ll N\log N\) bits (i.e., uses space sufficient to only store \(\ll N\) output labels of length \(\log N\)) will not have stored a value \(x\in f^{-1}(y)\) for most y’s, and thus must invert by brute force which will require \(\varTheta (N)\) invocations to f. Unfortunately, even if f is modelled as a truly random function, this intuition is totally wrong due to Hellman’s time-memory trade-offs, which we’ll discuss in the next section.

The goal of this work is to save this elegant and simple approach towards constructing PoS. As discussed before, for our function \(g_f:[N]\rightarrow [N]\) (defined as \(g_f(x)=g(x,x')\) where \(f(x)=\overline{f(x')}\)) we can prove better lower bounds than for random functions. Instantiating the simple PoS with \(g_f\) needs some minor adaptions. \(\mathcal{V}\) will send the description of a function \(g:[N]\times [N]\rightarrow [N]\) and a permutation \(f:[N]\rightarrow [N]\) to \(\mathcal{P}\). Now \(\mathcal{P}\) first computes the entire function table of f and sorts it by the output values. Note that with this table \(\mathcal{P}\) can efficiently invert f. Then \(\mathcal{P}\) computes (and sorts) the function table of \(g_f\) (using that \(g_f(x)=g(x,f^{-1}(\overline{f(x)}))\). Another issue is that in the execution phase \(\mathcal{V}\) can no longer compute a challenge as before – i.e. \(y=g_f(x)\) for a random x – as it cannot evaluate \(g_f\). Instead, we let \(\mathcal{V}\) just pick a random \(y\in [N]\). The prover \(\mathcal{P}\) must answer this challenge with a tuple \((x,x')\) s.t. \(f(x)=\overline{f(x')}\) and \(g(x,x')=y\) (i.e., \(g_f(x)=y\)). Just sending the preimage x of \(g_f\) for y is no longer sufficient, as \(\mathcal{V}\) is not able to verify if \(g_f(x)=y\) without \(x'\).

Remark 1

(Completeness and Soundness Error). This protocol has a significant soundness and completeness error. On the one hand, a cheating prover \(\tilde{\mathcal{P}}\) who only stores, say \(10\%\), of the function table, will still be able to make \(\mathcal{V}\) accept in \(10\%\) of the cases. On the other hand, even if \(g_f\) behaves like a random function, an honest prover \(\mathcal{P}\) will only be able to answer a \(1-1/e\) fraction (\(\approx 63\%\)) of the challenges \(y\in [N]\), as some will simply not have a preimage under \(g_f\).Footnote 11

When used as a replacement for PoW in cryptocurrencies, neither the soundness nor the completeness error are an issue. If this PoS is to be used in a context where one needs negligible soundness and/or completeness, one can use standard repetition tricks to amplify the soundness and completeness, and make the corresponding errors negligible.Footnote 12

Remark 2

(Domain vs. Space). When constructing a PoS from a function with a domain of size N, the space the honest prover requires is around \(N\log N\) bits for the simple PoS outlined above (where we store the sorted function table of a function \(f:[N]\rightarrow [N]\)), and roughly twice that for our basic construction (where we store the function tables of \(g_f:[N]\rightarrow [N]\) and \(f:[N]\rightarrow [N]\)). Thus, for a given amount \(N'\) of space the prover wants to commit to, it must use a function with domain \(N\approx N'/\log (N')\). In particular, the time-memory trade-offs we can prove on the hardness of inverting the underlying function translate directly to the security of the PoS.

1.4 Hellman’s Time-Memory Trade Offs

Hellman [Hel80] showed that any permutation \(p:[N]\rightarrow [N]\) can be inverted using an algorithm that is given S bits of auxiliary information on p and makes at most T oracle queries to \(p(\cdot )\), where (\(\tilde{O}\) below hides \(\log (N)^{O(1)}\) factors)

$$\begin{aligned} S\cdot T \in \tilde{O}( N)\quad \text {e.g. when}\quad S=T\approx N^{1/2}\ . \end{aligned}$$
(1)

Hellman also presents attacks against random functions, but with worse parameters. A rigorous bound was only later proven by Fiat and Naor [FN91] where they show that Hellman’s attack on random functions satisfies

$$\begin{aligned} S^2\cdot T\in \tilde{O}(N^2)\quad \text {e.g. when}\quad S=T\approx N^{2/3}\ . \end{aligned}$$
(2)

Fiat and Naor [FN91] also present an attack with worse parameters which works for any (not necessarily random) function, where

$$\begin{aligned} S^3\cdot T\in \tilde{O}(N^3)\quad \text {e.g. when}\quad S=T\approx N^{3/4}\ . \end{aligned}$$
(3)

The attack on a permutation \(p:[N]\rightarrow [N]\) for a given T is easy to explain: Pick any \(x\in [N]\) and define \(x_0,x_1,\ldots \) as \(x_0=x\), \(x_{i+1}=p(x_i)\), let \(\ell \le N-1\) be minimal such that \(x_0=x_\ell \). Now store the values \(x_T,\,x_{2T},\ldots ,\,x_{(\ell \mod T)T}\) in a sorted list. Let us assume for simplicity that \(\ell -1=N\), so \(x_0,\ldots ,x_{\ell -1}\) cover the entire domain (if this is not the case, one picks some \(x'\) not yet covered and makes a new table for the values \(x_0=x',\,x_1=p(x_0),\ldots \)). This requires storing \(S=N/T\) values. If we have this table, given a challenge y to invert, we just apply p to y until we hit some stored value \(x_{iT}\), then continue applying p to \(x_{(i-1)T}\) until we hit y, at which point we found the inverse \(p^{-1}(y)\). By construction this attack requires T invocations to p. The attack on general functions is more complicated and gives worse bounds as we don’t have such a nice cycle structure. In a nutshell, one computes several different chains, where for the jth chain we pick some random \(h_j:[N]\rightarrow [N]\) and compute \(x_0,\,x_1,\ldots ,\,x_n\) as \(x_i=f(h_j(x_{i-1}))\). Then, every T’th value of the chain is stored. To invert a challenge y we apply \(f(h_1(\cdot ))\) sequentially on input y up to T times. If we hit a value \(x_{iT}\) we stored in the first chain, we try to invert by applying \(f(h_1(\cdot ))\) starting with \(x_{(i-1)T}\).Footnote 13 If we don’t succeed, continue with the chains generated by \(f(h_2(\cdot )),\,f(h_3(\cdot )),\ldots \) until the inverse is found or all chains are used up. This attack will be successful with high probability if the chains cover a large fraction of f’s output domain.

1.5 Samplability is Sufficient for Hellman’s Attack

One reason the lower bound for our function \(g_f:[N]\rightarrow [N]\) (defined as \(g_f(x)=g(x,\,x')\) where \(f(x)=\overline{f(x')}\)) does not contradict Hellman’s attacks is the fact that \(g_f\) cannot be efficiently evaluated in forward direction. One can think of simpler constructions such as \(g'_f(x)=g(x,\,f^{-1}(x))\) which also have this property, but observe that Hellman’s attack is easily adapted to break \(g'_f\). More generally, Hellman’s attack doesn’t require that the function can be efficiently computed in forward direction, it is sufficient to have an algorithm that efficiently samples random input/output tuples of the function. This is possible for \(g'_f\) as for a random z the tuple \(f(z),\,g(f(z),\,z)\) is a valid input/output: \(g'_f(f(z)\,=\,g(f(z),\,f^{-1}(f(z))=g(f(z),\,z)\). To adapt Hellman’s attack to this setting – where we just have an efficient input/output sampler \(\sigma _f\) for f – replace the \(f(h_i(\cdot ))\)’s in the attack described in the previous section with \(\sigma _f(h_i(\cdot ))\).

1.6 Lower Bounds

De, Trevisan and Tulsiani [DTT10] (building on work by Yao [Yao90], Gennaro-Trevisan [GT00] and Wee [Wee05]) prove a lower bound for inverting random permutations, and in particular show that Hellman’s attack as stated in Eq. (1) is optimal: For any oracle-aided algorithm \(\mathcal{A}\), it holds that for most permutations \(p:[N]\rightarrow [N]\), if \(\mathcal{A}\) is given advice (that can arbitrarily depend on p) of size S, makes at most T oracle queries and inverts p on \(\epsilon N\) values, we have \(S\cdot T\in \varOmega (\epsilon N)\). Their lower bound proof can easily be adapted to random functions \(f:[N]\rightarrow [N]\), but note that in this case it is no longer tight, i.e., matching Eq. (2). Barkan, Biham, and Shamir [BBS06] show a matching \(S^2\cdot T\in \tilde{\varOmega }( N^2)\) lower bound for a restricted class of algorithms.

1.7 Proof Outline

The starting point of our proof is the \(S\cdot T\in \varOmega (\epsilon N)\) lower bound for inverting random permutations by De, Trevisan and Tulsiani [DTT10] mentioned in the previous section. We sketch their simple and elegant proof, with a minor adaption to work for functions rather than permutations in Appendix A.

The high level idea of their lower bound proof is as follows: Assume an adversary \(\mathcal{A}\) exists, which is given an auxiliary string \(\mathsf{aux}\), makes at most T oracle queries and can invert a random permutation \(p:[N]\rightarrow [N]\) on an \(\epsilon \) fraction of [N] with high probability (\(\mathsf{aux}\) can depend arbitrarily on p). One then shows that given (black box access to) \(\mathcal{A}_{\mathsf{aux}}(\cdot ){\mathop {=}\limits ^{\mathsf{def}}}\mathcal{A}(\mathsf{aux},\cdot )\) it’s possible to “compress” the description of p from \(\log (N!)\) to \(\log (N!)-\varDelta \) bits for some \(\varDelta >0\). As a random permutation is incompressible (formally stated as Fact 1 in Sect. 2 below), the \(\varDelta \) bits we saved must come from the auxiliary string given, so \(S=|\mathsf{aux}|\gtrapprox \varDelta \).

To compress p, one now finds a subset \(G\subset [N]\) where (1) \(\mathcal{A}\) inverts successfully, i.e., for all \(y\in p(G)=\{p(x)\ :\ x\in G\}\) we have \(\mathcal{A}^{p}_\mathsf{aux}(y)=p^{-1}(y)\) and (2) \(\mathcal{A}\) never makes a query in G, i.e., for all \(y\in G\) all oracle queries made by \(\mathcal{A}^{p}_\mathsf{aux}(y)\) are in \([N]-G\) (except for the last query which we always assume is \(p^{-1}(y)\)).

The compression now exploits the fact that one can learn the mapping \(G\rightarrow p(G)\) given \(\mathsf{aux}\), an encoding of the set p(G), and the remaining mapping \([N]-G \rightarrow p([N]-G)\). While decoding, one recovers \(G\rightarrow p(G)\) by invoking \(\mathcal{A}^p_\mathsf{aux}(\cdot )\) on all values p(G) (answering all oracle queries using \([N]-G \rightarrow p([N]-G)\), the first query outside \([N]-G\) will be the right value by construction).

Thus, we compressed by not encoding the mapping \(G\rightarrow p(G)\), which will save us \(|G|\log (N)\) bits, however we have to pay an extra \(|G|\log (eN/|G|)\) bits to encode the set p(G), so overall we compressed by \(|G|\log (|G|/e)\) bits, and therefore \(S\ge |G|\) assuming \(|G|\ge 2e\). Thus the question is how large a set G can we choose. A simple probabilistic argument, basically picking values at random until it’s no longer possible to extend G, shows that we can always pick a G of size at least \(|G|\ge \epsilon N /T\), and we conclude \(S \ge \epsilon N/ T\) assuming \(T\le \epsilon N/2e\).

In the De et al. proof, the size of the good set G will always be close to \(\epsilon N /T\), no matter how \(\mathcal{A}_\mathsf{aux}\) actually behaves. In this paper we give a more fine grained analysis introducing a new parameter \(T_g\) as discussed next.

The \(T_g\) parameter. Informally, our compression algorithm for a function \(g:[N]\rightarrow [N]\) goes as follows: Define the set \(I=\{x\ :\ \mathcal{A}_\mathsf{aux}^{g}(g(x))=x\}\) of values where \(\mathcal{A}^g_\mathsf{aux}\) inverts g(I), by assumption \( |I|=\epsilon N\). Now we can add values from I to G as long as possible, every time we add a value x, we “spoil” up to T values in I, where we say \(x'\) gets spoiled if \(\mathcal{A}^g_\mathsf{aux}(g(x))\) makes oracle query \(x'\), and thus we will not be able to add \(x'\) to G in the future. As we start with \(|I|=\epsilon N\), and spoil at most T values for every value added to G, we can add at least \(\epsilon N/T\) values to G.

This is a worst case analysis assuming \(\mathcal{A}^g_\mathsf{aux}\) really spoils close to T values every time we add a value to G, but potentially \(\mathcal{A}^g_\mathsf{aux}\) behaves nicer and on average spoils less. In the proof of Lemma 1 we take advantage of this and extend G as long as possible, ending up with a good set G of size at least \(\epsilon N/2T_g\) for some \(1\le T_g\le T\). Here \(T_g\) is the average number of elements we spoiled for every element added to G.

This doesn’t help to improve the De et al. lower bound, as in general \(T_g\) can be as large as T in which case our lower bound \(S\cdot T_g\in \varOmega (\epsilon N)\) coincides with the De et al. \(S\cdot T\in \varOmega (\epsilon N)\) lower bound.Footnote 14 But this more fine grained bound will be a crucial tool to prove the lower bound for \(g_f\).

Lower Bound for \(g_f\). We now outline the proof idea for our lower bound \(S^2\cdot T\in \varOmega (\epsilon ^2 N^2)\) for inverting \(g_f(x)=g(x,\,x'),\,f(x)=\overline{f(x')}\) assuming \(g:[N]\times [N]\rightarrow [N]\) is a random function and \(f:[N]\rightarrow [N]\) is a random permutation. We assume an adversary \(\mathcal{A}^{g,f}_\mathsf{aux}\) exists which has oracle access to \(f,\,g\) and inverts \(g_f:[N]\rightarrow [N]\) on a set \(J=\{y\ :\ g_f(\mathcal{A}^{f,g}_\mathsf{aux}(y))=y\}\) of size \(J=|\epsilon N|\).

If the function table of f is given, \(g_f:[N]\rightarrow [N]\) is a random function that can be efficiently evaluated, and we can prove a lower bound \(S\cdot T_g\in \varOmega (\epsilon N)\) as outlined above.

At this point, we make a case distinction, depending on whether \(T_g\) is below or above \(\sqrt{T}\).

If \(T_g< \sqrt{T}\) our \(S\cdot T_g\in \varOmega (\epsilon N)\) bound becomes \(S^2\cdot T\in \varOmega (\epsilon ^2 N^2)\) and we are done.

The more complicated case is when \(T_g\ge \sqrt{T}\) where we show how to use the existence of \(\mathcal{A}^{f,g}_\mathsf{aux}\) to compress f instead of g. Recall that \(T_g\) is the average number of values that got “spoiled” while running the compression algorithm for \(g_f\), that means, for every value added to the good set G, \(\mathcal{A}_\mathsf{aux}^{f,g}\) made on average \(T_g\) “fresh” queries to \(g_f\). Now making fresh \(g_f\) queries isn’t that easy, as it requires finding \(x,x'\) where \(f(x)=\overline{f(x')}\). We can use \(\mathcal{A}_\mathsf{aux}^{f,g}\) which makes many such fresh \(g_f\) queries to “compress” f: when \(\mathcal{A}_\mathsf{aux}^{f,g}\) makes two f queries \(x,\,x'\) where \(f(x)=\overline{f(x')}\), we just need to store the first output f(x), but won’t need the second \(f(x')\) as we know it is \(\overline{f(x)}\). For decoding we also must store when exactly \(\mathcal{A}_\mathsf{aux}^{f,g}\) makes the f queries x and \(x'\), more on this below.

Every time we invoke \(\mathcal{A}_\mathsf{aux}^{f,g}\) for compression as just outlined, up to T outputs of f may get “spoiled” in the sense that \(\mathcal{A}_\mathsf{aux}^{f,g}\) makes an f query that we need to answer at this point, and thus it’s no longer available to be compressed later.

As \(\mathcal{A}_\mathsf{aux}^{f,g}\) can spoil up to T queries on every invocation, we can hope to invoke it at least \(\epsilon N/T\) times before all the f queries are spoiled. Moreover, on average \(\mathcal{A}_\mathsf{aux}^{f,g}\) makes \(T_g\) fresh \(g_f\) queries, so we can hope to compress around \(T_g\) outputs of f with every invocation of \(\mathcal{A}_\mathsf{aux}^{f,g}\), which would give us around \(T_g\cdot \epsilon N/T\) compressed values. This assumes that a large fraction of the fresh \(g_f\) queries uses values of f that were not spoiled in previous invocations. The technical core of our proof is a combinatorial lemma which we state and prove in Sect. 5, which implies that it’s always possible to find a sequence of inputs to \(\mathcal{A}_\mathsf{aux}^{f,g}\) such that this is the case. Concretely, we can always find a sequence of inputs such that at least \(T_g\cdot \epsilon N/32 T\) values can be compressed.Footnote 15

2 Notation and Basic Facts

We use brackets like \((x_1,x_2,\ldots )\) and \(\{x_1,x_2,\ldots \}\) to denote ordered and unordered sets, respectively. We’ll usually refer to unordered sets simply as sets, and to ordered sets as lists. [N] denotes some domain of size N, for notational convenience we assume \(N=2^n\) is a power of two and identify [N] with \(\{0,1\}^n\). For a function \(f:[N] \rightarrow [M]\) and a set \(S\subseteq [N]\) we denote with f(S) the set \(\{f(S[1]),\ldots , f(S[|S|])\}\), similarly for a list \(L\subseteq [N]\) we denote with f(L) the list \((f(L[1]),\ldots ,f(L[|L|]))\). For a set \(\mathcal X\), we denote with \(x\leftarrow \mathcal{X}\) that x is assigned a value chosen uniformly at random from \(\mathcal{X}\).

Fact 1

(from [DTT10]). For any randomized encoding procedure \(\mathsf{Enc}:\{0,1\}^r\times \{0,1\}^n\rightarrow \{0,1\}^m\) and decoding procedure \(\mathsf{Dec}:\{0,1\}^r\times \{0,1\}^m\rightarrow \{0,1\}^n\) where

$$\begin{aligned} \mathop {\Pr }\limits _{x\leftarrow \{0,1\}^n,\,\rho \leftarrow \{0,1\}^r}[\mathsf{Dec}(\rho ,\,\mathsf{Enc}(\rho ,\,x))=x]\ge \delta \end{aligned}$$

we have \(m\ge n-\log (1/\delta )\).

Fact 2

If a set X is at least \(\epsilon \) dense in Y, i.e., \(X\subset Y,\, |X|\ge \epsilon |Y|\), and Y is known, then X can be encoded using \(|X|\cdot \log (e/\epsilon )\) bits. To show this we use the inequality \({n\atopwithdelims (){\epsilon n}} \le (en/\epsilon n)^{\epsilon n}\), which implies \(\log {n\atopwithdelims (){\epsilon n}} \le \epsilon n\log (e/\epsilon )\).

3 A Lower Bound for Functions

The following theorem is basically from [DTT10], but stated for functions not permutations.

Theorem 1

Fix some \(\epsilon \ge 0\) and an oracle algorithm \(\mathcal{A}\) which on any input makes at most T oracle queries. If for every function \(f:[N]\rightarrow [N]\) there exists a string \(\mathsf{aux}\) of length \(|\mathsf{aux}|=S\) such that

$$\begin{aligned} \mathop {\Pr }\limits _{y\leftarrow [N]}[f(\mathcal{A}^f_\mathsf{aux}(y))=y]\ge \epsilon \end{aligned}$$

then

$$\begin{aligned} T\cdot S \in \varOmega ( \epsilon N)\ . \end{aligned}$$
(4)

The theorem follows from Lemma 1 below using Fact 1 as follows: in Fact 1 let \(\delta =0.9\) and \(n=N\log N\), think of x as the function table of a function \(f:[N]\rightarrow [N]\). Then \(|\mathsf{Enc}(\rho ,\,\mathsf{aux},\,f)|\ge N\log N -\log (1/0.9)\), together with the upper bound on the encoding from Eq. (6) this implies Eq. (4). Note that the extra assumption that \(T\le \epsilon N/40\) in the lemma below doesn’t matter, as if it’s not satisfied the theorem is trivially true. For now the value \(T_g\) in the lemma below is not important and the reader can just assume \(T_g=T\).

Lemma 1

Let \(\mathcal{A},\,T,\,S,\,\epsilon \) and f be as in Theorem 1, and assume \(T\le \epsilon N/40\). There are randomized encoding and decoding procedures \(\mathsf{Enc},\,\mathsf{Dec}\) such that if \(f:[N]\rightarrow [N]\) is a function and for some \(\mathsf{aux}\) of length \(|\mathsf{aux}|=S\)

$$\begin{aligned} \mathop {\Pr }\limits _{y\leftarrow [N]}[f(\mathcal{A}^f_\mathsf{aux}(y))=y]\ge \epsilon \end{aligned}$$

then

$$\begin{aligned} \mathop {\Pr }\limits _{\rho \leftarrow \{0,1\}^r}[\mathsf{Dec}(\rho ,\mathsf{Enc}(\rho ,\,\mathsf{aux},\,f))=f]\ge 0.9 \end{aligned}$$
(5)

and the length of the encoding is at most

$$\begin{aligned} |\mathsf{Enc}(\rho ,\mathsf{aux},\,f)|\le \underbrace{N\log N}_{=|f|}-\frac{\epsilon N}{2T_g}+S+\log (N) \end{aligned}$$
(6)

for some \(T_g,\,1\le T_g\le T\).

3.1 Proof of Lemma 1

The Encoding and Decoding Algorithms. In Algorithms 1 and 2, we always assume that if \(\mathcal{A}^{f}_\mathsf{aux}(y)\) outputs some value x, it makes the query f(x) at some point. This is basically w.l.o.g. as we can turn any adversary into one satisfying this by making at most one extra query. If at some point \(\mathcal{A}^{f}_\mathsf{aux}(y)\) makes an oracle query x where \(f(x)=y\), then we also w.l.o.g. assume that right after this query \(\mathcal{A}\) outputs x and stops. Note that if \(\mathcal{A}\) is probabilistic, it uses random coins which are given as input to \(\mathsf{Enc},\,\mathsf{Dec}\), so we can make sure the same coins are used during encoding and decoding.

figure a
figure b

The Size of the Encoding. We will now upper bound the size of the encoding of \(G,\,f(Q'),\,(|q_1|,\,\ldots ,|q_{|G|}|),\,f([N]-\{G^{-1}\cup Q'\})\) as output in line (15) of the \(\mathsf{Enc}\) algorithm.

Let \(T_g:=|B|/|G|\) be the average number of elements we added to the bad set B for every element added to the good set G, then

$$\begin{aligned} |G|\ge \epsilon N /2 T_g\ . \end{aligned}$$
(7)

To see this we note that when we leave the while loop (see line (8) of the algorithm \(\mathsf{Enc}\)) it holds that \(|B|\ge |J|/2 =\epsilon N/2\), so \(|G|=|B|/T_g\ge |J|/2T_g=\epsilon N /2 T_g\).

 

G : :

Instead of G we will actually encode the set \(\pi ^{-1}(G)=\{c_1,\ldots ,c_{|G|}\}\). From this the decoding \(\mathsf{Dec}\) (who gets \(\rho \), and thus knows \(\pi \)) can then reconstruct \(G=\pi (\pi ^{-1}(G))\). We claim that the elements in \(c_1<c_2<\ldots <c_{|G|}\) are whp. at least \(\epsilon /2\) dense in \([c_{|G|}]\) (equivalently, \(c_{|G|}\le 2|G|/\epsilon \)). By Fact 2 we can thus encode \(\pi ^{-1}(G)\) using \(|G|\log (2e/\epsilon )+\log N\) bits (the extra \(\log N\) bits are used to encode the size of G which is required so decoding later knows how to parse the encoding). To see that the \(c_i\)’s are \(\epsilon /2\) dense whp. consider line (9) in \(\mathsf{Enc}\) which states \(c:=\min \{c'>c\ :\ y_{c'}\in \{J\setminus B\}\}\). If we replace \(J\setminus B\) with J, then the \(c_i\)’s would be whp. close to \(\epsilon \) dense as J is \(\epsilon \) dense in [N] and the \(y_i\) are uniformly random. As \(|B|< |J|/2\), using \(J\setminus B\) instead of J will decrease the density by at most a factor 2. If we don’t have this density, i.e., \(c_{|G|}> 2|G|/\epsilon \), we consider encoding to have failed.

\(f(Q'){:}\) :

This is a list of \(Q'\) elements in [N] and can be encoded using \(|Q'|\log N\) bits.

\((|q'_1|,\ldots ,|q'_{|G|}|){:}\) :

Require \(|G|\log T\) bits as \(|q'_i|\le |q_i|\le T\). A more careful argument (using that the \(q'_i\) are on average at most \(T_g\)) requires \(|G|\log (e T_g)\) bits.

\(f([N]-\{G^{-1}\cup Q'\}){:}\) :

Requires \((N-|G|-|Q'|)\log N\) bits (using that \(G^{-1}\cap Q'=\emptyset \) and \(|G^{-1}|=|G|\)).

\(\mathsf{aux}{:}\) :

Is S bits long.

 

Summing up we get

$$\begin{aligned} |\mathsf{Enc}(\rho ,\mathsf{aux},f)|=|G|\log (2e^2T_g/\epsilon )+(N-|G|)\log N+S+\log N \end{aligned}$$

as by assumption \(T_g\le T\le \epsilon N/40\), we get \(\log N-\log (2e^2T_g/\epsilon ) \ge 1\), and further using (7) we get

$$\begin{aligned} |\mathsf{Enc}(\rho ,\mathsf{aux},f)|\le N\log N -\frac{\epsilon N}{2 T_g}+S+\log N \end{aligned}$$

as claimed.

4 A Lower Bound for \(g(x,f^{-1}(\overline{f(x)}))\)

For a permutation \(f:[N]\rightarrow [N]\) and a function \(g:[N]\times [N]\rightarrow [N]\) we define \(g_f:[N]\rightarrow [N]\) as

$$\begin{aligned} g_f(x)=g(x,x')\text { where }f(x)=\overline{f(x')}\text { or equivalently }g_f(x)=g(x,f^{-1}(\overline{f(x)}) \end{aligned}$$

Theorem 2

Fix some \(\epsilon >0\) and an oracle algorithm \(\mathcal{A}\) which makes at most

$$\begin{aligned} T\le (N/4e)^{2/3} \end{aligned}$$
(8)

oracle queries and takes an advice string \(\mathsf{aux}\) of length \(|\mathsf{aux}|=S\). If for all functions \(f:[N]\rightarrow [N],\,g:[N]\times [N]\rightarrow [N]\) and some \(\mathsf{aux}\) of length \(|\mathsf{aux}|=S\) we have

$$\begin{aligned} \mathop {\Pr }\limits _{y\leftarrow [N]}[g_f(A^{f,g}_\mathsf{aux}(y))=y]\ge \epsilon \end{aligned}$$
(9)

then

$$\begin{aligned} T S^2 \in \varOmega ( \epsilon ^2 N^2)\ . \end{aligned}$$
(10)

The theorem follows from Lemma 2 below as we’ll prove thereafter.

Lemma 2

Fix some \(\epsilon \ge 0\) and an oracle algorithm \(\mathcal{A}\) which makes at most \(T\le (N/4e)^{2/3}\) oracle queries. There are randomized encoding and decoding procedures \(\mathsf{Enc}_g,\mathsf{Dec}_g\) and \(\mathsf{Enc}_f,\mathsf{Dec}_f\) such that if \(f:[N]\rightarrow [N]\) is a permutation, \(g:[N]\times [N]\rightarrow [N]\) is a function and for some advice string \(\mathsf{aux}\) of length \(|\mathsf{aux}|=S\) we have

$$\begin{aligned} \mathop {\Pr }\limits _{y\leftarrow [N]}[g_f(A^{f,g}_\mathsf{aux}(y))=y]\ge \epsilon \end{aligned}$$

then

$$\begin{aligned} \mathop {\Pr }\limits _{\rho \leftarrow \{0,1\}^r}[\mathsf{Dec}_g(\rho ,f,\mathsf{Enc}_g(\rho ,\mathsf{aux},f,g))=g]\ge & {} 0.9\end{aligned}$$
(11)
$$\begin{aligned} \mathop {\Pr }\limits _{\rho \leftarrow \{0,1\}^r}[\mathsf{Dec}_f(\rho ,g,\mathsf{Enc}_f(\rho ,\mathsf{aux},f,g))=f]\ge & {} 0.9 . \end{aligned}$$
(12)

Moreover for every \(\rho ,\mathsf{aux},f,g\) there is a \(T_g, 1\le T_g\le T\), such that

$$\begin{aligned} |\mathsf{Enc}_g(\rho ,\mathsf{aux},f,g)|\le & {} \underbrace{N^2\log N }_{=|g|}-\frac{\epsilon N }{2T_g}+S +\log N \end{aligned}$$
(13)

and if \(T_g\ge \sqrt{T}\)

$$\begin{aligned} |\mathsf{Enc}_f(\rho ,\mathsf{aux},f,g)|\le & {} \underbrace{\log N!}_{=|f|}-\frac{\epsilon NT_g}{64T}+S + \log N . \end{aligned}$$
(14)

We first explain how Theorem 2 follows from Lemma 2 using Fact 1.

Proof

(of Theorem 2 ). The basic idea is to make a case analysis; if \(T_g<\sqrt{T}\) we compress g, otherwise we compress f. Intuitively, our encoding for g achieving Eq. (13) makes both f and g queries, but only g queries “spoil” g values. As the compression runs until all g values are spoiled, it compresses better the smaller \(T_g\) is. On the other hand, the encoding for f achieving Eq. (12) is derived from our encoding for g, and it manages to compresses in the order of \(T_g\) values of f for every invocation (while “spoiling” at most T of the f values), so the larger \(T_g\) the better it compresses f.

Concretely, pick fg uniformly at random (and assume Eq. (9) holds). By a union bound for at least a 0.8 fraction of the \(\rho \) Eqs.(11) and (12) hold simultaneously. Consider any such good \(\rho \), which together with fg fixes some \(T_g,1\le T_g\le T\) as in the statement of Lemma 2. Now consider an encoding \(\mathsf{Enc}_{f,g}\) where \(\mathsf{Enc}_{f,g}(\rho ,\mathsf{aux},f,g)\) outputs \((f,\mathsf{Enc}_{g}(\rho ,\mathsf{aux},f,g))\) if \(T_g<\sqrt{T}\), and \((g,\mathsf{Enc}_f(\rho ,\mathsf{aux},f,g))\) otherwise.

  • If \(T_g< \sqrt{T}\) we use (13) to get

    $$\begin{aligned} |\mathsf{Enc}_{f,g}(\rho ,\mathsf{aux},f,g)|=|f|+|\mathsf{Enc}_{g}(\rho ,\mathsf{aux},f,g)| \le |f|+|g|-{\epsilon N }/{2T_g}+S +\log N \end{aligned}$$

    and now using Fact 1 (with \(\delta =0.8\)) we get

    $$\begin{aligned} S\ge {\epsilon N }/{2T_g}-\log N-\log (1/0.8)> {\epsilon N }/{2\sqrt{T}}-\log N-\log (1/0.8) \end{aligned}$$

    and thus \(TS^2\in \varOmega (\epsilon ^2 N^2)\) as claimed in Eq. (10).

  • If \(T_g\ge \sqrt{T}\) then we use Eq. (14) and Fact 1 and again get \(S\ge \epsilon N T_g/64T -\log N -\log (1/0.8)\) which implies Eq. (10) as \(T_g\ge \sqrt{T}\).

   \(\square \)

figure c
figure d
figure e
figure f

Proof

(of Lemma 2 ).

The Encoding and Decoding Algorithms. The encoding and decoding of g are depicted in Algorithms 3 and 4, and those of f in Algorithms 5 and 6. \(\mathcal{A}_\mathsf{aux}^{f,g}(\cdot )\) can make up to T queries in total to its oracles f(.) and g(.). We will assume that whenever a query \(g(x,x')\) is made, the adversary made queries f(x) and \(f(x')\) before. This is basically without loss of generality as we can turn any adversary into one adhering to this by at most tripling the number of queries. It will also be convenient to assume that \(\mathcal{A}_\mathsf{aux}^{f,g}\) only queries g on its restriction to \(g_f\), that is, for all \(g(x,x')\) queries it holds that \(f(x)=\overline{f(x')}\), but the proof is easily extended to allow all queries to g as our encoding will store the function table of g on all “uninteresting” inputs \((x,x'),\,f(x)\ne \overline{f(x')}\) and thus can directly answer any such query.

As in the proof of Lemma 1, we don’t explicitly show the randomness in case \(\mathcal{A}\) is probabilistic.

The Size of the Encodings. We will now upper bound the size of the encodings output by \(\mathsf{Enc}_g\) and \(\mathsf{Enc}_f\) in Algorithms 3 and 5 and hence prove Eqs.(13) and (14).

Equation (13) now follows almost directly from Theorem 1 as our compression algorithm \(\mathsf{Enc}_g\) for \(g:[N]\times [N]\rightarrow [N]\) simply uses \(\mathsf{Enc}\) to compress g restricted to \(g_f:[N]\rightarrow [N]\), and thus compresses by exactly the same amount as \(\mathsf{Enc}\).

It remains to prove an upper bound on the length of the encoding of f by our algorithm \(\mathsf{Enc}_f\) as claimed in Eq. (14). Recall that \(\mathsf{Enc}\) (as used inside \(\mathsf{Enc}_g\)) defines a set G such that for every \(y\in G\) we have (1) \(A^{f,g}_\mathsf{aux}(y)\) inverts, i.e., \(g_f(A^{f,g}_\mathsf{aux}(y))=y\) and (2) never makes a \(g_f\) query x where \(g_f(x)\in G\). Recall that \(T_g\) in Eq. (13) satisfies \(T_g=\epsilon N/2|G|\), and corresponds to the average number of “fresh” \(g_f\) queries made by \(A^{f,g}_\mathsf{aux}(\cdot )\) when invoked on the values in G.

\(\mathsf{Enc}_f\) invokes \(A^{f,g}_\mathsf{aux}(\cdot )\) on a carefully chosen subset \(G_f=(z_1,\ldots ,z_{|G_f|})\) of G (to be defined later). It keeps lists \(L_f,C_f\) and \(T_f\) such that after invoking \(A^{f,g}_\mathsf{aux}(\cdot )\) on \(G _f\), \(L_f\cup C_f\) holds the outputs to all f queries made. Looking ahead, the decoding \(\mathsf{Dec}_f\) will also invoke \(A^{f,g}_\mathsf{aux}(\cdot )\) on \(G_f\), but will only need \(L_f\) and \(T_f\) (but not \(C_f\)) to answer all f queries.

The lists \(L_f,T_f,C_f\) are generated as follows. On the first invocation \(A^{f,g}_\mathsf{aux}(z_1)\) we observe up to T oracle queries made to g and f. Every g query \((x,x')\) must be preceded by f queries x and \(x'\) where \(f(x)=\overline{f(x')}\). Assume x and \(x'\) are the queries number \(t,t'\) (\(1\le t<t'\le T\)). A key observation is that by just storing \((t,t')\) and f(x), \(\mathsf{Dec}_f\) will later be able to reconstruct \(f(x')\) by invoking \(A^{f,g}_\mathsf{aux}(z_1)\), and when query \(t'\) is made, looking up the query f(x) in \(L_f\) (its position in \(L_f\) is given by t), and set \(f(x')=\overline{f(x)}\). Thus, every time a fresh query \(f(x')\) is made we append it to \(L_f\), unless earlier in this invocation we made a fresh query f(x) where \(f(x')=\overline{f(x)}\). In this case we append the indices \((t,\,t')\) to the list \(T_f\). We also add \(f(x')\) to a list \(C_f\) just to keep track of what we already compressed. \(\mathsf{Enc}_f\) now continues this process by invoking \(A^{f,g}_\mathsf{aux}(\cdot )\) on inputs \(z_2,\,z_3,\ldots ,\,z_{|G_f|} \in G_f\) and finally outputs and encoding of \(G_f\), an encoding of the list of images of fresh queries \(L_f\), an encoding of the list of colliding indices \(T_f\), \(\mathsf{aux}\), and all values of f that were neither compressed nor queried.

In the sequel we show how to choose \(G_f \in G\) such that \(|G_f| \ge \epsilon N/8T\) and hence it can be encoded using \(|G_f|\log N + \log N\) where the extra \(\log N\) is used to encode \(|G_f|\). We also show that \(|T_f| \ge |G_f|\cdot T_g/4\) and furthermore that we can compress at least one bit per element of \(T_f\). Putting things together we get

$$\begin{aligned} |\mathsf{Enc}_f(\rho ,\mathsf{aux},f,g)|\le \log N! - |G_f|(T_g/4 - \log N) + S +\log N . \end{aligned}$$

And if \(\log N \le T_g/8\), we get Eq. (14)

$$\begin{aligned} |\mathsf{Enc}_f(\rho ,\mathsf{aux},f,g)|\le \log N! - \epsilon NT_g/64T +S+ \log N . \end{aligned}$$

Given G such that \(|G|\ge \epsilon N/2T_g\), the subset \(G_f\) can be constructed by carefully applying Lemma 3 which we prove in Sect. 5. Let \((X_1, \ldots , X_{|G|})\), \((Y_1,\ldots , Y_{|G|})\) be two sequences of sets such that \(Y_i\subseteq X_i \subseteq [N]\) and \(|X_i|\le T\) such that \(Y_i\) and \(X_i\) respectively correspond to g and f queries in |G| consecutive executions of \(A^{f,g}_\mathsf{aux}(\cdot )\) on G.Footnote 16 Given such sequences Lemma 3 constructs a subsequence of executions \(G_f\subseteq G\) whose corresponding g queries \((Y_{i_1}, \ldots , Y_{i_{|G_f|}})\) are fresh. As a g query is preceded by two f queries, such a subsequence induces a sequence \((Z_{i_1}, \ldots , Z_{i_{|G_f|}})\) of queries that are not only fresh for g but also fresh for f. Furthermore, such a sequence covers \(y\cdot |I|/16T\) where \(y=|I|/|G|\) is the average coverage of \(Y_i\text {'s}\) and \(I\subseteq [N]\) is their total coverage.

However, Lemma 3 considers a g query \((x,x')\in Y_i\) to be fresh if either \(x\notin \cup _{j=1}^{i-1}X_j\) or \(x'\notin \cup _{j=1}^{i-1}X_j\), i.e., if at least one of \(x,x'\) is fresh in the \(i^\text {th}\) execution, then the pair is considered fresh. For compressing f both \(x,x'\) need to be fresh. To enforce that and apply Lemma 3 directly, we apply Lemma 3 on augmented sets \(X_1, \ldots ,X_{|G|}\) such that whenever \(X_i,Y_i\) are selected, the corresponding \(Z_i\) contains exactly \(|Z_i|/2\) pairs of queries that are fresh for both g and f. We augment \(X_i\) as follows. For every \(X_i\) and every f query x made in the \(i^{th}\) step, add \(f^{-1}(\overline{f(x)})\) to \(X_i\). This augmentation results in \(X_i\) such that \(|X_i|\le 2T\) as originally we have \(|X_i|\le T\).

Applying Lemma 3 on \(Y_1, \ldots , Y_{|G|}\) and such augmented sets \(X_1,\ldots , X_{|G|}\) yields \(G_f\) such that the total number of fresh colliding queries is of size at least

$$\begin{aligned} y\cdot \frac{|I|}{16\cdot 2T} =\frac{\epsilon N}{|G|}\cdot \frac{\epsilon N}{32T} = \frac{\epsilon NT_g}{16T}\ . \end{aligned}$$

Therefore the total number of fresh colliding pairs, or equivalently \(|T_f|\), is \(\epsilon NT_g/32T\) as claimed. Furthermore, Lemma 3 guarantees that \(|G_f|\ge \epsilon N/8T\).Footnote 17

What remains to show is that for each colliding pair in \(T_f\) we compress by at least one bit. Recall that the list \(T_f\) has exactly as many entries as \(C_f\). However entries in \(T_f\) are colliding pairs of indices \((t,t')\) and entries in \(C_f\) are images of size \(\log N\). Per each entry \((t,t')\) in \(T_f\) we compress if the encoding size of \((t,t')\) is strictly less than \(\log N\). Here is an encoding of \(T_f\) that achieves this. Instead of encoding each entry \((t,t')\) as two indices which costs \(2\log T\) and therefore we save one bit per element in \(T_f\) assuming \(T\le \sqrt{N/2}\), we encode the set of colliding pairs among all possible query pairs. Concretely, for each \(z\in G_f\) we obtain a set of colliding indices of size at least \(T_g/4\). Then we encode this set of colliding pairs \(T_g/4\) among all possible pairsFootnote 18, which is upper bounded by \(T^2\), using

$$\begin{aligned} \log {T^2 \atopwithdelims ()T_g/4} \le \frac{T_g}{4}\log \frac{4eT^2}{T_g} \end{aligned}$$

bits, and therefore, given that \(T_g\ge \sqrt{T}\) and \(T \le (N/4e)^{2/3}\), we have that \(\log N - \log 4eT^2/T_g \ge 1\) and therefore we compress by at least one bit for each pair, i.e., for each element in \(T_f\), and that concludes the proof.    \(\square \)

5 A Combinatorial Lemma

In this section we state and prove a lemma which can be cast in terms of the following game between Alice and Bob. For some integers nNM, Alice can choose a partition \((Y_1,\ldots ,Y_n)\) of \(I\subseteq [N]\), and for every \(Y_i\) also a superset \(X_i\supseteq Y_i\) of size \(|X_i|\le M\). The goal of Bob is to find a subsequence \(1\le b_1<b_2<\ldots <b_\ell \) such that \(Y_{b_1},Y_{b_2},\ldots ,Y_{b_\ell }\) contains as many “fresh” elements as possible, where after picking \(Y_{b_i}\) the elements \(\bigcup _{k=1}^{i}X_{b_k}\) are not fresh, i.e., picking \(Y_{b_i}\) “spoils” all of \(X_{b_i}\). How many fresh elements can Bob expect to hit in the worst case? Intuitively, as every \(Y_{b_i}\) added spoils up to M elements, he can hope to pick up to \(\ell \approx |I|/M\) of the \(Y_i\)’s before most of the elements are spoiled. As the \(Y_i\) are on average of size \(y:=|I|/n\), this is also an upper bound on the number of fresh elements he can hope to get with every step. This gives something in the order of \(y\cdot (|I|/M)\) fresh elements in total. By the lemma below a subsequence that contains about that many fresh elements always exists.

Lemma 3

For \(M,N\in \mathbb {N}, M\le N\) and any disjoint sets \(Y_1,\ldots , Y_n\subset [N]\)

$$\begin{aligned} \bigcup _{i=1}^n Y_i=I,\quad \forall i\ne j:Y_i\cap Y_j=\emptyset \end{aligned}$$

and supersets \((X_1,\ldots ,X_n)\) where

$$\begin{aligned} \forall i\in [n]\ :\ Y_i \subseteq X_i\subseteq [N],\quad |X_i|\le M \end{aligned}$$

there exists a subsequence \(1\le b_1<b_2<\ldots < b_\ell \le n\) such that the sets

$$\begin{aligned} Z_{b_j}=Y_{b_j} \setminus \cup _{k< j} X_{b_k} \end{aligned}$$
(15)

have total size

$$\begin{aligned} \sum _{j=1}^\ell |Z_{b_j}|=|\bigcup _{j=1}^\ell Z_{b_j}|\ge y\cdot \frac{|I|}{16M} \end{aligned}$$

where \(y=|I|/n\) denotes the average size of the \(Y_i\)’s.

Proof

Let \((Y_{a_1}, \ldots , Y_{a_m})\) be a subquence of \((Y_1,\ldots , Y_n)\) that contains all the sets of size at least y / 2. By a Markov bound, these large \(Y_{a_i}\)’s cover at least half of the domain I, i.e.

$$\begin{aligned} \left| \cup _{i\in [m]} Y_{a_i}\right| > |I|/2 . \end{aligned}$$
(16)

We now choose the subsequence \((Y_{b_1}, \ldots , Y_{b_\ell })\) from the statement of the lemma as a subsequence of \((Y_{a_1}, \ldots , Y_{a_m})\) in a greedy way: for \(i=1,\ldots ,m\) we add \(Y_{a_i}\) to the sequence if it adds a lot of “fresh” elements, concretely, assume we are in step i and so far have added \(Y_{b_1},\ldots ,Y_{b_{j-1}}\), then we’ll pick the next element, i.e., \(Y_{b_j}:=Y_{a_i}\), if the fresh elements \(Z_{b_j}=Y_{b_j} \setminus \cup _{k< j} X_{b_k}\) contributed by \(Y_{b_j}\) are of size at least \(|Z_{b_j}|>|Y_{b_j}|/2\).

We claim that we can always add at least one more \(Y_{b_j}\) as long as we haven’t yet added at least |I| / 4M sets, i.e., \(j< |I|/4M\). Note that this then proves the lemma as

$$\begin{aligned} \sum _{j=1}^\ell |Z_{b_j}| \ge \sum _{j=1}^\ell |Y_{b_j}|/2 \ge \ell y/4 \ge |I|/4M\cdot y/4 = y|I|/16M\ . \end{aligned}$$

It remains to prove the claim. For contradiction assume our greedy algorithm picked \((Y_{b_1}, \ldots , Y_{b_\ell })\) with \(\ell <|I|/4M\). We’ll show that there is a \(Y_{a_t}\) (with \(a_t>b_\ell \)) with

$$\begin{aligned} |Y_{a_t} \setminus \cup _{i=1}^{j} X_{b_i}| \ge |Y_{a_t}|/2 \end{aligned}$$

which is a contradiction as this means the sequence could be extended by \(Y_{b_{\ell +1}}=Y_{a_t}\). We have

$$\begin{aligned} |\cup _{i=1}^{\ell } X_{b_i}| \le |I|/4M \cdot M=|I|/4 . \end{aligned}$$

This together with (16) implies

$$\begin{aligned} |\cup _{i\in [m]} Y_{a_i}\setminus \cup _{i=1}^{\ell } X_{a_i} |> |\cup _{i\in [m]} Y_{a_i}|/2 . \end{aligned}$$

By Markov there must exist some \(Y_{a_t}\) with

$$\begin{aligned} |Y_{a_t}\setminus \cup _{i=1}^{\ell } X_{a_i}|\ge |Y_{a_t}|/2 \end{aligned}$$

as claimed.    \(\square \)

6 Conclusions

In this work we showed that existing time-memory trade-offs for inverting functions can be overcome, if one relaxes the requirement that the function is efficiently computable, and just asks for the function table to be computed in (quasi)linear time. We showed that such functions have interesting applications towards constructing proofs of space. The ideas we introduced can potentially also be used for related problems, like memory-bound or memory-hard functions.