Keywords

1 Introduction

Memory-hard functions are a fast emerging trend which has become a popular remedy to the hardware-equipped adversaries in various applications: cryptocurrencies, password hashing, key derivation, and more generic Proof-of-Work constructions. It was motivated by the rise of various attack techniques, which can be commonly described as optimized exhaustive search. In cryptocurrencies, the hardware arms race made the Bitcoin mining [29] on regular desktops tremendously inefficient, as the best mining rigs spend 30,000 times less energy per hash than x86-desktops/laptopsFootnote 1. This causes major centralization of the mining efforts which goes against the democratic philosophy behind the Bitcoin design. This in turn prevents wide adoption and use of such cryptocurrency in economy, limiting the current activities in this area to mining and hoarding, whith negative effects on the price. Restoring the ability of CPU or GPU mining by the use of memory-hard proof-of-work functions may have dramatic effect on cryptocurrency adoption and use in economy, for example as a form of decentralized micropayments [15]. In password hashing, numerous leaks of hash databases triggered the wide use of GPUs [3, 34], FPGAs [27] for password cracking with a dictionary. In this context, constructions that intensively use a lot of memory seem to be a countermeasure. The reasons are that memory operations have very high latency on GPU and that the memory chips are quite large and thus expensive on FPGA and ASIC environments compared to a logic core, which computes, e.g. a regular hash function.

Memory-intensive schemes, which bound the memory bandwidth only, were suggested earlier by Burrows et al. [8] and Dwork et al. [17] in the context of spam countermeasures. It was quickly realized that to be a real countermeasure, the amount of memory shall also be bounded [18], so that memory must not be easily traded for computations, time, or other resources that are cheaper on certain architecture. Schemes that are resilient to such tradeoffs are called memory-hard [21, 30]. In fact, the constructions in [18] are so strong that even tiny memory reduction results in a huge computational penalty.

Disadvantage of Classical Constructions and New Schemes. The provably tradeoff-resilient superconcentrators [32] and their applications in [18, 19] have serious performance problems. They are terribly slow for modern memory sizes. A superconcentrator requiring N blocks of memory makes \(O(N\log N)\) calls to F. As a result, filling, e.g., 1 GB of RAM with 256-bit blocks would require dozens of calls to F per block (\(C\log N\) calls for some constant C). This would take several minutes even with lightweight F and is thus intolerable for most applications like web authentication or cryptocurrencies. Using less memory, e.g., several megabytes, does not effectively prohibit hardware adversaries.

This has been an open challenge to construct a reasonably fast and tradeoff-resilient scheme. Since the seminal paper by Dwork et al. [18] the first important step was made by Percival, who suggested scrypt [30]. The idea of scrypt was quite simple: fill the memory by an iterative hash function and then make a pseudo-random walk on the blocks using the block value as an address for the next step. However, the entire design is somewhat sophisticated, as it employs a stack of subfunctions and a number of different crypto primitives. Under certain assumptions, Percival proved that the time-memory product is lower bounded by some constant. The scrypt function is used inside cryptocurrency Litecoin [4] with 128 KB memory parameter and is now adapted as an IETF standard for key-derivation [5]. scrypt is a notable example of data-dependent schemes where the memory access pattern depends on the input, and this property enabled Percival to prove some lower bound on adversary’s costs. However, the performance and/or the tradeoff resilience of scrypt are apparently not sufficient to discourage hardware mining: the Litecoin ASIC miners are more efficient than CPU miners by the factor of 100 [1].

The need for even faster, simpler, and possibly more tradeoff-resilient constructions was further emphasized by the ongoing Password Hashing Competition [2], which has recently selected 9 finalists out of the 24 original submissions. Notable entries are Catena [20], just presented at Asiacrypt 2014 with a security proof based on [26], and yescrypt and Lyra2 [25], which both claim performance up to 1 GB/sec and which were quickly adapted within a cryptocurrency proof-of-work [7]. The tradeoff resilience of these constructions has not been challenged so far. It is also unclear how possible tradeoffs would translate to the cost

Our Contributions. We present a rigorous approach and a reference model to estimate the amortized costs of password brute-force on special hardware using full-memory algorithms or time-space tradeoffs. We show how to evaluate the adversary’s gains in terms of area-time and time-memory products via computational complexity and latency of the algorithm.

Then we present our tradeoff attacks on the last versions of Catena and yescrypt, and the original version of Lyra2. Then we generalize them to wide classes of data-dependent and data-independent schemes. For Catena we analyze the faster Dragonfly mode and show that the original security proof for it is flawed and the computation-memory product can be kept constant while reducing the memory. For ASIC-equipped adversaries we show how to reduce the area-time product (abbreviated further by AT) by the factor of 25 under reasonable assumptions on the architecture. The attack algorithm is then generalized for a wide class of data-independent schemes as a precomputation method.

Then we consider data-dependent schemes and present the first generic tradeoff strategy for them, which we call the ranking method. Our method easily applies to yescrypt and then to the second phase of Lyra2, both taken with minimally secure time parameters. We further exploit the incomplete diffusion in the core primitives of these designs, which reduces the time-memory and time-area products for both designs.

Altogether, we show how to decrease the time-memory product by the factor of 2 for yescrypt and the factor of 8 for Lyra2. Our results are summarized in Table 1. To the best of our knowledge, our methods are the first generic attacks so far on data-dependent or data-independent schemesFootnote 2.

Table 1. Our tradeoff gains on Catena, yescrypt and Lyra2 with minimal secure parameters, \(2^{30}\) memory bytes and reference hardware implementations (Sect. 2). TM loss is the maximal factor by which we can reduce the time-memory product compared to the full-memory implementation. AT loss is the maximal factor for time-area product reduction. Compactness of TM and AT is the maximal memory reduction factor which does not increase the TM or AT, resp., compared to the default implementation.

Related Work. So far there have been only a few attempts to develop tradeoff attacks on memory-hard functions. A simple tradeoff for scrypt has been known in folklore and was recently formalized in [20]. Alwen and Serbinenko analyzed a simplified version of Catena in [9]. Designers of Lyra2 and Catena attempted to attack their own designs in the original submissions [20, 25]. Simple analysis of Catena has been made in [16].

Paper Outline. We introduce necessary definitions and metrics in Sect. 2. We attack Catena-Dragonfly in Sect. 3 and generalize this method in Sect. 4. Then we present a generic ranking algorithm for data-dependent schemes in Sect. 5 and attack yescrypt with this method in Sect. 6. The attack on Lyra2 is quite sophisticated and we leave it for Appendix A.

2 Preliminaries

2.1 Syntax

Let \(\mathcal {G}\) be a hash function that takes a fixed-length string I as input and outputs tag H. We consider functions that iteratively fill and overwrite memory blocks \(X[1],X[2],\ldots , X[M]\) using a compression function F:

$$\begin{aligned} X[i_j]&= f_j(I),\; 1\le j\le s;\end{aligned}$$
(1)
$$\begin{aligned} X[i_j]&= F(X[\phi _1(j)], X[\phi _2(j)], \ldots , X[\phi _k(j)]),\;s< j\le T, \end{aligned}$$
(2)

where \(\phi _i\) are some indexing functions referring to some already filled blocks and \(f_j\) are auxiliary hash functions (similar to F) filling the initial s blocks for some positive s.

We say that the function makes p passes over the memory, if \(T = pM\). Usually p and M are tunable parameters which are responsible for the total running time and the memory requirements, respectively.

2.2 Time-Space Tradeoff

Let \(\mathcal {A}\) be an algorithm that computes \(\mathcal {G}\). The computational complexity \(C(\mathcal {A})\) is the total number of calls to F and \(f_i\) by \(\mathcal {A}\), averaged over all inputs to \(\mathcal {G}\). We do not consider possible complexity amortization over successive calls to \(\mathcal {A}\). The space complexity \(S(\mathcal {A})\) is the peak number of blocks (or their equivalents) stored by \(\mathcal {A}\), again averaged over all inputs to \(\mathcal {G}\). Suppose that \(\mathcal {A}\) can be represented as a directed acyclic graph with vertices being calls to F. Then the latency \(L(\mathcal {A})\) is the length of the longest chain the graph from the input to the output. Therefore, \(L(\mathcal {A})\) is the minimum time needed to run \(\mathcal {A}\) assuming unlimited parallelism and instant memory access.

A straightforward implementation of the scheme (1) results in an algorithm with computational complexity T and latency \(L=T\) and space complexity M. However, it might be possible to compute \(\mathcal {G}\) using less memory. According to [24], any function, that is described by Eq. (1) and whose reference block indices \(\phi _j(i)\) are known in advance, can be computed using \(c_k\frac{T}{\log T}\) memory blocks for some constant \(c_k\) depending on the number k of input blocks for F. Therefore, any p-pass function can be computed using less than \(M=T/p\) memory for sufficiently large M.

Let us fix some default algorithm \(\mathcal {A}\) of \(\mathcal {G}\) with \((C_1,M_1,L_1)\) being computational and space complexities and latency of \(\mathcal {A}\), respectively. Suppose that there is a time-space tradeoff given by the family of algorithmsFootnote 3 \(\mathcal {B}= \{B_q\}\) that compute \(\mathcal {G}\) using \(\frac{M_1}{q}\) space for different q. The idea is to store only one of q memory blocks on average and recompute the missing blocks whenever they are needed. Then we define the computational penalty \(C_{\mathcal {B}}(q)\) as

$$ CP_{\mathcal {B}}(q) = \frac{C(B_q)}{C_1} $$

and latency penalty \(L_{\mathcal {B}}(q)\).

$$ LP_{\mathcal {B}}(q) = \frac{L(B_q)}{L_1}, $$

2.3 Attackers and Cost Estimates

We consider the following attack. Suppose that \(\mathcal {G}\) with time and memory parameters (TM) is used as a password hashing function with \(I=(P,S)\), where P is a secret password and S is a public salt. An attacker gets H and S (e.g., from a database leak) and tries to recover P. He attempts a dictionary attack: given a list L of most probable passwords, he runs \(\mathcal {G}\) on every \(P\in L\) and checks the output.

Definition 1

Let \(\varPhi \) be a cost function defined over a space of algorithms. Let also \(\mathcal {G}_{T,M}\) be a hash function with fixed algorithm \(\mathcal {A}_0\) (default algorithm). Then \(\mathcal {G}_{T,M}\) is called \((\alpha ,\varPhi )\)-secure if for every algorithm \(\mathcal {B}\) for \(\mathcal {G}_{T,M}\)

$$ \varPhi (\mathcal {B})> \alpha \varPhi (\mathcal {A}). $$

In other words, \(\mathcal {G}_{T,M}\) can not be computed cheaper than by the factor of \(\frac{1}{\alpha }\).

The cost function is more difficult to determine. We suggest evaluating amortized computing costs for a single password trial. Depending on the architecture, the costs vary significantly for the same algorithm \(\mathcal {A}\). For the ASIC-equipped attackers, who can use parallel computing cores, it is widely suggested that the costs can be approximated by the time-area product \(\mathrm {AT}\) [9, 11, 28, 35]. Here T is the time complexity of the used algorithm and A is the sum of areas needed to implement the memory cells and the area needed to implement the cores. Let the area needed to implement one block of memory be the unit of area measurement. Then in order to know the total area, we need core-memory ratio \(R_c\), which is how many memory blocks we can place on the area taken by one core.

Suppose that the adversary runs algorithm \(B_q\) using M / q memory and l computing cores, thus having computational complexity \(C_q = C(B_q)\). The running time is lower bounded by the latency \(L_q = L(B_q)\) of the algorithm. If \(L_q<C_q/l\), i.e. the computing cores can not finish the work in minimum time, then the time T can be approximated by \(C_q/l\), and the costs are estimated as follows:

$$ \mathrm {AT}_{B_q}(l) = \left( lR_c+\frac{M}{q}\right) \frac{C_q}{l} = C_q(R_c + \frac{M}{ql}) $$

We see that the costs drop as l increases. Therefore, the adversary would be motivated to push it to the maximum limit \(C_q/L_q\). Thus we obtain the final approximation of costs:

$$\begin{aligned} \mathrm {AT}_{B_q} = C_qR_c + L_q\frac{M}{q}. \end{aligned}$$
(3)

Here we assume unlimited memory bandwidth. Taking the bandwidth restrictions into account is even more difficult, as they depends on the relative frequency of the computing core and the memory as well as on the architecture of the memory bus. Moreover, the memory bandwidth of the algorithm depends on the implementation and is not easy to evaluate. We leave rigorous memory bandwidth evaluation and restrictions for the future work.

We recall that the value \(R_c\) is depends on the architecture, the function F, and the block size. To give a concrete example, suppose that the block is 64 bytes and F is the Blake-512 hash function. We use the following reference implementationsFootnote 4:

  • The 50-nm DRAM [22], which takes 550 mm\({}^2\) per GByte;

  • The 65-nm Blake-512 [23], which takes about 0.1 mm\({}^2\).

Then the core-memory ratio is \(\frac{2^24 \cdot 0.1}{550} \approx 3000\). For more lightweight hash functions this ratio will be smaller.

The actual functions F in the designs that we attack are often ad-hoc and have not implemented yet in hardware. Moreover, the numbers may change when going to smaller feature size. To make our estimates of the attack costs architecture-independent, we introduce a simpler metric — the time-memory product \(\mathrm {TM}\):

$$\begin{aligned} \mathrm {TM}_{B_q} = L_q\frac{M}{q}, \end{aligned}$$
(4)

which for not so high computational penalties gives a good approximation of \(\mathrm {AT}\).

In our tradeoff attacks, we are mainly interested to compare the AT and TM costs of \(B_q\) with that of the default algorithm \(\mathcal {A}\). Thus we define the AT ratio of \(B_q\) as \(\frac{\mathrm {AT}_{B_q}}{\mathrm {AT}_{\mathcal {A}}}\), and the TM ratio of \(B_q\) as \(\frac{\mathrm {TM}_{B_q}}{\mathrm {TM}_{\mathcal {A}}}\)

We note that for the same \(\mathrm {TM}\) value the implementation with less memory is preferable, as its design and production will be cheaper. Thus we explore how much the memory can be reduced keeping the AT or TM costs below those of the default algorithm.

Definition 2

Tradeoff algorithms \(\mathcal {B}\) have AT compactness q if it is the maximal q such that

$$ \mathrm {AT}_{B_q} \le \mathrm {AT}_{\mathcal {A}}. $$

Tradeoff algorithms \(\mathcal {B}\) have TM compactness q if it is the maximal q such that

$$ \mathrm {TM}_{B_q} \le \mathrm {TM}_{\mathcal {A}}. $$

For the concrete schemes we take “minimally secure” values of T, i.e. those that supposed to have \((\alpha ,\varPhi )\)-security for reasonably high \(\alpha \). Unfortunately, no explicit security claim of this kind is present in the design documents of the functions we consider.

Data-Dependent and Data-Independent Schemes. The existing schemes can be categorized according to the way they access memory. The data-independent schemes Catena [20], Pomelo [36], Argon2i [13] computes \(\phi (j)\) independently of the actual password in order to avoid timing attacks like in [33]. Then the algorithm \(\mathcal {B}\) that uses less memory can recompute the missing blocks just by the time they are requested. Therefore, it has the same latency as the full-memory algorithm, i.e. \(L(\mathcal {B}) = L_0\). For these algorithms the time-memory product can be arbitrarily small, and the minimum \(\mathrm {AT}\) value is determined by the core-memory ratio.

The data-dependent schemes scrypt [30] yescrypt [31], Argon2d [13] compute \(\phi (j)\) using the just computed block: \( \phi (j) = \phi (j,X_{i_{j-1}})\). Then precomputation is impossible, and for each recomputing block the latency is increased by the latency of the recomputation algorithm, so \(L_q>L_0\). There exist hybrid schemes [25], which first run a data-independent phase and then a data-dependent one.

3 Cryptanalysis of Catena-Dragonfly

3.1 Description

Short History. Catena was first published on ePrint [20] and then submitted to the Password Hashing Competition. Eventually the paper was accepted to Asiacrypt 2014 [21]. In the middle of the reviewing process, we discovered and communicated the first attack on Catena to the authors. The authors have introduced a new mode for Catena in the camera-ready version of the Asiacrypt paper, which is resistant to the first attack. The final version of Catena, which is the finalist of the Password Hashing Competition, contains two modes: Catena-Dragonfly (which we abbreviate to Catena-D), which is an extension to the original Catena, and Catena-Butterfly, which is a new mode advertised as tradeoff-resistant. In this paper we present the attack on Catena-Dragonfly, which is very similar to the first attack on Catena.

Specification. Catena-D is essentially a mode of operation over the hash function F, which is be instantiated by Blake2b [10] in the full or reduced-round version. The functional graph of Catena-D is determined by the time parameter \(\lambda \) (values \(\lambda =1,2\) are recommended) and the memory parameter n, and can be viewed as \((\lambda +1)\)-layer graph with \(2^n\) vertices in each layer (denoted by Catena-D-\(\lambda \)). We denote the X-th vertex in layer l (both count from 0) by \([X]^l\). With each vertex we associate the corresponding output of the hash function F and denote it by \([X^l]\) as well. The outputs are stored in the memory, and due to the memory access pattern it is sufficient to store only \(2^n\) blocks at each moment. The hash function H has 512-bit output, so the total memory requirements are \(2^{n+6}\) bytes.

First layer is filled as follows

  • \([0]^0 = G_1(P,S)\), where \(G_1\) invokes 3 calls to F;

  • \([1]^0 = G_2(P,S)\), where \(G_2\) invokes 3 calls to F

  • \([i]^0 \leftarrow F([{i-1}]^0,[{i-2}]^0),\; 2\le i \le 2^n-1\).

Then \(2^{3n/4}\) nodes of the first layer are modified by function \(\varGamma \). The details of \(\varGamma \) are irrelevant to our attack.

The memory access pattern at the next layers is determined by the bit-reversal permutation \(\nu \). Each index is viewed as an n-bit string and is transformed as follows:

$$ \nu ({x_1 x_2\ldots x_n}) = x_n x_{n-1}\ldots x_1,\; \text {where }x_i \in \{0,1\}. $$

The layers are then computed as

  • \([0]^j = F([0]^{j-1}\,||\,[{2^n-1}]^{j-1})\);

  • \([i]^j = F([{i-1}]^j\,||\,[{\nu ({i})}]^{j-1})\).

Thus to compute \([X]^l\) we need \([\nu ({X})^{l-1}]\). The latter can be then overwrittenFootnote 5. An example of Catena-D with \(\lambda =2\) and \(n=3\) is shown at Fig. 1.

The bit-reversal permutation is supposed to provide memory-hardness. The intuition is that it maps any segment to a set of blocks that are evenly distributed at the upper layer.

Fig. 1.
figure 1

Catena-D-2 with \(n=3\). 3 layers, 8 vertices per layer.

Original Tradeoff Analysis. The authors of Catena-D originally provided two types of security bounds against tradeoff attacks. Recall that Catena-D-\(\lambda \) can be computed with \(\lambda 2^n\) calls to F using \(2^n\) memory blocks. The Catena-D designers demonstrated that Catena-D-\(\lambda \) can be computed using \(\lambda S\) memory blocks with time complexityFootnote 6

$$ T \le 2^n + 2^n\left( \frac{2^n}{2S}\right) ^{\lambda -1} + 2^n\left( \frac{2^n}{2S}\right) ^{\lambda } $$

Therefore, if we reduce the memory by the factor of q, i.e. use only \(\frac{2^n}{q}\) blocks, we get the following penalty:

$$\begin{aligned} P_{\lambda }(q) \approx \left( \frac{q}{2}\right) ^{\lambda }. \end{aligned}$$
(5)

The second result is the lower bound for tradeoff attacks with memory reduction by q:

$$\begin{aligned} P_{\lambda }(q) \ge \varOmega \left( q^{\lambda }\right) . \end{aligned}$$
(6)

However the constant in \(\varOmega ()\) is too small (\(2^{-18}\) for \(\lambda =3\)) to be helpful in bounding tradeoff attacks for small q. More importantly, the proof is flawed: the result for \(\lambda =1\) is incorrectly generalized for larger \(\lambda \). The reason seems to be that the authors assumed some independence between the layers, which is apparently not the case (and is somewhat exploited in our attack).

In the further text we demonstrate a tradeoff attack yielding much smaller penalties than Eq. (5) and thus asymptotically violating Eq. (6).

3.2 Our Tradeoff Attack on Catena-D

The idea of our method is based on the simple fact that

$$ \nu (\nu (X)) = X, $$

where X can be a single index or a set of indices. We exploit it as follows. We partition layers into segments of length \(2^k\) for some integer k, and store the first block of every segment (first two blocks at layer 0). As the index of such a block ends with k zeros, we denote the set of these blocks as \([*^{n-k}0^k]\). We also store all \(2^{3n/4}\) blocks modified by \(\varGamma \), which we denote by \([\varGamma ]\).

Consider a single segment \([AB*^k]\), where A is a k-bit constant, B is a \(n-2k\)-bit constant. Then

$$ \nu ([AB*^k]) = [*^k\nu (B)\nu (A)]. $$

Blocks \([*^k\nu (B)\nu (A)]\) belong to \(2^k\) segments that have \(\nu (B)\) in the middle of the index. Denote the union of these segments by \([*^k\nu (B)*^k]\). Now note that

$$ \nu ([*^k\nu (B)*^k]) = [*^kB*^k], $$

and

$$ \nu (\nu ([*^kB*^k])) = [*^kB*^k]. $$
figure a

Therefore, when we iterate the permutation \(\nu \), we are always within some \(2^k\) segments. We suggest the computing strategy in Algorithm 1. At layer t we recompute \(2^k\) full segments from layers 0 to \(t-2\) and \(2^k\) subsegments of length \(\nu (A)\) (interpreted as a number in the binary form) at layer \(t-1\). Therefore, the total cost of computing layer t is

$$\begin{aligned}&\quad C(t) = \sum _{A}\sum _{B}((t-1)2^{2k} + \nu (A)2^k + 2^k) =\\ \nonumber&= \sum _A ((t-1)2^{n} + \nu (A)2^{n-k} + 2^{n-k}) =\\ \nonumber&(t-1)2^{n+k} + 2^{n+k-1} + 2^n = (t-\frac{1}{2})2^{n+k} + 2^n. \end{aligned}$$
(7)

The total cost of computing Catena-D-\(\lambda \) is

$$ 2^n\left( \frac{\lambda ^2}{2}2^k+\lambda +1\right) . $$

We store \((t+1) 2^{n-k}\) blocks as segment starting points, \(2^{3n/4}\) blocks \([\varGamma ]\) and \(2^{2k}\) blocks for intermediate computations. For \(k = \log q +\log (\lambda +1) \) and \(q<2^{n/4}\) we store about \(2^n/q\) blocks, so the memory is reduced by the factor of q. This value of k yields the total computational complexity of

$$\begin{aligned} C_q = 2^n\left( \frac{q\lambda ^2(\lambda +1)}{2}+\lambda +1\right) \end{aligned}$$
(8)

Since the computational complexity of the memory-full algorithm is \((\lambda +1)2^n\), our tradeoff method gives the computational penalty

$$ \frac{q\lambda ^2}{2}+1. $$

Since Catena is a data-independent scheme, the latency of our method does not increase. Therefore, the time-memory product (Eq. (4)) can be reduced by the factor of \(2^{n/4}\). We can estimate how AT costs evolves assuming the reference implementation in Sect. 2.3:

$$ \mathrm {AT}_{B_q} = 2^n\left( \frac{q\lambda ^2(\lambda +1)}{2}+\lambda +1\right) \cdot 3000 + (\lambda +1)2^n\frac{2^n}{q}. $$

For \(q = 2^{n/5}\) and \(\lambda =2\) we get

$$ \mathrm {AT}_{B_{2^{n/5}}} = 2^n\left( 6\cdot 2^{n/5}\right) \cdot 2^{11.5} + 3\cdot 2^{9n/5}. $$

For \(n=24\) (1 GB of RAM) we get

$$ \mathrm {AT}_{B_{2^{4.8}}} \approx 2^{24 + 2.5 + 4.8 + 11.5} + 2^{43.2+1.5}\approx 2^{44}. $$

whereas

$$ \mathrm {AT}_{B_{1}} = 2^{49.5}. $$

Therefore, we expect the time-area product dropped by the factor of about 25 if the memory is reduced by the factor of 30. In the terms of Definition 1, Catena-D-2 is not \((1/25,\mathrm {AT})\)-secure. Our tradeoff method also have AT and TM compactness at least \(2^{n/5} = 64\).

On other architectures the \(\mathrm {AT}\) may drop even further, and we expect that an adversary would choose the one that maximizes the tradeoff effect, so the actual impact of our attack can be even higher.

Violation of Catena-D Lower Bound. Our method shows that the Catena-D lower bound is wrong. If we summarize the computational costs for \(\lambda \) layers, we obtain the following computational penalty for the memory reduction by the factor of q:

$$ CP_{\lambda }(q) = O(\lambda ^3 q), $$

which is asymptotically smaller than the lower bound \(\varOmega (q^{\lambda })\) (Eq. (6)) from the original Catena submission [20].

3.3 Other Results for Catena

Our attack on Catena can be further scrutinized and generalized to non-even segments. More details are provided in [14] with the summary given in Table 2.

Table 2. Computation-memory tradeoff for Catena-D-3 and Catena-D-4.

4 Generic Precomputation Tradeoff Attack

Now we try to generalize the tradeoff method used in the attack on Catena for a class of data-independent schemes. We consider schemes \(\mathcal {G}\) where each memory block is a function of the previous block and some earlier block:

$$ X[i]\leftarrow F(X[i-1],X[\phi (i)]), 0\le i < T $$

where \(\phi \) is a deterministic function such that \(\phi (i) <i\). A group of existing password hashing schemes falls into this category: Catena [20], Pomelo [36], Lyra2 [25] (first phase). Multiple iterations of such a scheme are equivalent to a single iteration with larger T and an additional restriction

$$ x - M \le \phi (x), $$

so that the memory requirements are M blocks.

The crucial property of the data-independent attacks is that they can be tested and tuned offline, without hashing any real password. An attacker may spend significant time to search for an optimal tradeoff strategy, since it would then apply to the whole set of passwords hashed with this scheme.

Precomputation Method. Our tradeoff method generalizes as follows. We divide memory into segments and store only the first block of each segment. For every segment I we calculate its image \(\phi (I)\). Let \(\overline{\phi ({I})}\) be the union of segments that contain \(\phi ({I})\). We repeat this process until we get an invariant set \(U_k = U(I)\):

$$ \underbrace{{I}}_{U_0} \rightarrow \underbrace{\overline{\phi ({I})}}_{U_1} \rightarrow \underbrace{\overline{\phi (\overline{\phi ({I})})}}_{U_2} \cdots \rightarrow U_k. $$

The scheme \(\mathcal {G}\) is then computed according to Algorithm 2.

figure b

The total amount of calls to F is \(\sum _{i\ge 0}|U_i|\), and the penalty to compute I is

$$ CP(I) = \frac{\sum _{i\ge 0}|U_i|}{|I|}. $$

How efficient the tradeoff is depends on the properties of \(\phi \) and the segment partition, i.e. how fast \(U_i\) expands. As we have seen, Catena uses a bit permutation for \(\phi \), whereas Lyra2 uses a simple arithmetic function or a bit permutation [20, 25]. In both cases \(U_i\) stabilizes in size after two iterations. If \(\phi \) is a more sophisticated function, the following heuristics (borrowed from our attacks on data-dependent schemes) might be helpful:

  • Store the first \(T_1\) computed blocks and the last \(T_2\) computed blocks for some \(T_1,T_2\) (usually about N / q).

  • Keep the list \(\mathcal {L}\) of the most expensive blocks to recompute and store M[i] if \(\phi (i)\in \mathcal {L}\) (Fig. 2).

Fig. 2.
figure 2

Segment unions in the precomputation method.

5 Generic Ranking Tradeoff Attack

Now we present a generic attack on a wide class of schemes with data-dependent memory addressing. Such schemes include scrypt [30] and the PHC finalists yescrypt [31], Argon2d [13], and Lyra2 [25]. We consider the schemes described by Eq. (1) with \(k=2\) and the following addressing (cf. also Fig. 3):

$$\begin{aligned} \begin{aligned} X[1]&= f(I)&\\ \text {for }&1<i <T&\\&r_i = g(X[i-1]);\\&X[i] = F(X[i-1], X[r_i]). \end{aligned} \end{aligned}$$
(9)

Here g is some indexing function. This construction and our tradeoff method can be easily generalized to multiple functions F, to stateful functions (like in Lyra2), to multiple inputs, outputs, and passes, etc. However, for the sake of simplicity we restrict to the construction above.

Fig. 3.
figure 3

Data-dependent schemes.

Our tradeoff strategy is following: we compute the blocks sequentially and for each block X[i] decide if we store it or not. If we do not store it, we calculate its access complexity A(i) – the number of calls needed to recompute it as a sum of access complexities of \(X[i-1]\) and \(X[r_i]\) plus one. If we store X[i], its access complexity is 0.

The storing heuristic rule is the crucial element of our strategy. The idea is to store the block if \(A(r_i)\) is too high.

Our ranking tradeoff method works according to Algorithm 3 (Fig. 4).

figure c

Here w, s and l are parameters, and we usually set \(l=3s\). The computational complexity is computed as

$$ C = \sum _i A(r_i). $$

We also compute the latency L(i) of each block as \(L(i) = \max (L(r_i),L(i-1))+1\) if we do not store X[i] and \(L(i) = 0\) if we store it. Then the total latency is

$$ L = \sum _i L_i. $$

We implemented our attack and tested it on the class of functions described by Eq. (9). For fixed w and s the total number of calls to F and the number of stored blocks is entirely determined by indices \(\{r_i\}\). Thus we do not have to implement a real hash function, and it is sufficient to generate \(r_i\) according to some distribution, model the computation as a directed acyclic graph, and compute C and L for this graph. We made a number of tests with uniformly random \(r_i\) (within the segment [0; i] and \(T=2^{12}\)) and different values of w and s. Then we grouped C and L values by the memory complexity and figured the lowest complexities for each memory reduction factor. These values are given in Table 3.

Fig. 4.
figure 4

Outline of the ranking tradeoff method.

Table 3. Computational, latency, AT (for \(R_c=3000\) and \(M=2^{24}\)), and TM penalties for the ranking tradeoff attack on generic data-dependent schemes.

We conclude that generic 1-pass data-dependent schemes with random addressing are (0.75, AT)- and (0.75, TM)-secure using our ranking method. Both AT and TM ratios exceed 1 when \(q\ge 4\), so both the AT- and the TM-compactness is about 4.

6 Cryptanalysis of yescrypt

6.1 Description

yescrypt [31] is another PHC finalist, which is built upon scrypt and is notable for its high memory filling rate (up to 2 GB/sec) and a number of features, which includes custom S-boxes to thwart exhaustive search on GPU, multiplicative chains to increase the ASIC latency, and some others. yescrypt is essentially a family of functions, each member activated by a combination of flags. Due to the page limits, we consider only one function of the family.

Here we consider the yescrypt setting where flag yescrypt_RW is set, there is no parallelism, and no ROM (in the further text – just yescrypt). It operates on 1024-byte memory blocks \(X[1],X[2],\ldots , X[M]\). The scheme works as follows:

$$\begin{aligned} X[1]&\leftarrow F'(I);\\ X[i]&\leftarrow F(X[i-1]\oplus X[\phi (i)]),\;1<i\le M;\\ Y&\leftarrow X[M];\\ Y&\leftarrow X[Y\mod M])\leftarrow F(Y\oplus X[Y \pmod M]),M < i \le T. \end{aligned}$$

Here F and \(F'\) are compression functions (the details of \(F'\) are irrelevant for the attack). Therefore, the memory is filled in the first M steps and then \((T-M)\) blocks are updated using the state variable Y. Here \(\phi (i)\) is the data-dependent indexing function: it takes 32 bits of \(X[i-1]\) and interprets it as a random block index among the last \(2^k\) blocks, where \(2^k\) is the largest power of 2 that is smaller than i.

Transformation F operates on 1024-byte blocks as follows:

  • Blocks are partitioned into 16 64-byte subblocks \(B_0, B_1,\ldots ,B_{15}\).

  • New blocks are produced sequentially:

    $$\begin{aligned} B_{0}^{new}&\leftarrow f(B_{0}^{old}\oplus B_{15}^{old});\\ B_{i}^{new}&\leftarrow f(B_{i-1}^{new}\oplus B_{i}^{old}),\; 0 <i<16. \end{aligned}$$

    The details of f are irrelevant to our attack.

6.2 Tradeoff Attack on yescrypt

Our crucial observation is that there is no diffusion from the last subblocks to the first ones. Thus if we store all \(B_0\), we break the dependencies between consecutive blocks and the subblocks can be recomputed from \(B_1\) to \(B_{15}\) with pipelining (Fig. 5). Suppose that the block X[i] is computed with latency L(i), i.e. its computation tree has L(i) levels if measured in F. However, if we consider the tree of f, then the actual latency of X[i] is \(L(i)+15\) instead of expected 16L(i) if measured in calls to f.

The tradeoff strategy is given in Algorithm 4.

figure d

If the missing block is recomputed by a tree of depth D, then the latency of the new block is \(D+16\) measured in calls to f, or \(\frac{D}{16}+1\) if measured in calls to F. This number should be compared to the latency \(D+1\) if we had not exploited the iterative structure of F. Thus if the ranking method gives the total latency L (measured in F), the actual latency should be \(\frac{L+15}{16}\).

For the smallest secure parameter (\(T=4M/3\)) we get the final computational and latency penalties as well as AT and TM penalties are given in Table 4 (1 / 16-th of each block is added to the attacker’s memory). We conclude yescrypt is only (0.45, AT)- and (0.45, TM)-secure, whereas the AT compactness is 4 and the TM compactness is 6. Since this numbers are worse than for generic 1-pass schemes, our attack clearly signals of a vulnerability in the design of BlockMix. We expect that our attack becomes inefficient for \(T=2M\) and higher.

Table 4. Computational, latency, AT (for \(R_c=3000\) and \(M=2^{24}\)), and TM penalties for the ranking tradeoff attack on yescrypt mode of operation with 4/3 passes, using the iterative structure of F.
Fig. 5.
figure 5

Pipelining the block computation in yescrypt: only the first subblock is computed with delay D.

7 Future Work

Our tradeoff methods apply to a wide class of memory-hard functions, so our research can be continued in the following directions:

  • Application of our methods to other PHC candidates and finalists: Pomelo [36] and the modified Lyra2.

  • Set of design criteria for the indexing functions that would withstand our attacks.

  • New methods that directly target schemes that make multiple passes over memory or use parallel cores.

  • Design a set of tools that helps to choose a proof-of-work instance in various applications: cryptocurrencies, proofs of space, etc.

8 Conclusion

Tradeoff cryptanalysis of memory hard functions is a young, relatively unexplored and complex area of research combining cryptanalytic techniques with understanding of implementation aspects and hardware constraints. It has direct real-world impact since its results can be immediately used in the on-going arms race of mining hardware for the cryptocurrencies.

In this paper we have analyzed memory-hard functions Catena-Dragonfly and yescrypt. We show that Catena-Dragonfly is not memory-hard despite original claims and the security proof by the designers’, since a hardware-equipped adversary can reduce the attack costs significantly using our tradeoffs. We also show that yescrypt is more tradeoff-resilient than Catena, though we can still exploit several design decisions to reduce the time-memory and the time-area product by the factor of 2.

We generalize our ideas to the generic precomputation method for data-independent schemes and the generic ranking method for the data-dependent schemes. Our techniques may be used to estimate the attack cost in various applications from the fast emerging area of memory-hard cryptocurrencies to the password-based key derivation.