Keywords

1 Introduction

Two quantum algorithms threaten to dramatically reduce the security of currently deployed cryptosystems: Shor’s algorithm solves the abelian hidden subgroup problem in polynomial time [1, 2], and Grover’s algorithm provides a quadratic improvement in the number of queries needed to solve black-box search problems [3,4,5].

Efficient quantum algorithms for integer factorization, finite field discrete logarithms, and elliptic curve discrete logarithms can all be constructed by reduction to the abelian hidden subgroup problem. As such, cryptosystems based on these problems can not be considered secure in a post-quantum environment. Diffie-Hellman key exchange, RSA encryption, and RSA signatures will all need to be replaced before quantum computers are available. Some standards bodies have already begun discussions about transitioning to new public key cryptographic primitives [6, 7].

The situation is less dire for hash functions and symmetric ciphers. In a pre-quantum setting, a cryptographic primitive that relies on the hardness of inverting a one-way function is said to offer k-bit security if inverting the function is expected to take \(N=2^k\) evaluations of the function. An exhaustive search that is expected to take O(N) queries with classical hardware can be performed with \(\varTheta (\sqrt{N})\) queries using Grover’s algorithm on quantum hardware. Hence, Grover’s algorithm could be said to reduce the bit-security of such primitives by half; one might say that a 128-bit pre-quantum primitive offers only 64-bit security in a post-quantum setting.

A conservative defense against quantum search is to double the security parameter (e.g. the key length of a cipher, or the output length of a hash function). However, this does not mean that the true cost of Grover’s algorithm should be ignored. A cryptanalyst may want to know the cost of an attack even if it is clearly infeasible, and users of cryptosystems may want to know the minimal security parameter that provides “adequate protection” in the sense of [8,9,10].

In the context of pre-image search on a hash function, the cost of a pre-quantum attack is given as a number of invocations of the hash function. If one assumes that quantum queries have the same cost as classical queries, then the query model provides a reasonable comparison between quantum and classical search. However, realistic designs for large quantum computers call this assumption into question.

The main difficulty is that the coherence time of physical qubits is finite. Noise in the physical system will eventually corrupt the state of any long computation. If the physical error rate can be suppressed below some threshold, then logical qubits with arbitrarily long coherence times can be created using quantum error correcting codes. Preserving the state of a logical qubit is an active process that requires periodic evaluation of an error detection and correction routine. This is true even if no logical gates are performed on the logical qubit. Hence the classical processing required to evaluate a quantum circuit will grow in proportion to both the depth of the circuit and the number of logical qubits on which it acts.

We suggest that a cost model that facilitates direct comparisons of classical and quantum algorithms should take the classical computation required for quantum error correction into consideration. Clearly such estimates will be architecture dependent, and advances in quantum computing could invalidate architectural assumptions.

To better understand the impact of costing quantum error correction, we present an estimate of the cost of pre-image attacks on SHA-2 and SHA-3 assuming a quantum architecture based on the surface code with a logical Clifford+T gate set. We execute the following procedure for each hash function. First, we implement the function as a reversible circuitFootnote 1 over the Clifford+T gate set. We use a quantum circuit optimization tool, “T-par” [11], to minimize the circuit’s T-count and T-depthFootnote 2. With the optimized circuit in hand we estimate the additional overhead of fault tolerant computation. In particular, we estimate the size of the circuits needed to produce the ancillary states that are consumed by T-gates.

Grassl et al. presented a logical-layer quantum circuit for applying Grover’s algorithm to AES key recovery [12]. Separately, Fowler et al. have estimated the physical resources required to implement Shor’s factoring algorithm on a surface code based quantum computer [13]. Our resource estimates combine elements of both of these analyses. We focus on the number of logical qubits in the fault-tolerant circuit and the overall depth of the circuit in units of surface code cycles. While our cost model ties us to a particular quantum architecture, we segment our analysis into several layers so that the impact of a different assumptions at any particular level can be readily evaluated. We illustrate our method schematically in Fig. 2.

The structure of this article reflects our workflow. In Sect. 2 we state the problem of pre-image search using Grover’s algorithm. Section 3 introduces our framework for computing costs, and Sect. 4 applies these principles to compute the intrinsic cost of performing Grover search. Sections 5 and 6 detail our procedure for generating reversible circuits for SHA-256 and SHA3-256 respectively. In Sect. 7 we embed these reversible implementations into a surface code, and estimate the required physical resources. We summarize our results and propose avenues of future research in Sect. 8.

Fig. 1.
figure 1

Grover searching with an oracle for \(f : \{0,1\}^k \rightarrow \{0,1\}^k\).

Fig. 2.
figure 2

Analyzing Grover’s algorithm.

2 Pre-image Search via Grover’s Algorithm

Let \(f : \{0,1\}^k \rightarrow \{0,1\}^k\) be an efficiently function. For a fixed \(y \in \{0,1\}^k\), the value x such that \(f(x) = y\) is called a pre-image of y. In the worst case, the only way to compute a pre-image of y is to systematically search the space of all inputs to f. A function that must be searched in this way is known as a one-way function. A one-way function that is bijective is a one-way permutation Footnote 3.

Given a one-way permutation f, one might ask for the most cost effective way of computing pre-images. With a classical computer one must query f on the order of \(2^k\) times before finding a pre-image. By contrast, a quantum computer can perform the same search with \(2^{k/2}\) queries to f by using Grover’s algorithm [3]. Of course, counting only the queries to f neglects the potentially significant overhead involved in executing f on a quantum computer.

Figure 1 gives a high-level description of Grover’s algorithm. The algorithm makes \(\lfloor \frac{\pi }{4} 2^{k/2}\rfloor \) calls to G, the Grover iteration. The Grover iteration has two subroutines. The first, \(U_g\), implements the predicate \(g : \{0,1\}^k \rightarrow \{0,1\}\) that maps x to 1 if and only if \(f(x) = y\). Each call to \(U_g\) involves two calls to a reversible implementation of f and one call to a comparison circuit that checks whether \(f(x) = y\).

The second subroutine in G implements the transformation and is called the diffusion operator. The diffusion operator is responsible for amplifying the probability that a measurement of the output register would yield x such that \(f(x) = y\). As it involves only single-qubit gates and a one k-fold controlled-NOT, the cost of the diffusion operator is expected to be small compared with that of \(U_g\).

3 A Cost Metric for Quantum Computation

Without significant future effort, the classical processing will almost certainly limit the speed of any quantum computer, particularly one with intrinsically fast quantum gates.

Fowler–Whiteside–Hollenberg [14]

The majority of the overhead for quantum computation, under realistic assumptions about quantum computing architectures, comes from error detection and correction. There are a number of error correction methods in the literature, however the most promising, from the perspective of experimental realizability, is the surface code [15].

The surface code allows for the detection and correction of errors on a two-dimensional array of nearest-neighbor coupled physical qubits. A distance d surface code encodes a single logical qubit into an \(n \times n\) array of physical qubits \((n=2d-1)\). A classical error detection algorithm must be run at regular intervals in order to track the propagation of physical qubit errors and, ultimately, to prevent logical errors. Every surface code cycle involves some number of one- and two-qubit physical quantum gates, physical qubit measurements, and classical processing to detect and correct errors.

The need for classical processing allows us to make a partial comparison between the cost of classical and quantum algorithms for any classical cost metric. The fact that quantum system engineers consider classical processing to be a bottleneck for quantum computation [14] suggests that an analysis of the classical processing may serve as a good proxy for an analysis of the cost of quantum computation itself.

Performing this analysis requires that we make a number of assumptions about how quantum computers will be built, not least of which is the assumption that quantum computers will require error correcting codes, and that the surface code will be the code of choice.

Assumption 1

The resources required for any large quantum computation are well approximated by the resources required for that computation on a surface code based quantum computer.

Fowler et al. [16] give an algorithm for the classical processing required by the surface code. A timing analysis of this algorithm was given in [14], and a parallel variant was presented in [17]. Under a number of physically motivated assumptions, the algorithm of [17] runs in constant time per round of error detection. It assumes a quantum computer architecture consisting of an \(L \times L\) grid of logical qubits overlaid by a constant density mesh of classical computing units. More specifically, the proposed design involves one ASIC (application-specific integrated circuit) for each block of \(C_a \times C_a\) physical qubits. These ASICs are capable of nearest-neighbor communication, and the number of rounds of communication between neighbors is bounded with respect to the error model. The number of ASICs scales linearly with the number of logical qubits, but the constant \(C_a\), and the amount of computation each ASIC performs per time step, is independent of the number of logical qubits.

Each logical qubit is a square grid of \(n \times n\) physical qubits where n depends on the length of the computation and the required level of error suppression. We are able to estimate n directly (Sect. 7). Following [14] we will assume that \(C_a = n\). The number of classical computing units we estimate is therefore equal to the number of logical qubits in the circuit. Note that assuming \(C_a = n\) introduces a dependence between \(C_a\) and the length of the computation, but we will ignore this detail. Since error correction must be performed on the time scale of hundreds of nanoseconds (200 ns in [15]), we do not expect it to be practical to make \(C_a\) much larger than n. Furthermore, while n depends on the length of the computation it will always lie in a fairly narrow range. A value of \(n < 100\) is sufficient even for the extremely long computations we consider. The comparatively short modular exponentiation computations in [15] require \(n > 31\). As long as it is not practical to take \(C_a\) much larger than 100, the assumption that \(C_a = n\) will introduce only a small error in our analysis.

Assumption 2

The classical error correction routine for the surface code on an \(L \times L\) grid of logical qubits requires an \(L \times L\) mesh of classical processors (i.e. \(C_a = n\)).

The algorithm that each ASIC performs is non-trivial and estimating its exact runtime depends on the physical qubit error model. In [14] evidence was presented that the error correction algorithm requires \(O(C_a^2)\) operations, on average, under a reasonable error model. This work considered a single qubit in isolation, and some additional overhead would be incurred by communication between ASICs. A heuristic argument is given in [17] that the communication overhead is also independent of L, i.e. that the radius of communication for each processor depends on the noise model but not on the number of logical qubits in the circuit.

Assumption 3

Each ASIC performs a constant number of operations per surface code cycle.

Finally we (arbitrarily) peg the cost of a surface code cycle to the cost of a hash function invocation. If we assume, as in [15], that a surface code cycle time on the order of 100 ns is achievable, then we are assuming that each logical qubit is equipped with an ASIC capable of performing several million hashes per second. This would be on the very low end of what is commercially available for Bitcoin mining today [18], however the ASICs used for Bitcoin have very large circuit footprints. One could alternatively justify this assumption by noting that typical hash functions require \({\approx }10\) cycles per byte on commercial desktop CPUs [19]. This translates to approximately \({\approx }1000\) cycles per hash function invocation. Since commercial CPUs operate at around 4 GHz, this again translates to a few million hashes per second.

Assumption 4

The temporal cost of one surface code cycle is equal to the temporal cost of one hash function invocation.

Combining Assumptions 1, 2, and 4 we arrive at the following metric for comparing the costs of classical and quantum computations.

Cost Metric 1

The cost of a quantum computation involving \(\ell \) logical qubits for a duration of \(\sigma \) surface code cycles is equal to the cost of classically evaluating a hash function \(\ell \cdot \sigma \) times. Equivalently we will say that one logical qubit cycle is equivalent to one hash function invocation.

We will use the term “cost” to refer either to logical qubit cycles or to hash function invocations.

4 Intrinsic Cost of Grover Search

Suppose there is polynomial overhead per Grover iteration, i.e. \(\varTheta (2^{k/2})\) Grover iterations cost \(\approx k^v2^{k/2}\) logical qubit cycles for some real v independent of k. Then an adversary who is willing to execute an algorithm of cost \(2^C\) can use Grover’s algorithm to search a space of k bits provided that

$$\begin{aligned} k/2 + v\log _2(k) \le C. \end{aligned}$$
(1)

We define the overhead of the circuit as v and the advantage of the circuit as k/C. Note that if we view k as a function of v and C then for any fixed v we have \(\lim _{C\rightarrow \infty } k(v,C)/C = 2,\) i.e. asymptotically, Grover’s algorithm provides a quadratic advantage over classical search. However, here we are interested in non-asymptotic advantages.

When costing error correction, we must have \(v\ge 1\) purely from the space required to represent the input. However, we should not expect the temporal cost to be independent of k. Even if the temporal cost is dominated by the k-fold controlled-NOT gate, the Clifford+T depth of the circuit will be at least \(\log _2(k)\) [20]. Hence, \(v \ge 1.375\) for \(k\le 256\). This still neglects some spatial overhead required for magic state distillation, but \(v = 1.375\) may be used to derive strict upper bounds, in our cost model, for the advantage of Grover search.

In practice the overhead will be much greater. The AES-256 circuit from [12] has depth 130929 and requires 1336 logical qubits. This yields overhead of \(v \approx 3.423\) from the reversible layer alone.

Substituting \(z = \frac{k\ln {2}}{2v}\), the case of equality in Eq. 1 is

$$\begin{aligned} ze^z = \frac{2^{C/v}\ln {2}}{2v} \quad \Longrightarrow \quad k(v,C) = \frac{2v}{\ln (2)} \cdot {\text {W}}\left( \frac{2^{C/v} \ln 2}{2v}\right) \end{aligned}$$
(2)

where \({\text {W}}\) is the Lambert W-function. Table 4 in Appendix A gives the advantage of quantum search as a function of its cost C and overhead v; k is computed using Eq. 2.

5 Reversible Implementation of a SHA-256 Oracle

The Secure Hash Algorithm 2 (SHA-2) [21] is a family of collision resistant cryptographic hash functions. There are a total of six functions in the SHA-2 family: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224 and SHA-512/256. There are currently no known classical pre-image attacks against any of the SHA-2 algorithms which are faster then brute force. We will focus on SHA-256, a commonly used variant, and will assume a message size of one block (512 bits).

First the message block is stretched using Algorithm 2 and the result is stored in \(\mathbf {W}\). The internal state is then initialized using a set of constants. The round function is then run 64 times, each run using a single entry of \(\mathbf {W}\) to modify the internal state. The round function for SHA-256 is shown in Algorithm 1.

figure a
figure b

5.1 Reversible Implementation

Our implementation of the SHA-256 algorithm as a reversible circuit is similar to the one presented in [22] (with the addition of the stretching function). Each round can be performed fully reversibly (with access to the input) so no additional space is accumulated as rounds are performed. The in-place adders shown in the circuit are described in [23]. The adders perform the function \((a,b,0) \mapsto (a,a+b,0)\) where the 0 is a single ancilla bit used by the adder. Since the \(\varSigma \) blocks use only rotate and XOR operations, they are constructed using CNOT gates exclusively.

\(\mathsf {Maj}\) is the bitwise majority function. The majority function is computed using a CNOT gate and two Toffoli gates as show in Fig. 4.

Fig. 3.
figure 3

SHA-256 round.

Fig. 4.
figure 4

Majority circuit implementation. The \(a\oplus b\) line will be returned to b when the inverse circuit is applied.

Fig. 5.
figure 5

\(\mathsf {Ch}\) circuit implementation. This circuit is applied bitwise to input of each \(\mathsf {Ch}\) block.

The \(\mathsf {Ch}\) function is \(ab\oplus \lnot ac\) which can be rewritten as \(a(b\oplus c) \oplus c\). This requires a single Toffoli gate as shown in Fig. 5.

Table 1. T-par optimization results for a single round of SHA-256, one iteration of the stretch algorithm and full SHA-256. Note that 64 iterations of the round function and 48 iterations of the stretch function are needed. The stretch function does not contribute to overall depth since it can be performed in parallel with the rounds function. No X gates are used so an X column is not included. The circuit uses 2402 total logical qubits.

There are a few options for constructing the round circuit. For example if space is available some of the additions can be performed in parallel, and the cleanup of the \(\varSigma \), \(\mathsf {Ch}\), and \(\mathsf {Maj}\) functions can be neglected if it is desirable to exchange space for a lower gate count. We select the round implementation shown in Fig. 3.

5.2 Quantum Implementation

For the quantum implementation we converted the Toffoli-CNOT-NOT circuit (Fig. 3) discussed above into a Clifford+T circuit. To expand the Toffoli gates we used the T-depth 3 Toffoli reported in [24]. T-par was then used to optimize a single round. The results are shown in Table 1. Due to the construction of the adders every Toffoli gate shares two controls with another Toffoli gate. This allows T-par to remove a large number of T-gates (see [20]).

Observing that the depth of the optimized SHA-256 circuit, 830720, is approximately \(256^{2.458}\), and likewise that it requires \(2402 \approx 256^{1.404}\) logical qubits, the overhead, from the reversible layer alone, is \(v \approx 3.862\).

6 Reversible Implementation of a SHA3-256 Oracle

The Secure Hash Algorithm 3 standard [25] defines six individual hash algorithms, based on the length of their output in the case of SHA3-224, SHA3-256, SHA3-384 and SHA3-512, or their security strength in the case of SHAKE-128 and SHAKE-256. In contrast to the SHA-2 standard, each of the SHA-3 algorithms requires effectively the same resources to implement reversibly, owing to their definition as cryptographic sponge functions [26]. Analogous to a sponge, the sponge construction first pads the input to a multiple of the given rate constant then absorbs the padded message in chunks, applying a permutation to the state after each chunk, before “squeezing” out a hash value of desired length. Each of the SHA-3 algorithms use the same underlying permutation, but vary the chunk size, padding and output lengths.

The full SHA-3 algorithm is given in pseudocode in Algorithm 3. Each instance results from the sponge construction with permutation Keccak-p[1600, 24] described below, padding function \(\mathsf{pad}10^*1(x, m)\) which produces a length \(-m\mod x\) string of the form (as a regular expression) \(10^*1\), and rate \(1600-2k\). The algorithm first pads the input message M with the string \(0110^*1\) to a total length some multiple of \(1600-2k\). It then splits up this string into length \(1600-2k\) segments and absorbs each of these segments into the current hash value S then applies the Keccak-p[1600, 24] permutation. Finally the hash value is truncated to a length \(k\) string.

The SHAKE algorithms are obtained by padding the input M with a string of the form \(111110^*1\), but otherwise proceed identically to SHA-3.

figure c

Assuming the pre-image has length k, the padded message P has length exactly \(1600 - 2k\) and hence \(n=1\) for every value of \(k\), so Algorithm 3 reduces to one application of Keccak-p[1600, 24].

The Keccak Permutation. The permutation underlying the sponge construction in each SHA-3 variant is an instance of a family of functions, denoted Keccak-\(p[b, r]\). The Keccak permutation accepts a 5 by 5 array of lanes, bitstrings of length \(w=2^l\) for some l where \(b=25w\), and performs \(r\) rounds of an invertible operation on this array. In particular, round i is defined, for \(12+2l - r\) up to \(12+2l - 1\), as \(R_i=\iota _i\circ \chi \circ \pi \circ \rho \circ \theta \), where the component functions are described in Fig. 7. Note that array indices are taken mod 5 and \(A, A'\) denote the input and output arrays, respectively. The rotation array c and round constants RC(i) are pre-computed values.

The Keccak-\(p[b, r]\) permutation itself is defined as the composition of all \(r\) rounds, indexed from \(12+2l -r\) to \(12+2l - 1\). While any parameters could potentially be used to define a hash function, only Keccak-p[1600, 24] is used in the SHA-3 standard. Note that the lane size w in this case is 64 bits (Fig. 6).

Fig. 6.
figure 6

The component functions of \(R_i\)

6.1 Reversible Implementation

Given the large size of the input register for the instance used in SHA3-256 (1600 bits), we sought a space-efficient implementation as opposed to a more straightforward implementation using Bennett’s method [27] which would add an extra 1600 bits per round, to a total of 38400 bits. While this space usage could be reduced by using pebble games [28], the number of iterations of Keccak-p would drastically increase. Instead, we chose to perform each round in place by utilizing the fact that each component function \((\theta , \rho , \pi , \chi , \iota _i)\) is invertible. The resulting circuit requires only a single temporary register the size of the input, which is returned to the all-zero state at the end of each round.

Fig. 7.
figure 7

Reversible circuit implementation for round i of Keccak-p.

Figure 7 shows our circuit layout for a given round of Keccak-\(p[b, r]\). We compute \(\theta (A)\) into the ancilla register by a straightforward implementation of (3), as binary addition (\(\oplus \)) is implemented reversibly by the CNOT gate. The implementation of \(\theta ^{-1}:|\psi \rangle |\theta (A)\rangle \mapsto |\psi \oplus A\rangle |\theta (A)\) is much less obvious – we adapted our implementation from the C++ library Keccak tools [29] with minor modifications to remove temporary registers. To reduce the number of unnecessary gates, we perform the \(\rho \) and \(\pi \) operations “in software” rather than physically swapping bits. The \(\chi \) and \(\chi ^{-1}\) operations are again straightforward implementations of (6) and the inverse operation from Keccak tools, respectively, using Toffoli gates to implement the binary multiplications. Finally the addition of the round constant (\(\iota _i\)) is a sequence of at most 5 NOT gates, precomputed for each of the 24 individual rounds.

As a function of the lane width w, \(\theta \) comprises 275w CNOT gates. The inverse of \(\theta \) is more difficult to assign a formula to, as it depends on some precomputed constants – in particular, \(\theta {-1}\) is implemented using \(125w\cdot j\) CNOT gates, where \(j\) is 170 for \(b=1600\). As \(\rho \) and \(\pi \) are implemented simply by re-indexing, they have no logical cost. We implement \(\chi \) using 50w additions and 25w multiplications, giving 50w CNOT gates and 25w Toffoli gates in 5 parallel stages. Finally \(\chi ^{-1}\) requires 25w CNOT gates to copy the output back into the initial register, then 60w CNOT and 30w Toffoli gates in 6 parallel stages. As the cost of \(\iota _i\) is dependent on the round, we don’t give its per-round resources.

Table 2. Clifford+T resource counts for the Keccak-p[1600, 24] components, as well as for the full oracle implementation of SHA3-256. \(\iota \) gives the combined resource counts for all 24 rounds of \(\iota _i\). The circuit uses 3200 total logical qubits.

The final circuit comprises 3200 qubits, 85 NOT gates, 33269760 CNOT gates and 84480 Toffoli gates. Additionally, the Toffoli gates are arranged in 264 parallel stages.

6.2 Quantum Implementation

As with the Clifford+T implementation of SHA-256, we used the T-depth 3 Toffoli reported in [24] to expand each Toffoli gate. Since the \(\chi \) (and \(\chi ^{-1}\)) transformation is the only non-linear operation of Keccak-p[1600, 24], we applied T-par just to the \(\chi /\chi ^{-1}\) subcircuit to optimize T-count and depth. We used the formally verified reversible circuit compiler ReVerC [30] to generate a machine readable initial circuit for \(\chi /\chi ^{-1}\) – while the ReVerC compiler performs some space optimization [31], the straightforward manner in which we implemented the circuit meant the compiled circuit coincided exactly with our analysis above. The optimized results are reported in Table 2. Note that each algorithm in the SHA-3 family corresponds to one application of Keccak-p[1600, 24] for our purposes, so the resources are identical for any output size.

As an illustration of the overhead for SHA3-256, our reversible SHA3-256 circuit, having depth \(11040\approx 256^{1.679}\) and a logical qubit count of \(3200\approx 256^{1.455}\) yields \(v \approx 3.134\) at the reversible layer.

7 Fault-Tolerant Cost

The T gate is the most expensive in terms of the resources needed for implementing a circuit fault-tolerantly in a surface code. Most known schemes implement the T gate using an auxiliary resource called a magic state. The latter is usually prepared in a faulty manner, and purified to the desired fidelity via a procedure called magic state distillation. Fault-tolerant magic state distilleries (circuits for performing magic state distillation) require a substantial number of logical qubits. In this section we estimate the additional resources required by distilleries in the particular case of SHA-256 and SHA3-256.

Let \(T^{c}_U\) denote the T-count of a circuit U (i.e., total number of logical T gates), and let \(T^{d}_U\) be the T-depth of the circuit. We denote by \(T^{w}_U = T^c_U/T^d_U\) the T-width of the circuit (i.e., the number of logical T gates that can be done in parallel on average for each layer of depth). Each T gate requires one logical magic state of the form

(8)

for its implementation. For the entirety of U to run successfully, the magic states have to be produced with an error rate no larger than \(p_{out} = 1/T^{c}_U\).

The magic state distillation procedure is based on the following scheme. The procedure starts with a physical magic state prepared with some failure probability \(p_{in}\). This faulty state is then injected into an error correcting code, and then by performing a suitable distillation procedure on the output carrier qubits of the encoded state a magic state with a smaller failure probability is distilled. If this failure probability is still larger than the desired \(p_{out}\), the scheme uses another layer of distillation, i.e. concatenates the first layer of distillation with a second layer of distillation, and so forth. The failure probability thus decreases exponentially.

In our case, we use the Reed-Muller 15-to-1 distillation scheme introduced in [32]. Given a state injection error rate \(p_{in}\), the output error rate after a layer of distillation can be made arbitrarily close to the ideal \(p_{dist}=35p_{in}^3\) provided we ignore the logical errors that may appear during the distillation procedure (those can be ignored if the distillation code uses logical qubits with high enough distance). As pointed out in [33] logical errors do not need to be fully eliminated. We also assume that the physical error rate per gate in the surface code, \(p_g\), is approximately 10 times smaller than \(p_{in}\), i.e. \(p_g = p_{in}/10\), as during the state injection approximately 10 gates have to perform without a fault before error protection is available (see [13] for more details).

We define \(\varepsilon \) so that \(\varepsilon p_{dist}\) represents the amount of logical error introduced, so \(p_{out} = (1+\varepsilon )p_{dist}\). In the balanced case \(\varepsilon = 1\) the logical circuit introduces the same amount of errors as distillation eliminates. Algorithm 4 [33] summarizes the procedure for estimating the number of rounds of state distillation needed to achieve a given output error rate, as well as the required minimum code distances at each round. Note that \(d_1\) represents the distance of the surface code used in the top layer of distillation (where by top we mean the initial copy of the Reed-Muller circuit), \(d_2\) the distance of the surface code used in the next layer, and so forth.

figure d

7.1 SHA-256

The T-count of our SHA-256 circuit is (see Table 1), and the T-count of the k-fold controlled-NOT is [12]. With \(k = 256\), we have and the total T-count of the SHA-256 oracle \(U_g\) (of Fig. 1) is

(9)

The diffusion operator consists of Clifford gates and a \((k-1)\)-fold controlled-NOT, hence its T-count is . The T-count of one Grover iteration G is therefore

(10)

and the T-count for the full Grover algorithm (let us call it GA) is

$$\begin{aligned} T^c_{GA} = \lfloor \pi /4 \times 2^{128} \rfloor \times 474168\approx 1.27\times 10^{44}.\end{aligned}$$
(11)

For this T-count the output error rate for state distillation should be no greater than \(p_{out}=1/T^c_{GA}\approx 7.89\times 10^{-45}\). Assuming a magic state injection error rate \(p_{in}=10^{-4}\), a per-gate error rate \(p_{g}=10^{-5}\), and choosing \(\varepsilon = 1\), Algorithm 4 suggests 3 layers of distillation, with distances \(d_1=33, d_2=13\) and \(d_3=7\).

The bottom layer of distillation occupies the largest footprint in the surface code. Three layers of distillation consume \(N_{dist} = 16 \times 15 \times 15 = 3600\) input states in the process of generating a single state. These input states are encoded on a distance \(d_3=7\) code that uses \(2.5 \times 1.25 \times d_3^2 \approx 154\) physical qubits per logical qubit. The total footprint of the distillation circuit is then \(N_{dist} \times 154\approx 5.54\times 10^5\) physical qubits. The round of distillation is completed in \(10d_3=70\) surface code cycles.

The middle layer of distillation requires a \(d_2 = 13\) surface code, for which a logical qubit takes \(2.5 \times 1.25 \times d_2^2 \approx 529\) physical qubits. The total number of physical qubits required in the second round is therefore \(16 \times 15 \times 529\approx 1.27\times 10^5\) physical qubits, with the round of distillation completed in \(10 d_2 = 130\) surface code cycles.

The top layer of state distillation requires a \(d_1 = 33\) surface code, for which a logical qubit takes \(2.5 \times 1.25 \times d_1^2 \approx 3404\) physical qubits. The total number of physical qubits required in the top layer is therefore \(16 \times 3404 = 54464\) physical qubits, with the round of distillation completed in \(10 d_1 = 330\) surface code cycles.

Note that the physical qubits required in the bottom layer of state distillation can be reused in the middle and top layers. Therefore the total number of physical qubits required for successfully distilling one purified state is \(n_{dist} = 5.54 \times 10^5\). The concatenated distillation scheme is performed in \(\sigma _{dist} = 70 + 130 + 330 = 530\) surface code cycles. Since the middle layer of distillation has smaller footprint than the bottom layer, distillation can potentially be pipelined to produce \(\phi = (5.54\times 10^5) / (1.27\times 10^5) \approx 4\) magic states in parallel. Assuming, as in [13], a \(t_{sc} = 200\) ns time for a surface code cycle, a magic state distillery can therefore produce 4 states every \(\sigma _{dist} \times t_{sc} \approx 106\,\upmu \)s. Generating the requisite \(T^c_{GA} = 1.27\times 10^{44}\) magic states with a single distillery would take approximately \(t_{dist} = 3.37\times 10^{39}\) s \(\approx 1.06\times 10^{32}\) years.

We now compute the distance required to embed the entire algorithm in a surface code. The number of Clifford gates in one iteration of G is roughly \(8.76\times 10^6\), so the full attack circuit performs around \(2.34 \times 10^{45}\) Clifford gates. The overall error rate of the circuit should therefore be less than \(4.27 \times 10^{-46}\). To compute the required distance, we seek the smallest d that satisfies the inequality [14]

$$\begin{aligned} \left( \frac{p_{in}}{0.0125} \right) ^{\frac{d + 1}{2}} < 4.27 \times 10^{-46}, \end{aligned}$$
(12)

and find this to be . The total number of physical qubits in the Grover portion of the circuit is then \(2402 \times (2.5\times 1.25\times 43^2) = 1.39 \times 10^7\).

We can further estimate the number of cycles required to run the entire algorithm, \(\sigma _{GA}\). Consider a single iteration of G from Fig. 1. The T-count is \(T^{c}_{GA} = 1.27 \times 10^{44}\) and the T-depth is \(T^{d}_{GA} = 4.79 \times 10^{43}\) for one iteration of SHA-256, yielding \(T^{w}_{G} = T^c_{GA} / T^d_{GA} \approx 3\).

Our SHA-256 circuit has = 2402 logical qubits. Between sequential T gates on any one qubit we will perform some number of Clifford operations. These are mostly CNOT gates, which take 2 surface code cycles, and Hadamard gates, which take a number of surface code cycles equal to the code distance [13].

Assuming the \(8.76 \times 10^6\) Clifford gates in one Grover iteration are uniformly distributed among the 2402 logical qubits, then we expect to perform \(8.76\times 10^6 / (2402 \times T^{d}_{G}) \approx 0.026\) Clifford operations per qubit per layer of depth. As a magic state distillery produces 4 magic states per 530 surface code cycles, we can perform a single layer of T depth every 530 surface code cycles. We thus need only a single distillery, \(\varPhi = 1\). On average about \(2\%\) of the Cliffords are Hadamards, and the remaining \(98\%\) are CNOTs. This implies that the expected number of surface code cycles required to implement the 0.025 average number of Clifford gates in a given layer of T depth is \(2\% \times 0.025 \times 43 + 98\% \times 0.025 \times 2 = 0.071\). As this is significantly lower than 1, we conclude that performing the T gates comprises the largest part of the implementation, while the qubits performing the Clifford gates are idle most of the time. In conclusion, the total number of cycles is determined solely by magic state production, i.e.

As discussed in Sect. 3, the total cost of a quantum attack against SHA-256 equals the product of the total number of logical qubits (including the ones used for magic state distillation) and the number of code cycles, which in our case results in

corresponding to an overhead factor of \(v = (166.4 - 128)/\log _2(256) = 4.8\).

7.2 SHA3-256

We perform a similar analysis for SHA3-256. We have \(T^{c}_{U_g} = 2\times 499200 + 32\times 256 - 84 = 1006508\), and \(T^{c}_G = 1006508 + 32\times 255 - 84 = 1014584\), and thus the full Grover algorithm takes T-count \(T^c_{GA} = \lfloor \pi /4 \times 2^{128} \rfloor \times 1014584 \approx 2.71 \times 10^{44}\). If we choose, like in the case of SHA-256, \(p_{in} = 10^{-4}, p_g = 10^{-5}\), and \(\varepsilon = 1\), Algorithm 4 yields 3 layers of distillation with distances \(d_1 = 33, d_2 = 13\), and \(d_3 = 7\); these are identical to those of SHA-256. Thus, the distillation code requires takes 3600 logical qubits (and \(5.54\times 10^5\) physical qubits), and in 530 cycles is able to produce roughly 4 magic states.

We compute the distance required to embed the entire algorithm in a surface code. The total number of Cliffords in one iteration of G is roughly \(6.90\times 10^7\), so the total number will be around \(1.84 \times 10^{46}\) operations. We thus need the overall error rate to be less than \(5.43 \times 10^{-47}\), which by Eq. 12 yields a distance . The number of physical qubits is then \(1.94 \times 10^7\).

Consider a single iteration of G from Fig. 1. \(T^c_G = 1014584\) and , which yields \(T^w_G = 1014584 / (2\times 432) = 1175\). Above we figured we can compute 4 magic states in 530 code cycles. Then, to compute 1175 magic states in the same number of cycles we will need roughly \(\varPhi = 294\) distillation factories working in parallel to keep up. This will increase the number of physical qubits required for state distillation to \(1.63\times 10^8\). If we assume \(t_{sc} = 200\) ns cycle time, generation of the full set of magic states will take \(2.28\times 10^{37}\) s, or about \(t_{dist} = 7.23 \times 10^{29}\) years.

Our SHA3-256 circuit uses = 3200 logical qubits. Assuming the \(6.90\times 10^7\) Clifford gates per Grover iteration are uniformly distributed among the qubits, and between the 864 sequential T gates, we must be able to implement \(6.90\times 10^7 / (3200 \times 864) \approx 25\) Clifford operations per qubit per layer of T-depth. As the ratio of CNOTs to Hadamards is roughly 202 to 1, i.e. 99.5% of the Cliffords are CNOTs and only \(0.5\%\) are Hadamards, the expected number of surface code cycles required to implement the average of 25 Clifford gates in a given layer of T depth is \(25\times (0.005 \times 44 + 0.995 \times 2) \approx 55\). We have used just enough ancilla factories to implement a single layer of T-depth in 530 cycles, meaning that once again the limiting step in implementing this circuit is the production of magic states. Hence, we can compute the total number of surface code cycles required to implement SHA3-256 using just the T-depth:

The total cost of a quantum attack against SHA3-256 is then

or an overhead of \(v = (166.5 - 128)/\log _2(256) = 4.81\) (Table 3).

Table 3. Fault-tolerant resource counts for Grover search of SHA-256 and SHA3-256.

8 Conclusions and Open Questions

We estimated the cost of a quantum pre-image attack on SHA-256 and SHA3-256 cryptographic hash functions via Grover’s quantum searching algorithm. We constructed reversible implementations of both SHA-256 and SHA3-256 cryptographic hash functions, for which we optimized their corresponding T-count and depth. We then estimated the required physical resources needed to run a brute force Grover search on a fault-tolerant surface code based architecture.

We showed that attacking SHA-256 requires approximately \(2^{153.8}\) surface code cycles and that attacking SHA3-256 requires approximately \(2^{146.5}\) surface code cycles. For both SHA-256 and SHA3-256 we found that the total cost when including the classical processing increases to approximately \(2^{166}\) basic operations.

Our estimates are by no means a lower bound, as they are based on a series of assumptions. First, we optimized our T-count by optimizing each component of the SHA oracle individually, which of course is not optimal. Dedicated optimization schemes may achieve better results. Second, we considered a surface code fault-tolerant implementation, as such a scheme looks the most promising at present. However it may be the case that other quantum error correcting schemes perform better. Finally, we considered an optimistic per-gate error rate of about \(10^{-5}\), which is the limit of current quantum hardware. This number will probably be improved in the future. Improving any of the issues listed above will certainly result in a better estimate and a lower number of operations, however the decrease in the number of bits of security will likely be limited.