The SQALE of CSIDH: sublinear Vélu quantum-resistant isogeny action with low exponents

Recent independent analyses by Bonnetain–Schrottenloher and Peikert in Eurocrypt 2020 significantly reduced the estimated quantum security of the isogeny-based commutative group action key-exchange protocol CSIDH. This paper refines the estimates of a resource-constrained quantum collimation sieve attack to give a precise quantum security to CSIDH. Furthermore, we optimize large CSIDH parameters for performance while still achieving the NIST security levels 1, 2, and 3. Finally, we provide a C-code constant-time implementation of those CSIDH large instantiations using the square-root-complexity Vélu’s formulas recently proposed by Bernstein, De Feo, Leroux and Smith.

cations such as key encapsulation mechanisms, signatures and other primitives. It has remarkably small public keys (in fact, even with the parameter scaling proposed in this paper it still has shorter keys than the four public key encryption round-3 finalists of the NIST post-quantum standardization process [36] 1 ), and allows a highly efficient key validation procedure. This latter feature aids in making CSIDH better suited than most (if not all) post-quantum schemes for resisting Chosen Ciphertext Attacks (CCA) and for supporting static-dynamic and static-static key exchange settings. On the downside, CSIDH has a significantly higher latency than other isogeny-based protocols such as SIDH and SIKE [3,31]. Furthermore, as this paper will discuss in detail, several recent analyses revised CSIDH's true quantum security downwards (see for example [12,38]).
The CSIDH framework considers a set of curves with the same F p -endomorphism ring. Isogenies between curves are represented by ideal classes in this ring, which form a group, so that an ideal a can operate over a curve E to produce a new curve E . We denote this by a * E = E , and call it the CSIDH group action. One very appealing feature of the CSIDH group action is its commutative property. This allows one to apply the group action directly to the key exchange between two parties by mimicking the Diffie-Hellman protocol. Starting from a base elliptic curve E 0 , Alice and Bob first need to choose a secret key a and b, respectively. Then they can produce their corresponding public keys by computing the group actions E A = a * E 0 and E B = b * E 0 . After exchanging these public keys and taking advantage of the commutative property of the group action, Alice and Bob can obtain a common secret by calculating The CSIDH protocol introduced in [13] operates on supersingular elliptic curves E/F p expressed in the Montgomery model as Since E/F p is supersingular, one has full control of its order, which is #E(F p ) = ( p + 1). The CSIDH protocol chooses p such that p + 1 = 4 n i=1 i , where 1 , . . . , n−1 are small odd primes. This enables an efficient computation of degreei isogenies, which correspond to the group action of ideal l i of norm i . The most demanding computational task of CSIDH is the evaluation of its class group action, which takes as input an elliptic curve E 0 , represented by its A-coefficient, and an ideal class a = n i=1 l e i i , represented by its list of exponents (e i , . . . , e n ) ∈ −m. .m n . This list of exponents is the CSIDH secret key. The output of the class group action is the A-coefficient of the elliptic curve E A defined as, E A = a * E 0 = l e 1 1 * · · · * l e n n * E 0 .
The action of each ideal l e i i in Eq. 2 can be computed by performing e i degreei isogeny construction operations, for i = 1, . . . , n. For practical implementations of CSIDH, constructing and evaluating n degreei isogenies, plus up to n(n+1) 2 scalar multiplications by the prime factors i , dominate the computational cost [15].
Previous works regularly evaluated and constructed degreei isogenies using Vélu's formulae (cf. [28, §2.4] and [42,Theorem 12.16]), which cost ≈ 6 field multiplications each. Recently, Bernstein, De Feo, Leroux and Smith presented in [6] a new approach for constructing and evaluating degreeisogenies at a combined cost of justÕ( √ ) field multiplications. Later, it was reported in [2] that constant-time CSIDH implementations using 511-and 1023-bit primes were moderately favored by the new algorithm of [6] for evaluating Vélu's formulae.
CSIDH's Security. The security of CSIDH rests on an analogue of the discrete logarithm problem: given the base elliptic curve E 0 and the public-key elliptic curve E A (or E B ), deduce the ideal class a (or b) (see Eq. 2).
From a classical perspective, the security of CSIDH is related to the problem of finding an isogeny path from the isogenous supersingular elliptic curves E 0 and E A . Now, random-walk-based attacks on the whole class group (of rough size √ p) have a complexity ofÕ( 4 √ p) steps with constant space (for more details see, [20]). Thus, in order to provide a security level of 128 classical bits, the prime p needs to be large enough to support 2 256 ideal classes, hence the choice of a 512-bit prime in the original CSIDH proposal. The parameter m should then be chosen in such a way that the private key space is also composed of 2 256 different secret keys, which we heuristically expect to fill nearly all ideal classes. From a quantum attack perspective, Childs, Jao, and Soukharev tackled in [16] the problem of recovering the secret a from the relation E A = a * E 0 . They managed to reduce this computational task to the abelian hidden-shift problem on the class group, where the hidden shift corresponds to the secret a that one wants to find. Previously in 2003 and 2004, Kuperberg and Regev had presented two sieving algorithms that could solve this problem in subexponential time if they were executed in a quantum setting [29,39]. In particular, Kuperberg's procedure has a quantum time and space complexity of just exp O( √ log p) . Later, in 2011, Kuperberg refined his algorithm by adding a collimation sieving phase [30]. The time complexity of this new variant was still exp O( √ log p) , but the quantum space complexity was just O(log p).
In a nutshell, a Kuperberg-like approach for solving the hidden-shift problem consists of two main components: 1. A quantum oracle that evaluates the group action on a uniform superposition and produces random phase vectors 2. A sieving procedure that destructively combines lowquality phase vectors into high-quality phase vectors The sieving procedure gradually improves the quality of the phase vectors until they can be measured and reveal some bits of the hidden shift, and thus the CSIDH secret key.
Recent analyses of this quantum algorithm that were presented in Eurocrypt 2020 [12,38], point to a significant reduction of the quantum security provided by CSIDH. Concretely, the original 511-bit prime CSIDH instantiation was deemed to achieve NIST security level 1 in [13]. However, the authors of [12] recommended that the size of the CSIDH prime p should be upgraded to at least 2260 or 5280 bits, according to what they named as aggressive and conservative modes, respectively.
Both [12] and [38] focus on breaking the originally proposed instantiations of CSIDH, rather than an exhaustive analysis of the quantum attack. [12] focuses mainly on Kuperberg's first attack and Regev's attack by providing a thorough accounting of a quantum group action circuit. [38] gives a thorough practical and theoretical analysis of Kuper-berg's second algorithm and provides many optimizations. While [38] simulates the full algorithm to give very precise estimates, this method will not extend to the larger primes we consider here because, by design, even the classical aspects of the attack should be infeasible to compute. We use the results of the theoretical analysis in [38] to count resource use without a full simulation. This allows us to evaluate very large primes and to explore depth-width tradeoffs and thus to compare to NIST's security levels. We argue that for the primes we consider, CSIDH's quantum security depends mainly on the cost of the collimation sieve, not the current isogeny evaluation costs.
The SQALE of CSIDH. We use the acronym SQALE for "Sublinear Vélu Quantum-resistant isogeny Action with Low Exponents." The SQALE of CSIDH is a CSIDH instance such that p = 4 · n i=1 i − 1 is a prime number with small odd primes 1 , . . . , n , and the key space size N √ p is determined by using only the k ≤ n smallest i 's, where the exponents e i of the ideal class a = n i=1 l e i i , are drawn from a small range, possibly {−1, 0, 1}.
The original CSIDH protocol chose exponents large enough that the key space is approximately equal to the class group. We show in Sect. 2 that a SQALE'd CSIDH preserves classical security. We also argue in Sect. 4 that quantum attackers need to attack the entire class group, regardless of the subset that keys are drawn from, so we can choose low exponents and preserve quantum security as well. With this change, we improve the trade-off between the performance of the key exchange and its quantum security. To further improve performance of the large CSIDH instances considered in this paper, we incorporate the Vélu's improved O( √ ) algorithm for isogeny computations.
On a related idea, the isogeny-based signature scheme SeaSign presented in [19], uses the notion of lossy keys, where the ideals i e i i cover only a small part of the class group. The security guarantees of SeaSign are partially based on the computational assumption that is hard to distinguish the special case of lossy keys from uniform ideal classes (see [19, §8.1]).
As an aside, note that increasing the size of the prime makes it impossible to compute the class group with current technology as it has been done with the CSIDH-512 prime to derive related schemes such as the CSI-FiSh signature scheme [9]. Quantum computing would allow for efficient computation of larger class groups in the future, but this does not affect the CSIDH scheme itself and a scheme like CSI-FiSh is incompatible with our idea of low exponents anyways.
Outline. In this work, we present a detailed classical and quantum cryptanalysis of CSIDH and its constant-time C implementation using our revised prime sizes, which, according to our analysis, are required to achieve the NIST security levels 1, 2 and 3 (Table 1). Section 2 gives background on CSIDH, efficient methods for computing its group action, and the quantum cost models we use. In Sect. 3 we describe the quantum collimation sieve attack and explain how to estimate its cost. We account for larger primes, depth limits, improved memory circuits, and find several small optimizations. The sieve only seems able to attack the full class group, and not any smaller generating subset. We give several arguments for this in Sect. 4, ultimately concluding that for a quantum attacker, only the size of the class group affects the total quantum attack cost. These conclusions suggests that an ideal scheme will operate on isogenies of a number of degrees, but with small exponents for each. Section 5 summarizes the quantum and classical security and the effects of hardware limits.
We then give a concrete cost analysis of the CSIDH group action for a key exchange with different sizes of primes p in Sect. 6. We account for different options of the exponent interval m, from the minimal setting −1. .1 (with or without zero) up to the original proposal of −5. .5 . For each interval, we apply the framework reported in [15] to select optimal bounds (different m i for each prime) and their corresponding optimal strategies. Starting from the Python-3 CSIDH library reported in [2], we present the first constant-time implementation of large CSIDH instantiations supporting the O( √ ) isogeny-evaluation algorithm from [6]. Our C library also includes a companion script that estimates quantum attack costs. Our software is freely available from, https://github.com/JJChiDguez/sqale-csidh-velusqrt.

Background
This section presents some of the main concepts required for performing classical and quantum attacks on CSIDH.

Construction and evaluation of odd degree isogenies using Vélu Square-root Algorithm
Let be an odd prime number, F p a finite field of large characteristic, and A a Montgomery coefficient of an elliptic curve E A /F p : y 2 = x 3 + Ax 2 + x. Given an orderpoint P ∈ E A (F p ), the construction of an isogeny φ : E A → E A of kernel P and its evaluation at a point Using the recent Vélu square-root algorithm (aka √ élu) as presented by Bernstein, De Feo, Leroux and Smith in [6], A and φ x (α) can be computed as (see also [17,33,34] and [2]), Hence, the main cost associated with computing A and φ x (α), corresponds to the computation of h S (X ). Given E A /F p an order-point P ∈ E A (F p ), and some value α ∈ F p we want to efficiently evaluate the polynomial, This suggests a rearrangement à la Baby-step Giant-step as, where s is a fixed integer representing the size of the giant steps and I, J are two sets of indices such that I ±sJ covers S.
Now h S (α) can be efficiently computed by calculating the resultants of two polynomials in F p [Z ], of the form The most demanding operations of √ élu require computing four different resultants . Those four resultants are computed using a remainder tree approach supported by carefully tailored Karatsuba polynomial multiplications. In practice, the computational cost of computing degree-isogenies using √ élu is close to K ( √ ) log 2 3 field operations for a constant K . For more details about these computations see [2,6].

Summary of CSIDH
Here, we give a general description of CSIDH. A more detailed description of the CSIDH group action computation can be found in [13,14,32,37]. The most demanding computational task of CSIDH is evaluating its class group action, whose cost is dominated by performing a number of degreei isogeny constructions. Roughly speaking, three major variants for computing the CSIDH group action have been proposed, which we briefly outline next.
Let π : (x, y) → (x p , y p ) be the Frobenius map and N ∈ Z be a positive integer. Working now with points over the extension field F p 2 , let E[N ] denote the N -torsion subgroup of E/F p 2 defined as, Note that E[π − 1] corresponds to the original set of F prational points, whereas E[π + 1] is a set of points of the form (x, iy) where x, y ∈ F p and i = √ −1 so that i p = −i. We call E[π + 1] the set of zero-trace points.
The MCR-style [32] of evaluating the CSIDH group action takes as input a secret integer vector e = (e 1 , . . . , e n ) such that e i ∈ 0. .m . From this input, isogenies with kernel generated by P ∈ E A [ i ] ∩ E A [π − 1] are constructed for exactly e i iterations. In the case of the OAYT-style [37], the exponents are drawn from e i ∈ −m. .m , and P lies either on ] (the sign of e i determines which one will be used). We stress that for constant-time implementation of CSIDH adopting the MCR and OAYT styles, the group action evaluation starts by constructing isogenies with kernel generated by for e i iterations, followed by dummy isogeny constructions that are performed for the remaining (m − e i ) iterations.
On the other hand, the dummy-free constant-time CSIDH group action evaluation, proposed in [14], takes as secret integer vector e = (e 1 , . . . , e n ) such that e i ∈ −m. .m has the same parity as m. Then, one starts constructing isogenies with kernel generated by for exactly e i iterations. Thereafter, one alternatingly computes isogenies for the remaining m i − e i iterations (for more details see [14]).

Quantum computing
We refer to [35] for the basics and notation of quantum computing. Following [26], we treat a quantum computer as a memory peripheral of a classical computer, which can modify the quantum state with certain operations called "gates." We give the cost of a quantum algorithm in terms of these operations (specifically Clifford + T gates), which we treat as a classical computation cost. With this we can directly add and compare quantum and classical costs, since we measure quantum computation costs in classical operations. We use the "DW "-cost, which assumes that the controller must actively correct all the qubits at every time step to prevent decoherence. This means the total cost is proportional to the total number of qubits (the "width"), times the total circuit depth.
We depart from [26] by giving an overhead of 2 10 classical operations for each unit of DW -cost, to represent the overhead of quantum error correction. With surface code error correction, every logical qubit is formed of many physical qubits, which continuously run through measurement cycles. We assume each cycle of each physical qubit is equivalent to a classical operation. By this metric, Shor's algorithm has an overhead of 2 17 for each logical gate [23]. The algorithm we analyze will need much more error correction, but we assume continuing advances in quantum error correction will reduce this overhead to 2 10 . Since a surface code needs to maintain a distance between logical qubits in two physical dimensions and one dimension of time [21], we assume the 2 10 overhead is the cube of the code distance, and thus every logical qubit is composed of 2 10· 2 3 physical qubits.

Quantum attack
We follow Peikert [38] and analyze only Kuperberg's second algorithm [30]. Because of this, and our assumption that classical operations are only 2 10 times cheaper than quantum, the tradeoffs of [10,11] do not help for our analysis.
Kuperberg's algorithm can be divided into 3 stages: 1. Constructing phase states, where we compute an arbitrary isogeny action in superposition, perform a quantum Fourier transform, then measure the result. This leaves a single qubit in a random phase state with some associated classical data, which forms the input to the next stage. 2. A sieving stage, where we use a process called "collimation" to destructively combine phase states to produce "better" phase states. This requires some quantum arithmetic, but the main costs are quantum access, in superposition, to a large table of classical memory, and subsequent classical computations on this table. 3. A measurement stage, where we measure a sufficiently "good" phase state and recover some number of bits of the secret key.
We repeat these steps until we recover enough bits of the secret key to exhaustively search the remainder. Asymptotically, the sieving stage is the most costly, so we focus on that. In Sect. 3.6 we justify our choice to ignore the cost of constructing phase states.

Overview of Kuperberg's algorithm
We start with an abelian group G (the class group) of order N and two injective functions f : G → X and h : G → X such that h(x) = f (x − S) for some secret S. For this description we assume G is cyclic. This is generally untrue for class groups, but a quantum attacker can recover the group structure as a polynomial-cost precomputation (see [12,Section 4]). They can then decompose the group into cyclic subgroups, perform a quantum Fourier transform on each, and collimate them independently. The total amount of collimation will be the same, so we focus on a cyclic group as it is easier to describe.
For CSIDH, the function f will identify an element of the class group with an isogeny from E 0 to some other curve E, and output the j-invariant of that curve. The function h is the same, but starts with a public key curve E A .
To begin, we generate a superposition over G (ignoring normalization), g∈G |g . Then we initialize a single qubit in the state |+ = |0 + |1 , and use it to control applying either f or h: Then we measure the final register, finding f (g) = h(g + S) for some g. Because f and h are injective, this leaves only two states in superposition: This is the ideal state. Naive representations of the group will not produce precisely this state. Section 4.1 explains why our best option is to fix a generator g, and produce superpositions N −1 x=0 |x |xg , which leads to a final state where S = sg. At this point, we apply a quantum Fourier transform (QFT), modulo the group order N , to produce Then we measure the final register and find some value b, leaving us with the state From this point, we define ζ b s = e 2πi bs N . We emphasize that it is critical that the QFT acts as a homomorphism between the elements of the group and phases modulo N , even an approximate homomorphism as in [12].
A classical computer with knowledge of s can easily simulate input phase vectors, and the cost of the remainder of the algorithm is mainly classical. Peikert thus simulated the remaining steps of the algorithm for a precise security estimate [38]. We hope to choose parameters such that the remaining steps are infeasible, so we cannot classically simulate them. Instead we extrapolate Peikert's results to estimate the full cost, with some small algorithmic improvements we now describe.
Phase vectors with data. Kuperberg works with states of the form in Eq. 7 to save quantum memory; however, we will maintain the factor b in quantum memory.
We define a phase vector with data to have a length L, a height S, an altitude A, and a phase function B : The phase function B is known classically. The vector in Eq. 7 almost has this form, with L = 2, B(0) = 0 and B(1) = b (in fact B(0) = 0 for all phase vectors), and S = b. To add the data to it, we simply use the qubit to control a write of the value of b to a new register.
Starting from an initial phase vector with data, we can double its length with a new initial phase vector. We describe the procedure for a power-of-two length, which is much easier, but other lengths are possible with relabeling. We first concatenate the new phase vector, then treat the new qubit as the most significant bit of the index j: On the left sum, the first bit of j is 0, and on the right sum it is 1. We then redefine the phase function to be B : To update the phase register, we perform an addition of b , controlled on the first qubit (which is now the leading bit of the index j). The state is now twice as long, at the cost of just one quantum addition, and classical processing of the table of values representing B.
We can produce initial phase vectors with data of length L = 2 by starting with an initial phase vector, adding its phase function to a quantum register, then repeating this doubling process − 1 times. The height of such a vector will be the maximum of uniformly random values from 0 to 2 n ; we assume this is simply 2 n . The altitude will be the least common multiple of these vectors and we assume this is 1.
The next part of the algorithm is to collimate phase vectors until their height equals approximately their length. A collimation takes r phase vectors of some length L, height S, and altitude A, and destructively produces a new phase vector of length L , height S , and altitude A , where S < S and A ≥ A. For efficiency, we try to keep L = L.
Once the height equals the length, say S 0 , we perform a QFT and hopefully recover lg S 0 bits of the secret s, starting from the bit at lg(A). To recover all of the secret bits, we run the same process but target different bits each time, sequentially or in parallel. Classical simulations show that each run recovers only lg S 0 − 2 bits on average [38].
Adaptive Strategy. The length of the register in Eq. 5, which undergoes to the QFT, governs the cost of the sieve. Ideally, after finishing one sieve, we would use the known bits of the secret to reduce the size of the problem. For example, if the group order is N = 2 n for some n, then if the secret is s = s 1 2 k + s 0 and we know s 0 , we start with a state |0 |x + |1 |x + s mod 2 n for some random, unknown x.
We can subtract s 0 from the second register, controlled by the first qubit, to obtain The least significant k bits of the second register are the same in both states, so we can remove or measure these states, and only apply the QFT to the remaining bits. Then our initial phase vectors start with a height of 2 n−k , rather than 2 n . This is Kuperberg's original technique. Peikert analyzed a non-adaptive attack, using a high-bit collimation in case of non-smooth group orders. We remain uncertain whether an attack can be adaptive with a prime-order group. With prime orders, there is little correlation between the bits of x and x + s mod N , even if we know most of the bits of s.
Alternatively, we could represent group elements by exponent vectors. In that case, we end up with the state where L is the lattice representing the kernel of the map from exponent vectors to class group elements. However, a direct, bit-wise QFT does not define a homomorphism from vectors modulo a lattice are to phases (see Sect. 4.1).
We could try to represent integer exponent vectors x by vectors v such that It is possible that adaptive sieving on a prime-order group is inherently difficult. There is a large gap between the classical difficulty of discrete log in a prime-order group compared to a smooth-order group, so a similar gap may exist in the highly similar abelian hidden shift problem. In summary, we assume that partial knowledge of the bits of a secret s in an abelian hidden shift problem gives no advantage in finding unknown bits for groups of prime order. More formally:

Assumption 1
If it costs C to recover t secret bits in an abelian hidden shift problem for a group of prime order, it will still cost max{C, O(2 n−k )} to recover t bits even if k bits out of n are already known.
Each run of the sieve recovers about lg S 0 − 2 bits on average, so the total number of sieves is lg N lg S 0 −2 . If this assumption is wrong, then in the worst case, the total sieving cost will be dominated by the first run of the sieve, leading to a reduction of ≈ 7 bits of security.

Collimation
From vectors of length L and height S, we repeatedly collimate to a height S as follows: First we concatenate the vectors and add together their phase functions, which will match the new phase. Addition is done in-place on one of the phase registers.
The resulting state will be: Then we divide B(j) by S and compute the remainder and modulus: We then measure the value of B(j) S , which gives some value K . Let J ⊆ L × L be the set of indices j 1 and j 2 such that B 1 ( j 1 )+B 2 ( j 2 ) S = K . Since we know K , B 1 , and B 2 classically, we can find the set J and use it to construct a permutation π : shows that the factor of K only introduces a global phase and thus we can ignore it.
We now fix the phase vector that was left after measurement. First, we must erase B 1 ( j 1 ). We use a quantum random access classical memory (QRACM) look-up uncomputation, which only needs to look up values of j 1 which are part of a pair in J . We expect L such values.
Then we compute π(j) in another register. This is a QRACM look-up from a table of L indices with words of size lg L . Letting j = π(j), this leaves the state We now do a QRACM look-up uncomputation in a table of L indices to erase π −1 ( j ). This technique is analogous with r > 2. We uncompute with a single look-up. We can do this because each value of j i that appears in a tuple in J likely appears in a unique tuple, since there are only L possible values of j i and it appears in L i tuples. Since this is an uncomputation, the extra word size is irrelevant [4]. The greatest cost here seems to be computing the permutation π .
QRAM. Collimations repeatedly perform look-ups in quantum random access classical memory (QRACM), also known as quantum read-only memory (QROM). Given a large table of classical data T = [t 0 , . . . , t n−1 ] of w-bit words, we want a circuit to perform the following: The simplest method is a sequential look-up from Babbush et al. [4], while Berry et al. [8] provide a version that parallelizes nicely. Beyond the minimum depth of that circuit, we use a wide circuit, Fig. 1. Our cost estimation checks the cost of each of these circuits and chooses whichever has the lowest cost under each depth constraint; often this is Berry et al.'s circuit with k ≈ 8.
Following Peikert we assume that if our target length is L, the actual look-ups will need to access L max = 8L words.
Memory latency has no effect on our final costs. For both the look-ups and the permutation computation, we added a depth of (100W ) 1/2 , where W is the total hardware (classical and quantum) needed. Signal propagation over a single bit should be faster than execution of a single gate, which is our unit of depth, so (100W ) 1/2 should safely overestimate the latency of accessing two-dimensional memory. This still had no effect on our final costs except under extreme conditions of more than about 2 130 classical processors.

Permutation
To compute the permutation π , we start with r sorted lists of L elements in the range [S]. We want to find all tuples that add up to a specified value K in [r S]. For our estimation, we checked the cost of three different approaches and different r and chose the cheapest, which was often r = 2.
Problem 1 (Collimation permutation) Let L, S 1 , and S 2 be integers such that S 1 S 2 L. On an input of r sorted lists B 1 , . . . , B r of L random numbers from 0 to S 1 and an integer K , list all r -tuples from B 1 × · · · × B r such that their sum is in {K S 2 , K S 2 + 1, . . . , K S 2 + S 2 − 1}.
One approach is to iterate through all (r − 1)-tuples of elements from B 1 to B r −1 , compute the sum for each tuple, then search through B r to find all elements that produce a sum in the correct range. This has a cost of approximately L r −1 lg L, since we expect to check only 1/L r −1 elements in B r for each (r − 1)-tuple. With appropriate read-write controls, this parallelizes perfectly.
The structure of the sieve guarantees S 2 ≥ L r for all but the final collimation. This means we cannot guess a value for the sum of the first r /2 lists, then search for a matching sum in the remaining lists, because we would need to guess r 2 S 2 values, raising the cost over L r . This prevents divide-andconquer strategies like with a subset-sum, as in [11].
A lower-cost but memory intensive algorithm first merges s of the lists into a single sorted list of L s s-tuples and their sums, at cost L s (s lg L). Then it exhaustively searches the remaining L r −s tuples, and searches for matches in the merged, sorted list. The total cost is O(L s + L r −s s lg L). We choose s = r /2 .
We assume both classical approaches parallelize perfectly, but we track the total numbers of classical processors required to fit in any depth limit.
Grover's algorithm. A simple quantum approach is Grover's algorithm, searching through the set of L r r -tuples for those whose sum is in the correct range. This requires O(L r /2 ) iterations, but each iteration requires r look-ups, which each cost O(L). Each Grover search returns 1 possible tuple, creating a coupon-collector problem, so we repeat the Grover search L lg L times. The cost thus grows as L r +3 2 lg L, which improves on the classical approach for r ≥ 5.
The cost of Grover's algorithm gets much worse under a depth limit. Grover oracles should minimize their depth as much as possible, and since the look-up circuits parallelize almost perfectly, we analyze only the wide look-up as a Grover oracle subroutine. We assume the L lg L search repetitions are parallel as well.

Sieving
To find the cost of each sieve repetition, we first find the depth of the tree of sieves. First we follow [5] to derive some facts about the distribution of phase vectors after sieving. Let K = {K 1 , . . . , K s } be all possible measurement results from collimation. We treat each of the L r states in superposition as i.i.d. random variables X i with values in K , defining p i = P[X = K i ]. Since the states are in uniform superposition, we imagine that measurement selects one such state X j . Let W j be the number of other states in the superposition with the same value as X j ; it equals 1

This means
The size of the collimated list is the expected value of W j : In the first layer of collimation X is uniformly random so To find p m for later collimations, we assume X is a sum of r i.i.d. uniformly random variables with values in [0, . . . , s] where s = S i /S i+1 . By the central limit theorem this converges to a N (r μ, r σ 2 ) random variable, where μ = s/2 and σ 2 ≈ s 2 12 . We approximate s m=1 p 2 m as the integral of the square of the probability density function for N (μ, σ 2 ), which is 1 2 √ πσ . This gives us This means the size of a new list is approximately S i+1 S i 3 r π L r . We use c r := 3 r π as an "adjustor." Peikert takes this as 2 3 for r = 2. Using the central limit theorem might be inaccurate for small r , but in fact our adjustor gives ≈ 0.69 for r = 2, so we assume it is also accurate for r ≥ 3.
This derivation replicates Peikert's result that each collimation reduces the height by a multiplicative factor of L r −1 c r , with a more precise expression for c r .
We start with a height of N = √ p and we want to reach a height of S 0 , so the height of the tree must be Because of the rounding, we might need vectors of length less than L in the initial layer. Thus, we recalculate: The height of the phase vectors in the second layer (after the first collimation) must be The top layer has height S h = N , the height of random new phase vectors. Since S h−1 /S h is larger than any other layer, the phase vectors in the top layer only need a length L 0 which is less than L. Following Section 3.3.1 of Peikert and the previous derivation, the sieve requires L 0 = (L N S h−1 ) 1/r . For this top layer we do not have the adjusting factor of c r because the sum of r uniformly random values up to N , modulo N , will still be uniformly random.
This tells us how many oracle calls must be performed: There will be r h leaf nodes in the tree, and each one must have length L 0 . We adjust this slightly: Since each layer has some probability of failing, we divide this total by (1−δ) h for δ = 0.02, which is an empirical value from Peikert. We also add a 2 0.3 "fudge factor" from Peikert. The above analysis gives the number of oracle calls.

Fitting the sieve in a depth limit
We focus on NIST's security levels, which have a fixed limit MAXDEPTH on circuit depth, forcing the sieve to parallelize. The full algorithm consists of recursive sieving steps, producing a tree, where we collimate nodes together at one level to produce a node at the next level. This parallelizes extremely well, though a tree of height h must do at least h sequential collimations.
From this, we use MAXDEPTH/h as the depth limit for each collimation. The cost of collimation is mainly QRACM look-ups, which parallelize almost perfectly.
If each collimation has depth d c and the tree has height h, then MAXDEPTH − hd c is the maximum depth available for oracle calls. We divide this by the depth for each oracle call, d o , and then by the number of total oracle calls. This determines the number of oracle calls one must make simultaneously.
We also check whether collimation must be parallelized. We compute the total number of collimations in the tree, then multiply this by the depth of each collimation. Since one can start collimating as soon as the first oracle calls are done, the depth available for collimating is MAXDEPTH−d o . This tells us how many parallel oracle calls the sieve must make, P o , and the number of parallel collimations, P c .
If P o > lg(L 0 )P c , then we will need to store extra phase vectors. We compute the depth to finish all the oracle calls, then subtract the number of phase vectors that are collimated in that time, to find the number that must be stored.
If P o ≤ lg(L 0 )P c , the algorithm cannot parallelize the collimation as much as required, because the input rate of phase vectors is too low. Hence, we must increase P o to lg(L 0 )P c . This slightly overestimates the oracle's parallelization, since we can occupy the collimation circuits by collimating at higher levels in the tree, but since the number of vectors in successive levels of the tree decreases exponentially, we expect negligible impact.

Oracle costs
We propose that the cost of the oracle is the most likely factor for future algorithmic improvements to reduce CSIDH quantum security. Any improvement in basic quantum arithmetic will apply to computing the CSIDH group action in superposition; thus, using estimates from current quantum arithmetic techniques like [12], will almost certainly overestimate costs (indeed, the costs they reference have since been reduced [24]). The alternative approach of [7] was to produce a classical constant-time implementation to give a lower bound on cost, since latency, reversibility, and fault tolerance will add significant overheads.
However, there is some possibility that quantum implementations may be cheaper than reversible classical methods. A prominent example is the recent idea of "ghost pebbles" [22], which shows that the lower bounds on the costs of reversibly computing classical straight-line programs [27] do not hold for quantum computers.
We give some rough estimates for the oracle cost here. We start with [7] and assume the number of nonlinear bit operations scales quadratically with the size of the prime. The √ élu memory costs 8b +3b log 2 b field elements, where b ≈ √ max ≈ log p log log p is the largest isogeny computed. Each field element is log 2 p bits. We assume that this is enough to hold the "state" of the group action evaluation, and thus we can apply straight-line ghost pebbling techniques. This is likely not optimal but it is a first approximation. We assume that the depth is equal to the number of operations, though with perfect parallelization up to a factor of log 2 p. We treat each nonlinear bit operation as a quantum AND gate, and do not include linear bit operation costs.
Pebbling. Reversible computers cannot delete memory, and "pebbling" is the process of managing a limited amount of memory ("pebbles") to compute a program. We refer to [27] for details. Ghost pebbling [22] is a quantum technique where we measure a state in the {|+ , |− }-basis, which releases the qubits but may add an unwanted phase that must be cleaned up. For our purposes, a pebble will be a state of many qubits, so with near certainty, a measurement-based uncomputation will leave a phase that we need to remove.
Our strategy is as follows: Suppose we have enough qubits to hold s states simultaneously and n steps remaining in the program. From one state we can compute the next step, uncompute the previous state with measurements, and then repeat this; this only requires 2 states at a time. As a base case for s = 3, this gives the "Constant Space" strategy from [22], which requires n(n+1) 2 steps. In fact we only need 2 states, since we either consider the final state separately from this accounting, or we only need to clear the phase from the final state.
For a recursive strategy, we pick some k < n, and repeat the 2-states-at-a-time method to reach step n − k. We then recurse with s − 1 states for the final k steps, then uncompute the state at step n − k with a measurement. To clean up the phase from this measurement, we repeat the 2-states-at-atime to reach step n − 2k, then recurse for the next k steps. We repeat this process until all phases are removed.
If C(k, s − 1) is the cost for the recursive step, this has total cost Based on some simple optimization, we choose k = n s−1 s . We find the total costs numerically, and test initial values of s between 1 2 lg n and 5 lg n to find an optimal value. Table 2 gives the costs of one call to the oracle.

Security of low exponents
One of our main contributions is low exponents as secret keys.
Our key space is thus a small subset of the class group. We believe that this extra information does not help a quantum adversary, for the following reasons: 1. The representation of group elements as a bitstring must be homomorphic to bitstrings representing integers; 2. Creating an incomplete superposition of states will not produce properly formed phase vectors; and 3. Incorrect phase vectors as input are likely undetectable, uncorrectable, and quickly render the sieve useless.
We will explain each point in detail. These support our main assumptions: -Quantum adversaries will still need to search the entire class group; -The oracle for a quantum adversary will need to evaluate arbitrary group actions, not just small exponents.
Both points mean that the quantum security depends only on the size of the class group, not the size of the subset we draw keys from. Importantly, these assumptions fail if we restrict the keys to a small subgroup of the class group. It is critical that the subset of keys generates the entire class group.

Group representations
To create the input states, we must use a QFT which computes a homomorphism between elements of the group and phases of quantum states. Circuits to do this are well-known only for modular integers, represented as bitstrings. With a different representation of group elements (e.g., vectors in lattice), we either need a custom-built QFT circuit for that representation, or we first change the representation to modular integers. However, a custom-built QFT is equivalent to a change of representation: we could apply the custom QFT, then the inverse of the usual QFT to integers, and this will map our group elements to modular integers. This seems to restrict us to representing elements of the class group as multiples of a generator. We might be able to reduce the cost of the search if we only used small multiples of this generator; however, low exponents do not correspond to small multiples. Hence, the exponent vectors will likely be indistinguishable from random multiples of the generator.
The state before the QFT has the form |0 |x +|1 |x + s , where x is the coefficient of the generator for the group element that we measured. Hence, if x is randomly distributed, we will still need lg |G| qubits to represent it, and the QFT will produce random phase vectors of height up to |G|. Since the cost of the sieve is governed by the height of the input phase vectors, the cost of the sieve will be the same.
In short, to exploit the fact that secrets are restricted, we require a representation of group elements that can be homomorphically compressed to fewer than lg |G| qubits. We see no method to do this.

Incomplete superpositions
The first step of producing phase vectors involves a superposition over all of G. If we know that the secret s is in a smaller subset H 1 ⊆ G, we could instead sample from H 1 . We could even sample from another set H 0 for f , though it must be the same size for the normalization to match. This produces a superposition Measuring the final register returns a particular value z = f (g) for some g ∈ H 0 or z = h(g) = f (g − S) for some g ∈ H 1 . Let Z = f (H 0 ) ∪ h(H 1 ), and partition it into 3 subsets: . If we measure z ∈ Z 0 , then the state after the QFT is just |0 , since there was no value g ∈ H 1 such that h(g) = z. Similarly, measuring z ∈ Z 1 leaves the state |1 . Only if we measure z ∈ Z + will we have a "successful" phase vector, i.e., one that is not just |0 or |1 and has some information about s. The size of Z + is |H 0 ∩ (S + H 1 )| ≤ |H 0 |, and the probability of measuring z ∈ Z + is |Z + |/|H 0 |. Choosing H 0 and H 1 to make this probability large, without knowing S, seems very challenging. For example: Proof There are 2(2m + 1) n states in superposition when we measure: (2m + 1) n exponent vectors in superposition for each value |0 or |1 of the leading qubit. Each state has equal probability. We measure curves, meaning that a curve reached by both E 0 and E 1 is twice as likely as a curve reached by only one or the other. For small m, the set of curves reached by E 0 is close to a bijection with a hypercube of exponent vectors of width (2m + 1) and centered at 0. The set of curves reached by E 1 is in bijection with a hypercube of exponent vectors of the same width centered at s, the exponent vector of the secret key. The intersection of these hypercubes has volume (2m) n , giving Eq. 23.

Effects of incomplete superpositions
We define a defective phase vector with fidelity q of length If we measure a |0 or |1 state from an oracle that produces incomplete superpositions, then q = 1 2 , In short, a phase vector with q < 1 is one where our classical beliefs about the set of phases in superposition are wrong. We know the function B correctly, but it only matches the real state on the unknown subset J . The issue is that the oracle cannot tell us the fidelity of a new phase vector; our measurements do not tell us whether we succeeded or not.
We call this fidelity because it represents quantum fidelity with respect to the state we believe we have, given the classical information of the function B. This means that if k input phase vectors are defective, the fidelity of the entire input state degrades to 2 −k . If our final phase vector before measurement has fidelity q with respect to the state we want, then q is the probability of measuring the same result [35, Section 9.2.2]. As a rough argument for why fidelity must be high, even if the QFT partitions the noise so that the highorder bits always give an accurate result, but the low order bits are uniformly random, then at least k bits must be uniformly random if the probability of a correct measurement is 2 −k .
Hence, if our input states have fidelity q, we need the fidelity to increase by the time we reach the final state. Quantum circuits without measurement are unitary operations and thus preserve fidelity, but measurements may increase it, so we first argue that collimation does not appreciably increase the fidelity.

Theorem 2
Starting with an initial phase vector of length L and fidelity q < 1 2 , with height S, if we collimate to a new height S , the resulting phase vector is a new defective phase vector with expected fidelity at most for L := S S Lq ≥ 40.
Proof The probability of measuring any phase is uniform in the first collimation. This means p m is constant in Eq. 16, so the length of any state after measurement, which we denote X , has distribution 1 + Bin(|J | − 1, S /S) = 1 + Bin(q L − 1, p) for p = S /S and q L = |J |. The length of phases that we incorrectly believe we have will have distribution Y ∼ Bin(L − q L, p). The fidelity of the measured state is X X +Y . We use Chernoff bounds to concentrate X and Y to be within a factor of (1 ± δ) of their means, except with probability With careful rearranging we find For sufficiently large L this fits the required bound.
Theorem 2 shows that for small q, the fidelity increases only linearly with each collimation. The factor of L is approximately equal to the actual number of states in superposition in the collimated phase vector. Each phase vector is only collimated once for each level of the tree and there are only ≈ 2 7 sequential collimations, even at very large prime sizes. Hence, even if collimation is helpful, it would only remove the noise from ≈ 7 defective input phase vectors. Each sieving run over a 6144-bit prime needs 2 76 input phase vectors and recovers 39 bits of the secret. This means we would need fidelity greater than 2 −38 to gain any information, so we would need the probability of failure for each input vector to be at most 2 −31 . Given Theorem 1, this nearly rules out sampling low exponents.
Since sieving is ineffective, can we instead take many phase vectors, some of which may be defective, and produce good vectors? We summarize this as the following problem: Problem 2 (Probabilistic Phase Vector Distillation (PPVD)) Let s be an unknown secret value. As input, there are n input states |φ k with labels k, such that with probability p, |φ k = |0 + e iks/N |1 , with probability 1− p 2 , |φ k = |0 , and with probability 1− p 2 , |φ k = |1 . With some probability , either output 0 for failure or output 1 and t states φ j 1 , . . . , φ j t and their associated phase multipliers j i , such that, for all i: The PPVD problem cannot be solved with > 0 for n = 1:

Lemma 1
There is no quantum channel (circuit plus measurement) that distinguishes a single phase vector from |0 or |1 without calling the group oracle or learning the secret s.
Proof Suppose such a quantum channel Φ exists. Since the states we want to distinguish are constrained to a 2dimensional subspace, any measurement will produce a state in a 1-dimensional space, which is a single vector. Since we want the output to be a phase vector, our measurement must produce a valid phase vector φ . Suppose φ has some associated phase j. The vector φ is the basis of our measurement, and thus cannot depend on the input states nor the secret s, since we assume we do not learn s. Hence, for an input |φ = |0 + e iks/N |1 , the secret is s, so we require φ = |0 + e i js/N |1 . But if we instead had an input for a secret s = s, then φ is not a correct phase vector.
The argument of Lemma 1 does not readily extend to n > 1, but we assume that similar arguments exist. The central issue is that our distillation process must project inputs onto phase vectors that are correct for an unknown secret phase multiplier s. We see no way to do this without learning s and without being able to produce correct phase vectors from "blank" inputs of |0 and |1 . Either of these cases implies a more efficient solution to the dihedral hidden subgroup problem. We make that last statement more precise and argue that we cannot expect to "gain" phase vectors on average: Lemma 2 If the collimation sieve gives the optimal query complexity for the dihedral hidden subgroup problem, then no process can solve PPVD with t > pn.
Proof For a contradiction, let t > pn. Assume we have a perfect phase vector oracle, from which we make n initial queries. We then take pn of our phase vectors and shuffle them together with |0 and |1 vectors. Then we run the process that solves the PPVD. If it succeeds, it produces t new phase vectors, which we add to a growing list; if it fails, we call the phase vector oracle another t times. Either way we have t −np new phase vectors, and in the first case we did not need to call the oracle. Thus each iteration calls the oracle t(1 − ) times on average. We repeat this process to create all the phase vectors that the collimation sieve needs.
If the collimation sieve requires Q states, this process only calls the oracle Q t−np t(1 − ) times. If t > pn, then and thus we solve the dihedral hidden subgroup problem with fewer than Q states.  [41].

Quantum-secure CSIDH instantiations
To compare these three algorithms, which have distinct space-time tradeoffs, we include fixed hardware limits and add a fault tolerance overhead. Figure 2 shows the spacetime tradeoffs of the three algorithms. These assumptions are stronger than the assumptions used in the analyses of other post-quantum schemes, particularly proposed NIST standards. Since CSIDH, and our 'SQALE'd version, are not being considered for standardization, we use riskier assumptions in our cost model. This means the performance is not directly comparable to other post-quantum schemes at the same security level. Our recommended parameters are a 4096-bit prime for Level 1, 6144 bits for level 2, and 8192 bits for level 3.
Quantum Oracle Costs. Our estimates in this section assume free oracle costs. The number of oracle calls decreases with the size of the prime, relative to the total computational expense, and we need very large primes to reach NIST security levels. Further, the sieve can reparameterize to use more collimations when it uses a more expensive oracle. Compared to a free oracle, we found that the oracle costs from Sect. 3.6 only increase the total cost between 0 and 14 bits, depending on the prime size, with no change in the NIST security levels. Since oracle costs are the most likely to change with future research, we opted to estimate costs based on a free oracle, which gives us conservative estimates.
Hardware Limits. Grover-like quantum algorithms parallelize very badly, but the collimation sieve parallelizes almost perfectly. Thus the threshold for security increases as depth decreases, but CSIDH's bits of security remain the same. To an adversary with a high depth budget of 2 96 , SQALE'd CSIDH-4096 costs much more to break than AES-128, but costs much less to break if the adversary must finish their attack in depth 2 40 . Is SQALE'd CSIDH-4096 as secure as AES-128?
We assert that it does not matter if an adversary with access to more than 2 80 qubits could attack AES-128 at a higher cost than attacking CSIDH-4096, since such an adversary is unrealistic. We constrain an adversary's amount of "hardware," the total of classical processors, memory, and physical qubits (see Sect. 2.3). All three are given equal weight. Under limits of both hardware and depth, certain attacks are impossible. The depths in Table 4 are the minimum depths for which the collimation sieve can finish under our hardware constraint. Because Grover search becomes more expensive at lower depths, this removes high-cost attacks on AES.
Our hardware limit for NIST level 1 is 2 80 , based on [1]. For level 2 we use 2 100 , the memory contained in a "New York City-sized memory made of petabyte micro-SD cards" [40], and for level 3 we use 2 119 , the memory of a 15 mm shell of such cards around the Earth [40].

Classical security
Assume we want to find a CSIDH key that connects two given supersingular Montgomery curves E 0 and E 1 defined over F p for a prime p = 4 · n i=1 i − 1. Let N denote the key space size.
Notice, large primes p 2 512 permit smaller key space sizes N p 1/2 than the class group order; and then, random-walk-based attacks are costlier than Meet-In-The-Middle (MITM) procedures. In fact, MITM performs about N 1/2 p 1/4 steps. To illustrate the MITM approach, let us assume that for i := 1, . . . , n, we require the computation of isogenies of degreei , each of which we repeat m ∈ Z + times. The first step is to split the set { 1 , . . . , n } into two disjoint subsets L 0 and L 1 , both of size n 2 . Next, for i = 0, 1, let S i be the table with elements (e, g e ) where g e corresponds to the output of the group action evaluation with inputs E i , and a CSIDH key e = (e 1 , . . . , e n ) such that e j = 0 for each j ∈ L 1−i . The MITM procedure on CSIDH looks for a collision between S 0 and S 1 ; that is, two pairs (e, g e ) ∈ S 0 and (f, g f ) ∈ S 1 such that g e = g f ; consequently, the concatenation of e and f, maps E 0 to E 1 .
The tables S 0 and S 1 each have about N 1/2 elements 2 . The size of the class group #cl(O) is asymptotically close to p 1/2 , and the key space size N must be (approximately) equal to 2 2λ to ensure λ ∈ {128, 192} bits of classical security. Consequently, for large primes p 2 1024 , we have Then (#S 1 )(#S 0 ) #cl(O), and the birthday-paradox probability of a collision between S 0 and S 1 (other than the one expected by construction) happening by chance is negligible. The expected running-time of MITM is 1.5N 1/2 and it requires N 1/2 ≈ 2 λ cells of memory. Here, the classical security of CSIDH falls into the same case as SIDH, where van Oorschot & Wiener (vOW) Golden Collision Search (GCS) is cheaper than MITM, and a small key space still provides λ ∈ {128, 192} bits of classical security. In fact, the van Oorschot & Wiener Golden Collision search procedure [1,41] applied to CSIDH has an expected running-time of when only μ processors and w cells of memory are allowed to be used. As a consequence, the number k of small odd primes i 's that allows λ-bits of classical security is where N = (δm + 1) k and (δm + 1) determine the size of either −m. .m (δ = 2, OAYT-style [37]), 0. .m (δ = 1, Fig. 2 Costs of the quantum collimation sieve attack under various hardware limits. Colored solid lines are the costs of the collimation sieve at primes of bit lengths from 512 to 9126; dotted lines are the cost of key search on AES, from [25], with the same memory limits and overhead as our analysis. All figures are logarithmic in base 2. Plots on the left are parameterized to minimize gate cost, plots on the right to minimize DW -cost. Larger primes achieving lower depth (e.g., 5120 vs. 4096) is due to increased memory limits MCR-style [32]) or S(m) = {e ∈ −m. .m | e ≡ m mod 2} (δ = 1, dummy-free style [14]).
Assuming the previously mentioned technological limits of w = 2 80 , w = 2 100 , w = 2 119 cells of classical mem-ory for NIST levels 1, 2, 3 (resp.), Table 3  Depth is the minimum possible under the given hardware limit. The final two columns give the lowest cost of attacking {AES,SHA} in depth at least as much as the minimum to break the associated CSIDH instance, based on [18,25,36]. Italics highlights where such a break exceeds the hardware limit bounds m i for each degreei isogeny construction to optimize the cost using the approach reported in [15]. Note that any increase in our classical memory budget w will imply a higher value of k, thus forcing us to re-parameterize the collection of k isogenies that must be processed.

Experimental results
In this section, we discuss larger and safer CSIDH instantiations. We report the first constant-time C-coded implementation of the CSIDH group action evaluation that uses the new fast isogeny algorithm of [6], as reported in [2]. The C-code implementation allows an easy application for any prime field, which requires the shortest differential addition chains (SDACs), the list of small odd primes (SOPs), and the optimal strategies presented in [15]; in particular, our C-code implementation is a direct application of the algorithm and Python-code presented in [2], and thus all the data framework required (for each different prime field) can be obtained from its corresponding Python-code version. Our experiments focus on instantiations of CSIDH with primes of the form p = 4 n i=1 i − 1 of 1024, 1792, 2048, 3072, 4096, 5120, 6144, 8192, and 9216 bits (see Table 5). We compared the three variants of CSIDH, namely, i) MCRstyle, ii) OAYT-style, and ii) Dummy-free-style. All of our experiments were executed on a Intel(R) Core(TM) i7-6700K CPU 4.00GHz machine with 16GB of RAM, with Turbo boost disabled and using clang version 3.8. Our software library is freely available from https://github.com/JJChiDguez/sqale-csidh-velusqrt .
To illustrate the impact of using low exponents, Fig. 3 shows experimental results for all instantiations of CSIDH using exponent bounds ranging from m = 1 to m = 4. Each exponent bound is parameterized to reach the same security, meaning fewer i for larger m. In all cases we started  Table 3. Each experiment considers the cost of 1024 random instances, except for experiments corresponding to the 8192and 9216-bit instances which consider a smaller set of 128 experiments from the global bound and then optimized for the bounds per individual small prime and evaluation strategies as in [15]. Note that some configurations of the 1024-and 1792bit primes do not have enough i 's to support the m = 1 and m = 2 bounds. We stopped the experiments at m = 4 because performance degraded at higher values.
Our results show a slight drop in performance with the m = 1 bound in both the dummy-free and MCR-style versions, then for m = 2 onwards, higher m steadily performs worse. For OAYT style, on the other hand, m = 1 was always optimal. Because the performance bump at m = 1 appears to get ameliorated at higher primes, we decided to use the m = 1 bound for all cases due to its simplicity and security. The results for these instantiations, which provide NIST security levels 1, 2, and 3, are in Table 6. These results correspond with the measurement of 1024 random instances.

Conclusions
As the quantum security analysis of CSIDH has become more robust, it seems clear now that its original parameters must be updated by considering larger primes.
In this paper, we propose a set of primes large enough to make the protocol quantum-secure. Taking as a basis the Python 3 library reported in [2], we provide a freely available software library coded in C, which implements CSIDH instantiations that were built using these large primes.
Since the introduction of CSIDH in 2018, it has been the norm to try to approximate the key space to its maximum theoretical size of #cl(O) ≈ √ p. Nevertheless, as quantum security demands a larger prime, this key space has become unnecessarily large. It is therefore important to prove that leaving a portion of this space unused does not compromise the CSIDH security, which is an important conjecture that our analysis supports.
To make larger prime field instantiations of CSIDH more viable, our implementation combines techniques such as exponent strategy optimization, low exponents, and the new Vélu formulas presented in [6]. Our results are the first of their kind for these larger primes, hoping that these designs will pave the path forward for future refinements of CSIDH.
From our analysis, the main computational cost of the quantum sieve comes from the classical cost of merging lists to find permutations. Improvements to this subroutine would lower the security of CSIDH. Given that CSIDH's relative security and it's 'SQALE'd performance depend on hardware limits, our analysis highlights the need for consensus on the resources of far-future attackers.