Abstract
We introduce Four\(\mathbb {Q}\), a high-security, high-performance elliptic curve that targets the 128-bit security level. At the highest arithmetic level, cryptographic scalar multiplications on Four\(\mathbb {Q}\) can use a four-dimensional Gallant-Lambert-Vanstone decomposition to minimize the total number of elliptic curve group operations. At the group arithmetic level, Four\(\mathbb {Q}\) admits the use of extended twisted Edwards coordinates and can therefore exploit the fastest known elliptic curve addition formulas over large prime characteristic fields. Finally, at the finite field level, arithmetic is performed modulo the extremely fast Mersenne prime \(p=2^{127}-1\). We show that this powerful combination facilitates scalar multiplications that are significantly faster than all prior works. On Intel’s Haswell, Ivy Bridge and Sandy Bridge architectures, our software computes a variable-base scalar multiplication in 59,000, 71,000 cycles and 74,000 cycles, respectively; and, on the same platforms, our software computes a Diffie-Hellman shared secret in 92,000, 110,000 cycles and 116,000 cycles, respectively.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
This paper introduces a new, complete twisted Edwards [5] curve \(\mathcal {E}(\mathbb {F}_{p^2}):\,-x^2+y^2=1+dx^2y^2\), where p is the Mersenne prime \(p=2^{127}-1\), and d is a non-square in \(\mathbb {F}_{p^2}\). This curve, dubbed “Four\(\mathbb {Q}\)”, arises as a special instance of recent constructions using \(\mathbb {Q}\)-curves [27, 46], and is thus equipped with an endomorphism \(\psi \) related to the p-power Frobenius map. In addition, it has complex multiplication (CM) by the order of discriminant \(D=-40\), meaning it comes equipped with another efficient, low-degree endomorphism \(\phi \) [47].
We built an elliptic curve cryptography (ECC) library that works inside the cryptographic subgroup \(\mathcal {E}(\mathbb {F}_{p^2})[N]\), where N is a 246-bit prime. The endomorphisms \(\psi \) and \(\phi \) do not give any practical speedup to Pollard’s rho algorithm [42], which means the best known attack against the elliptic curve discrete logarithm problem (ECDLP) on \(\mathcal {E}(\mathbb {F}_{p^2})[N]\) requires around \(\sqrt{\pi N/4} \sim 2^{122.5}\) group operations on average. Thus, the cryptographic security of \(\mathcal {E}\) (see Sect. 2.3 for more details) is closely comparable to other curves that target the 128-bit security level, e.g., [6, 9, 21, 37].
Our choice of curve and the accompanying library offer a range of advantages over existing curves and implementations:
-
Speed: Four\({\mathbb {Q}}\)’s library computes scalar multiplications significantly faster than all known software implementations of curve-based cryptographic primitives. It uses the endomorphisms \(\psi \) and \(\phi \) to accelerate scalar multiplications via four-dimensional Gallant-Lambert-Vanstone (GLV)-style [22] decompositions. Four-dimensional decompositions have been used before [9, 32, 37], but not over the Mersenne primeFootnote 1; this choice of field is significantly faster than any neighboring fields and several works have studied its arithmetic [13, 21, 36]. The combination of extremely fast modular reductions and four-dimensional scalar decompositions makes for highly efficient scalar multiplications on \(\mathcal {E}\). Furthermore, we can exploit the fastest known addition formulas for elliptic curves over large characteristic fields [31], which are complete on \(\mathcal {E}\) since the above d is non-square [31, Sect. 3]. In Sect. 2, we explain why four-dimensional decompositions and this special underlying field were not previously partnered at the 128-bit security level.
-
Simplicity and concrete correctness: Simplicity is a major priority in this work and in the development of our software; in some cases we sacrifice speed enhancements in order to design a more simple and compact algorithm (cf. Sect. 4.2).
On input of any point \(P \in \mathcal {E}(\mathbb {F}_{p^2})[N]\), validated as in [14, Appendix A] if necessary, and any integer scalar \(m \in [0,2^{256})\), our software does the following (strictly in constant-time and without exception):
-
1.
Computes \(\phi (P)\), \(\psi (P)\) and \(\psi (\phi (P))\) using exactlyFootnote 2 \(68 \mathbf{M}\), \(27 \mathbf{S}\) and \(49.5 \mathbf{A}\) – see Sect. 3.
-
2.
Decomposes m (e.g., in less than 200 Sandy Bridge cycles) into a multiscalar \((a_1,a_2,a_3,a_4) \in \mathbb {Z}^4\) such that each \(a_i\) is positive and at most 64 bits – see Sect. 4.
-
3.
Recodes the multiscalar (e.g., in less than 800 Sandy Bridge cycles) to ensure a simple and constant-time main loop – see Sect. 5.
-
4.
Computes a lookup table of 8 elements using exactly 7 complete additions, before executing the main loop using exactly 64 complete twisted Edwards double-and-add operations, and finally outputting \([m]P = [a_1]P+[a_2]\phi (P)+[a_3]\psi (P) + [a_4]\psi \phi (P)\) – see Sect. 5.
This paper details each of the above steps explicitly, culminating in the full routine presented in Algorithm 2. Several prior works exploiting scalar decompositions have potential points of failure (cf. [30, Sect. 7], and Sect. 4.2), but crucially, and for the first time in the setting of four-dimensional decompositions, we accompany our routine with a robust proof of correctness – see Theorem 1.
-
1.
-
Cryptographic versatility: Four\(\mathbb {Q}\) is intended to be used in the same way, i.e., using the same model, same coordinates and same explicit formulas, irrespective of the cryptographic protocol or nature of the intended scalar multiplication. Unlike implementations using ladders [4, 6, 9, 23], Four\(\mathbb {Q}\) supports fast variable-base and fast fixed-base scalar multiplications, both of which use twisted Edwards coordinates; this serves as a basis for fast (ephemeral) Diffie-Hellman key exchange and fast Schnorr-like signatures. The presence of a single, complete addition law gives implementers the ability to easily wrap higher-level software and protocols around the Four\(\mathbb {Q}\)’s library exactly as is.
-
Public availability: Prior works exploiting four-dimensional decompositions have either made code available that did not attempt to run in constant-time [9], or not published code that did run in constant-time [18, 37]. Our library, which is publicly available [15], is largely written in portable C and includes two modular implementations of the arithmetic over \(\mathbb {F}_{p^2}\): a portable implementation written in C and a high-performance implementation for x64 platforms written in C and optional x64 assembly. The library also permits to select (at build time) whether the efficiently computable endomorphisms \(\psi \) and \(\phi \) can be used or not for computing generic scalar multiplications. The code is accompanied by Magma scripts that can be used to verify the proofs of all claims and the claimed operation counts. Our aim is to make it easy for subsequent implementers to replicate the routine and, if desired, develop specialized code that is tailored to specific platforms for further performance gains or with different memory constraints.
When the NIST curves [40] were standardized in 1999, many of the landmark discoveries in ECC (e.g., [17, 21, 22, 46]) were yet to be made. Four\(\mathbb {Q}\) and its accompanying library represent the culmination of several of the best known ECC optimizations to date: it pulls together the extremely fast Mersenne prime, the fastest known large characteristic addition formulas [31], and the highest degree of scalar decompositions (there is currently no known way of achieving higher dimensional decompositions without exposing the ECDLP to attacks that are asymptotically much faster than Pollard rho). Subsequently, for generic scalar multiplications, Four\(\mathbb {Q}\) performs around four to five times faster than the original NIST P-256 curve [26], between two and three times faster than curves that are currently under consideration as NIST alternatives, e.g., Curve25519 [4], and is also significantly faster than all of the other curves used to set previous speed records (see Sect. 6 for the comparisons). Interestingly, Four\(\mathbb {Q}\) is still highly efficient if the endomorphisms \(\psi \) and \(\phi \) are not used at all for computing generic scalar multiplications. In this case, Four\(\mathbb {Q}\) performs about three times faster than the NIST P-256 curve and up to 1.5 times faster than Curve25519.
It is our belief that the demand for high-performance cryptography warrants the state-of-the-art in ECC to be part of the standardization discussion: this paper ultimately demonstrates the performance gains that are possible if such a curve was to be considered alongside the “conservative” choices.
The extended version. For space considerations, we have omitted the proofs of Propositions 1, 2, 4 and 5, Lemma 1 and Theorem 1, as well as several additional remarks. All of these, along with an appendix covering point validation, can be found in the extended version of this article [14].
2 The Curve: Four\(\mathbb {Q}\)
This section describes the proposed curve, where we adopt Smith’s notation [44, 46] for the most part. We present the curve parameters in Sect. 2.1, shed some light on how the curve was found in Sect. 2.2, and discuss its cryptographic security in Sect. 2.3. Both Sects. 2.2 and 2.3 discuss that \(\mathcal {E}\) is essentially one-of-a-kind, illustrating that there were no degrees of freedom in the choice of curve (see [14] for more details).
2.1 A Complete Twisted Edwards Curve
We will work over the quadratic extension field \(\mathbb {F}_{p^2} := \mathbb {F}_p(i)\), where \(p:=2^{127}-1\) and \(i^2 = -1\). We define \(\mathcal {E}\) to be the twisted Edwards [5] curve
where \(d := 125317048443780598345676279555970305165 \cdot i + 4205857648805777768770\).
The set of \(\mathbb {F}_{p^2}\)-rational points satisfying the affine model for \(\mathcal {E}\) forms a group: the neutral element is \(\mathcal {O}_{\mathcal {E}} = (0,1)\) and the inverse of a point (x, y) is \((-x,y)\). The fastest set of explicit formulas for the addition law on \(\mathcal {E}\) are due to Hisil, Wong, Carter and Dawson [31]: they use extended twisted Edwards coordinates to represent the affine point (x, y) on \(\mathcal {E}\) by any projective tuple of the form \((X :Y :Z :T)\) for which \(Z \ne 0\), \(x = X/Z\), \(y=Y/Z\) and \(T=XY/Z\). Since d is not a square in \(\mathbb {F}_{p^2}\), this set of formulas is also complete on \(\mathcal {E}\) (see [5]), meaning that they will work without exception for all points in \(\mathcal {E}(\mathbb {F}_{p^2})\).
The trace \(t_{\mathcal {E}}\) of the \(p^2\)-power Frobenius endomorphism \(\pi _\mathcal {E}\) of \(\mathcal {E}\) is \(t_\mathcal {E}= 136368062447564341573735631776713817674\), which reveals that
where N is a 246-bit prime. The cryptographic group we work with in this paper is \(\mathcal {E}(\mathbb {F}_{p^2})[N]\).
2.2 Where did this Curve Come From?
The curve \(\mathcal {E}\) above comes from the family of \(\mathbb {Q}\)-curves of degree 2 – originally defined by Hasegawa [29] – that was recently used as one of the example families in Smith’s general construction of \(\mathbb {Q}\)-curve endomorphisms [44, 46]. Certain examples of low-degree \(\mathbb {Q}\)-curves (including this family) were independently obtained through a different construction by Guillevic and Ionica [27], who also studied 4-dimensional decompositions arising from such curves possessing CM. In fact, \(\mathcal {E}\) has a similar structure to the curve constructed in [27, Exercise 1], but is over the prime \(p=2^{127}-1\).
For \(\varDelta \) a square-free integer, this family is defined over \(\mathbb {Q}(\sqrt{\varDelta })\) and is parameterized by \(s \in \mathbb {Q}\) as
By definition [44, Definition 1], curves from this family are 2-isogenous (over \(\mathbb {Q}(\varDelta ,\sqrt{-2})\)) to their Galois conjugates \(^\sigma \tilde{\mathcal {E}}_{2,\varDelta ,s}\). Smith reduces \(\tilde{\mathcal {E}}_{2,\varDelta ,s}\) and \(^\sigma \tilde{\mathcal {E}}_{2,\varDelta ,s}\) modulo primes p that are inert in \(\mathbb {Q}(\sqrt{\varDelta })\) to produce the curves \(\mathcal {E}_{2,\varDelta ,s}\) and \(^\sigma \mathcal {E}_{2,\varDelta ,s}\) defined over \(\mathbb {F}_{p^2}\). He then composes the induced 2-isogeny from \(\mathcal {E}_{2,\varDelta ,s}\) to \(^\sigma \mathcal {E}_{2,\varDelta ,s}\) with the p-power Frobenius map from \(^\sigma \mathcal {E}_{2,\varDelta ,s}\) back to \(\mathcal {E}_{2,\varDelta ,s}\), which produces an efficiently computable degree 2p endomorphism \(\psi \) on \(\mathcal {E}_{2,\varDelta ,s}\).
Recall that in this paper we fix \(p=2^{127}-1\) for efficiency reasons. For this particular prime p and this family of \(\mathbb {Q}\)-curves, Smith’s construction gives rise to precisely p non-isomorphic curves corresponding to each possible choice of \(s \in \mathbb {F}_p\) [46, Proposition 1]. Varying s allows us to readily find curves belonging to this family with strong cryptographic group orders, each of which comes equipped with the endomorphism \(\psi \) that facilitates a two-dimensional scalar decomposition.
Seeking a four-dimensional (rather than two-dimensional) scalar decomposition on \(\mathcal {E}_{2,\varDelta ,s}\) restricts us to a very small subset of possible s values. This is because we require the existence of another efficiently computable endomorphism on \(\mathcal {E}_{2,\varDelta ,s}\), namely the low-degree GLV endomorphism \(\phi \) on those instances of \(\mathcal {E}_{2,\varDelta ,s}\) that possess CM over \(\mathbb {Q}(\sqrt{\varDelta })\). In [46, Sect. 9], Smith explains why there are only a handful of s values in any particular \(\mathbb {Q}\)-curve family that correspond to a curve with CM, before cataloging all such instances in the families of \(\mathbb {Q}\)-curves of degrees 2, 3, 5 and 7. In particular, up to isogeny and over any prime p, there are merely 13 values of s such that \(\mathcal {E}_{2,\varDelta ,s}\) has CM over \(\mathbb {Q}(\sqrt{\varDelta })\). As is remarked in [46, Sect. 9], this scarcity of CM curves makes it highly unlikely that we will find a secure instance of a low-degree \(\mathbb {Q}\)-curve family with CM over any fixed prime p. This is the reason why other authors chasing high speeds at the 128-bit security level have previously sacrificed the fast Mersenne prime \(p=2^{127}-1\) in favor of a four-dimensional decomposition [9, 37]; one can always search through the small handfull of exceptional CM curves over many sub-optimal primes until a cryptographically secure instance is found. However, in the specific case of \(p=2^{127}-1\), we actually get extremely lucky: our search through Smith’s tables of exceptional \(\mathbb {Q}\)-curves with CM [46, Theorem 6] found one particular instance over \(\mathbb {F}_{p^2}\) with a prime subgroup of 246-bits, namely \(\mathcal {E}_{2,\varDelta ,s}\) with \(s=\pm \frac{4}{9}\) and \(\varDelta =5\). As is detailed in [46, Sect. 3], the specification of \(\varDelta =5\) here does not dictate how we form the extension field \(\mathbb {F}_{p^2}\) over \(\mathbb {F}_p\); all quadratic extension fields of \(\mathbb {F}_p\) are isomorphic, so we can take \(s\sqrt{\varDelta } = \pm \frac{4}{9}\sqrt{5}\) in (3) while still taking the reduction of \(\tilde{\mathcal {E}}_{2,5,\pm \frac{4}{9}}\) modulo p to be \(\mathcal {E}_{2,5,\pm \frac{4}{9}}/\mathbb {F}_{p^2}\) with \(\mathbb {F}_{p^2}:=\mathbb {F}_p(\sqrt{-1})\). To simplify notation, from hereon we fix \(\tilde{\mathcal {E}}_\mathrm{W} := \tilde{\mathcal {E}}_{2,5,\pm \frac{4}{9}}\) and define \(\mathcal {E}_\mathrm{W}\) as the reduction of \(\tilde{\mathcal {E}}_\mathrm{W}\) modulo p, given as
where the choice of the root \(\sqrt{5}\) in \(\mathbb {F}_{p^2}\) will be fixed in Sect. 3. We note that the short Weierstrass curve \(\mathcal {E}_\mathrm{W}\) is not isomorphic to our twisted Edwards curve \(\mathcal {E}\), but rather to a twisted Edwards curve \(\hat{\mathcal {E}}\) that is \(\mathbb {F}_{p^2}\)-isogenous to \(\mathcal {E}\). The reason we work with \(\mathcal {E}\) rather than \(\hat{\mathcal {E}}\) is because the curve constant d on \(\mathcal {E}\) is non-square in \(\mathbb {F}_{p^2}\), which is not the case for the curve constant \(\hat{d}\) on \(\hat{\mathcal {E}}\); as we mentioned above, d being a non-square ensures that the fastest known addition formulas are also complete on \(\mathcal {E}\). The isogenies between \(\mathcal {E}\) and \(\hat{\mathcal {E}}\) are made explicit as follows.
Proposition 1
Let \(\hat{\mathcal {E}}/K\) and \(\mathcal {E}/K\) be the twisted Edwards curves defined by \(\hat{\mathcal {E}}/K :-x^2+y^2 = 1+\hat{d}x^2y^2\) and \(\mathcal {E}/K :-x^2+y^2 = 1+dx^2y^2\). If \(d = -(1+1/\hat{d})\), then the map \(\tau \, :\, \mathcal {E}\rightarrow \hat{\mathcal {E}}\), \((x,y) \mapsto \left( \frac{2xy}{(x^2+y^2)\sqrt{\hat{d}}} \, , \, \frac{x^2-y^2+2}{y^2-x^2} \right) \) is a 4-isogeny, the dual of which is \(\hat{\tau } \, :\, \hat{\mathcal {E}} \rightarrow \mathcal {E}\), \((x,y) \mapsto \left( \frac{2xy\sqrt{\hat{d}}}{x^2-y^2+2} \, , \, \frac{y^2-x^2}{y^2+x^2} \right) \).
We note at once that if \(\hat{d}\) is a square in K, then \(\tau \) and \(\hat{\tau }\) are defined over K. Fortunately, while the twisted Edwards curve \(\hat{\mathcal {E}}\) corresponding to \(\mathcal {E}_\mathrm{W}/\mathbb {F}_{p^2}\) has a square constant \(\hat{d}\), our chosen isogenous curve \(\mathcal {E}\) has the non-square constant \(d = -(1+1/\hat{d})\). Our implementation will work solely in twisted Edwards coordinates on \(\mathcal {E}\), but we will pass back and forth through \(\mathcal {E}_\mathrm{W}\) (via \(\hat{\mathcal {E}}\)) when deriving explicit formulas for the endomorphisms \(\phi \) and \(\psi \) in Sect. 3. We note that Hamburg used 4-isogenies (also derived from [1]) to a similar effect in [28].
2.3 The Cryptographic Security of Four\(\mathbb {Q}\)
Pollard’s rho algorithm [42] is the best known way to solve the ECDLP in \(\mathcal {E}(\mathbb {F}_{p^2})[N]\). An optimized version of this attack which uses the negation map [50] requires around \(\sqrt{\pi N/4} \sim 2^{122.5}\) group operations on average. We note that, unlike some of the typical GLV [22] or GLS [21] endomorphisms that can be used to speed up Pollard’s rho algorithm [16], both \(\psi \) and \(\phi \) on \(\mathcal {E}\) do not facilitate any known advantage; neither of these endomorphisms have a small orbit and they are both more expensive to compute than an amortized addition. Thus, the known complexity of the ECDLP on \(\mathcal {E}\) is comparable to various other curves used in the speed-record literature; optimized implementations of Pollard rho against any of the fastest curves in [4, 9, 13, 18, 21, 37, 41] would require between \(2^{124.8}\) and \(2^{125.8}\) group operations on average. Ideally, we would prefer not to have the factor \(7^2\) dividing \(\#\mathcal {E}(\mathbb {F}_{p^2})\), but the resulting (\(\sim 2.8\) bit) security degradation is a small price to pay for having the fastest field at the 128-bit level in conjunction with a four-dimensional scalar decomposition. As we discuss further in [14], it was a long shot to try and find such a cryptographically secure \(\mathbb {Q}\)-curve with CM over \(\mathbb {F}_{p^2}\) in Smith’s tables in the first place, let alone one that also had the necessary torsion to support a twisted Edwards model.
Since \(\mathcal {E}(\mathbb {F}_{p^2})\) has rational 2-torsion, it is easy to write down the corresponding abelian surface over \(\mathbb {F}_p\) whose Jacobian is isogenous to the Weil restriction of \(\mathcal {E}\) – see [43, Lemma 2.1 and Lemma 3.1]. But since the best known algorithm to solve the discrete logarithm problem on such abelian surfaces is again Pollard’s rho algorithm, the Weil descent philosophy (cf. [24]) does not pose a threat here. Furthermore, the embedding degree of \(\mathcal {E}\) with respect to N is \((N-1)/2\), making it infeasible to reduce the ECDLP into a finite field [19, 39].
We note that the largest prime factor dividing the group order of \(\mathcal {E}\)’s quadratic twist is 158 bits, but twist-security [4] is not an issue in this work: firstly, our software always validates input points (such validation is essentially free), and secondly, x-coordinate-only arithmetic (which is where twist-security makes sense) on \(\mathcal {E}\) is not competitive with a four-dimensional decomposition that uses both coordinates.
In contrast to most currently standardized curves, the proposed curve is both defined over a quadratic extension field and has a small discriminant; one notable exception is secp256k1 in the SEC standard [11], which is used in the Bitcoin protocol and also has small discriminant. However, it is important to note that there is no better-than-generic attack known to date that can exploit either of these two properties on \(\mathcal {E}\). In fact, with respect to ECDLP difficulty, Koblitz, Koblitz and Menezes [33, Sect. 11] point out that slower, large discriminant curves, like NIST P-256 and Curve25519, may turn out to be less conservative than specially chosen curves with small discriminant.
3 The Endomorphisms \(\psi \) and \(\phi \)
In this section we derive explicit formulas for the two endomorphisms on \(\mathcal {E}\). In what follows we use \(c_{i,j,k,l}\) to denote the constant \(i+j\sqrt{2}+k\sqrt{5}+l\sqrt{2}\sqrt{5}\) in \(\mathbb {F}_{p^2}\), which is fixed by setting \(\sqrt{2}:=2^{64}\) and \(\sqrt{5}:=87392807087336976318005368820707244464 \cdot i\).
For both \(\psi \) and \(\phi \), we start by deriving the explicit formulas on the short Weierstrass model \(\mathcal {E}_\mathrm{W}\). As discussed in the previous section, we will pass back and forth between \(\mathcal {E}\) and \(\mathcal {E}_\mathrm{W}\) via the twisted Edwards curve \(\hat{\mathcal {E}}\) that is 4-isogenous to \(\mathcal {E}\) over \(\mathbb {F}_{p^2}\). The maps between \(\mathcal {E}\) and \(\hat{\mathcal {E}}\) are given in Proposition 1, and we take the maps \(\delta :\mathcal {E}_\mathrm{W} \rightarrow \hat{\mathcal {E}}\) and \(\delta ^{-1} :\hat{\mathcal {E}} \rightarrow \mathcal {E}_\mathrm{W}\) from [46, Sect. 5] (tailored to our \(\hat{\mathcal {E}}\)) as \(\delta \, :(x,y) \mapsto \left( \frac{\gamma (x-4)}{y},\frac{x-4-c_{0,2,0,1}}{x-4+c_{0,2,0,1}}\right) \), and \(\delta ^{-1} \, :(x,y) \mapsto \left( \frac{c_{0,2,0,1}(y+1)}{1-y}+4,\frac{ c_{0,2,0,1}(y+1)\gamma }{x(1-y)}\right) \), where \(\gamma ^2=c_{-12,-4,0,-2}\). The choice of the square root \(\gamma \in \mathbb {F}_{p^2}\) becomes irrelevant in the compositions below.
3.1 Explicit Formulas for \(\psi \)
There is almost no work to be done in deriving \(\psi \) on \(\mathcal {E}\), since this is Smith’s \(\mathbb {Q}\)-curve endomorphism corresponding to the degree-2 family to which \(\mathcal {E}_\mathrm{W}\) belongs. We start with \(\psi _\mathrm{W} :\mathcal {E}_\mathrm{W} \rightarrow \mathcal {E}_\mathrm{W}\), taken from [46, Sect. 5], as \(\psi _\mathrm{W} :(x,y) \mapsto \left( \left( -\frac{x}{2}-\frac{c_{9,0,4,0}}{x-4}\right) ^p, \left( \frac{y}{i\sqrt{2}} \left( -\frac{1}{2}+\frac{c_{9,0,4,0}}{(x-4)^2}\right) \right) ^p\right) \). With \(\psi _\mathrm{W}\) as above, we define \(\psi :\mathcal {E}\rightarrow \mathcal {E}\) as the composition \(\psi = \hat{\tau }\delta \psi _\mathrm{W} \delta ^{-1} \tau \). In optimizing the explicit formulas for this composition, there is practically nothing to be gained by simplifying the full composition in the function field \(\mathbb {F}_{p^2}(\mathcal {E})\). However, it is advantageous to optimize explicit formulas for the inner composition \((\delta \psi _\mathrm{W} \delta ^{-1})\) in the function field \(\mathbb {F}_{p^2}(\hat{\mathcal {E}})\). In fact, for both \(\psi \) and \(\phi \), optimized explicit formulas for this inner composition are faster than the respective endomorphisms \(\psi _\mathrm{W}\) and \(\phi _\mathrm{W}\), and are therefore much faster than computing the respective compositions individually.
Simplifying the composition \(\delta \psi _\mathrm{W} \delta ^{-1}\) in the function field \(\mathbb {F}_{p^2}(\hat{\mathcal {E}})\) yields \((\delta \psi _\mathrm{W} \delta ^{-1}) :\hat{\mathcal {E}} \rightarrow \hat{\mathcal {E}}\),
Note that each of the p-power Frobenius operations above amount to one \(\mathbb {F}_{p}\) negation. As mentioned above, we compute the endomorphism \(\psi = \hat{\tau } (\delta \psi _\mathrm{W} \delta ^{-1}) \tau \) on \(\mathcal {E}\) by computing \(\tau \) and \(\hat{\tau }\) separately; see Sect. 3.4 for the operation counts.
3.2 Deriving Explicit Formulas for \(\phi \)
We now derive the second endomorphism \(\phi \) that arises from \(\mathcal {E}\) admitting CM by the order of discriminant \(D=-40\). We start by pointing out that there is actually multiple routes that could be taken in defining and deriving \(\phi \) (see the full version [14] for additional details). The possibility that we use in this paper produces an endomorphism of degree 5. This option was revealed to us in correspondence with Ben Smith, who pointed out that \(\mathbb {Q}\)-curves with CM can also be produced as the intersection of families of \(\mathbb {Q}\)-curves, and that our curve \(\mathcal {E}\) is not only a degree-2 \(\mathbb {Q}\)-curve, but is also a degree-5 \(\mathbb {Q}\)-curve. Thus, the second endomorphism \(\phi \) can be derived by first following the treatment in [46, Sect. 7] (see also [27, Sect. 3.3]) to derive \(\phi _\mathrm{W}\) as a 5-isogeny on \(\mathcal {E}_\mathrm{W}\), which we do below.
Working in \(\mathbb {Q}(\sqrt{5})[x]\), the 5-division polynomial (cf. [20, Definition 9.8.4]) of \(\tilde{\mathcal {E}}_\mathrm{W}\) factors as f(x)g(x), where \(f(x) = x^2 + 4\sqrt{5} \cdot x +(18-4/5\sqrt{5})\) and g(x) (which is of degree 10) are irreducible. The polynomial f(x) defines the kernel of a 5-isogeny \(\phi ^\sigma _\mathrm{W} :\tilde{\mathcal {E}}_\mathrm{W} \rightarrow \tilde{\mathcal {E}}^\sigma _\mathrm{W}\). We use this kernel to compute \(\phi ^\sigma _\mathrm{W}\) via Vélu’s formulae [49] (see also [34, Sect. 2.4]), reduce modulo p, and then compose with Frobenius \(\pi _p :\mathcal {E}^\sigma _\mathrm{W} \rightarrow \mathcal {E}_\mathrm{W}\) to give \(\phi _\mathrm{W} :\mathcal {E}_\mathrm{W} \rightarrow \mathcal {E}_\mathrm{W} , (x,y) \mapsto (x_{\phi _W},y_{\phi _W})\), where
As was the case with \(\psi \) in Sect. 3.1, it is advantageous to optimize formulas in \(\mathbb {F}_{p^2}(\hat{\mathcal {E}})\) for the composition \((\delta \psi _\mathrm{W} \delta ^{-1})\), which gives \((\delta \psi _\mathrm{W} \delta ^{-1}):\hat{\mathcal {E}} \rightarrow \hat{\mathcal {E}}, (x,y) \mapsto (x_\phi ,y_\phi )\), where
Again, we use this to compute the full endomorphism \(\psi = \hat{\tau } (\delta \psi _\mathrm{W} \delta ^{-1}) \tau \) on \(\mathcal {E}\) by computing \(\tau \) and \(\hat{\tau }\) separately; see Sect. 3.4 for the operation counts.
3.3 Eigenvalues
The eigenvalues of the two endomorphisms \(\psi \) and \(\phi \) play a key role in developing scalar decompositions. In this subsection we write them in terms of the curve parameters. From [46, Theorem 2], and given that we used a 4-isogeny \(\tau \) and its dual to pass back and forth to \(\mathcal {E}_\mathrm{W}\), the eigenvalues of \(\psi \) on \(\mathcal {E}(\mathbb {F}_{p^2})[N]\) are \(\lambda _{\psi } := 4 \cdot \frac{p+1}{r} \pmod {N}\) and \(\lambda _{\psi }' := - \lambda _{\psi } \pmod {N}\), where r is an integer satisfying \(2r^2 = 2p + t_{\mathcal {E}}\). To derive the eigenvalues for \(\phi \), we make use of the CM equation for \(\mathcal {E}\), which (since \(\mathcal {E}\) has CM by the order of discriminant \(D=-40\)) is \(40V^2 = 4p^2-t_\mathcal {E}^2\), for some integer V. We fix r and V to be the positive integers satisfying these equations, namely \(V:=4929397548930634471175140323270296814\) and \(r:=15437785290780909242\).
Proposition 2
The eigenvalues of \(\phi \) on \(\mathcal {E}(\mathbb {F}_{p^2})[N]\) are
3.4 Section Summary
Table 1 summarizes the isogenies derived in this section, together with their exact operation counts. The reason that multiples of 0.5 appear in the additions column is that we count Frobenius operations (which amount to a negation in \(\mathbb {F}_p\)) as half an addition in \(\mathbb {F}_{p^2}\). Four-dimensional scalar decompositions on \(\mathcal {E}\) require the computation of \(\phi (P)\), \(\psi (P)\) and the composition \(\psi (\phi (P))\); the ordering here is important since \(\psi \) is much faster than \(\phi \), meaning we actually compute \(\phi \) once and \(\psi \) twice. We note that all sets of explicit formulas were derived assuming the inputs were projective points \((X :Y :Z)\) corresponding to a point (X / Z, Y / Z) in the domain of the isogeny. Similarly, all explicit formulas output the point \((X' :Y' :Z')\) corresponding to \((X'/Z',Y'/Z')\) in the codomain, and in the special cases when the codomain is \(\mathcal {E}\) (i.e., for \(\hat{\tau }\), \(\phi \), \(\psi \) and \(-\psi \phi \)), we also output the coordinate \(T'\) (or a related variant) corresponding to \(T'=X'Y'/Z'\), which facilitates faster subsequent group law formulas on \(\mathcal {E}\) – see [14].
Table 1 reveals that, on input of a projective point in \(\mathcal {E}(\mathbb {F}_{p^2})[N]\), the total cost of the three maps \(\phi \), \(\psi \) and \(\psi \phi \) is \(68 \mathbf{M}+27 \mathbf{S}+49.5\mathbf{A}\). Computing the maps using these explicit formulas requires the storage of 16 constants in \(\mathbb {F}_{p^2}\), and at any stage of the endomorphism computations, requires the storage of at most 7 temporary variables.
4 Optimal Scalar Decompositions
Let \(\lambda _\psi \) and \(\lambda _\phi \) be as fixed in Sect. 3.3. In this section we show how to compute, for any integer scalar \(m \in \mathbb {Z}\), a corresponding 4-dimensional multiscalar \((a_1,a_2,a_3,a_4) \in \mathbb {Z}^4\) such that \(m \equiv a_1+a_2\lambda _\phi +a_3\lambda _\psi +a_4\lambda _\phi \lambda _\psi \pmod {N}\), such that \(0\le a_i<2^{64}-1\) for \(i=1,2,3,4\), and such that \(a_1\) is odd (which facilitates faster scalar recodings and multiplications – see Sect. 5). An excellent reference for general scalar decompositions in the context of elliptic curve cryptography is [45], where it is shown how to write down short lattice bases for scalar decompositions directly from the curve parameters. Here, we show how to further reduce such short bases into bases that are, in the context of multiscalar multiplications, optimal.
4.1 Babai Rounding and Optimal Bases
Following [45, Sect. 1], we define the lattice of zero decompositions as
so that the set of decompositions for \(m \in \mathbb {Z}/N\mathbb {Z}\) is the lattice coset \((m,0,0,0)+\mathcal {L}\). For a given basis \(\mathbf{B}=(\mathbf {b}_1,\mathbf {b}_2,\mathbf {b}_3,\mathbf {b}_4)\) of \(\mathcal {L}\), and on input of any \(m \in \mathbb {Z}\), the Babai rounding technique [2] computes \((\alpha _1,\alpha _2,\alpha _3,\alpha _4) \in \mathbb {Q}^4\) as the unique solution to \((m,0,0,0) = \sum _{i=1}^{4} \alpha _i \mathbf {b}_i\), and subsequently computes the multiscalar \((a_1,a_2,a_3,a_4)=(m,0,0,0)-\sum _{i=1}^4 \lfloor \alpha _i \rceil \cdot \mathbf {b}_i\). It follows that \((a_1,a_2,a_3,a_4)-(m,0,0,0) \in \mathcal {L}\), so \(m \equiv a_1+a_2\lambda _\phi + a_3\lambda _\psi + a_4 \lambda _\phi \lambda _\psi \pmod {N}\). Since \(-1/2\le x - \lfloor x \rceil \le 1/2\), this technique finds the unique element in \((m,0,0,0)+\mathcal {L}\) that lies inside the parallelepipedFootnote 3 defined by \(\mathcal {P}(\mathbf{B}) = \{\mathbf{B}{} \mathbf{x}\, |\, \mathbf{x} \in [-1/2,1/2)^4\}\), i.e., Babai rounding maps \(\mathbb {Z}\) onto \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\). For a given m, the length of the corresponding multiscalar multiplication is then determined by the infinity norm, \(||\cdot ||_\infty \), of the corresponding element \((a_1,a_2,a_3,a_4)\) in \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\).
Since our scalar multiplications must run in time independent of m, the speed of the multiscalar exponentiations will depend on the worst case, i.e., on the maximal infinity norm taken across all elements in \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\). Or, equivalently, the speed of routine will depend on the width of the smallest 4-cube whose convex body contains \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\). This width depends only on the choice of \(\mathbf{B}\), so this gives us a natural way of finding a basis that is optimal for our purposes. We make this concrete in the following definition, which is stated for an arbitrary lattice of dimension n. Definition 1 simplifies the situation by looking for the smallest n-cube containing \(\mathcal {P}(\mathbf{B})\), rather than \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^n\), but our candidate bases will always be orthogonal enough such that the conditions are equivalent in practice.
Definition 1
(Babai-optimal bases). We say that a basis \(\mathbf{B}\) of a lattice \(\mathcal {L}\in \mathbb {R}^n\) is Babai-optimal if the width of the smallest n-cube containing the parallelepiped \(\mathcal {P}(\mathbf{B})\) is minimal across all bases for \(\mathcal {L}\).
We note immediately that taking the n successive minima under \(||\cdot ||_\ell \), for any \(\ell \in \{1,2,\dots ,\infty \}\), will not be Babai-optimal in general. Indeed, for our specific lattice \(\mathcal {L}\), neither the \(||\cdot ||_2\)-reduced basis (output from LLL [35]) or the \(||\cdot ||_\infty \)-reduced basis (in the sense of Lovász and Scarf [38]) are Babai-optimal.
For very low dimensions, such as those used in ECC scalar decompositions, we can find a Babai-optimal basis via straightforward enumeration as follows. Starting with any reasonably small basis \(\mathbf{B}'=(\mathbf {b}_1',\dots ,\mathbf {b}_n')\), like the ones in [45], we compute the width, \(w(\mathbf{B}')\), of the smallest n-cube whose convex body contains \(\mathcal {P}(\mathbf{B}')\); by the definition of \(\mathcal {P}\), this is \(w(\mathbf{B}') = \mathrm{max}_{1 \le j \le n}\left\{ \sum _{i=1}^n |\mathbf {b}'_i[j]| \right\} \). We then enumerate the set S of all vectors \(\mathbf{v} \in \mathcal {L}\) such that \(||\mathbf{v}||_\infty \le w(\mathbf{B}')\); any vector not in S cannot be in a basis whose width is smaller than \(\mathbf{B}'\). We can then test all possible bases \(\mathbf{B}\), that are formed as combinations of n linearly independent vectors in S, and choose one corresponding to the minimal value of \(w(\mathbf{B})\).
Proposition 3
A Babai optimal basis for our zero decomposition lattice \(\mathcal {L}\) is given by \(\mathbf{B}:=( \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_3, \mathbf {b}_4 )\), where
for V and r as fixed in Sect. 3, and where \(\alpha := V/r \in \mathbb {Z}\).
Proof
Straightforward but lengthy calculations using the equations in Sect. 3.3 reveal that \(\mathbf {b}_1\), \(\mathbf {b}_2\), \(\mathbf {b}_3\) and \(\mathbf {b}_4\) are all in \(\mathcal {L}\). Another direct calculation reveals that the determinant of \(\langle \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_3, \mathbf {b}_4 \rangle \) is N, so \(\mathbf{B}\) is a basis for \(\mathcal {L}\). To show that \(\mathbf{B}\) is Babai-optimal, we set \(\mathbf{B}'=\mathbf{B}\) and compute \(w(\mathbf{B}') = \mathrm{max}_{1 \le j \le 4}\left\{ \sum _{i=1}^4 |\mathbf {b}'_i[j]| \right\} \), which (at \(j=1\)) is \(w(\mathbf{B}')= (245\alpha +120r+17)/448\). Enumeration under \(||\cdot ||_\infty \) yields exactly 128 vectors (up to sign) in \(S = \{\mathbf{v} \in \mathcal {L}\mid ||\mathbf{v}||_\infty \le w(\mathbf{B}') \}\); none of the rank 4 bases formed from S have a width smaller than \(\mathbf{B}\). \(\square \)
The size of the set S in the above proof depends on the quality of the initial basis \(\mathbf{B}'\). For the proof, it suffices to start with the Babai-optimal basis \(\mathbf{B}\) itself, but in practice we will usually start with a basis that is not optimal according to Definition 1. In our case we computed the basis in Proposition 3 by first writing down a short basis using Smith’s methodology [45]. We input this into the LLL algorithm [35] to obtain an LLL-reduced basis \(( \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_1+\mathbf {b}_4, \mathbf {b}_3)\); these are also the four successive minima under \(|| \cdot ||_2\). We then input this basis into the algorithm of Lovász and Scarf [38]; this forced the requisite changes to output a basis consisting of the four successive minima under \(|| \cdot ||_\infty \), namely \(( \mathbf {b}_1, \mathbf {b}_1+\mathbf {b}_4,\mathbf {b}_2,\mathbf {b}_1+\mathbf {b}_3)\). Using this as our input \(\mathbf{B}'\) into the enumeration gave a set S of size 282, which we exhaustively searched to find \(\mathbf{B}\).
We now describe a simple scalar decomposition that uses Babai rounding on the optimal basis above. Note that, since V and r are fixed, the four \(\hat{\alpha }_i\) values below are fixed integer constants.
Proposition 4
For a given integer m, and the basis \(\mathbf{B}:=( \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_3, \mathbf {b}_4 )\) in Prop. 3, let \((\alpha _1,\alpha _2,\alpha _3,\alpha _4) \in \mathbb {Q}^4\) be the unique solution to \((m,0,0,0) = \sum _{i=1}^{4} \alpha _i \mathbf {b}_i\), and let \((a_1, a_2, a_3,a_4) =(m,0,0,0)- \sum _{i=1}^{4}\lfloor \alpha _i \rceil \cdot \mathbf {b}_i\). Then \(m \equiv a_1+a_2\lambda _\phi + a_3\lambda _\psi + a_4 \lambda _\psi \phi \pmod {N}\) and \(|a_1|,|a_2|, |a_3|, |a_4| <2^{62}\).
4.2 Handling Round-Off Errors
The decomposition described in Proposition 4 requires the computation of four roundings \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil \), where m is the input scalar and the four \(\hat{\alpha }_i\) and N are fixed curve constants. Following [10, Sect. 4.2], one efficient way of performing these roundings is to choose a power of 2 greater than the denominator N, say \(\mu \), and precompute the fixed curve constants \(\ell _i = \lfloor \frac{\hat{\alpha }_i}{N} \cdot \mu \rceil \), so that \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil \) can be computed at runtime as \(\lfloor \frac{\ell _i \cdot m }{\mu } \rfloor \), and the division by \(\mu \) can be computed as a simple shift.
It is correctly noted in [10, Sect. 4.2] that computing the rounding in this way means the answer can be out by 1 in some cases, but it is further said that “in practice this does not affect the size of the multiscalars”. While this assertion may have been true in [10], in general this will not be the case, particularly when we wish to bound the size of the multiscalars as tightly as possible. We address this issue on \(\mathcal {E}\) starting with Lemma 1.
Lemma 1
Let \(\hat{\alpha }\) be any integer, and let m, N and \(\mu \) be positive integers with \(m < \mu \). Then \(\left\lfloor \frac{\hat{\alpha } m}{N} \right\rceil - \left\lfloor \left\lfloor \frac{\hat{\alpha } \mu }{N} \right\rceil \cdot \frac{m}{\mu } \right\rfloor \) is either 0 or 1.
Lemma 1 says that, so long as we choose \(\mu \) to be greater than the maximum size of our input scalars m, our fast method of approximating \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil \) will either give the correct answer, or it will be \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil -1\). It is easy to see that larger choices of \(\mu \) decrease the probability of a rounding error. For example, on 10 million random decompositions of integers between 0 and N with \(\mu =2^{246}\), roughly 2.2 million trials gave at least one error in the \(\alpha _i\); when \(\mu =2^{247}\), roughly 1.7 million trials gave at least one error; when \(\mu =2^{256}\), 4333 trials gave an error; and, taking \(\mu =2^{269}\) was the first power of two that gave no errors.
Prior works have seemingly addressed this problem by taking \(\mu \) to be large enough so that the chance of roundoff errors are very (perhaps even exponentially) small. However, no matter how large \(\mu \) is chosen, the existence of a permissible scalar whose decomposition gives a roundoff error is still a possibilityFootnote 4, and this could violate constant-time promises.
In this work, and in light of Theorem 1, we instead choose to sacrifice some speed by guaranteeing that roundoff errors are always accounted for. Rather than assuming that \((a_1,a_2,a_3,a_4)=\sum _{i=1}^4 (\alpha _i - \lfloor \alpha _i \rceil )\mathbf {b}_i\), we account for the approximation \(\tilde{\alpha }_i\) to \(\lfloor \alpha _i \rceil \) (described in Lemma 1) by allowing \((a_1, a_2, a_3,a_4) =\sum _{i=1}^4 \left( \alpha _i - \tilde{\alpha }_i \right) \mathbf {b}_i=\sum _{i=1}^4 \left( \alpha _i - (\lfloor {\alpha }_i\rceil - \epsilon _i) \right) \mathbf {b}_i\), for all sixteen combinations arising from \(\epsilon _i \in \{0,1\}\), for \(i=1,2,3,4\). This means that all integers less than \(\mu \) will decompose to a multiscalar in \(\mathbb {Z}^4\) whose coordinates lie inside the parallelepiped \(\mathcal {P}_\epsilon (\mathbf{B}):=\{\mathbf{B}{} \mathbf{x}\, |\, \mathbf{x} \in [-1/2,3/2)^4\}\). Theorem 1 permits scalars as any 256-bit strings, so we fix \(\mu :=2^{256}\) from here on, which also means that division by \(\mu \) will correspond to a shift of machine words. The edges of \(\mathcal {P}_\epsilon (\mathbf{B})\) are twice as long as those of \(\mathcal {P}(\mathbf{B})\), so the number of points in \(\mathcal {P}_\epsilon (\mathbf{B}) \cap \mathbb {Z}^4\) is \(\mathrm{vol}(\mathcal {P}_\epsilon ) = 16N\). We note that, even though the number of permissible scalars far exceeds 16N, the decomposition that maps integers in \([0,\mu )\) to multiscalars in \(\mathcal {P}_\epsilon (\mathbf{B}) \cap \mathbb {Z}^4\) is certainly no longer onto; almost all of the \(\mu \) scalars will map into \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\), since the chance of roundoff errors that take us into \(\mathcal {P}_\epsilon (\mathbf{B})-\mathcal {P}(\mathbf{B})\) is small. Plainly, the width of smallest 4-cube containing \(\mathcal {P}_\epsilon (\mathbf{B})\) is also twice that of the 4-cube containing \(\mathcal {P}(\mathbf{B})\), so (in the sense of Definition 1) our basis is still Babai-optimal. Nevertheless, the bounds in Proposition 4 no longer apply, which is one of the issues addressed in the next subsection.
4.3 All-Positive Multiscalars
Many points in \(\mathcal {P}_\epsilon (\mathbf{B}) \cap \mathbb {Z}^4\) have coordinates that are far greater than \(2^{62}\) in absolute value, and in addition, the majority of them will have coordinates that are both positive and negative. Dealing with such signed multiscalars can require an additional iteration in the main loop of the scalar multiplication, so in this subsection we use an offset vector in \(\mathcal {L}\) to find a translate of \(\mathcal {P}_\epsilon (\mathbf{B})\) that contains points whose four coordinates are always positive. We note that this does not save the additional iteration mentioned above, but (at no cost) it does simplify the scalar recoding, such that we do not have to deal with multiscalars that can have negative coordinates. Such offset vectors were used in two dimensions in [13, Sect. 4].
From the proof of Proposition 3, we have that the width of the smallest 4-cube containing \(\mathcal {P}_\epsilon (\mathbf{B})\) is \(2\cdot (245\alpha +120r+17)/448\), which lies between \(2^{63}\) and \(2^{64}\). Thus, the optimal situation is to translate of \(\mathcal {P}_\epsilon (\mathbf{B})\) (using a vector in \(\mathcal {L}\)) that fits inside the convex body of the 4-cube \(\mathcal {H} = \{2^{64}\cdot \mathbf{x}\, |\, \mathbf{x} \in [0,1]^4\}\). In fact, as we discuss in the next paragraph, we actually want to find two unique translates of \(\mathcal {P}_\epsilon (\mathbf{B})\) inside \(\mathcal {H}\).
The scalar recoding described in Sect. 5 requires that the first component of the multiscalar \((a_1,a_2,a_3,a_4)\) is odd. In the case that \(a_1\) is even, which happens around half of the time, previous works have employed this “odd-only” recoding by instead working with the multiscalar \((a_1-1,a_2,a_3,a_4)\), and adding the point P to the value output by the main loop (cf. [41, Algorithm 4] and [18, Algorithm 2]). Of course, in a constant-time routine, this scalar update and point addition must be performed regardless of the parity of \(a_1\), and the correct scalars and results must be masked in and out of the main loop accordingly. In this work we simplify the situation by using offset vectors in \(\mathcal {L}\) to achieve the same result; this has the added advantage of avoiding an extra point addition. We do this by finding two vectors \(\mathbf{c}, \mathbf{c}' \in \mathcal {L}\) such that \(\mathbf{c}+\mathcal {P}_\epsilon (\mathbf{B})\) and \(\mathbf{c}' +\mathcal {P}_\epsilon (\mathbf{B})\) both lie inside \(\mathcal {H}\), and such that precisely one of \((a_1,a_2,a_3,a_4)+\mathbf{c}\) and \((a_1,a_2,a_3,a_4)+\mathbf{c}'\) has a first component that is odd. This is made explicit in the full scalar decomposition described below.
Proposition 5
(Scalar Decompositions). Let \(\mathbf{B}=(\mathbf {b}_1,\mathbf {b}_2,\mathbf {b}_3,\mathbf {b}_4)\) be the basis in Proposition 3, let \(\mu =2^{256}\), and define the four curve constants \(\ell _i:=\lfloor \hat{\alpha }_i \cdot \mu /N \rceil \) for \(i=1,2,3,4\), with the \(\hat{\alpha }_i\) as given in Proposition 4. Let \(\mathbf{c}=2\mathbf {b}_1-\mathbf {b}_2+5\mathbf {b}_3+2\mathbf {b}_4\) and \(\mathbf{c}'= 2\mathbf {b}_1-\mathbf {b}_2+5\mathbf {b}_3+\mathbf {b}_4\) in \(\mathcal {L}\). For any integer \(m \in [0,2^{256})\), let \(\tilde{\alpha }_i =\left\lfloor \ell _i m /\mu \right\rfloor \), and let \((a_1, a_2, a_3,a_4) =(m,0,0,0)- \sum _{i=1}^{4}\lfloor \tilde{\alpha }_i \rceil \cdot \mathbf {b}_i\). Then, both of the multiscalars \((a_1,a_2,a_3,a_4)+\mathbf{c}\) and \((a_1,a_2,a_3,a_4)+\mathbf{c}'\) are valid decompositions of m, have all four coordinates positive and less than \(2^{64}\), and precisely one of them has a first coordinate that is odd.
The scalar decomposition described in Proposition 5 outputs two multiscalars. Our decomposition routine uses a bitmask to select and output the one with an odd first coordinate in constant time.
5 The Scalar Multiplication
This section describes the full scalar multiplication of \(P \in \mathcal {E}(\mathbb {F}_{p^2})\) by an integer \(m \in [0,2^{256})\), pulling together the endomorphisms and scalar decompositions derived in the previous two sections.
5.1 Recoding the Multiscalar
The “all-positive” multiscalar \((a_1,a_2,a_3,a_4)\) that is obtained from the decomposition described in Proposition 5 could be fed as is into a simple 4-way multiexponentiation (e.g., the 4-dimensional version of [48]) to achieve an efficient scalar multiplication. However, more care needs to be taken to obtain an efficient routine that also runs in constant-time. For example, we need to guarantee that the main loop iterates in the same number of steps, which would not currently be the case since \(\mathrm{max}_j(\mathrm{log}_2(|a_j|))\) can be several integers less than 64. As another example, a straightforward multiexponentiation could leak information in the case that the i-th bit of all four \(a_j\) values was 0, which would result in a “do-nothing” rather than a non-trivial addition.
To achieve an efficient constant-time routine, we adopt the general recoding Algorithm from [18, Algorithm 1], and tailor it to scalar multiplications on Four\(\mathbb {Q}\). This results in Algorithm 1 below, which is presented in two flavors: one that is geared towards the general reader and one that is geared towards implementers (we note that the lines do not coincide for the most part). On input of any multiscalar \((a_1,a_2,a_3,a_4)\) produced by Proposition 5, Algorithm 1 outputs an equivalent multiscalar \((b_1,b_2,b_3,b_4)\) with \(b_j = \sum _{i=0}^{64}b_j[i] \cdot 2^i\) for \(b_j[i]\in \{-1,0,1\}\) and \(j=1,2,3,4\), such that we always have \(b_1[64]=1\) and such that \(b_1[i]\) is non-zero for every \(i=0,\dots ,63\). This fixes the length of the main loop and ensures that each addition step of the multiexponentiation requires an addition by something other than the neutral element.
Another benefit of Algorithm 1 is that \(b_j[i] \in \{0,b_1[i]\}\) for \(j=2,3,4\); as was exploited in [18], this “sign-alignment” means that the lookup table used in our multiexponentiation only requires 8 elements, rather than the 16 that would be required in a naïve multiexponentiation that uses \((a_1,a_2,a_3,a_4)\). More specifically, since \(b_1[i]\) (which is to be multiplied by P) is always non-zero, every element of the lookup table T must contain P, so we have \(T[u]:=P+[u_0]\phi (P)+[u_1]\psi (P)+[u_2]\psi (\phi (P))\), where \(u=(u_2,u_1,u_0)_2\) for \(u =0,\dots ,7\). We point out that the recoding must itself be implemented in constant-time; the implementer-friendly version shows that Algorithm 1 indeed lends itself to such a constant-time implementation. We further note that the outputs of the two versions are formatted differently: the left side outputs the multiscalar \((b_1,b_2,b_3,b_4)\), while the right side instead outputs the corresponding lookup table indices (the \(d_i\)) and the masks (the \(m_i\)) used to select the correct signs of the lookup elements. That is, \((m_{64}, \ldots , m_0)\) corresponds to the binary expansion of \(b_1\) and \((d_{64}, \ldots , d_0)\) corresponds to the binary expansion of \(b_2 + 2b_3+4b_4\).
5.2 The Full Routine
We now present Algorithm 2: the full scalar multiplication routine. This is accompanied by Theorem 1, the proof of which (see [14]) gives more details on the steps summarized in Algorithm 2; in particular, it specifies the representations of all points in order to state the total number of \(\mathbb {F}_{p^2}\) operations. Algorithm 2 assumes that the input point P is in \(\mathcal {E}(\mathbb {F}_{p^2})[N]\), i.e., has been validated according to [14, Appendix A].
Theorem 1
For every point \(P \in \mathcal {E}(\mathbb {F}_{p^2})[N]\) and every non-negative integer m less than \(2^{256}\), Algorithm 2 computes [m]P correctly using a fixed sequence of field, integer and table-lookup operations.
6 Performance Analysis and Results
This section shows that, at the 128-bit security level, Four\(\mathbb {Q}\) is significantly faster than all other known curve-based primitives. We reiterate that our software runs in constant-time and is therefore fully protected against timing and cache attacks.
6.1 Operation Counts
We begin with a first-order comparison based on operation counts between Four\(\mathbb {Q}\) and two other efficient curve-based primitives that are defined over large prime characteristic fields and that target the 128-bit security level: the twisted Edwards GLV+GLS curve defined over \(\mathbb {F}_{p^2}\) with \(p=2^{127}-5997\) proposed in [37], and the genus 2 Kummer surface defined over \(\mathbb {F}_p\) with \(p=2^{127}-1\) that was proposed in [25]; we dub these “GLV+GLS” and “Kummer” below. Both of these curves have recently set speed records on a variety of platforms (see [18] and [6]). Table 2 summarizes the operation counts for one variable-base scalar multiplication on Four\(\mathbb {Q}\), GLV+GLS and Kummer. In the right-most column we approximate the cost in terms of prime field operations (using the standard assumption that 1 base field squaring is approximately 0.8 base field multiplications), where we round each tally to the nearest integer. For the GLV+GLS and Four\(\mathbb {Q}\) operation counts, we assume that one multiplication over \(\mathbb {F}_{p^2}\) involves 3 multiplications and 5 additions/subtractions over \(\mathbb {F}_p\) (when using Karatsuba) and one squaring over \(\mathbb {F}_{p^2}\) involves 2 multiplications and 3 additions/subtractions over \(\mathbb {F}_p\).
Table 2 shows that the GLV+GLS routine from [37] requires slightly fewer operations than Four\({\mathbb {Q}}\). This can mainly be explained by the faster endomorphisms, but (as we will see in Table 3) this difference is more than made up for by the faster modular arithmetic and superior simplicity of Four\(\mathbb {Q}\). Table 2 shows that Four\(\mathbb {Q}\) requires far fewer operations (in the same ground field) than Kummer; it is therefore expected, in general, that implementations based on Four\(\mathbb {Q}\) outperform Kummer implementations for computing variable-base scalar multiplications.
6.2 Experimental Results
To evaluate performance, we wrote a standalone library supporting Four \(\mathbb {Q}\) – see [15]. The library’s design pursues modularity and code reuse, and leverages the simplicity of Four\(\mathbb {Q}\)’s arithmetic. It also facilitates the addition of specialized code for different platforms and applications: the core functionality of the library is fully written in portable C and works together with pluggable implementations of the arithmetic over \(\mathbb {F}_{p^2}\) (and a few other complementary functions). The first release version of the library comes with two of those pluggable modules: a portable implementation written in C and a high-performance implementation for x64 platforms written in C and optional x64 assembly. The library computes all of the basic elliptic curve operations including variable-base and fixed-base scalar multiplications, making it suitable for a wide range of cryptographic protocols. In addition, the software permits the selection (at build time) of whether or not the endomorphisms \(\psi \) and \(\phi \) are to be exploited in variable-based scalar multiplications.
In Table 3, we compare Four\(\mathbb {Q}\)’s performance with other state-of-the-art implementations documented in the literature. Our benchmarks cover a wide range of x64 processors, from high-end architectures (e.g., Intel’s Haswell) to low-end architectures (e.g., Intel’s Atom). To cast the performance numbers in the context of a real-world protocol, we choose to illustrate Four\(\mathbb {Q}\)’s performance in one round of an ephemeral Diffie-Hellman (DH) key exchange. This means that both parties can generate their public keys using a fixed-base scalar multiplication and generate the shared secret using a variable-base scalar multiplication. Exploiting such precomputations to generate truly ephemeral public keys agrees with the comments made by Bernstein and Lange in [8, Sect. 1], e.g., that “forward secrecy is at its strongest when a key is discarded immediately after its use”. Thus, Table 3 shows the execution time (in terms of clock cycles) for both variable-base and fixed-base scalar multiplications. We note that the laddered implementations in [4, 6, 9] only compute variable-base scalar multiplications, which is why we use the cost of two variable-base scalar multiplications to approximate the cost of ephemeral DH in those cases. For the Four\(\mathbb {Q}\) and GLV+GLS implementations, precomputations for the fixed-base scalar multiplications occupied 7.5KB and 6KB of storage, respectively.
Table 3 shows that, in comparison with the “conservative” curves, Four\(\mathbb {Q}\) is 2.1–2.7 times faster than the Curve25519 implementations in [3, 12] and up to 5.4 times faster than the curve P-256 implementation in [26], when computing variable-base scalar multiplications. When considering the results for the DH key exchange, Four\(\mathbb {Q}\) performs 1.8–3.5 times faster than Curve25519 and up to 4.2 times faster than curve P-256.
In terms of comparisons to the previously fastest implementations, variable-base scalar multiplications using our software are between 1.20 and 1.34 times faster than the Kummer [6, 9] and the GLV+GLS [18] implementations on AMD’s Kaveri and Intel’s Atom Pineview, Sandy Bridge and Ivy Bridge. The Kummer implementation for Haswell in [6] is particularly fast because it takes advantage of the powerful AVX2 vector instructions. Nevertheless, our implementation (which does not currently exploit vector instructions to accelerate the field arithmetic) is still faster in the case of variable-base scalar multiplication. Moreover, in practice we expect a much larger advantage. For example, in the case of the DH key exchange, we leverage the efficiency of fixed-base scalar multiplications to achieve a factor 1.33x speedup over the Kummer implementation on Haswell. For the rest of platforms considered in Table 3, a DH shared secret using the Four\(\mathbb {Q}\) software can be computed 1.5–1.8 times faster than a DH secret using the Kummer software in [6]. We note that the eBACS website [7] and [6] report different results for the same Kummer software on the same platform (i.e., Titan0): eBACS reports 60,556 Haswell cycles whereas [6] claims 54,389 Haswell cycles. This difference in performance raises questions regarding accuracy. The results that we obtained after running the eBACS’ SUPERCOP toolkit on our own targeted Haswell machine seem to confirm that the results claimed in [6] for the Kummer were measured with TurboBoost enabled.
Four \(\mathbb {Q}\) without endomorphisms. Our library can be built with a version of the variable-base scalar multiplication function that does not exploit the endomorphisms \(\psi \) and \(\phi \) to accelerate computations (note that fixed-base scalar multiplications do not exploit these endomorphisms by default). In this case, Four\(\mathbb {Q}\) computes one variable-base scalar multiplication in (respectively) 109, 131, 138 and 803 thousand cycles on the Haswell, Ivy Bridge, Sandy Bridge and Atom Pineview processors used for our experiments. These results are up to 2.9 times faster than the corresponding results for NIST P-256 and up to 1.5 times faster than the corresponding results for Curve25519.
Notes
- 1.
p stands alone as the only Mersenne prime suitable for high-security curves over quadratic extension fields. The next largest Mersenne prime is \(2^{521}-1\), which is suitable only for prime field curves targeting the 256-bit level.
- 2.
Here, and throughout, \(\mathbf{I}\), \(\mathbf{M}\), \(\mathbf{S}\) and \(\mathbf{A}\) are used to denote the respective costs of inversions, multiplications, squarings and additions in \(\mathbb {F}_{p^2}\). We note that Frobenius operations amount to conjugations in \(\mathbb {F}_p\), which are tallied as \(0.5\mathbf{A}\).
- 3.
This is a translate (by \(-\frac{1}{2}(\sum _{i=1}^4 \mathbf {b}_i)\)) of the fundamental parallelepiped, which is defined using \(\mathbf{x} \in [0,1)^4\).
- 4.
This is not technically true: so long as the set of permissible scalars is finite, there will always be a \(\mu \) large enough to round all scalar decompositions accurately, but finding or proving this is, to our knowledge, very difficult.
References
Ahmadi, O., Granger, R.: On isogeny classes of Edwards curves over finite fields. Cryptology ePrint Archive, Report 2011/135 (2011). http://eprint.iacr.org/
Babai, L.: On Lovász’ lattice reduction and the nearest lattice point problem. Combinatorica 6(1), 1–13 (1986)
Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.-Y.: High-speed high-security signatures. In: Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 124–142. Springer, Heidelberg (2011)
Bernstein, D.J.: Curve25519: new Diffie-Hellman speed records. In: Yung, M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 207–228. Springer, Heidelberg (2006)
Bernstein, D.J., Birkner, P., Joye, M., Lange, T., Peters, C.: Twisted Edwards curves. In: Vaudenay, S. (ed.) AFRICACRYPT 2008. LNCS, vol. 5023, pp. 389–405. Springer, Heidelberg (2008)
Bernstein, D.J., Chuengsatiansup, C., Lange, T., Schwabe, P.: Kummer strikes back: new DH speed records. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014. LNCS, vol. 8873, pp. 317–337. Springer, Heidelberg (2014)
Bernstein, D.J., Lange, T.: eBACS: ECRYPT Benchmarking of Cryptographic Systems. http://bench.cr.yp.to/results-dh.html. Accessed on May 19 2015
Bernstein, D.J., Lange, T.: Hyper-and-elliptic-curve cryptography. LMS J. Comput. Math. 17(A), 181–202 (2014)
Bos, J.W., Costello, C., Hisil, H., Lauter, K.: Fast cryptography in genus 2. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013. LNCS, vol. 7881, pp. 194–210. Springer, Heidelberg (2013)
Bos, J.W., Costello, C., Hisil, H., Lauter, K.: High-performance scalar multiplication using 8-dimensional GLV/GLS decomposition. In: Bertoni, G., Coron, J.-S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 331–348. Springer, Heidelberg (2013)
Certicom Research. Standards for Efficient Cryptography 2: Recommended Elliptic Curve Domain Parameters, v2.0. Standard SEC2, Certicom (2010)
Chou, T.: Fastest Curve25519 implementation ever. In: Workshop on Elliptic Curve Cryptography Standards (2015). http://www.nist.gov/itl/csd/ct/ecc-workshop.cfm
Costello, C., Hisil, H., Smith, B.: Faster compact Diffie–Hellman: endomorphisms on the x-line. In: Nguyen, P.Q., Oswald, E. (eds.) EUROCRYPT 2014. LNCS, vol. 8441, pp. 183–200. Springer, Heidelberg (2014)
Costello, C., Longa, P.: Four\(\mathbb{Q}\): four-dimensional decompositions on a \(\mathbb{Q}\)-curve over the Mersenne prime (extended version). Cryptology ePrint Archive, Report 2015/565 2015. http://eprint.iacr.org/
Costello, C., Longa, P.: Four\(\mathbb{Q}\)lib (2015). http://research.microsoft.com/fourqlib/
Duursma, I.M., Gaudry, P., Morain, F.: Speeding up the discrete log computation on curves with automorphisms. In: Lam, K.-Y., Okamoto, E., Xing, C. (eds.) ASIACRYPT 1999. LNCS, vol. 1716, pp. 103–121. Springer, Heidelberg (1999)
Edwards, H.: A normal form for elliptic curves. Bull. Am. Math. Soc. 44(3), 393–422 (2007)
Faz-Hernández, A., Longa, P., Sánchez, A.H.: Efficient and secure algorithms for GLV-based scalar multiplication and their implementation on GLV-GLS curves (extended version). J. Cryptographic Eng. 5(1), 31–52 (2015)
Frey, G., Müller, M., Rück, H.: The Tate pairing and the discrete logarithm applied to elliptic curve cryptosystems. IEEE Trans. Inf. Theor. 45(5), 1717–1719 (1999)
Galbraith, S.D.: Mathematics of Public Key Cryptography. Cambridge University Press, Cambridge (2012)
Galbraith, S.D., Lin, X., Scott, M.: Endomorphisms for faster elliptic curve cryptography on a large class of curves. J. Cryptology 24(3), 446–469 (2011)
Gallant, R.P., Lambert, R.J., Vanstone, S.A.: Faster point multiplication on elliptic curves with efficient endomorphisms. In: Kilian, J. (ed.) CRYPTO 2001. LNCS, vol. 2139, pp. 190–200. Springer, Heidelberg (2001)
Gaudry, P.: Fast genus 2 arithmetic based on Theta functions. J. Math. Cryptology 1(3), 243–265 (2007)
Gaudry, P.: Index calculus for abelian varieties of small dimension and the elliptic curve discrete logarithm problem. J. Symbolic Comput. 44(12), 1690–1702 (2009)
Gaudry, P., Schost, E.: Genus 2 point counting over prime fields. J. Symbolic Comput. 47(4), 368–400 (2012)
Gueron, S., Krasnov, V.: Fast prime field elliptic curve cryptography with 256 bit primes. J. Cryptographic Eng. 5(2), 141–151 (2015)
Guillevic, A., Ionica, S.: Four-dimensional GLV via the Weil restriction. In: Sako, K., Sarkar, P. (eds.) ASIACRYPT 2013, Part I. LNCS, vol. 8269, pp. 79–96. Springer, Heidelberg (2013)
Hamburg, M.: Twisting Edwards curves with isogenies. Cryptology ePrint Archive, Report 2014/027 (2014). http://eprint.iacr.org/
Hasegawa, Y.: \(\mathbb{Q}\)-curves over quadratic fields. Manuscripta Math. 94(1), 347–364 (1997)
Hisil, H., Costello, C.: Jacobian coordinates on genus 2 curves. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014. LNCS, vol. 8873, pp. 338–357. Springer, Heidelberg (2014)
Hisil, H., Wong, K.K.-H., Carter, G., Dawson, E.: Twisted Edwards curves revisited. In: Pieprzyk, J. (ed.) ASIACRYPT 2008. LNCS, vol. 5350, pp. 326–343. Springer, Heidelberg (2008)
Hu, Z., Longa, P., Xu, M.: Implementing 4-dimensional GLV method on GLS elliptic curves with j-invariant 0. Des. Codes Cryptography 63(3), 331–343 (2012)
Koblitz, A.H., Koblitz, N., Menezes, A.: Elliptic curve cryptography: the serpentine course of a paradigm shift. J. Number Theor. 131(5), 781–814 (2011)
Kohel, D.: Endomorphism rings of elliptic curves over finite fields. Ph.D. thesis, University of California at Berkeley (1996)
Lenstra, A.K., Lenstra, H.W., Lovász, L.: Factoring polynomials with rational coefficients. Math. Ann. 261(4), 515–534 (1982)
Longa, P., Gebotys, C.: Efficient techniques for high-speed elliptic curve cryptography. In: Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp. 80–94. Springer, Heidelberg (2010)
Longa, P., Sica, F.: Four-dimensional Gallant-Lambert-Vanstone scalar multiplication. J. Cryptology 27(2), 248–283 (2014)
Lovász, L., Scarf, H.E.: The generalized basis reduction algorithm. Math. Oper. Res. 17(3), 751–764 (1992)
Menezes, A., Vanstone, S.A., Okamoto, T.: Reducing elliptic curve logarithms to logarithms in a finite field. In: Koutsougeras, C., Vitter, J.S. (eds.) Proceedings of 23rd Annual ACM Symposium on Theory of Computing, pp. 80–89. ACM (1991)
National Institute of Standards and Technology (NIST). 186–2. Digital Signature Standard (DSS). Federal Information Processing Standards (FIPS) Publication (2000)
Oliveira, T., López, J., Aranha, D.F., Rodríguez-Henríquez, F.: Two is the fastest prime: lambda coordinates for binary elliptic curves. J. Cryptographic Eng. 4(1), 3–17 (2014)
Pollard, J.M.: Monte Carlo methods for index computation (mod \(p\)). Math. Comput. 32(143), 918–924 (1978)
Scholten, J.: Weil restriction of an elliptic curve over a quadratic extension (2004). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.7987&rep=rep1&type=pdf
Smith, B.: Families of fast elliptic curves from \(\mathbb{Q}\)-curves. In: Sako, K., Sarkar, P. (eds.) ASIACRYPT 2013, Part I. LNCS, vol. 8269, pp. 61–78. Springer, Heidelberg (2013)
Smith, B.: Easy scalar decompositions for efficient scalar multiplication on elliptic curves and genus 2 Jacobians. In: Contemporary Mathematics Series, vol. 637, p. 15. American Mathematical Society (2015)
Smith, B.: The \(\mathbb{Q}\)-curve construction for endomorphism-accelerated elliptic curves. J. Cryptology (2015, to appear)
Stark, H.M.: Class-numbers of complex quadratic fields. In: Kuijk, W. (ed.) Modular Functions of One Variable I, pp. 153–174. Springer, Heidelberg (1973)
Straus, E.G.: Addition chains of vectors. Am. Math. Mon. 70(806–808), 16 (1964)
Vélu, J.: Isogénies entre courbes elliptiques. CR Acad. Sci. Paris Sér. AB 273, A238–A241 (1971)
Wiener, M., Zuccherato, R.J.: Faster attacks on elliptic curve cryptosystems. In: Tavares, S., Meijer, H. (eds.) SAC 1998. LNCS, vol. 1556, pp. 190–200. Springer, Heidelberg (1999)
Acknowledgements
We thank Michael Naehrig for several discussions throughout this work, and Joppe Bos, Sorina Ionica and Greg Zaverucha for their comments on an earlier version of this paper. We are especially thankful to Ben Smith for pointing out the better option for \(\phi \) in Sect. 3.2.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 International Association for Cryptologc Research
About this paper
Cite this paper
Costello, C., Longa, P. (2015). Four\(\mathbb {Q}\): Four-Dimensional Decompositions on a \(\mathbb {Q}\)-curve over the Mersenne Prime. In: Iwata, T., Cheon, J. (eds) Advances in Cryptology -- ASIACRYPT 2015. ASIACRYPT 2015. Lecture Notes in Computer Science(), vol 9452. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48797-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-662-48797-6_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48796-9
Online ISBN: 978-3-662-48797-6
eBook Packages: Computer ScienceComputer Science (R0)