Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This paper introduces a new, complete twisted Edwards [5] curve \(\mathcal {E}(\mathbb {F}_{p^2}):\,-x^2+y^2=1+dx^2y^2\), where p is the Mersenne prime \(p=2^{127}-1\), and d is a non-square in \(\mathbb {F}_{p^2}\). This curve, dubbed “Four\(\mathbb {Q}\)”, arises as a special instance of recent constructions using \(\mathbb {Q}\)-curves [27, 46], and is thus equipped with an endomorphism \(\psi \) related to the p-power Frobenius map. In addition, it has complex multiplication (CM) by the order of discriminant \(D=-40\), meaning it comes equipped with another efficient, low-degree endomorphism \(\phi \) [47].

We built an elliptic curve cryptography (ECC) library that works inside the cryptographic subgroup \(\mathcal {E}(\mathbb {F}_{p^2})[N]\), where N is a 246-bit prime. The endomorphisms \(\psi \) and \(\phi \) do not give any practical speedup to Pollard’s rho algorithm [42], which means the best known attack against the elliptic curve discrete logarithm problem (ECDLP) on \(\mathcal {E}(\mathbb {F}_{p^2})[N]\) requires around \(\sqrt{\pi N/4} \sim 2^{122.5}\) group operations on average. Thus, the cryptographic security of \(\mathcal {E}\) (see Sect. 2.3 for more details) is closely comparable to other curves that target the 128-bit security level, e.g., [6, 9, 21, 37].

Our choice of curve and the accompanying library offer a range of advantages over existing curves and implementations:

  • Speed: Four\({\mathbb {Q}}\)’s library computes scalar multiplications significantly faster than all known software implementations of curve-based cryptographic primitives. It uses the endomorphisms \(\psi \) and \(\phi \) to accelerate scalar multiplications via four-dimensional Gallant-Lambert-Vanstone (GLV)-style [22] decompositions. Four-dimensional decompositions have been used before [9, 32, 37], but not over the Mersenne primeFootnote 1; this choice of field is significantly faster than any neighboring fields and several works have studied its arithmetic [13, 21, 36]. The combination of extremely fast modular reductions and four-dimensional scalar decompositions makes for highly efficient scalar multiplications on \(\mathcal {E}\). Furthermore, we can exploit the fastest known addition formulas for elliptic curves over large characteristic fields [31], which are complete on \(\mathcal {E}\) since the above d is non-square [31, Sect. 3]. In Sect. 2, we explain why four-dimensional decompositions and this special underlying field were not previously partnered at the 128-bit security level.

  • Simplicity and concrete correctness: Simplicity is a major priority in this work and in the development of our software; in some cases we sacrifice speed enhancements in order to design a more simple and compact algorithm (cf. Sect. 4.2).

    On input of any point \(P \in \mathcal {E}(\mathbb {F}_{p^2})[N]\), validated as in [14, Appendix A] if necessary, and any integer scalar \(m \in [0,2^{256})\), our software does the following (strictly in constant-time and without exception):

    1. 1.

      Computes \(\phi (P)\), \(\psi (P)\) and \(\psi (\phi (P))\) using exactlyFootnote 2 \(68 \mathbf{M}\), \(27 \mathbf{S}\) and \(49.5 \mathbf{A}\) – see Sect. 3.

    2. 2.

      Decomposes m (e.g., in less than 200 Sandy Bridge cycles) into a multiscalar \((a_1,a_2,a_3,a_4) \in \mathbb {Z}^4\) such that each \(a_i\) is positive and at most 64 bits – see Sect. 4.

    3. 3.

      Recodes the multiscalar (e.g., in less than 800 Sandy Bridge cycles) to ensure a simple and constant-time main loop – see Sect. 5.

    4. 4.

      Computes a lookup table of 8 elements using exactly 7 complete additions, before executing the main loop using exactly 64 complete twisted Edwards double-and-add operations, and finally outputting \([m]P = [a_1]P+[a_2]\phi (P)+[a_3]\psi (P) + [a_4]\psi \phi (P)\) – see Sect. 5.

    This paper details each of the above steps explicitly, culminating in the full routine presented in Algorithm 2. Several prior works exploiting scalar decompositions have potential points of failure (cf. [30, Sect. 7], and Sect. 4.2), but crucially, and for the first time in the setting of four-dimensional decompositions, we accompany our routine with a robust proof of correctness – see Theorem 1.

  • Cryptographic versatility: Four\(\mathbb {Q}\) is intended to be used in the same way, i.e., using the same model, same coordinates and same explicit formulas, irrespective of the cryptographic protocol or nature of the intended scalar multiplication. Unlike implementations using ladders [4, 6, 9, 23], Four\(\mathbb {Q}\) supports fast variable-base and fast fixed-base scalar multiplications, both of which use twisted Edwards coordinates; this serves as a basis for fast (ephemeral) Diffie-Hellman key exchange and fast Schnorr-like signatures. The presence of a single, complete addition law gives implementers the ability to easily wrap higher-level software and protocols around the Four\(\mathbb {Q}\)’s library exactly as is.

  • Public availability: Prior works exploiting four-dimensional decompositions have either made code available that did not attempt to run in constant-time [9], or not published code that did run in constant-time [18, 37]. Our library, which is publicly available [15], is largely written in portable C and includes two modular implementations of the arithmetic over \(\mathbb {F}_{p^2}\): a portable implementation written in C and a high-performance implementation for x64 platforms written in C and optional x64 assembly. The library also permits to select (at build time) whether the efficiently computable endomorphisms \(\psi \) and \(\phi \) can be used or not for computing generic scalar multiplications. The code is accompanied by Magma scripts that can be used to verify the proofs of all claims and the claimed operation counts. Our aim is to make it easy for subsequent implementers to replicate the routine and, if desired, develop specialized code that is tailored to specific platforms for further performance gains or with different memory constraints.

When the NIST curves [40] were standardized in 1999, many of the landmark discoveries in ECC (e.g., [17, 21, 22, 46]) were yet to be made. Four\(\mathbb {Q}\) and its accompanying library represent the culmination of several of the best known ECC optimizations to date: it pulls together the extremely fast Mersenne prime, the fastest known large characteristic addition formulas [31], and the highest degree of scalar decompositions (there is currently no known way of achieving higher dimensional decompositions without exposing the ECDLP to attacks that are asymptotically much faster than Pollard rho). Subsequently, for generic scalar multiplications, Four\(\mathbb {Q}\) performs around four to five times faster than the original NIST P-256 curve [26], between two and three times faster than curves that are currently under consideration as NIST alternatives, e.g., Curve25519 [4], and is also significantly faster than all of the other curves used to set previous speed records (see Sect. 6 for the comparisons). Interestingly, Four\(\mathbb {Q}\) is still highly efficient if the endomorphisms \(\psi \) and \(\phi \) are not used at all for computing generic scalar multiplications. In this case, Four\(\mathbb {Q}\) performs about three times faster than the NIST P-256 curve and up to 1.5 times faster than Curve25519.

It is our belief that the demand for high-performance cryptography warrants the state-of-the-art in ECC to be part of the standardization discussion: this paper ultimately demonstrates the performance gains that are possible if such a curve was to be considered alongside the “conservative” choices.

The extended version. For space considerations, we have omitted the proofs of Propositions 124 and 5, Lemma 1 and Theorem 1, as well as several additional remarks. All of these, along with an appendix covering point validation, can be found in the extended version of this article [14].

2 The Curve: Four\(\mathbb {Q}\)

This section describes the proposed curve, where we adopt Smith’s notation [44, 46] for the most part. We present the curve parameters in Sect. 2.1, shed some light on how the curve was found in Sect. 2.2, and discuss its cryptographic security in Sect. 2.3. Both Sects. 2.2 and 2.3 discuss that \(\mathcal {E}\) is essentially one-of-a-kind, illustrating that there were no degrees of freedom in the choice of curve (see [14] for more details).

2.1 A Complete Twisted Edwards Curve

We will work over the quadratic extension field \(\mathbb {F}_{p^2} := \mathbb {F}_p(i)\), where \(p:=2^{127}-1\) and \(i^2 = -1\). We define \(\mathcal {E}\) to be the twisted Edwards [5] curve

$$\begin{aligned} \mathcal {E}/\mathbb {F}_{p^2} \, :-x^2+y^2 = 1+dx^2y^2, \end{aligned}$$
(1)

where \(d := 125317048443780598345676279555970305165 \cdot i + 4205857648805777768770\).

The set of \(\mathbb {F}_{p^2}\)-rational points satisfying the affine model for \(\mathcal {E}\) forms a group: the neutral element is \(\mathcal {O}_{\mathcal {E}} = (0,1)\) and the inverse of a point (xy) is \((-x,y)\). The fastest set of explicit formulas for the addition law on \(\mathcal {E}\) are due to Hisil, Wong, Carter and Dawson [31]: they use extended twisted Edwards coordinates to represent the affine point (xy) on \(\mathcal {E}\) by any projective tuple of the form \((X :Y :Z :T)\) for which \(Z \ne 0\), \(x = X/Z\), \(y=Y/Z\) and \(T=XY/Z\). Since d is not a square in \(\mathbb {F}_{p^2}\), this set of formulas is also complete on \(\mathcal {E}\) (see [5]), meaning that they will work without exception for all points in \(\mathcal {E}(\mathbb {F}_{p^2})\).

The trace \(t_{\mathcal {E}}\) of the \(p^2\)-power Frobenius endomorphism \(\pi _\mathcal {E}\) of \(\mathcal {E}\) is \(t_\mathcal {E}= 136368062447564341573735631776713817674\), which reveals that

$$\begin{aligned} \#\mathcal {E}(\mathbb {F}_{p^2}) = p^2+1-t_\mathcal {E}= 2^3 \cdot 7^2 \cdot N, \end{aligned}$$
(2)

where N is a 246-bit prime. The cryptographic group we work with in this paper is \(\mathcal {E}(\mathbb {F}_{p^2})[N]\).

2.2 Where did this Curve Come From?

The curve \(\mathcal {E}\) above comes from the family of \(\mathbb {Q}\)-curves of degree 2 – originally defined by Hasegawa [29] – that was recently used as one of the example families in Smith’s general construction of \(\mathbb {Q}\)-curve endomorphisms [44, 46]. Certain examples of low-degree \(\mathbb {Q}\)-curves (including this family) were independently obtained through a different construction by Guillevic and Ionica [27], who also studied 4-dimensional decompositions arising from such curves possessing CM. In fact, \(\mathcal {E}\) has a similar structure to the curve constructed in [27, Exercise 1], but is over the prime \(p=2^{127}-1\).

For \(\varDelta \) a square-free integer, this family is defined over \(\mathbb {Q}(\sqrt{\varDelta })\) and is parameterized by \(s \in \mathbb {Q}\) as

$$\begin{aligned} \tilde{\mathcal {E}}_{2,\varDelta ,s} :y^2=x^3 - 6(5-3s\sqrt{\varDelta })x + 8(7-9s\sqrt{\varDelta }). \end{aligned}$$
(3)

By definition [44, Definition 1], curves from this family are 2-isogenous (over \(\mathbb {Q}(\varDelta ,\sqrt{-2})\)) to their Galois conjugates \(^\sigma \tilde{\mathcal {E}}_{2,\varDelta ,s}\). Smith reduces \(\tilde{\mathcal {E}}_{2,\varDelta ,s}\) and \(^\sigma \tilde{\mathcal {E}}_{2,\varDelta ,s}\) modulo primes p that are inert in \(\mathbb {Q}(\sqrt{\varDelta })\) to produce the curves \(\mathcal {E}_{2,\varDelta ,s}\) and \(^\sigma \mathcal {E}_{2,\varDelta ,s}\) defined over \(\mathbb {F}_{p^2}\). He then composes the induced 2-isogeny from \(\mathcal {E}_{2,\varDelta ,s}\) to \(^\sigma \mathcal {E}_{2,\varDelta ,s}\) with the p-power Frobenius map from \(^\sigma \mathcal {E}_{2,\varDelta ,s}\) back to \(\mathcal {E}_{2,\varDelta ,s}\), which produces an efficiently computable degree 2p endomorphism \(\psi \) on \(\mathcal {E}_{2,\varDelta ,s}\).

Recall that in this paper we fix \(p=2^{127}-1\) for efficiency reasons. For this particular prime p and this family of \(\mathbb {Q}\)-curves, Smith’s construction gives rise to precisely p non-isomorphic curves corresponding to each possible choice of \(s \in \mathbb {F}_p\) [46, Proposition 1]. Varying s allows us to readily find curves belonging to this family with strong cryptographic group orders, each of which comes equipped with the endomorphism \(\psi \) that facilitates a two-dimensional scalar decomposition.

Seeking a four-dimensional (rather than two-dimensional) scalar decomposition on \(\mathcal {E}_{2,\varDelta ,s}\) restricts us to a very small subset of possible s values. This is because we require the existence of another efficiently computable endomorphism on \(\mathcal {E}_{2,\varDelta ,s}\), namely the low-degree GLV endomorphism \(\phi \) on those instances of \(\mathcal {E}_{2,\varDelta ,s}\) that possess CM over \(\mathbb {Q}(\sqrt{\varDelta })\). In [46, Sect. 9], Smith explains why there are only a handful of s values in any particular \(\mathbb {Q}\)-curve family that correspond to a curve with CM, before cataloging all such instances in the families of \(\mathbb {Q}\)-curves of degrees 2, 3, 5 and 7. In particular, up to isogeny and over any prime p, there are merely 13 values of s such that \(\mathcal {E}_{2,\varDelta ,s}\) has CM over \(\mathbb {Q}(\sqrt{\varDelta })\). As is remarked in [46, Sect. 9], this scarcity of CM curves makes it highly unlikely that we will find a secure instance of a low-degree \(\mathbb {Q}\)-curve family with CM over any fixed prime p. This is the reason why other authors chasing high speeds at the 128-bit security level have previously sacrificed the fast Mersenne prime \(p=2^{127}-1\) in favor of a four-dimensional decomposition [9, 37]; one can always search through the small handfull of exceptional CM curves over many sub-optimal primes until a cryptographically secure instance is found. However, in the specific case of \(p=2^{127}-1\), we actually get extremely lucky: our search through Smith’s tables of exceptional \(\mathbb {Q}\)-curves with CM [46, Theorem 6] found one particular instance over \(\mathbb {F}_{p^2}\) with a prime subgroup of 246-bits, namely \(\mathcal {E}_{2,\varDelta ,s}\) with \(s=\pm \frac{4}{9}\) and \(\varDelta =5\). As is detailed in [46, Sect. 3], the specification of \(\varDelta =5\) here does not dictate how we form the extension field \(\mathbb {F}_{p^2}\) over \(\mathbb {F}_p\); all quadratic extension fields of \(\mathbb {F}_p\) are isomorphic, so we can take \(s\sqrt{\varDelta } = \pm \frac{4}{9}\sqrt{5}\) in (3) while still taking the reduction of \(\tilde{\mathcal {E}}_{2,5,\pm \frac{4}{9}}\) modulo p to be \(\mathcal {E}_{2,5,\pm \frac{4}{9}}/\mathbb {F}_{p^2}\) with \(\mathbb {F}_{p^2}:=\mathbb {F}_p(\sqrt{-1})\). To simplify notation, from hereon we fix \(\tilde{\mathcal {E}}_\mathrm{W} := \tilde{\mathcal {E}}_{2,5,\pm \frac{4}{9}}\) and define \(\mathcal {E}_\mathrm{W}\) as the reduction of \(\tilde{\mathcal {E}}_\mathrm{W}\) modulo p, given as

$$\begin{aligned} \mathcal {E}_\mathrm{W}/\mathbb {F}_{p^2} \, :\, y^2 =x^3 - (30-8 \sqrt{5})x + (56-32\sqrt{5}), \end{aligned}$$
(4)

where the choice of the root \(\sqrt{5}\) in \(\mathbb {F}_{p^2}\) will be fixed in Sect. 3. We note that the short Weierstrass curve \(\mathcal {E}_\mathrm{W}\) is not isomorphic to our twisted Edwards curve \(\mathcal {E}\), but rather to a twisted Edwards curve \(\hat{\mathcal {E}}\) that is \(\mathbb {F}_{p^2}\)-isogenous to \(\mathcal {E}\). The reason we work with \(\mathcal {E}\) rather than \(\hat{\mathcal {E}}\) is because the curve constant d on \(\mathcal {E}\) is non-square in \(\mathbb {F}_{p^2}\), which is not the case for the curve constant \(\hat{d}\) on \(\hat{\mathcal {E}}\); as we mentioned above, d being a non-square ensures that the fastest known addition formulas are also complete on \(\mathcal {E}\). The isogenies between \(\mathcal {E}\) and \(\hat{\mathcal {E}}\) are made explicit as follows.

Proposition 1

Let \(\hat{\mathcal {E}}/K\) and \(\mathcal {E}/K\) be the twisted Edwards curves defined by \(\hat{\mathcal {E}}/K :-x^2+y^2 = 1+\hat{d}x^2y^2\) and \(\mathcal {E}/K :-x^2+y^2 = 1+dx^2y^2\). If \(d = -(1+1/\hat{d})\), then the map \(\tau \, :\, \mathcal {E}\rightarrow \hat{\mathcal {E}}\), \((x,y) \mapsto \left( \frac{2xy}{(x^2+y^2)\sqrt{\hat{d}}} \, , \, \frac{x^2-y^2+2}{y^2-x^2} \right) \) is a 4-isogeny, the dual of which is \(\hat{\tau } \, :\, \hat{\mathcal {E}} \rightarrow \mathcal {E}\), \((x,y) \mapsto \left( \frac{2xy\sqrt{\hat{d}}}{x^2-y^2+2} \, , \, \frac{y^2-x^2}{y^2+x^2} \right) \).

We note at once that if \(\hat{d}\) is a square in K, then \(\tau \) and \(\hat{\tau }\) are defined over K. Fortunately, while the twisted Edwards curve \(\hat{\mathcal {E}}\) corresponding to \(\mathcal {E}_\mathrm{W}/\mathbb {F}_{p^2}\) has a square constant \(\hat{d}\), our chosen isogenous curve \(\mathcal {E}\) has the non-square constant \(d = -(1+1/\hat{d})\). Our implementation will work solely in twisted Edwards coordinates on \(\mathcal {E}\), but we will pass back and forth through \(\mathcal {E}_\mathrm{W}\) (via \(\hat{\mathcal {E}}\)) when deriving explicit formulas for the endomorphisms \(\phi \) and \(\psi \) in Sect. 3. We note that Hamburg used 4-isogenies (also derived from [1]) to a similar effect in [28].

2.3 The Cryptographic Security of Four\(\mathbb {Q}\)

Pollard’s rho algorithm [42] is the best known way to solve the ECDLP in \(\mathcal {E}(\mathbb {F}_{p^2})[N]\). An optimized version of this attack which uses the negation map [50] requires around \(\sqrt{\pi N/4} \sim 2^{122.5}\) group operations on average. We note that, unlike some of the typical GLV [22] or GLS [21] endomorphisms that can be used to speed up Pollard’s rho algorithm [16], both \(\psi \) and \(\phi \) on \(\mathcal {E}\) do not facilitate any known advantage; neither of these endomorphisms have a small orbit and they are both more expensive to compute than an amortized addition. Thus, the known complexity of the ECDLP on \(\mathcal {E}\) is comparable to various other curves used in the speed-record literature; optimized implementations of Pollard rho against any of the fastest curves in [4, 9, 13, 18, 21, 37, 41] would require between \(2^{124.8}\) and \(2^{125.8}\) group operations on average. Ideally, we would prefer not to have the factor \(7^2\) dividing \(\#\mathcal {E}(\mathbb {F}_{p^2})\), but the resulting (\(\sim 2.8\) bit) security degradation is a small price to pay for having the fastest field at the 128-bit level in conjunction with a four-dimensional scalar decomposition. As we discuss further in [14], it was a long shot to try and find such a cryptographically secure \(\mathbb {Q}\)-curve with CM over \(\mathbb {F}_{p^2}\) in Smith’s tables in the first place, let alone one that also had the necessary torsion to support a twisted Edwards model.

Since \(\mathcal {E}(\mathbb {F}_{p^2})\) has rational 2-torsion, it is easy to write down the corresponding abelian surface over \(\mathbb {F}_p\) whose Jacobian is isogenous to the Weil restriction of \(\mathcal {E}\) – see [43, Lemma 2.1 and Lemma 3.1]. But since the best known algorithm to solve the discrete logarithm problem on such abelian surfaces is again Pollard’s rho algorithm, the Weil descent philosophy (cf. [24]) does not pose a threat here. Furthermore, the embedding degree of \(\mathcal {E}\) with respect to N is \((N-1)/2\), making it infeasible to reduce the ECDLP into a finite field [19, 39].

We note that the largest prime factor dividing the group order of \(\mathcal {E}\)’s quadratic twist is 158 bits, but twist-security [4] is not an issue in this work: firstly, our software always validates input points (such validation is essentially free), and secondly, x-coordinate-only arithmetic (which is where twist-security makes sense) on \(\mathcal {E}\) is not competitive with a four-dimensional decomposition that uses both coordinates.

In contrast to most currently standardized curves, the proposed curve is both defined over a quadratic extension field and has a small discriminant; one notable exception is secp256k1 in the SEC standard [11], which is used in the Bitcoin protocol and also has small discriminant. However, it is important to note that there is no better-than-generic attack known to date that can exploit either of these two properties on \(\mathcal {E}\). In fact, with respect to ECDLP difficulty, Koblitz, Koblitz and Menezes [33, Sect. 11] point out that slower, large discriminant curves, like NIST P-256 and Curve25519, may turn out to be less conservative than specially chosen curves with small discriminant.

3 The Endomorphisms \(\psi \) and \(\phi \)

In this section we derive explicit formulas for the two endomorphisms on \(\mathcal {E}\). In what follows we use \(c_{i,j,k,l}\) to denote the constant \(i+j\sqrt{2}+k\sqrt{5}+l\sqrt{2}\sqrt{5}\) in \(\mathbb {F}_{p^2}\), which is fixed by setting \(\sqrt{2}:=2^{64}\) and \(\sqrt{5}:=87392807087336976318005368820707244464 \cdot i\).

For both \(\psi \) and \(\phi \), we start by deriving the explicit formulas on the short Weierstrass model \(\mathcal {E}_\mathrm{W}\). As discussed in the previous section, we will pass back and forth between \(\mathcal {E}\) and \(\mathcal {E}_\mathrm{W}\) via the twisted Edwards curve \(\hat{\mathcal {E}}\) that is 4-isogenous to \(\mathcal {E}\) over \(\mathbb {F}_{p^2}\). The maps between \(\mathcal {E}\) and \(\hat{\mathcal {E}}\) are given in Proposition 1, and we take the maps \(\delta :\mathcal {E}_\mathrm{W} \rightarrow \hat{\mathcal {E}}\) and \(\delta ^{-1} :\hat{\mathcal {E}} \rightarrow \mathcal {E}_\mathrm{W}\) from [46, Sect. 5] (tailored to our \(\hat{\mathcal {E}}\)) as \(\delta \, :(x,y) \mapsto \left( \frac{\gamma (x-4)}{y},\frac{x-4-c_{0,2,0,1}}{x-4+c_{0,2,0,1}}\right) \), and \(\delta ^{-1} \, :(x,y) \mapsto \left( \frac{c_{0,2,0,1}(y+1)}{1-y}+4,\frac{ c_{0,2,0,1}(y+1)\gamma }{x(1-y)}\right) \), where \(\gamma ^2=c_{-12,-4,0,-2}\). The choice of the square root \(\gamma \in \mathbb {F}_{p^2}\) becomes irrelevant in the compositions below.

3.1 Explicit Formulas for \(\psi \)

There is almost no work to be done in deriving \(\psi \) on \(\mathcal {E}\), since this is Smith’s \(\mathbb {Q}\)-curve endomorphism corresponding to the degree-2 family to which \(\mathcal {E}_\mathrm{W}\) belongs. We start with \(\psi _\mathrm{W} :\mathcal {E}_\mathrm{W} \rightarrow \mathcal {E}_\mathrm{W}\), taken from [46, Sect. 5], as \(\psi _\mathrm{W} :(x,y) \mapsto \left( \left( -\frac{x}{2}-\frac{c_{9,0,4,0}}{x-4}\right) ^p, \left( \frac{y}{i\sqrt{2}} \left( -\frac{1}{2}+\frac{c_{9,0,4,0}}{(x-4)^2}\right) \right) ^p\right) \). With \(\psi _\mathrm{W}\) as above, we define \(\psi :\mathcal {E}\rightarrow \mathcal {E}\) as the composition \(\psi = \hat{\tau }\delta \psi _\mathrm{W} \delta ^{-1} \tau \). In optimizing the explicit formulas for this composition, there is practically nothing to be gained by simplifying the full composition in the function field \(\mathbb {F}_{p^2}(\mathcal {E})\). However, it is advantageous to optimize explicit formulas for the inner composition \((\delta \psi _\mathrm{W} \delta ^{-1})\) in the function field \(\mathbb {F}_{p^2}(\hat{\mathcal {E}})\). In fact, for both \(\psi \) and \(\phi \), optimized explicit formulas for this inner composition are faster than the respective endomorphisms \(\psi _\mathrm{W}\) and \(\phi _\mathrm{W}\), and are therefore much faster than computing the respective compositions individually.

Simplifying the composition \(\delta \psi _\mathrm{W} \delta ^{-1}\) in the function field \(\mathbb {F}_{p^2}(\hat{\mathcal {E}})\) yields \((\delta \psi _\mathrm{W} \delta ^{-1}) :\hat{\mathcal {E}} \rightarrow \hat{\mathcal {E}}\),

$$\begin{aligned} (x,y) \mapsto \left( \frac{2 i x^p \cdot c_{-2,3,-1,0}}{y^p \cdot \left( (x^p)^2\cdot c_{-140,99,0,0}+c_{-76,57,-36,24}\right) }, \frac{c_{-9,-6,4,3}-(x^p)^2}{c_{-9,-6,4,3}+(x^p)^2} \right) . \end{aligned}$$

Note that each of the p-power Frobenius operations above amount to one \(\mathbb {F}_{p}\) negation. As mentioned above, we compute the endomorphism \(\psi = \hat{\tau } (\delta \psi _\mathrm{W} \delta ^{-1}) \tau \) on \(\mathcal {E}\) by computing \(\tau \) and \(\hat{\tau }\) separately; see Sect. 3.4 for the operation counts.

3.2 Deriving Explicit Formulas for \(\phi \)

We now derive the second endomorphism \(\phi \) that arises from \(\mathcal {E}\) admitting CM by the order of discriminant \(D=-40\). We start by pointing out that there is actually multiple routes that could be taken in defining and deriving \(\phi \) (see the full version [14] for additional details). The possibility that we use in this paper produces an endomorphism of degree 5. This option was revealed to us in correspondence with Ben Smith, who pointed out that \(\mathbb {Q}\)-curves with CM can also be produced as the intersection of families of \(\mathbb {Q}\)-curves, and that our curve \(\mathcal {E}\) is not only a degree-2 \(\mathbb {Q}\)-curve, but is also a degree-5 \(\mathbb {Q}\)-curve. Thus, the second endomorphism \(\phi \) can be derived by first following the treatment in [46, Sect. 7] (see also [27, Sect. 3.3]) to derive \(\phi _\mathrm{W}\) as a 5-isogeny on \(\mathcal {E}_\mathrm{W}\), which we do below.

Working in \(\mathbb {Q}(\sqrt{5})[x]\), the 5-division polynomial (cf. [20, Definition 9.8.4]) of \(\tilde{\mathcal {E}}_\mathrm{W}\) factors as f(x)g(x), where \(f(x) = x^2 + 4\sqrt{5} \cdot x +(18-4/5\sqrt{5})\) and g(x) (which is of degree 10) are irreducible. The polynomial f(x) defines the kernel of a 5-isogeny \(\phi ^\sigma _\mathrm{W} :\tilde{\mathcal {E}}_\mathrm{W} \rightarrow \tilde{\mathcal {E}}^\sigma _\mathrm{W}\). We use this kernel to compute \(\phi ^\sigma _\mathrm{W}\) via Vélu’s formulae [49] (see also [34, Sect. 2.4]), reduce modulo p, and then compose with Frobenius \(\pi _p :\mathcal {E}^\sigma _\mathrm{W} \rightarrow \mathcal {E}_\mathrm{W}\) to give \(\phi _\mathrm{W} :\mathcal {E}_\mathrm{W} \rightarrow \mathcal {E}_\mathrm{W} , (x,y) \mapsto (x_{\phi _W},y_{\phi _W})\), where

$$\begin{aligned} x_{\phi _W}=&\left( \frac{x^5 + 8\sqrt{5}x^4 + (40\sqrt{5} + 260)x^3 + (720\sqrt{5} +640)x^2 + (656\sqrt{5} + 4340)x + (1920\sqrt{5} + 960)}{5\left( (x^2 + 4\sqrt{5}x - 1/5(4\sqrt{5} - 90)\right) ^2}\right) ^p\!,\\ y_{\phi _W}=&\left( \frac{-y \left( x^2 + (4\sqrt{5} - 8)x - 12\sqrt{5} + 26\right) \left( x^4 + (8\sqrt{5} + 8)x^3 + 28x^2 - (48\sqrt{5} + 112)x - 32\sqrt{5} - 124\right) }{\left( \sqrt{5}(x^2 + 4\sqrt{5}x - 1/5(4\sqrt{5} - 90))\right) ^3}\right) ^p\!, \end{aligned}$$

As was the case with \(\psi \) in Sect. 3.1, it is advantageous to optimize formulas in \(\mathbb {F}_{p^2}(\hat{\mathcal {E}})\) for the composition \((\delta \psi _\mathrm{W} \delta ^{-1})\), which gives \((\delta \psi _\mathrm{W} \delta ^{-1}):\hat{\mathcal {E}} \rightarrow \hat{\mathcal {E}}, (x,y) \mapsto (x_\phi ,y_\phi )\), where

$$\begin{aligned} x_\phi =&\left( \frac{c_{9,-6,4,-3} \cdot x \cdot (y^2-c_{7,5,3,2} \cdot y+c_{21,15,10,7}) \cdot (y^2+c_{7,5,3,2} \cdot y+c_{21,15,10,7})}{(y^2+c_{3,2,1,1} \cdot y+c_{3,3,2,1}) \cdot (y^2-c_{3,2,1,1} \cdot y+c_{3,3,2,1})}\right) ^p\!, \\ y_\phi =&\left( \frac{c_{15,10,6,4} \cdot (5 y^4+c_{120,90,60,40} \cdot y^2+c_{175,120,74,54})}{5 y \cdot (y^4+c_{240,170,108,76} \cdot y^2+c_{3055,2160,1366,966})}\right) ^p\!. \end{aligned}$$

Again, we use this to compute the full endomorphism \(\psi = \hat{\tau } (\delta \psi _\mathrm{W} \delta ^{-1}) \tau \) on \(\mathcal {E}\) by computing \(\tau \) and \(\hat{\tau }\) separately; see Sect. 3.4 for the operation counts.

3.3 Eigenvalues

The eigenvalues of the two endomorphisms \(\psi \) and \(\phi \) play a key role in developing scalar decompositions. In this subsection we write them in terms of the curve parameters. From [46, Theorem 2], and given that we used a 4-isogeny \(\tau \) and its dual to pass back and forth to \(\mathcal {E}_\mathrm{W}\), the eigenvalues of \(\psi \) on \(\mathcal {E}(\mathbb {F}_{p^2})[N]\) are \(\lambda _{\psi } := 4 \cdot \frac{p+1}{r} \pmod {N}\) and \(\lambda _{\psi }' := - \lambda _{\psi } \pmod {N}\), where r is an integer satisfying \(2r^2 = 2p + t_{\mathcal {E}}\). To derive the eigenvalues for \(\phi \), we make use of the CM equation for \(\mathcal {E}\), which (since \(\mathcal {E}\) has CM by the order of discriminant \(D=-40\)) is \(40V^2 = 4p^2-t_\mathcal {E}^2\), for some integer V. We fix r and V to be the positive integers satisfying these equations, namely \(V:=4929397548930634471175140323270296814\) and \(r:=15437785290780909242\).

Proposition 2

The eigenvalues of \(\phi \) on \(\mathcal {E}(\mathbb {F}_{p^2})[N]\) are

$$\begin{aligned} \lambda _{\phi } \,\, := \,\, 4\cdot \frac{(p-1)r^3}{(p+1)^2V} \pmod {N} \quad \mathrm{and} \quad \lambda _{\phi }' := - \lambda _{\phi } \pmod {N}. \end{aligned}$$

3.4 Section Summary

Table 1 summarizes the isogenies derived in this section, together with their exact operation counts. The reason that multiples of 0.5 appear in the additions column is that we count Frobenius operations (which amount to a negation in \(\mathbb {F}_p\)) as half an addition in \(\mathbb {F}_{p^2}\). Four-dimensional scalar decompositions on \(\mathcal {E}\) require the computation of \(\phi (P)\), \(\psi (P)\) and the composition \(\psi (\phi (P))\); the ordering here is important since \(\psi \) is much faster than \(\phi \), meaning we actually compute \(\phi \) once and \(\psi \) twice. We note that all sets of explicit formulas were derived assuming the inputs were projective points \((X :Y :Z)\) corresponding to a point (X / ZY / Z) in the domain of the isogeny. Similarly, all explicit formulas output the point \((X' :Y' :Z')\) corresponding to \((X'/Z',Y'/Z')\) in the codomain, and in the special cases when the codomain is \(\mathcal {E}\) (i.e., for \(\hat{\tau }\), \(\phi \), \(\psi \) and \(-\psi \phi \)), we also output the coordinate \(T'\) (or a related variant) corresponding to \(T'=X'Y'/Z'\), which facilitates faster subsequent group law formulas on \(\mathcal {E}\) – see [14].

Table 1 reveals that, on input of a projective point in \(\mathcal {E}(\mathbb {F}_{p^2})[N]\), the total cost of the three maps \(\phi \), \(\psi \) and \(\psi \phi \) is \(68 \mathbf{M}+27 \mathbf{S}+49.5\mathbf{A}\). Computing the maps using these explicit formulas requires the storage of 16 constants in \(\mathbb {F}_{p^2}\), and at any stage of the endomorphism computations, requires the storage of at most 7 temporary variables.

Table 1. Summary of isogenies used in the derivation of the three endomorphisms \(\phi \), \(\psi \) and \(\phi \psi \) on \(\mathcal {E}\), together with the cost of their explicit formulas. Here \(\mathbf{M}\), \(\mathbf{S}\) and \(\mathbf{A}\) respectively denote the costs of one multiplication, one squaring and one addition in \(\mathbb {F}_{p^2}\).

4 Optimal Scalar Decompositions

Let \(\lambda _\psi \) and \(\lambda _\phi \) be as fixed in Sect. 3.3. In this section we show how to compute, for any integer scalar \(m \in \mathbb {Z}\), a corresponding 4-dimensional multiscalar \((a_1,a_2,a_3,a_4) \in \mathbb {Z}^4\) such that \(m \equiv a_1+a_2\lambda _\phi +a_3\lambda _\psi +a_4\lambda _\phi \lambda _\psi \pmod {N}\), such that \(0\le a_i<2^{64}-1\) for \(i=1,2,3,4\), and such that \(a_1\) is odd (which facilitates faster scalar recodings and multiplications – see Sect. 5). An excellent reference for general scalar decompositions in the context of elliptic curve cryptography is [45], where it is shown how to write down short lattice bases for scalar decompositions directly from the curve parameters. Here, we show how to further reduce such short bases into bases that are, in the context of multiscalar multiplications, optimal.

4.1 Babai Rounding and Optimal Bases

Following [45, Sect. 1], we define the lattice of zero decompositions as

$$\begin{aligned} \mathcal {L}:= \langle \, (z_1, z_2, z_3, z_4) \in \mathbb {Z}^4 \,\, | \,\, z_1 + z_2\lambda _\phi +z_3\lambda _\psi +z_4\lambda _\phi \lambda _\psi \equiv 0 \pmod {N} \rangle , \end{aligned}$$

so that the set of decompositions for \(m \in \mathbb {Z}/N\mathbb {Z}\) is the lattice coset \((m,0,0,0)+\mathcal {L}\). For a given basis \(\mathbf{B}=(\mathbf {b}_1,\mathbf {b}_2,\mathbf {b}_3,\mathbf {b}_4)\) of \(\mathcal {L}\), and on input of any \(m \in \mathbb {Z}\), the Babai rounding technique [2] computes \((\alpha _1,\alpha _2,\alpha _3,\alpha _4) \in \mathbb {Q}^4\) as the unique solution to \((m,0,0,0) = \sum _{i=1}^{4} \alpha _i \mathbf {b}_i\), and subsequently computes the multiscalar \((a_1,a_2,a_3,a_4)=(m,0,0,0)-\sum _{i=1}^4 \lfloor \alpha _i \rceil \cdot \mathbf {b}_i\). It follows that \((a_1,a_2,a_3,a_4)-(m,0,0,0) \in \mathcal {L}\), so \(m \equiv a_1+a_2\lambda _\phi + a_3\lambda _\psi + a_4 \lambda _\phi \lambda _\psi \pmod {N}\). Since \(-1/2\le x - \lfloor x \rceil \le 1/2\), this technique finds the unique element in \((m,0,0,0)+\mathcal {L}\) that lies inside the parallelepipedFootnote 3 defined by \(\mathcal {P}(\mathbf{B}) = \{\mathbf{B}{} \mathbf{x}\, |\, \mathbf{x} \in [-1/2,1/2)^4\}\), i.e., Babai rounding maps \(\mathbb {Z}\) onto \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\). For a given m, the length of the corresponding multiscalar multiplication is then determined by the infinity norm, \(||\cdot ||_\infty \), of the corresponding element \((a_1,a_2,a_3,a_4)\) in \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\).

Since our scalar multiplications must run in time independent of m, the speed of the multiscalar exponentiations will depend on the worst case, i.e., on the maximal infinity norm taken across all elements in \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\). Or, equivalently, the speed of routine will depend on the width of the smallest 4-cube whose convex body contains \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\). This width depends only on the choice of \(\mathbf{B}\), so this gives us a natural way of finding a basis that is optimal for our purposes. We make this concrete in the following definition, which is stated for an arbitrary lattice of dimension n. Definition 1 simplifies the situation by looking for the smallest n-cube containing \(\mathcal {P}(\mathbf{B})\), rather than \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^n\), but our candidate bases will always be orthogonal enough such that the conditions are equivalent in practice.

Definition 1

(Babai-optimal bases). We say that a basis \(\mathbf{B}\) of a lattice \(\mathcal {L}\in \mathbb {R}^n\) is Babai-optimal if the width of the smallest n-cube containing the parallelepiped \(\mathcal {P}(\mathbf{B})\) is minimal across all bases for \(\mathcal {L}\).

We note immediately that taking the n successive minima under \(||\cdot ||_\ell \), for any \(\ell \in \{1,2,\dots ,\infty \}\), will not be Babai-optimal in general. Indeed, for our specific lattice \(\mathcal {L}\), neither the \(||\cdot ||_2\)-reduced basis (output from LLL [35]) or the \(||\cdot ||_\infty \)-reduced basis (in the sense of Lovász and Scarf [38]) are Babai-optimal.

For very low dimensions, such as those used in ECC scalar decompositions, we can find a Babai-optimal basis via straightforward enumeration as follows. Starting with any reasonably small basis \(\mathbf{B}'=(\mathbf {b}_1',\dots ,\mathbf {b}_n')\), like the ones in [45], we compute the width, \(w(\mathbf{B}')\), of the smallest n-cube whose convex body contains \(\mathcal {P}(\mathbf{B}')\); by the definition of \(\mathcal {P}\), this is \(w(\mathbf{B}') = \mathrm{max}_{1 \le j \le n}\left\{ \sum _{i=1}^n |\mathbf {b}'_i[j]| \right\} \). We then enumerate the set S of all vectors \(\mathbf{v} \in \mathcal {L}\) such that \(||\mathbf{v}||_\infty \le w(\mathbf{B}')\); any vector not in S cannot be in a basis whose width is smaller than \(\mathbf{B}'\). We can then test all possible bases \(\mathbf{B}\), that are formed as combinations of n linearly independent vectors in S, and choose one corresponding to the minimal value of \(w(\mathbf{B})\).

Proposition 3

A Babai optimal basis for our zero decomposition lattice \(\mathcal {L}\) is given by \(\mathbf{B}:=( \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_3, \mathbf {b}_4 )\), where

$$\begin{aligned} 224 \cdot \mathbf {b}_1&:= \left( 16(-60 \alpha +13r-10) , \, \, \, 4(-10 \alpha -3r+12) \, , \, \,4(-15 \alpha +5r-13)\, , \, \,-13 \alpha -6r+3 \right) , \\ 8 \cdot \mathbf {b}_2&:= \left( 32(5 \alpha -r)\, , \, \, -8\, , \, \, 8\, , \, \, 2 \alpha +r \right) , \\ 224 \cdot \mathbf {b}_3&:= \left( 16(80 \alpha -15r+18)\, , \, \,4(18 \alpha -3r-16)\, , \, \, 4(-15 \alpha -9r+15) \, , \, \,15 \alpha +8r+3\alpha \right) , \\ 448 \cdot \mathbf {b}_4&:= \left( 16(-360 \alpha +77r+42) , \, 4(42 \alpha +17r+72) , \, 4(85 \alpha -21r-77) , \, (-77 \alpha -36r-17) \right) , \end{aligned}$$

for V and r as fixed in Sect. 3, and where \(\alpha := V/r \in \mathbb {Z}\).

Proof

Straightforward but lengthy calculations using the equations in Sect. 3.3 reveal that \(\mathbf {b}_1\), \(\mathbf {b}_2\), \(\mathbf {b}_3\) and \(\mathbf {b}_4\) are all in \(\mathcal {L}\). Another direct calculation reveals that the determinant of \(\langle \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_3, \mathbf {b}_4 \rangle \) is N, so \(\mathbf{B}\) is a basis for \(\mathcal {L}\). To show that \(\mathbf{B}\) is Babai-optimal, we set \(\mathbf{B}'=\mathbf{B}\) and compute \(w(\mathbf{B}') = \mathrm{max}_{1 \le j \le 4}\left\{ \sum _{i=1}^4 |\mathbf {b}'_i[j]| \right\} \), which (at \(j=1\)) is \(w(\mathbf{B}')= (245\alpha +120r+17)/448\). Enumeration under \(||\cdot ||_\infty \) yields exactly 128 vectors (up to sign) in \(S = \{\mathbf{v} \in \mathcal {L}\mid ||\mathbf{v}||_\infty \le w(\mathbf{B}') \}\); none of the rank 4 bases formed from S have a width smaller than \(\mathbf{B}\).    \(\square \)

The size of the set S in the above proof depends on the quality of the initial basis \(\mathbf{B}'\). For the proof, it suffices to start with the Babai-optimal basis \(\mathbf{B}\) itself, but in practice we will usually start with a basis that is not optimal according to Definition 1. In our case we computed the basis in Proposition 3 by first writing down a short basis using Smith’s methodology [45]. We input this into the LLL algorithm [35] to obtain an LLL-reduced basis \(( \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_1+\mathbf {b}_4, \mathbf {b}_3)\); these are also the four successive minima under \(|| \cdot ||_2\). We then input this basis into the algorithm of Lovász and Scarf [38]; this forced the requisite changes to output a basis consisting of the four successive minima under \(|| \cdot ||_\infty \), namely \(( \mathbf {b}_1, \mathbf {b}_1+\mathbf {b}_4,\mathbf {b}_2,\mathbf {b}_1+\mathbf {b}_3)\). Using this as our input \(\mathbf{B}'\) into the enumeration gave a set S of size 282, which we exhaustively searched to find \(\mathbf{B}\).

We now describe a simple scalar decomposition that uses Babai rounding on the optimal basis above. Note that, since V and r are fixed, the four \(\hat{\alpha }_i\) values below are fixed integer constants.

Proposition 4

For a given integer m, and the basis \(\mathbf{B}:=( \mathbf {b}_1, \mathbf {b}_2, \mathbf {b}_3, \mathbf {b}_4 )\) in Prop. 3, let \((\alpha _1,\alpha _2,\alpha _3,\alpha _4) \in \mathbb {Q}^4\) be the unique solution to \((m,0,0,0) = \sum _{i=1}^{4} \alpha _i \mathbf {b}_i\), and let \((a_1, a_2, a_3,a_4) =(m,0,0,0)- \sum _{i=1}^{4}\lfloor \alpha _i \rceil \cdot \mathbf {b}_i\). Then \(m \equiv a_1+a_2\lambda _\phi + a_3\lambda _\psi + a_4 \lambda _\psi \phi \pmod {N}\) and \(|a_1|,|a_2|, |a_3|, |a_4| <2^{62}\).

4.2 Handling Round-Off Errors

The decomposition described in Proposition 4 requires the computation of four roundings \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil \), where m is the input scalar and the four \(\hat{\alpha }_i\) and N are fixed curve constants. Following [10, Sect. 4.2], one efficient way of performing these roundings is to choose a power of 2 greater than the denominator N, say \(\mu \), and precompute the fixed curve constants \(\ell _i = \lfloor \frac{\hat{\alpha }_i}{N} \cdot \mu \rceil \), so that \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil \) can be computed at runtime as \(\lfloor \frac{\ell _i \cdot m }{\mu } \rfloor \), and the division by \(\mu \) can be computed as a simple shift.

It is correctly noted in [10, Sect. 4.2] that computing the rounding in this way means the answer can be out by 1 in some cases, but it is further said that “in practice this does not affect the size of the multiscalars”. While this assertion may have been true in [10], in general this will not be the case, particularly when we wish to bound the size of the multiscalars as tightly as possible. We address this issue on \(\mathcal {E}\) starting with Lemma 1.

Lemma 1

Let \(\hat{\alpha }\) be any integer, and let mN and \(\mu \) be positive integers with \(m < \mu \). Then \(\left\lfloor \frac{\hat{\alpha } m}{N} \right\rceil - \left\lfloor \left\lfloor \frac{\hat{\alpha } \mu }{N} \right\rceil \cdot \frac{m}{\mu } \right\rfloor \) is either 0 or 1.

Lemma 1 says that, so long as we choose \(\mu \) to be greater than the maximum size of our input scalars m, our fast method of approximating \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil \) will either give the correct answer, or it will be \(\lfloor \frac{\hat{\alpha }_i}{N} \cdot m \rceil -1\). It is easy to see that larger choices of \(\mu \) decrease the probability of a rounding error. For example, on 10 million random decompositions of integers between 0 and N with \(\mu =2^{246}\), roughly 2.2 million trials gave at least one error in the \(\alpha _i\); when \(\mu =2^{247}\), roughly 1.7 million trials gave at least one error; when \(\mu =2^{256}\), 4333 trials gave an error; and, taking \(\mu =2^{269}\) was the first power of two that gave no errors.

Prior works have seemingly addressed this problem by taking \(\mu \) to be large enough so that the chance of roundoff errors are very (perhaps even exponentially) small. However, no matter how large \(\mu \) is chosen, the existence of a permissible scalar whose decomposition gives a roundoff error is still a possibilityFootnote 4, and this could violate constant-time promises.

In this work, and in light of Theorem 1, we instead choose to sacrifice some speed by guaranteeing that roundoff errors are always accounted for. Rather than assuming that \((a_1,a_2,a_3,a_4)=\sum _{i=1}^4 (\alpha _i - \lfloor \alpha _i \rceil )\mathbf {b}_i\), we account for the approximation \(\tilde{\alpha }_i\) to \(\lfloor \alpha _i \rceil \) (described in Lemma 1) by allowing \((a_1, a_2, a_3,a_4) =\sum _{i=1}^4 \left( \alpha _i - \tilde{\alpha }_i \right) \mathbf {b}_i=\sum _{i=1}^4 \left( \alpha _i - (\lfloor {\alpha }_i\rceil - \epsilon _i) \right) \mathbf {b}_i\), for all sixteen combinations arising from \(\epsilon _i \in \{0,1\}\), for \(i=1,2,3,4\). This means that all integers less than \(\mu \) will decompose to a multiscalar in \(\mathbb {Z}^4\) whose coordinates lie inside the parallelepiped \(\mathcal {P}_\epsilon (\mathbf{B}):=\{\mathbf{B}{} \mathbf{x}\, |\, \mathbf{x} \in [-1/2,3/2)^4\}\). Theorem 1 permits scalars as any 256-bit strings, so we fix \(\mu :=2^{256}\) from here on, which also means that division by \(\mu \) will correspond to a shift of machine words. The edges of \(\mathcal {P}_\epsilon (\mathbf{B})\) are twice as long as those of \(\mathcal {P}(\mathbf{B})\), so the number of points in \(\mathcal {P}_\epsilon (\mathbf{B}) \cap \mathbb {Z}^4\) is \(\mathrm{vol}(\mathcal {P}_\epsilon ) = 16N\). We note that, even though the number of permissible scalars far exceeds 16N, the decomposition that maps integers in \([0,\mu )\) to multiscalars in \(\mathcal {P}_\epsilon (\mathbf{B}) \cap \mathbb {Z}^4\) is certainly no longer onto; almost all of the \(\mu \) scalars will map into \(\mathcal {P}(\mathbf{B}) \cap \mathbb {Z}^4\), since the chance of roundoff errors that take us into \(\mathcal {P}_\epsilon (\mathbf{B})-\mathcal {P}(\mathbf{B})\) is small. Plainly, the width of smallest 4-cube containing \(\mathcal {P}_\epsilon (\mathbf{B})\) is also twice that of the 4-cube containing \(\mathcal {P}(\mathbf{B})\), so (in the sense of Definition 1) our basis is still Babai-optimal. Nevertheless, the bounds in Proposition 4 no longer apply, which is one of the issues addressed in the next subsection.

4.3 All-Positive Multiscalars

Many points in \(\mathcal {P}_\epsilon (\mathbf{B}) \cap \mathbb {Z}^4\) have coordinates that are far greater than \(2^{62}\) in absolute value, and in addition, the majority of them will have coordinates that are both positive and negative. Dealing with such signed multiscalars can require an additional iteration in the main loop of the scalar multiplication, so in this subsection we use an offset vector in \(\mathcal {L}\) to find a translate of \(\mathcal {P}_\epsilon (\mathbf{B})\) that contains points whose four coordinates are always positive. We note that this does not save the additional iteration mentioned above, but (at no cost) it does simplify the scalar recoding, such that we do not have to deal with multiscalars that can have negative coordinates. Such offset vectors were used in two dimensions in [13, Sect. 4].

From the proof of Proposition 3, we have that the width of the smallest 4-cube containing \(\mathcal {P}_\epsilon (\mathbf{B})\) is \(2\cdot (245\alpha +120r+17)/448\), which lies between \(2^{63}\) and \(2^{64}\). Thus, the optimal situation is to translate of \(\mathcal {P}_\epsilon (\mathbf{B})\) (using a vector in \(\mathcal {L}\)) that fits inside the convex body of the 4-cube \(\mathcal {H} = \{2^{64}\cdot \mathbf{x}\, |\, \mathbf{x} \in [0,1]^4\}\). In fact, as we discuss in the next paragraph, we actually want to find two unique translates of \(\mathcal {P}_\epsilon (\mathbf{B})\) inside \(\mathcal {H}\).

The scalar recoding described in Sect. 5 requires that the first component of the multiscalar \((a_1,a_2,a_3,a_4)\) is odd. In the case that \(a_1\) is even, which happens around half of the time, previous works have employed this “odd-only” recoding by instead working with the multiscalar \((a_1-1,a_2,a_3,a_4)\), and adding the point P to the value output by the main loop (cf. [41, Algorithm 4] and [18, Algorithm 2]). Of course, in a constant-time routine, this scalar update and point addition must be performed regardless of the parity of \(a_1\), and the correct scalars and results must be masked in and out of the main loop accordingly. In this work we simplify the situation by using offset vectors in \(\mathcal {L}\) to achieve the same result; this has the added advantage of avoiding an extra point addition. We do this by finding two vectors \(\mathbf{c}, \mathbf{c}' \in \mathcal {L}\) such that \(\mathbf{c}+\mathcal {P}_\epsilon (\mathbf{B})\) and \(\mathbf{c}' +\mathcal {P}_\epsilon (\mathbf{B})\) both lie inside \(\mathcal {H}\), and such that precisely one of \((a_1,a_2,a_3,a_4)+\mathbf{c}\) and \((a_1,a_2,a_3,a_4)+\mathbf{c}'\) has a first component that is odd. This is made explicit in the full scalar decomposition described below.

Proposition 5

(Scalar Decompositions). Let \(\mathbf{B}=(\mathbf {b}_1,\mathbf {b}_2,\mathbf {b}_3,\mathbf {b}_4)\) be the basis in Proposition 3, let \(\mu =2^{256}\), and define the four curve constants \(\ell _i:=\lfloor \hat{\alpha }_i \cdot \mu /N \rceil \) for \(i=1,2,3,4\), with the \(\hat{\alpha }_i\) as given in Proposition 4. Let \(\mathbf{c}=2\mathbf {b}_1-\mathbf {b}_2+5\mathbf {b}_3+2\mathbf {b}_4\) and \(\mathbf{c}'= 2\mathbf {b}_1-\mathbf {b}_2+5\mathbf {b}_3+\mathbf {b}_4\) in \(\mathcal {L}\). For any integer \(m \in [0,2^{256})\), let \(\tilde{\alpha }_i =\left\lfloor \ell _i m /\mu \right\rfloor \), and let \((a_1, a_2, a_3,a_4) =(m,0,0,0)- \sum _{i=1}^{4}\lfloor \tilde{\alpha }_i \rceil \cdot \mathbf {b}_i\). Then, both of the multiscalars \((a_1,a_2,a_3,a_4)+\mathbf{c}\) and \((a_1,a_2,a_3,a_4)+\mathbf{c}'\) are valid decompositions of m, have all four coordinates positive and less than \(2^{64}\), and precisely one of them has a first coordinate that is odd.

The scalar decomposition described in Proposition 5 outputs two multiscalars. Our decomposition routine uses a bitmask to select and output the one with an odd first coordinate in constant time.

5 The Scalar Multiplication

This section describes the full scalar multiplication of \(P \in \mathcal {E}(\mathbb {F}_{p^2})\) by an integer \(m \in [0,2^{256})\), pulling together the endomorphisms and scalar decompositions derived in the previous two sections.

5.1 Recoding the Multiscalar

The “all-positive” multiscalar \((a_1,a_2,a_3,a_4)\) that is obtained from the decomposition described in Proposition 5 could be fed as is into a simple 4-way multiexponentiation (e.g., the 4-dimensional version of [48]) to achieve an efficient scalar multiplication. However, more care needs to be taken to obtain an efficient routine that also runs in constant-time. For example, we need to guarantee that the main loop iterates in the same number of steps, which would not currently be the case since \(\mathrm{max}_j(\mathrm{log}_2(|a_j|))\) can be several integers less than 64. As another example, a straightforward multiexponentiation could leak information in the case that the i-th bit of all four \(a_j\) values was 0, which would result in a “do-nothing” rather than a non-trivial addition.

To achieve an efficient constant-time routine, we adopt the general recoding Algorithm from [18, Algorithm 1], and tailor it to scalar multiplications on Four\(\mathbb {Q}\). This results in Algorithm 1 below, which is presented in two flavors: one that is geared towards the general reader and one that is geared towards implementers (we note that the lines do not coincide for the most part). On input of any multiscalar \((a_1,a_2,a_3,a_4)\) produced by Proposition 5, Algorithm 1 outputs an equivalent multiscalar \((b_1,b_2,b_3,b_4)\) with \(b_j = \sum _{i=0}^{64}b_j[i] \cdot 2^i\) for \(b_j[i]\in \{-1,0,1\}\) and \(j=1,2,3,4\), such that we always have \(b_1[64]=1\) and such that \(b_1[i]\) is non-zero for every \(i=0,\dots ,63\). This fixes the length of the main loop and ensures that each addition step of the multiexponentiation requires an addition by something other than the neutral element.

Another benefit of Algorithm 1 is that \(b_j[i] \in \{0,b_1[i]\}\) for \(j=2,3,4\); as was exploited in [18], this “sign-alignment” means that the lookup table used in our multiexponentiation only requires 8 elements, rather than the 16 that would be required in a naïve multiexponentiation that uses \((a_1,a_2,a_3,a_4)\). More specifically, since \(b_1[i]\) (which is to be multiplied by P) is always non-zero, every element of the lookup table T must contain P, so we have \(T[u]:=P+[u_0]\phi (P)+[u_1]\psi (P)+[u_2]\psi (\phi (P))\), where \(u=(u_2,u_1,u_0)_2\) for \(u =0,\dots ,7\). We point out that the recoding must itself be implemented in constant-time; the implementer-friendly version shows that Algorithm 1 indeed lends itself to such a constant-time implementation. We further note that the outputs of the two versions are formatted differently: the left side outputs the multiscalar \((b_1,b_2,b_3,b_4)\), while the right side instead outputs the corresponding lookup table indices (the \(d_i\)) and the masks (the \(m_i\)) used to select the correct signs of the lookup elements. That is, \((m_{64}, \ldots , m_0)\) corresponds to the binary expansion of \(b_1\) and \((d_{64}, \ldots , d_0)\) corresponds to the binary expansion of \(b_2 + 2b_3+4b_4\).

figure a

5.2 The Full Routine

We now present Algorithm 2: the full scalar multiplication routine. This is accompanied by Theorem 1, the proof of which (see [14]) gives more details on the steps summarized in Algorithm 2; in particular, it specifies the representations of all points in order to state the total number of \(\mathbb {F}_{p^2}\) operations. Algorithm 2 assumes that the input point P is in \(\mathcal {E}(\mathbb {F}_{p^2})[N]\), i.e., has been validated according to [14, Appendix A].

figure b

Theorem 1

For every point \(P \in \mathcal {E}(\mathbb {F}_{p^2})[N]\) and every non-negative integer m less than \(2^{256}\), Algorithm 2 computes [m]P correctly using a fixed sequence of field, integer and table-lookup operations.

6 Performance Analysis and Results

This section shows that, at the 128-bit security level, Four\(\mathbb {Q}\) is significantly faster than all other known curve-based primitives. We reiterate that our software runs in constant-time and is therefore fully protected against timing and cache attacks.

6.1 Operation Counts

We begin with a first-order comparison based on operation counts between Four\(\mathbb {Q}\) and two other efficient curve-based primitives that are defined over large prime characteristic fields and that target the 128-bit security level: the twisted Edwards GLV+GLS curve defined over \(\mathbb {F}_{p^2}\) with \(p=2^{127}-5997\) proposed in [37], and the genus 2 Kummer surface defined over \(\mathbb {F}_p\) with \(p=2^{127}-1\) that was proposed in [25]; we dub these “GLV+GLS” and “Kummer” below. Both of these curves have recently set speed records on a variety of platforms (see [18] and [6]). Table 2 summarizes the operation counts for one variable-base scalar multiplication on Four\(\mathbb {Q}\), GLV+GLS and Kummer. In the right-most column we approximate the cost in terms of prime field operations (using the standard assumption that 1 base field squaring is approximately 0.8 base field multiplications), where we round each tally to the nearest integer. For the GLV+GLS and Four\(\mathbb {Q}\) operation counts, we assume that one multiplication over \(\mathbb {F}_{p^2}\) involves 3 multiplications and 5 additions/subtractions over \(\mathbb {F}_p\) (when using Karatsuba) and one squaring over \(\mathbb {F}_{p^2}\) involves 2 multiplications and 3 additions/subtractions over \(\mathbb {F}_p\).

Table 2. Operation counts for variable-base scalar multiplications on three different curves targeting the 128-bit security level. In the case of the Kummer surface, we additionally use a “word-mul” column to count the number of special multiplications of a general element in \(\mathbb {F}_p\) by a small (i.e., one-word) constant – see [6].

Table 2 shows that the GLV+GLS routine from [37] requires slightly fewer operations than Four\({\mathbb {Q}}\). This can mainly be explained by the faster endomorphisms, but (as we will see in Table 3) this difference is more than made up for by the faster modular arithmetic and superior simplicity of Four\(\mathbb {Q}\). Table 2 shows that Four\(\mathbb {Q}\) requires far fewer operations (in the same ground field) than Kummer; it is therefore expected, in general, that implementations based on Four\(\mathbb {Q}\) outperform Kummer implementations for computing variable-base scalar multiplications.

6.2 Experimental Results

To evaluate performance, we wrote a standalone library supporting Four \(\mathbb {Q}\) – see [15]. The library’s design pursues modularity and code reuse, and leverages the simplicity of Four\(\mathbb {Q}\)’s arithmetic. It also facilitates the addition of specialized code for different platforms and applications: the core functionality of the library is fully written in portable C and works together with pluggable implementations of the arithmetic over \(\mathbb {F}_{p^2}\) (and a few other complementary functions). The first release version of the library comes with two of those pluggable modules: a portable implementation written in C and a high-performance implementation for x64 platforms written in C and optional x64 assembly. The library computes all of the basic elliptic curve operations including variable-base and fixed-base scalar multiplications, making it suitable for a wide range of cryptographic protocols. In addition, the software permits the selection (at build time) of whether or not the endomorphisms \(\psi \) and \(\phi \) are to be exploited in variable-based scalar multiplications.

In Table 3, we compare Four\(\mathbb {Q}\)’s performance with other state-of-the-art implementations documented in the literature. Our benchmarks cover a wide range of x64 processors, from high-end architectures (e.g., Intel’s Haswell) to low-end architectures (e.g., Intel’s Atom). To cast the performance numbers in the context of a real-world protocol, we choose to illustrate Four\(\mathbb {Q}\)’s performance in one round of an ephemeral Diffie-Hellman (DH) key exchange. This means that both parties can generate their public keys using a fixed-base scalar multiplication and generate the shared secret using a variable-base scalar multiplication. Exploiting such precomputations to generate truly ephemeral public keys agrees with the comments made by Bernstein and Lange in [8, Sect. 1], e.g., that “forward secrecy is at its strongest when a key is discarded immediately after its use”. Thus, Table 3 shows the execution time (in terms of clock cycles) for both variable-base and fixed-base scalar multiplications. We note that the laddered implementations in [4, 6, 9] only compute variable-base scalar multiplications, which is why we use the cost of two variable-base scalar multiplications to approximate the cost of ephemeral DH in those cases. For the Four\(\mathbb {Q}\) and GLV+GLS implementations, precomputations for the fixed-base scalar multiplications occupied 7.5KB and 6KB of storage, respectively.

Table 3. Performance results (expressed in terms of thousands of clock cycles) of state-of-the-art implementations of various curves targeting the 128-bit security level on various x64 platforms. Benchmark tests were taken with Intel’s TurboBoost and AMD’s TurboCore disabled and the results were rounded to the nearest 1000 clock cycles. The benchmarks for the Four\(\mathbb {Q}\) and GLV+GLS implementations were done on 1.66 GHz Intel Atom N570 Pineview, 3.4 GHz Intel Core i7-2600 Sandy Bridge, 3.4 GHz Intel Core i7-3770 Ivy Bridge, 3.4 GHz Intel Core i7-4770 Haswell and 3.1GHz AMD A8 PRO-7600B Kaveri. For the Kummer implementations [6, 9] and Curve25519 implementation [3], Pineview, Sandy Bridge, Ivy Bridge and Haswell benchmarks were taken from eBACS [7] (machines h2atom, h6sandy, h9ivy and titan0), while AMD benchmarks were obtained by running eBACS’ SUPERCOP toolkit on the corresponding targeted machine. The benchmarks for curve NIST P-256 were taken directly from [26] and the second set of Curve25519 benchmarks were taken directly from [12].

Table 3 shows that, in comparison with the “conservative” curves, Four\(\mathbb {Q}\) is 2.1–2.7 times faster than the Curve25519 implementations in [3, 12] and up to 5.4 times faster than the curve P-256 implementation in [26], when computing variable-base scalar multiplications. When considering the results for the DH key exchange, Four\(\mathbb {Q}\) performs 1.8–3.5 times faster than Curve25519 and up to 4.2 times faster than curve P-256.

In terms of comparisons to the previously fastest implementations, variable-base scalar multiplications using our software are between 1.20 and 1.34 times faster than the Kummer [6, 9] and the GLV+GLS [18] implementations on AMD’s Kaveri and Intel’s Atom Pineview, Sandy Bridge and Ivy Bridge. The Kummer implementation for Haswell in [6] is particularly fast because it takes advantage of the powerful AVX2 vector instructions. Nevertheless, our implementation (which does not currently exploit vector instructions to accelerate the field arithmetic) is still faster in the case of variable-base scalar multiplication. Moreover, in practice we expect a much larger advantage. For example, in the case of the DH key exchange, we leverage the efficiency of fixed-base scalar multiplications to achieve a factor 1.33x speedup over the Kummer implementation on Haswell. For the rest of platforms considered in Table 3, a DH shared secret using the Four\(\mathbb {Q}\) software can be computed 1.5–1.8 times faster than a DH secret using the Kummer software in [6]. We note that the eBACS website [7] and [6] report different results for the same Kummer software on the same platform (i.e., Titan0): eBACS reports 60,556 Haswell cycles whereas [6] claims 54,389 Haswell cycles. This difference in performance raises questions regarding accuracy. The results that we obtained after running the eBACS’ SUPERCOP toolkit on our own targeted Haswell machine seem to confirm that the results claimed in [6] for the Kummer were measured with TurboBoost enabled.

Four \(\mathbb {Q}\) without endomorphisms. Our library can be built with a version of the variable-base scalar multiplication function that does not exploit the endomorphisms \(\psi \) and \(\phi \) to accelerate computations (note that fixed-base scalar multiplications do not exploit these endomorphisms by default). In this case, Four\(\mathbb {Q}\) computes one variable-base scalar multiplication in (respectively) 109, 131, 138 and 803 thousand cycles on the Haswell, Ivy Bridge, Sandy Bridge and Atom Pineview processors used for our experiments. These results are up to 2.9 times faster than the corresponding results for NIST P-256 and up to 1.5 times faster than the corresponding results for Curve25519.