1 Introduction

The Gallant–Lambert–Vanstone (GLV) method is a generic approach to speed up the computation of scalar multiplication on some elliptic curves defined over fields of large prime characteristic. Given a curve with a point P of prime order n, it consists essentially in an algorithm that finds a decomposition of an arbitrary scalar multiplication kP for k∈[1,n] into two scalar multiplications, with the new scalars having only about half the bitlength of the original scalar. This immediately enables the elimination of half the doublings by employing the Straus–Shamir trick for simultaneous point multiplication.

Whereas the original GLV method as defined in [13] works on curves over \(\mathbb{F}_{p}\) with an endomorphism of small degree (GLV curves), Galbraith–Lin–Scott (GLS) in [11] have shown that over \(\mathbb{F}_{p^{2}}\) one can expect to find many more such curves by basically exploiting the action of the Frobenius endomorphism. One can therefore expect that on the particular GLV curves, this new insight will lead to improvements over \(\mathbb{F}_{p^{2}}\). Indeed, the GLS article itself considers four-dimensional decompositions on GLV curves with nontrivial automorphisms (corresponding to the degree one cases) but leaves the other cases open to investigation.

In this work, we generalize the GLS method to all GLV curves by exploiting fast endomorphisms Φ,Ψ over \(\mathbb{F}_{p^{2}}\) acting on a cyclic group generated by a point P of prime order n to construct a proven decomposition with no heuristics involved for any scalar k∈[1,n]

$$kP=k_1P+ k_2\varPhi(P)+ k_3\varPsi(P) + k_4\varPsi\varPhi(P)\quad \text{with}\ \max_i \bigl(|k_i| \bigr)< C n^{1/4} $$

for some explicitly computable C. In doing this we provide a reduction algorithm for the four-dimensional relevant lattice which runs in O(log2 n) by implementing two Cornacchia-type algorithms [9, 25], one in ℤ, the other in ℤ[i]. The algorithm is remarkably simple to implement and allows us to demonstrate an improved \(C=O(\sqrt{s})\) (compared to the value obtained with LLL, which is only \(\mathrm{\varOmega}(s^{3/2})\)). Thus, it guarantees a relative speedup practically independent of the curve when moving from a two-dimensional to a four-dimensional GLV method over the same underlying field. If parallel computation is available, then the computation of kP can possibly be implemented (close to) four times faster in this case. When moving from two-dimensional GLV over \(\mathbb {F}_{p}\) to the four-dimensional case over \(\mathbb{F}_{p^{2}}\), our method still guarantees a relative speedup that is quasi-uniform among all GLV curves (see Sect. 8 for details). In fact, we present experimental results on different GLV curves that demonstrate that the relative speedup between the original GLV method and the proposed method (termed GLV–GLS in the remainder) is as high as 1.5 times.

Twisted Edwards curves [2] are efficient generalizations of the popular Edwards curves [10], which exhibit high-performance arithmetic. By exploiting this curve model, Galbraith, Lin, and Scott [12] showed that the GLS method can be improved in practice a further 10 %, approximately. Similar findings were later reported by Longa and Gebotys [23] (see also Longa [22]). Galbraith et al. also described how to write down j-invariant 0 and 1728 curves in Edwards form to combine a four-dimensional decomposition with the fast arithmetic provided by this curve model. We exploit this approach and, most remarkably, lift the restriction to those special curves and show that in practice the GLV–GLS curves discussed in this work may achieve extremely high-performance and become virtually equivalent in terms of speed when written in Twisted Edwards form.

In the last years multiple works have incrementally shown the impact of using the GLS method for high performance [11, 16, 23]. However, it is still unclear how well the method behaves on settings where side-channel attacks are a threat. Since it is usually assumed that required countermeasures once in place degrade performance significantly, it is also unclear if the GLS method would retain its current superiority in the case of side-channel protected implementations. Here, we study this open problem and describe how to protect implementations based on the GLV–GLS method against timing attacks, cache attacks, and similar ones and still achieve very high performance. The techniques discussed naturally apply to GLV-based implementations in general. Finally, we discuss different strategies to implement GLV-based scalar multiplication on modern multicore processors, and include the case in which countermeasures against side-channel attacks are required.

The presented implementations corresponding to the GLV–GLS method improve the state-of-the-art performance of point multiplication for all the cases under study: protected and unprotected versions with sequential and parallel execution. For instance, on one core of an Intel Core i7-2600 processor and at roughly 128 bits of security, we compute an unprotected scalar multiplication in only 91,000 cycles (which is 1.34 times faster than a previous result reported by Hu, Longa, and Xu [16]) and a side-channel protected scalar multiplication in only 137,000 cycles (which is 1.42 times faster than the protected implementation presented by Bernstein et al. [3]).

Related Work

Recently, a paper by Zhou, Hu, Xu, and Song [32] has shown that it is possible to combine the GLV and GLS approaches by introducing a three-dimensional version of the GLV method, which seems to work to a certain degree, with however no justification but through practical implementations. The first author together with Hu and Xu [16] studied the case of curves with j-invariant 0 and provided a bound for this particular case. Our analysis supplements [16] by considering all GLV curves and providing a unified treatment.

2 The GLV Method

In this section we briefly summarize the GLV method following [29]. Let E be an elliptic curve defined over a finite field \(\mathbb{F}_{q}\), and P be a point on this curve with prime order n such that the cofactor \(h=\#E(\mathbb{F}_{q})/n\) is small, say h≤4. Let us consider a nontrivial endomorphism Φ defined over \(\mathbb{F}_{q}\) and its characteristic polynomial X 2+rX+s. In all the examples, r and s are actually small fixed integers, and q is varying in some family. By hypothesis there is only one subgroup of order n in \(E(\mathbb {F}_{q})\), implying that Φ(P)=λP for some λ∈[0,n−1], since Φ(P) has order dividing the prime n. In particular, λ is obtained as a root of X 2+rX+s modulo n.

Define the group homomorphism (the GLV reduction map)

Let \(\mathcal {K}=\ker \mathfrak {f}\). It is a sublattice of ℤ×ℤ of rank 2 since the quotient is finite. Let \(\Bbbk>0\) be a constant (depending on the curve) such that we can find v 1,v 2 two linearly independent vectors of \(\mathcal {K}\) satisfying \(\max\{\vert v_{1}\vert _{\infty}, \vert v_{2}\vert _{\infty}\}<~\Bbbk\sqrt{n}\), where \(\vert \vphantom{v_{1}}\cdot \vert _{\infty}\) denotes the rectangle norm.Footnote 1 Express

$$(k,0)= \beta_1v_1 + \beta_2v_2, $$

where β i ∈ℚ. Then round β i to the nearest integer b i =⌊β i ⌉=⌊β i +1/2⌋ and let v=b 1 v 1+b 2 v 2. Note that \(v\in \mathcal {K}\) and that \(u\overset{\text{def}}{=} (k,0)-v\) is short. Indeed by the triangle inequality we have that

$$\vert \vphantom{v_1}u\vert _\infty\leq \frac{\vert v_1\vert _\infty + \vert v_2 \vert _\infty}{2} <\Bbbk\sqrt{n}. $$

If we set (k 1,k 2)=u, then we get kk 1+k 2 λ(modn) or equivalently kP=k 1 P+k 2 Φ(P) with \(\max (|k_{1}|,|k_{2}|)<\Bbbk\sqrt{n}\).

In [29], the optimal value of \(\Bbbk\) (with respect to large values of n, i.e., large fields, keeping X 2+rX+s constant) is determined. Let Δ=r 2−4s be the discriminant of the characteristic polynomial of Φ. Then the optimal \(\Bbbk\) is given by the following result.Footnote 2

Theorem 1

[29, Theorem 4]

Assuming that n is the norm of an element of ℤ[Φ], the optimal value of \(\Bbbk\) is

$$\Bbbk= \begin{cases} \frac{\sqrt{s}}{2} (1+\frac{1}{|\varDelta |} ) &\text{\textit{if} $r$ \textit{is odd,}}\\[4pt] \frac{\sqrt{s}}{2} \sqrt{1+\frac{4}{|\varDelta |}} &\text{\textit{if} $r$ \textit{is even.}} \end{cases} $$

3 The GLS Improvement

In 2009, Galbraith, Lin, and Scott [11] realized that we do not need to have Φ 2++s=0 in \(\operatorname{End}(E)\) but only in a subgroup of \(E(\mathbb{F})\) for a specific finite field \(\mathbb{F}\). In particular, considering \(\varPsi=\operatorname{Frob}_{p}\) the p-power Frobenius endomorphism of a curve E defined over \(\mathbb{F}_{p}\), we know that Ψ m(P)=P for all \(P\in E(\mathbb{F}_{p^{m}})\). While this tells nothing useful if m=1,2, it does offer new nontrivial relations for higher-degree extensions. The case m=4 is particularly useful here.

In this case, if \(P\in E(\mathbb{F}_{p^{4}}) \backslash E(\mathbb {F}_{p^{2}})\), then Ψ 2(P)=−P, and hence on the subgroup generated by P, Ψ satisfies the equation X 2+1=0. This implies that if Ψ(P) is a multiple of P (which happens as soon as the order n of P is sufficiently large, say at least 2p), we can apply the previous GLV construction and split again a scalar multiplication as kP=k 1 P+k 2 Ψ(P) with \(\max(|k_{1}|,|k_{2}|) = O(\sqrt{n})\). Contrast this with the characteristic polynomial of Ψ which is X 2a p X+p for some integer a p , a nonconstant polynomial to which we cannot apply efficiently the GLV paradigm.

For efficiency reasons, however, one does not work with \(E/\mathbb{F}_{p^{4}}\) directly but with \(E'/\mathbb{F}_{p^{2}}\) isomorphic to E over \(\mathbb{F}_{p^{4}}\) but not over \(\mathbb{F}_{p^{2}}\), that is, a quadratic twist over \(\mathbb{F}_{p^{2}}\). In this case, it is possible that \(\#E'(\mathbb{F}_{p^{2}})=n\geq(p-1)^{2}\) be prime. Furthermore, if ψ:E′→E is an isomorphism defined over \(\mathbb{F}_{p^{4}}\), then the endomorphism \(\varPsi= \psi \operatorname{Frob}_{p} \psi^{-1} \in \operatorname{End}(E')\) satisfies the equation X 2+1=0, and if p≡5(mod8), it can be defined over \(\mathbb{F}_{p}\).

This idea is at the heart of the GLS approach, but it only works for curves over \(\mathbb{F}_{p^{m}}\) with m>1, and therefore it does not generalize the original GLV method but rather complements it.

4 Combining GLV and GLS

Let \(E/\mathbb{F}_{p}\) be a GLV curve. As in Sect. 3, we will denote by \(E'/\mathbb{F}_{p^{2}}\) a quadratic twist \(\mathbb {F}_{p^{4}}\)-isomorphic to E via the isomorphism ψ:EE′. We also suppose that \(\# E'(\mathbb{F}_{p^{2}}) = nh\) where n is prime and h≤4. We then have the two endomorphisms of E′, \(\varPsi= \psi \operatorname{Frob}_{p} \psi^{-1}\) and Φ=ψϕψ −1, with ϕ the GLV endomorphism coming with the definition of a GLV curve. They are both defined over \(\mathbb {F}_{p^{2}}\) since if σ is the nontrivial Galois automorphism of \(\mathbb {F}_{p^{4}}/ \mathbb {F}_{p^{2}}\), then ψ σ=−ψ, so that \(\varPsi^{\sigma}= \psi^{\sigma} \operatorname{Frob}_{p}^{\sigma}(\psi^{-1} )^{\sigma}= (-\psi)\operatorname{Frob}_{p}(-\psi^{-1}) = \varPsi\), meaning that \(\varPsi\in \operatorname{End}_{ \mathbb {F}_{p^{2}}}(E')\). Similarly for Φ, where we are using the fact that \(\phi\in \operatorname{End}_{ \mathbb {F}_{p}}(E)\). Notice that Ψ 2+1=0 and that Φ has the same characteristic polynomial as ϕ. Furthermore, since we have a large subgroup \(\langle P \rangle\subset E'(\mathbb{F}_{p^{2}})\) of prime order, Φ(P)=λP and Ψ(P)=μP for some λ,μ∈[1,n−1]. We will assume that Φ and Ψ, when viewed as algebraic integers, generate disjoint quadratic extensions of ℚ. In particular, we are not dealing with Example 1 from Appendix A, but this can be treated separately with a quartic twist as described in Appendix B.

Consider the biquadratic (Galois of degree 4, with Galois group ℤ/2×ℤ/2) number field K=ℚ(Φ,Ψ). Let \(\mathfrak{o}_{K}\) be its ring of integers. The following analysis is inspired by [29, Sect. 8].

We have \(\mathbb{Z}[\varPhi, \varPsi] \subseteq\mathfrak{o}_{K}\). Since the degrees of Φ and Ψ are much smaller than n, the prime n is unramified in K, and the existence of λ and μ above means that n splits in ℚ(Φ) and ℚ(Ψ), namely that n splits completely in K. There exists therefore a prime ideal \(\mathfrak{n}\) of \(\mathfrak{o}_{K}\) dividing \(n\mathfrak {o}_{K}\), such that its norm is n. We can also suppose that \(\varPhi \equiv \lambda\pmod{\mathfrak{n}}\) and \(\varPsi\equiv\mu\pmod{\mathfrak{n}}\). The four-dimensional GLV–GLS method works as follows.

Consider the GLV–GLS reduction map F defined by

If we can find four linearly independent vectors v 1,…,v 4∈kerF with max i |v i |Cn 1/4 for some constant C>0, then for any k∈[1,n−1], we write

$$(k,0,0,0) = \sum_{j=1}^4 \beta_j v_j $$

with β j ∈ℚ. As in the GLV method, one performs a Babai rounding to obtain the closest lattice vector \(v= \sum_{j=1}^{4} \lfloor \beta_{j} \rceil v_{j}\) and defines

$$u = (k,0,0,0)-v = (k_1, k_2, k_3, k_4) . $$

We then get

$$ kP=k_1P+ k_2\varPhi(P)+ k_3 \varPsi(P) + k_4\varPsi\varPhi(P)\quad \text{with } \max_i \bigl(|k_i|\bigr)\leq2C n^{1/4} . $$
(1)

We next focus on the study of kerF in order to find a reduced basis v 1,v 2,v 3,v 4 with an explicit C. We can factor the GLV–GLS map F as

Notice that the kernel of the second map (reduction mod \(\mathfrak{n}\cap \mathbb {Z}[\varPhi,\varPsi]\)) is exactly \(\mathfrak{n}\cap \mathbb {Z}[\varPhi,\varPsi]\). This can be seen as follows. The reduction map factors as

$$\mathbb{Z}[\varPhi,\varPsi] \longrightarrow\mathfrak{o}_K \longrightarrow \mathfrak{o}_K / \mathfrak{n} \cong\mathbb{Z}/n, $$

where the first arrow is inclusion, and the second is reduction mod \(\mathfrak{n}\) corresponding to reducing the x i ’s mod \(\mathfrak {n}\cap\mathbb{Z}= n\mathbb{Z}\) and using \(\varPhi\equiv\lambda, \varPsi \equiv\mu\pmod{\mathfrak{n}}\). But the kernel of this map consists precisely of elements of ℤ[Φ,Ψ] which are in \(\mathfrak {n}\), and that is what we want.

Moreover, since the reduction map is surjective, we obtain an isomorphism \(\mathbb {Z}[\varPhi,\varPsi]/\mathfrak{n}\cap \mathbb {Z}[\varPhi ,\varPsi] \cong\mathbb {Z}/n\), which says that the index of \(\mathfrak{n}\cap \mathbb {Z}[\varPhi ,\varPsi]\) inside ℤ[Φ,Ψ] is n. Since the first map f is an isomorphism, we get that \(\ker F = f^{-1} (\mathfrak{n}\cap \mathbb {Z}[\varPhi,\varPsi])\) and that kerF has index [ℤ4:kerF]=n inside ℤ4.

We can also produce a basis of kerF by the following observation. Let Φ′=Φλ, Ψ′=Ψμ, and hence ΦΨ′=ΦΨλΨμΦ+λμ. In matrix form,

Since the determinant of the square matrix is 1, we deduce that ℤ[Φ,Ψ]=ℤ[Φ′,Ψ′]. But in this new basis, we claim that

$$\mathfrak{n}\cap \mathbb {Z}\bigl[\varPhi',\varPsi'\bigr] = n \mathbb{Z} + \mathbb {Z}\varPhi' + \mathbb{Z}\varPsi' + \mathbb{Z}\varPhi'\varPsi' . $$

Indeed, reverse inclusion (⊇) is easy since \(\varPhi',\varPsi', \varPhi'\varPsi' \in\mathfrak{n}\) and so is n, because \(\mathfrak{n}\) divides \(n\mathfrak{o}_{K}\) is equivalent to \(\mathfrak{n} \supseteq n\mathfrak {o}_{K}\). On the other hand, the index of both sides in ℤ[Φ′,Ψ′] is n, which can only happen, once an inclusion is proved, if the two sides are equal. Using the isomorphism f, we see that a basis of kerF⊂ℤ4 is therefore given by

$$w_1= (n,0,0,0), w_2= (-\lambda, 1 ,0,0), w_3 = (-\mu, 0, 1, 0), w_4 = ( \lambda\mu, -\mu, - \lambda, 1) . $$

The LLL algorithm [20] then finds, for a given basis w 1,…,w 4 of kerF, a reducedFootnote 3 basis v 1,…,v 4 in polynomial time (in the logarithm of the norm of the w i ’s) such that (cf. [8, Theorem 2.6.2, p. 85])

$$ \prod_{i=1}^4 |v_i|_\infty \leq8 \bigl[\mathbb{Z}^4 \colon \ker F\bigr] = 8n. $$
(2)

Lemma 1

Let Φ and Ψ be as defined at the beginning of this section,

be the norm of an element x 1+x 2 Φ+x 3 Ψ+x 4 ΦΨ∈ℤ[Φ,Ψ], where the \(b_{i_{1},i_{2},i_{3},i_{4}}\) ’s lie in ℤ. Then, for any nonzero v∈kerF, one has

$$ |v|_\infty\geq \frac{n^{1/4}}{ (\sum_{\substack{i_1,i_2,i_3,i_4\\ i_1+i_2+i_3+i_4=4}} |b_{i_1,i_2,i_3,i_4}| )^{1/4}} . $$
(3)

Proof

For v∈kerF, we have , and if v≠0, we must therefore have . On the other hand, if we did not have (3), then every component of v would be strictly less than the right-hand side, and plugging this upper bound in the definition of would yield a quantity <n, a contradiction. □

Let B be the denominator of the right-hand side of (3). Then (2) and (3) imply that

$$ |v_i|_\infty\leq8B^{3} n^{1/4}, \quad i=1,2,3,4 . $$
(4)

Remark 1

In our case, where Ψ 2+1=0 and Φ 2++s=0, we get as norm function

and therefore,

$$ B= \bigl(4+4s^2 + 8s + 8|r| + 8 |r| s + 2 \bigl(r^2+2s\bigr) + 2 |r^2-2s| \bigr)^{1/4} . $$
(5)

From (1) and (4) we have proved the following theorem.

Theorem 2

Let \(E/\mathbb{F}_{p}\) be a GLV curve, and \(E'/\mathbb{F}_{p^{2}}\) a twist, together with the two efficient endomorphisms Φ and Ψ, where everything is defined as at the start of this section. Suppose that the minimal polynomial of Φ is X 2+rX+s=0. Let \(P\in E'(\mathbb{F}_{p^{2}})\) be a generator of the large subgroup of prime order n. There exists an efficient algorithm which for any k∈[1,n] finds integers k 1,k 2,k 3,k 4 such that

$$kP=k_1P+ k_2\varPhi(P)+ k_3\varPsi(P) + k_4\varPsi\varPhi(P)\quad \text{\textit{with} } \max_i \bigl(|k_i|\bigr)\leq16 B^3 n^{1/4} $$

and

$$B= \bigl(4+4s^2 + 8s + 8|r| + 8 |r| s + 2 \bigl(r^2+2s \bigr) + 2 \bigl|r^2-2s\bigr| \bigr)^{1/4} . $$

5 Uniform Improvements and a Tale of Two Cornacchia Algorithms

The previous analysis is only the first step of our work. It shows that the GLV–GLS method works as predicted in a four-way decomposition on twists of GLV curves over \(\mathbb{F}_{p^{2}}\). However, the constant B 3 involved is rather large and, hence, does not guarantee a non-negligible gain when switching from two to four dimensions (especially on those GLV curves with more complicated endomorphism rings). A much deeper argument allows us to prove the following result.

Theorem 3

When performing an optimal lattice reduction on kerF, it is possible to decompose any k∈[1,n] into integers k 1,k 2,k 3,k 4 such that

$$kP=k_1P+ k_2\varPhi(P)+ k_3\varPsi(P) + k_4\varPsi\varPhi(P) $$

with \(\max_{i} (|k_{i}|) < 103 (\sqrt{1+|r|+s}) \, n^{1/4}\).

The significance of this theorem lies in the improvement of the constant 16B 3, which is Ω(s 3/2) in Theorem 2, to a value that is an absolute constant times greater than the minimal bound for the two-dimensional GLV method (Theorem 1). Hence, this guarantees in practice a more uniform improvement when switching from two-dimensional to four-dimensional GLV independently of the curve.

To prove Theorem 3, first note that Lemma 1 gives a rather poor bound when applied to more than one vector, as is done three times for the proof of Theorem 2. A more direct treatment of the reduced vectors of kerF becomes necessary, and this is done via a modification of the original GLV approach. This results in a new, easy-to-implement lattice reduction algorithm which employs two Cornacchia-type algorithms [8, Sect. 1.5.2], one in ℤ (as in the original GLV method), the other one in ℤ[i] (Gaussian Cornacchia).

The full proof of Theorem 3 via the new lattice reduction algorithm can be found in Appendix D.

5.1 The Euclidean Algorithm in ℤ

The first step is to find ν=a+ib∈ℤ[i] such that |ν|2=a 2+b 2=n, i.e., a Gaussian prime above n. Recall that n splits in ℤ[i]. Let ν=a+ib a prime above n. We can furthermore assume that νP=aP+biP:=aP+(P)=0 since \(\nu\bar{\nu}P = nP=0\), and hence either \(\bar{\nu}P\) is a nonzero multiple of P and therefore νP=0, or else \(\bar{\nu}P=0\), so that in any case one of the Gaussian primes (WLOG ν) above n will have νP=0. We can find ν by Cornacchia’s algorithm [8, Sect. 1.5.2], which is a truncated form of the Euclidean algorithm. For completeness and consistency with what will follow, we recall how this is done.

Let μ∈[1,n] be such that μi(modn), with i being defined by Ψ(P)=iP. Actually, in the GLS approach [11], it has been pointed out that this value of μ can be readily computed from \(\#E( \mathbb {F}_{p})\). The extended Euclidean algorithm to compute the gcd of n and μ produces three terminating sequences of integers (r j ) j≥0, (s j ) j≥0 and (t j ) j≥0 such that

(6)

for some integer q j+1>0 and initial data

(7)

This means that at step j≥0,

$$r_j = q_{j+1} r_{j+1} + r_{j+2} $$

and similarly for the other sequences. The sequence (q j ) j≥1 is uniquely defined by imposing that the previous equation be the integer division of r j by r j+1. In other terms, q j+1=⌊r j /r j+1⌋. This implies by induction that all the sequences are well defined in the integers, together with the following properties.

Lemma 2

The sequences (r j ) j≥0,(s j ) j≥0, and (t j ) j≥0 defined by (6) and (7) with q j+1=⌊r j /r j+1satisfy the following properties, valid for all j≥0.

  1. 1.

    r j >r j+1≥0 and q j+1≥1,

  2. 2.

    (−1)j s j ≥0 and |s j |<|s j+1| (this last inequality valid for j≥1),

  3. 3.

    (−1)j+1 t j ≥0 and |t j |<|t j+1|,

  4. 4.

    s j+1 r j s j r j+1=(−1)j+1 r 1,

  5. 5.

    t j+1 r j t j r j+1=(−1)j r 0,

  6. 6.

    r 0 s j +r 1 t j =r j .

These properties lie at the heart of the original GLV algorithm. They imply in particular via property 1 that the algorithm terminates (once r j reaches zero) and that it has O(logn) steps, as r j =q j+1 r j+1+r j+2r j+1+r j+2>2r j+2. Note that properties 1, 2, and 3 imply that properties 4 and 5 can be rewritten in our case respectively as

$$ |s_{j+1} r_j| + |s_j r_{j+1}| = \mu\quad\text{and} \quad|t_{j+1} r_j| + |t_j r_{j+1}| = n . $$
(8)

The Cornacchia (as well as the GLV) algorithm does not make use of the full sequences (r j ),(s j ), and (t j ) but rather stops at the m≥0 such that \(r_{m}\geq\sqrt{n}\) and \(r_{m+1}< \sqrt{n}\). An application of (8) with j=m yields |t m+1 r m |<n or \(|t_{m+1}| < \sqrt{n}\). Since by property 6 we have r m+1μt m+1=ns m+1≡0(modn), we deduce that \(r_{m+1}^{2}+ t_{m+1}^{2}=(r_{m+1}-\mu t_{m+1})(r_{m+1}+\mu t_{m+1}) \equiv0 \pmod{n}\). Moreover, t m+1≠0 by property 3, so that \(0< r_{m+1}^{2}+t_{m+1}^{2} < n + n = 2n\), which therefore implies that \(r_{m+1}^{2}+ t_{m+1}^{2} = n\) and finally that ν=r m+1it m+1.

We present here the pseudo-code of this Euclidean algorithm in ℤ.

Algorithm 1

(Cornacchia’s GCD in ℤ)

figure a

5.2 The Euclidean Algorithm in ℤ[i]

In the previous subsection we have given a meaning to zP, where z∈ℤ[i], and we have seen how to construct ν, a Gaussian prime such that νP=0. By identifyingFootnote 4 (x 1,x 2,x 3,x 4)∈ℤ4 with (z 1,z 2)=(x 1+ix 3,x 2+ix 4)∈ℤ[i]2, we can rewrite the 4-GLV reduction map F of Sect. 4 as (using the same letter F by abuse of notation)

This F should be compared with the map \(\mathfrak {f}\) of Sect. 2. In mimicking the GLV original paper [13] we would like to apply the extended Euclidean algorithm (defined exactly as before, with integer divisions occurring in ℤ[i], henceforth denoted EGEA in short for extended Gaussian Euclidean algorithm) to the pair (r 0,r 1)=(λ,ν) if \(\lambda\geq\sqrt{2}\, |\nu|\) and (r 0,r 1)=(λ+n,ν) otherwise (the latter case being exceptionally rare). This should output short vectors in ℤ[i]2, which we can transform into short vectors in ℤ4 using the previous isomorphism, thus proving Theorem 3 by the Babai rounding argument given in Sect. 4.

What are the difficulties in following this path? Let us note that properties 4, 5, and 6 of Lemma 2 still hold and property 1 holds in modulus (in particular, the algorithm terminates). However, in the analysis of this algorithm, especially in [29], a crucial role is played by (8), in order to derive a bound on |s j+1 r j | and |s j r j+1| from a bound on

$$ s_{j+1}r_j - s_jr_{j+1} = (-1)^{j+1}\nu $$
(9)

in the present case. This fact, as we saw, stems from the alternating sign of the sequence (s j ), which results from taking a canonical form of integer division with positive quotients q j+1 and nonnegative remainders r j+2, a property which is not available here. Nevertheless, we can still use a similar reasoning using (9), provided that the arguments of s j+1 r j and s j r j+1 are not too close, so as to avoid a high degree of cancellation. In other terms, in order to follow the argument of [29, Theorem 1], we need a property of the kind

$$| s_{j+1}r_j - s_jr_{j+1} | \leq M \quad\Longrightarrow\quad\max \bigl(|s_{j+1}r_j|, |s_jr_{j+1}|\bigr) \leq c M $$

for some explicit absolute constant c (equal to 1 in [29]). This is in general impossible to attain because in the EGEA, in contrast to the usual extended Euclidean algorithm, we have no control over the arguments of the r j ’s or the s j ’s. However, in most cases something of the sort can be proved. This is the content of Lemma 4 (Appendix D). We define the corresponding indices (terms) of the sequences r j ,s j as “good” when this happens. If all the terms were good, then the proof of [29, Theorem 1] could be carried over to proving Theorem 3 without almost any change (the final constant of the theorem would be different, depending on c). However, this is not the case, and the main difficulty here lies in the treatment of the terms which are not good (called therefore “bad”). The surprising fact is that we can still control the contribution of bad terms to our advantage (see Lemma 5) and, ultimately, the combination of Lemmas 4 and 5 becomes the main ingredient in the proof of Theorem 3. All above makes the reasoning noticeably more sophisticated than in [29].

We now turn to the description of the EGEA. The first observation is that in the case of Gaussian integers, there can be 2, 3, or 4 possible choices for a remainder in the jth step of the integer division r j =q j+1 r j+1+r j+2. It turns out that choosing at each step j≥0 of the EGEA a remainder r j+2 with smallest modulus will yield Theorem 3.

We give the pseudo-code of Cornacchia’s Algorithm in ℤ[i] in two forms, working with complex numbers (see Algorithm 2) and separating real and imaginary parts (see Algorithm 3, Appendix C).

Algorithm 2

(EGEA or Cornacchia’s algorithm in ℤ[i]—compact form)

figure b

Remark 2

In the case of the LLL algorithm, we have not managed to demonstrate a bound as good as the one obtained with our lattice reduction algorithm.

Remark 3

Nguyen and Stehlé [26] have produced an efficient lattice reduction in four dimensions which finds successive minima and hence produces a decomposition with relatively good bounds. Our algorithm represents a very simple and easy-to-implement alternative that may be ideal for certain cryptographic libraries.

6 GLV–GLS using the Twisted Edwards Model

The GLV–GLS method can be sped up in practice by writing down GLV–GLS curves in the Twisted Edwards model. Note that arithmetic on j-invariant 0 Weierstrass curves is already very efficient. However, some GLV curves do not exhibit such high-speed arithmetic. In particular, curves in Examples 3–6 from Appendix A have Weierstrass coefficients a 4a 6≠0 for curve parameters a 4 and a 6, and hence they have more expensive point doubling (even more if we consider the extra multiplication by the twisted parameter u when using the GLS method). So the impact of using Twisted Edwards is expected to be especially significant for these curves. In fact, if we consider that suitable parameters can be always chosen, the use of Twisted Edwards curves isomorphic to the original Weierstrass GLV–GLS curves uniformizes the performance of all of them.

Let us illustrate how to produce a Twisted Edwards GLV–GLS curve with the GLV curve from Example 4, Appendix A. First, consider its quadratic twist over \(\mathbb{F}_{p^{2}}\)

$$E'/\mathbb{F}_{p^2}{:}\ x^3 - \frac{15}{2} u^2 x -7 u^3 = (x + 2u) \cdot \biggl(x^2 - 2ux - \frac{7}{2} u^2\biggr). $$

The change of variables x 1=x+2u transforms E′ into

$$y^2 = x_1^3 -6u x_1^2 + \frac{9u^2}{2} x_1 . $$

Let \(\beta= 3u/\sqrt{2} \in\mathbb{F}_{p^{2}}\) and substitute x 1=βx′ to get

$$\frac{1}{\beta^3} y^2 = x'^3 - \frac{6u}{\beta} x'^2 + x', $$

and this is a Montgomery curve M A,B :Bv 2=u 3+Au 2+u, where A≠±2,B≠0, with

$$B= \frac{1}{\beta^3} = \frac{2\sqrt{2}}{27u^3} , \qquad A=- \frac{6u}{\beta}= -2\sqrt{2} . $$

The corresponding Twisted Edwards GLV–GLS curve is then E a,d :ax 2+y 2=1+dx 2 y 2 with

$$a =\frac{A+2}{B} = 27u^3 \biggl(\frac{\sqrt{2}}{2} - 1 \biggr), \qquad d = \frac{A-2}{B} = -27u^3 \biggl(\frac{\sqrt{2}}{2} + 1 \biggr). $$

The map E′→E a,d is

$$(x,y) \mapsto \biggl(\frac{x+2u}{\beta y} , \frac{x+2u-\beta }{x+2u+\beta } \biggr) = (X,Y) $$

with inverse

$$(X,Y) \mapsto \biggl( \frac{\beta-2u + (\beta+2u)Y}{1-Y} , \frac {1+Y}{(1-Y) X} \biggr) . $$

We now specify the formulas for Φ and Ψ, obtained by composing these endomorphisms on the Weierstrass model with the birational maps above. We found an extremely appealing expression in the case where u=1+i and i 2=−1. Then \(\beta= 3u/\sqrt{2} = 3\zeta_{8}\) where ζ 8 is a primitive 8th root of unity. We have

and

$$\varPsi(X,Y) = \biggl(\zeta_8 X^p, \frac{1}{Y^p} \biggr) . $$

In this case,

$$a = 54 \bigl(\zeta_8^3-\zeta_8^2+1 \bigr) , \qquad d = -54 \bigl(\zeta_8^3+ \zeta_8^2-1\bigr). $$

Finally, one would want to use the efficient formulas given in [15] for the case a=−1. After ensuring that −a be a square in \(\mathbb{F}_{p^{2}}\), we use the map \((x,y) \mapsto(x/\sqrt[]{-a},y)\) to convert to the isomorphic curve −x 2+y 2=1+dx 2 y 2, where d′=−d/a.

7 Side-Channel Protection and Parallelization of the GLV–GLS Method

Given the potential threat posed by attacks that exploit timing information to deduce secret keys ([7, 19]), many works have proposed countermeasures to minimize the risks and achieve the so-called constant-time execution during cryptographic computations. In general, to avoid leakage, the execution flow should be independent of the secret key. This means that conditional branches and secret-dependent table lookup indices should be avoided [4, 18]. There are five key points that are especially vulnerable during the computation of scalar multiplication: inversion, modular reduction in field operations, precomputation, scalar recoding, and double-and-add execution.

A well-known technique that is secure and easy to implement for inverting any field element a consists of computing the exponentiation a p−2 mod p using a short addition chain for p−2.

To protect field operations, one may exploit conditional move instructions typically found on modern x86 and x64 processors (a.k.a. cmove). Since conditional checks happen during operations such as addition and subtraction as part of the reduction step, it is standard practice to replace conditional branches with the conditional move instruction. Luckily, these conditional branches are highly unpredictable, and, hence, the substitution above does not only make the execution constant-time but also more efficient in most cases. An exception happens when performing modular reduction during a field multiplication or squaring, where a final correction step could happen very rarely, and hence a conditional branch may be more efficient.

For the case of precomputation in the setting of elliptic curves, recent work [18] and later [3] showed how to enable the use of precomputed points by employing constant-time table lookups that mask the extraction of points, which is a known technique in the literature (see, for example, [5]). In our implementations (see Sect. 8), we exploit a similar approach based on cmove and conditional vector instructions instead, which is expected to achieve higher performance on some platforms than implementations based on logical instructions (see Listing 1 in [18]). Note that it is straightforward to enable the use of signed-digit representations that allow negative points by performing a second table lookup between the point selected in the first table lookup and its negated value.

To protect the scalar recoding and its corresponding double-and-add algorithm, one needs a regular pattern execution. Based on a method in [27], Joye and Tunstall [17] proposed a constant-time recoding that supports a regular execution double-and-add algorithm that exploits precomputations. The nonzero density of the method is 1/(w−1), where w is the window width. Therefore, there is certain loss in performance in comparison with an unprotected version with nonzero density 1/(w+1). In GLV-based implementations one has to deal with more than one scalar, and these scalars are scanned simultaneously, using interleaving [13], for instance, during multi-exponentiation. So there are two issues that arise. First, how are the several scalars aligned with respect to their zero and nonzero digit representation? And second, how do we guarantee the same representation length for all scalars so that no dummy operations are required? The first issue is inherently solved by the recoding algorithm itself. The input is always an odd number, which means that, from left to right, one obtains the execution pattern (w−1) doublings, d additions, (w−1) doublings, d additions, … , (w−1) doublings and d additions, for d-dimensional GLV. For dealing with even numbers, one may employ the technique described in [17] in a constant-time fashion, namely, scalars k i that are even are replaced by k i +1 and scalars that are odd are replaced by k i +2 (the correction, also constant-time, is performed after the scalar multiplication computation using d point additions). A solution to the second issue was also hinted by [17]. We present in Appendix E the modified recoding algorithm that outputs a regular pattern representation with fixed length. Note that in the case of Twisted Edwards, one can alternatively use unified addition formulas that also work for doubling (see [2, 15] for details). However, our analysis indicates that this approach is consistently slower because of the high cost of these unified formulas in comparison to doubling and the extra cost incurred by the increase in constant-time table lookup accesses.

7.1 Multicore Computation and Its Side-Channel Protection

Parallelization of scalar multiplication over prime fields is particularly difficult on modern multicore processors. This is due to the difficulty to perform point operations concurrently when executing the double-and-add algorithm from left to right. From right to left parallelization is easier, but performance is hurt because the use of precomputations is cumbersome. Hence, parallelization should be ideally performed at the field arithmetic level. Unfortunately, current multicore processors still impose a severe overhead for thread creation/destruction. During our tests, we observed overheads of a few thousands of cycles on modern 64-bit CPUs (that is, much more costly than a point addition or doubling). Given this limitation, for the GLV method, it seems the ideal approach (from a speed perspective) to let each core manage a separate scalar multiplication with k i . This is simple to implement, minimizes thread management overhead and also eases the task of protecting the implementation against side-channel attacks since each scalar can be recoded using Algorithm 4, Appendix E. Using d cores, the total cost of a protected d-dimensional GLV l-bit scalar multiplication (disregarding precomputation) is approximately l/d doublings and l/((w−1)⋅d) mixed additions. A somewhat slower approach (but more power efficient) would be to let one core manage all doublings and let one or two extra cores manage the additions corresponding to nonzero digits. For instance, for dimension four and three cores, the total cost (disregarding precomputation) is approximately l/d doublings and l/((w−1)⋅d) general additions, always that the latency of (w−1) doublings be equivalent or greater than the addition part (otherwise, the cost is dominated by nonmixed additions).

8 Performance Analysis and Experimental Results

For our analysis and experiments, we consider the five curves below: two GLV curves in Weierstrass form with and without nontrivial automorphisms, their corresponding GLV–GLS counterparts, and one curve in Twisted Edwards form isomorphic to the GLV–GLS curve \(E'_{3}\) (see below).

  • GLV–GLS curve with j-invariant 0 in Weierstrass form \(E'_{1}/\mathbb{F}_{p_{1}^{2}}: y^{2}=x^{3} + 9u\), where p 1=2127−58309 and \(\#E'_{1}(\mathbb{F}_{p_{1}^{2}}) = r\), where r is a 254-bit prime. We use \(\mathbb{F}_{p_{1}^{2}} = \mathbb {F}_{p_{1}}[i]/(i^{2} +1)\) and \(u=1+i \in\mathbb{F}_{p_{1}^{2}}\). \(E'_{1}\) is the quadratic twist of the curve in Example 2, Appendix A. Φ(x,y)=λP=(ξx,y) and Ψ(x,y)=μP=(u (1−p)/3 x p,u (1−p)/2 y p), where ξ 3=1modp 1. We have that Φ 2+Φ+1=0 and Ψ 2+1=0.

  • GLV curve with j-invariant 0 in Weierstrass form \(E_{2}/\mathbb {F}_{p_{2}}: y^{2} = x^{3} + 2\), where p 2=2256−11733, and \(\# E_{2}(\mathbb{F}_{p_{2}})\) is a 256-bit prime. This curve corresponds to Example 2, Appendix A.

  • GLV–GLS curve in Weierstrass form \(E'_{3}/\mathbb{F}_{p_{3}^{2}}: y^{2}=x^{3}-15/2\; u^{2} x-7u^{3}\), where p 3=2127−5997 and \(\#E'_{3}(\mathbb{F}_{p_{3}^{2}}) = 8r\), where r is a 251-bit prime. We use \(\mathbb{F}_{p_{3}^{2}} = \mathbb {F}_{p_{3}}[i]/(i^{2} + 1)\) and \(u=1+i \in\mathbb{F}_{p_{3}^{2}}\). \(E'_{3}\) is the quadratic twist of a curve isomorphic to the one in Example 4, Appendix A. The formula for Φ(x,y)=λP can be easily derived from ψ(x,y), and Ψ(x,y)=μP=(u (1−p) x p,u 3(1−p)/2 y p). It can be verified that Φ 2+2=0 and Ψ 2+1=0.

  • GLV–GLS curve in Twisted Edwards form \(E'_{T3}/\mathbb {F}_{p_{3}^{2}}: -x^{2} + y^{2}=1+ dx^{2} y^{2}\), where

    p 3=2127−5997 and \(\# E'_{T3}(\mathbb{F}_{p_{3}^{2}}) = 8r\), where r is a 251-bit prime. We use again \(\mathbb{F}_{p_{3}^{2}} = \mathbb{F}_{p_{3}}[i]/(i^{2} + 1)\) and \(u=1+i \in\mathbb{F}_{p_{3}^{2}}\). \(E'_{T3}\) is isomorphic to curve \(E'_{3}\) above and was obtained following the procedure in Sect. 6. The formulas for Φ(x,y) and Ψ(x,y) are also given in Sect. 6. It can be verified that Φ 2+2=0 and Ψ 2+1=0.

  • GLV curve \(E_{4}/\mathbb{F}_{p_{4}}: y^{2}=x^{3}-15/2\; x-7\), where p 4=2256−45717 and \(\#E_{4}(\mathbb{F}_{p_{4}}) = 2r\), where r is a 256-bit prime. This curve is isomorphic to the curve in Example 4, Appendix A.

For our experiments, we also explored the case of p=2128c, with a relatively small integer c, for GLV–GLS curves. We finally decided on p=2127c because it was consistently faster thanks to the use of lazy reduction in the multiplication over \(\mathbb {F}_{p^{2}}\) [21] at the expense of a slight reduction in security.

Let us first analyze the performance of the GLV–GLS method over \(\mathbb{F}_{p^{2}}\) in comparison with the traditional 2-GLV case over \(\mathbb{F}_{p}\). We assume the use of a pseudo-Mersenne prime of the form p=2mc with small c (for our targeted curves, groups with (near) prime order cannot be constructed using the attractive Mersenne prime p=2127−1). Given that we have a proven ratio C 2/C 1<412 that is independent of the curve, the only values left that could affect significantly a uniform speedup between GLV–GLS and 2-GLV are the quadratic nonresidue β used to build \(\mathbb{F}_{p^{2}}\) as \(\mathbb{F}_{p}[i]/(i^{2}-\beta)\), the value of the twisting parameter u, and the cost of applying the endomorphisms Φ and Ψ. In particular, if |β|>1, a few extra additions (or a multiplication by a small constant) are required per \(\mathbb{F}_{p^{2}}\) multiplication and squaring. Luckily, for all the GLV curves listed in Appendix A, one can always use a suitably chosen modulus p so that |β| can be one or at least very close to it. Similar comments apply to the twisting parameter u. In this case, the extra cost (equivalent to a few additions) is added to the cost of point doubling always that the curve parameter a in the Weierstrass equation be different to zero (e.g., it does not affect j-invariant 0 curves). In the case of Twisted Edwards, we applied a better strategy, that is, we eliminated the twisting parameter u in the isomorphic curve. The cost of applying Φ and Ψ does depend on the chosen curve, and it could be relatively expensive. If computing Φ(P), Ψ(P) or ΨΦ(P) is more expensive than point addition, then its use can be limited to only one application (i.e., multiples of those values—if using precomputations—should be computed with point additions). Further, the extra cost can be minimized by choosing the optimal window width for each k i .

To illustrate how the parameters above may affect the performance gain, we detail in Table 1 estimates for the cost of computing a scalar multiplication with our representative curves. For the remainder, we use the following notation: M, S, A, and I represent field multiplication, squaring, addition, and inversion over \(\mathbb {F}_{p}\), respectively, and m, s, a, and i represent the same operations over \(\mathbb{F}_{p^{2}}\). Side-channel protected multiplication and squaring are denoted by m s and s s . We consider the cost of addition, subtraction, negation, multiplication by 2, and division by 2 as equivalent. For the targeted curves in Weierstrass form, a mixed addition consists of 8 multiplications, 3 squarings, and 7 additions, and a general addition consists of 12 multiplications, 4 squarings, and 7 additions. For \(E'_{1}\) and E 2, a doubling consists of 3 multiplications, 4 squarings, and 7 additions, and for \(E'_{3}\) and E 4, a doubling consists of 3 multiplications, 6 squarings, and 12 additions. For Twisted Edwards, we consider the use of mixed homogeneous/extended homogeneous projective coordinates [15]. In this case, a mixed addition consists of 7 multiplications and 7 additions, a general addition consists of 8 multiplications and 6 or 7 additions, and a doubling consists of 4 multiplications, 3 squarings, and 5 additions. We also assume the use of interleaving [13] with width-w nonadjacent form (wNAF) and the use of the LM scheme for precomputing points on the Weierstrass curves [24] (see also [22, Chap. 3]).

Table 1. Operation counts and performance for scalar multiplication at approximately 128 bits of security. To determine the total costs, we consider 1i=66m, 1s=0.76m, and 1a=0.18m for \(E'_{1}\), \(E'_{3}\), and \(E'_{T3}\); and 1I=290M, 1S=0.85M, and 1A=0.18M for E 2 and E 4. The cost ratio of multiplications over \(\mathbb{F}_{p}\) and \(\mathbb{F}_{p^{2}}\) is M/m=0.91. These values and the performance figures (in cycles) were obtained by benchmarking full implementations on a single core of a 3.4 GHz Intel Core i7-2600 (Sandy Bridge) processor.

According to our theoretical estimates, it is expected that the relative speedup when moving from 2-GLV to GLV-GLS be as high as 1.5 times, approximately. To confirm our findings, we realized full implementations of the methods. Experimental results, also displayed in Table 1, closely follow our estimates and confirm that speedups in practice are about 1.52 times. Most remarkably, the use of the Twisted Edwards model pushes performance even further. In Table 1, the expected gains for \(E'_{T3}\) are 31 % and 97 % in comparison with 4-GLV–GLS and 2-GLV in Weierstrass form (respectively). In practice, we achieved similar speedups, namely, 33 % and 102 % (respectively). Likewise, a rough analysis indicates that a Twisted Edwards GLV–GLS curve for a j-invariant 0 curve would achieve roughly similar speed to \(E'_{T3}\), which means that in comparison to its corresponding Weierstrass counterpart the gains are in the order of 9 % and 66 % (respectively). This highlights the impact of using Twisted Edwards especially over those GLV–GLS curves relatively slower in the Weierstrass model. Timings were registered on a single core of a 3.4 GHz Intel Core i7-2600 (Sandy Bridge) processor.

Let us now focus on curves \(E'_{1}\), E 2, and \(E'_{T3}\) to assess performance of implementations targeting four scenarios of interest: unprotected and side-channel protected versions with sequential and multicore execution. Operation counts for computing a scalar multiplication at approximately 128 bits of security for the different cases are displayed in Table 2. The techniques to protect and parallelize our implementations are described in Sect. 7. In particular, the execution flow and memory address access of side-channel protected versions are not secret and are fully independent of the scalar. For our versions running on several cores, we used OpenMP. We use an implementation in which each core is in charge of one scalar multiplication with k i . Given the high cost of thread creation/destruction, this approach guarantees the fastest computation in our case (see Sect. 7 for a discussion). Note that these multicore figures are only relevant for scenarios in which latency rather than throughput is targeted. Finally, we consider the cost of constant-time table lookups (denoted by t) given its nonnegligible cost in protected implementations.

Table 2. Operation counts for scalar multiplication at approximately 128 bits of security using curves \(E'_{1}\), E 2, and \(E'_{T3}\) in up to four variants: unprotected and side-channel protected implementations with sequential and multicore execution. To determine the total costs we consider 1i=66m, 1s=0.76m, and 1a = 0.18m for unprotected versions of \(E'_{1}\) and \(E'_{T3}\); 1i = 79m s , 1s s  = 0.81m s , and 1a = 0.17m s for protected versions of \(E'_{1}\) and \(E'_{T3}\); t = 0.83m s for \(E'_{1}\) (32pts.); t = 1.28m s for \(E'_{T3}\) (36pts.); t = 0.78m s for \(E'_{T3}\) (20pts.); and 1I = 290M, 1S = 0.85M, and 1A = 0.18M for E 2. In our case, M/m = 0.91 and m s /m = 1.11. These values were obtained by benchmarking full implementations on a 3.4 GHz Intel Core i7-2600 (Sandy Bridge) processor.

Focusing on curve \(E'_{1}\), it can be noted a significant cost reduction when switching from non-GLV to a GLV–GLS implementation. The speedup is more than twofold for sequential, unprotected versions. Significant improvements are also expected when using multiple cores. A remarkable factor 3 speedup is expected when using GLV–GLS on four cores in comparison with a traditional execution (listed as non-GLV).

In general for our targeted GLV–GLS curves, the speedup obtained by using four cores is in between 1.42–1.80 times. Interestingly, the improvement is greater for protected implementations since the overhead of using a regular pattern execution is minimized when distributing computation among various cores. Remarkably, protecting implementations against timing attacks slowdowns performance by a factor in between 1.28–1.52, approximately. On the other hand, in comparison with curve E 2, an optimal execution of GLV–GLS on four cores is expected to run 1.81 times faster than an optimal execution of the standard 2-GLV on two cores.

To confirm our findings, we implemented the different versions using curves \(E'_{1}\), E 2, and \(E'_{T3}\). To achieve maximum performance and ease the task of parallelizing and protecting the implementations, we wrote our own standalone software without employing any external library. For our experiments we used a 3.4 GHz Intel Core i7-2600 processor, which contains four cores. The timings in terms of clock cycles are displayed in Table 3. As can be seen, closely following our analysis, GLV–GLS achieves a twofold speedup over a non-GLV implementation on a single core. Parallel execution improves performance by up to 1.76 times for side-channel protected versions. In comparison with the non-GLV implementation, the four-core implementation runs 3 times faster. Our results also confirm the lower-than-expected cost of adding side-channel protection. Sequential versions lose about 50 % in performance, whereas parallel versions only lose about 28 %. The relative speedup when moving from 2-GLV to GLV-GLS on j-invariant 0 curves is 1.53 times, closely following the theoretical factor-1.5 speedup estimated previously. Four-core GLV–GLS supports a computation that runs 1.81 times faster than the standard 2-GLV on two cores. Finally, in practice our Twisted Edwards curve achieves up to 9 % speedup on the sequential, nonprotected scenario in comparison with the efficient j-invariant 0 curve based on Jacobian coordinates.

Table 3. Point multiplication timings (in clock cycles), 64-bit processor.

Comparison to Related Work

Let us now compare our best numbers with recent results in the literature for elliptic curves over large prime characteristic fields. Focusing on one-core unprotected implementations, the first author together with Hu and Xu reported in [16] 122,000 cycles for a j-invariant 0 Weierstrass curve on an Intel Core i7-2600 (Sandy Bridge) processor. We report 91,000 cycles with the GLV–GLS Twisted Edwards curve \(E'_{T3}\), improving that number by a factor-1.34 speedup. We benchmarked on the same processor the side-channel protected software recently presented by Bernstein et al. [3] and obtained 194,000 cycles. Thus, our protected implementation, which runs in 137,000 cycles, is 1.42 times faster. Our result is also 1.12 times faster than the recent implementation by Hamburg [14].

It is also relevant to mention very recent results in settings other than elliptic curves over large prime characteristic fields. Taverne et al. [31] reported a protected implementation of a binary Edwards curve that runs in 225,000 cycles on an Intel Core i7-2600 (Sandy Bridge) machine, which is 1.64 times slower than our corresponding result. Aranha et al. [1] presented an implementation of the Koblitz curve K-283 that runs in 99,000 cycles on the same machine, which is 9 % slower than our GLV–GLS Twisted Edwards curve \(E'_{T3}\) (unprotected sequential execution). Aranha et al. do not report timings for side-channel protected implementations. A faster (although also unprotected) implementation of a GLS binary curve over a quadratic extension field of characteristic two was recently announced in ECC2012. The running time in this case is about 73,000 cycles on the same Sandy Bridge processor [28]. These results highlight the significant impact of the carryless multiplier on the efficiency of characteristic two fields in the newest Intel processors. Efficient implementations on genus-2 (hyperelliptic) curves were recently reported in Bos et al. [6]. For instance, a protected implementation on a Kummer surface over a prime field runs in approximately 117,000 cycles on an Intel Core i7-3520M (Ivy Bridge) processor. Note that this processor architecture is in general more efficient than Sandy Bridge.

To the best of our knowledge, we have presented the first scalar multiplication implementation running on multiple cores that is protected against timing attacks, cache attacks, and several others.

9 Conclusion

We have shown how to generalize the GLV scalar multiplication method by combining it with Galbraith–Lin–Scott’s ideas to perform a proven almost fourfold speedup on GLV curves over \(\mathbb{F}_{p^{2}}\). We have introduced a new and easy-to-implement reduction algorithm, consisting in two applications of the extended Euclidean algorithm, one in ℤ and the other in ℤ[i]. The refined bound obtained from this algorithm has allowed us to get a relative improvement from 2-GLV to 4-GLV–GLS practically independent of the curve. Our analysis and experimental results on different GLV curves show that in practice one should expect a factor-1.5 speedup, approximately. We improve performance even further by exploiting the Twisted Edwards model over a larger set of curves and show that this approach is especially significant to certain GLV curves with slow arithmetic in the Weierstrass model. This makes available to implementers new curves that achieve close to optimal performance. Moreover, we have shown how to protect GLV-based implementations against certain side-channel attacks with relatively low overhead and carried out a performance analysis on modern multicore processors. Our implementations of the generalized GLV–GLS method improve the state-of-the-art performance of elliptic curve point multiplication over fields of large prime characteristic for multiple scenarios: unprotected and side-channel protected versions with sequential and parallel execution. Finally, we have produced new families of GLV curves and written all such curves (up to isomorphism) with nontrivial endomorphisms of degree ≤3.