1 Introduction

Let \(\mathbb {G}\) denote a cyclic group of order \(\ell \). Given an element \(g\in \mathbb {G}\) of order \(r|\ell \) and \(h\in \langle g \rangle \), the Discrete Logarithm Problem (DLP) for \(\mathbb {G}\) is the computational problem of finding an integer x such that \(g^x = h\). The smallest integer x satisfying \(g^x = h\) is called the discrete logarithm of h to the base g and is denoted by \(\log _gh\). In this paper, we are mainly interested in the case where \(\mathbb {G}\) is the additive group of points on an elliptic curve E defined over a binary field, and g is a point \(P\in E({\mathbb {F}}_{2^m})\).

In 1985, Koblitz [51] and Miller [62] independently showed that the group of points on an elliptic curve defined over a finite field could be used for designing a public key cryptosystem, having the DLP in that group as the underlying hard computational problem. This was the birth of Elliptic Curve Cryptography (ECC), which across the years has become one of the most intensively analyzed public key schemes in our discipline.Footnote 1

In the abstract of his seminal paper, Koblitz [51] remarked that

These elliptic curve cryptosystems may be more secure, because the analog of the discrete logarithm problem on elliptic curves is likely to be harder than the classical discrete logarithm problem, especially over \(GF(2^n)\).

Indeed, it was soon realized that for sensible choices of the elliptic curve parameters, there did not appear to exist a subexponential-time algorithm that could solve the DLP over the group of points of an elliptic curve, or at least, one that was analogous to the index-calculus family of algorithms, which were proved to be highly successful when the DLP was defined in the multiplicative group of a finite field.

Furthermore, in the abstract of his paper [62], Miller commented about the disparity between the security offered by the original Diffie–Hellman protocol, whose security guarantees lie on the DLP defined over finite fields, as compared with its analogue over elliptic curves,

As computational power grows, this disparity should get rapidly bigger.

As an interesting historical remark, let us recall that just 1 year before Koblitz and Miller presented their ECC proposal, Coppersmith had reported in [23]Footnote 2 an index-calculus algorithm able to solve the DLP over characteristic two fields of the form \({\mathbb {F}}_{q}\) with \(q = 2^m,\) with a running time of \(L_q(1/3, (32/9)^{\frac{1}{3}}),\) where \(L_q(\alpha , c),\) with parameters \(0< \alpha < 1\) and \(c > 0,\) denotes the expression, \(\exp \left( (c + o(1))(\log q)^\alpha (\log \log q)^{1 - \alpha }\right) .\) Almost 30 years after, in February 2013, Joux [47] presented a new DLP algorithm with a running time of \(L_q(1/4, c)\) (for some undetermined c), which was rapidly followed by several other developments that culminated with the discovery of algorithms that asymptotically enjoy a quasi-polynomial time complexity [6, 38]. Hence, a bit less than three decades of cryptanalysis rendered the DLP in binary fields completely useless for constructive cryptographic purposes.

In stark contrast with its binary field instantiation, and after all these years of far and wide analysis, the DLP over elliptic curves defined in the binary field \({\mathbb {F}}_{q=2^m},\) still stands as a formidable computational task. As of today, the best known general-purpose algorithm to solve it is still the Pollard rho algorithm [30]. In the last decade, however, inspired by the new lines of research presented by Semaev [79], several researchers [29, 31, 35, 49, 80] (see also [30] for a comprehensive survey) have attempted to attack the DLP of all binary elliptic curves using summation-polynomial methods.Footnote 3 However, the current status of these attempts is crucially based on ill-understood Gröbner basis assumptions, which in some cases have led to contradictory behaviors [42], even for tiny experiments [30].Footnote 4

On the other side, it is now standard knowledge that the Pollard rho algorithm is able to solve the DLP over generic curves with an exponential computational complexity of \((1 + o(1)) O(\frac{\sqrt{\pi \cdot q}}{2}).\) Moreover, it is always possible to apply the negation map, which yields an extra \(\sqrt{2}-o(1) < 1.5\) improvement to the above estimate [10, 90]. Further, in the case of Koblitz curves, which are the main subject of this paper, one can apply the Frobenius endomorphism to speed up the Pollard rho algorithm by an extra \(\sqrt{m}\) factor [34, 91]. Notice that in practice this \(\sqrt{m}\) factor implies a relatively modest computational saving of no more than 4–5 bits. Taking into account the above, several international bodies have suggested that the size of the binary field where cryptographic elliptic curves are defined should be designed by “adding about 10 bits in the binary field case” [27].Footnote 5

1.1 Known Attacks Against Certain Classes of Binary Curves

At the time that ECC was proposed, it appeared clear that binary supersingular curves were the most efficient curves in practice [52, 61]. However, when Menezes, Okamoto, and Vanstone published their famous MOV attack [58, 59], it was realized that the difficulty of computing discrete logarithms in \(E({\mathbb {F}}_{2^m})\) is actually comparable to the security provided by the DLP in the multiplicative group of the field \({\mathbb {F}}_{2^{2m}}.\) Because of this, it was recommended not to use any kind of supersingular curves for cryptographic applications.

Surprisingly, this situation changed around the year 2000, when several papers proposing the usage of pairing-based protocols were published [19, 45, 46, 76]. All of a sudden, and after having been banned for almost a decade, it was again a good idea to implement cryptographic applications using binary (and also ternary) supersingular curves. As a consequence, several works reported efficient implementations using those curves both in hardware and in software platforms [4, 13,14,15]. Nevertheless, disaster struck again about one decade later, when Joux [47] discovered his aforementioned DLP algorithm for binary field multiplicative groups, with a running time of \(L_q(1/4, c)\), for \(q = 2^m, c > 0.\) This marked the end of the usage of binary supersingular curves in cryptography.

Given a target ordinary binary curve defined over the field \({\mathbb {F}}_{2^m},\) the Gaudry–Hess–Smart (GHS) attack [33, 36, 41, 60] exploits the idea of finding an algebraic curve C of a relatively small genus g such that the Jacobian of C contains the target elliptic curve group. In this case, the original elliptic curve discrete logarithm problem can be transferred into the Jacobian of C defined over \({\mathbb {F}}_{2^l},\) with l|m. The hope is that if the genus of C is not too large, the DLP could be easier to solve in that Jacobian, due to the availability of an index-calculus strategy for that group [28].

In general, however, the GHS strategy is difficult to implement due to the large genus that a suitable curve C usually ends up having.Footnote 6 In fact, in [36, 60], it was proved that the GHS attack fails (i.e., the Pollard rho attack is more effective), for all binary elliptic curves defined over \({\mathbb {F}}_{2^m,}\) where \(m\in [160, 600]\) is prime. Furthermore, in [57], it was proved that the GHS attack fails for most of the composite extensions in the range \(m\in [160, 600].\) To our knowledge, the largest instance where the GHS has been proved effective is for the DLP computation over \(E({\mathbb {F}}_{2^{5\cdot 31}}).\) In [87], it was estimated that when using the Enge–Gaudry algorithm [28], the cost of such computation is of around 1736 core days.

1.2 Performance Advantage of Ordinary Binary Elliptic Curves

The computation of the scalar multiplication operation on binary elliptic curves can be performed significantly faster than prime field curves for both software and hardware platforms.

In software, the libraries reported in [3, 70, 71, 84] rank among the fastest Diffie–Hellman software benchmarked in the eBACS site [11] in various platforms. In particular, the software library announced in [70] holds the current speed record for constant-time variable-base-point Diffie–Hellman software at the 128 bit security level.

In hardware, due to the carry-less arithmetic, the computation of the scalar multiplication operation on binary elliptic curves implies a substantial hardware circuitry saving when compared to the full adders and integer multipliers required for prime field elliptic curves. Consequently, ECC accelerators using binary curves tend to be more compact and faster than their prime field curve counterparts (see [5, Table IV] for a comparison of recent designs). It should be remarked that for the Internet of Things and several other applications, hardware performance is much more important than software performance.

1.3 Choosing Side Channel-Resistant Binary Elliptic Curves

Since the publication of Kocher’s paper on differential power analysis [55], the cryptographic community has been increasingly concerned about the importance of producing side channel-resistant cryptographic software and hardware. In a post-Snowden world, this fear has just exacerbated. In the case of ECC, one first line of defense is the sound selection of elliptic curves that show solid cryptographic properties against several known side-channel attacks, which apply both in hardware and in software platforms.

In [12] (see also [20]), the authors present several criteria that a so-called safe curve should exhibit. Among others, the following properties are listed: rigidity, safety against transfers, Montgomery-ladder-friendliness, twist security, etc. The reader is referred to [12] for a definition of each one of the desirable security properties that a safe curve should enjoy. Unfortunately, the authors in [12, 20] explicitly exclude binary elliptic curves from their analysis, without elaborating in the technical arguments to avoid them.

However, and as it will be discussed in the remaining of this paper, Koblitz curves (see their definition in the next subsection) meet most of the requirements specified for safe curves in [12]. Particularly, Koblitz curves arguably stand among the purest mathematical elliptic curve problems, which makes them as fully rigid as it gets. Similarly, these curves support the most efficient Montgomery ladder formulas that we know of (consisting of only five field multiplications per bit, which have the extra bonus of being amenable for parallelization). Moreover, Koblitz curves are suitable for right-to-left Montgomery ladders as discussed in [69]. This feature is especially valuable for the vast majority of protocols (such as the Diffie–Hellman protocol, digital signatures, key generation, etc.), which usually require the computation of one or more fixed-point scalar multiplications. Similarly, Koblitz curves enjoy twist security and they are transfer safe. On the other hand, they do not meet the completeness criterion as defined in [12].

1.4 Koblitz Curves

Anomalous binary curves, generally referred to as Koblitz curves, are binary elliptic curves satisfying the Weierstrass equation, \(E_a:y^2+xy = x^3 + ax^2+1\), with \(a\in \{0,1\}\). Since their introduction in 1991 by Koblitz [53], these curves have been extensively studied for their additional structure that allows, in principle, a performance speedup in the computation of the elliptic curve point multiplication operation. Also, Koblitz curves were historically the first family of ordinary elliptic curves proposed for cryptographic usage [52].

Koblitz curves defined over \({\mathbb {F}}_4\) were also proposed in [53]. Nevertheless, until now the research works dealing with standardized Koblitz curves in commercial use, such as the binary curves standardized by NIST [65,66,67] or the suite of elliptic curves supported by the TLS protocol [17, 25], have exclusively analyzed the security and performance of curves defined over binary extension fields \({\mathbb {F}}_{2^m}\), with m a prime number (for recent examples, see [3, 18, 84, 89]).

We found interesting to explore the cryptographic usage of Koblitz curves defined over \({\mathbb {F}}_4\) due to their inherent usage of quadratic field arithmetic. Indeed, it has been recently shown [56, 69, 71] that quadratic field arithmetic is extraordinarily efficient when implemented in software. This is because one can take full advantage of the Single Instruction Multiple Data (SIMD) paradigm, where a vector instruction performs simultaneously the same operation on a set of input data items.

Quadratic extensions of a binary finite field \({\mathbb {F}}_{q^{2}}\) can be defined by means of a monic polynomial of degree two \(h(u) \in {\mathbb {F}}_{2}[u]\) irreducible over \({\mathbb {F}}_{q}\). The field \({\mathbb {F}}_{q^{2}}\) is isomorphic to \({\mathbb {F}}_{q}[u]/(h(u))\), and its elements can be represented as \(a_0 + a_1u\), with \(a_0, a_1 \in {\mathbb {F}}_{q}\). The addition of two elements \(a, b \in {\mathbb {F}}_{q^2},\) can be performed as \(c = (a_0 + b_0) + (a_1 + b_1)u.\) Using \(h(u) = u^2 + u +1,\) the multiplication of ab can be computed as \(d = a_0b_0 + a_1b_1 + ((a_0 + a_1)\cdot (b_0 + b_1) + a_0b_0)u.\) By carefully organizing the code associated with these arithmetic operations, one can greatly exploit the instruction-level parallelism of the pipelines that are available in contemporary high-end processors.

1.4.1 Our Contributions

In this work, we designed for the first time a 128-bit secure and timing attack-resistant scalar multiplication on a Koblitz curve defined over \({\mathbb {F}}_4,\) as it was proposed in [53]. Furthermore, we present a taxonomy of Koblitz-like elliptic curves. Some of these curves are equipped with more efficient endomorphisms, which to the best of our knowledge have not been discussed before.

We developed all the required algorithms for performing the scalar multiplication at the 128-bit security level for standard Koblitz curves and also for one of the Koblitz-like elliptic curves introduced in this paper for the first time. This took us to reconsider the strategy of using redundant trinomials (also known as almost irreducible trinomials), which were proposed more than 10 years ago in [21, 26]. We also report what is perhaps the most comprehensive analysis yet reported on how to efficiently implement arithmetic operations in binary finite fields and their quadratic extensions using the vectorized instructions available in high-end microprocessors.

The remaining of this paper is organized as follows. In Sect. 2, we formally introduce the family of Koblitz elliptic curves defined over \({\mathbb {F}}_4.\) In Sect. 3, we analyze all the 12 ordinary elliptic curves that can be defined over \({\mathbb {F}}_4\) and their classification in isogeny classes. For one isogeny class, curves equipped with efficient endomorphisms are introduced. In Sects. 4 and 5, a detailed description of the efficient implementation of the base and quadratic field arithmetic using vectorized instructions is given. We introduce in Sect. 6 the scalar multiplication algorithms used in this work, and we present in Sect. 7 the analysis and discussion of the results obtained by our software library. Finally, we draw our concluding remarks in Sect. 8.

2 Koblitz Curves over \({\mathbb {F}}_4\)

Koblitz curves over \({\mathbb {F}}_4\) are defined by the following equation

$$\begin{aligned} E_a:y^2 + xy = x^3 + a \gamma x^2 + \gamma , \end{aligned}$$
(1)

where \(\gamma \in {\mathbb {F}}_{4}\) satisfies \(\gamma ^2 = \gamma + 1\) and \(a \in \{0,1\}\). The number of points in the curves \(E_0/{\mathbb {F}}_4\) and \(E_1/{\mathbb {F}}_4\) is 4 and 6, respectively. For cryptographic purposes, one uses (1) operating over binary extension fields of the form \({\mathbb {F}}_{4^m},\) and m is a prime number. The set of affine points \(P = (x, y) \in {\mathbb {F}}_{4^m} \times {\mathbb {F}}_{4^m}\) that satisfy (1) together with a point at infinity represented as \(\mathcal O\) forms an abelian group denoted by \(E_a({\mathbb {F}}_{4^m}),\) where its group law is defined by the point addition operation.

Table 1 Almost-prime group orders \(\#E_a({\mathbb {F}}_{4^m})\) with prime \(m \in \{127, \ldots , 191\}\)

For each proper divisor r of s, \(E({\mathbb {F}}_{4^r})\) is a subgroup of \(E({\mathbb {F}}_{4^s}),\) then \(\#E({\mathbb {F}}_{4^r})\) divides \(\#E({\mathbb {F}}_{4^s})\) and consequently, the order of every group of points on a quadratic Koblitz curve is divisible by 4 or 6 (depending on the curve a-parameter). On the other hand, by choosing specific prime extensions m,  it is possible to find groups \(E_a({\mathbb {F}}_{4^m})\) with almost-prime order, for instance, \(E_0({\mathbb {F}}_{4^{163}})\) and \(E_1({\mathbb {F}}_{4^{167}})\). In Table 1, we present the almost-prime group orders \(\#E_a({\mathbb {F}}_{4^m})\) for prime degrees \(m \in \{127, \ldots , 191\}\).

The Frobenius map \(\tau : E_a({\mathbb {F}}_{4^m}) \rightarrow E_a({\mathbb {F}}_{4^m})\) defined by \(\tau (\mathcal {O}) = \mathcal {O},\)\(\tau (x,y) = (x^4,y^4),\) is a curve automorphism satisfying \((\tau ^2 + 4)P = \mu \tau (P)\) for \(\mu =(-1)^{a}\) and all \({P \in E_a({\mathbb {F}}_{4^m})}\). By solving the equation \(\tau ^2+4 = \mu \tau \), the Frobenius map can be seen as the complex number \(\tau = (\mu \pm \sqrt{-15})/2\).

2.1 The \(\tau \)-Adic Representation

Given a Koblitz curve \(E_a/{\mathbb {F}}_{4^{m}}\) with group order \(\#E_a({\mathbb {F}}_{4^{m}}) = h \cdot p \cdot r\), where h is the order of \(E_a({\mathbb {F}}_4)\), r is the order of our subgroup of interest, and p is the order of a group of no cryptographic interest,Footnote 7 we can express a scalar \(k \in \mathbb {Z}/r\mathbb {Z}\) as an element in \(\mathbb {Z}[\tau ]\) using the classical partial reduction introduced by Solinas [82], with a few modifications. The modified version is based on the fact that \(\tau ^2 = \mu \tau - 4\).

Given that the norm of \(\tau \) is \(N(\tau ) = 4\), \(N(\tau -1) = h\), \(N(\tau ^m - 1) = h \cdot p \cdot r\), and \(N((\tau ^m - 1)/(\tau - 1)) = p \cdot r\), the subscalars \(k_0\) and \(k_1\) resulted from the partial modulo function will be both of size approximately \(\sqrt{p \cdot r}\). As a consequence, the corresponding scalar multiplication will need more iterations than expected, since the order p of a subgroup which is not of cryptographic interest will also be taken into account in the computation. For that reason, we took the design decision of considering that the input scalar of our point multiplication algorithm is already given in the \(\mathbb {Z}[\tau ]\) domain.

As a result, a partial reduction of the scalar k is no longer required, and the number of iterations in the point multiplication will be consistent with the scalar k size. If one needs to retrieve the equivalent value of the scalar k in the ring \(\mathbb {Z}/r\mathbb {Z}\), this can be easily computed with one multiplication and one addition in \(\mathbb {Z}/r\mathbb {Z}\). This strategy is in line with the degree-2 scalar decomposition method within the GLS curves context as suggested in [32].

2.2 The Width-\(w\,\tau \)NAF Form

Assuming that the scalar k is specified in the \(\mathbb {Z}[\tau ]\) domain, one can represent the scalar in the regular width-w\(\tau \)NAF form [69] by slightly adopting the method for the \({\mathbb {F}}_4\) case. The length of the representation width-w\(\tau \)NAF of an element \(k \in \mathbb {Z}[\tau ]\) is discussed in [81]. Given a width w, after running the regular \(\tau \)NAF algorithm, we have \(2^{2(w-1)-1}\) different digits.Footnote 8

Table 2 Representations of \(\alpha _v = v \bmod {\tau ^w}\), for \(w \in \{2, 3, 4\}\), \(a = 1\) and the required operations for computing \(\alpha _v\)

Therefore, it is necessary to be more conservative when choosing the width w, when compared to the Koblitz curves defined over \({\mathbb {F}}_2\). For widths \(w = 2, 3, 4, 5\), we have to pre- or post-compute 2, 8, 32, and 128 points, respectively. In order to construct an efficient 128-bit point multiplication, we estimated that the value of the width w must be at most four. Otherwise, the costs of the point pre- and post-processing will be greater than the additional savings obtained in the main iteration.

In addition, we must find efficient ways for computing the values \(\alpha _v = v \bmod {\tau ^w}\). The method for searching the best expressions in Koblitz curves over \({\mathbb {F}}_2\) [85] cannot be directly applied in the \({\mathbb {F}}_4\) case. As a result, we manually provided \(\alpha _v\) representations for \(w \in \{2, 3, 4\}\) and \(a = 1\), which are our implementation parameters.Footnote 9 The rationale for our chosen representations was to minimize the number of field arithmetic operations. In practice, we must reduce the number of full point additions on behalf of point doublings and mixed additions.Footnote 10 In Table 2, we present the \(\alpha _v\) representatives along with the operations required to generate the multiples of the base point.Footnote 11

As a result, one point doubling and full addition are required to generate the points \(\alpha _v P\) for \(w = 2\), one point doubling, four full additions, three mixed additions, and four applications of the Frobenius map for the \(w = 3\) case and one point doubling, twenty full additions, eleven mixed additions, and five applications of the Frobenius map for the \(w = 4\) case.

2.3 Security Analysis Against the GHS Attack

Given that the Koblitz curves defined over \(E_a({\mathbb {F}}_{4^m})\) operate over quadratic extensions fields, it is conceivable that Weil descent attacks [36, 41] could possibly be efficiently applied on these curves. However, Menezes and Qu showed in [60] that the GHS attack cannot be implemented efficiently for elliptic curves defined over binary extension fields \({\mathbb {F}}_{q},\) with \(q=2^m,\) and m a prime number in [160, 600]. Further, a specialized analysis for binary curves defined over fields of the form \({\mathbb {F}}_{4^m}\) reported in [39] proved that the only vulnerable prime extension in the range [80, 256] is \(m = 127\).

3 Extended Koblitz Curves over \({\mathbb {F}}_4\)

There exist several ordinary elliptic curves over \({\mathbb {F}}_4\) that strictly speaking cannot be considered Koblitz curves in the way that they were defined in Sect. 2. Since some of these additional curves come out equipped with additional endomorphisms, they are also of cryptographic interest. This extended set of Koblitz curves can be better described using isogeny classes as discussed in the following subsection.

3.1 Isogenies

Let \(E_1\) and \(E_2\) be two elliptic curves defined over a field \(\mathbb {K}.\) An isogeny map is a non-constant homomorphism \(\phi : E_0(\bar{\mathbb {K}}) \rightarrow E_1(\bar{\mathbb {K}})\) such that \(\phi (\mathcal O) = \mathcal O,\) which can be described by means of rational functions. Moreover, \(\phi (P \,{+}\, Q)\,{=}\,\phi (P) + \phi (Q)\) and \(\phi (x_0,y_0) \,{=}\, (x_1, y_1),\) where \(x_1 \,{=}\, R(x_0)\) and \(y_1 \,{=}\, y_0R(x_0)\) are rational functions. If \(R(x) \,{=}\, \frac{p(x)}{q(x)}\) and gcd\((p(x), q(x)) \,{=}\, 1,\) then the degree of \(\phi \) is \(\max \{\deg ({p(x)}), \deg (q(x))\}\). If \(\phi \) has the same domain and co-domain, then \(\phi \) is an endomorphism. If \(\phi \) has degree one, then \(\phi \) is an isomorphism. Two elliptic curves are isomorphic if an isomorphism between them exists.

It is known that two curves \(E_0\) and \(E_1\) are isogenous over \(\mathbb {K}\) if and only if they have the same number of points [83, Theorem 1]. This fact helps the elliptic curves classification by isogeny classes, that is, by their point cardinality.

3.2 The Twelve Ordinary Elliptic Curves over \({\mathbb {F}}_4\)

The description of Koblitz curves given in Sect. 2 defines four ordinary elliptic curves. Nevertheless, there is a total of 12 ordinary elliptic curves over the field \({\mathbb {F}}_4.\) As shown in Table 3, these 12 ordinary elliptic curves define four different isogeny classes.

Table 3 Twelve ordinary elliptic curves \(E_{(a, b)}/{\mathbb {F}}_4: y^2 + xy = x^3 + ax^2 + b\) define four isogeny classes. The curve parameters \(a, b \in {\mathbb {F}}_4\) can take the values \([0, 1, \gamma , \gamma ^2],\) with \(\gamma \in {\mathbb {F}}_4{\setminus }{\mathbb {F}}_2\)

Notice that the Koblitz curves described in Sect. 2 correspond to isogeny classes 1 and 2 of Table 3, with curve parameters (ab) given as \((\gamma , \gamma )\), \((\gamma ^2 , \gamma ^2)\), \((0, \gamma ),\) and \((0, \gamma ^2).\) Since these two classes were already studied, in the following, we will focus our attention on isogeny classes 0 and 3.

The curves \(E_{(1, 1)}\) and \(E_{(0, 1)}\) are the only two members of isogeny class 0. Notice that the curve parameters ab of these curves lie in \({\mathbb {F}}_2.\) Furthermore, it can be shown that \(E_{(1, 1)}\) and \(E_{(0, 1)}\) become isomorphic over \({\mathbb {F}}_4\) and that \((\#E_{(0, 1)}({\mathbb {F}}_2) \cdot \#E_{(1, 1)}({\mathbb {F}}_2))\mid \#E_{0}({\mathbb {F}}_4),\) where \(E_0\) is the Koblitz curve defined in (1) with \( a = 0.\) This observation can be generalized to prove that \((\#E_{(0, 1)}({\mathbb {F}}_{2^m}) \cdot \#E_{(1, 1)}({\mathbb {F}}_{2^m}) )\mid \#E_{0}({\mathbb {F}}_{4^m}).\) A direct consequence of this relation is that the largest prime factor of \(\#E_{0}({\mathbb {F}}_{4^m})\) must be smaller than the order \(\#E_{(0, 1)}({\mathbb {F}}_{2^m})\approx \#E_{(1, 1)}({\mathbb {F}}_{2^m})\approx 2^m\). Thus, when the two curves in the isogeny class 0 are defined over the field \({\mathbb {F}}_{4^m}\), one can only hope to achieve at most an \(\frac{m}{2}\)-bit security level. Hence, we conclude that isogeny class 0 is of little or no cryptographic value.

3.3 A Novel Endomorphism for Isogeny Class 3

The curves \(E_{(\gamma , 1)}\) and \(E_{(\gamma ^2, 1)}\) are the only two members of isogeny class 3 shown in Table 3. Both of these two curves are equipped with an efficient endomorphism as discussed next.

It can be seen that the mapping \(\tau _2 : E_{(a, b)}({\mathbb {F}}_{q}) \rightarrow E_{(a', b')}({\mathbb {F}}_{q})\) defined by \(\tau (\mathcal {O}) = \mathcal {O},\)\(\tau (x,y) = (x^2, y^2)\) is a two-degree isogeny such that \(\tau _2(E_{(\gamma , 1)}({\mathbb {F}}_{q})) = E_{(\gamma ^2, 1)}({\mathbb {F}}_{q}),\) and \(\tau _2(E_{(\gamma ^2, 1)}({\mathbb {F}}_{q})) = E_{(\gamma , 1)}({\mathbb {F}}_{q}).\) Moreover, the curves \(E_{(\gamma ,1)}\) and \(E_{(\gamma ^2, 1)}\) are also isomorphic, since one can define the isogenies, \(\phi _0: E_{(\gamma , 1)} \rightarrow E_{(\gamma ^2, 1)}\) and \(\phi _1 : E_{(\gamma ^2, 1)} \rightarrow E_{(\gamma , 1)}\) such that \(\phi _0(x, y) = (x, y + \gamma \cdot x)\) and \(\phi _1(x, y) = (x, y + \gamma ^2\cdot x).\) As illustrated in Fig. 1, for each one of the curves in class 3, one can therefore build two novel endomorphisms \({\bar{\tau }_0}, {\bar{\tau }_1}\) as follows:

$$\begin{aligned} {\bar{\tau }_0}(x, y)&= (\phi _0 \circ \tau _2)(x,y) = (x^2, y^2 + \gamma \cdot x^2)\\ {\bar{\tau }_1}(x, y)&= (\phi _1 \circ \tau _2)(x, y) = (x^2 , y^2 + \gamma ^2\cdot x^2). \end{aligned}$$

Using affine \(\lambda \)-coordinates as defined in [71], the endomorphisms \(\bar{\tau }_0\) and \(\bar{\tau }_1\) can be written as \(\bar{\tau }_0(x, \lambda ) = (x^2 , \lambda ^2 + \gamma )\) and \(\bar{\tau }_1(x, \lambda ) = (x^2 , \lambda ^2 + \gamma ^2),\) respectively. Using projective \(\lambda \)-coordinates, they become \(\bar{\tau }_0(X, L, Z) = (X^2 , L^2 + \gamma \cdot Z^2, Z^2)\) and \(\bar{\tau }_1(X, L, Z) = (X^2 , L^2 + \gamma ^2\cdot Z^2, Z^2)\).

Fig. 1
figure 1

Construction of the endomorphisms \(\bar{\tau }_0\) and \(\bar{\tau }_1\) for the isogeny class-3 elliptic curves \(E_{(\gamma , 1)}\) and \(E_{(\gamma ^2, 1)}\)

Since \(\bar{\tau }_0\) and \(\bar{\tau }_1\) satisfy the same properties, we will use in the following the symbol \(\bar{\tau }\) to refer to both of them. We stress that \(\bar{\tau }\) is computationally cheaper than the endomorphism \(\tau \) of the Koblitz curves discussed in the previous section. Indeed, the computational cost of \(\bar{\tau }\) is of two squaring operations instead of the four squaring operations associated with \(\tau .\) As it will be further discussed in Sect. 7, this computational saving induces an important reduction in the number of pre-computed points for the point multiplication \(Q = kP\) that uses a width-w\(\tau \)NAF scalar representation.

Another interesting property of the \(\bar{\tau }\) endomorphism is that \(\bar{\tau }^2(x,y) = (x^4 , y^4 + x^4) = -\tau (x,y)\). Moreover, for the elliptic curves in isogeny class 3, \(\tau \) satisfies the equation \(\tau ^2 + 4 = 3\tau \), which implies that \(\bar{\tau }^2 + \bar{\tau }= -2.\) It also follows that \(\bar{\tau }=(-1 + \sqrt{-7})/2.\) Since the ring \(\mathbb {Z}[(-1 + \sqrt{-7})/2]\) has been extensively studied in the literature, we can adopt the same existing methods reported in [3, 18, 84, 85] for performing the width-w\(\tau \)NAF scalar recoding.

We computed the cardinality of the elliptic curves belonging to isogeny class 3 defined over the field \({\mathbb {F}}_{4^m}\) with m a prime extension in the range [127, 191]. From this experiment, we found out that the only extension of cryptographic interest is \(m=163\). Indeed, for this extension field \({\mathbb {F}}_{4^{163}}\), the cardinality of the elliptic curves in class 3 has the following integer factorization:

$$\begin{aligned}&\texttt {0x2} \cdot \texttt {0x28D} \cdot \texttt {0xC8B90A95C20EE5BBC91D671B0CEFED2EA}\backslash \\&\texttt {7901F5CEAAA522F37A4E0D020A19EBBDC1D0437C458139}. \end{aligned}$$

The largest prime factor above has a size of approximately 316 bits. Hence, its associated security level is of around 158 bits. This curve is comfortably above the 128-bit security level (even considering the criterion that for a given field extension m,  a binary curve offers ten bits less of security than the number \(\lfloor \frac{m}{2}\rfloor \)).

4 Base Field Arithmetic

In this section, we introduce practical aspects of the Koblitz curves previously presented. More particularly, we describe techniques for an efficient software implementation of 128-bit secure scalar multiplication algorithms over selected Koblitz curves defined over \({\mathbb {F}}_4\). Our library was specially designed for high-end 64-bit processors embedded with a 64-bit carry-less multiplication instruction and 128-bit vector registers that store and simultaneously process two 64-bit words [1, 43].

We based our curve selection on two factors: security and performance. For conservative scenarios, we propose the class-3 curve over \({\mathbb {F}}_{4^{163}}\). This curve offers about 156 bits of security, which is well above the 10-bit security margin for binary curves (see Sect. 1 for further discussion). Moreover, it is equipped with a faster endomorphism when compared with class-1 and class-2 curves. For other scenarios, we suggest the class-1 curve over \({\mathbb {F}}_{4^{149}}\), which was chosen because of its 254-bit prime subgroup order, yielding a security level of approximately 128 bits.

4.1 Modular Reduction

One can construct a binary extension field \({\mathbb {F}}_{2^m}\) by taking a polynomial \(f(x) \in {\mathbb {F}}_2[x]\) of degree m, which is irreducible over \({\mathbb {F}}_2\). It is very important that the form of the polynomial f(x) admits an efficient modular reduction. The criteria for selecting f(x) depend on the architecture to be implemented as it was extensively discussed in [78].

For our field extension degrees \(m \in \{149, 163\}\), we do not have irreducible trinomials in the ring \({\mathbb {F}}_2[x]\). The immediate solution is to construct the fields \({\mathbb {F}}_{2^{149}}\) and \({\mathbb {F}}_{2^{163}}\) through irreducible pentanomials. However, the cost of a modular reduction with pentanomials is excessively high when compared with the field multiplication computed with carry-less instructions. This is because we need to perform four shift-and-add operations per reduction step. Besides, most of those operations require costly shift instructions, since the number of bits to be shifted is often not divisible by the word size (64 bits in our target architecture).

As a consequence, we resorted to the redundant trinomials strategy introduced in [21, 26]. Given a non-irreducible trinomial g(x) of degree n that factorizes into an irreducible polynomial f(x) of degree \(m < n\), the idea is to perform the field reduction modulo g(x) throughout the scalar multiplication and, at the end of the algorithm, reduce the polynomials modulo f(x). Considering that our target platform counts with a native 64-bit carry-less multiplier, an efficient representation of the field elements must have at most 192 bits, that is, three 64-bit words. Furthermore, given a redundant trinomial \(g(x) = x^n + x^a + x^b\), we need that most of the following constraints be satisfied for an efficient reduction:

  1. (R1)

    The difference \((n-a) \ge 64\), which allows us to perform the reduction in a optimal number of steps in the interleaved representation (see Sect. 5 for more details).

  2. (R2)

    The properties \((n-a) \equiv 0 \pmod {64}\), \((n-b) \equiv 0 \pmod {64}\) result in faster shift-and-add steps after the field multiplication and squaring operations. This is because they avoid additions between words, which require a series of left and right shifts to align the data within the registers.

  3. (R3)

    The properties \((n-a) \equiv \pm 1 \pmod {64}\), \((n-b) \equiv \pm 1 \pmod {64}\) result in faster shift-and-add steps after the field squaring operation. The squaring operation consists of interleaving zeroes between the bits of the binary representation of a field element [40]. Therefore, parameters with this configuration help us avoiding shifts by one bit and one field addition in each reduction step.

For constructing the extension field \({\mathbb {F}}_{4^{149}}\), we selected the trinomial \(g_{149}(x) = x^{192} + x^{19} + 1\), since it complies with the criteria R1 and R2. This polynomial factorizes into a 69-term irreducible polynomial \(f_{149}(x)\) of degree 149. The field \({\mathbb {F}}_{4^{163}}\) is built via the trinomial \(g_{163}(x) = x^{192} + x^3 + x^2\), which only satisfies the constraint R1. This result will affect the efficiency of the basic arithmetic operations; however, there were no better options for constructing this field. The trinomial \(g_{163}(x)\) factorizes into an 87-term irreducible polynomial \(f_{163}(x)\) of degree 163. The final reduction by \(f_m(x)\) is performed via the mul-and-add approachFootnote 12 which, experimentally, performed more efficiently than the shift-and-add reduction for irreducible polynomials with large number of terms.

5 Quadratic Field Arithmetic

The quadratic field is constructed by the degree-two irreducible polynomial \(h(u) = u^2 + u + 1\). Elements in this field are represented as \(a_0 + a_1u\), with \(a_0, a_1 \in {\mathbb {F}}_{2^{m}}\). As discussed in Sect. 1, the \({\mathbb {F}}_{4^m}\) arithmetic can be implemented in terms of operations in the base field, resulting in considerable speedups due to the internal processor parallelism.

5.1 Register Allocation

The first question to be addressed on implementing an efficient quadratic field arithmetic is: How to allocate the binary representation of the field elements into the architecture available registers? In our case, we have to store two 192-bit polynomials into vector registers of 128 bits so that a fast modular reduction is possible, and the overhead in the two main arithmetic operations (i.e., multiplication and squaring) is kept to a minimum.

Let us consider an element \(a = (a_0 + a_1u) \in {\mathbb {F}}_{4^{m}}\), where \(a_0 = Cx^{128} + Bx^{64} + A\) and \(a_1 = Fx^{128} + Ex^{64} + D\) are 192-bit polynomials, each one of them stored into three 64-bit words (A-C, D-F). Also, suppose we have three 128-bit registers \(R_i\), with \(i \in \{0, 1, 2\}\), which can store two packed 64-bit words each.Footnote 13 The first option is to rearrange the extension field element a as

$$\begin{aligned} R_0 = A | B, \quad R_1 = C | D, \quad R_2 = E | F. \end{aligned}$$

The problem with this representation is that a significant overhead is generated in the multiplication function, more specifically in the pre-computation phase of the Karatsuba procedure (cf. Sect. 5.2 with the computation of \(V_{0,1}\), \(V_{0,2}\), and \(V_{1,2}\)). Besides, in order to efficiently perform the subsequent reduction phase, we must temporarily store the polynomial terms into four 128-bit vectors, which could cause a register overflow. A better method for storing the element a is to use the interleaved arrangement as follows:

$$\begin{aligned} R_0 = D | A, \quad R_1 = E | B, \quad R_2 = F | C. \end{aligned}$$

Using this setting, some overhead in the multiplication and squaring operations still exists, even though the penalty on the latter operation is almost negligible. In the positive side, the terms of the elements \(a_0, a_1\) do not need to be rearranged and the modular reduction of these two base field elements can be performed in parallel.

5.2 Multiplication

Given two \({\mathbb {F}}_{4^{m}}\) elements \(a = (a_0 + a_1u)\) and \(b = (b_0 + b_1u)\), with \(a_0, a_1, b_0, b_1\) in \({\mathbb {F}}_{2^{m}}\), we perform the multiplication \(c = a \cdot b\) as

$$\begin{aligned} c = a \cdot b&= (a_0 + a_1u) \cdot (b_0 + b_1u)\\&= (a_0 b_0 \oplus a_1b_1) + (a_0b_0 \oplus (a_0 \oplus a_1)\cdot (b_0 \oplus b_1))u, \end{aligned}$$

where each element \(a_i, b_i \in {\mathbb {F}}_{2^{m}}\) is composed of three 64-bit words. The analysis of the Karatsuba algorithm cost for different word sizes was presented in [88]. There, it was shown that the most efficient way to multiply three-word polynomials \(s(x) = s_2x^2 + s_1x + s_0\) and \(t(x) = t_2x^2 + t_1x + t_0\) as \(v(x) = s(x) \cdot t(x)\) is through the one-level Karatsuba method:

$$\begin{aligned}&\displaystyle V_0 = s_0 \cdot t_0, \quad V_1 = s_1 \cdot t_1, \quad V_2 = s_2 \cdot t_2,\\&\displaystyle V_{0,1} = (s_0 \oplus s_1) \cdot (t_0 \oplus t_1), \quad V_{0,2} = (s_0 \oplus s_2) \cdot (t_0 \oplus t_2) \quad V_{1,2} = (s_1 \oplus s_2) \cdot (t_1 \oplus t_2),\\&\displaystyle v(x) = V_2x^4 + (V_{1,2} \oplus V_1 \oplus V_2)x^3 + (V_{0,2} \oplus V_0 \oplus V_1 \oplus V_2)x^2 + (V_{0,1} \oplus V_0 \oplus V_1)x + V_0, \end{aligned}$$

which costs six multiplications and 12 additions.

The algorithm requires six carry-less instructions, six vectorized xors, and three bitwise shift instructions. In order to calculate the total multiplication cost, it is necessary to include the Karatsuba pre-computation operations at the base field level (12 vectorized xors and six byte interleaving instructions) and at the quadratic field level (six vectorized xors). Also, we must consider the reorganization of the registers in order to proceed with the modular reduction (six vectorized xors).

5.3 Modular Reduction

The modular reduction of an element in \({\mathbb {F}}_{4^m}\) highly benefits from the interleaved representation, since the two packed words in a 128-bit register do not change their positions. In other words, the real (\(a_0\)) and the imaginary (\(a_1\)) parts are always placed in the least and most significant 64-bit words, respectively. The trinomial \(g_{149}(x) = x^{192} + x^{19} + 1\) satisfies condition R2 (cf. Sect. 4.1) that is, \((192-0) \bmod 64 = 0\). For that reason, each step of the shift-and-add algorithm consists of only two shifts and three xors, and three such steps are required.

Nonetheless, for the \({\mathbb {F}}_{4^{163}}\) case, we have to compute four shifts and four xors per step. In the total cost, the modular reduction in this field requires three extra shifts and xors when compared with the same operation over \({\mathbb {F}}_{4^{149}}\). Although this difference does not appear to be significant, its impact can be seen in the total cost of the scalar multiplication algorithm (see Sect. 7 for specific timings).

5.4 Squaring

Squaring is a very important function in the Koblitz curve point multiplication algorithm, since it is the building block for computing the \(\tau \) endomorphism in both curve classes. In our implementation, we computed the squaring operation through carry-less multiplication instructions which, experimentally, was an approach less expensive than the bit interleaving method [40, §2.3.4]. The preprocessing phase is straightforward, and we just need to rearrange the 32-bit packed words of the 128-bit registers in order to prepare them for the subsequent modular reduction. The pre- and post-processing phases require three 32-bit shuffle instructions,Footnote 14 three vectorized xors, and three bitwise shifts.

5.5 Inversion

The inversion operation is computed via the Itoh–Tsujii method [44]. Given an element \(c \in {\mathbb {F}}_{2^m}\), we compute \(c^{-1} = c^{(2^{m-1}-1) \cdot 2}\) through an addition chain, where each step computes the term \((c^{2^i-1})^{2^j} \cdot c^{2^j-1}\) with \(0< j < i \le m-1\). For the case \(m = 149\), the following chain is used:

$$\begin{aligned} 1 \rightarrow 2 \rightarrow 4 \rightarrow 8 \rightarrow 16 \rightarrow 32 \rightarrow 33 \rightarrow 66 \rightarrow 74 \rightarrow 148, \end{aligned}$$

while for \(m = 163\), we used

$$\begin{aligned} 1 \rightarrow 2 \rightarrow 4 \rightarrow 5 \rightarrow 10 \rightarrow 20 \rightarrow 40 \rightarrow 80 \rightarrow 81 \rightarrow 162. \end{aligned}$$

Both addition chains are optimal and were found through the procedure described in [22]. Note that although we computed the inversion operation over polynomials in \({\mathbb {F}}_2[x]\) (reduced modulo \(g_m(x)\)), we still had to build the addition chain with \(m \in \{149, 163\}\), since we are in fact interested in the embedded \({\mathbb {F}}_{2^{m}}\) field element.

6 Scalar Multiplication Algorithms

In this part, we briefly discuss single-core algorithms for computing a timing-resistant scalar multiplication function over Koblitz curves. The fastest approach is to compute the function via the left-to-right or right-to-left \(\tau \)-and-add algorithms [40]. A more conservative and simpler method is to perform the point multiplication with Montgomery ladders [63]. This method disregards the efficient \(\tau \) endomorphism but takes profit of the special curve b-parameter within the x-coordinate doubling and addition formulas.

6.1 Left-to-Right \(\tau \)-and-Add

This algorithm is similar to the traditional left-to-right double-and-add method. Here, the point doubling operation is replaced by the computationally cheaper \(\tau \) endomorphism.Footnote 15 Before that, we need to compute the width-w\(\tau \)NAF representation of the scalar k. In addition, linear passes must be applied in the accumulators in order to avoid cache-based timing attacks (for more details, see Sect. 6.4). The process is shown in Algorithm 1.

figure a

The main advantage of this method is that the sensitive data are indirectly related to the points \(P_{v_i}\), which are only read and then added to the unique accumulator Q. As a consequence, just one linear pass per iteration is required, immediately prior to reading \(P_{v_i}\). On the other hand, the operation \(\tau ^{w-1}(Q)\) must be performed on three coordinates of a projective point by successive squarings, since computing it through lookup tables could leak information about the scalar k bits.Footnote 16

Note that, in the context of the Diffie–Hellman key exchange protocol, we can assume that the scalar k is kept in the \(\mathbb {Z}[\tau ]\) form, since it is a private key and therefore (ideally) it is not manipulated by other entities other than its owner. This procedure, besides simplifying the implementation of the \(\tau \)-and-add point multiplication, reduces the possibilities of timing attacks, given that Solinas’ partial reduction algorithm contains branching sections [82]. If desired, the scalar can be retrieved as an integer in \(\mathbb {Z}_r\) as discussed in Sect. 2.1.

6.2 Right-to-Left \(\tau \)-and-Add

An alternative is to process the scalar k from the least to the most significant digit. Again, taking advantage of the \(\tau \) endomorphism, the GLV method is brought to its full extent. This approach is called right-to-left \(\tau \)-and-add, and it is presented in Algorithm 2.

figure b

Here, we have to perform a post-computation in the accumulators instead of pre-computing the points \(P_{i}\) as in the previous approach. Also, the \(\tau \) endomorphism is applied to the point P, which is usually public. For that reason, we can compute \(\tau \) with table lookups instead of performing squarings multiple times. The downside of this algorithm is that the accumulators carry sensitive information about the private key. Moreover, they are read and written. As a result, it is necessary to apply the linear pass algorithm over the whole set of accumulators \(Q_{i}\) twice per loop iteration.

6.3 Montgomery Ladder

The scalar multiplication can also be computed via Montgomery ladders, which provide simple and efficient timing-resistant formulas. The downside of this approach is that the speedup from the Frobenius map is lost. On the other hand, the Montgomery formulas are significantly improved when applied with the Koblitz curve b-parameter. For instance, the doubling formula only requires three field squarings and one multiplication. Moreover, if we choose the group generator with the x-coordinate equal to \(x^{64}u\), the cost of one loop multiplication is reduced to 16 logical vector instructions. For fixed-point scenarios, the right-to-left Montgomery ladder with Koblitz curves efficiently computes the point multiplication with \(\log _2(r)\) pre-computed x-coordinates. (see [72] for more details on the Montgomery formulas for binary curves). The advantage of using this method is that no linear passes are required, when compared with windowed versions of the \(\tau \)-and-add algorithm.

For Montgomery ladders, we suggest constructing the scalar k as follows. First, 40 bytes \(K_i\) are chosen at random from [0, 255]. Then, we update the most and least significant byte as

$$\begin{aligned} K_{39} \leftarrow 0x5, \quad K_{0} \leftarrow K_{0} \vee 0x1. \end{aligned}$$

Next, the scalar is multiplied by 653, which is one of the factors of \(\#E({\mathbb {F}}_{4^{163}})\). Finally, after processing the scalar multiplication, the accumulator was doubled in order to return \(Q = 2kP\). By doing this, we assured that the least and most significant bit of k is always equal to one, avoiding any inconsistencies in the Montgomery addition left-to-right and right-to-left binary formulas.

Remark on ECDSA applications   The lattice-based attacks presented in [7, 68] show that fixing bits of the scalar k in the signing phase of the ECDSA potentially reveals the secret key after the collection of a few hundred signatures. Nguyen and Shparlinski [68] describe how to reduce the ECDSA problem to the Hidden Number Problem (HNP) if the \(\ell \) most (or least) significant bits of the scalar are known. Here, \(\ell = \lceil \text {log log }r \rceil ,\) where r is the order of the subgroup of interest. For our first curve, r is a 316-bit prime. Hence, \(\ell = 9\) consecutive biased bits would be necessary in order to launch the ECDSA-HNP attack. Since in our setting we are only fixing three bits (the most and the two least significant bits), then it would appear that our configuration is safe against this kind of attacks.

Nevertheless, as demonstrated in [7], a concrete analysis of the attack could improve the aforementioned bound. Moreover, since the effects that this type of attacks may have on points of composite order are not clearly understood, we do not recommend the usage of Montgomery ladders in the ECDSA signing phase. Instead, the designer may use \(\tau \)-and-add/double-and-add methods, which do not fix any bit of the scalar k. In cases where the implementer is compelled to use Montgomery ladders, the scalar should be selected at random from \([1, r-1]\). As a consequence, the order of the input point has to be verified. Finally, to avoid formula inconsistencies, the order r must be added to k if the nonce is an even number.

6.4 Auxiliary Functions

In the next paragraphs, we comment on the implementation aspects of functions that support the aforementioned scalar multiplication algorithms.

6.4.1 Regular \(\tau \)NAF Representation

The regular NAF representation was proposed by Joye [48] and adopted for the \(\mathbb {Z}[\tau ]\) ring in [69]. This approach is crucial for implementing timing-resistant NAF-based scalar multiplication algorithms on curves with non-complete addition formulas. The general procedure is presented in Algorithm 3.

figure c

The operations throughout the algorithm are performed via basic arithmetic and logic machine operations, with the subscalars \(r_0\), \(r_1\) represented in the two’s complement form. As a result, the negation can be implemented through a masking function. Each of the \(\beta _u\) and \(\gamma _u\) values and their respective signs are stored into a single 64-bit register, which is accessed via bitwise shift and logical operators. Finally, to avoid carry-related branches, the multiword addition is programmed in Assembly code with the addition-with-carry instructions (addq, adcq) [43].

6.4.2 Linear Pass

The linear pass is a method designed to protect sensitive information against side-channel attacks associated with the CPU cache access patterns [73, 86]. Let us consider an array A of size l. Before reading a value A[i], with \(i \in \{0, \ldots , l-1\}\), the linear pass technique reads the entire array A but only stores, usually into an output register, the requested data A[i].

As a consequence, the attacker does not know which array index was accessed just by analyzing the location of the cache miss in his own artificially injected data. Writing in A[i] occurs in a similar vein. Assuming that the data to be written are stored into a register, we proceed by deceitfully updating all values of A, except for A[i], which receives the real data. Note that this method causes a considerable overhead directly related to the size of the array.

In this work, we implemented the linear pass method using 128-bit SSE vectorized instructions and registers. The selection of A[i] is performed via logical functions and the SSE instruction pcmpeqq that compares the values of two 128-bit registers A and B and sets the resulting register C with bits one, if A and B are equal, and bits zero otherwise [43].

6.4.3 Conditional Swap

The conditional swap is a technique usually employed in Montgomery ladders to hide the access order of the two accumulators \(R_0 and R_1\). The knowledge of this order by a malicious entity could result in catastrophic consequences for the cryptosystem [37]. The countermeasure can be similarly applied to width-w versions of \(\tau \)-and-add algorithms that process only two accumulators. As a result, the double linear pass in the right-to-left variant is not required anymore. For \(\tau \)-and-add algorithms where more than two accumulators are processed, it is possible to extend the conditional swap to n points in \(log_2(n)\) steps. However, the required number of xor instructions results in a greater overhead than the double linear pass discussed above. Finally, we can perform the conditional swap only in the lowest \(2^{n-1}\) array A values. Nonetheless, this approach requires the storage of the array state.

7 Results and Discussion

Our software library was designed for 64-bit high-end desktops, provided with SSE 4.1-equivalent vector instructions and a 64-bit carry-less multiplier. The timings were measured in an Intel Core i7 4770k 3.50 GHz machine (Haswell architecture) with the Turbo Boost and Hyper-Threading technologies disabled. The implementation was coded in the GNU11 C and Assembly languages.

We compiled our code with the GCC (Gnu Compiler Collection) version 5.4 with the optimization flags --march=haswell -fomit-frame-pointer-O3. In addition, the 3-\(\tau \)NAF left-to-right \(\tau \)-and-add point multiplication over \(E_1/F_{4^{149}}\) code is available in the eBACS platform [9].

Remark on error margin   The analysis in [74] shows that our machine has an error margin of four clock cycles. This value is not significant when considering the point arithmetic or scalar multiplication timings. Still, for simpler functions such as the basic finite field operations, it is recommended to consider this remark.

7.1 Parameters

Our base binary fields \({\mathbb {F}}_{2^m} \cong {\mathbb {F}}_2[x]/(f_m(x))\) with \(m \in \{149, 163\}\) were constructed by the 69- and 87-term irreducible polynomials \(f_{\{149, 163\}}(x)\) described in Sect. 5. The quadratic extension \({\mathbb {F}}_{q^2} \cong {\mathbb {F}}_q[u]/(h(u))\) was built through the irreducible quadratic \(h(u) = u^2 + u + 1\).

Our class-1 Koblitz curve is defined as \(E_{(0,u)}/{\mathbb {F}}_{4^{149}} : y^2 + xy = x^3 + ux^2 + u,\) and the group \(E_{(0,u)}({\mathbb {F}}_{4^{149}})\) contains a subgroup of interest of order

$$\begin{aligned} r = \texttt {0x637845F7F8BFAB325B85412FB54061F148B7F6E79AE11CC843ADE1470F7E4E29}, \end{aligned}$$

which corresponds to approximately 254 bits. Our class-3 curve is defined as \(E_{(u,1)}/{\mathbb {F}}_{4^{163}} : y^2 + xy = x^3 + ux^2 + 1,\) and the order of the prime subgroup of \(E_{(u,1)}({\mathbb {F}}_{4^{163}})\) is

$$\begin{aligned} r&= \texttt {0x}{} \texttt {C8B90A95C20EE5BBC91D671B0CEFED2EA701F5CE}\backslash \\&\quad \quad \texttt {9AAA522F37A4E0D020A19EBBDC1D0437C458139}, \end{aligned}$$

of about 316 bits. In addition, throughout our \(\tau \)-and-add scalar multiplication implementation, we represented the points in \(\lambda \)-affine [50, 77] and \(\lambda \)-projective [71] coordinates.

7.2 Field and Elliptic Curve Arithmetic Timings

We present in Table 4 the timings for the quadratic field arithmetic. The field multiplication, squaring, and inversion timings already include the costs for the modular reduction modulo \(g_m(x)\). Because of our machine error margin, we abstained from presenting timings for the field addition (implemented by three xor instructions), which theoretically costs from 1.66 to three clock cycles. The column Ratio op / m shows the ratio between the timings of the presented operation and the field multiplication.

Table 4 Timings for the finite field arithmetic in \({\mathbb {F}}_{4^{149}}\) and \({\mathbb {F}}_{4^{163}}\)

The ratio squaring/multiplication is relatively high when compared with other binary curves such as GLS-254 [71]. This is because here the trinomials \(g_m(x)\) do not admit a reduction specially crafted for the squaring operation (see Sect. 4.1). Table 4 also shows that the field inversion is significantly more expensive than a multiplication and therefore, should be avoided as much as possible in our scalar multiplication design. The high cost of this operation is explained by the fact that, to avoid attacks on projective coordinates [64], we did not work with lookup tables for computing the different Itoh–Tsujii multi-squaring operations, but applied consecutive field squarings. Finally, the result of having a redundant trinomial which does not comply with criteria R2 (see Sect. 4.1) is indicated above: sixteen and eight extra cycles for the field multiplication and squaring, respectively. Next, the timings for the point arithmetic used in our implementation are shown in Table 5.

Table 5 Timings for the point arithmetic in \(E_{(0,u)}({\mathbb {F}}_{4^{149}})\) and \(E_{(u,1)}({\mathbb {F}}_{4^{163}})\)

The effects of a slower field arithmetic is seen in Table 5. Over \(E_{(u,1)}({\mathbb {F}}_{4^{163}})\), the \(\lambda \)-doubling, -full addition, and -mixed additionFootnote 17 are 1.14, 1.05, and 1.05 slower when compared with the same operations over \(E_{(0,u)}({\mathbb {F}}_{4^{149}})\). In the other hand, the faster class-3 endomorphisms outperform the class-1 equivalents by 20 and \(10\%\) in the affine and projective formulas, respectively.

7.3 Scalar Multiplication Timings

At last, we present in Table 6 the results obtained for the scalar multiplication implementation through the \(\tau \)-and-add (with different values for the width w) and Montgomery ladder algorithms. After calculating estimations, we saw that, over the \(E_{(0,u)}/{\mathbb {F}}_{4^{149}}\) curve, the pre- and post-computation costs along with the timing-attack countermeasures overhead would surpass the savings in the total number of point operations. In the class-3 curve \(E_{(u,1)}/{\mathbb {F}}_{4^{163}}\), this scenario occurs when \(w > 5\). Note that the same parameter w represents a different number of pre-/post-computed points when comparing the two curve classes, since the \(\tau ,\bar{\tau }\) endomorphisms and the regular recoding algorithms are different.

Table 6 Timings for the \(\tau \)-and-add scalar multiplication on \(E_{(0,u)}/{\mathbb {F}}_{4^{149}}\) and \(E_{(u,1)}{/}{\mathbb {F}}_{4^{163}}\)

For the sake of comparison, Table 6 includes the ratio clock cycles per bit. This is because it is inaccurate to compare the absolute timings of each curve, since the security provided by the class-3 curve is higher than our proposed class-1 curve. The ratio is calculated over the number of randomly chosen bits processed by the \(\tau \)-and-add algorithms. In the curve \(E_{(0,u)}/{\mathbb {F}}_{4^{149}}\) implementation, 252 bits are considered, while in the \(E_{(u,1)}/{\mathbb {F}}_{4^{163}}\) design, we have a total of 312 bits.

The above results show that for the class-1 case, the costs of the pre- and post-computation increase rapidly with the parameter w. The difference in the point pre-computation timings between widths \(w = 2\) and 3 is about \(86.93\%\), while between widths \(w = 3\) and 4, the variation is approximately \(168.97\%\). This is because the number of points to be computed quadruples at each value w. This aspect is also reflected in the cost of the linear pass functions, which also depends on the number of computed points. Table 7 describes the fraction of the scalar multiplication cost dedicated to applying the linear pass function in different \(\tau \)-and-add approaches.

Table 7 Fraction (in percent) of the linear pass countermeasure on the cost of different \(\tau \)-and-add algorithms over \(E_{(0,u)}/{\mathbb {F}}_{4^{149}}\)

The same issue does not occur in the class-3 curve, since the number of points at each increasing w-value only doubles. The bottleneck here lies in the base field arithmetic, as discussed in the previous paragraphs. Nevertheless, the class-3 curve obtained competitive timings when compared with its counterpart. In terms of cost per bit, its best algorithm, namely the right-to-left variant with \(w=4\), is only 1.03 times more expensive than the left-to-right version with \(w=3\) over our class-1 curve. This result is clearly due to the faster \(\tau \) endomorphisms and the smaller amount of points to protect in the main loop. At last, we present in Table 8 the timings for the Montgomery ladder algorithms.

Table 8 Timings for the Montgomery ladder scalar multiplication \(E_{(u,1)}/{\mathbb {F}}_{4^{163}}\)

The savings in the left-to-right Montgomery addition formulas resulted in competitive timings when compared with the \(\tau \)-and-add algorithms with \(w \le 3\). The advantage of the Montgomery ladder is that it is simpler to implement and does not require complicated countermeasures for protecting the multiple pre-computed points and accumulators. The right-to-left Montgomery ladder algorithm results do not compare well with the costs presented in Table 6, since this approach is applied in a different setup, namely the fixed-point scenario.

7.4 Comparison

At last, we compare our best point multiplication versions against the state-of-the-art 128-bit secure timing-resistant software scalar multiplication implementations on binary and prime curves. The results are presented in Table 9.

Table 9 A comparison between state-of-the-art software implementations of 128-bit secure timing-resistant scalar multiplication

The 3-\(\tau \)NAF scalar multiplication on \(E_{(0,u)}/{\mathbb {F}}_{4^{149}}\) is the fastest published software implementation on 128-bit secure Koblitz curves, surpassing the work on the standardized curve K-283 [69] by \(20.29\%\). On binary fields, the GLS implementation in [70] outperforms our faster implementation by \(38.80\%\). The works in [24] and [8] are about 29.04 and \(23.27\%\) faster, respectively. Finally, our \(E_{(u,1)}/{\mathbb {F}}_{4^{163}}\) curve is only 6000 cycles more expensive than the K-283 implementation in [69], but offers 15 extra bits of security. Given that both implementations work with the same endomorphism \(\tau \), we can see that the quadratic extension plays an important role in the efficiency of the field arithmetic.

8 Conclusion

In this work, we gave a comprehensive taxonomy of Koblitz-like curves defined over the extension fields \({\mathbb {F}}_{4^m}\) with good cryptographic properties. We also introduced a family of curves, whose rich structure admits novel and more efficient endomorphisms than the traditional Frobenius endomorphism associated with classical Koblitz curves.

We presented fast software implementations of 128-bit secure scalar multiplication algorithms on the Koblitz curves \(E_{(0,u)}/{\mathbb {F}}_{4^{149}}\) and the Koblitz-like curve \(E_{(1, u)}/{\mathbb {F}}_{4^{163}}\). The design of these implementations included original techniques to construct the basic finite field arithmetic and the methods to protect the code against timing attacks. An important feature that distinguishes our code from previous binary field arithmetic implementations is the intensive usage of the carry-less multiplication instruction. The drastic reduction of its latency and throughput in the recent architectures enabled its application to operations that go beyond the field multiplication, such as the field squaring and modular reduction.