Abstract
We describe the design and implementation of efficient signature and key-exchange schemes for the AVR ATmega and ARM Cortex M0 microcontrollers, targeting the 128-bit security level. Our algorithms are based on an efficient Montgomery ladder scalar multiplication on the Kummer surface of Gaudry and Schost’s genus-2 hyperelliptic curve, combined with the Jacobian point recovery technique of Chung, Costello, and Smith. Our results are the first to show the feasibility of software-only hyperelliptic cryptography on constrained platforms, and represent a significant improvement on the elliptic-curve state-of-the-art for both key exchange and signatures on these architectures. Notably, our key-exchange scalar-multiplication software runs in under 9520k cycles on the ATmega and under 2640k cycles on the Cortex M0, improving on the current speed records by 32 % and 75 % respectively.
L. Batina— This work has been supported by the Netherlands Organisation for Scientific Research (NWO) through Veni 2013 project 13114 and by the Technology Foundation STW (project 13499 - TYPHOON&ASPASIA), from the Dutch government. Permanent ID of this document: b230ab9b9c664ec4aad0cea0bd6a6732. Date: 2016-04-07.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The current state of the art in asymmetric cryptography, not only on microcontrollers, is elliptic-curve cryptography; the most widely accepted reasonable security is the 128-bit security level. All current speed records for 128-bit secure key exchange and signatures on microcontrollers are held—until now—by elliptic-curve-based schemes. Outside the world of microcontrollers, it is well known that genus-2 hyperelliptic curves and their Kummer surfaces present an attractive alternative to elliptic curves [1, 2]. For example, at Asiacrypt 2014 Bernstein, Chuengsatiansup, Lange and Schwabe [3] presented speed records for timing-attack-protected 128-bit-secure scalar multiplication on a range of architectures with Kummer-based software. These speed records are currently only being surpassed by the elliptic-curve-based Four\(\mathbb {Q}\) software by Costello and Longa [4] presented at Asiacrypt 2015, which makes heavy use of efficiently computable endomorphisms (i.e., of additional structure of the underlying elliptic curve). The Kummer-based speed records in [3] were achieved by exploiting the computational power of vector units of recent “large” processors such as Intel Sandy Bridge, Ivy Bridge, and Haswell, or the ARM Cortex-A8. Surprisingly, very little attention has been given to Kummer surfaces on embedded processors. Indeed, this is the first work showing the feasibility of software-only implementations of hyperelliptic-curve based crypto on constrained platforms. There have been some investigations of binary hyperelliptic curves targeting the much lower 80-bit security level, but those are actually examples of software-hardware co-design showing that using hardware acceleration for field operations was necessary to get reasonable performance figures (see eg. [5, 6]).
In this paper we investigate the potential of genus-2 hyperelliptic curves for both key exchange and signatures on the “classical” 8-bit AVR ATmega architecture, and the more modern 32-bit ARM Cortex-M0 processor. The former has the most previous results to compare to, while ARM is becoming more relevant in real-world applications. We show that not only are hyperelliptic curves competitive, they clearly outperform state-of-the art elliptic-curve schemes in terms of speed and size. For example, our variable-basepoint scalar multiplication on a 127-bit Kummer surface is 31 % faster on AVR and 26 % faster on the M0 than the recently presented speed records for Curve25519 software by Düll et al. [7]; our implementation is also smaller, and requires less RAM.
We use a recent result by Chung, Costello, and Smith [8] to also set new speed records for 128-bit secure signatures. Specifically, we present a new signature scheme based on fast Kummer surface arithmetic. It is inspired by the EdDSA construction by Bernstein, Duif, Lange, Schwabe, and Yang [9]. On the ATmega, it produces shorter signatures, achieves higher speeds and needs less RAM than the Ed25519 implementation presented in [10].
Our routines handling secret data are constant-time, and are thus naturally resistant to timing attacks. These algorithms are built around the Montgomery ladder, which improves resistance against simple-power-analysis (SPA) attacks. Resistance to DPA attacks can easily be added to the implementation by randomizing the scalar and/or Jacobian points. Re-randomizing the latter after each ladder step would also guarantee resistance against horizontal types of attacks.
Source code. We place all of the software described in this paper into the public domain, to maximize the reuseability of our results. The software is available at http://www.cs.ru.nl/~jrenes/.
2 High-Level Overview
We begin by describing the details of our signature and Diffie–Hellman schemes, explaining the choices we made in their design. Concrete implementation details appear in Sects. 3 and 4 below. Experimental results and comparisons follow in Sect. 5.
2.1 Signatures
Our signature scheme, defined at the end of this section, adheres closely to the proposal of [8, Sect. 8], which in turn is a type of Schnorr signature [11]. There are however some differences and trade-offs, which we discuss below.
Group structure. We build the signature scheme on top of the group structure from the Jacobian \(\mathcal {J}_\mathcal {C}(\mathbb {F}_q)\) of a genus-2 hyperelliptic curve \(\mathcal {C}\). More specifically, \(\mathcal {C}\) is the Gaudry–Schost curve over the prime field \(\mathbb {F}_q\) with \(q=2^{127}-1\) (cf. Sect. 3.2). The Jacobian is a group of order \(\#\mathcal {J}_\mathcal {C}(\mathbb {F}_q)=2^4N\), where
is a 250-bit prime. For more details on the Jacobian and its elements, see Sect. 3.3.
Hash function. We may use any hash function H with a 128-bit security level. For our purposes, \(H(M)=\mathtt{SHAKE128}(M, 512)\) suffices [12]. While SHAKE128 has variable-length output, we only use the 512-bit output implementation.
Encoding. At the highest level, we operate on points Q in \(\mathcal {J}_\mathcal {C}(\mathbb {F}_q)\). To minimize communication costs, we compress the usual 508-bit representation of Q into a 256-bit encoding (see Sect. 3.3). (This notation is the same as in [9].)
Public generator. The public generator can be any element P of \(\mathcal {J}_\mathcal {C}(\mathbb {F}_q)\) such that \([N]P=0\). In our implementation we have made the arbitrary choice \( P = (X^2+u_1X+u_0, v_1X+v_0) \), where
This is the point which we use the most for scalar multiplication. Since it remains fixed, we assume we have its decompressed representation precomputed, so as to avoid having to perform the relatively expensive decompression operation whenever we need a scalar multiplication; this gives a low-cost speed gain. We further assume we have a “wrapped” representation of the projection of \(P\) to the Kummer surface, which is used to speed up the xDBLADD function. See Sect. 4.1 for more details on the xWRAP function.
Public keys. In contrast to the public generator, we assume public keys are compressed: they are communicated much more frequently, and we therefore benefit much more from smaller keys. Moreover, we include the public key in one of the hashes during the sign operation [13, 14], computing instead of the originally suggested by Schnorr [11]. This protects against adversaries attacking multiple public keys simultaneously.
Compressed signatures. Schnorr [11] mentions the option of compressing signatures by hashing one of their two components: the hash size only needs to be b/2 bits, where b is the key length. Following this suggestion, our signatures are 384-bit values of the form \((h_{128}||s)\), where \(h_{128}\) means the lowest 128 bits of , and s is a 256-bit scalar. The most obvious upside is that signatures are smaller, reducing communication overhead. Another big advantage is that we can exploit the half-size scalar to speed up signature verification. On the other hand, we lose the possibility of efficient batch verification.
Verification efficiency. The most costly operation in signature verification is the two-dimensional scalar multiplication \(T=[s]P\oplus [h_{128}]Q\). In [8], the authors propose an algorithm relying on the differential addition chains presented in [15]. However, since we are using compressed signatures, we have a small scalar \(h_{128}\). Unfortunately the two-dimensional algorithm in [8] cannot directly exploit this fact, therefore not obtaining much benefit from the compressed signature. On the other hand, we can simply compute [s]P and \([h_{128}]Q\) separately using the fast scalar multiplication on the Kummer surface and finally add them together on the Jacobian. Here \([s]P\) is a 256-bit scalar multiplication, whereas \([h_{128}]Q\) is only a 128-bit scalar multiplication. Not only do we need fewer cycles compared to the two-dimensional routine, but we also reduce code size by reusing the one-dimensional scalar multiplication routine.
The scheme. We now define our signature scheme, taking the above into account.
-
Key generation ( keygen ). Let d be a 256-bit secret key, and P the public generator. Compute \((d'||d'') \leftarrow H(d)\) (with \(d'\) and \(d''\) both 256 bits), then \(Q\leftarrow [16d']P\). The public key is .
-
Signing ( sign ). Let M be a message, d a 256-bit secret key, P the public generator, and a compressed public key. Compute \((d'||d'') \leftarrow H(d)\) (with \(d'\) and \(d''\) both 256 bits), then \(r\leftarrow H(d''||M)\), then \(R\leftarrow [r]P\), then , and finally \(s\leftarrow \left( r-16h_{128}d'\right) \bmod {N}\). The signature is \((h_{128}||s)\).
-
Verification ( verify ). Let M be a message with a signature \((h_{128}||s)\) corresponding to a public key , and let P be the public generator. Compute \(T\leftarrow [s]P\oplus [h_{128}]Q\), then . The signature is correct if \(g_{128}=h_{128}\), and incorrect otherwise.
Remark 1
We note that there may be faster algorithms to compute the “one-and-a-half-dimensional” scalar multiplication in verify, especially since we do not have to worry about being constant-time. One option might be to adapt Montgomery’s PRAC [16, Sect. 3.3.1] to make use of the half-size scalar. But while this may lead to a speed-up, it would also cause an increase in code size compared to simply re-using the one-dimensional scalar multiplication. We have chosen not to pursue this line, preferring the solid benefits of reduced code size instead.
2.2 Diffie-Hellman Key Exchange
For key exchange it is not necessary to have a group structure; it is enough to have a pseudo-multiplication. We can therefore carry out our the key exchange directly on the Kummer surface \(\mathcal {K}_{\mathcal {C}}^{} = \mathcal {J}_{\mathcal {C}}^{}/{\left\langle {\pm }\right\rangle }\), gaining efficiency by not projecting from and recovering to the Jacobian \(\mathcal {J}_{\mathcal {C}}^{}\). If \(Q\) is a point on \(\mathcal {J}_{\mathcal {C}}^{}\), then its image in \(\mathcal {K}_{\mathcal {C}}^{}\) is \(\pm Q\). The common representation for points in \(\mathcal {K}_{\mathcal {C}}^{}(\mathbb {F}_q)\) is a 512-bit 4-tuple of field elements. For input points (i. e. the generator or public keys), we prefer the 384-bit “wrapped” representation (see Sect. 3.5). This not only reduces key size, but it also allows a speed-up in the core xDBLADD subroutine. The wrapped representation of a point \(\pm Q\) on \(\mathcal {K}_{\mathcal {C}}^{}\) is denoted by .
-
Key exchange ( dh_exchange ). Let d be a 256-bit secret key, and the public generator (respectively public key). Compute \(\pm Q\leftarrow \pm [d]P\). The generated public key (respectively shared secret) is .
Remark 2
While it might be possible to reduce the key size even further to 256 bits, we would then have to pay the cost of compressing and decompressing, and also wrapping for xDBLADD (see the discussion in [8, App. A]). We therefore choose to keep the 384-bit representation, which is consistent with [3].
3 Building Blocks: Algorithms and Their Implementation
We begin by presenting the finite field \(\mathbb {F}_{2^{127}-1}\) in Sect. 3.1. We then define the curve \(\mathcal {C}\) in Sect. 3.2, before giving basic methods for the elements of \(\mathcal {J}_{\mathcal {C}}^{}\) in Sect. 3.3. We then present the fast Kummer \(\mathcal {K}_{\mathcal {C}}^{}\) and its differential addition operations in Sect. 3.4.
3.1 The Field \(\mathbb {F}_q\)
We work over the prime finite field \(\mathbb {F}_q\), where \(q\) is the Mersenne prime
We let M, S, a, s, neg, and I denote the costs of multiplication, squaring, addition, subtraction, negation, and inversion in \(\mathbb {F}_q\). Later, we will define a special operation for multiplying by small constants: its cost is denoted by \(\mathbf {m_c}\).
For complete field arithmetic we implement modular reduction, addition, subtraction, multiplication, and inversion. We comment on some important aspects here, giving cycle counts in Table 2.
We can represent elements of \(\mathbb {F}_q\) as 127-bit values; but since the ATmega and Cortex M0 work with 8- and 32-bit words, respectively, the obvious choice is to represent field elements with 128 bits. That is, an element \(g\in \mathbb {F}_q\) is represented as \(g=\sum _{i=0}^{15}g_i2^{8i}\) on the AVR ATmega platform and as \(g=\sum _{i=0}^{3}g'_i2^{32i}\) on the Cortex M0, where \(g_i\in \{0,\ldots ,2^8-1\}\), \(g'_i\in \{0,\ldots ,2^{32}-1\}\).
Working with the prime field \(\mathbb {F}_q\), we need integer reduction modulo \(q\); this is implemented as bigint_red. Reduction is very efficient because \(2^{128}\equiv 2\text { mod}\, q\), which enables us to reduce using only shifts and integer additions. Given this reduction, we implement addition and subtraction operations for \(\mathbb {F}_q\) (as gfe_add and gfe_sub, respectively) in the obvious way.
The most costly operations in \(\mathbb {F}_q\) are multiplication (gfe_mul) and squaring (gfe_sqr), which are implemented as \(128\times 128\)-bit bit integer operations (bigint_mul and bigint_sqr) followed by a call to bigint_red. Since we are working on the same platforms as [7] in which both of these operations are already highly optimized, we took the necessary code from those implementations:
-
On the AVR ATmega: The authors of [17] implement a 3-level Karatsuba multiplication of two 256-bit integers, representing elements f of \(\mathbb {F}_{2^{255}-19}\) as \(f=\sum _{i=0}^{31}f_i 2^{8i}\) with \(f_i\in \{0,\ldots ,2^8-1\}\). Since the first level of Karatsuba relies on a \(128\times 128\)-bit integer multiplication routine named MUL128, we simply lift this function out to form a 2-level \(128\times 128\)-bit Karatsuba multiplication. Similarly, their \(256\times 256\)-bit squaring relies on a \(128\times 128\)-bit routine SQR128, which we can (almost) directly use. Since the \(256\times 256\)-bit squaring is 2-level Karatsuba, the \(128\times 128\)-bit squaring is 1-level Karatsuba.
-
On the ARM Cortex M0: The authors of [7] use optimized Karatsuba multiplication and squaring. Their assembly code does not use subroutines, but fully inlines \(128\times 128\)-bit multiplication and squaring. The \(256\times 256\)-bit multiplication and squaring are both 3-level Karatsuba implementations. Hence, using these, we end up with 2-level \(128\times 128\)-bit Karatsuba multiplication and squaring.
The function gfe_invert computes inversions in \(\mathbb {F}_q\) as exponentiations, using the fact that \(g^{-1} = g^{q-2}\) for all \(g\) in \(\mathbb {F}_q^\times \). To do this efficiently we use an addition chain for \(q-2\), doing the exponentiation in \(10\mathbf{M}+126\mathbf{S}\).
Finally, to speed up our Jacobian point decompression algorithms, we define a function gfe_powminhalf which computes \(g\mapsto g^{-1/2}\) for \(g\) in \(\mathbb {F}_q\) (up to a choice of sign). To do this, we note that \( g^{-1/2} = \pm g^{-(q+1)/4} =\pm g^{{(3q-5)}/{4}} \) in \(\mathbb {F}_q\); this exponentiation can be done with an addition chain of length 136, using \(11\mathbf{M}+125\mathbf{S}\). We can then define a function gfe_sqrtinv, which given \((x,y)\) and a bit \(b\), computes \((\sqrt{x},1/y)\) as \((\pm xyz,xyz^2)\) where \(z = \mathtt {gfe\_powminhalf}(xy^2)\), choosing the sign so that the square root has least significant bit \(b\). Including the gfe_powminhalf call, this costs 15M + 126S + 1neg.
3.2 The Curve \(\mathcal {C}\) and Its Theta Constants
We define the curve \(\mathcal {C}\) “backwards”, starting from its (squared) theta constants
From these, we define the dual theta constants
Observe that projectively,
Crucially, all of these constants can be represented using just 16 bits each. Since Kummer arithmetic involves many multiplications by these constants, we implement a separate \(16\times 128\)-bit multiplication function gfe_mulconst. For the AVR ATmega, we store the constants in two 8-bit registers. For the Cortex M0, the values fit into a halfword; this works well with the \(16\!\times \!16\)-bit multiplication. Multiplication by any of these 16-bit constants costs \(\mathbf {m_c}\).
Continuing, we define \(e/f := (1 + \alpha )/(1 - \alpha )\), where \(\alpha ^2 = CD/AB\) (we take the square root with least significant bit 0), and thus
These are the Rosenhain invariants of the curve \(\mathcal {C}\), found by Gaudry and Schost [18], which we are (finally!) ready to define as
The curve constants are the coefficients of \(f_\mathcal {C}(X) = \sum _{i=0}^5f_iX^i\): so \(f_0 = 0\), \(f_5 = 1\),
We store the squared theta constants \((a:b:c:d)\), along with \((1/a:1/b:1/c:1/d)\), and \((1/A:1/B:1/C:1/D)\); the Rosenhain invariants \(\lambda \), \(\mu \), and \(\nu \), together with \(\lambda \mu \) and \(\lambda \nu \); and the curve constants \(f_1\), \(f_2\), \(f_3\), and \(f_4\), for use in our Kummer and Jacobian arithmetic functions. Obviously, none of the Rosenhain or curve constants are small; multiplying by these costs a full M.
3.3 Elements of \(\mathcal {J}_{\mathcal {C}}^{}\), compressed and decompressed
Our algorithms use the usual Mumford representation for elements of \(\mathcal {J}_{\mathcal {C}}^{}(\mathbb {F}_q)\): they correspond to pairs \( {\left\langle {u(X)},{v(X)}\right\rangle } \), where \(u\) and \(v\) are polynomials over \(\mathbb {F}_q\) with \(u\) monic, \(\deg v < \deg u \le 2\), and \(v(X)^2 \equiv f_\mathcal {C}(X) \pmod {u(X)}\). We compute the group operation \(\oplus \) in \(\mathcal {J}_{\mathcal {C}}^{}(\mathbb {F}_q)\) using a function ADD, which implements the algorithm found in [19] (after a change of coordinates to meet their Assumption 1)Footnote 1 at a cost of 28M + 2S + 11a + 24s + 1I.
For transmission, we compress the 508-bit Mumford representation to a 256-bit form. Our functions compress (Algorithm 1) and decompress (Algorithm 2) implement Stahlke’s compression technique (see [20] and [8, Appendix A] for details).
3.4 The Kummer Surface \(\mathcal {K}_{\mathcal {C}}^{}\)
The Kummer surface of \(\mathcal {C}\) is the quotient \(\mathcal {K}_{\mathcal {C}}^{}:= \mathcal {J}_{\mathcal {C}}^{}/{\left\langle {\pm 1}\right\rangle }\); points on \(\mathcal {K}_{\mathcal {C}}^{}\) correspond to points on \(\mathcal {J}_{\mathcal {C}}^{}\) taken up to sign. If \(P\) is a point in \(\mathcal {J}_{\mathcal {C}}^{}\), then we write
for its image in \(\mathcal {K}_{\mathcal {C}}^{}\). To avoid subscript explosion, we make the following convention: when points \(P\) and \(Q\) on \(\mathcal {J}_{\mathcal {C}}^{}\) are clear from the context, we write
The Kummer surface of this \(\mathcal {C}\) has a “fast” model in \(\mathbb {P}^3\) defined by
where
and \(E = 4abcd\left( ABCD/((ad-bc)(ac-bd)(ab-cd))\right) ^2 \) (see eg. [8, 21, 22]). The identity point \({\left\langle {1},{0}\right\rangle }\) of \(\mathcal {J}_{\mathcal {C}}^{}\) maps to
Algorithm 3 (Project) maps general points from \(\mathcal {J}_{\mathcal {C}}^{}(\mathbb {F}_q)\) into \(\mathcal {K}_{\mathcal {C}}^{}\). The “special” case where \(u\) is linear is treated in [8, Sect. 7.2]; this is not implemented, since Project only operates on public generators and keys, none of which are special.
3.5 Pseudo-addition on \(\mathcal {K}_{\mathcal {C}}^{}\)
While the points of \(\mathcal {K}_{\mathcal {C}}^{}\) do not form a group, we have a pseudo-addition operation (differential addition), which computes \(\pm (P\oplus Q)\) from \(\pm P\), \(\pm Q\), and \(\pm (P\ominus Q)\). The function \(\texttt {xADD}\) (Algorithm 4) implements the standard differential addition. The special case where \(P = Q\) yields a pseudo-doubling operation.
To simplify the presentation of our algorithms, we define three operations on points in \(\mathbb {P}^3\). First, \(\mathcal {M}: \mathbb {P}^3\times \mathbb {P}^3\rightarrow \mathbb {P}^3\) multiplies corresponding coordinates:
The special case \((x_1:y_1:z_1:t_1) = (x_2:y_2:z_2:t_2)\) is denoted by
Finally, the Hadamard transformFootnote 2 is defined by
Clearly \(\mathcal {M}\) and \(\mathcal {S}\) cost \(4\mathbf {M}\) and \(4\mathbf {S}\), respectively. The Hadamard transform can easily be implemented with \(4\mathbf {a}+4\mathbf {s}\). However, the additions and subtractions are relatively cheap, making function call overhead a large factor. To minimize this we inline the Hadamard transform, trading a bit of code size for efficiency.
Lines 5 and 6 of Algorithm 4 only involve the third argument, \(\pm (P\ominus Q)\); essentially, they compute the point \(( y_\ominus z_\ominus t_\ominus : x_\ominus z_\ominus t_\ominus : x_\ominus y_\ominus t_\ominus : x_\ominus y_\ominus z_\ominus )\) (which is projectively equivalent to \((1/x_\ominus : 1/y_\ominus : 1/z_\ominus : 1/t_\ominus )\), but requires no inversions; note that this is generally not a point on \(\mathcal {K}_{\mathcal {C}}^{}\)). In practice, the pseudoadditions used in our scalar multiplication all use a fixed third argument, so it makes sense to precompute this “inverted” point and to scale it by \(x_\ominus \) so that the first coordinate is \(1\), thus saving \(7\mathbf M \) in each subsequent differential addition for a one-off cost of \(1\mathbf I \). The resulting data can be stored as the 3-tuple \((x_\ominus /y_\ominus ,x_\ominus /z_\ominus ,x_\ominus /t_\ominus )\), ignoring the trivial first coordinate: this is the wrapped form of \(\pm (P\ominus Q)\). The function xWRAP (Algorithm 5) applies this transformation.
Algorithm 6 combines the pseudo-doubling with the differential addition, sharing intermediate operands, to define a differential double-and-add \(\texttt {xDBLADD}\). This is the fundamental building block of the Montgomery ladder.
4 Scalar Multiplication
All of our cryptographic routines are built around scalar multiplication in \(\mathcal {J}_{\mathcal {C}}^{}\) and pseudo-scalar multiplication in \(\mathcal {K}_{\mathcal {C}}^{}\). We implement pseudo-scalar multiplication using the classic Montgomery ladder in Sect. 4.1. In Sect. 4.2, we extend this to full scalar multiplication on \(\mathcal {J}_{\mathcal {C}}^{}\) using the point recovery technique proposed in [8].
4.1 Pseudomultiplication on \(\mathcal {K}_{\mathcal {C}}^{}\)
Since \([m](\ominus P) = \ominus [m]P\) for all \(m\) and \(P\), we have a pseudo-scalar multiplication operation \((m,\pm P)\longmapsto \pm [m]P\) on \(\mathcal {K}_{\mathcal {C}}^{}\), which we compute using Algorithm 7 (the Montgomery ladder), implemented as crypto_scalarmult. The loop of Algorithm 7 maintains the following invariant: at the end of iteration \(i\) we have
Hence, at the end we return \(\pm [m]P\), and also \(\pm [m+1]P\) as a (free) byproduct. We suppose we have a constant-time conditional swap routine \(\texttt {CSWAP}(b,(V_1,V_2))\), which returns \((V_1,V_2)\) if \(b = 0\) and \((V_2,V_1)\) if \(b = 1\). This makes the execution of Algorithm 7 uniform and constant-time, and thus suitable for use with secret \(m\).
Our implementation of crypto_scalarmult assumes that its input Kummer point \(\pm P\) is wrapped. This follows the approach of [3]. Indeed, many calls to crypto_scalarmult involve Kummer points that are stored or transmitted in wrapped form. However, crypto_scalarmult does require the unwrapped point internally—if only to initialize one variable. We therefore define a function xUNWRAP (Algorithm 8) to invert the xWRAP transformation at a cost of only 4M.
4.2 Point Recovery from \(\mathcal {K}_{\mathcal {C}}^{}\) to \(\mathcal {J}_{\mathcal {C}}^{}\)
Point recovery means efficiently computing \([m]P\) on \(\mathcal {J}_{\mathcal {C}}^{}\) given \(\pm [m]P\) on \(\mathcal {K}_{\mathcal {C}}^{}\) and some additional information. In our case, the additional information is the base point \(P\) and the second output of the Montgomery ladder, \(\pm [m+1]P\). Algorithm 9 (Recover) implements the point recovery described in [8]. This is the genus-2 analogue of the elliptic-curve methods in [24–26].
We refer the reader to [8] for technical details on this method, but there is one important mathematical detail that we should mention (since it is reflected in the structure of our code): point recovery is more natural starting from the general Flynn model \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) of the Kummer, because it is more closely related to the Mumford model for \(\mathcal {J}_{\mathcal {C}}^{}\). Algorithm 9 therefore proceeds in two steps: first Algorithms 10 (fast2genFull) and 11 (fast2genPartial) map the problem into \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\), and then we recover from \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) to \(\mathcal {J}_{\mathcal {C}}^{}\) using Algorithm 12 (recoverGeneral).
Since the general Kummer \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) only appears briefly in our recovery procedure (we never use its relatively slow arithmetic operations), we will not investigate it in detail here—but the curious reader may refer to [27] for the general theory. For our purposes, it suffices to recall that \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) is, like \(\mathcal {K}_{\mathcal {C}}^{}\), embedded in \(\mathbb {P}^3\); and the isomorphism \(\mathcal {K}_{\mathcal {C}}^{}\rightarrow {\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) is defined (in eg. [8, Sect. 7.4]) by the linear transformation
where \(L\) is (any scalar multiple of) the matrix
which we precompute and store. If \(\pm P\) is a point on \(\mathcal {K}_{\mathcal {C}}^{}\), then \(\widetilde{\pm P}\) denotes its image on \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\); we compute \(\widetilde{\pm P}\) using Algorithm 10 (fast2genFull).
Sometimes we only require the first three coordinates of \(\widetilde{\pm P}\). Algorithm 11 (fast2genPartial) saves \(4\mathbf M +3\mathbf a \) per point by not computing \(\tilde{t}_P\).
4.3 Full Scalar Multiplication on \(\mathcal {J}_{\mathcal {C}}^{}\)
We now combine our pseudo-scalar multiplication function crypto_scalarmult with the point-recovery function Recover to define a full scalar multiplication function jacobian_scalarmult (Algorithm 13) on \(\mathcal {J}_{\mathcal {C}}^{}\).
Remark 3
jacobian_scalarmult takes not only a scalar \(m\) and a Jacobian point \(P\) in its Mumford representation, but also the wrapped form of \(\pm P\) as an auxiliary argument: that is, we assume that \(\texttt {xP} \leftarrow \texttt {Project}(P)\) and \(\texttt {xWRAP}(\texttt {xP})\) have already been carried out. This saves redundant Project and xWRAP calls when operating on fixed base points, as is often the case in our protocols. Nevertheless, jacobian_scalarmult could easily be converted to a “pure” Jacobian scalar multiplication function (with no auxiliary input) by inserting appropriate Project and xWRAP calls at the start, and removing the xUNWRAP call at Line 2, increasing the total cost by 11M + 1S + 4\(\mathbf {m_c}\) + 7a + 8s + 1I.
5 Results and Comparison
The high-level cryptographic functions for our signature scheme are named keygen, sign and verify. Their implementations contain no surprises: they do exactly what was specified in Sect. 2.1, calling the lower-level functions described in Sects. 3 and 4 as required. Our Diffie-Hellman key generation and key exchange use only the function dh_exchange, which implements exactly what we specified in Sect. 2.2: one call to crypto_scalarmult plus a call to xWRAP to convert to the correct 384-bit representation. Table 1 (in the introduction) presents the cycle counts and stack usage for all of our high-level functions.
Code and compilation. For our experiments, we compiled our AVR ATmega code with avr-gcc -O2, and our ARM Cortex M0 code with clang -O2 (the optimization levels -O3, -O1, and -Os gave fairly similar results). The total program size is \(20\,242\) bytes for the AVR ATmega, and \(19\,606\) bytes for the ARM Cortex M0. This consists of the full signature and key-exchange code, including the reference implementation of the hash function SHAKE128 with 512-bit output.Footnote 3
Basis for comparison. As we believe ours to be the first genus-2 hyperelliptic curve implementation on both the AVR ATmega and the ARM Cortex M0 architectures, we can only compare with elliptic curve-based alternatives at the same 128-bit security level: notably [7, 29–31]. This comparison is not superficial: the key exchange in [7, 29, 30] uses the highly efficient \(x\)-only arithmetic on Montgomery elliptic curves, while [31] uses similar techniques for Weierstrass elliptic curves, and \(x\)-only arithmetic is the exact elliptic-curve analogue of Kummer surface arithmetic. To provide full scalar multiplication in a group, [31] appends \(y\)-coordinate recovery to its \(x\)-only arithmetic (using the approach of [26]); again, this is the elliptic-curve analogue of our methods.
Results for ARM Cortex M0. As we see in Table 4, genus-2 techniques give great results for Diffie–Hellman key exchange on the ARM Cortex M0 architecture. Compared with the current fastest implementation [7], we reduce the number of clock cycles by about \(27\,\%\), while roughly halving code size and stack space. For signatures, the state-of-the-art is [31]: here we reduce the cycle count for the underlying scalar multiplications by a very impressive \(75\,\%\), at the cost of an increase in code size and stack usage.
Results for AVR ATmega. Looking at Table 5, on the AVR ATmega architecture we reduce the cycle count for Diffie–Hellman by about \(32\,\%\) compared with the current record [7], again roughly halving the code size, and reducing stack usage by about \(80\,\%\). The cycle count for Jacobian scalar multiplication (for signatures) is reduced by \(71\,\%\) compared with [31], while increasing the stack usage by \(25\,\%\).
Finally we can compare to the current fastest full signature implementation [10], shown in Table 6. We almost halve the number of cycles, while reducing stack usage by a decent margin (code size is not reported in [10]).
Notes
- 1.
We only call ADD once in our algorithms, so for lack of space we omit its description.
- 2.
Note that \((A:B:C:D) = \mathcal {H}((a:b:c:d))\) and \((a:b:c:d) = \mathcal {H}((A:B:C:D))\).
- 3.
References
Bernstein, D.J.: Elliptic vs. hyperelliptic, part 1 (2006). http://cr.yp.to/talks/2006.09.20/slides.pdf
Bos, J.W., Costello, C., Hisil, H., Lauter, K.: Fast cryptography in genus 2. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013. LNCS, vol. 7881, pp. 194–210. Springer, Heidelberg (2013). https://eprint.iacr.org/2012/670.pdf
Bernstein, D.J., Chuengsatiansup, C., Lange, T., Schwabe, P.: Kummer strikes back: new DH speed records. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014. LNCS, vol. 8873, pp. 317–337. Springer, Heidelberg (2014). https://cryptojedi.org/papers/#kummer
Costello, C., Longa, P.: Four\(\mathbb{Q}\): four-dimensional decompositions on a\(\mathbb{Q}\)-curve over the mersenne prime. In: Iwata, T., Cheon, J.H. (eds.) ASIACRYPT 2015. LNCS, vol. 9452, pp. 214–235. Springer, Heidelberg (2015). https://eprint.iacr.org/2015/565
Batina, L., Hwang, D., Hodjat, A., Preneel, B., Verbauwhede, I.: Hardware/Software Co-design for Hyperelliptic Curve Cryptography (HECC) on the 8051 \(\mu P\). In: Rao, J.R., Sunar, B. (eds.) CHES 2005. LNCS, vol. 3659, pp. 106–118. Springer, Heidelberg (2005). https://www.iacr.org/archive/ches2005/008.pdf
Hodjat, A., Batina, L., Hwang, D., Verbauwhede, I.: HW/SW co-design of a hyperelliptic curve cryptosystem using amicrocode instruction set coprocessor. Integr. VLSI J. 40, 45–51 (2007). https://www.cosic.esat.kuleuven.be/publications/article-622.pdf
Düll, M., Haase, B., Hinterwälder, G., Hutter, M., Paar, C., Sánchez, A.H., Schwabe, P.: High-speed curve25519 on 8-bit, 16-bit and 32-bit microcontrollers. Des. Codes Crypt. 77, 493–514 (2015). http://cryptojedi.org/papers/#mu25519
Costello, C., Chung, P.N., Smith, B.: Fast, uniform, and compact scalar multiplication for elliptic curves and genus 2 Jacobians with applications to signature schemes.Cryptology ePrint Archive, Report 2015/983 (2015). https://eprint.iacr.org/2015/983
Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.Y.: High-speed high-security signatures. J. Cryptogr. Eng. 2, 77–89 (2012). https://cryptojedi.org/papers/ed25519
Nascimento, E., López, J., Dahab, R.: Efficient and secure elliptic curve cryptography for 8-bit AVR microcontrollers. In: Chakraborty, R.S., Schwabe, P., Solworth, J. (eds.) SPACE 2015. LNCS, vol. 9354, pp. 289–309. Springer, Heidelberg (2015)
Schnorr, C.-P.: Efficient identification and signatures for smart cards. In: Brassard, G. (ed.) CRYPTO 1989. LNCS, vol. 435, pp. 239–252. Springer, Heidelberg (1990)
Dworkin, M.J.:SHA-3 standard: Permutation-based hash and extendable-outputfunctions.Technical report, National Institute of Standards and Technology(NIST) (2015). http://www.nist.gov/manuscript-publication-search.cfm?pub_id=919061
Katz, J., Wang, N.: Efficiency improvements for signature schemes with tight securityreductions.In: Proceedings of the 10th ACM Conference on Computer and Communications Security, CCS 2003, pp. 155–164. ACM (2003). https://www.cs.umd.edu/~jkatz/papers/CCCS03_sigs.pdf
Vitek, J., Naccache, D., Pointcheval, D., Vaudenay, S.: Computational alternatives to random number generators. In: Tavares, S., Meijer, H. (eds.) SAC 1998. LNCS, vol. 1556, pp. 72–80. Springer, Heidelberg (1999). https://www.di.ens.fr/ pointche/Documents/Papers/1998_sac.pdf
Bernstein, D.J.: Differential addition chains (2006). http://cr.yp.to/ecdh/diffchain-20060219.pdf
Stam, M.: Speeding up subgroup cryptosystems. Ph.D. thesis, Technische Universiteit Eindhoven (2003). http://alexandria.tue.nl/extra2/200311829.pdf?q=subgroup
Hutter, M., Schwabe, P.: Multiprecision multiplication on AVR revisited. J. Cryptogr. Eng. 5, 201–214 (2015). http://cryptojedi.org/papers/#avrmul
Gaudry, P., Schost, E.: Genus 2 point counting over prime fields. J Symb Comput 47, 368–400 (2012). https://cs.uwaterloo.ca/~eschost/publications/countg2.pdf
Hisil, H., Costello, C.: Jacobian coordinates on genus 2 curves. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014. LNCS, vol. 8873, pp. 338–357. Springer, Heidelberg (2014). https://eprint.iacr.org/2014/385.pdf
Stahlke, C.: Point compression on jacobians of hyperelliptic curves over\(\mathbb{F}_q\).Cryptology ePrint Archive, Report 2004/030 (2004). https://eprint.iacr.org/2004/030
Chudnovsky, D.V., Chudnovsky, G.V.: Sequences of numbers generated by addition in formal groups and new primality and factorization tests. Adv. Appl. Math. 7, 385–434 (1986)
Cosset, R.: Applications of theta functions for hyperelliptic curvecryptography.Ph.D. thesis, Université Henri Poincaré - Nancy I (2011). https://tel.archives-ouvertes.fr/tel-00642951/file/main.pdf
Gaudry, P.: Fast genus 2 arithmetic based on theta functions. J. Math. Cryptol. 1, 243–265 (2007). https://eprint.iacr.org/2005/314/
López, J., Dahab, R.: Fast multiplication on elliptic curves over \(GF\)(2\(_{\rm m}\)) without precomputation. In: Koç, Ç.K., Paar, C. (eds.) CHES 1999. LNCS, vol. 1717, pp. 316–327. Springer, Heidelberg (1999)
Okeya, K., Sakurai, K.: Efficient elliptic curve cryptosystems from a scalar multiplication algorithm with recovery of the \(y\)-Coordinate on a Montgomery-Form Elliptic Curve. In: Koç, Ç.K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 126–141. Springer, Heidelberg (2001)
Brier, E., Joye, M.: Weierstra elliptic curves and side-channel attacks. In: Naccache, D., Paillier, P. (eds.) PKC 2002. LNCS, vol. 2274, pp. 335–345. Springer, Heidelberg (2002). http://link.springer.com/content/pdf/10.1007%2F3-540-45664-3_24.pdf
Cassels, J.W.S., Flynn, E.V.: Prolegomena to a Middlebrow Arithmetic of Curves of Genus 2, vol. 230. Cambridge University Press, Cambridge (1996)
Bertoni, G., Daemen, J., Peeters, M., Assche, G.V.: The keccak sponge function family (2016). http://keccak.noekeon.org/
Liu, Z., Wenger, E., Großschädl, J.: MoTE-ECC: energy-scalable elliptic curve cryptography for wireless sensor networks. In: Boureanu, I., Owesarski, P., Vaudenay, S. (eds.) ACNS 2014. LNCS, vol. 8479, pp. 361–379. Springer, Heidelberg (2014). https://online.tugraz.at/tug_online/voe_main2.getvolltext?pCurrPk=77985
Hutter, M., Schwabe, P.: NaCl on 8-bit AVR microcontrollers. In: Youssef, A., Nitaj, A., Hassanien, A.E. (eds.) AFRICACRYPT 2013. LNCS, vol. 7918, pp. 156–172. Springer, Heidelberg (2013). http://cryptojedi.org/papers/#avrnacl
Wenger, E., Unterluggauer, T., Werner, M.: 8/16/32 shades of elliptic curve cryptography on embedded processors. In: Paul, G., Vaudenay, S. (eds.) INDOCRYPT 2013. LNCS, vol. 8250, pp. 244–261. Springer, Heidelberg (2013). https://online.tugraz.at/tug_online/voe_main2.getvolltext?pCurrPk=72486
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 International Association for Cryptologic Research
About this paper
Cite this paper
Renes, J., Schwabe, P., Smith, B., Batina, L. (2016). \(\mu \)Kummer: Efficient Hyperelliptic Signatures and Key Exchange on Microcontrollers. In: Gierlichs, B., Poschmann, A. (eds) Cryptographic Hardware and Embedded Systems – CHES 2016. CHES 2016. Lecture Notes in Computer Science(), vol 9813. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53140-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-662-53140-2_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53139-6
Online ISBN: 978-3-662-53140-2
eBook Packages: Computer ScienceComputer Science (R0)