Keywords

1 Introduction

The current state of the art in asymmetric cryptography, not only on microcontrollers, is elliptic-curve cryptography; the most widely accepted reasonable security is the 128-bit security level. All current speed records for 128-bit secure key exchange and signatures on microcontrollers are held—until now—by elliptic-curve-based schemes. Outside the world of microcontrollers, it is well known that genus-2 hyperelliptic curves and their Kummer surfaces present an attractive alternative to elliptic curves [1, 2]. For example, at Asiacrypt 2014 Bernstein, Chuengsatiansup, Lange and Schwabe [3] presented speed records for timing-attack-protected 128-bit-secure scalar multiplication on a range of architectures with Kummer-based software. These speed records are currently only being surpassed by the elliptic-curve-based Four\(\mathbb {Q}\) software by Costello and Longa [4] presented at Asiacrypt 2015, which makes heavy use of efficiently computable endomorphisms (i.e., of additional structure of the underlying elliptic curve). The Kummer-based speed records in [3] were achieved by exploiting the computational power of vector units of recent “large” processors such as Intel Sandy Bridge, Ivy Bridge, and Haswell, or the ARM Cortex-A8. Surprisingly, very little attention has been given to Kummer surfaces on embedded processors. Indeed, this is the first work showing the feasibility of software-only implementations of hyperelliptic-curve based crypto on constrained platforms. There have been some investigations of binary hyperelliptic curves targeting the much lower 80-bit security level, but those are actually examples of software-hardware co-design showing that using hardware acceleration for field operations was necessary to get reasonable performance figures (see eg. [5, 6]).

In this paper we investigate the potential of genus-2 hyperelliptic curves for both key exchange and signatures on the “classical” 8-bit AVR ATmega architecture, and the more modern 32-bit ARM Cortex-M0 processor. The former has the most previous results to compare to, while ARM is becoming more relevant in real-world applications. We show that not only are hyperelliptic curves competitive, they clearly outperform state-of-the art elliptic-curve schemes in terms of speed and size. For example, our variable-basepoint scalar multiplication on a 127-bit Kummer surface is 31 % faster on AVR and 26 % faster on the M0 than the recently presented speed records for Curve25519 software by Düll et al. [7]; our implementation is also smaller, and requires less RAM.

We use a recent result by Chung, Costello, and Smith [8] to also set new speed records for 128-bit secure signatures. Specifically, we present a new signature scheme based on fast Kummer surface arithmetic. It is inspired by the EdDSA construction by Bernstein, Duif, Lange, Schwabe, and Yang [9]. On the ATmega, it produces shorter signatures, achieves higher speeds and needs less RAM than the Ed25519 implementation presented in [10].

Table 1. Cycle counts and stack usage in bytes of all functions related to the signature and key exchange schemes, for the AVR ATmega and ARM Cortex M0 microcontrollers.

Our routines handling secret data are constant-time, and are thus naturally resistant to timing attacks. These algorithms are built around the Montgomery ladder, which improves resistance against simple-power-analysis (SPA) attacks. Resistance to DPA attacks can easily be added to the implementation by randomizing the scalar and/or Jacobian points. Re-randomizing the latter after each ladder step would also guarantee resistance against horizontal types of attacks.

Source code. We place all of the software described in this paper into the public domain, to maximize the reuseability of our results. The software is available at http://www.cs.ru.nl/~jrenes/.

2 High-Level Overview

We begin by describing the details of our signature and Diffie–Hellman schemes, explaining the choices we made in their design. Concrete implementation details appear in Sects. 3 and 4 below. Experimental results and comparisons follow in Sect. 5.

2.1 Signatures

Our signature scheme, defined at the end of this section, adheres closely to the proposal of [8, Sect. 8], which in turn is a type of Schnorr signature [11]. There are however some differences and trade-offs, which we discuss below.

Group structure. We build the signature scheme on top of the group structure from the Jacobian \(\mathcal {J}_\mathcal {C}(\mathbb {F}_q)\) of a genus-2 hyperelliptic curve \(\mathcal {C}\). More specifically, \(\mathcal {C}\) is the Gaudry–Schost curve over the prime field \(\mathbb {F}_q\) with \(q=2^{127}-1\) (cf. Sect. 3.2). The Jacobian is a group of order \(\#\mathcal {J}_\mathcal {C}(\mathbb {F}_q)=2^4N\), where

$$N=2^{250}-\mathtt{0x334D69820C75294D2C27FC9F9A154FF47730B4B840C05BD}$$

is a 250-bit prime. For more details on the Jacobian and its elements, see Sect. 3.3.

Hash function. We may use any hash function H with a 128-bit security level. For our purposes, \(H(M)=\mathtt{SHAKE128}(M, 512)\) suffices [12]. While SHAKE128 has variable-length output, we only use the 512-bit output implementation.

Encoding. At the highest level, we operate on points Q in \(\mathcal {J}_\mathcal {C}(\mathbb {F}_q)\). To minimize communication costs, we compress the usual 508-bit representation of Q into a 256-bit encoding (see Sect. 3.3). (This notation is the same as in [9].)

Public generator. The public generator can be any element P of \(\mathcal {J}_\mathcal {C}(\mathbb {F}_q)\) such that \([N]P=0\). In our implementation we have made the arbitrary choice \( P = (X^2+u_1X+u_0, v_1X+v_0) \), where

$$\begin{aligned} u_1&= {\scriptstyle \mathtt 0x7D5D9C3307E959BF27B8C76211D35E8A},&u_0&= {\scriptstyle \mathtt 0x2703150F9C594E0CA7E8302F93079CE8}, \\ v_1&= {\scriptstyle \mathtt 0x444569AF177A9C1C721736D8F288C942},&v_0&= {\scriptstyle \mathtt 0x7F26CFB225F42417316836CFF8AEFB11}. \end{aligned}$$

This is the point which we use the most for scalar multiplication. Since it remains fixed, we assume we have its decompressed representation precomputed, so as to avoid having to perform the relatively expensive decompression operation whenever we need a scalar multiplication; this gives a low-cost speed gain. We further assume we have a “wrapped” representation of the projection of \(P\) to the Kummer surface, which is used to speed up the xDBLADD function. See Sect. 4.1 for more details on the xWRAP function.

Public keys. In contrast to the public generator, we assume public keys are compressed: they are communicated much more frequently, and we therefore benefit much more from smaller keys. Moreover, we include the public key in one of the hashes during the sign operation [13, 14], computing instead of the originally suggested by Schnorr [11]. This protects against adversaries attacking multiple public keys simultaneously.

Compressed signatures. Schnorr [11] mentions the option of compressing signatures by hashing one of their two components: the hash size only needs to be b/2 bits, where b is the key length. Following this suggestion, our signatures are 384-bit values of the form \((h_{128}||s)\), where \(h_{128}\) means the lowest 128 bits of , and s is a 256-bit scalar. The most obvious upside is that signatures are smaller, reducing communication overhead. Another big advantage is that we can exploit the half-size scalar to speed up signature verification. On the other hand, we lose the possibility of efficient batch verification.

Verification efficiency. The most costly operation in signature verification is the two-dimensional scalar multiplication \(T=[s]P\oplus [h_{128}]Q\). In [8], the authors propose an algorithm relying on the differential addition chains presented in [15]. However, since we are using compressed signatures, we have a small scalar \(h_{128}\). Unfortunately the two-dimensional algorithm in [8] cannot directly exploit this fact, therefore not obtaining much benefit from the compressed signature. On the other hand, we can simply compute [s]P and \([h_{128}]Q\) separately using the fast scalar multiplication on the Kummer surface and finally add them together on the Jacobian. Here \([s]P\) is a 256-bit scalar multiplication, whereas \([h_{128}]Q\) is only a 128-bit scalar multiplication. Not only do we need fewer cycles compared to the two-dimensional routine, but we also reduce code size by reusing the one-dimensional scalar multiplication routine.

The scheme. We now define our signature scheme, taking the above into account.

  • Key generation ( keygen ). Let d be a 256-bit secret key, and P the public generator. Compute \((d'||d'') \leftarrow H(d)\) (with \(d'\) and \(d''\) both 256 bits), then \(Q\leftarrow [16d']P\). The public key is .

  • Signing ( sign ). Let M be a message, d a 256-bit secret key, P the public generator, and a compressed public key. Compute \((d'||d'') \leftarrow H(d)\) (with \(d'\) and \(d''\) both 256 bits), then \(r\leftarrow H(d''||M)\), then \(R\leftarrow [r]P\), then , and finally \(s\leftarrow \left( r-16h_{128}d'\right) \bmod {N}\). The signature is \((h_{128}||s)\).

  • Verification ( verify ). Let M be a message with a signature \((h_{128}||s)\) corresponding to a public key , and let P be the public generator. Compute \(T\leftarrow [s]P\oplus [h_{128}]Q\), then . The signature is correct if \(g_{128}=h_{128}\), and incorrect otherwise.

Remark 1

We note that there may be faster algorithms to compute the “one-and-a-half-dimensional” scalar multiplication in verify, especially since we do not have to worry about being constant-time. One option might be to adapt Montgomery’s PRAC [16, Sect. 3.3.1] to make use of the half-size scalar. But while this may lead to a speed-up, it would also cause an increase in code size compared to simply re-using the one-dimensional scalar multiplication. We have chosen not to pursue this line, preferring the solid benefits of reduced code size instead.

2.2 Diffie-Hellman Key Exchange

For key exchange it is not necessary to have a group structure; it is enough to have a pseudo-multiplication. We can therefore carry out our the key exchange directly on the Kummer surface \(\mathcal {K}_{\mathcal {C}}^{} = \mathcal {J}_{\mathcal {C}}^{}/{\left\langle {\pm }\right\rangle }\), gaining efficiency by not projecting from and recovering to the Jacobian \(\mathcal {J}_{\mathcal {C}}^{}\). If \(Q\) is a point on \(\mathcal {J}_{\mathcal {C}}^{}\), then its image in \(\mathcal {K}_{\mathcal {C}}^{}\) is \(\pm Q\). The common representation for points in \(\mathcal {K}_{\mathcal {C}}^{}(\mathbb {F}_q)\) is a 512-bit 4-tuple of field elements. For input points (i. e. the generator or public keys), we prefer the 384-bit “wrapped” representation (see Sect. 3.5). This not only reduces key size, but it also allows a speed-up in the core xDBLADD subroutine. The wrapped representation of a point \(\pm Q\) on \(\mathcal {K}_{\mathcal {C}}^{}\) is denoted by .

  • Key exchange ( dh_exchange ). Let d be a 256-bit secret key, and the public generator (respectively public key). Compute \(\pm Q\leftarrow \pm [d]P\). The generated public key (respectively shared secret) is .

Remark 2

While it might be possible to reduce the key size even further to 256 bits, we would then have to pay the cost of compressing and decompressing, and also wrapping for xDBLADD (see the discussion in [8, App. A]). We therefore choose to keep the 384-bit representation, which is consistent with [3].

3 Building Blocks: Algorithms and Their Implementation

We begin by presenting the finite field \(\mathbb {F}_{2^{127}-1}\) in Sect. 3.1. We then define the curve \(\mathcal {C}\) in Sect. 3.2, before giving basic methods for the elements of \(\mathcal {J}_{\mathcal {C}}^{}\) in Sect. 3.3. We then present the fast Kummer \(\mathcal {K}_{\mathcal {C}}^{}\) and its differential addition operations in Sect. 3.4.

3.1 The Field \(\mathbb {F}_q\)

We work over the prime finite field \(\mathbb {F}_q\), where \(q\) is the Mersenne prime

$$ q := 2^{127} - 1 . $$

We let M, S, a, s, neg, and I denote the costs of multiplication, squaring, addition, subtraction, negation, and inversion in \(\mathbb {F}_q\). Later, we will define a special operation for multiplying by small constants: its cost is denoted by \(\mathbf {m_c}\).

For complete field arithmetic we implement modular reduction, addition, subtraction, multiplication, and inversion. We comment on some important aspects here, giving cycle counts in Table 2.

We can represent elements of \(\mathbb {F}_q\) as 127-bit values; but since the ATmega and Cortex M0 work with 8- and 32-bit words, respectively, the obvious choice is to represent field elements with 128 bits. That is, an element \(g\in \mathbb {F}_q\) is represented as \(g=\sum _{i=0}^{15}g_i2^{8i}\) on the AVR ATmega platform and as \(g=\sum _{i=0}^{3}g'_i2^{32i}\) on the Cortex M0, where \(g_i\in \{0,\ldots ,2^8-1\}\), \(g'_i\in \{0,\ldots ,2^{32}-1\}\).

Working with the prime field \(\mathbb {F}_q\), we need integer reduction modulo \(q\); this is implemented as bigint_red. Reduction is very efficient because \(2^{128}\equiv 2\text { mod}\, q\), which enables us to reduce using only shifts and integer additions. Given this reduction, we implement addition and subtraction operations for \(\mathbb {F}_q\) (as gfe_add and gfe_sub, respectively) in the obvious way.

The most costly operations in \(\mathbb {F}_q\) are multiplication (gfe_mul) and squaring (gfe_sqr), which are implemented as \(128\times 128\)-bit bit integer operations (bigint_mul and bigint_sqr) followed by a call to bigint_red. Since we are working on the same platforms as [7] in which both of these operations are already highly optimized, we took the necessary code from those implementations:

  • On the AVR ATmega: The authors of [17] implement a 3-level Karatsuba multiplication of two 256-bit integers, representing elements f of \(\mathbb {F}_{2^{255}-19}\) as \(f=\sum _{i=0}^{31}f_i 2^{8i}\) with \(f_i\in \{0,\ldots ,2^8-1\}\). Since the first level of Karatsuba relies on a \(128\times 128\)-bit integer multiplication routine named MUL128, we simply lift this function out to form a 2-level \(128\times 128\)-bit Karatsuba multiplication. Similarly, their \(256\times 256\)-bit squaring relies on a \(128\times 128\)-bit routine SQR128, which we can (almost) directly use. Since the \(256\times 256\)-bit squaring is 2-level Karatsuba, the \(128\times 128\)-bit squaring is 1-level Karatsuba.

  • On the ARM Cortex M0: The authors of [7] use optimized Karatsuba multiplication and squaring. Their assembly code does not use subroutines, but fully inlines \(128\times 128\)-bit multiplication and squaring. The \(256\times 256\)-bit multiplication and squaring are both 3-level Karatsuba implementations. Hence, using these, we end up with 2-level \(128\times 128\)-bit Karatsuba multiplication and squaring.

The function gfe_invert computes inversions in \(\mathbb {F}_q\) as exponentiations, using the fact that \(g^{-1} = g^{q-2}\) for all \(g\) in \(\mathbb {F}_q^\times \). To do this efficiently we use an addition chain for \(q-2\), doing the exponentiation in \(10\mathbf{M}+126\mathbf{S}\).

Finally, to speed up our Jacobian point decompression algorithms, we define a function gfe_powminhalf which computes \(g\mapsto g^{-1/2}\) for \(g\) in \(\mathbb {F}_q\) (up to a choice of sign). To do this, we note that \( g^{-1/2} = \pm g^{-(q+1)/4} =\pm g^{{(3q-5)}/{4}} \) in \(\mathbb {F}_q\); this exponentiation can be done with an addition chain of length 136, using \(11\mathbf{M}+125\mathbf{S}\). We can then define a function gfe_sqrtinv, which given \((x,y)\) and a bit \(b\), computes \((\sqrt{x},1/y)\) as \((\pm xyz,xyz^2)\) where \(z = \mathtt {gfe\_powminhalf}(xy^2)\), choosing the sign so that the square root has least significant bit \(b\). Including the gfe_powminhalf call, this costs 15M + 126S + 1neg.

Table 2. Cycle counts for our field implementation (including function-call overhead).

3.2 The Curve \(\mathcal {C}\) and Its Theta Constants

We define the curve \(\mathcal {C}\) “backwards”, starting from its (squared) theta constants

$$ a := -11 , \quad b := 22 , \quad c := 19 , \quad \text {and} \quad d := 3 \quad \text {in } \mathbb {F}_q . $$

From these, we define the dual theta constants

$$\begin{aligned} A&{} := a + b + c + d = 33 ,&B&{} := a + b - c - d = -11 , \\ C&{} := a - b + c - d = -17 ,&D&{} := a - b - c + d = -49 . \end{aligned}$$

Observe that projectively,

$$\begin{aligned} (1/a:1/b:1/c:1/d)&= (114:-57:-66:-418) , \\ (1/A:1/B:1/C:1/D)&= (-833:2499:1617:561) . \end{aligned}$$

Crucially, all of these constants can be represented using just 16 bits each. Since Kummer arithmetic involves many multiplications by these constants, we implement a separate \(16\times 128\)-bit multiplication function gfe_mulconst. For the AVR ATmega, we store the constants in two 8-bit registers. For the Cortex M0, the values fit into a halfword; this works well with the \(16\!\times \!16\)-bit multiplication. Multiplication by any of these 16-bit constants costs \(\mathbf {m_c}\).

Continuing, we define \(e/f := (1 + \alpha )/(1 - \alpha )\), where \(\alpha ^2 = CD/AB\) (we take the square root with least significant bit 0), and thus

$$\begin{aligned} \lambda := {ac}/{bd}&= \mathtt {0x15555555555555555555555555555552} , \quad \\ \mu := {ce}/{df}&= \mathtt {0x73E334FBB315130E05A505C31919A746} , \\ \nu := {ae}/{bf}&= \mathtt {0x552AB1B63BF799716B5806482D2D21F3} . \end{aligned}$$

These are the Rosenhain invariants of the curve \(\mathcal {C}\), found by Gaudry and Schost [18], which we are (finally!) ready to define as

$$ \mathcal {C}: Y^2 = f_\mathcal {C}(X) := X(X-1)(X-\lambda )(X-\mu )(X-\nu ) . $$

The curve constants are the coefficients of \(f_\mathcal {C}(X) = \sum _{i=0}^5f_iX^i\): so \(f_0 = 0\), \(f_5 = 1\),

$$\begin{aligned} f_1&{} = {} \scriptstyle \mathtt {0x1EDD6EE48E0C2F16F537CD791E4A8D6E} ,&f_2&{} = {} \scriptstyle \mathtt {0x73E799E36D9FCC210C9CD1B164C39A35} , \\ f_3&{} = {} \scriptstyle \mathtt {0x4B9E333F48B6069CC47DC236188DF6E8} ,&f_4&{} = {} \scriptstyle \mathtt {0x219CC3F8BB9DFE2B39AD9E9F6463E172} . \end{aligned}$$

We store the squared theta constants \((a:b:c:d)\), along with \((1/a:1/b:1/c:1/d)\), and \((1/A:1/B:1/C:1/D)\); the Rosenhain invariants \(\lambda \), \(\mu \), and \(\nu \), together with \(\lambda \mu \) and \(\lambda \nu \); and the curve constants \(f_1\), \(f_2\), \(f_3\), and \(f_4\), for use in our Kummer and Jacobian arithmetic functions. Obviously, none of the Rosenhain or curve constants are small; multiplying by these costs a full M.

3.3 Elements of \(\mathcal {J}_{\mathcal {C}}^{}\), compressed and decompressed

Our algorithms use the usual Mumford representation for elements of \(\mathcal {J}_{\mathcal {C}}^{}(\mathbb {F}_q)\): they correspond to pairs \( {\left\langle {u(X)},{v(X)}\right\rangle } \), where \(u\) and \(v\) are polynomials over \(\mathbb {F}_q\) with \(u\) monic, \(\deg v < \deg u \le 2\), and \(v(X)^2 \equiv f_\mathcal {C}(X) \pmod {u(X)}\). We compute the group operation \(\oplus \) in \(\mathcal {J}_{\mathcal {C}}^{}(\mathbb {F}_q)\) using a function ADD, which implements the algorithm found in [19] (after a change of coordinates to meet their Assumption 1)Footnote 1 at a cost of 28M + 2S + 11a + 24s + 1I.

For transmission, we compress the 508-bit Mumford representation to a 256-bit form. Our functions compress (Algorithm 1) and decompress (Algorithm 2) implement Stahlke’s compression technique (see [20] and [8, Appendix A] for details).

figure a
figure b

3.4 The Kummer Surface \(\mathcal {K}_{\mathcal {C}}^{}\)

The Kummer surface of \(\mathcal {C}\) is the quotient \(\mathcal {K}_{\mathcal {C}}^{}:= \mathcal {J}_{\mathcal {C}}^{}/{\left\langle {\pm 1}\right\rangle }\); points on \(\mathcal {K}_{\mathcal {C}}^{}\) correspond to points on \(\mathcal {J}_{\mathcal {C}}^{}\) taken up to sign. If \(P\) is a point in \(\mathcal {J}_{\mathcal {C}}^{}\), then we write

$$ (x_P:y_P:z_P:t_P) = \pm P $$

for its image in \(\mathcal {K}_{\mathcal {C}}^{}\). To avoid subscript explosion, we make the following convention: when points \(P\) and \(Q\) on \(\mathcal {J}_{\mathcal {C}}^{}\) are clear from the context, we write

$$ (x_\oplus :y_\oplus :z_\oplus :t_\oplus ) = \pm (P\oplus Q) \quad \text {and} \quad (x_\ominus :y_\ominus :z_\ominus :t_\ominus ) = \pm (P\ominus Q) . $$

The Kummer surface of this \(\mathcal {C}\) has a “fast” model in \(\mathbb {P}^3\) defined by

$$ \mathcal {K}_{\mathcal {C}}^{}: E\cdot xyzt = \left( \begin{array}{c} (x^2 + y^2 + z^2 + t^2) \\ - F\cdot (xt+yz) - G\cdot (xz+yt) - H\cdot (xy+zt) \end{array} \right) ^2 $$

where

$$ F = \frac{a^2-b^2-c^2+d^2}{ad-bc} , \quad G = \frac{a^2-b^2+c^2-d^2}{ac-bd} , \quad H = \frac{a^2+b^2-c^2-d^2}{ab-cd} , $$

and \(E = 4abcd\left( ABCD/((ad-bc)(ac-bd)(ab-cd))\right) ^2 \) (see eg. [8, 21, 22]). The identity point \({\left\langle {1},{0}\right\rangle }\) of \(\mathcal {J}_{\mathcal {C}}^{}\) maps to

$$ \pm 0_{\mathcal {J}_{\mathcal {C}}^{}} = ( a : b : c : d ) . $$

Algorithm 3 (Project) maps general points from \(\mathcal {J}_{\mathcal {C}}^{}(\mathbb {F}_q)\) into \(\mathcal {K}_{\mathcal {C}}^{}\). The “special” case where \(u\) is linear is treated in [8, Sect. 7.2]; this is not implemented, since Project only operates on public generators and keys, none of which are special.

figure c

3.5 Pseudo-addition on \(\mathcal {K}_{\mathcal {C}}^{}\)

While the points of \(\mathcal {K}_{\mathcal {C}}^{}\) do not form a group, we have a pseudo-addition operation (differential addition), which computes \(\pm (P\oplus Q)\) from \(\pm P\), \(\pm Q\), and \(\pm (P\ominus Q)\). The function \(\texttt {xADD}\) (Algorithm 4) implements the standard differential addition. The special case where \(P = Q\) yields a pseudo-doubling operation.

To simplify the presentation of our algorithms, we define three operations on points in \(\mathbb {P}^3\). First, \(\mathcal {M}: \mathbb {P}^3\times \mathbb {P}^3\rightarrow \mathbb {P}^3\) multiplies corresponding coordinates:

$$ \mathcal {M}: \left( (x_1:y_1:z_1:t_1),(x_2:y_2:z_2:t_2)\right) \longmapsto (x_1x_2:y_1y_2:z_1z_2:t_1t_2) . $$

The special case \((x_1:y_1:z_1:t_1) = (x_2:y_2:z_2:t_2)\) is denoted by

$$ \mathcal {S}: (x:y:z:t) \longmapsto (x^2:y^2:z^2:t^2) . $$

Finally, the Hadamard transformFootnote 2 is defined by

$$ \mathcal {H}: (x:y:z:t) \longmapsto (x':y':z':t') \quad \text {where} \quad \left\{ \begin{array}{r@{\;=\;}l} x' &{} x + y + z + t , \\ y' &{} x + y - z - t , \\ z' &{} x - y + z - t , \\ t' &{} x - y - z + t . \end{array} \right. $$

Clearly \(\mathcal {M}\) and \(\mathcal {S}\) cost \(4\mathbf {M}\) and \(4\mathbf {S}\), respectively. The Hadamard transform can easily be implemented with \(4\mathbf {a}+4\mathbf {s}\). However, the additions and subtractions are relatively cheap, making function call overhead a large factor. To minimize this we inline the Hadamard transform, trading a bit of code size for efficiency.

figure d

Lines 5 and 6 of Algorithm 4 only involve the third argument, \(\pm (P\ominus Q)\); essentially, they compute the point \(( y_\ominus z_\ominus t_\ominus : x_\ominus z_\ominus t_\ominus : x_\ominus y_\ominus t_\ominus : x_\ominus y_\ominus z_\ominus )\) (which is projectively equivalent to \((1/x_\ominus : 1/y_\ominus : 1/z_\ominus : 1/t_\ominus )\), but requires no inversions; note that this is generally not a point on \(\mathcal {K}_{\mathcal {C}}^{}\)). In practice, the pseudoadditions used in our scalar multiplication all use a fixed third argument, so it makes sense to precompute this “inverted” point and to scale it by \(x_\ominus \) so that the first coordinate is \(1\), thus saving \(7\mathbf M \) in each subsequent differential addition for a one-off cost of \(1\mathbf I \). The resulting data can be stored as the 3-tuple \((x_\ominus /y_\ominus ,x_\ominus /z_\ominus ,x_\ominus /t_\ominus )\), ignoring the trivial first coordinate: this is the wrapped form of \(\pm (P\ominus Q)\). The function xWRAP (Algorithm 5) applies this transformation.

figure e

Algorithm 6 combines the pseudo-doubling with the differential addition, sharing intermediate operands, to define a differential double-and-add \(\texttt {xDBLADD}\). This is the fundamental building block of the Montgomery ladder.

figure f
Table 3. Operation and cycle counts of basic functions on the Kummer and Jacobian.

4 Scalar Multiplication

All of our cryptographic routines are built around scalar multiplication in \(\mathcal {J}_{\mathcal {C}}^{}\) and pseudo-scalar multiplication in \(\mathcal {K}_{\mathcal {C}}^{}\). We implement pseudo-scalar multiplication using the classic Montgomery ladder in Sect. 4.1. In Sect. 4.2, we extend this to full scalar multiplication on \(\mathcal {J}_{\mathcal {C}}^{}\) using the point recovery technique proposed in [8].

4.1 Pseudomultiplication on \(\mathcal {K}_{\mathcal {C}}^{}\)

Since \([m](\ominus P) = \ominus [m]P\) for all \(m\) and \(P\), we have a pseudo-scalar multiplication operation \((m,\pm P)\longmapsto \pm [m]P\) on \(\mathcal {K}_{\mathcal {C}}^{}\), which we compute using Algorithm 7 (the Montgomery ladder), implemented as crypto_scalarmult. The loop of Algorithm 7 maintains the following invariant: at the end of iteration \(i\) we have

$$ (V_1,V_2) = (\pm [k]P,\pm [k+1]P) \quad \text {where} \quad \textstyle k = \sum _{j=i}^{\beta -1}m_j2^{\beta -1-i} . $$

Hence, at the end we return \(\pm [m]P\), and also \(\pm [m+1]P\) as a (free) byproduct. We suppose we have a constant-time conditional swap routine \(\texttt {CSWAP}(b,(V_1,V_2))\), which returns \((V_1,V_2)\) if \(b = 0\) and \((V_2,V_1)\) if \(b = 1\). This makes the execution of Algorithm 7 uniform and constant-time, and thus suitable for use with secret \(m\).

figure g

Our implementation of crypto_scalarmult assumes that its input Kummer point \(\pm P\) is wrapped. This follows the approach of [3]. Indeed, many calls to crypto_scalarmult involve Kummer points that are stored or transmitted in wrapped form. However, crypto_scalarmult does require the unwrapped point internally—if only to initialize one variable. We therefore define a function xUNWRAP (Algorithm 8) to invert the xWRAP transformation at a cost of only 4M.

figure h

4.2 Point Recovery from \(\mathcal {K}_{\mathcal {C}}^{}\) to \(\mathcal {J}_{\mathcal {C}}^{}\)

Point recovery means efficiently computing \([m]P\) on \(\mathcal {J}_{\mathcal {C}}^{}\) given \(\pm [m]P\) on \(\mathcal {K}_{\mathcal {C}}^{}\) and some additional information. In our case, the additional information is the base point \(P\) and the second output of the Montgomery ladder, \(\pm [m+1]P\). Algorithm 9 (Recover) implements the point recovery described in [8]. This is the genus-2 analogue of the elliptic-curve methods in [2426].

figure i

We refer the reader to [8] for technical details on this method, but there is one important mathematical detail that we should mention (since it is reflected in the structure of our code): point recovery is more natural starting from the general Flynn model \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) of the Kummer, because it is more closely related to the Mumford model for \(\mathcal {J}_{\mathcal {C}}^{}\). Algorithm 9 therefore proceeds in two steps: first Algorithms 10 (fast2genFull) and 11 (fast2genPartial) map the problem into \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\), and then we recover from \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) to \(\mathcal {J}_{\mathcal {C}}^{}\) using Algorithm 12 (recoverGeneral).

Since the general Kummer \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) only appears briefly in our recovery procedure (we never use its relatively slow arithmetic operations), we will not investigate it in detail here—but the curious reader may refer to [27] for the general theory. For our purposes, it suffices to recall that \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) is, like \(\mathcal {K}_{\mathcal {C}}^{}\), embedded in \(\mathbb {P}^3\); and the isomorphism \(\mathcal {K}_{\mathcal {C}}^{}\rightarrow {\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\) is defined (in eg. [8, Sect. 7.4]) by the linear transformation

$$ (x_{P}:y_{P}:z_{P}:t_{P}) \longmapsto (\tilde{x}_{P}:\tilde{y}_{P}:\tilde{z}_{P} : \tilde{t}_{P}) := (x_{P}:y_{P}:z_{P}:t_{P})L , $$

where \(L\) is (any scalar multiple of) the matrix

$$ \left( \begin{array}{c@{\ \ }c@{\ \ }c@{\ \ }c} a^{-1}(\nu - \lambda ) &{} a^{-1}(\mu \nu - \lambda ) &{} a^{-1}\lambda \nu (\mu - 1) &{} a^{-1}\lambda \nu (\mu \nu - \lambda ) \\ b^{-1}(\mu - 1) &{} b^{-1}(\mu \nu - \lambda ) &{} b^{-1}\mu (\nu - \lambda ) &{} b^{-1}\mu (\mu \nu - \lambda ) \\ c^{-1}(\lambda - \mu ) &{} c^{-1}(\lambda -\mu \nu ) &{} c^{-1}\lambda \mu (1 - \nu ) &{} c^{-1}\lambda \mu (\lambda -\mu \nu ) \\ d^{-1}(1-\nu ) &{} d^{-1}(\lambda -\mu \nu ) &{} d^{-1}\nu (\lambda -\mu ) &{} d^{-1}\nu (\lambda -\mu \nu ) \end{array} \right) , $$

which we precompute and store. If \(\pm P\) is a point on \(\mathcal {K}_{\mathcal {C}}^{}\), then \(\widetilde{\pm P}\) denotes its image on \({\widetilde{\mathcal {K}}_{\mathcal {C}}}^{}\); we compute \(\widetilde{\pm P}\) using Algorithm 10 (fast2genFull).

figure j

Sometimes we only require the first three coordinates of \(\widetilde{\pm P}\). Algorithm 11 (fast2genPartial) saves \(4\mathbf M +3\mathbf a \) per point by not computing \(\tilde{t}_P\).

figure k
figure l

4.3 Full Scalar Multiplication on \(\mathcal {J}_{\mathcal {C}}^{}\)

We now combine our pseudo-scalar multiplication function crypto_scalarmult with the point-recovery function Recover to define a full scalar multiplication function jacobian_scalarmult (Algorithm 13) on \(\mathcal {J}_{\mathcal {C}}^{}\).

figure m

Remark 3

jacobian_scalarmult takes not only a scalar \(m\) and a Jacobian point \(P\) in its Mumford representation, but also the wrapped form of \(\pm P\) as an auxiliary argument: that is, we assume that \(\texttt {xP} \leftarrow \texttt {Project}(P)\) and \(\texttt {xWRAP}(\texttt {xP})\) have already been carried out. This saves redundant Project and xWRAP calls when operating on fixed base points, as is often the case in our protocols. Nevertheless, jacobian_scalarmult could easily be converted to a “pure” Jacobian scalar multiplication function (with no auxiliary input) by inserting appropriate Project and xWRAP calls at the start, and removing the xUNWRAP call at Line 2, increasing the total cost by 11M + 1S + 4\(\mathbf {m_c}\) + 7a + 8s + 1I.

5 Results and Comparison

The high-level cryptographic functions for our signature scheme are named keygen, sign and verify. Their implementations contain no surprises: they do exactly what was specified in Sect. 2.1, calling the lower-level functions described in Sects. 3 and 4 as required. Our Diffie-Hellman key generation and key exchange use only the function dh_exchange, which implements exactly what we specified in Sect. 2.2: one call to crypto_scalarmult plus a call to xWRAP to convert to the correct 384-bit representation. Table 1 (in the introduction) presents the cycle counts and stack usage for all of our high-level functions.

Code and compilation. For our experiments, we compiled our AVR ATmega code with avr-gcc -O2, and our ARM Cortex M0 code with clang -O2 (the optimization levels -O3, -O1, and -Os gave fairly similar results). The total program size is \(20\,242\) bytes for the AVR ATmega, and \(19\,606\) bytes for the ARM Cortex M0. This consists of the full signature and key-exchange code, including the reference implementation of the hash function SHAKE128 with 512-bit output.Footnote 3

Basis for comparison. As we believe ours to be the first genus-2 hyperelliptic curve implementation on both the AVR ATmega and the ARM Cortex M0 architectures, we can only compare with elliptic curve-based alternatives at the same 128-bit security level: notably [7, 2931]. This comparison is not superficial: the key exchange in [7, 29, 30] uses the highly efficient \(x\)-only arithmetic on Montgomery elliptic curves, while [31] uses similar techniques for Weierstrass elliptic curves, and \(x\)-only arithmetic is the exact elliptic-curve analogue of Kummer surface arithmetic. To provide full scalar multiplication in a group, [31] appends \(y\)-coordinate recovery to its \(x\)-only arithmetic (using the approach of [26]); again, this is the elliptic-curve analogue of our methods.

Results for ARM Cortex M0. As we see in Table 4, genus-2 techniques give great results for Diffie–Hellman key exchange on the ARM Cortex M0 architecture. Compared with the current fastest implementation [7], we reduce the number of clock cycles by about \(27\,\%\), while roughly halving code size and stack space. For signatures, the state-of-the-art is [31]: here we reduce the cycle count for the underlying scalar multiplications by a very impressive \(75\,\%\), at the cost of an increase in code size and stack usage.

Table 4. Comparison of scalar multiplication routines on the ARM Cortex M0 architecture at the 128-bit security level. S denotes signature-compatible full scalar multiplication; DH denotes Diffie–Hellman pseudo-scalar multiplication.
Table 5. Comparison of scalar multiplication routines on the AVR ATmega architecture at the 128-bit security level. S denotes signature-compatible full scalar multiplication; DH denotes Diffie–Hellman pseudo-scalar multiplication. The implementation marked \(^{*}\) also contains a fixed-basepoint scalar multiplication routine, whereas the implementation marked \(^{\dagger }\) does not report code size for the separated scalar multiplication.

Results for AVR ATmega. Looking at Table 5, on the AVR ATmega architecture we reduce the cycle count for Diffie–Hellman by about \(32\,\%\) compared with the current record [7], again roughly halving the code size, and reducing stack usage by about \(80\,\%\). The cycle count for Jacobian scalar multiplication (for signatures) is reduced by \(71\,\%\) compared with [31], while increasing the stack usage by \(25\,\%\).

Finally we can compare to the current fastest full signature implementation [10], shown in Table 6. We almost halve the number of cycles, while reducing stack usage by a decent margin (code size is not reported in [10]).

Table 6. Comparison of signature schemes on the AVR ATmega architecture at the 128-bit security level.