# Low-Weight Primes for Lightweight Elliptic Curve Cryptography on 8-bit AVR Processors

- 8 Citations
- 827 Downloads

## Abstract

Small 8-bit RISC processors and micro-controllers based on the AVR instruction set architecture are widely used in the embedded domain with applications ranging from smartcards over control systems to wireless sensor nodes. Many of these applications require asymmetric encryption or authentication, which has spurred a body of research into implementation aspects of Elliptic Curve Cryptography (ECC) on the AVR platform. In this paper, we study the suitability of a special class of finite fields, the so-called *Optimal Prime Fields (OPFs)*, for a “lightweight” implementation of ECC with a view towards high performance and security. An OPF is a finite field \(\mathbb {F}_p\) defined by a prime of the form \(p = u \cdot 2^k + v\), whereby both \(u\) and \(v\) are “small” (in relation to \(2^k\)) so that they fit into one or two registers of an AVR processor. OPFs have a low Hamming weight, which allows for a very efficient implementation of the modular reduction since only the non-zero words of \(p\) need to be processed. We describe a special variant of Montgomery multiplication for OPFs that does not execute any input-dependent conditional statements (e.g. branch instructions) and is, hence, resistant against certain side-channel attacks. When executed on an Atmel ATmega processor, a multiplication in a 160-bit OPF takes just 3237 cycles, which compares favorably with other implementations of 160-bit modular multiplication on an 8-bit processor. We also describe a performance-optimized and a security-optimized implementation of elliptic curve scalar multiplication over OPFs. The former uses a GLV curve and executes in 4.19 M cycles (over a 160-bit OPF), while the latter is based on a Montgomery curve and has an execution time of approximately 5.93 M cycles. Both results improve the state-of-the-art in lightweight ECC on 8-bit processors.

## Keywords

Clock Cycle Scalar Multiplication Elliptic Curve Cryptography Modular Multiplication Wireless Sensor Node## 1 Introduction

The 8-bit AVR architecture [2] has grown increasingly popular in recent years thanks to its rich instruction set that allows for efficient code generation from high-level programming languages. A typical AVR microcontroller, such as the Atmel ATmega128 [3], features 32 general-purpose registers, separate memories and buses for program and data, and some 130 instructions, most of which are executed in a single clock cycle. The AVR platform occupies a significant share of the worldwide smartcard market and other security-critical segments of the embedded systems industry, e.g. wireless sensor nodes. This has made AVR an attractive evaluation platform for research projects in the area of efficient implementation of cryptographic primitives for embedded devices. The literature contains papers dealing with block ciphers [10], hash functions [17], as well as public-key schemes based on Elliptic Curve Cryptography (ECC) [23]. Despite some recent progress [1, 5, 18], the implementation of ECC on 8-bit smartcards and sensor nodes is still a big challenge due to the resource constraints of these devices. A typical low-cost smartcard contains an 8-bit microcontroller clocked at a frequency of 5 MHz, 256 B RAM, and 16 kB ROM. On the other hand, a typical wireless sensor node, such as the MICAz mote [8], is equipped with an ATmega128 processor clocked at 7.3728 MHz and provides 4 kB RAM and 128 kB programmable flash memory.

### 1.1 Past Work on Lightweight ECC for 8-bit Processors

One of the first ECC software implementations for an 8-bit processor was presented by Woodbury et al. in 2000 [41]. Their work utilizes a so-called Optimal Extension Field (OEF), which is a finite field consisting of \(p^m\) elements where \(p\) is a *pseudo-Mersenne prime* [7] (i.e. a prime of the form \(p = 2^k - c\)) and \(m\) is chosen such that an irreducible binomial \(x(t) = t^m - \omega \) exists over GF(\(p\)). The specific OEF used in [41] is GF(\((2^8-17)^{17}\)) as this field allows the arithmetic operations, especially multiplication and inversion, to be executed efficiently on an 8-bit platform. Woodbury et al. implemented the point arithmetic in affine coordinates and achieved an execution time of \(23.4 \cdot 10^6\) clock cycles for a full 134-bit scalar multiplication on an 8051-compatible microcontroller that is significantly slower than the ATmega128. The first really efficient ECC software for an 8-bitter was introduced by Gura et al. at CHES 2004 [15]. They reported an execution time of only \(6.48 \cdot 10^6\) clock cycles for a full scalar multiplication over a 160-bit SECG-compliant prime field on the ATmega128. This impressive performance is mainly the result of a smart optimization of the multi-precision multiplication, the nowadays widely used *hybrid method*. In short, the core idea of hybrid multiplication is to exploit the large register file of the ATmega128 to process several bytes (e.g. four bytes) of the operands in each iteration of the inner loop(s), which significantly reduces the number of load/store instructions compared to a conventional byte-wise multiplication.

In the ten years since the publication of Gura et al.’s seminal paper, a large body of research has been devoted to further reduce the execution time of ECC on the ATmega128. The majority of research focussed on advancing the hybrid multiplication technique or devising more efficient variants of it. An example is the work of Uhsadel et al. [37], who improved the handling of carry bits in the hybrid method and managed to achieve an execution time of 2881 cycles for a (\(160 \times 160\))-bit multiplication (without modular reduction), which is about 7.3 % faster than Gura et al.’s original implementation (3106 cycles). Zhang et al. [43] re-arranged the sequence in which the byte-by-byte multiplications are carried out and measured an execution time of 2846 clock cycles. A further reduction of the cycle count to 2651 was reported by Scott et al. [30], who fully unrolled the loops and used so-called “carry catcher” registers to limit the propagation of carries. This unrolled hybrid multiplication was adopted by Szczechowiak et al. [35] to implement scalar multiplication over a 160-bit generalized-Mersenne prime field \(\mathbb {F}_p\). An interesting result of their work is that the reduction modulo \(p = 2^{160}-2^{112}+2^{64}+1\) requires 1228 cycles, which means a full modular multiplication (including reduction) executes in 3882 clock cycles altogether. Also Lederer et al. [22] came up with an optimized variant of the hybrid method and performed ECDH key exchange using the 192-bit generalized-Mersenne prime \(p = 2^{192} - 2^{64} - 1\) as recommended by the NIST. A scalar multiplication needs \(12.33 \cdot 10^6\) cycles for an arbitrary base point, and \(5.2 \cdot 10^6\) cycles when the base point is fixed. The currently fastest means of multiplying two large integers on the ATmega128 is the so-called *operand-caching method* [19, 31], which follows a similar idea as the hybrid multiplication method, namely to exploit the large number of general-purpose registers to store (parts of) the operands.

Most lightweight ECC implementations for 8-bit AVR processors mentioned above suffer from two notable shortcomings, namely (1) they are vulnerable to side-channel attacks, e.g. *Simple Power Analysis (SPA)* [28], and (2) they make aggressive use of loop unrolling to reduce the execution time of the prime-field arithmetic, which comes at the expense of a massive increase in code size and poor scalability since the operand length is “fixed.” SPA attacks exploit conditional statements and other irregularities in the execution of a cryptographic algorithm (e.g. double-and-add method for scalar multiplication [6]), which can leak key-related information through the power-consumption profile of a device executing the algorithm. However, not only the scalar multiplication, but also the underlying field arithmetic can be vulnerable to SPA attacks, e.g. due to conditional subtractions in the modular addition [34], modular multiplication [38], or modular reduction for generalized-Mersenne primes [29]. It was shown in various papers that SPA attacks on unprotected (or insufficiently protected) implementations of ECC pose a real-world threat to the security of embedded devices such as smart cards [25] or wireless sensor nodes [9].

Loop unrolling is a frequently employed optimization technique to increase the performance of the field arithmetic operations, in particular multiplication [1]. The basic idea is to replicate the loop body \(n\) times (and adjust the overall number of iterations accordingly) so that the condition for loop termination as well as the branch back to the top of the loop need to be performed only once per \(n\) executions. Full loop unrolling may allow some extra optimizations since the first and the last iteration of a loop often differ from the “middle” ones and can, therefore, be specifically tuned. However, full loop unrolling, when applied to operations of quadratic complexity (e.g. multiplication), bloats the code size (i.e. the size of the binary executable) significantly. Moreover, a fully unrolled implementation can only process operands up to a length corresponding to the number of loop iterations, which means it is not scalable anymore.

### 1.2 Contributions of This Paper

We present an efficient prime-field arithmetic library for the 8-bit AVR architecture that we developed under consideration of the resource constraints and security requirements of smart cards, wireless sensor nodes, and similar kinds of embedded devices. Our goal was to overcome the drawbacks of most existing implementations mentioned in the previous subsection, and therefore we aimed for a good compromise between performance, code size, and resistance against SPA attacks. Instead of using a Mersenne-like prime field, our library supports so-called *Optimal Prime Fields (OPFs)* [12] since this family of fields has some attractive properties that allow for efficient arithmetic on a wide range of platforms. An OPF is a finite field defined by a “low-weight” prime \(p\) of the form \(p = u \cdot 2^k + v\), where \(u\) and \(v\) are small (in relation to \(2^k\)) to that they fit into one or two registers of an 8-bit processor. The reduction modulo such a prime can be performed efficiently using Montgomery’s algorithm [26] since only the non-zero bytes of \(p\) need to be processed. Our implementation is based on the OPF library from [43], but we significantly improved the execution time of all arithmetic operations (especially multiplication) and made it resistant against SPA attacks. We present a new variant of Montgomery modular multiplication for OPFs that does not perform any data-dependent indexing or branching in the final subtraction. A multiplication (including modular reduction) in a 160-bit OPF takes 3237 clock cycles on the ATmega128, which compares very well with previous work on modular multiplication for 8-bit processors.

Our OPF library uses an optimized variant of Gura et al.’s hybrid technique [15] for the multiplication, whereby we process four bytes of the two operands per iteration of the inner loop(s). However, in contrast to the bulk of previous implementations, we do not fully unroll the loops in order to keep the code size small. All arithmetic functions provided by our OPF library are implemented in a parameterized fashion and with “rolled” loops, which means that the length of the operands is not fixed or hard-coded, but is passed as parameter to the function along with other parameters such as the start address of the arrays in which the operands are stored. Consequently, our OPF library can support operands of varying length, ranging from 64 to 2048 bits (in 32-bit steps). This feature makes our OPF library highly scalable since one and the same function can be used for operands of different length without re-compilation.

We provide benchmarking results for operand lengths of 160, 192, 224, and 256 bits on the 8-bit ATmega128 processor, which we obtained with help of the cycle-accurate simulator of AVR Studio. For the purpose of benchmarking, we also implemented and simulated scalar multiplication for two different families of elliptic curves, namely Montgomery curves [27] and GLV curves [11]. In the former case, an SPA-protected scalar multiplication over a 160-bit OPF takes only \(5.93 \cdot 10^6\) cycles, which is faster than most unprotected implementations reported in the literature. On the other hand, we use the GLV curve to explore the “lower bound” of the execution time for a scalar multiplication when resistance against SPA is not needed. Such a speed-optimized implementation has an execution time of only \(4.19 \cdot 10^6\) clock cycles for a 160-bit scalar.

## 2 Preliminaries

In this section we recap some basic properties of special families of prime fields and elliptic curves, and discuss how to exploit their distinctive features to speed up the arithmetic operations needed in ECC.

### 2.1 Prime Fields

Even though elliptic curves can be defined over various algebraic structures, we only consider prime fields in this paper [6]. Formally, a prime field \(\mathbb {F}_p\) consists of \(p\) elements (namely the integers between \(0\) and \(p-1\)) and the arithmetic operations are addition and multiplication modulo \(p\). It is common practice in ECC to use “special” primes to speed up the modular reduction; a well-known example for primes with good arithmetic properties are the so-called Mersenne primes, which are primes of the form \(p = 2^k - 1\). Multiplying two \(k\)-bit integers \(a, b \in \mathbb {F}_p\) yields a \(2k\)-bit product \(r\) that can be written \(r = r_H \cdot 2^k + r_L\), where \(r_H\) and \(r_L\) represent the upper half and the lower half of \(r\), respectively. Since Open image in new window , we can simply reduce \(r\) via a conventional addition of the form Open image in new window to obtain a result that is at most \(k+1\) bits long. A final subtraction of \(p\) may be necessary to get a fully reduced result. In summary, a reduction modulo a Mersenne prime requires just a conventional \(k\)-bit addition and, in the worst case, a subtraction of \(p\). Unfortunately, Mersenne primes are rare, and there exist no Mersenne primes between \(2^{160}\) and \(2^{512}\), which is the interval from which one normally chooses primes for ECC.

*pseudo-Mersenne prime*is a prime of the form

*generalized Mersenne primes*were first described by Solinas in 1999 [33] and shortly thereafter, the NIST recommended a set of five of these special primes for use in ECC cryptosystems. The common form of the primes presented by Solinas is

### 2.2 Elliptic Curves

Any elliptic curve over a prime field \(\mathbb {F}_p\) can be expressed through a Weierstrass equation of the form \(y^2 = x^3 + ax + b\) [16]. When using mixed Jacobian-affine coordinates, a point addition on a Weierstrass curve costs eight multiplications (i.e. 8 M) and three squarings (i.e. 3 S) in the underlying field, whereas a point doubling requires 4 M and 4 S [16]. Similar to prime fields, there exist numerous “special” families of elliptic curves, each having a unique curve equation and a unique addition law. In the past 20 years, a massive research effort was devoted to finding special curves that allow for a more efficient implementation of the scalar multiplication than ordinary Weierstrass curves.

## 3 Optimal Prime Fields

*Optimal Prime Fields (OPFs)*, which were first described in the literature in an extended abstract from 2006 [12]. OPFs are defined by “low-weight” primes that can be written as

The implementation of most of the arithmetic operations we describe in the following subsections is based on Zhang et al.’s OPF library for AVR processors [43]. However, Zhang’s library, in its original form, is not resistant against side-channel attacks because it contains operand-dependent conditional statements such as if-then-else constructs. Therefore, we modified the arithmetic functions in a way so that they exhibit a highly regular execution pattern (and constant execution time) regardless of the actual values of the operands. In addition, we optimized a number of performance-critical code sections in the field arithmetic operations, which improved their execution time by up to 10 % versus Zhang’s OPF library. As stated in Sect. 1.2, we strive for a scalable implementation able to process operands of varying length. To achieve this, we implemented all arithmetic functions to support the passing of a length parameter, which is then used by the function to calculate the number of loop iterations. Our library is dimensioned for operands between 64 and 2048 bits in steps of 32 bits, i.e. the operand length has to be a multiple of 32.

### 3.1 Selection of Primes

The original definition of OPFs in [12] specifies the coefficients \(u\) and \(v\) of the prime \(p = u \cdot 2^k + v\) to “fit into a single register of the target processor,” i.e. in our case, \(u\) and \(v\) would be (at most) 8 bits long. However, the OPF library we describe in this paper expects \(u\) to be a 16-bit integer, while \(v\) is fixed to 1. In the following, we explain the rationale behind this choice and elaborate on the supported bitlengths of \(p\).

It is common practice in ECC to use primes with a bitlength that is a multiple of 32, e.g. 160, 192, 224 and 256 bits for applications with low to medium security requirements, and 384 and 512 bits for high-security applications. All standardization bodies (e.g. NIST, IEEE, SECG) recommend primes of these lengths and also we follow this approach. However, for efficiency reasons, it can be advantageous to use finite fields of a length slightly smaller than a multiple of 32, e.g. 255 bits instead of 256 [4]. Such slightly reduced field sizes facilitate certain optimization techniques like the so-called “lazy reduction,” which means that the result of an addition or any other operation is only reduced when it is necessary so as to prevent overflow. We conducted some experiments with the 159-bit OPF given by \(p= 126 \cdot 2^{152} + 1\), but found the performance gain one can achieve through lazy reduction to be less than 5 %. Therefore, we decided to stick with the well-established field lengths of 160, 192, 224 and 256 bits.

Our OPF software uses Montgomery’s algorithm [26] for multiplication and squaring modulo \(p\). A standard implementation of Montgomery multiplication based on e.g. the so-called Finely Integrated Product Scanning (FIPS) method [21] has to execute \(2s^2+s\) word-level multiplications (i.e. \((w \times w)\)-bit mul instructions) for operands consisting of \(s\) words [14]. However, when we optimize the FIPS method for primes of the form \(p = u \cdot 2^k + v\) with \(0 < u,v < 2^w\), then only \(s^2 + 3s\) mul instructions are required since all the “middle” words of \(p\) do not need to be processed because they are 0. A further reduction is achievable if \(v = 1\) since this case simplifies the quotient determination in Montgomery’s algorithm so that only \(s^2 + s\) mul instructions need to be executed, as we will show in Sect. 3.3. The situation is similar for \(v = 2^w - 1\) (which corresponds to \(v = -1\) in two’s complement representation) as also this special case allows for a reduction of the number of mul instructions. Having \(v = 2^w - 1\) implies that the least significant word of \(p\) is an “all-one” word, which, in turn, means Open image in new window and square roots modulo \(p\) can be computed efficiently [6].

The bitlength of a prime of the form \(p = u \cdot 2^k + 1\) is not only determined by the exponent \(k\), but also the coefficient \(u\). To maximize performance, it was recommended in [12] to select \(u\) so that its length matches the word-size of the underlying processor; in our case, \(u\) should be an 8-bit integer in order to fit in a single register of an ATmega128 processor. When doing so, an optimized FIPS Montgomery multiplication ignoring all the zero-bytes of \(p\) requires to execute only \(s^2 + s\) mul instructions. However, high performance is only one of several design goals; as stated in Sect. 1.2, we also aim for scalability, which means the ability to support fields of different lengths without the need to re-compile the arithmetic library. Besides the common field lengths of 160, 192, 224, and 256 bits, we want our library also to be able to perform arithmetic in 384 and 512-bit OPFs. Unfortunately, neither a 384-bit nor a 512-bit prime of the form \(p = u\cdot 2^k + 1\) with \(2^7 <= u < 2^8\) exists. It should be noted that the situation is very similar for pseudo-Mersenne primes; none of the 256 integers of the form \(2^k - c\) with \(k = 384\) and \(c < 2^8\) is prime, and the same holds for \(k=512\). As a consequence, we decided to “weaken” the original criterium for the selection of \(u\), namely to fit into a single register on the target processor, and allow \(u\) to have a length of 16 bits. While this relaxed condition for the selection of \(u\) entails a slight performance degradation, it significantly increases scalability and allows our OPF library to support high-security applications requiring 384 and 512-bit fields. All arithmetic functions of our library assume that \(u\) is a 16-bit integer and can be kept in two registers of an 8-bit ATmega128 processor. The second coefficient \(v\) of our low-weight primes is fixed to 1.

**Notation.** In what follows, \(\mathbb {F}_p\) denotes an OPF defined by a prime of the form \(p = u \cdot 2^k + 1\), whereby \(u\) is in the range \([2^{15}, 2^{16}-1]\), i.e. \(u\) has a length of 16 bits. As mentioned above, the bitlength \(n\) of the primes we use in this paper is always a multiple of 32, e.g. \(n = 160\), \(192\), \(224\), or \(256\) bits. Field elements are referred to by lowercase italic letters, e.g. \(a \in \mathbb {F}_p\). When implementing ECC in software, it is common practice to represent field elements by arrays of single-precision (i.e. \(w\)-bit) words so that the arithmetic operations can be executed efficiently on the processor’s fast integer unit [16]. Normally, one chooses \(w\) to match the word-size of the underlying processor, which would mean \(w=8\) in the case of an 8-bit processor. However, as shown by Gura et al. in [15], it can be more efficient to process several (e.g. four) bytes of the operands at a time (instead of just a single byte), which, in fact, means to work with 32-bit words even though the processor has just an 8-bit datapath. We follow this approach and represent the elements of \(\mathbb {F}_p\) via arrays of \(s = \lceil n / w \rceil \) words, each having a length of \(w = 32\) bits. For example, an element of a 160-bit prime field consists of five 32-bit words since \(s = 160/32 = 5\). We use uppercase letters to denote these arrays and indexed uppercase letters to refer to individual words within an array, e.g. \(A = (A_{s-1}, ... , A_1, A_0)\) where \(A_0\) is the least significant word and \(A_{s-1}\) the most significant word of \(A\), respectively.

### 3.2 Modular Addition and Subtraction

The typical way to perform a modular addition Open image in new window is to first add the two \(n\)-bit operands \(a,b \in \mathbb {F}_p\) to get a temporary sum \(t = a + b\) (which can have a length of up to \(n+1\) bits), followed by a comparison between \(t\) and \(p\) to check whether \(t \ge p\). Based on the result of this comparison, it may be necessary to subtract \(p\) from \(t\) to get a sum in the range of \([0, p-1]\). However, this approach exhibits an operand-dependent (and, therefore, irregular) execution pattern that leaks information through small variations of both the execution time and power consumption profile, the latter of which may be exploited in an SPA attack as described in e.g. [34]. In fact, this side-channel leakage has two origins, one is the comparison between \(t\) and \(p\), and the other is the conditional subtraction of \(p\). Most performance-optimized ECC implementations adopt an “early-abort” strategy to compare two integers, which means the comparison is done word by word, starting at the most significant word-pair, and the result is immediately returned when the first unequal word-pair is found. Therefore, the difference between the operands determines the execution time; it is maximal when the operands are equal. The second origin of side-channel leakage, i.e. the subtraction of \(p\), is more obvious since this subtraction is only performed when the temporary sum \(t\) is not smaller than \(p\).

In order to eliminate or, at least, reduce side-channel leakage, we adopt the idea of incomplete modular arithmetic as described by Yanık et al. [42]. Instead of reducing the result \(t\) of the addition to the least non-negative residue in the range of \([0, p-1]\), incomplete modular arithmetic allows (i.e. tolerates) results that are not fully reduced as long as they do not exceed a certain bitlength. In our case, this means that all results of modular operations are (at most) \(n\) bits long, but do not necessarily need to be smaller than \(p\). All our modular arithmetic functions also accept incompletely reduced operands as inputs, provided that their length does not exceed \(n\), the bitlength of \(p\). The advantage of this “relaxed” residue representation is the possibility to perform modular addition without an exact comparison between the sum \(t\) and the prime \(p\). Instead, we just check whether the length of \(t\) exceeds \(n\) bits (i.e. whether \(t \ge 2^n\)), which is only the case when the addition \(t = a + b\) produced a “carry bit.”

Thanks to the carry bit (which is either 0 or 1), the conditional subtraction of \(p\) can be done in an “unconditional” way by applying a mask to each byte of \(p\) before it is subtracted. The value of this mask is either an “all-zero” byte or an “all-one” byte and can be easily obtained from the carry bit through negation. For example, when the carry bit \(c = 0\), the value of the mask becomes \(m = -c = 0\). Applying this mask \(m\) to a byte \(p_i\) of \(p\) (i.e. performing a logical and between \(m\) and \(p_i\)) yields a zero-byte, which means \(0\) is subtracted from the sum \(t\). Conversely, when \(c = 1\), we have \(m = -c = -1 = 2^8-1 = \mathrm 0xff \), and applying this \(m\) to the bytes \(p_i\) does not change their value, which means \(p\) is subtracted from \(t\). Note, however, that a second subtraction may be required to obtain an \(n\)-bit result since both operands can be incompletely reduced. To get “branch-less” code, we always perform two masked subtractions of \(p\) and update the carry bit \(c\) after the first one. More precisely, the first subtraction produces a “borrow bit,” which is either 0 or 1 and has to be subtracted from the carry bit to obtain the correct carry bit for the second subtraction.

A modular subtraction Open image in new window can be implemented on basis of the same principles as the modular addition described above. Our implementation performs an ordinary subtraction \(t = a - b\) followed by two masked additions of \(p\), whereby the mask is derived from the borrow bit of the subtraction.

### 3.3 Modular Multiplication and Squaring

As detailed earlier in this section, our OPF library supports low-weight primes of the form \(p = u \cdot {2^k} + 1\) where \(u\) is 16 bits long. Following the notation from Sect. 3.1, we can represent \(p\) via an array \(P = (P_{s-1}, \ldots , P_1, P_0)\) consisting of \(s\) words, each having a length of \(w\) bits, i.e. \(w/4\) bytes. The least significant word \(P_0\) is 1, while the most significant word \(P_{s-1}\) contains \(u\); all other words are 0. In this subsection, we assume that the two operands \(a, b\) to be multiplied have the same length as \(p\), namely \(n\) bits, but they do not necessarily need to be smaller than \(p\), i.e. \(a\) and \(b\) are in the range of \([0, 2^n-1]\).

*after*the multiplication (see e.g. [18]), which is inefficient since the \(2s\)-word product is first written to memory (during the multiplication), and then it has to be loaded again from memory to accomplish the reduction. To avoid this, our implementation adopts a variant of the so-called Finely Integrated Product Scanning (FIPS) method [21] for Montgomery multiplication, which interleaves multiplication steps and reduction steps instead of executing them one after the other, thereby saving a number of load/store instructions and reducing the RAM footprint.

The standard FIPS technique for arbitrary primes, as described in [14, 21], has a nested-loop structure with two outer and two simple inner loops. In each iteration of the inner loops, two Multiply-Accumulate (MAC) operations are carried out; one with the words of the operands \(a\) and \(b\), which contributes to the computation of \(a \cdot b\). The second MAC operation involves words of the prime \(p\) and, hence, contributes to the reduction operation. Algorithm 1 shows a special variant of the FIPS method optimized for “low-weight” primes of the form \(p = u \cdot 2^k + 1\). This variant differs from the generic FIPS method for arbitrary primes in three main aspects. First, we eliminated all multiplications and MAC operations performed on zero words of \(p\) since they do not contribute to the final result. Consequently, the inner loops of Algorithm 1 perform only one MAC operation, similar to the product-scanning method for multiple-precision multiplication [16]. In fact, the inner loops in lines 6–8 and 15–17 are the same as in product-scanning multiplication, which makes Algorithm 1 fairly easy to implement. Another difference between our FIPS variant and the generic FIPS method for arbitrary primes is that the former is optimized for \(P_0 = 1\) and, as a consequence, the Montgomery reduction requires only \(s\) MAC operations; one is performed in line 10 and the remaining \(s-1\) in the second outer loop (line 18). When \(P_0 = 1\), we have Open image in new window , which simplifies the quotient-determination part of the reduction operation compared to the original FIPS method (see [43, Sect. 4.3] for a detailed explanation). Due to this optimization, the total number of word-level multiplications and MAC operations of the FIPS method for \(p = u \cdot 2^k + 1\) amounts to only \(s^2 + s\). The third difference between our FIPS variant and the classic one is that we peeled off the computation of \(A_0 \times B_0\) from the first nested loop and re-arranged the loop structure accordingly. Because of this modification, all loops of Algorithm 1 iterate at least one time if \(s \ge 2\), which simplifies their implementation.

Our AVR Assembly implementation of the FIPS Montgomery multiplication is based on the pseudo-code from Algorithm 1. However, in order to maximize performance, we adopt a variant of Gura et al.’s hybrid multiplication method [15], which means all word-level multiplications and MAC operations are performed on four bytes (i.e. 32 bits) of the operands instead of just a single byte (i.e. our word-size \(w\) is 32). In each iteration of the two inner loops, four bytes of operand \(a\) (i.e. the word \(A_j\)) and operand \(b\) (i.e. the word \(B_{i-j}\)) are loaded from memory and multiplied together to a 64-bit product. This product is then added to a cumulative sum \(T\) held in nine 8-bit registers. Our implementation of the inner loops follows [24, Sect. 3.1] and is, therefore, slightly faster than Zhang et al.’s inner-loop operation from [43]. Each iteration of the inner loops consists of eight ld (i.e. load), 16 mul, 49 add (or adc), and four movw instructions (excluding loop overhead). When taking the updating of the loop-control variable and branch instruction into account, the overall execution time of one full iteration of the inner loop amounts to exactly 104 clock cycles.

Besides excellent performance, the inner-loop implementation from [24] has the further advantage that it occupiers only 30 out of the 32 working registers of an AVR processor. We use the two free registers to accommodate the 16-bit coefficient \(u\) of the prime \(p = u \cdot 2^k + 1\). Hence, we have to maintain only three pointers, namely the pointers to the arrays \(A\), \(B\), and \(Z\), which we hold in the three pointer registers X, Y, and Z during the execution of a multiplication. In each iteration of the inner loop, the pointer to \(A\) gets incremented by 4, while the pointer to \(B\) is decremented. Therefore, the pointers need to be initialized with the correct start addresses, and this initialization has to performed in the outer loop, immediately before the start of the inner loop. Zhang et al. [43] did this pointer initialization with help of the “original” start address of the arrays \(A\) and \(B\) (i.e. the address of \(A_0\) and \(B_0\)), which they pushed on the stack at the very beginning of the multiplication and then popped whenever needed. Unfortunately, this approach is quite expensive since push and pop instructions take two cycles each. We found it more efficient to re-calculate the original address of these pointers using the end-value of the loop counter.

Algorithm 1 does not include the so-called “final subtraction” of \(p\), which is generally required in Montgomery multiplication to guarantee that the result is smaller than \(p\) or, in our case, smaller than \(2^n\). Therefore, the array \(Z\) consists of \(s + 1\) words, whereby its most significant word \(Z_s\) is either 0 or 1. Note that (at most) one subtraction of \(p\) is required to get an \(s\)-word result in the range of \([0, 2^n-1]\), even when both inputs are not completely reduced. To minimize SPA leakage, we perform this subtraction of \(p\) in the same way as described in Sect. 3.2, but use \(Z_s\) to derive an “all-zero” or “all-one” mask.

We implemented modular squaring for our low-weight primes similar to the multiplication, using the same optimizations in the reduction. Furthermore, the squaring adopts the well-known “trick” that allows one to cut the total number of word-level multiplications by almost one half (from \(s^2\) to \(\frac{s^2 + s}{2}\)) [6].

## 4 Performance Evaluation and Comparison

Execution time (in clock cycles) of arithmetic operations in OPFs

Operation | 160 bit | 192 bit | 224 bit | 256 bit |
---|---|---|---|---|

Addition | \(530\) | \(631\) | \(732\) | \(833\) |

Subtraction | \(530\) | \(631\) | \(732\) | \(833\) |

Multiplication | \(3237\) | \(4500\) | \(5971\) | \(7650\) |

Squaring | \(2901\) | \(3909\) | \(5058\) | \(6347\) |

Mul. by 16-bit integer | \(873\) | \(1039\) | \(1295\) | \(1461\) |

Inversion | \(223374\) | \(311828\) | \(416758\) | \(531901\) |

Table 1 summarizes the execution times we obtained using the ATmega128 processor as target platform, whereby all timings include the full function-call overhead. A multiplication in a 160-bit OPF takes 3237 clock cycles, which is almost 10 % faster than the average multiplication time of 3542 cycles reported by Zhang et al. [43]. For comparison, Szczechowiak et al.’s NanoECC [35] needs a total of 3882 clock cycles for a 160-bit modular multiplication (2654 cycles to do the multiplication, 1228 cycles for a reduction modulo a 160-bit generalized Mersenne prime), even though they fully unrolled the loops. The overhead due to the reduction operation accounts for about 31.6 % of the total multiplication time. On the other hand, the reduction overhead of multiplication in a 160-bit OPF is 459 clock cycles (or 14.2 %) since, according to [24], a conventional 160-bit multiplication (without modular reduction) requires 2778 cycles.

Execution time (in cycles) of point arithmetic and scalar multiplication

Operation | 160 bit | 192 bit | 224 bit | 256 bit |
---|---|---|---|---|

GLV point addition | \(40305\) | \(54417\) | \(70418\) | \(88550\) |

GLV point doubling | \(26684\) | \(36539\) | \(45369\) | \(56296\) |

GLV scalar mul. | \(4191073\) | \(6918518\) | \(10064582\) | \(14178625\) |

Montgomery point add. | \(19479\) | \(25890\) | \(33207\) | \(41428\) |

Montgomery point dbl. | \(15950\) | \(21072\) | \(26884\) | \(33390\) |

Montgomery scalar mul. | \(5928088\) | \(9445554\) | \(14109549\) | \(20158840\) |

Table 2 lists the simulated execution times of point addition/doubling and full scalar multiplication for both GLV and Montgomery curves. As explained in Sect. 2.2, the addition and doubling of points on a Montgomery curve is less costly (in terms of arithmetic operations in the underlying prime field) than the point addition/doubling on a GLV curve, and the simulation results from Table 2 clearly confirm this. However, the situation becomes different when we compare the execution times of a full scalar multiplication since the GLV curve outperforms its Montgomery counterpart by a factor of 1.41 in the 160-bit case (i.e. \(4.19 \cdot 10^6\) versus \(5.93 \cdot 10^6\) cycles on an ATmega128). We implemented the scalar multiplication on the Montgomery curve in a straightforward way based on a “Montgomery ladder” [6], while the scalar multiplication on the GLV curve exploits an efficiently computable endomorphism as described in [11, 16]. Since the Montgomery curves we used have a positive trace and a co-factor of 4, we evaluated the execution time using scalars that are two bits shorter than the underlying OPF. On the other hand, our GLV curves have a co-factor of 1 and we used scalars \(k\) that satisfy the following conditions: (1) the two sub-scalars \(k_1\), \(k_2\) of the decomposition of \(k\) are both positive and \(n/2\) bit long (\(n\) is the bitlength of the underlying OPF), and (2) their JSF contains \(n/4\) zero bits.

Comparison of execution time of scalar multiplication over fields of an order of roughly 160 bits (evaluation platform is an ATmega128 clocked at 7.3728 MHz)

Implementation | Field order | Fixed P. | Rand. P. | SPA resistant |
---|---|---|---|---|

Seo et al. [32] | GF(\(2^m\)), 163 bit | 1.14 s | 1.14 s | No |

Kargl et al. [20] | GF(\(2^m\)), 167 bit | 0.76 s | 0.76 s | No |

Aranha et al. [1] | GF(\(2^m\)), 163 bit | 0.29 s | 0.32 s | No |

Liu et al. [23] | GF(\(p\)), 160 bit | 2.05 s | 2.30 s | No |

Szczechowiak et al. [35] | GF(\(p\)), 160 bit | 1.27 s | 1.27 s | No |

Wang et al. [39] | GF(\(p\)), 160 bit | 1.24 s | 1.35 s | No |

Gura et al. [15] | GF(\(p\)), 160 bit | 0.88 s | 0.88 s | No |

Chu et al. [5] | GF(\(p\)), 160 bit | 0.79 s | 0.79 s | No |

Großschädl et al. [13] | GF(\(p\)), 160 bit | 0.74 s | 0.74 s | No |

Ugus et al. [36] | GF(\(p\)), 160 bit | 0.57 s | 1.03 s | No |

Wenger et al. [40] (Mon.) | GF(\(p\)), 160 bit | 0.75 s | 0.75 s | Yes |

Wenger et al. [40] (GLV) | GF(\(p\)), 160 bit | 0.53 s | 0.53 s | No |

Our work (Montg. curve) | GF(\(p\)), 160 bit | 0.80 s | 0.80 s | Yes |

Our work (GLV curve) | GF(\(p\)), 160 bit | 0.57 s | 0.57 s | No |

## 5 Conclusions

The aim of this paper was to provide new insights into certain implementation aspects of OPFs on 8-bit AVR processors. First, we argued that OPFs defined by primes of the form \(p = u \cdot 2^k + 1\), where \(u\) is a 16-bit integer, represent an optimal trade-off between performance and scalability. Then, we described in detail how to implement arithmetic operations for OPFs, taking the properties (e.g. low Hamming weight) of these primes into account. In particular, we proposed a new variant of Montgomery multiplication for low-weight primes based on the FIPS method. Our Montgomery variant has the same loop structure as the ordinary product-scanning method for multiplication and can, therefore, be well optimized for ATmega processors. We implemented the multiplication and all other arithmetic operations needed for ECC in a parameterized fashion with rolled loops so as to achieve high scalability and small code size. Furthermore, we wrote the Assembly code of all arithmetic functions (bar inversion) in such a way that always the same instruction sequence is executed, irrespective of the actual value of the operands, which helps to foil SPA attacks. Simulation results obtained with AVR Studio 4 indicate an execution time of 3237 cycles for a multiplication in a 160-bit OPF, while squaring takes 2901 cycles. These results compare very favorably with previous work and outperform even some implementations with unrolled loops. We also evaluated the execution time of a full scalar multiplication on Montgomery as well as GLV curves over OPFs. In the former case, the scalar multiplication is “intrinsically” SPA resistant and executes in 5.93 million cycles over a 160-bit OPF, while, in the latter case, we have an execution time of 4.19 million cycles. Both results confirm that OPFs are an excellent implementation option for ECC on 8-bit AVR processors.

## References

- 1.Aranha, D.F., Dahab, R., López, J.C., Oliveira, L.B.: Efficient implementation of elliptic curve cryptography in wireless sensors. Adv. Math. Commun.
**4**(2), 169–187 (2010)MathSciNetzbMATHCrossRefGoogle Scholar - 2.Atmel Corporation. 8-bit ARV\(^{\textregistered }\) Instruction Set. User Guide, July 2008. http://www.atmel.com/dyn/resources/prod_documents/doc0856.pdf
- 3.Atmel Corporation. 8-bit ARV\(^{\textregistered }\) Microcontroller with 128K Bytes In-System Programmable Flash: ATmega128, ATmega128L. Datasheet, June 2008. http://www.atmel.com/dyn/resources/prod_documents/doc2467.pdf
- 4.Bernstein, D.J.: Curve25519: New Diffie-Hellman Speed Records. In: Yung, M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 207–228. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 5.Chu, D., Großschädl, J., Liu, Z., Müller, V., Zhang, Y.: Twisted Edwards-form elliptic curve cryptography for 8-bit AVR-based sensor nodes. In: Xu, S., Zhao, Y. (eds.) Proceedings of the 1st ACM Workshop on Asia Public-Key Cryptography (AsiaPKC 2013), pp. 39–44. ACM Press (2013)Google Scholar
- 6.Cohen, H., Frey, G.: Handbook of Elliptic and Hyperelliptic Curve Cryptography. Discrete Mathematics and Its Applications, vol. 34. Chapmann & Hall, Boca Raton (2006)Google Scholar
- 7.Crandall, R.E.: Method and apparatus for public key exchange in a cryptographic system, U.S. Patent No. 5,159,632, October 1992Google Scholar
- 8.Crossbow Technology, Inc. MICAz Wireless Measurement System. Data sheet, January 2006. http://www.xbow.com/Products/Product_pdf_files/Wireless_pdf/MICAz_Datasheet.pdf
- 9.de Meulenaer, G., Standaert, F.-X.: Stealthy compromise of wireless sensor nodes with power analysis attacks. In: Chatzimisios, P., Verikoukis, C., Santamaría, I., Laddomada, M., Hoffmann, O. (eds.) MOBILIGHT 2010. LNICST, vol. 45, pp. 229–242. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 10.Eisenbarth, T., Gong, Z., Güneysu, T., Heyse, S., Indesteege, S., Kerckhof, S., Koeune, F., Nad, T., Plos, T., Regazzoni, F., Standaert, F.-X., van Oldeneel tot Oldenzeel, L.: Compact implementation and performance evaluation of block ciphers in ATtiny devices. In: Mitrokotsa, A., Vaudenay, S. (eds.) AFRICACRYPT 2012. LNCS, vol. 7374, pp. 172–187. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 11.Gallant, R.P., Lambert, R.J., Vanstone, S.A.: Faster point multiplication on elliptic curves with efficient endomorphisms. In: Kilian, J. (ed.) CRYPTO 2001. LNCS, vol. 2139, pp. 190–200. Springer, Heidelberg (2001)CrossRefGoogle Scholar
- 12.Großschädl, J.: TinySA: a security architecture for wireless sensor networks. In: Diot, C., Ammar, M., Sá da Costa, C., Lopes, R.J., Leitão, A.R., Feamster, N., Teixeira, R. (eds.) Proceedings of the 2nd International Conference on Emerging Networking Experiments and Technologies (CoNEXT 2006), pp. 288–289. ACM Press (2006)Google Scholar
- 13.Großschädl, J., Hudler, M., Koschuch, M., Krüger, M., Szekely, A.: Smart elliptic curve cryptography for smart dust. In: Zhang, X., Qiao, D. (eds.) QShine 2010. LNICST, vol. 74, pp. 623–634. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 14.Großschädl, J., Kamendje, G.-A.: Architectural enhancements for montgomery multiplication on embedded RISC processors. In: Zhou, J., Yung, M., Han, Y. (eds.) ACNS 2003. LNCS, vol. 2846, pp. 418–434. Springer, Heidelberg (2003)CrossRefGoogle Scholar
- 15.Gura, N., Patel, A., Wander, A., Eberle, H., Shantz, S.C.: Comparing elliptic curve cryptography and RSA on 8-bit CPUs. In: Joye, M., Quisquater, J.-J. (eds.) CHES 2004. LNCS, vol. 3156, pp. 119–132. Springer, Heidelberg (2004)CrossRefGoogle Scholar
- 16.Hankerson, D.R., Menezes, A.J., Vanstone, S.A.: Guide to Elliptic Curve Cryptography. Springer, New York (2004)zbMATHGoogle Scholar
- 17.Heyse, S., von Maurich, I., Wild, A., Reuber, C., Rave, J., Poeppelmann, T., Paar, C.: Evaluation of SHA-3 candidates for 8-bit embedded processors. Presentation at the 2nd SHA-3 Candidate Conference, Santa Barbara, CA, USA, August 2010. http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/
- 18.Hutter, M., Schwabe, P.: NaCl on 8-Bit AVR microcontrollers. In: Youssef, A., Nitaj, A., Hassanien, A.E. (eds.) AFRICACRYPT 2013. LNCS, vol. 7918, pp. 156–172. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- 19.Hutter, M., Wenger, E.: Fast multi-precision multiplication for public-key cryptography on embedded microprocessors. In: Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 459–474. Springer, Heidelberg (2011)CrossRefGoogle Scholar
- 20.Kargl, A., Pyka, S., Seuschek, H.: Fast arithmetic on ATmega128 for elliptic curve cryptography. Cryptology ePrint Archive, Report 2008/442 (2008). http://eprint.iacr.org
- 21.Koç, Ç.K., Acar, T., Kaliski, B.S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro
**16**(3), 26–33 (1996)CrossRefGoogle Scholar - 22.Lederer, C., Mader, R., Koschuch, M., Großschädl, J., Szekely, A., Tillich, S.: Energy-efficient implementation of ECDH key exchange for wireless sensor networks. In: Markowitch, O., Bilas, A., Hoepman, J.-H., Mitchell, C.J., Quisquater, J.-J. (eds.) Information Security Theory and Practice. LNCS, vol. 5746, pp. 112–127. Springer, Heidelberg (2009)Google Scholar
- 23.Liu, A., Ning, P.: TinyECC: a configurable library for elliptic curve cryptography in wireless sensor networks. In: Proceedings of the 7th International Conference on Information Processing in Sensor Networks (IPSN 2008), pp. 245–256. IEEE Computer Society Press (2008)Google Scholar
- 24.Liu, Z., Großschädl, J.: New speed records for Montgomery modular multiplication on 8-bit AVR microcontrollers. Cryptology ePrint Archive, Report 2013/882 (2013). http://eprint.iacr.org
- 25.Mangard, S., Oswald, E., Popp, T.: Power Analysis Attacks: Revealing the Secrets of Smart Cards. Springer, New York (2007)Google Scholar
- 26.Montgomery, P.L.: Modular multiplication without trial division. Math. Comput.
**44**(170), 519–521 (1985)zbMATHCrossRefGoogle Scholar - 27.Montgomery, P.L.: Speeding the Pollard and elliptic curve methods of factorization. Math. Comput.
**48**(177), 243–264 (1987)zbMATHCrossRefGoogle Scholar - 28.Oswald, E.: Enhancing simple power-analysis attacks on elliptic curve cryptosystems. In: Kaliski, B.S., Koç, Ç.K., Paar, C. (eds.) CHES 2002. LNCS, vol. 2523, pp. 82–97. Springer, Heidelberg (2002)Google Scholar
- 29.Sakai, Y., Sakurai, K.: Simple power analysis on fast modular reduction with NIST recommended elliptic curves. In: Qing, S., Mao, W., López, J., Wang, G. (eds.) ICICS 2005. LNCS, vol. 3783, pp. 169–180. Springer, Heidelberg (2005)CrossRefGoogle Scholar
- 30.Scott, M., Szczechowiak, P.: Optimizing multiprecision multiplication for public key cryptography. Cryptology ePrint Archive, Report 2007/299 (2007). http://eprint.iacr.org
- 31.Seo, H., Kim, H.: Multi-precision multiplication for public-key cryptography on embedded microprocessors. In: Lee, D.H., Yung, M. (eds.) WISA 2012. LNCS, vol. 7690, pp. 55–67. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 32.Seo, S.C., Han, D.-G., Kim, H.C., Hong, S.: TinyECCK: efficient elliptic curve cryptography implementation over GF(\(2^m\)) on 8-bit Micaz mote. IEICE Trans. Inf. Syst
**E91–D**(5), 1338–1347 (2008)CrossRefGoogle Scholar - 33.Solinas, J.A.: Generalized Mersenne numbers. Technical report CORR-99-39, Centre for Applied Cryptographic Research (CACR), University of Waterloo, Waterloo, Canada (1999)Google Scholar
- 34.Stebila, D., Thériault, N.: Unified point addition formulæ and side-channel attacks. In: Goubin, L., Matsui, M. (eds.) CHES 2006. LNCS, vol. 4249, pp. 354–368. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 35.Szczechowiak, P., Oliveira, L.B., Scott, M., Collier, M., Dahab, R.: NanoECC: testing the limits of elliptic curve cryptography in sensor networks. In: Verdone, R. (ed.) EWSN 2008. LNCS, vol. 4913, pp. 305–320. Springer, Heidelberg (2008)CrossRefGoogle Scholar
- 36.Ugus, O., Westhoff, D., Laue, R., Shoufan, A., Huss, S.A.: Optimized implementation of elliptic curve based additive homomorphic encryption for wireless sensor networks. In: Wolf, T., Parameswaran, S. (eds.) Proceedings of the 2nd Workshop on Embedded Systems Security (WESS 2007), pp. 11–16 (2007). http://arxiv.org/abs/0903.3900
- 37.Uhsadel, L., Poschmann, A., Paar, C.: Enabling full-size public-key algorithms on 8-bit sensor nodes. In: Stajano, F., Meadows, C., Capkun, S., Moore, T. (eds.) ESAS 2007. LNCS, vol. 4572, pp. 73–86. Springer, Heidelberg (2007)CrossRefGoogle Scholar
- 38.Walter, C.D.: Simple power analysis of unified code for ECC double and add. In: Joye, M., Quisquater, J.-J. (eds.) CHES 2004. LNCS, vol. 3156, pp. 191–204. Springer, Heidelberg (2004)CrossRefGoogle Scholar
- 39.Wang, H., Li, Q.: Efficient implementation of public key cryptosystems on mote sensors (short paper). In: Ning, P., Qing, S., Li, N. (eds.) ICICS 2006. LNCS, vol. 4307, pp. 519–528. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 40.Wenger, E., Großschädl, J.: An 8-bit AVR-based elliptic curve cryptographic RISC processor for the Internet of things. In: Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture Workshops (MICROW 2012), pp. 39–46. IEEE Computer Society Press (2012)Google Scholar
- 41.Woodbury, A.D., Bailey, D.V., Paar, C.: Elliptic curve cryptography on smart cards without coprocessors. In: Domingo-Ferrer, J., Chan, D., Watson, A. (eds.) Smart Card Research and Advanced Applications. International Federation for Information Processing, vol. 180, pp. 71–92. Kluwer Academic Publishers, Amsterdam (2000)CrossRefGoogle Scholar
- 42.Yanık, T., Savaş, E., Koç, Ç.K.: Incomplete reduction in modular arithmetic. IEE Proc. Comput. Digit. Tech.
**149**(2), 46–52 (2002)CrossRefGoogle Scholar - 43.Zhang, Y., Großschädl, J.: Efficient prime-field arithmetic for elliptic curve cryptography on wireless sensor nodes. In: Proceedings of the 1st International Conference on Computer Science and Network Technology (ICCSNT 2011), vol. 1, pp. 459–466. IEEE (2011)Google Scholar