Efficient implementation of modular multiplication over 192-bit NIST prime for 8-bit AVR-based sensor node

Modular multiplication is one of the most time-consuming operations that account for almost 80% of computational overhead in a scalar multiplication in elliptic curve cryptography. In this paper, we present a new speed record for modular multiplication over 192-bit NIST prime P-192 on 8-bit AVR ATmega microcontrollers. We propose a new integer representation named Range Shifted Representation (RSR) which enables an efficient merging of the reduction operation into the subtractive Karatsuba multiplication. This merging results in a dramatic optimization in the intermediate accumulation of modular multiplication by reducing a significant amount of unnecessary memory access as well as the number of addition operations. Our merged modular multiplication on RSR is designed to have two duplicated groups of 96-bit intermediate values during accumulation. Hence, only one accumulation of the group is required and the result can be used twice. Consequently, we significantly reduce the number of load/store instructions which are known to be one of the most time-consuming operations for modular multiplication on constrained devices. Our implementation requires only 2888 cycles for the modular multiplication of 192-bit integers and outperforms the previous best result for modular multiplication over P-192 by a factor of 17%. In addition, our modular multiplication is even faster than the Karatsuba multiplication (without reduction) which achieved a speed record for multiplication on AVR processor.


Introduction
With the appearance of the rapid advancement of Internet of Things (IoT), wireless sensor networks (WSNs) are recognized as important enablers consisting of a numerous number of resource-constrained sensor nodes. Recently, many constrained sensor nodes are widely used to monitor and record physical and environmental conditions such as temperature, sound, and pollution levels. Compared with traditional wired networks, it is harder to obtain security in WSNs where sensor nodes are easily captured or eavesdropped by adversaries owing to the environment of wireless communication. Such security issues naturally raise a requirement for the cryptographic mechanism in WSNs which enables secure and reliable communication. However, it is difficult to provide sufficient security on WSNs because of many restrictions on computation capability, energy consumption, and even storage space for constrained sensor nodes. For example, MICAz mote is widely considered as a representative of constrained 8-bit sensor nodes. It is equipped with an AVR ATmega128 processor which has 4 Kbytes of RAM and 128 Kbytes of programmable flash memory with clock frequency of 7.3728 MHz. The energy consumption of cryptographic software executed on a processor is closely related to its execution time, where faster execution time of cryptographic algorithm usually translates to savings in energy.
In early days, it is believed that Public-Key Cryptosystems (PKCs) are infeasible to be implemented for resource-constrained sensor node since they require a significant amount of computation. Until recently, many types of researches have been proposed to apply PKCs for secure communication on WSNs by overcoming the restrictions of resource-constrained sensor nodes [1][2][3][4]. Elliptic curve cryptography (ECC) is considered as a better choice for WSNs than conventional PKCs, such as RSA and DSA owing to its short key length. For example, the 160bit key in ECC scheme provides the same level of security in RSA scheme with 1024-bit key. Such small key in ECC allows lower memory footprint and bandwidth consumption on WSNs. Moreover, only 5% to 10% of the execution time of RSA exponentiation is required for a scalar multiplication which is the most time-consuming part of all ECC-based schemes.
ECC-based schemes such as the Elliptic Curve Diffie-Hellman (ECDH) key exchange and the Elliptic Curve Digital Signature Algorithm (ECDSA) are composed of three levels of operations as described in Fig. 1. The main operation of virtually all ECC-based schemes is scalar multiplication which requires elliptic curve point arithmetic operations such as elliptic curve point addition and elliptic curve point doubling. These point arithmetic operations are composed of field arithmetic operations such as multiplication, squaring, addition, and inversion. Except for field inversion, multiplication is the most time-consuming operation that accounts for almost 80% of computational overhead in computation of scalar multiplication. After multiplication, reduction operation should always be executed to reduce the double sized result.
For efficient ECC implementation on resource-constrained environments, careful design of field arithmetic operations is required where the most performance-critical operation is multi-precision multiplication. Hence, the majority researches of ECC implementation have been focused on improving the performance of multi-precision multiplication for constrained sensor nodes.

Related work and motivation
After the first ECC implementation by Gura et al. [1], there have been a variety of approaches to optimize ECC implementation for constrained devices. Many studies have focused on improving the performance of multi-precision multiplication which is the most critical factor for an efficient implementation of scalar multiplication.
In 1994, Comba described an efficient column-wise approach of multi-precision multiplication referred as the product scanning method on Intel processor [5]. Until 2004, this method had been known as the fastest multiplication with quadratic complexity on AVR processor. However, this is changed for integers with size larger than 96 bits.
In CHES 2004 [1], Gura et al. presented the hybrid method which combines the advantage of conventional byte-wise multiplication techniques such as the operand scanning and product scanning methods. The hybrid method aims at minimizing the number of load instruction on processor with a large register file by processing four bytes for each iteration of the inner loop in the calculation. Such a significant reduction in load instruction in the hybrid method introduced a speed improvement of up to 25% compared to the product scanning method. Their 160-bit multiplication requires 3106 clock cycles on 8-bit ATmega128 processor. After that, several authors applied this method to accelerate the scalar multiplication of ECC implementation. Most of them focused on optimizing the performance of the hybrid method and proposed some variants that reported between 2593 and 2881 clock cycles on 8-bit ATmega128 processor [6][7][8][9]. The next milestone belongs to Hutter and Wenger who proposed the operand caching method [10]. Their technique increases performance of multiplication by caching the operands in the general-purpose registers to reduce the number of load instructions. The operand caching method is slightly improved in WISA 2012, where Seo and Kim introduce an advanced consecutive operand caching method [11,12].
In 2015, the subtractive Karatsuba method was carefully revisited in [13] by Hutter and Schwabe. This method makes further improvement for the implementation of subtractive Karatsuba method which costs only 1969 clock cycles for 160-bit operands and sets the speed record of multi-precision multiplication on ATmega processor. In [3], it is also proved that Karatsuba method is fastest approach for modular multiplication on constrained devices.
From the point of view of implementation on constrained devices, load and store instructions have a huge influence on the performance of multi-precision multiplication. Hence, the main concern of various multiplication methods is reducing the memory access for operands or intermediate accumulated results during the multiplication. Recently, the operand caching method [10] and Karatsuba multiplication [13] show that careful scheduling of memory access can lead to best performance by maximizing the use of available registers.
Until now, the reduction is treated as a separate part of multiplication process. Most studies do not concentrate on optimizing the reduction operation despite that it always follows multiplication, and consequently, can cause huge memory access overhead by recalling the previous results. In this paper, we focus on finding an effective way of reducing unnecessary memory access by considering multiplications and reductions as a whole.

Contributions
In this paper, we propose a new method for a fast modular multiplication over 192bit prime recommended by the US National Institute of Standards and Technology (NIST). The result of our work sets a new speed record on an 8-bit AVR ATmega processor. The following list details the contributions of our work. • We propose a new integer representation to optimize the implementation of modular multiplication using the characteristic of modulo prime which has the term "− 1." In this regard, we choose the 192-bit NIST standard prime, which has such characteristic and suitable for constrained devices. • On the basis of the new integer representation, we present a novel approach for the 192-bit modular multiplication over the 192-bit NIST prime for 8-bit architectures. By merging the reduction operations into the subtractive Karatsuba multiplication on the new integer representation, we optimize the intermediate accumulation in the modular multiplication. Our merged modular multiplication has two duplicated groups of 96-bit intermediate results during accumulation. Hence, only one accumulation of the group is required and the result can be used twice. Consequently, we significantly reduce the number of load/store instructions as well as that of addition instructions. • We present the implementation result of our proposed 192-bit modulo multiplication over the 192-bit NIST prime on an 8-bit AVR ATmega microcontrollers. The result of our work takes only 2888 clock cycles, which is 17% faster than the previous best record of modular multiplication by Liu et al. [3]. In addition, our modular multiplication is even faster than Hutter's subtractive Karatsuba multiplication (without reduction) [13] which achieved a speed record for multiplication on AVR processor.
This paper is organized as follows: In Sect. 2, we give a brief introduction of ECC including NIST curve P-192 and review various multi-precision multiplication techniques. In Sect. 3, we propose the new modular multiplication over the 192-bit NIST prime. Section 4 compares our work with previous works. Finally, we conclude the paper in Sect. 5.

Elliptic curve cryptography
Elliptic curve cryptography is first introduced by Koblitz and Miller in 1985 [14,15]. The security of ECC is based on the Elliptic Curve Discrete Logarithm Problem (ECDLP), and there is no general-purpose subexponential algorithms to solve the ECDLP. Let P be a finite field with odd characteristic. An elliptic curve E over P can be defined through a short Weierstraßequation of the form where a, b ∈ P and 4a 3 + 27b 2 ≠ 0 . It is preferred that the curve parameter a is fixed to − 3 to optimize the point arithmetic in scalar multiplication. NIST first proposed five prime-field curves in 1999 [16] for standardization. The so-called NIST curves E can be defined through a short Weierstraßequation of the following form: From the point of view of implementation in resource-constrained devices, the NIST curve P-192 has a better position than other NIST curves because it provides an appropriate security level and proper computational cost on small device [3]. This curve uses prime field P 192 , defined by prime P 192 = 2 192 − 2 64 − 1 . This prime has the special characteristic that it can be expressed as the sum or difference of a small number of powers of 2. In addition, the powers are all multiples of 8, 16, or 32. The reduction algorithm for P 192 is especially fast and suitable on machines having word size of 8, 16, or 32. For example, the result of multiplication can be reduced via three additions modulo P 192 using the congruence 2 192 ≡ 2 64 + 1 (mod P 192 ).

Multi-precision multiplication techniques
In this section, we briefly review the multi-precision multiplication techniques for fast execution on constrained device. Throughout this section, we represent X and

3
Efficient implementation of modular multiplication over… Y by n-word integers as X = x 0 + x 1 W + ⋯ + x n W n and Y = y 0 + y 1 W + ⋯ + y n W n where W = 2 8 .

Operand scanning method
The operand scanning method is the most simplest approach to implement multiprecision multiplication. This method is also referred as schoolbook method or row-wise method. The multiplication consists of two parts, i.e., inner loop and outer loop. In the outer loop, the operand x i is loaded and held in working register during the inner loop. Within the inner loop, the multiplicand y i is loaded one by one and the partial product is computed by multiplying with x i . Once the inner loop is completed, the next operand y i+1 is loaded and the inner loop is iterated again.

Product scanning method
The product scanning method accumulates partial products in the different way. This method computes partial product column by column where the intermediate result in the same column accumulated immediately in working register without storing and loading. Once the accumulation for a column is completed, the part of final multiplication result is obtained. This consecutive approach makes easy to handle carry propagation. In addition, the product scanning method is very suitable for constrained device, since a few number of registers are needed to compute partial products and accumulation.

Hybrid scanning method
Another way to compute a multi-precision multiplication is the hybrid scanning method [1] which combines the advantages of the operand scanning and the product scanning. The hybrid scanning method consists of two nested loop structures where the inner loop follows the operand scanning method and the outer loop accumulates the result of the inner loop, similar to the product scanning method. The outer loop can be implemented by processing the inner loop as a sequence of partial product blocks. This method can save the number of load instructions by sharing the operands within the block. To maximize the shared operands, it is possible to make full use of available register. However, since the outer loop follows a column-wise approach, there is no shared operand between two consecutive blocks. Hence, all operands need to be reloaded again.

Operand caching method
In [10], Hutter and Wenger proposed the operand caching method. This method is based on the product scanning method, but it separates the computation into several rows. All rows can be further divided into four parts. In the first part, all operands for the first and second part are loaded. In the second part, all operands are kept constant and reused. Only one word of the multiplicand is loaded between consecutive two columns. The third part follows the opposite process of previous part. That is, all 1 3 multiplicand are kept constant and reused. Only one word of the operands is loaded for each column. In the last part, no loading of the operand is required, since the working registers hold the operands. It is an efficient way to reduce a significant amount of load operations in the computation of the row by reusing operands already loaded from the previous part. But whenever a row is changed, reload of operand is required since there is no shared operand between the rows. To overcome this disadvantage, Seo and Kim proposed the consecutive operand caching method [11,12] which re-schedules the rows in order to share the operands when a row is changed.

Subtractive Karatsuba method
In the early 1960s, Karatsuba proposed the notable multiplication technique with subquadratic complexity [17]. This Karatsuba method can effectively reduce a multiplication of two n-word operands to three multiplication of two k(= n∕2)-word operands. Any multiplication method mentioned above can be applied to compute the reduced half-size multiplication. In [13], Hutter and Schwabe highly optimized implementation of the subtractive Karatsuba method for various ranges of operands on AVR processor. We can explain the subtractive Karatsuba multiplication on the 8-bit platform as follows: Let We can compute X ⋅ Y as The main idea of optimization technique in [13] is to reduce memory access by using duplicated computation of L B + H A occurred twice in X ⋅ Y . In addition, this trick saves k addition operations. The subtractive Karatsuba method in [13] shows the best performance for multi-precision multiplication on an 8-bit processor.

Range shifted representation
Generally, we can represent 192-bit integers X, Y and their multiplication Z = X ⋅ Y based on 8-bit word size (W = 2 8 ) as follows: For simplicity, we can rewrite Z as presented in (10) Here we omit the complete reduction step for simplicity. In the following, we propose a new integer representation for 192-bit integer which ranges from 2 −96 to 2 96 − 1 . We call it Range Shifted Representation (RSR). We can represent 192-bit integers X, Y and their multiplication Z = X ⋅ Y with RSR as follows: . An interesting thing about the RSR is that the result of multiplication is expanded to both sides. The shape of result is symmetric with respect to W 0 . Because we want to represent integers in the range of [2 −96 , 2 96 − 1] , we have to transform P 192 into the range shifted form for modular reduction. We can use range shifted prime P 192 ⋅ 2 −96 = 2 96 − 2 −32 − 2 −96 for modular reduction. We have to reduce the result at both sides such that z 0 W −24 + z 1 W −23 + ⋯ + z 11 W −13 and z 36 W 12 + z 37 W 13 + ⋯ + z 47 W 23 are reduced by modulo P 192 ⋅ 2 −96 . Let X, Y, Z ∈ P 192 be represented with RSR where Z = X ⋅ Y . Then, we can reduce Z using the equation Then, we can reduce Z as follows: Note that, for complete reduction, we need to reduce the part Here we omit the complete reduction step for simplicity. To utilize RSR in elliptic curve protocol like ECDH or ECDSA scheme, conversions from the original integer representation to RSR and vice versa are required. For example, let X, Y are coordinates of input point for scalar multiplication, then conversion from X, Y ∈ [0, 2 192 − 1] in Eqs. (5,6) to X, Y ∈ [2 −96 , 2 96 − 1] in Eqs. (12,13) is required before conducting scalar multiplication. This conversion can be simply done by applying modulo P 192 ⋅ W −12 for each coordinate. For the output of the scalar multiplication, conversion from the RSR to original integer representation is required. However, compared to computational cost of scalar multiplication, these conversions require a negligible cycle counts and are needed only once. In regard of computation process of other field arithmetic operations on RSR like addition, subtraction, multiplication, and squaring, it is equal to that on original representation where P 192 ⋅ W −12 is used for reduction.

Modular multiplication with RSR
We can use Karatsuba method for multiplication with RSR. Let X, Y ∈ P 192 be represented with RSR and Z = X ⋅ Y. Let Then, X, Y, Z can be represented as The interesting thing in the above equations is that (L A + L B + H A + H B ) is expressed exactly twice. We can make use of this duplicated intermediate result to reduce memory access and accumulate operations for the efficient implementation of modular multiplication.

Efficient implementation of modular multiplication over…
In Eq. (35), the computation of L (1) B + H (1) A is appeared twice. This duplicated computation can be utilized in Algorithm 1 to minimize the register allocation and reduce additional load and store instructions for accumulation process. Let us assume that the result of L (1) Step 5 is not reused at Step 9, then L (1) B and H (1) A should be held in registers before the computation of L (1) B + H (1) A in Step 5 and saved in memory with store instructions. Moreover, the result of L (1) B + H (1) A is kept in registers for next accumulation. In Step 9, because of the calculation of L (1) B + H (1) A the loading of L (1) B and H (1) A which are stored in the memory after Step 5 is required. In Algorithm 1, however, the store/load instructions for each L (1) B and H (1) A actually are not necessary; only the result of L (1) B + H (1) A needs to be kept in registers for reusing at Step 9. Furthermore, six addition instructions for L (1) B + H (1) A can be saved.
Step 13. Then, we can store L (2) A + L (2) B in (z 12 , … , z 23 , carry) . In comparison with Algorithm 1, we can save 6 load instructions for L (1) A and compute L (2) A + L (2) B in Algorithm 2 through this process.

192-Bit 2-level Karatsuba multiplication with reduction
We combined Karatsuba multiplication with reduction on RSR to generate more duplicated intermediate results. The graphical illustrations of 192-bit 2-level Karatsuba multiplication with reduction on RSR are shown in Fig. 2. Figure 2a shows that L (2) A = l 0 + ⋯ + l 11 W 11 and H (2) B = h 12 + ⋯ + h 11 W 23 need to be reduced for modular reduction. Figure 2b shows the reduced result of L (2) A and H (2) B by P 192 ⋅ W −12 . Now, we can visualize which one is accumulated for computing the final result of Eq. (31). As mentioned earlier, (L (2) A is duplicated so that we can use it for reducing memory access and optimize the register usage by inserting accumulated value of the duplicated intermediate results into Karatsuba multiplication with reduction.
Algorithm 3 shows the implementation of 192-bit×192-bit 2-level Karatsuba multiplication with reduction over P 192 ⋅W −12 . For computing (L (2) A + L (2) B + H (2) A + H (2) B ) , at first L (2) A + L (2) B is computed during the evaluation of L (2) through Algorithm 2 and saved. After the multiplication of X B ⋅ Y B in Step 4, we get the result H (2) = H (2) A + H (2) B ⋅ W 12 and compute H (2) A + H (2) B in Step 5. In the next step, we load L (2) A + L (2) B and accumulate it to H (2) A + H (2) B . The accumulated result requires an additional register for a carry byte. Therefore, we can hold the complete duplicated intermediate result (L A + L B + H A + H B ) in 13 registers which is represented by (T, carry 2 ) = (t 0 , … , t 11 , carry 2 ) . In Step 7, we can represent the other half side of the intermediate result in Fig. 2b by just copying T of duplicated intermediate Fig. 2 Process of modular multiplication with RSR results without carry 2 . This is a very efficient way to decrease the number of load and save operations for previous computation results. Moreover, the number of addition operation is reduced. These advantages save clock cycle counts significantly. In Step 10, carry 2 is added for complete accumulation.
Because we cannot always hold the 192-bit result of 1-level Karatsuba multiplication, careful handling of the 32 registers is required to minimize the memory access between 96-bit Karatsuba multiplication L (2) , H (2) , and M (2) . We reordered the order of computation from L (2) → H (2) → M (2) in [13] to M (2) → L (2) → H (2) . Since H (2) B is kept in registers after Step 4, we can directly reduce H (2) B without any memory access at Step 7. This generates carry 3 at which carries from Step 8, Step 9, and Step 10 are accumulated for reducing all carries together at Step 11.

Result
In this section, we present the implementation result of our 192-bit modular multiplication on 8-bit AVR ATmega128 processors providing the execution time (cycle counts). The timing of our work is obtained by simulation with Atmel studio 7.0. We refer the cycle counts represented in [18] to compare with various multiplications. Table 1 shows the execution time of previous works for 192-bit multiplication (only) and 192-bit modular multiplication over NIST P 192 . The results for multiplication cover various multiplication methods including operand scanning, product scanning, hybrid scanning, operand caching, consecutive operand caching, and Karatsuba method. Among them, the implementation of Karatsuba method by Hutter and Schwabe [13] sets the speed record for 192-bit multiplication. In [3], it is also verified that modular multiplication using the Karatsuba method achieves better performance than other methods for 192-bit modular multiplication over NIST P 192 .
The Karatsuba multiplication (only) [13] needs 241 LD/LDD instructions, 108 ST/ STD instructions, 46 PUSH instructions, and 21 POP instructions. Our modular multiplication requires 212 LD/LDD instructions, 104 ST/STD instructions, 20 PUSH instructions, and 20 POP instructions. Even though our implementation includes a reduction step, it requires fewer LDD/STD instructions and PUSH instructions. This is due to the fact that we can reduce the redundant memory access effectively using duplicate intermediate results of multiplication which are generated from combining Karatsuba multiplication with reduction on RSR.
In [3], Liu et al. present two types of implementation for modular multiplication over NIST P 192 using consecutive operand caching and Karatsuba method. Bu comparison, our work is about 26% faster than the one using consecutive operand caching method which requires 4042 cycles. The other one applies Karatsuba method of [13] for modular multiplication and requires 3597 cycles which is the previous best result. Our work saves 17% cycles than that and even faster than the multiplication (only) in [13]. Our modular multiplication achieves the best speed record for 192-bit modular multiplication over NIST prime P 192 on the 8-bit AVR ATmega microcontroller.
In Table 2, we also compare the performance of the modular multiplications in PKCs on 8-bit AVR processor. The basic operation underlying RSA is modular exponentiation where the complexity of the exponentiation is decided by the size of modulus and the exponent. Chinese Remainder Theorem (CRT) can be utilized to reduce the  size of both modulus and the exponent. For example, the exponentiation of RSA-1024 can be decomposed into two 512-bit modular exponentiations by applying CRT where 512-bit modular multiplication can be used instead of 1024-bit modular multiplication to speed up by a factor of four. The 512-bit modular multiplication is most time-consuming operation in RSA-1024 where Montgomery reduction [19] is commonly used to avoid trial division by using simple shift instruction which accelerates reduction operation. For comparison between RSA and ECC, we choose 160-bit key size of ECC system to achieve comparable security level to RSA-1024. The 160-bit ECC implementation in [20] uses Optimal Prime Fields(OPFs) which are represented by lowweight primes. This specific primes allow for simplification of the modular arithmetic. The result of 160-bit modular multiplication makes a big difference with the result of 512-bit modular multiplication used in RSA-1024 [21]. This difference shows why ECC is better choice for the implementation of PKCs on constrained devices. Our 192bit modular multiplication is even faster than the 160-bit modular multiplication which uses also Montgomery method to perform reduction efficiently. In our work, instead of using Montgomery reduction, we focused on merging reduction operation into Karatsuba multiplication having two duplicated groups of intermediate results which result in reduction in the memory access.

Conclusion
Many studies focus on improving the performance of multi-precision multiplication, which is the most critical factor for an efficient ECC implementation on constrained devices. Among various methods for multi-precision multiplications, the Karatsuba multiplication of Hutter and Schwabe in [13] is to be considered the best choice for an efficient implementation on the 8-bit AVR ATmega family of microcontrollers. However, these studies do not consider the reduction operation followed by multiplication thoroughly although this process introduces significant amount of memory access for recalling the multiplication result.
In this paper, we concentrated on reducing unnecessary memory access related to accumulation of intermediate results by merging reduction process into multiplication. In this context, we proposed a new integer representation named range shifted representation and optimized the modular multiplication over 192-bit NIST prime P 192 . Our work shows that Karatsuba multiplication with reduction on RSR generates duplicated intermediate results during accumulation which have many advantages for an efficient implementation of modular multiplication. Careful ordering of computation routines also saves load/save instructions. Our proposed modular multiplication surpasses the multiplication (only) in [13] and achieved a new speed record for 192-bit modulo multiplication over NIST prime P 192 on an 8-bit AVR ATmega processor.

Efficient implementation of modular multiplication over…
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/.