Abstract
Modular multiplication is one of the most timeconsuming operations that account for almost 80% of computational overhead in a scalar multiplication in elliptic curve cryptography. In this paper, we present a new speed record for modular multiplication over 192bit NIST prime P192 on 8bit AVR ATmega microcontrollers. We propose a new integer representation named Range Shifted Representation (RSR) which enables an efficient merging of the reduction operation into the subtractive Karatsuba multiplication. This merging results in a dramatic optimization in the intermediate accumulation of modular multiplication by reducing a significant amount of unnecessary memory access as well as the number of addition operations. Our merged modular multiplication on RSR is designed to have two duplicated groups of 96bit intermediate values during accumulation. Hence, only one accumulation of the group is required and the result can be used twice. Consequently, we significantly reduce the number of load/store instructions which are known to be one of the most timeconsuming operations for modular multiplication on constrained devices. Our implementation requires only 2888 cycles for the modular multiplication of 192bit integers and outperforms the previous best result for modular multiplication over P192 by a factor of 17%. In addition, our modular multiplication is even faster than the Karatsuba multiplication (without reduction) which achieved a speed record for multiplication on AVR processor.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
With the appearance of the rapid advancement of Internet of Things (IoT), wireless sensor networks (WSNs) are recognized as important enablers consisting of a numerous number of resourceconstrained sensor nodes. Recently, many constrained sensor nodes are widely used to monitor and record physical and environmental conditions such as temperature, sound, and pollution levels. Compared with traditional wired networks, it is harder to obtain security in WSNs where sensor nodes are easily captured or eavesdropped by adversaries owing to the environment of wireless communication. Such security issues naturally raise a requirement for the cryptographic mechanism in WSNs which enables secure and reliable communication. However, it is difficult to provide sufficient security on WSNs because of many restrictions on computation capability, energy consumption, and even storage space for constrained sensor nodes. For example, MICAz mote is widely considered as a representative of constrained 8bit sensor nodes. It is equipped with an AVR ATmega128 processor which has 4 Kbytes of RAM and 128 Kbytes of programmable flash memory with clock frequency of 7.3728 MHz. The energy consumption of cryptographic software executed on a processor is closely related to its execution time, where faster execution time of cryptographic algorithm usually translates to savings in energy.
In early days, it is believed that PublicKey Cryptosystems (PKCs) are infeasible to be implemented for resourceconstrained sensor node since they require a significant amount of computation. Until recently, many types of researches have been proposed to apply PKCs for secure communication on WSNs by overcoming the restrictions of resourceconstrained sensor nodes [1,2,3,4]. Elliptic curve cryptography (ECC) is considered as a better choice for WSNs than conventional PKCs, such as RSA and DSA owing to its short key length. For example, the 160bit key in ECC scheme provides the same level of security in RSA scheme with 1024bit key. Such small key in ECC allows lower memory footprint and bandwidth consumption on WSNs. Moreover, only 5% to 10% of the execution time of RSA exponentiation is required for a scalar multiplication which is the most timeconsuming part of all ECCbased schemes.
ECCbased schemes such as the Elliptic Curve Diffie–Hellman (ECDH) key exchange and the Elliptic Curve Digital Signature Algorithm (ECDSA) are composed of three levels of operations as described in Fig. 1. The main operation of virtually all ECCbased schemes is scalar multiplication which requires elliptic curve point arithmetic operations such as elliptic curve point addition and elliptic curve point doubling. These point arithmetic operations are composed of field arithmetic operations such as multiplication, squaring, addition, and inversion. Except for field inversion, multiplication is the most timeconsuming operation that accounts for almost 80% of computational overhead in computation of scalar multiplication. After multiplication, reduction operation should always be executed to reduce the double sized result.
For efficient ECC implementation on resourceconstrained environments, careful design of field arithmetic operations is required where the most performancecritical operation is multiprecision multiplication. Hence, the majority researches of ECC implementation have been focused on improving the performance of multiprecision multiplication for constrained sensor nodes.
1.1 Related work and motivation
After the first ECC implementation by Gura et al. [1], there have been a variety of approaches to optimize ECC implementation for constrained devices. Many studies have focused on improving the performance of multiprecision multiplication which is the most critical factor for an efficient implementation of scalar multiplication.
In 1994, Comba described an efficient columnwise approach of multiprecision multiplication referred as the product scanning method on Intel processor [5]. Until 2004, this method had been known as the fastest multiplication with quadratic complexity on AVR processor. However, this is changed for integers with size larger than 96 bits.
In CHES 2004 [1], Gura et al. presented the hybrid method which combines the advantage of conventional bytewise multiplication techniques such as the operand scanning and product scanning methods. The hybrid method aims at minimizing the number of load instruction on processor with a large register file by processing four bytes for each iteration of the inner loop in the calculation. Such a significant reduction in load instruction in the hybrid method introduced a speed improvement of up to 25% compared to the product scanning method. Their 160bit multiplication requires 3106 clock cycles on 8bit ATmega128 processor. After that, several authors applied this method to accelerate the scalar multiplication of ECC implementation. Most of them focused on optimizing the performance of the hybrid method and proposed some variants that reported between 2593 and 2881 clock cycles on 8bit ATmega128 processor [6,7,8,9].
The next milestone belongs to Hutter and Wenger who proposed the operand caching method [10]. Their technique increases performance of multiplication by caching the operands in the generalpurpose registers to reduce the number of load instructions. The operand caching method is slightly improved in WISA 2012, where Seo and Kim introduce an advanced consecutive operand caching method [11, 12].
In 2015, the subtractive Karatsuba method was carefully revisited in [13] by Hutter and Schwabe. This method makes further improvement for the implementation of subtractive Karatsuba method which costs only 1969 clock cycles for 160bit operands and sets the speed record of multiprecision multiplication on ATmega processor. In [3], it is also proved that Karatsuba method is fastest approach for modular multiplication on constrained devices.
From the point of view of implementation on constrained devices, load and store instructions have a huge influence on the performance of multiprecision multiplication. Hence, the main concern of various multiplication methods is reducing the memory access for operands or intermediate accumulated results during the multiplication. Recently, the operand caching method [10] and Karatsuba multiplication [13] show that careful scheduling of memory access can lead to best performance by maximizing the use of available registers.
Until now, the reduction is treated as a separate part of multiplication process. Most studies do not concentrate on optimizing the reduction operation despite that it always follows multiplication, and consequently, can cause huge memory access overhead by recalling the previous results. In this paper, we focus on finding an effective way of reducing unnecessary memory access by considering multiplications and reductions as a whole.
1.2 Contributions
In this paper, we propose a new method for a fast modular multiplication over 192bit prime recommended by the US National Institute of Standards and Technology (NIST). The result of our work sets a new speed record on an 8bit AVR ATmega processor. The following list details the contributions of our work.

We propose a new integer representation to optimize the implementation of modular multiplication using the characteristic of modulo prime which has the term “− 1.” In this regard, we choose the 192bit NIST standard prime, which has such characteristic and suitable for constrained devices.

On the basis of the new integer representation, we present a novel approach for the 192bit modular multiplication over the 192bit NIST prime for 8bit architectures. By merging the reduction operations into the subtractive Karatsuba multiplication on the new integer representation, we optimize the intermediate accumulation in the modular multiplication. Our merged modular multiplication has two duplicated groups of 96bit intermediate results during accumulation. Hence, only one accumulation of the group is required and the result can be used twice. Consequently, we significantly reduce the number of load/store instructions as well as that of addition instructions.

We present the implementation result of our proposed 192bit modulo multiplication over the 192bit NIST prime on an 8bit AVR ATmega microcontrollers. The result of our work takes only 2888 clock cycles, which is 17% faster than the previous best record of modular multiplication by Liu et al. [3]. In addition, our modular multiplication is even faster than Hutter’s subtractive Karatsuba multiplication (without reduction) [13] which achieved a speed record for multiplication on AVR processor.
This paper is organized as follows: In Sect. 2, we give a brief introduction of ECC including NIST curve P192 and review various multiprecision multiplication techniques. In Sect. 3, we propose the new modular multiplication over the 192bit NIST prime. Section 4 compares our work with previous works. Finally, we conclude the paper in Sect. 5.
2 Preliminaries
2.1 Elliptic curve cryptography
Elliptic curve cryptography is first introduced by Koblitz and Miller in 1985 [14, 15]. The security of ECC is based on the Elliptic Curve Discrete Logarithm Problem (ECDLP), and there is no generalpurpose subexponential algorithms to solve the ECDLP. Let \({\mathbb {F}}_{P}\) be a finite field with odd characteristic. An elliptic curve E over \({\mathbb {F}}_{P}\) can be defined through a short Weierstraßequation of the form \(y^{2} = x^3+ax+b\), where \(a,b \in {\mathbb {F}}_{P}\) and \(4a^3+27b^2 \ne 0\). It is preferred that the curve parameter a is fixed to − 3 to optimize the point arithmetic in scalar multiplication.
NIST first proposed five primefield curves in 1999 [16] for standardization. The socalled NIST curves E can be defined through a short Weierstraßequation of the following form:
From the point of view of implementation in resourceconstrained devices, the NIST curve P192 has a better position than other NIST curves because it provides an appropriate security level and proper computational cost on small device [3]. This curve uses prime field \({\mathbb {F}}_{P_{192}}\), defined by prime \(P_{192} = 2^{192}2^{64}1\). This prime has the special characteristic that it can be expressed as the sum or difference of a small number of powers of 2. In addition, the powers are all multiples of 8, 16, or 32. The reduction algorithm for \({\mathbb {F}}_{P_{192}}\) is especially fast and suitable on machines having word size of 8, 16, or 32. For example, the result of multiplication can be reduced via three additions modulo \(P_{192}\) using the congruence \(2^{192} \equiv 2^{64}+1 \pmod {P_{192}}\).
2.2 Multiprecision multiplication techniques
In this section, we briefly review the multiprecision multiplication techniques for fast execution on constrained device. Throughout this section, we represent X and Y by nword integers as \(X = x_0+x_1W+\cdots +x_{n}W^{n}\) and \(Y = y_0+y_1W+\cdots +y_{n}W^{n}\) where \(W=2^8\).
2.2.1 Operand scanning method
The operand scanning method is the most simplest approach to implement multiprecision multiplication. This method is also referred as schoolbook method or rowwise method. The multiplication consists of two parts, i.e., inner loop and outer loop. In the outer loop, the operand \(x_i\) is loaded and held in working register during the inner loop. Within the inner loop, the multiplicand \(y_i\) is loaded one by one and the partial product is computed by multiplying with \(x_i\). Once the inner loop is completed, the next operand \(y_{i+1}\) is loaded and the inner loop is iterated again.
2.2.2 Product scanning method
The product scanning method accumulates partial products in the different way. This method computes partial product column by column where the intermediate result in the same column accumulated immediately in working register without storing and loading. Once the accumulation for a column is completed, the part of final multiplication result is obtained. This consecutive approach makes easy to handle carry propagation. In addition, the product scanning method is very suitable for constrained device, since a few number of registers are needed to compute partial products and accumulation.
2.2.3 Hybrid scanning method
Another way to compute a multiprecision multiplication is the hybrid scanning method [1] which combines the advantages of the operand scanning and the product scanning. The hybrid scanning method consists of two nested loop structures where the inner loop follows the operand scanning method and the outer loop accumulates the result of the inner loop, similar to the product scanning method. The outer loop can be implemented by processing the inner loop as a sequence of partial product blocks. This method can save the number of load instructions by sharing the operands within the block. To maximize the shared operands, it is possible to make full use of available register. However, since the outer loop follows a columnwise approach, there is no shared operand between two consecutive blocks. Hence, all operands need to be reloaded again.
2.2.4 Operand caching method
In [10], Hutter and Wenger proposed the operand caching method. This method is based on the product scanning method, but it separates the computation into several rows. All rows can be further divided into four parts. In the first part, all operands for the first and second part are loaded. In the second part, all operands are kept constant and reused. Only one word of the multiplicand is loaded between consecutive two columns. The third part follows the opposite process of previous part. That is, all multiplicand are kept constant and reused. Only one word of the operands is loaded for each column. In the last part, no loading of the operand is required, since the working registers hold the operands. It is an efficient way to reduce a significant amount of load operations in the computation of the row by reusing operands already loaded from the previous part. But whenever a row is changed, reload of operand is required since there is no shared operand between the rows. To overcome this disadvantage, Seo and Kim proposed the consecutive operand caching method [11, 12] which reschedules the rows in order to share the operands when a row is changed.
2.2.5 Subtractive Karatsuba method
In the early 1960s, Karatsuba proposed the notable multiplication technique with subquadratic complexity [17]. This Karatsuba method can effectively reduce a multiplication of two nword operands to three multiplication of two \(k(= n/2)\)word operands. Any multiplication method mentioned above can be applied to compute the reduced halfsize multiplication. In [13], Hutter and Schwabe highly optimized implementation of the subtractive Karatsuba method for various ranges of operands on AVR processor. We can explain the subtractive Karatsuba multiplication on the 8bit platform as follows:
Let \(X = X_{A}+X_{B}\cdot W^k\) and \(Y = Y_{A}+Y_{B}\cdot W^k\). Then,
We can compute \(X\cdot Y\) as
The main idea of optimization technique in [13] is to reduce memory access by using duplicated computation of \(L_{B} + H_{A}\) occurred twice in \(X\cdot Y\). In addition, this trick saves k addition operations. The subtractive Karatsuba method in [13] shows the best performance for multiprecision multiplication on an 8bit processor.
3 Proposed modular multiplication
3.1 Range shifted representation
Generally, we can represent 192bit integers X, Y and their multiplication \(Z = X\cdot Y\) based on 8bit word size \((W=2^8)\) as follows:
where \(x_i,y_i,z_i \in [0,2^81]\).
For simplicity, we can rewrite Z as presented in (10).
For modular reduction, NIST prime \(P_{192} = 2^{192}2^{64}1\) can be used. We can use the equation \(W^{24} \equiv W^{8}+1 \pmod {P_{192}}\) for modulo \(P_{192}\) reduction. Then, we have
This is not complete reduction. We need to reduce the part \((z_{40}W^{24}+z_{41}W^{25}+\cdots +z_{47}W^{31})\) of \(Z_{B}\cdot W^{8}\) that is not in the range of the 192bit element. Here we omit the complete reduction step for simplicity.
In the following, we propose a new integer representation for 192bit integer which ranges from \(2^{96}\) to \(2^{96}1\). We call it Range Shifted Representation (RSR). We can represent 192bit integers X, Y and their multiplication \(Z = X\cdot Y\) with RSR as follows:
where \(x_i,y_i,z_i \in [0,2^81]\). An interesting thing about the RSR is that the result of multiplication is expanded to both sides. The shape of result is symmetric with respect to \(W^0\). Because we want to represent integers in the range of \([2^{96},2^{96}1]\), we have to transform \(P_{192}\) into the range shifted form for modular reduction. We can use range shifted prime \(P_{192}\cdot 2^{96} = 2^{96}2^{32}2^{96}\) for modular reduction. We have to reduce the result at both sides such that \(z_0W^{24}+z_1W^{23}+\cdots +z_{11}W^{13}\) and \(z_{36}W^{12}+z_{37}W^{13}+\cdots +z_{47}W^{23}\) are reduced by modulo \(P_{192}\cdot 2^{96}\). Let \(X,Y,Z \in {\mathbb {F}}_{P_{192}}\) be represented with RSR where \(Z = X\cdot Y\). Then, we can reduce Z using the equation \(W^{12} \equiv W^{12}+W^{4}\) or \(W^{24} \equiv 1W^{16} \pmod {P_{192}\cdot W^{12}}\).
Let
where \(z_i \in [0,2^81]\).
Then, we can reduce Z as follows:
Note that, for complete reduction, we need to reduce the part \((z_0z_1Wz_{2}W^{2}z_{3}W^{3})\) of \(Z_{A}\cdot W^{16}\) that is not in the range of RSR. Here we omit the complete reduction step for simplicity.
To utilize RSR in elliptic curve protocol like ECDH or ECDSA scheme, conversions from the original integer representation to RSR and vice versa are required. For example, let X, Y are coordinates of input point for scalar multiplication, then conversion from \(X,Y \in [0,2^{192}1] \) in Eqs. (5, 6) to \(X,Y \in [2^{96},2^{96}1]\) in Eqs. (12, 13) is required before conducting scalar multiplication. This conversion can be simply done by applying modulo \(P_{192}\cdot W^{12}\) for each coordinate. For the output of the scalar multiplication, conversion from the RSR to original integer representation is required. However, compared to computational cost of scalar multiplication, these conversions require a negligible cycle counts and are needed only once. In regard of computation process of other field arithmetic operations on RSR like addition, subtraction, multiplication, and squaring, it is equal to that on original representation where \(P_{192}\cdot W^{12}\) is used for reduction.
3.2 Modular multiplication with RSR
We can use Karatsuba method for multiplication with RSR. Let \(X,Y \in {\mathbb {F}}_{P_{192}}\) be represented with RSR and \(Z = X\cdot Y\).
Let
where \(x_i,y_i \in [0,2^81]\).
Then, X, Y, Z can be represented as
Let low(L), high(H), middle(M) denote \(X_{A}Y_{A},X_{B}Y_{B},\) \((X_{A}X_{B})\cdot (Y_{A}Y_{B})\) as follows:
We can simply denote Z by L, H, M.
Then, the result of Karatsuba multiplication can be reduced by \(P_{192}\cdot W^{12}\). We do not need to reduce all part of the result. Because \((L+HM)W^{12}\) of Eq. (30) just fits in the 192bit range of RSR, we need to reduce only two parts \(L_{A}\cdot W^{24}\) and \(H_{B}\cdot W^{12}\) which overflow on both sides of the RSR range. We can compute Z modulo \(P_{192}\cdot W^{12}\) using the equation \(W^{12} \equiv W^{12}+W^{4}\) or \(W^{24} \equiv 1W^{16} \pmod {P_{192}\cdot W^{12}}\) as follows:
The interesting thing in the above equations is that \((L_{A} + L_{B} + H_{A} + H_{B})\) is expressed exactly twice. We can make use of this duplicated intermediate result to reduce memory access and accumulate operations for the efficient implementation of modular multiplication.
3.3 Implementation of modular multiplication with RSR
We used 2level Karatsuba recursion for implementation of the 192bit multiplication which is composed of three 96bit 1level Karatsuba multiplication, L, H, and M, as represented in Eqs. (27), (28) and (29). Let \(L^{(1)},H^{(1)}\) and \(M^{(1)}\) be the 48bit small multiprecision multiplications for 96bit 1level Karatsuba multiplications L, H, and M, respectively. Similarly, let \(L^{(2)},H^{(2)}\) and \(M^{(2)}\) be the 96bit 1level Karatsuba multiplications for 192bit 2level Karatsuba multiplications L, H, and M, respectively.
3.3.1 96Bit 1level Karatsuba multiplication
Implementation of 96bit 1level Karatsuba multiplication \(L^{(2)},H^{(2)}\) and \(M^{(2)}\) follows basically the same scheduling as 96bit multiplication in [13]. Algorithm 1 is a basic implementation of 96bit 1level Karatsuba multiplication presented in [13]. Algorithm 1 is composed of three 48bit small multiprecision multiplications \(L^{(1)},H^{(1)}\) and \(M^{(1)}\) that did not include any load or store instructions, and the result is kept in 11 registers.
Let
where \(L^{(1)}_A, L^{(1)}_B, H^{(1)}_A, H^{(1)}_B, M^{(1)}_A\) and \(M^{(1)}_B\) are 6bytes integers. As described in Algorithm 1, we can obtain the result of 96bit 1level Karatsuba multiplications \(L^{(2)},H^{(2)}\), and \(M^{(2)}\) through the computation of \(L^{(1)}+(L^{(1)}+H^{(1)}M^{(1)})W^{6}+H^{(1)}\cdot W^{12}\). We can express this computation in detail as follows:
In Eq. (35), the computation of \(L^{(1)}_B+H^{(1)}_A\) is appeared twice. This duplicated computation can be utilized in Algorithm 1 to minimize the register allocation and reduce additional load and store instructions for accumulation process. Let us assume that the result of \(L^{(1)}_B+H^{(1)}_A\) in Step 5 is not reused at Step 9, then \(L^{(1)}_B\) and \(H^{(1)}_A\) should be held in registers before the computation of \(L^{(1)}_B+H^{(1)}_A\) in Step 5 and saved in memory with store instructions. Moreover, the result of \(L^{(1)}_B+H^{(1)}_A\) is kept in registers for next accumulation. In Step 9, because of the calculation of \(L^{(1)}_B+H^{(1)}_A\) the loading of \(L^{(1)}_B\) and \(H^{(1)}_A\) which are stored in the memory after Step 5 is required. In Algorithm 1, however, the store/load instructions for each \(L^{(1)}_B\) and \(H^{(1)}_A\) actually are not necessary; only the result of \(L^{(1)}_B+H^{(1)}_A\) needs to be kept in registers for reusing at Step 9. Furthermore, six addition instructions for \(L^{(1)}_B+H^{(1)}_A\) can be saved.
3.3.2 Modified 96bit 1level Karatsuba multiplication for \(L^{(2)}\)
We can represent \(L^{(2)}\) as
where \(L^{(2)}_{A}, L^{(2)}_{B}\) are 12byte integers. We want to compute \(L^{(2)}_{A} + L^{(2)}_{B}\) during the computation of 96bit 1level Karatsuba multiplication \(L^{(2)} = X_{A}Y_{A}\) and reload it to build the complete duplicated intermediate result \((L^{(2)}_{A} + L^{(2)}_{B} + H^{(2)}_{A} + H^{(2)}_{B})\) in 192bit 2level Karatsuba multiplication with reduction. Through this process, we can reduce redundant memory access for \(L^{(2)}_{A}\) and \(L^{(2)}_{B}\) in 2level Karatsuba multiplication. In Algorithm 2, we modified 96bit 1level Karatsuba multiplication for \(L^{(2)}\) by inserting the computation of \(L^{(2)}_{A} + L^{(2)}_{B}\) into Algorithm 1.
We can represent \(L^{(2)}\) by the 48bit small multiprecision multiplication \(L^{(1)},H^{(1)}\) and \(M^{(1)}\) as follows:
Then
where c is 1byte carry. We can represent \(L^{(2)}_{A}\) and \(L^{(2)}_{B}\) as
We can get \(L^{(2)}_{A}\) easily by taking only 12byte without carry byte c from Eq. (38). \(L^{(2)}_{A} + L^{(2)}_{B}\) can be represented as
To compute \(L^{(2)}_{A} + L^{(2)}_{B}\), we first compute \((L^{(1)}_{A}+L^{(1)}_{B}+H^{(1)}_{A}M^{(1)}_{A})\cdot W^{6}+ (L^{(1)}_{A}+L^{(1)}_{B}+H^{(1)}_{A}+H^{(1)}_{B}M^{(1)}_{B})\cdot W^{12} + H^{(1)}_{B}\cdot W^{18}\). Then, upper 6byte of the first computation, which is \((H^{(1)}_{B}+c)\), is added to upper 6byte of \(L^{(2)}_{A}\).
In Algorithm 2, \(L^{(1)}_{A}\) is added to \(L^{(1)}_{B}\) at Step 3. In Step 6, \((L^{(1)}_{A}+L^{(1)}_{B}+H^{(1)}_{A})\) is computed and carry \(c'\), which is different from c of (38), is propagated through \((\bar{h}_6,\ldots ,\bar{h}_{11})\). In Step 9, \((\bar{h}_0,\ldots ,\bar{h}_{5})\) is copied to represent the duplicate partial result \((L^{(1)}_{A}+L^{(1)}_{B}+H^{(1)}_{A})\) of Eq. (41) such that \((\bar{h}_0,\ldots ,\bar{h}_{5},\bar{h}_0,\ldots ,\bar{h}_{5})\). On the right half of it, \((H^{(1)}_{B}+c')\) is added. In Step 10, \(M^{(1)}\) is subtracted. \((L^{(1)}_{A}+L^{(1)}_{B}+H^{(1)}_{A}M^{(1)}_{A}) = (t_0,\ldots ,t_{5})\) is added to \((H^{(1)}_{B}+c)\) in Step 13. Then, we can store \(L^{(2)}_{A} + L^{(2)}_{B}\) in \((z_{12},\ldots ,z_{23},carry)\). In comparison with Algorithm 1, we can save 6 load instructions for \(L^{(1)}_{A}\) and compute \(L^{(2)}_{A} + L^{(2)}_{B}\) in Algorithm 2 through this process.
3.3.3 192Bit 2level Karatsuba multiplication with reduction
We combined Karatsuba multiplication with reduction on RSR to generate more duplicated intermediate results. The graphical illustrations of 192bit 2level Karatsuba multiplication with reduction on RSR are shown in Fig. 2. Figure 2a shows that \(L^{(2)}_{A} = l_0+\cdots +l_{11}W^{11}\) and \(H^{(2)}_{B} = h_{12}+\cdots +h_{11}W^{23}\) need to be reduced for modular reduction. Figure 2b shows the reduced result of \(L^{(2)}_{A}\) and \(H^{(2)}_{B}\) by \(P_{192}\cdot W^{12}\). Now, we can visualize which one is accumulated for computing the final result of Eq. (31). As mentioned earlier, \((L^{(2)}_{A} + L^{(2)}_{B} + H^{(2)}_{A} + H^{(2)}_{B})\) is duplicated so that we can use it for reducing memory access and optimize the register usage by inserting accumulated value of the duplicated intermediate results into Karatsuba multiplication with reduction.
Algorithm 3 shows the implementation of 192bit\(\times 192\)bit 2level Karatsuba multiplication with reduction over \({\mathbb {F}}_{P_{192}\cdot W^{12}}\). For computing \((L^{(2)}_{A} + L^{(2)}_{B} + H^{(2)}_{A} + H^{(2)}_{B})\), at first \(L^{(2)}_{A} + L^{(2)}_{B}\) is computed during the evaluation of \(L^{(2)}\) through Algorithm 2 and saved. After the multiplication of \(X_{B}\cdot Y_{B}\) in Step 4, we get the result \(H^{(2)} = H^{(2)}_{A} + H^{(2)}_{B}\cdot W^{12}\) and compute \(H^{(2)}_{A} + H^{(2)}_{B}\) in Step 5. In the next step, we load \(L^{(2)}_{A} + L^{(2)}_{B}\) and accumulate it to \(H^{(2)}_{A} + H^{(2)}_{B}\). The accumulated result requires an additional register for a carry byte. Therefore, we can hold the complete duplicated intermediate result \((L_{A} + L_{B} + H_{A} + H_{B})\) in 13 registers which is represented by \((T,carry_2) = (t_0,\ldots ,t_{11},carry_2)\). In Step 7, we can represent the other half side of the intermediate result in Fig. 2b by just copying T of duplicated intermediate results without \(carry_2\). This is a very efficient way to decrease the number of load and save operations for previous computation results. Moreover, the number of addition operation is reduced. These advantages save clock cycle counts significantly. In Step 10, \(carry_2\) is added for complete accumulation.
Because we cannot always hold the 192bit result of 1level Karatsuba multiplication, careful handling of the 32 registers is required to minimize the memory access between 96bit Karatsuba multiplication \(L^{(2)}, H^{(2)}\), and \(M^{(2)}\). We reordered the order of computation from \(L^{(2)}\rightarrow H^{(2)}\rightarrow M^{(2)}\) in [13] to \(M^{(2)}\rightarrow L^{(2)}\rightarrow H^{(2)}\). Since \(H^{(2)}_{B}\) is kept in registers after Step 4, we can directly reduce \(H^{(2)}_{B}\) without any memory access at Step 7. This generates \(carry_3\) at which carries from Step 8, Step 9, and Step 10 are accumulated for reducing all carries together at Step 11.
4 Result
In this section, we present the implementation result of our 192bit modular multiplication on 8bit AVR ATmega128 processors providing the execution time (cycle counts). The timing of our work is obtained by simulation with Atmel studio 7.0. We refer the cycle counts represented in [18] to compare with various multiplications.
Table 1 shows the execution time of previous works for 192bit multiplication (only) and 192bit modular multiplication over NIST \(P_{192}\). The results for multiplication cover various multiplication methods including operand scanning, product scanning, hybrid scanning, operand caching, consecutive operand caching, and Karatsuba method. Among them, the implementation of Karatsuba method by Hutter and Schwabe [13] sets the speed record for 192bit multiplication. In [3], it is also verified that modular multiplication using the Karatsuba method achieves better performance than other methods for 192bit modular multiplication over NIST \(P_{192}\).
The Karatsuba multiplication (only) [13] needs 241 LD/LDD instructions, 108 ST/STD instructions, 46 PUSH instructions, and 21 POP instructions. Our modular multiplication requires 212 LD/LDD instructions, 104 ST/STD instructions, 20 PUSH instructions, and 20 POP instructions. Even though our implementation includes a reduction step, it requires fewer LDD/STD instructions and PUSH instructions. This is due to the fact that we can reduce the redundant memory access effectively using duplicate intermediate results of multiplication which are generated from combining Karatsuba multiplication with reduction on RSR.
In [3], Liu et al. present two types of implementation for modular multiplication over NIST \(P_{192}\) using consecutive operand caching and Karatsuba method. Bu comparison, our work is about 26% faster than the one using consecutive operand caching method which requires 4042 cycles. The other one applies Karatsuba method of [13] for modular multiplication and requires 3597 cycles which is the previous best result. Our work saves 17% cycles than that and even faster than the multiplication (only) in [13]. Our modular multiplication achieves the best speed record for 192bit modular multiplication over NIST prime \(P_{192}\) on the 8bit AVR ATmega microcontroller.
In Table 2, we also compare the performance of the modular multiplications in PKCs on 8bit AVR processor. The basic operation underlying RSA is modular exponentiation where the complexity of the exponentiation is decided by the size of modulus and the exponent. Chinese Remainder Theorem (CRT) can be utilized to reduce the size of both modulus and the exponent. For example, the exponentiation of RSA1024 can be decomposed into two 512bit modular exponentiations by applying CRT where 512bit modular multiplication can be used instead of 1024bit modular multiplication to speed up by a factor of four. The 512bit modular multiplication is most timeconsuming operation in RSA1024 where Montgomery reduction [19] is commonly used to avoid trial division by using simple shift instruction which accelerates reduction operation. For comparison between RSA and ECC, we choose 160bit key size of ECC system to achieve comparable security level to RSA1024. The 160bit ECC implementation in [20] uses Optimal Prime Fields(OPFs) which are represented by lowweight primes. This specific primes allow for simplification of the modular arithmetic. The result of 160bit modular multiplication makes a big difference with the result of 512bit modular multiplication used in RSA1024 [21]. This difference shows why ECC is better choice for the implementation of PKCs on constrained devices. Our 192bit modular multiplication is even faster than the 160bit modular multiplication which uses also Montgomery method to perform reduction efficiently. In our work, instead of using Montgomery reduction, we focused on merging reduction operation into Karatsuba multiplication having two duplicated groups of intermediate results which result in reduction in the memory access.
5 Conclusion
Many studies focus on improving the performance of multiprecision multiplication, which is the most critical factor for an efficient ECC implementation on constrained devices. Among various methods for multiprecision multiplications, the Karatsuba multiplication of Hutter and Schwabe in [13] is to be considered the best choice for an efficient implementation on the 8bit AVR ATmega family of microcontrollers. However, these studies do not consider the reduction operation followed by multiplication thoroughly although this process introduces significant amount of memory access for recalling the multiplication result.
In this paper, we concentrated on reducing unnecessary memory access related to accumulation of intermediate results by merging reduction process into multiplication. In this context, we proposed a new integer representation named range shifted representation and optimized the modular multiplication over 192bit NIST prime \(P_{192}\). Our work shows that Karatsuba multiplication with reduction on RSR generates duplicated intermediate results during accumulation which have many advantages for an efficient implementation of modular multiplication. Careful ordering of computation routines also saves load/save instructions. Our proposed modular multiplication surpasses the multiplication (only) in [13] and achieved a new speed record for 192bit modulo multiplication over NIST prime \(P_{192}\) on an 8bit AVR ATmega processor.
References
Gura N, Patel A, Wander A, Eberle H, Shantz SC (2004) Comparing elliptic curve cryptography and RSA on 8bit CPUs. In: Joye M, Quisquater JJ (eds) Cryptographic hardware and embedded systems (lecture notes in computer science), vol 3156. Springer, Berlin, pp 119–132
Liu A, Ning P (2008) TinyECC: a configurable library for elliptic curve cryptography in wireless sensor networks. In: Proceedings of the 7th International Conference on Information Processing in Sensor Networks (IPSN), pp 245–256
Liu Z, Seo H, Großschädl J, Kim H (2016) Efficient implementation of NISTcompliant elliptic curve cryptography for 8bit AVRbased sensor nodes. IEEE Trans Inf Forensics Secur 11(7):1385–1397
Seo SC, Seo H (2018) Highly efficient implementation of NISTcompliant Koblitz curve for 8bit AVRbased sensor nodes. IEEE Access 6:67637–67652
Comba PG (1990) Exponentiation cryptosystems on the IBM PC. IBM Syst J 29(4):526–538
Scott M, Szczechowiak P (2007) Optimizing multiprecision multiplication for public key cryptography. Cryptology ePrint archive, report 2007/299
Szczechowiak P, Oliveira LB, Scott M, Collier M, Dahab R (2008) NanoECC: testing the limits of elliptic curve cryptography in sensor networks. In: Proceedings of the International Conference on Wireless Sensor Networks’08). Springer, Berlin, pp 305–320
Uhsadel L, Poschmann A, Paar C (2007) Enabling fullsize publickey algorithms on 8bit sensor nodes. In: Proceedings of the International Conference on Security and Privacy in AdHoc and Sensor Networks (ESAS’07). Springer, Berlin, pp 73–86
Yang Z, Johann G (2011) Efficient primefield arithmetic for elliptic curve cryptography on wireless sensor nodes. In: Proceedings of the International Conference on Computer Science and Network Technology, pp 459–466
Hutter M, Wenger E (2011) Fast multiprecision multiplication for publickey cryptography on embedded microprocessors. In: Preneel B, Takagi T (eds) Cryptographic hardware and embedded systems (lecture notes in computer science), vol 6917. Springer, Berlin, pp 459–474
Seo H, Kim H (2012) Multiprecision multiplication for publickey cryptography on embedded microprocessors. In: MotiYung DHL (ed) Information security applications, vol 7690. Lecture notes in computer science. Springer, Berlin, pp 55–67
Seo H, Kim H (2013) Optimized multiprecision multiplication for publickey cryptography on embedded microprocessors. Int J Comput Commun Eng 2(3):255
Hutter M, Schwabe P (2015) Multiprecision multiplication on AVR revisited. J Cryptogr Eng 5(3):201–214
Miller VS (1985) Use of elliptic curves in cryptography. In: Proceedings of the Conference on the Theory and Application of Cryptographic Techniques, Santa Barbara, CA, USA. Springer, Berlin, pp 417–426 (1985)
Koblitz N (1987) Elliptic curve cryptosystems. Math Comput 48(177):203–209
National Institute of Standards and Technology (1999) Recommended elliptic curves for federal government use. http://csrc.nist.gov/encryption/dss/ecdsa/NISTReCur.pdf
Karatsuba AA, Ofman YP (1963) Multiplication of multidigit numbers on automata. Sov Phys Dokl 7(7):595–596
Liu Z, Seo H, Kim H (2016) A synthesis of multiprecision multiplication and squaring techniques for 8bit sensor nodes: state oftheart research and future challenges. J Comput Sci Technol 31(2):284–299
Montgomery PL (1985) Modular multiplication without trial division. Math Comput 44(170):519–521
Liu Z, Großschädl J, Wong DS (2014) Lowweight primes for lightweight elliptic curve cryptography on 8bit AVR processors. In: Information Security and Cryptology—INSCRYPT 2013. LNCS (2014)
Liu Z, Großschädl J, Kizhvatov I (2010) Efficient and SideChannel Resistant RSA Implementation for 8bit AVR Microcontrollers. In: Workshop on the Security of the Internet of Things—SOCIOT 2010, 1st International Workshop, Tokyo, Japan, November 29. IEEE Computer Society, Los Alamitos
Acknowledgements
This work was supported by Institute for Information and communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MSIT). (No. 2019000033, Study on Quantum Security Evaluation of Cryptography based on Computational Quantum Complexity).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Park, Dw., Hong, S., Chang, N.S. et al. Efficient implementation of modular multiplication over 192bit NIST prime for 8bit AVRbased sensor node. J Supercomput 77, 4852–4870 (2021). https://doi.org/10.1007/s11227020034415
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227020034415