A HighSpeed Elliptic Curve Cryptographic Processor for Generic Curves over \(\mathrm{GF}(p)\)
Abstract
Elliptic curve cryptography (ECC) is preferred for highspeed applications due to the lower computational complexity compared with other publickey cryptographic schemes. As the basic arithmetic, the modular multiplication is the most timeconsuming operation in publickey cryptosystems. The existing highradix Montgomery multipliers performed a single Montgomery multiplication either in approximately \(2n\) clock cycles, or approximately \(n\) cycles but with a very low frequency, where \(n\) is the number of words. In this paper, we first design a novel Montgomery multiplier by combining a quotient pipelining Montgomery multiplication algorithm with a parallel array design. The parallel design with oneway carry propagation can determine the quotients in one clock cycle, thus one Montgomery multiplication can be completed in approximately \(n\) clock cycles. Meanwhile, by the quotient pipelining technique applied in digital signal processing (DSP) blocks, our multiplier works in a high frequency. We also implement an ECC processor for generic curves over \(\mathrm{GF}(p)\) using the novel multiplier on FPGAs. To the best of our knowledge, our processor is the fastest among the existing ECC implementations over \(\mathrm{GF}(p)\).
Keywords
FPGA Montgomery multiplier DSP Highspeed ECC1 Introduction
Elliptic curve cryptography has captured more and more attention since the introduction by Koblitz [8] and Miller [12] in 1985. Compared with RSA or discrete logarithm schemes over finite fields, ECC uses a much shorter key to achieve an equivalent level of security. Therefore, ECC processors are preferred for highspeed applications owing to the lower computational complexity and other nice properties such as less storage and power consumption. Hardware accelerators are the most appropriate solution for the highperformance implementations with acceptable resource and power consumption. Among them, fieldprogrammable gate arrays (FPGAs) are wellsuited for this application due to their reconfigurability and versatility.
Point multiplication dominates the overall performance of the elliptic curve cryptographic processors. Efficient implementations of point multiplication can be separated into three distinct layers [6]: the finite field arithmetic, the elliptic curve point addition and doubling and the scalar multiplication. The fundamental finite field arithmetic is the basis of all the others. Finite field arithmetic over \(\mathrm{GF}(p)\) consists of modular multiplications, modular additions/subtractions and modular inversions. By choosing an alternative representation, called the projective representation, for the coordinates of the points, the timeconsuming finite field inversions can be eliminated almost completely. This leaves the modular multiplication to be the most critical operation in ECC implementations over \(\mathrm{GF}(p)\). One of the widely used algorithms for efficient modular multiplications is the Montgomery algorithm which was proposed by Peter L. Montgomery [16] in 1985.
Hardware implementations of the Montgomery algorithm have been studied for several decades. From the perspective of the radix, the Montgomery multiplication implementations can be divided into two categories: radix2 based [7, 21] and highradix based [1, 2, 9, 11, 17, 19, 20, 22, 23]. Compared with the former one, the latter, which can significantly reduce the required clock cycles, are preferred for highspeed applications.
For highradix Montgomery multiplication, the determination of quotients is critical for speeding up the modular multiplication. For simplifying the quotient calculation, Walter et al. [3, 23] presented a method that shifted up of modulus and multiplicand, and proposed a systolic array architecture. Following the similar ideas, Orup presented an alternative to systolic architecture [18], to perform highradix modular multiplication. He introduced a rewritten highradix Montgomery algorithm with quotient pipelining and gave an example of a nonsystolic (or parallel) architecture, but the design is characterized by low frequency due to global broadcast signals. In order to improve the frequency, DSP blocks widely dedicated in modern FPGAs have been employed for highspeed modular multiplications since Suzuki’s work [19] was presented. However, as a summary, the existing highradix Montgomery multipliers perform a single Montgomery multiplication for \(n\)word multiplicand either in approximately \(2n\) clock cycles, or approximately \(n\) cycles but with a low frequency.
To design a highspeed ECC processor, our primary goal is to propose a new Montgomery multiplication architecture which is able to simultaneously process one Montgomery multiplication within approximately \(n\) clock cycles and improve the working frequency to a high level.
Key Insights and Techniques. One key insight is that the parallel array architecture with oneway carry propagation can efficiently weaken the data dependency for calculating quotients, yielding that the quotients can be determined in a single clock cycle. Another key insight is that a high working frequency can be achieved by employing quotient pipelining inside DSP blocks. Based on these insights, our Montgomery multiplication design is centered on the novel techniques: combining the parallel array design and the quotient pipelining inside DSP blocks.
We also implement an ECC processor for generic curves over \(\mathrm{GF}(p)\) using the novel multiplier on FPGAs. Due to the pipeline characteristic of the multiplier, we reschedule the operations in elliptic curve arithmetic by overlapping successive Montgomery multiplications to further reduce the number of operation cycles. Additionally, sidechannel analysis (SCA) resistance is considered in our design. Experimental results indicate that our ECC processor can perform a 256bit point multiplication in 0.38 ms at 291 MHz on Xilinx Virtex5 FPGA.

We develop a novel architecture for Montgomery multiplication. As far as we know, it is the first Montgomery multiplier that combining the parallel array design and the quotient pipelining using DSP blocks.

We design and implement our ECC processor on modern FPGAs using the novel Montgomery multiplier. To the best of our knowledge, our ECC processor is the fastest among the existing hardware implementations over \(\mathrm{GF}(p)\).
2 Related Work
2.1 HighSpeed ECC Implementations over \(\mathrm{GF}(p)\)
Among the highspeed ECC hardware implementations over \(\mathrm{GF}(p)\), the architectures in [5] and [4] are the fastest two. For a 256bit point multiplication they reached latency of 0.49 ms and 0.68 ms in modern FPGAs Virtex4 and Stratix II, respectively. The architectures in [5] are designed for NIST primes using fast reduction. By forcing the DSP blocks to run at their maximum frequency (487 MHz), the architectures reach a very low latency for one point multiplication. Nevertheless, due to the characteristics of dual clock and the complex control inside DSP blocks, the architecture can be only implemented in FPGA platforms. Furthermore, due to the restriction on primes, the application scenario of [5] is limited in NIST prime fields. For generic curves over \(\mathrm{GF}(p)\), [4] combines residue number systems (RNS) and Montgomery reduction for ECC implementation. The design achieves 6stage parallelism and high frequency with a large number of DSP blocks, resulting in the fastest ECC implementation for generic curves. In addition, the design in [4] is resistant to SCA.
As far as we know, the fastest ECC implementation based on Montgomery multiplication was presented in [11], which was much slower than the above two designs. The main reason is that the frequency is driven down to a low level although the number of cycles for a single multiplication is approximately \(n\). In an earlier FPGA device Virtex2 Pro, the latency for a 256bit point multiplication is 2.27 ms without the SCA resistance, and degrades to 2.35 ms to resist SCA.
2.2 HighRadix Montgomery Multiplication
Up to now, for speeding up highradix Montgomery multiplications, a wealth of methods have been proposed either to reduce the number of processing cycles or to shorten the critical path in the implementations.
The systolic array architecture seems to be the best solution for modular multiplications with very long integers. Eldridge and Walter performed a shift up of the multiplicand to speed up modular multiplication [3], and Walter designed a systolic array architecture with a throughput of one modular multiplication every clock cycle and a latency of \(2n+2\) cycles for \(n\)word multiplicands [23]. Suzuki introduced a Montgomery multiplication architecture based on DSP48, which is a dedicated DSP unit in modern FPGAs [19]. In order to achieve scalability and high performance, complex control signals and dual clocks were involved in the design. However, the average number of processing cycles per multiplication are approximately \(2n\) at least. In fact, this is a common barrier in the highradix Montgomery algorithm implementations: the quotient is hard to generate in a single clock cycle. This barrier also exists in other systolic highradix designs [9, 22].
On the contrary, some nonsystolic array architectures were proposed for speeding up the process of quotient determination, but the clock frequency is a concern. Orup introduced a rewritten highradix Montgomery algorithm with quotient pipelining and gave an example of a nonsystolic architecture [18]. Another highspeed parallel design was proposed by Mentens [11], where the multiplication result was written in a special carrysave form to shorten the long computational path. The approach was able to process a single \(n\)word Montgomery multiplication in approximately \(n\) clock cycles. But the maximum frequency was reduced to a very low level, because too many arithmetic operations have to be completed within one clock cycle. Similar drawbacks in frequency can also be found in [2, 17].
3 Processing Method
In this section, we propose a processing method for pipelined implementation by employing DSP blocks.
3.1 Pipelined Montgomery Algorithm
The calculation of the right \(q_i\) is crucial for Montgomery multiplication, and it is the most timeconsuming operation for hardware implementations. In order to improve the maximum frequency, a straightforward method is to divide the computational path into \(\alpha \) stages for pipelining. The processing clock cycles, however, increase by a factor of \(\alpha \), since \(q_i\) is generated every \(\alpha \) clock cycles. That is to say, the pipeline does not work due to data dependency. The main idea of Algorithm 1 is using the preset values \(q_{d}=0,q_{d+1}=0,\ldots ,q_{1}=0\) to replace \(q_1\) to start the pipeline in the first \(d\) clock cycles. Then, in the \((d+1)^{th}\) cycle, the right value \(q_1\) is generated and fed into the calculation of the next round. After that, one \(q_i\) is generated per clock cycle in pipelining mode. Compared to the traditional Montgomery algorithm, the cost of Algorithm 1 is a few extra iteration cycles, additional preprocessing and a wider range of the final result.
3.2 DSP Blocks in FPGAs
Figure 1 shows the generic DSP block structure in modern FPGAs. By using different data paths, DSP blocks can operate on external inputs \(A,B,C\) as well as on feedback values from \(P\) or results \(PCIN\) from a neighboring DSP block. Notice that all the registers, labeled in gray in Figure 1, can be added or bypassed to control the pipeline stages, which is helpful to implement the pipelined Montgomery algorithm. Here, for the sake of the brevity and portability of the design, we do not engage dual clock and complex control signals like [5, 19] which force DSP blocks to work in the maximum frequency.
3.3 Processing Method for Pipelined Implementation
Now we explain the consistency between Algorithms 1 and 2. There are three phases in Algorithm 2: Phase 0 for initialization, Phase 1 for iteration and Phase 2 for the final addition. The initialization should be executed before each multiplication begins. In Phase 1, a fourstage pipeline is introduced in order to utilize the DSP blocks, so the total of the surrounding loops becomes \(n+6\) from \(n+3\). The inner loop from \(0\) to \(n1\) represents the operations of \(n\) Processing Elements (PEs). In the pipeline, referring to Algorithm 1, we can see that Stage 1 to Stage 3 are used to calculate \(w_i=q_{i_d}M'+b_iA\), and Stage 4 is used to calculate \((S_i~\mathrm{mod }~2^k + w_i)\). Here, \((S_i~\mathrm mod ~2^k)\) is divided into two parts: \(c_{(i+3,j)}\) inside the PE itself and \(s_{(i+3,j+1)}\) from the neighboring higher PE. The delay is caused by the pipeline. In Stage 4, \(S_i\) is represented by \(s_{(i+3,j)}\) and \(c_{(i+3,j)}\) in a redundant form, where \(s_{(i+3,j)}\) represents the lower \(k\) bits and \(c_{(i+3,j)}\) the \(k+1\) carry bits. Note that the carry bits from lower PEs are not transferred to higher PEs, because this interconnection would increase the data dependency for calculating \(q_i\) implying that \(q_i\) cannot be generated per clock cycle. Therefore, except for \(q_{i3}\), the transfer of \(s_{(i,j+1)}\) in Stage 4 is the only interconnection among the PEs, ensuring that \(q_{i3}\) can be generated per cycle. The carry bits from lower PEs to higher PEs which are saved in \(c_{(i,j)}\) are processed in Phase 2. In brief, the goal of Phase 1 is to generate the right quotient per clock cycle for running the iteration regardless of the representation of \(S_i\), while by simple additions Phase 2 transforms the redundant representation to nonredundant representation of the final value \(S_{n+4}\). The detailed hardware architecture is presented in the next section.
4 Proposed Architecture
4.1 Montgomery Multiplier
In the first three pipeline stages, the arithmetic operations \(q_{id}M^\prime +b_iA\), are located in the two DSP blocks named DSP1 and DSP2. In order to achieve high frequency, two stage registers are inserted in DSP2 which calculates the multiplication of \(q_{id}\) and \(m'_j\). Accordingly, another stage of registers are added in DSP1 in order to wait for the multiplication result \(u_{(i,j)}\) from DSP2. The addition of \(u_{(i,j)}\) and \(v_{(i,j)}\) is performed by DSP1 as shown in Fig. 2. In the fourth stage, the threenumber addition \(w_{(i,j)}+c_{(i,j)}+s_{(i,j+1)}\) is performed by using the carrysave adder (CSA). In FPGAs, CSA can be implemented by onestage lookup tables (LUTs) and one carry propagate adder (CPA). Because the computational path between the DSP registers is shorter than the CSA, the critical path only includes threenumber addition, i.e. CSA. Therefore, in this way, the PE can work in a high frequency due to the very short critical path.
Now we analyze the performance of the PE array. According to Algorithm 2, the number of iteration rounds of Phase 1 is \(n+7\). Together with the one clock cycle for initialization, the processing cycles of the PE array are \(n+8\). Regarding the consumed hardware resources, \(n+m\) DSP blocks are required for forming the PE array.
Although the frequency may decrease due to the global signals and large bus width, fortunately we find that these factors do not have a serious impact on the hardware performance, owing to the small bit size (256 or smaller) of the operands of ECC. The impact has been verified in our experiment as shown in Sect. 5.1.
The outputs of PEs are the redundant representation of the final result \(S_{n+4}\). So some addition operations (cf. Phase 2) have to be performed to get the nonredundant representation before it can be used as input for a new multiplication. Here we use another circuit module  redundant number adder (RNA) to implement the operation of Phase 2. Actually, the PE array and RNA that work in an alternative form can be pipelined for processing independent multiplications which are inherent in the elliptic curve point calculation algorithms. Therefore, the average number of processing clock cycles for one multiplication are only \((n+8)\) in our ECC processor.
4.2 ECC Processor Architecture
Modular Adder/Subtracter. In elliptic curve arithmetic, modular additions and subtractions are interspersed among the modular multiplication arithmetic. According to Algorithm 1, for the inputs in the range of \([0,2\tilde{M}]\) the final result \(S_{n+d+2}\) will be reduced to the range of \([0,2\tilde{M}]\).
In our design, ModAdd/Sub performs actually straightforward integer addition/subtraction without modular reduction. As an alternative, the modular reduction is performed by the Montgomery multiplication with an expanded \(R\). After a careful observation and scheduling, the results of ModAdd/Sub are restricted to the range of \((0,8\tilde{M})\), as shown in Appendix A, where the squaring is treated as the generic multiplication with two identical multiplicands. The range of \((0,8M)\) is determined by the rescheduling of elliptic curve arithmetic. For example, for calculating \(8(x\times y)\) where \(x,y<2\tilde{M}\), the process is rescheduled as \((4x)\times (2y)\) to narrow the range of the result. In this case, parameter \(R\) should be expanded to \(R>64\tilde{M}\) to guarantee that for inputs in the range of \((0,8\tilde{M})\) the result of Montgomery multiplication \(S\) still satisfies: \(S<2\tilde{M}\). The proof is omitted here.
SCA Resistance. Considering the SCA resistance and the efficiency, we combine randomized Jacobian coordinates method and a window method [13] against differential power analysis (DPA) and simple power analysis (SPA), respectively. The randomization technique transforms the base point \((x, y, 1)\) of projective coordinates to \((r^2x, r^3y, r)\) with a random number \(r\ne 0\). The window method in [13] based on a special recoding algorithm makes minimum information leak in the computation time, and it is efficient under Jacobian coordinates with a precomputed table. A more efficient method was presented in [14], and a security enhanced method, which avoided a fixed table and achieved comparative efficiency, was proposed in [15]. For computing point multiplication, the windowbased method [13] requires \(2^{w1} +tw\) point doublings and \(2^{w1}1+t\) point additions, where \(w\) is the window size and \(t\) is the number of words after recoding. The precomputing time has been taken into account, and the base point is not assumed to be fixed. The precomputed table with \(2^w1\) points can be easily implemented by block RAMs which are abundant in modern FPGAs, and the cost is acceptable for our design. Note that the randomization technique causes no impact on the area and little decrease in the speed, as the randomization is executed only twice or once [13].
5 Implementation and Comparison
5.1 Hardware Implementation
Clock cycles for ECC256p under Jacobian projective coordinates
Operation  ECC256p 

MUL  35 (average 29) 
ADD/SUB  7 
Point Doubling (Jacobian)  232 
Point Addition (Jacobian)  484 
Inversion (Fermat)  13685 
Point Multiplication (Window)  109297 
PAR results of ECC256p on Virtex4 and Virtex5
Virtex4  Virtex5  

Slices  4655  1725 
LUTs  5740 (4input)  4177 (6input) 
Flipflops  4876  4792 
DSP blocks  37  37 
BRAMs  11 (18 Kb)  10 (36 Kb) 
Frequency (Delay)  250 MHz (0.44 ms)  291 MHz (0.38 ms) 
5.2 Performance Comparison and Discussion
The comparison results are shown in Table 3, where the first three works support generic elliptic curves, while the last two only support NIST curves. In addition, our work and [4, 11] are SCA resistant, while the others are not. We have labeled these differences in Table 3.
Hardware performance comparison of this work and other ECC cores
Curve  Device  Size  Frequency  Delay  SCA res.  

(DSP)  (MHz)  (ms)  
Our  256 any  Virtex5  1725 Slices (37 DSPs)  291  0.38  Yes 
work  256 any  Virtex4  4655 Slices (37 DSPs)  250  0.44  Yes 
[4]  256 any  Stratix II  9177 ALM (96 DSPs)  157  0.68  Yes 
[11]  256 any  Virtex2 Pro  3529 Slices (36 MULTs)  67  2.35  Yes 
[10]  256 any  Virtex2 Pro  15755 Slices (256 MULTs)  39.5  3.84  No 
[5]  256 NIST  Virtex4  1715 Slices (32 DSPs)  487  0.49  No 
[17]  192 NIST  VirtexE  5708 Slices  40  3  No 
The designs in [10, 11] are both based on the classic Montgomery algorithm, and implemented in earlier FPGAs Virtex2 Pro, which did not supported DSP blocks yet. To our best knowledge, the architecture [11] is the fastest among the implementations based on the Montgomery multiplication for generic curves. In [11], the multiplication result was written in a special carrysave form to shorten the long computational path. But the maximum frequency was reduced to a very low level. As the targeted platform of our design is more advanced than that of [11], it is necessary to explain that our frequency is higher than [11] by a large margin from the aspect of the critical path. The critical path of [11] is composed of one adder, two 16bit multipliers and some stage LUTs for 62 CSA, whereas the critical path in our design is only one stage LUTs for 32 CSA and one 32bit adder. As a result, owing to the quotient pipelining technique applied in DSP blocks, the critical path is shortened significantly in our design.
The architecture described in [5] is the fastest FPGA implementation of elliptic curve point multiplication over \(\mathrm{GF}(p)\), but with restrictions on primes. It computes point multiplication over NIST curves which are widely used and standardized in practice. It is a dual clock design, and shifts all the field arithmetic operations into DSP blocks, thus the design occupies a small area and runs at a high speed (487 MHz) on Virtex4. Our design extends the application to support generic curves at a higher speed, and our architecture is not limited in FPGA platforms. In fact, our architecture can be easily transferred to application specific integrated circuits (ASICs) by replacing the multiplier cores, i.e. DSP blocks with excellent pipelined multiplier IP cores. It will be more flexible on ASICs to configure the delay parameter and the radix to maximize the hardware performance. Furthermore, notice that the drawbacks of the pipelined Montgomery algorithm, i.e. the wider range and additional iteration cycles mentioned in Sect. 2.1, can be eliminated for commonly used pseudo Mersenne primes. Taking NIST prime \(\mathrm P256 =2^{256} 2^{224}+2^{192}+2^{96}1\) as an example, the least 96 significant bits are all ‘1’, so the parameter \(\bar{M}\) equals 1 in Algorithm 1 for \(k,d\) satisfying \(2^{k(d+1)}\le 2^{96}\) and then \(\tilde{M}\) is reduced to \(\tilde{M}=M\). In this case, the range of precomputed parameters are corresponding to the width in the traditional Montgomery algorithm. Therefore, if our architecture is designed for P256, the performance will be further improved.
6 Conclusion and Future Work
This paper presents a highspeed elliptic curve cryptographic processor for generic curves over \(\mathrm{GF}(p)\) based on our novel Montgomery multiplier. We combine the quotient pipelining Montgomery multiplication algorithm with our new parallel array design, resulting in our multiplier completes one single Montgomery multiplication in approximately \(n\) clock cycles and also works in a high frequency. Also, employing the multiplier, we implement the ECC processor for scalar multiplications on modern FPGAs. Experimental results indicate that the design is faster than other existing ECC implementations over \(\mathrm{GF}(p)\) on FPGAs. From the comparison results, we can see that pipelined Montgomery based scheme is a better choice than the classic Montgomery based and RNS based ones in terms of speed or consumed resources for ECC implementations. In future work, we will implement the architecture in more advanced FPGAs such as Virtex6 and Virtex7, and transfer it to ASIC platforms.
Notes
Acknowledgements
The authors would like to acknowledge the contributions of Doctor Zhan Wang, Jingqiang Lin, and Chenyang Tu for useful discussions. The authors also would like to thank Professor Tanja Lange from Technische Universiteit Eindhoven in the Netherlands for helpful proofreading and comments. Finally, we are grateful to the anonymous reviewers for their invaluable suggestions and comments to improve the quality and fairness of this paper.
References
 1.Blum, T., Paar, C.: Highradix montgomery modular exponentiation on reconfigurable hardware. IEEE Trans. Comput. 50(7), 759–764 (2001)CrossRefGoogle Scholar
 2.Daly, A., Marnane, W.P., Kerins, T., Popovici, E.M.: An FPGA implementation of a \({\rm {GF}}(p)\) ALU for encryption processors. Microprocess. Microsyst. 28(5–6), 253–260 (2004)CrossRefGoogle Scholar
 3.Eldridge, S.E., Walter, C.D.: Hardware implementation of montgomery’s modular multiplication algorithm. IEEE Trans. Comput. 42(6), 693–699 (1993)CrossRefGoogle Scholar
 4.Guillermin, N.: A high speed coprocessor for elliptic curve scalar multiplications over \({\mathbb{F}}_p\). In: Mangard, S., Standaert, F.X. (eds.) CHES 2010. LNCS, vol. 6225, pp. 48–64. Springer, Heidelberg (2010) Google Scholar
 5.Güneysu, T., Paar, Ch.: Ultra high performance ECC over NIST primes on commercial FPGAs. In: Oswald, E., Rohatgi, P. (eds.) CHES 2008. LNCS, vol. 5154, pp. 62–78. Springer, Heidelberg (2008) Google Scholar
 6.Hankerson, D., Vanstone, S., Menezes, A.J.: Guide to elliptic curve cryptography. Springer, New York (2004)zbMATHGoogle Scholar
 7.Huang, M., Gaj, K., Kwon, S., ElGhazawi, T.: An optimized hardware architecture for the montgomery multiplication algorithm. In: Cramer, R. (ed.) PKC 2008. LNCS, vol. 4939, pp. 214–228. Springer, Heidelberg (2008) Google Scholar
 8.Koblitz, N.: Elliptic curve cryptosystems. Math. Comput. 48(177), 203–209 (1987)CrossRefzbMATHMathSciNetGoogle Scholar
 9.McIvor, C., McLoone, M., McCanny, J.V.: Highradix systolic modular multiplication on reconfigurable hardware. In: Brebner, G.J., Chakraborty, S., Wong, W.F. (eds) FPT 2005, pp. 13–18. IEEE (2005)Google Scholar
 10.McIvor, C.J., McLoone, M., McCanny, J.V.: Hardware elliptic curve cryptographic processor over \({\rm {GF}}(p)\). IEEE Trans. Circ. Syst. I: Regul. Pap. 53(9), 1946–1957 (2006)CrossRefMathSciNetGoogle Scholar
 11.Mentens, N.: Secure and efficient coprocessor design for cryptographic applications on FPGAs. Ph.D. thesis, Katholieke Universiteit Leuven (2007)Google Scholar
 12.Miller, V.S.: Use of elliptic curves in cryptography. In: Williams, H.C. (ed.) CRYPTO 1985. LNCS, vol. 218, pp. 417–426. Springer, Heidelberg (1986)Google Scholar
 13.Möller, B.: Securing elliptic curve point multiplication against sidechannel attacks. In: Davida, G.I., Frankel, Y. (eds.) ISC 2001. LNCS, vol. 2200, pp. 324–334. Springer, Heidelberg (2001)Google Scholar
 14.Möller, B.: Securing elliptic curve point multiplication against sidechannel attacks, addendum: Efficiency improvement. http://pdf.aminer.org/000/452/864/securing_elliptic_curve_point_multiplication_against_side_channel_attacks.pdf (2001)
 15.Möller, B.: Parallelizable elliptic curve point multiplication method with resistance against sidechannel attacks. In: Chan, A.H., Gligor, V.D. (eds.) ISC 2002. LNCS, vol. 2433, pp. 402–413. Springer, Heidelberg (2002) Google Scholar
 16.Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)CrossRefzbMATHGoogle Scholar
 17.Orlando, G., Paar, C.: A scalable GF(\(p\)) elliptic curve processor architecture for programmable hardware. In: Koç, Ç.K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 348–363. Springer, Heidelberg (2001) Google Scholar
 18.Orup, H.: Simplifying quotient determination in highradix modular multiplication. In: IEEE Symposium on Computer Arithmetic, pp. 193–199 (1995)Google Scholar
 19.Suzuki, D.: How to maximize the potential of FPGA resources for modular exponentiation. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727, pp. 272–288. Springer, Heidelberg (2007) Google Scholar
 20.Tang, S.H., Tsui, K.S., Leong, P.H.W.: Modular exponentiation using parallel multipliers. In: FPT 2003, pp. 52–59. IEEE (2003)Google Scholar
 21.Tenca, A.F., Koç, Ç.K.: A scalable architecture for montgomery multiplication. In: Koç, Ç.K., Paar, Ch. (eds.) CHES 1999. LNCS, vol. 1717, pp. 94–108. Springer, Heidelberg (1999) Google Scholar
 22.Tenca, A.F., Todorov, G., Koç, Ç.K.: Highradix design of a scalable modular multiplier. In: Koç, Ç.K., Naccache, D., Paar, Ch. (eds.) CHES 2001. LNCS, vol. 2162, pp. 185–201. Springer, Heidelberg (2001) Google Scholar
 23.Walter, C.D.: Systolic modular multiplication. IEEE Trans. Comput. 42(3), 376–378 (1993)CrossRefGoogle Scholar