RNS Montgomery reduction algorithms using quadratic residuosity
- 2.1k Downloads
Abstract
The residue number system (RNS) is a method for representing an integer as an n-tuple of its residues with respect to a given base. Since RNS has inherent parallelism, it is actively researched to implement a faster processing system for public-key cryptography. This paper proposes new RNS Montgomery reduction algorithms, Q-RNSs, the main part of which is twice a matrix multiplication. Letting n be the size of a base set, the number of unit modular multiplications in the proposed algorithms is evaluated as \((2n^2+n)\). This is achieved by posing a new restriction on the RNS base, namely, that its elements should have a certain quadratic residuosity. This makes it possible to remove some multiplication steps from conventional algorithms, and thus the new algorithms are simpler and have higher regularity compared with conventional ones. From our experiments, it is confirmed that there are sufficient candidates for RNS bases meeting the quadratic residuosity requirements.
Keywords
Residue number system Montgomery reduction Quadratic residuosity Cryptography1 Introduction
The residue number system (RNS) is a method for representing an integer in which a given integer x is represented by its residues divided by a base of integers, which are pairwise co-prime. If we denote the base by \(B=\{m_1, m_2, \dots , m_n \}\) and the RNS representation of x as \([ x_1,x_2, \dots , x_n ]\), it holds that \(x_i=x \bmod m_i\). The main feature of RNS is that addition, subtraction, and multiplication are carried out by independent addition, subtraction, and multiplication with respect to each base element. The operation flow at each base element is called a channel. If each channel has a processing unit, an n-fold speed increase can be achieved, as compared with the case with a single processing unit. This parallelism seems attractive in pursuing efficient computation of public-key cryptography, which is constructed by integer operations of several hundred or several thousand bits with a modular reduction. However, modular reduction in RNS was not easy to carry out before it was replaced with Montgomery reduction (M-red) in [1]. Following proposal of the RNS M-red, promising results have been obtained for the RSA algorithm [2, 3, 4, 5, 6], Elliptic Curve Cryptosystem [7, 8, 9, 10, 11, 12, 13, 14], Pairing-based Cryptosystem [15, 16], modular inversion [17], Lattice-based Cryptosystem [18], and an architectural study[19]. In parallel with these applications, improvements in the RNS M-red algorithm have been proposed [3, 5, 6, 9, 13, 14, 15]. An overview of these researches is presented in [20].
This paper proposes improved RNS M-red algorithms, Q-RNS M-reds, which by posing quadratic residuosity constraints on the RNS base achieves the least number of multiplications. Past improvements in RNS M-red algorithms, with exceptions such as [6], were optimizations within one round of M-red execution, whereas our optimization for Q-RNS M-red is novel in that it transfers the square root of a constant from the current round to the previous round. Q-RNS includes two concrete algorithms called sQ-RNS and dQ-RNS, depending on the difference of the multiplication unit used.
This paper is organized as follows. Section 2 introduces used notation and some basic concepts. Section 3 explains the conventional RNS M-red algorithms. In Sect. 4, we introduce the new idea to use quadratic residuosity in order to simplify the M-red algorithm. The first variant, sQ-RNS M-red, is a direct combination of a new idea and a conventional algorithm. The second variant, dQ-RNS M-red, relaxes the constraint to RNS base choice by introducing the double-level Montgomery technique from [13]. Other procedures necessary to implement public-key cryptography, such as Initialize, are discussed in Sect. 5. Section 6 compares RNS M-red algorithms including FPGA implementations, and Sect. 7 concludes this paper.
2 Basic concepts
2.1 Notation
The following definitions are applied in this paper.
Base \(B' = \{ m'_1, \dots , m'_n \}\), where \(\gcd ( m'_i, m'_j ) = 1\) for \(i \ne j\).
2.2 Modular multiplication
2.3 Montgomery reduction
Let \( MM (x, y)\) be the right-hand side of Eq. (3). Using \( MM (x, y)\), the procedure to compute a modular multiplication is described as follows.
The goal of this paper is to propose an efficient algorithm to compute the Montgomery reduction in RNS, that is, an RNS M-red algorithm.
2.4 RNS
In RNS, a large integer can be processed with independent parallel operations in each channel. If we could use small factors of public-key p as an RNS base, a very efficient implementation would be realized. It is, however, difficult to employ such an approach since the public-key p is usually a large prime or a product of two large primes.
2.5 Chinese remainder theorem
2.6 Special modulus for fast reduction
Let \(\bar{\mu }= (1/w) \log _2 \mu _i\). Equation (4) is satisfied if \(\bar{\mu }< 0.5\).
2.7 Quadratic residuosity
Quadratic residuosity has been used for RNS in signal processing applications to represent a complex signal (for instance, refer to section 8.1 of [22]), whereas no previously proposed RNS M-red algorithm has used quadratic residuosity. This paper applies quadratic residuosity to the RNS M-red algorithm for the first time to construct algorithms that consist of the least number of unit multiplications.
3 Conventional algorithms
3.1 Basic RNS M-red algorithm
3.1.1 Algorithm
Since step 1 is carried out in base B, modulo M is automatically applied to the computation and the result is equivalent to that of step 1 in Fig. 2. It is in base \(B'\) that steps 4 and 5 should be carried out. The reason for this is as follows: As for step 4, it is of no use computing \((x+pq)\) in B because the result is always a multiple of M and thus always 0 in base B. The computation in step 5 is to multiply \(M^{-1}\) by r. This can be carried out in base \(B'\) but not in base B, since \(M^{-1}\) does not exist in base B. Although the final result s is computed at step 5, it is only represented in base \(B'\). In order to complete the representation in base B, steps 6 and 7 extend \(\{s\}_{B'}\) to \(\{ s \}_B\). This ensures compatibility between output and input of the RNS M-red algorithm.
The matrix elements in each step are defined as follows:
Step 1:^{2}\(d_i = \left\langle -p^{-1} \right\rangle _{m_i}\).
Step 2:^{3}\(w_{ii} = \left\langle M_i^{-1} \right\rangle _{m_i}\).
Step 5: \(w_i = \left\langle M^{-1} \right\rangle _{m'_i}\).
Step 6: \(w'_{ii} = \left\langle {M'_i}^{-1} \right\rangle _{m'_i}\).
3.1.2 Requirement for parameters
- (a)
\(m_i\) is a special modulus for fast reduction.
- (b)
\(1/m_i\) can be well approximated as \(1/2^w\).
- (i)
\(\gcd (M, M') = 1\)
- (ii)
\(\gcd (p, M) = 1\)
- (iii)
\(\max (e_1, e_2) \le \alpha < 1\)
- (iv)
\(\beta p \le (1-\alpha )M\)
- (v)
\(2p \le (1-\alpha )M'\)
3.2 G-RNS algorithm
Guillermin proposed an algorithm that at the time achieved the minimum number of unit multiplications [9]. We call this algorithm G-RNS (Fig. 4). Step 1 is the integration of steps 1 and 2 in Fig. 3. Step 2 is from the first term of step 4 combined with steps 5 and 6 of Fig. 3. Step 3 is derived from steps 3–6 of Fig. 3. Step 4 corresponds to step 7 of the basic algorithm. Step 5 is new in Fig. 4.
Elements of the matrices of G-RNS are defined as follows:
Step 1: \(d_{ii} = \left\langle w_{ii} \cdot d_i \right\rangle _{m_i} = \left\langle M_i^{-1}(-p^{-1}) \right\rangle _{m_i}\).
Step 2: \(e_{ii} = \left\langle w'_{ii} \cdot w_i \right\rangle _{m'_i} = \left\langle {M'_i}^{-1} M^{-1} \right\rangle _{m'_i}\).
3.3 C-RNS algorithm
Figure 5 shows an algorithm proposed by Cheung et al.[15]. Elements appear in each step are defined as follows:
Step 1: \(d_{ii} = \left\langle M_i^{-1}(-p^{-1}) \right\rangle _{m_i}\).
Step 2: \(e_{ii} = \left\langle {M'_i}^{-1} M^{-1} \right\rangle _{m'_i}\).
3.4 R-RNS algorithm
- 1.Input and output are changed from \(\{ x \}_{BB'}\) and \(\{ s \}_{BB'}\) to \(\{ x \}_B \cup \{ \hat{\hat{x}} \}_{B'}\) and \(\{ s \}_B \cup \{ \hat{s} \}_{B'}\), respectively, where elements of \(\{ \hat{\hat{x}} \}_{B'}\) and \(\{ \hat{s} \}_{B'}\) are defined as$$\begin{aligned}&\left\langle \hat{\hat{x}} \right\rangle _{m_i'} = \left\langle x {M_i'}^{-2} \right\rangle _{m_i'}, \\&\left\langle \hat{s} \right\rangle _{m_i'} = \left\langle s {M_i'}^{-1} \right\rangle _{m_i'}. \end{aligned}$$
- 2.In step 2, elements of the matrix are changed from \(e_{ij}\) to \(e'_{ij}\), where the latter is defined asDue to this definition, the following relationship holds.$$\begin{aligned} e_{ij}' = \left\langle e_{ij} {M_i'}^2 \right\rangle _{m_i'} = \left\langle M_i' M^{-1} \right\rangle _{m_i'}. \end{aligned}$$Thus, the result of step 2 in Fig. 6 is identical to that in Fig. 4.$$\begin{aligned} \left\langle e_{ij} \cdot x \right\rangle _{m_i'} = \left\langle e_{ij}' \cdot \hat{\hat{x}} \right\rangle _{m_i'} \end{aligned}$$
- 3.
Notation of the result in step 3 is changed to \(\left\langle \hat{s}\right\rangle _{m_i'}\), although its value is identical to \(\left\langle \sigma \right\rangle _{m_i'}\), the result of step 3 of Fig. 4.
- 4.
Since the new output includes \(\left\langle \hat{s}\right\rangle _{m_i'}\) instead of \(\left\langle s\right\rangle _{m_i'}\), step 5 of Fig. 4 is omitted in Fig. 6. This reduces the number of unit multiplications by n from Fig. 4.
4 New algorithms
4.1 Derivation of Q-RNS
Most past improvements in RNS M-red algorithms except [6] were optimization within one round of M-red execution. Our optimization for Q-RNS M-red is unique in that it transfers a square root^{4} of a constant from a present round to a previous round.
As seen in Fig. 7, Q-RNS assumes that multiplication is carried out as preprocessing for the next M-red. This assumption ensures that the degree of K is 2. Let us consider possible degrees of K. All RNS M-red algorithms discussed in this paper use two bases, B and \(B'\), with the intent that these M-reds accommodate a number twice the length of what a single base can represent. This means M-red is designed not to accommodate a number with a degree more than or equal to 3. If the degree of K is 1 or 0, we could cope with such cases by multiplying K or \(K^2\) by the input. Even if such cases should occur, the computation amount would be the same as Fig. 7 (left).
4.2 sQ-RNS algorithm
Figure 8 shows the sQ-RNS algorithm—the initial “s” indicating a single-level rather than double-level Montgomery—a technique proposed in [13]. sQ-RNS is basically derived according to the procedure shown in Fig. 7 with a small extra optimization.
Note that steps 2 and 3 in Fig. 8 can be carried out simultaneously. Therefore, the computation time of sQ-RNS can be estimated as comparable with twice that of the matrix multiplication. The number of unit multiplications is \((2n^2 + n)\), which is the minimum among all previously proposed RNS M-red algorithms.
4.3 dQ-RNS algorithm
Example base for sQ-RNS
Prime | w | \(\mu _i = 2^w - m_i\) | \(\mu '_i = 2^w - m'_i\) | \(\max \log _2 \mu \) | \(\max \bar{\mu }\) | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
NIST P-192 | 50 | 27 | 117 | 351 | 951 | 1163 | 2567 | 2855 | 8543 | 13.0 | 0.26 |
NIST P-224 | 58 | 57 | 63 | 147 | 447 | 27 | 731 | 3807 | 7403 | 12.9 | 0.22 |
NIST P-256 | 65 | 535 | 751 | 3219 | 8031 | 49 | 979 | 2191 | 11,335 | 13.5 | 0.21 |
NIST P-384 | 98 | 51 | 855 | 4343 | 52,155 | 117 | 831 | 1571 | 1827 | 15.7 | 0.16 |
NIST P-521 | 132 | 347 | 363 | 527 | 38,835 | 725 | 6647 | 11,535 | 38,679 | 15.2 | 0.12 |
Curve25519 | 65 | 535 | 2191 | 3219 | 8031 | 49 | 751 | 979 | 11,335 | 13.5 | 0.21 |
4.4 Base search
A base search experiment is carried out for a given modulus p to find RNS bases that satisfy the requirement for quadratic residuosity. To avoid bias, we use five NIST primes [24] and one for Curve25519 [25], which are defined as common moduli for Elliptic Curve Cryptography. The NIST primes are called P-192, P-224, P-256, P-384, and P-521, with numbers representing the bit size of each prime. The prime for Curve25519 is defined as \(2^{255}-19\).
Experiment for sQ-RNS:
We search for bases satisfying Eq. (8) using the following search algorithm.
- 1
Let candidates be an ordered sequence of prime numbers in the form \(c_i =2^w - \mu _i\), where \(\mu _i> \mu _j > 0\) for \(i > j\). The search is done in a smaller-index-first manner.
- 2
\(\mathrm{Pool} = \{ c_i | \mathrm{QR}(c_i, c_j ) \cdot \mathrm{QR}(c_j, c_i ) = 1 \;\mathrm{for }\;i \ne j \}\).
- 3
\(B = \{ c_i | c_i \in \mathrm{Pool} \wedge \mathrm{QR}(p, c_i ) = 1 \wedge |B| = n \}\).
- 4
\(B' = \{ c_i | c_i \in \mathrm{Pool} \wedge c_i \notin B \wedge |B'| = n \}\).
Table 1 presents the search results for \(n=4\). The rightmost column shows that these bases satisfy the condition \(\bar{\mu }< 0.5\).
Experiment for dQ-RNS:
We apply the following search algorithm, which generates bases satisfying the condition given by Eq. (9).
- 1
Let seeds be an ordered sequence of odd numbers in the form \(\sigma _i = 2^{w/2} - \nu _i\), where \(\nu _i> \nu _j >0\) for \(i > j\). The search is done in a smaller-index-first manner.
- 2
\(\mathrm{Pool} = \{ \sigma _i | \gcd (\sigma _i, \sigma _j ) = 1 \;\mathrm{for}\; i \ne j \}\).
- 3
\(B = \{ \sigma _i^2 | \sigma _i \in \mathrm{Pool} \wedge \mathrm{QR}(p, \sigma _i ) = 1 \wedge |B| = n \}\).
- 4
\(B' = \{ \sigma _i^2 | \sigma _i \in \mathrm{Pool} \wedge \sigma _i^2 \notin B \wedge |B'| = n \}\).
Figure 10 shows the search results for \(\alpha = 0.5\) and \(\nu \ge 2\), the degree of laziness. The search succeeds for (n, w) plotted in the figure, although the graph P-256 is almost hidden behind that of C25519. The search fails when \(\max (e_1, e_2)\) exceeds 0.5 and violates condition (iii) \(\max (e_1, e_2) \le \alpha \) in Sect. 3.1.2. The lower bounds of word length w for success are 22, 24, 24, 26, 28, and 24 bits for P-192, P-224, P-256, P-384, P-512, and Curve25519, respectively. The lower bound \(t_0\) for necessary bit length for approximation ranges from 3 to 8. Therefore, it is possible to realize a compact computation circuit for \(\hat{L}\) and \(\hat{L}'\). Let \(N_1\) be a number of seeds satisfying \(\mathrm{QR}( p, \sigma _i ) = 1\), and let \(N_0\) be the number of all seeds generated until the algorithm halts. In our experiment, \(N_1 / N_0\) ranges from 0.30 to 0.43, which implies that the probability that \(\mathrm{QR}( p, \sigma _i ) = 1\) is near 0.3 for these primes. We also confirm that bases are efficiently found for some randomly chosen non-NIST primes.
Experiments show that bases for dQ-RNS can be found unless the word size w is too small. For instance, \(w \ge 22\) suffices for P-192. Since values less than 22 do not seem to be promising parameters for efficient hardware implementation, dQ-RNS has a sufficient range of word size selection. It is up to hardware designers to determine optimum sizes for specific Q-RNS applications.
Comparison of basic representations
Algorithm | Representation | Note | |
---|---|---|---|
(a) | Orthodox, Eq. (1) | \(\left\langle x \right\rangle _p\) | Typically, p is a large prime or a product of two primes. |
(b) | M-red, Eq. (3) | \(\overline{ \left\langle xM \right\rangle _p }\) | Setting \(M= 2^l\) is efficient for a binary computer. |
(c) | RNS M-red | \(\left\{ \overline{ \left\langle xM \right\rangle _p } \right\} _{BB'}\) | M is a product of elements in B |
(d) | sQ-RNS M-red | \(\left\{ K \overline{ \left\langle xM \right\rangle _p } \right\} _{BB'}\) | Constant K is a function of \((p, B, B')\). |
(e) | dQ-RNS M-red | \(\left\{ 2^{w/2} K \overline{ \left\langle xM \right\rangle _p } \right\} _{BB'}\) | The double-level Montgomery variant of (d) |
5 Application to cryptography
We discuss several procedures necessary for RNS implementation of public-key cryptography, including Initialize, Finalize, transform to RNS representation (ToRNS, hereafter), and transform to Binary representations (ToBin, hereafter). We also provide formulae for bounds on degree of laziness and for relaxation of reduction within a channel. Although these issues were discussed in previous work, we are interested in the case of Q-RNS and exact expressions of bounds.
5.1 Basic representation
Table 2 shows the representations of computation result of each algorithm. Row (a) corresponds to the orthodox modular multiplication described by Eq. (1). Row (b) is for the standard Montgomery multiplication defined by Eq. (3), in which a constant \(M = 2^l\) is multiplied by x. The bar symbol in row (b) means relaxation of the upper bound of reduction from p to 2p. Row (c) represents conventional RNS M-reds other than R-RNS. This is an immediate transformation of (b) into an RNS representation with base B and \(B'\). Rows (d) and (e) are for Q-RNS, derived from (c) by multiplying constant K and \(2^{w/2}K\), respectively. The representation for the R-RNS algorithm is derived from (d) if we replace coefficient \(\{ K \}_{BB'}\) with \(\{1\}_B \cup \{X\}_{B'}\), where \(\left\langle X\right\rangle _{m_i'} = \left\langle {M_{i}'}^{-1}\right\rangle _{m_i'}\).
Montgomery developed an efficient reduction algorithm (b) by multiplying a constant R (here, M) by representation (a), whereas this paper proposes efficient RNS M-red algorithms (d) and (e) by multiplying constants K and \(2^{w/2}K\) by representation (c). As a result, (d) and (e) are realized with fewer unit multiplications, and their structures are much simpler. As will be explained in the next subsection, we can embed multiplication by K or \(2^{w/2}K\) into the Initialized process. We can also carry out the removal process of K or \(2^{w/2}K\) in parallel with the Finalized process.
5.2 ToRNS and Initialize
5.3 Finalize and ToBin
ToBin in Fig. 11 is derived for dQ-RNS from the one proposed in [9]. In [9], one of the moduli in base B is chosen as \(m_1 = 2^w\), which is used as \(2^w\) in Fig. 11. On the other hand, due to quadratic residuosity, it is not possible to use \(m_1 = 2^w\) for Q-RNS. Therefore, ToBin needs \((n+1)\) more words in its lookup table in step 2 than were used in [9]. In addition, step 4 needs an n-word table, while step 3 needs no table.
In Fig. 11, unit multiplication is basically the Montgomery multiplication, although step 2 is an exception. Typical implementation of step 2 is to apply multiplication without reduction and take the lower w bits. Since steps 2 and 3 can be carried out at the same time, efficient implementation is possible.
5.4 Degree of laziness
We will represent the upper bound for degree of laziness \(\nu \) described with Q-RNS parameters. As a typical lazy reduction, we consider the product sum.
5.5 Relaxation of reduction within channel
So far, we have assumed that the modular reduction in a unit multiplication is carried out strictly; that is, its result is always less than \(m_i\). It is, however, known for the Montgomery reduction in Fig. 2 that relaxation of reduction is effective toward avoiding a conditional branch due to the final subtraction, thus making the implementation simpler. It may be also possible to apply this idea to modular reduction at a unit operation.
Comparison of RNS M-red
RNS M-red | Feature | # of unit mult. | Requirements to base | |
---|---|---|---|---|
Basic [3] | Straightforward | \(2n^2+5n\) | Weak | Mutually prime |
(Fig. 3) | \(\bar{\mu } < 0.5\) | |||
G-RNS [9] | Integrated lookup tables | \(2n^2+3n\) | Weak | Mutually prime |
(Fig. 4) | \(\bar{\mu } < 0.5\) | |||
C-RNS [15] | Special form of | \(\le 2n^2+4n\) | Medium | Mutually prime |
(Fig. 5) | Base extension matrices | \(\bar{\mu } < 0.5\) | ||
n is smaller | ||||
R-RNS [6] | Reorganized G-RNS | \(2n^2 + 2n\) | Weak | Mutually prime |
(Fig. 6) | \(\bar{\mu } < 0.5\) | |||
sQ-RNS | Quadratic Residuosity (QR) | \(2n^2+n\) | Strong | Mutually prime |
(Fig. 8) | \(\bar{\mu } < 0.5\) | |||
QR by Eq. (8) | ||||
dQ-RNS | QR and | \(2n^2+n\) | Medium | Mutually prime |
(Fig. 9) | The double-level Montgomery | QR by Eq. (9) |
6 Comparison
6.1 Number of unit multiplications
Table 3 summarizes comparison of four conventional RNS M-red algorithms and two Q-RNS M-red algorithms. Among these, the proposed ones achieved the least number of unit multiplications. It should be noted that unit multiplication for dQ-RNS is Montgomery’s, while other algorithms use standard modular multiplication. Note also that if n is small in C-RNS, there is a possibility that one can find base extension matrices with less computation. As for the requirements for base choice, the basic algorithms G-RNS and R-RNS pose the weakest requirements, while sQ-RNS poses the strongest. C-RNS and dQ-RNS fall somewhere in between. dQ-RNS has weaker requirements on the RNS base than does sQ-RNS, since it is possible to employ square numbers as elements of the bases.
As in conventional RNS M-reds, it is easy to implement sQ-RNS and dQ-RNS in parallel processing architecture due to RNS. Since sQ-RNS and dQ-RNS mostly consist of two matrix multiplications, these algorithms have more regularity and simplicity than do conventional ones. From past work, it is definite that Q-RNS can terminate in \((2n^2+n)/n = (2n +1)\) cycles if n processing units operate in parallel. Since multiplication previous to Q-RNS finishes in \(2n/n = 2\) cycles, the total cycles of Montgomery multiplication is \((2n+1+2\nu )\), where \(\nu \) is the degree of laziness. Another possibility, though less likely, is that with \((n^2 + n)\) unit multipliers, Q-RNS finishes in two cycles. Although this seems theoretically possible, in practice there are several issues for elaboration, such as feasibility of fan-out n of registers and design of an efficient circuit for summing up the results from unit multipliers.
Bigou et al. proposed a method that consists of fewer unit multiplications than other RNS M-red algorithms, including Q-RNS, under the hypothesis that the modulus p and the product of base moduli M should satisfy a certain equation [12, 14]. Although Q-RNS also poses quadratic residuosity conditions, their hypothesis is much stronger than that of Q-RNS. Actually, no base exists for NIST primes [14]. In their algorithm, it should be preferable to fix the base first and then determine p under the hypothesis. On the other hand, we can find bases with very high probability not only for NIST primes but also for other primes. Therefore, the discussion in this paper does not include their algorithm for comparison.
6.2 Size of lookup table
Number of memory words
G-RNS [9] | R-RNS [6] | sQ-RNS | dQ-RNS | |
---|---|---|---|---|
Main body\(^{*1}\) | \(2n^2+5n\) | \(2n^2 + 4n\) | \(2n^2+3n\) | \(2n^2+3n\) |
Base B, \(B'\) | \(+2n\) | \(+2n\) | \(+2n\) | \(+2n\) |
ToRNS | \(+0\) | \(+0\) | \(+0\) | \(+2n^{*2}\) |
Initialize\(^{*3}\) | \(+2n\) | \(+2n\) | \(+2n\) | \(+2n\) |
Finalize\(^{*4}\) | \(+0\) | \(+n\) | \(+2n\) | \(+2n\) |
ToBin\(^{*5}\) | \(+2n\) | \(+2n\) | \(+(3n+1)\) | \(+(2n+1)\) |
Total | \(2n^2+11n\) | \(2n^2 + 11n\) | \(2n^2+12n+1\) | \(2n^2+13n+1\) |
Table 4 shows comparison of the lookup table size necessary for the four algorithms, G-RNS, R-RNS, sQ-RNS, and dQ-RNS. Compared with G-RNS and R-RNS, sQ-RNS and dQ-RNS need only \((n+1)\) and \((2n+1)\) words of extra memory, respectively. With such little additional memory, Q-RNSs provide sufficient merit regarding reduction in the number of multiplications and simplicity of the algorithm. A toy example of parameters is shown in “Appendix C”.
6.3 FPGA implementation
Figure 12 shows the main operation units, a multiply-and-add unit and a modular reduction unit, where the latter carries out the fast reduction algorithm presented in Sect. 2.6. Let \(c_m\) and \(c_r\) be the clock cycles required to carry out these operations, respectively. In our implementation, it follows that \(c_m=1\) and \(c_r=2\). n sets of these operation units are prepared. We use almost the same configuration for both sQ-RNS and R-RNS.
Synthesis results
(a) sQ-RNS | (b) R-RNS | |
---|---|---|
Word length w | 65 | 65 |
Base size n | 4 | 4 |
Number of operation units | 4 | 4 |
Clock cycles | 15 | 18 |
Max. frequency(MHz) | 139.5 | 142.7 |
Hardware components | ||
LUT | 4076 | 4247 |
FF | 2104 | 2329 |
DSP | 84 | 84 |
FPGA device | Kintex® UltraScale+™ | |
Compiler | Vivado® 2018.1 |
7 Conclusion
This paper proposed new RNS Montgomery reduction algorithms, namely, sQ-RNS and dQ-RNS, which are derived by posing quadratic residuosity requirements on RNS bases. They achieve fewer number of unit multiplications than all previously proposed algorithms. The size of the lookup tables they use is comparable with conventional ones. Improvement over the R-RNS algorithm was confirmed with FPGA implementations. Since the proposed algorithms have more regularity and symmetry than do conventional ones, it may be worth studying software implementations for multi-core processors. Another topic for future study is improvement to the two base search algorithms proposed in this paper.
Footnotes
- 1.
\(s = (x+pq)/R< (\beta p^2 + pR)/R = (\beta p/R + 1)p < 2p\).
- 2.
d is used because it looks like an inverted p, which makes it easier to relate to \(p^{-1}\).
- 3.
w is similarly used because it looks like an inverted M.
- 4.
If more than one square root of the constant exists, either is useful to construct Q-RNS M-Red.
References
- 1.Posch, K.C., Posch, R.: Modulo reduction in residue number systems. IEEE Trans. Parallel Distrib. Syst. 6(5), 449–454 (1995)CrossRefGoogle Scholar
- 2.Schwemmlein, J., Posch, K.C., Posch, R.: RNS-modulo reduction upon a restricted base value set and its applicability to RSA cryptography. Comput. Secur. 17(7), 637–650 (1998)CrossRefGoogle Scholar
- 3.Kawamura, S., Koike, M., Sano, F., Shimbo, A.: Cox-rower architecture for fast parallel Montgomery multiplication. In: EUROCRYPT2000, LNCS1807, pp. 523–538. Springer (2000)Google Scholar
- 4.Nozaki, H., Motoyama, M., Shimbo, A., Kawamura, S.: Implementation of RSA algorithm based on RNS Montgomery multiplication. In: CHES2001, LNCS2162, pp. 364–376. Springer (2001)Google Scholar
- 5.Bajard, J.-C., Imbert, L.: A full RNS implementation of RSA. IEEE Trans. Comput. (Brief Contrib.) 53(6), 769–774 (2004)CrossRefGoogle Scholar
- 6.Gandino, F., Lamberti, F., Paravati, G., Bajard, J.-C., Montuschi, P.: An algorithmic and architectural study of Montgomery exponentiation in RNS. IEEE Trans. Comput. 61(8), 1071–1083 (2012)MathSciNetCrossRefGoogle Scholar
- 7.Schinianakis, D.M., Kakarountas, A.P., Stouraitis, T.: A new approach to elliptic curve cryptography: an RNS architecture. In: Proceedings of IEEE MELECON 2006, May 16–19, Benalmadena (Malaga), Spain, pp. 1241–1245 (2006)Google Scholar
- 8.Schinianakis, D.M., Fournaris, A.P., Michail, H.E., Kakarountas, A.P., Souraitis, T.: An RNS implementation of an Fp elliptic curve point multiplier. IEEE Trans. Circuits Syst. 56(6), 1202–1213 (2009)MathSciNetCrossRefGoogle Scholar
- 9.Guillermin, N.: A high speed coprocessor for elliptic curve scalar multiplications over Fp. In: CHES2010, LNCS6225, pp. 48–64. Springer (2010)Google Scholar
- 10.Antão, S., Bajard, J.-C., Sousa, L.: RNS-based elliptic curve point multiplication for massive parallel architectures. Comput. J. 55(5), 629–647 (2012)CrossRefGoogle Scholar
- 11.Schinianakis, D.M., Souraitis, T.: Multifunction residue architectures for cryptography. IEEE Trans. Circuits Syst. 61(4), 1156–1169 (2014)CrossRefGoogle Scholar
- 12.Bigou, K., Tisserand, A.: RNS modular multiplication through reduced base extensions. In: ASAP, pp. 57–62. IEEE (2014)Google Scholar
- 13.Bajard, J.-C., Merkiche, N.: Double level Montgomery cox-rower architecture, new bounds. In: Smart Card Research and Advanced Applications (CARDIS), LNCS 8968, pp. 139–153. Springer (2015)Google Scholar
- 14.Bigou, K., Tisserand, A.: Single base modular multiplication for efficient hardware RNS implementations of ECC. In: CHES2015, LNCS9293, pp. 123–140. Springer (2015)Google Scholar
- 15.Cheung, R., Duquesne, S., Fan, J., Guillermin, N., Verbauwhede, I., Yao, G.: FPGA implementation of pairing using residue number system and lazy reduction. In: CHES2011, LNCS6917, pp. 421–441. Springer (2011)Google Scholar
- 16.Yao, G.X., Fan, J., Cheung, R.C.C., Verbauwhede, I.: Faster pairing coprocessor architecture. In: Pairing 2012, LNCS 7708, pp. 160–176. Springer (2012)Google Scholar
- 17.Bigou, K., Tisserand, A.: Improving modular inversion in RNS using the plus-minus methods. In: CHES 2013, LNCS8086, pp. 233–249. Springer (2013)Google Scholar
- 18.Bajard, J.-C., Eynard, J., Merkiche, N., Plantard, T.: RNS arithmetic approach in lattice-based cryptography. In: 22nd IEEE Symposium on Computer Arithmetic (2015)Google Scholar
- 19.Gérard, B., Kammerer, J.-G., Merkiche, N.: Contribution to the design of RNS architecture. In: 22nd IEEE Symposium on Computer Arithmetic (2015)Google Scholar
- 20.Bajard, J.-C., Eynard, J., Merkiche, N.: Montgomery reduction within the context of residue number system arithmetic. Special Issue on Montgomery Arithmetic. J. Cryptogr. Eng. https://doi.org/10.1007/s13389-017-0154-9 CrossRefGoogle Scholar
- 21.Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)MathSciNetCrossRefGoogle Scholar
- 22.Ananda Mohan, P.V.: Residue Number Systems—Theory and Applications. Birkhäuser. ISBN: 978-3-319-41383-9(2016)CrossRefGoogle Scholar
- 23.Kawamura, S., Yonemura, T., Komano, Y., Shimizu, H.: Exact error bound of cox-rower architecture for RNS arithmetic. Cryptology ePrint Archive: Report 2016/266, March (2016). https://eprint.iacr.org/2016/266
- 24.Federal Information Processing Standards Publication: FIPS186-4 “Digital Signature Standard (DSS).” Appendix D, National Institute of Standards and Technology, July (2013)Google Scholar
- 25.Bernstein, D.J.: Curve25519: new Diffie-Hellman speed records. In: Public Key Cryptography—PKC 2006, LNCS 3958, pp. 207–228. Springer (2006)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.