1 Introduction

The residue number system (RNS) is a method for representing an integer in which a given integer x is represented by its residues divided by a base of integers, which are pairwise co-prime. If we denote the base by \(B=\{m_1, m_2, \dots , m_n \}\) and the RNS representation of x as \([ x_1,x_2, \dots , x_n ]\), it holds that \(x_i=x \bmod m_i\). The main feature of RNS is that addition, subtraction, and multiplication are carried out by independent addition, subtraction, and multiplication with respect to each base element. The operation flow at each base element is called a channel. If each channel has a processing unit, an n-fold speed increase can be achieved, as compared with the case with a single processing unit. This parallelism seems attractive in pursuing efficient computation of public-key cryptography, which is constructed by integer operations of several hundred or several thousand bits with a modular reduction. However, modular reduction in RNS was not easy to carry out before it was replaced with Montgomery reduction (M-red) in [1]. Following proposal of the RNS M-red, promising results have been obtained for the RSA algorithm [2,3,4,5,6], Elliptic Curve Cryptosystem [7,8,9,10,11,12,13,14], Pairing-based Cryptosystem [15, 16], modular inversion [17], Lattice-based Cryptosystem [18], and an architectural study[19]. In parallel with these applications, improvements in the RNS M-red algorithm have been proposed [3, 5, 6, 9, 13,14,15]. An overview of these researches is presented in [20].

This paper proposes improved RNS M-red algorithms, Q-RNS M-reds, which by posing quadratic residuosity constraints on the RNS base achieves the least number of multiplications. Past improvements in RNS M-red algorithms, with exceptions such as [6], were optimizations within one round of M-red execution, whereas our optimization for Q-RNS M-red is novel in that it transfers the square root of a constant from the current round to the previous round. Q-RNS includes two concrete algorithms called sQ-RNS and dQ-RNS, depending on the difference of the multiplication unit used.

This paper is organized as follows. Section 2 introduces used notation and some basic concepts. Section 3 explains the conventional RNS M-red algorithms. In Sect. 4, we introduce the new idea to use quadratic residuosity in order to simplify the M-red algorithm. The first variant, sQ-RNS M-red, is a direct combination of a new idea and a conventional algorithm. The second variant, dQ-RNS M-red, relaxes the constraint to RNS base choice by introducing the double-level Montgomery technique from [13]. Other procedures necessary to implement public-key cryptography, such as Initialize, are discussed in Sect. 5. Section 6 compares RNS M-red algorithms including FPGA implementations, and Sect. 7 concludes this paper.

2 Basic concepts

2.1 Notation

The following definitions are applied in this paper.

w : Bit size of a word in a given computer.

$$\begin{aligned}&\left\langle x \right\rangle _m = x \bmod m,\hbox { where }\left\langle x \right\rangle _m \in [ 0, m).\\&\overline{ \left\langle x \right\rangle _m } \equiv x \; ( \bmod m )\hbox { and }\overline{ \left\langle x \right\rangle _m} \in [ 0, 2m). \end{aligned}$$

Base \(B = \{ m_1, \dots , m_n \}\), where \(\gcd ( m_i, m_j ) = 1\) for \(i \ne j\).

Base \(B' = \{ m'_1, \dots , m'_n \}\), where \(\gcd ( m'_i, m'_j ) = 1\) for \(i \ne j\).

|B| : Size of a set B.

$$\begin{aligned} M= & {} \prod _{i=1}^n m_i, M' = \prod _{i=1}^n m'_i,\hbox { where }\gcd (M, M') = 1.\\ M_i= & {} M/m_i, M'_i = M'/m'_i. \end{aligned}$$

\(\left\langle x^{-1} \right\rangle _m\): A multiplicative inverse of x if \(\gcd (x, m)=1\).

$$\begin{aligned} \{ x \}_B= & {} \left[ \left\langle x \right\rangle _{m_1}, \dots , \left\langle x \right\rangle _{m_n} \right] .\\ \{ x \}_{B'}= & {} \left[ \left\langle x \right\rangle _{m'_1}, \dots , \left\langle x \right\rangle _{m'_n} \right] .\\ \{ x \}_{BB'}= & {} \left[ \{ x \}_B, \{ x \}_{B'} \right] . \end{aligned}$$

Transpose T:

$$\begin{aligned}&\{x\}_B^T=\left[ \begin{array}{c} \left\langle x \right\rangle _{m_1} \\ \vdots \\ \left\langle x \right\rangle _{m_n} \end{array} \right] .\\&\left\langle x \right\rangle _m \otimes \left\langle y \right\rangle _m = \left\langle xy \right\rangle _m.\\&\{ x \}_B \otimes \{ y \}_B = \{ xy \}_B = \left[ \left\langle xy \right\rangle _{m_1}, \dots , \left\langle xy \right\rangle _{m_n} \right] . \end{aligned}$$

\(\otimes _M\): Single-word Montgomery multiplication.

\(\left\langle x \right\rangle _m \otimes _M \left\langle y \right\rangle _m = \left\langle xy2^{-w} \right\rangle _m\).

$$\begin{aligned} \{ x \}_B \otimes _M \{ y \}_B= & {} \{ xy2^{-w} \}_B \\= & {} \left[ \left\langle xy2^{-w} \right\rangle _{m_1}, \dots , \left\langle xy2^{-w} \right\rangle _{m_n} \right] . \end{aligned}$$

In this paper, matrix expressions are used to describe parallel processing using RNS. If the matrix is diagonal, no substantive mixture of B and \(B'\) in an operation occurs. In such cases, definitions above are sufficient to carry out the matrix operations. If different bases appear in an operation, which occurs for the base extension operation or the ToBin transformation, the following computation rules will apply.

$$\begin{aligned} \left[ \begin{array}{l} \left\langle a_1 \right\rangle _{m'_i}, \ldots , \left\langle a_n \right\rangle _{m'_i} \end{array} \right] \otimes \{x\}_B^T + \hat{ L } \otimes \left\langle a \right\rangle _{m'_i} \\ = \left\langle \sum _{k=1}^n \left\langle a_k \right\rangle _{m'_i} \left\langle x \right\rangle _{m_k} + \hat{ L } \left\langle a \right\rangle _{m'_i} \right\rangle _{m'_i}. \end{aligned}$$

If the multiplication unit is the Montgomery one, then

$$\begin{aligned} \left[ \begin{array}{c} \left\langle a_1 \right\rangle _{m'_i}, \ldots , \left\langle a_n \right\rangle _{m'_i} \end{array} \right] \otimes _M \{x\}_B^T + \hat{ L } \otimes _M \left\langle a \right\rangle _{m'_i} \\ = \left\langle \sum _{k=1}^n \left\langle a_k \right\rangle _{m'_i} \left\langle x \right\rangle _{m_k}2^{-w} + \hat{ L } \left\langle a \right\rangle _{m'_i}2^{-w} \right\rangle _{m'_i}. \end{aligned}$$

These definitions suffice to carry out the matrix computations appearing in this paper. Note that the matrix computation in this paper is different from standard matrix computation in that the result in each line is reduced by a modulus unique to that line. Therefore, no inverse matrix can be defined here. However, this representation effectively simplifies the algorithm representation and makes it easy to count the number of operations.

2.2 Modular multiplication

Most public-key cryptosystems are implemented by repetition of a modular multiplication with a large modulus p, that is,

$$\begin{aligned} z = xy \bmod p, \end{aligned}$$
(1)

where \(p < 2^l\). l ranges from several hundreds to several thousands and p is usually a large prime or a product of two large primes.

Suppose that we need to add the results of two modular multiplications. If we run a modular multiplication twice, at least two multiplications and two modular reductions are necessary. If we instead use the equation

$$\begin{aligned} z = ( x_1 y_1 + x_2 y_2 ) \bmod p, \end{aligned}$$
(2)

only a single reduction is sufficient to obtain the result after the two multiplications and an addition. This technique is called a lazy reduction, and it effectively reduces the number of reduction operations when summation of several modular multiplications is to be computed. Such a case frequently occurs in the implementation of Elliptic Curve Cryptography. We define the number of terms \(\nu \) as a degree of laziness. For example, Eq. (2) has degree \(\nu = 2\).

2.3 Montgomery reduction

When implementing modular multiplication, one option is to simply use Eq. (1), However, to avoid conditional branches inherent to division operation, another popular option is to implement it using the Montgomery multiplication below [21].

$$\begin{aligned} v = xyR^{-1} \bmod p \end{aligned}$$
(3)

Figure 1 shows details of the procedure. Step 1 is a multiplication followed by Montgomery reduction in steps 2–5, which here is called M-red. For correct results, it suffices that \(\gcd (R, p) =1\) and \(R > p\). Since p is usually an odd number, choosing \(R = 2^l\) satisfies these conditions. In this setting, step 4 is carried out simply by a shifting operation. Step 5 is called the final subtraction, which makes the computation result less than p.

Fig. 1
figure 1

Montgomery multiplication \( MM (x, y)\)

Let \( MM (x, y)\) be the right-hand side of Eq. (3). Using \( MM (x, y)\), the procedure to compute a modular multiplication is described as follows.

Initialize:

$$\begin{aligned} x'= & {} MM (x, R^2) = xR^2R^{-1} \bmod p \\= & {} xR \bmod p \\ y'= & {} MM ({y}, R^2) = yR^2R^{-1} \bmod p \\= & {} yR \bmod p \end{aligned}$$

Main body:

$$\begin{aligned} z'= & {} MM (x', y') = (xR)(yR)R^{-1} \bmod p \\= & {} xyR \bmod p \end{aligned}$$

Finalize:

$$\begin{aligned} z= & {} MM ( z', 1) = (xyR) \cdot 1 \cdot R^{-1} \bmod p \\= & {} xy \bmod p \end{aligned}$$

If our goal is single execution of a modular multiplication, calling \( MM (x, y)\) four times is not efficient. However, when computing a modular multiplication many times, the overheads of Initialize and Finalize become negligible because they are called once at the beginning and at the end, respectively.

The goal of this paper is to propose an efficient algorithm to compute the Montgomery reduction in RNS, that is, an RNS M-red algorithm.

2.4 RNS

Let \(B = \{ m_1, \dots , m_n \}\) be a base for RNS representation, where \(\gcd (m_i, m_j) = 1\) holds for \( i \ne j\). RNS representation of an integer x is given by

$$\begin{aligned} \{ x \}_B = [ \left\langle x \right\rangle _{m_1}, \dots , \left\langle x \right\rangle _{m_n} ]. \end{aligned}$$

The symbol \(\left\langle x \right\rangle _m\) is defined as \(\left\langle x \right\rangle _m = x \bmod m\), and thus \(\left\langle x \right\rangle _m \in [0, m)\) holds. The n-tuple on the right is called the RNS representation of x in base B. The representation is unique if \(0 \le x < M\), where \(M = \prod _{i=1}^{n} m_i\). This representation allows fast arithmetic in Z / MZ since

$$\begin{aligned} \{ x \}_B \odot \{ y \}_B = [\left\langle x \odot y \right\rangle _{m_1}, \dots , \left\langle x \odot y \right\rangle _{m_n} ], \end{aligned}$$

where \(\odot \in \{ +, -, \times , / \}\). ‘ / ’ applies only if y is co-prime to M.

In RNS, a large integer can be processed with independent parallel operations in each channel. If we could use small factors of public-key p as an RNS base, a very efficient implementation would be realized. It is, however, difficult to employ such an approach since the public-key p is usually a large prime or a product of two large primes.

2.5 Chinese remainder theorem

According to the Chinese remainder theorem, the integer x represented in RNS is recovered by

$$\begin{aligned} x = \sum _{i=1}^{n} \left\langle x \right\rangle _{m_i} \left\langle M_i^{-1} \right\rangle _{m_i}M_i \bmod M. \end{aligned}$$

Let us consider a method avoiding the modulo M operation in evaluating the right-hand side. Since we can replace \(\left\langle x \right\rangle _{m_i} \left\langle M_i^{-1} \right\rangle _{m_i}\) by \({\xi _i(x)} = \left\langle xM_i^{-1} \right\rangle _{m_i}\) without affecting the equality, we can rewrite the equation with a new unknown integer L as

$$\begin{aligned} x = \sum _{i=1}^n {\xi _i(x)} M_i - LM. \end{aligned}$$

Considering \(0 \le x/M < 1\), we obtain the following equation.

$$\begin{aligned} L = \left\lfloor \sum _{i=1}^n \frac{{\xi _i(x)}}{m_i} \right\rfloor \end{aligned}$$

An approximation \(\hat{L}\) of L is proposed in [3] for an appropriate offset \(\alpha \in [0, 1)\),

$$\begin{aligned} \hat{L} = \left\lfloor \alpha + \sum _{i=1}^n \frac{\mathrm{trunc}( t, {\xi _i(x)})}{2^w} \right\rfloor . \end{aligned}$$

\(\mathrm{trunc}(t, x)\) is a function to force the lower \(w-t\) bits of x to zero.

$$\begin{aligned} \mathrm{trunc}( t, x) = \left\lfloor x2^{-(w-t)} \right\rfloor \cdot 2^{(w-t)}, \end{aligned}$$

where \(0<t \le w\). The difference between \(\hat{L}\) and L can be at most 1 if appropriate t and \(\alpha \) are selected. This approximation is the most important part of the base extension process in the computation of RNS M-red.

2.6 Special modulus for fast reduction

Usually, arbitrary moduli can be selected as RNS bases so long as they are mutually prime. It is well-known that the pseudo-Mersenne prime number \(m_i = 2^w - \mu _i\) is useful for fast reduction. Actually, if

$$\begin{aligned} \mu _i < 2^{\lfloor w/2 \rfloor } \end{aligned}$$
(4)

holds, \(x \bmod m_i\) can be computed efficiently with the following procedure. First, repeat the operation below twice:

$$\begin{aligned} x \leftarrow (x \bmod 2^w) + \mu _i \cdot \lfloor x/2^w \rfloor . \end{aligned}$$

If \(x < 2^{2w}\) holds for the initial value, the result is in \([0, 2^{w+1})\). The final result is obtained by subtracting \(m_i\) at most once. In addition, if the Hamming weight of \(\mu _i\) is small, multiplication by \(\mu _i\) can be replaced by several additions.

Let \(\bar{\mu }= (1/w) \log _2 \mu _i\). Equation (4) is satisfied if \(\bar{\mu }< 0.5\).

2.7 Quadratic residuosity

An integer a is called a quadratic residue modulo m if there exists a solution for the congruence equation of x,

$$\begin{aligned} x^2 \equiv a \bmod m. \end{aligned}$$

In other words, a is called a quadratic residue if a has a square root, and a quadratic non-residue otherwise. Unlike real numbers, not every integer has its square roots for a given modulus m. An integer a can be a quadratic residue or a quadratic non-residue, depending on the value of modulus m.

Let a function \(\mathrm{QR}(a, m)\) be defined by

$$\begin{aligned} \mathrm{QR}( a, m) = \left\{ \begin{array}{ll} 1 &{} (\mathrm{if}\; \exists x \;\mathrm{s.t.}\; x^2 \equiv a \,(\bmod m) ) \\ 0 &{} (\mathrm{if}\; \forall x \;\mathrm{s.t.}\; x^2 \not \equiv a \,(\bmod m) ) \end{array} \right. \end{aligned}$$

This function will be used as a distinguisher of a quadratic residue. If \(\mathrm{QR}(a, m) = 1\), let \(\left\langle a^{1/2} \right\rangle _m\) denote one of the square roots of a modulo m.

Quadratic residuosity has been used for RNS in signal processing applications to represent a complex signal (for instance, refer to section 8.1 of [22]), whereas no previously proposed RNS M-red algorithm has used quadratic residuosity. This paper applies quadratic residuosity to the RNS M-red algorithm for the first time to construct algorithms that consist of the least number of unit multiplications.

3 Conventional algorithms

3.1 Basic RNS M-red algorithm

3.1.1 Algorithm

Figure 2 shows the Montgomery reduction algorithm corresponding to steps 2–5 of Fig. 1. By relaxing the range of the output to less than 2p, the final subtraction has been removed. In addition, the upper bound of the input is also relaxed from \(p^2\) to \(\beta p^2\) with \(\beta \ge 4\). The condition \(R > \beta p\) ensures that the output is less than 2p.Footnote 1 All RNS M-red algorithms in this paper can be regarded as RNS variants of this M-red algorithm.

Fig. 2
figure 2

Montgomery reduction (M-red)

Figure 3 shows the RNS M-red algorithm derived straightforwardly from M-red in Fig. 2. A description of each step is on the same line as the step number, followed by the actual specification in matrix form. Steps 1, 4, and 5 correspond to steps 1, 2, and 3 in Fig. 2, respectively. Steps 2 and 3 derive approximation \(\{ q' \}_{B'}\) from \(\{ q \}_B\), which is a technique called the base extension step. Similarly, steps 6 and 7 are also the base extension, deriving \(\{ s \}_B\) from \(\{ s \}_{B'}\). A constant R is set \(R = M = \prod _{i=1}^n m_i\) for RNS M-red, while \(R = 2^l\) is a common setting for the binary M-red.

Fig. 3
figure 3

Basic RNS M-red algorithm[3]

Since step 1 is carried out in base B, modulo M is automatically applied to the computation and the result is equivalent to that of step 1 in Fig. 2. It is in base \(B'\) that steps 4 and 5 should be carried out. The reason for this is as follows: As for step 4, it is of no use computing \((x+pq)\) in B because the result is always a multiple of M and thus always 0 in base B. The computation in step 5 is to multiply \(M^{-1}\) by r. This can be carried out in base \(B'\) but not in base B, since \(M^{-1}\) does not exist in base B. Although the final result s is computed at step 5, it is only represented in base \(B'\). In order to complete the representation in base B, steps 6 and 7 extend \(\{s\}_{B'}\) to \(\{ s \}_B\). This ensures compatibility between output and input of the RNS M-red algorithm.

The matrix elements in each step are defined as follows:

Step 1:Footnote 2\(d_i = \left\langle -p^{-1} \right\rangle _{m_i}\).

Step 2:Footnote 3\(w_{ii} = \left\langle M_i^{-1} \right\rangle _{m_i}\).

Step 3: The first base extension.

$$\begin{aligned} a_{ij}= & {} \left\langle M_j \right\rangle _{m'_i}, \\ a_i= & {} \left\langle -M \right\rangle _{m'_i}. \end{aligned}$$

Step 4: \(p_i = \left\langle p \right\rangle _{m'_i}\).

Step 5: \(w_i = \left\langle M^{-1} \right\rangle _{m'_i}\).

Step 6: \(w'_{ii} = \left\langle {M'_i}^{-1} \right\rangle _{m'_i}\).

Step 7: The second base extension.

$$\begin{aligned} b_{ij}= & {} \left\langle M'_j \right\rangle _{m_i}, \\ b_i= & {} \left\langle -M' \right\rangle _{m_i}. \end{aligned}$$

This algorithm includes \((2n^2+5n)\) unit multiplications. We exclude the multiplications by \(\hat{L}\) at steps 3 and 7 because a technique in [3] shows how to carry out each of these by less than n additions.

3.1.2 Requirement for parameters

Throughout this paper, we assume that the bit length w is common to base elements \(m_i\) and \(m'_i\). Let \(m_i\) have a form of the pseudo-Mersenne prime,

$$\begin{aligned} m_i = 2^ w- \mu _i \in B, \end{aligned}$$

where \(\mu _i\) is a relatively small positive integer. Similarly,

$$\begin{aligned} m'_i = 2^w - \mu '_i \in B'. \end{aligned}$$

Such a modulus has two important properties:

  1. (a)

    \(m_i\) is a special modulus for fast reduction.

  2. (b)

    \(1/m_i\) can be well approximated as \(1/2^w\).

Property (b) applies to the computation of \(\hat{L}\) in steps 3 and 7 in Fig. 3. Let \({\xi _i(q)} = \left\langle \theta \right\rangle _{m_i}\) be the results of step 2. To estimate the approximation error between

$$\begin{aligned} L = \left\lfloor \sum _{i=1}^n {\xi _i(q)} /m_i \right\rfloor \end{aligned}$$

and

$$\begin{aligned} \hat{L} = \left\lfloor \sum _{i=1}^n \mathrm{trunc}( t, {\xi _i(q)})/2^w \right\rfloor , \end{aligned}$$

let us consider a function \({f(q)} = \sum _{i=1}^n {\xi _i(q)}/m_i\) and its approximate function \({\tilde{f}(q)} = \sum _{i=1}^n \mathrm{trunc}(t, {\xi _i(q)})/2^w\). The following equations hold for f and \(\tilde{f}\) [23].

$$\begin{aligned}&{f(q)} - e_1 \le {\tilde{f}(q)} \le {f(q)} \nonumber \\&e_1 = n(2^{-t} - 2^{-w} ) + \frac{1}{2^w} \sum _{i=1}^n \left( 1 - \frac{1}{m_i} \right) \mu _i \end{aligned}$$
(5)

Let \(q'\) be the extended value of q as computed in step 3. Using the above equations, we can show the relationship between \(q'\) and q as \(q' = q + uM\) with \(u \in \{ 0, 1 \}\) if an offset \(\alpha = 0\). This means that q is transformed to \(q'\) at step 3 with an error term uM. This error is absorbed in the relaxed range of the output \(\overline{ \left\langle xM^{-1} \right\rangle _p}\) until the end of step 5. A similar analysis can be applied to the second base extension at step 7. In this case, an approximation error \(e_1\) is replaced by \(e_2\) with parameters \(( m'_i, \mu '_i )\), and an offset \(\alpha \) is positive. A typical offset at the second base extension is \(\alpha = 0.5\) [3]. The second base extension is error-free if \(e_2 < \alpha \) and \(2p \le (1-\alpha ) M'\).

Conditions (i)–(v) below are typical requirements for ensuring correct results [3, 23].

  1. (i)

    \(\gcd (M, M') = 1\)

  2. (ii)

    \(\gcd (p, M) = 1\)

  3. (iii)

    \(\max (e_1, e_2) \le \alpha < 1\)

  4. (iv)

    \(\beta p \le (1-\alpha )M\)

  5. (v)

    \(2p \le (1-\alpha )M'\)

From condition (iii) and Eq. (5), we can derive the lower bound of t, the effective number of bits for approximation, as

$$\begin{aligned} t \ge t_0 = \left\lceil -\log _2 \{(\alpha - \max (e_0, e'_0) ) n^{-1} + 2^{-w}\} \right\rceil , \end{aligned}$$

where \(e_0\) and \(e'_0\) represent the summation parts of \(e_1\) and \(e_2\), respectively.

Fig. 4
figure 4

G-RNS M-red algorithm [9]

3.2 G-RNS algorithm

Guillermin proposed an algorithm that at the time achieved the minimum number of unit multiplications [9]. We call this algorithm G-RNS (Fig. 4). Step 1 is the integration of steps 1 and 2 in Fig. 3. Step 2 is from the first term of step 4 combined with steps 5 and 6 of Fig. 3. Step 3 is derived from steps 3–6 of Fig. 3. Step 4 corresponds to step 7 of the basic algorithm. Step 5 is new in Fig. 4.

Elements of the matrices of G-RNS are defined as follows:

Step 1: \(d_{ii} = \left\langle w_{ii} \cdot d_i \right\rangle _{m_i} = \left\langle M_i^{-1}(-p^{-1}) \right\rangle _{m_i}\).

Step 2: \(e_{ii} = \left\langle w'_{ii} \cdot w_i \right\rangle _{m'_i} = \left\langle {M'_i}^{-1} M^{-1} \right\rangle _{m'_i}\).

Step 3:

$$\begin{aligned} a'_{ij}= & {} \left\langle w'_{ii} \cdot w_i \cdot p_i \cdot a_{ij} \right\rangle _{m'_i} = \left\langle {M'_i}^{-1}m_j^{-1}p \right\rangle _{m'_i}, \\ a'_i= & {} \left\langle w'_{ii} \cdot w_i \cdot p_i \cdot a_i \right\rangle _{m'_i} = \left\langle {-M'_{i}}^{-1}p \right\rangle _{m'_i}. \end{aligned}$$

Step 4:

$$\begin{aligned} b_{ij}= & {} \left\langle M'_j \right\rangle _{m_i}, \\ b_i= & {} \left\langle -M' \right\rangle _{m_i}. \end{aligned}$$

Step 5: \(c_{ii} = \left\langle M'_i \right\rangle _{m'_i}\).

The necessary number of unit multiplications for G-RNS is \((2n^2 + 3n)\).

Fig. 5
figure 5

C-RNS M-red algorithm[15]

3.3 C-RNS algorithm

Figure 5 shows an algorithm proposed by Cheung et al.[15]. Elements appear in each step are defined as follows:

Step 1: \(d_{ii} = \left\langle M_i^{-1}(-p^{-1}) \right\rangle _{m_i}\).

Step 2: \(e_{ii} = \left\langle {M'_i}^{-1} M^{-1} \right\rangle _{m'_i}\).

Step 3:

$$\begin{aligned} a_{ij}= & {} \left\langle M_j \right\rangle _{m'_i},\\ a_i= & {} \left\langle -M \right\rangle _{m'_i}. \end{aligned}$$

Step 4: \(f_{ii} = \left\langle w'_{ii} \cdot w_i \cdot p_i \right\rangle _{m'_i} = \left\langle {M'_i}^{-1}M^{-1}p \right\rangle _{m'_i}\).

Step 5:

$$\begin{aligned} b_{ij}= & {} \left\langle M'_j \right\rangle _{m_i}, \\ b_i= & {} \left\langle -M' \right\rangle _{m_i}. \end{aligned}$$

Step 6: \(c_{ii} = \left\langle M'_i \right\rangle _{m'_i}\).

The difference from G-RNS is that C-RNS restores the original base extension matrix at step 3. The number of multiplications is \((2n^2 + 4n)\) in general cases, but computation of the base extension can be reduced drastically in the special case when n is small. As discussed in [15, 16], it follows that

$$\begin{aligned} a_{ij}= & {} \left\langle M_j \right\rangle _{m'_i} = \prod _{k \ne j} m_k \bmod m'_i \\= & {} \prod _{k \ne j} (m_k - m'_i) \bmod m'_i, \end{aligned}$$

and that \(|m_k - m'_i| = |\mu '_i - \mu _k|\) is a small number. Therefore, if n is not so large, we can expect that \(a_{ij}\) and \(b_{ij}\) are close to \(2^w\). This makes it possible to reduce the computation amount at the base extensions. In hindsight, this property can be applied to the basic RNS M-red as well. It is shown in [16] that efficient parameters exist for \(n = 4\) and 258-bit modulus p.

Fig. 6
figure 6

R-RNS M-red algorithm

Fig. 7
figure 7

From G-RNS to Q-RNS

3.4 R-RNS algorithm

Gandino et al. proposed a reorganized version of the RNS Montgomery multiplication algorithm [6], which we call the R-RNS algorithm here. As shown in Fig. 6, we can describe the R-RNS algorithm using almost the same notation as G-RNS. Let us explain the difference between Figs. 4 and 6.

  1. 1.

    Input and output are changed from \(\{ x \}_{BB'}\) and \(\{ s \}_{BB'}\) to \(\{ x \}_B \cup \{ \hat{\hat{x}} \}_{B'}\) and \(\{ s \}_B \cup \{ \hat{s} \}_{B'}\), respectively, where elements of \(\{ \hat{\hat{x}} \}_{B'}\) and \(\{ \hat{s} \}_{B'}\) are defined as

    $$\begin{aligned}&\left\langle \hat{\hat{x}} \right\rangle _{m_i'} = \left\langle x {M_i'}^{-2} \right\rangle _{m_i'}, \\&\left\langle \hat{s} \right\rangle _{m_i'} = \left\langle s {M_i'}^{-1} \right\rangle _{m_i'}. \end{aligned}$$
  2. 2.

    In step 2, elements of the matrix are changed from \(e_{ij}\) to \(e'_{ij}\), where the latter is defined as

    $$\begin{aligned} e_{ij}' = \left\langle e_{ij} {M_i'}^2 \right\rangle _{m_i'} = \left\langle M_i' M^{-1} \right\rangle _{m_i'}. \end{aligned}$$

    Due to this definition, the following relationship holds.

    $$\begin{aligned} \left\langle e_{ij} \cdot x \right\rangle _{m_i'} = \left\langle e_{ij}' \cdot \hat{\hat{x}} \right\rangle _{m_i'} \end{aligned}$$

    Thus, the result of step 2 in Fig. 6 is identical to that in Fig. 4.

  3. 3.

    Notation of the result in step 3 is changed to \(\left\langle \hat{s}\right\rangle _{m_i'}\), although its value is identical to \(\left\langle \sigma \right\rangle _{m_i'}\), the result of step 3 of Fig. 4.

  4. 4.

    Since the new output includes \(\left\langle \hat{s}\right\rangle _{m_i'}\) instead of \(\left\langle s\right\rangle _{m_i'}\), step 5 of Fig. 4 is omitted in Fig. 6. This reduces the number of unit multiplications by n from Fig. 4.

The number of unit multiplications is \((2n^2 + 2n)\) in this case.

4 New algorithms

4.1 Derivation of Q-RNS

We introduce an idea to pose quadratic residuosity to the RNS base so as to make steps 1 and 2 in the G-RNS algorithm unnecessary. Figure 7 (left) shows part of a long sequence of operations in which a multiplication and G-RNS M-red are repeated. It consists of three phases: the previous M-red, a multiplication, and the present M-red. The input of the present M-red is \(\left\{ xy \right\} _{BB'}\). For simplicity, elements of RNS representation are uniformly numbered from 1 to 2n only in Fig. 7. From the definition of G-RNS, the input is multiplied by the constants

$$\begin{aligned} d_{ii}= & {} \left\langle M_i^{-1}(-p^{-1}) \right\rangle _{m_i}, \end{aligned}$$
(6)
$$\begin{aligned} e_{ii}= & {} \left\langle {M'_i}^{-1} M^{-1} \right\rangle _{m'_i} \end{aligned}$$
(7)

in base \(B = \{ m_1, \dots , m_n \}\) and \(B'=\{ m'_1, \dots , m'_n \}\), respectively. If the bases, B and \(B'\), are selected so that these constants are quadratic residues, each constant can be represented as a square of a constant K, as shown in Fig. 7 (left). We will, then, transfer the square root K from the present M-red to the previous M-red, integrating K onto the coefficient of the multiplication at the last steps (Fig. 7, right). As a result, outputs of previous M-reds are modified to Kx and Ky and their product is \(K^2xy\), which is the same as the value immediately after steps 1 and 2 of Fig. 4. We call this new algorithm as Q-RNS M-red or simply as Q-RNS, using initials for quadratic residuosity. Q-RNS includes sQ-RNS which is directly derived from G-RNS and dQ-RNS in which a unit multiplication is replaced by the Montgomery multiplication.

Most past improvements in RNS M-red algorithms except [6] were optimization within one round of M-red execution. Our optimization for Q-RNS M-red is unique in that it transfers a square rootFootnote 4 of a constant from a present round to a previous round.

As seen in Fig. 7, Q-RNS assumes that multiplication is carried out as preprocessing for the next M-red. This assumption ensures that the degree of K is 2. Let us consider possible degrees of K. All RNS M-red algorithms discussed in this paper use two bases, B and \(B'\), with the intent that these M-reds accommodate a number twice the length of what a single base can represent. This means M-red is designed not to accommodate a number with a degree more than or equal to 3. If the degree of K is 1 or 0, we could cope with such cases by multiplying K or \(K^2\) by the input. Even if such cases should occur, the computation amount would be the same as Fig. 7 (left).

4.2 sQ-RNS algorithm

Figure 8 shows the sQ-RNS algorithm—the initial “s” indicating a single-level rather than double-level Montgomery—a technique proposed in [13]. sQ-RNS is basically derived according to the procedure shown in Fig. 7 with a small extra optimization.

The constants \(d_{ii}\) and \(e_{ii}\), defined by Eqs. (6) and (7), are the diagonal elements in steps 1 and 2 of G-RNS. Quadratic residuosity of these constants is key to the design of Q-RNS. The square root of \(d_{ii}\) would yield \(\left\langle M_i^{-1/2}(-1)^{1/2}p^{-1/2} \right\rangle _{m_i}\). The factor \((-1)^{1/2}\) requires that \((-1)\) should be a quadratic residue modulo \(m_i\). To reduce constraints on base B even a bit, a factor \(\left\langle -1 \right\rangle _{m_i}\) in \(d_{ii}\) is moved to the base extension matrix \(a'_{ij}\). As a result, the constant \(K^2\) is defined by

$$\begin{aligned} \left\langle K^2 \right\rangle _{m_i}= & {} \left\langle -d_{ii} \right\rangle _{m_i} = \left\langle M_i^{-1}p^{-1} \right\rangle _{m_i}, \\ \left\langle K^2 \right\rangle _{m'_i}= & {} e_{ii} = \left\langle {M'_i}^{-1} M^{-1} \right\rangle _{m'_i}. \end{aligned}$$

The new requirement for base B is that values on the right-hand side of the above equations must be quadratic residues. For a given p and \(i, j \in [1, n ]\), we can describe the requirement using the function QR as follows:

$$\begin{aligned}&\prod _{i \ne j} \mathrm{QR}(m_j, m_i) \prod _{\forall i} \mathrm{QR}(p, m_i) \nonumber \\&\quad \times \prod _{i \ne j} \mathrm{QR}(m'_j, m'_i) \prod _{\forall i, j} \mathrm{QR}(m_j, m'_i) = 1 \end{aligned}$$
(8)

If Eq. (8) holds, there exists a coefficient K defined by the following equations and Q-RNS is properly defined.

$$\begin{aligned} \left\langle K \right\rangle _{m_i}= & {} \left\langle M_i^{-1/2}p^{-1/2} \right\rangle _{m_i} \\ \left\langle K \right\rangle _{m'_i}= & {} \left\langle {M'_i}^{-1/2} M^{-1/2} \right\rangle _{m'_i} \end{aligned}$$
Fig. 8
figure 8

sQ-RNS M-red algorithm

The computation of \(\hat{L}'\) at step 1 is also modified due to the transfer of the factor \(\left\langle -1 \right\rangle _{m_i}\). Before the transfer, it was \(\hat{L} \leftarrow \lfloor 0 + \sum _{i=1}^n \mathrm{trunc}( t, \left\langle K^2x \right\rangle _{m_i} )/2^w \rfloor \). This is replaced by

$$\begin{aligned} \hat{L}' \leftarrow \left\lfloor 1 + \alpha + \sum _{i=1}^n \mathrm{trunc}( t, \left\langle K^2x \right\rangle _{m_i} ) / 2^w \right\rfloor . \end{aligned}$$

Note here that the offset value changes from 0 to \((1+\alpha )\), which compensates for the effect of transfer of the factor \(\left\langle -1\right\rangle _{m_i}\). Derivation of the new formula is explained in “Appendix A”. This also makes the constant \(a'_i\) negative, as shown in the next paragraph.

In Fig. 8, input and output of the algorithm are \(\left\{ K^2x \right\} _{BB'}\) and \(\left\{ Ks \right\} _{BB'}\), respectively, and the elements in each matrix are defined from those of G-RNS as follows:

$$\begin{aligned} \alpha _{ij}= & {} \left\langle -a'_{ij} \right\rangle _{m'_i} = \left\langle -{M'_i}^{-1} m_j^{-1} p \right\rangle _{m'_i} \\ \alpha _i= & {} \left\langle -a'_i \right\rangle _{m'_i} = \left\langle {M'_i}^{-1}p \right\rangle _{m'_i} \\ \beta _{ij}= & {} \left\langle Kb_{ij} \right\rangle _{m_i} = \left\langle KM'_j \right\rangle _{m_i} = \left\langle M_i^{-1/2}p^{-1/2}M'_j \right\rangle _{m_i} \\ \beta _i= & {} \left\langle Kb_i \right\rangle _{m_i} = \left\langle K(-M') \right\rangle _{m_i} = \left\langle -M_i^{-1/2}p^{-1/2}M' \right\rangle _{m_i} \\ \gamma _i= & {} \left\langle Kc_{ii} \right\rangle _{m'_i} = \left\langle KM'_i \right\rangle _{m'_i} = \left\langle {M'_i}^{1/2}M^{-1/2} \right\rangle _{m'_i} \end{aligned}$$

The constants \(\beta _{ij}\), \(\beta _i\), and \(\gamma _i\) are each multiplied by K, because they correspond to the explanatory constants \(h_1\) through \(h_{2n}\) in Fig. 7 (right).

Note that steps 2 and 3 in Fig. 8 can be carried out simultaneously. Therefore, the computation time of sQ-RNS can be estimated as comparable with twice that of the matrix multiplication. The number of unit multiplications is \((2n^2 + n)\), which is the minimum among all previously proposed RNS M-red algorithms.

4.3 dQ-RNS algorithm

The double-level Montgomery is a technique proposed in [13]. It replaces the standard modular unit multiplication in RNS M-red with single-word Montgomery multiplication. dQ-RNS in Fig. 9 is derived by applying this technique to sQ-RNS. Using Montgomery multiplication \(\otimes _M\) removes the requirement \(\bar{\mu }< 0.5\), due to its special modulus for fast reduction. Without this requirement, we can take square numbers \(m_i = \sigma _i^2\), as base elements which may violate the condition, \(\bar{\mu }< 0.5\). In this approach, the condition for quadratic residue becomes very simple as

$$\begin{aligned} \prod _{\forall i} \mathrm{QR}(p, m_i ) = 1 \end{aligned}$$
(9)

since \(m_i = {\sigma _i}^2\) automatically satisfies \(\mathrm{QR}( m_i, m_j ) = 1\). Therefore, it is expected that bases can be found efficiently for a wider range of base size n.

The elements of matrices in dQ-RNS are defined from those of sQ-RNS with the modification by coefficient \(2^{kw/2}\) (for \(k=1,2,3\)), which is represented by a symbol with a dot.

$$\begin{aligned} \dot{\alpha }_{ij}= & {} \left\langle 2^w \alpha _{ij} \right\rangle _{m'_i} \\ \dot{\alpha }_i= & {} \left\langle 2^w \alpha _i \right\rangle _{m'_i} \\ \dot{\beta }_{ij}= & {} \left\langle 2^{3w/2}\beta _{ij} \right\rangle _{m_i} \\ \dot{\beta }_i= & {} \left\langle 2^{3w/2}\beta _i \right\rangle _{m_i} \\ \dot{\gamma }_i= & {} \left\langle 2^{3w/2}\gamma _i \right\rangle _{m'_i} \end{aligned}$$

A variable \(\left\{ \zeta \right\} _{BB'}\) is also modified as

$$\begin{aligned} \left\{ \dot{\zeta }\right\} _{BB'} = \left\{ 2^{w/2}\zeta \right\} _{BB'}. \end{aligned}$$

If we take an even-valued w, these elements are always well-defined. Constants, \(\left\{ K^2 \right\} _{BB'}\) and \(\left\{ K \right\} _{BB'}\) are the same as in sQ-RNS. Consequently, variables \(\left\{ K^2 x \right\} _{BB'}\) and \(\left\{ \xi \right\} _{B'}\) are unchanged from sQ-RNS, leaving the formulae for \(\hat{L}'\) and \(\hat{L}\) unchanged.

The number of unit multiplications of dQ-RNS is also \((2n^2 + n)\). It should be noted that the unit multiplication in this case is the Montgomery multiplication.

Fig. 9
figure 9

dQ-RNS M-red algorithm

Table 1 Example base for sQ-RNS

4.4 Base search

A base search experiment is carried out for a given modulus p to find RNS bases that satisfy the requirement for quadratic residuosity. To avoid bias, we use five NIST primes [24] and one for Curve25519 [25], which are defined as common moduli for Elliptic Curve Cryptography. The NIST primes are called P-192, P-224, P-256, P-384, and P-521, with numbers representing the bit size of each prime. The prime for Curve25519 is defined as \(2^{255}-19\).

Experiment for sQ-RNS:

We search for bases satisfying Eq. (8) using the following search algorithm.

Search algorithm 1:

  1. 1

    Let candidates be an ordered sequence of prime numbers in the form \(c_i =2^w - \mu _i\), where \(\mu _i> \mu _j > 0\) for \(i > j\). The search is done in a smaller-index-first manner.

  2. 2

    \(\mathrm{Pool} = \{ c_i | \mathrm{QR}(c_i, c_j ) \cdot \mathrm{QR}(c_j, c_i ) = 1 \;\mathrm{for }\;i \ne j \}\).

  3. 3

    \(B = \{ c_i | c_i \in \mathrm{Pool} \wedge \mathrm{QR}(p, c_i ) = 1 \wedge |B| = n \}\).

  4. 4

    \(B' = \{ c_i | c_i \in \mathrm{Pool} \wedge c_i \notin B \wedge |B'| = n \}\).

Since we choose the candidates from among prime numbers, they all satisfy the condition that they must be mutually prime. This also makes it easier to determine the quadratic residuosity.

Table 1 presents the search results for \(n=4\). The rightmost column shows that these bases satisfy the condition \(\bar{\mu }< 0.5\).

Experiment for dQ-RNS:

We apply the following search algorithm, which generates bases satisfying the condition given by Eq. (9).

Search algorithm 2:

  1. 1

    Let seeds be an ordered sequence of odd numbers in the form \(\sigma _i = 2^{w/2} - \nu _i\), where \(\nu _i> \nu _j >0\) for \(i > j\). The search is done in a smaller-index-first manner.

  2. 2

    \(\mathrm{Pool} = \{ \sigma _i | \gcd (\sigma _i, \sigma _j ) = 1 \;\mathrm{for}\; i \ne j \}\).

  3. 3

    \(B = \{ \sigma _i^2 | \sigma _i \in \mathrm{Pool} \wedge \mathrm{QR}(p, \sigma _i ) = 1 \wedge |B| = n \}\).

  4. 4

    \(B' = \{ \sigma _i^2 | \sigma _i \in \mathrm{Pool} \wedge \sigma _i^2 \notin B \wedge |B'| = n \}\).

As a lemma, if \(\mathrm{QR}( p, \sigma _i ) = 1\), then \(\mathrm{QR}( p, {\sigma _i}^2 ) = 1\) (see “Appendix B”). For the base elements found by algorithm 2, it holds that \(\bar{\mu }> 0.5\), since

$$\begin{aligned} \sigma _i^2 = { (2^{w/2} - \nu _i )}^2 = 2^w - \nu _i 2^{w/2+1} + \nu _i^2. \end{aligned}$$
Fig. 10
figure 10

Base search results for dQ-RNS

Figure 10 shows the search results for \(\alpha = 0.5\) and \(\nu \ge 2\), the degree of laziness. The search succeeds for (nw) plotted in the figure, although the graph P-256 is almost hidden behind that of C25519. The search fails when \(\max (e_1, e_2)\) exceeds 0.5 and violates condition (iii) \(\max (e_1, e_2) \le \alpha \) in Sect. 3.1.2. The lower bounds of word length w for success are 22, 24, 24, 26, 28, and 24 bits for P-192, P-224, P-256, P-384, P-512, and Curve25519, respectively. The lower bound \(t_0\) for necessary bit length for approximation ranges from 3 to 8. Therefore, it is possible to realize a compact computation circuit for \(\hat{L}\) and \(\hat{L}'\). Let \(N_1\) be a number of seeds satisfying \(\mathrm{QR}( p, \sigma _i ) = 1\), and let \(N_0\) be the number of all seeds generated until the algorithm halts. In our experiment, \(N_1 / N_0\) ranges from 0.30 to 0.43, which implies that the probability that \(\mathrm{QR}( p, \sigma _i ) = 1\) is near 0.3 for these primes. We also confirm that bases are efficiently found for some randomly chosen non-NIST primes.

Experiments show that bases for dQ-RNS can be found unless the word size w is too small. For instance, \(w \ge 22\) suffices for P-192. Since values less than 22 do not seem to be promising parameters for efficient hardware implementation, dQ-RNS has a sufficient range of word size selection. It is up to hardware designers to determine optimum sizes for specific Q-RNS applications.

Table 2 Comparison of basic representations

5 Application to cryptography

We discuss several procedures necessary for RNS implementation of public-key cryptography, including Initialize, Finalize, transform to RNS representation (ToRNS, hereafter), and transform to Binary representations (ToBin, hereafter). We also provide formulae for bounds on degree of laziness and for relaxation of reduction within a channel. Although these issues were discussed in previous work, we are interested in the case of Q-RNS and exact expressions of bounds.

5.1 Basic representation

Table 2 shows the representations of computation result of each algorithm. Row (a) corresponds to the orthodox modular multiplication described by Eq. (1). Row (b) is for the standard Montgomery multiplication defined by Eq. (3), in which a constant \(M = 2^l\) is multiplied by x. The bar symbol in row (b) means relaxation of the upper bound of reduction from p to 2p. Row (c) represents conventional RNS M-reds other than R-RNS. This is an immediate transformation of (b) into an RNS representation with base B and \(B'\). Rows (d) and (e) are for Q-RNS, derived from (c) by multiplying constant K and \(2^{w/2}K\), respectively. The representation for the R-RNS algorithm is derived from (d) if we replace coefficient \(\{ K \}_{BB'}\) with \(\{1\}_B \cup \{X\}_{B'}\), where \(\left\langle X\right\rangle _{m_i'} = \left\langle {M_{i}'}^{-1}\right\rangle _{m_i'}\).

Montgomery developed an efficient reduction algorithm (b) by multiplying a constant R (here, M) by representation (a), whereas this paper proposes efficient RNS M-red algorithms (d) and (e) by multiplying constants K and \(2^{w/2}K\) by representation (c). As a result, (d) and (e) are realized with fewer unit multiplications, and their structures are much simpler. As will be explained in the next subsection, we can embed multiplication by K or \(2^{w/2}K\) into the Initialized process. We can also carry out the removal process of K or \(2^{w/2}K\) in parallel with the Finalized process.

5.2 ToRNS and Initialize

If \(\left\{ \overline{ \left\langle x \right\rangle _p} \right\} _{BB'}\) is given, Initialize for conventional RNS M-red is carried out as follows: First, a product

$$\begin{aligned} \left\{ \overline{ \left\langle x \right\rangle _p } \right\} _{BB'} \otimes \left\{ \left\langle M^2 \right\rangle _p \right\} _{BB'} \end{aligned}$$
(10)

is computed, then the product is input to RNS M-red to obtain \(\left\{ \overline{ \left\langle xM \right\rangle _p } \right\} _{BB'}\). A similar Initialize process can be defined for Q-RNS and applied to the result of ToRNS, which denotes the transformation from binary to RNS representation. To describe the concrete procedure, we assume the input is represented in binary as \(\overline{ \left\langle x \right\rangle _p } = \sum _{j=0}^{n-1} x_j 2^{jw}\) with \(x_j \in [0, 2^w-1]\).

For sQ-RNS:

$$\begin{aligned}&X_0&\leftarrow x_{n-1} \\&X_j&\leftarrow \left\langle X_{j-1}2^w + x_{n-1-j} \right\rangle _{m_i} \;(j=1, \dots , n-1). \end{aligned}$$

This procedure outputs \(X_{n-1} = \left\langle \overline{ \left\langle x \right\rangle _p} \right\rangle _{m_i}\). The second step can be implemented with a single-word modular reduction. This matches the special modulus for fast reduction. By running this procedure with all moduli, we obtain \(\left\{ \overline{ \left\langle x \right\rangle _p } \right\} _{BB'}\).

To initialize this variable, we first multiply it by a constant.

$$\begin{aligned} \left\{ \overline{ \left\langle x \right\rangle _p} \right\} _{BB'} \otimes \left\{ K^2 \left\langle M^2 \right\rangle _p \right\} _{BB'} \end{aligned}$$
(11)

Then, we input the product to sQ-RNS and obtain the basic representation \(\left\{ K \overline{ \left\langle xM \right\rangle _p} \right\} _{BB'}\).

For dQ-RNS:

$$\begin{aligned}&Y_0&\leftarrow x_0 \\&Y_j&\leftarrow \overline{ \left\langle (x_j2^w + Y_{j-1})2^{-w} \right\rangle _{m_i} } \; (j=1, \dots , n-1). \end{aligned}$$

This procedure outputs \(Y_{n-1} = \left\langle 2^{-(n-1)w} \overline{ \left\langle x \right\rangle _p} \right\rangle _{m_i}\). The second step can be implemented with a single-word Montgomery reduction, which matches well with the double-level Montgomery. By running this procedure with all moduli, we obtain \(\left\{ 2^{-(n-1)w} \overline{ \left\langle x \right\rangle _p } \right\} _{BB'}\). It is possible to prepare the following lookup table for the single-word Montgomery reduction.

$$\begin{aligned} \left\langle -{m_i}^{-1} \right\rangle _{2^w}, \left\langle -{m'_i}^{-1} \right\rangle _{2^w} (i=1, \dots , n ) \end{aligned}$$
(12)

These constants are used in a similar way to the constant \((-p^{-1}) \bmod R\) at step 1 in Fig. 2.

For Initialize, we first multiply a constant as follows.

$$\begin{aligned} \left\{ 2^{-(n-1)w} \overline{ \left\langle x \right\rangle _p} \right\} _{BB'} \otimes _M \left\{ 2^{nw} K^2 \left\langle M^2 \right\rangle _p \right\} _{BB'} \end{aligned}$$
(13)

Then, we input this product to dQ-RNS and obtain the basic representation \(\left\{ 2^{w/2}K \overline{ \left\langle xM \right\rangle _p} \right\} _{BB'}\).

5.3 Finalize and ToBin

In the conventional RNS M-red, the Finalize of the Montgomery reduction is carried out by inputting the following value to RNS M-red.

$$\begin{aligned} \left\{ \overline{ \left\langle xM \right\rangle _p} \right\} _{BB'} = \left\{ \overline{ \left\langle xM \right\rangle _p} \right\} _{BB'} \otimes \left\{ 1 \right\} _{BB'} \end{aligned}$$

Similarly, the Finalize for sQ-RNS and dQ-RNS is carried out by inputting the following values to the respective M-red algorithms.

sQ-RNS algorithm:

$$\begin{aligned} \left\{ K \overline{ \left\langle xM \right\rangle _p} \right\} _{BB'} \otimes \left\{ K \right\} _{BB'}. \end{aligned}$$
(14)

dQ-RNS algorithm:

$$\begin{aligned} \left\{ 2^{w/2}K \overline{ \left\langle xM \right\rangle _p} \right\} _{BB'} \otimes _M \left\{ 2^{w/2}K \right\} _{BB'}. \end{aligned}$$
(15)

Regardless of sQ-RNS or dQ-RNS, the products above are both \(\left\{ K^2 \overline{ \left\langle xM \right\rangle _p} \right\} _{BB'}\). Similarly, the intermediate results at step 1 of the corresponding Q-RNS M-reds are the same, namely,

$$\begin{aligned} \left[ \begin{array}{c} \left\langle \xi \right\rangle _{m'_1} \\ \vdots \\ \left\langle \xi \right\rangle _{m'_n} \end{array} \right] = \left[ \begin{array}{c} \left\langle \overline{ \left\langle x \right\rangle _p}{M'_1}^{-1} \right\rangle _{m'_1} \\ \vdots \\ \left\langle \overline{ \left\langle x \right\rangle _p}{M'_n}^{-1} \right\rangle _{m'_n} \end{array} \right] . \end{aligned}$$

To compute binary representation, we need an additional subroutine, ToBin, shown in Fig. 11, the input of which is the intermediate result of step 1 above. From condition (v) in Sect. 3.1.2, it follows that

$$\begin{aligned} \overline{ \left\langle x \right\rangle _p} < 2p \le (1-\alpha )M'. \end{aligned}$$

This meets the input requirement of ToBin. If the return value Z of ToBin is not less than p, it is best to carry out the final subtraction in the binary representation.

ToBin in Fig. 11 is derived for dQ-RNS from the one proposed in [9]. In [9], one of the moduli in base B is chosen as \(m_1 = 2^w\), which is used as \(2^w\) in Fig. 11. On the other hand, due to quadratic residuosity, it is not possible to use \(m_1 = 2^w\) for Q-RNS. Therefore, ToBin needs \((n+1)\) more words in its lookup table in step 2 than were used in [9]. In addition, step 4 needs an n-word table, while step 3 needs no table.

In Fig. 11, unit multiplication is basically the Montgomery multiplication, although step 2 is an exception. Typical implementation of step 2 is to apply multiplication without reduction and take the lower w bits. Since steps 2 and 3 can be carried out at the same time, efficient implementation is possible.

To obtain ToBin for sQ-RNS, steps 3 and 4 in Fig. 11 should be respectively modified by the following parts of the equation. These need a 2n-word table for sQ-RNS in addition to the conventional RNS M-reds.

$$\begin{aligned}&\mathrm{diag}\left[ \left\langle 2^{-w} \right\rangle _{m'_1}, \dots , \left\langle 2^{-w} \right\rangle _{m'_n} \right] \otimes ,\\&\left[ \begin{array}{c} \left\langle -{M'_1}^{-1}2^{-w} \right\rangle _{m'_1} \\ \vdots \\ \left\langle -{M'_n}^{-1}2^{-w} \right\rangle _{m'_n} \end{array} \right] \otimes . \end{aligned}$$
Fig. 11
figure 11

ToBin transform for dQ-RNS

5.4 Degree of laziness

We will represent the upper bound for degree of laziness \(\nu \) described with Q-RNS parameters. As a typical lazy reduction, we consider the product sum.

Input:

$$\begin{aligned}&\sum _{i=1}^\nu \left\{ 2^{w/2} K \overline{ \left\langle x_iM \right\rangle _p} \right\} _{BB'} \otimes _M \left\{ 2^{w/2} K \overline{ \left\langle y_iM \right\rangle _p} \right\} _{BB'}\\&\quad = \left\{ K^2 \sum _{i=1}^\nu \overline{ \left\langle x_iM \right\rangle _p} \cdot \overline{ \left\langle y_iM \right\rangle _p} \right\} _{BB'}.\\&\hbox {Output:}\\&\left\{ 2^{w/2} K \overline{ \left\langle M \cdot \sum _{i=1}^\nu x_i y_i \right\rangle _p } \right\} _{BB'}. \end{aligned}$$

If \(4\nu \le \beta \) holds, then

$$\begin{aligned} \sum _{i=1}^\nu \overline{ \left\langle x_iM \right\rangle _p } \cdot \overline{ \left\langle y_iM \right\rangle _p } \le \nu {(2p)}^2 = 4\nu p^2 \le \beta p^2. \end{aligned}$$

This satisfies the upper bound on input for dQ-RNS. On the other hand, from condition (iv) in Sect. 3.1.2, it holds that

$$\begin{aligned} \beta p \le (1-\alpha )M. \end{aligned}$$

Combined with \(4\nu \le \beta \), this leads to

$$\begin{aligned} \nu \le \beta /4 \le (1-\alpha )M/4p. \end{aligned}$$

Thus, we can conclude that

$$\begin{aligned} \nu _{\mathrm{max}} = \left\lfloor \frac{(1-\alpha )M }{4p} \right\rfloor . \end{aligned}$$

Here, \(\alpha \) may be replaced by \(e_1\). The same formula can be applied to sQ-RNS.

5.5 Relaxation of reduction within channel

So far, we have assumed that the modular reduction in a unit multiplication is carried out strictly; that is, its result is always less than \(m_i\). It is, however, known for the Montgomery reduction in Fig. 2 that relaxation of reduction is effective toward avoiding a conditional branch due to the final subtraction, thus making the implementation simpler. It may be also possible to apply this idea to modular reduction at a unit operation.

Let \(\overline{\overline{ \left\langle \xi \right\rangle _m }}\) denote the \(\delta \)-relax of \(\left\langle \xi \right\rangle _m\) defined as \(\overline{\overline{ \left\langle \xi \right\rangle _m }} \equiv \left\langle \xi \right\rangle _m (\bmod m)\) and \(\overline{\overline{ \left\langle \xi \right\rangle _m }} \in [ 0, \delta m )\), where \(\delta \ge 1\). A special case \(\delta = 1\) means strict reduction. If we introduce \(\delta \)-relax to RNS M-red, it affects the representation of error bound \(e_1\), which is modified as

$$\begin{aligned} \widetilde{ e_1 }= & {} n(2^{-t} - 2^{-w} ) + \frac{1}{2^w} \sum _{i=1}^n \left( \delta - \frac{1}{m_i} \right) \mu _i. \\= & {} e_1 + \frac{\delta -1}{2^w} \sum _{i=1}^n \mu _i \end{aligned}$$

Similar modification is required for \(e_2\).

By replacing \(e_1\) and \(e_2\) with \(\widetilde{ e_1 }\) and \(\widetilde{ e_2 }\) in conditions (i)–(v) in Sect. 3.1.2, the basic algorithm in Fig. 3 and all its variants including Q-RNSs output a correct result for \(\delta \)-relax variables. Note that the ranges of \(\hat{L}\) and \(\hat{L}'\) change when \(\delta \)-relax is applied. In the strict reduction case, their ranges are

$$\begin{aligned} 0\le & {} \hat{L} \le n-1\\ 1\le & {} \hat{L'} \le n+1, \end{aligned}$$

whereas in the \(\delta \)-relax case,

$$\begin{aligned} 0\le & {} \widetilde{L} \le \delta n-1,\\ 1\le & {} \widetilde{L'} \le \delta n+1. \end{aligned}$$

Since the relaxation requires a wider bit length than w, there is a tradeoff between simple reduction and word size.

Table 3 Comparison of RNS M-red

6 Comparison

6.1 Number of unit multiplications

Table 3 summarizes comparison of four conventional RNS M-red algorithms and two Q-RNS M-red algorithms. Among these, the proposed ones achieved the least number of unit multiplications. It should be noted that unit multiplication for dQ-RNS is Montgomery’s, while other algorithms use standard modular multiplication. Note also that if n is small in C-RNS, there is a possibility that one can find base extension matrices with less computation. As for the requirements for base choice, the basic algorithms G-RNS and R-RNS pose the weakest requirements, while sQ-RNS poses the strongest. C-RNS and dQ-RNS fall somewhere in between. dQ-RNS has weaker requirements on the RNS base than does sQ-RNS, since it is possible to employ square numbers as elements of the bases.

As in conventional RNS M-reds, it is easy to implement sQ-RNS and dQ-RNS in parallel processing architecture due to RNS. Since sQ-RNS and dQ-RNS mostly consist of two matrix multiplications, these algorithms have more regularity and simplicity than do conventional ones. From past work, it is definite that Q-RNS can terminate in \((2n^2+n)/n = (2n +1)\) cycles if n processing units operate in parallel. Since multiplication previous to Q-RNS finishes in \(2n/n = 2\) cycles, the total cycles of Montgomery multiplication is \((2n+1+2\nu )\), where \(\nu \) is the degree of laziness. Another possibility, though less likely, is that with \((n^2 + n)\) unit multipliers, Q-RNS finishes in two cycles. Although this seems theoretically possible, in practice there are several issues for elaboration, such as feasibility of fan-out n of registers and design of an efficient circuit for summing up the results from unit multipliers.

Bigou et al. proposed a method that consists of fewer unit multiplications than other RNS M-red algorithms, including Q-RNS, under the hypothesis that the modulus p and the product of base moduli M should satisfy a certain equation [12, 14]. Although Q-RNS also poses quadratic residuosity conditions, their hypothesis is much stronger than that of Q-RNS. Actually, no base exists for NIST primes [14]. In their algorithm, it should be preferable to fix the base first and then determine p under the hypothesis. On the other hand, we can find bases with very high probability not only for NIST primes but also for other primes. Therefore, the discussion in this paper does not include their algorithm for comparison.

6.2 Size of lookup table

Table 4 Number of memory words

Table 4 shows comparison of the lookup table size necessary for the four algorithms, G-RNS, R-RNS, sQ-RNS, and dQ-RNS. Compared with G-RNS and R-RNS, sQ-RNS and dQ-RNS need only \((n+1)\) and \((2n+1)\) words of extra memory, respectively. With such little additional memory, Q-RNSs provide sufficient merit regarding reduction in the number of multiplications and simplicity of the algorithm. A toy example of parameters is shown in “Appendix C”.

6.3 FPGA implementation

We have implemented sQ-RNS on FPGA with parameters \(n = 4\), \(w=65\) and P-256 as a modulus. We have also implemented R-RNS for comparison.

Fig. 12
figure 12

A set of operation units

Fig. 13
figure 13

Operation diagrams for a sQ-RNS and b R-RNS

Figure 12 shows the main operation units, a multiply-and-add unit and a modular reduction unit, where the latter carries out the fast reduction algorithm presented in Sect. 2.6. Let \(c_m\) and \(c_r\) be the clock cycles required to carry out these operations, respectively. In our implementation, it follows that \(c_m=1\) and \(c_r=2\). n sets of these operation units are prepared. We use almost the same configuration for both sQ-RNS and R-RNS.

Figure 13 shows the operation diagram for both implementations. In each implementation, four sets of operation units run in parallel and the diagram show the operation of each unit. For sQ-RNS, first 7 cycles correspond to step 1 of the algorithm, followed by 3 cycles of step 3. Step 2 overwraps step 3 with 1 cycle of delay. sQ-RNS completes in 15 cycles. A similar diagram is shown for R-RNS with 18 cycles. A 3-cycle difference is caused by steps 1 and 2 of R-RNS, which are unnecessary in sQ-RNS. An approximately 17% reduction in clock cycles is achieved by sQ-RNS compared with R-RNS. If \(c_m \le c_r\) holds, we can derive the equations representing clock cycles from Fig. 13 as:

$$\begin{aligned} \begin{aligned} \text{ sQ-RNS: } (2n + 3) c_m + 2c_r\\ \text{ R-RNS: } (2n + 4) c_m + 3c_r. \end{aligned} \end{aligned}$$

Table 5 summarizes the results of FPGA implementations. Both (a) and (b) consume almost the same hardware resources specific to FPGA, such as look up table (LUT), flip-flop (FF), and digital signal processing (DSP). In the implementation, we did not apply hand tuning to the multiplier and adder. Namely, these components are synthesized automatically by the compiler. Further optimization may be possible.

Table 5 Synthesis results

7 Conclusion

This paper proposed new RNS Montgomery reduction algorithms, namely, sQ-RNS and dQ-RNS, which are derived by posing quadratic residuosity requirements on RNS bases. They achieve fewer number of unit multiplications than all previously proposed algorithms. The size of the lookup tables they use is comparable with conventional ones. Improvement over the R-RNS algorithm was confirmed with FPGA implementations. Since the proposed algorithms have more regularity and symmetry than do conventional ones, it may be worth studying software implementations for multi-core processors. Another topic for future study is improvement to the two base search algorithms proposed in this paper.