Streamlined NTRU Prime on FPGA

We present a novel full hardware implementation of Streamlined NTRU Prime, with two variants: a high-speed, high-area implementation and a slower, low-area implementation. We introduce several new techniques that improve performance, including a batch inversion for key generation, a high-speed schoolbook polynomial multiplier, an NTT polynomial multiplier combined with a CRT map, a new DSP-free modular reduction method, a high-speed radix sorting module, and new encoders and decoders. With the high-speed design, we achieve the to-date fastest speeds for Streamlined NTRU Prime, with speeds of 5007, 10,989, and 64,026 cycles for encapsulation, decapsulation, and key generation, respectively, while running at 285 MHz on a Xilinx Zynq Ultrascale+. The entire design uses 40,060 LUT, 26,384 flip-flops, 36.5 Bram, and 31 DSP.


Introduction
With the advent of quantum computers, many cryptosystems would become insecure. In particular, quantum computers would completely break many public-key cryptosystems, including RSA, DSA, and elliptic curve cryptosystems. Due to this concern, the National Institute of Standards and Technology (NIST) began soliciting proposals for post-quantum cryptosystems [17]. The algorithms solicited are divided into public-key encryption (key exchange) and digital signature. The NIST Post-Quantum Cryptography Standardization Process has entered the third phase, and NTRU Prime [7]  Hamburg University of Technology, Hamburg, Germany of the candidates of key encapsulation algorithms, as an alternate candidate. Since hardware implementations will be an important factor in the evaluation, it is important to research hardware implementations for various use cases.
NTRU Prime has two instantiations: Streamlined NTRU Prime and NTRU LPRime. In this paper, we implement the Streamlined NTRU Prime cryptosystem on Xilinx Artix-7 and Xilinx Zynq Ultrascale+ FPGA. We present two different versions: a high-performance, large-area implementation and a slower, compact implementation. Both implement the full cryptosystem, including all encoding and decoding, but without a TRNG. We also present several novel designs to implement the subroutines required by NTRU Prime, such as sorting, modular reduction, polynomial multiplication, and polynomial inversion.

Definitions
Streamlined NTRU Prime [7] defines the following polynomial rings: The parameters ( p, q, w) of Streamlined NTRU Prime satisfy the following: p, q ∈ P (4) w > 0, w ∈ Z, 2 p ≥ 3w (5) q ≥ 16w + 1 ( 6 ) The recommended parameter set for Streamlined NTRU Prime is sntrup761: For this reason, we will focus on this parameter set during this paper. NTRU Prime also uses the following notations:

Streamlined NTRU Prime
The key generation of Streamlined NTRU Prime is described in Algorithm 1, and the encapsulation and decapsulation in Algorithms 2 and 3, respectively.

Design consideration with FPGAs
Field-programmable gate arrays (FPGAs) are popular hardware implementation platforms as one can easily construct and prototype customized digital logic circuits, without the large cost of manufacturing ASICs. Most FPGAs provide several different general purposed resources which are either common in general logic circuits or are able to simulate or execute Boolean functions, which are constructed from basic logic gates. On hardware implementation with FPGAs, the utilization of these resources is one of the important standards of comparison among similar implementations. To "make an apples-to-apples comparison," a specified FPGA platform is often assigned in a call-for-proposal project. NIST recommends that "(PQC submission) teams generally focus their hardware implementation efforts on Artix-7" as an FPGA platform [2]. Artix-7 ™ is a FPGA platform manufactured by Xilinx ® . We will focus on Xilinx FPGAs in this paper, in particular Xilinx Zynq ® Ultrascale+ ™ and Artix-7 FPGAs, but note that the philosophy of the design consideration remains the same if the resources are of similar types and structures, even when the FPGA manufacturer differs. Here, we introduce the main resources, of whose properties our design takes advantage, provided in FPGAs.

Look-up tables (LUTs)
Look-up tables are very basic units in popular FPGAs. A LUT is a combinational logic unit with usually 4-6 input bits and 1-2 output bits. Here, we denote an LUT with m input bits and n output bits as LUT m,n . An LUT can be considered as a block of read-only memory. For example, an LUT 5,2 can be considered as a block with 32 cells, each of which contains 2 bits. Xilinx Zynq Ultrascale+ and Artix-7 provide LUT units which support both the functions of LUT 5,2 and LUT 6,1 [21,22].
Usually, LUTs are used to implement combinational digital circuit, but they are also useful to implement read-only memories (ROMs) and random-access memory, call distributed RAM. For example, to construct a 12-bit, 32-cell read-only memory unit, we will need 6 LUT 5,2 units. A 13bit, 64-cell block costs 13 LUT 6,1 units.

Digital Signal Processing (DSP) slices
A DSP slice is an arithmetic unit which consists of one multiplier and some accumulators. The multiplier supports signed integer multiplication with many bits, and it costs a lot of LUTs to construct the same multiplier if the DSP slice is not applied. Xilinx Zynq Ultrascale+ provides DSP slices with 27 × 18-bit signed integer multipliers [26], and Xilinx Artix-7 provides DSP slices with 25 × 18-bit signed integer multipliers [23].
If we multiply two integers whose bit lengths are more than the limit one DSP slice can offer, we can either apply a pipeline approach, or connect two or more DSP slices in parallel. For example, to multiply a 24-bit signed integer with a 32-bit signed integer, we can connect 2 slices in parallel, or to multiply the multiplicand with the least significant 16 bits of the other integer and then with the most significant bits. If we can control the bit lengths of the integers we want to multiply, however, we are able to limit the bit lengths so that one DSP slice can handle the multiplication.

Block memories (BRAMs)
A block memory unit stores a certain large number of bits. Every BRAM provides several channels with partially customizable data widths during the hardware synthesis stage. We can read and/or write the data stored in one BRAM only via the channels. This fact means that we can access as many words simultaneously as the number of channels in one BRAM, and if we want to access more words at the same clock cycle, we need either to duplicate the data from the BRAM to another in advance, or to partition the data we want to store in two or more BRAMs.
Both Xilinx Zynq Ultrascale+ and Artix-7 provides BRAM units [24,25], each of which contains 36 kbits and two channels. Every BRAM unit can be divided into two blocks with 18 kbits, each of which in turn provides two channels, and the synthesis report records 0.5 BRAMs of utilization as long as a 18 kbits block is utilized. In both FPGAs, the data width of each 18 kbits block can be customized as 1, 2, 4, 9, or 18 bits.

Multiplication using Good's trick with NTT
Polynomial multiplication is one of the most important operations which needs to be carefully designed in NTRU Prime (the other is the polynomial inversion).
Polynomials in R/q can be written as where we denote r = n mod ± q (signed modulo) for any integer n and r if − q−1 2 ≤ r ≤ q−1 2 and there exist an integer m such that n = mq +r . To reduce modulo x p − x −1 is easy since we only need to substitute x j with x j− p+1 + x j− p for every j ≥ p and reduce the eventual polynomial into the form p−1 i=0 f i x i . So the key is to evaluate f (x)g(x) mod ± q. For multiplying two polynomials of degree p − 1, a fast-Fouriertransform-like approach can effectively reduce the number of integer multiplications we need, from O( p 2 ) to O( p log p). Such an approach operating in a prime field Z/q but not complex numbers is a number theoretic transform (NTT).
NTT is usually a 2 k -point transformation method with a pre-determined positive integer k (written as NTT 2 k (·), and the inverse operation iNTT 1 k (·)). For polynomials f (x) and g(x) of degree at most 2 k −1 and with at least 2 k−1 zero coefficients, the polynomial multiplication can be implemented as where is the point-wise multiplication. For NTRU Prime with p = 761, we need to pad 263 monomials with zero coefficients to the polynomials, making NTT 2 11 (·) work. Good [11] provides another approach, applying NTT 2 9 (·) instead and then doing 9 degree-512 polynomial multiplications where the polynomials are with at least 256 zero coefficients. In this approach, we need only to pad 7 zeros to the polynomials. This idea was introduced in NTRU Prime originally in [1,8].
In the case p = 761, we regard the polynomial as of degree 767 instead, with the coefficients of the high-degree terms set to 0. Now since f (x)g(x) is of degree of at most 1534, we have f (x)g(x) = f (x)g(x) mod x 3·512 − 1. We set x = yz, and it can be shown that In detail, we see that for the set of integer i in [0, 1535], the mapping i ≡ 513 + 512 j (mod 1536) to the set of the integer pair ( j, ) where 0 ≤ j ≤ 3 and 0 ≤ ≤ 511 is oneto-one and onto. Then, f (x) (and g(x), same as follows) can be expressed as: Here, we define f y j (z) 511 =0 f (( − j) mod 3)2 9 + · z for convenience, and then f (x) ≡ f y 0 (z) + f y 1 (z)y + f y 2 (z)y 2 (mod (y 3 − 1)(z 512 − 1)). We can assert that f · (z) and g · (z) are all of z-degree 511 and with at least half of the coefficients being 0, so that f · (z)g · (z) can be evaluated as f · (z)g · (z) ≡ iNTT 2 9 (NTT 2 9 ( f · (z)) NTT 2 9 (g · (z))) Then, f (x)g(x) is given by ≡ ( f y 0 (z) + f y 1 (z)y + f y 2 (z)y 2 )(g y 0 (z) We can regard the polynomial multiplication of h(x) = f (x)g(x) as a school-book multiplication with respect to y, where the coefficients of the powers of y's are the sum of products of the polynomials in z, which can be computed by NTT. Notice that for every h j the index j directs to the coefficient polynomial of y j , and the index directs to the coefficient of z in each polynomial. To map back the coefficients of h(z, y) to those of h(x), we can see h(x) is given by

Chinese remainder theorem and NTT
To compute NTT 2 k (·), we need to find a 2 k -th root of unity in the field Z/q. Specifically, to apply Good's trick for p = 761 and q = 4591, we need to find a 512-th root of unity in Z/4591. This is impossible since 4591 − 1 = 2 · 3 3 · 5 · 17 without the factor 512.
In [8], it is suggested to apply the Chinese remainder theorem (CRT) to resolve this issue. To make it clear how the CRT can be applied, the following two cases are considered: Case 1 Polynomial multiplications used in the standard of NTRU Prime are multiplications with one small polynomial (coefficients are all −1, 0, or 1) and one R/q polynomial (coefficients are in the range [− q−1 2 , q−1 2 ], or [−2295, 2295]). If we use the school-book scheme, we can see that all of the coefficients in the polynomial multiplication without modulo q are ranged in which is a 22-bit signed integer. If, instead, we want to apply Good's trick, we can choose two good primes (in whose finite fields we can find a 512-th root of unity) q 1 and q 2 such that q 1 q 2 > p(q −1)+1. Then, we apply Good's trick separately.
For all coefficients of x i 's computed with an NTT in Z/q 1 and Z/q 2 , respectively, say h i,1 and h i,2 , we can get the eventual h i by where q 1 ≡ q −1 1 (mod q 2 ) and q 2 ≡ q −1 2 (mod q 1 ). We can see that h (0) i is in the range [−q 1 q 2 + q 1 +q 2 2 , q 1 q 2 − q 1 +q 2 2 ], and we need only to check if it is in [− q 1 q 2 2 , q 1 q 2 2 ] and tune up or down by q 1 q 2 . That is, Notice that we will control the logic such that we always multiply a 25-bit signed integer with a 18-bit signed integer, as the built-in multipliers in Xilinx FPGAs we focused are at least signed 25 × 18-bit multipliers. Controlling the size of the multiplication in this manner provides the portability between high-end and low-end FPGAs and utilizes the builtin multiplier with a better effectiveness. This fact is important in the next case. Case 2 In our implementation, batch inversion is applied (see Sect. 3.4). This makes multiplication with two R/q polynomials necessary. In this case, all coefficients in the polynomial multiplication without modulo q are ranged in ], which is [−4008206025, 4008206025], and then the coefficients are 33-bit signed integers. In this case, three good primes are picked. We have that and tune up or down by q 1 q 2 q 3 . That is, We choose q 1 = 7681, q 2 = 12,289 and q 3 = 15,361 here. In this case, q 23 = 2562 = (A02) 16 , q 31 = 8182 = 2 13 − (A) 16 and q 12 = 10 = (A) 16 , making all three of h i,a = (h i,a q bc ) mod ± q a can be done with simple addition or subtraction only followed by a modulo operation. This makes all h i,a represented as 14-bit signed integers. Multiplying the remaining q b q c can be done also by one 25×18-bit multiplier since in this configuration

Montgomery's trick
Montgomery's trick is a method to accelerate inversion by doing batch inversion [16]. This allows us to replace n inversions in a ring with a single inversion, together with 3n − 3 multiplications. Montgomery's trick is described in Algorithm 4. The trick can lead to a significant speedup as long as multiplication is at least 3 times as fast as a single inversion, and one has enough storage space to store the intermediate products. Batch inversion with Montgomery's trick for NTRU Prime was already proposed in the original paper [4]. It was recently implemented for fast key generation in an integration of NTRU Prime into OpenSSL [3]. There, for the parameter set sntrup761 and a batch size of 32, it led to a key generation speed of 156,317 cycles per key, compared to the non-batch 819,332 cycles.

Algorithm 4: Description of Montgomery's trick for batch inversion
Input : n: the batch size, f x : an array of n numbers to be inverted Output: The array of n inverted f −1

Hardware implementation
In this section, we describe the basic functionality and architecture of all core functions and modules of our Streamlined NTRU Prime implementation.

Parallel schoolbook multiplier
This multiplier use a massively parallel version of the schoolbook multiplication algorithm. It consist of an LFSR, an accumulator register, and a large number of multiply accumulate units.
The use of schoolbook multiplication both for NTRU Prime [15] and for other lattice KEM [10,18] is not new. Two different implementations, based on the same overall design architecture, are presented in this paper: the first is a high-speed, high-area implementation and the second is a much smaller, but also slower implementation. Both are similar with regard to the speed-area product. They also have very simple memory access patterns. The differences between the two is that the faster implementation stores all values in flipflops, whereas the compact implementation uses distributed RAM. The architecture is shown in Fig. 1.
The high performance and efficiency of this design is based on the fact that in Streamlined NTRU Prime, all multiplications are always with one polynomial in R/3, and the second either also in R/3 or in R/q. Multiplication with both polynomials in R/q do not normally occur. (The only exception here is during the batch inversion using Montgomery's trick, see Sect. 3.4.) This idea was previously presented in [10,18], and allows a number of optimizations. The fact that one polynomial is always in R/3 allows the individual multiply accumulate (MAC) units to be very simple, as only a very small number of bit operations are needed. This in turn leads to a very small footprint in the FPGA. In addition, we do not perform any modular reduction at this step. Its algorithmic description can be found in Algorithm 5.
Before the multiplication starts, the small R/3 polynomial is loaded into an LFSR of length p, with the tap points set to correspond to the polynomial of the NTRU Prime ring, . For this reason, 3 bits are needed per coefficient, as the tap points can lead to coefficients in the range from −2 to 2. Once the R/3 polynomial is fully shifted into the LFSR, the multiplication can begin. During multiplication, one coefficient from the R/q polynomial is retrieved from BRAM at a time. This coefficient is then multiplied with every single coefficient in the LFSR, and added coefficient-wise to an accumulator register. The LFSR is then shifted once, and the next coefficient from the R/q polynomial is retrieved. This repeats for every coefficient from the R/q polynomial. After this, the accumulator register contains the completed polynomial multiplication. The register contents are then sent to the multiplier output, where they are taken modulo q. Because of the LFSR, no additional polynomial modulo reduction is required.
For the high-speed schoolbook multiplier, p MAC units are instantiated, and as a result, one coefficient from the R/q polynomial can be processed per clock cycle. For the compact implementation using distributed RAM, 24 MAC units are instantiated. This number comes from the value of p, and the size of the smallest distributed RAM blocks. In Xilinx FPGA's, the LUT can be configured as 32-bit dualport RAM, with one read/write port, and one read-only port. With p = 761, and 761/24 = 32, it means that 24 MAC units pack the RAM as densely as possible. This means that every 32 Algorithm 5: Single coefficient multiply accumulate (MAC) algorithm. Note that no modulo calculation is performed here. The 23 bits are large enough so that no overflow can occur.
Input : a: a 23-bit signed number, b: a 13-bit signed number, c: 1 ; 6 return a + r c Fig. 1 Architecture of the parallel schoolbook polynomial multiplier for the parameter set sntrup761. The accumulator array has a size of p·23 bits. The blocks with the label MAC are described in Algorithm 5. The difference between the high-speed and the low-area multiplier are in the number of MAC units, and whether the accumulator array and small polynomial LFSR are implemented in flip-flops or in distributed RAM clock cycles, a new coefficient from the R/q polynomial is processed, and the multiplier thus also takes 32 times as many cycles.
It takes p clock cycles to shift the R/3 polynomial into the LFSR. It also takes p clock cycles to shift the result out of the accumulator array, during which the accumulator array is also set to 0. Both of these operation can be interleaved to save time, i.e., a new R/3 polynomial can be shifted in, while the accumulator array is shifted out. As a result, for p = 761, the high-speed multiplication takes 1522 cycles, otherwise 2283 cycles.

Architecture of R/q · R/q NTT multiplier
The architecture NTT multiplication employing Good's trick and a CRT map is shown in Fig. 2, which is modified from the NTT/INTT architecture from [27].
Coefficients in polynomials f (x) and g(x) are partitioned into those of f y 0 (z), f y 1 (z), f y 2 (z), g y 0 (z), g y 1 (z), and g y 2 (z), as are mentioned in Sect. 2.4. Each z-polynomial is put into bank 0 and 1, and with proper design of the four address generators to the reading and writing channels of bank 0 and 1, the z-polynomials are passed through the 3 Butterfly units (for Z/7681, Z/12,289, and Z/15,361, respectively), and the corresponding NTT vectors are calculated. Bank 2 is then used to store the result of the summation of the point multiplications. The content in bank 2 contains the NTT vectors of h y 0 (z), h y 1 (z), and h y 2 (z). The NTT vectors are then calculated with three INTT operations, making h(x) = f (x)g(x) ready, where each coefficient is of 3-tuple with each entry representing the coefficient modulo 7681, 12,289, 15,361, respectively. The CRT operation is then done to find each coefficient modulo 4591, and then the reduction of This multiplier is used for the R/q · R/q multiplication during batch inversion (see Sect. 3.4), and takes 35,463 clock cycles. The control unit which controls the unit consists of the following stages: load, N T T , point_mul, reload, I N T T , cr t, and reduce stages. When the product of the polynomials is ready, the control unit falls into f inish stage, and the result can be fetched out of the multiplier. -In the I N T T stage, the process is similar as in the N T T stage. The differences are that the butterfly units and the ω address generator are now operated in the inverse mode. Now, there are three INTT operations to be done, and 6945 cycles are consumed in this stage. -In the cr t stage, the DSP slices in the three butterfly units are applied to calculate the partial result h i,a q b q c using the CRT. All of the partial results are then added as one integer, which is the input of the modulo q unit. After this, f (x)g(x) is ready but without modulo x p − x − 1 applied. It takes 3080 cycles to complete this stage. We inspect in detail how the coefficients in the polynomial f y i (z) and g y i (z) are stored in the memory banks. One zpolynomial requires 512 cells as the storage of coefficients, and we save half of the coefficients in 256 cells of bank 0 and the other half in 256 cells of bank 1. This design is to feed the inputs simultaneously into the butterfly units, and an efficient in-place memory addressing is introduced in [14], which provides the formula of bank index B(·) and the lower bits of the address A l (·). The higher bits of the address A h (·) just indicate which polynomial it is. The bank index and address are given by It should be noted that in the reload stage and at the end of multiplication, as NTT itself re-arranges the order of the coefficients such that the address in one polynomial is bit Fig. 2 Architecture of Good's trick NTT multiplication reversed, the lower 9 bits of the address need to be reversed. The higher 3 bits do not join the bit reversal.

Generation of short polynomials
During the encapsulation and key generation in NTRU Prime, a so-called short polynomial has to be created. For this, the original NTRU Prime paper suggests using a sorting network [4], and using a sorting algorithm is a well-established method to randomly shuffle a list in constant-time [10,20]. In our case, a list of p 32-bit random numbers is created. Of the first w, the least significant bit is set to 0 so that the number is always even. For the others, the lowest two bits are set to (0, 1). This list of numbers is then sorted, after which the upper 30 bits are discarded. The remaining two-bit numbers are then subtracted by one. As a result, exactly w elements are either 1 or −1, and the rest are all zero. An alternative method for generating short polynomials would be using a shuffling algorithm such as Fisher-Yates, as used by a Dilithium hardware implementation [13]. However, in Dilithium, a public polynomial is sorted, wherase in NTRU Prime, a secret polynomial is sorted, and thus requires a constant-time algorithm. As Fisher-Yates shuffle is difficult to implement in constanttime [10,20], we do not consider it an option. The reference C implementation of NTRU Prime [7], as well as the hardware implementation in [15], use a constanttime sorting network. However, on an FPGA, we can use a faster method in the form of the radix sorting algorithm [12]. Radix sort is an extremely fast sorting algorithm, offering O(n) speed compared to the O(n log n) of the sorting network used in [7,15]. But radix sort has the drawback of having input-dependent addressing, which would disqualify it for memory architectures that have a cache due to sidechannel leakages. As the BRAMs on an FPGA do not have any sort of cache, we can safely implement the algorithm. Our implementation is based on the radix sorting algorithm found in the SUPERCOP benchmark suite [5]. As a result, we can generate a new short polynomial in 4837 cycles. A comparison of different sorting algorithms is in Table 3.
A further optimization we have implemented is the pregeneration of short polynomials. As short polynomials can be generated independently of the operation (encapsulation or key generation) or any other input (e.g. the public key), we can pregenerate a short polynomial, instead of generating it on-demand. This pregenerated short polynomial is then cached, and is immediately output when the encapsulation or key generation starts. Once it has been output, we can use the rest of the time spent on encapsulation or key generation to pregenerate a new short polynomial for the next operation. This in particular speeds up encapsulation, as the rest of the modules do not have to wait until the sorting has completed. Note that this pregeneration is also possible for NTRU, but not Saber or Kyber, and would allow for a similar speed up.
The one case where this pregeneration would not be possible is if an encapsulation starts immediately after power on. In that case, the encapsulation would have to wait until a short polynomial is generated. Further encapsulations however would be able to use a cached pregenerated short polynomial, so only the very first encapsulation would be delayed. However, the describes scenario is unlikely to occur in the real word, as it disregards aspects such as the loading of the public key from a flash storage, which will likely take longer than the 4837 cycles that the sorting takes.

Batch inversion using Montgomery's trick
To accelerate the inversion during key generation, we employ batch inversion using Montgomery's trick. For the polynomial inversion itself, we use the constant-time extended GCD algorithm from [6]. This algorithm uses a constant number of "division steps" (or divsteps) to calculate the inverse of the input polynomial. This algorithm is used by the reference implementation of NTRU Prime, and was also used in a previous hardware implementation [15]. We extend it by allowing a configurable number of divsteps per clock cycle. Increasing the number of divsteps per clock cycle proportionally decreases the number of cycles. The architecture of the R/q inversion is shown in Fig. 3. We do not consider alternative inversion methods, such as Fermat's method or Fig. 3 Architecture of the R/q inversion module using the extended GCD algorithm. The to-be-inverted polynomial is stored in RAM g. At the start of the algorithm, RAM v stores an all-zero polynomial, RAM r the polynomial (3 −1 mod q, 0, . . . , 0), and RAM f the polynomial (1, 0, . . . , 0, −1, −1). The final result is stored in RAM v. The section marked "Divstep" is the part that is replicated when multiple divsteps are performed per clock cycle. This also requires wider read/write ports to the RAM. The architecture of the R/3 inversion is identical, other that all arithmetic operations are performed in Hensel lifting, as they are either slower, not constant-time or are not applicable to the rings used in NTRU Prime [3,6].
In our implementation, we only implement batch inversion for the inversion in R/q. For inversion in R/3, it is more efficient to simply increase the number of parallel divsteps, as the divstep operation in R/3 is trivial (see Table 6). With, e.g., 32 parallel divsteps, an inversion in R/3 takes 47,166 cycles. The inversion in R/3 also has the potential of having non-invertable polynomials. We skip the invertability check, and simple redo the inversion with a new polynomial in case of a non-invertable polynomial. However, for batch inversion, we would have to check every polynomial for invertability, as a single non-invertable polynomial would force us to redo the entire batch.
Doing batch inversion has an additional caveat: it requires n multiplications where both polynomials are in R/q (line 7 in Algorithm 4). This is an issue, as the polynomial multiplier for NTRU Prime normally always has one operand in R/3. This means we cannot use the our schoolbook multiplier, as the multiplier has optimizations that rely on one operand being in R/3. As a result, we add a second multiplier to our design, namely the NTT multiplier with a CRT map for the R/q · R/q multiplication.
Due to the additional R/q·R/q multiplier, batch inversion is not automatically the optimal way of inverting polynomials in R/q. This is because the additional multiplier consumes hardware resources that could otherwise be used to implement more parallel divsteps for the R/q inversion.
In addition, larger batch sizes require more BRAM to store intermediate results. Depending on the speed and hardware consumption of non-batch inversion, batch inversion and multiplication, respectively, together with the available hardware resources and batch size, the optimal solution varies. A contour plot that shows the minimum batch size needed for Montgomery's trick to be worthwhile for different inversion and multiplication speeds is shown in Fig. 4. In practice, we recommend to use batch sizes of 5, 21, and 42. These sizes are found via experimentation, and pack the 36kbit BRAM available in Xilinx FPGAs as densely as possible. Table 10 lists the additional BRAM cost for the different batch sizes, as well as the associated cycles.

Reduction without DSPs
In this section, we extend the technique of fast modulo reduction in [27] (called Shifting Reduction in this paper) without using additional DSP slices which are often necessary in a Barrett reduction unit or a Montgomery reduction unit. We apply this technique in the cases q ∈ {7681, 12,289, 15,361}. Moreover, in the case q = 4591, another reduction technique (called Linear Reduction in this paper) will be introduced. All four modular reductions are fully pipelined, and can process one new operand per clock cycle. Fig. 4 Minimum batch size when comparing the cycle count for the three multiplications incurred per polynomial inversion when using Montgomery's trick, to simply accelerating the inversion itself. This assumes a base R/q inversion speed of 1,200,000 cycles, which is the rough number of cycles an R/q inversion takes with a single divstep per clock cycle. An example: Assume the three multiplications take 40,000 cycles in total, which is roughly how long two R/q · R/3 and one R/q · R/q multiplication take in our design. At the same, assume that with the extra hardware resources, we could alternatively accelerate the inversion by a factor of 2, so that it takes only 600,000 cycles. According to the plot, a batch size of 4 would be sufficient for Montgomery's trick to be worthwhile

Fast signed modular multiplication on q = 12,289
We start with the modification of the unsigned reduction with q = 12,289 as introduced in [27]. In the signed case, the reduction is slightly different.
The equivalent logic circuit is given in Fig. 5. The thicker blocks and dataflows differ from that in [27] for signed reduction.

Fast Signed Modular Multiplication on q = 7681
Shifting Reduction can be easily applied in the case q = 7681 since q is of the form q = 2 h − 2 l + 1.

Fast signed modular multiplication on q = 15,361
We still use Shifting Reduction in the case q = 15,361. Suppose −7680 ≤ a ≤ 7680 and −7680 ≤ b ≤ 7680. Now, z = ab is a 27-bit signed number and (47C0000) 16 = −58,982,400 ≤ z = ab ≤ 58,982,400 ≤ (3840000) 16 We have q = 2 14 − 2 10 + 1, and the sign bit z [26]  Note that the definition of z p 2 u is slightly different from the other cases. We can see that z pu is a 6-bit unsigned integer. and is bounded by 16383. z n is bounded by 751 + 4095 + 255 + 15 = 5116. Therefore, z 0 = z * p − z n ≡ z p − z n = z (mod q) and is an integer in [−5116, 16,383]. We need only to check if the value of z 0 is greater than 7680, and perform a subtraction of q if this is the case.
The circuit for signed reduction modulo 15,361 is omitted as it is similar to those for modulo 7681 and 12,289. The main difference is still the dataflow for the sign bit.

Fast signed modular reduction on q = 4591
The reduction in integers modulo q = 4591 (or other q's in the parameter set of NTRU Prime) using shifting reduction is not easily obtained since all of these primes are not of the form q = 2 h −2 l +1. Specifically, q = 4591 = 2 12 +2 9 −2 4 −1 is of effective Hamming weight 4. Shifting reduction will make the bits spread into the lower bits, making the positive and the negative parts of the partial results (as z p and z n defined in the case q ∈ {7681, 12,289, 15,361} ) hard to be analyzed.
In the signed version modification doing modulo 12,289, we separate the sign bit from other bits and consider it independently. Actually, every bit can be considered independently, especially in the case q = 4591. We may transform the reduction problem into several signed additions. Here, we will call this technique linear reduction.
In the implementation we are considering, the integer z that will be reduced is a 33-bit signed integer and bounded by (11117A137) 16 ≤ z ≤ (0EEE85EC9) 16     We know that when the encode starts, M = q, . . . , q and len(M) is odd. This implies that we need only to track m 0 , m 1 and the output bytes for each regular pair of r 's and for the last r . Table 1 show the values of m 0 , m 1 , and the output bytes. We can see the total encoded bytes are of length 1158.
With q = 1531 = q/3, which is applied in Roundencode, a similar tracking info can also easily be predetermined, shown in Table 2. The total encoded bytes are of length 1007.
All of the tracked info are provided outside the encoder and the decoder, making the circuit able to do the encode/decode for any case of Q. Both of the encoder and the decoder needs an internal memory buffer to save the intermediate R.
The block diagrams of the encoder and decoder are shown in Figs. 7 and 8 , where the dashed blocks are outside of the module. The parameter module is a look-up table of either Table 1 or Table 2, making the encoder/decoder flexible to do/recover either R/q-encode or Round-encode. The encoder needs a DSP slice to evaluate r 0 = r 0 + m 0 r 1 . And the decoder needs 4 DSP slices to apply Barrett's reduction to evaluate r 0 = r 0 mod m 0 .
The encoding of a public key and cipherext takes 2297 and 2296 cycles, respectively. The decoding of a public key and ciphertext takes 1550 and 1541 cycles, respectively.

SHA-512 hash function
Streamlined NTRU Prime uses SHA-512 internally as a hash function. It is used on the one hand to generate the shared secret after encapsulation and decapsulation, but also to create the ciphertext confirmation hash. The ciphertext confirmation hash is a hash of the public key and the short polynomial r and is appended to the ciphertext. Our SHA-512 implementation is based on the implementation used in [15,19], but has been optimized to increase performance. The hashing of a 1024 bit block takes 117 cycles.

Evaluation and comparison with other implementations
In this section, we compare and evaluate individual submodules, as well as the full design, and provide area utilization and performance results. Table 3 shows a comparison of different sorting algorithm for arrays of size 761. The radix sort from this work is significantly faster than the sorting network from [15]. While not quite as fast as the FIFO merge sort from [10], our radix sort does use less LUT, FF and BRAM, and runs at a higher frequency. Due to the pregeneration of short polynomials, the cycle count of the sorting does not factor into the cycle count of the encapsulation, as long as the encapsulation operation takes longer than the sorting, which is the case. As such, any additional speed-up in sorting would not lead to a speed-up in encapsulation. This is not the case for batch key generation, as multiple short polynomials are needed. There, the key generation must wait until the sorting algorithm is executed a number of times equal to the batch size. Table 4 shows a comparison of different multiplication algorithms for NTRU Prime. This includes the Karatsuba multiplier from [15], the high-speed and low-area schoolbook multiplier from this work, as well as our new NTT and CRT multiplier. Our new high-speed schoolbook multiplier is by far the fastest, by over an order of magnitude. At the same, it is also by far the most resource intensive. The Karatsubabased multiplier from [15] is the most compact with regard to LUT, but it also is the slowest, and has a comparatively high BRAM usage. Our new low-area schoolbook multiplier uses no BRAM, and only slightly more LUT, but is more than three time faster with regard to cycle count than the Karatsuba-based multiplier. The NTT and CRT multiplier has the benefit of being extendable to perform R/q · R/q multiplication, with no increase in cycle count and only a moderate increase in resource consumption. Otherwise, the low-area schoolbook multiplier is better both in terms of resource consumed and cycle count, and requires no BRAM or DSPs. As a result, we only use the NTT multiplier for the R/q · R/q multiplication during batch inversion. Table 5 show a comparison of our new encoder and decoder compared with the encoder and decoder from [15]. Our new encoder and decoder have either the same or lower resource consumption, while at the same time significantly reducing the cycle counts, as well as increasing the max clock frequency. Table 6 shows a comparison of different inversion modules for NTRU Prime and NTRU-HPS. All implement the extended GCD algorithm. Our R/q inversion improves on the R/q inversion from [15], as due to the two divsteps per clock cycles, we gain a nearly two-times speed up. At the same, the amount of LUT and FF is reduced, and the DSP count remains the same. This is due to the improved modular reduction algorithm of our work. Increasing the number of divsteps to four gives another speedup of almost two, but also increases DSP, FF and LUT consumption. In addition, distributed RAM is used instead BRAM. For the inversion in R/3, it is clearly visible that increasing the number of divsteps only leads to a comparative small increase in resource consumption, in exchange for a significant increase in performance. While the inversion from [10] is over an order of magnitude faster, it also uses significantly more LUT and FF than our design with 32 divsteps per cycle. However, since the R/3 is not the bottleneck during key generation, further increasing the speed of R/3 is unnecessary. The target FPGA is a Xilinx Zynq Ultrascale+ The target FPGA is a Xilinx Zynq Ultrascale+. All multipliers except the NTT with the label R/q · R/q are for multiplying one polynomial in R/3 with a second polynomial in R/q. Note that for the high-speed schoolbook multiplier, we assume that the loading of the small polynomial is interleaved with the output of the result. Otherwise a multiplication takes 2283 cycles

Comparison of the full design
In this section, we will compare our implementation with existing Streamlined NTRU Prime implementations [9,15], as well as with NTRU-HPS821 [10]. NTRU-HPS821 is a round 3 finalist key-encapsulation algorithm. All Streamlined NTRU Prime implementations employ the parameter set sntrup761, and NTRU-HPS821 has comparable security strength. All benchmark numbers of individual operations of our implementation for the Zynq Ultrascale+ and the Artix-7 are listed in Tables 7 and 8, respectively. A comparison with existing Streamlined NTRU Prime implementations, as well as with NTRU-HPS821 is shown in Table 9. Benchmark numbers of our full implementation are listed in Table 11. Note that the cycles counts are not the simple addition of the cycles counts of sub-modules, as there is a certain level of overlapping of operations. Also note that, like the design in [15], both of our implementations do not contain a random number generator. This mirrors the reference design of NTRU Prime [7], and allows us to directly use the inputs of the known-answer-test to verify the correctness of our design.
However, particularly for the high-speed encapsulation, this does somewhat skew the comparison with other KEMs. This does not apply to the decapsulation, as it does not require any randomness. In addition, for the low-area design, the SHA-512 hash function could be used to generate the randomness from a seed, as the hash function is both fast enough and has enough idle time between its normal usage, though we did not implement it for this work. Our high-speed implementation has the fastest cycle count and execution times of all Streamlined NTRU Prime implementations for all 3 operations. At the same time, while our low-area implementation does require slightly more LUTs (at most 31% more) then the lightweight implementation from [15], our implementation is significantly faster, with 2.05, 4.08 and 3.04 speedup respectively for key generation, encapsulation and decapsulation.
When comparing our high-speed implementation with that of NTRU-HPS821 [10], one can see that our encapsulation uses fewer LUT, flip-flops, BRAMs, but more DSP. While our cycle count is slightly higher, this is compensated by the higher frequency, leading to a slightly faster execu- tion time. For decapsulation, our design uses less of every resource except BRAM. In particular, our design uses 31% fewer flip-flops and 78% less DSP. Although our cycle count is higher and our frequency is lower, the total execution time is only 11% slower. For key generation, our design uses fewer LUT, flip-flops, and DSP, while also having a lower cycle count and faster clock speeds. However, our design does use significantly more BRAM due to the batch inversion. Batch inversion also has the downside of an initial large latency as the whole batch is calculated. Table 10 compares the cycle counts for different batch sizes. Larger batches increase the total number of cycles to complete the batch, but dramatically decrease the amortized cycles per key. However, the speedup from increasing the batch size from 21 to 42 is relatively low.

High-speed vs. low-area design
The difference between our high-speed and low-area implementation lie in a number of different sub modules. For one, the low-area version does not use batch inversion for key generation, and uses only 2 divsteps per clock cycles instead of 4 during R/q inversion, and 2 divsteps instead of 32 for the R/3 inversion. The low-area implementation also uses the compact version of the parallel schoolbook multiplier. Finally, the high-speed implementation uses two separate decoders, one for public keys, and one for ciphertexts. This allows the secret key (which also contains the public key) and the ciphertext to be decoded in parallel during decapsulation.
In the low-area implementation, only one decoder is present, and the decoding occurs sequentially.  The clock frequency and other FPGA resources are only minimally affected by increasing the batch size

Timing side channels
Both the high-speed and the low-area implementation are fully constant-time with regard to secret input. The radix sorting used in the generation of short polynomials does include secret-dependant memory indexing. However, as the BRAMs on modern Xilinx FPGA have no cache, this does not expose a side channel. At the same time, we did not implement any advanced protections against more advanced attacks such as DPA.

Applicability to NTRU LPRime
As mentioned earlier, the NIST submission of NTRU Prime describes the KEM Streamlined NTRU Prime and NTRU LPrime [7]. Both share many components, and many parts of our design can be reused to implement NTRU LPrime, namely the multipliers, the sorting module, the encoders and decoders, the hash module and the modular reduction modules. In addition, one would require the AES-based XOF used in NTRU LPrime, as well new state machines for the control flow and operation scheduling.

Conclusion
We present a novel and complete constant-time hardware implementation of Streamlined NTRU Prime, with two variants: A high-speed implementation and a low-area one. Both compare favorably to existing Streamlined NTRU Prime implementations, as well as to the round 3 finalist NTRU-HPS821. The full source code of our implementation in mixed Verilog and VHDL can be found on Github at https:// github.com/AdrianMarotzke/SNTRUP_on_FPGA.