Keywords

1 Introduction

Fully Homomorphic Encryption (FHE) allows a computation to be done on encrypted data (ciphertext) and no decryption is needed prior to any computation, offering thus better privacy [1]. FHE has emerged as a powerful cryptographic tool in recent years as it has been shown to possess both additive and multiplicative homomorphic properties. However, it is still far from practical deployment due to their complexity, mainly due to the huge key size involved. Three variants of FHE: Lattice-Based, Ring Learning with Error (RLWE) and Integer-Based have been an area of active research in recent years to investigate the potentials and limitations of FHE by investigating software [2,3,4,5] and hardware [6,7,8,9] implementations.

Implementing Lattice-Based FHE in software was initially proposed in [2]; it requires huge key sizes between 17 Megabytes (MB) to 2.3 Gigabytes (GB) with key generation taking from 2.5 s to 2.2 h. Van Dijk et al. revised the original FHE scheme and proposed Integer Based FHE [10] where both homomorphic properties are computed over the integers with the objective of promoting simplicity in its scheme. Later, Coron et al. improved this scheme with smaller key sizes of 0.95 Mb to 802 Mb and key generation time between 4.38 s to 43 min [4].

A modulus switching technique was introduced in [5] which allows leveled multiplication on smaller moduli, hence results in smaller public key sizes. In [5], the authors worked on RLWE based FHE, managed to reduce noise growth from quadratic to linear complexity even without modulus switching. Cousins et al. introduced the Chinese Remainder Transform (CRT) on Lattice-Based FHE which splits a larger modulus into multiple moduli so that parallelization can be employed on Field-Programmable Gate Array (FPGA) Virtex 6, however extra time is needed for re-conversion from the Montgomery domain to regular integers [11, 12]. Later, Gentry in [13] presented an encryption of 150-bit Advanced Encryption Standard (AES) homomorphically which takes 73.03 s for key generation and 3 Gb memory usage without bootstrapping.

Apart from FHE, recent research also focused on Somewhat Homomorphic Encryption (SHE) [14, 15]. Smart et al. in [3] suggested multiple stages of encryption (known as re-crypt) on larger message sizes rather than single bit proposed originally in [1]; however, key generation still requires more than an hour even for small key size. An improvised version of [3] is done by introducing a Single Instruction Multiple Data (SIMD) implementation in [16], which performs 4.13 times faster re-cryption and 12 times smaller ciphertext than one without SIMD. Also working on SHE, Poppelmann et al. [17] showed that Lattice-Based SHE is possible to be deployed on FPGA Spartan-6 with 9063 Number Theoretic Transform (NTT) coefficients multiplication per second, provided NTT parameters are selected appropriately. The recent SHE work is based on Ring-LWE variants and aimed at accelerating the encryption for cloud computing at the FPGA level and also enlarge the NTT coefficients by introducing a 1228-bit modulus [18]. However, the resulting multiplication process was relatively slower than the software implementation with the same NTT size; 26.67 s and 2.98 s respectively. The bottleneck being the memory access.

To accelerate the FHE performance, the authors in [6] exploited the speed of Graphical Processing Units (GPUs) and encrypted 7.68 times faster than standard Central Processing Unit (CPUs). Then, the authors in [19] introduced Integer-Based FHE by batch to reduce the bottleneck on AES encryption. Later, Doroz et al. proposed pre-computation of Schönhage Strassen multiplier parameters which allowed FHE encryption to perform better with only 18.1 ms (ms) [20]. Recently, the concept of a re-cryption box was proposed by Roy et al. [21] at the hardware level to reduce the effects of growing noise on the ciphertexts. The re-cryption box is also exploited to accelerate the search operation on the encrypted data.

The first hardware implementation for Integer Based FHE was proposed by Cao et al. in [8] with two building blocks of a large NTT multiplier and Barrett reduction to speed up FHE on high-end FPGA technology Virtex 7. Their encryption time is 44.72 times faster than software implementation for ‘Large’ key size. Comba scheduling is proposed in [22], by utilizing Digital Signal Processing (DSP) slices for uneven operands to shorten the delays during multiplications while reducing ‘Write to Memory’ operation. Meanwhile, recent research by Cao et al. [9] proposed Low Hamming Weight (LHW) design on Virtex 7 to allow simpler multiplications while reducing hardware usage at the same time. The encryption time of this work outperforming benchmark software implementations by 131 times for ‘Large’ key size, while the encryption time showed by this scheme is between 0.0006 s to 3.317 s, resulting in the best FHE achievement by far with a reasonable speed and small footprint.

Inspired by the significant performance reported with strong potential for improvements, we focused our work on the Integer-Based FHE scheme by Van Dijk et al. [10]. The central theme of this scheme is about simplicity. It is easier in terms of parameter selection compared to Lattice-Based while its hardness is based on Greatest Common Divisor (GCD) approximate problem. Furthermore, the sizes of the parameters in Integer-Based FHE are defined clearly in [9], unlike the other variants where only the matrix size is defined rather than bit size.

We propose to accelerate FHE over the integers by adopting frequency domain multiplication using the NTT specifically targeted for FPGAs. The FPGA platform is chosen over custom hardware Application-Specific Integrated Circuits (ASICs) due to the high availability of resources such as DSPs which have dedicated mathematical functions on modern FPGAs.

We followed the seminal work in [2, 5], pronouncing the operands size in four different groups: Toy, Small, Medium and Large as shown in Table 1. At least 150 k to 19 m bits operands are required for the encryption steps which is a large number, hence normal Schoolbook multiplication is no longer efficient. In recent years, there have been many reported ideas by researchers to optimize large number multiplications especially in cryptography; such as Comba [22, 23], Karatsuba [24, 25] and frequency domain conversion methods [14, 26]. The idea of adopting a frequency domain approach on hardware such as in [8, 14, 15, 27, 28] has increasingly gained acceptance as an efficient method to accelerate the multiplication process given its computational complexity being in the order \( {\text{ O}}\left( {n { \log }\left( n \right)} \right) \) for n-bits operand. Researchers in [9, 29, 30] have also shown that NTT hardware implementations outperformed software implementations at certain magnitudes.

Table 1. Test instances for encryption process

In this paper, we further advance research in this area by relaxing the strict relationship between the NTT parameters to allow for more optimized hardware implementations on FPGA. To speed up the large integer multiplications required in FHE schemes such as the one proposed in [5], previous research has sought to optimize the multiplication steps within the NTT transform computations by fixing the kernel \( \alpha \) to be simply a two or a power of two value [9]. However, such approach tends to impose restrictions on the possible transform lengths to be deployed, thereby affecting potential optimizations in the overall multiplication process.

In this paper, we propose a different methodology whereby we relax the requirement for a simple kernel in favour of optimal transform lengths and moduli in terms of the number of overall iterations, suitable data path, and FPGA architecture. The kernel multiplications by \( \alpha \)’s required for the optimal word lengths and moduli can be easily implemented in the form of Look-Up Tables (LUTs) integral to any FPGA fabrics.

The specific contributions of this paper are summarised as follows:

  • A set of NTT parameters that supports large operands for NTT multiplication is proposed.

  • Analysis of important hardware design trade-offs; such as the butterfly costs of the NTT building blocks against multiplication iteration for each key group in FHE (Toy, Small, Medium and Large).

  • An iterative multiplication method is incorporated to support a small footprint design on hardware while at the same time maximizing the multiplier size to speed-up the overall multiplication process.

  • Hardware implementation is validated with results showing improved performance.

The rest of the paper is organised as follow. Section 2 recaps the introduction and mathematical background of FHE over the integers. Our proposed methodology is illustrated in Sect. 3. Section 4 covers the implementation aspects with results given in Sect. 5. The paper concludes with a Conclusion section.

2 Integer Based Fully Homomorphic Encryption

Integer Based FHE needs to perform key generation, encryption and decryption with the additional step of evaluation. Our work in this paper, in line with previously reported implementations [9], is focused solely on the encryption step defined in (1). The work in [9] is workable for binary messages only with message space \( Q = 2\, \left\{ {0,1} \right\} \); [31] proposed a larger space \( Q > 2 \), which means the message can be non-binary with an extended circuit. Their key size is also reduced, although no specific size is reported.

$$ c \leftarrow m + 2r + 2\sum\nolimits_{i = 1}^{\tau } {X_{i} \cdot B_{i} \,mod\,X_{0} } $$
(1)

Noted, \( c \) is ciphertext; \( m \) is a single bit of plaintext binary message with only bit 0 or 1; \( r \) is a random signed integer; \( X_{0} \) is a part of the public key; \( B_{i} \) is a random integer sequence, and \( X_{i} \) is a \( \tau \)-bits public key sequence with \( 1 \le i \le \theta \). We direct the interested reader to refer to the original work in [5, 10] for details on the parameter selection in (1) and Table 1.

As seen from (1), the FHE encryption step needs two core operational building blocks: (1) Multiplication; and (2) Reduction. These can be designed as individual building block and combined later as a complete process of FHE encryption. Meanwhile, as can be seen from Table 1, both multiplicands Xi and Bi are not symmetrical in size. Multiplicands are also known as operands after this point. Thus, we exploit this unsymmetrical property to propose a hybrid multiplication approach of Schoolbook and NTT based multiplication. Schoolbook multiplier is employed for the outer iterations whereas the NTT multipliers will be used for the inner multiplications. In fact, employing symmetric multiplication methods for non-symmetric operands leads to significant waste of computational time as well as hardware resources.

2.1 Number Theoretic Transform (NTT) Multiplication

The NTT has been used widely in signal processing for implementing convolution and correlation operations because of its error-free advantages (no rounding or truncation errors) and efficient implementation. Recently there has been a revival of interest in NTTs to be deployed in frequency domain approaches to implementing large operand multiplications required in new offerings in Cryptography. Dai et al. in [32] proposed large NTTs of 215 coefficient integrated with CRT in order to accelerate NTRU-based FHE. Meanwhile, diminished-1 NTTs is used for performing SWIFFT hash function in [33] to simplify modular NTTs but is limited to certain modular form such as Fermat primes only. Promising more parallelization, NTT is also widely used in hardware implementation with good performances [8, 34]. The Mathematical representation of an NTT is given in (2). Where \( k = 0, . . ., N - 1 \) and \( \alpha \) is twiddle-factor with the condition of \( \alpha^{N} \equiv 1\; mod\;m \).

$$ X\left( t \right) = \sum\nolimits_{n = 0}^{N - 1} {x\left( n \right)\alpha^{nk} \,mod\,m } $$
(2)

From (2), parameters \( \alpha , m \) and \( N \) are interdependent. The desirable choice of NTT parameters traditionally involved [35]:

  • \( \alpha \) to be selected as two or a power of two so that the exponentiation operations required can be implemented as shift operation;

  • N to be highly composite, a power of two if possible so that efficient NTT type algorithms can be employed

  • \( m \) has a special form so that reduction can be a simple operation.

In this paper, we use Classical Modular NTT, with each operation is bounded by ring Zm where m is moduli. Algorithm 1 describes NTT multiplication steps with 4 underlying steps; Forward Transform, Pointwise Multiplication, Inverse Transform and Carry Accumulation.

figure a

3 Proposed Methodology

The efficiency of NTT designs as explained before is related closely to the trade-off of its three key inter-related parameters, namely the kernel α, the transform length N and the modulus m. In this paper we stipulate that in the context of FHE where very large multiplications of asymmetrical operands are required, a methodology that allows more flexibility in terms of transform length, offers better scope for improving overall FHE performance on modern FPGA platforms. The proposed methodology is more efficient than traditional methodologies driven by overcoming the complexity of the multiplications by the kernel of the transform at the detriment of the transform length. In this case, the impact of the transform length on overall performance is far more significant than that of the kernel multiplication within the NTT. This is because, the long multiplier unit will be able to cater for larger operand size, thus minimize the number of partial product iterations. As a result, multiplication complexity can be reduced specifically for asymmetric operands. A study of NTT parameters and its optimization is discussed in the next section.

3.1 NTT Parameters Optimization

The central parameter to be optimized is the NTT length as large NTT length can facilitate larger operands, by relaxing the kernel \( \alpha \) restriction. The choice of modulus needs a specific consideration, as explained later so that every operation during the NTT over the defined ring is optimized for the targeted hardware. Importantly, the NTT coefficient must be within the dynamic range b, as expressed in (3) to ensure no overflow error. More details of dynamic range is in [36].

$$ \frac{N}{2}\left( {b - 1} \right)^{2} < m $$
(3)

To illustrate the improvements in operand sizes achieved by the proposed approach we report in Table 2 the comparison between two types of moduli, Solinas and Fermat (F6); they are 64 bits and 65 bits moduli respectively. Solinas 1 and F6 1 show the NTT parameter set without optimization, whereas Solinas 2 and F6 2 show these parameters with our proposed optimization. The optimization is done by enlarging the NTT length as well as relaxing the kernel restriction. As a result, both Solinas 2 and F6 2 result in much larger multiplier sizes of 1792 bits and 3072 bits which correspond to almost double the length.

Table 2. Comparison between Solinas and Fermat moduli NTT parameters

Let y mod p where \( y = 2^{96} a + 2^{64} b + 2^{32} c + d \), a 128 bits integer. The Solinas reduction can be simplified as (4).

$$ 2^{32} \left( {b + c} \right) - a - b + d $$
(4)

Algorithm 2 is used for Special form modulus, of \( 2^{n - 1} \pm 1 \) as proposed in [39]. We used this Algorithm for Fermat F6 1 and Fermat F6 2 reduction.

figure b

In terms of reduction’s complexity cost, Solinas just needs shift, addition and subtraction. Also, the Solinas form lends itself to efficient FPGA implementation. As the goal of this work is to design a large multiplier on a targeted FPGA, then, Solinas 2 was chosen as the optimal modulus as it covers an acceptable number of operands; 1792 bits and the 64 bits modulus is an optimal fit in terms of a single word. Although F6 2 can cover larger operands of 3072 bits, its 65 bits modulus needs more than a single word operand, which is not optimal for hardware implementation. Even if the diminished-1 number system can be adopted to handle 65 bits modular operation as suggested in [40], the conversion to and from this number system is costly and can become a performance bottleneck in particular for the special case of the zero detection.

Cost Analysis

We first analyzed the operational cost of the NTT block for Solinas 1 and Solinas 2 individually and later we analyzed the cost for overall multiplication during the encryption. For a fair comparison, we presume \( \alpha \) for Solinas 1 and Solinas 2 are pre-computed over the Solinas modulus beforehand and stored in LUTs as 64-bits Read-only Memories (ROMs). This was also done before in [27, 41] with the same purpose of speeding-up the kernel multiplication process.

In our work, \( 64 \times \frac{N}{2} \), pre-computed operands are needed to be stored in the LUTs which is relatively small compared to the available LUTs of the targeted hardware, Kintex 7. Exponentiation by \( \alpha \) during the Butterfly operation in (5) can be replaced with a ‘Read’ operation which is obviously faster than computing exponential \( \alpha \) by using an algorithm.

Meanwhile, as NTTs over the ring has a symmetrical root of unity, then it benefits the NTT implementation because the same table also can be used for retrieving \( \alpha^{ - 1} \) for Inverse NTT (INTT) [36]. This way, only N multiplications are required for each transform. As the overall NTT multiplication building block has 2 forward and 1 inverse transforms, then 3N multiplications are required. The same goes for the ‘Read’ operations during the NTT multiplication which is \( 3\left[ {\frac{N}{2} log_{2} N} \right] \).

$$ X_{i} = A_{i} + \alpha^{i} B_{i} \,, \quad X_{{i + \frac{n}{2}}} = A_{i} - \alpha^{i} B_{i} $$
(5)

The NTT multiplier size \( n_{c} \) can be determined from (6). Division by two is because we use Zero-padded convolution, means only \( \frac{N}{2} \) coefficients are employable, and the other \( \frac{N}{2} \) appended as zeros.

$$ n_{c} = \frac{N \times b}{2} $$
(6)

Table 3 shows the comparison between Solinas 1 and Solinas 2, specifically in terms of operations during the NTT and the space required to store the precomputed operands. As illustrated in Table 3, the Butterfly, ‘Read’ operation and Addition/Subtraction dominate the cost in Solinas 2, which as expected are higher than Solinas 1, as Solinas 2 caters for larger NTT points. Solinas 2 also requires more LUTs space to store pre-computed \( \alpha \). Crucially though Solinas 2 has the largest NTT points among the similar work done previously in [28, 42].

Table 3. Solinas 1 vs Solinas 2

Next, we analyze the entire multiplication, but first we explain how the multiplication building block works during the FHE encryption. As discussed earlier, the NTT multiplier blocks are used for computing the partial products whereas accumulation is completed using a Schoolbook method. In symmetric operands (\( n \)-bits) of the Schoolbook method, \( n^{2} \) multiplication and \( 2n - 1 \) accumulations are needed. However, as in our case asymmetric multiplication is required and the partial products are completed by the NTT multiplier block; then assumption is made that a partial product iteration \( P_{i} \) represents the number of multiplications as determined in (7). Meanwhile, accumulation \( A_{i} \) in (8) represents the number of additions required for accumulating the partial products. Given two operands of asymmetrical size a (na bits) and b (nb bits) with the multiplier size of nc-bit.

$$ P_{i} = \left\lceil {\frac{{n_{a} }}{{n_{c} }}} \right\rceil \times \left\lceil {\frac{{n_{b} }}{{n_{c} }}} \right\rceil $$
(7)
$$ A_{i} = \left\lceil {\frac{{n_{a} }}{{n_{c} }}} \right\rceil + \left\lceil {\frac{{n_{b} }}{{n_{c} }}} \right\rceil $$
(8)

Figure 1 explains graphically the impact of multiplier size towards partial product iteration and accumulation. Let a and b, the asymmetric operands of 32-bits and 16-bits respectively. Two different multipliers 8-bit and 16-bit are used to show the relationship between the multiplier size and the complexity of multiplication. 32 bits operand is chunked into the multiple blocks depending on multiplier size. The accumulation chain relies on the partial product iteration. For example, an 8-bit multiplier requires 8 partial product iterations and 5 accumulation chains whereas a 16-bit multiplier only consumes 2 partial product iterations and 2 accumulation chains. Essentially, fewer iterations are needed for larger multipliers while long carry accumulation chains also can be minimized.

Fig. 1.
figure 1

8-bit multiplier vs 16-bit multiplier

We analyze the complexity of the multiplication building block, during the FHE encryption with different key sizes Toy, Small, Medium and Large as illustrated in Table 4. \( P_{i} \) and \( A_{i} \) are obtained from Eqs. 7, and 8 respectively. We also include the Butterfly cost \( B_{i} \) in Table 4 which corresponds to the number of butterflies involved during the NTT multiplication to perform FHE encryption as shown in (9). The values of \( P_{i} \), \( A_{i} \) and \( B_{i} \) in Table 4 represent the overall costs and complexity of the multiplication during the FHE encryption.

Table 4. Complexity costs of Solinas 1 and Solinas 2
$$ B_{i} = B_{u} \times P_{i} $$
(9)

As can be seen from Table 4, if the multiplier is large enough to cover the operands \( b_{i} \) in a minimum NTT block, then the partial product iterations and accumulation are reduced significantly. For example, Toy operand \( b_{i} \) can fit in a single NTT block of Solinas 2. However, for Solinas 1, operand \( b_{i} \) does not fit a single NTT block, instead 2 NTT block iterations are needed, thus, complicates the multiplication process quadratically.

Overall, the number of partial product iterations (Pi) and accumulations (Ai) in Solinas 2 is reduced drastically compared with Solinas 1. In fact, the butterfly cost in Solinas 2 is also much lower than Solinas 1 despite Solinas 2 incurring a larger butterfly cost than Solinas 1 in a single multiplier block.

Based on this analysis, we confirm that choosing appropriate multiplier size can significantly reduce the multiplication building block complexity and therefore by relaxing the kernel restriction to enable longer length NTT, the overall complexity cost of the multiplication building blocks is reduced significantly.

Also, from the complexity analysis in Table 4, our parameter optimization using Solinas 2 shows a significant improvement compared to Solinas 1. For that reason, we conclude that Solinas 2 is more efficient for large asymmetric operands. This is due to the large size of the multiplier which leads to small partial product iterations and short carry chain. In fact, Solinas 2 also costs fewer butterflies, hence reduce entire multiplication complexity.

4 The Architecture of NTT Multiplier

Labview FPGA 15 is being used for this hardware implementation, targeted to Xilinx Kintex-7 XC7K160T FPGA device and Xilinx Vivado 2014.4 compiler. Given the size of the operands needed, it is assumed that Block Random Access Memory (BRAMs) is used and sufficient to store \( X_{i} \) and \( B_{i} \) as multiple data chunks where each chunk is \( b \) bits size.

The architecture of the NTT Multiplier is depicted in Fig. 2. Initially, both NTT1 and NTT2 are used to transform the \( B_{i} \) operands. After the \( B_{i} \) operands are completely transformed into frequency domain, they are stored in a BRAM \( B_{i} . \) Next, \( X_{i} \) are transformed into frequency domain using NTT1 and NTT2. This also means for each iteration; the NTT block can cover \( 2n_{c} \) bits. Then, pointwise (PW) multiplication takes place in parallel by 2 PW units; PW1 and PW2 have 128 points each. During pointwise multiplication, \( X_{i} \) is fed on the fly from both NTT1 and NTT2 outputs, whereas \( B_{i} \) is read from BRAM \( Y \). The output of PW1 and PW2 then are loaded into INTT1 and INTT2 respectively. The proposed design is pipelined, so after the INTT takes place, then the following output of INTT is generated at the following clock cycle. The product is then loaded into the accumulation unit for addition and carry management. This unit merely involves shifting and addition.

Fig. 2.
figure 2

The proposed NTT multiplier architecture

In the case where \( B_{i} \) does not fit into a single NTT unit, then pointwise multiplication should be done iteratively. For example, operands Bi for Medium and Large exceed the multiplier size as they need two NTT blocks; so pointwise multiplication must undergo 2 iterations to complete the multiplication for both blocks, hence more clock cycles required for this case.

5 Results and Discussion

The synthesis result for our proposed NTT Multiplier is within the available resources of the targeted hardware Kintex-7 as seen in Table 5. As can be seen, registers and LUTs are same for all key sizes Toy, Small, Medium and Large. This has happened because the same NTTs unit is being used for each group. The latency is different due to the number of iteration for each group is different. Meanwhile, BRAMs represent an amount to store the operands \( X_{i} \) and \( B_{i} \) as well as the final results after the reduction.

Table 5. Synthesis results for proposed NTT multiplier

The latency in Table 6 is calculated using the clock cycles count and the synthesis design frequencies which is generated by the tools. As the timing for both the multiplication and reduction building block are obtained, then the encryption time \( Enc_{t} \) can be computed as (10).

Table 6. Latency and timing for proposed NTT multiplier of each group
$$ Enc_{t} = \left( {Group\,1 \,timing \times \tau } \right) + \left( {2 \times Group\,2 \,timing} \right) $$
(10)

From (10), the first bracket refers to multiplication timing whereas the second bracket refers to reduction timing. Note that we used Barrett reduction which also utilized the same NTT building blocks with different operands [9]. Multiplication by two for the reduction building block is because the Barrett reduction needs two large multiplications [43]. The Encryption time of each group is presented in Table 6.

We also compared our result with previous research [9] in Table 7. As can be seen, our design outperforms [9] for the Medium and Large groups. This proves that our design manages to reduce multiplication complexity specifically for large operands such as Medium and Large. Although [9] performs better in Toy and Small, but the encryption time of our design shows that it does not increase gradually from Toy to Large. We notice that our design is not efficient for Toy and Small because the operand \( B_{i} \) just utilized 20% and 41% out of full NTT blocks respectively. This can be improved in the future by designing a scalable design which can be flexible depending on size of operands.

Table 7. Encryption Time of our proposed design and previous research [9]

6 Conclusion

In this paper, we proposed a new methodology to speed up the large modular multiplications required in FHE schemes in frequency domain using NTTs. The methodology is based on relaxing the strict relationship between the NTT parameters imposed by having a simple transform kernel. In our approach, more emphasis is put on the transform length as it was shown that this parameter has more effect on overall hardware performance. Both Analytical and implementation presented in this paper show that the proposed methodology leads to improved large NTT multiplication. In fact, our Optimized NTT Multiplier is 55% and 76% faster than [9] for Medium and Large group respectively. The results attained illustrate that FHE encryption time is improved. Further enhancements can be carried out by deploying several NTT blocks in parallel.