Abstract
With this work we provide further evidence that latticebased cryptography is a promising and efficient alternative to secure embedded applications. So far it is known for solid security reductions but implementations of specific instances have often been reported to be too complex beyond any practicability. In this work, we present an efficient and scalable microcode engine for RingLWE encryption that combines polynomial multiplication based on the Number Theoretic Transform (NTT), polynomial addition, subtraction, and Gaussian sampling in a single unit. This unit can encrypt and decrypt a block in 26.19 µs and 16.80 µs on a Virtex6 LX75T FPGA, respectively – at moderate resource requirements of about 1506 slices and a few block RAMs. Additionally, we provide solutions for several practical issues with RingLWE encryption, including the reduction of ciphertext expansion, error rate and constanttime operation. We hope that this contribution helps to pave the way for the deployment of ideal latticebased encryption in future realworld systems.
Keywords
 Ideal lattices
 RingLWE
 FPGA implementation
Download conference paper PDF
1 Introduction and Motivation
Resistance against quantum computers and long term security has been an issue that cryptographers are trying so solve for some time [12]. However, while quite a few alternative schemes and problem classes are available, not many of them received the attention both from cryptanalysts and implementers that would be needed to establish the confidence and efficiency for their deployment in realworld systems. In the field of patentfree latticebased publickey encryption there are a few promising proposals such as a provably secure NTRU variant [49] or the cryptosystem based on the (Ring) LWE problem [32, 36]. For the latter scheme Göttert et al. presented a proofofconcept implementation in [22] demonstrating that LWE encryption is feasible in software. However, their corresponding hardware implementation is quite large and can only be placed fully on a Virtex7 2000T and does not even fit onto the largest Xilinx Virtex6 FPGA for secure parameters.^{Footnote 1} Several other important aspects for RingLWE encryption have also not been regarded yet, such as the reduction of the extensive ciphertext expansion and constanttime operation to withstand timing attacks.
Contribution. In this work we aim to resolve the aforementioned deficiencies and present an efficient hardware implementation of RingLWE encryption that can be placed even on a lowcost Xilinx Spartan6 FPGA. Our implementation of RingLWE encryption achieves significant performance, namely 42.88 µs to encrypt and 27.51 µs to decrypt a block, even with very moderate resource requirements on the lowcost Spartan6 family. Providing the evidence that RingLWE encryption can be both fast and cheap in hardware, we hope to complement the work by Göttert et al. [22] and demonstrate that latticebased cryptography is indeed a promising and practical alternative for asymmetric encryption in future realworld systems. In summary, the contributions of this work are as follows:

1.
Efficient hardware implementation of RingLWE encryption. We present a microcode processor implementing RingLWE encryption as proposed by [32, 36] in hardware, capable to perform the Number Theoretic Transform (NTT), polynomial additions and subtractions as well as Gaussian sampling. For a fair comparison of our implementation with previous work, we use the same parameters as in [22] and improve their results by at least an order of magnitude considering throughput/area on a similar reconfigurable platform. Moreover, our processor is designed as a versatile building block for the implementation of future ideal latticebased schemes and is not solely limited to RingLWE encryption. All parts of our implementation have constant runtime and inherently provide resistance against timing attacks.

2.
Efficient Gaussian sampling. We present a constanttime Gaussian sampler implementing the inverse transform method. The sampler is optimized for sampling from narrow Gaussian distributions and is the first hardware implementation of this method in the context of latticebased cryptography.

3.
Reducing ciphertext expansion and decryption failure rates. A major drawback of RingLWE encryption is the large expansion of the ciphertext^{Footnote 2} and the occurrence of (rare) decryption errors. We analyze different approaches to reduce the impact of both problems and harden RingLWE encryption for deployment in realworld systems.
In order to allow thirdparty evaluation of our results we will make source code files, testbenches and documentation available on our website.^{Footnote 3}
Outline. In Sect. 2 we introduce the implemented ringbased encryption scheme. The implementation of our processor, the Gaussian sampler and the cryptosystem are discussed in Sect. 3. In Sect. 4 we give detailed results including a comparison with previous and related works and conclude with Sect. 5.
2 The RingLWEEncryptCryptosystem
In this section we briefly introduce the original definition of the implemented RingLWE public key encryption system (RingLWEEncrypt) and propose modifications in order to decrease ciphertext expansion and error rate without affecting the security properties of the scheme.
2.1 Background on LWE
Since the seminal result by Ajtai [2] who proved a worstcase to averagecase reduction between several lattice problems, the whole field of latticebased cryptography has received significant attention. The reasons for this seems to be that the underlying lattice problems are very versatile and allow the construction of hierarchical identity based encryption (HIBE) [1] or homomorphic encryption [19, 42] but have also led to the introduction of reasonably efficient publickey encryption systems [22, 32, 36], signature schemes [14, 23, 34], and even hash functions [35]. A significant longterm advantage of such schemes is that quantum algorithms do not seem to yield significant improvements over classical ones and that some schemes exhibit a security reduction that relates the hardness of breaking the scheme to the presumably intractable problem of solving a worstcase (ideal) lattice problem. This is a huge advantage to heuristic and patentprotected schemes like NTRU [29], which are just related to lattice problems but might suffer from yet not known weaknesses and had to repeatedly raise their parameters as immediate reaction to attacks [28]. A particular example is the NTRU signature scheme NTRUSign which has been completely broken [17, 43]. As a consequence, while NTRU with larger parameters can be considered secure, it seems to be worthwhile to investigate possible alternatives.
However, the biggest practical problem of latticebased cryptography are huge key sizes and also quite inefficient matrixvector and matrixmatrix arithmetic. This led to the definition of cyclic [40] and more generalized ideal lattices [33] which correspond to ideals in the ring \(\mathbb {Z}[x]/\langle f \rangle \) for some irreducible polynomial \(f\) of degree \(n\). While certain properties can be established for various rings, in most cases the ring \(R=\mathbb {Z}_q[\mathbf{x}]/\langle {x}^n+1\rangle \) is used. Some papers proposing parameters then also follow the methodology to choose \(n\) as a power of two and \(q\) a prime such that \(q\equiv 1 \mathrm{~mod~ }2n\) and thus support asymptotic quasilinear runtime by direct usage of FFT techniques. Recent work also suggests that \(q\) does not have to be prime in order to allow security reductions [11].
Nowadays, the most popular averagecase problem to base latticebased cryptography on is presumably the learning with errors (LWE) problem [48]. In order to solve the decisional RingLWE problem in the ring \(R=\mathbb {Z}_q[\mathbf{x}]/\langle {x}^n+1\rangle \), an attacker has to decide whether the samples \((a_{1},t_{1}),\dots ,(a_{m},t_{m}) \in R \times R\) are chosen uniformly random or whether each \(t_{i}=a_{i}s+e_{i}\) with \(s,e_{1},\dots ,e_{m}\) has small coefficients from a Gaussian distribution \(D_{\sigma }\) [36].^{Footnote 4} This distribution \(D_{\sigma }\) is defined as the (onedimensional) discrete Gaussian distribution on \(\mathbb {Z}\) with standard deviation \(\sigma \) and mean 0. The probability of sampling \(x \in \mathbb {Z}\) is \(\rho _{\sigma }(x)/\rho _{\sigma }(\mathbb {Z})\) where \( \rho _{\sigma }(x)=\exp {(\frac{x^2}{2\sigma ^2})}\mathrm{~and~ }\rho _{\sigma } (\mathbb {Z})=\sum _{k=\infty }^{\infty }\rho _{\sigma }(k). \) In this simple case the standard deviation \(\sigma \) completely describes the Gaussian distribution. Note that some works, e.g., [22, 32] use the parameter \(s=\sqrt{2\pi }\sigma \) to describe the Gaussian.
2.2 RingLWEEncrypt
The properties of the RingLWE problem can be used to realize a semantically secure public key encryption scheme with a reduction to decisional RingLWE. The scheme has been introduced in the full version [37] of Lyubashevsky et al. [36] and parameters have been proposed by Lindner and Peikert [32] as well as Göttert et al. [22]. The scheme (Gen, Enc, Dec) is defined as follows and will from now on be referred to as RingLWEEncrypt:

Gen(\(a\)): Choose \(r_1, r_2 \leftarrow D_{\sigma }\) and let \(p=r_1a \cdot r_2 \in R\). The public key is \(p\) and the secret key is \(r_2\) while \(r_1\) is just noise and not needed anymore after key generation. The value \(a \in R\) can be defined as global constant or chosen uniformly random during key generation.

Enc(\(a,p,m\in {\{0,1\}}^n\)): Choose the noise terms \(e_1, e_2, e_3 \leftarrow D_{\sigma }\). Let \(\bar{m} = \mathtt{encode }(m) \in R\), and compute the ciphertext \([c_1 = a\cdot e_1+e_2, c_2 = p \cdot e_1 + e_3 + \bar{m}] \in R^2\)

Dec(\(c=[c_1,c_2], r_2\)): Output decode \((c_1\cdot r_2 +c_2) \in {\{0,1\}}^n\).
During encryption the encoded message \(\bar{m}\) is added to \(p e_1+e_3\) which is uniformly random and thus hides the message. Decryption is only possible with knowledge of \(r_2\) since otherwise the large term \(ae_1r_2\) cannot be eliminated when computing \(c_1r_2+c_2\). According to [32] the polynomial \(a\) can be chosen during key generation (as part of each public key) or regarded as a global constant and should then be generated from a public verifiable random generator (e.g., using a binary interpretation of \(\pi \)). The encoding of the message of length \(n\) is necessary as the noise given by \(e_1 r_1 + e_2 r_2 + e_3\) is still present after calculating \(c_1 r_2 +c_2\) and would prohibit the retrieval of the binary message after decryption. Note that the noise is relatively small as all noise terms are sampled from a narrow Gaussian distribution. With the simple threshold encoding \(\mathtt{encode }(m)=\frac{q1}{2}m\) the value \(\frac{q1}{2}\) is assigned only to each binary one of the string \(m\). The corresponding decoding function needs to test whether a received coefficient \(z\in [0..q1]\) is in the interval \(\frac{q1}{4} \le z <3\frac{q1}{4}\) which is interpreted as one and zero otherwise. As a consequence, the maximum error added to each coefficient must not be larger that \(\frac{q}{4}\) in order to decrypt correctly. The probability of an decryption error is mainly dominated by the tailcut and the standard deviation of the Gaussian \(\sigma = \frac{s}{\sqrt{2\pi }}\). Decreasing \(s\) decreases the error probability but also negatively affects the security of the scheme.
Parameter Selection. For details regarding parameter selection we refer to the work by Lindner and Peikert [32] who propose the parameter sets \((n,q,s)\) with (192, 4093, 8.87), (256, 4093, 8.35), and (320, 4093, 8.00) for low, medium, and high security levels, respectively. In this context, Lindner and Peikert [32] state that medium security should be roughly considered equivalent to the security of the symmetric AES128 block cipher as the decoding attack requires an estimated runtime of approximately \(2^{120}\) s for the best runtime/advantage ratio. However, they did not provide bitsecurity results due to the new nature of the problem and several tradeoffs in their attack.
In this context, the authors of [22] introduced hardwarefriendly parameter sets for medium (256, 7681, 11.31) and high security (512, 12289, 12.18). With \(n\) being a power of two and \(q\) a prime such that \(q= 1 \mathrm{~mod~ }2n\), the Fast Fourier Transform (FFT) in \(\mathbb {Z}_{q}\) (namely the Number Theoretic Transform (NTT)) can be directly applied for polynomial multiplication with a quasilinear runtime of \({\mathcal {O}}(n \log {n})\). Increased security parameters (e.g., a larger \(n\)) have therefore much less impact on the efficiency compared to other schemes [36].
Security Implications of Gaussian Sampling. For practical and efficiency reasons it is common to bound the tail of the Gaussian. As an example, the authors of the first proofofconcept implementation of RingLWEEncrypt [22] have chosen to bound their sampler to \([\lceil 2s \rceil ,\lceil 2s \rceil ]\). Unfortunately, they do not provide either a security analysis or justification for this specific value. In this context, the probability of sampling \(\pm 24\) which is out of this bound (recall that \(\lceil 2s \rceil =\lceil 2\cdot 11.32 \rceil = 23\)) is \(6.505\cdot 10^{8}\) and thus not negligible. However, when increasing the tailcut up to a certain level it can be ensured that certain values will only occur with a negligible probability. For \([48,48]\), the probability of sampling an \(x=\pm 49\) is \(2.4092 \cdot 10^{27} < 2^{80}\) which is unlikely to happen in a real world scenario. The overall quality of a Gaussian random number generator (GRNG) can be measured by computing the statistical distance \(\varDelta (X,Y) =\frac{1}{2} \sum _{\omega \in \varOmega }{X(\omega )Y(\omega )}\) over a finite domain \(\varOmega \) between the probability of sampling a value \(x\) by the GRNG and the probability given by \(\rho _{\sigma }(x)/\rho _{\sigma }(\mathbb {Z})\).
Since in general attacks on LWE work better for smaller secrets (see [3, 4] for a survey on current attacks) the tailcut will certainly influence the security level of the scheme. However, we are not aware of any detailed analysis whether short tails or certain statistical distances lead to better attacks. Moreover, a recent trend in latticebased cryptography is to move away from Gaussian to very small uniform distributions (e.g., \(1/0/1\)) [23, 41]. It is therefore not clear whether a sampler has to have a statistical distance of \(2^{80}\) or \(2^{100}\) (which is required for a worstcase to averagecase reductions) in order to withstand practical attacks. Moreover the parameter choices for the RingLWEEncrypt scheme and for most other practical latticebased schemes already sacrifice the worstcase to averagecase reduction in order to obtain practical parameters (i.e., small keys). As a consequence, we primarily implemented a \(\pm \lceil 2s \rceil \) bound sampler for straightforward comparison with the work by Göttert et al. [22] but also provide details and implementation results for larger sampler instantiations that support a much larger tail.
2.3 Improving Efficiency
In this section we propose efficient modifications to RingLWEEncrypt to decrease the undesirable ciphertext expansion and the error rate at the same level of security.
Reducing the Ciphertext Expansion. Threshold encoding was proposed in [22, 32] to transfer \(n\) bits resulting in an inflated ciphertext of size \(2n \log _{2}{q}\). Efficiency is further reduced if only a part of the \(n\) bits is used, for example to transfer a 128bit AES key. Moreover, the RingLWEEncrypt scheme suffers from random decryption errors so that redundancy in the message \(m\) is required to correct those errors. In the following we analyze a simple but effective way to reduce the ciphertext expansion without significantly affecting the error rate. This approach has been previously applied to homomorphic encryption schemes [9, Sect. 6.4], [10, Sect. 4.2] and the idea is basically to cutoff a certain number of least significant bits of \(c_2\) since they mostly carry noise but only few information supporting the threshold decoding. We experimentally verified the applicability of this approach in practice with regard to concrete parameters by measuring the error rates for reduced versions of \(c_2\) as shown in Table 1 (\(u=1\)).
As it turns out the error rate does not significantly increase – even if we remove 7 least significant bits of every coefficient and thus have halved the size of \(c_2\). It would also be possible to cutoff very few bits (e.g., 1 to 3) of \(c_1\) at the cost of an higher error rate. A further extreme option to reduce ciphertext expansion is to omit whole coefficients of \(c_{2}\) in case they are not used to transfer message bits (e.g., to securely transport a symmetric key). Note that this approach does not affect the concrete security level of the scheme as the modification does not involve any knowledge of the secret key or message and thus does not leak any further information. When compared with much more complicated and hardware consuming methods, e.g., the compression function for the Lyubashevsky signature scheme presented in [23], this straightforward approach is much more practical.
Decreasing the Error Rate. As noted above decryption of RingLWEEncrypt is prone to undesired message bitflips with some small probability. Such a faulty decryption is certainly highly undesirable and can also negatively affect security properties. One solution can be the subsequent application of forward error correcting codes but such methods obviously introduce additional complexity in hardware or software. As another approach, the error probability can be lowered by modifying the threshold encoding scheme, i.e., instead of encoding one bit into each coefficient of \(c_2\), a plaintext bit is now encoded into \(u\) coefficients of \(c_2\). This additive threshold encoding algorithm is shown in Fig. 1 where encode takes as input a plaintext bitvector \(m\) of length \(\lfloor \frac{n}{u} \rfloor \) and outputs the threshold encoded vector \(\bar{m}\) of size \(m\). The decoding algorithm is given the encoded message vector \(\tilde{m}\) affected by an unknown error vector. The impact on the error rate by using additive threshold encoding (\(u=2\)) jointly with the removal of least significant bits is shown in Table 1. Note that this significantly lowers the error rate without any expensive encoding or decoding operations and is much more efficient than, e.g., a simple repetition code [38].
3 Implementation of RingLWEEncrypt
In this section we describe the design and implementation of our processor with special focus on the efficient and flexible implementation of Gaussian sampling.
3.1 Gaussian Sampling
Beside its versatile applicability in latticebased cryptography, sampling of Gaussian distributed numbers is also crucial in electrical engineering and information technology, e.g., for the simulation of complex communication systems (see [51] for a survey from this perspective). However, it is not clear how to adapt continuous Gaussian samplers, like the ones presented in [25, 31, 54], for the requirements of latticebased cryptography. In the context of discrete Gaussian sampling for latticebased cryptography the most straightforward method is rejection sampling. In this case an uniform integer \(x \in \{\tau \sigma , ..., \tau \sigma \}\), where \(\tau \) is the “tailcut” factor, is chosen from a certain range depending on the security parameter and then accepted with probability proportional to \(e^{x^2/2\sigma ^2}\) [20]. This method has been implemented in software in [22] but the success rate is only approximately 20 % and requires costly floating point arithmetic (cf. to the laziness approach in [16]). Another method is a tablebased approach where a memory array is filled with Gaussian distributed values and selected by a randomly generated address. Unfortunately, a large resolution – resulting in a very large table – is required for accurate sampling. It is not explicitly addressed in [22] how larger values such as \(x=\lceil 2s \rceil \) for \(s=6.67\) with a probability of \(\Pr [x=14]=1.46 \cdot 10^{7}\) are accurately sampled from a table with a total resolution of only 1024 entries. We further refer to [15, Table 2] for a comparison of different methods to sample from a Gaussian distribution and a new approach.
Hardware Implementation Using the Inverse Transform Method. Since the aforementioned methods seem to be unsuitable for an efficient hardware implementation we decided to use the inverse transform method. When applying this method in general a table of cumulative probabilities \(p_z = \Pr (x \leqslant z: x \leftarrow D_\sigma )\) for integers \(z \in [\tau \sigma , ...,\tau \sigma ]\) is computed with a precision of \(\lambda \) bits. For a uniformly random chosen value \(x\) from the interval \([0,1)\) the integer \(y \in \mathbb {Z}\) is then returned (still requiring costly floating point arithmetic) for which it holds that \(p_{z1} \le x <p_z\) [15, 18, 44].
In hardware we operate with integers instead of floats by feeding a uniformly random value into a parallel array of comparators. Each comparator \(c_{i}\) compares its input to the commutative distribution function scaled to the range of the PRNG outputting \(r\) bits. As we have to cut the tail at a certain point, we compute the accumulated probability over the positive half (as it is slightly smaller than \(0.5\)) until we reach the maximum value \(j\) (e.g., \(j=\lceil 2s \rceil \)) so that \(w = \sum _{k=0}^{j}{\rho _{\sigma }(x)/\rho _{\sigma }(\mathbb {Z})}\). We then compute the values fed into the comparators as \(v_{k} = \frac{2^{r1}1}{w}(v_{k1} + \sum _{k=0}^{j}{\rho _{\sigma }(x)/\rho _{\sigma }(\mathbb {Z})})\) for \(0< k \le j\) and with \(v_0=\frac{2^{r1}1}{2w}\rho _{\sigma }(0)/\rho _{\sigma }(\mathbb {Z})\). Each comparator \(c_{i}\) is preloaded with the rounded value \(v_{i}\) and outputs a one bit if the input was smaller or equal to \(v_{i}\). A subsequent circuit then identifies the first comparator \(c_l\) which returned a one bit and outputs either \(l\) or \(l\).
The block diagram of the sampler is shown in Fig. 2 for the concrete parameter set (\(n=256, q=7681, s=11.32\)) where the output of the sampler is bound to \([ \lceil 2s\rceil ,\lceil 2s \rceil ] = [5.09\sigma ,5.09\sigma ]\) and the amount of required randomness is 25 bits per sample. These random bits are supplied by a PRNG for which we used the output of an AES block cipher operating in counter mode. Each 128bit output block of our AESbased PRNG allows sampling of 5 coefficients. One random bit is used for sign determination while the other 24 bits form a uniformly random value. Finally, the output of the sampler is buffered in a FIFO. When leaving the FIFO, the values are lifted to the target domain \([0,q1]\). Although it is possible to generate a sampler directly in VHDL by computing the cumulative distribution function onthefly during synthesis, we have implemented a Python script for this purpose. The reason is that the VHDL floating point implementation only provides double accuracy while the Decimal ^{Footnote 5} data type supports arbitrary precision. The Python script also performs a direct evaluation of the properties of the sampler (e.g., statistical distance).
3.2 RingLWE Processor Architecture
The core of our processor is built around an NTTbased polynomial multiplier which is described in [45]. The freely available implementation has been further optimized and the architecture has been extended from a simple polynomial multiplier into a fullblown and highly configurable microcode engine. Note that Aysu et al. [6] recently proposed some improvements to the architecture of [45] in order to increase the efficiency and area usage of the polynomial multiplier. While some improvements rely on their decision to fix the modulus \(q\) to \(2^{16} + 1\) other ideas are clearly applicable in future work and revisions of our implementations. However, we do not fix \(q\) as the design goal of our hardware processor is the native support for a large variety of ideal latticebased schemes, including the most common operations on polynomials like addition, subtraction, multiplication by the NTT as well as sampling of Gaussian distributed polynomials. By supporting an arbitrary number of internal registers (each can store one polynomial) realized in block RAMs and by reusing the data path of the NTT multiplier for other arithmetic operations we achieve high performance at low resource consumption.
General Description and Instruction Set. The datapath of our engine depicted in Fig. 3 depends on the size of the reduction prime \(q\) and is thus \(\log _{2}{q}\) as polynomial coefficients are processed serially in a pipeline. Four registers are fixed where register R0 and R1 are part of the NTT block, while the Gaussian sampler is connected to register R2. Register R3 is exported to upper layers and operates as I/O port. More registers R4 to R\(x\) can be flexibly enabled during synthesis where each additional register can hold a polynomial with \(n\) elements of size \(\log _2{q}\). The Switch matrix is a dynamic multiplexer that connects registers to the ALU and the external interface and is designed to process statements in twooperand form like \({R1} \leftarrow {R1}+{R2}\). All additional registers R\(x\) for \(x>4\) are placed inside of the Register array component. The Decoder unit is responsible for interpreting instructions that configure the switch matrix, determines whether the ALU has to be used (SUB, ADD, MOV) or if NTT specific commands need to invoke the NTT multiplier. To improve resource utilization of the overall system, the butterfly unit of the NTT core is shared between the NTT multiplier and the ALU.
The most important instructions supported by the processor are the iterative forward (NTT_NTT) as well as the backward transform (NTT_INTT) which take \({\approx }\frac{n}{2} \log _{2}{n}\) cycles. Other instructions are for example used for the bitreversal step (NTT_REV), pointwise multiplication (NTT_PW_MUL), addition (ADD), or subtraction (SUB) – each consuming \({\approx }n\) cycles. Note that the sampler and the I/O port are just treated as general purpose registers. Thus no specific I/O or sampling instructions are necessary and for example the MOV command can be used. Note also that the implementation of the NTT is performed in place and commands for the backward transformation (e.g., NTT_PW_MUL, or NTT_INTT) modify only register R1. Therefore, after a backward transform a value in R0 is still available.
Implementation of RingLWEEncrypt. For our implementation we used the medium and high security parameter sets as proposed in [22] which are specifically optimized for hardware. We further exploit the general characteristic of the NTT which allows it to “decompose” a multiplication into two forward transforms and one backward transform. If one coefficient is fixed or needed twice it is wise to directly store it in NTT representation to save subsequent transformations. In Fig. 4 the modified algorithm is given which is more efficient since the public constant \(a\) as well as the public and private keys \(p\) and \(r_{2}\) are stored in NTT representation.
As a consequence, an encryption operation consists of a certain overhead, one forward NTT transformation (\(n+ \frac{1}{2}n \log _2{n}\) cycles), two backward transforms (\(2 \cdot (2n+ \frac{1}{2}n \log _2{n})\) cycles), two coefficientwise multiplications (\(2n\) cycles), three calls to the Gaussian sampling routine (\(3n\) cycles) and some additions as well as data movement operations (\(3n\) cycles) which return the error vectors. For decryption, we just need two NTT transformations, one coefficientwise multiplications and one addition.
The toplevel module (LWEenc) in Fig. 3 instantiates the ideal lattice processor and uses a block RAM as external interface to export or import ciphertexts \(c_1,c_2\), keys \(r_2,p\) or messages \(m\) with straightforward clock domain separation (see again Fig. 3). The processor is controlled by a finite state machine (FSM) issuing commands to the lattice processor to perform encryption, decryption, key import or key generation. It is configured with three general purpose registers R4R6 in order to permanently store the public key \(p\), the global constant \(a\) and the private key \(r_2\). More registers for several keypairs are also supported but optional. The implementation supports preinitialization of registers so that all constant values and keys can be directly included in the (encrypted) bitstream. Note that, for encryption, the core is run similar to a stream cipher as \(c_{1}\) and \(c_2\) can be computed independently from the message which is then only added in the last step (e.g., comparable to the XOR operation used within stream ciphers).
4 Results and Performance
For performance analysis we primarily focus on Virtex6 platforms (speed grade 2) but would also like to emphasize that our solution can be efficiently implemented even on a small and lowcost Spartan6 FPGA. All results were obtained after postplace and route (PostPAR) with Xilinx ISE 14.2.
4.1 Gaussian Sampling
In Table 2 we summarize resource requirements of six setups of the implemented comparatorbased Gaussian sampler for different tail cuts and statistical distances. Our random number generator is a round based AES in counter mode that computes a 128bit AES block in 13 cycles and comprises 349 slices, 1181/ 350 LUT/FF, two 18K block RAMs and runs with a maximum frequency of about 265 MHz. Combined with this PRNG^{Footnote 6}, Gaussian sampling based on the inverse transform method is efficient for small values of \(s\) (as typically used for RingLWEEncrypt) but would not be suitable for larger Gaussian parameters like, e.g., \(s=\sqrt{2\pi }2688=6737.8\) for the treeless signature scheme presented in [34]. While our sampler needs a huge number of random inputs, the AES engine is still able to generate these numbers (for each encryption we need \(3n\) samples). Table 2 also shows that it is possible to realize an efficient sampler even for a small statistical distance \({<}2^{80}\) since its resource consumption of roughly 250 slices is quite moderate (setup III/IV). With additional register levels and pipelining for versions I/II we achieved the overall clock frequency for the whole core reported in Table 3 in this section. As the PRNG does not provide enough randomness to sample a value in every clock cycle it is not required to evaluate the comparator array in every single cycle so that in particular setups IIIVI can use several clock cycles until output is provided. This lowers the critical path and thus allows higher clock frequencies without costs for pipelining registers. Setups V/VI are even more accurate and support (theoretical) requirements of a statistical distance smaller than \(2^{100}\) [18]. However, then a faster PRNG would be required as for \(n=256\) we would need \(105\cdot 3n=80640\) bits of random input.
4.2 Performance of RingLWEEncrypt
Table 3 lists the resource consumption and performance of our implementation of RingLWEEncrypt. As stated in Sect. 3.2 our implementation combines key generation, encryption and decryption in a holistic design and would not significantly benefit from removing any one of these functional units. The only exception might be a decryptiononly core in which no Gaussian sampling is needed.
Table 4 compares the results achieved in this work with the implementation by Göttert et al. [22] as well as other relevant asymmetric schemes and also adds performance figures for a Spartan6 instantiation. Note that a detailed comparison with [22] is unfair due to inaccuracies of synthesis results (the Virtex6 LX240T FPGA used in [22] was overmapped so that the subsequent placeandroute (PAR) step providing final results could not be performed). Figures for clock frequency, overall slice consumption, and cycles counts for individual operations or the whole encryption block are thus not given in [22]. We therefore can only refer to numbers providing the resource consumption of registers and LUT usage. For a rough comparison we apply the throughput to area (T/A) metric and define area equivalent to the usage of LUTs due to the restriction mentioned above. It turns out that our implementation for \(n=256\) is 32 times smaller regarding key generation, \(65\) times smaller for encryption and 27 times smaller for decryption, at a loss of a factor of about \(2\) and \(3.3\) in performance. When employing the \(\frac{\text {Bit/s}}{\text {LUT}}\) metric for medium security encryption we achieve \(\frac{9.77\cdot 10^6 \text {Bits}}{4549 \text {LUTs}}=2147\) while the work presented in [22] gives \(\frac{31.8\cdot 10^6 \text {Bits}}{298016 \text {LUTs}}=106\). This results in an improvement of a factor of roughly 20.^{Footnote 7}
In comparison with a recent implementations of the codebased Niederreiter scheme [27] we are faster for decryption and we also use fewer resources on the same platform. Another natural target for comparison is the patentprotected NTRU scheme which has been implemented on a large number of architectures [5, 7, 26]. The implementation in [30] is clearly faster than ours. However, the implemented NTRU(251,3,12) variant in [30] seems to be less secure than our scheme [28]. Unfortunately, we are not aware of any newer NTRU FPGA implementations in order to determine the impact of increased security parameters on runtime and area consumption. In software, NTRU even seems to be rather slow for higher security levels what can be obtained from the 256bit secure NTRU software implementation (ntruees787ep1) benchmarked using the eBACS framework [8] with secret/public key sizes of 1854/1574 bytes and a ciphertext of 1574 bytes. For the ideal latticebased NTRU version presented in [49], no implementation and concrete parameters have been published yet. In comparison with ECC over prime curves (i.e., a single point multiplication [24]) and RSA (randomexponent 1024bit exponentiation [50]) our implementation is by an order of magnitude faster, scales better for higher security levels, and also consumes less resources. However, we are not able to beat the recent binary curve implementation of Rebeiro et al. [47] in terms of throughput and performance.
4.3 Constant Time Operation
Sidechannel attacks are a problem for all physical implementations [39]. A simple target for a sidechannel attack is the use of timing information of the security algorithm by measuring execution time or cycles. Our implementation of RingLWEEncrypt is fully pipelined and has no datadependent operations. The processor core does not support any branches and Gaussian sampling based on the inverse transform operates in constant time. Summarizing, all cryptographic operations of our core are timinginvariant.
5 Conclusions and Future Work
In this work we presented a novel implementation of the ideal latticebased RingLWE encryption scheme that fits even on a lowcost Spartan6 FPGA. According to our findings, we improved the results obtained in the previous work of [22] by at least an order of magnitude using the same FPGA platform and much less resources.
Future work can combine our hardware engine with error correction facilities and CCA2 conversion. Additionally, countermeasures against further sidechannel and faultinjection attacks need to be considered. As we intend to make our implementation publicly available, our work also offers the chance for thirdparty sidechannel evaluation and cryptanalysis (e.g., exploiting the concrete implementation of the Gaussian sampler). Since our processor could also be utilized by other latticebased cryptosystems, the provably secure NTRU variant presented in [49] can be another target for implementation. Moreover, a recent proposal of a latticebased signature scheme by Ducas et al. [14] uses exactly the same parameters (\(n=512,q=12289\)) as RingLWEEncrypt and is thus a natural target for implementation based on our microcode engine.
Notes
 1.
The authors report that the utilization of LUTs required for LWE encryption exceeds the number of available LUTs on a Virtex6 LX240T by 197 % and 410 % for parameters \(n=256\) and \(n=512\), respectively. Note that the Virtex6 LX240T is a very expensive (above €1000 as of August 2013) and large FPGA.
 2.
For example, the parameters used for implementation in [22] result in a ciphertext expansion by a factor of 26.
 3.
See our web page at http://www.sha.rub.de/research/projects/lattice/
 4.
Note that this is the definition of RingLWE in Hermite normal form where the secret \(s\) is sampled from the noise distribution \(D_{\sigma }\) instead of uniformly random [37].
 5.
 6.
Generation of true random numbers is not in the scope of this work; we refer to the survey by Varchola [52] how to achieve this.
 7.
For this comparison we assumed that for each encryption 256 bits are transmitted.
References
Agrawal, S., Boneh, D., Boyen, X.: Efficient lattice (H)IBE in the standard model. In: Gilbert [21], pp. 553–572
Ajtai, M.: Generating hard instances of lattice problems. In: Proceedings of the TwentyEighth Annual ACM Symposium on Theory of Computing, pp. 99–108. ACM (1996)
Albrecht, M., Cid, C., Faugère, J.C., Fitzpatrick, R., Perret, L.: On the complexity of BKW algorithm against LWE. In: SCC’12: Proceedings of the 3nd International Conference on Symbolic Computation and Cryptography, CastroUrdiales, July 2012, pp. 100–107 (2012)
Albrecht, M., Cid, C., Faugère, J.C., Fitzpatrick, R., Perret, L.: On the complexity of the AroraGe algorithm against LWE. In: SCC’12: Proceedings of the 3nd International Conference on Symbolic Computation and Cryptography, CastroUrdiales, July 2012, pp. 93–99 (2012)
Atici, A.C., Batina, L., Fan, J., Verbauwhede, I., Örs, S.B.: Lowcost implementations of NTRU for pervasive security. In: ASAP, pp. 79–84. IEEE Computer Society (2008)
Aysu, A., Patterson, C., Schaumont, P.: Lowcost and areaefficient FPGA implementations of latticebased cryptography. In: IEEE International Symposium on HardwareOriented Security and Trust (HOST), 2013. IEEE (2013, to appear)
Bailey, D.V., Coffin, D., Elbirt, A., Silverman, J.H., Woodbury, A.D.: NTRU in constrained devices. In: Koç, Ç.K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 262–272. Springer, Heidelberg (2001)
Bernstein, D.J., Lange, T.: eBACS: ECRYPT benchmarking of cryptographic systems. http://bench.cr.yp.to. Accessed 10 May 2013
Bos, J.W., Lauter, K., Loftus, J., Naehrig, M.: Improved security for a ringbased fully homomorphic encryption scheme. IACR Cryptol. ePrint Arch. 2013, 75 (2013)
Brakerski, Z.: Fully homomorphic encryption without modulus switching from classical GapSVP. In: SafaviNaini, R., Canetti, R. (eds.) CRYPTO 2012. LNCS, vol. 7417, pp. 868–886. Springer, Heidelberg (2012)
Brakerski, Z., Langlois, A., Peikert, C., Regev, O., Stehlé, D.: Classical hardness of learning with errors. In: Boneh, D., Roughgarden, T., Feigenbaum, J. (eds.) STOC, pp. 575–584. ACM (2013)
Buchmann, J., May, A., Vollmer, U.: Perspectives for cryptographic longterm security. Commun. ACM 49(9), 50–55 (2006)
Canetti, R., Garay, J.A. (eds.): CRYPTO 2013, Part I. LNCS, vol. 8042. Springer, Heidelberg (2013)
Ducas, L., Durmus, A., Lepoint, T., Lyubashevsky, V.: Lattice signatures and bimodal Gaussians. In: Canetti and Garay [13], pp. 40–56. Proceedings version of [15]
Ducas, L., Durmus, A., Lepoint, T., Lyubashevsky, V.: Lattice signatures and bimodal Gaussians. IACR Cryptol. ePrint Arch. 2013, 383 (2013). (Full version of [14])
Ducas, L., Nguyen, P.Q.: Faster Gaussian lattice sampling using lazy floatingpoint arithmetic. In: Wang and Sako [53], pp. 415–432
Ducas, L., Nguyen, P.Q.: Learning a zonotope and more: cryptanalysis of NTRUSign countermeasures. In: Wang and Sako [53], pp. 433–450
Galbraith, S.D., Dwarakanath, N.C.: Efficient sampling from discrete gaussians for latticebased cryptography on a constrained device
Gentry, C.: Fully homomorphic encryption using ideal lattices. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing, pp. 169–178. ACM (2009)
Gentry, C., Peikert, C., Vaikuntanathan, V.: Trapdoors for hard lattices and new cryptographic constructions. In: Dwork, C. (ed.) STOC, pp. 197–206. ACM (2008)
Gilbert, H. (ed.): EUROCRYPT 2010. LNCS, vol. 6110. Springer, Heidelberg (2010)
Göttert, N., Feller, T., Schneider, M., Buchmann, J., Huss, S.: On the design of hardware building blocks for modern latticebased encryption schemes. In: Prouff and Schaumont [46], pp. 512–529
Güneysu, T., Lyubashevsky, V., Pöppelmann, T.: Practical latticebased cryptography: a signature scheme for embedded systems. In: Prouff and Schaumont [46], pp. 530–547
Güneysu, T., Paar, C.: Ultra high performance ECC over NIST primes on commercial FPGAs. In: Oswald, E., Rohatgi, P. (eds.) CHES 2008. LNCS, vol. 5154, pp. 62–78. Springer, Heidelberg (2008)
Gutierrez, R., Torres, V., Valls, J.: Hardware architecture of a Gaussian noise generator based on the inversion method. IEEE Trans. Circ. Syst. 59II(8), 501–505 (2012)
Hermans, J., Vercauteren, F., Preneel, B.: Speed records for NTRU. In: Pieprzyk, J. (ed.) CTRSA 2010. LNCS, vol. 5985, pp. 73–88. Springer, Heidelberg (2010)
Heyse, S., Güneysu, T.: Towards one cycle per bit asymmetric encryption: codebased cryptography on reconfigurable hardware. In: Prouff and Schaumont [46], pp. 340–355
Hirschhorn, P.S., Hoffstein, J., HowgraveGraham, N., Whyte, W.: Choosing NTRUEncrypt parameters in light of combined lattice reduction and MITM approaches. In: Abdalla, M., Pointcheval, D., Fouque, P.A., Vergnaud, D. (eds.) ACNS 2009. LNCS, vol. 5536, pp. 437–455. Springer, Heidelberg (2009)
Hoffstein, J., Pipher, J., Silverman, J.H.: NTRU: a ringbased public key cryptosystem. In: Buhler, J.P. (ed.) ANTS 1998. LNCS, vol. 1423, pp. 267–288. Springer, Heidelberg (1998)
Kamal, A.A., Youssef, A.M.: An FPGA implementation of the NTRUEncrypt cryptosystem. In: 2009 International Conference on Microelectronics (ICM), pp. 209–212. IEEE (2009)
Lee, D.U., Luk, W., Villasenor, J.D., Zhang, G., Leong, P.H.W.: A hardware Gaussian noise generator using the Wallace method. IEEE Trans. Very Large Scale Integr. VLSI Syst. 13(8), 911–920 (2005)
Lindner, R., Peikert, C.: Better key sizes (and Attacks) for LWEbased encryption. In: Kiayias, A. (ed.) CTRSA 2011. LNCS, vol. 6558, pp. 319–339. Springer, Heidelberg (2011)
Lyubashevsky, V., Micciancio, D.: Generalized compact knapsacks are collision resistant. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 144–155. Springer, Heidelberg (2006)
Lyubashevsky, V.: Lattice signatures without trapdoors. In: Pointcheval, D., Johansson, T. (eds.) EUROCRYPT 2012. LNCS, vol. 7237, pp. 738–755. Springer, Heidelberg (2012)
Lyubashevsky, V., Micciancio, D., Peikert, C., Rosen, A.: SWIFFT: a modest proposal for FFT hashing. In: Nyberg, K. (ed.) FSE 2008. LNCS, vol. 5086, pp. 54–72. Springer, Heidelberg (2008)
Lyubashevsky, V., Peikert, C., Regev, O.: On ideal lattices and learning with errors over rings. In: Gilbert [21], pp. 1–23. Proceedings version of [37]
Lyubashevsky, V., Peikert, C., Regev, O.: On ideal lattices and learning with errors over rings. IACR Cryptol. ePrint Arch. 2012, 230 (2012). (Full version of [36])
MacWilliams, F.J., Sloane, N.J.A.: The Theory of ErrorCorrecting Codes. vol. 16, 762 pp, Elsevier Science Publishers B. V., NorthHolland (2006). ISBN: 0444851933
Mangard, S., Oswald, E., Popp, T.: Power Analysis Attacks: Revealing the Secrets of Smart Cards (Advances in Information Security), 3rd edn. Springer, New York (2007)
Micciancio, D.: Generalized compact knapsacks, cyclic lattices, and efficient oneway functions. Comput. Complex. 16(4), 365–411 (2007)
Micciancio, D., Peikert, C.: Hardness of SIS and LWE with small parameters. In: Canetti and Garay [13], pp. 21–39
Naehrig, M., Lauter, K., Vaikuntanathan, V.: Can homomorphic encryption be practical? In: Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop, CCSW ’11, pp. 113–124. ACM, New York (2011)
Nguyên, P.Q., Regev, O.: Learning a parallelepiped: cryptanalysis of GGH and NTRU signatures. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 271–288. Springer, Heidelberg (2006)
Peikert, C.: An efficient and parallel Gaussian sampler for lattices. In: Rabin, T. (ed.) CRYPTO 2010. LNCS, vol. 6223, pp. 80–97. Springer, Heidelberg (2010)
Pöppelmann, T., Güneysu, T.: Towards efficient arithmetic for latticebased cryptography on reconfigurable hardware. In: Hevia, A., Neven, G. (eds.) LatinCrypt 2012. LNCS, vol. 7533, pp. 139–158. Springer, Heidelberg (2012)
Prouff, E., Schaumont, P. (eds.): CHES 2012. LNCS, vol. 7428. Springer, Heidelberg (2012)
Rebeiro, C., Roy, S.S., Mukhopadhyay, D.: Pushing the limits of highspeed GF(\(2^m\)) elliptic curve scalar multiplication on FPGAs. In: Prouff and Schaumont [46], pp. 494–511
Regev, O.: On lattices, learning with errors, random linear codes, and cryptography. In: Gabow, H.N., Fagin, R. (eds.) STOC, pp. 84–93. ACM (2005)
Stehlé, D., Steinfeld, R.: Making NTRU as secure as worstcase problems over ideal lattices. In: Paterson, K.G. (ed.) EUROCRYPT 2011. LNCS, vol. 6632, pp. 27–47. Springer, Heidelberg (2011)
Suzuki, D.: How to maximize the potential of FPGA resources for modular exponentiation. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727, pp. 272–288. Springer, Heidelberg (2007)
Thomas, D.B., Luk, W., Leong, P.H.W., Villasenor, J.D.: Gaussian random number generators. ACM Comput. Surv. 39(4), 11:1–11:38 (2007)
Varchola, M.: FPGA based true random number generators for embedded cryptographic applications. Ph.D. thesis, Technical University of Kosice (2008)
Wang, X., Sako, K. (eds.): ASIACRYPT 2012. LNCS, vol. 7658. Springer, Heidelberg (2012)
Zhang, G., Leong, P.H.W., Lee, D.U., Villasenor, J.D., Cheung, R.C.C., Luk, W.: Zigguratbased hardware Gaussian random number generator. In: International Conference on Field Programmable Logic and Applications, 2005, pp. 275–280 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 SpringerVerlag Berlin Heidelberg
About this paper
Cite this paper
Pöppelmann, T., Güneysu, T. (2014). Towards Practical LatticeBased PublicKey Encryption on Reconfigurable Hardware. In: Lange, T., Lauter, K., Lisoněk, P. (eds) Selected Areas in Cryptography  SAC 2013. SAC 2013. Lecture Notes in Computer Science(), vol 8282. Springer, Berlin, Heidelberg. https://doi.org/10.1007/9783662434147_4
Download citation
DOI: https://doi.org/10.1007/9783662434147_4
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 9783662434130
Online ISBN: 9783662434147
eBook Packages: Computer ScienceComputer Science (R0)