Towards Practical LatticeBased PublicKey Encryption on Reconfigurable Hardware
Abstract
With this work we provide further evidence that latticebased cryptography is a promising and efficient alternative to secure embedded applications. So far it is known for solid security reductions but implementations of specific instances have often been reported to be too complex beyond any practicability. In this work, we present an efficient and scalable microcode engine for RingLWE encryption that combines polynomial multiplication based on the Number Theoretic Transform (NTT), polynomial addition, subtraction, and Gaussian sampling in a single unit. This unit can encrypt and decrypt a block in 26.19 µs and 16.80 µs on a Virtex6 LX75T FPGA, respectively – at moderate resource requirements of about 1506 slices and a few block RAMs. Additionally, we provide solutions for several practical issues with RingLWE encryption, including the reduction of ciphertext expansion, error rate and constanttime operation. We hope that this contribution helps to pave the way for the deployment of ideal latticebased encryption in future realworld systems.
Keywords
Ideal lattices RingLWE FPGA implementation1 Introduction and Motivation
Resistance against quantum computers and long term security has been an issue that cryptographers are trying so solve for some time [12]. However, while quite a few alternative schemes and problem classes are available, not many of them received the attention both from cryptanalysts and implementers that would be needed to establish the confidence and efficiency for their deployment in realworld systems. In the field of patentfree latticebased publickey encryption there are a few promising proposals such as a provably secure NTRU variant [49] or the cryptosystem based on the (Ring) LWE problem [32, 36]. For the latter scheme Göttert et al. presented a proofofconcept implementation in [22] demonstrating that LWE encryption is feasible in software. However, their corresponding hardware implementation is quite large and can only be placed fully on a Virtex7 2000T and does not even fit onto the largest Xilinx Virtex6 FPGA for secure parameters.^{1} Several other important aspects for RingLWE encryption have also not been regarded yet, such as the reduction of the extensive ciphertext expansion and constanttime operation to withstand timing attacks.
 1.
Efficient hardware implementation of RingLWE encryption. We present a microcode processor implementing RingLWE encryption as proposed by [32, 36] in hardware, capable to perform the Number Theoretic Transform (NTT), polynomial additions and subtractions as well as Gaussian sampling. For a fair comparison of our implementation with previous work, we use the same parameters as in [22] and improve their results by at least an order of magnitude considering throughput/area on a similar reconfigurable platform. Moreover, our processor is designed as a versatile building block for the implementation of future ideal latticebased schemes and is not solely limited to RingLWE encryption. All parts of our implementation have constant runtime and inherently provide resistance against timing attacks.
 2.
Efficient Gaussian sampling. We present a constanttime Gaussian sampler implementing the inverse transform method. The sampler is optimized for sampling from narrow Gaussian distributions and is the first hardware implementation of this method in the context of latticebased cryptography.
 3.
Reducing ciphertext expansion and decryption failure rates. A major drawback of RingLWE encryption is the large expansion of the ciphertext^{2} and the occurrence of (rare) decryption errors. We analyze different approaches to reduce the impact of both problems and harden RingLWE encryption for deployment in realworld systems.
Outline. In Sect. 2 we introduce the implemented ringbased encryption scheme. The implementation of our processor, the Gaussian sampler and the cryptosystem are discussed in Sect. 3. In Sect. 4 we give detailed results including a comparison with previous and related works and conclude with Sect. 5.
2 The RingLWEEncryptCryptosystem
In this section we briefly introduce the original definition of the implemented RingLWE public key encryption system (RingLWEEncrypt) and propose modifications in order to decrease ciphertext expansion and error rate without affecting the security properties of the scheme.
2.1 Background on LWE
Since the seminal result by Ajtai [2] who proved a worstcase to averagecase reduction between several lattice problems, the whole field of latticebased cryptography has received significant attention. The reasons for this seems to be that the underlying lattice problems are very versatile and allow the construction of hierarchical identity based encryption (HIBE) [1] or homomorphic encryption [19, 42] but have also led to the introduction of reasonably efficient publickey encryption systems [22, 32, 36], signature schemes [14, 23, 34], and even hash functions [35]. A significant longterm advantage of such schemes is that quantum algorithms do not seem to yield significant improvements over classical ones and that some schemes exhibit a security reduction that relates the hardness of breaking the scheme to the presumably intractable problem of solving a worstcase (ideal) lattice problem. This is a huge advantage to heuristic and patentprotected schemes like NTRU [29], which are just related to lattice problems but might suffer from yet not known weaknesses and had to repeatedly raise their parameters as immediate reaction to attacks [28]. A particular example is the NTRU signature scheme NTRUSign which has been completely broken [17, 43]. As a consequence, while NTRU with larger parameters can be considered secure, it seems to be worthwhile to investigate possible alternatives.
However, the biggest practical problem of latticebased cryptography are huge key sizes and also quite inefficient matrixvector and matrixmatrix arithmetic. This led to the definition of cyclic [40] and more generalized ideal lattices [33] which correspond to ideals in the ring \(\mathbb {Z}[x]/\langle f \rangle \) for some irreducible polynomial \(f\) of degree \(n\). While certain properties can be established for various rings, in most cases the ring \(R=\mathbb {Z}_q[\mathbf{x}]/\langle {x}^n+1\rangle \) is used. Some papers proposing parameters then also follow the methodology to choose \(n\) as a power of two and \(q\) a prime such that \(q\equiv 1 \mathrm{~mod~ }2n\) and thus support asymptotic quasilinear runtime by direct usage of FFT techniques. Recent work also suggests that \(q\) does not have to be prime in order to allow security reductions [11].
Nowadays, the most popular averagecase problem to base latticebased cryptography on is presumably the learning with errors (LWE) problem [48]. In order to solve the decisional RingLWE problem in the ring \(R=\mathbb {Z}_q[\mathbf{x}]/\langle {x}^n+1\rangle \), an attacker has to decide whether the samples \((a_{1},t_{1}),\dots ,(a_{m},t_{m}) \in R \times R\) are chosen uniformly random or whether each \(t_{i}=a_{i}s+e_{i}\) with \(s,e_{1},\dots ,e_{m}\) has small coefficients from a Gaussian distribution \(D_{\sigma }\) [36].^{4} This distribution \(D_{\sigma }\) is defined as the (onedimensional) discrete Gaussian distribution on \(\mathbb {Z}\) with standard deviation \(\sigma \) and mean 0. The probability of sampling \(x \in \mathbb {Z}\) is \(\rho _{\sigma }(x)/\rho _{\sigma }(\mathbb {Z})\) where \( \rho _{\sigma }(x)=\exp {(\frac{x^2}{2\sigma ^2})}\mathrm{~and~ }\rho _{\sigma } (\mathbb {Z})=\sum _{k=\infty }^{\infty }\rho _{\sigma }(k). \) In this simple case the standard deviation \(\sigma \) completely describes the Gaussian distribution. Note that some works, e.g., [22, 32] use the parameter \(s=\sqrt{2\pi }\sigma \) to describe the Gaussian.
2.2 RingLWEEncrypt

Gen(\(a\)): Choose \(r_1, r_2 \leftarrow D_{\sigma }\) and let \(p=r_1a \cdot r_2 \in R\). The public key is \(p\) and the secret key is \(r_2\) while \(r_1\) is just noise and not needed anymore after key generation. The value \(a \in R\) can be defined as global constant or chosen uniformly random during key generation.

Enc(\(a,p,m\in {\{0,1\}}^n\)): Choose the noise terms \(e_1, e_2, e_3 \leftarrow D_{\sigma }\). Let \(\bar{m} = \mathtt{encode }(m) \in R\), and compute the ciphertext \([c_1 = a\cdot e_1+e_2, c_2 = p \cdot e_1 + e_3 + \bar{m}] \in R^2\)

Dec(\(c=[c_1,c_2], r_2\)): Output decode \((c_1\cdot r_2 +c_2) \in {\{0,1\}}^n\).
Parameter Selection. For details regarding parameter selection we refer to the work by Lindner and Peikert [32] who propose the parameter sets \((n,q,s)\) with (192, 4093, 8.87), (256, 4093, 8.35), and (320, 4093, 8.00) for low, medium, and high security levels, respectively. In this context, Lindner and Peikert [32] state that medium security should be roughly considered equivalent to the security of the symmetric AES128 block cipher as the decoding attack requires an estimated runtime of approximately \(2^{120}\) s for the best runtime/advantage ratio. However, they did not provide bitsecurity results due to the new nature of the problem and several tradeoffs in their attack.
In this context, the authors of [22] introduced hardwarefriendly parameter sets for medium (256, 7681, 11.31) and high security (512, 12289, 12.18). With \(n\) being a power of two and \(q\) a prime such that \(q= 1 \mathrm{~mod~ }2n\), the Fast Fourier Transform (FFT) in \(\mathbb {Z}_{q}\) (namely the Number Theoretic Transform (NTT)) can be directly applied for polynomial multiplication with a quasilinear runtime of \({\mathcal {O}}(n \log {n})\). Increased security parameters (e.g., a larger \(n\)) have therefore much less impact on the efficiency compared to other schemes [36].
Security Implications of Gaussian Sampling. For practical and efficiency reasons it is common to bound the tail of the Gaussian. As an example, the authors of the first proofofconcept implementation of RingLWEEncrypt [22] have chosen to bound their sampler to \([\lceil 2s \rceil ,\lceil 2s \rceil ]\). Unfortunately, they do not provide either a security analysis or justification for this specific value. In this context, the probability of sampling \(\pm 24\) which is out of this bound (recall that \(\lceil 2s \rceil =\lceil 2\cdot 11.32 \rceil = 23\)) is \(6.505\cdot 10^{8}\) and thus not negligible. However, when increasing the tailcut up to a certain level it can be ensured that certain values will only occur with a negligible probability. For \([48,48]\), the probability of sampling an \(x=\pm 49\) is \(2.4092 \cdot 10^{27} < 2^{80}\) which is unlikely to happen in a real world scenario. The overall quality of a Gaussian random number generator (GRNG) can be measured by computing the statistical distance \(\varDelta (X,Y) =\frac{1}{2} \sum _{\omega \in \varOmega }{X(\omega )Y(\omega )}\) over a finite domain \(\varOmega \) between the probability of sampling a value \(x\) by the GRNG and the probability given by \(\rho _{\sigma }(x)/\rho _{\sigma }(\mathbb {Z})\).
Since in general attacks on LWE work better for smaller secrets (see [3, 4] for a survey on current attacks) the tailcut will certainly influence the security level of the scheme. However, we are not aware of any detailed analysis whether short tails or certain statistical distances lead to better attacks. Moreover, a recent trend in latticebased cryptography is to move away from Gaussian to very small uniform distributions (e.g., \(1/0/1\)) [23, 41]. It is therefore not clear whether a sampler has to have a statistical distance of \(2^{80}\) or \(2^{100}\) (which is required for a worstcase to averagecase reductions) in order to withstand practical attacks. Moreover the parameter choices for the RingLWEEncrypt scheme and for most other practical latticebased schemes already sacrifice the worstcase to averagecase reduction in order to obtain practical parameters (i.e., small keys). As a consequence, we primarily implemented a \(\pm \lceil 2s \rceil \) bound sampler for straightforward comparison with the work by Göttert et al. [22] but also provide details and implementation results for larger sampler instantiations that support a much larger tail.
2.3 Improving Efficiency
In this section we propose efficient modifications to RingLWEEncrypt to decrease the undesirable ciphertext expansion and the error rate at the same level of security.
Biterror rate for the encryption and decryption of 160,000,000 bytes of plaintext when cutting off a certain number \(x\) of least significant bits of every coefficient of \(c_2\) for the parameter set (\(n=256, q=7681, s=11.31\)) where \(u\) is the parameter of the additive threshold encoding (see Algorithm 1) and \(\pm \lceil 2s \rceil \) the tailcut bound. For a cutoff of 12 or 13 bits almost no message can be recovered.
u  Cutoff \(x\) bits  0  1  2  3  4  5  6  7  8  9  10  11 

1  Errors (\(10^{3}\))  46  46  45.5  45.6  46  46.5  48.6  56.1  94.4  381  5359  135771 
Error rate (\(10^{5}\))  3.59  3.59  3.56  3.57  3.59  3.63  3.80  4.38  7.38  29.81  418.7  10610  
2  Errors  26  20  26  27  23  21  21  32  71  957  125796  \(44\cdot 10^6\) 
Error rate (\(10^{8}\))  2.03  1.56  2.03  2.11  1.80  1.64  1.64  2.5  5.55  74.7  9830  \(34\cdot 10^5\) 
As it turns out the error rate does not significantly increase – even if we remove 7 least significant bits of every coefficient and thus have halved the size of \(c_2\). It would also be possible to cutoff very few bits (e.g., 1 to 3) of \(c_1\) at the cost of an higher error rate. A further extreme option to reduce ciphertext expansion is to omit whole coefficients of \(c_{2}\) in case they are not used to transfer message bits (e.g., to securely transport a symmetric key). Note that this approach does not affect the concrete security level of the scheme as the modification does not involve any knowledge of the secret key or message and thus does not leak any further information. When compared with much more complicated and hardware consuming methods, e.g., the compression function for the Lyubashevsky signature scheme presented in [23], this straightforward approach is much more practical.
3 Implementation of RingLWEEncrypt
In this section we describe the design and implementation of our processor with special focus on the efficient and flexible implementation of Gaussian sampling.
3.1 Gaussian Sampling
Beside its versatile applicability in latticebased cryptography, sampling of Gaussian distributed numbers is also crucial in electrical engineering and information technology, e.g., for the simulation of complex communication systems (see [51] for a survey from this perspective). However, it is not clear how to adapt continuous Gaussian samplers, like the ones presented in [25, 31, 54], for the requirements of latticebased cryptography. In the context of discrete Gaussian sampling for latticebased cryptography the most straightforward method is rejection sampling. In this case an uniform integer \(x \in \{\tau \sigma , ..., \tau \sigma \}\), where \(\tau \) is the “tailcut” factor, is chosen from a certain range depending on the security parameter and then accepted with probability proportional to \(e^{x^2/2\sigma ^2}\) [20]. This method has been implemented in software in [22] but the success rate is only approximately 20 % and requires costly floating point arithmetic (cf. to the laziness approach in [16]). Another method is a tablebased approach where a memory array is filled with Gaussian distributed values and selected by a randomly generated address. Unfortunately, a large resolution – resulting in a very large table – is required for accurate sampling. It is not explicitly addressed in [22] how larger values such as \(x=\lceil 2s \rceil \) for \(s=6.67\) with a probability of \(\Pr [x=14]=1.46 \cdot 10^{7}\) are accurately sampled from a table with a total resolution of only 1024 entries. We further refer to [15, Table 2] for a comparison of different methods to sample from a Gaussian distribution and a new approach.
Hardware Implementation Using the Inverse Transform Method. Since the aforementioned methods seem to be unsuitable for an efficient hardware implementation we decided to use the inverse transform method. When applying this method in general a table of cumulative probabilities \(p_z = \Pr (x \leqslant z: x \leftarrow D_\sigma )\) for integers \(z \in [\tau \sigma , ...,\tau \sigma ]\) is computed with a precision of \(\lambda \) bits. For a uniformly random chosen value \(x\) from the interval \([0,1)\) the integer \(y \in \mathbb {Z}\) is then returned (still requiring costly floating point arithmetic) for which it holds that \(p_{z1} \le x <p_z\) [15, 18, 44].
In hardware we operate with integers instead of floats by feeding a uniformly random value into a parallel array of comparators. Each comparator \(c_{i}\) compares its input to the commutative distribution function scaled to the range of the PRNG outputting \(r\) bits. As we have to cut the tail at a certain point, we compute the accumulated probability over the positive half (as it is slightly smaller than \(0.5\)) until we reach the maximum value \(j\) (e.g., \(j=\lceil 2s \rceil \)) so that \(w = \sum _{k=0}^{j}{\rho _{\sigma }(x)/\rho _{\sigma }(\mathbb {Z})}\). We then compute the values fed into the comparators as \(v_{k} = \frac{2^{r1}1}{w}(v_{k1} + \sum _{k=0}^{j}{\rho _{\sigma }(x)/\rho _{\sigma }(\mathbb {Z})})\) for \(0< k \le j\) and with \(v_0=\frac{2^{r1}1}{2w}\rho _{\sigma }(0)/\rho _{\sigma }(\mathbb {Z})\). Each comparator \(c_{i}\) is preloaded with the rounded value \(v_{i}\) and outputs a one bit if the input was smaller or equal to \(v_{i}\). A subsequent circuit then identifies the first comparator \(c_l\) which returned a one bit and outputs either \(l\) or \(l\).
3.2 RingLWE Processor Architecture
The core of our processor is built around an NTTbased polynomial multiplier which is described in [45]. The freely available implementation has been further optimized and the architecture has been extended from a simple polynomial multiplier into a fullblown and highly configurable microcode engine. Note that Aysu et al. [6] recently proposed some improvements to the architecture of [45] in order to increase the efficiency and area usage of the polynomial multiplier. While some improvements rely on their decision to fix the modulus \(q\) to \(2^{16} + 1\) other ideas are clearly applicable in future work and revisions of our implementations. However, we do not fix \(q\) as the design goal of our hardware processor is the native support for a large variety of ideal latticebased schemes, including the most common operations on polynomials like addition, subtraction, multiplication by the NTT as well as sampling of Gaussian distributed polynomials. By supporting an arbitrary number of internal registers (each can store one polynomial) realized in block RAMs and by reusing the data path of the NTT multiplier for other arithmetic operations we achieve high performance at low resource consumption.
The most important instructions supported by the processor are the iterative forward (NTT_NTT) as well as the backward transform (NTT_INTT) which take \({\approx }\frac{n}{2} \log _{2}{n}\) cycles. Other instructions are for example used for the bitreversal step (NTT_REV), pointwise multiplication (NTT_PW_MUL), addition (ADD), or subtraction (SUB) – each consuming \({\approx }n\) cycles. Note that the sampler and the I/O port are just treated as general purpose registers. Thus no specific I/O or sampling instructions are necessary and for example the MOV command can be used. Note also that the implementation of the NTT is performed in place and commands for the backward transformation (e.g., NTT_PW_MUL, or NTT_INTT) modify only register R1. Therefore, after a backward transform a value in R0 is still available.
Implementation of RingLWEEncrypt. For our implementation we used the medium and high security parameter sets as proposed in [22] which are specifically optimized for hardware. We further exploit the general characteristic of the NTT which allows it to “decompose” a multiplication into two forward transforms and one backward transform. If one coefficient is fixed or needed twice it is wise to directly store it in NTT representation to save subsequent transformations. In Fig. 4 the modified algorithm is given which is more efficient since the public constant \(a\) as well as the public and private keys \(p\) and \(r_{2}\) are stored in NTT representation.
The toplevel module (LWEenc) in Fig. 3 instantiates the ideal lattice processor and uses a block RAM as external interface to export or import ciphertexts \(c_1,c_2\), keys \(r_2,p\) or messages \(m\) with straightforward clock domain separation (see again Fig. 3). The processor is controlled by a finite state machine (FSM) issuing commands to the lattice processor to perform encryption, decryption, key import or key generation. It is configured with three general purpose registers R4R6 in order to permanently store the public key \(p\), the global constant \(a\) and the private key \(r_2\). More registers for several keypairs are also supported but optional. The implementation supports preinitialization of registers so that all constant values and keys can be directly included in the (encrypted) bitstream. Note that, for encryption, the core is run similar to a stream cipher as \(c_{1}\) and \(c_2\) can be computed independently from the message which is then only added in the last step (e.g., comparable to the XOR operation used within stream ciphers).
4 Results and Performance
For performance analysis we primarily focus on Virtex6 platforms (speed grade 2) but would also like to emphasize that our solution can be efficiently implemented even on a small and lowcost Spartan6 FPGA. All results were obtained after postplace and route (PostPAR) with Xilinx ISE 14.2.
4.1 Gaussian Sampling
Performance, resource consumption, and quality of the core part (shaded grey in Fig. 2) of the Gaussian sampler on a Virtex6 LX75T (PostPAR). The entry rnd denotes the number of used random bits to sample one value.
Setup  s  Max s  rnd  Slices  LUT/FF  MHz  Stat. distance 

I  11.32  23  25  42  136/5  115  \({<}2^{22}\) 
II  12.18  25  25  46  149/5  118  \({<}2^{22}\) 
III  11.32  48  85  231  863/6  61  \({<}2^{80}\) 
IV  12.18  51  85  255  911/6  61  \({<}2^{80}\) 
V  11.32  53  105  314  1157/6  58  \({<} 2^{100}\) 
VI  12.18  57  105  342  1248/6  50  \({<}2^{100}\) 
4.2 Performance of RingLWEEncrypt
Resource consumption and performance of the combined key generation, encryption and decryption engine for the two different security levels on a Virtex6 LX75T (PostPAR). The public key requires \(n\log _2{q}\) bits (when stored in NTT representation), the private key \(n\log _2{q}\) bits and the ciphertext \(2n\log _2{q}\) bits.

Table 4 compares the results achieved in this work with the implementation by Göttert et al. [22] as well as other relevant asymmetric schemes and also adds performance figures for a Spartan6 instantiation. Note that a detailed comparison with [22] is unfair due to inaccuracies of synthesis results (the Virtex6 LX240T FPGA used in [22] was overmapped so that the subsequent placeandroute (PAR) step providing final results could not be performed). Figures for clock frequency, overall slice consumption, and cycles counts for individual operations or the whole encryption block are thus not given in [22]. We therefore can only refer to numbers providing the resource consumption of registers and LUT usage. For a rough comparison we apply the throughput to area (T/A) metric and define area equivalent to the usage of LUTs due to the restriction mentioned above. It turns out that our implementation for \(n=256\) is 32 times smaller regarding key generation, \(65\) times smaller for encryption and 27 times smaller for decryption, at a loss of a factor of about \(2\) and \(3.3\) in performance. When employing the \(\frac{\text {Bit/s}}{\text {LUT}}\) metric for medium security encryption we achieve \(\frac{9.77\cdot 10^6 \text {Bits}}{4549 \text {LUTs}}=2147\) while the work presented in [22] gives \(\frac{31.8\cdot 10^6 \text {Bits}}{298016 \text {LUTs}}=106\). This results in an improvement of a factor of roughly 20.^{7}
Performance comparison of our proposal with other public key encryption schemes (\({\approx }80..128\) bit) comparable to the medium security (\(n=256, q=7681, s=11.31\)) parameter set which is capable of transferring \(256\)bit messages. Our implementation is versatile enough to perform encryption, decryption and key generation in a single core. Figures denoted with an asterisk (*) are less accurate results obtained from synthesis due to extensive overmapping of resources.
Scheme  Device  Resources  Speed 

Our work [Gen/Enc/Dec]  S6LX16  4121 LUT/3513 FF/  45.22 µs 
(n=256)  @160 MHz  14 BRAM(8K)/1 DSP48  42.88 µs 
27.51 µs  
Our work [Gen/Enc/Dec]  V6LX75T  4549 LUT/3624 FF/  27.61 µs 
(n=256)  @262 MHz  12 BRAM(18K)/1 DSP48  26.19 µs 
16.80 µs  
RingLWEEncrypt  V6LX240T  146718 LUT/82463 FF   
[Gen/Enc/Dec] (n=256) [22]  V6LX240T  298016 LUT/143396 FF  8.05 µs* 
V6LX240T  124158 LUT/65174 FF  8.10 µs  
Niederreiter [Enc/Dec] [27]  V6LX240T  888 LUT/875 FF/17 BRAM  0.66 µs 
V6LX240T  9409 LUT/12861 FF/  57.78 µs  
12 BRAM  
NTRU [Enc/Dec] [30]  XCV1600E  27292 LUT/5160 FF  1.54 µs 
1.41 µs  
1024bit mod. Exp. [50]  XC4VFX12  3937 SLICE/17 DSP48  1.71 ms 
ECCP224 [24]  XC4VFX12  1825 LUT/1892 FF/  365.1 µs 
26 DSP48/ 11 BRAM  
ECCB233 [47]  XC5VLX85T  18097 LUT/5644 SLICE  12.3 µs 
4.3 Constant Time Operation
Sidechannel attacks are a problem for all physical implementations [39]. A simple target for a sidechannel attack is the use of timing information of the security algorithm by measuring execution time or cycles. Our implementation of RingLWEEncrypt is fully pipelined and has no datadependent operations. The processor core does not support any branches and Gaussian sampling based on the inverse transform operates in constant time. Summarizing, all cryptographic operations of our core are timinginvariant.
5 Conclusions and Future Work
In this work we presented a novel implementation of the ideal latticebased RingLWE encryption scheme that fits even on a lowcost Spartan6 FPGA. According to our findings, we improved the results obtained in the previous work of [22] by at least an order of magnitude using the same FPGA platform and much less resources.
Future work can combine our hardware engine with error correction facilities and CCA2 conversion. Additionally, countermeasures against further sidechannel and faultinjection attacks need to be considered. As we intend to make our implementation publicly available, our work also offers the chance for thirdparty sidechannel evaluation and cryptanalysis (e.g., exploiting the concrete implementation of the Gaussian sampler). Since our processor could also be utilized by other latticebased cryptosystems, the provably secure NTRU variant presented in [49] can be another target for implementation. Moreover, a recent proposal of a latticebased signature scheme by Ducas et al. [14] uses exactly the same parameters (\(n=512,q=12289\)) as RingLWEEncrypt and is thus a natural target for implementation based on our microcode engine.
Footnotes
 1.
The authors report that the utilization of LUTs required for LWE encryption exceeds the number of available LUTs on a Virtex6 LX240T by 197 % and 410 % for parameters \(n=256\) and \(n=512\), respectively. Note that the Virtex6 LX240T is a very expensive (above €1000 as of August 2013) and large FPGA.
 2.
For example, the parameters used for implementation in [22] result in a ciphertext expansion by a factor of 26.
 3.
See our web page at http://www.sha.rub.de/research/projects/lattice/
 4.
Note that this is the definition of RingLWE in Hermite normal form where the secret \(s\) is sampled from the noise distribution \(D_{\sigma }\) instead of uniformly random [37].
 5.
 6.
Generation of true random numbers is not in the scope of this work; we refer to the survey by Varchola [52] how to achieve this.
 7.
For this comparison we assumed that for each encryption 256 bits are transmitted.
References
 1.Agrawal, S., Boneh, D., Boyen, X.: Efficient lattice (H)IBE in the standard model. In: Gilbert [21], pp. 553–572Google Scholar
 2.Ajtai, M.: Generating hard instances of lattice problems. In: Proceedings of the TwentyEighth Annual ACM Symposium on Theory of Computing, pp. 99–108. ACM (1996)Google Scholar
 3.Albrecht, M., Cid, C., Faugère, J.C., Fitzpatrick, R., Perret, L.: On the complexity of BKW algorithm against LWE. In: SCC’12: Proceedings of the 3nd International Conference on Symbolic Computation and Cryptography, CastroUrdiales, July 2012, pp. 100–107 (2012)Google Scholar
 4.Albrecht, M., Cid, C., Faugère, J.C., Fitzpatrick, R., Perret, L.: On the complexity of the AroraGe algorithm against LWE. In: SCC’12: Proceedings of the 3nd International Conference on Symbolic Computation and Cryptography, CastroUrdiales, July 2012, pp. 93–99 (2012)Google Scholar
 5.Atici, A.C., Batina, L., Fan, J., Verbauwhede, I., Örs, S.B.: Lowcost implementations of NTRU for pervasive security. In: ASAP, pp. 79–84. IEEE Computer Society (2008)Google Scholar
 6.Aysu, A., Patterson, C., Schaumont, P.: Lowcost and areaefficient FPGA implementations of latticebased cryptography. In: IEEE International Symposium on HardwareOriented Security and Trust (HOST), 2013. IEEE (2013, to appear)Google Scholar
 7.Bailey, D.V., Coffin, D., Elbirt, A., Silverman, J.H., Woodbury, A.D.: NTRU in constrained devices. In: Koç, Ç.K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 262–272. Springer, Heidelberg (2001) Google Scholar
 8.Bernstein, D.J., Lange, T.: eBACS: ECRYPT benchmarking of cryptographic systems. http://bench.cr.yp.to. Accessed 10 May 2013
 9.Bos, J.W., Lauter, K., Loftus, J., Naehrig, M.: Improved security for a ringbased fully homomorphic encryption scheme. IACR Cryptol. ePrint Arch. 2013, 75 (2013)Google Scholar
 10.Brakerski, Z.: Fully homomorphic encryption without modulus switching from classical GapSVP. In: SafaviNaini, R., Canetti, R. (eds.) CRYPTO 2012. LNCS, vol. 7417, pp. 868–886. Springer, Heidelberg (2012) CrossRefGoogle Scholar
 11.Brakerski, Z., Langlois, A., Peikert, C., Regev, O., Stehlé, D.: Classical hardness of learning with errors. In: Boneh, D., Roughgarden, T., Feigenbaum, J. (eds.) STOC, pp. 575–584. ACM (2013)Google Scholar
 12.Buchmann, J., May, A., Vollmer, U.: Perspectives for cryptographic longterm security. Commun. ACM 49(9), 50–55 (2006)CrossRefGoogle Scholar
 13.Canetti, R., Garay, J.A. (eds.): CRYPTO 2013, Part I. LNCS, vol. 8042. Springer, Heidelberg (2013)Google Scholar
 14.Ducas, L., Durmus, A., Lepoint, T., Lyubashevsky, V.: Lattice signatures and bimodal Gaussians. In: Canetti and Garay [13], pp. 40–56. Proceedings version of [15]Google Scholar
 15.Ducas, L., Durmus, A., Lepoint, T., Lyubashevsky, V.: Lattice signatures and bimodal Gaussians. IACR Cryptol. ePrint Arch. 2013, 383 (2013). (Full version of [14])Google Scholar
 16.Ducas, L., Nguyen, P.Q.: Faster Gaussian lattice sampling using lazy floatingpoint arithmetic. In: Wang and Sako [53], pp. 415–432Google Scholar
 17.Ducas, L., Nguyen, P.Q.: Learning a zonotope and more: cryptanalysis of NTRUSign countermeasures. In: Wang and Sako [53], pp. 433–450Google Scholar
 18.Galbraith, S.D., Dwarakanath, N.C.: Efficient sampling from discrete gaussians for latticebased cryptography on a constrained deviceGoogle Scholar
 19.Gentry, C.: Fully homomorphic encryption using ideal lattices. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing, pp. 169–178. ACM (2009)Google Scholar
 20.Gentry, C., Peikert, C., Vaikuntanathan, V.: Trapdoors for hard lattices and new cryptographic constructions. In: Dwork, C. (ed.) STOC, pp. 197–206. ACM (2008)Google Scholar
 21.Gilbert, H. (ed.): EUROCRYPT 2010. LNCS, vol. 6110. Springer, Heidelberg (2010)MATHGoogle Scholar
 22.Göttert, N., Feller, T., Schneider, M., Buchmann, J., Huss, S.: On the design of hardware building blocks for modern latticebased encryption schemes. In: Prouff and Schaumont [46], pp. 512–529Google Scholar
 23.Güneysu, T., Lyubashevsky, V., Pöppelmann, T.: Practical latticebased cryptography: a signature scheme for embedded systems. In: Prouff and Schaumont [46], pp. 530–547Google Scholar
 24.Güneysu, T., Paar, C.: Ultra high performance ECC over NIST primes on commercial FPGAs. In: Oswald, E., Rohatgi, P. (eds.) CHES 2008. LNCS, vol. 5154, pp. 62–78. Springer, Heidelberg (2008) Google Scholar
 25.Gutierrez, R., Torres, V., Valls, J.: Hardware architecture of a Gaussian noise generator based on the inversion method. IEEE Trans. Circ. Syst. 59II(8), 501–505 (2012)Google Scholar
 26.Hermans, J., Vercauteren, F., Preneel, B.: Speed records for NTRU. In: Pieprzyk, J. (ed.) CTRSA 2010. LNCS, vol. 5985, pp. 73–88. Springer, Heidelberg (2010) Google Scholar
 27.Heyse, S., Güneysu, T.: Towards one cycle per bit asymmetric encryption: codebased cryptography on reconfigurable hardware. In: Prouff and Schaumont [46], pp. 340–355Google Scholar
 28.Hirschhorn, P.S., Hoffstein, J., HowgraveGraham, N., Whyte, W.: Choosing NTRUEncrypt parameters in light of combined lattice reduction and MITM approaches. In: Abdalla, M., Pointcheval, D., Fouque, P.A., Vergnaud, D. (eds.) ACNS 2009. LNCS, vol. 5536, pp. 437–455. Springer, Heidelberg (2009) Google Scholar
 29.Hoffstein, J., Pipher, J., Silverman, J.H.: NTRU: a ringbased public key cryptosystem. In: Buhler, J.P. (ed.) ANTS 1998. LNCS, vol. 1423, pp. 267–288. Springer, Heidelberg (1998)Google Scholar
 30.Kamal, A.A., Youssef, A.M.: An FPGA implementation of the NTRUEncrypt cryptosystem. In: 2009 International Conference on Microelectronics (ICM), pp. 209–212. IEEE (2009)Google Scholar
 31.Lee, D.U., Luk, W., Villasenor, J.D., Zhang, G., Leong, P.H.W.: A hardware Gaussian noise generator using the Wallace method. IEEE Trans. Very Large Scale Integr. VLSI Syst. 13(8), 911–920 (2005)CrossRefGoogle Scholar
 32.Lindner, R., Peikert, C.: Better key sizes (and Attacks) for LWEbased encryption. In: Kiayias, A. (ed.) CTRSA 2011. LNCS, vol. 6558, pp. 319–339. Springer, Heidelberg (2011) Google Scholar
 33.Lyubashevsky, V., Micciancio, D.: Generalized compact knapsacks are collision resistant. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 144–155. Springer, Heidelberg (2006)Google Scholar
 34.Lyubashevsky, V.: Lattice signatures without trapdoors. In: Pointcheval, D., Johansson, T. (eds.) EUROCRYPT 2012. LNCS, vol. 7237, pp. 738–755. Springer, Heidelberg (2012) CrossRefGoogle Scholar
 35.Lyubashevsky, V., Micciancio, D., Peikert, C., Rosen, A.: SWIFFT: a modest proposal for FFT hashing. In: Nyberg, K. (ed.) FSE 2008. LNCS, vol. 5086, pp. 54–72. Springer, Heidelberg (2008) Google Scholar
 36.Lyubashevsky, V., Peikert, C., Regev, O.: On ideal lattices and learning with errors over rings. In: Gilbert [21], pp. 1–23. Proceedings version of [37]Google Scholar
 37.Lyubashevsky, V., Peikert, C., Regev, O.: On ideal lattices and learning with errors over rings. IACR Cryptol. ePrint Arch. 2012, 230 (2012). (Full version of [36])Google Scholar
 38.MacWilliams, F.J., Sloane, N.J.A.: The Theory of ErrorCorrecting Codes. vol. 16, 762 pp, Elsevier Science Publishers B. V., NorthHolland (2006). ISBN: 0444851933Google Scholar
 39.Mangard, S., Oswald, E., Popp, T.: Power Analysis Attacks: Revealing the Secrets of Smart Cards (Advances in Information Security), 3rd edn. Springer, New York (2007)Google Scholar
 40.Micciancio, D.: Generalized compact knapsacks, cyclic lattices, and efficient oneway functions. Comput. Complex. 16(4), 365–411 (2007)CrossRefMATHMathSciNetGoogle Scholar
 41.Micciancio, D., Peikert, C.: Hardness of SIS and LWE with small parameters. In: Canetti and Garay [13], pp. 21–39Google Scholar
 42.Naehrig, M., Lauter, K., Vaikuntanathan, V.: Can homomorphic encryption be practical? In: Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop, CCSW ’11, pp. 113–124. ACM, New York (2011)Google Scholar
 43.Nguyên, P.Q., Regev, O.: Learning a parallelepiped: cryptanalysis of GGH and NTRU signatures. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 271–288. Springer, Heidelberg (2006) CrossRefGoogle Scholar
 44.Peikert, C.: An efficient and parallel Gaussian sampler for lattices. In: Rabin, T. (ed.) CRYPTO 2010. LNCS, vol. 6223, pp. 80–97. Springer, Heidelberg (2010) CrossRefGoogle Scholar
 45.Pöppelmann, T., Güneysu, T.: Towards efficient arithmetic for latticebased cryptography on reconfigurable hardware. In: Hevia, A., Neven, G. (eds.) LatinCrypt 2012. LNCS, vol. 7533, pp. 139–158. Springer, Heidelberg (2012) CrossRefGoogle Scholar
 46.Prouff, E., Schaumont, P. (eds.): CHES 2012. LNCS, vol. 7428. Springer, Heidelberg (2012)MATHGoogle Scholar
 47.Rebeiro, C., Roy, S.S., Mukhopadhyay, D.: Pushing the limits of highspeed GF(\(2^m\)) elliptic curve scalar multiplication on FPGAs. In: Prouff and Schaumont [46], pp. 494–511Google Scholar
 48.Regev, O.: On lattices, learning with errors, random linear codes, and cryptography. In: Gabow, H.N., Fagin, R. (eds.) STOC, pp. 84–93. ACM (2005)Google Scholar
 49.Stehlé, D., Steinfeld, R.: Making NTRU as secure as worstcase problems over ideal lattices. In: Paterson, K.G. (ed.) EUROCRYPT 2011. LNCS, vol. 6632, pp. 27–47. Springer, Heidelberg (2011)CrossRefGoogle Scholar
 50.Suzuki, D.: How to maximize the potential of FPGA resources for modular exponentiation. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727, pp. 272–288. Springer, Heidelberg (2007) Google Scholar
 51.Thomas, D.B., Luk, W., Leong, P.H.W., Villasenor, J.D.: Gaussian random number generators. ACM Comput. Surv. 39(4), 11:1–11:38 (2007)CrossRefGoogle Scholar
 52.Varchola, M.: FPGA based true random number generators for embedded cryptographic applications. Ph.D. thesis, Technical University of Kosice (2008)Google Scholar
 53.Wang, X., Sako, K. (eds.): ASIACRYPT 2012. LNCS, vol. 7658. Springer, Heidelberg (2012)MATHGoogle Scholar
 54.Zhang, G., Leong, P.H.W., Lee, D.U., Villasenor, J.D., Cheung, R.C.C., Luk, W.: Zigguratbased hardware Gaussian random number generator. In: International Conference on Field Programmable Logic and Applications, 2005, pp. 275–280 (2005)Google Scholar