Modeling and simulating the sample complexity of solving LWE using BKW-style algorithms

The Learning with Errors (LWE) problem receives much attention in cryptography, mainly due to its fundamental significance in post-quantum cryptography. Among its solving algorithms, the Blum-Kalai-Wasserman (BKW) algorithm, originally proposed for solving the Learning Parity with Noise (LPN) problem, performs well, especially for certain parameter settings with cryptographic importance. The BKW algorithm consists of two phases, the reduction phase and the solving phase. In this work, we study the performance of distinguishers used in the solving phase. We show that the Fast Fourier Transform (FFT) distinguisher from Eurocrypt’15 has the same sample complexity as the optimal distinguisher, when making the same number of hypotheses. We also show via simulation that it performs much better than previous theory predicts and develop a sample complexity model that matches the simulations better. We also introduce an improved, pruned version of the FFT distinguisher. Finally, we indicate, via extensive experiments, that the sample dependency due to both LF2 and sample amplification is limited.


Introduction
Post-quantum cryptography studies replacements of cryptographic primitives based on the factoring or discrete-log problem, since both can be efficiently solved by a quantum computer [2]. Lattice-based cryptography is its main area. In the NIST Post-Quantum Cryptography Standardization [3], 5 out of 7 finalists and 2 out of 8 alternates are lattice-based.
The Learning with Errors (LWE) problem, introduced by Regev [4], is the major problem in lattice-based cryptography. Its average-case hardness can be based on the worst-case hardness of some standard lattice problems, which is extremely interesting in theoretical crypto. The most famous, of its many cryptographic applications, is the design of Fully Homomorphic Encryption (FHE) schemes. Its binary counterpart, the Learning Parity with Noise problem (LPN), also plays an significant role in cryptography (see [5]), especially in light-weight cryptography for very constrained environments such as RFID tags and low-power devices.
The algorithms for solving LWE can be divided into lattice-based, algebraic, and combinatorial methods. The last class of algorithms all inherit from the famous Blum-Kalai-Wasserman (BKW) algorithm [6,7], and are the most relevant to our study. We refer interested readers to [8] for concrete complexity estimation for solving LWE instances, and to [9,10] for asymptotic complexity estimations.
The BKW-type algorithms include two phases, the reduction phase and the solving phase. The prior consists of a series of operations, called BKW steps, iteratively reducing the dimension of the problem at the cost of increasing its noise level. At the end of the reduction phase, the original LWE problem is transformed to a new problem with a much smaller dimension. The new problem can be solved efficiently by a procedure called distinguishing in the solving phase.
One of the main challenges in understanding the precise performance of BKW variants on solving the LWE problem comes from the lack of extensive experimental studies, especially on the various distinguishers proposed for the solving phase. Firstly, we have borrowed many heuristics from BKW variants on the LPN problem, but only very roughly or not at all verified them for the LWE problem. Secondly, the tightness of the nice theoretical bound in [11] on the sample complexity of the FFT distinguisher also needs to be experimentally checked. Lastly, a performance comparison of the different known distinguishers is still lacking.

Related work
The BKW algorithm proposed by Blum et al. [6,7] is the first sub-exponential algorithm for solving the LPN problem. Its initial distinguisher, an exhaustive search method in the binary field, recovers one bit of the secret by employing majority voting. Later, Levieil and Fouque [12] applied the fast Walsh-Hadamard transform (FWHT) technique to accelerate the distinguishing process and recovered a number of secret bits in one pass. They also proposed some heuristic versions and tested these assumptions by experiments. In [13] Kirchner proposed a secret-noise transform technique to change the secret distribution to be sparse. This technique is an application of the transform technique proposed in [14] for solving LWE. Bernstein and Lange [15] further instantiated an attack on the Ring-LPN problem, a variant of LPN with algebraic ring structures. In [16,17], Guo, Johansson, and Löndahl proposed a new distinguishing method called subspace hypothesis testing. Though this distinguisher can handle an instance with larger dimension by using covering codes, its inherent nature is still an FWHT distinguisher. Improvements of the BKW algorithm were further studied by Zhang et al. [18] and Bogos-Vaudenay [19]. An elaborate survey with experimental results on the BKW algorithm for solving LPN can be found in [20].
BKW for solving LWE follows a similar research line. Albrecht et al. initiated the study in [21]. In PKC 2014 [22], a new reduction technique called lazy modulus switching was proposed. In both works, the solving phase uses an exhaustive search approach. In [11] Duc et al. introduced the Fast Fourier Transform (FFT) technique in the distinguishing process and bounded the sample complexity theoretically from the Hoeffding inequality. Note that the actual performance regarding the bound is not experimentally verified and the information loss in the FFT distinguisher is unclear. There are new reduction methods in [23][24][25], and in [23], the authors also proposed a new method with polynomial reconstruction in the solving phase. This method has the same sample complexity as that of the exhaustive search approach but requires (q + 1) FFT operations rather than only one FFT in [11]. The BKW variants with memory constraints were recently studied in [26][27][28].

Contributions
In the paper, we compare the performances of the known distinguishers empirically. We investigate the performance of the optimal distinguisher and the FFT distinguisher. We also test the sample dependency when using LF2 or sample amplification. We have the following contributions.

Organization
The rest of the paper is organized as follows. Section 2 introduces some necessary background. In Section 3 we cover the basic BKW algorithm. Section 4 goes over distinguishers used for hypothesis testing when solving LWE using BKW and introduces the pruned FFT method. Next, in Section 5 we show why the FFT distinguisher and the optimal distinguisher perform identically for our setting, followed by simulation results in Section 6. In Section 7 we develop a new model for the sample complexity of the FFT distinguisher. Finally, Section 8 concludes the paper.

Background
Let us introduce some notation. Bold small letters denote vectors. Let 〈⋅,⋅〉 denote the scalar products of two vectors with the same dimension. By |x| we denote the absolute value of x for a real number x ∈ ℝ. We also denote by R(y) the real part and ‖y‖ the absolute value of a complex number y ∈ ℂ.

LWE
Let us define the LWE problem.

Rounded Gaussian distribution
For the error we use the rounded Gaussian distribution. 1 Let f(x|0,σ 2 ) denote the PDF of the normal ditribution with mean 0 and standard deviation σ, this distribution in turn being denoted as N(0, 2 ) . The rounded Gaussian distribution samples from N(0, 2 ) , rounds to the nearest integer and wraps to the interval [−(q − 1)/2,(q − 1)/2]. In other words, the probability of choosing a certain error e is equal to for e ∈ [−(q − 1)/2,(q − 1)/2]. We denote this distribution by Ψ ,q . We use the well-known heuristic approximation that the sum of two independent distributions X 1 and X 2 , drawn from Ψ 1 ,q and Ψ 2 ,q , is drawn from Ψ√ 2 1 + 2 2 ,q . We also use the notation α = σ/q. Finally, we let U(a, b) denote the discrete uniform distribution taking values from a up to b.

BKW
The BKW algorithm was originally invented to solve LPN. It was first used for LWE in [21]. The BKW algorithm consists of two parts, reduction and hypothesis testing.

Reduction
We  1 Also common is to use the Discrete Gaussian distribution, which is similar.
The corresponding b value is b 1,2 = b 1 ± b 2 . Now we have a new sample (a 1,2 ,b 1,2 ). The corresponding noise variable is e 1,2 = e 1 ± e 2 , with variance 2σ 2 , where σ 2 is the variance of the original noise. By calculating a suitable number of new samples for each category we have reduced the dimensionality of the problem by b, but increased the noise variance to 2σ 2 . If we repeat the reduction process t times we end up with a dimensionality of n − tb, and a noise variance of 2 t ⋅ σ 2 .

LF1 and LF2
LF1 and LF2 are two implementation tricks originally proposed for solving LPN in [12]. Both can naturally be generalized for solving LWE.
In LF1 we choose one representative per category. We form new samples by combining the other samples with the representative. This way all samples at the hypothesis testing stage are independent of each other. However, the sample size shrinks by (q b − 1)/2 samples per generation, requiring a large initial sample size.
In LF2 we allow combining any pair of samples within a category, creating much more samples. If we form every possible sample, a sample size of 3(q b − 1)/2 is enough to keep the sample size constant between steps. The disadvantage of this approach is that the samples are no longer independent, leading to higher noise levels in the hypothesis stage of BKW. It is generally assumed that this effect is quite small. This assumption is well tested for solving the LPN problem [12].

Sample amplification
Some versions of LWE limit the number of samples. We can get more samples using sample amplification. For example, by adding/subtracting triples of samples we can increase the initial sample size m up to a maximum of 4 ⋅ m 3 . This does increase the noise by a factor of √ 3 . It also leads to an increased dependency between samples in the hypothesis testing phase, similar in principle to LF2.

Secret-noise transformation
There is a transformation of the LWE problem that makes the distribution of the secret vector follow the distribution of the noise [13,14].

Improved reduction steps
There are many improvements of the plain BKW steps. Lazy modulus switching (LMS) was introduced in [22] and further developed in [24]. In [23] coded-BKW was introduced. Coded-BKW with sieving was introduced in [25] and improved in [10,29].
Since only the final noise level, not the type of steps, matters for the distinguishers, we only use plain steps in this paper.

Hypothesis testing
Assume that we have reduced all but k positions to 0, leaving k positions for the hypothesis testing phase. After the reduction phase we have samples on the form where e is (approximately) rounded Gaussian distributed with a standard deviation of σ f = 2 t/2 ⋅ σ and mean 0. Now the problem is to distinguish the correct guess s = (s 1 ,s 2 ,…,s k ) from all the incorrect ones, among all q k guesses. 2 For each guess ̂ we calculate the corresponding error terms in (3). For the correct guess the observed values of e are rounded Gaussian distributed, while for the wrong guess they are uniformly random. How to distinguish the right guess from all the wrong ones is explained in Section 4.

Distinguishers
For the hypothesis testing we study the optimal distinguisher, which is an exhaustive search method; and a faster method based on the Fast Fourier Transform.

Optimal distinguisher
Let D̂ denote the distribution of the e values for a given guess of the secret vector ̂ . As is shown in [30, Prop. 1] to optimally distinguish the hypothesis D̂ = U(0, q − 1) against D̂ =Ψ f ,q we calculate the log-likelihood ratio where N(e) denotes the number of times e occurs for the guess ̂ , σ f denotes the standard deviation of the samples after the reduction phase and Pr D (e) denotes the probability of drawing e from the distribution D. We choose the value ̂ that maximizes (4). The time complexity of this distinguisher is if we try all possible hypotheses. After performing the secret-noise transformation of Section 3.1.3 we can limit ourselves to assuming that the k values in s have an absolute value of at most d, reducing the complexity to By only testing the likely hypotheses we have a lower risk of choosing an incorrect one. 3 This trick of limiting the number of hypotheses can of course also be applied to the FFT method of Section 4.2, which we do in Section 4.4.

Fast Fourier Transform method
For LWE, the idea of using a transform to speed up the distinguishing was introduced in [11]. Consider the function where ∈ ℤ k q , is equal to 1 if and only if x = a j and 0 otherwise, and q denotes the q-th root of unity. The idea of the FFT distinguisher is to calculate the FFT of f, that is Given enough samples compared to the noise level, the correct guess α = s maximizes ℜ(f ( )) in (8).
The time complexity of the FFT distinguisher is In general this complexity is much lower than the one in (5). However, it does depend on the sparsity of the secret s. For a binary s, the exhaustive methods are better.
From [11,Thm. 16] we have the following (upper limit) formula for the sample complexity of the FFT distinguisher where is the probability of guessing s incorrectly. Notice that the expression is slightly modified to fit our notation and that a minor error in the formula is corrected. 4

Polynomial reconstruction method
In [23], a method combining exhaustive search and the FFT was introduced. It achieves optimal distinguishing information theoretically, while being more efficient than the optimal distinguisher. However, its complexity is roughly a factor q higher than the complexity of the FFT distinguisher.
, 3 As long as the correct one is among our hypotheses. 4 Using our notation k should be within the logarithm and not as a factor in front of it like in [11].

Pruned FFT distinguisher
Also when using an FFT distinguisher we can limit the number of hypotheses. We only need a small subset of the output values of the FFT distinguisher in (8), so we can speedup the calculations using a pruned FFT. In general, if we only need K out of all N output values, the time complexity for calculating the FFT improves from O(N log(N)) to O(N log(K)) [31]. Limiting the magnitude when guessing the last k positions of s to d, this changes the time complexity from (9) to More importantly this method reduces the sample complexity. In the formula for sample complexity (10), the numerator q k corresponds to the number of values that s can take on the last k positions. Re-doing the proofs of [11,Thm. 16], limiting the magnitude of the guess in each position to d, we get This reduced sample complexity comes at no extra cost.

Equal performance of optimal and FFT distinguishers
When starting to run simulations, we noticed that the FFT distinguisher and the optimal distinguisher performed identically, in terms of number of samples to correctly guess the secret. We give a brief explanation of this phenomenon. 5 Consider a sample on the form (3). By making a guess ̂ we calculate the corresponding error term ê . The Fourier transform of the FFT distinguisher in (8) can now be written as

The real part (13) is equal to
The FFT distinguisher picks the guess that maximizes (14). Now, let us rewrite (4) for the optimal distinguisher as It turns out that with increasing noise level, the terms in (15) can be approximated as cosine functions with a period of q, as illustrated in Fig. 1. The terms correspond to q = 1601, starting with rounded Gaussian noise with α = 0.005, σ = α ⋅ q = 8.005 and taking (11) O(m + k ⋅ q k ⋅ log(2d + 1)).
12 or 13 steps of plain BKW respectively. Notice that the approximation gets drastically better with increasing noise level. 6 The 13 step picture corresponds to the setting used in most of the experiments in Section 6. For a large-scale problem, the noise level would of course be much larger, resulting in an even better cosine approximation. Since both distinguishers pick the ̂ that minimizes a sum of cosine functions with the same period, they will pick the same ̂ , hence they will perform identically.
There are two immediate effects of this finding.
-The polynomial reconstruction method is obsolete.
-Unless the secret is very sparse, the FFT distinguisher is strictly better than the optimal distinguisher, since it is computationally cheaper.
Hence we limit our investigation to the FFT distinguisher from Section 6. We do not make any claims about the equivalance between the sample complexity of the two distinguishers outside of our context of solving LWE using BKW, when having large rounded (or Discrete) Gaussian noise. 7

Simulations and results
This section covers the simulations we ran, using the FBBL library [32] from [33], and the results they yielded. For all figures, each point corresponds to running plain BKW plus distinguishing at least 30 times. For most points we ran slightly more iterations. See Appendix for details on the number of iterations for all the points. We chose our parameters inspired by the Darmstadt LWE Challenge [34].
The challenges are a set of (search) LWE instances used to compare LWE solving methods. Each instance consists of the dimension n, the modulus q ≈ n 2 , the relative error size  6 Also notice that the approximation is not necessarily the best cosine approximation. It is simple the approximation that matches the largest and the smallest value of the curve. 7 Although it could be interesting to investigate. α and m ≈ n 2 equations of the form (1). Our simulations mostly use parameters inspired by the LWE challenges. We mostly let q = 1601 (corresponding to n = 40) and vary α to get problem instances that require a suitable number of samples for simulating hypothesis testing. The records for the LWE challenges are set using lattice sieving [35].

Varying noise level
In the upper part of Fig. 2 we compare the theoretical sample complexity from (10) with simulation results from an implementation of the FFT distinguisher of [11] and our pruned FFT distinguisher. The latter distinguisher guesses values of absolute value up to 3σ, rounded upwards. The simulated points are the median values of our simulations and the theoretical values correspond to setting = 0.5 in (10). We use q = 1601, n = 28, we take t = 13 steps of plain BKW, reducing 2 positions per step. Finally we guess the last 2 positions and measure the minimum number of samples to correctly guess the secret. We vary α between 0.005 and 0.006. We use LF1 to guarantee that the samples are independent.
We notice that there is a gap of roughly a factor 10 between theory and simulation. More exactly, the gap is a factor [10.8277, 8.6816, 10.1037, 8.6776, 10.5218, 10.1564] for the six points, counting in increasing order of noise level.
We also see a gap between the FFT distinguisher and pruned FFT distinguisher. We can estimate the gap by comparing (12) and (10). Counting in increasing level of noise by theory we expect the gap to be [

Varying q
In the lower part of Fig. 2 [5,7,9,11,13,15]. Thereby the final noise level and the original s vectors have almost the same distribution, making the q values the only varying factor. We use LF1 to guarantee that the samples are independent. Notice that the number of samples needed to guess the secret is roughly an order of magnitude lower than theory predicts, counting in increasing order of q, the gain is a factor [11.4537, 10.6112, 9.2315, 10.4473, 9.5561, 9.7822] for the six points.
Also notice that the pruned version is an improvement, that increases with q. This is because the total number of hypotheses divided by the number of hypotheses we make increases with q. By comparing (12) and (10)

LF1 vs LF2
We investigate the increased number of samples needed due to dependencies, when using LF2. For LF2, depending on the number of samples needed for guessing, we used either the minimum number of samples to produce a new generation of the same size or a sample size roughly equal to the size needed for guessing at the end. To test the limit of LF2 we made sure to produce every possible sample from each category. See the upper part of Fig. 3 for details. The setting is the same as in Section 6.1. We only use the pruned FFT distinguisher. Notice that the performance is almost exactly the same in both the LF1 and the LF2 cases, as is generally assumed [12].

Sample amplification
The lower part of Fig. 3 shows the increased number of samples needed, due to sample amplification. We use q = 1601 and 1600 initial samples. We form new samples by combining triples of samples to get a large enough sample size. We vary the noise level between = 0.005∕ √ 3 and = 0.006∕ √ 3 . We take 13 steps of plain BKW, reducing 2 positions per step. Finally we guess the last 2 positions and measure the minimum number of samples needed to guess correctly. We use LF1 and we compare the results against starting with as many samples as we want and noise levels between α = 0.005 and α = 0.006, both tricks to isolate the dependency due to sample amplification. We only use the pruned FFT distinguisher. The difference between the points is small, implying that the dependency due to sample amplification is limited. Consider the real part of (13), for an incorrect guess. The sum is sampled from where U j ∼ U(0, q − 1) . The expected value of (16) is 0. Let us denote the variance of each term of (16) by 2 U . For the correct guess the real part of (13) is equal to where E j is the sum of 2 t independent terms e j ∼Ψ ,q . Numerically we can calculate the mean and variance of each term of (17) with arbitrary precision. Denote these by μ E and 2 E .

Comparing the new model to previous theory
Here we discuss what predictions the new model makes, compared to the old model, for parameter settings beyond what we can simulate. First we use the same setting as in Fig. 2, but let α increase a bit further. See Fig. 5. We see that the predictions differ by a factor roughly 10 when testing all hypotheses and roughly 11 when limiting the number of hypotheses, almost independently of α. This is the behavior we observed in simulation too. Next, let us look at what happens when we vary . Fix q = 1601, α = 0.01 and vary between 0.05 and 0.95. See Fig. 6. Notice for both settings that the gap between the old and the new theory increases with . The "constant" gap of a factor around 10 that we observed in simulations is what happens to be the gap for = 0.5.
Next we look at what happens when we vary both q and . We vary between 0.05 and 0.95 and vary q between 401 and 6401.
We plot the estimated sample complexity of the old model divided by the estimated sample complexity of the new model in Figs. 7, 8, 9, and 10. In Figs. 7 and 8 we  estimate the sample complexity when testing all hypotheses, while in Figs. 9 and 10 we limit the number of hypotheses. In Figs. 7 and 9 we start at a high noise level using α = 0.01, while in Figs. 8 and 10 we start at a low noise level using α = 0.005. In all settings we see that the gap increases with . We also see for large values of that the gap is larger for the smaller values of q, especially when using a limited number of samples.

Probability of the correct guess being top r
We get a slightly different model if we only require that the correct hypothesis is any of the r top candidates. Sort and re-label the incorrect guesses in increasing order; With this notation the k th smallest value has the CDF see [36] for details. The r th largest values has the CDF When calculating this sum, it is useful to know that it is the complement probability of the CDF of a Bin(h,F X (x)) distribution, evaluated at h − r.
Next, the probability of Y being larger than the r th largest value, in other words P(Y > X (r) ), is equal to In Fig. 11 we estimate the sample complexity in the same setting as Fig. 5, for various values of r, testing all and limited number of hypotheses in the upper and the lower half of the figure respectively.  In both settings we can clearly reduce the sample complexity. It comes at a cost though. For each of the r hypotheses we need to backtrack one step and check whether the hypothesis is correct or not. When doing the backtracking for each of the r hypotheses, since the noise level after the previous reduction step is so low, the cost of testing the hypothesis gets reduced. We leave studying the details of this approach to future research.

Conclusions
We have shown that the FFT distinguisher and the optimal distinguisher have the same sample complexity for solving LWE using BKW. We have also showed that it performs roughly an order of magnitude better than the upper limit formula from [11,Thm. 16]. We have developed a new sample complexity model, which matches our simulatated complexities well. It also helps explain the gap between our simulated sample complexity originally reported in [1] and previous theory from [11]. Our pruned version of the FFT method improves the sample complexity of the FFT solver, at no cost. Finally, we have indicated that the sample dependency due to both LF2 and sample amplification is limited.   Unlimited Samples  33  41  56  35  30  49   Sample Amplification  37  59  38  45  47  40