Keywords

1 Introduction

Decades of development in the area of lattice-based cryptography have identified two important primitive hard problems, namely, the Shortest Integer Solution (SIS) problem [1] and the Learning With Errors (LWE) problem [24], to be standard building blocks of modern lattice-based cryptosystems.

In this work, we focus on the LWE problem proposed by Regev [24]. LWE has attracted more and more attention since its proposal. Initially LWE problem was reduced to the GAPSVP (the decision version of the shortest vector problem) or SIVP (Shortest Independent Vector Problem) under the quantum setting. This means that LWE is considered hard if there are no algorithms to efficiently solve the GAPSVP or SIVP using a quantum computer. Subsequently, the hardness reduction as been sharpened to accept a classic reduction to these standard lattice problems [8]. As such, LWE-based schemes are widely studied as potential primitives in the post-quantum era.

LWE. Let n be a positive integer, denoting the dimension of the lattice related with the LWE problem, q an odd prime, and let \(\mathcal {D}\) be an error distribution over the integer ring modulo q, \(\mathbb {Z}_q\). Denote by \(\mathbf { s}\) a fixed secret vector in \(\mathbb {Z}_q^n\) (in this manuscript we adopt the row vector convention to be consistent with software implementation) selected according to the uniform distribution on its support. Let \(\mathcal {L}_{n, q,\mathcal {D}}\) be the probability distribution on \(\mathbb {Z}_q^n \times \mathbb {Z}_q\) generated by choosing \(\mathbf {a} \in \mathbb {Z}_q^n\) uniformly at random, choosing an error e according to \(\mathcal {D}\) and returning

$$\begin{aligned} (\mathbf {a}, c) = (\mathbf {a}, \langle \mathbf {a}, \mathbf {s}\rangle + e) \end{aligned}$$

in \(\mathbb {Z}_q^n \times \mathbb {Z}_q\), where \(\langle \cdot , \cdot \rangle \) denotes the inner product of two vectors in \(\mathbb {Z}_q^n\). The search LWE problem is to find the secret vector \(\mathbf {s}\) given a fixed number of m samples from \(\mathcal {L}_{n, q, \mathcal {D}}\).

Although the intractability of LWE is well established by the reduction proofs, its concrete hardness is far from clear. In this work we follow the approach of Liu and Nguyen [21] to evaluate the performance of the BDD approach for solving LWE problem.

1.1 Our Contribution

In this manuscript, our main contributions include:

  • We implement a parallel version of the BDD approach for solving the LWE problem. The implementation features an embarrassingly parallel design where each thread takes a randomized basis and performs an independent enumeration. The advantage of this design is that the power of multi-cores can be fully utilized.

  • We give heuristic analysis on how to choose the optimal sub-dimension of the LWE instance. We use the Gaussian heuristic to estimate the cost of a pruned enumeration tree to find better sub-dimension which can reduce the time to solve an LWE instance.

  • We compare our implementation with that of Kirshanova et al. [18] and show the advantages of our implementation. Specifically we show that the performance of our parallelization strategy is not limited by Amdahl’s Law and the extreme pruning in our implementation brings huge speedup compared with the linear pruning used in the implementation of [18].

  • We demonstrate that our implementation solves a couple of instances from the TU Darmstadt LWE challenge.

2 Preliminaries

2.1 Discrete Gaussian Distribution

We first describe the error distribution \(\mathcal {D}\) in the LWE problem. In the general situation, any error distribution with small variance is fine for the LWE problem to be hard. However, in this work, similar to many other previous works regarding LWE, we focus on the discrete Gaussian distribution over the ring \(\mathbb {Z}_q\) as the error distribution. Let \(x \in \mathbb {Z}\). The discrete Gaussian distribution over \(\mathbb {Z}\) with mean 0 and width parameter \(\sigma \), denoted by \(D_{\mathbb {Z},\sigma }\) assigns to each \(x \in \mathbb {Z}\) the probability proportional to exp\((-x^2/2\sigma ^2)\). The error distribution we consider for the LWE problem is the discrete Gaussian distribution over \(\mathbb {Z}_q\), denoted by \(D_{\mathbb {Z}_q,\sigma }\), by accumulating the values of the probability mass function over all integers in each residue class mod q. In the original proposal of Regev, the width parameter associated with the moduli q is \(\sigma = \frac{\alpha q}{\sqrt{2 \pi }} \), where \(\alpha \) is the relative error rate. With a slight abuse of notation, we also denote the discrete Gaussian distribution as \(\mathcal {D}_{\mathbb {Z}_q, \alpha q}\). When the error distribution of an LWE instance is \(\mathcal {D}_{\mathbb {Z}_q, \alpha q}\), we express the LWE instance as \(\mathcal {L}_{n, q,\alpha }\).

2.2 Lattice

A lattice in \(\mathbb {R}^m\) is a discrete additive subgroup generated by a (non-unique) basis \(\mathbf {B} = (\mathbf {b}_1, \ldots , \mathbf {b}_m)^T\). Equivalently, the lattice \(\varLambda (\mathbf {B})\) generated by \(\mathbf {B}\) is given by \(\varLambda (\mathbf {B}) = \{ x | x = \sum _{i=1}^{m}z_i\mathbf {b}_i\}\), where \(z_i\)’s are integers. Note that by our convention, the vector \(\mathbf {b}_i\) in the basis matrix \(\mathbf {B}\) is its row vector. The rank of the lattice \(\varLambda (\mathbf {B})\) is defined as the rank of the basis matrix \(\mathbf {B}\). If the rank of \(\varLambda (\mathbf {B})\) equals m, we say that the lattice is full rank. A fundamental notion that lies in various lattice problems is the successive minimal \(\lambda _k(\varLambda )\) which is defined to be the smallest real number r such that the lattice contains k linearly independent nonzero vectors of Euclidean length at most r. Specifically, \(\lambda _1(\varLambda )\) is the length of the shortest nonzero vector of the lattice \(\varLambda \).

The lattices we are interested in are a special type of lattices called q-ary lattices which are lattices satisfying \(q\mathbb {Z}^m \subset \varLambda \subset \mathbb {Z}^m\). Fix positive integers \(n \le m \le q\), where n serves as the main security parameter, and q is an odd prime. For any matrix \(\mathbf {A}\in \mathbb {Z}^{m\times n}\), define the following two lattices.

$$ \varLambda ^{\bot }_q (\mathbf {A}) = \{\mathbf { x} \in \mathbb {Z}^m : \mathbf { x}\mathbf { A}=0\, \texttt { mod }\, q\}, $$
$$ \varLambda _q(\mathbf {A}) = \{ \mathbf { x} \in \mathbb {Z}^m : \mathbf { x} = \mathbf { A}\mathbf {s}\; \texttt { mod }\, q\, \texttt { for }{} \texttt {some }\, \mathbf {s} \in \mathbb {Z}_q^n\}. $$

It is easy to check that both \(\varLambda ^{\bot }_q(\mathbf {A})\) and \(\varLambda _q(\mathbf {A})\) are q-ary lattices [23].

2.3 Lattice Reduction

As we have noticed in Sect. 2.2 that a lattice can be generated from different bases, the property of the basis plays a central role in the difficulty of various hard lattice problems. Informally, the more orthogonal the basis is, the easier the corresponding lattice problems are. As such, many attempts to solve hard lattice problems try to alter (often called reduce in the literature) the given basis in order to get basis which generates the same lattice while at the same time achieves the highest orthogonality possible. We adopt the convention that the first vector \(\mathbf {b}_1\) in a reduced basis has the smallest length among the (reduced) basis vectors. After the lattice reduction algorithm, we can use the vector \(\mathbf {b}_1\) as an approximation of the shortest vector. Since the determinate of a lattice is invariant under lattice reduction, when the basis is reduced, the length of each basis vector decreases. The common measurement of the quality of a lattice basis is called Hermite factor \(\delta ^m\) defined as: \(||\mathbf {b}_1|| = \delta ^m \texttt {vol}(\varLambda )^{1/m}\). We also refer to \(\delta \) as the root-Hermite factor. A smaller root-Hermite factor typically implies a reduced basis with higher quality.

Lattice reduction algorithms can be viewed as a hierarchy of BKZ [26] based on the parameter blocksize \(\beta \). The case when \(\beta =2\) is called LLL reduction, which was invented by Lenstra et al. [20]. LLL reduction is proven to run in polynomial time in the lattice dimension and outputs a short vector which is within an exponential factor of the minimal length of a lattice \(\varLambda \), i.e., \(\lambda _1(\varLambda )\). When \(\beta =m\), i.e., the full size of the basis, the output basis is HKZ reduced [17] which implies solving the SVP. The situation when k lies in between 2 and m is known as the BKZ-\(\beta \) reduction which is the most referenced reduction algorithm in practice. Chen and Nguyen observed that the running time of BKZ reduction is mainly dominated by the root-Hermite factor \(\delta \) and is less affected by the dimension m. See Chen and Nguyen [11] for a detailed analysis and their improvements over the standard BKZ as a collection of optimization known as BKZ 2.0. See also Albrecht et al. [2] for a thorough comparison of different estimations of the complexity of BKZ.

2.4 Pruned Enumeration

Gram-Schmidt Orthogonalization. Given a lattice basis \(\mathbf {B} = (\mathbf {b}_1, \ldots , \mathbf {b}_m)^T\), the Gram-Schmidt Orthogonalization of \(\mathbf {B}\) is denoted as \(\mathbf {B^*} = (\mathbf {b}^*_1, \ldots , \mathbf {b}^*_m)\), where \(\mathbf {b}^*_i\) is computed as \(\mathbf {b}^*_i = \mathbf {b}_i - \sigma ^{i-1}_{j=1}\mu _{i,j}\mathbf {b}^*_j\) for \(i = 1, \ldots , m\), with \(\mu _{i,j} = <\mathbf {b}_i, \mathbf {b}^*_j> / ||\mathbf {b}^*_j|| ^2\) for all \(1 \le j \le i \le m\). Denote by \(\pi _i(\cdot )\) the orthogonal projection onto \((\mathbf {b}_1, \mathbf {b}_2, \ldots , \mathbf {b}_{i-1})^{\bot }\). Then \(\mathbf {b}^*_i = \pi _i(\mathbf {b}_i)\). Also \(\pi _i(\mathbf {\varLambda (\mathbf {B})})\) is an \((m + 1 -i)\)-dimensional lattice generated by the basis \((\pi _i(b_1), \pi _i(b_2), \ldots , \pi _i(b_{i-1}))^T\).

Lattice Enumeration. Given a target vector \(\mathbf {t}\), a lattice basis \(\mathbf {B} = (\mathbf {b}_1, \cdots , \mathbf {b}_m)^T\) and a radius R, lattice enumeration algorithm enumerates over all lattice vectors \(\mathbf {v} \in \mathcal {L}\) such that \(||\mathbf {v} - \mathbf {t}|| \le R\) and finds the closest one. The enumeration algorithm enumerates over a search tree leveled by the enumeration depth \(k \in [m]\). The root of the search is levelled using \(k=0\) and it represents the target vector. And \(k=m\) corresponds to the leaves. The nodes at level k of the search tree consist of all vectors \(\mathbf {v} \in \mathbf {\varLambda (\mathbf {B})}\) such that \(||\pi _{m+1-k}(\mathbf {t}- \mathbf {v})|| \le R\). Gama et al. [14] use Gaussian heuristic to approximate the number of nodes in the k-th level of the enumeration tree as

$$\begin{aligned} H_k = \frac{V_k(R)}{\prod _{i=m+1-k}^{n}||\mathbf {b}^*_i||^2}, \end{aligned}$$
(1)

where \(V_k(R)\) denotes the volume of a k-dimensional ball of radius R. Then the total number of nodes in the enumeration tree is \(N = \sum _{k=1}^{m}{H_k}\).

Gamma et al. also suggest to use extreme pruning to accelerate the enumeration algorithm. The idea of extreme pruning is by deliberately setting the probability that the solution vector is in the tree after pruning to be very small to cut lots of branches in the enumeration tree. Though the success probability of finding the desired solution becomes quite low, it is compensated by the huge reduction of the enumeration time. Their experiments show an exponential speed up over full enumeration. Formally, pruned enumeration bounds the enumeration tree by limiting the k-level nodes to those vectors \(\mathbf {v} \in \mathbf {\varLambda (\mathbf {B})}\) such that \(||\pi _{m+1-k}(\mathbf {v} - \mathbf {t})|| \le R_k\) with \(R_k\) denoting the pruned radius and \(R_1 \le R_2 \le \ldots \le R_m = R\).

3 Related Work

We consider the search version of the LWE problem in this work. There are mainly three ways to solve the search LWE problem.

  1. 1.

    BKW approach: Blum, Kalai and Wasserman proposed BKW algorithm for the LPN (learning with parity noise) problem. Since LWE can be viewed as a generalization of LPN problem, BKW was also adapted to solve LWE by Albrecht et al. [3].

  2. 2.

    Algebraic approach: Arora-Ge [6] proposed to set up a system of algebraic equations over integers to describe the LWE problem and solve the search problem by solving the equation system. Later, this method was improved by using Gröbner basis techniques [4].

  3. 3.

    BDD approach: This approach views the search LWE problem as a decoding problem in a lattice. We will explain this idea in more details in the following.

Bounded Distance Decoding (BDD): Given m samples \((\mathbf {a_i}, c_i)\) following the given LWE distribution \(\mathcal {L}_{n, q,\mathcal {D}}\), we organize the input into a matrix \(\mathbf {A} \in \mathbb {Z}_q^{m \times n}\) whose rows constitute the m samples of the vector \(\mathbf {a}_i\), and a vector \(\mathbf {c} \in \mathbb {Z}_q^m\) whose i-th element is \(c_i\) from the i-th sample. Note that \(\mathbf {c} = \mathbf {As} + \mathbf {e}\), where \(\mathbf {e}\) is the error vector which follows the distribution \(\mathcal {D}^n\). When the error distribution of the LWE problem is the discrete Gaussian distribution \(\mathcal {D}_{\mathbb {Z}_q, \alpha q}\), we observe that the length of \(\mathbf {e}\) is relatively small since each of its entries is distributed according to the discrete Gaussian. Consider the q-ary lattice

$$ \varLambda _q(\mathbf {A}) = \{ \mathbf { x} \in \mathbb {Z}^m : \mathbf { x} = \mathbf { A}\mathbf {s}\; \texttt { mod }\, q\, \texttt { for }{} \texttt {some }\, \mathbf {s} \in \mathbb {Z}_q^n\}, $$

induced by \(\mathbf {A}\). Then the vector \(\mathbf {c}\) is bounded in distance from a vector \(\mathbf {v} \in \varLambda _q(\mathbf {A})\). Finding the vector \(\mathbf {v}\) from the q-ary lattice is called the BDD problem.

One approach to solve the BDD problem is to reduce the BDD problem to unique SVP problem by the embedding technique of Kannan [17] and solve the corresponding SVP problem. Another more common approach is to adapt the well-established Babai’s algorithm to solve the BDD problem directly. In this regard, Lindner and Peikert [22] proposed a variant of Babai’s nearest plane algorithm to solve the BDD problem. Bischof et al. [10] and Kirshanova et al. [18] have implemented parallel versions of this algorithm and investigated its practical performance. Liu and Nguyen [21] observed that Lindner and Peickert nearest plane algorithm can be viewed as a form of pruned enumeration where in the former, the pruning strategy bounds the coefficients instead of the usual way of bounding the projection lengths. Further, they propose to use the lattice enumeration with GNR extreme pruning strategy to accelerate the speed of finding the closest vector. This will be the approach we use in our experimental study. We refer the readers to the excellent survey by Albrecht et al. [2] for a comprehensive exploration of the concrete hardness of LWE.

4 Our Implementation

We choose to implement the Liu and Nguyen algorithm to study its practical performance. This algorithm uses enumeration with extreme pruning to solve the BDD problem. Given M samples \(\{(\mathbf {a}_i, c_i)\}_{i = 1, 2, \ldots , M}\) from the LWE distribution \(\mathcal {L}_{n, q, \alpha }\), we use the matrix representation to express the LWE problem as \(\mathbf {A} \mathbf {s} + \mathbf {e} = \mathbf {c}\). We outline the algorithm steps as follows:

figure a

The red and underlined parameters sub-dimension , BKZ block size and enumeration radius in Algorithm 1 need to be optimized to achieve better performance for our LWE solver. However, it is easy to see that these parameters affect the running time and success probability of our LWE solver in an entangled way. Thus it is a multi-object optimization problem. In general, it is not trivial to solve such a multi-object optimization problem.

We give detailed explanation of the choice we make regarding each step of the algorithm in subsequent subsections.

4.1 Compute the Basis

This is easy linear algebra. Given a matrix \(\mathbf {A}\) of size \(m\times n\), we want to find the matrix \(\mathbf {B}\) which is the basis of the q-ary lattice \(\varLambda _q(\mathbf {A}) = \{ \mathbf { x} \in \mathbb {Z}^m : \mathbf { x} = \mathbf { A}\mathbf {s}\; \texttt { mod } \,q\, \texttt { for } \texttt {some }\, \mathbf {s} \in \mathbb {Z}_q^n\}\). Recall that we use the convention of row vectors, so a lattice vector generated by the basis \(\mathbf {B}\) can be represented by \(\mathbf {zB}\) where \(\mathbf {z}\) is a row vector. If we forget for a moment that we are working on the modular ring \(\mathbb {Z}_q\), then the basis for \(\varLambda _q(\mathbf {A})\) is simply \(\mathbf { A}^T\), the transpose of \(\mathbf { A}\) since all lattice vectors except for those in \(q\mathbb {Z}^m\) can be expressed by an integer linear combination of rows in matrix \(\mathbf {A}^T\). To include \(q\mathbb {Z}^m\) in order to make it q-ary lattice, we further compute the Hermite normal form of \(\left[ \frac{\mathbf {A^T}}{q\mathbf {I}_m}\right] \) to get the basis of the q-ary lattice \(\varLambda _q(\mathbf {A})\). In other words, \(\mathbf {B} = \text { HNF}(\left[ \frac{\mathbf {A^T}}{q\mathbf {I}_m}\right] )\), where \(\text {HNF}(\mathbf {A})\) denotes the row Hermite normal form of a matrix \(\mathbf {A}\) removing all zero rows. To see this, first note that \(q\mathbb {Z}^m\) itself can be viewed as a lattice with the (m dimensional) identity matrix scaled by q as its basis. Thus, we get \(\varLambda _q(\mathbf {A}) = \varLambda (\mathbf {A^T}) \cup \varLambda (q\mathbf {I}_m)\). The last step of our computation relies on the following fact:

Fact 1. Given two lattices \(\varLambda (\mathbf {B}_1)\) and \(\varLambda (\mathbf {B}_2)\) of the same dimension, the basis for the lattice generated by the union of \(\varLambda (\mathbf {B}_1)\) and \(\varLambda (\mathbf {B}_2)\) is \(\text {HNF}(\left[ \frac{\mathbf {B}_1}{\mathbf {B}_2}\right] )\).

4.2 Enumeration Radius and Basis Randomization

The enumeration radius only affects the running time and success probability of the enumeration part. Consider an LWE instance \(\mathcal {L}_{n, q,\alpha }\). Fix m samples from it to get the equation \(\mathbf {A}' \mathbf {s} + \mathbf {e}' = \mathbf {c}'\) mod q. The exact BDD radius is the length of the error vector \(\mathbf {e'} \in \mathcal {Z}^m\). Though we do not know the exact value of \(||\mathbf {e}'||^2\), we know that its coefficients are generated from the discrete Gaussian distribution \(\mathcal {D}_{\mathcal {Z}_q, \alpha q}\). According to the acceptance criteria of the LWE challenge, the requirement is that \(||\mathbf {e}'|| \le 2\sqrt{m}\alpha q\) for an LWE instance \(\mathcal {L}_{n,q,\alpha }\) with m samples. Thus one option is to take \(R = 2\sqrt{m}\alpha q\) as the BDD bound. More generally, we set squared BDD bound \(R^2\) as \(c \cdot m\alpha ^2 q^2\) for some fixed constant c. To approximate the BDD bound R, we sampled error vectors according to the discrete Gaussian distribution and record the squared length of the error vector \(\mathbf {e}'\). See Fig. 1 for our experiment results. The figure shows the histograms of the distribution of the squared norm of error vector \(\mathbf {e}'\) with the bins set as the multiplier between \(||\mathbf {e}'||^2\) and \(m\alpha ^2 q^2\) (i.e., the scalar c). From Fig. 1 we can see that \(c=1.3\) is an appropriate choice for making sure that the closest vector can be found with overwhelming probability and the distribution of the scalar c does not depend on the parameters n, \(\alpha \) and m. However, if one chooses \(c=1\) so as to reduce the running time of the enumeration algorithm, then with probability about half, the closest vector can not be found within this radius.

Typical applications of pruned enumeration will first randomize the lattice basis by multiplying the basis matrix with a random unimodular matrix and then apply pruned enumeration to find the desired shortest vector or closest vector. If we adopt this method in our LWE solver, two problems arise.

  1. 1.

    The randomized basis usually has larger entries than the initial basis, thereby adding some burden to the lattice reduction algorithm.

  2. 2.

    If we want to choose a smaller enumeration radius such as setting the scalar c less than 1.3, we might miss the opportunity of finding the closest vector. This is primarily due to the fact that we are working with different bases of a fixed lattice and hence, the error norm is fixed.

Our randomization is natural and effective. We do not bother to do randomization over a fixed lattice but instead we choose a different lattice each time. In most instances, the number of samples M is larger than the LWE dimension n. It follows that after deciding on a sub-dimension m with \(n< m < M\), we can randomly choose m samples from the total M samples to form a different lattice each time. This simple trick solves the two problems discussed above. First the entries of the generated basis are all less than or equal than q. Second, since we randomize over the different m-combinations of the samples, the error vector \(\mathbf {e}'\) changes every time. Then we can choose a lower enumeration bound R and be confident that a fixed portion of the trials contribute to error vectors within the bound. For example, according to Fig. 1, if we choose \(R^2 = m \alpha ^2 q^2\) then the closest vector could be found within this bound with probability about 50 %.

Denote by \(p_{enum}\) the success probability of an enumeration algorithm given that the length of the error vector \(||\mathbf {e'}||\) is indeed within the enumeration bound R. For simplicity we first consider \(||\mathbf {e}'|| \le 1.3 \cdot m \alpha ^2 q^2\). Let \(T_{enum}(c)\) be the time of the enumeration algorithm when setting the enumeration radius to be \(c \cdot m \alpha ^2 q^2\). We can estimate the total time of enumeration to find the closest vector (when setting \(c=1.3\)) as \(T(1.3) = T_{enum}(1.3) / p_{enum}\). Further assuming that changing the enumeration radius does not affect \(p_{enum}\), we can approximate the success probability of the enumeration algorithm using different enumeration scalars c. For example if we choose \(c = 1\) the probability that the error vector is within \(1 \cdot m \alpha ^2 q^2\) is about 0.5 so we get the total enumeration time of solving the BDD problem as \(T(1) = 2 * T_{enum}(1)/p_{enum}\). In particular, choosing \(c=1\) may lead to a faster algorithm if \(T_{enum}(1) < T_{enum}(1.3)/2\). Thus, by analyzing the impact of the enumeration radius on the running time of enumeration algorithm, we can choose nearly optimal enumeration radius. Finally, we can use the Gaussian heuristic Eq. (1) to approximate \(T_{enum}(c)\) (see below).

Fig. 1.
figure 1

Histograms of square length of \(\mathbf {e}'\) for different parameters.

4.3 Choose Sub-dimension

In the typical setting of LWE problem, the number of total samples M is bounded by a polynomial of the LWE dimension n. When treating LWE as a lattice problem, an important decision concerns a suitable choice for the dimension of the lattice. The dimension of the lattice equals the number of samples we choose. How many of the total M samples do we use to form the generating matrix \(\mathbf {A}'\)?

First, we show that if the sub-dimension were chosen too small, the sub LWE problem may not have a unique solution. Consider the following equation \(\mathbf {A}'\mathbf {s}+ \mathbf {e'} = \mathbf {c}'\). For any choice of \(\mathbf {s}\), we can find an error vector \(\mathbf {e}'\) satisfying the above equation. However, the LWE problem restricts the length of \(\mathbf {e}'\). More precisely, each element of \(\mathbf {e}'\) is chosen from the discrete Gaussian distribution with small variance and thus, it can not be too large. TU Darmstat University has held an LWE challenge websiteFootnote 1 similar to the famous SVP challenge. According to Buchmann et al. [9], the acceptance criteria for the correct answer of the LWE problem \(\mathcal {L}_{n,q,\alpha }\) with M samples is that \(||\mathbf {e}|| \le 2\sqrt{M}\alpha q\). Based on this criteria, when we choose the sub-dimension to be m, we would also expect to find a secret vector \(\mathbf {s}\) such that it leads to a error vector of length less than \(2\sqrt{m}\alpha q\). Following the argument of Buchmann et al. [9], we calculate the probability that the sub LWE problem has more than one solution. For a chosen matrix \(\mathbf {A}'\) of size \(m \times n \), let \(\varLambda _q(\mathbf {A}')\) denote the q-ary lattice generated by \(\mathbf {A}'\). Recall that \(\lambda _1(\varLambda _q(\mathbf {A}'))\) is the norm of the shortest non zero vector in \(\varLambda _q(\mathbf {A})\). Assume that we have two solutions for the secret vector \(\mathbf {s}_1\) and \(\mathbf {s}_2\) satisfying the criteria \(\mathbf {A's}_1 + \mathbf {e}'_1 = \mathbf {c}' = \mathbf {A's}_2 + \mathbf {e}'_2\) and \(\mathbf {e}'_i \le 2\sqrt{m}\alpha q\). Then by the triangle inequality, we have \(||\mathbf {A}'(\mathbf {s}_1 - \mathbf {s}_2)|| \le 4\sqrt{m}\alpha q\). Since \(\mathbf {A}'(\mathbf {s}_1 - \mathbf {s}_2)\) is actually a vector in the q-ary lattice \(\varLambda _q(\mathbf {A}')\), the fact that the sub LWE problem has more than one solution implies that \(\lambda _1(\varLambda _q(\mathbf {A}')) \le 4\sqrt{m}\alpha q\). On the other hand, Gaussian heuristic tells us that the expected length of the shortest vector of \(\varLambda _q(\mathbf {A}')\) is \(q^{1-\frac{n}{m}}\sqrt{\frac{m}{2\pi e}}\). In view of this, in our implementation we choose the sub-dimension m such that the corresponding Gaussian heuristic \(q^{1-\frac{n}{m}}\sqrt{\frac{m}{2\pi e}}\) is larger than \(4\sqrt{m}\alpha q\) so that the expected number of solutions is small.

The next question concerns how large the lattice sub-dimension m should be. Note that a large dimension invariably increases the time for the basis reduction. In [15], the authors experimentally showed that for a random input lattice, the root Hermite factor \(\delta \) after a BKZ-beta reduction is independent of the lattice dimension. The following table shows the root Hermite factor obtained from various m random samples taken from the LWE challenge with \(\alpha = 0.005\) under a BKZ-20 reduction averaged over 20 experiments for each pair of m and n (Table 1).

Table 1. Average root hermite factor for LWE instances with sub-dimension m

One sees that for each n, the value of \(\delta \) is approximately 0.128 for the first values of m but deviates from this value for larger values of m. A closer examination reveals that in the latter case, the shortest vectors produced by the reduction algorithm are the unit vectors scaled by q. In general, we will like to have the lattice reduction to produce vectors with length less than q which tends to suggest that the input vectors are more well-mixed by the reduction algorithm to produce short vectors. In view of this, for a given BKZ-beta reduction, we will select the sub-dimension m such that the predicted shortest vector has length less than q, namely, \(\delta ^mq^{1 - n/m} \le q\) or, \(m \le \sqrt{n\log q/\log \delta }\), where \(\delta \) is the expected root Hermite factor. For \(\delta = 1.0128\), one checks that this gives the pairs (n, m) to be (40, 152), (45, 164), (50, 175), (55, 186), (60, 196) and (65, 206).

Fig. 2.
figure 2

Semi-log graph for full cost N. The parameters have the following reference: n is the LWE dimension, \(\alpha \) is the relative error rate of the LWE instance and \(\beta \) is the block size for BKZ reduction algorithm used. Different colors stand for different trials. (Color figure online)

Apart from the lattice reduction, the size of m also affects the enumeration cost. Here, we examine the impact of m for the full enumeration tree. We propose to use the Gaussian heuristic to estimate the enumeration cost, i.e., Eq. (1) to estimate the (full) enumeration cost and to decide the optimal sub-dimension. The total cost is \(N = \sum _{k=1}^{m} H_k\). We however do not know how to systematically solve the equation to find an optimal m which minimizes the total cost N. Instead we use numerical calculation to determine the optimal sub-dimension m for fixed BKZ block size \(\beta \). We plot the total cost \(N = \sum _{k=1}^{m}H_k\) for an LWE instance by varying the sub-dimension m. Figure 2 shows the estimated (logarithm of) full enumeration cost for different parameters. We deploy a conversion that for an LWE instance \(\mathcal {L}_{n, q, \alpha }\), q is set to be the next prime of \(n^2\) which follows the parameter setting of the LWE Challenge. Comparing Fig. 2a and b, we see that for fixed n and BKZ block size \(\beta \), the optimal sub-dimension m does not depend on the relative error rate \(\alpha \). Figure 2a and c show the impact of BKZ block size \(\beta \) on the optimal sub-dimension. As we can see, by increasing \(\beta \), the optimal sub-dimension m also increases and the full enumeration cost decreases for larger \(\beta \). However, the larger \(\beta \) requires more BKZ reduction time. There is still a need for a trade-off between the BKZ reduction time and enumeration time by setting an appropriate block size \(\beta \). We discuss this in the next subsection. Finally combining Fig. 2b and d, we can further get the impression that for fixed relative error rate \(\alpha \) and BKZ block size \(\beta \), the larger the dimension of the LWE instance n is, the larger sub-dimension we need to get an optimal performance. One problem of the numerical method to decide on the optimal sub-dimension is that we did not consider the BKZ reduction part. In practice we need to consider the running time of BKZ reduction algorithm, so the actual optimal sub-dimension is usually less than that viewed from the plot. However, the plot can still act as a rough guide to find the optimal sub-dimension. Due to page constraints, we defer the details of the estimation of enumeration cost and the relation between cost of full enumeration and that of enumeration with extreme pruning in Appendix A.

4.4 Balancing Reduction and Enumeration

Since we use enumeration to solve the BDD problem, we want to first reduce the lattice basis before applying enumeration. BKZ is now the de-facto standard of lattice reduction algorithm in cryptanalysis. We use the BKZ implementation in FPLLL [5] library to perform BKZ reduction.

Fig. 3.
figure 3

Running time for BKZ reduction and pruned enumeration.

The quality of the reduced basis and the running time of BKZ reduction algorithm highly depend on the block size \(\beta \). The choice of an appropriate block size \(\beta \) affects the total running time of our LWE solver. Generally speaking, a larger block size \(\beta \) leads to longer running time of BKZ reduction algorithm but the highly reduced basis will decrease the running time of enumeration. So the folklore is that the optimal block size \(\beta \) should balance the running time of BKZ reduction and enumeration. In other words, when the BKZ reduction time and the enumeration time are close to each other, the total running time is minimized. See Fig. 3 for an example. We plot the actual running time of BKZ reduction algorithm and pruned enumeration time for the LWE instance \(\mathcal {L}_{40, 1601, 0.015}\) using sub-dimension \(m=120\). The enumeration radius is set to be \(R^2 = 0.8m\alpha ^2q^2\). The figure confirms the folklore that the optimal block size \(\beta \) should roughly balance the running time of BKZ reduction and enumeration. However, finding such optimal block size is not easy, especially for extreme pruning. In our experiments we manually tune the block size \(\beta \) by measuring the running time of the BKZ reduction part and pruned enumeration part.

4.5 Parallelization

Parallelism is ubiquitous in today’s program design. We have multi-core CPUs even in our laptops. It is natural to implement the LWE solver algorithm in parallel. One option is to use parallel implementation of enumeration algorithm and parallel implementation of lattice reduction algorithm. Alternatively, One can use a sequential implementation of lattice reduction algorithm and enumeration algorithm but instead launch several threads to solve the BDD problem with different randomized basis. We choose the latter approach for its simplicity and its embarrassing parallelism. Although there are parallel implementations of lattice enumeration algorithms [12, 13, 16, 19], we do not know any public available parallel implementation of BKZ reduction algorithm. Thus if we want to use parallel implementation of enumeration we might have to use a sequential implementation of BKZ reduction. Amdahl’s Law sets a bound on the potential program speedup defined by the fraction of code (p) that can be parallelized as \(speedup = \frac{1}{1-p}\). In using the combination of BKZ reduction and enumeration

to solve the SVP or CVP problem, it is common knowledge that when the running time of BKZ reduction part and the enumeration part are roughly equal, the total running time is minimized (refer to the previous section and Fig. 3). If we want optimal performance then the fraction of parallelizable code would be about 1/2. It follows that regardless of how many threads are used, the speedup can be at most 2. We can circumvent this by using a small block size for the BKZ reduction, or plugging in the parallel enumeration into the BKZ reduction, but those methods are either complicated or do not achieve optimal performance gain.

In our implementation we use the embarrassingly parallel design to let each thread work on a different randomized basis, and thus there is no load balance issue. In order to achieve best performance we carefully choose the BKZ block size so that the BKZ reduction time is comparable to the enumeration time.

5 Experimental Results

Our implementation is written in C++, using the library FPLLL for BKZ reduction and lattice enumeration. Our program is compiled using gcc 5.4.0 on a desktop running Ubuntu 14.04 LTS. We test our LWE solver using the instances from the LWE challenge website. We use extreme pruning [14] for lattice enumeration as suggested by Liu and Nguyen [21]. Gamma et al. [14] suggest using numerical approximation to generate optimal pruning coefficients by fixing the successful probability and seeking for minimum overhead. Aono [7] also describes how to compute the optimal pruning coefficients. We follow Aono’s approach to compute the optimal pruning coefficients in our implementation. We are preparing to release the source code after further optimization. At this moment, it is available upon request.

5.1 LWE Challenge

TU Darmstadt held a LWE challenge project. The challenge provides LWE instances with different parameters. The LWE challenge instance is identified by two parameters: the LWE dimension n and the relative error rate \(\alpha \). The other parameters of an LWE instance are set as follows:

  • Moduli q is set as the next prime of \(n^2\);

  • Number of samples is set as \(M = n^2\);

  • Error distribution is set as the discrete Gaussian distribution with width parameter \(\sigma = \alpha q\), i.e., the distribution \(\mathcal {D}_{\mathbb {Z}_q, \sigma }\).

Table 2. Results on solving some instances from the LWE Challenge website

Using our implementation described in the preceding section, we solved several instances from the LWE Challenge website. Please refer to Table 2 for the detailed recording of the LWE parameters, the block size we used for BKZ and the running time for solving these instances. All the instances except two are run using a single thread on a desktop with a 3.60 GHz Intel Core i7 processor with eight cores and 32 GB 1600 MHz DDR3 memory. The instance \((n = 60, \alpha = 0.005)\) was run on a cluster consisting of 20 c4.8xlarge instances, each having 36 cores with a 60 GB memory (720 threads in total), on the Amazon EC2 platform. The instance \((n = 40, \alpha = 0.02)\) was solved on a cluster consisting of 8 desktops with a 3.60 GHz Intel Core i7 processor with eight cores and 32 GB 1600 MHz DDR3 memory (64 threads in total).

In the experiments we carefully choose the BKZ block size \(\beta \) to ensure the BKZ reduction time is comparable with the enumeration time so as to achieve the reduction on overall running time. Our experiments indeed confirm the folklore that when BKZ reduction time roughly equals that of enumeration time the total running time achieves the minimal. The squared BDD bound \(R^2\) was set as \(c \cdot m\alpha ^2 q^2\). The successful probability in our pruning strategy is set to be 0.01. From the results of our experiments we find that the relative error rate \(\alpha \) plays an important role in the hardness of the LWE problem.

5.2 Comparison with Other Implementations

Recently Kirshanova et al. [18] report a parallel implementation of BDD enumeration for solving LWE. They implement both the Lindner-Peikert [22] nearest planes algorithm and the pruned enumeration methodFootnote 2 proposed by Liu and Nguyen [21]. They directly implement a pruned parallel enumeration algorithm. Their experiments show that the enumeration algorithm can be nicely parallelized. For example, they achieve a linear speedup by increasing the number of threads even until 10. However, the BKZ reduction is not parallelized. We can observe the impact of Amdahl’s Law from their experimental results. For example in order to solve the LWE instance \(\mathcal {L}_{80, 4093, 5}\) their serial implementation needs \(4.3 + 13 = 17.3\) h. Their parallel implementation using 10 threads can reduce the enumeration time from 13 h to 1.5 h. Then the total running time is \(4.3 + 1.5 = 5.8\) h. That is a 3x speedup by using 10 threads. Even when they increase the number of threads to 20, the total running time is \(4.3 + 0.8 = 5.1\) h, which gives a 3.4x speedup by using 20 threads. Although one can circumvent this by using a very small block size for the BKZ reduction part, as we discussed in Sect. 4.4 this choice would increase the total running time of BKZ reduction and pruned enumeration.

On the contrary, our strategy to use extreme pruning and to use many threads working on different basis can scale quite well with respect to the number of threads. Moreover, using extreme pruning can highly reduce the time used by enumeration and thus reduce the total time needed for solving the LWE instance. We compare the running time of our implementation and that of Kirshanova et al. in Table 3. In the table, the time t for BKZ and enumeration stands for the total BKZ time and enumeration time for solving the corresponding LWE instance. In Kirshanova et al.’s setting they fixed the number of samples for the LWE instance and all their experiments use the fixed dimension. We try a different setting where the number of LWE samples are a polynomial of n, say \(n^2\) so that we can use the optimal sub-dimension to reduce the difficulty of the LWE instance.

Table 3. Comparison between Kirshoanova et al. and ours results.

The first row of Table 3 shows that extreme pruning can indeed speedup the LWE solver. Kirshanova et al. need 109 min to solve the instance \(\mathcal {L}_{70, 4093, 6}\) on a single thread, while our implementation can solve the instance using the same sub-dimension and block size \(\beta =20\) within 20 min. Further more, their implementation uses 70 min to solve the instance on 10 threads. Since our implementation uses 18 trials to solve the instance we can solve the instance within the time of two rounds if given 10 threads. Basically, we only need less than 4 min to solve the same instance given 10 threads. We note that the block size \(\beta =20\) is not optimal for our implementation. By changing \(\beta \) to 15, we solve the instance \(\mathcal {L}_{70, 4093, 6}\) in 12 min on a single thread.

To further demonstrate the effectiveness of extreme pruning, we compare the performance of both our implementations for the instance \(\mathcal {L}_{80, 4093, 5}\). When we use the same sub-dimension as \(m=150\), the running time of our implementation on a single thread is 5.2 h which is much smaller than 17.3 h of Kirshanova et al. But the advantage of our implementation lies also in another factor. Notice that the algorithm uses 347 trials to find the correct solution, which means a single trial uses on average only 1 min. Kirshanova et al. solve the instance in more than 2 h using 20 threads. Our implementation is expected to solve the instance in 347/20 = 18 min. If we have more than 400 threads, we can solve the instance within 1 min. Moreover, if we apply the optimal sub-dimension trick we do not need so many threads to achieve the speedup. For example when we use 180 samples and BKZ block size \(\beta =10\), the total of 12 trials take \(417+117 = 534\) s. Then with 12 threads our implementation is expected to solve the instance \(\mathcal {L}_{80, 4093, 5}\) using 45 s. On the contrary, the BKZ reduction of Kirshanova et al.’s implementation alone takes 1.3 h.

6 Conclusion and Future Work

This current work described our choice of strategy to solve the BDD problem, namely the details of our implementation and our experimental results on several LWE challenge instances. Our implementation features a embarrassingly parallel design and the use of extreme pruning shows advantages over existing implementations. Potential future work include:

  • We choose the optimal BKZ block size \(\beta \) manually in our experiments. This would be impossible for LWE instances with large dimension and/or large relative error rate. Thus it would be useful to explore the relation between the BKZ reduction time and (pruned) enumeration time and use some heuristics to decide the optimal BKZ block size.

  • The success probabilities from our experiments seem to be higher than those estimated by Aono’s algorithm, thereby resulting in fewer threads. Since we are using parallel implementations of the LWE solver, we have more room for a lower success probability. Lower successful probability can reduce the running time while we can simply add more threads to compensate the low probability of success. In fact our current environment of 20 c4.8xlarge Amazon EC2 instances contains in total more than 700 threads. We can deal with this problem in two ways: first, we can reduce the successful probability for the pruning strategy; second, we can deploy a two-level parallelization by using the first level to run the LWE solver in parallel and using the second level to run the parallel enumeration algorithm.