Polar sampler: A novel Bernoulli sampler using polar codes with application to integer Gaussian sampling

Cryptographic constructions based on hard lattice problems have emerged as a front runner for the standardization of post-quantum public-key cryptography. As the standardization process takes place, optimizing specific parts of proposed schemes, e.g., Bernoulli sampling and integer Gaussian sampling, becomes a worthwhile endeavor. In this work, we propose a novel Bernoulli sampler based on polar codes, dubbed “polar sampler”. The polar sampler is information theoretically optimum in the sense that the number of uniformly random bits it consumes approaches the entropy bound asymptotically. It also features quasi-linear complexity and constant-time implementation. An integer Gaussian sampler is developed using multilevel polar samplers. Our algorithm becomes effective when sufficiently many samples are required at each query to the sampler. Security analysis is given based on Kullback–Leibler divergence and Rényi divergence. Experimental and asymptotic comparisons between our integer Gaussian sampler and state-of-the-art samplers verify its efficiency in terms of entropy consumption, running time and memory cost. We envisage that the proposed Bernoulli sampler can find other applications in cryptography in addition to Gaussian sampling.


Introduction
Lattice-based cryptography is one of the most promising candidates of cryptosystems in the plausible post-quantum age. The security of lattice-basedprimitives is guaranteed by the Communicated 1 hardness of worst-case lattice problems, e.g. the Learning With Errors (LWE) problem [22,34] and Short Integer Solution (SIS) problem [24,25]. The discrete Gaussian distribution lies at the core of security proofs of these primitives, and it is also one of the fundamental building blocks of practical lattice-based cryptographic applications, e.g. signature schemes, encryption and key exchanges. In general, the security level of these cryptographic applications is closely related to the statistical performance of the discrete Gaussian sampling (DGS) algorithm. From an implementation standpoint, cryptographers also take other qualities of a DGS into consideration including side-channel resistance, computation and storage efficiency. In practice, the trade-off between these performances is a bottleneck of this problem.
It has been widely assumed that for cryptographic applications with λ bits of security the statistical distance (SD) between the ideal distribution and the approximated one should be roughly 2 −λ such that there is only minor loss in security [12]. Some other measures such as Kullback-Leibler (KL) divergence and Rényi divergence are proved to provide more efficient security analysis than SD, as they can lower the requirement for precision and reduce the cost of the algorithms in many practical cases [5,[30][31][32]. From a practical point of view, the difficulty of DGS lies in the implementation of DGS in cryptographic primitives with constrained resources. Besides the resilience against potential side-channel attacks, designers looking for the optimal DGS solution to a specific application must strike the balance of memory consumption and running time, precision and efficiency.
There are already a variety of works addressing the application of DGS in lattice-based primitives. Existing techniques include the binary sampling [11], the cumulative distribution table (CDT) sampler [10], the Knuth-Yao sampler [20], and the discrete Ziggurat sampler [23], etc. In [14], rejection sampling is used to generate discrete Gaussian samples where one draws an element x from a discrete domain uniformly at random and accepts it with probability proportional to exp(−x 2 /2σ 2 ) where σ is the standard deviation. However, calculating the exponential function requires high-precision computing and sufficient trials are needed before the sampler produces an output. In [11], a CDT is used as a base sampler and the rejection sampling is done in a bitwise manner to produce discrete Gaussian samples for BLISS. However, the CDT sampling itself takes 35 percent of the total running time of BLISS [19] and the precomputed table for rejection sampling requires larger memory when a wider distribution is in need.
In [18], Hülsing et al. replaced the discrete Gaussian distribution by a rounded Gaussian distribution in Lyubashevsky's signature scheme without trapdoors and BLISS showing its effectiveness, security and efficiency. As the term suggested, a rounded Gaussian distribution is derived by rounding continuous Gaussian samples which can be efficiently realized by Box-Muller transform [7] in constant time. A convolution method, first proposed in [28], can expand a discrete Gaussian distribution with a small parameter to a wider one. Another sampling design [26] exploits a base sampler with small parameters to efficiently generate DGS with arbitrary and varying parameters in a convolutional manner. This applicationindependent algorithm consists of an online and offline stage, both of which can be carried out in constant time, proving a resilience against timing attack. A constant-time sampler was proposed in [39] and Rényi divergence was used to improve the efficiency. In [35], arithmetic coding, a classical data compression technique, was adopted as a sampler in BLISS giving a reduced signature size.
When reviewing the literature of DGS, we find that Bernoulli sampling is of vital importance to randomness generation. It is involved in many cryptographic designs and a typical example is BLISS [11] where Bernoulli sampling is employed to build a discrete Gaussian sampler. To make BLISS safe under side channel attacks, especially the timing-based one, improved Bernoulli samplers were devised in [8,13,29,39]. To improve the efficiency of Bernoulli sampling with biases in exponential or cosh form, as is the case in BLISS, polynomial approximation with sufficient precision were proposed in [6,39]. Our research begins with Bernoulli sampling and we get our inspiration from polar source coding. The proposed Bernoulli sampler can be used for generating arbitrary discrete distribution and this paper is concerned about its application to Gaussian sampling.

Contribution
In this work, we propose a novel Bernoulli sampler using polar codes and apply it to DGS over the integers. Polar codes are the first class of efficiently encodable and decodable codes which provably achieve channel capacity of symmetric channels [3]. It can also achieve Shannon's data compression rate [4]. The power of polar codes stems from the polarization phenomenon: under Arıkan's polar transform, information measures of synthesized sources (or channels) converge to either 0 or 1 when coding becomes trivial. Moreover, the state-ofthe-art decoding runs with O(N log log N ) complexity where N denotes the block length of a polar code [38]. Given their attractive performance, polar codes have found a wide range of applications in information theory and communication systems. In particular, they have been standardized for the fifth-generation (5G) wireless communication networks.
This work tackles the sampling problem from a source coding perspective, namely, sampling can be considered the inverse problem of source coding. In source coding or data compression, one typically encodes a block of symbols of a certain distribution into some bits which become uniformly random as the block length tends to infinity [9]. Since a source code is invertible, inverting this process would produce samples from the desired distribution. Polar sampling is well suited for applications where a large number of independent discrete Gaussian samples are required (e.g. fully homomorphic encryption (FHE), digital signatures) as the averaged randomness consumption per sample decreases asymptotically to the optimum thanks to the polarization effect. Note that the polar sampler is not restricted to sampling from the discrete Gaussian distribution, but can be extended to other distributions of interest in cryptography.
The principal contributions of this paper are summarized as follows: -A novel approach to sample from a Bernoulli distribution as well as an integer Gaussian sampler using multilevel polar samplers are developed. Using a binary partition tree, we recursively partition Z into 2 cosets, 4 cosets, and so on. The number of partitions is only logarithmic in s. Each partition gives rise to a binary source, which is produced by one polar sampler. The advantage of this multilevel sampling approach is that only Bernoulli samples are needed, which allows simpler implementation than sampling over the whole integer domain. -Analysis of approximation errors. Although multilevel polar samplers would produce the desired distribution D Z N ,c,s , it is not exactly so. This is because the polar sampler converts N i.i.d. Bernoullis into N polarized and unpolarized Bernoullis. We approximate the polarized ones using either unbiased or deterministic Bernoullis which will only yield an approximate version of the desired distribution. We derive upper bounds on the closeness between the target discrete Gaussian and its approximation measured by KL divergence. -Security analysis. To achieve a certain security level in a standard cryptographic scheme with oracle access to a discrete Gaussian distribution, the principle of setting the parameters of our polar sampler is also discussed based on KL divergence. In cryptographic applications where the number of queries q to the Gaussian sampler is limited (e.g., q ≤ 2 64 in the NIST specifications of signatures), using Rényi divergence can yield con-siderable savings according to previous work of [5,32]. We also apply Rényi divergence to improve the parameter selection of polar sampler.
The proposed multilevel polar sampler scheme complements and distinguishes from existing discrete Gaussian samplers in the literature. In addition to offering a different approach, it exhibits several salient features: -Information theoretic optimality. Asymptotically, the multilevel polar sampler achieves the entropy bound of the discrete Gaussian distribution. This implies that it requires minimum resources of random bits to produce the desired distribution. -Quasi-linear complexity. The proposed Gaussian sampling approach enjoys low complexity. The design of a polar sampler can be done at the offline stage, that is, given a target distribution, it is done once and for all. The online stage of a polar sampler computes certain posterior probabilities which can be implemented in O(N log N ) complexity. 1 We also give experimental and asymptotic comparison between our DGS approach and other existing samplers including Knuth-Yao sampling, binary sampling and CDT sampling. The prominent advantage of multilevel polar sampler is the entropy consumption which indicates the cost of randomness. The overall running time depends on both SC decoding(computing LRs) and Bernoulli sampling. Compared with the binary sampling [11], polar sampler has higher computational complexity but it asymptotically and effectively reduces the entropy consumption. We also illustrate in experiments that the multilevel polar sampler is faster than Knuth-Yao sampling. -Constant-time implementation. The proposed discrete Gaussian sampler is constant-time in the sense that the running time is independent of the output values. This makes our sampler attractive when dealing with timing side-channel attacks. From a perspective of coding theory, polar sampler is constant-time because polar codes have a fixed code length which compares favorably with other source coding techniques (e.g. Huffman coding).
Of course, the proposed sampler can be combined with existing "expander" techniques such as convolution if needed. In this work, we focus on the theoretic design and analysis of polar samplers, whereas various optimization issues (e.g., concrete computational/storage costs etc.) are left to future work. Nevertheless, we have found it in experiments that even a prototype implementation significantly outperforms the Knuth-Yao sampler in speed in benchmark experiments.

Roadmap
The roadmap of this paper is given as follows. Section 2 introduces the proposed Bernoulli sampler, i.e., polar sampler, and elucidates its relation to polar source coding. Section 3 presents how we devise an integer Gaussian sampler using multiple polar samplers. Section 4 analyses the approximation error of our integer Gaussian sampler based on KL divergence. In Sect. 5, the security, precision and parameter selection are discussed based on KL and Rényi divergence. Section 6 gives a comprehensive analysis of the constant-time feature and compares our integer Gaussian sampler with state-of-the-art samplers regarding the complexity. Section 7 concludes this paper.

Notation
Given a vector x 1:N and a set A ⊂ {1, . . . , N }, we denote by x A the subvector of x 1:N indexed by A and denote by x (i) the i-th coordinate of x 1:N . A capital letter is used to denote a variable while its lowercase represents a realization. Denote by X ∼ P a distribution P of X over a countable set X . Then the entropy of X is defined as H P (X ) = − x∈X p(x) log p(x). We write H (X ) = H P (X ) for brevity if the distribution is clear. Suppose X and Y have a joint distribution P(X , Y ). The conditional entropy of X given Y is defined as H (X |Y ) = x∈X ,y∈Y p(x, y) log p(y) p(x,y) . The logarithm to base 2 is denoted by log while the natural logarithm is denoted by ln.

Source polarization
The key idea of polar source coding can be found in [4] where a polar code was proposed to achieve Shannon's source coding bound. Let (X 1:N , Y 1:N ) denotes N i.i.d. copies of a memoryless source (X , Y ) of joint distribution P X ,Y , where X takes values over X = {0, 1} while Y takes values over a countable set Y. The two random source X and Y are correlated, and Y is called the side-information. In source coding, the encoder compresses a sequence X 1:N into a shorter codeword such that the decoder can yield an estimationX 1:N of X 1:N given the shorter codeword and side information Y 1:N . 2 Polar codes are proved to achieve Shannon's source coding bound asymptotically. The source polarization transform from X 1:N to U 1:N is performed by applying an entropypreserving circuit to X 1:N , i.e., where ⊗n denotes the n-th Kronecker power, and B N is a bit-reversal permutation [3] of the input vector. Figure 1 illustrates the source polarization transform of X 1:2 and X 1:4 where ⊕ denotes mod-2 sum. This transform preserves the entropy in the sense that Meanwhile, it also polarizes the entropy in the sense that and By applying the construction in Fig. 1 recursively, we derive a bijection U 1:N = X 1:N G N inducing a combined source pair (U 1:N , Y 1:N ) and a transition W N : U 1:N → Y 1:N . This combined source pair is then split into N synthesized source pairs (U (i) , Y 1:N × U 1:i−1 ) giving rise to N sub-transitions W (i) N : U (i) → Y 1:N × X 1:i−1 . A polarization phenomenon happened to sub-source pairs is observed and stated as follows. Fig. 1 The source polarization transform [3]: a A two-by-two transform. b A four-by-four transform Theorem 1 (Source polarization [4]) Let (X , Y ) be a source as above. For any N = 2 n ,n ≥ 1, let U 1:N = X 1:N G N . Then, for any 0 < β < 0.5, as N → ∞, Note that in the absence of side information Y , the above theorem still holds by considering Y independent of X .
Definition 1 (Bhattacharyya parameter [15]) Let (X , Y ) ∈ X × Y be a pair of random variables where X = {0, 1} = GF(2) and Y is an arbitrary finite set. Let X and Y follow the joint distribution P XY (x, y). If X is the source to be compressed and Y is the side information, the Bhattacharyya parameter is defined as

Proposition 1 ([4], Proposition 2)
It is implied by Proposition 1 that for a source (X , Y ), the parameters H ( For β ∈ (0, 1/2) and α = 2 −N β , the indexes of U 1:N can be divided into a low-entropy set and its complement L c X |Y . Again, in the absence of side information Y , the two sets are defined in the same way by considering Y independent of X and U .
This gives rise to the encoding and decoding scheme described in [4]. More specifically, for a realization of (X 1:N , Y 1:N ) = (x 1:N , y 1:N ), the encoder computes u 1:N = x 1:N G N and only shares u L c X |Y with the decoder. The compression rate is defined as R = |L c X |Y |/N . The decoder can obtain an estimateû 1:N of u 1:N in a successive manner aŝ where L Theorem 2 (An upper bound on error probability [4]) For any fixed R > H (X |Y ) and β < 0.5, the probability of error for the above polar source coding method is bounded as P e = Pr(Û 1: It implies that any compression rate R > H (X |Y ) is achievable with a vanishing block error probability for polar source coding of sufficiently large N . As N goes to infinity, the polarization process removes the randomness of the low-entropy set almost surely while the other set becomes random. Additionally, the complexity of polar encoding and decoding are both O(N log N ).

From source coding to Bernoulli sampling
Given the notion of memoryless source (X , Y ) ∼ P X ,Y and polar source coding, we now consider the sampling problem, i.e., to sample from a Bernoulli distribution P(X ) given (or without) side information Y . We propose a novel Bernoulli sampler called PolarSampler(·) in Algorithm 1. The interfaces, key operations and subroutines are described as follows.

Interfaces
PolarSampler(·) draws N samples x 1:N from the target distribution P(X ) given N samples In addition to the low-entropy set L X |Y as defined in formula (6), we further define a high-entropy set. 3 where α = 2 −N β . In order to define the two sets H X |Y and L X |Y , one needs to calculate the Bhattacharyya parameter Z (U (i) |Y 1:N , U 1:i−1 ) efficiently. However, as a source pair (X , Y ) turns into a synthesized source pair (U (i) , Y 1:N × U 1:i−1 ) by polarization transform, the alphabet size of the side information increases exponentially with N . Calculating Z (U (i) |Y 1:N , U 1:i−1 ) according to how we define it in Definition 1 becomes intractable. Some efficient algorithms to calculate the Bhattacharyya parameters were proposed in [36]. 4 In addition, preparing the bias table c and calculating Z ( We will refer to the offline stage as the construction stage of PolarSampler(·) in the sequel. Note that when there is no side information Y , the precomputed table c only consists of one bias c 1 = P(X = 1). The bias vector will be c[1 1:N ] = [c 1 , · · · , c 1 ] of length N . The high-and low-entropy sets H X and L X are defined in the same manner by considering Y independent of X .

Key operations
According to polar source coding, the U (i) for i ∈ H X |Y with very high entropy is approximately uniformly distributed and is approximately independent of both U 1:i−1 and Y 1: As N goes to infinity, the fraction of unpolarized indexes vanishes and the fraction of high-entropy indexes approaches the entropy H (X |Y ) of X given Y according to Theorem 1.
Since the polarization transform G N is invertible with G −1 N = G N , it is expected to produce an approximation Q X 1:N of P X 1:N by applying the above transform to U 1:N , i.e. X 1:N = U 1:N G N . However, the unpolarized set may not be negligible for finite length N , and should be handled with care. More precisely, those U (i) for i ∈ H c X |Y \L X |Y are neither uniform nor deterministic; assigning values inaccurately will cause non-negligible distortion of the target distribution. To minimize the distortion, U (i) should obey the following and Figure 2 shows the difference between source coding and sampling: although the unpolarized set belongs to the compressed codeword in source coding, its bits should be randomized as in (11) in sampling.
While sampling according to formula (10) is obviously trivial and straightforward, the bottleneck of entropy consumption is determined by sampling the unpolarized set in formula (11). The following lemma is adapted from [27, Lemma 1] which gives the fraction of unpolarized set for finite n = log N .
Lemma 1 (Fraction of unpolarized set [27]) Let μ (3.579 ≤ μ ≤ 4.714) be a constant called the upper bound on the scaling exponent which is solely determined by the conditional probability P X |Y . For a constant v > 1 and n = log N ≥ 1, the following relation holds: where the constant c depends solely on v and it does not depend on n or P X |Y .
In the absence of side information Y , the above property also holds by substituting X for X |Y .
polarize simultaneously. Therefore Theorem 1 can be restated as follows. For any 0 < β < 0.5, as N → ∞, Given a threshold 2 −N β , we can find a v such that with a probability smaller than cN −1/μ where 3.579 ≤ μ ≤ 4.714 and c is determined by v and P(X |Y ). Therefore, the fraction of unpolarized set L c X |Y \H X |Y scales as N −1/μ . The above conclusions still hold in the absence of side information Y by considering X independent of Y . Q.E.D.

Subroutines
To carry out the operations in formula (10) and (11), one also needs to calculate Recall that the definition of likelihood ratio is Since these likelihood ratios can be computed with quasi-linear complexity O(N log N ) by SC decoding proposed in [3], the posterior probability P U (i) |Y 1:N ,U 1:i−1 in (10) and (11) can be equivalently computed. When there is no side information Y , we can compute P U (i) |U 1:i−1 in the same manner as above by considering Y independent of X .
In PolarSampler(·), we first define a two-dimensional likelihood ratio array LRReg[N ][n+ 1] and a two-dimensional bit array UReg[N ][n + 1] indexed by integers 1 ≤ i ≤ N and 0 ≤ m ≤ n. We also define an array of N × (n + 1) nodes connected by multiple 2-by-2 butterfly circuits " " as in Fig. 3 for N = 8. Each node takes the responsibility to update a unique element of LRReg[N ][n + 1] and a unique element of UReg[N ][n + 1] of the same index. We define two properties of each layer of the array, i.e., phase and branch denoted by integers φ and ψ, respectively. In Fig. 3, we distinguish different phases at each layer by different colors. At layer m, the phase and branch satisfy 1 ≤ φ ≤ 2 m and 0 ≤ ψ < 2 n−m . Note that for any layer m, each index 1 ≤ i ≤ 2 n has a unique representation as

And for a generic array
To ease the notations, we also use the notation of likelihood ratio L (φ) 2 m to denote the node by which it is calculated. In Line 2 of PolarSampler(·), the raw likelihood ratios L As a high level description, CalLR(·) recursively assembles two likelihood ratios of the same phase but different branches at layer m − 1 (two nodes on RHS of " ") and derive two new likelihood ratios of different phases but the same branch at layer m (two nodes on LHS input : m, φ output: updated LRReg of " ") according to formula (13) and (14) as 5 We give an example to demonstrate how CalLR(·) and CalBit(·) work in Fig. 3. In PolarSampler(·), CalLR(n, i) begins with i = 1 for node L

Closeness analysis of polar sampler
A good closeness metric can help reduce the complexity of implementation. In this section, we will evaluate the closeness error of the proposed sampling scheme.

Definition 2 (Kullback-Leibler divergence)
Let P and Q be two distributions over a common countable set , and let A ⊂ be the strict support of P (P(a) > 0 iff. a ∈ A). The KL divergence D K L of Q from P is defined as: with the convention that ln(x/0) = +∞ for any x > 0.
Let P X 1:N (x 1:N ) denote the distribution of N i.i.d. Bernoullis X defined as above. For any 0 < β < 0.5, N = 2 n , n ≥ 1 and the corresponding high-and low-entropy sets defined in (9) and (6), one can generate a distribution Q X 1:N (x 1:N ) using the rules (10) and (11). To give the KL divergence between P X 1:N (x 1:N ) and Q X 1:N (x 1:N ), we first modify the Bernoulli sampling rules (10) and (11) to be where only the deterministic decisions for U (i) in L X |Y are replaced by random decisions. Let Q X 1:N (x 1:N ) denote the distribution derived by the new Bernoulli sampling rules described in (15) and (16).
In the presence of a side information Y ∈ Y := {y 1 , · · · , y |Y| }, X can be seen as a combination of a sequence of Bernoullis with a bias table c = [c 1 , . . . , c |Y| ] for c y = P(X = 1|Y = y). PolarSampler(·) takes the side information y 1:N and the high-and lowentropy set H X |Y , L X |Y as input. The above closeness bound still holds by substituting X |Y for X .

Proof
The KL divergence between P X 1:N and Q X 1:N is bounded by the KL divergence between P U 1:N |Y 1:N and Q U 1:N |Y 1:N because the following relation hold.
The above equalities are derived as follows.
where the equalities and inequalities are explained as follows.
In a similar fashion, the KL divergence of Q X 1:N 1:r and Q X 1:N 1:r is as follows.
Therefore, the closeness measured by KL divergence can be concluded as Note that polar sampling without side information is easier. In the absence of Y , the above closeness analysis for KL divergence still hold by seeing Y independent of X . Q.E.D.
Although we cannot give the KL divergence between P X 1:N and Q X 1:N due to the lack of triangle inequality, the absence of D K L (P X 1:N Q X 1:N ) will not prevent us from giving the security analysis which will be explained in Sect. 5.

Gaussian sampling over the integers using polar sampler
Definition 3 For any c ∈ R, s > 0, define the discrete Gaussian distribution over Z as where ρ c,s (x) = exp(−π|x − c| 2 /s 2 ) and ρ c,s (Z) = z∈Z ρ c,s (z).
In the above definition, the denominator ρ c,s (Z) is for normalization. For convenience, we may omit c for c = 0, e.g. ρ 0,s (x) = ρ s (x) and D Z,0,s (x) = D Z,s (x).
Gaussian sampling over the integers Z can be formulated as a multilevel sampling problem over a binary partition chain Z ⊂ 2Z ⊂ 4Z ⊂ · · · 2 r Z · · · of which each level is labeled by X 1 , X 2 , . . . , X r , . . . (see Fig. 4). Then the discrete Gaussian distribution over the integers Z induces a distribution P X 1:r whose limit is exactly D Z,c,s as r goes to infinity. By cutting off the tail area of negligible probability, a discrete Gaussian distribution over the integer lattice Z can be reduced to a distribution over a finite set. For example, if the cutoff points of D Z,0,s=3 √ 2π are ±16, the left and right tail areas are approximately 2 −20 .
We now begin to demonstrate how PolarSampler(·) can be used for discrete Gaussian sampling as in Algorithm 4 GaussianSampling(·). Suppose r levels of partition are employed to approximate D Z,c,s . The chain rule of conditional probability and the chain rule of conditional entropy, i.e. imply that the Gaussian distribution over the finite constellation can be generated in a level-by-level way. For the k-th level, we can sample from the component source X k using PolarSampler(·) given the samples x 1:k−1 from lower levels as side information. GaussianSampling(·) has access to a bias table c = [c 1 , · · · , c r ] defined as follows.
The high-and low-entropy sets H X k |X 1:k−1 and L X k |X 1:k−1 are computed according to the bias table c offline. We call this stage the construction stage of GaussianSampling(·). The online stage of GaussianSampling(·) draws Bernoulli samples level by level using PolarSampler(·) and we call it the implementation stage.
For the first level, we want to generate the component source X 1 in the absence of any side information.
1. Construction By performing the source polarization transformation G N on N i.i.d. copies of X 1 , we obtain an N dimensional vector U 1:N 1 = X 1:N 1 G N . For any β ∈ (0, 1/2) and α = 2 −N β , we formally define two sets H X 1 and L X 1 as and For any i ∈ H X 1 , U and Once we have a realization u 1:N 1 of U 1:N 1 , we derive a realization x 1:N 1 = u 1:N 1 G N of X 1:N 1 and pass it to the next level for further processing. For higher levels with k ∈ (1, r ], our task is to generate N i.i.d. samples of source X k given the side information x 1:N 1:k−1 which were generated at the previous k − 1 levels.
Then it generates N i.i.d. copies of X k by applying the polarization transformation circuit to vector U 1:N k of which each entry takes a value according to the following rule: Once we have a realization u 1:N k of U 1:N k , we derive a realization x 1:N k = u 1:N k G N of X 1:N k and pass it to the next level for further processing. Recall that the approximation error for each level is determined by parameter β and N . To achieve a target closeness between the ideal distribution and the one we can produce, the two sets H X k |X 1:k−1 and L X k |X 1:k−1 for each level are properly chosen and determined offline. By repeating the operations in (28) and (29) from level 2 to level r , we can finally obtain N samples x 1:N from D Z,c,s , i.e., N are calculated to define H X k |X 1:k−1 and L X k |X 1:k−1 . At the implementation stage, realizations of U 1:N are produced according to the implementation rules (28) and (29). Given the two parameters N and β, the closeness between the ideal distribution and the one our sampler can produce will be analysed in next section.

The approximation error model
In a concrete implementation, an ideal discrete Gaussian distribution is replaced by an approximation. To give a sharp estimation of the accuracy/security of a cryptographic primitive, the closeness between the ideal distribution and its approximation should be measured. In this section, we will derive the upper bounds on the closeness between the ideal distribution and the one generated by our sampling scheme measured by KL divergence.
The approximation error comes from two sources, the tailcut (owing to finite levels of partitions) and the polar sampling. On the one hand, we need to decide how many levels of partitions are needed. On the other hand, the error introduced by polar sampling should also be analysed. Denote by D Z,c,s the target discrete Gaussian distribution and we decide to employ r levels of partition. If polar sampling did not introduce any error, we would generate a distribution P X 1:r with a closeness measure δ(D Z,c,s , P X 1:r ) which is determined only by r for some metric δ. In reality, polar sampling produces a distribution Q X 1:r (x 1:r ) and it introduces an error of δ(P X 1:r , Q X 1:r ). We will in this section analyse the above two closeness quantities using Kullback-Leibler (KL) divergence.
Since the KL divergence does not satisfy the triangle inequality, we will give D K L (D Z,c,s P X 1:r ) and D K L (P X 1:r Q X 1:r ) separately rather than a total KL divergence D K L (D Z,c,s Q X 1:r ). However, as discussed in [31,Chapter 3] the lack of symmetry and triangle inequality can be handled by a KL-based security argument which will be presented in Sect. 5.
The smoothing parameter quantifies how large s must be for D ,c,s to behave like a continuous Gaussian distribution. It is implied by Definition 4 that for any > 0, the smoothing parameter η (Z) of Z is the smallest s such that ρ(sZ) ≤ 1 + .

Lemma 2 ([14, Lemma 4.2]) For any > 0, any s > η (Z), and any t > 0,
Instead of sampling over the full domain of the integer lattice, a distribution tail of negligible probability is cut off in practice. Suppose 2 r samples are left after the tailcut. Let where a ∈ A and γ is the probability of the tail. This constellation A of 2 r points can be represented as a binary partition tree labeled by X 1:r in the same way as Fig. 4. In our sampling scheme, we derive a sample labeled by There exists a one-to-one mapping from X 1:r to A. Therefore P X 1:r and the tailcut distribution D γ are exactly the same and we can obtain D K L (D Z,c,s P X 1:r ) by calculating According to the second-order Taylor bound, if D Z,c,s (a ∈ A) = 1−γ for any 0 < γ < 1, and so is D K L (P X 1:r D Z,c,s ).

Approximation error from polar sampling
The target discrete Gaussian distribution is tailcut to be P X 1:r which is exactly the distribution of r bits of Bernoullis. Let P X 1:N 1:r denote the distribution of N i.i.d. X 1:r . As discussed in Sect. 3, for properly chosen 0 < β < 0.5, N = 2 n , n ≥ 1 and the corresponding highand low-entropy sets defined in (26) and (27), one can approximate P X 1:N 1:r in a level-by-level manner using PolarSampler(·) for r times giving rise to the produced distribution Q X 1:N 1:r .
Recall it in Theorem 3 that an intermediate distribution Q is introduced to analyse the KL divergence between P and Q. Likewise, for every 1 ≤ k ≤ r we introduce an intermediate distribution Q where only the deterministic decisions for U where the equalities and inequalities are derived by (a) one-to-one mapping from X 1: where the inequality (e) is derived by Theorem 3. An explicit security analysis based on the KL divergence D K L (P X 1:r D Z,c,s ), D K L (P X 1:N 1:r Q X 1:N 1:r ) and D K L (Q X 1:N 1:r Q X 1:N 1:r ) will be given in the sequel.

Remark 1
According to the KL-based closeness, GaussianSampling(·) can arbitrarily approximate D Z 1:N ,c,s for sufficiently large N and properly chosen β and r . We highlight that the multilevel polar sampler would be attractive in applications consuming many more than one discrete Gaussian samples. There are plenty of applications of this kind in lattice-based cryptography and the prominent one is FHE. In FHE, it is quite common that dimension N can be tens of thousands. Even for lattice signature schemes (e.g. BLISS, Falcon), N = 512, 1024 is quite common, plus one may generate a batch of samples (except for embedded devices).

Security analysis with KL divergence
Lemma 3 (Bounding success probability variations, [30]) Let E P be an algorithm making at most q queries to an oracle sampling from a distribution P and returning a bit. Let ≥ 0, and Q be a distribution such that D K L (P Q) < . Let x (resp. y) denote the probability that Security argument [31] It can be concluded from Lemma 3 that if a scheme is λ-bit secure with oracle access to a perfect distribution P and the KL divergence between P and another distribution Q satisfies D K L (P Q) ≤ 2 −λ , then this scheme is also about λ-bit secure with oracle access to Q. Note that this security argument holds only if E is a search problem but not a decisional one. The security argument based on KL divergence satisfies symmetry and triangle inequality though KL divergence itself does not (see [31,Sect. 3.2] for detail).
Consider that a scheme with access to a perfect distribution D Z N ,c,s is λ-bit secure. Assume an adversary obtains N samples at each query. By the additivity of KL divergence and Eq. (31), we have D K L (D Z 1:N ,c,s P X 1:N 1:r ) ≤ N (γ + γ 2 ). In order to achieve λ-bit security after the tailcut, we need to set N (γ + γ 2 ) ≈ 2 −λ by selecting t ≈ (λ + log N ) ln 2/π. The number of levels needed is therefore r = log(2t · s) . As given in Sect. 4.3, the approximation error introduced by polar sampling is determined by both D K L (P X 1: In order to preserve λ-bit security after P X 1:N 1:r is replaced by Q X 1:N 1:r , we need to select n = log N and β properly such that 2 −2 nβ +n+log(r )+1 ≈ 2 −λ . Figure 6 shows how the security level of interest is related to β in terms of different s and n. We can observe that the curves with the same n but different in s are quite close. It is understandable because the security is dependent on the approximation error as is almost independent of what the target distribution is if the proposed GaussianSampling(·) is used. We also find that to preserve λ = 128 bits of security, a larger n implies a smaller β. This is because larger n means deeper polarization and therefore smaller unpolarized set (i.e. smaller β). This observation is instructional in selecting parameters for GaussianSampling(·). At the implementation stage, the entropy consumption to produce U polarized and unpolarized set. Given optional choices of β and n to preserve λ-bit security, we suggest smaller β for less entropy consumption per sample.

Security analysis of tailcut and precision with Rényi divergence
The KL-based security analysis is a reminder about Rényi divergence (RD). However, the approximation error of the proposed sampler measured by Rényi divergence is not given in this work because how polarization phenomenon converges in the metric of RD is an open problem in the area of coding theory. Nonetheless, RD can still be used to analyse the tailcut and precision.
Definition 5 (Rényi divergence) Let P, Q be two distributions with supports S P and S Q , respectively. Let S P ⊆ S Q . For a ∈ (1, +∞), we define the Rényi divergence of order a by In [32], a sharper bound on security based on Rényi divergence is given under the assumption that the number of adversarial queries q to a λ-bit secure scheme is far less than 2 λ . Consider a cryptographic scheme with λ bits of security and the number of queries to Q satisfies q ≤ 2 64 . By the security argument in [32], this scheme is proved to lose at most one bit of security when Q is replaced by Q γ provided that one of following conditions is satisfied.
(a) If Q γ is the distribution under tailcut, then (b) If Q γ denote a distribution having the same support with Q subject to some relative error, it should be satisfied that In the context of multilevel polar sampling, if t ≈ ln 2(66+log N ) π and r = log 2s (66+log N ) ln 2 π , a (λ + 1)-bit secure scheme will be at least λ-bit secure when the ideal distribution D Z,c,s is replaced by its tailcut P X 1:r for λ = 128, q = 2 64 and a = 2λ. Compared with the KL-based analysis of tailcut, Rényi divergence contributes to a smaller r by reducing the partitions by at most 1 level. Moreover, the security argument for relative error can be translated to γ ≤ 2 −36.5 for λ = 128, q = 2 64 and a = 2λ [32]. The precision requirement of polar sampler to achieve the target security is determined by the Bernoulli sampling step in formula (28) and (29). A practical approach to draw Bernoulli samples is to calculate the bias Q γ subject to the relative precision provided. Then sample q ∈ (0, 1) uniformly at random and yield 1 if q < Q γ . As long as the relative precision provided for the LR recursions in Algorithm 2 and the biases computed in formula (28) and (29) is more than 36 bits, polar sampler can achieve λ = 128. In our application, it suffices to use double precision in the LR recursions which provides 52 bits of relative precision. It can also be simulated using fixed-point numbers of 64 bits of precision particularly in 64-bit architectures.

A constant-time algorithm
Given a probabilistic or deterministic algorithm, we consider it to be constant-time if its execution time is independent of the sensitive part of its input and output [17]. Instead of making every operation finish exactly in a constant interval, we protect sensitive information from being recovered by timing-based side channel attacks. The proposed GaussianSampling(·) algorithm composes of multiple serially connected PolarSampler(·). We now study which part of PolarSampler(·) is constant-time and which part is not. The input of PolarSampler(·) includes the block length N , the side information vector y 1:N , a precomputed bias table c and the corresponding high-and low-entropy sets H X |Y , L X |Y , meanwhile it yields a Bernoulli sample vector x 1:N . Normally, we do not expect to disclose either the Bernoulli biases or the output samples. Therefore, it makes sense to consider N and y 1:N to be non-sensitive as they are irrelevant to what the target biases and output samples are. Table c Table 1.
Firstly, the table lookup for c[y 1:N ] may remind us of the cache-based attack breaking BLISS which exploits the weakness of CDT table search and the Bernoulli sampling with a precomputed table of exponential values [8]. Fortunately, this is not the case for PolarSampler(·). On one hand, the bias vector c[y 1:N ] is indexed by the side information  Secondly, if the block length is N , the SC decoding carries out exactly N log N 2 LR calculations as in formula (13) along with exactly N log N 2 LR calculations as in formula (14) regardless of what the input and output are. Concerns arise from the floating-point instructions and a comprehensive analysis is given as follows.
-If we assume floating-point instructions to be constant-time as in [17] 6 or we make a constant-time transformation to those floating-point calculations (e.g., CTFP transformation [2] or replacement by fixed-point instructions), those LR calculations will be safe. -If we only consider the floating-point division hard to be implemented in constant time as in [33,39], a division-free alternative of the LR calculations of the same computational complexity is given in Appendix B and the above concern will be eliminated. Thirdly, line 6, 9 and 12 of Algorithm 1 are non constant-time with respect to the input if no further measurements are taken. Sampling for the high-and low-entropy sets would be easy as it consumes only 1 bit of randomness for H X |Y and 0 for L X |Y , whereas sampling for H c X |Y \L X |Y is complicated and would take longer time. Therefore, the running time of line 6, 9 and 12 may reveal the proportion of H X |Y , L X |Y and H c X |Y \L X |Y which in turn discloses some information about the biases. As for the output, we consider it to be safe regardless of the running time of line 6, 9 and 12. The weak points are those floating-point comparisons in line 9 and 12. Generally speaking, comparing two approximate floating-point values would take longer, but it is equally likely to return True and False nonetheless. Therefore, it makes sense to consider this type of operations to be unsafe with respect to input but safe with respect to output.
Lastly, calculating x 1:N = u 1:N G N takes exactly N log N binary additions which is obviously constant-time.
To conclude, we claim that PolarSampler(·) is constant-time in the sense that its running time is irrelevant to the output samples. The same statement holds for the proposed integer Gaussian sampling for two reasons if PolarSampler(·) is adopted as in Algorithm 4. Firstly, the number of levels r is determined by the width of the integer Gaussian distribution. Secondly, if no further measurements are taken, the floating-point implementation of PolarSampler(·) is non constant-time with respect to input.

Time complexity
The latest trend of DGS solutions is to expand a base sampler into one for arbitrary parameters. For example, the Knuth-Yao and CDT sampler can work as a base sampler to produce samples which are then combined into new samples with a relatively large standard deviation in a convolutional manner [26,30]. Our Gaussian sampler is also eligible for such extension for potential speedup. The focus of this subsection is to compare the GaussianSampling(·) with other base samplers. Karmakar et al. [19] compared the time complexity of Knuth-Yao and CDT showing that the former can be made more time-saving. Therefore, it is fair to compare our Gaussian sampler only with Knuth-Yao and we used a non constant-time Knuth-Yao implemented in C++ 7 as well as its constant-time version. 8 We also give the benchmarks of a prototype GaussianSampling(·) for different choices of parameter s in C++. 9 Note that we substitute probability recursion for LR recursion (see Appendix B) to avoid floating-point divisions.
The experiment is conducted on a PC with Ubuntu 18.04 and an Intel i9-9900K processor running at 3.60 GHz using one core. We use g++ to compile both Knuth-Yao and our implementation with compilation flag -Ofast enabled. For the benchmarks, we select s ∈ √ 2π · {3, 8, 32, 256} and a target security level λ = 64. 10 According to the KL-based security analysis in Sect. 5, we specify β to achieve 64 bits of security with respect to N ∈ {2 13 , 2 14 , 2 15 }, and we select r = log(2st) where t ≈ (λ + log N ) ln 2/π . We assume that the adversary obtains N integer samples for each query to the sampling algorithm. The simulation results are shown in Table 2. Firstly, our Gaussian sampler always outperforms Knuth-Yao in speed with respect to the above setting. Secondly, Knuth-Yao slows down almost linearly as 2 r grows while GaussianSampling(·) still provides a competitive speed. This advantage stems from the binary partition of the integers. Thirdly, GaussianSampling(·) shows modest speed reduction as the block length increases from 2 13 to 2 15 . This doesn't contradict the asymptotic information optimality which implies less randomness consumption per sample for larger N . The polarization effect can reduce the consumption of randomness and contribute to the speed. However, the overall running time, in the current implementation, is dominated by the floating-point recursions in Fig. 3 with complexity O(N log N ). In the literature, practical boosting approaches for the butterfly circuit include semi-parallel design [21] and pruned SC decoding [1,38] giving computational complexity up to O(log log N ).

Memory cost
At the construction stage, we need to store a table c of biases, i.e., P(x k = 1|x 1:k−1 ) for 1 ≤ k ≤ r . The table consists of 2 r − 1 elements for the overall r levels. The biases are 7 https://github.com/AaronHall4/BKW-Algorithm. 8 https://github.com/jnortiz/HIBE-Gaussian-Sampling. 9 https://github.com/jwangit/polarsampler. 10 GaussianSampling(·) is extendable to other security levels (e.g. 128,256) in our double precision setting as discussed in Sect. 5.2. stored in natural order of X 1:k−1 such that once the samples x 1:k−1 for the preceding k − 1 levels are ready we can find the associated biases in c by moving the pointer by offset x 1:k−1 .

Asymptotic comparison
We also give in Table 3 an asymptotic comparison of entropy consumption, computational and storage complexity between GaussianSampling(·) and other existing samplers, e.g. the binary sampling algorithm [11], constant-time CDT [16], constant-time Knuth-Yao sampler [19].
The overall running time of GaussianSampling(·) depends on the SC decoding (LR recursions) and Bernoulli sampling (entropy consumption). As indicated in Corollary 1, the fraction of unpolarized set scales as N − 1 μ . When N → ∞, the average entropy consumption approaches H (X ) which is the Shannon's entropy of the target distribution X .
For binary sampling in BLISS [11], one sample requires entropy consumption of approximately 6 + 3 × log(s) . If we use a full-table access CDT for the base sampler and use a full-table access Bernoulli sampler, the computational complexity would be O(log(t × s 0 )) (s 0 : the parameter of a base sampler) for table lookup plus some constant number of integer arithmetic and the entropy cost will be much greater than 6 + 3 × log(s). Falcon uses bimodal Gaussian and rejection sampling which should have similar complexity to binary sampling.
The full-table access CDT [16] has a computational complexity of O(log(t × s)) and requires a storage of O(λ × t × s). The constant-time Knuth-Yao in [19], which is a variant of standard Knuth-Yao in [12] and performs totally different types of operations, has a computational complexity of O(log(t × s)) for Boolean function evaluations with λ random bits of input and its entropy consumption is O(λ). It requires large programmable memory to store O(log(t * s)) Boolean functions.
In conclusion, we claim our multilevel polar sampler to achieve the information-theoretic optimality (i.e., asymptotically optimal entropy consumption) which compares favorably with other samplers. For most sampling methods, the overall speed depends on computational complexity and randomness generation (e.g. producing Bernoulli samples in the binary sampling method). The computational complexity of our sampler is less attractive due to the log(N ) factor but (a) multilevel polar sampler saves the time of producing randomness which is not considered as computational complexity but entropy consumption (b) our experiments in Table 2 show that multilevel polar sampler is much faster than a standard Knuth-Yao [12]. In addition, there is room to improve the computational efficiency seeing that the state-of-the-art SC decoding achieves a per-bit complexity of O(log log N ) [38].

Conclusions and future work
The polar sampler and its multilevel application for DGS is efficient, application-independent and constant-time. Our algorithm is effective in the case that a large number of samples from a certain distribution, e.g., integer Gaussian, are required. The optimization of entropy consumption stems from the polarization process in which the randomness moves to the high-entropy set. For fixed parameters, the construction stage is prepared offline and the implementation stage is carried out online. The floating-point implementation given in this work is constant-time in the sense that its running time is independent of output samples. KL and Rényi divergence are used for security analysis, precision analysis and parameter selection. Since the Rényi divergence-based analysis of polar coding is still an open problem Table 3 Comparison of entropy, computational and storage complexity between multilevel polar sampler and existing samplers (λ: precision; t * s: Binary sampling [11] O(log(s))

))
Constant-time CDT [16] O(log(t × s))  [38] is used by now, it deserves more efforts to give a complete Rényi divergence-based analysis for our sampler and to carry out potential efficiency improvement.
In this paper, we only use the basic 2 × 2 kernel, whose finite-length performance is not the best. Optimizing finite-length performance using other kernels of polar codes as well as adapting the pruned/semi-parallel SC decoding are left to future work.
Funding This work was funded by the UK Engineering and Physical Sciences Research Council (Grant EP/S021043/1).
Data availability All data generated or analysed during this study are included in this published article.

Conflict of interest
The authors have no conflict of interest to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
where temp = UReg[ 2κ − 1, ψ ][m] and ⊕ is XOR operation. Formula (34), (35), (36) and (37) will substitute for the LR calculations in formula (13) and (14) respectively and line 5 and 6 in Algorithm 2 will be adapted accordingly. After the N probabilities in the leftmost column of PReg 0 and PReg 1 are derived, the probabilistic/deterministic sampling in line 9 and 12 of Algorithm 1 will be replaced by