Keywords

1 Introduction

Fully homomorphic encryption (FHE) allows arbitrary computations over encrypted data, without decrypting them. The first construction was proposed in 2009 by Gentry [20], which introduced a new technique called bootstrapping to handle the noise propagation in ciphertexts. Although many efforts have been done since this first proposal to improve FHE, it remains too slow for real world applications. The most promising constructions are [5, 21, 30]. We focus on constructions based on the LWE problem, introduced by Regev in 2005 [28], and its ring variants [25]. Some public implementations are available, namely Helib [22, 23], FV-NFlib [24] and SEAL [29], based on BGV [5, 18], and FHEW [17] and TFHE [14], based on GSW [12, 17, 21].

BGV-based schemes use in general slow operations, but they can treat a lot of bits at the same time, so they can pack and batch many operations in a SIMD manner, like in GPUs. Further, the set of operations that are efficient with BGV are very constrained by the parameter set. Some parameters allow very fast vectorial sums and products modulo a fixed modulus (as in AES). But with these parameters, a comparison, a classical addition, extracting one bit or more complicated bit operations (as in SHA-256) are very slow.

On the other hand, recent developments have shown that GSW operations can evaluate very fast independent elementary operations on bits, like in a CPU. In the TFHE scheme (presented in [12] and based on GSW [21] and its ring variant [17]) the elementary operations are all the binary gates. Therefore, it is easy to represent any function that has few gates, and the running time is simply proportional to their number. A few methods have been proposed to perform multibit or packed/batched operations with GSW-based schemes. For instance, [4] extends the bootstrapping of FHEW [17] to evaluate non-linear functions with a few input bits. Unfortunately, the parameter sizes must increase exponentially with the number of bits in the plaintext space. Until this work, it was not clear how to perform efficient evaluations on packed data or batch operations, as it is in BGV-based schemes.

Homomorphic encryption falls in two families: leveved (LHE) and fully (FHE) homomorphic encryption. Informally, in LHE, for each function, there exist parameters that can homomorphically evaluate itFootnote 1. The structure of the function to be evaluated (multiplicative depth in BGV or depth of compositions of branching algorithms for GSW) translates into a noise overhead, and the parameters must be chosen large enough to support this noise bound. This concept is represented in the paper by the notion of parameter levels. In FHE, a single parameter set allows to evaluate any function. This generalized definition implies that FHE is a particular case of LHE.

In many FHE schemes, the elementary operations consist in leveled gates with a symmetric noise propagation formula, and where non-linear gates cost more than linear ones. The papers [3, 26] improve the efficiency of fully homomorphic implementations by optimizing the placement of bootstrapping between the gates throughout the circuit. This strategy does not really apply to GSW schemes that strongly rely on the asymmetric noise propagation formula, in which all circuits are expressed as deterministic automata or branching algorithms, because the depth of the circuit has a very small impact on the noise.

The TFHE construction of [12] proposes two modes of operations: a FHE mode composed of bootstrapped binary gates, and a LHE mode which can evaluate a deterministic automata or a branching algorithm and which supports very large depth of transitions.Footnote 2 Note however that in the LHE mode of [12], inputs and outputs have different types, which makes it non-composable. In this paper we optimize both FHE/LHE modes, and we solve the non-composability constraint.

Our Contribution. In this paper, we improve the TFHE construction of [12] for both FHE and LHE modes.

We first propose a blind rotation algorithm that we describe in Sect. 2. In FHE mode, this algorithm contributes to the acceleration of the gate bootstrapping of [12], and its implementation is now included in the core of the TFHE library [14]. This algorithm is also one of the building block we use to improve the LHE mode of TFHE.

Because of the asymmetric noise propagation, operating over packed ciphertexts in GSW-based schemes is harder than in BGV-based schemes. We describe two different techniques, that we call horizontal packing and vertical packing, that can be used to improve the evaluation of leveled circuits. An arbitrary function from \(\{0,1\}^n\rightarrow \mathbb {T}^p\) can be represented as a truth table with p columns and \(2^s\) rows. By packing these coefficients horizontally, the homomorphic evaluation of the function can be batched, and the p outputs can be produced in parallel. This technique is classical, but is only efficient for very large p. We propose another technique, called vertical packing, which packs the coefficients column-wise, and which achieves its maximal speed-up also when p is equal to 1.

We also extend the deterministic finite automata framework proposed in [12, 19] by working with deterministic weighted finite automata. For most multibit arithmetic functions, such as addition, multiplication and maximum value, these latter allow to compute the whole output in a running time that would have previously produced only a single bit.

Indeed, when an arithmetic operation is evaluated by a deterministic automata, the only bit of information that is retained is whether or not the destination state is accepting, and the rest of the path (that contains a lot of information on the result) is forgotten. Thus, we need to evaluate one automata for each bit of the result. Instead, by assigning a vector of weights on each transition, we are able to retain enough information along the path to get all the bits of the result at once, in a single pass of the automata. This decreases the complexity of these operations by at least one order of magnitude. Furthermore, we propose a new homomorphic counter (called TBSR) that supports homomorphically all the basic operations related to the multiplication: incrementation, division by 2 and extraction of bits. This technique gives another speed-up by a factor equal to the bit-size of the input. We show how to use it to represent the \(O(d^2)\) (with d equal to the size of the input) schoolbook multi-addition or multiplication circuits, without increasing the homomorphic depth and with very low noise overhead.

Our last contribution solves the main problem of the leveled mode of TFHE, which is the non-composability, due to the fact that inputs and outputs are of different types. The inputs are in fact \({\mathrm {RingGSW}}\) ciphertexts, while the outputs are \(\mathsf {LWE}\) ciphertexts. We introduce a new bootstrapping, called circuit bootstrapping, that allows to transform \(\mathsf {LWE}\) ciphertexts back to \({\mathrm {RingGSW}}\) ciphertexts that can be reused as inputs in leveled circuits. The implementation of this circuit bootstrapping is publicly available [14] and it runs in 137 ms, improving all previous techniques. The introduction of the circuit bootstrapping closes the loop and allows all the new techniques previously described to be applied also in a FHE mode. To show how these techniques improve homomorphic evaluations, we propose several examples with concrete parameters and running time. For instance, we show that we can evaluate a 10 bits to 1 bit (\(\{ 0,1 \}^{10} \rightarrow \{ 0,1 \}\)) look-up table in \(340\,\upmu \text {s}\) and we can bootstrap the output in just 137 ms.

Paper organization. We first review mathematical definitions for the continuous \(\mathsf {LWE}\) and \({\mathrm {RingGSW}}\) encryption over the torus and review the algorithmic procedures for the homomomorphic evaluation of gates. In particular, we extend the keyswitching algorithm to evaluate public or private \(\mathbb {Z}\)-module morphisms, and explain how it can be used to pack, unpack and move data across slots of a ciphertext. In Sect. 3, we show various techniques to speed-up operations on packed data: horizontal and vertical packing, our method to evaluate arithmetic functions via weighted automata and our TBSR counter technique. In Sect. 4, we introduce our circuit bootstrapping algorithm which makes it possible to connect gates of either \({\mathrm {RingGSW}}\) or \(\mathsf {LWE}\) types and give the practical execution timings we have obtained. Section 5 depicts all our complexity results for different parameters set.

2 Preliminaries

This section introduces and revisits some basic concepts to understand the rest of the paper. The homomorphic constructions we present are based on the LWE problem, presented by Regev in 2005 [28], and on the GSW construction, proposed by Gentry-Sahai-Waters in 2013 [21]. We use the generalized definitions of \(\mathsf {TLWE}\) and \(\mathsf {TGSW}\) (the T stands for the torus representation) proposed in [12], and extend some of their results.

2.1 Background on TFHE

We denote by \(\lambda \) the security parameter. The set \(\{0,1\}\) is written as \(\mathbb {B}\). The real torus \(\mathbb {R}/\mathbb {Z}= \mathbb {R}\mod 1\) of real numbers mod 1 is denoted by \(\mathbb {T}\). \(\mathfrak {R}\) is the ring \(\mathbb {Z}[X]/(X^N+1)\) of integer polynomials modulo \(X^N+1\), and \(\mathbb {T}_N[X]\) is the module \(\mathbb {R}[X]/(X^N+1) \mod 1\) of torus polynomials, where N is a power of 2. \(\mathbb {B}_N[X]\) denotes the subset of \(\mathfrak {R}\) of polynomials with binary coefficients. Note that \(\mathbb {T}\) is a \(\mathbb {Z}\)-module and that \(\mathbb {T}_N[X]\) is a \(\mathfrak {R}\)-module. The set of vectors of size n in E is denoted by \(E^n\), and the set of \(n\times m\) matrices with entries in E is noted \(\mathcal {M}_{n,m}(E)\). As before, \(\mathbb {T}^n\) (resp. \(\mathbb {T}_N[X]^n\)) and \(\mathcal {M}_{n,m}(\mathbb {T})\) (resp. \(\mathcal {M}_{n,m}(\mathbb {T}_N[X])\)) are \(\mathbb {Z}\)-modules (resp. \(\mathfrak {R}\)-modules).

Distance, Lipschitzian functions, Norms. We use the standard \(\ell _p\)-distance over \(\mathbb {T}\), and use the (more convenient but improper) \(\left\| x\right\| _p\) notation to denote the distance between 0 and x. Note that it satisfies \(\forall m\in \mathbb {Z}, \left\| m\cdot \varvec{x}\right\| _p\le |m| \left\| \varvec{x}\right\| _p\). For an integer or torus polynomial a modulo \(X^N+1\), we write \(\left\| a\right\| _p\) the norm of its unique representative coefficients of degree \(\le N-1\). The notion of lipschitzian function always refers to the \(\ell _\infty \) distance: a function f is R-lipschitzian iff. \(\left\| f(x)-f(y)\right\| _\infty \le R\left\| x-y\right\| _\infty \) for all inputs xy.

\(\varvec{\mathsf {TLWE}.}\) \(\mathsf {TLWE}\) is a generalized and scale invariant version of the LWE problem, proposed by Regev in 2005 [28], over the Torus \(\mathbb {T}\).

Given a small linear lipshtitzian function \(\varphi _s\) from \(\mathbb {T}_N[X]^{k+1}\) to \(\mathbb {T}_N[X]\) (that depends on the secret key) and which we’ll call the phase function, the TLWE encryption of \(\mu \in \mathbb {T}_N[X]\) simply consists in picking a ciphertext c which is a Gaussian approximation of a preimage \(\varphi _s^{-1}(\mu )\). If the Gaussian noise is small enough, the distribution of \(\varphi _s(c)\) (over the probability space \(\varOmega \) of all possible choices of Gaussian noise) remains concentrated around the message \(\mu \), i.e. included in a ball of radius \(<\frac{1}{4}\) around \(\mu \). Because this distribution is concentrated, it allows to properly define the intuitive notions of expectation and variance, which would in general not exist over the Torus: in this case, the expectation of \(\varphi _s(c)\) is the original message \(\mu \), and its variance is equal to the variance of the Gaussian noise that was added during encryption. We refer to [12] for a general definition of \(\varOmega \)-space, concentrated distribution, expectation, variance, Gaussian and sub-Gaussian distributions over the Torus.

More precisely, a \(\mathsf {TLWE}\) secret key \(\varvec{s}\in \mathbb {B}_N[X]^k\) is a vector of k binary polynomials of degree N. We assume that each coefficient of the secret key is chosen uniformly, so the key has \(n=kN\) bits of entropy.

Definition 2.1

( \(\mathsf {TLWE}\) , phase). \(\mathsf {TLWE}\) ciphertexts or samples are \(\varvec{c}=(\varvec{a},b)\in \mathbb {T}_N[X]^{k+1}\) that fall in one of the three cases:

  • Noiseless Trivial of \(\mu \) : \(\varvec{a}=\varvec{0}\) and \(b=\mu \). Note that this sample is independent of the secret key.

  • Fresh \(\mathsf {TLWE}\) sample of \(\mu \) of standard deviation \(\alpha \) : \(\varvec{a}\) is uniformly in \(\mathbb {T}_N[X]^k\) and b follows a continuous Gaussian of standard deviation \(\alpha \) centered in \(\mu +\varvec{s}\cdot \varvec{a}\), where the variance is \(\alpha ^2\). In the following, we will write \((\varvec{a},b)\in \mathsf {TLWE}_{\varvec{s},\alpha }(\mu )\).

  • Combination of \(\mathsf {TLWE}\) samples: \(\varvec{c} = \sum _{j=1}^{p} r_j \cdot \varvec{c_j}\) is a \(\mathsf {TLWE}\) sample, where \(\varvec{c_1}, \ldots , \varvec{c_p}\) are \(\mathsf {TLWE}\) sample under the same key and \(r_1, \ldots , r_p\) in \(\mathbb {Z}\) or \(\mathfrak {R}\).

The phase of a sample \(\varvec{c}\) is defined as \(\varphi _{\varvec{s}}(\varvec{c})=b-\varvec{s}\cdot \varvec{a}\).

Like in [12], we say that a \(\mathsf {TLWE}\) sample \(\varvec{c}\) is valid iff there exists a key \(\varvec{s}\in \mathbb {B}_N[X]^k\) such that the distribution of the phase \(\varphi _{\varvec{s}}(\varvec{c})\) is concentrated. The message of a sample \(\varvec{c}\), written \(\textsf {msg}(\varvec{c})\) is defined as the expectation of its phase over the \(\varOmega \)-probability space. We will write \(\varvec{c}\in \mathsf {TLWE}_{\varvec{s}}(\mu )\) iff \(\textsf {msg}(\varvec{c})=\mu \). The error of a \(\mathsf {TLWE}\) sample \(\varvec{c}\), \(\textsf {Err}(\varvec{c})\) is then computed as \(\varphi (\varvec{c})-\textsf {msg}(\varvec{c})\). The variance of the error will be denoted \(\textsf {Var}(\textsf {Err}(\varvec{c}))\) and its maximal amplitude \(\left\| \textsf {Err}(\varvec{c})\right\| _\infty \).

The message of a fresh sample in \(\mathsf {TLWE}_{\varvec{s},\alpha }(\mu )\) is \(\mu \) and its variance is \(\alpha ^2\). The message function is linear: \(\textsf {msg}(\sum _{j=1}^{p} r_j \cdot \varvec{c_j})\), where \(c_j\in \mathsf {TLWE}_{\varvec{s}}(r_j\mu _j)\) is equal to \(\sum _j^p r_j \mu _j\) provided that the variance \(\textsf {Var}(\textsf {Err}(\varvec{c})) \le \sum _{j=1}^{p} \left\| r_j\right\| _2^2 \cdot \textsf {Var}(\textsf {Err}(\varvec{c_j}))\) and the maximal amplitude \(\left\| \textsf {Err}(\varvec{c})\right\| _\infty \le \sum _{j=1}^{p} \left\| r_j\right\| _1 \cdot \left\| \textsf {Err}(\varvec{c_j})\right\| _\infty \) remains small.

This definition of message has the great advantages to be linear, continuous, and that it works with infinite precision even over the continuous torus. In the practical case where the message is known to belong to a discrete subset \(\mathcal {M}\) of \(\mathbb {T}_N[X]\) and that the noise amplitude of \(\varvec{c}\) is smaller than the packing radius of \(\mathcal {M}\), then the decryption algorithm can retrieve the message in practice by rounding the phase of the sample to its nearest element in \(\mathcal {M}\). For example with \(\mathcal {M}=\{(0,1/2)\}[X]\), the packing is 1 / 4 and thus the samples of variance smaller than \((1/2^{10})\) are decryptable with overwhelming probability.

Distinguishing \(\mathsf {TLWE}\) encryptions of \(\varvec{0}\) from random samples in \(\mathbb {T}_N[X]^k\times \mathbb {T}_N[X]\) is equivalent to the LWE problem initially defined by Regev [28] and its ring [25] and Scale invariant [6, 11, 13] variants.

The main parameters of \(\mathsf {TLWE}\) are the noise rate \(\alpha \) and the key entropy n, and the security parameter is a function of those parameters, as specified in [12, Sect. 6]. By choosing \(N=1\) and k large, \(\mathsf {TLWE}\)-problem is the (Scalar) binary-\(\mathsf {TLWE}\)-problem. When N large and \(k=1\), \(\mathsf {TLWE}\) is binary-\({\mathsf {RingLWE}}\).

\(\varvec{\mathsf {TGSW}.}\) In the same line as \(\mathsf {TLWE}\), \(\mathsf {TGSW}\) generalizes the GSW encryption scheme, proposed by Gentry, Sahai and Waters in 2013 [21]. The gadget matrix H is defined with respect to a base \(B_g \in \mathbb {N}\) as the \(((k+1)\ell )\times (k+1)\) matrix with \(\ell \) repeated super-decreasing \(\mathbb {T}\)-polynomials \((1/B_g,\dots ,1/B_g^\ell )\) as:

(1)

With this choice of gadget, it is possible to efficiently decompose elements of \(\mathbb {T}_N[X]^{k+1}\) as a small linear combination of rows of H. As in [12], we use approximate decomposition. For a quality parameter \(\beta \in \mathbb {R}_{>0}\) and a precision \(\epsilon \in \mathbb {R}_{>0}\), we call \(Dec_{H, \beta , \epsilon }(\varvec{v})\) the (possibly randomized) algorithm that outputs a small vector \(\varvec{u} \in \mathfrak {R}^{(k+1)\ell }\), such that \(\left\| \varvec{u}\right\| _\infty \le \beta \) and \(\left\| \varvec{u}\cdot H - \varvec{v}\right\| _\infty \le \epsilon \). In this paper we will always use this gadget H with the decomposition in base \(B_g\), so we have \(\beta =B_g/2\) and \(\epsilon =1/2B_g^\ell \).

\(\mathsf {TGSW}\) samples. Let \(\varvec{s}\in \mathbb {B}_N[X]^k\) be a \(\mathsf {TLWE}\) secret key and \(H \in \mathcal {M}_{(k+1)\ell ,k+1}(\mathbb {T}_N[X])\) the gadget previously defined. A \(\mathsf {TGSW}\) sample C of a message \(\mu \in \mathfrak {R}\) is equal to the sum \(C = Z + \mu \cdot H \in \mathcal {M}_{(k+1)\ell ,k+1}(\mathbb {T}_N[X])\) where \(Z\in \mathcal {M}_{(k+1)\ell ,k+1}(\mathbb {T}_N[X])\) is a matrix such that each line is a random \(\mathsf {TLWE}\) sample of 0 under the same key.

A sample \(\varvec{C}\in \mathcal {M}_{(k+1)\ell ,k+1}(\mathbb {T}_N[X])\) is a valid \(\mathsf {TGSW}\) sample iff there exists a unique polynomial \(\mu \in \mathfrak {R}/H^\perp \) and a unique key \(\varvec{s}\) such that each row of is a valid \(\mathsf {TLWE}\) sample of 0 w.r.t. the key \(\varvec{s}\). We denote \(\textsf {msg}(C)\) the message \(\mu \) of \(\varvec{C}\). By extension, we can define the phase of a \(\mathsf {TGSW}\) sample C as the list of the \((k+1)\ell \) \(\mathsf {TLWE}\) phases of each line of C, and the error as the list of the \((k+1)\ell \) \(\mathsf {TLWE}\) errors of each line of C.

In addition, if we linearly combine \(\mathsf {TGSW}\) samples \(C_1,\dots ,C_p\) of messages \(\mu _1,\dots ,\mu _p\) with the same keys and independent errors, s.t. \(C=\sum _{i=1} e_i \cdot C_i\) is a sample of message \(\sum _{i=1}^p e_i \cdot \mu _i\). The variance \(\textsf {Var}(C) = \sum _{i=1}^p \left\| e_i\right\| _{2}^{2} \cdot \textsf {Var}(C_i)\) and noise infinity norm \(\left\| \textsf {Err}(C)\right\| _\infty = \sum _{i=1}^p \left\| e_i\right\| _1 \cdot \left\| \textsf {Err}(C)\right\| _\infty \). And, the lipschitzian property of the phase is preserved, i.e. \(\left\| \varphi _s(A)\right\| _\infty \le (Nk+1)\left\| A\right\| _\infty \).

Homomorphic Properties. As GSW, \(\mathsf {TGSW}\) inherits homomorphic properties. We can define the internal product between two \(\mathsf {TGSW}\) samples and the external \(\boxdot \) product already defined and used in [7, 12]. The external product is almost the GSW product [21], except that only one vector needs to be decomposed.

Definition 2.2

(External product). We define the product \(\boxdot \) as

$$ \begin{aligned} \boxdot :\mathsf {TGSW}\times \mathsf {TLWE}&\longrightarrow \mathsf {TLWE}\\ (A,\varvec{b})&\longmapsto A\boxdot \varvec{b} = Dec_{\varvec{h},\beta ,\epsilon }(\varvec{b})\cdot A. \end{aligned} $$

The following theorem on the noise propagation of the external product was shown in [12, Sect. 3.2]:

Theorem 2.3

(External Product). If A is a valid \(\mathsf {TGSW}\) sample of message \(\mu _A\) and \(\varvec{b}\) is a valid \(\mathsf {TLWE}\) sample of message \(\mu _{\varvec{b}}\), then \(A \boxdot \varvec{b}\) is a \(\mathsf {TLWE}\) sample of message \(\mu _A \cdot \mu _{\varvec{b}}\) and \(\left\| \textsf {Err}(A\boxdot \varvec{b})\right\| _\infty \le (k+1)\ell N\beta \left\| \textsf {Err}(A)\right\| _\infty + \left\| \mu _A\right\| _1(1+kN)\epsilon + \left\| \mu _A\right\| _1\left\| \textsf {Err}(\varvec{b})\right\| _\infty \) (worst case), where \(\beta \) and \(\epsilon \) are the parameters used in the decomposition algorithm. If \(\left\| \textsf {Err}(A\boxdot \varvec{b})\right\| _\infty \le 1/4\) then \(A \boxdot \varvec{b}\) is a valid \(\mathsf {TRLWE}\) sample. And assuming the heuristic 2.4, we have that \(\textsf {Var}(\textsf {Err}(A\boxdot \varvec{b})) \le (k+1)\ell N\beta ^2\textsf {Var}(\textsf {Err}(A)) + (1+kN)\left\| \mu _A\right\| _2^2 \epsilon ^2 + \left\| \mu _A\right\| _2^2 \textsf {Var}(\textsf {Err}(\varvec{b}))\).

There also exists an internal product between two \(\mathsf {TGSW}\) samples, already presented in [1, 12, 17, 19, 21], and which consists in \((k+1)\ell \) independent \(\boxdot \) products, and maps to the product of integer polynomials on plaintexts, and turns \(\mathsf {TGSW}\) encryption into a ring homomorphism. Since we do not use this internal product in our constructions, so we won’t detail it.

Independence heuristic. All our average-case bounds on noise variances rely on the independence heuristic below. They usually corresponds to the square-root of the worst-case bounds which don’t need this heuristic. As already noticed in [17], this assumption matches experimental results.

Assumption 2.4

(Independence Heuristic). We assume that all the error coefficients of \(\mathsf {TLWE}\) or \(\mathsf {TGSW}\) samples of the linear combinations we consider are independent and concentrated. In particular, we assume that they are \(\sigma \)-subgaussian where \(\sigma \) is the square-root of their variance.

Notations. In the rest of the paper, the notation \(\mathsf {TLWE}\) is used to denote the (scalar) binary \(\mathsf {TLWE}\) problem, while for the ring mode, we use the notation \(\mathsf {TRLWE}\). \(\mathsf {TGSW}\) is only used in ring mode with notation \(\mathsf {TRGSW}\), to keep uniformity with the \(\mathsf {TRLWE}\) notation.

Sum-up of elementary homomorphic operations. Table 1 summarizes the possible operations on plaintexts that we can perform in LHE mode, and their correspondence over the ciphertexts. All these operations are expressed on the continuous message space \(\mathbb {T}\) for \(\mathsf {TLWE}\) and \(\mathbb {T}_N[X]\) for \(\mathsf {TRLWE}\). As previously mentionned, all samples contain noise, the user is free to discretize the message space accordingly to allow practical exact decryption. All these algorithms will be described in the next sections.

Table 1. TFHE elementary operations - In this table, all \(\mu _i\)’s denote plaintexts in \(\mathbb {T}_N[X]\) and \(\varvec{c}_i\) the corresponding \(\mathsf {TRLWE}\) ciphertext. The \(m_i\)’s are plaintexts in \(\mathbb {T}\) and \(\mathfrak {c}\) their \(\mathsf {TLWE}\) ciphertext. The \(b_i\)’s are bit messages and \(C_i\) their \(\mathsf {TRGSW}\) ciphertext. The \(\vartheta _i\)’s are the noise variances of the respective ciphertexts. In the translation, w is in \(\mathbb {T}_N[X]\). In the rotation, the \(u_i\)’s are integer coefficients. In the \(\mathbb {Z}[X]\)-linear combination, the \(v_i\)’s are integer polynomials in \(\mathbb {Z}[X]\).

2.2 Key Switching Revisited

In the following, we instantiate \(\mathsf {TRLWE}\) and \(\mathsf {TRGSW}\) with different parameter sets and we keep the same name for the variables \(n,N,\alpha ,\ell ,B_g,\dots \), but we alternate between bar over and bar under variables to differentiate input and output parameters. In order to switch between keys in different parameter sets, but also to switch between the scalar and polynomial message spaces \(\mathbb {T}\) and \(\mathbb {T}_N[X]\), we use slightly generalized notions of sample extraction and keyswitching. Namely, we give to keyswitching algorithms the ability to homomorphically evaluate linear morphisms f from any \(\mathbb {Z}\)-module \(\mathbb {T}^p\) to \(\mathbb {T}_N[X]\). We define two flavors, one for a publicly known f, and one for a secret f encoded in the keyswitching key. In the following, we denote \(\mathsf {PubKS}(f,\mathsf {KS},\varvec{\mathfrak {c}})\) and \(\mathsf {PrivKS}(\mathsf {KS}^{({f})},\varvec{\mathfrak {c}})\) the output of Algorithms 1 and 2 on input the functional keyswitching keys \(\mathsf {KS}\) and \(\mathsf {KS}^{(f)}\) respectively and ciphertext \(\varvec{\mathfrak {c}}\).

figure a

Theorem 2.5

(Public KeySwitch) Given p LWE ciphertexts \(\varvec{\mathfrak {c}}^{(z)}\in \mathsf {TLWE}_{\mathfrak {K}}(\mu _z)\) and a public R-lipschitzian morphism f of \(\mathbb {Z}\)-modules, from \(\mathbb {T}^p\) to \(\mathbb {T}_{\underline{N}}[X]\), and \(\mathsf {KS}_{i,j}\in \mathsf {TRLWE}_{\underline{K},\underline{\gamma }}(\frac{\mathfrak {K}_i}{2^j})\) with standard deviation \(\underline{\gamma }\), Algorithm 1 outputs a \(\mathsf {TRLWE}\) sample \(\varvec{c} \in \mathsf {TRLWE}_{\underline{K}}(f(\mu _1,\dots ,\mu _p))\) where:

  • \(\left\| \textsf {Err}(\varvec{c})\right\| _\infty \le R \left\| \textsf {Err}(\varvec{\mathfrak {c}})\right\| _\infty +n t N \mathcal {A}_{\mathsf {KS}} + n 2^{-(t+1)}\) (worst case),

  • \(\textsf {Var}(\textsf {Err}(\varvec{c})) \le R^2 \textsf {Var}(\textsf {Err}(\varvec{\mathfrak {c}})) + n t N \vartheta _{\mathsf {KS}} + n 2^{-2(t+1)}\) (average case), where \(\mathcal {A}_{\mathsf {KS}}\) and \(\vartheta _\mathsf {KS}=\gamma ^2\) are respectively the amplitude and the variance of the error of \(\mathsf {KS}\).

We have a similar result when the function is private. In this algorithm, we extend the input secret key \(\mathfrak {K}\) by adding a \((n+1)\)-th coefficient equal to \(-1\), so that \(\varphi _\mathfrak {K}(\varvec{\mathfrak {c}})=-\mathfrak {K}\cdot \mathfrak {c}\). A detailed proof for both the private and the public keyswitching is given in the full version.

figure b

Theorem 2.6

(Private KeySwitch) Given p \(\mathsf {TLWE}\) ciphertexts \(\varvec{\mathfrak {c}}^{(z)}\in \mathsf {TLWE}_{\mathfrak {K}}(\mu _z)\), \(\mathsf {KS}_{i,j}\in \mathsf {TRLWE}_{\underline{K},\underline{\gamma }}(f(0,\dots , \frac{\mathfrak {K}_i}{2^j},\dots ,0))\) where f is a private R-lipschitzian morphism of \(\mathbb {Z}\)-modules, from \(\mathbb {T}^p\) to \(\mathbb {T}_{\underline{N}}[X]\), Algorithm 2 outputs a \(\mathsf {TRLWE}\) sample \(\varvec{c} \in \mathsf {TRLWE}_{\underline{K}}(f(\mu _1,\dots ,\mu _p))\) where

  • \(\left\| \textsf {Err}(\varvec{c})\right\| _\infty \le R \left\| \textsf {Err}(\varvec{\mathfrak {c}})\right\| _\infty +(n+1)R 2^{-(n+1)}+p(n+1)\mathcal {A}_\mathsf {KS}\) (worst-case),

  • \(\textsf {Var}(\textsf {Err}(\varvec{c})) \le R^2 \textsf {Var}(\textsf {Err}(\varvec{\mathfrak {c}}))+(n+1)R^2 2^{-2(n+1)}+p(n+1)\vartheta _\mathsf {KS}\) (average case),

    where \(\mathcal {A}_{\mathsf {KS}}\) and \(\vartheta _\mathsf {KS}=\gamma ^2\) are respectively the amplitude and the variance of the error of \(\mathsf {KS}\).

Sample Packing and Sample Extraction. A \(\mathsf {TRLWE}\) message is a polynomial with N coefficients, which can be viewed as N slots over \(\mathbb {T}\). It is easy to homomorphically extract a coefficient as a scalar \(\mathsf {TLWE}\) sample. To that end, we will use the following convention in the rest of the paper: for all \(n=kN\), a binary vector \(\mathfrak {K}\in \mathbb {B}^n\) can be interpreted as a \(\mathsf {TLWE}\) key, or alternatively as a \(\mathsf {TRLWE}\) key \(K\in \mathbb {B}_N[X]^k\) having the same sequence of coefficients. Namely, \(K_i\) is the polynomial \(\sum _{j=0}^{N-1} \mathfrak {K}_{N(i-1)+j+1}X^j\). In this case, we say that K is the \(\mathsf {TRLWE}\) interpretation of \(\mathfrak {K}\), and \(\mathfrak {K}\) is the \(\mathsf {TLWE}\) interpretation of K.

Given a \(\mathsf {TRLWE}\) sample \(\varvec{c}=(\varvec{a},b)\in \mathsf {TRLWE}_{K}(\mu )\) and a position \(p\in [0,N-1]\), we call \(\mathsf {SampleExtract}_p(\varvec{c})\) the \(\mathsf {TLWE}\) sample \((\varvec{\mathfrak {a}},\mathfrak {b})\) where \(\mathfrak {b}=b_p\) and \(\mathfrak {a}_{N(i-1)+j+1}\) is the \((p-j)\)-th coefficient of \(a_{i}\) (using the N-antiperiodic indexes). This extracted sample encodes the p-th coefficient \(\mu _p\) with at most the same noise variance or amplitude as \(\varvec{c}\). In the rest of the paper, we will simply write \(\mathsf {SampleExtract}(\varvec{c})\) when \(p=0\). In the next Section, we will show how the \(\mathsf {KeySwitching}\) and the \(\mathsf {SampleExtract}\) are used to efficiently pack, unpack and move data across the slots, and how it differs from usual packing techniques.

2.3 Gate Bootstrapping Overview

This lemma on the evaluation of the \(\mathtt {CMux}\)-gate extends Theorems 5.1 and 5.2 in [12] from the message space \(\{0,1/2\}\) to \(\mathbb {T}_N[X]\):

Lemma 2.7

( \(\mathtt {CMux}\) Gate). Let \(\varvec{d_1},\varvec{d_0}\) be \(\mathsf {TRLWE}\) samples and let \(C\in \mathsf {TRGSW}_{\varvec{s}}(\{0,1\})\). Then, \(\textsf {msg}(\mathtt {CMux}(C,\varvec{d_1},\varvec{d_0}))=\textsf {msg}(C)\text {?}\textsf {msg}(\varvec{d_1})\text {:}\textsf {msg}(\varvec{d_0})\). And we have \(\left\| \textsf {Err}(\mathtt {CMux}(C,\varvec{d_1},\varvec{d_0}))\right\| _\infty \le \max (\left\| \textsf {Err}(\varvec{d_0})\right\| _\infty ,\left\| \textsf {Err}(\varvec{d_1})\right\| _\infty ) + \eta (C)\), where \(\eta (C)=(k+1)\ell N\beta \left\| \textsf {Err}(C)\right\| _\infty + (kN+1)\epsilon \). Furthermore, under Assumption 2.4, we have: \(\textsf {Var}(\textsf {Err}(\mathtt {CMux}(C,\varvec{d_1},\varvec{d_0}))) \le \max (\textsf {Var}(\textsf {Err}(\varvec{d_0})),\textsf {Var}(\textsf {Err}(\varvec{d_1}))) + \vartheta (C)\), where \(\vartheta (C)=(k+1)\ell N\beta ^2\textsf {Var}(\textsf {Err}(C)) + (kN+1)\epsilon ^2\).

The proof is the same as for Theorems 5.1 and 5.2 in [12] because the noise of the output does not depend on the value of the \(\mathsf {TRLWE}\) message.

Blind rotate. In the following, we give faster sub-routine for the main loop of Algorithm 3 in [12]. The improvement consists in a new \(\mathtt {CMux}\) formula in the for loop of the Algorithm 3 instead of the formula in Algorithm 3 of [12]. The BlindRotate algorithm multiplies the polynomial encrypted in the input \(\mathsf {TRLWE}\) ciphertext by an encrypted power of X. Theorem 2.8 follows from the fact that Algorithm 3 calls p times the \(\mathtt {CMux}\) evaluation from Lemma 2.7.

Theorem 2.8

Let \(H \in \mathcal {M}_{(k+1)\ell ,k+1}(\mathbb {T}_N[X])\) the gadget matrix and \(\text {Dec}_{H, \beta , \epsilon }\) its efficient approximate gadget decomposition algorithm with quality \(\beta \) and precision \(\epsilon \) defining \(\mathsf {TRLWE}\) and \(\mathsf {TRGSW}\) parameters. Let \(\alpha \in \mathbb {R}_{\ge 0}\) be a standard deviation, \(\mathfrak {K}\in \mathbb {B}^{n}\) be a \(\mathsf {TLWE}\) secret key and \(K\in \mathbb {B}_{N}[X]^{k}\) be its \(\mathsf {TRLWE}\) interpretation. Given one sample \(\varvec{c}\in \mathsf {TRLWE}_{K}(\varvec{v})\) with \(\varvec{v}\in \mathbb {T}_{N}[X]\), \(p+1\) integers \(a_1,\dots ,a_p\) and \(b\in \mathbb {Z}/2N\mathbb {Z}\), and p \(\mathsf {TRGSW}\) ciphertexts \(C_1,\dots ,C_p\), where each \(C_i \in \mathsf {TRGSW}_{K,\alpha }(s_i)\) for \(s_i\in \mathbb {B}\). Algorithm 3 outputs a sample \(\mathsf {ACC} \in \mathsf {TRLWE}_{K}(X^{-\rho }\cdot \varvec{v})\) where \(\rho =b-\sum _{i=1}^p s_i a_i\) such that

  • \(\left\| \textsf {Err}(\mathsf {ACC})\right\| _\infty \le \left\| \textsf {Err}(\varvec{c})\right\| _\infty + p(k+1)\ell N \beta \mathcal {A}_{C} + p(1+k N)\epsilon \) (worst case),

  • \(\textsf {Var}(\textsf {Err}(\mathsf {ACC})) \le \textsf {Var}(\textsf {Err}(\varvec{c}))+ p(k+1)\ell N \beta ^2 \vartheta _{C} + p(1+k N) \epsilon ^2\) (average case), where \(\vartheta _{C} = \alpha ^2\) and \(\mathcal {A}_{C}\) are the variance and amplitudes of \(\textsf {Err}(C_{i})\).

figure c

We define \(\textsf {BlindRotate}(\varvec{c},(a_1,\dots ,a_p,b),C)\), the procedure described in Algorithm 3 that outputs the TLWE sample \(\mathsf {ACC}\) as in Theorem 2.8.

figure d

Gate Bootstrapping ( \(\varvec{\mathsf {TLWE}}\) -to- \(\varvec{\mathsf {TLWE}}\))

Theorem 2.9

(Gate Bootstrapping (TLWE-to-TLWE)). Let \(\bar{H} \in \mathcal {M}_{(\bar{k}+1)\bar{\ell },\bar{k}+1}(\mathbb {T}_{\bar{N}}[X])\) the gadget matrix and \(Dec_{\bar{H}, \bar{\beta }, \bar{\epsilon }}\) its efficient approximate gadget decomposition algorithm, with quality \(\bar{\beta }\) and precision \(\bar{\epsilon }\) defining \(\mathsf {TRLWE}\) and \(\mathsf {TRGSW}\) parameters. Let \(\underline{\mathfrak {K}}\in \mathbb {B}^{\underline{n}}\) and \(\bar{\mathfrak {K}}\in \mathbb {B}^{\bar{n}}\) be two \(\mathsf {TLWE}\) secret keys, and \(\bar{K}\in \mathbb {B}_{\bar{N}}[X]^{\bar{k}}\) be the \(\mathsf {TRLWE}\) interpretation of the key \(\bar{\mathfrak {K}}\), and let \(\bar{\alpha } \in \mathbb {R}_{\ge 0}\) be a standard deviation. Let \(\text {BK}_{\underline{\mathfrak {K}} \rightarrow \bar{\mathfrak {K}},\bar{\alpha }}\) be a bootstrapping key, composed by the \(\underline{n}\) \(\mathsf {TRGSW}\) encryptions \(\text {BK}_i \in \mathsf {TRGSW}_{\bar{K},\bar{\alpha }}(\underline{\mathfrak {K}}_i)\) for \(i\in [\![1,\underline{n} ]\!]\). Given one constant \(\mu _1\in \mathbb {T}\), and one sample \(\varvec{\underline{\varvec{\mathfrak {c}}}}\in \mathbb {T}^{\underline{n}+1}\) whose coefficients are all multiples of \(\frac{1}{2\bar{N}}\), Algorithm 4 outputs a \(\mathsf {TLWE}\) sample \(\varvec{\bar{\varvec{\mathfrak {c}}}} \in \mathsf {TLWE}_{\bar{\mathfrak {K}}}(\mu )\) where \(\mu =0\) iff. \(|\varphi _{\underline{\mathfrak {K}}}(\underline{\mathfrak {c}})|<\frac{1}{4}\), \(\mu =\mu _1\) otherwise and such that:

  • \(\left\| \textsf {Err}(\varvec{\bar{\varvec{\mathfrak {c}}}})\right\| _\infty \le \underline{n}(\bar{k}+1)\bar{\ell } \bar{N} \bar{\beta } \bar{\mathcal {A}}_{\text {BK}} + \underline{n}(1+\bar{k}\bar{N})\bar{\epsilon }\) (worst case),

  • \(\textsf {Var}(\textsf {Err}(\varvec{\bar{\varvec{\mathfrak {c}}}})) \le \underline{n}(\bar{k}+1)\bar{\ell } \bar{N} \bar{\beta }^2 \bar{\vartheta }_{\text {BK}} + \underline{n}(1+\bar{k}\bar{N})\bar{\epsilon }^2\) (average case),

    where \(\bar{\mathcal {A}}_{\text {BK}}\) is the amplitude of \(\text {BK}\) and \(\bar{\vartheta }_{\text {BK}}\) its variance s.t. \(\textsf {Var}(\textsf {Err}(\text {BK}_{\underline{\mathfrak {K}} \rightarrow \bar{K},\bar{\alpha }})) = \bar{\alpha }^2\).

Sketch of Proof. Algorithm 4 is almost the same as Algorithm 3 in [12] except that the main loop has been put in a separate algorithm (Algorithm 3) at line 2. In addition, the final \(\mathsf {KeySwitching}\) has been removed which suppresses two terms in the norm inequality of the error. Note that the output is encrypted with the same key as the bootstrapping key. Another syntactic difference is that the input sample is a multiple of 1 / 2N (which can be achieved by rounding all its coefficients). Also, a small difference in the way we associate \(\mathtt {CMux}\) operations removes a factor 2 in the noise compared to the previous gate bootstrapping procedure, and it is also faster.

Homomorphic operations (revisited) via Gate Bootstrapping. The fast bootstrapping of [17] and improved in [4, 12] is presented for \(\mathtt {Nand}\) gates. They evaluate a single \(\mathtt {Nand}\) operation and they refresh the result to make it usable for the next operations. Other elementary gates are presented: the \(\mathtt {And}\), \(\mathtt {Or}\), \(\mathtt {Xor}\) (and trivially \(\mathtt {Nor}\), \(\mathtt {Xnor}\), \(\mathtt {AndNot}\), etc. since \(\mathtt {NOT}\) is cheap and noiseless). The term gate bootstrapping refers to the fact that this fast bootstrapping is performed after every gate evaluationFootnote 3.

The ternary \(\mathtt {Mux}\) gate (\(\mathtt {Mux}(c,d_0,d_1) = c?d_1:d_0 = (c\wedge d_1) \oplus ((1-c)\wedge d_0)\), for \(c,d_0,d_1 \in \mathbb {B}\)) is generally expressed as a combination of 3 binary gates. As already mentioned in [17], we can improve the \(\mathtt {Mux}\) evaluation by performing the middle \(\oplus \) as a regular addition before the final \(\mathsf {KeySwitching}\). Indeed, this xor has at most one operand which is true, and at this location, it only affects a negligible amount of the final noise, and is compensated by the fact that we save a factor 2 in the gate bootstrapping in the blind rotation from Algorithm 3. Overall, the ternary \(\mathtt {Mux}\) gate can be evaluated in FHE mode by evaluating only two gate bootstrappings and one public keyswitch. We call this procedure native \(\mathtt {MUX}\), which computes:

  • \(c\wedge d_1\) via a gate bootstrapping (Algorithm 4) of \((\varvec{0}, -\frac{1}{8}) + \varvec{c} + \varvec{d_1}\);

  • \((1-c)\wedge d_0\) via a gate bootstrapping (Algorithm 4) of \( (\varvec{0}, \frac{1}{8}) - \varvec{c} + \varvec{d_0}\);

  • a final keyswitch on the sum (Algorithm 1) which dominates the noise.

This native \(\mathtt {Mux}\) is therefore bootstrappable with the same parameters as in [12]. More details are given in the full version. In the rest of the paper, when we compare different homomorphic techniques, we refer to the gate-bootstrapping mode as the technique consisting in evaluating small circuits expressed with any binary gates and/or the native \(\mathtt {Mux}\), and we use the following experimental timings (see Sect. 5):

Gate bootstrapping mode

Pre-bootstrap 1 bit

\(t_{GB}=13\,\text {ms}\)

Time per any binary gate (\(\mathtt {And}\), \(\mathtt {Or}\), \(\mathtt {Xor}\), ...)

\(t_{GB}=13\,\text {ms}\)

Time per \(\mathtt {MUX}\)

\(2t_{GB}=26\,\text {ms}\)

3 Leveled Homomorphic Circuits

Various packing techniques have already been proposed for homomorphic encryption, for instance the Lagrange embedding in Helib [22, 23], the diagonal matrices encoding in [27] or the CRT encoding in [2]. The message space is often a finite ring (e.g. \(\mathbb {Z}/p\mathbb {Z}\)), and the packing function is in general chosen as a ring isomorphism that preserves the structure of \(\mathbb {Z}/p\mathbb {Z}^N\). This way, elementary additions or products can be performed simultaneously on N independent slots, and thus, packing is in general associated to the concept of batching a single operation on multiple datasets. These techniques has some limitations, especially if in the whole program, each function is only run on a single dataset, and most of the slots are unused. This is particularly true in the context of GSW evaluations, where functions are split into many branching algorithms or automata, that are each executed only once.

In this paper, packing refers to the canonical coefficients embedding function, that maps N Scalar-\(\mathsf {TLWE}\) messages \(\mu _0,\dots ,\mu _{N-1}\in \mathbb {T}\) into a single \(\mathsf {TRLWE}\) message \(\mu (X)=\sum _{i=0}^{N-1} \mu _i X^i\). This function is a \(\mathbb {Z}\)-module isomorphism. Messages can be homomorphically unpacked from any slot using the (noiseless) \(\mathsf {SampleExtract}\) procedure. Reciprocally, we can repack, move data across the slots, or clear some slots by using our public functional key switching from Algorithm 1 to evaluate respectively the canonical coefficient embedding function (i.e. the identity), a permutation, or a projection. Since these functions are 1-lipschitzian, by Theorem 2.5, these keyswitch operations only induce a linear noise overhead. It is arguably more straightforward than the permutation network technique used in Helib. But as in [2, 10, 15], our technique relies on a circular security assumption, even in the leveled mode since our keyswitching key encrypts its own key bitsFootnote 4.

We now analyse how packing can speed-up TGSW leveled computations, first for lookup tables or arbitrary functions, and then for most arithmetic functions.

3.1 Arbitrary Functions and Look-Up Tables

The first class of functions that we analyse are arbitrary functions \(f : \mathbb {B}^d \rightarrow \mathbb {T}^s\). Such functions can be expressed with a Look-Up Table (LUT), containing a list of \(2^d\) input values (each one composed by d bits) and corresponding LUT values for the s sub-functions (1 element in \(\mathbb {T}\) per sub-function \(f_j\)).

In order to compute f(x), where \(x=\sum _{i=0}^{d-1} x_i 2^i\) is a d-bit integer, the classical evaluation of such function, as proposed in [8, 12] consists in evaluating the s subfunctions separately, and each of them is a binary decision tree composed by \(2^d-1\) \(\mathtt {CMux}\) gates. The total complexity of the classical evaluation requires therefore to execute about \(s\cdot 2^d\) \(\mathtt {CMux}\) gates. Let’s call \(o_j= f_j(x) \in \mathbb {T}\) the j-th output of f(x), for \(j=0, \ldots , s-1\). Figure 1 summarizes the idea of the computation of \(o_j\).

In this section we present two techniques, that we call horizontal and vertical packing, that can be used to improve the evaluation of a LUT.

Horizontal packing corresponds exactly to batching. In fact, it exploits the fact that the s subfunctions evaluate the same \(\mathtt {CMux}\) tree with the same inputs on different data, which are the s truth tables. For each of the \(2^d\) possible input values, we pack the LUT values of the s sub-functions in the first s slots of a single \(\mathsf {TRLWE}\) ciphertext (the remaining \(N-s\) are unused). By using a single \(2^d\) size \(\mathtt {CMux}\) tree to select the right ciphertext and obtain the s slots all at once, which is overall s times faster than the classical evaluation. Our vertical packing is very different from the batching technique. The basic idea is to pack several LUT values of a single sub-function in the same ciphertext, and to use both \(\mathtt {CMux}\) and blind rotations to extract the desired value. Unlike batching, this can also speed up functions that have only a single bit of output. In the following we detail these two techniques.

Fig. 1.
figure 1

LUT with \(\mathtt {CMux}\) tree - Intuitively, the horizontal rectangle encircles the bits packed in the horizontal packing, while the vertical rectangle encircles the bits packed in the vertical packing. The dashed square represents the packing in the case where the two techniques are mixed. The right part of the figure represents the evaluation of the sub-function \(f_j\) on \(x=\sum _{i=0}^{d-1}x_i 2^i\) via a \(\mathtt {CMux}\) binary decision tree.

In order to evaluate f(x), the total amount of homomorphic \(\mathtt {CMux}\) gates to be evaluated is \(s(2^d-1)\). If the function f is public, trivial samples of the LUT values \(\sigma _{j,0}, \ldots , \sigma _{j,N-1}\) are used as inputs in the \(\mathtt {CMux}\) gates. If f is private, the LUT values \(\sigma _{j,0}, \ldots , \sigma _{j,N-1}\) are given encrypted. An analysis of the noise propagation in the binary decision \(\mathtt {CMux}\) tree was already given in [12, 19].

Horizontal Packing. The idea of the Horizontal Packing is to evaluate all the outputs of the function f together, instead of evaluating all the \(f_j\) separately. This is possible by using \(\mathsf {TRLWE}\) samples as the message space is \(\mathbb {T}_N[X]\). In fact, we could encrypt up to N LUT values \(\sigma _{j,h}\) (for a fixed \(h\in [\![0,2^{d}-1 ]\!]\)) per \(\mathsf {TRLWE}\) sample and evaluate the binary decision tree as described before. The number of \(\mathtt {CMux}\) gates to evaluate is \(\lceil \frac{s}{N} \rceil (2^{d}-1)\). This technique is optimal if the size s of the output is a multiple of N. Unfortunately, s is in general \(\le N\), the number of gates to evaluate remains \(2^{d}-1\), which is only s times smaller than the non-packed approach, and is not advantageous if s is small. Lemma 3.1 specifies the noise propagation and it follows immediately from Lemma 2.7 and from the construction of the binary decision \(\mathtt {CMux}\) tree, which has depth d.

Lemma 3.1

(Horizontal Packing). Let \(\varvec{d}_0, \ldots ,\varvec{d}_{2^d-1}\) be \(\mathsf {TRLWE}\) samplesFootnote 5 such that \(\varvec{d}_h \in \mathsf {TRLWE}_{K}(\sum _{j=0}^{s} \sigma _{j,h}X^j)\) for \(h\in [\![0,2^{d}-1 ]\!]\). Here the \(\sigma _{j,h}\) are the LUT values relative to an arbitrary function \(f : \mathbb {B}^d \rightarrow \mathbb {T}^s\). Let \(C_0, \ldots , C_{d-1}\) be \(\mathsf {TRGSW}\) samples, such that \(C_i \in \mathsf {TRGSW}_{K}(x_i)\) with \(x_i \in \mathbb {B}\) (for \(i\in [\![0,d-1 ]\!]\)), and \(x = \sum _{i=0}^{d-1} x_i 2^i\). Let \(\varvec{d}\) be the \(\mathsf {TRLWE}\) sample output by the f evaluation of the binary decision \(\mathtt {CMux}\) tree for the LUT (described in Fig. 1). Then, using the same notations as in Lemma 2.7 and setting \(\textsf {msg}(\varvec{d}) = f(x)\):

  • \(\left\| \textsf {Err}(\varvec{d})\right\| _\infty \le \mathcal {A}_{\mathsf {TRLWE}} + d \cdot ((k+1) \ell N \beta \mathcal {A}_{\mathsf {TRGSW}} + (kN+1)\epsilon )\) (worst case),

  • \(\textsf {Var}(\textsf {Err}(\varvec{d})) \le \vartheta _{\mathsf {TRLWE}} + d \cdot ((k+1) \ell N \beta ^2 \vartheta _{\mathsf {TRGSW}} + (kN+1)\epsilon ^2)\) (average case), where \(\mathcal {A}_{\mathsf {TRLWE}}\) and \(\mathcal {A}_{\mathsf {TRGSW}}\) are upper bounds of the infinite norm of the errors of the \(\mathsf {TRLWE}\) samples ant the \(\mathsf {TRGSW}\) samples respectively and \(\vartheta _{\mathsf {TRLWE}}\) and \(\vartheta _{\mathsf {TRGSW}}\) are upper bounds of their variances.

Vertical Packing. In order to improve the evaluation of the LUT, we propose a second optimization called Vertical Packing. As for the horizontal packing we use the \(\mathsf {TRLWE}\) encryption to encode N values at the same time. But now, instead of packing the LUT values \(\sigma _{j,h}\) with respect to a fixed \(h\in [\![0,2^{d}-1 ]\!]\) i.e. “horizontally”, we pack N values \(\sigma _{j,h}\) “vertically”, with respect to a fixed \(j \in [\![0, s-1 ]\!]\). Then, instead of just evaluating a full \(\mathtt {CMux}\) tree, we use a different approach. If the LUT values are packed in boxes, our technique first uses a packed CMux tree to select the right box, and then, a blind rotation (Algorithm 3) to find the element inside the box.

We now explain how to evaluate the function f, or just one of its sub-functions \(f_j\), on a fixed input \(x=\sum _{i=0}^{d-1} x_i 2^i\). We assume we know the LUT associated to \(f_j\) as in Fig. 1. For retrieving the output of \(f_j(x)\), we just have to return the LUT value \(\sigma _{j,x}\) in position x.

Let \(\delta =\log _2(N)\). We analyse the general case where \(2^d\) is a multiple of \(N=2^{\delta }\). The LUT of \(f_j\), which is a column of \(2^d\) values, is now packed as \(2^d/N\) \(\mathsf {TRLWE}\) ciphertexts \(\varvec{d_0},\dots ,\varvec{d}_{2^{d-\delta }-1}\), where each \(\varvec{d}_k\) encodes N consecutive LUT values \(\sigma _{j,kN+0},\dots ,\sigma _{j,kN+N-1}\). To retrieve \(f_j(x)\), we first need to select the block that contains \(\sigma _{j,x}\). This block has the index \(p=\lfloor x/N\rfloor \), whose bits are the \(d-\delta \) most significant bits of x. Since the \(\mathsf {TRGSW}\) encryption of these bits are among our inputs, one can use a \(\mathtt {CMux}\) tree to select this block \(\varvec{d_p}\). Then, \(\sigma _{j,x}\) is the \(\rho \)-th coefficient of the message of \(\varvec{d_p}\) where \(\rho =x\mod N=\sum _{i=0}^{\delta -1}x_i2^i\). The bits of \(\rho \) are the \(\delta \) least significant bits of x, which are also available as \(\mathsf {TRGSW}\) ciphertexts in our inputs. We can therefore use a blind rotation (Algorithm 3) to homomorphically multiply \(\varvec{d_p}\) by \(X^{-\rho }\), which brings the coefficient \(\sigma _{j,x}\) in position 0, and finally, we extract it with a \(\mathsf {SampleExtract}\). Algorithm 5 details the evaluation of \(f_j(x)\).

figure e

The entire cost of the evaluation of \(f_j(x)\) with Algorithm 5 consists in \(\frac{2^d}{N}-1\) \(\mathtt {CMux}\) gates and a single blind rotation, which corresponds to \(\delta \) \(\mathtt {CMux}\) gates. Overall, we get a speed-up by a factor N on the evaluation of each partial function, so a factor N in total.

Lemma 3.2

(Vertical Packing LUT of \(f_j\) ). Let \(f_j: \mathbb {B}^d \rightarrow \mathbb {T}\) be a sub-function of the arbitrary function f, with LUT values \(\sigma _{j,0}, \ldots , \sigma _{j,2^d-1}\). Let \(\varvec{d}_0, \ldots , \varvec{d}_{\frac{2^d}{N}-1}\) be \(\mathsf {TRLWE}\) samples, such that \(\varvec{d}_p \in \mathsf {TRLWE}_{K}(\sum _{i=0}^{N-1} \sigma _{j,pN+i} X^i)\) for \(p \in [\![0,\frac{2^d}{N}-1 ]\!]\) Footnote 6. Let \(C_0, \ldots , C_{d-1}\) be \(\mathsf {TRGSW}\) samples, such that \(C_i \in \mathsf {TRGSW}_{K}(x_i)\), with \(x_i \in \mathbb {B}\) and \(i \in [\![0,d-1 ]\!]\).

Then Algorithm 5 outputs a \(\mathsf {TLWE}\) sample \(\varvec{\mathfrak {c}}\) such that \(\textsf {msg}(\varvec{\mathfrak {c}}) = f_j(x) = o_j\) where \(x=\sum _{i=0}^{d-1} x_i 2^i\) and using the same notations as in Lemma 2.7 and Theorem 2.8, we have:

  • \(\left\| \textsf {Err}(\varvec{d})\right\| _\infty \le \mathcal {A}_{\mathsf {TRLWE}} + d \cdot ((k+1)\ell N \beta \mathcal {A}_{\mathsf {TRGSW}} + (1+kN)\epsilon )\) (worst case),

  • \(\textsf {Var}(\textsf {Err}(\varvec{d})) \le \vartheta _{\mathsf {TRLWE}} + d \cdot ((k+1)\ell N \beta ^2 \vartheta _{\mathsf {TRGSW}} + (1+kN)\epsilon ^2)\) (average case), where \(\mathcal {A}_{\mathsf {TRLWE}}\) and \(\mathcal {A}_{\mathsf {TRGSW}}\) are upper bounds of the infinite norm of the errors in the \(\mathsf {TRLWE}\) samples ant the \(\mathsf {TRGSW}\) samples respectively, while \(\vartheta _{\mathsf {TRLWE}}\) and \(\vartheta _{\mathsf {TRGSW}}\) are upper bounds of the variances.

Proof

(Sketch). The proof follows immediately from the results of Lemma 2.7 and Theorem 2.8, and from the construction of the binary decision \(\mathtt {CMux}\) tree. In particular, the first \(\mathtt {CMux}\) tree has depth \((d-\delta )\) and the blind rotation evaluates \(\delta \) \(\mathtt {CMux}\) gates, which brings a total factor d in the depth. As the \(\mathtt {CMux}\) depth is the same as in horizontal packing, the noise propagation matches too.

Remark 1

As previously mentioned, the horizontal and vertical packing techniques can be mixed together to improve the evaluation of f in the case where s and d are both small, i.e. the previous two methodology cannot be applied separately but we have \(2^d \cdot s>N\). In particular, if we pack \(s=x\) coefficients horizontally and \(y=N/x\) coefficients vertically, we need \(\lceil 2^d/y \rceil -1\) \(\mathtt {CMux}\) gates plus one vertical packing LUT evaluation in order to evaluate f, which is equivalent to \(\log _2(y)\) \(\mathtt {CMux}\) evaluations. The result is composed of the first x \(\mathsf {TLWE}\) samples extracted.

3.2 Arithmetic Operations via Weighted Automata

In [12], the arithmetic operations were evaluated via deterministic finite automata using \(\mathtt {CMux}\) gates. It was made possible thanks to the fact that the messages were binary. In this paper, the samples on which we perform the arithmetic operations pack several torus values together. A more powerful tool is thus needed to manage the evaluations in an efficient way. Deteministic weighted finite automata (det-WFA) are deterministic finite automata where each transition contains an additional weight information. By reading a word, the outcome of a det-WFA is the sum of all weights encountered along the path (here, we work with an additive group), whereas the outcome of a deterministic finite automata (DFA) is just a boolean that states whether the destination state is accepting. The weights of a det-WFA can be seen as a memory that stores the bits of the partial result, all along the evaluation path. Let’s take for example the evaluation of the MAX circuit, that takes in input two d-bit integers and returns the maximal value between them. With DFA, to retrieve all the d bits of the result we need d different automata, for a total of \(O(d^2)\) transitions. By introducing the weights, all the bits of the result are given in one pass after only O(d) transitions. To our knowledge, our paper is the first one introducing this tool on the FHE context. In this section, we detail the use of det-WFA to evaluate some arithmetic functions largely used in applications, such as addition (and multi-addition), multiplication, squaring, comparison and max, etc. We refer to [9, 16] for further details on the theory of weighted automata.

Definition 3.3

(Deterministic weighted finite automata (det-WFA)). A deterministic weighted finite automata (det-WFA) over a group \((S, \oplus )\) is a tuple \(\mathfrak {A} = (Q, i, \varSigma , \mathcal {T}, F)\), where Q is a finite set of states, i is the initial state, \(\varSigma \) is the alphabet, \(\mathcal {T}\subseteq Q \times \varSigma \times S \times Q\) is the set of transitions and \(F \subseteq Q\) is the set of final states. Every transition itself is a tuple \(t = q \overset{\sigma ,\nu }{\longrightarrow } q'\) from the state q to the state \(q'\) by reading the letter \(\sigma \) with weight w(t) equal to \(\nu \), and there is at most one transition per every pair \((q, \sigma )\).

Let \(P = (t_1, \ldots , t_d)\) be a path, with \(t_j = q_{j-1} \overset{\sigma _j, \nu _j}{\longrightarrow }q_j\). The word \(\varvec{\sigma } = \sigma _1 \ldots \sigma _d \in \varSigma ^d\) induced by P is accepted by the det-WFA \(\mathfrak {A}\) if \(q_0 = i\) and \(q_d \in F\). The weight \(w(\varvec{\sigma })\) of a word \(\varvec{\sigma }\) is equal to \(\bigoplus _{j=1}^{d} w(t_j)\), where the \(w(t_j)\) are all the weights of the transitions in P: \(\varvec{\sigma }\) is called the label of P. Note that every label induces a single path (i.e. there is only one possible path per word).

Remark 2

In our applications, we fix the alphabet \(\varSigma = \mathbb {B}\). Definition 3.3 restraints the WFA to the deterministic (the non-deterministic case is not supported), complete and universally accepting case (i.e. all the words are accepted). In the general case, the additive group would be replaced by a semi-ring \((S, \oplus , \otimes , 0, 1)\). In the rest of the paper we set \((S,\oplus )\) as \((\mathbb {T}_N[X],+)\).

Theorem 3.4

(Evaluation of det-WFA). Let \(\mathfrak {A} = (Q, i, \mathbb {B}, \mathcal {T}, F)\) be a det-WFA with weights in \((\mathbb {T}_N[X],+)\), and let |Q| denote the total number of states. Let \(C_0, \ldots , C_{d-1}\) be d valid \(\mathsf {TRGSW}_{K}\) samples of the bits of a word \(\varvec{\sigma } = \sigma _0 \ldots \sigma _{d-1}\). By evaluating at most \(d\cdot |Q|\) \(\mathtt {CMux}\) gates, we output a \(\mathsf {TRLWE}\) sample \(\varvec{d}\) that encrypts the weight \(w(\varvec{\sigma })\), such that (using the same notations as in Lemma 2.7)

  • \(\left\| \textsf {Err}(\varvec{d})\right\| _\infty \le d \cdot ( (k+1)\ell N\beta \mathcal {A}_{\mathsf {TRGSW}} + (kN+1)\epsilon )\) (worst case),

  • \(\textsf {Var}(\textsf {Err}(\varvec{d})) \le d \cdot ( (k+1)\ell N\beta ^2 \vartheta _{\mathsf {TRGSW}} + (kN+1)\epsilon ^2)\) (average case),

    where \(\mathcal {A}_{\mathsf {TRGSW}}\) is an upper bound on the infinite norm of the error in the \(\mathsf {TRGSW}\) samples and \(\vartheta _{\mathsf {TRGSW}}\) is an upper bound of their variance. Moreover, if all the words connecting the initial state to a fixed state \(q\in Q\) have the same length, then the upper bound on the number of \(\mathtt {CMux}\) to evaluate decreases to |Q|.

Proof

(Sketch). This theorem generalizes Theorem 5.4 of [12] for det-WFA. The automaton is still evaluated from the last letter \(\sigma _{d-1}\) to the first one \(\sigma _{0}\), using one \(\mathsf {TRLWE}\) ciphertext \(\varvec{c}_{j,q}\) per position \(j \in [\![0,d-1 ]\!]\) in the word and per state \(q\in Q\). Before reading a letter, all the \(\mathsf {TRLWE}\) samples \(\varvec{c}_{d,q}\), for \(q \in Q\), are initialized to zero. When processing the j-th letter \(\sigma _{j}\), each pair of transitions \(q\overset{0,\nu _0}{\longrightarrow }q_0\) and \(q\overset{1,\nu _1}{\longrightarrow }q_1\) is evaluated as \(\varvec{c}_{j,q} = \mathtt {CMux}(C_j,\varvec{c}_{j+1,q_1}+(\varvec{0},\nu _1),\varvec{c}_{j+1,q_0}+(\varvec{0},\nu _0))\). The final result is \(\varvec{c}_{0,i}\), which encodes \(w(\varvec{\sigma })\) by induction on the \(\mathtt {CMux}\) graph. Since translations are noiseless, the output noise corresponds to a depth-d of \(\mathtt {CMux}\). Like in [12], the last condition implies that only |Q| of the d|Q| \(\mathtt {CMux}\) are accessible and need to be evaluated.    \(\square \)

MAX. In order to evaluate the MAX circuit of two d-bit integers, \(x=\sum _{i=0}^{d-1} x_i 2^i\) and \(y=\sum _{i=0}^{d-1} y_i 2^i\), we construct a det-WFA that takes in input all the bits \(x_{d-1}, \ldots , x_0\) of x and \(y_{d-1}, \ldots , y_0\) of y, and outputs the maximal value between them. The idea is to enumerate the \(x_{i}\) and \(y_{i}\), starting from the most significant bits down to the least significant ones. The det-WFA described in Fig. 2 has 3 principal states (A, B, E) and 4 intermediary states ((A), (B), (E, 1), (E, 0)), which keeps track of which number is the maximum, and in case of equality what is the last value of \(x_i\). A weight is added on all the transitions that reads the digit 1 from the maximum. Overall, the next lemma, which is a direct consequence of Theorem 3.4, shows that the Max can be computed by evaluating only 5d \(\mathtt {CMux}\) gates, instead of \(\varTheta (d^2)\) with classical deterministic automata.

Fig. 2.
figure 2

Max: det-WFA - The states A and (A) mean that y is the maximal value, the states B and (B) mean that x is the maximal value, and finally, the states E, (E, 1) and (E, 0) mean that x and y are equals on the most significant bits. If the current state is A or B, the following state will stay the same. The initial state is E. If the current state is E, after reading \(x_i\) there are two possible intermediate states: (E, 1) if \(x_i=1\) and (E, 0) if \(x_i=0\). After reading the value of \(y_i\), the 3 possible states A, B and E are possible. The det-WFA is repeated as many times as the bit length of the integers evaluated and the weights are given in clear.

Remark 3

In practice, to evaluate the MAX function, we convert the det-WFA in a circuit that counts 5d \(\mathtt {CMux}\) gates. Roughly speaking, we have to read the automata in the reverse. We initialize 5 states \(A, B, E_0, E_1, E\) as null \(\mathsf {TRLWE}\) samples. Then, for i from \(d-1\) to 0, we update the states as follows:

$$ {\left\{ \begin{array}{ll} E_0 := \mathtt {CMux}(C_{i}^y, A + (\varvec{0},\frac{1}{2}X^i), E); \\ E_1 := \mathtt {CMux}(C_{i}^y, E, B); \\ A := \mathtt {CMux}(C_{i}^y, A + (\varvec{0},\frac{1}{2}X^i), A); \\ E := \mathtt {CMux}(C_{i}^x, E_1 + (\varvec{0},\frac{1}{2}X^i), E_0); \\ B := \mathtt {CMux}(C_{i}^x, B + (\varvec{0},\frac{1}{2}X^i), B). \end{array}\right. } $$

Here the \(C_{i}^x\) and \(C_{i}^y\) are \(\mathsf {TRGSW}\) encryptions of the bits \(x_i\) and \(y_i\) respectively, and they are the inputs. The output of the evaluation is the \(\mathsf {TRLWE}\) sample E, which contains the maximal value.

Lemma 3.5

(Evaluation of Max det-WFA). Let \(\mathfrak {A}\) be the det-WFA of the Max, described in Fig. 2. Let \(C^x_0, \ldots , C^x_{d-1}, C^y_0, \ldots , C^y_{d-1}\) be \(\mathsf {TRGSW}_{K}\) samples of the bits of x and y respectively. By evaluating 5d \(\mathtt {CMux}\) gates (depth 2d), the Max det-WFA outputs a \(\mathsf {TRLWE}\) sample \(\varvec{d}\) encrypting the maximal value between x and y and (with same notations as in Lemma 2.7)

  • \(\left\| \textsf {Err}(\varvec{d})\right\| _\infty \le 2d \cdot ((k+1)\ell N\beta \mathcal {A}_{\mathsf {TRGSW}} + (kN+1)\epsilon )\) (worst case);

  • \(\textsf {Var}(\textsf {Err}(\varvec{d})) \le 2d \cdot ((k+1)\ell N\beta ^2 \vartheta _{\mathsf {TRGSW}} + (kN+1)\epsilon ^2)\) (average case).

    Here \(\mathcal {A}_{\mathsf {TRGSW}}\) and \(\vartheta _{\mathsf {TRGSW}}\) are upper bounds of the amplitude and of the variance of the errors in the \(\mathsf {TRGSW}\) samples.

Multiplication. For the multiplication we use the same approach and we construct a det-WFA which maps the schoolbook multiplication. We illustrate the construction on the example of the multiplication between two 2-bits integers \(x=x_1x_0\) and \(y=y_1y_0\). After an initial step of bit by bit multiplication, a multi-addition (shifted of one place on the left for every line) is performed. The bits of the final result are computed as the sum of each column with carry.

The det-WFA computes the multiplication by keeping track of the partial sum of each column in the states, and by using the transitions to update these sums. For the multiplication of 2-bits integers, the automaton (described in Fig. 3) has 6 main states (i, \(c_0\), \(c_{10}\), \(c_{11}\), \(c_{20}\), \(c_{21}\)), plus 14 intermediary states that store the last bit read (noted with capital letters and parenthesis). The value of the i-th output bit is put in a weight on the last transition of each column.

Fig. 3.
figure 3

Schoolbook 2-bits multiplication and corresponding det-WFA

For the generic multiplication of two d-bits integers, we can upper bound the number of states by \(4d^3\), instead of \(\varTheta (d^4)\) with one classical automata per output bit. For a more precise number of states we wrote a C++ program to eliminate unreachable states and refine the leading coefficient. The depth is \(2d^2\) and the noise evaluation can be easily deducted by previous results. The same principle can be used to construct the multi-addition, and its det-WFA is slightly simpler (one transition per bit in the sum instead of two).

3.3 TBSR Counter Techniques

We now present another design which is specific to the multi-addition (or its derivatives), but which is faster than the generic construction with weighted automata. The idea is to build an homomorphic scheme that can represent small integers, say between 0 and \(N=2^p\), and which is dedicated to only the three elementary operations used in the multi addition algorithm, namely:

  1. 1.

    Extract any of the bits of the value as a \(\mathsf {TLWE}\) sample;

  2. 2.

    Increment the value by 1 and

  3. 3.

    Integer division of the value by 2.

We will now explain the basic idea, and then, we will show how to implement it efficiently on \(\mathsf {TRLWE}\) ciphertexts.

For \(j\in [0,p=\log _2(N)]\) and \(k,l\in \mathbb {Z}\), we call \(B^{(l)}_{j,k}\) the j-th bit of \(k+l\) in the little endian signed binary representation. The latter form very simple binary sequence: \(B^{(0)}_0=(0,1,0,1,\ldots )\) is 2-periodic, \(B^{(0)}_1=(0,0,1,1,0,0,1,1\ldots )\) is 4-periodic, more generally, for all \(j\in [0,p]\) and \(l\in \mathbb {Z}\), \(B^{(l)}_j\) is \(2^{j}\)-antiperiodic, and is the left shift of \(B^{(0)}_j\) by l positions. Therefore, it suffices to have \(2^j\le N\) consecutive values of the sequence to (blindly) deduce all the remaining bits. And most importantly, for each integer \(k\in \mathbb {Z}\), \((B^{(l)}_{0,k},B^{(l)}_{1,k},\ldots ,B^{(l)}_{p,k})\) is the (little endian signed) binary representation of \(l+k\mod 2N\). We now suppose that an integer l in \([0,N-1]\) is represented by its Bit Sequence Representation, defined as \(BSR(l)=[B^{(l)}_0,\dots ,B^{(l)}_p]\). And we see how to compute \(BSR(l+1)\) and \(BSR(\lfloor l/2\rfloor )\) using only copy and negations operations on bits at a fixed position which does not depend on l (blind computation). Then, we will see how to represent these operations homomorphically on \(\mathsf {TRLWE}\) ciphertexts (Fig. 4).

Fig. 4.
figure 4

TBSR - example of addition \(+5\) and division by 2.

Increment: Let \(U=[u_0,\dots ,u_p]\) be the BSR of some unknown number \(l\in [0,N-1]\). Our goal is to compute \(V=[v_0,\dots ,v_p]\) which is the BSR of \(l+1\). Again, we recall that it suffices to define the sequence \(v_i\) on N consecutive values, the rest is deduced by periodicity. To map the increment operation, all we need to do is shifting the sequences by 1 position: \(v_{j,k}:=u_{j,k+1}\) for all \(k\in [0,N-1]\). Indeed, this operation transforms each \(B^{(l)}_{j,k}\) into \(B^{(l)}_{j,k+1}=B^{(l+1)}_{j,k}\), and the output V is the BSR of \(l+1\).

Integer division by two: Let \(U=[u_0,\dots ,u_p]\) be the BSR of some unknown number \(l\in [0,N-1]\). Our goal is to compute \(V=[v_0,\dots ,v_p]\) which is the BSR of \(\lfloor \frac{l}{2}\rfloor \). First, we note that the integer division by 2 corresponds to a right shift over the bits. Thus for \(j\in [0,p-1]\) and \(k\in \mathbb {N}\), we can set \(v_{j,k}=u_{j+1,2k}\). Indeed, \(u_{j+1,2k}\) is the \(j+1\)-th bit of \(l+2k\) is the j-th bit of its half \(\lfloor l/2 \rfloor +k\), which is our desired \(v_{j,k}=B^{(\lfloor l/2 \rfloor )}_{j,k}\). This is unfortunately not enough to reconstruct the last sequence \(v_p\), since we have no information on the \(p+1\)-th bits in U. However, in our case, we can reconstruct this last sequence directly. First, the numbers \(\lfloor \frac{l}{2}\rfloor +k\) for \(k\in [0,N/2-1]\) are all \(<N\), so we can blindly set the corresponding \(v_{p,k}=0\). Then, we just need to note that \((u_{p,0},\dots ,u_{p,N-1})\) is \(N-l\) times 0 followed by l times 1, and our target \((v_{p,N/2},\dots ,v_{p,N-1})\) must consist \(N/2-l\) times 0 followed by \(\lfloor l/2\rfloor \) times 1. Therefore, our target can be filled with the even positions \((u_{p,0},u_{p,2},\dots ,u_{p,N-2})\). To summarize, division by 2 corresponds to the following blind transformation:

$$ \left\{ \begin{array}{lcl} v_{j,k} &{} = &{} u_{j+1,2k} \text { for }j\in [0,p-1],k\in [0,N-1]\\ v_{p,k} &{} = &{} 0\text { for }k\in [0,\frac{N}{2}-1]\\ v_{p,N/2+k} &{} = &{} u_{p,2k}\text { for }k\in [0,\frac{N}{2}-1]\\ \end{array} \right. $$

We now explain how we can encode these BSR sequences on \(\mathsf {TRLWE}\) ciphertexts, considering that all the coefficients need to be in the torus rather than in \(\mathbb {B}\), and that we need to encode sequences that are either N-periodic or N-antiperiodic. Furthermore, since the cyclic shift of coefficients is heavily used in the increment operation, we would like to make it correspond to the multiplication by X, which has a similar behaviour on coefficients of torus polynomials. Therefore, this is our basic encoding of the BSR sequences: Let \(U=[u_0,\dots ,u_p]\) be the BSR of some unknown number \(l\in [0,N-1]\), For \(j\in [0,p-1]\), we represent \(u_j\) with the polynomial \(\mu _i=\sum _{k=0}^{N-1} \frac{1}{2}u_{j,k}X^k\), and we represent the last \(u_p\) with the polynomial \(\mu _p=\sum _{k=0}^{N-1} (\frac{1}{2}u_{p,k}-\frac{1}{4})X^k\). This simple rescaling between the bit representation U and the torus representation \(M=[\mu _0,\dots ,\mu _p]\) is bijective. Using this encoding, the integer division transformation presented above immediately rewrites into this affine function, which transforms the coefficients \((\mu _{j,k})_{j\in [1,p],k\in [0,2,\ldots ,2N-2]}\in \mathbb {T}^{pN}\) into \((\mu '_0,\dots ,\mu '_p)\) as follow:

$$ \pi _{\text {div2}} : \left\{ \begin{array}{lcl} \mu '_{j,k} &{} = &{} \mu _{j+1,2k} \text { for }j\in [0,p-2],k\in [0,N-1]\\ \mu '_{p-1,k} &{} = &{} \mu _{p,2k}+\frac{1}{4} \text { for }k\in [0,N-1]\\ \mu '_{p,k} &{} = &{} -\frac{1}{4}\text { for }k\in [0,\frac{N}{2}-1]\\ \mu '_{p,N/2+k} &{} = &{} \mu _{p,2k}\text { for }k\in [0,\frac{N}{2}-1]\\ \end{array} \right. $$

Finally, we call TBSR ciphertext of an unknown integer \(l\in [0,N-1]\) a vector \(C=[c_0,\ldots ,c_p]\) of \(\mathsf {TRLWE}\) ciphertexts of message \([\mu _0,\dots ,\mu _p]\).

Definition 3.6

(TBSR encryption).

  • Params and keys: \(\mathsf {TRLWE}\) parameters N with secret key \(K\in \mathbb {B}_N[X]\), and a circular-secure keyswitching key \(KS_{K\rightarrow K,\gamma }\) from K to itself, noted just KS.

  • \(\mathsf {TBSRSet}(l)\): return a vector of trivial \(\mathsf {TRLWE}\) ciphertexts encoding the torus representation of \([B^{(l)}_{0},\dots ,B^{(l)}_{p}]\).

  • \(\mathsf {TBSRBitExtract}_j(C)\): Return \(\mathsf {SampleExtract}_0(c_j)\) when \(j<p\).Footnote 7

  • \(\mathsf {TBSRIncrement}(\)C): Return \(X^{-1}.C\).

  • \(\mathsf {TBSRDiv2}(C)\): Use KS to evaluate \(\pi _{div2}\) homomorphically on C. Since it is a 1-lipschitzian affine function, this means: apply the public functional \(\mathsf {KeySswitch}\) to KS, the linear part of \(\pi _{div2}\) and C, and then, translate the result by the constant part of \(\pi _{div2}\).

Theorem 3.7

(TBSR operations). Let N,K, and KS be TBSR parameters..., and C a TBSR ciphertext of l with noise amplitude \(\eta \) (or noise variance \(\vartheta \)). Then for \(j\in [0,p-1]\), \(\mathsf {TBSRBitExtract}_j(C)\) is a \(LWE_{\mathfrak {K}}\) ciphertext of the j-th bit of l, over the message space \(\{0,\frac{1}{2}\}\), with noise amplitude (resp. variance) \(\le \eta \) (resp. \(\le \vartheta \)). If \(l\le N-2\), \(\mathsf {TBSRIncrement}(\)C) is a TBSR ciphertext of \(l+1\) with noise amplitude (resp. variance) \(\le \eta \) (resp. \(\le \vartheta \)). \(C'=\mathsf {TBSRDiv2}(C)\) is a TBSR ciphertext of \(\lfloor l/2 \rfloor \) such that:

  • \(\left\| \textsf {Err}(C')\right\| _\infty \le \mathcal {A}+ N^2t\mathcal {A}_{\mathsf {KS}} + N2^{-(t+1)}\) (worst-case);

  • \(\textsf {Var}(\textsf {Err}(C')) \le \vartheta + N^2t\vartheta _{\mathsf {KS}} + N2^{-2(t+1)}\) (average case).

Proof

(sketch). Correctness has already been discussed, the noise corresponds to the application of a public keyswitch on the same key: with \(n=N\).

Using the TBSR counter for a multi-addition or a multiplication.

The TBSR counter allows to perform a multi-addition or multiplication using the school-book elementary algorithms. This leads to a leveled multiplication circuit with \(\mathsf {KeySwitching}\) which is quadratic instead of cubic with weighted automata.

Lemma 3.8

Let N,\(B_g\),\(\ell \) and \(\mathsf {KS}\) be TBSR and \(\mathsf {TRGSW}\) parameters with the same key K, We suppose that each TBSR ciphertext has \(p\le 1+\log (N)\) \(\mathsf {TRLWE}\) ciphertexts. And let \((A_{i})\) and \((B_i)\) for \(i\in [0,d-1]\) be \(\mathsf {TRGSW}\)-encryptions of the bits of two d-bits integers (little endian), with the same noise amplitude \(\mathcal {A}_A\) (resp. variance \(\vartheta _A\)).

Then, there exists an algorithm (see the full version for more details) that computes all the bits of the product within \(2d^2p\) CMux and \((2d-2)p\) public keyswitch, and the output ciphertexts satisfy:

  • \(\left\| \textsf {Err}(Out)\right\| _\infty \le 2d^2 ((k+1)\ell N\beta \mathcal {A}_A+(kN+1)\epsilon ) + (2d-2) (N^2t\mathcal {A}_{\mathsf {KS}}+N2^{-(t+1)})\);

  • \(\textsf {Var}(\textsf {Err}(Out)) \le 2d^2 ((k+1)\ell N\beta ^2\vartheta _A+(kN+1)\epsilon ^2) + (2d-2) (N^2t\vartheta _{\mathsf {KS}}+N2^{-2(t+1)})\).

4 Combining Leveled with Bootstrapping

In the previous sections, we presented efficient leveled algorithms for some arithmetic operations, but the input and output have different types (e.g. \(\mathsf {TLWE}/\mathsf {TRGSW}\)) and we can’t compose these operations, like in a usual algorithm. In fully homomorphic mode, connecting the two becomes possible if we have an efficient bootstrapping between \(\mathsf {TLWE}\) and \(\mathsf {TRGSW}\) ciphertexts. Fast bootstrapping procedures have been proposed in [12, 17], and the external product Theorem 2.3 from [7, 12] has contributed to accelerate leveled operations. Unfortunately, these bootstrapping cannot output GSW ciphertexts. Previous solutions proposed in [1, 19, 21] based on the internal product are not practical. In this section, we propose an efficient technique to convert back \(\mathsf {TLWE}\) ciphertexts to \(\mathsf {TRGSW}\), that runs in 137 ms. We call it circuit bootstrapping.

Our goal is to convert a \(\mathsf {TLWE}\) sample with large noise amplitude over some binary message space (e.g. amplitude \(\frac{1}{4}\) over \(\{0,\frac{1}{2}\}\)), into a \(\mathsf {TRGSW}\) sample with a low noise amplitude \({<}2^{-20}\) over the integer message space \(\{0,1\}\).

In all previous constructions, the \(\mathsf {TLWE}\) decryption consists in a circuit, which is then evaluated using the internal addition and multiplication laws over \(\mathsf {TRGSW}\) ciphertexts. The target \(\mathsf {TRGSW}\) ciphertext is thus the result of an arithmetic expression over \(\mathsf {TRGSW}\) ciphertexts. Instead, we propose a more efficient technique, which reconstructs the target directly from its very sparse internal structure. Namely, a \(\mathsf {TRGSW}\) ciphertext of a message \(\mu \in \{0,1\}\) is a vector of \((k+1)\ell \) \(\mathsf {TRLWE}\) ciphertexts. Each of these \(\mathsf {TRLWE}\) ciphertexts encrypts the same message as \(\mu h_i\), where \(h_i\) is the corresponding line of the gadget matrix H. Depending on the position of the row (which can be indexed by \(u\in [1,k+1]\) and \(j\in [1,\ell ]\)), this message is \(\mu -K_u\cdot Bg^{-j}\) where \(K_u\) is the u-th polynomial of the secret key and \(K_{k+1}=-1\). So we can use \(\ell \) times the \(\mathsf {TLWE}\)-to-\(\mathsf {TLWE}\) bootstrapping of [12] to obtain a \(\mathsf {TLWE}\) sample of each message in \(\{\mu Bg^{-1},\ldots ,\mu B_g^{-\ell } \}\). Then we use the private key-switching technique to ”multiply” these ciphertexts by the secret \(-K_u\), to reconstruct the correct message.

4.1 Circuit Bootstrapping (\(\mathsf {TLWE}\)-to-\(\mathsf {TRGSW}\))

Our circuit bootstrapping, detailed in Algorithm 6, crosses 3 levels of noise and encryption (Fig. 5). Each level has its own key and parameters set. In order to distinguish the different levels, we use an intuitive notation with bars. The upper bar will be used for level 2 variables, the under bar for the level 0 variables and level 1 variables will remain without any bar. The main difference between the three levels of encryption is the amount of noise supported. Indeed, the higher the level is, the smaller is the noise. Level 0 corresponds to ciphertexts with very large noise (typically, \(\underline{\alpha }\ge 2^{-11}\)). Level 0 parameters are very small, computations are almost instantaneous, but only a very limited amount of linear operations are tolerated. Level 1 corresponds to medium noise (typically, \(\alpha \ge 2^{-30}\)). Ciphertexts in level 1 have medium size parameters, which allows for relatively fast operations, and for instance a leveled homomorphic evaluation of a relatively large automata, with transition timings described in Sect. 5 of [12]. Level 2 corresponds to ciphertexts with small noise (typically, \(\bar{\alpha }\ge 2^{-50}\)). This level corresponds to the limit of what can be mapped over native 64-bit operations. Practical values and details are given in Sect. 5.

Fig. 5.
figure 5

The figure represents the three levels of encryption on which our construction shifts. The arrows show the operations that can be performed inside each level or how to move from a level to another. In order to distinguish the objects with respect to their level, we adopted the intuitive notations “superior bar” for level 2, “no bar” for level 1 and “under bar” for level 0. We highlight in blue the different stages of the circuit bootstrapping (whose detailed description is given below).

Our circuit bootstrapping consists in three parts:

  • \(\mathsf {TLWE}\) -to- \(\mathsf {TLWE}\) Pre-keyswitch. The input of the algorithm is a \(\mathsf {TLWE}\) sample with a large noise amplitude over the message space \(\{0,\frac{1}{2}\}\). Without loss of generality, it can be keyswitched to a level 0 \(\mathsf {TLWE}\) ciphertext \(\varvec{\underline{\varvec{\mathfrak {c}}}} = (\underline{\varvec{\mathfrak {a}}},\underline{\mathfrak {b}}) \in \mathsf {TLWE}_{\underline{\mathfrak {K}},\underline{\eta }}(\mu \cdot \frac{1}{2})\), of a message \(\mu \in \mathbb {B}\) with respect to the small secret key \(\underline{\mathfrak {K}}\in \mathbb {B}^{\underline{n}}\) and a large standard deviation \(\underline{\eta }\in \mathbb {R}\) (typically, \(\underline{\eta }\le 2^{-5}\) to guaranty correct decryption with overwhelming probability). This step is standard.

  • \(\mathsf {TLWE}\) -to- \(\mathsf {TLWE}\) Bootstrapping (Algorithm 4): Given a level 2 bootstrapping key \(\text {BK}_{\underline{\mathfrak {K}} \rightarrow \bar{\mathfrak {K}},\bar{\alpha }} = (\text {BK}_i)_{i\in [\![1,\underline{n} ]\!]}\) where \(\text {BK}_i \in \mathsf {TRGSW}_{\bar{K},\bar{\alpha }}(\underline{\mathfrak {K}}_i))\), we use \(\ell \) times the \(\mathsf {TLWE}\)-to-\(\mathsf {TLWE}\) Bootstrapping algorithm (Algorithm 4) on \(\underline{\varvec{\mathfrak {c}}}\), to obtain \(\ell \) \(\mathsf {TLWE}\) ciphertexts \(\bar{\varvec{\mathfrak {c}}}^{(1)},\dots ,\bar{\varvec{\mathfrak {c}}}^{(\ell )}\) where \(\bar{\varvec{\mathfrak {c}}}^{(w)}\in \mathsf {TLWE}_{\bar{\mathfrak {K}},\bar{\eta }}(\mu \cdot \frac{1}{\bar{B_g}^w})\), with respect to the same level 2 secret key \(\bar{\mathfrak {K}}\in \mathbb {B}^{\bar{n}}\), and with a fixed noise parameter \(\bar{\eta }\in \mathbb {R}\) which does not depend on the input noise. If the bootstrapping key has a level 2 noise \(\bar{\alpha }\), we expect the output noise \(\bar{\eta }\) to remain smaller than level 1 value.

  • \(\mathsf {TLWE}\) -to- \(\mathsf {TRLWE}\) private key-switching (Algorithm 2): Finally, to reconstruct the final \(\mathsf {TRGSW}\) ciphertext of \(\mu \), we simply need to craft a \(\mathsf {TRLWE}\) ciphertext which has the same phase as \(\mu \cdot \varvec{h_i}\), for each row of the gadget matrix H. Since \(\varvec{h_i}\) contains only a single non-zero constant polynomial in position \(u\in [1,k+1]\) whose value is \(\frac{1}{B_g^w}\) where \(w\in [1,\ell ]\), the phase of \(\mu \cdot \varvec{h_i}\) is \(\mu K_u \cdot \frac{1}{B_g^w}\) where \(K_u\) is the u-th term of the key K. If we call \(f_u\) the (secret) morphism from \(\mathbb {T}\) to \(\mathbb {T}_N[X]\) defined by \(f_u(x)=K_u\cdot x\), we just need to apply \(f_u\) homomorphically to the \(\mathsf {TLWE}\) sample \(\bar{\varvec{\mathfrak {c}}}^{(w)}\) to get the desired \(\mathsf {TRLWE}\) sample. Since \(f_u\) is 1-lipschitzian (for the infinity norm), this operation be done with additive noise overhead via the private functional keyswitch (Algorithm 2).

figure f

Theorem 4.1

(Circuit Bootstrapping Theorem). Let \(n,\alpha ,N,k,B_g,\ell ,H,\epsilon \) denote \(\mathsf {TRLWE}\)/\(\mathsf {TRGSW}\) level 1 parameters, and the same variables names with underbars/upperbars for level 0 and 2 parameters. Let \(\underline{\mathfrak {K}}\in \mathbb {B}^{\underline{n}}\), \(\mathfrak {K}\in \mathbb {B}^{n}\) and \(\bar{\mathfrak {K}}\in \mathbb {B}^{\bar{n}}\), be a level 0, 1 and 2 \(\mathsf {TLWE}\) secret keys, and \(\underline{K}, K, \bar{K}\) their respective \(\mathsf {TRLWE}\) interpretation. Let \(\text {BK}_{\underline{\mathfrak {K}}\rightarrow \bar{\mathfrak {K}},\bar{\alpha }}\) be a bootstrapping key, composed by the \(\underline{n}\) \(\mathsf {TRGSW}\) encryptions \(\text {BK}_i \in \mathsf {TRGSW}_{\bar{K},\bar{\alpha }}(\underline{\mathfrak {K}}_i)\) for \(i\in [\![1,\underline{n} ]\!]\). For each \(u\in [\![1,k+1 ]\!]\), let \(f_u\) be the morphism from \(\mathbb {T}\) to \(\mathbb {T}_N[X]\) defined by \(f_u(x)=K_u\cdot x\), and \(\mathsf {KS}^{f_u}_{\bar{\mathfrak {K}} \rightarrow K,\gamma } = (\mathsf {KS}^{(u)}_{i,j} \in \mathsf {TRLWE}_{K,\gamma }((\bar{\mathfrak {K}}_i K_u\cdot 2^{-j})))_{i\in [\![1, \bar{n} ]\!], j\in [\![1,t ]\!]}\) be the corresponding private-key-switching key. Given a level 0 \(\mathsf {TLWE}\) sample \(\underline{\varvec{\mathfrak {c}}} = (\underline{\varvec{\mathfrak {a}}},\underline{\mathfrak {b}}) \in \mathsf {TLWE}_{\underline{\mathfrak {K}}}(\mu \cdot \frac{1}{2})\), with \(\mu \in \mathbb {B}\), the Algorithm 6 outputs a level 1 \(\mathsf {TRGSW}\) sample \(C \in \mathsf {TRGSW}_{K}(\mu )\) such that

  • \(\left\| \textsf {Err}(C)\right\| _\infty \le \underline{n}(\bar{k}+1)\bar{\ell } \bar{N} \bar{\beta } \mathcal {A}_{\text {BK}} + \underline{n}(1+\bar{k}\bar{N})\bar{\epsilon } + \bar{n} 2^{-(t+1)} + \bar{n} t \mathcal {A}_{\mathsf {KS}}\) (worst);

  • \(\textsf {Var}(\textsf {Err}(C)) \le \underline{n}(\bar{k}+1)\bar{\ell } \bar{N} \bar{\beta }^2 \bar{\vartheta }_{\text {BK}} + \underline{n}(1+\bar{k}\bar{N})\bar{\epsilon }^2 + \bar{n} 2^{-2(t+1)} + \bar{n} t \vartheta _{\mathsf {KS}}\) (average).

Here \(\bar{\vartheta }_{\text {BK}} = \bar{\alpha }^2\) and \(\mathcal {A}_{\text {BK}}\) is the variance and amplitude of \(\textsf {Err}(\text {BK}_{\underline{\mathfrak {K}} \rightarrow \bar{K},\bar{\alpha }})\), and \(\vartheta _{\mathsf {KS}} = \gamma ^2\) and \(\mathcal {A}_{KS}\) are the variance and amplitude of \(\textsf {Err}(\mathsf {KS}_{\bar{\mathfrak {K}} \rightarrow K,\gamma })\).

Proof

(sketch). The output \(\mathsf {TRGSW}\) ciphertext is correct, because by construction, the i-th \(\mathsf {TRLWE}\) component \(\varvec{c}^{(u,w)}\) has the correct message \(\textsf {msg}(\mu \cdot H_i)=\mu K_u/B_g^w\). \(\varvec{c}^{(u,w)}\) is obtained by chaining one \(\mathsf {TLWE}\)-to-\(\mathsf {TLWE}\) bootstrapping (Algorithm 4) with one private key switchings, as in Algorithm 2. The values of maximal amplitude and variance of \(\textsf {Err}(C)\) are directly obtained from the partial results of Lemma 2.9 and Theorem 2.6. In total, Algorithm 6 performs exactly \(\ell \) bootstrappings (Algorithm 4), and \(\ell (k+1)\) private key switchings (Algorithm 2).

   \(\square \)

Comparison with previous bootstrappings for \(\varvec{\mathsf {TGSW}}\). The circuit bootstrapping we just described evaluates a quasilinear number of level-2 external products, and a quasilinear number of level 1 products in the private keyswitchings. With the parameters proposed in the next section, it runs in 0.137 s for a 110-bit security parameter, level 2 operations take 70% of the running time, and the private keyswitch the remaining 30%.

Our circuit bootstrapping is not the first bootstrapping algorithm that outputs a \(\mathsf {TRGSW}\) ciphertext. Many constructions have previously been proposed and achieve valid asymptotical complexities, but very few concrete parameters are proposed. Most of these constructions are recalled in the last section of [19]. In all of them, the bootstrapped ciphertext is obtained as an arithmetic expression on \(\mathsf {TRGSW}\) ciphertexts involving linear combinations and internal products. First, all the schemes based on scalar variants of \(\mathsf {TRGSW}\) suffer from a slowdown of a factor at least quadratic in the security parameter, because the products of small matrices with polynomial coefficients (via FFT) are replaced with large dense matrix products. Thus, bootstrapping on \(\mathsf {TGSW}\) variants would require days of computations, instead of the 0.137 s we propose. Now, assuming that all the bootstrapping uses (Ring) instantiations of \(\mathsf {TRGSW}\), the design in [8] based on the expansion of the decryption circuit via Barrington theorem, as well as the expression as a minimal deterministic automata of the same function in [19] require a quadratic number of internal level 2 \(\mathsf {TRGSW}\) products, which is much slower than what we propose. Finally, the CRT variant in [1, 19] uses only a quasi-linear number of products, but since it uses composition between automata, these products need to run in level 3 instead of level 2, which induces a huge slowdown (a factor 240 in our benchs), because elements cannot be represented on 64-bits native numbers.

5 Comparison and Practical Parameters

We now explicit the practical parameters for our scheme, and we give the running time comparison for the evaluation of the homomorphic circuits described before in LHE and FHE mode (with or without the new optimization techniques).

In [12] the timing for the gate bootstrapping was 52 ms. We improved it to 13 ms: a speed up of a factor 2 is due to the dedicated assembly FFT for \(X^N+1\) in double precision. An additional speed ups (by a factor 1.5) is due to a new choice of parameters, for the same security level (in particular the \(\ell \) \(\mathsf {TRGSW}\) parameter is reduced to 2 instead of 3). Finally, we replaced the core of the gate bootstrapping with the simpler CMux and blindRotate (Algorithm 3) described in Sect. 2, which gives the last 1.33x speed-up. For the same reason, the external product is now executed in \(34\,\upmu \text {s}\). We added these optimizations to the public repository of the TFHE library [14]. A experimental measurement of the noise confirmed the average case bounds on the variance, predicted under the independance assumption.

As a consequence, all binary gates are executed in 13 ms, and the native bootstrapped \(\mathsf {MUX}\) (also described in Sect. 2) gate takes 26 ms on a 64-bit single core (i7-4910MQ) at 2.90 GHz. Starting from all these considerations, we implemented our circuit bootstrapping as a proof of concept. The code is available in the experimental repository of TFHE [14]. We perform a Circuit Bootstrapping in 0.137 s. One of the main constraints to obtain this performance is to ensure that all the computations are feasible and correct under 53 bits of floating point precision, in order to use the fast FFT. This requires to refine the parameters of the scheme. We verified the accuracy of the FFT with a slower but exact Karatsuba implementation of the polynomial product.

Concrete Parameters. In our three levels, we used the following \(\mathsf {TRLWE}\) and \(\mathsf {TRGSW}\) parameter sets, which have at least 110-bits of security, according to the security analysis in [12].

Level

Minimal noise \(\alpha \)

n

\(B_g\)

\(\ell \)

0

\(\underline{\alpha }= 2^{-15.33}\)

\(\underline{n}= 500\)

N.A

N.A.

1

\({\alpha }= 2^{-32.33}\)

\({n}=1024\)

\({B_g}=2^8\)

\(\ell =2\)

2

\(\bar{\alpha }=2^{-45.33}\)

\(\bar{n}=2048\)

\(\bar{B_g}=2^9\)

\(\bar{\ell }=4\)

Since we assume circular security, we will use only one key per level, and the following keyswitch parameters (in the leveled setting, the reader is free to increase the number of keys if he does not wish to assume circularity).

Level

t

\(\gamma \)

Usage

\(1 \rightarrow 0\)

\(\underline{t}=12\)

\(\underline{\gamma }=2^{-14}\)

Circuit Bootstap, Pre-KS

\(2 \rightarrow 1\)

\(\bar{t}=30\)

\(\bar{\gamma }=2^{-31}\)

Circuit Bootstap, Step 4 in Algorithm 6

\(1 \rightarrow 1\)

\(t=24\)

\(\gamma =2^{-24}\)

TBSR

Thus, we get these noise variances in input or in output

Output \(\mathsf {TLWE}\)

Fresh \(\mathsf {TRGSW}\) in LHE

\(\mathsf {TRGSW}\) Output of CB

Bootst. key

\(\vartheta \le 2^{-10,651}\)

\(\vartheta = 2^{-60} \)

\(\vartheta \le 2^{-47.03}\)

\(\vartheta _{BK} = 2^{-88} \)

And finally, this table summarizes the timings (Core i7-4910MQ laptop), noises overhead, and maximal depth of all our primitives.

 

CPU Time

Var Noise add

Max depth

Circuit bootstrap

\(t_{CB}=137\,\text {ms}\)

N.A

N.A

Fresh \(\mathtt {CMux}\)

\(t_{XP}=34\,\upmu \text {s}\)

\(2^{-23.99}\)

16384

CB \(\mathtt {CMux}\)

\(t_{XP}=34\,\upmu \text {s}\)

\(2^{-20.86}\)

3466

\(\mathsf {PubKS}_{TBSR}\)

\(t_{KS}=180\,\text {ms}\)

\(2^{-23.42}\)

16384

More details on these parameter choices are provided in the full version.

Fig. 6.
figure 6

The y coordinate represents the running time in seconds (in logscale), the x coordinate represents the number of bits in the input (in logscale for c–f).

Time Comparison. With these parameters, we analyse the (single-core) execution timings for the evaluation of the LUT, MAX and Multiplication in LHE and FHE mode.

In the LHE mode (left hand side of Fig. 6), all inputs are fresh ciphertexts (either \(\mathsf {TRLWE}\) or \(\mathsf {TRGSW}\)) and we compare the previous versions [12] (without packing/batching or gate bootstapping) with the new optimizations i.e. horizontal/vertical packing; with weighted automata or with TBSR techniques. In the FHE mode (right hand side of Fig. 6), all inputs and outputs are \(\mathsf {TLWE}\) samples on the \(\{0, \frac{1}{2}\}\) message space with noise amplitude \(\frac{1}{4}\). Each operation starts by bootstrapping its inputs. We compare the gate-by-gate bootstapping strategy with the mixed version where we use leveled encryption with circuit bootsrapping. Our goal is to identify which method is better for each of the 6 cases. We observe that compared to the gate bootstrapping, we obtain a huge speed-up for the homomorphic evaluation of arbitrary function in both LHE and FHE mode, in particular, we can evaluate a 8 bits to 1 bit lookup table and bootstrap the output in just 137 ms, or evaluate an arbitrary 8 bits to 8 bits function in 1.096 s, and an arbitrary 16 bits to 8 bits function in 2.192 s in FHE mode. For the multiplication in LHE mode, it is better to use the weighted automata technique when the number is less than 128 bits, and the TBSR counter after that. In the FHE mode, the weighted automata becomes faster than gate-bootstrapping after 4 bits of inputs, then the TBSR optimization becomes faster for >64 bits inputs.

6 Conclusion

In this paper we improved the efficiency of TFHE, by proposing some new packing techniques. For the first time we use det-WFA in the context of homomorphic encryption to optimize the evaluation of arithmetic circuits, and we introduced the TBSR counter. By combining these optimizations, we obtained a significant timing speed-up and decrease the ciphertext overhead for \(\mathsf {TLWE}\) and \(\mathsf {TRGSW}\) based encryption. We also solved the problem of non universal composability of TFHE leveled gates, by proposing the efficient circuit bootstrapping that runs in 134 ms; we implemented it in the TFHE project [14].