Keywords

1 Introduction

Secure multi-party computation (MPC) protocols allow useful computations to be performed on private data, without the data owners having to reveal their inputs. The last decade has seen an enormous amount of progress in the practicality of MPC, with many works designing more efficient protocols and implementations. There has also been a growing interest in exploring the possible applications of MPC, with a number of works targeting specific computations such as auctions, statistics and stable matching [6, 7, 23, 29].

One promising application area that has recently emerged is the use of secure computation to protect long-term secret keys, for instance, in authentication servers or to protect company secrets [4]. Here, the secret key, \({\mathsf {sk}}\), is split up into n pieces, or shares, such that certain subsets of the n shares are needed to reconstruct \({\mathsf {sk}}\), and each share is stored on a different server (possibly in a different location and/or managed by a separate entity). When the key is needed by an application, say for a user logging in, the servers run an MPC protocol to authenticate the user, without ever revealing \({\mathsf {sk}}\). Typically, the type of computation required here can be performed using a symmetric primitive such as a block cipher or hash function.

Several previous works in secure computation have studied the above type of application, and the AES function is even considered a standard benchmark for new protocols [5, 16, 35, 37, 39]. A recent line of works has even looked at special-purpose symmetric primitives, designed to have low complexity when evaluated in MPC [1, 2, 27]. However, in industries such as banking and the wider financial sector, strict regulations and legacy systems mean that switching to new primitives can be very expensive, or even impossible. Indeed, most banking systems today are using AES or Triple DES (3DES) to secure their data [24], but may still benefit greatly from MPC technologies to prevent theft and data breaches.

1.1 Our Contributions

In this work, we focus on the task of secure multi-party computation of the AES and the (Triple) DES block ciphers, in the setting of active security against any number of corrupted parties. We present a new technique for the preprocessing phase of efficient, secure computation of table lookup (with a secret index), and apply this to evaluating the S-boxes of AES and DES. In addition, we describe a new method of secure MPC evaluation of the DES S-boxes based on evaluating polynomials over binary finite fields, which reduces the number of non-linear field multiplications.

Our protocol for secure table lookup builds upon the recent ‘TinyTable’ protocol for secure two-party computation by Damgård et al. [18]. This protocol requires a preprocessing phase, which is independent of the inputs, where randomly masked (or ‘scrambled’) lookup tables on random data are created. In the online phase, where the function is securely evaluated, each (one-time) masked table can be used to perform a single table lookup on a private index in the MPC protocol. The online phase of TinyTable is very efficient, as each party only needs to send \(\log _2{N}\) bits over the network, for a table of size N.

However, the suggested technique for creating the masked tables is far less efficient: for secure computation of AES, it would take at least 256 times longer to create the masked lookup tables, compared with using standard methods with a slower online time.

We extend and improve upon the TinyTable approach in two ways. Firstly, we show that the technique can easily be generalized to the multi-party setting and used in any SPDZ-like MPC protocol based on secret-sharing and information-theoretic MACs. Secondly, we describe a new, general approach for creating the masked tables using finite field arithmetic, which significantly improves the preprocessing cost of the protocol. Concretely, for a lookup table of size N, we can create the masked table using an arithmetic circuit over \(\mathbb {F}_{2^k}\) with fewer than \(N/k + \log {N}\) multiplications. This provides a range of possible instantiations with either binary or arithmetic circuit-based protocols. When using binary circuits, we only require \(N-2\) multiplications. For arithmetic circuits over \(\mathbb {F}_{2^8}\), an AES S-box can be preprocessed with 33 multiplications, improving on the method in [18], which takes 1792 multiplications, by more than 50 times. With current practical protocols, it turns out to be even more efficient to work over \(\mathbb {F}_{2^{40}}\), with only 11 multiplications. We remark that standard methods for computing AES based on polynomials or Boolean circuits can obtain better overall running times, but with a much slower online phase. The main goal of this work is to reduce the preprocessing cost whilst preserving the very fast online phase of TinyTable.

We also consider a new method for secure multi-party computation of DES based on a masking side-channel countermeasure technique. The DES S-box can be viewed as a lookup table mapping 6 bits to 4 bits, or as a polynomial over \(\mathbb {F}_{2^6}\). A naïve method requires 62 field multiplications to evaluate a DES S-box polynomial over \(\mathbb {F}_{2^{6}}\). There were many recent works that reduced the number of non-linear multiplications required to evaluate polynomials over binary finite fields, including the DES S-box polynomials [10, 13, 14, 38, 41]. A recent proposal by Pulkus and Vivek [38] showed that the DES S-boxes, when represented over a different field, \(\mathbb {F}_{2^8}\), can be evaluated with only 3 non-linear multiplications. This is better than the best-known circuit over \(\mathbb {F}_{2^6}\), which needs 4 non-linear multiplications. Applying the Pulkus–Vivek method in our context, we show how 1 round of the DES block cipher can be computed with just 24 multiplications over \(\mathbb {F}_{2^8}\). This compares favorably with previous methods based on evaluating polynomials over \(\mathbb {F}_{2^6}\) and boolean circuits.

Analogous to the MPC protocols based on table lookups, there are also masking side-channel countermeasures based on random-table lookups [11, 12]. This analogy should not come as a surprise since the masking technique is also based on secret-sharing. The state-of-the-art for (higher-order) masking seems to suggest that the schemes based on evaluation of S-box polynomials usually outperform table-lookups based schemes in terms of time, RAM memory and randomness. We perform a similar comparison in the MPC context too. To this end, we evaluate the complexity of the various methods for secure computation of AES and 3DES, and present some implementation results. We implemented the protocols using the online phase of the SPDZ [16, 19] MPC protocol. The preprocessing additionally requires some random multiplication triples and shared bits, for which we estimated costs using MASCOT [30] for arithmetic circuits, and based on the recent optimized TinyOT protocol [35, 43] for binary circuits.

Our experiments show that the fastest online evaluation is achieved using lookup tables. The preprocessing for this method costs much less when using arithmetic circuits over larger fields, compared with a binary circuit protocol such as TinyOT [35, 43], despite the quadratic (in the field bit length) communication cost of [30]. The polynomial-based methods for AES and DES still perform slightly better in the preprocessing phase, but for applications where a low online latency is desired, the lookup table approach is definitely preferred. If an application is mainly concerned with the total running time, then the polynomial-based methods actually lead to runtimes for AES that are comparable with the fastest recent 2-PC implementations using garbled circuits.

Related Work. A recent, independent work by Dessouky et al. [22] presented two different protocols for lookup table-based secure two-party computation in the semi-honest security model. The first protocol, OP-LUT, offers an online phase very similar to ours (and [18]), while the preprocessing stage, that is implemented using 1-out-of-N oblivious transfer, is incomparable to ours as we must work much harder to achieve active security.

The second protocol, SP-LUT, proposes a more efficient preprocessing phase, which only requires random 1-out-of-N oblivious transfer computation, but a slower online evaluation; however this protocol has a much lower overall communication compared to the previous one. These two protocols are also compared with the OTTT (One-Time Truth-Table) protocol by Ishai et al. [28] with parallel circuit based preprocessing [20]. More detailed comparisons with our protocols are provided in Sect. 5.2.

This work also provides an FPGA-based synthesis tool that transforms a high level function representation to multi-input/multi-output table-lookup representation, which could also be used with our protocol.

2 Preliminaries

We denote by \(\lambda \) the computational security parameter and \(\kappa \) the statistical security parameter. We consider the sets \(\{0,1\}\) and \(\mathbb {F}_2^k\) endowed with the structure of the fields \(\mathbb {F}_2\) and \(\mathbb {F}_{2^k}\), respectively. We denote by \(\mathbb {F}=\mathbb {F}_{2^k}\) any finite field of characteristic two. Finally, we use \( a \mathop {\leftarrow }\limits ^{\small {\$}}A\) as notation for a uniformly random sampling of a from a set A.

Note that by linearity we always mean \(\mathbb {F}_2\)-linearity, as we only consider fields of characteristic 2.

2.1 MPC Computation Model

Our protocol builds upon the arithmetic black-box model for MPC, represented by the functionality \(\mathcal {F}_{\mathsf {ABB}}\) (shown in the full version). This functionality permits the parties to input and output secret-shared values and evaluate arbitrary binary circuits performing basic operations. This abstracts away the underlying details of secret sharing and MPC. Other than the standard Add and Mult commands, \(\mathcal {F}_{\mathsf {ABB}}\) also has a BitDec command for generating the bit decomposition of a given secret-shared value, two commands Random and RandomBit for generating random values according to different distributions and an Open command which allows the parties and the adversary to output values. BitDec can be implemented in a standard manner by opening and then bit-decomposing \(x + r\), where r is obtained using k secret random bits.

We use the notation \(\llbracket x \rrbracket \) to denote an authenticated and secret-shared value x, which is stored by \(\mathcal {F}_{\mathsf {ABB}}\). More precisely, this can be implemented with active security using the SPDZ protocol [16, 19] based on additive secret sharing and unconditionally secure MACs. We also use the \(+\) and \(\cdot \) operators to denote calls to Add and Mul with the appropriate shared values in \(\mathcal {F}_{\mathsf {ABB}}\).

More concretely, our protocols are in the so called preprocessing model and consist of two different phases: an online computation, where the actual evaluation takes place, and a preprocessing phase that is independent of the parties’ inputs. During the online evaluation, linear operations only require local computations thanks to the linearity of the secret sharing scheme and MAC. Multiplications and bit decompositions require random preprocessed data and interactions. More generally, the main task of the preprocessing step is to produce enough random secret data for the parties to use during the online computation: other than multiplication triples, which allow parties to compute products, it also provides random shared values. The preprocessing phase can be efficiently implemented using OT-based protocols for binary circuits [8, 25, 43] and arithmetic circuits [30].

Security Model. We describe our protocols in the universal composition (UC) framework of Canetti [9], and assume familiarity with this. Our protocols work with n parties from the set \(\mathcal {P}= \{P_1,\dots , P_n\}\), and we consider security against malicious, static adversaries, i.e. corruption may only take place before the protocols start, corrupting up to \(n-1\) parties.

3 Evaluating AES and DES S-box Polynomials

In this section, we recollect some of the previously known methods that aim to reduce the number of non-linear operations to evaluate univariate polynomials over binary finite fields, particularly, the AES and the DES S-boxes represented in this form. Note here that, by a non-linear multiplication, we mean those multiplications of polynomials that are neither multiplication by constants nor squaring operations. Since squaring is a linear operation in binary fields, once a monomial is computed, it can be repeatedly squared to generate as many more monomials as possible without costing any non-linear multiplication.

Due to limited space, a more detailed discussion can found in the full version.

3.1 AES S-box

The AES S-box evaluation on a given input (as an element of \(\mathbb {F}_{2^{8}}\)) consists of first computing its multiplicative inverse in \(\mathbb {F}_{2^{8}}\) (mapping zero to zero), and then applying a bijective affine transformation. For the inverse S-box, the inverse affine transformation is applied first and then the multiplicative inverse. Note that the polynomial representation of the inverse function in \(\mathbb {F}_{2^{8}}\) is \(X^{254}\).

BitDecompostion Method. This approach, described by Damgård et al. [15], computes the squares \(X^{2^i}\), for \(i \in [7]\), and then multiplies them to get \(X^{254}\). This method needs 6 non-linear multiplications.

Rivain–Prouff Method. This method, as presented in Gentry et al. [26], is a variant of the method of Rivain–Prouff [40] to evaluate the AES S-box polynomial using only 4 non-linear multiplications in \(\mathbb {F}_{2^{8}}[X]\): \(\{X, X^2\} \overset{\times }{\rightarrow } \{X^3,X^{12}\} \overset{\times }{\rightarrow } \{X^{14}\} \overset{\times }{\rightarrow } \{X^{15},X^{240}\} \overset{\times }{\rightarrow } X^{254}.\)

3.2 Des S-boxes

Cyclotomic Class Method. Recall that DES has eight 6-to-4-bit S-boxes. In this naïve method given by Carlet et al. [10], the DES S-boxes are represented as univariate polynomials over \(\mathbb {F}_{2^{6}}\). In particular, the 4-bit S-box outputs are padded with zeros in the most significant bits and then identified with the elements of \(\mathbb {F}_{2^{6}}\). It turns out that these polynomials have degree at most 62 [41].

Over \(\mathbb {F}_{2^{m}}[X]\), define \(C^m_i := \left\{ X^{i\cdot 2^{j}}:\; j=0,1,\ldots , m-1\right\} \text { for } 0<i<2^m.\) Now we need to compute \(C^6_0, C^6_1, C^6_3, C^6_5, C^6_7, C^6_9, C^6_{11}, C^6_{13}, C^6_{15}, C^6_{21}, C^6_{23}, C^6_{27}, C^6_{31}\), to cover all monomials up to degree 62, and this needs at most 11 non-linear multiplications. The target polynomial is then simply obtained as a linear combination of the computed monomials.

Pulkus–Vivek Method. This generic method to evaluate arbitrary polynomials over binary finite fields was proposed recently by Pulkus and Vivek [38] as an improvement over the method of Coron–Roy–Vivek [13, 14]. In the PV method, the DES S-boxes are represented as polynomials over \(\mathbb {F}_{2^{8}}\) instead of \(\mathbb {F}_{2^{6}}\). The 6-bit input strings of the DES S-boxes are padded with zeroes in the two most significant positions and then naturally identified with the elements of \(\mathbb {F}_{2^{8}}\). The four most significant coefficient bits of the polynomial outputs are discarded to obtain the desired 4-bit S-box output.

Firstly, a set of monomials \(L = C_1^8 \cup C_3^8 \cup C_7^8\) in \(\mathbb {F}_{2^{8}}[X]\) is computed. Then a polynomial, say P(X), representing the given S-box is sought as \( P(X) = p_1(X)\cdot q_1(X) + p_2(X)\), where \(p_1(X)\), \(q_1(X)\), and \(p_2(X)\) have monomials only from the set L. In total, the PV method needs 3 non-linear multiplications in \(\mathbb {F}_{2^{8}}[X]\) to evaluate each of the S-box polynomial.

3.3 MPC Evaluation of AES and DES S-box Polynomials

Here we detail the MPC evaluation of AES and DES S-boxes using the techniques described above. We recall that since the S-boxes, in both the ciphers we are considering, are the only non-linear components, they represent the only parts which actually need interactions in an MPC evaluation.

AES Evaluation. As we mention before in Sect. 3.1, the straightforward way to compute the S-box is using the BitDecomposition method, which requires 6 multiplications in \(4+1\) rounds. We are considering the case of active security, so the AES evaluation is done in the field \(\mathbb {F}_{2^{40}}\) instead of \(\mathbb {F}_{2^{8}}\), via the embedding \(\mathbb {F}_{2^8} \hookrightarrow \mathbb {F}_{2^{40}}\). This follows from the fact that we are using the SPDZ protocol which requires a field size of at least \(2^\kappa \), where \(\kappa \) is the statistical security parameter. This permits to have only one MAC per data item [15].

The evaluation proceeds as follow: first X is bit-decomposed so that all the squarings can be locally evaluated, and then \(X^{254}\) is obtained as described in [15]:

$$ X^{254} = ((X^2 \cdot X^4) \cdot (X^8 \cdot X^{16})) \cdot ((X^{32} \cdot X^{64}) \cdot X^{128}). $$

This requires 4 rounds, out of which one is a call to \(\mathsf {BitDec}\). We also need an extra round for computing the inverse of the field embedding \(\mathbb {F}_{2^8} \hookrightarrow \mathbb {F}_{2^{40}}\) to evaluate the S-box linear layer. We denote this method by \(\texttt {AES-BD}\).

We denote by \(\texttt {AES-RP}\) the AES S-box evaluation that uses the Rivain–Prouff method (cf. Sect. 3.1). It requires \(6+1\) rounds to compute the four powers \(X^3, X^{14}, X^{15}, X^{254}\). Furthermore, this can be done with three calls to \(\mathsf {BitDec}\) and four non-linear multiplications, but some of the openings can be done in parallel which yields to a depth-6 circuit. As before, we need an extra round to call \(\mathsf {BitDec}\) and compute the S-box linear layer.

DES Evaluation. We denote by \(\texttt {DES-PV}\) the DES S-box evaluation using the Pulkus–Vivek method. Note that, although in side-channel world computing the squares is for free, since it is an \(\mathbb {F}_2\)-linear operation, in a secret-shared based MPC with MACs this is no longer true and we need to bit-decompose.

The squares from \(C_1^8, C_3^8, C_7^8\), are obtained locally after \(X, X^3, X^7\) are bit-decomposed. Here we need two multiplications, since \(X^3 = X \cdot X^2\) and \(X^7 = X^3 \cdot X^4\). The third multiplication occurs when computing the product \(p_1(X)\cdot q_1(X)\), resulting in an S-box cost of only 3 triples, 24 bits and 5 communication rounds.

The number of rounds is due to 3 calls to \(\mathsf {BitDec}\) (on \(X^3, X^7\) and \( p_1(X) \cdot q_1(X) + p_2(X)\)) and 3 non-linear multiplications. Although at a first glance there seems to be six rounds, we have that \(\mathsf {BitDec}(X^7)\) is independent of the \(\mathsf {BitDec}(X^3)\), as we can compute \(X^7\) without the call \(\mathsf {BitDec}(X^3)\), resulting in only five rounds.

4 MPC Evaluation of Boolean Circuits Using Lookup Tables

In this section we describe an efficient MPC protocol for securely evaluating circuits over extension fields of \(\mathbb {F}_2\) (including boolean circuits) containing additional ‘lookup table’ gates. This protocol is in the preprocessing model and follows the same approach proposed in [20], evaluating lookup table gates using preprocessed, masked lookup tables.

Fig. 1.
figure 1

The ideal functionality for MPC using lookup tables

Fig. 2.
figure 2

Ideal functionality for the preprocessing of masked lookup tables.

The functionality that we implement is \(\mathcal {F}_{\mathsf {ABB-LUT}}\) (Fig. 1), which augments the standard \(\mathcal {F}_{\mathsf {ABB}}\) functionality with a table lookup command. The concrete online cost of each table lookup is just \(\log _2 N\) bits of communication per party, where N is the size of the table. Note that the functionality \(\mathcal {F}_{\mathsf {ABB-LUT}}\) works over a finite field \(\mathbb {F}_{2^k}\), and has been simplified by assuming that the size of the range and domain of the lookup table \(\mathsf {T}\) is not more than \(2^k\). However, our protocol actually works for general table sizes, and \(\mathcal {F}_{\mathsf {ABB-LUT}}\) can easily be extended to model this by representing a table lookup result with several field elements instead of one.

We now show how Protocol 1 implements the Table Lookup command of \(\mathcal {F}_{\mathsf {ABB-LUT}}\), given the right preprocessing material. For any non-linear function \(\mathsf {T}\), with \(\ell \) input and m output bits, it is well known that it can be implemented as a lookup table of \(2^\ell \) components of m bits each. To evaluate \(\mathsf {T}(\cdot )\) on a secret authenticated value \(\llbracket x \rrbracket , x \in \mathbb {F}_{2^\ell }\), the parties use a random authenticated \(\mathsf {T}\) evaluation from \(\mathcal {F}_{\mathsf {Prep-LUT}}\) (Fig. 2). More precisely, we would like the preprocessing to output values \(( \llbracket s \rrbracket ,\llbracket {\mathsf {Table}}(s) \rrbracket )\), where \(\llbracket s \rrbracket \) is a random authenticated value unknown to the parties and \(\llbracket {\mathsf {Table}}(s) \rrbracket )\) is the table

$$ \llbracket {\mathsf {Table}}(s) \rrbracket = \left( \, \llbracket \mathsf {T}(s) \rrbracket , \llbracket \mathsf {T}(s \oplus 1) \rrbracket , \dots , \llbracket \mathsf {T}(s \oplus (2^{\ell } -1)) \rrbracket \, \right) , $$

so that \(\llbracket {\mathsf {Table}}(s) \rrbracket [j], 0 \le j \le 2^\ell -1,\) denotes the element \(\llbracket \mathsf {T}(s \oplus j) \rrbracket \). Given such a table, evaluating \(\llbracket \mathsf {T}(x) \rrbracket \) is straightforward: first the parties open the value \(h=x \oplus s\) and then they locally retrieve the value \(\llbracket {\mathsf {Table}}(s) \rrbracket [h] = \llbracket \mathsf {T}(s \oplus h) \rrbracket = \llbracket \mathsf {T}(s \oplus s \oplus x) \rrbracket = \llbracket \mathsf {T}(x) \rrbracket .\)

figure a

Correctness easily follows from the linearity of the \(\llbracket \cdot \rrbracket \)-representation and the discussion above. Privacy follows from the fact that the value s used in Table Lookup is randomly chosen and is used only once, thus it perfectly blinds the secret value x.

4.1 The Preprocessing Phase: Securely Generating Masked Lookup Tables

In this section we describe how to securely implement \(\mathcal {F}_{\mathsf {Prep-LUT}}\) (see Fig. 2), and in particular how to generate masked lookup tables which can be used for the online phase evaluation.

Recall that the goal is to obtain the shared values:

Protocol 2 begins by taking a secret, random \(\ell \)-bit mask \(\llbracket s \rrbracket = (\llbracket s_0 \rrbracket , \dots , \llbracket s_{\ell -1} \rrbracket )\). Then, the parties expand s into a secret-shared bit vector \((s'_0, \dots , s'_{2^\ell -1})\) which has a 1 in the s-th entry and is 0 elsewhere. We denote this procedure—the most expensive part of the protocol—by \(\mathsf {Demux}\), and describe how to perform it in the next section.

figure b

Once this is done, the parties can obtain the i-th entry of the masked lookup table by computing:

$$ \mathsf {T}(i) \cdot \llbracket s'_0 \rrbracket + \mathsf {T}(i \oplus 1) \cdot \llbracket s'_1 \rrbracket + \dots + \mathsf {T}(i \oplus (2^\ell -1)) \cdot \llbracket s'_{2^\ell -1} \rrbracket , $$

which is clearly \(\llbracket \mathsf {T}(i \oplus s) \rrbracket \) as required. Note that since the S-box is public, this is a local computation for the parties. In the following we give an efficient protocol for computing \(\mathsf {Demux}\).

figure c

4.2 Computing \(\mathsf {Demux}\) with Finite Field Multiplications

We now present a general method for computing \(\mathsf {Demux}\) using fewer than \(N/k + \log {N}\) multiplications over \(\mathbb {F}_{2^k}\), when k is any power of 2 and \(N=2^\ell \) is the table size. Launchbury et al. [32] previously described a protocol with O(N) multiplications in \(\mathbb {F}_2\), but our protocol has fewer multiplications than theirs for all choices of k.

As said before, \(\mathsf {Demux}\) maps a binary representation \((s_0,\dots ,s_{\ell -1})\) of an integer \(s = \sum _{i=0}^{\ell -1} s_i \cdot 2^i\) into a unary representation of fixed length \(2^\ell \) that contains a one in the position s and zeros elsewhere. A straightforward way to compute \(\mathsf {Demux}\) is by computing, over \(\mathbb {F}_{2^{N}}\) Footnote 1:

Notice that if \(s_i=1\) then the i-th term of the product equals \(X^{2^i}\), whereas the term equals 1 if \(s_i=0\). This means the entire product evaluates to \(s' = X^s\), where s is the integer representation of the bits \((s_0, \dots , s_{\ell -1})\). Bit decomposing \(s'\) obtains the demuxed output as required. Unfortunately, this approach does not scale well with N, the table size, as we must exponentially increase the size of the field.

We now show how to compute this more generally, using operations over \(\mathbb {F}_{2^k}\), where k is a power of two. We will only ever perform multiplications between elements of \(\mathbb {F}_2\) and \(\mathbb {F}_{2^k}\), and will consider elements of \(\mathbb {F}_{2^k}\) as vectors over \(\mathbb {F}_2\). Define the partial products, for \(j = 1, \dots , \ell \):

and note that \(p_{j+1}(X) = p_{j}(X) \cdot (s_{j} \cdot X^{2^{j}} + (1 - s_{j}))\), for \(j < \ell \).

Note also that the polynomial \(p_j(X)\) has degree \(<2^j\), so \(p_j(X)\) can be represented as a vector in \(\mathbb {F}_2^{2^j}\) containing its coefficients, and more generally, a vector \(p_j\) containing \(\lceil 2^j / k \rceil \) elements of \(\mathbb {F}_2^k\). This is the main observation that allows us to emulate the computation of \(s'\) using only \(\mathbb {F}_{2^k}\) arithmetic.

Given a sharing of \(p_j\) represented in this way, a sharing of \(p_j(X) \cdot X^{2^j}\) can be seen as the vector (increasing the powers of X from left to right):

$$ (0^{2^{j}} \Vert p_j) \in \mathbb {F}_2^{2^{j+1}} $$

and a vector representation of \(p_{j+1}(X)\) is:

$$ \left( (0^{2^{j}} \Vert s_j \cdot p_j) + ((1 - s_j) \cdot p_j \Vert 0^{2^{j}})\right) \in \mathbb {F}_2^{2^{j+1}}. $$

Thus, given \(\llbracket p_j \rrbracket \) represented as \(\lceil 2^j/k \rceil \) shared elements of \(\mathbb {F}_{2^k}\), we can compute \(\llbracket p_{j+1} \rrbracket \) in MPC with \(\lceil 2^j/k \rceil \) multiplications between \(\llbracket s_j \rrbracket \) and a shared \(\mathbb {F}_{2^k}\) element, plus some local additions.

Starting with \(p_1(X)=s_0 \cdot X + (1 - s_0)\) we can iteratively apply the above method to compute \(p_\ell = s'\), as shown in Protocol 3. The overall complexity of this protocol is given by

$$\begin{aligned} \sum _{j=1}^{\ell -1} \lceil 2^{j}/k \rceil < N/k + \ell \end{aligned}$$

multiplications between bits and \(\mathbb {F}_{2^k}\) elements.

Table 1 illustrates this trade-off between the field size and number of multiplications for some example parameters. We note that the main factor affecting the best choice of k is the cost of performing a multiplication in \(\mathbb {F}_{2^k}\) in the underlying MPC protocol, and this may change as new protocols are developed. However, we compare costs of some current protocols in Sect. 5.

Table 1. Number of \(\mathbb {F}_2 \times \mathbb {F}_{2^k}\) multiplications for creating a masked lookup table of size N, for varying k.

4.3 MPC Evaluation of AES and DES Using Lookup Tables

We now show how to use the lookup table MPC protocol described above to evaluate AES and DES.

AES Evaluation. We require an MPC protocol which performs operations in \(\mathbb {F}_{2^{8}}\). In practice, we actually embed \(\mathbb {F}_{2^{8}}\) in \(\mathbb {F}_{2^{40}}\), since we use the SPDZ protocol which requires a field size of at least \(2^\kappa \), for statistical security parameter \(\kappa \). We implement the AES S-box using the table lookup method from Protocol 2 combined with \(\mathsf {Demux}\) (Protocol 3) over \(\mathbb {F}_{2^{40}}\), since this yields a lower communication cost (see Table 4). Notice that the data sent is highly dependent on the number of bits, triples and the field size.

In a naive implementation of this approach, we would have call \(\mathsf {BitDec}\) on \(\llbracket {\mathsf {Table}}(s) \rrbracket \), in order to perform the embedding \(\mathbb {F}_{2^{8}} \hookrightarrow \mathbb {F}_{2^{40}}\). This is required since the table output is not embedded, but the MixColumns step needs this to perform multiplication by \(X \in \mathbb {F}_{2^{8}}\) on each state.

With a more careful analysis we can avoid the \(\mathsf {BitDec}\) calls by locally embedding the bit shares inside Protocol 2. We store the masked S-box table in bit decomposed form and then its bits are multiplied (in the clear) with \(\mathsf {Demux}\)’s output (secret-shared). This trick reduces the online communication by a factor of 8, halves the number of rounds required to evaluate AES and gives a very efficient online phase with only 10 rounds and 160 openings in \(\mathbb {F}_{2^{40}}\).

DES Evaluation. Using the fact that DES S-boxes have size 64, we chose to use the \(\mathsf {Demux}\) Protocol 3 with multiplications in \(\mathbb {F}_{2^{40}}\), based on the costs in Table 4. Like AES, we try to isolate the input-dependent phase as much as possible with no extra cost.

Every DES round performs only bitwise addition and no embedding is necessary here. The masked table can be bit-decomposed without interaction, exactly as described above for AES, by multiplying clear bits with secret shared values. This yields a low number of openings, one per S-box look-up, so the total online cost for \(\texttt {3DES}\) is 46 rounds with 384 openings.

5 Performance Evaluation

This section presents timings for \(\texttt {3DES}\) and \(\texttt {AES}\) using the methods presented in previous sections. We also discuss trade-offs and different optimizations which turn out to be crucial for our running-times. The setup we have considered is that both the key and message used in the cipher are secret shared across two parties. We consider the input format for each block cipher as already embedded into \(\mathbb {F}_{2^{40}}\) for AES, or as a list of shared bits for DES. We implemented the protocols using the SPDZ software,Footnote 2 and estimated times for computing the multiplication triples and random bits needed based on the costs of MASCOT [30].

The results, shown in Tables 2 and 3, give measurements in terms of latency and throughput. Latency indicates the online phase time required to evaluate one block cipher, whereas throughput (which we consider for both online and offline phases) shows the maximum number of blocks per second which can be evaluated in parallel during one execution. We also measure the number of rounds of interaction of the protocols, and the number of openings, which is the total number of secret-shared field elements opened during the online evaluation.

Benchmarking Environment. The experiments were ran across two machines each with Intel i7-4790 CPUs running at 3.60 GHz, 16 GB of RAM connected over a 1 GBps LAN with an average ping of 0.3 ms (roundtrip). For experiments with 3–5 parties, we used three additional machines with i7-3770 CPUs at 3.1 GHz. In order to get accurate timings each experiment was averaged over 5 executions, each with at least 1000 cipher calls.

Security Parameters and Field Sizes. Secret-sharing based MPC can be usually split into 2 phases—preprocessing and online. In SPDZ-like systems, the preprocessing phase depends on a computational security parameter, and the online phase a statistical security parameter which depends on the field size. In our experiments the computational security parameter is \(\lambda = 128\). The statistical security \(\kappa \) is 40 for every cipher except for \(\texttt {3DES-Raw}\) which requires an embedding into a 42 bit field.

Results. The theoretical costs and practical results are shown in Tables 2 and 3, respectively. Timings are taken only for the encryption calls, excluding the key schedule mechanism.

\(\texttt {AES-BD}\) is implemented by embedding each block into \(\mathbb {F}_{2^{40}}\), and then squaring the shares locally after the inputs are bit-decomposed. In this manner, each S-box computation costs 5 communication rounds and 6 multiplications. This method was described in [15].

\(\texttt {3DES-Raw}\) represents the \(\texttt {3DES}\) cipher with the S-box evaluated as a polynomial of degree 62 over the field \(\mathbb {F}_{2^6} = \mathbb {F}_2[x] / (x^6 + x^4 + x^3 + x + 1)\). To make the comparisons relevant with other ciphers in terms of active security we chose to embed the S-box input in \(\mathbb {F}_{2^{42}}\), via the embedding \(\mathbb {F}_{2^6} \hookrightarrow \mathbb {F}_{2^{42}}\), where \(\mathbb {F}_{2^{42}} = \mathbb {F}_2[y] / (y^{42} + y^{21} + 1)\) and \(y = x^7+1\). The S-boxes used for interpolating are taken from the PyCrypto library [34]. \(\texttt {3DES-Raw}\) is implemented only for benchmarking purposes and it has no added optimizations. One S-box has a cost of 62 multiplications and 62 rounds.

\(\texttt {3DES-PV}\) is \(\texttt {3DES}\) implemented with the Pulkus-Vivek method from Section 3.2. Since it has only a few multiplications in \(\mathbb {F}_{2^{40}}\), the amount of preprocessing data required is very small, close to \(\texttt {AES-BD}\). It suffers in terms of both latency and throughput due to the high number of communication rounds (needed for bit decomposition to perform the squarings).

Surprisingly, \(\texttt {AES-RP}\) (the polynomial-based method from Sect. 3.1) has a better throughput than \(\texttt {AES-BD}\) although it requires 20 more rounds and 2 times more shared bits to evaluate. The explanation for this is that in \(\texttt {AES-RP}\) there are fewer openings, thus less data sent between parties.

\(\texttt {AES-LT}\) and \(\texttt {3DES-LT}\) are the ciphers obtained with the lookup table protocol from Sect. 4. \(\texttt {AES-LT}\) achieves the lowest latency and the highest throughput in the online phase. The communication in the preprocessing phase is roughly twice the cost of the previous method, \(\texttt {AES-BD}\).

Packing Optimization. We notice that in the online phase of \(\texttt {AES-LT}\) each opening requires to send 8 bit values embedded in \(\mathbb {F}_{2^{40}}\). Instead of sending 40 bits across the network we select only the relevant bits, which for \(\texttt {AES-LT}\) are 8 bits. This reduces the communication by a factor of 5 and gives a throughput of 236k AES/second over LAN and a multi-threaded MPC engine.

The same packing technique is applied for \(\texttt {3DES-LT}\) since during the protocol we only open 6 bit values from Protocol 1. These bits are packed into a byte and sent to the other party. Here the multi-threaded version of \(\texttt {3DES-LT}\) improves the throughput only by a factor of 4.2x (vs \(\texttt {AES-LT}\) 4.4x) due to the higher number of rounds and openings along with the loss of 2 bits from packing.

Table 2. Communication cost for \(\texttt {AES}\) and \(\texttt {3DES}\) in MPC.
Table 3. 1 GBps LAN timings for evaluating \(\texttt {AES}\) and \(\texttt {3DES}\) in MPC.

General Costs of the Table Lookup Protocol. In Table 4, we estimate the communication cost for creating preprocessed, masked tables for a range of table sizes, using our protocol from Sect. 4.1. This requires multiplication triples over \(\mathbb {F}_{2^k}\), where k is a parameter of the protocol. When \(k=1\), we give figures using a recent optimized variant [43] of the two-party TinyOT protocol [35]. For larger choices of k, the costs are based on the MASCOT protocol [30]. We note that even though MASCOT has a communication complexity in \(O(k^2)\), it still gives the lowest costs (with \(k=40\)) for all the table sizes we considered.

Table 4. Total communication cost (kBytes) of the \(\mathbb {F}_2 \times \mathbb {F}_{2^k}\) multiplications needed in creating a masked lookup table of size N, with two parties. The \(k=1\) estimates are based on TinyOT [43], the others on MASCOT [30].

5.1 Multiparty Setting

We also ran the \(\texttt {AES-LT}\) protocol with different numbers of parties and measured the throughput of the preprocessing and online phases. Figure 3 indicates that the preprocessing gets more expensive as the number of parties increases, whereas the online phase throughput does not decrease by much. This is likely to be because the bottleneck for the preprocessing is in terms of communication (which is \(O(n^2)\) in total), whereas the online phase is more limited by the local computation done by each party.

Fig. 3.
figure 3

Table lookup-based AES throughput for multiple parties.

5.2 Comparison with Other Works

We now compare the performance of our protocols with other implementations in similar settings. Table 5 gives an overview of the most relevant previous works. We see that our \(\texttt {AES-LT}\) protocol comes very close to the best online throughput of TinyTable, whilst having a far more competitive offline cost.Footnote 3 Our \(\texttt {AES-RP}\) variant has a slower online phase, but is comparable to the best garbled circuit protocols overall.

Table 5. Performance comparison with other 2-PC protocols for evaluating AES in a LAN setting.

TinyTable Protocol. The original, 2-party TinyTable protocol [18] presented implementations of the online phase only, with two different variants. The fastest variant is based on table lookup and obtains a throughput of around 340 thousand AES blocks per second over a 1Gbps LAN, which is 1.51x faster than our online throughput. The latency (for sequential operations) is around 1ms, the same as ours. We attribute the difference in throughput to the additional local computation in our implementation, since we need to compute on MACs for every linear operation.

TinyTable does not report figures for the preprocessing phase. However, we estimate that using TinyOT and the naive method suggested in the paper would need would need over 1.3 million TinyOT triples for AES (34 ANDs for each S-box, repeated 256 times to create one masked table, for 16 S-boxes in 10 rounds). In contrast, our table lookup method uses around 160 thousand TinyOT triples, or just 2080 triples over \(\mathbb {F}_{2^{40}}\) (cf. Table 1), per AES block.

Garbled Circuits. There are many implementations of AES for actively secure 2-PC using garbled circuits [33, 36, 39, 42, 43]. When measuring online throughput in a LAN setting, using garbled circuits gives much worse performance than methods based on table lookup, because evaluating a garbled circuit is much more expensive computationally. For example, out of all these works the lowest reported online time (even over a 10 GBps LAN) is 0.93 ms [43], and this does not improve in the amortized setting.

Some recent garbled circuit implementations, however, improve upon our performance in the preprocessing phase, where communication is typically the bottleneck. Wang et al. [43] require 2.57 MB of communication when 1024 circuits are being garbled at once, while Rindal and Rosulek [39] need only 1.6 MB. The runtime for both of these preprocessing phases is around 5 ms over a 10 GBps LAN; this would likely increase to at least 15–20 ms in a 1 GBps network, whereas our table lookup preprocessing takes around 60 ms using MASCOT. If a very fast online time is not required, our implementation of the Rivain–Prouff method would be more competitive, since this has a total amortized time of only 23 ms per AES block.

Secret-Sharing Based MPC. Other actively implementations of AES/DES using secret-sharing and dishonest majority based on secret sharing include those using SPDZ [15, 31] and MiniMAC [17, 21]. Our AES-BD method is the same as [15] and obtains faster performance than both SPDZ implementations. For DES, our TinyTable approach improves upon the times of the binary circuit implementation from [31] (which are for single-DES, so must be multiplied by 3) by over 100 times. Regarding MiniMAC, the implementation of [17] obtains slower online phase times than our work and TinyTable, and it is not known how to do the preprocessing with concrete efficiency.

OP-LUT and SP-LUT. The proposed 2-party protocols by Dessouky et al. [22] only offer security in the semi-honest setting. The preprocessing phase for both the protocols are based on 1-out-of-N oblivious transfer. In particular, the cost of the OP-LUT setup is essentially that of 1-out-of-N OT, while the cost of SP-LUT is the cost of 1-out-of-N random OT, which is much more efficient in terms of communication.

The online communication cost of OP-LUT is essentially the same as our online phase, since both protocols require each party to send \(\log _2 N\) bits for a table of size N. However, we incur some additional local computation costs and a MAC check (at the end of the function evaluation) to achieve active security. The online phase of SP-LUT is less efficient, but the overall communication of this protocol is very low, only 0.055 MB for a single AES evaluation over a LAN setting with 1 GB network.

The work [22] reports figures for both preprocessing and online phase: using OP-LUT gives a latency of around 5 ms for 1 AES block in the LAN setting, and a throughput of 42000 blocks/s. These are both slower than our online phase figures using \(\texttt {AES-LT}\). The preprocessing runtimes of both OP-LUT and SP-LUT are much better than ours, however, achieving over 1000 blocks per second (roughly 80 times faster than \(\texttt {AES-LT}\)). This shows that we require a large overhead to obtain active security in the preprocessing, but the online phase cost is the same, or better.