1 Introduction

In 2-party secure computation, two parties A and B want to compute an agreed function securely on privately held inputs, and we want to construct protocols ensuring that the only new information a party learns is the intended output.

In this paper we will focus on malicious security: one of the parties is under control of an adversary and may behave arbitrarily. As is well known, this means that we cannot guarantee that the protocol always gives output to the honest party, but we can make sure that the output, if delivered, is correct. It is also well known that we cannot accomplish this task without using a computational assumption, and in fact heavy public-key machinery must be used to some extent.

However, as observed in several works [BDOZ11, DPSZ12, NNOB12, DZ13], one can confine the use of cryptography to a preprocessing phase where the inputs need not be known and can therefore be done at any time prior to the actual computation. The preprocessing produces “raw material” for the on-line phase which is executed once the inputs are known, and this phase can be information theoretically secure, and usually has very small computational and communication complexity, but round complexity proportional to the depth of the computation in question. An alternative (which is not our focus here) is to use Yao-garbled circuits. This approach is incomparable even in the case we consider here where the function to compute is known in advance. This is because malicious security requires many garbled circuit to be stored and evaluated in the on-line phase. Hence, the round and communication complexity is smaller than for information theoretic protocols, but the storage and computational complexity is larger.

We will focus on the case where the desired computation is specified as a Boolean circuit. The case of arithmetic circuits over large fields was handled in [DPSZ12] which gave a solution where communication and computational complexities as well as the size of the preprocessed data (called data complexity in the following) are proportional to the circuit size. The requirement is that the field has \(2^{\varOmega (k)}\) elements where k is the security parameter and the allowed error probability is \(2^{- \varOmega (k)}\).

On the other hand, for Boolean circuits, state of the art is the protocol from [DZ13], nick-named MiniMac which achieves data and communication complexity O(s) where s is the circuit size, and computational complexity \(O(k^\epsilon s)\), where \(\epsilon <1\) is a constant. For an alternative variant of the protocol, all complexities are \(O({\text {polylog}}(k) s)\). However, the construction only works for circuits with a sufficiently nice structure, called “well formed” circuits in [DZ13]. Informally, a well-formed circuit allows a modest amount of parallelization throughout the computation – for instance, very tall and skinny circuits are not allowed. If well-formedness is not assumed, both complexities would be \(\varOmega (ks)\) using known protocols.

On the practical side, many of the protocols in the preprocessing model are potentially practical and several of them have been implemented. In particular, implementations of the MiniMac protocol were reported in [DLT14, DZ16]. In [DLT14], MiniMac was optimised and used for computing many instances of the same Boolean circuit in parallel, while in [DZ16] the protocol was adapted specifically for computing the AES circuit, which resulted in an implementation with latency about 6 ms and an amortised time of 0,4 ms per AES block.

Our Contribution. In this paper, we introduce a new protocol for the preprocessing model, nick-named TinyTable. The idea is to implement each (non-linear) gate by a scrambled version of its truthtable. Players will do look-ups in the tables using bits that are masked by uniformly random bits chosen in the preprocessing phase, together with the tables.

The idea of gate-scrambling goes back at least to [CDvdG87] where a (much less efficient) approach based on the quadratic residuosity problem was proposed. Scrambled truth tables also appear more recently in [IKM+13], but here the truthtable for the entire function is scrambled, leading to a protocol with complexity exponential in the length of the inputs (but very small communication). Even more recently, in [DZ16], a (different form of) table look-up was used to implement the AES S-boxes.

What we do here is to observe that the idea of scrambled truth tables makes especially good sense in the preprocessing model and, more importantly, if we combine this with the “right” authentication scheme, we get an extremely practical and maliciously secure protocol. This first version of our protocol has communication complexity O(s) and data and computational complexity O(ks). Although the computational complexity is asymptotically inferior to previous protocols when counting elementary bit operations, it works much better in practice: XOR and NOT gates require no communication, and for each non-linear gate, each player sends and receives 1 bit and XORs 1 word from memory into a register. This means that TinyTable saves a factor of at least 2 in communication in comparison to standard passively secure protocols in the preprocessing model, such as GMW with precomputed OT’s or the protocol using circuit randomization via Beaver-triples.

We implemented a version of this protocol that was optimised for AES computation, by using tables for each S-box. This is more costly in preprocessing but, compared to a Boolean circuit implementation, it reduces the round complexity of the on-line phase significantly (to about 10). On a fast cloud set-up, we obtain on-line latency of 1 ms and amortized time about 0.5 \(\upmu \)s per AES block for an error probability of \(2^{-64}\). To the best of our knowledge, this is the fastest on-line time obtained for secure two-party AES with malicious security. To illustrate what we gain from the AES specific approach, we also implemented the version that works for any Boolean circuit and applied it to an AES circuit computing the same function as the optimized protocol does. On the same cloud set-up and same security level, we obtained a latency of 2.4 ms and amortized time about 10 \(\upmu \)s per AES block.

We describe how one can do the preprocessing we require based on the preprocessing phase of the TinyOT protocol from [NNOB12]. This protocol is basically maliciously secure OT extension with some extra processing on top. In the case of Boolean circuits, the work needed to preprocess an AND gate roughly equals the work we would need per AND gate in TinyOT. For the case of AES S-box tables, we show how to use a method from [DK10] to preprocess such tables. This requires 7 binary multiplications per table entry and local computation.

As for the speeds of preprocessing one can obtain, the best approaches and implementations of this type of protocol are from [FKOS15]. They do not have an implementation of the exact case we need here (2-party TinyOT) but they estimate that one can obtain certainly more than 10.000 AND gates per second and most likely up to 100.000 per second [Kel]. This would mean that we could preprocess a binary AES circuits in about 40 ms.

Our final contribution is a version of our protocol that has better asymptotic performance. We get data and communication complexity O(s), and computational complexity \(O(k^\epsilon s)\), where \(\epsilon <1\) is a constant. Alternatively we can also have all complexities be \(O({\text {polylog}}(k) s)\). While this is the same result that was obtained for the MiniMac protocol, note that we get this for any circuit, not just for well-formed ones. Roughly speaking, the idea is to use the MiniMac protocol to authenticate the bits that players send during the on-line phase. This task is very simple: it parallelizes easily and can be done in constant depth. Therefore we get a better result than if we had used MiniMac directly on the circuit in question.

2 Construction

We show a 2PC protocol for securely computing a Boolean circuit C for two players A and B. The circuit may contain arbitrary gates taking two inputs and giving one output. We let \(G_1, \ldots , G_N\) be the gates of the circuit, and let \(w_1, \ldots , w_W\) be the wires. We note that arbitrary fan-out is allowed, and we do not assume special fan-out gates for this, we simply allow that an arbitrary number of copies of a wire leave a gate, and all these copies are assigned the same wire index. We assume for simplicity that both parties are to learn the output.

We will fix some arbitrary order in which the circuit can be evaluated gate by gate, such that the output gates come last, and such that when we are about to evaluate gate i, its inputs have already been computed. We assume throughout that the indexing of the gates satisfy this constraint. We call the wires coming out of the output gates the output wires.

We first specify a functionality for preparing preprocessing material that will allow computation of the circuit with semi-honest security, see Fig. 1 Footnote 1.

Fig. 1.
figure 1

Functionality for preprocessing, semi-honest security.

The idea behind the construction of the tables is that when the time comes to compute gate \(G_i\), both players will know “encrypted bits” \(e_u= b_u\oplus r_u\) and \(e_v= b_v\oplus r_v\), for the input wires, where \(b_u,b_v\) are the actual “cleartext” bits going into \(G_i\), and \(r_u, r_v\) are random masking bits that are chosen in the preprocessing. In addition, the preprocessing sets up two tables \(A_i, B_i\) for each gate, one held by A and one held by B. These tables are used to get hold of a similar encrypted bit for the output wire: \(e_o = b_o\oplus r_o\), where \(b_o= G_i(b_u, b_v)\). This works because the tables are set up such that \(A_{i}[e_u,e_v], B_{i}[e_u,e_v]\) is an additive sharing of \(b_o \oplus r_o\), i.e.,

$$A_{i}[b_u \oplus r_u,b_v \oplus r_v] \oplus B_{i}[b_u \oplus r_u,b_v \oplus r_v] = b_o \oplus r_o\ .$$

These considerations lead naturally to the protocol for computing C securely shown in Fig. 2. For the security of this construction, note that since the masking bits for the wires are uniformly random, each bit \(e_u\) is random as well (except when \(w_u\) is an output wire), and so the positions in which we look up in the tables are also random. With these observations, it is easy to see that we have:

Fig. 2.
figure 2

Protocol for semi-honest security.

Proposition 1

\(F^{pre}_{sem}\) composed with \(\pi _{sem}\) implements (with perfect semi-honest security) the ideal functionality \(F_{\mathsf {SFE}}\) for secure function evaluation.

Here, \(F_\mathsf {SFE}\) is a standard functionality that accepts C as input from both parties, then gets x from A, y from B and finally outputs C(xy) to both parties, in case of semi-honest corruption. In case of malicious corruption, it will send the output to the corrupted party and let the adversary decide whether to abort or not. Only in the latter case will it send the output to the honest party. We assume that the communication is not secret, even when both players are honest, so the protocol does not have to keep the output secret in this case. We give a precise specification in Fig. 4.

Fig. 3.
figure 3

Preprocessing functionality, malicious security.

Fig. 4.
figure 4

Secure function evaluation functionality, malicious security.

It is also very natural that we can get malicious security if we assume instead a functionality that commits A and B to the tables. We can get this effect by letting the functionality also store the tables and outputting bits from them to the other party on request. In other words, what we need is a functionality that allows us to commit players to correlated bit strings such that they can later open a subset of the bits. Note that the subset cannot be known in advance, which makes it more difficult to implement efficiently. Indeed, if we go for the simplest information theoretically secure solution in the preprocessing model, the storage requirement will be \(O(\ell k)\) bits where \(\ell \) is the string length and k is the security parameter. This is achieved, essentially by committing to each bit individually, we give more details on this below. In contrast, if we just need to open the entire string or nothing, \(O(\ell +k)\) bits is sufficient using standard techniques. However, this difference is not inherent: as we discuss in more detail in Sect. 6, one can in fact achieve both storage and communication complexity \(O(\ell +k)\) bits also for opening a subset that is unknown a priori. On the other hand, the simple bit-by-bit solution works better in practice.

Fig. 5.
figure 5

Protocol for malicious security.

In Fig. 3, we show a functionality that does preprocessing as in the semi-honest case, but in addition commits players bit by bit to the content of the tables. The idea is that, for entry \(A_i[c,d]\) in a table, A is also given a random string \(a_{i,c,d}^{A_i[c,d]}\) which serves as an authentication code that A can use to show that he sends the correct value of \(A_i[c,d]\), while B is given the pair \(a_{i,c,d}^0, a_{i,c,d}^1\) serving as a key that B can use to verify that he gets the correct value. Of course, using the authentication codes directly in this way, we would have to send k bits to open each bit. However, in our application, we can bring down the communication needed to (essentially) \(\ell +k\) bits, because we can delay verification of most of the bits opened. The idea is that, instead of sending the authentication codes, players will accumulate the XOR of all of them and check for equality later at the end. The protocol shown in Fig. 5 uses this idea to implement maliciously secure computation.

Theorem 1

\(F^{pre}_{mal}\) composed with \(\pi _{mal}\) implements \(F_{\mathsf {SFE}}\) with statistical security against malicious and static corruption.

Proof

We will show security in the UC model (in the variant where the environment also plays the role of the adversary), which means we need to exhibit a simulator S such that no (unbounded) environment Z can distinguish \(F^{pre}_{mal}\) composed with \(\pi _{mal}\) from S composed with \(F_{\mathsf {SFE}}\). The case where no player is corrupt is trivial to simulate since messages sent contain uniformly random bits, with the only exception of the output stage where the shares sent determine the outputs - which the simulator knows, and hence these shares are also easy to simulate. We now describe the simulation for the case where A is corrupt (the other case where B is corrupt is similar).

  • Set-up. S runs internally copies of \(F^{pre}_{mal}\) and \(\pi _{mal}\) where it corrupts A, and gives B a default input, say all zeros. It will let Z play the role of the corrupt A. We assume that both players send the same circuit C to be computed (otherwise there is nothing to simulate) and S will send C to \(F_{\mathsf {SFE}}\).

  • Input. In the input stage of the protocol, when Z (corrupt A) sends \(e_u\) for an input wire, S computes \(b_u = e_u\oplus r_u\) and sends it to \(F_{\mathsf {SFE}}\). This is the extracted input of A for this wire, note that S knows \(r_u\) from (its copy of) \(F^{pre}_{mal}\).

  • Computing stage. For the first M gates, S will simply let its copy of B run the protocol with Z acting as corrupt A. If B aborts, S sends “abort” to \(F_{\mathsf SFE}\) and stops, otherwise it sends “compute result”.

  • Output stage. S gets the outputs from \(F_{\mathsf SFE}\), i.e., \(b_o\) for each output wire \(w_o\). S modifies the table inside its copy of B such that for each output gate \(G_i\) with input wires \(w_u, w_v\) and output \(w_o\), \(B_i[e_u,e_v]\) satisfies \(b_o= r_o\oplus A_i[e_u,e_v] \oplus B_i[e_u,e_v]\). Note that S knows \(r_o\) and \(A_i[e_u,e_v] \) from its copy of \(F^{pre}_{mal}\). S now runs the output stage according to the protocol. If Z lets the protocol finish normally, S sends “deliver output” to \(F_{\mathsf {SFE}}\), otherwise it sends “abort”.

To show that this simulation is statistically good, observe first that the simulation up until the point where the outputs are opened is perfect: this follows since in the computation phase of the protocol, the honest player sends only uniformly random bits that are independent of anything else in the environment’s view. Therefore this also holds for the simulated honest player, even if he runs with a default input. The reason why only random bits are sent is as follows: whenever the honest player sends a bit, it can be assigned to a particular wire, say \(w_v\) that has not been handled before. Therefore no information about the wire mask \(r_v\) was released before. The bit sent by the honest player can be written as \(r_v\) xored with other bits and since \(r_v\) was chosen independently of anything else, the bit sent is also random in the adversary’s view.

In the verification step (last part of Item 4 of the protocol), the honest player sends the correct verification value \(m_B\) that can be computed from what the environment has already seen. The correct verification value \(m_A= t_B\) to be sent by A is well defined (it can be computed easily from the view of the environment), and if the value actually sent is incorrect, the protocol will abort in both the real and the simulated process.

Now, if the protocol proceeds to the output step, the only new information the environment sees is, for each output gate \(G_i\) with input wires \(w_u, w_v\) and output \(w_o\): the output \(b_o\) as well as B’s share \(e_B= B_i[e_u,e_v]\) and the verification value \(b^{e_B}_{i,e_u,e_v}\). Note the latter two values are determined from \(b_o\) and the environment’s view so far: \(e_B\) is determined by \(b_o= e_A\oplus e_B\) and then \(b^{e_B}_{i,e_u,e_v}\) is determined by what A received from the preprocessing initially. It follows that the entire simulation is perfect given the event that the output generated is the same in the real as in the simulated process.

Now, in the simulation the output is always the correct output based on the inputs determined in the input stage. Therefore, to argue statistically close simulation, it is sufficient to show that the real protocol completes with incorrect output except with probability \(2^{-k}\). This in turn follows easily from the fact that if A always sends correct shares from the tables, the output is always correct. And if he does not, the verification value corresponding to the incorrect message is completely unknown to A and can be guessed with probability at best \(2^{-k}\). Hence B will abort in any such case, except with probability \(2^{-k}\).

This theorem can quite easily be extended to adaptive corruption (of one player). For this case, the simulator would start running internally a copy of the protocol where AB are honest and use default inputs. When A (or B) will be corrupted, one observes that the internal view of A can easily be modified so it is consistent with the real input of A (which the simulator is now given). The simulator gives this modified view of A to the environment and continues as described in the proof of the above theorem.

2.1 Free XOR

It is easy to modify our construction to allow non-interactive processing of XOR gates. For simplicity, we only show how this works for the case of semi-honest security, malicious security is obtained in exactly the same way as in the previous section (Fig. 6).

The idea is to select the wire masks in a different way, exploiting the homomorphic properties of XOR gates. Basically, if we encounter a NOT gate, we make sure the output wire mask is equal the input wire mask. For XOR gates we set the output wire mask to the XOR of the input wires masks. We ensure this invariant by traversing the circuit, since the wire masks can no longer be sampled independently at random. One could set the output wire mask to zero for all output gates and traverse the gates backwards \(G_N,\dots ,G_1\) ensuring the invariant on the wire masks. Another approach, which is the one we use in the following, is to traverse the gates forwardly \(G_1,\dots ,G_N\) ensuring the invariants and then give output wire masks to the parties, who will then remove the mask before returning the actual output. In the online phase, we can now process XOR and NOT gates locally by computing the respective function directly on masked values. The resulting protocol is shown in Fig. 7.

Fig. 6.
figure 6

Functionality for preprocessing, semi-honest security with free XOR.

Fig. 7.
figure 7

Protocol for semi-honest security, free XOR.

2.2 Removing NOT-Gates

A slight change in the preprocessing allows us to completely remove the on-line operations associated with NOT gates. Namely, when preprocessing a NOT gate \(G_i\), we will set \(r_o= 1-r_v\), where \(w_u, w_o\) are the input and output wires, anything else remains unchanged. Then, in the on-line phase, we can simply ignore the NOT gates, or in other words, by convention we will set \(e_o= e_u\).

2.3 Generalisation to Bigger Tables

If the circuit contains a part that evaluates a non-linear function f on a small input, it is natural to implement computation of this function as a table. If the input is small, such a table is not prohibitively large. Suppose, for instance, that the input and output is 8 bits, as is the case for the AES S-Box. Then we will store tables \(A_f, B_f\) each indexed by 8-bit value M such that \(A_f[M] \oplus B_f[M] = f(x \oplus M)\oplus O\), where O is an 8-bit output mask chosen in the preprocessing. We make sure in the preprocessing that the i’th bit of B denoted B[i] equals the wire mask for the i’th wire going into the evaluation of f, whereas O[i] equals the wire mask for the i’th wire receiving output from f. We can then use the table for f exactly as we use the tables for the AND gates.

To get malicious security we must note that the simple authentication scheme we used in the binary case will be less practical here: as we have 256 possible output values, each party would need to store 256 strings per table entry. It turns out this can be optimized considerably using a different MAC scheme, as described in the following section.

3 The Linear MAC Scheme

In this section, we describe some variations over a well-known information theoretically secure MAC scheme that can be found, e.g., in [NNOB12]. We optimise it to be efficient on modern Intel processors, this requires some changes in the construction and hence we need to specify and reprove the scheme. It is intended to be used in conjunction with the generalisation to bigger tables described in the previous section.

There is a committer \({\mathsf {C}}\), a verifier \({\mathsf {V}}\) and a preprocessor \({\mathsf {P}}\). There is a security parameter k. Some of the computations are done over the finite field \(\mathbb {F}= {\text {GF}}(2^k)\). Let \(p(\texttt {X})\) be the irreducible polynomial of degree k used to do the computation in \(\mathbb {F}= {\text {GF}}(2^k)\), i.e., elements \(x, y \in \mathbb {F}\) are polynomials of degree at most \(k-1\) and multiplication is computed as \(z = x y \bmod p\). We will also be doing computations in the finite field \(\mathbb {G}= {\text {GF}}(2^{2k-1})\). Let \(q(\texttt {X})\) be the irreducible polynomial of degree \(2k-1\) used to do the computation in \(\mathbb {G}\). Notice that elements \(x, y \in \mathbb {F}\) are polynomials of degree at most \(k-1\), so xy is a polynomial of degree at most \(2k-2\). We can therefore think of xy as an element of \(\mathbb {G}\). Note in particular that \(x y \bmod q = x y\) when \(x, y \in \mathbb {F}\).

3.1 Basic Version

The MAC scheme has message space \(\mathbb {F}\). We denote a generic message by \(x \in \mathbb {F}\). The MAC scheme has key space \(\mathbb {F}\times \mathbb {G}\). We denote a generic key by \(K = (\alpha , \beta ) \in \mathbb {F}\times \mathbb {G}\). The tag space is \(\mathbb {G}\). We denote a generic tag by \(y \in \mathbb {G}\). The tag is computed as \(y = {\text {mac}}_K(x) = \alpha x + \beta \). Note that \(\alpha \in \mathbb {F}\) and \(x \in \mathbb {F}\), so \(\alpha x \in \mathbb {G}\) as described above and hence can be added to \(\beta \) in \(\mathbb {G}\). We use this particular scheme because it can be computed very efficiently using the PCLMULQDQ instruction on modern Intel processors. With one PCLMULQDQ instruction we can compute \(\alpha x\) from which we can compute \(\alpha x + \beta \) using one additional XOR.

The intended use of the MAC scheme is as follows. The preprocessor samples a message x and a uniformly random key K and computes \(y = {\text {mac}}_K(x)\). It gives K to \({\mathsf {V}}\) and gives (xy) to \({\mathsf {C}}\). To reveal the message \({\mathsf {C}}\) sends (xy) to \({\mathsf {V}}\) who accepts if and only if \(y = {\text {mac}}_K(x)\). Since K is sampled independently of x, the scheme is clearly hiding in the sense that \({\mathsf {V}}\) gets no information on x before receiving the opening information. We now show that the scheme is binding.

Let \(\mathcal {A}\) be an unbounded adversary. Run \(\mathcal {A}\) to get a message \(x \in \mathbb {F}\). Sample a uniformly random key \((\alpha , \beta ) \in \mathbb {F}\times \mathbb {G}\) and compute \(y = \alpha x + \beta \). Given y to \(\mathcal {A}\). Run \(\mathcal {A}\) to get an output \((y', x') \in \mathbb {G}\times \mathbb {F}\). We say that \(\mathcal {A}\) wins if \(x' \ne x\) and \(y' = {\text {mac}}_K(x')\). We can show that no \(\mathcal {A}\) wins with probability better than \(2^{-k}\). To see this, notice that if \(\mathcal {A}\) wins then he knows \(x,y,x',y'\) such that \(y = \alpha x + \beta \) and \(y' = \alpha x' + \beta \). This implies that \(y'-y = \alpha (x'-x)\), from which it follows that \(\alpha = (y'-y)(x'-x)^{-1}\). This means that if \(\mathcal {A}\) can win with probability r then \(\mathcal {A}\) can guess \(\alpha \) with probability at least r. It is then sufficient to prove that no adversary can guess \(\alpha \) with probability better than \(2^{-k}\). This follows from the fact that \(\alpha \) is uniformly random given \(\alpha x + \beta \), because \(\alpha x\) is some element of \(\mathbb {G}\) and \(\beta \) is a uniformly random element of \(\mathbb {G}\) independent of \(\alpha x\).

3.2 The Homomorphic Vector Version

We now describe a vector version of the scheme which allows to commit to multiple message using a single key.

The MAC scheme has message space \(\mathbb {F}^\ell \). We denote a generic message by \(\varvec{x} \in \mathbb {F}^\ell \). The MAC scheme has key space \(\mathbb {F}\times \mathbb {G}^\ell \). We denote a generic key by \(K = (\alpha , \varvec{\beta }) \in \mathbb {F}\times \mathbb {G}^\ell \). The tag space is \(\mathbb {G}^\ell \). We denote a generic tag by \(\varvec{y} \in \mathbb {G}^{\ell }\). The tag is computed as \(\varvec{y} = {\text {mac}}_K(\varvec{x}) = \alpha \varvec{x} + \varvec{\beta }\), i.e., \(y_i = \alpha x_i + \beta _i\). Note that \(\alpha \in \mathbb {F}\) and \(x_i \in \mathbb {F}\), so \(\alpha x_i \in \mathbb {G}\).

The intended use of the MAC scheme is as follows. The preprocessor samples a message \(\varvec{x}\) and a uniformly random key K and computes \(\varvec{y} = {\text {mac}}_K(\varvec{x})\). It gives K to \({\mathsf {V}}\) and gives \((\varvec{x}, \varvec{y})\) to \({\mathsf {C}}\). To reveal the message \(x_i\) the comitter \({\mathsf {C}}\) sends \((x_i, y_i)\) to \({\mathsf {V}}\), who accepts if and only if \(y_i = \alpha x_i + \beta _i\).

The preprocessed information also allows to open to any sum of a subset of the \(x_i\)’s. Let \(\varvec{\lambda } \in \mathbb {F}^\ell \) with \(\lambda _i \in \{0,1\}\). Let \(x_\lambda = \sum _i \lambda _i x_i \pmod {\mathbb {F}}\), let \(y_\lambda = \sum _i \lambda _i y_i \pmod {\mathbb {G}}\), and let let \(\beta _\lambda = \sum _i \lambda _i \beta _i \pmod {\mathbb {G}}\). To reveal \(x_\lambda \) the committer \({\mathsf {C}}\) sends \((x_\lambda , y_\lambda )\) and the verifier \({\mathsf {V}}\) checks that \(y_\lambda = \alpha x_\lambda + \beta _\lambda \). If both players are honest, this is clearly the case. The only non-trivial thing to notice is that since \(\sum _i \lambda _i x_i \pmod {\mathbb {F}}\) does not involve any reduction modulo p we have that \(\sum _i \lambda _i x_i \pmod {\mathbb {F}} = \sum _i \lambda _i x_i = \sum _i \lambda _i x_i \pmod {\mathbb {G}}\).

The scheme is hiding in the sense that after a number of openings to elements \(x_\lambda \) the verifier learns nothing more than what can be computed from the received values \(x_\lambda \). To see this notice that K is independent of \(\varvec{x}\) and hence could be simulated by \({\mathsf {V}}\). Also the openings can be simulated. Namely, whenever \({\mathsf {V}}\) received an opening \((x_\lambda , y_\lambda )\) from an honest \({\mathsf {C}}\), we know that \(y_\lambda = \alpha x_\lambda + \beta _\lambda \), so \({\mathsf {V}}\) could have computed \(y_\lambda \) itself from \(x_\lambda \) and K. Hence no information extra to \(x_\lambda \) is transmitted by transmitting \((x_\lambda , y_\lambda )\).

We then prove that the scheme is binding. Let \(\mathcal {A}\) be an unbounded adversary. Run \(\mathcal {A}\) to get a message \(\varvec{x} \in \mathbb {F}^\ell \). Sample a uniformly random key \((\alpha , \varvec{\beta }) \in \mathbb {F}\times \mathbb {G}^\ell \) and compute \(y_i = \alpha x_i + \beta _i\) for \(i = 1, \ldots , \ell \). Give \(\varvec{y}\) to \(\mathcal {A}\). Run \(\mathcal {A}\) to get an output \((y', x', \varvec{\lambda }) \in \mathbb {G}\times \mathbb {F}\times \{0,1\}^\ell \). We say that \(\mathcal {A}\) wins if \(x' \ne x_\lambda \) and \(y' = \alpha x' + \beta _{\lambda }\). We can show that no \(\mathcal {A}\) wins with probability better than \(2^{-k}\). To see this, notice that if \(\mathcal {A}\) wins then he knows \(x'\) and \(y'\) such that \(y' = \alpha x' + \beta _{\lambda }\). He also knows \(x_ {\lambda }\) and \(y_{\lambda }\) as these can be computed from \(\varvec{x}\), \(\varvec{y}\) and \(\varvec{\lambda }\), which he knows already. And, it holds that \(y_{\lambda } = \alpha x_{\lambda } + \beta _{\lambda }\). Therefore \(y' - y_{\lambda } = \alpha (x' - x_{\lambda })\), and it follows as above that \(\mathcal {A}\) can compute \(\alpha \). Since no information is leaked on \(\alpha \) by the value \(\alpha \varvec{x} + \varvec{\beta }\) given to \(\mathcal {A}\) it follows that it is uniform in \(\mathbb {F}\) in the view of \(\mathcal {A}\). Therefore \(\mathcal {A}\) can compute \(\alpha \) with probability at most \(2^{-k}\).

We note for completeness that the scheme could be extended to arbitrary linear combinations. In that case one would however have to send \(x_\lambda = \sum _i \lambda _i x_i \pmod {\mathbb {F}}\) and \(y_\lambda = \sum _i \lambda _i x_i \pmod {\mathbb {F}}\) which would involve doing reductions modulo p. The advantage of the above scheme where \(\lambda _i \in \{0,1\}\) is that no polynomial reductions are needed, allowing full use of the efficiency of the PCLMULQDQ instruction.

3.3 Batched Opening

We now present a method to open a large number of commitments in an amortised efficient way, by sending only k bits.

For notational simplicity we assume that \({\mathsf {C}}\) wants to reveal all the values \((x_1, \ldots , x_n)\), but the scheme trivially extends to opening arbitrary subsets and linear combinations. To reveal \((x_1, \ldots , x_n)\) as described above, \({\mathsf {C}}\) would send \(Y^{\mathsf {C}}=(y_1, \ldots , y_n)\) and \({\mathsf {V}}\) would compute \(y_i^{\mathsf {V}}= \alpha x_i + \beta _i\) for \(i = 1, \ldots , n\) and \(Y^{\mathsf {V}}=(y_1^{\mathsf {V}}, \ldots , y_n^{\mathsf {V}})\) and check that \(Y^{\mathsf {V}}= Y^{\mathsf {C}}\). Consider now the following optimization where \({\mathsf {C}}\) and \({\mathsf {V}}\) is given a function H that outputs (at least) k bits. They could then compare \(Y^{\mathsf {C}}\) and \(Y^{\mathsf {V}}\) by sending \(h^{\mathsf {C}}= H(Y^{\mathsf {C}})\) and checking that \(h^{\mathsf {C}}= H(Y^{\mathsf {V}})\).

We saw above that if \({\mathsf {C}}\) sends \((x_1', \ldots , x_n')\) and \((y_1', \ldots , y_n')\) where \(x_i' \ne x_i\) for some i, then \({\mathsf {C}}\) can guess \(y_i^{\mathsf {V}}\) with probability at most \(2^{-k}\), so that \(Y^{\mathsf {C}}\ne Y^{\mathsf {V}}\) with overwhelming probability. We could therefore let H be any collision resistant hash function, but we want something that can be implemented very efficiently on modern processors. So we will instead define H as follows:

$$ H(d_1, \ldots , d_n) = \bigoplus _{i=1}^n F(d_i) $$

where we think of F as a random oracle that outputs k bits.

This is clearly secure in the random oracle model: by what we saw above, if \(x_i' \ne x_i\) then with overwhelming probability, \({\mathsf {C}}\) has not been able to call the oracle on input \(y_i^{\mathsf {V}}\). Assuming he has not, he has no information on the value of \(F(y_i^{\mathsf {V}})\), and hence the probability that \(h^{\mathsf {C}}\) happens to be equal to \(H(Y^{\mathsf {V}})\) is \(2^{-k}\).

Recall that we are going to use the MAC scheme outlined here for the case where we use bigger tables than for Boolean gates, such as the AES S-box, and that in such a case the preprocessing will produce this type of linear MACs. This means that if we use the batch opening method, we will need to compute H in the on-line phase. Our definition of H is well suited for this on modern Intel processors: macs will typically be of size at most 128 bits, so as F we can use AES encryption under a fixed key that is chosen for each protocol execution. We will make the heuristic assumption that we can model this as a random permutation. Then, assuming we will be calling the function much less than \(\sqrt{2^{128}} = 2^{64}\) times (which should certainly be true in practice) such a permutation is well known to be statistically indistinguishable from a random function.

4 Preprocessing

In this section we show how to securely implement the preprocessing for Boolean circuits. The idea is to generate the tables by a computation on linear secret shared values, which in case of malicious security also include MACs. We will consider an additive secret sharing of x where A hold \(x_A\in \{0,1\}\) and B hold \(x_B\in \{0,1\}\) such that \(x=x_A+x_B\).

In the case of malicious security the MACs are elements in a finite field \(\mathbb {F}\) of characteristic 2 and size at least \(2^k\) where k is the security parameter. Here A hold a key \(\alpha _A\in \mathbb {F}\) and B a key \(\alpha _B\in \mathbb {F}\). We denote a secret shared value with MACs \([\![x]\!]\) where A hold \((x_A, y_A, \beta _A)\) and B hold \((x_B, y_B, \beta _B)\) such that \(y_A=\alpha _B x_A + \beta _B\) and \(y_B=\alpha _A x_B + \beta _A\). If a value is to be opened the MAC is checked, e.g. if x is opened to A she receives \(x_B\) and \(y_B\) and checks that indeed \(y_B = \alpha _A x_B + \beta _A\) or abort otherwise. This is also the format used in the TinyOT protocol [NNOB12], so we can use the preprocessing protocol from there to produce single values \([\![a]\!]\) for random a and triples of form \([\![x]\!], [\![y]\!], [\![z]\!]\) where xy are random bits and \(z=xy\). Any other preprocessing producing the same data format will of course also be OK, for instance, the protocols presented in [FKOS15] will give better speeds than original TinyOT.

Note that, by a standard protocol [Bea91], we can use one triple to produce from any \([\![a]\!], [\![b]\!]\) the product \([\![ab]\!]\), this just requires opening \(a+x, b+y\) and local computation. Also, we can compute the sum \([\![a+b]\!]\) by only local computation.

In Fig. 8, we describe a protocol that implements the preprocessing functionality \(F^{pre}_{mal}\) assuming a secure source of triples and single random values as described here, and also assuming that the circuit contains only AND, XOR and NOT gates. We use a function F that we model as a random oracle.

For simplicity the protocol is phrased as a loop that runs through all gates of the circuit, but we stress that it can be executed in constant round: we can execute step 3 in the protocol by first doing all the XOR and NOT gates, using only local operations. At this point wire masks have been chosen for all input and output wires of all AND gates, and they can now all be done in parallel.

Fig. 8.
figure 8

Protocol for preprocessing with malicious security

Preprocessing for AES. To preprocess an AES Sbox table, we can again make use of the TinyOT preprocessing. This can be combined with a method from [DK10]. Here, it is shown how to compute the Sbox function securely using 7 binary multiplications and local operations. We can then make the table by simply computing each entry (in parallel). It is also possible to compute the Sbox using arithmetic in \(\mathbb {F}_{256}\), but if we have to build such multiplications from binary ones, as we would if using TinyOT, this most likely does not pay off.

5 Implementation

We implement two clients, Alice and Bob, securely evaluating the aes-128 encryption function. Alice inputs the message, Bob inputs the expanded key and both parties learns the ciphertext. This function contains several operations where all operations except subbytes are linear. We first implement an optimized version where all linear operations are computed locally using the aes-ni/sse instruction set, and every non-linear Sbox lookup is replaced with a tinytable lookup and opening. Afterwards we implement a binary version with free XOR gates and no NOT gates using the aes-expanded circuit from [TS]. For both implementations we benchmark a passively secure version and maliciously secure versions all providing statistical security \(2^{-k}\). Here we test the linear MAC scheme with \(k=64\) and the lookup table MAC scheme with \(k=64\) and \(k=32\).

In the protocol the parties receive preprocessed data generated by a trusted party. We assume this data is present in memory and reuse the same instance for our benchmark. The parties proceed to compute the encryption function on a set of test vectors and verify correctness. We test the implementation on two setups: a basic LAN setup and on a cloud. The LAN setup consist of two PCs connected via 1 GbE, where each machine has a i7-3770K CPU at 3.5 GHz and 32 GB RAM. For our cloud setup we use Amazon EC2 with two c4.8xlarge instances locally connected via 10 GbE with 36 vCPUs (hyperthreads). The parties communicate over the TCP protocol.

We now analyse the size of the preprocessed data - we concentrate on the tables and ignore the size of the input and output bit masks. For the optimized version, we need \(40\text { KiB}\) of preprocessing data for one passively secure evaluation (160 tiny tables with \(2^8\) entries of one byte each). For the binary version we need \(2.7\text { KiB}\) (5440 tiny tables with 4 entries of one bit each). For malicious security we add MACs. The simple lookup MAC scheme use \(k + 256k\) bits extra per table entry. The linear MAC scheme reuse a k bit key and use 2k bits extra per table entry. Beside the input phase, the optimized version has 10 rounds of communication for the passive version and 12 rounds for the active version, i.e. two more rounds for MAC checking before and after opening the result. Similarly the binary version have 41 rounds of communication (layers in the circuit) for the passive version and 43 for the active. The size of the preprocessed data can be seen in Table 1.

Table 1. Size of preprocessed data

The results measured as the average over 30 s. For the sequential tests in both the optimized and binary version, the network delay is the major factor. For the amortized tests of the optimized version, the bandwidth is the limiting factor on the LAN setup, whereas computation takes over on the cloud setup as we raise the security parameter. For the amortized binary version the computation is the limiting factor on both setups. The timings for the optimized implementation are summarized in Table 2 and the binary version in Table 3.

Table 2. Execution times for optimized version
Table 3. Execution times for binary version

6 An Asymptotically Better Solution

Recall that the main problem we have with obtaining malicious security is that we must make sure that players reveal correct bits from the tables they are given, but on the other hand only the relevant bits should be revealed.

In this section we show an asymptotically better technique for committing players to their tables such that we can open only the relevant bits.

The idea is as follows: if player A is to commit to string \(\varvec{s}\), that is known at preprocessing time, then the preprocessing protocol will establish a (verifiable) secret sharing of \(\varvec{s}\) among the players. Concretely, we will use the representation introduced for the so-called MiniMac protocol in [DZ13]: we choose an appropriate linear error correcting (binary) code C. This code should be able to encode strings of length k bits, and have length and minimum distance linear in k.Footnote 2

For a string \(\varvec{s}\) of length \(\ell \) (comparable to the circuit size, so \(\ell>> k\)) we define the encoding \(C(\varvec{s})\) by splitting \(\varvec{s}\) in k-bit blocks, encoding each block in C and concatenating the encodings. The preprocessing will share \(C(\varvec{s})\) additively among the players.

Fig. 9.
figure 9

Preprocessing functionality using the MiniMac protocol

The preprocessing also chooses a random string \(\varvec{a}\), unknown to both players, and both \(\varvec{a}\) and \(\varvec{a}* C(\varvec{s})\) are additively secret shared. Here \(*\) denotes the bitwise (Schur) product. We will use the notation \([\varvec{s}]\) as shorthand for all the additive shares of \(\varvec{s}\), \(\varvec{a}\) and \(\varvec{a}* C(\varvec{s})\). The idea is that \(\varvec{a}* C(\varvec{s})\) serves as a message authentication code for authenticating \(\varvec{s}\), where \(\varvec{a}\) is the key. We note that, as in [DZ13], \(\varvec{a}\) is in fact a global key that is also used for other data represented in this same format.

Fig. 10.
figure 10

Functionality giving on-line access to tables

Fig. 11.
figure 11

Simple protocol for malicious security.

Based on this preprocessing, we can design a protocol that allows A to open any desired substring of \(\varvec{s}\), as follows: let I denote the characteristic vector of the substring, i.e., it is an \(\ell \) bit string where the i’th bit is 1 if A is to reveal the i’th bit of \(\varvec{s}\) and 0 otherwise.

We then compute a representation [I] (which is trivial using the additive shares of \(\varvec{a}\)), use the MiniMac protocol to compute \([I*\varvec{s}]\) and then open this representation to reveal \(I*\varvec{s}\) which gives B the string he needs to know and nothing more. This is possible if we let the preprocessing supply appropriate correlated randomness for the multiplication of I and \(\varvec{s}\). The protocol we just sketched here will be called \(\pi _{MiniMac}\) in the following.

In Fig. 9 we specify the preprocessing functionality \(F^{pre}_{MiniMac}\) we assumed in \(\pi _{MiniMac}\), i.e., it outputs the tables as well as MiniMac representations of them. Now consider the functionality \(F_{table}\) from Fig. 10 that simply stores the tables and outputs bits from them on request. By trivial modification of the security proof for MiniMac, we have

Lemma 1

The protocol \(\pi _{MiniMac}\) composed with \(F^{pre}_{MiniMac}\) implements \(F_{table}\) with statistical security against a malicious adversary.

As for the on-line efficiency of \(\pi _{MiniMac}\), note that in [DZ13] the MiniMac protocol is claimed to be efficient only for so-called well-formed circuits, but this is not a problem here since the circuit we need to compute is a completely regular depth 1 circuit. Indeed, bit-wise multiplication of strings is exactly the operation MiniMac was designed to do efficiently. Therefore, simple inspection of [DZ13] shows that the preprocessing data we need will be of size \(O(\ell ) = O(s)\), where s is the circuit size, and this is also the communication complexity. The computational complexity is dominated by the time spent on encoding in C. Unfortunately, we do not know codes with the right algebraic properties that also have smart encoding algorithms, so the only approach known is to simply multiply by the generator matrix. We can optimize by noting that if \(\ell > k^2\) we will always be doing many encodings in parallel, so we can collect all vectors to encode in a matrix and use fast matrix multiplication. With current state of the art, this leads to computational complexity \(O(s k^\epsilon )\) where \(\epsilon \approx 0.3727\).

Alternatively, we can let C be a Reed-Solomon code over an extension field with \(\varOmega (k)\) elements. We can then use FFT algorithms for encoding and then all complexities will be \(O({\text {polylog}}(k) s)\).

As a final step, consider the protocol \(\pi _{mal}^{table}\) from Fig. 11. By trivial adaptation of the proof for semi-honest security, we get that

Lemma 2

The protocol \(\pi _{mal}^{table}\) composed with \(F_{table}\) implements \(F_{\mathsf {SFE}}\) with malicious and statistical security.

We can then combine Lemmas 1 and 2 to get a protocol for \(F_{\mathsf {SFE}}\) in the preprocessing model, which together with the efficiency consideration above gives us:

Theorem 2

There exists 2-party protocol in the preprocessing model (using \(F^{pre}_{MiniMac}\)) for computing any Boolean circuit of size s with malicious and statistical security, where the preprocessed data size and communication complexity are O(s) and the computational complexity is \(O(k^\epsilon s)\) where k is the security parameter and \(\epsilon < 1\). There also exists a protocol for which all complexities are \(O({\text {polylog}}(k) s)\).